Week 2 — Dimensionality Reduction
Dimensionality Effect on Performance
A note about neural networks
 Yes, neural networks will perform a kind of automatic feature selection
 However, that's not as efficient as a welldesigned dataset and model
 Much of the model can be largely "shut off" to ignore unwanted features
 Even unused parts of the model consume space and compute resources
 Unwanted features can still introduce unwanted noise
 Each feature requires infrastructure to collect, store, and manage.
# WORD EMBEDDING EXAMPLE
import tensorflow as tf
from tensorflow import keras
import numpy as np
from keras.datasets import reuters
from keras.preprocessing import sequence
NUM_WORDS = 1_000 # Least repeated words are considered unknown
(reuters_train_x, reuters_train_y), (reuters_test_x, reuters_test_y) =
tf.keras.dataset.reuters.load_data(num_words=NUM_WORDS)
n_laberls = np.unique(reuters_train_y).shape[0]
# Further preprocessing
from keras.utils import np_utils
reuters_train_y = np_utils.to_categorical(reuters_train_y, 46)
reuters_test_y = np_utils.to_categorical(reuters_test_y, 46)
reuters_train_x = keras.preprocessing.sequence.pad_sequence(
reuters_train_x, maxlen=20)
reuters_test_x = keras.preprocessing.sequence.pad_sequence(
reuters_test_x, maxlen=20)
# Using all dimensions
from tensorflow.keras import layers
model = keras.Sequential([
# The embedding is projected to 1000 dimesions here (2nd parameter)
layers.Embedding(NUM_WORDS, 1000, input_length=20),
layers.Flatten(),
layers.Dense(256),
layers.Dropout(0.25),
layers.Actiation('relu'),
layers.Dense(46),
layers.Activation('softmax')
])
# Model compilation and training
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model_1 = model.fit(reuters_train_x, reuters_train_y,
validation_data=(reuters_test_x
reuters_test_y),
batch_size=128, epochs=20, verbose=0)
tion="This model is overfitting as the model clearly perform poorly on validation set, and a high loss for validation set compared to training loss." >}}
Word embeddings: 6 dimensions instead of 1000
model = keras.Sequential([
# The embedding is projected to 1000 dimesions here (2nd parameter)
layers.Embedding(NUM_WORDS, 6, input_length=20),
layers.Flatten(),
layers.Dense(256),
layers.Dropout(0.25),
layers.Actiation('relu'),
layers.Dense(46),
layers.Activation('softmax')
])
tion="Although there is some overfitting, The newer model seems to perform better" >}}
Curse of Dimensionality
Many ML methods use the distance measures like KNN, SVM and recommendation systems.
Most common being Euclidean Distance.
Why is highdimensional data a problem?
 More dimension $\rightarrow$ more features
 Risk of overfitting our models
 Distances grow more and more alike, vectors might appear equidistant from all others.
 No clear distinction between clustered objects
 Concentration phenomenon for Euclidean distance
 Distribution of norms (distance between vectors) in a given distribution of points tends to concentrate.
 Adding dimensions increases feature space volume
 Solutions take longer to converge, might get stuck in local optima
 Runtime and system memory requirements
Why are more features bad?
 Redundant/irrelevant features
 More noise added than signal
 Hard to interpret and visualize
 Hard to store and process data
Curse of dimensionality in the distance function
$$ d_{i,j} = \sqrt{\sum_{k=1}^n(x_{ik}  x_{jk})^2} \tag{Euclidean distance} $$
 New dimensions add nonnegative terms to the sum
 Distance increases with the number of dimensions
 For a given number of examples, the feature space becomes increasingly sparse
 The size of the feature space grows exponentially as the number of features increases making it much harder to generalize efficiently.
 The variance increases, features might even be correlated and thus there are higher chances of overfitting to noise.
 The challenge is to keep as much of the predictive information as possible using as few features as possible.
The Hughes effect
The more the features, the large the hypothesis space.
Curse of Dimensionality: an Example
 More features aren't better if they don't add predictive information
 Number of training instances needed increases exponentially with each added feature
What do ML models need?
 No hard and fast rule on how many features are required
 Number of features to be used vary depending on the amount of training data available, the variance in that data, the complexity of the decision surface, and the type of classifier that is used. It can also depend on which features actually contain predictive information.
 Prefer uncorrelated data, with containing predictive information to produce correct results
Manual Dimensionality Reduction
Increasing predictive performance
 Features must have information to produce correct results
 Derive feature from inherent features
 Extract and recombine to create new feature
Why reduce dimensionality?
Dimensionality reduction looks for patterns and data to re express the data in a lower dimensional form.
 Reduce multicollinearity by removing redundant features.
Feature Engineering
Manual Dimensionality Reduction: Case Study with Taxi Fare dataset
CSV_COLUMNS = [
'fare_amount',
'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
'dropoff_longitude', 'dropoff_lattidude',
'passenger_count', 'key'
]
LABEL_COLUMN = 'fare_amount'
STRING_COLS = ['pickup_datetime']
NUMERIC_COLS = ['pickup_longitude', 'pickup_latitude',
'dropoff_longitude', 'dropoff_lattidude',
'passenger_count']
DEFAULTS = [[0.0], ['na'], [0.0], [0.0], [0.0], [0.0], [0.0], ['na']]
DAYS = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
# Build a baseline model using raw features
from tensorflow.leras import layers
from tensorflow.keras.metrics import RootMeanSquared as RMSE
dnn_inputs = layers.DenseFeatures(features_columns.values())(inputs)
h1 = layers.Dense(32, activation='relu', name='h1')(dnn_inputs)
h2 = layers.Dense(8, activation='relu', name='h2')(h1)
output = layers.Dense(1, actiation='linear', name='fare')(h2)
model = keras.models.Model(inputs, output)
model.compile(optimizer='adam', loss='mse',
metrics=[RMSE(name='rmse'), 'mse'])
Increasing model performance with Feature Engineering
 Carefully craft features for the data types
 Temporal (pickup date & time)
 Geographical (latitude and longitude)
Handling temporal features
def parse_datetime(s):
if type(s) is not str:
s = s.numpy().decode('utf8')
return datetime.datetime.strptime(s, "%Y%m%d %H:%M:%S %Z")
def get_dayofweek(s):
ts = parse_datetime(s)
return DAYS[ts.weekday()]
@tf.function
def dayofweek(ts_in):
return tf.map_fn(
lambda s: tf.py_function(get_dayofweek, inp=[s],
Tout=tf.string),
ts_in)
Geological features
def euclidean(params):
lon1, lat1, lon2, lat2 = params
lodiff = lon2  lon1
latdiff = lat2  lat1
return tf.sqrt(londiff * longdiff + latdiff * latdif)
Scaling latitude and longitude
def sclae_longitude(lon_column):
return (lon_column + 78)/8. # Min: 70  Max: +78
def scale_latitude(lat_column):
return (lat_column  37/8. # Min: 37  Max: 45
Preparing the transformations
def transform(inputs, numeric_cols, string_cols, nbuckets):
...
feature_columns = {
colname: tf.feature_column.numeric_column(colname)
for colname in numeric_cols
}
for lon_col in ['pickup_longitude', 'dropoff_longitude']:
transformed[lon_col] = layers.Lambda(scale_longitude,
...)(inputs[long_col])
for lat_col in ['pickup_latitude', 'dropoff_latitude']:
transformed[lat_col] = layers.Lambda(scale_latitude,
...)(inputs[lat_col])
Bucketizing and feature crossing
Unless the specific geometry of the earth is relevant to your data, a bucketized version of the map is likely to be more useful than the raw inputs.
def transform(inputs, numeric_cols, string_cols, nbuckets):
...
latbucksts = np.linspace(0, 1, nbuckets).tolist()
lonbuckets = ...
b_plat = fc.bucketized_column(
feature_columns['pickup_latitude'], latbuckets)
b_dlat = # Bucketize 'dropoff_latitidue'
b_plon = # Bucketize 'pickup_longitude'
b_dlon = # Bucketize 'dropoff_longitude'
ploc = fc.cross_column([b_plat, b_plon], nbuckets * nbuckets)
dloc = # Feature corss 'b_dlat' and 'b_dlon'
pd_pair = fc.crossed_column([ploc, dloc], nbuckets ** 4)
Feature_columns['pickup_and_dropoff'] = fc.embedding_column(pd_pair, 100)
Algorithmic Dimensionality Reduction
Linear dimensionality reduction
 Linearly project $n$dimensional data onto a $k$dimensional subspace ($k < n$, often $k << n$)
 There are infinitely many $k$dimensional subspace we can project the data onto
 Which one should we choose?
Projecting onto a line
 Let's thing of features as vectors existing in a highdimensional space.
 Vectors being high dimensional is difficult to visualize, but if we project onto a lower dimension, allows us to visualize the data more easily. This is referred to as embedding.
Best kdimensional subspace for projection
Classification: maximum separation among classes
Example: Linear Discriminant analysis (LDA)
Regression: maximize correlation between projected data and output variable
Example: Partial Least Squares (PLS)
Unsupervised: retain as much data variance as possible
Example Principal component analysis (PCA)
Principal Components Analysis
 Relies on eigendecomposition (which can only be done for square matrices)
 PCA is a minimization of the orthogonal distance
 Widely used method for unsupervised & linear dimensionality reduction
 Accounts for variance of data in as few dimensions as possible using linear projections
PCA performs dimensionality reduction in two steps:

PCA rotates the samples so that they are aligned with the coordinate axis.
PCA also shifts the samples so that they have a mean of zero
Principal components (PCs)
 PCs maximize the variance of projections
 PCs are orthogonal
 Gives the best axis to project
 Goal of PCA: Minimize total squared reconstruction error
Repeat until we a have k orthogonal lines
from sklearn.decomposition import PCA
# PCA that will retain 99% of the variance
pca = PCA(n_components=0.99, whiten=True)
pca.fit(X)
X_pca = pca.transform(X)
Plot the explained variance
tot = sum(pca.e_vals_)
var_exp = [(i / tot) * 100 for i in sorted (pca.e_evals_, reverse=True)
cum_var_exp = np.cumsum(var_exp)
PCA factor loadings
The factor loadings are the unstandardized values of the eigenvectors.
When to use PCA?
Strengths:
 A versatile technique
 Fast and simple
 Offers several variations and extensions (e.g., kernel/sparse PCA)
Weaknesses:
 Result is not interpretable
 Requires setting threshold for cumulative explained variance
Other Techniques
Unsupervised
 Latent Semantic Indexing/ Analysis (LSI and LSA) (SVD)
 Independent Component Analysis (ICA)
Matrix Factorization
 NonNegative Matrix Factorization (NMF)
Latent Methods
 Latent Dirichlet Allocation (LDA)
Singular Value Decomposition (SVD)
 SVD decomposes nonsquare matrices
 Useful for sparse matrices and matrices that are not square matrices as produced by TFIDF.
 Removes Redundant features from the dataset
Independent Component Analysis
 PCA seeks directions in feature space that minimize reconstructions error, or for uncorrelated factors.
 ICA seeks directions that are most statistically independent
 ICA addresses higher order dependence
How does ICA work?
 Assume there exists independent signals:
 $S = [s_1(t), s_2(t), ..., s_N(t)]$
 Linear combinations of signals: $Y(t) = A S(t)$
 Both $A$ and $S$ are unknown
 $A$  mixing matrix
 Goal of ICA: recover original signal, $S(t)$ from $Y(t)$
Comparing PCA and ICA
Nonnegative Matrix Factorization (NMF)
 NMF model are interpretable and easier to understand
 NMF requires the sample features to be nonnegative
 NMF models are interpretable but it can't be applied to all datasets
 It requires the sample features to nonnegative.
Dimensionality Reduction Techniques
Dimensionality Reduction Techniques
If you wish to dive more deeply into dimensionality reduction techniques, feel free to check out these optional references. You won’t have to read these to complete this week’s practice quizzes.
Quantization and Pruning — Mobile, IoT, and Similar Use Cases
Factors driving new trend of edge computing
 Demands move ML capability from cloud to ondevice
 Costeffectiveness
 Compliance with privacy regulations
Online ML inference
 To generate realtime predictions you can:
 Host the model on a server
 Embed the model in the device
 Is it faster on a server, or ondevice?
 Mobile processing limitations?
Mobile inference
Model development
Benefits and Process of Quantization
Quantization involves transforming a model into an equivalent representation that uses parameters and computations at a lower precision.
This improves model execution performance and efficiency but at the cost of slightly lower model accuracy.
Think of a picture, which is a grid of pixels, each pixel has a certain number of bits. Now if we try reducing the continuous color spectrum of real life to discrete colors, we're quantizing or approximating the image.
Beyond a certain point it may get harder to recognize what the data/image really is.
Why quantize neural netowrks?
 Neural networks have many parameters and take up space
 Shrinking model file size
 Reduce computational resources
 Make models run faster and use less power with lowprecision
MobileNets: Latency vs Accuracy tradeoff
MobileNets are family of small, lowlatency, lowpower models parameterized to meet the resource constraints of a variety of use cases.
Benefits of quantization
 Faster compute
 Low memory bandwidth
 Low power
 Integer operations supported across CPU/DSP/NPUs
The quantization process
The weights and activations for a particular layer often tend to lie in a small range, which can be estimated beforehand. That's why we don't need to store the range in the same data type. Therefore we can concentrate the fewer bits within a smaller range.
Find the maximum absolute weight value, $m$, then maps the floating point range $m$ to $+m$ to the fixedpoint range $127$ to $+127$.
What parts of the mode are affected?
 Static parameters (like weights of the layers)
 Dynamix parameters (like activations inside networks)
 Computation (transformations)
Tradeoffs
 Optimizations impact model accuracy
 Difficult to predict ahead of time
 In rare cases, models may actually gain some accuracy
 Undefined effects on ML interpretability.
Choose the best model for the task
Post Training Quantization
In this technique an already trained TensorFlow model size is reduced by using TensorFlow Lite converter to save into TensorFlow Lite format.
 Reduced precision representation
 Incur small loss in model accuracy
 Joint optimization for mode and latency
 In dynamic range quantization, during inference the weights are converted back from eight bits to floating point and activations are computed using floating point kernels.
 This conversion is done once, cached to reduce latency
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
coverter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()
Model accuracy
 Small accuracy loss incurred (mostly for smaller network)
 Use the benchmarking tools to evaluate model accuracy
 If the loss of accuracy drop is not within acceptable limits, consider using quantizationaware training
Quantization Aware Training
Quantization aware training adds fake quantization operations to the mode so it can learn to ignore the quantization noise during training.
 Inserts fake quantization (FQ) nodes in the forward pass
 Rewrites the graph to emulate quantized inference
 Reduces the loss of accuracy due to quantization
 Resulting model contains all data to be quantized according to spec
QAT on entire model
import tensorflow_model_optimization as tfmot
model = tf.keras.Sequential([
...
])
# Qunatize the entire model.
quntized_model = tfmot.quantization.keras.quantize_model(model)
# Continue with training as usual
quantized_model.compile(...)
quantized_model.fit(...)
Quantize parts(s) of a model
import tensorflow_model_optimization as tfmot
quanitze_annotate_layer = tfmot.quanitzation.keras.quanitze_annotate_kayer
model = tf.keras.Sequential([
....
# Only anotated layers will be quantized
quantize_annotate_layer(Conv2D()),
quantize_annotate_layer(ReLU()),
Dense(),
...
])
# Quantize the model
quantized_model = tfmot.quantization.keras.quantize_apply(model)
Quantize custom Keras layer
quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layer
quantize_annotate_model = tfmot.quantization.leras.quantize_annotate_model
quantize_scope = tfmot.quatization.keras.quantize_scope
model = quantize_annotate_model(tf.keras.Sequential([
quantize_annotate_layer(CustomLayer(20, input_shape=(20,)),
DefaultDenseQuantizeConfig()),
tf.keras.layers.Flatten()
]))
# `quantize_apply` requires mentioning `DefaultDenseQuantizeConfig` with
# `quantize_scope`
with quantize_scope(
{'DefaultDenseQuantizeConfig': DefaultDenseQuantizeConfig,
'CustomLayer': CustomLayer}):
# Use 'quantize_apply` to actually make the model quantization aware
quant_aware_model = tfmot.quantization.keras.quantize_apply(model)
Quantization
Quantization
If you wish to dive more deeply into quantization, feel free to check out these optional references. You won’t have to read these to complete this week’s practice quizzes.
Pruning
Pruning increases the efficiency of model by removing parts (connections) of model that do not contribute to substantially to producing accurate results.
Model Sparsity
Origins of weight pruning
The first major paper advocating sparsity and neural networks dates back from 1990, in "Optimal of Brain Damage" written by Yann LeCun, John S.Denker, and Sara A. Solla.
At that time post pruning NN was already a trendy choice to reduce the size of models. runing was mainly done by using magnitude as an approximation for saliency to determine less useful connections.
 Intuition being smaller the weight smaller the effect was on the output.
 The saliency of each weight was estimated, defined by change in the loss function upon applying a perturbation to the nodes in the network. Finally retraining the model again.
 But retraining became a lot harder.
 The answer came with "lottery ticket hypothesis"
Lottery ticket Hypothesis
As the network size increases, the number of possible subnetworks and the probability of finding the ‘lucky subnetwork’ also increase. As per the lottery ticket hypothesis, if we find this lucky subnetwork, we can train small and sparsified networks to give higher performance even when 90 percent of the full network’s parameters are removed.
Finding Sparse Neural Networks
"A randomlyinitialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations"  Jonathan Frankle and Michael Carbin
Basically instead of finetuning weights, just reset the weight to original value and retain.
This lead to the acceptance of the idea that over parameterized dense networks containing several sparse subnetworks with varying performances, and one of these subnetworks is the winning ticket, which perform all others.
Pruning research is evolving
 The new method didn't perform well at large sacle
 The new method failed to identify the randomly initialized winners
 Active area of research
Eliminate connections based on their magnitude
TensorFlow includes a weight pruning API, which can iteratively remove connections based on their magnitude during training.
Sparsity increases with training
tion="Image source: TensorFlow Model Optimization Toolkit — Pruning API (morioh.com) Black cells indicate where the nonzero weights exist as pruning is applied to a tensor." >}}
What's special about pruning?
 Better storage and/or transmission
 Gain speedups in CPU and some ML accelerators
 Can be used in tandem with quantization to get additional benefits
 Unlock performance improvements
Pruning with TF Model Optimization Toolkit iwth Keras
import tensorflow_model_optimization as tfmot
model = build_your_model()
pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.5, final_sparsity=0.8,
begin_Step=2000, end_step=4000)
model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude(
model,
pruning_scehdule=pruning_schedule)
...
model_for_pruning.fit()
Pruning
Pruning
If you wish to dive more deeply into pruning, feel free to check out these optional references. You won’t have to read these to complete this week’s practice quizzes.