[go: up one dir, main page]

0% found this document useful (0 votes)
37 views18 pages

U4 PDF

This document discusses various regularization techniques for deep learning, including parameter norm penalties, L1 and L2 regularization, and strategies like dropout and early stopping to enhance model performance and reduce overfitting. It also covers concepts such as dataset augmentation, noise robustness, semi-supervised learning, and multitask learning, emphasizing their importance in improving generalization and stability in machine learning models. The document highlights the significance of constraints in optimization and the role of parameter sharing in reducing memory usage in neural networks.

Uploaded by

Anitha M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views18 pages

U4 PDF

This document discusses various regularization techniques for deep learning, including parameter norm penalties, L1 and L2 regularization, and strategies like dropout and early stopping to enhance model performance and reduce overfitting. It also covers concepts such as dataset augmentation, noise robustness, semi-supervised learning, and multitask learning, emphasizing their importance in improving generalization and stability in machine learning models. The document highlights the significance of constraints in optimization and the role of parameter sharing in reducing memory usage in neural networks.

Uploaded by

Anitha M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

NN-DL Unit-IV Regularization for Deep Learning

Cse (ai & ml) (Jawaharlal Nehru Technological University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Bharathi P (bharathicse88@gmail.com)
NEURALNETWORKSANDDEEPLEARNING

UNIT–IV(Regularization for Deep Learning)

Regularization for Deep LearningParameternorm


: Penalties, Norm Penalties as Constrained
Optimization, Regularization and Under-Constrained Problems, Dataset Augmentation, Noise
Robustness, Semi-Supervised learning, Multi-task learning, Early Stopping, Parameter Typing
andParameterSharing,SparseRepresentations,BaggingandotherEnsembleMethods,Dropout,
Adversarial Training, Tangent Distance, tangent Prop and Manifold, Tangent Classifier

Introduction:

• A central problem in machine learning is how to make an algorithm will perform well not just on
the training data, but also on new inputs.

• Many strategies used in machine learning are explicitly designed to reduce the test error, at
the expense of increased training error. These strategies are known collectively as
regularization.

• Here, we describe regularization in more detail, focusing on regularization strategies for


deep models or models that may be used as building blocks to form deep models.

PARAMETERNORMPENALTIES

• The idea here is to limit the capacity(the space of all possible model families)of the model by
adding a parameter norm penalty, Ω(θ), to the objective function, J:

• Here, θ represents only the weights and not the biases, there as on being that the biases require
much less data to fit and do not add much variance.
L²ParameterRegularization

• Here, we have the following parameter norm penalty:

• Applyingthe2ndorderTaylor-Seriesapproximation(ignoringalltermsofordergreaterthan2inthe T
Series expansion) at the point w* (where J̃(̃θ; X, y) assumes the minimum value, i.e.,
∇J̃(̃w*)=0),wegetthefollowingexpression(asthefirstordergradienttermis0): Applyingthe2nd orde
Taylor-Series approximation (ignoring all terms of order greater than 2 in the Taylor-Series
• expansion)atthepointw*(whereJ(̃̃θ;X,y)assumestheminimumvalue ∇
J̃(̃w*)=0) wegetthe
,i.e,.,
following expression (as the first order gradient term is 0):


Finally, Ĵ(w)=H(w—w*) since the first term is just a constant and the derivative
X’HXof(’ represents
HX .The over all gradient of the objective function
the transpose)2is gradientofĴ
( + gradient of Ω(θ))
becomes:

• Asαapproaches0,wcomesclosertow*.Finally, since H is real and symmetric, it can be


decomposed into a diagonal matrix ∧ and an ortho normal set of eigenvectors, Q.
• ∧
That is, H=Q’ Q.

• Because of the marked term, the value of each weight is rescaled along the Eigen vectors of H.
• The value of the weights along the eigenvector i, is rescaled by λi/(λi+α),where λi represents the
eigen value corresponding to that eigenvector.

• Thediagrambelowillustratesthiswell:
• TolookatitsapplicationtoMachineLearning,wehavetolookatlinearregression.Theobjective funct
there is exactly quadratic, given by:

L¹parameterregularization
• Here, the parameter norm penalty is given by: Ω(θ)=||w||¹
• This makes the gradient of the overall objective function:

• Now ,the last term, sign(w),createssomedifficultyasthegradientnolongerscaleslinearlywith


w. This
leads to a few complexities in arriving at the optimal solution

• My current interpretation of the max term is that, there shouldn’t be a zero crossing, as the
gradient of the absolute value function is not differentiable at zero.
• ThusL¹
, regularization has the property of sparsity,which is its fundamental distinguishing featur
fromL². Hence, L¹ is used for feature selection as in LASSO.

NORMPENALTIESASCONSTRAINEDOPTIMIZATION

• We know to minimize any function under some constraints; we can construct a generalized
Lagrangian function containing the objective function along with the penalties.

• Suppose we wanted Ω(θ)<k, then we could construct the following Lagrangian:

• We get optimal θ by solving the Lagrangian .If Ω(θ)>k, then the weights need to be compensated
highly and hence, α should be large to reduce its value below k.

• Likewise, if Ω (θ)<k ,then the norm shouldn’t be reduced too much and hence, should be small.
This is now similar to the parameter norm penalty regularized objective functions both of them
encourage lower values of the norm.

• Thus, parameter norm penalties naturally impose a constraint, like the L²-regularization, defini
constrainedL²-ball. Larger α implies a smaller constrain pushes the values really low, hence,
allowing a small radius and vice versa.

• The idea of constraints over penalties is important for several reasons. Large penalties might cau
non-convex optimization algorithms to get stuck in local minimal due to small values of θ ,leadin
to the formation of so-called dead cells, as the weights entering and leaving them are too small to
have an impact. Constraints don’t enforce the weights to be near zero, rather being confined to a
constrained region.
Another reason is that constraints induce higher stability. With higher learning rates, there migh

be a large weight, leading to a large gradient, which could go on iteratively leading to numerical
overflow in the value of θ. Constrains, along with re projection (to the corresponding ball), preve
the weights from becoming too large, thus, maintaining stability.
A final suggestion made by Hinton store strict the individual column norms of the weight matrix

rather than the Frobenius norm of the entire weight matrix, so as to prevent any hidden unit from
having a large weight.

Downloaded by Bharathi P (bharathicse88@gmail.com)


• The idea here is that if we restrict the Frobenius norm, it doesn’t guarantee that the individual weights
would be small, just their norm. So, we might have large weights being compensated by extremely
small weights to make the overall norm small .
REGULARIZED&UNDER-CONSTRAINEDPROBLEMS

• Under determined problems are those problems that have infinitely many solutions. A logistic
regression problem having linearly separable classes with was a solution, willalwayshave2wasa
solution and so on.

• In some machine learning problems, regularization is necessary .For many algorithms(e.g.


PCA) require the inversion of X’ X, which might be singular.

• In such a case, we can use a regularized form instead .(X’X+αI )is guaranteed to be invertible.
• Regularization can solve under determined problems. Fore .g .the Moore Pento sepseudo in
verse defined earlier as:

• This can be seen as performing a linear regression -regularization.


L²with

DATASETAUGMENTATION

• Having more data is the most desirable thing to improving a machine learning model’s
performance. In many cases, it is relatively easy to artificially generate data.

• For a classification task, we desire for themo del to be in variant to certain types of
transformations, and we can generate the corresponding (x, y) pairs by translating the input x. B
for certain problems, like density estimation, we can’t apply this directly unless we have already
solved the density estimation problem.

• However, caution needs to be maintained while augmenting data to make sure that the class doe
change. For eg ,if the labels contain both “b” and“d”, then horizontal flipping would be able for
data augmentation.

• Adding random noise to the inputs is another form of data augmentation, while adding noise
to hidden units can be seen as doing data augmentation at multiple levels of abstraction.
• Finally, when comparing machine learning models, we need to evaluate them using the same ha
designed data augmentation scheme so release it might happen that algorithm A out performs
algorithm B, just because it was trained on a dataset which had more / better data augmentation

NOISEROBUSTNESS

• Noise with infinite similar variance imposes a penalty on the norm of the weights. Noise added
to hidden units is very important and is discussed later in Dropout.

• Noise can even be added to the weights. This has several interpretations. One of the m is that
adding noise to weights is a stochastic implementation of Bayesian inference over the weights,
where the weights are considered to be uncertain, with the uncertainty being modeled by a
probability distribution.

• It is also interpreted as a more traditional for m of regularization by ensuring stability in learning

• Fore. g. in the linear regression case, we want to learn the mapping y(x) for each feature vector x
reducing the mean square error.

• Now, suppose a zero mean unit variance Gaussian random noise, ϵ ,is added to the weights. We
still want to learn the appropriate mapping through reducing the mean square.

• Minimizing the loss after adding noise to the weigh equivalent to adding another regularization
term which makes sure that small perturbations in the weight values don’t affect the predictions
much, thus stabilizing training.
Sometimes we may have the wrong output labels, in which case maximizing p(y | x) may not be a

good idea. In such a case, we can add noise to the labels by assigning a probability of (1-ϵ )that th
label is correct and a probability of ϵ that it is not.
In the latter case, all the other labels are equally likely. Label Smoothing regularize same mode l

with k sof t max outputs by assigning the classification targets with probability (1-ϵ) orchoo
singany of the remaining (k-1) classes with probability ϵ / (k-1).

SEMI-SUPERVISEDLEARNING

• P(x,y)denotes the joint distributionx and


of y, i.e., corresponding to a training sample
, I have
x a
label y.
• P(x) denotes the marginal distribution of x, i.e., just the training examples without any labels. In
Semi-supervised Learning, we use both P(x,y )(some labeled samples)and P(x)(unlabelled sampl
to estimate P(y|x)(since we want to predict the class, given the training sample).

• We want to learn some representation h=f(x)such that samples which are closer in the input space
have similar representations and a linear classifier in the new space achieves better generalizatio
error.

• Instead of separating the supervised and unsupervised criteria, we can instead have a generativ
model of P(x) (or P(x, y)) which shares parameters with the discriminative model. The idea is to
share the unsupervised/generative criterion with the supervised criterionto express a prior belie
that the structure of P(x) (or P(x, y)) is connected to the structure of P(y|x), which is expressed by
the shared parameters.

MULTITASKLEARNING

• The idea is to improve the generalization error by pooling together examples from multiple tasks
Similar to how more data leads to more generalization, using a part of the model for different
tasks constrains that part to learn good values. There are two types of model parameters:
o Task-specific: These parameters benefit only from that particular task.
o Generic, shared a cross all tasks: These are the ones which benefit from learning
through various tasks.

• Multitask learning leads to better generalization when there is actually some relationship between
the tasks, which actually happens in the context of Deep Learning where some of the factors,
which explain the variation observed in the data, are shared across different tasks.
EARLYSTOPPING

• As mentioned at the start of the post, after a certain point of time during training, for a model wit
extremely high representational capacity, the training error continues to decrease but the
validation error begins to increase (which we referred to as over fitting).
• In such a scenario, a better idea would be to return back to the point where the validation error w
the least. Thus, we need to keep calculating the validation metric after each epoch and if there is
any improvement, west or that parameter setting. Upon termination of training, were turn the l
saved parameters.

• The idea of Early Stopping is that if the validation error doesn’t improve over a certain fixed
number of iterations, we terminate the algorithm.

• This effectively reduces the capacity of the model by reducing the number of steps required to fit
the model. The evaluation on the validation set can be done both on another GPU in parallel or do
after the epoch.

• A drawback of weight decay was that we had to manually tweak the weight decay coefficient,
which, if chosen wrongly, can lead them delta local minima by squab shin the weight values too
much. In Early Stopping, no such parameter needs to be tweaked which reduces the number of
hyper parameters that we need to tune.

• However, since we are setting aside some part of the training data for validation, we are not usin
the complete training set. So, once Early Stopping is done, a second phase of training can be
done where the complete training set is used. There are two choices here:
o Train from scratch for the same number of steps absinthe Early Stopping case.
o Usetheweightslearnedfromthefirstphaseoftrainingandretrainusingthecompletedata.
• Other than lowering the number of training steps, it reduces the computational cost also by
regularizing the model without having to add additional penalty terms. It affects theoptimiz
Procedure by restricting it to a small volume of the parameter space, in the neighbourhood of the
initial parameters.

• Suppose 𝛕 and ϵ represent the number of iterations and the learning rate respectively. Then,
effectively represents the capacity of the model. Intuitively, this can be seen as the inverse of
the weight decay co-efficient λ.

• Whenϵ𝛕 is small(or λislarge),theparameterspaceissmallandviceversa.Thisequivalenceholds


a linear model with quadratic cost function (initial parameters w⁰ = 0). Taking the Taylor Series
Approximation of J(w) around the empirically optimal weights w*:

multiplyingwith Q’Q=I(Q isorthonormal):


Q’onbothsidesandusingthefactthat

• Assumingϵtobesmallenough:

• L²regularizationisgivenby:
Theequationfor

• Thus, if the hyper parameters are such that:

• L²-regularizationcanbeseenasequivalenttoEarlyStopping.
PARAMETERTYINGANDPARAMETERSHARING

• Till now, most of the methods focuse don’t bringing the weights to a fixed point,e.g. 0 in the case
of norm penalty.

• However, there might be situations where we might avesome prior knowledge on the kind of
dependencies that the model should encode.

• Suppose, two models A and B ,perform a classification task on similar input and output
distributions. In such a case, we’d expect the parameters for both the models to be similar to eac
other as well.

• We could impose a norm penalty on the distance between the weights, but a more popular meth
dis to force the set of parameters to be equal.
This is the essence behind Parameter Sharing. A major benefit here is that we need to store only a

subset of the parameters (e.g., storing only the parameters for model A instead of storing for bo
and B) which leads to large memory savings.
In the example of Convolution Neural Networks or CNNs , the same feature is computed across

different regions of the image and hence, detected irrespective of whether it is at position i or i+1

SPARSEREPRESENTATIONS

• We can place penalties on even the activation values of the units which indirectly imposes a penalty
on the parameters. This lead store presentational sparsely, where many of the activation values of
the units are zero.

• In the figure below, h are presentation of x ,which is sparse. Representational sparsity is obtained
similarly to the way parameter sparsity is obtained, by placing a penalty on the representation h
instead of the weights.
• Another idea could be to average the activation values across various examples and push it
towards some value.

• An example of getting representational parity by imposing hard constraint on the activation valu
is the Orthogonal Matching Pursuit (OMP) algorithm, where a representation h is learned for the
input x by solving the constrained optimization problem:

. Theby
Where the constraint is on the number of non-zero entries indicated problem
b can be solved
efficiently when W is restricted to be orthogonal

BAGGINGANDOTHERENSEMBLEMETHODS

• The techniques which train multiple models and take the maximum vote across those models for
the final prediction are called ensemble methods. The idea is that it’s highly unlikely that multiple
models would make the same error in the test set.

• Suppose that we have K regression models, with the model # I making an error ϵ I one a
example, where ϵi is drawn from a zero mean, multivariate normal distribution such that: D(
and D(ϵiϵj)=c. The error on each example is then the average across all the models: ( ∑ ϵi)/K.
• The mean of this average errors 0(as the mean of each of the individual ϵiϵiis0).The variance of the
average error is given by:

• Thus, if c=v, then there is no change. If c=0, then the variance of the average error decreases with
K. There are various resembling techniques. In the case of Bagging (Bootstrap Aggregating), the
same training algorithm is used multiple times.

• The data set is broken into K parts by sampling with replacement(see figure below for clarity)and
model is trained one those K parts .Because of sampling with replacement, the K parts have a
Few similarities as well as a few differences. These differences cause the difference in the predict
of the K models. Model averaging is a very strong technique.

DROPOUT

• Dropout is a computationally inexpensive, yet powerful regularization technique. The problem


with bagging is that we can’t train an exponentially large number of models and store them for
prediction later.

• Drop out makes bagging practical by making an inexpensive approximation. In a simplistic


view, drop out trains the ensemble of all sub-networks formed by randomly removing afew non-
output units by multiplying their outputs by 0.

• For every training sample, a mask is computed for all the input and hidden units independently.
clarification, suppose we have h hidden units in some layer. Then ,a mask for that layer h
dimensional vector with values either 0(remove the unit) or 1(keep the unit).

• There area few differences from bagging though:


o In bagging, the models are independent of each other, whereas in dropout, the different
models share parameters, with a model input, a sample of the total parameters.
o In bagging, each model is trained till convergence, but in drop out, each model is trained
for just one step and the parameter sharing makes sure that subsequent updates ensure
better predictions in the future.
• Attest time, we combine the prediction so fall the models. In the case of bagging with K models, th
was given by the arithmetic mean. In case of dropout, the probability that a model is chosen is giv
by p(μ), with μ denoting the mask vector.

• he prediction then becomes ∑ p(μ)p(y|x, μ). This is not computationally feasible, and there’s a
better method to compute this in one go, using the geometric mean instead of the arithmetic
mean.

• We need to take care of two main things when working with geometric mean:
o None of the probabilities should be zero.
o Re-normalizationtomakesurealltheprobabilitiessumto1.

• The advantage for drop out is that first term can be approximated in one pass of the complete
model by dividing the weight values by the keep probability (weight scaling inference rule).

• The motivation behind this is to capture the right expected values from the output of each unit
,i.e., the total expected input to a unit at train time is equal to the total expected input at test tim
Points to note
:

• Reduces their presentational capacity of the model and hence, the model should be large enoug
to begin with.

• Works better with more data.

• EquivalenttoL²for linear regression, with different weight decay coefficient for input feature.

Biological Interpretation
:

• During sexual reproduction, genes could be swapped between organisms if they are unable to
correctly adapt to the unusual feature organism. Thus, the units in dropout learn to perform wel
regardless of the presence of other hidden units, and also in many different contexts.

• Adding noise in the hidden layer is more effective than adding noise in the input layer. Fore. g.,
let’s assume that some unit to detect task .Now, if this unit is removed, then some other unit eith
learns to redundantly detect a nose or associates some other feature(like mouth) for recognizing
face.
• In either way, the model make more use of the information in the input. On the other hand, addi
noise to the input won’t completely remove the noise information, unless the noise is so large as
remove most of the information from the input.

ADVERSARIALTRAINING

• DeepLearning has outperformed humans in theta sk of Image Recognition, which might lea dust
believe that these models have acquired a human-level understanding of an image. However,
experimentally searching for an x′ (given an x), such that prediction made by the model changes
shows otherwise.

• As shown in the image below, although the newly formed image (adversarial image) looks almos
exactlythesametoahuman,themodelclassifiesitwronglyandthattoowithveryhighconfidence:

• .The main factor attributed to the above-mentioned behavior is the linearity of theymodel
= (say
Wx), caused by the main building blocks being primarily linear.

• Thus, a small change of ϵ in the input causes a drastic change of W ϵ in the output. The idea
of adversarial training is to avoid this jumping an constant in the neighbourhood of the
training data.

• This scan also be use semi- supervised learning. For an unlabelled sample x ,we can assign the

label ŷ(x) using our model. Then, we find an adversarial example, x′, such that y(x′) ŷ(x) (an
adversary found this way is called virtual adversarial example).
• The objective is to assign the same class to both x and x′. The idea behind this is that different
class’s areas summed to lie on disconnected manifolds and a little push from one main fold
shouldn’t land in any other manifold.

TANGENTDISTANCE,TANGENTPROPANDMANIFOLDTANGENTCLASSIFIER

• Many ML models assume the data to lie on a low dimensional manifold to overcome the curse of
dimensionality. The inherent assumption which follows is that small perturbations that cause the
data to move along the manifold (it originally belonged to), shouldn’t lead to different class
predictions.

• The idea of the tangent distance algorithm to find the K-nearest neighbors using the distance
metric as the distance between manifolds.

• A main fold M is approximated by the tangent plane at Xi ,hence, this technique needs
tangent vectors to be specified.
• The tangent prop algorithm proposed to learn a neural network-based classifier, f(x), which is
invariant to know n transformations causing the input to move along its manifold. Local
invariance would require that ▽f(x) is perpendicular to the tangent vectors V(i). This can also
achieved by adding a penalty term that minimizes the directional directive of f(x) along each of
the V(i).

• It is similar to data augmentation in that both of the muse prior knowledge of the domain to spec
various transformations that the model should be invariant to. However, tangent prop only resis
infinite simal perturbations while data augmentation causes invariance to much larger
perturbations.

• ManifoldTangentClassifier works in two parts:


o Use Auto encoder tolerant the main fold structures using Unsupervised Learning.
o Use the mani folds with tangent prop.

You might also like