FOUNDATIONS OF DEEP LEARNING
SURYA C
DEPT. OF ADS
1
Module 2
Training Deep Learning Models
2
Set up and Initialization Issues
3
Set up and Initialization Issues
• Complex and multi dimensional landscapes have few local minima
• Random wight initialization gives you enough variability
• With random weights some will be strengthened and some weakened
• Small numbers lead to vanishing gradient and large numbers lead to exploding
gradients
• It is important to set the variance of the weights proportional to the size of each
layer with in the network
• Initializing weights is more important than initializing biases
4
Common weight Initialization Techniques
• Kaiming:
• Uniform distribution
• Standard deviation is proportional to the input units in the layer
1
• 𝑤 ∈ −𝜎, 𝜎 𝜎=
𝑁𝑖𝑛
1+ 𝑎2
• 𝜎=
𝑁𝑖𝑛
• a = slope of negative activation(1 for linear activation, 0 for ReLU)
• Weights of different layers will be different according to the size of each layer
based on activation slope
• Specially designed for ReLU activation function
• Address the problem of vanishing gradient
5
Common weight Initialization Techniques
• Xavier:
• Weights are random numbers drawn from a normal distribution with a mean of 0
with a symmetric distribution
• The variance is set to by the total size of the weights matrix(total number of inputs
and outputs)
2
• 𝑤 ∈ 𝑁 0, 𝜎 2 𝜎2 =
𝑁 𝑖 + 𝑁𝑜
𝑝 𝑝
• Designed to address the exploding gradient problem
• Suitable for feedforward networks with tanh and sigmoidal activation function
• Assume a linear relationship between the input and output
6
Common weight Initialization Techniques
• Freezing Weights
• “Freezing” a layer means to switch off gradient descent in that layer
• The weights will not change and thus the layer will no longer learn
• The main application of freezing is when we are using a pre-trained model
7
Vanishing and Exploding Gradient Problems
• Vanishing gradient • Exploding gradient
• Weights don’t change – no
learning
• Weights change wildly – bad
• Problematic for deep networks solution
• The network never stops learning
8
Gradient Descent Algorithm
9
Stochastic Gradient Descent
10
Stochastic Gradient Descent
11
Mini Batches Gradient Descent
12
Mini Batches Gradient Descent
13
Courtesy: Deep learning in Malayalam 14
Courtesy: Deep learning in Malayalam 15
Courtesy: Deep learning in Malayalam 16
Courtesy: Deep learning in Malayalam 17
Courtesy: Deep learning in Malayalam 18
Courtesy: Deep learning in Malayalam 19
Courtesy: Deep learning in Malayalam 20
Courtesy: Deep learning in Malayalam 21
Courtesy: Deep learning in Malayalam 22
Courtesy: Deep learning in Malayalam 23
Courtesy: Deep learning in Malayalam 24
25
Courtesy: Deep learning in Malayalam 26
Concept of Regularization
• Overfitting is an important problem in neural network training
• We use Regularization to avoid overfitting
• Redefining the Loss function by adding a penalty
• 𝐽 𝜃: 𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω θ
• Two types
• L1 (Lasso) Regularization
• L2 (Ridge) Regularization
27
L2 Regularization
• Simplest and most common form of regularization
• Commonly known as ‘weight decay’
• Also known as Ridge regression or Tikhonov regularization
1 2
• This regularization strategy drives the weights closer to the origin by adding a regularization term Ω(θ)= 𝑤
2
• To simplify, we assume no bias parameters. So θ is just w
• Or
• 𝐽(𝜃)𝐿2 = 𝐽 θ + λ σ𝑛𝑖=1 𝑤𝑖2
• J(θ) – Original cost function without regularization
• 𝑤𝑖 − Weights of the model
• n – total number of weights
• λ- regularization hyperparameter
• L2 regularization encourages all parameters to be small, but not exactly zero
• It preaches the model from becoming too sensitive to small variations in the training data
28
L1 Regularization
• Also known as Lasso regularization
• L1 regularization adds a penalty term to loss function that is proportional to the
absolute value of the model’s parameter
• Ω θ = 𝑤 𝑖 = σ𝑖 𝑤𝑖
• The equation is
• Or
• 𝐽(𝜃)𝐿2 = 𝐽 θ + λ σ𝑛𝑖=1 𝑤𝑖
• J(θ) – Original cost function without regularization
• 𝑤𝑖 − Weights of the model
• n – total number of weights
• α/λ- regularization hyperparameter that contains the strength of the penalty
• L1 regularization encourages the model to have sparse weights, i.e; many parameters
become originally zero
• It selects a subset of important features and can be used for feature selection
• It helps prevent overfitting by making the model simple 29
Methods of Regularization
• Early stopping
• Dataset Augmentation
• Parameter tying and sharing
• Ensemble methods
• Dropout
• Batch Normalization
30
Early Stopping
31
Dataset Augmentation
• The best way to make an ML model
generalize better is to give more training
data
• However, the amount of data we have is
limited
• Create fake data and add it into the
training set
• This approach is easiest for classification
• We can generate new data by
transferring the input data in the
training set
• Image dataset can be easily created by
transferring pixels in each direction
• Creating fake data and injecting noise to
the data are also considered as dataset
augmentation
32
Parameter Tying and Sharing
• Techniques used in deep learning to reduce the number of parameters in a
neural network model while maintain its capacity to capture complex patterns
in data
• It helps mitigate the risk of overfitting and can make models more efficient
• Parameter tying involves using the same set of parameters for multiple layers
of a neural network
• It essentially enforces that certain weights are identical or constrained in a
particular way
• Embedding
• Weight sharing in CNN
• Parameter sharing is a specific form of parameter tying where the same set of
parameters are used across different layers or units in a neural network
33
Ensemble Methods
• Bagging
• Boosting
• Stacking
• Random Forests
• Neural Network ensembles
34
Dropout
35
Batch Normalization
36
Thank You………………….
37