[go: up one dir, main page]

0% found this document useful (0 votes)
20 views37 pages

FDL Module2

Uploaded by

ghostwolfvn6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views37 pages

FDL Module2

Uploaded by

ghostwolfvn6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

FOUNDATIONS OF DEEP LEARNING

SURYA C
DEPT. OF ADS

1
Module 2
Training Deep Learning Models

2
Set up and Initialization Issues

3
Set up and Initialization Issues
• Complex and multi dimensional landscapes have few local minima

• Random wight initialization gives you enough variability

• With random weights some will be strengthened and some weakened

• Small numbers lead to vanishing gradient and large numbers lead to exploding
gradients

• It is important to set the variance of the weights proportional to the size of each
layer with in the network

• Initializing weights is more important than initializing biases

4
Common weight Initialization Techniques
• Kaiming:
• Uniform distribution
• Standard deviation is proportional to the input units in the layer
1
• 𝑤 ∈ −𝜎, 𝜎 𝜎=
𝑁𝑖𝑛
1+ 𝑎2
• 𝜎=
𝑁𝑖𝑛
• a = slope of negative activation(1 for linear activation, 0 for ReLU)
• Weights of different layers will be different according to the size of each layer
based on activation slope
• Specially designed for ReLU activation function
• Address the problem of vanishing gradient

5
Common weight Initialization Techniques
• Xavier:
• Weights are random numbers drawn from a normal distribution with a mean of 0
with a symmetric distribution
• The variance is set to by the total size of the weights matrix(total number of inputs
and outputs)
2
• 𝑤 ∈ 𝑁 0, 𝜎 2 𝜎2 =
𝑁 𝑖 + 𝑁𝑜
𝑝 𝑝

• Designed to address the exploding gradient problem


• Suitable for feedforward networks with tanh and sigmoidal activation function
• Assume a linear relationship between the input and output

6
Common weight Initialization Techniques
• Freezing Weights
• “Freezing” a layer means to switch off gradient descent in that layer

• The weights will not change and thus the layer will no longer learn

• The main application of freezing is when we are using a pre-trained model

7
Vanishing and Exploding Gradient Problems
• Vanishing gradient • Exploding gradient

• Weights don’t change – no


learning
• Weights change wildly – bad
• Problematic for deep networks solution
• The network never stops learning
8
Gradient Descent Algorithm

9
Stochastic Gradient Descent

10
Stochastic Gradient Descent

11
Mini Batches Gradient Descent

12
Mini Batches Gradient Descent

13
Courtesy: Deep learning in Malayalam 14
Courtesy: Deep learning in Malayalam 15
Courtesy: Deep learning in Malayalam 16
Courtesy: Deep learning in Malayalam 17
Courtesy: Deep learning in Malayalam 18
Courtesy: Deep learning in Malayalam 19
Courtesy: Deep learning in Malayalam 20
Courtesy: Deep learning in Malayalam 21
Courtesy: Deep learning in Malayalam 22
Courtesy: Deep learning in Malayalam 23
Courtesy: Deep learning in Malayalam 24
25
Courtesy: Deep learning in Malayalam 26
Concept of Regularization
• Overfitting is an important problem in neural network training
• We use Regularization to avoid overfitting
• Redefining the Loss function by adding a penalty
• 𝐽 𝜃: 𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω θ
• Two types
• L1 (Lasso) Regularization
• L2 (Ridge) Regularization

27
L2 Regularization
• Simplest and most common form of regularization
• Commonly known as ‘weight decay’
• Also known as Ridge regression or Tikhonov regularization
1 2
• This regularization strategy drives the weights closer to the origin by adding a regularization term Ω(θ)= 𝑤
2
• To simplify, we assume no bias parameters. So θ is just w

• Or
• 𝐽(𝜃)𝐿2 = 𝐽 θ + λ σ𝑛𝑖=1 𝑤𝑖2

• J(θ) – Original cost function without regularization


• 𝑤𝑖 − Weights of the model
• n – total number of weights
• λ- regularization hyperparameter

• L2 regularization encourages all parameters to be small, but not exactly zero


• It preaches the model from becoming too sensitive to small variations in the training data

28
L1 Regularization
• Also known as Lasso regularization
• L1 regularization adds a penalty term to loss function that is proportional to the
absolute value of the model’s parameter
• Ω θ = 𝑤 𝑖 = σ𝑖 𝑤𝑖
• The equation is
• Or
• 𝐽(𝜃)𝐿2 = 𝐽 θ + λ σ𝑛𝑖=1 𝑤𝑖
• J(θ) – Original cost function without regularization
• 𝑤𝑖 − Weights of the model
• n – total number of weights
• α/λ- regularization hyperparameter that contains the strength of the penalty
• L1 regularization encourages the model to have sparse weights, i.e; many parameters
become originally zero
• It selects a subset of important features and can be used for feature selection
• It helps prevent overfitting by making the model simple 29
Methods of Regularization
• Early stopping
• Dataset Augmentation
• Parameter tying and sharing
• Ensemble methods
• Dropout
• Batch Normalization

30
Early Stopping

31
Dataset Augmentation
• The best way to make an ML model
generalize better is to give more training
data
• However, the amount of data we have is
limited
• Create fake data and add it into the
training set
• This approach is easiest for classification
• We can generate new data by
transferring the input data in the
training set
• Image dataset can be easily created by
transferring pixels in each direction
• Creating fake data and injecting noise to
the data are also considered as dataset
augmentation
32
Parameter Tying and Sharing
• Techniques used in deep learning to reduce the number of parameters in a
neural network model while maintain its capacity to capture complex patterns
in data
• It helps mitigate the risk of overfitting and can make models more efficient
• Parameter tying involves using the same set of parameters for multiple layers
of a neural network
• It essentially enforces that certain weights are identical or constrained in a
particular way
• Embedding
• Weight sharing in CNN
• Parameter sharing is a specific form of parameter tying where the same set of
parameters are used across different layers or units in a neural network

33
Ensemble Methods
• Bagging
• Boosting
• Stacking
• Random Forests
• Neural Network ensembles

34
Dropout

35
Batch Normalization

36
Thank You………………….

37

You might also like