[go: up one dir, main page]

0% found this document useful (0 votes)
8 views6 pages

Module 2 Initialization and Optimization Technique

Module 2 covers training deep models, focusing on weight initialization methods (Kaiming and Xavier), optimization techniques (various forms of Gradient Descent), and regularization strategies to combat issues like vanishing and exploding gradients. It explains the concepts of bias and variance in model training, highlighting their impact on underfitting and overfitting. Key techniques for effective training are discussed, including activation functions, batch normalization, and gradient clipping to ensure stable learning.

Uploaded by

reshmi5356
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

Module 2 Initialization and Optimization Technique

Module 2 covers training deep models, focusing on weight initialization methods (Kaiming and Xavier), optimization techniques (various forms of Gradient Descent), and regularization strategies to combat issues like vanishing and exploding gradients. It explains the concepts of bias and variance in model training, highlighting their impact on underfitting and overfitting. Key techniques for effective training are discussed, including activation functions, batch normalization, and gradient clipping to ensure stable learning.

Uploaded by

reshmi5356
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Module 2: Training Deep Models

Introduction, setup and initialization- Kaiming, Xavier weight


intializations, Vanishing and exploding gradient problems,
Optimization techniques - Gradient Descent (GD), Stochastic
GD, GD with momentum, GD with Nesterov momentum,
AdaGrad, RMSProp, Adam., Regularization Techniques - L1
and L2 regularization, Early stopping, Dataset augmentation,
Parameter tying and sharing, Ensemble methods, Dropout,
Batch normalization.

Vanishing and Exploding Gradient Problems in deep


learning:
These problems occur during backpropagation, the process
used to train deep neural networks.
In backpropagation, we compute gradients (partial
derivatives) layer by layer, moving from the output layer back
toward the input layer. These gradients are used to update the
weights in the network using optimization algorithms like
gradient descent.
When you multiply many gradients together (one per layer),
two things can happen:
1. Vanishing Gradients
 If the derivatives (gradients) at each layer are small
numbers (e.g., 0.1), then when multiplied across many
layers, the final gradient becomes extremely small—
almost zero.
 As a result, the earlier layers (closer to the input) get
almost no signal to update their weights.
 This makes the network unable to learn from the data.
 In extreme cases, gradients can become exactly zero,
stopping learning entirely.

2. Exploding Gradients
 If the derivatives at each layer are large numbers (e.g.,
>1), the product of many such derivatives becomes
extremely large.
 This causes huge weight updates and makes the model
unstable—loss becomes NaN, or weights blow up.
 The model cannot converge or learn effectively.

To avoid these problems, deep learning practitioners use:


 Activation functions like ReLU (instead of sigmoid or
tanh)
 Batch normalization to stabilize layer inputs
 Gradient clipping to limit very large gradients
 Proper weight initialization (e.g., Xavier, He
initialization)
 Skip connections (e.g., in ResNet) to ease gradient flow

What is Bias and Variance?


These are two types of errors that affect how well your
model learns and generalizes.

1. Bias
 Bias is error due to incorrect assumptions in the learning
algorithm.
 A high bias model is too simple to capture the true
pattern in the data.
Example:
 Trying to fit a straight line (linear model) to a complex
curve → the model underfits.
 High bias → Underfitting
 Low accuracy on both training and test data.
2. Variance
 Variance is error due to sensitivity to small changes in
training data.
 A high variance model is too complex and learns noise
as if it were signal.

 A model fits training data perfectly, but performs poorly


on test data.
High variance -overfitting
High training accuracy but low test accuracy

Weight initialization
Key points
1. Weight should be small
2. Weight should not be the same
3. Weight should have great variance
1. Xavier Initialization (Also called Glorot Initialization)
Used for Activation functions like Sigmoid and Tanh
Goal: Keep the variance of outputs and gradients the
same across layers, so the model trains smoothly.
Formula:
If a layer has:
 n_in inputs
 n_out output
 Sigmoid and Tanh can saturate if values are too
high/low.
 Xavier keeps activations within a useful range to
avoid this.
2. He Initialization (Also called Kaiming Initialization)
Used for: Activation functions like ReLU or Leaky
ReLU
Formula:
If a layer has n_in input neurons:

 Ensures that enough signal passes through ReLU


without vanishing.
OPTIMIZATION TECHNIQUES
Optimizers are algorithms or methods to change the attributes
of the neural networks, such as weights and learning rate to
reduce the losses.
• There are different types of optimization techniques.
Some of the techniques are
1· Gradient Descent
2 · Stochastic Gradient Descent (SGD)
3. Mini-Batch Stochastic Gradient Descent (MB — SGD)
4. SGD with Momentum
5. Nesterov Accelerated Gradient (NAG)
6. Adaptive Gradient (AdaGrad)
7. AdaDelta
8. RMSProp
9. Adam
10. Nadam

You might also like