0% found this document useful (0 votes)

39 views6 pages

Week 4

The document covers various topics related to deep learning, including the Adam optimizer, mini-batch gradient descent, stochastic gradient descent, and Nesterov Accelerated Gradient. It provides solutions to multiple-choice questions, explaining concepts such as bias correction, learning rate decay, and the advantages of different optimization algorithms. Each question includes a correct answer and a detailed solution to enhance understanding of the material.

Uploaded by

durgaraoscet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views6 pages

Week 4

Uploaded by

durgaraoscet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Deep Learning - Week 4

1. Using the Adam optimizer with β1 = 0.9, β2 = 0.999, and ϵ = 10−8 , what would be
the bias-corrected first moment estimate after the first update if the initial gradient
is 4?

(a) 0.4
(b) 4.0
(c) 3.6
(d) 0.44

Correct Answer: (a)

Solution: In Adam, the first moment estimate is calculated as:

mt = β1 ∗ mt−1 + (1 − β1 ) ∗ gt

For the first update, m0 = 0, so:

m1 = 0.9 ∗ 0 + 0.1 ∗ 4 = 0.4

The bias-corrected first moment is:

mcorrected
t = mt /(1 − β1t )

mcorrected
1 = 0.4/(1 − 0.91 ) = 0.4/0.1 = 4

Therefore, the bias-corrected first moment estimate after the first update is 4.

2. In a mini-batch gradient descent algorithm, if the total number of training samples

is 50,000 and the batch size is 100, how many iterations are required to complete 10
epochs?

(a) 5,000
(b) 50,000
(c) 500
(d) 5

Correct Answer: (a)

Solution: Let’s break this down step by step: 1) Number of batches per epoch =
Total samples / Batch size = 50,000 / 100 = 500 batches 2) Number of iterations for
10 epochs = Number of batches per epoch * Number of epochs = 500 * 10 = 5,000
iterations
Therefore, 5,000 iterations are required to complete 10 epochs.

3. In a stochastic gradient descent algorithm, the learning rate starts at 0.1 and decays
exponentially with a decay rate of 0.1 per epoch. What will be the learning rate after
5 epochs?
(a) 0.09
(b) 0.059
(c) 0.05
(d) 0.061

Correct Answer: (b)

Solution: The formula for exponential decay is:

ηt = η0 ∗ e−kt

where η0 is the initial learning rate, k is the decay rate, and t is the number of epochs.
Plugging in the values:

η5 = 0.1 ∗ e−0.1∗5 ≈ 0.1 ∗ 0.60653 ≈ 0.059

4. In the context of Adam optimizer, what is the purpose of bias correction?

(a) To prevent overfitting

(b) To speed up convergence
(c) To correct for the bias in the estimates of first and second moments
(d) To adjust the learning rate

Correct Answer: (c)

Solution: In Adam optimizer, bias correction is used to correct for the bias in the
estimates of first and second moments. This is particularly important in the early
stages of training when the moving averages are biased towards zero due to their
initialization.

5. The figure below shows the contours of a surface.

Suppose that a man walks, from -1 to +1, on both the horizontal (x) axis and the
vertical (y) axis. The statement that the man would have seen the slope change
rapidly along the x-axis than the y-axis is,

(a) True
(b) False
(c) Cannot say

Correct Answer: (a)

Solution: The given contour plot represents the function f (x, y) = x2 + 2y 2 . In a
contour plot, the closeness of contour lines indicates the rate of change of the function.
Since the contours are more closely spaced along the x-axis than along the y-axis, the
function changes more rapidly in the x-direction. This means that a person walking
from x = −1 to x = 1 would experience steeper slope changes compared to walking
along the y-axis. Therefore, the statement that the slope changes more rapidly along
the x-axis than the y-axis is True.

6. What is the primary benefit of using Adagrad compared to other optimization algo-
rithms?

(a) It converges faster than other optimization algorithms.

(b) It is more memory-efficient than other optimization algorithms.
(c) It is less sensitive to the choice of hyperparameters(learning rate).
(d) It is less likely to get stuck in local optima than other optimization algorithms.

Correct Answer: (c)

Solution: The main advantage of using Adagrad over other optimization algorithms
is that it is less sensitive to the choice of hyperparameters.

7. What are the benefits of using stochastic gradient descent compared to vanilla gra-
dient descent?

(a) SGD converges more quickly than vanilla gradient descent.

(b) SGD is computationally efficient for large datasets.
(c) SGD theoretically guarantees that the descent direction is optimal.
(d) SGD experiences less oscillation compared to vanilla gradient descent.

Correct Answer: (a),(b)

Solution: SGD updates weight more frequently hence it converges fast. Since it is
computationally faster than vanilla gradient descent, it works well for large datasets.

8. Select the true statements about the factor β used in the momentum based gradient
descent algorithm.

(a) Setting β = 0.1 allows the algorithm to move faster than the vanilla gradient
descent algorithm
(b) Setting β = 0 makes it equivalent to the vanilla gradient descent algorithm
(c) Setting β = 1 makes it equivalent to the vanilla gradient descent algorithm
(d) Oscillation around the minimum will be less if we set β = 0.1 than setting
β = 0.99

Correct Answer: (a),(b)(d)

Solution: Let’s analyze the statements about the factor β used in the momentum-
based gradient descent algorithm:
Momentum-based Gradient Descent: The momentum-based gradient descent algo-
rithm updates the weights using the following rule:

vt+1 = βvt + (1 − β)∇wt

wt+1 = wt − ηvt+1

where:
−vt is the velocity (momentum term).
−β is the momentum factor.
−∇wt is the gradient of the loss with respect to the weight at time t.
−η is the learning rate.
Setting β = 0.1 allows the algorithm to move faster than the vanilla (plain)
gradient descent algorithm: - When β is set to a small positive value like 0.1, the
algorithm incorporates some momentum, which can help accelerate convergence by
navigating more effectively through shallow regions of the loss surface. This statement
is generally true.
Setting β = 1 makes it equivalent to vanilla gradient descent algorithm: -
If β = 1, the velocity term vt+1 would solely depend on the previous velocity vt and
would not incorporate the current gradient ∇wt . This effectively stalls the learning
process. However, the claim that it makes it equivalent to vanilla gradient descent
(which does not use momentum) is incorrect. Vanilla gradient descent updates weights
purely based on the gradient without momentum.
Setting β = 0 makes it equivalent to vanilla gradient descent algorithm:
- When β = 0, the velocity term vt+1 is directly proportional to the current gradi-
ent ∇wt . This reduces the momentum-based gradient descent to the plain gradient
descent update rule. Thus, this statement is true.
Oscillation around the minimum will be less if we set β = 0.1 than setting
β = 0.99: - Higher values of β (close to 1) result in more momentum, which can cause
larger oscillations around the minimum due to the higher inertia. A lower value of
β like 0.1 results in less momentum, leading to reduced oscillations. Therefore, this
statement is true.

9. What is the advantage of using mini-batch gradient descent over batch gradient de-
scent?

(a) Mini-batch gradient descent is more computationally efficient than batch gradi-
ent descent.
(b) Mini-batch gradient descent leads to a more accurate estimate of the gradient
than batch gradient descent.
(c) Mini batch gradient descent gives us a better solution.
(d) Mini-batch gradient descent can converge faster than batch gradient descent.

Correct Answer: (a),(d)

Solution: The advantage of using mini-batch gradient descent over batch gradient
descent is that it is more computationally efficient, allows for parallel processing of
the training examples, and can converge faster than batch gradient descent.

10. In the Nesterov Accelerated Gradient (NAG) algorithm, the gradient is computed at:

(a) The current position

(b) A “look-ahead” position
(c) The previous position
(d) The average of current and previous positions

Correct Answer: (b)

Solution: In NAG, the gradient is computed at a “look-ahead” position. This look-
ahead position is determined by applying the momentum step to the current position.
This allows the algorithm to have a sort of ”prescience” about where the parameters
are going, which can lead to improved convergence rates compared to standard mo-
mentum.

Course Work 2
No ratings yet
Course Work 2
29 pages
Week 4
No ratings yet
Week 4
4 pages
MCQ1
No ratings yet
MCQ1
22 pages
4.deep Learning Assignment4 Solution PDF
100% (1)
4.deep Learning Assignment4 Solution PDF
12 pages
DL - Assignment 9 Solution
100% (3)
DL - Assignment 9 Solution
7 pages
Week 4
No ratings yet
Week 4
4 pages
Optimization Techniques in Data Analytics
No ratings yet
Optimization Techniques in Data Analytics
20 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Chapter 4
No ratings yet
Chapter 4
33 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Week 9
No ratings yet
Week 9
10 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
DL Exp 4 (9860)
No ratings yet
DL Exp 4 (9860)
9 pages
Deep Learning MCQs for Students
No ratings yet
Deep Learning MCQs for Students
6 pages
Cours 5
No ratings yet
Cours 5
23 pages
Deep Learning - IIT Ropar - Unit 7 - Week 4
100% (1)
Deep Learning - IIT Ropar - Unit 7 - Week 4
5 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
4.1 - EDA Lecture Module 4 Vetri Sir New
No ratings yet
4.1 - EDA Lecture Module 4 Vetri Sir New
19 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Week 4
No ratings yet
Week 4
61 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Aiml Solved Answers For QP
No ratings yet
Aiml Solved Answers For QP
39 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
2024 Exam2 Solution
No ratings yet
2024 Exam2 Solution
11 pages
Sheet 3 Sol 3
No ratings yet
Sheet 3 Sol 3
3 pages
Optimization Gradient Descent
No ratings yet
Optimization Gradient Descent
13 pages
Optimizers
No ratings yet
Optimizers
4 pages
Deep Learning - Week 4
No ratings yet
Deep Learning - Week 4
6 pages
Qbank 2 Solutions
No ratings yet
Qbank 2 Solutions
6 pages
Optim
No ratings yet
Optim
33 pages
Ass9 Soln
No ratings yet
Ass9 Soln
5 pages
Chap 4 2
No ratings yet
Chap 4 2
214 pages
F1+TFF1 - Quiz 2
No ratings yet
F1+TFF1 - Quiz 2
1 page
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
27 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
COE292 - T221 - Final - Version C
No ratings yet
COE292 - T221 - Final - Version C
19 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Deep Learning - IIT Ropar - Unit 7 - Week 4
No ratings yet
Deep Learning - IIT Ropar - Unit 7 - Week 4
6 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
Optimization of Gradiant Descant
No ratings yet
Optimization of Gradiant Descant
7 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Revision Questions - Lecture 2
No ratings yet
Revision Questions - Lecture 2
4 pages
DNN Cluster S2 22 MidSem Regular
No ratings yet
DNN Cluster S2 22 MidSem Regular
6 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Optimization-Module Iv
No ratings yet
Optimization-Module Iv
7 pages
A1 Ift6135
No ratings yet
A1 Ift6135
5 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
2021 Exam2 Solution
No ratings yet
2021 Exam2 Solution
11 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
hw2 Sol
No ratings yet
hw2 Sol
15 pages
DL 26-09
No ratings yet
DL 26-09
22 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Trainers and Optimizers
No ratings yet
Trainers and Optimizers
12 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
Fig (1) :200 RPM
No ratings yet
Fig (1) :200 RPM
3 pages
Course Title
No ratings yet
Course Title
3 pages
Week 7
No ratings yet
Week 7
7 pages
Week 1
No ratings yet
Week 1
5 pages
Week 3
No ratings yet
Week 3
5 pages
Assignment#2
No ratings yet
Assignment#2
3 pages
Sanket Papinwar Pluralsight Removed
No ratings yet
Sanket Papinwar Pluralsight Removed
30 pages
Note04 - British Industrial Revolution
No ratings yet
Note04 - British Industrial Revolution
84 pages
Digital Meter User Guide
100% (1)
Digital Meter User Guide
17 pages
GROUP 3 Questions Solved Problems in Gas Turbine and Geothermal Power Plant
0% (1)
GROUP 3 Questions Solved Problems in Gas Turbine and Geothermal Power Plant
62 pages
Ship-to-Shore Gantry Cranes: With High Performance Trolley
No ratings yet
Ship-to-Shore Gantry Cranes: With High Performance Trolley
6 pages
Case-4-Fvclu-Vs-Sanama 2
No ratings yet
Case-4-Fvclu-Vs-Sanama 2
3 pages
Electrical Energy Generation Guide
No ratings yet
Electrical Energy Generation Guide
17 pages
L&T Construction: Water & Effluent Treatment IC - EDRC
No ratings yet
L&T Construction: Water & Effluent Treatment IC - EDRC
4 pages
Sustainable Regional Planning
No ratings yet
Sustainable Regional Planning
16 pages
Inspection Qualifications Guide
No ratings yet
Inspection Qualifications Guide
4 pages
Unit - 5 Selection
No ratings yet
Unit - 5 Selection
7 pages
Inqvest Final Paper Ish
No ratings yet
Inqvest Final Paper Ish
65 pages
Network Access Methods-1
No ratings yet
Network Access Methods-1
36 pages
ACCA Financial Reporting Mock Exam 2020
No ratings yet
ACCA Financial Reporting Mock Exam 2020
20 pages
Consultant Confidentiality and Non-Disclosure Agreement: Common Ground Mediation & Consulting
No ratings yet
Consultant Confidentiality and Non-Disclosure Agreement: Common Ground Mediation & Consulting
2 pages
SMK3
No ratings yet
SMK3
25 pages
OOPS in Java
No ratings yet
OOPS in Java
11 pages
Urban Rainwater Solutions
No ratings yet
Urban Rainwater Solutions
2 pages
Physical Verification1
100% (1)
Physical Verification1
163 pages
65-1501 Victor Reguladores 1 Etapa
No ratings yet
65-1501 Victor Reguladores 1 Etapa
6 pages
061 - International Law (1987) (517-519)
No ratings yet
061 - International Law (1987) (517-519)
3 pages
Pil Intervention
No ratings yet
Pil Intervention
6 pages
4th Form Agricultural Science Guide
No ratings yet
4th Form Agricultural Science Guide
29 pages
Spamming Tutorial +918954133645
67% (3)
Spamming Tutorial +918954133645
18 pages
Database Modeling Essentials
No ratings yet
Database Modeling Essentials
31 pages
Hospital Billing System Project
No ratings yet
Hospital Billing System Project
7 pages
Family Office Technology Playbook
No ratings yet
Family Office Technology Playbook
12 pages
Handpump GORMAN RUPP
No ratings yet
Handpump GORMAN RUPP
3 pages
08.02 CCC Chart and Reflection: Event Name Type of Event Causes Course Consequences
No ratings yet
08.02 CCC Chart and Reflection: Event Name Type of Event Causes Course Consequences
4 pages

Week 4

Uploaded by

Week 4

Uploaded by

Deep Learning - Week 4

Correct Answer: (a)

For the first update, m0 = 0, so:

m1 = 0.9 ∗ 0 + 0.1 ∗ 4 = 0.4

The bias-corrected first moment is:

2. In a mini-batch gradient descent algorithm, if the total number of training samples

Correct Answer: (a)

Correct Answer: (b)

η5 = 0.1 ∗ e−0.1∗5 ≈ 0.1 ∗ 0.60653 ≈ 0.059

4. In the context of Adam optimizer, what is the purpose of bias correction?

(a) To prevent overfitting

Correct Answer: (c)

5. The figure below shows the contours of a surface.

Correct Answer: (a)

(a) It converges faster than other optimization algorithms.

Correct Answer: (c)

(a) SGD converges more quickly than vanilla gradient descent.

Correct Answer: (a),(b)

Correct Answer: (a),(b)(d)

vt+1 = βvt + (1 − β)∇wt

Correct Answer: (a),(d)

(a) The current position

Correct Answer: (b)

You might also like