0% found this document useful (0 votes)

14 views24 pages

Implement 03-1

The document outlines key concepts in deep learning optimization, including the importance of learning rates and gradient descent methods. It discusses techniques such as stochastic gradient descent, mini-batch processing, and momentum to improve convergence speed and efficiency. Additionally, it introduces adaptive learning rates through methods like AdaGrad and RMSProp to address challenges in optimizing deep neural networks.

Uploaded by

yuqi.rose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views24 pages

Implement 03-1

Uploaded by

yuqi.rose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

CSE 151B/251B

Deep Learning

Rose Yu
ANNOUNCEMENT
Kaggle Group Signup

• Sign up as a group before April 18th

• Use Piazza search for Teammates function

• Sign up on Google sheet first

• After that, register the info on Canvas to receive AWS

O P T I M I Z AT I O N
Learning as Optimization
Loss

learning as optimization
Weight
Parameter

to learn the weights, we need the derivative of the loss w.r.t. the weight
i.e. “how should the weight be updated to decrease the loss?”

@L
w=w ↵
@w
with multiple weights, we need the gradient of the loss w.r.t. the weights
w=w ↵rw L
learning rate
Gradient Descent
Convexity
Traditional loss functions often assume convexity
L(x2) − L(x1) ≥ ▿ L(x1)⊤(x2 − x1)

Easy to find
global optima!
Strict convex if
diff always >0
Convexity
• All local optima are global optima:

• Strictly convex: unique global optimum:

• Stochastic gradient descent will find the global optimum

with good learning rate
Poll
Optimizing Deep Neural Nets
VGG-56

convex non-convex

Li, Hao, et al. "Visualizing the loss landscape of neural nets."

Advances in Neural Information Processing Systems. 2018.

saddle point
stochastic gradient descent
large scale (N) datasets leads to memory bottleneck

stochastic gradient descent (SGD): w = w ↵r̃w L

use stochastic gradient estimate to descend the surface of the loss function

batch gradient stochastic gradient

n
▿˜ w L = ▿w L( f(xi; w), yi)
∑
▿L = ▿w L( f(xi; w), yi)
i=1

mini-batch size

11
Batch and Minibatch
n 1

mini-batch size

• Accurate gradient estimate (nonlinear) • Approximate estimate

• Fast training • Slow training
• Memory bottleneck • Memory efficient
• Hard to parallelize • Easy to parallelize
• Worse generalization • Better generalization

Stochastic gradient needs to be an unbiased estimator

▿w L(w) = [▿˜ wL(w, ξ)]

𝔼
Mini-Batch

• mini-batch implementation:https://d2l.ai/chapter_optimization/
minibatch-sgd.html
• Mini-batch SGD is faster than GD and SGD
• Trade-off between convergence speed and computational efficiency
SGD is Not Enough
global minima

Increased frequency and severity of

bad local minima

pathological curvature, like

the type seen in the well-
known Rosenbrock function
An ill-Conditioned Example
increase learning rate: 0.4
• Consider optimizing f(x) = 0.1x12 + 2x22

• The function is very flat along the

direction of x1

• The gradient is much larger and changes

rapidly along the direction of x2
increase learning rate: 0.6
• Increase learning rate improves the
convergence along the x1 direction, but
the overall solution is much worse

• Key idea: can we average the past

gradients?
Momentum
• SGD oscillates a lot and converges slowly in the
direction with a flat landscape

• Use ``leaky gradient’: use a fraction of the past

gradients updates

vt = βvt−1 − gt,t−1 gradients wt+1 = wt + vt

past gradients
t−1
2 τ
∑
vt = β vt−2 − βgt−1,t−2 − gt,t−1 = β gt−τ,t−τ−1
τ=0
Example
increase learning rate: 0.6 • momentum = 0 recovers the
momentum: 0.25
original SGD equation

• momentum improves learning

even with the same learning
rate
increase learning rate: 0.6
momentum: 0.5 • increasing the momentum can
lead to oscillating trajectories

• but still much better than

diverged solutions
Momentum
• Momentum
physical interpretation: update velocity

wt+1 = wt − α ▿˜ w L(wt)

vt = βvt−1 − α ▿˜ w L(wt) SGD without momentum SGD with momentum

wt+1 = wt + vt
• Nesterov Momentum
stronger theoretical guarantee
vt = βvt−1 − α ▿˜ w L(wt + αvt−1)
wt+1 = wt + vt
learning rate
w=w ↵r̃w L α is the learning rate

Learning rate is essential to convergence speed and accuracy

L(w) L(w) L(w)

w w w
too low too high just right
Simple Strategy
• Divide Loss Function by Number of Examples:

α
wt+1 = wt − ▿˜ w L(wt)
n
• Start with large step size
1
• If loss plateaus, divide step size by 2: αt = α0
2t
• (Can also use advanced optimization methods)
• (Step size must decrease over time to guarantee
convergence to global optimum)
1
•
Scale the learning rate by iterations αt = α0
t+c
Adaptive Learning rate
• Potential Issues of fixed learning rate

• parameters for common features converge rather quickly to their optimal values

• infrequent features we are still short of observing them sufficiently frequently

before their optimal values can be determined

• Remedy: let the learning rate scales according to the features:

1
αt = α0 where s(i, t) counts the number of nonzeros
s(i, t) + c
for parameter dimension i that we have observed up to time t

• Only for the input features, but not for the gradients
Adaptive Learning Rate
• AdaGrad: adaptive scales the learning for each
parameter dimension.
ϵ
•
wt+1 = wt − ⊙g gradients
δ+r adaptive learning rate

• rt = rt−1 + g ⊙ g sum of gradient squares

∂2 L ∂L∂L ∂L∂L
∂w1∂w2
⋯ ∂w ∂w
∂2w1 1 n

∂L∂L
Hessian matrix H = ⋮ ⋱ ∂w2∂wn
• ∂L∂L ∂2 L
∂wn∂w1
⋯
∂2wn

• Diag(H) = g ⊙ g approximates the Hessian

Adaptive Learning Rate
Cost is sensitive to learning rate only in some directions in the parameter space
maintain“memory” of previous gradients and scale gradients per parameter

• AdaGrad
gradient g = ▿˜ w L(wt)
approximate Hessian rt = rt−1 + g ⊙ g
ϵ
update wt+1 = wt − ⊙g
δ+r
• RMSProp
gradient g = ▿˜ w L(wt)
approximate Hessian rt = ρrt−1 + (1 − ρ)g ⊙ g
ϵ
update wt+1 = wt − ⊙g
δ+r
Comparison
local minima and saddle points are largely not an issue
in many dimensions, can move in exponentially more directions

24 http://sebastianruder.com/optimizing-gradient-descent/index.html

S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Optimizers
No ratings yet
Optimizers
4 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
AML - Lecture 5
No ratings yet
AML - Lecture 5
97 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Cours 5
No ratings yet
Cours 5
23 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Linear Models & Optimization Techniques
No ratings yet
Linear Models & Optimization Techniques
24 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Training NNs
No ratings yet
Training NNs
34 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Chapter 4 - Optimization
No ratings yet
Chapter 4 - Optimization
44 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
27 pages
SGD 1
No ratings yet
SGD 1
86 pages
Lec 8
No ratings yet
Lec 8
43 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Optimization Gradient Descent
No ratings yet
Optimization Gradient Descent
13 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
DL Exp2
No ratings yet
DL Exp2
6 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Gradient Descent and Optimization in Machine Learning
No ratings yet
Gradient Descent and Optimization in Machine Learning
9 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Optim
No ratings yet
Optim
33 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Op Tim Ization
No ratings yet
Op Tim Ization
37 pages
DL Module 2 1 (Sami)
No ratings yet
DL Module 2 1 (Sami)
17 pages
Neural Networks: A Beginner's Guide
No ratings yet
Neural Networks: A Beginner's Guide
23 pages
5 - Chapter8 - Optimization 2
No ratings yet
5 - Chapter8 - Optimization 2
40 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
Chapter 4
No ratings yet
Chapter 4
33 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Lecture 21
No ratings yet
Lecture 21
49 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Unit 2.a Optimzer
No ratings yet
Unit 2.a Optimzer
10 pages
Otimization 2024 - Ver3
No ratings yet
Otimization 2024 - Ver3
42 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Aistats2018 Tensor Regression Gaussian Process
No ratings yet
Aistats2018 Tensor Regression Gaussian Process
9 pages
Icml2015 Accelerating Online Low Rank
No ratings yet
Icml2015 Accelerating Online Low Rank
10 pages
Aaai2011 Nonconvex Optimization
No ratings yet
Aaai2011 Nonconvex Optimization
6 pages
Lecture 02-2
No ratings yet
Lecture 02-2
37 pages
(Ebook) Digital Image Processing and Analysis: Human and Computer Vision Applications With CVIPtools, Second Edition by Umbaugh, Scott E ISBN 9781439802069, 1439802068 PDF Version
No ratings yet
(Ebook) Digital Image Processing and Analysis: Human and Computer Vision Applications With CVIPtools, Second Edition by Umbaugh, Scott E ISBN 9781439802069, 1439802068 PDF Version
86 pages
(Ebook PDF) Digital Signal Processing First, Ebook, Global Editioninstant Download
100% (4)
(Ebook PDF) Digital Signal Processing First, Ebook, Global Editioninstant Download
56 pages
ANNFL Assignment
No ratings yet
ANNFL Assignment
4 pages
Module 8 - Multiple Linear Regression
No ratings yet
Module 8 - Multiple Linear Regression
40 pages
Skin Disease Detection: Machine Learning Vs Deep Learning
No ratings yet
Skin Disease Detection: Machine Learning Vs Deep Learning
7 pages
Virtu Finance 数学脑筋急转弯
No ratings yet
Virtu Finance 数学脑筋急转弯
7 pages
Optimization Algorithms (MEAP V02) Alaa Khamis PDF Available
No ratings yet
Optimization Algorithms (MEAP V02) Alaa Khamis PDF Available
159 pages
A3 - 1bm15me039 - Nyquist Plot Using Matlab
No ratings yet
A3 - 1bm15me039 - Nyquist Plot Using Matlab
12 pages
? 25 Polynomial Questions and Answers
No ratings yet
? 25 Polynomial Questions and Answers
3 pages
Bisma Ali - Assignment
No ratings yet
Bisma Ali - Assignment
5 pages
1 Simplex Method
100% (1)
1 Simplex Method
17 pages
Practice Quiz M2 (Ungraded) 4
No ratings yet
Practice Quiz M2 (Ungraded) 4
4 pages
Clifford Sze-Tsan Choy and Wan-Chi Siu - Fast Sequential Implementation of "Neural-Gas" Network For Vector Quantization
No ratings yet
Clifford Sze-Tsan Choy and Wan-Chi Siu - Fast Sequential Implementation of "Neural-Gas" Network For Vector Quantization
4 pages
1468564902EText (CH 4, M 2
No ratings yet
1468564902EText (CH 4, M 2
16 pages
Assignment 2 or
No ratings yet
Assignment 2 or
4 pages
CS3491 - Ai & ML University Practical Questions
No ratings yet
CS3491 - Ai & ML University Practical Questions
16 pages
ADA (BCS401) MQPsolved #Telegram (@vtu23)
No ratings yet
ADA (BCS401) MQPsolved #Telegram (@vtu23)
63 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Deep Learning - IIT Ropar - Unit 8 - Week 5
No ratings yet
Deep Learning - IIT Ropar - Unit 8 - Week 5
4 pages
CS-850: Advanced Theory of Computation: Adnan Rashid
No ratings yet
CS-850: Advanced Theory of Computation: Adnan Rashid
72 pages
DSP Processor & Architecture-2
0% (1)
DSP Processor & Architecture-2
2 pages
5.1 Simultaneous Equations-AZE
No ratings yet
5.1 Simultaneous Equations-AZE
16 pages
Project Ideas
No ratings yet
Project Ideas
10 pages
A Review On Yard Management in Container Terminals
100% (1)
A Review On Yard Management in Container Terminals
16 pages
Telekomunikasi: Matched Filter Basics
No ratings yet
Telekomunikasi: Matched Filter Basics
17 pages
Application of Numerical Methods in Chemical Engineering
100% (3)
Application of Numerical Methods in Chemical Engineering
11 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Revision Chapter 7
No ratings yet
Revision Chapter 7
4 pages
Unit-4 Dynamic Programming: Dr. Gopi Sanghani
No ratings yet
Unit-4 Dynamic Programming: Dr. Gopi Sanghani
65 pages
2025 Lecture07 P2 MLP
No ratings yet
2025 Lecture07 P2 MLP
56 pages

Implement 03-1

Uploaded by

Implement 03-1

Uploaded by

CSE 151B/251B

• Sign up as a group before April 18th

• Use Piazza search for Teammates function

• Sign up on Google sheet first

• After that, register the info on Canvas to receive AWS

• Strictly convex: unique global optimum:

• Stochastic gradient descent will find the global optimum

Li, Hao, et al. "Visualizing the loss landscape of neural nets."

stochastic gradient descent (SGD): w = w ↵r̃w L

batch gradient stochastic gradient

• Accurate gradient estimate (nonlinear) • Approximate estimate

Stochastic gradient needs to be an unbiased estimator

▿w L(w) = [▿˜ wL(w, ξ)]

Increased frequency and severity of

pathological curvature, like

• The function is very flat along the

• The gradient is much larger and changes

• Key idea: can we average the past

• Use ``leaky gradient’: use a fraction of the past

vt = βvt−1 − gt,t−1 gradients wt+1 = wt + vt

• momentum improves learning

• but still much better than

vt = βvt−1 − α ▿˜ w L(wt) SGD without momentum SGD with momentum

Learning rate is essential to convergence speed and accuracy

L(w) L(w) L(w)

• infrequent features we are still short of observing them sufficiently frequently

• Remedy: let the learning rate scales according to the features:

• rt = rt−1 + g ⊙ g sum of gradient squares

• Diag(H) = g ⊙ g approximates the Hessian

You might also like