0% found this document useful (0 votes)

42 views21 pages

Lecture 8 Gradient Descent For Non-Convex Functions

This document discusses advanced machine learning techniques for gradient descent on non-convex functions. It characterizes non-convex loss surfaces and describes issues with techniques like gradient descent and Newton's method. It then explains how stochastic gradient descent, momentum methods, and adaptive learning rate optimization techniques like Adam can help address these issues and help machine learning models escape saddle points and local minima to find better solutions.

Uploaded by

Utkarsh Bhalode

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views21 pages

Lecture 8 Gradient Descent For Non-Convex Functions

Uploaded by

Utkarsh Bhalode

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Advanced Machine Learning

Gradient Descent for Non-Convex Functions

Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture

• Characterize non-convex loss surfaces with

Hessian

• List issues with non-convex surfaces

• Explain how certain optimization techniques

help solve these issues
Contents
• Characterizing a non-convex loss surfaces
• Issues with gradient decent
• Issues with Newton’s method
• Stochastic gradient descent to the rescue
• Momentum and its variants
• Saddle-free Newton
Why do we not get stuck in bad local
minima?
• Local minima are close to global minima in terms of
errors

• Saddle points are much more likely at higher

portions of the error surface (in high-dimensional
weight space)

• SGD (and other techniques) allow you to escape the

saddle points
Error surfaces and saddle points

http://math.etsu.edu/multicalc/prealpha/Chap2/Chap2-8/10-6-53.gif

http://pundit.pratt.duke.edu/piki/images/thumb/0/0a/SurfExp04.png/400px-SurfExp04.png
Eigenvalues of Hessian at critical points
Local minima

Plateau Long furrow

Saddle point

http://i.stack.imgur.com/NsI2J.png
Saddle
point

Global
minima
A realistic picture

Local
minima

Local
maxima

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

Is achieving global minima important?
• Global minima for the training data may not
be the global minima for the validation or test
data

• Local minimas are often good enough

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

Under certain assumptions,
theoretically also they are of high quality
• Results:
– Lowest critical values of the random loss form a band
– Probability of minima outside that band diminishes
exponentially with the size of the network
– Empirical verification
• Assumptions:
– Fully-connected feed-forward neural network
– Variable independence
– Redundancy in network parametrization
– Uniformity

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

Empirically, most minima are of high quality

“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15
GD vs. Newton’s method
• Gradient descent is based on first-order
approx. 𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝛻𝑓 𝑇 ∆𝜃

Δ𝜃 = −𝜂 𝛻𝑓

• Newton’s method is based on second order

1
𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝛻𝑓 𝑇 ∆𝜃 + ∆𝜃 𝑇 𝐻 ∆𝜃
2
Δ𝜃 = −𝐻 −1 𝛻𝑓
𝑛𝜃
1
𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝜆𝑖 ∆𝒗𝑖 2
2
𝑖=1

“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15
Disadvantages of 2nd order methods
• Updates require O(d3) or at least O(d2)

• May not work well for non-convex surfaces

• Get attracted to saddle points (how?)

• Not very good for batch-updates

GD vs. SGD
• GD: wt
• wt+1 = wt − η gfor all samples(wt) − η gfor all samples (wt)

• SGD momentum: wt
• wt+1 = wt − η gfor a random subset(wt)

− η gfor a random subset(wt)

Compare GD with SGD
• GD requires more computations per update

• SGD is more noisy

SGD helps by changing the loss surface
• Different mini-batches (or samples) have their own loss surfaces
• The loss surface of the entire training sample (dotted) may be
different
• Local minima of one loss surface may not be local minima of
another one
• This helps us escape local minima using stochastic or batch
gradient descent
• Mini-batch size depends on computational resource utilization
Noise can be added in other ways to
escape saddle points
• Random mini-batches (SGD)

• Add noise to the gradient or the update

• Add noise to the input

Learning rate scheduling
• High learning rates explore faster earlier
– But, they can lead to divergence or high final loss
• Low learning rates fine-tune better later
– But, they can be very slow to converge
• LR scheduling combines advantages of both
– Lots of schedules possible: linear, exponential, square-root, step-wise,
cosine
Training loss

Training iterations
Classical and Nesterov Momentum
wt
• GD: − η g(wt)

• wt+1 = wt − η g(wt)
• Classical momentum: wt
α vt
− η g(wt)
• vt+1 = α vt − η g(wt); vt+1
− η g(wt) vt+1
wt+1
• wt+1 = wt + vt+1
• Nesterov momentum α vt
wt
• vt+1 = α vt − η g(wt+αvt); vt+1
− η g(wt+αvt)
wt+1
• wt+1 = wt + vt+1
• Better course-correction for bad velocity
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983
Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013
AdaGrad, RMSProp, AdaDelta
• Scales the gradient by a running norm of all
the previous gradients
• Per dimension:
𝑔(𝑤𝑡 )
𝑤𝑡+1 = 𝑤𝑡 − 𝜂
𝑡 2
𝑖=1 𝑔(𝑤𝑡 ) +𝜀
• Automatically reduces learning rate with t
• Parameters with small gradients speed up
• RMSProp and AdaDelta use a forgetting factor
in grad squared so that the updates do not
become too small
Adam optimizer combines AdaGrad
and momentum
• Initialize
• 𝑚0 = 0
• 𝑣0 = 0
• Loop over t
Get gradient
• 𝑔𝑡 = 𝛻𝑤𝑓𝑡 𝑤𝑡−1
• 𝑚𝑡 = 𝛽1 𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡 Update first moment (biased)

• 𝑣𝑡 = 𝛽2 𝑣𝑡−1 + 1 − 𝛽2 𝑔𝑡2 Update second moment (biased)

• 𝑚𝑡 = 𝑚𝑡/(1 − 𝛽1𝑡) Correct bias in first moment

• 𝑣𝑡 = 𝑣𝑡/(1 − 𝛽2𝑡) Correct bias in second moment
• 𝑤𝑡 = 𝑤𝑡−1 − 𝛼 𝑚𝑡/ 𝑣𝑡 + 𝜀 Update parameters

“ADAM: A method for stochastic optimization” Kingma and Ba, ICLR’15

Visualizing optimizers

Source: http://ruder.io/optimizing-gradient-descent/index.html

Implement 03-1
No ratings yet
Implement 03-1
24 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Otimization 2024 - Ver3
No ratings yet
Otimization 2024 - Ver3
42 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Neural Network Optimization Tactics
No ratings yet
Neural Network Optimization Tactics
20 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
DocumentsTraining Neural Networks - Part II
No ratings yet
DocumentsTraining Neural Networks - Part II
91 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Cours 5
No ratings yet
Cours 5
23 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Gradient Descent for ML Experts
No ratings yet
Gradient Descent for ML Experts
5 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
27 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Op Tim Ization
No ratings yet
Op Tim Ization
37 pages
Lecture 21
No ratings yet
Lecture 21
49 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Lec 8
No ratings yet
Lec 8
43 pages
Lec 4
No ratings yet
Lec 4
32 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Optim
No ratings yet
Optim
33 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
SGD 1
No ratings yet
SGD 1
86 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Optimization Gradient Descent
No ratings yet
Optimization Gradient Descent
13 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Module 2
No ratings yet
Module 2
67 pages
Super GD
No ratings yet
Super GD
15 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
DL Module 2 1 (Sami)
No ratings yet
DL Module 2 1 (Sami)
17 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Global Optimization for ML
No ratings yet
Global Optimization for ML
15 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
Optimizers
No ratings yet
Optimizers
4 pages
cs231n Training Neural Networks II
No ratings yet
cs231n Training Neural Networks II
99 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Session 5 - Consumption
No ratings yet
Session 5 - Consumption
17 pages
Session 15 - Exchange Rates and Capital Flows
No ratings yet
Session 15 - Exchange Rates and Capital Flows
11 pages
Session 2 - Macroeconomics - What Is It About
No ratings yet
Session 2 - Macroeconomics - What Is It About
14 pages
Session 15 - The Open Economy Mundell Fleming Model
No ratings yet
Session 15 - The Open Economy Mundell Fleming Model
8 pages
Session 12 - The Labour Market
No ratings yet
Session 12 - The Labour Market
12 pages
Session 3 - Macroeconomics - What Is It About
No ratings yet
Session 3 - Macroeconomics - What Is It About
16 pages
Session 14 - The Money Supply
No ratings yet
Session 14 - The Money Supply
13 pages
Mundell-Fleming Model: Exchange Rate Regimes
No ratings yet
Mundell-Fleming Model: Exchange Rate Regimes
14 pages
Session 6 - Investment
No ratings yet
Session 6 - Investment
13 pages
Session 7 - The Demand For Money
No ratings yet
Session 7 - The Demand For Money
12 pages
Session 9 - The IS LM Model
No ratings yet
Session 9 - The IS LM Model
8 pages
Session 13 - The Money Supply
No ratings yet
Session 13 - The Money Supply
6 pages
Intro to Macroeconomics Concepts
No ratings yet
Intro to Macroeconomics Concepts
7 pages
Session 10 - The IS LM Model
No ratings yet
Session 10 - The IS LM Model
14 pages
Session 8 - The Demand For Money
No ratings yet
Session 8 - The Demand For Money
10 pages
Session 13 - The Labour Market
No ratings yet
Session 13 - The Labour Market
8 pages
Session 7 - Investment
No ratings yet
Session 7 - Investment
6 pages
Session 11 - The Labour Market
No ratings yet
Session 11 - The Labour Market
11 pages
Session 14 - Exchange Rates and Capital Flows
No ratings yet
Session 14 - Exchange Rates and Capital Flows
6 pages
Session 8 - The IS LM Model
No ratings yet
Session 8 - The IS LM Model
10 pages
Session 17 - Economic Growth
No ratings yet
Session 17 - Economic Growth
10 pages
GNR602-Lec12-13 Image Compression
No ratings yet
GNR602-Lec12-13 Image Compression
85 pages
FRA - Lone Pine Cafe (B)
No ratings yet
FRA - Lone Pine Cafe (B)
2 pages
Advanced Satellite Image Processing
No ratings yet
Advanced Satellite Image Processing
86 pages
Session 18 - Economic Growth
No ratings yet
Session 18 - Economic Growth
9 pages
Mes Pocket Guide
No ratings yet
Mes Pocket Guide
36 pages
Signal Classifier Using Deep Learning Architecture: Dwijith R A
No ratings yet
Signal Classifier Using Deep Learning Architecture: Dwijith R A
40 pages
Mary Parker Follett and Chester Barnard
100% (1)
Mary Parker Follett and Chester Barnard
4 pages
Research Report On AI and Predictive Learning
No ratings yet
Research Report On AI and Predictive Learning
5 pages
Neural Networks and Dynamical Systems: Kumpati S. Narendra and Kannan Parthasarathy
No ratings yet
Neural Networks and Dynamical Systems: Kumpati S. Narendra and Kannan Parthasarathy
23 pages
Neural Network PPT Presentation
60% (10)
Neural Network PPT Presentation
23 pages
Project Management
No ratings yet
Project Management
2 pages
Autonomous Robot Control Design
No ratings yet
Autonomous Robot Control Design
36 pages
Module 3 Minicase Assignment 2
No ratings yet
Module 3 Minicase Assignment 2
2 pages
Lean - Theory To Practice
No ratings yet
Lean - Theory To Practice
4 pages
Change and Project Management
No ratings yet
Change and Project Management
16 pages
Repair ERP
No ratings yet
Repair ERP
11 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
28 pages
Optimized Software Testing Techniques
No ratings yet
Optimized Software Testing Techniques
5 pages
PID Control Using MATLAB Simulation: Lecturer: 黃教琪 Int. phone #:4375 s8323091@cc.ncu.edu.tw
No ratings yet
PID Control Using MATLAB Simulation: Lecturer: 黃教琪 Int. phone #:4375 s8323091@cc.ncu.edu.tw
13 pages
Rapid Equipment Strategy Development Process - Online - Participant Manual - Rev 3.6.1-Cópia
No ratings yet
Rapid Equipment Strategy Development Process - Online - Participant Manual - Rev 3.6.1-Cópia
78 pages
FALLSEM2023-24 MEE1014 TH VL2023240101810 2023-10-13 Reference-Material-I
No ratings yet
FALLSEM2023-24 MEE1014 TH VL2023240101810 2023-10-13 Reference-Material-I
44 pages
UML Diagram
No ratings yet
UML Diagram
31 pages
Object Oriented Analysis and Design
No ratings yet
Object Oriented Analysis and Design
17 pages
NCV SG Systems Analysis L3
No ratings yet
NCV SG Systems Analysis L3
10 pages
Use Case Diagrams
No ratings yet
Use Case Diagrams
4 pages
Fuzzy Credit Scoring for Banks
No ratings yet
Fuzzy Credit Scoring for Banks
6 pages
No Code Low Code
No ratings yet
No Code Low Code
12 pages
Army Public School Kanpur Part B Chapter-2 Project Cycle
No ratings yet
Army Public School Kanpur Part B Chapter-2 Project Cycle
5 pages
Operation Research
No ratings yet
Operation Research
17 pages
Database Technology and Nosql
0% (1)
Database Technology and Nosql
2 pages
Goal Question Metric v01
No ratings yet
Goal Question Metric v01
12 pages
Building Information Systems
100% (1)
Building Information Systems
46 pages
Roboto Specimen Book PDF
No ratings yet
Roboto Specimen Book PDF
11 pages
HAT ARE THE Most Common Types OF Information System IN AN Organization
No ratings yet
HAT ARE THE Most Common Types OF Information System IN AN Organization
5 pages

Lecture 8 Gradient Descent For Non-Convex Functions

Uploaded by

Lecture 8 Gradient Descent For Non-Convex Functions

Uploaded by

Advanced Machine Learning

Gradient Descent for Non-Convex Functions

• Characterize non-convex loss surfaces with

• List issues with non-convex surfaces

• Explain how certain optimization techniques

• Saddle points are much more likely at higher

• SGD (and other techniques) allow you to escape the

Plateau Long furrow

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

• Local minimas are often good enough

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

• Newton’s method is based on second order

• May not work well for non-convex surfaces

• Get attracted to saddle points (how?)

• Not very good for batch-updates

− η gfor a random subset(wt)

• SGD is more noisy

• Add noise to the gradient or the update

• Add noise to the input

• 𝑣𝑡 = 𝛽2 𝑣𝑡−1 + 1 − 𝛽2 𝑔𝑡2 Update second moment (biased)

• 𝑚𝑡 = 𝑚𝑡/(1 − 𝛽1𝑡) Correct bias in first moment

“ADAM: A method for stochastic optimization” Kingma and Ba, ICLR’15

You might also like