Advanced Machine Learning
Gradient Descent for Non-Convex Functions
Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture
• Characterize non-convex loss surfaces with
Hessian
• List issues with non-convex surfaces
• Explain how certain optimization techniques
help solve these issues
Contents
• Characterizing a non-convex loss surfaces
• Issues with gradient decent
• Issues with Newton’s method
• Stochastic gradient descent to the rescue
• Momentum and its variants
• Saddle-free Newton
Why do we not get stuck in bad local
minima?
• Local minima are close to global minima in terms of
errors
• Saddle points are much more likely at higher
portions of the error surface (in high-dimensional
weight space)
• SGD (and other techniques) allow you to escape the
saddle points
Error surfaces and saddle points
http://math.etsu.edu/multicalc/prealpha/Chap2/Chap2-8/10-6-53.gif
http://pundit.pratt.duke.edu/piki/images/thumb/0/0a/SurfExp04.png/400px-SurfExp04.png
Eigenvalues of Hessian at critical points
Local minima
Plateau Long furrow
Saddle point
http://i.stack.imgur.com/NsI2J.png
Saddle
point
Global
minima
A realistic picture
Local
minima
Local
maxima
Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/
Is achieving global minima important?
• Global minima for the training data may not
be the global minima for the validation or test
data
• Local minimas are often good enough
“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15
Under certain assumptions,
theoretically also they are of high quality
• Results:
– Lowest critical values of the random loss form a band
– Probability of minima outside that band diminishes
exponentially with the size of the network
– Empirical verification
• Assumptions:
– Fully-connected feed-forward neural network
– Variable independence
– Redundancy in network parametrization
– Uniformity
“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15
Empirically, most minima are of high quality
“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15
GD vs. Newton’s method
• Gradient descent is based on first-order
approx. 𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝛻𝑓 𝑇 ∆𝜃
Δ𝜃 = −𝜂 𝛻𝑓
• Newton’s method is based on second order
1
𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝛻𝑓 𝑇 ∆𝜃 + ∆𝜃 𝑇 𝐻 ∆𝜃
2
Δ𝜃 = −𝐻 −1 𝛻𝑓
𝑛𝜃
1
𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝜆𝑖 ∆𝒗𝑖 2
2
𝑖=1
“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15
Disadvantages of 2nd order methods
• Updates require O(d3) or at least O(d2)
• May not work well for non-convex surfaces
• Get attracted to saddle points (how?)
• Not very good for batch-updates
GD vs. SGD
• GD: wt
• wt+1 = wt − η gfor all samples(wt) − η gfor all samples (wt)
• SGD momentum: wt
• wt+1 = wt − η gfor a random subset(wt)
− η gfor a random subset(wt)
Compare GD with SGD
• GD requires more computations per update
• SGD is more noisy
SGD helps by changing the loss surface
• Different mini-batches (or samples) have their own loss surfaces
• The loss surface of the entire training sample (dotted) may be
different
• Local minima of one loss surface may not be local minima of
another one
• This helps us escape local minima using stochastic or batch
gradient descent
• Mini-batch size depends on computational resource utilization
Noise can be added in other ways to
escape saddle points
• Random mini-batches (SGD)
• Add noise to the gradient or the update
• Add noise to the input
Learning rate scheduling
• High learning rates explore faster earlier
– But, they can lead to divergence or high final loss
• Low learning rates fine-tune better later
– But, they can be very slow to converge
• LR scheduling combines advantages of both
– Lots of schedules possible: linear, exponential, square-root, step-wise,
cosine
Training loss
Training iterations
Classical and Nesterov Momentum
wt
• GD: − η g(wt)
• wt+1 = wt − η g(wt)
• Classical momentum: wt
α vt
− η g(wt)
• vt+1 = α vt − η g(wt); vt+1
− η g(wt) vt+1
wt+1
• wt+1 = wt + vt+1
• Nesterov momentum α vt
wt
• vt+1 = α vt − η g(wt+αvt); vt+1
− η g(wt+αvt)
wt+1
• wt+1 = wt + vt+1
• Better course-correction for bad velocity
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983
Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013
AdaGrad, RMSProp, AdaDelta
• Scales the gradient by a running norm of all
the previous gradients
• Per dimension:
𝑔(𝑤𝑡 )
𝑤𝑡+1 = 𝑤𝑡 − 𝜂
𝑡 2
𝑖=1 𝑔(𝑤𝑡 ) +𝜀
• Automatically reduces learning rate with t
• Parameters with small gradients speed up
• RMSProp and AdaDelta use a forgetting factor
in grad squared so that the updates do not
become too small
Adam optimizer combines AdaGrad
and momentum
• Initialize
• 𝑚0 = 0
• 𝑣0 = 0
• Loop over t
Get gradient
• 𝑔𝑡 = 𝛻𝑤𝑓𝑡 𝑤𝑡−1
• 𝑚𝑡 = 𝛽1 𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡 Update first moment (biased)
• 𝑣𝑡 = 𝛽2 𝑣𝑡−1 + 1 − 𝛽2 𝑔𝑡2 Update second moment (biased)
• 𝑚𝑡 = 𝑚𝑡/(1 − 𝛽1𝑡) Correct bias in first moment
• 𝑣𝑡 = 𝑣𝑡/(1 − 𝛽2𝑡) Correct bias in second moment
• 𝑤𝑡 = 𝑤𝑡−1 − 𝛼 𝑚𝑡/ 𝑣𝑡 + 𝜀 Update parameters
“ADAM: A method for stochastic optimization” Kingma and Ba, ICLR’15
Visualizing optimizers
Source: http://ruder.io/optimizing-gradient-descent/index.html