7 Optimization2 Stochastic Gradient
7 Optimization2 Stochastic Gradient
Optimization
Intro to Deep Learning, Fall 2021
1
Recap
• Neural networks are universal approximators
• We must train them to approximate any
function
• Networks are trained to minimize total “error”
on a training set
– We do so through empirical risk minimization
• We use variants of gradient descent to do so
– Gradients are computed through backpropagation
2
Recap
• Vanilla gradient descent may be too slow or unstable
4
Moving on: Topics for the day
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations
5
The training formulation
output (y)
Input (X)
7
Gradient descent
8
Gradient descent
9
Gradient descent
10
Gradient descent
11
Gradient descent
12
Effect of number of samples
16
Alternative: Incremental update
17
Alternative: Incremental update
18
Alternative: Incremental update
19
Alternative: Incremental update
20
Incremental Update
• Given , ,…,
• Initialize all weights
• Do:
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
22
Incremental Update
• Given , ,…,
• Initialize all weights
• Do: Over multiple epochs One epoch
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
One update
24
Caveats: order of presentation
25
Caveats: order of presentation
26
Caveats: order of presentation
27
Caveats: order of presentation
35
The expected behavior of the gradient
𝑑𝐸(𝑾( ) , 𝑾( ) , … , 𝑾 ) 𝟏 𝒅𝑫𝒊𝒗(𝒀(𝑿𝒊 ), 𝒅𝒊 ; 𝑾( ) , 𝑾( ) , … , 𝑾( ) )
( )
= ( )
𝒅𝑤 , 𝑻 𝒅𝑤 ,
𝒊
Batch SGD
42
Caveats: learning rate
output (y)
Input (X)
• Except in the case of a perfect fit, even an optimal overall
fit will look incorrect to individual instances
– Correcting the function for individual instances will lead to
never-ending, non-convergent updates
– We must shrink the learning rate with iterations to prevent this
• Correction for individual instances with the eventual miniscule
learning rates will not modify the function 43
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
•
• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕
𝜂 =∞
𝜂 <∞
– Strongly convex Can be placed inside a quadratic bowl, touching at any point
– Giving us the iterations to convergence as
• For generically convex (but not strongly convex) function, various proofs
report an convergence of using a learning rate of .
47
Batch gradient convergence
• In contrast, using the batch update method, for strongly
convex functions,
49
SGD Convergence
• We can bound the expected difference between the
loss over our data using the optimal weights and
the weights at any single iteration to for
strongly convex loss or for convex loss
50
SGD Convergence and weight
averaging
Polynomial Decay Averaging:
51
SGD example
55
Recall: The Empirical risk
di
Xi
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
𝑾 = argmin 𝐿𝑜𝑠𝑠 𝑊
• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
56
Recall: The Empirical risk
di
Xi
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
The empirical risk is an unbiased
𝑾 = estimate
argmin 𝐿𝑜𝑠𝑠of
𝑊 the expected divergence
Though there is no guarantee that minimizing it will minimize the
expected divergence
• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
57
Recall: The Empirical risk
di
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
The empirical risk is an unbiased estimate
𝑾 = argmin 𝐿𝑜𝑠𝑠 𝑓 𝑋; of
𝑊 ,the
𝑔 𝑋 expected divergence
Though there is no guarantee that minimizing it will minimize the
expected divergence
• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
58
SGD
di
Xi
di
Xi
di
X
The variance of the sample divergence is the variance of the divergencei itself:
var(div). This is N times the variance of the empirical average minimized by
batch update
The sample divergence is also an unbiased estimate of the expected error
62
Explaining the variance
71
Alternative: Mini-batch update
72
Alternative: Mini-batch update
73
Alternative: Mini-batch update
74
Alternative: Mini-batch update
75
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– ∆𝑊 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 76
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1 Mini-batch size
• For every layer k:
– ∆𝑊 = 0 Shrinking step size
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 77
Mini Batches
di
Xi
• The expected value of the batch loss is also the expected divergence
78
Mini Batches
di
Xi
• The expected value of the batch loss is also the expected divergence
79
Mini Batches
di
Xi
The variance of the minibatch loss: var(BatchLoss) = 1/b var(div)
This will be much smaller than the variance of the sample error in SGD
The minibatch loss is also an unbiased estimate of the expected error
• Mini-batch updates compute and minimize a batch loss
• The expected value of the batch loss is also the expected divergence
80
Minibatch convergence
• For convex functions, convergence rate for SGD is .
• For mini-batch updates with batches of size , the
convergence rate is
– Apparently an improvement of over SGD
– But since the batch size is , we perform times as many
computations per iteration as SGD
– We actually get a degradation of
• However, in practice
– The objectives are generally not convex; mini-batches are more
effective with the right learning rates
– We also get additional benefits of vector processing
81
SGD example
84
Story so far
• SGD: Presenting training instances one-at-a-time can be more effective
than full-batch training
– Provided they are provided in random order
• For SGD to converge, the learning rate must shrink sufficiently rapidly with
iterations
– Otherwise the learning will continuously “chase” the latest sample
87
Training and minibatches
• Convergence depends on learning rate
– Simple technique: fix learning rate until the error
plateaus, then reduce learning rate by a fixed
factor (e.g. 10)
– Advanced methods: Adaptive updates, where the
learning rate is itself determined as part of the
estimation
88
Moving on: Topics for the day
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations
89
Recall: Momentum Update
Plain gradient update With momentum
( ) ( ) ( )
91
Recall: Momentum Update
95
Nestorov’s Accelerated Gradient
• Nestorov’s method
100
Nestorov’s Accelerated Gradient
SGD instance
or minibatch
loss
• The momentum method
» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 (𝛻 𝐿𝑜𝑠𝑠)
𝑊 = 𝑊 + ∆𝑊
• Until has converged
103
Nestorov’s Accelerated Gradient
SGD instance
or minibatch
loss
• Nestorov’s method
( )
105
Nestorov: Mini-batch update
• Given , ,…,
• Initialize all weights ; 𝑗 = 0, ∆𝑊 = 0
• Do:
– Randomly permute 𝑋 , 𝑑 , 𝑋 , 𝑑 ,…, 𝑋 , 𝑑
– For 𝑡 = 1: 𝑏: 𝑇
• 𝑗=𝑗+1
• For every layer k:
– 𝑊 = 𝑊 + 𝛽Δ𝑊
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 =𝑊 −𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
1 1 +2.5
2 1 -3
1 2 4 3 2 +2.5
3 5
4 1 -2
5 1.5 1.5
• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean
squared derivative
111
RMS Prop
• This is a variant on the basic mini-batch SGD algorithm
• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean
squared derivative
• 𝑘 =𝑘+1
• Until loss has converged
113
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the
current gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient
– Considers both first and second moments
• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of derivatives for each
parameter
– Scale update of the parameter by the inverse of the root mean squared
derivative
114
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the
current gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient
• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of Ensures
derivatives
thatfor
theeach
parameter and terms do
– Scale update of the parameter by the inverse of the root mean squared in
not dominate
derivative early
iterations
115
Other variants of the same theme
• Many:
– Adagrad
– AdaDelta
– AdaMax
– …
• Generally no explicit learning rate to optimize
– But come with other hyper parameters to be optimized
– Typical params:
• RMSProp: ,
• ADAM: , ,
116
Visualizing the optimizers: Beale’s Function
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
119
Visualizing the optimizers: Long Valley
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
120
Visualizing the optimizers: Saddle Point
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
121
Story so far
• Gradient descent can be sped up by incremental
updates
– Convergence is guaranteed under most conditions
• Learning rate must shrink with time for convergence
– Stochastic gradient descent: update after each
observation. Can be much faster than batch learning
– Mini-batch updates: update after batches. Can be more
efficient than SGD
122