0% found this document useful (0 votes)

31 views114 pages

7 Optimization2 Stochastic Gradient

The document discusses optimization techniques for training neural networks. It introduces incremental or stochastic gradient descent as an alternative to batch gradient descent. In stochastic gradient descent, the network weights are updated after each training example rather than after seeing the whole batch, making training more computationally efficient. The order of training examples should be randomized to avoid cyclic behavior and ensure convergence. The overall goal is to iteratively minimize the error between predicted and true outputs through updating network parameters.

Uploaded by

subhek00

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views114 pages

7 Optimization2 Stochastic Gradient

Uploaded by

subhek00

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 114

Training Neural Networks:

Optimization
Intro to Deep Learning, Fall 2021

1
Recap
• Neural networks are universal approximators
• We must train them to approximate any
function
• Networks are trained to minimize total “error”
on a training set
– We do so through empirical risk minimization
• We use variants of gradient descent to do so
– Gradients are computed through backpropagation
2
Recap
• Vanilla gradient descent may be too slow or unstable

• Better convergence can be obtained through

– Second order methods that normalize the variation across
dimensions
– Adaptive or decaying learning rates that can improve
convergence
– Methods like Rprop that decouple the dimensions can
improve convergence
– Momentum methods which emphasize directions of
steady improvement and deemphasize unstable directions
3
Moving on…
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations

4
Moving on: Topics for the day
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations

5
The training formulation

output (y)

Input (X)

• Given input output pairs at a number of

locations, estimate the entire function
6
Gradient descent

• Start with an initial function

• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

7
Gradient descent

• Start with an initial function

8
Gradient descent

• Start with an initial function

9
Gradient descent

• Start with an initial function

10
Gradient descent

• Start with an initial function

11
Gradient descent

• Start with an initial function

12
Effect of number of samples

• Problem with conventional gradient descent: we try to

simultaneously adjust the function at all training points
– We must process all training points before making a single
adjustment
– “Batch” update
13
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

16
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

17
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

18
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

19
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

20
Incremental Update
• Given , ,…,
• Initialize all weights
• Do:
– For all
• For every layer :
– Compute 𝒕 𝒕

– Update

• Until has converged

21
Incremental Updates
• The iterations can make multiple passes over
the data
• A single pass through the entire training data
is called an “epoch”
– An epoch over a training set with samples
results in updates of parameters

22
Incremental Update
• Given , ,…,
• Initialize all weights
• Do: Over multiple epochs One epoch

– For all
• For every layer :
– Compute 𝒕 𝒕

– Update
One update

• Until has converged

23
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

24
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

25
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

26
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

27
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
28
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
29
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
30
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
31
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
32
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights
• Do:
– Randomly permute , ,…,
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged

33
Story so far
• In any gradient descent optimization problem,
presenting training instances incrementally
can be more effective than presenting them
all at once
– Provided training instances are provided in
random order
– “Stochastic Gradient Descent”

• This also holds for training neural networks

34
Explanations and restrictions
• So why does this process of incremental
updates work?
• Under what conditions?

• For “why”: first consider a simplistic

explanation that’s often given
– Look at an extreme example

35
The expected behavior of the gradient
𝑑𝐸(𝑾( ) , 𝑾( ) , … , 𝑾 ) 𝟏 𝒅𝑫𝒊𝒗(𝒀(𝑿𝒊 ), 𝒅𝒊 ; 𝑾( ) , 𝑾( ) , … , 𝑾( ) )
( )
= ( )
𝒅𝑤 , 𝑻 𝒅𝑤 ,
𝒊

• The individual training instances contribute different directions to the

overall gradient
– The final gradient points is the average of individual gradients
– It points towards the net direction
36
Extreme example

• Extreme instance of data clotting: all the

training instances are exactly the same
37
The expected behavior of the gradient
𝑑𝑬 𝟏 𝒅𝑫𝒊𝒗(𝒀(𝑿𝒊 ), 𝒅𝒊 ) 𝒅𝑫𝒊𝒗(𝒀(𝑿𝒊 ), 𝒅𝒊 )
( )
= ( )
= ( )
𝒅𝑤 , 𝑻 𝒅𝑤 , 𝒅𝑤 ,
𝒊

• The individual training instance contribute identical

directions to the overall gradient
– The final gradient points is simply the gradient for an individual
instance 38
Batch vs SGD

Batch SGD

• Batch gradient descent operates over T training instances

to get a single update
• SGD gets T updates for the same computation
39
Clumpy data..

• Also holds if all the data are not identical, but

are tightly clumped together
40
Clumpy data..

• As data get increasingly diverse, the benefits of incremental

updates decrease, but do not entirely vanish
41
When does it work
• What are the considerations?

• And how well does it work?

42
Caveats: learning rate

output (y)

Input (X)
• Except in the case of a perfect fit, even an optimal overall
fit will look incorrect to individual instances
– Correcting the function for individual instances will lead to
never-ending, non-convergent updates
– We must shrink the learning rate with iterations to prevent this
• Correction for individual instances with the eventual miniscule
learning rates will not modify the function 43
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
•
• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged

44
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
Randomize input order
•
• For every layer :
Learning rate reduces with j
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged

45
SGD convergence
• SGD converges “almost surely” to a global or local minimum for most
functions
– Sufficient condition: step sizes follow the following conditions
(Robbins and Munro 1951)

𝜂 =∞

• Eventually the entire parameter space can be searched

𝜂 <∞

• The steps shrink

– The fastest converging series that satisfies both above requirements is
1
𝜂 ∝
𝑘
• This is the optimal rate of shrinking the step size for strongly convex functions
– More generally, the learning rates are heuristically determined
• If the loss is convex, SGD converges to the optimal solution
• For non-convex losses SGD converges to a local minimum 46
SGD convergence
• We will define convergence in terms of the number of iterations taken to
get within of the optimal solution
( ) ∗
–
– Note: here is the optimization objective on the entire training data,
although SGD itself updates after every training instance

• Using the optimal learning rate , for strongly convex functions,

( ) ∗ ( ) ∗

– Strongly convex  Can be placed inside a quadratic bowl, touching at any point
– Giving us the iterations to convergence as

• For generically convex (but not strongly convex) function, various proofs
report an convergence of using a learning rate of .

47
Batch gradient convergence
• In contrast, using the batch update method, for strongly
convex functions,

– Giving us the iterations to convergence as

• For generic convex functions, iterations to convergence

• Batch gradients converge “faster”

– But SGD performs updates for every batch update
48
SGD Convergence: Loss value
If:
• is -strongly convex, and
• at step we have a noisy estimate of the
subgradient with for all ,
• and we use step size
Then for any :

49
SGD Convergence
• We can bound the expected difference between the
loss over our data using the optimal weights and
the weights at any single iteration to for
strongly convex loss or for convex loss

• Averaging schemes can improve the bound to

and

• Smoothness of the loss is not required

50
SGD Convergence and weight
averaging
Polynomial Decay Averaging:

With some small positive constant, e.g.

Achieves (strongly convex) and
(convex) convergence

51
SGD example

• A simpler problem: K-means

• Note: SGD converges faster
– But to a poorer minimum
• Also note the rather large variation between runs
52
– Let’s try to understand these results..
Recall: Modelling a function

• To learn a network to model a function we

minimize the expected divergence

55
Recall: The Empirical risk

• In practice, we minimize the empirical risk (or loss)

1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
𝑾 = argmin 𝐿𝑜𝑠𝑠 𝑊

• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
56
Recall: The Empirical risk

• In practice, we minimize the empirical risk (or loss)

1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
The empirical risk is an unbiased
𝑾 = estimate
argmin 𝐿𝑜𝑠𝑠of
𝑊 the expected divergence
Though there is no guarantee that minimizing it will minimize the
expected divergence
• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
57
Recall: The Empirical risk

The variance of the empirical risk: var(Loss) = 1/N var(div) Xi

The variance of the estimator is proportional to 1/N
The larger this variance, the greater the likelihood that the W that
minimizes the empirical risk will differ significantly from the W that
•minimizes
In practice,the expected
we minimize divergence
the empirical risk

1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
The empirical risk is an unbiased estimate
𝑾 = argmin 𝐿𝑜𝑠𝑠 𝑓 𝑋; of
𝑊 ,the
𝑔 𝑋 expected divergence
Though there is no guarantee that minimizing it will minimize the
expected divergence
• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
58
SGD

• At each iteration, SGD focuses on the divergence

of a single sample
• The expected value of the sample error is still the
expected divergence 59
SGD

The sample divergence is also an unbiased estimate of the expected error

• At each iteration, SGD focuses on the divergence

of a single sample
• The expected value of the sample error is still the
expected divergence 60
SGD

X
The variance of the sample divergence is the variance of the divergencei itself:
var(div). This is N times the variance of the empirical average minimized by
batch update
The sample divergence is also an unbiased estimate of the expected error

• At each iteration, SGD focuses on the divergence

of a single sample
• The expected value of the sample error is still the
expected divergence 61
Explaining the variance

• The blue curve is the function being approximated

• The red curve is the approximation by the model at a given
• The heights of the shaded regions represent the point-by-point error
– The divergence is a function of the error
– We want to find the that minimizes the average divergence

62
Explaining the variance

• Sample estimate approximates the shaded area with the

average length of the lines of these curves is the red curve
itself
• Variance: The spread between the different curves is the
variance
63
Explaining the variance

• Sample estimate approximates the shaded area

with the average length of the lines
• This average length will change with position of
the samples 64
Explaining the variance

• Sample estimate approximates the shaded area

with the average length of the lines
• This average length will change with position of
the samples 65
Explaining the variance

• Having more samples makes the estimate more

robust to changes in the position of samples
– The variance of the estimate is smaller
66
Explaining the variance
With only one sample

• Having very few samples makes the estimate

swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 67
Explaining the variance
With only one sample

• Having very few samples makes the estimate

swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 68
Explaining the variance
With only one sample

• Having very few samples makes the estimate

swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 69
SGD example

• A simpler problem: K-means

• Note: SGD converges faster
• But also has large variation between runs 70
SGD vs batch
• SGD uses the gradient from only one sample
at a time, and is consequently high variance

• But also provides significantly quicker updates

than batch
• Is there a good medium?

71
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data

72
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

73
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

74
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

75
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– ∆𝑊 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 76
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1 Mini-batch size
• For every layer k:
– ∆𝑊 = 0 Shrinking step size
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 77
Mini Batches

• Mini-batch updates compute and minimize a batch loss

• The expected value of the batch loss is also the expected divergence

78
Mini Batches

The minibatch loss is also an unbiased estimate of the expected loss

• Mini-batch updates compute and minimize a batch loss

• The expected value of the batch loss is also the expected divergence

79
Mini Batches

Xi
The variance of the minibatch loss: var(BatchLoss) = 1/b var(div)
This will be much smaller than the variance of the sample error in SGD
The minibatch loss is also an unbiased estimate of the expected error
• Mini-batch updates compute and minimize a batch loss

• The expected value of the batch loss is also the expected divergence

80
Minibatch convergence
• For convex functions, convergence rate for SGD is .
• For mini-batch updates with batches of size , the
convergence rate is
– Apparently an improvement of over SGD
– But since the batch size is , we perform times as many
computations per iteration as SGD
– We actually get a degradation of

• However, in practice
– The objectives are generally not convex; mini-batches are more
effective with the right learning rates
– We also get additional benefits of vector processing
81
SGD example

• Mini-batch performs comparably to batch

training on this simple problem
– But converges orders of magnitude faster
82
Measuring Loss
• Convergence is generally
defined in terms of the
overall training loss
– Not sample or batch loss

• Infeasible to actually measure the overall training loss

after each iteration
• More typically, we estimate is as
– Divergence or classification error on a held-out set
– Average sample/batch loss over the past
samples/batches
83
Training and minibatches
• In practice, training is usually performed using mini-
batches
– The mini-batch size is a hyper parameter to be optimized

• Convergence depends on learning rate

– Simple technique: fix learning rate until the error plateaus,
then reduce learning rate by a fixed factor (e.g. 10)
– Advanced methods: Adaptive updates, where the learning
rate is itself determined as part of the estimation

84
Story so far
• SGD: Presenting training instances one-at-a-time can be more effective
than full-batch training
– Provided they are provided in random order

• For SGD to converge, the learning rate must shrink sufficiently rapidly with
iterations
– Otherwise the learning will continuously “chase” the latest sample

• SGD estimates have higher variance than batch estimates

• Minibatch updates operate on batches of instances at a time

– Estimates have lower variance than SGD
– Convergence rate is theoretically worse than SGD
– But we compensate by being able to perform batch processing

87
Training and minibatches
• Convergence depends on learning rate
– Simple technique: fix learning rate until the error
plateaus, then reduce learning rate by a fixed
factor (e.g. 10)
– Advanced methods: Adaptive updates, where the
learning rate is itself determined as part of the
estimation

88
Moving on: Topics for the day
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations

89
Recall: Momentum Update
Plain gradient update With momentum

• The momentum method maintains a running average of all gradients until

the current step
( ) ( )

( ) ( ) ( )

– Typical value is 0.9

• The running average steps
– Get longer in directions where gradient retains the same sign
– Become shorter in directions where the sign keeps flipping
90
Recall: Momentum Update

• The momentum method

• At any iteration, to compute the current step:

– First computes the gradient step at the current location
– Then adds in the historical average step

91
Recall: Momentum Update

• The momentum method

• At any iteration, to compute the current step:

– First compute the gradient step at the current location

– Then adds in the historical average step

92
Recall: Momentum Update

• The momentum method

• At any iteration, to compute the current step:

– First compute the gradient step at the current location
– Then add in the scaled previous step
• Which is actually a running average
93
Recall: Momentum Update

• The momentum method

𝑇

• At any iteration, to compute the current step:

– First compute the gradient step at the current location
– Then add in the scaled previous step
• Which is actually a running average
– To get the final step
94
Momentum update
1

• Momentum update steps are actually computed in two stages

– First: We take a step against the gradient at the current location
– Second: Then we add a scaled version of the previous step

• The procedure can be made more optimal by reversing the order of

operations..

95
Nestorov’s Accelerated Gradient

• Change the order of operations

• At any iteration, to compute the current step:
– First extend by the (scaled) historical average
– Then compute the gradient at the resultant position
– Add the two to obtain the final step
96
Nestorov’s Accelerated Gradient

• Change the order of operations

• At any iteration, to compute the current step:
– First extend the previous step
– Then compute the gradient at the resultant position
– Add the two to obtain the final step
97
Nestorov’s Accelerated Gradient

• Change the order of operations

• At any iteration, to compute the current step:
– First extend the previous step
– Then compute the gradient step at the resultant
position
– Add the two to obtain the final step
98
Nestorov’s Accelerated Gradient

• Change the order of operations

• At any iteration, to compute the current step:
– First extend the previous step
– Then compute the gradient step at the resultant
position
– Add the two to obtain the final step
99
Nestorov’s Accelerated Gradient

• Nestorov’s method

100
Nestorov’s Accelerated Gradient

• Comparison with momentum (example from

Hinton)
• Converges much faster
101
Momentum and incremental updates

SGD instance
or minibatch
loss
• The momentum method

• Incremental SGD and mini-batch gradients tend to have

high variance
• Momentum smooths out the variations
– Smoother and faster convergence
102
Momentum: Mini-batch update
• Given , ,…,
• Initialize all weights ; ,
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 (𝛻 𝐿𝑜𝑠𝑠)
𝑊 = 𝑊 + ∆𝑊
• Until has converged
103
Nestorov’s Accelerated Gradient

• At any iteration, to compute the current step:

– First extend the previous step
– Then compute the gradient at the resultant position
– Add the two to obtain the final step
• This also applies directly to incremental update methods
– The accelerated gradient smooths out the variance in the
gradients
104
Nestorov’s Accelerated Gradient

SGD instance
or minibatch
loss
• Nestorov’s method
( )

105
Nestorov: Mini-batch update
• Given , ,…,
• Initialize all weights ; 𝑗 = 0, ∆𝑊 = 0
• Do:
– Randomly permute 𝑋 , 𝑑 , 𝑋 , 𝑑 ,…, 𝑋 , 𝑑
– For 𝑡 = 1: 𝑏: 𝑇
• 𝑗=𝑗+1
• For every layer k:
– 𝑊 = 𝑊 + 𝛽Δ𝑊
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 =𝑊 −𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇

• Until has converged

106
Still higher-order methods
• Momentum and Nestorov’s method improve
convergence by normalizing the mean of the
derivatives
• More recent methods take this one step further by also
considering their variance
– RMS Prop
– Adagrad
– AdaDelta
– ADAM: very popular in practice
– …
• All roughly equivalent in performance
107
Smoothing the trajectory
Step X component Y component

1 1 +2.5
2 1 -3
1 2 4 3 2 +2.5
3 5
4 1 -2
5 1.5 1.5

• Observation: Steps in “oscillatory” directions show large total

movement
– In the example, total motion in the vertical direction is much greater
than in the horizontal direction
– Can happen even when momentum or Nesterov are used
• Improvement: Dampen step size in directions with high motion
– Second order term
108
Normalizing steps by second moment

• Modify usual gradient-based update:

– Scale updates in every component in inverse proportion to the total
movement of that component in recent past
• According to their variation (not just their average)
• This will change the relative update sizes for the individual
components
– In the above example it would scale down Y component
– And scale up X component (in comparison)
• We will see two popular methods that embody this principle…
109
RMS Prop
• Notation:
– Updates are by parameter

– Derivative of loss w.r.t any individual parameter is shown as

• Batch or minibatch loss, or individual divergence for batch/minibatch/SGD

– The squared derivative is

• Short-hand notation represents the squared derivative, not the second
derivative

– The mean squared derivative is a running estimate of the average

squared derivative. We will show this as

• Modified update rule: We want to

– scale down updates with large mean squared derivatives
– scale up updates with small mean squared derivatives
110
RMS Prop
• This is a variant on the basic mini-batch SGD algorithm

• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean
squared derivative

111
RMS Prop
• This is a variant on the basic mini-batch SGD algorithm

• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean
squared derivative

Note similarity to RPROP

The magnitude of the derivative is being normalized112out
RMS Prop (updates are for each
• Do:
weight of each layer)
– Randomly shuffle inputs to change their order
– Initialize: ; for all weights in all layers,
– For all (incrementing in blocks of inputs)
• For all weights in all layers initialize 𝜕 𝐷 =0
• For 𝑏 = 0: 𝐵 − 1
– Compute
» Output 𝒀(𝑿𝒕 𝒃)
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute gradient
𝒅𝒘
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute 𝜕 𝐷 +=
𝒅𝒘

• update: for all 𝑤 ∈ 𝑤 ∀𝑖, 𝑗, 𝑘

𝑬 𝝏𝟐𝒘 𝑫 𝒌
= 𝜸𝑬 𝝏𝟐𝒘 𝑫 𝒌 𝟏
+ 𝟏 − 𝜸 𝝏𝟐𝒘 𝑫 𝒌
𝜼 Typical values:
𝒘𝒌 𝟏 = 𝒘𝒌 − 𝝏𝒘 𝑫
𝑬 𝝏𝟐𝒘 𝑫 𝒌+𝝐

• 𝑘 =𝑘+1
• Until loss has converged
113
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the
current gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient
– Considers both first and second moments

• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of derivatives for each
parameter
– Scale update of the parameter by the inverse of the root mean squared
derivative

114
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the
current gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient

• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of Ensures
derivatives
thatfor
theeach
parameter and terms do
– Scale update of the parameter by the inverse of the root mean squared in
not dominate
derivative early
iterations

115
Other variants of the same theme
• Many:
– Adagrad
– AdaDelta
– AdaMax
– …
• Generally no explicit learning rate to optimize
– But come with other hyper parameters to be optimized
– Typical params:
• RMSProp: ,
• ADAM: , ,

116
Visualizing the optimizers: Beale’s Function

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

119
Visualizing the optimizers: Long Valley

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

120
Visualizing the optimizers: Saddle Point

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

121
Story so far
• Gradient descent can be sped up by incremental
updates
– Convergence is guaranteed under most conditions
• Learning rate must shrink with time for convergence
– Stochastic gradient descent: update after each
observation. Can be much faster than batch learning
– Mini-batch updates: update after batches. Can be more
efficient than SGD

• Convergence can be improved using smoothed updates

– RMSprop and more advanced techniques

122

Lec 7 Optimization Part 2
No ratings yet
Lec 7 Optimization Part 2
139 pages
Neural Network Training: Optimization
No ratings yet
Neural Network Training: Optimization
62 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
1 Intro
No ratings yet
1 Intro
91 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
07 Optimizers
No ratings yet
07 Optimizers
77 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Neural Networks: A Beginner's Guide
No ratings yet
Neural Networks: A Beginner's Guide
23 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Optimization
No ratings yet
Optimization
44 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Unit 2
No ratings yet
Unit 2
76 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Neural Network Optimization Tactics
No ratings yet
Neural Network Optimization Tactics
20 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Introduction To Deep Learning AI 2025
No ratings yet
Introduction To Deep Learning AI 2025
78 pages
04-NN Training GoodF
No ratings yet
04-NN Training GoodF
82 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
MtechDL Unit2
No ratings yet
MtechDL Unit2
25 pages
Neural Network Intro Lecture 4
No ratings yet
Neural Network Intro Lecture 4
46 pages
Linear Models & Optimization Techniques
No ratings yet
Linear Models & Optimization Techniques
24 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
DL Module 2
No ratings yet
DL Module 2
8 pages
Deep Learning for Data Scientists
No ratings yet
Deep Learning for Data Scientists
17 pages
Stochastic Gradient Descent Tuning
No ratings yet
Stochastic Gradient Descent Tuning
8 pages
Neural Networks for Beginners
No ratings yet
Neural Networks for Beginners
79 pages
Linearity: Skip To Content
No ratings yet
Linearity: Skip To Content
10 pages
Deep Neura Network Lab
No ratings yet
Deep Neura Network Lab
11 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
ML - Week 06
No ratings yet
ML - Week 06
31 pages
9.b Handout-3-GD Variants
No ratings yet
9.b Handout-3-GD Variants
3 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Chapter 8-Deep Learning Book
No ratings yet
Chapter 8-Deep Learning Book
27 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Unit 2 Introduction To Deep Learning
67% (3)
Unit 2 Introduction To Deep Learning
79 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Week 4
No ratings yet
Week 4
61 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229
No ratings yet
CS229
69 pages
Notes 1
No ratings yet
Notes 1
30 pages
Linear Regression and Gradient Descent
No ratings yet
Linear Regression and Gradient Descent
30 pages
6 Working Example 01-08-2024
No ratings yet
6 Working Example 01-08-2024
21 pages
Chapter 11: Numerical Solution of The Eigenvalue Problem
No ratings yet
Chapter 11: Numerical Solution of The Eigenvalue Problem
24 pages
07 Chapter 5 Location Models in Facility Planning
No ratings yet
07 Chapter 5 Location Models in Facility Planning
72 pages
Teknik Komputasi (TEI 116) : 3 Sks Oleh: Husni Rois Ali, S.T., M.Eng. Noor Akhmad Setiawan, S.T., M.T., PH.D
No ratings yet
Teknik Komputasi (TEI 116) : 3 Sks Oleh: Husni Rois Ali, S.T., M.Eng. Noor Akhmad Setiawan, S.T., M.T., PH.D
44 pages
Matrix Decomposition for ML Experts
No ratings yet
Matrix Decomposition for ML Experts
184 pages
2ndQ Exam-Math 10 Sample
No ratings yet
2ndQ Exam-Math 10 Sample
2 pages
Unidad 2 Algebra Lineal Actividad 3 Representación Matricial Método de Eliminación de Gauss Blanca Ahirely Llovera Hernandez
No ratings yet
Unidad 2 Algebra Lineal Actividad 3 Representación Matricial Método de Eliminación de Gauss Blanca Ahirely Llovera Hernandez
5 pages
05.numerical Differentiation
No ratings yet
05.numerical Differentiation
23 pages
F1 - Manipulation of Simple Polynomials - A1
No ratings yet
F1 - Manipulation of Simple Polynomials - A1
4 pages
Matrix Theory and Linear Algebra Concepts
100% (1)
Matrix Theory and Linear Algebra Concepts
37 pages
Linear Programming for Students
No ratings yet
Linear Programming for Students
13 pages
Simplex Method Calculator1
No ratings yet
Simplex Method Calculator1
3 pages
Chapter 4
No ratings yet
Chapter 4
5 pages
Chapter 4 (Solutions)
No ratings yet
Chapter 4 (Solutions)
24 pages
Unstructured Chimera Grid Method
No ratings yet
Unstructured Chimera Grid Method
18 pages
Nonlinear Programming in Profit Optimization
No ratings yet
Nonlinear Programming in Profit Optimization
34 pages
CSTR Conversion & Mass Balance Analysis
No ratings yet
CSTR Conversion & Mass Balance Analysis
10 pages
Excel Polynomial
No ratings yet
Excel Polynomial
5 pages
(Ebook) Regression Analysis: An Intuitive Guide For Using and Interpreting Linear Models by Jim Frost ISBN 9781735431185, 1735431184, B08LHKMS7S Download
No ratings yet
(Ebook) Regression Analysis: An Intuitive Guide For Using and Interpreting Linear Models by Jim Frost ISBN 9781735431185, 1735431184, B08LHKMS7S Download
96 pages
Analytical Numerical Analysis
86% (7)
Analytical Numerical Analysis
3 pages
3.4 Notes
No ratings yet
3.4 Notes
5 pages
A Technique For Factorising Trinomials With Non-Unitary X 2 Coefficient
No ratings yet
A Technique For Factorising Trinomials With Non-Unitary X 2 Coefficient
2 pages
Cordeau 2006
No ratings yet
Cordeau 2006
15 pages
ME2404 Lab Manual Simulation Lab
No ratings yet
ME2404 Lab Manual Simulation Lab
97 pages
Numerical Methods
100% (1)
Numerical Methods
2 pages
Engineering Math Applications
100% (2)
Engineering Math Applications
92 pages
UEM Sol To Exerc Chap 071 PDF
No ratings yet
UEM Sol To Exerc Chap 071 PDF
13 pages
MA203, Tutorial-Module 4
No ratings yet
MA203, Tutorial-Module 4
2 pages
Elementary Symmetric Polynomial
No ratings yet
Elementary Symmetric Polynomial
4 pages
Exercise 1corrected
No ratings yet
Exercise 1corrected
5 pages
Quasi-Periodic Solutions Calculated With The Simple Shooting Technique
No ratings yet
Quasi-Periodic Solutions Calculated With The Simple Shooting Technique
12 pages