0% found this document useful (0 votes)

39 views8 pages

Nestrov Gradient Descent

This document presents a unifying framework for gradient-based optimization methods. It shows that two popular algorithms, Nesterov's Accelerated Gradient and momentum, can be interpreted as approximations of an algorithm called Regularised Update Descent. Regularised Update Descent directly optimizes the parameter update by minimizing a regularized objective function. When approximated, this framework recovers Nesterov's Accelerated Gradient and momentum as special cases. The authors also show that Regularised Update Descent can converge faster than Nesterov's Accelerated Gradient or momentum on quadratic objectives.

Uploaded by

vivek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views8 pages

Nestrov Gradient Descent

Uploaded by

vivek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Nesterov’s Accelerated Gradient and Momentum as approximations

to Regularised Update Descent

Aleksandar Botev, Guy Lever and David Barber

Department of Computer Science

University College London

July 12, 2016

Abstract
arXiv:1607.01981v2 [stat.ML] 11 Jul 2016

We present a unifying framework for adapting the update direction in gradient-based iterative
optimization methods. As natural special cases we re-derive classical momentum and Nesterov’s
accelerated gradient method, lending a new intuitive interpretation to the latter algorithm. We
show that a new algorithm, which we term Regularised Gradient Descent, can converge more quickly
than either Nesterov’s algorithm or the classical momentum algorithm.

1 Introduction
We present a framework for optimisation by directly setting the parameter update to optimise the
objective function. Under natural approximations, two special cases of this framework recover Nes-
terov’s Accelerated Gradient (NAG) descent[3] and the classical momentum method (MOM)[5]. This
is particularly interesting in the case of NAG since, though popular and theoretically principled, it
has largely defied intuitive interpretation. We show that (at least for the special quadratic objective
case) our algorithm can converge more quickly than either NAG or MOM.

Given a continuous objective J(θ) we consider iterative algorithms to minimise J. We write J 0 (θ) for
the gradient of the function evaluated at θ, and similarly J 00 (θ) for the second derivative1 . Our focus
is on first-order methods, namely those that form the parameter update on the basis of only first order
gradient information.

1.1 Gradient Descent

Perhaps the simplest optimisation approach is Gradient Descent (GD) which, starting from the current
parameters, locally modifies the parameter θt at iteration t to reduce J. Based on the Taylor series
expansion
J(θt + vt ) = J(θt ) + vt J 0 (θt ) + O vt2

(1)
for a small learning rate αt > 0, setting vt = −αt J 0 (θt ) reduces J. This motivates the GD update
θt+1 = θt + vt . For convex Lipshitz J GD converges to the optimum value J ∗ as J(θt ) − J ∗ ∼ 1/t [4].
Whilst gradient descent is universally popular, alternative methods such as momentum and Nesterov’s
Accelerated Gradient (NAG) can result in significantly faster convergence to the optimum.

1.2 Momentum
The intuition behind momentum (MOM) is to continue updating the parameter along the previous
update direction. This gives the algorithm (see for example [5])
vt+1 = µt vt − αt J 0 (θt )
(2)
θt+1 = θt + vt+1
1
These definitions extend in an obvious way to the gradient vector and Hessian in the vector θ case.

1
where 0 ≤ µt ≤ 1 is the momentum parameter. It is well known that GD can suffer from plateauing
when the objective landscape has ridges (due to poor scaling of the objective, for instance) causing
the optimization path to zig-zag. Momentum can alleviate this since persistent ascent directions
accumulate in (2), whereas directions in which the gradient is quickly changing tend to cancel each
other out. The algorithm is also useful in the stochastic setting when only a sample of the gradient
is available. By averaging the gradient over several minibatches/samples, the averaged gradient will
better approximate the full batch gradient. In a different setting, when the objective function becomes
flat, momentum is useful to maintain progress along directions of shallow gradient. As far as we are
aware, relatively little is known about the convergence properties of momentum. We show below, at
least for a special quadratic objective, that momentum indeed converges.

1.3 Nesterov’s Accelerated Gradient

Nesterov’s Accelerated Gradient (NAG) [3] is given by

yt+1 = (1 + µt )θt − µt θt−1

(3)
θt+1 = yt+1 − αt J 0 (yt+1 )

NAG has the interpretation that the previous two parameter values are smoothed and a gradient de-
scent step is taken from this smoothed value. For Lipshitz convex functions (and a suitable schedule
for µt and αt ), NAG converges at rate 1/t2 . Nesterov proved that this is the optimal possible rate for
any method based on first order gradients2 [3]. Nesterov proposed the schedule µt = 1 − 3/(5 + t) and
fixed αt , which we adopt in the experiments.

Recently, [6] showed that by setting vt+1 = θt+1 − θt , equation (3) can be rewritten as:

vt+1 = µt vt − αt J 0 (θt + µt vt )
(4)
θt+1 = θt + vt+1

This formulation reveals the relation of NAG to the classical momentum algorithm equation (2) which
uses J 0 (θt ) in place of J 0 (θt + µt vt ) in equation (4). In both cases, NAG and MOM tend to continue
updating the parameters along the previous update direction.

In the machine learning community, NAG is largely viewed as somewhat mysterious and explained as
performing a lookahead gradient evaluation and then performing a correction [6]. The closely related
momentum is often loosely motivated by analogy with a physical system [5]. One contribution of our
work, presented in section(2), shows that these algorithms can be intuitively understood from the
perspective of optimising the objective with respect to the update vt itself.

2 Regularised Update Descent

We consider a separable objective

ˆ t , vt ) ≡ J(θt ) + γ v 2
J(θ (5)
2 t

for which the θ that minimises Jˆ is clearly the same as the one that minimises J, with vt = 0 at the
minimum. We propose3 to update θt to θt + vt to reduce J. ˆ To do this we update vt to reduce4

J(θ ˆ t + vt , vt ) = J(θt + vt ) + γ vt2

˜ t , vt ) ≡ J(θ (6)
2
2
This is a ‘worst case’ result. For example for quadratic functions, convergence is exponentially fast, leaving open the
possibility that other algorithms may have superior convergence on ‘benign’ problems.
3
Previous authors have also considered optimising the update, for example [2].
4
Note that the regulariser term γt vt2 /2 is necessary. For the objective J(θt + vt ) alone, the update would be vt+1 =
vt − αt J 0 (θt + vt ). In this case, convergence for v occurs when J 0 (θt + vt ) = 0, for which vt+1 = vt . Using the update
θt+1 = θt + vt would then result in the parameter θ never converging; the parameter θ would pass though the minimum
J 0 (θ) = 0 and continue beyond this, never to return.

2
We note that the optimum of J˜ occurs when

∂ J˜ ∂ J˜
= 0, =0 (7)
∂θt ∂vt
These two conditions give

J 0 (θt + vt ) = 0, J 0 (θt + vt ) + γt vt = 0 (8)

which implies that at the optimum vt = 0 and therefore that J 0 (θt ) = 0 when we have found the
˜ Hence, the θt that minimises J˜ also minimises J.
optimum of J.

Differentiating J˜ with respect to vt we obtain

J 0 (θt + vt ) + γt vt (9)
˜
We thus make a gradient descent update in the direction that lowers J:

vt+1 = vt − αt J 0 (θt + vt ) + γt vt

(10)

We initially proposed to optimise J(θ) via the update θt+1 = θt + vt by performing gradient descent
on J˜ with respect to vt . However, we have now improved vt to vt+1 . This suggests therefore that a
superior update for θt is θt+1 = θt + vt+1 . The complete Regularised Update Descent (RUD) algorithm
is given by (see also algorithm(1))

vt+1 = µt vt − αt J 0 (θt + vt )
(11)
θt+1 = θt + vt+1

where µt ≡ 1 − αt γt . As we converge towards a minimum, the update vt will become small (since the
gradient is small) and the regularisation term can be safely tuned down. This means that µt should
be set so that it tends to 1 with increasing iterations. As we will show below one can view MOM and
NAG as approximations to RUD based on a first order expansion (for MOM) and a more accurate
second order expansion (for NAG).

Algorithm 1 Regularised Update Descent for T iterations

Require: Initial guess θ1 , learning rates αt and increasing momentum schedule 0 ≤ µt ≤ 1
1: v1 ← 0
2: for t ← 1 to T − 1 do
3: vt+1 ← µt vt − αt J 0 (θt + vt )
4: θt+1 ← θt + vt+1
5: end for
6: return θT

2.1 Deriving MOM from RUD

We consider an update vt at the current θt . Assuming vt is small:

J(θt + vt ) = J(θt ) + vt J 0 (θt ) + O vt2

(12)

Under this first order approximation, the RUD objective becomes

γt 2
J(θt ) + vt J 0 (θt ) + v (13)
2 t
Differentiating wrt vt we get

J 0 (θt ) + γt vt (14)

3
We thus make an update in this direction:

vt+1 = vt − αt J 0 (θt ) + γt vt

(15)
0
= µt vt − αt J (θt ) (16)

where µt should be close to 1. We then make a parameter update

θt+1 = θt + vt+1 (17)

which recovers the momentum algorithm. We can therefore view momentum as optimising, with
respect to the update, a first order approximation of the RUD objective.

2.2 Deriving NAG from RUD

Expanding J(θt + vt ) to the next order, we obtain
1
J(θt + vt ) = J(θt ) + vt J 0 (θt ) + vt2 J 00 (θt ) + O vt3

(18)
2
Since vt is not infinitesimally small, we cannot ‘trust’ the higher order terms as we move away from
vt = 0; as we move further from θt we are trying to approximate the function based on curvature
information at θt , rather than the current point vt + θt . This is analogous to the idea of trust regions
in Quasi-Newton approaches which limit the extent to which the Taylor expansion is trusted away
from the origin [4]. To encode this lack of trust, we reduce the second order term by a factor µt < 1
and add another term to encourage vt to be small. This gives the modified approximate RUD objective

µt 2 00 γt
J(θ) + vt J 0 (θt ) + vt J (θ) + vt2 (19)
2 2
Differentiating with respect to vt we get

J 0 (θt ) + µt vt J 00 (θt ) + γt vt = J 0 (θt + µt vt ) + γt vt + O v 2

(20)

We then update vt to reduce this approximate RUD objective:

vt+1 = vt − αt J 0 (θt + µt vt ) + γt vt

(21)
0
= (1 − αt γt )vt − αt J (θt + µt vt ) (22)

We are free to choose αt , and γt which should both be small. Ideally µt should be close to 1. Hence,
it is reasonable to set 1 − αt γt = µt and choose µt to be close to 1. This setting recovers the NAG
algorithm:

vt+1 = µt vt − αt J 0 (θt + µt vt ) (23)

θt+1 = θt + vt+1 (24)

and explains why we want µt to tend to 1 as we converge, since as we zoom in to the minimum, we
can trust more a quadratic approximation to the objective. An alternative interpretation of NAG (as
a two stage optimisation process) and its relation to RUD is outlined in Appendix (A).

From the perspective that NAG and MOM are approximations to RUD, NAG is preferable to MOM
since it is based on a more accurate expansion. In terms of RUD versus NAG, the difference between
NAG and RUD is the use of µt in the argument of J 0 (θt + µt vt ) in NAG, whereas we use J 0 (θt + vt )
in RUD. This means that RUD ‘looks further forward’ than NAG (since µt < 1) in a manner more
consistent with the eventual parameter update θt + vt+1 . This tentatively explains why RUD can
outperform NAG.

4
RUD convergence RUD better than NAG MOM better than NAG MOM better than RUD

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

alpha

alpha
0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
mu mu mu mu

(a) (b) (c) (d)

Figure 1: (a) Shaded is the parameter region (µ, α) for which RUD converges for the simple quadratic
function f (θ) = 0.5θ2 . (b) Shaded is the parameter region (µ, α) for which RUD converges more
quickly than NAG. (c) Shaded is the region in which MOM converges more quickly than NAG. (d)
Shaded is the region in which MOM converges more quickly than RUD.

3 Comparison on a Quadratic function

An interesting question is whether and under what conditions RUD may converge more quickly than
NAG for convex Lipshitz functions. To date we have not been able to fully analyse this. In lieu of a
more complete understanding we consider the simple quadratic objective5
1
J(θ) = θ2 (25)
2
For this simple function the gradient is given by θ and, for fixed αt , µt , we are able to fully compute
the update trajectories for NAG and RUD and MOM.

3.1 NAG
For NAG, the algorithm is given by

vt+1 = µvt − α(θt + µvt ) (26)

θt+1 = θt + vt+1 (27)

Assuming v1 = 0, and a given value for θ1 , this gives θ2 = (1 − α)θ1 . Similarly, for both MOM and
NAG, θ2 is given by the same value.

We can write equations (26,27) as a single second order difference equation

θt+1 + bθt + cθt−1 = 0 (28)

where

b ≡ −1 − µ + α + αµ (29)
c ≡ µ − αµ (30)

For the scalar case dim(θ) = 1, assuming a solution of the form θt = Awt gives
√
−b ± b2 − 4c
w= (31)
2
which defines two values w+ and w− , so that the general solution is given by
t t
θt = Aw+ + Bw− (32)
5
For the simple quadratic objective, the convergence is exponentially fast in terms of the number of iterations. This
is clearly a very special case compared to the more general convex Lipshitz scenario. Nevertheless, the analysis gives
some insight that some improvement over NAG might be possible.

5
8 0.35
GD
MOM
6 NAG
0.3
RUD

4
0.25

2
0.2

0.15
−2

0.1
−4

0.05
−6

−8 0
0 5 10 15 20 25 30 35 40 45 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) (b)

Figure 2: Optimising a 1000 dimensional quadratic function J(θ) using different algorithms, all with
the same learning rate αt and µt schedule. (a) The log objective log J(θt ) for Gradient Descent,
Momentum, Nesterov’s Accelerated Gradient and Regularised Update Descent. (b) Trajectories of
the different algorithms plotted for the first two components (θ1t , θ2t ). The behaviour demonstrated
is typical in that momentum tends to more significantly overshoot the minimum than RUD or NAG,
with RUD typically outperforming NAG.

where A and B are determined by the linear equations

θ1 = Aw+ + Bw− (33)
2 2
θ2 = Aw+ + Bw− (34)
A sufficient condition for NAG to converge is that |w+ | < 1 and |w− | < 1 which is equivalent to the
conditions |b| < 1 + c, c < 1 [7]. For any learning rate 0 < α < 1 and momentum 0 < µ < 1, it
is straightforward to show that these conditions hold and thus that NAG converges to the minimum
θ∗ = 0.

3.2 MOM
The above analysis carries over directly to the MOM algorithm, with the only change being
b ≡ −1 − µ + α (35)
c≡µ (36)
It is straightforward to show that for any learning rate 0 < α < 1 and momentum 0 < µ < 1, the
corresponding conditions |w+ | < 1 and |w− | < 1 are always satisfied. Therefore the MOM algorithm
(at least for this problem) always converges. For MOM to have better asymptotic convergence rate
than NAG, we need max(|w+ M OM |, |w M OM |) < max(|w N AG |, |w N AG |). From fig(1) we see that MOM
− + −
only outperforms NAG (and RUD) when the momentum is small. This is essentially the uninteresting
regime since, in practice, we will typically use a value of momentum that is close to 1. For this simple
quadratic case, for practical purposes, MOM therefore performs worse than RUD or NAG.

3.3 RUD
For the RUD algorithm the corresponding solutions are given by setting
b ≡ −1 − µ + 2α (37)
c≡µ−α (38)

6
RUD has more complex convergence behaviour than NAG or MOM. The conditions |w+ | < 1 and
|w− | < 1 are satisfied only within the region as shown in fig(1a), which is determined by
3
1+µ> α (39)
2
The main requirement is that the learning rate should not be too high, at least for values of momentum
µ less than 0.5. Unlike NAG and MOM, RUD has therefore the possibility to diverge.

In fig(1b) we show the region for which the asymptotic convergence of RUD is faster than NAG. The
main requirement is that the momentum needs to be high (say above 0.8) and is otherwise largely
independent of the learning rate (provided α < 1).

4 Experiments
4.1 A toy high dimensional quadratic function
In fig(2) we show the progress for different algorithms using the same learning rate αt = 0.2 and
µt = 1 − 3/(5 + t) for a toy 1000 dimension quadratic function 21 θT Aθ − θT b for randomly chosen A
and b. This simple experiment shows that the theoretical property derived in section(3) that RUD
can outperform NAG and MOM carries over to the more general quadratic setting. Indeed, in our
experience, the improved convergence of RUD over NAG for the quadratic objective function is typical
behaviour.

4.2 Deep Learning: MNIST

Whilst RUD has interesting convergence for quadratic functions, in practice of course it is important
to see how it behaves in the case of more general non-convex functions. In fig(3) we look at a classical
deep learning problem of training an 784 − 1000 − 500 − 250 − 30 autoencoder for handwritten digit
reconstruction [1]. The dataset consists of black and white images of size 28x28 and we used 50000
training images, with the images scaled to lie in the 0 to 1 range. The target is for the network to
learn to reduce the dimensionality of the input to a 30 dimensional vector and then to reconstruct
the input. The nonlinearity at each layer is the hyperbolic tangent6 and for the last layer we used the
binary cross entropy loss.

Since NAG and RUD are closely related, we use the same schedule µt = 1 − 3/(5 + t) for both
algorithms. All remaining hyperparameters for each method (learning rates) were set optimally based
on a grid search over a set of reasonable parameters for each algorithm. For this problem, there is
little difference between NAG and RUD, with RUD slightly outperforming NAG.

5 Conclusion
We described a general approach to first order optimisation based on optimising the objective with
respect to the updates. This gives a simple optimisation algorithm which we termed Regularised
Update Descent; we showed that his algorithm can converge more quickly than Nesterov’s Accelerated
Gradient. In addition to being a potentially useful optimisation algorithm in its own right, the main
contribution of this work is to show that the Nesterov and momentum algorithms can be viewed as
approximations to the Regularised Update Descent algorithm.

References
[1] G. E. Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks.
Science, 313(5786):504–507, 2006.
6
We tried also rectifier linear units and leaky rectifier linear units, but they did not affect the relative performance of
any of the algorithms.

7
6.0
SGD
NAG
RUD
MOM
5.5 Figure 3: The negative log loss for the classical
MNIST 784 − 1000 − 500 − 25 − 30 autoencoder
network [1] trained using minibatches contains 200
examples. Similar to the small quadratic objec-
5.0 tive experiments, we see that on this much larger
problem, as expected, NAG and RUD perform very
similarly (with RUD slightly outperforming NAG).
4.5 All methods used the same learning rate and mo-
mentum parameter µt schedule.

0 2000 4000 6000 8000 10000

[2] P-Y. Massé and Y. Ollivier. Speed learning on the fly. arXiv preprint arXiv:1511.02540, 2015.

[3] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ).
In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.

[4] J. Nocedal and S. J. Wright. Numerical optimization. Springer Series in Operations Research and
Financial Engineering. Springer, Berlin, 2006.

[5] N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks,
12(1):145–151, 1999.

[6] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and mo-
mentum in deep learning. In Proceedings of the 30th international conference on machine learning
(ICML-13), pages 1139–1147, 2013.

[7] K. Sydsaeter and P. Hammond. Essential Mathematics for Economic Analysis. Prentice Hall,
2008.

A Alternative NAG derivation

For the objective

˜ t , vt ) = J(θt + vt ) + 1 γt v 2
J(θ (40)
2 t
˜ The algorithm proceeds as follows: given θt and vt
we consider a two stage process of optimizing J.
we first perform a descent step only on the regularizer, followed by a descent step on the ‘lookahead’
J(θt + v). After this we perform the usual step on θt based on the final updated v. The procedure is
summarized below:
ṽt+1 = vt − αt γt vt = (1 − αt γt )vt
gt = J 0 (θt + ṽt+1 )
(41)
vt+1 = vet+1 − αt gt = (1 − αt γt )vt − αt gt
θt+1 = θt + vt+1

Setting µt = 1 − αt γt recovers the NAG formulation as in [6]. RUD therefore differs from NAG in that
it does not perform the initial descent step on the regulariser term so that for RUD ṽt+1 = vt .

Deep Learning Book Part2
No ratings yet
Deep Learning Book Part2
101 pages
UNIT3
No ratings yet
UNIT3
17 pages
ISYE 8803 - Kamran - M5 - Optimization Methods 2
No ratings yet
ISYE 8803 - Kamran - M5 - Optimization Methods 2
17 pages
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
No ratings yet
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
25 pages
Algorithm
100% (1)
Algorithm
885 pages
Mathematics 11 02466 v2
No ratings yet
Mathematics 11 02466 v2
37 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
CS60010 Fitting-1
No ratings yet
CS60010 Fitting-1
39 pages
Module 2 Part1new
No ratings yet
Module 2 Part1new
32 pages
A Geometric Structure of Acceleration and Its Role in Making Gradients Small Fast
No ratings yet
A Geometric Structure of Acceleration and Its Role in Making Gradients Small Fast
40 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Revisiting Natural Gradient For Deep Networks
No ratings yet
Revisiting Natural Gradient For Deep Networks
18 pages
Composite
No ratings yet
Composite
26 pages
DL 26-09
No ratings yet
DL 26-09
22 pages
9 - Gradient Descent Part 3
No ratings yet
9 - Gradient Descent Part 3
31 pages
Optimization
No ratings yet
Optimization
26 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Super GD
No ratings yet
Super GD
15 pages
Questions For CSE 7th Sem
No ratings yet
Questions For CSE 7th Sem
14 pages
Super Gradient Descent: Global Optimization Requires Global Gradient
No ratings yet
Super Gradient Descent: Global Optimization Requires Global Gradient
15 pages
Keynote 1
No ratings yet
Keynote 1
32 pages
Opt 202 LN
No ratings yet
Opt 202 LN
86 pages
Nesterov's Momentum Method
No ratings yet
Nesterov's Momentum Method
9 pages
Momentum
No ratings yet
Momentum
2 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Principles of Communication - Iit Madras Lecture Material
100% (4)
Principles of Communication - Iit Madras Lecture Material
501 pages
4.1 - EDA Lecture Module 4 Vetri Sir New
No ratings yet
4.1 - EDA Lecture Module 4 Vetri Sir New
19 pages
Optimization of Gradiant Descant
No ratings yet
Optimization of Gradiant Descant
7 pages
Algebra Previous Year Question Paper
No ratings yet
Algebra Previous Year Question Paper
35 pages
DL Class2
No ratings yet
DL Class2
30 pages
Mustapha 2021 J. Phys. Conf. Ser. 1743 012002
No ratings yet
Mustapha 2021 J. Phys. Conf. Ser. 1743 012002
13 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Nesterov Momentum
No ratings yet
Nesterov Momentum
3 pages
06 23ECE216 GradientDescent v2
No ratings yet
06 23ECE216 GradientDescent v2
73 pages
12B Homework Packet (2022 2023)
No ratings yet
12B Homework Packet (2022 2023)
109 pages
054 Report
No ratings yet
054 Report
6 pages
Lecture 11
No ratings yet
Lecture 11
35 pages
EDA Lecture Module 4
No ratings yet
EDA Lecture Module 4
20 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Optim
No ratings yet
Optim
33 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
CS 437 / CS 5317 Deep Learning: Murtaza Taj
No ratings yet
CS 437 / CS 5317 Deep Learning: Murtaza Taj
11 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Approximation of Solution Operators for High-dimensional PDEs部分3
No ratings yet
Approximation of Solution Operators for High-dimensional PDEs部分3
2 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
All Optimizers
No ratings yet
All Optimizers
13 pages
8811 Acceleration Via Symplectic Discretization of High Resolution Differential Equations
No ratings yet
8811 Acceleration Via Symplectic Discretization of High Resolution Differential Equations
9 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Optimization
No ratings yet
Optimization
6 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Butere F2 Exam 3 Maths
No ratings yet
Butere F2 Exam 3 Maths
15 pages
Linear Non Linear Iterative Learning Control
No ratings yet
Linear Non Linear Iterative Learning Control
31 pages
On The Momentum Term in Gradient Descent Learning Algorithms
No ratings yet
On The Momentum Term in Gradient Descent Learning Algorithms
7 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Number Theory For GMAT
No ratings yet
Number Theory For GMAT
15 pages
8.2 Finite Difference, Finite Element and Finite Volume Methods For Partial Differential Equations
No ratings yet
8.2 Finite Difference, Finite Element and Finite Volume Methods For Partial Differential Equations
32 pages
Gauss-Jordan Elimination
No ratings yet
Gauss-Jordan Elimination
18 pages
2017 AMC 8 Sample Exam 2
No ratings yet
2017 AMC 8 Sample Exam 2
13 pages
June 2018 QP
No ratings yet
June 2018 QP
36 pages
Wa0000.
No ratings yet
Wa0000.
4 pages
P2 Mathematics Set 3 Revision Work Budo Junior School
100% (2)
P2 Mathematics Set 3 Revision Work Budo Junior School
7 pages
Nonlinearity in Structural Dynamics Chapter App G
No ratings yet
Nonlinearity in Structural Dynamics Chapter App G
11 pages
OR Quiz
75% (4)
OR Quiz
5 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Complex Numbers
No ratings yet
Complex Numbers
30 pages
Worksheet 1 Answers
No ratings yet
Worksheet 1 Answers
16 pages
A Numerical Lifting-Line Method Using Horseshoe Vortex Sheets
No ratings yet
A Numerical Lifting-Line Method Using Horseshoe Vortex Sheets
11 pages
How To Simplify Rational Numbers
No ratings yet
How To Simplify Rational Numbers
5 pages
Xii Target Board 2025
No ratings yet
Xii Target Board 2025
1 page
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Calculus
No ratings yet
Calculus
42 pages
Assignment 1 Solutions 1
No ratings yet
Assignment 1 Solutions 1
10 pages
5 OperationsResearch
No ratings yet
5 OperationsResearch
1 page
Year 1 Test 1
100% (2)
Year 1 Test 1
9 pages
UNIT III Illustrative Programs
No ratings yet
UNIT III Illustrative Programs
12 pages
Allama Iqbal Open University, Islamabad Warning: (Department of Mathematics)
No ratings yet
Allama Iqbal Open University, Islamabad Warning: (Department of Mathematics)
3 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Maths - BP - Xii
No ratings yet
Maths - BP - Xii
1 page
Maths Viii
No ratings yet
Maths Viii
4 pages
Name: - Section: - Score
No ratings yet
Name: - Section: - Score
4 pages
Mathematics Year 4 Paper 1 September 2009
100% (2)
Mathematics Year 4 Paper 1 September 2009
7 pages
Math Newsletter
No ratings yet
Math Newsletter
2 pages
Math8 Q4 Summative Test 2 1
No ratings yet
Math8 Q4 Summative Test 2 1
2 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
From Everand
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
Fouad Sabry
No ratings yet

Nestrov Gradient Descent

Uploaded by

Nestrov Gradient Descent

Uploaded by

Nesterov’s Accelerated Gradient and Momentum as approximations

to Regularised Update Descent

Department of Computer Science

July 12, 2016

1.1 Gradient Descent

1.3 Nesterov’s Accelerated Gradient

yt+1 = (1 + µt )θt − µt θt−1

2 Regularised Update Descent

J(θ ˆ t + vt , vt ) = J(θt + vt ) + γ vt2

J 0 (θt + vt ) = 0, J 0 (θt + vt ) + γt vt = 0 (8)

Differentiating J˜ with respect to vt we obtain

Algorithm 1 Regularised Update Descent for T iterations

2.1 Deriving MOM from RUD

J(θt + vt ) = J(θt ) + vt J 0 (θt ) + O vt2

Under this first order approximation, the RUD objective becomes

where µt should be close to 1. We then make a parameter update

θt+1 = θt + vt+1 (17)

2.2 Deriving NAG from RUD

J 0 (θt ) + µt vt J 00 (θt ) + γt vt = J 0 (θt + µt vt ) + γt vt + O v 2

We then update vt to reduce this approximate RUD objective:

vt+1 = µt vt − αt J 0 (θt + µt vt ) (23)

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.2 0.2 0.2 0.2

(a) (b) (c) (d)

3 Comparison on a Quadratic function

vt+1 = µvt − α(θt + µvt ) (26)

We can write equations (26,27) as a single second order difference equation

θt+1 + bθt + cθt−1 = 0 (28)

where A and B are determined by the linear equations

4.2 Deep Learning: MNIST

0 2000 4000 6000 8000 10000

A Alternative NAG derivation

You might also like