Skript Opt Mach
Skript Opt Mach
Anton Schiela
1 Optimization problems in ML 5
1.1 A general fitting problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Some loss-functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Some regularizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Some models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Statistical interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Momentum 37
5.1 The dynamics of the heavy-ball method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 First and second order fixed point iterations . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Interpretation as a gradient aggregation method . . . . . . . . . . . . . . . . . . . . . . . 40
3
4 Contents
Preface
These lecture notes have been written for a lecture (2+1) held in WS 19/20 at University of Bayreuth.
They cover a special aspect of Machine Learning, namely optimization methods, which can be used for
problems, associated with that. The main source used is the following review article:
Bottou, L., Curtis, F., Nocedal, J.: “Optimization methods in large-scale machine learning”, SIAM
Review, 60, 2018
Concerning proximal gradient methods:
Beck, A.: “First-order methods in optimization”,MOS-SIAM Series on Optimization, 2017
For neural networks in general:
Goodfellow, I, Bengio, Y, Courville, A: “Deep learning”, 2016
In addition I would like to thank Prof. Stefan Ulbrich and Prof. Marc Pfetsch from TU Darmstadt, and
Prof. Roland Herzog from TU Chemitz for sharing their own lecture notes with me. This was a great
starting point for me.
Chapter 1
Optimization problems in ML
The principal ideas behind machine learning are not new. From a mathematical point of view, one could
also use the word data fitting.
In general the following quantities are involved:
• given input data xi ∈ X, i = 1 . . . N .
• given output data yi ∈ Y , i = 1 . . . N .
• unknown model parameters w ∈ W .
Quite often X is a high-dimensional space (e.g. of images, sound samples, . . . ), and also W is high-
dimensional. Also N can be very large. It is the number of examples to learn from.
We would like to find w and an input-to-output mapping h(·, w) : X → Y that can predict for given
input x ∈ X a matching output y ∈ Y .
Example 1.0.1 (Polynomial least squares fit). The simplest (and well known) example with X = Y = R
and W = Rd+1 is a polynomial least squares fit (linear regression):
N d
X 1 X
min |h(xi ; w) − yi |2 where h(x; w) := wj xj .
w
i=1
2 j=0
If we define the matrix A ∈ RN ×d+1 via aij = xji we can write this as a classical least-squares problem:
N
X 1 1
min |(Aw)i − yi |2 = ∥Aw − y∥22
w
i=1
2 2
It can be solved via QR-decomposition or the normal equation approach. The latter yields the following
linear system of equations:
AT Aw = AT y
• If A is injective (over-determined case), then our problem has a unique solution, but Aw ̸= y in
general.
• If A is surjective (under-determined case), then Aw = y, but w may be non-unique
5
6 Contents
• h : X × W → Y is a parametrized model
• l : Y × Y → R is called a loss-function
• ρ : W → R is called a regularization term
The choice of h, l, and ρ are the main modelling ingredients of machine learning. The approximate
solution of this problem by an optimization algorithm is called learning. In this lecture we will consider
popular optimization algorithms for the solution of (1.1).
The following challenges arise:
• Often N is very large, making the evaluation of the full sum inefficient
• Often X and W have high dimension and/or h is complicated
• Often ρ is non-smooth (see below)
In short notation we will write this problem also as follows
N
1 X
min fi (w) + ρ(w), fi (w) = l( h(xi , w), yi ). (1.2)
w N i=1
Vectorial loss-functions
If Y = Rn is a vector space, some generalizations may be as follows:
• l(h, y) = 21 ∥h − y∥22
For classification among n > 2 classes it is popular to use the following procedure:
• Assume that x is a member of class j ∈ {1, . . . n}. Then y = ej the j th unit vector.
• The so-called soft-argmax operator: sam : Y →]0, 1[n ⊂ Y is defined as:
exp y exp y
sam(y) = = Pn (exp is applied componentwise to a vector)
∥ exp y∥1 i=1 exp yi
which, however has the disadvantage of very flat regions, if h and ej are far off. Instead one uses:
n
!
X
l(h, ej ) = −eTj ln(sam(h)) = ln exp hi − hj
i=1
Asymptotically this functional becomes flat if hj dominates h and increases linearly if hj is negative.
Woche 2
λ λX 2
ρ(w) = ∥w∥22 = wi
2 2
and X
ρ(w) = λ∥w∥1 = λ |wi |.
The latter often produces sparse parameters, meaning that wi ̸= 0 only for a few wi . This may be helpful,
if each wi has a specific interpretation as a weight for a feature.
The regularizers enforce uniqueness of solutions and encourage certain desired properties of the pa-
rameters. The hyperparameter λ can be used for weighing between regularity and loss-minimization.
d
X
h(x; w) = (x, 1)T w = xi wi + w0 ,
i=1
This may, of course be extended to vector valued x (yielding a more complicated notation):
d
X
h(x; w) = p(x)w = xkjii wi .
i=0
Linear models, combined with convex loss functions and convex regularizers yield convex optimization
problems.
8 Contents
Example 1.4.1 (Support Vector Machine). A well known model that is comparatively simple is the
support vector machine, which is used for binary classification:
N
1 X
min max(0, 1 − hi yi ) + ρ(w) s.t. hi = (xi , 1)T w.
w N i=1
Example 1.4.2 (A feed forward neural network). Neural networks are structured as desribed above.
Each Zi is called a layer and each h̃k (z; w̃) is constructed as follows:
h̃k (z; w̃) = α(A(w̃)z + b(w̃))
with the affine transformation A · +b : Zk−1 → Zk and the activation function α : Zk → Zk , which is
usually defined in a componentwise way.
1.5. Statistical interpretation 9
• Currently the most popular activation function is called ReLU (rectified linear unit):
This function is not differentiable at 0, which is regarded more as a theoretical than a practical
problem. Of course smoothed versions could be used (but aren’t in practice).
Earlier sigmoid-type functions have been used frequently (like arctan, arctanh, ez /(1 + ez )).
The objective functional, considered above can be interpreted as an approximation of the expected risk:
N N
1 X 1 X
Empirical risk: RN (w) = l(h(xi ; w), yi ) = fi (w)
N i=1 N i=1
where (xi , yi ) are samples, drawn according to P (x, y). We then know
The expected square error can be computed via the variance of f (w, ξ):
Z
2
V[f (w)] = E[|f (w, ξ) − f (w)| ] = |f (w, ξ) − f (w)|2 dP (ξ) = ∥f (w, ·) − f (w)∥2L2 (X×Y ) .
X×Y
1
E[(RN (w) − f (w))2 ] = V[f (w)]
N
10 Contents
¯
Proof. Introduce for fixed w the
2 −1
P random variable η = f (w, ξ) − f (w), so that E(η) = 0 and V(η) =
V(f ) = E(η ), and ηN = N ηi so that V(ηN ) = V(RN (w)). Since ηi and ηj are independent, we
obtain COV (ηi ηj ) = E[ηi ηj ] = 0 for i ̸= j.
Then we compute:
XN N
X N X
X
V(ηN ) = E[N −2 ( ηi )2 ] = E[N −2 ηi2 ] + E[N −2 ηi ηj ]
i=1 i=1 i=1 i̸=j
N
X N X
X
= N −2 E[ηi2 ] + N −2 E[ηi ηj ]
i=1 i=1 i̸=j
N
X
−2
=N V(ηi ) + 0 = N −1 V(η).
i=1
V(X)
P (|X − X| > ε) ≤
ε2
we obtain:
V(f (w))
P (|RN (w) − R(w)| > ε) ≤ .
N ε2
Hence, for a given confidence level we obtain:
ε ∼ N −1/2 .
There is a much deeper stochastic theory on machine learning available, for which we recommend to
attend a lecture on statistical machine learning.
Chapter 2
The ultimate goal of learning algorithms, once a model is given, is to minimize R(w) or, in the presence Woche 3
of a regularization term ρ(w) the quantity
F (w) := R(w) + ρ(w). (2.1)
However, this is not so easily possible:
Neither R(w) nor R′ (w) are computationally available. The only thing that we can do is evaluate
11
12 Contents
Proposition 2.1.1. A sequence, defined by wk+1 = wk + δwk as above yields either F (wk ) → −∞ or
∥F ′ (wk )∥ → 0.
Proof. Clearly, F (wk ) is decreasing monotonically, so either F (wk ) → −∞ or F (wk ) → F∗ ∈ R. In the
latter case this implies
ν∥F ′ (wk )∥2 ≤ F (wk ) − F (wk+1 ) → 0.
Lα2
ν(α) = α −
2
This expression is maximized for α = 1/L with ν(α) = 1/(2L). It is positive, if α < 2/L.
Proof.
L
F (w + δw) − F (w) ≤ F ′ (w)δw + ∥δw∥2
2
Lα2
= −αF ′ (w)∇F (w) + ∥δw∥2
2
Lα2
= −α + ∥F ′ (w)∥2 = −ν(α)∥F ′ (w)∥2 .
2
κ = λn /λ1
α ← min F (w + α∆w),
α
∆w := E[∆w]
and variance:
V[∆w] = E[∥∆w − ∆w∥2 ] = E[∥∆w∥2 ] − ∥∆w∥2 .
and set δw = α∆w. Then we obtain:
L
E[F (w + δw) − F (w)] ≤ E[F ′ (w)α∆w + ∥α∆w∥2 ]
2
α2 L
= αF ′ (w)∆w + E[∥∆w∥2 ]
2
α2 L
= αF ′ (w)∆w + ∥∆w∥2 + V[∆w] .
2
Here we used monotonicity and linearity of the expectation and the definition of the variance.
14 Contents
Stochastic gradients
Stochastic gradient methods now take the following choice:
α2 L
1 + V[∆w]/∥F ′ (w)∥2
ν(α) = α −
2
V[∆w] ≤ V0 + V1 ∥F ′ (w)∥2 .
α2 L
′
1 + V0 /∥F (w)∥ + V1 ∥F ′ (x)∥2
2
E[F (w + δw) − F (w)] ≤ − α −
2
α2 LV0
= −ν1 (α)∥F ′ (x)∥2 +
2
with
α2 L
ν1 (α) = α− (1 + V1 )
2
Obviously, ν1 (α) > 0 for sufficiently small α. However, if ∥F ′ (x)∥ → 0, the second term will dominate
and yield positivity of the right hand side. Hence, our estimate for the expected decrease is positive if F ′
is too small.
Woche 4
Theoretically optimal α
The theoretically optimal choice of α would result from maximising ν(α) with respect to α:
with
1 ∥F ′ (w)∥2
ν(α) = =
2L (1 + V[∆w]/∥F ′ (w)∥2 ) 2L (∥F ′ (w)∥2 + V[∆w])
The expected optimal decrease factor ν(α) thus depends on two quantities:
• As usual, the nonlinearity of the problem, encoded in the Lipschitz constant L. The choice of a
problem suited norm (preconditioning) could reduce L. Alternatively, acceleration techniques may
be used.
• The variance of the step, relative to the derivative. This term is only relevant, if
V[∆w] ≫ ∥F ′ (w)∥2 .
We may not assume that V[∆w] ∼ ∥F ′ (w)∥2 if ∥F ′ (w)∥ → 0. A realistic assumption may be:
V[∆w] ≤ V0 + V1 ∥F ′ (w)∥2 .
∥F ′ (w)∥2 ∥F ′ (w)∥2
ν(α) ≥ ′ 2
∼
2L(V0 + (1 + V1 )∥F (x)∥ ) 2LV0
Hence, techniques to reduce the variance may be useful in later stages of the algorithm. For example,
if F (w) = R(w) + ρ(w), we may define
Suboptimal choice of α
In practice, the required quantities for finding α are not available. A suboptimal choice of α will change
the properties of the algorithm.
• Too large α may result in expected increase:
2
α≥ = 2α
L (1 + V[∆w]/∥F ′ (w)∥2 )
implies ν(α) ≤ 0 and thus the expected value of the functional may not decrease
• too small α yields only slow decrease. We observe that ν(α) ∼ α, asymptotically.
α2 LV0
ν1 (α)2µ(F (w) − F∗ ) ≤ ν1 (α)∥F ′ (w)∥2 ≤ F (w) − EF (w + δw) + .
2
α2 LV0
EF (w + δw) − F∗ ≤ (1 − 2µν1 (α))(F (w) − F∗ ) +
2
with
α2 L
α 1
ν1 (α) = α− (1 + V1 ) ≥ for α ≤
2 2 L(1 + V1 )
α2 LV0
EF (w + δw) − F∗ ≤ (1 − µα)(F (w) − F∗ ) +
2
Writing wk = w and wk+1 = wk + δw and taking into account that F (wk ) is the result of a stochastic
gradient method we obtain:
E0:k (F (wk+1 ) − F∗ ) = E0:k−1 (Ek (F (wk+1 ) − F∗ ) ≤ (1 − µνk (αk ))E0:k−1 (F (w) − F∗ )).
α2 LV0
E0:k F (w + δw) − F∗ ≤ (1 − µα)E0:k−1 (F (w) − F∗ ) +
2
Constant stepsizes
1
Theorem 2.3.1. For constant step-sizes, satisfying 0 < α ≤ L(1+V1 ) we obtain
αLV0
E[F (wk ) − F∗ ] →
2µ
αLV0
Proof. Subtracting 2µ from both sides, one obtains:
This is a contraction.
2.4. Comparison of complexity: stochastic vs. classical gradient method 17
Harmonic Stepsizes
Theorem 2.3.2. Suppose that the SG method is run with a step size sequence, such that for all k ∈ N:
β 1 1
αk = , for some β > and γ > 0 such that α1 ≤
γ+k µ L(1 + V1 )
α2 LV0
EF (w + δw) − F∗ ≤ (1 − αk µ)E[F (w) − F∗ ] +
2
2
βµ ν β LV0
≤ 1− +
k̂ k̂ 2k̂ 2
!
2
k̂ − βµ β LV0
= ν+
k̂ 2 2k̂ 2
!
β 2 LV0
k̂ − 1 βµ − 1 ν
= ν− ν+ ≤ ,
k̂ 2 k̂ 2 2k̂ 2 k̂ + 1
| {z }
≤0
log ε log N
kG ∼ = ⇒ TG ∼ N log N
log Θ 2| log Θ|
1 √ √
kSG ∼ ∼ N ⇒ TSG ∼ N .
ε
18 Contents
A comparison yields√ a clear advantage for the stochastic gradient method in terms of work compexity.
The estimate 1/ N for the statistical error is a worst case estimate, which can be improved in
some special situations. However, the errors will not dacay faster that 1/N under any set of reasonable
assumptions. In this case a balancing of errors ε ∼ 1/N would yield more equal results, which still favour
stochastic gradients:
log N
kG ∼ ⇒ TG ∼ N log N
| log Θ|
kSG ∼ N ⇒ TSG ∼ N.
By and large it can be said that in this setting one step of the classical gradient method has a similar
cost as the whole stochastic gradient algorithm, run till appropriate accuracy.
Woche 5
or at least of a summand
l(h(xi , w), yi ) = li (hi (w)) = li ◦ hi (w)
P
of R = l(·). We obtain the simple relation:
where ρ′ and ρ′′ can usually be computed efficiently. Concerning a summand, we obtain by the chain
rule:
In these formulas, li′ and li′′ can be computed cheaply. Observe that
h′ (w) : W → Y
• In deep neural networks these parameters are distributed within the layers. The same will be true
for the result of h′ (w)T v.
• By the chain-rule, h′ (w) contains products of linear mappings. So in h′ (w)T these products occur
in reverse order.
Hk = ∂w hk : Wk → Zk
H̃k = ∂z h̃k : Zk−1 → Zk
R̃k = ∂w̃ h̃k : W̃k → Zk .
which gives rise to a recursive algorithm for evaluating h′i (w)T v = HLT v.
applyHkT(ζk , k)
ζk−1 = H̃kT ζk ∈ Zk−1
sk = R̃kT ζk ∈ W̃k
if (k > 0) vk−1 = applyHkT(ζk−1 , k − 1)
return vk := (vk−1 , sk )
h′ (w)T v = applyHkT(v, L)
We observe that this is a recursion that works backwards through the layers. The corresponding
algorithm is thus called back-propagation. We observe a couple of things:
• In each step, some components are propagated via H̃kT , others are generated via R̃kT
• The simple notation with a vector v hides the fact that v is a collection of matrices δ W̃ and vectors
δ w̃0 .
20 Contents
Here z is the vector of inputs into the layer, W̃ is matrix of connection-parameters and w̃0 is the bias-
vector. Together (W̃ , w̃0 ) are the parameters, associated with the current layer.
Thus, h̃ is a mapping:
h̃ : Z × WM × W0 → Z+ ,
where WM is a space of matrices. Hence, its derivative is a linear mapping:
H̃ : Z → Z+
R̃M : WM → Z+
R̃0 : W0 → Z+
Let us apply the chain rule. Since α is defined componentwise, α′ is a diagonal matrix, D. If, for
example, ReLUs are used, the diagonal consists of 0 and 1. Thus, we obtain, calling α′ (W̃ z + w̃0 ) = D:
The remaining part R̃M : WM → Z+ is a little bit harder to interpret and to transpose, because R̃M is
a linear mapping that takes a matrix and yields a vector :
R̃M δ W̃ = Dδ W̃ z
To compute a transpose
T
R̃M : Z+ → W̃M
where · is the sum of products of all matching entries (either of vectors or matrices). Since ζ · D = ζ T D,
we define r = DT ζ and thus obtain at the problem of finding M , such that for given r, z:
M · δ W̃ = r · δ W̃ z : ∀δ W̃
This computation can be performed best with the help of the trace:
n
X X
tr A = Aii ⇒ M · δ W̃ = Mij δ W̃ij = tr M T δ W̃
i=1 ij
2.5. Computation of derivatives and gradients 21
One important property of the trace, that we will use is that it commutes:
tr AB = tr BA.
M · δ W̃ = r · δ W̃ z = rT δ W̃ z = tr rT δ W̃ z = tr zrT δ W̃ = rz T · δ W̃ .
applyHkT(ζk , z, k)
rk = DkT ζk
ζk−1 = W̃kT rk
T
Mk = rk zk−1
if (k > 0) vk−1 = applyHkT(ζk−1 , z, k − 1)
return vk = (vk−1 , Mk , rk )
22 Contents
Chapter 3
We have seen that the variance of the step computation is a decisive quantity, when analysing the speed
of convergence of stochastic gradient methods. There are essentially two ways to achieve this:
• gradient aggregation. Here gradients from previous iterates are collected and averaged.
V[∆w] ∼ Θk
and, as we have seen the variances depend on the sample size n as follows:
1
V[∆w] ∼
n
Thus, the cost for each step grows exponentially as:
nk ∼ Θ−k
23
24 Contents
ε ∼ Θk .
Thus, the total computational effort is proportional to 1/ε, which is similar to stochastic gradient methods.
This estimate is, however, still not computable, because ∆w is unknown. We may, however replace ∆w
by an estimate:
n
1X
∆W := ∆wj
n j=1
to our case:
Xn Xn
E[ ∥(∆wi − ∆w) − (∆W − ∆w)∥2 ] = E[ ∥∆wi − ∆w∥2 ] − E[n∥∆W − ∆w∥2 ]
i=1 i=1
n
X
= V[∆wi ] − nV[∆W ]
i=1
= nV[∆wi ] − V[∆w] = (n − 1)V[∆w].
In our case, ∆wi would be a step, computed by a single data evaluation and ∆w would be a step, created
by dynamic sampling. We thus need the variance of as step ∆W (m) , computed by averaging m gradients:
n
1 1 X
V[∆W (m) ] = V[∆w] ≈ VEst [∆W ] := ∥∆wi − ∆W ∥2 .
m m(n − 1) i=1
3.2. Gradient aggregation 25
However, this requires an estimate of ∥F ′ (wk )∥, which is not available in general. In fact, close to a
minimizer, using small sample size,
m
1 X ′
∥F ′ (w)∥ ≈ ∥Fm
′
(w)∥ = f (w) + ρ′ (w)
m i=1 i
will usually be overestimated. One way to avoid this problem is to choose V1 < 1 sufficiently small. Then
by the inverse triangle inequality:
∥F ′ (w)∥ = ∥Fm
′
(w) + F ′ (w) − Fm
′ ′
(w)∥ ≥ ∥Fm (w)∥ − ∥F ′ (w) − Fm
′
(w)∥
If ∥F ′ (w) − Fm
′
(w)∥ ≪ ∥Fm′
(w)∥ with high probability, we obtain a good estimate.
Alternatively, the non-adaptive rule:
(mk )
VEst [∆Wk ] ≤ V0 ζ k
could be chosen. However, here it is hard to choose appropriate parameters to make such a method
efficient.
This implies the expensive computation of n gradients ∇fi (wk ) for i = 1 . . . n, where {(x1 , y1 ), . . . (xn , yn )}
is a ramdomly chosen subset of the data.
To reduce this effort, it may be a good idea to reuse information. If, for example ∇fi has been
evaluated at a previous iterate, say wj(i) , then we may use the formula:
n
1X
∆wk = ∇fi (wj(i) )
n i=1
• If ∇fi has never been evaluated before, we may set j(i) = k and compute and store ∇fi (wk ).
• In addition, to update available information, we may recompute ∇fi (wk ) (so j(i) = k) for some i
(which may be chosen randomly or cyclically).
Locally, i.e., if wj(i) ≈ wk , we would expect a reduction of variance, similar to dynamic sampling size
methods. However, we cannot expect that ∆wk is an unbiased estimate of ∇F (wk )
These ideas leave plenty of room for implementation. The following variants have been proposed and
analysed. Usually linear convergence was shown for strongly convex objectives.
• Compute first a classical gradient step, and evaluate all ∇fi (w1 ), so j(i) = 1 for i = 1 . . . N .
26 Contents
• In the following iteration for each k pick i ∈ {1 . . . N } randomly and replace ∇fi (w1 ) by ∇fi (wk ),
so j(i) = k.
• The mean value does not need to be recomputed. We may write:
1
∆wk = ∆wk−1 + ∇f (wk ) − ∇f (wj(i) )
N
This complicates convergence theory. The bias, however vanishes close to the solution.
• The method requires a lot of storage, so it may be unsuitable for very large N .
SAGA
Here the idea is to provide an unbiased estimator. This can be achieved by the following simple modifi-
cation:
" N
! #
1 X
∆wk = ∆wk−1 + ∇fi (wk ) − ∇fi (wj(i) ) = ∇fi (wk ) + ∇fi (wj(i) ) − ∇fi (wj(i) )
N i=1
Since i has been chosen randomly (uniformly distributed), the expected value of the term [. . . ] is 0.
• m inner update steps with w̃0 = wk and w̃j+1 = w̃j + α∆w̃j with
• Concerning complexity, this method is rather close to the full gradient method.
Chapter 4
Concerning the rate of convergence, not only the variance plays a role, but also the condition number:
L
κ=
µ
Even if V0 = 0 and V1 is moderate, we have
which means that the (now linear) speed of convergence depends crucially on κ. It is a classical result
that the number of iterations needed for a prescribed accuracy ε is proportional to κ.
Also for harmonic step-sizes it can be shown (exercise) that γ ∼ κ and
ν κ V0 1
E0:k [F (wk ) − F∗ ) ≤ ∼ max , (1 + V1 ) (F (w1 ) − F∗ )
γ+k 2 µ(F (w1 ) − F∗ ) γ+k
Thus, also here, the error is scaled with κ, asymptotically for k ≫ γ. So the required number of iterations
is also proportional to κ.
Hence, reducing the condition number directly reduces the computational effort by the same factor
27
28 Contents
The norm for dual quantities ist denoted by ∥ · ∥∗ . It is defined as an operator norm:
f (v)
∥f ∥∗ := sup for a linear function f.
v̸=0 ∥v∥
Let us now define the following general scalar product (and norm):
√
⟨v, w⟩ := v T M w M s.p.d. , ∥v∥ = vT M v
⟨f, g⟩∗ = f T M −1 g.
1 T
F (w) = w Aw + lT w + c
2
Then the condition number is given by the solutions of the following generalized Eigenvalue problem:
Av = λM v ⇔ M −1 Av = λv
as follows:
λmax
κM (A) = .
λmin
In particular, in the extreme case M = A, we obtain κM (A) = 1. So a good choice of M can reduce
the required computational effort significantly. Note, however, that mere scaling of the norm leaves the
condition number invariant.
The price to pay is that the computation of a search direction
⟨∆w, v⟩ = −F ′ (w)v
M ∆w = −F ′ (w)T
Given w1
for k = 1,2, . . .
compute an estimate l(wk ) for F ′ (wk )T
compute Mk
∆w = −Mk−1 l(wk )
wk+1 = wk + δw = wk + α∆w.
RMSprop
The motivation of the following algorithm is to gather some estimates of the scaling properties of the
variables.
Here a diagonal matrix Dk is defined recursively as follows:
Sensitivity to scaling
This is one way to make the algorithm less sensitive to scaling of variables. Setting v = Sw + w0 and
FS (w) := F (v) = F (Sw + w0 ) we get:
∆v = D−1/2 F ′ (v)T
∆w = (DS )−1/2 FS′ (w)T = D−1/2 S −1 SF ′ (v)T = D−1/2 F ′ (v)T = ∆v
Hence, the steps are the same. However, given that v = Sw + w0 we rather should expect ∆v = S∆w,
so a scaling with D−1 would also be conceivable.
Adagrad
A similar idea is unsed in the algorithm Adagrad. Here the following update for D is used:
This has the effect that Mk grows during the iteration, which yields smaller and smaller steps. This effect
may be desirable during a pure stochastic gradient method. This decay, however, often turns out to be
too aggressive in practice.
1 T
min q(v) := v M v + lT v.
2
30 Contents
This problem is often very large scale, so it is often not possible to store or to evaluate the matrix M .
Instead, we might have an algorithm that performs the application of M to a vector v:
v → Mv
The absence of M in explicit form rules out matrix factorizations (like M = LR) for the solution of this
system, and we have to use iterative methods.
The simplest idea would be to use a gradient method again, now corresponding to the standard scalar
product v T P w, where P is itself a preconditioner (an inner preconditioner) for M :
1
compute pk := −∇P q(vk ) = −P −1 q ′ (vk ) (corresponding to min δv T δv + q ′ (vk )δv)
δv 2
vk+1 = vk + αk pk
min q(vk + αpk ) ⇒ 0 = ∂α q(vk + αpk ) = q ′ (vk + αpk )pk = (vk + αpk )T M pk + lT pk
α
(M vk + l)T pk q ′ (vk )pk
⇒ αk = − T
=− T .
pk M p k pk M pk
As shown in NLP, this method converges with the following speed:
κ−1
∥v∗ − vk+1 ∥M ≤ ∥v∗ − vk ∥M ,
κ+1
where
λmax (M )
κ=
λmin (M )
This is not a big improvement, compared to application of the unpreconditioned gradient method to the
nonlinear problem.
M -conjugate directions
Application of an additional Gram-Schmidt orthogonalization step improves the situation drastically. We
would like to achieve that the new search direction pk is M -orthogonal to the previous direction pk−1 :
pTk M pk−1 = 0.
Thus, the overall computational effort of cg is roughly the square-root of the gradient method for large
condition numbers. A clever implementation uses only one application of M per iteration, namley pk →
M pk . The algorithm, with loop slightly changed is shown below. In addition, we include the possibility
of preconditioning the iteration for M with a preconditioner P :
4.4. Inexact Newton methods for Neural Networks 31
for k = 0, 1, 2, . . .
p̃k = M pk
dTk pk
αk = −
pTk p̃k
vk+1 = vk + αk pk
dk+1 = dk + αk p̃k
gk+1 = −P −1 dk+1
dTk+1 gk+1
βk+1 =
dTk gk
pk+1 = gk+1 + βk+1 pk
• For usage inside an optimization method it is important for efficiency to terminate the cg-Method
early. Thus, the linear system M v + l = 0 is not solved exactly. The computed step, however,
are better than unpreconditioned gradient steps, even if only a moderate number of cg-Iterations
is used.
• The analysis of cg is quite involved and it is presented in the lecture “Vertiefung der Numerik”. In
fact, cg minimizes q(x) on x0 + span {pi }i=1...k , the space, spanned by all previous search directions.
In theory, it will yields the exact solution after k = dim W iterations. In practive, however, it is
terminated much earlier.
• As a preconditioner P one may take an estimate of the diagonal of M . For example, one may use
ideas from RMSprop: X
Pii = [∇fj ]2i + µ
j
• less dependent on conditioning of the problem (locally κ ∼ 1, globally, invariant against linear
transformation of variables)
• step-sizes locally tend to 1, depending on the non-linearity of the problem, rather than the scaling
(no experimentation with step-sizes necessary)
• evaluation of the hessian and solution of the system (remedy: inexact Newton-cg)
If F is convex, we may set Mk = F ′′ (wk ) as an ideal preconditioner. In that case we have to solve
following problem at each step:
1
min F ′′ (wk )(v, v) + F ′ (wk )v
2
We cannot factorize or even store F ′′ (wk ), thus use cg with M = F ′′ (wk ) to find an approximate minimizer
of the problem.
This requires the following computations:
• Computation of F ′ (w) once at the beginning of the cg-Iteration.
• Computation of F ′′ (w)pk at each iteration of the cg-method.
Thus, one run of the cg-method (using K steps) requires the following work:
Obviously, exact evaluation of F ′′ (w)v makes one iteration of cg very expensive, compared to a gradienht
step, because then γ ≥ 1 has to be expected. Rather, we aim to obtain a cheaper approximation
M v ≈ F ′′ (w)v so that γK ≈ 1.
Hessian Sub-sampling
This is a popular idea to obtain an efficient algorithm in the setting of learning. Assume that our
algorithm uses a stochastic gradient method with mini-batches usiong a sampling size n:
n n
1X ′ 1X ′
Rn′ (w)T = fi (w)T = h (xi , w)T l′ (h(xi , w), yi )T
n i=1 n i=1
where the (xi , yi ) are chosen randomly from the data. This is done, as usual with back-propagation.
Then we may choose m ≪ n, draw j = 1 . . . m data samples form the already chosen set and compute
in each cg-step (setting z = hj (w))
m m
′′ 1 X ′′ 1 X ′
h (w)T lj′′ (z)h′j (w)v + lj′ (w)h′′j (z)v
Rm (w)v = fj (w)v =
m j=1 m j=1 j
′′
Then M = Rm (w) + ρ′′ (w) can be used. We have to keep M and thus the sample fixed during the
cg-iteration. Otherwise M would change during the cg-iteration.
Concerning
f ′′ (w)v = h′ (w)T l′′ (z)h′ (w)v + l′ (w)h′′ (z)v
we observe two summands:
h′ (w)T l′′ (z)h′ (w)v
This summand can be evaluated by one forward propagation v → h′j (w)v, application of lj′′ (w), which is
usually cheap, and one back-propagation: ṽ → h′j (w)T ṽ. This costs about twice a gradient evaluation.
Moreover, if l is strictly convex, l′′ is positive definite, and thus, this term is positive definite either:
l′ (z)h′′ (w)v
A naive implementation would yield the matrix h′′ (w)v first, to which l′ (z)T is applied. Howwever,
efficient implementation for this summand are also available. There are no reasonable assumtions that
guarantee positive definiteness of this term.
If ReLUs are used, however, it is questionable, if h′′ (w) makes sense. Recall that h consists of piecewise
linear functions α(z) = max(z, 0) and bilinear functions z+ = W z + w0 . While α′ may be seen as a step
function, it is hard to give α′′ a meaningful interpretation.
In efficient implementations, for each sample the cost is moderately higher than gradient evaluation
(factor γsample ∼ 3 − 5).
We then obtain the total work:
′ mK
work(F (w)) 1 + γsample ,
n
• large K and small m yields accurate solution of system for inaccurate M . This is, however, not
advocated in the literature and called “minibatch-overfitting”.
1
α=
L (1 + V[∆w]/∥F ′ (w)∥2∗ )
1
θ = 1 − µα = 1 −
κ (1 + V[∆w]/∥F ′ (w)∥2∗ )
• minibatch-size m for F ′′ (w): reduction of κ to the “ideal” case M ≈ F ′′ (w). Close to the solution
κ ≈ κM (F ′′ (w)), which would then govern linear convergence.
• number of iterations K for the cg-method. The speed of convergence depends on the condition
number κP (M ) of M relative to the preconditioner P .
• stepsize α
Matrix Chernoff-bounds
Just as in the P
case of function evaluations, a probabilistic measure for the condition number of M =
′′
Fm (w) = 1/m Aj , relative to A = F ′′ (w) would be helpful. One possible probability theoretic tool are
the so called matrix Chernoff bounds.
Lemma 4.4.1. Consider symmetric matrices Aj ∈ Rd×d , j = 1 . . . m. Assume that all matrices are
drawn independently and that E[Aj ] = A.
Let 0 < λmin ≤ λmax be the extreme eigenvalues of A and κ̄ = λmax /λmin its condition number.
Assume further a bound on the extreme eigenvalue of all Aj :
where δw consists of all perturbations δW k . If M GN is considered as a matrix, then the partitioning into
layers yields a partitioning of M GN into a L × L block matrix.
By the chain-rule, and the representation h̃(z, W̃ , w̃0 ) = α(W̃ z + w̃0 ) we obtain the following representa-
tion:
∂W k h(x, w)δW k = ∂zk h(x, w)Dk δW k hk−1 (xj , w)
and thus, with
We obtain
n
1 X T T
min p δW k Aj δW k pj + qjT δW k pj
2n j=1 j
Lemma 4.6.1. Let vj ∈ V, wj ∈ W be some vectors for j = 1 . . . m then we have the followig equivalence:
n
X n
X
0= vj wjT ⇔ wjT Hvj = 0 for all matrices H : V → W
j=1 j=1
Proof. Let ek ∈ W , el ∈ V be unit-vectors, such that Hkl = ek eTl is a matrix with precisely one entry
= 1. Then we compute:
Xn n
X n
X n
X
wjT Hkl vj = (wjT ek )(eTl vj ) = (eTl vj )(wjT ek ) = eTl vj wjT ek
j=1 j=1 j=1 j=1
P
n Pn
Thus, the entry j=1 vj wjT = 0, iff j=1 wjT Hkl vj = 0. This implies our stated equivalence (taking
lk
into account that the set of Hkl is a basis of the space of linear mappings L(V, W )).
Hence, we have to solve the following linear matrix equation for δW k :
n n n
1X T 1X 1X
0= pj pTj δW k Aj + pj qjT ⇔ Aj δW k pj pTj = − qj pTj .
n j=1 n j=1 n j=1
Then we can pull A out of the sum and get the following simplified system
n n
1 X 1X
AδWkkfac pj pTj = − qj pTj .
n j=1 n j=1
Thus, if we define
n n
1X 1X
P := pj pTj Q := qj pTj
n j=1 n j=1
under the assumption that the symmetric positive semi-definite matrices P and A are invertible. Thus,
we only have to compute Cholesky-factorizations of matrices of size P ∈ Rdin ×din and A ∈ Rdout ×dout .
While A is likely to be positive definite as a sum of positive definite matrices, P is a sum of rank-1
matrices, and thus, there should be at least n ≥ din addends. Otherwise, we may replace P −1 by a
Pseudo-Inverse P + .
Chapter 5
Momentum
Some moderate reduction of variance and condition number can also be achieved by alternatives, maybe
less sophisticated strategies. In the following we will discuss two of them.
Its name is due to a physical interpretation of the second terms as a momentum or inertia term:
We may interpret the gradient method (β = 0) as an explicit Euler method (with stepsize τ = α) for
the first order equation:
ẇ = −∇F (w)
For stiff problems (where the eigenvalues of −(∇F )′ = −F ′′ are large) it is known that the explicit
Euler method needs small step sizes to be stable, while small eigenvalues of F ′′ yield slow convergence
of w towards the fixed point ∇F = 0. Again we observe that badly conditioned problems yield slowly
converging gradient methods.
To be more concrete, since F ′′ (w) is spd, we can choose a basis (the same for w and v), such that
F ′′ (w) = D = diag(d1 , . . . dn ) is diagonal. Then the speed of convergence of the trajectory is propostional
to the smallest eigenvalue d1 , while the maximal stable step-size is proportional to 1/dn . Thus, the
required number of steps is proportional to
dn
κ= .
d1
From the point of view of dynamics, we may want to achieve a damping of the too oscillatory right hand
side.
The heavy-ball method can be seen as a discretization of the following second order equation:
ẇ = v
v̇ = −∇F (w) − ρv
37
38 Contents
Then our system decouples in 2 × 2 blocks and the jacobian matrix reads:
0 I 0 1
with blocks
−D −ρI −di −ρ
We may compute the eigenvalues of this 2x2 system:
ρ p
0 = λ(λ + ρ) + di ⇒ λ1,2 = − ± ρ2 /4 − di
2
If we choose ρ, such that ρ2 /4 − di < 0, we obtain conjugate complex eigenvalues with:
ρ
Re λ1,2 = −
p2
|λ1,2 | = di
√
Choosing ρ := 2 d1 , the smallest
√ eigenvalue guarantees this for all eigenvalues. The maximal absolute
value of our eigenvalues is now dn .
Hence, just as above, the required number of steps is proportional to
√
√ dn
κ= √ .
d1
Spectral analysis
Consider a diagonalization of A:
S −1 AS = D = diag(d1 , . . . , dn ) ⇒ A = SΛS −1
Let is denote d1 = γ and dn = Γ the smallest and largest eigenvalues, respectively.
5.2. First and second order fixed point iterations 39
Convergence of first order iteration. We obtain for the first order method:
zk+1 = S(I − αD)S −1 zk
and thus for v = S −1 z:
vk+1 = (I − αD)vk ,
so we obtain componentwise:
i
vk+1 = (1 − αt)vki , t = di ∈ [γ, Γ]
and hence:
i
|vk+1 | ≤ max |I − αdi ||vki | ≤ max |1 − αt||vki |.
i t∈[γ,Γ]
This is straightforward to compute. Choose α such that both terms in the maximum agree:
2 2γ Γ−γ κ−1
α= ⇒ |1 − αt| ≤ 1 − = =
γ+Γ γ+Γ Γ+γ κ+1
Convergence of second order iteration. The same can be done for the second order iteration and
we get:
vk+1 (1 + β)I − αD −βI vk
=
vk I 0 vk−1
and thus similarly:
i
vki
vk+1 1 + β − αt −β
= i , t = di .
vki 1 0 vk−1
To proceed, we have to compute the eigenvalues of the above matrix. The characteristic equation reads:
0 = λ(λ − (1 + β − αt)) + β
with eigenvalues: p
(1 + β − αt)2 − 4β
(1 + β − αt) ±
λ1,2 =
2
As above, we look for conjugate complex eigenvalues, which occur, if
0 ≥ (1 + β − αt)2 − 4β.
The right hand side is zero, if
p p √ √
αt = ( β − 1)2 ⇒ β − 1 = ± αt ⇒ β = (1 ± αt)2
Thus, we obtain conjugate complex eigenvalues, iff
√ √
β ∈ [(1 − αt)2 , (1 + αt)2 ] ∀t ∈ [γ, Γ]
which is implied by √ √
β := max{(1 − αΓ)2 , (1 − αγ)2 }
In this case √
p √
|λ1 | = |λ2 | = β = max{|1 − αΓ|, |1 − αγ|}.
For a minimal choice of β we have to compute:
√ √
min max{|1 − αΓ|, |1 − αγ|}
α
Comparison:
2 κ−1
First order: α= , β=0 Θ∼
γ+Γ κ+1
!2 √ 2 √
2 κ−1 κ−1
Second order: α= √ √ , β= √ Θ∼ √
γ+ Γ κ+1 κ+1
The difficulty is now, of course, to choose α and β adequately. While Γ can be estimated quite con-
veniently, a-posteriori, γ is rather difficult to estimate but critical. In practice, trial and error is often
used.
k
X
wk+1 − wk = −α β k−j ∇F (wj )
j=0
We observe a weighed sum of gradients. In the case of stochastic gradients, this might yield a reduction
of variance, similar to gradient aggregation. However, a theoretical verification for this fact seems to be
missing, yet.
In practice, methods with momentum are quite popular due to their simplicity and the improvement
they yield, compared to stochastic gradients.
We have seen that preconditioning (even if only diagonal) is an important technique to accellerate op-
timization. Up to now, it did not matter, if preconditioners Mk changed during the course of the
optimization, since classical gradient methods do not have a “memory”. This is different for methods
with momentum. Then the above equations read as follows (naming ∇Mk F (wk ) := Mk−1 F ′ (wk )):
k
X
wk+1 − wk = −α β k−j Mj−1 F ′ (wj ).
j=0
So we may write:
wk+1 − wk = −αMk−1 lk where lk = βlk−1 + F ′ (wk ).
5.3. Interpretation as a gradient aggregation method 41
Adam
The popular Adam algorithm chooses the latter variant. It employs for Mk the diagonal preconditioner
for RMSprop and exponentially weighed aggregation of derivatives.
Similarly to RMSprop Adam uses a diagonal preconditioner via a diagonal matrix Dk , that is defined
recursively as follows:
Dk = λDk−1 + (1 − λ)diag(F ′ (wk )2i ) λ ∈ [0, 1]
Then for some µ > 0 the preconditioner is given by:
p
Mk = Dk + µI (where the root is taken componentwise)
Given w1
for k = 1,2, . . .
lk = βlk−1 + (1 − β)F ′ (wk ) often a Mini-Batch is used here
′
dk = (1 − λ)dk−1 + λ(F (wk )2i )
p
Mk = (diag dk + µI)
∆w = −Mk−1 lk
√
1 − λk
wk+1 = wk + α ∆w.
1 − βk
The factors (1 − β k ) → 1 and (1 − λk ) → 1 in the step size are used to compensate the so called
“initialization bias”. This heuristic takes into account that the sums, defining lk and dk are short in the
beginning and grow longer and longer.
The corresponding heuristic computation is as follows:
k
X k
X
lk = (1 − β) β k−i F ′ (wi ) = (1 − β) β k−i F ′ (wk ) + ζ = (1 − β k )F ′ (wk ) + ζ,
i=1 i=1
where
k
X
ζ = (1 − β) β k−i (F ′ (wi ) − F ′ (wk )).
i=1
where f is differentiable and ρ convex and “easy to handle”. By this we mean the following:
For any f ′ (w) and L > 0 the local minimization problem (which has a unique minimizer):
L
min p(δw) := f (w) + f ′ (w)δw + ρ(w + δw) + ∥δw∥2 (6.1)
δw 2
can be solved in an efficient way. Here the norm is a (maybe diagonally scaled) 2-norm.
p(tδw) − p(0) = p((1 − t)0 + tδw) − p(0) ≤ (1 − t)p(0) − tp(δw) − p(0) = t(p(δw) − p(0)) = −tc.
Moreover:
L
p(tδw) − p(0) = F (w∗ + tδw) − F (w∗ ) + o(t) + ∥tδw∥2 ≥ −o(t).
2
For sufficiently small t this yields a contradiction to δw being a minimizer.
Consider the converse. Since f is convex, f ′ (w∗ )δw ≤ f (w∗ + δw) − f (w∗ ). Hence, with w = w∗ + δw:
F (w) − F (w∗ ) = f (w) + ρ(w) − f (w∗ ) − ρ(w∗ ) ≥ f ′ (w∗ )δw + ρ(w∗ + δw) − ρ(w∗ )
F (w∗ + tδw) = F ((1 − t)w∗ + t(w∗ + δw)) ≤ (1 − t)F (w∗ ) + tF (w∗ + δw) = F (w∗ ) − tm
43
44 Contents
In this case (6.1) can be solved by the following componentwise considerations. Let δw be the minimizer
of p.
i) Assume that (w + δw)i > 0. Then ρ vanishes, and we have
0 ≤ f ′ (w)i + Lδwi
(which can be computed very efficiently) satisfies these properties in both cases:
• In the case δwi = −wi , it follows that Lδwi ≥ −f ′ (w)i and thus ii) is satisfied
• Otherwise, δwi > −wi , so wi + δwi > 0 and f ′ (w)i + Lδwi = 0. So i) is fulfilled.
Example: 1-Norm
X
ρ(w) := λ |wi |, λ>0
i
This regularization term is often used together with a quadratic regression functional:
the so called Lasso-problem. Depending on the choice of λ the solution w∗ of this problem exhibits many
zeros.
To compute the proximal gradient step. Let again δw be the minimizer of p. Then we have the cases:
i) Let (w + δw)i > 0. Then ρ′ (w + δw)i = λ
Given : w1 , L
for k = 1, 2, . . .
do
increase L
δw ← min p at wk with parameter L
while (6.2) is violated
wk+1 = wk + δw
Theorem 6.2.3. Assume that f is σ-convex, and assume that L is chosen in a way that (6.2) is fulfilled.
Then:
σ
∥wk+1 − w∗ ∥2 ≤ 1 − ∥wk − w∗ ∥2
L
L
F (wk+1 ) − F (w∗ ) ≤ ∥wk+1 − w∗ ∥2
2
If an iterative algorithm keeps Lk bounded, say by L, then we obtain linear convergence:
k
2 σ
∥wk − w∗ ∥ ≤ 1 − ∥w0 − w∗ ∥2
L
k
L σ
F (wk ) − F (w∗ ) ≤ 1− ∥w0 − w∗ ∥2
2 L
Proof. Inserting w∗ into our main estimate, we obtain:
L L−σ
0 ≥ F (w∗ ) − F (wk+1 ) ≥ ∥w∗ − wk+1 ∥2 − ∥w∗ − wk ∥2 ∀w (6.3)
2 2
Thus, we obtain immediately our first result:
L L−σ
∥w∗ − wk+1 ∥2 ≤ ∥w∗ − wk ∥2 .
2 2
Our second result follows by neclecting the negative term in (6.3).
6.3. Nesterov-Accelleration 47
We know that (6.2) holds, if L is chosen as the Lipschitz-constant for the gradient. In that case, we
may again define:
L
κ :=
σ
and retain our well-known convergence factor, but now w.r.t. errors:
2 1
∥wk+1 − w∗ ∥ ≤ 1 − ∥wk − w∗ ∥2 .
κ
6.3 Nesterov-Accelleration
Here we present an alternative to momentum methods, which works in the non-smooth case and admits
a nice convergence result.
Given : w1 , y1 = w1 , L
for k = 1, 2, . . .
do
increase L
δw ← min p at yk with parameter L
while (6.2) is violated
wk+1 = yk + δw
yk+1 = wk+1 + βk+1 δw
if k > 1.
48 Contents
Proof. We start with our estimate and observe that by our definition (1 − q) = (1 + λ)(1 − λ):
E(w+ ) + ∥w0λ − w+ ∥2 ≤ E(w0λ ) + (1 − λ)(1 + λ)∥w0λ − y0 ∥2
By construction E is 2q-strictly convex, with E(w∗ ) = 0 so by Lemma 6.2.1 we obtain: For that, we
obtain:
E(w0λ ) − q∥w0λ − w∗ ∥2 ≤ (1 − λ)E(w0 ) − qλ(1 − λ)∥w∗ − w0 ∥2 .
and thus the overall estimate:
E(w+ ) + ∥w0λ − w+ ∥2 ≤ (1 − λ) E(w0 ) + (1 + λ)∥w0λ − y0 ∥2 − λ3 ∥w0 − w∗ ∥2 . (6.5)
Next, we split:
∥w0λ − y0 ∥2 = ∥λ(w∗ − w0 ) + (w0 − y0 )∥2 = λ2 ∥w∗ − w0 ∥2 + 2λ⟨w∗ − w0 , w0 − y0 ⟩ + ∥w0 − y0 ∥2 .
to obtain using (1 + λ)λ2 − λ3 = λ2 , y0 − w0 = β(w0 − w− ) and (1 + λ)β = (1 − λ):
(1 + λ)∥w0λ − y0 ∥2 − λ3 ∥w0 − w∗ ∥2
= λ2 ∥w∗ − w0 ∥2 + 2(1 + λ)λ⟨w∗ − w0 , w0 − y0 ⟩ + (1 + λ)∥w0 − y0 ∥2
≤ λ2 ∥w∗ − w0 ∥2 + 2(1 + λ)λβ⟨w∗ − w0 , w− − w0 ⟩ + (1 + λ)2 β 2 ∥w− − w0 ∥2
= λ2 ∥w∗ − w0 ∥2 + 2(1 − λ)λ⟨w∗ − w0 , w− − w0 ⟩ + (1 − λ)2 ∥w− − w0 ∥2
= ∥λ(w∗ − w0 ) + (1 − λ)(w− − w0 )∥2 = ∥w−
λ
− w0 ∥2 .
Hence, as desired:
E(w+ ) + ∥w0λ − w+ ∥2 ≤ (1 − λ)(E(w0 ) + ∥w−
λ
− w0 ∥2 ).
√ √
Theorem 6.3.2. Choosing β = ( κ − 1)/( κ + 1) we obtain the speed of convergence:
k
1 σ
F (wk+1 ) − F (w∗ ) ≤ 1 − √ (F (w0 ) − F (w∗ ) + ∥w1 − w∗ ∥2 )
κ 2
Proof. To obtain our desired estimate, we have to consider k = 1, where y0 = w0 = w1 . This yields
w0λ − y0 = λw∗ + (1 − λ)w0 − w0 = λ(w∗ − w0 ).
With (6.5) and λ2 = σ/L we get:
σ
E(w2 )+∥w1λ −w2 ∥2 ≤ (1−λ)(E(w1λ )+(1+λ)λ2 ∥w∗ −w1 ∥2 −λ3 ∥w∗ −w1 ∥2 ) = (1−λ)(E(w1 )+ ∥w∗ −w1 ∥2 ).
L
With Lemma 6.3.1, induction and rescaling we obtain the desired result.
Unfortunately, it is the same as with momentum methods here: σ and thus κ are not a-priori, so it
is hard to find the right value of β. If σ is underestimated (β too large), we obtain a suboptimal rate
of convergence, if σ is overestimated (β too small), our result is incorrect, which means that the method
might not converge.
Restarts
A practical way to dispense with an estimate of σ is to use restarts:
• Choose β = 1 during the iteration.
• If a certain condition is fulfilled, set βk+1 = 0 once. This has the same effect asa restarting the
iteration.
• Possible criteria:
f (wk+1 ) > f (wk ) restart at ascent 0th order
⟨δw, yk − wk ⟩ < 0 restart at ascent 1st order
6.3. Nesterov-Accelleration 49