[go: up one dir, main page]

0% found this document useful (0 votes)
44 views49 pages

Skript Opt Mach

The document discusses optimization methods in machine learning, focusing on various optimization problems, loss functions, regularizers, and models. It covers stochastic gradient methods, variance reduction techniques, preconditioning, momentum, and non-smooth regularization terms. The content is structured as lecture notes, providing a comprehensive overview of the mathematical foundations and practical applications of optimization in machine learning.

Uploaded by

padahet754
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views49 pages

Skript Opt Mach

The document discusses optimization methods in machine learning, focusing on various optimization problems, loss functions, regularizers, and models. It covers stochastic gradient methods, variance reduction techniques, preconditioning, momentum, and non-smooth regularization terms. The content is structured as lecture notes, providing a comprehensive overview of the mathematical foundations and practical applications of optimization in machine learning.

Uploaded by

padahet754
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Optimization for Machine Learning

Anton Schiela

May 17, 2024


2
Contents

1 Optimization problems in ML 5
1.1 A general fitting problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Some loss-functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Some regularizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Some models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Statistical interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Stochastic gradient methods 11


2.1 A short review of gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Stochastic gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Convergence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Comparison of complexity: stochastic vs. classical gradient method . . . . . . . . . . . . 17
2.5 Computation of derivatives and gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Techniques for variance reduction 23


3.1 Dynamic sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Gradient aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Techniques for preconditioning 27


4.1 The effect of preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Diagonal scaling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 CG as a matrix free solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Inexact Newton methods for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Generalized Gauss-Newton methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 K-FAC Preconditioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Momentum 37
5.1 The dynamics of the heavy-ball method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 First and second order fixed point iterations . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Interpretation as a gradient aggregation method . . . . . . . . . . . . . . . . . . . . . . . 40

6 Non-smooth regularization terms 43


6.1 Examples for non-smooth terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Convergence of proximal gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3 Nesterov-Accelleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3
4 Contents

Preface
These lecture notes have been written for a lecture (2+1) held in WS 19/20 at University of Bayreuth.
They cover a special aspect of Machine Learning, namely optimization methods, which can be used for
problems, associated with that. The main source used is the following review article:
Bottou, L., Curtis, F., Nocedal, J.: “Optimization methods in large-scale machine learning”, SIAM
Review, 60, 2018
Concerning proximal gradient methods:
Beck, A.: “First-order methods in optimization”,MOS-SIAM Series on Optimization, 2017
For neural networks in general:
Goodfellow, I, Bengio, Y, Courville, A: “Deep learning”, 2016
In addition I would like to thank Prof. Stefan Ulbrich and Prof. Marc Pfetsch from TU Darmstadt, and
Prof. Roland Herzog from TU Chemitz for sharing their own lecture notes with me. This was a great
starting point for me.
Chapter 1

Optimization problems in ML

The principal ideas behind machine learning are not new. From a mathematical point of view, one could
also use the word data fitting.
In general the following quantities are involved:
• given input data xi ∈ X, i = 1 . . . N .
• given output data yi ∈ Y , i = 1 . . . N .
• unknown model parameters w ∈ W .
Quite often X is a high-dimensional space (e.g. of images, sound samples, . . . ), and also W is high-
dimensional. Also N can be very large. It is the number of examples to learn from.
We would like to find w and an input-to-output mapping h(·, w) : X → Y that can predict for given
input x ∈ X a matching output y ∈ Y .
Example 1.0.1 (Polynomial least squares fit). The simplest (and well known) example with X = Y = R
and W = Rd+1 is a polynomial least squares fit (linear regression):
N d
X 1 X
min |h(xi ; w) − yi |2 where h(x; w) := wj xj .
w
i=1
2 j=0

If we define the matrix A ∈ RN ×d+1 via aij = xji we can write this as a classical least-squares problem:
N
X 1 1
min |(Aw)i − yi |2 = ∥Aw − y∥22
w
i=1
2 2

It can be solved via QR-decomposition or the normal equation approach. The latter yields the following
linear system of equations:
AT Aw = AT y
• If A is injective (over-determined case), then our problem has a unique solution, but Aw ̸= y in
general.
• If A is surjective (under-determined case), then Aw = y, but w may be non-unique

1.1 A general fitting problem


More generally, we may consider the minimization problem:
N
1 X
min l( h(xi , w), yi ) + ρ(w). (1.1)
w N i=1

5
6 Contents

• h : X × W → Y is a parametrized model
• l : Y × Y → R is called a loss-function
• ρ : W → R is called a regularization term
The choice of h, l, and ρ are the main modelling ingredients of machine learning. The approximate
solution of this problem by an optimization algorithm is called learning. In this lecture we will consider
popular optimization algorithms for the solution of (1.1).
The following challenges arise:
• Often N is very large, making the evaluation of the full sum inefficient
• Often X and W have high dimension and/or h is complicated
• Often ρ is non-smooth (see below)
In short notation we will write this problem also as follows
N
1 X
min fi (w) + ρ(w), fi (w) = l( h(xi , w), yi ). (1.2)
w N i=1

1.2 Some loss-functions


In the following we discuss some loss-functions that are used regularly for various purposes. In particular
we consider:
• Regression (predicting a real or vector valued result)
• Classification (in two or more classes)

Scalar loss functions


If Y = R the following loss functions are popular:
• linear regression: l(h, y) = 21 (h − y)2 . This yields a least squares fit.
• hinge loss: l(h, y) = max(0, 1 − hy). This function is often used for classification. For given input
xi the desired output is chosen as yi ∈ {−1, 1}. A new input x is classified according to the sign of
h(x; w).
If h and y have different sign, then l ≥ 1, otherwise l < 1. If hy > 1, i.e., the classification is clearly
correct, then l(h, y) = 0.
• Often a smoothed version l(h, y) = log(1 + exp(−hy)) with similar properties is used instead.

Vectorial loss-functions
If Y = Rn is a vector space, some generalizations may be as follows:
• l(h, y) = 21 ∥h − y∥22
For classification among n > 2 classes it is popular to use the following procedure:
• Assume that x is a member of class j ∈ {1, . . . n}. Then y = ej the j th unit vector.
• The so-called soft-argmax operator: sam : Y →]0, 1[n ⊂ Y is defined as:
exp y exp y
sam(y) = = Pn (exp is applied componentwise to a vector)
∥ exp y∥1 i=1 exp yi

This is a smoothed version of the arg max operator:


arg max(y) = ej if yj > yi for all i
1.3. Some regularizers 7

• With this, the loss functional for y = ej one could go on as follows:

l(h, ej ) = ∥sam(h) − ej ∥22 ,

which, however has the disadvantage of very flat regions, if h and ej are far off. Instead one uses:

n
!
X
l(h, ej ) = −eTj ln(sam(h)) = ln exp hi − hj
i=1

Asymptotically this functional becomes flat if hj dominates h and increases linearly if hj is negative.

Woche 2

1.3 Some regularizers


Some very simple but popular regularizers are as follows:

λ λX 2
ρ(w) = ∥w∥22 = wi
2 2
and X
ρ(w) = λ∥w∥1 = λ |wi |.

The latter often produces sparse parameters, meaning that wi ̸= 0 only for a few wi . This may be helpful,
if each wi has a specific interpretation as a weight for a feature.
The regularizers enforce uniqueness of solutions and encourage certain desired properties of the pa-
rameters. The hyperparameter λ can be used for weighing between regularity and loss-minimization.

1.4 Some models


Linear models
An important distinction is, if a model is linear or nonlinear in w:

linear: h(x; w) = A(x)w,

with its simplest scalar example:

d
X
h(x; w) = (x, 1)T w = xi wi + w0 ,
i=1

or, with a polynomial for x ∈ R:


d
X
h(x; w) = p(x)w = xi wi .
i=0

This may, of course be extended to vector valued x (yielding a more complicated notation):

d
X
h(x; w) = p(x)w = xkjii wi .
i=0

Linear models, combined with convex loss functions and convex regularizers yield convex optimization
problems.
8 Contents

Example 1.4.1 (Support Vector Machine). A well known model that is comparatively simple is the
support vector machine, which is used for binary classification:
N
1 X
min max(0, 1 − hi yi ) + ρ(w) s.t. hi = (xi , 1)T w.
w N i=1

If ρ = 0 this problem can be rewritten as a linear program and solved as such:


N
1 X
min ti s.t. hi = (xi , 1)T w
w,t N i=1
ti ≥ 1 − hi yi
ti ≥ 0

Often ρ(w) = λwT w is used, yielding a quadratic program.

Nested nonlinear models


Consider a nested nonlinear model, defined recursively as:
hk (x, wk ) := h̃k (hk−1 (x; wk−1 ); w̃k )
where k = 1 . . . L, wk = (wk−1 , w̃k ) and
h̃k (·; w̃) : Zk−1 → Zk , Z0 = X, w0 = ∅, h̃0 = idX
We have the diagram:
h̃1 (·;w̃1 ) h̃2 (·;w̃2 ) h̃L (·;w̃L )
X −→ Z1 −→ Z2 ... ZL−1 −→ ZL = Y
Then we set:
w = wL , h(x, w) = hL (x; wL );
• This is motivated by the idea that the input X is transformed to the output Y by a chain of
transformations h̃k .
• For derivative based optimization it is necessary to compute the derivative of ∂w h(x; w). This yields
first order information on how the model changes under perturbations of the parameters. Using
the chain rule we obtain the following recursive formula:
∂wk hk (x; wk ) = ∂w̃k h̃k (hk−1 (x, wk−1 ); w̃k ) + ∂x h̃k (hk−1 (x, wk−1 ); w̃k )∂wk−1 hk−1 (x, wk−1 ),
or in short
∂w hk = ∂w̃ h̃k + ∂x h̃k ∂w hk−1 .
Expanding this expression we obtain:
L
X
∂w h = ∂x h̃L . . . ∂x h̃i+1 ∂w̃ h̃i .
i=1

With this we can compute


fi′ (w) = ∂w fi (w) = ∂h l(h(xi ; w), yi )∂w h(xi ; w).

Example 1.4.2 (A feed forward neural network). Neural networks are structured as desribed above.
Each Zi is called a layer and each h̃k (z; w̃) is constructed as follows:
h̃k (z; w̃) = α(A(w̃)z + b(w̃))
with the affine transformation A · +b : Zk−1 → Zk and the activation function α : Zk → Zk , which is
usually defined in a componentwise way.
1.5. Statistical interpretation 9

• Currently the most popular activation function is called ReLU (rectified linear unit):

α(z) = max(0, z) applied componentwise

This function is not differentiable at 0, which is regarded more as a theoretical than a practical
problem. Of course smoothed versions could be used (but aren’t in practice).
Earlier sigmoid-type functions have been used frequently (like arctan, arctanh, ez /(1 + ez )).

• Generically we may directly set


(A, b) = (W̃ , w̃0 )
leading, of course to a large number of parameters, namely (dimZl−1 + 1) · dimZl .
For specialized tasks we can reduce the number of parameters drastically. For example we may
restrict A to be a tridiagonal or a sparse matrix.
Convolutional Neural Networks, popular in image processing define the linear mapping via a con-
volution. A 1-d example could be:

[A(w̃)z]i = w̃T zi:i+c , where W̃ ∈ Rc

1.5 Statistical interpretation


One important insight in machine learning is that the pairs (xi , yi ) ∈ Xi × Yi are somewhat arbitrary
examples or sample for an unknown relationship between X and Y . To exploit this, we may use techniques
from statistics and propability theory. In this course we only say very little about this topic and refer to
lectures from the field of statistics.
From the point of view of statistics, we assume that the pair ξ = (x, y) is a random variable, drawn
from a joint distribution P (x, y). The true objective that we want to minimize is as follows:
Z
Expected risk: R(w) = E[l(h(x; w), y)] = l(h(x; w), y) dP (x, y)
X×Y
Z
= E[f (w, ξ)] = f (w, ξ) dP (ξ) = f (w).
X×Y

The objective functional, considered above can be interpreted as an approximation of the expected risk:
N N
1 X 1 X
Empirical risk: RN (w) = l(h(xi ; w), yi ) = fi (w)
N i=1 N i=1

where (xi , yi ) are samples, drawn according to P (x, y). We then know

RN (w) = E(RN (w)) = E(f (w)) = f (w).

The expected square error can be computed via the variance of f (w, ξ):
Z
2
V[f (w)] = E[|f (w, ξ) − f (w)| ] = |f (w, ξ) − f (w)|2 dP (ξ) = ∥f (w, ·) − f (w)∥2L2 (X×Y ) .
X×Y

Lemma 1.5.1. It holds:


1
V[RN (w)] = V[f (w)]
N
Thus, for the expected square error we obtain:

1
E[(RN (w) − f (w))2 ] = V[f (w)]
N
10 Contents
¯
Proof. Introduce for fixed w the
2 −1
P random variable η = f (w, ξ) − f (w), so that E(η) = 0 and V(η) =
V(f ) = E(η ), and ηN = N ηi so that V(ηN ) = V(RN (w)). Since ηi and ηj are independent, we
obtain COV (ηi ηj ) = E[ηi ηj ] = 0 for i ̸= j.
Then we compute:

XN N
X N X
X
V(ηN ) = E[N −2 ( ηi )2 ] = E[N −2 ηi2 ] + E[N −2 ηi ηj ]
i=1 i=1 i=1 i̸=j
N
X N X
X
= N −2 E[ηi2 ] + N −2 E[ηi ηj ]
i=1 i=1 i̸=j
N
X
−2
=N V(ηi ) + 0 = N −1 V(η).
i=1

By the Tschebyscheff inequality:

V(X)
P (|X − X| > ε) ≤
ε2
we obtain:
V(f (w))
P (|RN (w) − R(w)| > ε) ≤ .
N ε2
Hence, for a given confidence level we obtain:

ε ∼ N −1/2 .

There is a much deeper stochastic theory on machine learning available, for which we recommend to
attend a lecture on statistical machine learning.
Chapter 2

Stochastic gradient methods

The ultimate goal of learning algorithms, once a model is given, is to minimize R(w) or, in the presence Woche 3
of a regularization term ρ(w) the quantity
F (w) := R(w) + ρ(w). (2.1)
However, this is not so easily possible:

Neither R(w) nor R′ (w) are computationally available. The only thing that we can do is evaluate

l(h(xi ; w), yi ) and ∂w l(h(xi ; w), yi )

for a large number of samples (xi , yi ).

A straightforward idea is to minimize


FN (w) := RN (w) + ρ(w) (2.2)
instead for a given fixed dataset (xi , yi )i=1...N . This minimization should only be done approximately.
If w∗ and wN ∗
are minimizers of of (2.1) and (2.2), respectively, and wN is an approximate solution of
(2.2), then the errors should be balanced:
F (wN ) − F (w∗ ) = F (wN ) − FN (wN )

+ FN (wN ) − FN (wN )

+ FN (wN ) − F (w∗ )
We observe that only the second difference can be reduced by optimization for fixed N , while the first
and the third difference have to be taken care of by increasing N . Further, we observe that
∗ ∗
FN (wN ) < FN (wN ) ̸⇒ F (wN ) < F (wN )

By using the minimizing property FN (wN ) ≤ FN (w∗ ) the following estimate can be derived for a given
confidence level:
F (wN ) − F (w∗ ) ≤ FN (wN ) − FN (wN

) + (F (wN ) − FN (wN )) + (FN (w∗ ) − F (w∗ ))
1  
∼ ∆FN,opt + √ Vf (wN ) + Vf (w∗ ) .
N
A further difficulty arises:
• If N is large, the evaluation of FN and FN′ may be very expensive. Efficient algorithms should
probably use cheaper surrogates.
• If W is chosen inadequately (and/or the regularizer is too weak), minimizing FN very accurately

may yield larger errors in F than early termination, since Vf (wN ) may be large.

11
12 Contents

2.1 A short review of gradient descent


Consider the problem
min F (w)
w
p
on a normed vector space W with norm ∥ · ∥ = ⟨·, ·⟩. Its dual space W ∗ also has a norm, denoted by
∥ · ∥.
Assume that we can compute a search direction δw that satisfies:

F (w + δw) ≤ F (w) − ν∥F ′ (w)∥2 ν>0 (2.3)

Proposition 2.1.1. A sequence, defined by wk+1 = wk + δwk as above yields either F (wk ) → −∞ or
∥F ′ (wk )∥ → 0.
Proof. Clearly, F (wk ) is decreasing monotonically, so either F (wk ) → −∞ or F (wk ) → F∗ ∈ R. In the
latter case this implies
ν∥F ′ (wk )∥2 ≤ F (wk ) − F (wk+1 ) → 0.

If we define ∆w as the solution of the following minimization problem:


1
min F ′ (w)v + ⟨v, v⟩,
v 2
then ∆w is the negative gradient of F :

0 = F ′ (w)v + ⟨∆w, v⟩ ⇒ ∆w = −∇F (w)

Assume now that F obeys the following smoothness assumption:


L
F (w + δw) ≤ F (w) + F ′ (w)δw + ∥δw∥2 .
2
This holds, for example, if F ′ is Lipschitz continuous with constant L.
Lemma 2.1.2. For δw = −α∇F (w), (2.3) is satisfied for

Lα2
ν(α) = α −
2
This expression is maximized for α = 1/L with ν(α) = 1/(2L). It is positive, if α < 2/L.
Proof.
L
F (w + δw) − F (w) ≤ F ′ (w)δw + ∥δw∥2
2
Lα2
= −αF ′ (w)∇F (w) + ∥δw∥2
2
Lα2
 
= −α + ∥F ′ (w)∥2 = −ν(α)∥F ′ (w)∥2 .
2

If we assume the following Polyak-Lojasiewicz inequality:

∥F ′ (w)∥2 ≥ 2µ(F (w) − F∗ ), µ > 0,

where F∗ = inf k F . Then we obtain:

2ν(α)µ(F (w) − F∗ ) ≤ ν(α)∥F ′ (w)∥2 ≤ F (w) − F (w + δw)


2.2. Stochastic gradients 13

With that we compute:

F (w) − F∗ = (F (w) − F (w + δw)) + (F (w + δw) − F∗ )


≥ 2µν(α)(F (w) − F∗ ) + (F (w + δw) − F∗ )

Hence, we obtain the result:

F (w + δw) − F∗ ≤ (1 − 2µν(α))(F (w) − F∗ ).

and for optimal α:


µ
F (w + δw) − F∗ ≤ (1 − )(F (w) − F∗ ).
L
We may defind a condition number κ := L/µ to find:
1
F (w + δw) − F∗ ≤ (1 − )(F (w) − F∗ ).
κ
and thus linear convergence:
 k
1
F (wk ) − F∗ ≤ 1− (F (w0 ) − F∗ ).
κ
Be aware that our choice of norm can affect the size of κ. A clever choice will yield small κ. This is called
preconditioning (note that mere rescaling does not change κ).
If F is quadratic with eigenvalues 0 < λ1 ≤ · · · ≤ λn , then

κ = λn /λ1

and our result can be refined:


 2
κ−1
F (wk+1 ) − F∗ ≤ (F (wk ) − F∗ ).
κ+1
This holds for an optimal choice of α:

α ← min F (w + α∆w),
α

which can be computed exactly in the quadratic case.

2.2 Stochastic gradients


In very large scale problems we can neither evaluate F nor F ′ exactly. In the following we will interpret
an approximate computation of ∇F as a stochastic gradient step.
Let us now introduce a stochastic search direction ∆w with expectation

∆w := E[∆w]

and variance:
V[∆w] = E[∥∆w − ∆w∥2 ] = E[∥∆w∥2 ] − ∥∆w∥2 .
and set δw = α∆w. Then we obtain:
L
E[F (w + δw) − F (w)] ≤ E[F ′ (w)α∆w + ∥α∆w∥2 ]
2
α2 L
= αF ′ (w)∆w + E[∥∆w∥2 ]
2
α2 L
= αF ′ (w)∆w + ∥∆w∥2 + V[∆w] .

2
Here we used monotonicity and linearity of the expectation and the definition of the variance.
14 Contents

Stochastic gradients
Stochastic gradient methods now take the following choice:

∆w = E[∆w] := −∇F (w).

The search direction is an unbiased estimate of the gradient.


This implies
−F ′ (w)∆w = ∥F ′ (w)∥2 = ∥∆w∥2
and thus:
α2 L
E[F (w + δw) − F (w)] ≤ −α∥F ′ (w)∥2 + ∥F ′ (w)∥2 + V[∆w]

2
α2 L
 
′ 2
∥F ′ (x)∥2

=− α− 1 + V[∆w]/∥F (w)∥
2
= −ν(α)∥F ′ (x)∥2

α2 L
1 + V[∆w]/∥F ′ (w)∥2

ν(α) = α −
2

What happens for fixed α?


Consider the following model for the variance:

V[∆w] ≤ V0 + V1 ∥F ′ (w)∥2 .

In that case, our estimate can be written as follows:

α2 L
 

1 + V0 /∥F (w)∥ + V1 ∥F ′ (x)∥2
2

E[F (w + δw) − F (w)] ≤ − α −
2
α2 LV0
= −ν1 (α)∥F ′ (x)∥2 +
2
with
α2 L
 
ν1 (α) = α− (1 + V1 )
2
Obviously, ν1 (α) > 0 for sufficiently small α. However, if ∥F ′ (x)∥ → 0, the second term will dominate
and yield positivity of the right hand side. Hence, our estimate for the expected decrease is positive if F ′
is too small.
Woche 4

Theoretically optimal α
The theoretically optimal choice of α would result from maximising ν(α) with respect to α:

0 = ∂α ν(α) = −1 + αL 1 + V[∆w]/∥F ′ (w)∥2




and thus obtaining the theoretically optimal choice:


1
α=
L (1 + V[∆w]/∥F ′ (w)∥2 )
This results in an expected decrease as follows:

E[F (w + δw) − F (w)] ≤ −ν(α)∥F ′ (w)∥2 .


2.3. Convergence results 15

with
1 ∥F ′ (w)∥2
ν(α) = =
2L (1 + V[∆w]/∥F ′ (w)∥2 ) 2L (∥F ′ (w)∥2 + V[∆w])
The expected optimal decrease factor ν(α) thus depends on two quantities:
• As usual, the nonlinearity of the problem, encoded in the Lipschitz constant L. The choice of a
problem suited norm (preconditioning) could reduce L. Alternatively, acceleration techniques may
be used.
• The variance of the step, relative to the derivative. This term is only relevant, if

V[∆w] ≫ ∥F ′ (w)∥2 .

We may not assume that V[∆w] ∼ ∥F ′ (w)∥2 if ∥F ′ (w)∥ → 0. A realistic assumption may be:

V[∆w] ≤ V0 + V1 ∥F ′ (w)∥2 .

Then asymptotically, if ∥F ′ (w)∥ → 0, we obtain

∥F ′ (w)∥2 ∥F ′ (w)∥2
ν(α) ≥ ′ 2

2L(V0 + (1 + V1 )∥F (x)∥ ) 2LV0
Hence, techniques to reduce the variance may be useful in later stages of the algorithm. For example,
if F (w) = R(w) + ρ(w), we may define

∆wn := ∇Fn′ (w) = ∇Rn (w) + ∇ρ(w),

which means solving the problem:


1
min Fn′ (w)v + ⟨v, v⟩.
v 2
Usually n ≪ N is a moderately chosen number. Just as in the last section we obtain
1
V[∆wn ] ∼ V[∆w1 ],
n
but of course, the computational effort to compute Fn′ (w) is ∼ n.

Suboptimal choice of α
In practice, the required quantities for finding α are not available. A suboptimal choice of α will change
the properties of the algorithm.
• Too large α may result in expected increase:
2
α≥ = 2α
L (1 + V[∆w]/∥F ′ (w)∥2 )

implies ν(α) ≤ 0 and thus the expected value of the functional may not decrease
• too small α yields only slow decrease. We observe that ν(α) ∼ α, asymptotically.

2.3 Convergence results


Reduction of the functional
We may assume the following Polyak-Lojasiewicz inequality:

∥F ′ (w)∥2 ≥ 2µ(F (w) − F∗ ), µ > 0,


16 Contents

where F∗ = inf F . Then we obtain:

α2 LV0
ν1 (α)2µ(F (w) − F∗ ) ≤ ν1 (α)∥F ′ (w)∥2 ≤ F (w) − EF (w + δw) + .
2

With that we compute:

F (w) − F∗ = (F (w) − EF (w + δw)) + (EF (w + δw) − F∗ )


α2 LV0
≥ 2µν1 (α)(F (w) − F∗ ) − + (EF (w + δw) − F∗ )
2

Hence, we obtain the result:

α2 LV0
EF (w + δw) − F∗ ≤ (1 − 2µν1 (α))(F (w) − F∗ ) +
2

with
α2 L
 
α 1
ν1 (α) = α− (1 + V1 ) ≥ for α ≤
2 2 L(1 + V1 )

In that case we have

α2 LV0
EF (w + δw) − F∗ ≤ (1 − µα)(F (w) − F∗ ) +
2

Writing wk = w and wk+1 = wk + δw and taking into account that F (wk ) is the result of a stochastic
gradient method we obtain:

E0:k (F (wk+1 ) − F∗ ) = E0:k−1 (Ek (F (wk+1 ) − F∗ ) ≤ (1 − µνk (αk ))E0:k−1 (F (w) − F∗ )).

This yields the result:

α2 LV0
E0:k F (w + δw) − F∗ ≤ (1 − µα)E0:k−1 (F (w) − F∗ ) +
2

Constant stepsizes
1
Theorem 2.3.1. For constant step-sizes, satisfying 0 < α ≤ L(1+V1 ) we obtain

αLV0
E[F (wk ) − F∗ ] →

αLV0
Proof. Subtracting 2µ from both sides, one obtains:

αLV0 α2 LV0 αLV0


E[F (wk+1 ) − F∗ ] − ≤ (1 − αµ)E[F (wk ) − F∗ ] + −
2µ 2 2µ
 
αLV0
= (1 − αµ) E[F (wk ) − F∗ ] −

This is a contraction.
2.4. Comparison of complexity: stochastic vs. classical gradient method 17

Harmonic Stepsizes
Theorem 2.3.2. Suppose that the SG method is run with a step size sequence, such that for all k ∈ N:
β 1 1
αk = , for some β > and γ > 0 such that α1 ≤
γ+k µ L(1 + V1 )

Then for all k ∈ N, the expected optimality gap satisfies:


ν
E[F (wk ) − F∗ ] ≤ (2.4)
γ+k
where
β 2 LV0
 
ν := max , (γ + 1)(F (w1 ) − F∗ )
2(βµ − 1)
Proof. We prove (2.4) by induction. By the second term in the definition of ν the result holds for k = 1.
Let’s set k̂ := γ + k. Then we compute, using αk = β/k̂:

α2 LV0
EF (w + δw) − F∗ ≤ (1 − αk µ)E[F (w) − F∗ ] +
2
2
 
βµ ν β LV0
≤ 1− +
k̂ k̂ 2k̂ 2
!
2
k̂ − βµ β LV0
= ν+
k̂ 2 2k̂ 2
!
β 2 LV0
 
k̂ − 1 βµ − 1 ν
= ν− ν+ ≤ ,
k̂ 2 k̂ 2 2k̂ 2 k̂ + 1
| {z }
≤0

because k̂ 2 ≥ (k̂ + 1)(k̂ − 1).

2.4 Comparison of complexity: stochastic vs. classical gradient


method
Recall the following splitting:

F (wN ) − F (w∗ ) ≤ [FN (wN ) − FN (wN )] + (F (wN ) − FN (wN )) + (FN (w∗ ) − F (w∗ ))
C
∼ε+ √
N
The first term ε can be reduced by optimization, while the second term can be reduced by increasing the
problem size.
From our convergence theory we know that after k-steps:

Classical Gradient: ε ∼ Θk with computing time Tk ∼ N k


1
Stochastic Gradient: ε∼ with computing time Tk ∼ k
k
To balance the error, we set ε = √1 and obtain:
N

log ε log N
kG ∼ = ⇒ TG ∼ N log N
log Θ 2| log Θ|
1 √ √
kSG ∼ ∼ N ⇒ TSG ∼ N .
ε
18 Contents

A comparison yields√ a clear advantage for the stochastic gradient method in terms of work compexity.
The estimate 1/ N for the statistical error is a worst case estimate, which can be improved in
some special situations. However, the errors will not dacay faster that 1/N under any set of reasonable
assumptions. In this case a balancing of errors ε ∼ 1/N would yield more equal results, which still favour
stochastic gradients:
log N
kG ∼ ⇒ TG ∼ N log N
| log Θ|
kSG ∼ N ⇒ TSG ∼ N.

By and large it can be said that in this setting one step of the classical gradient method has a similar
cost as the whole stochastic gradient algorithm, run till appropriate accuracy.
Woche 5

2.5 Computation of derivatives and gradients


Let us now discuss practical methods to compute the derivative of:

F (w) := R(w) + ρ(w). (2.5)

or at least of a summand
l(h(xi , w), yi ) = li (hi (w)) = li ◦ hi (w)
P
of R = l(·). We obtain the simple relation:

F ′ (w) = R′ (w) + ρ′ (w)


F ′′ (w) = R′′ (w) + ρ′′ (w)

where ρ′ and ρ′′ can usually be computed efficiently. Concerning a summand, we obtain by the chain
rule:

(li ◦ hi )′ (w)δw = li′ (hi (w))h′i (w)δw


(li ◦ hi )′′ (w)(δw)2 = li′′ (hi (w))(h′i (w)δw)2 + li′ (hi (w))h′′i (w)δw2 .

In these formulas, li′ and li′′ can be computed cheaply. Observe that

h′ (w) : W → Y

maps from the parameter space to the output space.


However, h′i is rather expensive to compute, as we will see below, and the computation of h′′i is
similarly complicated.
A gradient step is computed by solving the following problem:
1
min F ′ (w)δw + ⟨δw, δw⟩,
δw 2
which yields for the euclidean product:
X
∇F (w) = F ′ (w)T = R′ (w)T + ρ′ (w)T = h′i (w)T li′ (hi (w))T + ρ′ (w)T
i
X
= h′i (w)T ∇li (hi (w)) + ∇ρ(w).
i

Thus, we have to compute h′ (w)T v for a gradient step.


We observe:
• h′ (w)T : Y → W maps from the output space into the parameter space
2.5. Computation of derivatives and gradients 19

• Thus, v is located in the output-space

• In deep neural networks these parameters are distributed within the layers. The same will be true
for the result of h′ (w)T v.

• By the chain-rule, h′ (w) contains products of linear mappings. So in h′ (w)T these products occur
in reverse order.

These observations are reflected in the following algortihm, known as back-propagation.

The computation of h′ (w)T v by backpropagation


As we have seen, the layer structure of a neural network can be described mathematically as follows:

hk (x; wk ) = h̃k (hk−1 (x; wk−1 ), w̃k )

Here the mapping:


h̃k : Zk−1 × W̃k → Zk
represents the k th layer, parametrized by w̃k ∈ W̃k .
The chain-rule yields the following formula for derivatives:

Hk = H̃k Hk−1 R̃k ⇔ Hk δwk = H̃k Hk−1 δwk−1 + R̃k δ w̃k

where, with Wk := W̃1 × · · · × W̃k :

Hk = ∂w hk : Wk → Zk
H̃k = ∂z h̃k : Zk−1 → Zk
R̃k = ∂w̃ h̃k : W̃k → Zk .

Transposition of this block-matrix yield the following formula:


 T
Hk−1 H̃kT

HkT = : Zk → Wk−1 × W̃k = Wk .
R̃kT

which gives rise to a recursive algorithm for evaluating h′i (w)T v = HLT v.

applyHkT(ζk , k)
ζk−1 = H̃kT ζk ∈ Zk−1
sk = R̃kT ζk ∈ W̃k
if (k > 0) vk−1 = applyHkT(ζk−1 , k − 1)
return vk := (vk−1 , sk )

With this algorithm, we can compute:

h′ (w)T v = applyHkT(v, L)

We observe that this is a recursion that works backwards through the layers. The corresponding
algorithm is thus called back-propagation. We observe a couple of things:

• The vector v grows during the application of the algorithm.

• In each step, some components are propagated via H̃kT , others are generated via R̃kT

• The simple notation with a vector v hides the fact that v is a collection of matrices δ W̃ and vectors
δ w̃0 .
20 Contents

Application of H̃ and R̃:


To explain the details of our algorithm, we drop the index k.
Let us recall that a full layer is defined as:

h̃(z, w̃) = α(W̃ z + w̃0 )

Here z is the vector of inputs into the layer, W̃ is matrix of connection-parameters and w̃0 is the bias-
vector. Together (W̃ , w̃0 ) are the parameters, associated with the current layer.
Thus, h̃ is a mapping:
h̃ : Z × WM × W0 → Z+ ,
where WM is a space of matrices. Hence, its derivative is a linear mapping:

h̃′ = H̃ R̃M R̃0 : Z × WM × W0 → Z+ .




Here Z, Z+ , and W0 are spaces of vectors, while WM is a space of matrices. We observe:

H̃ : Z → Z+
R̃M : WM → Z+
R̃0 : W0 → Z+

Let us apply the chain rule. Since α is defined componentwise, α′ is a diagonal matrix, D. If, for
example, ReLUs are used, the diagonal consists of 0 and 1. Thus, we obtain, calling α′ (W̃ z + w̃0 ) = D:

h̃′ (z, w̃)(δz, δ W̃ , δ w̃0 ) = D(W̃ δz + δ W̃ z + δ w̃0 )

Hence, we have the easy parts:

H̃δz = DW̃ δz ⇒ H̃ = DW̃ ⇒ H̃ T = W̃ T DT : Z+ → Z


R̃0 δw0 = Dδw0 ⇒ R̃0 = D ⇒ R̃0T = DT : Z+ → W0

The remaining part R̃M : WM → Z+ is a little bit harder to interpret and to transpose, because R̃M is
a linear mapping that takes a matrix and yields a vector :

R̃M δ W̃ = Dδ W̃ z

To compute a transpose
T
R̃M : Z+ → W̃M

we have to find for each ζ ∈ Z+ a matrix M , such that


X X
Mij δ W̃ij = M · δ W̃ = ζ · Dδ W̃ z = ζi (Dδ W̃ z)i .
ij i

where · is the sum of products of all matching entries (either of vectors or matrices). Since ζ · D = ζ T D,
we define r = DT ζ and thus obtain at the problem of finding M , such that for given r, z:

M · δ W̃ = r · δ W̃ z : ∀δ W̃

This computation can be performed best with the help of the trace:
n
X X
tr A = Aii ⇒ M · δ W̃ = Mij δ W̃ij = tr M T δ W̃
i=1 ij
2.5. Computation of derivatives and gradients 21

One important property of the trace, that we will use is that it commutes:

tr AB = tr BA.

With this, we compute:

M · δ W̃ = r · δ W̃ z = rT δ W̃ z = tr rT δ W̃ z = tr zrT δ W̃ = rz T · δ W̃ .

Thus, we have that M = rz T , (which is a matrix of rank 1) and thus


T
R̃M : Z+ → W̃M
ζ → rz T = (DT ζ)z T

With this information, our algorithm reads now, where z = (x = z0 , . . . , zL−1 ):

applyHkT(ζk , z, k)
rk = DkT ζk
ζk−1 = W̃kT rk
T
Mk = rk zk−1
if (k > 0) vk−1 = applyHkT(ζk−1 , z, k − 1)
return vk = (vk−1 , Mk , rk )
22 Contents
Chapter 3

Techniques for variance reduction

We have seen that the variance of the step computation is a decisive quantity, when analysing the speed
of convergence of stochastic gradient methods. There are essentially two ways to achieve this:

• increase of sample size (mini-batch).

• gradient aggregation. Here gradients from previous iterates are collected and averaged.

3.1 Dynamic sample size


Using our model:
(k) (k)
V[∆wk ] ≤ V0 + V1 ∥F ′ (wk )∥2
We recall our central estimate:
(k)
αk2 LV0
E0:k F (wk + δwk ) − F∗ ≤ (1 − µαk )E0:k−1 (F (wk ) − F∗ ) + ,
2
as long as
1
αk ≤ (k)
.
L(1 + V1 )
By choosing a variable sample size, we can achieve a reduction of the variance, which may be modelled
(k) (k)
by the constants V0 and V1 , depending on the iteration number k.
There are two possible scenarios for linear convergence:
(k) (k)
• V0 = 0 and V1 ≤ V 1 . This might be achieved by estimating ∥F ′ (w)∥2 and choosing V[∆w]
proportional to this estimate.
(k) (k)
• V0 ≤ ζ k with ζ < 1 and V1 = 0 (for simplicity). Then linear convergence can also be shown
(exercise).

In both cases, we have to expect that for some Θ < 1

V[∆w] ∼ Θk

and, as we have seen the variances depend on the sample size n as follows:
1
V[∆w] ∼
n
Thus, the cost for each step grows exponentially as:

nk ∼ Θ−k

23
24 Contents

Summing up, we obtain for the whole computational effort:


k k k
X X X Θ−k
ni = Θ−i = Θ−k Θi ≤ ∼ Θ−k
i=1 i=1 i=1
1−Θ

Since we have linear convergence, the achieved accuracy ε behaves like

ε ∼ Θk .

Thus, the total computational effort is proportional to 1/ε, which is similar to stochastic gradient methods.

Estimate of the variance


If the sample size is chosen dynamically, it is certainly desirable to provide an estimate for the variance
V[∆w] in order to choose the sample size n accordingly.
Using the definition:
V[∆w] = E[∥∆w − ∆w∥2 ]
our first idea is to compute n steps ∆wi and define:
n
1X
ṼEst [∆w] := ∥∆wi − ∆w∥2 .
n i=1

This estimate is, however, still not computable, because ∆w is unknown. We may, however replace ∆w
by an estimate:
n
1X
∆W := ∆wj
n j=1

Lemma 3.1.1. It holds


n
X
E[ ∥∆wi − ∆W ∥2 ] = (n − 1)V[∆w].
i=1
P
Proof. We apply the well known identity (with nX = i xi ):
n
X n
X
∥xi − X∥2 = ∥xi ∥2 − n∥X∥2
i=1 i=1

to our case:
Xn Xn
E[ ∥(∆wi − ∆w) − (∆W − ∆w)∥2 ] = E[ ∥∆wi − ∆w∥2 ] − E[n∥∆W − ∆w∥2 ]
i=1 i=1
n
X
= V[∆wi ] − nV[∆W ]
i=1
= nV[∆wi ] − V[∆w] = (n − 1)V[∆w].

Thus, an unbiased estimate for the variance is


n
1 X
VEst [∆w] := ∥∆wi − ∆W ∥2 .
n − 1 i=1

In our case, ∆wi would be a step, computed by a single data evaluation and ∆w would be a step, created
by dynamic sampling. We thus need the variance of as step ∆W (m) , computed by averaging m gradients:
n
1 1 X
V[∆W (m) ] = V[∆w] ≈ VEst [∆W ] := ∥∆wi − ∆W ∥2 .
m m(n − 1) i=1
3.2. Gradient aggregation 25

Hence, we could choose mk (for prescribed V1 ), such that


(mk )
VEst [∆Wk ] ≤ V1 ∥F ′ (wk )∥2 .

However, this requires an estimate of ∥F ′ (wk )∥, which is not available in general. In fact, close to a
minimizer, using small sample size,
m
1 X ′
∥F ′ (w)∥ ≈ ∥Fm

(w)∥ = f (w) + ρ′ (w)
m i=1 i

will usually be overestimated. One way to avoid this problem is to choose V1 < 1 sufficiently small. Then
by the inverse triangle inequality:

∥F ′ (w)∥ = ∥Fm

(w) + F ′ (w) − Fm
′ ′
(w)∥ ≥ ∥Fm (w)∥ − ∥F ′ (w) − Fm

(w)∥

If ∥F ′ (w) − Fm

(w)∥ ≪ ∥Fm′
(w)∥ with high probability, we obtain a good estimate.
Alternatively, the non-adaptive rule:
(mk )
VEst [∆Wk ] ≤ V0 ζ k

could be chosen. However, here it is hard to choose appropriate parameters to make such a method
efficient.

3.2 Gradient aggregation


Among gradient aggregation methods there are a couple of variants available. We will only sketch a few
ideas. For simplicity let ρ = 0.
Recall that a mini-batch stochastic gradient method computes steps of the form:
n
!
1X
wk+1 = wk + αk ∆wk with ∆wk = ∇fi (wk )
n i=1

This implies the expensive computation of n gradients ∇fi (wk ) for i = 1 . . . n, where {(x1 , y1 ), . . . (xn , yn )}
is a ramdomly chosen subset of the data.
To reduce this effort, it may be a good idea to reuse information. If, for example ∇fi has been
evaluated at a previous iterate, say wj(i) , then we may use the formula:
n
1X
∆wk = ∇fi (wj(i) )
n i=1

• If ∇fi has never been evaluated before, we may set j(i) = k and compute and store ∇fi (wk ).

• In addition, to update available information, we may recompute ∇fi (wk ) (so j(i) = k) for some i
(which may be chosen randomly or cyclically).

Locally, i.e., if wj(i) ≈ wk , we would expect a reduction of variance, similar to dynamic sampling size
methods. However, we cannot expect that ∆wk is an unbiased estimate of ∇F (wk )
These ideas leave plenty of room for implementation. The following variants have been proposed and
analysed. Usually linear convergence was shown for strongly convex objectives.

SAG (Stochastic averaged gradient)


Here n = N is chosen, so the full dataset is exploited in averaging.

• Compute first a classical gradient step, and evaluate all ∇fi (w1 ), so j(i) = 1 for i = 1 . . . N .
26 Contents

• In the following iteration for each k pick i ∈ {1 . . . N } randomly and replace ∇fi (w1 ) by ∇fi (wk ),
so j(i) = k.
• The mean value does not need to be recomputed. We may write:
1 
∆wk = ∆wk−1 + ∇f (wk ) − ∇f (wj(i) )
N

• As already mentioned, ∆wk is a biased estimator for ∆wk :


N N
1 X 1 X 
∆wk = ∇fi (wk ) + ∇fi (wj(i) ) − ∇fi (wk )
N i=1 N i=1

This complicates convergence theory. The bias, however vanishes close to the solution.
• The method requires a lot of storage, so it may be unsuitable for very large N .

SAGA
Here the idea is to provide an unbiased estimator. This can be achieved by the following simple modifi-
cation:
" N
! #
 1 X
∆wk = ∆wk−1 + ∇fi (wk ) − ∇fi (wj(i) ) = ∇fi (wk ) + ∇fi (wj(i) ) − ∇fi (wj(i) )
N i=1

Since i has been chosen randomly (uniformly distributed), the expected value of the term [. . . ] is 0.

SVRG (Stochastic Varianced Reduced Gradient)


This method works with macro steps. One step wk → wk+1 is computed via:
• one full gradient step with ∆w = ∇Fn (wk ) = N1 i ∇fi (wk )
P

• m inner update steps with w̃0 = wk and w̃j+1 = w̃j + α∆w̃j with

∆w̃j = ∇fi (w̃j ) − ∇fi (wk ) + ∇Fn (wk )

• Concerning complexity, this method is rather close to the full gradient method.
Chapter 4

Techniques for preconditioning

Concerning the rate of convergence, not only the variance plays a role, but also the condition number:
L
κ=
µ
Even if V0 = 0 and V1 is moderate, we have

E0:k (F (wk + δwk ) − F∗ ) ≤ (1 − µαk )E0:k−1 (F (wk ) − F∗ )


1
Setting α = L(1+V1 ) , so
 
1
E0:k F (wk + δwk ) − F∗ ≤ 1− E0:k−1 (F (wk ) − F∗ )
κ(1 + V1 )

which means that the (now linear) speed of convergence depends crucially on κ. It is a classical result
that the number of iterations needed for a prescribed accuracy ε is proportional to κ.
Also for harmonic step-sizes it can be shown (exercise) that γ ∼ κ and
 
ν κ V0 1
E0:k [F (wk ) − F∗ ) ≤ ∼ max , (1 + V1 ) (F (w1 ) − F∗ )
γ+k 2 µ(F (w1 ) − F∗ ) γ+k

Thus, also here, the error is scaled with κ, asymptotically for k ≫ γ. So the required number of iterations
is also proportional to κ.

For fixed ε : k∼κ

Hence, reducing the condition number directly reduces the computational effort by the same factor

4.1 The effect of preconditioning


Let us collect our set of assumptions and definitions that depend on the norm and the corresponding
scalar product:
L
F (w + δw) ≤ F (w) + F ′ (w)δw + ∥δw∥2 ∀δw
2
∥F ′ (w)∥2∗ ≥ 2µ(F (w) − F (w∗ )), µ > 0
⟨∆w, v⟩ = −F ′ (w)v ∀v
V[∆w] = E[∥∆w − ∆w∥2 ] ≤ V0 + V1 ∥F ′ (x)∥2∗

27
28 Contents

The norm for dual quantities ist denoted by ∥ · ∥∗ . It is defined as an operator norm:

f (v)
∥f ∥∗ := sup for a linear function f.
v̸=0 ∥v∥

Let us now define the following general scalar product (and norm):

⟨v, w⟩ := v T M w M s.p.d. , ∥v∥ = vT M v

Then we can derive the representation:

⟨f, g⟩∗ = f T M −1 g.

If F is quadratic spd (or if close to a minimizer F can be approximated by a quadratic function):

1 T
F (w) = w Aw + lT w + c
2
Then the condition number is given by the solutions of the following generalized Eigenvalue problem:

Av = λM v ⇔ M −1 Av = λv

as follows:
λmax
κM (A) = .
λmin
In particular, in the extreme case M = A, we obtain κM (A) = 1. So a good choice of M can reduce
the required computational effort significantly. Note, however, that mere scaling of the norm leaves the
condition number invariant.
The price to pay is that the computation of a search direction

⟨∆w, v⟩ = −F ′ (w)v

now amounts in solving the linear system:

M ∆w = −F ′ (w)T

So the price to pay for preconditioning is as follows:

• derive and compute a suitable preconditioner M

• solve (approximately) a linear system

The potential benefit is:

• a significantly lower number of total iterations

Interestingly, also the variance of ∆w depends on the choice of norm.


Our preconditioned algorithm now looks as follows:

Given w1
for k = 1,2, . . .
compute an estimate l(wk ) for F ′ (wk )T
compute Mk
∆w = −Mk−1 l(wk )
wk+1 = wk + δw = wk + α∆w.

Since M defines a norm, all the previous theory applies.


4.2. Diagonal scaling methods 29

4.2 Diagonal scaling methods


The cheapest way of preconditioning is to choose M as a diagonal matrix: M = diag(mii ).
Then we can simply compute:
1
[∆w]i = − [l(w)]i
mii
Some popular ways to define mii have been proposed:

RMSprop
The motivation of the following algorithm is to gather some estimates of the scaling properties of the
variables.
Here a diagonal matrix Dk is defined recursively as follows:

Dk = (1 − λ)Dk−1 + λdiag(F ′ (wk )2i ) λ ∈ [0, 1]

Then for some µ > 0 we define:


p
Mk = Dk + µI (where the root is taken componentwise)

Sensitivity to scaling
This is one way to make the algorithm less sensitive to scaling of variables. Setting v = Sw + w0 and
FS (w) := F (v) = F (Sw + w0 ) we get:

FS′ (w) = F ′ (v)S ⇒ FS′ (w)2i = Sii


2 ′
F (v)2i ⇒ Dii
S 2
= Sii Dii

Thus, we obtain (for µ = 0):

∆v = D−1/2 F ′ (v)T
∆w = (DS )−1/2 FS′ (w)T = D−1/2 S −1 SF ′ (v)T = D−1/2 F ′ (v)T = ∆v

Hence, the steps are the same. However, given that v = Sw + w0 we rather should expect ∆v = S∆w,
so a scaling with D−1 would also be conceivable.

Adagrad
A similar idea is unsed in the algorithm Adagrad. Here the following update for D is used:

Dk = Dk−1 + diag(F ′ (wk )2i )

This has the effect that Mk grows during the iteration, which yields smaller and smaller steps. This effect
may be desirable during a pure stochastic gradient method. This decay, however, often turns out to be
too aggressive in practice.

4.3 CG as a matrix free solver


If the preconditioner matrix M is more complicated than diagonal, we have to solve a linear system of
the form:
M ∆w + l = 0
Since M is by assumption spd, this is equivalent to the minimization problem:

1 T
min q(v) := v M v + lT v.
2
30 Contents

This problem is often very large scale, so it is often not possible to store or to evaluate the matrix M .
Instead, we might have an algorithm that performs the application of M to a vector v:

v → Mv

The absence of M in explicit form rules out matrix factorizations (like M = LR) for the solution of this
system, and we have to use iterative methods.
The simplest idea would be to use a gradient method again, now corresponding to the standard scalar
product v T P w, where P is itself a preconditioner (an inner preconditioner) for M :
1
compute pk := −∇P q(vk ) = −P −1 q ′ (vk ) (corresponding to min δv T δv + q ′ (vk )δv)
δv 2
vk+1 = vk + αk pk

Since the problem is quadratic, an optimal value of α can be computed:

min q(vk + αpk ) ⇒ 0 = ∂α q(vk + αpk ) = q ′ (vk + αpk )pk = (vk + αpk )T M pk + lT pk
α
(M vk + l)T pk q ′ (vk )pk
⇒ αk = − T
=− T .
pk M p k pk M pk
As shown in NLP, this method converges with the following speed:
κ−1
∥v∗ − vk+1 ∥M ≤ ∥v∗ − vk ∥M ,
κ+1
where
λmax (M )
κ=
λmin (M )
This is not a big improvement, compared to application of the unpreconditioned gradient method to the
nonlinear problem.

M -conjugate directions
Application of an additional Gram-Schmidt orthogonalization step improves the situation drastically. We
would like to achieve that the new search direction pk is M -orthogonal to the previous direction pk−1 :

pTk M pk−1 = 0.

Then Gram-Schmidt yields:

gkT M pk−1 q ′ (vk )pk−1


pk = gk − p k−1 = gk − pk−1 = gk + βk pk−1
pTk−1 M pk−1 pTk−1 M pk−1

compute gk := −∇P q(vk ) = −P −1 q ′ (vk )


pk = gk + βk pk−1 (Gram-Schmidt)
vk+1 = vk + αk pk (Optimal line-search)

This simple modification yields a greatly improved rate of convergence:


p !k
κP (M ) − 1
∥v∗ − vk ∥M ≤ 2 p ∥v∗ − v0 ∥M .
κP (M ) + 1

Thus, the overall computational effort of cg is roughly the square-root of the gradient method for large
condition numbers. A clever implementation uses only one application of M per iteration, namley pk →
M pk . The algorithm, with loop slightly changed is shown below. In addition, we include the possibility
of preconditioning the iteration for M with a preconditioner P :
4.4. Inexact Newton methods for Neural Networks 31

Algorithm 4.3.1. (cg)


Geg: x0
d0 = q ′ (v0 ) = M v0 + l, p0 := −P −1 d0

for k = 0, 1, 2, . . .
p̃k = M pk
dTk pk
αk = −
pTk p̃k
vk+1 = vk + αk pk
dk+1 = dk + αk p̃k
gk+1 = −P −1 dk+1
dTk+1 gk+1
βk+1 =
dTk gk
pk+1 = gk+1 + βk+1 pk

• For usage inside an optimization method it is important for efficiency to terminate the cg-Method
early. Thus, the linear system M v + l = 0 is not solved exactly. The computed step, however,
are better than unpreconditioned gradient steps, even if only a moderate number of cg-Iterations
is used.

• The analysis of cg is quite involved and it is presented in the lecture “Vertiefung der Numerik”. In
fact, cg minimizes q(x) on x0 + span {pi }i=1...k , the space, spanned by all previous search directions.
In theory, it will yields the exact solution after k = dim W iterations. In practive, however, it is
terminated much earlier.

• The cg-method depends on the positive definiteness of M in order to converge. However, if M is


indefinite, cg can still be used to compute a direction of descent. as soon as we encounte a direction
of negative curvature, i.e. pTk p̃k < 0, we terminate the cg-method and return the current iterate vk .
This can be used as a search direction in the outer optimization algorithm.

• As a preconditioner P one may take an estimate of the diagonal of M . For example, one may use
ideas from RMSprop: X
Pii = [∇fj ]2i + µ
j

Then the preconditioner eliminates dependencies of cg on scaling.

4.4 Inexact Newton methods for Neural Networks


The application of Newton methods has a couple of advantages:

• less dependent on conditioning of the problem (locally κ ∼ 1, globally, invariant against linear
transformation of variables)

• step-sizes locally tend to 1, depending on the non-linearity of the problem, rather than the scaling
(no experimentation with step-sizes necessary)

There is also some cost involved:

• evaluation of the hessian and solution of the system (remedy: inexact Newton-cg)

• need additional smoothness

• deal with non-convexity, singular hessian matrices (remedy: trust-region methods)


32 Contents

Let us recall our objective:


N N
1 X 1 X
min F (w) = R(w) + ρ(w) = fi (w) + ρ(w) = l(h(xi , w), yi ) + ρ(w)
N i=1 N i=1

If F is convex, we may set Mk = F ′′ (wk ) as an ideal preconditioner. In that case we have to solve
following problem at each step:
1
min F ′′ (wk )(v, v) + F ′ (wk )v
2
We cannot factorize or even store F ′′ (wk ), thus use cg with M = F ′′ (wk ) to find an approximate minimizer
of the problem.
This requires the following computations:
• Computation of F ′ (w) once at the beginning of the cg-Iteration.
• Computation of F ′′ (w)pk at each iteration of the cg-method.
Thus, one run of the cg-method (using K steps) requires the following work:

work(F ′ (w)) + K · work(F ′′ (w)v)

Assuming that work(F ′′ (w)v) ∼ γwork(F ′ (w)) we obtain

work(F ′ (w))(1 + γK).

Obviously, exact evaluation of F ′′ (w)v makes one iteration of cg very expensive, compared to a gradienht
step, because then γ ≥ 1 has to be expected. Rather, we aim to obtain a cheaper approximation
M v ≈ F ′′ (w)v so that γK ≈ 1.

Hessian Sub-sampling
This is a popular idea to obtain an efficient algorithm in the setting of learning. Assume that our
algorithm uses a stochastic gradient method with mini-batches usiong a sampling size n:
n n
1X ′ 1X ′
Rn′ (w)T = fi (w)T = h (xi , w)T l′ (h(xi , w), yi )T
n i=1 n i=1

where the (xi , yi ) are chosen randomly from the data. This is done, as usual with back-propagation.
Then we may choose m ≪ n, draw j = 1 . . . m data samples form the already chosen set and compute
in each cg-step (setting z = hj (w))
m m
′′ 1 X ′′ 1 X ′
h (w)T lj′′ (z)h′j (w)v + lj′ (w)h′′j (z)v

Rm (w)v = fj (w)v =
m j=1 m j=1 j

′′
Then M = Rm (w) + ρ′′ (w) can be used. We have to keep M and thus the sample fixed during the
cg-iteration. Otherwise M would change during the cg-iteration.
Concerning
f ′′ (w)v = h′ (w)T l′′ (z)h′ (w)v + l′ (w)h′′ (z)v
we observe two summands:
h′ (w)T l′′ (z)h′ (w)v
This summand can be evaluated by one forward propagation v → h′j (w)v, application of lj′′ (w), which is
usually cheap, and one back-propagation: ṽ → h′j (w)T ṽ. This costs about twice a gradient evaluation.
Moreover, if l is strictly convex, l′′ is positive definite, and thus, this term is positive definite either:

v T h′T l′′ h′ v = (h′ v)T l′′ (h′ v) ≥ 0


4.4. Inexact Newton methods for Neural Networks 33

The second term is more tricky to evaluate:

l′ (z)h′′ (w)v

A naive implementation would yield the matrix h′′ (w)v first, to which l′ (z)T is applied. Howwever,
efficient implementation for this summand are also available. There are no reasonable assumtions that
guarantee positive definiteness of this term.
If ReLUs are used, however, it is questionable, if h′′ (w) makes sense. Recall that h consists of piecewise
linear functions α(z) = max(z, 0) and bilinear functions z+ = W z + w0 . While α′ may be seen as a step
function, it is hard to give α′′ a meaningful interpretation.
In efficient implementations, for each sample the cost is moderately higher than gradient evaluation
(factor γsample ∼ 3 − 5).
We then obtain the total work:
 
′ mK
work(F (w)) 1 + γsample ,
n

and thus the tradeoff (if work is constant):

• small K and large m yields inaccurate solution of system for accurate M .

• large K and small m yields accurate solution of system for inaccurate M . This is, however, not
advocated in the literature and called “minibatch-overfitting”.

Recall that the theoretically optimal choice of stepsize is given as follows,

1
α=
L (1 + V[∆w]/∥F ′ (w)∥2∗ )

yielding an expected decrease by a factor θ:

1
θ = 1 − µα = 1 −
κ (1 + V[∆w]/∥F ′ (w)∥2∗ )

Our algorithmic task would then be to balance the following quantities:

• minibatch-size n for F ′ (w): reduction of V[∆w] ∼ 1/n

• minibatch-size m for F ′′ (w): reduction of κ to the “ideal” case M ≈ F ′′ (w). Close to the solution
κ ≈ κM (F ′′ (w)), which would then govern linear convergence.

• number of iterations K for the cg-method. The speed of convergence depends on the condition
number κP (M ) of M relative to the preconditioner P .

• stepsize α

Matrix Chernoff-bounds
Just as in the P
case of function evaluations, a probabilistic measure for the condition number of M =
′′
Fm (w) = 1/m Aj , relative to A = F ′′ (w) would be helpful. One possible probability theoretic tool are
the so called matrix Chernoff bounds.

Lemma 4.4.1. Consider symmetric matrices Aj ∈ Rd×d , j = 1 . . . m. Assume that all matrices are
drawn independently and that E[Aj ] = A.
Let 0 < λmin ≤ λmax be the extreme eigenvalues of A and κ̄ = λmax /λmin its condition number.
Assume further a bound on the extreme eigenvalue of all Aj :

0 ≤ λmin (Aj ) ≤ λmax (Aj ) ≤ Rλmax .


34 Contents
1
P
Then for the extreme eigenvalues 0 ≤ λmin ≤ λmax of m Aj we obtain:
m/(κ̄R)
e−δ

= dΘm/(κ̄R)

P λmin ≤ (1 − δ)λmin ≤ d δ ∈]0, 1[
(1 − δ)1−δ
m/R


 m/R
P λmax ≥ (1 + δ)λmax ≤ d = dΘ δ>0
(1 + δ)1+δ

with 0 < Θ, Θ < 1.


To obtain a moderately good approximation on the extreme eigenvalues we have to fix δ moderately
small and obtain Θ not too close to 1. It we want to achieve a probability p we get the equations (for
the small eigenvalue, and similar for the large one):
ln(p/d)
p = d Θm/κR ⇒ m = κR
ln Θ
We observe that the number of samples depends lineary on the condition number. However, the above
results are not exactly what we need for a good theoretical understanding. Rather, bounds on:
1 X
P (κM (F ′′ (w)) ≤ C), where M = Aj
m
would be of interest. Unfortunately, I am not aware of any such bounds in the literature. Also, appropriate
assumptions under which such an estimate should hold seem to be unclear.

4.5 Generalized Gauss-Newton methods


The idea of generalized Gauss-Newton methods is to drop the term l′ (w)h′′ (z) in the formula:
f ′′ (w)v = h′ (w)T l′′ (z)h′ (w)v + l′ (w)h′′ (z)v,
and thus obtain the so called generalized Gauss-Newton matrix:
G(w) := h′ (w)T l′′ (z)h′ (w).
The classical Gauss-Newton-Method method uses l(z, y) = (z − y)2 , so l′′ = I and
Gclass (w) = h′ (w)T h′ (w).
In total, we then have (using sub-sampling):
m m
1 X 1 X ′
M GN (w) := Gj (w) + ρ′′ (w) = h (xj , w)T l′′ (z)h′ (xj , w) + ρ′′ (w).
m j=1 m j=1

This yields two benefits:


• The computational cost and implementation effort of evaluating l′ (w)h′′ (z)v is saved, as well as the
problem with the ReLUs.
• M is always positive definite as long as l is convex and ρ is strictly convex. This is the case very
frequently.
• Close to a minimizer, we obtain small 0 ≈ F ′ (w) = R′ (w) + ρ′ (w) = N −1 i li′ (z)h′i (w) + ρ′ (w).
P
Thus, if ρ′ (w) is small, we expect i li′ (z)h′i (w) to be small, as well. Recall that loss functions are
P
often constructed, such that l′ (z, yi ) is small, if z ≈ yi , or if z is correctly classified. Thus, G(w)
can be a good approximation to F ′′ (w) close to the solution. In practice one often observes rather
fast linear convergence.
• Used with appropriate globalization and preconditioning techniques, this class of methods is com-
petitive, in particular for ill-conditioned, difficult problems.
4.6. K-FAC Preconditioner 35

4.6 K-FAC Preconditioner


In the following we consider the case ρ = 0 for simplicity. In this section we construct a preconditioner
M that is a reasonable approximation of MGN and for which M −1 v can be evaluated efficiently. It is
known under the name Kronecker-Factored Approximate Curvature and its application does not require
a cg-iteration.
First of all, we recall that in DNNs the space of parameters W is partitioned by layers, and in each
layer k = 1 . . . L there are the parameters (W̃k , w̃0,k ) = W k (weights and bias) with perturbations δW k ,
which we will sometimes also write as vectors δwk by rearanging the entries of δW k into a column.
We would like to simplify the following bilinear form:
n n
1X T 1X ′
δwT M GN δw := δw Gj (w)δw = (h (xj , w)δw)T l′′ (z)(h′ (xj , w)δw),
n j=1 n j=1

where δw consists of all perturbations δW k . If M GN is considered as a matrix, then the partitioning into
layers yields a partitioning of M GN into a L × L block matrix.

First simplification to a block diagonal form


GN
As a first simplification we will consider only the diagonal blocks Mk,k of M GN and drop all off-diagonal
blocks. We thus define the simplified bilinear form as follows:
L
X
δwT M GN δw ≈ δwT M D δw = δwTk Mk,k
GN
wk .
k=1

In order to apply (M D )−1 we only have to invert each block Mk,k


GN
, which is given by
n
1X
δwTk Mk,k
GN
δwk = (∂ h(xj , w)δW k )T l′′ (z)(∂W k h(xj , w)δW k )
n j=1 W k

Our computation thus amount in the following minimization problem:


1 ′
min δwTk Mk,k
GN
δwk + Fm (w)δwk
2
or
n
1 X
min (∂ h(xj , w)δW k )T l′′ (z)(∂W k h(xj , w)δW k ) + l′ (z)∂W k h(xj , w)δW k
2n j=1 W k

By the chain-rule, and the representation h̃(z, W̃ , w̃0 ) = α(W̃ z + w̃0 ) we obtain the following representa-
tion:
∂W k h(x, w)δW k = ∂zk h(x, w)Dk δW k hk−1 (xj , w)
and thus, with

pj = hk−1 (xj , w) intermediate variables, computed by forward propagation



qjT := l (z)∂zk h(xj , w)Dk computed by backward-propagation
Aj := DkT ∂zk h(xj , w)T l′′ (z)∂zk h(xj , w)Dk modified backward-propagation

We obtain
n
1 X T T
min p δW k Aj δW k pj + qjT δW k pj
2n j=1 j

and thus by differentiation:


n
1X T T
0= (pj δW k Aj + qjT )Hpj
n j=1
36 Contents

Lemma 4.6.1. Let vj ∈ V, wj ∈ W be some vectors for j = 1 . . . m then we have the followig equivalence:
n
X n
X
0= vj wjT ⇔ wjT Hvj = 0 for all matrices H : V → W
j=1 j=1

Proof. Let ek ∈ W , el ∈ V be unit-vectors, such that Hkl = ek eTl is a matrix with precisely one entry
= 1. Then we compute:
 
Xn n
X n
X n
X
wjT Hkl vj = (wjT ek )(eTl vj ) = (eTl vj )(wjT ek ) = eTl  vj wjT  ek
j=1 j=1 j=1 j=1

P 
n Pn
Thus, the entry j=1 vj wjT = 0, iff j=1 wjT Hkl vj = 0. This implies our stated equivalence (taking
lk
into account that the set of Hkl is a basis of the space of linear mappings L(V, W )).
Hence, we have to solve the following linear matrix equation for δW k :
n n n
1X T 1X 1X
0= pj pTj δW k Aj + pj qjT ⇔ Aj δW k pj pTj = − qj pTj .
n j=1 n j=1 n j=1

Second simplification: averaged A


This system is still too large to solve. It is a linear system with as many degrees of freedom as δW k has.
It is thus a system of dimension din · dout of the corresponding level k.
Thus, we need an additional simplification step: we replace each Aj by its mean (over a possibly
smaller number m < n of samples):
m n n
1 X 1X 1X
Aj → A := Aj and Aj δW k pj pTj → AδWk pj pTj
m j=1 n j=1 n j=1

Then we can pull A out of the sum and get the following simplified system
 
n n
1 X 1X
AδWkkfac  pj pTj  = − qj pTj .
n j=1 n j=1

Thus, if we define
n n
1X 1X
P := pj pTj Q := qj pTj
n j=1 n j=1

we can now solve the following system:

AδWkkfac P = −Q ⇔ δWkkfac P = −A−1 Q ⇔ δWkkfac = −A−1 QP −1

under the assumption that the symmetric positive semi-definite matrices P and A are invertible. Thus,
we only have to compute Cholesky-factorizations of matrices of size P ∈ Rdin ×din and A ∈ Rdout ×dout .
While A is likely to be positive definite as a sum of positive definite matrices, P is a sum of rank-1
matrices, and thus, there should be at least n ≥ din addends. Otherwise, we may replace P −1 by a
Pseudo-Inverse P + .
Chapter 5

Momentum

Some moderate reduction of variance and condition number can also be achieved by alternatives, maybe
less sophisticated strategies. In the following we will discuss two of them.

5.1 The dynamics of the heavy-ball method


The following method is called the heavy-ball method:

wk+1 = wk − α∇F (wk ) + β(wk − wk−1 ).

Its name is due to a physical interpretation of the second terms as a momentum or inertia term:
We may interpret the gradient method (β = 0) as an explicit Euler method (with stepsize τ = α) for
the first order equation:
ẇ = −∇F (w)
For stiff problems (where the eigenvalues of −(∇F )′ = −F ′′ are large) it is known that the explicit
Euler method needs small step sizes to be stable, while small eigenvalues of F ′′ yield slow convergence
of w towards the fixed point ∇F = 0. Again we observe that badly conditioned problems yield slowly
converging gradient methods.
To be more concrete, since F ′′ (w) is spd, we can choose a basis (the same for w and v), such that
F ′′ (w) = D = diag(d1 , . . . dn ) is diagonal. Then the speed of convergence of the trajectory is propostional
to the smallest eigenvalue d1 , while the maximal stable step-size is proportional to 1/dn . Thus, the
required number of steps is proportional to
dn
κ= .
d1
From the point of view of dynamics, we may want to achieve a damping of the too oscillatory right hand
side.
The heavy-ball method can be seen as a discretization of the following second order equation:

ẅ + ρẇ = −∇F (w)

or alternatively of the system:

ẇ = v
v̇ = −∇F (w) − ρv

The jacobian matrix of this system reads:


 
0 I
−F ′′ (w) −ρI

37
38 Contents

Then our system decouples in 2 × 2 blocks and the jacobian matrix reads:
   
0 I 0 1
with blocks
−D −ρI −di −ρ
We may compute the eigenvalues of this 2x2 system:
ρ p
0 = λ(λ + ρ) + di ⇒ λ1,2 = − ± ρ2 /4 − di
2
If we choose ρ, such that ρ2 /4 − di < 0, we obtain conjugate complex eigenvalues with:
ρ
Re λ1,2 = −
p2
|λ1,2 | = di

Choosing ρ := 2 d1 , the smallest
√ eigenvalue guarantees this for all eigenvalues. The maximal absolute
value of our eigenvalues is now dn .
Hence, just as above, the required number of steps is proportional to

√ dn
κ= √ .
d1

5.2 First and second order fixed point iterations


To understand the idea of momentum, we consider the following quadratic problem:
1 T
min q(w) := w Aw + lT w ⇔ q ′ (w) = Aw + l = 0,
2
where A is an spd matrix. For simplicity we do not consider a preconditioner, which would, however, be
possible.
The gradient method for this problem can be written as:
wk+1 = wk − αq ′ (wk ) = wk − α(Awk + l)
This in turn yields a first order fixed point method:
wk+1 − w∗ = wk − w∗ − αA(wk − w∗ ) = (I − αA)(wk − w∗ ).
We may adjust the parameter α to obtain fast convergence. Defining z = w − w∗ this yields:
zk+1 = (I − αA)zk .
A second order fixed point method is defined as follows:
wk+1 = wk − αq ′ (wk ) + β(wk − wk−1 )
or, similar as above:
zk+1 = (I − αA)zk + β(zk − zk−1 ).
This can be rewritten as a two-component recursion:
    
zk+1 (1 + β)I − αA −βI zk
=
zk I 0 zk−1

Spectral analysis
Consider a diagonalization of A:
S −1 AS = D = diag(d1 , . . . , dn ) ⇒ A = SΛS −1
Let is denote d1 = γ and dn = Γ the smallest and largest eigenvalues, respectively.
5.2. First and second order fixed point iterations 39

Convergence of first order iteration. We obtain for the first order method:
zk+1 = S(I − αD)S −1 zk
and thus for v = S −1 z:
vk+1 = (I − αD)vk ,
so we obtain componentwise:
i
vk+1 = (1 − αt)vki , t = di ∈ [γ, Γ]
and hence:
i
|vk+1 | ≤ max |I − αdi ||vki | ≤ max |1 − αt||vki |.
i t∈[γ,Γ]

To obtain the fastest possible convergence, we have to figure out:


min max |1 − αt| = min max{|1 − αΓ|, |1 − αγ|}
α t∈[γ,Γ] α

This is straightforward to compute. Choose α such that both terms in the maximum agree:
2 2γ Γ−γ κ−1
α= ⇒ |1 − αt| ≤ 1 − = =
γ+Γ γ+Γ Γ+γ κ+1

Convergence of second order iteration. The same can be done for the second order iteration and
we get:     
vk+1 (1 + β)I − αD −βI vk
=
vk I 0 vk−1
and thus similarly:
i
vki
    
vk+1 1 + β − αt −β
= i , t = di .
vki 1 0 vk−1
To proceed, we have to compute the eigenvalues of the above matrix. The characteristic equation reads:
0 = λ(λ − (1 + β − αt)) + β
with eigenvalues: p
(1 + β − αt)2 − 4β
(1 + β − αt) ±
λ1,2 =
2
As above, we look for conjugate complex eigenvalues, which occur, if
0 ≥ (1 + β − αt)2 − 4β.
The right hand side is zero, if
p p √ √
αt = ( β − 1)2 ⇒ β − 1 = ± αt ⇒ β = (1 ± αt)2
Thus, we obtain conjugate complex eigenvalues, iff
√ √
β ∈ [(1 − αt)2 , (1 + αt)2 ] ∀t ∈ [γ, Γ]
which is implied by √ √
β := max{(1 − αΓ)2 , (1 − αγ)2 }
In this case √
p √
|λ1 | = |λ2 | = β = max{|1 − αΓ|, |1 − αγ|}.
For a minimal choice of β we have to compute:
√ √
min max{|1 − αΓ|, |1 − αγ|}
α

Again, we balance the two terms in the maximum by choosing:


!2 √ √ √ √
2 p 2 γ Γ− γ κ−1
α= √ √ ⇒ β = |λ 1 | ≤ 1 − √ √ = √ √ = √ .
Γ+ γ Γ+ γ Γ+ γ κ+1
40 Contents

Comparison:

2 κ−1
First order: α= , β=0 Θ∼
γ+Γ κ+1
!2 √ 2 √
2 κ−1 κ−1
Second order: α= √ √ , β= √ Θ∼ √
γ+ Γ κ+1 κ+1

The difficulty is now, of course, to choose α and β adequately. While Γ can be estimated quite con-
veniently, a-posteriori, γ is rather difficult to estimate but critical. In practice, trial and error is often
used.

5.3 Interpretation as a gradient aggregation method


Coming back to our original formula and keeping α and β fixed, we have:

wk+1 − wk = −α∇F (wk ) + β(wk − wk−1 ).

Using w1 = w0 − α∇F (w0 ) we can compute by induction:

k
X
wk+1 − wk = −α β k−j ∇F (wj )
j=0

We observe a weighed sum of gradients. In the case of stochastic gradients, this might yield a reduction
of variance, similar to gradient aggregation. However, a theoretical verification for this fact seems to be
missing, yet.
In practice, methods with momentum are quite popular due to their simplicity and the improvement
they yield, compared to stochastic gradients.

A subtle distinction for varying preconditioners

We have seen that preconditioning (even if only diagonal) is an important technique to accellerate op-
timization. Up to now, it did not matter, if preconditioners Mk changed during the course of the
optimization, since classical gradient methods do not have a “memory”. This is different for methods
with momentum. Then the above equations read as follows (naming ∇Mk F (wk ) := Mk−1 F ′ (wk )):

k
X
wk+1 − wk = −α β k−j Mj−1 F ′ (wj ).
j=0

This would be an exponentially weighed aggregation of gradients, which is equivalent to momentum


methods.
Alternatively, we could do an exponentially weighed aggregation of derivatives:
 
k
X
wk+1 − wk = −αMk−1  β k−j F ′ (wj ) .
j=0

So we may write:
wk+1 − wk = −αMk−1 lk where lk = βlk−1 + F ′ (wk ).
5.3. Interpretation as a gradient aggregation method 41

Adam
The popular Adam algorithm chooses the latter variant. It employs for Mk the diagonal preconditioner
for RMSprop and exponentially weighed aggregation of derivatives.
Similarly to RMSprop Adam uses a diagonal preconditioner via a diagonal matrix Dk , that is defined
recursively as follows:
Dk = λDk−1 + (1 − λ)diag(F ′ (wk )2i ) λ ∈ [0, 1]
Then for some µ > 0 the preconditioner is given by:
p 
Mk = Dk + µI (where the root is taken componentwise)

Given w1
for k = 1,2, . . .
lk = βlk−1 + (1 − β)F ′ (wk ) often a Mini-Batch is used here

dk = (1 − λ)dk−1 + λ(F (wk )2i )
p
Mk = (diag dk + µI)
∆w = −Mk−1 lk

1 − λk
wk+1 = wk + α ∆w.
1 − βk

The factors (1 − β k ) → 1 and (1 − λk ) → 1 in the step size are used to compensate the so called
“initialization bias”. This heuristic takes into account that the sums, defining lk and dk are short in the
beginning and grow longer and longer.
The corresponding heuristic computation is as follows:
k
X k
X
lk = (1 − β) β k−i F ′ (wi ) = (1 − β) β k−i F ′ (wk ) + ζ = (1 − β k )F ′ (wk ) + ζ,
i=1 i=1

where
k
X
ζ = (1 − β) β k−i (F ′ (wi ) − F ′ (wk )).
i=1

Assuming that E[ζ] = 0, we obtain

E[lk ] = (1 − β k )E[F ′ (wk )]

The same computation can be done for dk .


In recent years, Adam has probably become the method that is chosen as a default.
42 Contents
Chapter 6

Non-smooth regularization terms

In the following we consider the minimization of problems of the form:

min F (w) := f (w) + ρ(w)


w

where f is differentiable and ρ convex and “easy to handle”. By this we mean the following:

For any f ′ (w) and L > 0 the local minimization problem (which has a unique minimizer):

L
min p(δw) := f (w) + f ′ (w)δw + ρ(w + δw) + ∥δw∥2 (6.1)
δw 2
can be solved in an efficient way. Here the norm is a (maybe diagonally scaled) 2-norm.

This step is called a proximal gradient step.


Lemma 6.0.1.
• Assume that w∗ is a minimizer of F . Then δw = 0 is a minimizer of p at w∗ .
• Let f be convex and assume that δw = 0 is a minimizer of p at w∗ . Then w∗ minimizes F .
Proof. Since p is convex, we may compute for a minimizer δw ̸= 0 of p:

p(tδw) − p(0) = p((1 − t)0 + tδw) − p(0) ≤ (1 − t)p(0) − tp(δw) − p(0) = t(p(δw) − p(0)) = −tc.

Moreover:
L
p(tδw) − p(0) = F (w∗ + tδw) − F (w∗ ) + o(t) + ∥tδw∥2 ≥ −o(t).
2
For sufficiently small t this yields a contradiction to δw being a minimizer.
Consider the converse. Since f is convex, f ′ (w∗ )δw ≤ f (w∗ + δw) − f (w∗ ). Hence, with w = w∗ + δw:

F (w) − F (w∗ ) = f (w) + ρ(w) − f (w∗ ) − ρ(w∗ ) ≥ f ′ (w∗ )δw + ρ(w∗ + δw) − ρ(w∗ )

Since δw = 0 minimizes p we have:


L
f ′ (w∗ )δw + ρ(w∗ + δw) + ∥δw∥2 ≥ ρ(w∗ )
2
and thus:
L
F (w∗ + δw) ≥ F (w∗ ) −∥δw∥2 ∀δw.
2
Assume for contradiction that m := F (w∗ ) − F (w∗ + δw) > 0. Then

F (w∗ + tδw) = F ((1 − t)w∗ + t(w∗ + δw)) ≤ (1 − t)F (w∗ ) + tF (w∗ + δw) = F (w∗ ) − tm

43
44 Contents

On the other hand:


 
L 2 Lt 2
F (w∗ + tδw) ≥ F (w∗ ) − ∥tδw∥ = F (w∗ ) − t ∥δw∥
2 2
The term in square brackets tends to 0 for t → 0 so that both inequalities contradict, if t is sufficiently
small.

6.1 Examples for non-smooth terms


Example: Positivity constraint
For example, we may impose positivity constraints to our variables. This can be achieved by:

0 : wi ≥ 0 ∀i
ρ(w) :=
∞ : otherwise

In this case (6.1) can be solved by the following componentwise considerations. Let δw be the minimizer
of p.
i) Assume that (w + δw)i > 0. Then ρ vanishes, and we have

0 = p′ (δw)i = f ′ (w)i + Lδwi

ii) Assume that (w + δw)i = 0. Then we only have

0 ≤ f ′ (w)i + Lδwi

Choosing a projected gradient step:


 
1
δwi := max −wi , − f ′ (w)i
L

(which can be computed very efficiently) satisfies these properties in both cases:
• In the case δwi = −wi , it follows that Lδwi ≥ −f ′ (w)i and thus ii) is satisfied
• Otherwise, δwi > −wi , so wi + δwi > 0 and f ′ (w)i + Lδwi = 0. So i) is fulfilled.

Example: 1-Norm
X
ρ(w) := λ |wi |, λ>0
i

This regularization term is often used together with a quadratic regression functional:

min ∥Aw − d∥22 + ρ(w),

the so called Lasso-problem. Depending on the choice of λ the solution w∗ of this problem exhibits many
zeros.
To compute the proximal gradient step. Let again δw be the minimizer of p. Then we have the cases:
i) Let (w + δw)i > 0. Then ρ′ (w + δw)i = λ

0 = p′ (δw)i = f ′ (w)i + λ + Lδwi

ii) Let (w + δw)i < 0. Then ρ′ (w + δw)i = −λ

0 = p′ (δw)i = f ′ (w)i − λ + Lδwi


6.2. Convergence of proximal gradients 45

iii) Assume that (w + δw)i = 0. Then ρ′ (w + δw)i ∈ {−λ, λ} and we have

0 ∈ f ′ (w)i + [−λ, λ] + Lδwi

To satisfy these constraints we have to choose:


 1 ′
 − L (f (w)i + λ) : f ′ (w)i + λ < Lwi
δwi = − 1 (f ′ (w)i − λ) : f ′ (w)i − λ > Lwi
 L
−wi : otherwise

This specific step is called a “shrinkage step”.


We observe that each line solves the corresponding case, listed above. For example, our first choice
yields 0 = f ′ (w)i + λ + Lδwi , and thus −Lδwi < Lwi , implying 0 < wi + δwi .

6.2 Convergence of proximal gradients


In the following, we say that a convex function F is σ-convex, if:
σ
F (w) ≥ F (w0 ) + ℓ(w − w0 ) + ∥w − w0 ∥2 .
2
Where ℓ is any linear function, such that F (w) ≥ F (w0 ) + ℓ(w − w0 ) for all w. If F is differentiable, we
may use instead of ℓ = F ′ (w0 ).
If we add to F a quadratic term ν2 ∥w − w0 ∥2 it becomes σ + ν-convex as long as ν = −σ and for ν + σ,
the result is still convex. If w∗ is a minimizer of f , then:
σ
F (w) ≥ F (w∗ ) + ∥w − w∗ ∥2 .
2
Lemma 6.2.1. Assume that F is σ-convex. Then
σ
F ((1 − λ)x + λy) ≤ (1 − λ)F (x) + λF (y) − λ(1 − λ)∥x − y∥2
2
Proof. Since g(x) := F (x) − σ2 ∥x − y∥2 is convex and (1 − λ)x + λy − y = (1 − λ)(x − y) we may compute:
σ
F ((1 − λ)x + λy) − (1 − λ)2 ∥x − y∥2 = g((1 − λ)x + λy)
2
σ
≤ (1 − λ)g(x) + λg(y) = (1 − λ)F (x) + λF (y) − (1 − λ)∥x − y∥2 .
2
This implies our assertion.
The convergence theory of proximal gradient methods hinges on the following fundamental lemma:
Lemma 6.2.2. Assume that for given L δw has been computed at a point w0 as a proximal gradient,
such that with w+ = w0 + δw:
L
f (w+ ) ≤ f (w0 ) + f ′ (w0 )δw + ∥δw∥2 (6.2)
2
then we obtain:
L
∥w − w+ ∥2 − ∥w − w0 ∥2 + δf (w, w0 )

F (w) − F (w+ ) ≥ ∀w
2
with δf (w, w0 ) = f (w) − f (w0 ) − f ′ (w0 )(w − w0 ). If f is σ-convex, we get
L L−σ
F (w) − F (w+ ) ≥ ∥w − w+ ∥2 − ∥w − w0 ∥2 ∀w
2 2
In particular we have the decrease condition:
L
F (w+ ) ≤ F (w0 ) − ∥w+ − w0 ∥2 .
2
46 Contents

Proof. By definition, δw minimizes:


L
p(δw) = f (w0 ) + f ′ (w0 )δw + ρ(w0 + δw) + ∥δw∥2 .
2
This implies for any w:
L L
p(w − w0 ) − p(δw) ≥ ∥w − w0 − δw∥2 = ∥w − w+ ∥2
2 2
Then, by assumption:
L
F (w+ ) = f (w+ ) + g(w+ ) ≤ f (w0 ) + f ′ (w0 )δw + ∥δw∥2 + g(w+ ) = p(δw)
2
So we obtain:
L
p(w − w0 ) − F (w+ ) ≥ ∥w − w+ ∥2 .
2
Inserting the definition of p(w − w0 ) we obtain:
L L
f (w0 ) + f ′ (w0 )(w − w0 ) + g(w) +∥w − w0 ∥2 − F (w+ ) ≥ ∥w − w+ ∥2
2 2
L L
f (w) + g(w) + ∥w − w0 ∥2 − F (w+ ) ≥ ∥w − w+ ∥2 + f (w) − f (w0 ) − f ′ (w0 )(w − w0 ),
2 2
and hence the desired result.

Given : w1 , L
for k = 1, 2, . . .
do
increase L
δw ← min p at wk with parameter L
while (6.2) is violated
wk+1 = wk + δw
Theorem 6.2.3. Assume that f is σ-convex, and assume that L is chosen in a way that (6.2) is fulfilled.
Then:
 σ
∥wk+1 − w∗ ∥2 ≤ 1 − ∥wk − w∗ ∥2
L
L
F (wk+1 ) − F (w∗ ) ≤ ∥wk+1 − w∗ ∥2
2
If an iterative algorithm keeps Lk bounded, say by L, then we obtain linear convergence:
 k
2 σ
∥wk − w∗ ∥ ≤ 1 − ∥w0 − w∗ ∥2
L
 k
L σ
F (wk ) − F (w∗ ) ≤ 1− ∥w0 − w∗ ∥2
2 L
Proof. Inserting w∗ into our main estimate, we obtain:
L L−σ
0 ≥ F (w∗ ) − F (wk+1 ) ≥ ∥w∗ − wk+1 ∥2 − ∥w∗ − wk ∥2 ∀w (6.3)
2 2
Thus, we obtain immediately our first result:
L L−σ
∥w∗ − wk+1 ∥2 ≤ ∥w∗ − wk ∥2 .
2 2
Our second result follows by neclecting the negative term in (6.3).
6.3. Nesterov-Accelleration 47

We know that (6.2) holds, if L is chosen as the Lipschitz-constant for the gradient. In that case, we
may again define:
L
κ :=
σ
and retain our well-known convergence factor, but now w.r.t. errors:
 
2 1
∥wk+1 − w∗ ∥ ≤ 1 − ∥wk − w∗ ∥2 .
κ

6.3 Nesterov-Accelleration
Here we present an alternative to momentum methods, which works in the non-smooth case and admits
a nice convergence result.

Given : w1 , y1 = w1 , L
for k = 1, 2, . . .
do
increase L
δw ← min p at yk with parameter L
while (6.2) is violated
wk+1 = yk + δw
yk+1 = wk+1 + βk+1 δw

The difference to a momentum base method is small.


The general inequality
L L−σ
F (w) − F (w+ ) ≥ ∥w − w+ ∥2 − ∥w − w0 ∥2 ∀w,
2 2
inserting y0 ∼ yk , w0 ∼ wk for w0 and w+ = wk+1 , can be rewritten as follows:
L L−σ
F (w+ ) − F (w∗ ) + ∥w − w+ ∥2 ≤ F (w) − F (w∗ ) + ∥w − y0 ∥2 ∀w
2 2
Setting
2 1 σ
E(w) := (F (w) − F (w∗ )) and q := =
L κ L
we get:
E(w+ ) + ∥w − w+ ∥2 ≤ E(w) + (1 − q)∥w − y0 ∥2 ∀w
Now we choose
w := w0λ := λw∗ + (1 − λ)w0
and write

E(w+ ) + ∥w0λ − w+ ∥2 ≤ E(w0λ ) + (1 − q)∥w0λ − y0 ∥2 (6.4)

Our aim is to choose λ (and then β) to achieve

E(w+ ) + ∥w0λ − w+ ∥2 ≤ Θ(E(w0 ) + ∥w−


λ
− w0 ∥2 )
λ
where 0 < Θ < 1 is as small as possible. Here w− = λw∗ + (1 − λ)wk−1 .

Lemma 6.3.1. Setting λ = q and β = (1 − λ)/(1 + λ) we obtain

Θ = (1 − λ) = 1 − q,

if k > 1.
48 Contents

Proof. We start with our estimate and observe that by our definition (1 − q) = (1 + λ)(1 − λ):
E(w+ ) + ∥w0λ − w+ ∥2 ≤ E(w0λ ) + (1 − λ)(1 + λ)∥w0λ − y0 ∥2
By construction E is 2q-strictly convex, with E(w∗ ) = 0 so by Lemma 6.2.1 we obtain: For that, we
obtain:
E(w0λ ) − q∥w0λ − w∗ ∥2 ≤ (1 − λ)E(w0 ) − qλ(1 − λ)∥w∗ − w0 ∥2 .
and thus the overall estimate:
 
E(w+ ) + ∥w0λ − w+ ∥2 ≤ (1 − λ) E(w0 ) + (1 + λ)∥w0λ − y0 ∥2 − λ3 ∥w0 − w∗ ∥2 . (6.5)

Next, we split:
∥w0λ − y0 ∥2 = ∥λ(w∗ − w0 ) + (w0 − y0 )∥2 = λ2 ∥w∗ − w0 ∥2 + 2λ⟨w∗ − w0 , w0 − y0 ⟩ + ∥w0 − y0 ∥2 .
to obtain using (1 + λ)λ2 − λ3 = λ2 , y0 − w0 = β(w0 − w− ) and (1 + λ)β = (1 − λ):
(1 + λ)∥w0λ − y0 ∥2 − λ3 ∥w0 − w∗ ∥2
= λ2 ∥w∗ − w0 ∥2 + 2(1 + λ)λ⟨w∗ − w0 , w0 − y0 ⟩ + (1 + λ)∥w0 − y0 ∥2
≤ λ2 ∥w∗ − w0 ∥2 + 2(1 + λ)λβ⟨w∗ − w0 , w− − w0 ⟩ + (1 + λ)2 β 2 ∥w− − w0 ∥2
= λ2 ∥w∗ − w0 ∥2 + 2(1 − λ)λ⟨w∗ − w0 , w− − w0 ⟩ + (1 − λ)2 ∥w− − w0 ∥2
= ∥λ(w∗ − w0 ) + (1 − λ)(w− − w0 )∥2 = ∥w−
λ
− w0 ∥2 .
Hence, as desired:
E(w+ ) + ∥w0λ − w+ ∥2 ≤ (1 − λ)(E(w0 ) + ∥w−
λ
− w0 ∥2 ).

√ √
Theorem 6.3.2. Choosing β = ( κ − 1)/( κ + 1) we obtain the speed of convergence:
 k
1 σ
F (wk+1 ) − F (w∗ ) ≤ 1 − √ (F (w0 ) − F (w∗ ) + ∥w1 − w∗ ∥2 )
κ 2
Proof. To obtain our desired estimate, we have to consider k = 1, where y0 = w0 = w1 . This yields
w0λ − y0 = λw∗ + (1 − λ)w0 − w0 = λ(w∗ − w0 ).
With (6.5) and λ2 = σ/L we get:
σ
E(w2 )+∥w1λ −w2 ∥2 ≤ (1−λ)(E(w1λ )+(1+λ)λ2 ∥w∗ −w1 ∥2 −λ3 ∥w∗ −w1 ∥2 ) = (1−λ)(E(w1 )+ ∥w∗ −w1 ∥2 ).
L
With Lemma 6.3.1, induction and rescaling we obtain the desired result.
Unfortunately, it is the same as with momentum methods here: σ and thus κ are not a-priori, so it
is hard to find the right value of β. If σ is underestimated (β too large), we obtain a suboptimal rate
of convergence, if σ is overestimated (β too small), our result is incorrect, which means that the method
might not converge.

Restarts
A practical way to dispense with an estimate of σ is to use restarts:
• Choose β = 1 during the iteration.
• If a certain condition is fulfilled, set βk+1 = 0 once. This has the same effect asa restarting the
iteration.
• Possible criteria:
f (wk+1 ) > f (wk ) restart at ascent 0th order
⟨δw, yk − wk ⟩ < 0 restart at ascent 1st order
6.3. Nesterov-Accelleration 49

Acceleration for (not strictly) convex functions


If f is not strictly convex, one can still show a convergence result for the ordinary proximal gradient
method:
1
F (wk ) − F (w∗ ) ∼
k
In that case the following acceleration sequence can be used:
p
1 + 1 + 4t2k
t0 = 1, tk+1 =
2
tk − 1
βk+1 =
tk+1

This yields, using a similar technique of proof as above:


1
F (wk ) − F (w∗ ) ∼
k2
Also here, restarts can be beneficial.

You might also like