0% found this document useful (0 votes)

44 views49 pages

Skript Opt Mach

The document discusses optimization methods in machine learning, focusing on various optimization problems, loss functions, regularizers, and models. It covers stochastic gradient methods, variance reduction techniques, preconditioning, momentum, and non-smooth regularization terms. The content is structured as lecture notes, providing a comprehensive overview of the mathematical foundations and practical applications of optimization in machine learning.

Uploaded by

padahet754

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views49 pages

Skript Opt Mach

Uploaded by

padahet754

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Optimization for Machine Learning

Anton Schiela

May 17, 2024

2
Contents

1 Optimization problems in ML 5
1.1 A general fitting problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Some loss-functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Some regularizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Some models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Statistical interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Stochastic gradient methods 11

2.1 A short review of gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Stochastic gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Convergence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Comparison of complexity: stochastic vs. classical gradient method . . . . . . . . . . . . 17
2.5 Computation of derivatives and gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Techniques for variance reduction 23

3.1 Dynamic sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Gradient aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Techniques for preconditioning 27

4.1 The effect of preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Diagonal scaling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 CG as a matrix free solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Inexact Newton methods for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Generalized Gauss-Newton methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 K-FAC Preconditioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Momentum 37
5.1 The dynamics of the heavy-ball method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 First and second order fixed point iterations . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Interpretation as a gradient aggregation method . . . . . . . . . . . . . . . . . . . . . . . 40

6 Non-smooth regularization terms 43

6.1 Examples for non-smooth terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Convergence of proximal gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3 Nesterov-Accelleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3
4 Contents

Preface
These lecture notes have been written for a lecture (2+1) held in WS 19/20 at University of Bayreuth.
They cover a special aspect of Machine Learning, namely optimization methods, which can be used for
problems, associated with that. The main source used is the following review article:
Bottou, L., Curtis, F., Nocedal, J.: “Optimization methods in large-scale machine learning”, SIAM
Review, 60, 2018
Concerning proximal gradient methods:
Beck, A.: “First-order methods in optimization”,MOS-SIAM Series on Optimization, 2017
For neural networks in general:
Goodfellow, I, Bengio, Y, Courville, A: “Deep learning”, 2016
In addition I would like to thank Prof. Stefan Ulbrich and Prof. Marc Pfetsch from TU Darmstadt, and
Prof. Roland Herzog from TU Chemitz for sharing their own lecture notes with me. This was a great
starting point for me.
Chapter 1

Optimization problems in ML

The principal ideas behind machine learning are not new. From a mathematical point of view, one could
also use the word data fitting.
In general the following quantities are involved:
• given input data xi ∈ X, i = 1 . . . N .
• given output data yi ∈ Y , i = 1 . . . N .
• unknown model parameters w ∈ W .
Quite often X is a high-dimensional space (e.g. of images, sound samples, . . . ), and also W is high-
dimensional. Also N can be very large. It is the number of examples to learn from.
We would like to find w and an input-to-output mapping h(·, w) : X → Y that can predict for given
input x ∈ X a matching output y ∈ Y .
Example 1.0.1 (Polynomial least squares fit). The simplest (and well known) example with X = Y = R
and W = Rd+1 is a polynomial least squares fit (linear regression):
N d
X 1 X
min |h(xi ; w) − yi |2 where h(x; w) := wj xj .
w
i=1
2 j=0

If we define the matrix A ∈ RN ×d+1 via aij = xji we can write this as a classical least-squares problem:
N
X 1 1
min |(Aw)i − yi |2 = ∥Aw − y∥22
w
i=1
2 2

It can be solved via QR-decomposition or the normal equation approach. The latter yields the following
linear system of equations:
AT Aw = AT y
• If A is injective (over-determined case), then our problem has a unique solution, but Aw ̸= y in
general.
• If A is surjective (under-determined case), then Aw = y, but w may be non-unique

1.1 A general fitting problem

More generally, we may consider the minimization problem:
N
1 X
min l( h(xi , w), yi ) + ρ(w). (1.1)
w N i=1

5
6 Contents

• h : X × W → Y is a parametrized model
• l : Y × Y → R is called a loss-function
• ρ : W → R is called a regularization term
The choice of h, l, and ρ are the main modelling ingredients of machine learning. The approximate
solution of this problem by an optimization algorithm is called learning. In this lecture we will consider
popular optimization algorithms for the solution of (1.1).
The following challenges arise:
• Often N is very large, making the evaluation of the full sum inefficient
• Often X and W have high dimension and/or h is complicated
• Often ρ is non-smooth (see below)
In short notation we will write this problem also as follows
N
1 X
min fi (w) + ρ(w), fi (w) = l( h(xi , w), yi ). (1.2)
w N i=1

1.2 Some loss-functions

In the following we discuss some loss-functions that are used regularly for various purposes. In particular
we consider:
• Regression (predicting a real or vector valued result)
• Classification (in two or more classes)

Scalar loss functions

If Y = R the following loss functions are popular:
• linear regression: l(h, y) = 21 (h − y)2 . This yields a least squares fit.
• hinge loss: l(h, y) = max(0, 1 − hy). This function is often used for classification. For given input
xi the desired output is chosen as yi ∈ {−1, 1}. A new input x is classified according to the sign of
h(x; w).
If h and y have different sign, then l ≥ 1, otherwise l < 1. If hy > 1, i.e., the classification is clearly
correct, then l(h, y) = 0.
• Often a smoothed version l(h, y) = log(1 + exp(−hy)) with similar properties is used instead.

Vectorial loss-functions
If Y = Rn is a vector space, some generalizations may be as follows:
• l(h, y) = 21 ∥h − y∥22
For classification among n > 2 classes it is popular to use the following procedure:
• Assume that x is a member of class j ∈ {1, . . . n}. Then y = ej the j th unit vector.
• The so-called soft-argmax operator: sam : Y →]0, 1[n ⊂ Y is defined as:
exp y exp y
sam(y) = = Pn (exp is applied componentwise to a vector)
∥ exp y∥1 i=1 exp yi

This is a smoothed version of the arg max operator:

arg max(y) = ej if yj > yi for all i
1.3. Some regularizers 7

• With this, the loss functional for y = ej one could go on as follows:

l(h, ej ) = ∥sam(h) − ej ∥22 ,

which, however has the disadvantage of very flat regions, if h and ej are far off. Instead one uses:

n
!
X
l(h, ej ) = −eTj ln(sam(h)) = ln exp hi − hj
i=1

Asymptotically this functional becomes flat if hj dominates h and increases linearly if hj is negative.

Woche 2

1.3 Some regularizers

Some very simple but popular regularizers are as follows:

λ λX 2
ρ(w) = ∥w∥22 = wi
2 2
and X
ρ(w) = λ∥w∥1 = λ |wi |.

The latter often produces sparse parameters, meaning that wi ̸= 0 only for a few wi . This may be helpful,
if each wi has a specific interpretation as a weight for a feature.
The regularizers enforce uniqueness of solutions and encourage certain desired properties of the pa-
rameters. The hyperparameter λ can be used for weighing between regularity and loss-minimization.

1.4 Some models

Linear models
An important distinction is, if a model is linear or nonlinear in w:

linear: h(x; w) = A(x)w,

with its simplest scalar example:

d
X
h(x; w) = (x, 1)T w = xi wi + w0 ,
i=1

or, with a polynomial for x ∈ R:

d
X
h(x; w) = p(x)w = xi wi .
i=0

This may, of course be extended to vector valued x (yielding a more complicated notation):

d
X
h(x; w) = p(x)w = xkjii wi .
i=0

Linear models, combined with convex loss functions and convex regularizers yield convex optimization
problems.
8 Contents

Example 1.4.1 (Support Vector Machine). A well known model that is comparatively simple is the
support vector machine, which is used for binary classification:
N
1 X
min max(0, 1 − hi yi ) + ρ(w) s.t. hi = (xi , 1)T w.
w N i=1

If ρ = 0 this problem can be rewritten as a linear program and solved as such:

N
1 X
min ti s.t. hi = (xi , 1)T w
w,t N i=1
ti ≥ 1 − hi yi
ti ≥ 0

Often ρ(w) = λwT w is used, yielding a quadratic program.

Nested nonlinear models

Consider a nested nonlinear model, defined recursively as:
hk (x, wk ) := h̃k (hk−1 (x; wk−1 ); w̃k )
where k = 1 . . . L, wk = (wk−1 , w̃k ) and
h̃k (·; w̃) : Zk−1 → Zk , Z0 = X, w0 = ∅, h̃0 = idX
We have the diagram:
h̃1 (·;w̃1 ) h̃2 (·;w̃2 ) h̃L (·;w̃L )
X −→ Z1 −→ Z2 ... ZL−1 −→ ZL = Y
Then we set:
w = wL , h(x, w) = hL (x; wL );
• This is motivated by the idea that the input X is transformed to the output Y by a chain of
transformations h̃k .
• For derivative based optimization it is necessary to compute the derivative of ∂w h(x; w). This yields
first order information on how the model changes under perturbations of the parameters. Using
the chain rule we obtain the following recursive formula:
∂wk hk (x; wk ) = ∂w̃k h̃k (hk−1 (x, wk−1 ); w̃k ) + ∂x h̃k (hk−1 (x, wk−1 ); w̃k )∂wk−1 hk−1 (x, wk−1 ),
or in short
∂w hk = ∂w̃ h̃k + ∂x h̃k ∂w hk−1 .
Expanding this expression we obtain:
L
X
∂w h = ∂x h̃L . . . ∂x h̃i+1 ∂w̃ h̃i .
i=1

With this we can compute

fi′ (w) = ∂w fi (w) = ∂h l(h(xi ; w), yi )∂w h(xi ; w).

Example 1.4.2 (A feed forward neural network). Neural networks are structured as desribed above.
Each Zi is called a layer and each h̃k (z; w̃) is constructed as follows:
h̃k (z; w̃) = α(A(w̃)z + b(w̃))
with the affine transformation A · +b : Zk−1 → Zk and the activation function α : Zk → Zk , which is
usually defined in a componentwise way.
1.5. Statistical interpretation 9

• Currently the most popular activation function is called ReLU (rectified linear unit):

α(z) = max(0, z) applied componentwise

This function is not differentiable at 0, which is regarded more as a theoretical than a practical
problem. Of course smoothed versions could be used (but aren’t in practice).
Earlier sigmoid-type functions have been used frequently (like arctan, arctanh, ez /(1 + ez )).

• Generically we may directly set

(A, b) = (W̃ , w̃0 )
leading, of course to a large number of parameters, namely (dimZl−1 + 1) · dimZl .
For specialized tasks we can reduce the number of parameters drastically. For example we may
restrict A to be a tridiagonal or a sparse matrix.
Convolutional Neural Networks, popular in image processing define the linear mapping via a con-
volution. A 1-d example could be:

[A(w̃)z]i = w̃T zi:i+c , where W̃ ∈ Rc

1.5 Statistical interpretation

One important insight in machine learning is that the pairs (xi , yi ) ∈ Xi × Yi are somewhat arbitrary
examples or sample for an unknown relationship between X and Y . To exploit this, we may use techniques
from statistics and propability theory. In this course we only say very little about this topic and refer to
lectures from the field of statistics.
From the point of view of statistics, we assume that the pair ξ = (x, y) is a random variable, drawn
from a joint distribution P (x, y). The true objective that we want to minimize is as follows:
Z
Expected risk: R(w) = E[l(h(x; w), y)] = l(h(x; w), y) dP (x, y)
X×Y
Z
= E[f (w, ξ)] = f (w, ξ) dP (ξ) = f (w).
X×Y

The objective functional, considered above can be interpreted as an approximation of the expected risk:
N N
1 X 1 X
Empirical risk: RN (w) = l(h(xi ; w), yi ) = fi (w)
N i=1 N i=1

where (xi , yi ) are samples, drawn according to P (x, y). We then know

RN (w) = E(RN (w)) = E(f (w)) = f (w).

The expected square error can be computed via the variance of f (w, ξ):
Z
2
V[f (w)] = E[|f (w, ξ) − f (w)| ] = |f (w, ξ) − f (w)|2 dP (ξ) = ∥f (w, ·) − f (w)∥2L2 (X×Y ) .
X×Y

Lemma 1.5.1. It holds:

1
V[RN (w)] = V[f (w)]
N
Thus, for the expected square error we obtain:

1
E[(RN (w) − f (w))2 ] = V[f (w)]
N
10 Contents
¯
Proof. Introduce for fixed w the
2 −1
P random variable η = f (w, ξ) − f (w), so that E(η) = 0 and V(η) =
V(f ) = E(η ), and ηN = N ηi so that V(ηN ) = V(RN (w)). Since ηi and ηj are independent, we
obtain COV (ηi ηj ) = E[ηi ηj ] = 0 for i ̸= j.
Then we compute:

XN N
X N X
X
V(ηN ) = E[N −2 ( ηi )2 ] = E[N −2 ηi2 ] + E[N −2 ηi ηj ]
i=1 i=1 i=1 i̸=j
N
X N X
X
= N −2 E[ηi2 ] + N −2 E[ηi ηj ]
i=1 i=1 i̸=j
N
X
−2
=N V(ηi ) + 0 = N −1 V(η).
i=1

By the Tschebyscheff inequality:

V(X)
P (|X − X| > ε) ≤
ε2
we obtain:
V(f (w))
P (|RN (w) − R(w)| > ε) ≤ .
N ε2
Hence, for a given confidence level we obtain:

ε ∼ N −1/2 .

There is a much deeper stochastic theory on machine learning available, for which we recommend to
attend a lecture on statistical machine learning.
Chapter 2

Stochastic gradient methods

The ultimate goal of learning algorithms, once a model is given, is to minimize R(w) or, in the presence Woche 3
of a regularization term ρ(w) the quantity
F (w) := R(w) + ρ(w). (2.1)
However, this is not so easily possible:

Neither R(w) nor R′ (w) are computationally available. The only thing that we can do is evaluate

l(h(xi ; w), yi ) and ∂w l(h(xi ; w), yi )

for a large number of samples (xi , yi ).

A straightforward idea is to minimize

FN (w) := RN (w) + ρ(w) (2.2)
instead for a given fixed dataset (xi , yi )i=1...N . This minimization should only be done approximately.
If w∗ and wN ∗
are minimizers of of (2.1) and (2.2), respectively, and wN is an approximate solution of
(2.2), then the errors should be balanced:
F (wN ) − F (w∗ ) = F (wN ) − FN (wN )
∗
+ FN (wN ) − FN (wN )
∗
+ FN (wN ) − F (w∗ )
We observe that only the second difference can be reduced by optimization for fixed N , while the first
and the third difference have to be taken care of by increasing N . Further, we observe that
∗ ∗
FN (wN ) < FN (wN ) ̸⇒ F (wN ) < F (wN )
∗
By using the minimizing property FN (wN ) ≤ FN (w∗ ) the following estimate can be derived for a given
confidence level:
F (wN ) − F (w∗ ) ≤ FN (wN ) − FN (wN
∗
) + (F (wN ) − FN (wN )) + (FN (w∗ ) − F (w∗ ))
1
∼ ∆FN,opt + √ Vf (wN ) + Vf (w∗ ) .
N
A further difficulty arises:
• If N is large, the evaluation of FN and FN′ may be very expensive. Efficient algorithms should
probably use cheaper surrogates.
• If W is chosen inadequately (and/or the regularizer is too weak), minimizing FN very accurately
∗
may yield larger errors in F than early termination, since Vf (wN ) may be large.

11
12 Contents

2.1 A short review of gradient descent

Consider the problem
min F (w)
w
p
on a normed vector space W with norm ∥ · ∥ = ⟨·, ·⟩. Its dual space W ∗ also has a norm, denoted by
∥ · ∥.
Assume that we can compute a search direction δw that satisfies:

F (w + δw) ≤ F (w) − ν∥F ′ (w)∥2 ν>0 (2.3)

Proposition 2.1.1. A sequence, defined by wk+1 = wk + δwk as above yields either F (wk ) → −∞ or
∥F ′ (wk )∥ → 0.
Proof. Clearly, F (wk ) is decreasing monotonically, so either F (wk ) → −∞ or F (wk ) → F∗ ∈ R. In the
latter case this implies
ν∥F ′ (wk )∥2 ≤ F (wk ) − F (wk+1 ) → 0.

If we define ∆w as the solution of the following minimization problem:

1
min F ′ (w)v + ⟨v, v⟩,
v 2
then ∆w is the negative gradient of F :

0 = F ′ (w)v + ⟨∆w, v⟩ ⇒ ∆w = −∇F (w)

Assume now that F obeys the following smoothness assumption:

L
F (w + δw) ≤ F (w) + F ′ (w)δw + ∥δw∥2 .
2
This holds, for example, if F ′ is Lipschitz continuous with constant L.
Lemma 2.1.2. For δw = −α∇F (w), (2.3) is satisfied for

Lα2
ν(α) = α −
2
This expression is maximized for α = 1/L with ν(α) = 1/(2L). It is positive, if α < 2/L.
Proof.
L
F (w + δw) − F (w) ≤ F ′ (w)δw + ∥δw∥2
2
Lα2
= −αF ′ (w)∇F (w) + ∥δw∥2
2
Lα2

= −α + ∥F ′ (w)∥2 = −ν(α)∥F ′ (w)∥2 .
2

If we assume the following Polyak-Lojasiewicz inequality:

∥F ′ (w)∥2 ≥ 2µ(F (w) − F∗ ), µ > 0,

where F∗ = inf k F . Then we obtain:

2ν(α)µ(F (w) − F∗ ) ≤ ν(α)∥F ′ (w)∥2 ≤ F (w) − F (w + δw)

2.2. Stochastic gradients 13

With that we compute:

F (w) − F∗ = (F (w) − F (w + δw)) + (F (w + δw) − F∗ )

≥ 2µν(α)(F (w) − F∗ ) + (F (w + δw) − F∗ )

Hence, we obtain the result:

F (w + δw) − F∗ ≤ (1 − 2µν(α))(F (w) − F∗ ).

and for optimal α:

µ
F (w + δw) − F∗ ≤ (1 − )(F (w) − F∗ ).
L
We may defind a condition number κ := L/µ to find:
1
F (w + δw) − F∗ ≤ (1 − )(F (w) − F∗ ).
κ
and thus linear convergence:
k
1
F (wk ) − F∗ ≤ 1− (F (w0 ) − F∗ ).
κ
Be aware that our choice of norm can affect the size of κ. A clever choice will yield small κ. This is called
preconditioning (note that mere rescaling does not change κ).
If F is quadratic with eigenvalues 0 < λ1 ≤ · · · ≤ λn , then

κ = λn /λ1

and our result can be refined:

2
κ−1
F (wk+1 ) − F∗ ≤ (F (wk ) − F∗ ).
κ+1
This holds for an optimal choice of α:

α ← min F (w + α∆w),
α

which can be computed exactly in the quadratic case.

2.2 Stochastic gradients

In very large scale problems we can neither evaluate F nor F ′ exactly. In the following we will interpret
an approximate computation of ∇F as a stochastic gradient step.
Let us now introduce a stochastic search direction ∆w with expectation

∆w := E[∆w]

and variance:
V[∆w] = E[∥∆w − ∆w∥2 ] = E[∥∆w∥2 ] − ∥∆w∥2 .
and set δw = α∆w. Then we obtain:
L
E[F (w + δw) − F (w)] ≤ E[F ′ (w)α∆w + ∥α∆w∥2 ]
2
α2 L
= αF ′ (w)∆w + E[∥∆w∥2 ]
2
α2 L
= αF ′ (w)∆w + ∥∆w∥2 + V[∆w] .

2
Here we used monotonicity and linearity of the expectation and the definition of the variance.
14 Contents

Stochastic gradients
Stochastic gradient methods now take the following choice:

∆w = E[∆w] := −∇F (w).

The search direction is an unbiased estimate of the gradient.

This implies
−F ′ (w)∆w = ∥F ′ (w)∥2 = ∥∆w∥2
and thus:
α2 L
E[F (w + δw) − F (w)] ≤ −α∥F ′ (w)∥2 + ∥F ′ (w)∥2 + V[∆w]

2
α2 L

′ 2
∥F ′ (x)∥2

=− α− 1 + V[∆w]/∥F (w)∥
2
= −ν(α)∥F ′ (x)∥2

α2 L
1 + V[∆w]/∥F ′ (w)∥2

ν(α) = α −
2

What happens for fixed α?

Consider the following model for the variance:

V[∆w] ≤ V0 + V1 ∥F ′ (w)∥2 .

In that case, our estimate can be written as follows:

α2 L

′
1 + V0 /∥F (w)∥ + V1 ∥F ′ (x)∥2
2

E[F (w + δw) − F (w)] ≤ − α −
2
α2 LV0
= −ν1 (α)∥F ′ (x)∥2 +
2
with
α2 L

ν1 (α) = α− (1 + V1 )
2
Obviously, ν1 (α) > 0 for sufficiently small α. However, if ∥F ′ (x)∥ → 0, the second term will dominate
and yield positivity of the right hand side. Hence, our estimate for the expected decrease is positive if F ′
is too small.
Woche 4

Theoretically optimal α
The theoretically optimal choice of α would result from maximising ν(α) with respect to α:

0 = ∂α ν(α) = −1 + αL 1 + V[∆w]/∥F ′ (w)∥2

and thus obtaining the theoretically optimal choice:

1
α=
L (1 + V[∆w]/∥F ′ (w)∥2 )
This results in an expected decrease as follows:

E[F (w + δw) − F (w)] ≤ −ν(α)∥F ′ (w)∥2 .

2.3. Convergence results 15

with
1 ∥F ′ (w)∥2
ν(α) = =
2L (1 + V[∆w]/∥F ′ (w)∥2 ) 2L (∥F ′ (w)∥2 + V[∆w])
The expected optimal decrease factor ν(α) thus depends on two quantities:
• As usual, the nonlinearity of the problem, encoded in the Lipschitz constant L. The choice of a
problem suited norm (preconditioning) could reduce L. Alternatively, acceleration techniques may
be used.
• The variance of the step, relative to the derivative. This term is only relevant, if

V[∆w] ≫ ∥F ′ (w)∥2 .

We may not assume that V[∆w] ∼ ∥F ′ (w)∥2 if ∥F ′ (w)∥ → 0. A realistic assumption may be:

V[∆w] ≤ V0 + V1 ∥F ′ (w)∥2 .

Then asymptotically, if ∥F ′ (w)∥ → 0, we obtain

∥F ′ (w)∥2 ∥F ′ (w)∥2
ν(α) ≥ ′ 2
∼
2L(V0 + (1 + V1 )∥F (x)∥ ) 2LV0
Hence, techniques to reduce the variance may be useful in later stages of the algorithm. For example,
if F (w) = R(w) + ρ(w), we may define

∆wn := ∇Fn′ (w) = ∇Rn (w) + ∇ρ(w),

which means solving the problem:

1
min Fn′ (w)v + ⟨v, v⟩.
v 2
Usually n ≪ N is a moderately chosen number. Just as in the last section we obtain
1
V[∆wn ] ∼ V[∆w1 ],
n
but of course, the computational effort to compute Fn′ (w) is ∼ n.

Suboptimal choice of α
In practice, the required quantities for finding α are not available. A suboptimal choice of α will change
the properties of the algorithm.
• Too large α may result in expected increase:
2
α≥ = 2α
L (1 + V[∆w]/∥F ′ (w)∥2 )

implies ν(α) ≤ 0 and thus the expected value of the functional may not decrease
• too small α yields only slow decrease. We observe that ν(α) ∼ α, asymptotically.

2.3 Convergence results

Reduction of the functional
We may assume the following Polyak-Lojasiewicz inequality:

∥F ′ (w)∥2 ≥ 2µ(F (w) − F∗ ), µ > 0,

16 Contents

where F∗ = inf F . Then we obtain:

α2 LV0
ν1 (α)2µ(F (w) − F∗ ) ≤ ν1 (α)∥F ′ (w)∥2 ≤ F (w) − EF (w + δw) + .
2

With that we compute:

F (w) − F∗ = (F (w) − EF (w + δw)) + (EF (w + δw) − F∗ )

α2 LV0
≥ 2µν1 (α)(F (w) − F∗ ) − + (EF (w + δw) − F∗ )
2

Hence, we obtain the result:

α2 LV0
EF (w + δw) − F∗ ≤ (1 − 2µν1 (α))(F (w) − F∗ ) +
2

with
α2 L

α 1
ν1 (α) = α− (1 + V1 ) ≥ for α ≤
2 2 L(1 + V1 )

In that case we have

α2 LV0
EF (w + δw) − F∗ ≤ (1 − µα)(F (w) − F∗ ) +
2

Writing wk = w and wk+1 = wk + δw and taking into account that F (wk ) is the result of a stochastic
gradient method we obtain:

E0:k (F (wk+1 ) − F∗ ) = E0:k−1 (Ek (F (wk+1 ) − F∗ ) ≤ (1 − µνk (αk ))E0:k−1 (F (w) − F∗ )).

This yields the result:

α2 LV0
E0:k F (w + δw) − F∗ ≤ (1 − µα)E0:k−1 (F (w) − F∗ ) +
2

Constant stepsizes
1
Theorem 2.3.1. For constant step-sizes, satisfying 0 < α ≤ L(1+V1 ) we obtain

αLV0
E[F (wk ) − F∗ ] →
2µ

αLV0
Proof. Subtracting 2µ from both sides, one obtains:

αLV0 α2 LV0 αLV0

E[F (wk+1 ) − F∗ ] − ≤ (1 − αµ)E[F (wk ) − F∗ ] + −
2µ 2 2µ

αLV0
= (1 − αµ) E[F (wk ) − F∗ ] −
2µ

This is a contraction.
2.4. Comparison of complexity: stochastic vs. classical gradient method 17

Harmonic Stepsizes
Theorem 2.3.2. Suppose that the SG method is run with a step size sequence, such that for all k ∈ N:
β 1 1
αk = , for some β > and γ > 0 such that α1 ≤
γ+k µ L(1 + V1 )

Then for all k ∈ N, the expected optimality gap satisfies:

ν
E[F (wk ) − F∗ ] ≤ (2.4)
γ+k
where
β 2 LV0

ν := max , (γ + 1)(F (w1 ) − F∗ )
2(βµ − 1)
Proof. We prove (2.4) by induction. By the second term in the definition of ν the result holds for k = 1.
Let’s set k̂ := γ + k. Then we compute, using αk = β/k̂:

α2 LV0
EF (w + δw) − F∗ ≤ (1 − αk µ)E[F (w) − F∗ ] +
2
2

βµ ν β LV0
≤ 1− +
k̂ k̂ 2k̂ 2
!
2
k̂ − βµ β LV0
= ν+
k̂ 2 2k̂ 2
!
β 2 LV0

k̂ − 1 βµ − 1 ν
= ν− ν+ ≤ ,
k̂ 2 k̂ 2 2k̂ 2 k̂ + 1
| {z }
≤0

because k̂ 2 ≥ (k̂ + 1)(k̂ − 1).

2.4 Comparison of complexity: stochastic vs. classical gradient

method
Recall the following splitting:
∗
F (wN ) − F (w∗ ) ≤ [FN (wN ) − FN (wN )] + (F (wN ) − FN (wN )) + (FN (w∗ ) − F (w∗ ))
C
∼ε+ √
N
The first term ε can be reduced by optimization, while the second term can be reduced by increasing the
problem size.
From our convergence theory we know that after k-steps:

Classical Gradient: ε ∼ Θk with computing time Tk ∼ N k

1
Stochastic Gradient: ε∼ with computing time Tk ∼ k
k
To balance the error, we set ε = √1 and obtain:
N

log ε log N
kG ∼ = ⇒ TG ∼ N log N
log Θ 2| log Θ|
1 √ √
kSG ∼ ∼ N ⇒ TSG ∼ N .
ε
18 Contents

A comparison yields√ a clear advantage for the stochastic gradient method in terms of work compexity.
The estimate 1/ N for the statistical error is a worst case estimate, which can be improved in
some special situations. However, the errors will not dacay faster that 1/N under any set of reasonable
assumptions. In this case a balancing of errors ε ∼ 1/N would yield more equal results, which still favour
stochastic gradients:
log N
kG ∼ ⇒ TG ∼ N log N
| log Θ|
kSG ∼ N ⇒ TSG ∼ N.

By and large it can be said that in this setting one step of the classical gradient method has a similar
cost as the whole stochastic gradient algorithm, run till appropriate accuracy.
Woche 5

2.5 Computation of derivatives and gradients

Let us now discuss practical methods to compute the derivative of:

F (w) := R(w) + ρ(w). (2.5)

or at least of a summand
l(h(xi , w), yi ) = li (hi (w)) = li ◦ hi (w)
P
of R = l(·). We obtain the simple relation:

F ′ (w) = R′ (w) + ρ′ (w)

F ′′ (w) = R′′ (w) + ρ′′ (w)

where ρ′ and ρ′′ can usually be computed efficiently. Concerning a summand, we obtain by the chain
rule:

(li ◦ hi )′ (w)δw = li′ (hi (w))h′i (w)δw

(li ◦ hi )′′ (w)(δw)2 = li′′ (hi (w))(h′i (w)δw)2 + li′ (hi (w))h′′i (w)δw2 .

In these formulas, li′ and li′′ can be computed cheaply. Observe that

h′ (w) : W → Y

maps from the parameter space to the output space.

However, h′i is rather expensive to compute, as we will see below, and the computation of h′′i is
similarly complicated.
A gradient step is computed by solving the following problem:
1
min F ′ (w)δw + ⟨δw, δw⟩,
δw 2
which yields for the euclidean product:
X
∇F (w) = F ′ (w)T = R′ (w)T + ρ′ (w)T = h′i (w)T li′ (hi (w))T + ρ′ (w)T
i
X
= h′i (w)T ∇li (hi (w)) + ∇ρ(w).
i

Thus, we have to compute h′ (w)T v for a gradient step.

We observe:
• h′ (w)T : Y → W maps from the output space into the parameter space
2.5. Computation of derivatives and gradients 19

• Thus, v is located in the output-space

• In deep neural networks these parameters are distributed within the layers. The same will be true
for the result of h′ (w)T v.

• By the chain-rule, h′ (w) contains products of linear mappings. So in h′ (w)T these products occur
in reverse order.

These observations are reflected in the following algortihm, known as back-propagation.

The computation of h′ (w)T v by backpropagation

As we have seen, the layer structure of a neural network can be described mathematically as follows:

hk (x; wk ) = h̃k (hk−1 (x; wk−1 ), w̃k )

Here the mapping:

h̃k : Zk−1 × W̃k → Zk
represents the k th layer, parametrized by w̃k ∈ W̃k .
The chain-rule yields the following formula for derivatives:

Hk = H̃k Hk−1 R̃k ⇔ Hk δwk = H̃k Hk−1 δwk−1 + R̃k δ w̃k

where, with Wk := W̃1 × · · · × W̃k :

Hk = ∂w hk : Wk → Zk
H̃k = ∂z h̃k : Zk−1 → Zk
R̃k = ∂w̃ h̃k : W̃k → Zk .

Transposition of this block-matrix yield the following formula:

T
Hk−1 H̃kT

HkT = : Zk → Wk−1 × W̃k = Wk .
R̃kT

which gives rise to a recursive algorithm for evaluating h′i (w)T v = HLT v.

applyHkT(ζk , k)
ζk−1 = H̃kT ζk ∈ Zk−1
sk = R̃kT ζk ∈ W̃k
if (k > 0) vk−1 = applyHkT(ζk−1 , k − 1)
return vk := (vk−1 , sk )

With this algorithm, we can compute:

h′ (w)T v = applyHkT(v, L)

We observe that this is a recursion that works backwards through the layers. The corresponding
algorithm is thus called back-propagation. We observe a couple of things:

• The vector v grows during the application of the algorithm.

• In each step, some components are propagated via H̃kT , others are generated via R̃kT

• The simple notation with a vector v hides the fact that v is a collection of matrices δ W̃ and vectors
δ w̃0 .
20 Contents

Application of H̃ and R̃:

To explain the details of our algorithm, we drop the index k.
Let us recall that a full layer is defined as:

h̃(z, w̃) = α(W̃ z + w̃0 )

Here z is the vector of inputs into the layer, W̃ is matrix of connection-parameters and w̃0 is the bias-
vector. Together (W̃ , w̃0 ) are the parameters, associated with the current layer.
Thus, h̃ is a mapping:
h̃ : Z × WM × W0 → Z+ ,
where WM is a space of matrices. Hence, its derivative is a linear mapping:

h̃′ = H̃ R̃M R̃0 : Z × WM × W0 → Z+ .

Here Z, Z+ , and W0 are spaces of vectors, while WM is a space of matrices. We observe:

H̃ : Z → Z+
R̃M : WM → Z+
R̃0 : W0 → Z+

Let us apply the chain rule. Since α is defined componentwise, α′ is a diagonal matrix, D. If, for
example, ReLUs are used, the diagonal consists of 0 and 1. Thus, we obtain, calling α′ (W̃ z + w̃0 ) = D:

h̃′ (z, w̃)(δz, δ W̃ , δ w̃0 ) = D(W̃ δz + δ W̃ z + δ w̃0 )

Hence, we have the easy parts:

H̃δz = DW̃ δz ⇒ H̃ = DW̃ ⇒ H̃ T = W̃ T DT : Z+ → Z

R̃0 δw0 = Dδw0 ⇒ R̃0 = D ⇒ R̃0T = DT : Z+ → W0

The remaining part R̃M : WM → Z+ is a little bit harder to interpret and to transpose, because R̃M is
a linear mapping that takes a matrix and yields a vector :

R̃M δ W̃ = Dδ W̃ z

To compute a transpose
T
R̃M : Z+ → W̃M

we have to find for each ζ ∈ Z+ a matrix M , such that

X X
Mij δ W̃ij = M · δ W̃ = ζ · Dδ W̃ z = ζi (Dδ W̃ z)i .
ij i

where · is the sum of products of all matching entries (either of vectors or matrices). Since ζ · D = ζ T D,
we define r = DT ζ and thus obtain at the problem of finding M , such that for given r, z:

M · δ W̃ = r · δ W̃ z : ∀δ W̃

This computation can be performed best with the help of the trace:
n
X X
tr A = Aii ⇒ M · δ W̃ = Mij δ W̃ij = tr M T δ W̃
i=1 ij
2.5. Computation of derivatives and gradients 21

One important property of the trace, that we will use is that it commutes:

tr AB = tr BA.

With this, we compute:

M · δ W̃ = r · δ W̃ z = rT δ W̃ z = tr rT δ W̃ z = tr zrT δ W̃ = rz T · δ W̃ .

Thus, we have that M = rz T , (which is a matrix of rank 1) and thus

T
R̃M : Z+ → W̃M
ζ → rz T = (DT ζ)z T

With this information, our algorithm reads now, where z = (x = z0 , . . . , zL−1 ):

applyHkT(ζk , z, k)
rk = DkT ζk
ζk−1 = W̃kT rk
T
Mk = rk zk−1
if (k > 0) vk−1 = applyHkT(ζk−1 , z, k − 1)
return vk = (vk−1 , Mk , rk )
22 Contents
Chapter 3

Techniques for variance reduction

We have seen that the variance of the step computation is a decisive quantity, when analysing the speed
of convergence of stochastic gradient methods. There are essentially two ways to achieve this:

• increase of sample size (mini-batch).

• gradient aggregation. Here gradients from previous iterates are collected and averaged.

3.1 Dynamic sample size

Using our model:
(k) (k)
V[∆wk ] ≤ V0 + V1 ∥F ′ (wk )∥2
We recall our central estimate:
(k)
αk2 LV0
E0:k F (wk + δwk ) − F∗ ≤ (1 − µαk )E0:k−1 (F (wk ) − F∗ ) + ,
2
as long as
1
αk ≤ (k)
.
L(1 + V1 )
By choosing a variable sample size, we can achieve a reduction of the variance, which may be modelled
(k) (k)
by the constants V0 and V1 , depending on the iteration number k.
There are two possible scenarios for linear convergence:
(k) (k)
• V0 = 0 and V1 ≤ V 1 . This might be achieved by estimating ∥F ′ (w)∥2 and choosing V[∆w]
proportional to this estimate.
(k) (k)
• V0 ≤ ζ k with ζ < 1 and V1 = 0 (for simplicity). Then linear convergence can also be shown
(exercise).

In both cases, we have to expect that for some Θ < 1

V[∆w] ∼ Θk

and, as we have seen the variances depend on the sample size n as follows:
1
V[∆w] ∼
n
Thus, the cost for each step grows exponentially as:

nk ∼ Θ−k

23
24 Contents

Summing up, we obtain for the whole computational effort:

k k k
X X X Θ−k
ni = Θ−i = Θ−k Θi ≤ ∼ Θ−k
i=1 i=1 i=1
1−Θ

Since we have linear convergence, the achieved accuracy ε behaves like

ε ∼ Θk .

Thus, the total computational effort is proportional to 1/ε, which is similar to stochastic gradient methods.

Estimate of the variance

If the sample size is chosen dynamically, it is certainly desirable to provide an estimate for the variance
V[∆w] in order to choose the sample size n accordingly.
Using the definition:
V[∆w] = E[∥∆w − ∆w∥2 ]
our first idea is to compute n steps ∆wi and define:
n
1X
ṼEst [∆w] := ∥∆wi − ∆w∥2 .
n i=1

This estimate is, however, still not computable, because ∆w is unknown. We may, however replace ∆w
by an estimate:
n
1X
∆W := ∆wj
n j=1

Lemma 3.1.1. It holds

n
X
E[ ∥∆wi − ∆W ∥2 ] = (n − 1)V[∆w].
i=1
P
Proof. We apply the well known identity (with nX = i xi ):
n
X n
X
∥xi − X∥2 = ∥xi ∥2 − n∥X∥2
i=1 i=1

to our case:
Xn Xn
E[ ∥(∆wi − ∆w) − (∆W − ∆w)∥2 ] = E[ ∥∆wi − ∆w∥2 ] − E[n∥∆W − ∆w∥2 ]
i=1 i=1
n
X
= V[∆wi ] − nV[∆W ]
i=1
= nV[∆wi ] − V[∆w] = (n − 1)V[∆w].

Thus, an unbiased estimate for the variance is

n
1 X
VEst [∆w] := ∥∆wi − ∆W ∥2 .
n − 1 i=1

In our case, ∆wi would be a step, computed by a single data evaluation and ∆w would be a step, created
by dynamic sampling. We thus need the variance of as step ∆W (m) , computed by averaging m gradients:
n
1 1 X
V[∆W (m) ] = V[∆w] ≈ VEst [∆W ] := ∥∆wi − ∆W ∥2 .
m m(n − 1) i=1
3.2. Gradient aggregation 25

Hence, we could choose mk (for prescribed V1 ), such that

(mk )
VEst [∆Wk ] ≤ V1 ∥F ′ (wk )∥2 .

However, this requires an estimate of ∥F ′ (wk )∥, which is not available in general. In fact, close to a
minimizer, using small sample size,
m
1 X ′
∥F ′ (w)∥ ≈ ∥Fm
′
(w)∥ = f (w) + ρ′ (w)
m i=1 i

will usually be overestimated. One way to avoid this problem is to choose V1 < 1 sufficiently small. Then
by the inverse triangle inequality:

∥F ′ (w)∥ = ∥Fm
′
(w) + F ′ (w) − Fm
′ ′
(w)∥ ≥ ∥Fm (w)∥ − ∥F ′ (w) − Fm
′
(w)∥

If ∥F ′ (w) − Fm
′
(w)∥ ≪ ∥Fm′
(w)∥ with high probability, we obtain a good estimate.
Alternatively, the non-adaptive rule:
(mk )
VEst [∆Wk ] ≤ V0 ζ k

could be chosen. However, here it is hard to choose appropriate parameters to make such a method
efficient.

3.2 Gradient aggregation

Among gradient aggregation methods there are a couple of variants available. We will only sketch a few
ideas. For simplicity let ρ = 0.
Recall that a mini-batch stochastic gradient method computes steps of the form:
n
!
1X
wk+1 = wk + αk ∆wk with ∆wk = ∇fi (wk )
n i=1

This implies the expensive computation of n gradients ∇fi (wk ) for i = 1 . . . n, where {(x1 , y1 ), . . . (xn , yn )}
is a ramdomly chosen subset of the data.
To reduce this effort, it may be a good idea to reuse information. If, for example ∇fi has been
evaluated at a previous iterate, say wj(i) , then we may use the formula:
n
1X
∆wk = ∇fi (wj(i) )
n i=1

• If ∇fi has never been evaluated before, we may set j(i) = k and compute and store ∇fi (wk ).

• In addition, to update available information, we may recompute ∇fi (wk ) (so j(i) = k) for some i
(which may be chosen randomly or cyclically).

Locally, i.e., if wj(i) ≈ wk , we would expect a reduction of variance, similar to dynamic sampling size
methods. However, we cannot expect that ∆wk is an unbiased estimate of ∇F (wk )
These ideas leave plenty of room for implementation. The following variants have been proposed and
analysed. Usually linear convergence was shown for strongly convex objectives.

SAG (Stochastic averaged gradient)

Here n = N is chosen, so the full dataset is exploited in averaging.

• Compute first a classical gradient step, and evaluate all ∇fi (w1 ), so j(i) = 1 for i = 1 . . . N .
26 Contents

• In the following iteration for each k pick i ∈ {1 . . . N } randomly and replace ∇fi (w1 ) by ∇fi (wk ),
so j(i) = k.
• The mean value does not need to be recomputed. We may write:
1
∆wk = ∆wk−1 + ∇f (wk ) − ∇f (wj(i) )
N

• As already mentioned, ∆wk is a biased estimator for ∆wk :

N N
1 X 1 X
∆wk = ∇fi (wk ) + ∇fi (wj(i) ) − ∇fi (wk )
N i=1 N i=1

This complicates convergence theory. The bias, however vanishes close to the solution.
• The method requires a lot of storage, so it may be unsuitable for very large N .

SAGA
Here the idea is to provide an unbiased estimator. This can be achieved by the following simple modifi-
cation:
" N
! #
1 X
∆wk = ∆wk−1 + ∇fi (wk ) − ∇fi (wj(i) ) = ∇fi (wk ) + ∇fi (wj(i) ) − ∇fi (wj(i) )
N i=1

Since i has been chosen randomly (uniformly distributed), the expected value of the term [. . . ] is 0.

SVRG (Stochastic Varianced Reduced Gradient)

This method works with macro steps. One step wk → wk+1 is computed via:
• one full gradient step with ∆w = ∇Fn (wk ) = N1 i ∇fi (wk )
P

• m inner update steps with w̃0 = wk and w̃j+1 = w̃j + α∆w̃j with

∆w̃j = ∇fi (w̃j ) − ∇fi (wk ) + ∇Fn (wk )

• Concerning complexity, this method is rather close to the full gradient method.
Chapter 4

Techniques for preconditioning

Concerning the rate of convergence, not only the variance plays a role, but also the condition number:
L
κ=
µ
Even if V0 = 0 and V1 is moderate, we have

E0:k (F (wk + δwk ) − F∗ ) ≤ (1 − µαk )E0:k−1 (F (wk ) − F∗ )

1
Setting α = L(1+V1 ) , so

1
E0:k F (wk + δwk ) − F∗ ≤ 1− E0:k−1 (F (wk ) − F∗ )
κ(1 + V1 )

which means that the (now linear) speed of convergence depends crucially on κ. It is a classical result
that the number of iterations needed for a prescribed accuracy ε is proportional to κ.
Also for harmonic step-sizes it can be shown (exercise) that γ ∼ κ and

ν κ V0 1
E0:k [F (wk ) − F∗ ) ≤ ∼ max , (1 + V1 ) (F (w1 ) − F∗ )
γ+k 2 µ(F (w1 ) − F∗ ) γ+k

Thus, also here, the error is scaled with κ, asymptotically for k ≫ γ. So the required number of iterations
is also proportional to κ.

For fixed ε : k∼κ

Hence, reducing the condition number directly reduces the computational effort by the same factor

4.1 The effect of preconditioning

Let us collect our set of assumptions and definitions that depend on the norm and the corresponding
scalar product:
L
F (w + δw) ≤ F (w) + F ′ (w)δw + ∥δw∥2 ∀δw
2
∥F ′ (w)∥2∗ ≥ 2µ(F (w) − F (w∗ )), µ > 0
⟨∆w, v⟩ = −F ′ (w)v ∀v
V[∆w] = E[∥∆w − ∆w∥2 ] ≤ V0 + V1 ∥F ′ (x)∥2∗

27
28 Contents

The norm for dual quantities ist denoted by ∥ · ∥∗ . It is defined as an operator norm:

f (v)
∥f ∥∗ := sup for a linear function f.
v̸=0 ∥v∥

Let us now define the following general scalar product (and norm):
√
⟨v, w⟩ := v T M w M s.p.d. , ∥v∥ = vT M v

Then we can derive the representation:

⟨f, g⟩∗ = f T M −1 g.

If F is quadratic spd (or if close to a minimizer F can be approximated by a quadratic function):

1 T
F (w) = w Aw + lT w + c
2
Then the condition number is given by the solutions of the following generalized Eigenvalue problem:

Av = λM v ⇔ M −1 Av = λv

as follows:
λmax
κM (A) = .
λmin
In particular, in the extreme case M = A, we obtain κM (A) = 1. So a good choice of M can reduce
the required computational effort significantly. Note, however, that mere scaling of the norm leaves the
condition number invariant.
The price to pay is that the computation of a search direction

⟨∆w, v⟩ = −F ′ (w)v

now amounts in solving the linear system:

M ∆w = −F ′ (w)T

So the price to pay for preconditioning is as follows:

• derive and compute a suitable preconditioner M

• solve (approximately) a linear system

The potential benefit is:

• a significantly lower number of total iterations

Interestingly, also the variance of ∆w depends on the choice of norm.

Our preconditioned algorithm now looks as follows:

Given w1
for k = 1,2, . . .
compute an estimate l(wk ) for F ′ (wk )T
compute Mk
∆w = −Mk−1 l(wk )
wk+1 = wk + δw = wk + α∆w.

Since M defines a norm, all the previous theory applies.

4.2. Diagonal scaling methods 29

4.2 Diagonal scaling methods

The cheapest way of preconditioning is to choose M as a diagonal matrix: M = diag(mii ).
Then we can simply compute:
1
[∆w]i = − [l(w)]i
mii
Some popular ways to define mii have been proposed:

RMSprop
The motivation of the following algorithm is to gather some estimates of the scaling properties of the
variables.
Here a diagonal matrix Dk is defined recursively as follows:

Dk = (1 − λ)Dk−1 + λdiag(F ′ (wk )2i ) λ ∈ [0, 1]

Then for some µ > 0 we define:

p
Mk = Dk + µI (where the root is taken componentwise)

Sensitivity to scaling
This is one way to make the algorithm less sensitive to scaling of variables. Setting v = Sw + w0 and
FS (w) := F (v) = F (Sw + w0 ) we get:

FS′ (w) = F ′ (v)S ⇒ FS′ (w)2i = Sii

2 ′
F (v)2i ⇒ Dii
S 2
= Sii Dii

Thus, we obtain (for µ = 0):

∆v = D−1/2 F ′ (v)T
∆w = (DS )−1/2 FS′ (w)T = D−1/2 S −1 SF ′ (v)T = D−1/2 F ′ (v)T = ∆v

Hence, the steps are the same. However, given that v = Sw + w0 we rather should expect ∆v = S∆w,
so a scaling with D−1 would also be conceivable.

Adagrad
A similar idea is unsed in the algorithm Adagrad. Here the following update for D is used:

Dk = Dk−1 + diag(F ′ (wk )2i )

This has the effect that Mk grows during the iteration, which yields smaller and smaller steps. This effect
may be desirable during a pure stochastic gradient method. This decay, however, often turns out to be
too aggressive in practice.

4.3 CG as a matrix free solver

If the preconditioner matrix M is more complicated than diagonal, we have to solve a linear system of
the form:
M ∆w + l = 0
Since M is by assumption spd, this is equivalent to the minimization problem:

1 T
min q(v) := v M v + lT v.
2
30 Contents

This problem is often very large scale, so it is often not possible to store or to evaluate the matrix M .
Instead, we might have an algorithm that performs the application of M to a vector v:

v → Mv

The absence of M in explicit form rules out matrix factorizations (like M = LR) for the solution of this
system, and we have to use iterative methods.
The simplest idea would be to use a gradient method again, now corresponding to the standard scalar
product v T P w, where P is itself a preconditioner (an inner preconditioner) for M :
1
compute pk := −∇P q(vk ) = −P −1 q ′ (vk ) (corresponding to min δv T δv + q ′ (vk )δv)
δv 2
vk+1 = vk + αk pk

Since the problem is quadratic, an optimal value of α can be computed:

min q(vk + αpk ) ⇒ 0 = ∂α q(vk + αpk ) = q ′ (vk + αpk )pk = (vk + αpk )T M pk + lT pk
α
(M vk + l)T pk q ′ (vk )pk
⇒ αk = − T
=− T .
pk M p k pk M pk
As shown in NLP, this method converges with the following speed:
κ−1
∥v∗ − vk+1 ∥M ≤ ∥v∗ − vk ∥M ,
κ+1
where
λmax (M )
κ=
λmin (M )
This is not a big improvement, compared to application of the unpreconditioned gradient method to the
nonlinear problem.

M -conjugate directions
Application of an additional Gram-Schmidt orthogonalization step improves the situation drastically. We
would like to achieve that the new search direction pk is M -orthogonal to the previous direction pk−1 :

pTk M pk−1 = 0.

Then Gram-Schmidt yields:

gkT M pk−1 q ′ (vk )pk−1

pk = gk − p k−1 = gk − pk−1 = gk + βk pk−1
pTk−1 M pk−1 pTk−1 M pk−1

compute gk := −∇P q(vk ) = −P −1 q ′ (vk )

pk = gk + βk pk−1 (Gram-Schmidt)
vk+1 = vk + αk pk (Optimal line-search)

This simple modification yields a greatly improved rate of convergence:

p !k
κP (M ) − 1
∥v∗ − vk ∥M ≤ 2 p ∥v∗ − v0 ∥M .
κP (M ) + 1

Thus, the overall computational effort of cg is roughly the square-root of the gradient method for large
condition numbers. A clever implementation uses only one application of M per iteration, namley pk →
M pk . The algorithm, with loop slightly changed is shown below. In addition, we include the possibility
of preconditioning the iteration for M with a preconditioner P :
4.4. Inexact Newton methods for Neural Networks 31

Algorithm 4.3.1. (cg)

Geg: x0
d0 = q ′ (v0 ) = M v0 + l, p0 := −P −1 d0

for k = 0, 1, 2, . . .
p̃k = M pk
dTk pk
αk = −
pTk p̃k
vk+1 = vk + αk pk
dk+1 = dk + αk p̃k
gk+1 = −P −1 dk+1
dTk+1 gk+1
βk+1 =
dTk gk
pk+1 = gk+1 + βk+1 pk

• For usage inside an optimization method it is important for efficiency to terminate the cg-Method
early. Thus, the linear system M v + l = 0 is not solved exactly. The computed step, however,
are better than unpreconditioned gradient steps, even if only a moderate number of cg-Iterations
is used.

• The analysis of cg is quite involved and it is presented in the lecture “Vertiefung der Numerik”. In
fact, cg minimizes q(x) on x0 + span {pi }i=1...k , the space, spanned by all previous search directions.
In theory, it will yields the exact solution after k = dim W iterations. In practive, however, it is
terminated much earlier.

• The cg-method depends on the positive definiteness of M in order to converge. However, if M is

indefinite, cg can still be used to compute a direction of descent. as soon as we encounte a direction
of negative curvature, i.e. pTk p̃k < 0, we terminate the cg-method and return the current iterate vk .
This can be used as a search direction in the outer optimization algorithm.

• As a preconditioner P one may take an estimate of the diagonal of M . For example, one may use
ideas from RMSprop: X
Pii = [∇fj ]2i + µ
j

Then the preconditioner eliminates dependencies of cg on scaling.

4.4 Inexact Newton methods for Neural Networks

The application of Newton methods has a couple of advantages:

• less dependent on conditioning of the problem (locally κ ∼ 1, globally, invariant against linear
transformation of variables)

• step-sizes locally tend to 1, depending on the non-linearity of the problem, rather than the scaling
(no experimentation with step-sizes necessary)

There is also some cost involved:

• evaluation of the hessian and solution of the system (remedy: inexact Newton-cg)

• need additional smoothness

• deal with non-convexity, singular hessian matrices (remedy: trust-region methods)

32 Contents

Let us recall our objective:

N N
1 X 1 X
min F (w) = R(w) + ρ(w) = fi (w) + ρ(w) = l(h(xi , w), yi ) + ρ(w)
N i=1 N i=1

If F is convex, we may set Mk = F ′′ (wk ) as an ideal preconditioner. In that case we have to solve
following problem at each step:
1
min F ′′ (wk )(v, v) + F ′ (wk )v
2
We cannot factorize or even store F ′′ (wk ), thus use cg with M = F ′′ (wk ) to find an approximate minimizer
of the problem.
This requires the following computations:
• Computation of F ′ (w) once at the beginning of the cg-Iteration.
• Computation of F ′′ (w)pk at each iteration of the cg-method.
Thus, one run of the cg-method (using K steps) requires the following work:

work(F ′ (w)) + K · work(F ′′ (w)v)

Assuming that work(F ′′ (w)v) ∼ γwork(F ′ (w)) we obtain

work(F ′ (w))(1 + γK).

Obviously, exact evaluation of F ′′ (w)v makes one iteration of cg very expensive, compared to a gradienht
step, because then γ ≥ 1 has to be expected. Rather, we aim to obtain a cheaper approximation
M v ≈ F ′′ (w)v so that γK ≈ 1.

Hessian Sub-sampling
This is a popular idea to obtain an efficient algorithm in the setting of learning. Assume that our
algorithm uses a stochastic gradient method with mini-batches usiong a sampling size n:
n n
1X ′ 1X ′
Rn′ (w)T = fi (w)T = h (xi , w)T l′ (h(xi , w), yi )T
n i=1 n i=1

where the (xi , yi ) are chosen randomly from the data. This is done, as usual with back-propagation.
Then we may choose m ≪ n, draw j = 1 . . . m data samples form the already chosen set and compute
in each cg-step (setting z = hj (w))
m m
′′ 1 X ′′ 1 X ′
h (w)T lj′′ (z)h′j (w)v + lj′ (w)h′′j (z)v

Rm (w)v = fj (w)v =
m j=1 m j=1 j

′′
Then M = Rm (w) + ρ′′ (w) can be used. We have to keep M and thus the sample fixed during the
cg-iteration. Otherwise M would change during the cg-iteration.
Concerning
f ′′ (w)v = h′ (w)T l′′ (z)h′ (w)v + l′ (w)h′′ (z)v
we observe two summands:
h′ (w)T l′′ (z)h′ (w)v
This summand can be evaluated by one forward propagation v → h′j (w)v, application of lj′′ (w), which is
usually cheap, and one back-propagation: ṽ → h′j (w)T ṽ. This costs about twice a gradient evaluation.
Moreover, if l is strictly convex, l′′ is positive definite, and thus, this term is positive definite either:

v T h′T l′′ h′ v = (h′ v)T l′′ (h′ v) ≥ 0

4.4. Inexact Newton methods for Neural Networks 33

The second term is more tricky to evaluate:

l′ (z)h′′ (w)v

A naive implementation would yield the matrix h′′ (w)v first, to which l′ (z)T is applied. Howwever,
efficient implementation for this summand are also available. There are no reasonable assumtions that
guarantee positive definiteness of this term.
If ReLUs are used, however, it is questionable, if h′′ (w) makes sense. Recall that h consists of piecewise
linear functions α(z) = max(z, 0) and bilinear functions z+ = W z + w0 . While α′ may be seen as a step
function, it is hard to give α′′ a meaningful interpretation.
In efficient implementations, for each sample the cost is moderately higher than gradient evaluation
(factor γsample ∼ 3 − 5).
We then obtain the total work:

′ mK
work(F (w)) 1 + γsample ,
n

and thus the tradeoff (if work is constant):

• small K and large m yields inaccurate solution of system for accurate M .

• large K and small m yields accurate solution of system for inaccurate M . This is, however, not
advocated in the literature and called “minibatch-overfitting”.

Recall that the theoretically optimal choice of stepsize is given as follows,

1
α=
L (1 + V[∆w]/∥F ′ (w)∥2∗ )

yielding an expected decrease by a factor θ:

1
θ = 1 − µα = 1 −
κ (1 + V[∆w]/∥F ′ (w)∥2∗ )

Our algorithmic task would then be to balance the following quantities:

• minibatch-size n for F ′ (w): reduction of V[∆w] ∼ 1/n

• minibatch-size m for F ′′ (w): reduction of κ to the “ideal” case M ≈ F ′′ (w). Close to the solution
κ ≈ κM (F ′′ (w)), which would then govern linear convergence.

• number of iterations K for the cg-method. The speed of convergence depends on the condition
number κP (M ) of M relative to the preconditioner P .

• stepsize α

Matrix Chernoff-bounds
Just as in the P
case of function evaluations, a probabilistic measure for the condition number of M =
′′
Fm (w) = 1/m Aj , relative to A = F ′′ (w) would be helpful. One possible probability theoretic tool are
the so called matrix Chernoff bounds.

Lemma 4.4.1. Consider symmetric matrices Aj ∈ Rd×d , j = 1 . . . m. Assume that all matrices are
drawn independently and that E[Aj ] = A.
Let 0 < λmin ≤ λmax be the extreme eigenvalues of A and κ̄ = λmax /λmin its condition number.
Assume further a bound on the extreme eigenvalue of all Aj :

0 ≤ λmin (Aj ) ≤ λmax (Aj ) ≤ Rλmax .

34 Contents
1
P
Then for the extreme eigenvalues 0 ≤ λmin ≤ λmax of m Aj we obtain:
m/(κ̄R)
e−δ

= dΘm/(κ̄R)

P λmin ≤ (1 − δ)λmin ≤ d δ ∈]0, 1[
(1 − δ)1−δ
m/R
eδ

m/R
P λmax ≥ (1 + δ)λmax ≤ d = dΘ δ>0
(1 + δ)1+δ

with 0 < Θ, Θ < 1.

To obtain a moderately good approximation on the extreme eigenvalues we have to fix δ moderately
small and obtain Θ not too close to 1. It we want to achieve a probability p we get the equations (for
the small eigenvalue, and similar for the large one):
ln(p/d)
p = d Θm/κR ⇒ m = κR
ln Θ
We observe that the number of samples depends lineary on the condition number. However, the above
results are not exactly what we need for a good theoretical understanding. Rather, bounds on:
1 X
P (κM (F ′′ (w)) ≤ C), where M = Aj
m
would be of interest. Unfortunately, I am not aware of any such bounds in the literature. Also, appropriate
assumptions under which such an estimate should hold seem to be unclear.

4.5 Generalized Gauss-Newton methods

The idea of generalized Gauss-Newton methods is to drop the term l′ (w)h′′ (z) in the formula:
f ′′ (w)v = h′ (w)T l′′ (z)h′ (w)v + l′ (w)h′′ (z)v,
and thus obtain the so called generalized Gauss-Newton matrix:
G(w) := h′ (w)T l′′ (z)h′ (w).
The classical Gauss-Newton-Method method uses l(z, y) = (z − y)2 , so l′′ = I and
Gclass (w) = h′ (w)T h′ (w).
In total, we then have (using sub-sampling):
m m
1 X 1 X ′
M GN (w) := Gj (w) + ρ′′ (w) = h (xj , w)T l′′ (z)h′ (xj , w) + ρ′′ (w).
m j=1 m j=1

This yields two benefits:

• The computational cost and implementation effort of evaluating l′ (w)h′′ (z)v is saved, as well as the
problem with the ReLUs.
• M is always positive definite as long as l is convex and ρ is strictly convex. This is the case very
frequently.
• Close to a minimizer, we obtain small 0 ≈ F ′ (w) = R′ (w) + ρ′ (w) = N −1 i li′ (z)h′i (w) + ρ′ (w).
P
Thus, if ρ′ (w) is small, we expect i li′ (z)h′i (w) to be small, as well. Recall that loss functions are
P
often constructed, such that l′ (z, yi ) is small, if z ≈ yi , or if z is correctly classified. Thus, G(w)
can be a good approximation to F ′′ (w) close to the solution. In practice one often observes rather
fast linear convergence.
• Used with appropriate globalization and preconditioning techniques, this class of methods is com-
petitive, in particular for ill-conditioned, difficult problems.
4.6. K-FAC Preconditioner 35

4.6 K-FAC Preconditioner

In the following we consider the case ρ = 0 for simplicity. In this section we construct a preconditioner
M that is a reasonable approximation of MGN and for which M −1 v can be evaluated efficiently. It is
known under the name Kronecker-Factored Approximate Curvature and its application does not require
a cg-iteration.
First of all, we recall that in DNNs the space of parameters W is partitioned by layers, and in each
layer k = 1 . . . L there are the parameters (W̃k , w̃0,k ) = W k (weights and bias) with perturbations δW k ,
which we will sometimes also write as vectors δwk by rearanging the entries of δW k into a column.
We would like to simplify the following bilinear form:
n n
1X T 1X ′
δwT M GN δw := δw Gj (w)δw = (h (xj , w)δw)T l′′ (z)(h′ (xj , w)δw),
n j=1 n j=1

where δw consists of all perturbations δW k . If M GN is considered as a matrix, then the partitioning into
layers yields a partitioning of M GN into a L × L block matrix.

First simplification to a block diagonal form

GN
As a first simplification we will consider only the diagonal blocks Mk,k of M GN and drop all off-diagonal
blocks. We thus define the simplified bilinear form as follows:
L
X
δwT M GN δw ≈ δwT M D δw = δwTk Mk,k
GN
wk .
k=1

In order to apply (M D )−1 we only have to invert each block Mk,k

GN
, which is given by
n
1X
δwTk Mk,k
GN
δwk = (∂ h(xj , w)δW k )T l′′ (z)(∂W k h(xj , w)δW k )
n j=1 W k

Our computation thus amount in the following minimization problem:

1 ′
min δwTk Mk,k
GN
δwk + Fm (w)δwk
2
or
n
1 X
min (∂ h(xj , w)δW k )T l′′ (z)(∂W k h(xj , w)δW k ) + l′ (z)∂W k h(xj , w)δW k
2n j=1 W k

By the chain-rule, and the representation h̃(z, W̃ , w̃0 ) = α(W̃ z + w̃0 ) we obtain the following representa-
tion:
∂W k h(x, w)δW k = ∂zk h(x, w)Dk δW k hk−1 (xj , w)
and thus, with

pj = hk−1 (xj , w) intermediate variables, computed by forward propagation

′
qjT := l (z)∂zk h(xj , w)Dk computed by backward-propagation
Aj := DkT ∂zk h(xj , w)T l′′ (z)∂zk h(xj , w)Dk modified backward-propagation

We obtain
n
1 X T T
min p δW k Aj δW k pj + qjT δW k pj
2n j=1 j

and thus by differentiation:

n
1X T T
0= (pj δW k Aj + qjT )Hpj
n j=1
36 Contents

Lemma 4.6.1. Let vj ∈ V, wj ∈ W be some vectors for j = 1 . . . m then we have the followig equivalence:
n
X n
X
0= vj wjT ⇔ wjT Hvj = 0 for all matrices H : V → W
j=1 j=1

Proof. Let ek ∈ W , el ∈ V be unit-vectors, such that Hkl = ek eTl is a matrix with precisely one entry
= 1. Then we compute:
 
Xn n
X n
X n
X
wjT Hkl vj = (wjT ek )(eTl vj ) = (eTl vj )(wjT ek ) = eTl  vj wjT  ek
j=1 j=1 j=1 j=1

P
n Pn
Thus, the entry j=1 vj wjT = 0, iff j=1 wjT Hkl vj = 0. This implies our stated equivalence (taking
lk
into account that the set of Hkl is a basis of the space of linear mappings L(V, W )).
Hence, we have to solve the following linear matrix equation for δW k :
n n n
1X T 1X 1X
0= pj pTj δW k Aj + pj qjT ⇔ Aj δW k pj pTj = − qj pTj .
n j=1 n j=1 n j=1

Second simplification: averaged A

This system is still too large to solve. It is a linear system with as many degrees of freedom as δW k has.
It is thus a system of dimension din · dout of the corresponding level k.
Thus, we need an additional simplification step: we replace each Aj by its mean (over a possibly
smaller number m < n of samples):
m n n
1 X 1X 1X
Aj → A := Aj and Aj δW k pj pTj → AδWk pj pTj
m j=1 n j=1 n j=1

Then we can pull A out of the sum and get the following simplified system
 
n n
1 X 1X
AδWkkfac  pj pTj  = − qj pTj .
n j=1 n j=1

Thus, if we define
n n
1X 1X
P := pj pTj Q := qj pTj
n j=1 n j=1

we can now solve the following system:

AδWkkfac P = −Q ⇔ δWkkfac P = −A−1 Q ⇔ δWkkfac = −A−1 QP −1

under the assumption that the symmetric positive semi-definite matrices P and A are invertible. Thus,
we only have to compute Cholesky-factorizations of matrices of size P ∈ Rdin ×din and A ∈ Rdout ×dout .
While A is likely to be positive definite as a sum of positive definite matrices, P is a sum of rank-1
matrices, and thus, there should be at least n ≥ din addends. Otherwise, we may replace P −1 by a
Pseudo-Inverse P + .
Chapter 5

Momentum

Some moderate reduction of variance and condition number can also be achieved by alternatives, maybe
less sophisticated strategies. In the following we will discuss two of them.

5.1 The dynamics of the heavy-ball method

The following method is called the heavy-ball method:

wk+1 = wk − α∇F (wk ) + β(wk − wk−1 ).

Its name is due to a physical interpretation of the second terms as a momentum or inertia term:
We may interpret the gradient method (β = 0) as an explicit Euler method (with stepsize τ = α) for
the first order equation:
ẇ = −∇F (w)
For stiff problems (where the eigenvalues of −(∇F )′ = −F ′′ are large) it is known that the explicit
Euler method needs small step sizes to be stable, while small eigenvalues of F ′′ yield slow convergence
of w towards the fixed point ∇F = 0. Again we observe that badly conditioned problems yield slowly
converging gradient methods.
To be more concrete, since F ′′ (w) is spd, we can choose a basis (the same for w and v), such that
F ′′ (w) = D = diag(d1 , . . . dn ) is diagonal. Then the speed of convergence of the trajectory is propostional
to the smallest eigenvalue d1 , while the maximal stable step-size is proportional to 1/dn . Thus, the
required number of steps is proportional to
dn
κ= .
d1
From the point of view of dynamics, we may want to achieve a damping of the too oscillatory right hand
side.
The heavy-ball method can be seen as a discretization of the following second order equation:

ẅ + ρẇ = −∇F (w)

or alternatively of the system:

ẇ = v
v̇ = −∇F (w) − ρv

The jacobian matrix of this system reads:

0 I
−F ′′ (w) −ρI

37
38 Contents

Then our system decouples in 2 × 2 blocks and the jacobian matrix reads:

0 I 0 1
with blocks
−D −ρI −di −ρ
We may compute the eigenvalues of this 2x2 system:
ρ p
0 = λ(λ + ρ) + di ⇒ λ1,2 = − ± ρ2 /4 − di
2
If we choose ρ, such that ρ2 /4 − di < 0, we obtain conjugate complex eigenvalues with:
ρ
Re λ1,2 = −
p2
|λ1,2 | = di
√
Choosing ρ := 2 d1 , the smallest
√ eigenvalue guarantees this for all eigenvalues. The maximal absolute
value of our eigenvalues is now dn .
Hence, just as above, the required number of steps is proportional to
√
√ dn
κ= √ .
d1

5.2 First and second order fixed point iterations

To understand the idea of momentum, we consider the following quadratic problem:
1 T
min q(w) := w Aw + lT w ⇔ q ′ (w) = Aw + l = 0,
2
where A is an spd matrix. For simplicity we do not consider a preconditioner, which would, however, be
possible.
The gradient method for this problem can be written as:
wk+1 = wk − αq ′ (wk ) = wk − α(Awk + l)
This in turn yields a first order fixed point method:
wk+1 − w∗ = wk − w∗ − αA(wk − w∗ ) = (I − αA)(wk − w∗ ).
We may adjust the parameter α to obtain fast convergence. Defining z = w − w∗ this yields:
zk+1 = (I − αA)zk .
A second order fixed point method is defined as follows:
wk+1 = wk − αq ′ (wk ) + β(wk − wk−1 )
or, similar as above:
zk+1 = (I − αA)zk + β(zk − zk−1 ).
This can be rewritten as a two-component recursion:

zk+1 (1 + β)I − αA −βI zk
=
zk I 0 zk−1

Spectral analysis
Consider a diagonalization of A:
S −1 AS = D = diag(d1 , . . . , dn ) ⇒ A = SΛS −1
Let is denote d1 = γ and dn = Γ the smallest and largest eigenvalues, respectively.
5.2. First and second order fixed point iterations 39

Convergence of first order iteration. We obtain for the first order method:
zk+1 = S(I − αD)S −1 zk
and thus for v = S −1 z:
vk+1 = (I − αD)vk ,
so we obtain componentwise:
i
vk+1 = (1 − αt)vki , t = di ∈ [γ, Γ]
and hence:
i
|vk+1 | ≤ max |I − αdi ||vki | ≤ max |1 − αt||vki |.
i t∈[γ,Γ]

To obtain the fastest possible convergence, we have to figure out:

min max |1 − αt| = min max{|1 − αΓ|, |1 − αγ|}
α t∈[γ,Γ] α

This is straightforward to compute. Choose α such that both terms in the maximum agree:
2 2γ Γ−γ κ−1
α= ⇒ |1 − αt| ≤ 1 − = =
γ+Γ γ+Γ Γ+γ κ+1

Convergence of second order iteration. The same can be done for the second order iteration and
we get:
vk+1 (1 + β)I − αD −βI vk
=
vk I 0 vk−1
and thus similarly:
i
vki

vk+1 1 + β − αt −β
= i , t = di .
vki 1 0 vk−1
To proceed, we have to compute the eigenvalues of the above matrix. The characteristic equation reads:
0 = λ(λ − (1 + β − αt)) + β
with eigenvalues: p
(1 + β − αt)2 − 4β
(1 + β − αt) ±
λ1,2 =
2
As above, we look for conjugate complex eigenvalues, which occur, if
0 ≥ (1 + β − αt)2 − 4β.
The right hand side is zero, if
p p √ √
αt = ( β − 1)2 ⇒ β − 1 = ± αt ⇒ β = (1 ± αt)2
Thus, we obtain conjugate complex eigenvalues, iff
√ √
β ∈ [(1 − αt)2 , (1 + αt)2 ] ∀t ∈ [γ, Γ]
which is implied by √ √
β := max{(1 − αΓ)2 , (1 − αγ)2 }
In this case √
p √
|λ1 | = |λ2 | = β = max{|1 − αΓ|, |1 − αγ|}.
For a minimal choice of β we have to compute:
√ √
min max{|1 − αΓ|, |1 − αγ|}
α

Again, we balance the two terms in the maximum by choosing:

!2 √ √ √ √
2 p 2 γ Γ− γ κ−1
α= √ √ ⇒ β = |λ 1 | ≤ 1 − √ √ = √ √ = √ .
Γ+ γ Γ+ γ Γ+ γ κ+1
40 Contents

Comparison:

2 κ−1
First order: α= , β=0 Θ∼
γ+Γ κ+1
!2 √ 2 √
2 κ−1 κ−1
Second order: α= √ √ , β= √ Θ∼ √
γ+ Γ κ+1 κ+1

The difficulty is now, of course, to choose α and β adequately. While Γ can be estimated quite con-
veniently, a-posteriori, γ is rather difficult to estimate but critical. In practice, trial and error is often
used.

5.3 Interpretation as a gradient aggregation method

Coming back to our original formula and keeping α and β fixed, we have:

wk+1 − wk = −α∇F (wk ) + β(wk − wk−1 ).

Using w1 = w0 − α∇F (w0 ) we can compute by induction:

k
X
wk+1 − wk = −α β k−j ∇F (wj )
j=0

We observe a weighed sum of gradients. In the case of stochastic gradients, this might yield a reduction
of variance, similar to gradient aggregation. However, a theoretical verification for this fact seems to be
missing, yet.
In practice, methods with momentum are quite popular due to their simplicity and the improvement
they yield, compared to stochastic gradients.

A subtle distinction for varying preconditioners

We have seen that preconditioning (even if only diagonal) is an important technique to accellerate op-
timization. Up to now, it did not matter, if preconditioners Mk changed during the course of the
optimization, since classical gradient methods do not have a “memory”. This is different for methods
with momentum. Then the above equations read as follows (naming ∇Mk F (wk ) := Mk−1 F ′ (wk )):

k
X
wk+1 − wk = −α β k−j Mj−1 F ′ (wj ).
j=0

This would be an exponentially weighed aggregation of gradients, which is equivalent to momentum

methods.
Alternatively, we could do an exponentially weighed aggregation of derivatives:
 
k
X
wk+1 − wk = −αMk−1  β k−j F ′ (wj ) .
j=0

So we may write:
wk+1 − wk = −αMk−1 lk where lk = βlk−1 + F ′ (wk ).
5.3. Interpretation as a gradient aggregation method 41

Adam
The popular Adam algorithm chooses the latter variant. It employs for Mk the diagonal preconditioner
for RMSprop and exponentially weighed aggregation of derivatives.
Similarly to RMSprop Adam uses a diagonal preconditioner via a diagonal matrix Dk , that is defined
recursively as follows:
Dk = λDk−1 + (1 − λ)diag(F ′ (wk )2i ) λ ∈ [0, 1]
Then for some µ > 0 the preconditioner is given by:
p
Mk = Dk + µI (where the root is taken componentwise)

Given w1
for k = 1,2, . . .
lk = βlk−1 + (1 − β)F ′ (wk ) often a Mini-Batch is used here
′
dk = (1 − λ)dk−1 + λ(F (wk )2i )
p
Mk = (diag dk + µI)
∆w = −Mk−1 lk
√
1 − λk
wk+1 = wk + α ∆w.
1 − βk

The factors (1 − β k ) → 1 and (1 − λk ) → 1 in the step size are used to compensate the so called
“initialization bias”. This heuristic takes into account that the sums, defining lk and dk are short in the
beginning and grow longer and longer.
The corresponding heuristic computation is as follows:
k
X k
X
lk = (1 − β) β k−i F ′ (wi ) = (1 − β) β k−i F ′ (wk ) + ζ = (1 − β k )F ′ (wk ) + ζ,
i=1 i=1

where
k
X
ζ = (1 − β) β k−i (F ′ (wi ) − F ′ (wk )).
i=1

Assuming that E[ζ] = 0, we obtain

E[lk ] = (1 − β k )E[F ′ (wk )]

The same computation can be done for dk .

In recent years, Adam has probably become the method that is chosen as a default.
42 Contents
Chapter 6

Non-smooth regularization terms

In the following we consider the minimization of problems of the form:

min F (w) := f (w) + ρ(w)

where f is differentiable and ρ convex and “easy to handle”. By this we mean the following:

For any f ′ (w) and L > 0 the local minimization problem (which has a unique minimizer):

L
min p(δw) := f (w) + f ′ (w)δw + ρ(w + δw) + ∥δw∥2 (6.1)
δw 2
can be solved in an efficient way. Here the norm is a (maybe diagonally scaled) 2-norm.

This step is called a proximal gradient step.

Lemma 6.0.1.
• Assume that w∗ is a minimizer of F . Then δw = 0 is a minimizer of p at w∗ .
• Let f be convex and assume that δw = 0 is a minimizer of p at w∗ . Then w∗ minimizes F .
Proof. Since p is convex, we may compute for a minimizer δw ̸= 0 of p:

p(tδw) − p(0) = p((1 − t)0 + tδw) − p(0) ≤ (1 − t)p(0) − tp(δw) − p(0) = t(p(δw) − p(0)) = −tc.

Moreover:
L
p(tδw) − p(0) = F (w∗ + tδw) − F (w∗ ) + o(t) + ∥tδw∥2 ≥ −o(t).
2
For sufficiently small t this yields a contradiction to δw being a minimizer.
Consider the converse. Since f is convex, f ′ (w∗ )δw ≤ f (w∗ + δw) − f (w∗ ). Hence, with w = w∗ + δw:

F (w) − F (w∗ ) = f (w) + ρ(w) − f (w∗ ) − ρ(w∗ ) ≥ f ′ (w∗ )δw + ρ(w∗ + δw) − ρ(w∗ )

Since δw = 0 minimizes p we have:

L
f ′ (w∗ )δw + ρ(w∗ + δw) + ∥δw∥2 ≥ ρ(w∗ )
2
and thus:
L
F (w∗ + δw) ≥ F (w∗ ) −∥δw∥2 ∀δw.
2
Assume for contradiction that m := F (w∗ ) − F (w∗ + δw) > 0. Then

F (w∗ + tδw) = F ((1 − t)w∗ + t(w∗ + δw)) ≤ (1 − t)F (w∗ ) + tF (w∗ + δw) = F (w∗ ) − tm

43
44 Contents

On the other hand:

L 2 Lt 2
F (w∗ + tδw) ≥ F (w∗ ) − ∥tδw∥ = F (w∗ ) − t ∥δw∥
2 2
The term in square brackets tends to 0 for t → 0 so that both inequalities contradict, if t is sufficiently
small.

6.1 Examples for non-smooth terms

Example: Positivity constraint
For example, we may impose positivity constraints to our variables. This can be achieved by:

0 : wi ≥ 0 ∀i
ρ(w) :=
∞ : otherwise

In this case (6.1) can be solved by the following componentwise considerations. Let δw be the minimizer
of p.
i) Assume that (w + δw)i > 0. Then ρ vanishes, and we have

0 = p′ (δw)i = f ′ (w)i + Lδwi

ii) Assume that (w + δw)i = 0. Then we only have

0 ≤ f ′ (w)i + Lδwi

Choosing a projected gradient step:

1
δwi := max −wi , − f ′ (w)i
L

(which can be computed very efficiently) satisfies these properties in both cases:
• In the case δwi = −wi , it follows that Lδwi ≥ −f ′ (w)i and thus ii) is satisfied
• Otherwise, δwi > −wi , so wi + δwi > 0 and f ′ (w)i + Lδwi = 0. So i) is fulfilled.

Example: 1-Norm
X
ρ(w) := λ |wi |, λ>0
i

This regularization term is often used together with a quadratic regression functional:

min ∥Aw − d∥22 + ρ(w),

the so called Lasso-problem. Depending on the choice of λ the solution w∗ of this problem exhibits many
zeros.
To compute the proximal gradient step. Let again δw be the minimizer of p. Then we have the cases:
i) Let (w + δw)i > 0. Then ρ′ (w + δw)i = λ

0 = p′ (δw)i = f ′ (w)i + λ + Lδwi

ii) Let (w + δw)i < 0. Then ρ′ (w + δw)i = −λ

0 = p′ (δw)i = f ′ (w)i − λ + Lδwi

6.2. Convergence of proximal gradients 45

iii) Assume that (w + δw)i = 0. Then ρ′ (w + δw)i ∈ {−λ, λ} and we have

0 ∈ f ′ (w)i + [−λ, λ] + Lδwi

To satisfy these constraints we have to choose:

 1 ′
 − L (f (w)i + λ) : f ′ (w)i + λ < Lwi
δwi = − 1 (f ′ (w)i − λ) : f ′ (w)i − λ > Lwi
 L
−wi : otherwise

This specific step is called a “shrinkage step”.

We observe that each line solves the corresponding case, listed above. For example, our first choice
yields 0 = f ′ (w)i + λ + Lδwi , and thus −Lδwi < Lwi , implying 0 < wi + δwi .

6.2 Convergence of proximal gradients

In the following, we say that a convex function F is σ-convex, if:
σ
F (w) ≥ F (w0 ) + ℓ(w − w0 ) + ∥w − w0 ∥2 .
2
Where ℓ is any linear function, such that F (w) ≥ F (w0 ) + ℓ(w − w0 ) for all w. If F is differentiable, we
may use instead of ℓ = F ′ (w0 ).
If we add to F a quadratic term ν2 ∥w − w0 ∥2 it becomes σ + ν-convex as long as ν = −σ and for ν + σ,
the result is still convex. If w∗ is a minimizer of f , then:
σ
F (w) ≥ F (w∗ ) + ∥w − w∗ ∥2 .
2
Lemma 6.2.1. Assume that F is σ-convex. Then
σ
F ((1 − λ)x + λy) ≤ (1 − λ)F (x) + λF (y) − λ(1 − λ)∥x − y∥2
2
Proof. Since g(x) := F (x) − σ2 ∥x − y∥2 is convex and (1 − λ)x + λy − y = (1 − λ)(x − y) we may compute:
σ
F ((1 − λ)x + λy) − (1 − λ)2 ∥x − y∥2 = g((1 − λ)x + λy)
2
σ
≤ (1 − λ)g(x) + λg(y) = (1 − λ)F (x) + λF (y) − (1 − λ)∥x − y∥2 .
2
This implies our assertion.
The convergence theory of proximal gradient methods hinges on the following fundamental lemma:
Lemma 6.2.2. Assume that for given L δw has been computed at a point w0 as a proximal gradient,
such that with w+ = w0 + δw:
L
f (w+ ) ≤ f (w0 ) + f ′ (w0 )δw + ∥δw∥2 (6.2)
2
then we obtain:
L
∥w − w+ ∥2 − ∥w − w0 ∥2 + δf (w, w0 )

F (w) − F (w+ ) ≥ ∀w
2
with δf (w, w0 ) = f (w) − f (w0 ) − f ′ (w0 )(w − w0 ). If f is σ-convex, we get
L L−σ
F (w) − F (w+ ) ≥ ∥w − w+ ∥2 − ∥w − w0 ∥2 ∀w
2 2
In particular we have the decrease condition:
L
F (w+ ) ≤ F (w0 ) − ∥w+ − w0 ∥2 .
2
46 Contents

Proof. By definition, δw minimizes:

L
p(δw) = f (w0 ) + f ′ (w0 )δw + ρ(w0 + δw) + ∥δw∥2 .
2
This implies for any w:
L L
p(w − w0 ) − p(δw) ≥ ∥w − w0 − δw∥2 = ∥w − w+ ∥2
2 2
Then, by assumption:
L
F (w+ ) = f (w+ ) + g(w+ ) ≤ f (w0 ) + f ′ (w0 )δw + ∥δw∥2 + g(w+ ) = p(δw)
2
So we obtain:
L
p(w − w0 ) − F (w+ ) ≥ ∥w − w+ ∥2 .
2
Inserting the definition of p(w − w0 ) we obtain:
L L
f (w0 ) + f ′ (w0 )(w − w0 ) + g(w) +∥w − w0 ∥2 − F (w+ ) ≥ ∥w − w+ ∥2
2 2
L L
f (w) + g(w) + ∥w − w0 ∥2 − F (w+ ) ≥ ∥w − w+ ∥2 + f (w) − f (w0 ) − f ′ (w0 )(w − w0 ),
2 2
and hence the desired result.

Given : w1 , L
for k = 1, 2, . . .
do
increase L
δw ← min p at wk with parameter L
while (6.2) is violated
wk+1 = wk + δw
Theorem 6.2.3. Assume that f is σ-convex, and assume that L is chosen in a way that (6.2) is fulfilled.
Then:
σ
∥wk+1 − w∗ ∥2 ≤ 1 − ∥wk − w∗ ∥2
L
L
F (wk+1 ) − F (w∗ ) ≤ ∥wk+1 − w∗ ∥2
2
If an iterative algorithm keeps Lk bounded, say by L, then we obtain linear convergence:
k
2 σ
∥wk − w∗ ∥ ≤ 1 − ∥w0 − w∗ ∥2
L
k
L σ
F (wk ) − F (w∗ ) ≤ 1− ∥w0 − w∗ ∥2
2 L
Proof. Inserting w∗ into our main estimate, we obtain:
L L−σ
0 ≥ F (w∗ ) − F (wk+1 ) ≥ ∥w∗ − wk+1 ∥2 − ∥w∗ − wk ∥2 ∀w (6.3)
2 2
Thus, we obtain immediately our first result:
L L−σ
∥w∗ − wk+1 ∥2 ≤ ∥w∗ − wk ∥2 .
2 2
Our second result follows by neclecting the negative term in (6.3).
6.3. Nesterov-Accelleration 47

We know that (6.2) holds, if L is chosen as the Lipschitz-constant for the gradient. In that case, we
may again define:
L
κ :=
σ
and retain our well-known convergence factor, but now w.r.t. errors:

2 1
∥wk+1 − w∗ ∥ ≤ 1 − ∥wk − w∗ ∥2 .
κ

6.3 Nesterov-Accelleration
Here we present an alternative to momentum methods, which works in the non-smooth case and admits
a nice convergence result.

Given : w1 , y1 = w1 , L
for k = 1, 2, . . .
do
increase L
δw ← min p at yk with parameter L
while (6.2) is violated
wk+1 = yk + δw
yk+1 = wk+1 + βk+1 δw

The difference to a momentum base method is small.

The general inequality
L L−σ
F (w) − F (w+ ) ≥ ∥w − w+ ∥2 − ∥w − w0 ∥2 ∀w,
2 2
inserting y0 ∼ yk , w0 ∼ wk for w0 and w+ = wk+1 , can be rewritten as follows:
L L−σ
F (w+ ) − F (w∗ ) + ∥w − w+ ∥2 ≤ F (w) − F (w∗ ) + ∥w − y0 ∥2 ∀w
2 2
Setting
2 1 σ
E(w) := (F (w) − F (w∗ )) and q := =
L κ L
we get:
E(w+ ) + ∥w − w+ ∥2 ≤ E(w) + (1 − q)∥w − y0 ∥2 ∀w
Now we choose
w := w0λ := λw∗ + (1 − λ)w0
and write

E(w+ ) + ∥w0λ − w+ ∥2 ≤ E(w0λ ) + (1 − q)∥w0λ − y0 ∥2 (6.4)

Our aim is to choose λ (and then β) to achieve

E(w+ ) + ∥w0λ − w+ ∥2 ≤ Θ(E(w0 ) + ∥w−

λ
− w0 ∥2 )
λ
where 0 < Θ < 1 is as small as possible. Here w− = λw∗ + (1 − λ)wk−1 .
√
Lemma 6.3.1. Setting λ = q and β = (1 − λ)/(1 + λ) we obtain
√
Θ = (1 − λ) = 1 − q,

if k > 1.
48 Contents

Proof. We start with our estimate and observe that by our definition (1 − q) = (1 + λ)(1 − λ):
E(w+ ) + ∥w0λ − w+ ∥2 ≤ E(w0λ ) + (1 − λ)(1 + λ)∥w0λ − y0 ∥2
By construction E is 2q-strictly convex, with E(w∗ ) = 0 so by Lemma 6.2.1 we obtain: For that, we
obtain:
E(w0λ ) − q∥w0λ − w∗ ∥2 ≤ (1 − λ)E(w0 ) − qλ(1 − λ)∥w∗ − w0 ∥2 .
and thus the overall estimate:

E(w+ ) + ∥w0λ − w+ ∥2 ≤ (1 − λ) E(w0 ) + (1 + λ)∥w0λ − y0 ∥2 − λ3 ∥w0 − w∗ ∥2 . (6.5)

Next, we split:
∥w0λ − y0 ∥2 = ∥λ(w∗ − w0 ) + (w0 − y0 )∥2 = λ2 ∥w∗ − w0 ∥2 + 2λ⟨w∗ − w0 , w0 − y0 ⟩ + ∥w0 − y0 ∥2 .
to obtain using (1 + λ)λ2 − λ3 = λ2 , y0 − w0 = β(w0 − w− ) and (1 + λ)β = (1 − λ):
(1 + λ)∥w0λ − y0 ∥2 − λ3 ∥w0 − w∗ ∥2
= λ2 ∥w∗ − w0 ∥2 + 2(1 + λ)λ⟨w∗ − w0 , w0 − y0 ⟩ + (1 + λ)∥w0 − y0 ∥2
≤ λ2 ∥w∗ − w0 ∥2 + 2(1 + λ)λβ⟨w∗ − w0 , w− − w0 ⟩ + (1 + λ)2 β 2 ∥w− − w0 ∥2
= λ2 ∥w∗ − w0 ∥2 + 2(1 − λ)λ⟨w∗ − w0 , w− − w0 ⟩ + (1 − λ)2 ∥w− − w0 ∥2
= ∥λ(w∗ − w0 ) + (1 − λ)(w− − w0 )∥2 = ∥w−
λ
− w0 ∥2 .
Hence, as desired:
E(w+ ) + ∥w0λ − w+ ∥2 ≤ (1 − λ)(E(w0 ) + ∥w−
λ
− w0 ∥2 ).

√ √
Theorem 6.3.2. Choosing β = ( κ − 1)/( κ + 1) we obtain the speed of convergence:
k
1 σ
F (wk+1 ) − F (w∗ ) ≤ 1 − √ (F (w0 ) − F (w∗ ) + ∥w1 − w∗ ∥2 )
κ 2
Proof. To obtain our desired estimate, we have to consider k = 1, where y0 = w0 = w1 . This yields
w0λ − y0 = λw∗ + (1 − λ)w0 − w0 = λ(w∗ − w0 ).
With (6.5) and λ2 = σ/L we get:
σ
E(w2 )+∥w1λ −w2 ∥2 ≤ (1−λ)(E(w1λ )+(1+λ)λ2 ∥w∗ −w1 ∥2 −λ3 ∥w∗ −w1 ∥2 ) = (1−λ)(E(w1 )+ ∥w∗ −w1 ∥2 ).
L
With Lemma 6.3.1, induction and rescaling we obtain the desired result.
Unfortunately, it is the same as with momentum methods here: σ and thus κ are not a-priori, so it
is hard to find the right value of β. If σ is underestimated (β too large), we obtain a suboptimal rate
of convergence, if σ is overestimated (β too small), our result is incorrect, which means that the method
might not converge.

Restarts
A practical way to dispense with an estimate of σ is to use restarts:
• Choose β = 1 during the iteration.
• If a certain condition is fulfilled, set βk+1 = 0 once. This has the same effect asa restarting the
iteration.
• Possible criteria:
f (wk+1 ) > f (wk ) restart at ascent 0th order
⟨δw, yk − wk ⟩ < 0 restart at ascent 1st order
6.3. Nesterov-Accelleration 49

Acceleration for (not strictly) convex functions

If f is not strictly convex, one can still show a convergence result for the ordinary proximal gradient
method:
1
F (wk ) − F (w∗ ) ∼
k
In that case the following acceleration sequence can be used:
p
1 + 1 + 4t2k
t0 = 1, tk+1 =
2
tk − 1
βk+1 =
tk+1

This yields, using a similar technique of proof as above:

1
F (wk ) − F (w∗ ) ∼
k2
Also here, restarts can be beneficial.

LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Lecture Notes For Machine Learning Theory
No ratings yet
Lecture Notes For Machine Learning Theory
167 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
38 pages
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
No ratings yet
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
644 pages
Convex Optimizatiom IP
No ratings yet
Convex Optimizatiom IP
97 pages
MA4K0 Notes
No ratings yet
MA4K0 Notes
189 pages
Mathematics For Machine Learning
No ratings yet
Mathematics For Machine Learning
134 pages
UC Berkeley ML Course Guide
100% (1)
UC Berkeley ML Course Guide
185 pages
Deep Learning Theory Notes
No ratings yet
Deep Learning Theory Notes
127 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Machine Learning Guide
No ratings yet
Machine Learning Guide
185 pages
Optim ML
No ratings yet
Optim ML
41 pages
SLT 2024
No ratings yet
SLT 2024
94 pages
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
No ratings yet
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
6 pages
Machine Learning Lecture
No ratings yet
Machine Learning Lecture
332 pages
Math Foundations of Machine Learning Mississippi SU
No ratings yet
Math Foundations of Machine Learning Mississippi SU
328 pages
Optimization For Machine Learning
No ratings yet
Optimization For Machine Learning
45 pages
10 1 1 672 7118 PDF
No ratings yet
10 1 1 672 7118 PDF
35 pages
Mathematical Foundations of Machine Learning
100% (1)
Mathematical Foundations of Machine Learning
340 pages
Bott Curt Noce 18
No ratings yet
Bott Curt Noce 18
89 pages
Eecs127 Reader
No ratings yet
Eecs127 Reader
199 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Optimization in Large-Scale ML
No ratings yet
Optimization in Large-Scale ML
95 pages
CS229: Machine Learning Notes
No ratings yet
CS229: Machine Learning Notes
241 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
Main Notes
No ratings yet
Main Notes
227 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
Applying Statistical Learning Theory To Deep Learning
No ratings yet
Applying Statistical Learning Theory To Deep Learning
51 pages
CS229
No ratings yet
CS229
216 pages
(Bernhard Schölkopf, Alexander J. Smola) Learning With Kernels PDF
No ratings yet
(Bernhard Schölkopf, Alexander J. Smola) Learning With Kernels PDF
645 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Opt 202 LN
No ratings yet
Opt 202 LN
86 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
practicalMachineLearning Lecture3
No ratings yet
practicalMachineLearning Lecture3
25 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Chapter 02.background-Theory
No ratings yet
Chapter 02.background-Theory
20 pages
Book PCH
No ratings yet
Book PCH
321 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Main Notes
No ratings yet
Main Notes
227 pages
Notes Class1 Copy 2
No ratings yet
Notes Class1 Copy 2
225 pages
LN ML Rug
No ratings yet
LN ML Rug
283 pages
Machine Learning
No ratings yet
Machine Learning
662 pages
178 HW 9
No ratings yet
178 HW 9
153 pages
Maths for Intelligent Systems Guide
No ratings yet
Maths for Intelligent Systems Guide
76 pages
ML Lecture Notes 2022 v0.0
No ratings yet
ML Lecture Notes 2022 v0.0
176 pages
Andrew NG Main - Notes PDF
100% (1)
Andrew NG Main - Notes PDF
226 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Machine Learning Summary
No ratings yet
Machine Learning Summary
38 pages
Exercise 10
No ratings yet
Exercise 10
2 pages
18 05 2025
No ratings yet
18 05 2025
1 page
11 05 2025
No ratings yet
11 05 2025
2 pages
04 05 2025
No ratings yet
04 05 2025
1 page
Fourier Transform & Frequency Response Lab
No ratings yet
Fourier Transform & Frequency Response Lab
3 pages
Slope & Y-Intercept Interpretation
No ratings yet
Slope & Y-Intercept Interpretation
6 pages
Linear Algebra
No ratings yet
Linear Algebra
44 pages
Physical Medicine and Rehabilitation Board Review Second Edition Sara J. Cuccurullo Download Full Chapters
100% (6)
Physical Medicine and Rehabilitation Board Review Second Edition Sara J. Cuccurullo Download Full Chapters
89 pages
Circular Motion and Laws of Motion-1
No ratings yet
Circular Motion and Laws of Motion-1
2 pages
One-Dimensional Flow With Heat Transfer
No ratings yet
One-Dimensional Flow With Heat Transfer
10 pages
Pap 0354
No ratings yet
Pap 0354
20 pages
Lab - Selection Problems - ARPM
No ratings yet
Lab - Selection Problems - ARPM
6 pages
Mathematics: Quarter 1 - Module 2
75% (8)
Mathematics: Quarter 1 - Module 2
18 pages
2021 Y7 At3 Exam Paper 1
No ratings yet
2021 Y7 At3 Exam Paper 1
4 pages
Year Plan 1-8 Final
No ratings yet
Year Plan 1-8 Final
30 pages
Data Preprocessing - 2: Course Leader
No ratings yet
Data Preprocessing - 2: Course Leader
31 pages
Vi 06 Mensuration-Solution
No ratings yet
Vi 06 Mensuration-Solution
16 pages
CATIA Tutorial 1-Parameters Wings Empennage
No ratings yet
CATIA Tutorial 1-Parameters Wings Empennage
27 pages
Principal of Operation of Transmitter and Receiver of DSB
No ratings yet
Principal of Operation of Transmitter and Receiver of DSB
7 pages
Complete Algebra Maths King Batch ? by Gagan Pratap Sir
No ratings yet
Complete Algebra Maths King Batch ? by Gagan Pratap Sir
48 pages
Lab Handout
No ratings yet
Lab Handout
15 pages
Miniaturized UWB Monopole Microstrip Antenna Design by The Combination of Giusepe Peano and Sierpinski Carpet Fractals
No ratings yet
Miniaturized UWB Monopole Microstrip Antenna Design by The Combination of Giusepe Peano and Sierpinski Carpet Fractals
4 pages
K-Means Document Clustering Using Vector Space Model
No ratings yet
K-Means Document Clustering Using Vector Space Model
5 pages
BAC 2014 MATH Correction
No ratings yet
BAC 2014 MATH Correction
14 pages
ALL SPACETIMES BEYOND EINSTEIN by Frederic P. Schuller
No ratings yet
ALL SPACETIMES BEYOND EINSTEIN by Frederic P. Schuller
44 pages
Wearable Sensor Devices
No ratings yet
Wearable Sensor Devices
11 pages
Answer Key - 1-3 - Lesson Quiz
No ratings yet
Answer Key - 1-3 - Lesson Quiz
1 page
Trig Equations With Half-Angles and Multiple Angles
No ratings yet
Trig Equations With Half-Angles and Multiple Angles
3 pages
(Ebooks PDF) Download Quantitative Methods For Business 13th Edition Full Chapters
100% (2)
(Ebooks PDF) Download Quantitative Methods For Business 13th Edition Full Chapters
55 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
R S R R R R S R: Fogler - ECRE - CDROM - Book Page 818 Wednesday, September 17, 2008 5:01 PM
No ratings yet
R S R R R R S R: Fogler - ECRE - CDROM - Book Page 818 Wednesday, September 17, 2008 5:01 PM
5 pages
Constant Acceleration Particle Model: Worksheet 1: Development of Accelerated Motion Representations
No ratings yet
Constant Acceleration Particle Model: Worksheet 1: Development of Accelerated Motion Representations
5 pages
Information Retrieval - Lsi, Plsi and Lda: Jian-Yun Nie
No ratings yet
Information Retrieval - Lsi, Plsi and Lda: Jian-Yun Nie
34 pages
Degrees of Freedom
No ratings yet
Degrees of Freedom
3 pages