[go: up one dir, main page]

0% found this document useful (0 votes)
134 views133 pages

ML Section15 Neural Networks

This document discusses single-layer neural networks for classification problems. It describes how a linear discriminant function can create linear decision boundaries for two-category or multi-category classification. While this is limited by its inability to solve problems like XOR, generalized linear discriminants using predefined basis functions can approximate any continuous function. The document outlines training these networks using least squares or stochastic gradient descent to minimize error on training data by updating weights. Early single-layer networks like the Perceptron are also briefly mentioned.

Uploaded by

dummy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views133 pages

ML Section15 Neural Networks

This document discusses single-layer neural networks for classification problems. It describes how a linear discriminant function can create linear decision boundaries for two-category or multi-category classification. While this is limited by its inability to solve problems like XOR, generalized linear discriminants using predefined basis functions can approximate any continuous function. The document outlines training these networks using least squares or stochastic gradient descent to minimize error on training data by updating weights. Early single-layer networks like the Perceptron are also briefly mentioned.

Uploaded by

dummy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

Machine Learning

Section 15: Neural Networks

Stefan Harmeling

13./15./20. December 2021

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 1


What we have seen so far?
Sections:
1. Introduction
2. Plausible reasoning and Bayes Rule
3. From Logic to Probabilities
4. Bayesian networks
5. Continuous Probabilities
6. The Gaussian distribution
7. More on distributions, models, MAP, ML
8. Linear Regression
9. Matrix Differential Calculus
10. Model selection
11. Support Vector Machines
12. PCA, kPCA
13. ISOMAP, LLE
14. k-means, GMM, EM algorithm
Machine Learning / Stefan Harmeling / 13./15./20. December 2021 2
We follow Chapters 3 and 4 of the book:
Neural Networks and Pattern Recognition
by Chris Bishop (1995)

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 3


Single-layer networks
Two category classification problem
▸ linear discriminant function

y (x) = w T x + w0

with weights w and threshold w0 (sometimes called bias)


▸ if y (x) > 0 assign x to class 1 otherwise to class 2
▸ our first neural network! (see board)
▸ creates a linear decision boundary
▸ the bias can be merged into the weights:

x̃ = [1, x T ]T
w̃ = [w0 , w T ]T
y (x) = w̃ T x̃

(see board, getting from 1D to 2D, or from 2D to 3D)

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 4


Single-layer networks
Several category classification problem
▸ linear discriminant function for k classes

yk (x) = wkT x + wk 0

with per-class weights wk and threshold wk 0 (sometimes called


bias)
▸ if yk (x) > yj (x) for all k ≠ j assign x to class k
▸ neural network representation (see board)
▸ consider x assigned to class k , i.e. for j ≠ k :

yk (x) − yj (x) > 0


(wk − wj ) x + wk 0 − wj0 = 0
T

▸ creates also linear decision boundaries (see board, similar to


Voronoi cells)

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 5


Single-layer networks

Logistic discriminants
▸ generalize linear discriminant function for two classes by applying
monotonic non-linear function g called activation function:

y (x) = g(w T x + w0 )

▸ still linear decision boundary (monotonicity), so what’s the point?


are there useful choices for g?

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 6


Single-layer networks
Consider two class problem
▸ assume the locations x sampled from Gaussians with
class-dependent means µ1 and µ2 and covariance matrix Σ:

1
p(x∣Ck ) = exp (− 12 (x − µk )T Σ−1 (x − µk ))
(2π)d/2 ∣Σ∣1/2

▸ Bayes theorem implies:

p(x∣C1 )p(C1 ) 1
p(C1 ∣x) = =
p(x∣C1 )p(C1 ) + p(x∣C2 )p(C2 ) 1 + p(x∣C2 )p(C2 )
p(x∣C1 )p(C1 )
1
= = g(a) “the logistic sigmoid”
1 + exp(−a)

with a = ln p(x∣C1 )p(C1 )


p(x∣C2 )p(C2 )
= w T x + w0 . Can be solved for w and w0 .
▸ Thus: the output for the logistic sigmoid can be interpreted as
posterior probabilities.
Machine Learning / Stefan Harmeling / 13./15./20. December 2021 7
Single-layer networks

The expressions for w and w0 from the previous slide are:

w = Σ−1 (µ1 − µ2 )
p(C1 )
w0 = − 21 µT1 Σ−1 µ1 + 12 µT2 Σ−1 µ2 + ln
p(C2 )

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 8


Single-layer networks

Linear discriminants
▸ discriminant function for two classes by applying sigmoid:

y (x) = g(w T x + w0 )

▸ how powerful is this? any limitations? e.g. limitations in 2D?


▸ XOR-problem can not be solved! (see board)
▸ this is related to the VC dimension of linear classifier in two
dimensions which is 3 (see later Section on Statistical Learning Theory)
▸ from the perspective of statistical learning theory, failing at the
XOR problem is not a bad thing, but bounds the capacity of the
single-layer networks which leads to regularization...
▸ nonetheless: how can we make it more powerful?

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 9


Single-layer networks

Generalized linear discriminants


▸ generalized discriminant function for several classes:

M
yk (x) = ∑ wkj φj (x)
j=0

with pre-defined basis functions φj (aka features) and fixing


φ0 (x) = 1 to omit the theshold/bias
▸ how powerful is this? any limitations?
▸ one can show that for suitable choice of basis functions any
continuous function can be arbitrarily well approximated
▸ so there is also a choice that solves the XOR-problem, e.g. in 2D:

y (x) = wφ(x)

with a single basis function φ(x) = x1 x2 for x = [x1 , x2 ]T .

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 10


Single-layer networks
Learn weights from data:
▸ given N training data points (x 1 , t 1 ), . . . , (x N , t N ) consisting of
locations and targets t n = [t1n , . . . , tcn ]T for c classes
▸ tkn = 1 if x n is in class k otherwise tkn = 0
Least-squares techniques
▸ Minimize sum-of-squares error function:
2
1 N c 2 1 N c ⎛M ⎞
E(w) = ∑ ∑ (yk (x n ) − tkn ) = ∑ ∑ ∑ wkj φj (x n ) − tkn
2 n=1 k =1 2 n=1 k =1 ⎝j=0 ⎠

▸ derivative of E wrt. w:
N ⎛ M ⎞ N

E(w) = ∑ ∑ wkj ′ φj ′ (x n ) − tkn φj (x n ) = ∑ rkn φj (x n )
∂wkj n=1 ⎝j ′ =0 ⎠ n=1

where residuals rkn = yk (x n ) − tkn

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 11


Single-layer networks

Derivative for a single location:


E n (w) = rkn φj (x n )
∂wkj

Stochastic gradient descent


▸ iterate over all data points and update the weights using some
learning rate, i.e. for n = 1 to N update w:

wkj ← wkj − ηrkn φj (x n )

with learning rate η

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 12


Goal: given training examples D = {(x 1 , t 1 ), . . . , (x N , t N )}
1 N 1 N
min E(w) = min ∑ E n (w) = min ∑ (t n − f (x n , w))2
w w 2 w 2
n=1 n=1

Algorithm 15.1 (Batch gradient descent)


1. Initialize w0 .
2. For several epochs update the parameters with the overall
gradient:
N
wt+1 = wt − η ∑ ∇w E n (w)
n=1

Algorithm 15.2 (Stochastic gradient descent)


1. Initialize w0 .
2. For several epochs update the parameters with a local gradient at
a randomly chosen location (e.g. for point x n ):

wt+1 = wt − η∇w E n (w)

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 13


Early single-layer networks: Perceptron / Adalines
The perceptron for the two class problem (Rosenblatt, 1962)
▸ output of the perceptron:

⎛M ⎞
y (x) = g ∑ wj φj (x) = g(w t φ(x))
⎝j=0 ⎠

with activation function g(a) = sign(a) being 1 if a ≥ 0 and -1


otherwise
▸ given N training data points (x 1 , t 1 ), . . . , (x N , t N ) consisting of
locations and targets t n ∈ {−1, +1}
▸ perceptron criterion: minimize

E perc (w) = − ∑ w T φ(x n )t n


y (x n )≠t n

where the summation is over all miss-classified training examples


▸ essentially the same as adalines (ADAptive LINear Element,
Widrow and Hoff 1960)
Machine Learning / Stefan Harmeling / 13./15./20. December 2021 14
Single-layer networks

Description: Note on back of


photograph reads Dr. Frank Rosenblatt (left) [Charles] W. Wightman (right) working on a
prototype association unit for 1st perceptron. Dec. 1958 / Cornell University Faculty
Biographical Files, 47-10-3394. Division of Rare and Manuscript Collections, Cornell
University Library. / https://digital.library.cornell.edu/catalog/ss:547372

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 15


Single-layer networks
The perceptron (Rosenblatt, 1962)

Mark I Perceptron at the Cornell Aeronautical Laboratory, Cornell University News


Service records, 4-3-15. Division of Rare and Manuscript Collections, Cornell University
Library / https://digital.library.cornell.edu/catalog/ss:550351.

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 16


Single-layer networks

The perceptron (Rosenblatt, 1962)

From Perceptrons by M. L Minsky and S. Papert, Copyright 1969 by MIT Press

Limits?
▸ can solve XOR problem with appropriate choice of φ
▸ however, when limiting the receptive fields of the single φj (x)
certain connectivity problems can not be solved (book
“Perceptron”, Minsky and Papert, 1969)

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 17


Spot the tiger!

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 18


Profberger at English Wikipedia https://commons.wikimedia.org/wiki/File:Cheetah_chasing_Thompsons_gazelle.jpg, “Cheetah

chasing Thompsons gazelle”, CC-BY-SA 3.0

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 19


The diameter-limited perceptron can not solve the
connectivity problem

Can you?
From Perceptrons by M. L Minsky and S. Papert, Copyright 1969 by MIT Press
Machine Learning / Stefan Harmeling / 13./15./20. December 2021 20
From Single-layer to Multi-layer networks
How to generalize the single-layer network?

⎛M ⎞
y (x) = g ∑ wj φj (x) = g(w t φ(x))
⎝j=0 ⎠

Make φj (x) adaptive as well! Make it another single-layer network!


Multi-layer network
▸ with one hidden layer:

y (x) = b2 + W2 tanh(b1 + W1 x)

with a component-wise activation function tanh, vector-valued


input x and thresholds b1 and b2 , output y , and matrix-valued
weights W1 and W2 .
▸ with two hidden layers:

y (x) = b3 + W3 tanh(b2 + W2 tanh(b1 + W1 x))

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 21


Activation functions

Sigmoid

1
σ(a) =
1 + exp(−a)

tanh
exp(a) − exp(−a) 2
tanh(a) = =1−
exp(a) + exp(−a) 1 + exp(2x)

ReLU

ReLU(a) = max(0, a)

▸ always nonlinear
▸ always a component-wise function

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 22


Multi-layer networks (python/numpy)

Multi-layer network:

y (x) = b3 + W3 tanh(b2 + W2 tanh(b1 + W1 x))

Forward computation:
z1 = x # input layer
z2 = np.tanh(b1 + W1 @ z1) # second layer (hidden)
z3 = np.tanh(b2 + W2 @ z2) # third layer (hidden)
y = b3 + W3 @ z3 # output layer
r = y - t # compare to truth
E = 0.5 * r.T @ r # squared error

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 23


Multi-layer networks

Calculating the differentials

dE = r T dr
= r T dy
= r T (db3 + (dW3 )z3 + W3 dz3 ) read off Db3 E and DW3 E
= r W3 dz3
T
continuing with z3
= r W3 d tanh(b2 + W2 z2 )
T

= r T W3 Diag(1 − z3 ⊙ z3 )(db2 + (dW2 )z2 + W2 dz2 )


= ...

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 24


ea − e−a
d tanh(a) = d
ea + e−a
(ea + e−a )(ea + e−a )da − (ea − e−a )(ea − e−a )da
=
(ea + e−a )2
= (1 − tanh(a)2 ) da

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 25


Multi-layer networks (python/numpy)

Forward computation:
z1 = x # input layer
z2 = np.tanh(b1 + W1 @ z1) # second layer (hidden)
z3 = np.tanh(b2 + W2 @ z2) # third layer (hidden)
y = b3 + W3 @ z3 # output layer
r = y - t # compare to truth
E = 0.5 * r.T @ r # squared error
Backward computation (calculate gradients):

b3E = r
W3E = np.outer(b3E, z3)
b2E = (1 - z3**2) * (W3.T @ b3E)
W2E = np.outer(b2E, z2)
b1E = (1 - z2**2) * (W2.T @ b2E)
W1E = np.outer(b1E, z1)

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 26


Multi-layer networks (python/numpy)
Backward computation (calculate gradients):

b3E = r
W3E = np.outer(b3E, z3)
b2E = (1 - z3**2) * (W3.T @ b3E)
W2E = np.outer(b2E, z2)
b1E = (1 - z2**2) * (W2.T @ b2E)
W1E = np.outer(b1E, z1)

Gradient descent:
eta = 0.1;
W1 = W1 - eta * W1E
b1 = b1 - eta * b1E
W2 = W2 - eta * W2E
b2 = b2 - eta * b2E
W3 = W3 - eta * W3E
b3 = b3 - eta * b3E

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 27


Multi-layer networks (matlab)

Multi-layer network:

y (x) = b3 + W3 tanh(b2 + W2 tanh(b1 + W1 x))

Forward computation:
z1 = x; # input layer
z2 = tanh(b1 + W1*z1); # second layer (hidden)
z3 = tanh(b2 + W2*z2); # third layer (hidden)
y = b3 + W3*z3 ; # output layer
r = y-t; # compare to truth
E = 0.5 * r’ * r; # squared error

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 28


Multi-layer networks (matlab)

Forward computation:
z1 = x; # input layer
z2 = tanh(b1 + W1*z1); # second layer (hidden)
z3 = tanh(b2 + W2*z2); # third layer (hidden)
y = b3 + W3*z3 ; # output layer
r = y-t; # compare to truth
E = 0.5 * r’ * r; # squared error
Backward computation (calculate gradients):

b3E = r;
W3E = b3E * z3’;
b2E = (1-z3.^2) .* (W3’ * b3E)
W2E = b2E * z2’;
b1E = (1-z2.^2) .* (W2’ * b2E)
W1E = b1E * z1’;

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 29


Multi-layer networks (matlab)
Backward computation (calculate gradients):
b3E = r;
W3E = b3E * z3’;
b2E = (1-z3.^2) .* (W3’ * b3E)
W2E = b2E * z2’;
b1E = (1-z2.^2) .* (W2’ * b2E)
W1E = b1E * z1’;
Gradient descent:

eta = 0.1;
W1 = W1 - eta * W1E;
b1 = b1 - eta * b1E;
W2 = W2 - eta * W2E;
b2 = b2 - eta * b2E;
W3 = W3 - eta * W3E;
b3 = b3 - eta * b3E;

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 30


What is back propagation?

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 31


▸ Consider simple neural network with scalars

y = f3 (w3 , f2 (w2 , f1 (w1 , x)))

x and y are scalars, parameters w3 , w2 , w1 are also scalars. We


don’t specify f1 , f2 , f3 .
▸ For a data point x with target value t we define the loss function

E = 12 (y − t)2

▸ Goal: learn parameters that minimize the loss

min E
w3 ,w2 ,w1

▸ Back-propagation is an efficient way to calculate the derivatives of


the the loss wrt the parameters:

∂E ∂E ∂E
∂w3 ∂w2 ∂w1

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 32


Rewrite the network into layers

z1 = x input layer
z2 = f1 (w1 , z1 ) second layer (hidden)
z3 = f2 (w2 , z2 ) third layer (hidden)
y = z4 = f3 (w3 , z3 ) output layer
E = 12 (y − t)2 squared error (the loss)

Applying chain rule

∂E ∂E ∂z4
=
∂w3 ∂z4 ∂w3
∂E ∂E ∂z4 ∂z3
=
∂w2 ∂z4 ∂z3 ∂w2
∂E ∂E ∂z4 ∂z3 ∂z2
=
∂w1 ∂z4 ∂z3 ∂z2 ∂w1

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 33


Propagation (feed forward)

z1 = x input layer
z2 = f1 (w1 , z1 ) second layer (hidden)
z3 = f2 (w2 , z2 ) third layer (hidden)
y = z4 = f3 (w3 , z3 ) output layer
E= 1
2
(y − t) 2
squared error (the loss)

x = z1 f1 z2 f2 z3 f3 z4 loss E

w1 w2 w3 t

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 34


Back propagation

∂E ∂E ∂z4
=
∂w3 ∂z4 ∂w3
∂E ∂E ∂z4 ∂z3
=
∂w2 ∂z4 ∂z3 ∂w2
∂E ∂E ∂z4 ∂z3 ∂z2
=
∂w1 ∂z4 ∂z3 ∂z2 ∂w1

∂E ∂E ∂E
f1 ∂z2 f2 ∂z3 f3 ∂z4 loss ∂E
∂E
=1

∂E ∂E ∂E
∂w1 ∂w2 ∂w3

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 35


Propagation (feed forward)

x = z1 f1 z2 f2 z3 f3 z4 loss E

w1 w2 w3 t

Back propagation
∂E ∂E ∂E
f1 ∂z2 f2 ∂z3 f3 ∂z4 loss ∂E
∂E
=1

∂E ∂E ∂E
∂w1 ∂w2 ∂w3

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 36


Back-propagation step by step
1. At the loss:
▸ sends ∂E
to f3
∂z4

2. At f3 :
▸ receives ∂E
∂z4
▸ calculates ∂E = ∂E ∂z4 with ∂z4
= ∂
f (w3 , z3 )
∂w3 ∂z4 ∂w3 ∂w3 ∂w3 3
▸ updates w3 = w3 − η ∂E
∂w3
▸ sends ∂E
= ∂E ∂z4
to f2 with ∂z4
= ∂
f (w3 , z3 )
∂z3 ∂z4 ∂z3 ∂z3 ∂z3 3

3. At f2 :
▸ receives ∂E
∂z3
▸ calculates ∂E = ∂E ∂z3 with ∂z3
= ∂
f (w2 , z2 )
∂w2 ∂z3 ∂w2 ∂w2 ∂w2 2
▸ updates w2 = w2 − η ∂E
∂w2
▸ sends ∂E
= ∂E ∂z3
to f1 with
∂z3
= ∂
f (w2 , z2 )
∂z2 ∂z3 ∂z2 ∂z2 ∂z2 2

4. At f1 :
▸ receives ∂E
∂z2
▸ calculates ∂E = ∂E ∂z2 with ∂z2
= ∂
f (w1 , z1 )
∂w1 ∂z2 ∂w1 ∂w1 ∂w1 1
▸ updates w1 = w1 − η ∂E
∂w1

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 37


Notation!
▸ Scalars only (EASY)

∂E ∂E ∂y
=
∂w ∂y ∂w
° °°
∈R ∈R ∈R

▸ Scalars and vectors (BE CAREFUL)

∂E ∂E ∂y
=
∂x ∂y ∂x
° ° °
∈R1×n ∈Rm×n
∈R1×m

where E ∈ R, x ∈ Rn , y ∈ Rm .
▸ Scalars, vectors, and matrices (DANGER!)

∂E ∂E ∂y
= Here, good notation is difficult!
∂W ∂y ∂W
± ° ±
∈Rn×m ∈R1×m
m×m×n
∈R

where E ∈ R, W ∈ Rm×n , y ∈ Rm .
Machine Learning / Stefan Harmeling / 13./15./20. December 2021 38
What is back propagation?
Let’s look at Stanford’s course on Deep Neural Networks for Computer
vision (2017, Fei-Fei Li, Justin Johnson, Serena Yeung,
http://cs231n.stanford.edu/2017/)

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 39


Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 12 April 13, 2017
40
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 13 April 13, 2017
41
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 14 April 13, 2017
42
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 15 April 13, 2017
43
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 16 April 13, 2017
44
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 17 April 13, 2017
45
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 18 April 13, 2017
46
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 19 April 13, 2017
47
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 20 April 13, 2017
48
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 21 April 13, 2017
49
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 22 April 13, 2017
50
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 23 April 13, 2017
51
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung


f

Lecture 4 - 24
52
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung


f
“local gradient”

Lecture 4 - 25
53
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung


f
“local gradient”

Lecture 4 - 26
gradients

54
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung


f
“local gradient”

Lecture 4 - 27
gradients

55
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung


f
“local gradient”

Lecture 4 - 28
gradients

56
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung


f
“local gradient”

Lecture 4 - 29
gradients

57
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 30
58
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 31
59
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 32
60
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 33
61
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 34
62
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 35
63
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 36
64
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 37
65
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 38
66
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

(-1) * (-0.20) = 0.20

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 39 April 13, 2017
67
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 40
68
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

[local gradient] x [upstream gradient]


[1] x [0.2] = 0.2
[1] x [0.2] = 0.2 (both inputs!)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 41 April 13, 2017
69
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 42
70
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Another example:

[local gradient] x [upstream gradient]


x0: [2] x [0.2] = 0.4
w0: [-1] x [0.2] = -0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 43 April 13, 2017
71
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

sigmoid function

sigmoid gate

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 44 April 13, 2017
72
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

sigmoid function

sigmoid gate

(0.73) * (1 - 0.73) = 0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 45 April 13, 2017
73
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Patterns in backward flow

add gate: gradient distributor

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 46 April 13, 2017
74
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Patterns in backward flow

add gate: gradient distributor


Q: What is a max gate?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 47 April 13, 2017
75
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Patterns in backward flow

add gate: gradient distributor


max gate: gradient router

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 48 April 13, 2017
76
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Patterns in backward flow

add gate: gradient distributor


max gate: gradient router
Q: What is a mul gate?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 49 April 13, 2017
77
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Patterns in backward flow

add gate: gradient distributor


max gate: gradient router
mul gate: gradient switcher

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 50 April 13, 2017
78
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung


Gradients add at branches

Lecture 4 - 51
79
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Gradients for vectorized code (x,y,z are This is now the


now vectors) Jacobian matrix
(derivative of each
element of z w.r.t. each
“local gradient” element of x)

f
gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 52 April 13, 2017
80
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Vectorized operations

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 53 April 13, 2017
81
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the
Jacobian matrix?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 54 April 13, 2017
82
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the
Jacobian matrix?
[4096 x 4096!]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 55 April 13, 2017
83
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Vectorized operations

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the in practice we process an


entire minibatch (e.g. 100)
size of the of examples at one time:
Jacobian matrix? i.e. Jacobian would technically be a
[4096 x 4096!] [409,600 x 409,600] matrix :\

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
84
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

Vectorized operations

Jacobian matrix

4096-d f(x) = max(0,x) 4096-d


input vector (elementwise) output vector

Q: what is the
size of the Q2: what does it
Jacobian matrix? look like?
[4096 x 4096!]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
85
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 58
86
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 59
87
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 60
88
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 61
89
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 62
90
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 63
91
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 64
92
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 65
93
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 66
94
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 67
95
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 68
96
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 69
97
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Always check: The


gradient with
respect to a variable
should have the
same shape as the
variable

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 70 April 13, 2017
98
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 71
99
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 72
100
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 73
101
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017

A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung


Lecture 4 - 74
102
April 13, 2017
Back to our slides!
Thanks to Justin, Serena and Fei-Fei Li for letting me use their great
slides on deep learning!

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 103


Building blocks for neural networks
see Bottou, Gallinari, A Framework for the Cooperation of Learning Algorithms, 1991

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 104


In a nutshell
Neural networks are sequences of computational blocks (or bricks).
More general

Neural networks are directed acyclic graphs (DAGs)


of computational blocks (or bricks).
Question
What kind of blocks?

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 105


General block with input x, output y , parameters w

x f y

Propagation (feed forward)

y = f (w, x)

Back propagation

∂E ∂E ∂y ∂E ∂E ∂y
= =
∂w ∂y ∂w ∂x ∂y ∂x
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
to update w to send to previous block

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 106


Linear block:
x linear y

Propagation (feed forward)

y = Wx

Back propagation

∂E ∂E ∂E ∂E
=x = W
∂W ∂y ∂x ∂y

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 107


Bias block:
x bias y

Propagation (feed forward)

y =x +b

Back propagation

∂E ∂E ∂E ∂E
= =
∂b ∂y ∂x ∂y

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 108


Tanh block (component-wise):

x tanh y

Propagation (feed forward)

y = tanh(x)

Back propagation

∂E ∂E ∂E
= Diag(1 − y ⊙ y ) = ⊙ (1 − y ⊙ y )T
∂x ∂y ∂y

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 109


Logistic block (component-wise):

x logistic y

Propagation (feed forward)

1
y=
1 + e−x
Back propagation

∂E ∂E ∂E
= Diag(y ⊙ (1 − y )) = ⊙ (y ⊙ (1 − y ))T
∂x ∂y ∂y

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 110


ReLU block (rectified linear unit, component-wise):

x ReLU y

Propagation (feed forward)

y = max(x, 0) = [x > 0] ⊙ x

Back propagation

∂E ∂E ∂E
= Diag([x > 0]) = ⊙ [x > 0]T
∂x ∂y ∂y

where [x > 0] is a vector of ones and zeros.

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 111


Convolutional block:
x conv y1 ⋯yk

f1 ⋯fk

Propagation (feed forward)

yi = fi ∗ x "convolution"

Back propagation

∂E ∂E
= ... = ...
∂fi ∂x

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 112


Max pooling block:

x max pool y

Propagation (feed forward)

y = Dx "D downsampling matrix"

Back propagation

∂E ∂E
= D
∂x ∂y

▸ pooling means “downsampling”


▸ alternatives: maximum or mean over region

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 113


Square loss block:

y square loss E

Propagation (feed forward)

E = 0.5(y − t)2

Back propagation

∂E
=y −t
∂y

▸ a loss layer is for convenience, it is not part of the solution

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 114


Softmax loss block:
y softmax loss E

Propagation (feed forward)

exp(yt )
E = − log
∑j exp(yj )

Back propagation

∂E
= ...
∂y

▸ loss function for multi-class problems


▸ assuming y = [y1 , . . . , yk ] for k classes, and t ∈ {1, . . . , k }

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 115


Neural network as a sequence of bricks

x b1 b2 b3 b4 b5 y

Neural network as a DAG of bricks


b1 b2 b3 b4 b5

x b6 b7 y

b8 b9

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 116


Famous examples

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 117


1991 Face recognition. Sonar image analysis. (Neuristique)
LeNet-5
1993 Vehicle recognition. (Onera)
1994 Zip code recognition (AT&T Bell Labs)
Layers of LeNet-5: not its bricks
1996 Check reading (AT&T Bell Labs)

from Bottou’s MLSS 2013 slides on neural networks

▸ LeNet-5 robustly recognizes digits.


▸ Developed at AT&T in the 1990s by Yann Lecun, Leon Bottou et al.
▸ LeNet-5 been used for a decade to read digits on about 10-20
percent of all (!) US checks.
▸ Demo at http://yann.lecun.com/exdb/lenet/index.html.
Machine Learning / Stefan Harmeling / 13./15./20. December 2021 118
ImageNet classification

▸ 1.2 million images, 1000 classes


▸ network with 60 million parameters

Figurefrom
2: An illustration
Kryzhevsky, of theHinton,
Sutskever, architecture of our
ImageNet CNN, explicitly
Classification showing
with Deep the delineation
Convolutional of responsibilities
Neural Networks, 2012
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.

neurons in a kernel map). The second convolutional layer takes as input the (response-normalized
and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48.
MachineThe third,/ Stefan
Learning fourth, and fifth
Harmeling convolutional
/ 13./15./20. layers
December 2021are connected to one another without any intervening 119
ImageNet classification

Figure 4: (Left) Eight ILSVRC-2010 test images and the five labels considered most probable by our model.
The correct label is written under each image, and the probability assigned to the correct label is also shown
with a red bar (if it happens to be in the top 5). (Right) Five ILSVRC-2010 test images in the first column. The
remaining columns show the six training images that produce feature vectors in the last hidden layer with the
smallest Euclidean distance from the feature vector for the test image.

from Kryzhevsky, Sutskever, Hinton, ImageNet Classification with Deep Convolutional Neural Networks, 2012

In the left panel of Figure 4 we qualitatively assess what the network has learned by computing its
top-5 predictions on eight test images. Notice that even off-center objects, such as the mite in the
top-left, can be recognized by the net. Most of the top-5 labels appear reasonable. For example,
only other types of cat are considered plausible labels for the leopard. In some cases (grille, cherry)
there is genuine ambiguity about the intended focus of the photograph.
Machine Learning / Stefan Harmeling / 13./15./20. December 2021 120
Tricks of the trade

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 121


General tricks
copied from Burger, Schuler, Harmeling, Image denoising: Can plain neural networks compete with BM3D, CVPR 2012.

Goal:
Avoid flat regions of the loss function.
Try to start at the steep parts of the activation functions.
Tricks:
▸ data normalization: transform input points to have mean zero and
variance one
▸ weight initialization for linear bricks: sample
√ from a normal
distribution with mean zero and std σ = N with N being the
number of input nodes
▸ learning rate division: in each linear layer divide the learning rate
by N, the number of input nodes
See also:
▸ LeCun, Bottou, Orr, Müller, “Efficient back-prop”, 1998.

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 122


Dropout

▸ very large network might overfit (i.e. be perfect on training data,


but fail on test data)
▸ dropout is a trick during training to reduce overfitting
▸ during forward pass randomly set hidden nodes to zero (e.g. with
probability 0.5)
▸ this leads to more robust features

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 123


Dropout block (component-wise):

x dropout y

Propagation (feed forward)

m = [rand(size(x)) <= p]
y =x ⊙m

Back propagation

∂E ∂E
= ⊙m
∂x ∂y

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 124


Optimization strategies

▸ stochastic gradient descent with momentum (SGD)

vt+1 = µvt − η∇w E(wt )


wt+1 = wt + vt+1

▸ Nesterov acceleration gradient (NAG)


Nesterov, A method of solving a convex programming problem with convergence rate (1/sqrt(k)), 1983.

vt+1 = µvt − η∇w E(wt + µvt )


wt+1 = wt + vt+1

▸ AdaGrad
see Duchi, Hazan, Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011.

(∇w E(wt ))i


(wt+1 )i = (wt )i − η √
∑t ′ =1 (∇w E(wt ′ ))2i
t

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 125


Unsupervised neural networks

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 126


Auto-associative neural network (autoencoder)

image from Matthias Scholz, http://nlpca.org and Scholz, Nichtlinear Hauptkomponentenanalyse auf Basis neuronaler Netze,

Diplomarbeit, 2002.

see also, Kramer, Nonlinear principal component analysis using auto- associative neural networks, AIChE Journal, 1991.

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 127


Auto-associative neural network (autoencoder)

from Matthias Scholz, http://nlpca.org

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 128


Auto-associative neural network (autoencoder)

What is optimized?
▸ The neural network can be written as follows:

f (x) = b4 + W4 tanh(b3 + w3 tanh(b2 + w2T tanh(b1 + W1 x)))

where W4 , W1 are matrices, w3 , w2 , b4 , b3 , b1 , x are vectors, and


b2 is a scalar.
▸ the loss function to minimize is

E = ∥x − f (x)∥2

i.e. we are trying to reconstruct the input with the output


▸ network reconstructs a single nonlinear directions (probably of
largest variance)

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 129


Auto-associative neural network (autoencoder)

What if we want to reconstruct a two dimensional embedding?


▸ The neural network can be written as follows:

f (x) = b4 + W4 tanh(b3 + W3 tanh(b2 + W2T tanh(b1 + W1 x)))

where W4 , W3 , W2 , W1 are matrices, b4 , b3 , b2 , b1 , x are vectors.


▸ the loss function to minimize is

E = ∥x − f (x)∥2

▸ note that if there are several dimensions for the bottleneck there is
no order on the nodes (unlike linear PCA), i.e. it is only about the
reconstruction error
▸ Scholz (2002) calls this s-NLPCA, “s” like symmetric.

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 130


Auto-associative neural network (autoencoder)

(Hierarchical) h-NLPCA, Scholz, 2002


▸ Scholz suggests an hierarchical error function (called h-NLPCA,
“h” like hierarchical, e.g. for two dimensions):

E = αE1 + E12 for 0 ≤ α < ∞

▸ For α = ∞ we get one-dimensional NLPCA, for α = 0 we get


s-NLPCA, in between he calls it h-NLPCA.

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 131


Auto-associative neural network (autoencoder)

from Matthias Scholz, http://nlpca.org

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 132


end for today!

Machine Learning / Stefan Harmeling / 13./15./20. December 2021 133

You might also like