ML Section15 Neural Networks
ML Section15 Neural Networks
Stefan Harmeling
y (x) = w T x + w0
x̃ = [1, x T ]T
w̃ = [w0 , w T ]T
y (x) = w̃ T x̃
yk (x) = wkT x + wk 0
Logistic discriminants
▸ generalize linear discriminant function for two classes by applying
monotonic non-linear function g called activation function:
y (x) = g(w T x + w0 )
1
p(x∣Ck ) = exp (− 12 (x − µk )T Σ−1 (x − µk ))
(2π)d/2 ∣Σ∣1/2
p(x∣C1 )p(C1 ) 1
p(C1 ∣x) = =
p(x∣C1 )p(C1 ) + p(x∣C2 )p(C2 ) 1 + p(x∣C2 )p(C2 )
p(x∣C1 )p(C1 )
1
= = g(a) “the logistic sigmoid”
1 + exp(−a)
w = Σ−1 (µ1 − µ2 )
p(C1 )
w0 = − 21 µT1 Σ−1 µ1 + 12 µT2 Σ−1 µ2 + ln
p(C2 )
Linear discriminants
▸ discriminant function for two classes by applying sigmoid:
y (x) = g(w T x + w0 )
M
yk (x) = ∑ wkj φj (x)
j=0
y (x) = wφ(x)
▸ derivative of E wrt. w:
N ⎛ M ⎞ N
∂
E(w) = ∑ ∑ wkj ′ φj ′ (x n ) − tkn φj (x n ) = ∑ rkn φj (x n )
∂wkj n=1 ⎝j ′ =0 ⎠ n=1
∂
E n (w) = rkn φj (x n )
∂wkj
⎛M ⎞
y (x) = g ∑ wj φj (x) = g(w t φ(x))
⎝j=0 ⎠
Limits?
▸ can solve XOR problem with appropriate choice of φ
▸ however, when limiting the receptive fields of the single φj (x)
certain connectivity problems can not be solved (book
“Perceptron”, Minsky and Papert, 1969)
Can you?
From Perceptrons by M. L Minsky and S. Papert, Copyright 1969 by MIT Press
Machine Learning / Stefan Harmeling / 13./15./20. December 2021 20
From Single-layer to Multi-layer networks
How to generalize the single-layer network?
⎛M ⎞
y (x) = g ∑ wj φj (x) = g(w t φ(x))
⎝j=0 ⎠
y (x) = b2 + W2 tanh(b1 + W1 x)
Sigmoid
1
σ(a) =
1 + exp(−a)
tanh
exp(a) − exp(−a) 2
tanh(a) = =1−
exp(a) + exp(−a) 1 + exp(2x)
ReLU
ReLU(a) = max(0, a)
▸ always nonlinear
▸ always a component-wise function
Multi-layer network:
Forward computation:
z1 = x # input layer
z2 = np.tanh(b1 + W1 @ z1) # second layer (hidden)
z3 = np.tanh(b2 + W2 @ z2) # third layer (hidden)
y = b3 + W3 @ z3 # output layer
r = y - t # compare to truth
E = 0.5 * r.T @ r # squared error
dE = r T dr
= r T dy
= r T (db3 + (dW3 )z3 + W3 dz3 ) read off Db3 E and DW3 E
= r W3 dz3
T
continuing with z3
= r W3 d tanh(b2 + W2 z2 )
T
Forward computation:
z1 = x # input layer
z2 = np.tanh(b1 + W1 @ z1) # second layer (hidden)
z3 = np.tanh(b2 + W2 @ z2) # third layer (hidden)
y = b3 + W3 @ z3 # output layer
r = y - t # compare to truth
E = 0.5 * r.T @ r # squared error
Backward computation (calculate gradients):
b3E = r
W3E = np.outer(b3E, z3)
b2E = (1 - z3**2) * (W3.T @ b3E)
W2E = np.outer(b2E, z2)
b1E = (1 - z2**2) * (W2.T @ b2E)
W1E = np.outer(b1E, z1)
b3E = r
W3E = np.outer(b3E, z3)
b2E = (1 - z3**2) * (W3.T @ b3E)
W2E = np.outer(b2E, z2)
b1E = (1 - z2**2) * (W2.T @ b2E)
W1E = np.outer(b1E, z1)
Gradient descent:
eta = 0.1;
W1 = W1 - eta * W1E
b1 = b1 - eta * b1E
W2 = W2 - eta * W2E
b2 = b2 - eta * b2E
W3 = W3 - eta * W3E
b3 = b3 - eta * b3E
Multi-layer network:
Forward computation:
z1 = x; # input layer
z2 = tanh(b1 + W1*z1); # second layer (hidden)
z3 = tanh(b2 + W2*z2); # third layer (hidden)
y = b3 + W3*z3 ; # output layer
r = y-t; # compare to truth
E = 0.5 * r’ * r; # squared error
Forward computation:
z1 = x; # input layer
z2 = tanh(b1 + W1*z1); # second layer (hidden)
z3 = tanh(b2 + W2*z2); # third layer (hidden)
y = b3 + W3*z3 ; # output layer
r = y-t; # compare to truth
E = 0.5 * r’ * r; # squared error
Backward computation (calculate gradients):
b3E = r;
W3E = b3E * z3’;
b2E = (1-z3.^2) .* (W3’ * b3E)
W2E = b2E * z2’;
b1E = (1-z2.^2) .* (W2’ * b2E)
W1E = b1E * z1’;
eta = 0.1;
W1 = W1 - eta * W1E;
b1 = b1 - eta * b1E;
W2 = W2 - eta * W2E;
b2 = b2 - eta * b2E;
W3 = W3 - eta * W3E;
b3 = b3 - eta * b3E;
E = 12 (y − t)2
min E
w3 ,w2 ,w1
∂E ∂E ∂E
∂w3 ∂w2 ∂w1
z1 = x input layer
z2 = f1 (w1 , z1 ) second layer (hidden)
z3 = f2 (w2 , z2 ) third layer (hidden)
y = z4 = f3 (w3 , z3 ) output layer
E = 12 (y − t)2 squared error (the loss)
∂E ∂E ∂z4
=
∂w3 ∂z4 ∂w3
∂E ∂E ∂z4 ∂z3
=
∂w2 ∂z4 ∂z3 ∂w2
∂E ∂E ∂z4 ∂z3 ∂z2
=
∂w1 ∂z4 ∂z3 ∂z2 ∂w1
z1 = x input layer
z2 = f1 (w1 , z1 ) second layer (hidden)
z3 = f2 (w2 , z2 ) third layer (hidden)
y = z4 = f3 (w3 , z3 ) output layer
E= 1
2
(y − t) 2
squared error (the loss)
x = z1 f1 z2 f2 z3 f3 z4 loss E
w1 w2 w3 t
∂E ∂E ∂z4
=
∂w3 ∂z4 ∂w3
∂E ∂E ∂z4 ∂z3
=
∂w2 ∂z4 ∂z3 ∂w2
∂E ∂E ∂z4 ∂z3 ∂z2
=
∂w1 ∂z4 ∂z3 ∂z2 ∂w1
∂E ∂E ∂E
f1 ∂z2 f2 ∂z3 f3 ∂z4 loss ∂E
∂E
=1
∂E ∂E ∂E
∂w1 ∂w2 ∂w3
x = z1 f1 z2 f2 z3 f3 z4 loss E
w1 w2 w3 t
Back propagation
∂E ∂E ∂E
f1 ∂z2 f2 ∂z3 f3 ∂z4 loss ∂E
∂E
=1
∂E ∂E ∂E
∂w1 ∂w2 ∂w3
2. At f3 :
▸ receives ∂E
∂z4
▸ calculates ∂E = ∂E ∂z4 with ∂z4
= ∂
f (w3 , z3 )
∂w3 ∂z4 ∂w3 ∂w3 ∂w3 3
▸ updates w3 = w3 − η ∂E
∂w3
▸ sends ∂E
= ∂E ∂z4
to f2 with ∂z4
= ∂
f (w3 , z3 )
∂z3 ∂z4 ∂z3 ∂z3 ∂z3 3
3. At f2 :
▸ receives ∂E
∂z3
▸ calculates ∂E = ∂E ∂z3 with ∂z3
= ∂
f (w2 , z2 )
∂w2 ∂z3 ∂w2 ∂w2 ∂w2 2
▸ updates w2 = w2 − η ∂E
∂w2
▸ sends ∂E
= ∂E ∂z3
to f1 with
∂z3
= ∂
f (w2 , z2 )
∂z2 ∂z3 ∂z2 ∂z2 ∂z2 2
4. At f1 :
▸ receives ∂E
∂z2
▸ calculates ∂E = ∂E ∂z2 with ∂z2
= ∂
f (w1 , z1 )
∂w1 ∂z2 ∂w1 ∂w1 ∂w1 1
▸ updates w1 = w1 − η ∂E
∂w1
∂E ∂E ∂y
=
∂w ∂y ∂w
° °°
∈R ∈R ∈R
∂E ∂E ∂y
=
∂x ∂y ∂x
° ° °
∈R1×n ∈Rm×n
∈R1×m
where E ∈ R, x ∈ Rn , y ∈ Rm .
▸ Scalars, vectors, and matrices (DANGER!)
∂E ∂E ∂y
= Here, good notation is difficult!
∂W ∂y ∂W
± ° ±
∈Rn×m ∈R1×m
m×m×n
∈R
where E ∈ R, W ∈ Rm×n , y ∈ Rm .
Machine Learning / Stefan Harmeling / 13./15./20. December 2021 38
What is back propagation?
Let’s look at Stanford’s course on Deep Neural Networks for Computer
vision (2017, Fei-Fei Li, Justin Johnson, Serena Yeung,
http://cs231n.stanford.edu/2017/)
e.g. x = -2, y = 5, z = -4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 12 April 13, 2017
40
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 13 April 13, 2017
41
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 14 April 13, 2017
42
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 15 April 13, 2017
43
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 16 April 13, 2017
44
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 17 April 13, 2017
45
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 18 April 13, 2017
46
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 19 April 13, 2017
47
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 20 April 13, 2017
48
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 21 April 13, 2017
49
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 22 April 13, 2017
50
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 23 April 13, 2017
51
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Lecture 4 - 24
52
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Lecture 4 - 25
53
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Lecture 4 - 26
gradients
54
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Lecture 4 - 27
gradients
55
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Lecture 4 - 28
gradients
56
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Lecture 4 - 29
gradients
57
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Another example:
Another example:
Another example:
Another example:
Another example:
Another example:
Another example:
Another example:
Another example:
Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 39 April 13, 2017
67
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Another example:
Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 41 April 13, 2017
69
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Another example:
Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 43 April 13, 2017
71
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
sigmoid function
sigmoid gate
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 44 April 13, 2017
72
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
sigmoid function
sigmoid gate
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 45 April 13, 2017
73
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 46 April 13, 2017
74
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 47 April 13, 2017
75
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 48 April 13, 2017
76
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 49 April 13, 2017
77
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 50 April 13, 2017
78
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Lecture 4 - 51
79
April 13, 2017
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
f
gradients
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 52 April 13, 2017
80
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Vectorized operations
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 53 April 13, 2017
81
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Vectorized operations
Jacobian matrix
Q: what is the
size of the
Jacobian matrix?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 54 April 13, 2017
82
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Vectorized operations
Jacobian matrix
Q: what is the
size of the
Jacobian matrix?
[4096 x 4096!]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 55 April 13, 2017
83
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Vectorized operations
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
84
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
Vectorized operations
Jacobian matrix
Q: what is the
size of the Q2: what does it
Jacobian matrix? look like?
[4096 x 4096!]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017
85
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
A vectorized example:
A vectorized example:
A vectorized example:
A vectorized example:
A vectorized example:
A vectorized example:
A vectorized example:
A vectorized example:
A vectorized example:
A vectorized example:
A vectorized example:
A vectorized example:
A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 70 April 13, 2017
98
Slide copied from Fei-Fei Li, Justin Johnson, Serena Yeung’s Stanford course on Deep Learning, 2017
A vectorized example:
A vectorized example:
A vectorized example:
A vectorized example:
x f y
y = f (w, x)
Back propagation
∂E ∂E ∂y ∂E ∂E ∂y
= =
∂w ∂y ∂w ∂x ∂y ∂x
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
to update w to send to previous block
y = Wx
Back propagation
∂E ∂E ∂E ∂E
=x = W
∂W ∂y ∂x ∂y
y =x +b
Back propagation
∂E ∂E ∂E ∂E
= =
∂b ∂y ∂x ∂y
x tanh y
y = tanh(x)
Back propagation
∂E ∂E ∂E
= Diag(1 − y ⊙ y ) = ⊙ (1 − y ⊙ y )T
∂x ∂y ∂y
x logistic y
1
y=
1 + e−x
Back propagation
∂E ∂E ∂E
= Diag(y ⊙ (1 − y )) = ⊙ (y ⊙ (1 − y ))T
∂x ∂y ∂y
x ReLU y
y = max(x, 0) = [x > 0] ⊙ x
Back propagation
∂E ∂E ∂E
= Diag([x > 0]) = ⊙ [x > 0]T
∂x ∂y ∂y
f1 ⋯fk
yi = fi ∗ x "convolution"
Back propagation
∂E ∂E
= ... = ...
∂fi ∂x
x max pool y
Back propagation
∂E ∂E
= D
∂x ∂y
y square loss E
E = 0.5(y − t)2
Back propagation
∂E
=y −t
∂y
exp(yt )
E = − log
∑j exp(yj )
Back propagation
∂E
= ...
∂y
x b1 b2 b3 b4 b5 y
x b6 b7 y
b8 b9
Figurefrom
2: An illustration
Kryzhevsky, of theHinton,
Sutskever, architecture of our
ImageNet CNN, explicitly
Classification showing
with Deep the delineation
Convolutional of responsibilities
Neural Networks, 2012
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.
neurons in a kernel map). The second convolutional layer takes as input the (response-normalized
and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48.
MachineThe third,/ Stefan
Learning fourth, and fifth
Harmeling convolutional
/ 13./15./20. layers
December 2021are connected to one another without any intervening 119
ImageNet classification
Figure 4: (Left) Eight ILSVRC-2010 test images and the five labels considered most probable by our model.
The correct label is written under each image, and the probability assigned to the correct label is also shown
with a red bar (if it happens to be in the top 5). (Right) Five ILSVRC-2010 test images in the first column. The
remaining columns show the six training images that produce feature vectors in the last hidden layer with the
smallest Euclidean distance from the feature vector for the test image.
from Kryzhevsky, Sutskever, Hinton, ImageNet Classification with Deep Convolutional Neural Networks, 2012
In the left panel of Figure 4 we qualitatively assess what the network has learned by computing its
top-5 predictions on eight test images. Notice that even off-center objects, such as the mite in the
top-left, can be recognized by the net. Most of the top-5 labels appear reasonable. For example,
only other types of cat are considered plausible labels for the leopard. In some cases (grille, cherry)
there is genuine ambiguity about the intended focus of the photograph.
Machine Learning / Stefan Harmeling / 13./15./20. December 2021 120
Tricks of the trade
Goal:
Avoid flat regions of the loss function.
Try to start at the steep parts of the activation functions.
Tricks:
▸ data normalization: transform input points to have mean zero and
variance one
▸ weight initialization for linear bricks: sample
√ from a normal
distribution with mean zero and std σ = N with N being the
number of input nodes
▸ learning rate division: in each linear layer divide the learning rate
by N, the number of input nodes
See also:
▸ LeCun, Bottou, Orr, Müller, “Efficient back-prop”, 1998.
x dropout y
m = [rand(size(x)) <= p]
y =x ⊙m
Back propagation
∂E ∂E
= ⊙m
∂x ∂y
▸ AdaGrad
see Duchi, Hazan, Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011.
image from Matthias Scholz, http://nlpca.org and Scholz, Nichtlinear Hauptkomponentenanalyse auf Basis neuronaler Netze,
Diplomarbeit, 2002.
see also, Kramer, Nonlinear principal component analysis using auto- associative neural networks, AIChE Journal, 1991.
What is optimized?
▸ The neural network can be written as follows:
E = ∥x − f (x)∥2
E = ∥x − f (x)∥2
▸ note that if there are several dimensions for the bottleneck there is
no order on the nodes (unlike linear PCA), i.e. it is only about the
reconstruction error
▸ Scholz (2002) calls this s-NLPCA, “s” like symmetric.