[go: up one dir, main page]

0% found this document useful (0 votes)
171 views18 pages

Lecture 3 CNN - Backpropagation

This document discusses backpropagation for convolutional neural networks. It begins by introducing gradient-based learning and backpropagation. It then describes backpropagation for common neural network layers like softmax, fully connected, pooling, ReLU, and convolutional layers. It provides equations for computing the gradients of loss with respect to weights and previous layer outputs. It discusses how convolutional layers are implemented using im2col and matrix multiplication. It concludes by mentioning CIFAR-10 dataset and exercises for further study.

Uploaded by

Trần Văn Duy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
171 views18 pages

Lecture 3 CNN - Backpropagation

This document discusses backpropagation for convolutional neural networks. It begins by introducing gradient-based learning and backpropagation. It then describes backpropagation for common neural network layers like softmax, fully connected, pooling, ReLU, and convolutional layers. It provides equations for computing the gradients of loss with respect to weights and previous layer outputs. It discusses how convolutional layers are implemented using im2col and matrix multiplication. It concludes by mentioning CIFAR-10 dataset and exercises for further study.

Uploaded by

Trần Văn Duy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Lecture 3:

CNN: Back-propagation

boris. ginzburg@intel.com

1
Agenda

 Introduction to gradient-based learning for Convolutional NN


 Backpropagation for basic layers
– Softmax
– Fully Connected layer
– Pooling
– ReLU
– Convolutional layer
 Implementation of back-propagation for Convolutional layer
 CIFAR-10 training

2
Good Links

1. http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
2. http://www.iro.umontreal.ca/~pift6266/H10/notes/gradien
t.html#flowgraph

3
Gradient based training

Conv. NN is a function y = f(x0,w), where


x0 is image [28,28],
w – network parameters (weights, bias)
y – softmax output= probability that x belongs to one of 10 classes 0..9

4
Gradient based training

We want to find parameters W, to minimize an error


E (f(x0,w),y0) = -log (f(x0,w)- y0).
For this we will do iterative gradient descent:
−𝜕𝐸
w(t) = w(t-1) – λ * (t)
𝜕𝑤

How do we compute gradient of E wrt weights?


Loss function E is chain of functions. Let’ s go layer by layer,
from last layer back, and use the chain rule for gradient of
complex functions:
𝜕𝐸 𝜕𝐸 𝜕𝑦𝑙 (𝑤,𝑦𝑙−1 )
= ×
𝜕𝑦𝑙−1 𝜕𝑦𝑙 𝜕𝑦𝑙−1
𝜕𝐸 𝜕𝐸 𝜕𝑦𝑙 (𝑤,𝑦𝑙−1 )
= ×
𝜕𝑤𝑙 𝜕𝑦𝑙 𝜕𝑤𝑙

5
LeNet topology

Soft Max + LogLoss

Inner Product

ReLUP

BACKWARD
Inner Product
FORWARD

Pooling [2x2, stride 2]

Convolutional layer [5x5]

Pooling [2x2, stride 2]

Convolutional layer [5x5]

Data Layer
6
Layer:: Backward( )

class Layer {
Setup (bottom, top); // initialize layer
Forward (bottom, top); //compute : 𝑦𝑙 = 𝑓 𝑤𝑙 , 𝑦𝑙−1
Backward( top, bottom); //compute gradient
}
𝜕𝐸
Backward: we start from gradient from last layer and
𝜕𝑦𝑙
𝜕𝐸 𝜕𝐸
1) propagate gradient back : →
𝜕𝑦𝑙 𝜕𝑦𝑙−1
𝜕𝐸
2) compute the gradient of E wrt weights wl:
𝜕𝑤𝑙

7
Softmax with LogLoss Layer

Consider the last layer, softmax with log-loss (MNIST example ):


𝑒 𝑦0 9 𝑦𝑘
𝐸 = − log 𝑝𝑦0 = −log( 9 𝑒 𝑦𝑘 ) = −𝑦0 + log( 0𝑒 )
0

For all k=0..9 , except k0 (right answer) we want to decrease pk:


𝜕𝐸 𝑒 𝑦𝑘
= 9 𝑒 𝑦𝑘 = 𝑝𝑘 ,
𝜕 𝑦𝑘 0

for k=k0 (right answer) we want to increase pk:


𝜕𝐸
= −(1 − 𝑝𝑘0 )
𝜕 𝑦𝑘0

See http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression

8
Inner product (Fully Connected) Layer

Fully connected layer is just Matrix – Vector multiplication:


𝑦𝑙 = 𝑊𝑙 ∗ 𝑦𝑙−1
𝜕𝐸 𝜕𝐸
So = ∗ 𝑊𝑙 𝑇
𝜕𝑦𝑙−1 𝜕𝑦𝑙
𝜕𝐸 𝜕𝐸
and = ∗ 𝑦𝑙−1
𝜕𝑊𝑙 𝜕𝑦𝑙
Note: we need 𝑦𝑙−1 , so we should keep them on forward pass.

9
ReLU Layer

Rectified Linear Unit :


𝑦𝑙 = max (0, 𝑦𝑙−1 )
= 0, 𝑖𝑓 (𝑦𝑙 < 0)
𝜕𝐿
so = 𝜕𝐿
𝜕𝑦𝑙−1 = , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝜕𝑦𝑙

10
Max-Pooling Layer
Forward :
for (p = 0; p< k; p++)
for (q = 0; q< k; q++)
yn (x, y) = max( yn (x, y), yn-1(x + p, y + q));

Backward:

𝜕𝐿
= 0, 𝑖𝑓 ( 𝑦𝑛 𝑥, 𝑦 ! = 𝑦𝑛−1 𝑥 + 𝑝, 𝑦 + 𝑞 )
(𝑥 + 𝑝, 𝑦 + 𝑞) = 𝜕𝐿
𝜕𝑦𝑛−1 = (𝑥, 𝑦), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝜕 𝑦𝑛

Quiz:
1. What will be gradient for Sum-pooling?
2. What will be gradient if pooling areas overlap? (e.g. stride =1)?

11
Convolutional Layer :: Backward
for (n = 0; n < N; n ++)
for (m = 0; m < M; m ++) M N

for(y = 0; y<Y; y++) K


W

for(x = 0; x<X; x++) X

for (p = 0; p< K; p++) Y

for (q = 0; q< K; q++)


yL (n; x, y) += yL-1(m, x+p, y+q) * w (n ,m; p, q);

Let’ s use the chain rule for convolutional layer


𝜕𝐸 𝜕𝐸 𝜕𝑦𝑙 (𝑤,𝑦𝑙−1 )
𝜕𝑦𝑙−1
=
𝜕𝑦𝑙
×
𝜕𝑦𝑙−1
;

𝜕𝐸 𝜕𝐸 𝜕𝑦𝑙 (𝑤, 𝑦𝑙−1 )


= ×
𝜕𝑤𝑙 𝜕𝑦𝑙 𝜕𝑤𝑙−1
12
Convolutional Layer :: Backward
Example: M=1, N=2, K=2.
Take one pixel in level (n-1). Which pixels in next level are influenced by it?

2x2

2x2

13
Convolutional Layer :: Backward

Let’ s use the chain rule for convolutional layer:


𝜕𝐸 𝜕𝐸
Gradient is sum of convolution with gradients over all
𝜕𝑦𝑙−1 𝜕𝑦𝑙
feature maps from “upper” layer:
𝜕𝐸 𝜕𝐸 𝜕𝑦𝑙 (𝑤,𝑦𝑙−1 ) 𝑁 𝜕𝐸
= × = 𝑛=1 𝑏𝑎𝑐𝑘_𝑐𝑜𝑟𝑟(𝑊, )
𝜕𝑦𝑙−1 𝜕𝑦𝑙 𝜕𝑦𝑙−1 𝜕𝑦 𝑙

Gradient of E wrt w is sum over all “pixels” (x,y) in the input


map :
𝜕𝐸 𝜕𝐸 𝜕𝑦𝑙 (𝑤,𝑦𝑙−1 ) 𝜕𝐸
= × = 0≤𝑥≤𝑋 𝑥, 𝑦 °𝑦𝑙−1 (𝑥, 𝑦)
𝜕𝑤𝑙 𝜕𝑙 𝜕𝑤𝑙 𝜕𝑦𝑙
0<𝑦<𝑌

14
Convolutional Layer :: Backward
How this is implemented:
backward ( ) { …
// im2col data to col_data
im2col_cpu( bottom_data , CHANNELS_, HEIGHT_, WIDTH_, KSIZE_, PAD_, STRIDE_,
col_data);
// gradient w.r.t. weight.:
caffe_cpu_gemm (CblasNoTrans, CblasTrans, M_, K_, N_, 1., top_diff, col_data , 1.,
weight_diff );
// gradient w.r.t. bottom data:
caffe_cpu_gemm (CblasTrans, CblasNoTrans, K_, N_, M_, 1., weight , top_diff , 0.,
col_diff );
// col2im back to the data
col2im_cpu(col_diff, CHANNELS_, HEIGHT_, WIDTH_, KSIZE_, PAD_, STRIDE_,
bottom_diff );
} 15
Convolutional Layer : im2col

Implementation is based on reduction of convolution layer to


matrix – matrix multiply ( See Chellapilla et all , “High Performance
Convolutional Neural Networks for Document Processing” )

16
CIFAR-10 Training

http://www.cs.toronto.edu/~kriz/cifar.html
https://www.kaggle.com/c/cifar-10
60000 32x32 colour images in 10 classes, with 6000 images per class.
There are:
 50000 training images

 10000 test images.

17
Exercises

1. Look at definition of following layers (Backward)


– sigmoid, tanh
2. Implement a new layer:
– softplus 𝑦𝑙 = log( 1 + 𝑒 𝑦𝑙−1 )
3. Train CIFAR-10 with different topologies

Project:
1. Port CIFAR-100 to caffe

18

You might also like