Deep Learning Basics
Lecture 6: Convolutional NN
Princeton University COS 495
Instructor: Yingyu Liang
Review: convolutional layers
Convolution: two dimensional case
Input Kernel/filter
a b c d w x
e f g h y z
i j k l
wa + bx + bw + cx +
ey + fz fy + gz
Feature map
Convolutional layers
the same weight shared for all output nodes
𝑚 output nodes
𝑘 kernel size
𝑛 input nodes
Figure from Deep Learning, by Goodfellow, Bengio, and Courville
Terminology
Figure from Deep Learning,
by Goodfellow, Bengio,
and Courville
Case study: LeNet-5
LeNet-5
• Proposed in “Gradient-based learning applied to document
recognition” , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner,
in Proceedings of the IEEE, 1998
LeNet-5
• Proposed in “Gradient-based learning applied to document
recognition” , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner,
in Proceedings of the IEEE, 1998
• Apply convolution on 2D images (MNIST) and use backpropagation
LeNet-5
• Proposed in “Gradient-based learning applied to document
recognition” , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner,
in Proceedings of the IEEE, 1998
• Apply convolution on 2D images (MNIST) and use backpropagation
• Structure: 2 convolutional layers (with pooling) + 3 fully connected layers
• Input size: 32x32x1
• Convolution kernel size: 5x5
• Pooling: 2x2
LeNet-5
Figure from Gradient-based learning applied to document recognition,
by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5
Figure from Gradient-based learning applied to document recognition,
by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5 Filter: 5x5, stride: 1x1,
#filters: 6
Figure from Gradient-based learning applied to document recognition,
by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5
Pooling: 2x2, stride: 2
Figure from Gradient-based learning applied to document recognition,
by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5 Filter: 5x5x6, stride: 1x1,
#filters: 16
Figure from Gradient-based learning applied to document recognition,
by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5
Pooling: 2x2, stride: 2
Figure from Gradient-based learning applied to document recognition,
by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5
Weight matrix: 400x120
Figure from Gradient-based learning applied to document recognition,
by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
Weight matrix: 84x10
LeNet-5
Weight matrix: 120x84
Figure from Gradient-based learning applied to document recognition,
by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
Software platforms for CNN
Updated in April 2016; checked more recent ones online
Platform: Marvin (marvin.is)
Platform: Marvin by
LeNet in Marvin: convolutional layer
LeNet in Marvin: pooling layer
LeNet in Marvin: fully connected layer
Platform: Caffe (caffe.berkeleyvision.org)
LeNet in Caffe
Platform: Tensorflow (tensorflow.org)
Platform: Tensorflow (tensorflow.org)
Platform: Tensorflow (tensorflow.org)
Others
• Theano – CPU/GPU symbolic expression compiler in python (from
MILA lab at University of Montreal)
• Torch – provides a Matlab-like environment for state-of-the-art
machine learning algorithms in lua
• Lasagne - Lasagne is a lightweight library to build and train neural
networks in Theano
• See: http://deeplearning.net/software_links/
Optimization: momentum
Basic algorithms
• Minimize the (regularized) empirical loss
𝐿𝑅 𝜃 = 1 σ𝑛𝑡=1 𝑙(𝜃, 𝑥𝑡 , 𝑦𝑡 ) + 𝑅(𝜃)
𝑛
where the hypothesis is parametrized by 𝜃
• Gradient descent
𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 𝛻𝐿 𝑅 𝜃𝑡
Mini-batch stochastic gradient descent
• Instead of one data point, work with a small batch of 𝑏 points
(𝑥𝑡𝑏+1, 𝑦𝑡𝑏+1 ),…, (𝑥𝑡𝑏+𝑏, 𝑦𝑡𝑏+𝑏 )
• Update rule
1
𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 𝛻 𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖 + 𝑅(𝜃𝑡 )
𝑏
1≤𝑖≤𝑏
Momentum
• Drawback of SGD: can be slow when gradient is small
• Observation: when the gradient is consistent across consecutive steps,
can take larger steps
• Metaphor: rolling marble ball on gentle slope
Momentum
Contour: loss function
Path: SGD with momentum
Arrow: stochastic gradient
Figure from Deep Learning, by Goodfellow, Bengio, and Courville
Momentum
• work with a small batch of 𝑏 points
(𝑥𝑡𝑏+1, 𝑦𝑡𝑏+1 ),…, (𝑥𝑡𝑏+𝑏, 𝑦𝑡𝑏+𝑏 )
• Keep a momentum variable 𝑣𝑡 , and set a decay rate 𝛼
• Update rule
1
𝑣𝑡 = 𝛼𝑣𝑡−1 − 𝜂𝑡 𝛻 𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖 + 𝑅(𝜃𝑡 )
𝑏
1≤𝑖≤𝑏
𝜃𝑡+1 = 𝜃𝑡 + 𝑣𝑡
Momentum
• Keep a momentum variable 𝑣𝑡 , and set a decay rate 𝛼
• Update rule
1
𝑣𝑡 = 𝛼𝑣𝑡−1 − 𝜂𝑡 𝛻 𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖 + 𝑅(𝜃𝑡 )
𝑏
1≤𝑖≤𝑏
𝜃𝑡+1 = 𝜃𝑡 + 𝑣𝑡
• Practical guide: 𝛼 is set to 0.5 until the initial learning stabilizes and
then is increased to 0.9 or higher.