Lecture 2 CNN
Lecture 2 CNN
Convolutional Neural
Networks
Chen-Change Loy
吕健勤
http://www.ntu.edu.sg/home/ccloy/
https://twitter.com/ccloy
Outline
…
…
…
100
…
…
…
…
784
Motivation
Issues
…
… • The network is not invariant to transformations
(e.g., translation).
784
Locally connected networks
Instead, let us only keep a sparse set of
connections, where all weights having the
same color are shared.
28
Basic Components in CNN
Credits
• CS 230: Deep Learning
• https://stanford.edu/~shervine/teaching/cs-230.html
As we go deeper (left to right) the height and width tend to go down and the number of channels increased.
Common layer arrangement: Conv → pool → Conv → pool → fully connected → fully connected → output
Convolution layer
3x32x32 image
32 height
32 width
3 depth / channels
Convolution layer
3x32x32 image
3x5x5 filter
32 height
Weights of filters are learned using backpropagation algorithm.
The weights learned are known as filters or kernels
Convolve the filter with the image i.e. “slide over the image
spatially, computing dot products”
32 width
3 depth / channels
Convolution layer
Filters always extend the full depth of the input volume
3x32x32 image
3x5x5 filter
32 height
Weights of filters are learned using backpropagation algorithm.
The weights learned are known as filters or kernels
Convolve the filter with the image i.e. “slide over the image
spatially, computing dot products”
32 width
3 depth / channels
Convolution layer
3x32x32 image
3x5x5 filter
32 1 number:
the result of taking a dot product between the filter and
a small 3x5x5 chunk of the image
32
3 1 0 1
An example of convolving
0 1 1
a 1x5x5 image with a
0 0 1
1x3x3 filter
32
28
32 28
3 1
28
32 28
3 1 1
Convolution layer
Six 1x28x28
3x32x32 image activation/feature maps
32 28
3 6
We stack these up to
get a “new image” of
size 6x28x28
Convolution layer
6x28x28
3x32x32 image activation/feature maps
Also the 6-dim bias vector
32
28
convolution layer
32 28
3 6
Convolution layer
2x3x32x32 batch of 2x6x28x28 batch of
images activation/feature maps
Also the 6-dim bias vector
32
28
convolution layer
32 28
3 3 6 6
Convolution layer
𝑁×𝐷%×𝐻%×𝑊% batch of 𝑁×𝐷$×𝐻$×𝑊$ batch of
images activation/feature maps
Also the 𝐷!-dim bias vector
𝐻!
𝐻"
convolution layer
𝑊! 𝑊"
𝐷! 𝐷! 𝐷" ×𝐷! ×𝐹×𝐹 𝐷" 𝐷"
An example of convolutional network: LeNet 5
Convolution layer
Back to this example, 6@28x28 means that we have 6 feature maps of size 28*28
You can imagine that the first convolutional layer uses 6 filters and each filter is of size 5 x 5 (how do we know that? We
will discuss that later)
Convolution layer
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. six e.g. ten
3x5x5 6x5x5
32 filters 28 filters 24
3 6 10
Convolution layer
Consider a kernel 𝒘 = {𝑤(𝑙, 𝑚)}, which has a size of 𝐿×𝑀, 𝐿 = 2𝑎 + 1, 𝑀 = 2𝑏 + 1
Synaptic input at location 𝑝 = (𝑖, 𝑗) of the first hidden layer due to a kernel is given by
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-
where 𝑓 is an activation function. For deep CNN, we typically use ReLU, 𝑓(𝑥) = max(0, 𝑥).
Note that one weight tensor 𝒘" = {𝑤" (𝑙, 𝑚)} or kernel (filter) creates one feature map:
𝒚 = (𝒚" )%
"#$
Convolution layer
Convolution by doing a sliding window
𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
Convolution layer
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-
𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
𝑢 1,2 = 5×1 + 8×4 + 7×1 + 8×1 + 8×4 + 8×3 + 6×3 + 6×3 + 4×1 = 148
Convolution layer
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-
𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
𝑢 1,2 = 5×1 + 8×4 + 7×1 + 8×1 + 8×4 + 8×3 + 6×3 + 6×3 + 4×1 = 148
𝑢 2,1 = 1×1 + 8×4 + 8×1 + 3×1 + 6×4 + 6×3 + 6×3 + 5×3 + 7×1 = 126
Convolution layer
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-
𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
𝑢 1,2 = 5×1 + 8×4 + 7×1 + 8×1 + 8×4 + 8×3 + 6×3 + 6×3 + 4×1 = 148
𝑢 2,1 = 1×1 + 8×4 + 8×1 + 3×1 + 6×4 + 6×3 + 6×3 + 5×3 + 7×1 = 126
𝑢 2,2 = 8×1 + 8×4 + 8×1 + 6×1 + 6×4 + 4×3 + 5×3 + 7×3 + 8×1 = 134
Convolution layer
Convolution as matrix multiplication
• Convolution operation essentially performs dot products between the filters
and local regions of the input (see the image below).
• Common practice is to take advantage of this fact and formulate the forward
pass of a convolutional layer as one big matrix multiply
Convolution layer
The convolution operation can be equivalently re-expressed as a single matrix multiplication:
Using the same guiding example, let us consider the convolution of single-
channel tensors 𝐱 ∈ ℝ.×. and 𝐰 ∈ ℝ/×/ :
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
Convolution layer
• The convolutional kernel 𝐰 is rearranged as a sparse Toeplitz circulant matrix, called the
convolution matrix:
1 4 1 0 1 4 3 0 3 3 1 0 0 0 0 0
0 1 4 1 0 1 4 3 0 3 3 1 0 0 0 0
𝐖=
0 0 0 0 1 4 1 0 1 4 3 0 3 3 1 0
0 0 0 0 0 1 4 1 0 1 4 3 0 3 3 1
• The input 𝐱 is flattened row by row, from top to bottom:
𝑣(𝐱) = 4 5 8 7 1 8 8 8 3 6 6 4 6 5 7 8 0
Then,
𝐖𝑣(𝐱) = 122 148 126 134 0
3 channels
28
3 1
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.
7
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.
7
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.
7
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.
7
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.
7
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.
7
Convolution layer – spatial dimensions
#$%
Output size = &
+1
e.g.
𝑁 𝑁 = 7, 𝐹 = 3
28
𝑁 = 32, 𝐹 = 5
28
32 28
3 1
The valid feature map is smaller than the input after convolution.
Without zero-padding, the width of the representation shrinks by the 𝐹 − 1 at each layer
To avoid shrinking the spatial extent of the network rapidly, small filters have to be used
Convolution layer – zero padding
By zero-padding in each layer, we prevent the representation from shrinking with depth. In
practice, we zero pad the border. In PyTorch, by default, we pad top, bottom, left, right with
zeros (this can be customized).
0 0 0 0 0 0 0 0 0 e.g. input 7x7
0 0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output shape?
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
Convolution layer – zero padding
By zero-padding in each layer, we prevent the representation from shrinking with depth. In
practice, we zero pad the border. In PyTorch, by default, we pad top, bottom, left, right with
zeros (this can be customized).
0 0 0 0 0 0 0 0 0 e.g. input 7x7
0 0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output shape?
0 0
#$%
0 0 Recall that without padding, output size = +1
&
0 0 #$%'"(
With padding, output size = +1
&
0 0
e.g.
0 0 𝑁 = 7, 𝐹 = 3
0 0 7×7 output!
0 0
In general, common to see CONV layers with stride 1, filters of size
0 0
𝐹×𝐹, and zero-padding with (𝐹 − 1)/2. (will preserve size spatially)
0 0
e.g. F = 3 => zero pad with 1
0 0
F = 5 => zero pad with 2
0 0 0 0 0 0 0 0 0 F = 7 => zero pad with 3
Example
Input volume: 3×32×32
Ten 3×5×5 filters with stride 1, pad 2
Input Output
Example
Input volume: 3×32×32
Ten 3×5×5 filters with stride 1, pad 2 Output size =
!"#$%&
Output volume size: ? +1
'
Input Output
Example
Input volume: 3×32×32
Ten 3×5×5 filters with stride 1, pad 2
Input Output
Each filter has 5×5×3 + 1 = 76 params (+1 for bias)
=> 76×10 = 𝟕𝟔𝟎
Convolution layer – a numerical example
Input volume (+pad 1) Filters W0, W1 volume Output volume
𝐶×𝐼×𝐼 = 3×7×7 𝐶×𝐹×𝐹 = 3×3×3 𝐶×𝑂×𝑂 = 2×3×3 • 𝐶 = 3 channel inputs
• Two filters of size 𝐹×𝐹 = 3×3, and
they are applied with a stride of 2.
• Each element of the output
activations (green) is computed by
elementwise multiplying the
highlighted input (blue) with the
filter (red), summing it up, and
then offsetting the result by the
bias.
Can we have 1x1 filter?
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Dimension is reduced!
Receptive field
Sparse local connectivity - CNNs exploit spatially local correlations by enforcing local connectivity between
neurons of adjacent layers.
Note that the receptive fields of the neurons are limited because of local connectivity.
For convolution with kernel size 𝐾, each element in the output depends on a 𝐾×𝐾 receptive field in the input
Receptive field
• Receptive field can be briefly defined as the region in the input space that a
particular CNN’s feature is looking at
• The receptive field of the units in the deeper layers of a convolutional network
is larger than the receptive field of the units in the shallow layers
• Even though direct connections in a CNN are very sparse, units in the deeper
layers can be indirectly connected to all or most of the input image
Receptive field
The receptive field at layer 𝑘 is the area denoted 𝑅A ×𝑅A of the input that each pixel of
the 𝑘-th activation map can 'see’.
By calling 𝐹B the filter size of layer 𝑗 and 𝑆C the stride value of layer 𝑖 and with the
convention 𝑆D = 1, the receptive field at layer 𝑘 can be computed with the formula:
k
X jY1
Rk = 1 + (Fj 1) Si
<latexit sha1_base64="Hmf8xnkElHmXan63VPNgWNqjwTs=">AAACInicbZDLSgMxFIYz9VbrrerSzcEiVKRlIoIKFgqCuKyXWqGtQyZN27SZC0lGKMM8ixtfxY0LRV0JPozpZaHVA4GP/z+Hk/O7oeBK2/anlZqZnZtfSC9mlpZXVtey6xs3KogkZVUaiEDeukQxwX1W1VwLdhtKRjxXsJrbPx36tXsmFQ/8az0IWdMjHZ+3OSXaSE72+NLpQwkw7EFDRZ4T90o4uYv7CeTPnB4UAO9CI5RBy4l5yTZOr4ATuHJ4xsnm7KI9KvgLeAI5NKmKk31vtAIaeczXVBCl6tgOdTMmUnMqWJJpRIqFhPZJh9UN+sRjqhmPTkxgxygtaAfSPF/DSP05ERNPqYHnmk6P6K6a9obif1490u2jZsz9MNLMp+NF7UiADmCYF7S4ZFSLgQFCJTd/BdolklBtUh2GgKdP/gs3+0Vs+OIgVz6ZxJFGW2gb5RFGh6iMzlEFVRFFD+gJvaBX69F6tt6sj3FryprMbKJfZX19A4jBoJQ=</latexit>
j=1 i=0
Receptive field
Example
*+! ,+-
Convolution layer - summary
Input Output
A convolution layer
• Accepts a volume of size 𝐷! ×𝐻! ×𝑊!
• Requires four hyperparameters 𝐻! 𝐻"
• Number of filters 𝐾
• Their spatial extent 𝐹
• The stride 𝑆
• The amount of zero padding 𝑃 𝑊! 𝑊"
• Produces a volume of size 𝐷" × 𝐻" × 𝑊" , where 𝐷! 𝐷"
• 𝑊" = (𝑊! − 𝐹 + 2𝑃)/𝑆 + 1
• 𝐻" = (𝐻! − 𝐹 + 2𝑃)/𝑆 + 1 (i.e., width and height are computed equally by symmetry)
• 𝐷" = 𝐾
• With parameter sharing, it introduces 𝐹 3 𝐹 3 𝐷! weights per filter, for a total of 𝐹 3 𝐹 3 𝐷! 3 𝐾 weights
and 𝐾 biases
• In the output volume, the 𝑑-th depth slice (of size 𝐻" × 𝑊" ) is the result of a valid convolution of the
𝑑-th filter over the input volume with a stride of 𝑆, and then offset by 𝑑-th bias
Convolutional layer in PyTorch
https://pytorch.org/docs/stable/nn.html
An example of convolutional network: LeNet 5
Pooling layer
Pooling layer
64×224×224
- Operates over each activation map
64×112×112 independently
- Either ‘max’ or ‘average’ pooling is used at
the pooling layer. That is, the convolved
features are divided into disjoint regions
and pooled by taking either maximum or
averaging.
Pooling layer
MAX Pooling
1 2 3 4
dim 2
Pooling is intended to subsample the convolution layer.
The default stride for pooling is equal to the filter width.
Pooling layer
L/$,M/$
Consider pooling with non-overlapping windows 𝑙, 𝑚 (,,)*L/$,*M/$ , of size 𝐿×𝑀
The max pooling output is the maximum of the activation inside the pooling window. Pooling
of a feature map 𝒚 at 𝑝 = (𝑖, 𝑗) produce pooled feature
The mean pooling output is the mean of activations in the pooling window
1
𝑧(𝑖, 𝑗) = ; ; 𝑦 𝑖 + 𝑙, 𝑗 + 𝑚
𝐿×𝑀
( ,
Pooling layer
Why pooling?
• Pooling layers can be used for building inner activations that are (slightly) invariant to small
translations of the input.
• Invariance to local translation is helpful if we care more about the presence of a pattern
rather than its exact position.
Pooling layer - summary
A pooling layer
• Accepts a volume of size 𝐷! ×𝐻! × 𝑊!
• Requires two hyperparameters
• Their spatial extent 𝐹
• The stride 𝑆
• Produces a volume of size 𝐷" × 𝐻" × 𝑊" , where
• 𝑊" = (𝑊! − 𝐹)/𝑆 + 1
• 𝐻" = (𝐻! − 𝐹)/𝑆 + 1
• 𝐷" = 𝐷!
• Introduces zero parameters since it computes a fixed function of the input
• Note that it is not common to use zero-padding for pooling layers
Pooling layer in PyTorch
An example of convolutional network: LeNet 5
32
weights
32
3
stretch to
input activation
1 × 3072 = 1
3072 10
1 number:
the result of taking a dot product between a column
of W and the input (a 3072-dimensional dot product)
where 𝑢 is the synaptic input and 𝑓 is an activation function (It aims at introducing non-linearities to the
network.).
Activation function
Activation function – Sigmoid
• Sigmoid non-linearity squashes real
numbers to range between [0,1]
• Large negative numbers become 0 and
large positive numbers become 1
• Frequent use historically since it has a
nice interpretation as the firing rate of
𝜎(𝑥) = 1/(1 + 𝑒 #$ )
a neuron
Activation function – Sigmoid
• The sigmoid non-linearity has fallen out of
favor and it is rarely ever used
• Weaknesses:
• Sigmoids saturate and kill gradients
• A very undesirable property of the sigmoid neuron is
that when the neuron's activation saturates at either
tail of 0 or 1, the gradient at these regions is almost
zero
• During backpropagation, this (local) gradient will be
𝜎(𝑥) = 1/(1 + 𝑒 #$ ) multiplied to the gradient of this gate's output for the
whole objective. Therefore, if the local gradient is very
small, it will effectively "kill" the gradient and almost
no signal will flow through the neuron to its weights
and recursively to its data. This will make learning very
slow.
Activation function – Sigmoid
• Weaknesses:
• Sigmoid outputs are not zero-centered
• This is undesirable since neurons in later layers of
processing in a neural network would be receiving
data that is not zero-centered. This has implications
on the dynamics during gradient descent (zig-zagging)
𝜎(𝑥) = 1/(1 + 𝑒 #$ )
Activation function – Tanh
• The tanh non-linearity squashes real
numbers to range between [-1,1]
• Its activations saturate, but unlike the
sigmoid neuron its output is zero-centered.
• Therefore, in practice the tanh non-
linearity is always preferred to the sigmoid
nonlinearity.
simply a scaled sigmoid neuron
tanh(𝑥) = 2𝜎(2𝑥) − 1
Activation function – ReLU
• Rectified Linear Unit (ReLU) activation
function, which is zero when x < 0 and
then linear with slope 1 when x > 0
𝑓(𝑥) = max(0, 𝑥)
Activation function – ReLU
• (+) It was found to greatly accelerate (e.g. a
factor of 6 in Krizhevsky et al.) the
convergence of stochastic gradient descent
compared to the sigmoid/tanh functions. It
is argued that this is due to its linear, non-
saturating form.
Tanh
https://poloclub.github.io/cnn-explainer/
Outline
AlexNet 8 layers
Deeper
More intricate and different
connectivity structures
• The split (i.e. two pathways) in the image above are the split between two GPUs.
• Trained for about a week on two NVIDIA GTX 580 3GB GPU
• 60 million parameters
• Input layer: size 227x227x3
• 8 layers deep: 5 convolution and pooling layers and 3 fully connected layers
• Batch normalization
• Normalization the activation for each training mini-batch
• Allows us to use much higher learning rates and be less
careful about initialization
G. Huang et al., Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award)
Deep architectures - DenseNet
G. Huang et al., Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award)
The research is on-going
https://colab.research.google.com/
Mounting Google Drive locally
https://colab.research.google.com/notebooks/io.ipynb
An example (padding)
Import packages
https://pytorch.org/tutorials/