[go: up one dir, main page]

0% found this document useful (0 votes)
33 views105 pages

Lecture 2 CNN

Uploaded by

Sazid Azad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views105 pages

Lecture 2 CNN

Uploaded by

Sazid Azad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

AI6126 Advanced Computer Vision

Last update: 19 January 2022 10:30am

Convolutional Neural
Networks
Chen-Change Loy
吕健勤

http://www.ntu.edu.sg/home/ccloy/
https://twitter.com/ccloy
Outline

• Basic components in CNN


• CNN architectures
• Short introduction to Colab
Motivation

Let us consider the first layer of a MLP


taking images as input. What are the
problems with this architecture?




100


784
Motivation
Issues

• Too many parameters: 100 × 784 + 100.


• What if images are 640 × 480 × 3?
• What if the first layer counts 1000 units?
Fully connected networks where neurons at one level
are connected to the neurons in other layers are not
feasible for signals of large resolutions and are
computationally expensive in feedforward and
backpropagation computations.


… • Spatial organization of the input is
100 destroyed.


… • The network is not invariant to transformations
(e.g., translation).
784
Locally connected networks
Instead, let us only keep a sparse set of
connections, where all weights having the
same color are shared.

• The resulting operation can be seen as


shifting the same weight triplet (kernel).
• The set of inputs seen by each unit is its
receptive field.

This is a 1D convolution, which can be


26 generalized to more dimensions.

28
Basic Components in CNN
Credits
• CS 230: Deep Learning
• https://stanford.edu/~shervine/teaching/cs-230.html

• CS231n: Convolutional Neural Networks for Visual Recognition


• http://cs231n.stanford.edu/syllabus.html
An example of convolutional network: LeNet 5

LeCun et al., 1998


An example of convolutional network: LeNet 5

Convolution layer Pooling layer Fully connected layer

As we go deeper (left to right) the height and width tend to go down and the number of channels increased.

Common layer arrangement: Conv → pool → Conv → pool → fully connected → fully connected → output
Convolution layer

3x32x32 image

32 height

32 width
3 depth / channels
Convolution layer

3x32x32 image

3x5x5 filter
32 height
Weights of filters are learned using backpropagation algorithm.
The weights learned are known as filters or kernels

Convolve the filter with the image i.e. “slide over the image
spatially, computing dot products”
32 width
3 depth / channels
Convolution layer
Filters always extend the full depth of the input volume
3x32x32 image

3x5x5 filter
32 height
Weights of filters are learned using backpropagation algorithm.
The weights learned are known as filters or kernels

Convolve the filter with the image i.e. “slide over the image
spatially, computing dot products”
32 width
3 depth / channels
Convolution layer

3x32x32 image
3x5x5 filter

32 1 number:
the result of taking a dot product between the filter and
a small 3x5x5 chunk of the image

(i.e. 3*5*5 = 75-dimensional dot product + bias)

32
3 1 0 1
An example of convolving
0 1 1
a 1x5x5 image with a
0 0 1
1x3x3 filter

Filter (weights or kernel)


Convolution layer
1x28x28
3x32x32 image activation/feature map
3x5x5 filter

32
28

convolve (slide) over all


spatial locations

32 28
3 1

The activations are obtained by convolving the filters (weights)


with the input activations. The output activation produced by a
particular filter is known as a activation map / feature map
Convolution layer
Consider a second, green filter
Two 1x28x28
3x32x32 image activation/feature maps
3x5x5 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1 1
Convolution layer
Six 1x28x28
3x32x32 image activation/feature maps

32 If we had six 3x5x5 filters,


we’ll get six separate
activation maps: 28
convolution layer

32 28
3 6

We stack these up to
get a “new image” of
size 6x28x28
Convolution layer
6x28x28
3x32x32 image activation/feature maps
Also the 6-dim bias vector
32

28
convolution layer

32 28
3 6
Convolution layer
2x3x32x32 batch of 2x6x28x28 batch of
images activation/feature maps
Also the 6-dim bias vector
32

28
convolution layer

32 28
3 3 6 6
Convolution layer
𝑁×𝐷%×𝐻%×𝑊% batch of 𝑁×𝐷$×𝐻$×𝑊$ batch of
images activation/feature maps
Also the 𝐷!-dim bias vector

𝐻!
𝐻"
convolution layer

𝑊! 𝑊"
𝐷! 𝐷! 𝐷" ×𝐷! ×𝐹×𝐹 𝐷" 𝐷"
An example of convolutional network: LeNet 5

Convolution layer

Back to this example, 6@28x28 means that we have 6 feature maps of size 28*28
You can imagine that the first convolutional layer uses 6 filters and each filter is of size 5 x 5 (how do we know that? We
will discuss that later)
Convolution layer

CNN is a sequence of convolutional layers, interspersed with activation functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. six e.g. ten
3x5x5 6x5x5
32 filters 28 filters 24
3 6 10
Convolution layer
Consider a kernel 𝒘 = {𝑤(𝑙, 𝑚)}, which has a size of 𝐿×𝑀, 𝐿 = 2𝑎 + 1, 𝑀 = 2𝑏 + 1

Synaptic input at location 𝑝 = (𝑖, 𝑗) of the first hidden layer due to a kernel is given by

+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-

For instance, given 𝐿 = 3, 𝑀 = 3

𝑢(𝑖, 𝑗) = 𝑥 𝑖 − 1, 𝑗 − 1 𝑤 −1, −1 + 𝑥 𝑖 − 1, 𝑗 𝑤 −1,0 + …


+𝑥 𝑖, 𝑗 𝑤 0,0 + 𝑥 𝑖 + 1, 𝑗 + 1 𝑤 1,1 + 𝑏
Convolution layer
The output of the neuron at (𝑖, 𝑗) of the convolution layer

𝑦(𝑖, 𝑗) = 𝑓(𝑢(𝑖, 𝑗))

where 𝑓 is an activation function. For deep CNN, we typically use ReLU, 𝑓(𝑥) = max(0, 𝑥).

Note that one weight tensor 𝒘" = {𝑤" (𝑙, 𝑚)} or kernel (filter) creates one feature map:

𝒚" = {𝑦" (𝑖, 𝑗)}

If there are 𝐾 weight vectors (𝒘" )%


"#$ , the convolutional layer is formed by 𝐾 feature maps

𝒚 = (𝒚" )%
"#$
Convolution layer
Convolution by doing a sliding window

As a guiding example, let us consider the convolution of single-channel tensors


𝐱 ∈ ℝ.×. and 𝐰 ∈ ℝ/×/ :
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
Convolution layer
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-

𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
Convolution layer
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-

𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
𝑢 1,2 = 5×1 + 8×4 + 7×1 + 8×1 + 8×4 + 8×3 + 6×3 + 6×3 + 4×1 = 148
Convolution layer
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-

𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
𝑢 1,2 = 5×1 + 8×4 + 7×1 + 8×1 + 8×4 + 8×3 + 6×3 + 6×3 + 4×1 = 148
𝑢 2,1 = 1×1 + 8×4 + 8×1 + 3×1 + 6×4 + 6×3 + 6×3 + 5×3 + 7×1 = 126
Convolution layer
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-

𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
𝑢 1,2 = 5×1 + 8×4 + 7×1 + 8×1 + 8×4 + 8×3 + 6×3 + 6×3 + 4×1 = 148
𝑢 2,1 = 1×1 + 8×4 + 8×1 + 3×1 + 6×4 + 6×3 + 6×3 + 5×3 + 7×1 = 126
𝑢 2,2 = 8×1 + 8×4 + 8×1 + 6×1 + 6×4 + 4×3 + 5×3 + 7×3 + 8×1 = 134
Convolution layer
Convolution as matrix multiplication
• Convolution operation essentially performs dot products between the filters
and local regions of the input (see the image below).
• Common practice is to take advantage of this fact and formulate the forward
pass of a convolutional layer as one big matrix multiply
Convolution layer
The convolution operation can be equivalently re-expressed as a single matrix multiplication:

convolution operation Rearrange the kernel Flattened input


Convolution layer
Convolution as matrix multiplication

Using the same guiding example, let us consider the convolution of single-
channel tensors 𝐱 ∈ ℝ.×. and 𝐰 ∈ ℝ/×/ :

4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
Convolution layer
• The convolutional kernel 𝐰 is rearranged as a sparse Toeplitz circulant matrix, called the
convolution matrix:
1 4 1 0 1 4 3 0 3 3 1 0 0 0 0 0
0 1 4 1 0 1 4 3 0 3 3 1 0 0 0 0
𝐖=
0 0 0 0 1 4 1 0 1 4 3 0 3 3 1 0
0 0 0 0 0 1 4 1 0 1 4 3 0 3 3 1
• The input 𝐱 is flattened row by row, from top to bottom:
𝑣(𝐱) = 4 5 8 7 1 8 8 8 3 6 6 4 6 5 7 8 0

Then,
𝐖𝑣(𝐱) = 122 148 126 134 0

which we can reshape to a 2×2 matrix to obtain 𝐰 ⋆ 𝐱.


Convolution layer – visualization

3 channels

one filter =>


one activation map example 3x5x5 filters
(32 filters in total)

Figure copyright Andrej Karpathy.


Convolution layer – spatial dimensions
1x28x28
3x32x32 image activation/feature map
3x5x5 filter
32

28

convolve (slide) over all


spatial locations
Why does the feature map
32 28 has a size of 28x28?

3 1
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.

conv with stride = 1

7
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.

conv with stride = 1

7
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.

conv with stride = 1

7
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.

conv with stride = 2

7
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.

conv with stride = 2

7
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.

conv with stride = 2

7
Convolution layer – spatial dimensions

𝑁 𝑁×𝑁 input (spatially), assume 𝐹×𝐹 filter, and 𝑆 stride

#$%
Output size = &
+1

e.g.
𝑁 𝑁 = 7, 𝐹 = 3

stride 1 => (7 − 3)/1 + 1 = 5


stride 2 => (7 − 3)/2 + 1 = 3
stride 3 => (7 − 3)/3 + 1 = 2.33 :\
Convolution layer – spatial dimensions
1x28x28
3x32x32 image activation/feature map
3x5x5 filter
32

28

convolve (slide) over all


spatial locations Why does the feature map
has a size of 28x28?
32 28
3 1 #$%
Output size = +1
&

𝑁 = 32, 𝐹 = 5

stride 1 => (32 − 5)/1 + 1 = 28


Convolution layer – spatial dimensions
1x28x28
3x32x32 image activation/feature map
3x5x5 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

The valid feature map is smaller than the input after convolution.
Without zero-padding, the width of the representation shrinks by the 𝐹 − 1 at each layer
To avoid shrinking the spatial extent of the network rapidly, small filters have to be used
Convolution layer – zero padding
By zero-padding in each layer, we prevent the representation from shrinking with depth. In
practice, we zero pad the border. In PyTorch, by default, we pad top, bottom, left, right with
zeros (this can be customized).
0 0 0 0 0 0 0 0 0 e.g. input 7x7
0 0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output shape?
0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0
Convolution layer – zero padding
By zero-padding in each layer, we prevent the representation from shrinking with depth. In
practice, we zero pad the border. In PyTorch, by default, we pad top, bottom, left, right with
zeros (this can be customized).
0 0 0 0 0 0 0 0 0 e.g. input 7x7
0 0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output shape?
0 0
#$%
0 0 Recall that without padding, output size = +1
&
0 0 #$%'"(
With padding, output size = +1
&
0 0
e.g.
0 0 𝑁 = 7, 𝐹 = 3

0 0 stride 1 => (7 − 3 + 2(1))/1 + 1 = 7


0 0 0 0 0 0 0 0 0
Convolution layer – zero padding
By zero-padding in each layer, we prevent the representation from shrinking with depth. In
practice, we zero pad the border. In PyTorch, by default, we pad top, bottom, left, right with
zeros (this can be customized).
0 0 0 0 0 0 0 0 0 e.g. input 7x7
0 0 3x3 filter, applied with stride 1
pad with 1 pixel border => what is the output shape?
0 0

0 0 7×7 output!

0 0
In general, common to see CONV layers with stride 1, filters of size
0 0
𝐹×𝐹, and zero-padding with (𝐹 − 1)/2. (will preserve size spatially)
0 0
e.g. F = 3 => zero pad with 1
0 0
F = 5 => zero pad with 2
0 0 0 0 0 0 0 0 0 F = 7 => zero pad with 3
Example
Input volume: 3×32×32
Ten 3×5×5 filters with stride 1, pad 2

Output volume size: ?

Input Output
Example
Input volume: 3×32×32
Ten 3×5×5 filters with stride 1, pad 2 Output size =

!"#$%&
Output volume size: ? +1
'

Input Output (%")$%(%)


+ 1 = 32 spatially
,

The output volume size is 10×32×32


Example
Input volume: 3×32×32
Ten 3×5×5 filters with stride 1, pad 2

Number of parameters in this layer: ?

Input Output
Example
Input volume: 3×32×32
Ten 3×5×5 filters with stride 1, pad 2

Number of parameters in this layer: ?

Input Output
Each filter has 5×5×3 + 1 = 76 params (+1 for bias)
=> 76×10 = 𝟕𝟔𝟎
Convolution layer – a numerical example
Input volume (+pad 1) Filters W0, W1 volume Output volume
𝐶×𝐼×𝐼 = 3×7×7 𝐶×𝐹×𝐹 = 3×3×3 𝐶×𝑂×𝑂 = 2×3×3 • 𝐶 = 3 channel inputs
• Two filters of size 𝐹×𝐹 = 3×3, and
they are applied with a stride of 2.
• Each element of the output
activations (green) is computed by
elementwise multiplying the
highlighted input (blue) with the
filter (red), summing it up, and
then offsetting the result by the
bias.
Can we have 1x1 filter?

1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Dimension is reduced!
Receptive field
Sparse local connectivity - CNNs exploit spatially local correlations by enforcing local connectivity between
neurons of adjacent layers.

Note that the receptive fields of the neurons are limited because of local connectivity.

For convolution with kernel size 𝐾, each element in the output depends on a 𝐾×𝐾 receptive field in the input
Receptive field
• Receptive field can be briefly defined as the region in the input space that a
particular CNN’s feature is looking at

• The receptive field of the units in the deeper layers of a convolutional network
is larger than the receptive field of the units in the shallow layers

• This effect increases if the network include strided convolution or pooling

• Even though direct connections in a CNN are very sparse, units in the deeper
layers can be indirectly connected to all or most of the input image
Receptive field
The receptive field at layer 𝑘 is the area denoted 𝑅A ×𝑅A of the input that each pixel of
the 𝑘-th activation map can 'see’.

By calling 𝐹B the filter size of layer 𝑗 and 𝑆C the stride value of layer 𝑖 and with the
convention 𝑆D = 1, the receptive field at layer 𝑘 can be computed with the formula:

k
X jY1
Rk = 1 + (Fj 1) Si
<latexit sha1_base64="Hmf8xnkElHmXan63VPNgWNqjwTs=">AAACInicbZDLSgMxFIYz9VbrrerSzcEiVKRlIoIKFgqCuKyXWqGtQyZN27SZC0lGKMM8ixtfxY0LRV0JPozpZaHVA4GP/z+Hk/O7oeBK2/anlZqZnZtfSC9mlpZXVtey6xs3KogkZVUaiEDeukQxwX1W1VwLdhtKRjxXsJrbPx36tXsmFQ/8az0IWdMjHZ+3OSXaSE72+NLpQwkw7EFDRZ4T90o4uYv7CeTPnB4UAO9CI5RBy4l5yTZOr4ATuHJ4xsnm7KI9KvgLeAI5NKmKk31vtAIaeczXVBCl6tgOdTMmUnMqWJJpRIqFhPZJh9UN+sRjqhmPTkxgxygtaAfSPF/DSP05ERNPqYHnmk6P6K6a9obif1490u2jZsz9MNLMp+NF7UiADmCYF7S4ZFSLgQFCJTd/BdolklBtUh2GgKdP/gs3+0Vs+OIgVz6ZxJFGW2gb5RFGh6iMzlEFVRFFD+gJvaBX69F6tt6sj3FryprMbKJfZX19A4jBoJQ=</latexit>
j=1 i=0
Receptive field
Example

Filter size 𝐹! = 𝐹" = 𝐹) = 3


k
X jY1
Stride 𝑆! = 𝑆" = 𝑆) = 1
Rk = 1 + (Fj 1) Si
) *$!
j=1 i=0
𝑅) = 1 + F(𝐹* −1) G 𝑆, = 1 + (2×1) + (2×1) + (2×1) = 7
<latexit sha1_base64="Hmf8xnkElHmXan63VPNgWNqjwTs=">AAACInicbZDLSgMxFIYz9VbrrerSzcEiVKRlIoIKFgqCuKyXWqGtQyZN27SZC0lGKMM8ixtfxY0LRV0JPozpZaHVA4GP/z+Hk/O7oeBK2/anlZqZnZtfSC9mlpZXVtey6xs3KogkZVUaiEDeukQxwX1W1VwLdhtKRjxXsJrbPx36tXsmFQ/8az0IWdMjHZ+3OSXaSE72+NLpQwkw7EFDRZ4T90o4uYv7CeTPnB4UAO9CI5RBy4l5yTZOr4ATuHJ4xsnm7KI9KvgLeAI5NKmKk31vtAIaeczXVBCl6tgOdTMmUnMqWJJpRIqFhPZJh9UN+sRjqhmPTkxgxygtaAfSPF/DSP05ERNPqYHnmk6P6K6a9obif1490u2jZsz9MNLMp+NF7UiADmCYF7S4ZFSLgQFCJTd/BdolklBtUh2GgKdP/gs3+0Vs+OIgVz6ZxJFGW2gb5RFGh6iMzlEFVRFFD+gJvaBX69F6tt6sj3FryprMbKJfZX19A4jBoJQ=</latexit>

*+! ,+-
Convolution layer - summary
Input Output
A convolution layer
• Accepts a volume of size 𝐷! ×𝐻! ×𝑊!
• Requires four hyperparameters 𝐻! 𝐻"
• Number of filters 𝐾
• Their spatial extent 𝐹
• The stride 𝑆
• The amount of zero padding 𝑃 𝑊! 𝑊"
• Produces a volume of size 𝐷" × 𝐻" × 𝑊" , where 𝐷! 𝐷"
• 𝑊" = (𝑊! − 𝐹 + 2𝑃)/𝑆 + 1
• 𝐻" = (𝐻! − 𝐹 + 2𝑃)/𝑆 + 1 (i.e., width and height are computed equally by symmetry)
• 𝐷" = 𝐾
• With parameter sharing, it introduces 𝐹 3 𝐹 3 𝐷! weights per filter, for a total of 𝐹 3 𝐹 3 𝐷! 3 𝐾 weights
and 𝐾 biases
• In the output volume, the 𝑑-th depth slice (of size 𝐻" × 𝑊" ) is the result of a valid convolution of the
𝑑-th filter over the input volume with a stride of 𝑆, and then offset by 𝑑-th bias
Convolutional layer in PyTorch

https://pytorch.org/docs/stable/nn.html
An example of convolutional network: LeNet 5

Pooling layer
Pooling layer

64×224×224
- Operates over each activation map
64×112×112 independently
- Either ‘max’ or ‘average’ pooling is used at
the pooling layer. That is, the convolved
features are divided into disjoint regions
and pooled by taking either maximum or
averaging.
Pooling layer

MAX Pooling

Single depth slice


1 1 2 4
max pool with 2x2 filters
5 6 7 8 and stride 2 6 8
dim 1
3 2 1 0 3 4

1 2 3 4

dim 2
Pooling is intended to subsample the convolution layer.
The default stride for pooling is equal to the filter width.
Pooling layer
L/$,M/$
Consider pooling with non-overlapping windows 𝑙, 𝑚 (,,)*L/$,*M/$ , of size 𝐿×𝑀

The max pooling output is the maximum of the activation inside the pooling window. Pooling
of a feature map 𝒚 at 𝑝 = (𝑖, 𝑗) produce pooled feature

𝑧(𝑖, 𝑗) = max{𝑦(𝑖 + 𝑙, 𝑗 + 𝑚)}


(,,

The mean pooling output is the mean of activations in the pooling window

1
𝑧(𝑖, 𝑗) = ; ; 𝑦 𝑖 + 𝑙, 𝑗 + 𝑚
𝐿×𝑀
( ,
Pooling layer
Why pooling?

A function 𝑓 is invariant to g if 𝑓(𝑔(𝒙)) = 𝑓(𝒙).

• Pooling layers can be used for building inner activations that are (slightly) invariant to small
translations of the input.
• Invariance to local translation is helpful if we care more about the presence of a pattern
rather than its exact position.
Pooling layer - summary
A pooling layer
• Accepts a volume of size 𝐷! ×𝐻! × 𝑊!
• Requires two hyperparameters
• Their spatial extent 𝐹
• The stride 𝑆
• Produces a volume of size 𝐷" × 𝐻" × 𝑊" , where
• 𝑊" = (𝑊! − 𝐹)/𝑆 + 1
• 𝐻" = (𝐻! − 𝐹)/𝑆 + 1
• 𝐷" = 𝐷!
• Introduces zero parameters since it computes a fixed function of the input
• Note that it is not common to use zero-padding for pooling layers
Pooling layer in PyTorch
An example of convolutional network: LeNet 5

Fully connected layer


Fully connected layer
The fully connected layer (FC) operates on a flattened input where each input is
connected to all neurons.
If present, FC layers are usually found towards the end of CNN architectures and
can be used to optimize objectives such as class scores.
Fully connected layer
32x32x3 image

32
weights
32
3

stretch to
input activation
1 × 3072 = 1
3072 10

1 number:
the result of taking a dot product between a column
of W and the input (a 3072-dimensional dot product)

10 Each neuron looks at the full input volume


Fully connected layer in PyTorch
Activation function
Recall that the output of the neuron at (𝑖, 𝑗) of the convolution layer

𝑦(𝑖, 𝑗) = 𝑓(𝑢(𝑖, 𝑗))

where 𝑢 is the synaptic input and 𝑓 is an activation function (It aims at introducing non-linearities to the
network.).
Activation function
Activation function – Sigmoid
• Sigmoid non-linearity squashes real
numbers to range between [0,1]
• Large negative numbers become 0 and
large positive numbers become 1
• Frequent use historically since it has a
nice interpretation as the firing rate of
𝜎(𝑥) = 1/(1 + 𝑒 #$ )
a neuron
Activation function – Sigmoid
• The sigmoid non-linearity has fallen out of
favor and it is rarely ever used
• Weaknesses:
• Sigmoids saturate and kill gradients
• A very undesirable property of the sigmoid neuron is
that when the neuron's activation saturates at either
tail of 0 or 1, the gradient at these regions is almost
zero
• During backpropagation, this (local) gradient will be
𝜎(𝑥) = 1/(1 + 𝑒 #$ ) multiplied to the gradient of this gate's output for the
whole objective. Therefore, if the local gradient is very
small, it will effectively "kill" the gradient and almost
no signal will flow through the neuron to its weights
and recursively to its data. This will make learning very
slow.
Activation function – Sigmoid
• Weaknesses:
• Sigmoid outputs are not zero-centered
• This is undesirable since neurons in later layers of
processing in a neural network would be receiving
data that is not zero-centered. This has implications
on the dynamics during gradient descent (zig-zagging)

𝜎(𝑥) = 1/(1 + 𝑒 #$ )
Activation function – Tanh
• The tanh non-linearity squashes real
numbers to range between [-1,1]
• Its activations saturate, but unlike the
sigmoid neuron its output is zero-centered.
• Therefore, in practice the tanh non-
linearity is always preferred to the sigmoid
nonlinearity.
simply a scaled sigmoid neuron

tanh(𝑥) = 2𝜎(2𝑥) − 1
Activation function – ReLU
• Rectified Linear Unit (ReLU) activation
function, which is zero when x < 0 and
then linear with slope 1 when x > 0

𝑓(𝑥) = max(0, 𝑥)
Activation function – ReLU
• (+) It was found to greatly accelerate (e.g. a
factor of 6 in Krizhevsky et al.) the
convergence of stochastic gradient descent
compared to the sigmoid/tanh functions. It
is argued that this is due to its linear, non-
saturating form.
Tanh

ReLU • (+) Compared to tanh/sigmoid neurons that


involve expensive operations (exponentials,
etc.), the ReLU can be implemented by
simply thresholding a matrix of activations at
zero.
Activation function – ReLU
• (-) Unfortunately, ReLU units can be fragile during training and can "die".
• For example, a large gradient flowing through a ReLU neuron could cause the
weights to update in such a way that the neuron will never activate on any data
point again (fall into the negative regime).
• If this happens, then the gradient flowing through the unit will forever be zero
from that point on.
• That is, the ReLU units can irreversibly die during training since they can get knocked off
the data manifold.
Activation function – Leaky ReLU
• Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Instead of the
function being zero when x < 0, a leaky ReLU will instead have a small negative
slope (of 0.01, or so). That is, the function computes
𝑓(𝑥) = 𝟙(𝑥 < 0)(𝛼𝑥) + 𝟙(𝑥 >= 0)(𝑥)
where 𝛼 is a small constant.
• Some people report success with this form of activation function, but the
results are not always consistent.
Visualization

https://poloclub.github.io/cnn-explainer/
Outline

• Basic components in CNN


• CNN architectures
• Short introduction to Colab
CNN Architectures
Deep networks for ImageNet

AlexNet MSRA ResNet


Deep architectures
VGG 19 layers GoogLeNet 22 layers ResNet 152 layers

AlexNet 8 layers

Deeper
More intricate and different
connectivity structures

2012 2013 2014 2015 2016 2017 2018 …


Performance of previous years on ImageNet
Top-5 Error
0 0.05 0.1 0.15 0.2 0.25 0.3

shallow @ 2010 28.0%

shallow @ 2011 26.0%

AlexNet (8 Layers) @ 2012 15.3%

ZFNet (8 Layers) @ 2013 11.7%

GoogLeNet (22 Layers) @ 2014 6.7%

ResNet (152 Layers) @ 2015 3.6%

Depth is the key to high classification accuracy.


Deep architectures - AlexNet

• The split (i.e. two pathways) in the image above are the split between two GPUs.
• Trained for about a week on two NVIDIA GTX 580 3GB GPU
• 60 million parameters
• Input layer: size 227x227x3
• 8 layers deep: 5 convolution and pooling layers and 3 fully connected layers

2012 2013 2014 2015 2016 2017 2018 …


Deep architectures - AlexNet

96 kernels learned by first convolution layer; 48 kernels were learned


by each GPU

2012 2013 2014 2015 2016 2017 2018 …


Deep architectures - AlexNet

• Escape from a few layers


• ReLU nonlinearity for solving gradient vanishing
• Data augmentation
• Dropout
• Outperformed all previous models on ILSVRC by 10%
2012 2013 2014 2015 2016 2017 2018 …
Deep architectures - AlexNet
Layer Weights
L1 (Conv) 34,944 First convolution layer: 96 kernels of size 11x11x3,
L2 (Conv) 614,656 with a stride of 4 pixels

L3 (Conv) 885,120 Number of parameters = (11x11x3 + 1)*96 = 34,944


L4 (Conv) 1,327,488
L5 (Conv) 884,992
L6 (FC) 37,752,832
L7 (FC) 16,781,312
L8 (FC) 4,097,000 Note: There are no
Conv Subtotal 3,747,200 parameters associated with a
pooling layer. The pool size,
FC Subtotal 58,631,144
stride, and padding are
Total 62,378,344 hyperparameters.

2012 2013 2014 2015 2016 2017 2018 …


Deep architectures - AlexNet
Layer Weights
L1 (Conv) 34,944
L2 (Conv) 614,656 Second convolution layer: 256 kernels of size 5x5x96
L3 (Conv) 885,120
Number of parameters = (5x5x96 + 1)*256 = 614,656
L4 (Conv) 1,327,488
L5 (Conv) 884,992
L6 (FC) 37,752,832
L7 (FC) 16,781,312
L8 (FC) 4,097,000
Conv Subtotal 3,747,200
FC Subtotal 58,631,144
Total 62,378,344

2012 2013 2014 2015 2016 2017 2018 …


Deep architectures - AlexNet
Layer Weights
L1 (Conv) 34,944
L2 (Conv) 614,656
L3 (Conv) 885,120
L4 (Conv) 1,327,488
L5 (Conv) 884,992
L6 (FC) 37,752,832 First FC layer:
Number of neurons = 4096
L7 (FC) 16,781,312
Number of kernels in the previous Conv Layer = 256
L8 (FC) 4,097,000 Size (width) of the output image of the previous Conv
Conv Subtotal 3,747,200 Layer = 6
FC Subtotal 58,631,144
Number of parameters = (6x6x256*4096)+4096 =
Total 62,378,344 37,752,832

2012 2013 2014 2015 2016 2017 2018 …


Deep architectures - AlexNet
Layer Weights
L1 (Conv) 34,944
L2 (Conv) 614,656
L3 (Conv) 885,120
L4 (Conv) 1,327,488
L5 (Conv) 884,992
L6 (FC) 37,752,832 The last FC layer:
Number of neurons = 1000
L7 (FC) 16,781,312
Number of neurons in the previous FC Layer = 4096
L8 (FC) 4,097,000
Conv Subtotal 3,747,200 Number of parameters = (1000*4096)+1000 =
4,097,000
FC Subtotal 58,631,144
Total 62,378,344

2012 2013 2014 2015 2016 2017 2018 …


Deep architectures - GoogLeNet
• An important lesson - go deeper
• Inception structures (v2, v3, v4)
• Reduce parameters (4M vs 60M in AlexNet)

The 1x1 convolutions are performed


to reduce the dimensions of
input/output

• Batch normalization
• Normalization the activation for each training mini-batch
• Allows us to use much higher learning rates and be less
careful about initialization

2012 2013 2014 2015 2016 2017 2018 …


Deep architectures - VGG
• An important lesson - go deeper
• 140M parameters
• Now commonly used for computing perceptual loss

2012 2013 2014 2015 2016 2017 2018 …


Deep architectures - ResNet
• An important lesson - go deeper
• Escape from 100 layers
• Residual learning

drives the new layer to learn


something different

2012 2013 2014 2015 2016 2017 2018 …


The importance of skip connections
• When networks become sufficiently
deep, neural loss landscapes quickly
transition from being nearly convex
to being highly chaotic.
• This transition from convex to
chaotic behavior coincides with a
dramatic increase in generalization
error, and ultimately to a lack of
trainability.
• Skip connections promote flat
minimizers and prevent the
transition to chaotic behavior,
which helps explain why skip
connections are necessary for
training extremely deep networks.

Visualizing the Loss Landscape of Neural Nets, NeurIPS 2018


Deep architectures - Inception-ResNet
Deep architectures - DenseNet
• Dense Block
• Each layer is directly connected to
every other layer in a feed-
The feature maps of all preceding layers
forward fashion
are treated as separate inputs • Each layer has direct access to the
gradients from the loss function
and the original input signal,
leading to an implicit deep
supervision
• Deep supervision enforces the
intermediate layers to learn
discriminative features
Its own feature maps are passed on as
inputs to all subsequent layers

G. Huang et al., Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award)
Deep architectures - DenseNet

In Standard ConvNet, input image goes through multiple


convolution and obtain high-level features.

In DenseNet, each layer obtains additional inputs from all


preceding layers and passes on its own feature-maps to all
subsequent layers. Concatenation is used. Each layer is
In ResNet, identity mapping is proposed to promote the receiving a “collective knowledge” from all preceding layers.
gradient propagation. Element-wise addition is used.

G. Huang et al., Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award)
The research is on-going

Finding the optimal neural network architecture remains


an active area of research.
Outline

• Basic components in CNN


• CNN architectures
• Short introduction to Colab
Short Introduction to Colab
Get started

https://colab.research.google.com/
Mounting Google Drive locally

https://colab.research.google.com/notebooks/io.ipynb
An example (padding)

Install PyTorch and Torchvision

!pip install -U torch==1.5.1+cu101


torchvision==0.6.1+cu101 -f
https://download.pytorch.org/whl/torch_stable.html

Import packages

Try more examples on

https://pytorch.org/tutorials/

You might also like