0% found this document useful (0 votes)

33 views105 pages

Lecture 2 CNN

Uploaded by

Sazid Azad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views105 pages

Lecture 2 CNN

Uploaded by

Sazid Azad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 105

AI6126 Advanced Computer Vision

Last update: 19 January 2022 10:30am

Convolutional Neural
Networks
Chen-Change Loy
吕健勤

http://www.ntu.edu.sg/home/ccloy/
https://twitter.com/ccloy
Outline

• Basic components in CNN

• CNN architectures
• Short introduction to Colab
Motivation

Let us consider the first layer of a MLP

taking images as input. What are the
problems with this architecture?

…
…
…
100
…
…

…
…

784
Motivation
Issues

• Too many parameters: 100 × 784 + 100.

• What if images are 640 × 480 × 3?
• What if the first layer counts 1000 units?
Fully connected networks where neurons at one level
are connected to the neurons in other layers are not
feasible for signals of large resolutions and are
computationally expensive in feedforward and
backpropagation computations.
…
…
… • Spatial organization of the input is
100 destroyed.
…
…

…
… • The network is not invariant to transformations
(e.g., translation).
784
Locally connected networks
Instead, let us only keep a sparse set of
connections, where all weights having the
same color are shared.

• The resulting operation can be seen as

shifting the same weight triplet (kernel).
• The set of inputs seen by each unit is its
receptive field.

This is a 1D convolution, which can be

26 generalized to more dimensions.
…

28
Basic Components in CNN
Credits
• CS 230: Deep Learning
• https://stanford.edu/~shervine/teaching/cs-230.html

• CS231n: Convolutional Neural Networks for Visual Recognition

• http://cs231n.stanford.edu/syllabus.html
An example of convolutional network: LeNet 5

LeCun et al., 1998

An example of convolutional network: LeNet 5

Convolution layer Pooling layer Fully connected layer

As we go deeper (left to right) the height and width tend to go down and the number of channels increased.

Common layer arrangement: Conv → pool → Conv → pool → fully connected → fully connected → output
Convolution layer

3x32x32 image

32 height

32 width
3 depth / channels
Convolution layer

3x32x32 image

3x5x5 filter
32 height
Weights of filters are learned using backpropagation algorithm.
The weights learned are known as filters or kernels

Convolve the filter with the image i.e. “slide over the image
spatially, computing dot products”
32 width
3 depth / channels
Convolution layer
Filters always extend the full depth of the input volume
3x32x32 image

3x5x5 filter
32 height
Weights of filters are learned using backpropagation algorithm.
The weights learned are known as filters or kernels

Convolve the filter with the image i.e. “slide over the image
spatially, computing dot products”
32 width
3 depth / channels
Convolution layer

3x32x32 image
3x5x5 filter

32 1 number:
the result of taking a dot product between the filter and
a small 3x5x5 chunk of the image

(i.e. 355 = 75-dimensional dot product + bias)

32
3 1 0 1
An example of convolving
0 1 1
a 1x5x5 image with a
0 0 1
1x3x3 filter

Filter (weights or kernel)

Convolution layer
1x28x28
3x32x32 image activation/feature map
3x5x5 filter

32
28

convolve (slide) over all

spatial locations

32 28
3 1

The activations are obtained by convolving the filters (weights)

with the input activations. The output activation produced by a
particular filter is known as a activation map / feature map
Convolution layer
Consider a second, green filter
Two 1x28x28
3x32x32 image activation/feature maps
3x5x5 filter
32

convolve (slide) over all

spatial locations

32 28
3 1 1
Convolution layer
Six 1x28x28
3x32x32 image activation/feature maps

32 If we had six 3x5x5 filters,

we’ll get six separate
activation maps: 28
convolution layer

32 28
3 6

We stack these up to
get a “new image” of
size 6x28x28
Convolution layer
6x28x28
3x32x32 image activation/feature maps
Also the 6-dim bias vector
32

28
convolution layer

32 28
3 6
Convolution layer
2x3x32x32 batch of 2x6x28x28 batch of
images activation/feature maps
Also the 6-dim bias vector
32

28
convolution layer

32 28
3 3 6 6
Convolution layer
𝑁×𝐷%×𝐻%×𝑊% batch of 𝑁×𝐷$×𝐻$×𝑊$ batch of
images activation/feature maps
Also the 𝐷!-dim bias vector

𝐻!
𝐻"
convolution layer

𝑊! 𝑊"
𝐷! 𝐷! 𝐷" ×𝐷! ×𝐹×𝐹 𝐷" 𝐷"
An example of convolutional network: LeNet 5

Convolution layer

Back to this example, 6@28x28 means that we have 6 feature maps of size 28*28
You can imagine that the first convolutional layer uses 6 filters and each filter is of size 5 x 5 (how do we know that? We
will discuss that later)
Convolution layer

CNN is a sequence of convolutional layers, interspersed with activation functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. six e.g. ten
3x5x5 6x5x5
32 filters 28 filters 24
3 6 10
Convolution layer
Consider a kernel 𝒘 = {𝑤(𝑙, 𝑚)}, which has a size of 𝐿×𝑀, 𝐿 = 2𝑎 + 1, 𝑀 = 2𝑏 + 1

Synaptic input at location 𝑝 = (𝑖, 𝑗) of the first hidden layer due to a kernel is given by

+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-

For instance, given 𝐿 = 3, 𝑀 = 3

𝑢(𝑖, 𝑗) = 𝑥 𝑖 − 1, 𝑗 − 1 𝑤 −1, −1 + 𝑥 𝑖 − 1, 𝑗 𝑤 −1,0 + …

+𝑥 𝑖, 𝑗 𝑤 0,0 + 𝑥 𝑖 + 1, 𝑗 + 1 𝑤 1,1 + 𝑏
Convolution layer
The output of the neuron at (𝑖, 𝑗) of the convolution layer

𝑦(𝑖, 𝑗) = 𝑓(𝑢(𝑖, 𝑗))

where 𝑓 is an activation function. For deep CNN, we typically use ReLU, 𝑓(𝑥) = max(0, 𝑥).

Note that one weight tensor 𝒘" = {𝑤" (𝑙, 𝑚)} or kernel (filter) creates one feature map:

𝒚" = {𝑦" (𝑖, 𝑗)}

If there are 𝐾 weight vectors (𝒘" )%

"#$ , the convolutional layer is formed by 𝐾 feature maps

𝒚 = (𝒚" )%
"#$
Convolution layer
Convolution by doing a sliding window

As a guiding example, let us consider the convolution of single-channel tensors

𝐱 ∈ ℝ.×. and 𝐰 ∈ ℝ/×/ :
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
Convolution layer
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-

𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
Convolution layer
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-

𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
𝑢 1,2 = 5×1 + 8×4 + 7×1 + 8×1 + 8×4 + 8×3 + 6×3 + 6×3 + 4×1 = 148
Convolution layer
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-

𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
𝑢 1,2 = 5×1 + 8×4 + 7×1 + 8×1 + 8×4 + 8×3 + 6×3 + 6×3 + 4×1 = 148
𝑢 2,1 = 1×1 + 8×4 + 8×1 + 3×1 + 6×4 + 6×3 + 6×3 + 5×3 + 7×1 = 126
Convolution layer
4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
+ -
𝑢(𝑖, 𝑗) = ; ; 𝑥 𝑖 + 𝑙, 𝑗 + 𝑚 𝑤 𝑙, 𝑚 + 𝑏
()*+ ,)*-

𝑢 1,1 = 4×1 + 5×4 + 8×1 + 1×1 + 8×4 + 8×3 + 3×3 + 6×3 + 6×1 = 122
𝑢 1,2 = 5×1 + 8×4 + 7×1 + 8×1 + 8×4 + 8×3 + 6×3 + 6×3 + 4×1 = 148
𝑢 2,1 = 1×1 + 8×4 + 8×1 + 3×1 + 6×4 + 6×3 + 6×3 + 5×3 + 7×1 = 126
𝑢 2,2 = 8×1 + 8×4 + 8×1 + 6×1 + 6×4 + 4×3 + 5×3 + 7×3 + 8×1 = 134
Convolution layer
Convolution as matrix multiplication
• Convolution operation essentially performs dot products between the filters
and local regions of the input (see the image below).
• Common practice is to take advantage of this fact and formulate the forward
pass of a convolutional layer as one big matrix multiply
Convolution layer
The convolution operation can be equivalently re-expressed as a single matrix multiplication:

convolution operation Rearrange the kernel Flattened input

Convolution layer
Convolution as matrix multiplication

Using the same guiding example, let us consider the convolution of single-
channel tensors 𝐱 ∈ ℝ.×. and 𝐰 ∈ ℝ/×/ :

4 5 8 7
1 4 1
1 8 8 8 122 148
𝐰⋆𝐱= 1 4 3 =
3 6 6 4 126 134
3 3 1
6 5 7 8
Convolution layer
• The convolutional kernel 𝐰 is rearranged as a sparse Toeplitz circulant matrix, called the
convolution matrix:
1 4 1 0 1 4 3 0 3 3 1 0 0 0 0 0
0 1 4 1 0 1 4 3 0 3 3 1 0 0 0 0
𝐖=
0 0 0 0 1 4 1 0 1 4 3 0 3 3 1 0
0 0 0 0 0 1 4 1 0 1 4 3 0 3 3 1
• The input 𝐱 is flattened row by row, from top to bottom:
𝑣(𝐱) = 4 5 8 7 1 8 8 8 3 6 6 4 6 5 7 8 0

Then,
𝐖𝑣(𝐱) = 122 148 126 134 0

which we can reshape to a 2×2 matrix to obtain 𝐰 ⋆ 𝐱.

Convolution layer – visualization

3 channels

one filter =>

one activation map example 3x5x5 filters
(32 filters in total)

Figure copyright Andrej Karpathy.

Convolution layer – spatial dimensions
1x28x28
3x32x32 image activation/feature map
3x5x5 filter
32

convolve (slide) over all

spatial locations
Why does the feature map
32 28 has a size of 28x28?

3 1
Convolution layer – spatial dimensions
7x7 input (spatially), assume 3x3 filter Stride is the factor with which the output is subsampled. That
is, distance between adjacent centers of the kernel.