[go: up one dir, main page]

0% found this document useful (0 votes)
189 views121 pages

Understanding Convolutional Networks

Uploaded by

vnaramala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
189 views121 pages

Understanding Convolutional Networks

Uploaded by

vnaramala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Convolutional Networks

Introduction
• Convolutional networks, also known as convolutional neural
networks (or) CNNs, are a specialized kind of neural network
for processing data that has a known, grid-like topology.
– Examples include time-series data, which can be thought
of as a 1D grid taking samples at regular time intervals, and
image data, which can be thought of as a 2D grid of pixels.
• Convolutional networks have been tremendously successful in
practical applications. The name “convolutional neural
network” indicates that the network employs a mathematical
operation called convolution. Convolution is a specialized kind
of linear operation.
• Convolutional networks are simply neural networks that use
convolution in place of general matrix multiplication in at
least one of their layers.
Neural net matrix multiplication
• Each layer produces values that are obtained from
previous layer by performing a matrix-
Augmented network
multiplication
• In an unaugmented network
• Hidden layer produces values z =h (W(1)T x + b(1))
• Output layer produces values y =σ(W(2)T x + b(2))
• Note:W(1) and W(2) are matrices rather than
vectors
• Example with D=3, x=[x1,x2,x3]T
• We
M=3 have two weight matrices W(1)


and W(2) ⎡ (1) (1) (1) ⎡ (1)W (1)
⎪ (1)
⎢⎡ 11 W 12(1)W 13(1) ⎤ , W2 = W
T W 22(1)W 23(1) ⎤T, W3(1) = W
(1) W 33 ⎤
(1)
T
W columns : W1 = W ⎣ 21 ⎣ 31 32 ⎦
w= ⎪

T T
⎪ W (2)columns : W (2)
= ⎡W (2)W (2)W (2) ⎤ , W 2(2) = ⎡W (2)W (2)W (2) ⎤ , W3 (2)
= ⎡W (2)W (2)W (2)
⎩⎪ 1 ⎢⎣ 11 12 13 ⎦⎥ ⎣⎢ 21 22 23 ⎢⎣ 31 32 33
T
First
Network

⎥⎦ Network layer In matrix
⎥⎦ multiplication
layer output notation
Matrix multiplication for 2D Convolution

Far fewer weights needed than full matrix


multiplication
Convolutional Layer for Image Recognition
CNN is a neural network with a convolutional layer

• CNNs are simply neural networks that use convolution in


place of general matrix multiplication in at least one of
their layers
• Convolution can be viewed as multiplication by a matrix
Runtime of Traditional vs Convolutional Networks
Deep Learning Srihari

• Traditional neural network layers use matrix


multiplication by a matrix of parameters with a
separate parameter describing the interaction
between each input unit and each output unit
s =g(WTx )

• With m inputs and n outputs, matrix multiplication requires m x n


parameters and O(m×n) runtime per example
• This means every output unit interacts with every input unit
• Convolutional network layers have sparse interactions

• If we limit no of connections for each input to k we need k x n


parameters and O(k×n) runtime
The Convolution Operation
What is convolution?
• Convolution is an operation on two functions of a real-
valued argument
• Examples of the two functions in One Dimension
• Tracking location of a spaceship by a laser sensor
• A laser sensor provides a single output x(t), the position of
spaceship at time t
• w a function of a real-valued argument
• If laser sensor is noisy, we want a weighted average that
gives more weight to recent observations
• Weighting function is w(a) where a is age of measurement
• Convolution is the weighted average or smoothed
estimate of the position of the spaceship
• A new function s(t) = ∫ x(a)w(t − a)da
s
Definition of convolution of input and kernel

• Convolution is a new function s, the


weighted average of x
s(t) = ∫ x(a)w(t − a)da
Convolution of f(u) and
g(u)
• This operation is typically denoted with an asterisk
s(t) = (x * w)
(t)
• w needs to be a valid pdf, or the output is not a weighted average
• w needs to be 0 for negative arguments, or we will look into the future
• In convolution network terminology the first function
x is referred to as the input, the second function w is
referred to as the kernel.
• The output s is referred to as the feature map.
Performing 1-D Convolution
• One-dimensional continuous
case
• Input f(t) is∞convolved with a Note that (f * g )(t)=(g * f )
(t)
kernel g(t)≡ ∫ f (τ )g(t −
(f * g)(t)
τ )dτ
−∞
[Link] each
function in terms of a
dummy variable τ

[Link] one of
the functions
g(τ)g(-τ)

[Link] a time offset


t, which allows g(t-
τ) to slide along
the τ axis
4. Start t at -∞ and slide
it all the way to+∞
Wherever the two functions
intersect find the integral of
their product
Convolution with Discrete Variables
• Laser sensor may only provide data at regular
intervals
• Time index t can take on only integer values
• x and w are defined only on integer
∞ t
s(t) = (x * w)(t) = ∑
x(a)w(t − a)
• In ML applications, input a=−∞
is a multidimensional array of
data and the kernel is a multidimensional array of
parameters that are adapted by the learning algorithm
• These arrays are referred to as tensors
• Input and kernel are explicitly stored separately
• The functions are zero everywhere except at these points
Convolution in discrete case
• Here we have discrete functions f
and g

(f * g)[t ] = ∑ f [τ ]⋅
g[t − τ ]
τ=−∞
f [t ]

g [t-τ ]
Computation of 1-D discrete convolution
Parameters of
convolution:
Kernel size (F)
Padding (P)
Stride (S)

(f *g)[t] g[t-
τ]

f [t]
Two-dimensional convolution
• Convolutions over more than one axis
• If we use a 2D image I as input and use a 2D kernel K we
have S (i, j ) = (I * K )(i, j ) = ∑ ∑ I (m,n)K (i − m, j
− n)
m n

Sharply peaked kernel K for edge


detection Kernels K1-K4 for line
I(i , j) K(i , j) S(i , j) detection
Commutativity of Convolution
• Convolution is commutative. We can equivalently
write: S (i, j ) = (K * I )(i, j ) = ∑ ∑ I (i − m, j
− n)K (m,n)
m n
• This formula is easier to implement in an ML library
since there is less variation in the range of valid
values of m and n
• Commutativity arises because we have flipped the
kernel relative to the input
• As m increases, index to the input increases, but index to the
kernel decreases
Cross-Correlation
• Same as convolution, but without flipping the
kernel S (i, j ) = (K * I )(i, j ) = ∑ ∑ I (i + m, j +
n)K (m,n)
m n

• Both referred to as convolution, and whether


kernel is flipped or not
• In ML, learning algorithm will learn appropriate
values of the kernel in the appropriate place
Example of 2D convolution
• Convolution without
kernel flipping applied
to a 2D tensor
• Output is restricted to
case where kernel is
situated entirely within
the image
• Arrows show how upper-
left of input tensor is
used to form upper-left
of output tensor
Discrete Convolution Viewed as Matrix multiplication
• Convolution can be viewed as multiplication by a matrix
• However the matrix has several entries constrained to
be zero
• Or constrained to be equal to other elements
• For univariate discrete convolution: Univariate Toeplitz matrix:
• Rows are shifted versions of previous row

• 2D case: doubly block circulant


matrix
• It corresponds to convolution
Motivation
Motivation for using convolution networks

1. Convolution leverages three important ideas


to improve ML systems:
1. Sparse interactions
2. Parameter sharing
3. Equivariant representations
2. Convolution also allows for working with inputs
of variable size
Sparse connectivity due to Image
Convolution
• Input image may have millions of pixels,
• But we can detect edges with kernels of
hundreds of pixels
• If we limit no of connections for each
input to k
• we need kxn parameters and O(k×n)
runtime
• It is possible to get good
performance with k<<n

• Convolutional networks
have sparse
interactions
• Accomplished by making
the kernel smaller
than the input
Next slide shows graphical
depiction
Neural network for 1-D convolution

Kernel
g(t):
f [t]

Equations for outputs of this


network:

etc. upto y8
We can also write the equations in terms of elements of a general 8 ✕ 8 weight
matrix W as:

wher
e

[Link]
Sparse Connectivity, viewed from below

• Highlight one input x3


and output units s
affected by it
• Top: when s is formed by
convolution with a kernel
of width 3, only three
outputs are affected by x3
• Bottom: when s is
formed by matrix
multiplication
connectivity is no longer
sparse
• So all outputs are affected
by x3
Sparse Connectivity, viewed from
above
• Highlight one output s3 and inputs x that affect
this unit
• These units are known as the receptive field of s3

When s3 is formed by
convolution with a kernel of
width 3

When s3 is formed
by matrix
multiplication
Keeping up performance with reduced connections
• It is possible to obtain good performance while
keeping k
several magnitudes lower than m
• In a deep neural network, units in deeper layers
may indirectly interact with a larger portion of the
input
• Receptive Field in Deeper layers is larger than the
receptive field of units in shallow layers

• This allows the network to efficiently describe


interactions between many variables from simple
complicated
building blocks that only describe sparse
interactions
Convolution with a stride
• Receptive Field in
Deeper layers is larger
than the receptive
field of units in
shallow layers
• This effect increases if
the network includes
architectural features
like strided
convolution or pooling
Parameter Sharing
• Parameter sharing refers to using the same parameter
for more than one function in a model
• In a traditional neural net each element of the weight
matrix is used exactly once when computing the
output of a layer
• It is multiplied by one element of the input and never revisited
• Parameter sharing is synonymous with tied weights
• Value of the weight applied to one input is tied to a weight
applied elsewhere
• In a Convolutional net, each member of the kernel is
used in every position of the input (except at the
boundary– subject to design decisions)
Efficiency of Parameter Sharing
• Parameter sharing by convolution operation
means that rather than learning a separate set of
parameters for every location, we learn only one
set
• This does not affect runtime of forward
propagation– which is still O(k ✕ n)
• But further reduces the storage requirements to k
parameters
• k is orders of magnitude less than m
• Since m and n are roughly the same size k is much
smaller than mxn
How parameter sharing works
• Black arrrows: connections that use a particular
parameter
1. Convolutional model: Black arrows indicate uses of the central
element of a 3-element kernel

2. Fully connected model: Single black arrow indicates use of


the central element of the weight matrix
• Model has no parameter sharing, so the parameter is used
only once

• How sparse connectivity and parameter sharing can


dramatically improve efficiency of image edge
detection is shown in next slide
Efficiency of Convolution for Edge Detection
• Image on right formed by taking each pixel of input
image and subtracting the value of its neighboring
pixel on the left
• This is a measure of all the vertically oriented edges in input
image which is useful for object detection
Both images are 280 pixels
Input
tall Input image is 320 pixels
imag wide Output image is 319
e pixels wide

• Transformation can be described by a convolution


kernel containing two elements and requires
319×320×3=267,960 flops (2 mpys, one add)
• Same transformation would require
i.e., 8 billion entries in the matrix
320×280×319×280,
• Convolution is 4 billion times more efficient
Equivariance of Convolution to Translation
• The particular form of parameter sharing leads to
equivariance to translation
• Equivariant means that if the input changes, the output
changes in the same way
• A function f (x) is equivariant to a function g if f (g (x))=g (f
(x))
• If g is a function that translates the input, i.e., that shifts
it, then the convolution function is equivariant to g
• I(x,y) is image brightness at point (x,y)
• I’=g(I) is image function with I’(x,y)=I(x-1,y), i.e., shifts every pixel
of I
one unit to the right
• If we apply g to I and then apply convolution, the output will
be the same as if we applied convolution to I’, then applied
transformation g to the output
Example of equivariance
• With 2D images convolution creates a map where
certain features appear in the input
• If we move the object in the input, the
representation will move the same amount in the
output
• It is useful to detect edges in first layer of
convolutional network
• Same edges appear everywhere in image, so it is
practical to share parameters across entire image
Absence of equivariance
• In some cases, we may not wish to share
parameters across entire image
• If image is cropped to be centered on a face,
we may want different features from different parts
of the face
• Part of the network processing the top of the
face looks for eyebrows
• Part of the network processing the bottom of the face
looks for the chin
• Certain image operations such as scale and
rotation are not equivariant to convolution
• Other mechanisms are needed for such transformations
Pooling
What is Pooling?
• Pooling in a CNN is a subsampling step
• It replaces output at a location with a summary statistic of
nearby outputs
• E,g,, Max pooling reports the maximum output within a rectangular
neighborhood
The pooling stage in a
CNN
• Typical layer of a CNN
consists of three stages
• Stage 1:
• perform several convolutions
in parallel to produce a set
of linear activations
• Stage 2 (Detector):
• each linear activation is run
through a nonlinear activation
function such as ReLU
• Stage 3 (Pooling):
• Use a pooling function to
modify output of the layer
further
Typical subsampling in a deep
network

• Input image is filtered by 4 5×5 convolutional kernels


which create 4 feature maps,
• Feature maps are subsampled by max pooling.
• The next layer applies ten 5×5 convolutional kernels to
these subsampled images and again we pool the feature
maps.
• The final layer is a fully connected layer where all
generated features are combined and used in the
classifier (essentially logistic regression).
Two terminologies for a typical CNN
layer
1. Net is viewed as a 2. Net is viewed as a larger
small no. of complex no
layers, each layer of simple layers
having many stages • Every processing step is a
• 1-1 mapping between layer in its own right
kernel tensors and • Not every layer has
network layers parameters
Why is Pooling
performed?
• Pooling is performed for two reasons
1. Dimensionality reduction
2. Invariance to transformations ofrotation and
translation
Some types of linear transformations
Types of Pooling
• functions
A pooling function replaces the output of the net at
a certain location with a summary statistic of the
nearby inputs
• Popular pooling functions:
1. max pooling operation reports the maximum output
within a rectangular neighborhood
2. Average of a rectangular neighborhood
3. L2 norm of a rectangular neighborhood
4. Weighted average based on the distance from the
central pixel
Pooling causes translation
invariance
• In all cases, pooling helps make the representation
become approximately invariant to small translations
of the input
• If we translate the input by a small amount values of most of
the outputs does not change (example next slide)
• Pooling can be viewed as adding a strong prior that the function
the layer learns must be invariant to small translations
Max pooling introduces invariance to
translation
• View of middle of output of a convolutional layer

Outputs of
maxpooling

Outputs of
nonlinearity

• Same network after the input has been shifted by


one pixel

• Every input value has changed, but only half the values
of output have changed because maxpooling units are
only sensitive to maximum value in neighborhood not
exact value
Importance of Translation Invariance

• Invariance to translation is important if we care about


whether a feature is present rather than exactly where
it is
• For detecting a face we just need to know that an eye is
present in a region, not its exact location
• In other contexts it is more important to preserve
location of a feature
• E.g., to determine a corner we need to know whether two
edges are present and test whether they meet
Learning other invariances

• Pooling over spatial regions produces invariance to


translation
• But if we pool over the results of separately
parameterized convolutions, the features can learn
which transformations to become invariant to
Learning Invariance to
rotation

• A pooling unit
that pools over
multiple features
that are learned
with separate
parameters can
learn to be
invariant to
transformations
of the input
Input tilted left Tilted
gets large Right
response from
unit tuned to left-
tilted images
Using fewer pooling units than detector
units
• Because pooling summarizes the responses over a
whole neighborhood, it is possible to use fewer
pooling units than detector units
• By reporting summary statistics for pooling regions spaced k
pixels apart rather than one pixel apart
• This improves computational efficiency because the next layer
has k
times fewer inputs to process
• An example is given next
Pooling with down-
sampling
• Max-pooling with a pool width of three and a stride
between pools of two

• This reduces representation size by a factor of two


• Which reduces computational burden of next layer
• Rightmost pooling region has a smaller size but must be
included if we don’t want to ignore some of the detector
units
Subsampling as Average pooling
Theoretical Guidance on
Pooling
• Which kind of pooling one should use in different
situations
• It is possible to dynamically pool features together
• By running a clustering algorithm on locations of
interesting features
• Yields a different set of pooling regions for each image
• Another approach: learn a single pooling structure
• Pooling can complicate architectures that use
top-down information
• E.g., Boltzmann machines and autoencoders
Deep Learning Srihari
Examples of Architectures for Classification with
CNNs CNN that processes Processes
variable- a fixed size image sized
Does not have any
fully-connected
image layer

• Real
networks
have
branchin
g
structure
s.
• Chain
structur
es
shown
for
simplicit
A Convolutional Neural Network (ConvNet or CNN) is a
specialized deep learning architecture designed mainly for
image, video, and spatial data. Its architecture is inspired by
how the visual cortex processes information, focusing on local
patterns through convolutional operations.
1. Input Layer
•Takes in raw data (e.g., image: 32×32×3 for a color image).
•Preprocessing may include normalization or resizing.
2. Convolutional Layer(s)
Core building block of CNNs.
Applies filters/kernels that slide over the input to detect local patterns (edges, textures,
shapes).
Each filter produces a feature map.
Parameters: filter size (e.g., 3×3, 5×5), stride, padding.
3. Activation Function
Non-linear transformation applied after convolution.
ReLU (Rectified Linear Unit) is the default choice.
Introduces non-linearity, enabling learning of complex features.
4. Pooling (Subsampling) Layer
• Reduces spatial dimensions of feature maps.
• Common: Max Pooling (takes max value), Average Pooling.
• Helps with translation invariance and reduces computational load.

5. Stacking Layers
• Several convolution + activation + pooling layers are stacked to progressively extract
low-level → high-level features.
• Early layers detect edges/corners.
• Deeper layers detect shapes/objects.

6. Fully Connected (Dense) Layer(s)


• Flattens feature maps into a 1D vector.
• Dense layers perform high-level reasoning and classification.

7. Output Layer
• Final dense layer with softmax (for multi-class) or sigmoid (for binary classification).
• Produces probability distribution over classes.

8. Regularization Layers (Optional)


• Dropout: Randomly drops neurons during training to prevent overfitting.
• Batch Normalization: Normalizes activations to speed up training and improve
stability.
A ConvNet
architecture
INPUT: 32x32x3 holds raw pixel values: an image of width 32, height 32 and 3 color channels
RGB CONV layer will compute the output of neurons connected to local regions in the input
Each computing a dot product between their weights and a small region they are connected
to
In the input volume. This may result in a volume such as 32x32x12 if we used 12
filters POOL layer will perform a down-sampling operation along spatial dimensions
(width, height) resulting in a volume such as 12x16x12

Activations of an example ConvNet architecture.


Initial volume stores raw image pixels (left) and the last volume stores class
scores (right)
Each volume of activations along the processing path is shown as a
column.
Since it is difficult to visualize 3D volumes, each volume’s slices are laid
out in rows
VGG Net

• VGG is a convolutional neural network model


• K. Simonyan and A. Zisserman, University of Oxford
• “Very Deep Convolutional Networks for Large-Scale Image Recognition”
• The model achieves 93% top-5 test accuracy in
ImageNet
• which is a dataset of over 14 million images belonging to 1000
classes.
VGG
16

Source: https://
[Link]/~frossard/post/vgg16/
Convolution and Pooling as an
Infinitely Strong Prior
Topics in Infinitely Strong Prior
• Weak and Strong Priors
• Convolution as an infinitely strong
prior
• Pooling as an infinitely strong prior
• Under-fitting with convolution and
pooling
• Permutation invariance
Prior parameter distribution
• Role of a prior probability distribution
over the parameters of a model is:
• Encode our belief as to what models are
reasonable before seeing the data
Weak and Strong Priors
• A weak prior
• A distribution with high
entropy
• e.g., Gaussian with high variance
• Data can move parameters
freely
• A strong prior
• It has very low entropy
• E.g., a Gaussian with low
variance
• Such a prior plays a more
active role in determining
where the parameters end
up
Infinitely Strong Prior
• An infinitely strong prior places zero probability
on some parameters
• It says that some parameter values are forbidden
regardless of support from data
• With an infinitely strong prior, irrespective of the data the prior
cannot be changed
Convolutional Network

• Convolutional
networks are
simply neural
networks that use
convolution in
place of general
matrix
multiplication in
at least one of
their layers
Convolution as infinitely strong prior
• Convolutional net is similar to a fully connected net
but with an infinitely strong prior over its weights
• It says that the weights for one hidden unit must be
identical to the weights of its neighbor, but shifted in
space
• Prior also says that the weights must be zero, except for in
the small spatially contiguous receptive field assigned to
that hidden unit

Convolution with a kernel of


width 3 s3 is a hidden unit. It
has 3 weights which are the
same as for s4
• Convolution introduces an infinitely strong prior probability
distribution over the parameters of a layer
• This prior says that the function the layer should learn
contains only local interactions and is equivariant to
translation
Pooling as an Infinitely strong prior
• The use of pooling is an infinitely
strong prior that each unit should be
invariant to small translations
• Maxpooling example:
Implementing as a prior
• Implementing a convolutional net as a
fully connected net with an infinitely
strong prior would be extremely
computationally wasteful
• But thinking of a convolutional net as a
fully connected net with an infinitely
strong prior can give us insights into
how convolutional nets work
Key Insight: Underfitting
• Convolution and pooling can cause
under-fitting
• Under-fitting happens when model has high
bias
• Convolution and pooling are only
useful when the assumptions High Bias/Underfit
made by the prior are can be countered
reasonably accurate by:
• Pooling may be 1. Add hidden
layers
inappropriate in some [Link]
cases hidden
• If the task relies on preserving units/layer
spatial information [Link]
• Using pooling on all features can regular.
increase training error parameter λ
When pooling may be inappropriate
• Some convolutional architectures are designed
to use pooling on some channels but not on
other channels
• In order to get highly invariant features and features
that will not under-fit when the translation
invariance prior is incorrect
• When a task involves incorporating information
from a distant location
• In which case, prior imposed by
convolution may be inappropriate
Comparing models with/without
convolution
• Convolutional models have spatial relationships
• In benchmarks of statistical learning
performance we should only compare
convolutional models to other convolutional
models – since they have
knowledge of spatial relationships hard-coded
• Models without convolution will be able to learn
even if we permuted all pixels in the image
• Permutation invariance: f (x1,x2,x3)=f
(x2,x1,x3)=f(x3,x1,x2)
• There are separate benchmarks for models
that are permutation invariant
Variants of the Basic Convolution
Function
Topics in
Variants of Convolution
Functions
• Neural net convolution is not same as mathematical
convolution
• How convolution in neural networks is different
• Multichannel convolution due to image color and
batches
• Convolution with a stride
• Locally connected layers (unshared convolution)
• Tiled convolution
• Implementation of a convolutional network
Neural Net
Convolution is
• Convolution in the context of neural networks does
Different
not refer exactly to the standard convolution
operation in mathematics
• The functions used differ slightly
• Here we describe the differences in detail and
highlight their useful propertied
Convolution Operation in Neural
Networks
1. It refers to an operation that consists of many
applications of convolution in parallel
• This is because convolution with a single kernel can only
extract one
kind of feature, albeit at many locations
• Usually we want to extract many kinds of features at many
locations
2. Input is usually not a grid of real values
• Rather it is a vector of observations
• E.g., a color image has R ,G , B values at each pixel
• Input to the next layer is the output of the first layer which
has many different convolutions at each position

• When working with images, input and output are 3-D


tensors
Four indices with image
software
1. One index for Channel
2. Two indices for spatial coordinates of each
channel
3. Fourth index for different samples in a batch
• We omit the batch axis for simplicity of discussion
Multichannel
Convolution
• Because we are dealing with multichannel convolution,
linear operations are not usually commutative, even of
kernel flipping is used
• These multi-channel operations are only commutative
if each operation has the same number of output
channels as input channels.
Definition of 4-
• D kernel
Assume we havetensor
a 4-D kernel tensor K with
element
K i , j , k , l giving the connection strength between
• a unit in channel i of the output and
• a unit in channel j of the input,
• with an offset of k rows and l columns between output and
input units
• Assume our input consists of observed data V with
element
V i , j , k giving the value of the input unit
• within channel i at row j and column k.
• Assume our output consists of Z with the same format
as V.
• If Z is produced by convolving K across V without
flipping K, then
Convolution with a stride:
• Definition
We may want to skip over some positions in the
kernel to reduce computational cost
• At the cost of not extracting fine features
• We can think of this as down-sampling the output
of the full convolution function
• If we want to sample only every s pixels in each
direction of output, then we can define a down-
sampled convolution function c such that

• We refer to s as the stride. It is possible to define a


different
stride for each
direction
Convolution with a stride:
Implementation Here we use a stride of
2

Convolution with a stride of


length two implemented in a
single operation

Convolution with a stride greater


than one pixel is mathematically
equivalent to convolution with a
unit stride followed by down-
sampling.

Two-step approach is
computationally wasteful,
because it discard many values
that are discarded
Effect of Zero-padding on
networknet
Convolutional size
with a kernel of width 6 at
every layer No pooling, so only convolution
shrinks network size We do not use any implicit
zero padding
Causes representation to
shrink by five pixels at each
layer
Starting from an input of 16
pixels
we are only able to have 3
convolutional layers and
the last layer does not ever
move the kernel, so only
By adding
two 5 implicit
layers are zeroes to
convolutional
Each layer, we prevent the
Representation from shrinking
with depth
This allows us to make an 11
arbitrarily deep
convolutional network
Locally connected

layer
In some cases, we do not actually want to use
convolution, but rather locally connected layers
• adjacency matrix in the graph of our MLP is the
same, but every connection has its own weight,
specified by a 6-D tensor W.
• The indices into W are respectively:
• i, the output channel,
• j, the output row,
• k, the output column,
• l, the input channel,
• m, the row offset within the input, and
• n, the column offset within the input.
• The linear part of a locally connected layer is
then given by
• Also called unshared
convolution
Local connections, convolution, full
connections
Use of locally connected
• layers
Locally connected layers are useful when
• we know that each feature should be a
function of a small part of space, but there is no
reason to think that the same feature should occur
across all of space

• Ex: if we want to tell if an image is a picture of


a face, we only need to look for the mouth
in the bottom half of the image
Constraining
• ConstrainOutputs
each output channel i to be a
function of only a subset of the input
channels l
• Make the first m output channels
connect to only the first n input channels,
• The second m output channels connect to
only the second n input channels, etc
• Modeling interactions between few
channels allows fewer parameters to:
• Reduce memory, increase statistical
efficiency, reduce computation for
forward/back-propagation.
• It accomplishes these goals without
Network with further restricted
connectivity
Tiled

and a
Convolution
Compromise between a convolutional layer
locally connected layer.
• Rather than learning a separate set of
weights at every spatial location, we learn a set
of kernels that we rotate through as we move
through space.
• This means that immediately neighboring
locations will have different filters, like in a
locally connected layer,
• but the memory requirements for storing the
parameters will increase only by a factor of the size
of this set of kernels
• rather than the size of the entire output feature
map.
Comparison of locally connected
layers, tiled convolution and
standard convolution
A locally connected
layer Has no sharing
at all
Each connection has
its own weight

Tiled convolution
Has a set of different
kernels With t=2

Traditional convolution
Equivalent to tiled
convolution with t=1
There is only one kernel
and18it is applied everywhere
Defining Tiled Convolution
Algebraically
• Let k be a 6-D tensor, where two of the dimensions
correspond to different locations in the output map.
• Rather than having a separate index for each
location in the output map, output locations cycle
through a set of t different choices of kernel stack in
each direction.
• If t is equal to the output width, this is the same as
a locally connected layer

• where % is the modulo operation, with t%t = 0, (t + 1)%t =


1, etc
Operations to implement convolutional
nets
• Besides convolution, other operations are
necessary to implement a convolutional
network.
• To perform learning, need to compute gradient wrt
the kernel, given the gradient with respect to the
outputs.
• In some simple cases, this operation can be
performed using the convolution operation, but
when stride greater than 1, do not have this
property.
Implementation of
Convolution
• Convolution is a linear operation and can
thus be described as a matrix
multiplication
• if we first reshape the input tensor into a flat
vector
• Matrix involved is a function of the
convolution kernel
• Matrix is sparse and each element of the kernel is
copied to several elements of the matrix.
• This view helps us to derive some of the
other operations needed to implement a
convolutional network
Structured Outputs
• Convolutional networks can be used to output a high-
dimensional, structured object, rather than just predicting a
class label for a classification task or a real value for a
regression task. Typically this object is just a tensor, emitted
by a standard convolutional layer.
– A good example is the task of image segmentation where
each pixel needs to be associated with an object class.
Here the output is the same size (spatially) as the input.
The model outputs a tensor S where Si,j,k is the probability
that pixel (j,k) belongs to class i.
• This allows the model to label every pixel in an image and
draw precise masks that follow the outlines of individual
objects.
• One issue that often comes up is that the output plane can
be smaller than the input plane.
• In the kinds of architectures typically used for classification
of a single object in an image, the greatest reduction in the
spatial dimensions of the network comes from using pooling
layers with large stride.
• In order to produce an output map of similar size as the
input, one can avoid pooling altogether.
• Another strategy is to simply emit a lower-resolution grid of
labels. Finally, in principle, one could use a pooling operator
with unit stride.
• One strategy for pixel-wise labeling of images is to produce
an initial guess of the image labels, then refine this initial
guess using the interactions between neighboring pixels.
• Repeating this refinement step several times corresponds to
using the same convolutions at each stage, sharing weights
between the last layers of the deep net.
• This makes the sequence of computations performed by the
successive convolutional layers with weights shared across
layers a particular kind of recurrent network .
• Once a prediction for each pixel is made, various methods can
be used to further process these predictions in order to obtain
a segmentation of the image into regions.
• The general idea is to assume that large groups of contiguous
pixels tend to be associated with the same label. Graphical
models can describe the probabilistic relationships between
neighboring pixels.
• Alternatively, the convolutional network can be trained to
maximize an approximation of the graphical model training
objective.
Data Types
• The data used with a convolutional network usually consist
of several channels, each channel being the observation of a
different quantity at some point in space or time.
• One advantage to convolutional networks is that they can
also process inputs with varying spatial extents.
• When the output is accordingly variable sized, no extra
design change needs to be made. If however the output is
fixed sized, as in the classification task, a pooling stage with
kernel size proportional to the input size needs to be used.
Different data types based on number of spatial
dimensions and channels are listed:

Dimensions Single channel Multichannel

Raw audio (single amplitude Skeleton animatin


1D
value per time point) data (orientation of each joint)

Audio spectrogram (one FFT


Color image (RGB triplet per
2D coefficient per time point per
(x,y) tuple)
frequency)

CT scan (one value per (x,y,z) Color video (one RGB triplet per
3D
tuple) (x,y) tuple per time instant)
Efficient Convolution Algorithms
• Modern convolutional network applications often involve
networks containing more than one million units. Powerful
implementations exploiting parallel computation resources
are essential. However, in many cases it is also possible to
speed up convolution by selecting an appropriate convolution
algorithm.
• Convolution is equivalent to converting both the input and the
kernel to the frequency domain using a Fourier transform,
performing point-wise multiplication of the two signals, and
converting back to the time domain using an inverse Fourier
transform. For some problem sizes, this can be faster than the
naive implementation of discrete convolution.
• When a d-dimensional kernel can be expressed as the outer
product of d vectors, one vector per dimension, the kernel is
called separable. When the kernel is separable, naive
convolution is inefficient. It is equivalent to compose d one-
dimensional convolutions with each of these vectors.
• The composed approach is significantly faster than
performing one d-dimensional convolution with their outer
product. The kernel also takes fewer parameters to represent
as vectors. If the kernel is w elements wide in each dimension,
then naive multidimensional convolution requires O(wd)
runtime and parameter storage space, while separable
convolution requires O(w × d) runtime and parameter storage
space. Of course, not every convolution can be represented in
this way.
• Devising faster ways of performing convolution or
approximate convolution without harming the accuracy of the
model is an active. area of research.
Random or Unsupervised Features
• Typically, the most expensive part of convolutional network
training is learning the features. The output layer is usually
relatively inexpensive due to the small number of features
provided as input to this layer after passing through several
layers of pooling.
• When performing supervised training with gradient descent,
every gradient step requires a complete run of forward
propagation and backward propagation through the entire
network.
• One way to reduce the cost of convolutional network training
is to use features that are not trained in a supervised fashion.
• There are three basic strategies for obtaining convolution
kernels without supervised training.
1. Random initialization has been shown to create filters that
are frequency selective and translation invariant. This can be
used to inexpesively select the model architecture. Randomly
initialize several CNN architectures and just train the last
classification layer. Once a winner is determined, that model
can be fully trained in a supervised manner.
2. Hand designed kernels may be used; e.g. to detect edges at
different orientations and intensities
3. Unsupervised training of kernels may be performed; e.g.
applying k-means clustering to image patches and using the
centroids as convolutional kernels. Unsupervised pertaining
may offer regularization effect (not well established). It may
also allow for training of larger CNNs because of reduced
computation cost.
• Another approach for CNN training is greedy layer-wise
pretraining most notably used in
convolutional deep belief network. As in the case of multi-
layer perceptrons, starting with the first, each layer is trained
in isolation.
The Neuroscientific Basis for Convolutional
Network
• Hubel and Wiesel studied the activity of neurons in a cat's brain
in response to visual stimuli. Their work characterized many
aspects of brain function.
In a simplified view, we have:
1. The light entering the eye stimulates the retina. The image
then passes through the optic nerve and a region of the brain
called the LGN (lateral geniculate nucleus)
2. V1 (primary visual cortex): The image produced on the retina
is transported to the V1 with minimal processing. The
properties of V1 that have been replicated in CNNs are
 The V1 response is localized spatially, i.e. the upper image stimulates the
cells in the upper region of V1 [kernel]
 V1 has simple cells whose activity is a linear function of the input in a
small neighbourhood [convolution]
 V1 has complex cells whose activity is invariant to shifts in the position of
the feature [pooling] as well as some changes in lighting which cannot be
captured by spatial pooling [cross-channel pooling]
3. There are several stages of V1 like operations.
4. In the medial temporal lobe, we find grandmother cells.
These cells respond to specific concepts and are invariant to
several transforms of the input. In the medial temporal lobe,
researchers also found neurons spiking on a particular
concept, e.g. the Halle Berry neuron fires when looking at a
photo/drawing of Halle Berry or even reading the text Halle
Berry.
• The medial temporal neurons are more generic than CNN in
that they respond even to specific ideas. A closer match to the
function of the last layers of a CNN is the IT (inferotemporal
cortex). When viewing an object, information flows from the
retina, through LGN, V1, V2, V4 and reaches IT. This happens
within 100ms. When a person continues to look at an object,
the brain sends top-down feedback signals to affect lower
level activation.
Some of the major differences between the human visual system (HVS) and
the CNN model are:
• The human eye is low resolution except in a region called fovea.
Essentially, the eye does not receive the whole image at high resolution
but stiches several patches through eye movements called saccades.
This attention based gazing of the input image is an active research
problem. Note: attention mechanisms have been shown to work on
natural language tasks.
• Integration of several senses in the HVS while CNNs are only visual
• The HVS processes rich 3D information, and can also determine
relations between objects. CNNs for such tasks are in their early stages.
• The feedback from higher levels to V1 has not been incorporated into
CNNs with substantial improvement.
• While the CNN can capture firing rates in the IT, the similarity between
intermediate computations is not established. The brain probably uses
different activation and pooling functions. Even the linearity of filter
response is doubtful as recent models for V1 involve quadratic filters.
• Neuroscience tells us very little about the training procedure.
Backpropogation which is a standard training mechanism
today is not inspired by neuroscience and sometimes
considered biologically implausible.
• In order to determine the filter parameters used by neurons, a
process called reverse correlation is used. The neuron
activations are measured by an electrode when viewing
several white noise images and a linear model is used to
approximate this behaviour. It has been shown experimentally
that the weights of the fitted model of V1 neurons are
described by Gabor functions.
• If we go by the simplified version of the HVS, if the simple
cells detect Gabor-like features, then complex cells learn a
function of simple cell outputs which is invariant to certain
translations and magnitude changes.
• A wide variety of statistical learning algorithms (from
unsupervised (sparse code) to deep learning (first layer
features)) learn features with Gabor-like functions when
applied to natural images. This goes to show that while no
algorithm can be touted as the right method based on Gabor-
like feature detectors, a lack of such features may be taken as
a bad sign.
Convolutional Networks and the History of
Deep Learning
• Convolutional networks have played an important role in the
history of deep learning. They are a key example of a successful
application of insights obtained by studying the brain to
machine learning applications. They were also some of the first
deep models to perform well, long before arbitrary deep
models were considered viable.
• Convolutional networks were also some of the first neural
networks to solve important commercial applications and
remain at the forefront of commercial applications of deep
learning today.
– For example, in the 1990s, the neural network research
group at AT&T developed a convolutional network for
reading checks.
– By the end of the 1990s, this system deployed by NEC was
reading over 10% of all the checks in the US.
– Later, several OCR and handwriting recognition systems
based on convolutional nets were deployed by Microsoft.
• Deep learning can be used in a wide variety of applications,
including:
– Image recognition: To identify objects and features in
images, such as people, animals, places, etc.
– Natural language processing: To help understand the
meaning of text, such as in customer service chatbots and
spam filters.
– Finance: To help analyze financial data and make
predictions about market trends
– Text to image: Convert text into images, such as in the
Google Translate app.
• See LeCun et al. (2010) for a more in-depth history of
convolutional networks up to 2010.
• Convolutional networks were also used to win many contests.
The current intensity of commercial interest in deep learning
began when Krizhevsky et al. (2012) won the ImageNet object
recognition challenge, but convolutional networks had been
used to win other machine learning and computer vision
contests with less impact for years earlier.
– LeNet-5 (1990)
– AlexNet (2012)
– VGGNet
– ResNet
– U-Net (2015)
– SegNet
– RCNN
– Fast/Faster - RCNN
– Mask R-CNN
• While ConvNets have been revolutionary in the field of
computer vision, their application extends to other domains
as well, including audio processing, time series analysis, and
financial forecasting.
– WaveNet (Audio Processing)
– TCN (Sequence Modelling Tasks)
– Transformer
• Convolutional nets were some of the first working deep
networks trained with back-propagation. It is not entirely clear
why convolutional networks succeeded when general back-
propagation networks were considered to have failed. It may
simply be that convolutional networks were more
computationally efficient than fully connected networks, so it
was easier to run multiple experiments with them and tune
their implementation and hyperparameters. Larger networks
also seem to be easier to train.
• With modern hardware, large fully connected networks
appear to perform reasonably on many tasks, even when
using datasets that were available and activation functions
that were popular during the times when fully connected
networks were believed not to work well. It may be that the
primary barriers to the success of neural networks were
psychological (practitioners did not expect neural networks to
work, so they did not make a serious effort to use neural
networks).
• Whatever the case, it is fortunate that convolutional networks
performed well decades ago. In many ways, they carried the
torch for the rest of deep learning and paved the way to the
acceptance of neural networks in general.
• Convolutional networks provide a way to specialize neural
networks to work with data that has a clear grid-structured
topology and to scale such models to very large size. This
approach has been the most successful on a two-dimensional,
image topology.
• To process one-dimensional, sequential data, we turn next to
another powerful specialization of the neural networks
framework: recurrent neural networks.

You might also like