Understanding Convolutional Networks
Understanding Convolutional Networks
Introduction
• Convolutional networks, also known as convolutional neural
networks (or) CNNs, are a specialized kind of neural network
for processing data that has a known, grid-like topology.
– Examples include time-series data, which can be thought
of as a 1D grid taking samples at regular time intervals, and
image data, which can be thought of as a 2D grid of pixels.
• Convolutional networks have been tremendously successful in
practical applications. The name “convolutional neural
network” indicates that the network employs a mathematical
operation called convolution. Convolution is a specialized kind
of linear operation.
• Convolutional networks are simply neural networks that use
convolution in place of general matrix multiplication in at
least one of their layers.
Neural net matrix multiplication
• Each layer produces values that are obtained from
previous layer by performing a matrix-
Augmented network
multiplication
• In an unaugmented network
• Hidden layer produces values z =h (W(1)T x + b(1))
• Output layer produces values y =σ(W(2)T x + b(2))
• Note:W(1) and W(2) are matrices rather than
vectors
• Example with D=3, x=[x1,x2,x3]T
• We
M=3 have two weight matrices W(1)
⎧
and W(2) ⎡ (1) (1) (1) ⎡ (1)W (1)
⎪ (1)
⎢⎡ 11 W 12(1)W 13(1) ⎤ , W2 = W
T W 22(1)W 23(1) ⎤T, W3(1) = W
(1) W 33 ⎤
(1)
T
W columns : W1 = W ⎣ 21 ⎣ 31 32 ⎦
w= ⎪
⎨
T T
⎪ W (2)columns : W (2)
= ⎡W (2)W (2)W (2) ⎤ , W 2(2) = ⎡W (2)W (2)W (2) ⎤ , W3 (2)
= ⎡W (2)W (2)W (2)
⎩⎪ 1 ⎢⎣ 11 12 13 ⎦⎥ ⎣⎢ 21 22 23 ⎢⎣ 31 32 33
T
First
Network
⎤
⎥⎦ Network layer In matrix
⎥⎦ multiplication
layer output notation
Matrix multiplication for 2D Convolution
[Link] one of
the functions
g(τ)g(-τ)
(f * g)[t ] = ∑ f [τ ]⋅
g[t − τ ]
τ=−∞
f [t ]
g [t-τ ]
Computation of 1-D discrete convolution
Parameters of
convolution:
Kernel size (F)
Padding (P)
Stride (S)
(f *g)[t] g[t-
τ]
f [t]
Two-dimensional convolution
• Convolutions over more than one axis
• If we use a 2D image I as input and use a 2D kernel K we
have S (i, j ) = (I * K )(i, j ) = ∑ ∑ I (m,n)K (i − m, j
− n)
m n
• Convolutional networks
have sparse
interactions
• Accomplished by making
the kernel smaller
than the input
Next slide shows graphical
depiction
Neural network for 1-D convolution
Kernel
g(t):
f [t]
etc. upto y8
We can also write the equations in terms of elements of a general 8 ✕ 8 weight
matrix W as:
wher
e
[Link]
Sparse Connectivity, viewed from below
When s3 is formed by
convolution with a kernel of
width 3
When s3 is formed
by matrix
multiplication
Keeping up performance with reduced connections
• It is possible to obtain good performance while
keeping k
several magnitudes lower than m
• In a deep neural network, units in deeper layers
may indirectly interact with a larger portion of the
input
• Receptive Field in Deeper layers is larger than the
receptive field of units in shallow layers
Outputs of
maxpooling
Outputs of
nonlinearity
• Every input value has changed, but only half the values
of output have changed because maxpooling units are
only sensitive to maximum value in neighborhood not
exact value
Importance of Translation Invariance
• A pooling unit
that pools over
multiple features
that are learned
with separate
parameters can
learn to be
invariant to
transformations
of the input
Input tilted left Tilted
gets large Right
response from
unit tuned to left-
tilted images
Using fewer pooling units than detector
units
• Because pooling summarizes the responses over a
whole neighborhood, it is possible to use fewer
pooling units than detector units
• By reporting summary statistics for pooling regions spaced k
pixels apart rather than one pixel apart
• This improves computational efficiency because the next layer
has k
times fewer inputs to process
• An example is given next
Pooling with down-
sampling
• Max-pooling with a pool width of three and a stride
between pools of two
• Real
networks
have
branchin
g
structure
s.
• Chain
structur
es
shown
for
simplicit
A Convolutional Neural Network (ConvNet or CNN) is a
specialized deep learning architecture designed mainly for
image, video, and spatial data. Its architecture is inspired by
how the visual cortex processes information, focusing on local
patterns through convolutional operations.
1. Input Layer
•Takes in raw data (e.g., image: 32×32×3 for a color image).
•Preprocessing may include normalization or resizing.
2. Convolutional Layer(s)
Core building block of CNNs.
Applies filters/kernels that slide over the input to detect local patterns (edges, textures,
shapes).
Each filter produces a feature map.
Parameters: filter size (e.g., 3×3, 5×5), stride, padding.
3. Activation Function
Non-linear transformation applied after convolution.
ReLU (Rectified Linear Unit) is the default choice.
Introduces non-linearity, enabling learning of complex features.
4. Pooling (Subsampling) Layer
• Reduces spatial dimensions of feature maps.
• Common: Max Pooling (takes max value), Average Pooling.
• Helps with translation invariance and reduces computational load.
5. Stacking Layers
• Several convolution + activation + pooling layers are stacked to progressively extract
low-level → high-level features.
• Early layers detect edges/corners.
• Deeper layers detect shapes/objects.
7. Output Layer
• Final dense layer with softmax (for multi-class) or sigmoid (for binary classification).
• Produces probability distribution over classes.
Source: https://
[Link]/~frossard/post/vgg16/
Convolution and Pooling as an
Infinitely Strong Prior
Topics in Infinitely Strong Prior
• Weak and Strong Priors
• Convolution as an infinitely strong
prior
• Pooling as an infinitely strong prior
• Under-fitting with convolution and
pooling
• Permutation invariance
Prior parameter distribution
• Role of a prior probability distribution
over the parameters of a model is:
• Encode our belief as to what models are
reasonable before seeing the data
Weak and Strong Priors
• A weak prior
• A distribution with high
entropy
• e.g., Gaussian with high variance
• Data can move parameters
freely
• A strong prior
• It has very low entropy
• E.g., a Gaussian with low
variance
• Such a prior plays a more
active role in determining
where the parameters end
up
Infinitely Strong Prior
• An infinitely strong prior places zero probability
on some parameters
• It says that some parameter values are forbidden
regardless of support from data
• With an infinitely strong prior, irrespective of the data the prior
cannot be changed
Convolutional Network
• Convolutional
networks are
simply neural
networks that use
convolution in
place of general
matrix
multiplication in
at least one of
their layers
Convolution as infinitely strong prior
• Convolutional net is similar to a fully connected net
but with an infinitely strong prior over its weights
• It says that the weights for one hidden unit must be
identical to the weights of its neighbor, but shifted in
space
• Prior also says that the weights must be zero, except for in
the small spatially contiguous receptive field assigned to
that hidden unit
Two-step approach is
computationally wasteful,
because it discard many values
that are discarded
Effect of Zero-padding on
networknet
Convolutional size
with a kernel of width 6 at
every layer No pooling, so only convolution
shrinks network size We do not use any implicit
zero padding
Causes representation to
shrink by five pixels at each
layer
Starting from an input of 16
pixels
we are only able to have 3
convolutional layers and
the last layer does not ever
move the kernel, so only
By adding
two 5 implicit
layers are zeroes to
convolutional
Each layer, we prevent the
Representation from shrinking
with depth
This allows us to make an 11
arbitrarily deep
convolutional network
Locally connected
•
layer
In some cases, we do not actually want to use
convolution, but rather locally connected layers
• adjacency matrix in the graph of our MLP is the
same, but every connection has its own weight,
specified by a 6-D tensor W.
• The indices into W are respectively:
• i, the output channel,
• j, the output row,
• k, the output column,
• l, the input channel,
• m, the row offset within the input, and
• n, the column offset within the input.
• The linear part of a locally connected layer is
then given by
• Also called unshared
convolution
Local connections, convolution, full
connections
Use of locally connected
• layers
Locally connected layers are useful when
• we know that each feature should be a
function of a small part of space, but there is no
reason to think that the same feature should occur
across all of space
Tiled convolution
Has a set of different
kernels With t=2
Traditional convolution
Equivalent to tiled
convolution with t=1
There is only one kernel
and18it is applied everywhere
Defining Tiled Convolution
Algebraically
• Let k be a 6-D tensor, where two of the dimensions
correspond to different locations in the output map.
• Rather than having a separate index for each
location in the output map, output locations cycle
through a set of t different choices of kernel stack in
each direction.
• If t is equal to the output width, this is the same as
a locally connected layer
CT scan (one value per (x,y,z) Color video (one RGB triplet per
3D
tuple) (x,y) tuple per time instant)
Efficient Convolution Algorithms
• Modern convolutional network applications often involve
networks containing more than one million units. Powerful
implementations exploiting parallel computation resources
are essential. However, in many cases it is also possible to
speed up convolution by selecting an appropriate convolution
algorithm.
• Convolution is equivalent to converting both the input and the
kernel to the frequency domain using a Fourier transform,
performing point-wise multiplication of the two signals, and
converting back to the time domain using an inverse Fourier
transform. For some problem sizes, this can be faster than the
naive implementation of discrete convolution.
• When a d-dimensional kernel can be expressed as the outer
product of d vectors, one vector per dimension, the kernel is
called separable. When the kernel is separable, naive
convolution is inefficient. It is equivalent to compose d one-
dimensional convolutions with each of these vectors.
• The composed approach is significantly faster than
performing one d-dimensional convolution with their outer
product. The kernel also takes fewer parameters to represent
as vectors. If the kernel is w elements wide in each dimension,
then naive multidimensional convolution requires O(wd)
runtime and parameter storage space, while separable
convolution requires O(w × d) runtime and parameter storage
space. Of course, not every convolution can be represented in
this way.
• Devising faster ways of performing convolution or
approximate convolution without harming the accuracy of the
model is an active. area of research.
Random or Unsupervised Features
• Typically, the most expensive part of convolutional network
training is learning the features. The output layer is usually
relatively inexpensive due to the small number of features
provided as input to this layer after passing through several
layers of pooling.
• When performing supervised training with gradient descent,
every gradient step requires a complete run of forward
propagation and backward propagation through the entire
network.
• One way to reduce the cost of convolutional network training
is to use features that are not trained in a supervised fashion.
• There are three basic strategies for obtaining convolution
kernels without supervised training.
1. Random initialization has been shown to create filters that
are frequency selective and translation invariant. This can be
used to inexpesively select the model architecture. Randomly
initialize several CNN architectures and just train the last
classification layer. Once a winner is determined, that model
can be fully trained in a supervised manner.
2. Hand designed kernels may be used; e.g. to detect edges at
different orientations and intensities
3. Unsupervised training of kernels may be performed; e.g.
applying k-means clustering to image patches and using the
centroids as convolutional kernels. Unsupervised pertaining
may offer regularization effect (not well established). It may
also allow for training of larger CNNs because of reduced
computation cost.
• Another approach for CNN training is greedy layer-wise
pretraining most notably used in
convolutional deep belief network. As in the case of multi-
layer perceptrons, starting with the first, each layer is trained
in isolation.
The Neuroscientific Basis for Convolutional
Network
• Hubel and Wiesel studied the activity of neurons in a cat's brain
in response to visual stimuli. Their work characterized many
aspects of brain function.
In a simplified view, we have:
1. The light entering the eye stimulates the retina. The image
then passes through the optic nerve and a region of the brain
called the LGN (lateral geniculate nucleus)
2. V1 (primary visual cortex): The image produced on the retina
is transported to the V1 with minimal processing. The
properties of V1 that have been replicated in CNNs are
The V1 response is localized spatially, i.e. the upper image stimulates the
cells in the upper region of V1 [kernel]
V1 has simple cells whose activity is a linear function of the input in a
small neighbourhood [convolution]
V1 has complex cells whose activity is invariant to shifts in the position of
the feature [pooling] as well as some changes in lighting which cannot be
captured by spatial pooling [cross-channel pooling]
3. There are several stages of V1 like operations.
4. In the medial temporal lobe, we find grandmother cells.
These cells respond to specific concepts and are invariant to
several transforms of the input. In the medial temporal lobe,
researchers also found neurons spiking on a particular
concept, e.g. the Halle Berry neuron fires when looking at a
photo/drawing of Halle Berry or even reading the text Halle
Berry.
• The medial temporal neurons are more generic than CNN in
that they respond even to specific ideas. A closer match to the
function of the last layers of a CNN is the IT (inferotemporal
cortex). When viewing an object, information flows from the
retina, through LGN, V1, V2, V4 and reaches IT. This happens
within 100ms. When a person continues to look at an object,
the brain sends top-down feedback signals to affect lower
level activation.
Some of the major differences between the human visual system (HVS) and
the CNN model are:
• The human eye is low resolution except in a region called fovea.
Essentially, the eye does not receive the whole image at high resolution
but stiches several patches through eye movements called saccades.
This attention based gazing of the input image is an active research
problem. Note: attention mechanisms have been shown to work on
natural language tasks.
• Integration of several senses in the HVS while CNNs are only visual
• The HVS processes rich 3D information, and can also determine
relations between objects. CNNs for such tasks are in their early stages.
• The feedback from higher levels to V1 has not been incorporated into
CNNs with substantial improvement.
• While the CNN can capture firing rates in the IT, the similarity between
intermediate computations is not established. The brain probably uses
different activation and pooling functions. Even the linearity of filter
response is doubtful as recent models for V1 involve quadratic filters.
• Neuroscience tells us very little about the training procedure.
Backpropogation which is a standard training mechanism
today is not inspired by neuroscience and sometimes
considered biologically implausible.
• In order to determine the filter parameters used by neurons, a
process called reverse correlation is used. The neuron
activations are measured by an electrode when viewing
several white noise images and a linear model is used to
approximate this behaviour. It has been shown experimentally
that the weights of the fitted model of V1 neurons are
described by Gabor functions.
• If we go by the simplified version of the HVS, if the simple
cells detect Gabor-like features, then complex cells learn a
function of simple cell outputs which is invariant to certain
translations and magnitude changes.
• A wide variety of statistical learning algorithms (from
unsupervised (sparse code) to deep learning (first layer
features)) learn features with Gabor-like functions when
applied to natural images. This goes to show that while no
algorithm can be touted as the right method based on Gabor-
like feature detectors, a lack of such features may be taken as
a bad sign.
Convolutional Networks and the History of
Deep Learning
• Convolutional networks have played an important role in the
history of deep learning. They are a key example of a successful
application of insights obtained by studying the brain to
machine learning applications. They were also some of the first
deep models to perform well, long before arbitrary deep
models were considered viable.
• Convolutional networks were also some of the first neural
networks to solve important commercial applications and
remain at the forefront of commercial applications of deep
learning today.
– For example, in the 1990s, the neural network research
group at AT&T developed a convolutional network for
reading checks.
– By the end of the 1990s, this system deployed by NEC was
reading over 10% of all the checks in the US.
– Later, several OCR and handwriting recognition systems
based on convolutional nets were deployed by Microsoft.
• Deep learning can be used in a wide variety of applications,
including:
– Image recognition: To identify objects and features in
images, such as people, animals, places, etc.
– Natural language processing: To help understand the
meaning of text, such as in customer service chatbots and
spam filters.
– Finance: To help analyze financial data and make
predictions about market trends
– Text to image: Convert text into images, such as in the
Google Translate app.
• See LeCun et al. (2010) for a more in-depth history of
convolutional networks up to 2010.
• Convolutional networks were also used to win many contests.
The current intensity of commercial interest in deep learning
began when Krizhevsky et al. (2012) won the ImageNet object
recognition challenge, but convolutional networks had been
used to win other machine learning and computer vision
contests with less impact for years earlier.
– LeNet-5 (1990)
– AlexNet (2012)
– VGGNet
– ResNet
– U-Net (2015)
– SegNet
– RCNN
– Fast/Faster - RCNN
– Mask R-CNN
• While ConvNets have been revolutionary in the field of
computer vision, their application extends to other domains
as well, including audio processing, time series analysis, and
financial forecasting.
– WaveNet (Audio Processing)
– TCN (Sequence Modelling Tasks)
– Transformer
• Convolutional nets were some of the first working deep
networks trained with back-propagation. It is not entirely clear
why convolutional networks succeeded when general back-
propagation networks were considered to have failed. It may
simply be that convolutional networks were more
computationally efficient than fully connected networks, so it
was easier to run multiple experiments with them and tune
their implementation and hyperparameters. Larger networks
also seem to be easier to train.
• With modern hardware, large fully connected networks
appear to perform reasonably on many tasks, even when
using datasets that were available and activation functions
that were popular during the times when fully connected
networks were believed not to work well. It may be that the
primary barriers to the success of neural networks were
psychological (practitioners did not expect neural networks to
work, so they did not make a serious effort to use neural
networks).
• Whatever the case, it is fortunate that convolutional networks
performed well decades ago. In many ways, they carried the
torch for the rest of deep learning and paved the way to the
acceptance of neural networks in general.
• Convolutional networks provide a way to specialize neural
networks to work with data that has a clear grid-structured
topology and to scale such models to very large size. This
approach has been the most successful on a two-dimensional,
image topology.
• To process one-dimensional, sequential data, we turn next to
another powerful specialization of the neural networks
framework: recurrent neural networks.