ARTIFICIAL NEURAL NETWORKS AND DEEP LEARNING
Deep Learning and its role in Computer Vision
duction to Robotics
Dr. Sandeep Singh Sengar
Dr. Sandeep Singh Sengar 1
What is a Digital Image?
Image is a two-dimensional intensity function f(x, y), where
the value of f at a spatial location (x, y) is the intensity of the
image at that point.
y
x
Gray
Level
f(x,y)
Dr. Sandeep Singh Sengar
Common image formats
– 1 sample per point (B&W) [0,1]
– 1 sample per point (Grayscale)[0-255]
– 3 samples per point (Red, Green, and Blue)[0-255]
– 4 samples per point (Red, Green, Blue, and “Alpha”, a.k.a. Opacity) [0-
255, 0-1]
Dr. Sandeep Singh Sengar
Color Image
RGB Color Space
A color image is just three functions pasted
together. We can write this as a “vector-
valued” function:
r ( x, y )
f ( x, y ) = g ( x, y )
b ( x, y )
Dr. Sandeep Singh Sengar
RGB Image
Dr. Sandeep Singh Sengar
Image Processing
An image processing operation typically defines a new
image g in terms of an existing image f.
We can write the following function for image transform:
Dr. Sandeep Singh Sengar
Why Digital Image Processing?
Digital image processing focuses on two major tasks
– Improvement of pictorial information for human interpretation
– Processing of image data for storage, transmission and
representation for autonomous machine perception
Some argument about where image processing ends and
fields such as image analysis and computer vision start
Dr. Sandeep Singh Sengar
The Spatial Filtering Process
Origin x
a b c j k l
d
g
e
h
f
i
* m
p
n
q
o
r
Original Filter (w)
Simple 3*3
e 3*3 Filter Image
Neighbourhoo Pixels
d
eprocessed = n*e + j*a + k*b
+ l*c + m*d + o*f + p*g + q*h
+ r*i
y Image f (x, y)
The above is repeated for every pixel in the original
image to generate the filtered image
Dr. Sandeep Singh Sengar
Levels of Digital Image Processing
The continuum from image processing to computer vision
can be broken up into low-, mid- and high-level processes
Low Level Process Mid Level Process High Level Process
Input: Image Input: Image Input: Attributes
Output: Image Output: Attributes Output: Understanding
Examples: Noise Examples: Object Examples: Scene
removal, image recognition, understanding,
sharpening segmentation autonomous navigation
Dr. Sandeep Singh Sengar
Spatial filters
Remember that types of neighborhood:
intensity transformation: neighborhood of size 1x1
spatial filter (or mask ,kernel, template or window): neighborhood of larger size, like 3*3 mask
The spatial filter mask is moved from point to point in an image. At each point (x, y),
the response of the filter is calculated
x
Neighbourhood
(x, y) Origin
y Sengar
Dr. Sandeep Singh Image f (x, y)
Neighbourhood Operations
For each pixel in the origin image, the outcome is written on
the same location at the target image.
x Target
Original
Neighbourhood
(x, y)
Origin
y Image f (x, y)
Dr. Sandeep Singh Sengar
The Spatial Filtering Process
Origin x
a b c j k l
d
g
e
h
f
i
* m
p
n
q
o
r
Original Filter (w)
Simple 3*3
e 3*3 Filter Image
Neighbourhood Pixels
eprocessed = n*e + j*a + k*b +
l*c + m*d + o*f + p*g + q*h +
r*i
y Image f (x, y)
The above is repeated for every pixel in the original
image to generate the filtered image
Dr. Sandeep Singh Sengar
Smoothing Spatial Filtering
Origin x
104 100 108
99 106 98
95 90 85
*
1/ 100108
9 /9 /9
104 1 1 Original Filter
Simple 3*3 /9 1106
199 /9 198
/9
3*3 Smoothing Image
Neighbourhood /9 190
195 /9 185
/9
Filter Pixels
e = 1/9*106 + 1/9*104 + 1/9*100 +
1/ *108 + 1/ *99 + 1/ *98 + 1/ *95 +
9 9 9 9
1/ *90 + 1/ *85 = 98.3333
y Image f (x, y) 9 9
The above is repeated for every pixel in
the original image to generate the
smoothed image Dr. Sandeep Singh Sengar
Spatial filters : Smoothing
linear smoothing : averaging kernels
Standard average
Dr. Sandeep Singh Sengar
15
Spatial filters : Smoothing
Standard Average- example
110 120 90 130 The mask is moved
from point to point in
91 94 98 200
an image. At each
90 91 99 100 point (x,y), the
response of the filter
82 96 85 90 is calculated
Standard averaging filter:
(110 +120+90+91+94+98+90+91+99)/9 =883/9 = 98.1
Dr. Sandeep Singh Sengar
Spatial filters : Smoothing
Weighted Average- example
Dr. Sandeep Singh Sengar
Spatial filters : Smoothing
Median Filter- example
Dr. Sandeep Singh Sengar
Another smoothing example
Smoothing example
By smoothing the original image we get rid of lots of the finer detail which
leaves only the gross features for thresholding.
Original Image Smoothed Image Thresholded Image
Dr. Sandeep Singh Sengar
Averaging filter vs. median filter example
Averaging filter vs. median filter example
Original Image Image After Image After
With Noise Averaging Filter Median Filter
• Filtering is often used to remove noise from images.
• Sometimes a median filter works better than an averaging filter.
Dr. Sandeep Singh Sengar
Strange things happen at the edges!
Strange things happen at the edges! (cont …)
At the edges of an image we are missing pixels to form a neighbourhood.
Origin
x
e e
e e e
y
Image f (x, y)
Dr. Sandeep Singh Sengar
21
What happens when the Values of the Kernel Fall Outside
the Image??!
Dr. Sandeep Singh Sengar
22
border padding
Dr. Sandeep Singh Sengar
Applications
Dr. Sandeep Singh Sengar
Dr. Sandeep Singh Sengar
Text Recognition
Dr. Sandeep Singh Sengar
Dr. Sandeep Singh Sengar
Dr. Sandeep Singh Sengar
Dr. Sandeep Singh Sengar
Biometrics
Dr. Sandeep Singh Sengar
Computer Vision
Dr. Sandeep Singh Sengar
“One picture is worth more than
thousand words”
Dr. Sandeep Singh Sengar
Object Detection
• Moving-object detection is one of the basic and most
active research domains in the field of computer vision.
• Underlying assumptions is that moving objects generally
entail intensity changes between consecutive frames.
Object Tracking
The object tracking is used to compute the configuration
(i.e., position and size) of the target in the subsequent
frames corresponding to the state of the target in the initial
frame.
Object Recognition
Object recognition is a computer vision technique for
identifying objects in images or videos.
Medical Image Segmentation
Medical
Imaging
Dr. Sandeep Singh Sengar
What is Machine Learning?
Machine learning is a subset of Artificial Intelligence, provides
computers with the ability to learn without being explicitly
programmed.
ML came in 1950s. Defined in 1951 by “Arthur Samuel” at IBM
(designed checkers play machine):
Ref:https://www.forbes.com/sites/kalevleetaru/2019/01/15/why-machine-learning-needs-semantics-not-
just-statistics/?sh=730fa3aa77b5 36
Dr. Sandeep Singh Sengar
Branch of Machine Learning
37
Dr. Sandeep Singh Sengar
Ref: https://www.wordstream.com/blog/ws/2017/07/28/machine-learning-applications
Deep Learning
Deep Learning is a subfield of machine learning concerned
with algorithms inspired by the structure and function of the
brain called artificial neural networks.
DL/ML is used to find the algorithm (model)
Large data High performance
Dr. Sandeep Singh Sengar
Ref: https://www.intel.la/content/www/xl/es/artificial-intelligence/posts/difference-between-ai-machine-learning-deep-learning.html
Why Deep Learning Today?
▪ Better algorithms and
understanding
▪ Computational power (GPUs,
TPUs, …)
▪ Massive labelled data
▪ Variety of open source tools
and models
Slide adapted from Wai K. Dr. Sandeep Singh Sengar
End-to-end approach?
Dr. Sandeep Singh Sengar
Ref: https://lawtomated.com/a-i-technical-machine-vs-deep-learning/
Deep Learning Process
▪ A deep neural network provides state-of-the-art
accuracy in many tasks, from object detection to
speech recognition
▪ They can learn automatically, without predefined
knowledge explicitly coded by the programmers
Dr. Sandeep Singh Sengar
Effectiveness of Deep Learning
▪ Deep learning algorithms attempt to learn
representation by using a hierarchy of multiple
layers
▪ If we provide the system tons of information, it
begins to understand it and respond in useful
ways
▪ Manually designed features are often over-
specified, incomplete and take a long time to
design and validate
▪ Learned features are easy to adapt, fast to learn
Dr. Sandeep Singh Sengar
Effectiveness of Deep Learning
▪ Deep learning provides a very flexible and
universal, learnable framework for representing
world
▪ Can learn in both unsupervised and supervised
manner
▪ Utilize large amounts of training data
▪ Since 2010, deep learning started outperforming
other machine learning techniques especially in
the areas of machine vision and speech
recognition
Dr. Sandeep Singh Sengar
Deep Learning Examples
▪ Hierarchy of representations with increasing level
of abstraction
▪ Each stage is a kind of trainable nonlinear feature
transform
▪ Image recognition example
• Pixel → edge → texton → motif → part → object
▪ Text example
• Character → word → word group → clause →
sentence → story
Dr. Sandeep Singh Sengar
Deep Learning in Practice
▪ Visual question answering : Given an image and a
natural language question about the image, the
task is to provide an accurate natural language
answer
▪ Click here for demo: http://visualqa.csail.mit.edu/
Dr. Sandeep Singh Sengar
Deep Learning Architectures
Architecture Application
CNN Image recognition, video analysis, natural language processing
RNN Speech recognition, handwriting recognition, Machine Translation
Natural language text compression, handwriting recognition,
LSTM/GRU networks
speech recognition, gesture recognition, image captioning
Image recognition, information retrieval, natural language
DBN
understanding, failure prediction
DSN Information retrieval, continuous speech recognition
Dr. Sandeep Singh Sengar
The Spatial Filtering Process
Origin x
a b c j k l
d
g
e
h
f
i
* m
p
n
q
o
r
Original Filter (w)
Simple 3*3
e 3*3 Filter Image
Neighbourhood Pixels
eprocessed = n*e + j*a + k*b
+ l*c + m*d + o*f + p*g + q*h
+ r*i
y Image f (x, y)
The above is repeated for every pixel in the original
image to generate the filtered image
Dr. Sandeep Singh Sengar
Convolutional Neural Network
A Convolutional Neural Network is a Deep Learning algorithm which can take
in an input image, assign importance (learnable weights and biases) to various
aspects/objects in the image and be able to differentiate one from the other.
The pre-processing required in a CNN is much lower as compared to other
classification algorithms.
Dr. Sandeep Singh Sengar
CNN layers
An image is passed through a series of layers:
– Convolutional – filters can be thought of as feature
identifiers
⮚Nonlinear (ReLu) – approximate complex functions
– Max Pooling (down sampling)
– Fully connected layers – softmax/sigmoid
⮚ which produce an output.
Dr. Sandeep Singh Sengar
Ref: https://towardsdatascience.com/understanding-and-implementing-lenet-5-cnn-architecture-deep-learning-a2d531ebc342
Convolutional Neural Network
Dr. Sandeep Singh Sengar
Basic idea of Convolutional
Dr. Sandeep Singh Sengar
Convolutional Layer Example
Stride s=2
#filters=2
#channels=3
Padding p=1
Dr. Sandeep Singh Sengar
Size of Output
I/P size: n*n
Filter size: f*f
O/P size: (n-f+1)*(n-f+1)
Dr. Sandeep Singh Sengar
Padding and stride convolutions
Padding: It is used for same I/P and O/P size
For padding: p
O/P size=(n+2p-f+1)*(n+2p-f+1) i.e. p=(f-1)/2
Stride: s
O/P size= [(n+2p-f)/s+1]* [(n+2p-f)/s+1]
Dr. Sandeep Singh Sengar
Multiple filters
For example to detect Horizontal and vertical edges.
O/P size: (n×n×nc)*(f×f×nc) --> (n-f+1)*(n-f+1)*nc’
Here nc’=# of filters
Dr. Sandeep Singh Sengar
Number of parameters in one layer
Suppose 10 filters of size 3*3*3
Then total parameters will be: [3*3*3+1 (bias)]*10=280
That means one bias for each filter.
It is not dependent on the original image size (beauty of DL)
It makes model to less prone to overfitting.
Dr. Sandeep Singh Sengar
Automatically learnt features
Retain most information (edge detectors)
Towards more abstract representation
Encode high level concepts
Sparser representations:
Detect less (more abstract) features
https://towardsdatascience.com/applied-deep-learning-part-4-
convolutional-neural-networks-584bc134c1e2
Dr. Sandeep Singh Sengar
Non-linear Activation Function
Dr. Sandeep Singh Sengar
Pooling
▪ The goal of the pooling operation is to reduce the
spatial size of convolved features
▪ Pooling helps in extracting salient features which
are rotational and positional invariant
• For example, by changing the orientation of nose,
eyes and ears, the image segment would still be
detected as a head
• This is one of the most prominent features of CNNs
Dr. Sandeep Singh Sengar
Pooling
▪ Two types of pooling operators are common: Max
pooling and Average pooling
• Max pooling returns the maximum value from the
portion of the image covered by the filter
• Average pooling returns the average of all the
values from the portion of the image covered by the
filter
Dr. Sandeep Singh Sengar
Max Pooling
▪ Let’s apply a 3 x 3 filter on a 5 x 5 convolved
features map
15.5 23.8 7.9 20.6 12.9
23.8
12.7 18.3 22.3 7.9 8.3
11.3 9.2 11.8 18.9 10.3
11.7 11.3 17.5 6.8 19.3
18.3 19.6 11.2 15.2 7,2
Convolved Features
Dr. Sandeep Singh Sengar
Max Pooling
▪ Let’s apply a 3 x 3 filter on a 5 x 5 convolved
features map
15.5 23.8 7.9 20.6 12.9
23.8 23.8
12.7 18.3 22.3 7.9 8.3
11.3 9.2 11.8 18.9 10.3
11.7 11.3 17.5 6.8 19.3
18.3 19.6 11.2 15.2 7,2
Convolved Features
Dr. Sandeep Singh Sengar
Max Pooling
▪ Let’s apply a 3 x 3 filter on a 5 x 5 convolved
features map
15.5 23.8 7.9 20.6 12.9
23.8 23.8 22.3
12.7 18.3 22.3 7.9 8.3
11.3 9.2 11.8 18.9 10.3
11.7 11.3 17.5 6.8 19.3
18.3 19.6 11.2 15.2 7,2
Convolved Features
Dr. Sandeep Singh Sengar
Max Pooling
▪ Let’s apply a 3 x 3 filter on a 5 x 5 convolved
features map
15.5 23.8 7.9 20.6 12.9
23.8 23.8 22.3
12.7 18.3 22.3 7.9 8.3
18.3
11.3 9.2 11.8 18.9 10.3
11.7 11.3 17.5 6.8 19.3
18.3 19.6 11.2 15.2 7,2
Convolved Features
Dr. Sandeep Singh Sengar
Max Pooling
▪ Let’s apply a 3 x 3 filter on a 5 x 5 convolved
features map
15.5 23.8 7.9 20.6 12.9
23.8 23.8 22.3
12.7 18.3 22.3 7.9 8.3
18.3 18.9
11.3 9.2 11.8 18.9 10.3
11.7 11.3 17.5 6.8 19.3
18.3 19.6 11.2 15.2 7,2
Convolved Features
Dr. Sandeep Singh Sengar
Average Pooling
▪ Let’s apply a 3 x 3 filter on a 5 x 5 convolved
features map
15.5 23.8 7.9 20.6 12.9
14.8
12.7 18.3 22.3 7.9 8.3
11.3 9.2 11.8 18.9 10.3
11.7 11.3 17.5 6.8 19.3
18.3 19.6 11.2 15.2 7,2
Convolved Features
Dr. Sandeep Singh Sengar
Average Pooling
▪ Let’s apply a 3 x 3 filter on a 5 x 5 convolved
features map
15.5 23.8 7.9 20.6 12.9
14.8 15.6
12.7 18.3 22.3 7.9 8.3
11.3 9.2 11.8 18.9 10.3
11.7 11.3 17.5 6.8 19.3
18.3 19.6 11.2 15.2 7,2
Convolved Features
Dr. Sandeep Singh Sengar
Max Pooling
Possible Nodes in Hidden Layer i + 1
9 4x4 max
Hidden Layer i
-4 5 4 6
5 6
0 -3 2 -3 2x2 max,
8 9 non overlapping
7 8 -5 9
3 0 -4 1
5 5 6 2x2 max,
overlapping
8 8 9 (contains non-
I/P size: n*n overlapping, so
8 8 9 no need for both)
Filter size: f*f
Padding=p, Stride=s
O/P size: (n+2p-f)/s+1 Dr. Sandeep Singh Sengar
Fully Connected Layer
Dr. Sandeep Singh Sengar
Fully Connected Layer
• Simply, feed forward neural networks.
• Fully Connected Layers form the last few layers in the
network.
• The input to the fully connected layer is the output from
the final Pooling or Convolutional Layer in the flattened form.
• After passing through the fully connected layers, the final
layer uses the softmax activation function which is used to
get probabilities of the input being in a particular class
(classification).
Dr. Sandeep Singh Sengar
CNN Architectures
There are various architectures of CNNs available which have
been key in building algorithms which power and shall power AI
as a whole in the foreseeable future. Some of them have been
listed below:
• LeNet
• AlexNet
• VGGNet
• GoogLeNet
• ResNet
• ZFNet
Dr. Sandeep Singh Sengar
U-Net
Ref: Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." In International Conference on Medical image computing and computer-assisted
intervention, pp. 234-241. Springer, Cham, 2015. Dr. Sandeep Singh Sengar
Train, Validation and Test Datasets
• Training Dataset: The sample of data used to fit the model.
• Validation Dataset: The validation set is used to evaluate a given model. We as machine learning
researchers use this data to fine-tune the model hyperparameters. Hence the model occasionally
sees this data, but never does it “Learn” from this. So the validation set in a way affects a model, but
indirectly.
• Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the
training dataset. The Test dataset provides the gold standard used to evaluate the model. It is only
used once a model is completely trained (using the train and validation sets).
Make sure, validation and test set come from same distribution
Hyper parameters: Learning rate, #iterations, Dr.
#hidden layers,Singh
Sandeep #hidden units, choice of activation function
Sengar
Under-fitting and Over-fitting
High bias: under fitting
High variance: Overfitting
Dr. Sandeep Singh Sengar
Bias and Variance
Training set error 1% 15% 15% 0.5%
Validation set error 11% 16% 30% 1%
Result High High bias High bias and Low bias and
variance variance variance
Dr. Sandeep Singh Sengar
Bias-variance Trade-off
Dr. Sandeep Singh Sengar
Under fitting (High bias)
• A statistical model or a machine learning algorithm is said to have under
fitting when it cannot capture the underlying trend of the data.
• Under fitting destroys the accuracy of our machine learning model.
• Training accuracy is much low in this case.
Steps for reducing under fitting:
⮚ Bigger Network
⮚ Train long duration
⮚ Increase the number of parameters in the model
Dr. Sandeep Singh Sengar
Overfitting (high variance)
• Overfitting happens when your model fits too well to the training set.
• It then becomes difficult for the model to generalize to new examples
that were not in the training set.
Steps for reducing overfitting:
⮚ Add more data
⮚ Data augmentation (rotate, crop, zoom)
⮚ Simplify the model
⮚ Change the training process (like loss function)
⮚ Early termination
⮚ Regularization
❑ Dropout and drop connect
❑ L1 and L2 regularization
Dr. Sandeep Singh Sengar
Ideas to improve ML/DL strategies
• Collect more data
• Collect more diverse training examples
• Train algorithm longer with suitable optimizer
• Try bigger network
• Try smaller network
• Try dropout
• Add regularization
• Network architectures:
❑ Activation function
❑ #hidden units
❑ Learning rate
❑ Iterations
Dr. Sandeep Singh Sengar
Problems where ML/DL significantly surpasses
human level performance
• Online advertising: estimate, how likely someone will click on it
• Product recommendations
• Loan approval
• Lots of data
Dr. Sandeep Singh Sengar
CNN for Computer Vision tasks
• Object detection • Image Classification With
• Object Tracking Localization
• Recognition • Object Segmentation
• Face Recognition • Image Style Transfer
• Action and Activity • Image Colorization
Recognition • Image Reconstruction
• Human Pose Estimation • Image Super-Resolution
• Image Classification • Image Synthesis
Dr. Sandeep Singh Sengar
Challenges
The challenge of making • Difficult to simulate something as
systems human-like complex as the human visual system.
• Objects may be in variety of sizes
and aspect ratios.
• Distinguish one object from multiple
others.
• Variety of handwriting styles, curves,
and shapes employed while writing.
• Deformation, appearance variation,
scale variation, occlusion, rotation of
objects.
Computer vision has its present challenges, but the humans working on this technology are steadily
improving it. Dr. Sandeep Singh Sengar
CNN: A Real Example
Dr. Sandeep Singh Sengar
CNN: A Real Example
Dr. Sandeep Singh Sengar
CNN: A Real Example
Dr. Sandeep Singh Sengar
CNN: A Real Example
Dr. Sandeep Singh Sengar
CNN: A Real Example
Dr. Sandeep Singh Sengar
CNN: A Real Example
Filters Features Maps
Dr. Sandeep Singh Sengar
CNN: A Real Example
Filters Features Maps
Dr. Sandeep Singh Sengar
CNN: A Real Example
Dr. Sandeep Singh Sengar
CNN: A Real Example
Dr. Sandeep Singh Sengar
CNN: A Real Example
Dr. Sandeep Singh Sengar
Convolutional Neural Network
Let the task is to predict an image caption
▪ The CNN receives an image of let's say a cat
• This image, in computer term, is a collection of the pixel
▪ Generally, one layer for the greyscale picture and three
layers for a color picture
▪ During the feature learning (i.e., hidden layers), the
network will identify unique features, for instance, the
tail of the cat, the ear, etc.
▪ When the network thoroughly learned how to recognize
a picture, it can provide a probability for each image it
knows
▪ The label with the highest probability will become the
prediction of the network
Dr. Sandeep Singh Sengar
Which Works Better: RNN or CNN?
▪ There is a vast amount of neural network, where
each architecture is designed to perform a given
task
▪ CNN works very well with images
▪ RNN (Recurrent Neural Network) provides
impressive results with time series and text
analysis
Dr. Sandeep Singh Sengar
Self-Review Questions
▪ What is convolution and how it works?
▪ What is pooling and how it works?
▪ What would be the impact of large/small
striding length?
Dr. Sandeep Singh Sengar
References
“Digital Image Processing”, Rafael C.
Gonzalez & Richard E. Woods,
Addison-Wesley, 2002
– Much of the material that follows is taken from
this book
“Machine Vision: Automated Visual
Inspection and Robot Vision”, David
Vernon, Prentice Hall, 1991
Dr. Sandeep Singh Sengar
Thank You
Dr. Sandeep Singh Sengar
Dr. Sandeep Singh Sengar