UNIT 2 - Neural Networks & DL
UNIT 2 - Neural Networks & DL
1
Syllabus:
Deep neural networks (DNNs):
1. Difficulty of training DNNs,
2. Greedy layerwise training,
3. Optimization for training DNNs,
4. Newer optimization methods for neural networks (AdaGrad, RMSProp, Adam),
Second order methods
3
What is Deep neural networks (DNNs) ?
Deep neural networks (DNNs) are a class of artificial neural networks (ANNs)
that have many layers of hidden units between the input and output layers.
Deep neural networks are a type of deep learning, which is a type of machine
learning.
DNN are a specialized class of ANNs that have multiple hidden layers between
the input and output layers. The term "deep" refers to the depth of the network,
i.e., the number of hidden layers.
4
Key Characteristics of DNN:
• Multiple Hidden Layers: DNNs have multiple hidden layers, which allows
them to model intricate relationships within the data.
• Hierarchical Feature Learning: Each layer in a DNN learns to transform its
input into a more abstract representation. For example, in image
processing, lower layers might detect edges, while higher layers detect
shapes or objects.
• Non-linearity: DNNs use activation functions (like ReLU, sigmoid, or tanh)
to introduce non-linearity into the network, allowing it to model complex
functions.
5
Training DNN: Training a model refers to the process of teaching a model to
make predictions or decisions based on data. The goal is to enable the model
to learn patterns, relationships, and features from the data so that it can
perform a specific task, such as classification, regression, or clustering,
effectively on new, unseen data.
6
How to Train DNNs:
1. Define Loss Function:
Loss Function Select an appropriate loss function that measures how well the model’s predictions
match the target values (e.g., cross-entropy loss for classification, mean squared error for regression).
2. Choose an Optimizer:
Optimizer Select an optimization algorithm to minimize the loss function (e.g., Stochastic
Gradient Descent (SGD), Adam, RMSprop).
3. Set Hyper parameters:
• Learning Rate: Determine the learning rate, which controls the step size in the optimization
process.
• Batch Size: Choose the number of samples processed before the model is updated.
• Epochs: Specify the number of times the entire dataset will be passed through the model.
4. Training Loop:
Forward Pass: Pass input data through the model to obtain predictions.
Compute Loss: Calculate the loss between predictions and actual values.
Backward Pass: Perform backpropagation to compute gradients of the loss with respect to the
model’s parameters.
Update Weights: Adjust the model’s weights using the optimizer based on computed gradients.
7
Example: Simple DNN for predicting weather a student will pass or fail in the
exam based on the study hours.
Studied Pass(Label)
1. Problem Statement: Hours
Task: Classification (pass or fail). 1 0
2 0
Input Feature: Number of hours studied.
3 0
Output Label: Pass (1) or Fail (0). 4 1
2. Data Collection: Refer Dataset table 5 1
6 1
3. Data Preparation : Data Cleaning, normalization
7 1
Scaling, etc.. Split Training(70%) and test data(30%) 8 1
5 Samples- training and 3 for test
8
4. Model Selection: Binary classifier( Logistic Regression)
5. Train the model:
• Loss Function: For logistic regression, the loss function is Logistic
Loss (Binary Cross-Entropy).
• Choose the Optimizer: Gradient Descent will be used to
minimize the loss function.
• Set Hyperparameters:
Learning Rate: Set a learning rate (e.g., 0.01).
Epochs: Set the number of iterations for training (e.g., 1000
epochs)
• Training Loop:
• Forward Pass: Calculate the predicted probability of passing based on the number of hours
studied.
• Compute Loss: Calculate the loss between the predicted probabilities and actual labels.
• Backward Pass: Compute gradients of the loss function with respect to the model parameters.
• Update Weights: Adjust model parameters (weights) using the gradients.
9
6. Validation and Hyperparameter Tuning: Monitor and adjust
hyperparameters as needed.
7. Model Evaluation: Assess performance using metrics like accuracy
on the test set.
8. Deployment: Save and deploy the model for making predictions.
9. Model Maintenance: Monitor and retrain the model as needed.
10
1. Difficulty of training DNN
Training a model can be challenging due to various difficulties that may arise
throughout the process. The problems faced are explained below
1. Overfitting
Overfitting occurs when a model learns to perform well on the training data but does not
generalize well to unseen data. This can be caused by having too many parameters, training
for too long, or not having enough train data.
• Solutions:
• Regularization: Techniques such as L1 and L2 regularization can help prevent overfitting by
penalizing large weights in the model.
• Dropout: Randomly dropping out neurons during training can help prevent the model from
relying too heavily on any single neuron.
• Early stopping: Stop training when the validation loss does not improve for a certain
number of epochs to prevent the model from fitting the noise in the data.
• Data argumentation: Artificially increase the size of the dataset by creating new training
examples through transformations such as rotation, scaling and flipping.
11
2. Underfitting
Underfitting occurs when a model fails to capture the underlying structure of the data,
resulting in poor performance on both the training and validation sets. This can be caused by
an overly simplistic model architecture or insufficient training.
Solutions:
• Increase model complexity: Add more layers or neurons to the model to increase its
capacity to learn complex patterns.
• Train longer: Train the model for more epochs to allow it to learn the underlying structure
of the data.
• Experiment with optimization algorithms: Try different optimization algorithms, such as
Adam, RMSprop, or Adagrad, to improve model convergence.
12
3. Vanishing and Exploding Gradients
Vanishing gradients occur when the gradients during backpropagation become too
small, making it difficult for the model to learn. Exploding gradients, on the other
hand, occur when the gradients become too large, leading to unstable training.
Solutions:
• Weight initialization: Use techniques like Xavier or He initialization to set the initial
weights of the model.
• Activation functions: Choose appropriate activation functions such as RELU, or ELU
to prevent gradients from vanishing or exploding.
• Batch normalization: Normalize the inputs to each layer to ensure they have a
consistent mean and variance, which can help stabilize the gradients.
13
4. Slow Training
Training deep learning models can be time-consuming, especially for large
models and datasets.
Solutions:
• Mini-batch gradient descent: Train the model on smaller batches of data to
speed up the training process.
• Parallelization: Utilize muti-core CPUs, GPUs, or TPUs to paralleize the
training process.
• Distributed training: Train the model across multiple devices or machines to
speed up training.
14
5. Insufficient or Imbalanced Data
Having a small or imbalanced dataset can lead to poor model performance.
Solutions:
• Data augmentation: Create new examples by applying transformations to the
existing data.
• Oversampling or understampling: Balance the dataset by oversampling
minority classes or undersamling majority classes.
• Transfer learning: Utilize pre-trained models to leverage knowledge from
similar tasks or domains.
15
6. Hyperparameter Tuning
Selecting the optimal hyperparameters for a model can be challenging and time-
consuming.
Solutions:
• Grid search: Search for the best hyperparameters by exhaustively trying all possible
combinations within a predefined range.
• Random search: Sample random combinations of hyperparameters within a
predefined range.
• Bayesian optimization: Use a probabilistic model to guide the search for optimal
hyperparameters, focusing on promising regions of the search space.
16
7. Model Architecture Selection
Choosing the right model architecture for a specific problem can be difficult and may
require experimentation with different architectures and layer configurations.
Solutions:
• Start with well-known architectures: Use proven architectures, such as CNNs for
image classification or LSTMs for sequence data, as a starting point.
• Experiment with different layer configurations: Try adding or removing layers and
adjusting layer sizes to find the optimal architecture.
• Use architecture seach algorithms: Leverage techniques like neural architecture
search (NAS) to automatcally discover the best model architecture for a given
problem.
17
2. Greedy Layerwise Training
Greedy Layerwise Training is a technique used in machine learning, particularly in deep learning, to
train deep neural networks layer by layer. It's a bottom-up approach where each layer is trained
separately, starting with the first layer and gradually adding subsequent layers.
Processes Involved:
• Initialize the first layer: The first layer is typically a simple layer, like a linear layer or a convolutional
layer. It's initialized with random weights.
• Train the first layer: This layer is trained using a supervised learning algorithm, such as
backpropagation, on the input data.
• Add the next layer: Once the first layer is trained, a new layer is added on top of it. The weights of
this new layer are also initialized randomly.
• Train the second layer: The entire network (now with two layers) is trained using backpropagation,
but the weights of the first layer are fixed. This means that the second layer is learning to transform
the outputs of the first layer into the desired target.
• Repeat: This process is repeated until all the desired layers have been added and trained.
18
Why Greedy Layerwise Training?
• Efficiency: It can be more efficient than training the entire network at once, especially for
deep networks.
• Debugging: It can help in identifying and fixing issues in individual layers.
• Initialization: It can provide a good initialization for the weights of the network.
Limitations:
• Suboptimal solutions: Greedy Layerwise Training may not always find the globally optimal
solution.
• Difficulty with deep networks: As the network becomes deeper, it can become more
challenging to train each layer effectively.
Applications:
• Deep belief networks: Greedy Layerwise Training is commonly used to train deep belief
networks, a type of unsupervised learning model.
• Stacked autoencoders: It can also be used to train stacked autoencoders, another
unsupervised learning model.
19
Layer-by-Layer Training Process: Stacked Autoencoders: Deep Belief Networks:
20
Example: Training a Deep Belief Network using Greedy Layerwise Training
• Deep Belief Networks (DBNs) are a type of generative model that can learn
complex patterns in data. They are often trained using Greedy Layerwise
Training.
• Steps:
• Initialize the network: Start with a stack of Restricted Boltzmann Machines (RBMs). Each
RBM has a visible layer and a hidden layer.
• Train the first RBM: Use unsupervised learning (e.g., contrastive divergence) to train the
first RBM. The visible layer is connected to the input data, and the hidden layer learns to
represent the underlying patterns.
• Use the first RBM's hidden layer as input for the second RBM: The hidden layer
activations from the first RBM become the input for the visible layer of the second RBM.
• Train the second RBM: Repeat the training process for the second RBM.
• Continue adding and training RBMs: Continue this process until all RBMs in the DBN are
trained.
21
3. Optimization for Training DNNs
What is optimization?
Process of minimizing the error and maximizing the performance
What is an optimizer?
• Optimizers are algorithms or methods used to minimize an error function(loss function)or to maximize the efficiency of
production.
• Optimizers are mathematical functions which are dependent on model’s learnable parameters i.e Weights & Biases.
• Optimizers help to know how to change weights and learning rate of neural network to reduce the losses.
• The goal of the optimization of DNN is to find the best parameters w to minimize the loss function f(w, x, y),
using f(w) bellow for simplification, subject to x, y, where x are the data and y are the labels.
Before we proceed, it’s essential to acquaint yourself with a few terms
• The epoch is the number of times the algorithm iterates over the entire training dataset.
• Batch weights refer to the number of samples used for updating the model parameters.
• A sample is a single record of data in a dataset.
• Learning Rate is a parameter determining the scale of model weight updates
• Weights and Bias are learnable parameters in a model that regulate the signal between two neurons.
22
1. Gradient Descent:
• Gradient descent is an optimization algorithm based on a convex function
and tweaks its parameters iteratively to minimize a given function to its
local minimum.
• Gradient Descent iteratively reduces a loss function by moving in the
direction opposite to that of steepest ascent.
• It is dependent on the derivatives of the loss function for finding minima.
uses the data of the entire training set to calculate the gradient of the cost
function to the parameters which requires large amount of memory and
slows down the process.
23
Advantages of Gradient Descent
• Easy to understand and Easy to implement.
Disadvantages of Gradient Descent
• Because this method calculates the gradient for the entire data set in one update,
the calculation is very slow and It requires large memory and it is computationally
expensive.
Learning Rate
• How big/small the steps are gradient descent takes into the direction of the local
minimum are determined by the learning rate, which figures out how fast or slow
we will move towards the optimal weights.
24
2. Stochastic Gradient Descent
• It is a variant of Gradient Descent. It update the model parameters one by one. If the
model has 10K dataset SGD will update the model parameters 10k times.
• In GD, the gradient is calculated over the entire dataset, but in SDG it is calculated on each
data point
• In the diagram, the red line shows how the algorithm follows the gradient vectors to
reach the minimum of the loss function. The zigzag pattern indicates that the algorithm
may not always take the most direct path to the minimum, but it will eventually converge.
• The arrows pointing downhill represent the gradient vectors. The gradient vector at a
point indicates the direction of steepest descent of the loss function.
25
Advantages of Stochastic Gradient Descent
• Frequent updates of model parameter
• Requires less Memory.
• Allows the use of large data sets as it has to update only one example at a
time.
Disadvantages of Stochastic Gradient Descent
• The frequent can also result in noisy gradients which may cause the error
to increase instead of decreasing it.
• High Variance.
• Frequent updates are computationally expensive.
26
3. Mini-Batch Gradient Descent
It is a combination of the concepts of SGD and batch gradient descent. It
simply splits the training dataset into small batches and performs an update
for each of those batches. This creates a balance between the robustness of
stochastic gradient descent and the efficiency of batch gradient descent. it
can reduce the variance when the parameters are updated, and the
convergence is more stable. It splits the data set in batches in between 50 to
256 examples, chosen at random.
29
4. SGD with Momentum
• SGD with Momentum is a stochastic optimization method that adds a momentum term
to regular stochastic gradient descent. Momentum simulates the inertia of an object
when it is moving, that is, the direction of the previous update is retained to a certain
extent during the update, while the current update gradient is used to fine-tune the final
update direction. In this way, you can increase the stability to a certain extent, so that
you can learn faster, and also have the ability to get rid of local optimization.
30
Advantages of SGD with momentum
• Momentum helps to reduce the noise.
• Exponential Weighted Average is used to smoothen the curve.
Disadvantage of SGD with momentum
• Extra hyperparameter is added.
31
4. Newer optimization methods for neural networks
6. AdaGrad(Adaptive Gradient Descent)
• In all the algorithms that we discussed previously the learning rate remains constant. The
intuition behind AdaGrad is can we use different Learning Rates for each and every
neuron for each and every hidden layer based on different iterations.
cache_new: This is a moving average of the squared gradients. It helps to adapt the learning rate to the curvature of the
loss surface. ε: This is a small constant added to the denominator to prevent division by zero.
Advantages of AdaGrad
• Learning Rate changes adaptively with iterations.
• It is able to train sparse data as well.
Disadvantage of AdaGrad
• If the neural network is deep the learning rate becomes very small number
which will cause dead neuron problem.
32
7. RMS-Prop (Root Mean Square Propagation)
• RMS-Prop is a special version of Adagrad in which the learning rate is an exponential average of the
gradients instead of the cumulative sum of squared gradients. RMS-Prop basically combines
momentum with AdaGrad.
Advantages of RMS-Prop
• In RMS-Prop learning rate gets adjusted automatically and it chooses a different learning rate for each
parameter.
Disadvantages of RMS-Prop •cache_new: This represents the updated value of the cache, which is used
how much weight is given to the previous cache value versus the new
gradient calculation.
•∂Loss / ∂W_old: This is the partial derivative of the loss function with respect
to the old weights. It represents the gradient of the loss function at the
previous iteration.
•^2: This denotes squaring the gradient.
33
8. Adam(Adaptive Moment Estimation)
• Adam optimizer is one of the most popular and famous gradient descent optimization algorithms. It is
a method that computes adaptive learning rates for each parameter. It stores both the decaying
average of the past gradients , similar to momentum and also the decaying average of the past
squared gradients , similar to RMS-Prop and Adadelta. Thus, it combines the advantages of both the
methods.
Advantages of Adam
• Easy to implement
• Computationally efficient.
• Little memory requirements.
•w_t and b_t: These represent the weight vector and bias vector at iteration t.
•w_t-1 and b_t-1: These are the weight vector and bias vector from the previous iteration.
•η: This is the learning rate, which controls how much the parameters are updated in each step.
•V_dw_t and V_db_t: These are the momentum terms for the weight vector and bias vector, respectively. They
Note:
• First-order: First-order methods Utilize the gradient, which indicates the path of most
development; it’s similar to following the maze’s walls.
• Second-order: Using the Hessian, which is similar to having a map of the maze, second-order
methods can tell you both the direction and curvature of the path.
35
Advantages:
• Faster Convergence: Can converge faster than first-order methods, especially in
cases with poorly conditioned loss surfaces.
• Adaptive Learning Rates: Automatically adjusts learning rates based on
curvature, leading to more efficient optimization.
Disadvantages:
• Computational Complexity: Computing the Hessian or its approximation can be
costly, especially with large datasets and high-dimensional parameter spaces.
• Memory Usage: Storing the Hessian matrix can require substantial memory,
making it less practical for very large networks.
36
37
Convolutional Neural Networks (CNN)
38
6. Introduction to CNN
What is Convolutional Neural Networks (CNN)?
• Convolutional Neural Networks (CNN) are the most popular and
powerful tools for image processing, classification, and segmentation.
• A convolutional neural network is a deep learning algorithm that can
take an input image, assign significance (weights and biases) to
various objects in the image, enable the differentiation of objects,
and concurrently extract relationships among them.
39
BASIC STRUCTURE AND FUNCTIONING OF CNN
• CNNs are specifically designed structures for feature extraction and
pattern recognition processes. They generally consist of several
layers:
• The CNN architecture consists of a stacking of three building blocks:
• Convolution layers
• Pooling layers,
• Fully connected (FC) layers
40
1. Convolution Layer
• The convolutional layer is the first layer of a convolutional network. While
convolutional layers can be followed by additional convolutional layers or
pooling layers, the fully connected layer is the final layer.
• A convolution layer is a key component of the CNN architecture. This layer
helps us perform feature extractions on input data using the convolution
operation. The convolution operation involves performing an element-wise
multiplication between the filter’s weights and the patch of the input image
with the same dimensions. Finally, the resulting output values are added
together.
• This layer forms the essential component of Feature-Extraction.
• By using multiple convolutional layers in succession, a neural network can
detect higher-level objects, people, and even facial expressions.
41
Convolution operation:
Convolution Kernels(Filters)
• A kernel is a small 2D matrix whose contents are based upon the operations to
be performed. A kernel maps on the input image by simple matrix multiplication
and addition, the output obtained is of lower dimensions and therefore easier to
work with.
• The shape of a kernel is heavily dependent on the input shape of the image and
architecture of the entire network, mostly the size of kernels is (MxM) i.e a
square matrix. The movement of a kernel is always from left to right and top to
bottom.
42
43
• Here the input matrix has shape 4x4x1 and the kernel is of size 3x3 since the shape of
input is larger than the kernel, we can implement a sliding window protocol and apply
the kernel over entire input. First entry in the convoluted result is calculated as:
• 45*0 + 12*(-1) + 5*0 + 22*(-1) + 10*5 + 35*(-1) + 88*0 + 26*(-1) + 51*0 = -45
• There are three hyperparameters which affect the volume size of the output that need
to be set before the training of the neural network begins. These include:
• The number of filters affects the depth of the output. For example, three distinct filters would yield
three different feature maps, creating a depth of three.
• Filters are one dimension higher than kernels and can be seen as multiple kernels stacked on each
other where every kernel is for a particular channel. Therefore, for an RGB image of (32x32) we have a
filter of the shape say (5x5x3).
• Stride is the distance, or number of pixels, that the kernel moves over the input matrix. While stride
values of two or greater is rare, a larger stride yields a smaller output.
• Stride defines by what step does to kernel move, for example stride of 1 makes kernel slide by one
row/column at a time and stride of 2 moves kernel by 2 rows/columns.
44
Zero-padding is usually used when the filters do not fit the input
image. This sets all elements that fall outside of the input matrix to
zero, producing a larger or equally sized output.
45
Sliding window protocol:
• The kernel gets into position at the top-left corner of the input matrix.
• Then it starts moving left to right, calculating the dot product and
saving it to a new matrix until it has reached the last column.
• Next, kernel resets its position at first column but now it slides one row
to the bottom. Thus following the fashion left-right and top-bottom.
• Steps 2 and 3 are repeated till the entire input has been processed.
• For a 3D input matrix the movement of the kernel will be from front to
back, left to right and top to bottom.
46
Activation:
• Activation functions are crucial components in neural networks, as
they introduce non-linearity into the model. Every convolution has
the activation.
47
2. POOLING LAYER:
• Pooling layers, also known as down sampling, conducts dimensionality reduction,
reducing the number of parameters in the input. Like the convolutional layer, the pooling
operation sweeps a filter across the entire input, but the difference is that this filter does
not have any weights.
• There are two main types of pooling:
• Max pooling: As the filter moves across the input, it selects the pixel with the maximum value to send
to the output array. As an aside, this approach tends to be used more often compared to average
pooling.
• Average pooling: As the filter moves across the input, it calculates the average value within the
receptive field to send to the output array.
• They help to reduce complexity, improve efficiency, and limit risk of overfitting.
48
Flattening
• As the name of this step implies, we are literally going to flatten our pooled feature map into a column like in the
image below.
• The reason we do this is that we're going to need to insert this data into an artificial neural network in the next
stage.
• To sum up, here is what we have after we're done with each of the steps that we have covered up until now:
• Input image (starting point)
• Convolutional layer (convolution operation)
• Pooling layer (pooling)
• Input layer for the artificial neural network (flattening)
• Definition: Flattening is a process used to convert multi-dimensional input data (such as images) into a one-
dimensional vector. This step typically follows convolutional and pooling layers.
• Purpose: The main goal of flattening is to prepare the data for the subsequent fully connected layers. Since fully
connected layers expect a 1D input, flattening reshapes the output of the last convolutional or pooling layer into a
1D array.
• Example: For an input tensor of shape (batch_size, height, width, channels), flattening will convert it into a tensor
of shape (batch_size, height * width * channels).
49
3. FULLY CONNECTED LAYER:
• This layer forms the last block of the CNN architecture, related to the
task of classification. This is essentially a Fully connected Simple
Neural Network, consisting of two or three hidden layers and an
output layer generally implemented using softmax regression that
performs the work of classification among a large no of categories.
50
• Notice that in artificial neural networks, we called the layer in the middle a
“hidden layer” whereas in the convolutional context we use the term “fully-
connected layer.”
• The input layer contains the vector of data that was created in the
flattening step. The features that we distilled throughout the previous steps
are encoded in this vector.
• The role of the artificial neural network is to take this data and combine the
features into a wider variety of attributes that make the convolutional
network more capable of classifying images
51
• SUMMARY OF THREE LAYERS:
52
Summary of CNN: Convolutional Neural Networks (CNNs) are a type of deep learning model particularly well-suited for analyzing visual data. Here are the key
steps involved in building and training a CNN:
1. Input Layer: The CNN begins with an input layer that takes in the raw pixel values of the images. The input shape typically has three dimensions: height,
width, and the number of channels (e.g., RGB for color images).
2. Convolutional Layers:
• Convolution Operation:
• Convolutional layers apply convolution operations to the input image. This involves sliding filters (kernels) over the image to extract features like edges, textures, and patterns.
• Each filter detects specific features and produces a feature map.
• Activation Function:
• After convolution, an activation function (commonly ReLU) is applied to introduce non-linearity into the model.
3. Pooling Layers
• Downsampling:
• Pooling layers (e.g., Max Pooling) reduce the spatial dimensions (height and width) of the feature maps while retaining the most important features.
• This helps decrease the computational load, reduce the number of parameters, and mitigate overfitting.
4. Flattening: The output from the convolutional and pooling layers is a multi-dimensional tensor. Flattening converts this tensor into a one-dimensional vector
to prepare it for the fully connected layers.
5. Fully Connected (Dense) Layers
• Neural Network Layers:
• These layers are standard neural network layers that learn complex representations. The flattened output is passed through one or more dense layers.
• Activation functions (like ReLU) are applied in these layers to introduce non-linearity.
• Output Layer:
• The final layer is a fully connected layer with a number of neurons equal to the number of classes in the classification task.
• It typically uses the softmax activation function for multi-class classification to output probabilities for each class.
53
6. Loss Function
• A loss function (like categorical cross-entropy for multi-class classification) is used to measure the difference between the predicted
output and the true labels. This guides the optimization process during training.
7. Optimization
• An optimizer (e.g., Adam, SGD) adjusts the weights of the network based on the gradients of the loss function with respect to the
weights. This is done through backpropagation, which computes gradients and updates weights to minimize the loss.
8. Training the Model
• The CNN is trained on labeled data through multiple epochs, iterating over the training dataset and updating weights at each step.
• Batch Size: The number of samples processed before updating the model’s weights.
• Learning Rate: Controls how much to change the model in response to the estimated error each time the model weights are updated.
9. Validation and Testing
• During training, a validation set is used to monitor the model’s performance and prevent overfitting.
• After training, the model is evaluated on a test set to assess its accuracy and generalization capability.
10. Prediction
• Once the model is trained, it can be used to make predictions on new, unseen data. The output is the class with the highest probability
from the softmax layer.
11. Post-Processing (Optional)
• Depending on the application, post-processing steps like thresholding or non-maximum suppression might be applied to refine the
model's predictions.
54
6. Deep Convolutional Neural Networks
• Deep convolutional neural networks are mainly focused on applications like object
detection, image classification, recommendation systems, and are also sometimes
used for natural language processing.
• The strength of DCNNs is in their layering. A DCNN uses a three-dimensional
neural network to process the Red, Green, and Blue elements of the image at the
same time. This considerably reduces the number of artificial neurons required to
process an image, compared to traditional feed forward neural networks.
• Deep convolutional neural networks receive images as an input and use them to
train a classifier. The network employs a special mathematical operation called a
“convolution” instead of matrix multiplication.
• The architecture of a convolutional network typically consists of four types of
layers: convolution, pooling, activation, and fully connected.
55
56
Convolutional Layer
• Applies a convolution filter to the image to detect features of the image. Here is
how this process works:
• A convolution—takes a set of weights and multiplies them with inputs from the
neural network.
• Kernels or filters—during the multiplication process, a kernel (applied for 2D
arrays of weights) or a filter (applied for 3D structures) passes over an image
multiple times. To cover the entire image, the filter is applied from right to left and
from top to bottom.
• Dot or scalar product—a mathematical process performed during the
convolution. Each filter multiplies the weights with different input values. The
total inputs are summed, providing a unique value for each filter position.
57
ReLU Activation Layer
• The convolution maps are passed through a nonlinear activation layer, such as Rectified Linear Unit (ReLu), which
replaces negative numbers of the filtered images with zeros.
Pooling Layer
• The pooling layers gradually reduce the size of the image, keeping only the most important information. For example,
for each group of 4 pixels, the pixel having the maximum value is retained (this is called max pooling), or only the
average is retained (average pooling).
• Pooling layers help control overfitting by reducing the number of calculations and parameters in the network.
• After several iterations of convolution and pooling layers (in some deep convolutional neural network architectures
this may happen thousands of times), at the end of the network there is a traditional multi layer perceptron or “fully
connected” neural network.
Fully Connected Layer
• In many CNN architectures, there are multiple fully connected layers, with activation and pooling layers in between
them. Fully connected layers receive an input vector containing the flattened pixels of the image, which have been
filtered, corrected and reduced by convolution and pooling layers. The softmax function is applied at the end to the
outputs of the fully connected layers, giving the probability of a class the image belongs to – for example, is it a car, a
boat or an airplane.
58
What are the Types of Deep Convolutional Neural Networks?
1. R-CNN (Region-based Convolutional Neural Networks)
2. Fast R-CNN
3. GoogleNet
4. VGGNet (Visual Geometry Group Neural Network)
5. ResNet (Residual Neural Network )
6. PlacesNet
59
7. Different deep CNN architectures – LeNet, AlexNet, VGG, PlacesNet
1. LeNet:
Series of CNN to recognise hand-written numbers
• LeNet-1 (1989) is the eariler version Developed by Yann LeCun and colleagues.
• Architecture: This initial version consisted of a simple architecture designed for digit recognition, using convolutional and subsampling
layers.
• LeNet-5 (1998): Developer: Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
• Architecture: LeNet-5 is the most famous version and has a more complex structure compared to LeNet-1. It includes:
• Input Layer: Receives a 32x32 grayscale image as input.
• Convolutional Layer (C1): Applies six 5x5 convolution filters, resulting in six feature maps of size 28x28. Each filter is responsible
for detecting specific patterns or features in the input image.
• Pooling Layer (S2): Downsamples the feature maps to 14x14 using a 2x2 max pooling operation. This reduces the
dimensionality and computational cost while preserving the most important features.
• Convolutional Layer (C3): Applies 16 5x5 convolution filters to the pooled feature maps, producing 16 feature maps of size
10x10.Pooling Layer (S4): Downsamples the feature maps to 5x5 using a 2x2 max pooling operation.
• Fully Connected Layer (F5): Flattens the 5x5 feature maps into a 120-dimensional vector and connects it to 120 neurons.
• Fully Connected Layer (F6): Connects the 120 neurons from F5 to 84 neurons.
• Output Layer: Contains 10 neurons, each representing a possible class for handwritten character recognition (0-9).
• Activation Function: Tanh (in original implementation), but modern variants often use ReLU.
LeNet-5 was trained using backpropagation with stochastic gradient descent (SGD). The network was optimized to minimize the cross-
entropy loss between the predicted class probabilities and the true class labels.
60
61
2. AlexNet:
architecture introduced in 2012 by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It played a pivotal role in
revolutionizing the field of computer vision
Key Features
Depth: AlexNet was significantly deeper than previous CNNs, with eight layers including five convolutional
layers, two fully connected layers, and three pooling layers. This depth allowed it to learn more complex
features from the input data.
• Batch size of 128
• SGD Momentum is used as a learning algorithm
Activation: AlexNet used rectified linear units (ReLU) as activation functions, which helped to address the
vanishing gradient problem and improve training efficiency compared to traditional sigmoid or tanh functions.
Dropout: To prevent overfitting, AlexNet incorporated dropout, a regularization technique that randomly drops
out neurons during training. This helps to reduce the network's reliance on any individual neuron and improves
its generalization performance.
Data Augmentation: AlexNet employed data augmentation techniques such as random cropping and horizontal
flipping to increase the size of the training dataset and improve the network's robustness to variations in input
data
62
Architecture : The AlexNet architecture consists of the following layers:
• Input Layer: Receives a 224x224x3 color image as input.
• Convolutional Layer 1: Applies 96 11x11x3 filters with a stride of 4, resulting in 96 feature maps of size 55x55x96.
• Pooling Layer 1: Downsamples the feature maps to 27x27x96 using a 3x3 max pooling operation with a stride of 2.
• Normalization Layer 1: Applies local response normalization (LRN) to normalize the activity of neurons across different feature maps.
• Convolutional Layer 2: Applies 256 5x5x96 filters with a stride of 1, resulting in 256 feature maps of size 27x27x256.
• Pooling Layer 2: Downsamples the feature maps to 13x13x256 using a 3x3 max pooling operation with a stride of 2.
• Normalization Layer 2: Applies local response normalization (LRN) to normalize the activity of neurons across different feature maps.
• Convolutional Layer 3: Applies 384 3x3x256 filters with a stride of 1, resulting in 384 feature maps of size 13x13x384.
• Convolutional Layer 4: Applies 384 3x3x384 filters with a stride of 1, resulting in 384 feature maps of size 13x13x384.
• Convolutional Layer 5: Applies 256 3x3x384 filters with a stride of 1, resulting in 256 feature maps of size 13x13x256.
• Pooling Layer 3: Downsamples the feature maps to 6x6x256 using a 3x3 max pooling operation with a stride of 2.
• Fully Connected Layer 1: Flattens the 6x6x256 feature maps into a 9216-dimensional vector and connects it to 4096 neurons.
• Dropout Layer 1: Applies dropout with a probability of 0.5.
• Fully Connected Layer 2: Connects the 4096 neurons from the previous layer to 4096 neurons.
• Dropout Layer 2: Applies dropout with a probability of 0.5.
• Output Layer: Contains 1000 neurons, each representing a possible class for image classification.
63
64
3. VGG:
is a deep convolutional neural network (CNN) architecture introduced in
VGG stands for Visual Geometry Group; it is a standard deep Convolutional Neural Network
(CNN) architecture with multiple layers. The “deep” refers to the number of layers with VGG-16 or VGG-19
consisting of 16 and 19 convolutional layers. 2014 by Karen Simonyan and Andrew Zisserman.
• Key Characteristics:
• Uniform Architecture: VGGNet follows a uniform architecture, using only 3x3 convolutional filters
throughout the network. This simplifies the design and makes it easier to train.
• Depth: VGGNet is known for its depth, with networks ranging from 11 to 19 layers. This depth allows the
network to learn more complex features from the input data.
• Max Pooling: The network uses max pooling layers to reduce the spatial dimensions of the feature maps
while preserving the most important information.
• VGGNet introduced several variants with different depths, including:
• VGG-11: 11 layers deep
• VGG-13: 13 layers deep
• VGG-16: 16 layers deep
• VGG-19: 19 layers deep
65
• The VGG network is constructed with very small convolutional filters. The VGG-16 consists of 13
convolutional layers and three fully connected layers.
• Let’s take a brief look at the architecture of VGG:
• Input: The VGGNet takes in an image input size of 224×224. For the ImageNet competition, the
creators of the model cropped out the center 224×224 patch in each image to keep the input size
of the image consistent.
• Convolutional Layers: VGG’s convolutional layers leverage a minimal receptive field, i.e., 3×3, the
smallest possible size that still captures up/down and left/right. Moreover, there are also 1×1
convolution filters acting as a linear transformation of the input. This is followed by a ReLU unit,
which is a huge innovation from AlexNet that reduces training time. ReLU stands for rectified
linear unit activation function; it is a piecewise linear function that will output the input if positive;
otherwise, the output is zero. The convolution stride is fixed at 1 pixel to keep the spatial
resolution preserved after convolution (stride is the number of pixel shifts over the input matrix).
• Hidden Layers: All the hidden layers in the VGG network use ReLU. VGG does not usually leverage
Local Response Normalization (LRN) as it increases memory consumption and training time.
Moreover, it makes no improvements to overall accuracy.
• Fully-Connected Layers: The VGGNet has three fully connected layers. Out of the three layers, the
first two have 4096 channels each, and the third has 1000 channels, 1 for each class.
66
67
4. PlacesNet:
PlacesNet is a deep convolutional neural network (CNN) architecture specifically designed for scene
recognition. It was introduced in 2014. It was trained on the Places database, a large-scale scene-centric
dataset with 205 natural scene categories, rather than on ImageNet. The authors showed that the deep
features from PlacesNet are more effective for recognizing natural scenes than deep features from CNNs
trained on ImageNet.
Key Features:
• Scene Recognition: PlacesNet is optimized for recognizing the scene or environment depicted in an image.
It can identify categories such as indoor, outdoor, city, mountain, beach, and more.
• Large-Scale Dataset: PlacesNet was trained on the Places dataset, a large-scale collection of images from
various scenes. This extensive training data helped the network learn robust features for scene recognition.
• Deep Architecture: PlacesNet employs a deep CNN architecture with multiple convolutional and pooling
layers, allowing it to capture complex visual patterns and relationships.
• Fine-Tuning: The network can be fine-tuned on specific scene datasets to improve performance for
particular applications.
68
• Key components of the PlacesNet architecture:
• Input Layer: Receives a 256x256x3 color image as input.
• Convolutional Layers: Multiple convolutional layers with 3x3 filters are
used to extract features from the input image. Each layer is followed by a
ReLU activation function.
• Pooling Layers: Max pooling layers are used to reduce the spatial
dimensions of the feature maps while preserving the most important
information.
• Fully Connected Layers: Fully connected layers are used to combine the
extracted features into a single vector.
• Output Layer: A softmax layer is used to classify the input image into one
of the predefined scene categories.
69
8. Training a CNNs: Weights initialization, batch normalization, hyperparameter
optimization
Steps in Training a CNNs
• The steps to train a Convolutional Neural Network (CNN) include:
• Prepare the dataset: Collect a labeled dataset and preprocess the images. Split the
dataset into training and test data.
• Design the CNN architecture: The CNN has several layers, including:
• Pooling layer: A sliding window technique that generalizes lower-level data.
• Convolutional layer: An essential block of the CNN, with learnable channels and neurons.
• ReLU layer: Improves the nonlinearity of the image's pixel data.
• Output layer: The final layer, with neurons equal to the number of classes. It provides the
likelihood of the input image belonging to a particular class.
• Train the model: Use an optimization algorithm to train the model.
• Define a loss function: Calculate the validation loss using the training technique.
• Experiment with hyperparameters: Monitor progress and fine-tune if needed.
• Evaluate the model's performance: Choose the model with the lowest loss.
70
1. Weights initialization:
Weight initialization is a fundamental aspect of training neural networks. It significantly impacts the convergence speed
and overall performance of the model. By carefully selecting the initialization method, we can help prevent common
issues like vanishing or exploding gradients, which can hinder the learning process.
Key Initialization Techniques
1. Zero Initialization:
• is a simple yet often ineffective technique for initializing weights in a neural network. It involves setting all weights to zero at the
beginning of training.
• Not accepted as the network become symmetic and leans to vanishing problem.
Example:
Consider a simple neural network with one hidden layer and one output layer. Let's assume that all weights are
initialized to zero:
Input layer: x
Hidden layer: h = W1 * x + b1
Output layer: y = W2 * h + b2
If all weights (W1, W2, b1, and b2) are initialized to zero, then:
h=0*x+0=0
y=0*0+0=0
No matter what the input is, the output will always be zero. This is because the network is essentially a linear model
with zero slope, and it cannot learn any non-linear patterns.
71
2. Random Initialization:
In an attempt to overcome the shortcomings of Zero or Constant Initialization, random initialization assigns
random values except for zeros as weights to neuron paths. However, assigning values randomly to the
weights, problems such as Overfitting, Vanishing Gradient Problem, Exploding Gradient Problem might
occur.
• Random Initialization can be of two kinds:
• Random Normal
• Random Uniform
a) Random Normal: The weights are initialized from values in a normal distribution.
b) Random Uniform: The weights are initialized from values in a uniform distribution.
72
3. Xavier/Glorot Initialization
In Xavier/Glorot weight initialization, the weights are assigned from values
of a uniform distribution as follows:
73
4. Normalized Xavier/Glorot Initialization
• In Normalized Xavier/Glorot weight initialization, the weights are
assigned from values of a normal distribution as follows:
• Xavier/Glorot Initialization, too, is suitable for layers where the
activation function used is Sigmoid.
74
5. He Uniform Initialization
• In He Uniform weight initialization, the weights are assigned from
values of a uniform distribution as follows:
• He Uniform Initialization is suitable for layers where ReLU activation
function is used.
6. He Normal Initialization
In He Normal weight initialization, the weights are assigned from values of a normal distribution as follows:
He Uniform Initialization, too, is suitable for layers where ReLU activation function is used.
75
Batch normalization
Batch Normalization is a technique used to improve the training and performance
of neural networks, particularly CNNs.
Batch normalization is a technique to improve the training of DNN by stabilizing and
accelerating the learning process.
Introduced by Sergey Ioffe and Christian Szegedy in 2015, it addresses the issue
known as “internal covariate shift” where the distribution of each layer’s inputs
changes during training, as the parameters of the previous layers change.
Batch normalization, it is a process to make neural networks faster and more
stable through adding extra layers in a deep neural network. The new layer
performs the standardizing and normalizing operations on the input of a layer
coming from a previous layer.
Addresses Internal Covariate Shift, Improving Gradient Flow, Regularization Effect:
Speeds up learning: Regularizes the model: Allows higher learning rates:
76
How Does Batch Normalization Work in CNN?
Batch normalization works in convolutional neural networks (CNNs) by
normalizing the activations of each layer across mini-batch during training.
The working is discussed below:
1. Normalization within Mini-Batch
• In a CNN, each layer receives inputs from multiple channels (feature maps)
and processes them through convolutional filters. Batch Normalization
operates on each feature map separately, normalizing the activations
across the mini-batch.
• During training, batch normalization (BN) standardizes the activations of
each layer by subtracting the mean and dividing by the standard
deviation of each mini-batch.
77
2. Scaling and Shifting
• After normalization, BN adjusts the normalized activations using learned
scaling and shifting parameters. These parameters enable the network to
adaptively scale and shift the activations, thereby maintaining the
network’s ability to represent complex patterns in the data.
3. Learnable Parameters
The parameters Alpha and Beta are learned during training through
backpropagation. This allows the network to adaptively adjust the
normalization and ensure that the activations are in the appropriate range
for learning.
78
4. Applying Batch Normalization
Batch Normalization is typically applied after the convolutional and activation layers
in a CNN, before passing the outputs to the next layer. It can also be applied before
or after the activation function, depending on the network architecture.
5. Training and Inference
During training, Batch Normalization calculates the mean and variance of each mini-
batch. During inference (testing), it uses the aggregated mean and variance
calculated during training to normalize the activations. This ensures consistent
normalization between training and inference.
79
Hyperparameters Optimization:
• Hyperparameters are the parameters that are set before the training process begins and are not
learned from the data. They include things like learning rates, batch sizes, the number of layers,
and the number of neurons in each layer.
• Common hyperparameters include:
• Learning rate: Controls the step size during gradient descent.
• Batch size: The number of samples processed at once during training.
• Number of epochs: The number of times the entire dataset is passed through the network.
• Network architecture: The number of layers, filters, and neurons.
• Regularization: Techniques like L1/L2 regularization and dropout to prevent overfitting.
• The Hyperparameters can be optimized as follows:
1. Manual Search
2. hyperparameter-tuning using Bayesian Optimization
3. GridSearchCV
4. RandomizedSearchCV
80
1. Manual Search Hyperparameter Optimization:
Experimenting with different hyperparameters based on domain knowledge. It is Simple and easy but Time-
consuming and often inefficient as it may not explore the parameter space comprehensively.
2. Bayesian Optimization
is a probabilistic framework for hyperparameter tuning that leverages Bayesian statistics to efficiently
explore the hyperparameter space. It's particularly effective when dealing with complex, expensive-to-
evaluate functions, such as training deep neural networks. Bayesian optimization is more efficient in time
and memory capacity for tuning many hyperparameters
Steps Involved:
1. Choose a suitable surrogate model (e.g., Gaussian process). Initialize the surrogate model with a small
set of randomly sampled hyperparameter configurations. Evaluate the objective function (e.g.,
validation accuracy) for these initial configurations.
2. Use the acquisition function to determine the next hyperparameter configuration to evaluate. Common
acquisition functions include Expected Improvement, Probability of Improvement, and Entropy Search.
3. Evaluate the objective function for the newly acquired point. Update the surrogate model with the new
data point.
4. Repeat steps 2 and 3 until a stopping criterion is met (e.g., maximum number of iterations or
convergence).
81
3. Grid-Search:
is a hyperparameter optimization technique that involves exhaustively trying all
combinations of hyperparameters within a specified grid. It's a simple but often
computationally expensive method.
• Steps involved:
• Define Hyperparameter Space: Specify a grid of values for each hyperparameter you want to
optimize.
• Try All Combinations: Train a model for each combination of hyperparameters in the grid.
• Evaluate Performance: Evaluate the performance of each model on a validation set.
• Choose Best Hyperparameters: Select the hyperparameter combination that yields the best
performance.
82
4. Randomized Search
• The Grid Search one that we have discussed above usually increases the complexity
in terms of the computation flow, So sometimes GS is considered inefficient since it
attempts all the combinations of given hyperparameters. But the Randomized
Search is used to train the models based on random hyperparameters and
combinations. obviously, the number of training models are small column than grid
search.
Key steps:
• Define Hyperparameter Space: Specify the range or distribution for each
hyperparameter you want to optimize.
• Randomly Sample Hyperparameters: Generate random combinations of
hyperparameters within the defined space.
• Train Model: Train a model with the sampled hyperparameters.
• Evaluate Performance: Evaluate the model's performance on a validation set using
appropriate metrics.
• Repeat: Repeat steps 2-4 for a specified number of iterations.
83
GridSearchCV RandomSearshCV
Guided flow to search for the The name itself says that, no
best combination guidance.
84
9. Understanding and visualizing CNNs training
Understanding CNN Training
• Training a convolutional neural network (CNN) involves teaching the network to recognize patterns in
data, such as images or audio. This is achieved through a process of iterative optimization, where the
network's weights and biases are adjusted to minimize the error between its predicted outputs and the
true labels.
Key Steps in CNN Training:
• Data Preparation: Collect and preprocess a large dataset of labeled examples.
• Network Architecture: Design a CNN architecture suitable for the task, including the number of layers,
filters, and activation functions.
• Forward Pass: Feed an input sample through the network to compute the predicted output.
• Backward Pass: Calculate the error between the predicted output and the true label.
• Weight Update: Use an optimization algorithm (e.g., stochastic gradient descent) to update the
network's weights and biases based on the calculated gradients.
• Repeat: Iterate through the dataset multiple times (epochs), updating the network's parameters with
each iteration.
85
Importance of Visualizing a CNN model
86
1. Preliminary Methods
1.1 Plotting model architecture: The simplest thing you can do is to
print/plot the model. Here, you can also print the shapes of individual
layers of neural network and the parameters in each layer.
87
1.2 Visualize filters: Another way is to plot the filters of a trained model, so
that we can understand the behaviour of those filters.
• CNN filters can be visualized when we optimize the input image with
respect to output of the specific convolution operation. For example, the
first filter of the first layer of the above model looks like:
88
2. Activation Maps/ Feature Maps Visualization:
• Feature maps (or activations) show how the output of each convolutional
layer looks after passing through the network.
• Visualizing feature maps helps in understanding how different layers of the
network respond to various input patterns.
3. Gradient Visualization:
• Visualizing gradients can help understand which parameters are changing
during training.
• This can be done using gradient flow or gradient histograms to check for
issues like vanishing or exploding gradients.
89
10. Regularization methods (dropout, drop connect, batch
normalization).
What is Regularization?
• Regularization is a technique used in machine learning and deep learning
to prevent overfitting and improve the generalization performance of a
model. It involves adding a penalty term to the loss function during
training.
• This penalty discourages the model from becoming too complex or having
large parameter values, which helps in controlling the model’s ability to fit
noise in the training data.
• By applying regularization for deep learning, models become more robust
and better at making accurate predictions on unseen data.
90
• As we move towards the right in this image, our model tries to learn
too well the details and the noise from the training data, which
ultimately results in poor performance on the unseen data.
• In other words, while going toward the right, the complexity of the
model increases such that the training error reduces but the testing
error doesn’t. This is shown in the image below:
91
• Regularization to reduce the overfitting
Assume that our regularization coefficient is so high that some of the weight matrices are nearly
equal to zero.
92
• This will result in a much simpler linear network and slight
underfitting of the training data.
• Such a large value of the regularization coefficient is not that useful.
We need to optimize the value of the regularization coefficient to
obtain a well-fitted model as shown in the image below:
93
Different Regularization Techniques in Deep Learning
1. L2&L1 Regularization
• L1 and L2 are the most common types of regularization deep learning. These
update the general cost function by adding another term known as the
regularization term.
• Due to the addition of this regularization term, the values of weight matrices
decrease because it assumes that a neural network with smaller weight
matrices leads to simpler models. Therefore, it will also reduce overfitting to
quite an extent.
• However, this regularization term differs in L1 and L2.
94
95
2. Dropout
• This is one of the most interesting types of regularization techniques. It also produces very good
results and is consequently the most frequently used regularization technique in the field of
deep learning.
• In Dropout, a random subset of neurons is temporarily excluded or “dropped out” during each
iteration. This helps prevent overfitting by promoting more robust learning and reducing the
reliance on specific neurons.
• Dropout mimics ensemble learning during training by randomly deactivating a subset of
neurons in each iteration, creating diverse network instances. Each instance can be viewed as a
different model.
96
2. Drop Connect- Drop Connect has a similar flavour to dropout. However, instead
of randomly dropping individual units (neurons) during training, DropConnect zero
out some of the values of the weight matrix. This means for each training iteration,
a random subset of connections in the neural network is set to zero.
97
Thank You
98