0% found this document useful (0 votes)

35 views98 pages

UNIT 2 - Neural Networks & DL

The document provides an overview of Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs), covering their training difficulties, optimization methods, and various architectures. It discusses key characteristics of DNNs, training processes, and common challenges such as overfitting and vanishing gradients, along with solutions. Additionally, it introduces techniques like Greedy Layerwise Training and different optimization algorithms used for training DNNs.

Uploaded by

vdk7018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views98 pages

UNIT 2 - Neural Networks & DL

Uploaded by

vdk7018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Unit 2 - Deep neural networks (DNNs)

Convolution neural networks (CNNs)

1
Syllabus:
Deep neural networks (DNNs):
1. Difficulty of training DNNs,
2. Greedy layerwise training,
3. Optimization for training DNNs,
4. Newer optimization methods for neural networks (AdaGrad, RMSProp, Adam),
Second order methods

Convolution neural networks (CNNs):

5. Introduction to CNNs – convolution, pooling,
6. Deep CNNs,
7. Different deep CNN architectures – LeNet, AlexNet, VGG, PlacesNet,
8. training a CNNs: weights initialization, batch normalization, hyperparameter
optimization,
9. Understanding and visualizing CNNs training,
10. Regularization methods (dropout, drop connect, batch normalization).
2
ANN VS DNN
ANN(foundational neural network DNN(advanced class of ANNs)
models)
Basic Structure: ANNs typically consist of Deep Architecture: DNNs have many hidden layers, which allows them to
an input layer, one or more hidden layers, learn hierarchical and complex features from the data
and an output layer
Shallow Network(Mostly Only one hidden Hierarchical Feature Learning: Each layer in a DNN learns progressively
layer more abstract features of the input data. For example, in image processing,
lower layers might learn edges, while higher layers learn objects.
Activation Functions: ANNs use activation Advanced Activation Functions: DNNs often use more advanced activation
functions such as sigmoid, tanh, or ReLU to functions like ReLU or its variants (Leaky ReLU, Parametric ReLU) to handle
introduce non-linearity into the model. the vanishing gradient problem and improve performance.

3
What is Deep neural networks (DNNs) ?
Deep neural networks (DNNs) are a class of artificial neural networks (ANNs)
that have many layers of hidden units between the input and output layers.
Deep neural networks are a type of deep learning, which is a type of machine
learning.
DNN are a specialized class of ANNs that have multiple hidden layers between
the input and output layers. The term "deep" refers to the depth of the network,
i.e., the number of hidden layers.

Deep neural networks are used in a variety of applications, including speech

recognition, computer vision, and natural language processing.

4
Key Characteristics of DNN:
• Multiple Hidden Layers: DNNs have multiple hidden layers, which allows
them to model intricate relationships within the data.
• Hierarchical Feature Learning: Each layer in a DNN learns to transform its
input into a more abstract representation. For example, in image
processing, lower layers might detect edges, while higher layers detect
shapes or objects.
• Non-linearity: DNNs use activation functions (like ReLU, sigmoid, or tanh)
to introduce non-linearity into the network, allowing it to model complex
functions.

5
Training DNN: Training a model refers to the process of teaching a model to
make predictions or decisions based on data. The goal is to enable the model
to learn patterns, relationships, and features from the data so that it can
perform a specific task, such as classification, regression, or clustering,
effectively on new, unseen data.

Preprocessed DNN Model Prediction

Date

6
How to Train DNNs:
1. Define Loss Function:
Loss Function Select an appropriate loss function that measures how well the model’s predictions
match the target values (e.g., cross-entropy loss for classification, mean squared error for regression).
2. Choose an Optimizer:
Optimizer Select an optimization algorithm to minimize the loss function (e.g., Stochastic
Gradient Descent (SGD), Adam, RMSprop).
3. Set Hyper parameters:
• Learning Rate: Determine the learning rate, which controls the step size in the optimization
process.
• Batch Size: Choose the number of samples processed before the model is updated.
• Epochs: Specify the number of times the entire dataset will be passed through the model.
4. Training Loop:
Forward Pass: Pass input data through the model to obtain predictions.
Compute Loss: Calculate the loss between predictions and actual values.
Backward Pass: Perform backpropagation to compute gradients of the loss with respect to the
model’s parameters.
Update Weights: Adjust the model’s weights using the optimizer based on computed gradients.
7
Example: Simple DNN for predicting weather a student will pass or fail in the
exam based on the study hours.
Studied Pass(Label)
1. Problem Statement: Hours
Task: Classification (pass or fail). 1 0
2 0
Input Feature: Number of hours studied.
3 0
Output Label: Pass (1) or Fail (0). 4 1
2. Data Collection: Refer Dataset table 5 1
6 1
3. Data Preparation : Data Cleaning, normalization
7 1
Scaling, etc.. Split Training(70%) and test data(30%) 8 1
5 Samples- training and 3 for test

8
4. Model Selection: Binary classifier( Logistic Regression)
5. Train the model:
• Loss Function: For logistic regression, the loss function is Logistic
Loss (Binary Cross-Entropy).
• Choose the Optimizer: Gradient Descent will be used to
minimize the loss function.
• Set Hyperparameters:
Learning Rate: Set a learning rate (e.g., 0.01).
Epochs: Set the number of iterations for training (e.g., 1000
epochs)
• Training Loop:
• Forward Pass: Calculate the predicted probability of passing based on the number of hours
studied.
• Compute Loss: Calculate the loss between the predicted probabilities and actual labels.
• Backward Pass: Compute gradients of the loss function with respect to the model parameters.
• Update Weights: Adjust model parameters (weights) using the gradients.

9
6. Validation and Hyperparameter Tuning: Monitor and adjust
hyperparameters as needed.
7. Model Evaluation: Assess performance using metrics like accuracy
on the test set.
8. Deployment: Save and deploy the model for making predictions.
9. Model Maintenance: Monitor and retrain the model as needed.

10
1. Difficulty of training DNN
Training a model can be challenging due to various difficulties that may arise
throughout the process. The problems faced are explained below
1. Overfitting
Overfitting occurs when a model learns to perform well on the training data but does not
generalize well to unseen data. This can be caused by having too many parameters, training
for too long, or not having enough train data.
• Solutions:
• Regularization: Techniques such as L1 and L2 regularization can help prevent overfitting by
penalizing large weights in the model.
• Dropout: Randomly dropping out neurons during training can help prevent the model from
relying too heavily on any single neuron.
• Early stopping: Stop training when the validation loss does not improve for a certain
number of epochs to prevent the model from fitting the noise in the data.
• Data argumentation: Artificially increase the size of the dataset by creating new training
examples through transformations such as rotation, scaling and flipping.
11
2. Underfitting
Underfitting occurs when a model fails to capture the underlying structure of the data,
resulting in poor performance on both the training and validation sets. This can be caused by
an overly simplistic model architecture or insufficient training.
Solutions:
• Increase model complexity: Add more layers or neurons to the model to increase its
capacity to learn complex patterns.
• Train longer: Train the model for more epochs to allow it to learn the underlying structure
of the data.
• Experiment with optimization algorithms: Try different optimization algorithms, such as
Adam, RMSprop, or Adagrad, to improve model convergence.

12
3. Vanishing and Exploding Gradients
Vanishing gradients occur when the gradients during backpropagation become too
small, making it difficult for the model to learn. Exploding gradients, on the other
hand, occur when the gradients become too large, leading to unstable training.
Solutions:
• Weight initialization: Use techniques like Xavier or He initialization to set the initial
weights of the model.
• Activation functions: Choose appropriate activation functions such as RELU, or ELU
to prevent gradients from vanishing or exploding.
• Batch normalization: Normalize the inputs to each layer to ensure they have a
consistent mean and variance, which can help stabilize the gradients.

13
4. Slow Training
Training deep learning models can be time-consuming, especially for large
models and datasets.
Solutions:
• Mini-batch gradient descent: Train the model on smaller batches of data to
speed up the training process.
• Parallelization: Utilize muti-core CPUs, GPUs, or TPUs to paralleize the
training process.
• Distributed training: Train the model across multiple devices or machines to
speed up training.

14
5. Insufficient or Imbalanced Data
Having a small or imbalanced dataset can lead to poor model performance.
Solutions:
• Data augmentation: Create new examples by applying transformations to the
existing data.
• Oversampling or understampling: Balance the dataset by oversampling
minority classes or undersamling majority classes.
• Transfer learning: Utilize pre-trained models to leverage knowledge from
similar tasks or domains.

15
6. Hyperparameter Tuning
Selecting the optimal hyperparameters for a model can be challenging and time-
consuming.
Solutions:
• Grid search: Search for the best hyperparameters by exhaustively trying all possible
combinations within a predefined range.
• Random search: Sample random combinations of hyperparameters within a
predefined range.
• Bayesian optimization: Use a probabilistic model to guide the search for optimal
hyperparameters, focusing on promising regions of the search space.

16
7. Model Architecture Selection
Choosing the right model architecture for a specific problem can be difficult and may
require experimentation with different architectures and layer configurations.
Solutions:
• Start with well-known architectures: Use proven architectures, such as CNNs for
image classification or LSTMs for sequence data, as a starting point.
• Experiment with different layer configurations: Try adding or removing layers and
adjusting layer sizes to find the optimal architecture.
• Use architecture seach algorithms: Leverage techniques like neural architecture
search (NAS) to automatcally discover the best model architecture for a given
problem.

17
2. Greedy Layerwise Training
Greedy Layerwise Training is a technique used in machine learning, particularly in deep learning, to
train deep neural networks layer by layer. It's a bottom-up approach where each layer is trained
separately, starting with the first layer and gradually adding subsequent layers.
Processes Involved:
• Initialize the first layer: The first layer is typically a simple layer, like a linear layer or a convolutional
layer. It's initialized with random weights.
• Train the first layer: This layer is trained using a supervised learning algorithm, such as
backpropagation, on the input data.
• Add the next layer: Once the first layer is trained, a new layer is added on top of it. The weights of
this new layer are also initialized randomly.
• Train the second layer: The entire network (now with two layers) is trained using backpropagation,
but the weights of the first layer are fixed. This means that the second layer is learning to transform
the outputs of the first layer into the desired target.
• Repeat: This process is repeated until all the desired layers have been added and trained.

18
Why Greedy Layerwise Training?
• Efficiency: It can be more efficient than training the entire network at once, especially for
deep networks.
• Debugging: It can help in identifying and fixing issues in individual layers.
• Initialization: It can provide a good initialization for the weights of the network.
Limitations:
• Suboptimal solutions: Greedy Layerwise Training may not always find the globally optimal
solution.
• Difficulty with deep networks: As the network becomes deeper, it can become more
challenging to train each layer effectively.
Applications:
• Deep belief networks: Greedy Layerwise Training is commonly used to train deep belief
networks, a type of unsupervised learning model.
• Stacked autoencoders: It can also be used to train stacked autoencoders, another
unsupervised learning model.

19
Layer-by-Layer Training Process: Stacked Autoencoders: Deep Belief Networks:

20
Example: Training a Deep Belief Network using Greedy Layerwise Training
• Deep Belief Networks (DBNs) are a type of generative model that can learn
complex patterns in data. They are often trained using Greedy Layerwise
Training.
• Steps:
• Initialize the network: Start with a stack of Restricted Boltzmann Machines (RBMs). Each
RBM has a visible layer and a hidden layer.
• Train the first RBM: Use unsupervised learning (e.g., contrastive divergence) to train the
first RBM. The visible layer is connected to the input data, and the hidden layer learns to
represent the underlying patterns.
• Use the first RBM's hidden layer as input for the second RBM: The hidden layer
activations from the first RBM become the input for the visible layer of the second RBM.
• Train the second RBM: Repeat the training process for the second RBM.
• Continue adding and training RBMs: Continue this process until all RBMs in the DBN are
trained.

21
3. Optimization for Training DNNs
What is optimization?
Process of minimizing the error and maximizing the performance

What is an optimizer?
• Optimizers are algorithms or methods used to minimize an error function(loss function)or to maximize the efficiency of
production.
• Optimizers are mathematical functions which are dependent on model’s learnable parameters i.e Weights & Biases.
• Optimizers help to know how to change weights and learning rate of neural network to reduce the losses.
• The goal of the optimization of DNN is to find the best parameters w to minimize the loss function f(w, x, y),
using f(w) bellow for simplification, subject to x, y, where x are the data and y are the labels.
Before we proceed, it’s essential to acquaint yourself with a few terms
• The epoch is the number of times the algorithm iterates over the entire training dataset.
• Batch weights refer to the number of samples used for updating the model parameters.
• A sample is a single record of data in a dataset.
• Learning Rate is a parameter determining the scale of model weight updates
• Weights and Bias are learnable parameters in a model that regulate the signal between two neurons.

22
1. Gradient Descent:
• Gradient descent is an optimization algorithm based on a convex function
and tweaks its parameters iteratively to minimize a given function to its
local minimum.
• Gradient Descent iteratively reduces a loss function by moving in the
direction opposite to that of steepest ascent.
• It is dependent on the derivatives of the loss function for finding minima.
uses the data of the entire training set to calculate the gradient of the cost
function to the parameters which requires large amount of memory and
slows down the process.

23
Advantages of Gradient Descent
• Easy to understand and Easy to implement.
Disadvantages of Gradient Descent
• Because this method calculates the gradient for the entire data set in one update,
the calculation is very slow and It requires large memory and it is computationally
expensive.
Learning Rate
• How big/small the steps are gradient descent takes into the direction of the local
minimum are determined by the learning rate, which figures out how fast or slow
we will move towards the optimal weights.

24
2. Stochastic Gradient Descent
• It is a variant of Gradient Descent. It update the model parameters one by one. If the
model has 10K dataset SGD will update the model parameters 10k times.
• In GD, the gradient is calculated over the entire dataset, but in SDG it is calculated on each
data point
• In the diagram, the red line shows how the algorithm follows the gradient vectors to
reach the minimum of the loss function. The zigzag pattern indicates that the algorithm
may not always take the most direct path to the minimum, but it will eventually converge.
• The arrows pointing downhill represent the gradient vectors. The gradient vector at a
point indicates the direction of steepest descent of the loss function.

25
Advantages of Stochastic Gradient Descent
• Frequent updates of model parameter
• Requires less Memory.
• Allows the use of large data sets as it has to update only one example at a
time.
Disadvantages of Stochastic Gradient Descent
• The frequent can also result in noisy gradients which may cause the error
to increase instead of decreasing it.
• High Variance.
• Frequent updates are computationally expensive.

26
3. Mini-Batch Gradient Descent
It is a combination of the concepts of SGD and batch gradient descent. It
simply splits the training dataset into small batches and performs an update
for each of those batches. This creates a balance between the robustness of
stochastic gradient descent and the efficiency of batch gradient descent. it
can reduce the variance when the parameters are updated, and the
convergence is more stable. It splits the data set in batches in between 50 to
256 examples, chosen at random.

Mini Batch Gradient Descent 27

The trade-off between SGD and BGD is mini-batch gradient descent. This
optimizer uses part of the data (n > 1) to calculate the gradients and update
the parameters w. This is the most popular setup used in modern machine
learning training process. In the rest of this story, SGD means SGD on mini-
batch.

Where 𝑤𝑡 is the weight vector

∝ is the learning rate
𝑔𝑡 is the previous (old) gradient
is the new gradient
28
Advantages of Mini Batch Gradient Descent:
• It leads to more stable convergence.
• more efficient gradient calculations.
• Requires less amount of memory.
• Disadvantages of Mini Batch Gradient Descent
• Mini-batch gradient descent does not guarantee good convergence,
• If the learning rate is too small, the convergence rate will be slow. If it
is too large, the loss function will oscillate or even deviate at the
minimum value.

29
4. SGD with Momentum
• SGD with Momentum is a stochastic optimization method that adds a momentum term
to regular stochastic gradient descent. Momentum simulates the inertia of an object
when it is moving, that is, the direction of the previous update is retained to a certain
extent during the update, while the current update gradient is used to fine-tune the final
update direction. In this way, you can increase the stability to a certain extent, so that
you can learn faster, and also have the ability to get rid of local optimization.

30
Advantages of SGD with momentum
• Momentum helps to reduce the noise.
• Exponential Weighted Average is used to smoothen the curve.
Disadvantage of SGD with momentum
• Extra hyperparameter is added.
31
4. Newer optimization methods for neural networks
6. AdaGrad(Adaptive Gradient Descent)
• In all the algorithms that we discussed previously the learning rate remains constant. The
intuition behind AdaGrad is can we use different Learning Rates for each and every
neuron for each and every hidden layer based on different iterations.

cache_new: This is a moving average of the squared gradients. It helps to adapt the learning rate to the curvature of the
loss surface. ε: This is a small constant added to the denominator to prevent division by zero.
Advantages of AdaGrad
• Learning Rate changes adaptively with iterations.
• It is able to train sparse data as well.
Disadvantage of AdaGrad
• If the neural network is deep the learning rate becomes very small number
which will cause dead neuron problem.
32
7. RMS-Prop (Root Mean Square Propagation)
• RMS-Prop is a special version of Adagrad in which the learning rate is an exponential average of the
gradients instead of the cumulative sum of squared gradients. RMS-Prop basically combines
momentum with AdaGrad.

Advantages of RMS-Prop
• In RMS-Prop learning rate gets adjusted automatically and it chooses a different learning rate for each
parameter.
Disadvantages of RMS-Prop •cache_new: This represents the updated value of the cache, which is used

• Slow Learning to store the moving average of squared gradients.

•cache_old: This is the previous value of the cache.
•γ: A hyperparameter known as the decay rate or momentum. It controls

how much weight is given to the previous cache value versus the new
gradient calculation.
•∂Loss / ∂W_old: This is the partial derivative of the loss function with respect

to the old weights. It represents the gradient of the loss function at the
previous iteration.
•^2: This denotes squaring the gradient.
33
8. Adam(Adaptive Moment Estimation)
• Adam optimizer is one of the most popular and famous gradient descent optimization algorithms. It is
a method that computes adaptive learning rates for each parameter. It stores both the decaying
average of the past gradients , similar to momentum and also the decaying average of the past
squared gradients , similar to RMS-Prop and Adadelta. Thus, it combines the advantages of both the
methods.
Advantages of Adam
• Easy to implement
• Computationally efficient.
• Little memory requirements.

•w_t and b_t: These represent the weight vector and bias vector at iteration t.
•w_t-1 and b_t-1: These are the weight vector and bias vector from the previous iteration.

•η: This is the learning rate, which controls how much the parameters are updated in each step.

•V_dw_t and V_db_t: These are the momentum terms for the weight vector and bias vector, respectively. They

store a moving average of the gradients calculated in previous iterations.

•S_dw_t and S_db_t: These are the second moments of the gradients, used for adaptive learning rate

adjustments (similar to Adam optimizer).

•ε: This is a small constant added to the denominator to prevent division by zero.
34
5. Second Order Methods.
• Second-order methods is optimization algorithms that use second-order derivative information
(like the Hessian matrix) to improve the training of neural networks.
• These methods can offer better convergence properties compared to first-order methods (like
gradient descent), especially in the presence of ill-conditioned loss surfaces.
• There are Two basic second order optimizer:
1. Newton's Method: Uses the second derivative (Hessian matrix) to calculate the optimal step size, often
more efficient but computationally expensive.
2. Quasi-Newton Methods: Approximate the Hessian matrix using past gradients, making them more
practical.

Note:
• First-order: First-order methods Utilize the gradient, which indicates the path of most
development; it’s similar to following the maze’s walls.
• Second-order: Using the Hessian, which is similar to having a map of the maze, second-order
methods can tell you both the direction and curvature of the path.

35
Advantages:
• Faster Convergence: Can converge faster than first-order methods, especially in
cases with poorly conditioned loss surfaces.
• Adaptive Learning Rates: Automatically adjusts learning rates based on
curvature, leading to more efficient optimization.
Disadvantages:
• Computational Complexity: Computing the Hessian or its approximation can be
costly, especially with large datasets and high-dimensional parameter spaces.
• Memory Usage: Storing the Hessian matrix can require substantial memory,
making it less practical for very large networks.

36
37
Convolutional Neural Networks (CNN)

38
6. Introduction to CNN
What is Convolutional Neural Networks (CNN)?
• Convolutional Neural Networks (CNN) are the most popular and
powerful tools for image processing, classification, and segmentation.
• A convolutional neural network is a deep learning algorithm that can
take an input image, assign significance (weights and biases) to
various objects in the image, enable the differentiation of objects,
and concurrently extract relationships among them.

39
BASIC STRUCTURE AND FUNCTIONING OF CNN
• CNNs are specifically designed structures for feature extraction and
pattern recognition processes. They generally consist of several
layers:
• The CNN architecture consists of a stacking of three building blocks:
• Convolution layers
• Pooling layers,
• Fully connected (FC) layers

40
1. Convolution Layer
• The convolutional layer is the first layer of a convolutional network. While
convolutional layers can be followed by additional convolutional layers or
pooling layers, the fully connected layer is the final layer.
• A convolution layer is a key component of the CNN architecture. This layer
helps us perform feature extractions on input data using the convolution
operation. The convolution operation involves performing an element-wise
multiplication between the filter’s weights and the patch of the input image
with the same dimensions. Finally, the resulting output values are added
together.
• This layer forms the essential component of Feature-Extraction.
• By using multiple convolutional layers in succession, a neural network can
detect higher-level objects, people, and even facial expressions.

41
Convolution operation:
Convolution Kernels(Filters)
• A kernel is a small 2D matrix whose contents are based upon the operations to
be performed. A kernel maps on the input image by simple matrix multiplication
and addition, the output obtained is of lower dimensions and therefore easier to
work with.

• The shape of a kernel is heavily dependent on the input shape of the image and
architecture of the entire network, mostly the size of kernels is (MxM) i.e a
square matrix. The movement of a kernel is always from left to right and top to
bottom.

42
43
• Here the input matrix has shape 4x4x1 and the kernel is of size 3x3 since the shape of
input is larger than the kernel, we can implement a sliding window protocol and apply
the kernel over entire input. First entry in the convoluted result is calculated as:
• 45*0 + 12*(-1) + 5*0 + 22*(-1) + 10*5 + 35*(-1) + 88*0 + 26*(-1) + 51*0 = -45

• There are three hyperparameters which affect the volume size of the output that need
to be set before the training of the neural network begins. These include:
• The number of filters affects the depth of the output. For example, three distinct filters would yield
three different feature maps, creating a depth of three.
• Filters are one dimension higher than kernels and can be seen as multiple kernels stacked on each
other where every kernel is for a particular channel. Therefore, for an RGB image of (32x32) we have a
filter of the shape say (5x5x3).
• Stride is the distance, or number of pixels, that the kernel moves over the input matrix. While stride
values of two or greater is rare, a larger stride yields a smaller output.
• Stride defines by what step does to kernel move, for example stride of 1 makes kernel slide by one
row/column at a time and stride of 2 moves kernel by 2 rows/columns.

44
Zero-padding is usually used when the filters do not fit the input
image. This sets all elements that fall outside of the input matrix to
zero, producing a larger or equally sized output.

45
Sliding window protocol:

• The kernel gets into position at the top-left corner of the input matrix.
• Then it starts moving left to right, calculating the dot product and
saving it to a new matrix until it has reached the last column.
• Next, kernel resets its position at first column but now it slides one row
to the bottom. Thus following the fashion left-right and top-bottom.
• Steps 2 and 3 are repeated till the entire input has been processed.
• For a 3D input matrix the movement of the kernel will be from front to
back, left to right and top to bottom.

46
Activation:
• Activation functions are crucial components in neural networks, as
they introduce non-linearity into the model. Every convolution has
the activation.

47
2. POOLING LAYER:
• Pooling layers, also known as down sampling, conducts dimensionality reduction,
reducing the number of parameters in the input. Like the convolutional layer, the pooling
operation sweeps a filter across the entire input, but the difference is that this filter does
not have any weights.
• There are two main types of pooling:
• Max pooling: As the filter moves across the input, it selects the pixel with the maximum value to send
to the output array. As an aside, this approach tends to be used more often compared to average
pooling.
• Average pooling: As the filter moves across the input, it calculates the average value within the
receptive field to send to the output array.
• They help to reduce complexity, improve efficiency, and limit risk of overfitting.

48
Flattening
• As the name of this step implies, we are literally going to flatten our pooled feature map into a column like in the
image below.
• The reason we do this is that we're going to need to insert this data into an artificial neural network in the next
stage.
• To sum up, here is what we have after we're done with each of the steps that we have covered up until now:
• Input image (starting point)
• Convolutional layer (convolution operation)
• Pooling layer (pooling)
• Input layer for the artificial neural network (flattening)
• Definition: Flattening is a process used to convert multi-dimensional input data (such as images) into a one-
dimensional vector. This step typically follows convolutional and pooling layers.
• Purpose: The main goal of flattening is to prepare the data for the subsequent fully connected layers. Since fully
connected layers expect a 1D input, flattening reshapes the output of the last convolutional or pooling layer into a
1D array.
• Example: For an input tensor of shape (batch_size, height, width, channels), flattening will convert it into a tensor
of shape (batch_size, height * width * channels).

49
3. FULLY CONNECTED LAYER:
• This layer forms the last block of the CNN architecture, related to the
task of classification. This is essentially a Fully connected Simple
Neural Network, consisting of two or three hidden layers and an
output layer generally implemented using softmax regression that
performs the work of classification among a large no of categories.

50
• Notice that in artificial neural networks, we called the layer in the middle a
“hidden layer” whereas in the convolutional context we use the term “fully-
connected layer.”
• The input layer contains the vector of data that was created in the
flattening step. The features that we distilled throughout the previous steps
are encoded in this vector.
• The role of the artificial neural network is to take this data and combine the
features into a wider variety of attributes that make the convolutional
network more capable of classifying images

51
• SUMMARY OF THREE LAYERS:

52
Summary of CNN: Convolutional Neural Networks (CNNs) are a type of deep learning model particularly well-suited for analyzing visual data. Here are the key
steps involved in building and training a CNN:
1. Input Layer: The CNN begins with an input layer that takes in the raw pixel values of the images. The input shape typically has three dimensions: height,
width, and the number of channels (e.g., RGB for color images).
2. Convolutional Layers:
• Convolution Operation:
• Convolutional layers apply convolution operations to the input image. This involves sliding filters (kernels) over the image to extract features like edges, textures, and patterns.
• Each filter detects specific features and produces a feature map.

• Activation Function:
• After convolution, an activation function (commonly ReLU) is applied to introduce non-linearity into the model.

3. Pooling Layers
• Downsampling:
• Pooling layers (e.g., Max Pooling) reduce the spatial dimensions (height and width) of the feature maps while retaining the most important features.
• This helps decrease the computational load, reduce the number of parameters, and mitigate overfitting.

4. Flattening: The output from the convolutional and pooling layers is a multi-dimensional tensor. Flattening converts this tensor into a one-dimensional vector
to prepare it for the fully connected layers.
5. Fully Connected (Dense) Layers
• Neural Network Layers:
• These layers are standard neural network layers that learn complex representations. The flattened output is passed through one or more dense layers.
• Activation functions (like ReLU) are applied in these layers to introduce non-linearity.

• Output Layer:
• The final layer is a fully connected layer with a number of neurons equal to the number of classes in the classification task.
• It typically uses the softmax activation function for multi-class classification to output probabilities for each class.
53
6. Loss Function
• A loss function (like categorical cross-entropy for multi-class classification) is used to measure the difference between the predicted
output and the true labels. This guides the optimization process during training.
7. Optimization
• An optimizer (e.g., Adam, SGD) adjusts the weights of the network based on the gradients of the loss function with respect to the
weights. This is done through backpropagation, which computes gradients and updates weights to minimize the loss.
8. Training the Model
• The CNN is trained on labeled data through multiple epochs, iterating over the training dataset and updating weights at each step.
• Batch Size: The number of samples processed before updating the model’s weights.
• Learning Rate: Controls how much to change the model in response to the estimated error each time the model weights are updated.
9. Validation and Testing
• During training, a validation set is used to monitor the model’s performance and prevent overfitting.
• After training, the model is evaluated on a test set to assess its accuracy and generalization capability.
10. Prediction
• Once the model is trained, it can be used to make predictions on new, unseen data. The output is the class with the highest probability
from the softmax layer.
11. Post-Processing (Optional)
• Depending on the application, post-processing steps like thresholding or non-maximum suppression might be applied to refine the
model's predictions.
54
6. Deep Convolutional Neural Networks
• Deep convolutional neural networks are mainly focused on applications like object
detection, image classification, recommendation systems, and are also sometimes
used for natural language processing.
• The strength of DCNNs is in their layering. A DCNN uses a three-dimensional
neural network to process the Red, Green, and Blue elements of the image at the
same time. This considerably reduces the number of artificial neurons required to
process an image, compared to traditional feed forward neural networks.
• Deep convolutional neural networks receive images as an input and use them to
train a classifier. The network employs a special mathematical operation called a
“convolution” instead of matrix multiplication.
• The architecture of a convolutional network typically consists of four types of
layers: convolution, pooling, activation, and fully connected.

55
56
Convolutional Layer
• Applies a convolution filter to the image to detect features of the image. Here is
how this process works:
• A convolution—takes a set of weights and multiplies them with inputs from the
neural network.
• Kernels or filters—during the multiplication process, a kernel (applied for 2D
arrays of weights) or a filter (applied for 3D structures) passes over an image
multiple times. To cover the entire image, the filter is applied from right to left and
from top to bottom.
• Dot or scalar product—a mathematical process performed during the
convolution. Each filter multiplies the weights with different input values. The
total inputs are summed, providing a unique value for each filter position.

57
ReLU Activation Layer
• The convolution maps are passed through a nonlinear activation layer, such as Rectified Linear Unit (ReLu), which
replaces negative numbers of the filtered images with zeros.
Pooling Layer
• The pooling layers gradually reduce the size of the image, keeping only the most important information. For example,
for each group of 4 pixels, the pixel having the maximum value is retained (this is called max pooling), or only the
average is retained (average pooling).
• Pooling layers help control overfitting by reducing the number of calculations and parameters in the network.
• After several iterations of convolution and pooling layers (in some deep convolutional neural network architectures
this may happen thousands of times), at the end of the network there is a traditional multi layer perceptron or “fully
connected” neural network.
Fully Connected Layer
• In many CNN architectures, there are multiple fully connected layers, with activation and pooling layers in between
them. Fully connected layers receive an input vector containing the flattened pixels of the image, which have been
filtered, corrected and reduced by convolution and pooling layers. The softmax function is applied at the end to the
outputs of the fully connected layers, giving the probability of a class the image belongs to – for example, is it a car, a
boat or an airplane.

58
What are the Types of Deep Convolutional Neural Networks?
1. R-CNN (Region-based Convolutional Neural Networks)
2. Fast R-CNN
3. GoogleNet
4. VGGNet (Visual Geometry Group Neural Network)
5. ResNet (Residual Neural Network )
6. PlacesNet

59
7. Different deep CNN architectures – LeNet, AlexNet, VGG, PlacesNet
1. LeNet:
Series of CNN to recognise hand-written numbers
• LeNet-1 (1989) is the eariler version Developed by Yann LeCun and colleagues.
• Architecture: This initial version consisted of a simple architecture designed for digit recognition, using convolutional and subsampling
layers.
• LeNet-5 (1998): Developer: Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
• Architecture: LeNet-5 is the most famous version and has a more complex structure compared to LeNet-1. It includes:
• Input Layer: Receives a 32x32 grayscale image as input.
• Convolutional Layer (C1): Applies six 5x5 convolution filters, resulting in six feature maps of size 28x28. Each filter is responsible
for detecting specific patterns or features in the input image.
• Pooling Layer (S2): Downsamples the feature maps to 14x14 using a 2x2 max pooling operation. This reduces the
dimensionality and computational cost while preserving the most important features.
• Convolutional Layer (C3): Applies 16 5x5 convolution filters to the pooled feature maps, producing 16 feature maps of size
10x10.Pooling Layer (S4): Downsamples the feature maps to 5x5 using a 2x2 max pooling operation.
• Fully Connected Layer (F5): Flattens the 5x5 feature maps into a 120-dimensional vector and connects it to 120 neurons.
• Fully Connected Layer (F6): Connects the 120 neurons from F5 to 84 neurons.
• Output Layer: Contains 10 neurons, each representing a possible class for handwritten character recognition (0-9).
• Activation Function: Tanh (in original implementation), but modern variants often use ReLU.
LeNet-5 was trained using backpropagation with stochastic gradient descent (SGD). The network was optimized to minimize the cross-
entropy loss between the predicted class probabilities and the true class labels.
60
61
2. AlexNet:
architecture introduced in 2012 by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It played a pivotal role in
revolutionizing the field of computer vision
Key Features
Depth: AlexNet was significantly deeper than previous CNNs, with eight layers including five convolutional
layers, two fully connected layers, and three pooling layers. This depth allowed it to learn more complex
features from the input data.
• Batch size of 128
• SGD Momentum is used as a learning algorithm
Activation: AlexNet used rectified linear units (ReLU) as activation functions, which helped to address the
vanishing gradient problem and improve training efficiency compared to traditional sigmoid or tanh functions.
Dropout: To prevent overfitting, AlexNet incorporated dropout, a regularization technique that randomly drops
out neurons during training. This helps to reduce the network's reliance on any individual neuron and improves
its generalization performance.
Data Augmentation: AlexNet employed data augmentation techniques such as random cropping and horizontal
flipping to increase the size of the training dataset and improve the network's robustness to variations in input
data
62
Architecture : The AlexNet architecture consists of the following layers:
• Input Layer: Receives a 224x224x3 color image as input.
• Convolutional Layer 1: Applies 96 11x11x3 filters with a stride of 4, resulting in 96 feature maps of size 55x55x96.
• Pooling Layer 1: Downsamples the feature maps to 27x27x96 using a 3x3 max pooling operation with a stride of 2.
• Normalization Layer 1: Applies local response normalization (LRN) to normalize the activity of neurons across different feature maps.
• Convolutional Layer 2: Applies 256 5x5x96 filters with a stride of 1, resulting in 256 feature maps of size 27x27x256.
• Pooling Layer 2: Downsamples the feature maps to 13x13x256 using a 3x3 max pooling operation with a stride of 2.
• Normalization Layer 2: Applies local response normalization (LRN) to normalize the activity of neurons across different feature maps.
• Convolutional Layer 3: Applies 384 3x3x256 filters with a stride of 1, resulting in 384 feature maps of size 13x13x384.
• Convolutional Layer 4: Applies 384 3x3x384 filters with a stride of 1, resulting in 384 feature maps of size 13x13x384.
• Convolutional Layer 5: Applies 256 3x3x384 filters with a stride of 1, resulting in 256 feature maps of size 13x13x256.
• Pooling Layer 3: Downsamples the feature maps to 6x6x256 using a 3x3 max pooling operation with a stride of 2.
• Fully Connected Layer 1: Flattens the 6x6x256 feature maps into a 9216-dimensional vector and connects it to 4096 neurons.
• Dropout Layer 1: Applies dropout with a probability of 0.5.
• Fully Connected Layer 2: Connects the 4096 neurons from the previous layer to 4096 neurons.
• Dropout Layer 2: Applies dropout with a probability of 0.5.
• Output Layer: Contains 1000 neurons, each representing a possible class for image classification.
63
64
3. VGG:
is a deep convolutional neural network (CNN) architecture introduced in
VGG stands for Visual Geometry Group; it is a standard deep Convolutional Neural Network
(CNN) architecture with multiple layers. The “deep” refers to the number of layers with VGG-16 or VGG-19
consisting of 16 and 19 convolutional layers. 2014 by Karen Simonyan and Andrew Zisserman.
• Key Characteristics:
• Uniform Architecture: VGGNet follows a uniform architecture, using only 3x3 convolutional filters
throughout the network. This simplifies the design and makes it easier to train.
• Depth: VGGNet is known for its depth, with networks ranging from 11 to 19 layers. This depth allows the
network to learn more complex features from the input data.
• Max Pooling: The network uses max pooling layers to reduce the spatial dimensions of the feature maps
while preserving the most important information.
• VGGNet introduced several variants with different depths, including:
• VGG-11: 11 layers deep
• VGG-13: 13 layers deep
• VGG-16: 16 layers deep
• VGG-19: 19 layers deep

65
• The VGG network is constructed with very small convolutional filters. The VGG-16 consists of 13
convolutional layers and three fully connected layers.
• Let’s take a brief look at the architecture of VGG:
• Input: The VGGNet takes in an image input size of 224×224. For the ImageNet competition, the
creators of the model cropped out the center 224×224 patch in each image to keep the input size
of the image consistent.
• Convolutional Layers: VGG’s convolutional layers leverage a minimal receptive field, i.e., 3×3, the
smallest possible size that still captures up/down and left/right. Moreover, there are also 1×1
convolution filters acting as a linear transformation of the input. This is followed by a ReLU unit,
which is a huge innovation from AlexNet that reduces training time. ReLU stands for rectified
linear unit activation function; it is a piecewise linear function that will output the input if positive;
otherwise, the output is zero. The convolution stride is fixed at 1 pixel to keep the spatial
resolution preserved after convolution (stride is the number of pixel shifts over the input matrix).
• Hidden Layers: All the hidden layers in the VGG network use ReLU. VGG does not usually leverage
Local Response Normalization (LRN) as it increases memory consumption and training time.
Moreover, it makes no improvements to overall accuracy.
• Fully-Connected Layers: The VGGNet has three fully connected layers. Out of the three layers, the
first two have 4096 channels each, and the third has 1000 channels, 1 for each class.

66
67
4. PlacesNet:
PlacesNet is a deep convolutional neural network (CNN) architecture specifically designed for scene
recognition. It was introduced in 2014. It was trained on the Places database, a large-scale scene-centric
dataset with 205 natural scene categories, rather than on ImageNet. The authors showed that the deep
features from PlacesNet are more effective for recognizing natural scenes than deep features from CNNs
trained on ImageNet.
Key Features:
• Scene Recognition: PlacesNet is optimized for recognizing the scene or environment depicted in an image.
It can identify categories such as indoor, outdoor, city, mountain, beach, and more.
• Large-Scale Dataset: PlacesNet was trained on the Places dataset, a large-scale collection of images from
various scenes. This extensive training data helped the network learn robust features for scene recognition.
• Deep Architecture: PlacesNet employs a deep CNN architecture with multiple convolutional and pooling
layers, allowing it to capture complex visual patterns and relationships.
• Fine-Tuning: The network can be fine-tuned on specific scene datasets to improve performance for
particular applications.

68
• Key components of the PlacesNet architecture:
• Input Layer: Receives a 256x256x3 color image as input.
• Convolutional Layers: Multiple convolutional layers with 3x3 filters are
used to extract features from the input image. Each layer is followed by a
ReLU activation function.
• Pooling Layers: Max pooling layers are used to reduce the spatial
dimensions of the feature maps while preserving the most important
information.
• Fully Connected Layers: Fully connected layers are used to combine the
extracted features into a single vector.
• Output Layer: A softmax layer is used to classify the input image into one
of the predefined scene categories.

69
8. Training a CNNs: Weights initialization, batch normalization, hyperparameter
optimization
Steps in Training a CNNs
• The steps to train a Convolutional Neural Network (CNN) include:
• Prepare the dataset: Collect a labeled dataset and preprocess the images. Split the
dataset into training and test data.
• Design the CNN architecture: The CNN has several layers, including:
• Pooling layer: A sliding window technique that generalizes lower-level data.
• Convolutional layer: An essential block of the CNN, with learnable channels and neurons.
• ReLU layer: Improves the nonlinearity of the image's pixel data.
• Output layer: The final layer, with neurons equal to the number of classes. It provides the
likelihood of the input image belonging to a particular class.
• Train the model: Use an optimization algorithm to train the model.
• Define a loss function: Calculate the validation loss using the training technique.
• Experiment with hyperparameters: Monitor progress and fine-tune if needed.
• Evaluate the model's performance: Choose the model with the lowest loss.
70
1. Weights initialization:
Weight initialization is a fundamental aspect of training neural networks. It significantly impacts the convergence speed
and overall performance of the model. By carefully selecting the initialization method, we can help prevent common
issues like vanishing or exploding gradients, which can hinder the learning process.
Key Initialization Techniques
1. Zero Initialization:
• is a simple yet often ineffective technique for initializing weights in a neural network. It involves setting all weights to zero at the
beginning of training.
• Not accepted as the network become symmetic and leans to vanishing problem.
Example:
Consider a simple neural network with one hidden layer and one output layer. Let's assume that all weights are
initialized to zero:
Input layer: x
Hidden layer: h = W1 * x + b1
Output layer: y = W2 * h + b2
If all weights (W1, W2, b1, and b2) are initialized to zero, then:
h=0*x+0=0
y=0*0+0=0
No matter what the input is, the output will always be zero. This is because the network is essentially a linear model
with zero slope, and it cannot learn any non-linear patterns.

71
2. Random Initialization:
In an attempt to overcome the shortcomings of Zero or Constant Initialization, random initialization assigns
random values except for zeros as weights to neuron paths. However, assigning values randomly to the
weights, problems such as Overfitting, Vanishing Gradient Problem, Exploding Gradient Problem might
occur.
• Random Initialization can be of two kinds:
• Random Normal
• Random Uniform
a) Random Normal: The weights are initialized from values in a normal distribution.

b) Random Uniform: The weights are initialized from values in a uniform distribution.

Wi Represents the weight connecting the i-th neuron

72
3. Xavier/Glorot Initialization
In Xavier/Glorot weight initialization, the weights are assigned from values
of a uniform distribution as follows:

• Wi Represents the weight connecting the i-th neuron

• ~ U(-sqrt(sigma/(fan_in + fan_out)), sqrt(sigma/(fan_in + fan_out))): This indicates that the weight W_ij is
drawn from a uniform distribution with a lower bound of -sqrt(sigma/(fan_in + fan_out)) and an upper
bound of sqrt(sigma/(fan_in + fan_out)).
• sigma: A hyperparameter that controls the variance of the weights.
• fan_in: The number of incoming connections to the neuron (i.e., the number of neurons in the previous
layer).fan_out: The number of outgoing connections from the neuron (i.e., the number of neurons in the
current layer).
• Xavier/Glorot Initialization often termed as Xavier Uniform Initialization, is suitable for layers where the
activation function used is Sigmoid.

73
4. Normalized Xavier/Glorot Initialization
• In Normalized Xavier/Glorot weight initialization, the weights are
assigned from values of a normal distribution as follows:
• Xavier/Glorot Initialization, too, is suitable for layers where the
activation function used is Sigmoid.

74
5. He Uniform Initialization
• In He Uniform weight initialization, the weights are assigned from
values of a uniform distribution as follows:
• He Uniform Initialization is suitable for layers where ReLU activation
function is used.

6. He Normal Initialization
In He Normal weight initialization, the weights are assigned from values of a normal distribution as follows:
He Uniform Initialization, too, is suitable for layers where ReLU activation function is used.

75
Batch normalization
Batch Normalization is a technique used to improve the training and performance
of neural networks, particularly CNNs.
Batch normalization is a technique to improve the training of DNN by stabilizing and
accelerating the learning process.
Introduced by Sergey Ioffe and Christian Szegedy in 2015, it addresses the issue
known as “internal covariate shift” where the distribution of each layer’s inputs
changes during training, as the parameters of the previous layers change.
Batch normalization, it is a process to make neural networks faster and more
stable through adding extra layers in a deep neural network. The new layer
performs the standardizing and normalizing operations on the input of a layer
coming from a previous layer.
Addresses Internal Covariate Shift, Improving Gradient Flow, Regularization Effect:
Speeds up learning: Regularizes the model: Allows higher learning rates:

76
How Does Batch Normalization Work in CNN?
Batch normalization works in convolutional neural networks (CNNs) by
normalizing the activations of each layer across mini-batch during training.
The working is discussed below:
1. Normalization within Mini-Batch
• In a CNN, each layer receives inputs from multiple channels (feature maps)
and processes them through convolutional filters. Batch Normalization
operates on each feature map separately, normalizing the activations
across the mini-batch.
• During training, batch normalization (BN) standardizes the activations of
each layer by subtracting the mean and dividing by the standard
deviation of each mini-batch.

77
2. Scaling and Shifting
• After normalization, BN adjusts the normalized activations using learned
scaling and shifting parameters. These parameters enable the network to
adaptively scale and shift the activations, thereby maintaining the
network’s ability to represent complex patterns in the data.

3. Learnable Parameters
The parameters Alpha and Beta are learned during training through
backpropagation. This allows the network to adaptively adjust the
normalization and ensure that the activations are in the appropriate range
for learning.
78
4. Applying Batch Normalization
Batch Normalization is typically applied after the convolutional and activation layers
in a CNN, before passing the outputs to the next layer. It can also be applied before
or after the activation function, depending on the network architecture.
5. Training and Inference
During training, Batch Normalization calculates the mean and variance of each mini-
batch. During inference (testing), it uses the aggregated mean and variance
calculated during training to normalize the activations. This ensures consistent
normalization between training and inference.

79
Hyperparameters Optimization:
• Hyperparameters are the parameters that are set before the training process begins and are not
learned from the data. They include things like learning rates, batch sizes, the number of layers,
and the number of neurons in each layer.
• Common hyperparameters include:
• Learning rate: Controls the step size during gradient descent.
• Batch size: The number of samples processed at once during training.
• Number of epochs: The number of times the entire dataset is passed through the network.
• Network architecture: The number of layers, filters, and neurons.
• Regularization: Techniques like L1/L2 regularization and dropout to prevent overfitting.
• The Hyperparameters can be optimized as follows:
1. Manual Search
2. hyperparameter-tuning using Bayesian Optimization
3. GridSearchCV
4. RandomizedSearchCV

80
1. Manual Search Hyperparameter Optimization:
Experimenting with different hyperparameters based on domain knowledge. It is Simple and easy but Time-
consuming and often inefficient as it may not explore the parameter space comprehensively.

2. Bayesian Optimization
is a probabilistic framework for hyperparameter tuning that leverages Bayesian statistics to efficiently
explore the hyperparameter space. It's particularly effective when dealing with complex, expensive-to-
evaluate functions, such as training deep neural networks. Bayesian optimization is more efficient in time
and memory capacity for tuning many hyperparameters
Steps Involved:
1. Choose a suitable surrogate model (e.g., Gaussian process). Initialize the surrogate model with a small
set of randomly sampled hyperparameter configurations. Evaluate the objective function (e.g.,
validation accuracy) for these initial configurations.
2. Use the acquisition function to determine the next hyperparameter configuration to evaluate. Common
acquisition functions include Expected Improvement, Probability of Improvement, and Entropy Search.
3. Evaluate the objective function for the newly acquired point. Update the surrogate model with the new
data point.
4. Repeat steps 2 and 3 until a stopping criterion is met (e.g., maximum number of iterations or
convergence).

81
3. Grid-Search:
is a hyperparameter optimization technique that involves exhaustively trying all
combinations of hyperparameters within a specified grid. It's a simple but often
computationally expensive method.
• Steps involved:
• Define Hyperparameter Space: Specify a grid of values for each hyperparameter you want to
optimize.
• Try All Combinations: Train a model for each combination of hyperparameters in the grid.
• Evaluate Performance: Evaluate the performance of each model on a validation set.
• Choose Best Hyperparameters: Select the hyperparameter combination that yields the best
performance.

82
4. Randomized Search
• The Grid Search one that we have discussed above usually increases the complexity
in terms of the computation flow, So sometimes GS is considered inefficient since it
attempts all the combinations of given hyperparameters. But the Randomized
Search is used to train the models based on random hyperparameters and
combinations. obviously, the number of training models are small column than grid
search.
Key steps:
• Define Hyperparameter Space: Specify the range or distribution for each
hyperparameter you want to optimize.
• Randomly Sample Hyperparameters: Generate random combinations of
hyperparameters within the defined space.
• Train Model: Train a model with the sampled hyperparameters.
• Evaluate Performance: Evaluate the model's performance on a validation set using
appropriate metrics.
• Repeat: Repeat steps 2-4 for a specified number of iterations.

83
GridSearchCV RandomSearshCV

Grid is well-defined Grid is not well defined

Discrete values for HP-params Continuos values and Statistical
distribution
Defined size for Hyperparameter No such a restriction
space
Picks of the best combination Picks up the samples from HP-
from HP-Space Space

Samples are not created Samples are created and

specified by the range and n_iter

Low performance than RSCV Better performance and result

Guided flow to search for the The name itself says that, no
best combination guidance.

84
9. Understanding and visualizing CNNs training
Understanding CNN Training
• Training a convolutional neural network (CNN) involves teaching the network to recognize patterns in
data, such as images or audio. This is achieved through a process of iterative optimization, where the
network's weights and biases are adjusted to minimize the error between its predicted outputs and the
true labels.
Key Steps in CNN Training:
• Data Preparation: Collect and preprocess a large dataset of labeled examples.
• Network Architecture: Design a CNN architecture suitable for the task, including the number of layers,
filters, and activation functions.
• Forward Pass: Feed an input sample through the network to compute the predicted output.
• Backward Pass: Calculate the error between the predicted output and the true label.
• Weight Update: Use an optimization algorithm (e.g., stochastic gradient descent) to update the
network's weights and biases based on the calculated gradients.
• Repeat: Iterate through the dataset multiple times (epochs), updating the network's parameters with
each iteration.

85
Importance of Visualizing a CNN model

• Understanding how the model works

• Assistance in Hyperparameter tuning
• Finding out the failures of the model and getting an intuition of why they fail
• Explaining the decisions to a consumer / end-user or a business executive
• Methods of Visualizing a CNN model
• Broadly the methods of Visualizing a CNN model can be categorized into three
parts based on their internal workings
• Preliminary methods – Simple methods which show us the overall structure of a
trained model
• Activation based methods – In these methods, we decipher the activations of the
individual neurons or a group of neurons to get an intuition of what they are doing
• Gradient based methods – These methods tend to manipulate the gradients that
are formed from a forward and backward pass while training a model

86
1. Preliminary Methods
1.1 Plotting model architecture: The simplest thing you can do is to
print/plot the model. Here, you can also print the shapes of individual
layers of neural network and the parameters in each layer.

87
1.2 Visualize filters: Another way is to plot the filters of a trained model, so
that we can understand the behaviour of those filters.
• CNN filters can be visualized when we optimize the input image with
respect to output of the specific convolution operation. For example, the
first filter of the first layer of the above model looks like:

88
2. Activation Maps/ Feature Maps Visualization:
• Feature maps (or activations) show how the output of each convolutional
layer looks after passing through the network.
• Visualizing feature maps helps in understanding how different layers of the
network respond to various input patterns.

3. Gradient Visualization:
• Visualizing gradients can help understand which parameters are changing
during training.
• This can be done using gradient flow or gradient histograms to check for
issues like vanishing or exploding gradients.

89
10. Regularization methods (dropout, drop connect, batch
normalization).
What is Regularization?
• Regularization is a technique used in machine learning and deep learning
to prevent overfitting and improve the generalization performance of a
model. It involves adding a penalty term to the loss function during
training.
• This penalty discourages the model from becoming too complex or having
large parameter values, which helps in controlling the model’s ability to fit
noise in the training data.
• By applying regularization for deep learning, models become more robust
and better at making accurate predictions on unseen data.

90
• As we move towards the right in this image, our model tries to learn
too well the details and the noise from the training data, which
ultimately results in poor performance on the unseen data.
• In other words, while going toward the right, the complexity of the
model increases such that the training error reduces but the testing
error doesn’t. This is shown in the image below:

91
• Regularization to reduce the overfitting

Assume that our regularization coefficient is so high that some of the weight matrices are nearly
equal to zero.

92
• This will result in a much simpler linear network and slight
underfitting of the training data.
• Such a large value of the regularization coefficient is not that useful.
We need to optimize the value of the regularization coefficient to
obtain a well-fitted model as shown in the image below:

93
Different Regularization Techniques in Deep Learning
1. L2&L1 Regularization
• L1 and L2 are the most common types of regularization deep learning. These
update the general cost function by adding another term known as the
regularization term.

• Due to the addition of this regularization term, the values of weight matrices
decrease because it assumes that a neural network with smaller weight
matrices leads to simpler models. Therefore, it will also reduce overfitting to
quite an extent.
• However, this regularization term differs in L1 and L2.

94
95
2. Dropout
• This is one of the most interesting types of regularization techniques. It also produces very good
results and is consequently the most frequently used regularization technique in the field of
deep learning.
• In Dropout, a random subset of neurons is temporarily excluded or “dropped out” during each
iteration. This helps prevent overfitting by promoting more robust learning and reducing the
reliance on specific neurons.
• Dropout mimics ensemble learning during training by randomly deactivating a subset of
neurons in each iteration, creating diverse network instances. Each instance can be viewed as a
different model.

96
2. Drop Connect- Drop Connect has a similar flavour to dropout. However, instead
of randomly dropping individual units (neurons) during training, DropConnect zero
out some of the values of the weight matrix. This means for each training iteration,
a random subset of connections in the neural network is set to zero.

3. Batch Normalization: Batch Normalization involves normalizing the inputs of

each layer in a mini-batch by subtracting the mean and dividing by the standard
deviation. This normalization helps address issues like internal covariate shift,
ensuring that the inputs to each layer are centered and have a consistent scale
during training.
• Additionally, Batch Normalization introduces learnable parameters (gamma and beta) that
allow the model to learn the optimal scale and mean for each feature.
• Challenges: Performance can be sensitive to the choice of batch size. Extremely small batch
sizes may lead to inaccurate estimation of batch statistics. and it requires using population
statistics (mean and variance) computed during training.

97
Thank You

Deep Learing
No ratings yet
Deep Learing
37 pages
DGM Mid Sem
No ratings yet
DGM Mid Sem
39 pages
Unit II
No ratings yet
Unit II
56 pages
Artificial Neural Networks - Lect - 4
No ratings yet
Artificial Neural Networks - Lect - 4
17 pages
Deep Learning
100% (2)
Deep Learning
49 pages
DL Insem Final
No ratings yet
DL Insem Final
2 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
ANN Analysis
No ratings yet
ANN Analysis
5 pages
Deep Learning & Neural Networks Guide
No ratings yet
Deep Learning & Neural Networks Guide
64 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Notes For Deep Learning
No ratings yet
Notes For Deep Learning
6 pages
Deepnet Lourentzou
No ratings yet
Deepnet Lourentzou
49 pages
Deep Learning Concise Notes
No ratings yet
Deep Learning Concise Notes
4 pages
SocrAI Day 4
No ratings yet
SocrAI Day 4
38 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Lecture5 MCQ Guide
No ratings yet
Lecture5 MCQ Guide
9 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Unit 2 Notes NLP
No ratings yet
Unit 2 Notes NLP
6 pages
2 Deep Neural Network - 241120 - 095158
No ratings yet
2 Deep Neural Network - 241120 - 095158
47 pages
NITW - Improving Deep Neural Networks
No ratings yet
NITW - Improving Deep Neural Networks
50 pages
Introtodeeplearning MIT 6.S191
No ratings yet
Introtodeeplearning MIT 6.S191
36 pages
Lect 12 - Deep Feed Forward NN - Review
No ratings yet
Lect 12 - Deep Feed Forward NN - Review
93 pages
Gen Ai Mynotes
No ratings yet
Gen Ai Mynotes
12 pages
Deep Learning Essentials
No ratings yet
Deep Learning Essentials
9 pages
DeepLearning Glossary
No ratings yet
DeepLearning Glossary
5 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Home Assignment Submission Solutions
No ratings yet
Home Assignment Submission Solutions
82 pages
Midterm Study Guide Csci566
No ratings yet
Midterm Study Guide Csci566
20 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Notes DL-1
No ratings yet
Notes DL-1
10 pages
Deep Learning Insights & Techniques
No ratings yet
Deep Learning Insights & Techniques
12 pages
Neural Networks & Deep Learning - Study Notes
No ratings yet
Neural Networks & Deep Learning - Study Notes
8 pages
Course Contents #1
No ratings yet
Course Contents #1
24 pages
Unit 2
No ratings yet
Unit 2
10 pages
Unit3 DL JNTUK
No ratings yet
Unit3 DL JNTUK
15 pages
Neural Network Essentials for Developers
No ratings yet
Neural Network Essentials for Developers
2 pages
Algorithmic Advances
No ratings yet
Algorithmic Advances
5 pages
Deep Learning
No ratings yet
Deep Learning
49 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
2 DL Training
No ratings yet
2 DL Training
60 pages
Introduction To Artificial Neural Networks
No ratings yet
Introduction To Artificial Neural Networks
31 pages
Tutorial 1,2
No ratings yet
Tutorial 1,2
12 pages
DL UNIT 3 - Part2
No ratings yet
DL UNIT 3 - Part2
34 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
UNIT - 5 Lecture 2
No ratings yet
UNIT - 5 Lecture 2
26 pages
Deep Neural Network
No ratings yet
Deep Neural Network
60 pages
The Deep Learning Revolution: Introductory Overview Lecture
No ratings yet
The Deep Learning Revolution: Introductory Overview Lecture
35 pages
Shortnotedeeplearning
No ratings yet
Shortnotedeeplearning
11 pages
Fine Tuning Hper Parameters
No ratings yet
Fine Tuning Hper Parameters
13 pages
NoteGPT Summary DL Mod1
No ratings yet
NoteGPT Summary DL Mod1
3 pages
Introduction To Deep Learning With IBM PDF
No ratings yet
Introduction To Deep Learning With IBM PDF
15 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Greedy Layerwise Learning
No ratings yet
Greedy Layerwise Learning
39 pages
DNN Merged Sugata
No ratings yet
DNN Merged Sugata
243 pages
BMC B22edo501 Unit.2
No ratings yet
BMC B22edo501 Unit.2
35 pages
Bia Full
No ratings yet
Bia Full
145 pages
UNIT 1 Neural Networks & DL
No ratings yet
UNIT 1 Neural Networks & DL
123 pages
Agile SDC Unit-1
No ratings yet
Agile SDC Unit-1
63 pages
Unit-2 CSS
No ratings yet
Unit-2 CSS
47 pages
WEB Unit 1
No ratings yet
WEB Unit 1
41 pages
WEB Unit 1 Chapter - 1
No ratings yet
WEB Unit 1 Chapter - 1
57 pages
Agile SDC Unit-2
No ratings yet
Agile SDC Unit-2
12 pages
MX200 Design en R01
100% (1)
MX200 Design en R01
116 pages
Specifications and Repair Procedures For C4.4 Cylinder Blocks
No ratings yet
Specifications and Repair Procedures For C4.4 Cylinder Blocks
8 pages
Ingersoll Rand Zimmerman Air Balancer Information
No ratings yet
Ingersoll Rand Zimmerman Air Balancer Information
108 pages
TWJO-MST-0002-revC Method Statement Cutting Trees
100% (1)
TWJO-MST-0002-revC Method Statement Cutting Trees
22 pages
Barons V CA
No ratings yet
Barons V CA
2 pages
Senior Citizen Regular Savings Account Socs
No ratings yet
Senior Citizen Regular Savings Account Socs
6 pages
Digital Investigation: Philipp Amann, Joshua I. James
No ratings yet
Digital Investigation: Philipp Amann, Joshua I. James
10 pages
Completed CH6 Mini Case Pharma Biotech Working Papers Fall 2014
50% (2)
Completed CH6 Mini Case Pharma Biotech Working Papers Fall 2014
3 pages
C++ STL List
No ratings yet
C++ STL List
10 pages
Msds Colonial 1240 Slurry
No ratings yet
Msds Colonial 1240 Slurry
3 pages
MPLS Layer 3 VPN Configuration
No ratings yet
MPLS Layer 3 VPN Configuration
19 pages
Quickguide - Twinkly Strings 2021 6
No ratings yet
Quickguide - Twinkly Strings 2021 6
17 pages
UTM Software Guidance Rev 1.0
No ratings yet
UTM Software Guidance Rev 1.0
29 pages
Final Centrifugal Lab
No ratings yet
Final Centrifugal Lab
11 pages
Infrastructure
No ratings yet
Infrastructure
6 pages
Detailed Visual Bridge Inspection Guidelines For Concrete and Steel Bridges Level 2 Inspections
No ratings yet
Detailed Visual Bridge Inspection Guidelines For Concrete and Steel Bridges Level 2 Inspections
313 pages
TJPR Manuscript - AL WAHFI SUHADA SIPAHUTAR - ERMADAYANI
No ratings yet
TJPR Manuscript - AL WAHFI SUHADA SIPAHUTAR - ERMADAYANI
6 pages
Three Minute Thesis Slides
100% (1)
Three Minute Thesis Slides
8 pages
Fabric, Trim and Accessories
No ratings yet
Fabric, Trim and Accessories
12 pages
WHO PPT On Aseptic Processing
No ratings yet
WHO PPT On Aseptic Processing
47 pages
Late Vs First 13
No ratings yet
Late Vs First 13
24 pages
Consti Consolidated List of Cases
No ratings yet
Consti Consolidated List of Cases
9 pages
18aue411t 5
No ratings yet
18aue411t 5
17 pages
Business Communication and Report Writing 2 Quizzes and Final Exam
100% (1)
Business Communication and Report Writing 2 Quizzes and Final Exam
8 pages
Literature Review On Finger Millet
100% (2)
Literature Review On Finger Millet
4 pages
100 Useful Tips and Tools To Research The Deep Web
100% (1)
100 Useful Tips and Tools To Research The Deep Web
5 pages
How To Install Niresh 10
No ratings yet
How To Install Niresh 10
2 pages
Best Books For JEE B.Arch 2018
No ratings yet
Best Books For JEE B.Arch 2018
9 pages
G. PRA Whistleblowing Policy
No ratings yet
G. PRA Whistleblowing Policy
7 pages

UNIT 2 - Neural Networks & DL

Uploaded by

UNIT 2 - Neural Networks & DL

Uploaded by

Unit 2 - Deep neural networks (DNNs)

Convolution neural networks (CNNs)

Convolution neural networks (CNNs):

Deep neural networks are used in a variety of applications, including speech

Preprocessed DNN Model Prediction

Mini Batch Gradient Descent 27

Where 𝑤𝑡 is the weight vector

• Slow Learning to store the moving average of squared gradients.

store a moving average of the gradients calculated in previous iterations.

adjustments (similar to Adam optimizer).

Wi Represents the weight connecting the i-th neuron

• Wi Represents the weight connecting the i-th neuron

Grid is well-defined Grid is not well defined

Samples are not created Samples are created and

Low performance than RSCV Better performance and result

• Understanding how the model works

3. Batch Normalization: Batch Normalization involves normalizing the inputs of

You might also like