CONVOLUTIONAL NEURAL NETWORK
What is a Convolutional Neural Network (CNN)?
A Convolutional Neural Network (CNN), also known as ConvNet, is a specialized type of deep learning
algorithm mainly designed for tasks that necessitate object recognition, including image classification,
detection, and segmentation. CNNs are employed in a variety of practical scenarios, such as autonomous
vehicles, security camera systems, and others.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial neural networks (ANN) which
is predominantly used to extract the feature from the grid-like matrix dataset.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial neural networks (ANN) which
is predominantly used to extract the feature from the grid-like matrix dataset.
Layers used to build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. A covnets is a sequence
of layers, and every layer transforms one volume to another through a differentiable function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input will
be an image or a sequence of images. This layer holds the raw input of the image with width 32,
height 32, and depth 3.
Convolutional Layers: This is the layer, which is used to extract the feature from the
input dataset. It applies a set of learnable filters known as the kernels to the input images.
The filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the
input image data and computes the dot product between kernel weight and the
corresponding input image patch. The output of this layer is referred as feature maps.
Suppose we use a total of 12 filters for this layer we’ll get an output volume of dimension
32 x 32 x 12. Stride determines the step size at which the filter moves across the input. Padding
adds additional border pixels to the input image to control the spatial dimensions of the output
feature map.
Activation Layer: By adding an activation function to the output of the preceding layer, activation
layers add nonlinearity to the network. it will apply an element-wise activation function to the output
of the convolution layer. Some common activation functions are RELU: max(0, x), Tanh, Leaky
RELU, etc. The volume remains unchanged hence output volume will have dimensions 32 x 32 x 12.
Pooling layer: This layer is periodically inserted in the covnets and its main function is to reduce the
size of volume which makes the computation fast reduces memory and also prevents overfitting. Two
common types of pooling layers are max pooling and average pooling. If we use a max pool with 2 x
2 filters and stride 2, the resultant volume will be of dimension 16x16x12.
Image source: cs231n.stanford.edu
Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for categorization
or regression.
Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
Output Layer: The output from the fully connected layers is then fed into a logistic function for
classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.
Overall Process:
1. Input image passes through the convolution layer, applying filters to extract features.
2. The resulting feature map undergoes an activation function (e.g., ReLU) to introduce non-
linearity.
3. Max-pooling layers downsample the feature map by selecting maximum values in local regions.
4. This process of convolution and pooling is typically repeated in a stack of layers to create a deep
CNN architecture.
Spatial Convolution and pooling
POOL layer will perform a downsampling operation along the spatial dimensions (width, height),
resulting in volume such as [16x16x12].
Dropout:
Dropout helps prevent overfitting by randomly nullifying outputs from neurons during the training
process. This encourages the network to learn redundant representations for everything and hence,
increases the model's ability to generalize.
Dropout is a regularization technique which involves randomly ignoring or “dropping out” some layer
outputs during training, used in deep neural networks to prevent overfitting.
A dropout regularization in deep learning is a regularization approach that prevents overfitting by
ensuring that no units are codependent with one another. Dropout regularization is one technique used to
tackle overfitting problems in deep learning.
Let’s try to understand with a given input x: {1, 2, 3, 4, 5} to the fully connected layer. We have a
dropout layer with probability p = 0.2 (or keep probability = 0.8). During the forward propagation
(training) from the input x, 20% of the nodes would be dropped, i.e. the x could become {1, 0, 3, 4, 5} or
{1, 2, 0, 4, 5} and so on. Similarly, it applied to the hidden layers.
For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is applied with drop
probability = 0.5, then 500 neurons would be randomly dropped in every iteration (batch).
Understanding Dropout Regularization
Dropout regularization leverages the concept of dropout during training in deep learning models to
specifically address overfitting, which occurs when a model performs nicely on schooling statistics
however poorly on new, unseen facts.
During training, dropout randomly deactivates a chosen proportion of neurons (and their connections)
within a layer. This essentially temporarily removes them from the network.
The deactivated neurons are chosen at random for each training iteration. This randomness is crucial for
preventing overfitting.
To account for the deactivated neurons, the outputs of the remaining active neurons are scaled up by a
factor equal to the probability of keeping a neuron active (e.g., if 50% are dropped, the remaining ones
are multiplied by 2).
Advantages of Dropout Regularization in Deep Learning
Prevents Overfitting: By randomly disabling neurons, the network cannot overly rely on the specific
connections between them.
Ensemble Effect: Dropout acts like training an ensemble of smaller neural networks with varying
structures during each iteration. This ensemble effect improves the model’s ability to generalize to unseen
data.
Enhancing Data Representation: Dropout methods are used to enhance data representation by
introducing noise, generating additional training samples, and improving the effectiveness of the model
during training.
Drawbacks of Dropout Regularization and How to Mitigate Them
Despite its benefits, dropout regularization in deep learning is not without its drawbacks. Here are some
of the challenges related to dropout and methods to mitigate them:
Longer Training Times: Dropout increases training duration due to random dropout of units in hidden
layers. To address this, consider powerful computing resources or parallelize training where possible.
Optimization Complexity: Understanding why dropout works is unclear, making optimization
challenging. Experiment with dropout rates on a smaller scale before full implementation to fine-tune
model performance.
Hyperparameter Tuning: Dropout adds hyperparameters like dropout chance and learning rate,
requiring careful tuning. Use techniques such as grid search or random search to systematically find
optimal combinations.
Redundancy with Batch Normalization: Batch normalization can sometimes replace dropout effects.
Evaluate model performance with and without dropout when using batch normalization to determine its
necessity.
Model Complexity: Dropout layers add complexity. Simplify the model architecture where possible,
ensuring each dropout layer is justified by performance gains in validation.
What is Recurrent Neural Network (RNN)?
Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is
fed as input to the current step. In traditional neural networks, all the inputs and outputs are independent
of each other. Still, in cases when it is required to predict the next word of a sentence, the previous words
are required and hence there is a need to remember the previous words. Thus RNN came into existence,
which solved this issue with the help of a Hidden Layer. The main and most important feature of RNN is
its Hidden state, which remembers some information about a sequence. The state is also referred to as
Memory State since it remembers the previous input to the network. It uses the same parameters for each
input as it performs the same task on all the inputs or hidden layers to produce the output. This reduces
the complexity of parameters, unlike other neural networks.
How RNN differs from Feedforward Neural Network?
Artificial neural networks that do not have looping nodes are called feed forward neural networks.
Because all information is only passed forward, this kind of neural network is also referred to as a multi-
layer neural network.
Information moves from the input layer to the output layer – if any hidden layers are present –
unidirectionally in a feedforward neural network. These networks are appropriate for image classification
tasks, for example, where input and output are independent. Nevertheless, their inability to retain previous
inputs automatically renders them less useful for sequential data analysis.
Recurrent Neuron and RNN Unfolding
The fundamental processing unit in a Recurrent Neural Network (RNN) is a Recurrent Unit, which is not
explicitly called a “Recurrent Neuron.” This unit has the unique ability to maintain a hidden state,
allowing the network to capture sequential dependencies by remembering previous inputs while
processing. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) versions improve the
RNN’s ability to handle long-term dependencies.
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the network.
One to One
One to Many
Many to One
Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as Vanilla Neural
Network. In this Neural network, there is only one input and one output.
One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used
examples of this network is Image captioning where given an image we predict a sentence having
Multiple words.
Many to One
In this type of network, Many inputs are fed to the network at several states of the network generating
only one output. This type of network is used in the problems like sentimental analysis. Where we give
multiple words as input and predict only the sentiment of the sentence as output.
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs corresponding to a problem.
One Example of this Problem will be language translation. In language translation, we provide multiple
words from one language as input and predict multiple words from the second language as output.
Recurrent Neural Network Architecture
How does RNN work?
The Recurrent Neural Network consists of multiple fixed activation function units, one for each time step.
Each unit has an internal state which is called the hidden state of the unit. This hidden state signifies the
past knowledge that the network currently holds at a given time step. This hidden state is updated at every
time step to signify the change in the knowledge of the network about the past. The hidden state is
updated using the following recurrence relation:-
The formula for calculating the current state:
ht=f(ht−1,xt)
where,
ht -> current state
ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh)
ℎ𝑡=𝑡𝑎𝑛ℎ(𝑊ℎℎℎ𝑡−1+𝑊𝑥ℎ𝑥𝑡)ht=tanh(Whhht−1+Wxhxt)
where,
whh -> weight at recurrent neuron
wxh -> weight at input neuron
𝑦𝑡=𝑊ℎ𝑦ℎ𝑡yt=Whyht
The formula for calculating output:
Yt -> output
Why -> weight at output layer
Issues of Standard RNNs
Vanishing Gradient: Text generation, machine translation, and stock market prediction are just a
few examples of the time-dependent and sequential data problems that can be modelled with
recurrent neural networks. You will discover, though, that the gradient problem makes training
RNN difficult.
Exploding Gradient: An Exploding Gradient occurs when a neural network is being trained and the
slope tends to grow exponentially rather than decay. Large error gradients that build up during
training lead to very large updates to the neural network model weights, which is the source of this
issue.
These parameters are updated using Backpropagation. However, since RNN works on sequential data
here we use an updated backpropagation which is known as Backpropagation through time.
Vanishing Gradient
Vanishing is when as backpropagation occurs, the gradients normally get smaller and smaller, gradually
approaching zero. This leaves the weights of the initial or lower layers unchanged, causing the Gradient
Descent to never converge to the optimum.
Exploding is the opposite of Vanishing and is when the gradient continues to get larger which causes a
large weight update and results in the Gradient Descent to diverge. Exploding gradients occur due to the
weights in the Neural Network, not the activation function.
The weights in the lower layers of the Neural Network are more likely to be affected by Exploding
Gradient as their associated gradients are products of more values. This leads to the gradients of the lower
layers being more unstable, causing the algorithm to diverge.
How can we identify?
Identifying the vanishing gradient problem typically involves monitoring the training dynamics of a deep
neural network.
One key indicator is observing model weights converging to 0 or stagnation in the improvement of the
model’s performance metrics over training epochs.
During training, if the loss function fails to decrease significantly, or if there is erratic behavior in the
learning curves, it suggests that the gradients may be vanishing.
Additionally, examining the gradients themselves during backpropagation can provide insights.
Visualization techniques, such as gradient histograms or norms, can aid in assessing the distribution of
gradients throughout the network.
LSTM networks are the most commonly used variation of Recurrent Neural Networks (RNNs). The
critical component of the LSTM is the memory cell and the gates (including the forget gate but also the
input gate), inner contents of the memory cell are modulated by the input gates and forget gates
The Logic Behind LSTM
These three parts of an LSTM unit are known as gates. They control the flow of information in and out of
the memory cell or lstm cell. The first gate is called Forget gate, the second gate is known as the Input
gate, and the last one is the Output gate. An LSTM unit that consists of these three gates and a memory
cell or lstm cell can be considered as a layer of neurons in traditional feedforward neural network, with
each neuron having a hidden layer and a current state.
Just like a simple RNN, an LSTM also has a hidden state where H(t-1) represents the hidden state of the
previous timestamp and Ht is the hidden state of the current timestamp. In addition to that, LSTM also
has a cell state represented by C(t-1) and C(t) for the previous and current timestamps, respectively.
Here the hidden state is known as Short term memory, and the cell state is known as Long term memory.
Refer to the following image.
Forget Gate
In a cell of the LSTM neural network, the first step is to decide whether we should keep the
information from the previous time step or forget it. Here is the equation for forget gate.
Let’s try to understand the equation, here
Xt: input to the current timestamp.
Uf: weight associated with the input
Ht-1: The hidden state of the previous timestamp
Wf: It is the weight matrix associated with the hidden state
Later, a sigmoid function is applied to it. That will make ft a number between 0 and 1. This ft is later
multiplied with the cell state of the previous timestamp, as shown below.
Input Gate
The input gate is used to quantify the importance of the new information carried by the input. Here is the
equation of the input gate
Xt: Input at the current timestamp t
Ui: weight matrix of input
Ht-1: A hidden state at the previous timestamp
Wi: Weight matrix of input associated with hidden state
Again we have applied the sigmoid function over it. As a result, the value of I at timestamp t will be
between 0 and 1.
New Information
Now the new information that needed to be passed to the cell state is a function of a hidden state at the
previous timestamp t-1 and input x at timestamp t. The activation function here is tanh. Due to the tanh
function, the value of new information will be between -1 and 1. If the value of Nt is negative, the
information is subtracted from the cell state, and if the value is positive, the information is added to the
cell state at the current timestamp.
However, the Nt won’t be added directly to the cell state. Here comes the updated equation:
Here, Ct-1 is the cell state at the current timestamp, and the others are the values we have calculated
previously.
Output Gate
Its value will also lie between 0 and 1 because of this sigmoid function. Now to calculate the current
hidden state, we will use Ot and tanh of the updated cell state. As shown below.
It turns out that the hidden state is a function of Long term memory (Ct) and the current output. If
you need to take the output of the current timestamp, just apply the SoftMax activation on hidden
state Ht.
Here the token with the maximum score in the output is the prediction.
LTSM vs RNN
RNN (Recurrent Neural
Aspect LSTM (Long Short-Term Memory)
Network)
Architecture A type of RNN with additional memory cells A basic type of RNN
RNN (Recurrent Neural
Aspect LSTM (Long Short-Term Memory)
Network)
Struggles with long-term
Handles long-term dependencies and prevents
Memory Retention dependencies and vanishing
vanishing gradient problem
gradient problem
Complex cell structure with input, output, and Simple cell structure with only
Cell Structure
forget gates hidden state
Also designed for sequential data,
Handling Sequences Suitable for processing sequential data
but limited memory
Slower training process due to increased Faster training process due to
Training Efficiency
complexity simpler architecture
Performance on Struggles to retain information on
Performs better on long sequences
Long Sequences long sequences
Best suited for tasks requiring long-term Appropriate for simple sequential
Usage memory, such as language translation and tasks, such as time series
sentiment analysis forecasting
Vanishing Gradient Prone to the vanishing gradient
Addresses the vanishing gradient problem
Problem problem
Tensorboard
TensorBoard is an open source toolkit which enables us to understand training progress and improve
model performance by updating the hyperparameters. TensorBoard toolkit displays a dashboard where the
logs can be visualized as graphs, images, histograms, embeddings, text etc. It also helps in tracking
information like gradients, losses, metrics, and intermediate outputs. arcgis.learn module integrates
TensorBoard toolkit to the model training process which now makes it possible for us to monitor model
training process.