Deep Learning

Unit III
Recurrent Neural Networks
Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is fed as
input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other,
but in cases when it is required to predict the next word of a sentence, the previous words are required and hence
there is a need to remember the previous words. Thus RNN came into existence, which solved this issue with the
help of a Hidden Layer. The main and most important feature of RNN is its Hidden state, which remembers
some information about a sequence. The state is also referred to as Memory State since it remembers the previous
input to the network. It uses the same parameters for each input as it performs the same task on all the inputs or
hidden layers to produce the output. This reduces the complexity of parameters, unlike other neural networks.
Below are some examples of RNN architectures that can help you better understand this.
 One To One: There is only one pair here. A one-to-one architecture is used in traditional neural networks.
 One To Many: A single input in a one-to-many network might result in numerous outputs. One too many
networks are used in the production of music, for example.
 Many To One: In this scenario, a single output is produced by combining many inputs from distinct time
steps. Sentiment analysis and emotion identification use such networks, in which the class label is
determined by a sequence of words.
 Many To Many: For many to many, there are numerous options. Two inputs yield three outputs. Machine
translation systems, such as English to French or vice versa translation systems, use many to many
networks.
Advantages of RNNs:
 Handle sequential data effectively, including text, speech, and time series.
 Process inputs of any length, unlike feedforward neural networks.
 Share weights across time steps, enhancing training efficiency.
Disadvantages of RNNs:
 Prone to vanishing and exploding gradient problems, hindering learning.
 Training can be challenging, especially for long sequences.
 Computationally slower than other neural network architectures.
RNN were created because there were a few issues in the feed-forward neural network:
 Cannot handle sequential data
 Considers only the current input
 Cannot memorize previous inputs
Benefits of RNN
Some of the benefits provided by Recurrent Neural Networks are:
1. Processes sequential data

2. Can memorize and store previous results
3. Takes into account both the current and the previous results in the computation of new results
4. Regardless of the increasing size of the input, the model size remains fixed
5. It shares weights to other units across time
Limitations of RNN
Below are some of the limitations of Recurrent Neural Networks:
1. The computation time is slow as it is recurrent.

2. Unable to process a long sequence of information if using tanh or ReLU activation functions.
3. Cannot process future data in computation of current data.
4. Training is complicated.
5. Exploding Gradient: An exponential increase in model weights occur due to an accumulation of large
gradient errors.
6. Vanishing Gradient: The gradients become too small and unable to make significant changes in the model
weights.
Types of Recurrent Neural Networks
1. One-to-One RNN
The above diagram represents the structure of the Vanilla Neural Network. It is used to solve general machine
learning problems that have only one input and output.
Example: classification of images.
2. One-to-Many RNN:
A single input and several outputs describe a one-to-many Recurrent Neural Network. The above diagram is an
example of this.
Example: The image is sent into Image Captioning, which generates a sentence of words.
3. Many-to-One RNN:
This RNN creates a single output from the given series of inputs.
Example: Sentiment analysis is one of the examples of this type of network, in which a text is identified as
expressing positive or negative feelings.
4. Many-to-Many RNN:
This RNN receives a set of inputs and produces a set of outputs.

Example: Machine Translation, in which the RNN scans any English text and then converts it to French.
Feed-Forward Neural Networks
The feedforward neural network is one of the most basic artificial neural networks. In this ANN, the data or the
input provided travels in a single direction. It enters into the ANN through the input layer and exits through the
output layer while hidden layers may or may not exist. So the feedforward neural network has a front-propagated
wave only and usually does not have backpropagation
The Recurrent Neural Network saves the output of a layer and feeds this output back to the input to better predict
the outcome of the layer. The first layer in the RNN is quite similar to the feed-forward neural network and the
recurrent neural network starts once the output of the first layer is computed. After this layer, each unit will
remember some information from the previous step so that it can act as a memory cell in performing computation
Long Short-Term Memory Networks (LSTM)

long Short Term Memory is a kind of recurrent neural network. In RNN output from the last step is fed as input
in the current step. LSTM was designed by Hochreiter & Schmidhuber. It tackled the problem of long-term
dependencies of RNN in which the RNN cannot predict the word stored in the long-term memory but can give
more accurate predictions from the recent information. As the gap length increases RNN does not give an
efficient performance. LSTM can by default retain the information for a long period of time. It is used for
processing, predicting, and classifying on the basis of time-series data.
Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN) that is specifically designed to
handle sequential data, such as time series, speech, and text. LSTM networks are capable of learning long-term
dependencies in sequential data, which makes them well suited for tasks such as language translation, speech
recognition, and time series forecasting.
A traditional RNN has a single hidden state that is passed through time, which can make it difficult for the
network to learn long-term dependencies. LSTMs address this problem by introducing a memory cell, which is a
container that can hold information for an extended period of time. The memory cell is controlled by three gates:
the input gate, the forget gate, and the output gate. These gates decide what information to add to, remove from,
and output from the memory cell.
The input gate controls what information is added to the memory cell. The forget gate controls what information
is removed from the memory cell. And the output gate controls what information is output from the memory cell.
This allows LSTM networks to selectively retain or discard information as it flows through the network, which
allows them to learn long-term dependencies.
LSTMs can be stacked to create deep LSTM networks, which can learn even more complex patterns in sequential
data. LSTMs can also be used in combination with other neural network architectures, such as Convolutional
Neural Networks (CNNs) for image and video analysis.
Structure Of LSTM:
LSTM has a chain structure that contains four neural networks and different memory blocks called cells.
Information is retained by the cells and the memory manipulations are done by the gates. There are three gates –
1. Forget Gate: The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs x_t (input at the particular time) and h_t-1 (previous cell output) are fed to the gate and multiplied with
weight matrices followed by the addition of bias. The resultant is passed through an activation function which
gives a binary output. If for a particular cell state the output is 0, the piece of information is forgotten and for
output 1, the information is retained for future use. The equation for the forget gate is:
f_t = σ(W_f · [h_t-1, x_t] + b_f)
where:
 W_f represents the weight matrix associated with the forget gate.
 [h_t-1, x_t] denotes the concatenation of the current input and the previous hidden state.
 b_f is the bias with the forget gate.
 σ is the sigmoid activation function.
2. Input gate: The addition of useful information to the cell state is done by the input gate. First, the information
is regulated using the sigmoid function and filter the values to be remembered similar to the forget gate using
inputs h_t-1 and x_t. Then, a vector is created using tanh function that gives an output from -1 to +1, which
contains all the possible values from h_t-1 and x_t. At last, the values of the vector and the regulated values are
multiplied to obtain the useful information. The equation for the input gate is:
i_t = σ(W_i · [h_t-1, x_t] + b_i)
Ĉ_t = tanh(W_c · [h_t-1, x_t] + b_c)
C_t = f_t ⊙ C_t-1 + i_t ⊙ Ĉ_t
where
 ⊙ denotes element-wise multiplication
 tanh is tanh activation function
3. Output gate: The task of extracting useful information from the current cell state to be presented as output is
done by the output gate. First, a vector is generated by applying tanh function on the cell. Then, the information is
regulated using the sigmoid function and filter by the values to be remembered using inputs h_t-1 and x_t. At
last, the values of the vector and the regulated values are multiplied to be sent as an output and input to the next
cell. The equation for the output gate is:
o_t = σ(W_o · [h_t-1, x_t] + b_o)
Advantages of LSTM
1. Long-term dependencies can be captured by LSTM networks. They have a memory cell that is
capable of long-term information storage.
2. In traditional RNNs, there is a problem of vanishing and exploding gradients when models are
trained over long sequences. By using a gating mechanism that selectively recalls or forgets
information, LSTM networks deal with this problem.
3. LSTM enables the model to capture and remember the important context, even when there is a
significant time gap between relevant events in the sequence. So where understanding context is
important, LSTMS are used. eg. machine translation.
Disadvantages of LSTM
1. Compared to simpler architectures like feed-forward neural networks LSTM networks are
computationally more expensive. This can limit their scalability for large-scale datasets or
constrained environments.
2. Training LSTM networks can be more time-consuming compared to simpler models due to their
computational complexity. So training LSTMs often requires more data and longer training times to
achieve high performance.
3. Since it is processed word by word in a sequential manner, it is hard to parallelize the work of
processing the sentences.
Encoder Decoder architectures
In this architecture, the input data is first fed through what’s called as an encoder network. The
encoder network maps the input data into a numerical representation that captures the important
information from the input. Thee numerical representation of the input data is also called as hidden
state. The numerical representation (hidden state) is then fed into what’s called as the decoder
network. The decoder network generates the output by generating one element of the output sequence at
a time. The following picture represents the encoder decoder architecture as explained here. Note that
both input and output sequence of data can be of varying length as shown in the picture below.
A popular form of neural network architecture called as autoencoder is a type of the encoder decoder
architecture. An autoencoder is a type of neural network architecture that uses an encoder to
compress an input into a lower-dimensional representation, and a decoder to reconstruct the
original input from the compressed representation. It is primarily used for unsupervised learning
and data compression. The other types of encoder-decoder architecture can be used for supervised
learning tasks, such as machine translation, image captioning, and speech recognition. In this
architecture, the encoder maps the input to a fixed-length representation, which is then passed to the
decoder to generate the output. So while the encoder-decoder architecture and autoencoder have similar
components, their main purposes and applications differ.
Recursive Neural Networks
What Is a Recursive Neural Network?
Deep Learning is a subfield of machine learning and artificial intelligence (AI) that attempts to imitate how the
human brain processes data and gains certain knowledge. Neural Networks form the backbone of Deep Learning.
These are loosely modeled after the human brain and designed to accurately recognize underlying patterns in a data
set. If you want to predict the unpredictable, Deep Learning is the solution.
Recursive Neural Networks (RvNNs) are a class of deep neural networks that can learn detailed and structured
information. With RvNN, you can get a structured prediction by recursively applying the same set of weights on
structured inputs. The word recursive indicates that the neural network is applied to its output.
Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data. The tree structure
means combining child nodes and producing parent nodes. Each child-parent bond has a weight matrix, and similar
children have the same weights. The number of children for every node in the tree is fixed to enable it to perform
recursive operations and use the same weights. RvNNs are used when there's a need to parse an entire sentence.
Recurrent Neural Network vs. Recursive Neural Networks
 Recurrent Neural Networks (RNNs) are another well-known class of neural networks used for
processing sequential data. They are closely related to the Recursive Neural Network.
 Recurrent Neural Networks represent temporal sequences, which they find application in Natural
language Processing (NLP) since language-related data like sentences and paragraphs are sequential in
nature. Recurrent networks are usually chain structures. The weights are shared across the chain length,
keeping the dimensionality constant.
 On the other hand, Recursive Neural Networks operate on hierarchical data models due to their tree
structure. There are a fixed number of children for each node in the tree so that it can execute recursive
operations and use the same weights for each step. Child representations are combined into parent
representations.
 The efficiency of a recursive network is higher than a feed-forward network.
 Recurrent Networks are recurrent over time, meaning recursive networks are just a generalization of the
recurrent network.
Recursive Neural Network
A branch of machine learning and artificial intelligence (AI) known as "deep learning" aims
to replicate how the human brain analyses information and learns certain concepts. Deep
Learning's foundation is made up of neural networks. These are intended to precisely
identify underlying patterns in a data collection and are roughly modelled after the human
brain. Deep Learning provides the answer to the problem of predicting the unpredictable.
A subset of deep neural networks called recursive neural networks (RvNNs) are capable of
learning organized and detailed data. By repeatedly using the same set of weights on
structured inputs, RvNN enables you to obtain a structured prediction. Recursive refers to
the neural network's application to its output.
Recursive neural networks are capable of handling hierarchical data because of their indepth
tree-like structure. In a tree structure, parent nodes are created by joining child nodes.
There is a weight matrix for every child-parent bond, and comparable children have the
same weights. To allow for recursive operations and the use of the same weights, the
number of children for each node in the tree is fixed. When it's necessary to parse a whole
sentence, RvNNs are employed.
We add the weight matrices' (W i) and children's (C i) products and use the transformation f
to determine the parent node's representation.
refers to the number of children.

Recurrent Neural Network vs. Recursive Neural Networks
 Another well-known family of neural networks for processing sequential data is recurrent neural networks
(RNNs). They are connected to the recursive neural network in a close way.
 Given that language-related data like sentences and paragraphs are sequential in nature, recurrent neural
networks are useful for representing temporal sequences in natural language processing (NLP). Chain
topologies are frequently used in recurrent networks. By distributing the weights along the entire chain
length, the dimensionality is maintained.
 Recursive neural networks, on the other hand, work with hierarchical data models because of their tree
structure. The tree may perform recursive operations and utilize the same weights at each step because each
node has a set number of offspring. Parent representations are created by combining child representations.
 A feed-forward network is less efficient than a recursive network.
 Recursive networks are just a generalization of recurrent networks because recurrent networks are recurrent
over time.
Unit IV
Autoencoders
1. Undercomplete autoencoders
An undercomplete autoencoder is one of the simplest types of autoencoders.

The way it works is very straightforward—
Undercomplete autoencoder takes in an image and tries to predict the same image as output, thus reconstructing
the image from the compressed bottleneck region.
Undercomplete autoencoders are truly unsupervised as they do not take any form of label, the target being the
same as the input.
The primary use of autoencoders like such is the generation of the latent space or the bottleneck, which forms a
compressed substitute of the input data and can be easily decompressed back with the help of the network when
needed.
This form of compression in the data can be modeled as a form of dimensionality reduction.
When we think of dimensionality reduction, we tend to think of methods like PCA (Principal Component Analysis)
that form a lower-dimensional hyperplane to represent data in a higher-dimensional form without losing
information.
However—
PCA can only build linear relationships. As a result, it is put at a disadvantage compared with methods like
undercomplete autoencoders that can learn non-linear relationships and, therefore, perform better in
dimensionality reduction.
This form of nonlinear dimensionality reduction where the autoencoder learns a non-linear manifold is also
termed as manifold learning.

Effectively, if we remove all non-linear activations from an undercomplete autoencoder and use only linear layers,
we reduce the undercomplete autoencoder into something that works at an equal footing with PCA.
The loss function used to train an undercomplete autoencoder is called reconstruction loss, as it is a check of
how well the image has been reconstructed from the input data.
Although the reconstruction loss can be anything depending on the input and output, we will use an L1 loss to
depict the term (also called the norm loss) represented by:
‍ here �^ represents the predicted output and x represents the ground truth.
W
As the loss function has no explicit regularisation term, the only method to ensure that the model is not
memorising the input data is by regulating the size of the bottleneck and the number of hidden layers within this
part of the network—the architecture.

Regularized autoencoders
There are other ways to constrain the reconstruction of an autoencoder than to impose a hidden layer of
smaller dimensions than the input. The regularized autoencoders use a loss function that helps the model
to have other properties besides copying input to the output. We can generally find two types of
regularized autoencoder: the denoising autoencoder and the sparse autoencoder.
Sparse autoencoders
Another way of regularizing the autoencoder is by using a sparsity constraint. In this way of regularization,
only fraction nodes are allowed to do forward and backward propagation. These nodes have non-zero
values and are called active nodes.
To do so, we add a penalty term to the loss function, which helps to activate the fraction of nodes. This
forces the autoencoder to represent each input as a combination of a small number of nodes and
demands it to discover interesting structures in the data. This method is efficient even if the code size is
large because only a small subset of the nodes will be active.
Sparse autoencoders are similar to the undercomplete autoencoders in that they use the same image as input
and ground truth. However—
The means via which encoding of information is regulated is significantly different.

While undercomplete autoencoders are regulated and fine-tuned by regulating the size of the bottleneck, the
sparse autoencoder is regulated by changing the number of nodes at each hidden layer.
Since it is not possible to design a neural network that has a flexible number of nodes at its hidden layers, sparse
autoencoders work by penalizing the activation of some neurons in hidden layers.
In other words, the loss function has a term that calculates the number of neurons that have been activated and
provides a penalty that is directly proportional to that.
This penalty, called the sparsity function, prevents the neural network from activating more neurons and serves
as a regularizer.
While typical regularizers work by creating a penalty on the size of the weights at the nodes, sparsity regularizer
works by creating a penalty on the number of nodes activated.

This form of regularization allows the network to have nodes in hidden layers dedicated to find specific features in
images during training and treating the regularization problem as a problem separate from the latent space
problem.
We can thus set latent space dimensionality at the bottleneck without worrying about regularization.
There are two primary ways in which the sparsity regularizer term can be incorporated into the loss function.
L1 Loss: In here, we add the magnitude of the sparsity regularizer as we do for general regularizers:
Where h represents the hidden layer, i represents the
image in the minibatch, and a represents the activation.
KL-Divergence: In this case, we consider the activations over a collection of samples at once rather than
summing them as in the L1 Loss method. We constrain the average activation of each neuron over this
collection.
Considering the ideal distribution as a Bernoulli distribution, we include KL divergence within the loss to reduce
the difference between the current distribution of the activations and the ideal (Bernoulli) distribution:
Where and j denote the specific neuron for layer h and a collection
of m samples is being made here, each denoted as x.
stochastic encoders and decoders

Stochastic encoders fall into the domain of generative modeling, where the objective
is to learn join probability P(X) over given data X transformed into another high-
dimensional space. For example, we want to learn about images and produce similar,
but not exactly the same, images by learning about pixel dependencies and
distribution. One of the popular approaches in generative modeling is Variational
autoencoder (VAE), which combines deep learning with statistical inference by
making a strong distribution assumption on h ~ P(h), such as Gaussian or Bernoulli.
For a given weight W, the X can be sampled from the distribution as Pw(X|h). An
example of VAE architecture is shown in the following diagram:
The cost function of VAE is based on log likelihood maximization. The cost function
consists of reconstruction and regularization error terms:
Cost = Reconstruction Error + Regularization Error
Denoising autoencoders
Denoising autoencoders, as the name suggests, are autoencoders that remove noise from an image.
As opposed to autoencoders we’ve already covered, this is the first of its kind that does not have the input image
as its ground truth.
In denoising autoencoders, we feed a noisy version of the image, where noise has been added via digital
alterations. The noisy image is fed to the encoder-decoder architecture, and the output is compared with the
ground truth image.

The denoising autoencoder gets rid of noise by learning a representation of the input where the noise can be
filtered out easily.
While removing noise directly from the image seems difficult, the autoencoder performs this by mapping the input
data into a lower-dimensional manifold (like in undercomplete autoencoders), where filtering of noise becomes
much easier.
Essentially, denoising autoencoders work with the help of non-linear dimensionality reduction. The loss function
generally used in these types of networks is L2 or L1 loss.
We can modify the autoencoder to learn useful features is by changing the inputs; we can add random
noise to the input and recover it to the original form by removing noise from the input data. This prevents
the autoencoder from copying the data from input to output because it contains random noise. We ask it
to subtract the noise and produce meaningful underlying data. This is called a denoising autoencoder.
In the above diagram, the first row contains original images. We can see in the second row that random
noise is added to the original images; this noise is called Gaussian noise. The input of the autoencoder
will not get the original images, but autoencoders are trained in such a way that they will remove noise
and generate the original images.
Contractive Autoencoders
Similar to other autoencoders, contractive autoencoders perform task of learning a representation of the image
while passing it through a bottleneck and reconstructing it in the decoder.
The contractive autoencoder also has a regularization term to prevent the network from learning the identity
function and mapping input into the output.
Contractive autoencoders work on the basis that similar inputs should have similar encodings and a similar latent
space representation. It means that the latent space should not vary by a huge amount for minor variations in the
input.
To train a model that works along with this constraint, we have to ensure that the derivatives of the hidden layer
activations are small with respect to the input data.
Mathematically:
Where h represents the hidden layer and x represents the input.
An important thing to note in the loss function (formed from the norm of the derivatives and the reconstruction
loss) is that the two terms contradict each other.
While the reconstruction loss wants the model to tell differences between two inputs and observe variations in the
data, the frobenius norm of the derivatives says that the model should be able to ignore variations in the input
data.
Putting these two contradictory conditions into one loss function enables us to train a network where the hidden
layers now capture only the most essential information. This information is necessary to separate images and
ignore information that is non-discriminatory in nature, and therefore, not important.
The total loss function can be mathematically expressed as:
‍ ‍
Where h> is the hidden layer for which a gradient is calculated and represented with respect to the input x
as The gradient is summed over all training samples, and a frobenius norm of the same is
taken.
Contractive Autoencoders (CAEs) are a type of autoencoder that incorporates a regularization term in the
loss function to enforce a contractive property on the learned latent representation. The goal of this
regularization is to make the autoencoder more robust to small changes in the input data. The contractive
term penalizes the model for sensitivity to input variations, helping to create a more stable and
meaningful representation in the latent space.
Applications of Autoencoders.
1. Dimensionality reduction
Undercomplete autoencoders are those that are used for dimensionality reduction.
These can be used as a pre-processing step for dimensionality reduction as they can perform fast and accurate
dimensionality reductions without losing much information.
Furthermore, while dimensionality reduction procedures like PCA can only perform linear dimensionality
reductions, undercomplete autoencoders can perform large-scale non-linear dimensionality reductions.
2. Image denoising
Autoencoders like the denoising autoencoder can be used for performing efficient and highly accurate image
denoising.
Unlike traditional methods of denoising, autoencoders do not search for noise, they extract the image from the
noisy data that has been fed to them via learning a representation of it. The representation is then decompressed
to form a noise-free image.
Denoising autoencoders thus can denoise complex images that cannot be denoised via traditional methods.
3. Generation of image and time series data
Variational Autoencoders can be used to generate both image and time series data.
The parameterized distribution at the bottleneck of the autoencoder can be randomly sampled to generate
discrete values for latent attributes, which can then be forwarded to the decoder,leading to generation of image
data. VAEs can also be used to model time series data like music.
4. Anomaly detection
Undercomplete autoencoders can also be used for anomaly detection.
For example—consider an autoencoder that has been trained on a specific dataset P. For any image sampled
for the training dataset, the autoencoder is bound to give a low reconstruction loss and is supposed to reconstruct
the image as is.
For any image which is not present in the training dataset, however, the autoencoder cannot perform the
reconstruction, as the latent attributes are not adapted for the specific image that has never been seen by the
network.
As a result, the outlier image gives off a very high reconstruction loss and can easily be identified as an anomaly
with the help of a proper threshold.

1. Data Compression:
 Autoencoders can be used for data compression by learning a compact representation of
input data. This is particularly useful in scenarios where storage or bandwidth is limited.
2. Image Denoising:
 Autoencoders can be trained to remove noise from images by learning to reconstruct
clean versions from noisy inputs. The learned representation captures essential features of
the data, aiding in denoising.
3. Anomaly Detection:
 Autoencoders can be employed for anomaly detection by training on normal instances
and identifying deviations from the learned representation. Instances that deviate
significantly from the learned pattern are flagged as anomalies.
4. Feature Learning:
 Autoencoders are effective for unsupervised feature learning. The encoder learns a
meaningful representation of the input data, capturing important features that can be
useful for downstream tasks such as classification.
5. Dimensionality Reduction:
 Autoencoders can reduce the dimensionality of data while preserving its essential
features. This is beneficial for visualization, as it allows high-dimensional data to be
represented in a lower-dimensional space.
6. Generating New Data (Generative Models):
 Variational autoencoders (VAEs) and generative adversarial networks (GANs) are types of
autoencoders that can generate new, realistic data samples. VAEs, in particular, generate
data by sampling from a learned probabilistic distribution.
7. Super-Resolution Imaging:
 Autoencoders can be trained to enhance the resolution of images by learning to
reconstruct high-resolution images from lower-resolution inputs. This is useful in
applications such as medical imaging and satellite imagery.
8. Recommendation Systems:
 Autoencoders can be applied to collaborative filtering in recommendation systems. They
can learn embeddings that capture user preferences and item characteristics, providing
personalized recommendations.
9. Speech Recognition:
 Autoencoders can be used for feature learning in speech recognition tasks. The encoder
can capture important acoustic features, facilitating the subsequent recognition of speech
patterns.
10. Time Series Analysis:
 Autoencoders can be employed for time series data analysis, capturing temporal
dependencies and learning representations that are effective for tasks such as prediction
or anomaly detection in time series sequences.
11. Semantic Segmentation:
 In computer vision, autoencoders can be adapted for semantic segmentation tasks by
learning to reconstruct high-resolution segmented images from input images.
Unit V
Representation Learning
Greedy Layer wise Pre-training
Traditionally, training deep neural networks with many layers was challenging.
As the number of hidden layers is increased, the amount of error information

propagated back to earlier layers is dramatically reduced. This means that weights
in hidden layers close to the output layer are updated normally, whereas weights in
hidden layers close to the input layer are updated minimally or not at all. Generally,
this problem prevented the training of very deep neural networks and was referred to
as the vanishing gradient problem.
An important milestone in the resurgence of neural networking that initially allowed
the development of deeper neural network models was the technique of greedy
layer-wise pretraining, often simply referred to as “pretraining.”
The deep learning renaissance of 2006 began with the discovery that this greedy
learning procedure could be used to find a good initialization for a joint learning
procedure over all the layers, and that this approach could be used to successfully
train even fully connected architectures.
Pretraining involves successively adding a new hidden layer to a model and refitting,
allowing the newly added model to learn the inputs from the existing hidden layer,
often while keeping the weights for the existing hidden layers fixed. This gives the
technique the name “layer-wise” as the model is trained one layer at a time.
The technique is referred to as “greedy” because the piecewise or layer-wise
approach to solving the harder problem of training a deep network. As an
optimization process, dividing the training process into a succession of layer-wise
training processes is seen as a greedy shortcut that likely leads to an aggregate of
locally optimal solutions, a shortcut to a good enough global solution.
Greedy algorithms break a problem into many components, then solve for the
optimal version of each component in isolation. Unfortunately, combining the
individually optimal components is not guaranteed to yield an optimal complete
solution.
Pretraining is based on the assumption that it is easier to train a shallow network

instead of a deep network and contrives a layer-wise training process that we are
always only ever fitting a shallow model.
… builds on the premise that training a shallow network is easier than training a
deep one, which seems to have been validated in several contexts.
The key benefits of pretraining are:
 Simplified training process.

 Facilitates the development of deeper networks.
 Useful as a weight initialization scheme.
 Perhaps lower generalization error.
In general, pretraining may help both in terms of optimization and in terms of
generalization.
There are two main approaches to pretraining; they are:
 Supervised greedy layer-wise pretraining.

 Unsupervised greedy layer-wise pretraining.
Broadly, supervised pretraining involves successively adding hidden layers to a
model trained on a supervised learning task. Unsupervised pretraining involves using
the greedy layer-wise process to build up an unsupervised autoencoder model, to
which a supervised output layer is later added.
It is common to use the word “pretraining” to refer not only to the pretraining stage
itself but to the entire two phase protocol that combines the pretraining phase and a
supervised learning phase. The supervised learning phase may involve training a
simple classifier on top of the features learned in the pretraining phase, or it may
involve supervised fine-tuning of the entire network learned in the pretraining phase.
Unsupervised pretraining may be appropriate when you have a significantly larger

number of unlabeled examples that can be used to initialize a model prior to using a
much smaller number of examples to fine tune the model weights for a supervised
task.
…. we can expect unsupervised pretraining to be most helpful when the number of

labeled examples is very small. Because the source of information added by
unsupervised pretraining is the unlabeled data, we may also expect unsupervised
pretraining to perform best when the number of unlabeled examples is very large.
Greedy layer-wise pretraining is an important milestone in the history of deep
learning, that allowed the early development of networks with more hidden layers
than was previously possible. The approach can be useful on some problems; for
example, it is best practice to use unsupervised pretraining for text data in order to
provide a richer distributed representation of words and their interrelationships
via word2vec.
Today, unsupervised pretraining has been largely abandoned, except in the field of
natural language processing […] the advantage of pretraining is that one can
pretrain once on a huge unlabeled set (for example with a corpus containing billions
of words), learn a good representation (typically of words, but also of sentences),
and then use this representation or fine-tune it for a supervised task for which the
training set contains substantially fewer examples.
Nevertheless, it is likely better performance may be achieved using modern methods

such as better activation functions, weight initialization, variants of gradient
descent, and regularization methods.
Today, we now know that greedy layer-wise pretraining is not required to train fully
connected deep architectures, but the unsupervised pretraining approach was the
first method to succeed.
To solve the vanishing gradient descent problem, we use this technique. Lets us see the mechanism of a
greedy layer-wise pretraining method. First, we make a base model of the input and output layer; later, we
train the model using the available dataset. After training the model, we remove the output layer and store
it in another variable. Add a new hidden layer in the model that will be the first hidden layer of the model
and re-add the output layer in the model. Now there are three layers in the model, the input layer, the
hidden layer1, and the output layer, and once again, train the model after inserting the hidden layer1. To
add one more hidden layer, remove the output layer set all the layers as non-trainable(no further change
in weights of the input layer and hidden layer1). Now insert the new hidden layer2 in the model and re-
add the output layer. Train the model after inserting the new hidden layer. The model structure will be in
the following order, input layer, hidden layer1, hidden layer2, output layer. Repeat the above steps for
every new hidden layer you want to add. (each time you insert a new hidden layer, perform training on the
model using the same dataset)
The greedy layer-wise training is a pre-training algorithm that aims to train each layer of a DBN in a sequential way,
feeding lower layers’ results to the upper layers. This renders a better optimization of a network than traditional
training algorithms, i.e. training method using stochastic gradient descent à la RBMs.
In terms of computational units, deep structures such as the DBN can be much more efficient (25) than their
shallow counterparts since they require fewer units (23) for performing the same function. Multi-layer deep
structures can represent abstract concepts and varying functions by keeping many non-linear layers in a
hierarchy (25). From a lower-level to a higher-level in this hierarchy, layers’ abstractness ascend in terms of the
complexity of objects they are representing (in the illustration below, the top layer shows all elemental pixels
whereas the images are kept in the bottom layer). The process of how we divide more complex objects into
simpler objects is by modeling a set of joint distribution between each visible and hidden layer.
Domain Adaptation
4.1. The Blueprint
In domain adaptation, we solely change the underlying datasets and thus the features of our machine learning
model. However, the feature space stays the same. The predictive function stays the same:
4.2. Application
Applying domain adaptation to our example, we could think of a significantly different, but somehow similar
dataset. This could still contain dog and cat pictures, but those that are vastly different from the ones in our
source dataset. For example, in our source data set, we only have poodles and black cats. In our target dataset, on
the other hand, we could have schnauzers and white cats.
Now, how can we ensure that our predictive function will still predict the right labels for our dataset? Domain
adaptation delivers an answer for this question.
4.3. Types of Domain Adaptation
We consider three types of domain adaptation. These are defined by the number of labeled examples in the
underlying domain:
 Unsupervised domain adaptation works with a source domain that has labeled examples, but also
unlabeled examples. The target domain only has unlabeled examples.
 Semi-supervised domain adaptation expects some of the examples in the target domain are labeled.
 Supervised domain indicates that all examples are labeled.
5. Methods in Domain Adaptation
In domain adaptation, we can look a bit closer at pragmatic approaches. This lies in the fact, that only changing
the dataset makes it much easier to tune our model for our new machine learning process.
5.1. Divergence-based Domain Adaptation
Divergence-based domain adaptation is a method of testing if two samples are from the same distribution. As we
have seen in our blueprint illustration, the features that are extracted from the datasets are vastly different. This
difference causes our predictive function to not work as intended. If it’s fed by features that it was not trained for, it
malfunctions. This is also the reason why we accept different features but require the same feature space.
For this reason, divergence-based domain adaptation creates features that are “equally close” to both
datasets. This can be achieved by applying various algorithms, including the Maximum Mean Discrepancy,
Correlation Alignment, Contrastive Domain Discrepancy, or the Wasserstein Metric.
5.2. Iterative Approach
In the iterative approach, we use our prediction function to label those samples of our target domain, for which we
have very high confidence. Doing so, we retrain our function . Thus creating a prediction function that fits our
target domain more and more as we apply it to samples that have less confidence.
Transfer Learning:
Transfer Learning Example
Let’s apply this concept to our example of dog and cat pictures. Now imagine we have a second dataset that shows
pictures of cows and horses. Cows and horses are significantly different from cats.
Nevertheless, they are all mammals, they have four feet, and a similar shape. As a solution, we can take the layers
that describe the shape of the object we want to detect, whether it is a dog, a cow, or a horse,
and freeze them. Freezing means we cut them out of our predictive function put them in our predictive
function , and train the function , without training our frozen layers:
We can see the layers on the right side are green, which indicates that we have to train them while creating our
predictive function for the source domain. On the other hand, the predictive is created using the already existing,
frozen layer from the source domain. The frozen layers stay untouched during the training process.
transfer learning?
Transfer learning is a technique in machine learning where a model trained on one task is used as the starting
point for a model on a second task. This can be useful when the second task is similar to the first task, or when
there is limited data available for the second task. By using the learned features from the first task as a starting
point, the model can learn more quickly and effectively on the second task. This can also help to prevent
overfitting, as the model will have already learned general features that are likely to be useful in the second task.
Why transfer learning?
Many deep neural networks trained on images have a curious phenomenon in common: in the early layers of the
network, a deep learning model tries to learn a low level of features, like detecting edges, colors, variations of
intensities, etc. Such kind of features appears not to be specific to a particular dataset or a task because no matter
what type of image we are processing either for detecting a lion or car. In both cases, we have to detect these low-
level features. All these features occur regardless of the exact cost function or image dataset. Thus learning these
features in one task of detecting lions can be used in other tasks like detecting humans
Advantages :
 Speed up the training process: By using a pre-trained model, the model can learn more quickly and
effectively on the second task, as it already has a good understanding of the features and patterns in
the data.
 Better performance: Transfer learning can lead to better performance on the second task, as the
model can leverage the knowledge it has gained from the first task.
 Handling small datasets: When there is limited data available for the second task, transfer learning
can help to prevent overfitting, as the model will have already learned general features that are likely
to be useful in the second task.
Disadvantages:
 Domain mismatch: The pre-trained model may not be well-suited to the second task if the two tasks
are vastly different or the data distribution between the two tasks is very different.
 Overfitting: Transfer learning can lead to overfitting if the model is fine-tuned too much on the
second task, as it may learn task-specific features that do not generalize well to new data.
 Complexity: The pre-trained model and the fine-tuning process can be computationally expensive
and may require specialized hardware.
Transfer learning is a machine learning method where a model already developed for a task is reused in
another task. Transfer learning is a popular approach in deep learning, as it enables the training of
deep neural networks with less data compared to having to create a model from scratch.
Typically, training a model takes a large amount of compute resources and time. Using a pre-trained
model as a starting point helps cut down on both.
Machine learning algorithms are typically designed to address isolated tasks. Through transfer learning,
methods are developed to transfer knowledge from one or more of these source tasks to improve learning
in a related target task. Knowledge from an already trained machine learning model must be similar to the
new task to be transferable. For example, the knowledge gained from recognizing an image of a dog in a
supervised machine learning system could be transferred to a new system to recognize images of cats.
The new system will filter out images it already recognizes as a dog.
Distributed Representations
Distributed representations are a fundamental concept in the field of machine learning and natural
language processing (NLP). They refer to a way of representing data, typically words or phrases, as
continuous vectors in a high-dimensional space. Unlike local representations, where each entity is
represented by a unique identifier in an isolated manner (such as one-hot encoding), distributed
representations capture a notion of similarity and semantic meaning by allowing an entity to be
represented by a pattern of values across many dimensions.
The Basics of Distributed Representations
In distributed representations, also known as embeddings, the idea is that the "meaning" or "semantic
content" of a data point is distributed across multiple dimensions. For example, in NLP, words with similar
meanings are mapped to points in the vector space that are close to each other. This closeness is not
arbitrary but is learned from the context in which words appear. This context-dependent learning is often
achieved through neural network models, such as Word2Vec or GloVe, which process large corpora of
text to learn these representations.
One of the key advantages of distributed representations is their ability to capture fine-grained semantic
relationships. For instance, in a well-trained word embedding space, synonyms would be represented by
vectors that are close together, and it's even possible to perform arithmetic operations with these vectors
that correspond to meaningful semantic operations (e.g., "king" - "man" + "woman" might result in a vector
close to "queen").
Applications of Distributed Representations
Distributed representations have a wide range of applications, particularly in tasks that involve natural
language understanding. They are used for:
 Word Similarity: Measuring the semantic similarity between words.

 Text Classification: Categorizing documents into predefined classes.
 Machine Translation: Translating text from one language to another.
 Information Retrieval: Finding relevant documents in response to a query.
 Sentiment Analysis: Determining the sentiment expressed in a piece of text.
Moreover, distributed representations are not limited to text data. They can also be applied to other types
of data, such as images, where deep learning models learn to represent images as high-dimensional
vectors that capture visual features and semantics.
Learning Distributed Representations
Learning distributed representations typically involves training a model on a task that forces it to capture
semantic or feature similarities. For text, this might involve predicting a word given its surrounding words
(continuous bag of words model) or predicting surrounding words given a word (skip-gram model). During
this process, the model learns to place semantically similar words closer together in the vector space.
For images, convolutional neural networks (CNNs) learn distributed representations by being trained to
recognize objects within images. Through layers of convolutions and pooling operations, CNNs learn to
represent images in a way that captures hierarchical visual features.
Challenges with Distributed Representations
Despite their effectiveness, distributed representations come with their own set of challenges. One major
issue is the requirement of large amounts of data to learn meaningful representations. Without sufficient
data, the embeddings may not capture the true semantic relationships. Additionally, distributed
representations can be computationally expensive to learn, requiring significant processing power and
memory, especially for large datasets.
Another challenge is the interpretability of these representations. Unlike local representations, where each
dimension corresponds to a specific feature, the dimensions in distributed representations do not have an
easily interpretable meaning. This can make it difficult to understand what the model has learned and to
diagnose issues when the model makes incorrect predictions.
Distributed representations in deep learning refer to the representation of data, such as features or
concepts, by using distributed patterns of activation across multiple neurons or units in a neural network.
This is in contrast to traditional, non-distributed representations where each feature or concept is
represented by the activity of a single neuron or a small set of neurons.
In the context of deep learning, distributed representations are often learned automatically from the data
through the training process. This is typically done in neural networks with multiple layers, such as deep
neural networks or deep learning models. The idea is that each layer in the network learns increasingly
abstract and complex features, and the final representation of the input data is distributed across the
activations of many neurons in the last layer.
Here are some key points about distributed representations in deep learning:
1. Feature Learning:
 In deep learning, each layer of a neural network can be thought of as learning hierarchical
features. The lower layers might capture simple features like edges or textures, while
higher layers combine these features to represent more complex patterns or concepts.
2. Sparse vs. Distributed Representations:
 Traditional approaches often used sparse representations, where each feature or concept
is represented by the activity of a specific neuron. In contrast, distributed representations
involve the collective activity of multiple neurons to represent a feature or concept.
3. Generalization:
 Distributed representations often lead to better generalization. This means that the
model can perform well on new, unseen data because it has learned a more abstract and
versatile representation of the underlying patterns in the training data.
4. Efficiency:
 Distributed representations are often more efficient in terms of storage and computation.
They can capture complex relationships with fewer parameters compared to non-
distributed representations.
5. Word Embeddings:
 In natural language processing (NLP), distributed representations are commonly used for
words through techniques like word embeddings. Word embeddings, such as Word2Vec,
GloVe, or FastText, represent words as vectors in a high-dimensional space, capturing
semantic relationships between words.
6. Autoencoders and Variational Autoencoders:
 Autoencoders and variational autoencoders are types of neural network architectures that
can learn distributed representations. Autoencoders learn an efficient representation of
the input data by encoding it into a lower-dimensional space and then decoding it back
to the original input. Variational autoencoders introduce a probabilistic element, allowing
them to generate new samples in addition to encoding and decoding.
Variants of CNN: DenseNet
DenseNet, short for Densely Connected Convolutional Networks, is a variant of Convolutional Neural
Networks (CNNs) designed to address some challenges associated with traditional CNN architectures.
DenseNet was introduced by Gao Huang, Zhuang Liu, and Laurens van der Maaten in their paper "Densely
Connected Convolutional Networks" in 2017.
The key idea behind DenseNet is to establish dense connections between layers, allowing each layer to
receive direct inputs from all preceding layers. This is in contrast to traditional CNN architectures, where
each layer typically connects only to the immediately preceding layer. The dense connectivity in DenseNet
leads to several advantages, including parameter efficiency, feature reuse, and improved gradient flow
during training.
Here are some key features and components of DenseNet:
1. Dense Connectivity:
 In DenseNet, each layer receives input from all preceding layers and, in turn, passes its
feature maps to all subsequent layers. This dense connectivity is achieved by
concatenating the feature maps from all previous layers as input to the current layer.
2. Dense Blocks:
 The building blocks of DenseNet are dense blocks, which consist of a series of densely
connected layers. Within a dense block, the output of each layer is concatenated with the
feature maps of all preceding layers. This promotes feature reuse and allows the network
to maintain a compact internal representation.
3. Transition Layers:
 Between dense blocks, transition layers are used to reduce the spatial dimensions (width
and height) of the feature maps and control the number of channels. This helps in
reducing the computational cost and allows the network to scale more efficiently.
4. Batch Normalization and ReLU:
 DenseNet typically employs batch normalization and rectified linear unit (ReLU) activation
functions after each convolutional layer. Batch normalization helps with training stability
and accelerates convergence, while ReLU introduces non-linearity.
5. Growth Rate:
 The growth rate is a hyperparameter that defines the number of feature maps added to
the network at each layer within a dense block. A higher growth rate increases the
number of parameters but can also enhance the representational capacity of the network.
6. Global Average Pooling:
 DenseNet often uses global average pooling instead of fully connected layers at the end
of the network. Global average pooling helps reduce the number of parameters and
encourages spatial hierarchies in the learned features.
DenseNet has been shown to achieve competitive performance on various computer vision tasks, such as
image classification and object detection, while requiring fewer parameters compared to traditional CNN
architectures. The dense connectivity facilitates feature reuse, which is particularly beneficial in scenarios
with limited data or computational resources.
DenseNet is one of the new discoveries in neural networks for visual object recognition. DenseNet is quite
similar to ResNet with some fundamental differences. ResNet uses an additive method (+) that merges
the previous layer (identity) with the future layer, whereas DenseNet concatenates (.) the output of the
previous layer with the future layer. Get in-depth knowledge of ResNet in this guide.
Why Do We DenseNet?
DenseNet was developed specifically to improve the declined accuracy caused by the vanishing gradient
in high-level neural networks. In simpler terms, due to the longer path between the input layer and the
output layer, the information vanishes before reaching its destination.
n short, DenseNet-121 has the following layers:
 1 7x7 Convolution
 4 AvgPool
 1 Fully Connected Layer

We will dive deeper.
Introduction to DenseNet
In a traditional feed-forward Convolutional Neural Network (CNN), each convolutional layer except the
first one (which takes in the input), receives the output of the previous convolutional layer and produces
an output feature map that is then passed on to the next convolutional layer. Therefore, for 'L' layers,
there are 'L' direct connections; one between each layer and the next layer.
However, as the number of layers in the CNN increase, i.e. as they get deeper, the 'vanishing gradient'
problem arises. This means that as the path for information from the input to the output layers increases,
it can cause certain information to 'vanish' or get lost which reduces the ability of the network to train
effectively.
DenseNets resolve this problem by modifying the standard CNN architecture and simplifying the
connectivity pattern between layers. In a DenseNet architecture, each layer is connected directly with
every other layer, hence the name Densely Connected Convolutional Network. For 'L' layers, there are
L(L+1)/2 direct connections.
DenseNet Architecture & Components
Components of DenseNet include:
 Connectivity
 DenseBlocks
 Growth Rate
 Bottleneck layers
Connectivity
In each layer, the feature maps of all the previous layers are not summed, but concatenated and used as
inputs. Consequently, DenseNets require fewer parameters than an equivalent traditional CNN, and this
allows for feature reuse as redundant feature maps are discarded. So, the lth layer receives the feature-
maps of all preceding layers, x0,...,xl-1, as input:
where [x0,x1,...,xl-1] is the concatenation of the feature-maps, i.e. the output produced in all the layers
preceding l (0,...,l-1). The multiple inputs of Hl are concatenated into a single tensor to ease
implementation.
DenseBlocks
The use of the concatenation operation is not feasible when the size of feature maps changes. However,
an essential part of CNNs is the down-sampling of layers which reduces the size of feature-maps through
dimensionality reduction to gain higher computation speeds.
To enable this, DenseNets are divided into DenseBlocks, where the dimensions of the feature maps
remains constant within a block, but the number of filters between them is changed. The layers between
the blocks are called Transition Layers which reduce the the number of channels to half of that of the
existing channels.
For each layer, from the equation above, Hl is defined as a composite function which applies three
consecutive operations: batch normalization (BN), a rectified linear unit (ReLU) and a convolution (Conv).
In the above image, a deep DenseNet with three dense blocks is shown. The layers between two
adjacent blocks are the transition layers which perform downsampling (i.e. change the size of the feature-
maps) via convolution and pooling operations, whilst within the dense block the size of the feature maps
is the same to enable feature concatenation.
Growth Rate
One can think of the features as a global state of the network. The size of the feature map grows after a
pass through each dense layer with each layer adding 'K' features on top of the global state (existing
features). This parameter 'K' is referred to as the growth rate of the network, which regulates the amount
of information added in each layer of the network. If each function H l produces k feature maps, then the
lth layer has
input feature-maps, where k0 is the number of channels in the input layer. Unlike existing network
architectures, DenseNets can have very narrow layers.
Bottleneck layers
Although each layer only produces k output feature-maps, the number of inputs can be quite high,
especially for further layers. Thus, a 1x1 convolution layer can be introduced as a bottleneck layer before
each 3x3 convolution to improve the efficiency and speed of computations
Unit VI
Applications of Deep Learning
What are Generative Adversarial Networks?
Generative Adversarial Networks (GANs) were introduced in 2014 by Ian J. Goodfellow and co-authors. GANs
perform unsupervised learning tasks in machine learning. It consists of 2 models that automatically discover and
learn the patterns in input data.
The two models are known as Generator and Discriminator.
They compete with each other to scrutinize, capture, and replicate the variations within a dataset. GANs can be used
to generate new examples that plausibly could have been drawn from the original dataset.
Shown below is an example of a GAN. There is a database that has real 100 rupee notes. The generator neural
network generates fake 100 rupee notes. The discriminator network will help identify the real and fake notes.
What is a Generator?
A Generator in GANs is a neural network that creates fake data to be trained on the discriminator. It learns to
generate plausible data. The generated examples/instances become negative training examples for the discriminator.
It takes a fixed-length random vector carrying noise as input and generates a sample.
The main aim of the Generator is to make the discriminator classify its output as real. The part of the GAN that
trains the Generator includes:
 noisy input vector
 generator network, which transforms the random input into a data instance
 discriminator network, which classifies the generated data
 generator loss, which penalizes the Generator for failing to dolt the discriminator
The backpropagation method is used to adjust each weight in the right direction by calculating the weight's impact
on the output. It is also used to obtain gradients and these gradients can help change the generator weights.
Generator:
 The Generator is a neural network that takes random noise or a random vector as input and
generates synthetic data samples. In the context of image generation, for example, the generator
produces images that ideally are realistic and similar to the training data. The goal of the
generator is to produce data that is convincing enough to fool the discriminator.
What is a Discriminator?
The Discriminator is a neural network that identifies real data from the fake data created by the Generator. The
discriminator's training data comes from different two sources:
 The real data instances, such as real pictures of birds, humans, currency notes, etc., are used by the
Discriminator as positive samples during training.
 The fake data instances created by the Generator are used as negative examples during the training
process.
While training the discriminator, it connects to two loss functions. During discriminator training, the discriminator
ignores the generator loss and just uses the discriminator loss.
In the process of training the discriminator, the discriminator classifies both real data and fake data from the
generator. The discriminator loss penalizes the discriminator for misclassifying a real data instance as fake or a fake
data instance as real.
The discriminator updates its weights through backpropagation from the discriminator loss through the discriminator
network.
Discriminator:
 The Discriminator is another neural network that evaluates the authenticity of a given input,
determining whether it is a real (from the training data) or fake (generated by the generator)
sample. The discriminator is trained to correctly classify real and fake samples. The objective is to
make the discriminator as accurate as possible in distinguishing between real and generated data.
TRANING
Steps to train a GAN
Step 1: Define the problem. Do you want to generate fake images or fake text. Here you should completely define
the problem and collect data for it.
Step 2: Define architecture of GAN. Define how your GAN should look like. Should both your generator and
discriminator be multi layer perceptrons, or convolutional neural networks? This step will depend on what problem
you are trying to solve.
Step 3: Train Discriminator on real data for n epochs. Get the data you want to generate fake on and train the
discriminator to correctly predict them as real. Here value n can be any natural number between 1 and infinity.
Step 4: Generate fake inputs for generator and train discriminator on fake data. Get generated data and let the
discriminator correctly predict them as fake.
Step 5: Train generator with the output of discriminator. Now when the discriminator is trained, you can get its
predictions and use it as an objective for training the generator. Train the generator to fool the discriminator.
Step 6: Repeat step 3 to step 5 for a few epochs.
Step 7: Check if the fake data manually if it seems legit. If it seems appropriate, stop training, else go to step
3. This is a bit of a manual task, as hand evaluating the data is the best way to check the fakeness. When this step is
over, you can evaluate whether the GAN is performing well enough.
Training Generative Adversarial Networks (GANs) involves an iterative process where the generator and
discriminator are trained in a competitive manner. The goal is to have the generator produce realistic data
that is difficult for the discriminator to distinguish from real data. Here is a high-level overview of the
training process for GANs:
1. Initialize Networks:
 Initialize the weights and biases of the generator and discriminator networks. These
networks are typically neural networks, and their architectures depend on the specific task
(e.g., image generation).
2. Define Loss Functions:
 Define the loss functions for both the generator and discriminator. The discriminator aims
to minimize the binary cross-entropy loss, correctly classifying real and generated
samples. The generator aims to maximize the same loss, fooling the discriminator into
classifying generated samples as real.
3. Generate Synthetic Samples:
 Input random noise into the generator to produce synthetic samples. These samples are
initially random and are not likely to resemble the real data.
4. Train Discriminator:
 Feed a batch of real samples from the training set and the corresponding synthetic
samples generated by the generator to the discriminator.
 Compute the discriminator loss based on how well it classifies real and generated
samples.
 Update the discriminator's weights using gradient descent to minimize its loss.
5. Train Generator:
 Generate a new batch of synthetic samples using the generator.
 Feed these generated samples into the discriminator.
 Compute the generator loss based on the discriminator's response to the generated
samples.
 Update the generator's weights using gradient ascent to maximize its loss. This
encourages the generator to produce samples that are more convincing to the
discriminator.
6. Iterate:
 Repeat steps 3-5 for a predefined number of iterations or until convergence.
 The adversarial training process involves the continuous back-and-forth training of the
generator and discriminator. The generator becomes better at generating realistic
samples, and the discriminator becomes more accurate in distinguishing between real
and generated samples.
7. Monitoring and Evaluation:
 Monitor the performance of the generator and discriminator using metrics such as
generator loss, discriminator loss, and visual inspection of generated samples.
 Evaluate the quality of generated samples on a validation set or using domain-specific
metrics.
8. Fine-Tuning and Hyperparameter Tuning:
 Fine-tune hyperparameters such as learning rates, batch sizes, and network architectures
based on the observed performance.
 Experiment with architectural modifications or regularization techniques to improve
stability and convergence.
9. End Training:
 Stop training when the generator produces high-quality, realistic samples, and the
discriminator performs well in distinguishing between real and generated data.
GAN variants
Generative Adversarial Networks (GANs) have inspired numerous variants and extensions, each designed
to address specific challenges or to cater to different applications. Here are some notable GAN variants:
1. Conditional GANs (cGANs):

 In traditional GANs, the generator generates samples without any control over the class
or characteristics of the generated samples. Conditional GANs introduce additional
information (such as class labels) to both the generator and discriminator, allowing for
the generation of samples conditioned on specific attributes.
2. Deep Convolutional GANs (DCGANs):
 DCGANs leverage deep convolutional neural network architectures for both the generator
and discriminator. These architectures have proven effective in image generation tasks
and stabilize the training of GANs.
3. Wasserstein GANs (WGANs):
 WGANs use the Wasserstein distance (also known as Earth Mover's Distance) as the
training objective instead of the traditional Jensen-Shannon divergence. This modification
has been shown to improve training stability and produce higher-quality samples.
4. Least Squares GANs (LSGANs):
 LSGANs modify the GAN loss function by replacing the binary cross-entropy loss with a
least squares loss. This modification aims to address issues such as mode collapse and
vanishing gradients, leading to more stable training.
5. InfoGAN:
 InfoGAN extends GANs by introducing an information-theoretic regularization term in the
objective function. This encourages the learning of disentangled and interpretable
representations in the latent space.
6. CycleGAN:
 CycleGAN is designed for unpaired image-to-image translation. It introduces cycle
consistency loss, ensuring that the translated images can be transformed back to the
original domain without loss of information.
7. StarGAN:
 StarGAN is a multi-domain image-to-image translation GAN. It can generate images with
different styles or attributes in a single model, allowing for diverse image synthesis.
8. Progressive GANs:
 Progressive GANs start with low-resolution images and progressively increase the
resolution during training. This helps in generating high-quality images and contributes
to stable training.
9. StyleGAN and StyleGAN2:
 StyleGAN and StyleGAN2 focus on controlling the style and attributes of generated
images. They introduce a style-based generator architecture, which allows for
manipulation of specific features in the generated images.
10. BigGAN:
 BigGAN scales up the GAN architecture by using larger models and more computational
resources, resulting in improved sample quality and diversity.
11. Self-Supervised GANs:
 These variants incorporate self-supervised learning techniques into the GAN framework,
allowing the model to learn without the need for explicit supervision.
12. Adversarial Autoencoders (AAE):
 AAE combines elements of autoencoders with GANs, incorporating an adversarial training
mechanism to improve the quality of generated samples.
Architecture of autoencoders
An autoencoder consists of three components:
 Encoder: An encoder is a feedforward, fully connected neural network that compresses the input
into a latent space representation and encodes the input image as a compressed representation in a
reduced dimension. The compressed image is the distorted version of the original image.
 Code: This part of the network contains the reduced representation of the input that is fed into the
decoder.
 Decoder: Decoder is also a feedforward network like the encoder and has a similar structure to the
encoder. This network is responsible for reconstructing the input back to the original dimensions
from the code.
First, the input goes through the encoder where it is compressed and stored in the layer called Code, then the
decoder decompresses the original input from the code. The main objective of the autoencoder is to get an output
identical to the input.
Note that the decoder architecture is the mirror image of the encoder. This is not a requirement but it’s typically the
case. The only requirement is the dimensionality of the input and output must be the same
An autoencoder is a type of artificial neural network used for unsupervised learning. It consists of an
encoder and a decoder, and its primary purpose is to learn efficient representations of data. The
architecture of an autoencoder is relatively simple, comprising two main components:
1. Encoder:
 The encoder takes input data and maps it to a lower-dimensional representation, often
referred to as a "latent space" or "encoding." The encoder's role is to compress the input
data into a more compact representation that captures the essential features.
 Architecture:
 The encoder typically consists of one or more layers of neurons, with each layer
performing a linear transformation followed by a non-linear activation function.
Common activation functions include Rectified Linear Units (ReLU) or hyperbolic
tangent (tanh).
 The number of neurons in the output layer of the encoder determines the
dimensionality of the latent space.
 Example:
plaintextCopy code
Input Layer (e.g., features) -> Hidden Layer 1 (with ReLU activation) -> Hidden Layer 2 (with
ReLU activation) -> ... -> Output Layer (Latent Space)
2. Latent Space:
 The latent space is the compressed representation of the input data produced by the
encoder. It is a lower-dimensional space that ideally captures the most relevant
information from the input.
3. Decoder:
 The decoder takes the encoded representation from the latent space and reconstructs the
input data. Its goal is to generate output data that closely matches the original input.
 Architecture:
 Similar to the encoder, the decoder comprises one or more layers of neurons.
Each layer performs a linear transformation followed by a non-linear activation
function, reconstructing the data.
 Example:
plaintextCopy code
Latent Space -> Hidden Layer 1 (with ReLU activation) -> Hidden Layer 2 (with ReLU
activation) -> ... -> Output Layer (Reconstructed Input)
4. Loss Function:
 The autoencoder is trained by minimizing a loss function that measures the difference
between the input data and its reconstruction. Mean Squared Error (MSE) is a common
choice for this purpose.
5. Training:
 During training, the autoencoder learns to encode and decode the data by adjusting the
weights and biases of its neural network layers. The process involves feeding input data
into the encoder, obtaining the encoded representation, passing it through the decoder,
and comparing the reconstructed output with the original input.
Overall, the architecture of an autoencoder is characterized by the encoding and decoding stages, with
the neural network layers in each stage responsible for transforming and reconstructing the data.
Autoencoders are widely used for tasks such as data compression, denoising, and feature learning, as they
can capture meaningful representations of complex data.
Denoising and Sparcity.
Denoising in the context of deep learning refers to the process of removing noise or unwanted variations
from data. This concept is particularly relevant in various applications where the input data may be
corrupted by noise, artifacts, or other undesired elements. Denoising techniques are often employed to
enhance the quality of data, making it more suitable for downstream tasks such as image recognition,
speech processing, or any other domain where clean and accurate data is crucial. Here are a few key
aspects of denoising in the application of deep learning:
1. Image Denoising:
 Objective: In image processing, denoising aims to remove unwanted artifacts or noise
from images.
 Application: Denoising is widely used in medical imaging, surveillance, and photography
to improve the quality and clarity of images.
 Deep Learning Technique: Denoising autoencoders and convolutional neural networks
(CNNs) are commonly used for image denoising. These models learn to map noisy
images to their clean counterparts.
2. Speech Denoising:
 Objective: In speech processing, denoising focuses on removing background noise or
distortions from audio signals.
 Application: Speech denoising is crucial in voice recognition systems,
telecommunications, and audio processing applications.
 Deep Learning Technique: Recurrent Neural Networks (RNNs) and Long Short-Term
Memory networks (LSTMs) are often used for modeling sequential data, including
denoising in speech signals.
3. Data Preprocessing:
 Objective: Denoising is an essential step in data preprocessing to improve the quality of
input data for subsequent tasks.
 Application: Clean and denoised data is crucial for tasks like classification, regression,
and other machine learning applications.
 Deep Learning Technique: Autoencoders, especially denoising autoencoders, are
effective for learning robust representations of data by removing noise during the
training process.
4. Sensor Data Denoising:
 Objective: Denoising is applied to sensor data to filter out noise introduced during data
acquisition.
 Application: In fields like Internet of Things (IoT), sensor data denoising is critical for
accurate monitoring and decision-making.
 Deep Learning Technique: Autoencoders or specialized architectures tailored to the
characteristics of sensor data may be employed for effective denoising.
5. Document Denoising:
 Objective: Denoising can be used to clean up noisy or distorted text in documents.
 Application: Document denoising is valuable in optical character recognition (OCR) and
natural language processing tasks.
 Deep Learning Technique: Recurrent neural networks or transformer-based models may
be employed for document denoising.
6. Video Denoising:
 Objective: In video processing, denoising is applied to remove noise or artifacts from
video sequences.
 Application: Video denoising is crucial in video surveillance, broadcasting, and video
analytics.
 Deep Learning Technique: Spatiotemporal architectures, such as 3D CNNs, can be
effective for video denoising tasks
Sparsity is a concept in deep learning that refers to a situation where a large number of elements in a
particular representation are zero or close to zero. The introduction of sparsity in neural networks can
have several benefits, and it is applied in various ways across different aspects of deep learning. Here are a
few contexts where sparsity is commonly employed:
1. Sparse Activation in Neural Networks:

 In the context of neural networks, sparsity can be applied to the activation patterns of
neurons. Instead of having all neurons in a layer active for a given input, introducing
sparsity means that only a small subset of neurons becomes active. This can help in
making the representations more compact and potentially more informative.
2. Sparse Autoencoders:
 As mentioned earlier, sparse autoencoders are designed to learn sparse representations
of data. The sparsity constraint is imposed during training, encouraging the network to
activate only a small number of neurons in the encoding layer for each input. This can
lead to more efficient and interpretable representations.
3. Sparse Regularization Techniques:
 Sparsity is often used as a regularization technique to prevent overfitting in deep learning
models. Regularization terms penalizing large weights or activations can be added to the
loss function. These penalties encourage the network to learn sparse representations,
reducing the risk of overfitting and improving generalization to unseen data.
4. Sparse Coding in Unsupervised Learning:
 In unsupervised learning, the idea of sparse coding is to represent data using a small
number of non-zero coefficients in a suitable basis. Sparse coding has been applied in
various domains, such as image and signal processing, where it helps in capturing the
essential features of the data with a compact representation.
5. Sparse Attention Mechanisms:
 In natural language processing and computer vision, attention mechanisms are widely
used to focus on relevant parts of the input sequence. Sparse attention mechanisms
selectively attend to specific elements in the input, making the model more interpretable
and computationally efficient.
6. Sparse Data Representations:
 In some applications, the input data itself may be naturally sparse. For example, in text
data, the bag-of-words representation often results in sparse vectors. Deep learning
models applied to such data need to be designed to handle and leverage the inherent
sparsity efficiently.
7. Structured Sparsity for Neural Network Compression:
 Structured sparsity refers to patterns of sparsity that have a specific structure. It can be
applied to neural network compression techniques, where certain weights or connections
are pruned based on their importance, leading to a more compact model
UNIT V
Representation Learning
Greedy Layer wise Pre-training
To solve the vanishing gradient descent problem, we use this technique. Lets us see the mechanism of a
greedy layer-wise pretraining method. First, we make a base model of the input and output layer; later, we
train the model using the available dataset. After training the model, we remove the output layer and store
it in another variable. Add a new hidden layer in the model that will be the first hidden layer of the model
and re-add the output layer in the model. Now there are three layers in the model, the input layer, the
hidden layer1, and the output layer, and once again, train the model after inserting the hidden layer1. To
add one more hidden layer, remove the output layer set all the layers as non-trainable(no further change
in weights of the input layer and hidden layer1). Now insert the new hidden layer2 in the model and re-
add the output layer. Train the model after inserting the new hidden layer. The model structure will be in
the following order, input layer, hidden layer1, hidden layer2, output layer. Repeat the above steps for
every new hidden layer you want to add. (each time you insert a new hidden layer, perform training on the
model using the same dataset)
Greedy layer-wise pre-training is a technique used in deep learning, specifically in the training of deep
neural networks, such as autoencoders and deep belief networks. The approach involves training
individual layers of a deep network one at a time in a "greedy" manner, where each layer is pretrained as
an unsupervised model before fine-tuning the entire network.
Here is an overview of the process:
1. Initialization:
 Start with an initial shallow model, typically a single-layer neural network or an
unsupervised learning model.
2. Greedy Layer-wise Training:
 Train the model layer by layer in a "greedy" manner. This means that each layer is trained
independently without considering the other layers. The input to each layer is the output
of the previously trained layers.
 Commonly used unsupervised learning methods for pre-training include Restricted
Boltzmann Machines (RBMs) or autoencoders.
3. Stacking Layers:
 Once each layer is pretrained, stack them together to create a deep neural network. The
pretrained layers serve as the initial weights for the corresponding layers in the deep
network.
4. Fine-tuning:
 Fine-tune the entire deep network using supervised learning with labeled data. This
involves adjusting the weights of the entire network to minimize a supervised loss
function.
The motivation behind greedy layer-wise pre-training is that it helps the model learn better
representations of the data in a step-wise manner. Training individual layers independently can
sometimes lead to more robust and meaningful features in each layer. The pretrained layers capture
useful hierarchical representations that can be fine-tuned for the specific task at hand.
This approach was particularly popular in the early days of deep learning when training very deep
networks directly with backpropagation and stochastic gradient descent was computationally challenging.
Greedy layer-wise pre-training acted as a form of initialization, providing a good starting point for the
subsequent supervised fine-tuning.
However, it's worth noting that recent advances in optimization algorithms, increased computing power,
and improvements in network architectures have diminished the importance of greedy layer-wise pre-
training in some contexts. In many cases, end-to-end training of deep networks from scratch using
techniques like batch normalization and advanced weight initialization has proven to be effective.
Greedy layer-wise pretraining is called so because it optimizes each layer at a time greedily. After
unsupervised training, there is usually a fine-tune stage, when a joint supervised training algorithm is
applied to all the layers.
So when and why does unsupervised pretraining work?
There is study shown that on average, pretraining was slightly harmful, but for many tasks were
significantly helpful. Unsupervised pretraining combines two ideas: 1) the choice of initial parameters of a
deep neural network can have a significant regularizing effect; 2) learning about the input
distribution can help with learning about the mapping from inputs to outputs. But both ideas are not
fully understood.
A. From the regularizing view, it’s possible that pretraining initializes the model in a location that
would otherwise be in accessible. But it’s always difficult to tell what aspects of the pretrained
parameters are retained during the supervised training stage, making this kind of intractable. There are
two ways to overcome this: 1) train supervised and unsupervised learning simultaneously. 2) freeze the
parameters for the feature extractors and using supervised learning only to add a classifier on top of them.
Thinking of this process as a regularizer, i.e. to avoid overfitting, we can expect unsupervised pretraining to
be most helpful when there is few labeled data and large number of unlabeled data.
B. From the representation view, the idea is that some features that are useful for unsupervised task are
also useful for supervised tasks. But this is not understood at a mathematical or theoretical level.
We can expect the unsupervised pretraining to be more effective when the initial representation is
poor. The use of word embeddings is a great example, where learned word embeddings naturally encode
similarity between words.
C. Other factors also matter. For example, unsupervised pretraining is likely to be most useful when the
function to be learned is extremely complicated.
Disadvantages: unsupervised pretraining has a very obvious disadvantage of operating with two separate
phases. And it has also way more hyperparameters, whose effect may be measured after training but is
often difficult to predict ahead.
To overcome this, we can train unsupervised and supervised learning simultaneously. Now there is a single
hyperparameter, usually a coefficient attached to the weight of the unsupervised cost.
Another disadvantage of having two separate phases is that each phase has its own hyper parameters. The
performance of the second phase usually cannot be predicted during the first phase, so there is a long delay
between proposing the hyperparameters and updating them.
The most principled solution to this is to use validation set error in the supervised phase to select the
hyperparameters of the pretraining phase. In practice, some hyperparameters, like the number of
iterations, are more conveniently set during the pretraining phase, using early stopping.

Deep Learning

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Deep Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning

Uploaded by

Copyright:

Available Formats

Unit III

Recurrent Neural Networks

Recurrent Neural Networks

networks are used in the production of music, for example.

determined by a sequence of words.

 Process inputs of any length, unlike feedforward neural networks.

 Share weights across time steps, enhancing training efficiency.

 Prone to vanishing and exploding gradient problems, hindering learning.

 Training can be challenging, especially for long sequences.

 Computationally slower than other neural network architectures.

 Cannot handle sequential data

 Considers only the current input

 Cannot memorize previous inputs

Some of the benefits provided by Recurrent Neural Networks are:

1. Processes sequential data

Below are some of the limitations of Recurrent Neural Networks:

1. The computation time is slow as it is recurrent.

Types of Recurrent Neural Networks

This RNN receives a set of inputs and produces a set of outputs.

Long Short-Term Memory Networks (LSTM)

 The efficiency of a recursive network is higher than a feed-forward network.

Recursive Neural Network

refers to the number of children.

An undercomplete autoencoder is one of the simplest types of autoencoders.

the image from the compressed bottleneck region.

same as the input.

termed as manifold learning.

part of the network—the architecture.

and ground truth. However—

The means via which encoding of information is regulated is significantly different.

autoencoders work by penalizing the activation of some neurons in hidden layers.

provides a penalty that is directly proportional to that.

works by creating a penalty on the number of nodes activated.

Where h represents the hidden layer, i represents the

image in the minibatch, and a represents the activation.

of m samples is being made here, each denoted as x.

stochastic encoders and decoders

Cost = Reconstruction Error + Regularization Error

as its ground truth.

ground truth image.

filtered out easily.

generally used in these types of networks is L2 or L1 loss.

while passing it through a bottleneck and reconstructing it in the decoder.

function and mapping input into the output.

activations are small with respect to the input data.

Where h represents the hidden layer and x represents the input.

loss) is that the two terms contradict each other.

ignore information that is non-discriminatory in nature, and therefore, not important.

The total loss function can be mathematically expressed as:

dimensionality reductions without losing much information.

reductions, undercomplete autoencoders can perform large-scale non-linear dimensionality reductions.

to form a noise-free image.

Undercomplete autoencoders can also be used for anomaly detection.

the image as is.

with the help of a proper threshold.

As the number of hidden layers is increased, the amount of error information

Pretraining is based on the assumption that it is easier to train a shallow network

The key benefits of pretraining are:

 Simplified training process.