Deep Learning
Deep Learning
Deep Learning
Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is fed as
input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other,
but in cases when it is required to predict the next word of a sentence, the previous words are required and hence
there is a need to remember the previous words. Thus RNN came into existence, which solved this issue with the
help of a Hidden Layer. The main and most important feature of RNN is its Hidden state, which remembers
some information about a sequence. The state is also referred to as Memory State since it remembers the previous
input to the network. It uses the same parameters for each input as it performs the same task on all the inputs or
hidden layers to produce the output. This reduces the complexity of parameters, unlike other neural networks.
Below are some examples of RNN architectures that can help you better understand this.
One To One: There is only one pair here. A one-to-one architecture is used in traditional neural networks.
One To Many: A single input in a one-to-many network might result in numerous outputs. One too many
Many To One: In this scenario, a single output is produced by combining many inputs from distinct time
steps. Sentiment analysis and emotion identification use such networks, in which the class label is
Many To Many: For many to many, there are numerous options. Two inputs yield three outputs. Machine
translation systems, such as English to French or vice versa translation systems, use many to many
networks.
Advantages of RNNs:
Handle sequential data effectively, including text, speech, and time series.
Disadvantages of RNNs:
RNN were created because there were a few issues in the feed-forward neural network:
Benefits of RNN
Limitations of RNN
1. One-to-One RNN
The above diagram represents the structure of the Vanilla Neural Network. It is used to solve general machine
learning problems that have only one input and output.
Example: classification of images.
2. One-to-Many RNN:
A single input and several outputs describe a one-to-many Recurrent Neural Network. The above diagram is an
example of this.
Example: The image is sent into Image Captioning, which generates a sentence of words.
3. Many-to-One RNN:
This RNN creates a single output from the given series of inputs.
Example: Sentiment analysis is one of the examples of this type of network, in which a text is identified as
expressing positive or negative feelings.
4. Many-to-Many RNN:
Structure Of LSTM:
LSTM has a chain structure that contains four neural networks and different memory blocks called cells.
Information is retained by the cells and the memory manipulations are done by the gates. There are three gates –
1. Forget Gate: The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs x_t (input at the particular time) and h_t-1 (previous cell output) are fed to the gate and multiplied with
weight matrices followed by the addition of bias. The resultant is passed through an activation function which
gives a binary output. If for a particular cell state the output is 0, the piece of information is forgotten and for
output 1, the information is retained for future use. The equation for the forget gate is:
f_t = σ(W_f · [h_t-1, x_t] + b_f)
where:
W_f represents the weight matrix associated with the forget gate.
[h_t-1, x_t] denotes the concatenation of the current input and the previous hidden state.
b_f is the bias with the forget gate.
σ is the sigmoid activation function.
2. Input gate: The addition of useful information to the cell state is done by the input gate. First, the information
is regulated using the sigmoid function and filter the values to be remembered similar to the forget gate using
inputs h_t-1 and x_t. Then, a vector is created using tanh function that gives an output from -1 to +1, which
contains all the possible values from h_t-1 and x_t. At last, the values of the vector and the regulated values are
multiplied to obtain the useful information. The equation for the input gate is:
i_t = σ(W_i · [h_t-1, x_t] + b_i)
Ĉ_t = tanh(W_c · [h_t-1, x_t] + b_c)
C_t = f_t ⊙ C_t-1 + i_t ⊙ Ĉ_t
where
⊙ denotes element-wise multiplication
tanh is tanh activation function
3. Output gate: The task of extracting useful information from the current cell state to be presented as output is
done by the output gate. First, a vector is generated by applying tanh function on the cell. Then, the information is
regulated using the sigmoid function and filter by the values to be remembered using inputs h_t-1 and x_t. At
last, the values of the vector and the regulated values are multiplied to be sent as an output and input to the next
cell. The equation for the output gate is:
o_t = σ(W_o · [h_t-1, x_t] + b_o)
Advantages of LSTM
1. Long-term dependencies can be captured by LSTM networks. They have a memory cell that is
capable of long-term information storage.
2. In traditional RNNs, there is a problem of vanishing and exploding gradients when models are
trained over long sequences. By using a gating mechanism that selectively recalls or forgets
information, LSTM networks deal with this problem.
3. LSTM enables the model to capture and remember the important context, even when there is a
significant time gap between relevant events in the sequence. So where understanding context is
important, LSTMS are used. eg. machine translation.
Disadvantages of LSTM
1. Compared to simpler architectures like feed-forward neural networks LSTM networks are
computationally more expensive. This can limit their scalability for large-scale datasets or
constrained environments.
2. Training LSTM networks can be more time-consuming compared to simpler models due to their
computational complexity. So training LSTMs often requires more data and longer training times to
achieve high performance.
3. Since it is processed word by word in a sequential manner, it is hard to parallelize the work of
processing the sentences.
Encoder Decoder architectures
In this architecture, the input data is first fed through what’s called as an encoder network. The
encoder network maps the input data into a numerical representation that captures the important
information from the input. Thee numerical representation of the input data is also called as hidden
state. The numerical representation (hidden state) is then fed into what’s called as the decoder
network. The decoder network generates the output by generating one element of the output sequence at
a time. The following picture represents the encoder decoder architecture as explained here. Note that
both input and output sequence of data can be of varying length as shown in the picture below.
A popular form of neural network architecture called as autoencoder is a type of the encoder decoder
architecture. An autoencoder is a type of neural network architecture that uses an encoder to
compress an input into a lower-dimensional representation, and a decoder to reconstruct the
original input from the compressed representation. It is primarily used for unsupervised learning
and data compression. The other types of encoder-decoder architecture can be used for supervised
learning tasks, such as machine translation, image captioning, and speech recognition. In this
architecture, the encoder maps the input to a fixed-length representation, which is then passed to the
decoder to generate the output. So while the encoder-decoder architecture and autoencoder have similar
components, their main purposes and applications differ.
Recursive Neural Networks
What Is a Recursive Neural Network?
Deep Learning is a subfield of machine learning and artificial intelligence (AI) that attempts to imitate how the
human brain processes data and gains certain knowledge. Neural Networks form the backbone of Deep Learning.
These are loosely modeled after the human brain and designed to accurately recognize underlying patterns in a data
set. If you want to predict the unpredictable, Deep Learning is the solution.
Recursive Neural Networks (RvNNs) are a class of deep neural networks that can learn detailed and structured
information. With RvNN, you can get a structured prediction by recursively applying the same set of weights on
structured inputs. The word recursive indicates that the neural network is applied to its output.
Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data. The tree structure
means combining child nodes and producing parent nodes. Each child-parent bond has a weight matrix, and similar
children have the same weights. The number of children for every node in the tree is fixed to enable it to perform
recursive operations and use the same weights. RvNNs are used when there's a need to parse an entire sentence.
Recurrent Neural Network vs. Recursive Neural Networks
Recurrent Neural Networks (RNNs) are another well-known class of neural networks used for
processing sequential data. They are closely related to the Recursive Neural Network.
Recurrent Neural Networks represent temporal sequences, which they find application in Natural
language Processing (NLP) since language-related data like sentences and paragraphs are sequential in
nature. Recurrent networks are usually chain structures. The weights are shared across the chain length,
keeping the dimensionality constant.
On the other hand, Recursive Neural Networks operate on hierarchical data models due to their tree
structure. There are a fixed number of children for each node in the tree so that it can execute recursive
operations and use the same weights for each step. Child representations are combined into parent
representations.
Recurrent Networks are recurrent over time, meaning recursive networks are just a generalization of the
recurrent network.
A branch of machine learning and artificial intelligence (AI) known as "deep learning" aims
to replicate how the human brain analyses information and learns certain concepts. Deep
Learning's foundation is made up of neural networks. These are intended to precisely
identify underlying patterns in a data collection and are roughly modelled after the human
brain. Deep Learning provides the answer to the problem of predicting the unpredictable.
A subset of deep neural networks called recursive neural networks (RvNNs) are capable of
learning organized and detailed data. By repeatedly using the same set of weights on
structured inputs, RvNN enables you to obtain a structured prediction. Recursive refers to
the neural network's application to its output.
Recursive neural networks are capable of handling hierarchical data because of their indepth
tree-like structure. In a tree structure, parent nodes are created by joining child nodes.
There is a weight matrix for every child-parent bond, and comparable children have the
same weights. To allow for recursive operations and the use of the same weights, the
number of children for each node in the tree is fixed. When it's necessary to parse a whole
sentence, RvNNs are employed.
We add the weight matrices' (W i) and children's (C i) products and use the transformation f
to determine the parent node's representation.
Unit IV
Autoencoders
1. Undercomplete autoencoders
Undercomplete autoencoder takes in an image and tries to predict the same image as output, thus reconstructing
Undercomplete autoencoders are truly unsupervised as they do not take any form of label, the target being the
The primary use of autoencoders like such is the generation of the latent space or the bottleneck, which forms a
compressed substitute of the input data and can be easily decompressed back with the help of the network when
needed.
This form of compression in the data can be modeled as a form of dimensionality reduction.
When we think of dimensionality reduction, we tend to think of methods like PCA (Principal Component Analysis)
that form a lower-dimensional hyperplane to represent data in a higher-dimensional form without losing
information.
However—
PCA can only build linear relationships. As a result, it is put at a disadvantage compared with methods like
undercomplete autoencoders that can learn non-linear relationships and, therefore, perform better in
dimensionality reduction.
This form of nonlinear dimensionality reduction where the autoencoder learns a non-linear manifold is also
we reduce the undercomplete autoencoder into something that works at an equal footing with PCA.
The loss function used to train an undercomplete autoencoder is called reconstruction loss, as it is a check of
how well the image has been reconstructed from the input data.
Although the reconstruction loss can be anything depending on the input and output, we will use an L1 loss to
depict the term (also called the norm loss) represented by:
here �^ represents the predicted output and x represents the ground truth.
W
As the loss function has no explicit regularisation term, the only method to ensure that the model is not
memorising the input data is by regulating the size of the bottleneck and the number of hidden layers within this
Sparse autoencoders are similar to the undercomplete autoencoders in that they use the same image as input
sparse autoencoder is regulated by changing the number of nodes at each hidden layer.
Since it is not possible to design a neural network that has a flexible number of nodes at its hidden layers, sparse
In other words, the loss function has a term that calculates the number of neurons that have been activated and
This penalty, called the sparsity function, prevents the neural network from activating more neurons and serves
as a regularizer.
While typical regularizers work by creating a penalty on the size of the weights at the nodes, sparsity regularizer
images during training and treating the regularization problem as a problem separate from the latent space
problem.
We can thus set latent space dimensionality at the bottleneck without worrying about regularization.
There are two primary ways in which the sparsity regularizer term can be incorporated into the loss function.
L1 Loss: In here, we add the magnitude of the sparsity regularizer as we do for general regularizers:
KL-Divergence: In this case, we consider the activations over a collection of samples at once rather than
summing them as in the L1 Loss method. We constrain the average activation of each neuron over this
collection.
Considering the ideal distribution as a Bernoulli distribution, we include KL divergence within the loss to reduce
the difference between the current distribution of the activations and the ideal (Bernoulli) distribution:
Where and j denote the specific neuron for layer h and a collection
The cost function of VAE is based on log likelihood maximization. The cost function
consists of reconstruction and regularization error terms:
Denoising autoencoders
Denoising autoencoders, as the name suggests, are autoencoders that remove noise from an image.
As opposed to autoencoders we’ve already covered, this is the first of its kind that does not have the input image
In denoising autoencoders, we feed a noisy version of the image, where noise has been added via digital
alterations. The noisy image is fed to the encoder-decoder architecture, and the output is compared with the
While removing noise directly from the image seems difficult, the autoencoder performs this by mapping the input
data into a lower-dimensional manifold (like in undercomplete autoencoders), where filtering of noise becomes
much easier.
Essentially, denoising autoencoders work with the help of non-linear dimensionality reduction. The loss function
We can modify the autoencoder to learn useful features is by changing the inputs; we can add random
noise to the input and recover it to the original form by removing noise from the input data. This prevents
the autoencoder from copying the data from input to output because it contains random noise. We ask it
to subtract the noise and produce meaningful underlying data. This is called a denoising autoencoder.
In the above diagram, the first row contains original images. We can see in the second row that random
noise is added to the original images; this noise is called Gaussian noise. The input of the autoencoder
will not get the original images, but autoencoders are trained in such a way that they will remove noise
and generate the original images.
Contractive Autoencoders
Similar to other autoencoders, contractive autoencoders perform task of learning a representation of the image
The contractive autoencoder also has a regularization term to prevent the network from learning the identity
Contractive autoencoders work on the basis that similar inputs should have similar encodings and a similar latent
space representation. It means that the latent space should not vary by a huge amount for minor variations in the
input.
To train a model that works along with this constraint, we have to ensure that the derivatives of the hidden layer
Mathematically:
An important thing to note in the loss function (formed from the norm of the derivatives and the reconstruction
While the reconstruction loss wants the model to tell differences between two inputs and observe variations in the
data, the frobenius norm of the derivatives says that the model should be able to ignore variations in the input
data.
Putting these two contradictory conditions into one loss function enables us to train a network where the hidden
layers now capture only the most essential information. This information is necessary to separate images and
Where h> is the hidden layer for which a gradient is calculated and represented with respect to the input x
as The gradient is summed over all training samples, and a frobenius norm of the same is
taken.
Contractive Autoencoders (CAEs) are a type of autoencoder that incorporates a regularization term in the
loss function to enforce a contractive property on the learned latent representation. The goal of this
regularization is to make the autoencoder more robust to small changes in the input data. The contractive
term penalizes the model for sensitivity to input variations, helping to create a more stable and
meaningful representation in the latent space.
Applications of Autoencoders.
1. Dimensionality reduction
Undercomplete autoencoders are those that are used for dimensionality reduction.
These can be used as a pre-processing step for dimensionality reduction as they can perform fast and accurate
Furthermore, while dimensionality reduction procedures like PCA can only perform linear dimensionality
2. Image denoising
Autoencoders like the denoising autoencoder can be used for performing efficient and highly accurate image
denoising.
Unlike traditional methods of denoising, autoencoders do not search for noise, they extract the image from the
noisy data that has been fed to them via learning a representation of it. The representation is then decompressed
Denoising autoencoders thus can denoise complex images that cannot be denoised via traditional methods.
3. Generation of image and time series data
Variational Autoencoders can be used to generate both image and time series data.
The parameterized distribution at the bottleneck of the autoencoder can be randomly sampled to generate
discrete values for latent attributes, which can then be forwarded to the decoder,leading to generation of image
data. VAEs can also be used to model time series data like music.
4. Anomaly detection
For example—consider an autoencoder that has been trained on a specific dataset P. For any image sampled
for the training dataset, the autoencoder is bound to give a low reconstruction loss and is supposed to reconstruct
For any image which is not present in the training dataset, however, the autoencoder cannot perform the
reconstruction, as the latent attributes are not adapted for the specific image that has never been seen by the
network.
As a result, the outlier image gives off a very high reconstruction loss and can easily be identified as an anomaly
Traditionally, training deep neural networks with many layers was challenging.
Pretraining involves successively adding a new hidden layer to a model and refitting,
allowing the newly added model to learn the inputs from the existing hidden layer,
often while keeping the weights for the existing hidden layers fixed. This gives the
technique the name “layer-wise” as the model is trained one layer at a time.
The technique is referred to as “greedy” because the piecewise or layer-wise
approach to solving the harder problem of training a deep network. As an
optimization process, dividing the training process into a succession of layer-wise
training processes is seen as a greedy shortcut that likely leads to an aggregate of
locally optimal solutions, a shortcut to a good enough global solution.
Greedy algorithms break a problem into many components, then solve for the
optimal version of each component in isolation. Unfortunately, combining the
individually optimal components is not guaranteed to yield an optimal complete
solution.
It is common to use the word “pretraining” to refer not only to the pretraining stage
itself but to the entire two phase protocol that combines the pretraining phase and a
supervised learning phase. The supervised learning phase may involve training a
simple classifier on top of the features learned in the pretraining phase, or it may
involve supervised fine-tuning of the entire network learned in the pretraining phase.
Today, we now know that greedy layer-wise pretraining is not required to train fully
connected deep architectures, but the unsupervised pretraining approach was the
first method to succeed.
To solve the vanishing gradient descent problem, we use this technique. Lets us see the mechanism of a
greedy layer-wise pretraining method. First, we make a base model of the input and output layer; later, we
train the model using the available dataset. After training the model, we remove the output layer and store
it in another variable. Add a new hidden layer in the model that will be the first hidden layer of the model
and re-add the output layer in the model. Now there are three layers in the model, the input layer, the
hidden layer1, and the output layer, and once again, train the model after inserting the hidden layer1. To
add one more hidden layer, remove the output layer set all the layers as non-trainable(no further change
in weights of the input layer and hidden layer1). Now insert the new hidden layer2 in the model and re-
add the output layer. Train the model after inserting the new hidden layer. The model structure will be in
the following order, input layer, hidden layer1, hidden layer2, output layer. Repeat the above steps for
every new hidden layer you want to add. (each time you insert a new hidden layer, perform training on the
model using the same dataset)
The greedy layer-wise training is a pre-training algorithm that aims to train each layer of a DBN in a sequential way,
feeding lower layers’ results to the upper layers. This renders a better optimization of a network than traditional
training algorithms, i.e. training method using stochastic gradient descent à la RBMs.
In terms of computational units, deep structures such as the DBN can be much more efficient (25) than their
shallow counterparts since they require fewer units (23) for performing the same function. Multi-layer deep
structures can represent abstract concepts and varying functions by keeping many non-linear layers in a
hierarchy (25). From a lower-level to a higher-level in this hierarchy, layers’ abstractness ascend in terms of the
complexity of objects they are representing (in the illustration below, the top layer shows all elemental pixels
whereas the images are kept in the bottom layer). The process of how we divide more complex objects into
simpler objects is by modeling a set of joint distribution between each visible and hidden layer.
Domain Adaptation
In domain adaptation, we solely change the underlying datasets and thus the features of our machine learning
model. However, the feature space stays the same. The predictive function stays the same:
4.2. Application
Applying domain adaptation to our example, we could think of a significantly different, but somehow similar
dataset. This could still contain dog and cat pictures, but those that are vastly different from the ones in our
source dataset. For example, in our source data set, we only have poodles and black cats. In our target dataset, on
the other hand, we could have schnauzers and white cats.
Now, how can we ensure that our predictive function will still predict the right labels for our dataset? Domain
adaptation delivers an answer for this question.
We consider three types of domain adaptation. These are defined by the number of labeled examples in the
underlying domain:
Unsupervised domain adaptation works with a source domain that has labeled examples, but also
unlabeled examples. The target domain only has unlabeled examples.
Semi-supervised domain adaptation expects some of the examples in the target domain are labeled.
Supervised domain indicates that all examples are labeled.
In domain adaptation, we can look a bit closer at pragmatic approaches. This lies in the fact, that only changing
the dataset makes it much easier to tune our model for our new machine learning process.
Divergence-based domain adaptation is a method of testing if two samples are from the same distribution. As we
have seen in our blueprint illustration, the features that are extracted from the datasets are vastly different. This
difference causes our predictive function to not work as intended. If it’s fed by features that it was not trained for, it
malfunctions. This is also the reason why we accept different features but require the same feature space.
For this reason, divergence-based domain adaptation creates features that are “equally close” to both
datasets. This can be achieved by applying various algorithms, including the Maximum Mean Discrepancy,
Correlation Alignment, Contrastive Domain Discrepancy, or the Wasserstein Metric.
In the iterative approach, we use our prediction function to label those samples of our target domain, for which we
have very high confidence. Doing so, we retrain our function . Thus creating a prediction function that fits our
target domain more and more as we apply it to samples that have less confidence.
Transfer Learning:
Let’s apply this concept to our example of dog and cat pictures. Now imagine we have a second dataset that shows
pictures of cows and horses. Cows and horses are significantly different from cats.
Nevertheless, they are all mammals, they have four feet, and a similar shape. As a solution, we can take the layers
that describe the shape of the object we want to detect, whether it is a dog, a cow, or a horse,
and freeze them. Freezing means we cut them out of our predictive function put them in our predictive
function , and train the function , without training our frozen layers:
We can see the layers on the right side are green, which indicates that we have to train them while creating our
predictive function for the source domain. On the other hand, the predictive is created using the already existing,
frozen layer from the source domain. The frozen layers stay untouched during the training process.
transfer learning?
Transfer learning is a technique in machine learning where a model trained on one task is used as the starting
point for a model on a second task. This can be useful when the second task is similar to the first task, or when
there is limited data available for the second task. By using the learned features from the first task as a starting
point, the model can learn more quickly and effectively on the second task. This can also help to prevent
overfitting, as the model will have already learned general features that are likely to be useful in the second task.
Why transfer learning?
Many deep neural networks trained on images have a curious phenomenon in common: in the early layers of the
network, a deep learning model tries to learn a low level of features, like detecting edges, colors, variations of
intensities, etc. Such kind of features appears not to be specific to a particular dataset or a task because no matter
what type of image we are processing either for detecting a lion or car. In both cases, we have to detect these low-
level features. All these features occur regardless of the exact cost function or image dataset. Thus learning these
features in one task of detecting lions can be used in other tasks like detecting humans
Advantages :
Speed up the training process: By using a pre-trained model, the model can learn more quickly and
effectively on the second task, as it already has a good understanding of the features and patterns in
the data.
Better performance: Transfer learning can lead to better performance on the second task, as the
model can leverage the knowledge it has gained from the first task.
Handling small datasets: When there is limited data available for the second task, transfer learning
can help to prevent overfitting, as the model will have already learned general features that are likely
to be useful in the second task.
Disadvantages:
Domain mismatch: The pre-trained model may not be well-suited to the second task if the two tasks
are vastly different or the data distribution between the two tasks is very different.
Overfitting: Transfer learning can lead to overfitting if the model is fine-tuned too much on the
second task, as it may learn task-specific features that do not generalize well to new data.
Complexity: The pre-trained model and the fine-tuning process can be computationally expensive
and may require specialized hardware.
Transfer learning is a machine learning method where a model already developed for a task is reused in
another task. Transfer learning is a popular approach in deep learning, as it enables the training of
deep neural networks with less data compared to having to create a model from scratch.
Typically, training a model takes a large amount of compute resources and time. Using a pre-trained
Machine learning algorithms are typically designed to address isolated tasks. Through transfer learning,
methods are developed to transfer knowledge from one or more of these source tasks to improve learning
in a related target task. Knowledge from an already trained machine learning model must be similar to the
new task to be transferable. For example, the knowledge gained from recognizing an image of a dog in a
supervised machine learning system could be transferred to a new system to recognize images of cats.
The new system will filter out images it already recognizes as a dog.
Distributed Representations
Distributed representations are a fundamental concept in the field of machine learning and natural
language processing (NLP). They refer to a way of representing data, typically words or phrases, as
continuous vectors in a high-dimensional space. Unlike local representations, where each entity is
represented by a unique identifier in an isolated manner (such as one-hot encoding), distributed
representations capture a notion of similarity and semantic meaning by allowing an entity to be
represented by a pattern of values across many dimensions.
In distributed representations, also known as embeddings, the idea is that the "meaning" or "semantic
content" of a data point is distributed across multiple dimensions. For example, in NLP, words with similar
meanings are mapped to points in the vector space that are close to each other. This closeness is not
arbitrary but is learned from the context in which words appear. This context-dependent learning is often
achieved through neural network models, such as Word2Vec or GloVe, which process large corpora of
text to learn these representations.
One of the key advantages of distributed representations is their ability to capture fine-grained semantic
relationships. For instance, in a well-trained word embedding space, synonyms would be represented by
vectors that are close together, and it's even possible to perform arithmetic operations with these vectors
that correspond to meaningful semantic operations (e.g., "king" - "man" + "woman" might result in a vector
close to "queen").
Distributed representations have a wide range of applications, particularly in tasks that involve natural
language understanding. They are used for:
Moreover, distributed representations are not limited to text data. They can also be applied to other types
of data, such as images, where deep learning models learn to represent images as high-dimensional
vectors that capture visual features and semantics.
Learning distributed representations typically involves training a model on a task that forces it to capture
semantic or feature similarities. For text, this might involve predicting a word given its surrounding words
(continuous bag of words model) or predicting surrounding words given a word (skip-gram model). During
this process, the model learns to place semantically similar words closer together in the vector space.
For images, convolutional neural networks (CNNs) learn distributed representations by being trained to
recognize objects within images. Through layers of convolutions and pooling operations, CNNs learn to
represent images in a way that captures hierarchical visual features.
Despite their effectiveness, distributed representations come with their own set of challenges. One major
issue is the requirement of large amounts of data to learn meaningful representations. Without sufficient
data, the embeddings may not capture the true semantic relationships. Additionally, distributed
representations can be computationally expensive to learn, requiring significant processing power and
memory, especially for large datasets.
Another challenge is the interpretability of these representations. Unlike local representations, where each
dimension corresponds to a specific feature, the dimensions in distributed representations do not have an
easily interpretable meaning. This can make it difficult to understand what the model has learned and to
diagnose issues when the model makes incorrect predictions.
Distributed representations in deep learning refer to the representation of data, such as features or
concepts, by using distributed patterns of activation across multiple neurons or units in a neural network.
This is in contrast to traditional, non-distributed representations where each feature or concept is
represented by the activity of a single neuron or a small set of neurons.
In the context of deep learning, distributed representations are often learned automatically from the data
through the training process. This is typically done in neural networks with multiple layers, such as deep
neural networks or deep learning models. The idea is that each layer in the network learns increasingly
abstract and complex features, and the final representation of the input data is distributed across the
activations of many neurons in the last layer.
Here are some key points about distributed representations in deep learning:
1. Feature Learning:
In deep learning, each layer of a neural network can be thought of as learning hierarchical
features. The lower layers might capture simple features like edges or textures, while
higher layers combine these features to represent more complex patterns or concepts.
2. Sparse vs. Distributed Representations:
Traditional approaches often used sparse representations, where each feature or concept
is represented by the activity of a specific neuron. In contrast, distributed representations
involve the collective activity of multiple neurons to represent a feature or concept.
3. Generalization:
Distributed representations often lead to better generalization. This means that the
model can perform well on new, unseen data because it has learned a more abstract and
versatile representation of the underlying patterns in the training data.
4. Efficiency:
Distributed representations are often more efficient in terms of storage and computation.
They can capture complex relationships with fewer parameters compared to non-
distributed representations.
5. Word Embeddings:
In natural language processing (NLP), distributed representations are commonly used for
words through techniques like word embeddings. Word embeddings, such as Word2Vec,
GloVe, or FastText, represent words as vectors in a high-dimensional space, capturing
semantic relationships between words.
6. Autoencoders and Variational Autoencoders:
Autoencoders and variational autoencoders are types of neural network architectures that
can learn distributed representations. Autoencoders learn an efficient representation of
the input data by encoding it into a lower-dimensional space and then decoding it back
to the original input. Variational autoencoders introduce a probabilistic element, allowing
them to generate new samples in addition to encoding and decoding.
DenseNet, short for Densely Connected Convolutional Networks, is a variant of Convolutional Neural
Networks (CNNs) designed to address some challenges associated with traditional CNN architectures.
DenseNet was introduced by Gao Huang, Zhuang Liu, and Laurens van der Maaten in their paper "Densely
Connected Convolutional Networks" in 2017.
The key idea behind DenseNet is to establish dense connections between layers, allowing each layer to
receive direct inputs from all preceding layers. This is in contrast to traditional CNN architectures, where
each layer typically connects only to the immediately preceding layer. The dense connectivity in DenseNet
leads to several advantages, including parameter efficiency, feature reuse, and improved gradient flow
during training.
1. Dense Connectivity:
In DenseNet, each layer receives input from all preceding layers and, in turn, passes its
feature maps to all subsequent layers. This dense connectivity is achieved by
concatenating the feature maps from all previous layers as input to the current layer.
2. Dense Blocks:
The building blocks of DenseNet are dense blocks, which consist of a series of densely
connected layers. Within a dense block, the output of each layer is concatenated with the
feature maps of all preceding layers. This promotes feature reuse and allows the network
to maintain a compact internal representation.
3. Transition Layers:
Between dense blocks, transition layers are used to reduce the spatial dimensions (width
and height) of the feature maps and control the number of channels. This helps in
reducing the computational cost and allows the network to scale more efficiently.
4. Batch Normalization and ReLU:
DenseNet typically employs batch normalization and rectified linear unit (ReLU) activation
functions after each convolutional layer. Batch normalization helps with training stability
and accelerates convergence, while ReLU introduces non-linearity.
5. Growth Rate:
The growth rate is a hyperparameter that defines the number of feature maps added to
the network at each layer within a dense block. A higher growth rate increases the
number of parameters but can also enhance the representational capacity of the network.
6. Global Average Pooling:
DenseNet often uses global average pooling instead of fully connected layers at the end
of the network. Global average pooling helps reduce the number of parameters and
encourages spatial hierarchies in the learned features.
DenseNet has been shown to achieve competitive performance on various computer vision tasks, such as
image classification and object detection, while requiring fewer parameters compared to traditional CNN
architectures. The dense connectivity facilitates feature reuse, which is particularly beneficial in scenarios
with limited data or computational resources.
DenseNet is one of the new discoveries in neural networks for visual object recognition. DenseNet is quite
similar to ResNet with some fundamental differences. ResNet uses an additive method (+) that merges
the previous layer (identity) with the future layer, whereas DenseNet concatenates (.) the output of the
previous layer with the future layer. Get in-depth knowledge of ResNet in this guide.
Why Do We DenseNet?
DenseNet was developed specifically to improve the declined accuracy caused by the vanishing gradient
in high-level neural networks. In simpler terms, due to the longer path between the input layer and the
output layer, the information vanishes before reaching its destination.
n short, DenseNet-121 has the following layers:
1 7x7 Convolution
58 3x3 Convolution
61 1x1 Convolution
4 AvgPool
Introduction to DenseNet
In a traditional feed-forward Convolutional Neural Network (CNN), each convolutional layer except the
first one (which takes in the input), receives the output of the previous convolutional layer and produces
an output feature map that is then passed on to the next convolutional layer. Therefore, for 'L' layers,
there are 'L' direct connections; one between each layer and the next layer.
However, as the number of layers in the CNN increase, i.e. as they get deeper, the 'vanishing gradient'
problem arises. This means that as the path for information from the input to the output layers increases,
it can cause certain information to 'vanish' or get lost which reduces the ability of the network to train
effectively.
DenseNets resolve this problem by modifying the standard CNN architecture and simplifying the
connectivity pattern between layers. In a DenseNet architecture, each layer is connected directly with
every other layer, hence the name Densely Connected Convolutional Network. For 'L' layers, there are
L(L+1)/2 direct connections.
DenseNet Architecture & Components
Components of DenseNet include:
Connectivity
DenseBlocks
Growth Rate
Bottleneck layers
Connectivity
In each layer, the feature maps of all the previous layers are not summed, but concatenated and used as
inputs. Consequently, DenseNets require fewer parameters than an equivalent traditional CNN, and this
allows for feature reuse as redundant feature maps are discarded. So, the lth layer receives the feature-
maps of all preceding layers, x0,...,xl-1, as input:
where [x0,x1,...,xl-1] is the concatenation of the feature-maps, i.e. the output produced in all the layers
preceding l (0,...,l-1). The multiple inputs of Hl are concatenated into a single tensor to ease
implementation.
DenseBlocks
The use of the concatenation operation is not feasible when the size of feature maps changes. However,
an essential part of CNNs is the down-sampling of layers which reduces the size of feature-maps through
dimensionality reduction to gain higher computation speeds.
To enable this, DenseNets are divided into DenseBlocks, where the dimensions of the feature maps
remains constant within a block, but the number of filters between them is changed. The layers between
the blocks are called Transition Layers which reduce the the number of channels to half of that of the
existing channels.
For each layer, from the equation above, Hl is defined as a composite function which applies three
consecutive operations: batch normalization (BN), a rectified linear unit (ReLU) and a convolution (Conv).
In the above image, a deep DenseNet with three dense blocks is shown. The layers between two
adjacent blocks are the transition layers which perform downsampling (i.e. change the size of the feature-
maps) via convolution and pooling operations, whilst within the dense block the size of the feature maps
is the same to enable feature concatenation.
Growth Rate
One can think of the features as a global state of the network. The size of the feature map grows after a
pass through each dense layer with each layer adding 'K' features on top of the global state (existing
features). This parameter 'K' is referred to as the growth rate of the network, which regulates the amount
of information added in each layer of the network. If each function H l produces k feature maps, then the
lth layer has
input feature-maps, where k0 is the number of channels in the input layer. Unlike existing network
architectures, DenseNets can have very narrow layers.
Bottleneck layers
Although each layer only produces k output feature-maps, the number of inputs can be quite high,
especially for further layers. Thus, a 1x1 convolution layer can be introduced as a bottleneck layer before
each 3x3 convolution to improve the efficiency and speed of computations
Unit VI
Applications of Deep Learning
Generative Adversarial Networks (GANs) were introduced in 2014 by Ian J. Goodfellow and co-authors. GANs
perform unsupervised learning tasks in machine learning. It consists of 2 models that automatically discover and
learn the patterns in input data.
They compete with each other to scrutinize, capture, and replicate the variations within a dataset. GANs can be used
to generate new examples that plausibly could have been drawn from the original dataset.
Shown below is an example of a GAN. There is a database that has real 100 rupee notes. The generator neural
network generates fake 100 rupee notes. The discriminator network will help identify the real and fake notes.
What is a Generator?
A Generator in GANs is a neural network that creates fake data to be trained on the discriminator. It learns to
generate plausible data. The generated examples/instances become negative training examples for the discriminator.
It takes a fixed-length random vector carrying noise as input and generates a sample.
The main aim of the Generator is to make the discriminator classify its output as real. The part of the GAN that
trains the Generator includes:
generator network, which transforms the random input into a data instance
generator loss, which penalizes the Generator for failing to dolt the discriminator
The backpropagation method is used to adjust each weight in the right direction by calculating the weight's impact
on the output. It is also used to obtain gradients and these gradients can help change the generator weights.
Generator:
The Generator is a neural network that takes random noise or a random vector as input and
generates synthetic data samples. In the context of image generation, for example, the generator
produces images that ideally are realistic and similar to the training data. The goal of the
generator is to produce data that is convincing enough to fool the discriminator.
What is a Discriminator?
The Discriminator is a neural network that identifies real data from the fake data created by the Generator. The
discriminator's training data comes from different two sources:
The real data instances, such as real pictures of birds, humans, currency notes, etc., are used by the
Discriminator as positive samples during training.
The fake data instances created by the Generator are used as negative examples during the training
process.
While training the discriminator, it connects to two loss functions. During discriminator training, the discriminator
ignores the generator loss and just uses the discriminator loss.
In the process of training the discriminator, the discriminator classifies both real data and fake data from the
generator. The discriminator loss penalizes the discriminator for misclassifying a real data instance as fake or a fake
data instance as real.
The discriminator updates its weights through backpropagation from the discriminator loss through the discriminator
network.
Discriminator:
The Discriminator is another neural network that evaluates the authenticity of a given input,
determining whether it is a real (from the training data) or fake (generated by the generator)
sample. The discriminator is trained to correctly classify real and fake samples. The objective is to
make the discriminator as accurate as possible in distinguishing between real and generated data.
TRANING
Step 1: Define the problem. Do you want to generate fake images or fake text. Here you should completely define
Step 2: Define architecture of GAN. Define how your GAN should look like. Should both your generator and
discriminator be multi layer perceptrons, or convolutional neural networks? This step will depend on what problem
Step 3: Train Discriminator on real data for n epochs. Get the data you want to generate fake on and train the
discriminator to correctly predict them as real. Here value n can be any natural number between 1 and infinity.
Step 4: Generate fake inputs for generator and train discriminator on fake data. Get generated data and let the
Step 5: Train generator with the output of discriminator. Now when the discriminator is trained, you can get its
predictions and use it as an objective for training the generator. Train the generator to fool the discriminator.
Step 7: Check if the fake data manually if it seems legit. If it seems appropriate, stop training, else go to step
3. This is a bit of a manual task, as hand evaluating the data is the best way to check the fakeness. When this step is
over, you can evaluate whether the GAN is performing well enough.
Training Generative Adversarial Networks (GANs) involves an iterative process where the generator and
discriminator are trained in a competitive manner. The goal is to have the generator produce realistic data
that is difficult for the discriminator to distinguish from real data. Here is a high-level overview of the
training process for GANs:
1. Initialize Networks:
Initialize the weights and biases of the generator and discriminator networks. These
networks are typically neural networks, and their architectures depend on the specific task
(e.g., image generation).
2. Define Loss Functions:
Define the loss functions for both the generator and discriminator. The discriminator aims
to minimize the binary cross-entropy loss, correctly classifying real and generated
samples. The generator aims to maximize the same loss, fooling the discriminator into
classifying generated samples as real.
3. Generate Synthetic Samples:
Input random noise into the generator to produce synthetic samples. These samples are
initially random and are not likely to resemble the real data.
4. Train Discriminator:
Feed a batch of real samples from the training set and the corresponding synthetic
samples generated by the generator to the discriminator.
Compute the discriminator loss based on how well it classifies real and generated
samples.
Update the discriminator's weights using gradient descent to minimize its loss.
5. Train Generator:
Generate a new batch of synthetic samples using the generator.
Feed these generated samples into the discriminator.
Compute the generator loss based on the discriminator's response to the generated
samples.
Update the generator's weights using gradient ascent to maximize its loss. This
encourages the generator to produce samples that are more convincing to the
discriminator.
6. Iterate:
Repeat steps 3-5 for a predefined number of iterations or until convergence.
The adversarial training process involves the continuous back-and-forth training of the
generator and discriminator. The generator becomes better at generating realistic
samples, and the discriminator becomes more accurate in distinguishing between real
and generated samples.
7. Monitoring and Evaluation:
Monitor the performance of the generator and discriminator using metrics such as
generator loss, discriminator loss, and visual inspection of generated samples.
Evaluate the quality of generated samples on a validation set or using domain-specific
metrics.
8. Fine-Tuning and Hyperparameter Tuning:
Fine-tune hyperparameters such as learning rates, batch sizes, and network architectures
based on the observed performance.
Experiment with architectural modifications or regularization techniques to improve
stability and convergence.
9. End Training:
Stop training when the generator produces high-quality, realistic samples, and the
discriminator performs well in distinguishing between real and generated data.
GAN variants
Generative Adversarial Networks (GANs) have inspired numerous variants and extensions, each designed
to address specific challenges or to cater to different applications. Here are some notable GAN variants:
Architecture of autoencoders
Encoder: An encoder is a feedforward, fully connected neural network that compresses the input
into a latent space representation and encodes the input image as a compressed representation in a
reduced dimension. The compressed image is the distorted version of the original image.
Code: This part of the network contains the reduced representation of the input that is fed into the
decoder.
Decoder: Decoder is also a feedforward network like the encoder and has a similar structure to the
encoder. This network is responsible for reconstructing the input back to the original dimensions
from the code.
First, the input goes through the encoder where it is compressed and stored in the layer called Code, then the
decoder decompresses the original input from the code. The main objective of the autoencoder is to get an output
identical to the input.
Note that the decoder architecture is the mirror image of the encoder. This is not a requirement but it’s typically the
case. The only requirement is the dimensionality of the input and output must be the same
An autoencoder is a type of artificial neural network used for unsupervised learning. It consists of an
encoder and a decoder, and its primary purpose is to learn efficient representations of data. The
architecture of an autoencoder is relatively simple, comprising two main components:
1. Encoder:
The encoder takes input data and maps it to a lower-dimensional representation, often
referred to as a "latent space" or "encoding." The encoder's role is to compress the input
data into a more compact representation that captures the essential features.
Architecture:
The encoder typically consists of one or more layers of neurons, with each layer
performing a linear transformation followed by a non-linear activation function.
Common activation functions include Rectified Linear Units (ReLU) or hyperbolic
tangent (tanh).
The number of neurons in the output layer of the encoder determines the
dimensionality of the latent space.
Example:
plaintextCopy code
Input Layer (e.g., features) -> Hidden Layer 1 (with ReLU activation) -> Hidden Layer 2 (with
ReLU activation) -> ... -> Output Layer (Latent Space)
2. Latent Space:
The latent space is the compressed representation of the input data produced by the
encoder. It is a lower-dimensional space that ideally captures the most relevant
information from the input.
3. Decoder:
The decoder takes the encoded representation from the latent space and reconstructs the
input data. Its goal is to generate output data that closely matches the original input.
Architecture:
Similar to the encoder, the decoder comprises one or more layers of neurons.
Each layer performs a linear transformation followed by a non-linear activation
function, reconstructing the data.
Example:
plaintextCopy code
Latent Space -> Hidden Layer 1 (with ReLU activation) -> Hidden Layer 2 (with ReLU
activation) -> ... -> Output Layer (Reconstructed Input)
4. Loss Function:
The autoencoder is trained by minimizing a loss function that measures the difference
between the input data and its reconstruction. Mean Squared Error (MSE) is a common
choice for this purpose.
5. Training:
During training, the autoencoder learns to encode and decode the data by adjusting the
weights and biases of its neural network layers. The process involves feeding input data
into the encoder, obtaining the encoded representation, passing it through the decoder,
and comparing the reconstructed output with the original input.
Overall, the architecture of an autoencoder is characterized by the encoding and decoding stages, with
the neural network layers in each stage responsible for transforming and reconstructing the data.
Autoencoders are widely used for tasks such as data compression, denoising, and feature learning, as they
can capture meaningful representations of complex data.
Denoising in the context of deep learning refers to the process of removing noise or unwanted variations
from data. This concept is particularly relevant in various applications where the input data may be
corrupted by noise, artifacts, or other undesired elements. Denoising techniques are often employed to
enhance the quality of data, making it more suitable for downstream tasks such as image recognition,
speech processing, or any other domain where clean and accurate data is crucial. Here are a few key
aspects of denoising in the application of deep learning:
1. Image Denoising:
Objective: In image processing, denoising aims to remove unwanted artifacts or noise
from images.
Application: Denoising is widely used in medical imaging, surveillance, and photography
to improve the quality and clarity of images.
Deep Learning Technique: Denoising autoencoders and convolutional neural networks
(CNNs) are commonly used for image denoising. These models learn to map noisy
images to their clean counterparts.
2. Speech Denoising:
Objective: In speech processing, denoising focuses on removing background noise or
distortions from audio signals.
Application: Speech denoising is crucial in voice recognition systems,
telecommunications, and audio processing applications.
Deep Learning Technique: Recurrent Neural Networks (RNNs) and Long Short-Term
Memory networks (LSTMs) are often used for modeling sequential data, including
denoising in speech signals.
3. Data Preprocessing:
Objective: Denoising is an essential step in data preprocessing to improve the quality of
input data for subsequent tasks.
Application: Clean and denoised data is crucial for tasks like classification, regression,
and other machine learning applications.
Deep Learning Technique: Autoencoders, especially denoising autoencoders, are
effective for learning robust representations of data by removing noise during the
training process.
4. Sensor Data Denoising:
Objective: Denoising is applied to sensor data to filter out noise introduced during data
acquisition.
Application: In fields like Internet of Things (IoT), sensor data denoising is critical for
accurate monitoring and decision-making.
Deep Learning Technique: Autoencoders or specialized architectures tailored to the
characteristics of sensor data may be employed for effective denoising.
5. Document Denoising:
Objective: Denoising can be used to clean up noisy or distorted text in documents.
Application: Document denoising is valuable in optical character recognition (OCR) and
natural language processing tasks.
Deep Learning Technique: Recurrent neural networks or transformer-based models may
be employed for document denoising.
6. Video Denoising:
Objective: In video processing, denoising is applied to remove noise or artifacts from
video sequences.
Application: Video denoising is crucial in video surveillance, broadcasting, and video
analytics.
Deep Learning Technique: Spatiotemporal architectures, such as 3D CNNs, can be
effective for video denoising tasks
Sparsity is a concept in deep learning that refers to a situation where a large number of elements in a
particular representation are zero or close to zero. The introduction of sparsity in neural networks can
have several benefits, and it is applied in various ways across different aspects of deep learning. Here are a
few contexts where sparsity is commonly employed:
Greedy layer-wise pre-training is a technique used in deep learning, specifically in the training of deep
neural networks, such as autoencoders and deep belief networks. The approach involves training
individual layers of a deep network one at a time in a "greedy" manner, where each layer is pretrained as
an unsupervised model before fine-tuning the entire network.
1. Initialization:
Start with an initial shallow model, typically a single-layer neural network or an
unsupervised learning model.
2. Greedy Layer-wise Training:
Train the model layer by layer in a "greedy" manner. This means that each layer is trained
independently without considering the other layers. The input to each layer is the output
of the previously trained layers.
Commonly used unsupervised learning methods for pre-training include Restricted
Boltzmann Machines (RBMs) or autoencoders.
3. Stacking Layers:
Once each layer is pretrained, stack them together to create a deep neural network. The
pretrained layers serve as the initial weights for the corresponding layers in the deep
network.
4. Fine-tuning:
Fine-tune the entire deep network using supervised learning with labeled data. This
involves adjusting the weights of the entire network to minimize a supervised loss
function.
The motivation behind greedy layer-wise pre-training is that it helps the model learn better
representations of the data in a step-wise manner. Training individual layers independently can
sometimes lead to more robust and meaningful features in each layer. The pretrained layers capture
useful hierarchical representations that can be fine-tuned for the specific task at hand.
This approach was particularly popular in the early days of deep learning when training very deep
networks directly with backpropagation and stochastic gradient descent was computationally challenging.
Greedy layer-wise pre-training acted as a form of initialization, providing a good starting point for the
subsequent supervised fine-tuning.
However, it's worth noting that recent advances in optimization algorithms, increased computing power,
and improvements in network architectures have diminished the importance of greedy layer-wise pre-
training in some contexts. In many cases, end-to-end training of deep networks from scratch using
techniques like batch normalization and advanced weight initialization has proven to be effective.
Greedy layer-wise pretraining is called so because it optimizes each layer at a time greedily. After
unsupervised training, there is usually a fine-tune stage, when a joint supervised training algorithm is
There is study shown that on average, pretraining was slightly harmful, but for many tasks were
significantly helpful. Unsupervised pretraining combines two ideas: 1) the choice of initial parameters of a
deep neural network can have a significant regularizing effect; 2) learning about the input
distribution can help with learning about the mapping from inputs to outputs. But both ideas are not
fully understood.
A. From the regularizing view, it’s possible that pretraining initializes the model in a location that
would otherwise be in accessible. But it’s always difficult to tell what aspects of the pretrained
parameters are retained during the supervised training stage, making this kind of intractable. There are
two ways to overcome this: 1) train supervised and unsupervised learning simultaneously. 2) freeze the
parameters for the feature extractors and using supervised learning only to add a classifier on top of them.
Thinking of this process as a regularizer, i.e. to avoid overfitting, we can expect unsupervised pretraining to
be most helpful when there is few labeled data and large number of unlabeled data.
B. From the representation view, the idea is that some features that are useful for unsupervised task are
also useful for supervised tasks. But this is not understood at a mathematical or theoretical level.
We can expect the unsupervised pretraining to be more effective when the initial representation is
poor. The use of word embeddings is a great example, where learned word embeddings naturally encode
C. Other factors also matter. For example, unsupervised pretraining is likely to be most useful when the
Disadvantages: unsupervised pretraining has a very obvious disadvantage of operating with two separate
phases. And it has also way more hyperparameters, whose effect may be measured after training but is
To overcome this, we can train unsupervised and supervised learning simultaneously. Now there is a single
Another disadvantage of having two separate phases is that each phase has its own hyper parameters. The
performance of the second phase usually cannot be predicted during the first phase, so there is a long delay
The most principled solution to this is to use validation set error in the supervised phase to select the
hyperparameters of the pretraining phase. In practice, some hyperparameters, like the number of
iterations, are more conveniently set during the pretraining phase, using early stopping.