DL Notes
DL Notes
History of AI → ML → DL
Artificial Intelligence (AI - 1950s)
Now, the computer could predict new things without strict rules.
But there was still a problem… humans had to decide which features to show.
Like: “Look at the ears, look at the nose, look at the fur.”
Computers got faster, internet gave us huge data, and scientists built neural networks with many layers (deep networks).
It looks at raw data (pixels in an image) and figures out everything itself
First layer: finds edges.
Next layers: find eyes, nose.
Final layer: says “It’s a cat”
Self-driving cars
ChatGPT
Generative AI
Definition Teaches computers to learn from data using A special type of ML using neural networks
algorithms. with many layers.
Feature Extraction Needs human help to select features (e.g., Learns features automatically from raw data
ear size, fur color in cat/dog detection). (edges → shapes → whole object).
Data Requirement Works well with small to medium datasets. Needs huge datasets (millions of images/text
samples).
Computation Power Runs on normal CPUs. Needs high computing power (GPUs, TPUs).
Training Time Relatively fast (minutes to hours). Slow (hours to weeks for large models).
Examples Spam filter, predicting house prices, Self-driving cars, Face Recognition, ChatGPT,
recommendation systems. Image Generators.
Analogy Like a school student who needs a teacher to Like a grown-up adult brain that figures
explain what to look at. things out on its own.
Feature Biological Neural Networks (Human Brain) Artificial Neural Networks (Computer Brain)
Basic Unit Neuron (nerve cell) with dendrites & axon. Artificial neuron (perceptron) – a simple
math function.
How it Works Neurons receive signals (electrical impulses), Neurons take numbers as input, apply weights
process them, and pass them on. & activation, and give output.
Connections Trillions of synapses connect ~86 billion Layers of nodes connected by weights
neurons. (parameters).
Signal Transmission Uses electro-chemical signals in the brain. Uses mathematical operations (dot
products, activations).
Learning Method Learns by adjusting synapse strengths Learns by adjusting weights using
(neuroplasticity). backpropagation & gradient descent.
Speed Slower (milliseconds per signal), but massively Faster (nanoseconds per operation), but
parallel. depends on computing power.
Example Child recognizing a cat after seeing it a few Neural net trained on millions of cat pictures
times. to recognize cats.
Created by Google.
Very powerful, used in both research and industry.
Good for building large, complex deep learning models.
2. Keras
3. PyTorch
Created by Facebook.
Very flexible and popular in research.
Easy to experiment with new model ideas.
Limitation
A single perceptron can only handle simple decisions (like AND, OR).
It cannot solve more complex problems (like XOR) because it has no hidden layers.
Input Layer →
where data comes in.
Hidden Layers →
where learning and pattern recognition happen.
Output Layer →
gives the final result.
Limitation
It’s a way to find the best values of weights in a neural network so that predictions are accurate.
So, gradient descent is like the ball rolling down the slope, step by step, until it finds the lowest point.
Backpropagation
It’s how the network adjusts weights after making a mistake.
Step by step:
Loss Functions
1. Mean Squared Error (MSE)
Formula:
yi = actual value
yp = predicted value
Example:
Actual = [3, 5], Predicted = [2, 6]
= (1 + 1) / 2
=1
2. Cross-Entropy Loss
Binary Classification Formula:
y = true label (0 or 1)
yp = predicted probability
Example:
True = 1, Predicted probability = 0.9
= -log(0.9)
≈ 0.10
If predicted probability = 0.1 (very wrong):
Loss = -log(0.1)
= 2.30
3. Hinge Loss
Formula:
Example:
True = +1, Predicted = 0.8
= max(0, 0.2)
= 0.2
=0
MSE : parabolic curve, error grows as prediction moves away from true value.
SGD (Stochastic Learns step by step, Simple, easy to Noisy updates- Online learning, very large
Gradient Descent) adjusting weights implement- Works well on Convergence is slow- May datasets
immediately after each large datasets- Can oscillate around minimum
training sample escape local minima due
to noise
Mini-Batch SGD Learns from small groups Faster than batch GD- Still noisy- Performance Deep learning training
of data instead of one or Smoother than pure SGD- depends on batch size (default choice)
all Efficient use of hardware
(vectorization, GPUs)
RMSProp (Root Mean Adjusts step size per Handles non-stationary May overshoot- Needs Training deep RNNs, non-
Square Propagation) parameter; slows down objectives well- Good for tuning of β (decay rate) convex problems
when gradient is large, RNNs and deep networks-
speeds up when small Adapts learning rates
Adam (Adaptive Combines momentum (m) Most widely used- Fast Slightly more memory use- Default optimizer in
Moment Estimation) + RMSProp (v) →
faster convergence Can sometimes overfit- TensorFlow, PyTorch; NLP
and adaptive Not always best for very models, CV models
large-scale problems
Adam : Fastest and smoothest convergence; combines momentum + adaptive learning rate.
They automatically learn what features matter (you don’t tell them “look at the ear” — they figure it out).
They are good at handling big image data.
They mimic how our visual system works (eyes detect edges →
brain combines them →
recognizes object).
CNNs work the same way: filters detect parts → combine → final classification.
Example:
Imagine you have a black-and-white image of a cat (numbers represent pixel brightness).
If you slide a filter that looks for vertical edges, the output highlights the whiskers and body edges.
2. Filters (Kernels)
A filter (or kernel) is just a tiny matrix, like 3x3 or 5x5.
Each filter is designed to detect something specific:
Vertical edges
Horizontal edges
Corners
Textures
Example:
3. Feature Maps
The feature map is the output after applying the filter.
It shows where in the image that feature appears.
Bright spots in the feature map mean “this filter found something important here.”
Example:
→
If you apply a filter for circles on a coin image feature map lights up where the coins are.
If you apply an edge filter → feature map shows outlines of objects.
Types of Pooling
Instead of a small block, it shrinks the entire feature map into a single number.
Used at the end of CNNs before classification.
Reduces size →
Makes CNN faster.
Keeps important info →
Small details are ignored.
Makes CNN stable →
Small shifts (like moving a cat’s ear a little) won’t confuse the model.
They can’t see every hair, but they still know it’s a dog.
That’s what pooling does →reduces details but keeps the “big picture.”
Dropout
What it is:
Why it helps:
Real-life analogy:
Imagine a cricket team practicing, but sometimes their best batsman sits out. The rest of the team learns to play better — not just depending on one
player.
Batch Normalization (BN)
What it is:
Adjusts (normalizes) the output of each layer so the data going forward has a stable distribution (mean ≈ 0, variance ≈ 1).
Done during training for each mini-batch of data.
Why it helps:
Example:
Think of it like checking students’ mood in a class — if some are too hyper and some are sleepy, the teacher balances everyone so learning goes
smoothly.
Regularization Techniques
Regularization = techniques to reduce overfitting (when the model performs well on training but poorly on new data).
Common Types:
L1 Regularization (Lasso):
Adds penalty = sum of absolute weights.
Makes many weights exactly zero → feature selection.
L2 Regularization (Ridge):
Adds penalty = sum of squared weights.
Keeps weights small, but rarely zero.
Think of it like reading a story word by word — you always remember the earlier words to understand the current word.
Formula:
RNN = a neural network with memory, used for sequential data like text, speech, and time-series.
In normal datasets, you can shuffle rows and nothing changes (e.g., a table of student marks).
But in sequential data, past values affect future values.
Examples:
A sentence →
words must be in order (“I love you” ≠
“You love I”).
Stock prices →
today’s value depends on yesterday’s.
Music → the next note depends on the earlier ones.
Key idea: We cannot treat each data point separately; we must consider the sequence.
Time-series is a special type of sequential data where values are recorded over time.
Examples:
Examples:
Gradients (learning signals) are multiplied many times as they travel back through each time step.
Depending on the values:
If weights < 1→ gradient shrinks → vanishing gradient.
If weights > 1→ gradient grows uncontrollably → exploding gradient.
Vanishing Gradient
The gradient becomes tiny as it moves backward.
Network forgets long-term dependencies.
Example:
Sentence: “I was born in Paris … I speak ___”
RNN cannot connect “Paris” to “French” because the signal vanished.
Exploding Gradient
The gradient becomes huge as it moves backward.
Causes unstable training, weights jump around, loss fluctuates.
Example: Model’s output changes randomly instead of learning properly.
Vanishing Gradient : The gradient value shrinks as we go back in time steps→ old information gets lost.
Exploding Gradient : The gradient value blows up as we go back in time steps → training becomes unstable.
Think of LSTM as an RNN with memory + gates to control what to remember and what to forget.
Why GRU?
LSTM is powerful but a bit complex (3 gates + memory cell).
GRU is a simpler version of LSTM →
fewer gates, faster training, but still solves vanishing gradient.
Think of GRU as “LSTM Lite”.
GRU Structure
GRU has only 2 gates (instead of 3 in LSTM):
👉 Because of fewer gates, GRUs are faster and need less data to train compared to LSTMs.
Feature LSTM GRU
Performance Slightly better for complex tasks Similar, often faster for small datasets
The model learns the style of writing and generates new text.
Example:
Train on Shakespeare’s plays → generate new lines in Shakespeare style.
Train on your WhatsApp chats → generate messages like you.
2. Speech Recognition
3. Sentiment Analysis
4. Machine Translation
5. Time-Series Forecasting
It takes input data → compresses it to a smaller form (encoding) → then reconstructs it back (decoding).
Structure of Autoencoder
1. Encoder →
Shrinks input into a smaller hidden form (latent space).
2. Latent Space (Code) →
The compressed knowledge.
3. Decoder →Reconstructs the original input from compressed code.
Types of Autoencoders
1. Basic Autoencoder
2. Denoising Autoencoder
Special type → doesn’t just copy data but generates new data similar to training data.
Example:
Train VAE on faces →
It can generate new human faces that never existed.
Use case: Generative tasks (image synthesis, anomaly detection, drug design).
Basic Autoencoder → Like compressing a movie into a zip file and unzipping it back.
Denoising Autoencoder → Like Photoshop’s “Auto Fix” that removes blur/noise.
VAE → Like an artist who learns a style and then paints new artworks in that style.
1. Generator (G) →
Creates fake data (tries to fool).
2. Discriminator (D) →
Judges if data is real or fake.
Together, they improve until the generator becomes so good that it produces realistic data.
Structure of a GAN
1. Generator
Input: Random noise
Output: Fake but realistic-looking data (like fake images)
Goal: Fool the discriminator
2. Discriminator
Input: Both real data + fake data from generator
Output: Probability (real or fake)
Goal: Catch the fake
Training Process
Step 1: Generator makes fake images
Step 2: Discriminator checks → “Real or Fake?”
Step 3: If Discriminator catches →Generator improves
Step 4: If Generator fools Discriminator → Discriminator improves
Repeat until Generator makes data almost identical to real
Real-Life Example
Think of a student (Generator) writing fake currency notes
A police officer (Discriminator) inspects them
→
If the officer catches student improves next time
→
If the student fools officer trains harder
Over time, the student becomes a master at making realistic notes
So, when the model sees “it”, attention tells it: “Look at ball, not dog.”
Transformer Architecture
1. Input Embedding – Words converted into vectors.
2. Positional Encoding – Since order matters (cat vs act), position info is added.
3. Encoder – Reads the input with multi-head attention + feedforward layers.
4. Decoder – Uses its own attention + encoder info to generate output (e.g translation).
5. Output – Predicted text sequence.
Multi-Head Attention
Instead of looking at just one relation, the model looks at different angles simultaneously.
Example:
One head might focus on subject–verb relation.
Another on object–adjective relation.
Then all heads combine for richer understanding.
So we replace the Q-table with a Deep Neural Network → this is Deep Q-Learning.
DQN Working (in simple steps):
1. Input: Current state (e.g., game screen).
2. Neural Network: Predicts Q-values for all possible actions.
3. Agent chooses action: Best Q-value = action taken.
4. Reward received from environment.
5. Network updated using backpropagation.
6. Repeat → Agent learns optimal strategy.