Unit 4 Short Notes
Unit 4 Short Notes
History of Deep Learning- A Probabilis c Theory of Deep Learning- Gradient Learning – Chain Rule and
Backpropaga on - Regulariza on: Dataset Augmenta on – Noise Robustness -Early Stopping, Bagging
and Dropout - batch normaliza on- VC Dimension and Neural Nets.
Uncertainty Important
In many real-life situa ons, data is noisy or incomplete. A good model should say “I’m not sure” when
the data is unclear. Probabilis c models do exactly that. They give not just predic ons, but also a
measure of confidence (or uncertainty) in those predic ons.
The model learns the distribu on of weights (not just one value), which allows it to make
predic ons with uncertainty es mates.
Varia onal inference approximates complex probability distribu ons using simpler ones.
But it turns out, if we keep using dropout during tes ng, it’s like doing Bayesian inference.
This trick helps es mate uncertainty without big changes to the model.
GPs are models that can predict a distribu on over func ons (not just values).
At test me, you run the model mul ple mes with dropout turned on.
This gives different results each me, and the varia on between them tells you how uncertain
the predic on is.
6. Ensemble Methods:
If the predic ons vary a lot between models, that means the model is less certain.
GRADIENT LEARNING
What is Gradient Learning?
Gradient learning is the process of training machine learning models (especially neural networks) by
op mizing their parameters (weights and biases). This is done using an algorithm called gradient
descent.
Looking at the loss func on (which tells how wrong the model is).
Calcula ng the gradient (the direc on and steepness of the slope of the loss).
Taking small steps in the direc on that reduces the loss (like walking downhill to reach the
bo om).
They work well for smooth and con nuous func ons (like those in neural networks).
It’s much easier to minimize these func ons than discrete or irregular ones.
By es ma ng how small changes in the parameters affect the loss, we can improve the model
gradually.
o More stable than SGD and faster than using the whole dataset.
Convex op miza on: If the loss func on is convex, gradient descent is guaranteed to find the
global minimum no ma er where you start.
Non-convex func ons (common in deep learning): No guarantee of reaching the best solu on.
Star ng values ma er.
Cost Func on
Mean squared error (MSE) loss func on arises from maximum likelihood es ma on (MLE) when
assuming a Gaussian distribu on for the outputs.
1. Maximum Likelihood Es ma on
2. The Cost Func on from MLE
Pu ng this into the form of an expected value over the data distribu on pdata:
This is the mean squared error (MSE) cost func on with a constant offset. The constant doesn't affect
op miza on because it doesn't depend on θ.
3. Gradient Desirable Proper es
The gradient of the cost func on (how fast the cost is changing) should be:
“Gradient must be large and predictable enough to serve as a good guide to the learning algorithm.”
The gradient tells the model how to update its parameters during training.
"Cross-entropy cost used for MLE does not have a minimum value..."
If we model output with a Gaussian distribu on, cross-entropy involves the variance.
If the model learns a ny variance, it can assign extremely high density to the correct output.
This causes the log-likelihood to diverge to nega ve infinity — again, no well-defined minimum.
That’s why regulariza on is needed — to prevent the model from becoming overly confident.
"We o en want to learn just one condi onal sta s c of y given x."
Instead of learning the whole distribu on p(y∣x)), we may only care about:
Example: In MSE regression, we’re learning the expected value of yyy given xxx.
In deep learning, we're not just adjus ng parameters — we're learning a func on f(x).
o Example: It takes the en re func on f and gives a single number (the total loss).
So instead of thinking about minimizing cost by tuning parameters, we can think of:
o Choosing the best func on from a space of all possible func ons.
o The cost func onal is designed so that its minimum lies at the func on we want (e.g.,
the one mapping xxx to E[y∣x]
Chain Rule is a rule from calculus used to compute the deriva ve of a func on composed of
other func ons.
o In neural networks, we use it to compute how a change in weights affects the final
output error, even across mul ple layers.
Backpropaga on uses the chain rule to calculate gradients of the loss (error) with respect to the
weights in the network.
Backpropaga on is a training algorithm for mul -layer neural networks (also called deep neural
networks).
It's also called the generalized delta rule, an extension of simpler learning rules like the Widrow-
Hoff rule.
It systema cally updates weights to minimize the error between predicted and actual output by
using gradient descent.
step-by-step idea)
1. Forward Pass:
2. Backward Pass:
o Gradients (slopes of error with respect to weights) are calculated layer by layer, star ng
from the output layer and going backward.
o These gradients show how much each weight contributed to the error.
3. Update Weights:
Nonlinearity
Without non-linearity, no ma er how many layers we have, the en re network acts like a single
linear func on.
Hidden layers allow the network to learn non-linear and complex mappings.
This enables the network to solve real-world problems like image recogni on, language
processing, etc.
The output of each neuron is scaled by the weight and passed forward.
The network learns by adjus ng these weights using backpropaga on so that the output gets
closer to the desired result.
Training Procedure:
Training a neural network means adjus ng the weights so that it can produce correct outputs for a given
set of inputs. This is done through repeated exposure to input-output pairs.
o All weights in the network are set to small random values (both posi ve and nega ve).
o This prevents neurons from becoming saturated (e.g., stuck with outputs too close to 0
or 1 in sigmoid).
o Select one input-output pair from the dataset. This is called supervised learning
because we provide the desired output.
o Data flows through the network from input → hidden layer(s) → output.
o Each neuron calculates its output using a weighted sum and ac va on func on.
o Compute the error using a loss func on (e.g., Mean Squared Error).
o The goal is to reduce the error by shi ing the weights in the direc on that lowers the
loss.
o Repeat the process for mul ple epochs (full passes through the dataset) un l the total
error is low enough.
Forward Pass:
Backward Pass:
Gradients are calculated for each weight using the chain rule.
1. Output Layer:
2. Hidden Layers:
o Instead, their errors are inferred from the layers above using the chain rule.
In calculus, the chain rule is used to compute the derivative of a composite function. If you have
two functions:
y=f(g(x))
Neural networks are composed of layers where each layer applies a function to the previous
layer’s output:
x→z=Wx+b→a=σ(z)
During training, we want to compute the gradient of the loss function with respect to each
parameter (e.g., weights WWW) to update them using gradient descent.
Where:
This is done layer by layer from the output to the input (backward), hence the name
backpropagation.
Example: Single Neuron
z=w⋅x+b,
a=σ(z),
L=Loss(a,y)
Then
Each of these par al deriva ves is easy to compute and the chain rule lets us link them together to find
the gradient.
1. The chain rule allows us to propagate error gradients backward through the layers.
2. It enables gradient-based optimization methods like SGD, Adam, etc.
3. It’s the core idea behind backpropagation, which powers training of deep neural
networks.
Regulariza on techniques are essen al for preven ng overfi ng in machine learning models, including
neural networks. Dataset augmenta on is one such technique used to enhance the generaliza on ability
of models by ar ficially increasing the size and diversity of the training dataset.
Heuris c data augmenta on schemes o en rely on the composi on of a set of simple transforma on
func ons (TFs) such as rota ons and flips (see Figure). When chosen carefully, data augmenta on
schemes tuned by human experts can improve model performance. However, such heuris c strategies in
prac ce can cause large variances in end model performance and may not produce augmenta ons
needed for state-of-the-art models.
Data augmenta on can be defined as the technique used to improve the diversity of the data by slightly
modifying copies of already exis ng data or newly create synthe c data from the exis ng data. It is used
to regularize the data and it also helps to reduce overfi ng. Some of the techniques used for data
augmenta on are :
Dataset Augmentation?
These changes make the model more robust (able to generalize better) by teaching it to handle
variations it might see in real-world data — without changing the essential meaning of the data.
1. Prevents overfitting: Helps the model avoid memorizing the training data.
2. Improves generalization: Makes the model better at handling unseen data.
3. Expands small datasets: Useful when real-world data is limited or hard to collect.
1. Geometric Transformations
Change the position or shape of the image:
o Rotation: Turn the image slightly.
o Translation: Shift the image up/down or left/right.
o Scaling: Zoom in or out.
o Cropping: Cut out a part of the image.
o Flipping: Mirror the image horizontally or vertically.
2. Color Transformations
Change how the image looks visually:
o Brightness: Make the image lighter or darker.
o Contrast: Change the difference between dark and light areas.
o Saturation: Make colors more or less intense.
o Hue: Shift the overall color tone.
3. Noise Injection
Add small random changes (noise) to simulate imperfections:
o Helps the model learn to ignore irrelevant variations.
4. Random Cropping and Padding
o Random cropping: Take a random part of the image.
o Padding: Add extra borders with a certain color or patter
Regularization Effect?
It helps the model avoid overfitting by making learning harder in a good way — so the model
doesn't just memorize but actually learns patterns that work on new, unseen data.
When you apply random changes (like rotations, brightness shifts, or noise) to your training
data, you're:
Making the data less perfect and more like the real world, where data isn’t always clean
or consistent.
Forcing the model to adapt to this variation, instead of just memorizing specific
examples.
This process "regularizes" the model — meaning it makes the learning more stable and
general.
Without augmentation:
A model might memorize training images — like "I know this cat because of the exact
position of its ears and background."
This leads to overfitting, where the model performs well on training data but poorly on
new data.
With augmentation:
Example:
If you:
“Ah, that’s still a car, even if it’s turned, shifted, or in different lighting.”
Use:
It saves me by not training unnecessarily.
Bagging
Bagging (short for Bootstrap Aggrega ng) is an ensemble learning technique in machine learning
designed to improve the accuracy and stability of models, par cularly those that are high-variance (e.g.,
decision trees).
o From the original dataset, mul ple subsets are created by random sampling with
replacement.
o Each subset is the same size as the original dataset (or slightly smaller).
3. Aggrega on:
Benefits:
Improves accuracy: Especially effec ve for unstable learners like decision trees.
Example:
The most well-known applica on of bagging is the Random Forest algorithm, which builds mul ple
decision trees using bagged samples and random feature selec on.
Pseudocode
Bagging (Bootstrap Aggregating) – Pseudocode
Input:
Training data
Algorithm:
For t = 1 to T:
1. Random Deactivation:
o In each training iteration, a fraction of neurons is set to zero with a probability p
(usually between 0.2 and 0.5).
2. Training and Inference:
o Dropout is only applied during training.
o During inference, all neurons are active.
o Outputs are scaled by the dropout probability p during inference to maintain
consistency.
3. Ensemble Effect:
o Dropout simulates training many different subnetworks.
o This ensemble behavior helps in learning more generalizable features and reduces
reliance on specific neurons.
Batch Normaliza on
Batch Normaliza on is a technique used to make training deep neural networks faster
and more stable.
When a neural network is training, the parameters (like weights) in each layer keep
changing. This causes the distribu on (e.g., the range and mean of values) of inputs to
the next layers to also change. This is called internal covariate shi .
Imagine you're trying to learn something new, but the rules keep changing slightly every
me—it's harder to learn. That’s what internal covariate shi does to neural networks.
1. Normalizing the inputs of each layer (subtrac ng the mean and dividing by the standard
devia on).
2. Then, it scales and shi s the normalized values using learned parameters (so the network s ll
has flexibility).
Train faster,
Use higher learning rates,
Be less sensi ve to weight ini aliza on,
And o en generalize be er (perform well on unseen data)
Normalization
For every mini-batch (a small subset of your dataset used during training), batch normaliza on
standardizes the inputs to a layer.
It does this by:
Purpose: This ensures the inputs to each layer have zero mean and unit variance, which helps
the network learn more efficiently.
A er normaliza on, we don’t just pass the standardized values as-is. We apply two trainable
parameters:
This allows the network to undo the normalization if needed and still learn the best
representation for the task.
During training: The mean and variance are calculated from each mini-batch.
During inference (when making predic ons): We use running averages of the mean and
variance computed over the whole training process, not batch-wise stats.
1. Improved Optimization
Batch normaliza on allows the model to use higher learning rates safely.
Normally, high learning rates can make training unstable, but batch norm helps keep
ac va ons in a predictable range.
This speeds up training and reduces the need for careful manual tuning of learning rates
or other hyperparameters.
2. Regularization
During training, batch norm uses the mini-batch sta s cs (mean and variance), which
introduces a bit of randomness into each forward pass.
This acts like a regularizer by slightly disturbing the ac va ons each me, much like
dropout.
As a result, it helps reduce overfi ng—the model becomes less likely to just memorize
the training data.
Neural networks are o en sensi ve to their ini al weights—bad ini aliza ons can slow
down or ruin training.
Batch normaliza on lessens this sensi vity, because it keeps ac va ons well-behaved
even if the ini al weights aren’t ideal.
This means the network is more robust and more likely to converge to a good solu on.
One of the main challenges in training very deep networks is the internal covariate shi ,
where the input distribu on to layers changes constantly.
Batch normaliza on reduces this shi , which makes it easier to train deeper
architectures.
That’s why modern deep models like ResNet, VGG, and Transformers o en use batch
norm.
VC Dimension and Neural Nets
The VC dimension is a theoretical concept that measures the capacity or expressiveness of a
learning algorithm, specifically the size of the largest set of points that can be shattered by the
model.
To shatter a set of points means that for every possible way of labeling those points
(e.g., as + or −), there exists some classifier in the model's hypothesis space that can
perfectly separate them.
The green region shows that a linear classifier (like a straight line) can shatter 3 points in
2D space. That is, for any labeling of 3 points, there's always a line that separates the + and −
correctly.
The red region shows that 4 points cannot always be shattered by a linear classifier. There's
at least one labeling of 4 points for which no single straight line can separate + and − labels
perfectly.
The VC dimension tells us how complex a neural network is in terms of the variety of pa erns it
can learn.
A higher VC dimension usually means a model can fit more complex data, but it also means
higher risk of overfi ng.
It’s a crucial concept for understanding generaliza on—whether a model just memorizes data or
truly learns pa erns.
Sha ering set of examples:
Assume a binary classifica on problem with N examples RD and consider the set of 2|N| possible
dichotomies. For instance, with N = 3 examples, set of all possible dichotomies is {(000), (001), (010),
(011), (100), (101), (110), (111)}. A class of func ons is said to sha er the dataset if, for every possible
dichotomy, there is a func on 𝑓(𝛼) that models it. Consider as an example a finite concept class C =
{c1,…,c4} applied to three instance vectors with the results :
Step-by-Step
Breakdown Using the Table:
We are working with a concept class C={c1,c2,c3,c4} and each concept func on gives output labels on
three input instances: x1,x2,x3
To sha er a set of input points means that for every way you could assign 0s and 1s (labels) to those
points, there's some concept func on in CCC that gives exactly those labels.
We now ask: Can our concept class produce all those combina ons?
Detailed Analysis
c2(x1)=0
c3(x1)=1
c4(x1)=0
So we can generate both outputs: 0 and 1 → All labelings possible → 1 point is sha ered
(0, 0), (0, 1), (1, 0), (1, 1) → All dichotomies present
3 Points ( x1,x2,x3):
(0,0,1)
(0,1,0)
(0,1,1)
(1,0,0)
(1,0,1)
(1,1,0)
(1,1,1)
Final Result:
So, VC dimension = 2