0% found this document useful (0 votes)

24 views27 pages

Unit 4 Short Notes

Unit 4 discusses deep feedforward networks, covering the history of deep learning from its inception in the 1940s to the deep learning revolution from 2011 to 2020. It explores probabilistic theories, gradient learning, backpropagation, and regularization techniques essential for training neural networks. Key concepts include Bayesian Neural Networks, gradient descent methods, and the importance of non-linear activation functions in learning complex patterns.

Uploaded by

skannansktamilarasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views27 pages

Unit 4 Short Notes

Uploaded by

skannansktamilarasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Unit – 4 DEEP FEEDFORWARD NETWORKS

History of Deep Learning- A Probabilis c Theory of Deep Learning- Gradient Learning – Chain Rule and
Backpropaga on - Regulariza on: Dataset Augmenta on – Noise Robustness -Early Stopping, Bagging
and Dropout - batch normaliza on- VC Dimension and Neural Nets.

History of Deep Learning:

1940s – The Beginning
 In 1943, Walter Pi s and Warren McCulloch built the ﬁrst computer model
of a neuron using mathema cs and logic.
 They introduced “threshold logic”, an early a empt to mimic how the brain
thinks.
1960s – Backpropaga on and Early Models
 In 1960, Henry J. Kelley proposed an early form of backpropaga on (a
training method for neural networks).
 Stuart Dreyfus simpliﬁed this with the chain rule in 1962.
 In 1965, Alexey Ivakhnenko and Valen n Lapa developed a system where
data passed through mul ple layers, laying the groundwork for deep
learning.
1970s – First AI Winter and New Ideas
 Funding for AI dropped in the 70s (called the AI winter) because early
promises couldn’t be delivered.
 S ll, progress con nued. In 1979, Kunihiko Fukushima developed the
Neocognitron, an early convolu onal neural network (CNN) for pa ern
recogni on.
 Seppo Linnainmaa created code for backpropaga on in 1970, but it wasn’t
applied to neural nets un l 1985.
1980s-1990s – Second AI Winter and Key Advances
 In 1989, Yann LeCun combined CNNs and backpropaga on to read
handwri en digits—used for reading checks.
 Despite another AI winter, key work con nued:
o Support Vector Machines (SVMs) were introduced in 1995 by Cortes
and Vapnik.
o LSTM (Long Short-Term Memory) networks, used in language models,
were developed in 1997 by Hochreiter and Schmidhuber.
 By 1999, GPUs became common, making deep learning training 1000 mes
faster over 10 years.
2000–2010 – Challenges and Big Data
 Vanishing Gradient Problem: Deep layers struggled to learn because the
learning signal became too weak.
 Solu ons included:
o Layer-by-layer pre-training
o Use of LSTM
 In 2001, Big Data began gaining a en on.
 In 2009, Fei-Fei Li launched ImageNet—a massive dataset of labeled
images, which became cri cal for training vision-based deep learning
models.
2011–2020 – Deep Learning Revolu on
 GPU speed boosted progress—no need for pre-training.
 AlexNet (2012) won image recogni on contests using CNNs, ReLU
ac va on, and dropout.
 Google Brain’s “Cat Experiment” (2012):
o Trained a network on unlabeled YouTube images.
o Found a neuron that recognized cats without being told what a cat is.
 GANs (Genera ve Adversarial Networks) were invented in 2014 by Ian
Goodfellow.
o GANs involve two networks compe ng—one generates fake images,
the other tries to spot the fake.

Probabilis c Theory of Deep Learning

Probabilis c Theory of Deep Learning is an approach that helps us understand and improve deep neural
networks (DNNs) by using probability and sta s cs.

Uncertainty Important

In many real-life situa ons, data is noisy or incomplete. A good model should say “I’m not sure” when
the data is unclear. Probabilis c models do exactly that. They give not just predic ons, but also a
measure of conﬁdence (or uncertainty) in those predic ons.

1. Bayesian Neural Networks (BNNs):

 In regular neural networks, weights are ﬁxed numbers.

 In BNNs, weights are random variables with probability distribu ons.

 The model learns the distribu on of weights (not just one value), which allows it to make
predic ons with uncertainty es mates.

2. Varia onal Inference:

 Calcula ng exact probabili es is hard.

 Varia onal inference approximates complex probability distribu ons using simpler ones.

 It’s o en used with BNNs to es mate the distribu on of weights eﬃciently.

3. Dropout as Bayesian Approxima on:

 Normally, dropout is used during training to prevent overﬁ ng.

 But it turns out, if we keep using dropout during tes ng, it’s like doing Bayesian inference.

 This trick helps es mate uncertainty without big changes to the model.

4. Gaussian Processes (GPs):

 GPs are models that can predict a distribu on over func ons (not just values).

 They’re very good at telling you how uncertain a predic on is.

 When combined with deep learning (as Deep Gaussian Processes), you get both ﬂexibility and
uncertainty es mates.

5. Monte Carlo Dropout:

 It extends the idea of dropout.

 At test me, you run the model mul ple mes with dropout turned on.

 This gives diﬀerent results each me, and the varia on between them tells you how uncertain
the predic on is.

6. Ensemble Methods:

 You train mul ple neural networks, each a bit diﬀerent.

 You average their predic ons.

 If the predic ons vary a lot between models, that means the model is less certain.

 It’s simple and o en very eﬀec ve.

GRADIENT LEARNING
What is Gradient Learning?

Gradient learning is the process of training machine learning models (especially neural networks) by
op mizing their parameters (weights and biases). This is done using an algorithm called gradient
descent.

What is Gradient Descent?

Gradient descent is a method that helps the model improve by:

 Looking at the loss func on (which tells how wrong the model is).

 Calcula ng the gradient (the direc on and steepness of the slope of the loss).

 Taking small steps in the direc on that reduces the loss (like walking downhill to reach the
bo om).

Use of Gradient-Based Methods

 They work well for smooth and con nuous func ons (like those in neural networks).

 It’s much easier to minimize these func ons than discrete or irregular ones.

 By es ma ng how small changes in the parameters aﬀect the loss, we can improve the model
gradually.

Types of Gradient Descent Variants

There are mul ple versions to make training faster and more stable:

1. Stochas c Gradient Descent (SGD):

o Updates the model using one data point at a me.

o Noisy but can escape local minima.

2. Mini-batch Gradient Descent:

o Updates using a small batch of data (e.g., 32 or 64 examples).

o More stable than SGD and faster than using the whole dataset.

3. Adap ve Methods (like Adam):

o Adjust the learning rate automa cally during training.

o O en converges faster and is more stable.

Convergence and Ini aliza on

 Convex op miza on: If the loss func on is convex, gradient descent is guaranteed to ﬁnd the
global minimum no ma er where you start.

 Non-convex func ons (common in deep learning): No guarantee of reaching the best solu on.
Star ng values ma er.

 For feedforward neural networks:

o Weights should be ini alized to small random values.

o Biases can be zero or small posi ve numbers.

Cost Func on
Mean squared error (MSE) loss func on arises from maximum likelihood es ma on (MLE) when
assuming a Gaussian distribu on for the outputs.

1. Maximum Likelihood Es ma on
2. The Cost Func on from MLE

Pu ng this into the form of an expected value over the data distribu on pdata:

This is the mean squared error (MSE) cost func on with a constant oﬀset. The constant doesn't aﬀect
op miza on because it doesn't depend on θ.
3. Gradient Desirable Proper es

 The gradient of the cost func on (how fast the cost is changing) should be:

o Large enough to give clear direc on.

o Predictable, so op miza on can proceed steadily.

4. Desirable Property of Gradient

“Gradient must be large and predictable enough to serve as a good guide to the learning algorithm.”

 The gradient tells the model how to update its parameters during training.

 If the gradient is:

o Too small → learning is very slow or stops (vanishing gradient).

o Unpredictable → training becomes unstable.

 We want gradients that are:

o Informa ve (accurately point toward reducing the loss).

o Stable and large enough to drive learning eﬀec vely.

5. Cross-Entropy and Regulariza on

"Cross-entropy cost used for MLE does not have a minimum value..."

For Discrete Outputs (e.g., classiﬁca on):

 Models like logis c regression use cross-entropy loss.

 The model predicts probabili es (e.g., so max).

 Cross-entropy penalizes wrong predic ons heavily.

 However, perfect certainty (probability = 0 or 1) is impossible because:

o Log(0) is undeﬁned (→ loss goes to inﬁnity).

o Model approaches perfect certainty, but can never reach it exactly.

o So the loss doesn't have a clear minimum — it can keep improving.

For Real-Valued Outputs (e.g., regression):

 If we model output with a Gaussian distribu on, cross-entropy involves the variance.

 If the model learns a ny variance, it can assign extremely high density to the correct output.

 This causes the log-likelihood to diverge to nega ve inﬁnity — again, no well-deﬁned minimum.

 That’s why regulariza on is needed — to prevent the model from becoming overly conﬁdent.

6. Learning Condi onal Sta s cs

"We o en want to learn just one condi onal sta s c of y given x."

 Instead of learning the whole distribu on p(y∣x)), we may only care about:

o The mean E[y∣x]

o Or the mode, median, etc.

 This simpliﬁes the learning problem.

 Example: In MSE regression, we’re learning the expected value of yyy given xxx.

7. Learning a Func on (Func onal View)

"Cost func on is a func on rather than a func on..."

This line is confusing due to wording, but it’s trying to say:

 In deep learning, we're not just adjus ng parameters — we're learning a func on f(x).

 The cost func on is be er seen as a func onal — a func on of a func on.

o Example: It takes the en re func on f and gives a single number (the total loss).

 So instead of thinking about minimizing cost by tuning parameters, we can think of:

o Choosing the best func on from a space of all possible func ons.
o The cost func onal is designed so that its minimum lies at the func on we want (e.g.,
the one mapping xxx to E[y∣x]

Chain Rule and Backpropaga on

1. Chain Rule and Backpropaga on: What are they?

 Chain Rule is a rule from calculus used to compute the deriva ve of a func on composed of
other func ons.

o In neural networks, we use it to compute how a change in weights aﬀects the ﬁnal
output error, even across mul ple layers.

 Backpropaga on uses the chain rule to calculate gradients of the loss (error) with respect to the
weights in the network.

o It is essen al for training deep networks using gradient descent.

2. What is Backpropaga on?

 Backpropaga on is a training algorithm for mul -layer neural networks (also called deep neural
networks).

 It's also called the generalized delta rule, an extension of simpler learning rules like the Widrow-
Hoﬀ rule.

 It systema cally updates weights to minimize the error between predicted and actual output by
using gradient descent.
step-by-step idea)

1. Forward Pass:

o Inputs go through the network layer by layer.

o The ﬁnal output is computed.

o The error (diﬀerence between predicted and actual output) is calculated.

2. Backward Pass:

o This is where the chain rule is used.

o Gradients (slopes of error with respect to weights) are calculated layer by layer, star ng
from the output layer and going backward.

o These gradients show how much each weight contributed to the error.

3. Update Weights:

o Using gradient descent, we update the weights to reduce the error.

Nonlinearity

 Each neuron applies an ac va on func on (like sigmoid, tanh, etc.).

 Non-linear ac va on func ons let the network learn complex pa erns.

 Without non-linearity, no ma er how many layers we have, the en re network acts like a single
linear func on.

Use of Hidden Layers

 Two-layer networks (input and output) can only learn simple rela onships (e.g., linearly
separable data).

 Hidden layers allow the network to learn non-linear and complex mappings.

 This enables the network to solve real-world problems like image recogni on, language
processing, etc.

Connec vity and Learning

 Neurons in one layer are only connected to the next layer.

 The output of each neuron is scaled by the weight and passed forward.

 The network learns by adjus ng these weights using backpropaga on so that the output gets
closer to the desired result.

Training Procedure:
Training a neural network means adjus ng the weights so that it can produce correct outputs for a given
set of inputs. This is done through repeated exposure to input-output pairs.

Training Algorithm Steps:

1. Ini alize Weights Randomly:

o All weights in the network are set to small random values (both posi ve and nega ve).

o This prevents neurons from becoming saturated (e.g., stuck with outputs too close to 0
or 1 in sigmoid).

2. Pick a Training Pair:

o Select one input-output pair from the dataset. This is called supervised learning
because we provide the desired output.

3. Apply the Input:

o Feed the input vector to the input layer of the network.

4. Calculate the Output (Forward Pass):

o Data ﬂows through the network from input → hidden layer(s) → output.

o Each neuron calculates its output using a weighted sum and ac va on func on.

5. Calculate the Error:

o Compare the network’s output to the target (desired) output.

o Compute the error using a loss func on (e.g., Mean Squared Error).

6. Adjust the Weights (Backward Pass):

o Use backpropaga on and gradient descent to update weights.

o The goal is to reduce the error by shi ing the weights in the direc on that lowers the
loss.

7. Repeat for All Training Pairs:

o Go through all pairs in the dataset.

o Repeat the process for mul ple epochs (full passes through the dataset) un l the total
error is low enough.

Forward Pass vs Backward Pass:

Forward Pass:

 Data ﬂows from input to output.

 Output is computed layer-by-layer.

 Used to evaluate the current performance of the network.

Backward Pass:

 The error is propagated backward through the network.

 Gradients are calculated for each weight using the chain rule.

 Weights are updated to reduce the output error.

Weight Adjustment Strategy:

1. Output Layer:

o Adjusted ﬁrst because we know the target values.

o Use the delta rule (a part of gradient descent) to update weights.

2. Hidden Layers:

o More challenging because they don’t have direct target values.

o Instead, their errors are inferred from the layers above using the chain rule.

o These inferred errors guide how their weights should be updated.

Chain Rule?

In calculus, the chain rule is used to compute the derivative of a composite function. If you have
two functions:

y=f(g(x))

Then the derivative of y with respect to x is:

Chain Rule in Deep Learning (Backpropagation)

Neural networks are composed of layers where each layer applies a function to the previous
layer’s output:

x→z=Wx+b→a=σ(z)
During training, we want to compute the gradient of the loss function with respect to each
parameter (e.g., weights WWW) to update them using gradient descent.

Using the chain rule, we compute:

Where:

 L is the loss function,

 a is the activation,
 z=Wx+b is the pre-activation value.

This is done layer by layer from the output to the input (backward), hence the name
backpropagation.
Example: Single Neuron

Suppose a neuron computes:

z=w⋅x+b,

a=σ(z),

L=Loss(a,y)

Then

Each of these par al deriva ves is easy to compute and the chain rule lets us link them together to ﬁnd
the gradient.

1. The chain rule allows us to propagate error gradients backward through the layers.
2. It enables gradient-based optimization methods like SGD, Adam, etc.
3. It’s the core idea behind backpropagation, which powers training of deep neural
networks.

Regulariza on: Dataset Augmenta on

Regulariza on techniques are essen al for preven ng overfi ng in machine learning models, including
neural networks. Dataset augmenta on is one such technique used to enhance the generaliza on ability
of models by ar ficially increasing the size and diversity of the training dataset.
Heuris c data augmenta on schemes o en rely on the composi on of a set of simple transforma on
func ons (TFs) such as rota ons and flips (see Figure). When chosen carefully, data augmenta on
schemes tuned by human experts can improve model performance. However, such heuris c strategies in
prac ce can cause large variances in end model performance and may not produce augmenta ons
needed for state-of-the-art models.

Data augmenta on can be deﬁned as the technique used to improve the diversity of the data by slightly
modifying copies of already exis ng data or newly create synthe c data from the exis ng data. It is used
to regularize the data and it also helps to reduce overﬁ ng. Some of the techniques used for data
augmenta on are :

1. Rota on (Range 0-360 degrees)

2. flipping (true or false for horizontal flip and ver cal flip)
3. Shear range (image is shi ed along x-axis or y-axis)
4. Brightness or Contrast range (image is made lighter or darker)
5. Cropping (resize the image)
6. Scale (image is scaled outward or inward)
7. Satura on (depth or intensity of the image) Here's how dataset augmenta on works
within the context of regulariza on:

Dataset Augmentation?

Dataset augmentation is a technique used in machine learning (especially in computer vision)

to artificially increase the size and diversity of a training dataset. Instead of collecting more real
data, you take existing data and apply transformations to create new, slightly changed
versions of it.

These changes make the model more robust (able to generalize better) by teaching it to handle
variations it might see in real-world data — without changing the essential meaning of the data.

1. Prevents overfitting: Helps the model avoid memorizing the training data.
2. Improves generalization: Makes the model better at handling unseen data.
3. Expands small datasets: Useful when real-world data is limited or hard to collect.

Common Types of Transformations

1. Geometric Transformations
Change the position or shape of the image:
o Rotation: Turn the image slightly.
o Translation: Shift the image up/down or left/right.
o Scaling: Zoom in or out.
o Cropping: Cut out a part of the image.
o Flipping: Mirror the image horizontally or vertically.
2. Color Transformations
Change how the image looks visually:
o Brightness: Make the image lighter or darker.
o Contrast: Change the difference between dark and light areas.
o Saturation: Make colors more or less intense.
o Hue: Shift the overall color tone.
3. Noise Injection
Add small random changes (noise) to simulate imperfections:
o Helps the model learn to ignore irrelevant variations.
4. Random Cropping and Padding
o Random cropping: Take a random part of the image.
o Padding: Add extra borders with a certain color or patter

Regularization Effect?

When we say "dataset augmentation acts as a form of regularization", we mean:

It helps the model avoid overfitting by making learning harder in a good way — so the model
doesn't just memorize but actually learns patterns that work on new, unseen data.

When you apply random changes (like rotations, brightness shifts, or noise) to your training
data, you're:

 Making the data less perfect and more like the real world, where data isn’t always clean
or consistent.
 Forcing the model to adapt to this variation, instead of just memorizing specific
examples.

This process "regularizes" the model — meaning it makes the learning more stable and
general.

Without augmentation:

 A model might memorize training images — like "I know this cat because of the exact
position of its ears and background."
 This leads to overfitting, where the model performs well on training data but poorly on
new data.

With augmentation:

 The model sees many variations of the same image.

 It learns core features that matter — like "a cat usually has pointed ears and whiskers,"
no matter the angle, brightness, or background.
 This leads to better generalization.
 Dataset augmentation acts like a regularizer (just like dropout or weight decay).
 It helps the model focus on important, general features.
 It reduces overfitting and boosts performance on real, unseen data.

Example:

Let’s say you’re training a model to recognize cars.

If you:

 Rotate the car images,

 Move them slightly in the frame,
 Adjust brightness like daytime or night...

Then the model learns:

“Ah, that’s still a car, even if it’s turned, shifted, or in different lighting.”

Early Stopping, Bagging and Dropout

Early Stopping
Early Stopping is a technique used in training machine learning models (especially neural networks) to
prevent overﬁ ng—which is when the model learns the training data too well, including its noise or
errors, and performs poorly on new, unseen data.

Here's how it works:

1. Use a Valida on Set:

While training, you split oﬀ a small part of your data (called the valida on set) that the model
doesn't learn from, but you use it to check how well the model is doing.

2. Monitor Valida on Loss:

A er each round of training (called an epoch), you check how much error the model is making
on the valida on set. This is called the valida on loss.

3. Stop When Performance Gets Worse:

At ﬁrst, as training progresses, both training loss and valida on loss usually decrease. But at
some point, the model starts to "memorize" the training data and forget how to generalize. This
shows up as the valida on loss increasing.
When the valida on loss doesn't improve for a while (e.g., 5 or 10 epochs), we stop training
early.

4. Best Model is Saved:

The model from the epoch with the lowest valida on loss is usually saved and used for
predic ons.

Use:
 It saves me by not training unnecessarily.

 It prevents overﬁ ng and helps the model generalize be er to new data.

Bagging
Bagging (short for Bootstrap Aggrega ng) is an ensemble learning technique in machine learning
designed to improve the accuracy and stability of models, par cularly those that are high-variance (e.g.,
decision trees).

How Bagging Works:

1. Bootstrapping (Data Sampling):

o From the original dataset, mul ple subsets are created by random sampling with
replacement.

o Each subset is the same size as the original dataset (or slightly smaller).

2. Training Mul ple Models:

o A separate model is trained on each bootstrapped subset.

o Commonly used with decision trees (e.g., in Random Forests).

3. Aggrega on:

o For classiﬁca on tasks: uses majority vo ng across all models.

o For regression tasks: uses averaging of model outputs.

Beneﬁts:

 Reduces variance: Helps to prevent overﬁ ng by averaging out ﬂuctua ons.

 Improves accuracy: Especially eﬀec ve for unstable learners like decision trees.

 Parallelizable: Since each model is trained independently, it’s easy to parallelize.

Example:
The most well-known applica on of bagging is the Random Forest algorithm, which builds mul ple
decision trees using bagged samples and random feature selec on.

Pseudocode
Bagging (Bootstrap Aggregating) – Pseudocode

Input:
Training data

Base learning algorithm

Number of models T

Algorithm:

For t = 1 to T:

a. Generate bootstrap sample S_t by randomly sampling m examples from D with

replacement.

b. Train base learner h_t on S_t

Output: Combined classiﬁer

H(x) = MajorityVote(h₁(x), h₂(x), ..., h_T(x))

For regression, replace MajorityVote with Average:

Dropout:
Dropout is a regularization technique specifically designed for training neural networks to
prevent overfitting. It involves randomly "dropping out" (i.e., deactivating) a fraction of neurons
during training. The key aspects of dropout are:

1. Random Deactivation:
o In each training iteration, a fraction of neurons is set to zero with a probability p
(usually between 0.2 and 0.5).
2. Training and Inference:
o Dropout is only applied during training.
o During inference, all neurons are active.
o Outputs are scaled by the dropout probability p during inference to maintain
consistency.
3. Ensemble Effect:
o Dropout simulates training many different subnetworks.
o This ensemble behavior helps in learning more generalizable features and reduces
reliance on specific neurons.

Batch Normaliza on
 Batch Normaliza on is a technique used to make training deep neural networks faster
and more stable.
 When a neural network is training, the parameters (like weights) in each layer keep
changing. This causes the distribu on (e.g., the range and mean of values) of inputs to
the next layers to also change. This is called internal covariate shi .
 Imagine you're trying to learn something new, but the rules keep changing slightly every
me—it's harder to learn. That’s what internal covariate shi does to neural networks.

Use of Batch Normalization

Batch Normaliza on reduces this shi ing eﬀect by:

1. Normalizing the inputs of each layer (subtrac ng the mean and dividing by the standard
devia on).
2. Then, it scales and shi s the normalized values using learned parameters (so the network s ll
has ﬂexibility).

This helps the network:

 Train faster,
 Use higher learning rates,
 Be less sensi ve to weight ini aliza on,
 And o en generalize be er (perform well on unseen data)

Normalization

 For every mini-batch (a small subset of your dataset used during training), batch normaliza on
standardizes the inputs to a layer.
 It does this by:

Purpose: This ensures the inputs to each layer have zero mean and unit variance, which helps
the network learn more efficiently.

Scaling and Shifting

 A er normaliza on, we don’t just pass the standardized values as-is. We apply two trainable
parameters:
This allows the network to undo the normalization if needed and still learn the best
representation for the task.

Training and Inference

 During training: The mean and variance are calculated from each mini-batch.
 During inference (when making predic ons): We use running averages of the mean and
variance computed over the whole training process, not batch-wise stats.

The diagram shows:

 Forward Propagation: Data moves through input → hidden layers → output.

 Backpropagation: Errors are sent backward to update weights.
 Batch normalization corrects each layer (“Oops! I’ll correct my layer”) by normalizing
activations at each layer.
Benefits of Batch Normaliza on
 Batch normaliza on offers several benefits to the training process of deep neural networks.
 Batch normaliza on makes training faster, more stable, and more reliable, while also helping
generaliza on and suppor ng deeper architectures.

1. Improved Optimization

 Batch normaliza on allows the model to use higher learning rates safely.
 Normally, high learning rates can make training unstable, but batch norm helps keep
ac va ons in a predictable range.
 This speeds up training and reduces the need for careful manual tuning of learning rates
or other hyperparameters.

2. Regularization

 During training, batch norm uses the mini-batch sta s cs (mean and variance), which
introduces a bit of randomness into each forward pass.
 This acts like a regularizer by slightly disturbing the ac va ons each me, much like
dropout.
 As a result, it helps reduce overﬁ ng—the model becomes less likely to just memorize
the training data.

3. Reduced Sensitivity to Initialization

 Neural networks are o en sensi ve to their ini al weights—bad ini aliza ons can slow
down or ruin training.
 Batch normaliza on lessens this sensi vity, because it keeps ac va ons well-behaved
even if the ini al weights aren’t ideal.
 This means the network is more robust and more likely to converge to a good solu on.

4. Allows Deeper Networks

 One of the main challenges in training very deep networks is the internal covariate shi ,
where the input distribu on to layers changes constantly.
 Batch normaliza on reduces this shi , which makes it easier to train deeper
architectures.
 That’s why modern deep models like ResNet, VGG, and Transformers o en use batch
norm.
VC Dimension and Neural Nets
The VC dimension is a theoretical concept that measures the capacity or expressiveness of a
learning algorithm, specifically the size of the largest set of points that can be shattered by the
model.

 To shatter a set of points means that for every possible way of labeling those points
(e.g., as + or −), there exists some classifier in the model's hypothesis space that can
perfectly separate them.

 The green region shows that a linear classifier (like a straight line) can shatter 3 points in
2D space. That is, for any labeling of 3 points, there's always a line that separates the + and −
correctly.

 The red region shows that 4 points cannot always be shattered by a linear classifier. There's
at least one labeling of 4 points for which no single straight line can separate + and − labels
perfectly.

Relevance to Neural Networks:

 The VC dimension tells us how complex a neural network is in terms of the variety of pa erns it
can learn.

 A higher VC dimension usually means a model can ﬁt more complex data, but it also means
higher risk of overﬁ ng.

 It’s a crucial concept for understanding generaliza on—whether a model just memorizes data or
truly learns pa erns.
Sha ering set of examples:

Assume a binary classiﬁca on problem with N examples RD and consider the set of 2|N| possible
dichotomies. For instance, with N = 3 examples, set of all possible dichotomies is {(000), (001), (010),
(011), (100), (101), (110), (111)}. A class of func ons is said to sha er the dataset if, for every possible
dichotomy, there is a func on 𝑓(𝛼) that models it. Consider as an example a ﬁnite concept class C =
{c1,…,c4} applied to three instance vectors with the results :

Step-by-Step
Breakdown Using the Table:
We are working with a concept class C={c1,c2,c3,c4} and each concept func on gives output labels on
three input instances: x1,x2,x3

Each concept corresponds to one row in this table:

To sha er a set of input points means that for every way you could assign 0s and 1s (labels) to those
points, there's some concept func on in CCC that gives exactly those labels.

If you’re trying to sha er:

 1 point → there are 2 possible labelings (0 or 1)

 2 points → 4 labelings: (0,0), (0,1), (1,0), (1,1)

 3 points → 8 labelings: (000), (001), ..., (111)

We now ask: Can our concept class produce all those combina ons?

Detailed Analysis

1 Point (say x1):

Look at the outputs of each concept on x1:

 c1(x1)=1

 c2(x1)=0

 c3(x1)=1

 c4(x1)=0

So we can generate both outputs: 0 and 1 → All labelings possible → 1 point is sha ered

2 Points (say x1,x3):

Check all 4 concepts and their outputs on x1 and x3:

We have all 4 possible binary labelings:

 (0, 0), (0, 1), (1, 0), (1, 1) → All dichotomies present

→ 2 points can be sha ered

3 Points ( x1,x2,x3):

Now we need all 8 possible labelings for 3 bits, i.e.:

(0,0,0)

(0,0,1)

(0,1,0)

(0,1,1)

(1,0,0)

(1,0,1)

(1,1,0)

(1,1,1)

From the table, we only have the following outputs:

Only 4 pa erns are covered: (111), (011), (100), (000)

The remaining 4 are missing.

→ Not all possible labelings can be achieved
→ 3 points cannot be sha ered

Final Result:

 1 point → sha ered

 2 points → sha ered

 3 points → not sha ered

So, VC dimension = 2

Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
No ratings yet
Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
27 pages
A Probabilistic Theory of Deep Learning: Unit 2
100% (1)
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
Unit 2
No ratings yet
Unit 2
10 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
Deep Learning Essentials
100% (1)
Deep Learning Essentials
140 pages
DL 2
No ratings yet
DL 2
62 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Deep Learning Model
No ratings yet
Deep Learning Model
144 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Deep Learning Hand Book 2024
No ratings yet
Deep Learning Hand Book 2024
185 pages
Aidl Unit III
No ratings yet
Aidl Unit III
79 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
XOR Problem & Two-Layer Perceptron
No ratings yet
XOR Problem & Two-Layer Perceptron
74 pages
UNIT-III-3.2-ML-Features of ANN and Case Study ANN
No ratings yet
UNIT-III-3.2-ML-Features of ANN and Case Study ANN
24 pages
Tutorial 1,2
No ratings yet
Tutorial 1,2
12 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Deep Learning Book by François Fleuret
No ratings yet
Deep Learning Book by François Fleuret
149 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
Cours 4
No ratings yet
Cours 4
30 pages
2023.05.03 The Little Book of Deep Learning
No ratings yet
2023.05.03 The Little Book of Deep Learning
143 pages
04-NN Training GoodF
No ratings yet
04-NN Training GoodF
82 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
LBDL
No ratings yet
LBDL
185 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Inherent Stochasticity
No ratings yet
Inherent Stochasticity
12 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
Al3502 - DLV Unit 2
No ratings yet
Al3502 - DLV Unit 2
18 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
167 pages
Deep Learning for Tech Enthusiasts
No ratings yet
Deep Learning for Tech Enthusiasts
50 pages
UNIT 2 Notes
No ratings yet
UNIT 2 Notes
19 pages
Supervised Learning Essentials
No ratings yet
Supervised Learning Essentials
16 pages
DL CS 7 M4 Live Class Flow
No ratings yet
DL CS 7 M4 Live Class Flow
37 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
01 Intro Slides
No ratings yet
01 Intro Slides
67 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
195 pages
AI Unit II Lec Notes Deep Learning
No ratings yet
AI Unit II Lec Notes Deep Learning
64 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Introduction Deep Eng
No ratings yet
Introduction Deep Eng
50 pages
Unit 2
No ratings yet
Unit 2
19 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Module 2
No ratings yet
Module 2
55 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Facilities Management Overview
0% (1)
Facilities Management Overview
40 pages
Dynamic Modelling and Simulation of Gear Transmission Error For Gearbox Vibration Analysis
No ratings yet
Dynamic Modelling and Simulation of Gear Transmission Error For Gearbox Vibration Analysis
227 pages
Norriseal PDF
100% (2)
Norriseal PDF
349 pages
Price Report 20240930 e
No ratings yet
Price Report 20240930 e
2 pages
Seminar in Thesis Chapter 1-3
No ratings yet
Seminar in Thesis Chapter 1-3
37 pages
Aircraft Airconditioning
No ratings yet
Aircraft Airconditioning
8 pages
Letter From Arthur Z. Schwartz To City Lawyers PDF
No ratings yet
Letter From Arthur Z. Schwartz To City Lawyers PDF
12 pages
Bylaws
No ratings yet
Bylaws
27 pages
Bank Statement Overview
100% (1)
Bank Statement Overview
1 page
OD432056579686600100
No ratings yet
OD432056579686600100
7 pages
Transport Strategy for Polk County
No ratings yet
Transport Strategy for Polk County
19 pages
Central Aar-Nsed
No ratings yet
Central Aar-Nsed
10 pages
Welding Standards & Specifications
No ratings yet
Welding Standards & Specifications
9 pages
Mine Survey Lab 2
No ratings yet
Mine Survey Lab 2
5 pages
Full Corporate Offer For Railways Second Grade R50-R65
No ratings yet
Full Corporate Offer For Railways Second Grade R50-R65
3 pages
ULTRA V Vertical Pressure Screen
No ratings yet
ULTRA V Vertical Pressure Screen
4 pages
Single Core/pvc /cu
No ratings yet
Single Core/pvc /cu
20 pages
Disc Personality Test
No ratings yet
Disc Personality Test
3 pages
Bain Chicago Practice Casebook
No ratings yet
Bain Chicago Practice Casebook
23 pages
01 Drug File MHN
90% (21)
01 Drug File MHN
29 pages
Emergency Stop & Safety Gate Controls
No ratings yet
Emergency Stop & Safety Gate Controls
3 pages
Case Analysis II
No ratings yet
Case Analysis II
3 pages
CRE Objective Type Questions
No ratings yet
CRE Objective Type Questions
3 pages
Benjamin Banneker's Timekeeping Legacy
No ratings yet
Benjamin Banneker's Timekeeping Legacy
14 pages
Adarsh Resume
No ratings yet
Adarsh Resume
2 pages
Case Study As An Interactive Method of Teaching Business English
No ratings yet
Case Study As An Interactive Method of Teaching Business English
6 pages
Astro
No ratings yet
Astro
256 pages
KASANA - Product Catalogue 2024
No ratings yet
KASANA - Product Catalogue 2024
24 pages
Guide To Finding Films
No ratings yet
Guide To Finding Films
18 pages
Good Reasons Why We Should Not Have Homework
100% (1)
Good Reasons Why We Should Not Have Homework
6 pages