[go: up one dir, main page]

0% found this document useful (0 votes)
310 views114 pages

Input Neurons for Handwritten Digit Classifier

The document provides an overview of key concepts in training neural networks, including risk minimization, loss functions, backpropagation, regularization, model selection, and optimization. It details procedures, applications, and examples for each concept, emphasizing their roles in effective model training and generalization. Additionally, it discusses the architecture of feed-forward neural networks and their universal approximation capabilities.

Uploaded by

Saurabh Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
310 views114 pages

Input Neurons for Handwritten Digit Classifier

The document provides an overview of key concepts in training neural networks, including risk minimization, loss functions, backpropagation, regularization, model selection, and optimization. It details procedures, applications, and examples for each concept, emphasizing their roles in effective model training and generalization. Additionally, it discusses the architecture of feed-forward neural networks and their universal approximation capabilities.

Uploaded by

Saurabh Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Unit 3 Training Neural Network:

1. Risk Minimization
✅ Definition:
Risk minimization refers to the process of reducing the expected loss (or error) that a
model makes when predicting outputs on unseen data.
There are two types of risks:
 Empirical Risk : Average loss over training data.
 True Risk : Expected loss over the entire data distribution (unknown in practice).

🔁 Procedure:
 Use training data to estimate empirical risk.
 Apply optimization algorithms to minimize it.
💡 Application:
Used in all supervised learning tasks like classification and regression.
📌 Example:
Minimizing cross-entropy loss in a neural network for image classification.

2. Loss Function
✅ Definition:
A loss function quantifies how well the model's predictions match the true labels. It
guides the learning process by providing a numerical measure of error.

📈 Common Types:

TASK LOSS FUNCTION FORMULA


Binary Classification Binary Cross-Entropy

Multi-Class Classification Categorical Cross-Entropy

Regression Mean Squared Error (MSE)

Object Detection Smooth L1 Loss Hybrid of MSE and MAE

🔁 Procedure:
 Compute loss after forward pass.
 Pass gradients back via backpropagation.
💡 Application:
Used in every neural network to guide learning.

1
📌 Example:
Using MSE loss to train a neural network for house price prediction.

3. Backpropagation
✅ Definition:
Backpropagation is an algorithm used to compute the gradient of the loss function with
respect to each weight in the neural network using the chain rule of calculus .
🧠 How It Works:
1. Forward Pass : Input is passed through the network to get output.
2. Loss Calculation : Compute error using loss function.
3. Backward Pass :
 Compute derivative of loss w.r.t. weights.
 Update weights using gradient descent.
🔁 Algorithm Steps:
1. Initialize weights randomly.
2. For each batch:
 Forward propagate inputs.
 Compute loss.
 Backpropagate gradients.
 Update weights using optimizer.
💡 Application:
Essential for training all neural networks — from simple MLPs to deep CNNs and
RNNs.
📌 Example:
Training a feedforward network to classify handwritten digits (MNIST dataset).

4. Regularization
✅ Definition:
Regularization techniques prevent overfitting , where the model performs well on
training data but poorly on new, unseen data.
🧰 Common Techniques:
METHOD DESCRIPTION
L1 Regularization (Lasso) Adds penalty proportional to absolute value of weights
L2 Regularization (Ridge) Adds penalty proportional to square of weights
Dropout Randomly drops neurons during training
Early Stopping Stops training when validation performance plateaus
Data Augmentation Increases diversity of training data artificially
🔁 How Used:
 Add regularization term to loss function:
Total Loss=Original Loss+λ⋅Regularization Term
💡 Application:
Improves generalization in models trained on small or noisy datasets.
📌 Example:
Using dropout in a CNN for image classification to reduce overfitting.

5. Model Selection
✅ Definition:
Model selection involves choosing the best-performing model architecture and
hyperparameters based on performance metrics.
2
Process:
1. Define candidate models (e.g., different layer sizes, activation functions).
2. Train and evaluate them using:
 Cross-validation
 Validation set
3. Choose the one with best performance on validation/test sets.
📊 Evaluation Metrics:
 Accuracy
 Precision, Recall, F1-score (classification)
 MSE, MAE (regression)
💡 Application:
Critical in developing robust AI systems for real-world deployment.
📌 Example:
Selecting between a 2-layer vs 4-layer neural network for sentiment analysis based on
validation accuracy.

6. Optimization
✅ Definition:
Optimization refers to the method used to update the model parameters to minimize the
loss function.
🧰 Common Optimizers:
OPTIMIZER DESCRIPTION
Stochastic Gradient Descent (SGD) Simple, uses single sample per update
SGD with Momentum Accelerates SGD in relevant directions
RMSProp Adapts learning rate per parameter
Adam Combines momentum and RMSProp; most popular
Learning Rate Schedulers Adjust learning rate dynamically during training
🔁 How It Works:

💡 Application:
Used in all phases of training — from initial convergence to fine-tuning.
📌 Example:
Using Adam optimizer to train a Transformer-based language model.

🧩 Putting It All Together: Full Training Pipeline


Here’s how these components interact in a typical neural network training loop:
python
for epoch in epochs:
for batch in batches:
# 1. Forward Propagation

3
output = model(input)

# 2. Compute Loss
loss = loss_function(output, true_labels)

# 3. Backpropagation
[Link]()

# 4. Optimization (Update Weights)


[Link]()

# 5. Regularization (e.g., dropout, weight decay)


apply_regularization(model)

# 6. Model Selection & Early Stopping


val_loss = evaluate_on_validation_set()
if val_loss < best_val_loss:
save_model()
else:
patience_counter += 1
if patience_counter > early_stop_threshold:
break

📊 Summary Table
COMPONENT PURPOSE KEY METHODS APPLICATIONS EXAMPLE
Risk Reduce Empirical risk Supervised Image classification
Minimization prediction error estimation learning
Loss Function Measure MSE, Cross- Any ML task House price
prediction error Entropy prediction
Backpropagation Compute Chain rule All neural nets MNIST digit
gradients recognition
Regularization Prevent Dropout, L2, Early Small/noisy CNN for object
overfitting Stopping datasets detection
Model Selection Choose best Cross-validation Final deployment Choosing optimal
model network depth
Optimization Update weights Adam, SGD Model training Training large
language models

📝 Final Notes:
 Training a neural network is a complex interplay of risk minimization , loss
computation , gradient propagation , regularization , model tuning , and
parameter optimization .
 Each component plays a critical role in ensuring that the model learns effectively
and generalizes well to new data.

🔍 What is MAE?
✅ MAE stands for Mean Absolute Error .

4
It is a loss function used in regression tasks to measure the average magnitude of errors
between predicted values and actual values, without considering their direction (i.e.,
sign).

📈 Definition & Formula:

🧠 Key Characteristics of MAE:


FEATURE DESCRIPTION
Robustness Less sensitive to outliers than MSE
Interpretability Easy to understand — same unit as target variable
Optimization Not always differentiable at zero (due to absolute value)

🆚 MAE vs. MSE


CRITERIA MAE MSE
Formula

Sensitivity Low High


to Outliers
Interpretatio Average error magnitude Average squared error
n
Use Case When all errors are treated equally When large errors should be penalized
more

💡 Applications of MAE
 House price prediction
 Temperature forecasting
 Sales forecasting
 Any regression task where outlier errors shouldn't dominate training

📌 Example:
Suppose we have:
ACTUAL (YI) PREDICTED (Y^I)

5
5 7
10 9
3 2
Then:

🧩 So, What Does “Hybrid of MSE and MAE” Mean?


Sometimes, neither MSE nor MAE alone performs optimally — especially when you
want:
 Robustness to outliers (like MAE)
 Smooth gradients for optimization (like MSE)
To get the best of both worlds, hybrid loss functions like Smooth L1 Loss or Huber Loss
are used.

🔄 Hybrid Loss Example: Huber Loss

 For small errors → behaves like MSE


 For large errors → behaves like MAE

📦 Applications of Hybrid (MSE + MAE) Loss


 Object detection (e.g., Faster R-CNN uses Smooth L1 Loss)
 Noisy datasets with outliers
 Tasks requiring stable gradient flow during training

✅ Summary Table
CONCEPT DESCRIPTION
MAE Mean Absolute Error – measures average absolute difference between
predictions and actuals
MSE Mean Squared Error – penalizes larger errors more heavily
Hybrid Combine properties of both (e.g., Huber, Smooth L1)
Losses
Use Case Regression tasks with possible outliers or need for robustness
Example Using Huber loss in object detection models

A note on Logit
The logit is a mathematical function that plays a central role in statistics and machine
learning, especially in logistic regression. Here's a breakdown:
🔍 What is a Logit?

6
The logit function transforms a probability value (between 0 and 1) into a real number (from
−∞ to +∞). It's defined as:

( )
logit ( p ) =ln
p
1− p
Where:
 p is the probability of an event occurring
 ln is the natural logarithm
This expression is also known as the log-odds, because it’s the logarithm of the odds of the
event happening.
📊 Why Use the Logit?
 In logistic regression, we model the logit of the probability as a linear combination of
input features:
logit ( p ) =β 0+ β 1 x1 + β 2 x 2 +…+ β k x k
 This lets us use linear models to predict probabilities while keeping the output
bounded between 0 and 1 after applying the sigmoid function, which is the inverse of
the logit.
🧠 Intuition
 A logit of 0 means the event is equally likely to happen or not (i.e., p=0.5p = 0.5)
 A positive logit means the event is more likely than not
 A negative logit means the event is less likely than not

Neural Networks Chapter in Book- Pattern Recognition and Machine


Learning by Bishops
Detailed Analysis of Feed-Forward Neural Networks (Multilayer
Perceptrons)
Introduction to Neural Networks
Neural networks, specifically feed-forward networks or multilayer perceptrons (MLPs), are
powerful models for regression and classification tasks that overcome limitations of linear
models with fixed basis functions. The key advantage of neural networks is their ability to
adapt basis functions to the data, addressing the curse of dimensionality that plagues fixed
basis function approaches.
Compared to support vector machines (SVMs) and relevance vector machines (RVMs),
neural networks offer:
 More compact models (fewer parameters for equivalent performance)
 Faster evaluation of new data points
 The tradeoff of non-convex optimization during training

A Neuron
Network Architecture and Functional Form

7
Activation function like e.g. Thresholding function

A Neural
Network

The basic neural network model consists of a series of functional transformations:


Input to Hidden Layer Transformation:
First, M linear combinations of input variables x₁,...,x_D are formed:

aⱼ = ∑ᵢ₌₁ᴰ wⱼᵢ⁽¹⁾xᵢ + wⱼ₀⁽¹⁾ for j = 1,...,M


Where:
 wⱼᵢ⁽¹⁾ are first-layer weights
 wⱼ₀⁽¹⁾ are first-layer biases
 aⱼ are called activations
Each activation is then transformed using a nonlinear activation function h(·):
zⱼ = h(aⱼ)
Common activation functions include:
 Logistic sigmoid: σ(a) = 1/(1 + exp(-a))
 Hyperbolic tangent: tanh(a)
Hidden to Output Layer Transformation
The hidden unit outputs zⱼ are then linearly combined to give output unit activations:

8
aₖ = ∑ⱼ₌₁ᴹ wₖⱼ⁽²⁾zⱼ + wₖ₀⁽²⁾ for k = 1,...,K
Where:
 K is the number of outputs
 wₖⱼ⁽²⁾ are second-layer weights
 wₖ₀⁽²⁾ are second-layer biases
Output Transformation
The final outputs yₖ are obtained by applying appropriate activation functions:
 For regression: identity function (yₖ = aₖ) (Output same as input)
 For binary classification: logistic sigmoid (yₖ = σ(aₖ)= 1/(1 + exp(-aₖ))
y
e i

SoftMax ( yi ) = n
 For multiclass classification: softmax function ( )
∑ ey j

j=1

Combined Network Function


The complete network function for sigmoidal output activation is:
yₖ(x,w) = σ(∑ⱼ₌₁ᴹ wₖⱼ⁽²⁾h(∑ᵢ₌₁ᴰ wⱼᵢ⁽¹⁾ x i + wⱼ₀⁽¹⁾) + wₖ₀⁽²⁾)
Simplified Notation with Absorbed Biases
We can simplify notation by absorbing bias terms into the weights:
1. Add an input x₀ = 1
2. The first layer becomes: aⱼ = ∑ᵢ₌₀ᴰ wⱼᵢ⁽¹⁾xᵢ
3. Similarly for the second layer by adding z₀ = 1
The network function then becomes:
yₖ(x,w) = σ(∑ⱼ₌₀ᴹ wₖⱼ⁽²⁾h(∑ᵢ₌₀ᴰ wⱼᵢ⁽¹⁾xᵢ))

Network Diagram Representation


The network can be visualized as:
1. Input nodes (x₀,x₁,...,x_D)
2. Hidden nodes (z₀,z₁,...,z_M)
3. Output nodes (y₁,...,y_K)
4. Weighted connections between layers
Key properties:
 Information flows forward (feed-forward)
 No cycles allowed (ensures deterministic outputs)
 Bias units are represented as connections from constant unit (x₀=1, z₀=1)

9
Universal Approximation Capability
Feed-forward neural networks are universal approximators:
 A two-layer network with linear outputs can approximate any continuous function on
a compact input domain to arbitrary accuracy
 Requires sufficiently large number of hidden units
 Works for wide range of activation functions (excluding polynomials)
Example Approximation
Figure 5.3 demonstrates how a two-layer network with:
 3 hidden units
 tanh activation functions
 Linear output units
can approximate various functions:
1. f(x) = x²
2. f(x) = sin(x)
3. f(x) = |x|
4. f(x) = H(x) (Heaviside step function)

10
The network was trained on N=50 uniformly sampled points in (-1,1).
Weight Space Symmetries
Neural networks exhibit important symmetries in weight space:
Sign-Flip Symmetry
For a hidden unit with tanh activation:
1. Change sign of all incoming weights and bias: wⱼᵢ⁽¹⁾ → -wⱼᵢ⁽¹⁾, wⱼ₀⁽¹⁾ → -wⱼ₀⁽¹⁾
2. Change sign of all outgoing weights: wₖⱼ⁽²⁾ → -wₖⱼ⁽²⁾
3. The network function remains unchanged because tanh(-a) = -tanh(a)
For M hidden units, there are 2ᴹ equivalent weight vectors from this symmetry.
Hidden Unit Interchange Symmetry
We can interchange all weights associated with two hidden units:
1. Swap incoming weights and biases for units j and k
2. Swap outgoing weights for units j and k
3. The network function remains identical
For M hidden units, there are M! possible orderings, leading to M! equivalent weight vectors.
Total Symmetry
Combining both symmetries, the total number of equivalent weight vectors is M!2ᴹ.
For networks with more layers, the total symmetry is the product of these factors for each
hidden layer.
Example: Classification Problem
Figure 5.4 shows a two-class classification problem solved with a network having:
 2 inputs
 2 hidden units with tanh activation
 1 logistic sigmoid output
Key components visualized:
1. Dashed blue lines: z=0.5 contours for each hidden unit
2. Red line: y=0.5 decision boundary
3. Green line: optimal decision boundary from true data distributions

11
Mathematical Derivations and Properties
Forward Propagation
The network computes outputs via forward propagation:
1. For each hidden unit j:

aⱼ = ∑ᵢ wⱼᵢ⁽¹⁾xᵢ + wⱼ₀⁽¹⁾
zⱼ = h(aⱼ)
2. For each output unit k:

aₖ = ∑ⱼ wₖⱼ⁽²⁾zⱼ + wₖ₀⁽²⁾
yₖ = fₖ(aₖ)
Where fₖ is the appropriate output activation function.
Linear Network Limitation
If all activation functions h(·) are linear:
 The network reduces to a single linear transformation
 Can always find equivalent network without hidden units
 Only interesting if hidden layer has lower dimension than input/output
(dimensionality reduction)
Extension to Deep Networks
The architecture can be extended to more layers:
 Additional hidden layers can be inserted
 Each layer follows the same pattern: linear combination + nonlinearity
 Number of layers typically counted by number of weight layers (e.g., 2-layer network
has one hidden layer)
Sparse and Skip Connections
Variations include:
1. Skip-layer connections: Direct inputs to outputs
o Can be explicitly included
o Can theoretically be approximated by hidden units but may be more efficient
explicitly
2. Sparse connections: Not all possible connections present
o Example: Convolutional neural networks (Section 5.5.6)
Practical Considerations
Terminology Note
There is variation in how layers are counted:

12
1. Some count all layers of units (input + hidden + output)
o e.g., "3-layer network" for one hidden layer
2. Some count only hidden layers
o e.g., "single-hidden-layer network"
3. Recommended: Count layers of adaptive weights
o Inputs are not adaptive, so "2-layer network" for one hidden layer
Biological Inspiration
While originally inspired by biological neurons:
 Modern neural networks are primarily mathematical constructs
 Biological realism is not necessary for practical pattern recognition
 Focus is on statistical pattern recognition capabilities
Conclusion
Feed-forward neural networks provide:
1. Flexible nonlinear function approximation
2. Universal approximation capabilities
3. Compact representation compared to some other models
4. Efficient forward evaluation (after training)
The price is:
1. Non-convex optimization during training
2. Potential multiple equivalent solutions due to weight symmetries
3. Need for careful architecture selection
The next sections would typically cover:
 Network training via backpropagation
 Regularization approaches
 Bayesian treatments
 Specialized architectures like convolutional networks

In neural networks, the choice of activation function for the input, hidden,
and output layers depends on the type of problem you're solving and the
behavior you want from each layer.
Let’s break it down step-by-step with examples:
🔹 1. Input Layer Activation Function
 Typically: No activation function is used.
 Why? The input layer just passes the raw data to the next layer.
 Exception: Sometimes normalization or embedding layers are used for preprocessing.
Example:
If your input is image pixel values (e.g., 0–255), you might normalize them to 0–1 or -1 to
+1, but no activation function is applied.
🔹 2. Hidden Layer Activation Function
 Used to introduce non-linearity so the network can model complex patterns.
 Common choices:
o ReLU (Rectified Linear Unit): f(x) = max(0, x)
o Sigmoid: f(x) = 1 / (1 + e^(-x))
o Tanh: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))

13
🔸 Which to choose?
Activatio When to Use Pros Cons
n
ReLU Default for deep nets Fast, reduces vanishing Can "die"
gradient (outputs 0)
Sigmoid For probabilistic behavior Output between 0–1 Vanishing
gradient
Tanh When output needs to be Centered Slower than
between -1 to 1 ReLU
Here's a tabular form showing practical examples/applications for each activation
function, along with when and why they are used:
Activation When to Use Application Example Why Used in This
Function Case
ReLU Default for hidden Image Classification Efficient, avoids
f(x) = max(0, layers in deep (e.g., using CNNs like vanishing gradient, fast
x)
networks ResNet, VGG) convergence
Sigmoid Output layer for Spam Detection, Outputs a probability
f(x) = 1 / (1 binary Tumor Detection between 0 and 1 for
+ e^(-x))
classification (Yes/No) binary decision
Tanh When input needs Time Series Output is between -1
f(x) = (e^x - centering around 0 Prediction (e.g., in and 1; better for models
e^(-x)) / (e^x
RNNs) needing zero-centered
+ e^(-x))
outputs
Leaky ReLU For deeper nets to Deep GANs or NLP Keeps small gradient
f(x) = x if avoid dying ReLU (e.g., Transformer- for negative inputs to
x>0, αx
based models) avoid dead neurons
otherwise
Softmax Output layer for Handwritten Digit Converts outputs into
f(xᵢ) = e^{xᵢ} multi-class Recognition (MNIST probabilities across
/ Σ e^{xⱼ}
classification 0–9) multiple classes

🔹 3. Output Layer Activation Function


 Depends on the task type:
Task Type Output Layer Why
Activation
Binary Classification Sigmoid Gives output between 0 and 1
(probability)
Multi-Class Softmax Converts outputs into class
Classification probabilities
Regression None or Linear To predict continuous values

✅ Example 1: Binary Classification (e.g., Spam Detection)


Input: Email features
Hidden Layer: ReLU
Output Layer: Sigmoid (Output: Probability of spam)
Gives probability over binary – select one with bigger
probability
✅ Example 2: Multi-Class Classification (e.g., Handwritten digit: 0–9)
Input: Image pixels (28x28)

14
Hidden Layers: ReLU
Output Layer: Softmax (10 neurons for 10 digits)
Gives probability over Multi-Class Classification – select one
with largest probability.
✅ Example 3: Regression (e.g., House Price Prediction)
Input: Features like size, location, age
Hidden Layers: ReLU or Tanh
Output Layer: Linear (no activation)
Linear mapping of features to output-only for that neuron

🔚 Summary Table
Layer Type Common Activation Functions Purpose
Input None Pass data
Hidden ReLU, Tanh, Sigmoid Add non-linearity
Output Sigmoid / Softmax / Linear Task-specific output

Vanishing gradient problem


The vanishing gradient problem happens when gradients—used to update weights during
backpropagation—become extremely small as they move backward through a deep neural
network. This causes the early layers to learn very slowly or not at all, effectively “freezing”
them.
🧠 Why It Happens
It’s especially common when using activation functions like sigmoid or tanh, whose
derivatives are less than 1. When you multiply many small numbers (gradients) across layers,
they shrink exponentially. Eventually, the gradient becomes so tiny that weight updates stop
making meaningful progress.
📉 A Simple Example
Imagine a 5-layer neural network using sigmoid activations. During backpropagation,
suppose the gradient at each layer is around 0.2:
Final gradient ≈ 0.2 × 0.2 × 0.2 × 0.2 × 0.2 = 0.00032
That’s a very small number—so the weights in the first layer barely change, even after many
epochs.
🧪 Real-World Analogy
Think of trying to shout a message through five thick walls. By the time it reaches the first
room, it’s just a whisper—if it’s heard at all.
How to Fix It
 Use ReLU or variants like Leaky ReLU, which don’t squash gradients.
 Apply batch normalization to stabilize activations.
 Use residual connections (like in ResNets) to help gradients flow backward more
easily.

15
Neurons within the same layer can technically use different activation
functions, though it's not common in standard practice.
Most neural networks apply the same activation function across all neurons in a layer for
simplicity, efficiency, and compatibility with hardware acceleration. But there are scenarios
where mixing activations can be useful or even beneficial.
🧪 When Might You Use Mixed Activations?
1. Heterogeneous Feature Processing
If different neurons are responsible for processing different types of input features,
using varied activations might help. For example:
o Some neurons might process categorical embeddings (where tanh or sigmoid
could help).
o Others might handle continuous signals (where ReLU or Leaky ReLU might
be better).
2. Neuroevolution or AutoML
In evolutionary algorithms or automated architecture search, the system might
discover that mixing activations improves performance. These methods don’t assume
uniformity and can evolve diverse neuron behaviors.
3. Research Experiments
Some experimental architectures—like multi-activation networks—intentionally
assign different activations to different neurons to increase expressiveness. For
instance, a paper might explore combining ReLU, tanh, and ELU in the same layer to
capture a broader range of nonlinearities.
🧠 Example (Conceptual)
Imagine a hidden layer with 6 neurons:
 Neurons 1–3 use ReLU to capture sparse, high-magnitude patterns.
 Neurons 4–6 use tanh to capture smooth, bounded variations.
This hybrid layer might be useful in a model that processes both financial time series (which
can spike) and sensor data (which is smoother).
⚠️Trade-offs
 Harder to optimize: Backpropagation becomes more complex.
 Less efficient: GPU acceleration is optimized for uniform operations.
 Harder to interpret: Debugging and tuning become trickier.

Comprehensive Analysis of Neural Network Training


Introduction to Network Training
Neural network training involves determining optimal weight parameters that minimize an
error function measuring the discrepancy between network outputs and target values. This
process is fundamentally an optimization problem in high-dimensional weight space.
Error Functions and Probabilistic Interpretation
For regression tasks with single target variable t ∈ ℝ:
Regression Problems

 Assume Gaussian distribution: p(t|x,w) = 𝒩(t|y(x,w), β⁻¹)


 Network output y(x,w) serves as mean of Gaussian distribution
 β is precision (inverse variance) of noise
N.B:
‘t’ is target, ‘x’ is input and ‘w’ is the weight into the neuron, y is the predicted output
of the network.

16
Likelihood function for N i.i.d. (independent and identically distributed) observations:

p(t|X,w,β) = ∏ₙ₌₁ᴺ p(tₙ|xₙ,w,β)

17
Negative log-likelihood (error function):

E(w,β) = β/2 ∑ₙ₌₁ᴺ (y(xₙ,w)-tₙ)² - N/2 lnβ + N/2 ln(2π)

Different
Min. value of w which minimizes E(w,
β¿ values of w
which gives
different

Maximum likelihood solution:


1. For weights w: Minimize sum-of-squares error

E(w) = 1/2 ∑ₙ₌₁ᴺ (y(xₙ,w)-tₙ)²


2. For precision β:

1/β_ML = 1/N ∑ₙ₌₁ᴺ (y(xₙ,w_ML)-tₙ)²


N.B- Derivations given above for ML

Multiple target variables (K outputs):


 Assume independent Gaussian noise with shared precision
 Error function:

E(w) = 1/2 ∑ₙ₌₁ᴺ ∥y(xₙ,w)-tₙ∥²

1/β_ML = 1/(NK) ∑ₙ₌₁ᴺ ∥y(xₙ,w_ML)-tₙ∥²


 Precision estimate:

Derivation:
Let's carefully derive the maximum likelihood solution for the case with multiple target
variables (K outputs), where we assume:

18
19
For single binary classification (t ∈ {0,1}):
Binary Classification

 Single output with logistic sigmoid:


y = σ(a) = 1/(1+exp(-a))
 Interpret y(x,w) as p(C₁|x), 1-y(x,w) as p(C₂|x)
 Bernoulli distribution: p(t|x,w) = y(x,w)ᵗ (1-y(x,w))¹⁻ᵗ
Cross-entropy error function:

E(w) = -∑ₙ₌₁ᴺ [tₙ ln yₙ + (1-tₙ) ln(1-yₙ)]


where yₙ = y(xₙ,w)

Multiple binary classifications (K independent binary outputs):


 K outputs, each with logistic sigmoid
 Error function:

E(w) = -∑ₙ₌₁ᴺ ∑ₖ₌₁ᴷ [tₙₖ ln yₙₖ + (1-tₙₖ) ln(1-yₙₖ)]

Let's derive the cross-entropy error function for binary classification, starting from first
principles. This is a fundamental result in neural networks and logistic regression.

20
21
Multiclass Classification
For K mutually exclusive classes with 1-of-K coding:
 Output activation: softmax function

yₖ(x,w) = exp(aₖ(x,w))/∑ⱼ exp(aⱼ(x,w))

 Error function (multiclass cross-entropy):

E(w) = -∑ₙ₌₁ᴺ ∑ₖ₌₁ᴷ tₖₙ ln yₖ(xₙ,w)

22
Key property for all cases:
∂E/∂aₖ = yₖ - tₖ
This consistent form simplifies error backpropagation.
N.B.
In the context of neural networks, logits are the raw, unnormalized output values produced by
the final layer of the model—before applying an activation function like SoftMax.
The notation a k ( x , w ) typically refers to the activation (or output) of the k-th neuron in the
final layer, given input x and weights w. When no activation function is applied yet, this is
the logit for class k.
🔍 Breaking it down:
 x: input features
 w: weights of the model
 a k ( x , w ) : the score (logit) assigned to class k
These logits are then passed through SoftMax to convert them into probabilities:
a
k
e
SoftMax ( ak ) =
∑ ea
j

j
So, logits are like the model’s raw opinions about each class—SoftMax turns those opinions
into probabilities we can interpret.
Derivation
Let's derive the multiclass cross-entropy error function step-by-step for a neural network
with K mutually exclusive classes, using 1-of-K encoding for the targets tk.

23
24
N.B 1)

25
26
Special Case: Networks with Hidden Layers
If the network has hidden layers:
 w includes all weights (input-to-hidden and hidden-to-output).
 wk refers only to the final layer weights connecting the last hidden layer to the k-th
output.

Why This Matters


1. Efficiency: Updates to wk are localized to class k, avoiding unnecessary
computations.
2. Interpretability: Each wk learns features discriminative for class kk.
3. Modularity: In frameworks like PyTorch/TensorFlow, wkwk is accessed via the
output layer’s weight matrix (e.g., [Link][:, k]).

Summary
 w: All network parameters (global scope).
 wk: Weights specific to the kth output class (local scope).
 Gradient updates target wk independently for each class, leveraging the structure of
the softmax/cross-entropy loss.
N.B 2)

27
28
Parameter Optimization
Geometrical View of Error Function
The error function E(w) can be visualized as a surface over weight space (Figure 5.5):

 w A: local minimum

 w C : arbitrary point with gradient ∇E


 w B : global minimum

∇E(w) = 0
At any minimum w, the gradient vanishes:

Local Quadratic Approximation


Taylor expansion around point ŵ is given by:

E(w) ≈ E(ŵ) + (w-ŵ)ᵀb + 1/2 (w-ŵ)ᵀH(w-ŵ)

b = ∇E|_w=ŵ (gradient vector)


where:

H = ∇∇E|_w=ŵ (Hessian matrix)



At a minimum w* (where ∇E=0):

E(w) ≈ E(w*) + 1/2 (w-w*)ᵀH(w-w*)

Eigenvalue analysis of Hessian H:


 Eigenvectors uᵢ: H uᵢ = λᵢ uᵢ
 Orthonormal: uᵢᵀuⱼ = δᵢⱼ
 Express w-w* = ∑ᵢ αᵢ uᵢ
Then error becomes:

E(w) = E(w*) + 1/2 ∑ᵢ λᵢ αᵢ²

Positive definite Hessian at minimum:


 vᵀHv > 0 for all v ≠ 0
 All eigenvalues λᵢ > 0
 Contours of constant E are ellipses aligned with eigenvectors (Figure 5.6)

29
N.B

when a matrix is symmetric and positive definite, its eigenvectors are guaranteed to be
orthogonal (and even orthonormal if normalized).
A matrix is positive definite when it’s symmetric and its quadratic form is always strictly
positive—except at the origin. Here’s what that means in more detail:
✅ Formal Definition
A real symmetric matrix A ∈ R n× n is positive definite if:
T
x Ax >0 for all x ≠ 0
This says that no matter what nonzero vector x you choose, the result of this quadratic
expression is always greater than zero.
🔁 Equivalent Ways to Tell
A matrix A is positive definite if any of these are true:

30
1. All eigenvalues of A are positive
2. All leading principal minors are positive (i.e. the determinants of top-left k×k
k × k submatrices for k = 1 to n)
3. Cholesky decomposition is possible: A = L LT , where L is a lower triangular matrix
with positive diagonal entries
🧠 Geometric Interpretation
The matrix defines a bowl-shaped surface. So, in optimization, if the Hessian (second
derivative matrix) of your loss function is positive definite, you’re sitting in a local (and
actually global) minimum with nice curvature—perfect for gradient-based methods.

Calculus condition for minima

31
32
Optimization Algorithms
Gradient Descent

w⁽ᵗ⁺¹⁾ = w⁽ᵗ⁾ - η∇E(w⁽ᵗ⁾)


Basic weight update rule:

where η > 0 is learning rate

 Use entire training set to compute ∇E


Batch methods:

 Simple but computationally expensive


 Poor performance in practice due to zig-zag path
On-line (stochastic) gradient descent:

w⁽ᵗ⁺¹⁾ = w⁽ᵗ⁾ - η∇Eₙ(w⁽ᵗ⁾)


 Update based on single data points or mini-batches

 Advantages:
o Handles redundant data efficiently
o Can escape local minima
o Suitable for large datasets
More advanced batch methods:
 Conjugate gradients
 Quasi-Newton methods
 Generally faster convergence than simple gradient descent

33
N.B.

Computational Efficiency
Without gradient information:
 Need O(W²) function evaluations
 Each evaluation costs O(W)
 Total cost: O(W³)
With gradient information (via backpropagation):
 Each gradient evaluation provides W pieces of information
 O(W) gradient evaluations needed
 Each gradient evaluation costs O(W)
 Total cost: O(W²)

34
Practical Considerations
Multiple Minima
Neural network error functions typically have:
 Multiple inequivalent minima (local and global)
 Families of equivalent minima due to weight space symmetries
o For M hidden units: M! 2ᴹ equivalent weight vectors
In practice:
 Global minimum often not required
 Compare performance of different local minima on validation set
 Multiple random initializations help find good solutions
Implementation Example
Consider a simple regression problem with:
 Single input x, single output y
 Training set: {(xₙ,tₙ)} for n=1,...,N
 Network with 1 hidden unit (for simplicity)
Forward pass:
a = w₁₁⁽¹⁾x + w₁₀⁽¹⁾
z = h(a)
y = w₁₁⁽²⁾z + w₁₀⁽²⁾
Sum-of-squares error:
E(w) = 1/2 ∑ₙ (y(xₙ,w)-tₙ)²
Gradient descent update:
For each parameter wᵢ:
wᵢ ← wᵢ - η ∂E/∂wᵢ
Where derivatives can be computed via backpropagation.
Numerical Example
Suppose we have:
 Single data point (x,t) = (1.0, 0.5)
 Current weights: w₁₁⁽¹⁾=0.3, w₁₀⁽¹⁾=-0.2, w₁₁⁽²⁾=0.4, w₁₀⁽²⁾=0.1
 Activation h(a) = tanh(a)
 Learning rate η = 0.1
Forward pass:
a = 0.3*1.0 + (-0.2) = 0.1 ( a =w 11(1) x +w 10(1) )
z = tanh(0.1) ≈ 0.0997 (z = tanh(a) )
y = 0.4*0.0997 + 0.1 ≈ 0.1399 (y = w₁₁⁽²⁾z + w₁₀⁽²⁾ )
Error:
E = 1/2 (0.1399 - 0.5)² ≈ 0.0648
Backward pass (computing gradients):
∂E/∂y = y - t = 0.1399 - 0.5 = -0.3601
∂E/∂w₁₁⁽²⁾ = ∂E/∂y * ∂y/∂w₁₁⁽²⁾ = -0.3601 * z ≈ -0.0359
∂E/∂w₁₀⁽²⁾ = ∂E/∂y * ∂y/∂w₁₀⁽²⁾ = -0.3601 * 1 = -0.3601
∂E/∂z = ∂E/∂y * ∂y/∂z = -0.3601 * 0.4 ≈ -0.1440
∂z/∂a = sech²(0.1) ≈ 0.9901
∂E/∂a = ∂E/∂z * ∂z/∂a ≈ -0.1440 * 0.9901 ≈ -0.1426
∂E/∂w₁₁⁽¹⁾ = ∂E/∂a * ∂a/∂w₁₁⁽¹⁾ ≈ -0.1426 * 1.0 = -0.1426
∂E/∂w₁₀⁽¹⁾ = ∂E/∂a * ∂a/∂w₁₀⁽¹⁾ ≈ -0.1426 * 1 = -0.1426
Weight updates:
w₁₁⁽²⁾ ← 0.4 - 0.1*(-0.0359) ≈ 0.4036
w₁₀⁽²⁾ ← 0.1 - 0.1*(-0.3601) ≈ 0.1360

35
w₁₁⁽¹⁾ ← 0.3 - 0.1*(-0.1426) ≈ 0.3143
w₁₀⁽¹⁾ ← -0.2 - 0.1*(-0.1426) ≈ -0.1857
This demonstrates a single step of gradient descent for a simple network.
Conclusion
Neural network training involves:
1. Choosing appropriate error function based on problem type
o Regression: sum-of-squares
o Classification: cross-entropy
2. Efficient gradient computation via backpropagation
3. Iterative optimization using gradient information
o Batch or on-line approaches
o Advanced optimization methods often preferred
4. Handling multiple minima through random initializations
The probabilistic interpretation provides a principled framework for selecting error functions
and output activations, while efficient gradient computation enables practical training of large
networks.
Detailed Explanation of Error Backpropagation in Neural Networks
1. Introduction to Backpropagation
Backpropagation is an efficient technique for evaluating the gradient of an error function
E(w) for a feed-forward neural network. It implements a local message passing scheme where
information flows alternately forwards and backwards through the network.
Key points about terminology:
 The term "backpropagation" is sometimes used to refer to:
o The multilayer perceptron architecture itself (called a backpropagation
network)
o The training process using gradient descent on a sum-of-squares error function
 More precisely, backpropagation specifically refers to the evaluation of derivatives
The training process involves two distinct stages:
1. Evaluation of derivatives of the error function with respect to weights (this is
backpropagation proper)
2. Use of these derivatives to compute weight adjustments (can use various optimization
methods)
2. General Derivation of Backpropagation
2.1 Network Structure and Forward Propagation

36
2.2 Error Function Derivatives

2.3 Output Unit Errors

2.4 Hidden Unit Errors (Backpropagation Proper)

37
2.5 Backpropagation Algorithm Summary

3. A Concrete Example

3.1 Forward Propagation Equations


For each pattern:

38
3.2 Backward Propagation
1. Output unit errors:
δ k = y k −t k
2. Hidden unit errors:
K
δ j=( 1−z 2j ) ∑ w {kj }δ k
{k=1 }

3. Derivatives:
∂ En
( )
=δ j x i
∂ w ji1

4. Computational Efficiency
Backpropagation is efficient because:
 Forward pass: O(W) operations
 Backward pass: O(W) operations
 Total: O(W) operations per pattern
Compare with finite differences:
 Central differences approach:
∂ En En ( w ji + ϵ )−E n ( w ji−ϵ )
+O ( ϵ )
2
=
∂ w ji 2ϵ
 Requires O(W²) operations since each weight must be perturbed individually
Numerical differentiation is useful for verifying implementations but too slow for actual
training.
Numerical Example
Let's work through a concrete numerical example for a small network.
Network Architecture:
 2 input units (x₁, x₂)
 2 hidden units (tanh activation)
 1 output unit (linear activation)
Weights:
First layer (input to hidden):
 w₁₁ = 0.5, w₁₂ = -0.5
 w₂₁ = 0.3, w₂₂ = 0.8
Second layer (hidden to output):
 w₁ = 1.0, w₂ = -0.6
Input and Target:
 Input: x₁ = 1.0, x₂ = 0.5
 Target: t = 0.7
Forward Pass:
1. Hidden unit activations:
a₁ = w₁₁x₁ + w₁₂x₂ = 0.5*1.0 + (-0.5)*0.5 = 0.25
a₂ = w₂₁x₁ + w₂₂x₂ = 0.3*1.0 + 0.8*0.5 = 0.7

39
2. Hidden unit outputs:
z₁ = tanh(0.25) ≈ 0.2449
z₂ = tanh(0.7) ≈ 0.6044
3. Output activation:
y = w₁z₁ + w₂z₂ = 1.0*0.2449 + (-0.6)*0.6044 ≈ -0.1177
Error Calculation:
E = ½(y - t)² = ½(-0.1177 - 0.7)² ≈ ½(-0.8177)² ≈ 0.3343
Backward Pass:
1. Output error:
δ = y - t ≈ -0.8177
2. Hidden unit errors:
δ₁ = (1 - z₁²) * w₁ * δ = (1 - 0.2449²)*1.0*(-0.8177) ≈ 0.94*(-0.8177) ≈ -0.7686
δ₂ = (1 - z₂²) * w₂ * δ = (1 - 0.6044²)*(-0.6)*(-0.8177) ≈ 0.6347*0.4906 ≈ 0.3114
3. Derivatives:
∂E/∂w₁ = δ*z₁ ≈ -0.8177*0.2449 ≈ -0.2002
∂E/∂w₂ = δ*z₂ ≈ -0.8177*0.6044 ≈ -0.4942
∂E/∂w₁₁ = δ₁*x₁ ≈ -0.7686*1.0 ≈ -0.7686
∂E/∂w₁₂ = δ₁*x₂ ≈ -0.7686*0.5 ≈ -0.3843
∂E/∂w₂₁ = δ₂x₁ ≈ 0.3114*1.0 ≈ 0.3114
∂E/∂w₂₂ = δ₂x₂ ≈ 0.3114*0.5 ≈ 0.1557
This example demonstrates how the errors propagate backward through the network to
compute all required derivatives efficiently.

40
Key Observations:
1. Direction of Updates:
o Weights with negative gradients (∂E/∂w < 0) increased
o Weights with positive gradients (∂E/∂w > 0) decreased
2. Impactful Updates:
o w₁₁ (input→1st hidden) changed most significantly (+15.4%)
o w₂ (2nd hidden→output) saw the second largest change (+8.2%)
3. Error Reduction:
o The update successfully reduced the error
o Further iterations would continue to minimize the error
This demonstrates how backpropagation provides the exact derivatives needed to efficiently
update weights in the direction that reduces the error.

5. Jacobian Matrix Computation

41
5.1 Backpropagation for Jacobian
1. First write:
∂ yk ∂ yk ∂ a j ∂ yk
J ki = =∑ =∑ wji
∂ xi j ∂ a j ∂ xi j ∂ aj
2. Recursive backpropagation formula:
∂ yk ' ∂ yk
=h ( a j ) ∑ wlj
∂aj l ∂ al
3. For output units:
 Sigmoidal outputs:
∂ yk '
=δ kj σ ( a j )
∂aj
 Softmax outputs:
∂ yk
=δ kj y k − y k y j
∂aj
5.2 Jacobian Computation Procedure

42
From above:

∂ al
∧∂ al
∂ yk ∂ yk ∂aj
=∑
'
=w lj h (a j )
∂aj l ∂ al ∂aj

∂ yk ∂ yk ∂ yk
=∑ wlj h (a j)=h (a j) ∑ wlj
' '
∂ a j l ∂ al l ∂ al

43
5. Complete Algorithm

J ki =∑ wji
j
( )
∂ yk
∂aj
Example Computation

X1

a1/Z1 y

X2

∂ y ∂ y ∂ a1 2
= ⋅ =w11 w 21 sech ( a1 )
∂ x1 ∂ a1 ∂ x 1

44
∂ y ∂ y ∂ a1 2
= ⋅ =w11 w 21 sech ( a1 )
∂ x1 ∂ a1 ∂ x 1
Similarly
∂ y ∂ y ∂ a1 2
= ⋅ =w12 w21 sech ( a 1)
∂ x2 ∂ a1 ∂ x 2

🧠 1. What is Backpropagation?
Backpropagation is the algorithm used to compute gradients of the loss function with
respect to the weights in a neural network. These gradients are used to update weights using
gradient descent.
For a network with input x, weights W, output y, and loss L, we want:
∂L
∂W

🔄 2. What is the Jacobian?

🔗 3. Connection Between Backpropagation and Jacobian


Backpropagation uses the chain rule, and the Jacobian is a formal way of expressing the
chain rule for vector-valued functions.

So, backpropagation is just repeatedly applying Jacobian-based chain rule from output
layer back to each weight.

45
∂y
🧮 4. Why Calculate ?
∂x

🔧 5. How Weights Are Updated

✅ Summary:

Numerical example showing use of Jacobian in calculating Error gradient w,r,t. inner
weights
Let's walk through a numerical example and use the Jacobian to compute error gradients
for a simple neural network with:
 2 input neurons
 2 hidden neurons (with sigmoid activation)
 1 output neuron (with linear activation for simplicity)
We'll show how to compute the Jacobian, and use it to find the gradient of the loss w.r.t
weights.

46
47
48
49
50
7. Conclusion
Backpropagation provides an efficient O(W) method for computing error function derivatives
in feed-forward neural networks. The key steps are:
1. Forward propagation of inputs
2. Backward propagation of errors using the chain rule
3. Accumulation of derivatives
The algorithm can be generalized to compute other derivatives like the Jacobian matrix and
can be verified using numerical differentiation methods. Its efficiency compared to finite
difference methods makes it essential for training practical neural networks.

51
52
53
54
55
Numerical on Hessian
Numerical Example: Hessian Matrix in Backpropagation
Let's demonstrate the application of the Hessian matrix through a complete numerical
example with a 2-layer neural network.
Network Architecture
 Inputs: x₁ = 1.0, x₂ = 0.5
 Hidden layer: 2 tanh units
 Output: 1 linear unit
 Target: t = 0.7
Weights
First Layer (input → hidden):
w₁₁ = 0.5, w₁₂ = -0.5
w₂₁ = 0.3, w₂₂ = 0.8
Second Layer (hidden → output):
w₁ = 1.0, w₂ = -0.6
Step 1: Forward Pass
1. Hidden layer activations:
a₁ = 0.5*1.0 + (-0.5)*0.5 = 0.25
a₂ = 0.3*1.0 + 0.8*0.5 = 0.7
2. Hidden layer outputs (tanh):
z₁ = tanh(0.25) ≈ 0.2449

56
z₂ = tanh(0.7) ≈ 0.6044
3. Output:
y = 1.0*0.2449 + (-0.6)*0.6044 ≈ -0.1177
4. Error (MSE):
E = ½(-0.1177 - 0.7)² ≈ 0.3343
Step 2: First Derivatives (Standard Backpropagation)
1. Output error:
δ = y - t ≈ -0.8177
2. Hidden layer errors:
δ₁ = (1 - 0.2449²)*1.0*(-0.8177) ≈ -0.7686
δ₂ = (1 - 0.6044²)*(-0.6)*(-0.8177) ≈ 0.3114
3. Weight gradients:
∂E/∂w₁ = δ*z₁ ≈ -0.2002
∂E/∂w₂ = δ*z₂ ≈ -0.4942
∂E/∂w₁₁ = δ₁*x₁ ≈ -0.7686
∂E/∂w₁₂ = δ₁*x₂ ≈ -0.3843
∂E/∂w₂₁ = δ₂*x₁ ≈ 0.3114
∂E/∂w₂₂ = δ₂*x₂ ≈ 0.1557
Step 3: Hessian Computation (Exact Method)
A) Second-Layer Weight Hessian (w₁ and w₂)
For linear output with MSE:
M = 1 (since ∂²E/∂y² = 1)
∂²E/∂w₁² = z₁² * M ≈ 0.2449² * 1 ≈ 0.0600
∂²E/∂w₂² = z₂² * M ≈ 0.6044² * 1 ≈ 0.3653
∂²E/∂w₁∂w₂ = z₁*z₂ * M ≈ 0.2449*0.6044 ≈ 0.1480
B) First-Layer Weight Hessian (w₁₁)
Using exact formula:
Term 1: h''(a₁)∑wδ = (-0.4689)*(-0.8177) ≈ 0.3834
Term 2: h'(a₁)²∑w² = (0.94)*1.0 ≈ 0.94
∂²E/∂w₁₁² = x₁²*(Term1 + Term2) = 1*(0.3834 + 0.94) ≈ 1.3234
Where:
h'(a₁) = 1-tanh²(0.25) ≈ 0.94
h''(a₁) = -2tanh(0.25)(1-tanh²(0.25)) ≈ -0.4689
C) Cross-Layer Hessian (w₁₁ and w₁)
∂²E/∂w₁₁∂w₁ = x₁*h'(a₁)*[δ + z₁*w₁*M]
= 1.0*0.94*[-0.8177 + 0.2449*1.0*1]
≈ 0.94*(-0.5728) ≈ -0.5384

Step 4: Hessian Application - Weight Update with Newton's Method

Δw = -H⁻¹ * ∇E
Newton's update rule:

For just w₁ and w₂ (2nd layer weights):


Hessian submatrix:
H = [ 0.0600 0.1480
0.1480 0.3653 ]

∇E = [-0.2002, -0.4942]ᵀ
Gradient vector:

Inverse Hessian:
H⁻¹ ≈ [ 42.1223 -17.0699
-17.0699 6.9200 ]

57
Δw = -H⁻¹ * ∇E ≈ - [ 42.1223*(-0.2002) + (-17.0699)*(-0.4942)
Weight update (with η=1):

-17.0699*(-0.2002) + 6.9200*(-0.4942) ]
≈ - [ -8.4289 + 8.4369
3.4174 - 3.4209 ]
≈ [ -0.0080, 0.0035 ]ᵀ
New weights:
w₁_new = 1.0 - 0.0080 ≈ 0.9920
w₂_new = -0.6 + 0.0035 ≈ -0.5965

Verification
New output:
y_new = 0.9920*0.2449 + (-0.5965)*0.6044 ≈ -0.1196
New error:
E_new = ½(-0.1196-0.7)² ≈ 0.3357
Note: The error increased slightly because we only used partial Hessian (2nd layer weights).
Using full Hessian would give better results.
N.B: In this example the Haesian has been used to calculate weight change between Hidden
and output only. For a whole application see below.

58
For derivation of H
See below

59
60
61
Key Observations:
1. The Hessian provides curvature information that standard backpropagation (1st
derivatives) doesn't
2. Newton's method can converge faster than gradient descent
3. Exact Hessian computation is expensive (O(W²))
4. In practice, approximations (diagonal Hessian, BFGS) are often used
5. The example shows how second derivatives influence weight updates differently than
first derivatives alone
This demonstrates how the Hessian matrix provides additional information about the error
surface curvature that can be used to make more informed weight updates during neural
network training.

62
63
64
🔷 What is Regularization in Neural Networks?
Regularization is a technique used to prevent overfitting in neural networks. Overfitting
occurs when the model learns not just the general pattern in the training data but also the
noise and random fluctuations — causing poor performance on unseen data.
Regularization adds a penalty term to the loss function, discouraging the model from
learning overly complex or extreme weights.

🔸 Types of Regularization
Type Description
L1 (Lasso) Adds absolute value of weights: `λ * Σ
L2 (Ridge) Adds squared value of weights: λ * Σw²
Dropout Randomly drops units during training to reduce co-adaptation
Early Stops training when validation error starts increasing
Stopping

🔷 Mathematical Formulation

✅ Numerical Example (L2 Regularization)

65
.

🔷 Effect on Gradient Descent

🔷 Conclusion
Without Regularization With Regularization
Learns exact patterns, noise too Learns smoothed/general patterns
Can overfit training data Reduces overfitting
Complex weight values Simpler, smaller weights
Detailed Analysis of Regularization Techniques in Neural Networks
This comprehensive write-up covers all the regularization methods discussed in the provided
text, including mathematical derivations, examples, and numerical illustrations to enhance
understanding.
1. Introduction to Regularization in Neural Networks
Neural networks contain free parameters (weights and biases) that must be adjusted to
achieve good predictive performance. The number of hidden units M controls model
complexity:
 Too small M: Underfitting (high bias)

66
 Too large M: Overfitting (high variance)
Figure 5.9 illustrates this trade-off using sinusoidal regression examples with M=1, 3, and 10
hidden units. However, as shown in Figure 5.10, the generalization error isn't a simple
function of M due to local minima in the error function. The plot shows test-set error versus
number of hidden units with 30 random initializations per size, where the best validation
performance occurred at M=8.

What is Lambda (λ) in Regularization?


Lambda (λ) is a hyperparameter that controls the strength of regularization in machine
learning models, including neural networks. It determines how much the regularization term
influences the overall loss function during training.

67
68
69
70
71
72
73
N.B.
🧠 Why Scaling Matters in Regularization
In neural networks, regularization techniques like L1/L2 penalties, dropout, or noise
injection are used to prevent overfitting by discouraging overly complex models. Scaling, on
the other hand, refers to adjusting the input data or model parameters to observe how the
network's behavior changes. Here's how they intersect:
🔄 1. Scaling Inputs to Simulate Data Variability
 When you scale input features (e.g., multiply by a constant), you're effectively
simulating a change in the data distribution.
 This helps test how robust the model is to such changes.
 In some cases, adding Gaussian noise or scaling inputs is mathematically equivalent
to applying L2 regularization.
🧮 2. Scaling and the Loss Function
 Regularization modifies the loss function by adding a penalty term (like λ‖w‖² for L2).
 If you scale the input data, it can affect the magnitude of the gradients and,
consequently, how the regularization term influences learning.

74
 For example, scaling inputs up might require adjusting the regularization strength (λ)
to maintain the same effect on weight decay.
🧪 3. Scaling as a Diagnostic Tool
 Researchers sometimes scale data or weights to observe how sensitive the model is
to such perturbations.
 This can reveal whether the model has learned robust, generalizable patterns or is
overly reliant on specific input magnitudes.
🧰 Practical Example
Suppose you train a network on normalized data (mean 0, std 1), and then retrain it on data
scaled by a factor of 10. If the model performs worse without adjusting regularization, it
suggests that the regularization strength was tuned to the original scale.
If you're exploring this from a theoretical angle—say, how scaling affects the optimization
landscape or the curvature of the loss surface—I’d love to dive deeper with you. Want to
explore that next?

75
76
77
Explaining early stopping with a numerical

78
79
80
81
82
83
🔷 What is a Convolutional Neural Network (CNN)?
A Convolutional Neural Network (CNN) is a special type of feedforward neural network
designed to process grid-like data, such as images. Instead of using fully connected layers,
CNNs use convolutional layers that apply filters to detect patterns (like edges, textures,
shapes) in local regions.

🔹 Why CNNs?
Images are high-dimensional, and connecting every pixel to every neuron (as in a regular
feedforward neural network) is inefficient. CNNs solve this by:
 Using local connections (via filters)
 Sharing weights (same filter across image)
 Reducing parameters
 Capturing spatial hierarchies

🔷 CNN vs. Feedforward Neural Network


Feature Feedforward NN CNN
Layer type Fully connected Convolution + pooling layers
Input Vector (1D) Image/matrix (2D/3D)
Weight sharing ❌ No ✅ Yes (filter shared)

84
Spatial structure ❌ No ✅ Yes
used?
Parameters High Lower due to local connections
Use cases Tabular data Images, speech, video, etc.

At X(i,j), W(m,n) has been superimposed


to carry out the multiplication operation.

85
86
87
88
89
i, j are no
longer
variables but

90
This is the
new filter

91
92
Kernel is the filter weight

Let's elaborate on the mathematical formulation of CNNs from Section 5.2 and connect it
with the architectural understanding and examples we've already discussed (like edge
detection, pooling, backpropagation, etc.
🔷 5.1 Architecture (Conceptual Summary)
You're referring to a standard deep CNN architecture often used in image tasks like digit
recognition (e.g., MNIST).
🧠 Key Ideas:
Layer Role What It Learns
Convolutional Layer Detects local patterns Edges, textures
Subsampling (Pooling) Provides invariance Location-insensitive features
Multiple Layers Build hierarchies of Corners, digits
patterns
Fully Connected Layer Global reasoning Final classification
Example in Digit Recognition:
 Layer 1: detects lines (vertical/horizontal)
 Layer 2: combines lines into curves/corners
 Layer 3: recognizes complete digits
Now let’s connect this to the mathematics.

93
94
🔷 1. What Is Soft Weight Sharing?
Soft weight sharing is a regularization technique where instead of forcing weights to be
exactly equal (as in hard sharing), we encourage them to cluster around a few values (e.g.,
multiple means) by assuming a prior distribution over them — typically a mixture of
Gaussians.

95
This allows weights to be "softly grouped", which reduces overfitting and helps in model
compression and generalization.

96
97
98
99
100
101
102
103
104
105
106
📘 Soft Weight Sharing: Formula Summary Table
Ste Purpose Formula Example / Application
p
1️⃣ Mixture Prior on Weights Defines a prior over weights — encourages weights
to cluster around a few centers.

2️⃣ Total Prior over All Needed to compute regularization loss (negative
Weights log prior).

3️⃣ Regularizer (Penalty Added to total loss to penalize unstructured


Term) weights.

4️⃣ Responsibilities (Posterior Soft assignment: how much each weight belongs to
of Cluster j for weight i) a cluster.

5️⃣ Total Loss (Augmented Combines task loss (e.g., MSE) and
Error) regularization.

6️⃣ Gradient w.r.t. Weight w i Guides weight updates: task loss + pulls toward
cluster centers.

7️⃣ Gradient w.r.t. Mean μ j Updates cluster center to better fit assigned weights.

107
8️⃣ Gradient w.r.t. Variance Updates spread of each cluster; controls tightness
η j=lo g ( σ 2j ) of fit.

9️⃣ Gradient w.r.t. Mixing Adjusts how many weights are expected in each
Coefficients (via logits η j) cluster.

🔟 Mixing Coefficient from


Softmax

📌 Practical Example/Application
Scenario Application
Model Compression Weights are softly pulled to a few shared values → reduce memory via quantization.
Regularization Prevents overfitting by restricting weights to fall into clusters instead of spreading freely.
Knowledge Teacher networks can suggest priors ( μ j , σ j ) to guide student weights.
2

Distillation

108
Here’s a structured summary of regularization methods in neural networks, with math, examples, and key insights:
Regularization Methods in Neural Networks
Method Mathematical Formulation Example/Numerical Key Insight
Simple Weight Penalizes large
Decay (L2) weights; may unfairly
treat scaled networks.
Grouped Fixes inconsistency by
Weight Decay scaling λk with
input/output
transformations.
Early Stopping Approximates weight
decay without explicit
penalty.
Tangent Encourages invariance
Propagation to transformations
(e.g., rotation).

Data Simulates infinite


Augmentation data; improves
generalization.
Convolutional Built-in translation
Networks invariance; reduces
parameters.

109
Soft Weight Learns weight
Sharing distributions instead of
fixed decay.

Key Observations
1. Weight Decay: Simple L2 penalizes all weights equally; grouped L2 is transformation-aware.
2. Early Stopping: Implicit regularization via optimization dynamics.
3. Tangent Propagation: Explicitly enforces invariance (e.g., rotation).
4. Convolutional Nets: Achieve invariance architecturally (e.g., translation).
5. Soft Sharing: Dynamically adapts to weight distributions during training.
Each method balances bias-variance tradeoff differently. Grouped weight decay and convolutional nets are scaling-invariant, while early
stopping and data augmentation are optimization/data-based.

110
111
112
113
114

You might also like