[go: up one dir, main page]

0% found this document useful (0 votes)
20 views40 pages

Machine Learning Documentation

The document is an advanced guide on Machine Learning, covering topics such as design principles, neural networks, training methodologies, classification and regression tasks, and optimization techniques. It includes detailed sections on mathematical modeling, probabilistic frameworks, and various initialization and regularization techniques. Additionally, it addresses data analysis and generative algorithms, emphasizing data cleaning, dimensional reduction, and standardization.

Uploaded by

Soourabh Kapure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views40 pages

Machine Learning Documentation

The document is an advanced guide on Machine Learning, covering topics such as design principles, neural networks, training methodologies, classification and regression tasks, and optimization techniques. It includes detailed sections on mathematical modeling, probabilistic frameworks, and various initialization and regularization techniques. Additionally, it addresses data analysis and generative algorithms, emphasizing data cleaning, dimensional reduction, and standardization.

Uploaded by

Soourabh Kapure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Machine Learning (advanced) documentation

Pranav Misra (2022AM11527)

19th November 2024


2
Contents

1 Introduction to Design and Neural Networks 7


1.1 Introduction to Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Parametric Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Rule-Based Design vs. Generative Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Rule-Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Generative Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Introduction to Neural Networks (NN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Structure of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.3 Why Neural Networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.4 Learning and Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Training of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.1 Core Steps in Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.2 Key Concepts in Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Classification Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.2 Example: Image Classification with CIFAR-10 . . . . . . . . . . . . . . . . . . . . 10
1.6.3 Performance Evaluation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Regression Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.2 Example: Olympics Winning Time Prediction . . . . . . . . . . . . . . . . . . . . . 10
1.7.3 Metrics for Evaluation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.4 Practical Considerations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8 Mathematic Modelling of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.9 Optimization, Gradient Descent Algorithm, and Its Variants . . . . . . . . . . . . . . . . . 13
1.9.1 Variants of Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.10 How Do You Define Probabilities Over Data? . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.10.1 Discrete Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.10.2 Continuous Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.11 Probabilistic Modeling Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.11.1 Unsupervised Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.11.2 Generative Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.11.3 Discriminative Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.12 Comparison: Generative vs. Discriminative . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.13 Importance of Weight Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.14 Techniques to Avoid Local Minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.14.1 Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.14.2 Xavier Initialization (Glorot Initialization) . . . . . . . . . . . . . . . . . . . . . . . 16
1.14.3 He Initialization (Kaiming Initialization) . . . . . . . . . . . . . . . . . . . . . . . . 16
1.14.4 LeCun Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.15 Layer-Specific Initializations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.16 Bias Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.17 Regularization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.18 RMS Prop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.18.1 RMSProp Doesn’t Solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.18.2 How RMSProp and Weight Initialization Work Together . . . . . . . . . . . . . . . 17

3
4 CONTENTS

1.19 RMSProp: Understanding Its Role in Optimization . . . . . . . . . . . . . . . . . . . . . . 18


1.19.1 What RMSProp Does . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.19.2 What RMSProp Doesn’t Solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.19.3 Does RMSProp Avoid Local Minima? . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.20 Exponentially Decaying Moving Average of Squared Gradients . . . . . . . . . . . . . . . 19

2 Data Analysis and Generative Algorithms 21


2.1 Data Cleaning, Imputing, and Standardization . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Data Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.3 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Introduction to Dimensional Reduction in Data Space . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 t-SNE (t-Distributed Stochastic Neighbor Embedding) . . . . . . . . . . . . . . . . 22
2.3 Data Cleaning, Imputing, and Standardization . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Data Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Design Datasets and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Optimization in Generative Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Dimensional Reduction in Data Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.2 t-SNE (t-Distributed Stochastic Neighbor Embedding) . . . . . . . . . . . . . . . . 24
2.6 Determinantal Point Process (DPP) Matrix and Its Mathematical Background . . . . . . 24
2.6.1 Mathematical Background of DPP . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.2 Properties of DPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.4 Example: Constructing a DPP Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.5 Connection to PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Solved Example of Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . 26
2.7.1 Step 1: Center the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.2 Step 2: Compute the Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7.3 Step 3: Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7.4 Step 4: Project the Data onto the Principal Components . . . . . . . . . . . . . . 27
2.8 Solved Example of Determinantal Point Process (DPP) Matrix Construction . . . . . . . 27
2.8.1 Step 1: Compute the Similarity Matrix S . . . . . . . . . . . . . . . . . . . . . . . 27
2.8.2 Step 2: Construct the Kernel Matrix K . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8.3 Step 3: Calculate the Probability of Selecting Subset A = {y1 , y3 } . . . . . . . . . 28
2.8.4 Step 4: Normalization for Subset Size Control . . . . . . . . . . . . . . . . . . . . . 28
2.9 Purpose of Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.9.1 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.9.2 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.9.3 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 Types of Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.10.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.10.2 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.10.3 Systematic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.10.4 Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.10.5 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.10.6 Reservoir Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.11 Theoretical Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.11.1 Law of Large Numbers (LLN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.11.2 Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.12 Applications of Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.12.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.12.2 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.12.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
CONTENTS 5

2.13 Advanced Sampling in ML and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 30


2.13.1 Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.13.2 Determinantal Point Processes (DPP) . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.14 Introduction to Generative Algorithms and their novelties . . . . . . . . . . . . . . . . . . 30
2.15 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.15.1 p-Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.15.2 h-Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.16 Applications of Novelty Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.17 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.17.1 1. p-Novelty Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.17.2 2. h-Novelty Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.18 Visualizing the Concept of Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.19 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Generative Algorithms and Variational Encoders 33


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Autoencoders (AE) and Variational Autoencoders (VAE) . . . . . . . . . . . . . . . . . . 36
3.2.1 Variational Autoencoders (VAE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 PaDGANN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Parallelization of GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Distributed Training Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Optimization Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Communication and Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.1 Parameter Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.2 Gradient Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Distributed GAN Stability and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.1 Communication Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.2 Asynchronous Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.3 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 CONTENTS
Chapter 1

Introduction to Design and Neural


Networks

1.1. Introduction to Design


Design is the process of conceptualizing and creating objects, systems, or solutions that address specific
needs or problems. It is both an art and a science, requiring creativity, technical knowledge, and problem-
solving skills. The design process typically involves:

1. Understanding User Needs: Identifying pain points or requirements.

2. Creative Exploration: Generating ideas and solutions.

3. Practical Implementation: Turning abstract concepts into functional, market-ready products.

Importance: Design shapes how users interact with products, impacting usability, aesthetics, and
value. For example, a smartphone’s sleek design not only appeals visually but also ensures ergonomic
comfort and intuitive functionality.

1.2. Parametric Design


Parametric design leverages parameters (variables) and rules to define the relationships among design
components. It allows designers to adjust specific inputs (e.g., dimensions, shapes, or materials), with
the entire design dynamically updating to reflect these changes.

Key Characteristics
• Dynamic Adjustability: For example, in architectural design, altering the height of a building
parameter automatically adjusts related aspects like the façade, internal spaces, and structural
loads.

• Constraint-Based Modelling: Encodes rules to ensure designs stay within defined boundaries,
such as maintaining proportions or structural integrity.

• Iterative Exploration: Enables rapid generation of variations, useful in optimizing designs.

Applications
1. Architecture: Complex geometries like curved facades or parametric roof structures (e.g., Zaha
Hadid’s works).

2. Engineering: Optimizing aerodynamic shapes in automotive or aircraft design.

3. Fashion and Jewellery: Generating intricate patterns and structures.

Advantages:

7
8 CHAPTER 1. INTRODUCTION TO DESIGN AND NEURAL NETWORKS

• Saves time by automating repetitive updates.


• Ensures accuracy by maintaining constraints.
• Facilitates innovation by exploring multiple design iterations quickly.

1.3. Rule-Based Design vs. Generative Design


1.3.1. Rule-Based Design
This approach strictly adheres to predefined rules or conditions. It’s highly deterministic and ensures
that all outputs comply with established standards or constraints.
Example:
• Designing a bridge where the rules dictate load-bearing capacity, material limits, and safety codes.
• Automating the layout of electrical circuits based on wiring standards.
Advantages:
• Ensures compliance with industry norms.
• Reliable for repetitive or routine tasks.
Limitations:
• Limited scope for creativity or innovation.
• Cannot adapt dynamically to complex, multi-variable problems.

1.3.2. Generative Design


Generative design is an exploratory approach that uses algorithms (often involving AI or optimization
techniques) to autonomously create and test multiple design options. The process involves:
1. Setting objectives (e.g., minimize weight, maximize strength).
2. Inputting constraints (e.g., material type, load conditions).
3. Allowing the system to generate and evaluate designs.
Example:
• Automotive manufacturers use generative design to create lightweight yet strong components. For
instance, General Motors developed seatbelt brackets that are 40% lighter using generative design.
Advantages:
• Generates unconventional, high-performance solutions.
• Automates the exploration of vast design spaces.
Limitations:
• Requires advanced computational resources and expertise.
• Outputs may require further refinement for practical manufacturing.

1.4. Introduction to Neural Networks (NN)


1.4.1. Overview
- Neural Networks (NNs) are computational systems inspired by biological neural networks in the hu-
man brain. - Composed of layers of interconnected nodes (neurons), they aim to identify patterns and
relationships in data.
1.5. TRAINING OF NEURAL NETWORKS 9

1.4.2. Structure of Neural Networks


1. Input Layer:

• Receives raw data (e.g., images, text, numbers).


• Number of neurons equals the number of input features.

2. Hidden Layers:

• Transform inputs via weights, biases, and activation functions to extract features.
• Non-linear activation functions like ReLU or Sigmoid introduce complexity.
• Networks can have multiple layers (deep networks), allowing for hierarchical feature learning.

3. Output Layer:

• Produces predictions, either as probabilities (for classification) or continuous values (for re-
gression).

1.4.3. Why Neural Networks?


- They can model complex, non-linear relationships. - Adaptable across a variety of domains, including
computer vision, natural language processing, and time-series forecasting.

1.4.4. Learning and Representation


- NNs learn ”features” from data during training, eliminating the need for manual feature engineering.
- Utilize distributed representations, meaning a concept is encoded across many neurons rather than a
single neuron.

1.5. Training of Neural Networks


1.5.1. Core Steps in Training
1. Forward Propagation:

• Data flows through the network, layer by layer, to produce an output (e.g., a classification
label or predicted value).
• Uses the equation: y = f (W x + b), where:
– W : Weights.
– b: Biases.
– f : Activation function.

2. Loss Calculation:

• Compares the predicted output with the actual target value using a loss function:
– Example Loss Functions:
∗ Mean Squared Error (MSE): For regression tasks.
∗ Cross-Entropy Loss: For classification tasks.

3. Backpropagation:

• Calculates the gradient of the loss with respect to each weight and bias using the chain rule.
• Gradients are used to update weights and biases.

4. Weight Updates:

• Optimization algorithms (e.g., Gradient Descent, Adam) adjust weights to minimize the
loss function.
10 CHAPTER 1. INTRODUCTION TO DESIGN AND NEURAL NETWORKS

1.5.2. Key Concepts in Training


- Learning Rate: Controls the size of updates. Small values ensure stability but slow convergence;
large values risk overshooting. - Overfitting and Regularization: - Overfitting occurs when the
model learns noise instead of patterns. - Techniques like dropout, early stopping, and L2 regularization
combat overfitting. - Epochs and Batches: - Training is done in epochs (one full pass through the
dataset). - Datasets are divided into batches for memory efficiency and faster convergence.

1.6. Classification Tasks


1.6.1. Overview
- Classification involves assigning discrete labels to data. - Example Applications: Spam detection, image
recognition, sentiment analysis.

1.6.2. Example: Image Classification with CIFAR-10


- Dataset: - 50,000 training images and 10,000 test images. - Images are 32x32 pixels with 3 color
channels (RGB). - 10 classes (e.g., airplane, bird, car). - Neural Network Process: - Input: Flattened
image data (32 × 32 × 3 = 3072 numbers). - Output: Class scores for 10 categories. - Parametric
Approach: - Learn the mapping f (x; W ) using weights W . - Use Softmax to convert scores into
probabilities.

1.6.3. Performance Evaluation:


- Accuracy: Proportion of correctly classified instances. - F1 Score: Balances precision and recall,
especially useful for imbalanced datasets.

1.7. Regression Tasks


1.7.1. Overview
- Regression predicts continuous outcomes based on input features. - Example Applications: Predicting
stock prices, house values, or weather conditions.

1.7.2. Example: Olympics Winning Time Prediction


- Problem Statement: - Given historical data, predict future winning times in the 100m race. -
Model: - Single-feature linear regression. - Equation: y = W · x + b, where:

• y: Predicted time.

• x: Year of the Olympics.

• W, b: Learnable parameters.

1.7.3. Metrics for Evaluation:


- Mean Squared Error (MSE): Measures average squared error between predictions and actual values.
- R-Squared (Coefficient of Determination): Indicates how well the model explains the variance
in the target.

1.7.4. Practical Considerations:


- Feature Engineering: Identifying relevant features (e.g., economic conditions for house price predic-
tions). - Model Complexity: Balancing underfitting (too simple) and overfitting (too complex).

1.8. MATHEMATIC MODELLING OF NEURAL NETWORKS 11

1.8. Mathematic Modelling of Neural Networks


Here is the detailed mathematical formulation for each topic:

1.Introduction to Neural Networks (NN)


A neural network is mathematically represented as a composition of layers, where each layer applies a
linear transformation followed by a non-linear activation function.

Single Layer Perceptron:


The output of a single neuron in layer j is:
zj = WjT x + bj
Where: - x: Input vector. - Wj : Weight vector for neuron j. - bj : Bias term.
The activation function f is applied to zj :
aj = f (zj )

Multi-Layer Network:
For a network with L layers:
a(l) = f (W (l) a(l−1) + b(l) )
Where: - a(l) : Activation in layer l. - W (l) : Weight matrix for layer l. - b(l) : Bias vector for layer l. - f :
Activation function.
The output layer computes predictions y:
y = fout (W (L) a(L−1) + b(L) )

2.Training of Neural Networks


Forward Propagation:
The network computes predictions from the input x:
ypred = f (W (L) . . . f (W (1) x + b(1) ) · · · + b(L) )

Loss Function:
For a training dataset {(xi , yi )}ni=1 , the loss quantifies the difference between predicted and true outputs:
- Mean Squared Error (MSE) for regression:
n
1X 2
L= (yi − ypred,i )
n i=1

- Cross-Entropy Loss for classification:


n K
1 XX
L=− yi,k log ŷi,k
n i=1
k=1

Where K is the number of classes, and ŷi,k is the predicted probability for class k.

Backpropagation:
Using the chain rule, compute the gradients of the loss with respect to weights W and biases b:
∂L
= δ (l) a(l−1)T
∂W (l)
 
(l) ∂L ′ (l)
δ = ◦ f (z )
∂a(l)
Where: - δ (l) : Error term for layer l. - f ′ (z): Derivative of the activation function.
12 CHAPTER 1. INTRODUCTION TO DESIGN AND NEURAL NETWORKS

Gradient Descent:

Weights are updated iteratively:


∂L
W (l) = W (l) − η
∂W (l)
Where: - η: Learning rate.

3. Classification Tasks
Softmax Function:

For a classification network with K classes, the output is passed through the softmax function:

ez k
ŷk = PK
j=1 ezj

Where: - zk : Raw score (logit) for class k. - ŷk : Predicted probability for class k.

Prediction:

The predicted class ŷ is:


ŷ = arg max ŷk
k

4. Regression Tasks
Linear Regression:

A linear model predicts the target y as:

ypred = W T x + b

Loss Function:

Use Mean Squared Error (MSE):


n
1X 2
L= (yi − ypred,i )
n i=1

Gradient Descent Updates:

The gradients are computed as:


n
∂L 2X
=− xi (yi − ypred,i )
∂W n i=1

n
∂L 2X
=− (yi − ypred,i )
∂b n i=1

Weights and biases are updated as:


∂L
W =W −η
∂W

∂L
b=b−η
∂b
1.9. OPTIMIZATION, GRADIENT DESCENT ALGORITHM, AND ITS VARIANTS 13

Activation Function Examples:


1. ReLU (Rectified Linear Unit):
(
1 x>0
f (x) = max(0, x), f ′ (x) =
0 x≤0
2. Sigmoid:
1
f (x) = , f ′ (x) = f (x) · (1 − f (x))
1 + e−x
3. Tanh:
ex − e−x
f (x) = , f ′ (x) = 1 − f (x)2
ex + e−x

1.9. Optimization, Gradient Descent Algorithm, and Its Vari-


ants
Optimization Overview
Optimization refers to the process of finding the best solution (e.g., minimizing a loss function or max-
imizing a utility function) from a set of feasible solutions. In machine learning, optimization is crucial
for training models by adjusting their parameters to minimize a loss function.

Gradient Descent Algorithm


Gradient Descent (GD) is a first-order optimization algorithm used to minimize a function f (θ), where
θ represents the parameters.

Mathematics of Gradient Descent


The update rule is:
θt+1 = θt − η∇f (θt )
where:
• θt : Parameters at iteration t
• η: Learning rate
• ∇f (θt ): Gradient of the function at θt
The gradient provides the direction of the steepest ascent; taking a step in the negative gradient
direction minimizes the function.

Challenges
• Learning rate (η): If too large, it may overshoot the minimum; if too small, convergence becomes
slow.
• Local minima: May converge to a local minimum instead of the global minimum.

1.9.1. Variants of Gradient Descent


Several variants improve upon the basic gradient descent:

Stochastic Gradient Descent (SGD)


Uses a single data point (or a small batch) for each gradient computation. The update rule is:
θt+1 = θt − η∇fi (θt )
where fi is the loss for the i-th data point.
Pros: Faster updates, avoids saddle points.
Cons: Noisy updates can lead to instability.
14 CHAPTER 1. INTRODUCTION TO DESIGN AND NEURAL NETWORKS

Mini-Batch Gradient Descent


Uses a small batch of data points to calculate the gradient. It combines the benefits of GD and SGD.

Momentum
Incorporates a velocity term to accelerate convergence:
vt+1 = βvt + η∇f (θt )
θt+1 = θt − vt+1
Momentum reduces oscillations in steep dimensions.

RMSProp
Scales the learning rate based on recent gradients:
E[∇f 2 ]t = βE[∇f 2 ]t−1 + (1 − β)(∇f (θt ))2
η
θt+1 = θt − p ∇f (θt )
E[∇f 2 ]t + ϵ
RMSProp is effective for non-convex problems.

Adam
Combines Momentum and RMSProp:
mt = β1 mt−1 + (1 − β1 )∇f (θt )
vt = β2 vt−1 + (1 − β2 )(∇f (θt ))2
η
θt+1 = θt − √ m̂t
v̂t + ϵ
Pros: Adaptive and robust for sparse gradients.
Cons: May not generalize well in some cases.

1.10. How Do You Define Probabilities Over Data?


Probabilities over data are defined by assigning a likelihood value to events or outcomes within a dataset.

1.10.1. Discrete Probability


For a discrete random variable X, the probability mass function (PMF) defines probabilities:
P (X = x) = p(x)
where:
• p(x) ≥ 0

P
x p(x) = 1
1
Example: For a fair six-sided die, P (X = k) = 6 for k ∈ {1, 2, 3, 4, 5, 6}.

1.10.2. Continuous Probability


For a continuous random variable X, probabilities are defined via the probability density function (PDF):
Z b
P (a ≤ X ≤ b) = f (x) dx
a

where:
• f (x) ≥ 0
R∞
• −∞ f (x) dx = 1
2
Example: For a standard normal distribution, f (x) = √1 e−x /2 .

1.11. PROBABILISTIC MODELING FRAMEWORK 15

1.11. Probabilistic Modeling Framework


Probabilistic models use probability distributions to describe data. They are categorized into unsuper-
vised setups, generative models, and discriminative models.

1.11.1. Unsupervised Setup


In unsupervised learning: - Data lacks labels. - Probabilistic models are used to uncover hidden structures
(e.g., clustering or dimensionality reduction).
Example: Gaussian Mixture Models (GMMs) assume data points are generated from a mixture of
Gaussian distributions.

1.11.2. Generative Modeling


Generative models learn the joint probability P (X, Y ), where X is the input and Y is the label. They
can: 1. Generate new data similar to the training set. 2. Perform classification using Bayes’ theorem:
P (X, Y )
P (Y |X) =
P (X)

Examples: - Naive Bayes - Variational Autoencoders (VAEs) - Generative Adversarial Networks


(GANs)

1.11.3. Discriminative Modeling


Discriminative models learn the conditional probability P (Y |X), focusing directly on classification or
regression tasks. These models: - Do not model the data distribution explicitly. - Are often simpler and
more efficient for classification tasks.
Examples: - Logistic Regression - Support Vector Machines (SVMs) - Neural Networks

1.12. Comparison: Generative vs. Discriminative


Aspect Generative Models Discriminative Models
Focus Models P (X, Y ) Models P (Y |X)
Use Case Data generation, semi-supervised learning Classification, regression
Training Complexity Higher (models full distribution) Lower
Example Algorithms Naive Bayes, VAEs, GANs Logistic regression, SVMs
Proper initialization of weights is crucial to ensure efficient training of neural networks and to avoid
getting stuck in poor local minima or saddle points. Here’s a detailed guide on how to initialize weights
effectively:

1.13. Importance of Weight Initialization


Weight initialization affects:
• Symmetry breaking: Ensures neurons learn unique features.
• Gradient flow: Prevents vanishing or exploding gradients.

1.14. Techniques to Avoid Local Minima


1.14.1. Random Initialization
Assign small random values (e.g., drawn from a uniform or normal distribution) to weights to break
symmetry. However:
16 CHAPTER 1. INTRODUCTION TO DESIGN AND NEURAL NETWORKS

• Too small values can cause vanishing gradients.

• Too large values can cause exploding gradients.

1.14.2. Xavier Initialization (Glorot Initialization)


Balances the variance of activations and gradients across layers. For a layer with nin input neurons and
nout output neurons, weights are drawn from:

• Uniform distribution:  r r 
6 6
W ∼U − ,
nin + nout nin + nout

• Normal distribution:  
2
W ∼N 0,
nin + nout

Best suited for sigmoid or tanh activation functions.

1.14.3. He Initialization (Kaiming Initialization)


Addresses vanishing gradients for ReLU-like activations by scaling weights based on the number of input
neurons (nin ).
Weights are drawn from:

• Uniform distribution:  r r 
6 6
W ∼U − ,
nin nin

• Normal distribution:  
2
W ∼N 0,
nin

1.14.4. LeCun Initialization


Specialized for Leaky ReLU or SELU activations:
 
1
W ∼ N 0,
nin

1.15. Layer-Specific Initializations


For deep networks, initialization may depend on the layer type:

1. Convolutional Layers: Apply Xavier or He initialization based on the size of the filter and input
channels.

2. Recurrent Layers (e.g., LSTMs): Use orthogonal initialization for weights and set biases for
forget gates to 1 (helps in learning long-term dependencies).

1.16. Bias Initialization


Biases are typically initialized to:

• Zero: Prevents unnecessary bias in the early stages of training.

• Small constants: Occasionally used to help with faster convergence.


1.17. REGULARIZATION TECHNIQUES 17

1.17. Regularization Techniques


Weight initialization can be complemented with:
• Batch Normalization: Normalizes layer outputs to stabilize gradients.
• Dropout: Reduces dependency on specific neurons, preventing overfitting.

Practical Advice
• Choose based on activation: Use Xavier for sigmoid/tanh, He for ReLU.
• Monitor loss/accuracy trends: Adjust initialization if gradients vanish or explode.
• Combine with advanced optimizers: Optimizers like Adam or RMSProp handle some gradient
issues efficiently.
By carefully initializing weights, training becomes more stable and less prone to poor local minima
or saddle points. Let me know if you’d like examples or further clarification!

1.18. RMS Prop


1.18.1. RMSProp Doesn’t Solve


• Weight Initialization:
– RMSProp cannot fix poor weight initialization. If weights are initialized too small, activations
may vanish (e.g., sigmoid/tanh activations saturating). If initialized too large, gradients may
explode.
– Proper initialization (e.g., Xavier or He) ensures balanced gradient flow and complements
RMSProp.
• Local Minima:
– RMSProp does not specifically avoid local minima or saddle points. However, its ability to
adaptively scale gradients can help it escape shallow minima or plateaus by allowing the
network to move faster in flat regions.
– It is still possible for RMSProp to converge to a poor local minimum if the loss surface is
complex.
• Non-Convex Optimization:
– Many deep learning loss functions are non-convex, with numerous local minima and saddle
points. RMSProp improves optimization efficiency but does not alter the fundamental land-
scape of the loss function.

1.18.2. How RMSProp and Weight Initialization Work Together


• Good Weight Initialization: Ensures stable activation distributions and gradient flow at the
start of training.
• RMSProp: Adapts learning rates dynamically to improve convergence, especially in challenging
scenarios like sparse gradients or varying magnitudes.

In summary, RMSProp reduces the sensitivity to weight initialization and helps improve
optimization efficiency, but proper weight initialization and complementary strategies are still critical
for avoiding poor minima.
18 CHAPTER 1. INTRODUCTION TO DESIGN AND NEURAL NETWORKS

1.19. RMSProp: Understanding Its Role in Optimization


RMSProp is a powerful optimizer, but it does not entirely eliminate the need for proper weight ini-
tialization or guarantee avoidance of sticking in local minima. Instead, RMSProp complements weight
initialization by addressing challenges related to learning rate adaptation and gradient scaling. Here’s a
breakdown:

1.19.1. What RMSProp Does


• Adaptive Learning Rates: RMSProp dynamically adjusts the learning rate for each weight
based on the magnitude of recent gradients. This is done by maintaining a running average of
squared gradients:
E[g 2 ]t = βE[g 2 ]t−1 + (1 − β)gt2
where:

– E[g 2 ]t : Exponentially decayed average of squared gradients.


– β: Decay rate (e.g., 0.9).
– gt : Gradient at step t.

The parameter update is then:


η
θt+1 = θt − p gt
E[g 2 ]t + ϵ
where:

– η: Global learning rate.


– ϵ: Small constant to avoid division by zero.

• Benefits:

– Handles exploding gradients by scaling down large gradients.


– Speeds up training in flat regions by boosting learning rates for smaller gradients.
– Avoids oscillations in the optimization trajectory.

1.19.2. What RMSProp Doesn’t Solve


• Weight Initialization:

– RMSProp cannot fix poor weight initialization. If weights are initialized too small, activations
may vanish (e.g., sigmoid/tanh activations saturating). If initialized too large, gradients may
explode.
– Proper initialization (e.g., Xavier or He) ensures balanced gradient flow and complements
RMSProp.

• Local Minima:

– RMSProp does not specifically avoid local minima or saddle points. However, its ability to
adaptively scale gradients can help it escape shallow minima or plateaus by allowing the
network to move faster in flat regions.
– It is still possible for RMSProp to converge to a poor local minimum if the loss surface is
complex.

• Non-Convex Optimization:

– Many deep learning loss functions are non-convex, with numerous local minima and saddle
points. RMSProp improves optimization efficiency but does not alter the fundamental land-
scape of the loss function.
1.20. EXPONENTIALLY DECAYING MOVING AVERAGE OF SQUARED GRADIENTS 19

1.19.3. Does RMSProp Avoid Local Minima?


While RMSProp’s adaptive mechanism helps in escaping shallow local minima, it does not guarantee
avoidance of deeper, more problematic local minima. For enhanced robustness against local minima:

• Use optimizers like Adam, which combines RMSProp with momentum.

• Employ techniques such as annealing, early stopping, or stochasticity (e.g., SGD with small
batches).

In summary, RMSProp reduces the sensitivity to weight initialization and helps improve
optimization efficiency, but proper weight initialization and complementary strategies are still critical
for avoiding poor minima.

1.20. Exponentially Decaying Moving Average of Squared Gra-


dients
The term E[gt2 ] in the RMSProp optimizer represents the exponentially decaying moving average
of the squared gradients over time. Here’s the detailed formulation and explanation:

Definition
At each step t, E[gt2 ] is updated using the following formula:

E[gt2 ]t = βE[gt2 ]t−1 + (1 − β)gt2

Where:

• E[gt2 ]t : The moving average of the squared gradients at step t.

• gt2 : The square of the gradient at step t for a parameter.

• β: The decay rate (usually set to 0.9).

• E[gt2 ]t−1 : The moving average from the previous step.

How It Works
• Smoothing Gradients: The term E[gt2 ]t smoothens the squared gradients over time, allowing
RMSProp to use a stable estimate of their magnitude.

• Controlling Step Size: E[gt2 ]t is used to scale the learning rate inversely:
η
∆θt = − p gt
E[gt2 ]t + ϵ

– When gradients are large, E[gt2 ]t increases, reducing the step size.
– When gradients are small, E[gt2 ]t decreases, increasing the step size.

• Decay Factor: The decay rate β ensures that more recent gradients contribute more significantly
to the moving average. A higher β gives greater weight to past gradients, while a lower β emphasizes
recent changes.


20 CHAPTER 1. INTRODUCTION TO DESIGN AND NEURAL NETWORKS

Expanded Form
If you expand the recurrence relation, you can see how the moving average incorporates all past squared
gradients:
E[gt2 ]t = (1 − β)gt2 + β(1 − β)gt−1
2
+ β 2 (1 − β)gt−2
2
+ ...
This shows that E[gt2 ]t is a weighted sum of past squared gradients, with weights decreasing exponentially.

Role of ϵ
A small constant ϵ (e.g., 10−8 ) is added in the denominator to prevent division by zero:
η
∆θt = − p gt
E[gt2 ]t + ϵ

Summary
• E[gt2 ]t adapts the learning rate by keeping track of the recent history of gradient magnitudes.
• It balances learning across parameters by normalizing updates based on gradient variance.

• It works effectively in situations with sparse or noisy gradients, making RMSProp robust for many
machine learning problems.
Chapter 2

Data Analysis and Generative


Algorithms

Figure 2.1: Visualisation of Dimensionality Reduction to exploit operations in low dimension which is
embedded in the high-dimension

2.1. Data Cleaning, Imputing, and Standardization


Data preprocessing is essential for effective modeling and analysis. Common steps include:

2.1.1. Data Cleaning


Identifying and correcting errors, handling missing or outlier values, and ensuring consistent formatting.
Techniques include:
• Removing duplicates.
• Applying domain-specific validation rules.

2.1.2. Data Imputation


Filling missing values using:
• Statistical methods (mean, median, mode).
• Regression-based techniques.
• Advanced algorithms like KNN or machine learning-based imputation.

21
22 CHAPTER 2. DATA ANALYSIS AND GENERATIVE ALGORITHMS

2.1.3. Standardization
Ensuring variables have zero mean and unit variance to:
• Allow models like PCA or k-means to treat all features equally.
• Aid in faster convergence during training.

2.2. Introduction to Dimensional Reduction in Data Space


Dimensional reduction is critical for handling high-dimensional data efficiently. Common techniques
include:

2.2.1. Principal Component Analysis (PCA)


• Reduces dimensionality by projecting data onto principal axes that capture maximum variance.
• Applications include noise reduction, visualization, and feature extraction.
• Mathematical foundation involves eigenvalue decomposition of the covariance matrix.

2.2.2. t-SNE (t-Distributed Stochastic Neighbor Embedding)


• Optimized for visualization of high-dimensional data in 2D or 3D.
• Captures local and global structure but focuses on preserving neighborhood relationships.

2.3. Data Cleaning, Imputing, and Standardization


Mathematical approaches ensure consistency and quality of datasets:

2.3.1. Data Cleaning


Outlier Detection
Use statistical methods like z-scores or interquartile range (IQR):
x−µ
z-score =
σ
Values with |z-score| > k (commonly k = 3) are flagged as outliers.

Missing Data Identification


Identify null values as xi = NaN, and compute the missingness proportion:
Number of NaN entries
Missingness =
Total entries

2.3.2. Data Imputation


Mean Imputation
Replace missing values with the mean:
n
1X
xi = µ where µ = xi
n i=1

Regression Imputation
Estimate missing values xi based on other features:

xi = w1 x1 + w2 x2 + · · · + wm xm + ϵ
2.4. DESIGN DATASETS AND EXAMPLES 23

2.3.3. Standardization
Transform data to have zero mean and unit variance:
xi − µ
zi = , where µ = mean, σ = standard deviation.
σ

2.4. Design Datasets and Examples


Design datasets often require extracting patterns:

2.4.1. Cluster Analysis


Group similar samples using metrics like Euclidean distance:
v
u n
uX
d(x, y) = t (xi − yi )2
i=1

2.4.2. Optimization in Generative Design


Minimize a loss function subject to constraints:

min f (x) + λg(x)


x

where f (x) is the design objective and g(x) represents constraints.

2.5. Dimensional Reduction in Data Space


Dimensionality reduction ensures computational efficiency and insight. Key methods include:

2.5.1. Principal Component Analysis (PCA)

Figure 2.2: Suppose we pick u to correspond the the direction shown in the figure. The circles denote
the projections of the original data onto this line

Steps of PCA
1. Center the data:
Xcentered = X − µ

2. Compute the covariance matrix:


1
C= X⊤ Xcentered
n − 1 centered
24 CHAPTER 2. DATA ANALYSIS AND GENERATIVE ALGORITHMS

Figure 2.3: We see that the projected data still has a fairly large − variance, and the points tend to
be far from zero. In contrast, suppose had instead picked the above direction

3. Perform eigen decomposition of C:


Cv = λv
Principal components are eigenvectors v corresponding to the largest eigenvalues λ.
4. Project data onto the top k components:

Xreduced = Xcentered Vk

where Vk is the matrix of the top k eigenvectors.

2.5.2. t-SNE (t-Distributed Stochastic Neighbor Embedding)


High-Dimensional Space
Define probabilities of pairwise similarity:
exp(−∥xi − xj ∥2 /2σi2 )
Pj|i = P 2 2
k̸=i exp(−∥xi − xk ∥ /2σi )

Low-Dimensional Space
Define probabilities using a Student-t distribution:
(1 + ∥yi − yj ∥2 )−1
qij = P 2 −1
k̸=l (1 + ∥yk − yl ∥ )

Minimizing KL Divergence
Minimize the Kullback-Leibler (KL) divergence between Pij and qij :
 
X Pij
KL(P ∥ Q) = Pij log
qij
i̸=j

2.6. Determinantal Point Process (DPP) Matrix and Its Math-


ematical Background
A DeterminantalPointProcess(DPP) is a probabilistic model used to select diverse subsets of data
points. It has applications in machine learning, recommendation systems, and summarization tasks. The
foundation of a DPP lies in linear algebra and the geometry of determinants.
2.6. DETERMINANTAL POINT PROCESS (DPP) MATRIX AND ITS MATHEMATICAL BACKGROUND25

Figure 2.4: DPP-Kernel

2.6.1. Mathematical Background of DPP


A DPP is defined on a ground set Y = {y1 , y2 , . . . , yN }, where subsets are sampled with probabilities
proportional to the determinant of a positive semi-definite kernel matrix K.

Kernel Matrix K
• K ∈ RN ×N is symmetric and positive semi-definite.

• The entry Kij represents the similarity between items yi and yj .

• Diversity is encouraged because the determinant of K is high when the rows (or columns) are
orthogonal, i.e., less similar.

Probability of Selecting Subset A ⊆ Y


Pr(A) ∝ det(KA )
Here, KA is the submatrix of K formed by rows and columns indexed by A.

2.6.2. Properties of DPP


• Diversity Maximization: Subsets with less similar elements are more probable because deter-
minants favor orthogonal vectors.

• Marginal Probabilities: For a single item yi , its inclusion probability is:

Pr(yi ∈ A) = Kii
26 CHAPTER 2. DATA ANALYSIS AND GENERATIVE ALGORITHMS

• Subset Size Control: Using a parameterized kernel L, one can control the size of sampled subsets
via normalization:
det(LA )
Pr(A) =
det(L + I)

2.6.3. Applications
• Data Summarization: Selecting representative data points while maintaining diversity.
• Recommendation Systems: Suggesting items that are both relevant and diverse to users.
• Dimensionality Reduction: Selecting diverse features for models to prevent redundancy.

2.6.4. Example: Constructing a DPP Matrix


Consider Y = {y1 , y2 , y3 }, with feature vectors f1 , f2 , f3 :
1. Compute the similarity matrix S:

∥fi − fj ∥2
 
Sij = exp −
2σ 2

2. Construct kernel K:
K=S
Alternatively, K can be normalized to ensure eigenvalues are in [0, 1].

2.6.5. Connection to PCA


DPPs can be connected to eigen-decomposition. If:
r
X
K= λi ui u⊤
i
i=1

• Eigenvectors (ui ) represent orthogonal directions.


• Eigenvalues (λi ) control the probability of selecting associated features or items.
This makes DPPs particularly useful for diverse subset selection in high-dimensional spaces, comple-
menting dimensionality reduction techniques like PCA or t-SNE.

2.7. Solved Example of Principal Component Analysis (PCA)


Consider a small dataset with three 2-dimensional data points:
 
1 2
X = 2 3
3 4

Each row represents a data point, and each column represents a feature.

2.7.1. Step 1: Center the Data


First, subtract the mean of each feature (column) from the data points to center the data.
   
1 1+2+3 2
µ= =
3 2+3+4 3

Now, subtract the mean vector µ from each data point in X:


   
1 2  −1 −1
Xcentered = X − µ = 2 3 − 2 3 =  0 0
3 4 1 1
2.8. SOLVED EXAMPLE OF DETERMINANTAL POINT PROCESS (DPP) MATRIX CONSTRUCTION27

2.7.2. Step 2: Compute the Covariance Matrix


Next, compute the covariance matrix of the centered data:
 
  −1 −1
1 ⊤ 1 −1 0 1 
C= Xcentered Xcentered = 0 0
n−1 2 −1 0 1
1 1
   
1 2 2 1 1
C= =
2 2 2 1 1

2.7.3. Step 3: Eigenvalue Decomposition


Now, perform eigenvalue decomposition of the covariance matrix:

Cv = λv

We solve for eigenvalues λ and eigenvectors v using the characteristic equation:

det(C − λI) = 0

Solving this equation yields:


λ1 = 2, λ2 = 0
The corresponding eigenvectors are:
   
1 −1
v1 = , v2 =
1 1

2.7.4. Step 4: Project the Data onto the Principal Components


To reduce the dimensionality, we project the centered data onto the first principal component (v1 ):
   
−1 −1   −2
1
Xreduced = Xcentered v1 =  0 0 = 0 
1
1 1 2

Thus, the reduced data (projected onto the first principal component) is:
 
−2
Xreduced =  0 
2

2.8. Solved Example of Determinantal Point Process (DPP)


Matrix Construction
Consider a small set of data points Y = {y1 , y2 , y3 }, where each data point is represented by a 2-
dimensional feature vector:      
1 2 3
f1 = , f2 = , f3 =
2 3 4

2.8.1. Step 1: Compute the Similarity Matrix S


First, compute the pairwise similarity matrix S, where each entry is the Gaussian similarity between the
feature vectors:
∥fi − fj ∥2
 
Sij = exp −
2σ 2
Assuming σ = 1, compute the pairwise Euclidean distances:

∥f1 −f2 ∥2 = (1−2)2 +(2−3)2 = 2, ∥f1 −f3 ∥2 = (1−3)2 +(2−4)2 = 8, ∥f2 −f3 ∥2 = (2−3)2 +(3−4)2 = 2
28 CHAPTER 2. DATA ANALYSIS AND GENERATIVE ALGORITHMS

Thus, the similarity matrix S is:


 
1 exp(−1) exp(−4)
S = exp(−1) 1 exp(−1)
exp(−4) exp(−1) 1

Approximating the exponentials:


 
1 0.3679 0.0183
S ≈ 0.3679 1 0.3679
0.0183 0.3679 1

2.8.2. Step 2: Construct the Kernel Matrix K


The kernel matrix K is simply the similarity matrix K = S:
 
1 0.3679 0.0183
K ≈ 0.3679 1 0.3679
0.0183 0.3679 1

2.8.3. Step 3: Calculate the Probability of Selecting Subset A = {y1 , y3 }


The probability of selecting the subset A = {y1 , y3 } is given by the determinant of the submatrix of K
indexed by A:  
1 0.0183
KA =
0.0183 1
Thus, the probability is:
 
1 0.0183
Pr(A) ∝ det(KA ) = det = 1 − (0.0183)2 ≈ 0.999666
0.0183 1

2.8.4. Step 4: Normalization for Subset Size Control


If we want to control the size of the sampled subsets using a parameterized kernel L, we normalize the
probability:
det(LA )
Pr(A) =
det(L + I)
where L is a kernel matrix with additional adjustments for subset size.

2.9. Purpose of Sampling


Sampling is a fundamental concept in data science, statistics, and machine learning, involving the se-
lection of a representative subset from a larger dataset or population. Effective sampling ensures that
analysis and modeling are computationally efficient while maintaining accuracy and generalizability.

2.9.1. Efficiency
Large datasets can be computationally expensive to process. Sampling reduces the data size without
significantly compromising results.

2.9.2. Generalization
A well-chosen sample can reflect the characteristics of the entire population, enabling robust model
training.

2.9.3. Exploration
Sampling can help identify patterns, outliers, or trends in the data before committing to a full-scale
analysis.
2.10. TYPES OF SAMPLING 29

2.10. Types of Sampling


2.10.1. Random Sampling
Each data point has an equal probability of being selected.
 −1
N
Pr(S) =
n

where N is the population size and n is the sample size. This is used for unbiased representations of the
population.

2.10.2. Stratified Sampling


Divides the population into subgroups (strata) and samples proportionally from each. Ensures represen-
tation of all subgroups.
n i = Wi · n
where Wi is the weight (proportion) of stratum i, and ni is the sample size for i.

2.10.3. Systematic Sampling


Selects every k-th data point after a random start. The selected indices are:

{i, i + k, i + 2k, . . . }
N
where k = n.

2.10.4. Cluster Sampling


Divides the population into clusters and randomly samples some clusters. Often used in geographical or
segmented datasets.

2.10.5. Importance Sampling


Assigns higher selection probabilities to important data points (e.g., rare events or outliers).
n
1 X p(xi )
µ̂ = f (xi )
n i=1 q(xi )

where p(x) is the target probability distribution, q(x) is the proposal distribution, and f (x) is the
function of interest.

2.10.6. Reservoir Sampling


A streaming algorithm for random sampling from a large or infinite population. Ensures all elements
have equal probability of selection in a fixed-size sample.

2.11. Theoretical Foundations


Sampling connects deeply with probability theory:

2.11.1. Law of Large Numbers (LLN)


As sample size n increases, sample statistics converge to population parameters:
n→∞
x̄ −−−−→ µ

where x̄ is the sample mean and µ is the population mean.


30 CHAPTER 2. DATA ANALYSIS AND GENERATIVE ALGORITHMS

2.11.2. Central Limit Theorem (CLT)


The sampling distribution of the sample mean approaches a normal distribution for sufficiently large n,
regardless of the population’s original distribution:

σ2
 
X̄ ∼ N µ,
n

σ2
where N denotes a normal distribution with mean µ and variance n .

2.12. Applications of Sampling


2.12.1. Machine Learning
• Training and Validation Splits: Partition datasets into training and test sets using random
sampling or stratification.

• Subsampling: Efficiently train models on large datasets (e.g., random forests or stochastic gra-
dient descent).

2.12.2. Big Data


Sample subsets for preliminary analysis in high-volume systems.

2.12.3. Data Visualization


Visualize trends without overloading resources or cluttering graphs.

2.13. Advanced Sampling in ML and Statistics


2.13.1. Markov Chain Monte Carlo (MCMC)
Sampling from complex distributions via iterative processes. Applications include Bayesian inference
and probabilistic modeling.

Metropolis-Hastings Algorithm

Propose a new state x′ from q(x′ |x). Accept x′ with probability:

p(x′ )q(x|x′ )
 
α = min 1,
p(x)q(x′ |x)

where p(x) is the target distribution and q(x′ |x) is the proposal distribution.

2.13.2. Determinantal Point Processes (DPP)


Sampling diverse subsets, as discussed earlier, for applications in machine learning, such as feature
selection and diversity optimization in models.

2.14. Introduction to Generative Algorithms and their novelties


Novelty is an important measure when evaluating new designs in optimization and generative design
processes. Novelty metrics like p-novelty and h-novelty quantify the ”distance” of a new design from
existing designs, thus helping to identify unique or diverse designs.
2.15. DEFINITIONS 31

2.15. Definitions
Given a set of design points in the design space D, novelty measures the distance between a new design
and the designs in the existing population.
Let the new design be represented by xnew , and the existing designs are x1 , x2 , . . . , xN where N is
the number of designs in the population.

2.15.1. p-Novelty
The p-novelty measure calculates the distance between the new design and the closest design in the
population, typically using a distance metric such as Euclidean distance.
The p-novelty of the new design xnew is defined as:

pnovelty (xnew ) = min ∥xnew − xi ∥


i∈{1,2,...,N }

Where: - ∥xnew − xi ∥ is the Euclidean distance (or any other suitable distance metric) between the new
design xnew and an existing design xi .

2.15.2. h-Novelty
The h-novelty measure is an extension of p-novelty and considers the set of closest designs to the new
design. It computes the average distance to the h closest designs, providing a more comprehensive
measure of novelty.
The h-novelty of the new design xnew is defined as:
h
1X (h)
hnovelty (xnew ) = ∥xnew − xi ∥
h i=1

(h)
Where: - xi are the h nearest neighbors to xnew based on the chosen distance metric. - The sum is
taken over the h-closest designs to the new design in the population.

2.16. Applications of Novelty Measures


- Generative Design: Novelty measures are useful for selecting diverse designs in generative design
processes. - Evolutionary Algorithms: In evolutionary algorithms, novelty measures can guide the
selection of solutions that explore novel regions of the design space. - Optimization Problems: Nov-
elty metrics can be incorporated into multi-objective optimization to balance between exploitation and
exploration.

2.17. Example Calculation


Consider a design space D with three existing designs x1 = (1, 2), x2 = (3, 4), x3 = (6, 7), and a new
design xnew = (2, 3).

2.17.1. 1. p-Novelty Calculation


The Euclidean distances between the new design and the existing designs are:
p √ √
∥xnew − x1 ∥ = (2 − 1)2 + (3 − 2)2 = 1 + 1 = 2
p √ √
∥xnew − x2 ∥ = (2 − 3)2 + (3 − 4)2 = 1 + 1 = 2
p √ √
∥xnew − x3 ∥ = (2 − 6)2 + (3 − 7)2 = 16 + 16 = 32
√ √ √ √
The minimum distance is min( 2, 2, 32) = 2, so:

pnovelty (xnew ) = 2
32 CHAPTER 2. DATA ANALYSIS AND GENERATIVE ALGORITHMS

2.17.2. 2. h-Novelty Calculation


(for h = 2): The two nearest neighbors are x1 and x2 . The average distance is:
1
hnovelty (xnew ) = (∥xnew − x1 ∥ + ∥xnew − x2 ∥)
2
1 √ √  √
hnovelty (xnew ) = 2+ 2 = 2
2

2.18. Visualizing the Concept of Novelty


To aid the understanding of novelty metrics, we present visual illustrations of both p-novelty and h-
novelty.

Figure 2.5: -Novelty: Distance from the new design to the closest existing design.

2.19. Conclusion
Novelty measures such as p-novelty and h-novelty are useful for evaluating the uniqueness of new de-
signs in various optimization and generative design tasks. By using these metrics, we can guide design
exploration and ensure that new designs contribute to the diversity and innovation of the solution space.
Chapter 3

Generative Algorithms and


Variational Encoders

3.1. Introduction
Generative models are a class of machine learning algorithms that aim to model the underlying distribu-
tion of data. These models can generate new data points similar to the training data, making them widely
used in various applications, including image generation, anomaly detection, and semi-supervised learn-
ing. In this report, we explore several key generative algorithms: Gaussian Mixture Models (GMM), the
Expectation-Maximization (EM) algorithm, Autoencoders (AE), Variational Autoencoders (VAE), and
Generative Adversarial Networks (GANs). These models represent foundational techniques in modern
machine learning, each contributing unique capabilities to modeling and data generation.

Mixture Models
A mixture model is a probabilistic model that represents the presence of sub-populations within an
overall population. It assumes that the observed data is generated from a mixture of multiple probability
distributions, where each component represents a sub-population.
Mathematically, the probability density function of a mixture model is given by:
K
X
p(x) = πk pk (x),
k=1

where:
• K: Number of components (or clusters) in the mixture.
PK
• πk : Mixing coefficients, satisfying 0 ≤ πk ≤ 1 and k=1 πk = 1.
• pk (x): Probability density function of the k-th component.

Key Features of Mixture Models


• Flexibility in modeling complex distributions by combining simpler components.
• Applicable to both supervised and unsupervised learning tasks.

33
34 CHAPTER 3. GENERATIVE ALGORITHMS AND VARIATIONAL ENCODERS

Gaussian Mixture Models (GMMs)

Figure 3.1: Comparison of K-Means and Gaussian Mixture Model clustering results.

A Gaussian Mixture Model (GMM) is a specific type of mixture model where each component pk (x)
is a Gaussian (normal) distribution. It is widely used in clustering, density estimation, and classification.
The density function of a GMM is expressed as:
K
X
p(x) = πk N (x | µk , Σk ),
k=1

where:

• N (x | µk , Σk ): Multivariate Gaussian distribution of the k-th component, given by:


 
1 1 T −1
N (x | µk , Σk ) = exp − (x − µk ) Σk (x − µk ) ,
(2π)d/2 |Σk |1/2 2

where d is the dimensionality of x, µk is the mean vector, and Σk is the covariance matrix.
• πk : Mixing coefficient of the k-th component, as defined earlier.

GMM Properties
• Clustering: Assigns data points to the component with the highest posterior probability.
• Soft Assignment: Unlike k-means clustering, GMM allows for probabilistic membership.
3.1. INTRODUCTION 35

Expectation-Maximization (EM) Algorithm


The Expectation-Maximization (EM) algorithm is used to estimate the parameters of a mixture model
(e.g., the means, covariances, and mixing coefficients in GMM). It is an iterative method that alternates
between two steps: the Expectation (E) step and the Maximization (M) step.

Figure 3.2: EM-Algorithm Flowchart

EM for GMMs
Given a dataset {x1 , x2 , . . . , xN }, the goal is to estimate:
Θ = {πk , µk , Σk | k = 1, . . . , K}.
Log-Likelihood Function:
N K
!
X X
log p(X | Θ) = log πk N (xi | µk , Σk ) .
i=1 k=1

Since the log-likelihood is difficult to maximize directly, EM iteratively maximizes a lower bound.

Steps of the EM Algorithm


(0) (0) (0)
1. Initialization: Initialize parameters Θ(0) = {πk , µk , Σk }.
2. E-Step: Compute the posterior probabilities (responsibilities) that a data point belongs to a
particular component:
πk N (xi | µk , Σk )
γik = PK ,
j=1 πj N (xi | µj , Σj )
where γik represents the responsibility of component k for data point xi .
3. M-Step: Update the parameters based on the responsibilities:
N
1 X
πknew = γik ,
N i=1
PN
i=1 γik xi
µnew
k = PN ,
i=1 γik
36 CHAPTER 3. GENERATIVE ALGORITHMS AND VARIATIONAL ENCODERS

PN
γik (xi − µk )(xi − µk )T
Σnew
k = i=1
PN .
i=1 γik

4. Convergence: Repeat the E-Step and M-Step until convergence, typically when the log-likelihood
changes by less than a specified tolerance.

Advantages of EM
• Handles incomplete data effectively.

• Converges to a local optimum of the log-likelihood function.

Limitations of EM
• Sensitive to initialization.

• May converge to a local, rather than global, maximum.

Applications
• Clustering: GMMs provide soft clustering for applications like image segmentation and document
classification.

• Anomaly Detection: Detecting outliers in data based on low probability under the model.

• Speech Recognition: Modeling acoustic features for phoneme classification.

3.2. Autoencoders (AE) and Variational Autoencoders (VAE)


Autoencoders (AEs) are neural networks designed to learn efficient representations of input data. The
autoencoder consists of an encoder network that maps input data x to a lower-dimensional latent space z,
and a decoder network that reconstructs the input from the latent representation. The model is trained
to minimize the reconstruction error, typically using the mean squared error:

LAE = ∥x − x̂∥2 ,

where x̂ is the reconstruction of x produced by the decoder.

3.2.1. Variational Autoencoders (VAE)

Figure 3.3: Basic Variational Auto-Encoder


3.3. PADGANN ARCHITECTURE 37

Variational Autoencoders (VAEs) extend the basic AE framework by introducing a probabilistic


approach. Instead of directly mapping each input to a fixed latent representation, VAEs model the
latent space as a distribution. The encoder outputs the parameters of a probability distribution (usually
a Gaussian) over the latent variables, and the decoder samples from this distribution to generate the
output.
The VAE objective is to maximize the lower bound on the marginal likelihood of the data, which
consists of two terms: 1. Reconstruction loss: Measures how well the model reconstructs the input
data. 2. KL divergence: Regularizes the distribution of the latent variables.
The loss function for VAEs is given by:

L = Eq(z|x) [log p(x|z)] − DKL (q(z|x) ∥ p(z)), (3.1)

where: - q(z|x) is the approximate posterior distribution (output of the encoder), - p(x|z) is the likelihood
(reconstruction probability), - DKL (· ∥ ·) denotes the Kullback-Leibler divergence, which measures the
difference between two distributions.
By optimizing this objective function, VAEs learn both a generative model for the data and a useful
representation of the latent space.

3.2.2. Generative Adversarial Networks (GANs)


Generative Adversarial Networks (GANs) consist of two neural networks: a generator G and a dis-
criminator D, which are trained in opposition. The generator learns to produce realistic samples that
mimic real data, while the discriminator learns to distinguish between real and fake data. The training
process is adversarial, with each network trying to outsmart the other.
The objective function in GANs is a minimax game, defined as:

min max Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))], (3.2)
G D

where: - pdata (x) is the distribution of real data, - pz (z) is the distribution of noise z used to generate
fake samples, - D(x) is the probability that x is real (i.e., coming from the true data distribution), - G(z)
generates a fake sample from noise z.
The generator G tries to minimize the log of the discriminator’s output for fake samples, while the
discriminator D tries to maximize its ability to distinguish real from fake data. The equilibrium occurs
when the generator produces indistinguishable fake samples from the real data, and the discriminator
can no longer differentiate them.

3.3. PaDGANN Architecture

Figure 3.4: PADGANN-Architecture


38 CHAPTER 3. GENERATIVE ALGORITHMS AND VARIATIONAL ENCODERS

PaDGANN extends the basic GAN framework by distributing the generator and discriminator across
multiple machines or computing units. The system consists of multiple worker nodes that run parallel
versions of G and D, with a centralized coordinator to synchronize the processes.

3.3.1. Parallelization of GAN


In PaDGANN, both the generator and discriminator are distributed across multiple workers:

• Parallel Generator: Multiple copies of G are used to generate data points concurrently. Each
worker i samples a noise vector zi and generates a corresponding data sample G(zi ).

• Parallel Discriminator: Similarly, multiple copies of D are used to evaluate whether the gen-
erated data samples are real or fake. Each worker i computes the probability D(G(zi )) for the
corresponding sample.

3.3.2. Distributed Training Dynamics


Let Gi and Di represent the generator and discriminator models on worker i, respectively. The objective
function in a parallelized setting becomes:

min max Ezi ∼pz (z) [log Di (Gi (zi ))] + Ex∼pdata (x) [log(1 − Di (Gi (zi )))] . (3.3)
Gi Di

In this formulation, the workers compute their local loss and then synchronize their parameters with
the global model at regular intervals. This is typically done using methods like parameter averaging or
gradient synchronization.

3.4. Optimization Dynamics


GAN optimization is a min-max game between the generator and discriminator, where the generator G
tries to minimize the discriminator’s ability to distinguish between real and fake data, and the discrimi-
nator D tries to maximize its ability to classify real and generated samples.
The updates for the discriminator in PaDGANN are:

LD = Ex∼pdata (x) [log D(x)] + Ezi ∼pz (z) [log(1 − Di (Gi (zi )))].

The gradient update for the discriminator is:

∂  
∇ Di L D = Ex∼pdata (x) [log Di (x)] + Ezi ∼pz (z) [log(1 − Di (Gi (zi )))] .
∂Di

For the generator, the update rule becomes:

LG = Ezi ∼pz (z) [log(1 − Di (Gi (zi )))],

and the gradient update for the generator is:

∂  
∇Gi LG = Ezi ∼pz (z) [log(1 − Di (Gi (zi )))] .
∂Gi

Both the generator and discriminator are trained iteratively using these gradient updates. The dis-
tributed nature of PaDGANN requires that these updates be synchronized across all workers, potentially
slowing down convergence.

3.5. Communication and Synchronization


The success of PaDGANN depends heavily on efficient communication and synchronization between
workers. Two common strategies are:
3.6. DISTRIBUTED GAN STABILITY AND CHALLENGES 39

3.5.1. Parameter Averaging


In parameter averaging, after each worker computes its local updates, the parameters are averaged and
broadcast to all workers:
N N
1 X 1 X
Gglobal = Gi , Dglobal = Di .
N i=1 N i=1

This strategy reduces the communication overhead but can result in slower convergence as the parameters
are averaged across workers.

3.5.2. Gradient Synchronization


In gradient synchronization, workers share their gradients instead of full model parameters. The aggre-
gated gradients are then used to update the global model:
N N
1 X 1 X
Gglobal = Gglobal − η · ∇ Gi L G , Dglobal = Dglobal − η · ∇ Di L D .
N i=1 N i=1

This approach incurs lower communication costs, as only gradients are shared, but still introduces over-
head due to synchronization.

3.6. Distributed GAN Stability and Challenges


The stability of PaDGANN depends on multiple factors, including:

3.6.1. Communication Latency


Delays in communication between workers can result in outdated updates, causing instability and slower
convergence.

3.6.2. Asynchronous Updates


Asynchronous updates may lead to synchronization issues, where workers use outdated information to
update their models, leading to suboptimal performance.

3.6.3. Load Balancing


Uneven workload distribution among workers can result in some workers finishing their updates faster
than others, causing inefficiencies in the training process.

3.7. Conclusion

Figure 3.5: Comparative Probability Distribution for CL/CD ratios of airfoil set for different GANs
40 CHAPTER 3. GENERATIVE ALGORITHMS AND VARIATIONAL ENCODERS

PaDGANN introduces significant scalability and efficiency improvements over traditional GANs by
utilizing a distributed architecture,it therefore avoids the Mode Collapse Problem, where the Genera-
tor fools the Discriminator giving one type of design only. However, challenges such as slower convergence,
communication overhead, and stability must be addressed. Future research can explore more efficient
synchronization strategies, load balancing, and optimization techniques to enhance the performance of
distributed GAN systems.

You might also like