A I Bootcamp
A I Bootcamp
The AI Engineer’s
Path to Success
All Hands-on from Algorithms,
Programming to Real Projects
INSTRUCTOR
VIVIAN ARANHA
WEEK 1: Python Programming Basics
01
Introduction to Python and
Development Setup
Overview of Python and its Role in AI
• Installing Python
• Setting Up a Coding Environment
• Jupyter Notebooks
• Visual Studio Code
Introduction to Basic Python Syntax,
Variables, and Data Types
• Basic Syntax
• Variables
• Data Types
• Integers | Floats | Strings | Lists | Tuples | Dictionaries |
Booleans
• Examples of Using Variables and Data Types
Hands-On Exercise
02
Control Flow in Python
Conditional Statements
• Syntax:
• if: Executes code if a condition is True
• elif: Adds additional conditions after the initial if
• else: Executes code if none of the previous conditions
are met
Loops
• for Loop
• Iterates over a sequence
• while Loop
• Executes as long as a condition is True
Using break and continue for Control Flow
• break
• Terminates the loop prematurely when a condition is met
• continue
• Skips the current iteration and proceeds to the next
Hands-On Exercise
03
Functions and Modules
in Python
Defining Functions with def
• Scope
• Local Scope
• Global Scope
• Lifetime
• Examples
Importing and Using Modules
04
Data Structures
(Lists, Tuples, Dictionaries, Sets)
Lists
05
Working with Strings
String Manipulation
• Concatenation
• Slicing
• Formatting
Common String Methods
• split()
• join()
• replace()
• strip()
Regular Expressions for Pattern Matching
06
File Handling
Reading and Writing Text Files
• Opening Files
• Use the built-in open() function to open a file
• r | w | a | r+
• Reading Files
• .read() | .readline() | .readlines()
• Writing to Files
• .write() | .writelines()
Using ‘with’ Statements for File Management
07
Pythonic Code
and Project Work
Writing Clean, “Pythonic” Code
• map()
• Applies a function to each item in an iterable
• filter()
• Filters items based on a condition
• reduce()
• Reduces an iterable to a single value
Python’s os and sys Modules
• os Module
• Provides functions to interact with the operating system
• sys Module
• Provides access to system-specific parameters and
functions
Hands-On Project
01
Introduction to NumPy for
Numerical Computing
Understanding the Role of NumPy in Data
Science and AI
• What is NumPy?
• Why Use NumPy in AI?
• Performance
• Ease of Use
• Integration
Creating and Manipulating NumPy Arrays
• Import NumPy
• Creating Arrays
• From a list
• Using built-in functions
• Manipulating Arrays
• Change shape
• Add dimensions
Basic Operations on Arrays
• Element-wise Operations
• Mathematical Operations
Array Indexing, Slicing, and Reshaping
• Indexing
• Slicing
• Reshaping
Hands-On Exercises
02
Advanced NumPy Operations
Broadcasting in NumPy
• What is Broadcasting?
• Rules of Broadcasting
• Dimensions are aligned from the right
• A dimension is compatible if:
• It matches the other array’s dimension
• One of the dimensions is 1
Aggregation Functions
03
Introduction to Pandas for
Data Manipulation
Introduction to Pandas Data Structures
• What is Pandas?
• Pandas Data Structures
• Series
• DataFrame
Loading Data from CSV, Excel, and Other Sources
• Viewing Data
• Selecting and Indexing
• Selecting columns
• Filtering rows
• Selecting by position
• Selecting by label
Hands-On Exercises
04
Data Cleaning and
Preparation with Pandas
Handling Missing Values
• Why Handle Missing Values?
• Methods to Handle Missing Values
• Drop Missing Values
• Fill Missing Values
• Interpolation
Data Transformations
• Renaming Columns
• Changing Data Types
• Creating or Modifying Columns
Combining and Merging DataFrames
• Concatenation
• Merging
• Joining
Hands-On Exercises
• Exercise 1: Clean a Dataset by Handling Missing Values and Renaming
Columns
• Exercise 2: Merge Two Datasets and Perform Data Transformations
• Additional Practice
• Drop columns with more than 50% missing values
• Merge three datasets and analyze relationships between them
• Convert categorical data to numerical using one-hot encoding
Day
05
Data Aggregation and
Grouping in Pandas
Grouping Data by Categories
• Using groupby
• Using pivot_table
• Custom Aggregation
• Apply custom functions using .agg()
Calculating Summary Statistics for Grouped Data
• Common Statistics
• Mean
• Max
• Min
• Multi-Aggregation
Hands-On Exercises
06
Data Visualization with
Matplotlib and Seaborn
Introduction to Matplotlib for Plotting
• What is Matplotlib?
• Basic Syntax
Basic Plots
• Line Plot
• Bar Chart
• Histogram
• Scatter Plot
Customizing Plots
• What is Seaborn?
• Common Seaborn Plots
• Heatmap
• Pairplot
Hands-On Exercises
07
Exploratory Data Analysis
(EDA) Project
Applying Data Manipulation and Visualization for EDA
• What is EDA?
• Steps in EDA
• Data Cleaning
• Data Transformation
• Aggregation and Filtering
Identifying Patterns, Trends, and Correlations
• Summary Statistics
• Hypothesis Generation
Hands-On Project: EDA on a Sample Dataset
01
Linear Algebra Fundamentals
Vectors and Matrices
• Vector
• Example:[2,3,4]
• Matrix
• Properties
Matrix Operations
02
Advanced Linear Algebra
Concepts
Determinants and Inverse of a Matrix
• Determinants
• Scalar value that provides information about a matrix’s properties
• Only for square matrices
• det(A) = 0, the matrix A is singular
• det(A) != 0, A is invertible
• Geometric Interpretation
• For a 2×2 matrix, the determinant represents the scaling factor of the area
formed by its column vectors
• Geometric Interpretation
• Eigenvectors point in the direction where the matrix transformation stretches or compresses vectors
• Eigenvalues indicate the factor of stretching or compression
• Properties
• Matrix of size 𝑛 × 𝑛 has 𝑛 n eigenvalues and eigenvectors
• Eigenvalues can be real or complex
• For a symmetric matrix, eigenvalues are always real.
• Computing Eigenvalues and Eigenvectors in NumPy
Introduction to Matrix Decomposition
• What is Matrix Decomposition?
• Process of breaking a matrix into simpler components to analyze or solve
problems
• Singular Value Decomposition (SVD)
• SVD decomposes a matrix 𝐴 A into three matrices: 𝐴 = 𝑈 ⋅ Σ ⋅ 𝑉T
• U: Left singular vectors (orthogonal matrix)
• Σ: Diagonal matrix of singular values (non-negative).
• 𝑉T: Right singular vectors (orthogonal matrix).
• Applications of SVD
• Computing SVD in NumPy
Hands-On Exercises
03
Calculus for Machine Learning
(Derivatives)
Introduction to Derivatives and Their Role in Optimization
• Partial Derivatives
• Measures how a function changes with respect to one
variable while keeping other variables constant
•
Gradients
• Gradient
• Vector of all partial derivatives, indicating the direction
of the steepest ascent
•
Gradient Descent Optimization Algorithm
04
Calculus for Machine Learning
(Integrals and Optimization)
Understanding Integrals and their Applications in ML
• Applications in ML
• Probability Distributions
• Cost Functions
Optimization Concepts
05
Probability Theory and
Distributions
Probability Basics
• Conditional Probability
• The probability of an event 𝐴 given that event 𝐵 has occurred
• Bayes’ Theorem
• P( 𝐴 ) Prior probability
• 𝑃 ( 𝐵 ∣ 𝐴 ) Likelihood
• 𝑃 ( 𝐵 ) Evidence
Common Probability Distributions
• Bernoulli Distribution
• Describes outcomes of a binary experiment
Common Probability Distributions
• Binomial Distribution
• Models the number of successes in 𝑛 independent Bernoulli trials
• Poisson Distribution
• Models the number of events in a fixed interval of time or space
Applications in Machine Learning
• Gaussian Distribution
• Used in Gaussian Naive Bayes and kernel density estimation
• Bernoulli Distribution
• Models binary classification problems
• Binomial Distribution
• Used in logistic regression to model binary outcomes.
• Poisson Distribution
• Models count data
Hands-On Exercises
06
Statistics Fundamentals
Measures of Central Tendency and Dispersion
• Central Tendency
• Mean: The average value of a dataset
• Median: The middle value when data is sorted
• Mode: The most frequently occurring value
• Dispersion
• Variance: The average squared deviation from the mean
• Standard Deviation: The square root of variance, indicating the
spread of data. python Copy code
Hypothesis Testing
• Confidence Interval
• Range of values within which the true population parameter is
expected to lie
• x: Sample mean
• 𝑧: Z-score
• 𝑠: Standard deviation
• Statistical Significance
Hands-On Exercises
07
Math-Driven Mini Project
Linear Regression from Scratch
Applying Linear Algebra, Calculus, and Statistics
• Linear Algebra
• Mathematical Model:
X: Feature matrix (with bias term) | 𝜃: Parameters (weights and bias) | 𝑦^: Predicted values.
• Calculus
• Optimization of 𝜃 involves minimizing the loss function
• Gradient of 𝐽 (𝜃):
• Statistics
• Metrics like Mean Squared Error (MSE) and 𝑅2 are used to evaluate model
performance
Using Gradient Descent for Parameter Optimization
• R-squared (𝑅2)
• Measures how well the regression line explains the variance in the data
Hands-On Project: Linear Regression from Scratch
01
Probability Theory and
Random Variables
Basic Probability Concepts
• Sample Space and Events
• Sample Space: The set of all possible outcomes of a random experiment
• Events: A subset of the sample space
• Conditional Probability
• The probability of an event 𝐴 occurring, given that 𝐵 has occurred
• Independence
• Two events 𝐴 and 𝐵 are independent if
• P(A∩B) = P(A) ⋅ P(B)
• Example with Python
Random Variables
• Expectation (𝐸 [𝑋])
• Weighted average of a random variable’s possible values
• Variance (𝑉𝑎𝑟[𝑋])
• Measures the spread of a random variable
02
Probability Distributions in
Machine Learning
Common Probability Distributions
• Gaussian (Normal) Distribution
• Bell-shaped curve characterized by mean (𝜇) and standard deviation (𝜎)
• Probability Density Function (PDF)
• Properties
• Symmetric about the mean
• Mean, median, and mode are the same
• Application in ML
• Common assumption in many algorithms (e.g., Naive Bayes)
• Used in feature scaling (e.g., standardization)
Common Probability Distributions
• Binomial Distribution
• Models the number of successes in 𝑛 n independent Bernoulli trials
• Probability Mass Function (PMF)
• Properties
• Discrete distribution
• Parameters: 𝑛(number of trials), 𝑝(probability of success)
• Application in ML
• Logistic regression assumes a binomial distribution for binary classification
Common Probability Distributions
• Poisson Distribution
• Models the number of events in a fixed interval
• Probability Mass Function (PMF)
• Properties
• Discrete distribution
• Parameter: 𝜆 (average rate of occurrence)
• Application in ML
• Used in event modeling
Common Probability Distributions
• Uniform Distribution
• Equal probability for all outcomes in a range
• Probability Density Function (PDF)
• Properties
• Continuous distribution
• Parameters: 𝑎(lower bound), 𝑏(upper bound)
• Application in ML
• Random initialization of weights in neural networks
Application of Distributions in Machine Learning
• Gaussian Distribution
• Used in algorithms like Naive Bayes and Gaussian Mixture Models
• Assumed in statistical tests
• Binomial Distribution
• Foundational for logistic regression and other binary classification models
• Poisson Distribution
• Applied in modeling count data
• Uniform Distribution
• Commonly used in random sampling and initialization of parameters
Visualizing Distributions and Understanding Their
Properties
03
Statistical Inference -
Estimation and Confidence Intervals
Introduction to Statistical Inference
• For Means
• When the population standard deviation (𝜎) is unknown
• For Proportions
04
Hypothesis Testing and
P-Values
Introduction to Hypothesis Testing
• What is Hypothesis Testing?
• Statistical method to determine if there is enough evidence in a sample to infer a conclusion about the
population
• Key Components
• Null Hypothesis: Assumes no effect or no difference
• Alternative Hypothesis: Indicates an effect or difference
• P-Value
• The probability of observing results as extreme as the test statistic under Null
Hypothesis
• Smaller p-values indicate stronger evidence against Null Hypothesis
• Significance Level (𝛼)
• Threshold for deciding whether to reject
• Example: 𝛼 = 0.05 means a 5% risk of rejecting Null Hypothesis when it is true
• Decision Rule
• Reject Null Hypothesis: 𝑝 ≤ 𝛼
• Fail to Reject Null Hypothesis: 𝑝 > 𝛼
Types of Errors
• Type I Error (𝛼)
• Incorrectly rejecting Null Hypothesis when it is true
• Example: Concluding a drug is effective when it is not
• Type II Error ( 𝛽 )
• Failing to reject Null Hypothesis when it is false
• Example: Concluding a drug is not effective when it is
• Example:
Hands-On Exercises
• Exercise 1: Perform a Hypothesis Test
• Exercise 2: Two-Sample T-Test
• Additional Practice
• Perform a z-test for large sample sizes
• Use the Iris dataset to test if the mean sepal length differs between two
species
• Perform hypothesis testing on proportions using the binomial distribution
Day
05
Types of Hypothesis Tests
T-Tests
• Purpose: Test whether the means of one or more groups differ significantly
• Types
• One-Sample T-Test: Tests if the mean of a sample differs from a known value or population
mean
• Two-Sample T-Test (Independent T-Test): Compares the means of two independent groups
• Paired Sample T-Test: Compares means of two related groups (e.g., pre-test vs. post-test)
• Example Use Cases
• One-Sample: Testing if the average test score of a class differs from the national average
• Two-Sample: Comparing test scores between two classes
• Paired Sample: Comparing weight before and after a diet program
Chi-Square Test
• Purpose: Test for independence or goodness-of-fit in categorical data
• Chi-Square Test of Independence: Tests if two categorical variables are
independent
• Example Use Case: Testing if gender is independent of preference for a product
• Steps
• Create a contingency table
• Calculate expected frequencies
• Compute 𝜒2 statistic and p-value
• Python Implementation
ANOVA (Analysis of Variance)
06
Correlation and Regression
Analysis
Understanding Correlation
• What is Correlation?
• Measures the strength and direction of the relationship
between two variables
• Values range from − 1 to 1, with 0 indicating no correlation
• Types of Correlation
• Pearson Correlation Coefficient (𝑟)
• Spearman Correlation Coefficient (𝜌)
Linear Regression Basics
• β : Slope
1
• ϵ : Error term
Linear Regression Basics
• Key Metrics
• Slope (𝛽1)
• Intercept (𝛽0)
• R-Squared (𝑅2)
Interpreting Regression Results
• Slope (𝛽1)
• Indicates the magnitude and direction of the relationship
• Intercept (𝛽0)
• Starting point of the regression line
• R-Squared (𝑅2)
• Closer to 1 indicates better fit
Hands-On Exercises
07
Statistical Analysis Project –
Analyzing Real-World Data
Applying Probability and Statistical Concepts
• Steps in EDA
1. Load and inspect the dataset
2. Check for missing or inconsistent data
3. Visualize distributions and relationships using histograms,
scatter plots, and correlation heatmaps
• Key Goals
• Understand the data structure
• Identify patterns, trends, and outliers
Conducting Hypothesis Testing
• Steps
1. Formulate null and alternative hypotheses
2. Choose and perform an appropriate hypothesis test
3. Interpret p-values and test results
• Example: Comparing Tip Amounts by Gender
Applying Linear Regression
• Steps
1. Select dependent and independent variables
2. Fit a regression model to the data
3. Interpret coefficients and the 𝑅2 value
• Example: Analyzing Relationship Between Total Bill and Tip
Hands-On Project
01
Machine Learning Basics and
Terminology
What is Machine Learning?
• Machine Learning
• Real-World Applications
• Healthcare
• Finance
• E-commerce
• Autonomous Vehicles
• Natural Language Processing
• Why is ML Important?
Types of Machine Learning
• Supervised Learning
• Model is trained on labeled data
• Model learns to map inputs (features) to outputs (target)
• Examples: Classification | Regression
• Key Features
• Requires labeled data
• Accuracy depends heavily on the quality of the training data
Types of Machine Learning
• Unsupervised Learning
• Model works on unlabeled data to find hidden patterns or structures
• Examples: Clustering | Dimensionality Reduction
• Key Features
• No labeled data is needed
• Focused on exploratory analysis and identifying patterns
Types of Machine Learning
• Reinforcement Learning
• An agent interacts with an environment and learns by trial and error
to maximize cumulative rewards
• Examples: Robotics | Gaming | Dynamic Systems
• Key Features
• Goal-oriented learning based on rewards and penalties
• Suitable for sequential decision-making problems
Key Concepts
• Features
• The input variables (independent variables) used to train the model
• Example: In predicting house prices, features could include the number of
bedrooms, size, and location
• Target
• The output variable (dependent variable) the model predicts
• Example: House price is the target variable
• Training and Testing Datasets
• The data is split into two subsets: Training Set | Testing Set
• A typical split is 80% training and 20% testing
Key Concepts
• Overfitting
• Model learns noise and details in the training data, performing poorly on new data
• Model becomes too complex for the dataset
• Underfitting
• Model is too simple to capture the underlying patterns in the data
• Example: Fitting a linear model to non-linear data
• Bias-Variance Tradeoff
• Bias: The error introduced by assuming a simplified model
• Variance: Error introduced by the model's sensitivity to small changes in the training data
• Goal: Balance bias and variance to achieve optimal performance.
Hands-On Exercise
02
Introduction to Supervised
Learning and Regression
Models
Overview of Supervised Learning
• Key Characteristics of Supervised Learning
• Labeled Data
• Supervised learning requires a dataset with labeled examples
• Example: Inputs | Outputs
• Objective
• Minimize the error between the predicted output and the actual output
• Types of Supervised Learning
• Regression: Predicts continuous outputs
• Classification: Predicts discrete outputs
Overview of Supervised Learning
• Applications of Supervised Learning
• Healthcare
• Predicting patient outcomes based on medical data
• Finance
• Fraud detection in transactions, credit risk assessment
• Retail
• Personalized product recommendations based on customer behavior
• Autonomous Vehicles
• Object detection and lane tracking using image classification
Introduction to Regression Analysis
• Linear Regression
• Assumes a linear relationship between the dependent variable (𝑦) and the independent
variable (𝑥)
• Equation of a Line: y = β0 + β1x + ϵ
• β0: Intercept of the line
• β1: Slope of the line
• ϵ: Error term representing the difference between the observed and predicted values
Linear regression aims to minimize the error between the predicted and actual values of the target
variable. This is achieved using a cost function
• Cost Function
• Measures how far the predictions are from the actual values
• Most common cost function is the Mean Squared Error (MSE)
• Convergence
• Algorithm stops when the updates become very small or a predefined number of iterations is
reached
• Visualizing Optimization
• The optimization process can be visualized as finding the lowest point on a cost surface
Hands-On Exercise
03
Advanced Regression Models
Polynomial Regression and
Regularization
Polynomial Regression for Modeling Non-Linear Relationships
• What is Regularization?
• Technique used to prevent overfitting by adding a penalty term to the cost function of a regression model
• Types of Regularization
• Ridge Regression (L2 Regularization)
• Adds the sum of the squared coefficients to the cost function
04
Introduction to Classification
and Logistic Regression
Classification Problems and Common Use Cases
• What is Classification?
• Types of Classification
• Binary Classification
• Multi-Class Classification
• Multi-Label Classification
• Common Use Cases
• Healthcare | Finance | Retail | Natural Language Processing |
Autonomous Systems
Logistic Regression for Binary Classification
• Where:
• 𝑃 ( 𝑦 = 1 ∣ 𝑥 ): Probability of the positive class
• 𝜎 ( 𝑧 ): Sigmoid function
Logistic Regression for Binary Classification
05
Model Evaluation and Cross-
Validation
Model Evaluation Metrics for Regression and Classification
• Stratified K-Fold
• Ensures each fold has a proportional representation of classes in classification problems
• Advantages
• Reduces the risk of overfitting by testing on multiple subsets of data
• Provides a more generalized evaluation of model performance.
Understanding the Confusion Matrix
The confusion matrix is a table that summarizes the performance of a classification model by comparing predicted and
actual values
• Structure of a Confusion Matrix
06
k-Nearest Neighbors (k-NN)
Algorithm
Introduction to k-Nearest Neighbors (k-NN)
Algorithm and Its Applications
• Step-by-Step Process
• Feature Scaling
• Calculate Distances
• Identify 𝑘 Nearest Neighbors
• Make Predictions
• Classification
• Regression
Choosing the Optimal Value of 𝑘
• Choosing 𝑘
• Small 𝑘
• High sensitivity to noise
• Captures local variations in data
• Large 𝑘
• Smoother decision boundaries but can miss finer details
• Common Practices
• Use cross-validation to determine the optimal value of 𝑘
• A common starting point is 𝑘 = √𝑛, where 𝑛 is the number of training samples
Understanding the Model’s Limitations
• Computationally Expensive
• Predictions require distance computation for all training samples
• Feature Scaling Dependence
• Requires proper scaling to avoid feature dominance
• Not Robust to Imbalanced Data
• Classes with more samples can dominate predictions
Hands-On Exercise
07
Supervised Learning
Mini Project
Building an End-to-End Supervised Learning Project
• Key Steps:
• Define the Problem
• Identify the objective: Regression or Classification
• Data Preparation
• Exploratory Data Analysis (EDA)
• Understand the structure of the dataset
• Visualize data distributions and relationships
• Preprocessing
• Handle missing values
• Scale features for algorithms like k-NN
• Encode categorical variables
Building an End-to-End Supervised Learning Project
• Key Steps:
• Model Selection
• Choose appropriate models based on the problem
• Model Evaluation
• Use performance metrics like Mean Squared Error (MSE) for regression or
Accuracy, Precision, Recall, and F1 Score for classification
• Comparison
• Compare multiple models to identify the best-performing one
Applying Regression and Classification Models on a Real-World Dataset
• Dataset Options
• Regression Example
• Predict house prices using features like square footage, number of rooms,
and location
• Dataset: California Housing Dataset or any housing dataset
• Classification Example
• Classify customer churn based on customer demographics and behavior
• Dataset: Telco Customer Churn Dataset
Evaluating and Comparing Model Performance
• Steps
1. Evaluate models using cross-validation
2. Generate performance metrics
3. Summarize findings to identify strengths and weaknesses of each model
Hands-On Project
01
Introduction to Feature
Engineering
Importance of Feature Engineering in Machine
Learning
• What is Feature Engineering?
• Process of transforming raw data into meaningful inputs for machine learning
models
• Why is Feature Engineering Important?
• Improves Model Accuracy
• Reduces Model Complexity
• Enables Model Interpretability
• Handles Data Challenges
• Key Applications
• Finance | Healthcare | E-Commerce
Types of Features: Categorical, Numerical, Ordinal
• Categorical Features
• Represent discrete categories or labels
• Encoding Techniques
• One-Hot Encoding
• Label Encoding
• Numerical Features
• Represent continuous or discrete numbers
• Preprocessing Techniques
• Scaling
• Ordinal Features
• Represent categorical data with a meaningful order
• Encoding Techniques
• Ordinal Encoding
Overview of Feature Engineering Techniques
• Scaling
• Ensures all features contribute equally to the model
• Techniques: Min-Max Scaling | Standardization
• Encoding
• Converts categorical data into numerical format
• Techniques: One-Hot Encoding | Label Encoding
• Transformation
• Applies mathematical functions to modify features
• Examples: Log Transformation | Polynomial Features
• Feature Selection
• Reduces the number of input features to improve model performance
• Techniques: Statistical Methods | Recursive Feature Elimination (RFE)
Hands-On Exercise
02
Data Scaling and
Normalization
Importance of Scaling and Normalization in
Machine Learning
• What is Scaling and Normalization?
• Preprocessing techniques used to transform numerical features
to a common range or distribution
• Why is Scaling and Normalization Important?
• Improves Algorithm Performance
• Ensures Fair Comparisons
• Stabilizes Training
Methods: Min-Max Scaling, Standardization (Z-Score Scaling)
• Min-Max Scaling
• Transforms features to a specified range, typically [0, 1]
• Ensures all feature values are within the same range
• Use Cases: k-NN or neural networks
• Limitations: Sensitive to outliers, as extreme values can distort the scale
• Standardization (Z-Score Scaling)
• Centers the data around zero and scales it to have a standard deviation of 1
• Ensures a standard normal distribution for each feature
• Use Cases: SVM, logistic regression, and PCA
• Advantages: Handles outliers better than Min-Max scaling
When to Use Scaling and Normalization for Different
Algorithms
03
Encoding Categorical
Variables
One-Hot Encoding, Label Encoding
• What Are Categorical Variables?
• Binary Categorical Features: Gender (Male/Female)
• Multi-Class Categorical Features: Country (USA, Canada, UK).
• One-Hot Encoding
• Creates binary columns for each category in a categorical feature
• Each row is marked with a 1 for its respective category and 0 elsewhere
• Example: Feature: Color = ['Red', 'Blue', 'Green’]
• Applications
• Categorical features with a small number of unique categories
• Tree-based models, logistic regression, and neural networks
One-Hot Encoding, Label Encoding
• Label Encoding
• Label Encoding assigns a unique integer to each category
• Example: Red = 0, Blue = 1, Green = 2.
• Applications:
• Ordinal features where the order matters
• Can introduce unintended ordinal relationships for nominal features
• Limitations
• Can mislead algorithms into interpreting categories as ordered, especially
when the variable is nominal
Dealing with High-Cardinality Categorical Features
Label Encoding Ordinal features or when used with algorithms like tree-based
models
Frequency Encoding High-cardinality features in both regression and classification
tasks
Target Encoding High-cardinality features in supervised learning tasks
Hands-On Exercise
04
Feature Selection Techniques
Introduction to Feature Selection
• What is Feature Selection?
• Process of identifying and retaining the most relevant features (input variables) in a dataset
while discarding irrelevant or redundant ones
• Why is Feature Selection Important?
• Improves Model Performance
• Reduces Overfitting
• Enhances Interpretability
• Increases Computational Efficiency
• When to Use Feature Selection?
• High-Dimensional Data
• Correlated Features
• Reducing Complexity
Techniques for Feature Selection
• Filter Methods
• Evaluate the relevance of features by analyzing their statistical properties in relation to the target variable
• Examples: Correlation | Mutual Information
• When to Use: Quick evaluation of features before training a model
• Wrapper Methods
• Iteratively selects features by training and evaluating a model
• Examples: Forward Selection | Backward Elimination
• When to Use: Useful when feature interactions are important but computationally expensive
• Embedded Methods
• Perform feature selection as part of the model training process
• Examples: Lasso Regression | Tree-Based Models
• When to Use: Effective when training tree-based models or regularized regression
Hands-On Exercise
05
Creating and Transforming
Features
Feature Creation
• What is Feature Creation?
• Feature creation involves deriving new, meaningful features from existing ones to
enhance a model's ability to capture important patterns in the data
• Examples of Feature Creation
• Date-Time Features
• Interaction Features
• Aggregations
• Importance
• Adds domain knowledge to the dataset
• Captures hidden patterns and trends not evident in the original features
Transforming Features
• What is Feature Transformation?
• Feature transformation modifies existing features to better suit the learning algorithm
• Common Transformations
• Logarithmic Transformation
• Reduces skewness in highly skewed distributions
• Square Root Transformation
• Moderately reduces skewness, often used for count data
• Polynomial Transformation
• Adds higher-order terms ( 𝑥2, 𝑥3 ) to capture non-linear relationships
• Importance
• Enhances the model's ability to fit non-linear relationships
• Makes distributions more normal-like, aiding algorithms that assume normality
Importance of Feature Transformations in Non-Linear
Relationships
• Create new features from a date column (e.g., day of the week, month,
year)
• Apply polynomial transformations to a dataset and compare model
performance before and after transformation
Day
06
Model Evaluation Techniques
Evaluation Metrics for Regression
Regression Metrics
• Mean Absolute Error (MAE)
• Measures the average magnitude of errors without considering their direction
• Use Case: Suitable when all errors have equal importance
• Regression
• Use MAE for interpretability and uniform importance of errors
• Use MSE/RMSE when larger errors need greater penalization
• Use 𝑅2 to explain variance but not as a sole performance metric
• Classification
• Use accuracy for balanced datasets
• Use precision and recall for imbalanced datasets, depending on the problem's
focus (e.g., minimizing false positives or false negatives)
• Use F1 score for a balanced evaluation of precision and recall
• Use ROC-AUC for overall model performance evaluation in binary classification
Hands-On Exercise
07
Cross-Validation and
Hyperparameter Tuning
Introduction to Cross-Validation
• What is Cross-Validation?
• Technique used to assess how well a machine learning model generalizes to an independent dataset
• Types of Cross-Validation
• K-Fold Cross-Validation
• Splits the dataset into 𝐾 folds of approximately equal size
• The model is trained on 𝐾 − 1 folds and validated on the remaining fold
• This process is repeated 𝐾 times, and the average performance is computed
• Stratified K-Fold
• Ensures that each fold maintains the same class distribution as the original dataset
• Useful for imbalanced datasets
• Leave-One-Out Cross-Validation (LOOCV)
• Uses a single data point for validation and the rest for training
• Repeats this process for all data points
• Computationally expensive but provides the most robust evaluation
Hyperparameter Tuning
• What is Hyperparameter Tuning?
• Hyperparameters are parameters that are not learned by the model but are set before training, tuning these
hyperparameters is crucial for optimizing model performance
• Techniques for Hyperparameter Tuning
• Grid Search
• Exhaustively searches over a predefined hyperparameter space
• Example: Testing all combinations of values for max_depth and learning_rate
• Random Search
• Randomly samples combinations of hyperparameters from the predefined space
• More efficient than Grid Search when the parameter space is large
• Importance of Hyperparameter Tuning
• Prevents overfitting and underfitting by selecting the best configuration
• Enhances model performance by optimizing critical settings.
Importance of Tuning Hyperparameters for Model
Performance
• Without tuning, the model might not reach its optimal performance, leading to:
• Underfitting: Model fails to capture the underlying patterns.
• Overfitting: Model memorizes the training data and performs poorly on unseen data
Hands-On Project:
Feature Engineering and Model Evaluation
• Objective
• Perform end-to-end feature engineering, model evaluation, and
hyperparameter tuning on a dataset
• Tasks
• Task 1: Perform Feature Engineering
• Task 2: Train and Evaluate Models
• Task 3: Apply Grid Search for Hyperparameter Tuning
WEEK 7: Advanced Machine Learning Algorithms
01
Introduction to Ensemble
Learning
Concept of Ensemble Learning
• What is Ensemble Learning?
• Machine learning technique that combines the predictions of multiple models to
produce a final output
• AdaBoost
• Adjusts model weights based on performance
• Focuses on misclassified instances
• XGBoost
• Optimized version of gradient boosting, known for speed and accuracy
• Voting Classifier
• Combines predictions of multiple models using majority voting or averaging
Hands-On Exercise
• Build a basic ensemble model combining predictions from Linear Regression, Decision
Tree, and k-NN to observe the impact on accuracy
Day
02
Bagging and Random Forests
Understanding Bagging (Bootstrap Aggregating)
• What is Bagging?
• Ensemble learning technique that trains multiple models on different subsets of the data,
created by random sampling with replacement
• Regression: Average the predictions of individual models
• Classification: Use majority voting to determine the final class
• Why Use Bagging?
• Reduces Variance
• Improves Robustness
• Applications
• Bagging is commonly used with decision trees, which are prone to high variance
Introduction to Random Forests
03
Boosting and Gradient
Boosting
Concept of Boosting
• What is Boosting?
• Ensemble technique that sequentially combines weak learners to form a strong learner
• Each subsequent model focuses on correcting the errors made by previous models
• How Does Boosting Differ from Bagging?
Gradient Boosting
04
Introduction to XGBoost
Overview of XGBoost
• What is XGBoost?
• Advanced implementation of the Gradient Boosting algorithm designed for speed and
performance
• It introduces various enhancements that make it faster, more efficient, and capable of
handling complex datasets
• Improvements Over Traditional Gradient Boosting
• Speed
• Handling Missing Data
• Regularization
• Custom Loss Functions
• Tree Pruning
Key Features of XGBoost
• Handling Missing Data
• Automatically assigns missing values to the branch that minimizes the loss function
• Reduces preprocessing steps for datasets with missing values
• Regularization
• Includes penalties for overly complex models, reducing overfitting
• Hyperparameters
• lambda: L2 regularization term
• alpha: L1 regularization term
• Parallel Processing
• Splits calculations for tree construction across multiple cores, significantly improving
training time
Hyperparameters in XGBoost and How to Tune Them
the model
• Fraction of data used to train each tree
• Helps reduce overfitting; typical range:
• Typical range: 0.01–0.3
0.5–1.0
• Number of Trees (n_estimators)
• Colsample_bytree
• Determines the number of boosting
rounds
• Fraction of features used for each tree
split
• Larger values may improve performance
• Typical range: 0.5–1.0
but increase computation time
• Regularization Parameters: lambda and alpha
• Tree Depth (max_depth)
control L2 and L1 regularization, respectively
• Limits the depth of trees, balancing bias
and variance
Hands-On Exercise
05
LightGBM and CatBoost
Introduction to LightGBM
• What is LightGBM?
• Implementation of Gradient Boosting designed to handle large datasets and high-dimensional data with speed and
accuracy
• Key Features of LightGBM:
• Histogram-Based Splitting
• Leaf-Wise Tree Growth
• Support for GPU Training
• Handling Sparse Data
• Advantages
• Faster training than XGBoost
• Handles large datasets effectively
• Reduces memory usage with histogram-based splitting
• When to Use LightGBM
• Large datasets with numerical features
• Time-sensitive tasks requiring fast training
Overview of CatBoost
• What is CatBoost?
• Gradient Boosting library developed specifically to handle categorical features without the need for preprocessing like
one-hot encoding
• Key Features of CatBoost:
• Native Support for Categorical Data
• Ordered Boosting
• Robust to Overfitting
• Advantages
• Eliminates the need for manual encoding of categorical data
• Reduces overfitting with robust boosting techniques
• Easy to implement for datasets with many categorical features
• When to Use CatBoost
• Datasets with a high proportion of categorical features
• Applications where overfitting is a concern
XGBoost, LightGBM, and CatBoost
Hands-On Exercise
06
Handling Imbalanced Data
Problems Caused by Imbalanced Data in Classification Tasks
• Resampling Techniques
• Oversampling
• Increase the number of minority class samples by duplicating or
synthesizing new samples
• Example: SMOTE (Synthetic Minority Over-sampling Technique),
which generates synthetic examples
• Undersampling
• Reduce the number of majority class samples to balance the dataset
• Risk: Loss of valuable information from majority class.
Techniques to Handle Imbalanced Data
• Algorithmic Solutions
• Class Weights
• Assign higher weights to the minority class during model training
• Many algorithms (e.g., Logistic Regression, Random Forest) have
built-in support for class weights
• Anomaly Detection Models
• Treat the minority class as anomalies, focusing the model on
detecting them.
Techniques to Handle Imbalanced Data
07
Ensemble Learning Project –
Comparing Models on a Real Dataset
Building and Evaluating Multiple Ensemble Models
• Bagging
• Builds models independently using random subsets of data
• Robust against overfitting with strong base learners
• Boosting
• Sequentially builds models, focusing on hard-to-predict samples
• Requires careful tuning to prevent overfitting.
Model Performance on Balanced vs. Imbalanced Data
01
Introduction to
Hyperparameter Tuning
Parameters and Hyperparameters
• What are Parameters?
• Values learned by a machine learning model during training
• Adjusted to minimize the loss function and optimize predictions
• Examples
• Coefficients in Linear Regression
• Weights and biases in Neural Networks
• What are Hyperparameters?
• Settings defined before training that influence how the model learns from data
• Not learned from the data but instead control the training process
• Examples
• Tree Depth
• Learning Rate
• Number of Estimators
Importance of Tuning Hyperparameters
• Why Tune Hyperparameters?
• Improve Model Performance
• Optimal hyperparameters help models generalize better, reducing
overfitting and underfitting
• Enhance Efficiency
• Proper tuning can reduce training time and computational resources
• Objective
• Train a model with default hyperparameters, evaluate its
performance, and manually adjust a few
hyperparameters to observe their impact on results
Day
02
Grid Search and Random
Search
Introduction to Grid Search and Random Search
• Objective
• Implement both Grid Search and Random Search for
hyperparameter tuning, compare their efficiency, and
analyze the impact on model performance
Day
03
Advanced Hyperparameter
Tuning with Bayesian
Optimization
Introduction to Bayesian Optimization
• What is Bayesian Optimization?
• Advanced method for hyperparameter tuning that balances exploration (searching new regions) and
exploitation (refining promising regions)
• Uses a probabilistic model to guide the search for optimal hyperparameters
• How It Works
• Surrogate Model
• Builds a probabilistic model (e.g., Gaussian Process) of the objective function based on prior evaluations
• Acquisition Function
• Balances exploration and exploitation by choosing the next hyperparameters to evaluate based on predicted
performance and uncertainty
• Iterative Refinement
• Updates the surrogate model after each evaluation, refining the search
• Why Use Bayesian Optimization?
• Efficient for high-dimensional and expensive-to-evaluate functions
• Reduces the number of evaluations required to find near-optimal hyperparameters
Using Libraries for Bayesian Optimization
• Popular Libraries
• Hyperopt
• Simplifies Bayesian Optimization for hyperparameter tuning
• Works with fmin to minimize objective functions over a parameter
space
• Optuna
• Flexible and user-friendly library for hyperparameter optimization
• Supports dynamic search spaces and pruning of unpromising trials
Understanding Exploration vs. Exploitation
• Exploration
• Focuses on sampling hyperparameters from unexplored regions
• Useful for identifying new areas of high potential
• Exploitation
• Focuses on refining the search around regions with known high performance
• Useful for fine-tuning near-optimal hyperparameters
• Bayesian Optimization’s Advantage
• Balances these approaches using the acquisition function to minimize
unnecessary evaluations while improving results.
Hands-On Exercise
• Objective
• Apply Bayesian Optimization using Optuna to tune an
XGBoost model and compare the results with Grid Search
and Random Search
Day
04
Regularization Techniques for
Model Optimization
Understanding Overfitting and Underfitting
• Overfitting
• Occurs when a model learns the noise in the training data along with the patterns,
leading to poor generalization on unseen data
• Symptoms
• High training accuracy but low test accuracy
• Large differences between training and validation losses
• Underfitting
• Occurs when a model is too simple to capture the underlying patterns in the data
• Symptoms
• Low accuracy on both training and test sets
• High bias in predictions
Regularization Techniques
• Regularization introduces a penalty term to the loss function during model training to
prevent overfitting by discouraging overly complex models
• L1 Regularization (Lasso)
• Adds the absolute values of coefficients to the loss function
• Encourages sparsity by setting some coefficients to zero, effectively selecting
features.
• L2 Regularization (Ridge)
• Adds the squared values of coefficients to the loss function
• Shrinks coefficients toward zero but does not set them to zero.
• Elastic Net
• Combines L1 and L2 regularization
• Useful when there are correlated predictors and when feature selection is desired
Practical Applications of Regularization
• Prevent Overfitting
• Penalizes large coefficients, reducing model complexity
• Handle Multicollinearity
• Ridge regularization is effective when predictors are highly correlated
• Feature Selection
• Lasso automatically performs feature selection by setting some
coefficients to zero
Hands-On Exercise
• Objective
• Apply Lasso and Ridge regularization on a linear regression
model, compare performance, and analyze the effects on
coefficients
Day
05
Cross-Validation and Model
Evaluation Techniques
Importance of Cross-Validation in Model Evaluation
• What is Cross-Validation?
• Statistical method used to evaluate the performance of a model by partitioning
the data into training and validation subsets multiple times
• It helps ensure that the model's performance generalizes well to unseen data
• Why Use Cross-Validation?
• Prevents Overfitting
• By evaluating the model on multiple subsets, cross-validation provides a more robust measure
of its performance
• Reliable Performance Estimate
• Reduces the variance of performance metrics compared to a single train-test split
• Optimizes Model Selection
• Helps in comparing and selecting the best model or hyperparameter configuration.
Types of Cross-Validation
• K-Fold Cross-Validation
• Splits the dataset into 𝐾 equal-sized folds
• Trains the model on 𝐾 − 1 folds and validates on the remaining fold
• Repeats the process 𝐾 times, ensuring each fold is used as a validation set once
• Best For: General-purpose datasets.
• Stratified K-Fold Cross-Validation
• Ensures that each fold maintains the same class distribution as the original dataset
• Particularly useful for imbalanced datasets
• Best For: Classification tasks with imbalanced data
• Leave-One-Out Cross-Validation (LOOCV)
• Uses a single data point as the validation set and the rest as the training set
• Repeats the process for each data point
• Pros: Maximizes training data for each fold
• Cons: Computationally expensive for large datasets
• Best For: Small datasets where maximizing training data is critical
Practical Guidance on Cross-Validation
• Choose K Based on Dataset Size
• 𝐾 = 5 or 𝐾 = 10 are commonly used for large datasets
• Use LOOCV for small datasets
• Objective
• Evaluate a classification model using K-Fold and Stratified K-Fold Cross-
Validation
• Compare the results to demonstrate the importance of stratification for
imbalanced datasets
Day
06
Automated Hyperparameter
Tuning with GridSearchCV
and RandomizedSearchCV
Using GridSearchCV and RandomizedSearchCV in
scikit-learn
• What is GridSearchCV?
• Exhaustive search over a specified parameter grid
• Trains and evaluates a model for every combination of hyperparameters
in the grid using cross-validation
• What is RandomizedSearchCV?
• Selects a fixed number of random combinations from a parameter
distribution
• Faster than GridSearchCV for large hyperparameter spaces while still
providing good results
Using GridSearchCV and RandomizedSearchCV in
scikit-learn
• Key Features
• Automates Hyperparameter Tuning
• Combines model training, evaluation, and hyperparameter search into a
single step
• Cross-Validation Integration
• Ensures robust performance metrics by using cross-validation
• Result Interpretation
• Provides the best hyperparameter combination and associated metrics
Integrating Cross-Validation with Hyperparameter Tuning
• Cross-Validation
• Ensures that the hyperparameters selected generalize well to
unseen data
• Benefits
• Reduces overfitting to the training dataset
• Provides robust estimates of model performance
Interpreting Results and Selecting the Best Model
• Best Parameters
• Access the optimal hyperparameter combination using
.best_params_
• Best Estimator
• Retrieve the model trained with the best hyperparameters using
.best_estimator_
• Performance Metrics
• Use .best_score_ to evaluate the performance of the best
hyperparameters.
Hands-On Exercise
• Objective
• Use GridSearchCV and RandomizedSearchCV to tune
hyperparameters of Gradient Boosting and Support Vector
Machine models, and compare results
Day
07
Optimization Project –
Building and Tuning a Final
Model
Applying All Learned Tuning and Optimization
Techniques
• Comprehensive Model Optimization
• Data Preprocessing:
• Ensure data is clean, scaled, and encoded appropriately
• Feature Engineering
• Derive new features and select the most important ones
• Regularization
• Avoid overfitting by penalizing complex models
• Cross-Validation
• Use techniques like K-Fold or Stratified K-Fold for robust performance metrics
• Hyperparameter Tuning
• Use methods like GridSearchCV, RandomizedSearchCV, or Bayesian Optimization
Evaluating and Interpreting Model Performance
• Performance Metrics
• Classification
• Accuracy, Precision, Recall, F1-Score, ROC-AUC
• Regression
• Mean Squared Error (MSE), Mean Absolute Error (MAE), 𝑅2
• Importance of Interpretability
• Use feature importance and coefficient analysis for
transparency
Hands-On Project
• Objective
• Build, tune, and optimize a machine learning model using
a structured process and evaluate its performance
comprehensively
WEEK 9: Neural Networks and Deep Learning Fundamentals
01
Introduction to
Deep Learning and Neural
Networks
What is Deep Learning?
• What is Deep Learning?
• Subset of Machine Learning that uses artificial neural networks (ANNs) with multiple layers
(deep architectures) to model and learn complex patterns in data
• Key Feature
• Automatically extracts relevant features from raw data, eliminating the need for manual
feature engineering
• Machine Learning vs. Deep Learning
Overview of Artificial Neural Networks (ANNs)
• Structure of a Neural Network
• Input Layer
• Accepts input data features
• Hidden Layers
• Perform computations to extract patterns
• Output Layer
• Produces predictions or classifications
• Key Components
• Neurons
• Basic units of computation that take inputs, apply weights and biases, and produce outputs using an activation
function
• Weights and Biases
• Weights determine the importance of each input
• Bias shifts the output of the activation function
• Activation Functions
• Add non-linearity to the model (e.g., ReLU, Sigmoid, Tanh).
Overview of Artificial Neural Networks (ANNs)
02
Forward Propagation and
Activation Functions
Understanding Forward Propagation
• What is Forward Propagation?
• Process by which input data flows through the layers of a neural network to
produce an output
• Input Layer
• Accepts input features and passes them to the next layer
• Hidden Layers
• Compute weighted sums of inputs, apply biases, and pass the result through activation
functions
• Output Layer
• Produces predictions, typically using an activation function suitable for the task.
Understanding Forward Propagation
• Steps in Forward Propagation
• Compute Weighted Sum
• 𝑧=𝑊⋅𝑋+𝑏
𝑊: Weights | 𝑋: Inputs | 𝑏: Bias
• Apply Activation Function
• 𝑎=𝜎(𝑧)
σ: Activation function
• Repeat for Each Layer
• Outputs of one layer become inputs to the next.
Common Activation Functions
• Sigmoid
• Use Case: Binary classification in the output layer
• Limitation: Can suffer from vanishing gradients for large positive/negative 𝑧
• Tanh (Hyperbolic Tangent)
• Use Case: Hidden layers where zero-centered outputs are preferred
• Limitation: Also prone to vanishing gradients
• ReLU (Rectified Linear Unit)
• Use Case: Most commonly used in hidden layers due to simplicity and efficiency
• Limitation: Can suffer from the "dying ReLU" problem (neurons stuck at zero)
• Softmax
• Use Case: Multi-class classification in the output layer
Choosing Activation Functions
Hands-On Exercise
• Objective
• Implement forward propagation for a simple neural network in
Python and experiment with different activation functions.
Day
03
Loss Functions and
Backpropagation
Understanding Loss Functions
• What Are Loss Functions?
• Quantify the difference between the predicted output of a model and the
actual target value
• Guide the training process by providing a metric to minimize during
optimization
• Role in Neural Networks
• Error Measurement
• Evaluate the accuracy of predictions
• Feedback for Optimization
• Provide gradients for weight updates via backpropagation
Understanding Loss Functions
• Key Concepts
• Gradient: The rate of change of the loss with respect to a parameter
• Gradient Descent: An optimization algorithm that minimizes the loss by updating parameters in the
direction of the negative gradient
Hands-On Exercise
• Objective
• Implement basic loss functions, calculate gradients manually, and
visualize the effects of different loss functions
Day
04
Gradient Descent and
Optimization Techniques
Gradient Descent and Its Variants
• What is Gradient Descent?
• Optimization algorithm used to minimize the loss function by iteratively adjusting the model's parameters in
the direction of the negative gradient
• Variants of Gradient Descent
• Batch Gradient Descent
• Uses the entire dataset to compute gradients at each step
• Pros: Accurate gradients
• Cons: Computationally expensive for large datasets.
• Stochastic Gradient Descent (SGD)
• Updates parameters using one data point at a time
• Pros: Faster updates
• Cons: High variance in updates; can lead to oscillations.
• Mini-batch Gradient Descent
• Updates parameters using a small subset (batch) of the dataset
• Pros: Combines the efficiency of SGD with the stability of Batch Gradient Descent
Advanced Optimization Techniques
• Adagrad
• Adapts learning rates for each parameter by scaling inversely with the sum of gradients
squared
• Pros: Suitable for sparse data
• Cons: Learning rate decreases too aggressively over time
• RMSprop
• Modifies Adagrad by using an exponentially weighted moving average of squared gradients
• Pros: Addresses Adagrad’s aggressive learning rate decay; works well for non-convex
problems
• Adam (Adaptive Moment Estimation)
• Combines momentum and RMSprop to adapt learning rates for each parameter
• Pros: Works well in practice for most problems; computationally efficient
Importance of Learning Rate and Choosing
the Right Optimizer
• Learning Rate
• Determines the step size for parameter updates
• Too High: May overshoot the minimum or cause divergence
• Too Low: Leads to slow convergence
• Choosing the Right Optimizer
• SGD: Works well for simple, convex problems
• Adam: Generally performs well across tasks
• RMSprop: Often preferred for RNNs and sequence-based tasks
Hands-On Exercise
• Objective
• Implement gradient descent to update model weights and
experiment with different optimizers using TensorFlow and/or
PyTorch
Day
05
Building Neural Networks
with TensorFlow and Keras
Introduction to TensorFlow and Keras
• What is TensorFlow?
• Open-source library for numerical computation and machine learning
• Provides tools for building and training deep learning models
• What is Keras?
• High-level API integrated with TensorFlow that simplifies the process of
creating and training neural networks
• Key Features of Keras
• User-Friendly: Intuitive syntax for rapid prototyping
• Modular: Building blocks for defining layers, optimizers, and loss functions
• Integration: Compatible with TensorFlow for scalable deep learning tasks
Defining Layers, Models, and Compiling Networks in Keras
• Defining Layers
• Layers are the building blocks of neural networks. Common types
include:
• Dense (Fully Connected) Layers
• Each neuron is connected to every neuron in the previous layer
• Dropout Layers
• Randomly drops connections to prevent overfitting
• Activation Layers
• Apply activation functions to introduce non-linearity
Defining Layers, Models, and Compiling Networks in Keras
• Building a Model
• Keras supports two primary ways to define models
• Sequential API: A linear stack of layers
• Functional API: More flexible, allows for complex architectures
• Compiling a Model
• Specifies
• Optimizer: Algorithm to update weights
• Loss Function: Metric to minimize during training
• Metrics: Additional performance measures
Training, Evaluating, and Saving a Model
• Training
• Fit the model to data using model.fit()
• Evaluation
• Test the model on unseen data using model.evaluate()
• Saving and Loading
• Save a trained model using model.save() and reload it with
keras.models.load_model()
Hands-On Exercise
• Objective
• Build, train, evaluate, and save a simple neural network to classify
digits from the MNIST dataset
Day
06
Building Neural Networks
with PyTorch
Introduction to PyTorch and Its Core Components
• What is PyTorch?
• Open-source deep learning framework that provides flexibility and dynamic
computation for building and training machine learning models
• Core Components of PyTorch
• Tensors: Multi-dimensional arrays similar to NumPy arrays but with GPU
support for accelerated computation
• Autograd: Automatic differentiation engine that computes gradients for
optimization
• torch.nn Module: Provides tools to define and train neural networks with
layers, activation functions, and loss functions.
Building a Neural Network in PyTorch
• Steps
• Define the Model
• Use torch.nn.Module to create a neural network with layers and
forward propagation
• Define the Loss Function
• Use built-in loss functions like Cross-Entropy Loss
• Define the Optimizer
• Use optimizers like SGD or Adam for weight updates
Training, Evaluating, and Saving a Model in PyTorch
• Training
• Forward pass to compute predictions
• Compute loss and gradients using backpropagation
• Update weights using an optimizer
• Evaluation
• Test the model on unseen data and calculate metrics like accuracy
• Saving and Loading
• Save the model's parameters using torch.save() and load them using
torch.load()
Hands-On Exercise
• Objective
• Build, train, evaluate, and save a neural network for MNIST digit
classification using PyTorch.
Day
07
Neural Network Project –
Image Classification on
CIFAR-10
Applying Learned Concepts to a More
Complex Dataset
• Why CIFAR-10?
• The CIFAR-10 dataset is a challenging benchmark dataset for
image classification. It contains 60,000 32x32 color images
across 10 classes (e.g., airplane, car, bird, dog)
• Unlike MNIST, CIFAR-10 involves more complex patterns and
requires robust neural network architectures
Building and Optimizing a Neural Network
for Image Classification
• Key Steps:
• Preprocess the dataset for training (e.g., normalization, one-hot
encoding)
• Define a neural network with convolutional layers for feature
extraction
• Optimize the network using techniques like learning rate adjustment
and dropout.
Analyzing Model Performance and
Experimenting with Hyperparameters
• Performance Analysis
• Evaluate accuracy and loss curves during training
• Use test set metrics to measure generalization
• Experimentation
• Try different activation functions (e.g., ReLU, Tanh)
• Test optimizers like SGD, Adam, and RMSprop
• Adjust the learning rate and regularization techniques (e.g., dropout,
weight decay).
Hands-On Project
• Objective
• Build, train, and optimize a neural network for CIFAR-10
image classification, experimenting with
hyperparameters to improve performance
WEEK 10: Convolutional Neural Networks (CNNs)
01
Introduction to Convolutional
Neural Networks
Overview of CNNs and Their Role in Image Processing
• Translation Invariance
• CNNs can detect patterns irrespective of their position in the image
• Reduced Parameters
• Shared weights and local connectivity make CNNs computationally
efficient
• Automatic Feature Extraction
• CNNs learn to identify meaningful patterns like edges, shapes, and
textures directly from data
Hands-On Exercise
• Objective
• Visualize images in a dataset, explore their pixel data, and set up
an environment for building CNNs using TensorFlow or PyTorch
Day
02
Convolutional Layers and
Filters
Convolution Operations, Filters, and Feature Maps
• Edge Detection
• Kernels like Sobel or Prewitt highlight edges in images
• Feature Extraction
• Initial layers focus on edges; deeper layers capture abstract patterns.
Hands-On Exercise
• Objective
• Understand convolution operations by implementing and
visualizing their effects using TensorFlow and PyTorch
Day
03
Pooling Layers and
Dimensionality Reduction
Introduction to Pooling Layers
• What Are Pooling Layers?
• Used to reduce the dimensions of feature maps while retaining the most important
information
• Help make the network computationally efficient and robust to variations in the input
• Types of Pooling
• Max Pooling
• Selects the maximum value from each region of the input feature map
• Captures the strongest activations (features)
• Average Pooling
• Computes the average value for each region of the input feature map
• Provides a more generalized summary of features.
Role of Pooling in Reducing Dimensionality
• Dimensionality Reduction
• Pooling reduces the spatial dimensions (height and width) of feature maps,
resulting in fewer parameters and faster computations
• Robustness
• Makes the model invariant to small translations or distortions in the input image
Combining Convolution and Pooling Layers
• Pooling layers typically follow convolutional layers to downsample the feature
maps
• This combination helps extract hierarchical features
• Early layers focus on simple features (e.g., edges)
• Deeper layers capture complex patterns (e.g., objects)
Hands-On Exercise
• Objective
• Implement max pooling and average pooling layers on feature
maps and observe their effects on size and representation.
Day
04
Building CNN Architectures
with Keras and TensorFlow
Building a CNN Architecture in Keras
• Steps to Build a CNN
• Convolutional Layers: Extract features from the input images
• Pooling Layers: Downsample feature maps to reduce dimensions and retain
key features
• Dense (Fully Connected) Layers: Combine features for final predictions
• Basic CNN Architecture
• Input Layer → Convolutional Layer → Activation → Pooling → Fully
Connected Layer → Output Layer
• Repeat convolution and pooling layers for deeper networks
Compiling, Training, and Evaluating a CNN
• Steps
• Compile the Model
• Define loss, optimizer, and metrics
• Example loss functions
• Categorical Cross-Entropy: Multi-class classification
• Example optimizers
• Adam: Efficient optimization for large networks
• Example metrics: Accuracy
• LeNet
• One of the earliest CNNs for handwritten digit classification (e.g., MNIST)
• AlexNet
• Revolutionized deep learning for image classification in 2012
• Introduced ReLU activation and dropout for regularization
• VGG
• Uses deep networks with small filters (e.g., 3 × 3 3×3)
• Known for its simplicity and effectiveness
Hands-On Exercise
• Objective
• Build, train, and evaluate a CNN for image classification on the
MNIST or CIFAR-10 dataset using Keras and TensorFlow
Day
05
Building CNN Architectures
with PyTorch
Building CNN Architectures in PyTorch Using the nn Module
• Key Steps
• Define a Model
• Use torch.nn.Module to build CNN layers like convolutional, pooling,
and fully connected layers
• Forward Pass
• Define how input flows through the layers to produce output
• Model Summary
• Inspect the structure and learnable parameters
Training and Evaluating CNNs in PyTorch
• Training
• Perform forward and backward passes, calculate loss, and update
weights using an optimizer
• Evaluation
• Test the model on unseen data and compute metrics like
accuracy and loss.
Experimenting with CNN Model Design and Tuning
Hyperparameters
• Experimentation Areas
• Layer Depth
• Add or remove convolutional and pooling layers to observe the impact
• Filter Size
• Experiment with kernel sizes (e.g., 3 × 3, 5 × 5)
• Learning Rate
• Adjust the learning rate to improve convergence speed and accuracy
Hands-On Exercise
• Objective
• Build, train, evaluate, and experiment with CNNs for CIFAR-10
classification using PyTorch
Day
06
Regularization and Data
Augmentation for CNNs
Overfitting in CNNs and Methods to Prevent It
• What Is Overfitting?
• Occurs when a model performs well on the training data but fails to generalize to unseen data
• In CNNs, overfitting is common due to the large number of parameters in deep networks
• Methods to Prevent Overfitting
• Dropout
• Randomly sets a fraction of neurons to zero during training
• Prevents co-adaptation of neurons
• Controlled by a dropout rate (e.g., 0.5)
• Batch Normalization
• Normalizes the input of each layer to stabilize training
• Reduces internal covariate shift and allows higher learning rates
• Data Augmentation
• Increases dataset size artificially by applying transformations to images
• Examples: rotation, flipping, scaling, cropping, brightness adjustment.
Introduction to Data Augmentation Techniques
• Common Techniques
• Rotation
• Rotates the image by a specified angle range (e.g., -30° to 30°)
• Flipping
• Horizontally or vertically flips the image
• Scaling
• Resizes the image by zooming in or out
• Cropping
• Extracts random portions of the image
Implementing Regularization and Data Augmentation in CNN Training
• Objective
• Apply dropout, batch normalization, and data
augmentation to improve CNN performance
Day
07
CNN Project
Image Classification on
Fashion MNIST or CIFAR-10
Applying CNN Architecture to a Larger Dataset
• Evaluation Metrics
• Accuracy: Overall classification correctness
• Loss: Measures the difference between predictions and ground
truth
• Confusion Matrix: Highlights misclassified classes for deeper
insights
Hands-On Exercise
• Objective
• Build, train, and optimize a CNN for Fashion MNIST or
CIFAR-10 image classification, experimenting with
regularization and data augmentation to achieve the best
performance
WEEK 11: Recurrent Neural Networks (RNNs) and Sequence Modeling
Day 2
Understanding RNN Architecture and Backpropagation Through
Time (BPTT)
01
Introduction to Sequence
Modeling and RNNs
Overview of Sequence Modeling
• What is Sequence Modeling?
• Involves predicting or generating outputs based on sequential data
• Capture temporal or contextual dependencies
• Why Is Sequence Modeling Important?
• Natural Language Processing (NLP)
• Tasks like language modeling, machine translation, and sentiment analysis
depend on understanding sequential relationships in text
• Time-Series Analysis
• Sequence models are essential for tasks like stock price prediction, weather
forecasting, and sensor data analysis
Introduction to Recurrent Neural Networks (RNNs)
• Text Generation
• Generate new text based on learned patterns
• Language Translation
• Convert text from one language to another
• Stock Price Prediction
• Predict future stock prices using historical data
• Speech Recognition
• Understand spoken words and phrases
• Video Frame Prediction
• Anticipate future frames in video sequences
Hands-On Exercise
• Objective
• Preprocess a text dataset for use in RNNs and set up an
environment in TensorFlow or PyTorch for building RNNs
Day
02
Understanding RNN Architecture
and Backpropagation Through
Time (BPTT)
Detailed Architecture of RNNs
• Components of an RNN
• Input Layer
• Takes sequential data as input at each time step
• Hidden Layer
• Maintains a "memory" of past inputs through recurrent connections. The hidden state at time 𝑡(ℎ𝑡) is calculated
as
• ℎ𝑡 = 𝑓 ( 𝑊ℎ ⋅ ℎ𝑡−1 + 𝑊𝑥 ⋅ 𝑥𝑡 + 𝑏ℎ)
• 𝑊ℎ : Weight matrix for recurrent connections
• 𝑊𝑥 : Weight matrix for input connections
• 𝑏ℎ : Bias term
• 𝑓 : Non-linear activation function (e.g., tanh, ReLU)
• Output Layer
• Produces output 𝑦𝑡 based on the hidden state ℎ𝑡
• 𝑦𝑡 = 𝑔 ( 𝑊𝑦 ⋅ ℎ𝑡 + 𝑏𝑦 )
• 𝑔: Activation function (e.g., softmax for classification)
Backpropagation Through Time (BPTT)
• What is BPTT?
• Extension of standard backpropagation to handle sequential data in RNNs
• It calculates gradients for each time step and propagates them backward through the sequence
• Steps of BPTT
• Unroll the RNN across the sequence for a fixed number of time steps
• Compute the loss for each time step
• Backpropagate the errors across all time steps to update weights
• Challenges in BPTT
• Vanishing Gradient Problem
• Gradients diminish exponentially as they are propagated back through time
• Leads to difficulty in learning long-term dependencies
• Exploding Gradient Problem
• Gradients grow exponentially, causing numerical instability during training
• Solutions
• Use gradient clipping to handle exploding gradients
• Use architectures like Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRU) to mitigate the vanishing gradient
problem
Limitations of Vanilla RNNs
• Short-Term Memory
• Struggle to learn dependencies in long sequences due to vanishing
gradients
• Sequential Computation
• Cannot parallelize training across time steps, making them
computationally expensive
• Sensitive Initialization
• Performance depends heavily on proper weight initialization and
learning rates
Hands-On Exercise
• Objective
• Build a simple RNN model for text classification using TensorFlow or
PyTorch
• Train the RNN and observe how it captures sequence patterns
Day
03
Long Short-Term Memory
(LSTM) Networks
Introduction to LSTMs and How They Address RNN Limitations
• Forget Gate
• Decides what information to discard from the cell state
• ft = σ ( Wf . dot [ht-1, xt] + bf) ]
• Wf: Weight matrix for the forget gate
• 𝑓𝑡: Forget gate output
• 𝑥𝑡: Input
• ℎ𝑡− 1 : Previous hidden state
• Input Gate
• Decides what new information to add to the cell state
• it = σ (Wi ⋅ [ht−1, xt] + bi)
• σ: Sigmoid activation function
• 𝑊𝑖 : Weight matrix for the input gate
• ℎ𝑡 − 1 : Hidden state from the previous time step
• 𝑥𝑡 : Current input
• 𝑏 𝑖: Bias for the input gate
LSTM Cell Structure: Input, Forget, and Output Gates
• Objective
• Build an LSTM model for sentiment analysis on the IMDB Movie
Reviews Dataset and compare its performance with a basic RNN
model
Day
04
Gated Recurrent Units (GRUs)
Introduction to Gated Recurrent Units (GRUs)
• Reset Gate
• Controls how much of the past information to forget when combining with new
input.
• Objective
• Build a GRU-based model for the IMDB Movie Reviews Dataset and
compare its performance with the LSTM model
Day
05
Text Preprocessing and Word
Embeddings for RNNs
Importance of Text Preprocessing
• What is Text Preprocessing?
• Involves cleaning and preparing raw text data to make it suitable for machine learning models
• Critical step for achieving high performance in Natural Language Processing (NLP) tasks
• Key Steps
• Tokenization: Splits text into individual units (e.g., words, sentences)
• Example: "I love NLP" → ["I", "love", "NLP"]
• Stemming: Reduces words to their root form by removing suffixes
• Example: "running", "runner" → "run"
• Lemmatization: Converts words to their base form using a vocabulary
• Example: "better" → "good”
• Why is Preprocessing Important?
• Reduces noise in the data
• Standardizes input for models
• Improves feature extraction and model accuracy
Introduction to Word Embeddings
• What Are Word Embeddings?
• Dense vector representations of words that capture semantic meaning
• Represent words in a continuous vector space
• Popular Word Embedding Models
• Word2Vec:
• Models: Continuous Bag of Words (CBOW) and Skip-gram
• Captures word relationships based on context
• GloVe (Global Vectors for Word Representation)
• Uses word co-occurrence statistics to generate embeddings
• Represents global semantic relationships
• Pre-trained Embeddings in Frameworks
• Frameworks like TensorFlow and PyTorch offer pre-trained embeddings for quick integration
• Benefits of Word Embeddings
• Reduce dimensionality | Capture semantic similarity | Improve model generalization
Using Pre-trained Embeddings for NLP Tasks
• Objective
• Preprocess a text dataset and integrate word embeddings (e.g.,
GloVe) into an LSTM model for sentiment analysis
Day
06
Sequence-to-Sequence
Models and Applications
Sequence-to-Sequence (Seq2Seq) Models
and Their Architecture
• What Are Seq2Seq Models?
• Map an input sequence to an output sequence of different lengths
• Widely used for tasks like language translation, text summarization, speech-to-
text, and chatbots
• Architecture
• Encoder
• Processes the input sequence and encodes it into a fixed-length vector (context vector)
• Decoder
• Takes the context vector as input and generates the output sequence, step by step
Encoder-Decoder Framework for Seq2Seq Tasks
• How It Works
• Encoder
• Sequentially processes the input sequence using RNN, LSTM, or
GRU
• Produces a context vector representing the entire input sequence
• Decoder
• Initializes its hidden state with the encoder's context vector
• Generates the output sequence one token at a time
• Predicts the next token using the previously generated tokens
Attention Mechanism Overview
• Why Attention?
• Standard Seq2Seq models compress the entire input sequence into a fixed-
length vector, which can lead to information loss for long sequences
• Attention Mechanism dynamically focuses on different parts of the input
sequence when generating each output token
• How Attention Works
• Calculates a weight (or score) for each input token based on its relevance to
the current decoder state
• Outputs a weighted sum of the encoder outputs, creating a context vector for
each decoder step
Hands-On Exercise
• Objective
• Build a basic Seq2Seq model using LSTMs for translation and
experiment with hyperparameters
Day
07
RNN Project
Sentiment Analysis
Applying RNN, LSTM, and GRU Models to a Complete Task
• Project Overview
• Choose a task: Sentiment Analysis
• Sentiment Analysis: Classify text into categories like positive or negative
sentiment
• Use RNN, LSTM, or GRU models to solve the chosen task
• Focus on preprocessing, embedding integration, model architecture, and
hyperparameter tuning
Analyzing Model Performance
• Key Metrics
• For Sentiment Analysis
• Accuracy, Precision, Recall, F1 Score
• Tuning Hyperparameters
• Embedding size
• Number of hidden units
• Learning rate and optimizer choice
• Sequence length and batch size
Experimenting with Architectures and Techniques
• Architectural Variations
• Single-Layer vs. Multi-Layer RNNs
• Bidirectional RNNs for improved context capture
• Preprocessing Techniques
• Tokenization, stemming, lemmatization
• Padding for sequence uniformity
Hands-On Project
• Objective
• Build, train, and optimize RNN, LSTM, or GRU models for
Sentiment Analysis
WEEK 12: Transformers and Attention Mechanisms
01
Introduction to Attention
Mechanisms
Understanding the Limitations of RNNs and
the Need for Attention
• Challenges of RNNs
• Sequential Processing
• Long-Term Dependency Problems
• Fixed Context Vector
• The Role of Attention Mechanisms
• Attention overcomes these limitations by allowing the model to focus on specific
parts of the input sequence dynamically during each output generation step
• Instead of relying on a single context vector, attention provides a weighted
combination of all input tokens relevant to the current output token
Basics of the Attention Mechanism
• Core Components
• Queries ( 𝑄 )
• Represents the current focus of the model (e.g., the current
decoder state in Seq2Seq tasks)
• Keys ( 𝐾 )
• Encoded representations of the input sequence
• Values ( 𝑉 )
• Additional information associated with the keys.
Basics of the Attention Mechanism
• Attention Mechanism
• The attention score is computed using the dot product of the query
and keys, followed by a softmax function to normalize into a
probability distribution
• The weighted sum of the values forms the context vector
Types of Attention
• Self-Attention
• The query, key, and value all come from the same input sequence
• Widely used in Transformer models for learning interdependencies within a
sequence
• Multi-Head Attention
• Extends self-attention by applying multiple attention mechanisms in parallel
• Captures different aspects of relationships in the sequence
MultiHead ( 𝑄 , 𝐾 , 𝑉 ) = Concat ( head1 , head2 , . . . , headℎ ) 𝑊𝑂
• where each head computes attention with different learned projections of 𝑄,
𝐾, and 𝑉
Hands-On Exercise
• Objective
• Implement a basic Attention Mechanism using NumPy or
PyTorch and visualize its impact on a simple sequence
task
Day
02
Introduction to
Transformers Architecture
Overview of the Transformer Architecture
• What is a Transformer?
• Neural network architecture introduced in the paper
"Attention is All You Need”
• It relies entirely on the attention mechanism to process
sequential data without using recurrence or convolution
• Transformative for NLP tasks like translation,
summarization, and text generation
Overview of the Transformer Architecture
• Components of the Transformer
• Encoder
• Processes the input sequence and generates a contextualized representation
• Consists of multiple identical layers, each with
• Self-Attention Mechanism: Captures dependencies between all input tokens
• Feed-Forward Neural Network (FFNN): Processes the attention outputs
• Decoder
• Generates the output sequence one token at a time
• Consists of multiple identical layers, each with
• Masked Self-Attention Mechanism: Prevents the decoder from attending to future tokens
• Encoder-Decoder Attention: Attends to encoder outputs
• Feed-Forward Neural Network
• Workflow: Input sequence → Encoder → Context vectors → Decoder → Output sequence
Detailed Breakdown of the Transformer Model Layers
• Self-Attention Layer
• Captures relationships between all tokens in the input sequence
• Computes the importance of each token to all other tokens
• Positional Encoding
• Since Transformers lack recurrence, positional encoding injects information about the token order into the
model
• Feed-Forward Neural Network
• Applies a position-wise FFNN to the outputs of the attention layer
• Non-linear transformation enhances the representation
• Layer Normalization
• Stabilizes training by normalizing inputs within each layer
• Multi-Head Attention
• Combines multiple self-attention mechanisms to learn various aspects of relationships within the sequence
Key Differences Between Transformers and RNNs
Hands-On Exercise
• Objective
• Visualize the architecture of a Transformer model and
set up an environment for working with Transformers
using PyTorch and/or TensorFlow
Day
03
Self-Attention and Multi-Head
Attention in Transformers
Self-Attention Mechanism
• What is Self-Attention?
• Allows a model to dynamically focus on different parts of an input sequence when encoding a token
• It captures dependencies across all tokens in a sequence, enabling context-aware representations
• Steps in Self-Attention
• Compute Attention Scores:
• Calculate dot products between the query ( 𝑄 ) and key ( 𝐾 ) vectors for all tokens
• Scale by the square root of the key dimension ( 𝑑𝑘 ) to stabilize gradients
• Apply the softmax function to convert scores into probabilities
• Weight Values
• Use the attention scores to compute a weighted sum of value ( 𝑉 ) vectors
Multi-Head Attention
• What is Multi-Head Attention?
• Applies several attention mechanisms in parallel
• Each attention "head" focuses on different aspects of the sequence
• Steps
• Linear Projections
• Project 𝑄, 𝐾, and 𝑉 into multiple subspaces using learned weight matrices
• Apply Self-Attention
• Perform self-attention for each head independently
• Concatenate Outputs
• Combine outputs from all heads
• Final Linear Projection
• Project concatenated outputs back into the original dimension
MultiHead ( 𝑄 , 𝐾 , 𝑉 ) = Concat ( head1 , head2 , . . . , headℎ ) 𝑊𝑂
Applications of Multi-Head Attention in NLP
• Machine Translation
• Captures dependencies across languages for better translations
• Text Summarization
• Identifies key phrases to generate concise summaries
• Named Entity Recognition
• Focuses on contextual clues to detect entities in text
Hands-On Exercise
• Objective
• Implement a simplified Self-Attention and Multi-Head
Attention mechanism and visualize their effects on text
sequences
Day
04
Positional Encoding and
Feed-Forward Networks
Understanding the Role of Positional
Encoding in Transformers
• Why Positional Encoding?
• Unlike RNNs, Transformers do not process sequences sequentially
• They process all tokens in parallel
• Transformers lack inherent knowledge of token positions, which is crucial for tasks like
translation or sequence modeling
• What is Positional Encoding?
• Positional encoding introduces information about the order of tokens in a sequence
• It allows the model to differentiate between identical tokens in different positions.
Mathematical Foundation and
Implementation of Positional Encoding
• Sinusoidal Positional Encoding
• Encodes positional information using sine and cosine functions
• Formula for positional encoding
• Objective
• Implement positional encoding and integrate it with a
basic Transformer model
• Experiment with different positional encoding methods
and observe the effects
Day
05
Hands-On with Pre-Trained
Transformers
BERT and GPT
Introduction to BERT and GPT
• What is BERT?
• BERT (Bidirectional Encoder Representations from Transformers)
• Developed by Google AI
• Processes input sequences bidirectionally, enabling it to capture context from both
directions
• Pre-trained on tasks like Masked Language Modeling (MLM) and Next Sentence
Prediction (NSP)
• Key Features of BERT
• Bidirectional: Understands context from both left and right sides of a word
• Transformer Encoder-Based: Optimized for understanding input text
• Applications: Sentiment analysis, named entity recognition, question answering
What is GPT?
• GPT (Generative Pretrained Transformer)
• Developed by OpenAI
• Processes input sequences unidirectionally (left-to-right), focusing on
generative tasks
• Pre-trained using causal language modeling
• Key Features of GPT
• Unidirectional: Processes text from left to right, focusing on text generation
• Transformer Decoder-Based: Optimized for generating coherent text
• Applications: Text generation, chatbots, summarization
Key Differences Between BERT and GPT
Fine-Tuning Pre-Trained Models for Downstream Tasks
• Why Fine-Tune?
• Pre-trained models are trained on large generic datasets
• Fine-tuning adapts them to specific tasks like sentiment analysis or classification
• Steps to Fine-Tune
• Load a Pre-Trained Model
• Use libraries like Hugging Face to load a pre-trained BERT or GPT model
• Prepare Dataset
• Format the dataset for the specific task (e.g., tokenization for text classification)
• Train and Evaluate
• Fine-tune the model using task-specific data
Hands-On Exercise
• Objective
• Use Hugging Face’s Transformers library to fine-tune a
pre-trained BERT or GPT model for a text classification
task
Day
06
Advanced Transformers
BERT Variants and GPT-3
Exploration of BERT Variants
• Why BERT Variants?
• While BERT is powerful, it has limitations like large computational requirements and inefficiencies in capturing certain
nuances
• BERT variants optimize the model for specific tasks, improve performance, or reduce computational overhead
• Key BERT Variants
• RoBERTa (Robustly Optimized BERT)
• Removes Next Sentence Prediction (NSP) task for better efficiency
• Trains on more data with larger batch sizes
• Use Case: Superior performance in tasks requiring deeper context
• DistilBERT
• A distilled (smaller) version of BERT that retains 97% of BERT’s performance while being 60% faster
• BERTweet
• Fine-tuned on Twitter data
• Use Case: Social media sentiment analysis, hashtag prediction.
Introduction to GPT-3
• What is GPT-3?
• GPT-3 (Generative Pretrained Transformer 3)
• Developed by OpenAI
• A massive model with 175 billion parameters trained on diverse datasets
• Excels at generating coherent and contextually relevant text
• Key Features of GPT-3
• Zero-shot and Few-shot Learning
• Can perform tasks with minimal or no fine-tuning
• Versatility
• Used for text generation, summarization, question answering, and conversational AI
• Applications
• Conversational AI: Chatbots and virtual assistants
• Content Generation: Articles, scripts, code snippets
• Creative Writing: Poems, stories, and creative ideas
Transfer Learning in NLP with Transformer Models
• Objective
• Experiment with a BERT variant (e.g., RoBERTa) and fine-
tune it on an NLP task. Use the GPT-3 API for text
generation and analyze the quality of the generated text
Day
07
Transformer Project – Text
Summarization or Translation
Applying Transformer-Based Models to
Advanced NLP Tasks
• Text Summarization
• The process of condensing a piece of text while retaining the key information
• Two types
• Extractive Summarization: Selects key phrases or sentences from the original text
• Abstractive Summarization: Generates new sentences that capture the meaning of the
original text
• Text Translation
• Converts text from one language to another while maintaining meaning and grammar
• Examples
• English to French translation
• Multi-lingual translations with models like T5 or mT5
Fine-Tuning and Optimizing Models
• Pre-Trained Models for Summarization and Translation
• T5 (Text-to-Text Transfer Transformer)
• Treats every NLP problem as a text-to-text task
• Fine-tuned for summarization and translation tasks
• BART (Bidirectional and Auto-Regressive Transformer)
• Combines BERT-like encoder and GPT-like decoder
• Pre-trained for denoising and fine-tuned for summarization and translation
• Optimization Techniques
• Learning rate scheduling
• Hyperparameter tuning (batch size, optimizer type, maximum sequence length)
Analyzing Model Performance
• Evaluation Metrics
• Text Summarization
• ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
• BLEU (Bilingual Evaluation Understudy) for generated
summaries
• Text Translation
• BLEU score for translation quality
• Perplexity to measure model performance
Hands-On Project
• Objective
• Fine-tune a pre-trained Transformer model (e.g., T5 or
BART) for text summarization or translation and evaluate
its performance
WEEK 13: Transfer Learning and Fine-Tuning
01
Introduction to
Transfer Learning
What is Transfer Learning?
• A machine learning technique where a model trained on one task is reused as a starting
point for another related task
• Instead of training a model from scratch, pre-trained models are fine-tuned on a
smaller dataset for a new task
• How It Differs from Traditional Training:
Benefits of Transfer Learning
• Objective
• Set up a transfer learning environment, load a pre-trained model,
and explore its architecture and layers
Day
02
Transfer Learning in
Computer Vision
Popular Pre-Trained Models for Vision Tasks
• VGG • Inception
• VGG16/VGG19: Deep networks with 16 or 19 layers • InceptionV3: Known for inception modules, which
• Known for simplicity in architecture: stack of allow for multi-scale feature extraction in one layer
convolutional layers followed by fully connected • Applications: Scene recognition, fine-grained image
layers classification
• Applications: General-purpose image classification • EfficientNet
and feature extraction • Family of models that scales network depth, width,
• ResNet and resolution efficiently
• Residual Networks: Introduced residual connections • Provides better performance with fewer parameters
(skip connections) to tackle vanishing gradients • Applications: Resource-constrained environments
• Popular variants: ResNet18, ResNet50, ResNet101 requiring high accuracy
• Applications: Large-scale image classification tasks,
object detection.
Freezing and Unfreezing Layers for Fine-Tuning
• Steps
• Load a pre-trained model (e.g., ResNet, VGG)
• Replace the last layer with a task-specific classifier (e.g.,
softmax for multi-class classification)
• Fine-tune the model on the new dataset
Hands-On Exercise
• Objective
• Load a pre-trained ResNet or VGG model and fine-tune it
for a new image classification task (e.g., classifying
animals or plants)
• Experiment with freezing and unfreezing layers and
observe the impact on performance
Day
03
Fine-Tuning Techniques in
Computer Vision
Choosing Layers to Fine-Tune and Understanding the
Feature Extraction Process
• Objective
• Apply data augmentation to a dataset and train a fine-
tuned model. Experiment with hyperparameters to
observe their impact on performance
Day
04
Transfer Learning in NLP
Popular Pre-Trained NLP Models
• BERT (Bidirectional Encoder Representations from Transformers)
• Architecture: Transformer-based encoder model
• Training Tasks
• Masked Language Modeling (MLM)
• Next Sentence Prediction (NSP)
• Applications
• Text classification, sentiment analysis, question answering
• GPT (Generative Pretrained Transformer)
• Architecture: Transformer-based decoder model
• Training Task
• Causal Language Modeling (predicting next word)
• Applications
• Text generation, summarization, dialogue systems
Popular Pre-Trained NLP Models
• T5 (Text-to-Text Transfer Transformer)
• Treats all NLP tasks as text-to-text transformations
• Applications
• Summarization, translation, text classification
• RoBERTa (Robustly Optimized BERT)
• Removes Next Sentence Prediction
• Pre-trained on a larger dataset with optimized training strategies
• Applications
• Similar to BERT but with better performance on downstream tasks
Tokenization and Text Preprocessing for Fine-Tuning
NLP Models
• Tokenization
• Converts raw text into numerical representations
• Types
• WordPiece Tokenization: Used in BERT
• Byte-Pair Encoding (BPE): Used in GPT and RoBERTa
• Text Preprocessing
• Cleaning
• Remove unnecessary characters (e.g., URLs, special symbols)
• Normalization
• Convert text to lowercase
• Remove stopwords if necessary
• Tokenization
• Break text into tokens compatible with the pre-trained model
Adapting Pre-Trained Models for NLP Tasks
• Common Tasks
• Text Classification
• Categorize text into predefined labels (e.g., positive/negative sentiment)
• Sentiment Analysis
• Determine the sentiment polarity of text (e.g., positive, neutral, negative)
• Summarization
• Generate concise summaries from lengthy texts
• Steps
• Load pre-trained model
• Add a task-specific head (e.g., classification layer)
• Fine-tune the model on task-specific data.
Hands-On Exercise
• Objective
• Fine-tune a pre-trained BERT or T5 model for sentiment
analysis using Hugging Face’s Transformers library
• Preprocess the text data, tokenize it, and evaluate the
model
Day
05
Fine-Tuning
Techniques in NLP
Fine-Tuning Methods for NLP Tasks
• Discriminative Fine-Tuning
• Different layers of a pre-trained model capture different types of information
• Approach
• Use different learning rates for different layers of the model
• Lower learning rates for early layers (general features)
• Higher learning rates for later layers (task-specific features).
• Slanted Triangular Learning Rates (STLR)
• Dynamically adjusts learning rates during training to balance exploration and convergence
• Phases
• Warm-Up: Gradually increase the learning rate to promote exploration
• Decay: Slowly decrease the learning rate to ensure convergence
• Use Case
• Effective for fine-tuning pre-trained models like BERT and GPT
Regularization and Dropout for Preventing
Overfitting in NLP Models
• Regularization
• L1 Regularization: Encourages sparsity by penalizing absolute weights
• L2 Regularization (Ridge): Penalizes large weights to improve
generalization
• Dropout
• Randomly drops units (along with their connections) during training
• Prevents over-reliance on specific neurons
• Commonly used in Transformer-based models
Evaluating Model Performance with NLP-Specific Metrics
• Key Metrics
• F1-Score
• Harmonic mean of precision and recall
• Suitable for classification tasks with imbalanced datasets
• BLEU Score
• Evaluates the quality of generated text against reference text
• Commonly used for translation and summarization tasks
• ROUGE Score
• Measures overlap between generated and reference text
• Used for summarization tasks
Hands-On Exercise
• Objective
• Experiment with advanced fine-tuning techniques (e.g.,
STLR) on an NLP model and evaluate its performance
using F1-score or BLEU score
Day
06
Domain Adaptation and
Transfer Learning Challenges
Understanding Domain Adaptation and Handling
Domain-Specific Data
• Domain-Specific Embeddings
• Use pre-trained embeddings tailored to the target domain (e.g., BioBERT for
biomedical data, LegalBERT for legal text)
Hands-On Exercise
• Objective
• Fine-tune a pre-trained model on a domain-specific
dataset (e.g., BERT for medical text classification) and
experiment with domain-specific embeddings
Day
07
Transfer Learning Project
Fine-Tuning for a Custom Task
Applying Transfer Learning Techniques to a Custom Project
• Project Objective
• Leverage transfer learning to solve a specific task in either computer vision or NLP
• Fine-tune a pre-trained model for domain-specific data to achieve optimal performance
• Steps to Follow
• Dataset Selection
• Computer Vision: Custom image dataset (e.g., animal species classification)
• NLP: Text classification task (e.g., sentiment analysis, product categorization)
• Pre-Trained Model
• Computer Vision: Models like ResNet, EfficientNet, or MobileNet
• NLP: BERT, RoBERTa, or T5
• Fine-Tuning Techniques
• Regularization, hyperparameter tuning, data augmentation, discriminative learning rates
Analyzing Fine-Tuning Techniques
• Fine-Tuning Process
• Freeze the pre-trained layers and train the custom classifier head first
• Unfreeze some pre-trained layers for domain adaptation
• Gradually reduce the learning rate to avoid catastrophic forgetting
• Key Techniques
• Regularization
• Dropout, L2 regularization to prevent overfitting
• Data Augmentation
• Enhance diversity in training data (rotation, cropping, or text paraphrasing)
• Hyperparameter Tuning
• Experiment with learning rate, batch size, and optimizer
Documenting Results and Baseline Comparisons
• Steps
• Evaluate the baseline performance of the pre-trained
model without fine-tuning
• Track performance improvements after fine-tuning and
hyperparameter optimization
• Use metrics like accuracy, F1-score, BLEU, or ROC-AUC
to compare results
Hands-On Exercise
• Objective
• Fine-tune a pre-trained model (ResNet for computer
vision or BERT for NLP) to solve a custom task
• Evaluate and document results against a baseline
WEEK 14: Model Deployment and Serving
01
Introduction to
Transfer Learning
Hands-On Project
• Objective
• Fine-tune a pre-trained Transformer model (e.g., T5 or
BART) for text summarization or translation and evaluate
its performance
WEEK 15: Advanced Topics in Machine Learning Deployment