[go: up one dir, main page]

0% found this document useful (0 votes)
19 views149 pages

Machine Learning

Uploaded by

mayurpatil017902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views149 pages

Machine Learning

Uploaded by

mayurpatil017902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 149

Exploratory Data Analysis (EDA) is a

important step in data analysis which focuses on understanding


patterns, trends and relationships through statistical tools and
visualizations. Python offers various libraries like pandas, numPy,
matplotlib, seaborn and plotly which enables effective exploration
and insights generation to help in further modeling and analysis. In
this article, we will see how to perform EDA using python.
Key Steps for Exploratory Data Analysis (EDA)
Lets see various steps involved in Exploratory Data Analysis:
Step 1: Importing Required Libraries
We need to
install Pandas, NumPy, Matplotlib and Seaborn libraries in
python to proceed further.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as wr
wr.filterwarnings('ignore')
Step 2: Reading Dataset
Download the dataset from this link and lets read it using
pandas.
df = pd.read_csv("/content/WineQT.csv")
print(df.head())
Output:
First 5 rows
Step 3: Analyzing the Data
1. df.shape(): This function is used to understand the number
of rows (observations) and columns (features) in the dataset.
This gives an overview of the dataset's size and structure.
df.shape
Output:
(1143, 13)
2. df.info(): This function helps us to understand the dataset
by showing the number of records in each column, type of
data, whether any values are missing and how much memory
the dataset uses.
df.info()
Output:
info()
3. df.describe(): This method gives a statistical summary of
the DataFrame showing values like count, mean, standard
deviation, minimum and quartiles for each numerical column.
It helps in summarizing the central tendency and spread of
the data.
df.describe()
Output:
describe()
4. df.columns.tolist(): This converts the column names of the
DataFrame into a Python list making it easy to access and
manipulate the column names.
df.columns.tolist()
Output:
column names
Step 4 : Checking Missing Values
df.isnull().sum(): This checks for missing values in each
column and returns the total number of null values per
column helping us to identify any gaps in our data.
df.isnull().sum()
Output:
Missing values in each column
Step 5 : Checking for the duplicate values
df.nunique(): This function tells us how many unique values
exist in each column which provides insight into the variety of
data in each feature.
df.nunique()
Output:
nunique()
Step 6: Univariate Analysis
In Univariate analysis plotting the right charts can help us to
better understand the data making the data visualization so
important.
1. Bar Plot for evaluating the count of the wine with its
quality rate.
quality_counts = df['quality'].value_counts()

plt.figure(figsize=(8, 6))
plt.bar(quality_counts.index, quality_counts,
color='deeppink')
plt.title('Count Plot of Quality')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()
Output:
Bar
Plot
Here, this count plot graph shows the count of the wine with
its quality rate.
2. Kernel density plot for understanding variance in the
dataset
sns.set_style("darkgrid")

numerical_columns = df.select_dtypes(include=["int64",
"float64"]).columns

plt.figure(figsize=(14, len(numerical_columns) * 3))


for idx, feature in enumerate(numerical_columns, 1):
plt.subplot(len(numerical_columns), 2, idx)
sns.histplot(df[feature], kde=True)
plt.title(f"{feature} | Skewness: {round(df[feature].skew(),
2)}")

plt.tight_layout()
plt.show()
Output:
K
ernel density plot
The features in the dataset with a skewness of 0 shows a
symmetrical distribution. If the skewness is 1 or above it
suggests a positively skewed (right-skewed) distribution. In a
right-skewed distribution the tail extends more to the right
which shows the presence of extremely high values.
3. Swarm Plot for showing the outlier in the data
plt.figure(figsize=(10, 8))

sns.swarmplot(x="quality", y="alcohol", data=df,


palette='viridis')

plt.title('Swarm Plot for Quality and Alcohol')


plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()
Output:

Swarm Plot
This graph shows the swarm plot for the 'Quality' and
'Alcohol' columns. The higher point density in certain areas
shows where most of the data points are concentrated. Points
that are isolated and far from these clusters represent
outliers highlighting uneven values in the dataset.
Step 7: Bivariate Analysis
In bivariate analysis two variables are analyzed together to
identify patterns, dependencies or interactions between
them. This method helps in understanding how changes in
one variable might affect another.
Let's visualize these relationships by plotting various plot for
the data which will show how the variables interact with each
other across multiple dimensions.
1. Pair Plot for showing the distribution of the individual
variables
sns.set_palette("Pastel1")

plt.figure(figsize=(10, 6))

sns.pairplot(df)

plt.suptitle('Pair Plot for DataFrame')


plt.show()
Output:
Pair
Plot
 If the plot is diagonal , histograms of kernel density
plots shows the distribution of the individual variables.
 If the scatter plot is in the lower triangle, it displays the
relationship between the pairs of the variables.
 If the scatter plots above and below the diagonal are
mirror images indicating symmetry.
 If the histogram plots are more centered, it represents
the locations of peaks.
 Skewness is found by observing whether the histogram
is symmetrical or skewed to the left or right.
2. Violin Plot for examining the relationship between alcohol
and Quality.
df['quality'] = df['quality'].astype(str)
plt.figure(figsize=(10, 8))

sns.violinplot(x="quality", y="alcohol", data=df, palette={


'3': 'lightcoral', '4': 'lightblue', '5': 'lightgreen', '6':
'gold', '7': 'lightskyblue', '8': 'lightpink'}, alpha=0.7)

plt.title('Violin Plot for Quality and Alcohol')


plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()
Output:

Violin Plot
For interpreting the Violin Plot:
 If the width is wider, it shows higher density suggesting
more data points.
 Symmetrical plot shows a balanced distribution.
 Peak or bulge in the violin plot represents most common
value in distribution.
 Longer tails shows great variability.
 Median line is the middle line inside the violin plot. It
helps in understanding central tendencies.
3. Box Plot for examining the relationship between alcohol
and Quality
sns.boxplot(x='quality', y='alcohol', data=df)
Output:
Box Plot
Box represents the IQR i.e longer the box, greater the
variability.
 Median line in the box shows central tendency.
 Whiskers extend from box to the smallest and largest
values within a specified range.
 Individual points beyond the whiskers represents
outliers.
 A compact box shows low variability while a stretched
box shows higher variability.
Step 8: Multivariate Analysis
It involves finding the interactions between three or more
variables in a dataset at the same time. This approach
focuses to identify complex patterns, relationships and
interactions which provides understanding of how multiple
variables collectively behave and influence each other.
Here, we are going to show the multivariate analysis using
a correlation matrix plot.
plt.figure(figsize=(15, 10))

sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='Pastel2',


linewidths=2)
plt.title('Correlation Heatmap')
plt.show()
Output:

Corr
elation Matrix
Values close to +1 shows strong positive correlation, -1 shows
a strong negative correlation and 0 suggests no linear
correlation.
 Darker colors signify strong correlation, while light
colors represents weaker correlations.
 Positive correlation variable move in same directions. As
one increases, the other also increases.
 Negative correlation variable move in opposite
directions. An increase in one variable is associated with
a decrease in the other.
How Dimensionality Reduction
Works?
Lets understand how dimensionality Reduction is used with
the help of example. Imagine a dataset where each data point
exists in a 3D space defined by axes X, Y and Z. If most of the
data variance occurs along X and Y then the Z-dimension may
contribute very little to understanding the structure of the
data.

 Before Reduction You can see that Data exist in 3D


(X,Y,Z). It has high redundancy and Z contributes little
meaningful information
 On the right after reducing the dimensionality the data
is represented in lower-dimensional spaces. The top plot
(X-Y) maintains the meaningful structure while the
bottom plot (Z-Y) shows that the Z-dimension
contributed little useful information.
This process makes data analysis more efficient, improving
computation speed and visualization while minimizing
redundancy
Dimensionality Reduction Techniques
Dimensionality reduction techniques can be broadly divided
into two categories:
1. Feature Selection
Feature selection chooses the most relevant features from
the dataset without altering them. It helps remove redundant
or irrelevant features, improving model efficiency. Some
common methods are:
 Filter methods rank the features based on their
relevance to the target variable.
 Wrapper methods use the model performance as the
criteria for selecting features.
 Embedded methods combine feature selection with the
model training process.
Please refer to Feature Selection Techniques for better in
depth understanding about the techniques.
2. Feature Extraction
Feature extraction involves creating new features by
combining or transforming the original features. These new
features retain most of the dataset’s important information in
fewer dimensions. Common feature extraction methods are:
1. Principal Component Analysis (PCA): Converts correlated
variables into uncorrelated 'principal components,
reducing dimensionality while maintaining as much
variance as possible enabling more efficient analysis.
2. Missing Value Ratio: Variables with missing data beyond
a set threshold are removed, improving dataset
reliability.
3. Backward Feature Elimination: Starts with all features
and removes the least significant ones in each iteration.
The process continues until only the most impactful
features remain, optimizing model performance.
4. Forward Feature Selection: Forward Feature
Selection Begins with one feature, adds others
incrementally and keeps those improving model
performance.
5. Random Forest: Random forest Uses decision trees to
evaluate feature importance, automatically selecting the
most relevant features without the need for manual
coding, enhancing model accuracy.
6. Factor Analysis: Groups variables by correlation and
keeps the most relevant ones for further analysis.
7. Independent Component Analysis (ICA): Identifies
statistically independent components, ideal for
applications like ‘blind source separation’ where
traditional correlation-based methods fall short.
Dimensionality Reduction Real World Examples
Dimensionality reduction plays a important role in many real-
world applications such as text categorization, image
retrieval, gene expression analysis and more. Here are a few
examples:
1. Text Categorization: With vast amounts of online data
dimensionality reduction helps classify text documents
into predefined categories by reducing the feature space
like word or phrase features while maintaining accuracy.
2. Image Retrieval: As image data grows indexing based on
visual content like color, texture, shape rather than just
text descriptions has become essential. This allows for
better retrieval of images from large databases.
3. Gene Expression Analysis: Dimensionality reduction
accelerates gene expression analysis help to classify
samples like leukemia by identifying key features,
improve both speed and accuracy.
4. Intrusion Detection: In cybersecurity dimensionality
reduction helps analyze user activity patterns to detect
suspicious behaviors and intrusions by identifying
optimal features for network monitoring.
Advantages of Dimensionality Reduction
As seen earlier high dimensionality makes models inefficient.
Let's now summarize the key advantages of reducing
dimensionality.
 Faster Computation: With fewer features machine
learning algorithms can process data more quickly. This
results in faster model training and testing which is
particularly useful when working with large datasets.
 Better Visualization: As we saw in the earlier figure
reducing dimensions makes it easier to visualize data
and reveal hidden patterns.
 Prevent Overfitting: With few features models are less
likely to memorize the training data and overfit. This
helps the model generalize better to new, unseen data
improve its ability to make accurate predictions.
Disadvantages of Dimensionality Reduction
 Data Loss & Reduced Accuracy: Some important
information may be lost during dimensionality reduction
and affect model performance.
 Choosing the Right Components: Deciding how many
dimensions to keep is difficult as keeping too few may
lose valuable information while keeping too many can
led to overfitting.

Linear regression is a type of supervised machine-learning


algorithm that learns from the labelled datasets and maps the data points with most
optimized linear functions which can be used for prediction on new datasets. It
assumes that there is a linear relationship between tAhe input and output, meaning
the output changes at a constant rate as the input changes. This relationship is
represented by a straight line.
For example we want to predict a student's exam score based on how many hours
they studied. We observe that as students study more hours, their scores go up. In the
example of predicting exam scores based on hours studied. Here
 Independent variable (input): Hours studied because it's the factor we control
or observe.
 Dependent variable (output): Exam score because it depends on how many
hours were studied.
We use the independent variable to predict the dependent variable.

Why Linear Regression is Important?


Here’s why linear regression is important:
 Simplicity and Interpretability: It’s easy to understand and interpret, making it
a starting point for learning about machine learning.
 Predictive Ability: Helps predict future outcomes based on past data, making it
useful in various fields like finance, healthcare and marketing.
 Basis for Other Models: Many advanced algorithms, like logistic regression or
neural networks, build on the concepts of linear regression.
 Efficiency: It’s computationally efficient and works well for problems with a
linear relationship.
 Widely Used: It’s one of the most widely used techniques in both statistics and
machine learning for regression tasks.
 Analysis: It provides insights into relationships between variables (e.g., how
much one variable influences another).
Best Fit Line in Linear Regression
In linear regression, the best-fit line is the straight line that most accurately represents
the relationship between the independent variable (input) and the dependent variable
(output). It is the line that minimizes the difference between the actual data points and
the predicted values from the model.
1. Goal of the Best-Fit Line
The goal of linear regression is to find a straight line that minimizes the error (the
difference) between the observed data points and the predicted values. This line helps
us predict the dependent variable for new, unseen data.

Linear Regression
Here Y is called a dependent or target variable and X is called an independent variable
also known as the predictor of Y. There are many types of functions or modules that
can be used for regression. A linear function is the simplest type of function. Here, X
may be a single feature or multiple features representing the problem.
2. Equation of the Best-Fit Line
For simple linear regression (with one independent variable), the best-fit line is
represented by the equation
y=mx+by=mx+b
Where:
 y is the predicted value (dependent variable)
 x is the input (independent variable)
 m is the slope of the line (how much y changes when x changes)
 b is the intercept (the value of y when x = 0)
The best-fit line will be the one that optimizes the values of m (slope) and b (intercept)
so that the predicted y values are as close as possible to the actual data points.
3. Minimizing the Error: The Least Squares Method
To find the best-fit line, we use a method called Least Squares. The idea behind this
method is to minimize the sum of squared differences between the actual values (data
points) and the predicted values from the line. These differences are called residuals.
The formula for residuals is:
Residual=yᵢ−y^ᵢResidual=yᵢ−y^ᵢ
Where:
 yᵢyᵢ is the actual observed value
 y^ᵢy^ᵢ is the predicted value from the line for that xᵢxᵢ
The least squares method minimizes the sum of the squared residuals:
Sumofsquarederrors(SSE)=Σ(yᵢ−y^ᵢ)²Sumofsquarederrors(SSE)=Σ(yᵢ−y^ᵢ)²
This method ensures that the line best represents the data where the sum of the
squared differences between the predicted values and actual values is as small as
possible.
4. Interpretation of the Best-Fit Line
 Slope (m): The slope of the best-fit line indicates how much the dependent
variable (y) changes with each unit change in the independent variable (x). For
example if the slope is 5, it means that for every 1-unit increase in x, the value
of y increases by 5 units.
 Intercept (b): The intercept represents the predicted value of y when x = 0. It’s
the point where the line crosses the y-axis.
In linear regression some hypothesis are made to ensure reliability of the model's
results.
Limitations
 Assumes Linearity: The method assumes the relationship between the
variables is linear. If the relationship is non-linear, linear regression might not
work well.
 Sensitivity to Outliers: Outliers can significantly affect the slope and intercept,
skewing the best-fit line.
Hypothesis function in Linear Regression
In linear regression, the hypothesis function is the equation used to make predictions
about the dependent variable based on the independent variables. It represents the
relationship between the input features and the target output.
For a simple case with one independent variable, the hypothesis function is:
h(x)=β₀+β₁xh(x)=β₀+β₁x
Where:
 h(x)(ory^)h(x)(ory^) is the predicted value of the dependent variable (y).
 x xxis the independent variable.
 β₀β₀ is the intercept, representing the value of y when x is 0.
 β₁β₁ is the slope, indicating how much y changes for each unit change in x.
For multiple linear regression (with more than one independent variable), the
hypothesis function expands to:
h(x₁,x₂,...,xₖ)=β₀+β₁x₁+β₂x₂+...+βₖxₖh(x₁,x₂,...,xₖ)=β₀+β₁x₁+β₂x₂+...+βₖxₖ
Where:
 x₁,x₂,...,xₖx₁,x₂,...,xₖ are the independent variables.
 β₀β₀ is the intercept.
 β₁,β₂,...,βₖβ₁,β₂,...,βₖ are the coefficients, representing the influence of each
respective independent variable on the predicted output.
Assumptions of the Linear Regression
1. Linearity: The relationship between inputs (X) and the output (Y) is a straight line.

Linearity
2. Independence of Errors: The errors in predictions should not affect each other.
3. Constant Variance (Homoscedasticity): The errors should have equal spread across
all values of the input. If the spread changes (like fans out or shrinks), it's called
heteroscedasticity and it's a problem for the model.

Homoscedasticity
4. Normality of Errors: The errors should follow a normal (bell-shaped) distribution.
5. No Multicollinearity(for multiple regression): Input variables shouldn’t be too
closely related to each other.
6. No Autocorrelation: Errors shouldn't show repeating patterns, especially in time-
based data.
7. Additivity: The total effect on Y is just the sum of effects from each X, no mixing or
interaction between them.'
To understand Multicollinearity in detail refer to article: Multicollinearity.
Types of Linear Regression
When there is only one independent feature it is known as Simple Linear Regression
or Univariate Linear Regression and when there are more than one feature it is known
as Multiple Linear Regression or Multivariate Regression.
1. Simple Linear Regression
Simple linear regression is used when we want to predict a target value (dependent
variable) using only one input feature (independent variable). It assumes a straight-line
relationship between the two.
Formula:
y^=θ0+θ1xy^=θ0+θ1x
Where:
 y^y^ is the predicted value
 xxis the input (independent variable)
 θ0θ0 is the intercept (value of y^y^ when x=0)
 θ1θ1 is the slope or coefficient (how much y^y^ changes with one unit of x)
Example:
Predicting a person’s salary (y) based on their years of experience (x).
2. Multiple Linear Regression
Multiple linear regression involves more than one independent variable and one
dependent variable. The equation for multiple linear regression is:

y^=θ0+θ1x1+θ2x2+⋯+θnxny^=θ0+θ1x1+θ2x2+⋯+θnxn
where:
 y^y^ is the predicted value
 x1,x2,…,xnx1,x2,…,xn are the independent variables
 θ1,θ2,…,θnθ1,θ2,…,θn are the coefficients (weights) corresponding to each
predictor.
 θ0θ0 is the intercept.
The goal of the algorithm is to find the best Fit Line equation that can predict the
values based on the independent variables.
In regression set of records are present with X and Y values and these values are used
to learn a function so if you want to predict Y from an unknown X this learned function
can be used. In regression we have to find the value of Y, So, a function is required that
predicts continuous Y in the case of regression given X as independent features.
Use Case of Multiple Linear Regression
Multiple linear regression allows us to analyze relationship between multiple
independent variables and a single dependent variable. Here are some use cases:
 Real Estate Pricing: In real estate MLR is used to predict property prices based
on multiple factors such as location, size, number of bedrooms, etc. This helps
buyers and sellers understand market trends and set competitive prices.
 Financial Forecasting: Financial analysts use MLR to predict stock prices or
economic indicators based on multiple influencing factors such as interest
rates, inflation rates and market trends. This enables better investment
strategies and risk management24.
 Agricultural Yield Prediction: Farmers can use MLR to estimate crop yields
based on several variables like rainfall, temperature, soil quality and fertilizer
usage. This information helps in planning agricultural practices for optimal
productivity
 E-commerce Sales Analysis: An e-commerce company can utilize MLR to assess
how various factors such as product price, marketing promotions and seasonal
trends impact sales.
Now that we have understood about linear regression, its assumption and its type now
we will learn how to make a linear regression model.
Cost function for Linear Regression
As we have discussed earlier about best fit line in linear regression, its not easy to get it
easily in real life cases so we need to calculate errors that affects it. These errors need
to be calculated to mitigate them. The difference between the predicted
value Y^ Y^ and the true value Y and it is called cost function or the loss function.
In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which
calculates the average of the squared errors between the predicted values y^iy^i and
the actual values yiyi. The purpose is to determine the optimal values for the
intercept θ1θ1 and the coefficient of the input feature θ2θ2 providing the best-fit line
for the given data points. The linear equation expressing this relationship
is y^i=θ1+θ2xiy^i=θ1+θ2xi.
MSE function can be calculated as:
Cost function(J)=1n∑ni(yi^−yi)2Cost function(J)=n1∑ni(yi^−yi)2
Utilizing the MSE function, the iterative process of gradient descent is applied to
update the values of \θ1&θ2θ1&θ2. This ensures that the MSE value converges to the
global minima, signifying the most accurate fit of the linear regression line to the
dataset.
This process involves continuously adjusting the parameters \(\theta_1\) and \(\
theta_2\) based on the gradients calculated from the MSE. The final result is a linear
regression line that minimizes the overall squared differences between the predicted
and actual values, providing an optimal representation of the underlying relationship in
the data.
Now we have calculated loss function we need to optimize model to mtigate this error
and it is done through gradient descent.
Gradient Descent for Linear Regression
A linear regression model can be trained using the optimization algorithm gradient
descent by iteratively modifying the model's parameters to reduce the mean squared
error (MSE) of the model on a training dataset. To update θ1 and θ2 values in order to
reduce the Cost function (minimizing RMSE value) and achieve the best-fit line the
model uses Gradient Descent. The idea is to start with random θ1 and θ2 values and
then iteratively update the values, reaching minimum cost.
A gradient is nothing but a derivative that defines the effects on outputs of the
function with a little bit of variation in inputs.
Let's differentiate the cost function(J) with respect to θ1 θ1
Jθ1′=∂J(θ1,θ2)∂θ1=∂∂θ1[1n(∑i=1n(y^i−yi)2)]=1n[∑i=1n2(y^i−yi)
(∂∂θ1(y^i−yi))]=1n[∑i=1n2(y^i−yi)(∂∂θ1(θ1+θ2xi−yi))]=1n[∑i=1n2(y^i−yi)
(1+0−0)]=1n[∑i=1n(y^i−yi)(2)]=2n∑i=1n(y^i−yi)Jθ1′=∂θ1∂J(θ1,θ2)=∂θ1∂[n1(i=1∑n(y^i
−yi)2)]=n1[i=1∑n2(y^i−yi)(∂θ1∂(y^i−yi))]=n1[i=1∑n2(y^i−yi)(∂θ1∂(θ1+θ2xi−yi))]=n1
[i=1∑n2(y^i−yi)(1+0−0)]=n1[i=1∑n(y^i−yi)(2)]=n2i=1∑n(y^i−yi)
Let's differentiate the cost function(J) with respect to θ2θ2
Jθ2′=∂J(θ1,θ2)∂θ2=∂∂θ2[1n(∑i=1n(y^i−yi)2)]=1n[∑i=1n2(y^i−yi)
(∂∂θ2(y^i−yi))]=1n[∑i=1n2(y^i−yi)(∂∂θ2(θ1+θ2xi−yi))]=1n[∑i=1n2(y^i−yi)
(0+xi−0)]=1n[∑i=1n(y^i−yi)(2xi)]=2n∑i=1n(y^i−yi)⋅xiJθ2′=∂θ2∂J(θ1,θ2)=∂θ2∂[n1(i=1∑n
(y^i−yi)2)]=n1[i=1∑n2(y^i−yi)(∂θ2∂(y^i−yi))]=n1[i=1∑n2(y^i−yi)(∂θ2∂(θ1+θ2xi−yi))]=n1
[i=1∑n2(y^i−yi)(0+xi−0)]=n1[i=1∑n(y^i−yi)(2xi)]=n2i=1∑n(y^i−yi)⋅xi
Finding the coefficients of a linear equation that best fits the training data is the
objective of linear regression. By moving in the direction of the Mean Squared Error
negative gradient with respect to the coefficients, the coefficients can be changed. And
the respective intercept and coefficient of X will be if α α is the learning rate.
Gradient Descent

θ1=θ1−α(Jθ1′)=θ1−α(2n∑i=1n(y^i−yi))θ2=θ2−α(Jθ2′)=θ2−α(2n∑i=1n(y^i−yi) ⋅xi)θ1=θ1
−α(Jθ1′)=θ1−α(n2i=1∑n(y^i−yi))θ2=θ2−α(Jθ2′)=θ2−α(n2i=1∑n(y^i−yi)⋅xi)
After optimizing our model, we evaluate our models accuracy to see how well it will
perform in real world scenario.
Evaluation Metrics for Linear Regression
A variety of evaluation measures can be used to determine the strength of any linear
regression model. These assessment metrics often give an indication of how well the
model is producing the observed outputs.
The most common measurements are:
1. Mean Square Error (MSE)
Mean Squared Error (MSE) is an evaluation metric that calculates the average of the
squared differences between the actual and predicted values for all the data points.
The difference is squared to ensure that negative and positive differences don't cancel
each other out.
MSE=1n∑i=1n(yi−yi^)2MSE=n1∑i=1n(yi−yi)2
Here,
 nnis the number of data points.
 yiyiis the actual or observed value for theithithdata point.
 yi^yi is the predicted value for the ithithdata point.
MSE is a way to quantify the accuracy of a model's predictions. MSE is sensitive to
outliers as large errors contribute significantly to the overall score.
2. Mean Absolute Error (MAE)
Mean Absolute Error is an evaluation metric used to calculate the accuracy of a
regression model. MAE measures the average absolute difference between the
predicted values and actual values.
Mathematically MAE is expressed as:

MAE=1n∑i=1n∣Yi−Yi^∣MAE=n1∑i=1n∣Yi−Yi∣
Here,
 n is the number of observations
 Yi represents the actual values.
 Yi^Yi represents the predicted values
Lower MAE value indicates better model performance. It is not sensitive to the outliers
as we consider absolute differences.
3. Root Mean Squared Error (RMSE)
The square root of the residuals' variance is the Root Mean Squared Error. It describes
how well the observed data points match the expected values or the model's absolute
fit to the data.
In mathematical notation, it can be expressed as:
RMSE=RSSn=∑i=2n(yiactual−yipredicted)2nRMSE=nRSS=n∑i=2n(yiactual−yipredicted)2
Rather than dividing the entire number of data points in the model by the number of
degrees of freedom, one must divide the sum of the squared residuals to obtain an
unbiased estimate. Then, this figure is referred to as the Residual Standard Error (RSE).
In mathematical notation, it can be expressed as:
RMSE=RSSn=∑i=2n(yiactual−yipredicted)2(n−2)RMSE=nRSS=(n−2)∑i=2n(yiactual
−yipredicted)2
RSME is not as good of a metric as R-squared. Root Mean Squared Error can fluctuate
when the units of the variables vary since its value is dependent on the variables' units
(it is not a normalized measure).
4. Coefficient of Determination (R-squared)
R-Squared is a statistic that indicates how much variation the developed model can
explain or capture. It is always in the range of 0 to 1. In general, the better the model
matches the data, the greater the R-squared number.
In mathematical notation, it can be expressed as:
R2=1−(RSSTSS)R2=1−(TSSRSS)
 Residual sum of Squares(RSS): The sum of squares of the residual for each data
point in the plot or data is known as the residual sum of squares or RSS. It is a
measurement of the difference between the output that was observed and
what was anticipated.
RSS=∑i=1n(yi−b0−b1xi)2RSS=∑i=1n(yi−b0−b1xi)2
 Total Sum of Squares (TSS): The sum of the data points' errors from the answer
variable's mean is known as the total sum of squares or TSS.
TSS=∑i=1n(y−yi‾)2TSS=∑i=1n(y−yi)2.
R squared metric is a measure of the proportion of variance in the dependent variable
that is explained the independent variables in the model.
5. Adjusted R-Squared Error
Adjusted R2R2measures the proportion of variance in the dependent variable that is
explained by independent variables in a regression model. Adjusted R-square accounts
the number of predictors in the model and penalizes the model for including irrelevant
predictors that don't contribute significantly to explain the variance in the dependent
variables.
Mathematically, adjusted R2R2is expressed as:
AdjustedR2=1−((1−R2).(n−1)n−k−1)AdjustedR2=1−(n−k−1(1−R2).(n−1))
Here,
 n is the number of observations
 k is the number of predictors in the model
 R2 is coeeficient of determination
Adjusted R-square helps to prevent overfitting. It penalizes the model with additional
predictors that do not contribute significantly to explain the variance in the dependent
variable.
While evaluation metrics help us measure the performance of a model, regularization
helps in improving that performance by addressing overfitting and enhancing
generalization.
Regularization Techniques for Linear Models
1. Lasso Regression (L1 Regularization)
Lasso Regression is a technique used for regularizing a linear regression model, it adds
a penalty term to the linear regression objective function to prevent overfitting.
The objective function after applying lasso regression is:

J(θ)=12m∑i=1m(yi^−yi)2+λ∑j=1n∣θj∣J(θ)=2m1∑i=1m(yi−yi)2+λ∑j=1n∣θj∣
 the first term is the least squares loss, representing the squared difference
between predicted and actual values.
 the second term is the L1 regularization term, it penalizes the sum of absolute
values of the regression coefficient θj.
2. Ridge Regression (L2 Regularization)
Ridge regression is a linear regression technique that adds a regularization term to the
standard linear objective. Again, the goal is to prevent overfitting by penalizing large
coefficient in linear regression equation. It useful when the dataset has
multicollinearity where predictor variables are highly correlated.
The objective function after applying ridge regression is:
J(θ)=12m∑i=1m(yi^−yi)2+λ∑j=1nθj2J(θ)=2m1∑i=1m(yi−yi)2+λ∑j=1nθj2
 the first term is the least squares loss, representing the squared difference
between predicted and actual values.
 the second term is the L1 regularization term, it penalizes the sum of square of
values of the regression coefficient θj.
3. Elastic Net Regression
Elastic Net Regression is a hybrid regularization technique that combines the power of
both L1 and L2 regularization in linear regression objective.

J(θ)=12m∑i=1m(yi^−yi)2+αλ∑j=1n∣θj∣+12(1−α)λ∑j=1nθj2J(θ)=2m1∑i=1m(yi−yi
)2+αλ∑j=1n∣θj∣+21(1−α)λ∑j=1nθj2
 the first term is least square loss.
 the second term is L1 regularization and third is ridge regression.
 λλis the overall regularization strength.
 ααcontrols the mix between L1 and L2 regularization.
Now that we have learned how to make a linear regression model, now we will
implement it.
Python Implementation of Linear Regression
1. Import the necessary libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as ax
from matplotlib.animation import FuncAnimation
2. Load the dataset and separate input and Target variables
Here is the link for dataset: Dataset Link
url = 'https://media.geeksforgeeks.org/wp-content/uploads/20240320114716/
data_for_lr.csv'
data = pd.read_csv(url)

data = data.dropna()
train_input = np.array(data.x[0:500]).reshape(500, 1)
train_output = np.array(data.y[0:500]).reshape(500, 1)

test_input = np.array(data.x[500:700]).reshape(199, 1)
test_output = np.array(data.y[500:700]).reshape(199, 1)
3. Build the Linear Regression Model and Plot the regression line
In forward propagation Linear regression function Y=mx+cY=mx+c is applied by initially
assigning random value of parameter (m and c). The we have written the function to
finding the cost function i.e the mean
class LinearRegression:
def __init__(self):
self.parameters = {}

def forward_propagation(self, train_input):


m = self.parameters['m']
c = self.parameters['c']
predictions = np.multiply(m, train_input) + c
return predictions

def cost_function(self, predictions, train_output):


cost = np.mean((train_output - predictions) ** 2)
return cost

def backward_propagation(self, train_input, train_output, predictions):


derivatives = {}
df = (predictions-train_output)
dm = 2 * np.mean(np.multiply(train_input, df))
dc = 2 * np.mean(df)
derivatives['dm'] = dm
derivatives['dc'] = dc
return derivatives

def update_parameters(self, derivatives, learning_rate):


self.parameters['m'] = self.parameters['m'] - learning_rate * derivatives['dm']
self.parameters['c'] = self.parameters['c'] - learning_rate * derivatives['dc']

def train(self, train_input, train_output, learning_rate, iters):


self.parameters['m'] = np.random.uniform(0, 1) * -1
self.parameters['c'] = np.random.uniform(0, 1) * -1

self.loss = []

fig, ax = plt.subplots()
x_vals = np.linspace(min(train_input), max(train_input), 100)
line, = ax.plot(x_vals, self.parameters['m'] * x_vals + self.parameters['c'],
color='red', label='Regression Line')
ax.scatter(train_input, train_output, marker='o', color='green', label='Training
Data')

ax.set_ylim(0, max(train_output) + 1)

def update(frame):
predictions = self.forward_propagation(train_input)
cost = self.cost_function(predictions, train_output)
derivatives = self.backward_propagation(train_input, train_output, predictions)
self.update_parameters(derivatives, learning_rate)
line.set_ydata(self.parameters['m'] * x_vals + self.parameters['c'])
self.loss.append(cost)
print("Iteration = {}, Loss = {}".format(frame + 1, cost))
return line,

ani = FuncAnimation(fig, update, frames=iters, interval=200, blit=True)


ani.save('linear_regression_A.gif', writer='ffmpeg')

plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Linear Regression')
plt.legend()
plt.show()

return self.parameters, self.loss

The linear regression line provides valuable insights into the relationship between the
two variables. It represents the best-fitting line that captures the overall trend of how a
dependent variable (Y) changes in response to variations in an independent variable
(X).
 Positive Linear Regression Line: A positive linear regression line indicates a
direct relationship between the independent variable (XX) and the dependent
variable (YY). This means that as the value of X increases, the value of Y also
increases. The slope of a positive linear regression line is positive, meaning that
the line slants upward from left to right.
 Negative Linear Regression Line: A negative linear regression line indicates an
inverse relationship between the independent variable (XX) and the dependent
variable (YY). This means that as the value of X increases, the value of Y
decreases. The slope of a negative linear regression line is negative, meaning
that the line slants downward from left to right.
4. Trained the model and Final Prediction
linear_reg = LinearRegression()
parameters, loss = linear_reg.train(train_input, train_output, 0.0001, 20)
Output:
Model Training
Applications of Linear Regression
Linear regression is used in many different fields including finance, economics and
psychology to understand and predict the behavior of a particular variable.
For example linear regression is widely used in finance to analyze relationships and
make predictions. It can model how a company's earnings per share (EPS) influence its
stock price. If the model shows that a $1 increase in EPS results in a $15 rise in stock
price, investors gain insights into the company's valuation. Similarly, linear regression
can forecast currency values by analyzing historical exchange rates and economic
indicators, helping financial professionals make informed decisions and manage risks
effectively.
Also read - Linear Regression - In Simple Words, with real-life Examples
Advantages and Disadvantages of Linear Regression
Advantages of Linear Regression
 Linear regression is a relatively simple algorithm, making it easy to understand
and implement. The coefficients of the linear regression model can be
interpreted as the change in the dependent variable for a one-unit change in
the independent variable, providing insights into the relationships between
variables.
 Linear regression is computationally efficient and can handle large datasets
effectively. It can be trained quickly on large datasets, making it suitable for
real-time applications.
 Linear regression is relatively robust to outliers compared to other machine
learning algorithms. Outliers may have a smaller impact on the overall model
performance.
 Linear regression often serves as a good baseline model for comparison with
more complex machine learning algorithms.
 Linear regression is a well-established algorithm with a rich history and is
widely available in various machine learning libraries and software packages.
Disadvantages of Linear Regression
 Linear regression assumes a linear relationship between the dependent and
independent variables. If the relationship is not linear, the model may not
perform well.
 Linear regression is sensitive to multicollinearity, which occurs when there is a
high correlation between independent variables. Multicollinearity can inflate
the variance of the coefficients and lead to unstable model predictions.
 Linear regression assumes that the features are already in a suitable form for
the model. Feature engineering may be required to transform features into a
format that can be effectively used by the model.
 Linear regression is susceptible to both overfitting and underfitting. Overfitting
occurs when the model learns the training data too well and fails to generalize
to unseen data. Underfitting occurs when the model is too simple to capture
the underlying relationships in the data.
 Linear regression provides limited explanatory power for complex relationships
between variables. More advanced machine learning techniques may be
necessary for deeper insights.

Logistic Regression is a supervised machine learning


algorithm used for classification problems. Unlike linear regression which predicts
continuous values it predicts the probability that an input belongs to a specific class. It
is used for binary classification where the output can be one of two possible categories
such as Yes/No, True/False or 0/1. It uses sigmoid function to convert inputs into a
probability value between 0 and 1. In this article, we will see the basics of logistic
regression and its core concepts.

1/3
Types of Logistic Regression
Logistic regression can be classified into three main types based on the nature of the
dependent variable:
1. Binomial Logistic Regression: This type is used when the dependent variable
has only two possible categories. Examples include Yes/No, Pass/Fail or 0/1. It is
the most common form of logistic regression and is used for binary
classification problems.
2. Multinomial Logistic Regression: This is used when the dependent variable has
three or more possible categories that are not ordered. For example, classifying
animals into categories like "cat," "dog" or "sheep." It extends the binary
logistic regression to handle multiple classes.
3. Ordinal Logistic Regression: This type applies when the dependent variable has
three or more categories with a natural order or ranking. Examples include
ratings like "low," "medium" and "high." It takes the order of the categories into
account when modeling.
Assumptions of Logistic Regression
Understanding the assumptions behind logistic regression is important to ensure the
model is applied correctly, main assumptions are:
1. Independent observations: Each data point is assumed to be independent of
the others means there should be no correlation or dependence between the
input samples.
2. Binary dependent variables: It takes the assumption that the dependent
variable must be binary, means it can take only two values. For more than two
categories SoftMax functions are used.
3. Linearity relationship between independent variables and log odds: The
model assumes a linear relationship between the independent variables and
the log odds of the dependent variable which means the predictors affect the
log odds in a linear way.
4. No outliers: The dataset should not contain extreme outliers as they can distort
the estimation of the logistic regression coefficients.
5. Large sample size: It requires a sufficiently large sample size to produce reliable
and stable results.
Understanding Sigmoid Function
1. The sigmoid function is a important part of logistic regression which is used to
convert the raw output of the model into a probability value between 0 and 1.
2. This function takes any real number and maps it into the range 0 to 1 forming an "S"
shaped curve called the sigmoid curve or logistic curve. Because probabilities must lie
between 0 and 1, the sigmoid function is perfect for this purpose.
3. In logistic regression, we use a threshold value usually 0.5 to decide the class label.
 If the sigmoid output is same or above the threshold, the input is classified as
Class 1.
 If it is below the threshold, the input is classified as Class 0.
This approach helps to transform continuous input values into meaningful class
predictions.
How does Logistic Regression work?
Logistic regression model transforms the linear regression function continuous value
output into categorical value output using a sigmoid function which maps any real-
valued set of independent variables input into a value between 0 and 1. This function
is known as the logistic function.
Suppose we have input features represented as a matrix:

X=[x11 ...x1mx21 ...x2m ⋮⋱ ⋮ xn1 ...xnm]X=⎣⎡x11 x21 ⋮xn1 ......⋱ ...x1mx2m⋮ xnm⎦⎤
and the dependent variable is YYhaving only binary value i.e 0 or 1.
Y={0 if Class11 if Class2Y={01 if Class1 if Class2
then, apply the multi-linear function to the input variables X.
z=(∑i=1nwixi)+bz=(∑i=1nwixi)+b

Here xixi is the ithith observation of X, wi=[w1,w2,w3,⋯,wm]wi=[w1,w2,w3,⋯,wm] is


the weights or Coefficient and bbis the bias term also known as intercept. Simply this
can be represented as the dot product of weight and bias.

z=w⋅X+bz=w⋅X+b
At this stage, zzis a continuous value from the linear regression. Logistic regression
then applies the sigmoid function to zzto convert it into a probability between 0 and 1
which can be used to predict the class.
Now we use the sigmoid function where the input will be z and we find the probability
between 0 and 1. i.e. predicted y.
σ(z)=11+e−zσ(z)=1+e−z1

Sigmoid function
As shown above the sigmoid function converts the continuous variable data into the
probability i.e between 0 and 1.
 σ(z) σ(z) tends towards 1 as z→∞z→∞
 σ(z) σ(z) tends towards 0 as z→−∞z→−∞
 σ(z) σ(z) is always bounded between 0 and 1
where the probability of being a class can be measured as:
P(y=1)=σ(z)P(y=0)=1−σ(z)P(y=1)=σ(z)P(y=0)=1−σ(z)
Logistic Regression Equation and Odds:
It models the odds of the dependent event occurring which is the ratio of the
probability of the event to the probability of it not occurring:
p(x)1−p(x) =ez1−p(x)p(x) =ez
Taking the natural logarithm of the odds gives the log-odds or logit:

log⁡[p(x)1−p(x)]=zlog⁡[p(x)1−p(x)]=w⋅X+bp(x)1−p(x)=ew⋅X+b⋯Exponentiate both sidesp(


x)=ew⋅X+b⋅(1−p(x))p(x)=ew⋅X+b−ew⋅X+b⋅p(x))p(x)+ew⋅X+b⋅p(x))=ew⋅X+bp(x)
(1+ew⋅X+b)=ew⋅X+bp(x)=ew⋅X+b1+ew⋅X+blog[1−p(x)p(x)]log[1−p(x)p(x)]1−p(x)p(x)
p(x)p(x)p(x)+ew⋅X+b⋅p(x))p(x)(1+ew⋅X+b)p(x)=z=w⋅X+b=ew⋅X+b⋯Exponentiate both sid
es=ew⋅X+b⋅(1−p(x))=ew⋅X+b−ew⋅X+b⋅p(x))=ew⋅X+b=ew⋅X+b=1+ew⋅X+bew⋅X+b
then the final logistic regression equation will be:

p(X;b,w)=ew⋅X+b1+ew⋅X+b=11+e−w⋅X+bp(X;b,w)=1+ew⋅X+bew⋅X+b=1+e−w⋅X+b1
This formula represents the probability of the input belonging to Class 1.
Likelihood Function for Logistic Regression
The goal is to find weights ww and bias bb that maximize the likelihood of observing
the data.
For each data point ii
 for y=1y=1, predicted probabilities will be: p(X;b,w) =p(x)p(x)
 for y=0y=0 The predicted probabilities will be: 1-p(X;b,w) = 1−p(x)1−p(x)
L(b,w)=∏i=1np(xi)yi(1−p(xi))1−yiL(b,w)=∏i=1np(xi)yi(1−p(xi))1−yi
Taking natural logs on both sides:

log⁡(L(b,w))=∑i=1nyilog⁡p(xi)+(1−yi)log⁡(1−p(xi))=∑i=1nyilog⁡p(xi)+log⁡(1−p(xi))
−yilog⁡(1−p(xi))=∑i=1nlog⁡(1−p(xi))+∑i=1nyilog⁡p(xi)1−p(xi=∑i=1n−log⁡1−e−(w⋅xi+b)
+∑i=1nyi(w⋅xi+b)=∑i=1n−log⁡1+ew⋅xi+b+∑i=1nyi(w⋅xi+b)log(L(b,w))=i=1∑nyilogp(xi)
+(1−yi)log(1−p(xi))=i=1∑nyilogp(xi)+log(1−p(xi))−yilog(1−p(xi))=i=1∑nlog(1−p(xi))+i=1∑n
yilog1−p(xip(xi)=i=1∑n−log1−e−(w⋅xi+b)+i=1∑nyi(w⋅xi+b)=i=1∑n−log1+ew⋅xi+b+i=1∑nyi
(w⋅xi+b)
This is known as the log-likelihood function.
Gradient of the log-likelihood function
To find the best ww and bb we use gradient ascent on the log-likelihood function. The
gradient with respect to each weight wjwjis:

∂J(l(b,w)∂wj=−∑i=nn11+ew⋅xi+bew⋅xi+bxij+∑i=1nyixij=−∑i=nnp(xi;b,w)xij+∑i=1nyixij=∑i=
nn(yi−p(xi;b,w))xij∂wj∂J(l(b,w)=−i=n∑n1+ew⋅xi+b1ew⋅xi+bxij+i=1∑nyixij=−i=n∑np(xi
;b,w)xij+i=1∑nyixij=i=n∑n(yi−p(xi;b,w))xij
Terminologies involved in Logistic Regression
Here are some common terms involved in logistic regression:
1. Independent Variables: These are the input features or predictor variables
used to make predictions about the dependent variable.
2. Dependent Variable: This is the target variable that we aim to predict. In
logistic regression, the dependent variable is categorical.
3. Logistic Function: This function transforms the independent variables into a
probability between 0 and 1 which represents the likelihood that the
dependent variable is either 0 or 1.
4. Odds: This is the ratio of the probability of an event happening to the
probability of it not happening. It differs from probability because probability is
the ratio of occurrences to total possibilities.
5. Log-Odds (Logit): The natural logarithm of the odds. In logistic regression, the
log-odds are modeled as a linear combination of the independent variables and
the intercept.
6. Coefficient: These are the parameters estimated by the logistic regression
model which shows how strongly the independent variables affect the
dependent variable.
7. Intercept: The constant term in the logistic regression model which represents
the log-odds when all independent variables are equal to zero.
8. Maximum Likelihood Estimation (MLE): This method is used to estimate the
coefficients of the logistic regression model by maximizing the likelihood of
observing the given data.
Implementation for Logistic Regression
Now, let's see the implementation of logistic regression in Python. Here we will be
implementing two main types of Logistic Regression:
1. Binomial Logistic regression:
In binomial logistic regression, the target variable can only have two possible values
such as "0" or "1", "pass" or "fail". The sigmoid function is used for prediction.
We will be using sckit-learn library for this and shows how to use the breast cancer
dataset to implement a Logistic Regression model for classification.
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=23)

clf = LogisticRegression(max_iter=10000, random_state=0)


clf.fit(X_train, y_train)

acc = accuracy_score(y_test, clf.predict(X_test)) * 100


print(f"Logistic Regression model accuracy: {acc:.2f}%")
Output:
Logistic Regression model accuracy (in %): 96.49%
This code uses logistic regression to classify whether a sample from the breast cancer
dataset is malignant or benign.
2. Multinomial Logistic Regression:
Target variable can have 3 or more possible types which are not ordered i.e types have
no quantitative significance like “disease A” vs “disease B” vs “disease C”.
In this case, the softmax function is used in place of the sigmoid function. Softmax
function for K classes will be:
softmax(zi)=ezi∑j=1Kezjsoftmax(zi)=∑j=1Kezjezi
Here KK represents the number of elements in the vector zz and i,ji,j iterates over all
the elements in the vector.
Then the probability for class cc will be:

P(Y=c∣X→=x)=ewc⋅x+bc∑k=1Kewk⋅x+bkP(Y=c∣X=x)=∑k=1Kewk⋅x+bkewc⋅x+bc
Below is an example of implementing multinomial logistic regression using the Digits
dataset from scikit-learn:
from sklearn.model_selection import train_test_split
from sklearn import datasets, linear_model, metrics

digits = datasets.load_digits()

X = digits.data
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

reg = linear_model.LogisticRegression(max_iter=10000, random_state=0)


reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
print(f"Logistic Regression model accuracy: {metrics.accuracy_score(y_test, y_pred) *
100:.2f}%")
Output:
Logistic Regression model accuracy: 96.66%
This model is used to predict one of 10 digits (0-9) based on the image features.
How to Evaluate Logistic Regression Model?
Evaluating the logistic regression model helps assess its performance and ensure it
generalizes well to new, unseen data. The following metrics are commonly used:
1. Accuracy: Accuracy provides the proportion of correctly classified instances.
Accuracy=TruePositives+TrueNegativesTotalAccuracy=TotalTruePositives+TrueN
egatives
2. Precision: Precision focuses on the accuracy of positive predictions.
Precision=TruePositivesTruePositives+FalsePositivesPrecision=TruePositives+Fal
sePositivesTruePositives
3. Recall (Sensitivity or True Positive Rate): Recall measures the proportion of
correctly predicted positive instances among all actual positive instances.
Recall=TruePositivesTruePositives+FalseNegativesRecall=TruePositives+FalseNe
gativesTruePositives
4. F1 Score: F1 score is the harmonic mean of precision and recall.
F1Score=2∗Precision∗RecallPrecision+RecallF1Score=2∗Precision+RecallPrecis
ion∗Recall
5. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): The ROC
curve plots the true positive rate against the false positive rate at various
thresholds. AUC-ROC measures the area under this curve which provides an
aggregate measure of a model's performance across different classification
thresholds.
6. Area Under the Precision-Recall Curve (AUC-PR): Similar to AUC-ROC, AUC-
PR measures the area under the precision-recall curve helps in providing a
summary of a model's performance across different precision-recall trade-offs.
Differences Between Linear and Logistic Regression
Logistic regression and linear regression differ in their application and output. Here's a
comparison:
Linear Regression Logistic Regression

Linear regression is used to Logistic regression is used to


predict the continuous predict the categorical
dependent variable using a dependent variable using a
given set of independent given set of independent
variables. variables.

It is used for solving regression It is used for solving


problem. classification problems.

In this we predict the value of In this we predict values of


continuous variables categorical variables

In this we find best fit line. In this we find S-Curve.

Least square estimation method Maximum likelihood estimation


is used for estimation of method is used for Estimation of
accuracy. accuracy.

Output must be categorical


The output must be continuous
value such as 0 or 1, Yes or no
value, such as price, age etc.
etc.

It required linear relationship


It not required linear
between dependent and
relationship.
independent variables.

There may be collinearity There should be little to no


between the independent collinearity between
variables. independent variables.

Decision tree is a supervised learning algorithm used for both


classification and regression tasks. It has a hierarchical tree structure which consists of
a root node, branches, internal nodes and leaf nodes. It It works like a flowchart help
to make decisions step by step where:
 Internal nodes represent attribute tests
 Branches represent attribute values
 Leaf nodes represent final decisions or predictions.
Decision trees are widely used due to their interpretability, flexibility and low
preprocessing needs.
How Does a Decision Tree Work?
A decision tree splits the dataset based on feature values to create pure subsets ideally
all items in a group belong to the same class. Each leaf node of the tree corresponds to
a class label and the internal nodes are feature-based decision points. Let’s understand
this with an example.
https://media.geeksforgeeks.org/wp-content/uploads/20250514105137773224/Decision-
Tree-.webpDecision Tree

Let’s consider a decision tree for predicting whether a customer will buy a product
based on age, income and previous purchases: Here's how the decision tree works:
1. Root Node (Income)
First Question: "Is the person’s income greater than $50,000?"
 If Yes, proceed to the next question.
 If No, predict "No Purchase" (leaf node).
2. Internal Node (Age):
If the person’s income is greater than $50,000, ask: "Is the person’s age above 30?"
 If Yes, proceed to the next question.
 If No, predict "No Purchase" (leaf node).
3. Internal Node (Previous Purchases):
 If the person is above 30 and has made previous purchases, predict "Purchase"
(leaf node).
 If the person is above 30 and has not made previous purchases, predict "No
Purchase" (leaf node).
Decision making with 2 Decision Tree
Example: Predicting Whether a Customer Will Buy a Product Using Two Decision Trees
Tree 1: Customer Demographics
First tree asks two questions:
1. "Income > $50,000?"
 If Yes, Proceed to the next question.
 If No, "No Purchase"
2. "Age > 30?"
 Yes: "Purchase"
 No: "No Purchase"
Tree 2: Previous Purchases
"Previous Purchases > 0?"
 Yes: "Purchase"
 No: "No Purchase"
Once we have predictions from both trees, we can combine the results to make a final
prediction. If Tree 1 predicts "Purchase" and Tree 2 predicts "No Purchase", the final
prediction might be "Purchase" or "No Purchase" depending on the weight or
confidence assigned to each tree. This can be decided based on the problem context.
Information Gain and Gini Index in Decision Tree
Till now we have discovered the basic intuition and approach of how decision tree
works, so lets just move to the attribute selection measure of decision tree. We have
two popular attribute selection measures used:
1. Information Gain
Information Gain tells us how useful a question (or feature) is for splitting data into
groups. It measures how much the uncertainty decreases after the split. A good
question will create clearer groups and the feature with the highest Information Gain is
chosen to make the decision.
For example if we split a dataset of people into "Young" and "Old" based on age and all
young people bought the product while all old people did not, the Information Gain
would be high because the split perfectly separates the two groups with no uncertainty
left
 Suppose SS is a set of instances AAis an attribute, SvSvis the subset
of SS, vvrepresents an individual value that the attribute AAcan take and Values
(AA) is the set of all possible values of AA then

Gain(S,A)=Entropy(S)−∑vA∣Sv∣∣S∣.Entropy(Sv)Gain(S,A)=Entropy(S)−∑vA∣S∣∣Sv∣.Entropy(Sv
)
 Entropy: is the measure of uncertainty of a random variable it characterizes the
impurity of an arbitrary collection of examples. The higher the entropy more
the information content.
For example if a dataset has an equal number of "Yes" and "No" outcomes (like 3
people who bought a product and 3 who didn’t), the entropy is high because it’s
uncertain which outcome to predict. But if all the outcomes are the same (all "Yes" or
all "No") the entropy is 0 meaning there is no uncertainty left in predicting the
outcome
Suppose SS is a set of instances, AAis an attribute, SvSvis the subset
of SSwith AA= vv and Values (AA) is the set of all possible values of AA, then

Gain(S,A)=Entropy(S)−∑vϵValues(A)∣Sv∣∣S∣.Entropy(Sv) Gain(S,A)=Entropy(S)
−∑vϵValues(A)∣S∣∣Sv∣.Entropy(Sv)
Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3

Entropy H(X)=[(38)log⁡238+(58)log⁡258]=−[0.375(−1.415)+0.625(−0.678)]=−
(−0.53−0.424)=0.954Entropy H(X)=[(83)log283+(85)log285
]=−[0.375(−1.415)+0.625(−0.678)]=−(−0.53−0.424)=0.954
Building Decision Tree using Information Gain the essentials
 Start with all training instances associated with the root node
 Use info gain to choose which attribute to label each node with
 Recursively construct each subtree on the subset of training instances that
would be classified down that path in the tree.
 If all positive or all negative training instances remain, the label that node “yes"
or “no" accordingly
 If no attributes remain label with a majority vote of training instances left at
that node
 If no instances remain label with a majority vote of the parent's training
instances.
Example: Now let us draw a Decision Tree for the following data using Information
gain. Training set: 3 features and 2 classes
X Y Z C

1 1 1 I

1 1 0 I

0 0 1 II

1 0 0 II

Here, we have 3 features and 2 output classes. To build a decision tree using
Information gain. We will take each of the features and calculate the information for
each feature.

Split on feature X
Split on feature Y

Split on feature Z
From the above images we can see that the information gain is maximum when we
make a split on feature Y. So, for the root node best-suited feature is feature Y. Now we
can see that while splitting the dataset by feature Y, the child contains a pure subset of
the target variable. So we don't need to further split the dataset. The final tree for the
above dataset would look like this:
2. Gini Index
Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified. It means an attribute with a lower Gini index should be
preferred. Sklearn supports “Gini” criteria for Gini Index and by default it takes “gini”
value.
For example if we have a group of people where all bought the product (100% "Yes")
the Gini Index is 0 indicate perfect purity. But if the group has an equal mix of "Yes"
and "No" the Gini Index would be 0.5 show high impurity or uncertainty. Formula for
Gini Index is given by :
Gini=1−∑i=1npi2Gini=1−∑i=1npi2
Some additional features of the Gini Index are:
1. It is calculated by summing the squared probabilities of each outcome in a
distribution and subtracting the result from 1.
2. A lower Gini Index indicates a more homogeneous or pure distribution while a
higher Gini Index indicates a more heterogeneous or impure distribution.
3. In decision trees the Gini Index is used to evaluate the quality of a split by
measuring the difference between the impurity of the parent node and the
weighted impurity of the child nodes.
4. Compared to other impurity measures like entropy, the Gini Index is faster to
compute and more sensitive to changes in class probabilities.
5. One disadvantage of the Gini Index is that it tends to favour splits that create
equally sized child nodes, even if they are not optimal for classification
accuracy.
6. In practice the choice between using the Gini Index or other impurity measures
depends on the specific problem and dataset and requires experimentation and
tuning.
Understanding Decision Tree with Real life use case:
Till now we have understand about the attributes and components of decision tree.
Now lets jump to a real life use case in which how decision tree works step by step.
Step 1. Start with the Whole Dataset
We begin with all the data which is treated as the root node of the decision tree.
Step 2. Choose the Best Question (Attribute)
Pick the best question to divide the dataset. For example ask: "What is the outlook?"
Possible answers: Sunny, Cloudy or Rainy.
Step 3. Split the Data into Subsets :
Divide the dataset into groups based on the question:
 If Sunny go to one subset.
 If Cloudy go to another subset.
 If Rainy go to the last subset.
Step 4. Split Further if Needed (Recursive Splitting)
For each subset ask another question to refine the groups. For example If the Sunny
subset is mixed ask: "Is the humidity high or normal?"
 High humidity → "Swimming".
 Normal humidity → "Hiking".
Step 5. Assign Final Decisions (Leaf Nodes)
When a subset contains only one activity, stop splitting and assign it a label:
 Cloudy → "Hiking".
 Rainy → "Stay Inside".
 Sunny + High Humidity → "Swimming".
 Sunny + Normal Humidity → "Hiking".
Step 6. Use the Tree for Predictions
To predict an activity follow the branches of the tree. Example: If the outlook is Sunny
and the humidity is High follow the tree:
 Start at Outlook.
 Take the branch for Sunny.
 Then go to Humidity and take the branch for High Humidity.
 Result: "Swimming".
A decision tree works by breaking down data step by step asking the best possible
questions at each point and stopping once it reaches a clear decision. It's an easy and
understandable way to make choices. Because of their simple and clear structure
decision trees are very helpful in machine learning for tasks like sorting data into
categories or making predictions.
Frequently Asked Questions (FAQs)
1. What are the major issues in decision tree learning?
Major issues in decision tree learning include overfitting, sensitivity to small data
changes and limited generalization. Ensuring proper pruning, tuning and handling
imbalanced data can help mitigate these challenges for more robust decision tree
models.
2. How does decision tree help in decision making?
Decision trees aid decision-making by representing complex choices in a hierarchical
structure. Each node tests specific attributes, guiding decisions based on data values.
Leaf nodes provide final outcomes, offering a clear and interpretable path for decision
analysis in machine learning.
3. What is the maximum depth of a decision tree?
The maximum depth of a decision tree is a hyperparameter that determines the
maximum number of levels or nodes from the root to any leaf. It controls the
complexity of the tree and helps prevent overfitting.
4. What is the concept of decision tree?
A decision tree is a supervised learning algorithm that models decisions based on input
features. It forms a tree-like structure where each internal node represents a decision
based on an attribute, leading to leaf nodes representing outcomes.
5. What is entropy in decision tree?
In decision trees, entropy is a measure of impurity or disorder within a dataset. It
quantifies the uncertainty associated with classifying instances, guiding the algorithm
to make informative splits for effective decision-making.
6. What are the Hyperparameters of decision tree?
1. Max Depth: Maximum depth of the tree.
2. Min Samples Split: Minimum samples required to split an internal node.
3. Min Samples Leaf: Minimum samples required in a leaf node.
4. Criterion: The function used to measure the quality of a split

Random Forest is a machine learning algorithm that uses many


decision trees to make better predictions. Each tree looks at different random parts of
the data and their results are combined by voting for classification or averaging for
regression. This helps in improving accuracy and reducing errors.
Working of Random Forest Algorithm
 Create Many Decision Trees: The algorithm makes many decision trees each
using a random part of the data. So every tree is a bit different.
 Pick Random Features: When building each tree it doesn’t look at all the
features (columns) at once. It picks a few at random to decide how to split the
data. This helps the trees stay different from each other.
 Each Tree Makes a Prediction: Every tree gives its own answer or prediction
based on what it learned from its part of the data.
 Combine the Predictions:
o For classification we choose a category as the final answer is the one
that most trees agree on i.e majority voting.
o For regression we predict a number as the final answer is the average of
all the trees predictions.
 Why It Works Well: Using random data and features for each tree helps avoid
overfitting and makes the overall prediction more accurate and trustworthy.
Random forest is also a ensemble learning technique which you can learn more about
from: Ensemble Learning
Key Features of Random Forest
 Handles Missing Data: It can work even if some data is missing so you don’t
always need to fill in the gaps yourself.
 Shows Feature Importance: It tells you which features (columns) are most
useful for making predictions which helps you understand your data better.
 Works Well with Big and Complex Data: It can handle large datasets with many
features without slowing down or losing accuracy.
 Used for Different Tasks: You can use it for both classification like predicting
types or labels and regression like predicting numbers or amounts.
Assumptions of Random Forest
 Each tree makes its own decisions: Every tree in the forest makes its own
predictions without relying on others.
 Random parts of the data are used: Each tree is built using random samples
and features to reduce mistakes.
 Enough data is needed: Sufficient data ensures the trees are different and learn
unique patterns and variety.
 Different predictions improve accuracy: Combining the predictions from
different trees leads to a more accurate final result.
Implementing Random Forest for Classification Tasks
Here we will predict survival rate of a person in titanic.
 Import libraries and load the Titanic dataset.
 Remove rows with missing target values ('Survived').
 Select features like class, sex, age, etc and convert 'Sex' to numbers.
 Fill missing age values with the median.
 Split the data into training and testing sets, then train a Random Forest model.
 Predict on test data, check accuracy and print a sample prediction result.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/
titanic.csv"
titanic_data = pd.read_csv(url)

titanic_data = titanic_data.dropna(subset=['Survived'])

X = titanic_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]


y = titanic_data['Survived']

X.loc[:, 'Sex'] = X['Sex'].map({'female': 0, 'male': 1})

X.loc[:, 'Age'].fillna(X['Age'].median(), inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)


rf_classifier.fit(X_train, y_train)

y_pred = rf_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


classification_rep = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", classification_rep)

sample = X_test.iloc[0:1]
prediction = rf_classifier.predict(sample)

sample_dict = sample.iloc[0].to_dict()
print(f"\nSample Passenger: {sample_dict}")
print(f"Predicted Survival: {'Survived' if prediction[0] == 1 else 'Did Not Survive'}")
Output:

Random Forest for Classification Tasks


We evaluated model's performance using a classification report to see how well it
predicts the outcomes and used a random sample to check model prediction.
Implementing Random Forest for Regression Tasks
We will do house price prediction here.
 Load the California housing dataset and create a DataFrame with features and
target.
 Separate the features and the target variable.
 Split the data into training and testing sets (80% train, 20% test).
 Initialize and train a Random Forest Regressor using the training data.
 Predict house values on test data and evaluate using MSE and R² score.
 Print a sample prediction and compare it with the actual value.
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

california_housing = fetch_california_housing()
california_data = pd.DataFrame(california_housing.data,
columns=california_housing.feature_names)
california_data['MEDV'] = california_housing.target

X = california_data.drop('MEDV', axis=1)
y = california_data['MEDV']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

rf_regressor.fit(X_train, y_train)

y_pred = rf_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

single_data = X_test.iloc[0].values.reshape(1, -1)


predicted_value = rf_regressor.predict(single_data)
print(f"Predicted Value: {predicted_value[0]:.2f}")
print(f"Actual Value: {y_test.iloc[0]:.2f}")

print(f"Mean Squared Error: {mse:.2f}")


print(f"R-squared Score: {r2:.2f}")
Output:

Random Forest for Regression Tasks


We evaluated the model's performance using Mean Squared Error and R-squared
Score which show how accurate the predictions are and used a random sample to
check model prediction.
Advantages of Random Forest
 Random Forest provides very accurate predictions even with large datasets.
 Random Forest can handle missing data well without compromising with
accuracy.
 It doesn’t require normalization or standardization on dataset.
 When we combine multiple decision trees it reduces the risk of overfitting of
the model.
Limitations of Random Forest
 It can be computationally expensive especially with a large number of trees.
 It’s harder to interpret the model compared to simpler models like decision
trees
K-Nearest Neighbors (KNN) is a supervised
machine learning algorithm generally used for classification but can also be used for
regression tasks. It works by finding the "k" closest data points (neighbors) to a given
input and makesa predictions based on the majority class (for classification) or the
average value (for regression). Since KNN makes no assumptions about the underlying
data distribution it makes it a non-parametric and instance-based learning method.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and at the time of
classification it performs an action on the dataset.
For example, consider the following table of data points containing two features:

KNN Algorithm working visualization


The new point is classified as Category 2 because most of its closest neighbors are blue
squares. KNN assigns the category based on the majority of nearby points. The image
shows how KNN predicts the category of a new data point based on its closest
neighbours.
 The red diamonds represent Category 1 and the blue squares represent
Category 2.
 The new data point checks its closest neighbors (circled points).
 Since the majority of its closest neighbors are blue squares (Category 2) KNN
predicts the new data point belongs to Category 2.
KNN works by using proximity and majority voting to make predictions.
What is 'K' in K Nearest Neighbour?
In the k-Nearest Neighbours algorithm k is just a number that tells the algorithm how
many nearby points or neighbors to look at when it makes a decision.
Example: Imagine you're deciding which fruit it is based on its shape and size. You
compare it to fruits you already know.
 If k = 3, the algorithm looks at the 3 closest fruits to the new one.
 If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new
fruit is an apple because most of its neighbors are apples.
How to choose the value of k for KNN Algorithm?
 The value of k in KNN decides how many neighbors the algorithm looks at when
making a prediction.
 Choosing the right k is important for good results.
 If the data has lots of noise or outliers, using a larger k can make the
predictions more stable.
 But if k is too large the model may become too simple and miss important
patterns and this is called underfitting.
 So k should be picked carefully based on the data.
Statistical Methods for Selecting k
 Cross-Validation: Cross-Validation is a good way to find the best value of k is by
using k-fold cross-validation. This means dividing the dataset into k parts. The
model is trained on some of these parts and tested on the remaining ones. This
process is repeated for each part. The k value that gives the highest average
accuracy during these tests is usually the best one to use.
 Elbow Method: In Elbow Method we draw a graph showing the error rate or
accuracy for different k values. As k increases the error usually drops at first.
But after a certain point error stops decreasing quickly. The point where the
curve changes direction and looks like an "elbow" is usually the best choice for
k.
 Odd Values for k: It’s a good idea to use an odd number for k especially in
classification problems. This helps avoid ties when deciding which class is the
most common among the neighbors.
Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbor, these neighbors are used for
classification and regression task. To identify nearest neighbor we use below distance
metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a
plane or space. You can think of it like the shortest path you would walk if you were to
go directly from one point to another.
distance(x,Xi)=∑j=1d(xj−Xij)2]distance(x,Xi)=∑j=1d(xj−Xij)2]
2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal and
vertical lines like a grid or city streets. It’s also called "taxicab distance" because a taxi
can only drive along the grid-like streets of a city.

d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and
Manhattan distances as special cases.
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above, when p=2, it becomes the same as the Euclidean distance
formula and when p=1, it turns into the Manhattan distance formula. Minkowski
distance is essentially a flexible formula that can represent either Euclidean or
Manhattan distance depending on the value of p.
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where
it predicts the label or value of a new data point by considering the labels or values of
its K nearest neighbors in the training dataset.

Step 1: Selecting the optimal value of K


 K represents the number of nearest neighbors that needs to be considered
while making prediction.
Step 2: Calculating distance
 To measure the similarity between target and training data points Euclidean
distance is used. Distance is calculated between data points in the dataset and
target point.
Step 3: Finding Nearest Neighbors
 The k data points with the smallest distances to the target point are nearest
neighbors.
Step 4: Voting for Classification or Taking Average for Regression
 When you want to classify a data point into a category like spam or not spam,
the KNN algorithm looks at the K closest points in the dataset. These closest
points are called neighbors. The algorithm then looks at which category the
neighbors belong to and picks the one that appears the most. This is called
majority voting.
 In regression, the algorithm still looks for the K closest points. But instead of
voting for a class in classification, it takes the average of the values of those K
neighbors. This average is the predicted value for the new point for the
algorithm.
It shows how a test point is classified based on its nearest neighbors. As the test point
moves the algorithm identifies the closest 'k' data points i.e. 5 in this case and assigns
test point the majority class label that is grey label class here.
Python Implementation of KNN Algorithm
1. Importing Libraries
Counter is used to count the occurrences of elements in a list or iterable. In KNN after
finding the k nearest neighbor labels Counter helps count how many times each label
appears.
import numpy as np
from collections import Counter
2. Defining the Euclidean Distance Function
euclidean_distance is to calculate euclidean distance between points.
def euclidean_distance(point1, point2):
return np.sqrt(np.sum((np.array(point1) - np.array(point2))**2))
3. KNN Prediction Function
 distances.append saves how far each training point is from the test point, along
with its label.
 distances.sort is used to sorts the list so the nearest points come first.
 k_nearest_labels picks the labels of the k closest points.
 Uses Counter to find which label appears most among those k labels that
becomes the prediction.
def knn_predict(training_data, training_labels, test_point, k):
distances = []
for i in range(len(training_data)):
dist = euclidean_distance(test_point, training_data[i])
distances.append((dist, training_labels[i]))
distances.sort(key=lambda x: x[0])
k_nearest_labels = [label for _, label in distances[:k]]
return Counter(k_nearest_labels).most_common(1)[0][0]
4. Training Data, Labels and Test Point
training_data = [[1, 2], [2, 3], [3, 4], [6, 7], [7, 8]]
training_labels = ['A', 'A', 'A', 'B', 'B']
test_point = [4, 5]
k=3
5. Prediction
prediction = knn_predict(training_data, training_labels, test_point, k)
print(prediction)
Output:
A
The algorithm calculates the distances of the test point [4, 5] to all training points
selects the 3 closest points as k = 3 and determines their labels. Since the majority of
the closest points are labelled 'A' the test point is classified as 'A'.
In machine learning we can also use Scikit Learn python library which has in built
functions to perform KNN machine learning model and for that you refer to
Implementation of KNN classifier using Sklearn.
Applications of KNN
 Recommendation Systems: Suggests items like movies or products by finding
users with similar preferences.
 Spam Detection: Identifies spam emails by comparing new emails to known
spam and non-spam examples.
 Customer Segmentation: Groups customers by comparing their shopping
behavior to others.
 Speech Recognition: Matches spoken words to known patterns to convert
them into text.
Advantages of KNN
 Simple to use: Easy to understand and implement.
 No training step: No need to train as it just stores the data and uses it during
prediction.
 Few parameters: Only needs to set the number of neighbors (k) and a distance
method.
 Versatile: Works for both classification and regression problems.
Disadvantages of KNN
 Slow with large data: Needs to compare every point during prediction.
 Struggles with many features: Accuracy drops when data has too many
features.
 Can Overfit: It can overfit especially when the data is high-dimensional or not
clean.

Support Vector Machine (SVM) is a supervised


machine learning algorithm used for classification and regression tasks. It tries to find
the best boundary known as hyperplane that separates different classes in the data. It
is useful when you want to do binary classification like spam vs. not spam or cat vs.
dog.
The main goal of SVM is to maximize the margin between the two classes. The larger
the margin the better the model performs on new and unseen data.
Key Concepts of Support Vector Machine
 Hyperplane: A decision boundary separating different classes in feature space
and is represented by the equation wx + b = 0 in linear classification.
 Support Vectors: The closest data points to the hyperplane, crucial for
determining the hyperplane and margin in SVM.
 Margin: The distance between the hyperplane and the support vectors. SVM
aims to maximize this margin for better classification performance.
 Kernel: A function that maps data to a higher-dimensional space enabling SVM
to handle non-linearly separable data.
 Hard Margin: A maximum-margin hyperplane that perfectly separates the data
without misclassifications.
 Soft Margin: Allows some misclassifications by introducing slack variables,
balancing margin maximization and misclassification penalties when data is not
perfectly separable.
 C: A regularization term balancing margin maximization and misclassification
penalties. A higher C value forces stricter penalty for misclassifications.
 Hinge Loss: A loss function penalizing misclassified points or margin violations
and is combined with regularization in SVM.
 Dual Problem: Involves solving for Lagrange multipliers associated with support
vectors, facilitating the kernel trick and efficient computation.
How does Support Vector Machine Algorithm Work?
The key idea behind the SVM algorithm is to find the hyperplane that best separates
two classes by maximizing the margin between them. This margin is the distance from
the hyperplane to the nearest data points (support vectors) on each side.

Multiple hyperplanes separate the data from two classes


The best hyperplane also known as the "hard margin" is the one that maximizes the
distance between the hyperplane and the nearest data points from both classes. This
ensures a clear separation between the classes. So from the above figure, we choose
L2 as hard margin. Let's consider a scenario like shown below:

Selecting hyperplane for


data with outlier
Here, we have one blue ball in the boundary of the red ball.
How does SVM classify the data?
The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm
has the characteristics to ignore the outlier and finds the best hyperplane that
maximizes the margin. SVM is robust to outliers.

Hyperplane which is the most optimized one


A soft margin allows for some misclassifications or violations of the margin to improve
generalization. The SVM optimizes the following equation to balance margin
maximization and penalty minimization:
Objective Function=(1margin)+λ∑penalty Objective Function=(margin1)+λ∑penalty
The penalty used for violations is often hinge loss which has the following behavior:
 If a data point is correctly classified and within the margin there is no penalty
(loss = 0).
 If a point is incorrectly classified or violates the margin the hinge loss increases
proportionally to the distance of the violation.
Till now we were talking about linearly separable data that seprates group of blue balls
and red balls by a straight line/linear line.
What to do if data are not linearly separable?
When data is not linearly separable i.e it can't be divided by a straight line, SVM uses a
technique called kernels to map the data into a higher-dimensional space where it
becomes separable. This transformation helps SVM find a decision boundary even for
non-linear data.
O
riginal 1D dataset for classification
A kernel is a function that maps data points into a higher-dimensional space without
explicitly computing the coordinates in that space. This allows SVM to work efficiently
with non-linear data by implicitly performing the mapping. For example consider data
points that are not linearly separable. By applying a kernel function SVM transforms
the data points into a higher-dimensional space where they become linearly separable.
 Linear Kernel: For linear separability.
 Polynomial Kernel: Maps data into a polynomial space.
 Radial Basis Function (RBF) Kernel: Transforms data into a space based on
distances between data points.

Mapping
1D data to 2D to become able to separate the two classes
In this case the new variable y is created as a function of distance from the origin.
Mathematical Computation of SVM
Consider a binary classification problem with two classes, labeled as +1 and -1. We
have a training dataset consisting of input feature vectors X and their corresponding
class labels Y. The equation for the linear hyperplane can be written as:
wTx+b=0wTx+b=0
Where:
 ww is the normal vector to the hyperplane (the direction perpendicular to it).
 bb is the offset or bias term representing the distance of the hyperplane from
the origin along the normal vector ww.
Distance from a Data Point to the Hyperplane
The distance between a data point x_i and the decision boundary can be calculated as:

di=wTxi+b∣∣w∣∣di=∣∣w∣∣wTxi+b
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of
the normal vector W
Linear SVM Classifier
Distance from a Data Point to the Hyperplane:
y^={1: wTx+b≥00: wTx+b <0y^={10: wTx+b≥0: wTx+b <0
Where y^y^ is the predicted label of a data point.
Optimization Problem for SVM
For a linearly separable dataset the goal is to find the hyperplane that maximizes the
margin between the two classes while ensuring that all data points are correctly
classified. This leads to the following optimization problem:

minimizew,b12∥w∥2w,bminimize21∥w∥2
Subject to the constraint:

yi(wTxi+b)≥1fori=1,2,3,⋯,myi(wTxi+b)≥1fori=1,2,3,⋯,m
Where:
 yiyi is the class label (+1 or -1) for each training instance.
 xixi is the feature vector for the ii-th training instance.
 mm is the total number of training instances.
The condition yi(wTxi+b)≥1yi(wTxi+b)≥1 ensures that each data point is correctly
classified and lies outside the margin.
Soft Margin in Linear SVM Classifier
In the presence of outliers or non-separable data the SVM allows some
misclassification by introducing slack variables ζiζi. The optimization problem is
modified as:

minimize w,b12∥w∥2+C∑i=1mζiw,bminimize 21∥w∥2+C∑i=1mζi


Subject to the constraints:
yi(wTxi+b)≥1−ζiandζi≥0for i=1,2,…,myi(wTxi+b)≥1−ζiandζi≥0for i=1,2,…,m
Where:
 CC is a regularization parameter that controls the trade-off between margin
maximization and penalty for misclassifications.
 ζiζi are slack variables that represent the degree of violation of the margin by
each data point.
Dual Problem for SVM
The dual problem involves maximizing the Lagrange multipliers associated with the
support vectors. This transformation allows solving the SVM optimization using kernel
functions for non-linear classification.
The dual objective function is given by:
maximize α12∑i=1m∑j=1mαiαjtitjK(xi,xj)−∑i=1mαiαmaximize 21∑i=1m∑j=1mαiαjtitjK(xi
,xj)−∑i=1mαi
Where:
 αiαi are the Lagrange multipliers associated with the ithith training sample.
 titi is the class label for the ithith-th training sample.
 K(xi,xj)K(xi,xj) is the kernel function that computes the similarity between data
points xixi and xjxj. The kernel allows SVM to handle non-linear classification
problems by mapping data into a higher-dimensional space.
The dual formulation optimizes the Lagrange multipliers αiαi and the support vectors
are those training samples where αi>0αi>0.
SVM Decision Boundary
Once the dual problem is solved, the decision boundary is given by:
w=∑i=1mαitiK(xi,x)+bw=∑i=1mαitiK(xi,x)+b
Where ww is the weight vector, xx is the test data point and bb is the bias term. Finally
the bias term bb is determined by the support vectors, which satisfy:
ti(wTxi−b)=1⇒b=wTxi−titi(wTxi−b)=1⇒b=wTxi−ti
Where xixi is any support vector.
This completes the mathematical framework of the Support Vector Machine algorithm
which allows for both linear and non-linear classification using the dual problem and
kernel trick.
Types of Support Vector Machine
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be
divided into two main parts:
 Linear SVM: Linear SVMs use a linear decision boundary to separate the data
points of different classes. When the data can be precisely linearly separated,
linear SVMs are very suitable. This means that a single straight line (in 2D) or a
hyperplane (in higher dimensions) can entirely divide the data points into their
respective classes. A hyperplane that maximizes the margin between the
classes is the decision boundary.
 Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot
be separated into two classes by a straight line (in the case of 2D). By using
kernel functions, nonlinear SVMs can handle nonlinearly separable data. The
original input data is transformed by these kernel functions into a higher-
dimensional feature space where the data points can be linearly separated. A
linear SVM is used to locate a nonlinear decision boundary in this modified
space.
Implementing SVM Algorithm in Python
Predict if cancer is Benign or malignant. Using historical data about patients diagnosed
with cancer enables doctors to differentiate malignant cases and benign ones are given
independent attributes.
 Load the breast cancer dataset from sklearn.datasets
 Separate input features and target variables.
 Build and train the SVM classifiers using RBF kernel.
 Plot the scatter plot of the input features.
# Load the important packages
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.svm import SVC
# Load the datasets
cancer = load_breast_cancer()
X = cancer.data[:, :2]
y = cancer.target

#Build the model


svm = SVC(kernel="rbf", gamma=0.5, C=1.0)
# Trained the model
svm.fit(X, y)

# Plot Decision Boundary


DecisionBoundaryDisplay.from_estimator(
svm,
X,
response_method="predict",
cmap=plt.cm.Spectral,
alpha=0.8,
xlabel=cancer.feature_names[0],
ylabel=cancer.feature_names[1],
)

# Scatter plot
plt.scatter(X[:, 0], X[:, 1],
c=y,
s=20, edgecolors="k")
plt.show()
Output:
Breast Cancer Classifications with SVM RBF kernel
Advantages of Support Vector Machine (SVM)
1. High-Dimensional Performance: SVM excels in high-dimensional spaces,
making it suitable for image classification and gene expression analysis.
2. Nonlinear Capability: Utilizing kernel functions like RBF and polynomial SVM
effectively handles nonlinear relationships.
3. Outlier Resilience: The soft margin feature allows SVM to ignore outliers,
enhancing robustness in spam detection and anomaly detection.
4. Binary and Multiclass Support: SVM is effective for both binary classification
and multiclass classification suitable for applications in text classification.
5. Memory Efficiency: It focuses on support vectors making it memory efficient
compared to other algorithms.
Disadvantages of Support Vector Machine (SVM)
1. Slow Training: SVM can be slow for large datasets, affecting performance in
SVM in data mining tasks.
2. Parameter Tuning Difficulty: Selecting the right kernel and adjusting
parameters like C requires careful tuning, impacting SVM algorithms.
3. Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes,
limiting effectiveness in real-world scenarios.
4. Limited Interpretability: The complexity of the hyperplane in higher
dimensions makes SVM less interpretable than other models.
5. Feature Scaling Sensitivity: Proper feature scaling is essential, otherwise SVM
models may perform poorly.

K-Means Clustering is an Unsupervised Machine Learning


algorithm which groups unlabeled dataset into different clusters. It is used to organize
data into groups based on their similarity.
Understanding K-means Clustering

For example online store uses K-Means to group customers based on purchase
frequency and spending creating segments like Budget Shoppers, Frequent Buyers and
Big Spenders for personalised marketing.
The algorithm works by first randomly picking some central points called centroids and
each data point is then assigned to the closest centroid forming a cluster. After all the
points are assigned to a cluster the centroids are updated by finding the average
position of the points in each cluster. This process repeats until the centroids stop
changing forming clusters. The goal of clustering is to divide the data points into
clusters so that similar data points belong to same group.
How k-means clustering works?
We are given a data set of items with certain features and values for these features like
a vector. The task is to categorize those items into groups. To achieve this we will use
the K-means algorithm. 'K' in the name of the algorithm represents the number of
groups/clusters we want to classify our items into.

K means Clustering
The algorithm will categorize the items into k groups or clusters of similarity. To
calculate that similarity we will use the Euclidean distance as a measurement. The
algorithm works as follows:
1. First we randomly initialize k points called means or cluster centroids.
2. We categorize each item to its closest mean and we update the mean's
coordinates, which are the averages of the items categorized in that cluster so
far.
3. We repeat the process for a given number of iterations and at the end, we have
our clusters.
The "points" mentioned above are called means because they are the mean values of
the items categorized in them. To initialize these means, we have a lot of options. An
intuitive method is to initialize the means at random items in the data set. Another
method is to initialize the means at random values between the boundaries of the data
set. For example for a feature x the items have values in [0,3] we will initialize the
means with values for x at [0,3].
Selecting the right number of clusters is important for meaningful segmentation to do
this we use Elbow Method for optimal value of k in KMeans which is a graphical tool
used to determine the optimal number of clusters (k) in K-means.
Implementation of K-Means Clustering in Python
We will use blobs datasets and show how clusters are made.
Step 1: Importing the necessary libraries
We are importing Numpy, Matplotlib and scikit learn.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
Step 2: Create custom dataset with make_blobs and plot it
X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state = 23)

fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()
Output:
Clustering dataset
Step 3: Initializing random centroids
The code initializes three clusters for K-means clustering. It sets a random seed and
generates random cluster centers within a specified range and creates an empty list of
points for each cluster.
k=3

clusters = {}
np.random.seed(23)

for idx in range(k):


center = 2*(2*np.random.random((X.shape[1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
}

clusters[idx] = cluster

clusters
Output:

Rando
m Centroids
Step 4: Plotting random initialize center with data points
plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '*',c = 'red')
plt.show()
Output:
Data points with random center
The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also marks
the initial cluster centers (red stars) generated for K-means clustering.
Step 5: Defining Euclidean distance
def distance(p1,p2):
return np.sqrt(np.sum((p1-p2)**2))
Step 6: Creating function Assign and Update the cluster center
This step assigns data points to the nearest cluster center and the M-step updates
cluster centers based on the mean of assigned points in K-means clustering.
def assign_clusters(X, clusters):
for idx in range(X.shape[0]):
dist = []

curr_x = X[idx]

for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
dist.append(dis)
curr_cluster = np.argmin(dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters

def update_clusters(X, clusters):


for i in range(k):
points = np.array(clusters[i]['points'])
if points.shape[0] > 0:
new_center = points.mean(axis =0)
clusters[i]['center'] = new_center

clusters[i]['points'] = []
return clusters
Step 7: Creating function to Predict the cluster for the datapoints
def pred_cluster(X, clusters):
pred = []
for i in range(X.shape[0]):
dist = []
for j in range(k):
dist.append(distance(X[i],clusters[j]['center']))
pred.append(np.argmin(dist))
return pred
Step 8: Assign, Update and predict the cluster center
clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)
Step 9: Plotting data points with their predicted cluster center
plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()
Output:

K-means Clustering
The plot shows data points colored by their predicted clusters. The red markers
represent the updated cluster centers after the E-M steps in the K-means clustering
algorithm.

Hierarchical clustering is used to group similar data


points together based on their similarity creating a hierarchy or tree-like structure.
The key idea is to begin with each data point as its own separate cluster and then
progressively merge or split them based on their similarity. Lets understand this with
the help of an example
Imagine you have four fruits with different weights: an apple (100g), a banana (120g),
a cherry (50g) and a grape (30g). Hierarchical clustering starts by treating each fruit
as its own group.
 It then merges the closest groups based on their weights.
 First the cherry and grape are grouped together because they are the lightest.
 Next the apple and banana are grouped together.
Finally all the fruits are merged into one large group, showing how hierarchical
clustering progressively combines the most similar data points.
Dendogram
A dendrogram is like a family tree for clusters. It shows how individual data points or
groups of data merge together. The bottom shows each data point as its own group,
and as you move up, similar groups are combined. The lower the merge point, the
more similar the groups are. It helps you see how things are grouped step by step. The
working of the dendrogram can be explained using the below diagram:

Den
dogram
In the above image on the left side there are five points labeled P, Q, R, S and T. These
represent individual data points that are being clustered. On the right side there’s
a dendrogram which show how these points are grouped together step by step.
 At the bottom of the dendrogram the points P, Q, R, S and T are all separate.
 As you move up, the closest points are merged into a single group.
 The lines connecting the points show how they are progressively merged based
on similarity.
 The height at which they are connected shows how similar the points are to
each other; the shorter the line the more similar they are
Types of Hierarchical Clustering
Now we understand the basics of hierarchical clustering. There are two main types of
hierarchical clustering.
1. Agglomerative Clustering
2. Divisive clustering
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering
(HAC). Unlike flat clustering hierarchical clustering provides a structured way to group
data. This clustering algorithm does not require us to prespecify the number of
clusters. Bottom-up algorithms treat each data as a singleton cluster at the outset and
then successively agglomerate pairs of clusters until all clusters have been merged into
a single cluster that contains all data.

Hierarchical Agglomerative Clustering


Workflow for Hierarchical Agglomerative clustering
1. Start with individual points: Each data point is its own cluster. For example if
you have 5 data points you start with 5 clusters each containing just one data
point.
2. Calculate distances between clusters: Calculate the distance between every
pair of clusters. Initially since each cluster has one point this is the distance
between the two data points.
3. Merge the closest clusters: Identify the two clusters with the smallest distance
and merge them into a single cluster.
4. Update distance matrix: After merging you now have one less cluster.
Recalculate the distances between the new cluster and the remaining clusters.
5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the
distance matrix until you have only one cluster left.
6. Create a dendrogram: As the process continues you can visualize the merging
of clusters using a tree-like diagram called a dendrogram. It shows the
hierarchy of how clusters are merged.
Python implementation of the above algorithm using the scikit-learn library:
from sklearn.cluster import AgglomerativeClustering
import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0],


[4, 2], [4, 4], [4, 0]])

clustering = AgglomerativeClustering(n_clusters=2).fit(X)

print(clustering.labels_)
Output :
[1, 1, 1, 0, 0, 0]
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to
prespecify the number of clusters. Top-down clustering requires a method for splitting
a cluster that contains the whole data and proceeds by splitting clusters recursively
until individual data have been split into singleton clusters.
Workflow for Hierarchical Divisive clustering :
1. Start with all data points in one cluster: Treat the entire dataset as a single
large cluster.
2. Split the cluster: Divide the cluster into two smaller clusters. The division is
typically done by finding the two most dissimilar points in the cluster and using
them to separate the data into two parts.
3. Repeat the process: For each of the new clusters, repeat the splitting process:
1. Choose the cluster with the most dissimilar points.
2. Split it again into two smaller clusters.
4. Stop when each data point is in its own cluster: Continue this process until
every data point is its own cluster, or the stopping condition (such as a
predefined number of clusters) is met.

Hierarchical Divisive clustering


Computing Distance Matrix
While merging two clusters we check the distance between two every pair of clusters
and merge the pair with the least distance/most similarity. But the question is how is
that distance determined. There are different ways of defining Inter Cluster
distance/similarity. Some of them are:
1. Min Distance: Find the minimum distance between any two points of the
cluster.
2. Max Distance: Find the maximum distance between any two points of the
cluster.
3. Group Average: Find the average distance between every two points of the
clusters.
4. Ward's Method: The similarity of two clusters is based on the increase in
squared error when two clusters are merged.
Distance Matrix Comparision in Hierarchical Clustering

Implementation code for Distance Matrix Comparision


import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])

Z = linkage(X, 'ward') # Ward Distance

dendrogram(Z) #plotting the dendogram

plt.title('Hierarchical Clustering Dendrogram')


plt.xlabel('Data point')
plt.ylabel('Distance')
plt.show()
Output:
Hierarchical Clustering Dendrogram
Hierarchical clustering is widely used unsupervised learning technique that organize
data into a tree-like structure allow us to visualize relationships between data points
using a dendrogram. Unlike flat clustering methods it does not require a predefined
number of clusters and provides a structured way to explore data similarity.

PCA (Principal Component Analysis) is


a dimensionality reduction technique used in data analysis and machine learning. It
helps you to reduce the number of features in a dataset while keeping the most
important information. It changes your original features into new features these new
features don’t overlap with each other and the first few keep most of the important
differences found in the original data.
PCA is commonly used for data preprocessing for use with machine learning
algorithms. It helps to remove redundancy, improve computational efficiency and
make data easier to visualize and analyze especially when dealing with high-
dimensional data.
How Principal Component Analysis Works
PCA uses linear algebra to transform data into new features called principal
components. It finds these by calculating eigenvectors (directions) and eigenvalues
(importance) from the covariance matrix. PCA selects the top components with the
highest eigenvalues and projects the data onto them simplify the dataset.
Note: It prioritizes the directions where the data varies the most because more
variation = more useful information.
Imagine you’re looking at a messy cloud of data points like stars in the sky and want to
simplify it. PCA helps you find the "most important angles" to view this cloud so you
don’t miss the big patterns. Here’s how it works step by step:
Step 1: Standardize the Data
Different features may have different units and scales like salary vs. age. To compare
them fairly PCA first standardizes the data by making each feature have:
 A mean of 0
 A standard deviation of 1
Z=X−μσZ=σX−μ
where:

 μμ is the mean of independent features μ={μ1,μ2,⋯,μm}μ={μ1,μ2,⋯,μm}

 σσ is the standard deviation of independent features σ={σ1,σ2,⋯,σm}σ={σ1,σ2


,⋯,σm}
Step 2: Calculate Covariance Matrix
Next PCA calculates the covariance matrix to see how features relate to each other
whether they increase or decrease together. The covariance between two
features x1x1 and x2x2 is:
cov(x1,x2)=∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)n−1cov(x1,x2)=n−1∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)
Where:
 xˉ1andxˉ2xˉ1andxˉ2 are the mean values of features x1andx2x1andx2
 nn is the number of data points
The value of covariance can be positive, negative or zeros.
Step 3: Find the Principal Components
PCA identifies new axes where the data spreads out the most:
 1st Principal Component (PC1): The direction of maximum variance (most
spread).
 2nd Principal Component (PC2): The next best direction, perpendicular to
PC1 and so on.
These directions come from the eigenvectors of the covariance matrix and their
importance is measured by eigenvalues. For a square matrix A an eigenvector X (a
non-zero vector) and its corresponding eigenvalue λ satisfy:
AX=λXAX=λX
This means:
 When A acts on X it only stretches or shrinks X by the scalar λ.
 The direction of X remains unchanged hence eigenvectors define "stable
directions" of A.
Eigenvalues help rank these directions by importance.
Step 4: Pick the Top Directions & Transform Data
After calculating the eigenvalues and eigenvectors PCA ranks them by the amount of
information they capture. We then:
1. Select the top k components hat capture most of the variance like 95%.
2. Transform the original dataset by projecting it onto these top components.
This means we reduce the number of features (dimensions) while keeping the
important patterns in the data.

Transform this 2D dataset into a 1D representation while preserving as much variance


as possible.
In the above image the original dataset has two features "Radius" and "Area"
represented by the black axes. PCA identifies two new directions: PC₁ and PC₂ which
are the principal components.
 These new axes are rotated versions of the original ones. PC₁ captures the
maximum variance in the data meaning it holds the most information
while PC₂ captures the remaining variance and is perpendicular to PC₁.
 The spread of data is much wider along PC₁ than along PC₂. This is why PC₁ is
chosen for dimensionality reduction. By projecting the data points (blue
crosses) onto PC₁ we effectively transform the 2D data into 1D and retain most
of the important structure and patterns.
Implementation of Principal Component Analysis in Python
Hence PCA uses a linear transformation that is based on preserving the most variance
in the data using the least number of dimensions. It involves the following steps:
Step 1: Importing Required Libraries
We import the necessary library like pandas, numpy, scikit
learn, seaborn and matplotlib to visualize results.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Creating Sample Dataset
We make a small dataset with three features Height, Weight, Age and Gender.
data = {
'Height': [170, 165, 180, 175, 160, 172, 168, 177, 162, 158],
'Weight': [65, 59, 75, 68, 55, 70, 62, 74, 58, 54],
'Age': [30, 25, 35, 28, 22, 32, 27, 33, 24, 21],
'Gender': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0] # 1 = Male, 0 = Female
}
df = pd.DataFrame(data)
print(df)
Output:
Dataset
Step 3: Standardizing the Data
Since the features have different scales Height vs Age we standardize the data. This
makes all features have mean = 0 and standard deviation = 1 so that no feature
dominates just because of its units.
X = df.drop('Gender', axis=1)
y = df['Gender']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
Step 4: Applying PCA algorithm
 We reduce the data from 3 features to 2 new features called principal
components. These components capture most of the original information but in
fewer dimensions.
 We split the data into 70% training and 30% testing sets.
 We train a logistic regression model on the reduced training data and predict
gender labels on the test set.
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3,


random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
Step 5: Evaluating with Confusion Matrix
The confusion matrix compares actual vs predicted labels. This makes it easy to see
where predictions were correct or wrong.
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Female', 'Male'],
yticklabels=['Female', 'Male'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()
Output:

Confusion matrix
Step 6: Visualizing PCA Result
y_numeric = pd.factorize(y)[0]

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_numeric, cmap='coolwarm', edgecolor='k',
s=80)
plt.xlabel('Original Feature 1')
plt.ylabel('Original Feature 2')
plt.title('Before PCA: Using First 2 Standardized Features')
plt.colorbar(label='Target classes')

plt.subplot(1, 2, 2)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_numeric, cmap='coolwarm', edgecolor='k',
s=80)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('After PCA: Projected onto 2 Principal Components')
plt.colorbar(label='Target classes')

plt.tight_layout()
plt.show()
Output:
PCA Algorithm
 Left Plot Before PCA: This shows the original standardized data plotted using
the first two features. There is no guarantee of clear separation between
classes as these are raw input dimensions.
 Right Plot After PCA: This displays the transformed data using the top 2
principal components. These new components capture the maximum
variance often showing better class separation and structure making it easier
to analyze or model.
Advantages of Principal Component Analysis
1. Multicollinearity Handling: Creates new, uncorrelated variables to address
issues when original features are highly correlated.
2. Noise Reduction: Eliminates components with low variance enhance data
clarity.
3. Data Compression: Represents data with fewer components reduce storage
needs and speeding up processing.
4. Outlier Detection: Identifies unusual data points by showing which ones
deviate significantly in the reduced space.
Disadvantages of Principal Component Analysis
1. Interpretation Challenges: The new components are combinations of original
variables which can be hard to explain.
2. Data Scaling Sensitivity: Requires proper scaling of data before application or
results may be misleading.
3. Information Loss: Reducing dimensions may lose some important information
if too few components are kept.
4. Assumption of Linearity: Works best when relationships between variables are
linear and may struggle with non-linear data.
5. Computational Complexity: Can be slow and resource-intensive on very large
datasets.
6. Risk of Overfitting: Using too many components or working with a small
dataset might lead to models that don't generalize well.

Apriori Algorithm is a basic method used in data analysis


to find groups of items that often appear together in large sets of data. It helps
to discover useful patterns or rules about how items are related which is
particularly valuable in market basket analysis.
Like in a grocery store if many customers buy bread and butter together, the
store can use this information to place these items closer or create special
offers. This helps the store sell more and make customers happy.
How the Apriori Algorithm Works?
The Apriori Algorithm operates through a systematic process that involves
several key steps:
1. Identifying Frequent Itemsets
 The Apriori algorithm starts by looking through all the data to count how many
times each single item appears. These single items are called 1-itemsets.
 Next it uses a rule called minimum support this is a number that tells us how
often an item or group of items needs to appear to be important. If an item
appears often enough meaning its count is above this minimum support it is
called a frequent itemset.
2. Creating Possible item group
 After finding the single items that appear often enough (frequent 1-item
groups) the algorithm combines them to create pairs of items (2-item groups).
Then it checks which pairs are frequent by seeing if they appear enough times
in the data.
 This process keeps going step by step making groups of 3 items, then 4 items
and so on. The algorithm stops when it can’t find any bigger groups that
happen often enough.
3. Removing Infrequent Item groups
 The Apriori algorithm uses a helpful rule to save time. This rule says: if a group
of items does not appear often enough then any larger group that incl2 udes
these items will also not appear often.
 Because of this, the algorithm does not check those larger groups. This way it
avoids wasting time looking at groups that won’t be important make the whole
process faster.
4. Generating Association Rules
 The algorithm makes rules to show how items are related.
 It checks these rules using support, confidence and lift to find the strongest
ones.
Key Metrics of Apriori Algorithm
 Support: This metric measures how frequently an item appears in the dataset
relative to the total number of transactions. A higher support indicates a more
significant presence of the itemset in the dataset. Support tells us how often a
particular item or combination of items appears in all the transactions ("Bread
is bought in 20% of all transactions.")
 Confidence: Confidence assesses the likelihood that an item Y is purchased
when item X is purchased. It provides insight into the strength of the
association between two items. Confidence tells us how often items go
together. ("If bread is bought, butter is bought 75% of the time.")
 Lift: Lift evaluates how much more likely two items are to be purchased
together compared to being purchased independently. A lift greater than 1
suggests a strong positive association. Lift shows how strong the connection is
between items. ("Bread and butter are much more likely to be bought together
than by chance.")
Lets understand the concept of apriori Algorithm with the help of an example.
Consider the following dataset and we will find frequent itemsets and generate
association rules for them:

Transactions of a Grocery Shop


Step 1 : Setting the parameters
 Minimum Support Threshold: 50% (item must appear in at least 3/5
transactions). This threshold is formulated from this formula:
Support(A)=Number of transactions containing itemset ATotal number of transa
ctionsSupport(A)=Total number of transactionsNumber of transactions containi
ng itemset A
 Minimum Confidence Threshold: 70% ( You can change the value of
parameters as per the use case and problem statement ). This threshold is
formulated from this formula:

Confidence(X→Y)=Support(X∪Y)Support(X)Confidence(X→Y)=Support(X)Suppo
rt(X∪Y)
Step 2: Find Frequent 1-Itemsets
Lets count how many transactions include each item in the dataset (calculating
the frequency of each item).

Frequent 1-Itemsets
All items have support% ≥ 50%, so they qualify as frequent 1-itemsets. if any
item has support% < 50%, It will be omitted out from the frequent 1- itemsets.
Step 3: Generate Candidate 2-Itemsets
Combine the frequent 1-itemsets into pairs and calculate their support. For this
use case we will get 3 item pairs ( bread,butter) , (bread,ilk) and (butter,milk)
and will calculate the support similar to step 2

Candidate 2-Itemsets
Frequent 2-itemsets: {Bread, Milk} meet the 50% threshold but {Butter, Milk}
and {Bread ,Butter} doesn't meet the threshold, so will be committed out.
Step 4: Generate Candidate 3-Itemsets
Combine the frequent 2-itemsets into groups of 3 and calculate their support.
for the triplet we have only got one case i.e {bread,butter,milk} and we will
calculate the support.

Candidate
3-Itemsets
Since this does not meet the 50% threshold, there are no frequent 3-itemsets.
Step 5: Generate Association Rules
Now we generate rules from the frequent itemsets and calculate confidence.
Rule 1: If Bread → Butter (if customer buys bread, the customer will buy
butter also)
 Support of {Bread, Butter} = 2.
 Support of {Bread} = 4.
 Confidence = 2/4 = 50% (Failed threshold).
Rule 2: If Butter → Bread (if customer buys butter, the customer will buy
bread also)
 Support of {Bread, Butter} = 3.
 Support of {Butter} = 3.
 Confidence = 3/3 = 100% (Passes threshold).
Rule 3: If Bread → Milk (if customer buys bread, the customer will buy milk
also)
 Support of {Bread, Milk} = 3.
 Support of {Bread} = 4.
 Confidence = 3/4 = 75% (Passes threshold).
The Apriori Algorithm, as demonstrated in the bread-butter example, is widely
used in modern startups like Zomato, Swiggy and other food delivery platforms.
These companies use it to perform market basket analysis which helps them
identify customer behaviour patterns and optimise recommendations.
Applications of Apriori Algorithm
Below are some applications of Apriori algorithm used in today's companies
and startups
1. E-commerce: Used to recommend products that are often bought together like
laptop + laptop bag, increasing sales.
2. Food Delivery Services: Identifies popular combos such as burger + fries, to
offer combo deals to customers.
3. Streaming Services: Recommends related movies or shows based on what
users often watch together like action + superhero movies.
4. Financial Services: Analyzes spending habits to suggest personalised offers such
as credit card deals based on frequent purchases.
5. Travel & Hospitality: Creates travel packages like flight + hotel by finding
commonly purchased services together.
6. Health & Fitness: Suggests workout plans or supplements based on users' past
activities like protein shakes + workouts.

Classification Metrics
In a classification task, our main task is to predict the target variable, which is in the
form of discrete values. To evaluate the performance of such a model, following are
the commonly used evaluation metrics:
 Accuracy
 Logarithmic Loss
 Area Under Curve
 Precision
 Recall
 F1 Score
 Confusion Matrix
Accuracy
Accuracy is a fundamental metric for evaluating the performance of a classification
model, providing a quick snapshot of how well the model is performing in terms of
correct predictions. It is calculated as the ratio of correct predictions to the total
number of input samples.

It works great if there are an equal number of samples for each class. For example, we
have a 90% sample of class A and a 10% sample of class B in our training set. Then, our
model will predict with an accuracy of 90% by predicting all the training samples
belonging to class A. If we test the same model with a test set of 60% from class A and
40% from class B. Then the accuracy will fall, and we will get an accuracy of 60%.
Accuracy is good but it gives a False Positive sense of achieving high accuracy. The
problem arises due to the possibility of misclassification of minor class samples being
very high.
Logarithmic Loss
Log loss penalizes the false (false positive) classification. It usually works well with
multi-class classification. Working on Log loss, the classifier should assign a probability
for each and every class of all the samples. If there are N samples belonging to
the M class, then we calculate the Log loss in this way:

Now the Terms,


 yij indicate whether sample i belongs to class j.
 pij - The probability of sample i belongs to class j.
 The range of log loss is [0,?). When the log loss is near 0 it indicates high
accuracy and when away from zero then, it indicates lower accuracy.
 Let me give you a bonus point, minimizing log loss gives you higher accuracy for
the classifier.
Area Under Curve (AUC)
It is one of the widely used metrics and basically used for binary classification. The AUC
of a classifier is defined as the probability of a classifier will rank a randomly chosen
positive example higher than a negative example. Before going into AUC more, let me
make you comfortable with a few basic terms.
True Positive Rate:
Also called or termed sensitivity. True Positive Rate is considered as a portion of
positive data points that are correctly considered as positive, with respect to all data
points that are positive.

True Negative Rate


Also called or termed specificity. True Negative Rate is considered as a portion of
negative data points that are correctly considered as negative, with respect to all data
points that are negatives.

False Positive Rate


False Negatives rate is actually the proportion of actual positives that are incorrectly
identified as negatives

False Positive Rate and True Positive Rate both have values in the range [0, 1]. Now the
thing is what is A U C then? So, A U C is a curve plotted between False Positive Rate Vs
True Positive Rate at all different data points with a range of [0, 1]. Greater the value
of AUCC better the performance of the model.
ROC Curve for Evaluation of Classification Models
Precision
There is another metric named Precision. Precision is a measure of a model’s
performance that tells you how many of the positive predictions made by the model
are actually correct.

Recall
Recall is the ratio of correctly predicted positive instances to the total actual positive
instances. It measures how well the model captures all relevant positive cases.

F1 Score
F1-Score is a harmonic mean between recall and precision. Its range is [0,1]. This
metric usually tells us how precise (correctly classifies how many
instances) and robust (does not miss any significant number of instances) our
classifier is.
Lower recall and higher precision give you great accuracy but then it misses a large
number of instances. The more the F1 score better will be performance. It can be
expressed mathematically in this way:

Confusion Matrix
Confusion matrix creates a N X N matrix, where N is the number of classes or
categories that are to be predicted. Here we have N = 2, so we get a 2 X 2 matrix.
Suppose there is a problem with our practice which is a binary classification. Samples
of that classification belong to either Yes or No. So, we build our classifier which will
predict the class for the new input sample. After that, we tested our model
with 165 samples, and we get the following result.

There are 4 terms you should keep in mind:


1. True Positives: It is the case where we predicted Yes and the real output was
also Yes.
2. True Negatives: It is the case where we predicted No and the real output was
also No.
3. False Positives: It is the case where we predicted Yes but it was actually No.
4. False Negatives: It is the case where we predicted No but it was actually Yes.
The accuracy of the matrix is always calculated by taking average values present in
the main diagonal i.e.

Regression Evaluation Metrics


In the regression task, we are supposed to predict the target variable which is in the
form of continuous values. To evaluate the performance of such a model below
mentioned evaluation metrics are used:
 Mean Absolute Error
 Mean Squared Error
 Root Mean Square Error
 Root Mean Square Logarithmic Error
 R2 - Score
Mean Absolute Error (MAE)
Mean Absolute Error(MAE) is the average distance between predicted and original
values. Basically, it gives how we have predicted from the actual output. However,
there is one limitation i.e. it doesn't give any idea about the direction of the error
which is whether we are under-predicting or over-predicting our data. It can be
represented mathematically in this way:

Mean Squared Error (MSE)


MSE is similar to mean absolute error but the difference is it takes the square of the
average of between predicted and original values. The main advantage to take this
metric is here, it is easier to calculate the gradient whereas, in the case of mean
absolute error, it takes complicated programming tools to calculate the gradient. By
taking the square of errors it pronounces larger errors more than smaller errors, we
can focus more on larger errors. It can be expressed mathematically in this way.

Root Mean Square Error (RMSE)


RMSE is a metric that can be obtained by just taking the square root of the MSE value.
As we know that the MSE metrics are not robust to outliers and so are the RMSE
values. This gives higher weightage to the large errors in predictions.
Root Mean Squared Logarithmic Error (RMSLE)
There are times when the target variable varies in a wide range of values. And hence
we do not want to penalize the overestimation of the target values but penalize the
underestimation of the target values. For such cases, RMSLE is used as an evaluation
metric which helps us to achieve the above objective.
Some changes in the original formula of the RMSE code will give us the RMSLE formula
that is as shown below:

R2 - Score
The coefficient of determination also called the R2 score is used to evaluate the
performance of a linear regression model. It is the amount of variation in the output-
dependent attribute which is predictable from the input independent variable(s). It is
used to check how well-observed results are reproduced by the model, depending on
the ratio of total deviation of results described by the model.

Regularization is an important technique in machine learning that


helps to improve model accuracy by preventing overfitting which happens when a
model learns the training data too well including noise and outliers and perform poor
on new data. By adding a penalty for complexity it helps simpler models to perform
better on new data. In this article, we will see main types of regularization i.e Lasso,
Ridge and Elastic Net and see how they help to build more reliable models.
Table of Content
 Types of Regularization
 What are Overfitting and Underfitting?
 What are Bias and Variance?
 Benefits of Regularization
Types of Regularization
1. Lasso Regression
A regression model which uses the L1 Regularization technique is called LASSO (Least
Absolute Shrinkage and Selection Operator) regression. It adds the absolute value of
magnitude of the coefficient as a penalty term to the loss function(L). This penalty can
shrink some coefficients to zero which helps in selecting only the important features
and ignoring the less important ones.

where
 mm - Number of Features
 nn- Number of Examples
 yiyi- Actual Target Value
 y^iy^i - Predicted Target Value
Lets see how to implement this using python:
 X, y = make_regression(n_samples=100, n_features=5, noise=0.1,
random_state=42) : Generates a regression dataset with 100 samples, 5
features and some noise.
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42) : Splits the data into 80% training and 20% testing sets.
 lasso = Lasso(alpha=0.1) : Creates a Lasso regression model with regularization
strength alpha set to 0.1.
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

mse = mean_squared_error(y_test, y_pred)


print(f"Mean Squared Error: {mse}")

print("Coefficients:", lasso.coef_)
Output:

Lasso Regression
The output shows the model's prediction error and the importance of features with
some coefficients reduced to zero due to L1 regularization.
2. Ridge Regression
A regression model that uses the L2 regularization technique is called Ridge
regression. It adds the squared magnitude of the coefficient as a penalty term to the
loss function(L).

where,
 nn= Number of examples or data points
 mm = Number of features i.e predictor variables
 yiyi = Actual target value for the ithith example
 y^iy^i = Predicted target value for the ithith example
 wiwi = Coefficients of the features
 λλ= Regularization parameter that controls the strength of regularization
Lets see how to implement this using python:
 ridge = Ridge(alpha=1.0) : Creates a Ridge regression model with regularization
strength alpha set to 1.0.
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

mse = mean_squared_error(y_test, y_pred)


print("Mean Squared Error:", mse)
print("Coefficients:", ridge.coef_)
Output:

Ridge Regression
The output shows the MSE showing model performance. Lower MSE means better
accuracy. The coefficients reflect the regularized feature weights.
3. Elastic Net Regression
Elastic Net Regression is a combination of both L1 as well as L2 regularization. That
shows that we add the absolute norm of the weights as well as the squared measure
of the weights. With the help of an extra hyperparameter that controls the ratio of the
L1 and L2 regularization.

where
 nn = Number of examples (data points)
 mm = Number of features (predictor variables)
 yiyi = Actual target value for the ithith example
 y^iy^i = Predicted target value for the ithith example
 wiwi= Coefficients of the features
 λλ= Regularization parameter that controls the strength of regularization
 α = Mixing parameter where 0 ≤ αα≤ 1 and αα= 1 corresponds to Lasso (L1)
regularization, αα= 0 corresponds to Ridge (L2) regularization and Values
between 0 and 1 provide a balance of both L1 and L2 regularization
Lets see how to implement this using python:
 model = ElasticNet(alpha=1.0, l1_ratio=0.5) : Creates an Elastic Net model with
regularization strength alpha=1.0 and L1/L2 mixing ratio 0.5.
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = ElasticNet(alpha=1.0, l1_ratio=0.5)


model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)


print("Coefficients:", model.coef_)
Output:

Elastic Net Regression


The output shows MSE which measures how far off predictions are from actual values
i.e lower is better and coefficients show feature importance.
Learn more about the difference between the regularization techniques here: Lasso vs
Ridge vs Elastic Net
What are Overfitting and Underfitting?
Overfitting and underfitting are terms used to describe the performance of machine
learning models in relation to their ability to generalize from the training data to
unseen data.

Overfitting happens when a machine learning model learns the training data too well
including the noise and random details. This makes the model to perform poorly on
new, unseen data because it memorizes the training data instead of understanding the
general patterns.
For example, if we only study last week’s weather to predict tomorrow’s i.e our model
might focus on one-time events like a sudden rainstorm which won’t help for future
predictions.
Underfitting is the opposite problem which happens when the model is too simple to
learn even the basic patterns in the data. An underfitted model performs poorly on
both training and new data. To fix this we need to make the model more complex or
add more features.
For example if we use only the average temperature of the year to predict tomorrow’s
weather hence the model misses important details like seasonal changes which results
in bad predictions.
What are Bias and Variance?
 Bias refers to the errors which occur when we try to fit a statistical model on
real-world data which does not fit perfectly well on some mathematical model.
If we use a way too simplistic a model to fit the data then we are more
probably face the situation of High Bias (underfitting) refers to the case when
the model is unable to learn the patterns in the data at hand and perform
poorly.
 Variance shows the error value that occurs when we try to make predictions by
using data that is not previously seen by the model. There is a situation known
as high variance (overfitting) that occurs when the model learns noise that is
present in the data.
Finding a proper balance between the two is also known as the Bias-Variance
Tradeoff which helps us to design an accurate model.
Bias Variance tradeoff
The Bias-Variance Tradeoff refers to the balance between bias and variance which
affect predictive model performance. Finding the right tradeoff is important for
creating models that generalize well to new data.
 The bias-variance tradeoff shows the inverse relationship between bias and
variance. When one decreases, the other tends to increase and vice versa.
 Finding the right balance is important. An overly simple model with high bias
won't capture the underlying patterns while an overly complex model with high
variance will fit the noise in the data.

Benefits of Regularization
Now, let’s see various benefits of regularization which are as follows:
1. Prevents Overfitting: Regularization helps models focus on underlying patterns
instead of memorizing noise in the training data.
2. Improves Interpretability: L1 (Lasso) regularization simplifies models by
reducing less important feature coefficients to zero.
3. Enhances Performance: Prevents excessive weighting of outliers or irrelevant
features helps in improving overall model accuracy.
4. Stabilizes Models: Reduces sensitivity to minor data changes which ensures
consistency across different data subsets.
5. Prevents Complexity: Keeps model from becoming too complex which is
important for limited or noisy data.
6. Handles Multicollinearity: Reduces the magnitudes of correlated coefficients
helps in improving model stability.
7. Allows Fine-Tuning: Hyperparameters like alpha and lambda control
regularization strength helps in balancing bias and variance.
8. Promotes Consistency: Ensures reliable performance across different datasets
which reduces the risk of large performance shifts.

Cross-validation is a technique used to check how well a machine


learning model performs on unseen data. It splits the data into several parts, trains the
model on some parts and tests it on the remaining part repeating this process multiple
times. Finally the results from each validation step are averaged to produce a more
accurate estimate of the model's performance.
The main purpose of cross validation is to prevent overfitting. If you want to make sure
your machine learning model is not just memorizing the training data but is capable of
adapting to real-world data cross-validation is a commonly used technique.
Types of Cross-Validation
There are several types of cross validation techniques which are as follows:
1. Holdout Validation
In Holdout Validation we perform training on the 50% of the given dataset and rest
50% is used for the testing purpose. It's a simple and quick way to evaluate a model.
The major drawback of this method is that we perform training on the 50% of the
dataset, it may possible that the remaining 50% of the data contains some important
information which we are leaving while training our model that can lead to higher bias.
2. LOOCV (Leave One Out Cross Validation)
In this method we perform training on the whole dataset but leaves only one data-
point of the available dataset and then iterates for each data-point. In LOOCV the
model is trained on n−1n−1 samples and tested on the one omitted sample repeating
this process for each data point in the dataset. It has some advantages as well as
disadvantages also.
 An advantage of using this method is that we make use of all data points and
hence it is low bias.
 The major drawback of this method is that it leads to higher variation in the
testing model as we are testing against one data point. If the data point is an
outlier it can lead to higher variation.
 Another drawback is it takes a lot of execution time as it iterates over the
number of data points we have.
3. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the cross-
validation process maintains the same class distribution as the entire dataset. This is
particularly important when dealing with imbalanced datasets where certain classes
may be under represented. In this method:
 The dataset is divided into k folds while maintaining the proportion of classes in
each fold.
 During each iteration, one-fold is used for testing and the remaining folds are
used for training.
 The process is repeated k times with each fold serving as the test set exactly
once.
Stratified Cross-Validation is essential when dealing with classification problems where
maintaining the balance of class distribution is crucial for the model to generalize well
to unseen data.
4. K-Fold Cross Validation
In K-Fold Cross Validation we split the dataset into k number of subsets known as folds
then we perform training on the all the subsets but leave one (k-1) subset for the
evaluation of the trained model. In this method, we iterate k times with a different
subset reserved for testing purpose each time.
Note: It is always suggested that the value of k should be 10 as the lower value of k
takes towards validation and higher value of k leads to LOOCV method.
Example of K Fold Cross Validation
The diagram below shows an example of the training subsets and evaluation subsets
generated in k-fold cross-validation. Here we have total 25 instances. In first iteration
we use the first 20 percent of data for evaluation and the remaining 80 percent for
training like [1-5] testing and [5-25] training while in the second iteration we use the
second subset of 20 percent for evaluation and the remaining three subsets of the data
for training like [5-10] testing and [1-5 and 10-25] training and so on.

Training Set
Iteration Observations Testing Set Observations

1 [5-24] [0-4]

2 [0-4, 10-24] [5-9]

3 [0-9, 15-24] [10-14]

4 [0-14, 20-24] [15-19]

5 [0-19] [20-24]

Each iteration uses different subsets for testing and training, ensuring that all data
points are used for both training and testing.
Comparison between K-Fold Cross-Validation and Hold Out Method
K-Fold Cross-Validation and Hold Out Method are widely used technique and
sometimes they are confusing so here is the quick comparison between them:
Feature K-Fold Cross-Validation Hold-Out Method

The dataset is divided into 'k'


The dataset is split into
subsets (folds). Each fold gets a
two sets: one for training
turn to be the test set while the
and one for testing.
Definition others are used for training.

The model is trained 'k' times, each The model is trained once
Training Sets time on a different training subset. on the training set.

The model is tested 'k' times, each The model is tested once
Testing Sets time on a different test subset. on the test set.

Less biased due to multiple splits Can have higher bias due
Bias and testing. to a single split.

Lower variance, as it tests on Higher variance, as results


Variance multiple splits. depend on the single split.

Low, as the model is


High, as the model is trained and
Computation trained and tested only
tested 'k' times.
Cost once.

Better for tuning and evaluating Less reliable for model


Use in Model model performance due to reduced selection, as it might give
Selection bias. inconsistent results.

Only a portion of the data


The entire dataset is used for both is used for testing, so some
training and testing. data is not used for
Data Utilization validation.

Suitability for Preferred for small datasets, as it Less ideal for small
Small Datasets maximizes data usage. datasets, as a significant
Feature K-Fold Cross-Validation Hold-Out Method

portion is held out for


testing.

Higher risk of overfitting as


Less prone to overfitting due to
Risk of the model is trained on
multiple training and testing cycles.
Overfitting one set.

Advantages and Disadvantages of Cross Validation


Advantages:
1. Overcoming Overfitting: Cross validation helps to prevent overfitting by
providing a more robust estimate of the model's performance on unseen data.
2. Model Selection: Cross validation is used to compare different models and
select the one that performs the best on average.
3. Hyperparameter tuning: This is used to optimize the hyperparameters of a
model such as the regularization parameter by selecting the values that result
in the best performance on the validation set.
4. Data Efficient: It allow the use of all the available data for both training and
validation making it more data-efficient method compared to traditional
validation techniques.
Disadvantages:
1. Computationally Expensive: It can be computationally expensive especially
when the number of folds is large or when the model is complex and requires a
long time to train.
2. Time-Consuming: It can be time-consuming especially when there are many
hyperparameters to tune or when multiple models need to be compared.
3. Bias-Variance Tradeoff: The choice of the number of folds in cross validation
can impact the bias-variance tradeoff i.e too few folds may result in high bias
while too many folds may result in high variance.
Python implementation for k fold cross-validation
Step 1: Importing necessary libraries
We will import scikit learn.
from sklearn.model_selection import cross_val_score, KFold
from sklearn.svm import SVC
from sklearn.datasets import load_iris
Step 2: Loading the dataset
let's use the iris dataset which is a multi-class classification in-built dataset.
iris = load_iris()
X, y = iris.data, iris.target
Step 3: Creating SVM classifier
SVC is a Support Vector Classification model from scikit-learn.
svm_classifier = SVC(kernel='linear')
Step 4: Defining the number of folds for cross-validation
Here we will be using 5 folds.
num_folds = 5
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)
Step 5: Performing k-fold cross-validation
cross_val_results = cross_val_score(svm_classifier, X, y, cv=kf)
Step 6: Evaluation metrics
print("Cross-Validation Results (Accuracy):")
for i, result in enumerate(cross_val_results, 1):
print(f" Fold {i}: {result * 100:.2f}%")

print(f'Mean Accuracy: {cross_val_results.mean()* 100:.2f}%')


Output:

Cross validation accuracy


The output shows the accuracy scores from each of the 5 folds in the K-fold cross-
validation process. The mean accuracy is the average of these individual scores which
is approximately 97.33% indicating the model's overall performance across all the
folds.

Hyperparameter tuning is the process of selecting the


optimal values for a machine learning model's hyperparameters. These are typically set
before the actual training process begins and control aspects of the learning process
itself. They influence the model's performance its complexity and how fast it learns.
For example the learning rate and number of neurons in a neural network in a neural
network or the kernel size in a support vector machine can significantly impact how
well the model trains and generalizes. The goal of hyperparameter tuning is to find the
values that lead to the best performance on a given task.
These settings can affect both the speed and quality of the model's performance.
 A high learning rate can cause the model to converge too quickly possibly
skipping over the optimal solution.
 A low learning rate might lead to slower convergence and require more time
and computational resources.
Different models have different hyperparameters and they need to be tuned
accordingly.
Techniques for Hyperparameter Tuning
Models can have many hyperparameters and finding the best combination of
parameters can be treated as a search problem. The two best strategies for
Hyperparameter tuning are:
1. GridSearchCV
GridSearchCV is a brute-force technique for hyperparameter tuning. It trains the model
using all possible combinations of specified hyperparameter values to find the best-
performing setup. It is slow and uses a lot of computer power which makes it hard to
use with big datasets or many settings. It works using below steps:
 Create a grid of potential values for each hyperparameter.
 Train the model for every combination in the grid.
 Evaluate each model using cross-validation.
 Select the combination that gives the highest score.
For example if we want to tune two hyperparameters C and Alpha for a Logistic
Regression Classifier model with the following sets of values:
C = [0.1, 0.2, 0.3, 0.4, 0.5]
Alpha = [0.01, 0.1, 0.5, 1.0]

The grid search technique will construct multiple versions of the model with all
possible combinations of C and Alpha, resulting in a total of 5 * 4 = 20 different
models. The best-performing combination is then chosen.
Example: Tuning Logistic Regression with GridSearchCV
The following code illustrates how to use GridSearchCV . In this below code:
 We generate sample data using make_classification.
 We define a range of C values using logarithmic scale.
 GridSearchCV tries all combinations from param_grid and uses 5-fold cross-
validation.
 It returns the best hyperparameter (C) and its corresponding validation score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.datasets import make_classification

X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10, n_classes=2, random_state=42)
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
logreg = LogisticRegression()
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
logreg_cv.fit(X, y)
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))
Output:
Tuned Logistic Regression Parameters: {'C': 0.006105402296585327}
Best score is 0.853
This represents the highest accuracy achieved by the model using the hyperparameter
combination C = 0.0061. The best score of 0.853 means the model achieved 85.3%
accuracy on the validation data during the grid search process.
2. RandomizedSearchCV
As the name suggests RandomizedSearchCV picks random combinations of
hyperparameters from the given ranges instead of checking every single combination
like GridSearchCV.
 In each iteration it tries a new random combination of hyperparameter values.
 It records the model’s performance for each combination.
 After several attempts it selects the best-performing set.
Example: Tuning Decision Tree with RandomizedSearchCV
The following code illustrates how to use RandomizedSearchCV. In this example:
 We define a range of values for each hyperparameter
e.g, max_depth, min_samples_leaf etc.
 Random combinations are picked and evaluated using 5-fold cross-validation.
 The best combination and score are printed.
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_classes=2, random_state=42)
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
"max_depth": [3, None],
"max_features": randint(1, 9),
"min_samples_leaf": randint(1, 9),
"criterion": ["gini", "entropy"]
}

tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_cv.fit(X, y)
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))
Output:
Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': None,
'max_features': 6, 'min_samples_leaf': 6}
Best score is 0.8
A score of 0.842 means the model performed with an accuracy of 84.2% on the
validation set with following hyperparameters.
3. Bayesian Optimization
Grid Search and Random Search can be inefficient because they blindly try many
hyperparameter combinations, even if some are clearly not useful. Bayesian
Optimization takes a smarter approach. It treats hyperparameter tuning like a
mathematical optimization problem and learns from past results to decide what to try
next.
 Build a probabilistic model (surrogate function) that predicts performance
based on hyperparameters.
 Update this model after each evaluation.
 Use the model to choose the next best set to try.
 Repeat until the optimal combination is found. The surrogate function models:

P(score(y)∣hyperparameters(x))P(score(y)∣hyperparameters(x))
Here the surrogate function models the relationship between hyperparameters xx and
the score yy. By updating this model iteratively with each new evaluation Bayesian
optimization makes more informed decisions. Common surrogate models used in
Bayesian optimization include:
 Gaussian Processes
 Random Forest Regression
 Tree-structured Parzen Estimators (TPE)
Advantages of Hyperparameter tuning
 Improved Model Performance: Finding the optimal combination of
hyperparameters can significantly boost model accuracy and robustness.
 Reduced Overfitting and Underfitting: Tuning helps to prevent both overfitting
and underfitting resulting in a well-balanced model.
 Enhanced Model Generalizability: By selecting hyperparameters that optimize
performance on validation data the model is more likely to generalize well to
unseen data.
 Optimized Resource Utilization: With careful tuning resources such as
computation time and memory can be used more efficiently avoiding
unnecessary work.
 Improved Model Interpretability: Properly tuned hyperparameters can make
the model simpler and easier to interpret.
Challenges in Hyperparameter Tuning
 Dealing with High-Dimensional Hyperparameter Spaces: The larger the
hyperparameter space the more combinations need to be explored. This makes
the search process computationally expensive and time-consuming especially
for complex models with many hyperparameters.
 Handling Expensive Function Evaluations: Evaluating a model's performance
can be computationally expensive, particularly for models that require a lot of
data or iterations.
 Incorporating Domain Knowledge: It can help guide the hyperparameter
search, narrowing down the search space and making the process more
efficient. Using insights from the problem context can improve both the
efficiency and effectiveness of tuning.
 Developing Adaptive Hyperparameter Tuning Methods: Dynamic adjustment
of hyperparameters during training such as learning rate schedules or early
stopping can lead to better model performance.
Reinforcement Learning (RL) is a branch
of machine learning that focuses on how agents can learn to make decisions
through trial and error to maximize cumulative rewards. RL allows machines to
learn by interacting with an environment and receiving feedback based on their
actions. This feedback comes in the form of rewards or penalties.

Reinforcement Learning revolves around the idea that an agent (the learner or
decision-maker) interacts with an environment to achieve a goal. The agent
performs actions and receives feedback to optimize its decision-making over
time.
 Agent: The decision-maker that performs actions.
 Environment: The world or system in which the agent operates.
 State: The situation or condition the agent is currently in.
 Action: The possible moves or decisions the agent can make.
 Reward: The feedback or result from the environment based on the agent’s
action.
How Reinforcement Learning Works?
The RL process involves an agent performing actions in an environment,
receiving rewards or penalties based on those actions, and adjusting its
behavior accordingly. This loop helps the agent improve its decision-making
over time to maximize the cumulative reward.
Here’s a breakdown of RL components:
 Policy: A strategy that the agent uses to determine the next action based on
the current state.
 Reward Function: A function that provides feedback on the actions taken,
guiding the agent towards its goal.
 Value Function: Estimates the future cumulative rewards the agent will receive
from a given state.
 Model of the Environment: A representation of the environment that predicts
future states and rewards, aiding in planning.
Reinforcement Learning Example: Navigating a Maze
Imagine a robot navigating a maze to reach a diamond while avoiding fire
hazards. The goal is to find the optimal path with the least number of hazards
while maximizing the reward:
 Each time the robot moves correctly, it receives a reward.
 If the robot takes the wrong path, it loses points.
The robot learns by exploring different paths in the maze. By trying various
moves, it evaluates the rewards and penalties for each path. Over time, the
robot determines the best route by selecting the actions that lead to the
highest cumulative reward.

The robot's learning process can be summarized as follows:


1. Exploration: The robot starts by exploring all possible paths in the maze, taking
different actions at each step (e.g., move left, right, up, or down).
2. Feedback: After each move, the robot receives feedback from the environment:
 A positive reward for moving closer to the diamond.
 A penalty for moving into a fire hazard.
3. Adjusting Behavior: Based on this feedback, the robot adjusts its behavior to
maximize the cumulative reward, favoring paths that avoid hazards and bring it
closer to the diamond.
4. Optimal Path: Eventually, the robot discovers the optimal path with the least
number of hazards and the highest reward by selecting the right actions based
on past experiences.
Types of Reinforcements in RL
1. Positive Reinforcement
Positive Reinforcement is defined as when an event, occurs due to a particular
behavior, increases the strength and the frequency of the behavior. In other
words, it has a positive effect on behavior.
 Advantages: Maximizes performance, helps sustain change over time.
 Disadvantages: Overuse can lead to excess states that may reduce
effectiveness.
2. Negative Reinforcement
Negative Reinforcement is defined as strengthening of behavior because a
negative condition is stopped or avoided.
 Advantages: Increases behavior frequency, ensures a minimum performance
standard.
 Disadvantages: It may only encourage just enough action to avoid penalties.
CartPole in OpenAI Gym
One of the classic RL problems is the CartPole environment in OpenAI Gym,
where the goal is to balance a pole on a cart. The agent can either push the cart
left or right to prevent the pole from falling over.
 State space: Describes the four key variables (position, velocity, angle, angular
velocity) of the cart-pole system.
 Action space: Discrete actions—either move the cart left or right.
 Reward: The agent earns 1 point for each step the pole remains balanced.
import gym
import numpy as np
import warnings
# Suppress specific deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Load the environment with render mode specified


env = gym.make('CartPole-v1', render_mode="human")

# Initialize the environment to get the initial state


state = env.reset()

# Print the state space and action space


print("State space:", env.observation_space)
print("Action space:", env.action_space)

# Run a few steps in the environment with random actions


for _ in range(10):
env.render() # Render the environment for visualization
action = env.action_space.sample() # Take a random action

# Take a step in the environment


step_result = env.step(action)

# Check the number of values returned and unpack accordingly


if len(step_result) == 4:
next_state, reward, done, info = step_result
terminated = False
else:
next_state, reward, done, truncated, info = step_result
terminated = done or truncated
print(f"Action: {action}, Reward: {reward}, Next State: {next_state}, Done:
{done}, Info: {info}")

if terminated:
state = env.reset() # Reset the environment if the episode is finished

env.close() # Close the environment when done


Output:

Application of Reinforcement Learning


1. Robotics: RL is used to automate tasks in structured environments such as
manufacturing, where robots learn to optimize movements and improve
efficiency.
2. Game Playing: Advanced RL algorithms have been used to develop strategies
for complex games like chess, Go, and video games, outperforming human
players in many instances.
3. Industrial Control: RL helps in real-time adjustments and optimization of
industrial operations, such as refining processes in the oil and gas industry.
4. Personalized Training Systems: RL enables the customization of instructional
content based on an individual's learning patterns, improving engagement and
effectiveness.
Advantages of Reinforcement Learning
 Solving Complex Problems: RL is capable of solving highly complex problems
that cannot be addressed by conventional techniques.
 Error Correction: The model continuously learns from its environment and can
correct errors that occur during the training process.
 Direct Interaction with the Environment: RL agents learn from real-time
interactions with their environment, allowing adaptive learning.
 Handling Non-Deterministic Environments: RL is effective in environments
where outcomes are uncertain or change over time, making it highly useful for
real-world applications.
Disadvantages of Reinforcement Learning
 Not Suitable for Simple Problems: RL is often an overkill for straightforward
tasks where simpler algorithms would be more efficient.
 High Computational Requirements: Training RL models requires a significant
amount of data and computational power, making it resource-intensive.
 Dependency on Reward Function: The effectiveness of RL depends heavily on
the design of the reward function. Poorly designed rewards can lead to
suboptimal or undesired behaviors.
 Difficulty in Debugging and Interpretation: Understanding why an RL agent
makes certain decisions can be challenging, making debugging and
troubleshooting complex

Ensemble learning is a method where we use many small models instead of just
one. Each of these models may not be very strong on its own, but when we put
their results together, we get a better and more accurate answer. It's like asking
a group of people for advice instead of just one person—each one might be a
little wrong, but together, they usually give a better answer.
Types of Ensembles Learning in Machine Learning
There are three main types of ensemble methods:
1. Bagging (Bootstrap Aggregating):
Models are trained independently on different random subsets of the training
data. Their results are then combined—usually by averaging (for regression) or
voting (for classification). This helps reduce variance and prevents overfitting.
2. Boosting:
Models are trained one after another. Each new model focuses on fixing the
errors made by the previous ones. The final prediction is a weighted
combination of all models, which helps reduce bias and improve accuracy.
3. Stacking (Stacked Generalization):
Multiple different models (often of different types) are trained, and their
predictions are used as inputs to a final model, called a meta-model. The meta-
model learns how to best combine the predictions of the base models, aiming
for better performance than any individual model.
1. Bagging Algorithm
Bagging classifier can be used for both regression and classification tasks. Here
is an overview of Bagging classifier algorithm:
 Bootstrap Sampling: Divides the original training data into ‘N’ subsets and
randomly selects a subset with replacement in some rows from other subsets.
This step ensures that the base models are trained on diverse subsets of the
data and there is no class imbalance.
 Base Model Training: For each bootstrapped sample we train a base model
independently on that subset of data. These weak models are trained in
parallel to increase computational efficiency and reduce time consumption. We
can use different base learners i.e. different ML models as base learners to
bring variety and robustness.
 Prediction Aggregation: To make a prediction on testing data combine the
predictions of all base models. For classification tasks it can include majority
voting or weighted majority while for regression it involves averaging the
predictions.
 Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training
subset of particular base models during the bootstrapping method. These “out-
of-bag” samples can be used to estimate the model’s performance without the
need for cross-validation.
 Final Prediction: After aggregating the predictions from all the base models,
Bagging produces a final prediction for each instance.
Python pseudo code for Bagging Estimator implementing libraries:
1. Importing Libraries and Loading Data
 BaggingClassifier: for creating an ensemble of classifiers trained on different
subsets of data.
 DecisionTreeClassifier: the base classifier used in the bagging ensemble.
 load_iris: to load the Iris dataset for classification.
 train_test_split: to split the dataset into training and testing subsets.
 accuracy_score: to evaluate the model’s prediction accuracy.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
2. Loading and Splitting the Iris Dataset
 data = load_iris(): loads the Iris dataset, which includes features and target
labels.
 X = data.data: extracts the feature matrix (input variables).
 y = data.target: extracts the target vector (class labels).
 train_test_split(...): splits the data into training (80%) and testing (20%) sets,
with random_state=42 to ensure reproducibility.
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
3. Creating a Base Classifier
Decision tree is chosen as the base model. They are prone to overfitting when
trained on small datasets making them good candidates for bagging.
 base_classifier = DecisionTreeClassifier(): initializes a Decision Tree classifier,
which will serve as the base estimator in the Bagging ensemble.
base_classifier = DecisionTreeClassifier()
4. Creating and Training the Bagging Classifier
 A BaggingClassifier is created using the decision tree as the base classifier.
 n_estimators = 10 specifies that 10 decision trees will be trained on different
bootstrapped subsets of the training data.
bagging_classifier = BaggingClassifier(base_classifier, n_estimators=10,
random_state=42)
bagging_classifier.fit(X_train, y_train)
5. Making Predictions and Evaluating Accuracy
 The trained bagging model predicts labels for test data.
 The accuracy of the predictions is calculated by comparing the predicted labels
(y_pred) to the actual labels (y_test).
y_pred = bagging_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Output:
Accuracy: 1.0
2. Boosting Algorithm
Boosting is an ensemble technique that combines multiple weak learners to
create a strong learner. Weak models are trained in series such that each next
model tries to correct errors of the previous model until the entire training
dataset is predicted correctly. One of the most well-known boosting algorithms
is AdaBoost (Adaptive Boosting). Here is an overview of Boosting algorithm:
 Initialize Model Weights: Begin with a single weak learner and assign equal
weights to all training examples.
 Train Weak Learner: Train weak learners on these dataset.
 Sequential Learning: Boosting works by training models sequentially where
each model focuses on correcting the errors of its predecessor. Boosting
typically uses a single type of weak learner like decision trees.
 Weight Adjustment: Boosting assigns weights to training datapoints.
Misclassified examples receive higher weights in the next iteration so that next
models pay more attention to them.
Python pseudo code for boosting Estimator implementing libraries:
1. Importing Libraries and Modules
 AdaBoostClassifier from sklearn.ensemble: for building the AdaBoost
ensemble model.
 DecisionTreeClassifier from sklearn.tree: as the base weak learner for
AdaBoost.
 load_iris from sklearn.datasets: to load the Iris dataset.
 train_test_split from sklearn.model_selection: to split the dataset into training
and testing sets.
 accuracy_score from sklearn.metrics: to evaluate the model’s accuracy.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
2. Loading and Splitting the Dataset
 data = load_iris(): loads the Iris dataset, which includes features and target
labels.
 X = data.data: extracts the feature matrix (input variables).
 y = data.target: extracts the target vector (class labels).
 train_test_split(...): splits the data into training (80%) and testing (20%) sets,
with random_state=42 to ensure reproducibility.
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
3. Defining the Weak Learner
We are creating the base classifier as a decision tree with maximum depth 1 (a
decision stump). This simple tree will act as a weak learner for the AdaBoost
algorithm, which iteratively improves by combining many such weak learners.
base_classifier = DecisionTreeClassifier(max_depth=1)
4. Creating and Training the AdaBoost Classifier
 base_classifier: The weak learner used in boosting.
 n_estimators = 50: Number of weak learners to train sequentially.
 learning_rate = 1.0: Controls the contribution of each weak learner to the
final model.
 random_state = 42: Ensures reproducibility.
adaboost_classifier = AdaBoostClassifier(
base_classifier, n_estimators=50, learning_rate=1.0, random_state=42
)
adaboost_classifier.fit(X_train, y_train)
5. Making Predictions and Calculating Accuracy
We are calculating the accuracy of the model by comparing the true
labels y_test with the predicted labels y_pred. The accuracy_score function
returns the proportion of correctly predicted samples. Then, we print the
accuracy value.
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Output:
Accuracy: 1.0
Benefits of Ensemble Learning in Machine Learning
Ensemble learning is a versatile approach that can be applied to machine
learning model for: -
 Reduction in Overfitting: By aggregating predictions of multiple model's
ensembles can reduce overfitting that individual complex models might exhibit.
 Improved Generalization: It generalizes better to unseen data by minimizing
variance and bias.
 Increased Accuracy: Combining multiple models gives higher predictive
accuracy.
 Robustness to Noise: It mitigates the effect of noisy or incorrect data points by
averaging out predictions from diverse models.
 Flexibility: It can work with diverse models including decision trees, neural
networks and support vector machines making them highly adaptable.
 Bias-Variance Tradeoff: Techniques like bagging reduce variance, while
boosting reduces bias leading to better overall performance.
There are various ensemble learning techniques we can use as each one of
them has their own pros and cons.
Ensemble Learning Techniques

Categor
Technique y Description

Random forest constructs multiple


decision trees on bootstrapped
Bagging subsets of the data and aggregates
Random their predictions for final output,
Forest reducing overfitting and variance.

Trains models on random subsets


of input features to enhance
Random Bagging diversity and improve
Subspace generalization while reducing
Method overfitting.

Gradient Boosting Gradient Boosting


Categor
Technique y Description

Machines sequentially builds


decision trees, with each tree
Boosting correcting errors of the previous
Machines ones, enhancing predictive
(GBM) accuracy iteratively.

Extreme XGBoost do optimizations like tree


Gradient pruning, regularization, and
Boosting
Boosting parallel processing for robust and
(XGBoost) efficient predictive models.

AdaBoost focuses on challenging


examples by assigning weights to
AdaBoost Boosting data points. Combines weak
(Adaptive classifiers with weighted voting for
Boosting) final predictions.

CatBoost specialize in handling


categorical features natively
Boosting without extensive preprocessing
with high predictive accuracy and
CatBoost automatic overfitting handling.

ETL & ELT


In managing and analyzing data, two primary approaches i.e. ETL (Extract, Transform,
Load) and ELT (Extract, Load, Transform), are commonly used to move data from
various sources into a data warehouse. Understanding the differences between these
methods is crucial for selecting the right approach based on our data needs, storage
system and performance requirements.
ELT Process
Extraction, Load and Transform (ELT) is the technique of extracting raw data from the
source, storing it in the data warehouse of the target server and preparing it for end-
stream users.
ELT consists of three different operations performed on the data:
1. Extract: Extracting data is the process of identifying data from one or more
sources. The sources may include databases, files, ERP, CRM, or any other
useful source of data.
2. Load: Loading is the process of storing the extracted raw data in a data
warehouse or data lake.
3. Transform: Data transformation is the process in which the raw data from the
source is transformed into the target format required for analysis

Data from the sources is extracted and stored in the data warehouse. The entire data is
not transformed; only the required transformations are done when necessary. Raw
data can be retrieved from the warehouse anytime when required. The data,
transformed as needed, is then sent forward for analysis. When we use ELT, we move
the entire data set as it exists in the source systems to the target. This means that we
have the raw data at your disposal in the data warehouse, in contrast to the ETL
approach.
ETL Process
ETL is the traditional technique of extracting raw data, transforming it as required for
the users and storing it in data warehouses. ELT was later developed, with ETL as its
base. The three operations in ETL and ELT are the same, except that their order of
processing is slightly different. This change in sequence was made to overcome some
drawbacks.
1. Extract: It is the process of extracting raw data from all available data sources
such as databases, files, ERP, CRM or any other.
2. Transform: The extracted data is immediately transformed as required by the
user.
3. Load: The transformed data is then loaded into the data warehouse from
where the users can access it.

The data collected from the sources is directly stored in the staging area. The required
transformations are performed on the data in the staging area. Once the data is
transformed, the resultant data is stored in the data warehouse. The main drawback of
the ETL architecture is that once the transformed data is stored in the warehouse, it
cannot be modified again. In contrast, in ELT, a copy of the raw data is always available
in the warehouse and only the required data is transformed when needed.
Difference between ELT and ETL

Category ETL ELT

Acronym
Extract, Transform, Load Extract, Load, Transform
Meaning

Extracts raw data, transforms it Extracts raw data, loads it


on a secondary server, then directly into the destination
Definition loads it into the destination. and transforms it there.

Slower; data transformation Faster; data is loaded first


Processing Speed occurs before loading. and transformed in parallel.

Suited for large data sets


Best for smaller, complex data
requiring speed, like real-
sets like marketing data.
Data Volume time analytics.

Structured, semi-structured
Primarily structured data.
Data Output and unstructured data.
Category ETL ELT

Data Lake Fully compatible with data


Not compatible with data lakes.
Compatibility lakes.

Well-established, used for 20+ Newer approach with fewer


years, with extensive tools and less
Maturity documentation. documentation.

Higher costs due to the need More cost-effective,


for separate servers and leveraging cloud resources
Cost Efficiency processing infrastructure. for scalability.

Requires custom security Built-in security features like


solutions to protect sensitive access control and
Security data. multifactor authentication.

Data is transformed on a Data is loaded as-is and


Transformation secondary server before transformed within the target
Location loading. system.

Best for structured data Handles structured and


Flexibility transformation. unstructured data with ease.

Similarities Between ETL and ELT


Both ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are data
integration processes that consolidate data from various sources into a single, unified
repository for further analysis. They share several key similarities:
 Data Extraction: Both processes begin by extracting raw data from multiple
sources like databases, files, SaaS applications, or IoT devices. This data can be
structured, semi-structured, or unstructured.
 Data Transformation: While the timing of transformation differs, both ETL and
ELT involve transforming the extracted data into a format that aligns with the
target system's requirements. This ensures data is clean, accurate and ready for
analysis.
 Data Loading: Both methods ultimately load the processed data into a data
warehouse or data lake, providing a central repository where the data can be
accessed and analyzed.
 Unified Data Repository: Both processes help create a single source of truth,
ensuring that enterprise data is consistent, accurate and up-to-date for
decision-making.
Choosing Between ELT and ETL
The choice between ETL and ELT depends on our specific needs and requirements.
 ETL works well for smaller datasets and structured data where the data needs
to be transformed immediately. It often requires special hardware and can be
less flexible when handling large amounts of data.
 ELT is better for large datasets and unstructured or non-relational data. It is
more flexible and cost-effective, especially with cloud-based data solutions.
With ELT, we can store raw data and transform it as needed.

Online Analytical Processing(OLAP)


refers to a set of software tools used for data analysis in order to make business
decisions. OLAP provides a platform for gaining insights from databases retrieved from
multiple database systems at the same time. It is based on a multidimensional data
model, which enables users to extract and view data from various perspectives. A
multidimensional database is used to store OLAP data. Many Business Intelligence (BI)
applications rely on OLAP technology.
Type of OLAP servers:
The three major types of OLAP servers are as follows:
 ROLAP
 MOLAP
 HOLAP
Relational OLAP (ROLAP):
Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a
relational database, where both the base data and dimension tables are stored as
relational tables. ROLAP servers are used to bridge the gap between the relational
back-end server and the client's front-end tools. ROLAP servers store and manage
warehouse data using RDBMS, and OLAP middleware fills in the gaps.
Benefits:
 It is compatible with data warehouses and OLTP systems.
 The data size limitation of ROLAP technology is determined by the underlying
RDBMS. As a result, ROLAP does not limit the amount of data that can be
stored.
Limitations:
 SQL functionality is constrained.
 It's difficult to keep aggregate tables up to date.

Multidimensional OLAP (MOLAP):


Through array-based multidimensional storage engines, Multidimensional On-Line
Analytical Processing (MOLAP) supports multidimensional views of data. Storage
utilization in multidimensional data stores may be low if the data set is sparse.
MOLAP stores data on discs in the form of a specialized multidimensional array
structure. It is used for OLAP, which is based on the arrays' random access capability.
Dimension instances determine array elements, and the data or measured value
associated with each cell is typically stored in the corresponding array element. The
multidimensional array is typically stored in MOLAP in a linear allocation based on
nested traversal of the axes in some predetermined order.
However, unlike ROLAP, which stores only records with non-zero facts, all array
elements are defined in MOLAP, and as a result, the arrays tend to be sparse, with
empty elements occupying a larger portion of them. MOLAP systems typically include
provisions such as advanced indexing and hashing to locate data while performing
queries for handling sparse arrays, because both storage and retrieval costs are
important when evaluating online performance. MOLAP cubes are ideal for slicing and
dicing data and can perform complex calculations. When the cube is created, all
calculations are pre-generated.
Benefits:
 Suitable for slicing and dicing operations.
 Outperforms ROLAP when data is dense.
 Capable of performing complex calculations.
Limitations:
 It is difficult to change the dimensions without re-aggregating.
 Since all calculations are performed when the cube is built, a large amount of
data cannot be stored in the cube itself.

Hybrid OLAP (HOLAP):


ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP).
HOLAP offers greater scalability than ROLAP and faster computation than
MOLAP.HOLAP is a hybrid of ROLAP and MOLAP. HOLAP servers are capable of storing
large amounts of detailed data. On the one hand, HOLAP benefits from ROLAP's
greater scalability. HOLAP, on the other hand, makes use of cube technology for faster
performance and summary-type information. Because detailed data is stored in a
relational database, cubes are smaller than MOLAP.
Benefits:
 HOLAP combines the benefits of MOLAP and ROLAP.
 Provide quick access at all aggregation levels.
Limitations
 Because it supports both MOLAP and ROLAP servers, HOLAP architecture is
extremely complex.
 There is a greater likelihood of overlap, particularly in their functionalities.
Other types of OLAP include:
 Web OLAP (WOLAP): WOLAP refers to an OLAP application that can be
accessed through a web browser. WOLAP, in contrast to traditional client/server
OLAP applications, is thought to have a three-tiered architecture consisting of
three components: a client, middleware, and a database server.
 Desktop OLAP (DOLAP): DOLAP is an abbreviation for desktop analytical
processing. In that case, the user can download the data from the source and
work with it on their desktop or laptop. In comparison to other OLAP
applications, functionality is limited. It is less expensive.
 Mobile OLAP (MOLAP): Wireless functionality or mobile devices are examples
of MOLAP. The user is working and accessing data via mobile devices.
 Spatial OLAP (SOLAP): SOLAP egress combines the capabilities of Geographic
Information Systems (GIS) and OLAP into a single user interface. SOLAP is
created because the data can be alphanumeric, image, or vector. This allows for
the quick and easy exploration of data stored in a spatial database.

Online Transaction Processing (OLTP)


is a data processing approach emphasizing real-time execution of transactions. The
majority of OLTP systems are meant to manage numerous short atomic operations that
keep databases in line. To maintain transaction integrity and reliability, these systems
support ACID (Atomicity, Consistency, Isolation, Durability) properties. It is through this
that numerous unavoidable applications run their critical courses like online banking,
reservation systems etc.
OLTP Examples
An example considered for OLTP System is ATM Center a person who authenticates
first will receive the amount first and the condition is that the amount to be withdrawn
must be present in the ATM. The uses of the OLTP System are described below.
 ATM center is an OLTP application.
 OLTP handles the ACID properties during data transactions via the application.
 It's also used for Online banking, Online airline ticket booking, sending a text
message, add a book to the shopping cart.

Benefits of OLTP Services


 Allow users to quickly read, write, and delete data operations.
 Support an increase in users and transactions for real-time data access.
 Provide better data protection through multiple security features.
 Aid in decision-making with accurate, up-to-date data.
 Ensure data integrity, consistency, and high availability.
Drawbacks of OLTP Services
 Limited analysis capability, not suited for complex analysis or reporting.
 High maintenance costs due to frequent updates, backups, and recovery.
 Susceptible to disruption during hardware failures, impacting online
transactions.
 Prone to issues like duplicate or inconsistent data.
Difference Between OLAP and OLTP
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)

It is well-known as an online
It is well-known as an online
Definition database query management
database modifying system.
system.

Consists of historical data Consists of only operational


Data source
from various Databases. current data.

It makes use of a
It makes use of a data
Method used standard database management
warehouse.
system (DBMS).

It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.

In an OLAP database, tables In an OLTP database, tables


Normalized
are not normalized. are normalized (3NF).

The data is used in planning,


The data is used to perform day-
Usage of data problem-solving, and
to-day fundamental operations.
decision-making.

It provides a multi-
It reveals a snapshot of present
Task dimensional view of different
business tasks.
business tasks.

It serves the purpose to


It serves the purpose to Insert,
extract information for
Purpose Update, and Delete information
analysis and decision-
from the database.
making.
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)

The size of the data is relatively


Volume of A large amount of data is
small as the historical data is
data stored typically in TB, PB
archived in MB, and GB.

Relatively slow as the


amount of data involved is Very Fast as the queries operate
Queries
large. Queries may take on 5% of the data.
hours.

The OLAP database is not The data integrity constraint must


Update often updated. As a result, be maintained in an OLTP
data integrity is unaffected. database.

It only needs backup from


Backup and The backup and recovery process
time to time as compared to
Recovery is maintained rigorously
OLTP.

The processing of complex It is comparatively fast in


Processing
queries can take a lengthy processing because of simple and
time
time. straightforward queries.

This data is generally


This data is managed by
Types of users managed by CEO, MD, and
clerksForex and managers.
GM.

Only read and rarely write


Operations Both read and write operations.
operations.

With lengthy, scheduled


The user initiates data updates,
Updates batch operations, data is
which are brief and quick.
refreshed on a regular basis.
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)

Nature of The process is focused on The process is focused on the


audience the customer. market.

Database Design with a focus on the Design that is focused on the


Design subject. application.

Improves the efficiency of


Productivity Enhances the user's productivity.
business analysts.

Eigenvalues are unique scalar values linked to a matrix or linear


transformation. They indicate how much an eigenvector gets stretched or compressed
during the transformation. The eigenvector's direction remains unchanged unless the
eigenvalue is negative, in which case the direction is simply reversed.
The equation for eigenvalue is given by,
Av=λv
Where,

A is the matrix,
v is associated eigenvector and
λ is scalar eigenvalue.

Eigenvectors are non-zero vectors that, when multiplied by a matrix, only


stretch or shrink without changing direction. The eigenvalue must be found first before
the eigenvector. For any square matrix A of order n × n, the eigenvector is a column
matrix of size n × 1. This is known as the right eigenvector, as matrix multiplication is
not commutative.
Alternatively, the left eigenvector can be found using the equation
vA=λv, where v is a row matrix of size 1 × n.

2.1 Eigenvector Equation


The Eigenvector equation is the equation that is used to find the eigenvector of any
square matrix. The eigenvector equation is
Av=λv
Where,
A is the given square matrix,
v is the eigenvector of matrix A and
λ is any scaler multiple.

You might also like