Machine Learning
Machine Learning
plt.figure(figsize=(8, 6))
plt.bar(quality_counts.index, quality_counts,
color='deeppink')
plt.title('Count Plot of Quality')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()
Output:
Bar
Plot
Here, this count plot graph shows the count of the wine with
its quality rate.
2. Kernel density plot for understanding variance in the
dataset
sns.set_style("darkgrid")
numerical_columns = df.select_dtypes(include=["int64",
"float64"]).columns
plt.tight_layout()
plt.show()
Output:
K
ernel density plot
The features in the dataset with a skewness of 0 shows a
symmetrical distribution. If the skewness is 1 or above it
suggests a positively skewed (right-skewed) distribution. In a
right-skewed distribution the tail extends more to the right
which shows the presence of extremely high values.
3. Swarm Plot for showing the outlier in the data
plt.figure(figsize=(10, 8))
Swarm Plot
This graph shows the swarm plot for the 'Quality' and
'Alcohol' columns. The higher point density in certain areas
shows where most of the data points are concentrated. Points
that are isolated and far from these clusters represent
outliers highlighting uneven values in the dataset.
Step 7: Bivariate Analysis
In bivariate analysis two variables are analyzed together to
identify patterns, dependencies or interactions between
them. This method helps in understanding how changes in
one variable might affect another.
Let's visualize these relationships by plotting various plot for
the data which will show how the variables interact with each
other across multiple dimensions.
1. Pair Plot for showing the distribution of the individual
variables
sns.set_palette("Pastel1")
plt.figure(figsize=(10, 6))
sns.pairplot(df)
Violin Plot
For interpreting the Violin Plot:
If the width is wider, it shows higher density suggesting
more data points.
Symmetrical plot shows a balanced distribution.
Peak or bulge in the violin plot represents most common
value in distribution.
Longer tails shows great variability.
Median line is the middle line inside the violin plot. It
helps in understanding central tendencies.
3. Box Plot for examining the relationship between alcohol
and Quality
sns.boxplot(x='quality', y='alcohol', data=df)
Output:
Box Plot
Box represents the IQR i.e longer the box, greater the
variability.
Median line in the box shows central tendency.
Whiskers extend from box to the smallest and largest
values within a specified range.
Individual points beyond the whiskers represents
outliers.
A compact box shows low variability while a stretched
box shows higher variability.
Step 8: Multivariate Analysis
It involves finding the interactions between three or more
variables in a dataset at the same time. This approach
focuses to identify complex patterns, relationships and
interactions which provides understanding of how multiple
variables collectively behave and influence each other.
Here, we are going to show the multivariate analysis using
a correlation matrix plot.
plt.figure(figsize=(15, 10))
Corr
elation Matrix
Values close to +1 shows strong positive correlation, -1 shows
a strong negative correlation and 0 suggests no linear
correlation.
Darker colors signify strong correlation, while light
colors represents weaker correlations.
Positive correlation variable move in same directions. As
one increases, the other also increases.
Negative correlation variable move in opposite
directions. An increase in one variable is associated with
a decrease in the other.
How Dimensionality Reduction
Works?
Lets understand how dimensionality Reduction is used with
the help of example. Imagine a dataset where each data point
exists in a 3D space defined by axes X, Y and Z. If most of the
data variance occurs along X and Y then the Z-dimension may
contribute very little to understanding the structure of the
data.
Linear Regression
Here Y is called a dependent or target variable and X is called an independent variable
also known as the predictor of Y. There are many types of functions or modules that
can be used for regression. A linear function is the simplest type of function. Here, X
may be a single feature or multiple features representing the problem.
2. Equation of the Best-Fit Line
For simple linear regression (with one independent variable), the best-fit line is
represented by the equation
y=mx+by=mx+b
Where:
y is the predicted value (dependent variable)
x is the input (independent variable)
m is the slope of the line (how much y changes when x changes)
b is the intercept (the value of y when x = 0)
The best-fit line will be the one that optimizes the values of m (slope) and b (intercept)
so that the predicted y values are as close as possible to the actual data points.
3. Minimizing the Error: The Least Squares Method
To find the best-fit line, we use a method called Least Squares. The idea behind this
method is to minimize the sum of squared differences between the actual values (data
points) and the predicted values from the line. These differences are called residuals.
The formula for residuals is:
Residual=yᵢ−y^ᵢResidual=yᵢ−y^ᵢ
Where:
yᵢyᵢ is the actual observed value
y^ᵢy^ᵢ is the predicted value from the line for that xᵢxᵢ
The least squares method minimizes the sum of the squared residuals:
Sumofsquarederrors(SSE)=Σ(yᵢ−y^ᵢ)²Sumofsquarederrors(SSE)=Σ(yᵢ−y^ᵢ)²
This method ensures that the line best represents the data where the sum of the
squared differences between the predicted values and actual values is as small as
possible.
4. Interpretation of the Best-Fit Line
Slope (m): The slope of the best-fit line indicates how much the dependent
variable (y) changes with each unit change in the independent variable (x). For
example if the slope is 5, it means that for every 1-unit increase in x, the value
of y increases by 5 units.
Intercept (b): The intercept represents the predicted value of y when x = 0. It’s
the point where the line crosses the y-axis.
In linear regression some hypothesis are made to ensure reliability of the model's
results.
Limitations
Assumes Linearity: The method assumes the relationship between the
variables is linear. If the relationship is non-linear, linear regression might not
work well.
Sensitivity to Outliers: Outliers can significantly affect the slope and intercept,
skewing the best-fit line.
Hypothesis function in Linear Regression
In linear regression, the hypothesis function is the equation used to make predictions
about the dependent variable based on the independent variables. It represents the
relationship between the input features and the target output.
For a simple case with one independent variable, the hypothesis function is:
h(x)=β₀+β₁xh(x)=β₀+β₁x
Where:
h(x)(ory^)h(x)(ory^) is the predicted value of the dependent variable (y).
x xxis the independent variable.
β₀β₀ is the intercept, representing the value of y when x is 0.
β₁β₁ is the slope, indicating how much y changes for each unit change in x.
For multiple linear regression (with more than one independent variable), the
hypothesis function expands to:
h(x₁,x₂,...,xₖ)=β₀+β₁x₁+β₂x₂+...+βₖxₖh(x₁,x₂,...,xₖ)=β₀+β₁x₁+β₂x₂+...+βₖxₖ
Where:
x₁,x₂,...,xₖx₁,x₂,...,xₖ are the independent variables.
β₀β₀ is the intercept.
β₁,β₂,...,βₖβ₁,β₂,...,βₖ are the coefficients, representing the influence of each
respective independent variable on the predicted output.
Assumptions of the Linear Regression
1. Linearity: The relationship between inputs (X) and the output (Y) is a straight line.
Linearity
2. Independence of Errors: The errors in predictions should not affect each other.
3. Constant Variance (Homoscedasticity): The errors should have equal spread across
all values of the input. If the spread changes (like fans out or shrinks), it's called
heteroscedasticity and it's a problem for the model.
Homoscedasticity
4. Normality of Errors: The errors should follow a normal (bell-shaped) distribution.
5. No Multicollinearity(for multiple regression): Input variables shouldn’t be too
closely related to each other.
6. No Autocorrelation: Errors shouldn't show repeating patterns, especially in time-
based data.
7. Additivity: The total effect on Y is just the sum of effects from each X, no mixing or
interaction between them.'
To understand Multicollinearity in detail refer to article: Multicollinearity.
Types of Linear Regression
When there is only one independent feature it is known as Simple Linear Regression
or Univariate Linear Regression and when there are more than one feature it is known
as Multiple Linear Regression or Multivariate Regression.
1. Simple Linear Regression
Simple linear regression is used when we want to predict a target value (dependent
variable) using only one input feature (independent variable). It assumes a straight-line
relationship between the two.
Formula:
y^=θ0+θ1xy^=θ0+θ1x
Where:
y^y^ is the predicted value
xxis the input (independent variable)
θ0θ0 is the intercept (value of y^y^ when x=0)
θ1θ1 is the slope or coefficient (how much y^y^ changes with one unit of x)
Example:
Predicting a person’s salary (y) based on their years of experience (x).
2. Multiple Linear Regression
Multiple linear regression involves more than one independent variable and one
dependent variable. The equation for multiple linear regression is:
y^=θ0+θ1x1+θ2x2+⋯+θnxny^=θ0+θ1x1+θ2x2+⋯+θnxn
where:
y^y^ is the predicted value
x1,x2,…,xnx1,x2,…,xn are the independent variables
θ1,θ2,…,θnθ1,θ2,…,θn are the coefficients (weights) corresponding to each
predictor.
θ0θ0 is the intercept.
The goal of the algorithm is to find the best Fit Line equation that can predict the
values based on the independent variables.
In regression set of records are present with X and Y values and these values are used
to learn a function so if you want to predict Y from an unknown X this learned function
can be used. In regression we have to find the value of Y, So, a function is required that
predicts continuous Y in the case of regression given X as independent features.
Use Case of Multiple Linear Regression
Multiple linear regression allows us to analyze relationship between multiple
independent variables and a single dependent variable. Here are some use cases:
Real Estate Pricing: In real estate MLR is used to predict property prices based
on multiple factors such as location, size, number of bedrooms, etc. This helps
buyers and sellers understand market trends and set competitive prices.
Financial Forecasting: Financial analysts use MLR to predict stock prices or
economic indicators based on multiple influencing factors such as interest
rates, inflation rates and market trends. This enables better investment
strategies and risk management24.
Agricultural Yield Prediction: Farmers can use MLR to estimate crop yields
based on several variables like rainfall, temperature, soil quality and fertilizer
usage. This information helps in planning agricultural practices for optimal
productivity
E-commerce Sales Analysis: An e-commerce company can utilize MLR to assess
how various factors such as product price, marketing promotions and seasonal
trends impact sales.
Now that we have understood about linear regression, its assumption and its type now
we will learn how to make a linear regression model.
Cost function for Linear Regression
As we have discussed earlier about best fit line in linear regression, its not easy to get it
easily in real life cases so we need to calculate errors that affects it. These errors need
to be calculated to mitigate them. The difference between the predicted
value Y^ Y^ and the true value Y and it is called cost function or the loss function.
In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which
calculates the average of the squared errors between the predicted values y^iy^i and
the actual values yiyi. The purpose is to determine the optimal values for the
intercept θ1θ1 and the coefficient of the input feature θ2θ2 providing the best-fit line
for the given data points. The linear equation expressing this relationship
is y^i=θ1+θ2xiy^i=θ1+θ2xi.
MSE function can be calculated as:
Cost function(J)=1n∑ni(yi^−yi)2Cost function(J)=n1∑ni(yi^−yi)2
Utilizing the MSE function, the iterative process of gradient descent is applied to
update the values of \θ1&θ2θ1&θ2. This ensures that the MSE value converges to the
global minima, signifying the most accurate fit of the linear regression line to the
dataset.
This process involves continuously adjusting the parameters \(\theta_1\) and \(\
theta_2\) based on the gradients calculated from the MSE. The final result is a linear
regression line that minimizes the overall squared differences between the predicted
and actual values, providing an optimal representation of the underlying relationship in
the data.
Now we have calculated loss function we need to optimize model to mtigate this error
and it is done through gradient descent.
Gradient Descent for Linear Regression
A linear regression model can be trained using the optimization algorithm gradient
descent by iteratively modifying the model's parameters to reduce the mean squared
error (MSE) of the model on a training dataset. To update θ1 and θ2 values in order to
reduce the Cost function (minimizing RMSE value) and achieve the best-fit line the
model uses Gradient Descent. The idea is to start with random θ1 and θ2 values and
then iteratively update the values, reaching minimum cost.
A gradient is nothing but a derivative that defines the effects on outputs of the
function with a little bit of variation in inputs.
Let's differentiate the cost function(J) with respect to θ1 θ1
Jθ1′=∂J(θ1,θ2)∂θ1=∂∂θ1[1n(∑i=1n(y^i−yi)2)]=1n[∑i=1n2(y^i−yi)
(∂∂θ1(y^i−yi))]=1n[∑i=1n2(y^i−yi)(∂∂θ1(θ1+θ2xi−yi))]=1n[∑i=1n2(y^i−yi)
(1+0−0)]=1n[∑i=1n(y^i−yi)(2)]=2n∑i=1n(y^i−yi)Jθ1′=∂θ1∂J(θ1,θ2)=∂θ1∂[n1(i=1∑n(y^i
−yi)2)]=n1[i=1∑n2(y^i−yi)(∂θ1∂(y^i−yi))]=n1[i=1∑n2(y^i−yi)(∂θ1∂(θ1+θ2xi−yi))]=n1
[i=1∑n2(y^i−yi)(1+0−0)]=n1[i=1∑n(y^i−yi)(2)]=n2i=1∑n(y^i−yi)
Let's differentiate the cost function(J) with respect to θ2θ2
Jθ2′=∂J(θ1,θ2)∂θ2=∂∂θ2[1n(∑i=1n(y^i−yi)2)]=1n[∑i=1n2(y^i−yi)
(∂∂θ2(y^i−yi))]=1n[∑i=1n2(y^i−yi)(∂∂θ2(θ1+θ2xi−yi))]=1n[∑i=1n2(y^i−yi)
(0+xi−0)]=1n[∑i=1n(y^i−yi)(2xi)]=2n∑i=1n(y^i−yi)⋅xiJθ2′=∂θ2∂J(θ1,θ2)=∂θ2∂[n1(i=1∑n
(y^i−yi)2)]=n1[i=1∑n2(y^i−yi)(∂θ2∂(y^i−yi))]=n1[i=1∑n2(y^i−yi)(∂θ2∂(θ1+θ2xi−yi))]=n1
[i=1∑n2(y^i−yi)(0+xi−0)]=n1[i=1∑n(y^i−yi)(2xi)]=n2i=1∑n(y^i−yi)⋅xi
Finding the coefficients of a linear equation that best fits the training data is the
objective of linear regression. By moving in the direction of the Mean Squared Error
negative gradient with respect to the coefficients, the coefficients can be changed. And
the respective intercept and coefficient of X will be if α α is the learning rate.
Gradient Descent
θ1=θ1−α(Jθ1′)=θ1−α(2n∑i=1n(y^i−yi))θ2=θ2−α(Jθ2′)=θ2−α(2n∑i=1n(y^i−yi) ⋅xi)θ1=θ1
−α(Jθ1′)=θ1−α(n2i=1∑n(y^i−yi))θ2=θ2−α(Jθ2′)=θ2−α(n2i=1∑n(y^i−yi)⋅xi)
After optimizing our model, we evaluate our models accuracy to see how well it will
perform in real world scenario.
Evaluation Metrics for Linear Regression
A variety of evaluation measures can be used to determine the strength of any linear
regression model. These assessment metrics often give an indication of how well the
model is producing the observed outputs.
The most common measurements are:
1. Mean Square Error (MSE)
Mean Squared Error (MSE) is an evaluation metric that calculates the average of the
squared differences between the actual and predicted values for all the data points.
The difference is squared to ensure that negative and positive differences don't cancel
each other out.
MSE=1n∑i=1n(yi−yi^)2MSE=n1∑i=1n(yi−yi)2
Here,
nnis the number of data points.
yiyiis the actual or observed value for theithithdata point.
yi^yi is the predicted value for the ithithdata point.
MSE is a way to quantify the accuracy of a model's predictions. MSE is sensitive to
outliers as large errors contribute significantly to the overall score.
2. Mean Absolute Error (MAE)
Mean Absolute Error is an evaluation metric used to calculate the accuracy of a
regression model. MAE measures the average absolute difference between the
predicted values and actual values.
Mathematically MAE is expressed as:
MAE=1n∑i=1n∣Yi−Yi^∣MAE=n1∑i=1n∣Yi−Yi∣
Here,
n is the number of observations
Yi represents the actual values.
Yi^Yi represents the predicted values
Lower MAE value indicates better model performance. It is not sensitive to the outliers
as we consider absolute differences.
3. Root Mean Squared Error (RMSE)
The square root of the residuals' variance is the Root Mean Squared Error. It describes
how well the observed data points match the expected values or the model's absolute
fit to the data.
In mathematical notation, it can be expressed as:
RMSE=RSSn=∑i=2n(yiactual−yipredicted)2nRMSE=nRSS=n∑i=2n(yiactual−yipredicted)2
Rather than dividing the entire number of data points in the model by the number of
degrees of freedom, one must divide the sum of the squared residuals to obtain an
unbiased estimate. Then, this figure is referred to as the Residual Standard Error (RSE).
In mathematical notation, it can be expressed as:
RMSE=RSSn=∑i=2n(yiactual−yipredicted)2(n−2)RMSE=nRSS=(n−2)∑i=2n(yiactual
−yipredicted)2
RSME is not as good of a metric as R-squared. Root Mean Squared Error can fluctuate
when the units of the variables vary since its value is dependent on the variables' units
(it is not a normalized measure).
4. Coefficient of Determination (R-squared)
R-Squared is a statistic that indicates how much variation the developed model can
explain or capture. It is always in the range of 0 to 1. In general, the better the model
matches the data, the greater the R-squared number.
In mathematical notation, it can be expressed as:
R2=1−(RSSTSS)R2=1−(TSSRSS)
Residual sum of Squares(RSS): The sum of squares of the residual for each data
point in the plot or data is known as the residual sum of squares or RSS. It is a
measurement of the difference between the output that was observed and
what was anticipated.
RSS=∑i=1n(yi−b0−b1xi)2RSS=∑i=1n(yi−b0−b1xi)2
Total Sum of Squares (TSS): The sum of the data points' errors from the answer
variable's mean is known as the total sum of squares or TSS.
TSS=∑i=1n(y−yi‾)2TSS=∑i=1n(y−yi)2.
R squared metric is a measure of the proportion of variance in the dependent variable
that is explained the independent variables in the model.
5. Adjusted R-Squared Error
Adjusted R2R2measures the proportion of variance in the dependent variable that is
explained by independent variables in a regression model. Adjusted R-square accounts
the number of predictors in the model and penalizes the model for including irrelevant
predictors that don't contribute significantly to explain the variance in the dependent
variables.
Mathematically, adjusted R2R2is expressed as:
AdjustedR2=1−((1−R2).(n−1)n−k−1)AdjustedR2=1−(n−k−1(1−R2).(n−1))
Here,
n is the number of observations
k is the number of predictors in the model
R2 is coeeficient of determination
Adjusted R-square helps to prevent overfitting. It penalizes the model with additional
predictors that do not contribute significantly to explain the variance in the dependent
variable.
While evaluation metrics help us measure the performance of a model, regularization
helps in improving that performance by addressing overfitting and enhancing
generalization.
Regularization Techniques for Linear Models
1. Lasso Regression (L1 Regularization)
Lasso Regression is a technique used for regularizing a linear regression model, it adds
a penalty term to the linear regression objective function to prevent overfitting.
The objective function after applying lasso regression is:
J(θ)=12m∑i=1m(yi^−yi)2+λ∑j=1n∣θj∣J(θ)=2m1∑i=1m(yi−yi)2+λ∑j=1n∣θj∣
the first term is the least squares loss, representing the squared difference
between predicted and actual values.
the second term is the L1 regularization term, it penalizes the sum of absolute
values of the regression coefficient θj.
2. Ridge Regression (L2 Regularization)
Ridge regression is a linear regression technique that adds a regularization term to the
standard linear objective. Again, the goal is to prevent overfitting by penalizing large
coefficient in linear regression equation. It useful when the dataset has
multicollinearity where predictor variables are highly correlated.
The objective function after applying ridge regression is:
J(θ)=12m∑i=1m(yi^−yi)2+λ∑j=1nθj2J(θ)=2m1∑i=1m(yi−yi)2+λ∑j=1nθj2
the first term is the least squares loss, representing the squared difference
between predicted and actual values.
the second term is the L1 regularization term, it penalizes the sum of square of
values of the regression coefficient θj.
3. Elastic Net Regression
Elastic Net Regression is a hybrid regularization technique that combines the power of
both L1 and L2 regularization in linear regression objective.
J(θ)=12m∑i=1m(yi^−yi)2+αλ∑j=1n∣θj∣+12(1−α)λ∑j=1nθj2J(θ)=2m1∑i=1m(yi−yi
)2+αλ∑j=1n∣θj∣+21(1−α)λ∑j=1nθj2
the first term is least square loss.
the second term is L1 regularization and third is ridge regression.
λλis the overall regularization strength.
ααcontrols the mix between L1 and L2 regularization.
Now that we have learned how to make a linear regression model, now we will
implement it.
Python Implementation of Linear Regression
1. Import the necessary libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as ax
from matplotlib.animation import FuncAnimation
2. Load the dataset and separate input and Target variables
Here is the link for dataset: Dataset Link
url = 'https://media.geeksforgeeks.org/wp-content/uploads/20240320114716/
data_for_lr.csv'
data = pd.read_csv(url)
data = data.dropna()
train_input = np.array(data.x[0:500]).reshape(500, 1)
train_output = np.array(data.y[0:500]).reshape(500, 1)
test_input = np.array(data.x[500:700]).reshape(199, 1)
test_output = np.array(data.y[500:700]).reshape(199, 1)
3. Build the Linear Regression Model and Plot the regression line
In forward propagation Linear regression function Y=mx+cY=mx+c is applied by initially
assigning random value of parameter (m and c). The we have written the function to
finding the cost function i.e the mean
class LinearRegression:
def __init__(self):
self.parameters = {}
self.loss = []
fig, ax = plt.subplots()
x_vals = np.linspace(min(train_input), max(train_input), 100)
line, = ax.plot(x_vals, self.parameters['m'] * x_vals + self.parameters['c'],
color='red', label='Regression Line')
ax.scatter(train_input, train_output, marker='o', color='green', label='Training
Data')
ax.set_ylim(0, max(train_output) + 1)
def update(frame):
predictions = self.forward_propagation(train_input)
cost = self.cost_function(predictions, train_output)
derivatives = self.backward_propagation(train_input, train_output, predictions)
self.update_parameters(derivatives, learning_rate)
line.set_ydata(self.parameters['m'] * x_vals + self.parameters['c'])
self.loss.append(cost)
print("Iteration = {}, Loss = {}".format(frame + 1, cost))
return line,
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Linear Regression')
plt.legend()
plt.show()
The linear regression line provides valuable insights into the relationship between the
two variables. It represents the best-fitting line that captures the overall trend of how a
dependent variable (Y) changes in response to variations in an independent variable
(X).
Positive Linear Regression Line: A positive linear regression line indicates a
direct relationship between the independent variable (XX) and the dependent
variable (YY). This means that as the value of X increases, the value of Y also
increases. The slope of a positive linear regression line is positive, meaning that
the line slants upward from left to right.
Negative Linear Regression Line: A negative linear regression line indicates an
inverse relationship between the independent variable (XX) and the dependent
variable (YY). This means that as the value of X increases, the value of Y
decreases. The slope of a negative linear regression line is negative, meaning
that the line slants downward from left to right.
4. Trained the model and Final Prediction
linear_reg = LinearRegression()
parameters, loss = linear_reg.train(train_input, train_output, 0.0001, 20)
Output:
Model Training
Applications of Linear Regression
Linear regression is used in many different fields including finance, economics and
psychology to understand and predict the behavior of a particular variable.
For example linear regression is widely used in finance to analyze relationships and
make predictions. It can model how a company's earnings per share (EPS) influence its
stock price. If the model shows that a $1 increase in EPS results in a $15 rise in stock
price, investors gain insights into the company's valuation. Similarly, linear regression
can forecast currency values by analyzing historical exchange rates and economic
indicators, helping financial professionals make informed decisions and manage risks
effectively.
Also read - Linear Regression - In Simple Words, with real-life Examples
Advantages and Disadvantages of Linear Regression
Advantages of Linear Regression
Linear regression is a relatively simple algorithm, making it easy to understand
and implement. The coefficients of the linear regression model can be
interpreted as the change in the dependent variable for a one-unit change in
the independent variable, providing insights into the relationships between
variables.
Linear regression is computationally efficient and can handle large datasets
effectively. It can be trained quickly on large datasets, making it suitable for
real-time applications.
Linear regression is relatively robust to outliers compared to other machine
learning algorithms. Outliers may have a smaller impact on the overall model
performance.
Linear regression often serves as a good baseline model for comparison with
more complex machine learning algorithms.
Linear regression is a well-established algorithm with a rich history and is
widely available in various machine learning libraries and software packages.
Disadvantages of Linear Regression
Linear regression assumes a linear relationship between the dependent and
independent variables. If the relationship is not linear, the model may not
perform well.
Linear regression is sensitive to multicollinearity, which occurs when there is a
high correlation between independent variables. Multicollinearity can inflate
the variance of the coefficients and lead to unstable model predictions.
Linear regression assumes that the features are already in a suitable form for
the model. Feature engineering may be required to transform features into a
format that can be effectively used by the model.
Linear regression is susceptible to both overfitting and underfitting. Overfitting
occurs when the model learns the training data too well and fails to generalize
to unseen data. Underfitting occurs when the model is too simple to capture
the underlying relationships in the data.
Linear regression provides limited explanatory power for complex relationships
between variables. More advanced machine learning techniques may be
necessary for deeper insights.
1/3
Types of Logistic Regression
Logistic regression can be classified into three main types based on the nature of the
dependent variable:
1. Binomial Logistic Regression: This type is used when the dependent variable
has only two possible categories. Examples include Yes/No, Pass/Fail or 0/1. It is
the most common form of logistic regression and is used for binary
classification problems.
2. Multinomial Logistic Regression: This is used when the dependent variable has
three or more possible categories that are not ordered. For example, classifying
animals into categories like "cat," "dog" or "sheep." It extends the binary
logistic regression to handle multiple classes.
3. Ordinal Logistic Regression: This type applies when the dependent variable has
three or more categories with a natural order or ranking. Examples include
ratings like "low," "medium" and "high." It takes the order of the categories into
account when modeling.
Assumptions of Logistic Regression
Understanding the assumptions behind logistic regression is important to ensure the
model is applied correctly, main assumptions are:
1. Independent observations: Each data point is assumed to be independent of
the others means there should be no correlation or dependence between the
input samples.
2. Binary dependent variables: It takes the assumption that the dependent
variable must be binary, means it can take only two values. For more than two
categories SoftMax functions are used.
3. Linearity relationship between independent variables and log odds: The
model assumes a linear relationship between the independent variables and
the log odds of the dependent variable which means the predictors affect the
log odds in a linear way.
4. No outliers: The dataset should not contain extreme outliers as they can distort
the estimation of the logistic regression coefficients.
5. Large sample size: It requires a sufficiently large sample size to produce reliable
and stable results.
Understanding Sigmoid Function
1. The sigmoid function is a important part of logistic regression which is used to
convert the raw output of the model into a probability value between 0 and 1.
2. This function takes any real number and maps it into the range 0 to 1 forming an "S"
shaped curve called the sigmoid curve or logistic curve. Because probabilities must lie
between 0 and 1, the sigmoid function is perfect for this purpose.
3. In logistic regression, we use a threshold value usually 0.5 to decide the class label.
If the sigmoid output is same or above the threshold, the input is classified as
Class 1.
If it is below the threshold, the input is classified as Class 0.
This approach helps to transform continuous input values into meaningful class
predictions.
How does Logistic Regression work?
Logistic regression model transforms the linear regression function continuous value
output into categorical value output using a sigmoid function which maps any real-
valued set of independent variables input into a value between 0 and 1. This function
is known as the logistic function.
Suppose we have input features represented as a matrix:
X=[x11 ...x1mx21 ...x2m ⋮⋱ ⋮ xn1 ...xnm]X=⎣⎡x11 x21 ⋮xn1 ......⋱ ...x1mx2m⋮ xnm⎦⎤
and the dependent variable is YYhaving only binary value i.e 0 or 1.
Y={0 if Class11 if Class2Y={01 if Class1 if Class2
then, apply the multi-linear function to the input variables X.
z=(∑i=1nwixi)+bz=(∑i=1nwixi)+b
z=w⋅X+bz=w⋅X+b
At this stage, zzis a continuous value from the linear regression. Logistic regression
then applies the sigmoid function to zzto convert it into a probability between 0 and 1
which can be used to predict the class.
Now we use the sigmoid function where the input will be z and we find the probability
between 0 and 1. i.e. predicted y.
σ(z)=11+e−zσ(z)=1+e−z1
Sigmoid function
As shown above the sigmoid function converts the continuous variable data into the
probability i.e between 0 and 1.
σ(z) σ(z) tends towards 1 as z→∞z→∞
σ(z) σ(z) tends towards 0 as z→−∞z→−∞
σ(z) σ(z) is always bounded between 0 and 1
where the probability of being a class can be measured as:
P(y=1)=σ(z)P(y=0)=1−σ(z)P(y=1)=σ(z)P(y=0)=1−σ(z)
Logistic Regression Equation and Odds:
It models the odds of the dependent event occurring which is the ratio of the
probability of the event to the probability of it not occurring:
p(x)1−p(x) =ez1−p(x)p(x) =ez
Taking the natural logarithm of the odds gives the log-odds or logit:
p(X;b,w)=ew⋅X+b1+ew⋅X+b=11+e−w⋅X+bp(X;b,w)=1+ew⋅X+bew⋅X+b=1+e−w⋅X+b1
This formula represents the probability of the input belonging to Class 1.
Likelihood Function for Logistic Regression
The goal is to find weights ww and bias bb that maximize the likelihood of observing
the data.
For each data point ii
for y=1y=1, predicted probabilities will be: p(X;b,w) =p(x)p(x)
for y=0y=0 The predicted probabilities will be: 1-p(X;b,w) = 1−p(x)1−p(x)
L(b,w)=∏i=1np(xi)yi(1−p(xi))1−yiL(b,w)=∏i=1np(xi)yi(1−p(xi))1−yi
Taking natural logs on both sides:
log(L(b,w))=∑i=1nyilogp(xi)+(1−yi)log(1−p(xi))=∑i=1nyilogp(xi)+log(1−p(xi))
−yilog(1−p(xi))=∑i=1nlog(1−p(xi))+∑i=1nyilogp(xi)1−p(xi=∑i=1n−log1−e−(w⋅xi+b)
+∑i=1nyi(w⋅xi+b)=∑i=1n−log1+ew⋅xi+b+∑i=1nyi(w⋅xi+b)log(L(b,w))=i=1∑nyilogp(xi)
+(1−yi)log(1−p(xi))=i=1∑nyilogp(xi)+log(1−p(xi))−yilog(1−p(xi))=i=1∑nlog(1−p(xi))+i=1∑n
yilog1−p(xip(xi)=i=1∑n−log1−e−(w⋅xi+b)+i=1∑nyi(w⋅xi+b)=i=1∑n−log1+ew⋅xi+b+i=1∑nyi
(w⋅xi+b)
This is known as the log-likelihood function.
Gradient of the log-likelihood function
To find the best ww and bb we use gradient ascent on the log-likelihood function. The
gradient with respect to each weight wjwjis:
∂J(l(b,w)∂wj=−∑i=nn11+ew⋅xi+bew⋅xi+bxij+∑i=1nyixij=−∑i=nnp(xi;b,w)xij+∑i=1nyixij=∑i=
nn(yi−p(xi;b,w))xij∂wj∂J(l(b,w)=−i=n∑n1+ew⋅xi+b1ew⋅xi+bxij+i=1∑nyixij=−i=n∑np(xi
;b,w)xij+i=1∑nyixij=i=n∑n(yi−p(xi;b,w))xij
Terminologies involved in Logistic Regression
Here are some common terms involved in logistic regression:
1. Independent Variables: These are the input features or predictor variables
used to make predictions about the dependent variable.
2. Dependent Variable: This is the target variable that we aim to predict. In
logistic regression, the dependent variable is categorical.
3. Logistic Function: This function transforms the independent variables into a
probability between 0 and 1 which represents the likelihood that the
dependent variable is either 0 or 1.
4. Odds: This is the ratio of the probability of an event happening to the
probability of it not happening. It differs from probability because probability is
the ratio of occurrences to total possibilities.
5. Log-Odds (Logit): The natural logarithm of the odds. In logistic regression, the
log-odds are modeled as a linear combination of the independent variables and
the intercept.
6. Coefficient: These are the parameters estimated by the logistic regression
model which shows how strongly the independent variables affect the
dependent variable.
7. Intercept: The constant term in the logistic regression model which represents
the log-odds when all independent variables are equal to zero.
8. Maximum Likelihood Estimation (MLE): This method is used to estimate the
coefficients of the logistic regression model by maximizing the likelihood of
observing the given data.
Implementation for Logistic Regression
Now, let's see the implementation of logistic regression in Python. Here we will be
implementing two main types of Logistic Regression:
1. Binomial Logistic regression:
In binomial logistic regression, the target variable can only have two possible values
such as "0" or "1", "pass" or "fail". The sigmoid function is used for prediction.
We will be using sckit-learn library for this and shows how to use the breast cancer
dataset to implement a Logistic Regression model for classification.
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
P(Y=c∣X→=x)=ewc⋅x+bc∑k=1Kewk⋅x+bkP(Y=c∣X=x)=∑k=1Kewk⋅x+bkewc⋅x+bc
Below is an example of implementing multinomial logistic regression using the Digits
dataset from scikit-learn:
from sklearn.model_selection import train_test_split
from sklearn import datasets, linear_model, metrics
digits = datasets.load_digits()
X = digits.data
y = digits.target
y_pred = reg.predict(X_test)
print(f"Logistic Regression model accuracy: {metrics.accuracy_score(y_test, y_pred) *
100:.2f}%")
Output:
Logistic Regression model accuracy: 96.66%
This model is used to predict one of 10 digits (0-9) based on the image features.
How to Evaluate Logistic Regression Model?
Evaluating the logistic regression model helps assess its performance and ensure it
generalizes well to new, unseen data. The following metrics are commonly used:
1. Accuracy: Accuracy provides the proportion of correctly classified instances.
Accuracy=TruePositives+TrueNegativesTotalAccuracy=TotalTruePositives+TrueN
egatives
2. Precision: Precision focuses on the accuracy of positive predictions.
Precision=TruePositivesTruePositives+FalsePositivesPrecision=TruePositives+Fal
sePositivesTruePositives
3. Recall (Sensitivity or True Positive Rate): Recall measures the proportion of
correctly predicted positive instances among all actual positive instances.
Recall=TruePositivesTruePositives+FalseNegativesRecall=TruePositives+FalseNe
gativesTruePositives
4. F1 Score: F1 score is the harmonic mean of precision and recall.
F1Score=2∗Precision∗RecallPrecision+RecallF1Score=2∗Precision+RecallPrecis
ion∗Recall
5. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): The ROC
curve plots the true positive rate against the false positive rate at various
thresholds. AUC-ROC measures the area under this curve which provides an
aggregate measure of a model's performance across different classification
thresholds.
6. Area Under the Precision-Recall Curve (AUC-PR): Similar to AUC-ROC, AUC-
PR measures the area under the precision-recall curve helps in providing a
summary of a model's performance across different precision-recall trade-offs.
Differences Between Linear and Logistic Regression
Logistic regression and linear regression differ in their application and output. Here's a
comparison:
Linear Regression Logistic Regression
Let’s consider a decision tree for predicting whether a customer will buy a product
based on age, income and previous purchases: Here's how the decision tree works:
1. Root Node (Income)
First Question: "Is the person’s income greater than $50,000?"
If Yes, proceed to the next question.
If No, predict "No Purchase" (leaf node).
2. Internal Node (Age):
If the person’s income is greater than $50,000, ask: "Is the person’s age above 30?"
If Yes, proceed to the next question.
If No, predict "No Purchase" (leaf node).
3. Internal Node (Previous Purchases):
If the person is above 30 and has made previous purchases, predict "Purchase"
(leaf node).
If the person is above 30 and has not made previous purchases, predict "No
Purchase" (leaf node).
Decision making with 2 Decision Tree
Example: Predicting Whether a Customer Will Buy a Product Using Two Decision Trees
Tree 1: Customer Demographics
First tree asks two questions:
1. "Income > $50,000?"
If Yes, Proceed to the next question.
If No, "No Purchase"
2. "Age > 30?"
Yes: "Purchase"
No: "No Purchase"
Tree 2: Previous Purchases
"Previous Purchases > 0?"
Yes: "Purchase"
No: "No Purchase"
Once we have predictions from both trees, we can combine the results to make a final
prediction. If Tree 1 predicts "Purchase" and Tree 2 predicts "No Purchase", the final
prediction might be "Purchase" or "No Purchase" depending on the weight or
confidence assigned to each tree. This can be decided based on the problem context.
Information Gain and Gini Index in Decision Tree
Till now we have discovered the basic intuition and approach of how decision tree
works, so lets just move to the attribute selection measure of decision tree. We have
two popular attribute selection measures used:
1. Information Gain
Information Gain tells us how useful a question (or feature) is for splitting data into
groups. It measures how much the uncertainty decreases after the split. A good
question will create clearer groups and the feature with the highest Information Gain is
chosen to make the decision.
For example if we split a dataset of people into "Young" and "Old" based on age and all
young people bought the product while all old people did not, the Information Gain
would be high because the split perfectly separates the two groups with no uncertainty
left
Suppose SS is a set of instances AAis an attribute, SvSvis the subset
of SS, vvrepresents an individual value that the attribute AAcan take and Values
(AA) is the set of all possible values of AA then
Gain(S,A)=Entropy(S)−∑vA∣Sv∣∣S∣.Entropy(Sv)Gain(S,A)=Entropy(S)−∑vA∣S∣∣Sv∣.Entropy(Sv
)
Entropy: is the measure of uncertainty of a random variable it characterizes the
impurity of an arbitrary collection of examples. The higher the entropy more
the information content.
For example if a dataset has an equal number of "Yes" and "No" outcomes (like 3
people who bought a product and 3 who didn’t), the entropy is high because it’s
uncertain which outcome to predict. But if all the outcomes are the same (all "Yes" or
all "No") the entropy is 0 meaning there is no uncertainty left in predicting the
outcome
Suppose SS is a set of instances, AAis an attribute, SvSvis the subset
of SSwith AA= vv and Values (AA) is the set of all possible values of AA, then
Gain(S,A)=Entropy(S)−∑vϵValues(A)∣Sv∣∣S∣.Entropy(Sv) Gain(S,A)=Entropy(S)
−∑vϵValues(A)∣S∣∣Sv∣.Entropy(Sv)
Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3
Entropy H(X)=[(38)log238+(58)log258]=−[0.375(−1.415)+0.625(−0.678)]=−
(−0.53−0.424)=0.954Entropy H(X)=[(83)log283+(85)log285
]=−[0.375(−1.415)+0.625(−0.678)]=−(−0.53−0.424)=0.954
Building Decision Tree using Information Gain the essentials
Start with all training instances associated with the root node
Use info gain to choose which attribute to label each node with
Recursively construct each subtree on the subset of training instances that
would be classified down that path in the tree.
If all positive or all negative training instances remain, the label that node “yes"
or “no" accordingly
If no attributes remain label with a majority vote of training instances left at
that node
If no instances remain label with a majority vote of the parent's training
instances.
Example: Now let us draw a Decision Tree for the following data using Information
gain. Training set: 3 features and 2 classes
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II
Here, we have 3 features and 2 output classes. To build a decision tree using
Information gain. We will take each of the features and calculate the information for
each feature.
Split on feature X
Split on feature Y
Split on feature Z
From the above images we can see that the information gain is maximum when we
make a split on feature Y. So, for the root node best-suited feature is feature Y. Now we
can see that while splitting the dataset by feature Y, the child contains a pure subset of
the target variable. So we don't need to further split the dataset. The final tree for the
above dataset would look like this:
2. Gini Index
Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified. It means an attribute with a lower Gini index should be
preferred. Sklearn supports “Gini” criteria for Gini Index and by default it takes “gini”
value.
For example if we have a group of people where all bought the product (100% "Yes")
the Gini Index is 0 indicate perfect purity. But if the group has an equal mix of "Yes"
and "No" the Gini Index would be 0.5 show high impurity or uncertainty. Formula for
Gini Index is given by :
Gini=1−∑i=1npi2Gini=1−∑i=1npi2
Some additional features of the Gini Index are:
1. It is calculated by summing the squared probabilities of each outcome in a
distribution and subtracting the result from 1.
2. A lower Gini Index indicates a more homogeneous or pure distribution while a
higher Gini Index indicates a more heterogeneous or impure distribution.
3. In decision trees the Gini Index is used to evaluate the quality of a split by
measuring the difference between the impurity of the parent node and the
weighted impurity of the child nodes.
4. Compared to other impurity measures like entropy, the Gini Index is faster to
compute and more sensitive to changes in class probabilities.
5. One disadvantage of the Gini Index is that it tends to favour splits that create
equally sized child nodes, even if they are not optimal for classification
accuracy.
6. In practice the choice between using the Gini Index or other impurity measures
depends on the specific problem and dataset and requires experimentation and
tuning.
Understanding Decision Tree with Real life use case:
Till now we have understand about the attributes and components of decision tree.
Now lets jump to a real life use case in which how decision tree works step by step.
Step 1. Start with the Whole Dataset
We begin with all the data which is treated as the root node of the decision tree.
Step 2. Choose the Best Question (Attribute)
Pick the best question to divide the dataset. For example ask: "What is the outlook?"
Possible answers: Sunny, Cloudy or Rainy.
Step 3. Split the Data into Subsets :
Divide the dataset into groups based on the question:
If Sunny go to one subset.
If Cloudy go to another subset.
If Rainy go to the last subset.
Step 4. Split Further if Needed (Recursive Splitting)
For each subset ask another question to refine the groups. For example If the Sunny
subset is mixed ask: "Is the humidity high or normal?"
High humidity → "Swimming".
Normal humidity → "Hiking".
Step 5. Assign Final Decisions (Leaf Nodes)
When a subset contains only one activity, stop splitting and assign it a label:
Cloudy → "Hiking".
Rainy → "Stay Inside".
Sunny + High Humidity → "Swimming".
Sunny + Normal Humidity → "Hiking".
Step 6. Use the Tree for Predictions
To predict an activity follow the branches of the tree. Example: If the outlook is Sunny
and the humidity is High follow the tree:
Start at Outlook.
Take the branch for Sunny.
Then go to Humidity and take the branch for High Humidity.
Result: "Swimming".
A decision tree works by breaking down data step by step asking the best possible
questions at each point and stopping once it reaches a clear decision. It's an easy and
understandable way to make choices. Because of their simple and clear structure
decision trees are very helpful in machine learning for tasks like sorting data into
categories or making predictions.
Frequently Asked Questions (FAQs)
1. What are the major issues in decision tree learning?
Major issues in decision tree learning include overfitting, sensitivity to small data
changes and limited generalization. Ensuring proper pruning, tuning and handling
imbalanced data can help mitigate these challenges for more robust decision tree
models.
2. How does decision tree help in decision making?
Decision trees aid decision-making by representing complex choices in a hierarchical
structure. Each node tests specific attributes, guiding decisions based on data values.
Leaf nodes provide final outcomes, offering a clear and interpretable path for decision
analysis in machine learning.
3. What is the maximum depth of a decision tree?
The maximum depth of a decision tree is a hyperparameter that determines the
maximum number of levels or nodes from the root to any leaf. It controls the
complexity of the tree and helps prevent overfitting.
4. What is the concept of decision tree?
A decision tree is a supervised learning algorithm that models decisions based on input
features. It forms a tree-like structure where each internal node represents a decision
based on an attribute, leading to leaf nodes representing outcomes.
5. What is entropy in decision tree?
In decision trees, entropy is a measure of impurity or disorder within a dataset. It
quantifies the uncertainty associated with classifying instances, guiding the algorithm
to make informative splits for effective decision-making.
6. What are the Hyperparameters of decision tree?
1. Max Depth: Maximum depth of the tree.
2. Min Samples Split: Minimum samples required to split an internal node.
3. Min Samples Leaf: Minimum samples required in a leaf node.
4. Criterion: The function used to measure the quality of a split
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/
titanic.csv"
titanic_data = pd.read_csv(url)
titanic_data = titanic_data.dropna(subset=['Survived'])
y_pred = rf_classifier.predict(X_test)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", classification_rep)
sample = X_test.iloc[0:1]
prediction = rf_classifier.predict(sample)
sample_dict = sample.iloc[0].to_dict()
print(f"\nSample Passenger: {sample_dict}")
print(f"Predicted Survival: {'Survived' if prediction[0] == 1 else 'Did Not Survive'}")
Output:
california_housing = fetch_california_housing()
california_data = pd.DataFrame(california_housing.data,
columns=california_housing.feature_names)
california_data['MEDV'] = california_housing.target
X = california_data.drop('MEDV', axis=1)
y = california_data['MEDV']
rf_regressor.fit(X_train, y_train)
y_pred = rf_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and
Manhattan distances as special cases.
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above, when p=2, it becomes the same as the Euclidean distance
formula and when p=1, it turns into the Manhattan distance formula. Minkowski
distance is essentially a flexible formula that can represent either Euclidean or
Manhattan distance depending on the value of p.
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where
it predicts the label or value of a new data point by considering the labels or values of
its K nearest neighbors in the training dataset.
Mapping
1D data to 2D to become able to separate the two classes
In this case the new variable y is created as a function of distance from the origin.
Mathematical Computation of SVM
Consider a binary classification problem with two classes, labeled as +1 and -1. We
have a training dataset consisting of input feature vectors X and their corresponding
class labels Y. The equation for the linear hyperplane can be written as:
wTx+b=0wTx+b=0
Where:
ww is the normal vector to the hyperplane (the direction perpendicular to it).
bb is the offset or bias term representing the distance of the hyperplane from
the origin along the normal vector ww.
Distance from a Data Point to the Hyperplane
The distance between a data point x_i and the decision boundary can be calculated as:
di=wTxi+b∣∣w∣∣di=∣∣w∣∣wTxi+b
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of
the normal vector W
Linear SVM Classifier
Distance from a Data Point to the Hyperplane:
y^={1: wTx+b≥00: wTx+b <0y^={10: wTx+b≥0: wTx+b <0
Where y^y^ is the predicted label of a data point.
Optimization Problem for SVM
For a linearly separable dataset the goal is to find the hyperplane that maximizes the
margin between the two classes while ensuring that all data points are correctly
classified. This leads to the following optimization problem:
minimizew,b12∥w∥2w,bminimize21∥w∥2
Subject to the constraint:
yi(wTxi+b)≥1fori=1,2,3,⋯,myi(wTxi+b)≥1fori=1,2,3,⋯,m
Where:
yiyi is the class label (+1 or -1) for each training instance.
xixi is the feature vector for the ii-th training instance.
mm is the total number of training instances.
The condition yi(wTxi+b)≥1yi(wTxi+b)≥1 ensures that each data point is correctly
classified and lies outside the margin.
Soft Margin in Linear SVM Classifier
In the presence of outliers or non-separable data the SVM allows some
misclassification by introducing slack variables ζiζi. The optimization problem is
modified as:
# Scatter plot
plt.scatter(X[:, 0], X[:, 1],
c=y,
s=20, edgecolors="k")
plt.show()
Output:
Breast Cancer Classifications with SVM RBF kernel
Advantages of Support Vector Machine (SVM)
1. High-Dimensional Performance: SVM excels in high-dimensional spaces,
making it suitable for image classification and gene expression analysis.
2. Nonlinear Capability: Utilizing kernel functions like RBF and polynomial SVM
effectively handles nonlinear relationships.
3. Outlier Resilience: The soft margin feature allows SVM to ignore outliers,
enhancing robustness in spam detection and anomaly detection.
4. Binary and Multiclass Support: SVM is effective for both binary classification
and multiclass classification suitable for applications in text classification.
5. Memory Efficiency: It focuses on support vectors making it memory efficient
compared to other algorithms.
Disadvantages of Support Vector Machine (SVM)
1. Slow Training: SVM can be slow for large datasets, affecting performance in
SVM in data mining tasks.
2. Parameter Tuning Difficulty: Selecting the right kernel and adjusting
parameters like C requires careful tuning, impacting SVM algorithms.
3. Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes,
limiting effectiveness in real-world scenarios.
4. Limited Interpretability: The complexity of the hyperplane in higher
dimensions makes SVM less interpretable than other models.
5. Feature Scaling Sensitivity: Proper feature scaling is essential, otherwise SVM
models may perform poorly.
For example online store uses K-Means to group customers based on purchase
frequency and spending creating segments like Budget Shoppers, Frequent Buyers and
Big Spenders for personalised marketing.
The algorithm works by first randomly picking some central points called centroids and
each data point is then assigned to the closest centroid forming a cluster. After all the
points are assigned to a cluster the centroids are updated by finding the average
position of the points in each cluster. This process repeats until the centroids stop
changing forming clusters. The goal of clustering is to divide the data points into
clusters so that similar data points belong to same group.
How k-means clustering works?
We are given a data set of items with certain features and values for these features like
a vector. The task is to categorize those items into groups. To achieve this we will use
the K-means algorithm. 'K' in the name of the algorithm represents the number of
groups/clusters we want to classify our items into.
K means Clustering
The algorithm will categorize the items into k groups or clusters of similarity. To
calculate that similarity we will use the Euclidean distance as a measurement. The
algorithm works as follows:
1. First we randomly initialize k points called means or cluster centroids.
2. We categorize each item to its closest mean and we update the mean's
coordinates, which are the averages of the items categorized in that cluster so
far.
3. We repeat the process for a given number of iterations and at the end, we have
our clusters.
The "points" mentioned above are called means because they are the mean values of
the items categorized in them. To initialize these means, we have a lot of options. An
intuitive method is to initialize the means at random items in the data set. Another
method is to initialize the means at random values between the boundaries of the data
set. For example for a feature x the items have values in [0,3] we will initialize the
means with values for x at [0,3].
Selecting the right number of clusters is important for meaningful segmentation to do
this we use Elbow Method for optimal value of k in KMeans which is a graphical tool
used to determine the optimal number of clusters (k) in K-means.
Implementation of K-Means Clustering in Python
We will use blobs datasets and show how clusters are made.
Step 1: Importing the necessary libraries
We are importing Numpy, Matplotlib and scikit learn.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
Step 2: Create custom dataset with make_blobs and plot it
X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state = 23)
fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()
Output:
Clustering dataset
Step 3: Initializing random centroids
The code initializes three clusters for K-means clustering. It sets a random seed and
generates random cluster centers within a specified range and creates an empty list of
points for each cluster.
k=3
clusters = {}
np.random.seed(23)
clusters[idx] = cluster
clusters
Output:
Rando
m Centroids
Step 4: Plotting random initialize center with data points
plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '*',c = 'red')
plt.show()
Output:
Data points with random center
The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also marks
the initial cluster centers (red stars) generated for K-means clustering.
Step 5: Defining Euclidean distance
def distance(p1,p2):
return np.sqrt(np.sum((p1-p2)**2))
Step 6: Creating function Assign and Update the cluster center
This step assigns data points to the nearest cluster center and the M-step updates
cluster centers based on the mean of assigned points in K-means clustering.
def assign_clusters(X, clusters):
for idx in range(X.shape[0]):
dist = []
curr_x = X[idx]
for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
dist.append(dis)
curr_cluster = np.argmin(dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters
clusters[i]['points'] = []
return clusters
Step 7: Creating function to Predict the cluster for the datapoints
def pred_cluster(X, clusters):
pred = []
for i in range(X.shape[0]):
dist = []
for j in range(k):
dist.append(distance(X[i],clusters[j]['center']))
pred.append(np.argmin(dist))
return pred
Step 8: Assign, Update and predict the cluster center
clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)
Step 9: Plotting data points with their predicted cluster center
plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()
Output:
K-means Clustering
The plot shows data points colored by their predicted clusters. The red markers
represent the updated cluster centers after the E-M steps in the K-means clustering
algorithm.
Den
dogram
In the above image on the left side there are five points labeled P, Q, R, S and T. These
represent individual data points that are being clustered. On the right side there’s
a dendrogram which show how these points are grouped together step by step.
At the bottom of the dendrogram the points P, Q, R, S and T are all separate.
As you move up, the closest points are merged into a single group.
The lines connecting the points show how they are progressively merged based
on similarity.
The height at which they are connected shows how similar the points are to
each other; the shorter the line the more similar they are
Types of Hierarchical Clustering
Now we understand the basics of hierarchical clustering. There are two main types of
hierarchical clustering.
1. Agglomerative Clustering
2. Divisive clustering
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering
(HAC). Unlike flat clustering hierarchical clustering provides a structured way to group
data. This clustering algorithm does not require us to prespecify the number of
clusters. Bottom-up algorithms treat each data as a singleton cluster at the outset and
then successively agglomerate pairs of clusters until all clusters have been merged into
a single cluster that contains all data.
clustering = AgglomerativeClustering(n_clusters=2).fit(X)
print(clustering.labels_)
Output :
[1, 1, 1, 0, 0, 0]
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to
prespecify the number of clusters. Top-down clustering requires a method for splitting
a cluster that contains the whole data and proceeds by splitting clusters recursively
until individual data have been split into singleton clusters.
Workflow for Hierarchical Divisive clustering :
1. Start with all data points in one cluster: Treat the entire dataset as a single
large cluster.
2. Split the cluster: Divide the cluster into two smaller clusters. The division is
typically done by finding the two most dissimilar points in the cluster and using
them to separate the data into two parts.
3. Repeat the process: For each of the new clusters, repeat the splitting process:
1. Choose the cluster with the most dissimilar points.
2. Split it again into two smaller clusters.
4. Stop when each data point is in its own cluster: Continue this process until
every data point is its own cluster, or the stopping condition (such as a
predefined number of clusters) is met.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
Step 4: Applying PCA algorithm
We reduce the data from 3 features to 2 new features called principal
components. These components capture most of the original information but in
fewer dimensions.
We split the data into 70% training and 30% testing sets.
We train a logistic regression model on the reduced training data and predict
gender labels on the test set.
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
y_pred = model.predict(X_test)
Step 5: Evaluating with Confusion Matrix
The confusion matrix compares actual vs predicted labels. This makes it easy to see
where predictions were correct or wrong.
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Female', 'Male'],
yticklabels=['Female', 'Male'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()
Output:
Confusion matrix
Step 6: Visualizing PCA Result
y_numeric = pd.factorize(y)[0]
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_numeric, cmap='coolwarm', edgecolor='k',
s=80)
plt.xlabel('Original Feature 1')
plt.ylabel('Original Feature 2')
plt.title('Before PCA: Using First 2 Standardized Features')
plt.colorbar(label='Target classes')
plt.subplot(1, 2, 2)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_numeric, cmap='coolwarm', edgecolor='k',
s=80)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('After PCA: Projected onto 2 Principal Components')
plt.colorbar(label='Target classes')
plt.tight_layout()
plt.show()
Output:
PCA Algorithm
Left Plot Before PCA: This shows the original standardized data plotted using
the first two features. There is no guarantee of clear separation between
classes as these are raw input dimensions.
Right Plot After PCA: This displays the transformed data using the top 2
principal components. These new components capture the maximum
variance often showing better class separation and structure making it easier
to analyze or model.
Advantages of Principal Component Analysis
1. Multicollinearity Handling: Creates new, uncorrelated variables to address
issues when original features are highly correlated.
2. Noise Reduction: Eliminates components with low variance enhance data
clarity.
3. Data Compression: Represents data with fewer components reduce storage
needs and speeding up processing.
4. Outlier Detection: Identifies unusual data points by showing which ones
deviate significantly in the reduced space.
Disadvantages of Principal Component Analysis
1. Interpretation Challenges: The new components are combinations of original
variables which can be hard to explain.
2. Data Scaling Sensitivity: Requires proper scaling of data before application or
results may be misleading.
3. Information Loss: Reducing dimensions may lose some important information
if too few components are kept.
4. Assumption of Linearity: Works best when relationships between variables are
linear and may struggle with non-linear data.
5. Computational Complexity: Can be slow and resource-intensive on very large
datasets.
6. Risk of Overfitting: Using too many components or working with a small
dataset might lead to models that don't generalize well.
Confidence(X→Y)=Support(X∪Y)Support(X)Confidence(X→Y)=Support(X)Suppo
rt(X∪Y)
Step 2: Find Frequent 1-Itemsets
Lets count how many transactions include each item in the dataset (calculating
the frequency of each item).
Frequent 1-Itemsets
All items have support% ≥ 50%, so they qualify as frequent 1-itemsets. if any
item has support% < 50%, It will be omitted out from the frequent 1- itemsets.
Step 3: Generate Candidate 2-Itemsets
Combine the frequent 1-itemsets into pairs and calculate their support. For this
use case we will get 3 item pairs ( bread,butter) , (bread,ilk) and (butter,milk)
and will calculate the support similar to step 2
Candidate 2-Itemsets
Frequent 2-itemsets: {Bread, Milk} meet the 50% threshold but {Butter, Milk}
and {Bread ,Butter} doesn't meet the threshold, so will be committed out.
Step 4: Generate Candidate 3-Itemsets
Combine the frequent 2-itemsets into groups of 3 and calculate their support.
for the triplet we have only got one case i.e {bread,butter,milk} and we will
calculate the support.
Candidate
3-Itemsets
Since this does not meet the 50% threshold, there are no frequent 3-itemsets.
Step 5: Generate Association Rules
Now we generate rules from the frequent itemsets and calculate confidence.
Rule 1: If Bread → Butter (if customer buys bread, the customer will buy
butter also)
Support of {Bread, Butter} = 2.
Support of {Bread} = 4.
Confidence = 2/4 = 50% (Failed threshold).
Rule 2: If Butter → Bread (if customer buys butter, the customer will buy
bread also)
Support of {Bread, Butter} = 3.
Support of {Butter} = 3.
Confidence = 3/3 = 100% (Passes threshold).
Rule 3: If Bread → Milk (if customer buys bread, the customer will buy milk
also)
Support of {Bread, Milk} = 3.
Support of {Bread} = 4.
Confidence = 3/4 = 75% (Passes threshold).
The Apriori Algorithm, as demonstrated in the bread-butter example, is widely
used in modern startups like Zomato, Swiggy and other food delivery platforms.
These companies use it to perform market basket analysis which helps them
identify customer behaviour patterns and optimise recommendations.
Applications of Apriori Algorithm
Below are some applications of Apriori algorithm used in today's companies
and startups
1. E-commerce: Used to recommend products that are often bought together like
laptop + laptop bag, increasing sales.
2. Food Delivery Services: Identifies popular combos such as burger + fries, to
offer combo deals to customers.
3. Streaming Services: Recommends related movies or shows based on what
users often watch together like action + superhero movies.
4. Financial Services: Analyzes spending habits to suggest personalised offers such
as credit card deals based on frequent purchases.
5. Travel & Hospitality: Creates travel packages like flight + hotel by finding
commonly purchased services together.
6. Health & Fitness: Suggests workout plans or supplements based on users' past
activities like protein shakes + workouts.
Classification Metrics
In a classification task, our main task is to predict the target variable, which is in the
form of discrete values. To evaluate the performance of such a model, following are
the commonly used evaluation metrics:
Accuracy
Logarithmic Loss
Area Under Curve
Precision
Recall
F1 Score
Confusion Matrix
Accuracy
Accuracy is a fundamental metric for evaluating the performance of a classification
model, providing a quick snapshot of how well the model is performing in terms of
correct predictions. It is calculated as the ratio of correct predictions to the total
number of input samples.
It works great if there are an equal number of samples for each class. For example, we
have a 90% sample of class A and a 10% sample of class B in our training set. Then, our
model will predict with an accuracy of 90% by predicting all the training samples
belonging to class A. If we test the same model with a test set of 60% from class A and
40% from class B. Then the accuracy will fall, and we will get an accuracy of 60%.
Accuracy is good but it gives a False Positive sense of achieving high accuracy. The
problem arises due to the possibility of misclassification of minor class samples being
very high.
Logarithmic Loss
Log loss penalizes the false (false positive) classification. It usually works well with
multi-class classification. Working on Log loss, the classifier should assign a probability
for each and every class of all the samples. If there are N samples belonging to
the M class, then we calculate the Log loss in this way:
False Positive Rate and True Positive Rate both have values in the range [0, 1]. Now the
thing is what is A U C then? So, A U C is a curve plotted between False Positive Rate Vs
True Positive Rate at all different data points with a range of [0, 1]. Greater the value
of AUCC better the performance of the model.
ROC Curve for Evaluation of Classification Models
Precision
There is another metric named Precision. Precision is a measure of a model’s
performance that tells you how many of the positive predictions made by the model
are actually correct.
Recall
Recall is the ratio of correctly predicted positive instances to the total actual positive
instances. It measures how well the model captures all relevant positive cases.
F1 Score
F1-Score is a harmonic mean between recall and precision. Its range is [0,1]. This
metric usually tells us how precise (correctly classifies how many
instances) and robust (does not miss any significant number of instances) our
classifier is.
Lower recall and higher precision give you great accuracy but then it misses a large
number of instances. The more the F1 score better will be performance. It can be
expressed mathematically in this way:
Confusion Matrix
Confusion matrix creates a N X N matrix, where N is the number of classes or
categories that are to be predicted. Here we have N = 2, so we get a 2 X 2 matrix.
Suppose there is a problem with our practice which is a binary classification. Samples
of that classification belong to either Yes or No. So, we build our classifier which will
predict the class for the new input sample. After that, we tested our model
with 165 samples, and we get the following result.
R2 - Score
The coefficient of determination also called the R2 score is used to evaluate the
performance of a linear regression model. It is the amount of variation in the output-
dependent attribute which is predictable from the input independent variable(s). It is
used to check how well-observed results are reproduced by the model, depending on
the ratio of total deviation of results described by the model.
where
mm - Number of Features
nn- Number of Examples
yiyi- Actual Target Value
y^iy^i - Predicted Target Value
Lets see how to implement this using python:
X, y = make_regression(n_samples=100, n_features=5, noise=0.1,
random_state=42) : Generates a regression dataset with 100 samples, 5
features and some noise.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42) : Splits the data into 80% training and 20% testing sets.
lasso = Lasso(alpha=0.1) : Creates a Lasso regression model with regularization
strength alpha set to 0.1.
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)
print("Coefficients:", lasso.coef_)
Output:
Lasso Regression
The output shows the model's prediction error and the importance of features with
some coefficients reduced to zero due to L1 regularization.
2. Ridge Regression
A regression model that uses the L2 regularization technique is called Ridge
regression. It adds the squared magnitude of the coefficient as a penalty term to the
loss function(L).
where,
nn= Number of examples or data points
mm = Number of features i.e predictor variables
yiyi = Actual target value for the ithith example
y^iy^i = Predicted target value for the ithith example
wiwi = Coefficients of the features
λλ= Regularization parameter that controls the strength of regularization
Lets see how to implement this using python:
ridge = Ridge(alpha=1.0) : Creates a Ridge regression model with regularization
strength alpha set to 1.0.
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
Ridge Regression
The output shows the MSE showing model performance. Lower MSE means better
accuracy. The coefficients reflect the regularized feature weights.
3. Elastic Net Regression
Elastic Net Regression is a combination of both L1 as well as L2 regularization. That
shows that we add the absolute norm of the weights as well as the squared measure
of the weights. With the help of an extra hyperparameter that controls the ratio of the
L1 and L2 regularization.
where
nn = Number of examples (data points)
mm = Number of features (predictor variables)
yiyi = Actual target value for the ithith example
y^iy^i = Predicted target value for the ithith example
wiwi= Coefficients of the features
λλ= Regularization parameter that controls the strength of regularization
α = Mixing parameter where 0 ≤ αα≤ 1 and αα= 1 corresponds to Lasso (L1)
regularization, αα= 0 corresponds to Ridge (L2) regularization and Values
between 0 and 1 provide a balance of both L1 and L2 regularization
Lets see how to implement this using python:
model = ElasticNet(alpha=1.0, l1_ratio=0.5) : Creates an Elastic Net model with
regularization strength alpha=1.0 and L1/L2 mixing ratio 0.5.
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
Overfitting happens when a machine learning model learns the training data too well
including the noise and random details. This makes the model to perform poorly on
new, unseen data because it memorizes the training data instead of understanding the
general patterns.
For example, if we only study last week’s weather to predict tomorrow’s i.e our model
might focus on one-time events like a sudden rainstorm which won’t help for future
predictions.
Underfitting is the opposite problem which happens when the model is too simple to
learn even the basic patterns in the data. An underfitted model performs poorly on
both training and new data. To fix this we need to make the model more complex or
add more features.
For example if we use only the average temperature of the year to predict tomorrow’s
weather hence the model misses important details like seasonal changes which results
in bad predictions.
What are Bias and Variance?
Bias refers to the errors which occur when we try to fit a statistical model on
real-world data which does not fit perfectly well on some mathematical model.
If we use a way too simplistic a model to fit the data then we are more
probably face the situation of High Bias (underfitting) refers to the case when
the model is unable to learn the patterns in the data at hand and perform
poorly.
Variance shows the error value that occurs when we try to make predictions by
using data that is not previously seen by the model. There is a situation known
as high variance (overfitting) that occurs when the model learns noise that is
present in the data.
Finding a proper balance between the two is also known as the Bias-Variance
Tradeoff which helps us to design an accurate model.
Bias Variance tradeoff
The Bias-Variance Tradeoff refers to the balance between bias and variance which
affect predictive model performance. Finding the right tradeoff is important for
creating models that generalize well to new data.
The bias-variance tradeoff shows the inverse relationship between bias and
variance. When one decreases, the other tends to increase and vice versa.
Finding the right balance is important. An overly simple model with high bias
won't capture the underlying patterns while an overly complex model with high
variance will fit the noise in the data.
Benefits of Regularization
Now, let’s see various benefits of regularization which are as follows:
1. Prevents Overfitting: Regularization helps models focus on underlying patterns
instead of memorizing noise in the training data.
2. Improves Interpretability: L1 (Lasso) regularization simplifies models by
reducing less important feature coefficients to zero.
3. Enhances Performance: Prevents excessive weighting of outliers or irrelevant
features helps in improving overall model accuracy.
4. Stabilizes Models: Reduces sensitivity to minor data changes which ensures
consistency across different data subsets.
5. Prevents Complexity: Keeps model from becoming too complex which is
important for limited or noisy data.
6. Handles Multicollinearity: Reduces the magnitudes of correlated coefficients
helps in improving model stability.
7. Allows Fine-Tuning: Hyperparameters like alpha and lambda control
regularization strength helps in balancing bias and variance.
8. Promotes Consistency: Ensures reliable performance across different datasets
which reduces the risk of large performance shifts.
Training Set
Iteration Observations Testing Set Observations
1 [5-24] [0-4]
5 [0-19] [20-24]
Each iteration uses different subsets for testing and training, ensuring that all data
points are used for both training and testing.
Comparison between K-Fold Cross-Validation and Hold Out Method
K-Fold Cross-Validation and Hold Out Method are widely used technique and
sometimes they are confusing so here is the quick comparison between them:
Feature K-Fold Cross-Validation Hold-Out Method
The model is trained 'k' times, each The model is trained once
Training Sets time on a different training subset. on the training set.
The model is tested 'k' times, each The model is tested once
Testing Sets time on a different test subset. on the test set.
Less biased due to multiple splits Can have higher bias due
Bias and testing. to a single split.
Suitability for Preferred for small datasets, as it Less ideal for small
Small Datasets maximizes data usage. datasets, as a significant
Feature K-Fold Cross-Validation Hold-Out Method
The grid search technique will construct multiple versions of the model with all
possible combinations of C and Alpha, resulting in a total of 5 * 4 = 20 different
models. The best-performing combination is then chosen.
Example: Tuning Logistic Regression with GridSearchCV
The following code illustrates how to use GridSearchCV . In this below code:
We generate sample data using make_classification.
We define a range of C values using logarithmic scale.
GridSearchCV tries all combinations from param_grid and uses 5-fold cross-
validation.
It returns the best hyperparameter (C) and its corresponding validation score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10, n_classes=2, random_state=42)
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
logreg = LogisticRegression()
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
logreg_cv.fit(X, y)
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))
Output:
Tuned Logistic Regression Parameters: {'C': 0.006105402296585327}
Best score is 0.853
This represents the highest accuracy achieved by the model using the hyperparameter
combination C = 0.0061. The best score of 0.853 means the model achieved 85.3%
accuracy on the validation data during the grid search process.
2. RandomizedSearchCV
As the name suggests RandomizedSearchCV picks random combinations of
hyperparameters from the given ranges instead of checking every single combination
like GridSearchCV.
In each iteration it tries a new random combination of hyperparameter values.
It records the model’s performance for each combination.
After several attempts it selects the best-performing set.
Example: Tuning Decision Tree with RandomizedSearchCV
The following code illustrates how to use RandomizedSearchCV. In this example:
We define a range of values for each hyperparameter
e.g, max_depth, min_samples_leaf etc.
Random combinations are picked and evaluated using 5-fold cross-validation.
The best combination and score are printed.
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_classes=2, random_state=42)
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
"max_depth": [3, None],
"max_features": randint(1, 9),
"min_samples_leaf": randint(1, 9),
"criterion": ["gini", "entropy"]
}
tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_cv.fit(X, y)
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))
Output:
Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': None,
'max_features': 6, 'min_samples_leaf': 6}
Best score is 0.8
A score of 0.842 means the model performed with an accuracy of 84.2% on the
validation set with following hyperparameters.
3. Bayesian Optimization
Grid Search and Random Search can be inefficient because they blindly try many
hyperparameter combinations, even if some are clearly not useful. Bayesian
Optimization takes a smarter approach. It treats hyperparameter tuning like a
mathematical optimization problem and learns from past results to decide what to try
next.
Build a probabilistic model (surrogate function) that predicts performance
based on hyperparameters.
Update this model after each evaluation.
Use the model to choose the next best set to try.
Repeat until the optimal combination is found. The surrogate function models:
P(score(y)∣hyperparameters(x))P(score(y)∣hyperparameters(x))
Here the surrogate function models the relationship between hyperparameters xx and
the score yy. By updating this model iteratively with each new evaluation Bayesian
optimization makes more informed decisions. Common surrogate models used in
Bayesian optimization include:
Gaussian Processes
Random Forest Regression
Tree-structured Parzen Estimators (TPE)
Advantages of Hyperparameter tuning
Improved Model Performance: Finding the optimal combination of
hyperparameters can significantly boost model accuracy and robustness.
Reduced Overfitting and Underfitting: Tuning helps to prevent both overfitting
and underfitting resulting in a well-balanced model.
Enhanced Model Generalizability: By selecting hyperparameters that optimize
performance on validation data the model is more likely to generalize well to
unseen data.
Optimized Resource Utilization: With careful tuning resources such as
computation time and memory can be used more efficiently avoiding
unnecessary work.
Improved Model Interpretability: Properly tuned hyperparameters can make
the model simpler and easier to interpret.
Challenges in Hyperparameter Tuning
Dealing with High-Dimensional Hyperparameter Spaces: The larger the
hyperparameter space the more combinations need to be explored. This makes
the search process computationally expensive and time-consuming especially
for complex models with many hyperparameters.
Handling Expensive Function Evaluations: Evaluating a model's performance
can be computationally expensive, particularly for models that require a lot of
data or iterations.
Incorporating Domain Knowledge: It can help guide the hyperparameter
search, narrowing down the search space and making the process more
efficient. Using insights from the problem context can improve both the
efficiency and effectiveness of tuning.
Developing Adaptive Hyperparameter Tuning Methods: Dynamic adjustment
of hyperparameters during training such as learning rate schedules or early
stopping can lead to better model performance.
Reinforcement Learning (RL) is a branch
of machine learning that focuses on how agents can learn to make decisions
through trial and error to maximize cumulative rewards. RL allows machines to
learn by interacting with an environment and receiving feedback based on their
actions. This feedback comes in the form of rewards or penalties.
Reinforcement Learning revolves around the idea that an agent (the learner or
decision-maker) interacts with an environment to achieve a goal. The agent
performs actions and receives feedback to optimize its decision-making over
time.
Agent: The decision-maker that performs actions.
Environment: The world or system in which the agent operates.
State: The situation or condition the agent is currently in.
Action: The possible moves or decisions the agent can make.
Reward: The feedback or result from the environment based on the agent’s
action.
How Reinforcement Learning Works?
The RL process involves an agent performing actions in an environment,
receiving rewards or penalties based on those actions, and adjusting its
behavior accordingly. This loop helps the agent improve its decision-making
over time to maximize the cumulative reward.
Here’s a breakdown of RL components:
Policy: A strategy that the agent uses to determine the next action based on
the current state.
Reward Function: A function that provides feedback on the actions taken,
guiding the agent towards its goal.
Value Function: Estimates the future cumulative rewards the agent will receive
from a given state.
Model of the Environment: A representation of the environment that predicts
future states and rewards, aiding in planning.
Reinforcement Learning Example: Navigating a Maze
Imagine a robot navigating a maze to reach a diamond while avoiding fire
hazards. The goal is to find the optimal path with the least number of hazards
while maximizing the reward:
Each time the robot moves correctly, it receives a reward.
If the robot takes the wrong path, it loses points.
The robot learns by exploring different paths in the maze. By trying various
moves, it evaluates the rewards and penalties for each path. Over time, the
robot determines the best route by selecting the actions that lead to the
highest cumulative reward.
if terminated:
state = env.reset() # Reset the environment if the episode is finished
Ensemble learning is a method where we use many small models instead of just
one. Each of these models may not be very strong on its own, but when we put
their results together, we get a better and more accurate answer. It's like asking
a group of people for advice instead of just one person—each one might be a
little wrong, but together, they usually give a better answer.
Types of Ensembles Learning in Machine Learning
There are three main types of ensemble methods:
1. Bagging (Bootstrap Aggregating):
Models are trained independently on different random subsets of the training
data. Their results are then combined—usually by averaging (for regression) or
voting (for classification). This helps reduce variance and prevents overfitting.
2. Boosting:
Models are trained one after another. Each new model focuses on fixing the
errors made by the previous ones. The final prediction is a weighted
combination of all models, which helps reduce bias and improve accuracy.
3. Stacking (Stacked Generalization):
Multiple different models (often of different types) are trained, and their
predictions are used as inputs to a final model, called a meta-model. The meta-
model learns how to best combine the predictions of the base models, aiming
for better performance than any individual model.
1. Bagging Algorithm
Bagging classifier can be used for both regression and classification tasks. Here
is an overview of Bagging classifier algorithm:
Bootstrap Sampling: Divides the original training data into ‘N’ subsets and
randomly selects a subset with replacement in some rows from other subsets.
This step ensures that the base models are trained on diverse subsets of the
data and there is no class imbalance.
Base Model Training: For each bootstrapped sample we train a base model
independently on that subset of data. These weak models are trained in
parallel to increase computational efficiency and reduce time consumption. We
can use different base learners i.e. different ML models as base learners to
bring variety and robustness.
Prediction Aggregation: To make a prediction on testing data combine the
predictions of all base models. For classification tasks it can include majority
voting or weighted majority while for regression it involves averaging the
predictions.
Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training
subset of particular base models during the bootstrapping method. These “out-
of-bag” samples can be used to estimate the model’s performance without the
need for cross-validation.
Final Prediction: After aggregating the predictions from all the base models,
Bagging produces a final prediction for each instance.
Python pseudo code for Bagging Estimator implementing libraries:
1. Importing Libraries and Loading Data
BaggingClassifier: for creating an ensemble of classifiers trained on different
subsets of data.
DecisionTreeClassifier: the base classifier used in the bagging ensemble.
load_iris: to load the Iris dataset for classification.
train_test_split: to split the dataset into training and testing subsets.
accuracy_score: to evaluate the model’s prediction accuracy.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
2. Loading and Splitting the Iris Dataset
data = load_iris(): loads the Iris dataset, which includes features and target
labels.
X = data.data: extracts the feature matrix (input variables).
y = data.target: extracts the target vector (class labels).
train_test_split(...): splits the data into training (80%) and testing (20%) sets,
with random_state=42 to ensure reproducibility.
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
3. Creating a Base Classifier
Decision tree is chosen as the base model. They are prone to overfitting when
trained on small datasets making them good candidates for bagging.
base_classifier = DecisionTreeClassifier(): initializes a Decision Tree classifier,
which will serve as the base estimator in the Bagging ensemble.
base_classifier = DecisionTreeClassifier()
4. Creating and Training the Bagging Classifier
A BaggingClassifier is created using the decision tree as the base classifier.
n_estimators = 10 specifies that 10 decision trees will be trained on different
bootstrapped subsets of the training data.
bagging_classifier = BaggingClassifier(base_classifier, n_estimators=10,
random_state=42)
bagging_classifier.fit(X_train, y_train)
5. Making Predictions and Evaluating Accuracy
The trained bagging model predicts labels for test data.
The accuracy of the predictions is calculated by comparing the predicted labels
(y_pred) to the actual labels (y_test).
y_pred = bagging_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Output:
Accuracy: 1.0
2. Boosting Algorithm
Boosting is an ensemble technique that combines multiple weak learners to
create a strong learner. Weak models are trained in series such that each next
model tries to correct errors of the previous model until the entire training
dataset is predicted correctly. One of the most well-known boosting algorithms
is AdaBoost (Adaptive Boosting). Here is an overview of Boosting algorithm:
Initialize Model Weights: Begin with a single weak learner and assign equal
weights to all training examples.
Train Weak Learner: Train weak learners on these dataset.
Sequential Learning: Boosting works by training models sequentially where
each model focuses on correcting the errors of its predecessor. Boosting
typically uses a single type of weak learner like decision trees.
Weight Adjustment: Boosting assigns weights to training datapoints.
Misclassified examples receive higher weights in the next iteration so that next
models pay more attention to them.
Python pseudo code for boosting Estimator implementing libraries:
1. Importing Libraries and Modules
AdaBoostClassifier from sklearn.ensemble: for building the AdaBoost
ensemble model.
DecisionTreeClassifier from sklearn.tree: as the base weak learner for
AdaBoost.
load_iris from sklearn.datasets: to load the Iris dataset.
train_test_split from sklearn.model_selection: to split the dataset into training
and testing sets.
accuracy_score from sklearn.metrics: to evaluate the model’s accuracy.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
2. Loading and Splitting the Dataset
data = load_iris(): loads the Iris dataset, which includes features and target
labels.
X = data.data: extracts the feature matrix (input variables).
y = data.target: extracts the target vector (class labels).
train_test_split(...): splits the data into training (80%) and testing (20%) sets,
with random_state=42 to ensure reproducibility.
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
3. Defining the Weak Learner
We are creating the base classifier as a decision tree with maximum depth 1 (a
decision stump). This simple tree will act as a weak learner for the AdaBoost
algorithm, which iteratively improves by combining many such weak learners.
base_classifier = DecisionTreeClassifier(max_depth=1)
4. Creating and Training the AdaBoost Classifier
base_classifier: The weak learner used in boosting.
n_estimators = 50: Number of weak learners to train sequentially.
learning_rate = 1.0: Controls the contribution of each weak learner to the
final model.
random_state = 42: Ensures reproducibility.
adaboost_classifier = AdaBoostClassifier(
base_classifier, n_estimators=50, learning_rate=1.0, random_state=42
)
adaboost_classifier.fit(X_train, y_train)
5. Making Predictions and Calculating Accuracy
We are calculating the accuracy of the model by comparing the true
labels y_test with the predicted labels y_pred. The accuracy_score function
returns the proportion of correctly predicted samples. Then, we print the
accuracy value.
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Output:
Accuracy: 1.0
Benefits of Ensemble Learning in Machine Learning
Ensemble learning is a versatile approach that can be applied to machine
learning model for: -
Reduction in Overfitting: By aggregating predictions of multiple model's
ensembles can reduce overfitting that individual complex models might exhibit.
Improved Generalization: It generalizes better to unseen data by minimizing
variance and bias.
Increased Accuracy: Combining multiple models gives higher predictive
accuracy.
Robustness to Noise: It mitigates the effect of noisy or incorrect data points by
averaging out predictions from diverse models.
Flexibility: It can work with diverse models including decision trees, neural
networks and support vector machines making them highly adaptable.
Bias-Variance Tradeoff: Techniques like bagging reduce variance, while
boosting reduces bias leading to better overall performance.
There are various ensemble learning techniques we can use as each one of
them has their own pros and cons.
Ensemble Learning Techniques
Categor
Technique y Description
Data from the sources is extracted and stored in the data warehouse. The entire data is
not transformed; only the required transformations are done when necessary. Raw
data can be retrieved from the warehouse anytime when required. The data,
transformed as needed, is then sent forward for analysis. When we use ELT, we move
the entire data set as it exists in the source systems to the target. This means that we
have the raw data at your disposal in the data warehouse, in contrast to the ETL
approach.
ETL Process
ETL is the traditional technique of extracting raw data, transforming it as required for
the users and storing it in data warehouses. ELT was later developed, with ETL as its
base. The three operations in ETL and ELT are the same, except that their order of
processing is slightly different. This change in sequence was made to overcome some
drawbacks.
1. Extract: It is the process of extracting raw data from all available data sources
such as databases, files, ERP, CRM or any other.
2. Transform: The extracted data is immediately transformed as required by the
user.
3. Load: The transformed data is then loaded into the data warehouse from
where the users can access it.
The data collected from the sources is directly stored in the staging area. The required
transformations are performed on the data in the staging area. Once the data is
transformed, the resultant data is stored in the data warehouse. The main drawback of
the ETL architecture is that once the transformed data is stored in the warehouse, it
cannot be modified again. In contrast, in ELT, a copy of the raw data is always available
in the warehouse and only the required data is transformed when needed.
Difference between ELT and ETL
Acronym
Extract, Transform, Load Extract, Load, Transform
Meaning
Structured, semi-structured
Primarily structured data.
Data Output and unstructured data.
Category ETL ELT
It is well-known as an online
It is well-known as an online
Definition database query management
database modifying system.
system.
It makes use of a
It makes use of a data
Method used standard database management
warehouse.
system (DBMS).
It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.
It provides a multi-
It reveals a snapshot of present
Task dimensional view of different
business tasks.
business tasks.
A is the matrix,
v is associated eigenvector and
λ is scalar eigenvalue.