Supervised Learning
Supervised Learning
Kalaiselvi
231ADC501T
Machine Learning Techniques
UNIT II
SUPERVISED LEARNING AND ENSEMBLE TECHNIQUES
Regression –Simple Linear Regression –Multiple Linear Regression – Logistic
Regression –Classification – K Nearest Neighbours Classifier – Naïve Bayes Classifier –
Support Vector Machine – Ensemble Techniques – Decision Trees – Random Forest –
Bagging – Boosting.
UNIT II
Regression –Simple Linear Regression –Multiple Linear Regression – Logistic Regression –Classification
– K Nearest Neighbours Classifier – Naïve Bayes Classifier – Support Vector Machine – Ensemble
Techniques – Decision Trees – Random Forest –Bagging – Boosting.
2.1 Regression:
Regression refers to methods used to model and analyze the relationship between a
dependent variable and one or more independent variables. It helps predict or explain the value
of the dependent variable based on the independent variables.
Regression is a type of supervised learning technique in machine learning that models
the relationship between a dependent variable (target/output) and one or more independent
variables (features/inputs). The goal of regression is to predict a continuous output.
📘 Key Concepts
Input: Features (e.g., age, income, hours studied)
Linear Regression:
Linear regression is a statistical method used in machine learning to model the relationship between a
dependent variable and one or more independent variables. It models relationships by fitting a linear
equation to observed data, often serving as a starting point for more complex algorithms and is widely
used in predictive analysis.
• Linear regression is a statistical regression method which is used for predictive analysis.
• It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
• It is used for solving the regression problem in machine learning.
• Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.
• If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is called multiple
linear regression.
• The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of experience.
Essentially, linear regression models the relationship between a dependent variable (the outcome you
want to predict) and one or more independent variables (the input features you use for prediction) by
finding the best-fitting straight line through a set of data points. This line, called the regression line,
represents the relationship between the dependent variable (the outcome we want to predict) and the
independent variable(s) (the input features we use for prediction). The equation for a simple linear
regression line is defined as:
y=mx+c
where y is the dependent variable, x is the independent variable, m is the slope of the line, and c is the
y-intercept. This equation provides a mathematical model for mapping inputs to predicted outputs,
with the goal of minimizing the differences between predicted and observed values, known as residuals.
By minimizing these residuals, linear regression produces a model that best represents the data
Conceptually, linear regression can be visualized as drawing a straight line through points on a graph
to determine if there is a relationship between those data points. The ideal linear regression model for
a set of data points is the line that best approximates the values of every point in the data set.
Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a Simple
Linear Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/real
value. However, the independent variable can be measured on continuous or categorical values.
Linear regression is a statistical method used for predictive analysis. It models the relationship between
a dependent variable and a single independent variable by fitting a linear equation to the data.
Multiple Linear Regression extends this concept by modelling the relationship between a dependent
variable and two or more independent variables. This technique allows us to understand how multiple
features collectively affect the outcomes.
y=β0+β1X1+β2X2+⋯+βnXn
Where:
Y is the dependent variable
X1,X2,⋯Xn are the independent variables
β0 is the intercept
β1,β2,⋯βn are the slopes
The goal of the algorithm is to find the best fit line equation that can predict the values based on the
independent variables. A regression model learns from the dataset with known X and y values and uses
it to predict y values for unknown X.
“Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship
between a single dependent continuous variable and more than one independent variable.”
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple predictor
variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so the same is applied
for the multiple linear regression equation, the equation becomes:
Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</sub>+
b<sub>3</sub>x<sub> 3</sub>+...... bnxn ............... (a)
Where,
Y= Output/Response variable
b0, b1, b2, b3 , bn....= Coefficients of the model.
x1, x2, x3, x4,...= Various Independent/feature variable
As an example, consider a real estate agent who wants to estimate house prices. The agent could use a
simple linear regression based on a single variable, like the size of the house or the zip code, but this
model would be too simplistic, as housing prices are often driven by a complex interplay of multiple
factors. A multiple linear regression, incorporating variables like the size of the house, the
neighborhood, and the number of bedrooms, will likely provide a more accurate prediction model.
Linear regression works by finding the best-fitting line through a set of data points. This process
involves:
1 Selecting the model: In the first step, the appropriate linear equation to describe the relationship
between the dependent and independent variables is selected.
2 Fitting the model: Next, a technique called Ordinary Least Squares (OLS) is used to minimize the sum
of the squared differences between the observed values and the values predicted by the model. This is
done by adjusting the slope and intercept of the line to find the best fit. The purpose of this method is
to minimize the error, or difference, between the predicted and actual values. This fitting process is a
core part of supervised machine learning, in which the model learns from the training data.
3 Evaluating the model: In the final step, the quality of fit is assessed using metrics such as R-squared,
which measures the proportion of the variance in the dependent variable that is predictable from the
independent variables. In other words, R-squared measures how well the data actually fits the
regression model.
This process generates a machine learning model that can then be used to make predictions based on
new data.
Marketers can use linear regression to understand how advertising spend affects sales revenue. By
applying a linear regression model to historical data, future sales revenue can be predicted, allowing
marketers to optimize their budgets and advertising strategies for maximum impact.
Computationally efficient
Machine learning models can be resource-intensive. Linear regression requires relatively low
computational power compared to many algorithms and can still provide meaningful predictive
insights.
Interpretable results
Advanced statistical models, while powerful, are often hard to interpret. With a simple model like linear
regression, the relationship between variables is easy to understand, and the impact of each variable is
clearly indicated by its coefficient.
Sensitivity to outliers
Outliers are data points that significantly deviate from the majority of observations in a dataset. If not
handled properly, these extreme value points can skew results, leading to inaccurate conclusions. In
machine learning, this sensitivity means that outliers can disproportionately affect the predictive
accuracy and reliability of the model.
Multicollinearity
In multiple linear regression models, highly correlated independent variables can distort the results, a
phenomenon known as multicollinearity. For example, the number of bedrooms in a house and its size
might be highly correlated since larger houses tend to have more bedrooms. This can make it difficult
to determine the individual impact of individual variables on housing prices, leading to unreliable
results.
Unlike the linear function used in linear regression, logistic regression models the probability of a
categorical outcome using an S-shaped curve called a logistic function. In the example of binary
classification, data points that belong to the “yes” category fall on one side of the S-shape, while the data
points in the “no” category fall on the other side. Practically speaking, logistic regression can be used to
classify whether an email is spam or not, or predict whether a customer will purchase a product or not.
Essentially, linear regression is used for predicting quantitative values, whereas logistic regression is
used for classification tasks.
Linear Regression Line:
A linear line showing the relationship between the dependent and independent variables is called a
regression line.
A regression line can show two types of relationship:
• Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then
such a relationship is termed as a Positive linear relationship.
When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error. The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate
this we use cost function. Cost functio no
The different values for weights or coefficient of lines (a0, a1) gives the different line of regression, and
the cost function is used to estimate the values of the coefficient for the best fit line.
• Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
• We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of
squared error occurred between the predicted values and actual values. It can be written as:
For the above linear equation, MSE can be calculated as:
Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the observed
points are far from the regression line, then the residual will be high, and so cost function will high. If
the scatter points are close to the regression line, then the residual will be small and hence the cost
function.
A cost function is an important parameter that determines how well a machine learning model
performs for a given dataset. It calculates the difference between the expected value and
predicted value and represents it as a single real number.
In machine learning, once we train our model, then we want to see how well our model is
performing. Although there are various accuracy functions that tell you how your model is
performing, but will not give insights to improve them. So, we need a function that can find when
the model is most accurate by finding the spot between the undertrained and overtrained model.
In simple, "Cost function is a measure of how wrong the model is in estimating the relationship
between X(input) and Y(output) Parameter." A cost function is sometimes also referred to as Loss
function, and it can be estimated by iteratively running the model to compare estimated
predictions against the known values of Y.
The main aim of each ML model is to determine parameters or weights that can minimize the cost
function.
It means for getting the optimal solution; we need a Cost function. It calculated the difference
between the actual values and predicted values and measured how wrong our model in the
prediction was. By minimizing the value of the cost function, we can get the optimal solution.
Regression Cost Function : Mean Error, Mean Square Error, Mean Absolute Error
Binary Classification cost Functions : Cross Entropy
Multi-class Classification Cost Function
Gradient Descent:
• Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
• A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
• It is done by a random selection of values of coefficient and then iteratively update the values to
reach the minimum cost function.
As we discussed in the above section, the cost function tells how wrong your model is? And each
machine learning model tries to minimize the cost function in order to give the best results. Here
comes the role of Gradient descent.
"Gradient Descent is an optimization algorithm which is used for optimizing the cost function or
error in the model." It enables the models to take the gradient or direction to reduce the errors
by reaching to least possible error. Here direction refers to how model parameters should be
corrected to further reduce the cost function. The error in your model can be different at different
points, and you have to find the quickest way to minimize it, to prevent resource wastage.
Gradient descent is an iterative process where the model gradually converges towards a
minimum value, and if the model iterates further than this point, it produces little or zero changes
in the loss. This point is known as convergence, and at this point, the error is least, and the cost
function is optimized.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below method:
1. R-squared method:
• R-squared is a statistical method that determines the goodness of fit.
• It measures the strength of the relationship between the dependent and independent variables
on a scale of 0-100%.
• The high value of R-square determines the less difference between the predicted values and
actual values and hence represents a good model.
• It is also called a coefficient of determination, or coefficient of multiple determination for
multiple regression.
• It can be calculated from the below formula:
Homoscedasticity is a situation when the error term is the same for all the values of independent
variables. With homoscedasticity, there should be no clear pattern distribution of data in the scatter
plot.
• Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern. If error
terms are not normally distributed, then confidence intervals will become either too wide or too
narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation, which means
the error is normally distributed.
• No autocorrelations: The linear regression model assumes no autocorrelation in error terms. If
there will be any correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.
2. 4 Logistic Regression:
• Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary or discrete
format such as 0 or 1.
• Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True
or False, Spam or not spam, etc.
• It is a predictive analysis algorithm which works on the concept of probability.
• Logistic regression is a type of regression, but it is different from the linear regression algorithm
in the term how they are used.
• Logistic regression uses sigmoid function or logistic function which is a complex cost function.
This sigmoid function is used to model the data in logistic regression. The function can be represented
as: f(x)= Output between the 0 and 1 value.
• x= input to the function
• e= base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-curve as follows:
It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values
below the threshold level are rounded up to 0.
This type is used when the dependent variable has only two possible categories. Examples include
Yes/No, Pass/Fail or 0/1. It is the most common form of logistic regression and is used for binary
classification problems.
Multinomial Logistic Regression: This is used when the dependent variable has three or more
possible categories that are not ordered. For example, classifying animals into categories like "cat,"
"dog" or "sheep." It extends the binary logistic regression to handle multiple classes.
Ordinal Logistic Regression:
This type applies when the dependent variable has three or more categories with a natural order or
ranking.
Understanding the assumptions behind logistic regression is important to ensure the model is applied
correctly, main assumptions are:
Independent observations: Each data point is assumed to be independent of the others means there
should be no correlation or dependence between the input samples.
Binary dependent variables: It takes the assumption that the dependent variable must be binary,
means it can take only two values. For more than two categories SoftMax functions are used.
Linearity relationship between independent variables and log odds: The model assumes a linear
relationship between the independent variables and the log odds of the dependent variable which
means the predictors affect the log odds in a linear way.
No outliers: The dataset should not contain extreme outliers as they can distort the estimation of the
logistic regression coefficients.
Large sample size: It requires a sufficiently large sample size to produce reliable and stable results.
2. This function takes any real number and maps it into the range 0 to 1 forming an "S" shaped curve
called the sigmoid curve or logistic curve. Because probabilities must lie between 0 and 1, the sigmoid
function is perfect for this purpose.
3. In logistic regression, we use a threshold value usually 0.5 to decide the class label.
If the sigmoid output is same or above the threshold, the input is classified as Class 1.
If it is below the threshold, the input is classified as Class 0.
This approach helps to transform continuous input values into meaningful class predictions.
Z= w.X+b
At this stage, z is a continuous value from the linear regression. Logistic regression then
applies the sigmoid function to z to convert it into a probability between 0 and 1 which can be
used to predict the class.
Now we use the sigmoid function where the input will be z and we find the probability
between 0 and 1. i.e. predicted y.
As shown above the sigmoid function converts the continuous variable data into the
probability i.e between 0 and 1.
P(y=1)=σ(z)
P(y=0)=1−σ(z)
2.5 Classification
Classification teaches a machine to sort things into categories. It learns by looking at examples
with labels (like emails marked "spam" or "not spam"). After learning, it can decide which
category new items belong to, like identifying if a new email is spam or not. For example a
classification model might be trained on dataset of images labeled as either dogs or cats and it
can be used to predict the class of new and unseen images as dogs or cats based on their features
such as color, texture and shape.
Explaining classification in ml, horizontal axis represents the combined values of color and texture
features. Vertical axis represents the combined values of shape and size features.
Each colored dot in the plot represents an individual image, with the color indicating whether the model
predicts the image to be a dog or a cat.
The shaded areas in the plot show the decision boundary, which is the line or region that the model uses
to decide which category (dog or cat) an image belongs to. The model classifies images on one side of
the boundary as dogs and on the other side as cats, based on their features. Basically, machine looks at
the features in the image (like shape, color, or texture) and chooses which animal the picture is most
likely to be based on the training it received.
1. Binary Classification
This is the simplest kind of classification. In binary classification, the goal is to sort the data into two
distinct categories. Think of it like a simple choice between two options. Imagine a system that sorts
emails into either spam or not spam. It works by looking at different features of the email like certain
keywords or sender details, and decides whether it’s spam or not. It only chooses between these two
options.
2. Multiclass Classification
Here, instead of just two categories, the data needs to be sorted into more than two categories. The
model picks the one that best matches the input. Think of an image recognition system that sorts
pictures of animals into categories like cat, dog, and bird.
3. Multi-Label Classification
In multi-label classification single piece of data can belong to multiple categories at once. Unlike
multiclass classification where each data point belongs to only one class, multi-label classification
allows datapoints to belong to multiple classes. A movie recommendation system could tag a movie as
both action and comedy. The system checks various features (like movie plot, actors, or genre tags) and
assigns multiple labels to a single piece of data, rather than just one. Multilabel classification is relevant
in specific use cases, but not as crucial for a starting overview of classification.
In machine learning, classification works by training a model to learn patterns from labeled data, so it
can predict the category or class of new, unseen data.
Working
Data Collection: You start with a dataset where each item is labeled with the correct class (for
example, "cat" or "dog").
Feature Extraction: The system identifies features (like color, shape, or texture) that help
distinguish one class from another. These features are what the model uses to make predictions.
Model Training: Classification - machine learning algorithm uses the labeled data to learn how
to map the features to the correct class. It looks for patterns and relationships in the data.
Model Evaluation: Once the model is trained, it's tested on new, unseen data to check how
accurately it can classify the items.
Prediction: After being trained and evaluated, the model can be used to predict the class of new
data based on the features it has learned.
Model Evaluation: Evaluating a classification model is a key step in machine learning. It helps us
check how well the model performs and how good it is at handling new, unseen data. Depending
on the problem and needs we can use different metrics to measure its performance. If the quality
metric is not satisfactory, the ML algorithm or hyperparameters can be adjusted, and the model
is retrained. This iterative process continues until a satisfactory performance is achieved. In
short, classification in machine learning is all about using existing labeled data to teach the
model how to predict the class of new, unlabeled data based on the patterns it has learned.
Non-linear models create a non-linear decision boundary between classes. They can capture
more complex relationships between input features and target variable. Some of the non-linear
classification models are as follows:
K-Nearest Neighbours
Kernel SVM
Naive Bayes
Decision Tree Classification
Ensemble learning classifiers:
Random Forests,
AdaBoost,
Bagging Classifier,
Voting Classifier,
Extra Trees Classifier
Multi-layer Artificial Neural Networks
Classification algorithms are widely used in many real-world applications across various domains,
including:
Credit risk assessment: Algorithms predict whether a loan applicant is likely to default by
analyzing factors such as credit score, income, and loan history. This helps banks make informed
lending decisions and minimize financial risk.
Medical diagnosis : Machine learning models classify whether a patient has a certain condition
(e.g., cancer or diabetes) based on medical data such as test results, symptoms, and patient
history. This aids doctors in making quicker, more accurate diagnoses, improving patient care.
Image classification : Applied in fields such as facial recognition, autonomous driving, and
medical imaging.
Sentiment analysis: Determining whether the sentiment of a piece of text is positive, negative,
or neutral. Businesses use this to understand customer opinions, helping to improve products
and services.
Fraud detection : Algorithms detect fraudulent activities by analyzing transaction patterns and
identifying anomalies crucial in protecting against credit card fraud and other financial crimes.
Recommendation systems : Used to recommend products or content based on past user
behavior, such as suggesting movies on Netflix or products on Amazon. This personalization
boosts user satisfaction and sales for businesses.
K-Nearest Neighbors (KNN) is a supervised machine learning algorithm generally used for
classification but can also be used for regression tasks. It works by finding the "k" closest data
points (neighbors) to a given input and makesa predictions based on the majority class (for
classification) or the average value (for regression). Since KNN makes no assumptions about the
underlying data distribution it makes it a non-parametric and instance-based learning method.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of classification it performs
an action on the dataset.
For example, consider the following table of data points containing two features:
The new point is classified as Category 2 because most of its closest neighbors are blue
squares. KNN assigns the category based on the majority of nearby points. The image shows how
KNN predicts the category of a new data point based on its closest neighbours.
The red diamonds represent Category 1 and the blue squares represent Category 2.
The new data point checks its closest neighbors (circled points).
Since the majority of its closest neighbors are blue squares (Category 2) KNN predicts the
new data point belongs to Category 2.
In the k-Nearest Neighbours algorithm k is just a number that tells the algorithm how
many nearby points or neighbors to look at when it makes a decision.
The value of k in KNN decides how many neighbors the algorithm looks at when making a
prediction.
Choosing the right k is important for good results.
If the data has lots of noise or outliers, using a larger k can make the predictions more
stable.
But if k is too large the model may become too simple and miss important patterns and
this is called underfitting.
So k should be picked carefully based on the data.
Example: Imagine you're deciding which fruit it is based on its shape and size. You compare it to
fruits you already know.
KNN uses distance metrics to identify nearest neighbor, these neighbors are used for
classification and regression task. To identify nearest neighbor we use below distance metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a plane or space.
You can think of it like the shortest path you would walk if you were to go directly from one point
to another.
2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal and vertical
lines like a grid or city streets. It’s also called "taxicab distance" because a taxi can only drive along
the grid-like streets of a city.
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and Manhattan
distances as special cases.
From the formula above, when p=2, it becomes the same as the Euclidean distance formula and
when p=1, it turns into the Manhattan distance formula. Minkowski distance is essentially a
flexible formula that can represent either Euclidean or Manhattan distance depending on the
value of p.
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where it
predicts the label or value of a new data point by considering the labels or values of its K nearest
neighbors in the training dataset.
Step 1: Selecting the optimal value of K
K represents the number of nearest neighbors that needs to be considered while
making prediction.
Step 2: Calculating distance
To measure the similarity between target and training data points Euclidean distance
is used. Distance is calculated between data points in the dataset and target point.
Step 3: Finding Nearest Neighbors
The k data points with the smallest distances to the target point are nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
When you want to classify a data point into a category like spam or not spam, the KNN
algorithm looks at the K closest points in the dataset. These closest points are called neighbors.
The algorithm then looks at which category the neighbors belong to and picks the one that
appears the most. This is called majority voting.
In regression, the algorithm still looks for the K closest points. But instead of voting for a class in
classification, it takes the average of the values of those K neighbors. This average is the predicted
value for the new point for the algorithm.
Ensemble Techniques
Decision Trees
Random Forest
Bagging – Boosting.