Chapter III - Supervised and Unsupervised Algorithms

Departement : Mathematics & Computer Science
Master of DPEIC – First year

Semester 2
OR & Artificial Intelligence
Chapter III - Supervised Machine Learning
Pr. Soufiane HAMIDA 1

Supervised ML Algorithms
How Supervised Learning Works?
• In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on the
basis of test data (a subset of the training set), and then it predicts the output.
• The working of Supervised learning can be easily understood by the below example and
diagram:

Steps Involved in Supervised Learning
• First Determine the type of training dataset
• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and validation dataset.
• Determine the input features of the training dataset, which should have enough knowledge
so that the model can accurately predict the output.
• Determine the suitable algorithm for the model, such as support vector machine, decision
tree, etc.
• Execute the algorithm on the training dataset. Sometimes we need validation sets as the
control parameters, which are the subset of training datasets.
• Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.
Key Concepts
To master supervised learning, you absolutely must understand

and know the following 4 concepts:
1. The Dataset
2. The learning algorithm
3. The Model and its parameters
4. The Cost Function

1) The Dataset
We talk about supervised learning when we provide a machine with

many examples (࢞, ࢟) in order to make it learn the relationship that
connects ࢞ to ࢟.

1) The Dataset
• The variable ࢟ is called Target. This is the

value we are trying to predict.
• The variable ࢞ is called Feature. A Feature

influences the value of ࢟, and we generally
have a lot of Features (࢞૚, ࢞૛, …) in our
Dataset which we group together in a
matrix ࢄ.
Example: a Dataset brings together examples of

apartments with their price ࢟ as well as some of
their characteristics (Features).

2) The learning algorithm
• The main objective in Supervised Learning is to find the model parameters that
minimize the Cost Function. To do this, we use a learning algorithm, the most
common example being the Gradient Descent algorithm,

3) The Model and its parameters
• The development of a model from Dataset. It can be a linear model or a non-linear
model like you.
• We define ࢇ, ࢈, ࢉ, etc. as the parameters of a model.

4) The Cost Function
A model can produce errors when making

predictions compared to the actual values in our
dataset. These errors are a measure of how well
the model is performing — a lower error indicates
a better fit to the data.
The method by which we aggregate these errors

to measure the overall performance of the model is
known as the Cost Function or Loss Function.

4) The Cost Function
• A 'good' model is generally characterized by

its ability to make accurate predictions on
new, previously unseen data.
• The smaller the value returned by the Cost

Function, the smaller the differences
between the predicted and actual values,
indicating a better performing model.

Types of Supervised ML Algorithms
• Supervised learning can be further divided into two types of problems:

Regression vs. Classification in ML

Recap
Regression Algorithm Classification Algorithm

In Regression, the output variable must be of In Classification, the output variable must be a
continuous nature or real value. discrete value.
The task of the regression algorithm is to map the
The task of the classification algorithm is to map the
input value (x) with the continuous output
input value(x) with the discrete output variable(y).
variable(y).
Regression Algorithms are used with continuous
Classification Algorithms are used with discrete data.
data.
In Regression, we try to find the best fit line, which
In Classification, we try to find the decision boundary,
can predict the output more accurately. which can divide the dataset into different classes.
Classification Algorithms can be used to solve
Regression algorithms can be used to solve the
classification problems such as Identification of spam
regression problems such as Weather Prediction,
emails, Speech Recognition, Identification of cancer
House price prediction, etc.
cells, etc.
The regression Algorithm can be further divided into The Classification algorithms can be divided into
Linear and Non-linear Regression. Binary Classifier and Multi-class Classifier.

Choosing the most appropriate algorithm
1. Problem Nature: Classification or Regression
2. Data Characteristics: Size of the Dataset, Feature Types, Feature

Dimensionality, Data Quality, …
3. Model Complexity and Interpretability: Complexity, Interpretability,
4. Experience and Domain Knowledge: Previous Successes and Expertise,
5. Model Updates and Scalability: Static vs. Dynamic Data, Scalability, ..

Performance Evaluation
Generalization and overfitting
Main challenge of Supervised learning:
• It is relatively easy to train a model that “works” well (low prediction error) on
the training data. Extreme example: learning “by rote”
• Generalization: ability of the model to make good predictions on data whose

label is unknown.
• Overfitting: when performance is better on learning data than on new data.
30/03/2022 22
Over-fitting et Under-fitting
1. Over-fitting - Example
• Over-fitting occurs when the model gets so close to the function that it
pays too much attention to noise. The model learns the relationship
between entities and labels in so much detail and picks up the noise.
23
2. Under-fitting - Example
• Under-fitting is the opposite of over-fitting. This is when the model
does not approximate the function well enough and is therefore unable
to capture the underlying trend of the data.
24
25
Training and test set
30/03/2022 27
Cross validation
• To use all the data for training and validation
• To obtain an average performance
• We separate the data set into K blocks (folds)
• In practice, K=5 or K=10 most often (balance between the number of

experiments and the size of each training set)
We use each of the blocks in turn as a validation set and the union of the others
as a training set.
30/03/2022 28
Cross validation
30/03/2022 29
Cross validation
30/03/2022 30
Model Selection: Validation Set
How to determine the best model among those learned:
- with different learning algorithms;
- with different hyperparameter(s) values for the same algorithm?
• Idea: Select the one with the best performance on the test set.
• Problem: we can no longer determine the generalization error because test

data has already been used.
- We separate the data into 3 sets: learning, validation and test.
30/03/2022 31
Model Selection: Cross-Validation
30/03/2022 32
Model Selection: Cross-Validation
30/03/2022 33
Hyper-parameters Tuning
GridSearchCV systematically works through multiple combinations of

parameter tunes, cross-validating as it goes to determine which tune gives the
best performance. It's thorough but can be slow for large datasets and many
parameters.
RandomSearchCV samples a fixed number of parameter settings from specified

distributions. This approach can be faster and more efficient, especially when
dealing with a large hyper-parameter space, as it doesn't try every combination
but selects at random to sample a wide range of values.
30/03/2022 34
30/03/2022 35
30/03/2022 36
Hyper-parameters Tuning - Example
30/03/2022 37
Hyper-parameters Tuning - Example
30/03/2022 38
Evaluation of a Classification model: Confusion Matrix
• The confusion matrix is a matrix used to determine the performance of

the classification models for a given set of test data. It can only be
determined if the true values for test data are known.
• The matrix itself can be easily understood, but the related terminologies
may be confusing. Since it shows the errors in the model performance in
the form of a matrix, hence also known as an error matrix.
30/03/2022 39
Confusion Matrix in Machine Learning
Some features of Confusion matrix are given below:
• For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it is
3*3 table, and so on.
• The matrix is divided into two dimensions, that are predicted values and actual
values along with the total number of predictions.
• Predicted values are those values, which are predicted by the model, and actual
values are the true values for the given observations.
30/03/2022 40
• It looks like the below table:
30/03/2022 41
• It looks like the below table:
30/03/2022 42
From the previous example, we can conclude that:
• The table is given for the two-class classifier, which has two predictions "Yes"
and "NO." Here, Yes defines that patient has the disease, and No defines that
patient does not has that disease.
• The classifier has made a total of 100 predictions. Out of 100 predictions, 89
are true predictions, and 11 are incorrect predictions.
• The model has given prediction "yes" for 32 times, and "No" for 68 times.
Whereas the actual "Yes" was 27, and actual "No" was 73 times.
30/03/2022 43
Multi-class classification : Confusion Matrix
Classe prédite
Classe réelle
Classe réelle
Classe prédite
Binary classification problem Multiclass classification problem
30/03/2022 44
• Introduction
Calculations using Confusion Matrix
We can perform various calculations for the model, such as the model's
accuracy, using this matrix. These calculations are given below:
TP
Sensitivity=
TP + FN
TP + TN
Accuracy=
TP + TN + FP + FN
TP
Precision= TN
TP + FP Specificity=
TN + FP
30/03/2022 S.HAMIDA 45
ROC Curve
ROC Curve: The ROC is a graph displaying

a classifier's performance for all possible
thresholds. The graph is plotted between
the true positive rate (on the Y-axis) and the
false Positive rate (on the x-axis).
30/03/2022 46
Evaluation of a regression model
30/03/2022 47
Some ML Algorithms
ML Algorithms

Regression solutions
Types of Regression Algorithm:

1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Regression
4. K-Nearest Neighbors Regression
5. Decision Tree Regression
6. Random Forest Regression
7. ANN
8. …..

Classification solutions
Classification Algorithms can be further divided into the following types:

1. K-Nearest Neighbors (KNN)
2. Decision Tree
3. Random Forest
4. Support Vector Machines (SVM)
5. Artificial Neural Networks
6. Logistic Regression (LR)
7. Naïve Bayes
8. ….

K-Nearest Neighbors Algorithm (KNN)

K-NN (K-NEAREST NEIGHBORS) algorithm is one of the simplest

classification algorithms and it is used to identify data points that are
separated into multiple classes in order to predict the classification of a new
data point. 'sample.
K-NN is a non-parametric and lazy learning algorithm. It classifies new
cases based on a similarity measure (i.e. distance functions).


KNN Algorithm - Example
Input data:
A dataset D.
A distance definition function d.
An integer K
For a new observation X for which we want to predict its output variable y Do:
1. Calculate all the distances of this observation X with the other observations in
the dataset D
2. Retain the K observations from the dataset D closest to X using the distance
calculation function d
3. Take the values of y of the K observations retained:
1. If we perform a regression, calculate the mean (or median) of y retained
2. If we carry out a classification, calculate the mode of retention
4. Return the value calculated in step 3 as the value that was predicted by K-NN
for observation X.
End Algorithm
To predict category label ‫ ݕ‬of a new point ࢞ (classification):

• Find k nearest neighbors (according to some distance metric)
• Assign the majority label to the new point
To predict numeric value ‫ ݕ‬of a new point ࢞ (regression):
• Find k nearest neighbors
• “Average” the values associated with the neighbors
If we change k we may get a different prediction !!

kNN Prediction: What Label?

KNN Algorithm - Example

Linear and Logistic Regression
algorithm

Linear Regression algorithm





The Math Behind LR

The Math Behind LR

The Math Behind LR

The Math Behind LR

LR & LR - Difference








Applications of LR

Applications of LR

Applications of LR

Use case – Predicting Numbers









Naive Bayes algorithm

• Naive Bayes Classifier is a popular algorithm in Machine Learning. It is a

Supervised Learning algorithm used for classification. It is particularly
useful for text classification problems.
• The naive Bayes classifier is based on Bayes' theorem. The latter is a classic of
probability theory. This theorem is based on conditional probabilities.

Conditionelles probabilites:
• What is the probability of an event produced?
• Know that someone other event has already happened.
Naive Bayes algorithm - Example



NO
Naive Bayes algorithm - USE CASES
The naive bayes classifier can be applied in various scenarios, one of the
classic use cases for this learning model is the classification of documents. It
involves determining whether a document corresponds to certain categories
or not. It’s used for:
• Spam filtering.
• Sentiment analysis.
• Recommendation systems.

PW

Unsupervised Machine Learning
What is Unsupervised Learning?
• As the name suggests, unsupervised learning is a machine learning technique in

which models are not supervised using training dataset. Instead, models itself
find the hidden patterns and insights from the given data. It can be compared to
learning which takes place in the human brain while learning new things.
• Unsupervised learning cannot be directly applied to a regression or classification

problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format.

Example - Unsupervised Learning
• Suppose the unsupervised learning algorithm is given an

input dataset containing images of different types of cats and
dogs. The algorithm is never trained upon the given dataset,
which means it does not have any idea about the features of
the dataset. The task of the unsupervised learning algorithm
is to identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the
image dataset into the groups according to similarities
between images.

Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised Learning:
• Unsupervised learning is helpful for finding useful insights from the data.
• Unsupervised learning is much similar as a human learns to think by their own

experiences, which makes it closer to the real AI.
• Unsupervised learning works on unlabeled and uncategorized data which make

unsupervised learning more important.
• In real-world, we do not always have input data with the corresponding output so
to solve such cases, we need unsupervised learning.

Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below diagram:

Types of Unsupervised Learning Algorithm
Below is the list of some popular unsupervised learning algorithms:

• K-means clustering
• Hierarchal clustering
• Anomaly detection
• Independent Component Analysis
• Apriori algorithm

Advantages of Unsupervised Learning
• Unsupervised learning is used for more complex tasks as compared to

supervised learning because, in unsupervised learning, we don't have labeled
input data.
• Unsupervised learning is preferable as it is easy to get unlabeled data in

comparison to labeled data.

Disadvantages of Unsupervised Learning
• Unsupervised learning is intrinsically more difficult than supervised learning as

it does not have corresponding output.
• The result of the unsupervised learning algorithm might be less accurate as

input data is not labeled, and algorithms do not know the exact output in
advance.

K-Means Clustering Algorithm
• K-Means Clustering is an unsupervised learning algorithm that is used to solve

the clustering problems in machine learning or data science.
• K-Means Clustering is an Unsupervised Learning algorithm, which groups the

unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.

• It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid.

The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.

• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value of
k should be predetermined in this algorithm.
• The k-means clustering algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an iterative process.
2. Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
• Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

How does the K-Means Algorithm Work?
• The working of the K-Means algorithm is explained in the below steps:

1. Step-1: Select the number K to decide the number of clusters.
2. Step-2: Select random K points or centroids. (It can be other from the input dataset).
3. Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
4. Step-4: Calculate the variance and place a new centroid of each cluster.
5. Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
6. Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
7. Step-7: The model is ready.

• Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.

• We need to choose some random k points or

centroid to form the cluster. These points can
be either the points from the dataset or any
other point. So, here we are selecting the
below two points as k points, which are not
the part of our dataset. Consider the
following image:

• Now we will assign each data point of the

scatter plot to its closest K-point or centroid.
We will compute it by applying some
mathematics that we have studied to
calculate the distance between two points. So,
we will draw a median between both the
centroids.

• From the previous image, it is clear that

points left side of the line is near to the K1 or
blue centroid, and points to the right of the
line are close to the yellow centroid. Let's
color them as blue and yellow for clear
visualization.

• As we need to find the closest cluster, so we will repeat the process by choosing
a new centroid. To choose the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids as follow:

• Next, we will reassign each datapoint to the

new centroid. For this, we will repeat the
same process of finding a median line. The
median will be like following image:

• From the previous image, we can see, one

yellow point is on the left side of the line, and
two blue points are right to the line. So, these
three points will be assigned to new
centroids.

• As reassignment has taken place, so we will

again go to the step-4, which is finding new
centroids or K-points.
• We will repeat the process by finding the center

of gravity of centroids, so the new centroids will
be as shown in the following image:

• As we got the new centroids so again will

draw the median line and reassign the data
points. So, the image will be:

• We can see in the following image; there are no dissimilar data points on either
side of the line, which means our model is formed.

• As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image:

How to choose the value of "K number of clusters"
• The performance of the K-means clustering algorithm depends upon highly
efficient clusters that it forms. But choosing the optimal number of clusters
is a big task. There are some different ways to find the optimal number of
clusters, but here we are discussing the most appropriate method to find the
number of clusters or value of K. The method is given below:

Elbow Method
• The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:

Elbow Method
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values

(ranges from 1-10).
• For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point
is considered as the best value of K.

Elbow Method
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as
the elbow method. The graph for the elbow method looks like the below image:

PW

Any questions ?
The End
Any questions ?

Chapter III - Supervised and Unsupervised Algorithms

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Chapter III - Supervised and Unsupervised Algorithms

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter III - Supervised and Unsupervised Algorithms

Uploaded by

Copyright:

Available Formats

Departement : Mathematics & Computer Science

Master of DPEIC – First year

OR & Artificial Intelligence

Chapter III - Supervised Machine Learning

Pr. Soufiane HAMIDA 1

Pr. Soufiane HAMIDA 6

• Collect/Gather the labelled training data.

To master supervised learning, you absolutely must understand

2. The learning algorithm

3. The Model and its parameters

4. The Cost Function

Pr. Soufiane HAMIDA 8

We talk about supervised learning when we provide a machine with

Pr. Soufiane HAMIDA 9

• The variable ࢟ is called Target. This is the

• The variable ࢞ is called Feature. A Feature

Example: a Dataset brings together examples of

Pr. Soufiane HAMIDA 11

2) The learning algorithm

Pr. Soufiane HAMIDA 12

• We define ࢇ, ࢈, ࢉ, etc. as the parameters of a model.

A model can produce errors when making

The method by which we aggregate these errors

Pr. Soufiane HAMIDA 14

• A 'good' model is generally characterized by

• The smaller the value returned by the Cost

Pr. Soufiane HAMIDA 15

• Supervised learning can be further divided into two types of problems:

Pr. Soufiane HAMIDA 16

Pr. Soufiane HAMIDA 18

Regression Algorithm Classification Algorithm

Pr. Soufiane HAMIDA 19

1. Problem Nature: Classification or Regression

2. Data Characteristics: Size of the Dataset, Feature Types, Feature

3. Model Complexity and Interpretability: Complexity, Interpretability,

4. Experience and Domain Knowledge: Previous Successes and Expertise,

5. Model Updates and Scalability: Static vs. Dynamic Data, Scalability, ..

Pr. Soufiane HAMIDA 20

• Generalization: ability of the model to make good predictions on data whose

• Overfitting: when performance is better on learning data than on new data.

• To use all the data for training and validation

• To obtain an average performance

• We separate the data set into K blocks (folds)

• In practice, K=5 or K=10 most often (balance between the number of

- with different learning algorithms;

- with different hyperparameter(s) values for the same algorithm?

• Problem: we can no longer determine the generalization error because test

- We separate the data into 3 sets: learning, validation and test.

GridSearchCV systematically works through multiple combinations of

RandomSearchCV samples a fixed number of parameter settings from specified

• The confusion matrix is a matrix used to determine the performance of

Some features of Confusion matrix are given below:

• It looks like the below table:

Binary classification problem Multiclass classification problem

ROC Curve: The ROC is a graph displaying

Pr. Soufiane HAMIDA 49

Types of Regression Algorithm:

Pr. Soufiane HAMIDA 51