PW3 SupervisedLearning
PW3 SupervisedLearning
1 Summary
In this lesson, we focus on supervised learning problems: the aim is to develop algorithms able to learn predictive
models. Based on labeled examples, these models will be able to predict the label of new objects. The aim of this
PW is to develop the general concepts with Scikit-learn that enable us to formalize this type of problem. Here a
definition of supervised learning concept:
Definition: Supervised learning is a machine learning technique where an algorithm learns from labeled data to make
predictions or decisions. It is widely applied in various domains, such as classifying emails as spam or non-
spam, predicting housing prices based on different factors, or detecting anomalies in datasets. Scikit-learn, a
popular Python library, provides a comprehensive set of tools and algorithms for supervised learning tasks.
Problem formalization
Supervised learning problem can be formalized as follows: given n observations {x1 , x2 , ...xn }, where each
observation xi is an element of the space of observations X , and their labels {y1, y2, ...yn}, where each label yi
belongs to the label space Y. The aim of supervised learning is to find a function f : X → Y such that f (x) ≈ y,
for all pairs (x, y) ∈ X × Y having the same relationship as the observed pairs. The set of D = (xi , yi)i = 1, ..., n
forms the training set. In this lesson, we will consider three special cases for Y :
• Y = {0, 1}: this is called a binary classification problem, and observations whose label 0 are called negative,
while those with a label of 1 are called positive. In some cases, it is mathematically useful to use Y =
{−1, 1} ;
1
ESILV (A4-All), De Vinci Group Nédra Mellouli@2023-2024
In many situations, we assume we have X = Rp where the observations are represented by p variables. In this
case, the matrix X ∈ Rn×p such that Xij = xij is the j th variable of the ith observation is called the data matrix.
Decision
In the case of a classification problem, the predictive model can take the form of a function f with values in {0, 1},
or use a real-valued intermediate function g, which associates a higher score to observations. This score is, for
example the probability that the observation belongs to the positive class. We then obtain f by thresholding g ; g
is called the decision function.
(Decision function) In a binary classification problem, a decision function g, or a discriminant function g : X →
R such that f (x) = 0 if and only if g(x) ≤ 0 and f (x) = 1 if and only if g(x) > 0.
(Decision Space) In the case of binary classification, the function g split the observation space X into two decision
Areas A0 and A1 such that:
A0 = {x ∈ X |g(x) ≤ 0} and A1 = {xinX |g(x) > 0}.
Cost Function
Solving a supervised learning problem means finding a function f ∈ F whose predictions are as close as possible
to the true labels, over the whole space X . It is a Minimizing empirical risk. To formalize this, we use the notion
of cost function.
(Cost function) A cost function L : Y × Y → R, also called loss function or error function, is a function used to
quantify the quality of a prediction: L(y, f (x)) is greater the closer the label f (x) is to the true value y.
2 / 10
ESILV (A4-All), De Vinci Group Nédra Mellouli@2023-2024
2 Exercises
Exercise 1 : Scale the data
You will need the dataset Wine quality, with Quality column has two classes good and bad. We have to classify
our wine in these two classes based on the features:
1 # Column Non-Null Count Dtype
2 --- ------ -------------- -----
3 0 fixed acidity 1599 non-null float64
4 1 volatile acidity 1599 non-null float64
5 2 citric acid 1599 non-null float64
6 3 residual sugar 1599 non-null float64
7 4 chlorides 1599 non-null float64
8 5 free sulfur dioxide 1599 non-null float64
9 6 total sulfur dioxide 1599 non-null float64
10 7 density 1599 non-null float64
11 8 pH 1599 non-null float64
12 9 sulphates 1599 non-null float64
13 10 alcohol 1599 non-null float64
14 11 quality 1599 non-null object
1. Read the Dataset, store the data into a variable ’wine data’ and print the basic information and statistics.
The date could be found from this location: Red-Wine
2. transform the wine data into a dataframe using pandas.DataFrame. function and as parameters wine data.values,
wine data.columns
4. Drop the quality column to set the rest of the data into a variable X using the function drop(data,axis)
6. As you can see, the numeric data are with different scales and to avoid misinterpretation of ML results,
we want to bring all features to a common scale. Many techniques exist. we commonly use two main
techniques: standardization and normalization.
The second technique is min-max scaling and many people call it the normalization.
3 / 10
ESILV (A4-All), De Vinci Group Nédra Mellouli@2023-2024
7. Create the training and test datasets : x train, x test, y train, y test with 0.3 for test and random state=20.
8. Use the correct scale and apply it to the train and test data set.
9. You have to apply a linear regression model to predict the quality. Try to formalize the problem
10. with scikit learn, use LinearRegression() class after the import as following : from sklearn.linear model
import LinearRegression
13. calculate the mean square error on the test dataset using : from sklearn.metrics import mean squared error,
r2 score
However to evaluate the goodness of fit of a regression model we need the (R2 ).
4 / 10
ESILV (A4-All), De Vinci Group Nédra Mellouli@2023-2024
R-squared (R2 )
R2 is a statistical measure. It quantifies the proportion of the variance in the dependent variable (the
predicted variable) which is explained by the independent variables (the features or input predictors) in
the regression model. In short, R-squared tells you how well the model fits the data. R2 is calculated as
(ypredicti ,ytesti )2
P
R2 = 1 − P(ytest i meanT est)
2.
2
R is ranged between 0 and 1. It measures the proportion of the variance in the dependent variable that
is predictable from the independent variables. For example, and R2 value of 0.7 means that 70% of the
variance in the dependent variable is explained by the independent variables in the model. But in our case,
we only obtained 0.2 (20%), which means that our model is unable to explain wine quality on the basis of
input characteristics.
We would compare the same prediction by changing different parameters as: random state, the transformer,
etc. To do this in a right way, we have to implement pipelines.
allowing you to create your pipeline with the following steps : 1) scale data, 2) predict. Your function need
the scaler as a parameter to be able to call MinMaxScaler() as standardScaler()
2. call your function with the standardscaler and compare the results with a random-state equal to 20.
evaluation R2 mse
scaler=min-max, test size=0.1 ? ?
Comment your results! scaler=standard, test size=0.1 ? ?
scaler=min-max, test size=0.2 ? ?
scaler=standard-max, test size=0.2 ? ?
3. Plot the quality prediction VS. the real quality in the case of the best configuration of your predictor model.
1 # Visualize the predicted vs. actual values
2 plt.scatter(y_test, y_pred)
3 plt.xlabel("Real Wine Quality")
4 plt.ylabel("Predicted Wine Quality")
5 plt.title("Real vs. Predicted Wine Quality")
6 plt.show()
4. If we note Θ = (θ1 , .., θ1 1) and B the estimated parameters of our predictor model, try to display them
from your pipeline and write the exact formula of your predictor.
1 #Access to parameters
2 predictor.coef_
3 # Access the intercept (bias) term
4 intercept = predictor.intercept_
5 / 10
ESILV (A4-All), De Vinci Group Nédra Mellouli@2023-2024
2. define an other function for your pipeline taking as parameters the scaler and the predictor model.
1 def Reg_Pipeline(scaler, model):
2 #put your code here
3 return my_pipeline
4
3. Fit and predict your pipeline with a standardscaler and LogisticRegression model.
6 / 10
ESILV (A4-All), De Vinci Group Nédra Mellouli@2023-2024
Logistic Regression
The logistic regression is a predictive analysis of statistical method:
• It is used to predict a binary outcome (1 or 0, Yes or No, True or False) given a set of indepen-
dent variables. Logistic regression is a classification algorithm used to assign observations to
a discrete set of classes. For this data set, it is better to using binary logistic regression with
modified ”quality”.
• Logistic regression is a statistical model used to study the relationships between a set of quali-
tative variables Xi and a qualitative variable Y. It is a generalized linear model using a logistic
function as the link function.
• A logistic regression model can also predict the probability of an event occurring (value of 1)
or not (value of 0), based on the optimization of regression coefficients. This result always
varies between 0 and 1. When the predicted value is above a threshold, the event is likely to
occur, whereas when this value is below the same threshold, it is not.
• The aim of logistic regression is to find a probability function P such that we can calculate :
y = {1 if P (X) ≥ threshold, 0 if P (X) < threshold}
The function P that fulfils these conditions is the sigmoid function, defined on R with values in [0,1].
It is written as follows: 1+e1−x . The sigmoid function transforms P (x) into a value between 0 and 1.
When x is large (positive or negative), the sigmoid approaches 1 or 0, respectively. When x is 0, the
7 / 10
ESILV (A4-All), De Vinci Group Nédra Mellouli@2023-2024
Evaluation metrics
Let note 0 as a negatif prediction and 1 as a positive one. We can obtain 4 measures: true positive
(TP :true 1), true negative ( TN: true 0), false positive (FP: false 1) and flase negative (FN : false 0).
Let detail now each of them.
T P = (y predict = 1 when y true = 1) = N B 1 predict/N B 1 true
T N = (y predict = 0wheny true = 0) = N B 0 predict/N B 0 true
Precision (P ) = T PT+F
P
P
Recall (R) = T PT+FP
N
F-score = 2. PP+R
R
T P +T N
The global accuracy is calculated as: len(y test)
Accuracy values range from 0 to 1, where:
An accuracy of 1 (100%) indicates that the model’s predictions are entirely correct, and there are no
prediction errors.
An accuracy of 0 (0%) indicates that the model’s predictions are entirely incorrect, and none of the
predictions are correct.
While accuracy is a widely used metric, it has some limitations, especially in situations with imbal-
anced datasets. In imbalanced datasets, where one class significantly outnumbers the other, a high
accuracy score can be misleading. For example, if 95% of the instances belong to class A and only
5% belong to class B, a model that predicts all instances as class A will have a high accuracy of 95%.
However, it fails to correctly predict any instances of class B, which may be the more critical class.
In such cases, other metrics like precision, recall, F1-score, or area under the ROC curve (AUC-ROC)
may provide a more comprehensive assessment of a classification model’s performance, as they take
into account true positives, false positives, true negatives, and false negatives.
In summary, accuracy measures the overall correctness of a classification model’s predictions, and
it is an essential metric to consider. However, it should be used in conjunction with other metrics,
especially when dealing with imbalanced datasets or when different costs are associated with false
positives and false negatives
5. Check the imbalaced datasets problem by comparing the rate of trained 0 samples VS. 1 samples.
8 / 10
ESILV (A4-All), De Vinci Group Nédra Mellouli@2023-2024
Naive Bayes
• Bayes’ Theorem: Naive Bayes classifiers are based on Bayes’ theorem, which calculates the proba-
bility of a particular event based on prior knowledge of conditions that might be related to the event.
In the context of classification, it calculates the probability of a particular class label given a set of
features.
• Conditional Independence: it assumes all features are conditionally independent of each other given
the class label. While this assumption is often unrealistic in practice, Naive Bayes can still perform
well, especially in cases with limited data and small number of features.
• Probability Estimation: Naive Bayes calculates the conditional probabilities of each feature given
each class label and the prior probability of each class label. It uses these probabilities to estimate
the probability of a specific class label given the observed features.
• Classification Rule: The classification rule in Naive Bayes selects the class label with the highest
probability given the observed features. This is known as the Maximum A Posteriori (MAP) classi-
fication rule
During the training step, Naive Bayes estimates the probabilities and parameters needed for classification.
It calculates the prior probabilities of class labels and conditional probabilities of features given each class.
During the classification step, Naive Bayes uses the Bayes’ theorem to calculate the posterior probabilities
of class labels given the observed features. The class with the highest posterior probability is selected as
the predicted class label.
1. Import the adequate class from Scikit learn to use the Naive Bayes, with a Gaussian distribution Gaus-
sianNB.
2. Use your pipeline with this new model.
3. Fit the model to the training set
4. Predict the classes of your test dataset
5. calculate the confusion matrix, the precision, recall, F-score and accuracy.
6. What is your conclusion?
9 / 10
ESILV (A4-All), De Vinci Group Nédra Mellouli@2023-2024
1. ID number
3. perimeter
4. area
9. symmetry
The mean, standard error and ”worst” or largest (mean of the three largest values) of these features were
computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE,
field 23 is Worst Radius.
All feature values are recorded with four significant digits.
Missing attribute values: none. But you have to check. Class distribution: 357 benign, 212 malignant
To do: Make a comparative study between Logistic regression and Naive Bayes.
10 / 10