MSBA 315
Machine Learning
&
Predictive Analytics
Wael Khreich
wk47@aub.edu.lb
ML Model Evaluation
Content Learning Outcomes
• Review of ML Approaches/Types • Understand the importance of
• Evaluation Metrics for model evaluation
Regression • Identify which metrics to use
• Evaluation Metrics for • Implement these techniques
Classification (Lab)
• Binary (Two-Class) Classification
• Multi-Class Classification
2
3
Machine Learning (ML) Approaches
Machine
Learning
Supervised Unsupervised Reinforcement
Learning Learning Learning
Dimensionality
Regression Classification Clustering Model-based Model-free
Reduction
4
Supervised Learning Tasks
Regression Classification
• Predict a numeric value • Predict a class (category)
•𝑦∈ℝ • 𝑦 = {×, o}
𝒙𝟐 (Feature 2)
𝒚 (Target)
𝒙𝟏 (Feature 1) 𝒙𝟏 (Feature 1)
5
Evaluation Metrics for Regression Tasks
• An evaluation metric quantifies the differences or similarities
between the actual and predicted values (by a ML model)
• Sklearn: 𝑚𝑒𝑡𝑟𝑖𝑐(𝑦_𝑡𝑟𝑢𝑒, 𝑦_𝑝𝑟𝑒𝑑)
• 𝒚 = 𝒚_𝒕𝒓𝒖𝒆 and 𝒚 ෝ = 𝒚_𝒑𝒓𝒆𝒅
• Commonly used metrics:
• Mean Squared Error (MSE)
• Root Mean Squared Error (RMSE)
• Root Mean Squared Log Error (RMSLE)
• Mean Absolute Error (MAE)
6
Regression Evaluation Metrics –MSE
𝒚
• Mean Squared Error (MSE) finds the Actual
average squared error between the (𝒚𝒊 )
predicted and actual values:
𝑁
1 2
𝑀𝑆𝐸 = 𝑦𝑖 − 𝑦ො𝑖 𝒙
𝑁
𝑖=1 𝒚𝒊 )
Predicted (ෝ
Actual Predicted Error Squared 𝒚
(𝒚𝒊 ) 𝒚𝒊 )
(ෝ ෝ𝒊 )
(𝒚𝒊 − 𝒚 Error
𝒚𝒊 − 𝒚ෝ𝒊 𝟐
9 13 -4 16
20 15 5 25
… … … … 𝒙
7
Regression Evaluation Metrics
• Root Mean Squared Error (RMSE) same scale as the target values:
1 𝑁 2
𝑅𝑀𝑆𝐸 = 𝑀𝑆𝐸 = σ𝑖=1 𝑦𝑖 − 𝑦ො𝑖
𝑁
• Root Mean Squared Log Error (RMSLE)
𝑁 𝑁 2
1 2
1 𝑦𝑖 + 1
𝑅𝑀𝑆𝐿𝐸 = log( 𝑦𝑖 + 1) − log(𝑦ො𝑖 + 1) = log
𝑁 𝑁 𝑦ො𝑖 + 1
𝑖=1 𝑖=1
• The one is added to avoid divergence: log 0 = ∞
• Mean Absolute Error (MAE) finds the average absolute distance
between the predicted and target values:
𝑁
1
𝑀𝐴𝐸 = 𝑦𝑖 − 𝑦ො𝑖
𝑁
𝑖=1 8
Regression Evaluation Metrics - Outliers
• MSE/RMSE: sensitive to outliers (squaring the errors)
• MAE: less sensitive to outliers
Actual Predicted Actual Predicted
• RMSLE: robust to outliers (𝒚𝒊 ) 𝒚𝒊 )
(ෝ (𝒚𝒊 ) 𝒚𝒊 )
(ෝ
60 67 60 67
80 78 80 78
90 91 90 91
705 102
𝑅𝑀𝑆𝐸 = 4.242 𝑅𝑀𝑆𝐸 = 301.52
𝑅𝑀𝑆𝐿𝐸 = 0.647 𝑅𝑀𝑆𝐿𝐸 = 0.964
𝑀𝐴𝐸 = 3.334 𝑀𝐴𝐸 = 153.25
image: Wikipedia 9
Regression Evaluation Metrics – Scale of Errors
• RMSE and MAE increase in
magnitude with the scale of errors
• RMSLE penalizes underestimation
• RMSLE robust to scale of errors; of the actual values more than
it consider relative errors overestimation
Actual Predicted Actual Predicted Actual Predicted Actual Predicted
(𝒚𝒊 ) 𝒚𝒊 )
(ෝ (𝒚𝒊 ) 𝒚𝒊 )
(ෝ (𝒚𝒊 ) 𝒚𝒊 )
(ෝ (𝒚𝒊 ) 𝒚𝒊 )
(ෝ
100 90 10,000 9,000 1000 600 1000 1400
𝑅𝑀𝑆𝐸 = 10 𝑅𝑀𝑆𝐸 = 1,000 𝑅𝑀𝑆𝐸 = 400 𝑅𝑀𝑆𝐸 = 400
𝑅𝑀𝑆𝐿𝐸 = 0.1043 𝑅𝑀𝑆𝐿𝐸 = 0.1053 𝑅𝑀𝑆𝐿𝐸 = 0.51 𝑅𝑀𝑆𝐿𝐸 = 0.33
𝑀𝐴𝐸 = 10 𝑀𝐴𝐸 = 1,000 𝑀𝐴𝐸 = 400 𝑀𝐴𝐸 = 400
10
Regression Evaluation Metrics
• Error Range from 0 to infinity (the lower the better)
• RMSE is always higher than or equal to MAE
• Exception: MAE = RMSE when all the differences are equal (or zero)
• E.g., Actual Values = [2,4,6,8] , Predicted Values = [4,6,8,10] -> RMSE=MAE=2
• RMSE is differentiable (typically used as a loss function)
• RMSLE is useful when
• The distribution of the actual values 𝑦𝑖 is long-tailed
• We are interested in the ratio of true value and predicted value
• We want to penalize underestimation of the actual values more than
overestimation
11
Regression Evaluation Metrics – R2
Coefficient of Determination
SST = total sum of squares
• Measures the variation of the actual
yi values around their mean yത
SSE = sum of squares error (residual)
• Variation attributable to factors other
than the relationship between x and y
SSR = sum of squares regression
• Explained variation attributable to the
relationship between x and y
12
Regression Evaluation Metrics – R2 vs. Adjusted R2
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2= =1−
𝑆𝑆𝑇 𝑆𝑆𝑇
• Sum of squares explained by the
regression / sum of squares total
σ 𝑦 − 𝑦
ො 2
2 𝑖 𝑖
𝑅 =1−
σ 𝑦𝑖 − 𝑦ത 2
Ajusted 𝑹𝟐 : 𝑹𝟐𝑨
• Adding Features or Independent Variables (𝑘) to the regression models
(multi-regression) will always increase 𝑅2 :
2 2
𝑛−1 𝑘: Number of features
𝑅𝐴 = 1 − 1 − 𝑅 𝑛: Sample size
𝑛−𝑘−1
13
Supervised Learning –Classification Tasks
Binary or two-class classification Multi-class classification
• Predict a class out of 2 classes • Predict a class out of 𝑛 classes
• 𝑦 = {×, o} • 𝑦 = {a, b, c}
𝒙𝟐 (Feature 2)
𝒙𝟐 (Feature 2)
𝒙𝟐 (Feature 2)
𝒙𝟏 (Feature 1) 𝒙𝟏 (Feature 1) 𝒙𝟏 (Feature 1)
14
Classification Metrics -- Confusion Matrix
• True Positive (TP):
• Sick person correctly
predicted as Sick
• False Negative (FN):
• Sick person incorrectly
predicted as Healthy
• False Positive (FP):
• Healthy person
incorrectly predicted
as Sick
• True Negative (TN):
• Healthy person
correctly predicted as
Healthy
15
Classification Metrics -- Confusion Matrix
• True Positive rate (tpr) or Sensitivity: 𝑷𝑷 = 𝑇𝑃 + 𝐹𝑃
measures the proportion of positives Total # of Positive
Predictions
that are correctly identified
• True Negative rate (tnr) or Specificity: 𝑵𝑷 = 𝐹𝑁 + 𝑇𝑁
Total # of Negative
measures the proportion of negatives Predictions
that are correctly identified 𝑷𝒐𝒔 = 𝑇𝑃 + 𝐹𝑁 𝑵𝒆𝒈 = 𝐹𝑃 + 𝑇𝑁
Total # of Positives Total # of Negatives
16
Classification Metrics - Confusion Matrix
Ex. Actual Predicted
(𝒚𝒊 ) 𝒚𝒊 )
(ෝ
17
Precision and Recall
• Precision (or Positive Predictive Value) 𝑷𝑷 = 𝑇𝑃 + 𝐹𝑃
Total # of Positive
is the fraction of relevant instances Predictions
among the retrieved instances
• Out of all the positive predictions, 𝑵𝑷 = 𝐹𝑁 + 𝑇𝑁
Total # of Negative
how many are actually positive? Predictions
𝑷𝒐𝒔 = 𝑇𝑃 + 𝐹𝑁 𝑵𝒆𝒈 = 𝐹𝑃 + 𝑇𝑁
𝑇𝑃 𝑇𝑃 Total # of Positives Total # of Negatives
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = =
𝑃𝑃 𝑇𝑃 + 𝐹𝑃
• Recall (same as tpr) is the fraction of
relevant instances that were retrieved • Maximizing precision will minimize
• Out of all the actual positives, how the false positive errors
many are predicted as positives? • Maximizing recall will minimize the
𝑇𝑃 𝑇𝑃 false negative errors
𝑅𝑒𝑐𝑎𝑙𝑙 = =
𝑃𝑜𝑠 𝑇𝑃 + 𝐹𝑁
18
F-Score
• Depending on application, give higher priority to recall or precision
• Use precision when avoiding FPs is more important than false negatives
• Use recall when avoiding FNs is more important than having false positives
• F1-score, which is the harmonic mean of precision and recall:
2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 = =2×
1 1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
+
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙
• Why not arithmetic mean? The harmonic mean punishes extreme values
• 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 is maximized when 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑅𝑒𝑐𝑎𝑙𝑙
• 𝑭𝜷 -score attaches 𝛽 times as much importance to recall as precision.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹𝛽 = 1 + 𝛽2
𝛽2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
• Precision-focused : 𝛽 = 0.5 (false positives hurt performance more than false negatives)
• Recall-focused: 𝛽 = 2 (false negatives hurt performance more than false positives) 19
Log-Loss
• Also known as cross-entropy
• It is the negative average of the log of predicted probabilities
𝑝𝑖 for each instance:
Log-loss =
1
− σ𝑁
𝑖=1(𝑦𝑖 log 𝑝𝑖 + 1 − 𝑦𝑖 log 1 − 𝑝𝑖 )
𝑁
where:
• N is the number of observations
• 𝑦 is true label for observation 𝑖
• 𝑝𝑖 = 𝑝(𝑥𝑖 ) is the predicted probability that observation 𝑖 belong to
positive class
Length
• Takes into account uncertainty of the model predictions
• Log-Loss heavily penalizes the model if it confidently predicts
the wrong class Weight
• Lower Log Loss => better model performance, 0 best score
20
Receiver Operating Characteristics (ROC) Curve
• Perfect classifier: AUC=1.0
• Random classifier: AUC=0.5
AUC
(Area Under the Curve)
(Adopted from Wikipedia) 21
ROC Curve Construction
Interactive ROC: http://arogozhnikov.github.io/2015/10/05/roc-curve.html
22
ROC Curve
Interactive ROC: http://arogozhnikov.github.io/2015/10/05/roc-curve.html
23
ROC Curve Construction
Interactive ROC: http://arogozhnikov.github.io/2015/10/05/roc-curve.html
24
AUC – Global Measure
AUC(𝑀1 ) = AUC(𝑀2 )
• Which model should
you select?
25
Quiz
1. Which of the below represents 3. Which classification metrics
the total number of actual does not vary with the
positives? decision threshold:
a. TP + FP a. TPR
b. TP + FN b. Recall
c. FP + FN c. Log-loss
d. FN + TN d. Error rate
e. All of the above
2. The AUC ROC cannot be used
f. None of the above
in case of skewed distribution
of target classes
• True
• False
26
Evaluation of Multi-class Classification
• Extension of the two-class classification
𝒙𝟐 (Feature 2)
• For each class we will have true vs predicted labels
• Classification report per class
• Confusion matrix can be generalized to multi-class
𝒙𝟏 (Feature 1)
• Average classification metrics across classes
• Different way of averaging: macro, micro, weighted
𝒙𝟐 (Feature 2)
• For imbalanced data: the support (number of instance)
for each class becomes important to consider
𝒙𝟏 (Feature 1) 27
Evaluation of Multi-class Classification - Example
• MNIST Handwritten Digit Classification Dataset
• 60,000 square 28×28 pixel grayscale images of handwritten single
digits between 0 and 9 (10 Classes)
Trained ML
Model
… 28
Evaluation of Multi-class Classification – Example
Multi-Class Confusion Matrix Predicted Labels
• MNIST Handwritten Digit
Classification Dataset
• Diagonal values: correct
classifications
Actual labels
• Off-diagonal values: errors
29
Micro vs Macro Average
Actual Class Predicted Class Correct?
Macro-average
Orange Lemon 0 • Each class has equal weight
Orange Lemon 0 1. Compute metric within each class
Orange Apple 0 2. Average results across classes
Orange Orange 1 • Example: Recall
Orange Apple 0 For each class: Out of all the actual
Lemon Lemon 1 positive predictions, how many are
predicted positive?
Lemon Apple 0 𝑇𝑃 1
• Orange: 1/5 = 0.2 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 + 𝐹𝑁 = 1 + 4
Apple Apple 1
• Lemon: 1/2 = 0.5
Apple Apple 1
• Apple: 2/2 = 1.0
• Macro-average Recall:
(0.2 + 0.5 + 1.0)/3 = 0.57
Each class contributes equally, regardless of its size! 30
Micro vs Macro Average
Actual Class Predicted Class Correct? Micro-average
Orange Lemon 0 • Each instance has equal weight
Orange Lemon 0 • Aggregate the contributions of all
Orange Apple 0 classes
Orange Orange 1
• Example: Recall
Orange Apple 0
Out of all the actual positive
Lemon Lemon 1 predictions, how many are predicted
Lemon Apple 0 positive?
Apple Apple 1 • Micro-average recall:
Apple Apple 1
4
= 0.44
9
Largest Classes have most influence on results 31
Micro vs Macro Average
• For balanced datasets (nearly equal # of instances among classes)
• Macro-and micro-average will be about the same
• For imbalanced datasets (some classes have more instances):
• Macro-averaging: focus on the smallest classes
• Micro-averaging: focus on the largest classes
• If micro-average << macro-average: check the larger classes for poor metric
performance; otherwise check the small classes
• Weighted-average: average per-class metric (similar to macro) but
weighted by support (number of instance in each class)
32
Activities
• Read and watch additional materials (posted on Moodle)
• Revise the Lab 02
• Expect an assignment
33