[go: up one dir, main page]

0% found this document useful (0 votes)
33 views33 pages

Lect 02 Evaluation Part 1

The document covers machine learning model evaluation, focusing on various approaches and evaluation metrics for regression and classification tasks. Key topics include understanding the importance of model evaluation, identifying appropriate metrics, and implementing techniques in a lab setting. Specific metrics discussed include Mean Squared Error, Root Mean Squared Error, and classification metrics like precision, recall, and F1-score.

Uploaded by

gacia der
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views33 pages

Lect 02 Evaluation Part 1

The document covers machine learning model evaluation, focusing on various approaches and evaluation metrics for regression and classification tasks. Key topics include understanding the importance of model evaluation, identifying appropriate metrics, and implementing techniques in a lab setting. Specific metrics discussed include Mean Squared Error, Root Mean Squared Error, and classification metrics like precision, recall, and F1-score.

Uploaded by

gacia der
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

MSBA 315

Machine Learning
&
Predictive Analytics
Wael Khreich
wk47@aub.edu.lb
ML Model Evaluation

Content Learning Outcomes


• Review of ML Approaches/Types • Understand the importance of
• Evaluation Metrics for model evaluation
Regression • Identify which metrics to use
• Evaluation Metrics for • Implement these techniques
Classification (Lab)
• Binary (Two-Class) Classification
• Multi-Class Classification

2
3
Machine Learning (ML) Approaches

Machine
Learning

Supervised Unsupervised Reinforcement


Learning Learning Learning

Dimensionality
Regression Classification Clustering Model-based Model-free
Reduction

4
Supervised Learning Tasks

Regression Classification
• Predict a numeric value • Predict a class (category)
•𝑦∈ℝ • 𝑦 = {×, o}

𝒙𝟐 (Feature 2)
𝒚 (Target)

𝒙𝟏 (Feature 1) 𝒙𝟏 (Feature 1)
5
Evaluation Metrics for Regression Tasks
• An evaluation metric quantifies the differences or similarities
between the actual and predicted values (by a ML model)
• Sklearn: 𝑚𝑒𝑡𝑟𝑖𝑐(𝑦_𝑡𝑟𝑢𝑒, 𝑦_𝑝𝑟𝑒𝑑)
• 𝒚 = 𝒚_𝒕𝒓𝒖𝒆 and 𝒚 ෝ = 𝒚_𝒑𝒓𝒆𝒅

• Commonly used metrics:


• Mean Squared Error (MSE)
• Root Mean Squared Error (RMSE)
• Root Mean Squared Log Error (RMSLE)
• Mean Absolute Error (MAE)

6
Regression Evaluation Metrics –MSE
𝒚
• Mean Squared Error (MSE) finds the Actual
average squared error between the (𝒚𝒊 )
predicted and actual values:
𝑁
1 2
𝑀𝑆𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖 𝒙
𝑁
𝑖=1 𝒚𝒊 )
Predicted (ෝ

Actual Predicted Error Squared 𝒚


(𝒚𝒊 ) 𝒚𝒊 )
(ෝ ෝ𝒊 )
(𝒚𝒊 − 𝒚 Error
𝒚𝒊 − 𝒚ෝ𝒊 𝟐
9 13 -4 16
20 15 5 25
… … … … 𝒙
7
Regression Evaluation Metrics
• Root Mean Squared Error (RMSE) same scale as the target values:
1 𝑁 2
𝑅𝑀𝑆𝐸 = 𝑀𝑆𝐸 = σ𝑖=1 𝑦𝑖 − 𝑦ො𝑖
𝑁
• Root Mean Squared Log Error (RMSLE)
𝑁 𝑁 2
1 2
1 𝑦𝑖 + 1
𝑅𝑀𝑆𝐿𝐸 = ෍ log( 𝑦𝑖 + 1) − log(𝑦ො𝑖 + 1) = ෍ log
𝑁 𝑁 𝑦ො𝑖 + 1
𝑖=1 𝑖=1

• The one is added to avoid divergence: log 0 = ∞


• Mean Absolute Error (MAE) finds the average absolute distance
between the predicted and target values:
𝑁
1
𝑀𝐴𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑁
𝑖=1 8
Regression Evaluation Metrics - Outliers

• MSE/RMSE: sensitive to outliers (squaring the errors)


• MAE: less sensitive to outliers
Actual Predicted Actual Predicted
• RMSLE: robust to outliers (𝒚𝒊 ) 𝒚𝒊 )
(ෝ (𝒚𝒊 ) 𝒚𝒊 )
(ෝ
60 67 60 67
80 78 80 78
90 91 90 91
705 102
𝑅𝑀𝑆𝐸 = 4.242 𝑅𝑀𝑆𝐸 = 301.52
𝑅𝑀𝑆𝐿𝐸 = 0.647 𝑅𝑀𝑆𝐿𝐸 = 0.964
𝑀𝐴𝐸 = 3.334 𝑀𝐴𝐸 = 153.25

image: Wikipedia 9
Regression Evaluation Metrics – Scale of Errors
• RMSE and MAE increase in
magnitude with the scale of errors
• RMSLE penalizes underestimation
• RMSLE robust to scale of errors; of the actual values more than
it consider relative errors overestimation
Actual Predicted Actual Predicted Actual Predicted Actual Predicted
(𝒚𝒊 ) 𝒚𝒊 )
(ෝ (𝒚𝒊 ) 𝒚𝒊 )
(ෝ (𝒚𝒊 ) 𝒚𝒊 )
(ෝ (𝒚𝒊 ) 𝒚𝒊 )
(ෝ
100 90 10,000 9,000 1000 600 1000 1400
𝑅𝑀𝑆𝐸 = 10 𝑅𝑀𝑆𝐸 = 1,000 𝑅𝑀𝑆𝐸 = 400 𝑅𝑀𝑆𝐸 = 400
𝑅𝑀𝑆𝐿𝐸 = 0.1043 𝑅𝑀𝑆𝐿𝐸 = 0.1053 𝑅𝑀𝑆𝐿𝐸 = 0.51 𝑅𝑀𝑆𝐿𝐸 = 0.33
𝑀𝐴𝐸 = 10 𝑀𝐴𝐸 = 1,000 𝑀𝐴𝐸 = 400 𝑀𝐴𝐸 = 400

10
Regression Evaluation Metrics
• Error Range from 0 to infinity (the lower the better)

• RMSE is always higher than or equal to MAE


• Exception: MAE = RMSE when all the differences are equal (or zero)
• E.g., Actual Values = [2,4,6,8] , Predicted Values = [4,6,8,10] -> RMSE=MAE=2

• RMSE is differentiable (typically used as a loss function)

• RMSLE is useful when


• The distribution of the actual values 𝑦𝑖 is long-tailed
• We are interested in the ratio of true value and predicted value
• We want to penalize underestimation of the actual values more than
overestimation

11
Regression Evaluation Metrics – R2
Coefficient of Determination
SST = total sum of squares
• Measures the variation of the actual
yi values around their mean yത
SSE = sum of squares error (residual)
• Variation attributable to factors other
than the relationship between x and y
SSR = sum of squares regression
• Explained variation attributable to the
relationship between x and y
12
Regression Evaluation Metrics – R2 vs. Adjusted R2
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2= =1−
𝑆𝑆𝑇 𝑆𝑆𝑇
• Sum of squares explained by the
regression / sum of squares total
σ 𝑦 − 𝑦
ො 2
2 𝑖 𝑖
𝑅 =1−
σ 𝑦𝑖 − 𝑦ത 2
Ajusted 𝑹𝟐 : 𝑹𝟐𝑨
• Adding Features or Independent Variables (𝑘) to the regression models
(multi-regression) will always increase 𝑅2 :
2 2
𝑛−1 𝑘: Number of features
𝑅𝐴 = 1 − 1 − 𝑅 𝑛: Sample size
𝑛−𝑘−1
13
Supervised Learning –Classification Tasks

Binary or two-class classification Multi-class classification


• Predict a class out of 2 classes • Predict a class out of 𝑛 classes
• 𝑦 = {×, o} • 𝑦 = {a, b, c}

𝒙𝟐 (Feature 2)

𝒙𝟐 (Feature 2)
𝒙𝟐 (Feature 2)

𝒙𝟏 (Feature 1) 𝒙𝟏 (Feature 1) 𝒙𝟏 (Feature 1)

14
Classification Metrics -- Confusion Matrix
• True Positive (TP):
• Sick person correctly
predicted as Sick
• False Negative (FN):
• Sick person incorrectly
predicted as Healthy
• False Positive (FP):
• Healthy person
incorrectly predicted
as Sick
• True Negative (TN):
• Healthy person
correctly predicted as
Healthy

15
Classification Metrics -- Confusion Matrix
• True Positive rate (tpr) or Sensitivity: 𝑷𝑷 = 𝑇𝑃 + 𝐹𝑃
measures the proportion of positives Total # of Positive
Predictions
that are correctly identified
• True Negative rate (tnr) or Specificity: 𝑵𝑷 = 𝐹𝑁 + 𝑇𝑁
Total # of Negative
measures the proportion of negatives Predictions
that are correctly identified 𝑷𝒐𝒔 = 𝑇𝑃 + 𝐹𝑁 𝑵𝒆𝒈 = 𝐹𝑃 + 𝑇𝑁
Total # of Positives Total # of Negatives

16
Classification Metrics - Confusion Matrix

Ex. Actual Predicted


(𝒚𝒊 ) 𝒚𝒊 )
(ෝ

17
Precision and Recall
• Precision (or Positive Predictive Value) 𝑷𝑷 = 𝑇𝑃 + 𝐹𝑃
Total # of Positive
is the fraction of relevant instances Predictions
among the retrieved instances
• Out of all the positive predictions, 𝑵𝑷 = 𝐹𝑁 + 𝑇𝑁
Total # of Negative
how many are actually positive? Predictions

𝑷𝒐𝒔 = 𝑇𝑃 + 𝐹𝑁 𝑵𝒆𝒈 = 𝐹𝑃 + 𝑇𝑁
𝑇𝑃 𝑇𝑃 Total # of Positives Total # of Negatives
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = =
𝑃𝑃 𝑇𝑃 + 𝐹𝑃
• Recall (same as tpr) is the fraction of
relevant instances that were retrieved • Maximizing precision will minimize
• Out of all the actual positives, how the false positive errors
many are predicted as positives? • Maximizing recall will minimize the
𝑇𝑃 𝑇𝑃 false negative errors
𝑅𝑒𝑐𝑎𝑙𝑙 = =
𝑃𝑜𝑠 𝑇𝑃 + 𝐹𝑁

18
F-Score
• Depending on application, give higher priority to recall or precision
• Use precision when avoiding FPs is more important than false negatives
• Use recall when avoiding FNs is more important than having false positives
• F1-score, which is the harmonic mean of precision and recall:
2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 = =2×
1 1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
+
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙
• Why not arithmetic mean? The harmonic mean punishes extreme values
• 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 is maximized when 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑅𝑒𝑐𝑎𝑙𝑙
• 𝑭𝜷 -score attaches 𝛽 times as much importance to recall as precision.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹𝛽 = 1 + 𝛽2
𝛽2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
• Precision-focused : 𝛽 = 0.5 (false positives hurt performance more than false negatives)
• Recall-focused: 𝛽 = 2 (false negatives hurt performance more than false positives) 19
Log-Loss
• Also known as cross-entropy
• It is the negative average of the log of predicted probabilities
𝑝𝑖 for each instance:

Log-loss =
1
− σ𝑁
𝑖=1(𝑦𝑖 log 𝑝𝑖 + 1 − 𝑦𝑖 log 1 − 𝑝𝑖 )
𝑁

where:
• N is the number of observations
• 𝑦 is true label for observation 𝑖
• 𝑝𝑖 = 𝑝(𝑥𝑖 ) is the predicted probability that observation 𝑖 belong to
positive class

Length
• Takes into account uncertainty of the model predictions
• Log-Loss heavily penalizes the model if it confidently predicts
the wrong class Weight
• Lower Log Loss => better model performance, 0 best score
20
Receiver Operating Characteristics (ROC) Curve

• Perfect classifier: AUC=1.0


• Random classifier: AUC=0.5

AUC
(Area Under the Curve)

(Adopted from Wikipedia) 21


ROC Curve Construction

Interactive ROC: http://arogozhnikov.github.io/2015/10/05/roc-curve.html


22
ROC Curve

Interactive ROC: http://arogozhnikov.github.io/2015/10/05/roc-curve.html


23
ROC Curve Construction

Interactive ROC: http://arogozhnikov.github.io/2015/10/05/roc-curve.html


24
AUC – Global Measure

AUC(𝑀1 ) = AUC(𝑀2 )
• Which model should
you select?

25
Quiz
1. Which of the below represents 3. Which classification metrics
the total number of actual does not vary with the
positives? decision threshold:
a. TP + FP a. TPR
b. TP + FN b. Recall
c. FP + FN c. Log-loss
d. FN + TN d. Error rate
e. All of the above
2. The AUC ROC cannot be used
f. None of the above
in case of skewed distribution
of target classes
• True
• False

26
Evaluation of Multi-class Classification
• Extension of the two-class classification

𝒙𝟐 (Feature 2)
• For each class we will have true vs predicted labels
• Classification report per class

• Confusion matrix can be generalized to multi-class


𝒙𝟏 (Feature 1)

• Average classification metrics across classes


• Different way of averaging: macro, micro, weighted

𝒙𝟐 (Feature 2)
• For imbalanced data: the support (number of instance)
for each class becomes important to consider

𝒙𝟏 (Feature 1) 27
Evaluation of Multi-class Classification - Example
• MNIST Handwritten Digit Classification Dataset
• 60,000 square 28×28 pixel grayscale images of handwritten single
digits between 0 and 9 (10 Classes)

Trained ML
Model

… 28
Evaluation of Multi-class Classification – Example
Multi-Class Confusion Matrix Predicted Labels

• MNIST Handwritten Digit


Classification Dataset
• Diagonal values: correct
classifications

Actual labels
• Off-diagonal values: errors

29
Micro vs Macro Average
Actual Class Predicted Class Correct?
Macro-average
Orange Lemon 0 • Each class has equal weight
Orange Lemon 0 1. Compute metric within each class
Orange Apple 0 2. Average results across classes
Orange Orange 1 • Example: Recall
Orange Apple 0 For each class: Out of all the actual
Lemon Lemon 1 positive predictions, how many are
predicted positive?
Lemon Apple 0 𝑇𝑃 1
• Orange: 1/5 = 0.2 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 + 𝐹𝑁 = 1 + 4
Apple Apple 1
• Lemon: 1/2 = 0.5
Apple Apple 1
• Apple: 2/2 = 1.0
• Macro-average Recall:
(0.2 + 0.5 + 1.0)/3 = 0.57
Each class contributes equally, regardless of its size! 30
Micro vs Macro Average
Actual Class Predicted Class Correct? Micro-average
Orange Lemon 0 • Each instance has equal weight
Orange Lemon 0 • Aggregate the contributions of all
Orange Apple 0 classes
Orange Orange 1
• Example: Recall
Orange Apple 0
Out of all the actual positive
Lemon Lemon 1 predictions, how many are predicted
Lemon Apple 0 positive?
Apple Apple 1 • Micro-average recall:
Apple Apple 1
4
= 0.44
9
Largest Classes have most influence on results 31
Micro vs Macro Average
• For balanced datasets (nearly equal # of instances among classes)
• Macro-and micro-average will be about the same
• For imbalanced datasets (some classes have more instances):
• Macro-averaging: focus on the smallest classes
• Micro-averaging: focus on the largest classes
• If micro-average << macro-average: check the larger classes for poor metric
performance; otherwise check the small classes

• Weighted-average: average per-class metric (similar to macro) but


weighted by support (number of instance in each class)

32
Activities
• Read and watch additional materials (posted on Moodle)
• Revise the Lab 02
• Expect an assignment

33

You might also like