Chapt-5 Performance Evaluation
Chapt-5 Performance Evaluation
Evaluating Classification
&Predictive Performance
Sagar Kamarthi
Topics
𝑛
1 2
𝑅𝑀𝑆 = 𝑒(𝑖)
𝑛
𝑖=1
𝑆𝑆𝐸 = 𝑒(𝑖) 2
𝑖=1
Comparing
Training and Validation Performance
• Residuals that are based on the training set tell about model fit or
“goodness of fit”
• Residuals that are based on the validation set measure the model's
ability to predict new data or “prediction performance”
Comparing
Training and Validation Performance
Naïve Rule: classify all records as belonging to the most prevalent class
• Naïve rule ignores all predictor information
• Often used as benchmark classifier: Any classifier using predictor values
should better than Naïve Rule based classifier
• Exception: when goal is to identify high-value but rare outcomes, we
may do well by doing worse than the naïve rule
Class Separation
• “High separation of
records” means that using
predictor variables attains
low error
Class Separation
Actual Class C1 n11 = True Positive (TP) n12 = False Negative (FN)
Actual Class C2 n21 = False Positive (FP) n22 = True Negative (TN)
n11 n12
Actual Class C1 number of C1 cases classified number of C1 cases classified incorrectly
correctly as C2
n21 n22
Actual Class C2 number of C2 cases classified
number of C2 cases classified correctly
incorrectly as C1
Class Example 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 1
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity Prediction
Ownership
1
1 60.0 18.4 0.996 1
Cutoff Table 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
1
1
4 61.5 20.8 0.980 1 1
• If cutoff is 0.25 5 87.0 23.6 0.948 1 1
• If cutoff is 0.25
Error = 5/24 = 0.2083
• If cutoff is 0.50
Error = 3/24 = 0.1250
• If cutoff is 0.75
Error = 6/24 = 0.2500
Accuracy and Overall Error as a Function of the
Cutoff Value
Two Cutoff Option
• Variance is a measure of how much the learning model will shift around
its mean
• Variance quantifies the amount by which the estimate of the parameters
will change if training data was varied by a small margin or different
training data was used
• Variance tells us how scattered are the model estimates across samples
from their actual values
• High variance models likely to overfit the data as they model random
noise
Irreducible error
• Irreducible error may occur due to the inherent noise present in the data
• Irreducible error cannot be reduced irrespective of the learning algorithm
we uses
Bias-variance tradeoff
• Increasing complexity of the model would increase its variance but reduce
bias
• Reducing a model's complexity increases its bias but reduces variance
• This is called bias-variance trade-off
Bias-variance tradeoff
irreducible error
Model Complexity
Sensitivity and Secificity
𝑛11
Sensitivity =
𝑛11 + 𝑛12
𝑛22
Specificity =
𝑛21 + 𝑛22
Positive Predictive Value and
Negative Predictive Value
Predicted Class C1 Predicted Class C2
𝑛11
Positive Predictive Value =
𝑛11 + 𝑛21
𝑛22
Negative Predictive Value =
𝑛12 + 𝑛22
False Positive Rate and
False Negative Rate
Predicted Class C1 Predicted Class C2
𝑛21
False postive rate =
𝑛11 + 𝑛21
𝑛12
False negative rate =
𝑛12 + 𝑛22
Predicted Class C1 Predicted Class C2
Actual
n11 n12
Accuracy Measures Class C1
Actual
n21 n22
Class C2
𝑛21 𝑛12
False postive rate = False negative rate =
𝑛21 + 𝑛21 𝑛12 + 𝑛11
𝑛11 𝑛22
Positive Predictive Value = Negative Predictive Value =
𝑛11 + 𝑛21 𝑛12 + 𝑛22
Predicted Class C1 Predicted Class C2
Actual
n11 n12
Performance Measures Class C1
Actual
Class C2
n21 n22
If “C1” is the important class (success class):
• Error = % of “true C1 and C2 cases” incorrectly classified as “C2 and C1 respectively”
Inability to classify observations correctly into their respective classes.
• Accuracy = % of “true C1 and C2 cases” correctly classified as “C1 and C2 respectively”
Ability to classify observations correctly into their respective classes.
• Sensitivity = % of “true C1 cases” correctly classified as “C1 cases”
Probability that a model classify a case as C1 when the case is truly C1
• Specificity = % of “true C2 cases” correctly classified as “C2 cases”
Probability that a model classify a case as C2 when the case is truly C2
• Positive predictive value = % of “predicted C1 cases” correctly classified “C1 cases”
Probability that the case is truly C1 when the model classifies the case as C1
• Negative predictive value = % of “predicted C2 cases” correctly classified “C2 cases”
Probability that the case is truly C2 when the model classifies the case as C2
• False positive rate = % of “predicted C1 cases” that were misclassified as “C2 cases”
Probability that the case is truly C2 when the model classifies the case as C1
• False negative rate = % of “predicted C2 cases” that were misclassified as “C1 cases”
Probability that the case is truly C1 when the model classifies the case as C2
Predicted Class C1 Predicted Class C2
Actual
n11= TP n1=FN
Performance Measures Class C1
Actual
Class C2
n21=FP n22=TN
• If the proportion of positive (Disease present) and the negative (Disease absent) groups in the sample data do
not reflect the real prevalence of the disease, then the Positive and Negative predictive values, and Accuracy,
cannot be estimated.
• When the disease prevalence is known then the positive and negative predictive values can be calculated
using the following formulas based on Bayes' theorem:
https://www.medcalc.org/calc/diagnostic_test.php
Predicted Class C1 Predicted Class C2
Actual
n11 n12
F1 Score or F Score Class C1
Actual
Class C2
n21 n22
2 Recall Precision
𝐹1 =
Recall + Precision
F1 Score or F Score Actual
Predicted Class C1
TP = n11
Predicted Class C2
FN = n12
Class C1
Actual
Class C2
FP = n21 TN = n22
𝑇𝑃 𝑇𝑃
2
𝐹1 = 𝑇𝑃 + 𝐹𝑁 𝑇𝑃 + 𝐹𝑃
𝑇𝑃 𝑇𝑃
+
𝑇𝑃 + 𝐹𝑁 𝑇𝑃 + 𝐹𝑁
Matthews correlation coefficient (MCC)
Actual
Class C1
TP = n11 FN = n12
Actual
Class C2
FP = n21 TN = n22
(𝑇𝑃)(𝑇𝑁) − (𝐹𝑃)(𝐹𝑁)
𝑀𝐶𝐶 =
TP+FP TP+FN 𝑇𝑁 + 𝐹𝑃 𝑇𝑁 + 𝐹𝑁
Matthews correlation coefficient (MCC)
Predicted Class C1 Predicted Class C2
Actual
Class C1
TP = n11 FN = n12
𝑛 = 𝑇𝑁 + 𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃
Actual
Class C2
FP = n21 TN = n22
𝑇𝑃 + 𝐹𝑁
𝑆=
𝑛
𝑇𝑃 + 𝐹𝑃
𝑃=
𝑛
(𝑇𝑃)/(𝑛) − (𝑆)(𝑃)
𝑀𝐶𝐶 =
S P 1−𝑆 1−𝑃
Matthews correlation coefficient (MCC)
Perfect Classifier 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.707 1
9 69.0 20.0 0.681 1
10 93.0 20.8 0.656 1
11 51.0 22.0 0.520 1
12 81.0 20.0 0.471 1
13 75.0 19.6 0.442 0
14 52.8 20.8 0.410 0
15 64.8 17.2 0.340 0
16 43.2 20.4 0.337 0
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1
Good Classifier 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 1
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1
Poor Classifier 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
0
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 0
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 0
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 0
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 1
20 66.0 18.4 0.038 1
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 1
24 63.0 14.8 0.004 1
ROC (Receiver Operating Characteristic) Curve
• The ROC curve plots the pairs {sensitivity, 1-specificity} as the cutoff value
increases from 0 and 1
• A classifiers performance is better if the ROC curve is closer to the top left
corner
• The comparison curve is the diagonal, which reflects the performance of the
Naive Rule, using varying the level of majority f used by the majority rule)
• Area Under the Curve (AUC) is another metric of a classifier
• AUC ranges from 1 (perfect discrimination between classes) to 0.5 (no better
than the naive rule)
Steps to Constructing ROC Curve
• Sort the records in the descending order of their estimated propensity, 𝑦(𝑖)
• Vary propensity cut off from 0 to 1 (example: 1=0.0, 2=0.2, …, 6=1.0)
• For each cut off k, do the following:
Classify observations as positive, if their propensity is greater than or equal to k,
otherwise as negative.
Construct the classification matrix using above classification rule
From the classification matrix, compute Sensitivity(k) and Specificity (k)
• Plot Sensitivity(k) vs 1-Specificty(k) for k = 1, 2, …
• Draw the benchmark performance line by connecting points (0,0) and (1,1)
Benchmark Line for ROC Curve
• It is a plot of
Sensitivity (true positive rate)
Vs.
1-specificity (false positive rate)
• Higher the area under the ROC curve better the classifier performance
Definition of Areas Under ROC (AUROC)
C2 C1
Propensity
• AUROC is the probability that a binary classifier gives a higher predictive score
to a random C1 case than to a random C0 case
• AUROC = P[p1 > p0]
where p1 = propensity for a random C1 case, and p2 = propensity for a random C2 case
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1
2 85.5 16.8 0.988 1
3 64.8 21.6 0.984 1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.707 1
9 69.0 20.0 0.681 1
10 93.0 20.8 0.656 1
11 51.0 22.0 0.520 1
12 81.0 20.0 0.471 1
• Perfect Classifier 13 75.0 19.6 0.442 0
14 52.8 20.8 0.410 0
15 64.8 17.2 0.340 0
16 43.2 20.4 0.337 0
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1
2 85.5 16.8 0.988 1
3 64.8 21.6 0.984 1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 1
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
• Good Classifier 13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1
2 85.5 16.8 0.988 1
3 64.8 21.6 0.984 0
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 0
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 0
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 0
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
• Poor Classifier 13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 1
20 66.0 18.4 0.038 1
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 1
24 63.0 14.8 0.004 1
Asymmetric Costs
Misclassification Costs May Differ
Predicted C1 Predicted C2
Actual C1 8 2
Actual C2 20 970
𝑞1 𝑛12 + 𝑞2 𝑛21
Average misclassifcation cost =
𝑛
Average Misclassification Cost
𝑛12 𝑛11 + 𝑛12 𝑛21 𝑛21 + 𝑛22
Average misclassifcation cost = 𝑞1 + 𝑞2
𝑛11 + 𝑛12 𝑛 𝑛21 + 𝑛22 𝑛
𝑛12 𝑛21
Average misclassifcation cost = 𝑝1 𝑞1 + 𝑝2 𝑞2
𝑛11 + 𝑛12 𝑛21 + 𝑛22
𝑝2 𝑞2
Average misclassifcation cost = 1 − Sensitivity + 1 − Specificity 𝑝1 𝑞1
𝑝1 𝑞1
Profit Matrix
Predicted C1 Predicted C2
Actual C1 $80 $0
Actual C2 ($20) $0
• Note that adding costs does not improve the actual classifications
themselves
• Use the lift curve and change the cutoff value for C1 to maximize profit
Opportunity Cost
Predicted C1 Predicted C2
Actual C1 $8 $20
Actual C2 $20 $0
Lift and Decile-wise Lift Charts for
Classification
• Plot C(k) vs k to get gain or lift chart for the classification model
• Plot A(k) vs k on the same graph to draw the baseline for the benchmark classifier
Lift Chart for Classification Performance
8 Ownership when
sorted using
6
predicted values • Example: random pick 10 cases:
4
Cumulative Expecting C1 predictions=10*(12/24)=5
2
Ownership using
0 average • Using model, predicted number
0 10 20 30 of C1 =9
# cases
• Lift = 9/5=1.8
Deciles
• Both lift chart and decile-wise lift chart embody the concept of
“moving down” through the records, starting with the most
probable
• Lift chart shows continuous cumulative results
y-axis shows number of important class records identified
• Decile chart does this in decile chunks of data
y-axis shows ratio of decile mean to overall mean
Adding Costs/Benefits to Lift Curve
• If total net benefit from all cases is negative, reference line will
have negative slope
• Nonetheless, goal is still to use cutoff to select the point where net
benefit is at a maximum
Lift Curve May Go Negative
Oversampling and Asymmetric Costs
Rare Cases
Assuming that misclassifying “o” is five times the cost of misclassifying “x”
Validate the models with an unweighted (simple random) sample from the
original data
Oversampling the Training Set
Training Dataset:
C1 pool (n1/2)
Half C1 pool (n1/2)
+
C2 pool (n1/2)
Validation Dataset:
C1 pool (n1/2)
Half C1 pool (n1/2)
+
C2 pool (n2/2)
Evaluating Model Performance Using a Non-
oversampled Validation Set
• In some cases very low response rates may make it more practical to use
oversampled data for both the training data and the validation data
• This might happen, for example, if an analyst is given a dataset for exploration
and prototyping that is already oversampled to boost the proportion
• This requires the oversampled validation set to be reweighted, in order to restore
the class of observations that were underrepresented in the sampling process
• This adjustment should be made to the classification matrix and to the lift chart
in order to derive good accuracy measures
Adjusting the Confusion Matrix for Oversampling
• Suppose that the C1 rate in the data as a whole is 2%, and that the data were
oversampled, yielding a sample in which the C1 rate is 25 times higher (50% C1)
C1 : 2% of the whole data; 50% of the sampled data
• Each C1 member in the whole data is worth 25 C1 members in the sample data
(50/2)
• Each nonresponder in the whole data is worth 0.5102 nonresponders in the sample
(50/98)
• We call these values oversampling weights. Assume that the validation
classification matrix looks like this:
Adjusting the Confusion Matrix for
Oversampling
• Correction factor for C1 Class = f1 = (Proportion of C1 class in Oversample
Data)/Proportion of C1 class in Original Data)
• Correction factor for C2 Class = f2 = (Proportion of C2 class in Oversample
Data)/Proportion of C2 class in Original Data)
• Correction for Classification Matrix:
Predicted Class C1 Predicted Class C2
• In some applications the goal is to search, among a set of new records, for a
subset of records that gives the highest cumulative predicted values
• In such cases a graphical way to assess predictive performance is through a
lift chart
• Y axis is cumulative value of numeric target variable (e.g., revenue), and X
axis is the index of each case in sequence
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1
Class Example 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 1
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Lift Chart for Predictive Performance