[go: up one dir, main page]

0% found this document useful (0 votes)
19 views116 pages

Chapt-5 Performance Evaluation

performanc valuation

Uploaded by

yeswanth.gujjula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views116 pages

Chapt-5 Performance Evaluation

performanc valuation

Uploaded by

yeswanth.gujjula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

Chapter 5

Evaluating Classification
&Predictive Performance

Sagar Kamarthi
Topics

• Evaluating predictive performance


• Performance metrics for classification
• Asymmetric costs
• Oversampling
Prediction Outcomes
• In supervised learning we are interested in predicting the three main
types of outcome variable for new records.

 Predicted numerical value—when the outcome variable is numerical (e.g., house


price)

 Predicted class membership—when the outcome variable is categorical (e.g.,


buyer/nonbuyer)

 Propensity—the probability of class membership, when the outcome variable is


categorical (e.g., the propensity to default)
Why Evaluate?

• Multiple methods are available to classify or predict

• For each method, multiple choices are available for settings

• To choose the best model, need to assess each model’s performance


Evaluating Predictive Performance
Measuring Predictive Error
• Predictive accuracy is not the same as “goodness-of-fit”
• Classical statistical measures of performance are aimed at finding a model
that fits well to the data on which the model was trained
• In data mining we are interested in models that have high predictive
accuracy when applied to new records
• Measures such as R2 and standard error of estimate are common metrics
in classical regression modeling, and residual analysis is used to gauge
goodness-of-fit in that situation
• R2 and standard error of estimate do not tell much about the ability of the
model to predict new cases
Measuring Predictive Error

• Key component of most measures is difference between actual y and


predicted y (“error”)
• The prediction error for record i is defined as the difference between its
actual y value and its predicted 𝑦 value:

𝑒(𝑖) = 𝑦(𝑖) − 𝑦(𝑖)


Prediction Accuracy Measures
• MAE or MAD (mean absolute error/deviation) gives the magnitude of
the average absolute error 𝑛
1
𝑀𝐴𝐸 = 𝑒(𝑖)
𝑛
𝑖=1
• Average error (AE) measure gives an indication of whether the predictions
are on average overrpredicting or underpredicting the response.
𝑛
1
𝐴𝐸 = 𝑒(𝑖)
𝑛
𝑖=1
• MAPE (mean absolute percentage error) measure gives a percentage score
of how predictions deviate (on average) from the actual values
𝑛
1 𝑒(𝑖)
𝑀𝐴𝑃𝐸 = 100
𝑛 𝑦(𝑖)
𝑖=1
Prediction Accuracy Measures

• RMSE (root-mean-squared error) is similar to the standard error of estimate


in linear regression, except that it is computed on the validation data rather
than on the training data.

𝑛
1 2
𝑅𝑀𝑆 = 𝑒(𝑖)
𝑛
𝑖=1

• Total SSE (total sum of squared errors)


𝑛

𝑆𝑆𝐸 = 𝑒(𝑖) 2

𝑖=1
Comparing
Training and Validation Performance

• Residuals that are based on the training set tell about model fit or
“goodness of fit”
• Residuals that are based on the validation set measure the model's
ability to predict new data or “prediction performance”
Comparing
Training and Validation Performance

• Training errors are expected to be smaller


than the validation errors
• This is because the model was fitted using
the training set
Comparing
Training and Validation Performance

• Figure shows histograms and


boxplots of Toyota price
prediction errors for training
and validation sets
• The discrepancies between
training and validation
performance are due to some
outliers, and especially the
large negative training error
Performance Metrics (Classification)
Performance Evaluation Methods for
Classification
• Lift charts
• Error rate/Accuracy
• Sensitivity
• Selectivity
• Positive predictive value
• Negative predictive value
• False positive rate
• False negative rate
• Receiver operating characteristic (ROC) curve
• Area under curve (AUROC)
Benchmark: Naïve Rule

Naïve Rule: classify all records as belonging to the most prevalent class
• Naïve rule ignores all predictor information
• Often used as benchmark classifier: Any classifier using predictor values
should better than Naïve Rule based classifier
• Exception: when goal is to identify high-value but rare outcomes, we
may do well by doing worse than the naïve rule
Class Separation

• “High separation of
records” means that using
predictor variables attains
low error
Class Separation

• “Low separation of records”


means that using predictor
variables does not improve
much on naïve rule. Even a
very large dataset will not help
Meaning of Each Cell in Classification Matrix
Predicted Class C1 Predicted Class C2

n12 = number of C1 cases


n11 = number of C1 cases
Actual Class C1 classified incorrectly as
classified correctly
C2

n21 = number of C2 cases n22 = number of C2 cases


Actual Class C2
classified incorrectly as C1 classified correctly

• Total number of class, n = n11 + n12 + n21 + n22


• Total number of C1 cases = n11 + n12
• Total number of C2 cases = n21 + n22
Meaning of Each Cell in Classification Matrix
Predicted Class C1 Predicted Class C2

Actual Class C1 n11 = True Positive (TP) n12 = False Negative (FN)

Actual Class C2 n21 = False Positive (FP) n22 = True Negative (TN)

• Total number of class, n = n11 + n12 + n21 + n22


• Total number of C1 cases = n11 + n12
• Total number of C2 cases = n21 + n22
Generalize of Classification Matrix to
Multiple Classes

• For m classes, classification matrix has m rows and m columns


• Theoretically, there are m(m-1) misclassification costs, since any
case could be misclassified in m-1 ways
• Practically too many to work with
• In decision-making context, though, such complexity rarely arises.
Usually one class is usually of primary interest
Classification Matrix (Confusion matrix)
• Summarizes the correct and incorrect classifications that a
classifier produced for a certain dataset.

Classification Confusion Matrix 201 1’s correctly classified as “1”


Predicted Class
85 1’s incorrectly classified as “0”
Actual Class 1 0
1 201 85 25 0’s incorrectly classified as “1”
0 25 2689 2689 0’s correctly classified as “0”

• Gives estimates of the true classification and misclassification


rates.
Using the Validation Data

• To obtain an honest estimate of classification error, use the classification


matrix that is computed from the validation data.
• Partition the data into training and validation sets by random selection
of cases.
• Construct a classifier using the training data and apply it to the
validation data.
• Summarize these classifications in a classification matrix.
Error and Accuracy Measures
Predicted Class C1 Predicted Class C2

n11 n12
Actual Class C1 number of C1 cases classified number of C1 cases classified incorrectly
correctly as C2
n21 n22
Actual Class C2 number of C2 cases classified
number of C2 cases classified correctly
incorrectly as C1

• Estimated misclassification 𝑛12 + 𝑛21 𝑛12 + 𝑛21


Error = =
rate or overall error rate: 𝑛 𝑛11 + 𝑛12 + 𝑛21 + 𝑛22

𝑛11 + 𝑛22 𝑛11 + 𝑛22


• Estimated overall accuracy: Accuracy = 1 − Error =
𝑛
=
𝑛11 + 𝑛12 + 𝑛21 + 𝑛22
Error and Accuracy Measures
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 201 85
0 25 2689

Overall error rate = (25+85)/3000 = 3.67%


Accuracy = 1 – Error = (201+2689) = 96.33%
• If multiple classes, error rate is equal to sum of misclassified
records divided by total number of records
Cutoff for Classification
Most DM algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly

• Default cutoff value is 0.50


 If >= 0.50, classify as “1”
 If < 0.50, classify as “0”
• Can use different cutoff values
• Typically, error rate is lowest for cutoff = 0.50
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1

Class Example 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 1
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity Prediction
Ownership
1
1 60.0 18.4 0.996 1

Cutoff Table 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
1
1
4 61.5 20.8 0.980 1 1
• If cutoff is 0.25 5 87.0 23.6 0.948 1 1

 15 records are classified as “1” 6 110.1 19.2 0.889 1 1


7 108.0 17.6 0.848 1 1
 5 records miss classified 8 82.8 22.4 0.762 0 0

 Error = 5/24 = 0.2083 9 69.0 20.0 0.707 0 1


10 93.0 20.8 0.681 0 1
• If cutoff is 0.50 11 51.0 22.0 0.656 0 1
12 81.0 20.0 0.622 0 0
 13 records are classified as “1” 13 75.0 19.6 0.506 0 1

 3 records miss classified 14 52.8 20.8 0.471 0 0


15 64.8 17.2 0.337 0 0
 Error = 3/24 = 0.1250 16 43.2 20.4 0.218 0 1
17 84.0 17.6 0.199 0 0
• If cutoff is 0.75 18 49.2 17.6 0.149 0 0
 8 records are classified as “1” 19 59.4 16.0 0.048 0 0
20 66.0 18.4 0.038 0 0
 6 record miss classified 21 47.4 16.4 0.025 0 0
 Error = 6/24 = 0.2500 22 33.0 18.8 0.022 0 0
23 51.0 14.0 0.016 0 0
Classification Matrix for Different Cutoffs

• If cutoff is 0.25
 Error = 5/24 = 0.2083
• If cutoff is 0.50
 Error = 3/24 = 0.1250
• If cutoff is 0.75
 Error = 6/24 = 0.2500
Accuracy and Overall Error as a Function of the
Cutoff Value
Two Cutoff Option

• In some cases it is useful to have two cutoffs to allow a “cannot


say” option for the classifier.
• In a two-class situation, this means that for a case, we can make
one of three predictions: The case belongs to C1, or the case belongs
to C2, or we cannot make a prediction because there is not enough
information to pick C1 or C2 confidently.
• Cases that the classifier cannot classify are subjected to closer
scrutiny either by using expert judgment or by enriching the set of
predictor variables by gathering additional information that is
perhaps more difficult or expensive to obtain.
Unequal Importance of Classes

• In many cases it is more important to identify members of one


class:
 Tax fraud
 Credit default
 Response to promotional offer
 Detecting electronic network intrusion
 Predicting delayed flights
• In such cases, we are willing to tolerate greater overall error, in
return for better identifying the important class for further
attention
Bias-variance tradeoff

• Generalization error of the model can be expressed in three error terms:


• Total Error = Bias2 + Variance + Irreducible Error
Bias

• Bias is the difference between average of parameters from different


samples and the actual value
• Bias may happen due to the wrong assumptions about data and/or due
to fewer variations of the models
• Bias tells us how far our parameters (function) from actual parameters
• High bias models likely to underfit the data as they miss to capture the
correct underlying relationship
Variance

• Variance is a measure of how much the learning model will shift around
its mean
• Variance quantifies the amount by which the estimate of the parameters
will change if training data was varied by a small margin or different
training data was used
• Variance tells us how scattered are the model estimates across samples
from their actual values
• High variance models likely to overfit the data as they model random
noise
Irreducible error

• Irreducible error may occur due to the inherent noise present in the data
• Irreducible error cannot be reduced irrespective of the learning algorithm
we uses
Bias-variance tradeoff

• Increasing complexity of the model would increase its variance but reduce
bias
• Reducing a model's complexity increases its bias but reduces variance
• This is called bias-variance trade-off
Bias-variance tradeoff

• Variation of bias and variance with model complexity


overfitting zone

optimum model complexity


underfitting zone
Error

irreducible error

Model Complexity
Sensitivity and Secificity

Predicted Class C1 Predicted Class C2

Actual Class C1 n11 n12

Actual Class C2 n21 n22

𝑛11
Sensitivity =
𝑛11 + 𝑛12

𝑛22
Specificity =
𝑛21 + 𝑛22
Positive Predictive Value and
Negative Predictive Value
Predicted Class C1 Predicted Class C2

Actual Class C1 n11 n12

Actual Class C2 n21 n22

𝑛11
Positive Predictive Value =
𝑛11 + 𝑛21

𝑛22
Negative Predictive Value =
𝑛12 + 𝑛22
False Positive Rate and
False Negative Rate
Predicted Class C1 Predicted Class C2

Actual Class C1 n11 n12

Actual Class C2 n21 n22

𝑛21
False postive rate =
𝑛11 + 𝑛21

𝑛12
False negative rate =
𝑛12 + 𝑛22
Predicted Class C1 Predicted Class C2
Actual
n11 n12
Accuracy Measures Class C1
Actual
n21 n22
Class C2

𝑛11 + 𝑛22 𝑛12 + 𝑛21


Accuracy = Error =
𝑛 𝑛
𝑛11 𝑛22
Sensitivity = Specificity =
𝑛11 + 𝑛12 𝑛21 + 𝑛22

𝑛21 𝑛12
False postive rate = False negative rate =
𝑛21 + 𝑛21 𝑛12 + 𝑛11

𝑛11 𝑛22
Positive Predictive Value = Negative Predictive Value =
𝑛11 + 𝑛21 𝑛12 + 𝑛22
Predicted Class C1 Predicted Class C2
Actual
n11 n12
Performance Measures Class C1
Actual
Class C2
n21 n22
If “C1” is the important class (success class):
• Error = % of “true C1 and C2 cases” incorrectly classified as “C2 and C1 respectively”
 Inability to classify observations correctly into their respective classes.
• Accuracy = % of “true C1 and C2 cases” correctly classified as “C1 and C2 respectively”
 Ability to classify observations correctly into their respective classes.
• Sensitivity = % of “true C1 cases” correctly classified as “C1 cases”
 Probability that a model classify a case as C1 when the case is truly C1
• Specificity = % of “true C2 cases” correctly classified as “C2 cases”
 Probability that a model classify a case as C2 when the case is truly C2
• Positive predictive value = % of “predicted C1 cases” correctly classified “C1 cases”
 Probability that the case is truly C1 when the model classifies the case as C1
• Negative predictive value = % of “predicted C2 cases” correctly classified “C2 cases”
 Probability that the case is truly C2 when the model classifies the case as C2
• False positive rate = % of “predicted C1 cases” that were misclassified as “C2 cases”
 Probability that the case is truly C2 when the model classifies the case as C1
• False negative rate = % of “predicted C2 cases” that were misclassified as “C1 cases”
 Probability that the case is truly C1 when the model classifies the case as C2
Predicted Class C1 Predicted Class C2
Actual
n11= TP n1=FN
Performance Measures Class C1
Actual
Class C2
n21=FP n22=TN

Definitions in medical domain:


• Error = Overall probability that a patient will not be correctly classified.
• Accuracy = Overall probability that a patient will be correctly classified.
• Sensitivity = Probability that a test result will be positive when the disease is present (true positive rate)
• Specificity = Probability that a test result will be negative when the disease is not present (true negative rate)
• Positive predictive value = Probability that the disease is present when the test is positive
• Negative predictive value = Probability that the disease is not present when the test is negative
• False positive rate = Probability that the disease is not present when the test is positive
• False negative rate = Probability that the disease is present when the test is negative
• Positive likelihood ratio = Ratio between the probability of a positive test result given the presence of the
disease and the probability of a positive test result given the absence of the disease, i.e. True positive rate / False
positive rate = Sensitivity / (1-Specificity)
• Negative likelihood ratio = Ratio between the probability of a negative test result given the presence of the
disease and the probability of a negative test result given the absence of the disease, i.e. False negative rate /
True negative rate = (1-Sensitivity) / Specificity
Predicted Class C1 Predicted Class C2
Actual
n11= TP n1=FN
Performance Measures Class C1
Actual
Class C2
n21=FP n22=TN

• If the proportion of positive (Disease present) and the negative (Disease absent) groups in the sample data do
not reflect the real prevalence of the disease, then the Positive and Negative predictive values, and Accuracy,
cannot be estimated.
• When the disease prevalence is known then the positive and negative predictive values can be calculated
using the following formulas based on Bayes' theorem:

• PPV = [Sensitivity x Prevalence] / [Sensitivity x Prevalence + (1 - Specifificity) x (1 - Prevalence)]


• NPV = [Specificity x (1 - Prevalence] / [(1 – Sensitivity) x Prevalence + Specifificity x (1 - Prevalence)]

https://www.medcalc.org/calc/diagnostic_test.php
Predicted Class C1 Predicted Class C2
Actual
n11 n12
F1 Score or F Score Class C1
Actual
Class C2
n21 n22

If “C1” is the important class (success class):


• Recall = Sensitivity = % of “true C1 cases” correctly classified as “C1 cases”
 Ability to detect the important class members correctly.
• Precision = Positive predictive value = % of “predicted C1 cases” are truly “C1 cases”

2 Recall Precision
𝐹1 =
Recall + Precision
F1 Score or F Score Actual
Predicted Class C1

TP = n11
Predicted Class C2

FN = n12
Class C1
Actual
Class C2
FP = n21 TN = n22

𝑇𝑃 𝑇𝑃
2
𝐹1 = 𝑇𝑃 + 𝐹𝑁 𝑇𝑃 + 𝐹𝑃
𝑇𝑃 𝑇𝑃
+
𝑇𝑃 + 𝐹𝑁 𝑇𝑃 + 𝐹𝑁
Matthews correlation coefficient (MCC)

Predicted Class C1 Predicted Class C2

Actual
Class C1
TP = n11 FN = n12

Actual
Class C2
FP = n21 TN = n22

(𝑇𝑃)(𝑇𝑁) − (𝐹𝑃)(𝐹𝑁)
𝑀𝐶𝐶 =
TP+FP TP+FN 𝑇𝑁 + 𝐹𝑃 𝑇𝑁 + 𝐹𝑁
Matthews correlation coefficient (MCC)
Predicted Class C1 Predicted Class C2

Actual
Class C1
TP = n11 FN = n12
𝑛 = 𝑇𝑁 + 𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃
Actual
Class C2
FP = n21 TN = n22
𝑇𝑃 + 𝐹𝑁
𝑆=
𝑛

𝑇𝑃 + 𝐹𝑃
𝑃=
𝑛

(𝑇𝑃)/(𝑛) − (𝑆)(𝑃)
𝑀𝐶𝐶 =
S P 1−𝑆 1−𝑃
Matthews correlation coefficient (MCC)

MCC takes into account true and false


positives and negatives to compute a
balanced measure which can be used
even if the classes sizes are imbalanced
MCC takes a value between −1 and +1
 +1 represents a perfect prediction
 0 indicates predictions no better than
random prediction
 −1 indicates total disagreement between
prediction and observation
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1

Perfect Classifier 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.707 1
9 69.0 20.0 0.681 1
10 93.0 20.8 0.656 1
11 51.0 22.0 0.520 1
12 81.0 20.0 0.471 1
13 75.0 19.6 0.442 0
14 52.8 20.8 0.410 0
15 64.8 17.2 0.340 0
16 43.2 20.4 0.337 0
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1

Good Classifier 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 1
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1

Poor Classifier 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
0
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 0
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 0
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 0
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 1
20 66.0 18.4 0.038 1
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 1
24 63.0 14.8 0.004 1
ROC (Receiver Operating Characteristic) Curve

• The ROC curve plots the pairs {sensitivity, 1-specificity} as the cutoff value
increases from 0 and 1
• A classifiers performance is better if the ROC curve is closer to the top left
corner
• The comparison curve is the diagonal, which reflects the performance of the
Naive Rule, using varying the level of majority f used by the majority rule)
• Area Under the Curve (AUC) is another metric of a classifier
• AUC ranges from 1 (perfect discrimination between classes) to 0.5 (no better
than the naive rule)
Steps to Constructing ROC Curve

• Sort the records in the descending order of their estimated propensity, 𝑦(𝑖)
• Vary propensity cut off  from 0 to 1 (example: 1=0.0, 2=0.2, …, 6=1.0)
• For each cut off k, do the following:
 Classify observations as positive, if their propensity is greater than or equal to k,
otherwise as negative.
 Construct the classification matrix using above classification rule
 From the classification matrix, compute Sensitivity(k) and Specificity (k)
• Plot Sensitivity(k) vs 1-Specificty(k) for k = 1, 2, …
• Draw the benchmark performance line by connecting points (0,0) and (1,1)
Benchmark Line for ROC Curve

• Let n = number of observations


• Let p = proportion of observations belonging to C1 class in the data
• (1-p) = proportion of observations belonging to C2 class in the data
• List observations in a random order
• For each observation i, assign propensity 𝑦(𝑖) = (𝑛 − 𝑖) (𝑛 − 1) for i = 1, 2, …, n
• Note that observation are already sorted in the descending order
• Let f = cut off on a scale of 1 to 0:
 Classify observations as positive, if their propensity is greater than or equal to f, otherwise
as negative
 (1-f) n observations are predicted as belonging to C1 class
 fn observations are predicted as belonging to C2 class
Sensitivity and Specificity for Benchmark Rule

n11  n12  np n21  n22  n(1  p )


n11  n(1  f ) p n22  nf (1  p )
n11 n(1  f ) p
Sensitivity    1 f
n11  n12 np
n22 nf (1  p )
Specificity    f
n21  n22 n(1  p )
1-Specificity  1  f
• To generate baseline curve, f is varied from 1 to 0
• As f is varied from 1 to 0, both Sensitivity and 1-Specificity
synchronously vary from 0 to 1
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1

ROC for Perfect Classifier 2


3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.707 1
9 69.0 20.0 0.681 1
10 93.0 20.8 0.656 1
11 51.0 22.0 0.520 1
12 81.0 20.0 0.471 1
13 75.0 19.6 0.442 0
14 52.8 20.8 0.410 0
15 64.8 17.2 0.340 0
16 43.2 20.4 0.337 0
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
• Plot the pairs {sensitivity, 1-specificity} versus 19 59.4 16.0 0.048 0

the cutoff value 20 66.0 18.4 0.038 0


21 47.4 16.4 0.025 0

• Better performance is reflected by curves that 22 33.0 18.8 0.022 0


23 51.0 14.0 0.016 0
are closer to the top left corner 24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1

ROC for Good Classifier 2


3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 1
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
• Plot the pairs {sensitivity, 1-specificity} versus 19 59.4 16.0 0.048 0

the cutoff value 20 66.0 18.4 0.038 0


21 47.4 16.4 0.025 0

• Better performance is reflected by curves that 22


23
33.0
51.0
18.8
14.0
0.022
0.016
0
0
are closer to the top left corner 24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1

ROC for Poor Classifier 2


3
85.5
64.8
16.8
21.6
0.988
0.984
1
0
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 0
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 0
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 0
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
• Plot the pairs {sensitivity, 1-specificity} versus 19 59.4 16.0 0.048 1

the cutoff value 20 66.0 18.4 0.038 1


21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
• Better performance is reflected by curves that 23 51.0 14.0 0.016 1
are closer to the top left corner 24 63.0 14.8 0.004 1
Receiver Operating Characteristic (ROC) Curve

• It is a plot of
Sensitivity (true positive rate)
Vs.
1-specificity (false positive rate)

• Area under ROC is the measure of the


quality of the classifier
Receiver Operating Characteristics (ROC) Curve

• Higher the area under the ROC curve better the classifier performance
Definition of Areas Under ROC (AUROC)

C2 C1

Propensity

• AUROC is the probability that a binary classifier gives a higher predictive score
to a random C1 case than to a random C0 case
• AUROC = P[p1 > p0]
 where p1 = propensity for a random C1 case, and p2 = propensity for a random C2 case
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1
2 85.5 16.8 0.988 1
3 64.8 21.6 0.984 1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.707 1
9 69.0 20.0 0.681 1
10 93.0 20.8 0.656 1
11 51.0 22.0 0.520 1
12 81.0 20.0 0.471 1
• Perfect Classifier 13 75.0 19.6 0.442 0
14 52.8 20.8 0.410 0
15 64.8 17.2 0.340 0
16 43.2 20.4 0.337 0
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1
2 85.5 16.8 0.988 1
3 64.8 21.6 0.984 1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 1
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
• Good Classifier 13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1
2 85.5 16.8 0.988 1
3 64.8 21.6 0.984 0
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 0
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 0
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 0
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
• Poor Classifier 13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 1
20 66.0 18.4 0.038 1
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 1
24 63.0 14.8 0.004 1
Asymmetric Costs
Misclassification Costs May Differ

• The cost of making a misclassification error may be higher for one


class than the other(s)
• Alternatively, the benefit of making a correct classification may be
higher for one class than the other(s)
• In such a scenario, using accuracy or misclassification rate as a
criterion can be misleading
Example: Response to Promotional Offer

• Suppose we send an offer to 1000 people, with 1% average response rate


(C1 = response, C2 = nonresponse)
• Naïve rule: (classify everyone as C2) has error rate of 1% (seems good)
• Using DM we can correctly classify eight C1 cases as C1 cases
 Example: It comes at the cost of misclassifying twenty C2 cases as C1 cases and
two C1 cases as C2 cases
Classification Matrix of a DM Classifier

Predicted C1 Predicted C2
Actual C1 8 2
Actual C2 20 970

• Error rate = (2+20)/1000 = 2.2% (higher than naïve rate)


Introducing Costs and Benefits
Suppose:
• Profit from a C1 is $10
• Cost of sending offer is $1
Then:
• Under naïve rule, all are classified as C2, so no offers are sent: no cost, no
profit
• Under a DM Classifier predictions, 28 offers are sent.
 8 respond with profit of $10 each (total profit $80) minus $8 cost for sending offer
 20 fail to respond, cost $1 each (total cost $20)
 972 receive nothing (no cost, no profit)
• Net profit = $52 (= 80 – 8 – 20)
Average Misclassification Cost
q1 = cost of misclassifying an actual C1,
q2 = cost of misclassifying an actual C2
p1 = proportion of C1 cases in the data set
p2 = proportion of C2 cases in the data set

Goal: find a classifier that minimizes average misclassification cost

𝑞1 𝑛12 + 𝑞2 𝑛21
Average misclassifcation cost =
𝑛
Average Misclassification Cost
𝑛12 𝑛11 + 𝑛12 𝑛21 𝑛21 + 𝑛22
Average misclassifcation cost = 𝑞1 + 𝑞2
𝑛11 + 𝑛12 𝑛 𝑛21 + 𝑛22 𝑛

𝑛12 𝑛21
Average misclassifcation cost = 𝑝1 𝑞1 + 𝑝2 𝑞2
𝑛11 + 𝑛12 𝑛21 + 𝑛22

Average misclassifcation cost = 1 − Sensitivity 𝑝1 𝑞1 + 1 − Specificity 𝑝2 𝑞2


Minimizing Cost Ratio
• Sometimes actual costs and benefits are hard to estimate:
 Need to express everything in terms of costs (i.e., cost of misclassification
per record)

 Goal is to minimize the average cost per record

• A good practical substitute for individual costs is the ratio of


misclassification costs (e,g,, “misclassifying fraudulent firms is 5
times worse than misclassifying solvent firms”)
• It can be shown that optimizing this quantity depends on the costs
only through their ratio (q2/q1) and on the prior probabilities only
through their ratio [p2/p1]
Average Misclassification Cost

𝑝2 𝑞2
Average misclassifcation cost = 1 − Sensitivity + 1 − Specificity 𝑝1 𝑞1
𝑝1 𝑞1
Profit Matrix

Predicted C1 Predicted C2
Actual C1 $80 $0
Actual C2 ($20) $0

• Note that adding costs does not improve the actual classifications
themselves
• Use the lift curve and change the cutoff value for C1 to maximize profit
Opportunity Cost

• As we see, best to convert everything to costs, as opposed to a mix


of costs and benefits
• E.g., instead of “benefit from sale” refer to “opportunity cost of lost
sale”
• Leads to same decisions, but referring only to costs allows greater
applicability
Cost Matrix Including Opportunity Costs
• Recall original confusion matrix (profit from a C1 = $10, cost of
sending offer = $1):
Predicted C1 Predicted C2
Actual C1 8 2
Actual C2 20 970

• Cost matrix including opportunity costs is follows

Predicted C1 Predicted C2
Actual C1 $8 $20
Actual C2 $20 $0
Lift and Decile-wise Lift Charts for
Classification

• Useful for assessing performance in terms of identifying the most


important but rare class
• Helps evaluate, e.g.:
 How many tax records to examine
 How many loans to grant
 How many customers to mail offer to
Lift and Decile-wise Lift Charts

• Compare performance of DM model to “No model”


• Measures ability of DM model to identify the important class,
relative to its average prevalence
• Charts give explicit assessment of results over a large number of
cutoffs
Lift and Decile-wise Lift Charts: How to Use

• Compare lift to “No model” baseline:

 In lift chart: compare step function to straight line

 In decile chart: compare to ratio of 1


Lift Charts: How to Compute

1. Using the model’s classification probabilities (or


propensities), sort records from most likely to least likely
members of the important class
2. Compute lift: Accumulate the correctly classified C1 records
(Y axis) and compare to number of total records (X axis)
3. A lift chart for classification is built by ordering the set of
records of interest (typically, validation data) by their
propensity value, from high to low. Then, plot their
cumulative count of C1 cases as a function of the number of
records (the x-axis value)
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1
2 85.5 16.8 0.988 1
3 64.8 21.6 0.984 1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 1
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
• Good Classifier 13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Steps in Constructing Gains or Lift Chart

• Sort observations in the descending order of their predicted response


 𝑦(𝑖) = estimated response for observation i for prediction problems
 𝑝(𝑖) = propensity for observation i for the classification problems
𝑘
• For each observation k, compute the actual cumulative response, 𝐶(𝑘) = 𝑖=1 𝑦(𝑖)
 y(i) = actual response for observation i for both prediction and classification problems

• For each observation k, compute the expected cumulative response, 𝐴(𝑘) = 𝑘 𝑦,


where 𝑦 = 𝑛𝑘=1 𝑦(𝑘) 𝑛 is the expected response per observation

• Plot C(k) vs k to get gain or lift chart for the classification model
• Plot A(k) vs k on the same graph to draw the baseline for the benchmark classifier
Lift Chart for Classification Performance

• Step 1: Ordering the set of records of interest (typically, validation data) by


their propensity value 𝑝(i) in descending order
• Step 2: Compute cumulative count of C1 cases up to record i from top to
bottom, S(i) = 𝑖𝑘=1 𝑦(𝑘) where i = 1,2, …, n, and y(k) is 1 if k belong to C1,
otherwise 0
• Step 3: Compute expected number of C1 cases in i randomly selected records,
R(i) = ip1 where i = 1,2, …, n and p1 is proportion of C1 cases in the dataset
• Step 4: Plot S(i) versus i as well as R(i) versus i on the same graph
Lift Chart for Classification
• Reference line: for any given the
Lift chart (training dataset) number of cases (x-axis value), it
14 represents the expected number of
12 of C1 predictions if we simply
10 Cumulative
select cases at random
Cumulative

8 Ownership when
sorted using
6
predicted values • Example: random pick 10 cases:
4
Cumulative Expecting C1 predictions=10*(12/24)=5
2
Ownership using
0 average • Using model, predicted number
0 10 20 30 of C1 =9
# cases
• Lift = 9/5=1.8

After examining (e.g.,) 10 cases (x-axis), 9 owners (y-axis) have


been correctly identified
Steps in Constructing Decile Chart
• Sort the observations in the descending order of their predicted response
 𝑦(𝑖) = estimated response for observation i for prediction problems
 𝑝(𝑖) = propensity for observation i for the classification problems
• Group the observations into 1st, 2nd, …, 10th decile, such that the 1st decile has
larger responses, while the 10th decile has smaller responses
• Compute the number of observations in each decile d, 𝑛𝑑 = 𝑛/10, for d=1, 2, …, 10
• Compute expected total response of each decile d, 𝐸𝑑 = 𝑛𝑑𝑦, where 𝑦= 𝑛𝑘=1 𝑦(𝑘) 𝑛
 y(k) = actual response of the kth observation for both prediction and classification problems
𝑛𝑑
• Compute actual total response for each decile d, 𝐷𝑑 = 𝑘=1 𝑦𝑠(𝑘) , where yd(k) is the
response of the kth observation in the dth decile
• Compute for each decile, lift ratio Ld = Dd/Ed
• Plot a bar graph of lift ratios Ld, where d = 1, 2, …, 10
Decile-wise Lift Chart
Decile-wise lift chart (training dataset)
2.5
Decile mean / Global mean

2 • The bars show that factor by


which a DM classifier
1.5
outperforms a random
1 assignment of C1s, taking
0.5 one decile at a time
0
1 2 3 4 5 6 7 8 9 10

Deciles

In “most probable” (top) decile, model is twice as likely to identify


the important class (compared to avg. prevalence)
Lift Charts vs Decile-wise Lift Charts

• Both lift chart and decile-wise lift chart embody the concept of
“moving down” through the records, starting with the most
probable
• Lift chart shows continuous cumulative results
 y-axis shows number of important class records identified
• Decile chart does this in decile chunks of data
 y-axis shows ratio of decile mean to overall mean
Adding Costs/Benefits to Lift Curve

1. Sort records in descending probability of success


2. For each case, record cost/benefit of actual outcome
3. Also record cumulative cost/benefit
4. Plot all records
 x-axis is index number (1 for 1st case, n for nth case)

 y-axis is cumulative cost/benefit

 Reference line from origin to yn ( yn = total net benefit)


Lift Curve May Go Negative

• If total net benefit from all cases is negative, reference line will
have negative slope
• Nonetheless, goal is still to use cutoff to select the point where net
benefit is at a maximum
Lift Curve May Go Negative
Oversampling and Asymmetric Costs
Rare Cases

• Asymmetric costs/benefits typically go hand in hand with


presence of rare but important class
 Responder to mailing
 Someone who commits fraud
 Debt defaulter
• Often, we oversample rare cases to give model more information
to work with
• Typically use 50% C1 and 50% C2 for training
Example

Following graphs show optimal classification under three scenarios:


 Assuming equal costs of misclassification

 Assuming that misclassifying “o” is five times the cost of misclassifying “x”

 Oversampling scheme allowing DM methods to incorporate asymmetric


costs
Classification: Equal Miss-Classification Costs

Error: 1/8 = 12.5%


Classification: Unequal Miss-Classification Costs

Error: 2/8 = 25%


Oversampling Scheme

Error: 2/16 = 12.5%

Oversample “o” to appropriately weight misclassification costs


Classification When Response Rate is Very Low

• When classifying data with very low response rates, practitioners


typically:
 Train models on data that are 50% responder (C1), 50% nonresponder (C2)

 Validate the models with an unweighted (simple random) sample from the
original data
Oversampling the Training Set

1. Separate the C1 (rare class) from C2


2. Randomly assign half the C1 to the training sample, plus equal
number of C2
3. Remaining C1 go to validation sample
4. Add C2 to validation data, to maintain original ratio of C1 to C2
5. Randomly take test set (if needed) from validation set
Oversampling the Training Set
Available Dataset (n1 + n2)

Training Dataset:
C1 pool (n1/2)
Half C1 pool (n1/2)
+
C2 pool (n1/2)

C1 pool (n1) C2 pool (n2)

Validation Dataset:
C1 pool (n1/2)
Half C1 pool (n1/2)
+
C2 pool (n2/2)
Evaluating Model Performance Using a Non-
oversampled Validation Set

• Although the oversampled data can be used to train models, they


are often not suitable for evaluating model performance because
the number of C1 will (of course) be exaggerated
• The most straightforward way of gaining an unbiased estimate of
model performance is to apply the model to regular data (i.e., data
not oversampled)
• In short: Train the model on oversampled data, but validate it
with regular data
Evaluating Model Performance If Only
Oversampled Validation Set Exists

• In some cases very low response rates may make it more practical to use
oversampled data for both the training data and the validation data
• This might happen, for example, if an analyst is given a dataset for exploration
and prototyping that is already oversampled to boost the proportion
• This requires the oversampled validation set to be reweighted, in order to restore
the class of observations that were underrepresented in the sampling process
• This adjustment should be made to the classification matrix and to the lift chart
in order to derive good accuracy measures
Adjusting the Confusion Matrix for Oversampling

• Suppose that the C1 rate in the data as a whole is 2%, and that the data were
oversampled, yielding a sample in which the C1 rate is 25 times higher (50% C1)
 C1 : 2% of the whole data; 50% of the sampled data

 C2 : 98% of the whole data, 50% of the sampled data

• Each C1 member in the whole data is worth 25 C1 members in the sample data
(50/2)
• Each nonresponder in the whole data is worth 0.5102 nonresponders in the sample
(50/98)
• We call these values oversampling weights. Assume that the validation
classification matrix looks like this:
Adjusting the Confusion Matrix for
Oversampling
• Correction factor for C1 Class = f1 = (Proportion of C1 class in Oversample
Data)/Proportion of C1 class in Original Data)
• Correction factor for C2 Class = f2 = (Proportion of C2 class in Oversample
Data)/Proportion of C2 class in Original Data)
• Correction for Classification Matrix:
Predicted Class C1 Predicted Class C2

• Where m11, m12, m21, and


n12= m12 / f1
Actual
Class C1
n11= m11 / f1 m22 are computed from
model performance on
Actual n21= m21 / f2
oversampled validation
Class C2
n22= m22 / f2
Adjusting the Confusion Matrix for
Oversampling
• Classification Matrix, Oversampled Validation Data
Predicted 1 Predicted 0 Total
Actual 1 m11 = 420 m12 = 80 500 • f1 = 50/2 = 25
Actual 0
m21 = 110 m22 = 390 500 • f2 = 50/98 = 0.5012
Total 530 470 1000

• Classification Matrix, Reweighted Data


Predicted 1 Predicted 0 Total
Actual 1 n11 = 420/25 = 16.8 n12 = 80/25 = 3.2 20
Actual 0 n21 = 110/0.5102 = 215.6 n22 = 390/0.5102 = 764.4 980
Total 232 768 1000
Adjusting the Confusion Matrix for
Oversampling
• At this point the misclassification rate appears to be (80 + 110)/1000 = 19%, and
the model ends up classifying 53% of the records as 1's. However, this reflects the
performance on a sample where 50% are responders.
• To estimate predictive performance when this model is used to score the original
population (with 2% responders), we need to undo the effects of the
oversampling.
• The actual number of responders must be divided by 25, and the actual number
of nonresponders divided by 0.5102.
• The adjusted misclassification rate is (3.2 + 215.6)/1000 = 21.9%. The model ends
up classifying (215.6 + 16.8)/1000 = 23.24% of the records as 1's, when we assume
2% responders.
Adjusting the Lift Curve for Oversampling
1. Sort the validation records in order of the predicted probability of success (where success =
belonging to the class of interest).
2. For each record, record the cost (benefit) associated with the actual outcome.
3. Divide that value by the oversampling rate. For example, if responders are overweighted by a
factor of 25, divide by 25.
4. For the highest probability (i.e., first) record, the value above is the y-coordinate of the first point
on the lift chart. The x-coordinate is index number 1.
5. For the next record, again calculate the adjusted value associated with the actual outcome. Add
this to the adjusted cost (benefit) for the previous record. This sum is the y-coordinate of the
second point on the lift curve. The x-coordinate is index number 2.
6. Repeat step 5 until all records have been examined. Connect all the points, and this is the lift
curve.
7. The reference line is a straight line from the origin to the point y = total net benefit and x = n (n =
number of records).
Lift Chart for Predictive Performance

• In some applications the goal is to search, among a set of new records, for a
subset of records that gives the highest cumulative predicted values
• In such cases a graphical way to assess predictive performance is through a
lift chart
• Y axis is cumulative value of numeric target variable (e.g., revenue), and X
axis is the index of each case in sequence
Actual
Case Income Lot Size Propensity
Ownership
1 60.0 18.4 0.996 1

Class Example 2
3
85.5
64.8
16.8
21.6
0.988
0.984
1
1
4 61.5 20.8 0.980 1
5 87.0 23.6 0.948 1
6 110.1 19.2 0.889 1
7 108.0 17.6 0.848 1
8 82.8 22.4 0.762 0
9 69.0 20.0 0.707 1
10 93.0 20.8 0.681 1
11 51.0 22.0 0.656 1
12 81.0 20.0 0.622 0
13 75.0 19.6 0.506 1
14 52.8 20.8 0.471 0
15 64.8 17.2 0.337 0
16 43.2 20.4 0.218 1
17 84.0 17.6 0.199 0
18 49.2 17.6 0.149 0
19 59.4 16.0 0.048 0
20 66.0 18.4 0.038 0
21 47.4 16.4 0.025 0
22 33.0 18.8 0.022 0
23 51.0 14.0 0.016 0
24 63.0 14.8 0.004 0
Lift Chart for Predictive Performance

• A lift chart is built by ordering the set of records of interest (typically,


validation data) by their predicted response value, from high to low
• Plot their cumulative actual value on the y-axis as a function of the number
of records (x-axis value)
Lift Chart for Construction Procedure

• Step 1: Ordering the set of records of interest (typically, validation data) by


their predicted value 𝑦(i) in descending order
• Step 2: Compute cumulative actual value up to each record i from top to
bottom, S(i) = 𝑖𝑘=1 𝑦(𝑘) where i = 1,2, …, n
• Step 3: Compute cumulative average actual value up to record i from top to
bottom, A(i) = i𝑦 where i = 1,2, …, n where 𝑦 is the global mean of the
actual response values of all records
• Step 4: Plot S(i) versus i as well as A(i) versus i on the same graph
Lift Chart for Predictive Performance
• This curve is compared to
assigning a naive prediction (𝑦)
to each record and accumulating
these average values, which
results in a diagonal line
• The farther away the lift curve is
from the diagonal benchmark
line, the better the model is doing
in separating records with high-
value outcomes from those with
low-value outcomes
Decile-wise Lift Chart for Predictive
Performance

• A decile lift chart is like a


lift char
• In a decile-wise lift chart
the ordered records are
grouped into 10 deciles,
and for each decile the
chart presents the ratio of
model lift to naive
benchmark lift
Decile-wise Lift Chart Construction Procedure
• Step 1: Ordering the set of records of interest (typically, validation data) by
their predicted value 𝑦(i) in descending order
• Step 2: Group ordered records into 10 deciles, ie, each dicile include 10% of
records
• Step 3: For each decile compute the mean of actual response values of records
in the decile, D = ( 𝑘 𝑦(𝑘))/nd where k is a record in the decile and nd is the
number of records in the decile
• Step 4: For each decile computer the lift as the ratio of the decile mean to the
global mean, D/𝑦, where 𝑦 is the global mean of the actual response values of
all records
• Step 5: Plot the lifts of all deciles as a bar graph
Comparison of Lift and Decile-wise Lift Chart
for Predictive Performance
Summary

• Evaluation metrics are important for comparing across DM models, for


choosing the right configuration of a specific DM model, and for
comparing to the baseline
• Major metrics: confusion matrix, error rate, predictive error
• Other metrics when
 one class is more important
 asymmetric costs
• When important class is rare, use oversampling
• In all cases, metrics computed from validation data

You might also like