Topic 7
Topic 7
Mining
Chapter 7
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction (Slide 1 of 2)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction (Slide 2 of 2)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation,
and Partitioning
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 1 of 7)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 2 of7)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 3 of 7)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 4 of 7)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 5 of 7)
k-Fold Cross-Validation
• k-Fold Cross-Validation: A robust procedure to
train and validate models in which observations
to be used to train and validate the model are
repeatedly randomly divided into k subsets called
folds. In each iteration, one fold is designated as
the validation set and the remaining k-1 folds are
designated as the training set. The results of the
iterations are then combined and evaluated.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 6 of 7)
k-Fold Cross-Validation
• A special case of k-fold cross-validation is leave-
one-out cross-validation.
• In this case, the number of folds equals the
number of observations in the combined training
and validation data.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 7 of 7)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes
Evaluating the Estimation of Continuous Outcomes
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical
Outcomes:
• By counting the classification errors on
a sufficiently large validation set
Performance and/or test set that is representative of
Measures (Slide the population, we will generate an
accurate measure of the model’s
1 of 19) classification performance.
• Classification confusion matrix:
Displays a model’s correct and
incorrect classifications.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 2 of 19)
Table 9.1: Confusion Matrix
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical
Outcomes (cont.):
• One minus the overall error rate is often
Performance referred to as the accuracy of the model.
• While overall error rate conveys an
Measures (Slide aggregate measure of misclassification, it
counts as misclassifying an actual Class 0
3 of 19) observation as a Class 1 observation (a
false positive) the same as misclassifying
an actual Class 1 observation as a Class 0
observation (a false negative).
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 4 of 19)
Evaluating the Classification of Categorical Outcomes (cont.):
• To account for the asymmetric costs in misclassification, we define the
error rate with respect to the individual classes:
• Class 1 error rate = n10
n11 n10
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Probability of Probability of
Actual Class Class 1 Actual Class Class 1
1 1.00 0 0.66
1 1.00 0 0.65
0 1.00 1 0.64
1 1.00 0 0.62
0 1.00 0 0.60
0 0.90 0 0.51
1 0.90 0 0.49
0 0.88 0 0.49
0 0.88 1 0.46
1 0.88 0 0.46
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Actual Probability Actual Probability
Class of Class 1 Class of Class 1
0 0.87 1 0.45
Performance 0 0.87 1 0.45
Measures 0
0
0.87
0.86
0
0
0.45
0.44
(Slide 6 of 19) 1 0.86 0 0.44
Table 9.2: Classification 0 0.86 0 0.30
Probabilities (cont.) 0 0.86 0 0.28
0 0.85 0 0.26
0 0.84 1 0.24
0 0.84 0 0.22
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Actual
Class
Probability
of Class 1
Actual
Class
Probability
of Class 1
Measures 0 0.83 0 0.21
(Slide 7 of 19) 0
0
0.68
0.67
0
0
0.04
0.04
Table 9.2: Classification 0 0.67 0 0.01
Probabilities (cont.)
0 0.67 0 0.00
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures
(Slide 8 of 19)
Table 9.3: Confusion Matrices for
Various Cutoff Values
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures
(Slide 9 of 19)
Table 9.3: Classification Confusion
Matrices and Error Rates for
Various Cutoff Values (cont.)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures (Slide
10 of 19)
Table 9.3: Classification Confusion
Matrices and Error Rates for
Various Cutoff Values (cont.)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures (Slide
11 of 19)
Figure 9.1: Classification Error
Rates vs. Cutoff Value
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical Outcomes
(cont.):
• Cumulative lift chart: Compares the number of
actual Class 1 observations identified if
considered in decreasing order of their estimated
Performance probability of being in Class 1 and compares this
to the number of actual Class 1 observations
Measures (Slide identified if randomly selected.
• Decile-wise lift chart: Another way to view how
12 of 19) much better a classifier is at identifying Class 1
observations than random classification.
• Observations are ordered in decreasing
probability of Class 1 membership and then
considered in 10 equal-sized groups.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures (Slide
13 of 19)
Figure 9.2: Cumulative and
Decile-Wise Lift Charts
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 14 of 19)
Evaluating the Classification of Categorical Outcomes (cont.):
• The ability to correctly predict Class 1 (positive) observations is
commonly expressed as sensitivity, or recall, and is calculated as:
n11
Sensitivity 1 Class 1 error rate
n11 n10
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 15 of 19)
Evaluating the Classification of Categorical Outcomes (cont.):
• Precision is a measure that corresponds to the proportion of
observations predicted to be Class 1 by a classifier that are actually in
Class 1:
n11
Precision =
n11 n01
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical
Outcomes (cont.):
• The receiver operating characteristic (ROC)
curve is an alternative graphical approach for
Performance displaying the tradeoff between a classifier’s
ability to correctly identify Class 1 observations
Measures (Slide and its Class 0 error rate.
• In general, we can evaluate the quality of a
16 of 19) classifier by computing the area under the ROC
curve, often referred to as the AUC.
• The greater the area under the ROC curve, i.e.,
the larger the AUC, the better the classifier
performs.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures (Slide
17 of 19)
Figure 9.3: Receiver Operating
Characteristic (ROC) Curve
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 18 of 19)
Evaluating the Estimation of Continuous Outcomes:
• The measures of accuracy are some function of the error in estimating an
outcome for an observation i.
• Two common measures are:
n
• Average error = e n
i 1 i
i 1 i n
n2
• Root mean squared error (RMSE) = e
(ei error in estimating an outcome for observation i)
The average error estimates the bias in a model’s predictions:
• If the average error is negative, then the model tends to overestimate the value
of the outcome variable.
• If the average error is positive, the model tends to underestimate.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 19 of 19)
Table 9.4: Computer Error in Estimates of Average Balance for 10 Customers
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Regression
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
• Logistic regression attempts to classify
a binary categorical outcome
Logistic (y = 0 or 1) as a linear function of
Regression explanatory variables.
• A linear regression model fails to
(Slide 1 of 8) appropriately explain a categorical
outcome variable.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic
Regression
(Slide 2 of 8)
Figure 9.4: Scatter Chart and
Simple Linear Regression Fit for
Oscars Example
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic
Regression (Slide
3 of 8)
•Figure 9.5: Residuals for Simple Linear Regression on Oscars
Data
•An unmistakable pattern of systematic misprediction suggests
that the simple linear regression model is not appropriate.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Regression (Slide 4 of 8)
Odds is a measure related to probability.
If an estimate of the probability of an event is pˆ , then the equivalent
odds measure is pˆ 1 pˆ .
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Regression (Slide 5 of 8)
• Logistic regression model:
pˆ
ln b0 b1 x1 bq xq
1 pˆ
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic
Regression
(Slide 6 of 8)
Figure 9.6: Logistic S-Curve for
Oscars Example
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Regression (Slide 7 of 8)
• Logistic regression classifies an observation by using the logistic
function to compute the probability of an observation belonging to
Class 1 and then comparing this probability to a cutoff value.
• If the probability exceeds the cutoff value, the observation is
classified as Class 1 and otherwise it is classified as Class 0.
• While a logistic regression model used for prediction should
ultimately be judged based on its classification accuracy on
validation and test sets, Mallow’s C p statistic is a measure
commonly computed by statistical software that can be used to
identify models with promising sets of variables.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Total Number Predicted
Regression
of Oscar Probability
Nominations of Winning Predicted Class Actual Class
14 0.89 Winner Winner
(Slide 8 of 8) 11
10
0.58
0.44
Winner
Loser
Loser
Loser
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k-Nearest Neighbors
Classifying Categorical Outcomes with k-Nearest Neighbors
Estimating Continuous Outcomes with k-Nearest Neighbors
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
• k-Nearest Neighbors (k-NN): This
method can be used either to classify a
categorical outcome or to estimate a
continuous outcome.
k-Nearest • k-NN uses the k most similar
Neighbors observations from the training set,
where similarity is typically measured
(Slide 1 of 7) with Euclidean distance.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classifying Categorical Outcomes with k-Nearest
Neighbors:
• A nearest-neighbor classifier is a “lazy learner”
that directly uses the entire training set to classify
k-Nearest •
observations in the validation and test sets.
The value of k can plausibly range from 1 to n, the
Neighbors number of observations in the training set.
• If k = 1, then the classification of a new
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Average Loan
Observation Balance Age Default
1 49 38 1
2 671 26 1
k-Nearest 3
4
772
136
47
48
1
1
Neighbors 5
6
123
36
40
29
1
0
(Slide 3 of 7) 7
8
192
6,574
31
35
0
0
Table 9.6: Training Set 9 2,200 58 0
Observations for k-NN Classifier
10 2,100 30 0
Average: 1,285 38.2
Standard
Deviation: 2,029 10.2
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k-Nearest
Neighbors
(Slide 4 of 7)
Figure 9.7: Scatter Chart for k-NN
Classification
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k-Nearest Neighbors (Slide 5 of 7)
% of Class 1
k Neighbors Classification
1 1.00 1
2 0.50 1
3 0.33 0
•Table 9.7: Classification of Observation with
4 0.25 0
Average Balance = 900 and Age = 28 for Different
5 0.40 0 Values of k
6 0.50 1
7 0.57 1
8 0.63 1
9 0.56 1
10 0.50 1
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
•Estimating Continuous Outcomes with k-Nearest Neighbors:
• When k-NN is used to estimate a continuous outcome, a new
observation’s outcome value is predicted to be the average of
the outcome values of its k-nearest neighbors in the training
k-Nearest set.
• The value of k can plausibly range from 1 to n, the number of
Neighbors (Slide observations in the training set.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k Average Balance Estimate
1 $36
k-Nearest 2
3
$936
$936
Neighbors 4 $750
(Slide 7 of 7) 5
6
$1,915
$1,604
Table 9.8: Estimation Average
Balance for Observation with 7 $1,392
Age = 28 for Different Values of k 8 $1,315
9 $1,184
10 $1,285
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
Classifying Categorical Outcomes with a Classification Tree
Estimating Continuous Outcomes with a Regression Tree
Ensemble Methods
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 1 of 20)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 2 of 20)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 3 of 20)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 4 of 20)
Figure 9.9: Construction
Sequence of Branches in a
Classification Tree
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 5 of 20)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 6 of 20)
Figure 9.11: Classification Tree
with One Pruned Branch
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Number of % Classification Error on % Classification Error on
Decision Nodes Training Set Validation Set
Classification and 0
1
43.5
8.7
39.4
20.9
Regression Trees 2
3
8.7
8.7
20.9
20.9
(Slide 7 of 20) 4 6.5 20.9
5 4.3 21.3
Table 9.9: Classification Error 6 2.2 21.3
Rates on Sequence of Pruned 7 0 21.6
Trees
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification
and Regression
Trees
(Slide 8 of 20)
Figure 9.12: Best-Pruned
Classification Tree
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 9 of 20)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 10 of 20)
Figure 9.13: Geometric
Illustration of First Six Rules of a
Regression Tree
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 11 of 20)
Ensemble Methods:
• In an ensemble method, predictions are made
based on the combination of a collection of
models.
• Two necessary conditions for an ensemble to
perform better than a single model:
1. Individual base models are constructed
independently of each other.
2. Individual models perform better than just
randomly guessing.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 12 of 20)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 13 of 20)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 14 of 20)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification
and Regression
Trees
(Slide 15 of 20)
Table 9.11: Bagging: Generation of
10 New Training Sets and
Corresponding Classification Trees
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 16 of 20)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 17 of 20)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Overall
Classification Age 26 29 30 32 34 37 42 47 48 54
Error
Rate
Trees Tree 1
Tree 2
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
1
0
1
0
30%
40%
(Slide 18 of 20) Tree 3
Tree 4
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
30%
30%
Table 9.12: Classification of 10 Tree 5 0 0 0 0 0 0 1 1 1 1 40%
Observations from Validation Set Tree 6 1 1 1 1 1 1 1 1 1 0 50%
with Bagging Ensemble Tree 7 1 1 1 1 1 1 1 1 1 0 50%
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Overall Error
Age 26 29 30 32 34 37 42 47 48 54 Rate
Loan default 1 0 0 0 0 1 0 1 1 0
Tree 8 1 1 1 1 1 1 1 1 1 0 50%
Tree 9 1 1 1 1 1 1 1 1 1 0 50%
Tree 10 0 0 0 0 0 0 0 0 0 0 40%
Average Vote 0.4 0.4 0.4 0.4 0.4 0.7 0.8 0.8 0.8 0.4
Bagging
Ensemble 0 0 0 0 0 1 1 1 1 0 20%
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 20 of 20)
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
End of Chapter 7
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.