[go: up one dir, main page]

0% found this document useful (0 votes)
21 views70 pages

Topic 7

The document discusses methods for evaluating predictive models for classification and estimation problems. It covers topics like confusion matrices, accuracy, precision, recall, and other performance metrics for classification models as well as measures for models predicting continuous outcomes.

Uploaded by

Hy Chong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views70 pages

Topic 7

The document discusses methods for evaluating predictive models for classification and estimation problems. It covers topics like confusion matrices, accuracy, precision, recall, and other performance metrics for classification models as well as measures for models predicting continuous outcomes.

Uploaded by

Hy Chong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

Predictive Data

Mining
Chapter 7

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction (Slide 1 of 2)

• An observation, or record, is the set of recorded


values of variables associated with a single
entity.
• Supervised learning: Data mining methods for
predicting an outcome based on a set of input
variables, or features.
• Supervised learning can be used for:
• Estimation of a continuous outcome.
• Classification of a categorical outcome.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction (Slide 2 of 2)

The data mining process comprises the following


steps:
1. Data sampling.
2. Data preparation.
3. Data partitioning.
4. Model construction.
5. Model assessment.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation,
and Partitioning

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 1 of 7)

• When dealing with large volumes of data, best


practice is to extract a representative sample for
analysis.
• A sample is representative if the analyst can
make the same conclusions from it as from the
entire population of data.
• The sample of data must be large enough to
contain significant information, yet small enough
to be manipulated quickly.
• Data mining algorithms typically are more
effective given more data.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 2 of7)

• When obtaining a representative sample, it is


generally best to include as many variables as
possible in the sample.
• After exploring the data with descriptive statistics
and visualization, the analyst can eliminate
variables that are not of interest.
• Data mining applications deal with an abundance
of data that simplifies the process of assessing
the accuracy of data-based estimates of variable
effects.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 3 of 7)

• Overfitting occurs when the analyst builds a model


that does a great job of explaining the sample of data
on which it is based, but fails to accurately predict
outside the sample data.
• We can use the abundance of data to guard against
the potential for overfitting by splitting the data set
into different subsets for:
• The training (or construction) of candidate
models.
• The validation (or performance comparison) of
candidate models
• The testing (or assessment) of future
performance of a selected model.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 4 of 7)

Statistic Holdout Method


• Training set: Consists of the data used to build
the candidate models.
• Validation set: The data set to which the
promising subset of models is applied to identify
which model is the most accurate at predicting
observations that were not used to build the
model.
• Test set: The data set to which the final model
should be applied to estimate this model’s
effectiveness when applied to data that have not
been used to build or select the model.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 5 of 7)

k-Fold Cross-Validation
• k-Fold Cross-Validation: A robust procedure to
train and validate models in which observations
to be used to train and validate the model are
repeatedly randomly divided into k subsets called
folds. In each iteration, one fold is designated as
the validation set and the remaining k-1 folds are
designated as the training set. The results of the
iterations are then combined and evaluated.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 6 of 7)

k-Fold Cross-Validation
• A special case of k-fold cross-validation is leave-
one-out cross-validation.
• In this case, the number of folds equals the
number of observations in the combined training
and validation data.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
(Slide 7 of 7)

Class Imbalanced Data


• There are two basic sampling approaches for
modifying the class distribution of the training set:
• Undersampling: Balances the number of Class 1 and
Class 0 observations in a training set by removing
majority class observations from the training set.
• Oversampling: Balances the number of Class 1 and
Class 0 observations in a training set by inserting
copies of minority class observations into the training
set.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes
Evaluating the Estimation of Continuous Outcomes

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical
Outcomes:
• By counting the classification errors on
a sufficiently large validation set
Performance and/or test set that is representative of
Measures (Slide the population, we will generate an
accurate measure of the model’s
1 of 19) classification performance.
• Classification confusion matrix:
Displays a model’s correct and
incorrect classifications.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 2 of 19)
Table 9.1: Confusion Matrix

• Many measures of classification performance are based on the confusion


matrix.
• Overall error rate: Percentage of misclassified observations:
n10  n01
Overall error rate 
n11  n10  n01  n00

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical
Outcomes (cont.):
• One minus the overall error rate is often
Performance referred to as the accuracy of the model.
• While overall error rate conveys an
Measures (Slide aggregate measure of misclassification, it
counts as misclassifying an actual Class 0
3 of 19) observation as a Class 1 observation (a
false positive) the same as misclassifying
an actual Class 1 observation as a Class 0
observation (a false negative).

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 4 of 19)
Evaluating the Classification of Categorical Outcomes (cont.):
• To account for the asymmetric costs in misclassification, we define the
error rate with respect to the individual classes:
• Class 1 error rate = n10
n11  n10

• Class 0 error rate =


n01
n01  n00
• Cutoff value: Probability value used to understand the tradeoff between
Class 1 error rate and Class 0 error rate.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Probability of Probability of
Actual Class Class 1 Actual Class Class 1
1 1.00 0 0.66
1 1.00 0 0.65
0 1.00 1 0.64
1 1.00 0 0.62
0 1.00 0 0.60
0 0.90 0 0.51
1 0.90 0 0.49
0 0.88 0 0.49
0 0.88 1 0.46
1 0.88 0 0.46

Performance Measures (Slide 5 of 19)


Table 9.2: Classification Probabilities

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Actual Probability Actual Probability
Class of Class 1 Class of Class 1
0 0.87 1 0.45
Performance 0 0.87 1 0.45
Measures 0
0
0.87
0.86
0
0
0.45
0.44
(Slide 6 of 19) 1 0.86 0 0.44
Table 9.2: Classification 0 0.86 0 0.30
Probabilities (cont.) 0 0.86 0 0.28
0 0.85 0 0.26
0 0.84 1 0.24
0 0.84 0 0.22
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Actual
Class
Probability
of Class 1
Actual
Class
Probability
of Class 1
Measures 0 0.83 0 0.21
(Slide 7 of 19) 0
0
0.68
0.67
0
0
0.04
0.04
Table 9.2: Classification 0 0.67 0 0.01
Probabilities (cont.)
0 0.67 0 0.00

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures
(Slide 8 of 19)
Table 9.3: Confusion Matrices for
Various Cutoff Values

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures
(Slide 9 of 19)
Table 9.3: Classification Confusion
Matrices and Error Rates for
Various Cutoff Values (cont.)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures (Slide
10 of 19)
Table 9.3: Classification Confusion
Matrices and Error Rates for
Various Cutoff Values (cont.)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures (Slide
11 of 19)
Figure 9.1: Classification Error
Rates vs. Cutoff Value

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical Outcomes
(cont.):
• Cumulative lift chart: Compares the number of
actual Class 1 observations identified if
considered in decreasing order of their estimated
Performance probability of being in Class 1 and compares this
to the number of actual Class 1 observations
Measures (Slide identified if randomly selected.
• Decile-wise lift chart: Another way to view how
12 of 19) much better a classifier is at identifying Class 1
observations than random classification.
• Observations are ordered in decreasing
probability of Class 1 membership and then
considered in 10 equal-sized groups.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures (Slide
13 of 19)
Figure 9.2: Cumulative and
Decile-Wise Lift Charts

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 14 of 19)
Evaluating the Classification of Categorical Outcomes (cont.):
• The ability to correctly predict Class 1 (positive) observations is
commonly expressed as sensitivity, or recall, and is calculated as:
n11
Sensitivity  1  Class 1 error rate 
n11  n10

• The ability to correctly predict Class 0 (negative) observations is


commonly expressed as specificity and is calculated as:
n00
Specificity  1  Class 0 error rate 
n11  n10

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 15 of 19)
Evaluating the Classification of Categorical Outcomes (cont.):
• Precision is a measure that corresponds to the proportion of
observations predicted to be Class 1 by a classifier that are actually in
Class 1:
n11
Precision =
n11  n01

• The F1 Score combines precision and sensitivity into a single measure


and is defined as:
2n11
F1 Score =
2n11  n01  n10

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Evaluating the Classification of Categorical
Outcomes (cont.):
• The receiver operating characteristic (ROC)
curve is an alternative graphical approach for
Performance displaying the tradeoff between a classifier’s
ability to correctly identify Class 1 observations
Measures (Slide and its Class 0 error rate.
• In general, we can evaluate the quality of a
16 of 19) classifier by computing the area under the ROC
curve, often referred to as the AUC.
• The greater the area under the ROC curve, i.e.,
the larger the AUC, the better the classifier
performs.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance
Measures (Slide
17 of 19)
Figure 9.3: Receiver Operating
Characteristic (ROC) Curve

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 18 of 19)
Evaluating the Estimation of Continuous Outcomes:
• The measures of accuracy are some function of the error in estimating an
outcome for an observation i.
• Two common measures are:

n
• Average error = e n
i 1 i

 i 1 i n
n2
• Root mean squared error (RMSE) = e
(ei  error in estimating an outcome for observation i)
The average error estimates the bias in a model’s predictions:
• If the average error is negative, then the model tends to overestimate the value
of the outcome variable.
• If the average error is positive, the model tends to underestimate.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures (Slide 19 of 19)
Table 9.4: Computer Error in Estimates of Average Balance for 10 Customers

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Regression

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
• Logistic regression attempts to classify
a binary categorical outcome
Logistic (y = 0 or 1) as a linear function of
Regression explanatory variables.
• A linear regression model fails to
(Slide 1 of 8) appropriately explain a categorical
outcome variable.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic
Regression
(Slide 2 of 8)
Figure 9.4: Scatter Chart and
Simple Linear Regression Fit for
Oscars Example

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic
Regression (Slide
3 of 8)
•Figure 9.5: Residuals for Simple Linear Regression on Oscars
Data
•An unmistakable pattern of systematic misprediction suggests
that the simple linear regression model is not appropriate.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Regression (Slide 4 of 8)
Odds is a measure related to probability.
If an estimate of the probability of an event is pˆ , then the equivalent
odds measure is pˆ 1  pˆ .

The odds metric ranges between zero and positive infinity.


We eliminate the fit problem by using logit, ln  pˆ 1  pˆ  .

Estimating the logit with a linear function results in the estimated


logistic regression model.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Regression (Slide 5 of 8)
• Logistic regression model:
 pˆ 
ln    b0  b1 x1    bq xq
 1  pˆ 

Given a set of explanatory variables, a logistic regression algorithm


determines values of b0 , b1 , ,bq that best estimate the log odds.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic
Regression
(Slide 6 of 8)
Figure 9.6: Logistic S-Curve for
Oscars Example

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Regression (Slide 7 of 8)
• Logistic regression classifies an observation by using the logistic
function to compute the probability of an observation belonging to
Class 1 and then comparing this probability to a cutoff value.
• If the probability exceeds the cutoff value, the observation is
classified as Class 1 and otherwise it is classified as Class 0.
• While a logistic regression model used for prediction should
ultimately be judged based on its classification accuracy on
validation and test sets, Mallow’s C p statistic is a measure
commonly computed by statistical software that can be used to
identify models with promising sets of variables.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Logistic Total Number Predicted

Regression
of Oscar Probability
Nominations of Winning Predicted Class Actual Class
14 0.89 Winner Winner

(Slide 8 of 8) 11

10
0.58

0.44
Winner

Loser
Loser
Loser

Table 9.5: Predicted Probabilities 6 0.07 Loser Winner


by Logistic Regression for Oscars
Example

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k-Nearest Neighbors
Classifying Categorical Outcomes with k-Nearest Neighbors
Estimating Continuous Outcomes with k-Nearest Neighbors

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
• k-Nearest Neighbors (k-NN): This
method can be used either to classify a
categorical outcome or to estimate a
continuous outcome.
k-Nearest • k-NN uses the k most similar
Neighbors observations from the training set,
where similarity is typically measured
(Slide 1 of 7) with Euclidean distance.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classifying Categorical Outcomes with k-Nearest
Neighbors:
• A nearest-neighbor classifier is a “lazy learner”
that directly uses the entire training set to classify

k-Nearest •
observations in the validation and test sets.
The value of k can plausibly range from 1 to n, the
Neighbors number of observations in the training set.
• If k = 1, then the classification of a new

(Slide 2 of 7) observation is set to be equal to the class of


the single most similar observation from the
training set.
• If k = n, then the new observation’s class is
naïvely assigned to the most common class in
the training set.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Average Loan
Observation Balance Age Default
1 49 38 1
2 671 26 1

k-Nearest 3
4
772
136
47
48
1
1
Neighbors 5
6
123
36
40
29
1
0
(Slide 3 of 7) 7
8
192
6,574
31
35
0
0
Table 9.6: Training Set 9 2,200 58 0
Observations for k-NN Classifier
10 2,100 30 0
Average: 1,285 38.2
Standard
Deviation: 2,029 10.2

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k-Nearest
Neighbors
(Slide 4 of 7)
Figure 9.7: Scatter Chart for k-NN
Classification

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k-Nearest Neighbors (Slide 5 of 7)

% of Class 1
k Neighbors Classification
1 1.00 1
2 0.50 1
3 0.33 0
•Table 9.7: Classification of Observation with
4 0.25 0
Average Balance = 900 and Age = 28 for Different
5 0.40 0 Values of k
6 0.50 1
7 0.57 1
8 0.63 1
9 0.56 1
10 0.50 1

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
•Estimating Continuous Outcomes with k-Nearest Neighbors:
• When k-NN is used to estimate a continuous outcome, a new
observation’s outcome value is predicted to be the average of
the outcome values of its k-nearest neighbors in the training
k-Nearest set.
• The value of k can plausibly range from 1 to n, the number of
Neighbors (Slide observations in the training set.

6 of 7) •Figure 9.8: Scatter Chart for k-NN Estimation

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
k Average Balance Estimate
1 $36
k-Nearest 2
3
$936
$936
Neighbors 4 $750
(Slide 7 of 7) 5
6
$1,915
$1,604
Table 9.8: Estimation Average
Balance for Observation with 7 $1,392
Age = 28 for Different Values of k 8 $1,315
9 $1,184
10 $1,285

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
Classifying Categorical Outcomes with a Classification Tree
Estimating Continuous Outcomes with a Regression Tree
Ensemble Methods

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 1 of 20)

• Classification and regression trees (CART) successively


partition a data set of observations into increasingly
smaller and more homogeneous subsets.
• At each iteration of the CART method, a subset of
observations is split into two new subsets based on
the values of a single variable.
• CART method can be thought of as a series of
questions that successively narrow down
observations into smaller and smaller groups of
decreasing impurity, which is the measure of the
heterogeneity in a group of observations’ outcome
classes or outcome values.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 2 of 20)

Classifying Categorical Outcomes with a Classification


Tree:
• Classification trees: The impurity of a group of
observations is based on the proportion of
observations belonging to the same class.
• There is zero impurity if all observations in a
group are in the same class.
• After a final tree is constructed, the classification of a
new observation is then based on the final partition
into which the new observation belongs.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 3 of 20)

Classifying a Categorical Outcome with a


Classification Tree (cont.):
• To explain how a classification tree categorizes
observations:
• We use a small sample of data from
DemoHHI consisting of 46 observations.
• Only two variables from HHI—percentage of
the $ character and percentage of the !
Character.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 4 of 20)
Figure 9.9: Construction
Sequence of Branches in a
Classification Tree

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 5 of 20)

•Figure 9.10: Geometric Illustration of


Classification Tree Partitions
•The final partitioning resulting from the
sequence of variable splits.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 6 of 20)
Figure 9.11: Classification Tree
with One Pruned Branch

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Number of % Classification Error on % Classification Error on
Decision Nodes Training Set Validation Set

Classification and 0
1
43.5
8.7
39.4
20.9
Regression Trees 2
3
8.7
8.7
20.9
20.9
(Slide 7 of 20) 4 6.5 20.9
5 4.3 21.3
Table 9.9: Classification Error 6 2.2 21.3
Rates on Sequence of Pruned 7 0 21.6
Trees

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification
and Regression
Trees
(Slide 8 of 20)
Figure 9.12: Best-Pruned
Classification Tree

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 9 of 20)

Estimating Continuous Outcomes with a Regression Tree:


• A regression tree successively partitions observations of
the training set into smaller and smaller groups in a
similar fashion as a classification tree.
• The differences are:
• A regression tree bases the impurity of a partition
based on the variance of the outcome value for the
observations in the group.
• After a final tree is constructed, the estimated
outcome value of an observation is based on the
mean outcome value of the partition in which the
new observation belongs.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 10 of 20)
Figure 9.13: Geometric
Illustration of First Six Rules of a
Regression Tree

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 11 of 20)

Ensemble Methods:
• In an ensemble method, predictions are made
based on the combination of a collection of
models.
• Two necessary conditions for an ensemble to
perform better than a single model:
1. Individual base models are constructed
independently of each other.
2. Individual models perform better than just
randomly guessing.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 12 of 20)

Ensemble Methods (cont.):


• Two primary steps to an ensemble approach:
1. The development of a committee of individual base
models.
2. The combination of the individual base models’
predictions to form a composite prediction.
• A classification or estimation method is unstable if relatively
small changes in the training set cause its predictions to
fluctuate.
• Three different ways to construct an ensemble of classification
or regression trees:
• Bagging.
• Boosting.
• Random forests.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 13 of 20)

•Ensemble Methods (cont.):


• In the bagging approach, the committee of
individual base models is generated by first
Age 29 31 35 38 47 48 53 54 58 70
constructing multiple training sets by repeated
Loan
default 0 0 0 1 1 1 1 0 0 0
random sampling of the n observations in the
original data with replacement.
•Table 9.10: Original 10-Observation Training Data

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 14 of 20)

Ensemble Methods (cont.):


• The boosting method generates is committee of
individual base models by sampling multiple
training sets.
• Boosting iteratively adapts how it samples the
original data when constructing a new training
set based on the prediction error of the models
constructed on the previous training sets.
• Random forests can be viewed as a variation of
bagging specifically tailored for use with
classification or regression trees.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification
and Regression
Trees
(Slide 15 of 20)
Table 9.11: Bagging: Generation of
10 New Training Sets and
Corresponding Classification Trees

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 16 of 20)

•Table 9.11: Bagging: Generation of 10 New Training Sets and


Corresponding Classification Trees (cont.)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and
Regression Trees
(Slide 17 of 20)

•Table 9.11: Bagging: Generation of 10 New Training Sets and


Corresponding Classification Trees (cont.)

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Overall
Classification Age 26 29 30 32 34 37 42 47 48 54
Error
Rate

and Regression Loan


default 1 0 0 0 0 1 0 1 1 0

Trees Tree 1
Tree 2
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
1
0
1
0
30%
40%
(Slide 18 of 20) Tree 3
Tree 4
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
30%
30%
Table 9.12: Classification of 10 Tree 5 0 0 0 0 0 0 1 1 1 1 40%
Observations from Validation Set Tree 6 1 1 1 1 1 1 1 1 1 0 50%
with Bagging Ensemble Tree 7 1 1 1 1 1 1 1 1 1 0 50%

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Overall Error
Age 26 29 30 32 34 37 42 47 48 54 Rate
Loan default 1 0 0 0 0 1 0 1 1 0
Tree 8 1 1 1 1 1 1 1 1 1 0 50%
Tree 9 1 1 1 1 1 1 1 1 1 0 50%
Tree 10 0 0 0 0 0 0 0 0 0 0 40%
Average Vote 0.4 0.4 0.4 0.4 0.4 0.7 0.8 0.8 0.8 0.4
Bagging
Ensemble 0 0 0 0 0 1 1 1 1 0 20%

Classification and Regression Trees


(Slide 19 of 20)
Table 9.12: Classification of 10 Observations from Validation Set with Bagging Ensemble

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Classification and Regression Trees
(Slide 20 of 20)

Ensemble Methods (cont.):


• For most problems, the predictive accuracy of boosting ensembles
exceeds the predictive performance of bagging ensembles.
• Boosting achieves its performance advantage because:
• It evolves its committee of models by focusing on observations
that are mispredicted.
• The member models’ votes are weighted by their accuracy.
• Boosting is more computationally expensive than bagging.
• There is no adaptive feedback in a bagging approach, so all m training
sets are corresponding models can be implemented simultaneously.
• Random forests approach has performance similar to boosting, but
maintains the computational simplicity of bagging.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
End of Chapter 7

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

You might also like