Classification Tree
Context
Classification Problem
Classification Class
𝑌 Model
Output
Labels
Class
Qualitative Probabilities
𝑋1 𝑋2 𝑋3
Default Data Set
A data set on ten thousand customers.
Variables
• Default: A factor with levels “No” and “Yes” indicating whether the customer
defaulted on their debt.
• Student: A factor with levels “No” and “Yes” indicating whether the customer is a
student.
• Balance: The average balance that the customer has remaining on their credit card
after making their monthly payment.
• Income: Income of customer.
Default Dataset
Classification Predicted
Default Model
Output Default
Labels
Class
Qualitative Probabilities
Balance Income Student
Objective
• Relationship between the output (i.e.,
Inference default) and the input variables (i.e.,
balance, income, student)
Prediction • Whether an individual will default on
his or her credit card payment.
Classification Tree
•A classification tree is very
similar to a regression tree
except that we try to make a
prediction for a categorical
response rather than
continuous one.
Classification Tree Output
Classification
Tree Output
• In a regression tree, the predicted
response for an observation is given by the
average response of the training
observations that belong to the same
terminal node.
Classification
• In a classification tree, we predict that
Tree Output each observation belongs to the most
commonly occurring class of the training
observations in the region to which it
belongs.
Algorithm
• The tree is grown in the same manner as with a
regression tree.
• However in classification tree, minimizing MSE no
longer makes sense.
• A natural alternative is classification error rate.
Algorithm • The classification error rate is simply the fraction
of the training observations in that region that do
not belong to the most common class.
• There are several other different criteria available
as well, such as the “gini index” and “cross-
entropy”.
Steps
Default Data Set
• Divide the data set into two parts- one part to be used for training the model
1 and the other part to test the same.
• Build a large tree on the training data set.
2
• Prune the tree to improve accuracy.
3
• Check the performance of the pruned tree on the test data set.
4
Step 1
We have in total data from 10,000 customers.
We then randomly split the observations into two parts- the training set
containing 8,000 observations and the test set containing 2,000 observations.
Step 2
Build a large tree on the
training data set.
Step 3:
Tree Pruning
Step 3:
Pruned Tree
Step 4:
Compute the True Default status
Test Error Predicted No Yes Total
Default
No 1932 36 1968
Status
Yes 12 20 32
Classification Error Rate
Total 1944 56 2000
12+36
=
2000
= 0.024
Step 4:
Compute the True Default status
Test Error Predicted No Yes Total
Default
No 1932 36 1968
Status
Yes 12 20 32
Sensitivity
Total 1944 56 2000
20
=
56
= 0.357
Step 4:
Compute the True Default status
Test Error Predicted No Yes Total
Default
No 1932 36 1968
Status
Yes 12 20 32
Specificity
Total 1944 56 2000
1932
=
1944
= 0.9938
Logistic Regression Model
Step 1
We have in total data from 10,000 customers.
We then randomly split the observations into two parts- the training set
containing 8,000 observations and the test set containing 2,000 observations.
COEFF STD. 𝑧 -STAT 𝑝-
ERROR VALUE
Step 2
Intercept −11.1300 0.5551 −20.04 <0.0001
Build a logistic regression
model on the training data Balance 0.0057 0.0002 22.59 <0.0001
set.
Income 0.0000 0.0000 1.04 0.2985
Student[Yes] −0.5406 0.2658 −2.03 0.0419
Step 3:
Build the Refined
COEFF STD. 𝑧 -STAT 𝑝-
Logistic ERROR VALUE
Regression Model
Intercept −10.7400 0.4062 −26.44 <0.0001
• Use Stepwise Method Balance 0.0057 0.0002 22.61 <0.0001
using AIC as the model Student[Yes] −0.7565 0.1645 −4.59 <0.0001
selection criterion.
Step 4:
Compute the True Default status
Test Error Predicted No Yes Total
Default
No 1939 40 1979
Status
Yes 5 16 21
Classification Error Rate
Total 1944 56 2000
5+40
=
2000
= 0.0225
Step 4:
Compute the True Default status
Test Error Predicted No Yes Total
Default
No 1939 40 1979
Status
Yes 5 16 21
Sensitivity
Total 1944 56 2000
16
=
56
= 0.2857
Step 4:
Compute the True Default status
Test Error Predicted Predicted No Yes Total
Default Default
No 1939 40 1979
Status Status
Specificity Yes 5 16 21
1939 Total 1944 56 2000
=
1944
= 0.9974
Model Comparison
Classification
Tree: ROC for
Test Data
AUC=0.9159
Logistic
Regression:
ROC for Test
Data
AUC=0.9374
Which model is better?
◦ If the relationship between the predictors
and response is linear, then classical
linear models such as linear regression
Trees vs. would outperform regression trees.
Linear Models ◦ On the other hand, if the relationship
between the predictors is non-linear,
then decision trees would outperform
classical approaches.
Trees vs. Linear Model:
Classification Example
Top row: The true decision boundary is linear
◦ Left: linear model (Better)
◦ Right: decision tree
Bottom row: The true decision boundary is non-
linear
◦ Left: linear model
◦ Right: decision tree (Better)
Advantages:
◦ Trees are very easy to explain to people (even
easier than linear regression).
Advantages ◦ Trees can be plotted graphically, and hence can
be easily communicated even to a non-expert.
and ◦ They work fine for both classification and
Disadvantages regression problems.
of Decision
Trees Disadvantages:
◦ Trees don’t have the same prediction accuracy
as some of the more flexible approaches
available in practice.
Assignment
Consider the case “HR Analytics at ScaleneWorks - Behavioural Modelling to
predict Renege.” Fit an appropriate classification tree model for the data set
provided with the case. Compare its performance with the logistic regression
model.
Reading Material
• James, G., Witten, D., Hastie, T. & Tibshirani, R. (2013). An Introduction to
Statistical Learning: with Applications in R. New York: Springer-Verlag. (web:
http://www-bcf.usc.edu/~gareth/ISL/).
✓ Chapter 8: Sub-sections 8.1.2, 8.1.3, 8.3.1.