[go: up one dir, main page]

0% found this document useful (0 votes)
6 views36 pages

Class Tree

The document discusses the classification tree methodology for predicting customer default on credit card payments using a dataset of 10,000 customers. It outlines the steps for building and evaluating a classification tree model, including data splitting, tree construction, pruning, and performance assessment, while also comparing it to a logistic regression model. The advantages and disadvantages of decision trees are highlighted, along with a recommendation for a practical assignment involving HR analytics.

Uploaded by

abby.iitpkd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views36 pages

Class Tree

The document discusses the classification tree methodology for predicting customer default on credit card payments using a dataset of 10,000 customers. It outlines the steps for building and evaluating a classification tree model, including data splitting, tree construction, pruning, and performance assessment, while also comparing it to a logistic regression model. The advantages and disadvantages of decision trees are highlighted, along with a recommendation for a practical assignment involving HR analytics.

Uploaded by

abby.iitpkd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Classification Tree

Context
Classification Problem

Classification Class
𝑌 Model
Output
Labels

Class
Qualitative Probabilities
𝑋1 𝑋2 𝑋3
Default Data Set
A data set on ten thousand customers.

Variables

• Default: A factor with levels “No” and “Yes” indicating whether the customer
defaulted on their debt.
• Student: A factor with levels “No” and “Yes” indicating whether the customer is a
student.
• Balance: The average balance that the customer has remaining on their credit card
after making their monthly payment.
• Income: Income of customer.
Default Dataset

Classification Predicted
Default Model
Output Default
Labels

Class
Qualitative Probabilities
Balance Income Student
Objective
• Relationship between the output (i.e.,
Inference default) and the input variables (i.e.,
balance, income, student)

Prediction • Whether an individual will default on


his or her credit card payment.
Classification Tree
•A classification tree is very
similar to a regression tree
except that we try to make a
prediction for a categorical
response rather than
continuous one.
Classification Tree Output
Classification
Tree Output
• In a regression tree, the predicted
response for an observation is given by the
average response of the training
observations that belong to the same
terminal node.
Classification
• In a classification tree, we predict that
Tree Output each observation belongs to the most
commonly occurring class of the training
observations in the region to which it
belongs.
Algorithm
• The tree is grown in the same manner as with a
regression tree.
• However in classification tree, minimizing MSE no
longer makes sense.
• A natural alternative is classification error rate.

Algorithm • The classification error rate is simply the fraction


of the training observations in that region that do
not belong to the most common class.
• There are several other different criteria available
as well, such as the “gini index” and “cross-
entropy”.
Steps
Default Data Set
• Divide the data set into two parts- one part to be used for training the model
1 and the other part to test the same.

• Build a large tree on the training data set.


2

• Prune the tree to improve accuracy.


3

• Check the performance of the pruned tree on the test data set.
4
Step 1

We have in total data from 10,000 customers.

We then randomly split the observations into two parts- the training set
containing 8,000 observations and the test set containing 2,000 observations.
Step 2
Build a large tree on the
training data set.
Step 3:
Tree Pruning
Step 3:
Pruned Tree
Step 4:
Compute the True Default status
Test Error Predicted No Yes Total
Default
No 1932 36 1968
Status
Yes 12 20 32
Classification Error Rate
Total 1944 56 2000
12+36
=
2000
= 0.024
Step 4:
Compute the True Default status
Test Error Predicted No Yes Total
Default
No 1932 36 1968
Status
Yes 12 20 32
Sensitivity
Total 1944 56 2000
20
=
56
= 0.357
Step 4:
Compute the True Default status
Test Error Predicted No Yes Total
Default
No 1932 36 1968
Status
Yes 12 20 32
Specificity
Total 1944 56 2000
1932
=
1944
= 0.9938
Logistic Regression Model
Step 1

We have in total data from 10,000 customers.

We then randomly split the observations into two parts- the training set
containing 8,000 observations and the test set containing 2,000 observations.
COEFF STD. 𝑧 -STAT 𝑝-
ERROR VALUE
Step 2
Intercept −11.1300 0.5551 −20.04 <0.0001
Build a logistic regression
model on the training data Balance 0.0057 0.0002 22.59 <0.0001
set.
Income 0.0000 0.0000 1.04 0.2985

Student[Yes] −0.5406 0.2658 −2.03 0.0419


Step 3:
Build the Refined
COEFF STD. 𝑧 -STAT 𝑝-
Logistic ERROR VALUE
Regression Model
Intercept −10.7400 0.4062 −26.44 <0.0001

• Use Stepwise Method Balance 0.0057 0.0002 22.61 <0.0001


using AIC as the model Student[Yes] −0.7565 0.1645 −4.59 <0.0001
selection criterion.
Step 4:
Compute the True Default status
Test Error Predicted No Yes Total
Default
No 1939 40 1979
Status
Yes 5 16 21
Classification Error Rate
Total 1944 56 2000
5+40
=
2000
= 0.0225
Step 4:
Compute the True Default status
Test Error Predicted No Yes Total
Default
No 1939 40 1979
Status
Yes 5 16 21
Sensitivity
Total 1944 56 2000
16
=
56
= 0.2857
Step 4:
Compute the True Default status
Test Error Predicted Predicted No Yes Total
Default Default
No 1939 40 1979
Status Status
Specificity Yes 5 16 21

1939 Total 1944 56 2000


=
1944
= 0.9974
Model Comparison
Classification
Tree: ROC for
Test Data
AUC=0.9159
Logistic
Regression:
ROC for Test
Data
AUC=0.9374
Which model is better?
◦ If the relationship between the predictors
and response is linear, then classical
linear models such as linear regression
Trees vs. would outperform regression trees.
Linear Models ◦ On the other hand, if the relationship
between the predictors is non-linear,
then decision trees would outperform
classical approaches.
Trees vs. Linear Model:
Classification Example
Top row: The true decision boundary is linear
◦ Left: linear model (Better)
◦ Right: decision tree

Bottom row: The true decision boundary is non-


linear
◦ Left: linear model
◦ Right: decision tree (Better)
Advantages:
◦ Trees are very easy to explain to people (even
easier than linear regression).
Advantages ◦ Trees can be plotted graphically, and hence can
be easily communicated even to a non-expert.
and ◦ They work fine for both classification and
Disadvantages regression problems.
of Decision
Trees Disadvantages:
◦ Trees don’t have the same prediction accuracy
as some of the more flexible approaches
available in practice.
Assignment
Consider the case “HR Analytics at ScaleneWorks - Behavioural Modelling to
predict Renege.” Fit an appropriate classification tree model for the data set
provided with the case. Compare its performance with the logistic regression
model.
Reading Material
• James, G., Witten, D., Hastie, T. & Tibshirani, R. (2013). An Introduction to
Statistical Learning: with Applications in R. New York: Springer-Verlag. (web:
http://www-bcf.usc.edu/~gareth/ISL/).
✓ Chapter 8: Sub-sections 8.1.2, 8.1.3, 8.3.1.

You might also like