Decision Trees
Yudi Agusta, PhD
Data Mining, Lecture 07
Copyright © Yudi Agusta, PhD, 2021
Copyright © Yudi Agusta, PhD, 2021
Lecture’s Structure
◼ Unsupervised, Supervised and Semi-
supervised Classification
◼ Definition of Decision Trees
◼ Method for
◼ Finding a consistent decision trees
◼ Finding a good decision trees
◼ Overfitting Problem
◼ Pruning Method
Copyright © Yudi Agusta, PhD, 2021
Supervised, Unsupervised, Semi-Supervised
◼ Supervised classification:
◼ Needs help from the data
◼ Each data in the dataset analysed has been classified
◼ Needs great amount of data
◼ Unsupervised classification:
◼ Doesn’t need help from the data
◼ Each data in the dataset analysed is not classified
◼ Great amount of data are not necessarily needed
◼ Semi-supervised classification:
◼ Need the user to set one or more parameters at the beginning of
the classification process
◼ Example: Clustering (fuzzy c-means)
Copyright © Yudi Agusta, PhD, 2021
Supervised, Unsupervised, Semi-Supervised
◼ Supervised Classification
◼ Neural Networks (NN)
◼ Decision Trees
◼ Naïve Bayes
◼ Support Vector Machine (SVM)
◼ Bayesian Networks (BN)
◼ Unsupervised Classification
◼ Clustering (Mixture Modelling)
◼ Self Organising Map (SOM)
◼ Semi-supervised Classification
◼ Analytical Hierarchy Process (AHP)
◼ Simple Additive Model (SAM)
Copyright © Yudi Agusta, PhD, 2021
Decision Trees
◼ Given an event, predict its category.
Examples:
◼ Who will won a given ball game?
◼ How should we file a given e-mail?
◼ Event = list of features. Examples:
◼ Ball game: Who is the goalkeeper?
◼ E-mail: Who sent the e-mail?
Copyright © Yudi Agusta, PhD, 2021
Decision Trees
◼ Each data in the training data has its own category
◼ Use training data to build the decision tree
◼ Use a decision tree model to predict categories of a
new event
New
Event
Training Events Decision
& Categories Tree Model
Category
Copyright © Yudi Agusta, PhD, 2021
Decision Trees
◼ A decision tree is a tree where:
◼ Each interior node is labeled with a feature
◼ Each arc out of an interior node is labeled with a feature
value for the node’s feature
◼ Each leaf is labeled with a category
Jumlah ART >3.5
<1.5
<3.5
Jenis Lantai WC Luas Lantai
tanah semen/keramik bersama sendiri <20.5 >20.5
Miskin Tidak Miskin Miskin Tidak Miskin Miskin Tidak Miskin
Copyright © Yudi Agusta, PhD, 2021
An Example of Training Data
Jumlah ART Jenis Lantai WC Luas Lantai Category
5 Tanah Sendiri 15 Miskin
3 Tanah Sendiri 20 Tidak Miskin
4 Semen Sendiri 50 Tidak Miskin
1 Semen Bersama 40 Tidak Miskin
1 Tanah Sendiri 30 Miskin
2 Semen Bersama 40 Miskin
8 Semen Bersama 18 Miskin
Copyright © Yudi Agusta, PhD, 2021
Finding A Consistent Decision Tree
◼ Too many decision trees to try them all
◼ Build a decision tree top-down, using
recursive partitioning, as follows:
◼ If all of the events have the same category:
◼ Create a leaf with that category
◼ Else
◼ Pick a feature, and create a node for that feature
◼ For each feature value:
◼ Back to the first step to create sub-tree for the events with
that feature value
Copyright © Yudi Agusta, PhD, 2021
Creating a Decision Tree
◼ Use training data consisting of events and
categories to create a new decision tree
◼ What properties do we want the decision tree
to have?
◼ It should be consistent with the training data
◼ Many decision trees are consistent
◼ It should be simple
◼ Occam’s Razor
◼ Simple models tend to generalise better
◼ In simpler decision trees, each node is based on more
data (on average)
Copyright © Yudi Agusta, PhD, 2021
Finding a Good Decision Tree
◼ What feature should we split on?
◼ We want a small decision tree
◼ Pick the feature that gives us the most information about
what the category should be
◼ 20 Questions
◼ I am thinking of a number from 1 to 1000
◼ You can ask yes/no questions
◼ What is the first question you would ask?
◼ On average, some questions give you more information than
others:
◼ Is it 752? Or Is it prime? Or Is it between 1 and 500?
◼ Pick a question that provides the most information
Copyright © Yudi Agusta, PhD, 2021
Entropy
◼ Entropy provides a measure of how much we know about the
category or how much information we gain
◼ On average, how many more yes/no questions do we need to ask
to determine the category?
◼ The more we know, the lower the entropy
◼ The entropy of a set of events/data E is H(E):
H ( E ) = P(c) log 2 P(c)
cC
◼ Where P(c) is the probability that an event in E has category c
◼ Entropy means the average value that we can expect to have if
an event occurs
Copyright © Yudi Agusta, PhD, 2021
Entropy
◼ Example of Entropy calculation
H ( E ) = P(c) log 2 P(c)
cC
◼ For the above dataset the H(E) is calculated as follows:
◼ H(E) = P(tidak miskin) log P(tidak miskin) + P(miskin) log P(miskin)
◼ H(E) = 3/7 log 3/7 + 4/7 log 4/7
◼ H(E) = 0.43 x (-0.37) + 0.57 x (-0.24)
◼ H(E) = - 0.16 - 0.17 = - 0.33
Copyright © Yudi Agusta, PhD, 2021
Information Gain
◼ How much information does a feature give us
about what the category should be?
◼ H(E)=entropy of event set E
◼ H(E|f)=expected entropy of event set E, once we
know the value of feature f.
◼ G(E,f)=H(E)-H(E|f)=the amount of new
information provided by feature f
◼ Split on the feature that maximises
information gain
Copyright © Yudi Agusta, PhD, 2021
Information Gain
◼ Information Gain for the variable ‘Jenis Lantai’ is
calculated as follows:
◼ H(E|f) =
◼ P(tidak miskin|tanah) log P(tidak miskin|tahan) +
◼ P(tidak miskin|semen) log P(tidak miskin|semen) +
◼ P(miskin|tanah) log P(miskin|tahan) +
◼ P(miskin|semen) log P(miskin|semen)
◼ H(E|f) =
◼ 1/3 log 1/3 + 2/3 log 2/3 + ½ log ½ + ½ log ½
◼ 0.33x(–0.48) + 0,67x(-0.17) + 0.5x(-0.30) + 0.5x(-0.30)
◼ - 0.16 - 0.11 - 0.15 - 0.15 = - 0.57
Copyright © Yudi Agusta, PhD, 2021
Information Gain
◼ Variable ‘Jumlah ART’ dan ‘Luas Lantai’ on the other
hand, has to use continuous distribution such as
Normal distribution to get the value of P(c)
◼ Untuk variable ‘Jumlah ART’ H(E|f) =
◼ P(tidak miskin|3) log P(tidak miskin|3) + ...
◼ Untuk variable ‘Luas Lantai’ H(E|f) =
◼ P(tidak miskin|20) log P(tidak miskin|20) + ...
◼ Where P(tidak miskin|3) dan P(tidak miskin|20) are
calculated using Normal Distribution
Copyright © Yudi Agusta, PhD, 2021
Information Gain Ratio
◼ But: information gain is biased towards features that have many
values
◼ Maximum information gain for a feature depends on the number
feature values:
◼ Max information gain for a binary feature is 2
◼ Max information gain for a feature with 1024 values is 10
◼ Use information gain ratio to remove this bias:
G( E, f )
GR( E , f ) =
− P(v) log 2 P(v)
vV
◼ Where v is in V (all values of the variable considered during the
calculation)
Copyright © Yudi Agusta, PhD, 2021
Overfitting
◼ Decision tree may encode idiosyncrasies of the
training data.
◼ E.g., consider this sub
Jumlah ART >3.5
<1.5
<3.5
Jenis Lantai WC Luas Lantai
tanah semen/keramik bersama sendiri <20.5 >20.5
Miskin Miskin Miskin Miskin Miskin Tidak Miskin
◼ How much should we trust the “Tidak Miskin” leaf, if it is
based on just one training sample
Copyright © Yudi Agusta, PhD, 2021
Overfitting (2)
◼ Fully consistent decision trees tend to over-
generalise
◼ Especially if the decision tree is big
◼ Or if there is not much training data
◼ Trade-off full consistency for compactness:
◼ Larger decision trees can be more consistent
◼ Smaller decision trees generelise better
Copyright © Yudi Agusta, PhD, 2021
Pruning
◼ We can reduce overfitting by reducing the
size of the decision tree:
◼ Create the complete decision tree
◼ Discard any unreliable leafs
◼ Repeat (2) until the decision tree is compact
enough
◼ Which leaves are unreliable
◼ Leafs with low counts?
◼ When should we stop
◼ No more unreliable leaves?
Copyright © Yudi Agusta, PhD, 2021
Test Data
◼ How can we tell if we’re overfitting?
◼ Use a subset of the training data to train the
model (the “training set”)
◼ Use the rest of the training data to test for
overfitting (the “test set”)
◼ For each leaf:
◼ Test the performance of the decision tree on the test set
without that leaf
◼ If the performance improves, discard the leaf
◼ Repeat (3) until no more leafs are discarded
Copyright © Yudi Agusta, PhD, 2021
Using A Decision Tree
◼ Given a new event (= list of feature
values):
◼ Start at the root
◼ At each interior node, follow the outgoing
arc for the feature value that matches our
event
◼ When we reach a leaf node, return its
category as the result of classification
using the decision tree model
Copyright © Yudi Agusta, PhD, 2021
When are Decision Trees Useful
◼ Disadvantages of decision trees:
◼ Problems with sparse data & overfitting
◼ Pruning helps these problems, but doesn’t solve them
◼ Don’t (directly) combine information about
different features
◼ Advantages of decision trees:
◼ Fast
◼ Easy to interpret
◼ Helps us understand what is important in new domains
◼ We can explain the “why” of how the decision tree
makes decisions
Copyright © Yudi Agusta, PhD, 2021
Execise
Jumlah ART Jenis Lantai WC Luas Lantai Category
5 Tanah Sendiri 15 Miskin
3 Tanah Sendiri 20 Tidak Miskin
4 Semen Sendiri 50 Tidak Miskin
1 Semen Bersama 40 Tidak Miskin
1 Tanah Sendiri 30 Miskin
2 Semen Bersama 40 Miskin
8 Semen Bersama 18 Miskin
Copyright © Yudi Agusta, PhD, 2021
Exercise
◼ Hitung information gain dari masing-
masing variable pada saat dilakukan
split yang pertama
Copyright © Yudi Agusta, PhD, 2021
Topik Diskusi
◼ Two types of unsupervised and
supervised classification methods have
been introduced (clustering and
decision trees). What is the main
difference of the two in regards to the
possible applications?
◼ Think of a problem that can be solved
using decision trees.