SCMA801204 – Advanced Machine Learning
Decision Tree
Dr. rer. nat. Hendri Murfi
Machine Learning Group
Department of Mathematics, Universitas Indonesia – Depok 16424
Telp. +62-21-7862719/7863439, Fax. +62-21-7863439, Email. hendri@ui.ac.id
Decision Tree
Description
• A Decision Tree is a model like a
flow diagram consisting of nodes
and branches
• Nodes represent testing of a
particular feature
• Branches represent the results of
those tests
• Each leaf node represents a class or
value
• The path from root to leaf states the
rules
2
Decision Tree
Advantage
• Decision Tree is a machine
learning model that is white box.
It means the knowledge
extracted by the model is in the
form of rules that can be
explained easily with Boolean
logic (If … Then …).
• Meanwhile, other machine
learning models are black boxes,
that is, the knowledge extracted
by the model is difficult to
interpret.
Decision Tree
Algoritm
• Iterative Dichotomiser 3 (ID3) is an algorithm for building a Decision Tree
in the form of a multiway tree. For each node, the algorithm looks for
categorical features that will produce the largest information gain
− C4.5 is a development of ID3 which removes the restriction that
features must be categorical by defining a discrete feature obtained by
partitioning a continuous feature into several intervals
− The latest version of development currently is C5.0
• Classification And Regression Tree (CART) is an algorithm likes C4.5, but
also supports continuous targets (regression problems) and is a binary tree.
For each node, the algorithm looks for features and threshold values that
will produce the largest information gain
4
CART
Problem Formulation
Given training data {xn, tn}, n = 1 to N
where tn is a label
• Problem:
How to build a binary tree from
training data, so that training data
with the same class or value will
be grouped together
• Method:
At each node, it selects features
and threshold values that create
the purest partition of the data,
i.e., data with the same class or
value is grouped together
CART
Algorithm
• For example, Q is the data set at a node, = (j, a) is the j-th feature
candidate set and the threshold values a, Qleft( ) and Qright( ) are the
data in the left and right partitions of the threshold value, namely:
Qleft( ) = {(x,t) | x(j) ≤ a}
Qright( ) = Q - Qleft( )
• So, the purity of a node is calculated based on an impurity function H()
and information gain G(), namely
𝑁 𝑁
𝐺 𝑄, 𝜃 = 𝐻 𝑄 𝜃 + 𝐻 𝑄 𝜃
𝑁 𝑁
where 𝑁 is the amount of data on the node, 𝑁 and 𝑁 is the
amount of data on the left and right partitions
6
CART
Algorithm
• Thus, the selection of features and threshold values that create
the purest data partition for a node can be expressed as an
optimization problem as follows:
𝜃 ∗ = min 𝐺 𝑄, 𝜃
• The process of selecting features and threshold values is
carried out recursively for the Qleft(*) and Qright(*) subsets
until the stopping criteria are met, for example the maximum
tree depth is reached or the minimum amount of data at a node
(N) is reached
CART
Algorithm
• Given data S = {xm, tm}, m=1 .. M and tn {0, 1, 2, ..., k}. There are several
functions for calculating purity in classification problems, namely:
Gini 𝐻 𝑆 = ∑ 𝑝 (1 − 𝑝 )
Entropy 𝐻 𝑆 = − ∑ 𝑝 𝑙𝑜𝑔(𝑝 )
Misclassification 𝐻 𝑆 = 1 − 𝑚𝑎𝑥(𝑝 )
𝑝 is the probability of class k at that node and
1
𝑝 = 𝐼(𝑡 = 𝑘)
𝑀
where 𝐼(𝑡 = 𝑘) is an indicator function having value of 1 if 𝑡 = 𝑘
and 0 for others
8
CART
Algoritma
• Given data S = {xm, tm}, m=1 .. M and tn {0, 1, 2, ..., k}. There are several
functions for calculating purity in regression problems, namely :
Mean Square Error
𝐻 𝑆 = ∑ (𝑡 − 𝑡̅)
Mean Absolute Error
𝐻 𝑆 = ∑ 𝑡 − 𝑡̅
where 𝑡̅ = ∑ 𝑡
CART
Illustration
Given seven observations of mortality rates with five features related to air
quality for several regions in England in 2007-2012. Which feature will be
selected as the root?
Obs O3 PM10 PM2.5 NO2 T2M Mortality Rate
1 22,251 20,447 8,891 28,858 274,723 1,208
2 30,977 15,687 7,250 27,145 276,907 1,498
3 46,849 13,461 4,166 16,238 280,981 1,387
4 64,451 7,447 2,109 5,192 277,892 1,342
5 59,234 7,281 2,781 7,062 278,626 1,431
6 42,514 8,020 4,256 16,457 279,427 1,319
7 30,100 15,848 6,924 23,323 279,413 1,252
Source: K. C. Dewi. Analisis Akurasi Model Random Forest untuk Big Data – Studi Kasus Prediksi Klaim Severity pada Asuransi
Mobil. Tesis. Program Magister Matematika, Departemen Matematika FMIPA UI, 2019
10
CART
Illustration
• There are five features that can be candidates as root nodes.
For problems with a very large number of features, feature
candidates are often selected randomly from a subset of
features
• The selection of the threshold value can be done randomly
from the feature values. In addition, selecting a threshold
value can be done by finding the midpoint of the feature value
(Cutler, Cutler, & Stevens, 2012)
11
CART
Illustration
For example, for feature O3 and the threshold value 22.251 or = {O3, 22,251}
Mortality
Obs O3 PM10 PM2.5 NO2 T2M
Rate
O3 ≤ 22,251
1 22,251 20,447 8,891 28,858 274,723 1,208
2 30,977 15,687 7,250 27,145 276,907 1,498 Yes No
3 46,849 13,461 4,166 16,238 280,981 1,387
64,451 7,447 2,109 5,192 277,892 1,342 1,498
4
1,208 1,387
5 59,234 7,281 2,781 7,062 278,626 1,431 1,342
1,431
6 42,514 8,020 4,256 16,457 279,427 1,319
1,319
7 30,100 15,848 6,924 23,323 279,413 1,252 1,252
12
CART
Illustration
So, the impurity of each branch = {O3, 22,251} is follows:
1 O3 ≤ 22,251
𝐻 𝑄 (𝜃) = 𝑦 −𝑦
𝑛
= 1,208 − 1,208 Yes No
=0
1,498
1 1,208 1,387
𝐻 𝑄 (𝜃) = 𝑦 −𝑦 1,342
𝑛
1,431
= 1,498 − 1,3715 + 1,387 − 1,3715 + 1,342 − 1,3715 + 1,319
1,431 − 1,3715 + 1,319 − 1,3715 + 1,252 − 1,3715 1,252
= 0,0376895
= 0,0063
13
CART
Illustration
• So, the information gain of = {O3, 22,251} is follows:
𝐺 𝑄, 𝜃 = 𝐻 𝑄 (𝜃) + 𝐻 𝑄 (𝜃)
𝒎 𝒎
= 0 + 0,0063
= 0,0054
• Using the same procedures, find the information gain for the feature
PM10, PM2.5, NO2, and T2M.
14
CART
Illustration
PM10≤ 15,687 PM2.5 ≤ 4,166 NO2≤ 5,192
Yes No Yes No Yes No
1,498 1,387
1,208 1,208 1,208
1,387 1,342
1,252 1,498 1,342 1,498
1,342 1,431
1,319 1,387
1,431 1,252 1,431
1,319 1,319
1,252
T2M≤ 278,626
Yes No For example, for each feature, the chosen threshold
value is follows: = {PM10, 15,687}, = {PM2.5,
1,208 1,387 4,166}, = {NO2, 5,192}, = {T2M, 278,626}
1,498 1,319
1,342 1,252
1,431 15
CART
Illustration
• So, the information gain for all features 𝜃 is obtained as follows:
= {O3, 22,251} is 0,0054
= {PM10, 15,687} is 0,0031
= {PM2.5, 4,166} is 0,0076
= {NO2, 5,192} is 0,0087
= {T2M, 278,626} is 0,0080
• Next, choose 𝜃 which minimizes the information gain or
𝜃 ∗ = 𝑚𝑖𝑛 𝐺 𝑄, 𝜃
• Because = {PM10, 15,687} gives the smallest information gain value,
select the PM10 feature with a threshold value of 15,687 as the root node
16
CART
Illustration
PM10 ≤ 15,687
• Next, do the same steps to
Ya Tidak
find the binary separation 𝜃
on the samples contained in T2M ≤ 276,907
1,208
1,252
the left child node and the
Ya Tidak
right child node.
• Perform this step recursively 1,498 T2M ≤ 278,626
until the stopping criteria are
Tidak
met, for example, the Ya
minimum number of samples
1,431 PM2.5 ≤ 4,166
at a node (𝑁𝑚<= 2), or the
maximum tree depth Ya Tidak
1,387
1,319
1,342
17
CART
Practical Usage Tips
• The main parameter that needs to be optimized at the model
selection stage is the maximum tree depth (tree size)
• Other parameters that are often optimized are the minimum number
of samples so that a node can branch, and the minimum number of
samples in the leaves
• Some other practical usage tips can be seen at https://scikit-
learn.org/stable/modules/tree.html
18
References
C. H. Bishop (2006). Pattern Recognition and Machine Learning, Springer
(Chapter 14.3)