[go: up one dir, main page]

0% found this document useful (0 votes)
4 views10 pages

Decision Tree

The document provides an overview of Decision Trees, a machine learning model that uses a flow diagram structure with nodes representing feature tests and branches indicating test results. It discusses the advantages of Decision Trees, particularly their interpretability, and details algorithms for constructing them, including ID3, C4.5, and CART. Additionally, it explains the process of building a binary tree using CART, including methods for selecting features and calculating information gain.

Uploaded by

erlanderrrachmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views10 pages

Decision Tree

The document provides an overview of Decision Trees, a machine learning model that uses a flow diagram structure with nodes representing feature tests and branches indicating test results. It discusses the advantages of Decision Trees, particularly their interpretability, and details algorithms for constructing them, including ID3, C4.5, and CART. Additionally, it explains the process of building a binary tree using CART, including methods for selecting features and calculating information gain.

Uploaded by

erlanderrrachmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

SCMA801204 – Advanced Machine Learning

Decision Tree

Dr. rer. nat. Hendri Murfi

Machine Learning Group


Department of Mathematics, Universitas Indonesia – Depok 16424
Telp. +62-21-7862719/7863439, Fax. +62-21-7863439, Email. hendri@ui.ac.id

Decision Tree
Description

• A Decision Tree is a model like a


flow diagram consisting of nodes
and branches

• Nodes represent testing of a


particular feature

• Branches represent the results of


those tests

• Each leaf node represents a class or


value

• The path from root to leaf states the


rules
2
Decision Tree
Advantage

• Decision Tree is a machine


learning model that is white box.
It means the knowledge
extracted by the model is in the
form of rules that can be
explained easily with Boolean
logic (If … Then …).

• Meanwhile, other machine


learning models are black boxes,
that is, the knowledge extracted
by the model is difficult to
interpret.

Decision Tree
Algoritm

• Iterative Dichotomiser 3 (ID3) is an algorithm for building a Decision Tree


in the form of a multiway tree. For each node, the algorithm looks for
categorical features that will produce the largest information gain
− C4.5 is a development of ID3 which removes the restriction that
features must be categorical by defining a discrete feature obtained by
partitioning a continuous feature into several intervals
− The latest version of development currently is C5.0

• Classification And Regression Tree (CART) is an algorithm likes C4.5, but


also supports continuous targets (regression problems) and is a binary tree.
For each node, the algorithm looks for features and threshold values that
will produce the largest information gain

4
CART
Problem Formulation

Given training data {xn, tn}, n = 1 to N


where tn is a label
• Problem:
How to build a binary tree from
training data, so that training data
with the same class or value will
be grouped together
• Method:
At each node, it selects features
and threshold values that create
the purest partition of the data,
i.e., data with the same class or
value is grouped together

CART
Algorithm

• For example, Q is the data set at a node,  = (j, a) is the j-th feature
candidate set and the threshold values a, Qleft( ) and Qright( ) are the
data in the left and right partitions of the threshold value, namely:
Qleft( ) = {(x,t) | x(j) ≤ a}
Qright( ) = Q - Qleft( )
• So, the purity of a node is calculated based on an impurity function H()
and information gain G(), namely
𝑁 𝑁
𝐺 𝑄, 𝜃 = 𝐻 𝑄 𝜃 + 𝐻 𝑄 𝜃
𝑁 𝑁
where 𝑁 is the amount of data on the node, 𝑁 and 𝑁 is the
amount of data on the left and right partitions

6
CART
Algorithm

• Thus, the selection of features and threshold values that create


the purest data partition for a node can be expressed as an
optimization problem as follows:

𝜃 ∗ = min 𝐺 𝑄, 𝜃

• The process of selecting features and threshold values is


carried out recursively for the Qleft(*) and Qright(*) subsets
until the stopping criteria are met, for example the maximum
tree depth is reached or the minimum amount of data at a node
(N) is reached

CART
Algorithm

• Given data S = {xm, tm}, m=1 .. M and tn  {0, 1, 2, ..., k}. There are several
functions for calculating purity in classification problems, namely:
 Gini 𝐻 𝑆 = ∑ 𝑝 (1 − 𝑝 )
 Entropy 𝐻 𝑆 = − ∑ 𝑝 𝑙𝑜𝑔(𝑝 )
 Misclassification 𝐻 𝑆 = 1 − 𝑚𝑎𝑥(𝑝 )
𝑝 is the probability of class k at that node and
1
𝑝 = 𝐼(𝑡 = 𝑘)
𝑀

where 𝐼(𝑡 = 𝑘) is an indicator function having value of 1 if 𝑡 = 𝑘


and 0 for others

8
CART
Algoritma

• Given data S = {xm, tm}, m=1 .. M and tn  {0, 1, 2, ..., k}. There are several
functions for calculating purity in regression problems, namely :
 Mean Square Error
𝐻 𝑆 = ∑ (𝑡 − 𝑡̅)

 Mean Absolute Error


𝐻 𝑆 = ∑ 𝑡 − 𝑡̅

where 𝑡̅ = ∑ 𝑡

CART
Illustration

Given seven observations of mortality rates with five features related to air
quality for several regions in England in 2007-2012. Which feature will be
selected as the root?

Obs O3 PM10 PM2.5 NO2 T2M Mortality Rate


1 22,251 20,447 8,891 28,858 274,723 1,208
2 30,977 15,687 7,250 27,145 276,907 1,498
3 46,849 13,461 4,166 16,238 280,981 1,387
4 64,451 7,447 2,109 5,192 277,892 1,342
5 59,234 7,281 2,781 7,062 278,626 1,431
6 42,514 8,020 4,256 16,457 279,427 1,319
7 30,100 15,848 6,924 23,323 279,413 1,252
Source: K. C. Dewi. Analisis Akurasi Model Random Forest untuk Big Data – Studi Kasus Prediksi Klaim Severity pada Asuransi
Mobil. Tesis. Program Magister Matematika, Departemen Matematika FMIPA UI, 2019
10
CART
Illustration

• There are five features that can be candidates as root nodes.


For problems with a very large number of features, feature
candidates are often selected randomly from a subset of
features
• The selection of the threshold value can be done randomly
from the feature values. In addition, selecting a threshold
value can be done by finding the midpoint of the feature value
(Cutler, Cutler, & Stevens, 2012)

11

CART
Illustration

For example, for feature O3 and the threshold value 22.251 or  = {O3, 22,251}

Mortality
Obs O3 PM10 PM2.5 NO2 T2M
Rate
O3 ≤ 22,251
1 22,251 20,447 8,891 28,858 274,723 1,208

2 30,977 15,687 7,250 27,145 276,907 1,498 Yes No

3 46,849 13,461 4,166 16,238 280,981 1,387


64,451 7,447 2,109 5,192 277,892 1,342 1,498
4
1,208 1,387
5 59,234 7,281 2,781 7,062 278,626 1,431 1,342
1,431
6 42,514 8,020 4,256 16,457 279,427 1,319
1,319
7 30,100 15,848 6,924 23,323 279,413 1,252 1,252

12
CART
Illustration

So, the impurity of each branch  = {O3, 22,251} is follows:

1 O3 ≤ 22,251
𝐻 𝑄 (𝜃) = 𝑦 −𝑦
𝑛

= 1,208 − 1,208 Yes No

=0
1,498
1 1,208 1,387
𝐻 𝑄 (𝜃) = 𝑦 −𝑦 1,342
𝑛
1,431
= 1,498 − 1,3715 + 1,387 − 1,3715 + 1,342 − 1,3715 + 1,319
1,431 − 1,3715 + 1,319 − 1,3715 + 1,252 − 1,3715 1,252
= 0,0376895
= 0,0063

13

CART
Illustration

• So, the information gain of  = {O3, 22,251} is follows:

𝐺 𝑄, 𝜃 = 𝐻 𝑄 (𝜃) + 𝐻 𝑄 (𝜃)
𝒎 𝒎

= 0 + 0,0063
= 0,0054

• Using the same procedures, find the information gain for the feature
PM10, PM2.5, NO2, and T2M.

14
CART
Illustration

PM10≤ 15,687 PM2.5 ≤ 4,166 NO2≤ 5,192

Yes No Yes No Yes No

1,498 1,387
1,208 1,208 1,208
1,387 1,342
1,252 1,498 1,342 1,498
1,342 1,431
1,319 1,387
1,431 1,252 1,431
1,319 1,319
1,252
T2M≤ 278,626

Yes No For example, for each feature, the chosen threshold


value is follows:  = {PM10, 15,687},  = {PM2.5,
1,208 1,387 4,166},  = {NO2, 5,192},  = {T2M, 278,626}
1,498 1,319
1,342 1,252
1,431 15

CART
Illustration

• So, the information gain for all features 𝜃 is obtained as follows:


  = {O3, 22,251} is 0,0054
  = {PM10, 15,687} is 0,0031
  = {PM2.5, 4,166} is 0,0076
  = {NO2, 5,192} is 0,0087
  = {T2M, 278,626} is 0,0080
• Next, choose 𝜃 which minimizes the information gain or
𝜃 ∗ = 𝑚𝑖𝑛 𝐺 𝑄, 𝜃
• Because  = {PM10, 15,687} gives the smallest information gain value,
select the PM10 feature with a threshold value of 15,687 as the root node

16
CART
Illustration

PM10 ≤ 15,687
• Next, do the same steps to
Ya Tidak
find the binary separation 𝜃
on the samples contained in T2M ≤ 276,907
1,208
1,252
the left child node and the
Ya Tidak
right child node.
• Perform this step recursively 1,498 T2M ≤ 278,626
until the stopping criteria are
Tidak
met, for example, the Ya

minimum number of samples


1,431 PM2.5 ≤ 4,166
at a node (𝑁𝑚<= 2), or the
maximum tree depth Ya Tidak

1,387
1,319
1,342
17

CART
Practical Usage Tips

• The main parameter that needs to be optimized at the model


selection stage is the maximum tree depth (tree size)
• Other parameters that are often optimized are the minimum number
of samples so that a node can branch, and the minimum number of
samples in the leaves
• Some other practical usage tips can be seen at https://scikit-
learn.org/stable/modules/tree.html

18
References

C. H. Bishop (2006). Pattern Recognition and Machine Learning, Springer


(Chapter 14.3)

You might also like