COS10022 DSP Week05 Decision Tree and Random Forest
COS10022 DSP Week05 Decision Tree and Random Forest
Decision
Tree
Ensemble
Learning
Random
Forest
2
DECISION TREE
3
Root Decision Tree
Trunk Learning Processing Splitting Pruning
Type data types Rules Techniques
•Supervised •Discrete •Entropy •Predicting
Decision Node Learning •Continuous •Gain Ratio
•Chi-Square
Error Rate
•Entire Error
(depending Test Rate
on the •Gini Index (Training
algorithm) and
•etc.
Predicting)
•No pruning.
◦ Goal:
Predict/classify the result based on the given criteria.
4
https://p7.hiclipart.com/preview/635/85/121/trunk-tree-stump-bark-tree.jpg, https://miro.medium.com/max/592/0*X-UrBzBeKMnoTY6H.png
Predictors Target
Outdoor
Outlook Temp Humidity Windy
Exercise
Rainy Hot High False No Outlook
Rainy Hot High True No
Cloudy Hot High False Yes
Sunny Mild High False Yes Sunny Cloudy Rainy
Sunny Cool Normal False Yes
Sunny Cool Normal True No
Cloudy Cool Normal True Yes Windy Yes Humidity
Rainy Mild High False No False True Normal High
Rainy Cool Normal False Yes
Sunny Mild Normal False Yes Yes No Yes No
Rainy Mild Normal True Yes
Cloudy Mild High True Yes
Cloudy Hot Normal False Yes
Sunny Mild High True No
Decision Tree (2) ◦ Using algorithms such as ID3 allows users to build
decision tree easily.
5
https://medium.com/@rishabhjain_22692/decision-trees-it-begins-here-93ff54ef134
Outlook Wind Temp Storage Sailing
Sunny 3 30 Yes Yes
6
A Clever Choice of the Attribute for
Splitting would Benefit Your Result
• ID3, CART, etc. • Splitting the • Entropy
dataset in a way • Information Gain
that the subsets Ratio
are as pure as • Gini Index
possible in terms of
• Etc.
the target variable.
7
◦ Where can we use Entropy?
◦ Entropy is one of the
Entropy: measurements in the information
theory, which is a field of study
concerned with quantifying
Η 𝑝 = − σ𝑛𝑖=1 𝑝𝑖 log 2 𝑝𝑖 information for communication.
Entropy
for 𝑝 ∈ ℚ𝑛
https://www.knime.com/knime-introductory-course/chapter6/section3/decision-tree
Entropy Calculation Η 𝑝 = − σ𝑛𝑖=1 𝑝𝑖 log 2 𝑝𝑖 for 𝑝 ∈ ℚ𝑛
Η 𝑝 = − 7ൗ13 log 2 7ൗ13 − 6ൗ13 log 2 6ൗ13 = 0.995 Η 𝑝 = − 13ൗ13 log 2 13ൗ13 − 0ൗ13 log 2 0ൗ13 = 0
9
https://www.knime.com/knime-introductory-course/chapter6/section3/decision-tree
Entropy Calculation Example
◦ Calculate the Entropy to
Outlook Wind Temp Storage Sailing
find go sailing from all
Sunny 3 30 Yes Yes
Sunny 3 25 Yes No observations.
Rain 12 15 Yes Yes
Overcast 15 2 No No
◦ Η 9+ , 7−
Rain 16 25 Yes Yes 9 9 7 7
Sunny 14 18 Yes Yes = − log 2 − log 2
16 16 16 16
Rain 3 5 No No
= 0.9887
Sunny 9 20 Yes Yes
Overcast 14 5 No No
Sunny 1 7 No No
Rain 4 25 Yes No
Rain 14 24 Yes Yes
Sunny 11 20 Yes Yes
Sunny 2 18 Yes No
Overcast 8 22 Yes Yes
Overcast 13 24 Yes Yes
10
Entropy Calculation Example 2
◦ Assume a pool of animal (cats and dogs) are somehow Animal
split into two groups.
◦ Let’s find the entropy for each group.
6 6 6 6
Η𝐴 𝑝 = − σ2𝑖=1 𝑝𝑖 log 2 𝑝𝑖 = − 12 log 2 12 − 12 log 2 12 = 1
5 5 1 1
Η𝐶 𝑝 = − σ2𝑖=1 𝑝𝑖 log 2 𝑝𝑖 = − log 2 − log 2 = 0.65
6 6 6 6 Cat Dog
5 5 1 1
Η𝐷 𝑝 = − σ2𝑖=1 𝑝𝑖 log 2 𝑝𝑖 = − 6 log 2 6 − 6 log 2 6 = 0.65
11
Η0 𝑝 = − 7ൗ13 log 2 7ൗ13 − 6ൗ13 log 2 6ൗ13 = 0.995
Information Gain
– One of the Possible
Split Criteria
◦ In ID3 algorithm (one of the decision tree
construction algorithms), the information
gain, which is the differences of entropies
before and after applying the dataset
splitting rule.
◦ Information Gain after the split:
◦ 𝐺𝑎𝑖𝑛 = Η0 − 𝑤1 Η1 + 𝑤2 Η2
= 0.995 − 0.462 × 0.65 − 0.539 × 0.83
= 0.995 − 0.766
= 0.23
◦ Next splitting feature:
Η1 𝑝 = − 5ൗ6 log 2 5ൗ6 − 1ൗ6 log 2 1ൗ6 Η2 𝑝 = − 2ൗ7 log 2 2ൗ7 − 5ൗ7 log 2 5ൗ7 ◦ Starting from the one with the highest Gain.
= 0.65 = 0.863
𝑤1 = 6Τ13 = 0.462 𝑤2 = 7ൗ13 = 0.539
Disadvantage:
Solution: Using Gain Ratio in Prefers highly represented 12
C4.5 algorithm to replace ID3 features.
C4.5 Algorithm
◦ To overcome the disadvantage in ID3, C4.5
algorithm introduces the SplitInfo concept
with the Gain Ratio.
◦ The SplitInfo is defined as the sum over the
weights multiplied by the logarithm of the
weights.
◦ The Gain Ratio measure is then calculated by
dividing the gain from the ID3 algorithm by the
SplitInfo.
𝐺𝑎𝑖𝑛 Η0 − σ𝑘𝑖=1 𝑤𝑖 Η𝑖
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 = =
Question: 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 σ𝑘𝑖=1 𝑤𝑖 log 2 𝑤𝑖
Can Entropy be greater than 1?
13
https://towardsdatascience.com/entropy-how-decision-trees-make-decisions-2946b9c18c8
𝑝1 = 7ൗ13 CART Algorithm
𝑝2 = 6ൗ13 ◦ Another measurement for measuring the
purity or the actually impurity, which is used
by CART algorithm, is called the Gini index.
◦ Split criterion in CART algorithm:
𝑛
𝐺𝑖𝑛𝑖𝑖𝑛𝑑𝑒𝑥 = 𝑤𝑖 𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦𝑖
𝑖=1
◦ In this case:
6 7
𝐺𝑖𝑛𝑖𝑖𝑛𝑑𝑒𝑥 = 𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦1 + 𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦2
13 13
72 62
𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 𝑝 =1− 2− = 0.497
12 52 2 2 52 13 132
𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦1 = 1 − 2 − 2 = 0.278 𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦2 = 1 − 2 − 2 = 0.592
6 6 7 7 ◦ Next splitting feature:
𝑤1 = 6Τ13 = 0.462 𝑤2 = 7ൗ13 = 0.539 Feature with the lowest 𝐺𝑖𝑛𝑖𝑖𝑛𝑑𝑒𝑥
14
https://www.knime.com/knime-introductory-course/chapter6/section3/decision-tree
Gini Index Calculation Example
S/N Gender Class Play Computer Games ◦ Assume that we want to segregate students based on
1 M IV Yes an specific target variable (playing computer games).
2 F IV Yes
3 F IV Yes ◦ We split the population using two input variables:
4 M IV No Gender and Class.
5 F IV No
6 M IV Yes ◦ Now, we want to identify which split is producing more
7 M IV No
homogeneous sub-nodes using Gini index.
8 F V No
9 M V Yes
10 F V No
11 M V Yes
12 F V No
13 F V No
Total Observations = 13
Play Computer Games = 6 (46%)
15
S/N Gender Class Play Computer Games Split on Gender Split on Class
1 M IV Yes Total Observations = 13
2 F IV Yes
Play Computer Games = 6 (46%)
3 F IV Yes
4 M IV No
5 F IV No
6 M IV Yes
7 M IV No
Male Female Class IV Class V
8 F V No
9 M V Yes
10 F V No
11 M V Yes
Students = 6 Students = 7 Students = 7
12 F V No Students = 6
Play = 4 (67%) Play = 2 (29%) Play = 4 (57%)
13 F V No Play = 2 (33%)
𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦𝑀 = 1 − 0.672 + 0.332 = 0.44 𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦𝐼𝑉 = 1 − 0.572 + 0.432 = 0.49
𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦𝐹 = 1 − 0.292 + 0.712 = 0.41 𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦𝑉 = 1 − 0.332 + 0.662 = 0.46
6 7 7 6
𝐺𝑖𝑛𝑖𝑖𝑛𝑑𝑒𝑥 = × 0.44 + × 0.41 = 0.42 𝐺𝑖𝑛𝑖𝑖𝑛𝑑𝑒𝑥 = × 0.49 + × 0.46 = 0.48
13 13 13 13
Better Split
16
Tree Construction Summary
Quantify the amount of uncertainty at a single node using the Gini impurity metric.
Quantify how much a question reduces the uncertainty using Information Gain.
Repeating to ask questions until the data can not be divided further.
18
General Steps for Tree Construction
Start
Stop
≤ Splitting >
criterion
19
https://www.youtube.com/watch?v=u4kbPtiVVB8
𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 𝑝 = 1 − σ𝑛𝑖=1 𝑝𝑖2 for 𝑝 ∈ ℚ𝑛
𝑛
Numerical
𝐺𝑖𝑛𝑖𝑖𝑛𝑑𝑒𝑥 = 𝑤𝑖 𝐺𝑖𝑛𝑖𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦𝑖 Feature
Handling
𝑖=1
20
https://www.knime.com/knime-introductory-course/chapter6/section3/decision-tree
Numerical
Feature
Handling
◦ Solution: Binary splits
◦ In case of
numerical features,
the split is always a
binary split.
21
https://www.knime.com/knime-introductory-course/chapter6/section3/decision-tree
Right size
of the Tree?
◦ Like many other
machine learning
algorithms, Decision
Trees are subject to
potentially overfitting
the training data.
◦ Too shallow
◦ Too simple model that
don’t fit the data.
◦ Too deep
◦ Too detailed and don’t
generalize on new
data.
◦ Solution:
Pruning.
22
https://www.knime.com/knime-introductory-course/chapter6/section3/decision-tree
Pruning
◦ Goal
◦ Tree that generalizes better.
◦ Idea
◦ Cut branches that seems as a result from
overfitting the training set.
25
https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e
Bootstrapping
26
https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e
Bootstrapping (2)
◦ Assume we have a set of data 𝑋 = 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛
containing 𝑛 values, where 𝑛 is not a relatively large
number.
◦ If we simply calculate the mean, standard
deviation, etc. by taking the whole dataset into
account, the result can be biased from the actual
situation.
σ𝑛
𝑖=1 𝑥𝑖
◦ 𝑚𝑒𝑎𝑛 𝑋 =
𝑛
27
https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e
Bootstrapping (3)
• Create many (.e.g. 𝑚) random sub-
samples from the dataset with
Create replacement (meaning we can select
Subsamples the same value multiple times).
28
https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e
Bootstrapping Application Example
◦ Assume we have a small sample dataset: 𝑋 = 13, 21, 17, 22, 4, 5, 9, 28 extracted from the population
𝑋 = 17, 10, 6, 13, 9, 12, 28, 13, 0, 4, 21, 5, 22, 9, 13, 4, 21, 16, 3, 22 .
σ𝑛 𝑥
◦ You’ll find that the mean value of this dataset is 𝑚𝑒𝑎𝑛 𝑋 = 𝑖=1 𝑖 , where 𝑛 = 8 in this case and
𝑛
𝑚𝑒𝑎𝑛 𝑋 = 14.88
◦ To calibrate the mean value, we can randomly take 𝑚 sub-samples. Let’s say 𝑚 = 5 as an example.
13 + 22 + 4 + 28 67
𝑠1 = 𝑚𝑒𝑎𝑛 𝑋1 = = = 16.75
𝑋1 = 13, 22, 4, 28 4 4
13 + 21 + 5 + 9 + 28 76
𝑋2 = 13, 21, 5, 9, 28 𝑠2 = 𝑚𝑒𝑎𝑛 𝑋2 = = = 15.2
Find the means 5 5
17 + 4 + 9 30
𝑋3 = 17, 4, 9 𝑠3 = 𝑚𝑒𝑎𝑛 𝑋3 = = = 10
3 3
𝑋4 = 21, 22, 5, 28 21 + 22 + 5 + 28 76
𝑠4 = 𝑚𝑒𝑎𝑛 𝑋4 = = = 19
𝑋5 = 13, 17, 4, 9 4 4
13 + 17 + 4 + 9 43
𝑠5 = 𝑚𝑒𝑎𝑛 𝑋5 = = = 10.75
4 4
◦ Ensemble the final results:
σ𝑚
𝑗=1 𝑠𝑗 16.75 + 15.2 + 10 + 19 + 10.75 71.7
Calibrated Mean = = = = 14.34
𝑚 5 5
Population Mean = 12.4
Is this always 29
the case?
Bagging
◦ Bagging is also known as Bootstrap
Aggregation.
◦ A simple but a powerful ensemble method.
◦ It is typically an application of the Bootstrap
procedure to decision tree, which is a high-
variance machine learning algorithm.
◦ In bagging, training instances can be
sampled several times for the same
predictor.
30
https://medium.com/machine-learning-through-visuals/machine-learning-through-visuals-part-1-what-is-bagging-ensemble-learning-432059568cc8
Bagging (2)
◦ Moving on to bagging ensemble learning,
the output of each predictor (classifier) is
summarised by the aggregation function.
◦ The aggregation function is typically the
statistical mode for classification and
average for regression.
◦ It helps to reduce both bias and variance.
31
https://medium.com/machine-learning-through-visuals/machine-learning-through-visuals-part-1-what-is-bagging-ensemble-learning-432059568cc8
Boosting
◦ In boosting algorithms, they utilise the
weighted averages to mark weak learners
into strong learners.
◦ Different from bagging, which runs each
model independently before aggregating the
outputs at the end without preference to any
model, boosting is all about “teamwork.”
◦ Each model that runs in boosting, dictates
what features the next model will focus on.
32
https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e
Boosting
◦ Data points
◦ + and – samples are equally weighted at the
beginning. (The weight is denoted by the size of
the symbols.)
◦ Box1
◦ The decision stump (D1) generates a vertical
line to classify the data points.
◦ For the three + on the right side, higher weights
are assigned to those + on the right side and
the classification process is handed over to D2.
◦ Box2
◦ Since there are three + having higher weights,
the second decision strum (D2) will try to
predict them correctly.
◦ A vertical line on the right side of box2 has
classified the three mis-classified + correctly.
◦ But again, it causes another mis-classification
errors on the three – on the left side. Thus,
those three – would gain weight in D3.
33
https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e
Boosting
◦ Box3
◦ Since the three mis-classified – are given higher
weights, a decision stump (D3) is applied to
predict these mis-classified observation
correctly.
◦ This time, a horizontal line is generated to
classify + and – based on higher weight of mis-
classified observations.
◦ Box4
◦ By combining D1, D2, and D3, a strong
predictor, which has complex rules as
compared to individual weak learner, is
generated.
34
https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e
Stacking
◦ Stacking often
considers
heterogeneous weak
learners, learns them
in parallel and
combines them by
training a meta-model
to output a prediction
based on the different
weak models
predictions.
◦ In recent years, a
more modern
approach is to create
an ensemble of a well-
chosen collection of
strong yet diverse
models in stacking.
35
https://www.researchgate.net/profile/Mahsa_Soufineyestani/publication/326224119/figure/download/fig1/AS:645263400112129@1530854191613/Stacking-model-1-S1.png
Stacking
◦ On the popular data science competition
site Kaggle (https://www.kaggle.com/), you
can explore numerous winning solutions
through its discussion forums to get a
flavour of the state of the art.
◦ Another popular data science competition is
the KDD Cup (http://www.kdd.org/kdd-cup).
The figure shown on the left is the winning
solution for the 2015 competition, which
used a three-stage stacked modelling
approach.
36
https://blogs.sas.com/content/subconsciousmusings/2017/05/18/stacked-ensemble-models-win-data-science-competitions/
RANDOM FOREST
37
Using Enhanced Multiclass object detection Random Forest is used in
Theromatic Mapping (ETM) is done using Random the KINECT game console.
devices to acquire images Forest algorithms
of the earth’s surface.
The accuracy is higher while Provides better detection in Tracks body movements and
the training time is less complicated environments recreates it in the game
comparing to other ML
methods.
◦ Remote Sensing
Where to use ◦ Object Detection
38
Application of Random Forest
http://x-tech.am/wp-content/uploads/2013/10/blog11-884x442.jpg
Where does Random Forest fit in Data
Science?
Machine
Learning
Classification Regression
Random
Forest
40
Why using Random Forest?
No Overfitting
High Accuracy
41
Tree1 Tree2 Tree3
What is Random
Forest?
Output 1 Output 2 Output 3
Definition
42
HOW DOES RANDOM FOREST
WORK?
43
Let’s Review How Decision Tree
Works, First.
44
Let’s Review How Decision Tree
Works, First.
7 7 4 4 5 5
Η0 𝑝 = − σ3𝑖=1 𝑝𝑖 log 2 𝑝𝑖 = − log 2 − log 2 − log 2 = 1.55 > 𝟏
16 16 16 16 16 16
True False
S/N Conditions
1 Colour == Brown?
2 Diameter < 8?
3 Colour == White?
4 Shape == Oval?
Shape == Oval?
S/N Conditions
True False
1 Colour == Brown?
2 Diameter < 8?
3 Colour == White?
4 Shape == Oval?
Colour == Brown?
True False
S/N Conditions
1 Colour == Brown?
2 Diameter < 8?
3 Colour == White?
4 Shape == Oval? Shape == Oval?
True False
47
How Does Random Forest Work?
Tree 3
Colour == White?
True False
S/N Conditions
1 Colour == Brown?
2 Diameter < 8?
3 Colour == White?
4 Shape == Oval?
Shape == Oval?
True False
48
How Does Random Forest Work?
◦ Assume we have a new test data with no colour information.
◦ Let the default answer to be false if there is no judge on the particular classifying criterion.
◦ Let’s guess what this input will be.
49
How Does Random Forest Work?
Tree 1 Tree 2 Tree 3
Majority Vote
50
• Entropy • Bagging • Definition
Decision Tree
Random Forest
Ensemble Learning
• Information • Boosting • Majority
Gain • Stacking Vote
• Gini Index
• Creation
• Pruning
WEEK 09 SUMMARY
55