DecisionTree
Lecture 5
Salahadin Seid
School of Data Science
Emerland University
September 1, 2024
Salahadin Seid School of Data Science Emerland University
DecisionTree 1 / 21
Outline
• What is Decision Tree?
• Example
• How Does the Decision Tree Algorithm Work?
• Attribute Selection Measures
• Information Gain
• Gini-index
• Python Implementations
Salahadin Seid School of Data Science Emerland University
DecisionTree 2 / 21
What is Decision trees?
• Decision trees are a another popular ML algorithm that can
be used for both regression and classification tasks.
• Decision trees make predictions by recursively splitting on
different attributes according to a tree structure.
• It has a hierarchical tree structure consisting of a root node,
branches, internal nodes, and leaf nodes.
• They are easy to understand, interpret, and implement,
making them an ideal choice for beginners in the field of ML.
Salahadin Seid School of Data Science Emerland University
DecisionTree 3 / 21
Example
• Example: classifying fruit as an orange or lemon based on
height and width
Salahadin Seid School of Data Science Emerland University
DecisionTree 4 / 21
Example ...
Salahadin Seid School of Data Science Emerland University
DecisionTree 5 / 21
Example ...
• For continuous attributes, split based on less than or greater than
some threshold
• Thus, input space is divided into regions with boundaries parallel to
axes
Salahadin Seid School of Data Science Emerland University
DecisionTree 6 / 21
How Does the Decision Tree Algorithm Work?
The basic idea behind any decision tree algorithm is as follows:
1 Select the best attribute using Attribute Selection Measures
(ASM) to split the records.
2 Make that attribute a decision node and breaks the dataset
into smaller subsets.
3 Start tree building by repeating this process recursively for
each child until one of the conditions will match:
• All the tuples belong to the same attribute value.
• There are no more remaining attributes.
• There are no more instances.
Salahadin Seid School of Data Science Emerland University
DecisionTree 7 / 21
Attribute Selection Measures
• Attribute selection measure (ASM) , aka, splitting rules-
a heuristic for selecting the splitting criterion that partitions
data in the best possible manner.
• ASM provides a rank to each feature (or attribute) by
explaining the given dataset.
• The best score attribute will be selected as a splitting
attribute.
• The most popular selection measures are Information Gain,
and Gini Index.
Salahadin Seid School of Data Science Emerland University
DecisionTree 8 / 21
Information Gain
• Information gain is a measure used to determine which
feature should be used to split the data at each internal node
of the decision tree. It is calculated using entropy.
• Entropy is a metric to measure the impurity in a given
attribute. It specifies randomness in data.
• In a decision tree, the goal is to decrease the entropy of the
dataset by creating more pure subsets of data.
• Since entropy is a measure of impurity, by decreasing the
entropy, we are increasing the purity of the data.
Salahadin Seid School of Data Science Emerland University
DecisionTree 9 / 21
Information Gain
Entropy
∑
H(x ) = − p(x )log2 p(x )
x ∈X
Pi is the probability of randomly selecting an example in class i.
Dataset made up of 3 colours; red, purple, and yellow - equation
becomes: H(x ) = −(pr log2 pr + pp log2 pp + py log2 py )
Salahadin Seid School of Data Science Emerland University
DecisionTree 10 / 21
Information Gain...
Information Gain
∑ sv
Gain(S, A) = Entropy (S) − ∗ Entropy (sv )
s
InformationGain = entropy (parent) − [weightedAverage] ∗ entropy (children)
Salahadin Seid School of Data Science Emerland University
DecisionTree 11 / 21
Information Gain... example
• A decision tree to predict whether a loan given to a person
would result in a write-off or not.
• Our entire population consists of 30 instances.
• 16 belong to the write-off class and
• other 14 belong to the non-write-off class.
• We have two features, namely
• Balance that can take on two values -> < 50K or >50K
• Residence that can take on three values -> OWN, RENT or
OTHER.
Salahadin Seid School of Data Science Emerland University
DecisionTree 12 / 21
Information Gain... example
• Feature 1: Balance
• Balance < 50K : Sub total: 13 (12 write-off, 1 not)
• Balance >= 50K : Sub total: 17 (4 write-off, 13 not)
E (parent) = − 16
30 log2 30 − 30 log2 30 = 0.99
16 14 14
E (Blance < 50K ) = − 12 13 log2 13 − 13 log2 13 = 0.39
12 1 1
E (Blance >= 50K ) = − 17 log2 17 − 17 log2 13
4 4 13
17 = 0.79
Weighted Average of entropy:
30 ∗ 0.39 + 30 ∗ 0.79 = 0.62
E (Blance) = 13 17
Information Gain
IG(Parent, Blance) = E (Parent) − E (Blance) = 0.99 − 0.62 = 0.37
Salahadin Seid School of Data Science Emerland University
DecisionTree 13 / 21
Information Gain... example
• Feature 2: Residence
• Residence = own : Sub total: 8 (7 write-off, 1 not)
• Residence= rent : Sub total: 10 (4 write-off, 6 not)
• Residence= other : Sub total: 12 (5 write-off, 7 not)
E (parent) = − 16
30 log2 30 − 30 log2 30 = 0.99
16 14 14
E (Residence = own) = − 8 log2 8 − 18 log2 18 = 0.54
7 7
E (Residence = rent) = − 10 4 4
log2 10 − 10
6 6
log2 10 = 0.97
E (Residence = other ) = − 12 5 5
log2 12 − 12
7 7
log2 12 = 0.98
Weighted Average of entropy:
E (Residence) = 308
∗ 0.54 + 1030 ∗ 0.97 + 30 ∗ 0.98 = 0.86
12
Information Gain IG(Parent, Residence) =
E (Parent) − E (Residence) = 0.99 − 0.86 = 0.13
Information gain from feature, Balance is almost 3 times more than the
information gain from Residence - it means that the feature(Balance) with the
higher information gain (0.37) is more informative and should be used to split
the data at the next internal node.
Salahadin Seid School of Data Science Emerland University
DecisionTree 14 / 21
Gini Index
• Gini Index is also known as Gini impurity - It is measure of
how mixed or impure a dataset is.
• Gini impurity ranges between 0 and 1, where 0 represents a
pure dataset and 1 represents a completely impure dataset.
• Pure dataset - all the samples belong to the same class or
category.
• Impure dataset - contains a mixture of different classes or
categories.
Salahadin Seid School of Data Science Emerland University
DecisionTree 15 / 21
Gini Index
∑
Gini_impurity = 1 − p(i)2
where p(i) is the probability of a specific class and the summation
is done for all classes present in the dataset.
Lets consider a toy dataset with two classes Yes and No and the
following class probabilities: p(Yes) = 0.3 and p(No) = 0.7
Gini_impurity = 1 − (0.3)2 − (0.7)2 = 0.45
Salahadin Seid School of Data Science Emerland University
DecisionTree 16 / 21
Gini Index - Example
Consider a toy dataset with the following features and class labels:
• Target class label is Buys_insurance and it can take two
values Yes or No.
• We want to determine the best feature to use as the root
node for the decision tree.
• To do this, we will calculate the Gini impurity for each feature
and select the feature with the lowest Gini impurity.
Salahadin Seid School of Data Science Emerland University
DecisionTree 17 / 21
Gini Index - Example
Calculate the Gini impurity for each feature formula: Gini_impurity =
1 − p(Feature = value1)2 − p(Feature = value2)2 − ...p(Feature = valueN)2
1 AGE Gini impurity: = 1 − p(age = 20)2 − p(age = 25)2 − ...p(age = 45)2
Gini_impurity = 1 − (1/6)2 − (1/6)2 − (1/6)2 ... − (1/6)2 = 1
2 Gender Gini impurity: = 1 − p(Gender = male)2 − p(Gender = Female)2
Gini_impurity = 1 − (3/6)2 − (3/6)2 = 0.5
3 Income Gini impurity:
= 1 − p(Income = high)2 − p(Income = Medium)2 − p(Income = low )2
Gini_impurity = 1 − (3/6)2 − (2/6)2 − (1/6)2 = 0.612
4 Credit Score Gini impurity:
= 1 − p(cs = Excellent)2 − p(cs = fair )2 − p(cs = poor )2
Gini_impurity = 1 − (3/6)2 − (2/6)2 − (1/6)2 = 0.612
From the above calculations, we can see that the feature Gender have the
lowest Gini impurity of 0.5. So, we can select gender as the root node for the
decision tree.
Salahadin Seid School of Data Science Emerland University
DecisionTree 18 / 21
Decision Trees - over-fitting issue
• Decision Trees are prone to over-fitting.
• Overfitting in decision trees occurs when the tree becomes too
complex and captures the noise in the training data, rather
than the underlying pattern.
• A decision tree will always overfit the training data if we allow
it to grow to its max depth.
• This can lead to poor generalization performance on new
unseen data
Salahadin Seid School of Data Science Emerland University
DecisionTree 19 / 21
Python Implementations
1 entropy example: entropy_iris.py
2 Training an Unpruned Decision Tree: dt_iris_without.py
• Accuracy on the training set: 1.0
• Accuracy on the testing set: 0.9333
3 Using pruning - removing branches that do not provide much
information gain or that are not necessary for the tree to make
accurate predictions: dt_iris_with_pruning.py
Salahadin Seid School of Data Science Emerland University
DecisionTree 20 / 21
Thanks For Your Attention!
Any questions?
Salahadin Seid School of Data Science Emerland University
DecisionTree 21 / 21