Decision Trees
Introduction
Entropy
Information gain
Tree models
Trees are expressive and easy to understand.
A feature tree is a tree such that each internal node is labeled
with a feature and each edge emanating from an internal node
is labeled with a literal.
The set of literals at a node is called a split.
Each leaf of the tree represents a logical expression, which is
the conjunction of literals on the path from root to the leaf.
The extension of that conjunction is called the instance space
segment associated with the leaf.
Tree models
An internal node test the value of particular features.
A branch represents an outcome of the test.
A leaf node represents a class label h(x) or class label
distribution.
Building decision trees
Top-down tree construction
At start, all training examples are at the root.
Partition the examples recursively by choosing one attribute
each time.
Bottom-up tree pruning
Remove subtrees or branches, in a bottom-up manner, to
improve the estimated accuracy on new cases.
Building decision trees
Building decision trees
Building decision trees
Building decision trees
The construction of a tree involves the following three general
elements:
The selection of the splits i.e., which node to split and how to
split it?
When to declare a node terminal and stop splitting?
We have to assign each terminal node to a class. How do we
assign these class labels?
Building decision trees
The algorithm starts as a single node N, representing the
training tuples in D.
If the tuples in D are all of the same class, then node N becomes
a leaf and is labeled with that class.
Otherwise, the algorithm calls BestSplit to determine the splitting
criterion.
This tells which attribute to test at node N by determining the
way to partition the tuples in D into individual classes.
The splitting attribute is determines so that, the resulting
partitions is pure – if all tuples in it belong to same class.
Building decision trees
The node N is labeled with the splitting criterion, which serves as
a test at the node.
A branch is grown from N for each of the outcome of splitting
criterion.
The tuples in D are partitioned accordingly.
The algorithm uses same process recursively to form a decision
tree for the tuples at each resulting partition Dj, of D.
Building decision trees
The recursive partitioning stops under any one of following
terminating condition:
All of the tuples in partition D (at node N) belong to the same
class, or
There are no remaining attributes on which the tuples may be
further partitioned, or
There are no tuples for a given branch, i.e., a partition Dj is
empty.
Choosing the Best attribute
How to choose the best attribute or feature?
A goodness of split is used for this purpose.
The goodness of split in turn is measured by an impurity function
Typical impurity functions:
Minority class : min(p, 1-p)
Gini index : 2p(1-p)
Entropy : – p log p – (1-p) log
2 2 (1-p)
Choosing the Best attribute
A1 and A2 are “attributes” (i.e. features or inputs).
Number +
and – examples
before and after
a split.
- Many different frameworks for choosing BEST have been
proposed!
- We will look at Entropy Gain.
Entropy
A formula to calculate the homogeneity of a sample.
A completely homogeneous sample has entropy of 0.
An equally divided sample has entropy of 1.
p+ = no. of positive samples / total samples in S.
p- = no. of negative samples / total samples in S.
Entropy
Entropy (S) ≡ – p+ log2 p+ – p– log2 p– [0 log20 = 0]
Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0
Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0.94
Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14)
= 1/2 + 1/2 = 1 [log21/2 = – 1]
Note: 0 ≤ p ≤ 1, 0 ≤ entropy ≤ 1
Entropy
9+, 5-
pure, 100% yes
not pure at all, 40% yes pure, 100% yes
not pure at all, 40% yes
Information gain (IG)
Which is the best attribute?
The one which will result in the smallest tree.
Heuristic: choose the attribute that produces the “purest” nodes
Popular impurity criterion: information gain
Information gain increases with the average purity of the
subsets that an attribute produces.
Information gain uses entropy H(p)
Strategy: choose attribute that results in greatest information
gain.
Information gain (IG)
The information gain is based on the decrease in entropy after a
dataset is split on an attribute.
Which attribute creates the most homogeneous branches?
Calculate the entropy of the total dataset D before split.
The dataset is then split on the different attributes.
The entropy for each branch is calculated. Then it is added
proportionally, to get total entropy for the split.
The resulting entropy is subtracted from the entropy before the split.
The result is the Information Gain, or decrease in entropy.
Information gain (IG)
Information gain = entropy before split – total entropy for the split.
The attribute that yields the largest IG is chosen for the decision node.
Decision Trees
When do I play tennis?
Best node?
Attribute: Outlook
“Outlook” = “Sunny”:
info([2,3]) = entropy(2/5,3/5) = −2 / 5 log(2 / 5) − 3 / 5 log(3 / 5) = 0.971 bits
“Outlook” = “Overcast”:
info([4,0]) = entropy(1,0) = −1log(1) − 0 log(0) = 0 bits
“Outlook” = “Rainy”:
info([3,2]) = entropy(3/5,2/5) = −3 / 5 log(3 / 5) − 2 / 5 log(2 / 5) = 0.971 bits
Expected Information for “Outlook”:
info([3,2],[4,0],[3,2]) = (5 / 14) × 0.971 + (4 / 14) × 0 + (5 / 14) × 0.971
= 0.693 bits
Information gain
Entropy(S) = - (9/14) log2 (9/14) - (5/14) log2 (5/14) = 0.940
Information gain:
(information before split) – (information after split)
gain(" Outlook" ) = info([9,5]) - info([2,3], [4,0], [3,2]) = 0.940 - 0.693
= 0.247 bits
Similarly calculate the information gain for 'Wind” attribute!
Attribute: Wind
Values(Wind) = {Weak, Strong}
S = [9+, 5−] = 0.94
SWeak = [6+, 2−] = entropy(SWeak)= - 6/8 log(6/8) – 2/8 log(2/8) = 0.811
SStrong = [3+, 3−] = entropy(SStrong)= - 3/3 log(3/3) – 3/3 log(3/3) = 1
Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14 Entropy(SStrong)
= 0.94 − 8/14 × 0.811 − 6/14 × 1.00
= 0.048
Best Attribute ?
Best Attribute ?
Which attribute should be tested at the root?
gain(" Outlook" ) = 0.247 bits
gain(" Temperature" ) = 0.029 bits
gain(" Humidity") = 0.152 bits
gain(" Windy") = 0.048 bits
Outlook provides the best prediction for the target.
Lets grow the tree:
add to the tree a successor for each possible value of
Outlook
partition the training samples according to the value of
Outlook
After first step ?
Second step
Gain(SSunny, Temp.) = 0.970 − 2/5 × 0.0 − 2/5 × 1.0 − 1/5 × 0.0 = 0.570
Gain(SSunny, Wind) = 0.970 − 2/5 × 1.0 − 3/5 × 0.918 = 0 .019
Gain(SSunny, Humidity) = 0.970 − 3/5 × 0.0 − 2/5 × 0.0 = 0.970
Second step
Working on Outlook=Sunny node:
gain(" Temperature" ) = 0.571 bits
gain(" Windy") = 0.020 bits
gain(" Humidity") = 0.971 bits
Humidity provides the best prediction for the target
Lets grow the tree:
add to the tree a successor for each possible value of
Humidity
partition the training samples according to the value of
Humidity
After second step
{D1, D2, D8} {D9, D11}
No Yes
After second step
The test on Humidity provides two splits, and the resulting tuples
of each split falls on the same class.
Hence leaf nodes are labeled with that class.
Lets grow the tree by testing the node rain:
add to the tree a successor for each possible value of rain.
partition the training samples according to the value of rain.
Wind attribute
Wind Attribute
Let S = {D4, D5, D6, D10, D14}
Entropy: H(S) = – 3/5log(3/5) – 2/5log(2/5) = 0.971
Information Gain
IG(S,Temp) = H(S) – H(S|Temp) = 0.01997
IG(S,Wind) = H(S) – H(S|Wind)
= 0.971 − 3/5 × 0.0 − 2/5 × 0.0 = 0.971
After third step
{D1, D2, D8} {D9, D11}
{D6, D14} {D4, D5, D10}
No Yes
No Yes
Final descision tree
ID3 Algorithm
ID3(X, T, Attrs)
X: training examples: T: target attribute (e.g. PlayTennis),
Attrs: other attributes, initially all attributes
Create Root node
If all X's are +, return Root with class +
If all X's are –, return Root with class –
If Attrs is empty return Root with class most common value of T in X
else
A ← best attribute; decision attribute for Root←A
For each possible value vi of A:
- add a new branch below Root, for test A = vi
- Xi ←subset of X with A = vi
- If Xi is empty then add a new leaf with class the most common
value of T in X
else add the subtree generated by ID3(Xi, T, Attrs ← {A})
return Root
Acknowledgement
Machine Learning – Peter Flach
Notes from Tom Mitchell