03/12/1445
Decision Tree
Decision Trees
Decision Trees are a type of Supervised Machine Learning (that is you
explain what the input is and what the corresponding output is in the
training data) where the data is continuously split according to a certain
parameter. The tree can be explained by two entities, namely decision
nodes and leaves. The leaves are the decisions or the final outcomes. And
the decision nodes are where the data is split.
There are two main types of Decision Trees:
Classification trees (Yes/No types): the decision variable is Categorical.
Regression trees (Continuous data types): the decision or the outcome variable is
Continuous,
1
03/12/1445
Decision Trees
A decision tree is a flow-chart-like tree structure
Internal node denotes a test on an attribute (feature)
Branch represents an outcome of the test
All records in a branch have the same value for the tested attribute
Leaf node represents class label or class label distribution
outlook
sunny overcast rain
humidity P windy
high normal true false
N P N P
2
03/12/1445
Instance Language for Classification
Example: “is it a good day to play golf?” A particular instance in the
a set of attributes and their possible values: training set might be:
outlook sunny, overcast, rain
<overcast, hot, normal, false>: play
temperature cool, mild, hot
humidity high, normal
windy true, false
In this case, the target class
is a binary attribute, so each
instance represents a positive
or a negative example.
3
03/12/1445
Using Decision Trees for Classification
Examples can be classified as follows
1. look at the example's value for the feature specified
2. move along the edge labeled with this value
3. if you reach a leaf, return the label of the leaf
4. otherwise, repeat from step 1
Example (a decision tree to decide whether to go on a picnic):
outlook
So a new instance:
sunny overcast rain
<rainy, hot, normal, true>: ?
humidity windy will be classified as “noplay”
P
high normal true false
N P N P
Decision Trees and Decision Rules
outlook
If attributes are continuous, sunny overcast rain
internal nodes may test
against a threshold. humidity P windy
> 75%<= 75% > 20 <= 20
N P N P
Each path in the tree represents a decision rule:
Rule1: Rule3:
If (outlook=“sunny”) AND (humidity<=0.75) If (outlook=“overcast”)
Then (play=“yes”) Then (play=“yes”)
Rule2: ...
If (outlook=“rainy”) AND (wind>20)
Then (play=“no”)
4
03/12/1445
Top-Down Decision Tree Generation
The basic approach usually consists of two phases:
Tree construction
At the start, all the training instances are at the root
Partition instances recursively based on selected attributes
Tree pruning (to improve accuracy)
remove tree branches that may reflect noise in the training data and lead to errors when
classifying test data
Basic Steps in Decision Tree Construction
Tree starts a single node representing all data
If instances are all same class then node becomes a leaf labeled with class label
Otherwise, select feature that best distinguishes among instances
Partition the data based the values of the selected feature (with each branch
representing one partitions)
Recursion stops when:
instances in node belong to the same class (or if too few instances remain)
There are no remaining attributes on which to split
Trees Construction Algorithm (ID3)
Decision Tree Learning Method (ID3)
Input: a set of training instances S, a set of features F
1. If every element of S has a class value “yes”, return “yes”; if every element of
S has class value “no”, return “no”
2. Otherwise, choose the best feature f from F (if there are no features
remaining, then return failure);
3. Extend tree from f by adding a new branch for each attribute value of f
3.1. Set F’ = F – {f},
4. Distribute training instances to leaf nodes (so each leaf node n represents the
subset of examples Sn of S with the corresponding attribute value
5. Repeat steps 1-5 for each leaf node n with Sn as the new set of training
instances and F’ as the new set of attributes
Main Question:
how do we choose the best feature at each step?
Note: ID3 algorithm only deals with categorical attributes, but can be extended
(as in C4.5) to handle continuous attributes
10
5
03/12/1445
Choosing the “Best” Feature
Use Information Gain to find the “best” (most discriminating) feature
Assume there are two classes, P and N (e.g, P = “yes” and N = “no”)
Let the set of instances S (training data) contains p elements of class P and n
elements of class N
The amount of information, needed to decide if an arbitrary example in S
belongs to P or N is defined in terms of entropy, I(p,n):
I ( p, n) Pr( P )log 2 Pr( P ) Pr( N )log 2 Pr( N )
Note that Pr(P) = p / (p+n) and Pr(N) = n / (p+n)
In other words, entropy of a set on instances S is a function of the
probability distribution of classes among the instances in S.
11
Entropy
Entropy for a two class variable
12
6
03/12/1445
Entropy in Multi-Class Problems
More generally, if we have m classes, c1, c2, …, cm , with s1, s2, …, sm
as the numbers of instances from S in each class, then the entropy is:
I 𝑠1, 𝑠2, … , 𝑠𝑛 = − ∑ 𝑝 log2 𝑝
where pi is the probability that an arbitrary instance belongs to the
class ci.
13
Information Gain
Now, assume that using attribute A a set S of instances will be
partitioned into sets S1, S2 , …, Sv each corresponding to distinct
values of attribute A.
If Si contains pi cases of P and ni cases of N, the entropy, or the expected
information needed to classify objects in all subtrees Si is
The probability that
Si pi ni
E ( A) Pr( Si ) I ( pi , ni ) where, Pr( Si ) an arbitrary
i 1 S pn instance in S
belongs to the
partition Si
The encoding information that would be gained by branching on A:
Gain( A) I ( p, n) E ( A)
At any point we want to branch using an attribute that provides the highest
information gain.
14
7
03/12/1445
Attribute Selection - Example
The “Golf” example: what attribute should we choose as the root?
Day outlook temp humidity wind play S: [9+,5-]
D1 sunny hot high weak No
Outlook?
D2 sunny hot high strong No
D3 overcast hot high weak Yes
D4 rain mild high weak Yes overcast rainy
sunny
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes
D8 sunny mild high weak No [4+,0-] [2+,3-] [3+,2-]
D9 sunny cool normal weak Yes
D10 rain mild normal weak Yes I(9,5) = -(9/14).log(9/14) - (5/14).log(5/14)
D11 sunny mild normal strong Yes = 0.94
D12 overcast mild high strong Yes
D13 overcast hot normal weak Yes
I(4,0) = -(4/4).log(4/4) - (0/4).log(0/4)
D14 rain mild high strong No
=0
Gain(outlook) = .94 - (4/14)*0 I(2,3) = -(2/5).log(2/5) - (3/5).log(3/5)
- (5/14)*.97 = 0.97
- (5/14)*.97 I(3,2) = -(3/5).log(3/5) - (2/5).log(2/5)
= .24 = 0.97
15
Attribute Selection - Example (Cont.)
Day outlook temp humidity wind play S: [9+,5-] (I = 0.940)
D1 sunny hot high weak No
D2 sunny hot high strong No humidity?
D3 overcast hot high weak Yes
high normal
D4 rain mild high weak Yes
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes [3+,4-] (I = 0.985) [6+,1-] (I = 0.592)
D8 sunny mild high weak No
D9 sunny cool normal weak Yes Gain(humidity) = .940 - (7/14)*.985 - (7/14)*.592
D10 rain mild normal weak Yes = .151
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes
S: [9+,5-] (I = 0.940)
D13 overcast hot normal weak Yes
D14 rain mild high strong No wind?
weak strong
So, classifying examples by humidity provides
more information gain than by wind. Similarly,
we must find the information gain for “temp”.
[6+,2-] (I = 0.811) [3+,3-] (I = 1.00)
In this case, however, you can verify that
outlook has largest information gain, so it’ll be Gain(wind) = .940 - (8/14)*.811 - (8/14)*1.0
selected as root = .048
16
8
03/12/1445
Attribute Selection - Example (Cont.)
Partially learned decision tree
S: [9+,5-]
Outlook {D1, D2, …, D14}
sunny overcast rainy
? yes ?
[2+,3-] [4+,0-] [3+,2-]
{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
which attribute should be tested here?
Ssunny = {D1, D2, D8, D9, D11}
Gain(Ssunny, humidity) = .970 - (3/5)*0.0 - (2/5)*0.0 = .970
Gain(Ssunny, temp) = .970 - (2/5)*0.0 - (2/5)*1.0 - (1/5)*0.0 = .570
Gain(Ssunny, wind) = .970 - (2/5)*1.0 - (3/5)*.918 = .019
17
Other Attribute Selection Measures
Gain ratio:
Information Gain measure tends to be biased in favor attributes with a large number
of values
Gain Ratio normalizes the Information Gain with respect to the total entropy of all
splits based on values of an attribute
Used by C4.5 (the successor of ID3)
But, tends to prefer unbalanced splits (one partition much smaller than others)
Gini index:
A measure of impurity (based on relative frequencies of classes in a set of instances)
The attribute that provides the smallest Gini index (or the largest reduction in impurity
due to the split) is chosen to split the node
Possible Problems:
Biased towards multivalued attributes; similar to Info. Gain.
Has difficulty when # of classes is large
18
9
03/12/1445
Overfitting
• An induced tree may overfit the training data
• Too many branches, some may reflect anomalies due to noise or
outliers
• Some splits or leaf nodes may be the result of decision based on very
few instances, resulting in poor accuracy for unseen instances
19
20
10
03/12/1445
Overfitting and Tree Pruning
Two approaches to avoid overfitting
Prepruning: Halt tree construction early ̵ do not split a node if this would
result in the error rate going above a pre-specified threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree
Get a sequence of progressively pruned trees
Use a test data different from the training data to measure error rates
Select the “best pruned tree”
21
22
11
03/12/1445
23
Enhancements to Basic Decision Tree
Learning Approach
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes that partition the
continuous attribute value into a discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that are sparsely represented
This reduces fragmentation, repetition, and replication
24
12
03/12/1445
25
26
13
03/12/1445
27
14