cel Te yg 6
Decision Tree
Learning
“Predict
ton is very difficult, especially if it’s about the future.”
~ Niels Bohr
pecision Tree Learning is a widely used predictive m
epee lend yt as ris ee nee ree ae
jon tasks. The decision tree model basically represents logical rules that Bate
salue of a target variable by inferring from data features. This chapter pr wi aaa ie
into how to construct a decision tree and infer knowledge from the tree ondesa keen ince
Powe tiraey (oss oy
* Understand the structure of the decision tree
* Know about the fundamentals of Entropy
© Learnand understand popular univariate Decision Tree Induction algorithms such as
1D3, C4.5, and multivariate decision tree algorithm such as CART
* Deal with continuous attributes using improved C4.5 algorithm
+ Construct Classification and Regression Tree (CART) for classifying both categorical
and continuous-valued target variables
re the target feature is a continuous-valued variable
* Construct regression trees whe!
Understand the basics of validating and pruning of decision trees
61 INTRODUCTION TO DECISION TREE LEARNING MODEL
Deision tree learning model, one of the most popular supervised nee, auras eee
‘sfies data inetanees with high accuracy and consisten"): The model proms an et
"rence that reaches a general conclusion from observed examples. This m™
‘ahd Fn
"solving complex classification application
Decisio i ich summarizes
d n tree is a concept tree Wil
i in the form of a tree structure. OnCe the conce]
fed,
s. 7
ation contained in the training
the inform: r
test data can be easily
pt model is built,ning > |
156 6 Machine
sorical target variables and contin,
a model can be classify both catey
This meatel can be used to clasity both categoria computes a hypothesis fyn cin
(x)
t,
target variables. Given a training dataset X, this mor
doasion tree
Inputs to the model are data instances or objects with a set of features or attribute, whi
be discrete or continuous and the output of the model is a decision tree which predicts o, class
the tanget class for the test data object. :
In statistical terms, attributes or features are called as independent variables. The argo,
&F target class is called as response variable which indicates the category we need to pg.
a test object. :
The decision tree learning model generates a complete hypothesis space in the form g.|
tree structure with the given training dataset and allows us to search through the possible’
hypotheses which in fact would be a smaller decision tree as we walk through the tree, Th."
of search bias is called as preference bias.
i,
6.1.1 Structure of a Decision Tree
A decision tree has a structure that consists of a root node, internal nodes/decision nodes, brandy,
and terminal nodes/leaf nodes. The topmost node in the tree is the root node. Internal nodes, Ate hy
test nodes and are also called as decision nodes. These nodes represent a choice or test of an
attribute and the outcome or outputs of the test condition are the branches emanating from ts
decision node. The branches are labelled as per the outcomes or output values of the test conditi,
Each branch represents a sub-tree or subsection of the entire tree. Every decision node is patti:
path to a leaf node. The leaf nodes represent the labels or the outcome of a decision path. The ik
of the leaf nodes are the different target classes a data instance can belong to.
Every path from root to a leaf node represents a logical rule that corresponds to a conjunctit
of test attributes and the whole tree represents a disjunction of these conjunctions. The dedsie
tree model, in general, represents a collection of logical rules of classification in the form of ate
structure.
Decision networks, otherwise called as influence diagrams, have a directed graph struct
with nodes and links. It is an extension of Bayesian belief networks that represents informatit
about each node’s current state, its possible actions, the possible outcome of those actions, and
utility. The concept of Bayesian Belief Network (BBN) is discussed in Chapter 9.
Figure 6.1 shows symbols that are used inthis book to represent different nodes in the cost
of a decision tree. A circle is used to represent a root node, a diamond symbol is used to repr"
decision node or the internal nodes, and all leaf nodes are represented with a rectangle.
© Root note
©} Decision node
Ea Leaf node
Figure 6.1: Nodes in a Decision Tree
Bed, bce doe eeuenles ook beeen erie weeierae lies>
_ Decision Tree Learning » 187
puilding the Tree
| Construct a decision tree with the Biven training dataset. The tree is constructed in a
jown fashion. It starts from.
the root node. At i
op t ~ At every level of tree construction, we need to find
pe best split peur atheros Node among all attributes. This process is recursive and
continued unt p in the last level of the tree or finding a leaf node which cannot be split
forher The tree construction is complete when all the test conditions lead to a leaf node. The leaf
node contains the target class or output of classification,
output Decision tree representing the complete hypothesis space
Knowledge Inference or Classification
Goal Given a test instance, infer to the target class it belongs to
Cassification Inferring the target class for the test instance or object is based on inductive
inference on the constructed decision tree. In order to classify an object, we need to start traversing
thetree from the root. We traverse as we evaluate the test condition on every decision node with
the test object attribute value and walk to the branch corresponding to the test’s outcome. This
process is repeated until we end up in a leaf node which contains the target class of the test object.
Output Target label of the test instance.
Advantages of Decision Trees
1. Easy to model and interpret
2 Simple to understand
3 The input and output attributes can be discrete or continuous predictor variables,
4. Can model a high degree of nonlinearity in the relationship between the target variables and the
predictor variables
5. Quick to train
Disadvantages of Decision Trees
Some of the issues that generally arise with a decision tree learning are that:
. It is difficult to determine how deeply a decision tree can be grown or when to stop
growing it.
If training data has errors or missing attribute values, then the decision tree constructed
may become unstable or biased.
If the training data has continuous valued attributes, handling it is computationally
complex and has to be discretized.
A complex decision tree may also be over-fitting with the training data.
Decision tree learning is not well suited for classifying multiple output classes,
own to be NP-complete.
Learning an optimal decision tree is also kn158 + Machine Learning
s academic performance
asec
dict a student’
+ to pre
[Example 6.1: ae top
e, s assignments, home-work assi ty
the giver information such as class attendance, class as : ene
© given information such as clas ‘tivities such as projects and Presentation
a
participation in competitions her events, BFOUP 4 . aan oe
in the final examination whether jy.
wi
Solution: The target feature is t i
pass or fail in the examination The decision nodes are test nodes wre check for Conditions
AW hat's the student's class attendance?’ “How did he perform in Tus «ir assignments?, Dg
do his home assignments properly?” “What about his assessment results? ‘Did he participay,
events”, ‘What is the performance rating in group activities such as Proje |
6.1 shows the attributes and set of values for each attribute.
or ot
he student performance
competitions or other
and presentations?’. Table
Table 6.1: Attributes and Associated Values
rome
Class attendance ‘Good, Average, Poor
Class assignments ‘Good, Moderate, Poor |
| Home-work assignments Yes, No
‘Assessment ‘Good, Moderate, Poor |
[Participation in competitions or other events Yes, No |
| Group activities such as projects and presentations _{ YES. No |
Pass, Fail |
[Exam Result
that is, either ‘pass’, or ‘fail’. |
t of if-else conditions which may
are two or more than two. Hence,
The leaf nodes represent the outcomes,
nstructed by following a se
A decision tree would be cot
and decision nodes outcomes
may not include all the attributes,
the tree is not a binary tree.
a tree which can have more than two branches.
Note: A decision tree is not always a binary tree. It is
———
ass or fail ase!
Table 6.2 sho"*
'm Result ¥*
Predict a student's academic performance of whether he will p:
formation such as ‘Assessment’ and ‘Assignment’. The following
Jes, Assessment and Assignment, and the target variable Exal
‘on the given
the independent variabl
their values. Draw a binary decision tree.
Table 6.2: Attributes and Associated Values
Attributes
‘Assessment
Assignment Yes, No
Exam Result Pass, Fail
[Exam Rests [Pass Fail _J ode
Solution: Consider the root node is ‘Assessment’, If a student’s marks are 250, the 100" isi!
branched to leaf node ‘Pass’ and if the assessment marks are <50, it is branched to another Sed
node. If the decision node in next level of the tree is ‘Assignment’ and if a student has sue
his assignment, the node branches to ‘Pass’ and if not submitted, the node bran! "
Figure 6.2 depicts this rule.——— Decision Tree Learning » 159
Figure 6.2: Illustration
Ths tree can be interpreted as a sequence of lo
1 (Assessment 2 50) then ‘Pass’
else if (Assessment < 50) then
if (Assignment = Yes) then ‘Pass’
else if (Assignment = No) then ‘Fail’
Now, ifa test instance is given,
not submitted his assignment, then
of a Decision Tree
gical rules as follows.
such as a student has scored 42 marksin his assessment and has
itis predicted with the decision tree that his exam result is ‘Fail’
—_ vst which ai ee
“any algorithms exist which will be studied for constructing decision trees in the sections below.
61.2 Fundamentals of Entropy
Sven the training dataset with a set of attributes or features,
‘nding the attribute or feature’ that best describes the target class for the given test instances.
‘he best split feature is the one which contains more information about how to split the dataset
‘mong all features so that the target class is accurately identified for the test instances. In other
words, the best split attribute is more informative to split the dataset into sub datasets and this
ess is continued until the stopping criterion is reached. This splitting should be pure at every
“ape of selecting the best feature.
The best feature is selected based on the amount of information among the features which
‘x bsically calculated on probabilities. Quantifying information is closely related to information
°° In the field of information theory, the features are quantified by a measure called Shannon
“py which is calculated based on the probability distribution ofthe events
the decision tree is constructed by
"atropy is the amount of uncertainty or randomness in the outcome ofa random varicleor an
St Moreoy homogeneity of the data instances. The best feature
er, entropy describes about the homog¢ :
“kted based on the entropy value, For example when a coinis flipped, head or tall are the two
wteomes, hence i tifa compared to rolling a dice which has got six outcomes.
Hanes X® Rence its entropy is lower w'
the interpretation is,
wa160 ~ Machine Learning TTT
Higher the entropy —> Higher the uncertainty
Lower the entropy — Lower the uncertainty . .
Similarly, if all instances are homogenous, say (1, 0), ces Rahs belo,
same class (here it is positive) or (0, 1) where all instances ae ae a eN the entyoe 6,
On the other hand, if the instances are equally distributed, say (0.5, 05), which means 59
and 50% negative, then the entropy is 1. If there are 10 data instances, out of which ¢ bets
positive class and 4 belong to negative class, then the entropy 1S calculated as shown in
et
6 top & + Atog, + Me
Entropy =- 70 °8 10 10 82 10 ,
instances that are completely homogeneous then,
PAL:
s that are equally divided (i.e, 50% ~ 50%), . 5
between 0 and 1 based on the randomness 4°
It is concluded that if the dataset has
entropy is 0 and if the dataset has sample:
entropy of 1. Thus, the entropy value ranges e i
samples in the dataset. If the entropy is 0, then the split is pure which means that all sampjg,
the set will partition into one class or category. But if the entropy is 1, the split is impure ang,
distribution of the samples is more random. The stopping criterion is based on the entropy vijy
Let P be the probability distribution of data instances from 1 to m as shown in Eq. (6.2),
So, P=P,....... P,
Entropy of P is the information measure of this probability distribution given in Eq, (6.3),
Entropy_Info(P) = Entropy_Info(P, .... P,)
=-(P, log,(P,) + P, log,(P,) + «..---- + P, log,(P,)) 3
where, P, is the probability of data instances classified as class 1 and P, is the probability of daz
instances classified as class 2 and so on.
No of data instances belonging to class 1| / |Total no of data instances in the
training dataset!
Entropy_Info(P) can be computed as shown in Eq. (6.4).
Thus,
Entropy_Info(6, 4) is calculated as-| 610g £4 A tog 4]
10°80 * 10° 10
(
Mathematically, entropy is defined in Eq. (6.5) as:
1 6
PAK = x]-log, oo
Pr[X = x] is the probability of a random variable X with a possible outcome *-
1
Ga 198, Ba Sa
Entropy_Info(X) = ¥,
-revalues(X)
= -log, (Pr[X = x}) J
Algorithm 6.1: General Algorithm for Decision Trees
1. Find the best attribute from the training dataset using an attribute selection meas
place it at the root of the tree. i)
(conti— pe cision Tree Learning #161
aa —
( 2, Split the training dataset into subsets based on the outcomes of the test attribute and each
subset in a branch contains the data instances or tuples with the same value for the selected
test attribute.
3, Repeat step 1 and step 2 on each subset until we end up in leaf nodes in all the branches of
the tree.
4. This splitting process is recursive until the stopping criterion is reached.
stopping Criteria |
The following are some of the common stopping conditions: |
1, The data instances are homogenous which means all belong to the same class Cand |
hence its entropy is 0. ‘
2. Anode with some defined minimum number of data instances becomes a leaf (Number |
of data instances in a node is between 0.25 and 1.00% of the full training dataset).
3. The maximum tree depth is reached, so further splitting is not done and the me |
becomes a leaf node.
6.2 DECISION TREE INDUCTION ALGORITHMS
There are many decision tree algorithms, such as ID3, C4.5, CART, CHAID, QUEST, GUIDE,
CRUISE, and CTREE, that are used for classification in real-time environment. The most
commonly used decision tree algorithms are ID3 (Iterative Dichotomizer 3), developed by
JR Quinlan in 1986, and C4.5 is an advancement of ID3 presented by the same author in 1993.
CART, that stands for, Classification and Regression Trees, is another algorithm which was
developed by Breiman et al. in 1984.
The accuracy of the tree constructed depends upon the selection of the best split attribute.
Different algorithms are used for building decision trees which use different measures to decide on
‘he splitting criterion. Algorithms such as ID3, C4.5 and CART are popular algorithms used in the
“nstruction of decision trees. The algorithm ID3 uses ‘Information Gain’ as the splitting criterion
“hereas the algorithm C4.5 uses ‘Gain Ratio’ as the splitting criterion. The CART algorithm is
Popularly used for classifying both categorical and continuous-valued target variables. CART uses
'NI Index to construct a decision tree.
Decision trees constructed using ID3 and C4.5 are also called as univariate decision trees which
sider only one feature/attribute to split at each decision node whereas decision trees constructed
“t8 CART algorithm are multivariate decision trees which consider a conjunction of univariate
‘Pits. The details about univariate and multivariate data has been discussed in Chapter 2
62,
‘1 ID3 Tree Construction
DB is ining dataset with labels and constructs
a supervi ithm which uses a training dai ;
2 ison ee. ibs aes Be ar arin decision trees as it considers only one feature at
B'decision pends ee example o cd splits. The tree is then used to classify the future test
wage sio i -aligne peat
acters Iteonetnae ae ie ; greedy approach in a top-down fashion by identifying the
a
\ces,
“bute at each level of the tree. Z— —— ee
considered as discrete/categorica,
attributes or features to be dec lay
‘ty
162 + Machine Learning —
features are ©
have to partition
attributes oT
us, then we
ID3 works well if the
If some attributes are continuo’
or nominal attributes or features:
puri measure
The algorithm builds the tree using a purity me
J then uses the constructe
training data instances and aan orical
for training set with only nome te ot etl rine "Bah
‘on, IDS works well for a large eeraiiee one
Moreover, iis not accurate if the dataset has missing attril cone to
i it is outlie
s done ‘ter construction of the tree ané ce
is done during 0» uous attributes. Both C4.5 and CaRy
&
putes and contin'
0 outliers whereas CART can handle outliers, aS We
called ‘Information Gain’ with
to classify the test data. It is ar"
attributes and with no missing
dataset is small, overfitting m,
FF
tor classificaty
No pruning :
CART can handle both categorical attr!
Seo handle missing values, but C45 is prone
PP bia ee
Pe acannon
Info Eq. (6.8) for the whole training dataset based on the target attribute.
tion Gain Eq, (6.10) for each of the attribute
Compute Entropy_]
Compute Entropy_Info Eq. (6.9) and Informal
the training dataset.
Choose the attribute for which entropy is minimu
the best split attribute.
The best split attribute is placed as the root node.
The root node is branched into subtrees with each subtree as an outcome of the tet
condition of the root node attribute. Accordingly, the training dataset is also spii
into subsets.
Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available
the subset.
m and therefore the gain is maximums
wg
o
=
Note: We stop branching a node if entropy is 0.
The best split attribute at every iteration is the attribute with the highest information ga"
Definitions
Let T be the training dataset.
Let A be the set of attributes A=(A,,A,, A A}
pAb Asn Als
Let m be the number of classes in the training dataset,
Let P,be the probability that a data instance or a tupl
ple ‘d’
It is calculated as, belongs to class C,.
P.=Total ni ins :
© of data instances that belongs to class Cj ini
8 C, in T/Total no of tuples in the tai"®
Mathematically, it is rey
, presented as shown
in Eq. (6.7).
ryDecision Tree Learning * 163
ected information or Entropy needed to cl
toy Info(T) given in Eq. (6.8). assity a data instance d’ in T is denoted as
Entropy Info(T)= — YP tog, P (6.8)
snuropy of every attribute denoted as Entropy_Info(T, A) is shown in Eq, (6.9) as
Entropy_Info(T, A) = a x Entropy_Info (A,) (6.9)
»re, the attribute A has got ‘v’ di
wheres . istinct values (a,, a, ....a,), |A|| is the number of instances for
vi v ttriby
ansunet value “Tin attribute A, and Entropy_Info (A) is the entropy for that set of instances.
Information_Gain is a metric that measures how mu
tribute A. In other words, ch information is gained by branching on
it measures the reduction in impurity in an arbitrary subset of data.
Itis calculated as given in Eq. (6.10):
Information_Gain(A) = Entropy_Info(T) — Entropy_Info(T, A) (6.10)
It can be noted that as entropy increases,
portional to each other. information gain decreases. They are inversely
Pro 5
aR
Scan for ‘Additional Exomples’ Ee |
Es |
a oo
Assess a student's performance during his course of study and predict whether
a student will get a job offer or not in his final year of the course. The training dataset T consists
of 10 data instances with attributes such as ‘CGPA’, ‘Interactiveness’, ‘Practical Knowledge’ and
‘Communication Skills’ as shown in Table 6.3. The target class attribute is the ‘Job Offer’.
Table 6.3: Training Dataset T
CoE eee eT Te eee ee eT en ey
‘Yes Very good Good Yes
2. No Good Moderate Yes
3. 29 No Average Poor No :
4 <8 No. Average Good No
5. 28 [Yes Good ~ | Moderate Yes
L$ 29 | Yes Good Moderate Yes
[2 <8 | Yes Good Poor No
a 29 [No Very good Good Yes
° 28 | Yes Good Good Yes
Ce] 28 [Yes ‘Average Good Yes164 + Machine Learning ———___
Solution:
Step 1;
Calculate the Entropy for the target class ‘Job Offer’
Entropy_Info(Target Attribute = Job Offer) = Entropy_Info(7, 3) =
= -|Z ve 24 F tog 3] = ~(0.3599 + -0.5208) = 0.8807
Iteration 1: 10 710 10-7? 10
Step 2:
Calculate the Entropy_Info and Gain(Information_Gain) for each of the attribute in the training
dataset.
Table 6.4 shows the number of data instances classified with Job Offer as Yes or No for the attribute
CGPA.
Table 6.4: Entropy Information for CGPA
CGPA__Job Offer= Yes Job Offer=No Total Entropy
EI 3 1 4
28 4 0 4 o
<8 0 2 2 0
Entropy_Info(T, CGPA)
473, 3-1, 1], 4/4, 4.0, 0]. 2[ 0.0 2
= Fp] los. 5 - zlos. = | +] Slog, 4 - Stog 2} 42/2 tog 9 2)
-4 1°84 paz] A] 41°82 g~ 4'°82 g | * 79] 71°82 9 - 51°82 5
= + (03111 +0.4997)+0+0
10
= 0.3243
Gain (CGPA) = 0.8807 — 0.3243
0.5564
Table 6.5 shows the number of data instances classified with Job Offer as Yes or No for the
attribute Interactiveness.
Table 6.5: Entropy Information for Interactiveness
ee
YES 5 1 6
[No 2 2 [4 |
2
. 6/5, 5 1 1], 4f_2, > 2.2 3]
Entropy_Info(T, Interacti = 79] ~ gloss = - Slog, 1] + 4] ~2 tog, 2 - Zt0g,
py_Info(T, Interactiveness) é 51°85 sie. 2|+ 4] 710824 4°74
= Soin + 0.4306) + + (0.4997 + 0.4997)
= 0.3898 + 0.3998 = 0.7896
Gain(Interactiveness) = 0.8807 - 0.7896
= 0.0911
the
Table 6.6 shows the number of data instances classified with Job Offer as Yes or No for
attribute Practical Knowledge—— Decision Tree Learning + 165
Table 6.6: Entropy Information for Practical Knowledge
eed
Very Good_
Average
Good
3] Uo 12, 2
3)_1, 12 5f 4. 44 4
ToL 3°83 35 3| . woes f 3)
o, 0
2°89
20 +
o2sr +0.4641)
0+ 0.2753 + 0.3608
0.6361
Gain(Practical Knowledge) = 0.8807 — 0.6361
= 0.2446
Table 6.7 shows the number of data instances classified with Job Offer as Yes or No for the
atnbute Communication Skills.
Table 6.7: Entropy Information for Communication Skills
Communication Skills Job Offer = Yes Job Offer=No Total
Good 4 1 5 |
Moderate 3 0 3
Poor 0 2 2
Entropy_Info(T, Communication Skills)
5] 4 4 1 1 3] 2 3 _9 oO 2] aor o- 2) 2
3 |-tioe, 4 Log, 3] * A]-F985 31°83 |* yo] 7/82 21°82
5 3 2
= (05 a 0)
79 (05280 + 0.3897) + 10 (0) + 10° )
= 0.3609
Gain(Communication Skills) = 0.8813 - 0.36096
= 0.5203
The Gain calculated for all the attributes i
Table 6.8: Gain
Pe
is shown in Table 6.8:
CGPA
Interactiveness
Practical Knowledge
‘Communication Skills
| Communicaienhich entropy is minimum and therefore the bh
yr wh iy
166 + Machine Learning
. :
: ple 6.8, choose ain. So, we choose CG)
Step 3: From heey best split ati jt has the a Sones 29, 28 and <8. The &y
‘GPA GPA wil fer = Yes for 28 and Job Offer sk
of
js maximum as
it attribute .
The best split attribu! ccra wt
» are three d ified a5 or the subset of >
root node. There are vec Of of
value is 0 for >8 and <8 oo a a ws noe. THE ree 8 "
<8. Hence, both 28 and