Classification : Decision Tree
(DT)
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2020)
Outline
What is decision tree (DT) algorithm
Why we need DT
The Pros and Cons of DT
Information Theory
Some Issues in DT
Assignment II
11/07/22 2
Decision Tree (DT)
Decision trees: splitting datasets one features at a time.
The decision tree is one of the most commonly used classification technique.
It has a decision blocks (rectangles).
Termination block (ovals).
The right and left arrows are called branches.
The kNN algorithm can do a grate job of classification, but it didn’t lead to
any major insight about the data.
11/07/22 3
Decision Tree (DT)
Figure 1: A decision Tree
11/07/22 4
Decision Tree (DT)
The best part of the DT (decision tree) algorithm is that humans can easily
understand the data:
The DT algorithm:
Takes a set of data. (training examples)
Build a decision tree (model), and draw it.
It can also be re-represented as sets of if-then rules to improve human readability.
The DT does a grate job of distilling data into knowledge.
Takes a set of unfamiliar data and extract a set of rules.
DT is often used in expert system development.
11/07/22 5
Decision Tree (DT)
The DT can be expressed using the following expression:
(Outlook = Sunny ˄ Humidity = Normal) → Yes
˅ (Outlook = Overcast) → Yes
˅ (Outlook = Rain ˄ Wind = Weak) → Yes
11/07/22 6
Decision Tree (DT)
The pros and cons of DT:
Pros of DT:
Computationally cheap to use,
Easy for humans to understand the learned results,
Missing values OK (robust to errors),
Can deal with irrelevant features.
Cons of DT:
Prone to overfitting.
Work with: Numeric values, nominal values.
11/07/22 7
Decision Tree (DT)
Appropriate problems for DT learning:
Instance are represented by attribute-value pairs (fixed set of attributes and
their values),
The target function has discrete output values,
Disjunctive descriptions may be required,
The training data may contain errors,
The training data may contain missing attribute values.
11/07/22 8
Decision Tree (DT)
The mathematics that is used by DT to split the dataset is called
information theory:
The first decision, you need to make is:
Which feature shall be used to split the data.
You need to try every feature and measure which split will give the
best result.
Then split the dataset into subsets.
The subsets will then traverse down the branches of the decision
node.
If the data on the branch is the same, stop; else repeat the splitting.
11/07/22 9
Decision Tree (DT)
Figure 2: Pseudo-code for the splitting function
11/07/22 10
Decision Tree (DT)
General approach to decision trees:
Collect: Any method.
Prepare: DI3 algorithm works only on nominal values, so any
continues values will need to be quantized.
Analyze: Any method. You should visually inspect the tree after it
is build.
Train: Construct a tree data structure. (DT)
Test: Calculate the error rate with the learned tree.
Use: This can be used in any supervised learning task. Often, to
better understand the data.
11/07/22 11
Decision Tree (DT)
We would like to classify the following animals into two
classes:
Fish and not Fish
Table 1: Marine animal data
11/07/22 12
Decision Tree (DT)
Need to decide whether we should split the data based on the
first feature or the second feature:
To make more organize the unorganized data.
One way to do this is to measure the information.
Measure the information before and after the split.
Information theory is a branch of science that is concerned with
quantifying information.
The change in information before and after the split is known
as the information gain.
11/07/22 13
Decision Tree (DT)
The split with the highest information gain is the best option.
The measure of information of a set is known as the Shannon
entropy or entropy.
One way to do this is to measure the information.
The change in information before and after the split is known
as the information gain.
11/07/22 14
Decision Tree (DT)
To calculate entropy, you need the expected value of all the
information of all possible values of our class.
This is given by:
Where n is the number of classes:
11/07/22 15
Decision Tree (DT)
The higher the entropy, the more mixed up the data.
Another common measure of disorder in a set is the Gini
impurity.
Which is the probability of choosing an item from the set and the
probability of that data item being misclassified.
Calculate the shannon entropy of a dataset.
Dataset splitting on a given feature.
Choosing the best feature to split on.
11/07/22 16
Decision Tree (DT)
Recursively building the tree.
Start with dataset and split it based on the best attribute.
The data will traverse down the branches of the tree to another
node.
This node will then split the data again (recursively)
Stop under the following conditions: run out of attributes or if the
instances in a branch are the same class.
11/07/22 17
Decision Tree (DT)
Table 2: Example training sets
11/07/22 18
Decision Tree (DT)
Figure 3: Data path while splitting
11/07/22 19
Decision Tree (DT)
ID3 uses the information gain measure to select among the
candidate attributes.
Start with dataset and split it based on the best attribute.
Given a collection S, containing positive and negative examples
of some target.
The entropy of S relative to this Boolean classification is:
Entropy(S) = -p1Xlog2p1+ - p2Xlog2p2
11/07/22 20
Decision Tree (DT)
Example:
The target attribute is PlayTennis. (yes/no)
Table 3: Example training sets
11/07/22 21
Decision Tree (DT)
Suppose S is a collection of 14 examples of some Boolean
concept, including 9 positive and 5 negative examples.
Then the entropy of S relative to Boolean classification is:
11/07/22 22
Decision Tree (DT)
Note that the entropy is 0 if all members of S belong to the
same class.
For example: if all the members are positive (p+ = 1), then (p-
= 0).
Entropy (s) = -1.log2(1) – 0.log2(0) = 0
Note the entropy is one (1) when the collection contain an
equal number of positive and negative examples.
If the collection contain unequal number of positive and
negative the entropy is b/n 0 and 1.
11/07/22 23
Decision Tree (DT)
Suppose S is a collection of training-example days described by
attributes Wind. (weak, strong)
The information gain is the measure used by ID3 to select the
best attribute at each step in growing the tree.
11/07/22 24
Decision Tree (DT)
Information gain of the two attributes: Humidity and Wind.
11/07/22 25
Decision Tree (DT)
Example:
ID3 will determines the information gain for each attribute.
(Outlook, Temperature, Humidity and Wind)
Then select the one with the highest information gain.
The information gain values for all four attributes are:
Gain (S, Outlook) = 0.246
Gain (S, Humidity) = 0.151
Gain (S, Wind) = 0.048
Gain (S, Temperature) = 0.029
Outlook provides grater information gain than the other.
11/07/22 26
Decision Tree (DT)
Example:
According to the information gain measure, the Outlook attribute
selected as the root node.
Branches are created below the root for each of its possible
values. (Sunny, Overcast, and Rain)
11/07/22 27
Decision Tree (DT)
The partially learned decision tree resulting from the first step of ID3
11/07/22 28
Decision Tree (DT)
The overcast descendant has only positive examples and
therefore becomes a leaf node with classification Yes:
The other two nodes will be expand by selecting the attribute
with the highest information gain relative to the new subsets.
11/07/22 29
Decision Tree (DT)
Decision Tree learning can be:
Classification tree: finite set values target variables
Regression tree: continuous values target variable
There are many specific Decision Tree algorithms:
ID3 (Iterative of ID3)
C4.5 (Successor of ID3)
CART(classification and Regression Tree)
CHAID (Chi – squared Automatic Interaction Detector)
MARS: extends DT to handle numerical data better
11/07/22 30
Decision Tree (DT)
Different Decision Tree algorithms use different metrics for
measuring the “best attribute” :
Information gain: used by ID3, C4.5 and C5.0
Gini impurity: used by CART
11/07/22 31
Decision Tree (DT)
ID3 in terms of its search space and search strategy:
ID3’s hypothesis space of all decision tree is a complete space of
finite discrete-valued functions.
ID3’s maintains only a single current hypothesis as it searches
through the space of decision trees.
ID3 in its pure form perform no backtracking in its search. (post-
pruning the decision tree)
ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current
hypothesis. (much less sensitive to error)
11/07/22 32
Decision Tree (DT)
Inductive bias in Decision Tree learning (ID3) :
Inductive bias are the set of assumption.
ID3 selects in favor of shorter trees over longer ones. (breadth
first approach)
Selects trees that place the attributes with the highest information
gain closest to the root.
11/07/22 33
Decision Tree (DT)
Issues in Decision Tree learning:
How deeply to grow the decision tree
Handling continuous attributes
Choosing an appropriate attributes selection measure
Handling training data with missing attribute values
Handling attributes with differing costs and
Improve computational efficiency
ID3 extended to address most of these issues to C4.5.
11/07/22 34
Decision Tree (DT)
Avoiding over fitting the Data:
Noisy data and too small training examples are problems.
Over fitting is a practical problem for Decision Tree and many of
the learning algorithms.
Over fitting was found to decrease the accuracy of the learned tree
by 10-25%.
Approach to avoid over fitting:
Stop growing the tree, before it over fitting. (direct but less practical)
Allow the tree to over fitting, and then post-prune (the most successful
in practice)
11/07/22 35
Decision Tree (DT)
Incorporating continuous-value attributes:
Initial definition of ID3, attributes and target value must be
discrete set of values.
The attributes tested in the decision nodes of the tree must be
discrete value :
Create a new Boolean attribute for the continuous value or
Multiple interval rather than just two interval.
11/07/22 36
Decision Tree (DT)
Alternative measure for selecting attributes:
Information gain favor attributes with many values.
One alternative measure that has been used successfully is the
gain ratio.
11/07/22 37
Decision Tree (DT)
Handling training example with missing attribute value:
Assign it with the most common value among training examples at
node n.
Assign the probability to each of the possible values of the attribute.
The 2nd approach, is used in C4.5.
11/07/22 38
Decision Tree (DT)
Handling attributes with different costs:
Low-cost attributes than high-cost attributes.
ID3 can be modified to take into account costs by introducing a cost
term into the attribute selection measure.
Divide the gain by the cost of the attribute.
11/07/22 39
Question & Answer
11/07/22 40
Thank You !!!
11/07/22 41
Assignment II
Answer the given questions by considering the following set of training examples.
11/07/22 42
Assignment II
(a) What is the entropy of this collection of training examples with respect to the target function classification?
(b) What is the information gain of a2 relative to these training examples?
11/07/22 43
Decision Tree (DT)
Do some research on the following Decision Tree algorithms:
ID3 (Iterative of ID3)
C4.5 (Successor of ID3)
CART(classification and Regression Tree)
CHAID (Chi – squared Automatic Interaction Detector)
MARS: extends DT to handle numerical data better
11/07/22 44