Machine Learning

‫الجامعة التكنولوجية – قسم علوم الحاسوب‬
‫تقرير االمتحان النهائي للفصل الدراسي (الكورس الثاني)‬

‫‪ ‬لسنة ‪2020-2019‬‬
‫عنوان التقرير‬
‫)‪(MACHINE LEARNING‬‬
Decision Tree
 Overview
Decision Tree Analysis is a general, predictive modelling tool that has applications spanning a number
of different areas. In general, decision trees are constructed via an algorithmic approach that identifies
ways to split a data set based on different conditions. It is one of the most widely used and practical
methods for supervised learning. Decision Trees are a non-parametric supervised learning method
used for both classification and regression tasks. The goal is to create a model that predicts the value
of a target variable by learning simple decision rules inferred from the data features.
The decision rules are generally in form of if-then-else statements. The deeper the tree, the more
complex the rules and fitter the model.
Before we dive deep, let's get familiar with some of the terminologies:
 Instances: Refer to the vector of features or attributes that define the input space
 Attribute: A quantity describing an instance
 Concept: The function that maps input to output
 Target Concept: The function that we are trying to find, i.e., the actual answer
 Hypothesis Class: Set of all the possible functions
 Sample: A set of inputs paired with a label, which is the correct output (also known as the
Training Set)
 Candidate Concept: A concept which we think is the target concept
 Testing Set: Similar to the training set and is used to test the candidate concept and
determine its performance
 Introduction
A decision tree is a tree-like graph with nodes representing the place where we pick an attribute and
ask a question; edges represent the answers the to the question; and the leaves represent the actual
output or class label. They are used in non-linear decision making with simple linear decision surface.
Decision trees classify the examples by sorting them down the tree from the root to some leaf node,
with the leaf node providing the classification to the example. Each node in the tree acts as a test
case for some attribute, and each edge descending from that node corresponds to one of the
possible answers to the test case. This process is recursive in nature and is repeated for every
subtree rooted at the new nodes.
example. Let's assume we want to play badminton on a particular day — say Saturday — how will
you decide whether to play or not. Let's say you go out and check if it's hot or cold, check the speed
of the wind and humidity, how the weather is, i.e. is it sunny, cloudy, or rainy. You take all these
factors into account to decide if you want to play or not. So, you calculate all these factors for the last
ten days and form a lookup table like the one below.
Now, you may use this table to decide whether to play or not. But, what if the weather pattern on
Saturday does not match with any of rows in the table? This may be a problem. A decision tree
would be a great way to represent data like this because it takes into account all the possible
paths that can lead to the final decision by following a tree-like structure.
Fig 1. A decision tree for the concept Play Badminton

Fig 1. illustrates a learned decision tree. We can see that each node represents an attribute or
feature and the branch from each node represents the outcome of that node. Finally, its the leaves
of the tree where the final decision is made. If features are continuous, internal nodes can test the
value of a feature against a threshold (see Fig. 2
Fig 2. A decision tree for the concept Play Badminton (when attributes are continuous)
 general algorithm for a decision tree can be described as follows:
1. Pick the best attribute/feature. The best attribute is one which best splits or separates the data.
2. Ask the relevant question.
3. Follow the answer path.
4. Go to step 1 until you arrive to the answer.
 Entropy
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of
the amount of uncertainty or randomness in data.
Intuitively, it tells us about the predictability of a certain event. Example, consider a

coin toss whose probability of heads is 0.5 and probability of tails is 0.5. Here the
entropy is the highest possible, since there’s no way of determining what the
outcome might be. Alternatively, consider a coin which has heads on both the sides,
the entropy of such an event can be predicted perfectly since we know beforehand
that it’ll always be heads. In other words, this event has no randomness hence it’s
entropy is zero.
In particular, lower values imply less uncertainty while higher values imply high
uncertainty.
 The decision tree learning algorithm

The basic algorithm used in decision trees is known as the ID3 (by Quinlan) algorithm. The ID3
algorithm builds decision trees using a top-down, greedy approach. Briefly, the steps to the
algorithm are: - Select the best attribute → A - Assign A as the decision attribute (test case) for
the NODE. - For each value of A, create a new descendant of the NODE. - Sort the training
examples to the appropriate descendant node leaf. - If examples are perfectly classified, then
STOP else iterate over the new leaf nodes.
Now, the next big question is how to choose the best attribute. For ID3, we think of the best
attribute in terms of which attribute has the most information gain, a measure that expresses
how well an attribute splits that data into groups based on classification.
Pseudocode: ID3 is a greedy algorithm that grows the tree top-down, at each node selecting the
attribute that best classifies the local training examples. This process continues until the tree
perfectly classifies the training examples or until all attributes have been used.
The pseudocode assumes that the attributes are discrete and that the classification is binary.
Examples are the training example. Target_attribute is the attribute whose value is to be
predicted by the tree. Attributes is a list of other attributes that may be tested by the learned
decision tree. Finally, it returns a decision tree that correctly classifies the given Examples.
ID3(Examples, Target_attribute, Attributes): - Create a root node for the tree. - If all Examples
are positive, return the single-node tree root, with positive labels. - If all Examples are negative,
return the single-node tree root, with negative labels. - If Attributes is empty, return the single-
node tree root, with the most common labels of the Target_attribute in Examples. - Otherwise,
begin - A ← the attribute from Attributes that best* classifies Examples - The decision attribute
for root ← A - For each possible value $v_i$, of A, - Add a new tree branch below root,
corresponding to the test A = $v_i$ - Let Examples_vi be the subset of Examples that have
value $v_i$ for A. - If Examples_vi is empty - Then, below this new branch add a leaf node with
the labels having the most common value of Target_attribute in Examples. - Else, below this
new branch add the subtree(or call the function) - ID3(Examples_vi, Target_attribute, Attributes-
{A}) - End - Return root
 Alternative measures for selecting attributes
he information gain formula used by ID3 algorithm treats all of the variables the same regardless
of their distribution and their importance. This is a problem when it comes to continuous
variables or discrete variables with many possible values because training examples may be few
and far between for each possible value, which leads to low entropy and high information gain by
virtue of splitting the data into small subsets but results in a decision tree that might not
generalize well.
One way to avoid this is to use some other measure to find the best attribute instead of
information gain. An alternative measure to information gain is gain ratio (Quinlan 1986). Gain
ratio tries to the correct the information gain’s bias towards attributes with many possible values
by adding a denominator to information gain called split information. Split Information tries to
measure how broadly and uniformly the attribute splits the data:
$SplitInformation(S, A) = - \sum_{i=1}^{c} rac{|S_i|}{|S|} \cdot log_2 rac{|S_i|}{|S|}$
The Gain Ratio is defined in terms of Gain and SplitInformation as,
$Gain Ratio(S, A) \equiv rac{Gain(S, A)}{SplitInformation(S, A)}$

One practical issue that arises in using gain ratio in place of information gain is that the denominator
can be zero or very small when $|S_i|pprox|S|$ for one of the $S_i$. This either makes the Gain
ratio undefined or very large for attributes that happen to have the same value for nearly all members
of S.For example, if there’s just one possible value for the attribute, then the formula equals $log_2$1
= 0.Luckily, we tend not to include attributes with 1 possible value in our training data because it is
impossible to carry out the ID3 algorithm by splitting on an attribute with only 1 value, so Gain ratio
doesn’t have to handle the possibility of a denominator of 0. On the other hand, our continuous
temperature example has 10 possible values in our training data, each of which occur once, which
leads to -(1/10)$\cdot log_2$(1/10) = $log_2$10 . In general, the SplitInformation of an attribute
with n equally distributed values is $log_2n$. These relatively large denominators significantly affect
an attribute’s chances of being the best attribute after an iteration of the ID3 algorithm and help in
avoiding choices that perform particularly well on the training data but not so well outside of it.
 Advantages and Disadvantages

 The advantages of decision trees: - Easy to use and understand. - Can handle both
categorical and numerical data. - Resistant to outliers, hence require little data
preprocessing. - New features can be easily added. - Can be used to build larger classifiers
by using ensemble methods.
 The disadvantages of decision trees: - Prone to overfitting. - Require some kind of
measurement as to how well they are doing. - Need to be careful with parameter tuning. -
Can create biased learned trees if some classes dominate.
 Coding a decision tree
import pydotplus
from sklearn.datasets import load_iris
from sklearn import tree
from IPython.display import Image, display
__author__ = "Mayur Kulkarni <mayur.kulkarni@xoriant.com>"
def load_data_set():
"""
Loads the iris data set
:return: data set instance

"""
iris = load_iris()
return iris
def train_model(iris):
"""
Train decision tree classifier
:param iris: iris data set instance

:return: classifier instance
"""
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
return clf
def display_image(clf, iris):

"""
Displays the decision tree image
:param clf: classifier instance

:param iris: iris data set instance
"""
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(data=graph.create_png()))
if __name__ == '__main__':
iris_data = load_iris()
decision_tree_classifier = train_model(iris_data)
display_image(clf=decision_tree_classifier, iris=iris_data)
References:
1. Information Gain: https://en.wikipedia.org/wiki/Information_gain_in_decision_trees
2. Entropy: https://en.wikipedia.org/wiki/Entropy_(information_theory)
3. Decision Tree,https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/ml-
decision-tree/tutorial/
4. Decision Trees for Classification: A Machine Learning Algorithm,
https://www.xoriant.com/blog/product-engineering/decision-trees-machine-learning-algorithm.html

Machine Learning

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning

Uploaded by

Copyright:

Available Formats

‫الجامعة التكنولوجية – قسم علوم الحاسوب‬

‫تقرير االمتحان النهائي للفصل الدراسي (الكورس الثاني)‬

Fig 1. A decision tree for the concept Play Badminton

Intuitively, it tells us about the predictability of a certain event. Example, consider a

 The decision tree learning algorithm

$SplitInformation(S, A) = - \sum_{i=1}^{c} rac{|S_i|}{|S|} \cdot log_2 rac{|S_i|}{|S|}$

The Gain Ratio is defined in terms of Gain and SplitInformation as,

$Gain Ratio(S, A) \equiv rac{Gain(S, A)}{SplitInformation(S, A)}$

 Advantages and Disadvantages

:return: data set instance

:param iris: iris data set instance

def display_image(clf, iris):

:param clf: classifier instance

You might also like