Machine Learning
Machine Learning
Machine Learning
عنوان التقرير
)(MACHINE LEARNING
Decision Tree
Overview
Decision Tree Analysis is a general, predictive modelling tool that has applications spanning a number
of different areas. In general, decision trees are constructed via an algorithmic approach that identifies
ways to split a data set based on different conditions. It is one of the most widely used and practical
methods for supervised learning. Decision Trees are a non-parametric supervised learning method
used for both classification and regression tasks. The goal is to create a model that predicts the value
of a target variable by learning simple decision rules inferred from the data features.
The decision rules are generally in form of if-then-else statements. The deeper the tree, the more
complex the rules and fitter the model.
Before we dive deep, let's get familiar with some of the terminologies:
Instances: Refer to the vector of features or attributes that define the input space
Attribute: A quantity describing an instance
Concept: The function that maps input to output
Target Concept: The function that we are trying to find, i.e., the actual answer
Hypothesis Class: Set of all the possible functions
Sample: A set of inputs paired with a label, which is the correct output (also known as the
Training Set)
Candidate Concept: A concept which we think is the target concept
Testing Set: Similar to the training set and is used to test the candidate concept and
determine its performance
Introduction
A decision tree is a tree-like graph with nodes representing the place where we pick an attribute and
ask a question; edges represent the answers the to the question; and the leaves represent the actual
output or class label. They are used in non-linear decision making with simple linear decision surface.
Decision trees classify the examples by sorting them down the tree from the root to some leaf node,
with the leaf node providing the classification to the example. Each node in the tree acts as a test
case for some attribute, and each edge descending from that node corresponds to one of the
possible answers to the test case. This process is recursive in nature and is repeated for every
subtree rooted at the new nodes.
example. Let's assume we want to play badminton on a particular day — say Saturday — how will
you decide whether to play or not. Let's say you go out and check if it's hot or cold, check the speed
of the wind and humidity, how the weather is, i.e. is it sunny, cloudy, or rainy. You take all these
factors into account to decide if you want to play or not. So, you calculate all these factors for the last
ten days and form a lookup table like the one below.
Now, you may use this table to decide whether to play or not. But, what if the weather pattern on
Saturday does not match with any of rows in the table? This may be a problem. A decision tree
would be a great way to represent data like this because it takes into account all the possible
paths that can lead to the final decision by following a tree-like structure.
Fig 2. A decision tree for the concept Play Badminton (when attributes are continuous)
general algorithm for a decision tree can be described as follows:
1. Pick the best attribute/feature. The best attribute is one which best splits or separates the data.
2. Ask the relevant question.
3. Follow the answer path.
4. Go to step 1 until you arrive to the answer.
Entropy
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of
the amount of uncertainty or randomness in data.
Now, the next big question is how to choose the best attribute. For ID3, we think of the best
attribute in terms of which attribute has the most information gain, a measure that expresses
how well an attribute splits that data into groups based on classification.
Pseudocode: ID3 is a greedy algorithm that grows the tree top-down, at each node selecting the
attribute that best classifies the local training examples. This process continues until the tree
perfectly classifies the training examples or until all attributes have been used.
The pseudocode assumes that the attributes are discrete and that the classification is binary.
Examples are the training example. Target_attribute is the attribute whose value is to be
predicted by the tree. Attributes is a list of other attributes that may be tested by the learned
decision tree. Finally, it returns a decision tree that correctly classifies the given Examples.
ID3(Examples, Target_attribute, Attributes): - Create a root node for the tree. - If all Examples
are positive, return the single-node tree root, with positive labels. - If all Examples are negative,
return the single-node tree root, with negative labels. - If Attributes is empty, return the single-
node tree root, with the most common labels of the Target_attribute in Examples. - Otherwise,
begin - A ← the attribute from Attributes that best* classifies Examples - The decision attribute
for root ← A - For each possible value $v_i$, of A, - Add a new tree branch below root,
corresponding to the test A = $v_i$ - Let Examples_vi be the subset of Examples that have
value $v_i$ for A. - If Examples_vi is empty - Then, below this new branch add a leaf node with
the labels having the most common value of Target_attribute in Examples. - Else, below this
new branch add the subtree(or call the function) - ID3(Examples_vi, Target_attribute, Attributes-
{A}) - End - Return root
Alternative measures for selecting attributes
he information gain formula used by ID3 algorithm treats all of the variables the same regardless
of their distribution and their importance. This is a problem when it comes to continuous
variables or discrete variables with many possible values because training examples may be few
and far between for each possible value, which leads to low entropy and high information gain by
virtue of splitting the data into small subsets but results in a decision tree that might not
generalize well.
One way to avoid this is to use some other measure to find the best attribute instead of
information gain. An alternative measure to information gain is gain ratio (Quinlan 1986). Gain
ratio tries to the correct the information gain’s bias towards attributes with many possible values
by adding a denominator to information gain called split information. Split Information tries to
measure how broadly and uniformly the attribute splits the data:
import pydotplus
from sklearn.datasets import load_iris
from sklearn import tree
from IPython.display import Image, display
__author__ = "Mayur Kulkarni <mayur.kulkarni@xoriant.com>"
def load_data_set():
"""
Loads the iris data set
def train_model(iris):
"""
Train decision tree classifier
graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(data=graph.create_png()))
if __name__ == '__main__':
iris_data = load_iris()
decision_tree_classifier = train_model(iris_data)
display_image(clf=decision_tree_classifier, iris=iris_data)
References:
1. Information Gain: https://en.wikipedia.org/wiki/Information_gain_in_decision_trees
2. Entropy: https://en.wikipedia.org/wiki/Entropy_(information_theory)
3. Decision Tree,https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/ml-
decision-tree/tutorial/
4. Decision Trees for Classification: A Machine Learning Algorithm,
https://www.xoriant.com/blog/product-engineering/decision-trees-machine-learning-algorithm.html