Decision Trees in Machine Learning: Representation of Algorithm As A Tree
Decision Trees in Machine Learning: Representation of Algorithm As A Tree
Decision Trees in Machine Learning: Representation of Algorithm As A Tree
Trees occupy an important place in the life of man. The trees provide us
flowers, fruits, fodder for animals, wood for fire and furniture and provide cool
shadow from scorching sun. They give us so many such good things and yet
expect nothing in return. Besides being such a important element for the
survival of human beings, trees have also inspired wide variety of algorithms
in Machine Learning both classification and regression.
With this table, other people would be able to use your intuition to decide
whether they should play tennis by looking up what you did given a certain
weather pattern, but after just 14 days, it’s a little unwieldy to match your
current weather situation with one of the rows in the table. We could represent
this tabular data in the form of tree.
Creating tree from the data
Here all the information is represented in the form of tree. The rectangular box
represents the node of the tree. Splitting of data is done by asking question to
the node. The branches represents various possible known outcome obtained
by asking the question on the node. The end nodes are the leafs. They
represent various classes in which the data can be classified. The two classes in
this example are Yes and No. Thus to obtain the class/final output, ask the
question to the node and using the answer travel through branch until one
reaches the leaf node.
Algorithm
1. Start with a training data set which we’ll call S. It should have
attributes and classification.
2. Determine the best attribute in the dataset. (We will go over the
definition of best attribute)
3. Split S into subset that contains the possile values for the best
attribute.
In Decision Tree algorithm, the best mean the attribute which has
most information gain.
The left split has less information gain as the data is split on two classes which
has almost equal ‘+’ and ‘-’ examples. While the split on the right as more ‘+’
example in one class and more ‘-’ example in the other class. In order to
calculate best attribute we will use Entropy.
Entropy
In machine learning sense and especially in this case Entropy is the measure of
homegeneity in the data. Its value is ranges from 0 to 1. Its value is close to 0 if
all the example belongs to same class and is close to 1 is there is almost equal
split of the data into different classes. Now the formula to calculate entropy is :
Here pi represents the proportion of the data with ith classification and c
represents the different types of classification.
Here Entropy(S) represents the entropy of the dataset and the second term on
the right is the weighted entropy of the different possible classes obtain after
the split. Now the goal is to maximize this information gain. The attribute
which has the maximum information gain is selected as the parent node and
successively data is split on the node.
As we can see Outlook attribute has the maximum information gain and hence
it is placed at top of the tree.
Now one question may arise is how the data is split in case of continuous data.
Suppose there is attribute temperature which has values from 10 to 45 degree
celcius. We cannot make a split on every discrete value. What we can do in this
case is to split the discrete values into continuous classes such as
10–20 can be class1, 20–30 and so on and a particular discrete value is put to a
particular class.
Now the main problem with decision tree is that it is prone to overfitting. We
could create a tree that could classify the data perfectly or we are not left with
any attribute to split. This would work well in on the training dataset but will
have a bad result on the testing dataset. There are two popular approaches to
avoid this in decision trees: stop growing the tree before it becomes too
large or prune the tree after it becomes too large. Typically, a limit to a
decision tree’s growth will be specified in terms of the maximum number of
layers, or depth, it’s allowed to have. The data available to train the decision
tree will be split into a training set and test set and trees with
various maximum depths will be created based on the training set and
tested against the test set. Cross-validation can be used as part of this
approach as well.Pruning the tree, on the other hand, involves testing the
original tree against pruned versions of it. Leaf nodes are taken away from the
tree as long as the pruned tree performs better against test data than the larger
tree.