[go: up one dir, main page]

0% found this document useful (0 votes)
109 views16 pages

Decision Trees Explained

Decision trees are a non-linear predictive modeling technique that uses a tree-like graph to predict outcomes. Nodes in the tree represent conditions related to predictor variables, and branches represent the outcome of those conditions. Decision trees split the data into smaller and smaller groups, known as nodes, and try to make each group as homogeneous as possible with respect to the target variable. There are two main types of decision trees: regression trees for continuous targets, and classification trees for categorical targets. Entropy, information gain, and Gini impurity are common metrics used to determine the best variable to split the nodes on. The tree is built in a top-down approach starting with the root node, which contains the entire sample, and recursively splitting nodes

Uploaded by

Vasudha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views16 pages

Decision Trees Explained

Decision trees are a non-linear predictive modeling technique that uses a tree-like graph to predict outcomes. Nodes in the tree represent conditions related to predictor variables, and branches represent the outcome of those conditions. Decision trees split the data into smaller and smaller groups, known as nodes, and try to make each group as homogeneous as possible with respect to the target variable. There are two main types of decision trees: regression trees for continuous targets, and classification trees for categorical targets. Entropy, information gain, and Gini impurity are common metrics used to determine the best variable to split the nodes on. The tree is built in a top-down approach starting with the root node, which contains the entire sample, and recursively splitting nodes

Uploaded by

Vasudha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Decision trees

Decision trees support tool that uses a tree-like graph or a model of decisions and their
possible consequence. It is also called instance based algorithm as at each instance we take
decision or we can say it uses nested if- else condition.
Decision Tree is a non linear model which is made of various linear axis parallel planes.
In logistic regression we have a single plane but in decision tree we have multiple planes.
Suppose we want to classify a person is defaulter in paying tax or not on basis of three
categories – employed, area, location. Then decision tree made as:

                     
  1 = no ROOT
defaulter       NODE          
  0=default 100
er       N= 0          
        1 600          
        0 400          
          |          
          |          
Employed
        == Y          
        / \          
      / \          
    YES   \   NO    
    N= 700       N= 300    
    1 550       1 50    
    0 150       0 250    
      |       |      
      |       |      
Location==
    Gender==M       N    
    / \       / \    

  / \     /   \    
YES (Gender==M)   (Gender==F)   Location=N)   Location!=N
N= 300   N= 400   N= 100   N= 200
1 230 1 10 25% 1 25   1 150
0 70 0 390 75% 0 75   0 50
 No no defaul
defaulter     defaulter      defaulter     er  

Decision tree represented as:


Geometrically, it is axis parallel planes that teslate the area into Hypercuboid or
hypercube.

Decision Trees can be of can be of two types- Regression Trees and Classification trees.
Regression Trees are used when the dependent variable is a continuous variable classification
is used when dependent variable is categorical.
Some terms related to decision trees are:
 Decision Node: Any node that gets split is known as the decision node.
 Root Node: This is the top-most decision node and represents the entire sample
which further gets divided.
 Leaf Node: Also known as the Terminal node, this is a node that does not split further.
 Splitting: The process of dividing a node is known as splitting. This is the most
commonly used method where a top-down approach is taken to split each node to find
the best split
 Pruning: It is the opposite of splitting. Here we remove sub-nodes of a decision node.

How to build a decision tree?

In case of classification type of problem we use “Entropy”, “Information Gain”, “Ginni


coefficient ” as our measure to decide the feature that split the data, where as in case of
linear regression we use mean sum of square (MSE) to split the data.
The most common criteria/metrics for Classification Trees are-
 Gini Index
 Entropy
 Chi-Square (for Classification Trees)
 Reduction In Variance (for Regression Trees)
 ANOVA.

1. Entropy: It is defined as
Let us consider a case, where we need to classify the data point in two categories category 1
and 2 let p is probability of outcome variable classified in class 1 and q is probability that
outcome variable(defaulter) classified in class 0.Then entropy will be calculated as for
H= -p log2p – q log2q
Entropy measures uncertainty or we can say randomness. In other words, it is a measure of
unpredictability.
Suppose, we have a data like 1000 employees as
Defaulte Employed Area Gender
r

1 1 N M
0 0 S F
0 0 S M
1 0 S F
0 1 N F
1 0 N M
0 0 N M
1 1 S M
1 0 S F
0 1 S M
0 0 N F

And we want to classify data a variable as defaulter or not on basis of employed,


Area, Gender.

Steps in computing entropy as:


Suppose Entropy(employed) is 0.97(using above formula), Entropy(Area ) is 0.43
and Entropy(gender) is 0.58. Then we choose Entropy(employed) as our root node
Because using employed variable we get a more random split than others variable.
( percentage of person in class 1 in employed = Yes is 78% and percentage of
Person n class 1 in employed = No is 83%. Nearly comparable or homogenous.

               
      Employed == Y      
       /  \      
 
    /     \      
  YES       NO
  N= 700       N= 300
  1 550       1 250
  0 150       0 50

Now, we compute Entropy(gender /employed =1(Yes)) 0.74 and Entropy(location =


North/employed = Yes(1)) is 0.35, since Entropy(gender /employed =1(Yes)) gives more
random split than Entropy(location = North/employed = Yes(1) so we choose gender as
feature for further splitting

           
Employed
       =YES
      N=  700
      1  550
      0  150
           |
           |
      Gender==M
 
       /    \
 
    /     \  
YES YES
  (Gender==M)   (Gender==F)
  N= 300   N= 400
77% 1 230 3% 1 10
23% 0 70 98% 0 390

Similarly we compute values for employed = No.

         
  Employed = NO    
  N= 300    
  1 50    
  0 250    
    |     
     |    
  Location==N    
    /  \    
   /   \    
YES   YES (Location!
(Location==N) =N)
N= 100   N= 200
1 25   1 150
0 75   0 50

2. Ginni Impurity: This method uses an impurity based criterion used in CART and
only performs binary splits.

Let consider the example:

We find that out of total 50 students, half of them were selected for the coaching.

For Calculating Gini Index, we have to find the proportions of these values

Now we can find the weighted Gini for Split X1 (Grade) by calculating Gini for sub-
node „A or B‟ and „C
or D‟.
The calculations are shown below

We can similarly do the same for X2 which here is „Class‟ and find the Gini for spliting
„Class‟.

For X3 Height which is a continuous variable, the statistical software in the backend split
the variable into
height > x and height < x and find the Gini value and whichever value of x provides them
with the
minimum Gini value, that value of x is used to calculate the Gini for splitting this
continuous variable.

In our example, we find that the value of x is equal to 1.75 and provides the minimum
value in comparison
to other values of height say 1.80 or 1.60. We find that the Gini for split height comes out
to be 0.23
Now we compare the value of Gini and find that the minimum Gini is provided by height
(0.23), thus, this
variable will be the root node that will split the data into two homogeneous „purer‟ sets.
The second
minimum value of Gini is found for the variable „Grade‟ and the maximum value of Gini
is found
with „Class‟. Thus the data will first split by Height as it plays a major role in deciding
whether a student
will be selected for the basketball coaching or not.

Ginni impurity and entropy both are same. The only difference lies in the formula
of ginni impurity doesn’t uses log that makes calculation faster as compared to
entropy (computational complexity).

3. Mean sum of square (MSE): It is defined as


∑(x – x ) 2 ÷ n
However, when the variable is categorical then it can get a bit complicate
We built tree top down right from root to leave and we want all similar data point to
be present in on leaf st that standard deviation measures homogeneity smaller.

For example, lets calculate mean and variance for the root node. Here we have 15 1s
(y=1) and 15 0s (y=0). We calculate mean by ((25 x 1) + (25 x 0)) ÷ 50 which comes
out to be 0.5 We can now use the formula of variance. Here our x will be 0.5. While
x will be 1 and 0. We have 50 entries so the formula will be (x-x )2× 50 ÷ n
To be very precise the formula would look something like this-
(1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2
+ (1-0.5)2 + (1-0.5)2 + (1- 0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-
0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1-0.5)2 + (1- 0.5)2 + (1-0.5)2 + (1-0.5)2 +
(1-0.5)2 + (1-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-34.
Decision Trees 280 0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-
0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0- 0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 +
(0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 + (0-0.5)2 ÷ 50 This can be
written as (1-0.5)2 X 25 + (0-0.5)2 X 25 ÷ 50
Using this we can finally come up with a equation (25 X (1-0.5)2)+(25 X (0-0.5)2) ÷
50
We use the above formula to find the mean and variance for each variable. See below
the calculation to find the variance of X1 (Grade) variable.

We can use the same formula to find the variance for Split Class.

We find that that the results are similar to entropy with the „Grade‟ being the least
important variable as its variance is higher (0.24) than the other categorical variable
which is „Class‟, having a variance of 0.21. The variance of Numerical Variable
follows similar steps and the variable which provides the minimum variance is
considered for the split

Now, we are aware of how to build a decision tree. But we must take care while
building a decision tree as depth increase the chance of overfitting increased (or
chances of higher noisy points in the data) and if depth is too short like 1 then
decision tree underfit. So, appropriate depth should be find out using cross validation.
Parameters Related to the Size of the Tree:
Maximum Depth Among the easiest parameter to understand is maximum depth
which is an important parameter to tune
trees and tree-based models. It is the length of the longest path from root node to the
leaf node or the depth
of the layers of the leaves. We often provide a value/maximum number/ depth-limit
to the decision tree
model to limit the further splitting of nodes, so when the specified depth is reached,
the tree stops splitting.

In the above example, we have two features with one being on x and other on the y-axis. Here
it finds that the first best split can be done by doing it on the x-axis i.e. split the variable on
the x-axis and find the value which provides the best result (the value that provides the
minimum entropy or Gini depending upon
the algorithm decision tree is using). Here for max depth =1, it splits the data and says that
data points on the left will be classified as 0 (blue, background colour) while the actual 0 are
the blue dots. This depth will make the model too generalized and will underfit the model.
However, if it keeps going and
recursively partitions the data, it then finds another way to split the data and thus creates
more and more branches and keeps on splitting the space. We can see by the time it reaches
the maximum depth, it has all the pure leaves, however. This leads to overfitting as the
decision boundry is very complex
and will yield very poor results in testing or when used for unseen new data. Thus we can set
the maximum depth at 5 to have a moderately complex and accurate decision tree model.
Minimum Split Size This is another parameter through which we define the minimum
number of observation required to allow the node to further split. This can control overfitting
and underfitting. A very low value of Minimum Split Size will lead to overfitting with a
small number of observations being considered for further splitting, however, an extremely
large value will lead to underfitting as a good amount of observations that on being split can
provide with useful insights won‟t be split.

Minimum Leaf Size


Here we can provide a value that will decide the minimum number of observations that a
terminal node will have. It is used for imbalanced class as, for example, if we define the
minimum split size at 99 and we
have a node with 100 observations, it can be split, however, upon splitting we find that the
child nodes have 99 observations in one node and one observation in another. We have
discussed in ANOVA and other Basic Statistical concepts that each class of a variable should
have a minimum number of
observations to be considered for any statistical test. Similarly here we define the minimum
leaf size which is generally 10% of the minimum split size.

Maximum Impurity
In the Maximum Depth, we saw how a low number of depth caused the red points to be
considered as blue. We can limit the number of wrong points required to be in a region before
splitting is stopped.
PYTHON CODE:

In case of classification

Dataset has folloing columns as:


['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

Outcome is a dependent variable and rest are independent variable.

Outcome is imbalance

it is clearly visible that there is no or low multicollinearity


First we balance the dataset

Since, we removed outlier, no missing value, no multicollinearity


Now, we train our model

Accuracy before cross validation and hyper parameter tuning

After K fold cross validation

Accuracy

Accuracy of train is 100 and test is 70%


After Hyper parameter tuning:

Accuracy

After parameter tuning accuracy of test has been increased from 60 to 73%.

In case of regression
Data set contain the following variable as:
In train data we have 8523 column and following columns:
# Item_Identifier: product ID
# Item_Weight: Weight of the product in Kg
# Item_Fat_Content: product is low fat or not
# Item_Visibility: The % of total display area of the products in a store
# Item_Type: The category to which the product belongs
# Item_MRP: Maximum Retail Price selling price) of the product
# Outlet_Identifier: Unique store ID
# Outlet_Establishment_Year: The year in which store was established
# Outlet_Size: The size of the store in terms of ground area covered
# Outlet_Location_Type: The type of city in which the store is located
# Outlet_Type: Whether the outlet is just a grocery store or supermarket
# Item_Outlet_Sales: Sales of the product in the particulat store. This is the outcome
variable to be predicted.
Item_Outlet_Sales is dependent variable rest are independent variable

After all preprocessing steps. We train a decision tree model without hyperparameter tuning
as:
Model score

We got 100% score on training data.

On test data we got 57% score because we haven’t done any tuning parameters just initialized
the tree with default values. So tree has expanded it full length using feature which is not
actually valid. That’s why 100 % accuracy on train data (i.e, a highly overfitted model)

Now we tuneparameter to get rid off this


Simple Models without any Hyper parameter or its tuning but
with K-Fold Cross Validation

Error after cross validation

After hyperparameter tuning.


Error after hyper parmeter tuning

It was observed that after tuning MSE has been decreased to a small value

When to stop growing?

 Pure node got stop growing


 If we have lack of point stop growing
 It tree are too deep

Advantages of decision trees:

 Decision Trees provide high accuracy and require very little preprocessing of
data such as outlier capping, missing value, variable transformation etc.
 Also, they can work if the dependent and independent variable are not linear
i.e. it can capture non-linear relationships
 Decision Trees require less data than what regression model requires and are
easy to understand along with being very inquisitive.
 Tree-Based models can be very easily visualized with clear-cut demarkation
allowing people having no background of statistics to understand the process
easily.
 Decision Tees can also be used for data cleaning, data exploration and
variable selection and creation.
 Decision Trees can work with high dimensional data having both- continuous
and categorical variables
 Feature interaction are in built of decision tree.
 DT are super - interpretable.

You might also like