[go: up one dir, main page]

0% found this document useful (0 votes)
12 views15 pages

Decision_tree

A decision tree is a supervised machine learning algorithm used for classification and regression, structured as a hierarchical model with nodes representing decisions based on features and leaf nodes indicating outcomes. The algorithm involves feature selection, recursive splitting of data into homogeneous subsets, and making predictions by traversing the tree. Key concepts include entropy, information gain, and Gini impurity, with various algorithms like ID3, C4.5, and CART utilized for constructing decision trees.

Uploaded by

brokenbottle571
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views15 pages

Decision_tree

A decision tree is a supervised machine learning algorithm used for classification and regression, structured as a hierarchical model with nodes representing decisions based on features and leaf nodes indicating outcomes. The algorithm involves feature selection, recursive splitting of data into homogeneous subsets, and making predictions by traversing the tree. Key concepts include entropy, information gain, and Gini impurity, with various algorithms like ID3, C4.5, and CART utilized for constructing decision trees.

Uploaded by

brokenbottle571
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Decision Tree

Decision Tree: Introduction


❖ A decision tree is a popular and versatile supervised machine learning algorithm used for both
classification and regression tasks.

❖ A decision tree is a hierarchical structure that resembles an inverted tree, where each internal
node represents a decision based on a feature or attribute, and each leaf node represents a class
label (in classification) or a numerical value (in regression).

A decision tree comprises of:

▪ Root node: This is the starting point of the tree


which contains the feature that best splits the
dataset into two or more homogeneous sets
based on certain criteria.
▪ Internal Nodes: These nodes represent decision
points containing decision rules based on a
feature.
▪ Branches: The branches emanating from each
internal node represent the possible outcomes
or values of the corresponding feature.
▪ Leaf nodes: These are the terminal nodes of the
tree. Each leaf node represents a class label (in Fig. 1: Iris Decision Tree
classification) or a predicted value (in regression).
Decision Trees: Introduction (Cont.)
Here's how it works:

▪ Feature Selection: The algorithm selects the feature that


results in the most homogeneous subsets (i.e., subsets
with similar outcomes) after splitting.
▪ Splitting: Once a feature is selected, the dataset is split
into subsets based on the possible values of that feature.
This process is repeated recursively for each subset until
a stopping criterion is met. as the prediction.

▪ Final outcomes: At the end of the process, the tree


forms leaf nodes that represent the final outcomes or Fig. 2: Iris Decision Tree
predictions.
▪ Prediction: To make predictions for new data, the algorithm traverses the tree from the root node to
a leaf node based on the values of the features in the new data. The outcome associated with the
leaf node reached is then assigned as the prediction.
For examples, let us consider two new instances with, say, petal length= 1.67 cm, petal width=1.78
cm and petal length= 2.9 cm, petal width=1.78 cm.

❖ Decision trees are versatile machine learning algorithms that can perform both classification and
regression tasks, and even multioutput tasks, capable of fitting complex datasets, and are
fundamental components of random forests.
Working of Decision Tree: Toy Example

❖ We have data for 15 students on pass or fail in an online ML Table 1: The toy data set
exam. To understand the basic process we begin with a dataset
which comprises a target variable that is binary ( Pass/Fail) and
various binary or categorical predictor variables such as:

• whether enrolled in other online courses


• whether student is from a maths, computer science or
other background
• Whether working or not working

❖ Selection of feature variable root node:

• The variable that best splits the target column into


homogeneous sets. Our candidates are: Background,
Working Status and Other Online Courses.
• Before evaluating the candidate variables for root node, we
have to define a measure of homogeneity or purity of a
node:
I. Entropy and Information gain
II. Gini impurity or Gini index

Fig. 3: Working Status as root node


Feature variable for decision node: Entropy and Information Gain

❖ Entropy: In information theory, the entropy of a random variable is the average level of
“information”, “surprise”, or “uncertainty” inherent to the variable’s possible outcomes. It is
measured by the formula:

where the pi is the probability of randomly selecting an example of ith outcome of target variable. For
example, ppass= number of passing students/ total students= 9/15, pfail= number of fails/ total
students= 6/15.
❖ Information Gain: Entropyparent – Entropychildren
where Entropyparent is the entropy of the parent node and Entropychildren represents the average
entropy of the child nodes that are obtained by splitting the parent node.

❖ To calculate entropy, first let us put our formulas for Entropy and Information Gain in terms of
variables in our dataset:
• 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑝𝑎𝑠𝑠 𝑜𝑟 𝑓𝑎𝑖𝑙 𝑎𝑡 𝑒𝑎𝑐ℎ 𝑛𝑜𝑑𝑒 = 𝑁𝑜. 𝑜𝑓 𝑝𝑎𝑠𝑠 𝑜𝑟 𝑓𝑎𝑖𝑙/ 𝑠𝑎𝑚𝑝𝑙𝑒 − 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑛𝑜𝑑𝑒

• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑜𝑓 𝑎𝑛𝑦 𝑛𝑜𝑑𝑒 = 𝐸 = − 𝑝𝑝𝑎𝑠𝑠 𝑙𝑜𝑔2 𝑝𝑝𝑎𝑠𝑠 + 𝑝𝑓𝑎𝑖𝑙 𝑙𝑜𝑔2 𝑝𝑓𝑎𝑖𝑙

𝑆𝑐,𝑗
• 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑎𝑐𝑟𝑜𝑠𝑠 𝑡ℎ𝑒 𝑐ℎ𝑖𝑙𝑑 𝑛𝑜𝑑𝑒𝑠 = 𝐸𝑎𝑣,𝑐ℎ𝑖𝑙𝑑 = σ𝑡𝑜𝑡𝑎𝑙
𝑗=1
𝑐ℎ𝑖𝑙𝑑 𝑛𝑜𝑑𝑒𝑠
𝐸𝑗
𝑆𝑝
• 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 = 𝐸𝑝𝑎𝑟𝑒𝑛𝑡 − 𝐸𝑎𝑣,𝑐ℎ𝑖𝑙𝑑
Toy Example (cont.): Evaluating candidate variables for root node

8 8 7 7
❖ 𝐸𝑝𝑎𝑟𝑒𝑛𝑡 = − 𝑝𝑝𝑎𝑠𝑠 𝑙𝑜𝑔2 𝑝𝑝𝑎𝑠𝑠 + 𝑝𝑓𝑎𝑖𝑙 𝑙𝑜𝑔2 𝑝𝑓𝑎𝑖𝑙 =− ∗ 𝑙𝑜𝑔2 + ∗ 𝑙𝑜𝑔2 =0.9968
15 15 15 15

❖ Let us calculate the average entropy of child nodes, 𝐸𝑎𝑣,𝑐ℎ𝑖𝑙𝑑 , and 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 for each
candidate feature variables in turn.
❖ Work Status as feature variable of the root node: ❖ In a similar fashion we can evaluate
𝐸𝑤𝑜𝑟𝑘𝑖𝑛𝑔 = −
3
∗ 𝑙𝑜𝑔2
3 6
+ ∗ 𝑙𝑜𝑔2
6
=0.9183
the entropy and information gain for
9 9 9 9 Student Background and Online
5 5 1 1
𝐸𝑛𝑜𝑡−𝑤𝑜𝑟𝑘𝑖𝑛𝑔 = − ∗ 𝑙𝑜𝑔2 + ∗ 𝑙𝑜𝑔2 =0.65 Courses variables. The results are
6 6 6 6
9 6
𝐸𝑎𝑣,𝑐ℎ𝑖𝑙𝑑 =
15
∗ 0.9183 + ∗ 0.65 = 0.8110
15
provided in the table below:
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 = 0.9968 − 0.8110 = 0.1858

Table 2: Root node calculation


Root node Entropy- Average Information
variable Nodes node entropy Gain
Parent 0.9968
Working 0.9183
Working status Not Working 0.6500 0.8110 0.1858
Backgound_Math 0.9852
Backgound_CS 0.0000
Background Backgound_others 0.0000 0.4598 0.5370
online_course_yes 0.9544
Online course online_course_no 0.9852 0.9688 0.0280
Fig. 4: Decision tree for the toy problem
Gini impurity as feature selection criterion

❖ Gini Impurity or Index:


The Gini Index or Impurity measures the probability for a random instance being misclassified when
chosen randomly.

❖ The Gini Index or Impurity of ith node:

In this equation:
• Gi is the Gini impurity of the ith node.
• pi,k is the ratio of class k instances among the training instances in the ith node.

❖ CART cost function for classification:


Gini impurity vs Information Gain
Tree algorithms:

❖ ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. It supports categorical
feature only; tree is fully grown, pruning is applied afterwards.

❖ C4.5 is the successor to ID3 and removed the restriction that features must be categorical.

❖ C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds
smaller rulesets than C4.5 while being more accurate.

❖ CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports
numerical target variables (regression) and does not compute rule sets. CART constructs binary
trees using the feature and threshold that yield the largest information gain at each node.

❖ scikit-learn uses an optimized version of the CART algorithm; however, the scikit-learn
implementation does not support categorical variables for now
CART (Classification and Regression Trees)

❖ CART algorithm recursively partitions the feature space such that the samples with the same
labels or similar target values are grouped together.

❖ The algorithm works by first splitting the training set into two subsets using a single feature k and
a threshold tk (e.g., “petal length ≤ 2.45 cm”) such that it produces the purest subsets, weighted
by their size.

❖ CART cost function for classification:

❖ The process continues until no further split is possible or it hits one of its stopping criteria.

❖ CART algorithm is a greedy algorithm; it often produces a solution that’s reasonably good but not
guaranteed to be optimal.

❖ Finding the optimal tree is known to be an NP-complete problem. It requires O(exp(m)) time,
making the problem intractable even for small training sets.

❖ Computational Complexity: Prediction – O(log2(m)), Training - O(n × m log2(m)).


CART: Hyperparameters

❖ criterion{“gini”, “entropy”, “log_loss”}, ❖ Pruning: The process of reducing the size of


default=”gini” the tree by removing parts of it such that it
does not overfit.
❖ splitter{“best”, “random”}, default=”best”
❖ Regularization Parameters:
❖ max_depth: int, default=None
Pre-pruning: max_depth, min_samples_split,
max_features, max_leaf_nodes, etc.
❖ min_samples_split: int or float, default=2
Post-pruning: ccp_alpha
If float, then min_samples_split is a fraction and
ceil(min_samples_split * n_samples) are the
❖ Increasing min_* hyperparameters or
minimum number of samples for each split. reducing max_* hyperparameters will
regularize the model.
❖ min_weight_fraction_leaf: float, default=0.0

❖ max_features: int, float or {“sqrt”, “log2”},


default=None

❖ max_leaf_nodes: int, default=None

❖ min_impurity_decrease: float, default=0.0


❖ ccp_alpha: non-negative float, default=0.0
❖ The unregularized model on the left is clearly
overfitting, and the regularized model on the
❖ random_state: int, None, default=None right will probably generalize better.
Minimal Cost Complexity Pruning

❖ The idea behind minimal cost-complexity pruning is this:

❖ For any subtree 𝑇 ≤ 𝑇𝑚𝑎𝑥 , define its complexity as 𝑇෨ , the number of terminal nodes in T. Let α
𝛼 ≥ 0 be a real number called the complexity parameter and define the cost complexity measure
𝑅𝛼 𝑇 as
𝑅𝛼 𝑇 = 𝑅 𝑇 + 𝛼 𝑇෨

❖ For each value of α, algorithms finds that subtree 𝑇 𝛼 ≤ 𝑇𝑚𝑎𝑥 which minimizes 𝑅𝛼 𝑇 .

❖ If 𝛼 is small, the penalty for having a large number of terminal nodes is small and 𝑇 𝛼 will be
large.
For instance, if 𝑇𝑚𝑎𝑥 is so large that each terminal node contains only one case, then every case is
classified correctly; 𝑅 𝑇𝑚𝑎𝑥 = 0, so that 𝑇𝑚𝑎𝑥 minimizes 𝑅𝛼 𝑇 .

❖ Finally, for 𝛼 sufficiently large, the minimizing subtree 𝑇 𝛼 will consist of the root node only, and
the tree 𝑇𝑚𝑎𝑥 will have been completely pruned.

❖ Although 𝛼 runs through a continuum of values, there are at most a finite number of subtrees of
𝑇𝑚𝑎𝑥 . Thus, the pruning process produces a finite sequence of subtrees T1, T2, T3, ... with
progressively fewer terminal nodes.
Decision Tree: Regression

❖ Regressor: Decision trees are also capable of performing regression tasks. The main difference
with classification tree is that instead of predicting a class in each node, it predicts a value.

❖ Accordingly, the CART regression algorithm works such that instead of trying to split the training
set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes
the MSE. CART cost function for regression is given below:

❖ Scikit-Learn’s DecisionTreeRegressor class


import numpy as np
from sklearn.tree import DecisionTreeRegressor
np.random.seed(42)
X_quad = np.random.rand(200, 1) - 0.5 # a single
random input feature
y_quad = X_quad ** 2 + 0.025 *
np.random.randn(200, 1)

tree_reg = DecisionTreeRegressor(max_depth=2,
random_state=42)
tree_reg.fit(X_quad, y_quad) Fig. 5: A decision tree for regression
Decision Tree: Regression (cont.)

Fig. 5: A decision tree for regression

Fig. 6: Predictions of two decision tree regression models


Decision Tree: Challenges

❖ Sensitivity to Axis Orientation: Decision boundaries are always orthogonal to the feature
axis, which tend to become unnecessarily complex, consequently.

Fig. 7: Sensitivity to training set rotation

❖ Decision Trees Have a High Variance: Luckily, by averaging predictions over many trees, it’s possible to
reduce variance significantly. Such an ensemble of trees is called a random forest, and it’s one of the
most powerful types of models available today

You might also like