Decision_tree
Decision_tree
❖ A decision tree is a hierarchical structure that resembles an inverted tree, where each internal
node represents a decision based on a feature or attribute, and each leaf node represents a class
label (in classification) or a numerical value (in regression).
❖ Decision trees are versatile machine learning algorithms that can perform both classification and
regression tasks, and even multioutput tasks, capable of fitting complex datasets, and are
fundamental components of random forests.
Working of Decision Tree: Toy Example
❖ We have data for 15 students on pass or fail in an online ML Table 1: The toy data set
exam. To understand the basic process we begin with a dataset
which comprises a target variable that is binary ( Pass/Fail) and
various binary or categorical predictor variables such as:
❖ Entropy: In information theory, the entropy of a random variable is the average level of
“information”, “surprise”, or “uncertainty” inherent to the variable’s possible outcomes. It is
measured by the formula:
where the pi is the probability of randomly selecting an example of ith outcome of target variable. For
example, ppass= number of passing students/ total students= 9/15, pfail= number of fails/ total
students= 6/15.
❖ Information Gain: Entropyparent – Entropychildren
where Entropyparent is the entropy of the parent node and Entropychildren represents the average
entropy of the child nodes that are obtained by splitting the parent node.
❖ To calculate entropy, first let us put our formulas for Entropy and Information Gain in terms of
variables in our dataset:
• 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑝𝑎𝑠𝑠 𝑜𝑟 𝑓𝑎𝑖𝑙 𝑎𝑡 𝑒𝑎𝑐ℎ 𝑛𝑜𝑑𝑒 = 𝑁𝑜. 𝑜𝑓 𝑝𝑎𝑠𝑠 𝑜𝑟 𝑓𝑎𝑖𝑙/ 𝑠𝑎𝑚𝑝𝑙𝑒 − 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑛𝑜𝑑𝑒
𝑆𝑐,𝑗
• 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑎𝑐𝑟𝑜𝑠𝑠 𝑡ℎ𝑒 𝑐ℎ𝑖𝑙𝑑 𝑛𝑜𝑑𝑒𝑠 = 𝐸𝑎𝑣,𝑐ℎ𝑖𝑙𝑑 = σ𝑡𝑜𝑡𝑎𝑙
𝑗=1
𝑐ℎ𝑖𝑙𝑑 𝑛𝑜𝑑𝑒𝑠
𝐸𝑗
𝑆𝑝
• 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 = 𝐸𝑝𝑎𝑟𝑒𝑛𝑡 − 𝐸𝑎𝑣,𝑐ℎ𝑖𝑙𝑑
Toy Example (cont.): Evaluating candidate variables for root node
8 8 7 7
❖ 𝐸𝑝𝑎𝑟𝑒𝑛𝑡 = − 𝑝𝑝𝑎𝑠𝑠 𝑙𝑜𝑔2 𝑝𝑝𝑎𝑠𝑠 + 𝑝𝑓𝑎𝑖𝑙 𝑙𝑜𝑔2 𝑝𝑓𝑎𝑖𝑙 =− ∗ 𝑙𝑜𝑔2 + ∗ 𝑙𝑜𝑔2 =0.9968
15 15 15 15
❖ Let us calculate the average entropy of child nodes, 𝐸𝑎𝑣,𝑐ℎ𝑖𝑙𝑑 , and 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 for each
candidate feature variables in turn.
❖ Work Status as feature variable of the root node: ❖ In a similar fashion we can evaluate
𝐸𝑤𝑜𝑟𝑘𝑖𝑛𝑔 = −
3
∗ 𝑙𝑜𝑔2
3 6
+ ∗ 𝑙𝑜𝑔2
6
=0.9183
the entropy and information gain for
9 9 9 9 Student Background and Online
5 5 1 1
𝐸𝑛𝑜𝑡−𝑤𝑜𝑟𝑘𝑖𝑛𝑔 = − ∗ 𝑙𝑜𝑔2 + ∗ 𝑙𝑜𝑔2 =0.65 Courses variables. The results are
6 6 6 6
9 6
𝐸𝑎𝑣,𝑐ℎ𝑖𝑙𝑑 =
15
∗ 0.9183 + ∗ 0.65 = 0.8110
15
provided in the table below:
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 = 0.9968 − 0.8110 = 0.1858
In this equation:
• Gi is the Gini impurity of the ith node.
• pi,k is the ratio of class k instances among the training instances in the ith node.
❖ ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. It supports categorical
feature only; tree is fully grown, pruning is applied afterwards.
❖ C4.5 is the successor to ID3 and removed the restriction that features must be categorical.
❖ C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds
smaller rulesets than C4.5 while being more accurate.
❖ CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports
numerical target variables (regression) and does not compute rule sets. CART constructs binary
trees using the feature and threshold that yield the largest information gain at each node.
❖ scikit-learn uses an optimized version of the CART algorithm; however, the scikit-learn
implementation does not support categorical variables for now
CART (Classification and Regression Trees)
❖ CART algorithm recursively partitions the feature space such that the samples with the same
labels or similar target values are grouped together.
❖ The algorithm works by first splitting the training set into two subsets using a single feature k and
a threshold tk (e.g., “petal length ≤ 2.45 cm”) such that it produces the purest subsets, weighted
by their size.
❖ The process continues until no further split is possible or it hits one of its stopping criteria.
❖ CART algorithm is a greedy algorithm; it often produces a solution that’s reasonably good but not
guaranteed to be optimal.
❖ Finding the optimal tree is known to be an NP-complete problem. It requires O(exp(m)) time,
making the problem intractable even for small training sets.
❖ For any subtree 𝑇 ≤ 𝑇𝑚𝑎𝑥 , define its complexity as 𝑇෨ , the number of terminal nodes in T. Let α
𝛼 ≥ 0 be a real number called the complexity parameter and define the cost complexity measure
𝑅𝛼 𝑇 as
𝑅𝛼 𝑇 = 𝑅 𝑇 + 𝛼 𝑇෨
❖ For each value of α, algorithms finds that subtree 𝑇 𝛼 ≤ 𝑇𝑚𝑎𝑥 which minimizes 𝑅𝛼 𝑇 .
❖ If 𝛼 is small, the penalty for having a large number of terminal nodes is small and 𝑇 𝛼 will be
large.
For instance, if 𝑇𝑚𝑎𝑥 is so large that each terminal node contains only one case, then every case is
classified correctly; 𝑅 𝑇𝑚𝑎𝑥 = 0, so that 𝑇𝑚𝑎𝑥 minimizes 𝑅𝛼 𝑇 .
❖ Finally, for 𝛼 sufficiently large, the minimizing subtree 𝑇 𝛼 will consist of the root node only, and
the tree 𝑇𝑚𝑎𝑥 will have been completely pruned.
❖ Although 𝛼 runs through a continuum of values, there are at most a finite number of subtrees of
𝑇𝑚𝑎𝑥 . Thus, the pruning process produces a finite sequence of subtrees T1, T2, T3, ... with
progressively fewer terminal nodes.
Decision Tree: Regression
❖ Regressor: Decision trees are also capable of performing regression tasks. The main difference
with classification tree is that instead of predicting a class in each node, it predicts a value.
❖ Accordingly, the CART regression algorithm works such that instead of trying to split the training
set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes
the MSE. CART cost function for regression is given below:
tree_reg = DecisionTreeRegressor(max_depth=2,
random_state=42)
tree_reg.fit(X_quad, y_quad) Fig. 5: A decision tree for regression
Decision Tree: Regression (cont.)
❖ Sensitivity to Axis Orientation: Decision boundaries are always orthogonal to the feature
axis, which tend to become unnecessarily complex, consequently.
❖ Decision Trees Have a High Variance: Luckily, by averaging predictions over many trees, it’s possible to
reduce variance significantly. Such an ensemble of trees is called a random forest, and it’s one of the
most powerful types of models available today