Decision Trees
Decision trees are a type of model used for making decisions based on a set of features.
Here's an overview:
Cat Classification Example
Consider an example where you want to classify if an animal is a cat or not. You might use
features like ear shape, face shape, and presence of whiskers to make this classification.
Here's an image that shows the decision-making process:
A decision tree consists of:
Root Node: The initial decision point.
Decision Nodes: Intermediate nodes that split the data based on features.
Leaf Nodes: Final nodes that give the prediction.
⚙️Decision Tree Learning Process
To build a decision tree, you need to learn how to make splits based on the features. For
example, you might start by splitting based on ear shape:
The image above shows how ear shape (pointy vs floppy) can be used to begin classifying
animal faces.
You continue splitting until you reach a point where you can confidently classify the data.
The image above provides a simple and visual guide for distinguishing between cats and
other animals
🤔 Key Decisions in Decision Tree Learning
There are two critical decisions to make when learning a decision tree:
1. How to choose what feature to split on at each node?
Maximize purity (or minimize impurity).
2. When do you stop splitting?
When a node is 100% one class.
When splitting a node will result in the tree exceeding a maximum depth.
When improvements in purity score are below a threshold.
When the number of examples in a node is below a threshold.
The image below shows a tree data structure with nodes labeled with "Depth" values,
ranging from 0 at the top to 2 at the bottom, illustrating a basic hierarchical structure:
📏 Measuring Purity: Entropy
Entropy is a measure of impurity. In the context of decision trees, it helps determine the
best splits to maximize information gain. Entropy is measured by the following formula:
H(p1)
Where p1 is the fraction of examples that are cats.
The image below shows a graphical representation of the relationship
between p1 and H(p1).
📈 Choosing a Split: Information Gain
Information Gain is used to determine the best feature to split on. It measures how much
information is gained about the target variable, given a specific feature. The general formula
is:
InformationGain=H(p1root)−(leftleft+right∗H(p1left)
+rightleft+right∗H(p1right))InformationGain=H(p1root)−(left+rightleft∗H(p1left)
+left+rightright∗H(p1right))
Where:
H(p1root)H(p1root) is the entropy of the root node.
H(p1left)H(p1left) is the entropy of the left child node.
H(p1right)H(p1right) is the entropy of the right child node.
🧩 Putting It Together
Here’s the process for constructing a decision tree:
Start with all examples at the root node.
Calculate information gain for all possible features and pick the one with the highest
information gain.
Split the dataset according to the selected feature, and create left and right branches
of the tree.
Keep repeating the splitting process until stopping criteria are met:
When a node is 100% one class.
When splitting a node will result in the tree exceeding a maximum depth.
Information gain from additional splits is less than a threshold.
When the number of examples in a node is below a threshold.
The image below shows how face shape can be used to classify cats, with ear shape as a
secondary factor.
Using One-Hot Encoding for Categorical Features
When dealing with categorical features, it's common to use one-hot encoding.
One-Hot Encoding: If a categorical feature can take on n values, create n binary features (0
or 1 valued).
Here's an example of how to apply one-hot encoding to the ear shape feature:
Ear shape Pointy Floppy
Pointy 1 0
Floppy 0 1
Oval 0 0
The image below shows a table with cat icons and features that have been one-hot encoded.
⚖️Continuous Valued Features
For continuous features, you can split the data based on a threshold. For example, if you
have a weight feature, you might split the data into examples where the weight is less than
or equal to a certain value, and examples where the weight is greater than that value. The
image below shows how a weight feature can be used to classify an animal.
🌲 Regression Trees (Optional)
Decision trees can also be used for regression tasks, where the goal is to predict a
continuous value. In this case, the leaf nodes would contain the average value of the target
variable for the examples that fall into that node.
The image below shows how ear and face shape could be used to predict a dog's weight.
🌳🌲🌴 Tree Ensembles
To improve the performance and robustness of decision trees, you can use tree ensembles.
These methods combine multiple decision trees to make predictions. The image below
shows the faces of a group of cats.
Tree ensembles can reduce overfitting and improve generalization performance.
Here's how a prediction might look using a tree ensemble:
🌳 Tree Ensembles
Sampling with Replacement
Sampling with replacement involves selecting items from a dataset where each item, once
chosen, is returned to the pool, allowing it to be selected again.
Here's an image demonstrating the concept of sampling with replacement:
This image displays a grid of colored blocks, each representing an item that can be sampled.
The colors are randomly distributed, illustrating how, with replacement, the same item
(color) can be chosen multiple times in a sample.
🌲 Random Forest Algorithm
The random forest algorithm is an ensemble learning method that operates by constructing
multiple decision trees during training and outputting the mode of the classes (classification)
or mean prediction (regression) of the individual trees.
The process can be described as follows:
1. Given a training set of size m.
2. For i = 1 to n:
Use sampling with replacement to create a new training set of size m.
Train a decision tree on the new dataset.
Randomizing the Feature Choice
At each node, when choosing a feature to use to split:
1. If n features are available, pick a random subset of < n features.
2. Allow the algorithm to only choose from that subset of features.
🚀 XGBoost (eXtreme Gradient Boosting)
XGBoost is an optimized distributed gradient boosting library designed to be
highly efficient, flexible and portable. It implements machine learning algorithms under the
Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as
Gradient Boosting Machine) that solve many data science problems in a fast and accurate
way.
Boosted Trees Intuition
1. Given a training set of size m.
2. For i = 1 to n:
Use sampling with replacement to create a new training set of size m. But
instead of picking from all examples with equal (1/m) probability, make it
more likely to pick examples that the previously trained trees misclassify.
Train a decision tree on the new dataset.
XGBoost Features:
Open-source implementation of boosted trees
Fast, efficient implementation
Good choice of default splitting criteria and criteria for when to stop splitting
Built-in regularization to prevent overfitting
Highly competitive algorithm for machine learning competitions (e.g., Kaggle
competitions)
This image shows a decision tree diagram for classifying animals as cats or not-cats based on
physical characteristics:
Here are the code snippets for using XGBoost in classification and regression tasks:
# Classification
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Regression
from xgboost import XGBRegressor
model = XGBRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
🤔 When to Use Decision Trees
Decision Trees vs Neural Networks
Feature Decision Trees and Tree Ensembles Neural Networks
Works well on tabular (structured) Works well on all types of data, including tabul
Data Type data unstructured data
Data Not recommended for unstructured
Recommendation data (images, audio, text)
Speed Fast May be slower than a decision tree
Transfer Learning Works with transfer learning
Small decision trees may be human
Interpretability interpretable
When building a system of multiple models wor
System Building be easier to string together multiple neural net