DECISION TREE
ALGORITHM
DECISION TREE
Decision tree are the type of supervised
learning where the data is continuously split
according to a certain parameters.
The tree has two entities namely decision
node and leaf node.
The decision node are where the data is split
and Leaves are the decision or final
outcomes.
Decision tree is used for both classification and
regression.
1. Entropy
2. Information gain
Entropy is a measure of impurity or uncertainty
in a dataset, and it's a key factor in decision
trees. It helps the algorithm determine how to
split data at each node to create more
structured subsets.
Formula:
Information gain is a measure used to
determine which feature should be used to
split the data at each internal node of the
decision tree. It is calculated using entropy.
Formula:
EXAMPLE
Gini index :
The Gini Index is a statistical measure used to
determine inequality or impurity in a dataset.
Purpose: Measures how "pure" or "impure" a
dataset is. Pure data means all elements belong
to one category, while impure data means they
are distributed among multiple categories.
Range: Values range from 0 to 1:
0: Perfectly pure (all elements belong to one
class).
1: Maximum impurity (elements are evenly
distributed across all classes).
• Formula:
• Higher Gini Index indicates more impurity. Lower Gini
Index indicates more purity.
Example:
ALGORITHMS:
1. CART (Classification and Regression
Trees)
Use Case: Both classification and regression.
Split Criterion:
Gini Index for classification.
Mean Squared Error (MSE) for regression.
Output: Binary tree (each node splits into
two branches).
2. ID3 (Iterative Dichotomiser 3)
Use Case: Classification tasks.
Split Criterion: Information Gain,
Entropy
Limitations:
Does not handle continuous data directly.
Prone to overfitting.
ID3 is one of the earliest decision tree
algorithms, developed by Ross Quinlan, and
is used for classification tasks.
3. C4.5
C4.5 is an advanced decision tree algorithm
developed by Ross Quinlan as an
improvement over the ID3 algorithm.
Use Case: Classification tasks.
Split Criterion: Gain Ratio
Features:
Handles continuous and missing data.
Prunes trees to avoid overfitting.
5. CHAID (Chi-squared Automatic
Interaction Detection)
Use Case: Primarily for categorical data.
Split Criterion: Chi-square test for
independence.
Features:
Produces multi-way splits (not restricted to binary
splits).
Often used in marketing and survey data analysis.
Evaluating the decision tree:
Split Data
Divide the dataset into training and test sets to evaluate
performance on unseen data.
Make Predictions
Use the decision tree to predict outcomes for the test set.
Compare Predictions
Compare the tree’s predictions with the actual outcomes
from the test set.
Calculate Metrics
Measure performance using metrics like accuracy, precision,
recall, or F1 score.
Analyze and Improve
Check for overfitting, adjust parameters and improve if
needed.
THANK YOU