Decision tree:
A Decision Tree is a non-parametric supervised learning and non-linear model used in machine
learning for classification and regression tasks. It represents decisions and their possible
consequences, including chance event outcomes, resource costs, and utility. Essentially, it's a
flowchart-like structure where:
• Each internal node represents a decision based on a feature (or attribute).
• Each branch represents the outcome of that decision.
• Each leaf node represents a class label (in classification) or a continuous value (in regression).
Decision Tree Induction
Decision Tree Induction refers to the process of learning (or constructing) a decision tree from
training data. Here's a detailed explanation of how it works:
Feature Selection Criterion:
To decide which feature to split on at each step in the tree, a feature selection criterion is used.
Common criteria include:
Gini Index: Measures the impurity of a node. The goal is to minimize the Gini Index for each split.
Information Gain: Based on entropy, it measures the reduction in entropy before and after the split.
Higher information gain indicates a better split.
Gain Ratio: Adjusts information gain by taking into account the intrinsic information of a split.
Chi-square: Measures the statistical significance of the differences in the distributions of classes
among different branches.
Tree Construction Process:
Step 1: Start with the entire dataset as the root.
Step 2: At each node, select the best feature to split the data based on the chosen criterion.
Step 3: Split the dataset into subsets that contain instances with similar values for the selected
feature.
Step 4: Repeat the process recursively for each subset, using only the remaining features.
Stopping Criteria:
The recursive process stops when one of the following conditions is met:
All instances in a node belong to the same class (pure node).
There are no remaining features to split upon.
The predefined depth limit (maximum depth) of the tree is reached.
A minimum number of instances for a node to be considered for splitting is reached.
Pruning:
After the tree is fully grown, it might be too complex and overfit the training data. Pruning helps to
reduce this complexity:
Pre-pruning: Stops the tree growth early based on predefined thresholds (like maximum depth or
minimum samples per leaf).
Post-pruning: Removes branches that have little importance or provide less information. Methods
include cost-complexity pruning (e.g., reduced error pruning).
Eg: