[go: up one dir, main page]

0% found this document useful (0 votes)
15 views17 pages

Unit 3 Notes

The document discusses classification trees in decision tree algorithms, highlighting their structure, key characteristics, and the importance of training and testing datasets. It explains Laplace correction for addressing zero probabilities in class predictions, outlines stopping criteria for tree development, and describes pruning techniques to enhance model generalization. Additionally, it provides R code examples for building classification trees and applying Laplace correction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views17 pages

Unit 3 Notes

The document discusses classification trees in decision tree algorithms, highlighting their structure, key characteristics, and the importance of training and testing datasets. It explains Laplace correction for addressing zero probabilities in class predictions, outlines stopping criteria for tree development, and describes pruning techniques to enhance model generalization. Additionally, it provides R code examples for building classification trees and applying Laplace correction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT-3

Classification Techniques – Decision Tree

5-Marks Questions (Answer in Brief)

1. Define a classification tree. What are its key characteristics?

A classification tree is a type of decision tree used in supervised machine learning to


categorize data into predefined classes. It works by recursively splitting the dataset into
subsets based on feature values, forming a tree structure where internal nodes represent
decision rules, branches represent outcomes of the rules, and leaf nodes represent class labels
or predictions.

The process starts with the entire dataset at the root. The algorithm evaluates all possible
splits using criteria such as Gini index, entropy (information gain), or classification error,
choosing the one that best separates the classes. The data is then split accordingly, and this
process is repeated for each resulting subset until a stopping condition is met (e.g., maximum
depth, minimum node size, or pure node).

Key characteristics of classification trees include:

 Interpretability: The rules and structure are easy to understand and visualize.
 Hierarchical decision-making: Decisions are made in a step-by-step top-down
manner.
 Non-linear boundaries: Trees can model complex class boundaries.
 Handling of mixed data types: Categorical and numerical features can be used.
 Feature importance: Trees can highlight which variables contribute most to
classification.
 Prone to overfitting: Especially on noisy data, unless pruning is applied.
 No need for data scaling: Unlike many other models, trees are not affected by data
normalization.
 Robust to outliers: Splits are not influenced heavily by extreme values.
 Can handle missing values: Some implementations support surrogate splits.
 Basis of ensemble methods: Trees form the foundation of powerful models like
Random Forest and Gradient Boosted Trees.

2. Explain the significance of training and testing datasets in decision tree algorithms.

In decision tree algorithms, the training and testing datasets play a vital role in building
and evaluating the model's performance. These two subsets ensure that the model learns
effectively from data and can generalize to unseen instances.

The training dataset is used to build the decision tree. It contains labeled data (features and
corresponding class labels) that the algorithm uses to identify patterns, determine the best
splits, and construct the decision rules in the tree. During training, the algorithm aims to
minimize classification error and create branches that best separate the classes.
The testing dataset, on the other hand, is not used during training. It is reserved to evaluate
how well the trained model performs on new, unseen data. This helps in assessing the
generalization ability of the decision tree — its capacity to make accurate predictions on
real-world data rather than just memorizing the training samples.

Using separate training and testing datasets prevents overfitting, where the tree becomes too
complex and performs well only on training data but poorly on new data. The performance
metrics (like accuracy, precision, recall, and F1-score) calculated on the testing set indicate
the model’s predictive strength and reliability.

In summary, training data helps the model learn, while testing data helps us verify its
effectiveness and validate the model’s real-world applicability.

3. What is Laplace correction? When is it used in decision trees?

What is Laplace Correction?

Laplace correction, also known as Laplace smoothing, is a technique used to avoid zero
probability estimates in categorical prediction problems. In the context of decision trees,
especially when making class probability estimates at the leaf nodes, it ensures that no
class is assigned a zero probability just because it didn’t appear in the training subset that led
to that leaf.

Why Zero Probability is a Problem

When a leaf node has not seen any instances of a particular class in training, the estimated
probability for that class becomes zero. This can be misleading — especially when the
dataset is small or imbalanced — because a class might not appear in the sample due to
chance, not because it's impossible.

Laplace Correction Formula

If a leaf node has:

 NN total instances
 NcN_c instances of class cc
 KK total number of classes

Then the Laplace-corrected probability for class cc is:

P(c)=Nc+1N+KP(c) = \frac{N_c + 1}{N + K}

This ensures that each class has at least a small, non-zero probability.
When is it Used in Decision Trees?

Laplace correction is used:

 When estimating class probabilities at leaf nodes for classification.


 In probabilistic decision trees where final predictions are made based on
probabilities, not just majority voting.
 During pruning or post-pruning to better estimate expected error rates.

Benefits of Laplace Correction

 Prevents zero probabilities for unseen classes.


 Improves generalization on small or imbalanced datasets.
 Makes the model more robust to data sparsity.

Example

Suppose a leaf node contains 4 samples:

 Class A: 4 samples
 Class B: 0 samples
Without correction:
 P(A)=1.0P(A) = 1.0, P(B)=0.0P(B) = 0.0

With Laplace correction (K=2K = 2):

 P(A)=4+14+2=56≈0.83P(A) = \frac{4 + 1}{4 + 2} = \frac{5}{6} \approx 0.83


 P(B)=0+14+2=16≈0.17P(B) = \frac{0 + 1}{4 + 2} = \frac{1}{6} \approx 0.17

This correction avoids falsely ruling out Class B.

4. List and briefly explain stopping criteria for tree development.

Here are the common stopping criteria used in decision tree development to prevent
overfitting and excessive growth:

1. Maximum Tree Depth

 Limits how deep the tree can grow.


 Prevents over-complex trees and helps generalization.
2. Minimum Samples per Leaf

 Specifies the minimum number of samples required at a leaf node.


 Ensures that decisions are based on enough data to be reliable.

3. Minimum Samples for Split

 Defines the minimum number of samples needed to consider a split.


 Avoids unnecessary splits on small subsets.

4. Impurity Reduction Threshold

 Splitting stops if the gain in impurity (e.g., Gini, entropy) is below a threshold.
 Ensures only meaningful splits are made.

5. Maximum Number of Nodes

 Puts a cap on the total number of nodes in the tree.


 Controls model complexity and memory usage.

6. No Further Improvement

 If no split improves classification significantly, splitting stops.


 Avoids overfitting to noise or minor variations in data.

7. Pure Node

 If all samples in a node belong to the same class, it becomes a leaf.


 No need to split further as the class is already perfectly separated.

8. Early Stopping (Validation-based)

 Tree stops growing if validation error starts increasing.


 A practical approach during cross-validation or pruning.
These criteria help balance model accuracy and complexity, ensuring the tree is neither too
shallow (underfitting) nor too deep (overfitting).

5. What is pruning in decision trees? Mention its importance.

What is Pruning in Decision Trees?

Pruning is the process of removing unnecessary branches from a decision tree to reduce its
size and improve generalization. After a decision tree is built, it may overfit the training data
by modeling noise or minor variations. Pruning helps simplify the model by eliminating
nodes that provide little or no predictive power.

Types of Pruning

1. Pre-Pruning (Early Stopping)


o Stops tree growth early using criteria like minimum samples or maximum
depth.
o Prevents the tree from becoming too complex.
2. Post-Pruning (Simplification After Full Growth)
o Builds the full tree and then removes branches that do not improve accuracy
on validation data.
o Common techniques include Reduced Error Pruning and Cost Complexity
Pruning (CCP).

Importance of Pruning

 Reduces Overfitting
Simplifies the model to avoid learning noise or irregular patterns in training data.
 Improves Generalization
Makes the model more robust and better at predicting unseen data.
 Enhances Interpretability
Smaller trees are easier to visualize and explain.
 Increases Efficiency
Smaller models use less memory and compute power during prediction.

6. Write a simple R code to build a classification tree using the rpart package.

Here’s a simple R code to build a classification tree using the rpart package:

# Load necessary libraries


library(rpart)
library(rpart.plot) # For plotting the tree

# Load built-in dataset


data(iris)

# Build a classification tree model


model <- rpart(Species ~ ., data = iris, method = "class")

# Print the model summary


print(model)

# Plot the decision tree


rpart.plot(model, main = "Classification Tree for Iris Dataset")

Explanation:

 Species ~ . means predicting Species using all other variables in the iris dataset.
 method = "class" tells rpart to perform classification (not regression).
 rpart.plot() visualizes the tree in a clean, interpretable format.

Output:

This will create a tree that classifies iris flowers into setosa, versicolor, or virginica based on
their features (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width).

12-Marks Questions (Detailed Explanation or Code)

1. Write an R program to build and visualize a classification tree using a sample dataset.
Explain the output.

Here is a complete R

to build and visualize a classification tree using a sample dataset (iris), along with an
explanation of the output.

✅ R Program: Classification Tree using rpart and rpart.plot

# Load necessary libraries


library(rpart) # For building the decision tree
library(rpart.plot) # For visualizing the decision tree

# Load the built-in iris dataset


data(iris)
# View first few rows of the dataset
head(iris)

# Build a classification tree to predict Species


tree_model <- rpart(Species ~ ., data = iris, method = "class")

# Print model summary


print(tree_model)

# Visualize the tree


rpart.plot(tree_model, main = "Classification Tree - Iris Dataset", extra = 104)

✅Explanation of Each Step

 iris Dataset: A built-in dataset with 150 samples of iris flowers. It has:
o 4 numerical features: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
o 1 categorical label: Species (setosa, versicolor, virginica)
 rpart() Function: Builds a classification tree using Gini index as the default impurity
measure.
o Species ~ . means we use all other variables to predict the Species.
 print(tree_model): Displays a text summary of the decision tree with splits and
predicted classes at each node.
 rpart.plot(): Graphically plots the decision tree.
o extra = 104 shows both the class probability and the number of observations in
each node.

✅Explanation of the Output

 The tree plot shows a top-down structure starting from the root node, which contains
all samples.
 At each internal node:
o A feature and threshold are used to split the data.
o For example: Petal.Length < 2.45 is often the first split, perfectly classifying
"setosa".
 Leaf nodes show:
o The predicted species,
o The proportion of samples from each class,
o The number of samples in the node.

✅Key Insights from the Tree

 The tree often separates setosa with one clear split on Petal.Length.
 Further splits differentiate between versicolor and virginica.
 The tree helps in understanding which features are most important for classification
(e.g., Petal.Length and Petal.Width are usually the top splitters).
This R program is ideal for teaching students the basics of classification trees with a real-
world dataset and intuitive visual output.

2. Discuss various pruning techniques used in decision tree algorithms.

Pruning Techniques in Decision Tree Algorithms

Pruning is a vital process in decision trees that removes unnecessary branches to reduce
overfitting, improve model generalization, and simplify interpretation. There are two major
types of pruning: pre-pruning (early stopping) and post-pruning (after full tree growth).

✅1. Pre-Pruning (Early Stopping)

Pre-pruning stops the tree from growing once certain conditions are met, such as:

a) Maximum Depth Limit

 Stops tree expansion beyond a set depth.


 Prevents highly complex trees.

b) Minimum Samples for Split

 A node must have at least n samples to split.


 Avoids unreliable decisions from small subsets.

c) Minimum Information Gain / Gini Decrease

 Stops splitting if the improvement in impurity is below a threshold.


 Prevents unnecessary splits with minimal benefit.

✅2. Post-Pruning (Pruning After Full Tree Construction)

Post-pruning first grows the complete tree and then removes branches that do not improve
accuracy on validation data.

a) Reduced Error Pruning

 Evaluates tree performance on a validation set.


 Prunes nodes if removing them doesn’t reduce accuracy.
 Simple and intuitive, but depends on a good validation set.

b) Cost-Complexity Pruning (CCP) – Used in rpart (R) and CART

 Balances tree complexity and classification accuracy.


 Prunes branches that reduce a cost function:

Rα(T)=R(T)+α⋅∣T∣R_\alpha(T) = R(T) + \alpha \cdot \lvert T \rvert

Where:

o R(T)R(T): Misclassification cost


o α\alpha: Complexity parameter
o ∣T∣\lvert T \rvert: Number of terminal nodes
 Generates a series of subtrees, and selects the one with best performance (e.g., via
cross-validation).

c) Pessimistic Error Pruning – Used in C4.5

 Estimates error without needing a separate validation set.


 Adds a penalty for complexity using statistical bounds.
 Reduces the risk of overfitting when data is limited.

✅Why Prune?

 Improves generalization by reducing overfitting.


 Simplifies the model for better understanding.
 Improves prediction speed with fewer decision rules.

✅Summary Table

Pruning Type Method Key Benefit

Pre-Pruning Max depth, Min samples Faster training, avoids overfit

Reduced Error Pruning Prune if error unchanged Simple, uses validation data

Cost-Complexity Trade-off error vs size Optimal subtree selection

Pessimistic Error Error estimate with bounds No need for validation set

3. Using R, demonstrate how to apply Laplace correction in a decision tree model and
compare results.

Here is a clear and practical demonstration of how to apply Laplace correction in a decision
tree model using R, and how to compare results with and without the correction.
✅ Objective

 Build a classification tree using rpart


 Manually apply Laplace correction to class probabilities
 Compare predicted class probabilities (with and without correction)

✅R Code: Laplace Correction in Decision Trees

# Load required libraries


library(rpart)
library(rpart.plot)

# Load dataset
data(iris)

# Build decision tree without Laplace correction


model <- rpart(Species ~ ., data = iris, method = "class")

# Predict class probabilities (default - no Laplace correction)


pred_default <- predict(model, type = "prob")

# Display first 6 predictions without Laplace correction


cat("Without Laplace Correction:\n")
print(head(pred_default))

# Extract class counts at each leaf node


leaf_nodes <- unique(model$where) # Unique terminal nodes

# Apply Laplace correction manually to leaf probabilities


laplace_corrected_probs <- pred_default # Initialize

for (node in leaf_nodes) {


# Get class counts for samples in this node
idx <- which(model$where == node)
actual_classes <- iris$Species[idx]
class_table <- table(actual_classes)

# Apply Laplace correction


K <- length(levels(iris$Species))
N <- sum(class_table)
corrected_probs <- rep(1, K) # Start with 1 for Laplace (pseudo-counts)
names(corrected_probs) <- levels(iris$Species)

for (cls in names(class_table)) {


corrected_probs[cls] <- class_table[cls] + 1
}
corrected_probs <- corrected_probs / (N + K) # Normalize

# Assign corrected probs back


for (i in idx) {
laplace_corrected_probs[i, ] <- corrected_probs
}
}

# Display first 6 predictions with Laplace correction


cat("\nWith Laplace Correction:\n")
print(head(laplace_corrected_probs))

✅Explanation of Code

 predict(..., type = "prob") returns the predicted class probabilities without Laplace
correction.
 For each leaf node, the code manually:
o Counts the number of samples per class.
o Applies Laplace smoothing: adds 1 to each class count and normalizes by
N+KN + K, where:
 NN = number of samples at the node
 KK = number of classes
 Updates the predicted probabilities with the smoothed values.

✅Comparison and Importance

Feature Without Laplace With Laplace


Rare class probability May be 0 Always > 0
Prediction robustness Less reliable on small nodes More stable on small datasets
Use in probabilistic tasks Risky due to 0-probs Safe and well-calibrated

Laplace correction helps prevent zero probability estimates, especially when a class is not
present in a node due to sample bias, making the tree more robust in real-world applications
like Naive Bayes, risk estimation, or uncertainty-aware decisions.

4. Describe the evaluation metrics: Confusion matrix, ROC curve, and F-measure.

✅Evaluation Metrics in Classification: Confusion Matrix, ROC Curve, and F-measure

When evaluating the performance of a classification model like a decision tree, it’s crucial to
understand different metrics that help assess its accuracy, precision, recall, and overall
behavior. Below are three key evaluation tools:
✅1. Confusion Matrix

A confusion matrix is a table used to describe the performance of a classification model on a


set of test data where the true values are known.

Example (for binary classification):


Predicted: Yes Predicted: No

Actual: Yes True Positive (TP) False Negative (FN)

Actual: No False Positive (FP) True Negative (TN)

Key metrics derived from it:

 Accuracy = (TP + TN) / (TP + TN + FP + FN)


 Precision = TP / (TP + FP)
 Recall (Sensitivity) = TP / (TP + FN)
 Specificity = TN / (TN + FP)

✅2. ROC Curve (Receiver Operating Characteristic Curve)

The ROC curve is a graphical plot that shows the trade-off between sensitivity (True
Positive Rate) and 1 - specificity (False Positive Rate) at different threshold settings.

 The x-axis shows the False Positive Rate (FPR)


 The y-axis shows the True Positive Rate (TPR)
 A model with perfect predictions will have a point in the top-left corner.
 AUC (Area Under Curve) is used to summarize the ROC curve:
o AUC = 1.0 → perfect model
o AUC = 0.5 → random guessing

✅3. F-measure (F1 Score)

The F1 score is the harmonic mean of Precision and Recall, and it gives a single score that
balances both concerns.

F1=2×Precision×RecallPrecision+RecallF1 = 2 \times \frac{\text{Precision} \times


\text{Recall}}{\text{Precision} + \text{Recall}}

 Range: 0 to 1
 Best when there is an uneven class distribution or when both precision and recall
are equally important.
✅ Summary Comparison

Metric Purpose Best For

Confusion Matrix Detailed performance summary Binary/multiclass classification

ROC Curve Visual trade-off (TPR vs FPR) Model comparison at various thresholds

F-measure Balancing precision and recall Imbalanced datasets

These metrics help you choose the most suitable model and evaluate how well it performs
across different aspects of classification.

5. Write an R program to evaluate a decision tree using K-fold cross-validation and


report accuracy.

Here’s a complete R program that uses K-fold cross-validation to evaluate a decision tree
model built using the rpart package, and it reports the average accuracy across all folds.

✅ R Program: K-Fold Cross-Validation for Decision Tree Evaluation

# Load necessary libraries


library(rpart)
library(caret) # For cross-validation and evaluation

# Set seed for reproducibility


set.seed(123)

# Load the built-in iris dataset


data(iris)

# Define the number of folds (e.g., 5-fold cross-validation)


k <- 5

# Create k-folds using caret's createFolds function


folds <- createFolds(iris$Species, k = k, list = TRUE, returnTrain = FALSE)

# Initialize a vector to store accuracy from each fold


accuracy_scores <- numeric(k)

# Perform K-fold cross-validation


for (i in 1:k) {
test_indices <- folds[[i]]
test_data <- iris[test_indices, ]
train_data <- iris[-test_indices, ]
# Train the decision tree model
model <- rpart(Species ~ ., data = train_data, method = "class")

# Predict on the test set


predictions <- predict(model, test_data, type = "class")

# Calculate accuracy
cm <- confusionMatrix(predictions, test_data$Species)
accuracy_scores[i] <- cm$overall['Accuracy']

cat(sprintf("Fold %d Accuracy: %.4f\n", i, accuracy_scores[i]))


}

# Report the average accuracy across all folds


cat(sprintf("\nAverage Accuracy over %d folds: %.4f\n", k, mean(accuracy_scores)))

✅Explanation

 createFolds(): Splits data into k stratified parts.


 Loop: For each fold, it:
o Trains the model on remaining (k−1) folds.
o Tests on the current fold.
o Calculates and stores accuracy.
 confusionMatrix(): From caret, provides metrics like accuracy, sensitivity, etc.
 Result: You get individual accuracies per fold and the overall average accuracy.

✅Sample Output (Will vary by random seed)

Fold 1 Accuracy: 0.9667


Fold 2 Accuracy: 0.9667
Fold 3 Accuracy: 0.9333
Fold 4 Accuracy: 1.0000
Fold 5 Accuracy: 0.9333

Average Accuracy over 5 folds: 0.9600

This method gives a reliable evaluation of model performance and helps detect overfitting or
underfitting more effectively than a single train-test split.

6. Explain the decision tree ensemble methods. How do they improve classification
performance?
✅ Decision Tree Ensemble Methods and Their Role in Improving Classification
Performance

Ensemble methods combine multiple decision trees to create a stronger predictive model.
Instead of relying on a single decision tree (which can be unstable or prone to overfitting),
ensembles aggregate the outputs of several trees to enhance accuracy, robustness, and
generalization.

✅Types of Decision Tree Ensemble Methods

1. Bagging (Bootstrap Aggregating)

 Technique: Trains multiple trees on random bootstrapped samples of the training


data.
 Each tree votes, and the final prediction is made by majority vote (classification) or
average (regression).
 Key Algorithm: Random Forest
o Adds randomness by choosing a random subset of features at each split.
 Advantages:
o Reduces variance and overfitting.
o Works well with high-dimensional data.

2. Boosting

 Technique: Builds trees sequentially, where each new tree focuses on correcting
errors made by the previous ones.
 Trees are usually shallow (weak learners).
 Key Algorithms:
o AdaBoost (Adaptive Boosting): Assigns higher weights to misclassified
instances.
o Gradient Boosting Machines (GBM): Optimizes a loss function using
gradient descent.
o XGBoost / LightGBM / CatBoost: Scalable and faster variants of gradient
boosting.
 Advantages:
o Improves both bias and variance.
o Handles complex patterns better than a single tree.

3. Stacking (Stacked Generalization)

 Technique: Combines multiple models (could be different types, including trees)


using a meta-model.
 Predictions from base models are used as input to a final model that makes the
ultimate prediction.
 Advantage: Leverages the strengths of multiple learning algorithms.

✅How Ensemble Methods Improve Classification Performance


Benefit Description

Reduced Overfitting Bagging reduces variance; boosting reduces bias and variance.

Combines weak learners into a stronger model, improving predictive


Higher Accuracy
power.

Robustness Less sensitive to noisy data or outliers.

Better
Performs well on unseen data compared to a single complex tree.
Generalization

Feature Importance Random Forest and boosting models can rank feature importance.

✅Summary Table
Method Strategy Key Focus Main Advantage

Bagging Parallel training with resampling Reduce variance Stability, robustness

Sequential training with weight Reduce bias & Accuracy, focus on


Boosting
updates variance errors

Combine model outputs via meta- Flexibility,


Stacking Model diversity
model customization

Decision tree ensembles like Random Forest and Boosted Trees significantly outperform
single decision trees in real-world tasks. They are widely used in competitions (like Kaggle)
and production systems because they offer better accuracy, scalability, and resilience to
overfitting.

15-Marks Questions (Comprehensive Answers with Diagrams/Programs)

1. Using R, implement a complete decision tree classifier on a real-world dataset (e.g.,


iris or Titanic), including training, pruning, Laplace correction, and ROC curve
plotting. Interpret the results.
2. Compare the performance of two decision tree models using McNemar’s test and K-
fold cross-validated paired t-test in R. Explain how to choose the better model for
prediction.

You might also like