Decision Tree (5 Marks Answer)
A Decision Tree is a supervised machine learning algorithm used for both classification and
regression tasks. It works by splitting the data into branches based on conditions or questions about
the input features. Each internal node represents a decision (based on a feature), each branch
represents the outcome of the decision, and each leaf node gives the final result or prediction.
It is called a "tree" because it starts from a root node and splits into branches like a tree.
Structure of a Decision Tree
Root Node: The topmost node that represents the entire dataset. This node is split into two
or more homogeneous sets.
Decision Nodes: These are the nodes where the data is split based on certain criteria
(features).
Leaf Nodes: These nodes represent the outcome (classification or decision) and do not split
further.
Branches: The arrows from one node to another, representing the outcome of a decision
How Decision Trees Work (Simple Explanation)
1. Start with the Whole Data:
o The decision tree begins at the root node, which has the entire dataset.
2. Choose the Best Feature to Split:
o The algorithm selects the best feature to divide the data.
o It uses methods like:
Gini Impurity for classification.
Variance Reduction for regression.
o The goal is to split the data in a way that makes each group more similar (pure).
3. Split the Data:
o The chosen feature is used to divide the data into smaller groups.
o The process is repeated again and again for each group.
4. When to Stop (Stopping Conditions):
o The tree stops splitting when:
All data points in a group belong to the same class.
There are no more features left to split.
The tree reaches a maximum depth (set by the user).
A node has too few samples to split further.
5. Making Predictions:
o For a new data point, start from the root node.
o Follow the path by checking the feature values at each step.
o Stop when you reach a leaf node.
o The leaf node gives the final answer (class or value).
Advantages of Decision Trees
Easy to Understand: The structure of decision trees makes them easy to interpret and
visualize.
Non-Parametric: They do not assume any underlying distribution of data.
Versatile: Can be used for both classification and regression tasks.
Disadvantages of Decision Trees
Overfitting: Decision trees can become very complex and overfit the training data.
Unstable: Small changes in the data can lead to a completely different tree.
Bias: They can be biased towards features with more levels (high cardinality).
Cross Validation – Full Explanation
📌 Definition:
Cross-validation is a technique used to evaluate the performance of a machine learning model by
splitting the data into multiple parts — so the model is trained and tested on different subsets of
data.
🧠 Why do we use it?
Because:
We want to know how well our model will perform on unseen data
Training and testing on the same data = risk of overfitting
Cross-validation helps give a realistic estimate of model accuracy
🔧 Types of Cross Validation (mainly)
1. ✅ K-Fold Cross Validation (most popular)
Steps:
1. Split dataset into K equal parts (called “folds”)
2. Train the model on K−1 folds, test it on the remaining 1 fold
3. Repeat this process K times, each time using a different fold for testing
4. Finally, average all K accuracy scores → that’s your final result!
🧪 Example for 5-Fold: If you have 100 rows:
Split into 5 parts (20 rows each)
Train on 80, test on 20 → do this 5 times
You get 5 accuracy scores → take the average
Benefits:
Reliable model evaluation
Detects overfitting or underfitting
Works well with small datasets
Helps in model tuning (Grid Search + CV)
📊 Where is it used?
Any model: Logistic regression, Random Forest, XGBoost, etc.
Used in hyperparameter tuning to find the best parameters
Also used to compare different models (e.g., SVM vs RF vs NB)