SHORT QUESTIONS ON DECISION TREE
1. What is the main purpose of a decision tree in machine learning?
A decision tree is used to model decisions and their possible outcomes in a structured
way. It helps in both classification and regression tasks by breaking down a dataset into
smaller subsets while forming a tree structure, where each internal node represents a test
on a feature, and each leaf node represents a final decision or prediction.
2. What criterion is commonly used to split nodes in a decision tree?
The most common criteria for splitting nodes are Gini impurity and Information Gain
(based on entropy). These metrics measure how well a split separates the data into
classes, aiming to make each group as "pure" as possible.
3. What is a leaf node in a decision tree?
A leaf node, also called a terminal node, is the end point of a decision path in the tree. It
represents the final output or decision of the model—either a predicted class (in
classification) or a value (in regression).
4. What is overfitting in the context of decision trees?
Overfitting occurs when a decision tree learns the training data too well, including noise
and outliers. This leads to a complex tree that performs very well on the training data but
poorly on unseen (test) data because it fails to generalize.
5. How can you prevent a decision tree from overfitting?
Overfitting can be prevented by techniques like pruning (removing unnecessary
branches), setting a maximum depth, limiting the minimum number of samples per
leaf or split, or using ensemble methods like Random Forests.
6. Is a decision tree suitable for both classification and regression?
Yes, decision trees can be used for both. In classification, the tree predicts a category or
class label. In regression, it predicts a continuous numeric value by averaging outcomes
in the leaf nodes.
7. What is the difference between Gini impurity and entropy?
Both are used to measure the quality of a split. Entropy is based on information theory
and measures the level of disorder or unpredictability, while Gini impurity measures the
frequency at which a randomly chosen element would be incorrectly classified. Gini is
often faster to compute and commonly used in practice.