Q7: What is the difference between OOB score and validation score?
Q8: How would you deal with an Overfitted Decision Tree?
Q9: What are some disadvantages of using Decision Trees and how would you solve them?
Q10: How would you define the Stopping Criteria for decision trees?
Q1: What are Decision Trees?
Decision Trees are a fundamental machine learning algorithm used for both
classification and regression tasks. They are a non-parametric supervised learning
method that is widely used in data science. Decision Trees create a tree-like model of
decisions and their possible consequences.
In a Decision Tree:
The root node represents the entire dataset or the initial problem.
Internal nodes are decision nodes that split the data into subgroups based on
specific criteria, often using features from the dataset.
Leaf nodes are the final outcomes or predictions.
The process of building a Decision Tree involves selecting the best feature to split the
data, typically using metrics like Gini impurity or information gain. Decision Trees are
known for their interpretability, as you can easily visualize the tree structure, making it
understandable even to non-technical stakeholders.
However, Decision Trees can be prone to overfitting, so techniques like pruning or using
ensemble methods like Random Forests are often employed to improve their
performance.
● Decision trees is a tool that uses a tree-like model of decisions and their possible consequences. If
an algorithm only contains conditional control statements, decision trees can model that
algorithm really well.
● Decision trees are a non-parametric, supervised learning method.
● Decision trees are used for classification and regression tasks.
● The diagram below shows an example of a decision tree (the dataset used is the Titanic dataset
to predict whether a passenger survived or not):
Q2: Explain the structure of a Decision Tree
A decision tree is a flowchart-like structure in which:
● Each internal node represents the test on an attribute (e.g. outcome of a coin flip).
● Each branch represents the outcome of the test.
● Each leaf node represents a class label.
● The paths from the root to leaf represent the classification rules.
● Splitting Criteria: At each internal node, a splitting criterion is used to determine
how the data should be divided into subgroups. Common criteria include Gini
impurity, information gain, or mean squared error, depending on whether it's a
classification or regression tree.
● Depth of the Tree: The depth of the tree is the length of the longest path from the
root node to a leaf node. A deeper tree can capture more complex patterns in the
data but is also more prone to overfitting.
● Features: Each internal node uses a specific feature from the dataset to make a
decision on how to split the data.
Q3: How are the different nodes of decision trees represented?
A decision tree consists of three types of nodes:
● Decision nodes: Represented by squares. It is a node where a flow branches into several optional
branches.
● Chance nodes: Represented by circles. It represents the probability of certain results.
● End nodes: Represented by triangles. It shows the final outcome of the decision path.
Q4: What are some advantages of using Decision Trees?
● It is simple to understand and interpret. It can be visualized easily.
● It does not require as much data preprocessing as other methods.
● It can handle both numerical and categorical data.
● It can handle multiple output problems.
Advantages of using Decision Trees:
Interpretability: Decision Trees are easy to understand and visualize, making them
great for explaining decisions.
Simple Implementation: They are straightforward to implement and can handle
various data types without much preprocessing.
Versatility: Suitable for classification and regression tasks, making them
applicable to a wide range of problems.
Feature Selection: They can automatically rank and select important features.
Robustness: Can handle missing values and are robust to outliers.
Scalability: Efficient on large datasets and parallelizable.
No Data Normalization: They don't require data normalization.
Non-linear Relationships: Can model complex non-linear relationships in the data.
Q5: What type of node is considered Pure?
In the context of Decision Trees, a node is considered "pure" when it contains only one class or category
in a classification problem or when it contains data points with the same target value in a regression
problem.
● If the Gini Index of the data is 0 then it means that all the elements belong to a specific class.
When this happens it is said to be pure.
● When all of the data belongs to a single class (pure) then the leaf node is reached in the tree.
● The leaf node represents the class label in the tree (which means that it gives the final output).
Q7: What is the difference between OOB score and validation score?
The oob_score uses a sample of “left-over” data that wasn’t necessarily used during the
model’s analysis, and the validation set is sample of data you yourself decided to subset. in
this way, the oob sample is a little more random than the validation set. Therefore, the oob
sample (on which the oob_score is measured) may be “harder” that the validation set. The
oob_score may on average have a “less good” accuracy score as a consequence.
Q8: How would you deal with an Overfitted Decision Tree?
Two approaches to avoiding overfitting are distinguished: pre-pruning
(generating a tree with fewer branches than would otherwise be the case) and
post-pruning (generating a tree in full and then removing parts of it). Results are
given for pre-pruning using either a size or a maximum depth cutoff. A method of
post-pruning a decision tree based on comparing the static and backed-up
estimated error rates at each node is also described.
Q9: What are some disadvantages of using Decision Trees and how would you solve them?
1. Overfitting the data:
Definition: given a hypothesis space H, a hypothesis is said to overfit the training
data if there exists some alternative hypothesis , such that h has smaller error than h'
over the training examples, but h' has smaller error than h over the entire distribution of
instances.
How can we prevent overfitting? Here are some common heuristics:
● Don't try to fit all examples, stop before the training set is exhausted.
● Fit all examples then prune the resultant tree.
How does one tell if a given tree is one that has overfit the data?
● Extract a validation set not used for training from the training set and use this to check for overfitting.
Usually the validation set consists of one-third of the training set, chosen randomly.
● Then use statistical tests, eg. the chi-squared metric, to determine if changing the tree improves its
performance over the validation set.
● A variation of the above is to use MDL to check if modifying the tree increases its MDL with respect
to the validation set.
If we use the validation set to guide pruning, how do we tell that the tree is not overfitting the validation set? In
this case, we need to extract yet another set called the test set from the training set and use this for the final
check.
2. Guarding against bad attribute choices:
The information gain measure has a bias that favors attributes with many values over those with only a few.
For example, if you imagine an attribute Date with unique values for each training example, then Gain(S,Date)
will yield H(S) since .
Obviously no other attribute can do better. This will result in a very broad tree of depth 1.
To guard against this, Quinlan uses GainRatio(S,A) instead of Gain(S,A).
Where
with P(Sv) estimated by relative frequency as before.
This introduces a penalty for partitioning into equipossible groups.
3. Handling continuous valued attributes:
Note that continuous valued attributes can be partitioned into a discrete number of disjoint intervals. Then we
can test for membership to these intervals.
If the Temperature attribute in the PlayTennis example took on continuous values in the range 40-90, note that
we cannot just use the same approach as before.
Why? Because Temperature becomes a bad choice for classification (It alone may perfectly classify the
training examples and therefore promise the highest information gain like in the earlier Date example) while
remaining a poor predictor on the test set.
The solution to this problem is to classify based not on the actual temperature, but on dynamically determined
intervals within which the temperature falls.
For instance we can introduce boolean attributes , , and T > c
instead of real valued T. a, b and c can be computed dynamically based on boundaries where T is likely or
known to change.
4. Handling missing attribute values:
What happens if some of the training examples contain one or more ``?'', meaning ``value not known'' instead
of the actual attribute values?
Here are some common ad hoc solutions:
● Substitute ``?'' by the most common value in that column.
● Substitute ``?'' by the most common value among all training examples that have been sorted into the
tree at that node.
● Substitute ``?'' by the most common value among all training examples that have been sorted into the
tree at that node with the same classification as the incomplete example.
● etc.
Quinlan's solution in C4.5 is slightly more complex: Instead of a value, he assigns a distribution over values to
the ``?''. This distribution is used when computing Gain(S,A). The distribution determines how much of H(Sv=?)
will contribute to each H(Sv).
5. Handling attributes with differing costs:
Sometimes we want to introduce additional bias against the selection of certain attributes. Eg. ``Wherever
possible try not to use InvasiveTest as an attribute, since this might inconvenience the patient.
Here is another ad hoc heuristic to fix this:
Introduce a user-defined measure Cost(A), to specify a fixed bias against each attribute.
Now we can use a CostedGain(S,A) which is defined along the lines of:
Q10: How would you define the Stopping Criteria for decision trees?
All decision trees need stopping criteria or it would be possible, and undesirable, to grow a tree
in which each case occupied its own node. The resulting tree would be computationally
expensive, difficult to interpret and would probably not work very well with new data. For
example, consider the diagram below. The dotted curve is a decision boundary that accurately
separates out the two classes in the training data. A diagonal red line is probably a better
decision boundary for new cases.
The stopping criteria used by CTREE are typical of many decision tree programs.
● Number of cases in the node is less than some pre-specified limit.
● Purity of the node is more than some pre-specified limit. Because CTREE only allows
equal costs this means that the proportion of cases in the node with class equal to the
majority class is more than the pre-specified limit.
● Depth of the node is more than some pre-specified limit.
● Predictor values for all records are identical - in which no rule could be generated to split
them.