8000 Duplicate logic in Decision Tree Building · Issue #7338 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Duplicate logic in Decision Tree Building #7338
@nelson-liu

Description

@nelson-liu

Description

I've been working on #7301, and I found some seemingly duplicate logic in the decision tree code. At the least, it's something that requires more documentation; at the most, it's a bug.

So I was poking around, trying to figure out why I couldn't get equal trees for uniform sample weights when i set min_weight_fraction_leaf and min_samples_leaf to the same float value. I saw that the two built trees actually had radically different values of min_samples_split, despite me not setting this value at all. In the code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L229), it shows that min_samples_split is set to the max(min_samples_split, 2 * min_samples_leaf). This isn't documented at all, and should be if we decide to keep this statement.

Now this logic makes sense to me, as you can't make a split if there's no way to split it in such a manner that each leaf has the minimum number of samples. However, isn't this sort of logic encapsulated when we're building the tree? Namely, I'm thinking of the is_leaf conditional, which determines whether to split on a node or not (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L219):

is_leaf = ((depth >= max_depth) or
  (n_node_samples < min_samples_split) or
  (n_node_samples < 2 * min_samples_leaf) or
   (weighted_n_node_samples < min_weight_leaf))

I think the statement (n_node_samples < 2 * min_samples_leaf) has the same effect as setting min_samples_split explicitly to the max of itself or 2*min_samples_leaf. Indeed, removing the max call does not cause any tests to fail.

Should we be removing the explicit call to max? It seems like a piece of duplicate logic, but maybe I'm missing something. ping @glouppe @raghavrv @jmschrei / anyone else.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0