Duplicate logic in Decision Tree Building

@glouppe

Description

I've been working on #7301, and I found some seemingly duplicate logic in the decision tree code. At the least, it's something that requires more documentation; at the most, it's a bug.

So I was poking around, trying to figure out why I couldn't get equal trees for uniform sample weights when i set min_weight_fraction_leaf and min_samples_leaf to the same float value. I saw that the two built trees actually had radically different values of min_samples_split, despite me not setting this value at all. In the code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L229), it shows that min_samples_split is set to the max(min_samples_split, 2 * min_samples_leaf). This isn't documented at all, and should be if we decide to keep this statement.

Now this logic makes sense to me, as you can't make a split if there's no way to split it in such a manner that each leaf has the minimum number of samples. However, isn't this sort of logic encapsulated when we're building the tree? Namely, I'm thinking of the is_leaf conditional, which determines whether to split on a node or not (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L219):

is_leaf = ((depth >= max_depth) or
  (n_node_samples < min_samples_split) or
  (n_node_samples < 2 * min_samples_leaf) or
   (weighted_n_node_samples < min_weight_leaf))

I think the statement (n_node_samples < 2 * min_samples_leaf) has the same effect as setting min_samples_split explicitly to the max of itself or 2*min_samples_leaf. Indeed, removing the max call does not cause any tests to fail.

Should we be removing the explicit call to max? It seems like a piece of duplicate logic, but maybe I'm missing something. ping @glouppe @raghavrv @jmschrei / anyone else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions