Duplicate logic in Decision Tree Building #7338

nelson-liu · 2016-09-04T04:18:11Z

Description

I've been working on #7301, and I found some seemingly duplicate logic in the decision tree code. At the least, it's something that requires more documentation; at the most, it's a bug.

So I was poking around, trying to figure out why I couldn't get equal trees for uniform sample weights when i set min_weight_fraction_leaf and min_samples_leaf to the same float value. I saw that the two built trees actually had radically different values of min_samples_split, despite me not setting this value at all. In the code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L229), it shows that min_samples_split is set to the max(min_samples_split, 2 * min_samples_leaf). This isn't documented at all, and should be if we decide to keep this statement.

Now this logic makes sense to me, as you can't make a split if there's no way to split it in such a manner that each leaf has the minimum number of samples. However, isn't this sort of logic encapsulated when we're building the tree? Namely, I'm thinking of the is_leaf conditional, which determines whether to split on a node or not (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L219):

is_leaf = ((depth >= max_depth) or
  (n_node_samples < min_samples_split) or
  (n_node_samples < 2 * min_samples_leaf) or
   (weighted_n_node_samples < min_weight_leaf))

I think the statement (n_node_samples < 2 * min_samples_leaf) has the same effect as setting min_samples_split explicitly to the max of itself or 2*min_samples_leaf. Indeed, removing the max call does not cause any tests to fail.

Should we be removing the explicit call to max? It seems like a piece of duplicate logic, but maybe I'm missing something. ping @glouppe @raghavrv @jmschrei / anyone else.

The text was updated successfully, but these errors were encountered:

nelson-liu · 2016-09-04T10:54:25Z

Additionally, there's a separate issue of how is_leaf handles min_weight_leaf and min_samples_leaf differently (since min_samples_leaf is multiplied by 2), which means that trees grown on equal weights with equivalent values of min_weight_leaf and min_samples_leaf will not be grown in the same manner. Any suggestions for fixing this? I'm not sure removing the 2*min_samples_leaf and the (min_samples_split, 2 * min_samples_leaf) is technically sound, but doing so doesn't cause any tests to fail (at least on my local machine, unsure about CI yet) and gives min_weight_leaf and min_samples_leaf equal interpretations of the float parameter in the case when the samples weights are uniform.

jnothman · 2016-09-04T11:42:12Z

Input from @glouppe would be most welcome.

jnothman · 2016-09-04T13:04:10Z

Doesn't setting is_leaf to True where weighted_n_node_samples < min_weight_leaf mean that the required minimum is not upheld?

jnothman · 2016-09-04T13:09:54Z

I suspect that check is redundant and never used, as long as the splitters correctly ensure that no split results in min_weight_leaf being violated.

nelson-liu · 2016-09-04T14:06:56Z

Doesn't setting is_leaf to True where weighted_n_node_samples < min_weight_leaf mean that the required minimum is not upheld?

Indeed, the main issue is that given two trees with only min_weight_fraction_leaf and min_samples_split, the two trees grow differently because there are times where n_node_samples < 2*min_samples_leaf but weighted_n_node_samples is not < min_weight_leaf. This is in contrary to the idea that they should give equivalent results in certain cases.

jnothman · 2016-09-04T21:56:18Z

I would hope the condition there regarding min_weight_leaf is never
executed. And I would think that the min_samples_leaf criterion there is
mostly a shortcut: the splitter would give up if asked to split that node,
wouldn't it? So unless I'm wrong in my presumptions, if there's actually a
case where these differ, then there might be a bug elsewhere....??

On 5 September 2016 at 00:06, Nelson Liu notifications@github.com wrote:

Doesn't setting is_leaf to True where weighted_n_node_samples <
min_weight_leaf mean that the required minimum is not upheld?

Indeed, the main issue is that given two trees with only
min_weight_fraction_leaf and min_samples_split, the two trees grow
differently because there are times where n_node_samples <
2*min_samples_leaf but weighted_n_node_samples is not < min_weight_leaf.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7338 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6x7QeMMPCh5RhjZx76QAusFaJPsMks5qmtCBgaJpZM4J0dA-
.

nelson-liu · 2016-09-04T22:20:25Z

well this is the code i'm running:

est_weight = TreeEstimator(max_leaf_nodes=max_leaf_nodes,
                           min_weight_fraction_leaf=frac,
                           random_state=0)
est_sample = TreeEstimator(max_leaf_nodes=max_leaf_nodes,
                           min_samples_leaf=min_samples_leaf,
                           random_state=0)
est_weight.fit(X,y)
est_sample.fit(X,y)

In est_weight, there are definitely cases where n_node_samples < 2 * min_samples_leaf (min_samples_leaf is set to 2 by default) is false and weighted_n_node_samples < min_weight_leaf is true. This basically happens if min_weight_leaf is greater than 4).

In est_sample, min_weight_fraction_leaf = 0.0 (by default) and thus the condition with min_weight_leaf is never executed (since min_samples_leaf is always greater than 0).

Regardless, it shows that these two if statements are not executing the same thing at all, which leads to (incorrect) growth of different trees.

nelson-liu · 2016-09-04T22:56:02Z

opened a PR at #7340 so you can see the changes i am proposing.

jnothman · 2016-09-04T23:22:10Z

In est_weight, there are definitely cases where n_node_samples < 2 * min_samples_leaf (min_samples_leaf is set to 2 by default) is false and weighted_n_node_samples < min_weight_leaf is true. This basically happens if min_weight_leaf is greater than 4).

But aren't these cases eliminated by the Splitter checks on min_weight_leaf? is the min_weight_leaf criterion actually violated in the final tree?

nelson-liu · 2016-09-04T23:27:07Z

No the min_weight_leaf criterion is not violated in the final tree. Im not sure if the cases are eliminated by the splitter checks, but supposedly the ones that are incorrectly marked as a leaf node (fulfill one of the conditions) do not get split on further.

jnothman · 2016-09-04T23:30:21Z

Yes, but if one is marked as a leaf node that has weighted_n_node_samples <
min_weight_leaf then it will violate the min_weight_leaf condition... but
if the splitter then rejects it it won't be retained in the final tree. Of
course this means that the min_weight_leaf case tries more splits than
the min_samples_leaf, which would create differences if there are RNGs
involved.

On 5 September 2016 at 09:27, Nelson Liu notifications@github.com wrote:

No the min_weight_leaf criterion is not violated in the final tree. Im
not sure if the cases are eliminated by the splitter checks, but supposedly
the ones that are incorrectly marked as a leaf node (fulfill one of the
conditions) do not get split on further.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7338 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz64AuQZK6pLAuzkT_8FBqsuOh-weRks5qm1PNgaJpZM4J0dA-
.

jnothman · 2016-09-04T23:31:01Z

I will admit I'm not familiar enough with the code to know when splits are
validated relative to this code.

On 5 September 2016 at 09:30, Joel Nothman joel.nothman@gmail.com wrote:

Yes, but if one is marked as a leaf node that has weighted_n_node_samples
< min_weight_leaf then it will violate the min_weight_leaf condition...
but if the splitter then rejects it it won't be retained in the final tree.
Of course this means that the min_weight_leaf case tries more splits than
the min_samples_leaf, which would create differences if there are RNGs
involved.

On 5 September 2016 at 09:27, Nelson Liu notifications@github.com wrote:

No the min_weight_leaf criterion is not violated in the final tree. Im
not sure if the cases are eliminated by the splitter checks, but supposedly
the ones that are incorrectly marked as a leaf node (fulfill one of the
conditions) do not get split on further.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7338 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz64AuQZK6pLAuzkT_8FBqsuOh-weRks5qm1PNgaJpZM4J0dA-
.

jnothman · 2016-09-05T02:49:45Z

If things are working as I suspect, then this condition regarding min_samples_leaf is only an optimisation, allowing the splitter not to descend that path. However, by doing so, the random_state is not altered where it is in the min_weight_leaf path (since the splitter needs to descend farther), so we cannot expect the resultant trees to be identical.

nelson-liu · 2016-09-05T02:52:50Z

Yeah, that makes sense. I feel like it is logical for us to expect them to be identical, though (in the sense that I think this would be the correct behavior) ? Considering I'd think that building two separate trees, one with a set min_weight_fraction_leaf and another with that same value pegged to min_samples_leaf on uniformly weighted data should have the same interpretation and output tree?

jnothman · 2016-09-05T03:07:09Z

Their effect as regularisers is identical, but min_samples_split can be implemented more efficiently, and results in a different tree due to randomisation. I don't think that's a problem, but you're welcome to propose a note in the documentation. And yes, to get rid of that incorrect condition.

nelson-liu · 2016-09-05T03:16:06Z

Their effect as regularisers is identical, but min_samples_split can be implemented more efficiently, and results in a different tree due to randomisation.

Fair enough, I'll remove the test that sees whether the two parameters have the same effect in #7301, since the trees are different due to randomization. Thanks for the clarity @jnothman

jnothman · 2016-09-05T03:18:58Z

The clarity came from collaboration. Thank you.q

amueller · 2016-09-06T17:22:15Z

the random state is only used for tie breaking. Are the results also different for data with no ties?
I didn't go through it in detail, but it sounds to me like the min_weight_leaf stops to early. Can you show the visualizations of the trees (if they are small enough)?

nelson-liu · 2016-09-06T20:03:15Z

Here are visualizations of the following trees:

    # X and y are the "multilabel" test set in test_tree.py
    for max_leaf_nodes, frac in product([None], [.2]):
        # test that min_samples_leaf and min_weight_fraction_leaf
        # yield the same results when sample_weight is None.
        est_weight = ExtraTreeRegressor(max_leaf_nodes=max_leaf_nodes,
                                   min_weight_fraction_leaf=frac,
                                   random_state=0)
        est_sample = ExtraTreeRegressor(max_leaf_nodes=max_leaf_nodes,
                                   min_samples_leaf=frac,
                                   random_state=0)
        est_weight.fit(X, y)
        est_sample.fit(X, y)
        export_graphviz(est_weight, "weight.dot")
        export_graphviz(est_sample, "sample.dot")

        assert_tree_equal(est_weight.tree_, est_sample.tree_,
                          "{0} with equal values of min_samples_leaf "
                          "and min_weight_fraction_leaf "
                          "are unequal with parameter value "
                          "{1}".format(name, frac))

the output:

AssertionError:
Arrays are not equal
ExtraTreeRegressor with equal values of min_samples_leaf and min_weight_fraction_leaf are unequal with parameter value 0.2: inequal features
(mismatch 50.0%)
 x: array([2, 8])
 y: array([2, 9])

The trees:
sample:

weight:

amueller · 2016-09-06T21:15:50Z

is that tie breaking? that's not obvious to me

nelson-liu · 2016-09-06T21:30:31Z

no, I don't think there are any ties...

amueller · 2016-09-14T17:28:46Z

this question is still open, right?

glouppe · 2016-09-29T12:35:59Z

Closed in #7441

nelson-liu mentioned this issue Sep 4, 2016

[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided #7301

Merged

nelson-liu mentioned this issue Sep 4, 2016

[MRG] Remove redundant logic in Decision Tree #7340

Closed

glouppe closed this as completed Sep 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Duplicate logic in Decision Tree Building #7338

Duplicate logic in Decision Tree Building #7338

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Duplicate logic in Decision Tree Building #7338

Duplicate logic in Decision Tree Building #7338

Comments

Uh oh!

Description

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!