8000 Duplicate logic in Decision Tree Building · Issue #7338 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Duplicate logic in Decision Tree Building #7338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nelson-liu opened this issue Sep 4, 2016 · 23 comments
Closed

Duplicate logic in Decision Tree Building #7338

nelson-liu opened this issue Sep 4, 2016 · 23 comments

Comments

@nelson-liu
Copy link
Contributor
nelson-liu commented Sep 4, 2016

Description

I've been working on #7301, and I found some seemingly duplicate logic in the decision tree code. At the least, it's something that requires more documentation; at the most, it's a bug.

So I was poking around, trying to figure out why I couldn't get equal trees for uniform sample weights when i set min_weight_fraction_leaf and min_samples_leaf to the same float value. I saw that the two built trees actually had radically different values of min_samples_split, despite me not setting this value at all. In the code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L229), it shows that min_samples_split is set to the max(min_samples_split, 2 * min_samples_leaf). This isn't documented at all, and should be if we decide to keep this statement.

Now this logic makes sense to me, as you can't make a split if there's no way to split it in such a manner that each leaf has the minimum number of samples. However, isn't this sort of logic encapsulated when we're building the tree? Namely, I'm thinking of the is_leaf conditional, which determines whether to split on a node or not (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L219):

is_leaf = ((depth >= max_depth) or
  (n_node_samples < min_samples_split) or
  (n_node_samples < 2 * min_samples_leaf) or
   (weighted_n_node_samples < min_weight_leaf))

I think the statement (n_node_samples < 2 * min_samples_leaf) has the same effect as setting min_samples_split explicitly to the max of itself or 2*min_samples_leaf. Indeed, removing the max call does not cause any tests to fail.

Should we be removing the explicit call to max? It seems like a piece of duplicate logic, but maybe I'm missing something. ping @glouppe @raghavrv @jmschrei / anyone else.

@nelson-liu
Copy link
Contributor Author

Additionally, there's a separate issue of how is_leaf handles min_weight_leaf and min_samples_leaf differently (since min_samples_leaf is multiplied by 2), which means that trees grown on equal weights with equivalent values of min_weight_leaf and min_samples_leaf will not be grown in the same manner. Any suggestions for fixing this? I'm not sure removing the 2*min_samples_leaf and the (min_samples_split, 2 * min_samples_leaf) is technically sound, but doing so doesn't cause any tests to fail (at least on my local machine, unsure about CI yet) and gives min_weight_leaf and min_samples_leaf equal interpretations of the float parameter in the case when the samples weights are uniform.

@jnothman
Copy link
Member
jnothman commented Sep 4, 2016

Input from @glouppe would be most welcome.

@jnothman
Copy link
Member
jnothman commented Sep 4, 2016

Doesn't setting is_leaf to True where weighted_n_node_samples < min_weight_leaf mean that the required minimum is not upheld?

@jnothman
Copy link
Member
jnothman commented Sep 4, 2016

I suspect that check is redundant and never used, as long as the splitters correctly ensure that no split results in min_weight_leaf being violated.

@nelson-liu
Copy link
Contributor Author
nelson-liu commented Sep 4, 2016

Doesn't setting is_leaf to True where weighted_n_node_samples < min_weight_leaf mean that the required minimum is not upheld?

Indeed, the main issue is that given two trees with only min_weight_fraction_leaf and min_samples_split, the two trees grow differently because there are times where n_node_samples < 2*min_samples_leaf but weighted_n_node_samples is not < min_weight_leaf. This is in contrary to the idea that they should give equivalent results in certain cases.

@jnothman
Copy link
Member
jnothman commented Sep 4, 2016

I would hope the condition there regarding min_weight_leaf is never
executed. And I would think that the min_samples_leaf criterion there is
mostly a shortcut: the splitter would give up if asked to split that node,
wouldn't it? So unless I'm wrong in my presumptions, if there's actually a
case where these differ, then there might be a bug elsewhere....??

On 5 September 2016 at 00:06, Nelson Liu notifications@github.com wrote:

Doesn't setting is_leaf to True where weighted_n_node_samples <
min_weight_leaf mean that the required minimum is not upheld?

Indeed, the main issue is that given two trees with only
min_weight_fraction_leaf and min_samples_split, the two trees grow
differently because there are times where n_node_samples <
2*min_samples_leaf but weighted_n_node_samples is not < min_weight_leaf.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7338 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6x7QeMMPCh5RhjZx76QAusFaJPsMks5qmtCBgaJpZM4J0dA-
.

@nelson-liu
Copy link
Contributor Author

well this is the code i'm running:

est_weight = TreeEstimator(max_leaf_nodes=max_leaf_nodes,
                           min_weight_fraction_leaf=frac,
                           random_state=0)
est_sample = TreeEstimator(max_leaf_nodes=max_leaf_nodes,
                           min_samples_leaf=min_samples_leaf,
                           random_state=0)
est_weight.fit(X,y)
est_sample.fit(X,y)

In est_weight, there are definitely cases where n_node_samples < 2 * min_samples_leaf (min_samples_leaf is set to 2 by default) is false and weighted_n_node_samples < min_weight_leaf is true. This basically happens if min_weight_leaf is greater than 4).

In est_sample, min_weight_fraction_leaf = 0.0 (by default) and thus the condition with min_weight_leaf is never executed (since min_samples_leaf is always greater than 0).

Regardless, it shows that these two if statements are not executing the same thing at all, which leads to (incorrect) growth of different trees.

@nelson-liu
Copy link
Contributor Author

opened a PR at #7340 so you can see the changes i am proposing.

@jnothman
Copy link
Member
jnothman commented Sep 4, 2016

In est_weight, there are definitely cases where n_node_samples < 2 * min_samples_leaf (min_samples_leaf is set to 2 by default) is false and weighted_n_node_samples < min_weight_leaf is true. This basically happens if min_weight_leaf is greater than 4).

But aren't these cases eliminated by the Splitter checks on min_weight_leaf? is the min_weight_leaf criterion actually violated in the final tree?

@nelson-liu
Copy link
Contributor Author

No the min_weight_leaf criterion is not violated in the final tree. Im not sure if the cases are eliminated by the splitter checks, but supposedly the ones that are incorrectly marked as a leaf node (fulfill one of the conditions) do not get split on further.

@jnothman
Copy link
Member
jnothman commented Sep 4, 2016

Yes, but if one is marked as a leaf node that has weighted_n_node_samples <
min_weight_leaf then it will violate the min_weight_leaf condition... but
if the splitter then rejects it it won't be retained in the final tree. Of
course this means that the min_weight_leaf case tries more splits than
the min_samples_leaf, which would create differences if there are RNGs
involved.

On 5 September 2016 at 09:27, Nelson Liu notifications@github.com wrote:

No the min_weight_leaf criterion is not violated in the final tree. Im
not sure if the cases are eliminated by the splitter checks, but supposedly
the ones that are incorrectly marked as a leaf node (fulfill one of the
conditions) do not get split on further.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7338 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz64AuQZK6pLAuzkT_8FBqsuOh-weRks5qm1PNgaJpZM4J0dA-
.

@jnothman
Copy link
Member
jnothman commented Sep 4, 2016

I will admit I'm not familiar enough with the code to know when splits are
validated relative to this code.

On 5 September 2016 at 09:30, Joel Nothman joel.nothman@gmail.com wrote:

Yes, but if one is marked as a leaf node that has weighted_n_node_samples
< min_weight_leaf then it will violate the min_weight_leaf condition...
but if the splitter then rejects it it won't be retained in the final tree.
Of course this means that the min_weight_leaf case tries more splits than
the min_samples_leaf, which would create differences if there are RNGs
involved.

On 5 September 2016 at 09:27, Nelson Liu notifications@github.com wrote:

No the min_weight_leaf criterion is not violated in the final tree. Im
not sure if the cases are eliminated by the splitter checks, but supposedly
the ones that are incorrectly marked as a leaf node (fulfill one of the
conditions) do not get split on further.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7338 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz64AuQZK6pLAuzkT_8FBqsuOh-weRks5qm1PNgaJpZM4J0dA-
.

@jnothman
Copy link
Member
jnothman commented Sep 5, 2016

If things are working as I suspect, then this condition regarding min_samples_leaf is only an optimisation, allowing the splitter not to descend that path. However, by doing so, the random_state is not altered where it is in the min_weight_leaf path (since the splitter needs to descend farther), so we cannot expect the resultant trees to be identical.

@nelson-liu
Copy link
Contributor Author
nelson-liu commented Sep 5, 2016

Yeah, that makes sense. I feel like it is logical for us to expect them to be identical, though (in the sense that I think this would be the correct behavior) ? Considering I'd think that building two separate trees, one with a set min_weight_fraction_leaf and another with that same value pegged to min_samples_leaf on uniformly weighted data should have the same interpretation and output tree?

@jnothman
Copy link
Member
jnothman commented Sep 5, 2016

Their effect as regularisers is identical, but min_samples_split can be implemented more efficiently, and results in a different tree due to randomisation. I don't think that's a problem, but you're welcome to propose a note in the documentation. And yes, to get rid of that incorrect condition.

@nelson-liu
Copy link
Contributor Author

Their effect as regularisers is identical, but min_samples_split can be implemented more efficiently, and results in a different tree due to randomisation.

Fair enough, I'll remove the test that sees whether the two parameters have the same effect in #7301, since the trees are different due to randomization. Thanks for the clarity @jnothman

@jnothman
Copy link
Member
jnothman commented Sep 5, 2016

The clarity came from collaboration. Thank you.q

@amueller
Copy link
Member
amueller commented Sep 6, 2016

the random state is only used for tie breaking. Are the results also different for data with no ties?
I didn't go through it in detail, but it sounds to me like the min_weight_leaf stops to early. Can you show the visualizations of the trees (if they are small enough)?

@nelson-liu
Copy link
Contributor Author
nelson-liu commented Sep 6, 2016

Here are visualizations of the following trees:

    # X and y are the "multilabel" test set in test_tree.py
    for max_leaf_nodes, frac in product([None], [.2]):
        # test that min_samples_leaf and min_weight_fraction_leaf
        # yield the same results when sample_weight is None.
        est_weight = ExtraTreeRegressor(max_leaf_nodes=max_leaf_nodes,
                                   min_weight_fraction_leaf=frac,
                                   random_state=0)
        est_sample = ExtraTreeRegressor(max_leaf_nodes=max_leaf_nodes,
                                   min_samples_leaf=frac,
                                   random_state=0)
        est_weight.fit(X, y)
        est_sample.fit(X, y)
        export_graphviz(est_weight, "weight.dot")
        export_graphviz(est_sample, "sample.dot")

        assert_tree_equal(est_weight.tree_, est_sample.tree_,
                          "{0} with equal values of min_samples_leaf "
                          "and min_weight_fraction_leaf "
                          "are unequal with parameter value "
                          "{1}".format(name, frac))

the output:

AssertionError:
Arrays are not equal
ExtraTreeRegressor with equal values of min_samples_leaf and min_weight_fraction_leaf are unequal with parameter value 0.2: inequal features
(mismatch 50.0%)
 x: array([2, 8])
 y: array([2, 9])

The trees:
sample:
sample

weight:
weight

@amueller
Copy link
Member
amueller commented Sep 6, 2016

is that tie breaking? that's not obvious to me

@nelson-liu
Copy link
Contributor Author

no, I don't think there are any ties...

@amueller
Copy link
Member

this question is still open, right?

@glouppe
Copy link
Contributor
glouppe commented Sep 29, 2016

Closed in #7441

@glouppe glouppe closed this as completed Sep 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0