[MRG+1] correct condition in decision tree construction #7441

nelson-liu · 2016-09-15T20:16:58Z

Reference Issue

Continuation of #7340

What does this implement/fix? Explain your changes.

Removes an unused leaf condition in _tree.pyx.

Any other comments?

ping @jnothman, who I previously discussed this with

This change is

jnothman · 2016-09-15T21:04:36Z

While I think these conditions are logically inappropriate, just in case I'm wrong, have you confirmed that they are never executed at least by the test suite?

nelson-liu · 2016-09-15T21:12:45Z

just verified; in the original code, never is weighted_n_node_samples < min_weight_leaf True after the is_leaf conditional changed here.

jnothman · 2016-09-15T22:42:33Z

I mean with something like printing a message if the condition is ever true
here

On 16 September 2016 at 07:12, Nelson Liu notifications@github.com wrote:

just verified; in the original code, never is weighted_n_node_samples <
min_weight_leaf True after the is_leaf conditional changed here.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yo7jjjxDvHfGJuRipCQDXJ3Dqh-ks5qqbTOgaJpZM4J-TZK
.

nelson-liu · 2016-09-15T22:45:58Z

yes, that's what i did (and nothing was printed). sorry for being unclear.

jnothman · 2016-09-15T22:48:09Z

Great, thanks. LGTM

On 16 September 2016 at 08:46, Nelson Liu notifications@github.com wrote:

yes, that's what i did (and nothing was printed). sorry for being unclear.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6zNg3-eqW__PJocp5O2LgZZVKYyhks5qqcqogaJpZM4J-TZK
.

amueller · 2016-09-15T22:56:24Z

would be great to have @glouppe of @jmschrei look at this.

glouppe · 2016-09-16T06:56:00Z

Hmmm can you explain why we should remove it? What if you make a test with sample_weight such that sample_weight.sum() < min_weight_leaf from the beginning?

jmschrei · 2016-09-16T13:48:26Z

Is it possible the test suite just isn't comprehensive enough to test cases where this might be important? Can you create by hand a dataset where you would expect that the weight would cause a difference in the splits and confirm?

nelson-liu · 2016-09-16T17:47:28Z

Is it possible the test suite just isn't comprehensive enough to test cases where this might be important?

yes, that's very possible. Are said "cases where this might be important" just when you have a dataset with sample_weight such that sample_weight.sum() < min_weight_leaf? or is there something else that i am missing?

jmschrei · 2016-09-16T17:51:33Z

Yeah, but sample_weight for the samples in the current node, not the entire dataset. I haven't read through the test cases recently, but do we have a test where we should make a split on weighted data (not unweighted data), but don't because of min_weight_leaf?

jnothman · 2016-09-17T12:47:15Z

What does it mean logically to say that something should be a leaf if its weight is so small that it violates the min_weight_leaf condition?

jnothman · 2016-09-21T12:39:53Z

@glouppe:

Hmmm can you explain why we should remove it? What if you make a test with sample_weight such that sample_weight.sum() < min_weight_leaf from the beginning?

This shouldn't be possible through the public interface, which sets min_weight_fraction_leaf, not min_weight_leaf, and requires the former to be at most .5 * sample_weight.sum().

(If this isn't a valid change to make, then surely changing < to <= is required in these conditions.)

raghavrv · 2016-09-21T23:00:33Z

IIRC Nothing would break even if you remove the whole set of conditions as the splitter will check it again.

The reason why this check is given here is to check and avoid entering the splitter at all if we know it is a leaf for sure. And I think the condition should instead be weighted_n_node_samples < 2 * min_weight_leaf...

just verified; in the original code, never is weighted_n_node_samples < min_weight_leaf True after the is_leaf conditional changed here.

Where does "here" refer to? How did you verify?

The flow, IIUC, is as follows

Assume min_weight_leaf=0.1 and min_samples_leaf=1
Assume a particular node's sample weights to be [0.05, 0.05, 0.05, 0.05, 0.05]
The weighted_n_node_samples for that node is now 0.25

Let that node be the one that is currently popped from stack.
Since one can split the data into 2 parts having weights at right [0.05, 0.05, 0.05] and left [0.05, 0.05] , it is allowed to enter the splitter.
Assume the splitter splits it that way. Splitter tells us it is not a leaf and gives us the 2 partitions.
The right/left partitions are stacked.
The left is popped. Now since the cumulative weights for the left partition is 0.1, there is no way we can split the data without violating the min_weight_leaf=0.1.
Hence the condition that you just removed (if corrected as mentioned above) marks it as leaf before it enters the splitter and saves us some computational time.

jnothman · Sep 21, 2016

And I think the condition should instead be weighted_n_node_samples < 2 * min_weight_leaf

I had originally thought that condition inappropriate (@nelson-liu had suggested it), but now I think you might be right. If that condition is true, there is no way to split the points at that node such that both splits satisfy min_weight_leaf, is there?

Yes, I now think this is appropriate as a fast path. However, this will change the fitted trees, so we should note that in what's new.

Thanks.

raghavrv · 2016-09-22T00:07:10Z

No it won't change the fitted trees, as before the splitter used to mark it as leaf. Now after the corrected condition, it will not enter the splitter at all. And yes with the corrected condition in place there is no way to split it such that min_weight_leaf is not violated...

jnothman · 2016-09-22T00:13:05Z

Yes, that will change the trees due to random_state.

On 22 September 2016 at 10:07, Raghav RV notifications@github.com wrote:

No it won't change the fitted trees, as before the splitter used to mark
it as leaf. Now after the corrected condition, it will not enter the
splitter at all. And yes with the corrected condition in place there is no
way to split it such that min_weight_leaf is not violated...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz60j4jCAoADkX6G-eomh3qWVusDwuks5qscavgaJpZM4J-TZK
.

raghavrv · 2016-09-22T00:23:14Z

Wow. Good catch! I totally missed that. Indeed the tree will change!

nelson-liu · 2016-09-28T14:14:50Z

following up with this; should we change the condition to weighted_n_node_samples < 2 * min_weight_leaf as suggested by @raghavrv ?

jnothman · 2016-09-28T14:31:01Z

I think so. (Sorry for changing my mind!)

On 29 September 2016 at 00:14, Nelson Liu notifications@github.com wrote:

following up with this; should we change the condition to weighted_n_node_samples
< 2 * min_weight_leaf as suggested by @raghavrv
https://github.com/raghavrv ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6y804A24W9B4d6ZJfPk_r9SRKhNcks5qunZbgaJpZM4J-TZK
.

nelson-liu · 2016-09-28T16:45:34Z

no need to be sorry, I'm glad we were able to come to a better solution. pushed the new change.

jnothman · 2016-09-28T23:18:13Z

sklearn/tree/_tree.pyx

@@ -218,8 +218,8 @@ cdef class DepthFirstTreeBuilder(TreeBuilder):

                is_leaf = ((depth >= max_depth) or


Do you mind removing all the extraneous parentheses too? :/

yeah, sorry about that.

jnothman · 2016-09-28T23:18:35Z

please change PR title

jnothman · 2016-09-28T23:18:49Z

actually, I will, seeing as I'm adding a +1. LGTM

jnothman · 2016-09-28T23:19:40Z

What I'd like to see is a what's new entries that says it's more efficient but trees using the parameter will change.

nelson-liu · 2016-09-28T23:45:32Z

What I'd like to see is a what's new entries that says it's more efficient but trees using the parameter will change.

@jnothman thanks for changing the title for me. I added a what's new entry and removed the extra parens.

raghavrv · 2016-09-29T12:31:34Z

Thanks for the change. This looks good to me. I think it would be better if @glouppe gave his final +1 and merge :)

glouppe · 2016-09-29T12:34:11Z

LGTM too! Merging

jnothman · 2016-09-30T02:42:23Z

doc/whats_new.rst

+   - Edited criterion for leaf nodes in decision tree criterion by declaring a
+     node as a leaf if the weighted number of samples at the node is less than
+     2 * the minimum weight specified to be at a node. This makes growth more
+     efficient, but trees using parameters that modify the weight at each leaf


I think this is a confusing way of stating it. Will fix it up in master.

thanks @jnothman

…#7441) * remove unused condition in decision tree construction * edit is_leaf condition for min_weight_leaf * edit ordering of statements * remove extra parens and add whats new

nelson-liu changed the title ~~remove unused condition in decision tree construction~~ [MRG] remove unused condition in decision tree construction Sep 15, 2016

jnothman approved these changes Sep 15, 2016

View reviewed changes

jnothman changed the title ~~[MRG] remove unused condition in decision tree construction~~ [MRG+1] remove unused condition in decision tree construction Sep 15, 2016

jnothman changed the title ~~[MRG+1] remove unused condition in decision tree construction~~ [MRG] remove unused condition in decision tree construction Sep 22, 2016

jnothman reviewed Sep 28, 2016

View reviewed changes

jnothman changed the title ~~[MRG] remove unused condition in decision tree construction~~ [MRG+1] remove unused condition in decision tree construction Sep 28, 2016

jnothman changed the title ~~[MRG+1] remove unused condition in decision tree construction~~ [MRG+1] correct condition in decision tree construction Sep 28, 2016

jnothman added the Waiting for Reviewer label Sep 28, 2016

nelson-liu added 4 commits September 28, 2016 16:40

remove unused condition in decision tree construction

ceaa894

edit is_leaf condition for min_weight_leaf

b897dad

edit ordering of statements

57a5c44

remove extra parens and add whats new

35ed851

nelson-liu force-pushed the remove_wrong_tree_condition branch from 640d413 to 35ed851 Compare September 28, 2016 23:44

glouppe merged commit 2f4b661 into scikit-learn:master Sep 29, 2016

glouppe mentioned this pull request Sep 29, 2016

Duplicate logic in Decision Tree Building #7338

Closed

jnothman reviewed Sep 30, 2016

View reviewed changes

		@@ -218,8 +218,8 @@ cdef class DepthFirstTreeBuilder(TreeBuilder):

		is_leaf = ((depth >= max_depth) or

Uh oh!

[MRG+1] correct condition in decision tree construction #7441

[MRG+1] correct condition in decision tree construction #7441

Uh oh!

Conversation

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!