8000 [MRG+1] correct condition in decision tree construction by nelson-liu · Pull Request #7441 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] correct condition in decision tree construction #7441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 29, 2016

Conversation

nelson-liu
Copy link
Contributor
@nelson-liu nelson-liu commented Sep 15, 2016

Reference Issue

Continuation of #7340

What does this implement/fix? Explain your changes.

Removes an unused leaf condition in _tree.pyx.

Any other comments?

ping @jnothman, who I previously discussed this with


This change is Reviewable

@nelson-liu nelson-liu changed the title remove unused condition in decision tree construction [MRG] remove unused condition in decision tree construction Sep 15, 2016
@jnothman
Copy link
Member

While I think these conditions are logically inappropriate, just in case I'm wrong, have you confirmed that they are never executed at least by the test suite?

@nelson-liu
Copy link
Contributor Author

just verified; in the original code, never is weighted_n_node_samples < min_weight_leaf True after the is_leaf conditional changed here.

@jnothman
Copy link
Member

I mean with something like printing a message if the condition is ever true
here

On 16 September 2016 at 07:12, Nelson Liu notifications@github.com wrote:

just verified; in the original code, never is weighted_n_node_samples <
min_weight_leaf True after the is_leaf conditional changed here.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yo7jjjxDvHfGJuRipCQDXJ3Dqh-ks5qqbTOgaJpZM4J-TZK
.

Copy link
Contributor Author

yes, that's what i did (and nothing was printed). sorry for being unclear.

@jnothman
Copy link
Member

Great, thanks. LGTM

On 16 September 2016 at 08:46, Nelson Liu notifications@github.com wrote:

yes, that's what i did (and nothing was printed). sorry for being unclear.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6zNg3-eqW__PJocp5O2LgZZVKYyhks5qqcqogaJpZM4J-TZK
.

@jnothman jnothman changed the title [MRG] remove unused condition in decision tree construction [MRG+1] remove unused condition in decision tree construction Sep 15, 2016
@amueller
Copy link
Member

would be great to have @glouppe of @jmschrei look at this.

@glouppe
Copy link
Contributor
glouppe commented Sep 16, 2016

Hmmm can you explain why we should remove it? What if you make a test with sample_weight such that sample_weight.sum() < min_weight_leaf from the beginning?

@jmschrei
Copy link
Member

Is it possible the test suite just isn't comprehensive enough to test cases where this might be important? Can you create by hand a dataset where you would expect that the weight would cause a difference in the splits and confirm?

@nelson-liu
Copy link
Contributor Author

Is it possible the test suite just isn't comprehensive enough to test cases where this might be important?

yes, that's very possible. Are said "cases where this might be important" just when you have a dataset with sample_weight such that sample_weight.sum() < min_weight_leaf? or is there something else that i am missing?

@jmschrei
Copy link
Member

Yeah, but sample_weight for the samples in the current node, not the entire dataset. I haven't read through the test cases recently, but do we have a test where we should make a split on weighted data (not unweighted data), but don't because of min_weight_leaf?

@jnothman
Copy link
Member

What does it mean logically to say that something should be a leaf if its weight is so small that it violates the min_weight_leaf condition?

@jnothman
Copy link
Member

@glouppe:

Hmmm can you explain why we should remove it? What if you make a test with sample_weight such that sample_weight.sum() < min_weight_leaf from the beginning?

This shouldn't be possible through the public interface, which sets min_weight_fraction_leaf, not min_weight_leaf, and requires the former to be at most .5 * sample_weight.sum().

(If this isn't a valid change to make, then surely changing < to <= is required in these conditions.)

@raghavrv
Copy link
Member
raghavrv commented Sep 21, 2016

IIRC Nothing would break even if you remove the whole set of conditions as the splitter will check it again.

The reason why this check is given here is to check and avoid entering the splitter at all if we know it is a leaf for sure. And I think the condition should instead be weighted_n_node_samples < 2 * min_weight_leaf...

just verified; in the original code, never is weighted_n_node_samples < min_weight_leaf True after the is_leaf conditional changed here.

Where does "here" refer to? How did you verify?

The flow, IIUC, is as follows

Assume min_weight_leaf=0.1 and min_samples_leaf=1
Assume a particular node's sample weights to be [0.05, 0.05, 0.05, 0.05, 0.05]
The weighted_n_node_samples for that node is now 0.25

  • Let that node be the one that is currently popped from stack.
  • Since one can split the data into 2 parts having weights at right [0.05, 0.05, 0.05] and left [0.05, 0.05] , it is allowed to enter the splitter.
  • Assume the splitter splits it that way. Splitter tells us it is not a leaf and gives us the 2 partitions.
  • The right/left partitions are stacked.
  • The left is popped. Now since the cumulative weights for the left partition is 0.1, there is no way we can split the data without violating the min_weight_leaf=0.1.
  • Hence the condition that you just removed (if corrected as mentioned above) marks it as leaf before it enters the splitter and saves us some computational time.

@jnothman
Copy link
Member

And I think the condition should instead be weighted_n_node_samples < 2 * min_weight_leaf

I had originally thought that condition inappropriate (@nelson-liu had suggested it), but now I think you might be right. If that condition is true, there is no way to split the points at that node such that both splits satisfy min_weight_leaf, is there?

Yes, I now think this is appropriate as a fast path. However, this will change the fitted trees, so we should note that in what's new.

Thanks.

@jnothman jnothman changed the title [MRG+1] remove unused condition in decision tree construction [MRG] remove unused condition in decision tree construction Sep 22, 2016
@raghavrv
Copy link
Member

No it won't change the fitted trees, as before the splitter used to mark it as leaf. Now after the corrected condition, it will not enter the splitter at all. And yes with the corrected condition in place there is no way to split it such that min_weight_leaf is not violated...

@jnothman
Copy link
Member

Yes, that will change the trees due to random_state.

On 22 September 2016 at 10:07, Raghav RV notifications@github.com wrote:

No it won't change the fitted trees, as before the splitter used to mark
it as leaf. Now after the corrected condition, it will not enter the
splitter at all. And yes with the corrected condition in place there is no
way to split it such that min_weight_leaf is not violated...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz60j4jCAoADkX6G-eomh3qWVusDwuks5qscavgaJpZM4J-TZK
.

@raghavrv
Copy link
Member

Wow. Good catch! I totally missed that. Indeed the tree will change!

@nelson-liu
Copy link
Contributor Author

following up with this; should we change the condition to weighted_n_node_samples < 2 * min_weight_leaf as suggested by @raghavrv ?

@jnothman
Copy link
Member

I think so. (Sorry for changing my mind!)

On 29 September 2016 at 00:14, Nelson Liu notifications@github.com wrote:

following up with this; should we change the condition to weighted_n_node_samples
< 2 * min_weight_leaf as suggested by @raghavrv
https://github.com/raghavrv ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6y804A24W9B4d6ZJfPk_r9SRKhNcks5qunZbgaJpZM4J-TZK
.

@nelson-liu
Copy link
Contributor Author

no need to be sorry, I'm glad we were able to come to a better solution. pushed the new change.

8000
@@ -218,8 +218,8 @@ cdef class DepthFirstTreeBuilder(TreeBuilder):

is_leaf = ((depth >= max_depth) or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind removing all the extraneous parentheses too? :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sorry about that.

@jnothman
Copy link
Member

please change PR title

@jnothman
Copy link
Member

actually, I will, seeing as I'm adding a +1. LGTM

@jnothman jnothman changed the title [MRG] remove unused condition in decision tree construction [MRG+1] remove unused condition in decision tree construction Sep 28, 2016
@jnothman jnothman changed the title [MRG+1] remove unused condition in decision tree construction [MRG+1] correct condition in decision tree construction Sep 28, 2016
@jnothman
Copy link
Member

What I'd like to see is a what's new entries that says it's more efficient but trees using the parameter will change.

@nelson-liu nelson-liu force-pushed the remove_wrong_tree_condition branch from 640d413 to 35ed851 Compare September 28, 2016 23:44
@nelson-liu
Copy link
Contributor Author

What I'd like to see is a what's new entries that says it's more efficient but trees using the parameter will change.

@jnothman thanks for changing the title for me. I added a what's new entry and removed the extra parens.

@raghavrv
Copy link
Member

Thanks for the change. This looks good to me. I think it would be better if @glouppe gave his final +1 and merge :)

@glouppe
Copy link
Contributor
glouppe commented Sep 29, 2016

LGTM too! Merging

@glouppe glouppe merged commit 2f4b661 into scikit-learn:master Sep 29, 2016
- Edited criterion for leaf nodes in decision tree criterion by declaring a
node as a leaf if the weighted number of samples at the node is less than
2 * the minimum weight specified to be at a node. This makes growth more
efficient, but trees using parameters that modify the weight at each leaf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a confusing way of stating it. Will fix it up in master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jnothman

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
…#7441)

* remove unused condition in decision tree construction

* edit is_leaf condition for min_weight_leaf

* edit ordering of statements

* remove extra parens and add whats new
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…#7441)

* remove unused condition in decision tree construction

* edit is_leaf condition for min_weight_leaf

* edit ordering of statements

* remove extra parens and add whats new
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…#7441)

* remove unused condition in decision tree construction

* edit is_leaf condition for min_weight_leaf

* edit ordering of statements

* remove extra parens and add whats new
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0