8000 Consider reversing deprecation of min_samples_leaf · Issue #11976 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Consider reversing deprecation of min_samples_leaf #11976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jph00 opened this issue Sep 2, 2018 · 10 comments
Closed

Consider reversing deprecation of min_samples_leaf #11976

jph00 opened this issue Sep 2, 2018 · 10 comments

Comments

@jph00
Copy link
jph00 commented Sep 2, 2018

Current RC proposes removing min_samples_leaf. In my experience across numerous projects and competitions, this param is very useful for generalization. Conceptually, I've often seen extreme outliers like this (top left point) result in picking a single point in a group leaf, which clearly isn't what you want. Larger leaves have less variance and therefore generalize better.
image

This param is also one way to reduce tree size:
image

@amueller requested adding this issue in this twitter thread https://twitter.com/jeremyphoward/status/1036110165236363266

@parrt
Copy link
parrt commented Sep 2, 2018

I concur that this parameter it is useful. I have found it helps with overfitting and also seems to speed up training, which I use during model dev. I thought it was due to fewer nodes but twitter thread makes me wonder if I know exactly how that parameter is used in this kit's RF implementation.

@jnothman
Copy link
Member
jnothman commented Sep 2, 2018 via email

@jnothman
Copy link
Member
jnothman commented Sep 3, 2018

"Larger leaves have less variance" does seem to be a valuable notion for regression. Less so for classification, but maybe it can indeed help there too??

We should be thinking of this as min_samples_branch rather than min_samples_leaf and perhaps we should reframe it as that:

min_samples_leaf : int
    A split point at any depth will only be considered if it leaves at least
    `min_samples_leaf` training samples in each of the left and right branches.
    This may have the effect of smoothing the model, especially in regression.

We could even rename it, but that might be unnecessarily disruptive to users.

@parrt
Copy link
parrt commented Sep 3, 2018

@jnothman That really cleared things up for me. thanks! I thought that's what min_samples_leaf did; i.e., don't split if splitting would create a leaf with fewer than the hyperparameter value. What did min_samples_leaf imply in code before? The doc led me to believe it was doing what I tell students, a simple check before splitting:

The minimum number of samples required to be at a leaf node: If int, then consider min_samples_leaf as the minimum number.

I'd vote for keeping same name, changing doc, and making it do as you suggest. Can you summarize how it's actually used now? thanks! Sorry, but I haven't looked up this bit in the code.

Yes, for regression preventing n=1 leaves most definitely helps avoid outlier leaves. I don't have enough experience to say for classification but I'd say an outlier class lurking amongst a similar group of feature vectors shouldn't be isolated and taken as valid. Seems like overfitting to me.

@parrt
Copy link
parrt commented Sep 3, 2018

Come to think of it, min_samples_split is almost what I'd want, you're right. It is doing the "don't split if less than this number" thing. OTOH, splitting by min size of current node might give less control because it doesn't consider purity of potential children, only size of current node. min_samples_leaf prevents a split of current node if a child created based upon purity would be too small. Hmm... guess we need an example.

Ok, imagine the current node has 10 samples and min_samples_split=10 so we are okay splitting. But, what if all but one predicted value were the same; one is an outlier. Splitting would create a node with a single outlier sample/value and another node with 9 samples.

Now imagine the same scenario but we have min_samples_leaf=3. Either we prevent splitting because it would create an outlier child node or we split the 10-sample node but borrow samples to create a note with three samples and a node with seven samples. I'm not sure the proper way to borrow so maybe the easiest thing to do is prevent splitting of the 10-sample node because it would create a leaf that is too small using the normal process.

That sounds like the simplest thing and is exactly what @jnothman has proposed in his doc. heh, cool!

@jnothman
Copy link
Member
jnothman commented Sep 3, 2018 via email

@parrt
Copy link
parrt commented Sep 3, 2018

Interesting. I wonder how it would handle the 10-sample node case where all samples predicted the same value except for one. If min_samples_leaf=3 it doesn't seem like it would find such a best to split that fit the constraint. I guess in that case it would in fact be a stopping condition and not split the current 10-sample node. (edit: actually that depends on what features we have...oops)

@jph00
Copy link
Author
jph00 commented Sep 4, 2018 via email

@jnothman
Copy link
Member
jnothman commented Sep 4, 2018 via email

@amueller
Copy link
Member
amueller commented Sep 7, 2018

I think we should undo the deprecation until we have hard evidence for all cases.

@rth rth closed this as completed in #11998 Sep 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0