-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Consider reversing deprecation of min_samples_leaf #11976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I concur that this parameter it is useful. I have found it helps with overfitting and also seems to speed up training, which I use during model dev. I thought it was due to fewer nodes but twitter thread makes me wonder if I know exactly how that parameter is used in this kit's RF implementation. |
It's great that we are getting this feedback during RC.
The problem here was that core developers as well as users expected these
parameters to do something they didn't.
I'm still not certain from your illustration whether the same, if not
better, can be achieved with min_samples_split. After all, if
min_samples_leaf is limiting the depth, this would imply there are
insufficient instances to create a larger leaf at those points. The problem
with min_samples_leaf is that it doesn't necessarily stop when it discovers
that the best split at a point would create an invalid leaf. If
min_samples_split is sufficient, then we would benefit from having fewer
parameters.
But if you can show that min_samples_split does not suffice, then reverting
deprecation and improving docs would be good.
|
"Larger leaves have less variance" does seem to be a valuable notion for regression. Less so for classification, but maybe it can indeed help there too?? We should be thinking of this as min_samples_leaf : int
A split point at any depth will only be considered if it leaves at least
`min_samples_leaf` training samples in each of the left and right branches.
This may have the effect of smoothing the model, especially in regression. We could even rename it, but that might be unnecessarily disruptive to users. |
@jnothman That really cleared things up for me. thanks! I thought that's what
I'd vote for keeping same name, changing doc, and making it do as you suggest. Can you summarize how it's actually used now? thanks! Sorry, but I haven't looked up this bit in the code. Yes, for regression preventing n=1 leaves most definitely helps avoid outlier leaves. I don't have enough experience to say for classification but I'd say an outlier class lurking amongst a similar group of feature vectors shouldn't be isolated and taken as valid. Seems like overfitting to me. |
Come to think of it, Ok, imagine the current node has 10 samples and Now imagine the same scenario but we have That sounds like the simplest thing and is exactly what @jnothman has proposed in his doc. heh, cool! |
Yes, with very many classes that may be the case.
Many users, and core developers, had expected that if the best split could
not adhere to this constraint, that node would be left un-split (a stopping
criterion). Rather, it keeps searching for the best split that does fit the
constraint.
|
Interesting. I wonder how it would handle the 10-sample node case where all samples predicted the same value except for one. If |
Ok, imagine the current node has 10 samples and min_samples_split=10
so we are okay splitting. But, what if all but one predicted value
were the same; one is an outlier. Splitting would create a node with
a single outlier sample/value and another node with 9 samples.> Now imagine the same scenario but we have min_samples_leaf=3. Either
we prevent splitting because it would create an outlier child node or
we split the 10-sample node but borrow samples to create a note with
three samples and a node with seven samples. I'm not sure the proper
way to borrow so maybe the easiest thing to do is prevent splitting of
the 10-sample node because it would create a leaf that is too small
using the normal process.Yes this is the kind of issue to consider. Note that it's extremely
common at the end of the tree (which is ~50% of the splits, of course!)
Avoiding a tiny leaf here is important, since tiny leaves don't
generalize (both for classification and regression).
Also, note that all proximity matrix / rf kernel methods rely on having
reasonable sized leaves. min_samples_leaf is exactly what you want to
create these - min_samples_split isn't quite the correct behavior here.
|
I think in doing the deprecation, we had under-considered the regression
and very-many-class cases.
@amueller, @glemaitre, your thoughts?
|
I think we should undo the deprecation until we have hard evidence for all cases. |
Current RC proposes removing min_samples_leaf. In my experience across numerous projects and competitions, this param is very useful for generalization. Conceptually, I've often seen extreme outliers like this (top left point) result in picking a single point in a group leaf, which clearly isn't what you want. Larger leaves have less variance and therefore generalize better.

This param is also one way to reduce tree size:

@amueller requested adding this issue in this twitter thread https://twitter.com/jeremyphoward/status/1036110165236363266
The text was updated successfully, but these errors were encountered: