-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Understanding min_samples_leaf / min_samples_split #8399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Actually, I'm not sure it's fair to say that growing small regions is smoothing. It seems a bit like the opposite. In the above example, a tie was created which is a bit weird on it's own, but consider this example (default parameters left, We're basically just growing the smaller class and misclassifying two points... That seems really really weird to me. |
Yes, this is the intended behavior. Maybe the documentation can be improved if that was not clear. |
@glouppe then my question is: why? that seems really strange. Do you have a reference? Consider my examples above. |
The same applies to The example in #8399 (comment) is disconcerning to me. |
The behaviour of |
Actually, let me correct that. So I understand that the options are heuristics for reducing the number of splits to consider, which makes sense, particularly in the case of ensembles. But that's quite different from pruning a tree. Usually when building single trees you're interested in small trees.
|
Isnt "what you thought it does" more or less equivalent to setting |
Nope, it's not possible to eliminate small leafs using |
I was trying to do a lecture on selecting regularization for single decision trees, and I found that we basically have none that works reasonably. The only one that does the right kind of pruning is |
Very interesting. Is what you seek, @amueller, that the builder should stop
considering a particular feature if its best split is invalid?
I'm quite envious of this course you're putting together given the rigour
with which it is finding issues in the package.
…On 20 Feb 2017 6:53 am, "Andreas Mueller" ***@***.***> wrote:
I was trying to do a lecture on selecting regularization for single
decision trees, and I found that we basically have none that works
reasonably. The only one that does the right kind of pruning is
max_leaf_nodes but that's really hard to set. You basically need to
create a full tree, then introspect the structure to see how many leafs it
has, and then do a grid-search up to that count.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8399 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz606H2peb8FIlHI1SFbti_C_Wkby7ks5reJ2fgaJpZM4MFgVF>
.
|
What I'm seeking is to stop splitting if the best candidate doesn't fulfill the requirement. You can find the slides here if you're interested: https://amueller.github.io/applied_ml_spring_2017/lectures.html Unfortunately I don't have the time to write notes to my slides right now :-/ |
@jnothman there's also the most sloppy notebooks here: https://github.com/amueller/applied_ml_spring_2017/tree/master/slides |
yes, the documentation was a bit confusing, as I had the same expectation. I thought if n_samples<=min_samples_leaf, don't split, but this would have the same exact behavior as if n_samples<=min_samples_split, don't split internal node, right? |
improvements to documentation are always welcome
|
I'll check out the developer guide. Thanks! |
No. What I wanted when I proposed this is "don't split if the resulting leafs are small". This is not something you can achieve with |
FYI #11870 (about deprecating min_samples_leaf and min_weight_fraction_leaf) has been merged. |
I think this is ok to close. We might want to think about adding the actual pruning options. |
This is maybe for @arjoly, @jmschrei @glouppe.
i'm trying to understand
min_samples_leaf
andmin_samples_split
. I thought these were pre-pruning options, but they are not. They are smoothing options. Is that clear from the docs and I'm just slow?Is there a reference for the current behavior?
What I expected was: "if the best split results in less then
min_samples_leaf
in the leaf, don't split".What it is instead is "don't consider a split that leaves less than min_samples_leaf in the leaf".
These are very very very different, and that was kind of non-obvious to me (because I didn't really think about it before).
Example:
Basically setting
min_samples_leaf
leads to exactly the same tree, only with the threshold moved so that there are enough samples in the leaf. Was that really the intent?The text was updated successfully, but these errors were encountered: