8000 Understanding min_samples_leaf / min_samples_split · Issue #8399 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Understanding min_samples_leaf / min_samples_split #8399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Feb 19, 2017 · 20 comments
Closed

Understanding min_samples_leaf / min_samples_split #8399

amueller opened this issue Feb 19, 2017 · 20 comments
Labels

Comments

@amueller
Copy link
Member

This is maybe for @arjoly, @jmschrei @glouppe.

i'm trying to understand min_samples_leaf and min_samples_split. I thought these were pre-pruning options, but they are not. They are smoothing options. Is that clear from the docs and I'm just slow?

Is there a reference for the current behavior?

What I expected was: "if the best split results in less then min_samples_leaf in the leaf, don't split".
What it is instead is "don't consider a split that leaves less than min_samples_leaf in the leaf".

These are very very very different, and that was kind of non-obvious to me (because I didn't really think about it before).

Example:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier

rng = np.random.RandomState(0)
X = rng.normal(size=(50, 2))
y = np.zeros(X.shape[0], dtype=np.int)
y[X[:, 1] > 2] = 1

tree = DecisionTreeClassifier().fit(X, y)

# create a grid for plotting decision functions...
x_lin = np.linspace(X[:, 0].min() - .5, X[:, 0].max() + .5, 1000)
y_lin = np.linspace(X[:, 1].min() - .5, X[:, 1].max() + .5, 1000)
x_grid, y_grid = np.meshgrid(x_lin, y_lin)
X_grid = np.c_[x_grid.ravel(), y_grid.ravel()]

fig, axes = plt.subplots(1, 2)
axes[0].contourf(x_grid, y_grid, tree.predict_proba(X_grid)[:, 0].reshape(x_grid.shape), alpha=.3)
axes[0].scatter(X[:, 0], X[:, 1], c=plt.cm.Vega10(y))


tree2 = DecisionTreeClassifier(min_samples_leaf=2).fit(X, y)
axes[1].contourf(x_grid, y_grid, tree2.predict_proba(X_grid)[:, 0].reshape(x_grid.shape), alpha=.3)
axes[1].scatter(X[:, 0], X[:, 1], c=plt.cm.Vega10(y))

image

Basically setting min_samples_leaf leads to exactly the same tree, only with the threshold moved so that there are enough samples in the leaf. Was that really the intent?

@amueller
Copy link
Member Author
8000

Actually, I'm not sure it's fair to say that growing small regions is smoothing. It seems a bit like the opposite. In the above example, a tie was created which is a bit weird on it's own, but consider this example (default parameters left, min_samples_leaf=6 on the right):

image

We're basically just growing the smaller class and misclassifying two points... That seems really really weird to me.

@glouppe
Copy link
Contributor
glouppe commented Feb 19, 2017

What it is instead is "don't consider a split that leaves less than min_samples_leaf in the leaf".

Yes, this is the intended behavior. Maybe the documentation can be improved if that was not clear.

@amueller
Copy link
Member Author
amueller commented Feb 19, 2017

@glouppe then my question is: why? that seems really strange. Do you have a reference? Consider my examples above.

@glouppe
Copy link
Contributor
glouppe commented Feb 19, 2017

Well, that was implemented at the time by @ndawe in #522, which you cherry picked yourself in #627 :-)

@amueller
Copy link
Member Author
amueller commented Feb 19, 2017

The same applies to min_samples_split which was introduced much earlier. I didn't look into the implementation of min_samples_split at the time. Again, I thought min_samples_split was a pruning option, but it is not.

The example in #8399 (comment) is disconcerning to me.

@glouppe
Copy link
Contributor
glouppe commented Feb 19, 2017

The behaviour of min_samples_split is quite standard and documented in many papers (e.g. the extra-trees paper). From there the same logic was replicated for the other criteria. These criteria should be seen as heuristics for pruning the space of splits.

@amueller
Copy link
Member Author
amueller commented Feb 19, 2017

Actually, let me correct that. min_samples_split is a pruning option. I guess I should have looked into the implementation of min_samples_leaf more closely at the time. It's not doing what I think it was doing - which is pruning the tree. maybe @ndawe can comment?

So I understand that the options are heuristics for reducing the number of splits to consider, which makes sense, particularly in the case of ensembles. But that's quite different from pruning a tree. Usually when building single trees you're interested in small trees.

min_samples_split actually leads to a smaller tree, but min_samples_leaf does not.

@amueller
Copy link
Member Author
amueller commented Feb 19, 2017

Here is on the breast_cancer dataset what I thought min_samples_leaf=2 does and what it actually does. The result of "what I thought it does" is actually the same as if 8000 you cross-validate max_leaf_nodes, because creating nodes with single data points is a bad idea:

what_i_thought_min_samples_does
what_min_samples_does

@glouppe
Copy link
Contributor
glouppe commented Feb 19, 2017

Isnt "what you thought it does" more or less equivalent to setting min_samples_split to 2*min_samples_leaf? (in expectation, assuming a random assignment of the node samples upon a non-informative split)

@amueller
Copy link
Member Author
amueller commented Feb 19, 2017

Nope, it's not possible to eliminate small leafs using min_samples_split. There's a node with 300 items that splits one off. That's why I thought min_samples_leaf would be a good idea. If it would actually avoid creating a split, not removing a candidate.

@amueller
Copy link
Member Author

I was trying to do a lecture on selecting regularization for single decision trees, and I found that we basically have none that works reasonably. The only one that does the right kind of pruning is max_leaf_nodes but that's really hard to set. You basically need to create a full tree, then introspect the structure to see how many leafs it has, and then do a grid-search up to that count.

@jnothman
Copy link
Member
jnothman commented Feb 19, 2017 via email

@amueller
Copy link
Member Author

What I'm seeking is to stop splitting if the best candidate doesn't fulfill the requirement.
Basically I want an option that implements pre-pruning, so results in smaller trees. Anything that forbids particular splits (even if they are all for a given feature) will mean that we just use worse features, not stop splitting.

You can find the slides here if you're interested: https://amueller.github.io/applied_ml_spring_2017/lectures.html

Unfortunately I don't have the time to write notes to my slides right now :-/

@amueller
Copy link
Member Author

@jnothman there's also the most sloppy notebooks here: https://github.com/amueller/applied_ml_spring_2017/tree/master/slides

@NicholasBoeke
Copy link

yes, the documentation was a bit confusing, as I had the same expectation. I thought if n_samples<=min_samples_leaf, don't split, but this would have the same exact behavior as if n_samples<=min_samples_split, don't split internal node, right?

@jnothman
Copy link
Member
jnothman commented Feb 18, 2018 via email

@NicholasBoeke
Copy link

I'll check out the developer guide. Thanks!

@amueller
Copy link
Member Author
amueller commented Mar 7, 2018

yes, the documentation was a bit confusing, as I had the same expectation. I thought if n_samples<=min_samples_leaf, don't split, but this would have the same exact behavior as if n_samples<=min_samples_split, don't split internal node, right?

No. What I wanted when I proposed this is "don't split if the resulting leafs are small". This is not something you can achieve with min_samples_split.

@rth
Copy link
Member
rth commented Aug 23, 2018

FYI #11870 (about deprecating min_samples_leaf and min_weight_fraction_leaf) has been merged.

@amueller
Copy link
Member Author

I think this is ok to close. We might want to think about adding the actual pruning options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants
0