Understanding min_samples_leaf / min_samples_split #8399

amueller · 2017-02-19T16:16:59Z

This is maybe for @arjoly, @jmschrei @glouppe.

i'm trying to understand min_samples_leaf and min_samples_split. I thought these were pre-pruning options, but they are not. They are smoothing options. Is that clear from the docs and I'm just slow?

Is there a reference for the current behavior?

What I expected was: "if the best split results in less then min_samples_leaf in the leaf, don't split".
What it is instead is "don't consider a split that leaves less than min_samples_leaf in the leaf".

These are very very very different, and that was kind of non-obvious to me (because I didn't really think about it before).

Example:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier

rng = np.random.RandomState(0)
X = rng.normal(size=(50, 2))
y = np.zeros(X.shape[0], dtype=np.int)
y[X[:, 1] > 2] = 1

tree = DecisionTreeClassifier().fit(X, y)

# create a grid for plotting decision functions...
x_lin = np.linspace(X[:, 0].min() - .5, X[:, 0].max() + .5, 1000)
y_lin = np.linspace(X[:, 1].min() - .5, X[:, 1].max() + .5, 1000)
x_grid, y_grid = np.meshgrid(x_lin, y_lin)
X_grid = np.c_[x_grid.ravel(), y_grid.ravel()]

fig, axes = plt.subplots(1, 2)
axes[0].contourf(x_grid, y_grid, tree.predict_proba(X_grid)[:, 0].reshape(x_grid.shape), alpha=.3)
axes[0].scatter(X[:, 0], X[:, 1], c=plt.cm.Vega10(y))


tree2 = DecisionTreeClassifier(min_samples_leaf=2).fit(X, y)
axes[1].contourf(x_grid, y_grid, tree2.predict_proba(X_grid)[:, 0].reshape(x_grid.shape), alpha=.3)
axes[1].scatter(X[:, 0], X[:, 1], c=plt.cm.Vega10(y))

Basically setting min_samples_leaf leads to exactly the same tree, only with the threshold moved so that there are enough samples in the leaf. Was that really the intent?

The text was updated successfully, but these errors were encountered:

amueller · 2017-02-19T16:33:31Z

8000

Actually, I'm not sure it's fair to say that growing small regions is smoothing. It seems a bit like the opposite. In the above example, a tie was created which is a bit weird on it's own, but consider this example (default parameters left, min_samples_leaf=6 on the right):

We're basically just growing the smaller class and misclassifying two points... That seems really really weird to me.

glouppe · 2017-02-19T16:36:32Z

What it is instead is "don't consider a split that leaves less than min_samples_leaf in the leaf".

Yes, this is the intended behavior. Maybe the documentation can be improved if that was not clear.

amueller · 2017-02-19T16:39:49Z

@glouppe then my question is: why? that seems really strange. Do you have a reference? Consider my examples above.

glouppe · 2017-02-19T16:49:24Z

Well, that was implemented at the time by @ndawe in #522, which you cherry picked yourself in #627 :-)

amueller · 2017-02-19T16:54:17Z

The same applies to min_samples_split which was introduced much earlier. I didn't look into the implementation of min_samples_split at the time. Again, I thought min_samples_split was a pruning option, but it is not.

The example in #8399 (comment) is disconcerning to me.

glouppe · 2017-02-19T16:57:16Z

The behaviour of min_samples_split is quite standard and documented in many papers (e.g. the extra-trees paper). From there the same logic was replicated for the other criteria. These criteria should be seen as heuristics for pruning the space of splits.

amueller · 2017-02-19T17:01:42Z

Actually, let me correct that. min_samples_split is a pruning option. I guess I should have looked into the implementation of min_samples_leaf more closely at the time. It's not doing what I think it was doing - which is pruning the tree. maybe @ndawe can comment?

So I understand that the options are heuristics for reducing the number of splits to consider, which makes sense, particularly in the case of ensembles. But that's quite different from pruning a tree. Usually when building single trees you're interested in small trees.

min_samples_split actually leads to a smaller tree, but min_samples_leaf does not.

amueller · 2017-02-19T18:31:50Z

Here is on the breast_cancer dataset what I thought min_samples_leaf=2 does and what it actually does. The result of "what I thought it does" is actually the same as if 8000 you cross-validate max_leaf_nodes, because creating nodes with single data points is a bad idea:

glouppe · 2017-02-19T19:25:07Z

Isnt "what you thought it does" more or less equivalent to setting min_samples_split to 2*min_samples_leaf? (in expectation, assuming a random assignment of the node samples upon a non-informative split)

amueller · 2017-02-19T19:50:42Z

Nope, it's not possible to eliminate small leafs using min_samples_split. There's a node with 300 items that splits one off. That's why I thought min_samples_leaf would be a good idea. If it would actually avoid creating a split, not removing a candidate.

amueller · 2017-02-19T19:53:02Z

I was trying to do a lecture on selecting regularization for single decision trees, and I found that we basically have none that works reasonably. The only one that does the right kind of pruning is max_leaf_nodes but that's really hard to set. You basically need to create a full tree, then introspect the structure to see how many leafs it has, and then do a grid-search up to that count.

jnothman · 2017-02-19T22:44:24Z

Very interesting. Is what you seek, @amueller, that the builder should stop considering a particular feature if its best split is invalid? I'm quite envious of this course you're putting together given the rigour with which it is finding issues in the package.

…

On 20 Feb 2017 6:53 am, "Andreas Mueller" ***@***.***> wrote: I was trying to do a lecture on selecting regularization for single decision trees, and I found that we basically have none that works reasonably. The only one that does the right kind of pruning is max_leaf_nodes but that's really hard to set. You basically need to create a full tree, then introspect the structure to see how many leafs it has, and then do a grid-search up to that count. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8399 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz606H2peb8FIlHI1SFbti_C_Wkby7ks5reJ2fgaJpZM4MFgVF> .

amueller · 2017-02-20T02:19:26Z

What I'm seeking is to stop splitting if the best candidate doesn't fulfill the requirement.
Basically I want an option that implements pre-pruning, so results in smaller trees. Anything that forbids particular splits (even if they are all for a given feature) will mean that we just use worse features, not stop splitting.

You can find the slides here if you're interested: https://amueller.github.io/applied_ml_spring_2017/lectures.html

Unfortunately I don't have the time to write notes to my slides right now :-/

amueller · 2017-02-20T02:22:14Z

@jnothman there's also the most sloppy notebooks here: https://github.com/amueller/applied_ml_spring_2017/tree/master/slides

NicholasBoeke · 2018-02-17T16:52:59Z

yes, the documentation was a bit confusing, as I had the same expectation. I thought if n_samples<=min_samples_leaf, don't split, but this would have the same exact behavior as if n_samples<=min_samples_split, don't split internal node, right?

jnothman · 2018-02-18T01:14:00Z

improvements to documentation are always welcome

NicholasBoeke · 2018-02-20T03:51:43Z

I'll check out the developer guide. Thanks!

amueller · 2018-03-07T20:50:03Z

yes, the documentation was a bit confusing, as I had the same expectation. I thought if n_samples<=min_samples_leaf, don't split, but this would have the same exact behavior as if n_samples<=min_samples_split, don't split internal node, right?

No. What I wanted when I proposed this is "don't split if the resulting leafs are small". This is not something you can achieve with min_samples_split.

rth · 2018-08-23T05:34:37Z

FYI #11870 (about deprecating min_samples_leaf and min_weight_fraction_leaf) has been merged.

amueller · 2018-08-23T22:34:07Z

I think this is ok to close. We might want to think about adding the actual pruning options.

amueller added the Question label Feb 19, 2017

amueller mentioned this issue Mar 7, 2018

docs state min_samples_leaf reduces size of the tree #10773

Closed

lasagnaman mentioned this issue Jun 15, 2018

[MRG] Deprecate min_samples_leaf #11280

Closed

amueller closed this as completed Aug 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding min_samples_leaf / min_samples_split #8399

Understanding min_samples_leaf / min_samples_split #8399

Understanding min_samples_leaf / min_samples_split #8399

Understanding min_samples_leaf / min_samples_split #8399

Comments