min_impurity_split in tree is very odd behavior #8400

amueller · 2017-02-19T16:48:33Z

From the implementation and name it looks like that's intended, but the current behavior of min_impurity_split seems odd to me and not what is usually used as stopping criterion in the literature.

In the literature, you do not do a split unless it decreases the impurity by a given threshold.
In scikit-learn, you do not do a split if the impurity of the current node is not above a given threshold.

Check out the example on breast_cancer here:

The left node has an impurity of 0.08, so it is not split any further - even though there is a very good split possible here that would decrease the gini a lot:

Basically even if another split could create entirely pure leaves, it's not considered because this leaf is considered "pure enough".
You can see that the tree above is very imbalanced and this imbalance is really arbitrary. I don't think this is a good stopping criterion, and I think we should implement the standard stopping criterion from the literature.

again ping the tree builders @glouppe @jmschrei @arjoly @raghavrv

The text was updated successfully, but these errors were encountered:

amueller · 2017-02-19T17:10:49Z

Maybe a more obvious example: consider a 1d regression with noise and non-informative feature:

from sklearn.tree import DecisionTreeRegressor

rng = np.random.RandomState(0)
X = rng.uniform(size=(100, 1))
y = rng.normal(size=(100))

tree = DecisionTreeRegressor().fit(X, y)

The standard stopping criterion from the literature would allow you to not split at all, because no split is informative. However, since there is noise, there is basically no setting of min_impurity_split that has any effect :-/

raghavrv · 2017-02-19T17:27:37Z

I agree that we need a stopping criterion which would specify not to split if "impurity decrease is not at least x" probably in addition to the existing min_impurity_split which specifies not to split "if impurity is low enough that I don't care if it can be split further"...

The current min_impurity_split was added as a kind of "post-pruning" method which let's you snip off further splits at branches where impurity is already pretty low...

amueller · 2017-02-19T17:31:12Z

I don't think there is a good argument to call this post-pruning. Post pruning would imply building the whole tree and then snip of branches which are not very helpful. This kills of branches that are very helpful. I find the current behavior confusing since it's not the standard in the literature.

I had an issue to implement this, but I guess that issues was misunderstood?

amueller · 2017-02-19T17:34:20Z

So the PR #6954 claims to implement the feature request by @glouppe here.
@glouppe actually asked for the thing I was expecting:

An easy addition would be to stop the construction when p(t) i(t, s*) < beta, i.e. when the weighted impurity p(t) i(t, s*) for the best split s* becomes less than some user-defined threshold beta.

The PR implements something very different.
If we implement this, I don't think the current parameter would be useful any more.

raghavrv · 2017-02-19T17:40:30Z

Ah that is true! So we need to adjust min_impurity_split to check the impurity_improvement rather than node impurity...

raghavrv · 2017-02-19T17:40:58Z

Then should the parameter be called min_impurity_decrease rather than min_impurity_split?

amueller · 2017-02-19T17:41:15Z

sounds good to me.

raghavrv · 2017-02-19T17:42:44Z

sounds good to me.

To the first comment or both?

amueller · 2017-02-19T18:24:39Z

both ;)

raghavrv · 2017-02-19T18:59:01Z

Thx. @glouppe @jmetzen @nelson-liu you guys okay with that?

nelson-liu · 2017-02-19T19:04:37Z

+1, sorry i must have misunderstood the original intended behavior. This is far more sensible.

glouppe · 2017-02-19T19:28:17Z

Fair enough.

(Shall we deprecate min_impurity_split? having too many parameters makes things difficult for the users...)

amueller · 2017-02-19T19:53:38Z

yeah, I think deprecating that is good.

jnothman · 2017-02-19T22:46:35Z

+1 for deprecation

…

On 20 Feb 2017 6:53 am, "Andreas Mueller" ***@***.***> wrote: yeah, I think deprecating that is good. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8400 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67VRpBR53jNNyQ8oHtqWDb_yXVfXks5reJ3EgaJpZM4MFg3r> .

amueller added the Enhancement label Feb 19, 2017

raghavrv mentioned this issue Feb 24, 2017

[MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split #8449

Merged

jmschrei closed this as completed in #8449 Apr 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

min_impurity_split in tree is very odd behavior #8400

min_impurity_split in tree is very odd behavior #8400

min_impurity_split in tree is very odd behavior #8400

min_impurity_split in tree is very odd behavior #8400

Comments