8000 min_impurity_split in tree is very odd behavior · Issue #8400 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
8000

min_impurity_split in tree is very odd behavior #8400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Feb 19, 2017 · 14 comments
Closed

min_impurity_split in tree is very odd behavior #8400

amueller opened this issue Feb 19, 2017 · 14 comments

Comments

@amueller
Copy link
Member

From the implementation and name it looks like that's intended, but the current behavior of min_impurity_split seems odd to me and not what is usually used as stopping criterion in the literature.

In the literature, you do not do a split unless it decreases the impurity by a given threshold.
In scikit-learn, you do not do a split if the impurity of the current node is not above a given threshold.

Check out the example on breast_cancer here:

image

The left node has an impurity of 0.08, so it is not split any further - even though there is a very good split possible here that would decrease the gini a lot:
image

Basically even if another split could create entirely pure leaves, it's not considered because this leaf is considered "pure enough".
You can see that the tree above is very imbalanced and this imbalance is really arbitrary. I don't think this is a good stopping criterion, and I think we should implement the standard stopping criterion from the literature.

again ping the tree builders @glouppe @jmschrei @arjoly @raghavrv

@amueller
Copy link
Member Author
amueller commented Feb 19, 2017

Maybe a more obvious example: consider a 1d regression with noise and non-informative feature:

from sklearn.tree import DecisionTreeRegressor

rng = np.random.RandomState(0)
X = rng.uniform(size=(100, 1))
y = rng.normal(size=(100))

tree = DecisionTreeRegressor().fit(X, y)

The standard stopping criterion from the literature would allow you to not split at all, because no split is informative. However, since there is noise, there is basically no setting of min_impurity_split that has any effect :-/

@raghavrv
Copy link
Member

I agree that we need a stopping criterion which would specify not to split if "impurity decrease is not at least x" probably in addition to the existing min_impurity_split which specifies not to split "if impurity is low enough that I don't care if it can be split further"...

The current min_impurity_split was added as a kind of "post-pruning" method which let's you snip off further splits at branches where impurity is already pretty low...

@amueller
Copy link
Member Author

I don't think there is a good argument to call this post-pruning. Post pruning would imply building the whole tree and then snip of branches which are not very helpful. This kills of branches that are very helpful. I find the current behavior confusing since it's not the standard in the literature.

I had an issue to implement this, but I guess that issues was misunderstood?

@amueller
Copy link
Member Author
amueller commented Feb 19, 2017

So the PR #6954 claims to implement the feature request by @glouppe here.
@glouppe actually asked for the thing I was expecting:

An easy addition would be to stop the construction when p(t) i(t, s*) < beta, i.e. when the weighted impurity p(t) i(t, s*) for the best split s* becomes less than some user-defined threshold beta.

The PR implements something very different.
If we implement this, I don't think the current parameter would be useful any more.

@raghavrv
Copy link
Member

Ah that is true! So we need to adjust min_impurity_split to check the impurity_improvement rather than node impurity...

@raghavrv
Copy link
Member

Then should the parameter be called min_impurity_decrease rather than min_impurity_split?

@amueller
Copy link
Member Author

sounds good to me.

@raghavrv
Copy link
Member

sounds good to me.

To the first comment or both?

@amueller
Copy link
Member Author

both ;)

@raghavrv
Copy link
Member

Thx. @glouppe @jmetzen @nelson-liu you guys okay with that?

@nelson-liu
Copy link
Contributor

+1, sorry i must have misunderstood the original intended behavior. This is far more sensible.

@glouppe
Copy link
Contributor
glouppe commented Feb 19, 2017

Fair enough.

(Shall we deprecate min_impurity_split? having too many parameters makes things difficult for the users...)

@amueller
Copy link
Member Author

yeah, I think deprecating that is good.

@jnothman
Copy link
Member
jnothman commented Feb 19, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0