-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
min_impurity_split in tree is very odd behavior #8400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Maybe a more obvious example: consider a 1d regression with noise and non-informative feature: from sklearn.tree import DecisionTreeRegressor
rng = np.random.RandomState(0)
X = rng.uniform(size=(100, 1))
y = rng.normal(size=(100))
tree = DecisionTreeRegressor().fit(X, y) The standard stopping criterion from the literature would allow you to not split at all, because no split is informative. However, since there is noise, there is basically no setting of |
I agree that we need a stopping criterion which would specify not to split if "impurity decrease is not at least x" probably in addition to the existing The current |
I don't think there is a good argument to call this post-pruning. Post pruning would imply building the whole tree and then snip of branches which are not very helpful. This kills of branches that are very helpful. I find the current behavior confusing since it's not the standard in the literature. I had an issue to implement this, but I guess that issues was misunderstood? |
So the PR #6954 claims to implement the feature request by @glouppe here.
The PR implements something very different. |
Ah that is true! So we need to adjust |
Then should the parameter be called |
sounds good to me. |
To the first comment or both? |
both ;) |
Thx. @glouppe @jmetzen @nelson-liu you guys okay with that? |
+1, sorry i must have misunderstood the original intended behavior. This is far more sensible. |
Fair enough. (Shall we deprecate |
yeah, I think deprecating that is good. |
+1 for deprecation
…On 20 Feb 2017 6:53 am, "Andreas Mueller" ***@***.***> wrote:
yeah, I think deprecating that is good.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8400 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67VRpBR53jNNyQ8oHtqWDb_yXVfXks5reJ3EgaJpZM4MFg3r>
.
|
From the implementation and name it looks like that's intended, but the current behavior of
min_impurity_split
seems odd to me and not what is usually used as stopping criterion in the literature.In the literature, you do not do a split unless it decreases the impurity by a given threshold.
In scikit-learn, you do not do a split if the impurity of the current node is not above a given threshold.
Check out the example on breast_cancer here:
The left node has an impurity of 0.08, so it is not split any further - even though there is a very good split possible here that would decrease the gini a lot:

Basically even if another split could create entirely pure leaves, it's not considered because this leaf is considered "pure enough".
You can see that the tree above is very imbalanced and this imbalance is really arbitrary. I don't think this is a good stopping criterion, and I think we should implement the standard stopping criterion from the literature.
again ping the tree builders @glouppe @jmschrei @arjoly @raghavrv
The text was updated successfully, but these errors were encountered: