-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split #8449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/tree/tests/test_tree.py
Outdated
imp_right = est.tree_.impurity[right] | ||
weighted_n_right = est.tree_.weighted_n_node_samples[right] | ||
|
||
actual_decrease = (est.tree_.impurity[node] - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#TODO this is incorrect comparison. The actual decrease should again by multiplied by fractional weight of the parent node...
sklearn/tree/_tree.pyx
Outdated
@@ -446,7 +454,8 @@ cdef class BestFirstTreeBuilder(TreeBuilder): | |||
|
|||
if not is_leaf: | |||
splitter.node_split(impurity, &split, &n_constant_features) | |||
is_leaf = is_leaf or (split.pos >= end) | |||
is_leaf = (is_leaf or split.pos >= end or | |||
split.improvement + EPSILON < min_impurity_decrease) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the need for epsilon here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this to avoid floating precision inconsistencies affecting the split... I'll explain clearly in a subsequent comment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did this to avoid not splitting if split.improvement
is almost equal to min_impurity_decrease
within the precision of the machine. For instance if you give min_impurity_decrease
as 1e-7
, it does not build the tree completely as sometimes the improvement is almost equal to 1e-7
...
And I added it to the left and not right as it would give splitting the benefit of doubt (as opposed to not splitting)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify further. Setting it to 1e-7
as done for other stopping params to denote eps
will not let the tree grow fully and produce trees dissimilar to master...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this as an inline comment, then.
@@ -272,10 +275,23 @@ def fit(self, X, y, sample_weight=None, check_input=True, | |||
min_weight_leaf = (self.min_weight_fraction_leaf * | |||
np.sum(sample_weight)) | |||
|
|||
if self.min_impurity_split < 0.: | |||
if self.min_impurity_split is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a deprication decorator which can be used? I know there is one for depricated functions, but I'm not sure about parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we typically use our deprecated
decorator for attributes not parameters... But I'm unsure... @amueller thoughts?
In general this looks good. I didn't check your test though to make sure it was correct. |
Thanks a lot @jmschrei for the review! |
Functionality wise this looks good to me, pending that comment about the deprecation decorator. Good work @raghavrv |
Thanks @nelson-liu and @jmschrei. Andy or Gilles?? |
Or maybe @glemaitre / @ogrisel have some time for reviews? |
Should you mention in the docstring that |
sklearn/ensemble/forest.py
Outdated
Threshold for early stopping in tree growth. A node will split | ||
if its impurity is above the threshold, otherwise it is a leaf. | ||
min_impurity_decrease : float, optional (default=0.) | ||
Threshold for early stopping in tree growth. A node will be split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change with:
A node will be split if this split induces a decrease of the impurity
greater than or equal to this value.
sklearn/ensemble/forest.py
Outdated
|
||
.. versionadded:: 0.18 | ||
The impurity decrease due to a potential split is the difference in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove "due to a potential split"
sklearn/ensemble/forest.py
Outdated
Threshold for early stopping in tree growth. A node will split | ||
if its impurity is above the threshold, otherwise it is a leaf. | ||
min_impurity_decrease : float, optional (default=0.) | ||
Threshold for early stopping in tree growth. A node will be split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same changes as in RandomForestClassifier
sklearn/ensemble/forest.py
Outdated
Threshold for early stopping in tree growth. A node will split | ||
if its impurity is above the threshold, otherwise it is a leaf. | ||
min_impurity_decrease : float, optional (default=0.) | ||
Threshold for early stopping in tree growth. A node will be split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same changes as in RandomForestClassifier
sklearn/ensemble/forest.py
Outdated
Threshold for early stopp F438 ing in tree growth. A node will split | ||
if its impurity is above the threshold, otherwise it is a leaf. | ||
min_impurity_decrease : float, optional (default=0.) | ||
Threshold for early stopping in tree growth. A node will be split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same changes as in RandomForestClassifier
@@ -1406,7 +1417,8 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin): | |||
def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100, | |||
subsample=1.0, criterion='friedman_mse', min_samples_split=2, | |||
min_samples_leaf=1, min_weight_fraction_leaf=0., | |||
max_depth=3, min_impurity_split=1e-7, init=None, | |||
max_depth=3, min_impurity_decrease=0., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_impurity_decrease
is define at 1e-7
in the above docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch. I changed the doc to 0... I'm using 0 because of the EPSILON
added as described here...
min_impurity_split : float, optional (default=1e-7) | ||
Threshold for early stopping in tree growth. A node will split | ||
if its impurity is above the threshold, otherwise it is a leaf. | ||
min_impurity_decrease : float, optional (default=1e-7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the default value
@@ -1790,7 +1811,8 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin): | |||
def __init__(self, loss='ls', learning_rate=0.1, n_estimators=100, | |||
subsample=1.0, criterion='friedman_mse', min_samples_split=2, | |||
min_samples_leaf=1, min_weight_fraction_leaf=0., | |||
max_depth=3, min_impurity_split=1e-7, init=None, random_state=None, | |||
max_depth=3, min_impurity_decrease=0., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check the default value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Same as above)
sklearn/tree/tree.py
Outdated
Threshold for early stopping in tree growth. If the impurity | ||
of a node is below the threshold, the node is a leaf. | ||
min_impurity_decrease : float, optional (default=0.) | ||
Threshold for early stopping in tree growth. A node will be split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same changes as in RandomForestClassifier
sklearn/tree/tree.py
Outdated
Threshold for early stopping in tree growth. A node will split | ||
if its impurity is above the threshold, otherwise it is a leaf. | ||
min_impurity_decrease : float, optional (default=0.) | ||
Threshold for early stopping in tree growth. A node will be split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same changes as in RandomForestClassifier
Generally we don't mention that in docstring. We deprecate it and remove the doc for that param... Thanks for the review. Have addressed it :) Another round? @jnothman Could you take a look this too? |
4775b93
to
0ca3a4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, looks fine otherwise.
sklearn/ensemble/forest.py
Outdated
|
||
.. versionadded:: 0.18 | ||
The impurity decrease is the difference in the parent node's impurity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer the easier-to-follow definition over here (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L177).
Also, there seems to be an extra term outside the bracket (N_parent / N_total) from your tests here. (https://github.com/scikit-learn/scikit-learn/pull/8449/files#diff-c3874016cfa1f9bc378d573240ff0502R890)
sklearn/tree/tests/test_tree.py
Outdated
|
||
fractional_node_weight = ( | ||
est.tree_.weighted_n_node_samples[node] / | ||
est.tree_.weighted_n_node_samples[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Can you replace the denominator by just X.shape[0]?
sklearn/tree/tests/test_tree.py
Outdated
est.tree_.impurity[node] - | ||
(weighted_n_left * imp_left + | ||
weighted_n_right * imp_right) / | ||
(weighted_n_left + weighted_n_right))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be simpler to write (N_parent * Imp_parent - N_left * imp_left - N_right * imp_right) / N
def test_min_impurity_decrease(): | ||
# test if min_impurity_decrease ensure that a split is made only if | ||
# if the impurity decrease is atleast that value | ||
X, y = datasets.make_classification(n_samples=10000, random_state=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should test regressors also no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! The ALL_TREES[...]
contains regressors too... Just that I use the same classification data to test the regressors too...
sklearn/tree/_tree.pyx
Outdated
@@ -446,7 +454,8 @@ cdef class BestFirstTreeBuilder(TreeBuilder): | |||
|
|||
if not is_leaf: | |||
splitter.node_split(impurity, &split, &n_constant_features) | |||
is_leaf = is_leaf or (split.pos >= end) | |||
is_leaf = (is_leaf or split.pos >= end or | |||
split.improvement + EPSILON < min_impurity_decrease) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this as an inline comment, then.
# Test if min_impurity_split of base estimators is set | ||
# Regression test for #8006 | ||
X, y = datasets.make_hastie_10_2(n_samples=100, random_state=1) | ||
all_estimators = [GradientBoostingRegressor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to test for random forests also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! done in the latest commit..
I agree that the behaviour of |
It's the same expression your one with the "fractional_weight" and the one documented in the criterion file. It is just that I find the latter easier to read, but it's fine. (I meant having the extra term is right and it wasn't reflected in the documentation) |
LGTM! |
sklearn/ensemble/forest.py
Outdated
|
||
.. versionadded:: 0.18 | ||
The weighted impurity decrease equation is the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using the ::math
environment in the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavrv will the math display correctly from lines 815-816? The `` tag will work properly, but does indenting alone work as intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. If you can address the one typesetting comment I'll go ahead and merge it.
sklearn/ensemble/forest.py
Outdated
|
||
.. versionadded:: 0.18 | ||
The weighted impurity decrease equation is the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavrv will the math display correctly from lines 815-816? The `` tag will work properly, but does indenting alone work as intended?
@jmschrei @glemaitre Thanks for pointing that out! It was not displaying correctly before but after the latest commit it should look like this |
Yohoo!! Thanks for the reviews and merge @jmschrei @MechCoder and @glemaitre :) |
Nice :) |
Sweet, thanks! |
…ing based on impurity; Deprecate min_impurity_split (scikit-learn#8449) [MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split
…ing based on impurity; Deprecate min_impurity_split (scikit-learn#8449) [MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split
…ing based on impurity; Deprecate min_impurity_split (scikit-learn#8449) [MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split
…ing based on impurity; Deprecate min_impurity_split (scikit-learn#8449) [MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split
Requires scikit-learn >= 0.19 See scikit-learn/scikit-learn#8449 Fixes #11
Requires scikit-learn >= 0.19 See scikit-learn/scikit-learn#8449 Fixes #11
Requires scikit-learn >= 0.19 See scikit-learn/scikit-learn#8449 Fixes #11
…ing based on impurity; Deprecate min_impurity_split (scikit-learn#8449) [MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split
Requires scikit-learn >= 0.19 See scikit-learn/scikit-learn#8449 Fixes #11
…ing based on impurity; Deprecate min_impurity_split (scikit-learn#8449) [MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split
Fixes #8400
Also ref Gilles' comment
This PR tries to stop splitting if the weighted impurity gain after a potential split is not above a user-given threshold...
@amueller Can you try this on your use cases and see if it gives better control than
min_impurity_split
?@jnothman @glouppe @nelson-liu @glemaitre @jmschrei