[MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split #8449

raghavrv · 2017-02-24T04:24:34Z

This PR tries to stop splitting if the weighted impurity gain after a potential split is not above a user-given threshold...

@amueller Can you try this on your use cases and see if it gives better control than min_impurity_split?

@jnothman @glouppe @nelson-liu @glemaitre @jmschrei

…split

raghavrv · 2017-02-24T04:28:16Z

sklearn/tree/tests/test_tree.py

+                    imp_right = est.tree_.impurity[right]
+                    weighted_n_right = est.tree_.weighted_n_node_samples[right]
+
+                    actual_decrease = (est.tree_.impurity[node] -


#TODO this is incorrect comparison. The actual decrease should again by multiplied by fractional weight of the parent node...

jmschrei · 2017-02-25T00:14:00Z

sklearn/tree/_tree.pyx

@@ -446,7 +454,8 @@ cdef class BestFirstTreeBuilder(TreeBuilder):

        if not is_leaf:
            splitter.node_split(impurity, &split, &n_constant_features)
-            is_leaf = is_leaf or (split.pos >= end)
+            is_leaf = (is_leaf or split.pos >= end or
+                       split.improvement + EPSILON < min_impurity_decrease)


What's the need for epsilon here?

I did this to avoid floating precision inconsistencies affecting the split... I'll explain clearly in a subsequent comment...

So I did this to avoid not splitting if split.improvement is almost equal to min_impurity_decrease within the precision of the machine. For instance if you give min_impurity_decrease as 1e-7, it does not build the tree completely as sometimes the improvement is almost equal to 1e-7...

And I added it to the left and not right as it would give splitting the benefit of doubt (as opposed to not splitting)...

To clarify further. Setting it to 1e-7 as done for other stopping params to denote eps will not let the tree grow fully and produce trees dissimilar to master...

Add this as an inline comment, then.

jmschrei · 2017-02-25T00:16:38Z

sklearn/tree/tree.py

@@ -272,10 +275,23 @@ def fit(self, X, y, sample_weight=None, check_input=True,
            min_weight_leaf = (self.min_weight_fraction_leaf *
                               np.sum(sample_weight))

-        if self.min_impurity_split < 0.:
+        if self.min_impurity_split is not None:


Is there a deprication decorator which can be used? I know there is one for depricated functions, but I'm not sure about parameters.

I think we typically use our deprecated decorator for attributes not parameters... But I'm unsure... @amueller thoughts?

jmschrei · 2017-02-25T00:18:30Z

In general this looks good. I didn't check your test though to make sure it was correct.

raghavrv · 2017-02-27T14:20:22Z

Thanks a lot @jmschrei for the review!

raghavrv · 2017-02-27T14:21:28Z

Others @glouppe @amueller Reviews please :)

nelson-liu · 2017-02-27T15:22:16Z

Functionality wise this looks good to me, pending that comment about the deprecation decorator. Good work @raghavrv

raghavrv · 2017-03-07T17:13:21Z

Thanks @nelson-liu and @jmschrei. Andy or Gilles??

raghavrv · 2017-03-14T08:17:59Z

Or maybe @glemaitre / @ogrisel have some time for reviews?

glemaitre · 2017-03-14T11:01:57Z

Should you mention in the docstring that min_impurity_split will be deprecated?

glemaitre · 2017-03-14T10:45:48Z

sklearn/ensemble/forest.py

-        Threshold for early stopping in tree growth. A node will split
-        if its impurity is above the threshold, otherwise it is a leaf.
+    min_impurity_decrease : float, optional (default=0.)
+        Threshold for early stopping in tree growth. A node will be split


I would change with:

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

glemaitre · 2017-03-14T10:47:28Z

sklearn/ensemble/forest.py


-        .. versionadded:: 0.18
+        The impurity decrease due to a potential split is the difference in the


I would remove "due to a potential split"

glemaitre · 2017-03-14T10:49:53Z

sklearn/ensemble/forest.py

-        Threshold for early stopping in tree growth. A node will split
-        if its impurity is above the threshold, otherwise it is a leaf.
+    min_impurity_decrease : float, optional (default=0.)
+        Threshold for early stopping in tree growth. A node will be split


Same changes as in RandomForestClassifier

glemaitre · 2017-03-14T10:53:04Z

sklearn/ensemble/forest.py

-        Threshold for early stopping in tree growth. A node will split
-        if its impurity is above the threshold, otherwise it is a leaf.
+    min_impurity_decrease : float, optional (default=0.)
+        Threshold for early stopping in tree growth. A node will be split


Same changes as in RandomForestClassifier

glemaitre · 2017-03-14T10:53:59Z

sklearn/ensemble/forest.py

-        Threshold for early stopping in tree growth. A node will split
-        if its impurity is above the threshold, otherwise it is a leaf.
+    min_impurity_decrease : float, optional (default=0.)
+        Threshold for early stopping in tree growth. A node will be split


Same changes as in RandomForestClassifier

glemaitre · 2017-03-14T10:57:22Z

sklearn/ensemble/gradient_boosting.py

@@ -1406,7 +1417,8 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
    def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
                 subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                 min_samples_leaf=1, min_weight_fraction_leaf=0.,
-                 max_depth=3, min_impurity_split=1e-7, init=None,
+                 max_depth=3, min_impurity_decrease=0.,


min_impurity_decrease is define at 1e-7 in the above docstring.

Thanks for the catch. I changed the doc to 0... I'm using 0 because of the EPSILON added as described here...

glemaitre · 2017-03-14T10:59:07Z

sklearn/ensemble/gradient_boosting.py

-    min_impurity_split : float, optional (default=1e-7)
-        Threshold for early stopping in tree growth. A node will split
-        if its impurity is above the threshold, otherwise it is a leaf.
+    min_impurity_decrease : float, optional (default=1e-7)


Check the default value

glemaitre · 2017-03-14T10:59:59Z

sklearn/ensemble/gradient_boosting.py

@@ -1790,7 +1811,8 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
    def __init__(self, loss='ls', learning_rate=0.1, n_estimators=100,
                 subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                 min_samples_leaf=1, min_weight_fraction_leaf=0.,
-                 max_depth=3, min_impurity_split=1e-7, init=None, random_state=None,
+                 max_depth=3, min_impurity_decrease=0.,


check the default value

(Same as above)

glemaitre · 2017-03-14T11:02:36Z

sklearn/tree/tree.py

-        Threshold for early stopping in tree growth. If the impurity
-        of a node is below the threshold, the node is a leaf.
+    min_impurity_decrease : float, optional (default=0.)
+        Threshold for early stopping in tree growth. A node will be split


Same changes as in RandomForestClassifier

glemaitre · 2017-03-14T11:02:48Z

sklearn/tree/tree.py

-        Threshold for early stopping in tree growth. A node will split
-        if its impurity is above the threshold, otherwise it is a leaf.
+    min_impurity_decrease : float, optional (default=0.)
+        Threshold for early stopping in tree growth. A node will be split


Same changes as in RandomForestClassifier

raghavrv · 2017-03-24T20:41:26Z

Should you mention in the docstring that min_impurity_split will be deprecated?

Generally we don't mention that in docstring. We deprecate it and remove the doc for that param...

Thanks for the review. Have addressed it :) Another round?

@jnothman Could you take a look this too?

MechCoder

Some minor comments, looks fine otherwise.

MechCoder · 2017-03-31T04:04:09Z

sklearn/ensemble/forest.py


-        .. versionadded:: 0.18
+        The impurity decrease is the difference in the parent node's impurity


I would prefer the easier-to-follow definition over here (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L177).

Also, there seems to be an extra term outside the bracket (N_parent / N_total) from your tests here. (https://github.com/scikit-learn/scikit-learn/pull/8449/files#diff-c3874016cfa1f9bc378d573240ff0502R890)

MechCoder · 2017-03-31T04:26:04Z

sklearn/tree/tests/test_tree.py

+
+                    fractional_node_weight = (
+                        est.tree_.weighted_n_node_samples[node] /
+                        est.tree_.weighted_n_node_samples[0])


nitpick: Can you replace the denominator by just X.shape[0]?

MechCoder · 2017-03-31T04:33:20Z

sklearn/tree/tests/test_tree.py

+                            est.tree_.impurity[node] -
+                            (weighted_n_left * imp_left +
+                             weighted_n_right * imp_right) /
+                            (weighted_n_left + weighted_n_right)))


It might be simpler to write (N_parent * Imp_parent - N_left * imp_left - N_right * imp_right) / N

MechCoder · 2017-03-31T04:36:12Z

sklearn/tree/tests/test_tree.py

+def test_min_impurity_decrease():
+    # test if min_impurity_decrease ensure that a split is made only if
+    # if the impurity decrease is atleast that value
+    X, y = datasets.make_classification(n_samples=10000, random_state=42)


You should test regressors also no?

Yes! The ALL_TREES[...] contains regressors too... Just that I use the same classification data to test the regressors too...

MechCoder · 2017-03-31T04:41:42Z

sklearn/tree/_tree.pyx

@@ -446,7 +454,8 @@ cdef class BestFirstTreeBuilder(TreeBuilder):

        if not is_leaf:
            splitter.node_split(impurity, &split, &n_constant_features)
-            is_leaf = is_leaf or (split.pos >= end)
+            is_leaf = (is_leaf or split.pos >= end or
+                       split.improvement + EPSILON < min_impurity_decrease)


Add this as an inline comment, then.

MechCoder · 2017-03-31T04:42:46Z

sklearn/ensemble/tests/test_gradient_boosting.py

+    # Test if min_impurity_split of base estimators is set
+    # Regression test for #8006
+    X, y = datasets.make_hastie_10_2(n_samples=100, random_state=1)
+    all_estimators = [GradientBoostingRegressor,


You need to test for random forests also?

Thanks! done in the latest commit..

MechCoder · 2017-03-31T04:49:47Z

I agree that the behaviour of min_impurity_decrease is much more intuitive than min_impurity_split.

MechCoder · 2017-03-31T22:27:58Z

It's the same expression your one with the "fractional_weight" and the one documented in the criterion file. It is just that I find the latter easier to read, but it's fine. (I meant having the extra term is right and it wasn't reflected in the documentation)

MechCoder · 2017-03-31T22:30:44Z

LGTM!

glemaitre · 2017-04-01T05:54:02Z

sklearn/ensemble/forest.py


-        .. versionadded:: 0.18
+        The weighted impurity decrease equation is the following:


Are we using the ::math environment in the docstring?

@raghavrv will the math display correctly from lines 815-816? The `` tag will work properly, but does indenting alone work as intended?

jmschrei

LGTM. If you can address the one typesetting comment I'll go ahead and merge it.

jmschrei · 2017-04-03T05:59:20Z

sklearn/ensemble/forest.py


-        .. versionadded:: 0.18
+        The weighted impurity decrease equation is the following:


@raghavrv will the math display correctly from lines 815-816? The `` tag will work properly, but does indenting alone work as intended?

raghavrv · 2017-04-03T14:18:20Z

@jmschrei @glemaitre Thanks for pointing that out! It was not displaying correctly before but after the latest commit it should look like this

raghavrv · 2017-04-03T16:40:27Z

Yohoo!! Thanks for the reviews and merge @jmschrei @MechCoder and @glemaitre :)

glouppe · 2017-04-04T06:29:07Z

Nice :)

amueller · 2017-04-05T11:39:13Z

Sweet, thanks!
Can I haz example?

…ing based on impurity; Deprecate min_impurity_split (scikit-learn#8449) [MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split

Requires scikit-learn >= 0.19 See scikit-learn/scikit-learn#8449 Fixes #11

…ing based on impurity; Deprecate min_impurity_split (scikit-learn#8449) [MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split

Requires scikit-learn >= 0.19 See scikit-learn/scikit-learn#8449 Fixes #11

…ing based on impurity; Deprecate min_impurity_split (scikit-learn#8449) [MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split

raghavrv added 4 commits February 24, 2017 04:22

ENH use min_impurity_decrease early stopping; Deprecate min_impurity_…

ff8b7c2

…split

FIX Check if |decrease| < 1e-7

b245750

Add EPSILON to split's improvement not to user threshold value

271f217

Remove scaffolding

cd334d5

raghavrv commented Feb 24, 2017

View reviewed changes

raghavrv added 2 commits February 24, 2017 05:59

Fix travis

dcde511

Some flake8 issues

18c1e7e

jmschrei reviewed Feb 25, 2017

View reviewed changes

raghavrv added 3 commits February 25, 2017 13:42

FIX Weight the actual decrease by fractional weight of the node

2ff0464

FIX Tests and add more test cases

49d6e18

flake8

10000

311f5ae

raghavrv added Enhancement Waiting for Reviewer labels Mar 7, 2017

glemaitre reviewed Mar 14, 2017

View reviewed changes

raghavrv removed the Waiting for Reviewer label Mar 24, 2017

Address Guillaiume's comment

0ca3a4e

raghavrv force-pushed the min_impurity_decrease branch from 4775b93 to 0ca3a4e Compare March 24, 2017 20:44

Merge branch 'master' into min_impurity_decrease

7255da4

raghavrv added the Waiting for Reviewer label Mar 24, 2017

MechCoder reviewed Mar 31, 2017

View reviewed changes

raghavrv added 3 commits March 31, 2017 15:25

Ensure both forest and gb are tested

97717d5

Add whatsnew

8a1f83b

DOC/whats_new: add term - 'weighted' to make it clear

1a2510a

MechCoder approved these changes Mar 31, 2017

View reviewed changes

MechCoder changed the title ~~[MRG] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split~~ [MRG+1] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split Mar 31, 2017

glemaitre reviewed Apr 1, 2017

View reviewed changes

jmschrei suggested changes Apr 3, 2017

View reviewed changes

Fix doc rendering of equation in min_impurity_decrease

fa8dd7e

jmschrei approved these changes Apr 3, 2017

View reviewed changes

jmschrei changed the title ~~[MRG+1] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split~~ [MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split Apr 3, 2017

jmschrei merged commit fc2f249 into scikit-learn:master Apr 3, 2017

raghavrv deleted the min_impurity_decrease branch April 3, 2017 16:44

glemaitre mentioned this pull request Apr 25, 2017

[MRG] FIX set min_impurity_split to None in gradient boosting estimator #8789

Merged

MechCoder mentioned this pull request May 1, 2017

Remove min_impurity_split scikit-garden/scikit-garden#10

Merged

sebp added a commit to sebp/scikit-survival that referenced this pull request Oct 16, 2017

Add min_impurity_decrease parameter to gradient boosting classes

3cc5373

Requires scikit-learn >= 0.19 See scikit-learn/scikit-learn#8449 Fixes #11

sebp added a commit to sebp/scikit-survival that referenced this pull request Oct 16, 2017

Add min_impurity_decrease parameter to gradient boosting classes

f6d0c74

Requires scikit-learn >= 0.19 See scikit-learn/scikit-learn#8449 Fixes #11

sebp added a commit to sebp/scikit-survival that referenced this pull request Oct 30, 2017

Add min_impurity_decrease parameter to gradient boosting classes

054bada

Requires scikit-learn >= 0.19 See scikit-learn/scikit-learn#8449 Fixes #11

sebp added a commit to sebp/scikit-survival that referenced this pull request Nov 18, 2017

Add min_impurity_decrease parameter to gradient boosting classes

1645a81

Requires scikit-learn >= 0.19 See scikit-learn/scikit-learn#8449 Fixes #11


		.. versionadded:: 0.18
		The impurity decrease due to a potential split is the difference in the


		.. versionadded:: 0.18
		The impurity decrease is the difference in the parent node's impurity


		.. versionadded:: 0.18
		The weighted impurity decrease equation is the following:

Uh oh!

[MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split #8449

[MRG+2] ENH/FIX Introduce min_impurity_decrease param for early stopping based on impurity; Deprecate min_impurity_split #8449

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment