[MRG] FIX gradient boosting with sklearn estimator as init #12436

jeremiedbb · 2018-10-22T21:41:20Z

Gradient Boosting used to fail when init was a sklearn estimator, which is a bit ironic :)
Issue was that the predict output didn't have the expected shape. And apparently there was no test for the init parameter with other estimator than default.

Edit Also accept initial estimator which does not support sample weights as long as the gradient boosting is not fitted with sample weights

rth

I'm not too familiar with GB code, but this looks good to me except for a minor comment below.

You also need to repeat "close" or "fix" in the description for every issue you want to auto-close: i.e. close #123, close#1234, etc..

I can confirm the added test fails on master.

Maybe @glemaitre would be able to do another review..

rth · 2018-10-23T18:04:41Z

sklearn/ensemble/gradient_boosting.py

@@ -1421,7 +1421,9 @@ def fit(self, X, y, sample_weight=None, monitor=None):
            self.init_.fit(X, y, sample_weight)

            # init predictions
-            y_pred = self.init_.predict(X)
+            y_pred = self.init_.predict(X).astype(np.float64, copy=False)
+            if y_pred.shape == (X.shape[0],):


Wouldn't,

y_pred.ndim == 1

be enough?

probably :)

rth · 2018-10-23T18:08:21Z

sklearn/ensemble/gradient_boosting.py

@@ -1421,7 +1421,9 @@ def fit(self, X, y, sample_weight=None, monitor=None):
            self.init_.fit(X, y, sample_weight)

            # init predictions
-            y_pred = self.init_.predict(X)
+            y_pred = self.init_.predict(X).astype(np.float64, copy=False)


What happens if you don't cast it to 64bit?

An error is raised due to unsafe casting because y_pred is updated later by the actual fit of the gradient boosting, and filled with float64 values.

y_pred[:, k] += (learning_rate * tree.value[:, 0, 0].take(terminal_regions, axis=0))

rth · 2018-10-23T18:20:43Z

Also please add a what's new. I guess we could try to squeeze it into 0.20.1 since it's a long standing issue that has been reported multiple times..

jeremiedbb · 2018-10-23T18:41:32Z

@stoddardg Pointed out another related bug. When the init estimator does not support sample weights, it crashes even if one calls fit with sample_weight=None.
I remove the MRG temporary, but I think I have a fix for that too.

jnothman · 2018-10-28T08:38:20Z

sklearn/ensemble/gradient_boosting.py

-            # fit initial model - FIXME make sample_weight optional
-            self.init_.fit(X, y, sample_weight)
+            # fit initial model
+            if has_fit_parameter(self.init_, "sample_weight"):


We should avoid has_fit_parameter unless try-except really won't work.

It means keyword args won't work. We should only use this kind of thing when we really need to take a different approach when sample_weight is not supported. (Feel free to document this caveat more in a separate pr)

jnothman

I'm not certain this should go in 0.20.1, but whatever...?

Nice work!

New features added

jeremiedbb · 2018-11-08T12:44:07Z

@rth I made some changes since your last review. Maybe you want to give another look at it.

I'm not certain this should go in 0.20.1

It's not a regression from 0.19 so maybe it should go in 0.21 indeed.

rth · 2018-11-18T14:22:17Z

sklearn/ensemble/tests/test_gradient_boosting.py

+    # init does not support sample weights
+    init_est = _NoSampleWeightWrapper(init())
+    gb(init=init_est).fit(X, y)  # ok no sample weights
+    with pytest.raises(ValueError):


To be more specific about the error,

Suggested change

with pytest.raises(ValueError):

with pytest.raises(

ValueError,

match=r'.*estimator.*does not support sample weights.*'):

rth · 2018-11-18T14:23:08Z

sklearn/ensemble/tests/test_gradient_boosting.py

+    gb, dataset_maker, init = task
+
+    X, y = dataset_maker()
+    sample_weight = np.random.rand(100)


Let's use a RNG,

Suggested change

sample_weight = np.random.rand(100)

sample_weight = np.random.RandomState(42).rand(100)

rth · 2018-11-18T14:25:03Z

sklearn/ensemble/tests/test_gradient_boosting.py

+    # estimator.
+    # Check that an error is raised if trying to fit with sample weight but
+    # inital estimator does not support sample weight
+    gb, dataset_maker, init = task


Maybe use a bit more explicit name here,

init_estimator

rth · 2018-11-18T14:29:14Z

sklearn/ensemble/gradient_boosting.py

+                else:
+                    raise ValueError(
+                        "The initial estimator {} does not support sample "
+                        "weights yet.".format(self.init_.__class__.__name__))


Suggested change

"weights yet.".format(self.init_.__class__.__name__))

"weights.".format(self.init_.__class__.__name__))

It was to point that it's not a definitive statement :)

rth · 2018-11-18T14:30:13Z

sklearn/ensemble/gradient_boosting.py

        if sample_weight is None:
            sample_weight = np.ones(n_samples, dtype=np.float32)
+            sample_weight_was_none = True


Suggested change

sample_weight_was_none = True

sample_weight_is_none = True

rth · 2018-11-18T14:30:22Z

sklearn/ensemble/gradient_boosting.py

        else:
            sample_weight = column_or_1d(sample_weight, warn=True)
+            sample_weight_was_none = False


Suggested change

sample_weight_was_none = False

sample_weight_is_none = False

rth · 2018-11-18T14:34:49Z

sklearn/ensemble/gradient_boosting.py

+            # fit initial model
+            try:
+                self.init_.fit(X, y, sample_weight=sample_weight)
+            except TypeError:


Maybe calling inspect.signature(self.init_.fit), and adapting the kwargs, in consequence, would be cleaner? Here I wonder what happens if one gets a genuine unrelated TypeError, then understanding the traceback will be somewhat difficult.

Nevermind, I see, it was the case originally and Joel requested the opposite change in #12436 (comment)

NicolasHug

Happy to see this fixed!

The current fix does not support multiclass classification, I added some suggestions.

NicolasHug · 2018-11-28T22:18:16Z

sklearn/ensemble/tests/test_gradient_boosting.py

+@pytest.mark.parametrize(
+    "task",
+    [(GradientBoostingClassifier, make_classification, DummyClassifier),
+     (GradientBoostingRegressor, make_regression, DummyRegressor)],


Testing for multiclass classification currently fails:

def make_multiclass(): return make_classification(n_classes=3, n_clusters_per_class=1) @pytest.mark.parametrize( "task", [(GradientBoostingClassifier, make_classification, DummyClassifier), (GradientBoostingClassifier, make_multiclass, DummyClassifier), (GradientBoostingRegressor, make_regression, DummyRegressor)], ids=["classification", "multiclass", "regression"] def test_gradient_boosting_with_init(task): ... # will fail with make_multiclass

NicolasHug · 2018-11-28T22:19:18Z

sklearn/ensemble/gradient_boosting.py

-            y_pred = self.init_.predict(X)
+            y_pred = self.init_.predict(X).astype(np.float64, copy=False)
+            if y_pred.ndim == 1:
+                y_pred = y_pred[:, np.newaxis]


Suggested change

y_pred = y_pred[:, np.newaxis]

if self.loss_.K >= 2:

# multiclass classification needs y_pred to be of shape

# (n_samples, K) where K is the number of trees per

# iteration, i.e. the number of classes.

# see e.g. PriorProbabilityEstimator or ZeroEstimator

n_samples = y_pred.shape[0]

y_pred = np.tile(y_pred, self.n_classes_)

y_pred = y_pred.reshape(n_samples, self.loss_.K)

else:

y_pred = y_pred[:, np.newaxis]

Good catch. Wouldn't be better to make a one-hot of y_pred ?
I pushed it, let me know which version is best.

Yes one-hot seems to be a better option. I think my version would have been equivalent to a zero initialization (or any constant value).

BTW I realized we could use if loss.is_multi_class to be more explicit.

amueller · 2018-11-28T22:25:33Z

Should we detect that this is happening and bail in the partial dependence plot? Non-constant init estimators will make the partial dependence plot misleading / wrong.

NicolasHug · 2018-11-28T22:28:57Z

@amueller what do you mean "this is happening"? I don't think this is related to partial dependence plot, we're simply fixing a shape issue here.

amueller · 2018-11-28T22:31:26Z

By "this" I mean "a non-constant init estimator". So far in our implementation init estimators were always constant. After this patch they will not be.
In the limited setting of before the patch, our partial dependence plot implementation was meaningful, in the more general setting after the patch, it is not.

outdated review

NicolasHug · 2018-12-04T15:10:57Z

@amueller I guess we could merge this as-is, and simply add a warning in the doc in #12599 about this?

jnothman

LGTM again! Thanks @NicolasHug!

NicolasHug · 2018-12-26T22:47:08Z

Thinking about this again, I don't think it makes any sense to have an init predictor in a classification setting (It's OK for regression).

This y_pred variable that we are initializing here contains the sum of the raw values from the trees, for each training sample. Those trees are all regression trees, even for classification problems, so y_pred contains contiguous values. For classification those raw values in y_pred first have to go through a sigmoid or a softmax to be converted into probabilities, and then into classes.

So currently init_.predict() will predict classes, while it should instead be predicting continuous values that somehow make sense. By 'making sense' here I mean values that minimize the loss, as done by the LogOddsEstimator class used by default.

This is not a problem in regression because ensemble.predict(X) == sum(tree.predict(X) for tree in estimators_): the prediction of the ensemble is homogeneous to the predictions of the trees, and doesn't need to go through another function (sigmoid or softmax) to be converted into a final prediction.

I would propose to either:

deprecate the initial predictor for classification and ultimately remove it
only accept a predictor that has predict_proba and pass those probabilities through log(p / 1 - p) (or equivalent for multiclass classif).

Slightly related: lots of the docstrings actually state that y_pred are the predicted labels but that's not true. Those are the values in the tree leaves. I'm to blame for this and will provide a fix later.

Hope it's clear... I can try to clarify if needed

jnothman · 2018-12-27T01:54:34Z

Is a classifier with decision_function appropriate? Or is it appropriate to use an initial regressor in classification? I've not thought about it, just trying to understand your proposal on the surface, @NicolasHug

NicolasHug · 2018-12-27T03:51:16Z

Is a classifier with decision_function appropriate?

That would be slightly less inappropriate than using predict but still not ideal, I think.

In binary classification:

at any point in fit(), y_pred contains sum of the raw values from the trees for each training sample.
sigmoid(y_pred) is the probability that each sample belongs to the positive class, and argmax(sigmoid(y_pred)) are the predicted labels.
Before fitting, the best initial prediction we can have is just to predict the most common class in the training set. Denoting p = y_train.mean() the proportion of positive examples in the training set, the best we can do is that predict_proba returns p for each sample. To achieve this, we set y_pred to sigmoid^{-1}(p) = log(p / (1 - p)), since this will result in predicted_proba = sigmoid(y_pred) = sigmoid(sigmoid^{-1}(p)) = p. That's what the default LogOddsEstimator is doing.

My proposition is to replace the constant p by some_estimator.predict_proba(X_train).

It does not make sense to set y_pred to some_estimator.predict(X_train) because then sigmoid(y_pred) isn't the probability of anything meaningful.

amueller · 2018-12-27T17:43:41Z

From discussing with @NicolasHug I think the solution in this PR is not correct for the classification case and we should do what @NicolasHug proposes (which actually also cleans up the GradientBoosting interface).

NicolasHug · 2018-12-31T14:52:49Z

I propose the following:

Each loss function should have a get_init_predictions(self, X_train, init_estimator) method that returns the initial (raw) predictions. init_estimator would be a fitted estimator on X_train, y_train.

For the binary classification loss, this would look like

def get_init_prediction(X_train, init_estimator):
    p = estimator.predict_proba(X_train)
    return np.log(p / (1 - p))

while for least squares this would be

def get_init_prediction(X_train, init_estimator):
    return estimator.predict(X_train)

The built-in init estimators like LogOddsEstimator would need to be changed to only return p and not log(p / (1 - p)). We might even completely remove those classes since those could be treated as default if init_estimator=None in loss.get_init_prediction.

It would be similar to what we do in pygbm, except that pygbm does not accept custom init estimators.

@jeremiedbb you probably didn't sign up for this so if you want, I can take it up. I was planning on refactoring the GB code anyway (putting losses in another file, renaming y_pred into raw_predictions, and fixing the incorrect docstrings.)

jnothman · 2019-01-02T11:33:56Z

Does it make sense to make it get_init_probabilities and put the sigmoid inside the method?

NicolasHug · 2019-01-02T12:54:20Z

I don't think so because y_pred is not a probability. Also, sigmoid is only used in binary classification to transform the raw predictions (i.e. the output of decision_function) into probabilities. sigmoid isn't used in regression or in multiclass classification.

jeremiedbb · 2019-01-14T12:01:56Z

@jeremiedbb you probably didn't sign up for this so if you want, I can take it up.

@NicolasHug feel free to take over. I won't have much time to work on it, and this is going out of my comfort zone :)

rth previously approved these changes Oct 23, 2018

View reviewed changes

jeremiedbb changed the title ~~[MRG] FIX gradient boosting with sklearn estimator as init~~ [WIP] FIX gradient boosting with sklearn estimator as init Oct 23, 2018

jeremiedbb force-pushed the gbc-with-init branch from 7d3c06e to 53abfd6 Compare October 23, 2018 21:38

jeremiedbb changed the title ~~[WIP] FIX gradient boosting with sklearn estimator as init~~ [MRG] FIX gradient boosting with sklearn estimator as init Oct 24, 2018

jnothman reviewed Oct 28, 2018

View reviewed changes

jeremiedbb force-pushed the gbc-with-init branch from cc85fee to 12c18f4 Compare October 29, 2018 15:05

jnothman approved these changes Oct 30, 2018

View reviewed changes

jeremiedbb force-pushed the gbc-with-init branch 2 times, most recently from ae99693 to 92cb295 Compare November 8, 2018 13:08

rth previously requested changes Nov 18, 2018

View reviewed changes

NicolasHug reviewed Nov 28, 2018

View reviewed changes

jnothman approved these changes Dec 17, 2018

View reviewed changes

jeremiedbb added 9 commits December 17, 2018 11:28

fix grandient boosting with init

ab3ac0e

add solver to avoid deprecation warning

2038133

optional sample_weight

7ed0eb3

use has_fit_parameter instead

65fc3e5

typo

ad6efdb

try except instead of has_fit_parameter

048dd79

fix merge conflicts

e0db2b2

adress comments

a870e9c

support multiclass classification

5a9ce38

jeremiedbb added 2 commits December 17, 2018 11:32

use is_multi_class

bb581c0

merge master

1129cb1

jeremiedbb force-pushed the gbc-with-init branch from b5d1462 to 1129cb1 Compare December 17, 2018 10:37

NicolasHug mentioned this pull request Jan 14, 2019

FIX several bugs in initial predictions for GBDT #12983

Merged

8 tasks

jnothman closed this in #12983 Mar 1, 2019

jeremiedbb deleted the gbc-with-init branch July 20, 2020 14:59

	sample_weight = np.random.rand(100)
	sample_weight = np.random.RandomState(42).rand(100)

	"weights yet.".format(self.init_.__class__.__name__))
	"weights.".format(self.init_.__class__.__name__))

-                y_pred = y_pred[:, np.newaxis]
+                if self.loss_.K >= 2:
+                    # multiclass classification needs y_pred to be of shape
+                    # (n_samples, K) where K is the number of trees per
+                    # iteration, i.e. the number of classes.
+                    # see e.g. PriorProbabilityEstimator or ZeroEstimator
+                    n_samples = y_pred.shape[0]
+                    y_pred = np.tile(y_pred, self.n_classes_)
+                    y_pred = y_pred.reshape(n_samples, self.loss_.K)
+                else:
+                    y_pred = y_pred[:, np.newaxis]

Uh oh!

[MRG] FIX gradient boosting with sklearn estimator as init #12436

[MRG] FIX gradient boosting with sklearn estimator as init #12436

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!