FIX several bugs in initial predictions for GBDT #12983

NicolasHug · 2019-01-14T23:41:48Z

Reference Issues/PRs

Closes #12436 , continuation of the work from @jeremiedbb

What does this implement/fix? Explain your changes.

This PR:

Fixes the use of the init estimator parameter in GBDTs
exposes init='zero' which was supported but undocumented.
deprecates all the losses and init estimators in gradient_boosting.py
- the loss classes have been moved to a losses.py file and have a new get_init_raw_predictions method
- the init estimators are replaced by an equivalent instance of DummyClassfier or DummyRegressor
renames y_pred into raw_predictions, and relevant (private) methods as well.
fixes some the inaccurate docstrings
"Fixes" the initial default prediction for multiclass classification: the raw prediction used to be the classes priors, but it should be the log of the priors. In practice the difference in accuracy is very minimal: over 100 runs, I observed an avg improvement in test accuracy of 0.000388 (std = 0.008266) between the 2 methods, with n_estimators=1, n_samples=10000.

The loss.get_init_raw_predictions(X, predictor) methods return a raw_prediction with the correct shape (n_samples, K) where K is the number of classes in multiclass classification, else 1.

Those raw_predictions are homogeneous to what the trees predict, not homogeneous to what the ensemble predicts. For regression GBDT this is the same thing, but for e.g. binary classification raw_prediction is homogeneous to a log-odds ratio.

For more, see #12436 (comment) and following discussion (not sure if this is much clearer ^^)

Any other comments?

glemaitre

Partial review

glemaitre · 2019-01-28T11:38:50Z

sklearn/ensemble/gradient_boosting.py

@@ -61,7 +63,15 @@
 from ..exceptions import NotFittedError


-class QuantileEstimator(object):
+# 0.23


You can add a FIXME:

glemaitre · 2019-01-28T11:53:38Z

sklearn/ensemble/gradient_boosting.py

+                                           dtype=np.float64)
+            else:
+                try:
+                    self.init_.fit(X, y, sample_weight=sample_weight)


I think that we are using something like that to check the support for sample_weight:

support_sample_weight = has_fit_parameter(self.init_, "sample_weight") if not support_sample_weight and not sample_weight_is_none: raise ValueError("xxxx")

glemaitre · 2019-01-28T11:57:09Z

sklearn/ensemble/gradient_boosting.py

-        If None it uses ``loss.init_estimator``.
+    init : estimator or 'zero', optional (default=None)
+        An estimator object that is used to compute the initial predictions.
+        ``init`` has to provide `fit` and `predict_proba`. If 'zero', the


are we using the double or single backticks?

I think we use single backticks for anything that has a glossary entry (that will automatically link it) and double backticks for everything else (to avoid shpinx warnings)

@glemaitre Once #13000 gets merged, we do not need to worry about this anymore! Single backticks without a reference will automatically be treated as double backticks!

sklearn/ensemble/losses.py

glemaitre · 2019-01-28T13:12:20Z

sklearn/ensemble/losses.py

+        Number of classes
+    """
+    def init_estimator(self):
+        # Predicts the median


I think that you can remove this comment. Strategy quantile with 0.5 should be self explicit (and this is the Least Absolute Error) as well

sklearn/ensemble/losses.py

glemaitre

Another partial review

sklearn/ensemble/losses.py

glemaitre · 2019-01-28T13:20:13Z

sklearn/ensemble/losses.py

+        raw_predictions : array, shape (n_samples, K)
+            The raw_predictions (i.e. values from the tree leaves)
+
+        sample_weight : array-like, shape (n_samples,), optional


is it array-lie?

sklearn/ensemble/losses.py

glemaitre · 2019-01-28T14:11:47Z

sklearn/ensemble/losses.py

+         the does not support probabilities raises AttributeError.
+        """
+        raise TypeError(
+            '%s does not support predict_proba' % type(self).__name__)


self.class.name, this might be the same actually

NicolasHug · 2019-02-12T04:13:04Z

Is the problem merely that you don't know how to align the prediction to the validation data?

No the problem isn't related to the validation data. It's related to the training data.

Basically the raised ValueErrors in e2e1a5c may happen in 2 scenarios:

User has passed an init estimator that was trained on a different number of classes than what the GBDT. In this case the error message is appropriate, since this is the user's fault.
User has passed an init estimator, or is using the default init estimator, AND uses early stopping. If one of the classes is not represented in the actual training data (after train_test_split), the init predictor is going to be trained on this data and init_predictor.predict_proba() will be of shape (n_samples, n_classes - 1), but the GBDT expects it to be of shape (n_samples, n_classes). In this case the error message is not helpful because it doesn't tell the user what to do. In this case we can either:
- Raise an error with a more helpful message (I think that using stratify in train_test_split would be the right solution)
- Not raise an error, try to find which of the classes has been ignored, and reconstruct a valid probas array. That might be tricky and will definitely make the code more complex than it needs to be.

I'm sorry I find it hard to explain clearly and concisely.

I think that using stratify in train_test_split would solve the issue because train_test_split raises an error when one of the classes are missing in y_train. Regardless of this issue, I think that early stopping splits should be stratified anyway. Would that be OK to use it and consider it as a bugfix?

jnothman · 2019-02-12T04:34:21Z

The init predictor's output can be fixed by aligning it up through classes_ I think. But stratifying does seem a reasonable fix. Does this happen elsewhere in the library with validation splits?

NicolasHug · 2019-02-12T13:58:24Z

The init predictor's output can be fixed by aligning it up through classes_

Yes but we still need to identify which of the classes are missing in y_train, and the loss is not supposed to know anything about y_pred at this point (only its init estimator knows about it, but itself cannot know which classes are missing). That's what I mean by tricky / overly complicated.

Does this happen elsewhere in the library with validation splits?

I don't think the same issues happens in other estimators. This issue is happening because our main estimator (GBDT) relies on another estimator (the init estimator) which is passed only a subset of the data where some classes might be missing, because of early stopping. GBDTs are the only estimators that use early stopping + another estimator, as far as I know.

I took a look at different early stopping implementations:

All estimators inheriting from BaseSGD (perceptron, passiveagressive, SGDClassifier) use stratified splits.
Multilayer perceptron and gradient boosting do not use stratifies splits.

Would it be acceptable that I open a new PR to "bugfix" multiplayer perceptron and gradient boosting by making the splits stratified, and wait for it to be merged?

amueller · 2019-02-13T22:04:25Z

Would it be acceptable that I open a new PR to "bugfix" multiplayer perceptron and gradient boosting by making the splits stratified, and wait for it to be merged?

fine with me.

glemaitre · 2019-02-28T15:47:50Z

I am still +1 here

jnothman · 2019-03-01T06:00:54Z

Thanks for a phenomenal effort, @NicolasHug!

agramfort · 2019-03-01T07:03:18Z

awesome ! thanks heaps @NicolasHug <https://github.com/NicolasHug>!

…

…n#12983)" This reverts commit 01be549.

jeremiedbb and others added 26 commits December 17, 2018 11:28

fix grandient boosting with init

ab3ac0e

add solver to avoid deprecation warning

2038133

optional sample_weight

7ed0eb3

use has_fit_parameter instead

65fc3e5

typo

ad6efdb

try except instead of has_fit_parameter

048dd79

fix merge conflicts

e0db2b2

adress comments

a870e9c

support multiclass classification

5a9ce38

use is_multi_class

bb581c0

merge master

1129cb1

Merge branch 'master' into fix_gbdt_init

32327ca

Moved losses into losses.py and deprecated those in gradient_boosting.py

1efb346

Done BinomialDeviance

0171aa6

Done least squares

a58b797

Done absolute error

2df3a25

Done Huber and Quantile losses

b7736ef

numerical stability for log

57f9e56

numerical stability for log

440fce9

exp loss, pep8

ce03ac6

pep8

1045796

properly exposed the 'zero' init estimator

97f68b2

Some doc

bc1c652

Fixed docstrings tests

5b3b9fb

Renamed relevant things into raw_predictions

d9b1f22

Updated whatsnew with multiclass classif default init bugfix

458fcb1

NicolasHug changed the title ~~[WIP] Fix GBDT init estimators~~ [MRG] Fix GBDT init estimators Jan 15, 2019

glemaitre requested changes Jan 28, 2019

View reviewed changes

glemaitre self-requested a review January 28, 2019 13:13

glemaitre requested changes Jan 28, 2019

View reviewed changes

Should fix FutureWarning errors

3078c55

NicolasHug mentioned this pull request Feb 14, 2019

[MRG] Used stratified splits for early stopping in GBDT and MLP #13164

Merged

Handled missing classes after early stopping differently

b04bcb1

NicolasHug mentioned this pull request Feb 21, 2019

[MRG] Partial dependence plots -- continued #12599

Merged

NicolasHug added 2 commits February 28, 2019 03:43

Merge branch 'master' into fix_gbdt_init

50310ef

EOF

d122729

jnothman approved these changes Mar 1, 2019

View reviewed changes

jnothman changed the title ~~[MRG+1] Fix GBDT init estimators~~ FIX several bugs in initial predictions for GBDT Mar 1, 2019

Merge branch 'master' into fix_gbdt_init

f8dcfb4

jnothman merged commit ea63c56 into scikit-learn:master Mar 1, 2019

NicolasHug mentioned this pull request Mar 19, 2019

[MRG] Fix GBDT init parameter when it's a pipeline #13472

Merged

yanyang82 mentioned this pull request Apr 4, 2019

Inconsistent init_estimator for MultinomialDeviance #13574

Closed

NicolasHug mentioned this pull request Apr 25, 2019

DOC what's new cleaning #13706

Merged

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FIX several bugs in initial predictions for GBDT (scikit-learn#12983)

01be549

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX several bugs in initial predictions for GBDT (scikit-lear…

7a5c861

…n#12983)" This reverts commit 01be549.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX several bugs in initial predictions for GBDT (scikit-lear…

5079e15

…n#12983)" This reverts commit 01be549.

ivanprado mentioned this pull request May 27, 2019

Compatibility with scikit-learn 0.21 TeamHG-Memex/eli5#313

Merged

4 tasks

NicolasHug mentioned this pull request Jun 12, 2019

AttributeError: 'LeastSquaresError' object has no attribute 'get_init_raw_predictions' #14076

Closed

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX several bugs in initial predictions for GBDT (scikit-learn#12983)

114e0a0

thomasjpfan mentioned this pull request Oct 27, 2019

GradientBoosting fails when using init estimator parameter. #12429

Closed

This was referenced Mar 2, 2020

ValueError: Buffer dtype mismatch, expected 'double' but got 'float' when calling fit on GradientBoostingRegressor with init = Ridge #10302

Closed

GBT fails with RF init #2691

Closed

jeremiedbb mentioned this pull request Apr 27, 2022

[MRG] Support arbitrary init estimators for Gradient Boosting #5221

Closed

Uh oh!

FIX several bugs in initial predictions for GBDT #12983

FIX several bugs in initial predictions for GBDT #12983

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!