8000 [MRG] FIX gradient boosting with sklearn estimator as init by jeremiedbb · Pull Request #12436 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] FIX gradient boosting with sklearn estimator as init #12436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

jeremiedbb
Copy link
Member
@jeremiedbb jeremiedbb commented Oct 22, 2018

Fixes #10302, Fixes #12429, Fixes #2691

Gradient Boosting used to fail when init was a sklearn estimator, which is a bit ironic :)
Issue was that the predict output didn't have the expected shape. And apparently there was no test for the init parameter with other estimator than default.

Edit Also accept initial estimator which does not support sample weights as long as the gradient boosting is not fitted with sample weights

rth
rth previously approved these changes Oct 23, 2018
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with GB code, but this looks good to me except for a minor comment below.

You also need to repeat "close" or "fix" in the description for every issue you want to auto-close: i.e. close #123, close#1234, etc..

I can confirm the added test fails on master.

Maybe @glemaitre would be able to do another review..

@@ -1421,7 +1421,9 @@ def fit(self, X, y, sample_weight=None, monitor=None):
self.init_.fit(X, y, sample_weight)

# init predictions
y_pred = self.init_.predict(X)
y_pred = self.init_.predict(X).astype(np.float64, copy=False)
if y_pred.shape == (X.shape[0],):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't,

y_pred.ndim == 1

be enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably :)

@@ -1421,7 +1421,9 @@ def fit(self, X, y, sample_weight=None, monitor=None):
self.init_.fit(X, y, sample_weight)

# init predictions
y_pred = self.init_.predict(X)
y_pred = self.init_.predict(X).astype(np.float64, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you don't cast it to 64bit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error is raised due to unsafe casting because y_pred is updated later by the actual fit of the gradient boosting, and filled with float64 values.

y_pred[:, k] += (learning_rate * tree.value[:, 0, 0].take(terminal_regions, axis=0))

@rth
Copy link
Member
rth commented Oct 23, 2018

Also please add a what's new. I guess we could try to squeeze it into 0.20.1 since it's a long standing issue that has been reported multiple times..

@jeremiedbb
Copy link
Member Author

@stoddardg Pointed out another related bug. When the init estimator does not support sample weights, it crashes even if one calls fit with sample_weight=None.
I remove the MRG temporary, but I think I have a fix for that too.

@jeremiedbb jeremiedbb changed the title [MRG] FIX gradient boosting with sklearn estimator as init [WIP] FIX gradient boosting with sklearn estimator as init Oct 23, 2018
@jeremiedbb jeremiedbb changed the title [WIP] FIX gradient boosting with sklearn estimator as init [MRG] FIX gradient boosting with sklearn estimator as init Oct 24, 2018
# fit initial model - FIXME make sample_weight optional
self.init_.fit(X, y, sample_weight)
# fit initial model
if has_fit_parameter(self.init_, "sample_weight"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid has_fit_parameter unless try-except really won't work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means keyword args won't work. We should only use this kind of thing when we really need to take a different approach when sample_weight is not supported. (Feel free to document this caveat more in a separate pr)

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain this should go in 0.20.1, but whatever...?

Nice work!

@jnothman jnothman dismissed rth’s stale review October 30, 2018 07:33

New features added

@jeremiedbb
Copy link
Member Author

@rth I made some changes since your last review. Maybe you want to give another look at it.

I'm not certain this should go in 0.20.1

It's not a regression from 0.19 so maybe it should go in 0.21 indeed.

@jeremiedbb jeremiedbb force-pushed the gbc-with-init branch 2 times, most recently from ae99693 to 92cb295 Compare November 8, 2018 13:08
rth
rth previously requested changes Nov 18, 2018
# init does not support sample weights
init_est = _NoSampleWeightWrapper(init())
gb(init=init_est).fit(X, y) # ok no sample weights
with pytest.raises(ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific about the error,

Suggested change
with pytest.raises(ValueError):
with pytest.raises(
ValueError,
match=r'.*estimator.*does not support sample weights.*'):

gb, dataset_maker, init = task

X, y = dataset_maker()
sample_weight = np.random.rand(100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a RNG,

Suggested change
sample_weight = np.random.rand(100)
sample_weight = np.random.RandomState(42).rand(100)

# estimator.
# Check that an error is raised if trying to fit with sample weight but
# inital estimator does not support sample weight
gb, dataset_maker, init = task
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a bit more explicit name here,

init_estimator

else:
raise ValueError(
"The initial estimator {} does not support sample "
"weights yet.".format(self.init_.__class__.__name__))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"weights yet.".format(self.init_.__class__.__name__))
"weights.".format(self.init_.__class__.__name__))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to point that it's not a definitive statement :)

if sample_weight is None:
sample_weight = np.ones(n_samples, dtype=np.float32)
sample_weight_was_none = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sample_weight_was_none = True
sample_weight_is_none = True

else:
sample_weight = column_or_1d(sample_weight, warn=True)
sample_weight_was_none = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sample_weight_was_none = False
sample_weight_is_none = False

# fit initial model
try:
self.init_.fit(X, y, sample_weight=sample_weight)
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe calling inspect.signature(self.init_.fit), and adapting the kwargs, in consequence, would be cleaner? Here I wonder what happens if one gets a genuine unrelated TypeError, then understanding the traceback will be somewhat difficult.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see, it was the case originally and Joel requested the opposite change in #12436 (comment)

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see this fixed!

The current fix does not support multiclass classification, I added some suggestions.

@pytest.mark.parametrize(
"task",
[(GradientBoostingClassifier, make_classification, DummyClassifier),
(GradientBoostingRegressor, make_regression, DummyRegressor)],
Copy link
Member
@NicolasHug NicolasHug Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing for multiclass classification currently fails:

def make_multiclass():
    return make_classification(n_classes=3, n_clusters_per_class=1)


@pytest.mark.parametrize(
    "task",
    [(GradientBoostingClassifier, make_classification, DummyClassifier),
     (GradientBoostingClassifier, make_multiclass, DummyClassifier),
     (GradientBoostingRegressor, make_regression, DummyRegressor)],
    ids=["classification", "multiclass", "regression"]
def test_gradient_boosting_with_init(task):
...

# will fail with make_multiclass

y_pred = self.init_.predict(X)
y_pred = self.init_.predict(X).astype(np.float64, copy=False)
if y_pred.ndim == 1:
y_pred = y_pred[:, np.newaxis]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
y_pred = y_pred[:, np.newaxis]
if self.loss_.K >= 2:
# multiclass classification needs y_pred to be of shape
# (n_samples, K) where K is the number of trees per
# iteration, i.e. the number of classes.
# see e.g. PriorProbabilityEstimator or ZeroEstimator
n_samples = y_pred.shape[0]
y_pred = np.tile(y_pred, self.n_classes_)
y_pred = y_pred.reshape(n_samples, self.loss_.K)
else:
y_pred = y_pred[:, np.newaxis]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Wouldn't be better to make a one-hot of y_pred ?
I pushed it, let me know which version is best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes one-hot seems to be a better option. I think my version would have been equivalent to a zero initialization (or any constant value).

BTW I realized we could use if loss.is_multi_class to be more explicit.

@amueller
Copy link
Member

Should we detect that this is happening and bail in the partial dependence plot? Non-constant init estimators will make the partial dependence plot misleading / wrong.

@NicolasHug
Copy link
Member

@amueller what do you mean "this is happening"? I don't think this is related to partial dependence plot, we're simply fixing a shape issue here.

@amueller
Copy link
Member

By "this" I mean "a non-constant init estimator". So far in our implementation init estimators were always constant. After this patch they will not be.
In the limited setting of before the patch, our partial dependence plot implementation was meaningful, in the more general setting after the patch, it is not.

@rth rth dismissed their stale review November 30, 2018 12:25

outdated review

@NicolasHug
Copy link
Member

@amueller I guess we could merge this as-is, and simply add a warning in the doc in #12599 about this?

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again! Thanks @NicolasHug!

@NicolasHug
Copy link
Member
NicolasHug commented Dec 26, 2018

Thinking about this again, I don't think it makes any sense to have an init predictor in a classification setting (It's OK for regression).

This y_pred variable that we are initializing here contains the sum of the raw values from the trees, for each training sample. Those trees are all regression trees, even for classification problems, so y_pred contains contiguous values. For classification those raw values in y_pred first have to go through a sigmoid or a softmax to be converted into probabilities, and then into classes.

So currently init_.predict() will predict classes, while it should instead be predicting continuous values that somehow make sense. By 'making sense' here I mean values that minimize the loss, as done by the LogOddsEstimator class used by default.

This is not a problem in regression because ensemble.predict(X) == sum(tree.predict(X) for tree in estimators_): the prediction of the ensemble is homogeneous to the predictions of the trees, and doesn't need to go through another function (sigmoid or softmax) to be converted into a final prediction.

I would propose to either:

  • deprecate the initial predictor for classification and ultimately remove it
  • only accept a predictor that has predict_proba and pass those probabilities through log(p / 1 - p) (or equivalent for multiclass classif).

Slightly related: lots of the docstrings actually state that y_pred are the predicted labels but that's not true. Those are the values in the tree leaves. I'm to blame for this and will provide a fix later.

Hope it's clear... I can try to clarify if needed

@jnothman
Copy link
Member
jnothman commented Dec 27, 2018 via email

179B
@NicolasHug
Copy link
Member

Is a classifier with decision_function appropriate?

That would be slightly less inappropriate than using predict but still not ideal, I think.

In binary classification:

  • at any point in fit(), y_pred contains sum of the raw values from the trees for each training sample.
  • sigmoid(y_pred) is the probability that each sample belongs to the positive class, and argmax(sigmoid(y_pred)) are the predicted labels.
  • Before fitting, the best initial prediction we can have is just to predict the most common class in the training set. Denoting p = y_train.mean() the proportion of positive examples in the training set, the best we can do is that predict_proba returns p for each sample. To achieve this, we set y_pred to sigmoid^{-1}(p) = log(p / (1 - p)), since this will result in predicted_proba = sigmoid(y_pred) = sigmoid(sigmoid^{-1}(p)) = p. That's what the default LogOddsEstimator is doing.

My proposition is to replace the constant p by some_estimator.predict_proba(X_train).

It does not make sense to set y_pred to some_estimator.predict(X_train) because then sigmoid(y_pred) isn't the probability of anything meaningful.

@amueller
Copy link
Member

From discussing with @NicolasHug I think the solution in this PR is not correct for the classification case and we should do what @NicolasHug proposes (which actually also cleans up the GradientBoosting interface).

@NicolasHug
Copy link
Member

I propose the following:

Each loss function should have a get_init_predictions(self, X_train, init_estimator) method that returns the initial (raw) predictions. init_estimator would be a fitted estimator on X_train, y_train.

For the binary classification loss, this would look like

def get_init_prediction(X_train, init_estimator):
    p = estimator.predict_proba(X_train)
    return np.log(p / (1 - p))

while for least squares this would be

def get_init_prediction(X_train, init_estimator):
    return estimator.predict(X_train)

The built-in init estimators like LogOddsEstimator would need to be changed to only return p and not log(p / (1 - p)). We might even completely remove those classes since those could be treated as default if init_estimator=None in loss.get_init_prediction.

It would be similar to what we do in pygbm, except that pygbm does not accept custom init estimators.


@jeremiedbb you probably didn't sign up for this so if you want, I can take it up. I was planning on refactoring the GB code anyway (putting losses in another file, renaming y_pred into raw_predictions, and fixing the incorrect docstrings.)

@jnothman
Copy link
Member
jnothman commented Jan 2, 2019 via email

@NicolasHug
Copy link
Member

I don't think so because y_pred is not a probability. Also, sigmoid is only used in binary classification to transform the raw predictions (i.e. the output of decision_function) into probabilities. sigmoid isn't used in regression or in multiclass classification.

@jeremiedbb
Copy link
Member Author

@jeremiedbb you probably didn't sign up for this so if you want, I can take it up.

@NicolasHug feel free to take over. I won't have much time to work on it, and this is going out of my comfort zone :)

@jeremiedbb jeremiedbb deleted the gbc-with-init branch July 20, 2020 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0