-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX several bugs in initial predictions for GBDT #12983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review
@@ -61,7 +63,15 @@ | |||
from ..exceptions import NotFittedError | |||
|
|||
|
|||
class QuantileEstimator(object): | |||
# 0.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a FIXME:
dtype=np.float64) | ||
else: | ||
try: | ||
self.init_.fit(X, y, sample_weight=sample_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we are using something like that to check the support for sample_weight:
support_sample_weight = has_fit_parameter(self.init_, "sample_weight")
if not support_sample_weight and not sample_weight_is_none:
raise ValueError("xxxx")
If None it uses ``loss.init_estimator``. | ||
init : estimator or 'zero', optional (default=None) | ||
An estimator object that is used to compute the initial predictions. | ||
``init`` has to provide `fit` and `predict_proba`. If 'zero', the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we using the double or single backticks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we use single backticks for anything that has a glossary entry (that will automatically link it) and double backticks for everything else (to avoid shpinx warnings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre Once #13000 gets merged, we do not need to worry about this anymore! Single backticks without a reference will automatically be treated as double backticks!
sklearn/ensemble/losses.py
Outdated
Number of classes | ||
""" | ||
def init_estimator(self): | ||
# Predicts the median |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that you can remove this comment. Strategy quantile with 0.5 should be self explicit (and this is the Least Absolute Error) as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another partial review
sklearn/ensemble/losses.py
Outdated
raw_predictions : array, shape (n_samples, K) | ||
The raw_predictions (i.e. values from the tree leaves) | ||
|
||
sample_weight : array-like, shape (n_samples,), optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it array-lie?
sklearn/ensemble/losses.py
Outdated
the does not support probabilities raises AttributeError. | ||
""" | ||
raise TypeError( | ||
'%s does not support predict_proba' % type(self).__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.class.name, this might be the same actually
No the problem isn't related to the validation data. It's related to the training data. Basically the raised ValueErrors in e2e1a5c may happen in 2 scenarios:
I'm sorry I find it hard to explain clearly and concisely. I think that using stratify in |
The init predictor's output can be fixed by aligning it up through classes_
I think. But stratifying does seem a reasonable fix. Does this happen
elsewhere in the library with validation splits?
|
Yes but we still need to identify which of the classes are missing in
I don't think the same issues happens in other estimators. This issue is happening because our main estimator (GBDT) relies on another estimator (the init estimator) which is passed only a subset of the data where some classes might be missing, because of early stopping. GBDTs are the only estimators that use early stopping + another estimator, as far as I know. I took a look at different early stopping implementations:
Would it be acceptable that I open a new PR to "bugfix" multiplayer perceptron and gradient boosting by making the splits stratified, and wait for it to be merged? |
fine with me. |
I am still +1 here |
Thanks for a phenomenal effort, @NicolasHug! |
awesome ! thanks heaps @NicolasHug <https://github.com/NicolasHug>!
… |
Reference Issues/PRs
Closes #12436 , continuation of the work from @jeremiedbb
What does this implement/fix? Explain your changes.
This PR:
init
estimator parameter in GBDTsinit='zero'
which was supported but undocumented.gradient_boosting.py
losses.py
file and have a newget_init_raw_predictions
methodDummyClassfier
orDummyRegressor
y_pred
intoraw_predictions
, and relevant (private) methods as well.0.000388
(std =0.008266
) between the 2 methods, withn_estimators=1
,n_samples=10000
.The
loss.get_init_raw_predictions(X, predictor)
methods return araw_prediction
with the correct shape(n_samples, K)
whereK
is the number of classes in multiclass classification, else1
.Those
raw_predictions
are homogeneous to what the trees predict, not homogeneous to what the ensemble predicts. For regression GBDT this is the same thing, but for e.g. binary classificationraw_prediction
is homogeneous to a log-odds ratio.For more, see #12436 (comment) and following discussion (not sure if this is much clearer ^^)
Any other comments?