-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] FIX ensure equivalence between resampling and sample_weight in AdaBoost #14252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] FIX ensure equivalence between resampling and sample_weight in AdaBoost #14252
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we previously supported regressors without sample_weight support but we did so incorrectly?
estimator.fit(X_, y_) | ||
# create a bootstrap sample with observations having non-null weights | ||
n_samples = X.shape[0] | ||
sample_mask = sample_weight > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I assume that we cannot have negative weights. Does it make sense?
The bootstrapping was equivalent to what was done previously. However, the previous way of constructing the bootstrap sample did not ensure that passing Then, there is a bug when computing the error vector where we should not consider the samples with 0 weight to find the maximum and normalize properly the computed error (which influence the weight of each weak learner). Now I see that we could break code which was using regressor without |
@staticmethod | ||
def _fit_estimator(estimator, X, y, sample_weight): | ||
"""Fit an estimator checking support for sample weight.""" | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman I meant something like:
try:
return estimator.fit(X, y, sample_weight=sample_weight)
except TypeError:
# generate a bootstrap based on sample_weight
bootstrap_idx = random_state.choice(
np.arange(sample_weight.size), size=sample_weight.size,
replace=True, p=sample_weight
)
return estimator.fit(safe_indexing(X, bootstrap_idx),
safe_indexing(y, bootstrap_idx))
sample_weight_bootstrap = (np.ones((n_samples,)) * | ||
np.bincount(indices, minlength=n_samples)) | ||
# combine the bootstrap sample with the original sample_weight | ||
sample_weight_bootstrap *= sample_weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be normalized to sum to 1
Uhm I am having second thought here. The failure of all ensemble algorithms is due to the bootstrap sample. For instance, let's define When using
I am more inclined with 2. @ogrisel @jnothman any thoughts (if my explanation is good enough and not confusing :)) |
closing in favor of #14294 |
Toward solving #14191
AdaBoostRegressor
did not ensure that resamplingX
andy
was equivalent to usingsample_weight
. The current changes ensure that this is true keeping the rest of the algorithm unchanged. HoweverAdaBoostRegressor
will now require thebase_estimator
to supportsample_weight
as doesAdaBoostClassifier
. In addition, some additional tests were fixed since they were not checking the right thing.