8000 [MRG] FIX ensure equivalence between resampling and sample_weight in AdaBoost by glemaitre · Pull Request #14252 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] FIX ensure equivalence between resampling and sample_weight in AdaBoost #14252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

glemaitre
Copy link
Member

Toward solving #14191

AdaBoostRegressor did not ensure that resampling X and y was equivalent to using sample_weight. The current changes ensure that this is true keeping the rest of the algorithm unchanged. However AdaBoostRegressor will now require the base_estimator to support sample_weight as does AdaBoostClassifier. In addition, some additional tests were fixed since they were not checking the right thing.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we previously supported regressors without sample_weight support but we did so incorrectly?

estimator.fit(X_, y_)
# create a bootstrap sample with observations having non-null weights
n_samples = X.shape[0]
sample_mask = sample_weight > 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I assume that we cannot have negative weights. Does it make sense?

@glemaitre
Copy link
Member Author
glemaitre commented Jul 4, 2019

So we previously supported regressors without sample_weight support but we did so incorrectly?

The bootstrapping was equivalent to what was done previously. However, the previous way of constructing the bootstrap sample did not ensure that passing sample_weight with some 0 was equivalent to removing these samples from the dataset. Now it does.

Then, there is a bug when computing the error vector where we should not consider the samples with 0 weight to find the maximum and normalize properly the computed error (which influence the weight of each weak learner).

Now I see that we could break code which was using regressor without sample_weight support.
@jnothman Do you think that we could make manually generate a bootstrap sample for both AdaBoostClassifier and AdaBoostRegressor when the underlying estimator does not support sample_weight?

@staticmethod
def _fit_estimator(estimator, X, y, sample_weight):
"""Fit an estimator checking support for sample weight."""
try:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman I meant something like:

        try:
            return estimator.fit(X, y, sample_weight=sample_weight)
        except TypeError:
            # generate a bootstrap based on sample_weight
            bootstrap_idx = random_state.choice(
                np.arange(sample_weight.size), size=sample_weight.size,
                replace=True, p=sample_weight
            )
            return estimator.fit(safe_indexing(X, bootstrap_idx),
                                 safe_indexing(y, bootstrap_idx))

sample_weight_bootstrap = (np.ones((n_samples,)) *
np.bincount(indices, minlength=n_samples))
# combine the bootstrap sample with the original sample_weight
sample_weight_bootstrap *= sample_weight
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be normalized to sum to 1

@glemaitre
Copy link
Member Author

Uhm I am having second thought here. The failure of all ensemble algorithms is due to the bootstrap sample. For instance, let's define X associated sample_weight [0, 1, 0, 1] (so 4 samples).

When using sample_weight, we will pass the full X and a bootstrap sample will be drawn using these 4 samples while in the case of resampling X, the bootstrap sample will be drawn from only 2 samples. Thus, we could have 2 solutions:

  1. one could force the bootstrap to be drawn from the samples associated with non-null weights to have something equivalent for both cases (using sample_weight or resampling).
  2. let the code as it is considering that this is an implementation detail. (For the current PR, there is still a bug regarding the detection of the max). In this case we would need to introduce an estimator tag avoiding to run the common test. We would need a good name for it then.

I am more inclined with 2.

@ogrisel @jnothman any thoughts (if my explanation is good enough and not confusing :))

@glemaitre glemaitre closed this Jul 8, 2019
@glemaitre
Copy link
Member Author
glemaitre commented Jul 8, 2019

closing in favor of #14294

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0