8000 SLEP006: Metadata routing for bagging by BenjaminBossan · Pull Request #24250 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

SLEP006: Metadata routing for bagging #24250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

BenjaminBossan
Copy link
Contributor

This PR adds metadata routing to BaggingClassifier and
BaggingRegressor (see #22893).

Description

With this change, in addition to sample_weight, which was already
supported, it's now also possible to pass arbitrary fit_params to the
sub estimator (which have to be requested explicitly).

Implementation

Most of the changes should be pretty straightforward with the existing
infrastructure for testing metadata routing. There was one aspect which
was not quite trivial though: The current implementation of bagging
works by inspecting the sub estimator's fit method. If the sub estimator
supports sample_weight, then subsampling is performed by making use of
sample_weight. This will also happen if the user does not explicitly
pass sample_weight.

At first, I wanted to change the implementation such that if sample
weights are requested, subsampling should use the sample_weight
approach, otherwise it shouldn't. However, that breaks a couple of
tests, so I rolled back the change and stuck very closely to the
existing implementation. I can't judge if this prevents the user from
doing certain things or if subsampling using vs not using sample_weight
are equivalent.

Coincidental changes

The method _validate_estimator on the BaseEnsemble class used to
validate, and then set as attribute, the sub estimator. This was
inconvenient because for get_metadata_routing, we want to fetch the sub
estimator, which is not easily possible with this method. Therefore, a
change was introduced that the method now returns the sub estimator and
the caller is responsible for setting it as an attribute. This has
the added advantages that the caller can now decide the attribute name,
and that this method now more closely mirrors
_BaseHeterogeneousEnsemble._validate_estimators. Affected by this change
are random forests, extra trees, and ada boosting.

The function process_routing used to mutate the incoming param
dict (adding new items), now it creates a shallow copy first.

Extended docstring for the check_input argument of BaseBagging._fit.

Testing

I noticed that the bagging tests didn't have a test case for sparse
input + using sample weights, so I extended an existing test to include
it.

The test test_bagging_sample_weight_unsupported_but_passed now raises a
TypeError, not ValueError, when sample_weight are passed but not
supported.

This PR adds metadata routing to BaggingClassifier and
BaggingRegressor (see scikit-learn#22893).

With this change, in addition to sample_weight, which was already
supported, it's now also possible to pass arbitrary fit_params to the
sub estimator.

Implementation

Most of the changes should be pretty straightforward with the existing
infrastructure for testing metadata routing. There was one aspect which
was not quite trivial though: The current implementation of bagging
works by inspecting the sub estimator's fit method. If the sub estimator
supports sample_weight, then subsampling is performed by making use of
sample weight. This will also happen if the user does not explicitly
pass sample weight.

At first, I wanted to change the implementation such that if sample
weights are requested, subsampling should use the sample weight
approach, otherwise it shouldn't. However, that breaks a couple of
tests, so I rolled back the change and stuck very closely to the
existing implementation. I can't judge if this prevents the user from
doing certain things or if subsampling using vs not using sample_weight
are equivalent.

Coincidental changes

The method _validate_estimator on the BaseEnsemble class used to
validate, and then set as attribute, the sub estimator. This was
inconvenient because for get_metadata_routing, we want to fetch the sub
estimator, which is not easily possible with this method. Therefore, a
change was introduced that the method now returns the sub estimator and
the caller is now responsible for setting it as an attribute. This has
the added advantages that the caller can now decide the attribute name
and that this method now more closely mirrors
_BaseHeterogeneousEnsemble._validate_estimators. Affected by this change
are random forests, extra trees, and ada boosting.

The function process_routing used to mutate the incoming param
dict (adding new items), now it creates a shallow copy first.

Extended docstring for check_input of BaseBagging._fit.

Testing

I noticed that the bagging tests didn't have a test case for sparse
input + using sample weights, so I extended an existing test to cover
it.

The test test_bagging_sample_weight_unsupported_but_passed now raises a
TypeError, not ValueError, when sample_weight are passed but not
supported.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@BenjaminBossan
Copy link
Contributor Author

@adrinjalali ready for review

@adrinjalali adrinjalali self-requested a review August 25, 2022 13:14
X_ = X[:, features] if requires_feature_indexing else X
estimator_fit(X_, y, sample_weight=curr_sample_weight)
estimator_fit(X_, y, sample_weight=curr_sample_weight, **fit_params)
else:
X_ = X[indices][:, features] if requires_feature_indexing else X[indices]
estimator_fit(X_, y[indices])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not pass the other fit params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. It's interesting that this is not caught at the moment. The reason is that CheckingClassifier has a sample_weight argument. Therefore, we always hit the top condition, never the bottom. (This comes back to the current implementation forcing the use of sample_weight if he sub estimator supports it).

My proposal is to implement a new test with a custom CheckingClassifier that does not support sample_weight. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead with my proposal: ed1a154

LMK if you have a different idea.

Comment on lines 126 to 128
support_sample_weight = has_fit_parameter(
ensemble.base_estimator_, "sample_weight"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to support a meta-estimator which forwards sample-weight to the child estimator. Should we instead try to route metadata, and see if sample-weight gets routed?

cc @glemaitre @thomasjpfan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in https://github.com/scikit-learn/scikit-learn/pull/24250/files/04882927e1c98adb0e9b62657c9d491067cef8c3#r963820342 this is a bit more involved. I'll continue the discussion in the other thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll respond here to not overload the other thread. Here are the two situations I see:

Case 1: base_estimator_ is a meta estimator.

If base_estimator_ requests for sample_weights, then pass it in. This is the same logic as we have for other meta-estimators.

Case 2: base_estimator_ is a regular estimator.

If base_estimator has a sample_weight fit parameter, bagging wants to pass in sample weights independent of the metadata request to simulate the bagging process. For backward compatibility, I think we need to preserve this behavior. Although, this behavior is inconsistent with metadata routing API. For example, on main, the following will pass in sample weights to SVR even if it is not requested:

regr = BaggingRegressor(base_estimator=SVR())

Do we want to require BaggingRegressor to request sample weights to use the "bagging with sample weight" feature?

regr = BaggingRegressor(base_estimator=SVR().fit_requests(sample_weight=True))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm not in favor of making the routing of sample weights dependent on whether the base estimator is a meta estimator or not. Maybe there is precedence for that but at as a user, I would be surprised if the behavior changed in such a fashion just because I swapped one base estimator for another one.

Let's list all the factors we have:

  1. Base estimator is a meta estimator or not
  2. Base estimator supports sample weights (as determined by its signature)
  3. Base estimator requested sample weight (as per fit request)
  4. bootstrap=True or False
  5. User passed sample weight to fit or not

As mentioned, factor 1 should not matter IMHO, leaving us to resolve the others.

Furthermore, we have 2 choices to make when it comes to using sample weights:

a. Sample via sample weights or via indexing
b. Pass sample weight to base estimator or not

I'm not really clear on what the point is of having sampling via sample weights if bootstrap=False. Is that not functionally equivalent to sampling via indexing, only potentially slower? Can we remove that branch completely? For my proposal, I assume that the two approaches are indeed equivalent.

With that out of the way, my proposal would be:

BE supports SW BE req SW bootstrap user passes SW outcome
T T T T pass SW to BE, sample via SW
T T T F error
T T F T pass SW to BE, sample via indexing
T T F F error
T F T T don’t pass SW, sample via SW
T F T F don’t pass SW, sample via SW
T F F T don’t pass SW, sample via indexing
T F F F don’t pass SW, sample via SW
F T T T error
F T T F error
F T F T error
F T F F error
F F T T error
F F T F error
F F F T error
F F F F sample via indexing

(BE = base estimator, SW = sample weight)

If I'm not mistaken, this preserves the current behavior but it opens the possibility for the user to control whether they want their sample weights to be passed to the base estimator or not.

@pytest.mark.parametrize("estimator_cls", [BaggingClassifier, BaggingRegressor])
@pytest.mark.parametrize("fit_params_type", ["list", "array"])
@pytest.mark.parametrize("use_sample_weight", [True, False])
def test_bagging_with_arbitrary_fit_params(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this test covered in test_metaestimators_metadata_routing.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these tests because they did catch an error that the tests in test_metaestimators_metadata_routing.py didn't. I cannot remember what exactly the error was, but below is a change that will make these tests fail whereas test_metaestimators_metadata_routing.py pass. Change this line:

https://github.com/BenjaminBossan/scikit-learn/blob/384127292b610785030ddd6fad0e6b1ee4376f4a/sklearn/ensemble/_bagging.py#L498

like this:

-                fit_params=routed_params.base_estimator.fit,
+                fit_params={"sample_weight": None},

Maybe instead of adding these tests, we have to fix test_metaestimators_metadata_routing.py, but I'm not sure how.

BenjaminBossan and others added 4 commits August 25, 2022 15:46
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
- Raise NotImplementedError in base class _get_estimator
- Use f-string
There was a bug that fit_params where not passed to fit of the base
estimator if the base estimator does not support sample_weight.

It is a bit difficult to test this. I decided to go with a test that
uses a subclassed CheckingClassifier that does not support
sample_weight (i.e. does not have it in the signature of fit). This way,
we can test the code branch that resulted in the bug.
Use _get_estimator to set the base_estimator of IsolationForest.
Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise LGTM!

Comment on lines 126 to 128
support_sample_weight = has_fit_parameter(
ensemble.base_estimator_, "sample_weight"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should instead just route and see if sample_weight gets routed, to support meta-estimators who accept/route sample_weight

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't quite understand. Do you mean if the base estimator is itself a meta estimator? Would that mean that we have potentially two different sample weights, the ones explicitly passed by the user (and requested by the base estimator) + the one used by bagging to "simulate" the bagging process?

And do you still want to "force" the use of sample weight if the base estimator supports it, even if it's not requested, or do you want to get rid of that? If so, would that not break existing code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a short discussion in the last triage meeting. I think that the idea of @adrinjalali would work. You need to inspect the process_routing at this stage to know if the underlying estimator is expecting the sample_weight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in:

routed_metadata = process_routing(...)
support_sample_weight = "sample_weight" in routed_metadata.base_estimator.fit

@glemaitre glemaitre self-requested a review September 9, 2022 09:45
@glemaitre
Copy link
Member

R A3E2 elated to this PR: we just merged in master something that deprecates base_estimator and base_estimator_ in favour of estimator and estimator_.

@BenjaminBossan
Copy link
Contributor Author

Thanks for the heads up. Here is the commit. @adrinjalali could you please update the base branch?

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first pass in the Bagging specialized classes.

@@ -381,7 +400,14 @@ def _fit(
y = self._validate_y(y)

# Check parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this comment. It is outdated.


def _parallel_args(self):
return {}

def _get_estimator(self):
# should be overridden by child classes
raise NotImplementedError("Return the default base estimator instance.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, it might be better to declare this method as an abstractmethod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 584 to 587
.add(
base_estimator=base_estimator,
method_mapping=MethodMapping().add(callee="fit", caller="fit"),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering how to handle the deprecation here :)
We should probably have a self._estimator to set the metadata and just property returning _estimator when calling base_estimator and estimator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or adding 2 identical rules for base_estimator and estimator and remove one of them after the deprecation cycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to be necessary, when using the deprecated base_estimator, we get a warning for it but it's not necessary to separately add a separate mapping for it because the correct estimator is resolved beforehand. There is a test for it (test_base_estimator_argument_missing_fit_request_warns).

A :class:`~utils.metadata_routing.MetadataRouter` encapsulating
routing information.
"""
base_estimator = self._validate_estimator(self._get_estimator())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way that get_matadata_routing get called before fit?

In such a case, the base_estimator instance is potentially not the same as the base_estimator_ initialized in fit. I don't know if it could trigger an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not necessary for the instance to be identical. @adrinjalali can you confirm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the method can be called before fit. GridSearch would need to do that for instance. That's why we have refactored and added _validate_estimator here.

If somebody implements a get_metadata_routing which returns a different value if the object is fit vs when it's not, then we can't help them 😁

if support_sample_weight:
sample_weight = fit_params.get("sample_weight")
if sample_weight is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this is something that should use sklearn.utils.validation._check_sample_weight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check sample weight earlier: https://github.com/BenjaminBossan/scikit-learn/blob/1221996815589ef076f9c2ac7434ea28a9dcf6bc/sklearn/ensemble/_bagging.py#L395

I think there is no advantage of checking again at the suggested line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will create for you the array of ones and trigger a copy if requested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if support_sample_weight:
sample_weight = fit_params.get("sample_weight")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we could maybe do a pop:

try:
    sample_weight = fit_params.pop("sample_weight")
except KeyError:
    sample_weight = None

Then we don't need the dict comprehension below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code to use fit_params.pop("sample_weight", None)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not know that pop as a "default" to be returned if not in the dict.

@glemaitre glemaitre self-requested a review September 9, 2022 14:38
BenjaminBossan and others added 6 commits September 9, 2022 16:53
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
- Use pop on dictionary to later avoid removing item
- Make _get_estimator abstractmethod
- "meta-estimator" instead of "metaestimator"
Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also some untested lines here.

Also, there were some related changes in main, and I've pulled them into the sample-props branch, you should merge those. (Hopefully that branch stays green lol)

Comment on lines 126 to 128
support_sample_weight = has_fit_parameter(
ensemble.base_estimator_, "sample_weight"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in:

routed_metadata = process_routing(...)
support_sample_weight = "sample_weight" in routed_metadata.base_estimator.fit

A :class:`~utils.metadata_routing.MetadataRouter` encapsulating
routing information.
"""
base_estimator = self._validate_estimator(self._get_estimator())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the method can be called before fit. GridSearch would need to do that for instance. That's why we have refactored and added _validate_estimator here.

If somebody implements a get_metadata_routing which returns a different value if the object is fit vs when it's not, then we can't help them 😁

@jnothman
Copy link
Member

Late to the party 🥳

TBH I've never liked the fact that we inspect signatures to determine if something accepts sample_weight.

What are the downsides to trying to send sample_weight and using the subsampling approach if an error is raised?

One downside is that the user won't get any indication that setting set_fit_request would make their code work differently/faster!

Reworking when sample_weight is being used in BaggingClassifier and
BaggingRegressor.

They provide two mechanisms to achieve sampling of rows, first via
sample_weight (the preferred mechanism) and second via indexing (i.e.
selecting the rows from the data).

The new metadata routing information is now being used to determine of
the base estimator can actually deal with sample_weight (the assumption
is made that it does the correct thing). This is a change to the current
situation, where the signature is checked for the presence of the
"sample_weight" argument instead. With this change, in the future it
will be possible to support the sample_weight mechanism when using
metaestimators even if they don't have sample_weight in their signature.
@BenjaminBossan
Copy link
Contributor Author

@adrinjalali Could you please take a look if the latest commit reflects what we discussed? The PR is not quite ready but the major changes are there.

@adrinjalali
Copy link
Member

Yep, I think it's what we talked about, and it looks quite nice I'd say.

@glemaitre
Copy link
Member

Do you need that we merge main into sample-props to have the CIs working properly?

@adrinjalali
Copy link
Member

I updated sample-props in case anything relevant has happened on main.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to clear my head and look again, but this is looking good, assuming that sniffing for sample weight support is the right thing to do.

f"{type(ensemble).__name__}"
)

if _supports_sample_weight(request_or_router):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if, for observability, we should add an attribute to the bagging estimator that notes whether sample weight or resampling was used...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a wish for that, I can change the code to set an attribute during fit that indicates if sample weights are supported, and then use that attribute here.

X, y = iris.data, iris.target
sample_weight = np.ones_like(y)

base_estimator = _weighted(CheckingClassifier(expected_sample_weight=False))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be expected_sample_weight=True? The comment above says "is achieved through the use of sample weights"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.


class BaseEstimatorWithoutSW(BaseEstimator):
def fit(self, X, y):
assert len(y) == 0.5 * num_samples # check that indexing was used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is integer-float comparison brittle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. As long as the iris dataset doesn't change in size, it should be fine. But just in case, I changed the code to use max_samples=<int> for this to be more robust.

- CheckingClassifier should have had expected_sample_weight=True in one
  test
- Use max_samples=<int> instead of <float> for less brittle comparison
Copy link
Contributor Author
@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman I addressed your comments.


class BaseEstimatorWithoutSW(BaseEstimator):
def fit(self, X, y):
assert len(y) == 0.5 * num_samples # check that indexing was used
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. As long as the iris dataset doesn't change in size, it should be fine. But just in case, I changed the code to use max_samples=<int> for this to be more robust.

X, y = iris.data, iris.target
sample_weight = np.ones_like(y)

base_estimator = _weighted(CheckingClassifier(expected_sample_weight=False))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

f"{type(ensemble).__name__}"
)

if _supports_sample_weight(request_or_router):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a wish for that, I can change the code to set an attribute during fit that indicates if sample weights are supported, and then use that attribute here.

@adrinjalali
Copy link
Member

@BenjaminBossan we have switched to a feature flag system (introduced in #26103), which makes things much easier for implementing slep6 for meta-estimators.

The big slep6 PR is now merged into main (#24027), and the sample-props branch is now removed from the main repo, therefore this PR is auto-closed.

Would you be up for converting this PR to the feature flag enabled form, and submit the PR to main?

Also happy to document/answer your questions on how to do the feature flag thingy.

@adam2392
Copy link
Member

Hi,

I was wondering if @BenjaminBossan is interested in continuing this work, else I was interested for the sake of supporting metadata routing in Bagging*?

@adrinjalali
Copy link
Member

@adam2392 I think you can go ahead and open a PR, taking the good parts of this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0