-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
SLEP006: Metadata routing for bagging #24250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP006: Metadata routing for bagging #24250
Conversation
This PR adds metadata routing to BaggingClassifier and BaggingRegressor (see scikit-learn#22893). With this change, in addition to sample_weight, which was already supported, it's now also possible to pass arbitrary fit_params to the sub estimator. Implementation Most of the changes should be pretty straightforward with the existing infrastructure for testing metadata routing. There was one aspect which was not quite trivial though: The current implementation of bagging works by inspecting the sub estimator's fit method. If the sub estimator supports sample_weight, then subsampling is performed by making use of sample weight. This will also happen if the user does not explicitly pass sample weight. At first, I wanted to change the implementation such that if sample weights are requested, subsampling should use the sample weight approach, otherwise it shouldn't. However, that breaks a couple of tests, so I rolled back the change and stuck very closely to the existing implementation. I can't judge if this prevents the user from doing certain things or if subsampling using vs not using sample_weight are equivalent. Coincidental changes The method _validate_estimator on the BaseEnsemble class used to validate, and then set as attribute, the sub estimator. This was inconvenient because for get_metadata_routing, we want to fetch the sub estimator, which is not easily possible with this method. Therefore, a change was introduced that the method now returns the sub estimator and the caller is now responsible for setting it as an attribute. This has the added advantages that the caller can now decide the attribute name and that this method now more closely mirrors _BaseHeterogeneousEnsemble._validate_estimators. Affected by this change are random forests, extra trees, and ada boosting. The function process_routing used to mutate the incoming param dict (adding new items), now it creates a shallow copy first. Extended docstring for check_input of BaseBagging._fit. Testing I noticed that the bagging tests didn't have a test case for sparse input + using sample weights, so I extended an existing test to cover it. The test test_bagging_sample_weight_unsupported_but_passed now raises a TypeError, not ValueError, when sample_weight are passed but not supported.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@adrinjalali ready for review |
sklearn/ensemble/_bagging.py
Outdated
X_ = X[:, features] if requires_feature_indexing else X | ||
estimator_fit(X_, y, sample_weight=curr_sample_weight) | ||
estimator_fit(X_, y, sample_weight=curr_sample_weight, **fit_params) | ||
else: | ||
X_ = X[indices][:, features] if requires_feature_indexing else X[indices] | ||
estimator_fit(X_, y[indices]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this not pass the other fit params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch. It's interesting that this is not caught at the moment. The reason is that CheckingClassifier
has a sample_weight
argument. Therefore, we always hit the top condition, never the bottom. (This comes back to the current implementation forcing the use of sample_weight
if he sub estimator supports it).
My proposal is to implement a new test with a custom CheckingClassifier
that does not support sample_weight
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead with my proposal: ed1a154
LMK if you have a different idea.
sklearn/ensemble/_bagging.py
Outdated
support_sample_weight = has_fit_parameter( | ||
ensemble.base_estimator_, "sample_weight" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to support a meta-estimator which forwards sample-weight to the child estimator. Should we instead try to route metadata, and see if sample-weight gets routed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in https://github.com/scikit-learn/scikit-learn/pull/24250/files/04882927e1c98adb0e9b62657c9d491067cef8c3#r963820342 this is a bit more involved. I'll continue the discussion in the other thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll respond here to not overload the other thread. Here are the two situations I see:
Case 1: base_estimator_
is a meta estimator.
If base_estimator_
requests for sample_weights
, then pass it in. This is the same logic as we have for other meta-estimators.
Case 2: base_estimator_
is a regular estimator.
If base_estimator
has a sample_weight
fit parameter, bagging wants to pass in sample weights independent of the metadata request to simulate the bagging process. For backward compatibility, I think we need to preserve this behavior. Although, this behavior is inconsistent with metadata routing API. For example, on main
, the following will pass in sample weights to SVR
even if it is not requested:
regr = BaggingRegressor(base_estimator=SVR())
Do we want to require BaggingRegressor
to request sample weights to use the "bagging with sample weight" feature?
regr = BaggingRegressor(base_estimator=SVR().fit_requests(sample_weight=True))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'm not in favor of making the routing of sample weights dependent on whether the base estimator is a meta estimator or not. Maybe there is precedence for that but at as a user, I would be surprised if the behavior changed in such a fashion just because I swapped one base estimator for another one.
Let's list all the factors we have:
- Base estimator is a meta estimator or not
- Base estimator supports sample weights (as determined by its signature)
- Base estimator requested sample weight (as per fit request)
bootstrap=True
orFalse
- User passed sample weight to fit or not
As mentioned, factor 1 should not matter IMHO, leaving us to resolve the others.
Furthermore, we have 2 choices to make when it comes to using sample weights:
a. Sample via sample weights or via indexing
b. Pass sample weight to base estimator or not
I'm not really clear on what the point is of having sampling via sample weights if bootstrap=False
. Is that not functionally equivalent to sampling via indexing, only potentially slower? Can we remove that branch completely? For my proposal, I assume that the two approaches are indeed equivalent.
With that out of the way, my proposal would be:
BE supports SW | BE req SW | bootstrap | user passes SW | outcome |
---|---|---|---|---|
T | T | T | T | pass SW to BE, sample via SW |
T | T | T | F | error |
T | T | F | T | pass SW to BE, sample via indexing |
T | T | F | F | error |
T | F | T | T | don’t pass SW, sample via SW |
T | F | T | F | don’t pass SW, sample via SW |
T | F | F | T | don’t pass SW, sample via indexing |
T | F | F | F | don’t pass SW, sample via SW |
F | T | T | T | error |
F | T | T | F | error |
F | T | F | T | error |
F | T | F | F | error |
F | F | T | T | error |
F | F | T | F | error |
F | F | F | T | error |
F | F | F | F | sample via indexing |
(BE = base estimator, SW = sample weight)
If I'm not mistaken, this preserves the current behavior but it opens the possibility for the user to control whether they want their sample weights to be passed to the base estimator or not.
@pytest.mark.parametrize("estimator_cls", [BaggingClassifier, BaggingRegressor]) | ||
@pytest.mark.parametrize("fit_params_type", ["list", "array"]) | ||
@pytest.mark.parametrize("use_sample_weight", [True, False]) | ||
def test_bagging_with_arbitrary_fit_params( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this test covered in test_metaestimators_metadata_routing.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these tests because they did catch an error that the tests in test_metaestimators_metadata_routing.py
didn't. I cannot remember what exactly the error was, but below is a change that will make these tests fail whereas test_metaestimators_metadata_routing.py
pass. Change this line:
like this:
- fit_params=routed_params.base_estimator.fit,
+ fit_params={"sample_weight": None},
Maybe instead of adding these tests, we have to fix test_metaestimators_metadata_routing.py
, but I'm not sure how.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
- Raise NotImplementedError in base class _get_estimator - Use f-string
There was a bug that fit_params where not passed to fit of the base estimator if the base estimator does not support sample_weight. It is a bit difficult to test this. I decided to go with a test that uses a subclassed CheckingClassifier that does not support sample_weight (i.e. does not have it in the signature of fit). This way, we can test the code branch that resulted in the bug.
Use _get_estimator to set the base_estimator of IsolationForest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise LGTM!
sklearn/ensemble/_bagging.py
Outdated
support_sample_weight = has_fit_parameter( | ||
ensemble.base_estimator_, "sample_weight" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should instead just route and see if sample_weight
gets routed, to support meta-estimators who accept/route sample_weight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't quite understand. Do you mean if the base estimator is itself a meta estimator? Would that mean that we have potentially two different sample weights, the ones explicitly passed by the user (and requested by the base estimator) + the one used by bagging to "simulate" the bagging process?
And do you still want to "force" the use of sample weight if the base estimator supports it, even if it's not requested, or do you want to get rid of that? If so, would that not break existing code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a short discussion in the last triage meeting. I think that the idea of @adrinjalali would work. You need to inspect the process_routing
at this stage to know if the underlying estimator is expecting the sample_weight
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in:
routed_metadata = process_routing(...)
support_sample_weight = "sample_weight" in routed_metadata.base_estimator.fit
R
A3E2
elated to this PR: we just merged in master something that deprecates |
Thanks for the heads up. Here is the commit. @adrinjalali could you please update the base branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first pass in the Bagging
specialized classes.
sklearn/ensemble/_bagging.py
Outdated
@@ -381,7 +400,14 @@ def _fit( | |||
y = self._validate_y(y) | |||
|
|||
# Check parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this comment. It is outdated.
sklearn/ensemble/_bagging.py
Outdated
|
||
def _parallel_args(self): | ||
return {} | ||
|
||
def _get_estimator(self): | ||
# should be overridden by child classes | ||
raise NotImplementedError("Return the default base estimator instance.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, it might be better to declare this method as an abstractmethod
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sklearn/ensemble/_bagging.py
Outdated
.add( | ||
base_estimator=base_estimator, | ||
method_mapping=MethodMapping().add(callee="fit", caller="fit"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering how to handle the deprecation here :)
We should probably have a self._estimator
to set the metadata and just property returning _estimator
when calling base_estimator
and estimator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or adding 2 identical rules for base_estimator
and estimator
and remove one of them after the deprecation cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to be necessary, when using the deprecated base_estimator
, we get a warning for it but it's not necessary to separately add a separate mapping for it because the correct estimator is resolved beforehand. There is a test for it (test_base_estimator_argument_missing_fit_request_warns
).
sklearn/ensemble/_bagging.py
Outdated
A :class:`~utils.metadata_routing.MetadataRouter` encapsulating | ||
routing information. | ||
""" | ||
base_estimator = self._validate_estimator(self._get_estimator()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way that get_matadata_routing
get called before fit
?
In such a case, the base_estimator
instance is potentially not the same as the base_estimator_
initialized in fit
. I don't know if it could trigger an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not necessary for the instance to be identical. @adrinjalali can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the method can be called before fit. GridSearch would need to do that for instance. That's why we have refactored and added _validate_estimator
here.
If somebody implements a get_metadata_routing
which returns a different value if the object is fit vs when it's not, then we can't help them 😁
sklearn/ensemble/_bagging.py
Outdated
if support_sample_weight: | ||
sample_weight = fit_params.get("sample_weight") | ||
if sample_weight is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this is something that should use sklearn.utils.validation._check_sample_weight
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We check sample weight earlier: https://github.com/BenjaminBossan/scikit-learn/blob/1221996815589ef076f9c2ac7434ea28a9dcf6bc/sklearn/ensemble/_bagging.py#L395
I think there is no advantage of checking again at the suggested line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will create for you the array of ones
and trigger a copy if requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sklearn/ensemble/_bagging.py
Outdated
if support_sample_weight: | ||
sample_weight = fit_params.get("sample_weight") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we could maybe do a pop
:
try:
sample_weight = fit_params.pop("sample_weight")
except KeyError:
sample_weight = None
Then we don't need the dict
comprehension below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the code to use fit_params.pop("sample_weight", None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not know that pop
as a "default" to be returned if not in the dict.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
- Use pop on dictionary to later avoid removing item - Make _get_estimator abstractmethod - "meta-estimator" instead of "metaestimator"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also some untested lines here.
Also, there were some related changes in main
, and I've pulled them into the sample-props
branch, you should merge those. (Hopefully that branch stays green lol)
sklearn/ensemble/_bagging.py
Outdated
support_sample_weight = has_fit_parameter( | ||
ensemble.base_estimator_, "sample_weight" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in:
routed_metadata = process_routing(...)
support_sample_weight = "sample_weight" in routed_metadata.base_estimator.fit
sklearn/ensemble/_bagging.py
Outdated
A :class:`~utils.metadata_routing.MetadataRouter` encapsulating | ||
routing information. | ||
""" | ||
base_estimator = self._validate_estimator(self._get_estimator()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the method can be called before fit. GridSearch would need to do that for instance. That's why we have refactored and added _validate_estimator
here.
If somebody implements a get_metadata_routing
which returns a different value if the object is fit vs when it's not, then we can't help them 😁
Late to the party 🥳 TBH I've never liked the fact that we inspect signatures to determine if something accepts What are the downsides to trying to send One downside is that the user won't get any indication that setting |
Reworking when sample_weight is being used in BaggingClassifier and BaggingRegressor. They provide two mechanisms to achieve sampling of rows, first via sample_weight (the preferred mechanism) and second via indexing (i.e. selecting the rows from the data). The new metadata routing information is now being used to determine of the base estimator can actually deal with sample_weight (the assumption is made that it does the correct thing). This is a change to the current situation, where the signature is checked for the presence of the "sample_weight" argument instead. With this change, in the future it will be possible to support the sample_weight mechanism when using metaestimators even if they don't have sample_weight in their signature.
@adrinjalali Could you please take a look if the latest commit reflects what we discussed? The PR is not quite ready but the major changes are there. |
Yep, I think it's what we talked about, and it looks quite nice I'd say. |
Do you need that we merge |
I updated |
Still needs some docstrings and tests.
- remove "supports" is its own method since only used for bagging - correct implementation of is_param_aliased, adding tests - add missing documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to clear my head and look again, but this is looking good, assuming that sniffing for sample weight support is the right thing to do.
f"{type(ensemble).__name__}" | ||
) | ||
|
||
if _supports_sample_weight(request_or_router): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if, for observability, we should add an attribute to the bagging estimator that notes whether sample weight or resampling was used...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a wish for that, I can change the code to set an attribute during fit
that indicates if sample weights are supported, and then use that attribute here.
X, y = iris.data, iris.target | ||
sample_weight = np.ones_like(y) | ||
|
||
base_estimator = _weighted(CheckingClassifier(expected_sample_weight=False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be expected_sample_weight=True
? The comment above says "is achieved through the use of sample weights"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
|
||
class BaseEstimatorWithoutSW(BaseEstimator): | ||
def fit(self, X, y): | ||
assert len(y) == 0.5 * num_samples # check that indexing was used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is integer-float comparison brittle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. As long as the iris dataset doesn't change in size, it should be fine. But just in case, I changed the code to use max_samples=<int>
for this to be more robust.
- CheckingClassifier should have had expected_sample_weight=True in one test - Use max_samples=<int> instead of <float> for less brittle comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman I addressed your comments.
|
||
class BaseEstimatorWithoutSW(BaseEstimator): | ||
def fit(self, X, y): | ||
assert len(y) == 0.5 * num_samples # check that indexing was used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. As long as the iris dataset doesn't change in size, it should be fine. But just in case, I changed the code to use max_samples=<int>
for this to be more robust.
X, y = iris.data, iris.target | ||
sample_weight = np.ones_like(y) | ||
|
||
base_estimator = _weighted(CheckingClassifier(expected_sample_weight=False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
f"{type(ensemble).__name__}" | ||
) | ||
|
||
if _supports_sample_weight(request_or_router): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a wish for that, I can change the code to set an attribute during fit
that indicates if sample weights are supported, and then use that attribute here.
@BenjaminBossan we have switched to a feature flag system (introduced in #26103), which makes things much easier for implementing slep6 for meta-estimators. The big slep6 PR is now merged into Would you be up for converting this PR to the feature flag enabled form, and submit the PR to Also happy to document/answer your questions on how to do the feature flag thingy. |
Hi, I was wondering if @BenjaminBossan is interested in continuing this work, else I was interested for the sake of supporting metadata routing in |
@adam2392 I think you can go ahead and open a PR, taking the good parts of this one. |
This PR adds metadata routing to
BaggingClassifier
andBaggingRegressor
(see #22893).Description
With this change, in addition to
sample_weight
, which was alreadysupported, it's now also possible to pass arbitrary
fit_params
to thesub estimator (which have to be requested explicitly).
Implementation
Most of the changes should be pretty straightforward with the existing
infrastructure for testing metadata routing. There was one aspect which
was not quite trivial though: The current implementation of bagging
works by inspecting the sub estimator's fit method. If the sub estimator
supports
sample_weight
, then subsampling is performed by making use ofsample_weight
. This will also happen if the user does not explicitlypass
sample_weight
.At first, I wanted to change the implementation such that if sample
weights are requested, subsampling should use the
sample_weight
approach, otherwise it shouldn't. However, that breaks a couple of
tests, so I rolled back the change and stuck very closely to the
existing implementation. I can't judge if this prevents the user from
doing certain things or if subsampling using vs not using
sample_weight
are equivalent.
Coincidental changes
The method
_validate_estimator
on theBaseEnsemble
class used tovalidate, and then set as attribute, the sub estimator. This was
inconvenient because for
get_metadata_routing
, we want to fetch the subestimator, which is not easily possible with this method. Therefore, a
change was introduced that the method now returns the sub estimator and
the caller is responsible for setting it as an attribute. This has
the added advantages that the caller can now decide the attribute name,
and that this method now more closely mirrors
_BaseHeterogeneousEnsemble._validate_estimators
. Affected by this changeare random forests, extra trees, and ada boosting.
The function
process_routing
used to mutate the incoming paramdict (adding new items), now it creates a shallow copy first.
Extended docstring for the
check_input
argument ofBaseBagging._fit
.Testing
I noticed that the bagging tests didn't have a test case for sparse
input + using sample weights, so I extended an existing test to include
it.
The test
test_bagging_sample_weight_unsupported_but_passed
now raises aTypeError
, notValueError
, whensample_weight
are passed but notsupported.