-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
SLEP006 - Add Metadata Routing to LogisticRegressionCV #24498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP006 - Add Metadata Routing to LogisticRegressionCV #24498
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @omar! Nice work!
I agree it's hard to know what to test, and we should probably define expectations for this family of PRs.
I'd love to see a test, or an example, that uses a cv=GroupKFold(n_splits=5)
or similar, which is a real motivating use case.
I can also see an error here: if the scorer requests sample_weight
it won't receive it, and, unless I'm much mistaken, this will happen silently (i.e. there will be no error).
To fix this, we need to pass sample_weight
to process_routing
in fit
. We would benefit from a test that the scorer receives sample_weight
if requested.
Thank you for the guidance! |
I can't quite understand properly how to route the scoring parameters. Actually even if I pass sample weight here
we still need someway to specify the child method of the scorer. However the available methods only contain score whereas any individual scorers themselves do not seem seem to define a score method. They instead have _score which is called through the |
Scorers don't have a score method, but for the purpose of allowing the user to set routing requests, we call it 'score'. So when the scorer is applied, we need to pass it the parameters that are routed to it by the name 'score'. Does that help, or is a code snippet more the right kind of support here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These PRs should also add the meta estimators here: https://github.com/scikit-learn/scikit-learn/blob/sample-props/sklearn/tests/test_metaestimators_metadata_routing.py
@jnothman anything else you think should be tested on all meta-estimators apart from what's there?
sklearn/linear_model/_logistic.py
Outdated
) | ||
|
||
return scoring( | ||
self, X, y, sample_weight=sample_weight, **routed_params.scorer.score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
routed_params.scorer.score
would include sample_weight
, no need to pass it explicitly.
How would I actually define the estimator_name for this case? |
@OmarManzoor your link is to a file, I'm not sure what you're referring to :) |
For example We need the estimator_name here. For meta estimators this should be just the estimator parameter name. In this case it is a little different. |
Oh I see what you mean. This meta-estimator doesn't actually have any sub-estimators the way others do. So we kinda need to replicate the same kinds of tests, but for Ideally we'd have them in a way to use them for other similar *CV classes. |
@adrinjalali Could you kindly clarify one thing. For scorers it seems like we won't raise any warnings or errors since when routing we actually search for the actual method specified, |
For scorers we would raise an error if routing is not correct. For instance, if the scorer supports We don't warn for them since the existing code doesn't do the routing, therefore all of it would be new functionality. I'm not sure why you say the |
Well if an error is raised then I think we can use the existing tests. However in that case I am not clear on how the error would be raised, since the underlying scorer does not have a "score" method whereas we try to check the signature of the score method since the "score" method is passed in the routing processing. |
The scorer is the Instead of checking the signature, we should now check the routing of the scorer object. Scorers 8000 expose routing the same way as estimators do. |
So I was trying to create a ConsumingScorer and use it within the existing tests structure in test_metaestimators_metadata_routing. But I seem to be doing it incorrectly. Would it be okay for me to push the code so that you can take a look and guide me? |
Sure, happy to check. |
Pushed the code. It does not raise an error for the function test_error_for_other_methods. |
Could you please merge with the latest |
@adrinjalali Any feedback on this? Is it incorrect to test for the ConsumingScorer like this? Instead should we actually check the actual routing data and decide based on that? |
I'm trying to figure out why the CI fails here. But I get a build not found error when I try to look. Could you please send an empty commit? Hoping to see what the issue here is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great work, @OmarManzoor, thank you!
sklearn/linear_model/_logistic.py
Outdated
# Remove the sample_weight parameter from score_params as it is | ||
# explicitly being passed in the path_func. | ||
score_params = routed_params.scorer.score | ||
score_params.pop("sample_weight", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those sample weights are used for fit, while the score params passed are used to score. The _log_reg_scoring_path
with the current implementation doesn't take sample_weight
into account for scoring since it doesn't pass it to the score method if sample_weight
is passed. Either here it needs to not be popped, or there it needs to be handled explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I popped it is because sample_weight is also explicitly being passed to the _log_reg_scoring_path method. If we pass sample_weight directly and then also pass it through the score_params it just seems to be duplicated.
@@ -1984,3 +1985,33 @@ def test_warning_on_penalty_string_none(): | |||
) | |||
with pytest.warns(FutureWarning, match=warning_message): | |||
lr.fit(iris.data, target) | |||
|
|||
|
|||
def test_lr_cv_scorer_does_not_receive_sample_weight_when_score_request_is_not_set(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quite like these two tests, but the tests you had in the common tests were also in the right direction. I don't think the tests here cover what we need to test:
In common tests we actually test if metadata is passed when it should, to the underlying objects, and here we don't do that. I think it'd be nice to have specific tests for LRCV in the common tests using the machinary there, even if they're not a part of the common tests as the other ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the ones in the common tests was that they were not actually working properly for scorers. The score method does not raise an exception in the inner implementation of MethodMetadataRequest, if we don't set the score request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please write a minimal test for it? It should raise, there shouldn't be a difference between fit
and score
in that regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you kindly provide an example of a scorer with LogisticRegressionCV that would raise UnsetMetadataPassedError when sample_weight is passed but the score request is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally it's not the consumer which raises the error, rather the router does that when process_routing
is called. So the only thing we need to test is for the scorer to have the right request values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see so should I then use the code that I had earlier to record the metadata in the sample scorer and then check those values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, pretty much, very similar to how we do it in consumer estimators there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrinjalali The tests in CircleCI seem to be failing. I actually reverted the code to the point where it passed previously but it still fails. Could you kindly check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to have something to do with set_output
(I might be wrong). I'm checking. cc @thomasjpfan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrinjalali The checks still seem to be failing but could you kindly check the test that I added in the common tests?
fb27db8
to
ad9f9e9
Compare
Merging the latest |
@adrinjalali Could you kindly have a look at the updates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM, thanks!
routed_params = process_routing( | ||
obj=lr_cv, method="fit", sample_weight=sample_weight, other_params=other_params | ||
) | ||
assert routed_params.scorer.score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unnecessary given the next line
assert not routed_params.scorer.score | ||
|
||
|
||
def test_lr_cv_scorer_receives_sample_weight_when_score_request_is_set(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't actually test that the weights are received or have an effect. Perhaps we should check that the results are different from when sample_weight
is not requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion. I just wanted to mention that there seems to be an issue if I try to call fit on LogisticRegressionCV when requesting sample weight for the underlying scorer. The problem occurs at this point
https://github.com/scikit-learn/scikit-learn/pull/24498/files#diff-66422c7ed307653c80e5d68ed353d44d8fbb02319c428e140daff082eaf9d6cbR742. Over here the sample weight has a length of 10 whereas y_test has a length of 2 and this causes an inconsistent number of samples error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure to index a subset of weights for the train/test set like we do for X and y... Not that I've checked the code now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman I updated the sample weight to only index the test set at the required place and also updated the test. Could you kindly have a look?
… and update the test
…gression_cv_routing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work so far @OmarManzoor
sklearn/linear_model/_logistic.py
Outdated
if "sample_weight" in _score_params: | ||
_score_params["sample_weight"] = _score_params["sample_weight"][test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should call _check_fit_params
the same as done in BaseSearchCV
, and index any metadata which might be sample-aligned.
@@ -1751,6 +1760,9 @@ def fit(self, X, y, sample_weight=None): | |||
Array of weights that are assigned to individual samples. | |||
If not provided, then each sample is given unit weight. | |||
|
|||
**fit_params : dict | |||
Parameters to pass to the underlying splitter and scorer. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the parameters for the public methods need a .. versionadded:: 2.0
@OmarManzoor we have switched to a feature flag system (introduced in #26103), which makes things much easier for implementing slep6 for meta-estimators. The big slep6 PR is now merged into Would you be up for converting this PR to the feature flag enabled form, and submit the PR to Also happy to document/answer your questions on how to do the feature flag thingy. |
Thank you for the suggestion. Sure, I would like to try this out as soon as I find time. |
Reference Issues/PRs
Towards: #22893
What does this implement/fix? Explain your changes.
Any other comments?