8000 Use sample_weight when validating LogisticRegressionCV · Issue #25906 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Use sample_weight when validating LogisticRegressionCV #25906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
luiscastro193 opened this issue Mar 19, 2023 · 12 comments · Fixed by #26525
Closed

Use sample_weight when validating LogisticRegressionCV #25906

luiscastro193 opened this issue Mar 19, 2023 · 12 comments · Fixed by #26525

Comments

@luiscastro193
Copy link
luiscastro193 commented Mar 19, 2023

Metadata Routing #24027 will add much needed support for taking into account sample_weight when cross-validating. However, the current implementation of LogisticRegressionCV doesn't seem to be taking advantage of this:

scores.append(scoring(log_reg, X_test, y_test))

Therefore, the scores used for choosing the correct hyperparameter will still be misleading even when the tools for solving this become available.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Mar 19, 2023
@thomasjpfan
Copy link
Member

During the triaging meeting, we decided that this a bug and that sample_weights should be passed into the scoring. There is SLEP006 work moving in this direction: #24498

But I can see passing in the sample weights as a fix before #24498 is completed.

@adrinjalali
Copy link
Member

I don't think this is a bug per say that we can/should fix in the current state of affairs. The result of LogisticRegressionCV(...).fit(...) should be more or less the same as GridSearchCV(LogisticRegression(), ...).fit(...), and we don't have a way of passing sample_weight to the scorer in *SearchCV classes w/o SLEP006.

@luiscastro193
Copy link
Author

As a user, for me it is also clear that it is a bug and I would appreciate it if it could be fixed sooner. However, I understand that this is part of a much deeper issue with lots of factors to consider. Knowing that there is work moving in that direction #24498 as part of SLEP006 is comforting enough.

@ogrisel
Copy link
Member
ogrisel commented Mar 31, 2023

I have the feeling that both LogisticRegressionCV and GridSearchCV have the same bug: I think that by default scikit-learn should always score models with nested CV with the weights passed to the outer fit.

@adrinjalali
Copy link
Member

only pass sample_weight? or any metadata passed which somehow matches the name present in the signature of a scorer? also, this would be a breaking change. People know scorers don't get the metadata in a gridsearch and have hacks around it. Suddenly passing things around would break things quite a bit. ref: https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function/49598597#49598597

@ogrisel
Copy link
Member
ogrisel commented Mar 31, 2023

We could special case sample_weight. For the other future metadata, I don't know ahead of time what is the good practice, and for those, maybe explicit routing is better.

Fixing only the sample_weight case would be a bug fix, not a breaking change.

@luiscastro193
Copy link
Author

only pass sample_weight? or any metadata passed which somehow matches the name present in the signature of a scorer? also, this would be a breaking change. People know scorers don't get the metadata in a gridsearch and have hacks around it. Suddenly passing things around would break things quite a bit. ref: https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function/49598597#49598597

That Stack Overflow discussion reminded me of this joke.

@adrinjalali
Copy link
Member

so what happens if the user passes sample_weight, but neither scorer or the sub-estimator accept it?

what happens when they accept it but through **kwargs?

what if a scorer doesn't support sample weight in version x, but does it in version x+1?

what if the sub-estimator doesn't support sample weights, but scorer does?

we have thought about these in the context of slep6, and I don't think the proposal here to pass it always around if we can (which is not well defined), is a good solution really.

@ogrisel
Copy link
Member
ogrisel commented Mar 31, 2023

so what happens if the user passes sample_weight, but neither scorer or the sub-estimator accept it?

we could warn first and raise in a later release.

what happens when they accept it but through **kwargs?

best effort we pass them and hope for the best :)

what if a scorer doesn't support sample weight in version x, but does it in version x+1?

warn in old version and then work correctly in new version

what if the sub-estimator doesn't support sample weights, but scorer does?

raise an error: why would passing sample weight to fit would make sense in this case?

we have thought about these in the context of slep6, and I don't think the proposal here to pass it always around if we can (which is not well defined), is a good solution really.

What I described above would be the way we should aim for (I think). If we implement this via SLEP6, this should be the default behavior, without explicit requests. If the users configure explicit routing, they can override the default behavior.

@adrinjalali
Copy link
Member

Considering the complications of fixing this w/o SLEP006, is the resolution to fix it in the context of the slep, or to fix it regardless?

@ogrisel
Copy link
Member
ogrisel commented Apr 1, 2023

We add a drafting meeting on SLEP6 integration yesterday and we by using the new routing mechanism we will have a way forward to implement the good routing policy for this case with a global setting (that might be the default when SLEP6 is enabled, to be decided yet). See the working proposal here in #26050.

@ogrisel
Copy link
Member
ogrisel commented Apr 1, 2023

Considering the complications of fixing this w/o SLEP006, is the resolution to fix it in the context of the slep, or to fix it regardless?

For LogisticRegressionCV, it's probably quite easy to fix it without SLEP6. For GridSearchCV and other similar meta-estimators, I don't know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0