8000 API deprecate CalibratedClassifierCV(..., cv=prefit) for FrozenEstimator by adrinjalali · Pull Request #30171 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

API deprecate CalibratedClassifierCV(..., cv=prefit) for FrozenEstimator #30171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 30, 2024

Conversation

adrinjalali
Copy link
Member

Towards #29893

This deprecates CalibratedClassifierCV(..., cv="prefit") in favor of CalibratedClassifierCV(FrozenEstimator(est)).

Also adds ensemble="auto" option for automatically handling FrozenEstimator correctly.

cc @glemaitre @ogrisel

Copy link
github-actions bot commented Oct 29, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4c17d22. Link to the linter CI: here

@adrinjalali adrinjalali added this to the 1.6 milestone Oct 29, 2024
Copy link
Member
@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks simple enough to me, but I'm by no means an expert in this area of the codebase :p

Co-authored-by: Adam Li <adam2392@gmail.com>
Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about this: asking the user to import a class and wrap an argument feels like a small usability regression.

What about not deprecating cv="prefit" and automatically do the wrapping with FrozenEstimator on the fly in _get_estimator:

    def _get_estimator(self):
        """Resolve which estimator to return (default is LinearSVC)"""
        if self.cv == "prefit" and self.estimator is None:
            raise ValueError(
                'estimator cannot be left to None when cv="prefit"'
            )
        if self.estimator is None:
            # we want all classifiers that don't expose a random_state
            # to be deterministic (and we don't want to expose this one).
            estimator = LinearSVC(random_state=0)
            if _routing_enabled():
                estimator.set_fit_request(sample_weight=True)
        else:
            estimator = self.estimator
            if self.cv == "prefit":
                estimator = FrozenEstimator(estimator)
    return estimator

But then we would have too equivalent ways to do the same thing:

from sklearn.frozen import FrozenEstimator

CalibratedClassifierCV(FrozenEstimator(model)).fit(X_cal, y_cal)

and:

CalibratedClassifierCV(model, cv="prefit").fit(X_cal, y_cal)

But maybe that's ok?

@adrinjalali
Copy link
Member Author

I wouldn't say it's a usability regression. cv="prefit" doesn't make any sense semantically. It's just we didn't have any other place to put "please don't fit my estimator" so we've put it under cv.

@ogrisel
Copy link
Member
ogrisel commented Oct 30, 2024

cv="prefit" doesn't make any sense semantically

I partially agree... But at the same time, calling cross_val_predict on a frozen estimator is just a complex way to do estimator.predict(X_cal) when cv is a CV splitter with non-overlapping and full-coverage validation folds as with the default cv=5.

In that respect, cv="prefit" is a usability improvement in the sense that it's simpler to understand compared to inferring the meaning and statistical implications of a call such as CalibratedClassifierCV(FrozenEstimator(model), cv=5).

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If @thomasjpfan and others think it's better, and I am in the minority, then so be it. It's true that this will also allow a small code simplification once the deprecation period is over.

@@ -540,7 +554,7 @@ def test_calibration_dict_pipeline(dict_data, dict_data_pipeline):
"""
X, y = dict_data
clf = dict_data_pipeline
calib_clf = CalibratedClassifierCV(clf, cv="prefit")
calib_clf = CalibratedClassifierCV(FrozenEstimator(clf), cv=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change highlights another small usability regression: it's difficult to understand why we would need to adjust the CV strategy when calibrating a frozen estimator on a dataset with less than 5 data-points per class. One could expect that cross-fitting would be entirely disabled in that case.

Again, I can live with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with you, but in reality, people with 5 data points shouldn't be doing statistical ML 😁

Copy link
Member
@ogrisel ogrisel Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, meaningful calibration with 5 data points per class is quite challenging. Still, it reveals a rough edge in my opinion. But I believe we can fix it later if needed.

@ogrisel ogrisel merged commit b4eef25 into scikit-learn:main Oct 30, 2024
30 checks passed
@adrinjalali adrinjalali deleted the calibration/cv branch October 31, 2024 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0