8000 FEA add FrozenEstimator by adrinjalali · Pull Request #29705 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FEA add FrozenEstimator #29705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Oct 28, 2024
Merged

FEA add FrozenEstimator #29705

merged 25 commits into from
Oct 28, 2024

Conversation

adrinjalali
Copy link
Member

This adds a FrozenEstimator, with tests for pipeline as well. I think it'd be nice to have this and extend the tests if we find regressions?

Copy link
github-actions bot commented Aug 23, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 3e59770. Link to the linter CI: here

@adrinjalali
Copy link
Member Author

Interesting discovery, we have a bunch of common tests which are not in check_estimator. That needs to be fixed for these tests to pass.

@adrinjalali
Copy link
Member Author

This test fails:

    @pytest.mark.parametrize(
        "estimator", DATA_VALIDATION_META_ESTIMATORS, ids=_get_meta_estimator_id
    )
    def test_meta_estimators_delegate_data_validation(estimator):
        # Check that meta-estimators delegate data validation to the inner
        # estimator(s).
        rng = np.random.RandomState(0)
        set_random_state(estimator)
    
        n_samples = 30
        X = rng.choice(np.array(["aa", "bb", "cc"], dtype=object), size=n_samples)
    
        if is_regressor(estimator):
            y = rng.normal(size=n_samples)
        else:
            y = rng.randint(3, size=n_samples)
    
        # We convert to lists to make sure it works on array-like
        X = _enforce_estimator_tags_X(estimator, X).tolist()
        y = _enforce_estimator_tags_y(estimator, y).tolist()
    
        # Calling fit should not raise any data validation exception since X is a
        # valid input datastructure for the first step of the pipeline passed as
        # base estimator to the meta estimator.
        estimator.fit(X, y)
    
        # n_features_in_ should not be defined since data is not tabular data.
>       assert not hasattr(estimator, "n_features_in_")
E       AssertionError

@thomasjpfan @glemaitre I wonder, why do we expect the meta-estimator not to have n_features_in_? shouldn't n_features_in be actually a part of the public API?

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't n_features_in be actually a part of the public API?

For that test, it's passing in a list of strings of shape (n_samples,), which does not have a notion of n_features_in_.

@adrinjalali
Copy link
Member Author

This is now ready for review.

Copy link
Member
@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall!

A naive question: When pickling, or skopping the "frozen estimator" from disc, is it guaranteed to be the same byte for byte?

@adrinjalali
Copy link
Member Author

A naive question: When pickling, or skopping the "frozen estimator" from disc, is it guaranteed to be the same byte for byte?

@adam2392 neither pickle nor skops guarantee byte for byte equivalence. But FrozenEstimator doesn't really interfere with those mechanisms, so whatever's happening with persistence, isn't changed with this estimator.

Co-authored-by: Adam Li <adam2392@gmail.com>
@glemaitre glemaitre self-requested a review September 9, 2024 21:22
@glemaitre glemaitre self-requested a review September 10, 2024 15:51
@glemaitre
Copy link
Member

Some additional thoughts: we need to have an example maybe with several sections. But in general, this is an alternative to the prefit parameter that is not available everywhere or to the cv="prefit". So I see the scope of the example as "How to provide pre-fitted models to meta-estimator".

It makes me think that we need to have additional thoughts regarding prefit and cv="prefit" and if we should deprecate or not the option and advocate to always use the FrozenEstimator instead.

@thomasjpfan
Copy link
Member

And right now we explain "frozen" by using the word frozen a lot ♻️. And I struggle to come up with a better suggestion.

If we are using the estimator's name to describe the feature, then I see that as good naming. i.e. The "RandomForestClassifier" estimator is a "random forest classifier".

In any case, I am +0.5 with PretrainedEstimator, since it is normally associated with deep learning. Another options are: ImmutableEstimator or ReadOnlyEstimator.

@ogrisel
Copy link
Member
ogrisel commented Oct 10, 2024

I think I like the FrozenEstimator name. We could just better explain in the docstring and the doc what this wrapper is doing and what one might want to use it for.

@ogrisel
Copy link
Member
ogrisel commented Oct 10, 2024

Also added set_params / get_params to make sure users can't override inner estimator params using a frozen.set_params(estimator_{param} = value) syntax.

I wonder if we really want this or not. This seems unrelated to the problem of making sure that the model is not refitted again. Being able to set the parameters of the inner model could be useful in some rare case where the parameters influence the prediction / transformation behavior of the model without refitting (one could imagine, the temperature parameter of a softmax based transformation for instance or changing the probability value of test time dropout), or the verbosity level or some.

Maybe set_params could be protected by default, but FrozenEstimator could make it possible to allow_set_params=True explicitly?

@lorentzenchr lorentzenchr added this to the 1.6 milestone Oct 14, 2024
@lorentzenchr
Copy link
Member

Let's get this in. Postponing edge cases and advanced features.
For the naming, let's vote (non-binding) with 👍 and 👎 :

@lorentzenchr
Copy link
Member

FrozenEstimator

@lorentzenchr
Copy link
Member

PretrainedEstimator

@lorentzenchr
Copy link
Member

ImmutableEstimator

@lorentzenchr
Copy link
Member

ReadOnlyEstimator

@adrinjalali
Copy link
Member Author

Since PretrainedEstimator has some sort of a DL connotation to it, I would prefer FrozenEstimator, PrefittedEstimator, or a FittedEstimator

@lorentzenchr
Copy link
Member

@scikit-learn/core-devs @scikit-learn/communication-team @scikit-learn/contributor-experience-team @scikit-learn/documentation-team Your v(non-binding) vote for names could help @adrinjalali in finishing this PR.

@betatim
Copy link
Member
betatim commented Oct 16, 2024

No mega strong feelings. I like pretrained, because I know the word from DL and to me this feels like I am using a pretrained model. I also like prefitted - kinda DL scikit-learn hybrid. But my opinion on naming aren't strong enough to veto any of the suggestions.

Copy link
Member Author
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#29705 (comment)

When I started reading the PR I assumed that FrozenEstimator somehow allows you to fit the estimator once, but after that it would be frozen.

I actually don't mind that idea. It would also make this estimator more "scikit-learn compatible". But it might be more confusing compared to what we have here? I'm easy either way.

#29705 (comment)

I wonder if we really want this or not. This seems unrelated to the problem of making sure that the model is not refitted again. Being able to set the parameters of the inner model could be useful in some rare case where the parameters influence the prediction / transformation behavior of the model without refitting (one could imagine, the temperature parameter of a softmax based transformation for instance or changing the probability value of test time dropout), or the verbosity level or some.

Maybe set_params could be protected by default, but FrozenEstimator could make it possible to allow_set_params=True explicitly?

I think we don't want users to set those parameters in a GridSearch kinda setting. If they want to set parameters, they can always do frozen.estimator.set_params(), which I've added to the error message now.

Naming

Thinking about the name, I think I like FrozenEstimator more than others, since pretrained models also have the ability to be fine-tuned, which we're not really doing / allowing here. Hopefully the improved docstring makes the name issue go away?

Comment on lines 29 to 30
REGRESSION_DATASET = make_regression()
CLASSIFICATION_DATASET = make_classification()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, applying the change. But I find it about 100 times more convoluted than the code I had 😬 . Like, one needs to understand how request works, which, good luck, on top of fixtures. There's so many layers of magic in this code. I wonder why you prefer it to the code I had.

Comment on lines +50 to +59
>>> from sklearn.datasets import make_classification
>>> from sklearn.frozen import FrozenEstimator
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = make_classification(random_state=0)
>>> clf = LogisticRegression(random_state=0).fit(X, y)
>>> frozen_clf = FrozenEstimator(clf)
>>> frozen_clf.fit(X, y) # No-op
FrozenEstimator(estimator=LogisticRegression(random_state=0))
>>> frozen_clf.predict(X) # Predictions from `clf.predict`
array(...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a nitpick to really show the behavior, I would fit the first time with -say- the first 10 data points, and print the explicit predictions of some few elements, then call fit with all data and print again those elements.

Another nit, do we really need to pass a random_state to the logistic regression when using the default solver?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a nitpick to really show the behavior, I would fit the first time with -say- the first 10 data points, and print the explicit predictions of some few elements, then call fit with all data and print again those elements.

I think that's more of a test than a documentation. Writing a good example to show that behavior is outside the scope of these tiny examples we have in docstrings. We shouldn't make them too long.

Another nit, do we really need to pass a random_state to the logistic regression when using the default solver?

I don't know, and that might also change. Whenever there's a random_state, I rather set it.

@adrinjalali
Copy link
Member Author

Could we move this forward now?

@lorentzenchr lorentzenchr added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 21, 2024
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@glemaitre
Copy link
Member

Since FrozenEstimator is the most voted name and other request have been addressed. I'm merging.

@glemaitre glemaitre merged commit 4ee3afa into scikit-learn:main Oct 28, 2024
33 of 34 checks passed
@adrinjalali adrinjalali deleted the frozen branch October 28, 2024 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
No Changelog Needed Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0