FEA add FrozenEstimator #29705

adrinjalali · 2024-08-23T09:34:36Z

This adds a FrozenEstimator, with tests for pipeline as well. I think it'd be nice to have this and extend the tests if we find regressions?

github-actions · 2024-08-23T09:35:58Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 3e59770. Link to the linter CI: here}

adrinjalali · 2024-08-23T11:17:05Z

Interesting discovery, we have a bunch of common tests which are not in check_estimator. That needs to be fixed for these tests to pass.

adrinjalali · 2024-09-07T14:22:54Z

This test fails:

    @pytest.mark.parametrize(
        "estimator", DATA_VALIDATION_META_ESTIMATORS, ids=_get_meta_estimator_id
    )
    def test_meta_estimators_delegate_data_validation(estimator):
        # Check that meta-estimators delegate data validation to the inner
        # estimator(s).
        rng = np.random.RandomState(0)
        set_random_state(estimator)
    
        n_samples = 30
        X = rng.choice(np.array(["aa", "bb", "cc"], dtype=object), size=n_samples)
    
        if is_regressor(estimator):
            y = rng.normal(size=n_samples)
        else:
            y = rng.randint(3, size=n_samples)
    
        # We convert to lists to make sure it works on array-like
        X = _enforce_estimator_tags_X(estimator, X).tolist()
        y = _enforce_estimator_tags_y(estimator, y).tolist()
    
        # Calling fit should not raise any data validation exception since X is a
        # valid input datastructure for the first step of the pipeline passed as
        # base estimator to the meta estimator.
        estimator.fit(X, y)
    
        # n_features_in_ should not be defined since data is not tabular data.
>       assert not hasattr(estimator, "n_features_in_")
E       AssertionError

@thomasjpfan @glemaitre I wonder, why do we expect the meta-estimator not to have n_features_in_? shouldn't n_features_in be actually a part of the public API?

thomasjpfan

shouldn't n_features_in be actually a part of the public API?

For that test, it's passing in a list of strings of shape (n_samples,), which does not have a notion of n_features_in_.

adrinjalali · 2024-09-08T15:49:00Z

This is now ready for review.

adam2392

LGTM overall!

A naive question: When pickling, or skopping the "frozen estimator" from disc, is it guaranteed to be the same byte for byte?

sklearn/frozen/_frozen.py

adrinjalali · 2024-09-09T15:21:52Z

A naive question: When pickling, or skopping the "frozen estimator" from disc, is it guaranteed to be the same byte for byte?

@adam2392 neither pickle nor skops guarantee byte for byte equivalence. But FrozenEstimator doesn't really interfere with those mechanisms, so whatever's happening with persistence, isn't changed with this estimator.

sklearn/frozen/_frozen.py

Co-authored-by: Adam Li <adam2392@gmail.com>

sklearn/frozen/tests/test_frozen.py

glemaitre · 2024-09-19T10:06:08Z

Some additional thoughts: we need to have an example maybe with several sections. But in general, this is an alternative to the prefit parameter that is not available everywhere or to the cv="prefit". So I see the scope of the example as "How to provide pre-fitted models to meta-estimator".

It makes me think that we need to have additional thoughts regarding prefit and cv="prefit" and if we should deprecate or not the option and advocate to always use the FrozenEstimator instead.

thomasjpfan · 2024-10-07T21:31:16Z

And right now we explain "frozen" by using the word frozen a lot ♻️. And I struggle to come up with a better suggestion.

If we are using the estimator's name to describe the feature, then I see that as good naming. i.e. The "RandomForestClassifier" estimator is a "random forest classifier".

In any case, I am +0.5 with PretrainedEstimator, since it is normally associated with deep learning. Another options are: ImmutableEstimator or ReadOnlyEstimator.

ogrisel · 2024-10-10T07:33:44Z

I think I like the FrozenEstimator name. We could just better explain in the docstring and the doc what this wrapper is doing and what one might want to use it for.

ogrisel · 2024-10-10T13:07:47Z

Also added set_params / get_params to make sure users can't override inner estimator params using a frozen.set_params(estimator_{param} = value) syntax.

I wonder if we really want this or not. This seems unrelated to the problem of making sure that the model is not refitted again. Being able to set the parameters of the inner model could be useful in some rare case where the parameters influence the prediction / transformation behavior of the model without refitting (one could imagine, the temperature parameter of a softmax based transformation for instance or changing the probability value of test time dropout), or the verbosity level or some.

Maybe set_params could be protected by default, but FrozenEstimator could make it possible to allow_set_params=True explicitly?

lorentzenchr · 2024-10-14T14:47:00Z

Let's get this in. Postponing edge cases and advanced features.
For the naming, let's vote (non-binding) with 👍 and 👎 :

lorentzenchr · 2024-10-14T14:47:09Z

FrozenEstimator

lorentzenchr · 2024-10-14T14:47:24Z

PretrainedEstimator

lorentzenchr · 2024-10-14T14:47:38Z

ImmutableEstimator

lorentzenchr · 2024-10-14T14:47:46Z

ReadOnlyEstimator

adrinjalali · 2024-10-15T11:48:10Z

Since PretrainedEstimator has some sort of a DL connotation to it, I would prefer FrozenEstimator, PrefittedEstimator, or a FittedEstimator

lorentzenchr · 2024-10-15T12:45:54Z

@scikit-learn/core-devs @scikit-learn/communication-team @scikit-learn/contributor-experience-team @scikit-learn/documentation-team Your v(non-binding) vote for names could help @adrinjalali in finishing this PR.

betatim · 2024-10-16T07:57:33Z

No mega strong feelings. I like pretrained, because I know the word from DL and to me this feels like I am using a pretrained model. I also like prefitted - kinda DL scikit-learn hybrid. But my opinion on naming aren't strong enough to veto any of the suggestions.

adrinjalali

#29705 (comment)

When I started reading the PR I assumed that FrozenEstimator somehow allows you to fit the estimator once, but after that it would be frozen.

I actually don't mind that idea. It would also make this estimator more "scikit-learn compatible". But it might be more confusing compared to what we have here? I'm easy either way.

#29705 (comment)

I wonder if we really want this or not. This seems unrelated to the problem of making sure that the model is not refitted again. Being able to set the parameters of the inner model could be useful in some rare case where the parameters influence the prediction / transformation behavior of the model without refitting (one could imagine, the temperature parameter of a softmax based transformation for instance or changing the probability value of test time dropout), or the verbosity level or some.

Maybe set_params could be protected by default, but FrozenEstimator could make it possible to allow_set_params=True explicitly?

I think we don't want users to set those parameters in a GridSearch kinda setting. If they want to set parameters, they can always do frozen.estimator.set_params(), which I've added to the error message now.

Naming

Thinking about the name, I think I like FrozenEstimator more than others, since pretrained models also have the ability to be fine-tuned, which we're not really doing / allowing here. Hopefully the improved docstring makes the name issue go away?

adrinjalali · 2024-10-16T08:37:53Z

sklearn/frozen/tests/test_frozen.py

+REGRESSION_DATASET = make_regression()
+CLASSIFICATION_DATASET = make_classification()


sure, applying the change. But I find it about 100 times more convoluted than the code I had 😬 . Like, one needs to understand how request works, which, good luck, on top of fixtures. There's so many layers of magic in this code. I wonder why you prefer it to the code I had.

doc/whats_new/v1.6.rst

ArturoAmorQ · 2024-10-16T17:24:04Z

sklearn/frozen/_frozen.py

+    >>> from sklearn.datasets import make_classification
+    >>> from sklearn.frozen import FrozenEstimator
+    >>> from sklearn.linear_model import LogisticRegression
+    >>> X, y = make_classification(random_state=0)
+    >>> clf = LogisticRegression(random_state=0).fit(X, y)
+    >>> frozen_clf = FrozenEstimator(clf)
+    >>> frozen_clf.fit(X, y)  # No-op
+    FrozenEstimator(estimator=LogisticRegression(random_state=0))
+    >>> frozen_clf.predict(X)  # Predictions from `clf.predict`
+    array(...)


Maybe just a nitpick to really show the behavior, I would fit the first time with -say- the first 10 data points, and print the explicit predictions of some few elements, then call fit with all data and print again those elements.

Another nit, do we really need to pass a random_state to the logistic regression when using the default solver?

Maybe just a nitpick to really show the behavior, I would fit the first time with -say- the first 10 data points, and print the explicit predictions of some few elements, then call fit with all data and print again those elements.

I think that's more of a test than a documentation. Writing a good example to show that behavior is outside the scope of these tiny examples we have in docstrings. We shouldn't make them too long.

Another nit, do we really need to pass a random_state to the logistic regression when using the default solver?

I don't know, and that might also change. Whenever there's a random_state, I rather set it.

adrinjalali · 2024-10-21T08:23:20Z

Could we move this forward now?

glemaitre

LGTM.

glemaitre · 2024-10-28T09:56:19Z

Since FrozenEstimator is the most voted name and other request have been addressed. I'm merging.

FEA add FrozenEstimator

5684bdb

adrinjalali mentioned this pull request Aug 23, 2024

API Freezing estimators #8370

Closed

adrinjalali added 3 commits August 23, 2024 12:08

docs

e594fd5

add to API docs

686b3f3

skip tests

532c965

adrinjalali added 2 commits September 7, 2024 14:40

Merge remote-tracking branch 'upstream/main' into frozen

b2861f2

...

b6df54f

thomasjpfan reviewed Sep 7, 2024

View reviewed changes

adrinjalali added 3 commits September 8, 2024 10:07

skip test

9a16585

...

a18b7e1

test tags

b013422

adam2392 reviewed Sep 9, 2024

View reviewed changes

sklearn/frozen/_frozen.py Outdated Show resolved Hide resolved

sklearn/frozen/_frozen.py Outdated Show resolved Hide resolved

sklearn/frozen/_frozen.py Show resolved Hide resolved

adrinjalali added 2 commits September 9, 2024 17:16

improve note

5b982b3

Merge remote-tracking branch 'upstream/main' into frozen

dc175b1

adam2392 reviewed Sep 9, 2024

View reviewed changes

sklearn/frozen/_frozen.py Outdated Show resolved Hide resolved

Update sklearn/frozen/_frozen.py

1e198a0

Co-authored-by: Adam Li <adam2392@gmail.com>

adam2392 approved these changes Sep 9, 2024

View reviewed changes

glemaitre self-requested a review September 9, 2024 21:22

glemaitre reviewed Sep 9, 2024

View reviewed changes

sklearn/frozen/tests/test_frozen.py Outdated Show resolved Hide resolved

adrinjalali added 3 commits September 10, 2024 10:12

remove unused code

fcfeb43

Merge remote-tracking branch 'upstream/main' into frozen

b2cf02a

Merge branch 'frozen' of github.com:adrinjalali/scikit-learn into frozen

ae3c72c

glemaitre self-requested a review September 10, 2024 15:51

Merge remote-tracking branch 'upstream/main' into frozen

40f3e84

lorentzenchr added this to the 1.6 milestone Oct 14, 2024

adrinjalali added 2 commits October 16, 2024 10:42

apply review

316927c

Merge remote-tracking branch 'upstream/main' into frozen

dd865fc

adrinjalali commented Oct 16, 2024

View reviewed changes

changed config context

d10f544

ArturoAmorQ reviewed Oct 16, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into frozen

92f4932

lorentzenchr added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 21, 2024

lesteve mentioned this pull request Oct 21, 2024

CI Fix action-towncrier-changelog whatsnew_pattern #30124

Draft

adrinjalali added 2 commits October 22, 2024 08:50

fix test

451d712

Merge remote-tracking branch 'upstream/main' into frozen

3e59770

adrinjalali added the No Changelog Needed label Oct 22, 2024

adrinjalali mentioned this pull request Oct 22, 2024

BUG: Non-standard workflow/behavior scientific-python/action-towncrier-changelog#10

Open

glemaitre approved these changes Oct 28, 2024

View reviewed changes

glemaitre merged commit 4ee3afa into scikit-learn:main Oct 28, 2024
33 of 34 checks passed

adrinjalali deleted the frozen branch October 28, 2024 09:58

KylevdLangemheen mentioned this pull request Nov 1, 2024

Create estimators for inference only #28520

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA add FrozenEstimator #29705

FEA add FrozenEstimator #29705

		REGRESSION_DATASET = make_regression()
		CLASSIFICATION_DATASET = make_classification()

FEA add FrozenEstimator #29705

FEA add FrozenEstimator #29705

Conversation

✔️ Linting Passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Naming

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment