-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA add FrozenEstimator #29705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA add FrozenEstimator #29705
Conversation
Interesting discovery, we have a bunch of common tests which are not in |
This test fails: @pytest.mark.parametrize(
"estimator", DATA_VALIDATION_META_ESTIMATORS, ids=_get_meta_estimator_id
)
def test_meta_estimators_delegate_data_validation(estimator):
# Check that meta-estimators delegate data validation to the inner
# estimator(s).
rng = np.random.RandomState(0)
set_random_state(estimator)
n_samples = 30
X = rng.choice(np.array(["aa", "bb", "cc"], dtype=object), size=n_samples)
if is_regressor(estimator):
y = rng.normal(size=n_samples)
else:
y = rng.randint(3, size=n_samples)
# We convert to lists to make sure it works on array-like
X = _enforce_estimator_tags_X(estimator, X).tolist()
y = _enforce_estimator_tags_y(estimator, y).tolist()
# Calling fit should not raise any data validation exception since X is a
# valid input datastructure for the first step of the pipeline passed as
# base estimator to the meta estimator.
estimator.fit(X, y)
# n_features_in_ should not be defined since data is not tabular data.
> assert not hasattr(estimator, "n_features_in_")
E AssertionError @thomasjpfan @glemaitre I wonder, why do we expect the meta-estimator not to have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't n_features_in be actually a part of the public API?
For that test, it's passing in a list of strings of shape (n_samples,)
, which does not have a notion of n_features_in_
.
This is now ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall!
A naive question: When pickling, or skopping the "frozen estimator" from disc, is it guaranteed to be the same byte for byte?
@adam2392 neither |
Co-authored-by: Adam Li <adam2392@gmail.com>
Some additional thoughts: we need to have an example maybe with several sections. But in general, this is an alternative to the It makes me think that we need to have additional thoughts regarding |
If we are using the estimator's name to describe the feature, then I see that as good naming. i.e. The "RandomForestClassifier" estimator is a "random forest classifier". In any case, I am +0.5 with |
I think I like the |
I wonder if we really want this or not. This seems unrelated to the problem of making sure that the model is not refitted again. Being able to set the parameters of the inner model could be useful in some rare case where the parameters influence the prediction / transformation behavior of the model without refitting (one could imagine, the temperature parameter of a softmax based transformation for instance or changing the probability value of test time dropout), or the verbosity level or some. Maybe |
Let's get this in. Postponing edge cases and advanced features. |
|
|
|
|
Since |
@scikit-learn/core-devs @scikit-learn/communication-team @scikit-learn/contributor-experience-team @scikit-learn/documentation-team Your v(non-binding) vote for names could help @adrinjalali in finishing this PR. |
No mega strong feelings. I like pretrained, because I know the word from DL and to me this feels like I am using a pretrained model. I also like prefitted - kinda DL scikit-learn hybrid. But my opinion on naming aren't strong enough to veto any of the suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I started reading the PR I assumed that FrozenEstimator somehow allows you to fit the estimator once, but after that it would be frozen.
I actually don't mind that idea. It would also make this estimator more "scikit-learn compatible". But it might be more confusing compared to what we have here? I'm easy either way.
I wonder if we really want this or not. This seems unrelated to the problem of making sure that the model is not refitted again. Being able to set the parameters of the inner model could be useful in some rare case where the parameters influence the prediction / transformation behavior of the model without refitting (one could imagine, the temperature parameter of a softmax based transformation for instance or changing the probability value of test time dropout), or the verbosity level or some.
Maybe set_params could be protected by default, but FrozenEstimator could make it possible to allow_set_params=True explicitly?
I think we don't want users to set those parameters in a GridSearch kinda setting. If they want to set parameters, they can always do frozen.estimator.set_params()
, which I've added to the error message now.
Naming
Thinking about the name, I think I like FrozenEstimator
more than others, since pretrained models also have the ability to be fine-tuned, which we're not really doing / allowing here. Hopefully the improved docstring makes the name issue go away?
sklearn/frozen/tests/test_frozen.py
Outdated
REGRESSION_DATASET = make_regression() | ||
CLASSIFICATION_DATASET = make_classification() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, applying the change. But I find it about 100 times more convoluted than the code I had 😬 . Like, one needs to understand how request
works, which, good luck, on top of fixtures. There's so many layers of magic in this code. I wonder why you prefer it to the code I had.
>>> from sklearn.datasets import make_classification | ||
>>> from sklearn.frozen import FrozenEstimator | ||
>>> from sklearn.linear_model import LogisticRegression | ||
>>> X, y = make_classification(random_state=0) | ||
>>> clf = LogisticRegression(random_state=0).fit(X, y) | ||
>>> frozen_clf = FrozenEstimator(clf) | ||
>>> frozen_clf.fit(X, y) # No-op | ||
FrozenEstimator(estimator=LogisticRegression(random_state=0)) | ||
>>> frozen_clf.predict(X) # Predictions from `clf.predict` | ||
array(...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just a nitpick to really show the behavior, I would fit the first time with -say- the first 10 data points, and print the explicit predictions of some few elements, then call fit with all data and print again those elements.
Another nit, do we really need to pass a random_state
to the logistic regression when using the default solver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just a nitpick to really show the behavior, I would fit the first time with -say- the first 10 data points, and print the explicit predictions of some few elements, then call fit with all data and print again those elements.
I think that's more of a test than a documentation. Writing a good example to show that behavior is outside the scope of these tiny examples we have in docstrings. We shouldn't make them too long.
Another nit, do we really need to pass a random_state to the logistic regression when using the default solver?
I don't know, and that might also change. Whenever there's a random_state
, I rather set it.
Could we move this forward now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Since |
This adds a
FrozenEstimator
, with tests for pipeline as well. I think it'd be nice to have this and extend the tests if we find regressions?