-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH add __sklearn_version__ to show pickled version #25280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to think of cases where this would result in a regression, can't think of any. Could you check if something like a gradient boost wouldn't see a massive slowdown?
Otherwise the changelog also needs to be fixed.
@@ -119,6 +119,12 @@ class BaseEstimator: | |||
arguments (no ``*args`` or ``**kwargs``). | |||
""" | |||
|
|||
def __new__(cls, *args, **kwargs): | |||
est = super().__new__(cls) | |||
if cls.__module__.startswith("sklearn."): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should safegaurd against not having __module__
, not all classes have them. I'm not sure if that can ever be the case for something inheriting from BaseEstimator
, but while working in skops
we discovered many cases where __module__
is not there, or is modified to something which is not true (like, a scipy ufunc pretending it belongs to numpy).
@@ -119,6 +119,12 @@ class BaseEstimator: | |||
arguments (no ``*args`` or ``**kwargs``). | |||
""" | |||
|
|||
def __new__(cls, *args, **kwargs): | |||
est = super().__new__(cls) | |||
if cls.__module__.startswith("sklearn."): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g.:
if cls.__module__.startswith("sklearn."): | |
if getattr(cls, "__module__", "").startswith("sklearn."): |
def __new__(cls, *args, **kwargs): | ||
est = super().__new__(cls) | ||
8000 td> | if cls.__module__.startswith("sklearn."): | |
est.__sklearn_version__ = __version__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not introduced our own "dunder" attributes on scikit-learn estimator instances. Just using a private a attribute should be enough:
est.__sklearn_version__ = __version__ | |
est._sklearn_version = __version__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the rest of the code would need to be adapted accordingly when accepting this suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference:
https://peps.python.org/pep-0008/#descriptive-naming-styles
__double_leading_and_trailing_underscore__
: “magic” objects or attributes that live in user-controlled namespaces. E.g.__init__
,__import__
or__file__
. Never invent such names; only use them as documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're already down the path of "dunder is how we signal dev API but not user API". We already have __sklearn_is_fitted__
, and we just approved the __sklearn_clone__
SLEP.
Here, the question is: do we want to expose this attribute in a way which people should rely on and read, or is it only a private thing for internal use. Since internally, we don't actually need it, and the issue started by a user/dev requesting to have access to this attribute, a private attribute with no guarantees on backward compatibility doesn't make much sense. We can either expose it as a public attribute without dunder, or with.
And we already use __sklearn_is_fitted__
inside sklearn, and we might do that with __sklearn_clone__
as well. Therefore to me it is sensible to have a __sklearn_version__
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__protocol__
is common pattern with Python numerical libraries for creating protocols. For example:
NumPy
has__array_function__
,__array_interface__
,__array__
, etc.- DataFrame exchange protocol has
__dataframe__
- Dask has
__dask_graph__
, etc. - CUDA GPU array libraries use
__cuda_array_interface__
if "_sklearn_version" in state: | ||
pickle_version = state.pop("_sklearn_version") | ||
else: | ||
pickle_version = state["__sklearn_version__"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if "_sklearn_version" in state: | |
pickle_version = state.pop("_sklearn_version") | |
else: | |
pickle_version = state["__sklearn_version__"] | |
pickle_version = state.get("_sklearn_version", "pre-0.18") |
# Before 1.3, `_sklearn_version` was used to store the sklearn version | ||
# when the estimator was pickled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Before 1.3, `_sklearn_version` was used to store the sklearn version | |
# when the estimator was pickled |
If we go this route, the warning message should be adapted because the attribute now marks the version of scikit-learn used to originally instantiate the scikit-learn estimator rather than the version of scikit-learn used to last pickle the estimator. Those are not necessarily the same versions:
|
@ogrisel The use-case that you provide seems a valid use case. It somehow means that we should be triggering an update of the |
I"m a bit lost on what this attribute is used for/what user problem it solves. In the original issue #22055 the request was for "programmatically determine which version a pickle was saved from.". For me this means that if you (1) fit and dump an estimator in v1, then (2) load and dump it in v2 and then (3) load it in v3 the attribute should read v2. That is the version it was (last) pickled with. The bug that was reported was that as soon as you load the estimator in step (2) the attribute would say v2, instead of v1. A related question is, is "last version used to pickle the estimator" actually what the user was thinking of, or is what they "really" meant "last version this estimator was modified with"? In 99% of the cases the answer is the same, but in cases like the one Olivier described the answer is different. My hunch is that what people actually care about is "version this was fitted with". At least I can't think of a good reason why I'd want to know the version that step (2) happened with (AFAIK it doesn't tell you anything about whether or not the estimator actually works with v2). Trying to figure out the answer to this would be nice. Hopefully it will also help with the name of the attribute, because "sklearn pickle version" is as clear as mud to me. The version of pickle used to store this? The version of sklearn? Something else? |
We can not really support the use case in #25280 (comment), because our internal data structures can change and parameters can change between versions. A model can be instantiated in v1.1 with a parameter that was removed in v1.2. If one wants to train a model in v1.2, they can create a model in v1.2 and train it. If they want to continue to train a v1.1 model with
I agree, that this is almost always the case and the assumption we are making with the warning in
For me, this was the intention. With this PR, we are making the assumption that "the version used to create the estimator is the same as the version used to train it". I'll say that is ~96% the case. An edge case is during a distributed hyper-parameter searching with I opened #25297 as an alternative, that places more information about the versions in the warning object itself and does not add any more state to the estimator. |
Closing since we introduced the new type of warning. |
closes #25273
First proposal for setting an instance attribute
__sklearn_version__
.