8000 ENH add __sklearn_version__ to show pickled version by glemaitre · Pull Request #25280 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH add __sklearn_version__ to show pickled version #25280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

glemaitre
Copy link
Member

closes #25273

First proposal for setting an instance attribute __sklearn_version__.

Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to think of cases where this would result in a regression, can't think of any. Could you check if something like a gradient boost wouldn't see a massive slowdown?

Otherwise the changelog also needs to be fixed.

@@ -119,6 +119,12 @@ class BaseEstimator:
arguments (no ``*args`` or ``**kwargs``).
"""

def __new__(cls, *args, **kwargs):
est = super().__new__(cls)
if cls.__module__.startswith("sklearn."):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should safegaurd against not having __module__, not all classes have them. I'm not sure if that can ever be the case for something inheriting from BaseEstimator, but while working in skops we discovered many cases where __module__ is not there, or is modified to something which is not true (like, a scipy ufunc pretending it belongs to numpy).

@@ -119,6 +119,12 @@ class BaseEstimator:
arguments (no ``*args`` or ``**kwargs``).
"""

def __new__(cls, *args, **kwargs):
est = super().__new__(cls)
if cls.__module__.startswith("sklearn."):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g.:

Suggested change
if cls.__module__.startswith("sklearn."):
if getattr(cls, "__module__", "").startswith("sklearn."):

def __new__(cls, *args, **kwargs):
est = super().__new__(cls)
if cls.__module__.startswith("sklearn."):
est.__sklearn_version__ = __version__
Copy link
Member
@ogrisel ogrisel Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not introduced our own "dunder" attributes on scikit-learn estimator instances. Just using a private a attribute should be enough:

Suggested change
est.__sklearn_version__ = __version__
est._sklearn_version = __version__

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the rest of the code would need to be adapted accordingly when accepting this suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference:

https://peps.python.org/pep-0008/#descriptive-naming-styles

__double_leading_and_trailing_underscore__: “magic” objects or attributes that live in user-controlled namespaces. E.g. __init__, __import__ or __file__. Never invent such names; only use them as documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're already down the path of "dunder is how we signal dev API but not user API". We already have __sklearn_is_fitted__, and we just approved the __sklearn_clone__ SLEP.

Here, the question is: do we want to expose this attribute in a way which people should rely on and read, or is it only a private thing for internal use. Since internally, we don't actually need it, and the issue started by a user/dev requesting to have access to this attribute, a private attribute with no guarantees on backward compatibility doesn't make much sense. We can either expose it as a public attribute without dunder, or with.

And we already use __sklearn_is_fitted__ inside sklearn, and we might do that with __sklearn_clone__ as well. Therefore to me it is sensible to have a __sklearn_version__ here.

Copy link
Member
@thomasjpfan thomasjpfan Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__protocol__ is common pattern with Python numerical libraries for creating protocols. For example:

  • NumPy has __array_function__, __array_interface__, __array__, etc.
  • DataFrame exchange protocol has __dataframe__
  • Dask has __dask_graph__, etc.
  • CUDA GPU array libraries use __cuda_array_interface__

Comment on lines +302 to +305
if "_sklearn_version" in state:
pickle_version = state.pop("_sklearn_version")
else:
pickle_version = state["__sklearn_version__"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "_sklearn_version" in state:
pickle_version = state.pop("_sklearn_version")
else:
pickle_version = state["__sklearn_version__"]
pickle_version = state.get("_sklearn_version", "pre-0.18")

Comment on lines 300 to 301
# Before 1.3, `_sklearn_version` was used to store the sklearn version
# when the estimator was pickled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Before 1.3, `_sklearn_version` was used to store the sklearn version
# when the estimator was pickled

@ogrisel
Copy link
Member
ogrisel commented Jan 3, 2023

If we go this route, the warning message should be adapted because the attribute now marks the version of scikit-learn used to originally instantiate the scikit-learn estimator rather than the version of scikit-learn used to last pickle the estimator.

Those are not necessarily the same versions:

  • Estimator could be instantiated (and typically trained) in 1.1.0;
  • The loaded in version 1.2.0 and maybe even retrained or not or mutated in another way then dumped again;
  • Then loaded again in 1.3.0 for computing predictions/transformations for that version of scikit-learn.

@glemaitre
Copy link
Member Author

@ogrisel The use-case that you provide seems a valid use case. It somehow means that we should be triggering an update of the _sklearn_version in fit, isn't it? Somehow, it means that the pickled refitted estimator in 1.2.0 is supposedly compatible even if constructed in 1.1.0.

@betatim
Copy link
Member
betatim commented Jan 4, 2023

I"m a bit lost on what this attribute is used for/what user problem it solves. In the original issue #22055 the request was for "programmatically determine which version a pickle was saved from.". For me this means that if you (1) fit and dump an estimator in v1, then (2) load and dump it in v2 and then (3) load it in v3 the attribute should read v2. That is the version it was (last) pickled with.

The bug that was reported was that as soon as you load the estimator in step (2) the attribute would say v2, instead of v1.

A related question is, is "last version used to pickle the estimator" actually what the user was thinking of, or is what they "really" meant "last version this estimator was modified with"? In 99% of the cases the answer is the same, but in cases like the one Olivier described the answer is different.

My hunch is that what people actually care about is "version this was fitted with". At least I can't think of a good reason why I'd want to know the version that step (2) happened with (AFAIK it doesn't tell you anything about whether or not the estimator actually works with v2).

Trying to figure out the answer to this would be nice. Hopefully it will also help with the name of the attribute, because "sklearn pickle version" is as clear as mud to me. The version of pickle used to store this? The version of sklearn? Something else?

@thomasjpfan
Copy link
Member

We can not really support the use case in #25280 (comment), because our internal data structures can change and parameters can change between versions. A model can be instantiated in v1.1 with a parameter that was removed in v1.2. If one wants to train a model in v1.2, they can create a model in v1.2 and train it. If they want to continue to train a v1.1 model with partial_fit, they can load the model in v1.1.

A related question is, is "last version used to pickle the estimator" actually what the user was thinking of, or is what they "really" meant "last version this estimator was modified with"? In 99% of the cases the answer is the same,

I agree, that this is almost always the case and the assumption we are making with the warning in main.

My hunch is that what people actually care about is "version this was fitted with".

For me, this was the intention.

With this PR, we are making the assumption that "the version used to create the estimator is the same as the version used to train it". I'll say that is ~96% the case. An edge case is during a distributed hyper-parameter searching with dask and the local scikit-learn version is v1.1 and the server uses v1.2. In that case, there is a mismatch between the initialized version and the version used for training. For me, I do not want to support this use case, since the v1.1 parameters may not even work with v1.2.

I opened #25297 as an alternative, that places more information about the versions in the warning object itself and does not add any more state to the estimator.

@glemaitre
Copy link
Member Author

Closing since we introduced the new type of warning.

@glemaitre glemaitre closed this Jan 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

__sklearn_pickle_version__ makes estimator.__dict__.keys() == loaded.__dict__.keys() to fail
5 participants
0