ENH add __sklearn_version__ to show pickled version #25280

glemaitre · 2023-01-03T11:09:54Z

First proposal for setting an instance attribute __sklearn_version__.

adrinjalali

Trying to think of cases where this would result in a regression, can't think of any. Could you check if something like a gradient boost wouldn't see a massive slowdown?

Otherwise the changelog also needs to be fixed.

adrinjalali · 2023-01-03T12:06:40Z

sklearn/base.py

@@ -119,6 +119,12 @@ class BaseEstimator:
    arguments (no ``*args`` or ``**kwargs``).
    """

+    def __new__(cls, *args, **kwargs):
+        est = super().__new__(cls)
+        if cls.__module__.startswith("sklearn."):


You should safegaurd against not having __module__, not all classes have them. I'm not sure if that can ever be the case for something inheriting from BaseEstimator, but while working in skops we discovered many cases where __module__ is not there, or is modified to something which is not true (like, a scipy ufunc pretending it belongs to numpy).

ogrisel · 2023-01-03T13:37:40Z

sklearn/base.py

@@ -119,6 +119,12 @@ class BaseEstimator:
    arguments (no ``*args`` or ``**kwargs``).
    """

+    def __new__(cls, *args, **kwargs):
+        est = super().__new__(cls)
+        if cls.__module__.startswith("sklearn."):


E.g.:

Suggested change

if cls.__module__.startswith("sklearn."):

if getattr(cls, "__module__", "").startswith("sklearn."):

ogrisel · 2023-01-03T13:39:20Z

sklearn/base.py

+    def __new__(cls, *args, **kwargs):
+        est = super().__new__(cls)
+        if cls.__module__.startswith("sklearn."):
+            est.__sklearn_version__ = __version__


I think we should not introduced our own "dunder" attributes on scikit-learn estimator instances. Just using a private a attribute should be enough:

Suggested change

est.__sklearn_version__ = __version__

est._sklearn_version = __version__

Note that the rest of the code would need to be adapted accordingly when accepting this suggestion.

For reference:

https://peps.python.org/pep-0008/#descriptive-naming-styles

__double_leading_and_trailing_underscore__: “magic” objects or attributes that live in user-controlled namespaces. E.g. __init__, __import__ or __file__. Never invent such names; only use them as documented.

I think we're already down the path of "dunder is how we signal dev API but not user API". We already have __sklearn_is_fitted__, and we just approved the __sklearn_clone__ SLEP.

Here, the question is: do we want to expose this attribute in a way which people should rely on and read, or is it only a private thing for internal use. Since internally, we don't actually need it, and the issue started by a user/dev requesting to have access to this attribute, a private attribute with no guarantees on backward compatibility doesn't make much sense. We can either expose it as a public attribute without dunder, or with.

And we already use __sklearn_is_fitted__ inside sklearn, and we might do that with __sklearn_clone__ as well. Therefore to me it is sensible to have a __sklearn_version__ here.

__protocol__ is common pattern with Python numerical libraries for creating protocols. For example:

NumPy has __array_function__, __array_interface__, __array__, etc.

DataFrame exchange protocol has __dataframe__

Dask has __dask_graph__, etc.

CUDA GPU array libraries use __cuda_array_interface__

ogrisel · 2023-01-03T14:16:49Z

sklearn/base.py

+            if "_sklearn_version" in state:
+                pickle_version = state.pop("_sklearn_version")
+            else:
+                pickle_version = state["__sklearn_version__"]


Suggested change

if "_sklearn_version" in state:

pickle_version = state.pop("_sklearn_version")

else:

pickle_version = state["__sklearn_version__"]

pickle_version = state.get("_sklearn_version", "pre-0.18")

ogrisel · 2023-01-03T14:17:01Z

sklearn/base.py

            # Before 1.3, `_sklearn_version` was used to store the sklearn version
            # when the estimator was pickled


Suggested change

# Before 1.3, `_sklearn_version` was used to store the sklearn version

# when the estimator was pickled

ogrisel · 2023-01-03T14:21:56Z

If we go this route, the warning message should be adapted because the attribute now marks the version of scikit-learn used to originally instantiate the scikit-learn estimator rather than the version of scikit-learn used to last pickle the estimator.

Those are not necessarily the same versions:

Estimator could be instantiated (and typically trained) in 1.1.0;
The loaded in version 1.2.0 and maybe even retrained or not or mutated in another way then dumped again;
Then loaded again in 1.3.0 for computing predictions/transformations for that version of scikit-learn.

glemaitre · 2023-01-04T11:12:21Z

@ogrisel The use-case that you provide seems a valid use case. It somehow means that we should be triggering an update of the _sklearn_version in fit, isn't it? Somehow, it means that the pickled refitted estimator in 1.2.0 is supposedly compatible even if constructed in 1.1.0.

betatim · 2023-01-04T14:08:43Z

I"m a bit lost on what this attribute is used for/what user problem it solves. In the original issue #22055 the request was for "programmatically determine which version a pickle was saved from.". For me this means that if you (1) fit and dump an estimator in v1, then (2) load and dump it in v2 and then (3) load it in v3 the attribute should read v2. That is the version it was (last) pickled with.

The bug that was reported was that as soon as you load the estimator in step (2) the attribute would say v2, instead of v1.

A related question is, is "last version used to pickle the estimator" actually what the user was thinking of, or is what they "really" meant "last version this estimator was modified with"? In 99% of the cases the answer is the same, but in cases like the one Olivier described the answer is different.

My hunch is that what people actually care about is "version this was fitted with". At least I can't think of a good reason why I'd want to know the version that step (2) happened with (AFAIK it doesn't tell you anything about whether or not the estimator actually works with v2).

Trying to figure out the answer to this would be nice. Hopefully it will also help with the name of the attribute, because "sklearn pickle version" is as clear as mud to me. The version of pickle used to store this? The version of sklearn? Something else?

thomasjpfan · 2023-01-05T02:39:18Z

We can not really support the use case in #25280 (comment), because our internal data structures can change and parameters can change between versions. A model can be instantiated in v1.1 with a parameter that was removed in v1.2. If one wants to train a model in v1.2, they can create a model in v1.2 and train it. If they want to continue to train a v1.1 model with partial_fit, they can load the model in v1.1.

A related question is, is "last version used to pickle the estimator" actually what the user was thinking of, or is what they "really" meant "last version this estimator was modified with"? In 99% of the cases the answer is the same,

I agree, that this is almost always the case and the assumption we are making with the warning in main.

My hunch is that what people actually care about is "version this was fitted with".

For me, this was the intention.

With this PR, we are making the assumption that "the version used to create the estimator is the same as the version used to train it". I'll say that is ~96% the case. An edge case is during a distributed hyper-parameter searching with dask and the local scikit-learn version is v1.1 and the server uses v1.2. In that case, there is a mismatch between the initialized version and the version used for training. For me, I do not want to support this use case, since the v1.1 parameters may not even work with v1.2.

I opened #25297 as an alternative, that places more information about the versions in the warning object itself and does not add any more state to the estimator.

glemaitre · 2023-01-14T15:38:27Z

Closing since we introduced the new type of warning.

ENH add __sklearn_version__ to show pickled version

741e517

glemaitre mentioned this pull request Jan 3, 2023

__sklearn_pickle_version__ makes estimator.__dict__.keys() == loaded.__dict__.keys() to fail #25273

Closed

glemaitre marked this pull request as draft January 3, 2023 11:11

adrinjalali reviewed Jan 3, 2023

View reviewed changes

ogrisel reviewed Jan 3, 2023

View reviewed changes

ogrisel mentioned this pull request Jan 3, 2023

Revert "ENH added original version of pickled estimator in state dict" #25279

Merged

ogrisel reviewed Jan 3, 2023

View reviewed changes

thomasjpfan mentioned this pull request Jan 5, 2023

ENH Adds InconsistentVersionWarning when unpickling version doesn't match #25297

Merged

glemaitre closed this Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH add __sklearn_version__ to show pickled version #25280

ENH add __sklearn_version__ to show pickled version #25280

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	if cls.__module__.startswith("sklearn."):
	if getattr(cls, "__module__", "").startswith("sklearn."):

	est.__sklearn_version__ = __version__
	est._sklearn_version = __version__

		# Before 1.3, `_sklearn_version` was used to store the sklearn version
		# when the estimator was pickled

Uh oh!

ENH add __sklearn_version__ to show pickled version #25280

ENH add __sklearn_version__ to show pickled version #25280

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!