8000 MNT add __reduce__ to loss objects by adrinjalali · Pull Request #30356 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MNT add __reduce__ to loss objects #30356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 29, 2024

Conversation

adrinjalali
Copy link
Member

This adds __reduce__ with the same format as it was already included in sgd_fast loss objects.

loss.__reduce__()
(<cyfunction __pyx_unpickle_CyHalfTweedieLossIdentity at 0x793fed21c520>,
 (_loss.CyHalfTweedieLossIdentity, 8221022, (5.0,)))

to the other simpler format:

Hinge().__reduce__()
(sklearn.linear_model._sgd_fast.Hinge, (1.0,))

This doesn't change anything wrt pickle, but it makes skops.io be able to save/load these objects.

cc @lesteve @glemaitre

Since some objects are removed from sgd_fast in this release, and the errors are triggered in skops for this 1.6 release, it'd be nice to backport this to the release.

@adrinjalali adrinjalali added this to the 1.6 milestone Nov 27, 2024
Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: de0b93a. Link to the linter CI: here

@lesteve
Copy link
Member
lesteve commented Nov 28, 2024

The change seems small enough but at the same time I find it weird that skops care about this, the main reason being probably because I know close to nothing about skops ...

Can you post a snippet reproducing the issue and have a short summary about skops expectations so that we can go back to it if the need arises?

So far my understanding is that skops does not want to rely on the first element of __reduce__ output (reconstruct function) for security reason and expects to be able to recreate any object by using the class and the remaing elements of __reduce__ doing something like:

# I am assuming that you get the sklearn_cls and reduce_info from skops persisted file somehow
reduce_info = recontruct_func, *args
sklearn_cls(args)

@@ -818,6 +818,9 @@ cdef inline double_pair cgrad_hess_exponential(
cdef class CyLossFunction:
"""Base class for convex loss functions."""

def __reduce__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed actually? It seems to me like CyLossFunction is only used as a base class and you are not not really meant to create an instance of CyLossFunction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed since not all losses have a parameter, which means they won't have our custom __reduce__ generated for them. Note the {{if param is not None}} around the __reduce__ implementation bellow.

Copy link
Member Author
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pickling works by asking objects how they should be constructed, which is done via __reduce__. In many cases, there is no need for a custom constructor function, since objects can simply be constructed by calling their constructors. However, this doesn't generalise if the object does some fancy things in their __init__ method. Therefore tools such as cython work around it by creating a constructor function which is returned by __reduce__.

However, that beats the purpose of not calling random functions not clearly related to the object to be created when constructing objects.

Now when it comes to skops, if not 8000 caring about this, I'm not sure then how we're supposed to reconstruct the objects. We could have custom code which completely ignores the output of __reduce__ and has custom code for each class to construct them with the right arguments, but that seems somewhat unnecessary when we can have these __reduce__ methods here.

An alternative is to have something like __skops_init_args__ which should return what can be given to the constructor. I'd be happier with that than changing the __reduce__ here, but it would bring more discussions as whether or not we want to have skops-specific code here and if it's the right thing to do. So @glemaitre suggested we simply implement these __reduce__ methods and it won't bring any discussions since we've had them for the old loss objects and we're bringing the same thing here.

Does this help @lesteve ?

@@ -818,6 +818,9 @@ cdef inline double_pair cgrad_hess_exponential(
cdef class CyLossFunction:
"""Base class for convex loss functions."""

def __reduce__(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed since not all losses have a parameter, which means they won't have our custom __reduce__ generated for them. Note the {{if param is not None}} around the __reduce__ implementation bellow.

@adrinjalali
Copy link
Member Author

The change seems small enough but at the same time I find it weird that skops care about this, the main reason being probably because I know close to nothing about skops ...

We persist objects pickle-free in skops.io, so the pickle machinery becomes very relevant there BTW 😁

@jeremiedbb jeremiedbb added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Nov 28, 2024
Copy link
Member
@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks minimal and i'm fine with it if it makes things easier for skops.

I'm wondering if we can add some tests in scikit-learn directly to not rely on skops only.

I also think it deserves a changelog entry since it's a regression introduced in 1.6

@adrinjalali
Copy link
Member Author

This isn't really a regression for people who don't use skops and use pickle. This doesn't change pickle-ability of objects, which is already tested in our common tests. Also, this is a change in private modules which users shouldn't be directly accessing anyway. So I don't think it warrants a changelog entry.

@lesteve
Copy link
Member
lesteve commented Nov 28, 2024

Thanks for the explanation @adrinjalali, it does help! Edit: sorry pinged wrong Adrin originally ...

I am fine with the change to make life easier for skops for 1.6, even if an ideal world skops could deal with the default Cython __reduce__ implementation but I have to say I don't understand the 8221022 and why is useful at all in the following snippet so 🤷

In [4]: from sklearn._loss._loss import CyHalfTweedieLossIdentity

In [5]: loss = CyHalfTweedieLossIdentity(2)

In [6]: loss.__reduce__()
Out[6]: 
(<cyfunction __pyx_unpickle_CyHalfTweedieLossIdentity at 0x7b4424573e00>,
 (_loss.CyHalfTweedieLossIdentity, 8221022, (2.0,)))

I double-checked and indeed the __reduce__ were implemented in sgd before, see

def __reduce__(self):
return Hinge, (self.threshold,)
. I am going to wild-guess because otherwise cython class instances were not serializable at the time, see e.g. this very old #946 (comment). I found also the Cython doc that seems to imply that now there is a default Cython __reduce__ implementation in some cases.

I guess one of my hidden question was whether there was something scikit-learn specific but looking at https://github.com/skops-dev/skops/blob/fb356747ff465658ec4f8359b61fda895206d17f/skops/io/_numpy.py for example I guess that you have some custom logic for other types like numpy or scipy types.

< 8000 /div>
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre merged commit 0d9fb78 into scikit-learn:main Nov 29, 2024
38 checks passed
@adrinjalali adrinjalali deleted the loss/reduce branch November 29, 2024 11:25
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 4, 2024
jeremiedbb pushed a commit that referenced this pull request Dec 6, 2024
virchan pushed a commit to virchan/scikit-learn that referenced this pull request Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0