[MRG+1] Fix pickling bug due to multiple inheritance & getstate #8324

HolgerPeters · 2017-02-09T09:12:58Z

Includes a reproducing test case.

sklearn.base.BaseEstimator now tries to use other __getstate__ methods of the class hierarchy first, before defaulting to the __dict__ attribute

Reference Issue

Fix issue #8316

codecov · 2017-02-09T09:45:00Z

Codecov Report

Merging #8324 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8324      +/-   ##
==========================================
+ Coverage   94.75%   94.75%   +<.01%     
==========================================
  Files         342      342              
  Lines       60816    60886      +70     
==========================================
+ Hits        57624    57695      +71     
+ Misses       3192     3191       -1

Impacted Files	Coverage Δ
sklearn/base.py	`94.94% <100%> (+0.79%)`	✅
sklearn/tests/test_base.py	`97.6% <100%> (+0.8%)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba8771f...934efaa. Read the comment docs.

lesteve · 2017-02-09T10:39:45Z

sklearn/base.py

@@ -290,10 +290,11 @@ def __repr__(self):
                                               offset=len(class_name),),)

    def __getstate__(self):


Just wondering whether we should adopt a similar strategy for __setstate__? Quickly looking at it, it seems like we are doing startswith('sklearn.') there as well to avoid messing with classes outside scikit-learn deriving from sklearn.base.BaseEstimator.

Yes, I believe there should be parallel changes in __setstate__

jnothman · 2017-02-09T11:34:29Z

sklearn/base.py

@@ -290,10 +290,11 @@ def __
8000
repr__(self):
                                               offset=len(class_name),),)

    def __getstate__(self):
-        if type(self).__module__.startswith('sklearn.'):


I think we still want this... we just want to delegate to super instead of blindly getting self.__dict__.items().

will reintroduce the check

jnothman · 2017-02-09T11:34:41Z

sklearn/base.py

@@ -290,10 +290,11 @@ def __repr__(self):
                                               offset=len(class_name),),)

    def __getstate__(self):


Yes, I believe there should be parallel changes in __setstate__

HolgerPeters · 2017-02-09T13:24:04Z

sklearn/base.py

+            state = super(BaseEstimator, self).__getstate__()
+        except AttributeError:
+            state = self.__dict__
+
        if type(self).__module__.startswith('sklearn.'):


This conditional is not properly covered, because the classes I use in the test are from within the sklearn module namespace (since scikit-learn has its tests in the sklearn namespace). So basically return state.copy() is never reached in the tests. Problem is, I cannot patch type(self).__module__ without breaking the pickling mechanism (it makes a lookup for the module). So either we make the string 'sklearn.' in the BaseEstimator mock-patchable, or we need to create a test-class outside of the sklearn namespace, to get this conditional fully covered.

My idea was to write something like this

def test_multiple_inheritance_setting_foreign_namespace(self): try: estimator = MultiInheritanceEstimator() old_mod = type(estimator).__module__ type(estimator).__module__ = "notsklearn" serialized = pickle.dumps(estimator, protocol=2) finally: type(estimator).__module__ = old_mod

which doesn't work for the aforementioned reason.

I think we already test something like this. You can either test get/set_state directly (instead of pickling) or hack the pickle loading, perhaps by hacking sys.modules.

+1 for testing getstate / setstate directly.

HolgerPeters · 2017-02-09T14:26:35Z

Not quite sure why codecov reports a reduction in coverage. It seems all code paths are now fully tested.

lesteve · 2017-02-09T14:40:07Z

Not quite sure why codecov reports a reduction in coverage

codecov seems to complain about a drop in coverage in sklearn/test/test_base.py looking at #8324 (comment).

Quickly looking at the diff it seems like you are not using MultiInheritanceEstimator.cache and SingleInheritanceEstimator.

HolgerPeters · 2017-02-13T07:40:11Z

Alright, I think I have incorporated your feedback in this PR. And all CI was green. I have now squashed the commits, so it makes for a nicer patch in the history and rebased stuff on the most recent master. Would you say it is mergeable? Anything else I need to address?

jnothman · 2017-02-13T09:56:42Z

You don't need to squash: github provides a "squash and merge" button. Also, unless there are merge conflicts, rebase is usually unnecessary (and even then, merging in the latest master suffices).

However, you generally require two full reviews and "LGTM"s before merge. We have a long backlog of reviewing. Thanks for your patience.

jnothman · 2017-02-13T09:59:40Z

sklearn/base.py

        else:
-            return dict(self.__dict__.items())
+            return state.copy()


surely it should only be necessary to copy in the .__dict__ case.

Indeed, an object's state need not be a dictionary and this line will break given some other types.

Fixed in 13c74fc

jnothman · 2017-02-13T10:06:22Z

sklearn/base.py

+        try:
+            super(BaseEstimator, self).__setstate__(state)
+        except AttributeError:
+            self.__dict__.update(state)


I think this is okay, but FYI, pickle doesn't directly use update, in order to ensure all strings are interned: https://github.com/python/cpython/blob/master/Lib/pickle.py#L1522.

jnothman · 2017-02-13T10:18:51Z

sklearn/tests/test_base.py

+        return self._cache
+
+
+class TestPicklingConstraints(object):


We usually don't use test classes, but I'm personally okay with this as a way of grouping the code.

It might be good to mention that this test is about BaseEstimator somewhere (though I see there's a lack of that in this file).

Addressed in 0a55414

since the tests are about estimators, isn't it (implicitly) clear, that BaseEstimator is involved. Also, the module being sklearn.base.

jnothman · 2017-02-13T10:23:59Z

sklearn/tests/test_base.py

+
+    def test_singleinheritance_clone(self):
+        estimator = SingleInheritanceEstimator()
+        assert estimator.cache


It's really hard to tell from the test code that this modifies __dict__. I think I'd rather something more transparent than a cache. For example, the getstate could store a timestamp.

jnothman · 2017-02-13T10:26:24Z

sklearn/tests/test_base.py

+
+        serialized = pickle.dumps(estimator, protocol=2)
+        estimator_restored = pickle.loads(serialized)
+        assert estimator_restored.b == 5


I notice this test tests the basic restoration that should have also been tested in test_pickle_version_warning which should really be checking that the loaded pickled trees can still predict. (You could add that if you wish.)

Added in 393ff64

jnothman · 2017-02-13T10:27:04Z

sklearn/tests/test_base.py

+        finally:
+            type(estimator).__module__ = old_mod
+
+    def test_uses_object_dictionary_when_getstate_not_present(self):


I don't get what you mean by "when_getstate_not_present". You're using a MultiInheritanceEstimator for which __getstate__ is present twice in the MRO.

I am not sure what I meant by this test anymore, so I removed it in a94e4d1 without decreasing coverage

jnothman

Thanks. In addition to these nitpicks, you've not tested your modifications to __setstate__

jnothman · 2017-02-15T00:11:56Z

sklearn/tests/test_base.py

    tree = TreeBadVersion().fit(iris.data, iris.target)
    tree_pickle_other = pickle.dumps(tree)
    message = ("Trying to unpickle estimator TreeBadVersion from "
-               "version {0} when using version {1}. This might lead to "


I think it's clearer to the maintainer if "something" is still formatted in

Hope a35832c is as you intend it to be. I really didn't like the replace calls in the old tests. I assume you favour a global message template, two over repeating it in the tests. The other option would be to duplicate the template in the tests (which I think is probably the least optimal variant).

jnothman · 2017-02-15T00:12:45Z

sklearn/tests/test_base.py

+        self._cache = None
+
+    @property
+    def cache(self):


I still don't think this makes for the test being easily read, and would rather something more explicit than a property with side effects. Even if the test did estimator._cache = "some_value" directly it would be better.

Addressed in c965511, indeed this is better than the property

lesteve

A few comments

lesteve · 2017-02-16T09:31:47Z

sklearn/tests/test_base.py

+        return data
+
+    @property
+    def cache(self):


You are still not using cache anywhere, right? If so can you please remove it?

fixed in 32cbb36

lesteve · 2017-02-16T09:38:11Z

sklearn/tests/test_base.py

+
+    serialized = pickle.dumps(estimator, protocol=2)
+    estimator_restored = pickle.loads(serialized)
+    assert estimator_restored.b == 5


Using bare asserts with nose creates not so great error messages. This does matter on CIs. Can you please use assert_* helpers in sklearn.utils.testing in all your tests?

I know the migration to pytest is definitely on the radar, but still, I think this is the right thing to do. I'd be happy to hear different opinions.

An example of using bare asserts vs an assert_* helper when running with nose:

====================================================================== FAIL: test_nose.test ---------------------------------------------------------------------- Traceback (most recent call last): File "/volatile/le243287/miniconda3/lib/python3.5/site-packages/nose/case.py", line 198, in runTest self.test(*self.arg) File "/tmp/test_nose.py", line 7, in test assert x > y AssertionError ====================================================================== FAIL: test_nose.test2 ---------------------------------------------------------------------- Traceback (most recent call last): File "/volatile/le243287/miniconda3/lib/python3.5/site-packages/nose/case.py", line 198, in runTest self.test(*self.arg) File "/tmp/test_nose.py", line 13, in test2 assert_greater(x, y) AssertionError: 2 not greater than 4 ---------------------------------------------------------------------- Ran 2 tests in 0.002s

see e1417ad

using py.test locally so I wasn't aware that this is an issue with nose :)

lesteve · 2017-02-16T09:45:08Z

sklearn/base.py

        if type(self).__module__.startswith('sklearn.'):
-            return dict(self.__dict__.items(), _sklearn_version=__version__)
+            return dict(state.items(), _sklearn_version=__version__)


Actually thinking about it I am a bit confused about this, should we not always have a warning mechanism even if the estimator is outside scikit-learn? I could imagine someone inheriting from say LogisticRegression with a minor modification and the warning applies to this case as well, right?

Firstly that is not a problem for this PR. Secondly, we do not issue the warning if out of sklearn because such an estimator is likely to be versioned differently.

Firstly that is not a problem for this PR.

Agreed.

Secondly, we do not issue the warning if out of sklearn because such an estimator is likely to be versioned differently.

OK I trust your judgement on this. The use case I had in mind was a thin wrapper around a scikit-learn estimator (deriving from an estimator class mostly for convenience), in which case it would make sense to get the warnings.

lesteve

Some small comments. LGTM otherwise.

lesteve · 2017-02-17T08:59:32Z

sklearn/tests/test_base.py

+def test_pickle_version_no_warning_is_issued_with_non_sklearn_estimator():
+    iris = datasets.load_iris()
+    tree = TreeNoVersion().fit(iris.data, iris.target)
+    tree_pickle_noversion = pickle.dumps(tree)
    TreeNoVersion.__module__ = "notsklearn"


I know it was like this before, but don't you need a try/finally here too to make sure that TreeNoVersion.__module__ is set to its original value?

lesteve · 2017-02-17T09:01:02Z

sklearn/tests/test_base.py

+    estimator = MultiInheritanceEstimator()
+    estimator._cache = "this attribute should not be pickled"
+
+    serialized = pickle.dumps(estimator, protocol=2)


Why protocol=2? The tests pass fine with serialized = pickle.dump(estimator).

Removed in 7985b0b

lesteve · 2017-02-17T09:44:03Z

sklearn/tests/test_base.py

+
+class MultiInheritanceEstimator(BaseEstimator, DontPickleCacheMixin):
+    def __init__(self, b=5):
+        self.b = b


I feel like the test could be made clearer by better variable naming, e.g. attr_pickled and attr_not_pickled. If you choose to do it, change it uniformly (and also maybe the naming of DontPickleCacheMixin).

lesteve · 2017-02-17T09:46:29Z

sklearn/tests/test_base.py

    assert_warns_message(UserWarning, message, pickle.loads, tree_pickle_other)

-    # check that not including any version also works:
+
+class TreeNoVersion(DecisionTreeClassifier):


My 2c: moving things around like this adds unnecessary noise in the diff thus making it harder to review without any significant benefit.

lesteve · 2017-02-20T08:39:09Z

LGTM, could you add an entry in doc/whats_new.rst?

…ate__ Includes a reproducing test case. sklearn.base.BaseEstimator now tries to use other __getstate__ methods of the class hierarchy first, before defaulting to the __dict__ attribute

…test cases

lesteve

LGTM, @jnothman do you want to have a look at this one so we can merge it? Two minor comments in the changelog.

lesteve · 2017-02-20T19:12:28Z

doc/whats_new.rst

@@ -220,6 +220,10 @@ Bug fixes
   - Fix a bug in cases where `numpy.cumsum` may be numerically unstable,
     raising an exception if instability is identified.  :issue:`7376` and
     :issue:`7331` by `Joel Nothman`_ and :user:`yangarbiter`.
+   - Fix a bug where :meth:`sklearn.base.BaseEstimator.__getstate__` blocked
+     obstructed pickling customizations of child-classes, when used in a
+     multiple inheritence context.


typo: inheritance

indeed, should be fixed with the last push.

lesteve · 2017-02-20T19:12:55Z

doc/whats_new.rst

@@ -220,6 +220,10 @@ Bug fixes
   - Fix a bug in cases where `numpy.cumsum` may be numerically unstable,
     raising an exception if instability is identified.  :issue:`7376` and
     :issue:`7331` by `Joel Nothman`_ and :user:`yangarbiter`.
+   - Fix a bug where :meth:`sklearn.base.BaseEstimator.__getstate__` blocked


looks like you could not make up your mind between blocked and obstructed ;-)

jnothman · 2017-02-20T22:31:41Z

LGTM thanks @HolgerPeters

lesteve · 2017-02-21T08:25:43Z

Great stuff @HolgerPeters, thanks a lot!

…cikit-learn#8324) Fixes scikit-learn#8316 * Don't use test classes to group tests * only use formatting for parts of the string that change * Flake 8 column limit * Make the modification of the estimator more explicit in the tests * As suggested in code review, prefer formatting over two literals * Also assert, that __setstate__ overwriting works in mixin * Remove cache property * Use assertion functions from sklearn.utils.testing * remove the protocol argument in tests * Rename attributes to better convey their purpose * Revert change of module in TreeNoVersion * Adhere to column-limit * changelog entry * Fix commit message

HolgerPeters force-pushed the master branch from 29dd3af to 9c857f6 Compare February 9, 2017 10:23

lesteve reviewed Feb 9, 2017

View reviewed changes

jnothman reviewed Feb 9, 2017

View reviewed changes

HolgerPeters commented Feb 9, 2017

View reviewed changes

HolgerPeters force-pushed the master branch from 1fb1bb4 to 5ba79aa Compare February 13, 2017 07:38

jnothman reviewed Feb 13, 2017

View reviewed changes

jnothman reviewed Feb 15, 2017

View reviewed changes

jnothman added the Bug label Feb 15, 2017

lesteve reviewed Feb 16, 2017

View reviewed changes

lesteve reviewed Feb 17, 2017

View reviewed changes

HolgerPeters and others added 14 commits February 20, 2017 17:23

Fix pickling bug (issue # 8316) due to multiple inheritance & __getst…

8cb90fd

…ate__ Includes a reproducing test case. sklearn.base.BaseEstimator now tries to use other __getstate__ methods of the class hierarchy first, before defaulting to the __dict__ attribute

In BaseEstimator.__getstate__ copy only the dictionary

85e3583

Split up test_pickle_version_warning tests into several, independent …

f084ce0

…test cases

Add a check that the restored estimator can also make predictions

3c0f913

Don't use test classes to group tests

27dac5f

only use formatting for parts of the string that change

9117c06

Flake 8 column limit

deae48a

Make the modification of the estimator more explicit in the tests

69e98e6

As suggested in code review, prefer formatting over two literals

06ac4ab

Also assert, that __setstate__ overwriting works in mixin

1878211

Remove cache property

ee770ff

Use assertion functions from sklearn.utils.testing

40004d1

remove the protocol argument in tests

2c5263f

Rename attributes to better convey their purpose

9edcf6d

HolgerPeters and others added 3 commits February 20, 2017 17:23

Revert change of module in TreeNoVersion

ae425c4

Adhere to column-limit

e7b2987

changelog entry

6d73269

HolgerPeters force-pushed the master branch from 89cf94e to 6d73269 Compare February 20, 2017 16:23

lesteve approved these changes Feb 20, 2017

View reviewed changes

lesteve changed the title ~~Fix pickling bug due to multiple inheritance & __getstate__~~ [MRG+1] Fix pickling bug due to multiple inheritance & __getstate__ Feb 20, 2017

Fix commit message

934efaa

jnothman merged commit 4493d37 into scikit-learn:master Feb 20, 2017

lesteve mentioned this pull request Mar 10, 2017

[MRG+1] Fix test of SingleInheritanceEstimator to not raise DeprecationWarning #8526

Merged

HolgerPeters mentioned this pull request Mar 10, 2017

Model persistence warnings across versions #7135

Closed

Przemo10 mentioned this pull request Mar 17, 2017

update fork (#1) #8606

Closed

amueller mentioned this pull request Nov 6, 2017

BaseEstimator does not support __slots__ #10079

Closed

WardLT mentioned this pull request Feb 17, 2025

Fix pickling for learners CitrineInformatics/lolo#336

Closed

		@@ -290,10 +290,11 @@ def __repr__(self):
		offset=len(class_name),),)

		def __getstate__(self):

Uh oh!

[MRG+1] Fix pickling bug due to multiple inheritance & __getstate__ #8324

[MRG+1] Fix pickling bug due to multiple inheritance & __getstate__ #8324

Uh oh!

Conversation

Uh oh!

Reference Issue

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

[MRG+1] Fix pickling bug due to multiple inheritance & getstate #8324

[MRG+1] Fix pickling bug due to multiple inheritance & getstate #8324