ENH ufunc called on memmap return a ndarray #7406

lesteve · 2016-03-10T18:42:05Z

Special case for reduction functions (e.g. np.sum with axis=None) that
return a numpy scalar.

Fix #7403. Implemented __array_wrap__ following @ahaldane's suggestion. Should fix #667 as well.

seberg · 2016-03-10T22:23:17Z

Years ago, had started something like this, but had to be a mit more complex, because of subclasses of memmap. You can find it here: #2932
Frankly, no idea why I closed it back then (did not check the posts), but it was a slow time for numpy IIRC, so might just have closed it because of that.

charris · 2016-03-11T03:59:05Z

numpy/core/tests/test_memmap.py

+            assert_(unary_op(fp, axis=1).__class__ is ndarray)
+
+        for binary_op in [add, multiply, divide, add]:
+            assert_(binary_op(fp, self.data).__class__ is ndarray)


I suspect that the test failure is because self.data starts with zero, which causes problems with division.

charris · 2016-03-11T04:01:24Z

Not sure if the counts as a bug/enhancement/change in behavior. If the latter, might need to start with a FutureWarning.

ahaldane · 2016-03-11T04:56:31Z

Looks good, though I'm pretty sure in-place operations (eg m += 1) don't work properly, and I didn't immediately see a good fix for that. I'd also need to think a little more to make sure there are no other broken edge cases.

Also, I wonder if this is an opportunity to try out the new __numpy_ufunc__ instead of __array_wrap__. As I understand __numpy_ufunc__ is meant to replace it, although it's not totally done yet. Just another thing to ponder.

ahaldane · 2016-03-11T22:05:45Z

Well, looks like __numpy_ufunc__ was disabled for now, so we shouldn't try that.

Here's a possible fix for __array_wrap__ to make in-place operations retain the subtype:

def __array_wrap__(self, arr, context=None):
    if np.may_share_memory(self, arr):
        return arr
    if arr.shape == ():
        return arr[()]
    return arr.view(ndarray)

so now you can also do

>>> a = memmap(np.memmap('file', dtype='i4', shape=(4,4))
>>> a += 1
>>> a.flush()

More generally, there are 2 things I am not 100% sure of yet which we need to be more sure of:

Are we sure that there are no ufunc uses which should return a 0-d array instead of a scalar?
Are we sure that there are no cases where np.may_share_memory(self, arr) == False but we should actually be returning a memmap? That is, where arr's memory is actually a memmapped buffer?

ahaldane · 2016-03-11T22:20:30Z

Oh, I just read through @seberg's PR, which looks like it addresses more of these concerns. @lesteve @seberg maybe we should revive that one? Even though it does seem a little dangerous, I think it might be worth doing since we have a couple people now who would like the change.

seberg · 2016-03-11T22:56:54Z

Dunno. Don't even know what my code was compared to this (why did I
have array_prepare also?). I am relatively sure you don't really have
to worry much about the mmapped buffers or so. If arr does not share
memory with self, then the array should be freshly created with its own
data. If an "out" array exists, the "out" array gets to do the
array_wrap (maybe that is why array_prepare, hmmm).

Only thing I would be worried about that memmaps have some attributes
that still somewhat work when we now would return arrays, and thus
might break code, that just use them for no good reason. Should check,
but I could imagine it is fine, since a memmap object that (I guess?)
errors on flush is pretty useless.

About 1. Yes, we are sure ufuncs never return 0-d arrays (except for
subclasses to give subclasses a chance and even that changed probably
in 1.7 or so. That does not mean I agree with it, but Numpy doesn't
consider 0-d arrays first class citizens.

Anyway, if we are not sure about all of these things, possibly we could
just change the 0-d case, and hope that since the behaviour was changed
(maybe not so long ago?) it is fine.

njsmith · 2016-03-12T01:10:04Z

Don't have time to read through all of this in detail right now, but as a general note I'm +10 on making it so that memmaps don't propagate through ufuncs. I'd also vote against requiring a FutureWarning, because (a) this will be really noisy and annoying, because it would trigger every time you do anything with memmaps, even totally safe things, and there's no reasonable way to change existing code to avoid the warning, (b) the only behavioural difference between an np.memmap object and an np.ndarray is that the former has a ._mmap attribute that can be flushed, but even this is not true for the np.memmap objects that get allocated and returned from ufuncs -- they are literally just ndarray objects with a confusing subclass attached.

charris · 2016-03-12T01:40:06Z

Should get a mention in the 1.12.0-notes.

charris · 2016-03-15T16:22:40Z

@lesteve Ping.

lesteve · 2016-03-15T17:27:31Z

I added something in doc/release/1.12.0-notes.rst.

I haven't had time to work on this in the past few days. I will try to understand better what was done in #2932.

ahaldane · 2016-03-15T18:08:17Z

@lesteve My impression of the difference between this PR and #2932 is that #2932 tries to treat subclasses of memmap specially and handles __getitem__ (but it missing the 0d array case in this PR), while this PR treats 0d arrays properly but doesn't worry about inheritance or __getitem__.

I will admit I don't really understand a lot of #2932. It appears to be trying to take care of multiple inheritance, involving inheritance from memmap. But I'm not sure it's worth the effort to do that. Do we know of any projects that use multiple inheritance involving memmap? Note also that this PR, as is, properly takes care of single inheritance of memmap (since subclasses will inherit array_wrap).

However, I do think #2932 has the right idea in __getitem__, that we want fancy indexing like

>>> a = memmap(np.memmap('file', dtype='i4', shape=(4,4))
>>> a[[0,3]]

to return an ndarray, since the result is a new array rather than a view. However a[1:3] should give back a memmap . I am not sure how to implement that yet.

In summary, my impression is that we can forget about trying to fix multiple inheritance like #2932, however we should somehow take care of fancy indexing, and I also suggest adding the if np.may_share_memory(self, arr): clause I pasted above.

lesteve · 2016-03-15T18:53:43Z

I have added the __getitem__ behaviour + tests from #2932 and also the may_share_memory in __array_wrap__. I think it does what @ahaldane wanted for view vs new array for __getitem__.

I haven't tried to understand further the subtleties involved with subclassing that #2932 was trying to tackle.

ahaldane · 2016-03-19T18:57:38Z

I think this PR is ready to be merged as is, if there are no objections.

Although I have a little bit of a feeling there may be some "unknown unknowns" here, after sitting on it I haven't though up any more cases to take care of, and I'm not too worried about the multiple subclass case. Plus, I think this is easy to undo later if there are problems, and it can get some testing if we merge.

Maybe we can wait a little for anyone else to speak up, and then merge.

seberg · 2016-03-19T19:16:07Z

numpy/core/memmap.py

+    def __array_wrap__(self, arr, context=None):
+        # Return scalar instead of 0d memmap, e.g. for np.sum with
+        # axis=None
+        if arr.shape == ():


Just to be on the safe side, it might be good to move it after the may_share_memory call? I don't think we should ever return 0-d arrays from functions calling array_wrap, but I am not perfectly sure.

Then again, maybe it doesn't matter.

Just to be on the safe side, it might be good to move it after the may_share_memory call?

Done.

seberg · 2016-03-19T19:17:11Z

If there may be problems with subclasses, I would feel better if there is a note to the mailing list.

seberg · 2016-03-19T19:17:42Z

Otherwise, I trust your assessment :).

homu · 2016-03-22T00:35:48Z

☔ The latest upstream changes (presumably #7325) made this pull request unmergeable. Please resolve the merge conflicts.

lesteve · 2016-03-22T20:04:51Z

I am on holiday for a week with a very spotty internet connection. I'll do that when I get back home.

mhvk · 2016-03-31T13:09:03Z

@lesteve - the problem I mentioned is that in your example, np.add returns an ndarray. Normally, when out is given, it is ensured that what is returned is out itself, i.e., if one does mm2 = np.add(mm0, 1, out=mm1), it should be the case that mm2 is mm1.

lesteve · 2016-03-31T13:16:49Z

mm2 = np.add(mm0, 1, out=mm1), it should be the case that mm2 is mm1

@mhvk this PR does have the behaviour you want. Thanks for the explanations, I naively expected np.add to return None when the out parameter was specified.

mhvk · 2016-03-31T13:38:55Z

OK, I now see why this does work (should have seen it before): when the output is an memmap, it is out.__array_wrap__ that is called, so inside __array_wrap__, the condition may_share_memory is in fact true.

mhvk · 2016-03-31T13:40:50Z

numpy/core/memmap.py

+
+    def __array_wrap__(self, arr, context=None):
+        # Return memmap if self and arr are likely to share memory
+        if np.may_share_memory(self, arr):


So, a remaining suggestion is to replace this statement with the simpler (and faster)

# Return the object it is was given as an output and should thus preserve type if arr is self: return arr

Playing a bit around, I wasn't able to find an edge case in __array_wrap__ where np.may_share_memory(self, arr) and arr is self disagreed, so I made the advised change.

If someone else can see an edge case that I missed, please complain !

ahaldane · 2016-04-01T14:58:54Z

Thanks for the sugestions @mhvk, it looks better now.

@lesteve I want to consider @mhvk's other suggestion too, to use super, but I am not totally happy with that solution because it breaks the single inheritance case (eg class B(np.memmap)). I want to think about it a little more.

lesteve · 2016-04-01T15:09:33Z

Let me know, I won't be able to work on it until Monday though.

ahaldane · 2016-04-01T15:38:23Z

All right, from a brief look, astropy code doesn't seem to subclass memmap anywhere so @mhvk's example of class MyClass(Quantity, memmap) is only hypothetical.

In that case I vote to keep things simple and ignore the subclass issue, and merge this as is. Is that all right @mhvk?

Anyone in the future who wants to subclass memmap will be able to override our behavior if they want.

mhvk · 2016-04-01T17:55:26Z

@ahaldane - I'm half curious about what goes wrong with the inheritance class B(np.memmap), but as my example was indeed hypothetical, I'm quite happy to keep it simple.

ahaldane · 2016-04-01T20:19:50Z

@mhvk, maybe I am mistaken. I was bothered by the fact that ufuncs and fancy slices of instances of B wouldn't return ndarrays, eg. np.mean(B(file, dtype='i4', shape=4)) returns a B scalar. But actually that is probably what should happen for subclasses of ndarray.

@lesteve, sorry for going back and forth on this, but could you make the super change?

    def __array_wrap__(self, arr, context=None):
        arr = super(memmap, self).__array_wrap__(arr, context)
        # Return arr if it not a memmap or if it was given as output.
        if arr is self or type(arr) is not memmap:
            return arr
        ...

and a unit test could be

class B(np.memmap):
    pass

a = B('file', dtype='i4', shape=4)
assert_equal(type(np.mean(a)), B)
assert_equal(type(a[[0,1]]), B)

Special case for reduction functions (e.g. np.sum with axis=None) that return a numpy scalar. Keep original memmap subclasses behavior to be on the safe side.

lesteve · 2016-04-04T08:01:44Z

I was bothered by the fact that ufuncs and fancy slices of instances of B wouldn't return ndarrays, eg. np.mean(B(file, dtype='i4', shape=4)) returns a B scalar. But actually that is probably what should happen for subclasses of ndarray.

I find that slightly bothering too. In my mind, it is very unlikely that one would want a B 0d array for the result of np.mean and for fancy slices I am a bit more split but not really convinced either. I'd be interested if someone has some kind of use case where this behaviour makes sense.

I made the requested change anyway.

ogrisel · 2016-04-04T13:33:47Z

I haven't reviewed the change closely but it's a much welcome change in principle so 👍 (in the long term I think the memmap class should go away entirely--some of its code is worth saving, but I don't think it makes much sense as a separate ndarray subclass).

@embray I agree. To fix the memmap class consistency issue I had prototyped in the past an alternative that would not subclass ndarray but instead call np.frombuffer on a custom mmap buffer that would store the required metadata (filename, offset, mode...):

joblib/joblib@b68a868#diff-19004d02cec64e3d448b33565de338ebR94

I never finished that work but if people are interested I could contribute it to numpy as a saner alternative to np.memmap. Do you want me to extract that old code and make it a PR for numpy (with tests)?

embray · 2016-04-04T13:39:13Z

@ogrisel Yes, that looks useful. For myself, I'm using np.memmap to create the underlying mmap in a convenient way, but then immediately passing it to np.frombuffer and throwing away the np.memmap` object.

mhvk · 2016-04-04T14:14:45Z

@lesteve - for the subclass case, I do think it is best to default to returning the subclass. It is quite possible that the subclass stores information that is useful even for a 0-d array (e.g., a Quantity stores a unit, which should not get lost).

Overall, it seems to me this is a good addition as is, and with all the tests, etc., it seems ready to go in.

lesteve · 2016-04-04T14:23:48Z

@lesteve - for the subclass case, I do think it is best to default to returning the subclass. It is quite possible that the subclass stores information that is useful even for a 0-d array (e.g., a Quantity stores a unit, which should not get lost).

Fair enough, thanks for the explanation!

ahaldane · 2016-04-04T15:29:16Z

Looks good, I'll merge in an hour or two.

ahaldane · 2016-04-04T17:11:00Z

Merged, thanks @lesteve and the many reviewers.

lesteve · 2016-04-04T17:40:45Z

Great, thanks a lot !

memmap behavior is about to change with merge of numpy/numpy#7406. Test for behavior change so we can adapt expected mmap return values on an image.

charris reviewed Mar 11, 2016
View reviewed changes

charris added 01 - Enhancement component: numpy._core labels Mar 11, 2016

lesteve force-pushed the memmap-ufunc-return-ndarray branch from 7826a7e to 137afc3 Compare March 11, 2016 09:24

charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Mar 12, 2016

lesteve force-pushed the memmap-ufunc-return-ndarray branch from 137afc3 to 6d7f1aa Compare March 15, 2016 17:22

lesteve force-pushed the memmap-ufunc-return-ndarray branch from 6d7f1aa to 609d19a Compare March 15, 2016 18:53

lesteve force-pushed the memmap-ufunc-return-ndarray branch from 609d19a to 838dab9 Compare March 15, 2016 23:30

charris removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Mar 18, 2016

seberg reviewed Mar 19, 2016
View reviewed changes

ahaldane mentioned this pull request Mar 19, 2016

MAINT: FutureWarning for changes to np.average subclass handling #7433

Merged

lesteve force-pushed the memmap-ufunc-return-ndarray branch from 838dab9 to 944ece3 Compare March 20, 2016 09:19

67E6
lesteve mentioned this pull request Mar 30, 2016

mem-map failure in scorers on travis scikit-learn/scikit-learn#6589

Closed

mhvk reviewed Mar 31, 2016
View reviewed changes

lesteve force-pushed the memmap-ufunc-return-ndarray branch from 0d0c8e6 to cc8c8b6 Compare April 1, 2016 06:44

FIX ufunc called on memmap return a ndarray

8f27d12

Special case for reduction functions (e.g. np.sum with axis=None) that return a numpy scalar. Keep original memmap subclasses behavior to be on the safe side.

lesteve force-pushed the memmap-ufunc-return-ndarray branch from cc8c8b6 to 8f27d12 Compare April 4, 2016 08:01

ahaldane merged commit 2fb9599 into numpy:master Apr 4, 2016

lesteve deleted the memmap-ufunc-return-ndarray branch April 4, 2016 17:40

lesteve mentioned this pull request May 17, 2016

[MRG+1] FIX support memmap scalars as CV scores scikit-learn/scikit-learn#6789

Merged

shoyer mentioned this pull request Jun 9, 2016

np.average of memmap returns memmap 0d array instead of numpy scalar #7403

Closed

shoyer mentioned this pull request Sep 21, 2016

error tokenizing memmaps dask/dask#1562

Closed

ahaldane mentioned this pull request Oct 12, 2016

Can't do a formatted print of a <class 'numpy.core.memmap.memmap'> returned by a.mean(a) #8108

Closed

Uh oh!

ENH ufunc called on memmap return a ndarray #7406

ENH ufunc called on memmap return a ndarray #7406

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!