8000 ENH ufunc called on memmap return a ndarray by lesteve · Pull Request #7406 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH ufunc called on memmap return a ndarray #7406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 4, 2016

Conversation

lesteve
Copy link
Contributor
@lesteve lesteve commented Mar 10, 2016

Special case for reduction functions (e.g. np.sum with axis=None) that
return a numpy scalar.

Fix #7403. Implemented __array_wrap__ following @ahaldane's suggestion. Should fix #667 as well.

@seberg
Copy link
Member
seberg commented Mar 10, 2016

Years ago, had started something like this, but had to be a mit more complex, because of subclasses of memmap. You can find it here: #2932
Frankly, no idea why I closed it back then (did not check the posts), but it was a slow time for numpy IIRC, so might just have closed it because of that.

assert_(unary_op(fp, axis=1).__class__ is ndarray)

for binary_op in [add, multiply, divide, add]:
assert_(binary_op(fp, self.data).__class__ is ndarray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the test failure is because self.data starts with zero, which causes problems with division.

@charris
Copy link
Member
charris commented Mar 11, 2016

Not sure if the counts as a bug/enhancement/change in behavior. If the latter, might need to start with a FutureWarning.

@ahaldane
Copy link
Member

Looks good, though I'm pretty sure in-place operations (eg m += 1) don't work properly, and I didn't immediately see a good fix for that. I'd also need to think a little more to make sure there are no other broken edge cases.

Also, I wonder if this is an opportunity to try out the new __numpy_ufunc__ instead of __array_wrap__. As I understand __numpy_ufunc__ is meant to replace it, although it's not totally done yet. Just another thing to ponder.

@lesteve lesteve force-pushed the memmap-ufunc-return-ndarray branch from 7826a7e to 137afc3 Compare March 11, 2016 09:24
@ahaldane
Copy link
Member

Well, looks like __numpy_ufunc__ was disabled for now, so we shouldn't try that.

Here's a possible fix for __array_wrap__ to make in-place operations retain the subtype:

def __array_wrap__(self, arr, context=None):
    if np.may_share_memory(self, arr):
        return arr
    if arr.shape == ():
        return arr[()]
    return arr.view(ndarray)

so now you can also do

>>> a = memmap(np.memmap('file', dtype='i4', shape=(4,4))
>>> a += 1
>>> a.flush()

More generally, there are 2 things I am not 100% sure of yet which we need to be more sure of:

  1. Are we sure that there are no ufunc uses which should return a 0-d array instead of a scalar?
  2. Are we sure that there are no cases where np.may_share_memory(self, arr) == False but we should actually be returning a memmap? That is, where arr's memory is actually a memmapped buffer?

@ahaldane
Copy link
Member

Oh, I just read through @seberg's PR, which looks like it addresses more of these concerns. @lesteve @seberg maybe we should revive that one? Even though it does seem a little dangerous, I think it might be worth doing since we have a couple people now who would like the change.

@seberg
Copy link
Member
seberg commented Mar 11, 2016

Dunno. Don't even know what my code was compared to this (why did I
have array_prepare also?). I am relatively sure you don't really have
to worry much about the mmapped buffers or so. If arr does not share
memory with self, then the array should be freshly created with its own
data. If an "out" array exists, the "out" array gets to do the
array_wrap (maybe that is why array_prepare, hmmm).

Only thing I would be worried about that memmaps have some attributes
that still somewhat work when we now would return arrays, and thus
might break code, that just use them for no good reason. Should check,
but I could imagine it is fine, since a memmap object that (I guess?)
errors on flush is pretty useless.

About 1. Yes, we are sure ufuncs never return 0-d arrays (except for
subclasses to give subclasses a chance and even that changed probably
in 1.7 or so. That does not mean I agree with it, but Numpy doesn't
consider 0-d arrays first class citizens.

Anyway, if we are not sure about all of these things, possibly we could
just change the 0-d case, and hope that since the behaviour was changed
(maybe not so long ago?) it is fine.

@njsmith
Copy link
Member
njsmith commented Mar 12, 2016

Don't have time to read through all of this in detail right now, but as a general note I'm +10 on making it so that memmaps don't propagate through ufuncs. I'd also vote against requiring a FutureWarning, because (a) this will be really noisy and annoying, because it would trigger every time you do anything with memmaps, even totally safe things, and there's no reasonable way to change existing code to avoid the warning, (b) the only behavioural difference between an np.memmap object and an np.ndarray is that the former has a ._mmap attribute that can be flushed, but even this is not true for the np.memmap objects that get allocated and returned from ufuncs -- they are literally just ndarray objects with a confusing subclass attached.

@charris charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Mar 12, 2016
@charris
Copy link
Member
charris commented Mar 12, 2016

Should get a mention in the 1.12.0-notes.

@charris
Copy link
Member
charris commented Mar 15, 2016

@lesteve Ping.

@lesteve lesteve force-pushed the memmap-ufunc-return-ndarray branch from 137afc3 to 6d7f1aa Compare March 15, 2016 17:22
@lesteve
Copy link
Contributor Author
lesteve commented Mar 15, 2016

I added something in doc/release/1.12.0-notes.rst.

I haven't had time to work on this in the past few days. I will try to understand better what was done in #2932.

@ahaldane
Copy link
Member

@lesteve My impression of the difference between this PR and #2932 is that #2932 tries to treat subclasses of memmap specially and handles __getitem__ (but it missing the 0d array case in this PR), while this PR treats 0d arrays properly but doesn't worry about inheritance or __getitem__.

I will admit I don't really understand a lot of #2932. It appears to be trying to take care of multiple inheritance, involving inheritance from memmap. But I'm not sure it's worth the effort to do that. Do we know of any projects that use multiple inheritance involving memmap? Note also that this PR, as is, properly takes care of single inheritance of memmap (since subclasses will inherit array_wrap).

However, I do think #2932 has the right idea in __getitem__, that we want fancy indexing like

>>> a = memmap(np.memmap('file', dtype='i4', shape=(4,4))
>>> a[[0,3]]

to return an ndarray, since the result is a new array rather than a view. However a[1:3] should give back a memmap . I am not sure how to implement that yet.

In summary, my impression is that we can forget about trying to fix multiple inheritance like #2932, however we should somehow take care of fancy indexing, and I also suggest adding the if np.may_share_memory(self, arr): clause I pasted above.

@lesteve lesteve force-pushed the memmap-ufunc-return-ndarray branch from 6d7f1aa to 609d19a Compare March 15, 2016 18:53
@lesteve
Copy link
8000
Contributor Author
lesteve commented Mar 15, 2016

I have added the __getitem__ behaviour + tests from #2932 and also the may_share_memory in __array_wrap__. I think it does what @ahaldane wanted for view vs new array for __getitem__.

I haven't tried to understand further the subtleties involved with subclassing that #2932 was trying to tackle.

@lesteve lesteve force-pushed the memmap-ufunc-return-ndarray branch from 609d19a to 838dab9 Compare March 15, 2016 23:30
@charris charris removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Mar 18, 2016
@ahaldane
Copy link
Member

I think this PR is ready to be merged as is, if there are no objections.

Although I have a little bit of a feeling there may be some "unknown unknowns" here, after sitting on it I haven't though up any more cases to take care of, and I'm not too worried about the multiple subclass case. Plus, I think this is easy to undo later if there are problems, and it can get some testing if we merge.

Maybe we can wait a little for anyone else to speak up, and then merge.

def __array_wrap__(self, arr, context=None):
# Return scalar instead of 0d memmap, e.g. for np.sum with
# axis=None
if arr.shape == ():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be on the safe side, it might be good to move it after the may_share_memory call? I don't think we should ever return 0-d arrays from functions calling array_wrap, but I am not perfectly sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again, maybe it doesn't matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be on the safe side, it might be good to move it after the may_share_memory call?

Done.

@seberg
Copy link
Member
seberg commented Mar 19, 2016

If there may be problems with subclasses, I would feel better if there is a note to the mailing list.

@seberg
Copy link
Member
seberg commented Mar 19, 2016

Otherwise, I trust your assessment :).

@homu
Copy link
Contributor
homu commented Mar 22, 2016

☔ The latest upstream changes (presumably #7325) made this pull request unmergeable. Please resolve the merge conflicts.

@lesteve
Copy link
Contributor Author
lesteve commented Mar 22, 2016 via email

@mhvk
Copy link
Contributor
mhvk commented Mar 31, 2016

@lesteve - the problem I mentioned is that in your example, np.add returns an ndarray. Normally, when out is given, it is ensured that what is returned is out itself, i.e., if one does mm2 = np.add(mm0, 1, out=mm1), it should be the case that mm2 is mm1.

@lesteve
Copy link
Contributor Author
lesteve commented Mar 31, 2016

mm2 = np.add(mm0, 1, out=mm1), it should be the case that mm2 is mm1

@mhvk this PR does have the behaviour you want. Thanks for the explanations, I naively expected np.add to return None when the out parameter was specified.

@mhvk
Copy link
Contributor
mhvk commented Mar 31, 2016

OK, I now see why this does work (should have seen it before): when the output is an memmap, it is out.__array_wrap__ that is called, so inside __array_wrap__, the condition may_share_memory is in fact true.


def __array_wrap__(self, arr, context=None):
# Return memmap if self and arr are likely to share memory
if np.may_share_memory(self, arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, a remaining suggestion is to replace this statement with the simpler (and faster)

# Return the object it is was given as an output and should thus preserve type
if arr is self:
    return arr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing a bit around, I wasn't able to find an edge case in __array_wrap__ where np.may_share_memory(self, arr) and arr is self disagreed, so I made the advised change.

If someone else can see an edge case that I missed, please complain !

@lesteve lesteve force-pushed the memmap-ufunc-return-ndarray branch from 0d0c8e6 to cc8c8b6 Compare April 1, 2016 06:44
@ahaldane
Copy link
Member
ahaldane commented Apr 1, 2016

Thanks for the sugestions @mhvk, it looks better now.

@lesteve I want to consider @mhvk's other suggestion too, to use super, but I am not totally happy with that solution because it breaks the single inheritance case (eg class B(np.memmap)). I want to think about it a little more.

@lesteve
Copy link
Contributor Author
lesteve commented Apr 1, 2016

Let me know, I won't be able to work on it until Monday though.

@ahaldane
Copy link
Member
ahaldane commented Apr 1, 2016

All right, from a brief look, astropy code doesn't seem to subclass memmap anywhere so @mhvk's example of class MyClass(Quantity, memmap) is only hypothetical.

In that case I vote to keep things simple and ignore the subclass issue, and merge this as is. Is that all right @mhvk?

Anyone in the future who wants to subclass memmap will be able to override our behavior if they want.

@mhvk
Copy link
Contributor
mhvk commented Apr 1, 2016

@ahaldane - I'm half curious about what goes wrong with the inheritance class B(np.memmap), but as my example was indeed hypothetical, I'm quite happy to keep it simple.

@ahaldane
Copy link
Member
ahaldane commented Apr 1, 2016

@mhvk, maybe I am mistaken. I was bothered by the fact that ufuncs and fancy slices of instances of B wouldn't return ndarrays, eg. np.mean(B(file, dtype='i4', shape=4)) returns a B scalar. But actually that is probably what should happen for subclasses of ndarray.

@lesteve, sorry for going back and forth on this, but could you make the super change?

    def __array_wrap__(self, arr, context=None):
        arr = super(memmap, self).__array_wrap__(arr, context)
        # Return arr if it not a memmap or if it was given as output.
        if arr is self or type(arr) is not memmap:
            return arr
        ...

and a unit test could be

class B(np.memmap):
    pass

a = B('file', dtype='i4', shape=4)
assert_equal(type(np.mean(a)), B)
assert_equal(type(a[[0,1]]), B)

Special case for reduction functions (e.g. np.sum with axis=None) that
return a numpy scalar.

Keep original memmap subclasses behavior to be on the safe side.
@lesteve
Copy link
Contributor Author
lesteve commented Apr 4, 2016

I was bothered by the fact that ufuncs and fancy slices of instances of B wouldn't return ndarrays, eg. np.mean(B(file, dtype='i4', shape=4)) returns a B scalar. But actually that is probably what should happen for subclasses of ndarray.

I find that slightly bothering too. In my mind, it is very unlikely that one would want a B 0d array for the result of np.mean and for fancy slices I am a bit more split but not really convinced either. I'd be interested if someone has some kind of use case where this behaviour makes sense.

I made the requested change anyway.

@lesteve lesteve force-pushed the memmap-ufunc-return-ndarray branch from cc8c8b6 to 8f27d12 Compare April 4, 2016 08:01
@ogrisel
Copy link
Contributor
ogrisel commented Apr 4, 2016

I haven't reviewed the change closely but it's a much welcome change in principle so 👍 (in the long term I think the memmap class should go away entirely--some of its code is worth saving, but I don't think it makes much sense as a separate ndarray subclass).

@embray I agree. To fix the memmap class consistency issue I had prototyped in the past an alternative that would not subclass ndarray but instead call np.frombuffer on a custom mmap buffer that would store the required metadata (filename, offset, mode...):

joblib/joblib@b68a868#diff-19004d02cec64e3d448b33565de338ebR94

I never finished that work but if people are interested I could contribute it to numpy as a saner alternative to np.memmap. Do you want me to extract that old code and make it a PR for numpy (with tests)?

@embray
Copy link
Contributor
embray commented Apr 4, 2016

@ogrisel Yes, that looks useful. For myself, I'm using np.memmap to create the underlying mmap in a convenient way, but then immediately passing it to np.frombuffer and throwing away the np.memmap` object.

@mhvk
Copy link
Contributor
mhvk commented Apr 4, 2016

@lesteve - for the subclass case, I do think it is best to default to returning the subclass. It is quite possible that the subclass stores information that is useful even for a 0-d array (e.g., a Quantity stores a unit, which should not get lost).

Overall, it seems to me this is a good addition as is, and with all the tests, etc., it seems ready to go in.

@lesteve
Copy link
Contributor Author
lesteve commented Apr 4, 2016

@lesteve - for the subclass case, I do think it is best to default to returning the subclass. It is quite possible that the subclass stores information that is useful even for a 0-d array (e.g., a Quantity stores a unit, which should not get lost).

Fair enough, thanks for the explanation!

@ahaldane
Copy link
Member
ahaldane commented Apr 4, 2016

Looks good, I'll merge in an hour or two.

@ahaldane ahaldane merged commit 2fb9599 into numpy:master Apr 4, 2016
@ahaldane
Copy link
Member
ahaldane commented Apr 4, 2016

Merged, thanks @lesteve and the many reviewers.

@lesteve
Copy link
Contributor Author
lesteve commented Apr 4, 2016

Great, thanks a lot !

@lesteve lesteve deleted the memmap-ufunc-return-ndarray branch April 4, 2016 17:40
matthew-brett added a commit to matthew-brett/nibabel that referenced this pull request Apr 7, 2016
memmap behavior is about to change with merge of
numpy/numpy#7406.

Test for behavior change so we can adapt expected mmap return values on
an image.
matthew-brett added a commit to matthew-brett/nibabel that referenced this pull request Apr 7, 2016
memmap behavior is about to change with merge of
numpy/numpy#7406.

Test for behavior change so we can adapt expected mmap return values on
an image.
matthew-brett added a commit to matthew-brett/nibabel that referenced this pull request Apr 7, 2016
memmap behavior is about to change with merge of
numpy/numpy#7406.

Test for behavior change so we can adapt expected mmap return values on
an image.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
0