8000 Regression: Change in behaviour for `np.array()` on class that defines __array__ · Issue #13958 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Regression: Change in behaviour for np.array() on class that defines __array__ #13958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mhvk opened this issue Jul 10, 2019 · 31 comments · Fixed by #14995
Closed

Regression: Change in behaviour for np.array() on class that defines __array__ #13958

mhvk opened this issue Jul 10, 2019 · 31 comments · Fixed by #14995
Labels
06 - Regression 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: numpy._core

Comments

@mhvk
Copy link
Contributor
mhvk commented Jul 10, 2019

An issue has been raised for numpy 1.17rc in astropy (astropy/astropy#8976), which after a bit of digging boils down to the change below in bahaviour for classes that define __array__ as well as a __len__ and __getitem__. I must admit I think the behaviour in both cases is strange, as it doesn't simply use __array__ but rather seems to iterate, but the behaviour in 1.17 definitely makes (even) less sense than that in 1.16.

import numpy as np
class A:
    """A recarray mimic in which one indexes items by order not name"""
    def __init__(self, item):
        self.item = item
    def __array__(self, dtype=None):
        return self.item
    def __len__(self):
        return len(self.item.dtype.names)
    def __getitem__(self, item):
        return self.item[self.item.dtype.names[item]]

dtype = np.dtype([('a', 'f'), ('b', 'i')])
rec1 = A(np.array((1., 1), dtype=dtype))
rec2 = A(np.array((2., 2), dtype=dtype))
np.array([rec1, rec2], dtype=object)
# numpy 1.16
# array([[array(1., dtype=float32), array(1, dtype=int32)],
#        [array(2., dtype=float32), array(2, dtype=int32)]], dtype=object)
# numpy 1.17rc
# array([[(1.0, 1), (1.0, 1)],
#        [(2.0, 2), (2.0, 2)]], dtype=object)

Note that the following do not depend on version:

len(rec1)
# 2
rec1[0]
# array(1., dtype=float32)
np.array(rec1)
# array((1., 1), dtype=[('a', '<f4'), ('b', '<i4')])
@seberg
Copy link
Member

Hmm, we new switch between using the array interface (for filling the array) and using the iteration only for discovery of dimension, that must be the issue, we will have to see how much we have to revert for compatibility in this regard.

@mhvk
Copy link
Contributor Author
mhvk commented Jul 10, 2019

My expectation would have been that __array__() is called before anything else (if it is present), and that any checks on length are done on whatever it returns. But that expectation is broken both in the old and the new versions...

But perhaps if we break things, we might as well try to make it more logical?

@mattip
Copy link
Member
mattip commented Jul 10, 2019

The documentation for __array__ is a bit silent on this, it only says

If a class (ndarray subclass or not) having the __array__ method is used as the output object of an ufunc, results will be written to the object returned by __array__. Similar conversion is done on input arrays.

So the interaction with non-ufunc things like concatenate might need to be addressed in documentation

@mhvk
Copy link
Contributor Author
mhvk commented Jul 10, 2019

The documentation for np.array itself does mention __array__ as well:

array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)

Create an array.

Parameters
----------
object : array_like
    An array, any object exposing the array interface, an object whose
    __array__ method returns an array, or any (nested) sequence.

This suggests __array__ would be tried first, before looking at whether the class is iterable. And that is also what happens for a straight np.array(rec1), just not if the classes are inside a list - which admittedly is a trickier situation.

@mhvk
8000
Copy link
Contributor Author
mhvk commented Jul 10, 2019

Just for reference, just passing in arrays directly is still consistent between versions:

np.array([rec1.__array__(), rec2.__array__()])
# array([(1., 1), (2., 2)], dtype=[('a', '<f4'), ('b', '<i4')])
np.array([rec1.__array__(), rec2.__array__()], dtype=object)
# array([array((1., 1), dtype=[('a', '<f4'), ('b', '<i4')]),
#        array((2., 2), dtype=[('a', '<f4'), ('b', '<i4')])], dtype=object)

But neither 1.16 nor 1.17 reproduces the second example.

@charris
Copy link
Member
charris commented Jul 21, 2019

@mhvk @seberg getting close to 1.17.0 final. What is the best way forward here?

@seberg
Copy link
Member
seberg commented Jul 22, 2019

I suppose I will look at this tomorrow (unless someone beats me to it). I am not quite sure what the solution is, but I would like to go over to use array information always, which however means replacing/caching the array-like, so it might not be quite trivial.

Worst case, we might have to revert that speed up change for 1.17.

@seberg seberg self-assigned this Jul 22, 2019
@seberg
Copy link
Member
seberg commented Jul 22, 2019

OK, so basically, I think I can add correct code into discover_dimensions at which point, this code probably tries __array__ three times on the same object. Could try to add a cache for those things, but that seems like it might be a larger thing to do. Will dig a bit more.

@seberg
Copy link
Member
seberg commented Jul 22, 2019

Just to list/summarize the issue:

I took a bit of a stab at prefering __array__, we will end up calling it a bit often (and something is still messy with respect to dtype discovery).

The other options is to just revert the order in gh-13399 in setArrayFromSequence (gh-13663 is not needed then). Tensorflow might actually still be faster, because I am not sure it defines the sequence interface at all, but it likely means undoing the speed advantage.

@superbobry do you happen to have an insight into this?

@seberg
Copy link
Member
seberg commented Jul 22, 2019

So even after fixing the dimension discovery to honor __array__. The result is a bit funny. For the case above with object dtype, it is just the same funny as for arrays (because an array is also not unpacked for object dtype). For no given dtypes it then breaks down because VOID_setitem cannot handle array-likes.

@seberg
Copy link
Member
seberg commented Jul 22, 2019

@mhvk would a result of:

array([<__main__.A object at 0x7f9e5c2b5a18>,
       <__main__.A object at 0x7f9e5c2b59b0>], dtype=object)

in your example (with object dtype; a failure without dtype) be better? Maybe coercing __array__ yet another time is OK, at least for now.

@mhvk
Copy link
Contributor Author
mhvk commented Jul 23, 2019

@seberg - for astropy, it doesn't really matter, we can fix it on our end, so I think it is really more a question of what the behaviour should be. To make an array of objects from a list of instances of A if the user passes in dtype=object certainly is reasonable!

The alternative would be to always immediately use __array__. I thought this was more consistent with how arrays themselves are treated, but then I noticed:

np.array([np.ones(3), np.ones(3)], dtype=object)
# Looks like array gets expanded
# array([[1.0, 1.0, 1.0],
#        [1.0, 1.0, 1.0]], dtype=object)
np.array([np.ones(4), np.ones(3)], dtype=object)
# Oh, but not if the shapes do not match...
# array([array([1., 1., 1., 1.]), array([1., 1., 1.])], dtype=object)

I guess the problem indeed boils down to not knowing whether to treat the list entries as the objects themselves, as duck arrays, or as further iterables.

Overall, your suggestion does seem the most logical: after all, without it, it would be impossible ever to make an array of A object instances, which seems strange.

One question is whether it would not be better to put this off to 1.18 (i.e., revert for now), so that one can also address the whole dtype=object question more generally (e.g., I think my example above should logically give an object array with 2 entries in both cases).

@charris
Copy link
Member
charris commented Jul 23, 2019

Reversion until 1.18 seems reasonable given how tricky this seems to be.

@mhvk What do you mean by "address the whole dtype=object question?"

@mhvk
Copy link
Contributor Author
mhvk commented Jul 23, 2019

@charris - good old #5353, the idea that one should never coerce to object unless explicitly asked to (though here of course the idea is to try to define what dtype=object implies)

@charris
Copy link
Member
charris commented Jul 23, 2019

@mhvk Just wanted to make sure :) I don't think we have that scheduled, @seberg might be a good topic for discussion tomorrow.

@superbobry
Copy link
Contributor

Tensorflow might actually still be faster, because I am not sure it defines the sequence interface at all, but it likely means undoing the speed advantage.

tf.Tensor does define a sequence interface and that was the where most of the time was spent when converting a sequence of tensors to a NumPy array (prior to gh-13399).

@seberg
Copy link
Member
seberg commented Jul 24, 2019

Yeah, but I am sorry. If this is the last thing that is needed for the release, and we want to do that soon. I feel like it might be best to revert it at least in the 1.17 branch. Unless someone steps up to advocate that such an object is so inherently broken and rare that we can get away with pretty much anything...

@superbobry
Copy link
Contributor

I'm fine with reverting. Sadly, don't have time to investigate this further at the moment.

@charris charris modified the milestones: 1.17.0 release, 1.18.0 release Jul 25, 2019
@charris
Copy link
Member
charris commented Jul 25, 2019

Reverted in 1.17, still in 1.18 while we search for a better solution.

@seberg seberg added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jul 25, 2019
@mhvk
Copy link
Contributor Author
mhvk commented Jul 26, 2019

Thanks all for looking into this!

@seberg
Copy link
Member
seberg commented Oct 15, 2019

PR #14109 still need to be backported (or a better fix found). Note that my new code that hopefull lands (or a different version of it of course). Which does not matter much here, but right now I am personally not very enthusiastic about looking into trying to find a solution because of that.

EDIT: I have to double check whether it is true that it needs backporting...

@mattip
Copy link
Member
mattip commented Nov 12, 2019

In the alternatives to NEP 34 it was suggested to add a new array creation function that would add a depth keyword to determine how deeply to recurse into objects when creating an array from a sequence of sequences. What do we want to do for 1.18?

@seberg seberg modified the milestones: 1.19.0 release, 1.18.0 release Nov 25, 2019
@charris
Copy link
Member
charris commented Nov 26, 2019

Now that this is pushed off, It sounds like we need to revert the change for 1.18.

@charris
Copy link
Member
charris commented Nov 26, 2019

@seberg is it your suggestion that the reversion be in master?

@seberg
Copy link
Member
seberg commented Nov 26, 2019

I guess we should put it into master, unless anyone argues that the regression does not matter much at all. There should be ways to make the speed up happen without the regression, or at least while doing it on purpose (and consistently). It is easy enough to revert to revert as a basis for new work...

@charris
Copy link
Member
charris commented Nov 26, 2019

@seberg Reverted in #14983, please check that.

@mattip
Copy link
Member
mattip commented Nov 26, 2019

The original PR showed a 30x speedup. Perhaps we should reconsider reverting that?

@rgommers
Copy link
Member

Reverting in master seems right, looks like this needs someone to sit down and work out the desired behavior in detail for all reported cases, and some tests to enforce that behavior, and then look again at the original speedup to see if it can be adapted.

@seberg seberg removed their assignment Nov 27, 2019
@seberg seberg modified the milestones: 1.18.0 release, 1.19.0 release Dec 2, 2019
seberg added a commit to seberg/numpy that referenced this issue Feb 6, 2020
``__array__`` was previously not used during dimension discovery,
while bein gused during dtype discovery (if dtype is not given),
as well as during filling of the resulting array.

This would lead to inconsistencies with respect to array likes
that have a shape including a 0 (typically as first dimension).
Thus a shape of ``(0, 1, 1)`` would be found as ``(0,)`` because
a nested list/sequence cannot represent empty shapes, except 1-D.

This uses the `_array_from_array_like` function, which means that
some coercions may be tiny bit slower, at the gain of removing
a lot of complex code.

(this also reverts commit d0d250a
or the related change).

This is a continuation of work by Sergei Lebedev in numpygh-13663
which had to be reverted due to problems with Pandas, and the
general inconsistency. This version may not resolve all issues
with pandas, but does resolve the inconsistency.

Closes numpygh-13958
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
06 - Regression 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: numpy._core
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
2AD2
0