Regression: Change in behaviour for `np.array()` on class that defines array #13958

mhvk · 2019-07-10T15:26:57Z

An issue has been raised for numpy 1.17rc in astropy (astropy/astropy#8976), which after a bit of digging boils down to the change below in bahaviour for classes that define __array__ as well as a __len__ and __getitem__. I must admit I think the behaviour in both cases is strange, as it doesn't simply use __array__ but rather seems to iterate, but the behaviour in 1.17 definitely makes (even) less sense than that in 1.16.

import numpy as np
class A:
    """A recarray mimic in which one indexes items by order not name"""
    def __init__(self, item):
        self.item = item
    def __array__(self, dtype=None):
        return self.item
    def __len__(self):
        return len(self.item.dtype.names)
    def __getitem__(self, item):
        return self.item[self.item.dtype.names[item]]

dtype = np.dtype([('a', 'f'), ('b', 'i')])
rec1 = A(np.array((1., 1), dtype=dtype))
rec2 = A(np.array((2., 2), dtype=dtype))
np.array([rec1, rec2], dtype=object)
# numpy 1.16
# array([[array(1., dtype=float32), array(1, dtype=int32)],
#        [array(2., dtype=float32), array(2, dtype=int32)]], dtype=object)
# numpy 1.17rc
# array([[(1.0, 1), (1.0, 1)],
#        [(2.0, 2), (2.0, 2)]], dtype=object)

Note that the following do not depend on version:

len(rec1)
# 2
rec1[0]
# array(1., dtype=float32)
np.array(rec1)
# array((1., 1), dtype=[('a', '<f4'), ('b', '<i4')])

The text was updated successfully, but these errors were encountered:

seberg · 2019-07-10T16:40:49Z

Hmm, we new switch between using the array interface (for filling the array) and using the iteration only for discovery of dimension, that must be the issue, we will have to see how much we have to revert for compatibility in this regard.

mhvk · 2019-07-10T17:27:22Z

My expectation would have been that __array__() is called before anything else (if it is present), and that any checks on length are done on whatever it returns. But that expectation is broken both in the old and the new versions...

But perhaps if we break things, we might as well try to make it more logical?

mattip · 2019-07-10T18:46:17Z

The documentation for __array__ is a bit silent on this, it only says

If a class (ndarray subclass or not) having the __array__ method is used as the output object of an ufunc, results will be written to the object returned by __array__. Similar conversion is done on input arrays.

So the interaction with non-ufunc things like concatenate might need to be addressed in documentation

mhvk · 2019-07-10T20:19:14Z

The documentation for np.array itself does mention __array__ as well:

array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)

Create an array.

Parameters
----------
object : array_like
    An array, any object exposing the array interface, an object whose
    __array__ method returns an array, or any (nested) sequence.

This suggests __array__ would be tried first, before looking at whether the class is iterable. And that is also what happens for a straight np.array(rec1), just not if the classes are inside a list - which admittedly is a trickier situation.

mhvk · 2019-07-10T20:22:28Z

Just for reference, just passing in arrays directly is still consistent between versions:

np.array([rec1.__array__(), rec2.__array__()])
# array([(1., 1), (2., 2)], dtype=[('a', '<f4'), ('b', '<i4')])
np.array([rec1.__array__(), rec2.__array__()], dtype=object)
# array([array((1., 1), dtype=[('a', '<f4'), ('b', '<i4')]),
#        array((2., 2), dtype=[('a', '<f4'), ('b', '<i4')])], dtype=object)

But neither 1.16 nor 1.17 reproduces the second example.

charris · 2019-07-21T18:50:44Z

@mhvk @seberg getting close to 1.17.0 final. What is the best way forward here?

seberg · 2019-07-22T04:03:28Z

I suppose I will look at this tomorrow (unless someone beats me to it). I am not quite sure what the solution is, but I would like to go over to use array information always, which however means replacing/caching the array-like, so it might not be quite trivial.

Worst case, we might have to revert that speed up change for 1.17.

seberg

OK, so basically, I think I can add correct code into discover_dimensions at which point, this code probably tries __array__ three times on the same object. Could try to add a cache for those things, but that seems like it might be a larger thing to do. Will dig a bit more.

seberg · 2019-07-22T23:08:45Z

Just to list/summarize the issue:

ENH: Improved performance of PyArray_FromAny for sequences of array-like #13399 tried to improve performance for converting sequences of array likes.
BUG: regression for array([pandas.DataFrame()]) #13663 tried to fix-up a corner case, just for compatibility reasons.
- ENH: check object shape when using the __array__ protocol #13671 notes that this fixup is necessary to not break pandas, but somewhat incorrect
This issue points out that the whole logic is utterly broken if the exported Sequence-like interface has a mismatch with the array-like interface. And that it is now in a sense even more broken because we use both the array-like or sequence-like interface side by side. (This is similar to the previous note, but goes one step further).

I took a bit of a stab at prefering __array__, we will end up calling it a bit often (and something is still messy with respect to dtype discovery).

The other options is to just revert the order in gh-13399 in setArrayFromSequence (gh-13663 is not needed then). Tensorflow might actually still be faster, because I am not sure it defines the sequence interface at all, but it likely means undoing the speed advantage.

@superbobry do you happen to have an insight into this?

seberg · 2019-07-22T23:27:50Z

So even after fixing the dimension discovery to honor __array__. The result is a bit funny. For the case above with object dtype, it is just the same funny as for arrays (because an array is also not unpacked for object dtype). For no given dtypes it then breaks down because VOID_setitem cannot handle array-likes.

seberg · 2019-07-22T23:36:37Z

@mhvk would a result of:

array([<__main__.A object at 0x7f9e5c2b5a18>,
       <__main__.A object at 0x7f9e5c2b59b0>], dtype=object)

in your example (with object dtype; a failure without dtype) be better? Maybe coercing __array__ yet another time is OK, at least for now.

mhvk · 2019-07-23T01:50:32Z

@seberg - for astropy, it doesn't really matter, we can fix it on our end, so I think it is really more a question of what the behaviour should be. To make an array of objects from a list of instances of A if the user passes in dtype=object certainly is reasonable!

The alternative would be to always immediately use __array__. I thought this was more consistent with how arrays themselves are treated, but then I noticed:

np.array([np.ones(3), np.ones(3)], dtype=object)
# Looks like array gets expanded
# array([[1.0, 1.0, 1.0],
#        [1.0, 1.0, 1.0]], dtype=object)
np.array([np.ones(4), np.ones(3)], dtype=object)
# Oh, but not if the shapes do not match...
# array([array([1., 1., 1., 1.]), array([1., 1., 1.])], dtype=object)

I guess the problem indeed boils down to not knowing whether to treat the list entries as the objects themselves, as duck arrays, or as further iterables.

Overall, your suggestion does seem the most logical: after all, without it, it would be impossible ever to make an array of A object instances, which seems strange.

One question is whether it would not be better to put this off to 1.18 (i.e., revert for now), so that one can also address the whole dtype=object question more generally (e.g., I think my example above should logically give an object array with 2 entries in both cases).

charris · 2019-07-23T02:49:20Z

Reversion until 1.18 seems reasonable given how tricky this seems to be.

@mhvk What do you mean by "address the whole dtype=object question?"

mhvk · 2019-07-23T14:45:30Z

@charris - good old #5353, the idea that one should never coerce to object unless explicitly asked to (though here of course the idea is to try to define what dtype=object implies)

charris · 2019-07-23T15:29:28Z

@mhvk Just wanted to make sure :) I don't think we have that scheduled, @seberg might be a good topic for discussion tomorrow.

superbobry · 2019-07-24T09:33:46Z

Tensorflow might actually still be faster, because I am not sure it defines the sequence interface at all, but it likely means undoing the speed advantage.

tf.Tensor does define a sequence interface and that was the where most of the time was spent when converting a sequence of tensors to a NumPy array (prior to gh-13399).

seberg · 2019-07-24T21:16:34Z

Yeah, but I am sorry. If this is the last thing that is needed for the release, and we want to do that soon. I feel like it might be best to revert it at least in the 1.17 branch. Unless someone steps up to advocate that such an object is so inherently broken and rare that we can get away with pretty much anything...

superbobry · 2019-07-25T10:07:10Z

I'm fine with reverting. Sadly, don't have time to investigate this further at the moment.

charris · 2019-07-25T23:33:57Z

Reverted in 1.17, still in 1.18 while we search for a better solution.

mhvk · 2019-07-26T00:09:51Z

Thanks all for looking into this!

seberg · 2019-10-15T23:52:25Z

PR #14109 still need to be backported (or a better fix found). Note that my new code that hopefull lands (or a different version of it of course). Which does not matter much here, but right now I am personally not very enthusiastic about looking into trying to find a solution because of that.

EDIT: I have to double check whether it is true that it needs backporting...

mattip · 2019-11-12T14:36:17Z

In the alternatives to NEP 34 it was suggested to add a new array creation function that would add a depth keyword to determine how deeply to recurse into objects when creating an array from a sequence of sequences. What do we want to do for 1.18?

charris · 2019-11-26T00:06:45Z

Now that this is pushed off, It sounds like we need to revert the change for 1.18.

charris · 2019-11-26T00:09:00Z

@seberg is it your suggestion that the reversion be in master?

seberg · 2019-11-26T00:45:16Z

I guess we should put it into master, unless anyone argues that the regression does not matter much at all. There should be ways to make the speed up happen without the regression, or at least while doing it on purpose (and consistently). It is easy enough to revert to revert as a basis for new work...

charris · 2019-11-26T01:58:08Z

@seberg Reverted in #14983, please check that.

mattip · 2019-11-26T04:02:17Z

The original PR showed a 30x speedup. Perhaps we should reconsider reverting that?

rgommers · 2019-11-27T18:46:58Z

Reverting in master seems right, looks like this needs someone to sit down and work out the desired behavior in detail for all reported cases, and some tests to enforce that behavior, and then look again at the original speedup to see if it can be adapted.

``__array__`` was previously not used during dimension discovery, while bein gused during dtype discovery (if dtype is not given), as well as during filling of the resulting array. This would lead to inconsistencies with respect to array likes that have a shape including a 0 (typically as first dimension). Thus a shape of ``(0, 1, 1)`` would be found as ``(0,)`` because a nested list/sequence cannot represent empty shapes, except 1-D. This uses the `_array_from_array_like` function, which means that some coercions may be tiny bit slower, at the gain of removing a lot of complex code. (this also reverts commit d0d250a or the related change). This is a continuation of work by Sergei Lebedev in numpygh-13663 which had to be reverted due to problems with Pandas, and the general inconsistency. This version may not resolve all issues with pandas, but does resolve the inconsistency. Closes numpygh-13958

mhvk added the component: numpy._core label Jul 10, 2019

mhvk mentioned this issue Jul 10, 2019

Creating table from list of rows fails with numpy-dev (OK for 1.17) astropy/astropy#8976

Closed

mhvk added this to the 1.17.0 release milestone Jul 10, 2019

seberg self-assigned this Jul 22, 2019

charris added the 06 - Regression label Jul 22, 2019

charris modified the milestones: 1.17.0 release, 1.18.0 release Jul 25, 2019

seberg added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jul 25, 2019

seberg modified the milestones: 1.19.0 release, 1.18.0 release Nov 25, 2019

charris modified the milestones: 1.18.0 release, 1.19.0 release Nov 26, 2019

charris modified the milestones: 1.19.0 release, 1.18.0 release Nov 26, 2019

charris mentioned this issue Nov 26, 2019

REV: "ENH: Improved performance of PyArray_FromAny for sequences of array-like #14983

Merged

seberg removed their assignment Nov 27, 2019

seberg mentioned this issue Nov 27, 2019

BUG: Use __array__ during dimension discovery #14995

Merged

seberg modified the milestones: 1.18.0 release, 1.19.0 release Dec 2, 2019

seberg mentioned this issue Jan 14, 2020

Strange KeyError 0 in asarray() when data only contains 1-length pandas.Series #15326

Closed

mhvk mentioned this issue Feb 20, 2020

Instantiating a Time from a Time looses the location astropy/astropy#9962

Closed

mattip closed this as completed in #14995 Feb 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Regression: Change in behaviour for `np.array()` on class that defines array #13958

Regression: Change in behaviour for `np.array()` on class that defines array #13958

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Regression: Change in behaviour for np.array() on class that defines __array__ #13958

Regression: Change in behaviour for np.array() on class that defines __array__ #13958

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Regression: Change in behaviour for `np.array()` on class that defines array #13958

Regression: Change in behaviour for `np.array()` on class that defines array #13958