BUG: Use `array` during dimension discovery #14995

seberg · 2019-11-27T22:09:04Z

__array__ was previously not used during dimension discovery,
while bein gused during dtype discovery (if dtype is not given),
as well as during filling of the resulting array.

This would lead to inconsistencies with respect to array likes
that have a shape including a 0 (typically as first dimension).
Thus a shape of (0, 1, 1) would be found as (0,) because
a nested list/sequence cannot represent empty shapes, except 1-D.

This uses the _array_from_array_like function, which means that
some coercions may be tiny bit slower, at the gain of removing
a lot of complex code.

(this also reverts commit d0d250a)

Closes gh-13958

I am a bit confused as to why I thought before that just adding __array__ would not simply solve this. I think it does. I could add the use of _array_from_array_like in common.h during dtype discovery.

Note that this changes behaviour, so testing against pandas may be in order. And I suppose it needs a release note, I will probably add those tonight.

eric-wieser · 2019-11-27T22:11:45Z

numpy/core/src/multiarray/ctors.c

-    if (slen < 0) {
+    /* Try __array__ before using s as a sequence */
+    PyObject *tmp = _array_from_array_like(
+            s, /*dtype*/ NULL, /*writeable*/ 0, /*context*/ NULL);


nit: Tempting to change this to match the other call exactly, either as _array_from_array_like(obj, NULL, NPY_FALSE, NULL); here or this way there.

Sure, presonally I now use an IDE which actually adds hints for the names here, giving it double...

I've seen a former coworker use that before, it looked pretty cool

numpy/core/src/multiarray/ctors.c

eric-wieser · 2019-11-27T22:23:18Z

Is a backport intended here?

charris · 2019-11-27T22:33:21Z

Is a backport intended here?

There was already a reversion in 1.17, but not in master. I had a reversion for the 1.18 release, but it was defective. This fixes the failed reversion with a new approach for 1.18.

numpy/core/src/multiarray/ctors.c

numpy/core/tests/test_regression.py

eric-wieser · 2019-11-28T01:12:56Z

Overall looks like some nice cleanup, especially if you can get it to pass CI

seberg · 2019-11-28T01:55:09Z

Passing our CI should not be a problem. Pandas seems to be acting up a bit though, I have to investigate if its us or them (or just bad tests).

seberg · 2019-11-28T17:39:22Z

@jbrockmendel, @superbobry maybe you can help me out understand pandas a bit better here? On the new branch pandas tests fail which do:

df1 = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
df2 = pd.DataFrame([df1, df1])

The code seems to call np.asarray([df1, df1]) (so far so good). Which now would return a 3-D array, and that trips pandas (I have difficulty saying how bad that actually is, a DeprecationWarning would be nice, but tricky with the project interplay maybe).

The part that I would like to understand is why it currently works (in some weird twisted way), when np.asarray([df1, df1]) actually raises an error. I cannot see where that error is caught and replaced by an dtype=object kind coercion path?

eric-wieser · 2019-11-28T18:12:27Z

I think it's here:

https://github.com/pandas-dev/pandas/blob/ea2e26ae7d700d7fd363ea5bfc05d2fe3fb8a5ee/pandas/core/internals/construction.py#L249-L283

seberg · 2019-11-28T18:31:11Z

Yeah, but I think it should call np.asarray([df1, df1]) which should raise an error. And I was too stupid to see why that error would not be propagated to the user...

eric-wieser · 2019-11-28T18:46:39Z

These lines also look relevant: https://github.com/pandas-dev/pandas/blob/ea2e26ae7d700d7fd363ea5bfc05d2fe3fb8a5ee/pandas/core/dtypes/cast.py#L1381-L1385

Can you paste the traceback of the failure?

seberg · 2019-11-28T18:51:38Z

Oh man, thanks yeah. I saw that function being used, but somehow did not expect it to be hit in one branch vs. the other...

Error with this branch

In [1]: import pandas as pd
^[[A
In [2]: df1 = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})

In [3]: pd.DataFrame([df1, df1])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-91cf2bad0f2e> in <module>()
----> 1 pd.DataFrame([df1, df1])

/home/sebastian/.local/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    462                     mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
    463                 else:
--> 464                     mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
    465             else:
    466                 mgr = init_dict({}, index, columns, dtype=dtype)

/home/sebastian/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    169     # by definition an array here
    170     # the dtypes will be coerced to a single dtype
--> 171     values = prep_ndarray(values, copy=copy)
    172 
    173     if dtype is not None:

/home/sebastian/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py in prep_ndarray(values, copy)
    293         values = values.reshape((values.shape[0], 1))
    294     elif values.ndim != 2:
--> 295         raise ValueError("Must pass 2-d input")
    296 
    297     ret
8000
urn values

ValueError: Must pass 2-d input

Without this branch it returns a Dataframe with dataframes as entries, which is silly, but... Anyway, I will have a closer look if pandas can do something. But could be nice to have feedback from pandas whether they care about this or think its OK to error out...

eric-wieser · 2019-11-28T18:59:41Z

Can you run %debug pd.core.internals.construction.prep_ndarray([df1, df1]) in ipython and step through to see the code path that initialized values?

seberg · 2019-11-28T19:14:11Z

whoops, I think I misread the code at the very start, its not isinstance(...), so it goes into the try/except branch of prep_ndarray of course.

EDIT: To be clear:

        try:
            if is_list_like(values[0]) or hasattr(values[0], 'len'):  # <-- is hit
                # following convert does nothing; `np.array()` than raises Error...
                values = np.array([convert(v) for v in values])
            elif isinstance(values[0], np.ndarray) and values[0].ndim == 0:
                # GH#21861
                values = np.array([convert(v) for v in values])
            else:
                values = convert(values)
        except (ValueError, TypeError):
            values = convert(values)  # <-- Ends up getting called and forces object array.

charris · 2019-12-02T17:38:52Z

Pushing this off. Will fix up the reversion which will also probably mess up the merge.

seberg · 2019-12-02T17:43:08Z

Yeah, I was starting to figure the same. I can do the fixup if you prefer.

seberg · 2019-12-02T17:46:00Z

I would like to try this early after the split then, so that we can see if pandas is OK/can keep up with it.

eric-wieser · 2019-12-02T17:47:54Z

# following convert does nothing

@seberg, the way I read it convert called construct_1d_object_array_from_listlike

seberg · 2019-12-02T18:03:58Z

@eric-wieser yeah, but the new path that is taken with this never tries calling convert on the(now 3D) array (which was a ValueError before). In any case, while I would prefer this change, and I think pandas should probably be OK with it. I do not feel like pushing something that might cause an avalanche a few days before the release.

I have opened pandas issue at pandas-dev/pandas#29978

mattip · 2020-01-29T07:59:23Z

@seberg ping

seberg · 2020-01-29T23:06:41Z

OK, updated and release notes added. Maybe we should just try and see where it brings us with respect to pandas? I assume they are the largest "users" of this behaviour, and it seemed like all of the test failures they have is nonsense behaviour that they should be deprecating anyway.

In the end, it is hard to test unless someone from the pandas side is willing to get this branch, so I would be for moving forward and reverting if necessary?

numpy/core/src/multiarray/ctors.c

jbrockmendel · 2020-01-29T23:23:28Z

it is hard to test unless someone from the pandas side is willing to get this branch

I'll pull this and run our npdev build against it now.

numpy/core/src/multiarray/ctors.c

seberg · 2020-01-30T00:20:50Z

Did some style fixups and comment expansion (sorry, I did not mean to, but my muscle reflexes force pushed). The refcounting was correct, since it actually is a borrowed reference until the end of that function (public API, cannot be changed, I added a comment, which may be a bit long but...).

@jbrockmendel thanks for checking! Good to know that you think those pandas tests can be simply fixed up and should not be a problem.

seberg · 2020-01-30T18:30:06Z

Release notes pushed (forgot to add it to the commit).

``__array__`` was previously not used during dimension discovery, while bein gused during dtype discovery (if dtype is not given), as well as during filling of the resulting array. This would lead to inconsistencies with respect to array likes that have a shape including a 0 (typically as first dimension). Thus a shape of ``(0, 1, 1)`` would be found as ``(0,)`` because a nested list/sequence cannot represent empty shapes, except 1-D. This uses the `_array_from_array_like` function, which means that some coercions may be tiny bit slower, at the gain of removing a lot of complex code. (this also reverts commit d0d250a or the related change). This is a continuation of work by Sergei Lebedev in numpygh-13663 which had to be reverted due to problems with Pandas, and the general inconsistency. This version may not resolve all issues with pandas, but does resolve the inconsistency. Closes numpygh-13958

seberg · 2020-02-06T01:48:59Z

Fixed the merge conflict, and tweaked the release note a bit. Do we have to discuss whether this should go in as a bug fix. If this is changes behaviour for some (arguably buggy code), I feel that in almost all cases it should lead to quick crashes, or at least very obvious wrong results.

mattip · 2020-02-26T21:40:18Z

Thanks @seberg

seberg added this to the 1.18.0 release milestone Nov 27, 2019

seberg requested a review from mattip November 27, 2019 22:09

eric-wieser reviewed Nov 27, 2019

View reviewed changes

seberg added 00 - Bug 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes labels Nov 27, 2019

eric-wieser reviewed Nov 27, 2019

View reviewed changes

numpy/core/src/multiarray/ctors.c Show resolved Hide resolved

eric-wieser reviewed Nov 27, 2019

View reviewed changes

numpy/core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

seberg force-pushed the array-like-coercion branch from 1e51a72 to 7ca95cd Compare November 28, 2019 00:05

eric-wieser reviewed Nov 28, 2019

View reviewed changes

numpy/core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

eric-wieser reviewed Nov 28, 2019

View reviewed changes

numpy/core/tests/test_regression.py Outdated Show resolved Hide resolved

charris removed this from the 1.18.0 release milestone Dec 2, 2019

seberg mentioned this pull request Dec 2, 2019

Pandas Tests rely on inconsistent array coercion pandas-dev/pandas#29978

Open

seberg force-pushed the array-like-coercion branch from 7ca95cd to 8a1a80f Compare December 2, 2019 19:43

mattip mentioned this pull request Dec 3, 2019

Adding an empty range() to an empty slice throws "Cannot cast ufunc 'add' output ..." #14317

Open

seberg force-pushed the array-like-coercion branch 2 times, most recently from 497c36a to 50612a5 Compare January 29, 2020 23:01

seberg removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jan 29, 2020

seberg force-pushed the array-like-coercion branch from 50612a5 to 766b046 Compare January 29, 2020 23:16

seberg added the 58 - Ready for review label Jan 29, 2020

eric-wieser reviewed Jan 29, 2020

View reviewed changes

numpy/core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

eric-wieser reviewed Jan 29, 2020

View reviewed changes

numpy/core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

eric-wieser reviewed F438 Jan 29, 2020

View reviewed changes