8000 BUG: Use ``__array__`` during dimension discovery by seberg · Pull Request #14995 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: Use __array__ during dimension discovery #14995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 26, 2020

Conversation

seberg
Copy link
Member
@seberg seberg commented Nov 27, 2019

__array__ was previously not used during dimension discovery,
while bein gused during dtype discovery (if dtype is not given),
as well as during filling of the resulting array.

This would lead to inconsistencies with respect to array likes
that have a shape including a 0 (typically as first dimension).
Thus a shape of (0, 1, 1) would be found as (0,) because
a nested list/sequence cannot represent empty shapes, except 1-D.

This uses the _array_from_array_like function, which means that
some coercions may be tiny bit slower, at the gain of removing
a lot of complex code.

(this also reverts commit d0d250a)

Closes gh-13958


I am a bit confused as to why I thought before that just adding __array__ would not simply solve this. I think it does. I could add the use of _array_from_array_like in common.h during dtype discovery.

Note that this changes behaviour, so testing against pandas may be in order. And I suppose it needs a release note, I will probably add those tonight.

@seberg seberg added this to the 1.18.0 release milestone Nov 27, 2019
@seberg seberg requested a review from mattip November 27, 2019 22:09
if (slen < 0) {
/* Try __array__ before using s as a sequence */
PyObject *tmp = _array_from_array_like(
s, /*dtype*/ NULL, /*writeable*/ 0, /*context*/ NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Tempting to change this to match the other call exactly, either as _array_from_array_like(obj, NULL, NPY_FALSE, NULL); here or this way there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, presonally I now use an IDE which actually adds hints for the names here, giving it double...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen a former coworker use that before, it looked pretty cool

@seberg seberg added 00 - Bug 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes labels Nov 27, 2019
@eric-wieser
Copy link
Member

Is a backport intended here?

@charris
Copy link
Member
charris commented Nov 27, 2019

Is a backport intended here?

There was already a reversion in 1.17, but not in master. I had a reversion for the 1.18 release, but it was defective. This fixes the failed reversion with a new approach for 1.18.

@seberg seberg force-pushed the array-like-coercion branch from 1e51a72 to 7ca95cd Compare November 28, 2019 00:05
@eric-wieser
Copy link
Member

Overall looks like some nice cleanup, especially if you can get it to pass CI

@seberg
Copy link
Member Author
seberg commented Nov 28, 2019

Passing our CI should not be a problem. Pandas seems to be acting up a bit though, I have to investigate if its us or them (or just bad tests).

@seberg
Copy link
Member Author
seberg commented Nov 28, 2019

@jbrockmendel, @superbobry maybe you can help me out understand pandas a bit better here? On the new branch pandas tests fail which do:

df1 = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
df2 = pd.DataFrame([df1, df1])

The code seems to call np.asarray([df1, df1]) (so far so good). Which now would return a 3-D array, and that trips pandas (I have difficulty saying how bad that actually is, a DeprecationWarning would be nice, but tricky with the project interplay maybe).

The part that I would like to understand is why it currently works (in some weird twisted way), when np.asarray([df1, df1]) actually raises an error. I cannot see where that error is caught and replaced by an dtype=object kind coercion path?

@eric-wieser
Copy link
Member

@seberg
Copy link
Member Author
seberg commented Nov 28, 2019

Yeah, but I think it should call np.asarray([df1, df1]) which should raise an error. And I was too stupid to see why that error would not be propagated to the user...

@eric-wieser
Copy link
Member
eric-wieser commented Nov 28, 2019

These lines also look relevant: https://github.com/pandas-dev/pandas/blob/ea2e26ae7d700d7fd363ea5bfc05d2fe3fb8a5ee/pandas/core/dtypes/cast.py#L1381-L1385

Can you paste the traceback of the failure?

@seberg
Copy link
Member Author
seberg commented Nov 28, 2019

Oh man, thanks yeah. I saw that function being used, but somehow did not expect it to be hit in one branch vs. the other...

Error with this branch
In [1]: import pandas as pd
^[[A
In [2]: df1 = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})

In [3]: pd.DataFrame([df1, df1])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-91cf2bad0f2e> in <module>()
----> 1 pd.DataFrame([df1, df1])

/home/sebastian/.local/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    462                     mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
    463                 else:
--> 464                     mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
    465             else:
    466                 mgr = init_dict({}, index, columns, dtype=dtype)

/home/sebastian/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    169     # by definition an array here
    170     # the dtypes will be coerced to a single dtype
--> 171     values = prep_ndarray(values, copy=copy)
    172 
    173     if dtype is not None:

/home/sebastian/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py in prep_ndarray(values, copy)
    293         values = values.reshape((values.shape[0], 1))
    294     elif values.ndim != 2:
--> 295         raise ValueError("Must pass 2-d input")
    296 
    297     ret
8000
urn values

ValueError: Must pass 2-d input

Without this branch it returns a Dataframe with dataframes as entries, which is silly, but... Anyway, I will have a closer look if pandas can do something. But could be nice to have feedback from pandas whether they care about this or think its OK to error out...

@eric-wieser
Copy link
Member
eric-wieser commented Nov 28, 2019

Can you run %debug pd.core.internals.construction.prep_ndarray([df1, df1]) in ipython and step through to see the code path that initialized values?

@seberg
Copy link
Member Author
seberg commented Nov 28, 2019

whoops, I think I misread the code at the very start, its not isinstance(...), so it goes into the try/except branch of prep_ndarray of course.

EDIT: To be clear:

        try:
            if is_list_like(values[0]) or hasattr(values[0], 'len'):  # <-- is hit
                # following convert does nothing; `np.array()` than raises Error...
                values = np.array([convert(v) for v in values])
            elif isinstance(values[0], np.ndarray) and values[0].ndim == 0:
                # GH#21861
                values = np.array([convert(v) for v in values])
            else:
                values = convert(values)
        except (ValueError, TypeError):
            values = convert(values)  # <-- Ends up getting called and forces object array.

@charris charris removed this from the 1.18.0 release milestone Dec 2, 2019
@charris
Copy link
Member
charris commented Dec 2, 2019

Pushing this off. Will fix up the reversion which will also probably mess up the merge.

@seberg
Copy link
Member Author
seberg commented Dec 2, 2019

Yeah, I was starting to figure the same. I can do the fixup if you prefer.

@seberg
Copy link
Member Author
seberg commented Dec 2, 2019

I would like to try this early after the split then, so that we can see if pandas is OK/can keep up with it.

@eric-wieser
Copy link
Member
eric-wieser commented Dec 2, 2019

# following convert does nothing

@seberg, the way I read it convert called construct_1d_object_array_from_listlike

@seberg
Copy link
Member Author
seberg commented Dec 2, 2019

@eric-wieser yeah, but the new path that is taken with this never tries calling convert on the(now 3D) array (which was a ValueError before). In any case, while I would prefer this change, and I think pandas should probably be OK with it. I do not feel like pushing something that might cause an avalanche a few days before the release.

I have opened pandas issue at pandas-dev/pandas#29978

@mattip
Copy link
Member
mattip commented Jan 29, 2020

@seberg ping

@seberg seberg force-pushed the array-like-coercion branch 2 times, most recently from 497c36a to 50612a5 Compare January 29, 2020 23:01
@seberg seberg removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jan 29, 2020
@seberg
Copy link
Member Author
seberg commented Jan 29, 2020

OK, updated and release notes added. Maybe we should just try and see where it brings us with respect to pandas? I assume they are the largest "users" of this behaviour, and it seemed like all of the test failures they have is nonsense behaviour that they should be deprecating anyway.

In the end, it is hard to test unless someone from the pandas side is willing to get this branch, so I would be for moving forward and reverting if necessary?

@jbrockmendel
Copy link
Contributor

it is hard to test unless someone from the pandas side is willing to get this branch

I'll pull this and run our npdev build against it now.

@seberg seberg force-pushed the array-like-coercion branch from 766b046 to d6ba4fb Compare January 30, 2020 00:18
@seberg
Copy link
Member Author
seberg commented Jan 30, 2020

Did some style fixups and comment expansion (sorry, I did not mean to, but my muscle reflexes force pushed). The refcounting was correct, since it actually is a borrowed reference until the end of that function (public API, cannot be changed, I added a comment, which may be a bit long but...).

@jbrockmendel thanks for checking! Good to know that you think those pandas tests can be simply fixed up and should not be a problem.

@seberg seberg force-pushed the array-like-coercion branch from d6ba4fb to 17cff3b Compare January 30, 2020 18:29
@seberg
Copy link
Member Author
seberg commented Jan 30, 2020

Release notes pushed (forgot to add it to the commit).

@mattip mattip requested a review from eric-wieser February 1, 2020 19:46
@seberg seberg force-pushed the array-like-coercion branch from 17cff3b to 9373cf3 Compare February 6, 2020 01:40
``__array__`` was previously not used during dimension discovery,
while bein gused during dtype discovery (if dtype is not given),
as well as during filling of the resulting array.

This would lead to inconsistencies with respect to array likes
that have a shape including a 0 (typically as first dimension).
Thus a shape of ``(0, 1, 1)`` would be found as ``(0,)`` because
a nested list/sequence cannot represent empty shapes, except 1-D.

This uses the `_array_from_array_like` function, which means that
some coercions may be tiny bit slower, at the gain of removing
a lot of complex code.

(this also reverts commit d0d250a
or the related change).

This is a continuation of work by Sergei Lebedev in numpygh-13663
which had to be reverted due to problems with Pandas, and the
general inconsistency. This version may not resolve all issues
with pandas, but does resolve the inconsistency.

Closes numpygh-13958
@seberg seberg force-pushed the array-like-coercion branch from 9373cf3 to 48dbe84 Compare February 6, 2020 01:46
@seberg
Copy link
Member Author
seberg commented Feb 6, 2020

Fixed the merge conflict, and tweaked the release note a bit. Do we have to discuss whether this should go in as a bug fix. If this is changes behaviour for some (arguably buggy code), I feel that in almost all cases it should lead to quick crashes, or at least very obvious wrong results.

@seberg seberg added the triage review Issue/PR to be discussed at the next triage meeting label Feb 26, 2020
@mattip mattip added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Feb 26, 2020
@mattip mattip merged commit 40fad32 into numpy:master Feb 26, 2020
@mattip
Copy link
Member
mattip commented Feb 26, 2020

Thanks @seberg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 58 - Ready for review triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: Change in behaviour for np.array() on class that defines __array__
5 participants
0