-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Rewrite of array-coercion to support new dtypes #16200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9e95bc2
to
60fc68d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to somehow diagram the overall view of how coercion works?
'PyArrayDescr_Type': (3,), | ||
# Internally, PyArrayDescr_Type is a PyArray_DTypeMeta, | ||
# the following also defines PyArrayDescr_TypeFull (Full appended) | ||
'PyArrayDescr_Type': (3, "PyArray_DTypeMeta"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change the function signature? The comment about PyArrayDescr_TypeFull
, seems out of place: that function does not seem to be exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, this change is a merge conflicts, or accidental.
|
||
#endif /* NPY_INTERNAL_BUILD */ | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved to a non-public header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is hopefully the one/only thing still missing. Those abstract dtypes are fairly obvious, I forgot to mention them in the chat, although its mentioned in the NEP.
Do you want to pull more than those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattip, I tried a bit. The problem is that PyArray_Descr
is a DTypeMeta
internally (and has to be I think), so basically this mirrors what happens in the auto-generated headers (I believe), and I do not see how to get around it.
b04278c
to
c5d3d5b
Compare
@mattip, @anirudh2290 and anyone else interested. Would Monday 9am PDT be good to discuss/walk through some of the changes/code here? I can send out an invite later. |
That works for me. Could you point to the relevant parts of NEPs or other design documents? Don’t go to any great effort to create/edit them. |
I would be interested. :) |
That works for me too. Thanks @seberg ! |
@seberg count me in too |
OK, its set then, I have to check what link best to use, so will send that out later. @mattip https://numpy.org/neps/nep-0042-new-dtypes.html#coercion-to-and-from-python-objects and the section after that apply, a quick read of it or the pseudo-code tries may make sense. |
Just FYI, I am battling with some a segfault found by the pandas test-suit (possibly datetime related, but I am not sure yet). And on the way there found a few smaller things. The tests should run through once I push again... Once I found the pandas issue I will remove the draft marker. I do think its fine to look at it anyway. UPDATE: OK, I think I found the pandas test, not sure it seemed like the tests had another issue right now if that is so have to track it down tomorrow (by that time also hopefully the pandas test suit is done, and a new leak-check run may have found some other smaller issues). Still need to look into simplifying the "requested descriptor" logic into a flag. |
406a173
to
afd99f6
Compare
OK, marking as ready. New tests are in the separate PR mainly, I will look into whether some new benchmarks make sense, just for documentation basically. We may or may not need release notes here. There may be a reference leak left somewhere, but my partial run indicates that it was fixed with the last set of changes. |
I ran the astropy test suit (pandas seemed pretty much fine), the (new) failures it found are:
The first failure makes me wonder whether the old code converted The next to "index error" are because a The @bsipocz do you happen to know quickly whether these are problematic? Full details below:
|
@mhvk, thought I can avoid pinging you, but there is an actual issue here around quantities. Quantities raise an error for I.e.
This change-set currently does not affect assignment So, I guess the question is much of an issue do you think this is? I will try to think of work-arounds, but right now I am not sure there are any easy ones. I mean the float dtype, could recognize 0-d arrays as "known scalars", but that is even more of a band-aid to catch some of the more common cases :/. |
@seberg - I can try to have a look at this later today, but indeed Marten is the one on the astropy side whom I would ping when I stuck with this. 628C td> |
@seberg - don't worry about pinging me about astropy failures. Often, they may be related to work-arounds that we would gladly drop! (and we do numpy version checks if needed). Just to be sure, you are wondering about the last test failure, right? |
Yes, that is the one that concerns me most, the other ones seemed similar to general issues around |
Indeed, the first test is because we're trying to assign
no longer gives a warning and returns 0 instead of |
Ohh, the masked scalar is a float64 subclass of an array, making it weird, and running into the same issue as the last quantity one, I guess. hmmmmm... |
Could also just delete this path entirely, but thought for now to keep it around.
Unfortunately one test had to be adapted (I missed the difference between timedelta64 and datetime64), and one tests which I thought failed only for one reason, also failed for anothe reason which is not fully gone in master.
Also add some more documentation for PyArray_DiscoverDTypeAndShape_Recursive
Co-authored-by: Anirudh Subramanian <anirudh2290@apache.org>
…hen string is too short this retains (and slightly expands) the old behaviour. It is inconsistent and I think we should fix that inconsistency one way or the other (I honestly do not care *which* way).
Co-authored-by: Anirudh Subramanian <anirudh2290@apache.org>
The rational test still fails, but the xfail message was wrong or confusing.
Co-authored-by: Hameer Abbasi <einstein.edison@gmail.com>
As asked for by Hameer, the comment was outdated, so had to change it completely instead of fixing one word.
Hmmm, the
A bit more impact than I thought on the casting, but 🤷, at least half of that should go away again eventually, whether or not we want to optimize the cast again, is something we can discuss... EDIT: First version had a typo, leading to no actual casting in the cast case. |
astype is similar to array coercion, and is also compared in many of the tests below.
f4a75e6
to
22ee971
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, the - are improved speeds, most slowdowns should specifically not come from that,
oops, i was misinterpreting the ratio column. I think if the performance hit is only for scalar indexing and assignment case and if it fixes inconsistent behavior makes sense to move ahead with this PR.
LGTM. Should we just merge this as 57 commits or squash it into a logical set of commits? |
Normally I would squash down liberally, here dunno. Guess I will cut it down... will probably swing to the other extreme of few commits, but then the history is not super interesting here. |
Hmmmpf, I tried to recommit it by file rather than time, that doesn't really work, there is too many deletions and slight modifications. So it would be extremely intrinsic job to do it by both time and file... But even that will hardly get something compiling without problems. And then the tests will only pass with the full chunk anyway, so its not like there are clear intermediate steps which are cleanly bisectable... So I can squash a few commits with sillier commit messages together if we want some notion of history. If we want "functional chunks", I give up, might as well do a single big commit. |
Thanks @seberg, let's just put it in as-is. |
This seem to have introduced #17173 |
This is in a draft state and needs a lot of cleanup still probably.It is based on top of the creation of DTypeMeta classes. There are a bunch of choices and its non-trivial listed below.This is a draft/prototype, I am putting this out there more if anyone is interested with what I am experimenting or has some feedback on general design thoughts! There are (probably) a lot of places that should be refactored and comments that need updating. (That said, I do see many of the basic ideas – e.g. what to ask from theDType
and the use of a cache, as sound.)It mostly tries to push off more/all logic to the
DType
(except special cases to support structured-void, object, and characters). It also introduces a cache... Its not really optimized much yet, and I have some open questions about how to deal with subclasses (I am tending to refuse to deal with subclasses).Right now
scalar/python class -> DType
mapping is done with a global dict. I could push this off to the a subclass check on the scalar. Right now it does not matter much, since the decision is not really final. I am not sure it is the main overhead right now though...Some notes:
__array_function__
overhead when dealing with a single array input as of this time. Because this is not short circuited, and thus a lot of work/overhead is incurred.PySequence_FAST
and__array__
calls. This means that generally the memory overhead can be much larger, but for examplenp.array(range(10))
does not have to createlist(range(10))
three times, but only once.SETITEM
is nice, but not very fast when casting is involved in general. But maybe there is nothing to do about it, or casting should just get acast-scalar
slot to speed this up when necessary.This is a draft, I am open to fully change it.Some bugs that are fixed, or potentially fixed with this:np.array([np.array("asdf", object), ], dtype="S")
will do the right thing. On the flip, side I did not (yet) implement the retry-with-string nonsense. Sonp.array(["string", np.float64(12.34)])
will result in a string large enough to fit any float64. I think that is good. (right now it depends on the order).SETITEM
in my code (which float128 has at least partially). I am not sure I am willing to make it work generally, rather I thinkfloat128
may want to error more often, and not call__float__
unless it knows it is dealing with a Python float (or maybe a subclass of it).np.array(..., dtype=np.int64)
will raise an error when converting fromnp.nan
, but not converting fromnp.float64(np.nan)
...