-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: create and use indexed inner loops #23136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice to make bincount
obsolete! I was intrigued how you went about it, so went through and left some comments along the way. Mostly just nitpicks, since this is all quite clear. Very nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, I do somewhat suspect we may want to morph the exact way we do this API wise. For example, this can capture item setting arr[indx] = values
but not item getting arr[indx]
.
OTOH, half the point of the whole organization now, is that we can just drop the slot and replace it with something else relatively easily in the future. So this seems straight forward enough to accept that we may want to morph it. (Unless we have a clear idea right now.)
These issues #11156, #7998, #5922, #8495 #9397 are related. I am a bit hesitant to close them since some things are missing here:
Edit: I misunderstood the request for |
Yeah, as you edited, this isn't really useful, and effectively covered considering that
There are two use-cases here, I think:
Once this goes in, I suspect it makes sense to close all those issues and maybe opening a single new one listing a few improvements that could be made around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more comments, probably OK to ignore all...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one issue I spotted.
Coming from zero it would be a little confusing to me that the indexes are inside the strides passed into the loop, so some sort of documentation for that would be nice, but can wait on a bigger docs cleanup for dtype authors.
For
Does that help clear up the idea or does it make it more obscure? |
Yes, much clearer! |
@ngoldbaum - you probably have seen it already, and it doesn't take away from your comment that it is good to have things documented in the code, but for me the write-your-own-ufunc documentation was super helpful: https://numpy.org/doc/stable/user/c-info.ufunc-tutorial.html (@seberg - I guess it would be good to update that once the new API becomes more set in stone...) |
OK, convinced me :) |
ed2e7ac
to
3802b16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, finally had a bit of a closer look through the new code. Two things I would like to see fixed:
-
The current code looks to have two smaller (but one maybe a bit more dangerous) issues if it was ever used with a single operand. It would be nice to fix or at least assert them (but if we prepare for making it public API it probably should be fixed).
The worry here is an N-D index array, while the trivial loop assumes the iteration is 1-D (implicitly, since it assumes there is no outer iteration). It might make sense to check that the outer iteration size is 1. In principle, we could actually do the outer iteration here! That would even set the stage for many more improvements. -
That INCREF there is deeply fishy. It isn't your issue as such, but I would like it to be cleared up, and I would not be surprised if the new changes didn't make things work (i.e. that we are even leaking the whole array now in the fast path).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicks on the release note, it looks OK to me though. I think you were planning to adding a single operand test? That would be nice.
But, overall, I think it is in a good state now, and I would be happy to merge it.
I added a unary ufunc test, which exposed a logic failure when selecting the indexed loop. Note this requires the experimental API, so I turned it on in the pytest conftest.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small suggestions, I am OK with adding the test like this and running our tests with the experimental version enabled.
I hope the suggestion will also fix the test failure caused by non-exact dtype match probably (no promotion kicks in here).
04746b0
to
2bbd233
Compare
Hmmmpf, what's going on on azure windows here? Did we get overrun by some environment change? |
2bbd233
to
790426e
Compare
MSVC does not like this form of struct initialization in C code
But designated initialization works
The first form I think only works in MSVC C++ |
Ohh, OK. I guess it might work if you just give a single element (or even nothing, you are setting anything relevant anyway). But all good then, the error looked a bit strange :). |
Ok, thanks Matti, lets give it a shot. I will note that I might consider morphing the API, but I doubt I will actually get to thinking about it ;). (mainly because I would really love to see this approach to be adopted towards advanced indexing.) The speedup is certainly amazing! |
Add a low level loop for add/subtract/multiply/divide that takes three arguments:
a
indx
val
It will calculate
a[index] = a[indx] <op> value
Add a new slot for indexed loops on the
ArrayMethodObject
, and allow the loop to appear in the slots in the spec. This updates theEXPERIMENTAL_DTYPE_API
valueAdd codegen for the loops. This extends the current codegen with a new
indexed
kwarg, and generates C code to search theufunc
for the correctinfo
tuple, pulls out theArrayMethodObject
and adds the loop to the slot.Add a "fastest" path to
ufunc_at
for trivial 1d arguments to use the new loop, much like the regular contiguous loops.Update the
__EXPERIMENTAL_DTYPE_VERSION
.No tests were addedAppropriate tests were added. Thebench_ufunc.At.time_sum_at
benchmarks got about 7x faster. Some other benchmarks' times changed, I am not sure why. The parameterizedbench_strings.StringComparisons
with3 * 2 * 2 * 6 = 74
benchmarks went all over: some are 1.8x slower, some unchanged, some 2x faster. Same thing with the parameterizedbench_ufunc_strides.Unary.time_ufunc
. When I run the benchmarks again, these two dominate the long list of changes (see mattip#89 for the complete list), but the parameters at the top and bottom change.