10000 ENH: create and use indexed inner loops by mattip · Pull Request #23136 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: create and use indexed inner loops #23136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Feb 7, 2023
Merged

Conversation

mattip
Copy link
Member
@mattip mattip commented Jan 31, 2023
  • Add a low level loop for add/subtract/multiply/divide that takes three arguments:

    • in/out array a
    • index array indx
    • values val
      It will calculate a[index] = a[indx] <op> value
  • Add a new slot for indexed loops on the ArrayMethodObject, and allow the loop to appear in the slots in the spec. This updates the EXPERIMENTAL_DTYPE_API value

  • Add codegen for the loops. This extends the current codegen with a new indexed kwarg, and generates C code to search the ufunc for the correct info tuple, pulls out the ArrayMethodObject and adds the loop to the slot.

  • Add a "fastest" path to ufunc_at for trivial 1d arguments to use the new loop, much like the regular contiguous loops.

  • Update the __EXPERIMENTAL_DTYPE_VERSION.

No tests were added Appropriate tests were added. The bench_ufunc.At.time_sum_at benchmarks got about 7x faster. Some other benchmarks' times changed, I am not sure why. The parameterized bench_strings.StringComparisons with 3 * 2 * 2 * 6 = 74 benchmarks went all over: some are 1.8x slower, some unchanged, some 2x faster. Same thing with the parameterized bench_ufunc_strides.Unary.time_ufunc. When I run the benchmarks again, these two dominate the long list of changes (see mattip#89 for the complete list), but the parameters at the top and bottom change.

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice to make bincount obsolete! I was intrigued how you went about it, so went through and left some comments along the way. Mostly just nitpicks, since this is all quite clear. Very nice!

Copy link
Member
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, I do somewhat suspect we may want to morph the exact way we do this API wise. For example, this can capture item setting arr[indx] = values but not item getting arr[indx].

OTOH, half the point of the whole organization now, is that we can just drop the slot and replace it with something else relatively easily in the future. So this seems straight forward enough to accept that we may want to morph it. (Unless we have a clear idea right now.)

@mattip mattip mentioned this pull request Feb 1, 2023
@mattip
Copy link
Member Author
mattip commented Feb 1, 2023

These issues #11156, #7998, #5922, #8495 #9397 are related. I am a bit hesitant to close them since some things are missing here:

  • you cannot use out with ufunc.at. I think it would be not too hard to extend ufunc.at to use out, but then behavior would change when index values are repeated: think of a[indx[i]] += v[i] vs. b = a[indx[i]] + v
  • 2d and more contiguous arrays could be sped up by calling the indexed loop on the fastest dimension. Currently the indexed loop will only be called for 1d arrays
  • add more indexed loops, for instance maximum, minimum might be candidates.
  • Add documentation on how user-defined loops can utilize this

Edit: I misunderstood the request for out, and as far as I can tell ufunc.add.at covers that use case.

@seberg
Copy link
Member
seberg commented Feb 1, 2023
  • you cannot use out with ufunc.at. I think it would be not too hard to extend ufunc.at to use out, but then behavior would change when index values are repeated: think of a[indx[i]] += v[i] vs. b = a[indx[i]] + v

Yeah, as you edited, this isn't really useful, and effectively covered considering that ufunc.at is always in-place.

  • 2d and more contiguous arrays could be sped up by calling the indexed loop on the fastest dimension. Currently the indexed loop will only be called for 1d arrays

There are two use-cases here, I think:

  1. There are multiple dimensions and as many indices -> calculate a single indexing array that is valid for all (should we actually use byte offsets?)
  2. 2-D case, but only a 1-D index: A normal ufunc call can be used on each of the slices. arr[i, ...] += values[i, ...]

Once this goes in, I suspect it makes sense to close all those issues and maybe opening a single new one listing a few improvements that could be made around ufunc.at. For me by far the most interesting would be unification with advanced indexing (which means casts/copies).

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments, probably OK to ignore all...

Copy link
Member
@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one issue I spotted.

Coming from zero it would be a little confusing to me that the indexes are inside the strides passed into the loop, so some sort of documentation for that would be nice, but can wait on a bigger docs cleanup for dtype authors.

@mattip
Copy link
Member Author
mattip commented Feb 1, 2023

the indexes are inside the strides passed into the loop

For np.add.at(a, indx, val), the inner loop is called with (pseudocode)

loop({a.data, indx.data, val.data}, //args
     {indx.shape[0]}, //dimensions 
     {a.strides[0], indx.strides[0], val.strides[0]}) //steps

Does that help clear up the idea or does it make it more obscure?

@ngoldbaum
Copy link
Member

Does that help clear up the idea or does it make it more obscure?

Yes, much clearer!

@mhvk
Copy link
Contributor
mhvk commented Feb 1, 2023

@ngoldbaum - you probably have seen it already, and it doesn't take away from your comment that it is good to have things documented in the code, but for me the write-your-own-ufunc documentation was super helpful: https://numpy.org/doc/stable/user/c-info.ufunc-tutorial.html

(@seberg - I guess it would be good to update that once the new API becomes more set in stone...)

@mattip
Copy link
Member Author
mattip commented Feb 5, 2023

There is something to be said for remaining consistent

OK, convinced me :)

Copy link
Member
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, finally had a bit of a closer look through the new code. Two things I would like to see fixed:

  1. The current code looks to have two smaller (but one maybe a bit more dangerous) issues if it was ever used with a single operand. It would be nice to fix or at least assert them (but if we prepare for making it public API it probably should be fixed).
    The worry here is an N-D index array, while the trivial loop assumes the iteration is 1-D (implicitly, since it assumes there is no outer iteration). It might make sense to check that the outer iteration size is 1. In principle, we could actually do the outer iteration here! That would even set the stage for many more improvements.

  2. That INCREF there is deeply fishy. It isn't your issue as such, but I would like it to be cleared up, and I would not be surprised if the new changes didn't make things work (i.e. that we are even leaking the whole array now in the fast path).

Copy link
Member
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicks on the release note, it looks OK to me though. I think you were planning to adding a single operand test? That would be nice.

But, overall, I think it is in a good state now, and I would be happy to merge it.

@mattip
Copy link
Member Author
mattip commented Feb 7, 2023

I added a unary ufunc test, which exposed a logic failure when selecting the indexed loop. Note this requires the experimental API, so I turned it on in the pytest conftest.py.

Copy link
Member
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small suggestions, I am OK with adding the test like this and running our tests with the experimental version enabled.

I hope the suggestion will also fix the test failure caused by non-exact dtype match probably (no promotion kicks in here).

@mattip mattip force-pushed the indexed-loopos branch 3 times, most recently from 04746b0 to 2bbd233 Compare February 7, 2023 18:52
@seberg
Copy link
Member
seberg commented Feb 7, 2023

Hmmmpf, what's going on on azure windows here? Did we get overrun by some environment change?

@mattip
Copy link
Member Author
mattip commented Feb 7, 2023

MSVC does not like this form of struct initialization in C code

    PyArrayMethod_Spec spec = {};
    spec.name = "negative_indexed_loop";
    spec.nin = 1;
    spec.nout = 1;
    spec.dtypes = dtypes;
    spec.slots = slots;
    spec.flags = NPY_METH_NO_FLOATINGPOINT_ERRORS;

But designated initialization works

    PyArrayMethod_Spec spec = {
        .name = "negative_indexed_loop",
        .nin = 1,
        .nout = 1,
        .dtypes = dtypes,
        .slots = slots,
        .flags = NPY_METH_NO_FLOATINGPOINT_ERRORS
    };

The first form I think only works in MSVC C++

@seberg
Copy link
Member
seberg commented Feb 7, 2023

Ohh, OK. I guess it might work if you just give a single element (or even nothing, you are setting anything relevant anyway). But all good then, the error looked a bit strange :).

@seberg
Copy link
Member
seberg commented Feb 7, 2023

Ok, thanks Matti, lets give it a shot. I will note that I might consider morphing the API, but I doubt I will actually get to thinking about it ;). (mainly because I would really love to see this approach to be adopted towards advanced indexing.)

The speedup is certainly amazing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0