8000 ENH: make ufunc.at faster for more cases · Issue #23176 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: make ufunc.at faster for more cases #23176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mattip opened this issue Feb 8, 2023 · 6 comments
Open

ENH: make ufunc.at faster for more cases #23176

mattip opened this issue Feb 8, 2023 · 6 comments

Comments

@mattip
Copy link
Member
mattip commented Feb 8, 2023

PR #23136 and #22889 created fast paths through ufunc.at for 1d aligned arrays with no casting. I closed issues #11156, #7998, #5922, and #8495, as mentioned in this comment and opened this to continue the discussion around additional speedups and concerns:

  • 2d and more contiguous arrays could be sped up by calling the indexed loop on the fastest dimension. Currently the indexed loop will only be called for 1d arrays
  • Add documentation on how user-defined loops can utilize this. The PR added a test that uses the experimental DType API to create a ufunc type with both a strided loop and an indexed loop. Is that the desired API?
  • https://github.com/ml31415/numpy-groupies has a number of specific ufunc-at loops. Is there more we could integrate from there? @nschloe thoughts?
  • Are there still cases where bincount is faster than ufunc.add.at?
  • Feature request: Incremental bincount and axis argument #9397 asks for an axis argument to bincount. ufunc.at does not support the axis kwarg, so I left that issue open.
@seberg
Copy link
Member
seberg commented Feb 8, 2023

Another thing that I would find absolutely amazing is if we could make "indexed" ufunc loops and advanced indexing be basically the same thing (i.e. reuse this machinery there).

I do feel that should be plausible. Some more ideas of how things might be munched generally:

  1. advanced indexing uses an "extra op" for the second array (if it exists) from which it assignes. We could use that machinery to normalize the second operand generally (this might add a bit of overhead when the second op is 1-D and contiguous already).
  2. I could fathom add a buffering layer to deal with multiple indices. I.e. we add one call that (buffered) checks the indices are good and calculates a single new offset to be passed to the "indexed" loop. This would disentangle the two advanced indexing implementations and make fast-paths available for more complicated DTypes in principle

@mattip
Copy link
Member 8000 Author
mattip commented Feb 8, 2023

More suggestions:

  • extend the dtypes covered to complex. Currently implemented index loops are for int and float-without-complex
  • use SIMD in the indexed loops. Currently the implementation uses no SIMD intrinsics.

@mhvk
Copy link
Contributor
mhvk commented Feb 8, 2023

Also speed up the case where the values are scalar (currently, 1D single element is fast, but array scalar is not):

r = np.zeros(2000)
i = np.repeat(np.arange(999, 1, -1, dtype=np.intp), 100)

In [23]: %timeit np.add.at(r, i, 1)
8.37 ms ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [24]: %timeit np.add.at(r, i, 1.)
1.5 ms ± 5.36 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [25]: %timeit np.add.at(r, i, np.ones(()))
1.51 ms ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [26]: %timeit np.add.at(r, i, np.ones(1))
<magic-timeit>:1: RuntimeWarning: overflow encountered in add
321 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(Also not sure where the overflow warning comes from!)

@mattip
Copy link
Member Author
mattip commented Feb 8, 2023

Maybe not obvious from the examples, but the "fastest" optimization only kick in when

  • 1d (which eliminates np.ones(()) and 1 and 1. which are 0d
  • no casting, which eliminates 1
    The "fast" optimization does not need 1d, but does require no casting, which is why the second and third cases take 1.5ms.
    The first case does casting, so it takes the slower path.

The overflow in addition is suspicious: r.dtype should be float64 and i has 100 repeats of the integers 999 to 2, so each run of np.add.at should increment r[2:1000] by 100. According to the reports, there are (7 * 100 + 7 * 1000 + 7 * 1000 + 7 * 1000) or 21_700 runs, so I would expect r[2:1000] to be 2_170_000 which should not overflow. Maybe you did something to r beforehand?

@seberg
Copy link
Member
seberg commented Feb 8, 2023

1d (which eliminates np.ones(()) and 1 and 1. which are 0d

To be fair, you could generalize it to 0-D. We probably should do that, it seems rather surprising that 0-D is slow there. The stride is just 0 then. No casting is harder to eliminate of course.

The overflow in addition is suspicious:

This is a pretty serious bug (oddly quiet it would be nicer if it just crashed)! And we need to fix it. The problem is that we do allow broadcasting of values to the index, but that doesn't work out here (it could, but we would have to set the stride to 0).

A1EC

@mhvk
Copy link
Contributor
mhvk commented Feb 8, 2023

See #23179 for the fix for scalar values (and allowing them all) and #23181 for complex indexed loops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0