ENH: make ufunc.at faster for more cases #23176

mattip · 2023-02-08T08:48:17Z

PR #23136 and #22889 created fast paths through ufunc.at for 1d aligned arrays with no casting. I closed issues #11156, #7998, #5922, and #8495, as mentioned in this comment and opened this to continue the discussion around additional speedups and concerns:

2d and more contiguous arrays could be sped up by calling the indexed loop on the fastest dimension. Currently the indexed loop will only be called for 1d arrays
Add documentation on how user-defined loops can utilize this. The PR added a test that uses the experimental DType API to create a ufunc type with both a strided loop and an indexed loop. Is that the desired API?
https://github.com/ml31415/numpy-groupies has a number of specific ufunc-at loops. Is there more we could integrate from there? @nschloe thoughts?
Are there still cases where bincount is faster than ufunc.add.at?
Feature request: Incremental bincount and axis argument #9397 asks for an axis argument to bincount. ufunc.at does not support the axis kwarg, so I left that issue open.

The text was updated successfully, but these errors were encountered:

seberg · 2023-02-08T09:33:29Z

Another thing that I would find absolutely amazing is if we could make "indexed" ufunc loops and advanced indexing be basically the same thing (i.e. reuse this machinery there).

I do feel that should be plausible. Some more ideas of how things might be munched generally:

advanced indexing uses an "extra op" for the second array (if it exists) from which it assignes. We could use that machinery to normalize the second operand generally (this might add a bit of overhead when the second op is 1-D and contiguous already).
I could fathom add a buffering layer to deal with multiple indices. I.e. we add one call that (buffered) checks the indices are good and calculates a single new offset to be passed to the "indexed" loop. This would disentangle the two advanced indexing implementations and make fast-paths available for more complicated DTypes in principle

mattip · 2023-02-08T18:24:07Z

More suggestions:

extend the dtypes covered to complex. Currently implemented index loops are for int and float-without-complex
use SIMD in the indexed loops. Currently the implementation uses no SIMD intrinsics.

mhvk · 2023-02-08T19:24:25Z

Also speed up the case where the values are scalar (currently, 1D single element is fast, but array scalar is not):

r = np.zeros(2000)
i = np.repeat(np.arange(999, 1, -1, dtype=np.intp), 100)

In [23]: %timeit np.add.at(r, i, 1)
8.37 ms ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [24]: %timeit np.add.at(r, i, 1.)
1.5 ms ± 5.36 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [25]: %timeit np.add.at(r, i, np.ones(()))
1.51 ms ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [26]: %timeit np.add.at(r, i, np.ones(1))
<magic-timeit>:1: RuntimeWarning: overflow encountered in add
321 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(Also not sure where the overflow warning comes from!)

mattip · 2023-02-08T20:31:33Z

Maybe not obvious from the examples, but the "fastest" optimization only kick in when

1d (which eliminates np.ones(()) and 1 and 1. which are 0d
no casting, which eliminates 1
The "fast" optimization does not need 1d, but does require no casting, which is why the second and third cases take 1.5ms.
The first case does casting, so it takes the slower path.

The overflow in addition is suspicious: r.dtype should be float64 and i has 100 repeats of the integers 999 to 2, so each run of np.add.at should increment r[2:1000] by 100. According to the reports, there are (7 * 100 + 7 * 1000 + 7 * 1000 + 7 * 1000) or 21_700 runs, so I would expect r[2:1000] to be 2_170_000 which should not overflow. Maybe you did something to r beforehand?

seberg · 2023-02-08T20:49:48Z

1d (which eliminates np.ones(()) and 1 and 1. which are 0d

To be fair, you could generalize it to 0-D. We probably should do that, it seems rather surprising that 0-D is slow there. The stride is just 0 then. No casting is harder to eliminate of course.

The overflow in addition is suspicious:

This is a pretty serious bug (oddly quiet it would be nicer if it just crashed)! And we need to fix it. The problem is that we do allow broadcasting of values to the index, but that doesn't work out here (it could, but we would have to set the stride to 0).

A1EC

mhvk · 2023-02-08T22:03:42Z

See #23179 for the fix for scalar values (and allowing them all) and #23181 for complex indexed loops.

dcherian mentioned this issue Sep 7, 2023

flox performance regression for cftime resampling pydata/xarray#7730

Closed

4 tasks

brendan-m-murphy mentioned this issue Feb 6, 2025

Make PyTensor compatible with numpy 2.0 pymc-devs/pytensor#1194

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: make ufunc.at faster for more cases #23176

ENH: make ufunc.at faster for more cases #23176

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: make ufunc.at faster for more cases #23176

ENH: make ufunc.at faster for more cases #23176

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!