-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ufunc.at perfomance >10x too slow #5922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Intrigued, I tried direct numpy examples, which confirm that
p.s. I confirmed both give identical results. |
+1 for fixing this. Would be great to have ufunc.at working with decent speed! |
It's what you pay for flexibility: I looked into this some time ago, and ended up giving up: in all its generality, there's no easy way out of the current slowness of I can try to look into it once more, and report back with what, if anything, would be easy to speed-up through specialization, e.g. if all inputs are 1-D, then we may be able to streamline things... |
It may make sense to implement some specialized loops for common cases like mean, std etc (if not done already). My C imlementation is basically also just dispatching on the functions name and then choosing the according inner loop. The generalization for different data types is done by scipy weave there, but this should also be possible to precompile it for all relevant combinations. |
Surely it's easy to go from 1d to nd, because you can As to the question of types, would it not be worth upgrading the types if it makes the computation go a lot faster? Then maybe in the long term, aim to reduce the extent of redundant upgrading. The reality of implementing this may be messier than I imagine, but the potential benefits are clearly pretty high. |
I revisited this last night, here are my findings... The way
So in summary, we could maybe squeeze a 5x performance for some of the more typical use cases by being smart about making copies of full arrays to avoid one-at-a-time casting into buffers. I have a few other ideas I will not go into right now, that perhaps could squeeze up to an additional 2x. The point is that the resulting function is still going to be some 10x slower than what one would expect from using If anyone wants to try it out, this is the branch I have been playing around with: https://github.com/jaimefrio/numpy/tree/ufunc_at_fast Thoughts are welcome, but I am not 100% sure that the improvements justify the extra complexity. They probably do, but... |
Sounds good, I knew it was slow due to the buffering always. The other optimization would be to not use the MapIter functions interface, but do the iteration lowlevel (and possibly specialized). EDIT: There is one at a time casting, but it is done in a very inefficient way and would likely be relatively easy to speed up. |
Yes, one possibility very easy to specialize is the "all three items are 1-D arrays." Although even without any of the iterator overhead, we are still going to have two function calls (the one to Related to this, the other possible optimization I had in mind, if the arrays are aligned, is handing 2 items at a time to the innerloop function, by passing it the first data pointer, and strides calculated to send it to the next one. What do you have in mind for the casting speed-up? |
The casting is currently done using an NpyIter based hack, I think you could write it using the dtype casting functions, which should be much faster. I think I would have to look again at the new mapiter structure to say more. In the case of a non-empty subspace (i.e. slicing involved), the casting can possibly be moved into the mapiter part (if you do not use the public API). I am not sure how you look at that "all three items are 1-D" optimization. I think, one method would be to look at the 3 or so branches that MapIter uses and basically do all of those branches as well. |
Update: @ml31415 and I have created a package called The full list of benchmarks now looks like this:
Here we are using
Note that some the some of the |
…tween `a[indices] += 1` and `np.add.at(a,indices,1)` and the trick of using `np.bincount`, which is faster than `ufunc.at`. See: ml31415/numpy-groupies#1 and numpy/numpy#5922
@jaimefrio any news on this? I guess your 5x improvements would be well worth to integrate them! |
@ml31415 I don't think doing this is impossibly complicated, but doing it does require quite a bit of time and spare brain cycles, which unfortunately a lot of us are notoriously short on.... And then there are often other things higher on our (personal) priority lists. |
Just for reference, fastfunc (a small pybind11 project of mine) speeds things up by a factor of about 40. |
Only if |
Another hint: If you use import perfplot
import numpy as np
np.random.seed(0)
def np_add_at(a, i):
out0 = np.zeros(1000)
np.add.at(out0, i, a)
return out0
def np_bincount(a, i):
return np.bincount(i, weights=a, minlength=1000)
b = perfplot.bench(
setup=lambda n: (np.random.rand(n), np.random.randint(0, 1000, n)),
kernels=[np_add_at, np_bincount],
n_range=[2**k for k in range(24)],
title=f"NumPy {np.__version__}",
)
b.save("out.png")
b.show() |
Anecdotally I've found it ~6x faster to manually loop rather than use add.at (not a completely elementwise loop) |
Closing: #23136 was merged. Please rerun the groupies benchmarks. |
Redid the benchmark from here with latest main: |
Wow, nice! That we have a decent amount of additional overhead should be expected, but very cool that it levels off at the same place. (Although probably not too surprising, since this should be heavily memory bound at large sizes) |
I have created a Matlab-like accumarray function which tries to squeeze as much performance form numpy as possible for a specific list of functions:
sum any all max min mean...
etc.The functions is available as a gist here.
There is another accumarray implementation available on github here, which was originally created by @ml31415, but over the last couple of days has had many of my suggestions incorporated (I have also updated my own gist in line with my recent suggestions).
The original purpose of this other implementation was to use
scipy.weave
to get massively improved performance over what was assumed to be the best-possible raw-numpy version. However it appears that most of the functions can be fairly heavily optimized withoutscipy.weave
. [It's not really important here, but for reference, the remaining functions which are difficult-to-optimize aremin max
andprod
.]The main point, however, is that it should be simple to optimize by just using
ufunc.at
for the relevantufunc
(as this is exactly whatufunc.at
is intended for), yetufunc.at
gets miserable performance....about 15x slower thanscipy.weave
and 10-25x slower than a bunch of carefully written alternative numpy algorithms (where such optimized algorithms exist). Surelyufunc.at
could be improved?Also, and on a separate note, would there be any interest in including accumarray itself in numpy?
Here are some benchmarking stats produced by my function (testing and benchmarking code is inclued at the bottom of the gist). Note that the
baseline
times are obtained by sorting-spliting-and-looping, using the named numpy function for each group; whereas the optimised functions do some kind of handcrafted vectorised operation in most cases, exceptmax min
andprod
which useufunc.at
. Note also that the actual observed speedup depends on a variety of properties of the input. Here we are using100 000
indices uniformly picked from the interval[0, 1000)
. Specifically, about 25% of the values are0
(for use with the bool tests), the remainder are uniformly distributed on[-50,25)
.The text was updated successfully, but these errors were encountered: