-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Speedup ufunc.at when casting is not needed #22889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Going back to the original issue, and repeating the test code
The comparison (that was |
I found another two optimizations, now the speedup is about 6.3x better:
|
@@ -537,7 +537,7 @@ NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(@TYPE@_@kind@) | |||
*((@type@ *)iop1) = io1; | |||
#endif | |||
} | |||
else if (!run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) { | |||
else if (dimensions[0] < 4 || !run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
short circuit the SIMD optimizations for small counts
Here is a complete benchmark run. Is there a way to make asv runs more stable on ubuntu using a AMD processor? Pyperf has some directions, but are they applicable to asv? There is also this project with some hints, has anyone tried them? benchmark results
Running
That is the good news. The bad news is that the slow-down in some benchmarks seems real:
|
cfa2b17 seems significant and maybe worth exploring for all the loops with SIMD logic |
Lowering the count limit to 2 improved the situation for the float add/subtract/multiply/divide template:
|
Turn off the boost and isolate at least one full core(due to hyper-threading) all you need. Follow pyperf instrctions it should fit, but the last time I checked You can use the following command after isolating the CPU core during the kernel load e.g. through grub sudo pyperf system tune
# disable AMD boost
echo "0" | sudo tee /sys/devices/system/cpu/cpufreq/boost Then use asv test option python runtests.py -n --bench-compare parent/main bench_io -- --cpu-affinity=7 |
Signed-off-by: mattip <matti.picus@gmail.com>
I will have a closer look at it later, in general looks good, since special casing casting and no-casting does seem OK to me (although there may be a way to do it in one, see below). Eliding the wrapper seems like a very moderate speed-up even after the SIMD change? I can't say I like it and would much prefer to not do it, since it means only old-style loops have the speed up. There could be a way to do both casting and no-cast in the same loop (although it makes sense to do the second arguments casting in larger buffers rather than individually). The one thing is that casting of the second argument should be done in chunks (a larger buffer). It may be that MapIter can actually do that already with more direct access. But digging into MapIter a bit more seems like a next, additional step. |
…he hot loop" This reverts commit 154c293.
Backed out that change, which on my machine made the new benchmark time go from 8.4 to 9.8 msec (+16%). The fast-path is only activated for old-style loops. Adding more logic to the loop will slow it down, so I wouldn't want to merge a casting loop with the non-casting one. |
The idea of "merging" them is to move the casting into the strided loop function. So you would use the current "no cast" loop also for the casting path without modifying it all. I do think that would be much better than what we have now, but I am happy to consider it a followup. Why is the fast-path only activated with |
That will slow things down. The speed-ups come from
If we add back casting, we will slow things down again. I still am not convinced I should remove the function eliding (commit 333d012) since it led to a 16% slowdown, which I think is significant. |
What I mean would be very very roughly, something like:
That is annoying to create and would be similar to here (although that is hopefully a lot more complex than required here). I am fine with keeping the optimization, 20% is nice! I am just not a big fan of optimizing the old-style loops, when I would hope we can eventually start removing them all. |
The 15% comes from realizing the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matti, generally looks good to me. I didn't quite parse all the paths for correct cleanup.
I would like to suggest not doing that "unpack the wrapped legacy loop" thing in this PR and then discuss it. There is still a lot that can be done, but right now I think the small cleanups I commented on and the 2-3 additional tests (and maybe one error check) would be good.
This code is a mess, and besides having two paths (that we need anyway) the PR seems cleaner than what we had, so I would like to move this on and then follow up with a bit more checks (e.g. for reference leaks).
int | ||
is_generic_wrapped_legacy_loop(PyArrayMethod_StridedLoop *strided_loop) { | ||
return strided_loop == generic_wrapped_legacy_loop; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you want to re-add the fast path, please remove this.
numpy/core/src/umath/ufunc_object.c
Outdated
if (res != 0 && err_msg) { | ||
PyErr_SetString(PyExc_ValueError, err_msg); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (res != 0 && err_msg) { | |
PyErr_SetString(PyExc_ValueError, err_msg); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is also in the other path, err_msg
is always NULL, so this is dead code.)
numpy/core/src/umath/ufunc_object.c
Outdated
int buffersize; | ||
int errormask = 0; | ||
int res = 0; | ||
char * err_msg = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
char * err_msg = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that this was never used to begin with.
} | ||
else { | ||
Py_INCREF(PyArray_DESCR(op1_array)); | ||
array_operands[1] = new_array_op(op1_array, iter->dataptr); | ||
array_operands[2] = NULL; | ||
nop = 2; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test to cover this path? I am not sure that this should even happen, be allowed, but:
arr = np.ones(10, dtype=int)
np.log.at(arr, [1, 2, 0])
should do the trick (yes, it is a terrible example), I am not sure there are neat ones, it seems entirely plausible that we should just reject this path entirely (assuming we make sure the fast path is always taken when it should be).
Another (maybe saner) option would be:
unaligned_arr = np.zeros(1+8*4, dtype="b")[1:].view(np.int64)
unaligned_arr[...] = 1
np.negative.at(unaligned_arr, [0, 1, 2, 2])
assert_array_equal(unaligned_arr, [-1, -1, 0, 0])
int res = 0; | ||
int nop = 0; | ||
NpyIter_IterNextFunc *iternext; | ||
char * err_msg = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
char * err_msg = NULL; |
numpy/core/src/umath/ufunc_object.c
Outdated
fast_path = 0; | ||
} | ||
} | ||
if (PyArray_DESCR(op1_array) != operation_descrs[0]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done more precisely, but it is a bit tedious and I think it should be done together with removing the "nditer for casting" hack from the casting path.
@@ -6309,29 +6462,15 @@ ufunc_at(PyUFuncObject *ufunc, PyObject *args) | |||
* (e.g. `power` released the GIL but manually set an Exception). | |||
*/ | |||
if (res != 0 || PyErr_Occurred()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need the PyErr_Occurred()
anymore, but we can follow up on that.
@@ -537,7 +538,7 @@ NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(@TYPE@_@kind@) | |||
*((@type@ *)iop1) = io1; | |||
#endif | |||
} | |||
else if (!run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) { | |||
else if (dimensions[0] < @count@ || !run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to add this, if @seiko2plus doesn't like it, we can follow up also.
83fbdc6
to
11fa976
Compare
Thanks for the careful review. I made changes as suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more small things I noticed, but I am happy to make these follow-ups. It would be great to improve test coverage a bit a more also, but happy to follow up.
Did you have a check for output dtype matching? E.g.:
arr = np.arange(3)
np.equal.at(arr, [0, 0], [0, 1])
assert arr[0] == 1
which is nonsense, but... (this test was failing already)
The new tests look great, thanks! I think a few more (and maybe splitting it up a bit) would be awesome, but that also seems like better for a follow-up.
For once the travis error looks legitimate:
This is on 390x, is the byte order different there and the output is being set to 1 in the wrong byte order? |
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
@mattip ohh! that was the test I was just asking about. The problem is probably just that we need to use the casting path here and fail to do so. We are missing the check that the output also doesn't need casts. |
|
||
# reset a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to follow up and split this test up. It now runs all of the rest twice with the same values, since it discards the parametrization.
Lets just put this in and follow up after that. Thanks for finally starting to clean up (and speed up) these code paths! As some notes what would be nice to follow ups:
For larger changes, there are basically two angles (the last two points). Improving the casting in general, and second making use of the private MapIter API to handle the additional array and possibly "unroll" the inner-loop. |
We also should do a valgrind/refcount leak check eventually to test the cleanup paths more carefully. |
I've updated the analysis at #5922 (comment). |
Towards solving #5922. This refactors the looping code in
ufunc_at
intoufunc_at__slow_iter
that is the same as now, andufunc_at__fast_iter
that avoids creating theNpyIter_AdvancedNew
iterator to do casting. The exact conditions to hit the fast path areThis speeds up the new benchmark by about 3x
The changes are a bit messy. I started off by moving code bits around in order to isolate the code that became the "slow" function, and then copy-pasted and removed code to make the "fast" function.