-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Use AVX for float32 implementation of np.sin & np.cos #13368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I realize the first 6 commits are exactly same as in this PR: #13134. The only commit that is new is the last one which implements sine and cosine (but it relies on the previous commits). |
For reviewers - if the other PR is merged, you can switch the base of this PR between |
#13134 is merged. Could you fix the conflicts/rebase/switch the base? |
bcb4e61
to
283b8ca
Compare
Thanks! Its updated now and has just the one commit. |
How do current benchmarks compare before/after this PR? |
As mentioned in the commit message, the performance gains are leveraged only if the input arguments to sine/cosine are between [-71476.0625f, 71476.0625f] for cosine and [-117435.992f, (1) For micro-benchmarks:
(2) Package ai.cs provides an API to transform spherical coordinates to
(3) Package photutils provides an API to find the best fit of first and
|
Sorry for not being clear. I meant that I would expect that a performance PR would come with changes in benchmarks in
|
As far as I could tell, those benchmarks were running the float64 version of sine and cosine. Is there a way I could run the float32 version of it? Or do they already do that? |
I will take a look at this today and let you know if I am able to run the benchmarks for float32. |
I ran the ufunc benchmarks
But the thing is, the benchmarks for ufunc seems to reporting just one metric (sum of the total time, I assume) for a whole bunch of dtypes (int16, int32, float16, float32, float128, complex64, complex256), whereas my patch only improves performance for float32. So I am not sure if these numbers are representative of performance of this patch. I added my own benchmarks which measures computing sine and cosine for just float32:
and this is what I measured:
|
This makes |
283b8ca
to
1571b0c
Compare
Made some edits to handle NAN and added some tests as well. |
Added a commit which fixes the bug reported in #13493 for sin and cos. |
Updated with following fixes:
|
Unused variable warnings. |
yeah, thanks :) It was in cpuid.c, just fixed it. |
seems like some network issues resulted in the one test failing and shouldn't be related to the patch itself. |
just checking if there is any more feedback for this patch? |
@mattip @juliantaylor ping .. |
Hi @mattip, just wanted to know what you are thinking with respect to this patch. I have more performance patches I am working on but they rely on some of the commits in this patch. |
Without an FMA, the output of AVX2 and AVX512 version differ. This changes ensures the output across implementations remains exactly the same.
The assert_array_max_ulp returns only an int since it compares ULP difference between two float32 numbers.
Re-based with master. The AVX functions for sin and cos pass the tests on my local skylake machine (though it is currently disabled in the CI). |
@r-devulap what would it take to enable the tests in CI? Do we need to find a skylake machine? |
If you have a cpu > Haswell then the AVX2/FMA algorithms should already be tested. SkylakeX will be needed for AVX512F. Additionally, I also need a way to enable the test only on these platforms and ignore on the rest. |
@mattip ping, how do you want to proceed with PR? Is there a way to enable the tests only for AVX machines? |
My inclination would be to put this in as-is, since you made sure tests pass on the relevant systems. I am not sure I understand the problem. I see all the checks pass sucessfully. Which tests are enabled for AVX-only? |
The test |
I guess you could add a function to Would you like to do that as part of this PR or in a separate, follow-on one? |
that might take me some time to implement, perhaps a separate follow up PR is better? |
OK, let's put this in as-is, in the hope we will improve #14048 's testing over time |
thanks @r-devulap |
sounds good, thanks! |
I've done the SkylakeX CI emulation testing upstream in OpenBLAS for AVX-related testing purposes: OpenMathLib/OpenBLAS#2134 Not sure that NumPy would really want to go that far though. |
This commit implements vectorized single precision sine and cosine using
AVX2 and AVX512. Both sine and cosine are computed using a
polynomial approximation which are accurate for values between
[-PI/4,PI/4]. The original input is reduced to this range using a 3-step
Cody-Waite's range reduction method. This method is only accurate for
values between [-71476.0625f, 71476.0625f] for cosine and [-117435.992f,
117435.992f] for sine. The algorithm identifies elements outside this
range and calls glibc in a scalar loop to compute their output.
The algorithm is a vectorized version of the methods presented
here: https://stackoverflow.com/questions/30463616/payne-hanek-algorithm-implementation-in-c/30465751#30465751
Accuracy: maximum ULP error = 1.49
Max ULP error comparison to GLIBC's implementation:
Performance: The speed-up this implementation provides is dependent on
the values of the input array. AVX512 version performs nearly 22x faster when all the input
values are within the range specified above. Details of the performance
benefits are provided in the commit message. Its worst performance is when all the array
elements are outside the range leading to about 1% reduction in
performance.