8000 ENH: Use AVX for float32 implementation of np.sin & np.cos by r-devulap · Pull Request #13368 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Use AVX for float32 implementation of np.sin & np.cos #13368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 18, 2019

Conversation

r-devulap
Copy link
Member

This commit implements vectorized single precision sine and cosine using
AVX2 and AVX512. Both sine and cosine are computed using a
polynomial approximation which are accurate for values between
[-PI/4,PI/4]. The original input is reduced to this range using a 3-step
Cody-Waite's range reduction method. This method is only accurate for
values between [-71476.0625f, 71476.0625f] for cosine and [-117435.992f,
117435.992f] for sine. The algorithm identifies elements outside this
range and calls glibc in a scalar loop to compute their output.

The algorithm is a vectorized version of the methods presented
here: https://stackoverflow.com/questions/30463616/payne-hanek-algorithm-implementation-in-c/30465751#30465751

Accuracy: maximum ULP error = 1.49

Max ULP error comparison to GLIBC's implementation:

Function GLIBC scalar GLIBC vector (libmvec) this patch
sine 1.19 2.47 1.49
cosine 1.19 2.39 1.49

Performance: The speed-up this implementation provides is dependent on
the values of the input array. AVX512 version performs nearly 22x faster when all the input
values are within the range specified above. Details of the performance
benefits are provided in the commit message. Its worst performance is when all the array
elements are outside the range leading to about 1% reduction in
performance.

@r-devulap
Copy link
Member Author

I realize the first 6 commits are exactly same as in this PR: #13134. The only commit that is new is the last one which implements sine and cosine (but it relies on the previous commits).

@eric-wieser
Copy link
Member
eric-wieser commented Apr 20, 2019

For reviewers - if the other PR is merged, you can switch the base of this PR between <anything> and back to master, and it will recompute the diff and throw out all the merged commits.

@mattip
Copy link
Member
mattip commented Apr 20, 2019

#13134 is merged. Could you fix the conflicts/rebase/switch the base?

@r-devulap r-devulap force-pushed the sincos-simd branch 2 times, most recently from bcb4e61 to 283b8ca Compare April 20, 2019 18:14
@r-devulap
Copy link
Member Author
8000

Thanks! Its updated now and has just the one commit.

@mattip
Copy link
Member
mattip commented Apr 23, 2019

How do current benchmarks compare before/after this PR?

@r-devulap
Copy link
Member Author

As mentioned in the commit message, the performance gains are leveraged only if the input arguments to sine/cosine are between [-71476.0625f, 71476.0625f] for cosine and [-117435.992f,
117435.992f] for sine. I measured 3 different benchmarks.

(1) For micro-benchmarks:
Array size = 10000, Command = "%timeit np.cos([myarr])"

Function Name NumPy 1.16 AVX2 AVX512 AVX2 speed up AVX512 speed up
np.cos 1.5174s 0.1553s 0.06634s 9.77x 22.87x
np.sin 1.4738s 0.1517s 0.06406s 9.71x 23.00x

(2) Package ai.cs provides an API to transform spherical coordinates to
cartesean system:
Array size = 10000, Command = "%timeit ai.cs.sp2cart(r,theta,phi)"

Function Name NumPy 1.16 AVX2 AVX512 AVX2 speed up AVX512 speed up
ai.cs.sp2cart 0.6371s 0.1066s 0.0605s 5.97x 10.53x

(3) Package photutils provides an API to find the best fit of first and
second harmonic functions to a set of (angle, intensity) pairs:
Array size = 1000, Command = "%timeit fit_first_and_second_harmonics(E, data)"

Function Name NumPy 1.16 AVX2 AVX512 AVX2 speed up AVX512 speed up
fit_first_and_second_harmonics 1.598s 0.8709s 0.7761s 1.83x 2.05x

@mattip
Copy link
Member
mattip commented Apr 23, 2019

Sorry for not being clear. I meant that I would expect that a performance PR would come with changes in benchmarks in numpy.benchmarks, run by

python runtests.py --bench

@r-devulap
Copy link
Member Author

As far as I could tell, those benchmarks were running the float64 version of sine and cosine. Is there a way I could run the float32 version of it? Or do they already do that?

@r-devulap
Copy link
Member Author

I will take a look at this today and let you know if I am able to run the benchmarks for float32.

@r-devulap
Copy link
Member Author

I ran the ufunc benchmarks python runtests.py --bench ufunc and saw this:

Function Before this patch With this patch
sin 7.51±0.01ms 7.15±0ms
cos 7.73±0.02ms 7.36±0ms

But the thing is, the benchmarks for ufunc seems to reporting just one metric (sum of the total time, I assume) for a whole bunch of dtypes (int16, int32, float16, float32, float128, complex64, complex256), whereas my patch only improves performance for float32. So I am not sure if these numbers are representative of performance of this patch.

I added my own benchmarks which measures computing sine and cosine for just float32:

class Trignometry(Benchmark):                                                   
    def setup(self):                                                            
        self.f = np.ones(10000, dtype=np.float32)                           
                                                                                
    def time_sine(self):                                                        
        np.sin(self.f)                                                          
                                                                                
    def time_cosine(self):                                                      
        np.cos(self.f)  

and this is what I measured:

Function Before this patch With this patch
sin 117±0.04us 12.2±1us
cos 115±0.02us 12.9±1us

@mattip
Copy link
Member
mattip commented Apr 30, 2019

This makes _multiarray_umath.so on Ubuntu 18.04 with gcc-8 grow from 17271072 to 17299296, which is 27.5 kB or 0.16%.

< 8000 a class="d-inline-block" data-test-selector="pr-timeline-events-actor-avatar" data-hovercard-type="user" data-hovercard-url="/users/r-devulap/hovercard" data-octo-click="hovercard-link-click" data-octo-dimensions="link_type:self" href="/r-devulap">@r-devulap r-devulap force-pushed the sincos-simd branch from 283b8ca to 1571b0c Compare May 1, 2019 16:03
@r-devulap
Copy link
Member Author

Made some edits to handle NAN and added some tests as well.

@r-devulap
Copy link
Member Author

Added a commit which fixes the bug reported in #13493 for sin and cos.

@r-devulap
Copy link
Member Author
r-devulap commented May 23, 2019

Updated with following fixes:

  • Added tests to validate sin/cos
  • using AVX for strided arrays (fixes the same issue reported for exp/log)
  • sin/cos will require an fma instead of just avx2. Otherwise the results are in-consistent across avx2 and avx512 implementations.
  • Rounded up the max tolerable ulp error to an int (the assert_array_max_ulp returns an int)

@charris
Copy link
Member
charris commented May 23, 2019

Unused variable warnings.

@r-devulap
Copy link
Member Author

yeah, thanks :) It was in cpuid.c, just fixed it.

@r-devulap
Copy link
Member Author

seems like some network issues resulted in the one test failing and shouldn't be related to the patch itself.

@r-devulap
Copy link
Member Author

just checking if there is any more feedback for this patch?

@r-devulap
Copy link
Member Author

@mattip @juliantaylor ping ..

@r-devulap
Copy link
Member Author

Hi @mattip, just wanted to know what you are thinking with respect to this patch. I have more performance patches I am working on but they rely on some of the commits in this patch.

@r-devulap
Copy link
Member Author

Re-based with master. The AVX functions for sin and cos pass the tests on my local skylake machine (though it is currently disabled in the CI).

@mattip
Copy link
Member
mattip commented Aug 8, 2019

@r-devulap what would it take to enable the tests in CI? Do we need to find a skylake machine?

@r-devulap
Copy link
Member Author

If you have a cpu > Haswell then the AVX2/FMA algorithms should already be tested. SkylakeX will be needed for AVX512F. Additionally, I also need a way to enable the test only on these platforms and ignore on the rest.

@r-devulap
Copy link
Member Author

@mattip ping, how do you want to proceed with PR? Is there a way to enable the tests only for AVX machines?

@mattip
Copy link
Member
mattip commented Aug 17, 2019

My inclination would be to put this in as-is, since you made sure tests pass on the relevant systems. I am not sure I understand the problem. I see all the checks pass sucessfully. Which tests are enabled for AVX-only?

@r-devulap
Copy link
Member Author

The test test_validate_transcendentals in the file numpy/core/tests/test_umath_accuracy.py is currently disabled completely in the CI. The reason is that it is different platforms/compilers seem to generate different outputs for the same functions and the accuracy varies significantly. I would like to know if there is way to enable this test in the CI only for platforms that NumPy utilizes the AVX version of sin/cos/exp/log and ignore the test on other platforms.

@mattip
Copy link
Member
mattip commented Aug 17, 2019

I guess you could add a function to numpy/core/src/umath/_umath_tests.c which would export a C-level detection of AVX, then call it as part of the setup for those tests.

Would you like to do that as part of this PR or in a separate, follow-on one?

@r-devulap
Copy link
Member Author

that might take me some time to implement, perhaps a separate follow up PR is better?

@mattip
Copy link
Member
mattip commented Aug 18, 2019

OK, let's put this in as-is, in the hope we will improve #14048 's testing over time

@mattip mattip merged commit a051f5e into numpy:master Aug 18, 2019
@mattip
Copy link
Member
mattip commented Aug 18, 2019

thanks @r-devulap

@r-devulap
Copy link
Member Author

OK, let's put this in as-is, in the hope we will improve #14048 's testing over time

sounds good, thanks!

@tylerjereddy
Copy link
Contributor

I've done the SkylakeX CI emulation testing upstream in OpenBLAS for AVX-related testing purposes: OpenMathLib/OpenBLAS#2134

Not sure that NumPy would really want to go that far though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0