ENH: Use AVX for float32 implementation of np.sin & np.cos #13368

r-devulap · 2019-04-19T17:07:41Z

This commit implements vectorized single precision sine and cosine using
AVX2 and AVX512. Both sine and cosine are computed using a
polynomial approximation which are accurate for values between
[-PI/4,PI/4]. The original input is reduced to this range using a 3-step
Cody-Waite's range reduction method. This method is only accurate for
values between [-71476.0625f, 71476.0625f] for cosine and [-117435.992f,
117435.992f] for sine. The algorithm identifies elements outside this
range and calls glibc in a scalar loop to compute their output.

The algorithm is a vectorized version of the methods presented
here: https://stackoverflow.com/questions/30463616/payne-hanek-algorithm-implementation-in-c/30465751#30465751

Accuracy: maximum ULP error = 1.49

Max ULP error comparison to GLIBC's implementation:

Function	GLIBC scalar	GLIBC vector (libmvec)	this patch
sine	1.19	2.47	1.49
cosine	1.19	2.39	1.49

Performance: The speed-up this implementation provides is dependent on
the values of the input array. AVX512 version performs nearly 22x faster when all the input
values are within the range specified above. Details of the performance
benefits are provided in the commit message. Its worst performance is when all the array
elements are outside the range leading to about 1% reduction in
performance.

r-devulap · 2019-04-19T17:09:10Z

I realize the first 6 commits are exactly same as in this PR: #13134. The only commit that is new is the last one which implements sine and cosine (but it relies on the previous commits).

eric-wieser · 2019-04-20T01:24:32Z

For reviewers - if the other PR is merged, you can switch the base of this PR between <anything> and back to master, and it will recompute the diff and throw out all the merged commits.

mattip · 2019-04-20T17:23:06Z

#13134 is merged. Could you fix the conflicts/rebase/switch the base?

r-devulap · 2019-04-20T18:15:55Z

Thanks! Its updated now and has just the one commit.

mattip · 2019-04-23T07:10:02Z

How do current benchmarks compare before/after this PR?

r-devulap · 2019-04-23T14:28:20Z

As mentioned in the commit message, the performance gains are leveraged only if the input arguments to sine/cosine are between [-71476.0625f, 71476.0625f] for cosine and [-117435.992f,
117435.992f] for sine. I measured 3 different benchmarks.

(1) For micro-benchmarks:
Array size = 10000, Command = "%timeit np.cos([myarr])"

Function Name	NumPy 1.16	AVX2	AVX512	AVX2 speed up	AVX512 speed up
np.cos	1.5174s	0.1553s	0.06634s	9.77x	22.87x
np.sin	1.4738s	0.1517s	0.06406s	9.71x	23.00x

(2) Package ai.cs provides an API to transform spherical coordinates to
cartesean system:
Array size = 10000, Command = "%timeit ai.cs.sp2cart(r,theta,phi)"

Function Name	NumPy 1.16	AVX2	AVX512	AVX2 speed up	AVX512 speed up
ai.cs.sp2cart	0.6371s	0.1066s	0.0605s	5.97x	10.53x

(3) Package photutils provides an API to find the best fit of first and
second harmonic functions to a set of (angle, intensity) pairs:
Array size = 1000, Command = "%timeit fit_first_and_second_harmonics(E, data)"

Function Name	NumPy 1.16	AVX2	AVX512	AVX2 speed up	AVX512 speed up
fit_first_and_second_harmonics	1.598s	0.8709s	0.7761s	1.83x	2.05x

mattip · 2019-04-23T14:41:09Z

Sorry for not being clear. I meant that I would expect that a performance PR would come with changes in benchmarks in numpy.benchmarks, run by

python runtests.py --bench

8000

r-devulap · 2019-04-23T14:46:25Z

As far as I could tell, those benchmarks were running the float64 version of sine and cosine. Is there a way I could run the float32 version of it? Or do they already do that?

r-devulap · 2019-04-23T16:46:13Z

I will take a look at this today and let you know if I am able to run the benchmarks for float32.

r-devulap · 2019-04-23T22:41:38Z

I ran the ufunc benchmarks python runtests.py --bench ufunc and saw this:

Function	Before this patch	With this patch
sin	7.51±0.01ms	7.15±0ms
cos	7.73±0.02ms	7.36±0ms

But the thing is, the benchmarks for ufunc seems to reporting just one metric (sum of the total time, I assume) for a whole bunch of dtypes (int16, int32, float16, float32, float128, complex64, complex256), whereas my patch only improves performance for float32. So I am not sure if these numbers are representative of performance of this patch.

I added my own benchmarks which measures computing sine and cosine for just float32:

class Trignometry(Benchmark):                                                   
    def setup(self):                                                            
        self.f = np.ones(10000, dtype=np.float32)                           
                                                                                
    def time_sine(self):                                                        
        np.sin(self.f)                                                          
                                                                                
    def time_cosine(self):                                                      
        np.cos(self.f)

and this is what I measured:

Function	Before this patch	With this patch
sin	117±0.04us	12.2±1us
cos	115±0.02us	12.9±1us

numpy/core/src/umath/loops.c.src

numpy/core/src/umath/simd.inc.src

mattip · 2019-04-30T20:55:26Z

This makes _multiarray_umath.so on Ubuntu 18.04 with gcc-8 grow from 17271072 to 17299296, which is 27.5 kB or 0.16%.

r-devulap · 2019-05-01T16:05:41Z

Made some edits to handle NAN and added some tests as well.

r-devulap · 2019-05-07T20:47:46Z

Added a commit which fixes the bug reported in #13493 for sin and cos.

r-devulap · 2019-05-23T17:03:14Z

Updated with following fixes:

Added tests to validate sin/cos
using AVX for strided arrays (fixes the same issue reported for exp/log)
sin/cos will require an fma instead of just avx2. Otherwise the results are in-consistent across avx2 and avx512 implementations.
Rounded up the max tolerable ulp error to an int (the assert_array_max_ulp returns an int)

charris · 2019-05-23T18:13:32Z

Unused variable warnings.

r-devulap · 2019-05-23T18:27:35Z

yeah, thanks :) It was in cpuid.c, just fixed it.

r-devulap · 2019-05-29T23:31:52Z

seems like some network issues resulted in the one test failing and shouldn't be related to the patch itself.

just checking if there is any more feedback for this patch?

r-devulap · 2019-06-07T16:45:41Z

@mattip @juliantaylor ping ..

r-devulap · 2019-06-25T16:13:57Z

Hi @mattip, just wanted to know what you are thinking with respect to this patch. I have more performance patches I am working on but they rely on some of the commits in this patch.

Without an FMA, the output of AVX2 and AVX512 version differ. This changes ensures the output across implementations remains exactly the same.

The assert_array_max_ulp returns only an int since it compares ULP difference between two float32 numbers.

r-devulap · 2019-08-03T17:52:59Z

Re-based with master. The AVX functions for sin and cos pass the tests on my local skylake machine (though it is currently disabled in the CI).

mattip · 2019-08-08T13:27:47Z

@r-devulap what would it take to enable the tests in CI? Do we need to find a skylake machine?

r-devulap · 2019-08-08T14:29:54Z

If you have a cpu > Haswell then the AVX2/FMA algorithms should already be tested. SkylakeX will be needed for AVX512F. Additionally, I also need a way to enable the test only on these platforms and ignore on the rest.

r-devulap · 2019-08-17T15:16:59Z

@mattip ping, how do you want to proceed with PR? Is there a way to enable the tests only for AVX machines?

mattip · 2019-08-17T17:27:38Z

My inclination would be to put this in as-is, since you made sure tests pass on the relevant systems. I am not sure I understand the problem. I see all the checks pass sucessfully. Which tests are enabled for AVX-only?

r-devulap · 2019-08-17T18:25:43Z

The test test_validate_transcendentals in the file numpy/core/tests/test_umath_accuracy.py is currently disabled completely in the CI. The reason is that it is different platforms/compilers seem to generate different outputs for the same functions and the accuracy varies significantly. I would like to know if there is way to enable this test in the CI only for platforms that NumPy utilizes the AVX version of sin/cos/exp/log and ignore the test on other platforms.

mattip · 2019-08-17T18:57:24Z

I guess you could add a function to numpy/core/src/umath/_umath_tests.c which would export a C-level detection of AVX, then call it as part of the setup for those tests.

Would you like to do that as part of this PR or in a separate, follow-on one?

r-devulap · 2019-08-17T19:23:03Z

that might take me some time to implement, perhaps a separate follow up PR is better?

mattip · 2019-08-18T03:03:43Z

OK, let's put this in as-is, in the hope we will improve #14048 's testing over time

mattip · 2019-08-18T03:04:34Z

thanks @r-devulap

r-devulap · 2019-08-19T00:13:47Z

OK, let's put this in as-is, in the hope we will improve #14048 's testing over time

sounds good, thanks!

tylerjereddy · 2019-08-21T17:11:13Z

I've done the SkylakeX CI emulation testing upstream in OpenBLAS for AVX-related testing purposes: OpenMathLib/OpenBLAS#2134

Not sure that NumPy would really want to go that far though.

r-devulap force-pushed the sincos-simd branch 2 times, most recently from bcb4e61 to 283b8ca Compare April 20, 2019 18:14

eric-wieser added the 01 - Enhancement label Apr 20, 2019

rgommers mentioned this pull request Apr 23, 2019

[$] Optimize NumPy SIMD algorithms for Power VSX #13393

Closed

juliantaylor reviewed Apr 24, 2019

View reviewed changes

numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved

juliantaylor reviewed Apr 24, 2019

View reviewed changes

numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved

juliantaylor reviewed Apr 24, 2019

View reviewed changes

numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved

r-devulap force-pushed the sincos-simd branch from 283b8ca to 1571b0c Compare May 1, 2019 16:03

r-devulap force-pushed the sincos-simd branch from 1571b0c to db468c3 Compare May 7, 2019 20:46

r-devulap force-pushed the sincos-simd branch from db468c3 to 4943f11 Compare May 23, 2019 16:58

Copy link

Member Author

r-devulap commented May 31, 2019

just checking if there is any more feedback for this patch?

Raghuveer Devulapalli added 9 commits August 3, 2019 10:48

BUG: fixing NAN handling and adding tests for sin/cos

e50e725

BUG: sin and cos cast float16 to float32

d56fbfd

TEST: adding tests to validate AVX sin/cos implementation

bca9628

BUG: use strides and process strided arrays using AVX

9cae09c

BUG: AVX2 impl of sin/cos requires an FMA

cd9f1a8

Without an FMA, the output of AVX2 and AVX512 version differ. This changes ensures the output across implementations remains exactly the same.

TEST: Rounding max tolerable ulp error to an int

8589d48

The assert_array_max_ulp returns only an int since it compares ULP difference between two float32 numbers.

BUG: eliminate unsed variables warning in cpuid

0f78154

MAINT: using an enum to switch between sin/cos

44d14ab

TEST: improving test coverage for sin/cos for input > 117435.992f

6f34f5f

r-devulap force-pushed the sincos-simd branch from 2b2e54f to 457b3e6 Compare August 3, 2019 17:51

BUG: rename avx2_scalef_ps to fma_scalef_ps

457b3e6

mattip merged commit a051f5e into numpy:master Aug 18, 2019

rgommers added this to the 1.18.0 release milestone Dec 26, 2019

rgommers mentioned this pull request Dec 26, 2019

TestAVXFloat32Transcendental test failure #15179

Closed

r-devulap mentioned this pull request Jul 28, 2021

TST/BENCH: Adding test coverage and benchmarks for floating point umath functions #19485

Merged

r-devulap mentioned this pull request Aug 18, 2021

ENH: Vectorizing umath module using AVX-512 (open sourced from Intel Short Vector Math Library, SVML) #19478

Merged

rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 12, 2022

r-devulap mentioned this pull request Jun 12, 2023

DISCUSS: Runtime precision selection drive? #23362

Open

Uh oh!

ENH: Use AVX for float32 implementation of np.sin & np.cos #13368

ENH: Use AVX for float32 implementation of np.sin & np.cos #13368

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!