ENH: add neon intrinsics to fft radix2&4 #17231

DumbMice · 2020-09-03T08:20:31Z

pocketfft

NumPy's fft uses pocketfft, and I added neon intrinsics to radix 2 and radix 4 components (cmplx input) for better performance on arm machine. Although scipy's fft is often recommended as the more developed module to be used, still there might be values in improving numpy's own fft.

Benchmarks

Simple benchmards were done on 1darray of size N and 2darray (N,4) filled with 1.+1j. Compared to master branch, time ratios from this commit are given below, and optimization is promising for large arrays.

log2 N	6	10	15	20	25
fft	99%	100%	83%	71%	62%
ifft	99%	100%	94%	78%	63%
fft2	98%	100%	100%	71%	64%
ifft2	99%	100%	99%	71%	64%

Comments

If you think this is worth doing it, I could proceed to othere compoents (radix3,5...) as well.

mattip · 2020-09-03T08:53:24Z

It would be nice if this could be done with the universal intrinsics rather than redefining new macros.

DumbMice · 2020-09-04T07:10:10Z

It would be nice if this could be done with the universal intrinsics rather than redefining new macros.

I could try to use universal intrinsic in the way I learned to do in #16969 and compare their benchmark results. However, there are intrinsics, specificallyvld2q_f64, vst2q_f64, and vmlaq_f64, that have no equivalent universal intrinsics. Should I build them upon existing universal intrinsics only in pocketfft?

mattip · 2020-09-04T07:40:54Z

@seiko2plus do we have a plan for intrinsics that are CPU-specific? There are two alternatives (you probably already thought about this):

locally check if an intrinsic is supported and provide a local fallback in an #else
provide a universal intrinsic, and a slow-path alternative in the simd headers

Qiyu8 · 2020-09-04T08:24:16Z

The best way is provide a normal intrinsics if we met CPU-specific intrinsics , As we can see in #17049 ,_sum_of_products_contig_contig_outstride0_two uses _mm_shuffle_ps, which is X86-specific, if we simulate this intrinsic in ARM/VSX, The performance is not good even compared to C code, but we found out that It's using _mm_shuffle_ps for reduce sum purpose(higher level), so we write a normal intrinsics named reduce_sum , which can use better specific intrinsics in ARM/VSX without simulate. the performance is going back then, @DumbMice can you do the same thing for vld2q_f64, vst2q_f64, and vmlaq_f64? we are willing to help if you stucked.

DumbMice · 2020-09-04T09:29:51Z

The best way is provide a normal intrinsics if we met CPU-specific intrinsics , As we can see in #17049 ,_sum_of_products_contig_contig_outstride0_two uses _mm_shuffle_ps, which is X86-specific, if we simulate this intrinsic in ARM/VSX, The performance is not good even compared to C code, but we found out that It's using _mm_shuffle_ps for reduce sum purpose(higher level), so we write a normal intrinsics named reduce_sum , which can use better specific intrinsics in ARM/VSX without simulate. the performance is going back then, @DumbMice can you do the same thing for vld2q_f64, vst2q_f64, and vmlaq_f64? we are willing to help if you stucked.

Sorry, I forgot vmlsq_f64 was used as well. @Qiyu8 After checking #17049 , I think I could easily add npyv_muladd_f64 and npyv_mulsub_f64 for vmlsq_f64 and vmlaq_f64, but it might be hard for me to do vld2q_f64, vst2q_f64 wit my limited knowledge in other intrinsics. Any help is welcome.

BTW @mattip, these new macros using neon intrinsics with prefix V are used only to look consistent with original macros used in scalar loops, I could explicitly expand them if you think they are obscure.

seiko2plus · 2020-09-04T12:21:38Z

@mattip, sure we have spacial data types npyv_{type}x2 and npyv_{type}x3 to be used with de&interleaving intrinsics and they're pretty compatible with neon intrinsics. we may need later to add data type npyv_{type}x4 to handle four channels.

@DumbMice,

but it might be hard for me to do vld2q_f64, vst2q_f64 wit my limited knowledge in other intrinsics. Any help is welcome.

Here are tips to implement de&interleaving intrinsics for two 64bit channels:

Create a new header named 'interleave.h' inside NPYV dir for each SIMD extension,
since this header expected grown-up to over 1000-2000 lines then include the new header inside the
main 'simd.h' header, same as we with other internal headers.
The new intrinsics names should be named npyv_load_deinterleave_f64x2 and npyv_store_interleave_f64x2,
too long name but it will increase the readability.
The final look of neon header should be as following:

#ifndef NPY_SIMD
    #error "Not a standalone header"
#endif

#ifndef _NPY_SIMD_NEON_INTERLEAVE_H
#define _NPY_SIMD_NEON_INTERLEAVE_H

#if NPY_SIMD_F64
NPY_FINLINE npyv_f64x2 npyv_load_deinterleave_f64x2(const double *ptr)
{ return vld2q_f64(ptr); }

NPY_FINLINE void npyv_store_interleave_f64x2(double *ptr, npyv_f64x2 a)
{ vst2q_f64(ptr, a); }
#endif // NPY_SIMD_F64

#endif

NPYV have intrinsics for zip operations that can be used with both operations (interleave and deinterleave)
for SSE and VSX. check the following example

#ifndef NPY_SIMD
    #error "Not a standalone header"
#endif

#include "memory.h" // load*
#include "reorder.h" // npyv_zip*

#ifndef _NPY_SIMD_{EXTENSION_NAME}_INTERLEAVE_H
#define _NPY_SIMD_{EXTENSION_NAME}_INTERLEAVE_H

NPY_FINLINE npyv_f64x2 npyv_load_deinterleave_f64x2(const double *ptr)
{
    npyv_f64 a = npyv_load_f64(ptr);
    npyv_f64 b = npyv_load_f64(ptr + npyv_nlanes_f64);
    return npyv_zip_f64(a, b);
}

NPY_FINLINE void npyv_store_interleave_f64x2(double *ptr, npyv_f64x2 a)
{
    npyv_f64x2 zip = npyv_zip_f64(a.val[0], a.val[1]);
    npyv_store_f64(ptr, zip.val[0]);
    npyv_store_f64(ptr + npyv_nlanes_f64, zip.val[1]);
}
#endif // _NPY_SIMD_{EXTENSION_NAME}_INTERLEAVE_H

For AVX2 and AVX512 you can still use npyv_zip_f64 only with "interleave" intrinic same as the previous code.
Deinterleave intrinsic implementation for AVX2

NPY_FINLINE npyv_f64x2 npyv_load_deinterleave_f64x2(const double *ptr)
{
    npyv_f64 ab0 = npyv_load_f64(ptr);
    npyv_f64 ab1 = npyv_load_f64(ptr + npyv_nlanes_f64);
    npyv_f64x2 ab, swap_halves = npyv_combine_f64(ab0, ab1);
    ab.val[0] = _mm256_unpacklo_pd(swap_halves.val[0], swap_halves.val[1]);
    ab.val[1] = _mm256_unpackhi_pd(swap_halves.val[0], swap_halves.val[1]);
    return ab;
}

Deinterleave intrinsic implementation for AVX512

NPY_FINLINE npyv_f64x2 npyv_load_deinterleave_f64x2(const double *ptr)
{
    npyv_f64 ab0 = npyv_load_f64(ptr);
    npyv_f64 ab1 = npyv_load_f64(ptr + npyv_nlanes_f64);
    npyv_f64x2 ab;
    ab.val[0] = _mm512_permutex2var_pd(ab0, _mm512_setr_epi64(0, 2, 4, 6, 8, 10, 12, 14), ab1);
    ab.val[1] = _mm512_permutex2var_pd(ab0, _mm512_setr_epi64(1, 3, 5, 7, 9, 11, 13, 15), ab1);
    return ab;
}

EDIT: // forget adding .val :), thanks @DumbMice!.

seiko2plus · 2020-09-04T12:44:46Z

numpy/fft/_pocketfft.c

+#ifdef NPY_HAVE_NEON
+#include "lowlevel_strided_loops.h"
+#include "arm_neon.h"
+
+/* align var to alignment */
+#define LOOP_BLOCK_ALIGN_VAR(var, type, alignment)\
+    npy_intp i, peel = npy_aligned_block_offset(var, sizeof(type),\
+                                                alignment, n);\
+    for(i = 0; i < peel; i++)
+
+#define LOOP_BLOCKED(type, vsize)\
+    for(; i < npy_blocked_end(peel, sizeof(type), vsize, n);\
+            i += (vsize / sizeof(type)))
+
+#define LOOP_BLOCKED_END\
+    for (; i < n; i++)
+#endif


Suggested change

8000

#ifdef NPY_HAVE_NEON

#include "lowlevel_strided_loops.h"

#include "arm_neon.h"

/* align var to alignment */

#define LOOP_BLOCK_ALIGN_VAR(var, type, alignment)\

npy_intp i, peel = npy_aligned_block_offset(var, sizeof(type),\

alignment, n);\

for(i = 0; i < peel; i++)

#define LOOP_BLOCKED(type, vsize)\

for(; i < npy_blocked_end(peel, sizeof(type), vsize, n);\

i += (vsize / sizeof(type)))

#define LOOP_BLOCKED_END\

for (; i < n; i++)

#endif

No need to handle the offset of the non-aligned part, for the following reasons

You're dealing with 64bit load, it will not gonna make a difference even with old machines and it will not worst to add new npyv intrinsics to handle alignment memory access for interleaving/deinterleaving.

NEON/ASIMD doesn't have intrinsics to handle the aligned load.

Except for stream write(non-temporal) in the case of x86, modern machines and compilers can effectively handle non-aligned
memory access even with the lowest memory alignments(1/2/4).

DumbMice · 2020-09-06T11:23:51Z

@seiko2plus Thank you for your tips! I added de&interleaving intrinsics to universal intrinsics and used them in pocketfft. Only need to change a[0] to a.val[0] to make npyv_f64x2 works, just like those in neon.

There are some subtleties in adding multiply-subtract intrinsics. For neon intrinsics, vmlsq_f64(c,a,b) means c[i]-a[i]*b[i] where the negative product is accumulated; for other intrinsics, mul-sub(a,b,c) generally means a[i]*b[i]-c[i]. I assumed the second one were to be adopted and made correponding alterations.

seiko2plus

~~add new intrinsic for negating multiply add~~

seiko2plus · 2020-09-06T12:19:20Z

@DumbMice,

Thank you for your tips! I added de&interleaving intrinsics to universal intrinsics and used them in pocketfft. Only need to change a[0] to a.val[0] to make npyv_f64x2 works, just like those in neon.

My bad I wrote that comment in a hurry, thanks for correcting me.

There are some subtleties in adding multiply-subtract intrinsics. For neon intrinsics, vmlsq_f64(c,a,b) means c[i]-a[i]*b[i] where the negative product is accumulated; for other intrinsics, mul-sub generally means a[i]*b[i]-c[i]. I assumed the second one were to be adopted and made correponding alterations.

~~You must add intrinsic for negating the intermediate result as I suggested~~

numpy/fft/_pocketfft.c

seiko2plus · 2020-09-06T12:28:22Z

@DumbMice, please wait let me review my suggestion
EDIT: I removed my suggestion in favor of implementing fma3 intrinsics in a separate pr

seiko2plus · 2020-09-06T12:57:30Z

@DumbMice, I'm going to release a pull request implementing fma3 intrinsics to remove any confusion just give me several hours.

seiko2plus · 2020-09-06T18:54:49Z

@DumbMice, #17258 implemented all fused intrinsics.
After #17258 gets merged, you can use intrinsic npyv_nmuladd_f64 ~~or synonym npyv_submul_f64~~ without the need to add extra negate.

DumbMice · 2020-09-08T16:33:05Z

@seiko2plus Thanks so much for your help. All macros expanded as raw universal intrinsics as required.

Benchmark Comparison

Similar performances are achieved. I might update with benchmarks on x86 systems later.

DumbMice · 2020-09-09T13:03:22Z

Benchmark on x86

Compared to which on arm machine, performances have insignificant changes on x86 machine. Not sure if this is supposed to happen. Builds and benchmark processes on both systems are:

python setup.py develop
python runtests.py -v -j 8 --bench-compare master bench_fft
,where uncommited file bench_fft.py is used for benchmarking on both machines.

System Info

	Arm	x86
Hardware	KunPeng	Lenovo T470P
Processor	ARMv8 2.6GMHZ 8 processors	Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
OS	Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64	18.04.1-Ubuntu
Compiler	gcc (GCC) 7.3.0	gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

Size Change

File	before	after	change
_pocketfft_internal.cpython-37m-x86_64-linux-gnu.so	392K	408K	+16K

mattip · 2020-09-09T13:23:47Z

If I understand the graphs correctly, SIMD on NEON only begins to really pay off for n>1_000_000 (log2n>20) or so?

Performance on x86 before this PR used intrinsics, right?, so I would hope there is no change.

DumbMice · 2020-09-09T13:52:32Z

If I understand the graphs correctly, SIMD on NEON only begins to really pay off for n>1_000_000 (log2n>20) or so?

Yes. I suppose this is expected. The whole array will be iterated log(n) times, but at each time the array will be dissected into smaller and smaller subarrays, where SIMD intrinsics take place. I suppose this is the nature of fft's nlog(n) complexity.

Performance on x86 before this PR used intrinsics, right?, so I would hope there is no change.

I believe pocketfft is implemented with pure C. No intrinsics are used before this commit. I suppose an improvement is expected on x86?

mattip · 2020-09-09T13:59:43Z

Perhaps SIMD is competing with the compiler.

charris · 2020-09-09T14:00:59Z

Cache usage was historically the dominate factor in fft timings, not sure how that holds up on modern architectures. NumPy fft is also float64, float32 might benefit more.

seiko2plus · 2020-09-09T14:03:09Z

@DumbMice,

,where uncommited file bench_fft.py is used for benchmarking on both machines.

Please could you provide 'bench_fft.py?

I believe pocketfft is implemented with pure C. No intrinsics are used before this commit. I suppose an improvement is expected on x86?

The compiler can easily auto-vectorize these kinda loops, plus you're "not" using fused intrinsics since you're using NPYV under the domain of the default baseline features(sse sse2 sse3 on x86_64).

Please could you re-run your benchmarks on x86 with --cpu-baseline="avx2 fma3" to activate avx2 with fma3?

seiko2plus · 2020-09-09T14:15:50Z

@charris,

NumPy fft is also float64, float32 might benefit more.

especially complex float32 since the compiler will struggling in auto-vectorize interleave/deinterleave single-precision.

numpy/fft/_pocketfft.c

DumbMice · 2020-09-10T03:48:47Z

Add bench-fft.py as required.

Please could you re-run your benchmarks on x86 with --cpu-baseline="avx2 fma3" to activate avx2 with fma3?

I used the following command for benchmark: python runtests.py --cpu-baseline="avx2 fma3" --bench-compare master bench_fft. Still, no significant improvement has been observed.

Maybe your are right? Compilers are smart enough to auto-vectorize operations on f64 arrays using x86 intrinsics?

Qiyu8 · 2020-09-10T09:28:43Z

I think that the correct commands is:

Build with expected features: python setup.py build --cpu-baseline="avx2 fma3"
Make sure that NPY_SIMD is 256: python runtests.py -v -t numpy/core/tests/test_fft.py -- -s
Install to override the local: pip install .
Run the benchmark: python runtests.py --bench-compare master bench_fft

DumbMice · 2020-09-10T12:30:48Z

@Qiyu8 Thanks for your advice, but benchmark still shows no improvements on x86.

In build log, both EXT and CLIB OPTMIZATION have SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 ENABLED in CPU baseline.

Not sure how to check NPY_SIMD, but 2nd step has output: NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_KNL? AVX512_KNM? AVX512_SKX? AVX512_CNL?

By comparing _pocketfft.o.d in build dir, I can confirm that header files in simd/avx folder are included when testing on fft-neon branch, and they are not included on master branch. So I think this might prove the compiler's efficiency in optimizing x86 codes.

seiko2plus · 2020-09-10T13:54:47Z

@DumbMice,

Not sure how to check NPY_SIMD, but 2nd step has output: NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_KNL? AVX512_KNM? AVX512_SKX? AVX512_CNL?

Thank you for your report, this issue should be fixed by #17284.

To test AVX2 with FMA3 as baseline run :

python runtests.py --cpu-baseline="avx2 fma3" -t numpy/core/tests/test_fft.py

To benchmark AVX2 with FMA3 as baseline through asv run after(#17284):

python3 runtests.py --cpu-baseline="avx2 fma3" --bench-compare master bench_fft

DumbMice · 2020-09-29T16:32:09Z

To help see the result faster, I've applied changes in #17284 in order to properly benchmark on my x86 machines locally. A reduction of time ~ 10% is shown when log2(N) is between 10 and 14. No improvement is seen elsewhere.

I cannot explain this. If others are willing to experiment on their x86 machines, I look forward to seeing their results.

Note: Benchmarking bench_fft.py could be memory-consumming, and my laptop has 2x8G memory.

mattip · 2020-12-16T19:40:22Z

We discussed this in the latest triage meeting and the feeling was to close this in favor of SciPy's implementation. NumPy as FFT as a "textbook implementation", if people want a faster FFT they should use SciPy. If you feel strongly that this position is wrong, let us know. One interesting thing to try would be to use the NumPy 1.20 Universal SIMD framework in Scipy, maybe it would provide a good reason to break the USIMD code out into a separate library.

rossbar · 2020-12-16T19:43:21Z

There are some loosely related discussions in #17839 , #17801

Qiyu8 added 01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Sep 3, 2020

seiko2plus reviewed Sep 4, 2020

View reviewed changes

seiko2plus requested changes Sep 6, 2020

View reviewed changes

seiko2plus reviewed Sep 6, 2020

View reviewed changes

numpy/fft/_pocketfft.c Outdated Show resolved Hide resolved

DumbMice added 4 commits September 8, 2020 20:52

ENH: add neon intrinsics to fft radix2&4

b84ed1b

ENH: add universal intrinsics to fft radix2&4

da95959

ENH: add universal intrinsics to fft radix2&4

6a0aecf

ENH: add universal intrinsics to fft radix2&4

295ce15

seiko2plus reviewed Sep 9, 2020

View reviewed changes

numpy/fft/_pocketfft.c Show resolved Hide resolved

DumbMice added 2 commits September 16, 2020 21:08

BENCH: add bench_fft.py

19387d4

ENH: add universal intrinsics to fft radix2&4

976289c

Qiyu8 added the triage review Issue/PR to be discussed at the next triage meeting label Nov 20, 2020

mattip closed this Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: add neon intrinsics to fft radix2&4 #17231

ENH: add neon intrinsics to fft radix2&4 #17231

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: add neon intrinsics to fft radix2&4 #17231

ENH: add neon intrinsics to fft radix2&4 #17231

Uh oh!

Conversation

pocketfft

Benchmarks

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Benchmark Comparison

Uh oh!

Benchmark on x86

System Info

Size Change

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!