8000 ENH: add neon intrinsics to fft radix2&4 by DumbMice · Pull Request #17231 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: add neon intrinsics to fft radix2&4 #17231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

ENH: add neon intrinsics to fft radix2&4 #17231

wants to merge 6 commits into from

Conversation

DumbMice
Copy link
Contributor
@DumbMice DumbMice commented Sep 3, 2020

pocketfft

NumPy's fft uses pocketfft, and I added neon intrinsics to radix 2 and radix 4 components (cmplx input) for better performance on arm machine. Although scipy's fft is often recommended as the more developed module to be used, still there might be values in improving numpy's own fft.

Benchmarks

Simple benchmards were done on 1darray of size N and 2darray (N,4) filled with 1.+1j. Compared to master branch, time ratios from this commit are given below, and optimization is promising for large arrays.

log2 N 6 10 15 20 25
fft 99% 100% 83% 71% 62%
ifft 99% 100% 94% 78% 63%
fft2 98% 100% 100% 71% 64%
ifft2 99% 100% 99% 71% 64%

Comments

If you think this is worth doing it, I could proceed to othere compoents (radix3,5...) as well.

@mattip
Copy link
Member
mattip commented Sep 3, 2020

It would be nice if this could be done with the universal intrinsics rather than redefining new macros.

@Qiyu8 Qiyu8 added 01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Sep 3, 2020
@DumbMice
Copy link
Contributor Author
DumbMice commented Sep 4, 2020

It would be nice if this could be done with the universal intrinsics rather than redefining new macros.

I could try to use universal intrinsic in the way I learned to do in #16969 and compare their benchmark results. However, there are intrinsics, specificallyvld2q_f64, vst2q_f64, and vmlaq_f64, that have no equivalent universal intrinsics. Should I build them upon existing universal intrinsics only in pocketfft?

@mattip
Copy link
Member
mattip commented Sep 4, 2020

@seiko2plus do we have a plan for intrinsics that are CPU-specific? There are two alternatives (you probably already thought about this):

  • locally check if an intrinsic is supported and provide a local fallback in an #else
  • provide a universal intrinsic, and a slow-path alternative in the simd headers

@Qiyu8
Copy link
Member
Qiyu8 commented Sep 4, 2020

The best way is provide a normal intrinsics if we met CPU-specific intrinsics , As we can see in #17049 ,_sum_of_products_contig_contig_outstride0_two uses _mm_shuffle_ps, which is X86-specific, if we simulate this intrinsic in ARM/VSX, The performance is not good even compared to C code, but we found out that It's using _mm_shuffle_ps for reduce sum purpose(higher level), so we write a normal intrinsics named reduce_sum , which can use better specific intrinsics in ARM/VSX without simulate. the performance is going back then, @DumbMice can you do the same thing for vld2q_f64, vst2q_f64, and vmlaq_f64? we are willing to help if you stucked.

@DumbMice
Copy link
Contributor Author
DumbMice commented Sep 4, 2020

The best way is provide a normal intrinsics if we met CPU-specific intrinsics , As we can see in #17049 ,_sum_of_products_contig_contig_outstride0_two uses _mm_shuffle_ps, which is X86-specific, if we simulate this intrinsic in ARM/VSX, The performance is not good even compared to C code, but we found out that It's using _mm_shuffle_ps for reduce sum purpose(higher level), so we write a normal intrinsics named reduce_sum , which can use better specific intrinsics in ARM/VSX without simulate. the performance is going back then, @DumbMice can you do the same thing for vld2q_f64, vst2q_f64, and vmlaq_f64? we are willing to help if you stucked.

Sorry, I forgot vmlsq_f64 was used as well. @Qiyu8 After checking #17049 , I think I could easily add npyv_muladd_f64 and npyv_mulsub_f64 for vmlsq_f64 and vmlaq_f64, but it might be hard for me to do vld2q_f64, vst2q_f64 wit my limited knowledge in other intrinsics. Any help is welcome.

BTW @mattip, these new macros using neon intrinsics with prefix V are used only to look consistent with original macros used in scalar loops, I could explicitly expand them if you think they are obscure.

@seiko2plus
Copy link
Member
seiko2plus commented Sep 4, 2020

@mattip, sure we have spacial data types npyv_{type}x2 and npyv_{type}x3 to be used with de&interleaving intrinsics and they're pretty compatible with neon intrinsics. we may need later to add data type npyv_{type}x4 to handle four channels.

@DumbMice,

but it might be hard for me to do vld2q_f64, vst2q_f64 wit my limited knowledge in other intrinsics. Any help is welcome.

Here are tips to implement de&interleaving intrinsics for two 64bit channels:

  • Create a new header named 'interleave.h' inside NPYV dir for each SIMD extension,
    since this header expected grown-up to over 1000-2000 lines then include the new header inside the
    main 'simd.h' header, same as we with other internal headers.

  • The new intrinsics names should be named npyv_load_deinterleave_f64x2 and npyv_store_interleave_f64x2,
    too long name but it will increase the readability.

  • The final look of neon header should be as following:

#ifndef NPY_SIMD
    #error "Not a standalone header"
#endif

#ifndef _NPY_SIMD_NEON_INTERLEAVE_H
#define _NPY_SIMD_NEON_INTERLEAVE_H

#if NPY_SIMD_F64
NPY_FINLINE npyv_f64x2 npyv_load_deinterleave_f64x2(const double *ptr)
{ return vld2q_f64(ptr); }

NPY_FINLINE void npyv_store_interleave_f64x2(double *ptr, npyv_f64x2 a)
{ vst2q_f64(ptr, a); }
#endif // NPY_SIMD_F64

#endif
  • NPYV have intrinsics for zip operations that can be used with both operations (interleave and deinterleave)
    for SSE and VSX. check the following example
#ifndef NPY_SIMD
    #error "Not a standalone header"
#endif

#include "memory.h" // load*
#include "reorder.h" // npyv_zip*

#ifndef _NPY_SIMD_{EXTENSION_NAME}_INTERLEAVE_H
#define _NPY_SIMD_{EXTENSION_NAME}_INTERLEAVE_H

NPY_FINLINE npyv_f64x2 npyv_load_deinterleave_f64x2(const double *ptr)
{
    npyv_f64 a = npyv_load_f64(ptr);
    npyv_f64 b = npyv_load_f64(ptr + npyv_nlanes_f64);
    return npyv_zip_f64(a, b);
}

NPY_FINLINE void npyv_store_interleave_f64x2(double *ptr, npyv_f64x2 a)
{
    npyv_f64x2 zip = npyv_zip_f64(a.val[0], a.val[1]);
    npyv_store_f64(ptr, zip.val[0]);
    npyv_store_f64(ptr + npyv_nlanes_f64, zip.val[1]);
}
#endif // _NPY_SIMD_{EXTENSION_NAME}_INTERLEAVE_H
  • For AVX2 and AVX512 you can still use npyv_zip_f64 only with "interleave" intrinic same as the previous code.

  • Deinterleave intrinsic implementation for AVX2

NPY_FINLINE npyv_f64x2 npyv_load_deinterleave_f64x2(const double *ptr)
{
    npyv_f64 ab0 = npyv_load_f64(ptr);
    npyv_f64 ab1 = npyv_load_f64(ptr + npyv_nlanes_f64);
    npyv_f64x2 ab, swap_halves = npyv_combine_f64(ab0, ab1);
    ab.val[0] = _mm256_unpacklo_pd(swap_halves.val[0], swap_halves.val[1]);
    ab.val[1] = _mm256_unpackhi_pd(swap_halves.val[0], swap_halves.val[1]);
    return ab;
}
  • Deinterleave intrinsic implementation for AVX512
NPY_FINLINE npyv_f64x2 npyv_load_deinterleave_f64x2(const double *ptr)
{
    npyv_f64 ab0 = npyv_load_f64(ptr);
    npyv_f64 ab1 = npyv_load_f64(ptr + npyv_nlanes_f64);
    npyv_f64x2 ab;
    ab.val[0] = _mm512_permutex2var_pd(ab0, _mm512_setr_epi64(0, 2, 4, 6, 8, 10, 12, 14), ab1);
    ab.val[1] = _mm512_permutex2var_pd(ab0, _mm512_setr_epi64(1, 3, 5, 7, 9, 11, 13, 15), ab1);
    return ab;
}

EDIT: // forget adding .val :), thanks @DumbMice!.

Comment on lines 24 to 40
#ifdef NPY_HAVE_NEON
#include "lowlevel_strided_loops.h"
#include "arm_neon.h"

/* align var to alignment */
#define LOOP_BLOCK_ALIGN_VAR(var, type, alignment)\
npy_intp i, peel = npy_aligned_block_offset(var, sizeof(type),\
alignment, n);\
for(i = 0; i < peel; i++)

#define LOOP_BLOCKED(type, vsize)\
for(; i < npy_blocked_end(peel, sizeof(type), vsize, n);\
i += (vsize / sizeof(type)))

#define LOOP_BLOCKED_END\
for (; i < n; i++)
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef NPY_HAVE_NEON
#include "lowlevel_strided_loops.h"
#include "arm_neon.h"
/* align var to alignment */
#define LOOP_BLOCK_ALIGN_VAR(var, type, alignment)\
npy_intp i, peel = npy_aligned_block_offset(var, sizeof(type),\
alignment, n);\
for(i = 0; i < peel; i++)
#define LOOP_BLOCKED(type, vsize)\
for(; i < npy_blocked_end(peel, sizeof(type), vsize, n);\
i += (vsize / sizeof(type)))
#define LOOP_BLOCKED_END\
for (; i < n; i++)
#endif

No need to handle the offset of the non-aligned part, for the following reasons

  • You're dealing with 64bit load, it will not gonna make a difference even with old machi 8000 nes and it will not worst to add new npyv intrinsics to handle alignment memory access for interleaving/deinterleaving.
  • NEON/ASIMD doesn't have intrinsics to handle the aligned load.
  • Except for stream write(non-temporal) in the case of x86, modern machines and compilers can effectively handle non-aligned
    memory access even with the lowest memory alignments(1/2/4).

@DumbMice
Copy link
Contributor Author
DumbMice commented Sep 6, 2020

@seiko2plus Thank you for your tips! I added de&interleaving intrinsics to universal intrinsics and used them in pocketfft. Only need to change a[0] to a.val[0] to make npyv_f64x2 works, just like those in neon.

There are some subtleties in adding multiply-subtract intrinsics. For neon intrinsics, vmlsq_f64(c,a,b) means c[i]-a[i]*b[i] where the negative product is accumulated; for other intrinsics, mul-sub(a,b,c) generally means a[i]*b[i]-c[i]. I assumed the second one were to be adopted and made correponding alterations.

Copy link
Member
@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add new intrinsic for negating multiply add

@seiko2plus
Copy link
Member
seiko2plus commented Sep 6, 2020

@DumbMice,

Thank you for your tips! I added de&interleaving intrinsics to universal intrinsics and used them in pocketfft. Only need to change a[0] to a.val[0] to make npyv_f64x2 works, just like those in neon.

My bad I wrote that comment in a hurry, thanks for correcting me.

There are some subtleties in adding multiply-subtract intrinsics. For neon intrinsics, vmlsq_f64(c,a,b) means c[i]-a[i]*b[i] where the negative product is accumulated; for other intrinsics, mul-sub generally means a[i]*b[i]-c[i]. I assumed the second one were to be adopted and made correponding alterations.

You must add intrinsic for negating the intermediate result as I suggested

@seiko2plus
Copy link
Member
seiko2plus commented Sep 6, 2020

@DumbMice, please wait let me review my suggestion
EDIT: I removed my suggestion in favor of implementing fma3 intrinsics in a separate pr

@seiko2plus
Copy link
Member
seiko2plus commented 8000 Sep 6, 2020

@DumbMice, I'm going to release a pull request implementing fma3 intrinsics to remove any confusion just give me several hours.

@seiko2plus
Copy link
Member
seiko2plus commented Sep 6, 2020

@DumbMice, #17258 implemented all fused intrinsics.
After #17258 gets merged, you can use intrinsic npyv_nmuladd_f64 or synonym npyv_submul_f64 without the need to add extra negate.

@DumbMice
Copy link
Contributor Author
DumbMice commented Sep 8, 2020

@seiko2plus Thanks so much for your help. All macros expanded as raw universal intrinsics as required.

Benchmark Comparison

fft-compare-neon

usimd

Similar performances are achieved. I might update with benchmarks on x86 systems later.

@DumbMice
Copy link
Contributor Author
DumbMice commented Sep 9, 2020

Benchmark on x86

fft-compare-x86

Compared to which on arm machine, performances have insignificant changes on x86 machine. Not sure if this is supposed to happen. Builds and benchmark processes on both systems are:

  1. python setup.py develop
  2. python runtests.py -v -j 8 --bench-compare master bench_fft
    ,where uncommited file bench_fft.py is used for benchmarking on both machines.

System Info

Arm x86
Hardware KunPeng Lenovo T470P
Processor ARMv8 2.6GMHZ 8 processors Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
OS Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64 18.04.1-Ubuntu
Compiler gcc (GCC) 7.3.0 gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

Size Change

File before after change
_pocketfft_internal.cpython-37m-x86_64-linux-gnu.so 392K 408K +16K

@mattip
Copy link
Member
mattip commented Sep 9, 2020

If I understand the graphs correctly, SIMD on NEON only begins to really pay off for n>1_000_000 (log2n>20) or so?

Performance on x86 before this PR used intrinsics, right?, so I would hope there is no change.

@DumbMice
Copy link
Contributor Author
DumbMice commented Sep 9, 2020

If I understand the graphs correctly, SIMD on NEON only begins to really pay off for n>1_000_000 (log2n>20) or so?

Yes. I suppose this is expected. The whole array will be iterated log(n) times, but at each time the array will be dissected into smaller and smaller subarrays, where SIMD intrinsics take place. I suppose this is the nature of fft's nlog(n) complexity.

Performance on x86 before this PR used intrinsics, right?, so I would hope there is no change.

I believe pocketfft is implemented with pure C. No intrinsics are used before this commit. I suppose an improvement is expected on x86?

@mattip
Copy link
Member
mattip commented Sep 9, 2020

Perhaps SIMD is competing with the compiler.

@charris
Copy link
Member
charris commented Sep 9, 2020

Cache usage was historically the dominate factor in fft timings, not sure how that holds up on modern architectures. NumPy fft is also float64, float32 might benefit more.

@seiko2plus
Copy link
Member
seiko2plus commented Sep 9, 2020

@DumbMice,

,where uncommited file bench_fft.py is used for benchmarking on both machines.

Please could you provide 'bench_fft.py?

I believe pocketfft is implemented with pure C. No intrinsics are used before this commit. I suppose an improvement is expected on x86?

The compiler can easily auto-vectorize these kinda loops, plus you're "not" using fused intrinsics since you're using NPYV under the domain of the default baseline features(sse sse2 sse3 on x86_64).

Please could you re-run your benchmarks on x86 with --cpu-baseline="avx2 fma3" to activate avx2 with fma3?

@seiko2plus
Copy link
Member
seiko2plus commented Sep 9, 2020

@charris,

NumPy fft is also float64, float32 might benefit more.

especially complex float32 since the compiler will struggling in auto-vectorize interleave/deinterleave single-precision.

@DumbMice
Copy link
Contributor Author

Add bench-fft.py as required.

Please could you re-run your benchmarks on x86 with --cpu-baseline="avx2 fma3" to activate avx2 with fma3?

I used the following command for benchmark: python runtests.py --cpu-baseline="avx2 fma3" --bench-compare master bench_fft. Still, no significant improvement has been observed.

Maybe your are right? Compilers are smart enough to auto-vectorize operations on f64 arrays using x86 intrinsics?

@Qiyu8
Copy link
Member
Qiyu8 commented Sep 10, 2020

I think that the correct commands is:

  1. Build with expected features: python setup.py build --cpu-baseline="avx2 fma3"
  2. Make sure that NPY_SIMD is 256: python runtests.py -v -t numpy/core/tests/test_fft.py -- -s
  3. Install to override the local: pip install .
  4. Run the benchmark: python runtests.py --bench-compare master bench_fft

@DumbMice
Copy link
Contributor Author

@Qiyu8 Thanks for your advice, but benchmark still shows no improvements on x86.

In build log, both EXT and CLIB OPTMIZATION have SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 ENABLED in CPU baseline.

Not sure how to check NPY_SIMD, but 2nd step has output: NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_KNL? AVX512_KNM? AVX512_SKX? AVX512_CNL?

By comparing _pocketfft.o.d in build dir, I can confirm that header files in simd/avx folder are included when testing on fft-neon branch, and they are not included on master branch. So I think this might prove the compiler's efficiency in optimizing x86 codes.

@seiko2plus
Copy link
Member

@DumbMice,

Not sure how to check NPY_SIMD, but 2nd step has output: NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_KNL? AVX512_KNM? AVX512_SKX? AVX512_CNL?

Thank you for your report, this issue should be fixed by #17284.

To test AVX2 with FMA3 as baseline run :

python runtests.py --cpu-baseline="avx2 fma3" -t numpy/core/tests/test_fft.py

To benchmark AVX2 with FMA3 as baseline through asv run after(#17284):

python3 runtests.py --cpu-baseline="avx2 fma3" --bench-compare master bench_fft

@DumbMice
Copy link
Contributor Author
DumbMice commented Sep 29, 2020

To help see the result faster, I've applied changes in #17284 in order to properly benchmark on my x86 machines locally. A reduction of time ~ 10% is shown when log2(N) is between 10 and 14. No improvement is seen elsewhere.
final2
I cannot explain this. If others are willing to experiment on their x86 machines, I look forward to seeing their results.

Note: Benchmarking bench_fft.py could be memory-consumming, and my laptop has 2x8G memory.

@Qiyu8 Qiyu8 added the triage review Issue/PR to be discussed at the next triage meeting label Nov 20, 2020
@mattip
Copy link
Member
mattip commented Dec 16, 2020

We discussed this in the latest triage meeting and the feeling was to close this in favor of SciPy's implementation. NumPy as FFT as a "textbook implementation", if people want a faster FFT they should use SciPy. If you feel strongly that this position is wrong, let us know. One interesting thing to try would be to use the NumPy 1.20 Universal SIMD framework in Scipy, maybe it would provide a good reason to break the USIMD code out into a separate library.

@mattip mattip closed this Dec 16, 2020
@rossbar
Copy link
Contributor
rossbar commented Dec 16, 2020

There are some loosely related discussions in #17839 , #17801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery triage review Issue/PR to be discussed at the next triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0