-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Add Neon SIMD implementations for add, sub, mul, and div #16969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for asking. It looks like that'd be the way to go here. Cc'ing @seiko2plus and @Qiyu8 as the universal intrinsics and Neon experts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution.
We are now at the beginning of a new phase called "the universal intrinsics" big name but too easy to use.
Maybe it will be kinda hard to remove X86 SIMD code but at least the new code should be written via the new interface.
We don't have documentation yet but I guess you can still discover it.
The following changes should enable SIMD on both ARM(7/8) and PPC64LE.
Note that on ARM7 there's no kind of support for double-precision,
also, the changes are made on the fly so I may committed some mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you!.
Thank you for your detailed guidance! @seiko2plus |
CI failure does not seem to be related to this PR. Could you run benchmarks on the final version of this PR, and also compare the size of |
Benchmark of the Last Commit
Results showed the current performance close to that of the initial commit. File Size ChangeFile : numpy/core/_multiarray_umath.cpython-37m-aarch64-linux-gnu.so
On x86Compare neon_intrinsics with master: x86 System InfoHardware: Lenovo T470P |
Looks good. I think we should put this in, as a first use case. Correct me if I am wrong: this PR does not yet use the dispatch mechanism, nor does it replace the x86 loops with universal simd ones. It does set the baseline loops for add, sub, mul, div on arm64, power, and s390x to ones that use appropriate intrinsics for float32 and float64 loops. |
@mattip, yes, you're right except we don't support z/architecture(s390x) yet. The current universal intrinsics code is only enabled for non-x86 CPU features and under the domain of the baseline features('--cpu-baseline'). |
Thanks @DumbMice |
Background:
There have already been ASV and SSE2 SIMD implementations for add, subtract, multiply and divide, and this is a reimplementation of existing SIMD codes for float32/64 in neon intrinsics in order to add supports on ARM platform. Resultantly, optimisation reduces the computation time by a percentage of 40%-70% on this specific ARM machine.
System Info:
Benchmark
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
Note
I have noticed a recent tendency of replacing newly added SIMD intrinsics of existing implementation with a universal intrinsics. So I am not sure if this is suitable for merging, but if some could help and guide me to do that I will really appreciate that.