8000 ENH: Add Neon SIMD implementations for add, sub, mul, and div by DumbMice · Pull Request #16969 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Add Neon SIMD implementations for add, sub, mul, and div #16969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 31, 2020
Merged

ENH: Add Neon SIMD implementations for add, sub, mul, and div #16969

merged 11 commits into from
Jul 31, 2020

Conversation

DumbMice
Copy link
Contributor
@DumbMice DumbMice commented Jul 28, 2020

Background:

There have already been ASV and SSE2 SIMD implementations for add, subtract, multiply and divide, and this is a reimplementation of existing SIMD codes for float32/64 in neon intrinsics in order to add supports on ARM platform. Resultantly, optimisation reduces the computation time by a percentage of 40%-70% on this specific ARM machine.

System Info:

  • HardWare: KunPeng
  • Processor: ARMv8 2.6GMHZ 8 processors
  • OS: Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64

Benchmark

before after ratio bench
197±2μs 117±0.8μs 0.59 bench_ufunc.CustomInplace.time_double_add_temp
51.6±0.03μs 28.8±0.06μs 0.56 bench_ufunc.CustomScalar.time_divide_scalar2_inplace(<class 'numpy.float32'>)
50.5±0.06μs 27.3±0.03μs 0.54 bench_ufunc.CustomScalar.time_divide_scalar2(<class 'numpy.float32'>)
24.5±0.07μs 13.0±0.03μs 0.53 bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float64'>)
175±8μs 91.1±0.7μs 0.52 bench_ufunc.CustomInplace.time_double_add
346±2μs 124±0.7μs 0.36 bench_ufunc.CustomInplace.time_float_add_temp
24.5±0.4μs 8.32±0.04μs 0.34 bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float32'>)
321±1μs 96.2±0.6μs 0.30 bench_ufunc.CustomInplace.time_float_add

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Note

I have noticed a recent tendency of replacing newly added SIMD intrinsics of existing implementation with a universal intrinsics. So I am not sure if this is suitable for merging, but if some could help and guide me to do that I will really appreciate that.

@rgommers rgommers added 01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Jul 28, 2020
@rgommers
Copy link
Member

I have noticed a recent tendency of replacing newly added SIMD intrinsics of existing implementation with a universal intrinsics. So I am not sure if this is suitable for merging, but if some could help and guide me to do that I will really appreciate that.

Thanks for asking. It looks like that'd be the way to go here. Cc'ing @seiko2plus and @Qiyu8 as the universal intrinsics and Neon experts.

Copy link
Member
@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution.
We are now at the beginning of a new phase called "the universal intrinsics" big name but too easy to use.
Maybe it will be kinda hard to remove X86 SIMD code but at least the new code should be written via the new interface.
We don't have documentation yet but I guess you can still discover it.
The following changes should enable SIMD on both ARM(7/8) and PPC64LE.
Note that on ARM7 there's no kind of support for double-precision,
also, the changes are made on the fly so I may committed some mistakes.

Copy link
Member
@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you!.

@DumbMice
Copy link
Contributor Author

Thank you for your detailed guidance! @seiko2plus

@mattip
Copy link
Member
mattip commented Jul 30, 2020

CI failure does not seem to be related to this PR.

Could you run benchmarks on the final version of this PR, and also compare the size of _multiarray_umath*.so before and after? Extra points for benchmarking x86 as well as arm64.

@DumbMice
Copy link
Contributor Author
DumbMice commented Jul 30, 2020

Benchmark of the Last Commit

before after ratio bench
196±0.8μs 117±0.7μs 0.59 bench_ufunc.CustomInplace.time_double_add_temp
51.5±0.03μs 28.8±0.02μs 0.56 bench_ufunc.CustomScalar.time_divide_scalar2_inplace(<class 'numpy.float32'>)
166±2μs 91.5±2μs 0.55 bench_ufunc.CustomInplace.time_double_add
24.0±0.07μs 13.2±0.4μs 0.55 bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float64'>)
50.4±0.07μs 27.3±0.05μs 0.54 bench_ufunc.CustomScalar.time_divide_scalar2(<class 'numpy.float32'>)
347±2μs 126±0.6μs 0.36 bench_ufunc.CustomInplace.time_float_add_temp
24.4±0.6μs 8.50±0.07μs 0.35 bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float32'>)
321±0.4μs 95.0±0.4μs 0.30 bench_ufunc.CustomInplace.time_float_add

Results showed the current performance close to that of the initial commit.

File Size Change

File : numpy/core/_multiarray_umath.cpython-37m-aarch64-linux-gnu.so

before after change
size 16528K 16589K +61K

On x86

Compare neon_intrinsics with master:
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

x86 System Info

Hardware: Lenovo T470P
Processor: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
OS: 18.04.1-Ubuntu

@mattip
Copy link
Member
mattip commented Jul 30, 2020

Looks good. I think we should put this in, as a first use case. Correct me if I am wrong: this PR does not yet use the dispatch mechanism, nor does it replace the x86 loops with universal simd ones. It does set the baseline loops for add, sub, mul, div on arm64, power, and s390x to ones that use appropriate intrinsics for float32 and float64 loops.

@eric-wieser eric-wieser changed the title ENH: Add Neon SIMD implmentation for add, sub, mul, div ENH: Add Neon SIMD implementations for add, sub, mul, and div Jul 30, 2020
@seiko2plus
Copy link
Member

@mattip, yes, you're right except we don't support z/architecture(s390x) yet. The current universal intrinsics code is only enabled for non-x86 CPU features and under the domain of the baseline features('--cpu-baseline').

@mattip mattip merged commit 6f0436d into numpy:master Jul 31, 2020
@mattip
Copy link
Member
mattip commented Jul 31, 2020

Thanks @DumbMice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0