ENH: Add Neon SIMD implementations for add, sub, mul, and div #16969

DumbMice · 2020-07-28T14:20:47Z

Background:

There have already been ASV and SSE2 SIMD implementations for add, subtract, multiply and divide, and this is a reimplementation of existing SIMD codes for float32/64 in neon intrinsics in order to add supports on ARM platform. Resultantly, optimisation reduces the computation time by a percentage of 40%-70% on this specific ARM machine.

System Info:

HardWare: KunPeng
Processor: ARMv8 2.6GMHZ 8 processors
OS: Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64

Benchmark

before	after	ratio	bench
197±2μs	117±0.8μs	0.59	bench_ufunc.CustomInplace.time_double_add_temp
51.6±0.03μs	28.8±0.06μs	0.56	bench_ufunc.CustomScalar.time_divide_scalar2_inplace(<class 'numpy.float32'>)
50.5±0.06μs	27.3±0.03μs	0.54	bench_ufunc.CustomScalar.time_divide_scalar2(<class 'numpy.float32'>)
24.5±0.07μs	13.0±0.03μs	0.53	bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float64'>)
175±8μs	91.1±0.7μs	0.52	bench_ufunc.CustomInplace.time_double_add
346±2μs	124±0.7μs	0.36	bench_ufunc.CustomInplace.time_float_add_temp
24.5±0.4μs	8.32±0.04μs	0.34	bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float32'>)
321±1μs	96.2±0.6μs	0.30	bench_ufunc.CustomInplace.time_float_add

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Note

I have noticed a recent tendency of replacing newly added SIMD intrinsics of existing implementation with a universal intrinsics. So I am not sure if this is suitable for merging, but if some could help and guide me to do that I will really appreciate that.

rgommers · 2020-07-28T14:25:22Z

I have noticed a recent tendency of replacing newly added SIMD intrinsics of existing implementation with a universal intrinsics. So I am not sure if this is suitable for merging, but if some could help and guide me to do that I will really appreciate that.

Thanks for asking. It looks like that'd be the way to go here. Cc'ing @seiko2plus and @Qiyu8 as the universal intrinsics and Neon experts.

numpy/core/src/umath/simd.inc.src

seiko2plus

Thank you for your contribution.
We are now at the beginning of a new phase called "the universal intrinsics" big name but too easy to use.
Maybe it will be kinda hard to remove X86 SIMD code but at least the new code should be written via the new interface.
We don't have documentation yet but I guess you can still discover it.
The following changes should enable SIMD on both ARM(7/8) and PPC64LE.
Note that on ARM7 there's no kind of support for double-precision,
also, the changes are made on the fly so I may committed some mistakes.

numpy/core/src/umath/simd.inc.src

seiko2plus

LGTM, Thank you!.

numpy/core/src/umath/simd.inc.src

DumbMice · 2020-07-30T02:38:19Z

Thank you for your detailed guidance! @seiko2plus

mattip · 2020-07-30T09:44:44Z

CI failure does not seem to be related to this PR.

Could you run benchmarks on the final version of this PR, and also compare the size of _multiarray_umath*.so before and after? Extra points for benchmarking x86 as well as arm64.

DumbMice · 2020-07-30T17:29:39Z

Benchmark of the Last Commit

before	after	ratio	bench
196±0.8μs	117±0.7μs	0.59	bench_ufunc.CustomInplace.time_double_add_temp
51.5±0.03μs	28.8±0.02μs	0.56	bench_ufunc.CustomScalar.time_divide_scalar2_inplace(<class 'numpy.float32'>)
166±2μs	91.5±2μs	0.55	bench_ufunc.CustomInplace.time_double_add
24.0±0.07μs	13.2±0.4μs	0.55	bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float64'>)
50.4±0.07μs	27.3±0.05μs	0.54	bench_ufunc.CustomScalar.time_divide_scalar2(<class 'numpy.float32'>)
347±2μs	126±0.6μs	0.36	bench_ufunc.CustomInplace.time_float_add_temp
24.4±0.6μs	8.50±0.07μs	0.35	bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float32'>)
321±0.4μs	95.0±0.4μs	0.30	bench_ufunc.CustomInplace.time_float_add

Results showed the current performance close to that of the initial commit.

File Size Change

File : numpy/core/_multiarray_umath.cpython-37m-aarch64-linux-gnu.so

	before	after	change
size	16528K	16589K	+61K

On x86

Compare neon_intrinsics with master:
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

x86 System Info

Hardware: Lenovo T470P
Processor: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
OS: 18.04.1-Ubuntu

mattip · 2020-07-30T18:07:12Z

Looks good. I think we should put this in, as a first use case. Correct me if I am wrong: this PR does not yet use the dispatch mechanism, nor does it replace the x86 loops with universal simd ones. It does set the baseline loops for add, sub, mul, div on arm64, power, and s390x to ones that use appropriate intrinsics for float32 and float64 loops.

seiko2plus · 2020-07-30T21:21:24Z

@mattip, yes, you're right except we don't support z/architecture(s390x) yet. The current universal intrinsics code is only enabled for non-x86 CPU features and under the domain of the baseline features('--cpu-baseline').

mattip · 2020-07-31T05:25:41Z

Thanks @DumbMice

DumbMice added 2 commits July 28, 2020 18:03

ENH: Add Neon implmentation for add, sub, mul, div

ae244d7

Update simd.inc.src

ad20bab

rgommers added 01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Jul 28, 2020

eric-wieser reviewed Jul 28, 2020

View reviewed changes

numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved

DumbMice added 2 commits July 29, 2020 10:15

delete the unused

6c32cd0

Update simd.inc.src

380029d

Qiyu8 requested changes Jul 29, 2020

View reviewed changes

numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved

numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved

numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved

DumbMice added 4 commits July 29, 2020 14:56

update; extract from scalar1&2

df7b199

update

867b2b9

update

8d1d95c

fix macros

708bf27

eric-wieser reviewed Jul 29, 2020

View reviewed changes

numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved

seiko2plus requested changes Jul 29, 2020

View reviewed changes

DumbMice added 2 commits July 30, 2020 01:09

transfer neon into universal intrinsics

96b6b13

avoid defining simd_binary functions for sse2-enabled machines

1286dc4

seiko2plus approved these changes Jul 29, 2020

View reviewed changes

eric-wieser reviewed Jul 29, 2020

View reviewed changes

numpy/core/src/umath/simd.inc.src Show resolved Hide resolved

ENH: Add Neon SIMD implmentation for add, sub, mul, div

a36153f

eric-wieser changed the title ~~ENH: Add Neon SIMD implmentation for add, sub, mul, div~~ ENH: Add Neon SIMD implementations for add, sub, mul, and div Jul 30, 2020

mattip merged commit 6f0436d into numpy:master Jul 31, 2020

DumbMice mentioned this pull request Sep 4, 2020

ENH: add neon intrinsics to fft radix2&4 #17231

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Add Neon SIMD implementations for add, sub, mul, and div #16969

ENH: Add Neon SIMD implementations for add, sub, mul, and div #16969

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: Add Neon SIMD implementations for add, sub, mul, and div #16969

ENH: Add Neon SIMD implementations for add, sub, mul, and div #16969

Uh oh!

Conversation

Uh oh!

Background:

System Info:

Benchmark

Note

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Benchmark of the Last Commit

File Size Change

On x86

x86 System Info

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!