ENH: Improve Floating Point Cast Performance on ARM #28769

f2013519 · 2025-04-18T18:34:33Z

This is meant as a prototype for initial performance analysis.

[ci skip] [skip ci]

seiko2plus · 2025-04-20T16:09:39Z

numpy/_core/src/multiarray/lowlevel_strided_loops.c.src

@@ -11,6 +11,7 @@
 #define PY_SSIZE_T_CLEAN
 #include <Python.h>

+#include <arm_neon.h>


Suggested change

#include <arm_neon.h>

Have you tried using __fp16 and letting the compiler auto-vectorize the code? This would also be beneficial for non-contiguous access and supports both single/double conversions. Here's a proposed implementation:

diff --git a/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src b/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src index 1299e55b42..5b03e39ce2 100644 --- a/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src +++ b/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src @@ -708,6 +708,14 @@ NPY_NO_EXPORT PyArrayMethod_StridedLoop * /************* STRIDED CASTING SPECIALIZED FUNCTIONS *************/ +#ifdef __ARM_FP16_FORMAT_IEEE + #define EMULATED_FP16 0 + typedef __fp16 _npy_half; +#else + #define EMULATED_FP16 1 + typedef npy_half _npy_half; +#endif + /**begin repeat * * #NAME1 = BOOL, @@ -723,15 +731,15 @@ NPY_NO_EXPORT PyArrayMethod_StridedLoop * * #type1 = npy_bool, * npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong, * npy_byte, npy_short, npy_int, npy_long, npy_longlong, - * npy_half, npy_float, npy_double, npy_longdouble, + * _npy_half, npy_float, npy_double, npy_longdouble, * npy_cfloat, npy_cdouble, npy_clongdouble# * #rtype1 = npy_bool, * npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong, * npy_byte, npy_short, npy_int, npy_long, npy_longlong, - * npy_half, npy_float, npy_double, npy_longdouble, + * _npy_half, npy_float, npy_double, npy_longdouble, * npy_float, npy_double, npy_longdouble# * #is_bool1 = 1, 0*17# - * #is_half1 = 0*11, 1, 0*6# + * #is_half1 = 0*11, EMULATED_FP16, 0*6# * #is_float1 = 0*12, 1, 0, 0, 1, 0, 0# * #is_double1 = 0*13, 1, 0, 0, 1, 0# * #is_complex1 = 0*15, 1*3# @@ -752,15 +760,15 @@ NPY_NO_EXPORT PyArrayMethod_StridedLoop * * #type2 = npy_bool, * npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong, * npy_byte, npy_short, npy_int, npy_long, npy_longlong, - * npy_half, npy_float, npy_double, npy_longdouble, + * _npy_half, npy_float, npy_double, npy_longdouble, * npy_cfloat, npy_cdouble, npy_clongdouble# * #rtype2 = npy_bool, * npy_ubyte, npy_ushort, npy_uint, npy_ 8000 ulong, npy_ulonglong, * npy_byte, npy_short, npy_int, npy_long, npy_longlong, - * npy_half, npy_float, npy_double, npy_longdouble, + * _npy_half, npy_float, npy_double, npy_longdouble, * npy_float, npy_double, npy_longdouble# * #is_bool2 = 1, 0*17# - * #is_half2 = 0*11, 1, 0*6# + * #is_half2 = 0*11, EMULATED_FP16, 0*6# * #is_float2 = 0*12, 1, 0, 0, 1, 0, 0# * #is_double2 = 0*13, 1, 0, 0, 1, 0# * #is_complex2 = 0*15, 1*3#

I haven't tested this yet, but it should leverage hardware FP16 support on ARM platforms when available while falling back to the emulated version elsewhere.

I had tried this - it is still 2x+ slower than the SIMD implementation for fp16->fp32 . (See the linked issue for the baseline and SIMD results).

Agreed, we can use the scalar path(native or emulated) for other casts/non-contiguous/ h/w does not support Neon SIMD.

Platform: Darwin / arm64 / arm
NumPy version: 2.3.0.dev0+git20250418.6c7e63a
Timeit settings: repeat=100, number=1

Size (Elements) | Min Time (ms) | Median Time (ms)

1 | 0.000 | 0.000 10 | 0.000 | 0.000 100 | 0.000 | 0.000 1,000 | 0.001 | 0.001 10,000 | 0.005 | 0.005 100,000 | 0.045 | 0.049 1,000,000 | 0.545 | 0.586 10,000,000 | 5.870 | 6.087 100,000,000 | 71.549 | 73.567

Clang should auto-vectorize this even under -O2 flags https://godbolt.org/z/zv8nqG9h9. Are you using GCC? If so, try to use PR #28789 with the patch above - it should re-enable NPY_GCC_OPT_3 and NPY_GCC_UNROLL_LOOPS as I just discovered they were disabled.

Try to add NPY_GCC_UNROLL_LOOPS alongside the current NPY_GCC_OPT_3 macro to the @prefix@_cast_@name1@_to_@name2@ function. This should help GCC better auto-vectorize the conversion loop.

I don't think we need to write raw SIMD for such a fundamental operation with some hints compiler should handle it probably. If we find we do need explicit SIMD control, we should consider using Google Highway for a more generic and maintainable solution that can properly dispatch these functions across different architectures.

No, I am using Clang. It does attempt to vectorize it, but ends up producing double the number of vector instructions than the SIMD version - that may be why it is 2x slower: https://godbolt.org/z/8986hv4hv. Need to dig deeper though ...

GCC does better than clang, though it is also not optimal:

Platform: Darwin / arm64 / arm
NumPy version: 2.3.0.dev0+git20250421.f223a15
Timeit settings: repeat=100, number=1

Size (Elements) | Min Time (ms) | Median Time (ms)

1 | 0.000 | 0.000 10 | 0.000 | 0.000 100 | 0.000 | 0.000 1,000 | 0.000 | 0.001 10,000 | 0.003 | 0.003 100,000 | 0.023 | 0.025 1,000,000 | 0.316 | 0.330 10,000,000 | 3.668 | 3.760 100,000,000 | 46.452 | 49.443

I haven't figured out how to make gcc/clang optimize this better, let me know if you have any ideas. The current code performs much better than depending on either GCC/Clang.

NPY_GCC_UNROLL_LOOPS

This had no noticeable effect.

but ends up producing double the number of vector instructions than the SIMD version

You're referring to pair loading? That should actually provide better performance. Clang produces the same code for both kernels on -O3 with one exception: the raw SIMD version preserves 16-lane iteration (unnecessary overhead):

ldp q0, q1, [x9, #-16] add x9, x9, #32 subs x11, x11, #16 fcvtl2 v2.4s, v0.8h fcvtl v0.4s, v0.4h fcvtl2 v3.4s, v1.8h fcvtl v1.4s, v1.4h stp q0, q2, [x10, #-32] stp q1, q3, [x10], #64

On -O2, which is the default for NumPy sources, the auto-vectorized version on Clang is better due to pair loading.

that may be why it is 2x slower: https://godbolt.org/z/8986hv4hv

After a second look, I realized I forgot to pass -ftrapping-math to Godbolt, which is enabled by numpy's meson build for newer versions of Clang. GCC enables this by default; however, under -O3, GCC auto-vectorizes it, while Clang makes no changes at either -O2 or -O3 optimization levels.

By disabling strict FP exceptions per function, I was able to produce the expected auto-vectorized code. See: https://godbolt.org/z/3edTGezM1

GCC does better than clang, though it is also not optimal:

This is because Clang unrolls by x2 (pair loading) while GCC does not, which affects both the current raw SIMD and auto-vectorization implementations, unroll by scalar can gives gcc better hint I suppose.

Nice! Yes, the -ftrapping-math was inhibiting vectorization.

… hardware cast

f2013519 · 2025-04-22T10:02:59Z

@seiko2plus : With your suggestion to use pragma to enable auto-vectorization, we see improvement of up to ~2x even for casts between fp32<->fp64 on M4. So overall floating point cast performance will be improved with this patch!

f2013519 · 2025-04-22T10:11:18Z

There are still two issues that we are still seeing:

Build error with Emscripten: The emscripten compiler crashes when it encounters this pragma - this looks like a compiler bug with emscripten. I have disabled the pragma for that compiler for now.
Test failure on Linux ARM64 SIMD (native target )- There is a fp16 test failure on this platform (which does not occur on other platforms/local testing). From what I can tell, the tests tries to lock down rounding behavior between fp32->fp16 and fp64->fp16 casts. For some reason on this target, there is a roundoff error of 1 bit between the actual and expected results for fp64->fp16 cast. This suggests that hardware cast was likely not used and the conversion may have happened through fp64->fp32->fp16 introducing a rounding error.

What is the difference between native & baseline/asimd target (both of which pass) and how to detect this in the code so we can fall back to emulated path if needed?

seiko2plus · 2025-04-23T16:37:20Z

With your suggestion to use pragma to enable auto-vectorization, we see improvement of up to ~2x even for casts between fp32<->fp64 on M4. So overall floating point cast performance will be improved with this patch!

we should file a bug with Clang. The compiler should be able to auto-vectorize as long as the enabled SIMD extension provides native instructions that respect IEEE semantics for these operations, even when -ftrapping-math is enabled.

Build error with Emscripten: The emscripten compiler crashes when it encounters this pragma - this looks like a compiler bug with emscripten. I have disabled the pragma for that compiler for now.

This is another bug that needs to be reported. However, we should only disable floating point exceptions when we can guarantee that the enabled SIMD extension provides native conversion instructions for this operation. Emscripten would be challenging since the generated WebAssembly is cross-architecture with no guarantees about available hardware instructions.

Test failure on Linux ARM64 SIMD (native target )- There is a fp16 test failure on this platform (which does not occur on other platforms/local testing). From what I can tell, the tests tries to lock down rounding behavior between fp32->fp16 and fp64->fp16 casts. For some reason on this target, there is a roundoff error of 1 bit between the actual and expected results for fp64->fp16 cast. This suggests that hardware cast was likely not used and the conversion may have happened through fp64->fp32->fp16 introducing a rounding error.

I confirm fp64, fp32, fp16 conversion, this appears to be a GCC bug that needs to be reported, specifically with -O3 optimization. When SVE is enabled, you will need to disable NPY_GCC_OPT_3 when NPY_HAVE_SVE is defined. See https://godbolt.org/z/n6KEa3K4v for reference.

What is the difference between native & baseline/asimd target (both of which pass) and how to detect this in the code so we can fall back to emulated path if needed?

test_native enables all CPU features supported by the host as part of the baseline (static dispatch). test_asimd sets asimd as the minimum baseline feature, which is actually the default. test_baseline_only disables any dynamic dispatching and keeps only static dispatching for the default baseline features.

f2013519 · 2025-04-24T10:30:22Z

Here are some benchmarks which I ran locally (MacBook Air M4 + clang 19.1.7) to measure the performance improvement with this change:

Without Patch:

Due to emulation, float16 performance is significantly worse compared to float32 and float64. We can do much better by taking advantage of native float16 support on the hardware.

With Patch:

By enabling native float16 and vectorization, we see a huge improvement in float16 cast performance. The float16<->float32 path now outperforms all other paths by a good margin. Vectorization improves float32/float64 performance as well.

Here are the maximum speedups we can achieve with these changes (best case over 1000 runs):

float64 -> float32: 2.56x (at size 100,000)
float64 9E88 -> float16: 15.63x (at size 100,000)
float32 -> float64: 3.00x (at size 10,000)
float32 -> float16: 24.80x (at size 100,000)
float16 -> float64: 10.00x (at size 10,000)
float16 -> float32: 19.80x (at size 100,000)

To summarize,

We are able to achieve up to 24.8x better cast performance with float16
float32/float64 performance is improved up to 3x

f2013519 · 2025-04-28T07:13:41Z

@seiko2plus: PTAL, thanks

seiko2plus

Nice performance improvements. The benchmarks are thorough and convincing. Thanks for your effort!

numpy/_core/src/multiarray/lowlevel_strided_loops.c.src

ngoldbaum · 2025-04-28T17:35:02Z

Wow, cool!

Can this get a release note?

ngoldbaum · 2025-04-28T17:37:01Z

If you're interested, a quick grep perusal indicates our existing asv benchmarks don't have great coverage for casting operations. If you're interested, it might be worth adding benchmarks too. Not necessary to merge this.

f2013519 · 2025-04-29T04:49:58Z

Sure, I will add a release note.

f2013519 · 2025-04-29T07:16:49Z

If you're interested, a quick grep perusal indicates our existing asv benchmarks don't have great coverage for casting operations. If you're interested, it might be worth adding benchmarks too. Not necessary to merge this.

Sounds like a good idea. Will consider a follow up issue for this.

ngoldbaum · 2025-04-29T13:56:00Z

I added the 2.3 milestone to ensure this doesn't get dropped before doing the release. I'm not enough of a SIMD expert to feel confident hitting the merge button on this one.

seiko2plus · 2025-04-29T16:29:14Z

I'm not enough of a SIMD expert to feel confident hitting the merge button on this one.

There's no raw SIMD involved.

I'm not enough of a SIMD expert to feel confident hitting the merge button on this one.

No worries, I'll follow up if anything shows up. We've disc 9B2B overed several compiler bugs that need deep dig before filed upstream, so I will update this pr later.

Thank you Krishna and Nathan!.

* WIP,Prototype: Use Neon SIMD to improve half->float cast performance [ci skip] [skip ci] * Support Neon SIMD float32->float16 cast and update scalar path to use hardware cast * Add missing header * Relax VECTOR_ARITHMETIC check and add comment on need for SIMD routines * Enable hardware cast on x86 when F16C is available * Relax fp exceptions in Clang to enable vectorization for cast * Ignore fp exceptions only for float casts * Fix build * Attempt to fix test failure on ARM64 native * Work around gcc bug for double->half casts * Add release note

f2013519 closed this Apr 18, 2025

f2013519 force-pushed the main 8000 branch from 9977b8a to b76bb23 Compare April 18, 2025 18:36

WIP,Prototype: Use Neon SIMD to improve half->float cast performance

9fe3084

[ci skip] [skip ci]

f2013519 reopened this Apr 18, 2025

f2013519 mentioned this pull request Apr 18, 2025

ENH: Improve poor np.float16 performance #28753

Open

seiko2plus reviewed Apr 20, 2025

View reviewed changes

f2013519 changed the title ~~Prototype: Use Neon SIMD for better fp16 -> fp32 cast performance~~ WIP: Use Neon SIMD for better fp16 -> fp32 cast performance Apr 20, 2025

f2013519 changed the title ~~WIP: Use Neon SIMD for better fp16 -> fp32 cast performance~~ WIP: Use Neon SIMD for better fp16 <-> fp32 cast performance Apr 20, 2025

f2013519 and others added 6 commits April 21, 2025 02:39

Support Neon SIMD float32->float16 cast and update scalar path to use…

4de0a6f

… hardware cast

Add missing header

3ccb95a

Merge branch 'numpy:main' into main

f223a15

Relax VECTOR_ARITHMETIC check and add comment on need for SIMD routines

236223f

Enable hardware cast on x86 when F16C is available

a8f472d

Relax fp exceptions in Clang to enable vectorization for cast

c4b1486

f2013519 changed the title ~~WIP: Use Neon SIMD for better fp16 <-> fp32 cast performance~~ ENH: Use Hardware Cast for better fp16 <-> fp32 cast performance Apr 22, 2025

f2013519 changed the title ~~ENH: Use Hardware Cast for better fp16 <-> fp32 cast performance~~ ENH: Use Hardware Cast for better fp16 cast performance Apr 22, 2025

Ignore fp exceptions only for float casts

0fbe5ec

f2013519 force-pushed the main branch 3 times, most recently from 15a7520 to ff8922d Compare April 22, 2025 08:51

Fix build

a7ce139

f2013519 force-pushed the main branch from ff8922d to a7ce139 Compare April 22, 2025 09:20

f2013519 changed the title ~~ENH: Use Hardware Cast for better fp16 cast performance~~ ENH: Improve Floating Point Cast Performance Apr 22, 2025

Attempt to fix test failure on ARM64 native

2c17e2a

f2013519 force-pushed the main branch 2 times, most recently from f547352 to 9545a50 Compare April 23, 2025 05:14

Merge branch 'numpy:main' into main

559dc78

f2013519 force-pushed the main branch from 9545a50 to 559dc78 Compare April 23, 2025 05:31

f2013519 force-pushed the main branch from 5f8a20f to 5c480d5 Compare April 24, 2025 05:47

Work around gcc bug for double->half casts

de229c7

f2013519 force-pushed the main branch from 5c480d5 to de229c7 Compare April 24, 2025 06:04

Merge branch 'numpy:main' into main

04b2cb0

seiko2plus approved these changes Apr 28, 2025

View reviewed changes

numpy/_core/src/multiarray/lowlevel_strided_loops.c.src Show resolved Hide resolved

ngoldbaum added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 28, 2025

f2013519 and others added 2 commits April 29, 2025 12:41

Add release note

5c32fee

Merge branch 'numpy:main' into main

e131454

ngoldbaum removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 29, 2025

ngoldbaum added this to the 2.3.0 release milestone Apr 29, 2025

seiko2plus changed the title ~~ENH: Improve Floating Point Cast Performance~~ ENH: Improve Floating Point Cast Performance on ARM Apr 29, 2025

seiko2plus added the 01 - Enhancement label Apr 29, 2025

seiko2plus merged commit d692fbc into numpy:main Apr 29, 2025
73 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Improve Floating Point Cast Performance on ARM #28769

ENH: Improve Floating Point Cast Performance on ARM #28769

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: Improve Floating Point Cast Performance on ARM #28769

ENH: Improve Floating Point Cast Performance on ARM #28769

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Platform: Darwin / arm64 / arm NumPy version: 2.3.0.dev0+git20250418.6c7e63a Timeit settings: repeat=100, number=1

Size (Elements) | Min Time (ms) | Median Time (ms)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Platform: Darwin / arm64 / arm NumPy version: 2.3.0.dev0+git20250421.f223a15 Timeit settings: repeat=100, number=1

Size (Elements) | Min Time (ms) | Median Time (ms)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Without Patch:

With Patch:

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Platform: Darwin / arm64 / arm
NumPy version: 2.3.0.dev0+git20250418.6c7e63a
Timeit settings: repeat=100, number=1

Platform: Darwin / arm64 / arm
NumPy version: 2.3.0.dev0+git20250421.f223a15
Timeit settings: repeat=100, number=1