8000 ENH: Improve Floating Point Cast Performance on ARM by f2013519 · Pull Request #28769 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Improve Floating Point Cast Performance on ARM #28769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Apr 29, 2025
Merged

Conversation

f2013519
Copy link
Contributor

This is meant as a prototype for initial performance analysis.

@@ -11,6 +11,7 @@
#define PY_SSIZE_T_CLEAN
#include <Python.h>

#include <arm_neon.h>
Copy link
Member
@seiko2plus seiko2plus Apr 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <arm_neon.h>

Have you tried using __fp16 and letting the compiler auto-vectorize the code? This would also be beneficial for non-contiguous access and supports both single/double conversions. Here's a proposed implementation:

diff --git a/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src b/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src
index 1299e55b42..5b03e39ce2 100644
--- a/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src
+++ b/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src
@@ -708,6 +708,14 @@ NPY_NO_EXPORT PyArrayMethod_StridedLoop *
 
 /************* STRIDED CASTING SPECIALIZED FUNCTIONS *************/
 
+#ifdef __ARM_FP16_FORMAT_IEEE
+    #define EMULATED_FP16 0
+    typedef __fp16 _npy_half;
+#else
+    #define EMULATED_FP16 1
+    typedef npy_half _npy_half;
+#endif
+
 /**begin repeat
  *
  * #NAME1 = BOOL,
@@ -723,15 +731,15 @@ NPY_NO_EXPORT PyArrayMethod_StridedLoop *
  * #type1 = npy_bool,
  *          npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
  *          npy_byte, npy_short, npy_int, npy_long, npy_longlong,
- *          npy_half, npy_float, npy_double, npy_longdouble,
+ *          _npy_half, npy_float, npy_double, npy_longdouble,
  *          npy_cfloat, npy_cdouble, npy_clongdouble#
  * #rtype1 = npy_bool,
  *           npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
  *           npy_byte, npy_short, npy_int, npy_long, npy_longlong,
- *           npy_half, npy_float, npy_double, npy_longdouble,
+ *           _npy_half, npy_float, npy_double, npy_longdouble,
  *           npy_float, npy_double, npy_longdouble#
  * #is_bool1 = 1, 0*17#
- * #is_half1 = 0*11, 1, 0*6#
+ * #is_half1 = 0*11, EMULATED_FP16, 0*6#
  * #is_float1 = 0*12, 1, 0, 0, 1, 0, 0#
  * #is_double1 = 0*13, 1, 0, 0, 1, 0#
  * #is_complex1 = 0*15, 1*3#
@@ -752,15 +760,15 @@ NPY_NO_EXPORT PyArrayMethod_StridedLoop *
  * #type2 = npy_bool,
  *          npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
  *          npy_byte, npy_short, npy_int, npy_long, npy_longlong,
- *          npy_half, npy_float, npy_double, npy_longdouble,
+ *          _npy_half, npy_float, npy_double, npy_longdouble,
  *          npy_cfloat, npy_cdouble, npy_clongdouble#
  * #rtype2 = npy_bool,
  *          npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
  *          npy_byte, npy_short, npy_int, npy_long, npy_longlong,
- *          npy_half, npy_float, npy_double, npy_longdouble,
+ *          _npy_half, npy_float, npy_double, npy_longdouble,
  *          npy_float, npy_double, npy_longdouble#
  * #is_bool2 = 1, 0*17#
- * #is_half2 = 0*11, 1, 
8000
0*6#
+ * #is_half2 = 0*11, EMULATED_FP16, 0*6#
  * #is_float2 = 0*12, 1, 0, 0, 1, 0, 0#
  * #is_double2 = 0*13, 1, 0, 0, 1, 0#
  * #is_complex2 = 0*15, 1*3#

I haven't tested this yet, but it should leverage hardware FP16 support on ARM platforms when available while falling back to the emulated version elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had tried this - it is still 2x+ slower than the SIMD implementation for fp16->fp32 . (See the linked issue for the baseline and SIMD results).

Agreed, we can use the scalar path(native or emulated) for other casts/non-contiguous/ h/w does not support Neon SIMD.

Platform: Darwin / arm64 / arm
NumPy version: 2.3.0.dev0+git20250418.6c7e63a
Timeit settings: repeat=100, number=1

Size (Elements) | Min Time (ms) | Median Time (ms)

           1 |           0.000 |              0.000
          10 |           0.000 |              0.000
         100 |           0.000 |              0.000
       1,000 |           0.001 |              0.001
      10,000 |           0.005 |              0.005
     100,000 |           0.045 |              0.049
   1,000,000 |           0.545 |              0.586
  10,000,000 |           5.870 |              6.087
 100,000,000 |          71.549 |             73.567

Copy link
Member
@seiko2plus seiko2plus Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang should auto-vectorize this even under -O2 flags https://godbolt.org/z/zv8nqG9h9. Are you using GCC? If so, try to use PR #28789 with the patch above - it should re-enable NPY_GCC_OPT_3 and NPY_GCC_UNROLL_LOOPS as I just discovered they were disabled.

Try to add NPY_GCC_UNROLL_LOOPS alongside the current NPY_GCC_OPT_3 macro to the @prefix@_cast_@name1@_to_@name2@ function. This should help GCC better auto-vectorize the conversion loop.

I don't think we need to write raw SIMD for such a fundamental operation with some hints compiler should handle it probably. If we find we do need explicit SIMD control, we should consider using Google Highway for a more generic and maintainable solution that can properly dispatch these functions across different architectures.

Copy link
Contributor Author
@f2013519 f2013519 Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I am using Clang. It does attempt to vectorize it, but ends up producing double the number of vector instructions than the SIMD version - that may be why it is 2x slower: https://godbolt.org/z/8986hv4hv. Need to dig deeper though ...

Copy link
Contributor Author
@f2013519 f2013519 Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCC does better than clang, though it is also not optimal:

Platform: Darwin / arm64 / arm
NumPy version: 2.3.0.dev0+git20250421.f223a15
Timeit settings: repeat=100, number=1

Size (Elements) | Min Time (ms) | Median Time (ms)

           1 |           0.000 |              0.000
          10 |           0.000 |              0.000
         100 |           0.000 |              0.000
       1,000 |           0.000 |              0.001
      10,000 |           0.003 |              0.003
     100,000 |           0.023 |              0.025
   1,000,000 |           0.316 |              0.330
  10,000,000 |           3.668 |              3.760
 100,000,000 |          46.452 |             49.443

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't figured out how to make gcc/clang optimize this better, let me know if you have any ideas. The current code performs much better than depending on either GCC/Clang.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPY_GCC_UNROLL_LOOPS

This had no noticeable effect.

Copy link
Member
@seiko2plus seiko2plus Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but ends up producing double the number of vector instructions than the SIMD version

You're referring to pair loading? That should actually provide better performance. Clang produces the same code for both kernels on -O3 with one exception: the raw SIMD version preserves 16-lane iteration (unnecessary overhead):

  ldp q0, q1, [x9, #-16]
  add x9, x9, #32
  subs x11, x11, #16
  fcvtl2 v2.4s, v0.8h
  fcvtl v0.4s, v0.4h
  fcvtl2 v3.4s, v1.8h
  fcvtl v1.4s, v1.4h
  stp q0, q2, [x10, #-32]
  stp q1, q3, [x10], #64

On -O2, which is the default for NumPy sources, the auto-vectorized version on Clang is better due to pair loading.

that may be why it is 2x slower: https://godbolt.org/z/8986hv4hv

After a second look, I realized I forgot to pass -ftrapping-math to Godbolt, which is enabled by numpy's meson build for newer versions of Clang. GCC enables this by default; however, under -O3, GCC auto-vectorizes it, while Clang makes no changes at either -O2 or -O3 optimization levels.

By disabling strict FP exceptions per function, I was able to produce the expected auto-vectorized code. See: https://godbolt.org/z/3edTGezM1

GCC does better than clang, though it is also not optimal:

This is because Clang unrolls by x2 (pair loading) while GCC does not, which affects both the current raw SIMD and auto-vectorization implementations, unroll by scalar can gives gcc better hint I suppose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Yes, the -ftrapping-math was inhibiting vectorization.

@f2013519 f2013519 changed the title Prototype: Use Neon SIMD for better fp16 -> fp32 cast performance WIP: Use Neon SIMD for better fp16 -> fp32 cast performance Apr 20, 2025
@f2013519 f2013519 changed the title WIP: Use Neon SIMD for better fp16 -> fp32 cast performance WIP: Use Neon SIMD for better fp16 <-> fp32 cast performance Apr 20, 2025
@f2013519 f2013519 changed the title WIP: Use Neon SIMD for better fp16 <-> fp32 cast performance ENH: Use Hardware Cast for better fp16 <-> fp32 cast performance Apr 22, 2025
@f2013519 f2013519 changed the title ENH: Use Hardware Cast for better fp16 <-> fp32 cast performance ENH: Use Hardware Cast for better fp16 cast performance Apr 22, 2025
@f2013519 f2013519 force-pushed the main branch 3 times, most recently from 15a7520 to ff8922d Compare April 22, 2025 08:51
@f2013519 f2013519 changed the title ENH: Use Hardware Cast for better fp16 cast performance ENH: Improve Floating Point Cast Performance Apr 22, 2025
@f2013519
Copy link
Contributor Author

@seiko2plus : With your suggestion to use pragma to enable auto-vectorization, we see improvement of up to ~2x even for casts between fp32<->fp64 on M4. So overall floating point cast performance will be improved with this patch!

@f2013519
Copy link
Contributor Author

There are still two issues that we are still seeing:

  1. Build error with Emscripten: The emscripten compiler crashes when it encounters this pragma - this looks like a compiler bug with emscripten. I have disabled the pragma for that compiler for now.

  2. Test failure on Linux ARM64 SIMD (native target )- There is a fp16 test failure on this platform (which does not occur on other platforms/local testing). From what I can tell, the tests tries to lock down rounding behavior between fp32->fp16 and fp64->fp16 casts. For some reason on this target, there is a roundoff error of 1 bit between the actual and expected results for fp64->fp16 cast. This suggests that hardware cast was likely not used and the conversion may have happened through fp64->fp32->fp16 introducing a rounding error.

What is the difference between native & baseline/asimd target (both of which pass) and how to detect this in the code so we can fall back to emulated path if needed?

@f2013519 f2013519 force-pushed the main branch 2 times, most recently from f547352 to 9545a50 Compare April 23, 2025 05:14
@seiko2plus
Copy link
Member
seiko2plus commented Apr 23, 2025

With your suggestion to use pragma to enable auto-vectorization, we see improvement of up to ~2x even for casts between fp32<->fp64 on M4. So overall floating point cast performance will be improved with this patch!

we should file a bug with Clang. The compiler should be able to auto-vectorize as long as the enabled SIMD extension provides native instructions that respect IEEE semantics for these operations, even when -ftrapping-math is enabled.

Build error with Emscripten: The emscripten compiler crashes when it encounters this pragma - this looks like a compiler bug with emscripten. I have disabled the pragma for that compiler for now.

This is another bug that needs to be reported. However, we should only disable floating point exceptions when we can guarantee that the enabled SIMD extension provides native conversion instructions for this operation. Emscripten would be challenging since the generated WebAssembly is cross-architecture with no guarantees about available hardware instructions.

Test failure on Linux ARM64 SIMD (native target )- There is a fp16 test failure on this platform (which does not occur on other platforms/local testing). From what I can tell, the tests tries to lock down rounding behavior between fp32->fp16 and fp64->fp16 casts. For some reason on this target, there is a roundoff error of 1 bit between the actual and expected results for fp64->fp16 cast. This suggests that hardware cast was likely not used and the conversion may have happened through fp64->fp32->fp16 introducing a rounding error.

I confirm fp64, fp32, fp16 conversion, this appears to be a GCC bug that needs to be reported, specifically with -O3 optimization. When SVE is enabled, you will need to disable NPY_GCC_OPT_3 when NPY_HAVE_SVE is defined. See https://godbolt.org/z/n6KEa3K4v for reference.

What is the difference between native & baseline/asimd target (both of which pass) and how to detect this in the code so we can fall back to emulated path if needed?

test_native enables all CPU features supported by the host as part of the baseline (static dispatch). test_asimd sets asimd as the minimum baseline feature, which is actually the default. test_baseline_only disables any dynamic dispatching and keeps only static dispatching for the default baseline features.

@f2013519
Copy link
Contributor Author

Here are some benchmarks which I ran locally (MacBook Air M4 + clang 19.1.7) to measure the performance improvement with this change:

Without Patch:

Unoptimized_1000 runs

Due to emulation, float16 performance is significantly worse compared to float32 and float64. We can do much better by taking advantage of native float16 support on the hardware.

With Patch:

Optimized_1000 runs

By enabling native float16 and vectorization, we see a huge improvement in float16 cast performance. The float16<->float32 path now outperforms all other paths by a good margin. Vectorization improves float32/float64 performance as well.

Here are the maximum speedups we can achieve with these changes (best case over 1000 runs):

float64 -> float32: 2.56x (at size 100,000)
float64 -> float16: 15.63x (at size 100,000)
float32 -> float64: 3.00x (at size 10,000)
float32 -> float16: 24.80x (at size 100,000)
float16 -> float64: 10.00x (at size 10,000)
float16 -> float32: 19.80x (at size 100,000)

To summarize,

  1. We are able to achieve up to 24.8x better cast performance with float16
  2. float32/float64 performance is improved up to 3x

@f2013519
Copy link
Contributor Author

@seiko2plus: PTAL, thanks

Copy link
Member
@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice performance improvements. The benchmarks are thorough and convincing. Thanks for your effort!

@ngoldbaum ngoldbaum added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 28, 2025
@ngoldbaum
Copy link
Member

Wow, cool!

Can this get a release note?

@ngoldbaum
Copy link
Member

If you're interested, a quick grep perusal indicates our existing asv benchmarks don't have great coverage for casting operations. If you're interested, it might be worth adding benchmarks too. Not necessary to merge this.

@f2013519
Copy link
Contributor Author

Sure, I will add a release note.

@f2013519
Copy link
Contributor Author

If you're interested, a quick grep perusal indicates our existing asv benchmarks don't have great coverage for casting operations. If you're interested, it might be worth adding benchmarks too. Not necessary to merge this.

Sounds like a good idea. Will consider a follow up issue for this.

@ngoldbaum ngoldbaum removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 29, 2025
@ngoldbaum ngoldbaum added this to the 2.3.0 release milestone Apr 29, 2025
@ngoldbaum
Copy link
Member

I added the 2.3 milestone to ensure this doesn't get dropped before doing the release. I'm not enough of a SIMD expert to feel confident hitting the merge button on this one.

@seiko2plus seiko2plus changed the title ENH: Improve Floating Point Cast Performance ENH: Improve Floating Point Cast Performance on ARM Apr 29, 2025
@seiko2plus seiko2plus merged commit d692fbc into numpy:main Apr 29, 2025
73 checks passed
@seiko2plus
Copy link
Member

I'm not enough of a SIMD expert to feel confident hitting the merge button on this one.

There's no raw SIMD involved.

I'm not enough of a SIMD expert to feel confident hitting the merge button on this one.

No worries, I'll follow up if anything shows up. We've discovered several compiler bugs that need deep dig before filed upstream, so I will update this pr later.

Thank you Krishna and Nathan!.

MaanasArora pushed a commit to MaanasArora/numpy that referenced this pull request May 8, 2025
* WIP,Prototype: Use Neon SIMD to improve half->float cast performance
[ci skip] [skip ci]

* Support Neon SIMD float32->float16 cast and update scalar path to use hardware cast

* Add missing header

* Relax VECTOR_ARITHMETIC check and add comment on need for SIMD routines

* Enable hardware cast on x86 when F16C is available

* Relax fp exceptions in Clang to enable vectorization for cast

* Ignore fp exceptions only for float casts

* Fix build

* Attempt to fix test failure on ARM64 native

* Work around gcc bug for double->half casts

* Add release note
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0