-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Improve Floating Point Cast Performance on ARM #28769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ci skip] [skip ci]
@@ -11,6 +11,7 @@ | |||
#define PY_SSIZE_T_CLEAN | |||
#include <Python.h> | |||
|
|||
#include <arm_neon.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <arm_neon.h> |
Have you tried using __fp16
and letting the compiler auto-vectorize the code? This would also be beneficial for non-contiguous access and supports both single/double conversions. Here's a proposed implementation:
diff --git a/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src b/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src
index 1299e55b42..5b03e39ce2 100644
--- a/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src
+++ b/numpy/_core/src/multiarray/lowlevel_strided_loops.c.src
@@ -708,6 +708,14 @@ NPY_NO_EXPORT PyArrayMethod_StridedLoop *
/************* STRIDED CASTING SPECIALIZED FUNCTIONS *************/
+#ifdef __ARM_FP16_FORMAT_IEEE
+ #define EMULATED_FP16 0
+ typedef __fp16 _npy_half;
+#else
+ #define EMULATED_FP16 1
+ typedef npy_half _npy_half;
+#endif
+
/**begin repeat
*
* #NAME1 = BOOL,
@@ -723,15 +731,15 @@ NPY_NO_EXPORT PyArrayMethod_StridedLoop *
* #type1 = npy_bool,
* npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
* npy_byte, npy_short, npy_int, npy_long, npy_longlong,
- * npy_half, npy_float, npy_double, npy_longdouble,
+ * _npy_half, npy_float, npy_double, npy_longdouble,
* npy_cfloat, npy_cdouble, npy_clongdouble#
* #rtype1 = npy_bool,
* npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
* npy_byte, npy_short, npy_int, npy_long, npy_longlong,
- * npy_half, npy_float, npy_double, npy_longdouble,
+ * _npy_half, npy_float, npy_double, npy_longdouble,
* npy_float, npy_double, npy_longdouble#
* #is_bool1 = 1, 0*17#
- * #is_half1 = 0*11, 1, 0*6#
+ * #is_half1 = 0*11, EMULATED_FP16, 0*6#
* #is_float1 = 0*12, 1, 0, 0, 1, 0, 0#
* #is_double1 = 0*13, 1, 0, 0, 1, 0#
* #is_complex1 = 0*15, 1*3#
@@ -752,15 +760,15 @@ NPY_NO_EXPORT PyArrayMethod_StridedLoop *
* #type2 = npy_bool,
* npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
* npy_byte, npy_short, npy_int, npy_long, npy_longlong,
- * npy_half, npy_float, npy_double, npy_longdouble,
+ * _npy_half, npy_float, npy_double, npy_longdouble,
* npy_cfloat, npy_cdouble, npy_clongdouble#
* #rtype2 = npy_bool,
* npy_ubyte, npy_ushort, npy_uint, npy_ulong, npy_ulonglong,
* npy_byte, npy_short, npy_int, npy_long, npy_longlong,
- * npy_half, npy_float, npy_double, npy_longdouble,
+ * _npy_half, npy_float, npy_double, npy_longdouble,
* npy_float, npy_double, npy_longdouble#
* #is_bool2 = 1, 0*17#
- * #is_half2 = 0*11, 1,
8000
0*6#
+ * #is_half2 = 0*11, EMULATED_FP16, 0*6#
* #is_float2 = 0*12, 1, 0, 0, 1, 0, 0#
* #is_double2 = 0*13, 1, 0, 0, 1, 0#
* #is_complex2 = 0*15, 1*3#
I haven't tested this yet, but it should leverage hardware FP16 support on ARM platforms when available while falling back to the emulated version elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had tried this - it is still 2x+ slower than the SIMD implementation for fp16->fp32 . (See the linked issue for the baseline and SIMD results).
Agreed, we can use the scalar path(native or emulated) for other casts/non-contiguous/ h/w does not support Neon SIMD.
Platform: Darwin / arm64 / arm
NumPy version: 2.3.0.dev0+git20250418.6c7e63a
Timeit settings: repeat=100, number=1
Size (Elements) | Min Time (ms) | Median Time (ms)
1 | 0.000 | 0.000
10 | 0.000 | 0.000
100 | 0.000 | 0.000
1,000 | 0.001 | 0.001
10,000 | 0.005 | 0.005
100,000 | 0.045 | 0.049
1,000,000 | 0.545 | 0.586
10,000,000 | 5.870 | 6.087
100,000,000 | 71.549 | 73.567
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang should auto-vectorize
this even under -O2
flags https://godbolt.org/z/zv8nqG9h9. Are you using GCC? If so, try to use PR #28789 with the patch above - it should re-enable NPY_GCC_OPT_3
and NPY_GCC_UNROLL_LOOPS
as I just discovered they were disabled.
Try to add NPY_GCC_UNROLL_LOOPS
alongside the current NPY_GCC_OPT_3
macro to the @prefix@_cast_@name1@_to_@name2@
function. This should help GCC better auto-vectorize the conversion loop.
I don't think we need to write raw SIMD for such a fundamental operation with some hints compiler should handle it probably. If we find we do need explicit SIMD control, we should consider using Google Highway for a more generic and maintainable solution that can properly dispatch these functions across different architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I am using Clang. It does attempt to vectorize it, but ends up producing double the number of vector instructions than the SIMD version - that may be why it is 2x slower: https://godbolt.org/z/8986hv4hv. Need to dig deeper though ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCC does better than clang, though it is also not optimal:
Platform: Darwin / arm64 / arm
NumPy version: 2.3.0.dev0+git20250421.f223a15
Timeit settings: repeat=100, number=1
Size (Elements) | Min Time (ms) | Median Time (ms)
1 | 0.000 | 0.000
10 | 0.000 | 0.000
100 | 0.000 | 0.000
1,000 | 0.000 | 0.001
10,000 | 0.003 | 0.003
100,000 | 0.023 | 0.025
1,000,000 | 0.316 | 0.330
10,000,000 | 3.668 | 3.760
100,000,000 | 46.452 | 49.443
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't figured out how to make gcc/clang optimize this better, let me know if you have any ideas. The current code performs much better than depending on either GCC/Clang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NPY_GCC_UNROLL_LOOPS
This had no noticeable effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but ends up producing double the number of vector instructions than the SIMD version
You're referring to pair loading? That should actually provide better performance. Clang produces the same code for both kernels on -O3 with one exception: the raw SIMD version preserves 16-lane iteration (unnecessary overhead):
ldp q0, q1, [x9, #-16]
add x9, x9, #32
subs x11, x11, #16
fcvtl2 v2.4s, v0.8h
fcvtl v0.4s, v0.4h
fcvtl2 v3.4s, v1.8h
fcvtl v1.4s, v1.4h
stp q0, q2, [x10, #-32]
stp q1, q3, [x10], #64
On -O2
, which is the default for NumPy sources, the auto-vectorized version on Clang is better due to pair loading.
that may be why it is 2x slower: https://godbolt.org/z/8986hv4hv
After a second look, I realized I forgot to pass -ftrapping-math
to Godbolt, which is enabled by numpy's meson build for newer versions of Clang. GCC enables this by default; however, under -O3
, GCC auto-vectorizes it, while Clang makes no changes at either -O2
or -O3
optimization levels.
By disabling strict FP exceptions per function, I was able to produce the expected auto-vectorized code. See: https://godbolt.org/z/3edTGezM1
GCC does better than clang, though it is also not optimal:
This is because Clang unrolls by x2 (pair loading) while GCC does not, which affects both the current raw SIMD and auto-vectorization implementations, unroll by scalar can gives gcc better hint I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Yes, the -ftrapping-math was inhibiting vectorization.
15a7520
to
ff8922d
Compare
@seiko2plus : With your suggestion to use pragma to enable auto-vectorization, we see improvement of up to ~2x even for casts between fp32<->fp64 on M4. So overall floating point cast performance will be improved with this patch! |
There are still two issues that we are still seeing:
What is the difference between native & baseline/asimd target (both of which pass) and how to detect this in the code so we can fall back to emulated path if needed? |
f547352
to
9545a50
Compare
we should file a bug with Clang. The compiler should be able to auto-vectorize as long as the enabled SIMD extension provides native instructions that respect IEEE semantics for these operations, even when
This is another bug that needs to be reported. However, we should only disable floating point exceptions when we can guarantee that the enabled SIMD extension provides native conversion instructions for this operation. Emscripten would be challenging since the generated WebAssembly is cross-architecture with no guarantees about available hardware instructions.
I confirm
|
Here are some benchmarks which I ran locally (MacBook Air M4 + clang 19.1.7) to measure the performance improvement with this change: Without Patch:Due to emulation, float16 performance is significantly worse compared to float32 and float64. We can do much better by taking advantage of native float16 support on the hardware. With Patch:By enabling native float16 and vectorization, we see a huge improvement in float16 cast performance. The float16<->float32 path now outperforms all other paths by a good margin. Vectorization improves float32/float64 performance as well. Here are the maximum speedups we can achieve with these changes (best case over 1000 runs): float64 -> float32: 2.56x (at size 100,000) To summarize,
|
@seiko2plus: PTAL, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice performance improvements. The benchmarks are thorough and convincing. Thanks for your effort!
Wow, cool! Can this get a release note? |
If you're interested, a quick grep perusal indicates our existing asv benchmarks don't have great coverage for casting operations. If you're interested, it might be worth adding benchmarks too. Not necessary to merge this. |
Sure, I will add a release note. |
Sounds like a good idea. Will consider a follow up issue for this. |
I added the 2.3 milestone to ensure this doesn't get dropped before doing the release. I'm not enough of a SIMD expert to feel confident hitting the merge button on this one. |
There's no raw SIMD involved.
No worries, I'll follow up if anything shows up. We've discovered several compiler bugs that need deep dig before filed upstream, so I will update this pr later. Thank you Krishna and Nathan!. |
* WIP,Prototype: Use Neon SIMD to improve half->float cast performance [ci skip] [skip ci] * Support Neon SIMD float32->float16 cast and update scalar path to use hardware cast * Add missing header * Relax VECTOR_ARITHMETIC check and add comment on need for SIMD routines * Enable hardware cast on x86 when F16C is available * Relax fp exceptions in Clang to enable vectorization for cast * Ignore fp exceptions only for float casts * Fix build * Attempt to fix test failure on ARM64 native * Work around gcc bug for double->half casts * Add release note
This is meant as a prototype for initial performance analysis.