-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Improve poor np.float16 performance #28753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Please! PRs welcome :) |
Please feel free to submit the prototype you have as a pull request and could you please share these benchmarks you care about? That can help narrow down future to do list. |
As a quick example, consider casting fp16 arrays to fp32 arrays with various sizes ... Here is the performance I get on MacBook Air M4: Platform: Darwin / arm64 / arm Starting NumPy Float16 -> Float32 Cast Benchmark... Size (Elements) | Min Time (ms) | Median Time (ms)
With an optimized implementation, we see ~4x-10x improvements: Size (Elements) | Min Time (ms) | Median Time (ms)
Benchmark Script (convert to .py before running): |
See #28769 for a prototype implementation ... |
Proposed new feature or change:
float16 is an IEEE 754 floating point type that enjoys wide support across a variety of hardware platforms including CPU's (x86, ARM) and GPU's. It has many applications in a variety of fields (ML, signal processing, computer vision).
It can be used either as a storage type or a compute type depending on the hardware capability.
The use as a storage type is of particular interest, since it allows us to compress the data by a factor of 2x compared to fp32, freeing up storage & memory bandwidth . Data can be converted to fp32 for computation and stored back in fp16 once the computation is complete. This is a common pattern.
Running some quick benchmarks against other implementations (PyTorch, Julia), numpy is consistently slower by a factor of 4x-6x or more across the board on apple silicon (M4) for fp16<=>fp32 conversions.
Looking quickly at the numpy implementation, casts from fp16<=>fp32 do not take advantage of the underlying hardware support and SIMD intrinsics. Instead it emulates the casting by manipulating the underlying bit pattern.
By making some quick prototype changes to use ARM neon intrinsics, it is possible to get much better performance (on par with other implementations).
Due to wide availability of hardware with native fp16 these days, can we add better support for fp16 across the different platforms that provide native hardware support? The priority would be improving cast performance first.
The text was updated successfully, but these errors were encountered: