8000 ENH: Improve poor np.float16 performance · Issue #28753 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Improve poor np.float16 performance #28753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
f2013519 opened this issue Apr 17, 2025 · 4 comments
Open

ENH: Improve poor np.float16 performance #28753

f2013519 opened this issue Apr 17, 2025 · 4 comments

Comments

@f2013519
Copy link
Contributor

Proposed new feature or change:

float16 is an IEEE 754 floating point type that enjoys wide support across a variety of hardware platforms including CPU's (x86, ARM) and GPU's. It has many applications in a variety of fields (ML, signal processing, computer vision).

It can be used either as a storage type or a compute type depending on the hardware capability.

The use as a storage type is of particular interest, since it allows us to compress the data by a factor of 2x compared to fp32, freeing up storage & memory bandwidth . Data can be converted to fp32 for computation and stored back in fp16 once the computation is complete. This is a common pattern.

Running some quick benchmarks against other implementations (PyTorch, Julia), numpy is consistently slower by a factor of 4x-6x or more across the board on apple silicon (M4) for fp16<=>fp32 conversions.

Looking quickly at the numpy implementation, casts from fp16<=>fp32 do not take advantage of the underlying hardware support and SIMD intrinsics. Instead it emulates the casting by manipulating the underlying bit pattern.

By making some quick prototype changes to use ARM neon intrinsics, it is possible to get much better performance (on par with other implementations).

Due to wide availability of hardware with native fp16 these days, can we add better support for fp16 across the different platforms that provide native hardware support? The priority would be improving cast performance first.

@ngoldbaum
Copy link
Member

can we add better support for fp16 across the different platforms that provide native hardware support?

Please! PRs welcome :)

@r-devulap
Copy link
Member

Running some quick benchmarks against other implementations (PyTorch, Julia), numpy is consistently slower by a factor of 4x-6x or more across the board on apple silicon (M4) for fp16<=>fp32 conversions.

By making some quick prototype changes to use ARM neon intrinsics, it is possible to get much better performance (on par with other implementations).

Please feel free to submit the prototype you have as a pull request and could you please share these benchmarks you care about? That can help narrow down future to do list.

@f2013519
Copy link
Contributor Author

As a quick example, consider casting fp16 arrays to fp32 arrays with various sizes ...

Here is the performance I get on MacBook Air M4:

Platform: Darwin / arm64 / arm
NumPy version: 2.3.0.dev0+git20250418.6c7e63a
Timeit settings: repeat=100, number=1

Starting NumPy Float16 -> Float32 Cast Benchmark...

Size (Elements) | Min Time (ms) | Median Time (ms)

           1 |           0.000 |              0.000
          10 |           0.000 |              0.000
         100 |           0.000 |              0.000
       1,000 |           0.001 |              0.001
      10,000 |           0.011 |              0.011
     100,000 |           0.090 |              0.098
   1,000,000 |           0.989 |              1.030
  10,000,000 |          10.058 |             10.476
 100,000,000 |         116.131 |            121.298

With an optimized implementation, we see ~4x-10x improvements:


Size (Elements) | Min Time (ms) | Median Time (ms)

           1 |           0.000 |              0.000
          10 |           0.000 |              0.000
         100 |           0.000 |              0.000
       1,000 |           0.000 |              0.000
      10,000 |           0.001 |              0.001
     100,000 |           0.005 |              0.005
   1,000,000 |           0.156 |              0.177
  10,000,000 |           1.980 |              2.108
 100,000,000 |          29.158 |             33.682

Benchmark Script (convert to .py before running):

cast_bench.txt

@f2013519
Copy link
Contributor Author

See #28769 for a prototype implementation ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0