ENH: Improve poor np.float16 performance #28753

f2013519 · 2025-04-17T18:34:32Z

Proposed new feature or change:

float16 is an IEEE 754 floating point type that enjoys wide support across a variety of hardware platforms including CPU's (x86, ARM) and GPU's. It has many applications in a variety of fields (ML, signal processing, computer vision).

It can be used either as a storage type or a compute type depending on the hardware capability.

The use as a storage type is of particular interest, since it allows us to compress the data by a factor of 2x compared to fp32, freeing up storage & memory bandwidth . Data can be converted to fp32 for computation and stored back in fp16 once the computation is complete. This is a common pattern.

Running some quick benchmarks against other implementations (PyTorch, Julia), numpy is consistently slower by a factor of 4x-6x or more across the board on apple silicon (M4) for fp16<=>fp32 conversions.

Looking quickly at the numpy implementation, casts from fp16<=>fp32 do not take advantage of the underlying hardware support and SIMD intrinsics. Instead it emulates the casting by manipulating the underlying bit pattern.

By making some quick prototype changes to use ARM neon intrinsics, it is possible to get much better performance (on par with other implementations).

Due to wide availability of hardware with native fp16 these days, can we add better support for fp16 across the different platforms that provide native hardware support? The priority would be improving cast performance first.

ngoldbaum · 2025-04-18T14:12:03Z

can we add better support for fp16 across the different platforms that provide native hardware support?

Please! PRs welcome :)

r-devulap · 2025-04-18T16:43:10Z

AF7C

Running some quick benchmarks against other implementations (PyTorch, Julia), numpy is consistently slower by a factor of 4x-6x or more across the board on apple silicon (M4) for fp16<=>fp32 conversions.

By making some quick prototype changes to use ARM neon intrinsics, it is possible to get much better performance (on par with other implementations).

Please feel free to submit the prototype you have as a pull request and could you please share these benchmarks you care about? That can help narrow down future to do list.

f2013519 · 2025-04-18T18:25:23Z

As a quick example, consider casting fp16 arrays to fp32 arrays with various sizes ...

Here is the performance I get on MacBook Air M4:

Platform: Darwin / arm64 / arm
NumPy version: 2.3.0.dev0+git20250418.6c7e63a
Timeit settings: repeat=100, number=1

Starting NumPy Float16 -> Float32 Cast Benchmark...

Size (Elements) | Min Time (ms) | Median Time (ms)

           1 |           0.000 |              0.000
          10 |           0.000 |              0.000
         100 |           0.000 |              0.000
       1,000 |           0.001 |              0.001
      10,000 |           0.011 |              0.011
     100,000 |           0.090 |              0.098
   1,000,000 |           0.989 |              1.030
  10,000,000 |          10.058 |             10.476
 100,000,000 |         116.131 |            121.298

With an optimized implementation, we see ~4x-10x improvements:

Size (Elements) | Min Time (ms) | Median Time (ms)

           1 |           0.000 |              0.000
          10 |           0.000 |              0.000
         100 |           0.000 |              0.000
       1,000 |           0.000 |              0.000
      10,000 |           0.001 |              0.001
     100,000 |           0.005 |              0.005
   1,000,000 |           0.156 |              0.177
  10,000,000 |           1.980 |              2.108
 100,000,000 |          29.158 |             33.682

Benchmark Script (convert to .py before running):

cast_bench.txt

f2013519 · 2025-04-18T18:44:03Z

See #28769 for a prototype implementation ...

jorenham added the 01 - Enhancement label Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Improve poor np.float16 performance #28753

ENH: Improve poor np.float16 performance #28753

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: Improve poor np.float16 performance #28753

ENH: Improve poor np.float16 performance #28753

Comments

Proposed new feature or change:

Uh oh!

Uh oh!

Size (Elements) | Min Time (ms) | Median Time (ms)

Size (Elements) | Min Time (ms) | Median Time (ms)

Uh oh!

Uh oh!