DOC: add NEP 54 on SIMD - moving to C++ and adopting Highway (or not) #24138

rgommers · 2023-07-07T10:25:49Z

This NEP compares the current plan of moving the NumPy universal intrinsics framework to C++ with the adoption of Google Highway, as discussed in gh-24084. There is no proposed decision yet, this is a write-up of the discussion so far (which makes it slightly unusual as a NEP).

My current recommendation is to continue the conversation and improve this summary as needed. As of right now, the large amount of effort and the impact on licensing and MSVC usage mean that adoption of Google Highway should at a minimum be postponed until after we've dealt with the 1.26.0 release over the next couple of months. That work has a hard deadline after all, due to the Python 3.12 release cycle.

I've added @seiko2plus and @jan-wassenberg as co-authors already, because quite a few technical arguments are verbatim from what they wrote in gh-24084. And @mattip because we drafted the first version of this document together. Happy to add more authors as needed when there is more discussion.

mattip · 2023-07-07T10:49:44Z

here is the rendered artifact

doc/neps/nep-0054-simd-cpp-highway.rst

jan-wassenberg

Nice writeup 👍

jan-wassenberg · 2023-07-07T12:50:35Z

doc/neps/nep-0054-simd-cpp-highway.rst

+- A second concern was whether it's possible with Highway to allow the user at
+  runtime to select or disable dispatching to certain instruction sets. This is
+  possible.
+- Use of tags in the C++ implementation (Sayed's concern). TODO: describe in


The issue here is that Highway code looks like this:

const hn::ScalableTag<T> d; for (size_t i = 0; i < size; i += hn::Lanes(d)) { const auto mul = hn::Load(d, mul_array + i); const auto add = hn::Load(d, add_array + i); auto x = hn::Load(d, x_array + i); x = hn::MulAdd(mul, x, add); hn::Store(x, d, x_array + i); }

The concern is about these extra d arguments. SVE and RVV actually have them in (almost) every intrinsic, Highway mainly only in ops involving memory or tied to the vector size. Looking beyond the verbosity, there is actually an important advantage: allowing users to switch to half-vectors by changing only one line. This can help performance with AVX-512, RVV, and older Arm.

Allowing flexibility is the upside. What is the downside? Is the resulting code slower?

The downside is that the extra d arguments use a template argument, and the testing unit appears to want to export single (non-template) functions, so exporting them all would involve many functions.
A reasonable workaround would be to only export a subset of the tags (mostly just full vectors) for the testing unit.

allowing users to switch to half-vectors by changing only one line. This can help performance with AVX-512, RVV, and older Arm.

Tags can be avoided by providing intrinsics to load/store half or quad of vector instead e.g. LoadHalf(ptr). Also its important to identify the values of the higher parts, whether undefined or set to zero.

What is the downside

Increase the complexity of the code while there are other options to make much simpler, easy to trace, debug and to test.

important to identify the values of the higher parts, whether undefined or set to zero.

Good point. Would it be OK to define it as either zero or a broadcast load? It's never random garbage.

Can you quantify the 'much simpler'? It seems to me that we are proposing to add half/quarter etc variants of all the various load operations (masked, partial, unaligned etc), that would mean more ops to test.

I will edit this to not dive into details. Is this adequate:

"Use of tags in the C++ implementation reduces code duplication but the added templating makes C-level testing and tracing more complicated"

doc/neps/nep-0054-simd-cpp-highway.rst

seiko2plus · 2023-07-07T20:13:06Z

doc/neps/nep-0054-simd-cpp-highway.rst

+  complex kernels that require 64-bit/32-bit stride alignment. Identified
+  intrinsics related to this:
+
+  - Highway's ``LoadInterleaved2`` would have to be extended with ``*Till``


Suggested change

- Highway's ``LoadInterleaved2`` would have to be extended with ``*Till``

Highway needs to implement the following memory operations, which mainly deal with non-contiguous memory operations and partial load/store.

| Name |u8 |i8 |u16|i16|u32|i32|u64|i64|f32|f64|

|:-----------------|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|

| LoadPairTill | - | - | - | - | x | x | x | x | x | x |

| LoadTill | - | - | - | - | x | x | x | x | x | x |

| IsLoadable | - | - | - | - | x | x | x | x | x | x |

| IsStorable | - | - | - | - | x | x | x | x | x | x |

| Loadn | - | - | - | - | x | x | x | x | x | x |

| LoadnTill | - | - | - | - | x | x | x | x | x | x |

| LoadnPair | - | - | - | - | x | x | x | x | x | x |

| LoadnPairTill | - | - | - | - | x | x | x | x | x | x |

| Storen | - | - | - | - | x | x | x | x | x | x |

| StorenTill | - | - | - | - | x | x | x | x | x | x |

| StorenPair | - | - | - | - | x | x | x | x | x | x |

| StorenPairTill | - | - | - | - | x | x | x | x | x | x |

| Lookup128 | - | - | - | - | x | x | x | x | x | x |

These operations are described in the new C++ reference section memory.

Furthermore, Highway needs to identify the behavior of comparisons, including whether they are signaling/non-signaling or ordered and unordered. This information is fully described in the new reference. Please refer to the comparison reference for detailed information.

Another aspect that seems to be missing is the integer division by scalar. This operation is fully described here as well.

Looks like the CI output link no longer works. Is there some other way to see your nicely formatted documentation?

Regarding comparisons, Highway uses _OQ, so ordered and quiet/non-signalling. Would that work for our purposes?

FYI float16 support is mostly done but not yet tested and requires some more work (casts) for GCC.

We didn't add integer division to Highway because it is generally inefficient and we haven't yet needed it. Is it commonly used in numpy? One option if the division is by a few constant values is to use libdivide, or a compiler, to translate that into vectorized mul by the inverse.

Looks like the CI output link no longer works. Is there some other way to see your nicely formatted documentation?

On each PR, the ci/circleci: build artifact job links directly to the html docs (assuming CI didn't fail, which in this case it did - at least, I think this is from gh-23096 which is currently red).

But the nep link on the top line of the rendered artifacts point to the absolute NEP path, not the relative one. here is the latest version, that I got from the CircleCI workflow artifact listing

I would prefer a high-level summary of missing functionality. How about:

Highway is missing non-contiguous memory operations and partial load/store operations.

Highway needs to identify the behavior of comparisons, including whether they are signaling/non-signaling or ordered and unordered.

Highway is missing integer division by a scalar.

Updates:

we have non-contiguous ops, but they use Gather/Scatter and are thus slow. What's missing is special-cases such as pairs of doubles using 128-bit loads.

as of last week we have partial Load/Store and also LoadNOr.

Comparisons, reductions and integer div are indeed missing.

I would prefer a high-level summary of missing functionality. How about:

Here's a list of missing intrinsics:

Vec LoadPairTill(const TLane *ptr, size_t plen, TLane fill0, TLane fill1)
Partially pair loads N-lane vector contents from memory.

Vec Loadn(const TLane *ptr, intptr_t stride)
Load N-lane vector contents from non-contiguous memory block.

Vec LoadnTill(const TLane *ptr, intptr_t stride, size_t len, TLane fill)
Partially loads N-lane vector contents from a non-contiguous memory block.

Vec LoadnPair(const TLane *ptr, intptr_t stride)
Pair loads N-lane vector contents from non-contiguous memory block.

Vec LoadnPairTill(const TLane *ptr, intptr_t stride, size_t plen, TLane fill0, TLane fill1)
Partially pair loads N-lane vector contents from non-contiguous memory block.

void StorePairTill(TLane *ptr, size_t plen, const Vec &a)
Partially pair store N-lane vector contents to memory.

void Storen(TLane *ptr, intptr_t stride, const Vec &a)
Store N-lane vector contents to non-contiguous memory.

void StorenTill(TLane *ptr, intptr_t stride, size_t len, const Vec &a)
Partially store N-lane vector contents to non-contiguous memory.

void StorenPair(TLane *ptr, intptr_t stride, const Vec &a)
Pair store N-lane vector contents to non-contiguous memory.

void StorenPairTill(TLane *ptr, intptr_t stride, size_t plen, const Vec &a)
Partially pair store N-lane vector contents to non-contiguous memory.

Vec Lookup128(const TLane *table, const Vec<MakeUnsigned> &idx)
Perform a lookup operation on a 128-byte table using lane indices.

TVec MaxProp(const TVec &a, const TVec &b)
Maximum with Floating-Point with Non-NaN Propagation.

TVec MinProp(const TVec &a, const TVec &b)
Minimum with Floating-Point with Non-NaN Propagation.

TVec MaxPropNan(const TVec &a, const TVec &b)
Maximum with Floating-Point with NaN Propagation.

TVec MinPropNan(const TVec &a, const TVec &b)
Minimum with with NaN Propagation.

TLane ReduceMax(const TVec &a)
Native Reduce Maximum.

TLane ReduceMin(const TVec &a)
Native Reduce Minimum.

TLane ReduceMaxProp(const TVec &a)
Reduce Maximum Floating-Point with Non-NaN Propagation.

TLane ReduceMinProp(const TVec &a)
Reduce Minimum Floating-Point with Non-NaN Propagation.

TLane ReduceMaxPropNan(const TVec &a)
Reduce Maximum Floating-Point with NaN Propagation.

TLane ReduceMinPropNan(const TVec &a)
Reduce Minimum Floating-Point with NaN Propagation.

TLane ReduceMaxProp(const TVec &a)
Reduce Maximum Floating-Point with Non-NaN Propagation.

TLane ReduceMinProp(const TVec &a)
Reduce Minimum Floating-Point with Non-NaN Propagation.

TLane ReduceMaxPropNan(const TVec &a)
Reduce Maximum Floating-Point with NaN Propagation.

TLane ReduceMinPropNan(const TVec &a)
Reduce Minimum Floating-Point with NaN Propagation.

Vec3 Divisor(TLane d)
Generate the precomputed parameters of the divisor.

TVec Div(const TVec &a, const Vec3 &b)
Performs element-wise integer division between the lanes of the input vector ‘a’ and the scalar value b.

TVec MulAddSub(const TVec &a, const TVec &b, const TVec &c)
Multiply the lanes of two vectors, add the lanes of a third vector for odd lanes, and subtract for even lanes.

TVec IfAdd/IfSub/IfDiv(const TMask &m, const TVec &a, const TVec &b, const TVec &c)
Conditional add/sub/div of vectors based on a mask.

In addition(not supported yet), I was working in a patch to support the following AVX512 operations which is required in order to replace SVML kernels:

_mm512_getmant_ps/_pd

_mm512_scalef_ps/_pd

_mm512_getexp_ps/_pd

as of last week we have partial Load/Store and also LoadNOr.

Current Highway partial Load/Store only supports contiguous memory operations according to the following testing unite:
LoadN:
https://github.com/google/highway/blob/ba0900a4957b929390ab73827235557959234fea/hwy/tests/memory_test.cc#L350-L404
StoreN:
https://github.com/google/highway/blob/ba0900a4957b929390ab73827235557959234fea/hwy/tests/memory_test.cc#L572-L634

Thank you @seiko2plus for making the list :) I am back in the office and adding it to our list of pending ops.

FYI ReduceMax/ReduceMin can already be expressed as GetLane(Max/MinOfLanes()).

Current Highway partial Load/Store only supports contiguous memory operations

We also have MaskedGather/Scatter. The mask can be obtained for partial N via FirstN, right?

We also have MaskedGather/Scatter. The mask can be obtained for partial N via FirstN, right?

if performance regression isn't matter then yes, sure.

I think it is too much of a moving target to publish this table in a permanent document. Maybe it belongs in a Highway issue instead, and then we can refer to the issue?

doc/neps/nep-0054-simd-cpp-highway.rst

[skip cirrus] [skip actions]

…] [skip travis] [skip actions]

mattip · 2023-08-28T20:05:19Z

rebased and force-pushed to fix CI errors

Mousius · 2023-09-22T10:10:37Z

I am trying to understand how this will move forward constructively. There are several recurring topics:

Highway uses tags. @seiko2plus has stated that he's OK with tags, so I don't understand why this keeps returning. Should we try to improve Highway to avoid using tags? That is entirely independent work, which I'm sure @jan-wassenberg would be interested in.
Highway doesn't have the ops. @jan-wassenberg has been receptive to patches as we introduce ops, and we will only do some of the routines in one go due to the bandwidth available in NumPy itself.
Highway might be less performant; @jan-wassenberg has been receptive, and as we land individual routines, we can optimize the ops we need to. Trying to do everything all at once is beyond at least my cognitive capacity.

(If I've missed anything, please feel free to add.)

These all seem fixable in the long term. If we assume the best intentions, I'm confident that any improvements we can add to improve the developer experience of Highway will be received positively.

Can we decouple these issues and focus on the high-level decision rather than continuing to get bogged down in specific technical details which are easier to fix independently?

At the moment, I can't justify committing resources to support the porting of routines due to the ambiguity in effort and technical direction. As demonstrated by #24018, I'm happy to head in a clear direction.

ngoldbaum · 2023-11-24T15:37:05Z

I haven’t been following this discussion and I noticed that some initial support for highway is getting merged. As a spectator who would like to learn more, it would be nice to have a version of the NEP merged.

Mousius · 2023-11-24T16:56:49Z

I haven’t been following this discussion and I noticed that some initial support for highway is getting merged. As a spectator who would like to learn more, it would be nice to have a version of the NEP merged.

Agree, we can merge this to indicate that Highway is available as an option for future work, which we will continue to assess. As highlighted in the last optimisation team meeting, the C++ machinery is an immense step forward for NumPy.

[skip travis] [skip azp] [skip cirrus]

mattip · 2023-12-21T06:22:08Z

I have tried to resolve the discussions, mainly by removing details. I also removed the mentions of meson since that has been resolved. The adoption of Highway intrinsics has also been resolved since we incorporated them into sorting.

jan-wassenberg

Looks good to me, with two minor comments :)

doc/neps/nep-0054-simd-cpp-highway.rst

jan-wassenberg · 2023-12-21T12:30:02Z

doc/neps/nep-0054-simd-cpp-highway.rst

+use Highway primitives internally. We could still use Highway sorting routines.
+If we do accept lower-precision routines (via a user-supplied choice, i.e.
+extending ``errstate`` to allow a precision option), we could use
+Highway-native routines.


We are also open to adding new higher-precision function versions to Highway, e.g. as part of Sayed's work to port from SVML.

…zp][skip cirrus]

mattip · 2023-12-23T19:39:27Z

Any more comments or editing?

rgommers

Thanks Matti and Jan. One more suggestion from me about the dynamic dispatch section. Then I think it's time to merge this; perfect is the enemy of good here. We're further along now, so getting this NEP in in its current state and learning from the 2.0 release where we'll have Highway as a git submodule and are using it for sorting-related functionality at least may give us more insights.

doc/neps/nep-0054-simd-cpp-highway.rst

…irrus]

Mousius

I suggest some minor text fixes and some added context; otherwise, I echo the sentiment that this seems fine.

doc/neps/nep-0054-simd-cpp-highway.rst

Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>

doc/neps/nep-0054-simd-cpp-highway.rst

Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>

mattip · 2023-12-31T07:09:11Z

The rendered NEP is here. It would be nice to get this merged.

rgommers · 2023-12-31T10:05:50Z

Interesting to see that SLEEF is being actively worked on again - it seems like a couple from hardware vendors are active, and at least one Arm engineer got commit rights. Would be great to see that activity continue.

rgommers · 2023-12-31T10:08:48Z

The latest textual updates all LGTM and seem fairly minor. Let's get this merged in 2023 still - I'll hit the green button here and will mention it on the mailing list. Thanks everyone!

rgommers · 2023-12-31T10:10:11Z

Actually, let me add Chris as a co-author as well before doing that, that seems very much justified.

[skip ci]

mattip · 2023-12-31T10:47:24Z

Thanks @rgommers

rgommers added component: NEP component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Jul 7, 2023

rgommers marked this pull request as draft July 7, 2023 10:26

github-actions bot added the 04 - Documentation label Jul 7, 2023

rgommers mentioned this pull request Jul 7, 2023

ENH: C++ SIMD wrapper #24084

Open

Mousius reviewed Jul 7, 2023

View reviewed changes

jan-wassenberg reviewed Jul 7, 2023

View reviewed changes

charris reviewed Jul 7, 2023

View reviewed changes

doc/neps/nep-0054-simd-cpp-highway.rst Outdated Show resolved Hide resolved

seiko2plus reviewed Jul 7, 2023

View reviewed changes

mattip mentioned this pull request Jul 11, 2023

License inconsistency betweent the LICENSE file and the setup.cfg metadata bigcat88/pillow_heif#111

Closed

mattip reviewed Aug 28, 2023

View reviewed changes

doc/neps/nep-0054-simd-cpp-highway.rst Outdated Show resolved Hide resolved

rgommers and others added 3 commits August 28, 2023 23:04

DOC: add NEP 54 on SIMD - moving to C++ and adopting Highway (or not)

789a0d8

[skip cirrus] [skip actions]

DOC: incorporate comments into NEP 54

f09ab71

DOC: add a comment about the licensing change [skip cirrus] [skip azp…

439d627

…] [skip travis] [skip actions]

mattip force-pushed the nep-simd-cpp-highway branch from 6105bfa to 439d627 Compare August 28, 2023 20:04

mattip mentioned this pull request Aug 29, 2023

DOC, CI: Add a "skip" option to only run docs-related CI checks #24574

Open

Mousius mentioned this pull request Nov 19, 2023

ENH: Use Highway's VQSort on AArch64 #24018

Merged

mattip added 2 commits December 21, 2023 08:02

resolve conversations, mainly by removing controversial topics

3f0461e

remove meson considerations from the NEP [skip actions]

20d4941

[skip travis] [skip azp] [skip cirrus]

mattip marked this pull request as ready for review December 21, 2023 06:22

jan-wassenberg approved these changes Dec 21, 2023

View reviewed changes

update Highway support for Z14, Z15 [skp actions][skip travis][skip a…

5d70906

…zp][skip cirrus]

rgommers commented Dec 24, 2023

View reviewed changes

doc/neps/nep-0054-simd-cpp-highway.rst Outdated Show resolved Hide resolved

doc/neps/nep-0054-simd-cpp-highway.rst Outdated Show resolved Hide resolved

NEP: review from rgommers [skp actions][skip travis][skip azp][skip c…

541ad6d

…irrus]

Mousius requested changes Dec 27 E377 , 2023

View reviewed changes

Clarify alternatives [skip actions][skip azp][skip cirrus]

8ddb3f0

Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>

mattip reviewed Dec 27, 2023

View reviewed changes

doc/neps/nep-0054-simd-cpp-highway.rst Outdated Show resolved Hide resolved

Sleef has seen some work [skip actions][skip azp][skip cirrus]

0cee9ce

Mousius requested changes Dec 27, 2023

View reviewed changes

doc/neps/nep-0054-simd-cpp-highway.rst Outdated Show resolved Hide resolved

doc/neps/nep-0054-simd-cpp-highway.rst Outdated Show resolved Hide resolved

Mousius requested changes Dec 27, 2023

View reviewed changes

doc/neps/nep-0054-simd-cpp-highway.rst Outdated Show resolved Hide resolved

mattip and others added 3 commits December 31, 2023 08:35

clarify effort needed for intrinsics

c85f6bf

Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>

explain history

45bc9ad

Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>

refactor naming convensions

59068cd

Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>

DOC: add Chris as a co-author

bc1ee6b

[skip ci]

rgommers merged commit f1fac82 into numpy:main Dec 31, 2023

rgommers deleted the nep-simd-cpp-highway branch December 31, 2023 10:16

abhishek-iitmadras mentioned this pull request Sep 16, 2024

ENH: Convert arithmetic from C universal intrinsics to C++ using Highway #27402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: add NEP 54 on SIMD - moving to C++ and adopting Highway (or not) #24138

DOC: add NEP 54 on SIMD - moving to C++ and adopting Highway (or not) #24138

-  - Highway's ``LoadInterleaved2`` would have to be extended with ``*Till``
+Highway needs to implement the following memory operations, which mainly deal with non-contiguous memory operations and partial load/store.
+     | Name             |u8 |i8 |u16|i16|u32|i32|u64|i64|f32|f64|
+     |:-----------------|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+     | LoadPairTill     | - | - | - | - | x | x | x | x | x | x |
+     | LoadTill         | - | - | - | - | x | x | x | x | x | x |
+     | IsLoadable       | - | - | - | - | x | x | x | x | x | x |
+     | IsStorable       | - | - | - | - | x | x | x | x | x | x |
+     | Loadn            | - | - | - | - | x | x | x | x | x | x |
+     | LoadnTill        | - | - | - | - | x | x | x | x | x | x |
+     | LoadnPair        | - | - | - | - | x | x | x | x | x | x |
+     | LoadnPairTill    | - | - | - | - | x | x | x | x | x | x |
+     | Storen           | - | - | - | - | x | x | x | x | x | x |
+     | StorenTill       | - | - | - | - | x | x | x | x | x | x |
+     | StorenPair       | - | - | - | - | x | x | x | x | x | x |
+     | StorenPairTill   | - | - | - | - | x | x | x | x | x | x |
+     | Lookup128        | - | - | - | - | x | x | x | x | x | x |

DOC: add NEP 54 on SIMD - moving to C++ and adopting Highway (or not) #24138

DOC: add NEP 54 on SIMD - moving to C++ and adopting Highway (or not) #24138

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment