8000 DOC: add NEP 54 on SIMD - moving to C++ and adopting Highway (or not) by rgommers · Pull Request #24138 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

DOC: add NEP 54 on SIMD - moving to C++ and adopting Highway (or not) #24138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Dec 31, 2023

Conversation

rgommers
Copy link
Member
@rgommers rgommers commented Jul 7, 2023

This NEP compares the current plan of moving the NumPy universal intrinsics framework to C++ with the adoption of Google Highway, as discussed in gh-24084. There is no proposed decision yet, this is a write-up of the discussion so far (which makes it slightly unusual as a NEP).

My current recommendation is to continue the conversation and improve this summary as needed. As of right now, the large amount of effort and the impact on licensing and MSVC usage mean that adoption of Google Highway should at a minimum be postponed until after we've dealt with the 1.26.0 release over the next couple of months. That work has a hard deadline after all, due to the Python 3.12 release cycle.

I've added @seiko2plus and @jan-wassenberg as co-authors already, because quite a few technical arguments are verbatim from what they wrote in gh-24084. And @mattip because we drafted the first version of this document together. Happy to add more authors as needed when there is more discussion.

@rgommers rgommers added component: NEP component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Jul 7, 2023
@rgommers rgommers marked this pull request as draft July 7, 2023 10:26
@mattip
Copy link
Member
mattip commented Jul 7, 2023

here is the rendered artifact

Copy link
Contributor
@jan-wassenberg jan-wassenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice writeup 👍

- A second concern was whether it's possible with Highway to allow the user at
runtime to select or disable dispatching to certain instruction sets. This is
possible.
- Use of tags in the C++ implementation (Sayed's concern). TODO: describe in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that Highway code looks like this:

  const hn::ScalableTag<T> d;
  for (size_t i = 0; i < size; i += hn::Lanes(d)) {
    const auto mul = hn::Load(d, mul_array + i);
    const auto add = hn::Load(d, add_array + i);
    auto x = hn::Load(d, x_array + i);
    x = hn::MulAdd(mul, x, add);
    hn::Store(x, d, x_array + i);
  }

The concern is about these extra d arguments. SVE and RVV actually have them in (almost) every intrinsic, Highway mainly only in ops involving memory or tied to the vector size. Looking beyond the verbosity, there is actually an important advantage: allowing users to switch to half-vectors by changing only one line. This can help performance with AVX-512, RVV, and older Arm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing flexibility is the upside. What is the downside? Is the resulting code slower?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is that the extra d arguments use a template argument, and the testing unit appears to want to export single (non-template) functions, so exporting them all would involve many functions.
A reasonable workaround would be to only export a subset of the tags (mostly just full vectors) for the testing unit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allowing users to switch to half-vectors by changing only one line. This can help performance with AVX-512, RVV, and older Arm.

Tags can be avoided by providing intrinsics to load/store half or quad of vector instead e.g. LoadHalf(ptr). Also its important to identify the values of the higher parts, whether undefined or set to zero.

What is the downside

Increase the complexity of the code while there are other options to make much simpler, easy to trace, debug and to test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important to identify the values of the higher parts, whether undefined or set to zero.

Good point. Would it be OK to define it as either zero or a broadcast load? It's never random garbage.

Can you quantify the 'much simpler'? It seems to me that we are proposing to add half/quarter etc variants of all the various load operations (masked, partial, unaligned etc), that would mean more ops to test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will edit this to not dive into details. Is this adequate:

"Use of tags in the C++ implementation reduces code duplication but the added templating makes C-level testing and tracing more complicated"

complex kernels that require 64-bit/32-bit stride alignment. Identified
intrinsics related to this:

- Highway's ``LoadInterleaved2`` would have to be extended with ``*Till``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Highway's ``LoadInterleaved2`` would have to be extended with ``*Till``
Highway needs to implement the following memory operations, which mainly deal with non-contiguous memory operations and partial load/store.
| Name |u8 |i8 |u16|i16|u32|i32|u64|i64|f32|f64|
|:-----------------|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| LoadPairTill | - | - | - | - | x | x | x | x | x | x |
| LoadTill | - | - | - | - | x | x | x | x | x | x |
| IsLoadable | - | - | - | - | x | x | x | x | x | x |
| IsStorable | - | - | - | - | x | x | x | x | x | x |
| Loadn | - | - | - | - | x | x | x | x | x | x |
| LoadnTill | - | - | - | - | x | x | x | x | x | x |
| LoadnPair | - | - | - | - | x | x | x | x | x | x |
| LoadnPairTill | - | - | - | - | x | x | x | x | x | x |
| Storen | - | - | - | - | x | x | x | x | x | x |
| StorenTill | - | - | - | - | x | x | x | x | x | x |
| StorenPair | - | - | - | - | x | x | x | x | x | x |
| StorenPairTill | - | - | - | - | x | x | x | x | x | x |
| Lookup128 | - | - | - | - | x | x | x | x | x | x |

These operations are described in the new C++ reference section memory.

Furthermore, Highway needs to identify the behavior of comparisons, including whether they are signaling/non-signaling or ordered and unordered. This information is fully described in the new reference. Please refer to the comparison reference for detailed information.

Another aspect that seems to be missing is the integer division by scalar. This operation is fully described here as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the CI output link no longer works. Is there some other way to see your nicely formatted documentation?

Regarding comparisons, Highway uses _OQ, so ordered and quiet/non-signalling. Would that work for our purposes?

FYI float16 support is mostly done but not yet tested and requires some more work (casts) for GCC.

We didn't add integer division to Highway because it is generally inefficient and we haven't yet needed it. Is it commonly used in numpy? One option if the division is by a few constant values is to use libdivide, or a compiler, to translate that into vectorized mul by the inverse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the CI output link no longer works. Is there some other way to see your nicely formatted documentation?

On each PR, the ci/circleci: build artifact job links directly to the html docs (assuming CI didn't fail, which in this case it did - at least, I think this is from gh-23096 which is currently red).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the nep link on the top line of the rendered artifacts point to the absolute NEP path, not the relative one. here is the latest version, that I got from the CircleCI workflow artifact listing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a high-level summary of missing functionality. How about:

  • Highway is missing non-contiguous memory operations and partial load/store operations.
  • Highway needs to identify the behavior of comparisons, including whether they are signaling/non-signaling or ordered and unordered.
  • Highway is missing integer division by a scalar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates:

  • we have non-contiguous ops, but they use Gather/Scatter and are thus slow. What's missing is special-cases such as pairs of doubles using 128-bit loads.
  • as of last week we have partial Load/Store and also LoadNOr.

Comparisons, reductions and integer div are indeed missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a high-level summary of missing functionality. How about:

Here's a list of missing intrinsics:

  • Vec LoadPairTill(const TLane *ptr, size_t plen, TLane fill0, TLane fill1)
    Partially pair loads N-lane vector contents from memory.

  • Vec Loadn(const TLane *ptr, intptr_t stride)
    Load N-lane vector contents from non-contiguous memory block.

  • Vec LoadnTill(const TLane *ptr, intptr_t stride, size_t len, TLane fill)
    Partially loads N-lane vector contents from a non-contiguous memory block.

  • Vec LoadnPair(const TLane *ptr, intptr_t stride)
    Pair loads N-lane vector contents from non-contiguous memory block.

  • Vec LoadnPairTill(const TLane *ptr, intptr_t stride, size_t plen, TLane fill0, TLane fill1)
    Partially pair loads N-lane vector contents from non-contiguous memory block.

  • void StorePairTill(TLane *ptr, size_t plen, const Vec &a)
    Partially pair store N-lane vector contents to memory.

  • void Storen(TLane *ptr, intptr_t stride, const Vec &a)
    Store N-lane vector contents to non-contiguous memory.

  • void StorenTill(TLane *ptr, intptr_t stride, size_t len, const Vec &a)
    Partially store N-lane vector contents to non-contiguous memory.

  • void StorenPair(TLane *ptr, intptr_t stride, const Vec &a)
    Pair store N-lane vector contents to non-contiguous memory.

  • void StorenPairTill(TLane *ptr, intptr_t stride, size_t plen, const Vec &a)
    Partially pair store N-lane vector contents to non-contiguous memory.

  • Vec Lookup128(const TLane *table, const Vec<MakeUnsigned> &idx)
    Perform a lookup operation on a 128-byte table using lane indices.

  • TVec MaxProp(const TVec &a, const TVec &b)
    Maximum with Floating-Point with Non-NaN Propagation.

  • TVec MinProp(const TVec &a, const TVec &b)
    Minimum with Floating-Point with Non-NaN Propagation.

  • TVec MaxPropNan(const TVec &a, const TVec &b)
    Maximum with Floating-Point with NaN Propagation.

  • TVec MinPropNan(const TVec &a, const TVec &b)
    Minimum with with NaN Propagation.

  • TLane ReduceMax(const TVec &a)
    Native Reduce Maximum.

  • TLane ReduceMin(const TVec &a)
    Native Reduce Minimum.

  • TLane ReduceMaxProp(const TVec &a)
    Reduce Maximum Floating-Point with Non-NaN Propagation.

  • TLane ReduceMinProp(const TVec &a)
    Reduce Minimum Floating-Point with Non-NaN Propagation.

  • TLane ReduceMaxPropNan(const TVec &a)
    Reduce Maximum Floating-Point with NaN Propagation.

  • TLane ReduceMinPropNan(const TVec &a)
    Reduce Minimum Floating-Point with NaN Propagation.

  • TLane ReduceMaxProp(const TVec &a)
    Reduce Maximum Floating-Point with Non-NaN Propagation.

  • TLane ReduceMinProp(const TVec &a)
    Reduce Minimum Floating-Point with Non-NaN Propagation.

  • TLane ReduceMaxPropNan(const TVec &a)
    Reduce Maximum Floating-Point with NaN Propagation.

  • TLane ReduceMinPropNan(const TVec &a)
    Reduce Minimum Floating-Point with NaN Propagation.

  • Vec3 Divisor(TLane d)
    Generate the precomputed parameters of the divisor.

  • TVec Div(const TVec &a, const Vec3 &b)
    Performs element-wise integer division between the lanes of the input vector ‘a’ and the scalar value b.

  • TVec MulAddSub(const TVec &a, const TVec &b, const TVec &c)
    Multiply the lanes of two vectors, add the lanes of a third vector for odd lanes, and subtract for even lanes.

  • TVec IfAdd/IfSub/IfDiv(const TMask &m, const TVec &a, const TVec &b, const TVec &c)
    Conditional add/sub/div of vectors based on a mask.

In addition(not supported yet), I was working in a patch to support the following AVX512 operations which is required in order to replace SVML kernels:

  • _mm512_getmant_ps/_pd
  • _mm512_scalef_ps/_pd
  • _mm512_getexp_ps/_pd

as of last week we have partial Load/Store and also LoadNOr.

Current Highway partial Load/Store only supports contiguous memory operations according to the following testing unite:
LoadN:
https://github.com/google/highway/blob/ba0900a4957b929390ab73827235557959234fea/hwy/tests/memory_test.cc#L350-L404
StoreN:
https://github.com/google/highway/blob/ba0900a4957b929390ab73827235557959234fea/hwy/tests/memory_test.cc#L572-L634

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @seiko2plus for making the list :) I am back in the office and adding it to our list of pending ops.

FYI ReduceMax/ReduceMin can already be expressed as GetLane(Max/MinOfLanes()).

Current Highway partial Load/Store only supports contiguous memory operations

We also have MaskedGather/Scatter. The mask can be obtained for partial N via FirstN, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have MaskedGather/Scatter. The mask can be obtained for partial N via FirstN, right?

if performance regression isn't matter then yes, sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is too much of a moving target to publish this table in a permanent document. Maybe it belongs in a Highway issue instead, and then we can refer to the issue?

@mattip mattip force-pushed the nep-simd-cpp-highway branch from 6105bfa to 439d627 Compare August 28, 2023 20:04
@mattip
F438 Copy link
Member
mattip commented Aug 28, 2023

rebased and force-pushed to fix CI errors

@Mousius
Copy link
Member
Mousius commented Sep 22, 2023

I am trying to understand how this will move forward constructively. There are several recurring topics:

  1. Highway uses tags. @seiko2plus has stated that he's OK with tags, so I don't understand why this keeps returning. Should we try to improve Highway to avoid using tags? That is entirely independent work, which I'm sure @jan-wassenberg would be interested in.
  2. Highway doesn't have the ops. @jan-wassenberg has been receptive to patches as we introduce ops, and we will only do some of the routines in one go due to the bandwidth available in NumPy itself.
  3. Highway might be less performant; @jan-wassenberg has been receptive, and as we land individual routines, we can optimize the ops we need to. Trying to do everything all at once is beyond at least my cognitive capacity.

(If I've missed anything, please feel free to add.)

These all seem fixable in the long term. If we assume the best intentions, I'm confident that any improvements we can add to improve the developer experience of Highway will be received positively.

Can we decouple these issues and focus on the high-level decision rather than continuing to get bogged down in specific technical details which are easier to fix independently?

At the moment, I can't justify committing resources to support the porting of routines due to the ambiguity in effort and technical direction. As demonstrated by #24018, I'm happy to head in a clear direction.

@ngoldbaum
Copy link
Member

I haven’t been following this discussion and I noticed that some initial support for highway is getting merged. As a spectator who would like to learn more, it would be nice to have a version of the NEP merged.

@Mousius
Copy link
Member
Mousius commented Nov 24, 2023

I haven’t been following this discussion and I noticed that some initial support for highway is getting merged. As a spectator who would like to learn more, it would be nice to have a version of the NEP merged.

Agree, we can merge this to indicate that Highway is available as an option for future work, which we will continue to assess. As highlighted in the last optimisation team meeting, the C++ machinery is an immense step forward for NumPy.

@mattip
Copy link
Member
mattip commented Dec 21, 2023

I have tried to resolve the discussions, mainly by removing details. I also removed the mentions of meson since that has been resolved. The adoption of Highway intrinsics has also been resolved since we incorporated them into sorting.

@mattip mattip marked this pull request as ready for review December 21, 2023 06:22
Copy link
Contributor
@jan-wassenberg jan-wassenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, with two minor comments :)

use Highway primitives internally. We could still use Highway sorting routines.
If we do accept lower-precision routines (via a user-supplied choice, i.e.
extending ``errstate`` to allow a precision option), we could use
Highway-native routines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are also open to adding new higher-precision function versions to Highway, e.g. as part of Sayed's work to port from SVML.

@mattip
Copy link
Member
mattip commented Dec 23, 2023

Any more comments or editing?

Copy link
Member Author
@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matti and Jan. One more suggestion from me about the dynamic dispatch section. Then I think it's time to merge this; perfect is the enemy of good here. We're further along now, so getting this NEP in in its current state and learning from the 2.0 release where we'll have Highway as a git submodule and are using it for sorting-related functionality at least may give us more insights.

Copy link
Member
@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest some minor text fixes and some added context; otherwise, I echo the sentiment that this seems fine.

Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>
mattip and others added 3 commits December 31, 2023 08:35
Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>
Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>
Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>
@mattip
Copy link
Member
mattip commented Dec 31, 2023

The rendered NEP is here. It would be nice to get this merged.

@rgommers
Copy link
Member Author

Interesting to see that SLEEF is being actively worked on again - it seems like a couple from hardware vendors are active, and at least one Arm engineer got commit rights. Would be great to see that activity continue.

@rgommers
Copy link
Member Author

The latest textual updates all LGTM and seem fairly minor. Let's get this merged in 2023 still - I'll hit the green button here and will mention it on the mailing list. Thanks everyone!

@rgommers
Copy link
Member Author

Actually, let me add Chris as a co-author as well before doing that, that seems very much justified.

@rgommers rgommers merged commit f1fac82 into numpy:main Dec 31, 2023
@rgommers rgommers deleted the nep-simd-cpp-highway branch December 31, 2023 10:16
@mattip
Copy link
Member
mattip commented Dec 31, 2023

Thanks @rgommers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
04 - Documentation component: NEP component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0