-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
DOC: add NEP 54 on SIMD - moving to C++ and adopting Highway (or not) #24138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
here is the rendered artifact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice writeup 👍
- A second concern was whether it's possible with Highway to allow the user at | ||
runtime to select or disable dispatching to certain instruction sets. This is | ||
possible. | ||
- Use of tags in the C++ implementation (Sayed's concern). TODO: describe in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that Highway code looks like this:
const hn::ScalableTag<T> d;
for (size_t i = 0; i < size; i += hn::Lanes(d)) {
const auto mul = hn::Load(d, mul_array + i);
const auto add = hn::Load(d, add_array + i);
auto x = hn::Load(d, x_array + i);
x = hn::MulAdd(mul, x, add);
hn::Store(x, d, x_array + i);
}
The concern is about these extra d
arguments. SVE and RVV actually have them in (almost) every intrinsic, Highway mainly only in ops involving memory or tied to the vector size. Looking beyond the verbosity, there is actually an important advantage: allowing users to switch to half-vectors by changing only one line. This can help performance with AVX-512, RVV, and older Arm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing flexibility is the upside. What is the downside? Is the resulting code slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside is that the extra d arguments use a template argument, and the testing unit appears to want to export single (non-template) functions, so exporting them all would involve many functions.
A reasonable workaround would be to only export a subset of the tags (mostly just full vectors) for the testing unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allowing users to switch to half-vectors by changing only one line. This can help performance with AVX-512, RVV, and older Arm.
Tags can be avoided by providing intrinsics to load/store half or quad of vector instead e.g. LoadHalf(ptr)
. Also its important to identify the values of the higher parts, whether undefined or set to zero.
What is the downside
Increase the complexity of the code while there are other options to make much simpler, easy to trace, debug and to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important to identify the values of the higher parts, whether undefined or set to zero.
Good point. Would it be OK to define it as either zero or a broadcast load? It's never random garbage.
Can you quantify the 'much simpler'? It seems to me that we are proposing to add half/quarter etc variants of all the various load operations (masked, partial, unaligned etc), that would mean more ops to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will edit this to not dive into details. Is this adequate:
"Use of tags in the C++ implementation reduces code duplication but the added templating makes C-level testing and tracing more complicated"
complex kernels that require 64-bit/32-bit stride alignment. Identified | ||
intrinsics related to this: | ||
|
||
- Highway's ``LoadInterleaved2`` would have to be extended with ``*Till`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Highway's ``LoadInterleaved2`` would have to be extended with ``*Till`` | |
Highway needs to implement the following memory operations, which mainly deal with non-contiguous memory operations and partial load/store. | |
| Name |u8 |i8 |u16|i16|u32|i32|u64|i64|f32|f64| | |
|:-----------------|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:| | |
| LoadPairTill | - | - | - | - | x | x | x | x | x | x | | |
| LoadTill | - | - | - | - | x | x | x | x | x | x | | |
| IsLoadable | - | - | - | - | x | x | x | x | x | x | | |
| IsStorable | - | - | - | - | x | x | x | x | x | x | | |
| Loadn | - | - | - | - | x | x | x | x | x | x | | |
| LoadnTill | - | - | - | - | x | x | x | x | x | x | | |
| LoadnPair | - | - | - | - | x | x | x | x | x | x | | |
| LoadnPairTill | - | - | - | - | x | x | x | x | x | x | | |
| Storen | - | - | - | - | x | x | x | x | x | x | | |
| StorenTill | - | - | - | - | x | x | x | x | x | x | | |
| StorenPair | - | - | - | - | x | x | x | x | x | x | | |
| StorenPairTill | - | - | - | - | x | x | x | x | x | x | | |
| Lookup128 | - | - | - | - | x | x | x | x | x | x | |
These operations are described in the new C++ reference section memory.
Furthermore, Highway needs to identify the behavior of comparisons, including whether they are signaling/non-signaling or ordered and unordered. This information is fully described in the new reference. Please refer to the comparison reference for detailed information.
Another aspect that seems to be missing is the integer division by scalar. This operation is fully described here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the CI output link no longer works. Is there some other way to see your nicely formatted documentation?
Regarding comparisons, Highway uses _OQ, so ordered and quiet/non-signalling. Would that work for our purposes?
FYI float16 support is mostly done but not yet tested and requires some more work (casts) for GCC.
We didn't add integer division to Highway because it is generally inefficient and we haven't yet needed it. Is it commonly used in numpy? One option if the division is by a few constant values is to use libdivide, or a compiler, to translate that into vectorized mul by the inverse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the CI output link no longer works. Is there some other way to see your nicely formatted documentation?
On each PR, the ci/circleci: build artifact
job links directly to the html docs (assuming CI didn't fail, which in this case it did - at least, I think this is from gh-23096 which is currently red).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the nep link on the top line of the rendered artifacts point to the absolute NEP path, not the relative one. here is the latest version, that I got from the CircleCI workflow artifact listing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a high-level summary of missing functionality. How about:
- Highway is missing non-contiguous memory operations and partial load/store operations.
- Highway needs to identify the behavior of comparisons, including whether they are signaling/non-signaling or ordered and unordered.
- Highway is missing integer division by a scalar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates:
- we have non-contiguous ops, but they use Gather/Scatter and are thus slow. What's missing is special-cases such as pairs of doubles using 128-bit loads.
- as of last week we have partial Load/Store and also LoadNOr.
Comparisons, reductions and integer div are indeed missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a high-level summary of missing functionality. How about:
Here's a list of missing intrinsics:
-
Vec LoadPairTill(const TLane *ptr, size_t plen, TLane fill0, TLane fill1)
Partially pair loads N-lane vector contents from memory. -
Vec Loadn(const TLane *ptr, intptr_t stride)
Load N-lane vector contents from non-contiguous memory block. -
Vec LoadnTill(const TLane *ptr, intptr_t stride, size_t len, TLane fill)
Partially loads N-lane vector contents from a non-contiguous memory block. -
Vec LoadnPair(const TLane *ptr, intptr_t stride)
Pair loads N-lane vector contents from non-contiguous memory block. -
Vec LoadnPairTill(const TLane *ptr, intptr_t stride, size_t plen, TLane fill0, TLane fill1)
Partially pair loads N-lane vector contents from non-contiguous memory block. -
void StorePairTill(TLane *ptr, size_t plen, const Vec &a)
Partially pair store N-lane vector contents to memory. -
void Storen(TLane *ptr, intptr_t stride, const Vec &a)
Store N-lane vector contents to non-contiguous memory. -
void StorenTill(TLane *ptr, intptr_t stride, size_t len, const Vec &a)
Partially store N-lane vector contents to non-contiguous memory. -
void StorenPair(TLane *ptr, intptr_t stride, const Vec &a)
Pair store N-lane vector contents to non-contiguous memory. -
void StorenPairTill(TLane *ptr, intptr_t stride, size_t plen, const Vec &a)
Partially pair store N-lane vector contents to non-contiguous memory. -
Vec Lookup128(const TLane *table, const Vec<MakeUnsigned> &idx)
Perform a lookup operation on a 128-byte table using lane indices. -
TVec MaxProp(const TVec &a, const TVec &b)
Maximum with Floating-Point with Non-NaN Propagation. -
TVec MinProp(const TVec &a, const TVec &b)
Minimum with Floating-Point with Non-NaN Propagation. -
TVec MaxPropNan(const TVec &a, const TVec &b)
Maximum with Floating-Point with NaN Propagation. -
TVec MinPropNan(const TVec &a, const TVec &b)
Minimum with with NaN Propagation. -
TLane ReduceMax(const TVec &a)
Native Reduce Maximum. -
TLane ReduceMin(const TVec &a)
Native Reduce Minimum. -
TLane ReduceMaxProp(const TVec &a)
Reduce Maximum Floating-Point with Non-NaN Propagation. -
TLane ReduceMinProp(const TVec &a)
Reduce Minimum Floating-Point with Non-NaN Propagation. -
TLane ReduceMaxPropNan(const TVec &a)
Reduce Maximum Floating-Point with NaN Propagation. -
TLane ReduceMinPropNan(const TVec &a)
Reduce Minimum Floating-Point with NaN Propagation. -
TLane ReduceMaxProp(const TVec &a)
Reduce Maximum Floating-Point with Non-NaN Propagation. -
TLane ReduceMinProp(const TVec &a)
Reduce Minimum Floating-Point with Non-NaN Propagation. -
TLane ReduceMaxPropNan(const TVec &a)
Reduce Maximum Floating-Point with NaN Propagation. -
TLane ReduceMinPropNan(const TVec &a)
Reduce Minimum Floating-Point with NaN Propagation. -
Vec3 Divisor(TLane d)
Generate the precomputed parameters of the divisor. -
TVec Div(const TVec &a, const Vec3 &b)
Performs element-wise integer division between the lanes of the input vector ‘a’ and the scalar value b. -
TVec MulAddSub(const TVec &a, const TVec &b, const TVec &c)
Multiply the lanes of two vectors, add the lanes of a third vector for odd lanes, and subtract for even lanes. -
TVec IfAdd/IfSub/IfDiv(const TMask &m, const TVec &a, const TVec &b, const TVec &c)
Conditional add/sub/div of vectors based on a mask.
In addition(not supported yet), I was working in a patch to support the following AVX512 operations which is required in order to replace SVML kernels:
- _mm512_getmant_ps/_pd
- _mm512_scalef_ps/_pd
- _mm512_getexp_ps/_pd
as of last week we have partial Load/Store and also LoadNOr.
Current Highway partial Load/Store only supports contiguous memory operations according to the following testing unite:
LoadN:
https://github.com/google/highway/blob/ba0900a4957b929390ab73827235557959234fea/hwy/tests/memory_test.cc#L350-L404
StoreN:
https://github.com/google/highway/blob/ba0900a4957b929390ab73827235557959234fea/hwy/tests/memory_test.cc#L572-L634
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @seiko2plus for making the list :) I am back in the office and adding it to our list of pending ops.
FYI ReduceMax/ReduceMin
can already be expressed as GetLane(Max/MinOfLanes())
.
Current Highway partial Load/Store only supports contiguous memory operations
We also have MaskedGather/Scatter. The mask can be obtained for partial N via FirstN, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have MaskedGather/Scatter. The mask can be obtained for partial N via FirstN, right?
if performance regression isn't matter then yes, sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is too much of a moving target to publish this table in a permanent document. Maybe it belongs in a Highway issue instead, and then we can refer to the issue?
[skip cirrus] [skip actions]
…] [skip travis] [skip actions]
6105bfa
to
439d627
Compare
rebased and force-pushed to fix CI errors |
I am trying to understand how this will move forward constructively. There are several recurring topics:
(If I've missed anything, please feel free to add.) These all seem fixable in the long term. If we assume the best intentions, I'm confident that any improvements we can add to improve the developer experience of Highway will be received positively. Can we decouple these issues and focus on the high-level decision rather than continuing to get bogged down in specific technical details which are easier to fix independently? At the moment, I can't justify committing resources to support the porting of routines due to the ambiguity in effort and technical direction. As demonstrated by #24018, I'm happy to head in a clear direction. |
I haven’t been following this discussion and I noticed that some initial support for highway is getting merged. As a spectator who would like to learn more, it would be nice to have a version of the NEP merged. |
Agree, we can merge this to indicate that Highway is available as an option for future work, which we will continue to assess. As highlighted in the last optimisation team meeting, the C++ machinery is an immense step forward for NumPy. |
[skip travis] [skip azp] [skip cirrus]
I have tried to resolve the discussions, mainly by removing details. I also removed the mentions of meson since that has been resolved. The adoption of Highway intrinsics has also been resolved since we incorporated them into sorting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, with two minor comments :)
use Highway primitives internally. We could still use Highway sorting routines. | ||
If we do accept lower-precision routines (via a user-supplied choice, i.e. | ||
extending ``errstate`` to allow a precision option), we could use | ||
Highway-native routines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are also open to adding new higher-precision function versions to Highway, e.g. as part of Sayed's work to port from SVML.
Any more comments or editing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matti and Jan. One more suggestion from me about the dynamic dispatch section. Then I think it's time to merge this; perfect is the enemy of good here. We're further along now, so getting this NEP in in its current state and learning from the 2.0 release where we'll have Highway as a git submodule and are using it for sorting-related functionality at least may give us more insights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest some minor text fixes and some added context; otherwise, I echo the sentiment that this seems fine.
Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>
Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>
Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>
Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>
The rendered NEP is here. It would be nice to get this merged. |
Interesting to see that SLEEF is being actively worked on again - it seems like a couple from hardware vendors are active, and at least one Arm engineer got commit rights. Would be great to see that activity continue. |
The latest textual updates all LGTM and seem fairly minor. Let's get this merged in 2023 still - I'll hit the green button here and will mention it on the mailing list. Thanks everyone! |
Actually, let me add Chris as a co-author as well before doing that, that seems very much justified. |
[skip ci]
Thanks @rgommers |
This NEP compares the current plan of moving the NumPy universal intrinsics framework to C++ with the adoption of Google Highway, as discussed in gh-24084. There is no proposed decision yet, this is a write-up of the discussion so far (which makes it slightly unusual as a NEP).
My current recommendation is to continue the conversation and improve this summary as needed. As of right now, the large amount of effort and the impact on licensing and MSVC usage mean that adoption of Google Highway should at a minimum be postponed until after we've dealt with the 1.26.0 release over the next couple of months. That work has a hard deadline after all, due to the Python 3.12 release cycle.
I've added @seiko2plus and @jan-wassenberg as co-authors already, because quite a few technical arguments are verbatim from what they wrote in gh-24084. And @mattip because we drafted the first version of this document together. Happy to add more authors as needed when there is more discussion.