ENH: Use Highway's VQSort on AArch64 #24018

Mousius · 2023-06-22T12:20:46Z

Introduces Highway, a SIMD library from Google. Highway has it's own dispatch mechanism and SIMD flavour (which adds padding) which may be a good replacement for NumPy intrinsics. Highway also has its own set of vectorised Math routines we could bolt on.

For this patch, I've integrated the VQSort algorithm only on AArch64 so as not to impact other architectures. The VQSort algorithms performance is comparable to the AVX512 algorithms, according to this report: https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md

Performance improvements in the NumPy benchmark suite are as follows:

       before           after         ratio
     [f4b52d53]       [f08058d4]
     <main-setuppy>       <highway-sort>
+     53.0±0.08μs        221±0.4μs     4.18  bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))
+      84.8±0.3μs        222±0.3μs     2.62  bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))
+      89.0±0.2μs        197±0.2μs     2.21  bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))
+     51.5±0.01μs       80.0±0.1μs     1.55  bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))
+     52.0±0.07μs       80.3±0.2μs     1.55  bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))
+       148±0.3μs        197±0.2μs     1.33  bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))
+       471±0.2μs        598±0.5μs     1.27  bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',))
+       536±0.7μs        644±0.8μs     1.20  bench_function_base.Sort.time_argsort('heap', 'float64', ('reversed',))
+     39.8±0.02μs      47.4±0.05μs     1.19  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000))
+     39.8±0.01μs      47.4±0.02μs     1.19  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000))
+     40.1±0.01μs      47.6±0.04μs     1.19  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 1000))
+     27.9±0.03μs      32.6±0.03μs     1.17  bench_function_base.Sort.time_argsort('heap', 'uint32', ('uniform',))
+       685±0.5μs        796±0.8μs     1.16  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 10))
+     68.7±0.05μs      78.4±0.04μs     1.14  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100))
+     68.7±0.05μs       78.3±0.1μs     1.14  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100))
+     69.4±0.03μs      78.8±0.05μs     1.14  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 100))
+     22.0±0.01μs      24.7±0.01μs     1.12  bench_function_base.Sort.time_sort('heap', 'uint32', ('uniform',))
+       674±0.7μs        743±0.4μs     1.10  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 1000))
+       730±0.6μs        794±0.8μs     1.09  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 100))
+         464±2μs         492±10μs     1.06  bench_function_base.Sort.time_argsort('heap', 'int32', ('reversed',))
+      95.5±0.1μs        101±0.1μs     1.06  bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))
+         839±1μs          886±2μs     1.06  bench_function_base.Sort.time_argsort('heap', 'float64', ('random',))
+      95.7±0.1μs       101±0.05μs     1.06  bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))
-       516±0.6μs        488±0.9μs     0.95  bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))
-         532±1μs          499±5μs     0.94  bench_function_base.Sort.time_argsort('heap', 'int16', ('reversed',))
-     34.5±0.05μs      32.2±0.03μs     0.93  bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 1000))
-         715±1μs        666±0.4μs     0.93  bench_function_base.Sort.time_sort('heap', 'float64', ('random',))
-         254±2μs        231±0.2μs     0.91  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))
-       655±0.5μs        592±0.3μs     0.90  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 100))
-     24.7±0.02μs      22.0±0.07μs     0.89  bench_function_base.Sort.time_sort('heap', 'int32', ('uniform',))
-       621±0.7μs        549±0.3μs     0.88  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000))
-       659±0.9μs        555±0.5μs     0.84  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 10))
-       545±0.4μs        444±0.5μs     0.81  bench_function_base.Sort.time_sort('heap', 'float64', ('reversed',))
-       511±0.5μs        396±0.4μs     0.78  bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',))
-       316±0.6μs        225±0.2μs     0.71  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))
-       333±0.4μs        233±0.3μs     0.70  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))
-       323±0.2μs        206±0.2μs     0.64  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))
-       134±0.9μs      85.0±0.07μs     0.64  bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))
-       376±0.9μs        228±0.3μs     0.61  bench_function_base.Sort.time_sort('quick', 'int64', ('random',))
-       418±0.4μs        207±0.4μs     0.50  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))
-         413±2μs        199±0.2μs     0.48  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))
-      64.4±0.3ms       28.6±0.3ms     0.44  bench_function_base.Sort.time_sort_worst
-         501±2μs        204±0.3μs     0.41  bench_function_base.Sort.time_sort('quick', 'float64', ('random',))
-         248±1μs       86.9±0.2μs     0.35  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000))
-       251±0.7μs       87.0±0.2μs     0.35  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000))
-       329±0.9μs       91.4±0.1μs     0.28  bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000))
-         329±1μs      85.2±0.06μs     0.26  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))
-       329±0.7μs      85.0±0.06μs     0.26  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))
-       315±0.5μs      80.4±0.05μs     0.26  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))
-         320±1μs      80.3±0.09μs     0.25  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))
-       372±0.4μs      82.1±0.06μs     0.22  bench_function_base.Sort.time_sort('quick', 'int32', ('random',))
-       414±0.4μs       89.6±0.1μs     0.22  bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))
-         384±1μs      81.8±0.02μs     0.21  bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))
-       415±0.3μs       84.8±0.2μs     0.20  bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))
-         491±2μs       86.3±0.1μs     0.18  bench_function_base.Sort.time_sort('quick', 'float32', ('random',))
-       111±0.4μs      13.0±0.01μs     0.12  bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',))
-     50.9±0.03μs      4.93±0.01μs     0.10  bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',))
-       108±0.3μs      7.25±0.02μs     0.07  bench_function_base.Sort.time_sort('quick', 'float32', ('uniform',))
-     51.7±0.05μs      3.43±0.01μs     0.07  bench_function_base.Sort.time_sort('quick', 'int32', ('uniform',))
-     51.3±0.07μs      3.35±0.06μs     0.07  bench_function_base.Sort.time_sort('quick', 'uint32', ('uniform',))

Mousius · 2023-06-22T12:21:57Z

numpy/core/setup.py

+                for sort_file in [
+                    "vqsort.cc",
+                    "vqsort_128a.cc",
+                    "vqsort_128d.cc",
+                    "vqsort_f32a.cc",
+                    "vqsort_f32d.cc",
+                    "vqsort_f64a.cc",
+                    "vqsort_f64d.cc",
+                    "vqsort_i16a.cc",
+                    "vqsort_i16d.cc",
+                    "vqsort_i32a.cc",
+                    "vqsort_i32d.cc",
+                    "vqsort_i64a.cc",
+                    "vqsort_i64d.cc",
+                    "vqsort_kv64a.cc",
+                    "vqsort_kv64d.cc",
+                    "vqsort_kv128a.cc",
+                    "vqsort_kv128d.cc",
+                    "vqsort_u16a.cc",
+                    "vqsort_u16d.cc",
+                    "vqsort_u32a.cc",
+                    "vqsort_u32d.cc",
+                    "vqsort_u64a.cc",
+                    "vqsort_u64d.cc",
+                ]


This isn't ideal, when compiling without these files I got missing symbol issues, but we definitely don't use them all and they add ~100M to the .so.

@jan-wassenberg any thoughts on how to minimize the number of files we need here?

I think we're only using {u/i/f}{32/64}a in your dispatcher and we ought to be able to remove 128*, 16*, and *d.cc (the latter for descending order).

If there are still linker errors after removing only those, then it's likely due to the dllexport. What we could then do is add a macro bitfield that you could define to replace the implementation of the unused/unwanted types with a stub.

Something like VQSORT_TYPES which contains multiple groups (one each for uint/int/float/KV), where each group has one bit encoding whether the element size equal to the bit's value is enabled.

Example, for only say unsigned/float types: #define VQSORT_TYPES (((4 | 8) << 6) | (4 | 8)).

Would that be useful?

As an example, when I try to run tests:

Original error was: /numpy/build/testenv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-aarch64-linux-gnu.so: undefined symbol: _ZN3hwy6VQSortEPlmNS_14SortDescendingE

The other option is I could use vqsort-inl.h and copy the implementations across from the various vqsort_X.cc files? That'd also mean I can do float16 without adding a vsqort_f16a.cc.

Could you add something like this to vqsort-inl.h ? I notice there's a lot of detail which would be bad to include in the NumPy source.

template <typename T> void VQSort(T* HWY_RESTRICT keys, size_t n, SortAscending) { SortTag<T> d; detail::SharedTraits<detail::TraitsLane<detail::OrderAscending<T>>> st; Sort(d, st, keys, num); }

If not the bitfield approach is also good 😸

Sure, we can add that. As written, this will not support the 128-bit nor key-value types - is that OK?

Sure, we can add that. As written, this will not support the 128-bit nor key-value types - is that OK?

That'd be specialising around uint128_t with the same uint64*2 logic as in the .cc file? I think we can add it later if necessary, there's no sort dispatch for it atm.

Right. And KV types also use different Traits.
Sounds good, we can indeed add that later.

Mousius · 2023-06-22T12:22:53Z

numpy/core/src/npysort/simd_argsort.dispatch.cpp

@@ -0,0 +1,44 @@
+/*@targets


I split this out as Highway's sort doesn't have an implementation of argsort, unless I'm wrong @jan-wassenberg?

FYI argsort can be implemented as copying an array to hwy::K32V32 or K64V64 types (putting the key in .key and 0..N-1 into .value) and then copying only the .value back to the array.

PiperOrigin-RevId: 542571742

PiperOrigin-RevId: 542579202

mattip · 2023-06-27T10:54:22Z

Introduces Highway, a SIMD library from Google. Highway has it's own dispatch mechanism and SIMD flavour (which adds padding) which may be a good replacement for NumPy intrinsics. Highway also has its own set of vectorised Math routines we could bolt on.

I think this is worthy of wider discussion. So far we have

legacy routines
universal intrinsics
svml and x86-simd-sort for AVX512
proposals to use SLEEF

Perhaps we should stop and think where this all is leading. Maybe NumPy should not be in the business of providing highly specialized inner loop functions, and we should instead provide an easily pluggable interface for others to experiment with, together with a base implementation and test cases for accuracy (known good answers) / precision (same answers across various array shapes).

Mousius · 2023-06-27T12:02:39Z

Introduces Highway, a SIMD library from Google. Highway has it's own dispatch mechanism and SIMD flavour (which adds padding) which may be a good replacement for NumPy intrinsics. Highway also has its own set of vectorised Math routines we could bolt on.

I think this is worthy of wider discussion. So far we have

legacy routines

universal intrinsics

svml and x86-simd-sort for AVX512

proposals to use SLEEF

Perhaps we should stop and think where this all is leading. Maybe NumPy should not be in the business of providing highly specialized inner loop functions, and we should instead provide an easily pluggable interface for others to experiment with, together with a base implementation and test cases for accuracy (known good answers) / precision (same answers across various array shapes).

Yip, I've been struggling with the current set of NumPy variations on the theme of performance optimisation, it's not clear what the current best practice is or where it's headed. Landing these vendored libraries (such as svml/x86-simd-sort) seems to be the lowest bar but doesn't seem in line with any overall strategy. Having vendored only AVX512 optimisations does mean that there's a clear favourite for which platform you should be using NumPy on in terms of performance.

My interpretation of the proposal for using SLEEF (alongside my reasoning for backing something like this) is that people want NumPy to perform better out-of-the-box and SLEEF already has a bunch of cross-platform routines for this; indicating that there is a desire for things to be faster, but these efforts haven't succeeded so far. That demand helps me to understand that we should be taking a more off-the-shelf solution, which means doing less work in NumPy but still meeting the original goals of the universal intrinsics work - which is what I see when I stumble across Highway. We could take a lot of complexity out of NumPy itself by leveraging the Highway's dynamic dispatch, optimised routines and existing universal intrinsics solution (which already supports scalable vectors).

I don't think performance should be passed back to users as plug-ins, if I wanted performance without using NumPy itself, I could easily move to something like Numba, one of the other optimisation libraries or stop using NumPy (as the de facto standard, such a stance seems rather out of place). I did like the errstate proposal, which allows users to opt into the lower accuracy, higher throughput routines but the high/low accuracy problem is only applicable to specific routines.

Returning to this case, it's a sort routine, if it puts things in the wrong order it's wrong - these optimisations can benefit all users without much controversy which seems like something everyone would want? Please don't lose sight of the benefits this specific PR brings whilst considering whether to integrate more of Highway in the future.

seberg · 2023-06-29T10:10:45Z

We discussed this a bit yesterday, please correct me if I am wrong (or just disagree), but basically. The way I interpreted this, there was also a hope to use highway more generally.

Sentiment was a bit towards:

We probably cannot adopt highway math functions, so the only point would likely be to adopt some of their "universal intrinsics" implementation. That wasn't hit with a lot of enthusiasm, and they apparently don't have great machinery around striding (I may have butchered that).
- Forgot: The math functions seem to have 3ULP within but within a certain range (might been a specific one). That makes it seem more tuned for their use-case, so that a straight up use of them in NumPy is likely to not work in practice unfortunately.
This part is cool, since it's much faster, Raghuveer said that if we did have the C++ version of our universal intrinsics getting the AARCh64 speedups should be easier. This pulls in quite a bit (if I understand it right), due to using their universal intrinsics framework (leading to the wheel to be slightly bloated, although I am not sure we should care).

Overall, there wasn't much enthusiasm for this yesterday.

jan-wassenberg · 2023-06-29T10:28:03Z

Hi, Highway/vqsort main author here :)
I'm curious what striding functionality you are looking for?

This pulls in quite a bit (if I understand it right), due to using their universal intrinsics framework

Not sure this is the case. Our intrinsics are just fully-inlined wrapper functions. Only the sort implementation itself would contribute to binary size.

VQSort also has some advantages concerning speed on AVX-512 (see https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md#updated-vqsort-results) and also robustness vs skewed input distributions, see other parts of that article.

Happy to discuss further, including via video call, if you are interested.

Mousius · 2023-06-29T11:00:27Z

I'm curious what striding functionality you are looking for?

@jan-wassenberg , I think this would just be something like MaskedGatherLoad and MaskedBlendedStore, as NumPy arrays aren't padded you need to load in the final partial vectors whilst also striding. From my naive perspective this doesn't seem too difficult?

mattip · 2023-06-29T11:09:41Z

I think the larger point is that we are uncomfortable adding yet another submodule with inner loop implementations to NumPy. We would prefer to generalize the routines we already have for x86_64, using NumPy's universal intrinsics, to the other architectures.

seberg · 2023-06-29T11:10:26Z

Maybe we can split this up a bit, since I probably munched it too much yesterday. I am not sure how much we gain by adopting highway more broadly, t 8000 here were arguments against it, and maybe in part that is because Sayed has momentum for the "custom" universal intrinsics but there were more arguments. (I honestly would prefer to just steal something, but SIMD is already hairy with compiler bugs, and I don't know it well enough to judge whether we would just run into more issues.)

So this is about sorting: I think right now it just switches things on AArch64? But highway supports also AVX512, so in the end it would really be swapping out one implementation for another. This doesn't actually stop us from reversing it out again once we can, so the question is if there is a downside? (if there are issues, too bad, we will revert)

I probably misunderstood that this had some impact on binary size, I have to admit, if it doesn't I am not sure I see a reason for not swapping in highway for sorting (completely) until Raghuveer's work supports more broadly SIMD.

Mousius · 2023-06-29T12:21:18Z

I think the larger point is that we are uncomfortable adding yet another submodule with inner loop implementations to NumPy. We would prefer to generalize the routines we already have for x86_64, using NumPy's universal intrinsics, to the other architectures.

@mattip, that seems to disadvantage those who didn't author the x86_64 code, whilst there is an equivalent solution available for many architectures. I'm personally reluctant to invest further effort in porting to universal intrinsics, there's solutions already available to users as proven by the SVML and x86-simd-sort integrations. Previous universal intrinsic work was already a wasted effort now that the math routines have been reverted and accuracy changed in SVML.

Maybe we can split this up a bit, since I probably munched it too much yesterday. I am not sure how much we gain by adopting highway more broadly, there were arguments against it, and maybe in part that is because Sayed has momentum for the "custom" universal intrinsics but there were more arguments. (I honestly would prefer to just steal something, but SIMD is already hairy with compiler bugs, and I don't know it well enough to judge whether we would just run into more issues.)

Where is this momentum? It'd be great to see things like #22265 landed but it has stalled waiting for the C++ intrinsic work. As far as compiler bugs, I think you'll get them either way, they're bugs after all 😸

I probably misunderstood that this had some impact on binary size, I have to admit, if it doesn't I am not sure I see a reason for not swapping in highway for sorting (completely) until Raghuveer's work supports more broadly SIMD.

Sorry @seberg, that's likely down to me, originally this was pretty bloated thanks to including many separate compilation units but through @jan-wassenberg's efforts on VQSortStatic it's now pretty slim and inlined - although you do increase binary size with more code added.

So this is about sorting: I think right now it just switches things on AArch64? But highway supports also AVX512, so in the end it would really be swapping out one implementation for another. This doesn't actually stop us from reversing it out again once we can, so the question is if there is a downside? (if there are issues, too bad, we will revert)

Thanks for bringing this back to the benefits of this contribution itself, I did limit this to AArch64 to minimise the overall impact, mirroring the accepted x86-simd-sort integration - it wouldn't be hard to enable this for AVX512 as well and potentially further. One limitation is it doesn't quite drop in for FP16 (@jan-wassenberg can maybe advise), nor argsort but we can probably implement that ourselves based on the suggestion above.

mattip · 2023-06-29T13:52:12Z

Previous universal intrinsic work was already a wasted effort now that the math routines have been reverted and accuracy changed in SVML.

I am sorry you feel that way. I seem to remember having similar feelings when SVML was proposed, and that the goal all along was to replace SVML with universal intrinsics. Unfortunately we did not adequately address accuracy when adding the tests that were a condition for the SVML acceptance, and that led to the mess around sin/cos.

At the end of the day mine is just one opinion of many maintainers. I think the place to push for a revamped approach to SIMD interinsics is discuss replacing NEP 38 on the mailing list, or to create a PR amending that NEP.

r-devulap · 2023-06-29T16:30:03Z

There are two aspects to this PR:

Sorting routines on AArch64
Adopting Highway as the SIMD framework

(1) is tied with (2): you can't port sorting routines without using highway as the SIMD framework which duplicates NumPy universal intrinsics. If NumPy wants to use highway as the SIMD framework, that's a whole new discussion and as @mattip says it would require modifying NEP 38.

If the concern is just about sorting routines on AArch64, we could address that by porting over x86-simd-sort using NumPy universal intrinsics once we merge #21057. If this is the direction we want to go, I will be happy to collaborate with @Mousius to port it. Note that x86-simd-sort also supports O(1) space np.argsort, np.partition and np.argpartition (WIP). We could port these over too once that work is completed.

Mousius · 2023-06-29T17:32:49Z

There are two aspects to this PR:

Sorting routines on AArch64

Adopting Highway as the SIMD framework

(1) is tied with (2): you can't port sorting routines without using highway as the SIMD framework which duplicates NumPy universal intrinsics. If NumPy wants to use highway as the SIMD framework, that's a whole new discussion and as @mattip says it would require modifying NEP 38.

I don't think we should be considering the implementation details of the sort algorithm, just as x86-simd-sort is a separate module and has its own understanding of intrinsics which are not NumPy intrinsics. Policing repos outside of NumPy seems beyond the NumPy maintainers scope (at least to me), especially as it compiles down to code without any notion of the intrinsics.

From a technical point of view, this is exactly the same as implementing sort using x86-simd-sort, yet does not receive the same treatment due to the SIMD implementation (I think this was my mistake by over-explaining Highway when opening this PR). It is very possible to merge this PR, gain the user value and never use the Highway SIMD library.

I'm also getting a distinct impression from these conversations that not many are in favour of replacing NEP38, as only @seberg has spoken in favour of using something already written.

If the concern is just about sorting routines on AArch64, we could address that by porting over x86-simd-sort using NumPy universal intrinsics once we merge #21057. If this is the direction we want to go, I will be happy to collaborate with @Mousius to port it. Note that x86-simd-sort also supports O(1) space np.argsort, np.partition and np.argpartition (WIP). We could port these over too once that work is completed.

I can't guarantee I'll be able to resource porting anything further, if/when C++ intrinsics are mature enough to do so, we'll have to re-assess then. It seems to be creating more work given VQSort already supports multiple platforms, but it is already the chosen solution so maybe we'll have to.

jan-wassenberg · 2023-06-30T08:13:16Z

I agree there are two discussions here; let's move the intrinsics part to a separate issue.

One limitation is it doesn't quite drop in for FP16

Yes, I happen to be working on fp16 at the moment :) It's a larger undertaking, so will take several days.

We would prefer to generalize the routines we already have for x86_64, using NumPy's universal intrinsics, to the other architectures.

Understandable. With VQSort author hat on, I'd point out that it is also considerably faster and
more robust. https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md
shows 5-26x(!) perf differences for Zipf and 5-unique-value (random_p5) distributions.
And that is even before our performance bugfix (we were getting random bits from the OS on each sort call).

It might be possible to copy over some of the techniques from VQSort while porting, though.

Have we estimated the number of new universal intrinsics that would be required? I imagine that more than a few new ones will be required for the shuffles used in sorting networks.

so the question is if there is a downside?

To adopting VQSort as suggested in this PR? Adding any code increases compile time and binary size, but I do not believe that is problematic here, so no, I do not see downsides for numpy.

rgommers · 2023-07-04T21:34:45Z

numpy/core/setup.py

+
+# Set NPY_DISABLE_HIGHWAY=1 in the environment to disable the Highway
+# library. This option only has significance on a AArch64 and is most
+# useful to avoid improperly requiring Highway when cross compiling.


This seems copied from the SVML snippet, however the reason that is there for SVML is that that library gets shipped as object files rather than as source files. There shouldn't be an issue with Highway when cross-compiling I hope?

Mousius · 2023-10-02T12:12:03Z

This is a bit messier now that #24201 has landed and select mixed in with sort. It appears that x86-simd-sort doesn't mark many functions as static; compiling it in separate select and sort files causes duplicate symbols to be produced. Hopefully, after google/highway#1710 has landed, I can unpick some of this as there'll be a comparable select algorithm.

seiko2plus

Thank you for working on this important enhancement by integrating the Google Highway library. I just have a reservation about decoupling from the Intel sort library, I suggest moving the Google Highway-specific code to a separate source to avoid conflicts and facilitate future modifications and also suggest taking advantage of the supported CPU features within Highway by enabling more CPU targets.

seiko2plus · 2023-11-18T06:26:12Z

numpy/_core/meson.build

+  [
+    'simd_argsort.dispatch.h',
+    'src/npysort/simd_argsort.dispatch.cpp',
+    [AVX512_SKX]
+  ],


Suggested change

[

'simd_argsort.dispatch.h',

'src/npysort/simd_argsort.dispatch.cpp',

[AVX512_SKX]

],

[

'highway_qsort.dispatch.h',

'src/npysort/highway_qsort.dispatch.cpp',

[[AVX2, FMA3], SSE42, ASIMD, VSX2]

],

Create an independent source and also a separate configuration header for the CPU targets. Please note that this will require creating a separate header, similar to simd_qsort.hpp, containing the forward declarations of the defined highway sort functions while considering the use of a different namespace, for example, highway_qsort.

Another note here: Are there any reasons to limit the use to the ARM architecture?

seiko2plus · 2023-11-19T06:51:03Z

numpy/_core/meson.build

@@ -779,13 +782,18 @@ foreach gen_mtargets : [
  [
    'simd_qsort.dispatch.h',
    'src/npysort/simd_qsort.dispatch.cpp',
-    [AVX512_SKX]
+    [AVX512_SKX, ASIMD]


Suggested change

[AVX512_SKX, ASIMD]

[AVX512_SKX]

Based on the presented suggestion, there would be no purpose in adding this target.

seiko2plus · 2023-11-19T08:09:42Z

numpy/_core/src/npysort/quicksort.cpp

+    #if !DISABLE_HIGHWAY_OPTIMIZATION
    else if (sizeof(T) == sizeof(uint32_t) || sizeof(T) == sizeof(uint64_t)) {
        #ifndef NPY_DISABLE_OPTIMIZATION
            #include "simd_qsort.dispatch.h"
        #endif
        NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::qsort_simd::template QSort, <TF>);
    }
+    #endif


Suggested change

#if !DISABLE_HIGHWAY_OPTIMIZATION

else if (sizeof(T) == sizeof(uint32_t) || sizeof(T) == sizeof(uint64_t)) {

#ifndef NPY_DISABLE_OPTIMIZATION

#include "simd_qsort.dispatch.h"

#endif

NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::qsort_simd::template QSort, <TF>);

}

#endif

else if (sizeof(T) == sizeof(uint32_t) || sizeof(T) == sizeof(uint64_t)) {

#ifndef NPY_DISABLE_OPTIMIZATION

#include "simd_qsort.dispatch.h"

#endif

NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::qsort_simd::template QSort, <TF>);

if (dispfunc == nullptr) {

// Priority is given to Intel-sort library to its efficient support for AVX512.

// For other CPU targets, we fallback to Google's highway sort.

#ifndef NPY_DISABLE_OPTIMIZATION

#include "highway_qsort.dispatch.h"

#endif

NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::highway_qsort::template QSort, <TF>);

}

}

#endif

Using Google Highway after confirming the unavailability of the Intel library, either due to architectural differences or the AVX512 features not being available during runtime.
Also, note that there is no need for the guard DISABLE_HIGHWAY_OPTIMIZATION as the targeted features are controlled through Meson

seiko2plus · 2023-11-19T08:15:50Z

numpy/_core/src/npysort/simd_qsort.dispatch.cpp

@@ -89,31 +71,32 @@ template<> void NPY_CPU_DISPATCH_CURFX(QSort)(double *arr, intptr_t size)
 {
    avx512_qsort(arr, size);
 }
-template<> void NPY_CPU_DISPATCH_CURFX(ArgQSort)(int32_t *arr, npy_intp *arg, npy_intp size)
+#elif USE_HIGHWAY


Suggested change

#elif USE_HIGHWAY

Based on the proposed suggestion, the following functions in this modification should be defined within a separate new source highway_qsort.dispatch.cpp.

seiko2plus · 2023-11-19T08:18:05Z

numpy/_core/src/npysort/selection.cpp

@@ -28,6 +28,7 @@
 #include "simd_qsort.hpp"

 #define NOT_USED NPY_UNUSED(unused)
+#define DISABLE_HIGHWAY_OPTIMIZATION (defined(__arm__) || defined(__aarch64__))


Suggested change

#define DISABLE_HIGHWAY_OPTIMIZATION (defined(__arm__) || defined(__aarch64__))

Based on the presented suggestion, there would be no purpose in keeping this #definition.

Mousius · 2023-11-19T14:18:56Z

Thank you for working on this important enhancement by integrating the Google Highway library. I just have a reservation about decoupling from the Intel sort library, I suggest moving the Google Highway-specific code to a separate source to avoid conflicts and facilitate future modifications and also suggest taking advantage of the supported CPU features within Highway by enabling more CPU targets.

Hi @seiko2plus,

The reason this patch targets AArch64 specifically is because I don't want to have one large unwieldy change, I want to break the change down and work incrementally; this patch puts in place the underlying infrastructure and enables a single target without impacting other users. By keeping it minimal it's easier to maintain and rebase, which I've had to do for nearly 6 months.

This patch does leave the code in an intermediary state, which I'd prefer to keep temporarily whilst I work on the final state for x86. The way forwards, in my opinion, would be to do as @r-devulap suggested in the Highway discussion and leverage VQSort instead of x86-simd-sort. Working incrementally, I can focus solely on enabling further platforms and algorithms in each follow-up patch. We should at least reach parity for x86, even if that requires work in Highway itself before removing x86-simd-sort.

seiko2plus · 2023-11-19T20:55:02Z

Hi @Mousius,

IMHO, I see it the other way around, the requested changes should make it easier to maintain and avoid conflicts with x86-simd-sort by separating the generated objects, regards supporting other architectures it is fine to keep it for ASIMD for now if you think that Highway may has issues with other architectures. Regarding x86-simd-sort, having both solutions one target avx512, and the other target the rest of the CPU features looks like an acceptable scenario to me, it may also be a permanent solution if there is a purpose in keeping the Intel library written in raw intrinsics to avoid any performance regression, or if the aim is to reduce maintenance efforts or lack of interest in supporting other architectures.

Mousius · 2023-11-20T11:03:59Z

Hi @Mousius,

IMHO, I see it the other way around, the requested changes should make it easier to maintain and avoid conflicts with x86-simd-sort by separating the generated objects,

As I've highlighted, this would be the start of a series of changes, which would eventually clean up those objects. Changing this patch would introduce more complexity in the already complex meson build infrastructure, which I'd prefer to avoid in favour of simply deleting a few lines later.

regards supporting other architectures it is fine to keep it for ASIMD for now if you think that Highway may has issues with other architectures.

I don't think Highway will have issues with other architectures; please don't interpret my response so negatively. I don't want to create a more difficult-to-maintain patch in code and cognitive load, and I have no concerns about using Highway in the future (as highlighted by my support of using Highway in NEP 54).

Regarding x86-simd-sort, having both solutions one target avx512, and the other target the rest of the CPU features looks like an acceptable scenario to me, it may also be a permanent solution if there is a purpose in keeping the Intel library written in raw intrinsics to avoid any performance regression, or if the aim is to reduce maintenance efforts or lack of interest in supporting other architectures.

Architecturally, trying to stitch together several libraries for the same functionality increases maintenance in NumPy rather than having a single cross-platform library with reasonable cross-platform performance. Given that both myself and the maintainers of Highway have indicated that we would do work to enable more architectures, please could you refrain from making negative comments regarding a lack of interest which doesn't exist.

I do understand I don't have a say in how NumPy is architected, and if I have time, I'll re-do this patch again. Unfortunately, I'm not working many days at the moment. This patch has been open for a long time without realising any user value, and the reviews have been sparse, so you can understand why I wouldn't prioritise this over other work which would deliver value earlier.

Introduces Highway, a SIMD library from Google. Highway has it's own dispatch mechanism and SIMD flavour (which adds padding) which may be a good replacement for NumPy intrinsics. Highway also has its own set of vectorised Math routines we could bolt on. For this patch, I've integrated the VQSort algorithm only on AArch64 so as not to impact other architectures. The VQSort algorithms performance is comparable to the AVX512 algorithms, according to this report: https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md Performance improvements in the NumPy benchmark suite are as follows: ``` before after ratio [f4b52d53] [f08058d4] <main-setuppy> <highway-sort> + 53.0±0.08μs 221±0.4μs 4.18 bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',)) + 84.8±0.3μs 222±0.3μs 2.62 bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',)) + 89.0±0.2μs 197±0.2μs 2.21 bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',)) + 51.5±0.01μs 80.0±0.1μs 1.55 bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',)) + 52.0±0.07μs 80.3±0.2μs 1.55 bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',)) + 148±0.3μs 197±0.2μs 1.33 bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',)) + 471±0.2μs 598±0.5μs 1.27 bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',)) + 536±0.7μs 644±0.8μs 1.20 bench_function_base.Sort.time_argsort('heap', 'float64', ('reversed',)) + 39.8±0.02μs 47.4±0.05μs 1.19 bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000)) + 39.8±0.01μs 47.4±0.02μs 1.19 bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000)) + 40.1±0.01μs 47.6±0.04μs 1.19 bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 1000)) + 27.9±0.03μs 32.6±0.03μs 1.17 bench_function_base.Sort.time_argsort('heap', 'uint32', ('uniform',)) + 685±0.5μs 796±0.8μs 1.16 bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 10)) + 68.7±0.05μs 78.4±0.04μs 1.14 bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100)) + 68.7±0.05μs 78.3±0.1μs 1.14 bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100)) + 69.4±0.03μs 78.8±0.05μs 1.14 bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 100)) + 22.0±0.01μs 24.7±0.01μs 1.12 bench_function_base.Sort.time_sort('heap', 'uint32', ('uniform',)) + 674±0.7μs 743±0.4μs 1.10 bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 1000)) + 730±0.6μs 794±0.8μs 1.09 bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 100)) + 464±2μs 492±10μs 1.06 bench_function_base.Sort.time_argsort('heap', 'int32', ('reversed',)) + 95.5±0.1μs 101±0.1μs 1.06 bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',)) + 839±1μs 886±2μs 1.06 bench_function_base.Sort.time_argsort('heap', 'float64', ('random',)) + 95.7±0.1μs 101±0.05μs 1.06 bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',)) - 516±0.6μs 488±0.9μs 0.95 bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',)) - 532±1μs 499±5μs 0.94 bench_function_base.Sort.time_argsort('heap', 'int16', ('reversed',)) - 34.5±0.05μs 32.2±0.03μs 0.93 bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 1000)) - 715±1μs 666±0.4μs 0.93 bench_function_base.Sort.time_sort('heap', 'float64', ('random',)) - 254±2μs 231±0.2μs 0.91 bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000)) - 655±0.5μs 592±0.3μs 0.90 bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 100)) - 24.7±0.02μs 22.0±0.07μs 0.89 bench_function_base.Sort.time_sort('heap', 'int32', ('uniform',)) - 621±0.7μs 549±0.3μs 0.88 bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000)) - 659±0.9μs 555±0.5μs 0.84 bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 10)) - 545±0.4μs 444±0.5μs 0.81 bench_function_base.Sort.time_sort('heap', 'float64', ('reversed',)) - 511±0.5μs 396±0.4μs 0.78 bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',)) - 316±0.6μs 225±0.2μs 0.71 bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10)) - 333±0.4μs 233±0.3μs 0.70 bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100)) - 323±0.2μs 206±0.2μs 0.64 bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000)) - 134±0.9μs 85.0±0.07μs 0.64 bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',)) - 376±0.9μs 228±0.3μs 0.61 bench_function_base.Sort.time_sort('quick', 'int64', ('random',)) - 418±0.4μs 207±0.4μs 0.50 bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100)) - 413±2μs 199±0.2μs 0.48 bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10)) - 64.4±0.3ms 28.6±0.3ms 0.44 bench_function_base.Sort.time_sort_worst - 501±2μs 204±0.3μs 0.41 bench_function_base.Sort.time_sort('quick', 'float64', ('random',)) - 248±1μs 86.9±0.2μs 0.35 bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000)) - 251±0.7μs 87.0±0.2μs 0.35 bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000)) - 329±0.9μs 91.4±0.1μs 0.28 bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000)) - 329±1μs 85.2±0.06μs 0.26 bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100)) - 329±0.7μs 85.0±0.06μs 0.26 bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100)) - 315±0.5μs 80.4±0.05μs 0.26 bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10)) - 320±1μs 80.3±0.09μs 0.25 bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10)) - 372±0.4μs 82.1±0.06μs 0.22 bench_function_base.Sort.time_sort('quick', 'int32', ('random',)) - 414±0.4μs 89.6±0.1μs 0.22 bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100)) - 384±1μs 81.8±0.02μs 0.21 bench_function_base.Sort.time_sort('quick', 'uint32', ('random',)) - 415±0.3μs 84.8±0.2μs 0.20 bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10)) - 491±2μs 86.3±0.1μs 0.18 bench_function_base.Sort.time_sort('quick', 'float32', ('random',)) - 111±0.4μs 13.0±0.01μs 0.12 bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',)) - 50.9±0.03μs 4.93±0.01μs 0.10 bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',)) - 108±0.3μs 7.25±0.02μs 0.07 bench_function_base.Sort.time_sort('quick', 'float32', ('uniform',)) - 51.7±0.05μs 3.43±0.01μs 0.07 bench_function_base.Sort.time_sort('quick', 'int32', ('uniform',)) - 51.3±0.07μs 3.35±0.06μs 0.07 bench_function_base.Sort.time_sort('quick', 'uint32', ('uniform',)) ```

seiko2plus · 2023-11-20T21:40:34Z

As I've highlighted, this would be the start of a series of changes, which would eventually clean up those objects. Changing this patch would introduce more complexity in the already complex meson build infrastructure, which I'd prefer to avoid in favour of simply deleting a few lines later.

Not sure about the complexity you're facing. My suggestion was similar to your current approach, but instead of creating a new source simd_argsort.dispatch.cpp to isolate the definitions of argsort functions of x86-simd-sort, it might be better to keep it as-is and instead we could have a separate source for the functions that Highway supports. This would also help avoid conflicts with pull requests #25045, #24924. To me, it seemed like a win-win situation.

I don't think Highway will have issues with other architectures; please don't interpret my response so negatively. I don't want to create a more difficult-to-maintain patch in code and cognitive load, and I have no concerns about using Highway in the future (as highlighted by my support of using Highway in NEP 54).

I didn't mean to come off as negative at all. I just thought that adding a few extra CPU targets to your patch might not change much, maybe just a line or two. My conclusion was based on your response and my understanding that enabling a target like AVX2 would initiate a path for MSVC on both 32-bit and 64-bit, and that Highway might face issues with this path, possibly requiring extra effort to disable it. That's why I said it's okay for 'now' to stick with only ASIMD.

Architecturally, trying to stitch together several libraries for the same functionality increases maintenance in NumPy rather than having a single cross-platform library with reasonable cross-platform performance.

Offering multiple choices should be considered if there's a performance benefit. If both solutions are sufficiently isolated, it shouldn't lead to a maintenance burden.

Given that both myself and the maintainers of Highway have indicated that we would do work to enable more architectures

This is the reason I suggested leveraging the broad support and enabling additional CPU features of the architectures that are already supported by Highway.

please could you refrain from making negative comments regarding a lack of interest which doesn't exist.

Once again, I don't mean to be negative at all, and I will try to make my points more clear to avoid any kind of misunderstanding.

I do understand I don't have a say in how NumPy is architected, and if I have time, I'll re-do this patch again. Unfortunately, I'm not working many days at the moment. This patch has been open for a long time without realising any user value, and the reviews have been sparse, so you can understand why I wouldn't prioritise this over other work which would deliver value earlier.

During our last optimization meeting, we agreed to accept your work as it is, without any further changes, to expedite progress. Later, we may focus on separating the two libraries.

Mousius · 2023-11-22T13:07:24Z

As I've highlighted, this would be the start of a series of changes, which would eventually clean up those objects. Changing this patch would introduce more complexity in the already complex meson build infrastructure, which I'd prefer to avoid in favour of simply deleting a few lines later.

Not sure about the complexity you're facing. My suggestion was similar to your current approach, but instead of creating a new source simd_argsort.dispatch.cpp to isolate the definitions of argsort functions of x86-simd-sort, it might be better to keep it as-is and instead we could have a separate source for the functions that Highway supports. This would also help avoid conflicts with pull requests #25045, #24924. To me, it seemed like a win-win situation.

We can just agree to disagree here 😸, I've been out sick for several weeks and looking at a phased return, so actually getting to do the work is difficult. @r-devulap has expressed willingness to do the follow up which I really appreciated!

I don't think Highway will have issues with other architectures; please don't interpret my response so negatively. I don't want to create a more difficult-to-maintain patch in code and cognitive load, and I have no concerns about using Highway in the future (as highlighted by my support of using Highway in NEP 54).

I didn't mean to come off as negative at all. I just thought that adding a few extra CPU targets to your patch might not change much, maybe just a line or two. My conclusion was based on your response and my understanding that enabling a target like AVX2 would initiate a path for MSVC on both 32-bit and 64-bit, and that Highway might face issues with this path, possibly requiring extra effort to disable it. That's why I said it's okay for 'now' to stick with only ASIMD.

Aha, that would be my preference, and then I could focus on enabling AVX2 if we needed in a single patch - dependent on whether we agree that using Highway rather than x86-simd-sort is worth it.

Architecturally, trying to stitch together several libraries for the same functionality increases maintenance in NumPy rather than having a single cross-platform library with reasonable cross-platform performance.

Offering multiple choices should be considered if there's a performance benefit. If both solutions are sufficiently isolated, it shouldn't lead to a maintenance burden.

Every branch in code or extra dependency is a maintenance burden, but I'm not going to worry too much about this one, given @r-devulap is keen to continue maintenance.

Given that both myself and the maintainers of Highway have indicated that we would do work to enable more architectures

This is the reason I suggested leveraging the broad support and enabling additional CPU features of the architectures that are already supported by Highway.

please could you refrain from making negative comments regarding a lack of interest which doesn't exist.

Once again, I don't mean to be negative at all, and I will try to make my points more clear to avoid any kind of misunderstanding.

I do understand I don't have a say in how NumPy is architected, and if I have time, I'll re-do this patch again. Unfortunately, I'm not working many days at the moment. This patch has been open for a long time without realising any user value, and the reviews have been sparse, so you can understand why I wouldn't prioritise this over other work which would deliver value earlier.

During our last optimization meeting, we agreed to accept your work as it is, without any further changes, to expedite progress. Later, we may focus on separating the two libraries.

Yip, this comment was before the meeting. I believe we did reach a compromise where this patch would land, and @r-devulap would support me in doing a follow-up (which, again, I really appreciate!). I'm excited to see it land after such long discussions 😸

seiko2plus

Thank you @Mousius!

r-devulap · 2023-11-22T21:48:14Z

I would prefer to have #25045 for AVX2 versions of sort and partition. It will soon be followed up with AVX2 versions of argsort and argpartition.

r-devulap · 2023-11-22T21:53:33Z

@Mousius, you are welcome! Once this is merged, I will update #24924 to split x86-simd-sort into its own dispatch module like how @seiko2plus prefers it.

seiko2plus · 2023-11-24T08:02:48Z

@Mousius, I think we should have a release note for this pr.

github-actions bot added the 01 - Enhancement label Jun 22, 2023

Mousius commented Jun 22, 2023

View reviewed changes

Mousius force-pushed the highway-vqsort branch from e05a00c to 7612fe5 Compare June 22, 2023 12:25

copybara-service bot pushed a commit to google/highway that referenced this pull request Jun 22, 2023

add VQSortStatic. Refs numpy/numpy#24018

d4302b9

PiperOrigin-RevId: 542571742

copybara-service bot mentioned this pull request Jun 22, 2023

add VQSortStatic. Refs https://github.com/numpy/numpy/pull/24018 google/highway#1498

Merged

copybara-service bot pushed a commit to google/highway that referenced this pull request Jun 22, 2023

add VQSortStatic. Refs numpy/numpy#24018

d97a769

PiperOrigin-RevId: 542571742

copybara-service bot pushed a commit to google/highway that referenced this pull request Jun 22, 2023

add VQSortStatic. Refs numpy/numpy#24018

761a5a5

PiperOrigin-RevId: 542571742

copybara-service bot pushed a commit to google/highway that referenced this pull request Jun 22, 2023

add VQSortStatic. Refs numpy/numpy#24018

8438582

PiperOrigin-RevId: 542579202

Mousius force-pushed the highway-vqsort branch 5 times, most recently from 4a6edf9 to 50bd401 Compare June 26, 2023 13:32

seberg added the triage review Issue/PR to be discussed at the next triage meeting label Jun 26, 2023

Mousius mentioned this pull request Jun 27, 2023

Inf sorting introduces max value google/highway#1486

Closed

jan-wassenberg mentioned this pull request Jun 30, 2023

ENH: C++ SIMD wrapper #24084

Open

rgommers reviewed Jul 4, 2023

View reviewed changes

Mousius force-pushed the highway-vqsort branch 3 times, most recently from b08dcd6 to f4e1583 Compare October 2, 2023 12:10

Mousius force-pushed the highway-vqsort branch from a1d99df to 7243554 Compare October 17, 2023 20:06

seiko2plus requested changes Nov 19, 2023 8000

View reviewed changes

Mousius added 8 commits November 20, 2023 17:18

Better gating of Highway

edac5aa

Use VQSortStatic

11ab024

Add argsort to meson

003c3a4

Fix up Highway gate in Meson

87f6d19

Re-align submodules with main

4d139b3

Switch off simd_select for AArch64

59443e8

Minor text fixes

278e1de

Mousius force-pushed the highway-vqsort branch from 7243554 to 278e1de Compare November 20, 2023 17:19

seiko2plus approved these changes Nov 22, 2023

View reviewed changes

seiko2plus merged commit 65c9f82 into numpy:main Nov 24, 2023

r-devulap mentioned this pull request Nov 27, 2023

ENH: Vectorize np.sort and np.partition with AVX2 #25045

Merged

lorentzenchr mentioned this pull request Dec 3, 2023

Introduce SIMD intrinsics for _dist_metrics.pyx scikit-learn/scikit-learn#26010

Open

This was referenced Dec 20, 2023

CI: show meson-log.txt in Cirrus wheel builds #25440

Merged

BUG: aarch64 linux wheel builds failing on highway_qsort SVE error #25445

Closed

seiko2plus mentioned this pull request Jan 11, 2024

BUG: crashes in sort/argsort on macOS arm64 #25464

Closed

Uh oh!

ENH: Use Highway's VQSort on AArch64 #24018

ENH: Use Highway's VQSort on AArch64 #24018

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!