8000 ENH: Use Highway's VQSort on AArch64 by Mousius · Pull Request #24018 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Use Highway's VQSort on AArch64 #24018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 24, 2023
Merged

Conversation

Mousius
Copy link
Member
@Mousius Mousius commented Jun 22, 2023

Introduces Highway, a SIMD library from Google. Highway has it's own dispatch mechanism and SIMD flavour (which adds padding) which may be a good replacement for NumPy intrinsics. Highway also has its own set of vectorised Math routines we could bolt on.

For this patch, I've integrated the VQSort algorithm only on AArch64 so as not to impact other architectures. The VQSort algorithms performance is comparable to the AVX512 algorithms, according to this report: https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md

Performance improvements in the NumPy benchmark suite are as follows:

       before           after         ratio
     [f4b52d53]       [f08058d4]
     <main-setuppy>       <highway-sort>
+     53.0±0.08μs        221±0.4μs     4.18  bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))
+      84.8±0.3μs        222±0.3μs     2.62  bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))
+      89.0±0.2μs        197±0.2μs     2.21  bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))
+     51.5±0.01μs       80.0±0.1μs     1.55  bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))
+     52.0±0.07μs       80.3±0.2μs     1.55  bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))
+       148±0.3μs        197±0.2μs     1.33  bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))
+       471±0.2μs        598±0.5μs     1.27  bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',))
+       536±0.7μs        644±0.8μs     1.20  bench_function_base.Sort.time_argsort('heap', 'float64', ('reversed',))
+     39.8±0.02μs      47.4±0.05μs     1.19  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000))
+     39.8±0.01μs      47.4±0.02μs     1.19  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000))
+     40.1±0.01μs      47.6±0.04μs     1.19  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 1000))
+     27.9±0.03μs      32.6±0.03μs     1.17  bench_function_base.Sort.time_argsort('heap', 'uint32', ('uniform',))
+       685±0.5μs        796±0.8μs     1.16  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 10))
+     68.7±0.05μs      78.4±0.04μs     1.14  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100))
+     68.7±0.05μs       78.3±0.1μs     1.14  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100))
+     69.4±0.03μs      78.8±0.05μs     1.14  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 100))
+     22.0±0.01μs      24.7±0.01μs     1.12  bench_function_base.Sort.time_sort('heap', 'uint32', ('uniform',))
+       674±0.7μs        743±0.4μs     1.10  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 1000))
+       730±0.6μs        794±0.8μs     1.09  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 100))
+         464±2μs         492±10μs     1.06  bench_function_base.Sort.time_argsort('heap', 'int32', ('reversed',))
+      95.5±0.1μs        101±0.1μs     1.06  bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))
+         839±1μs          886±2μs     1.06  bench_function_base.Sort.time_argsort('heap', 'float64', ('random',))
+      95.7±0.1μs       101±0.05μs     1.06  bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))
-       516±0.6μs        488±0.9μs     0.95  bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))
-         532±1μs          499±5μs     0.94  bench_function_base.Sort.time_argsort('heap', 'int16', ('reversed',))
-     34.5±0.05μs      32.2±0.03μs     0.93  bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 1000))
-         715±1μs        666±0.4μs     0.93  bench_function_base.Sort.time_sort('heap', 'float64', ('random',))
-         254±2μs        231±0.2μs     0.91  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))
-       655±0.5μs        592±0.3μs     0.90  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 100))
-     24.7±0.02μs      22.0±0.07μs     0.89  bench_function_base.Sort.time_sort('heap', 'int32', ('uniform',))
-       621±0.7μs        549±0.3μs     0.88  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000))
-       659±0.9μs        555±0.5μs     0.84  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 10))
-       545±0.4μs        444±0.5μs     0.81  bench_function_base.Sort.time_sort('heap', 'float64', ('reversed',))
-       511±0.5μs        396±0.4μs     0.78  bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',))
-       316±0.6μs        225±0.2μs     0.71  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))
-       333±0.4μs        233±0.3μs     0.70  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))
-       323±0.2μs        206±0.2μs     0.64  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))
-       134±0.9μs      85.0±0.07μs     0.64  bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))
-       376±0.9μs        228±0.3μs     0.61  bench_function_base.Sort.time_sort('quick', 'int64', ('random',))
-       418±0.4μs        207±0.4μs     0.50  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))
-         413±2μs        199±0.2μs     0.48  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))
-      64.4±0.3ms       28.6±0.3ms     0.44  bench_function_base.Sort.time_sort_worst
-         501±2μs        204±0.3μs     0.41  bench_function_base.Sort.time_sort('quick', 'float64', ('random',))
-         248±1μs       86.9±0.2μs     0.35  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000))
-       251±0.7μs       87.0±0.2μs     0.35  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000))
-       329±0.9μs       91.4±0.1μs     0.28  bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000))
-         329±1μs      85.2±0.06μs     0.26  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))
-       329±0.7μs      85.0±0.06μs     0.26  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))
-       315±0.5μs      80.4±0.05μs     0.26  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))
-         320±1μs      80.3±0.09μs     0.25  bench
10000
_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))
-       372±0.4μs      82.1±0.06μs     0.22  bench_function_base.Sort.time_sort('quick', 'int32', ('random',))
-       414±0.4μs       89.6±0.1μs     0.22  bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))
-         384±1μs      81.8±0.02μs     0.21  bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))
-       415±0.3μs       84.8±0.2μs     0.20  bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))
-         491±2μs       86.3±0.1μs     0.18  bench_function_base.Sort.time_sort('quick', 'float32', ('random',))
-       111±0.4μs      13.0±0.01μs     0.12  bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',))
-     50.9±0.03μs      4.93±0.01μs     0.10  bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',))
-       108±0.3μs      7.25±0.02μs     0.07  bench_function_base.Sort.time_sort('quick', 'float32', ('uniform',))
-     51.7±0.05μs      3.43±0.01μs     0.07  bench_function_base.Sort.time_sort('quick', 'int32', ('uniform',))
-     51.3±0.07μs      3.35±0.06μs     0.07  bench_function_base.Sort.time_sort('quick', 'uint32', ('uniform',))

Comment on lines 1042 to 1066
for sort_file in [
"vqsort.cc",
"vqsort_128a.cc",
"vqsort_128d.cc",
"vqsort_f32a.cc",
"vqsort_f32d.cc",
"vqsort_f64a.cc",
"vqsort_f64d.cc",
"vqsort_i16a.cc",
"vqsort_i16d.cc",
"vqsort_i32a.cc",
"vqsort_i32d.cc",
"vqsort_i64a.cc",
"vqsort_i64d.cc",
"vqsort_kv64a.cc",
"vqsort_kv64d.cc",
"vqsort_kv128a.cc",
"vqsort_kv128d.cc",
"vqsort_u16a.cc",
"vqsort_u16d.cc",
"vqsort_u32a.cc",
"vqsort_u32d.cc",
"vqsort_u64a.cc",
"vqsort_u64d.cc",
]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't ideal, when compiling without these files I got missing symbol issues, but we definitely don't use them all and they add ~100M to the .so.

@jan-wassenberg any thoughts on how to minimize the number of files we need here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're only using {u/i/f}{32/64}a in your dispatcher and we ought to be able to remove 128*, 16*, and *d.cc (the latter for descending order).

If there are still linker errors after removing only those, then it's likely due to the dllexport. What we could then do is add a macro bitfield that you could define to replace the implementation of the unused/unwanted types with a stub.

Something like VQSORT_TYPES which contains multiple groups (one each for uint/int/float/KV), where each group has one bit encoding whether the element size equal to the bit's value is enabled.

Example, for only say unsigned/float types: #define VQSORT_TYPES (((4 | 8) << 6) | (4 | 8)).

Would that be useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example, when I try to run tests:

Original error was: /numpy/build/testenv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-aarch64-linux-gnu.so: undefined symbol: _ZN3hwy6VQSortEPlmNS_14SortDescendingE

The other option is I could use vqsort-inl.h and copy the implementations across from the various vqsort_X.cc files? That'd also mean I can do float16 without adding a vsqort_f16a.cc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add something like this to vqsort-inl.h ? I notice there's a lot of detail which would be bad to include in the NumPy source.

template <typename T>
void VQSort(T* HWY_RESTRICT keys, size_t n, SortAscending) {
  SortTag<T> d;
  detail::SharedTraits<detail::TraitsLane<detail::OrderAscending<T>>> st;
  Sort(d, st, keys, num);
}

If not the bitfield approach is also good 😸

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can add that. As written, this will not support the 128-bit nor key-value types - is that OK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can add that. As written, this will not support the 128-bit nor key-value types - is that OK?

That'd be specialising around uint128_t with the same uint64*2 logic as in the .cc file? I think we can add it later if necessary, there's no sort dispatch for it atm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. And KV types also use different Traits.
Sounds good, we can indeed add that later.

@@ -0,0 +1,44 @@
/*@targets
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split this out as Highway's sort doesn't have an implementation of argsort, unless I'm wrong @jan-wassenberg?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI argsort can be implemented as copying an array to hwy::K32V32 or K64V64 types (putting the key in .key and 0..N-1 into .value) and then copying only the .value back to the array.

copybara-service bot pushed a commit to google/highway that referenced this pull request Jun 22, 2023
PiperOrigin-RevId: 542571742
copybara-service bot pushed a commit to google/highway that referenced this pull request Jun 22, 2023
PiperOrigin-RevId: 542571742
copybara-service bot pushed a commit to google/highway that referenced this pull request Jun 22, 2023
PiperOrigin-RevId: 542571742
copybara-service bot pushed a commit to google/highway that referenced this pull request Jun 22, 2023
PiperOrigin-RevId: 542579202
@Mousius Mousius force-pushed the highway-vqsort branch 5 times, most recently from 4a6edf9 to 50bd401 Compare June 26, 2023 13:32
@seberg seberg added the triage review Issue/PR to be discussed at the next triage meeting label Jun 26, 2023
@mattip
Copy link
Member
mattip commented Jun 27, 2023

Introduces Highway, a SIMD library from Google. Highway has it's own dispatch mechanism and SIMD flavour (which adds padding) which may be a good replacement for NumPy intrinsics. Highway also has its own set of vectorised Math routines we could bolt on.

I think this is worthy of wider discussion. So far we have

  • legacy routines
  • universal intrinsics
  • svml and x86-simd-sort for AVX512
  • proposals to use SLEEF

Perhaps we should stop and think where this all is leading. Maybe NumPy should not be in the business of providing highly specialized inner loop functions, and we should instead provide an easily pluggable interface for others to experiment with, together with a base implementation and test cases for accuracy (known good answers) / precision (same answers across various array shapes).

@Mousius
Copy link
Member Author
Mousius commented Jun 27, 2023

Introduces Highway, a SIMD library from Google. Highway has it's own dispatch mechanism and SIMD flavour (which adds padding) which may be a good replacement for NumPy intrinsics. Highway also has its own set of vectorised Math routines we could bolt on.

I think this is worthy of wider discussion. So far we have

  • legacy routines
  • universal intrinsics
  • svml and x86-simd-sort for AVX512
  • proposals to use SLEEF

Perhaps we should stop and think where this all is leading. Maybe NumPy should not be in the business of providing highly specialized inner loop functions, and we should instead provide an easily pluggable interface for others to experiment with, together with a base implementation and test cases for accuracy (known good answers) / precision (same answers across various array shapes).

Yip, I've been struggling with the current set of NumPy variations on the theme of performance optimisation, it's not clear what the current best practice is or where it's headed. Landing these vendored libraries (such as svml/x86-simd-sort) seems to be the lowest bar but doesn't seem in line with any overall strategy. Having vendored only AVX512 optimisations does mean that there's a clear favourite for which platform you should be using NumPy on in terms of performance.

My interpretation of the proposal for using SLEEF (alongside my reasoning for backing something like this) is that people want NumPy to perform better out-of-the-box and SLEEF already has a bunch of cross-platform routines for this; indicating that there is a desire for things to be faster, but these efforts haven't succeeded so far. That demand helps me to understand that we should be taking a more off-the-shelf solution, which means doing less work in NumPy but still meeting the original goals of the universal intrinsics work - which is what I see when I stumble across Highway. We could take a lot of complexity out of NumPy itself by leveraging the Highway's dynamic dispatch, optimised routines and existing universal intrinsics solution (which already supports scalable vectors).

I don't think performance should be passed back to users as plug-ins, if I wanted performance without using NumPy itself, I could easily move to something like Numba, one of the other optimisation libraries or stop using NumPy (as the de facto standard, such a stance seems rather out of place). I did like the errstate proposal, which allows users to opt into the lower accuracy, higher throughput routines but the high/low accuracy problem is only applicable to specific routines.

Returning to this case, it's a sort routine, if it puts things in the wrong order it's wrong - these optimisations can benefit all users without much controversy which seems like something everyone would want? Please don't lose sight of the benefits this specific PR brings whilst considering whether to integrate more of Highway in the future.

@seberg
Copy link
Member
seberg commented Jun 29, 2023

We discussed this a bit yesterday, please correct me if I am wrong (or just disagree), but basically. The way I interpreted this, there was also a hope to use highway more generally.

Sentiment was a bit towards:

  • We probably cannot adopt highway math functions, so the only point would likely be to adopt some of their "universal intrinsics" implementation. That wasn't hit with a lot of enthusiasm, and they apparently don't have great machinery around striding (I may have butchered that).
    • Forgot: The math functions seem to have 3ULP within but within a certain range (might been a specific one). That makes it seem more tuned for their use-case, so that a straight up use of them in NumPy is likely to not work in practice unfortunately.
  • This part is cool, since it's much faster, Raghuveer said that if we did have the C++ version of our universal intrinsics getting the AARCh64 speedups should be easier. This pulls in quite a bit (if I understand it right), due to using their universal intrinsics framework (leading to the wheel to be slightly bloated, although I am not sure we should care).

Overall, there wasn't much enthusiasm for this yesterday.

@jan-wassenberg
Copy link
Contributor

Hi, Highway/vqsort main author here :)
I'm curious what striding functionality you are looking for?

This pulls in quite a bit (if I understand it right), due to using their universal intrinsics framework

Not sure this is the case. Our intrinsics are just fully-inlined wrapper functions. Only the sort implementation itself would contribute to binary size.

VQSort also has some advantages concerning speed on AVX-512 (see https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md#updated-vqsort-results) and also robustness vs skewed input distributions, see other parts of that article.

Happy to discuss further, including via video call, if you are interested.

@Mousius
Copy link
Member Author
Mousius commented Jun 29, 2023

I'm curious what striding functionality you are looking for?

@jan-wassenberg , I think this would just be something like MaskedGatherLoad and MaskedBl 8000 endedStore, as NumPy arrays aren't padded you need to load in the final partial vectors whilst also striding. From my naive perspective this doesn't seem too difficult?

@mattip
Copy link
Member
mattip commented Jun 29, 2023

I think the larger point is that we are uncomfortable adding yet another submodule with inner loop implementations to NumPy. We would prefer to generalize the routines we already have for x86_64, using NumPy's universal intrinsics, to the other architectures.

@seberg
Copy link
Member
seberg commented Jun 29, 2023

Maybe we can split this up a bit, since I probably munched it too much yesterday. I am not sure how much we gain by adopting highway more broadly, there were arguments against it, and maybe in part that is because Sayed has momentum for the "custom" universal intrinsics but there were more arguments. (I honestly would prefer to just steal something, but SIMD is already hairy with compiler bugs, and I don't know it well enough to judge whether we would just run into more issues.)

So this is about sorting: I think right now it just switches things on AArch64? But highway supports also AVX512, so in the end it would really be swapping out one implementation for another. This doesn't actually stop us from reversing it out again once we can, so the question is if there is a downside? (if there are issues, too bad, we will revert)

I probably misunderstood that this had some impact on binary size, I have to admit, if it doesn't I am not sure I see a reason for not swapping in highway for sorting (completely) until Raghuveer's work supports more broadly SIMD.

@Mousius
Copy link
Member Author
Mousius commented Jun 29, 2023

I think the larger point is that we are uncomfortable adding yet another submodule with inner loop implementations to NumPy. We would prefer to generalize the routines we already have for x86_64, using NumPy's universal intrinsics, to the other architectures.

@mattip, that seems to disadvantage those who didn't author the x86_64 code, whilst there is an equivalent solution available for many architectures. I'm personally reluctant to invest further effort in porting to universal intrinsics, there's solutions already available to users as proven by the SVML and x86-simd-sort integrations. Previous universal intrinsic work was already a wasted effort now that the math routines have been reverted and accuracy changed in SVML.

Maybe we can split this up a bit, since I probably munched it too much yesterday. I am not sure how much we gain by adopting highway more broadly, there were arguments against it, and maybe in part that is because Sayed has momentum for the "custom" universal intrinsics but there were more arguments. (I honestly would prefer to just steal something, but SIMD is already hairy with compiler bugs, and I don't know it well enough to judge whether we would just run into more issues.)

Where is this momentum? It'd be great to see things like #22265 landed but it has stalled waiting for the C++ intrinsic work. As far as compiler bugs, I think you'll get them either way, they're bugs after all 😸

I probably misunderstood that this had some impact on binary size, I have to admit, if it doesn't I am not sure I see a reason for not swapping in highway for sorting (completely) until Raghuveer's work supports more broadly SIMD.

Sorry @seberg, that's likely down to me, originally this was pretty bloated thanks to including many separate compilation units but through @jan-wassenberg's efforts on VQSortStatic it's now pretty slim and inlined - although you do increase binary size with more code added.

So this is about sorting: I think right now it just switches things on AArch64? But highway supports also AVX512, so in the end it would really be swapping out one implementation for another. This doesn't actually stop us from reversing it out again once we can, so the question is if there is a downside? (if there are issues, too bad, we will revert)

Thanks for bringing this back to the benefits of this contribution itself, I did limit this to AArch64 to minimise the overall impact, mirroring the accepted x86-simd-sort integration - it wouldn't be hard to enable this for AVX512 as well and potentially further. One limitation is it doesn't quite drop in for FP16 (@jan-wassenberg can maybe advise), nor argsort but we can probably implement that ourselves based on the suggestion above.

@mattip
Copy link
Member
mattip commented Jun 29, 2023

Previous universal intrinsic work was already a wasted effort now that the math routines have been reverted and accuracy changed in SVML.

I am sorry you feel that way. I seem to remember having similar feelings when SVML was proposed, and that the goal all along was to replace SVML with universal intrinsics. Unfortunately we did not adequately address accuracy when adding the tests that were a condition for the SVML acceptance, and that led to the mess around sin/cos.

At the end of the day mine is just one opinion of many maintainers. I think the place to push for a revamped approach to SIMD interinsics is discuss replacing NEP 38 on the mailing list, or to create a PR amending that NEP.

@r-devulap
Copy link
Member

There are two aspects to this PR:

  1. Sorting routines on AArch64
  2. Adopting Highway as the SIMD framework

(1) is tied with (2): you can't port sorting routines without using highway as the SIMD framework which duplicates NumPy universal intrinsics. If NumPy wants to use highway as the SIMD framework, that's a whole new discussion and as @mattip says it would require modifying NEP 38.

If the concern is just about sorting routines on AArch64, we could address that by porting over x86-simd-sort using NumPy universal intrinsics once we merge #21057. If this is the direction we want to go, I will be happy to collaborate with @Mousius to port it. Note that x86-simd-sort also supports O(1) space np.argsort, np.partition and np.argpartition (WIP). We could port these over too once that work is completed.

@Mousius
Copy link
Member Author
Mousius commented Jun 29, 2023

There are two aspects to this PR:

  1. Sorting routines on AArch64
  2. Adopting Highway as the SIMD framework

(1) is tied with (2): you can't port sorting routines without using highway as the SIMD framework which duplicates NumPy universal intrinsics. If NumPy wants to use highway as the SIMD framework, that's a whole new discussion and as @mattip says it would require modifying NEP 38.

I don't think we should be considering the implementation details of the sort algorithm, just as x86-simd-sort is a separate module and has its own understanding of intrinsics which are not NumPy intrinsics. Policing repos outside of NumPy seems beyond the NumPy maintainers scope (at least to me), especially as it compiles down to code without any notion of the intrinsics.

From a technical point of view, this is exactly the same as implementing sort using x86-simd-sort, yet does not receive the same treatment due to the SIMD implementation (I think this was my mistake by over-explaining Highway when opening this PR). It is very possible to merge this PR, gain the user value and never use the Highway SIMD library.

I'm also getting a distinct impression from these conversations that not many are in favour of replacing NEP38, as only @seberg has spoken in favour of using something already written.

If the concern is just about sorting routines on AArch64, we could address that by porting over x86-simd-sort using NumPy universal intrinsics once we merge #21057. If this is the direction we want to go, I will be happy to collaborate with @Mousius to port it. Note that x86-simd-sort also supports O(1) space np.argsort, np.partition and np.argpartition (WIP). We could port these over too once that work is completed.

I can't guarantee I'll be able to resource porting anything further, if/when C++ intrinsics are mature enough to do so, we'll have to re-assess then. It seems to be creating more work given VQSort already supports multiple platforms, but it is already the chosen solution so maybe we'll have to.

Sorry, something went wrong.

@jan-wassenberg
Copy link
Contributor

I agree there are two discussions here; let's move the intrinsics part to a separate issue.

One limitation is it doesn't quite drop in for FP16

Yes, I happen to be working on fp16 at the moment :) It's a larger undertaking, so will take several days.

We would prefer to generalize the routines we already have for x86_64, using NumPy's universal intrinsics, to the other architectures.

Understandable. With VQSort author hat on, I'd point out that it is also considerably faster and
more robust. https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md
shows 5-26x(!) perf differences for Zipf and 5-unique-value (random_p5) distributions.
And that is even before our performance bugfix (we were getting random bits from the OS on each sort call).

It might be possible to copy over some of the techniques from VQSort while porting, though.

Have we estimated the number of new universal intrinsics that would be required? I imagine that more than a few new ones will be required for the shuffles used in sorting networks.

so the question is if there is a downside?

To adopting VQSort as suggested in this PR? Adding any code increases compile time and binary size, but I do not believe that is problematic here, so no, I do not see downsides for numpy.


# Set NPY_DISABLE_HIGHWAY=1 in the environment to disable the Highway
# library. This option only has significance on a AArch64 and is most
# useful to avoid improperly requiring Highway when cross compiling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems copied from the SVML snippet, however the reason that is there for SVML is that that library gets shipped as object files rather than as source files. There shouldn't be an issue with Highway when cross-compiling I hope?

@Mousius Mousius force-pushed the highway-vqsort branch 3 times, most recently from b08dcd6 to f4e1583 Compare October 2, 2023 12:10
@Mousius
Copy link
Member Author
Mousius commented Oct 2, 2023

This is a bit messier now that #24201 has landed and select mixed in with sort. It appears that x86-simd-sort doesn't mark many functions as static; compiling it in separate select and sort files causes duplicate symbols to be produced. Hopefully, after google/highway#1710 has landed, I can unpick some of this as there'll be a comparable select algorithm.

Copy link
Member
@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this important enhancement by integrating the Google Highway library. I just have a reservation about decoupling from the Intel sort library, I suggest moving the Google Highway-specific code to a separate source to avoid conflicts and facilitate future modifications and also suggest taking advantage of the supported CPU features within Highway by enabling more CPU targets.

Comment on lines +792 to +778
[
'simd_argsort.dispatch.h',
'src/npysort/simd_argsort.dispatch.cpp',
[AVX512_SKX]
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[
'simd_argsort.dispatch.h',
'src/npysort/simd_argsort.dispatch.cpp',
[AVX512_SKX]
],
[
'highway_qsort.dispatch.h',
'src/npysort/highway_qsort.dispatch.cpp',
[[AVX2, FMA3], SSE42, ASIMD, VSX2]
],

Create an independent source and also a separate configuration header for the CPU targets. Please note that this will require creating a separate header, similar to simd_qsort.hpp, containing the forward declarations of the defined highway sort functions while considering the use of a different namespace, for example, highway_qsort.

Another note here: Are there any reasons to limit the use to the ARM architecture?

@@ -779,13 +782,18 @@ foreach gen_mtargets : [
[
'simd_qsort.dispatch.h',
'src/npysort/simd_qsort.dispatch.cpp',
[AVX512_SKX]
[AVX512_SKX, ASIMD]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[AVX512_SKX, ASIMD]
[AVX512_SKX]

Based on the presented suggestion, there would be no purpose in adding this target.

Comment on lines +88 to +95
#if !DISABLE_HIGHWAY_OPTIMIZATION
else if (sizeof(T) == sizeof(uint32_t) || sizeof(T) == sizeof(uint64_t)) {
#ifndef NPY_DISABLE_OPTIMIZATION
#include "simd_qsort.dispatch.h"
#endif
NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::qsort_simd::template QSort, <TF>);
}
#endif
Copy link
Member
@seiko2plus seiko2plus Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if !DISABLE_HIGHWAY_OPTIMIZATION
else if (sizeof(T) == sizeof(uint32_t) || sizeof(T) == sizeof(uint64_t)) {
#ifndef NPY_DISABLE_OPTIMIZATION
#include "simd_qsort.dispatch.h"
#endif
NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::qsort_simd::template QSort, <TF>);
}
#endif
else if (sizeof(T) == sizeof(uint32_t) || sizeof(T) == sizeof(uint64_t)) {
#ifndef NPY_DISABLE_OPTIMIZATION
#include "simd_qsort.dispatch.h"
#endif
NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::qsort_simd::template QSort, <TF>);
if (dispfunc == nullptr) {
// Priority is given to Intel-sort library to its efficient support for AVX512.
// For other CPU targets, we fallback to Google's highway sort.
#ifndef NPY_DISABLE_OPTIMIZATION
#include "highway_qsort.dispatch.h"
#endif
NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::highway_qsort::template QSort, <TF>);
}
}
#endif

Using Google Highway after confirming the unavailability of the Intel library, either due to architectural differences or the AVX512 features not being available during runtime.
Also, note that there is no need for the guard DISABLE_HIGHWAY_OPTIMIZATION as the targeted features are controlled through Meson

@@ -89,31 +71,32 @@ template<> void NPY_CPU_DISPATCH_CURFX(QSort)(double *arr, intptr_t size)
{
avx512_qsort(arr, size);
}
template<> void NPY_CPU_DISPATCH_CURFX(ArgQSort)(int32_t *arr, npy_intp *arg, npy_intp size)
#elif USE_HIGHWAY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#elif USE_HIGHWAY

Based on the proposed suggestion, the following functions in this modification should be defined within a separate new source highway_qsort.dispatch.cpp.

@@ -28,6 +28,7 @@
#include "simd_qsort.hpp"

#define NOT_USED NPY_UNUSED(unused)
#define DISABLE_HIGHWAY_OPTIMIZATION (defined(__arm__) || defined(__aarch64__))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#define DISABLE_HIGHWAY_OPTIMIZATION (defined(__arm__) || defined(__aarch64__))

Based on the presented suggestion, there would be no purpose in keeping this #definition.

@Mousius
Copy link
Member Author
Mousius commented Nov 19, 2023

Thank you for working on this important enhancement by integrating the Google Highway library. I just have a reservation about decoupling from the Intel sort library, I suggest moving the Google Highway-specific code to a separate source to avoid conflicts and facilitate future modifications and also suggest taking advantage of the supported CPU features within Highway by enabling more CPU targets.

Hi @seiko2plus,

The reason this patch targets AArch64 specifically is because I don't want to have one large unwieldy change, I want to break the change down and work incrementally; this patch puts in place the underlying infrastructure and enables a single target without impacting other users. By keeping it minimal it's easier to maintain and rebase, which I've had to do for nearly 6 months.

This patch does leave the code in an intermediary state, which I'd prefer to keep temporarily whilst I work on the final state for x86. The way forwards, in my opinion, would be to do as @r-devulap suggested in the Highway discussion and leverage VQSort instead of x86-simd-sort. Working incrementally, I can focus solely on enabling further platforms and algorithms in each follow-up patch. We should at least reach parity for x86, even if that requires work in Highway itself before removing x86-simd-sort.

@seiko2plus
Copy link
Member

Hi @Mousius,

IMHO, I see it the other way around, the requested changes should make it easier to maintain and avoid conflicts with x86-simd-sort by separating the generated objects, regards supporting other architectures it is fine to keep it for ASIMD for now if you think that Highway may has issues with other architectures. Regarding x86-simd-sort, having both solutions one target avx512, and the other target the rest of the CPU features looks like an acceptable scenario to me, it may also be a permanent solution if there is a purpose in keeping the Intel library written in raw intrinsics to avoid any performance regression, or if the aim is to reduce maintenance efforts or lack of interest in supporting other architectures.

F42D
Copy link
Member Author
Mousius commented Nov 20, 2023

Hi @Mousius,

IMHO, I see it the other way around, the requested changes should make it easier to maintain and avoid conflicts with x86-simd-sort by separating the generated objects,

As I've highlighted, this would be the start of a series of changes, which would eventually clean up those objects. Changing this patch would introduce more complexity in the already complex meson build infrastructure, which I'd prefer to avoid in favour of simply deleting a few lines later.

regards supporting other architectures it is fine to keep it for ASIMD for now if you think that Highway may has issues with other architectures.

I don't think Highway will have issues with other architectures; please don't interpret my response so negatively. I don't want to create a more difficult-to-maintain patch in code and cognitive load, and I have no concerns about using Highway in the future (as highlighted by my support of using Highway in NEP 54).

Regarding x86-simd-sort, having both solutions one target avx512, and the other target the rest of the CPU features looks like an acceptable scenario to me, it may also be a permanent solution if there is a purpose in keeping the Intel library written in raw intrinsics to avoid any performance regression, or if the aim is to reduce maintenance efforts or lack of interest in supporting other architectures.

Architecturally, trying to stitch together several libraries for the same functionality increases maintenance in NumPy rather than having a single cross-platform library with reasonable cross-platform performance. Given that both myself and the maintainers of Highway have indicated that we would do work to enable more architectures, please could you refrain from making negative comments regarding a lack of interest which doesn't exist.

I do understand I don't have a say in how NumPy is architected, and if I have time, I'll re-do this patch again. Unfortunately, I'm not working many days at the moment. This patch has been open for a long time without realising any user value, and the reviews have been sparse, so you can understand why I wouldn't prioritise this over other work which would deliver value earlier.

Introduces Highway, a SIMD library from Google. Highway has it's own dispatch mechanism and SIMD flavour (which adds padding) which may be a good replacement for NumPy intrinsics. Highway also has its own set of vectorised Math routines we could bolt on.

For this patch, I've integrated the VQSort algorithm only on AArch64 so
as not to impact other architectures. The VQSort algorithms performance
is comparable to the AVX512 algorithms, according to this report:
https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md

Performance improvements in the NumPy benchmark suite are as follows:
```
       before           after         ratio
     [f4b52d53]       [f08058d4]
     <main-setuppy>       <highway-sort>
+     53.0±0.08μs        221±0.4μs     4.18  bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))
+      84.8±0.3μs        222±0.3μs     2.62  bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))
+      89.0±0.2μs        197±0.2μs     2.21  bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))
+     51.5±0.01μs       80.0±0.1μs     1.55  bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))
+     52.0±0.07μs       80.3±0.2μs     1.55  bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))
+       148±0.3μs        197±0.2μs     1.33  bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))
+       471±0.2μs        598±0.5μs     1.27  bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',))
+       536±0.7μs        644±0.8μs     1.20  bench_function_base.Sort.time_argsort('heap', 'float64', ('reversed',))
+     39.8±0.02μs      47.4±0.05μs     1.19  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000))
+     39.8±0.01μs      47.4±0.02μs     1.19  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000))
+     40.1±0.01μs      47.6±0.04μs     1.19  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 1000))
+     27.9±0.03μs      32.6±0.03μs     1.17  bench_function_base.Sort.time_argsort('heap', 'uint32', ('uniform',))
+       685±0.5μs        796±0.8μs     1.16  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 10))
+     68.7±0.05μs      78.4±0.04μs     1.14  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100))
+     68.7±0.05μs       78.3±0.1μs     1.14  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100))
+     69.4±0.03μs      78.8±0.05μs     1.14  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 100))
+     22.0±0.01μs      24.7±0.01μs     1.12  bench_function_base.Sort.time_sort('heap', 'uint32', ('uniform',))
+       674±0.7μs        743±0.4μs     1.10  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 1000))
+       730±0.6μs        794±0.8μs     1.09  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 100))
+         464±2μs         492±10μs     1.06  bench_function_base.Sort.time_argsort('heap', 'int32', ('reversed',))
+      95.5±0.1μs        101±0.1μs     1.06  bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))
+         839±1μs          886±2μs     1.06  bench_function_base.Sort.time_argsort('heap', 'float64', ('random',))
+      95.7±0.1μs       101±0.05μs     1.06  bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))
-       516±0.6μs        488±0.9μs     0.95  bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))
-         532±1μs          499±5μs     0.94  bench_function_base.Sort.time_argsort('heap', 'int16', ('reversed',))
-     34.5±0.05μs      32.2±0.03μs     0.93  bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 1000))
-         715±1μs        666±0.4μs     0.93  bench_function_base.Sort.time_sort('heap', 'float64', ('random',))
-         254±2μs        231±0.2μs     0.91  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))
-       655±0.5μs        592±0.3μs     0.90  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 100))
-     24.7±0.02μs      22.0±0.07μs     0.89  bench_function_base.Sort.time_sort('heap', 'int32', ('uniform',))
-       621±0.7μs        549±0.3μs     0.88  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000))
-       659±0.9μs        555±0.5μs     0.84  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 10))
-       545±0.4μs        444±0.5μs     0.81  bench_function_base.Sort.time_sort('heap', 'float64', ('reversed',))
-       511±0.5μs        396±0.4μs     0.78  bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',))
-       316±0.6μs        225±0.2μs     0.71  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))
-       333±0.4μs        233±0.3μs     0.70  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))
-       323±0.2μs        206±0.2μs     0.64  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))
-       134±0.9μs      85.0±0.07μs     0.64  bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))
-       376±0.9μs        228±0.3μs     0.61  bench_function_base.Sort.time_sort('quick', 'int64', ('random',))
-       418±0.4μs        207±0.4μs     0.50  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))
-         413±2μs        199±0.2μs     0.48  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))
-      64.4±0.3ms       28.6±0.3ms     0.44  bench_function_base.Sort.time_sort_worst
-         501±2μs        204±0.3μs     0.41  bench_function_base.Sort.time_sort('quick', 'float64', ('random',))
-         248±1μs       86.9±0.2μs     0.35  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000))
-       251±0.7μs       87.0±0.2μs     0.35  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000))
-       329±0.9μs       91.4±0.1μs     0.28  bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000))
-         329±1μs      85.2±0.06μs     0.26  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))
-       329±0.7μs      85.0±0.06μs     0.26  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))
-       315±0.5μs      80.4±0.05μs     0.26  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))
-         320±1μs      80.3±0.09μs     0.25  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))
-       372±0.4μs      82.1±0.06μs     0.22  bench_function_base.Sort.time_sort('quick', 'int32', ('random',))
-       414±0.4μs       89.6±0.1μs     0.22  bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))
-         384±1μs      81.8±0.02μs     0.21  bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))
-       415±0.3μs       84.8±0.2μs     0.20  bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))
-         491±2μs       86.3±0.1μs     0.18  bench_function_base.Sort.time_sort('quick', 'float32', ('random',))
-       111±0.4μs      13.0±0.01μs     0.12  bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',))
-     50.9±0.03μs      4.93±0.01μs     0.10  bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',))
-       108±0.3μs      7.25±0.02μs     0.07  bench_function_base.Sort.time_sort('quick', 'float32', ('uniform',))
-     51.7±0.05μs      3.43±0.01μs     0.07  bench_function_base.Sort.time_sort('quick', 'int32', ('uniform',))
-     51.3±0.07μs      3.35±0.06μs     0.07  bench_function_base.Sort.time_sort('quick', 'uint32', ('uniform',))
```
@seiko2plus
Copy link
Member

As I've highlighted, this would be the start of a series of changes, which would eventually clean up those objects. Changing this patch would introduce more complexity in the already complex meson build infrastructure, which I'd prefer to avoid in favour of simply deleting a few lines later.

Not sure about the complexity you're facing. My suggestion was similar to your current approach, but instead of creating a new source simd_argsort.dispatch.cpp to isolate the definitions of argsort functions of x86-simd-sort, it might be better to keep it as-is and instead we could have a separate source for the functions that Highway supports. This would also help avoid conflicts with pull requests #25045, #24924. To me, it seemed like a win-win situation.

I don't think Highway will have issues with other architectures; please don't interpret my response so negatively. I don't want to create a more difficult-to-maintain patch in code and cognitive load, and I have no concerns about using Highway in the future (as highlighted by my support of using Highway in NEP 54).

I didn't mean to come off as negative at all. I just thought that adding a few extra CPU targets to your patch might not change much, maybe just a line or two. My conclusion was based on your response and my understanding that enabling a target like AVX2 would initiate a path for MSVC on both 32-bit and 64-bit, and that Highway might face issues with this path, possibly requiring extra effort to disable it. That's why I said it's okay for 'now' to stick with only ASIMD.

Architecturally, trying to stitch together several libraries for the same functionality increases maintenance in NumPy rather than having a single cross-platform library with reasonable cross-platform performance F438 .

Offering multiple choices should be considered if there's a performance benefit. If both solutions are sufficiently isolated, it shouldn't lead to a maintenance burden.

Given that both myself and the maintainers of Highway have indicated that we would do work to enable more architectures

This is the reason I suggested leveraging the broad support and enabling additional CPU features of the architectures that are already supported by Highway.

please could you refrain from making negative comments regarding a lack of interest which doesn't exist.

Once again, I don't mean to be negative at all, and I will try to make my points more clear to avoid any kind of misunderstanding.

I do understand I don't have a say in how NumPy is architected, and if I have time, I'll re-do this patch again. Unfortunately, I'm not working many days at the moment. This patch has been open for a long time without realising any user value, and the reviews have been sparse, so you can understand why I wouldn't prioritise this over other work which would deliver value earlier.

During our last optimization meeting, we agreed to accept your work as it is, without any further changes, to expedite progress. Later, we may focus on separating the two libraries.

@Mousius
Copy link
Member Author
Mousius commented Nov 22, 2023

As I've highlighted, this would be the start of a series of changes, which would eventually clean up those objects. Changing this patch would introduce more complexity in the already complex meson build infrastructure, which I'd prefer to avoid in favour of simply deleting a few lines later.

Not sure about the complexity you're facing. My suggestion was similar to your current approach, but instead of creating a new source simd_argsort.dispatch.cpp to isolate the definitions of argsort functions of x86-simd-sort, it might be better to keep it as-is and instead we could have a separate source for the functions that Highway supports. This would also help avoid conflicts with pull requests #25045, #24924. To me, it seemed like a win-win situation.

We can just agree to disagree here 😸, I've been out sick for several weeks and looking at a phased return, so actually getting to do the work is difficult. @r-devulap has expressed willingness to do the follow up which I really appreciated!

I don't think Highway will have issues with other architectures; please don't interpret my response so negatively. I don't want to create a more difficult-to-maintain patch in code and cognitive load, and I have no concerns about using Highway in the future (as highlighted by my support of using Highway in NEP 54).

I didn't mean to come off as negative at all. I just thought that adding a few extra CPU targets to your patch might not change much, maybe just a line or two. My conclusion was based on your response and my understanding that enabling a target like AVX2 would initiate a path for MSVC on both 32-bit and 64-bit, and that Highway might face issues with this path, possibly requiring extra effort to disable it. That's why I said it's okay for 'now' to stick with only ASIMD.

Aha, that would be my preference, and then I could focus on enabling AVX2 if we needed in a single patch - dependent on whether we agree that using Highway rather than x86-simd-sort is worth it.

Architecturally, trying to stitch together several libraries for the same functionality increases maintenance in NumPy rather than having a single cross-platform library with reasonable cross-platform performance.

Offering multiple choices should be considered if there's a performance benefit. If both solutions are sufficiently isolated, it shouldn't lead to a maintenance burden.

Every branch in code or extra dependency is a maintenance burden, but I'm not going to worry too much about this one, given @r-devulap is keen to continue maintenance.

Given that both myself and the maintainers of Highway have indicated that we would do work to enable more architectures

This is the reason I suggested leveraging the broad support and enabling additional CPU features of the architectures that are already supported by Highway.

please could you refrain from making negative comments regarding a lack of interest which doesn't exist.

Once again, I don't mean to be negative at all, and I will try to make my points more clear to avoid any kind of misunderstanding.

I do understand I don't have a say in how NumPy is architected, and if I have time, I'll re-do this patch again. Unfortunately, I'm not working many days at the moment. This patch has been open for a long time without realising any user value, and the reviews have been sparse, so you can understand why I wouldn't prioritise this over other work which would deliver value earlier.

During our last optimization meeting, we agreed to accept your work as it is, without any further changes, to expedite progress. Later, we may focus on separating the two libraries.

Yip, this comment was before the meeting. I believe we did reach a compromise where this patch would land, and @r-devulap would support me in doing a follow-up (which, again, I really appreciate!). I'm excited to see it land after such long discussions 😸

Copy link
Member
@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Mousius!

@r-devulap
Copy link
Member

I would prefer to have #25045 for AVX2 versions of sort and partition. It will soon be followed up with AVX2 versions of argsort and argpartition.

@r-devulap
Copy link
Member

@Mousius, you are welcome! Once this is merged, I will update #24924 to split x86-simd-sort into its own dispatch module like how @seiko2plus prefers it.

@seiko2plus seiko2plus merged commit 65c9f82 into numpy:main Nov 24, 2023
@seiko2plus
Copy link
Member

@Mousius, I think we should have a release note for this pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement triage review Issue/PR to be discussed at the next triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0