-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Use Highway's VQSort on AArch64 #24018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
numpy/core/setup.py
Outdated
for sort_file in [ | ||
"vqsort.cc", | ||
"vqsort_128a.cc", | ||
"vqsort_128d.cc", | ||
"vqsort_f32a.cc", | ||
"vqsort_f32d.cc", | ||
"vqsort_f64a.cc", | ||
"vqsort_f64d.cc", | ||
"vqsort_i16a.cc", | ||
"vqsort_i16d.cc", | ||
"vqsort_i32a.cc", | ||
"vqsort_i32d.cc", | ||
"vqsort_i64a.cc", | ||
"vqsort_i64d.cc", | ||
"vqsort_kv64a.cc", | ||
"vqsort_kv64d.cc", | ||
"vqsort_kv128a.cc", | ||
"vqsort_kv128d.cc", | ||
"vqsort_u16a.cc", | ||
"vqsort_u16d.cc", | ||
"vqsort_u32a.cc", | ||
"vqsort_u32d.cc", | ||
"vqsort_u64a.cc", | ||
"vqsort_u64d.cc", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't ideal, when compiling without these files I got missing symbol issues, but we definitely don't use them all and they add ~100M to the .so.
@jan-wassenberg any thoughts on how to minimize the number of files we need here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're only using {u/i/f}{32/64}a in your dispatcher and we ought to be able to remove 128*, 16*, and *d.cc (the latter for descending order).
If there are still linker errors after removing only those, then it's likely due to the dllexport. What we could then do is add a macro bitfield that you could define to replace the implementation of the unused/unwanted types with a stub.
Something like VQSORT_TYPES which contains multiple groups (one each for uint/int/float/KV), where each group has one bit encoding whether the element size equal to the bit's value is enabled.
Example, for only say unsigned/float types: #define VQSORT_TYPES (((4 | 8) << 6) | (4 | 8))
.
Would that be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example, when I try to run tests:
Original error was: /numpy/build/testenv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-aarch64-linux-gnu.so: undefined symbol: _ZN3hwy6VQSortEPlmNS_14SortDescendingE
The other option is I could use vqsort-inl.h
and copy the implementations across from the various vqsort_X.cc
files? That'd also mean I can do float16
without adding a vsqort_f16a.cc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add something like this to vqsort-inl.h
? I notice there's a lot of detail
which would be bad to include in the NumPy source.
template <typename T>
void VQSort(T* HWY_RESTRICT keys, size_t n, SortAscending) {
SortTag<T> d;
detail::SharedTraits<detail::TraitsLane<detail::OrderAscending<T>>> st;
Sort(d, st, keys, num);
}
If not the bitfield approach is also good 😸
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can add that. As written, this will not support the 128-bit nor key-value types - is that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can add that. As written, this will not support the 128-bit nor key-value types - is that OK?
That'd be specialising around uint128_t
with the same uint64*2
logic as in the .cc
file? I think we can add it later if necessary, there's no sort dispatch for it atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. And KV types also use different Traits.
Sounds good, we can indeed add that later.
@@ -0,0 +1,44 @@ | |||
/*@targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split this out as Highway's sort doesn't have an implementation of argsort, unless I'm wrong @jan-wassenberg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI argsort can be implemented as copying an array to hwy::K32V32 or K64V64 types (putting the key in .key and 0..N-1 into .value) and then copying only the .value back to the array.
PiperOrigin-RevId: 542571742
PiperOrigin-RevId: 542571742
PiperOrigin-RevId: 542571742
PiperOrigin-RevId: 542579202
4a6edf9
to
50bd401
Compare
I think this is worthy of wider discussion. So far we have
Perhaps we should stop and think where this all is leading. Maybe NumPy should not be in the business of providing highly specialized inner loop functions, and we should instead provide an easily pluggable interface for others to experiment with, together with a base implementation and test cases for accuracy (known good answers) / precision (same answers across various array shapes). |
Yip, I've been struggling with the current set of NumPy variations on the theme of performance optimisation, it's not clear what the current best practice is or where it's headed. Landing these vendored libraries (such as My interpretation of the proposal for using SLEEF (alongside my reasoning for backing something like this) is that people want NumPy to perform better out-of-the-box and SLEEF already has a bunch of cross-platform routines for this; indicating that there is a desire for things to be faster, but these efforts haven't succeeded so far. That demand helps me to understand that we should be taking a more off-the-shelf solution, which means doing less work in NumPy but still meeting the original goals of the universal intrinsics work - which is what I see when I stumble across Highway. We could take a lot of complexity out of NumPy itself by leveraging the Highway's dynamic dispatch, optimised routines and existing universal intrinsics solution (which already supports scalable vectors). I don't think performance should be passed back to users as plug-ins, if I wanted performance without using NumPy itself, I could easily move to something like Numba, one of the other optimisation libraries or stop using NumPy (as the de facto standard, such a stance seems rather out of place). I did like the errstate proposal, which allows users to opt into the lower accuracy, higher throughput routines but the high/low accuracy problem is only applicable to specific routines. Returning to this case, it's a sort routine, if it puts things in the wrong order it's wrong - these optimisations can benefit all users without much controversy which seems like something everyone would want? Please don't lose sight of the benefits this specific PR brings whilst considering whether to integrate more of Highway in the future. |
We discussed this a bit yesterday, please correct me if I am wrong (or just disagree), but basically. The way I interpreted this, there was also a hope to use highway more generally. Sentiment was a bit towards:
Overall, there wasn't much enthusiasm for this yesterday. |
Hi, Highway/vqsort main author here :)
Not sure this is the case. Our intrinsics are just fully-inlined wrapper functions. Only the sort implementation itself would contribute to binary size. VQSort also has some advantages concerning speed on AVX-512 (see https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md#updated-vqsort-results) and also robustness vs skewed input distributions, see other parts of that article. Happy to discuss further, including via video call, if you are interested. |
@jan-wassenberg , I think this would just be something like |
I think the larger point is that we are uncomfortable adding yet another submodule with inner loop implementations to NumPy. We would prefer to generalize the routines we already have for x86_64, using NumPy's universal intrinsics, to the other architectures. |
Maybe we can split this up a bit, since I probably munched it too much yesterday. I am not sure how much we gain by adopting highway more broadly, there were arguments against it, and maybe in part that is because Sayed has momentum for the "custom" universal intrinsics but there were more arguments. (I honestly would prefer to just steal something, but SIMD is already hairy with compiler bugs, and I don't know it well enough to judge whether we would just run into more issues.) So this is about sorting: I think right now it just switches things on AArch64? But highway supports also AVX512, so in the end it would really be swapping out one implementation for another. This doesn't actually stop us from reversing it out again once we can, so the question is if there is a downside? (if there are issues, too bad, we will revert) I probably misunderstood that this had some impact on binary size, I have to admit, if it doesn't I am not sure I see a reason for not swapping in highway for sorting (completely) until Raghuveer's work supports more broadly SIMD. |
@mattip, that seems to disadvantage those who didn't author the x86_64 code, whilst there is an equivalent solution available for many architectures. I'm personally reluctant to invest further effort in porting to universal intrinsics, there's solutions already available to users as proven by the SVML and x86-simd-sort integrations. Previous universal intrinsic work was already a wasted effort now that the math routines have been reverted and accuracy changed in SVML.
Where is this momentum? It'd be great to see things like #22265 landed but it has stalled waiting for the C++ intrinsic work. As far as compiler bugs, I think you'll get them either way, they're bugs after all 😸
Sorry @seberg, that's likely down to me, originally this was pretty bloated thanks to including many separate compilation units but through @jan-wassenberg's efforts on
Thanks for bringing this back to the benefits of this contribution itself, I did limit this to AArch64 to minimise the overall impact, mirroring the accepted x86-simd-sort integration - it wouldn't be hard to enable this for AVX512 as well and potentially further. One limitation is it doesn't quite drop in for FP16 (@jan-wassenberg can maybe advise), nor argsort but we can probably implement that ourselves based on the suggestion above. |
I am sorry you feel that way. I seem to remember having similar feelings when SVML was proposed, and that the goal all along was to replace SVML with universal intrinsics. Unfortunately we did not adequately address accuracy when adding the tests that were a condition for the SVML acceptance, and that led to the mess around sin/cos. At the end of the day mine is just one opinion of many maintainers. I think the place to push for a revamped approach to SIMD interinsics is discuss replacing NEP 38 on the mailing list, or to create a PR amending that NEP. |
There are two aspects to this PR:
(1) is tied with (2): you can't port sorting routines without using highway as the SIMD framework which duplicates NumPy universal intrinsics. If NumPy wants to use highway as the SIMD framework, that's a whole new discussion and as @mattip says it would require modifying NEP 38. If the concern is just about sorting routines on AArch64, we could address that by porting over x86-simd-sort using NumPy universal intrinsics once we merge #21057. If this is the direction we want to go, I will be happy to collaborate with @Mousius to port it. Note that x86-simd-sort also supports O(1) space |
I don't think we should be considering the implementation details of the sort algorithm, just as From a technical point of view, this is exactly the same as implementing sort using I'm also getting a distinct impression from these conversations that not many are in favour of replacing NEP38, as only @seberg has spoken in favour of using something already written.
I can't guarantee I'll be able to resource porting anything further, if/when C++ intrinsics are mature enough to do so, we'll have to re-assess then. It seems to be creating more work given VQSort already supports multiple platforms, but it is already the chosen solution so maybe we'll have to. |
I agree there are two discussions here; let's move the intrinsics part to a separate issue.
Yes, I happen to be working on fp16 at the moment :) It's a larger undertaking, so will take several days.
Understandable. With VQSort author hat on, I'd point out that it is also considerably faster and It might be possible to copy over some of the techniques from VQSort while porting, though. Have we estimated the number of new universal intrinsics that would be required? I imagine that more than a few new ones will be required for the shuffles used in sorting networks.
To adopting VQSort as suggested in this PR? Adding any code increases compile time and binary size, but I do not believe that is problematic here, so no, I do not see downsides for numpy. |
numpy/core/setup.py
Outdated
|
||
# Set NPY_DISABLE_HIGHWAY=1 in the environment to disable the Highway | ||
# library. This option only has significance on a AArch64 and is most | ||
# useful to avoid improperly requiring Highway when cross compiling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems copied from the SVML snippet, however the reason that is there for SVML is that that library gets shipped as object files rather than as source files. There shouldn't be an issue with Highway when cross-compiling I hope?
b08dcd6
to
f4e1583
Compare
This is a bit messier now that #24201 has landed and select mixed in with sort. It appears that |
a1d99df
to
7243554
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this important enhancement by integrating the Google Highway library. I just have a reservation about decoupling from the Intel sort library, I suggest moving the Google Highway-specific code to a separate source to avoid conflicts and facilitate future modifications and also suggest taking advantage of the supported CPU features within Highway by enabling more CPU targets.
[ | ||
'simd_argsort.dispatch.h', | ||
'src/npysort/simd_argsort.dispatch.cpp', | ||
[AVX512_SKX] | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[ | |
'simd_argsort.dispatch.h', | |
'src/npysort/simd_argsort.dispatch.cpp', | |
[AVX512_SKX] | |
], | |
[ | |
'highway_qsort.dispatch.h', | |
'src/npysort/highway_qsort.dispatch.cpp', | |
[[AVX2, FMA3], SSE42, ASIMD, VSX2] | |
], |
Create an independent source and also a separate configuration header for the CPU targets. Please note that this will require creating a separate header, similar to simd_qsort.hpp, containing the forward declarations of the defined highway sort functions while considering the use of a different namespace, for example, highway_qsort
.
Another note here: Are there any reasons to limit the use to the ARM architecture?
@@ -779,13 +782,18 @@ foreach gen_mtargets : [ | |||
[ | |||
'simd_qsort.dispatch.h', | |||
'src/npysort/simd_qsort.dispatch.cpp', | |||
[AVX512_SKX] | |||
[AVX512_SKX, ASIMD] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[AVX512_SKX, ASIMD] | |
[AVX512_SKX] |
Based on the presented suggestion, there would be no purpose in adding this target.
#if !DISABLE_HIGHWAY_OPTIMIZATION | ||
else if (sizeof(T) == sizeof(uint32_t) || sizeof(T) == sizeof(uint64_t)) { | ||
#ifndef NPY_DISABLE_OPTIMIZATION | ||
#include "simd_qsort.dispatch.h" | ||
#endif | ||
NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::qsort_simd::template QSort, <TF>); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if !DISABLE_HIGHWAY_OPTIMIZATION | |
else if (sizeof(T) == sizeof(uint32_t) || sizeof(T) == sizeof(uint64_t)) { | |
#ifndef NPY_DISABLE_OPTIMIZATION | |
#include "simd_qsort.dispatch.h" | |
#endif | |
NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::qsort_simd::template QSort, <TF>); | |
} | |
#endif | |
else if (sizeof(T) == sizeof(uint32_t) || sizeof(T) == sizeof(uint64_t)) { | |
#ifndef NPY_DISABLE_OPTIMIZATION | |
#include "simd_qsort.dispatch.h" | |
#endif | |
NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::qsort_simd::template QSort, <TF>); | |
if (dispfunc == nullptr) { | |
// Priority is given to Intel-sort library to its efficient support for AVX512. | |
// For other CPU targets, we fallback to Google's highway sort. | |
#ifndef NPY_DISABLE_OPTIMIZATION | |
#include "highway_qsort.dispatch.h" | |
#endif | |
NPY_CPU_DISPATCH_CALL_XB(dispfunc = np::highway_qsort::template QSort, <TF>); | |
} | |
} | |
#endif |
Using Google Highway after confirming the unavailability of the Intel library, either due to architectural differences or the AVX512 features not being available during runtime.
Also, note that there is no need for the guard DISABLE_HIGHWAY_OPTIMIZATION
as the targeted features are controlled through Meson
@@ -89,31 +71,32 @@ template<> void NPY_CPU_DISPATCH_CURFX(QSort)(double *arr, intptr_t size) | |||
{ | |||
avx512_qsort(arr, size); | |||
} | |||
template<> void NPY_CPU_DISPATCH_CURFX(ArgQSort)(int32_t *arr, npy_intp *arg, npy_intp size) | |||
#elif USE_HIGHWAY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#elif USE_HIGHWAY |
Based on the proposed suggestion, the following functions in this modification should be defined within a separate new source highway_qsort.dispatch.cpp
.
@@ -28,6 +28,7 @@ | |||
#include "simd_qsort.hpp" | |||
|
|||
#define NOT_USED NPY_UNUSED(unused) | |||
#define DISABLE_HIGHWAY_OPTIMIZATION (defined(__arm__) || defined(__aarch64__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define DISABLE_HIGHWAY_OPTIMIZATION (defined(__arm__) || defined(__aarch64__)) |
Based on the presented suggestion, there would be no purpose in keeping this #definition.
Hi @seiko2plus, The reason this patch targets AArch64 specifically is because I don't want to have one large unwieldy change, I want to break the change down and work incrementally; this patch puts in place the underlying infrastructure and enables a single target without impacting other users. By keeping it minimal it's easier to maintain and rebase, which I've had to do for nearly 6 months. This patch does leave the code in an intermediary state, which I'd prefer to keep temporarily whilst I work on the final state for x86. The way forwards, in my opinion, would be to do as @r-devulap suggested in the Highway discussion and leverage VQSort instead of x86-simd-sort. Working incrementally, I can focus solely on enabling further platforms and algorithms in each follow-up patch. We should at least reach parity for x86, even if that requires work in Highway itself before removing |
Hi @Mousius, IMHO, I see it the other way around, the requested changes should make it easier to maintain and avoid conflicts with |
As I've highlighted, this would be the start of a series of changes, which would eventually clean up those objects. Changing this patch would introduce more complexity in the already complex meson build infrastructure, which I'd prefer to avoid in favour of simply deleting a few lines later.
I don't think Highway will have issues with other architectures; please don't interpret my response so negatively. I don't want to create a more difficult-to-maintain patch in code and cognitive load, and I have no concerns about using Highway in the future (as highlighted by my support of using Highway in NEP 54).
Architecturally, trying to stitch together several libraries for the same functionality increases maintenance in NumPy rather than having a single cross-platform library with reasonable cross-platform performance. Given that both myself and the maintainers of Highway have indicated that we would do work to enable more architectures, please could you refrain from making negative comments regarding a lack of interest which doesn't exist. I do understand I don't have a say in how NumPy is architected, and if I have time, I'll re-do this patch again. Unfortunately, I'm not working many days at the moment. This patch has been open for a long time without realising any user value, and the reviews have been sparse, so you can understand why I wouldn't prioritise this over other work which would deliver value earlier. |
Introduces Highway, a SIMD library from Google. Highway has it's own dispatch mechanism and SIMD flavour (which adds padding) which may be a good replacement for NumPy intrinsics. Highway also has its own set of vectorised Math routines we could bolt on. For this patch, I've integrated the VQSort algorithm only on AArch64 so as not to impact other architectures. The VQSort algorithms performance is comparable to the AVX512 algorithms, according to this report: https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md Performance improvements in the NumPy benchmark suite are as follows: ``` before after ratio [f4b52d53] [f08058d4] <main-setuppy> <highway-sort> + 53.0±0.08μs 221±0.4μs 4.18 bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',)) + 84.8±0.3μs 222±0.3μs 2.62 bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',)) + 89.0±0.2μs 197±0.2μs 2.21 bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',)) + 51.5±0.01μs 80.0±0.1μs 1.55 bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',)) + 52.0±0.07μs 80.3±0.2μs 1.55 bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',)) + 148±0.3μs 197±0.2μs 1.33 bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',)) + 471±0.2μs 598±0.5μs 1.27 bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',)) + 536±0.7μs 644±0.8μs 1.20 bench_function_base.Sort.time_argsort('heap', 'float64', ('reversed',)) + 39.8±0.02μs 47.4±0.05μs 1.19 bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000)) + 39.8±0.01μs 47.4±0.02μs 1.19 bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000)) + 40.1±0.01μs 47.6±0.04μs 1.19 bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 1000)) + 27.9±0.03μs 32.6±0.03μs 1.17 bench_function_base.Sort.time_argsort('heap', 'uint32', ('uniform',)) + 685±0.5μs 796±0.8μs 1.16 bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 10)) + 68.7±0.05μs 78.4±0.04μs 1.14 bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100)) + 68.7±0.05μs 78.3±0.1μs 1.14 bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100)) + 69.4±0.03μs 78.8±0.05μs 1.14 bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 100)) + 22.0±0.01μs 24.7±0.01μs 1.12 bench_function_base.Sort.time_sort('heap', 'uint32', ('uniform',)) + 674±0.7μs 743±0.4μs 1.10 bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 1000)) + 730±0.6μs 794±0.8μs 1.09 bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 100)) + 464±2μs 492±10μs 1.06 bench_function_base.Sort.time_argsort('heap', 'int32', ('reversed',)) + 95.5±0.1μs 101±0.1μs 1.06 bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',)) + 839±1μs 886±2μs 1.06 bench_function_base.Sort.time_argsort('heap', 'float64', ('random',)) + 95.7±0.1μs 101±0.05μs 1.06 bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',)) - 516±0.6μs 488±0.9μs 0.95 bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',)) - 532±1μs 499±5μs 0.94 bench_function_base.Sort.time_argsort('heap', 'int16', ('reversed',)) - 34.5±0.05μs 32.2±0.03μs 0.93 bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 1000)) - 715±1μs 666±0.4μs 0.93 bench_function_base.Sort.time_sort('heap', 'float64', ('random',)) - 254±2μs 231±0.2μs 0.91 bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000)) - 655±0.5μs 592±0.3μs 0.90 bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 100)) - 24.7±0.02μs 22.0±0.07μs 0.89 bench_function_base.Sort.time_sort('heap', 'int32', ('uniform',)) - 621±0.7μs 549±0.3μs 0.88 bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000)) - 659±0.9μs 555±0.5μs 0.84 bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 10)) - 545±0.4μs 444±0.5μs 0.81 bench_function_base.Sort.time_sort('heap', 'float64', ('reversed',)) - 511±0.5μs 396±0.4μs 0.78 bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',)) - 316±0.6μs 225±0.2μs 0.71 bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10)) - 333±0.4μs 233±0.3μs 0.70 bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100)) - 323±0.2μs 206±0.2μs 0.64 bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000)) - 134±0.9μs 85.0±0.07μs 0.64 bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',)) - 376±0.9μs 228±0.3μs 0.61 bench_function_base.Sort.time_sort('quick', 'int64', ('random',)) - 418±0.4μs 207±0.4μs 0.50 bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100)) - 413±2μs 199±0.2μs 0.48 bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10)) - 64.4±0.3ms 28.6±0.3ms 0.44 bench_function_base.Sort.time_sort_worst - 501±2μs 204±0.3μs 0.41 bench_function_base.Sort.time_sort('quick', 'float64', ('random',)) - 248±1μs 86.9±0.2μs 0.35 bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000)) - 251±0.7μs 87.0±0.2μs 0.35 bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000)) - 329±0.9μs 91.4±0.1μs 0.28 bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000)) - 329±1μs 85.2±0.06μs 0.26 bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100)) - 329±0.7μs 85.0±0.06μs 0.26 bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100)) - 315±0.5μs 80.4±0.05μs 0.26 bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10)) - 320±1μs 80.3±0.09μs 0.25 bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10)) - 372±0.4μs 82.1±0.06μs 0.22 bench_function_base.Sort.time_sort('quick', 'int32', ('random',)) - 414±0.4μs 89.6±0.1μs 0.22 bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100)) - 384±1μs 81.8±0.02μs 0.21 bench_function_base.Sort.time_sort('quick', 'uint32', ('random',)) - 415±0.3μs 84.8±0.2μs 0.20 bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10)) - 491±2μs 86.3±0.1μs 0.18 bench_function_base.Sort.time_sort('quick', 'float32', ('random',)) - 111±0.4μs 13.0±0.01μs 0.12 bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',)) - 50.9±0.03μs 4.93±0.01μs 0.10 bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',)) - 108±0.3μs 7.25±0.02μs 0.07 bench_function_base.Sort.time_sort('quick', 'float32', ('uniform',)) - 51.7±0.05μs 3.43±0.01μs 0.07 bench_function_base.Sort.time_sort('quick', 'int32', ('uniform',)) - 51.3±0.07μs 3.35±0.06μs 0.07 bench_function_base.Sort.time_sort('quick', 'uint32', ('uniform',)) ```
7243554
to
278e1de
Compare
Not sure about the complexity you're facing. My suggestion was similar to your current approach, but instead of creating a new source
I didn't mean to come off as negative at all. I just thought that adding a few extra CPU targets to your patch might not change much, maybe just a line or two. My conclusion was based on your response and my understanding that enabling a target like
Offering multiple choices should be considered if there's a performance benefit. If both solutions are sufficiently isolated, it shouldn't lead to a maintenance burden.
This is the reason I suggested leveraging the broad support and enabling additional CPU features of the architectures that are already supported by Highway.
Once again, I don't mean to be negative at all, and I will try to make my points more clear to avoid any kind of misunderstanding.
During our last optimization meeting, we agreed to accept your work as it is, without any further changes, to expedite progress. Later, we may focus on separating the two libraries. |
We can just agree to disagree here 😸, I've been out sick for several weeks and looking at a phased return, so actually getting to do the work is difficult. @r-devulap has expressed willingness to do the follow up which I really appreciated!
Aha, that would be my preference, and then I could focus on enabling AVX2 if we needed in a single patch - dependent on whether we agree that using Highway rather than x86-simd-sort is worth it.
Every branch in code or extra dependency is a maintenance burden, but I'm not going to worry too much about this one, given @r-devulap is keen to continue maintenance.
Yip, this comment was before the meeting. I believe we did reach a compromise where this patch would land, and @r-devulap would support me in doing a follow-up (which, again, I really appreciate!). I'm excited to see it land after such long discussions 😸 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Mousius!
I would prefer to have #25045 for AVX2 versions of sort and partition. It will soon be followed up with AVX2 versions of argsort and argpartition. |
@Mousius, you are welcome! Once this is merged, I will update #24924 to split x86-simd-sort into its own dispatch module like how @seiko2plus prefers it. |
@Mousius, I think we should have a release note for this pr. |
Introduces Highway, a SIMD library from Google. Highway has it's own dispatch mechanism and SIMD flavour (which adds padding) which may be a good replacement for NumPy intrinsics. Highway also has its own set of vectorised Math routines we could bolt on.
For this patch, I've integrated the VQSort algorithm only on AArch64 so as not to impact other architectures. The VQSort algorithms performance is comparable to the AVX512 algorithms, according to this report: https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md
Performance improvements in the NumPy benchmark suite are as follows: