-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Enable SVE detection for Highway VQSort #25247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@rdevulap don't panic; this was just so easy to do that I wanted to see if it would work 😸 I'll wait for your patches to be in and rebase on top of them. @seiko2plus SVE features! Woo! |
0d56dbb
to
20b5f2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you. Just a few changes more.
This pr doesn't change any C/C++ code so it should be fine to merge it. |
4c05be1
to
027cce0
Compare
This will conflict with https://github.com/numpy/numpy/pull/25045/files#diff-1df2f2cfde074ed7de97ebae8a7ef48ad000241256ea3f480f81b2598d03b4f8R784-R788 |
027cce0
to
f0b936d
Compare
Leveraging the meson infrastructure to selectively enable SVE specifically for Highway, which already supports SVE. ``` | Change | Before [94bc564] <main> | After [1ffedb85] <sve-sort> | Ratio | Benchmark (Parameter) | |----------|----------------------------|-------------------------------|---------|----------------------------------------------------------------------------------| | + | 551±0.8μs | 654±5μs | 1.19 | bench_function_base.Sort.time_sort('merge', 'float32', ('random',)) | | + | 72.9±0.1μs | 80.5±0.08μs | 1.11 | bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',)) | | + | 553±1μs | 606±2μs | 1.1 | bench_function_base.Sort.time_sort('merge', 'float64', ('random',)) | | + | 41.9±0.3μs | 45.6±0.04μs | 1.09 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000)) | | + | 493±1μs | 532±2μs | 1.08 | bench_function_base.Sort.time_sort('merge', 'float16', ('random',)) | | + | 73.7±0.2μs | 78.8±0.04μs | 1.07 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100)) | | + | 565±1μs | 600±1μs | 1.06 | bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',)) | | + | 465±1μs | 491±1μs | 1.06 | bench_function_base.Sort.time_argsort('heap', 'int32', ('reversed',)) | | + | 696±3μs | 739±2μs | 1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',)) | | + | 645±3μs | 684±3μs | 1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 10)) | | + | 651±1μs | 691±1μs 8000 | 1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100)) | | + | 627±3μs | 665±1μs | 1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000)) | | + | 467±1μs | 494±0.7μs | 1.06 | bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',)) | | + | 166±0.1μs | 174±1μs | 1.05 | bench_function_base.Sort.time_sort('merge', 'float32', ('sorted_block', 10)) | | - | 77.4±0.08μs | 73.2±0.2μs | 0.95 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100)) | | - | 379±1μs | 359±0.1μs | 0.95 | bench_function_base.Sort.time_sort('heap', 'int32', ('ordered',)) | | - | 341±0.5μs | 324±0.5μs | 0.95 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000)) | | - | 590±0.5μs | 554±1μs | 0.94 | bench_function_base.Sort.time_argsort('heap', 'float32', ('ordered',)) | | - | 239±5μs | 226±0.6μs | 0.94 | bench_function_base.Sort.time_argsort('merge', 'float16', ('sorted_block', 10)) | | - | 195±2μs | 184±1μs | 0.94 | bench_function_base.Sort.time_argsort('merge', 'float32', ('sorted_block', 10)) | | - | 692±2μs | 637±2μs | 0.92 | bench_function_base.Sort.time_argsort('merge', 'float32', ('random',)) | | - | 45.5±0.03μs | 42.0±0.2μs | 0.92 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000)) | | - | 80.5±0.07μs | 73.0±0.1μs | 0.91 | bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',)) | | - | 78.9±0.2μs | 71.7±0.2μs | 0.91 | bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',)) | | - | 79.2±0.1μs | 72.1±0.2μs | 0.91 | bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',)) | | - | 131±2μs | 118±0.8μs | 0.9 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 100)) | | - | 82.8±0.2μs | 73.7±0.3μs | 0.89 | bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',)) | | - | 83.4±0.07μs | 74.1±0.2μs | 0.89 | bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',)) | | - | 78.6±0.2μs | 70.3±0.2μs | 0.89 | bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',)) | | - | 79.2±0.09μs | 70.8±0.08μs | 0.89 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',)) | | - | 3.22±0.02μs | 2.86±0μs | 0.89 | bench_function_base.Sort.time_sort('quick', 'uint32', ('uniform',)) | | - | 3.26±0.04μs | 2.84±0μs | 0.87 | bench_function_base.Sort.time_sort('quick', 'int32', ('uniform',)) | | - | 82.6±0.06μs | 71.1±0.08μs | 0.86 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10)) | | - | 4.91±0.01μs | 4.22±0μs | 0.86 | bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',)) | | - | 79.0±0.2μs | 66.8±0.05μs | 0.85 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 1000)) | | - | 78.8±0.05μs | 67.0±0.2μs | 0.85 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10)) | | - | 84.2±0.07μs | 70.8±0.1μs | 0.84 | bench_function_base.Sort.time_sort('quick', 'float32', ('random',)) | | - | 89.4±0.1μs | 75.5±0.05μs | 0.84 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000)) | | - | 78.9±0.04μs | 65.9±0.1μs | 0.84 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10)) | | - | 85.4±0.06μs | 71.9±0.05μs | 0.84 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000)) | | - | 85.3±0.08μs | 70.5±0.1μs | 0.83 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000)) | | - | 80.5±0.03μs | 66.4±0.1μs | 0.83 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',)) | | - | 87.5±0.05μs | 71.6±0.1μs | 0.82 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100)) | | - | 80.4±0.05μs | 65.4±0.07μs | 0.81 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',)) | | - | 83.6±0.05μs | 66.9±0.1μs | 0.8 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100)) | | - | 83.5±0.05μs | 65.8±0.08μs | 0.79 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100)) | | - | 6.87±0.01μs | 5.13±0.08μs | 0.75 | bench_function_base.Sort.time_sort('quick', 'float32', ('uniform',)) | | - | 12.2±0.02μs | 8.79±0.1μs | 0.72 | bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',)) | | - | 193±0.5μs | 124±0.5μs | 0.65 | bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',)) | | - | 27.7±0.2ms | 18.0±0.2ms | 0.65 | bench_function_base.Sort.time_sort_worst | | - | 192±0.4μs | 123±0.2μs | 0.64 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',)) | | - | 202±0.2μs | 128±0.04μs | 0.63 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000)) | | - | 203±0.5μs | 125±0.09μs | 0.62 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100)) | | - | 199±0.4μs | 122±0.07μs | 0.61 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',)) | | - | 195±0.4μs | 120±0.09μs | 0.61 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10)) | | - | 215±0.3μs | 121±0.3μs | 0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',)) | | - | 216±0.3μs | 121±0.7μs | 0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',)) | | - | 225±0.3μs | 126±0.3μs | 0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000)) | | - | 223±0.2μs | 119±0.09μs | 0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',)) | | - | 219±0.06μs | 118±0.08μs | 0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10)) | | - | 227±0.3μs | 123±0.2μs | 0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100)) | ```
f0b936d
to
682d33c
Compare
@seiko2plus I've rebased this on top of @rdevulap's changes, I think this is good to go? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you!
Leveraging the meson infrastructure to selectively enable SVE specifically for Highway, which already supports SVE.