8000 ENH: Enable SVE detection for Highway VQSort by Mousius · Pull Request #25247 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Enable SVE detection for Highway VQSort #25247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 11, 2023

Conversation

Mousius
Copy link
Member
@Mousius Mousius commented Nov 24, 2023

Leveraging the meson infrastructure to selectively enable SVE specifically for Highway, which already supports SVE.

| Change   | Before [94bc5640] <main>   | After [1ffedb85] <sve-sort>   |   Ratio | Benchmark (Parameter)                                                            |
|----------|----------------------------|-------------------------------|---------|----------------------------------------------------------------------------------|
| +        | 551±0.8μs                  | 654±5μs                       |    1.19 | bench_function_base.Sort.time_sort('merge', 'float32', ('random',))              |
| +        | 72.9±0.1μs                 | 80.5±0.08μs                   |    1.11 | bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))          |
| +        | 553±1μs                    | 606±2μs                       |    1.1  | bench_function_base.Sort.time_sort('merge', 'float64', ('random',))              |
| +        | 41.9±0.3μs                 | 45.6±0.04μs                   |    1.09 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000))  |
| +        | 493±1μs                    | 532±2μs                       |    1.08 | bench_function_base.Sort.time_sort('merge', 'float16', ('random',))              |
| +        | 73.7±0.2μs                 | 78.8±0.04μs                   |    1.07 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100))   |
| +        | 565±1μs                    | 600±1μs                       |    1.06 | bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',))           |
| +        | 465±1μs                    | 491±1μs                       |    1.06 | bench_function_base.Sort.time_argsort('heap', 'int32', ('reversed',))            |
| +        | 696±3μs                    | 739±2μs                       |    1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',))               |
| +        | 645±3μs                    | 684±3μs                       |    1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 10))      |
| +        | 651±1μs                    | 691±1μs                       |    1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100))     |
| +        | 627±3μs                    | 665±1μs                       |    1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000))    |
| +        | 467±1μs                    | 494±0.7μs                     |    1.06 | bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))              |
| +        | 166±0.1μs                  | 174±1μs                       |    1.05 | bench_function_base.Sort.time_sort('merge', 'float32', ('sorted_block', 10))     |
| -        | 77.4±0.08μs                | 73.2±0.2μs                    |    0.95 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100))  |
| -        | 379±1μs                    | 359±0.1μs                     |    0.95 | bench_function_base.Sort.time_sort('heap', 'int32', ('ordered',))                |
| -        | 341±0.5μs                  | 324±0.5μs                     |    0.95 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000))   |
| -        | 590±0.5μs                  | 554±1μs                       |    0.
8000
94 | bench_function_base.Sort.time_argsort('heap', 'float32', ('ordered',))           |
| -        | 239±5μs                    | 226±0.6μs                     |    0.94 | bench_function_base.Sort.time_argsort('merge', 'float16', ('sorted_block', 10))  |
| -        | 195±2μs                    | 184±1μs                       |    0.94 | bench_function_base.Sort.time_argsort('merge', 'float32', ('sorted_block', 10))  |
| -        | 692±2μs                    | 637±2μs                       |    0.92 | bench_function_base.Sort.time_argsort('merge', 'float32', ('random',))           |
| -        | 45.5±0.03μs                | 42.0±0.2μs                    |    0.92 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000)) |
| -        | 80.5±0.07μs                | 73.0±0.1μs                    |    0.91 | bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))          |
| -        | 78.9±0.2μs                 | 71.7±0.2μs                    |    0.91 | bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))              |
| -        | 79.2±0.1μs                 | 72.1±0.2μs                    |    0.91 | bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',))             |
| -        | 131±2μs                    | 118±0.8μs                     |    0.9  | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 100))    |
| -        | 82.8±0.2μs                 | 73.7±0.3μs                    |    0.89 | bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))             |
| -        | 83.4±0.07μs                | 74.1±0.2μs                    |    0.89 | bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))            |
| -        | 78.6±0.2μs                 | 70.3±0.2μs                    |    0.89 | bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))               |
| -        | 79.2±0.09μs                | 70.8±0.08μs                   |    0.89 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))              |
| -        | 3.22±0.02μs                | 2.86±0μs                      |    0.89 | bench_function_base.Sort.time_sort('quick', 'uint32', ('uniform',))              |
| -        | 3.26±0.04μs                | 2.84±0μs                      |    0.87 | bench_function_base.Sort.time_sort('quick', 'int32', ('uniform',))               |
| -        | 82.6±0.06μs                | 71.1±0.08μs                   |    0.86 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))     |
| -        | 4.91±0.01μs                | 4.22±0μs                      |    0.86 | bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',))               |
| -        | 79.0±0.2μs                 | 66.8±0.05μs                   |    0.85 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 1000))   |
| -        | 78.8±0.05μs                | 67.0±0.2μs                    |    0.85 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))      |
| -        | 84.2±0.07μs                | 70.8±0.1μs                    |    0.84 | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))              |
| -        | 89.4±0.1μs                 | 75.5±0.05μs                   |    0.84 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000))   |
| -        | 78.9±0.04μs                | 65.9±0.1μs                    |    0.84 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))       |
| -        | 85.4±0.06μs                | 71.9±0.05μs                   |    0.84 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000))    |
| -        | 85.3±0.08μs                | 70.5±0.1μs                    |    0.83 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000))     |
| -        | 80.5±0.03μs                | 66.4±0.1μs                    |    0.83 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))               |
| -        | 87.5±0.05μs                | 71.6±0.1μs                    |    0.82 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))    |
| -        | 80.4±0.05μs                | 65.4±0.07μs                   |    0.81 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))                |
| -        | 83.6±0.05μs                | 66.9±0.1μs                    |    0.8  | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))     |
| -        | 83.5±0.05μs                | 65.8±0.08μs                   |    0.79 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))      |
| -        | 6.87±0.01μs                | 5.13±0.08μs                   |    0.75 | bench_function_base.Sort.time_sort('quick', 'float32', ('uniform',))             |
| -        | 12.2±0.02μs                | 8.79±0.1μs                    |    0.72 | bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',))             |
| -        | 193±0.5μs                  | 124±0.5μs                     |    0.65 | bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))            |
| -        | 27.7±0.2ms                 | 18.0±0.2ms                    |    0.65 | bench_function_base.Sort.time_sort_worst                                         |
| -        | 192±0.4μs                  | 123±0.2μs                     |    0.64 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))             |
| -        | 202±0.2μs                  | 128±0.04μs                    |    0.63 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))   |
| -        | 203±0.5μs                  | 125±0.09μs                    |    0.62 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))    |
| -        | 199±0.4μs                  | 122±0.07μs                    |    0.61 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))              |
| -        | 195±0.4μs                  | 120±0.09μs                    |    0.61 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))     |
| -        | 215±0.3μs                  | 121±0.3μs                     |    0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))               |
| -        | 216±0.3μs                  | 121±0.7μs                     |    0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))              |
| -        | 225±0.3μs                  | 126±0.3μs                     |    0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))     |
| -        | 223±0.2μs                  | 119±0.09μs                    |    0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))                |
| -        | 219±0.06μs                 | 118±0.08μs                    |    0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))       |
| -        | 227±0.3μs                  | 123±0.2μs                     |    0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))      |

@Mousius
Copy link
Member Author
Mousius commented Nov 24, 2023

@rdevulap don't panic; this was just so easy to do that I wanted to see if it would work 😸 I'll wait for your patches to be in and rebase on top of them.

@seiko2plus SVE features! Woo!

Copy link
Member
@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you. Just a few changes more.

@seiko2plus
Copy link
Member

this was just so easy to do that I wanted to see if it would work 😸 I'll wait for your patches to be in and rebase on top of them.

This pr doesn't change any C/C++ code so it should be fine to merge it.

@Mousius
Copy link
Member Author
Mousius commented Nov 29, 2023

this was just so easy to do that I wanted to see if it would work 😸 I'll wait for your patches to be in and rebase on top of them.

This pr doesn't change any C/C++ code so it should be fine to merge it.

This will conflict with https://github.com/numpy/numpy/pull/25045/files#diff-1df2f2cfde074ed7de97ebae8a7ef48ad000241256ea3f480f81b2598d03b4f8R784-R788

Leveraging the meson infrastructure to selectively enable SVE
specifically for Highway, which already supports SVE.

```
| Change   | Before [94bc564] <main>   | After [1ffedb85] <sve-sort>   |   Ratio | Benchmark (Parameter)                                                            |
|----------|----------------------------|-------------------------------|---------|----------------------------------------------------------------------------------|
| +        | 551±0.8μs                  | 654±5μs                       |    1.19 | bench_function_base.Sort.time_sort('merge', 'float32', ('random',))              |
| +        | 72.9±0.1μs                 | 80.5±0.08μs                   |    1.11 | bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))          |
| +        | 553±1μs                    | 606±2μs                       |    1.1  | bench_function_base.Sort.time_sort('merge', 'float64', ('random',))              |
| +        | 41.9±0.3μs                 | 45.6±0.04μs                   |    1.09 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000))  |
| +        | 493±1μs                    | 532±2μs                       |    1.08 | bench_function_base.Sort.time_sort('merge', 'float16', ('random',))              |
| +        | 73.7±0.2μs                 | 78.8±0.04μs                   |    1.07 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100))   |
| +        | 565±1μs                    | 600±1μs                       |    1.06 | bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',))           |
| +        | 465±1μs                    | 491±1μs                       |    1.06 | bench_function_base.Sort.time_argsort('heap', 'int32', ('reversed',))            |
| +        | 696±3μs                    | 739±2μs                       |    1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',))               |
| +        | 645±3μs                    | 684±3μs                       |    1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 10))      |
| +        | 651±1μs                    | 691±1μs                  
8000
     |    1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100))     |
| +        | 627±3μs                    | 665±1μs                       |    1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000))    |
| +        | 467±1μs                    | 494±0.7μs                     |    1.06 | bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))              |
| +        | 166±0.1μs                  | 174±1μs                       |    1.05 | bench_function_base.Sort.time_sort('merge', 'float32', ('sorted_block', 10))     |
| -        | 77.4±0.08μs                | 73.2±0.2μs                    |    0.95 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100))  |
| -        | 379±1μs                    | 359±0.1μs                     |    0.95 | bench_function_base.Sort.time_sort('heap', 'int32', ('ordered',))                |
| -        | 341±0.5μs                  | 324±0.5μs                     |    0.95 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000))   |
| -        | 590±0.5μs                  | 554±1μs                       |    0.94 | bench_function_base.Sort.time_argsort('heap', 'float32', ('ordered',))           |
| -        | 239±5μs                    | 226±0.6μs                     |    0.94 | bench_function_base.Sort.time_argsort('merge', 'float16', ('sorted_block', 10))  |
| -        | 195±2μs                    | 184±1μs                       |    0.94 | bench_function_base.Sort.time_argsort('merge', 'float32', ('sorted_block', 10))  |
| -        | 692±2μs                    | 637±2μs                       |    0.92 | bench_function_base.Sort.time_argsort('merge', 'float32', ('random',))           |
| -        | 45.5±0.03μs                | 42.0±0.2μs                    |    0.92 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000)) |
| -        | 80.5±0.07μs                | 73.0±0.1μs                    |    0.91 | bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))          |
| -        | 78.9±0.2μs                 | 71.7±0.2μs                    |    0.91 | bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))              |
| -        | 79.2±0.1μs                 | 72.1±0.2μs                    |    0.91 | bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',))             |
| -        | 131±2μs                    | 118±0.8μs                     |    0.9  | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 100))    |
| -        | 82.8±0.2μs                 | 73.7±0.3μs                    |    0.89 | bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))             |
| -        | 83.4±0.07μs                | 74.1±0.2μs                    |    0.89 | bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))            |
| -        | 78.6±0.2μs                 | 70.3±0.2μs                    |    0.89 | bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))               |
| -        | 79.2±0.09μs                | 70.8±0.08μs                   |    0.89 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))              |
| -        | 3.22±0.02μs                | 2.86±0μs                      |    0.89 | bench_function_base.Sort.time_sort('quick', 'uint32', ('uniform',))              |
| -        | 3.26±0.04μs                | 2.84±0μs                      |    0.87 | bench_function_base.Sort.time_sort('quick', 'int32', ('uniform',))               |
| -        | 82.6±0.06μs                | 71.1±0.08μs                   |    0.86 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))     |
| -        | 4.91±0.01μs                | 4.22±0μs                      |    0.86 | bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',))               |
| -        | 79.0±0.2μs                 | 66.8±0.05μs                   |    0.85 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 1000))   |
| -        | 78.8±0.05μs                | 67.0±0.2μs                    |    0.85 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))      |
| -        | 84.2±0.07μs                | 70.8±0.1μs                    |    0.84 | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))              |
| -        | 89.4±0.1μs                 | 75.5±0.05μs                   |    0.84 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000))   |
| -        | 78.9±0.04μs                | 65.9±0.1μs                    |    0.84 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))       |
| -        | 85.4±0.06μs                | 71.9±0.05μs                   |    0.84 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000))    |
| -        | 85.3±0.08μs                | 70.5±0.1μs                    |    0.83 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000))     |
| -        | 80.5±0.03μs                | 66.4±0.1μs                    |    0.83 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))               |
| -        | 87.5±0.05μs                | 71.6±0.1μs                    |    0.82 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))    |
| -        | 80.4±0.05μs                | 65.4±0.07μs                   |    0.81 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))                |
| -        | 83.6±0.05μs                | 66.9±0.1μs                    |    0.8  | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))     |
| -        | 83.5±0.05μs                | 65.8±0.08μs                   |    0.79 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))      |
| -        | 6.87±0.01μs                | 5.13±0.08μs                   |    0.75 | bench_function_base.Sort.time_sort('quick', 'float32', ('uniform',))             |
| -        | 12.2±0.02μs                | 8.79±0.1μs                    |    0.72 | bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',))             |
| -        | 193±0.5μs                  | 124±0.5μs                     |    0.65 | bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))            |
| -        | 27.7±0.2ms                 | 18.0±0.2ms                    |    0.65 | bench_function_base.Sort.time_sort_worst                                         |
| -        | 192±0.4μs                  | 123±0.2μs                     |    0.64 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))             |
| -        | 202±0.2μs                  | 128±0.04μs                    |    0.63 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))   |
| -        | 203±0.5μs                  | 125±0.09μs                    |    0.62 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))    |
| -        | 199±0.4μs                  | 122±0.07μs                    |    0.61 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))              |
| -        | 195±0.4μs                  | 120±0.09μs                    |    0.61 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))     |
| -        | 215±0.3μs                  | 121±0.3μs                     |    0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))               |
| -        | 216±0.3μs                  | 121±0.7μs                     |    0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))              |
| -        | 225±0.3μs                  | 126±0.3μs                     |    0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))     |
| -        | 223±0.2μs                  | 119±0.09μs                    |    0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))                |
| -        | 219±0.06μs                 | 118±0.08μs                    |    0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))       |
| -        | 227±0.3μs                  | 123±0.2μs                     |    0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))      |
```
@Mousius Mousius force-pushed the highway-vqsort-sve branch from f0b936d to 682d33c Compare December 8, 2023 13:42
@Mousius
Copy link
Member Author
Mousius commented Dec 8, 2023

@seiko2plus I've rebased this on top of @rdevulap's changes, I think this is good to go?

Copy link
Member
@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0