ENH: Vectorize np.sort and np.partition with AVX2 #25045

r-devulap · 2023-10-31T21:45:45Z

Adds AVX2 implementations of np.sort and np.partition for 32-bit and 64-bit dtypes. Speeds up 32-bit sort by up to 13x and 64-bit sort by up to 7x.

This patch includes commits from #24924 (makes it easier to rebase later).

EDIT: updated benchmark numbers after we made a fewmore improvments to 64-bit AVX2 sorting in intel/x86-simd-sort#99.

| Change   | Before [eae0b8bc] <main>   | After [33ab3498] <avx2-sort>   |   Ratio | Benchmark (Parameter)
| -        | 837±1μs                    | 249±0.5μs                      |    0.3  | bench_function_base.Partition.time_partition('int64', ('random',), 100)               |
| -        | 839±1μs                    | 246±0.6μs                      |    0.29 | bench_function_base.Partition.time_partition('int64', ('random',), 10)                |
| -        | 841±1μs                    | 246±2μs                        |    0.29 | bench_function_base.Partition.time_partition('int64', ('random',), 1000)              |
| -        | 1.03±0ms                   | 264±0.5μs                      |    0.26 | bench_function_base.Partition.time_partition('float64', ('random',), 10)              |
| -        | 1.03±0ms                   | 264±1μs                        |    0.26 | bench_function_base.Partition.time_partition('float64', ('random',), 100)             |
| -        | 1.03±0ms                   | 263±1μs                        |    0.25 | bench_function_base.Partition.time_partition('float64', ('random',), 1000)            |
| -        | 498±0.1μs                  | 
8000
98.0±0.05μs                    |    0.2  | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))                     |
| -        | 949±4μs                    | 154±0.8μs                      |    0.16 | bench_function_base.Partition.time_partition('float32', ('random',), 10)              |
| -        | 959±2μs                    | 152±2μs                        |    0.16 | bench_function_base.Partition.time_partition('float32', ('random',), 100)             |
| -        | 949±9μs                    | 152±1μs                        |    0.16 | bench_function_base.Partition.time_partition('float32', ('random',), 1000)            |
| -        | 815±3μs                    | 110±2μs                        |    0.14 | bench_function_base.Partition.time_partition('int32', ('random',), 10)                |
| -        | 820±4μs                    | 111±2μs                        |    0.14 | bench_function_base.Partition.time_partition('int32', ('random',), 100)               |
| -        | 818±4μs                    | 111±2μs                        |    0.14 | bench_function_base.Partition.time_partition('int32', ('random',), 1000)              |
| -        | 550±0.1μs                  | 74.9±0.2μs                     |    0.14 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))                   |
| -        | 520±0.2μs                  | 40.1±0.03μs                    |    0.08 | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))                   |
| -        | 470±0.2μs                  | 39.9±0.1μs                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))                     |
| -        | 498±0.2μs                  | 39.0±0.02μs                    |    0.08 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))                    |

Detailed benchmarks are here. https://gist.github.com/r-devulap/4d55b5c1909a7ed0743746cf719bf9d2

r-devulap · 2023-11-02T21:17:00Z

numpy/lib/tests/test_shape_base.py

@@ -38,7 +38,7 @@ def test_argequivalent(self):
            (np.sort, np.argsort, dict()),
            (_add_keepdims(np.min), _add_keepdims(np.argmin), dict()),
            (_add_keepdims(np.max), _add_keepdims(np.argmax), dict()),
-            (np.partition, np.argpartition, dict(kth=2)),
+            #(np.partition, np.argpartition, dict(kth=2)),


I wasn't sure what to do with this test. When there is no unique defined output to np.partition or np.argpartition, we obviously cannot expect their outputs to match with each other, right?

we obviously cannot expect their outputs to match with each other, right?

I agree since the order is supposed to be undefined.

r-devulap · 2023-11-02T21:27:31Z

Commit f92e3b3 updates testnumpy/_core/tests/test_multiarray.py::TestMethods::test_partition to not use results of np.argpartition to validate results of np.partition.

r-devulap · 2023-11-27T19:10:51Z

Rebased after #24018. Also split highway dispatch to a separate file.

Mousius

Just scrolled over this, think the split looks good but maybe we can avoid the last pointer check.

numpy/_core/src/npysort/highway_qsort.dispatch.cpp

numpy/_core/src/npysort/quicksort.cpp

numpy/_core/src/npysort/x86_simd_qsort.dispatch.cpp

r-devulap · 2023-11-30T20:32:28Z

Cygwin failure seems unrelated.

r-devulap · 2023-11-30T21:02:47Z

Rebased with main to fix the cygwin CI failure.

This reverts commit 76d5534.

… NAN

… partition

Perf improvements to AVX2 sorting: see intel/x86-simd-sort#104

Mousius

This looks good to me, but it'd be good to restart CI with this month's credits. @ngoldbaum, do you have buttons for that?

seiko2plus

LGTM, thank you Raghuveer!. Could you please make a few changes as suggested?

numpy/_core/src/npysort/highway_qsort.dispatch.cpp

numpy/_core/src/npysort/quicksort.cpp

numpy/_core/src/npysort/selection.cpp

numpy/_core/src/npysort/x86_simd_qsort.dispatch.cpp

…ighway

numpy/_core/src/npysort/x86_simd_qsort.dispatch.cpp

numpy/_core/meson.build

seiko2plus

Well done, Thank you!

seiko2plus · 2023-12-04T18:38:04Z

Thank you @r-devulap.

r-devulap · 2023-12-20T19:53:28Z

Does this need a release note?

charris · 2023-12-20T20:34:11Z

Does this need a release note?

You could add an improvement release note, won't hurt.

github-actions bot added the 01 - Enhancement label Oct 31, 2023

r-devulap mentioned this pull request Nov 2, 2023

fix numpy CI failures intel/x86-simd-sort#100

Merged

r-devulap commented Nov 2, 2023

View reviewed changes

seiko2plus mentioned this pull request Nov 20, 2023

ENH: Use Highway's VQSort on AArch64 #24018

Merged

r-devulap force-pushed the avx2-sort branch from 1b77997 to e2a076b Compare November 27, 2023 19:09

r-devulap mentioned this pull request Nov 27, 2023

ENH: Perf improvements to np.sort, np.argsort, np.partition and np.argpartition #24924

Closed

Mousius requested changes Nov 28, 2023

View reviewed changes

r-devulap force-pushed the avx2-sort branch from 4570077 to a3ca84b Compare November 30, 2023 21:02

Raghuveer Devulapalli added 18 commits November 30, 2023 13:02

Update x86-simd-sort to latest

95f6158

Dont include arg methods on 32-bit platforms

125017e

Update x86-simd-sort to latest

65d6506

Revert "Dont include arg methods on 32-bit platforms"

2766e9b

This reverts commit 76d5534.

Enable np.partition and np.argpartition on 32-bit

77278fe

Update x86-simd-sort to latest

d729a49

Update x86-simd-sort to latest

05000e1

Update x86-simd-sort to latest

c2bdc0e

Update x86-simd-sort to latest

a303128

avx512 qsort for fp16, fp32 and fp64 require an explicit flag to sort…

9a7e109

… NAN

Dispatch AVX2 qsort and partition

d4169da

Separate testing np.partition and np.argpartition

9f1faa1

update x86-simd-sort to latest

10000

7c39472

Disable test_shape_base.py::TestTakeAlongAxis::test_argequivalent for…

3b6643d

… partition

Linter fixes

fb69b5b

Fix rebase bug in meson.build

190e80e

Split highway and x86-simd-sort dispatch to separate files

c9588ca

Update x86-simd-sort to latest

0cb29a2

Perf improvements to AVX2 sorting: see intel/x86-simd-sort#104

Include highway/x86-simd-sort at compile time

a3ca84b

Mousius approved these changes Dec 1, 2023

View reviewed changes

seiko2plus requested changes Dec 1, 2023

View reviewed changes

Raghuveer Devulapalli added 2 commits December 1, 2023 10:19

Remove distutils related dispatch code and create new namespace for h…

9bde4ae

…ighway

Add optional hasnan arguement to avx sorting

b28ed78

seiko2plus reviewed Dec 1, 2023

View reviewed changes

numpy/_core/src/npysort/x86_simd_qsort.dispatch.cpp Show resolved Hide resolved

seiko2plus reviewed Dec 1, 2023

View reviewed changes

numpy/_core/meson.build Show resolved Hide resolved

seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Dec 1, 2023

Add x86_qsort/qselect functions inside anonymous namespace

675cb07

seiko2plus approved these changes Dec 2, 2023

View reviewed changes

seiko2plus merged commit 794f474 into numpy:main Dec 4, 2023

AndresGuzman-Ballen mentioned this pull request Dec 11, 2023

BUG: Occasional failure to sort complex numbers by absolute value on M2 macs #24842

Open

seiko2plus mentioned this pull request Dec 13, 2023

BUG: Build failure (of 1.26.2) on SapphireRapids (avx512_spr) due to multiple definition of avx512_qsort and avx512_qselect #25274

Closed

seiko2plus mentioned this pull request Jan 11, 2024

BUG: crashes in sort/argsort on macOS arm64 #25464

Closed

dcherian mentioned this pull request Feb 8, 2024

Vectorized grouped (nan)quantile xarray-contrib/flox#329

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Vectorize np.sort and np.partition with AVX2 #25045

ENH: Vectorize np.sort and np.partition with AVX2 #25045

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: Vectorize np.sort and np.partition with AVX2 #25045

ENH: Vectorize np.sort and np.partition with AVX2 #25045

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!