8000 ENH: Vectorize np.sort and np.partition with AVX2 by r-devulap · Pull Request #25045 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Vectorize np.sort and np.partition with AVX2 #25045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Dec 4, 2023

Conversation

r-devulap
Copy link
Member
@r-devulap r-devulap commented Oct 31, 2023

Adds AVX2 implementations of np.sort and np.partition for 32-bit and 64-bit dtypes. Speeds up 32-bit sort by up to 13x and 64-bit sort by up to 7x.

This patch includes commits from #24924 (makes it easier to rebase later).

EDIT: updated benchmark numbers after we made a fewmore improvments to 64-bit AVX2 sorting in intel/x86-simd-sort#99.

| Change   | Before [eae0b8bc] <main>   | After [33ab3498] <avx2-sort>   |   Ratio | Benchmark (Parameter)
| -        | 837±1μs                    | 249±0.5μs                      |    0.3  | bench_function_base.Partition.time_partition('int64', ('random',), 100)               |
| -        | 839±1μs                    | 246±0.6μs                      |    0.29 | bench_function_base.Partition.time_partition('int64', ('random',), 10)                |
| -        | 841±1μs                    | 246±2μs                        |    0.29 | bench_function_base.Partition.time_partition('int64', ('random',), 1000)              |
| -        | 1.03±0ms                   | 264±0.5μs                      |    0.26 | bench_function_base.Partition.time_partition('float64', ('random',), 10)              |
| -        | 1.03±0ms                   | 264±1μs                        |    0.26 | bench_function_base.Partition.time_partition('float64', ('random',), 100)             |
| -        | 1.03±0ms                   | 263±1μs                        |    0.25 | bench_function_base.Partition.time_partition('float64', ('random',), 1000)            |
| -        | 498±0.1μs                  | 98.0±0.05μs                    |    0.2  | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))                     |
| -        | 949±4μs                    | 154±0.8μs                      |    0.16 | bench_function_base.Partition.time_partition('float32', ('random',), 10)              |
| -        | 959±2μs                    | 152±2μs                        |    0.16 | bench_function_base.Partition.time_partition('float32', ('random',), 100)             |
| -        | 949±9μs                    | 152±1μs                        |    0.16 | bench_function_base.Partition.time_partition('float32', ('random',), 1000)            |
| -        | 815±3μs                    | 110±2μs                        |    0.14 | bench_function_base.Partition.time_partition('int32', ('random',), 10)                |
| -        | 820±4μs                    | 111±2μs                        |    0.14 | bench_function_base.Partition.time_partition('int32', ('random',), 100)               |
| -        | 818±4μs                    | 111±2μs                        |    0.14 | bench_function_base.Partition.time_partition('int32', ('random',), 1000)              |
| -        | 550±0.1μs                  | 74.9±0.2μs                     |    0.14 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))                   |
| -        | 520±0.2μs                  | 40.1±0.03μs                    |    0.08 | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))                   |
| -        | 470±0.2μs                  | 39.9±0.1μs                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))                     |
| -        | 498±0.2μs                  | 39.0±0.02μs                    |    0.08 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))                    |

Detailed benchmarks are here. https://gist.github.com/r-devulap/4d55b5c1909a7ed0743746cf719bf9d2

@@ -38,7 +38,7 @@ def test_argequivalent(self):
(np.sort, np.argsort, dict()),
(_add_keepdims(np.min), _add_keepdims(np.argmin), dict()),
(_add_keepdims(np.max), _add_keepdims(np.argmax), dict()),
(np.partition, np.argpartition, dict(kth=2)),
#(np.partition, np.argpartition, dict(kth=2)),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure what to do with this test. When there is no unique defined output to np.partition or np.argpartition, we obviously cannot expect their outputs to match with each other, right?

Copy link
Member
@seiko2plus seiko2plus Dec 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we obviously cannot expect their outputs to match with each other, right?

I agree since the order is supposed to be undefined.

@r-devulap
Copy link
Member Author

Commit f92e3b3 updates testnumpy/_core/tests/test_multiarray.py::TestMethods::test_partition to not use results of np.argpartition to validate results of np.partition.

@r-devulap
Copy link
Member Author

Rebased after #24018. Also split highway dispatch to a separate file.

Copy link
Member
@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just scrolled over this, think the split looks good but maybe we can avoid the last pointer check.

@r-devulap
Copy link
Member Author

Cygwin failure seems unrelated.

@r-devulap
Copy link
Member Author

Rebased with main to fix the cygwin CI failure.

Copy link
Member
@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but it'd be good to restart CI with this month's credits. @ngoldbaum, do you have buttons for that?

Copy link
Member
@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you Raghuveer!. Could you please make a few changes as suggested?

< 9E88 div id="pullrequestreview-1760608080" class="js-comment js-updatable-content js-socket-channel js-targetable-element js-minimize-container" data-gid="PRR_kwDOAA3dP85o8L9Q" data-channel="eyJjIjoicHVsbF9yZXF1ZXN0X3JldmlldzoxNzYwNjA4MDgwIiwidCI6MTc0NzU3MzczNX0=--d23be029c16bc2cc2f8ec28a991a0f3bab2d6a838197ac539cafc700eac6b1d2" data-url="/numpy/numpy/pull/25045/partials/reviews/1760608080" >
@seiko2plus seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Dec 1, 2023
Copy link
Member
@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, Thank you!

@seiko2plus seiko2plus merged commit 794f474 into numpy:main Dec 4, 2023
@seiko2plus
Copy link
Member

Thank you @r-devulap.

@r-devulap
Copy link
Member Author

Does this need a release note?

@charris
Copy link
Member
charris commented Dec 20, 2023

Does this need a release note?

You could add an improvement release note, won't hurt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0