ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort #28619

r-devulap · 2025-04-01T20:21:47Z

Update x86-simd-sort module to latest to pull in 4 major changes:

Fixes a performance regression on 16-bit dtype sorting (Ref Change 16-bit swizzle from vector to C arrays intel/x86-simd-sort#190)
Adds openmp support for quicksort which speeds up sorting arrays >100,000 by up to 3x. (ref: Adds OpenMP to qsort, should also improve test speed a bit intel/x86-simd-sort#179)
Adds openmp support for argsort which speeds up np.argsort for arrays > 10,000 by up to 3.5x (ref: Add OpenMP support to argsort intel/x86-simd-sort#195)
Fixes np.argsort perf regressions on sorted data (as reported in: BUG: Performance regression in argsort on sorted data #28714)

Benchmark numbers for `np.sort` on a TGL (sorting an array of 5 million numbers):

[100.00%] ··· bench_function_base.Sort.time_sort_worst                     9.64±0.05ms                                                                  12:53:12 [100/24308]
| Change   | Before [93898621] <main>   | After [19f94d3c] <xss-openmp>   |   Ratio | Benchmark (Parameter)                                                          |
|----------|----------------------------|---------------------------------|---------|--------------------------------------------------------------------------------|
| -        | 4.57±0.08ms                | 3.61±0.04ms                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))           |
| -        | 4.51±0.04ms                | 3.57±0.04ms                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))   |
| -        | 4.53±0.03ms                | 3.57±0.02ms                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100))  |
| -        | 4.54±0.03ms                | 3.58±0.03ms                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000)) |
| -        | 54.8±0.2ms                 | 27.0±0.05ms                     |    0.49 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))           |
| -        | 60.6±0.3ms                 | 29.8±0.3ms                      |    0.49 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000)) |
| -        | 53.0±0.1ms                 | 25.2±0.1ms                      |    0.48 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))            |
| -        | 55.0±0.3ms                 | 26.2±0.2ms                      |    0.48 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))  |
| -        | 55.1±0.2ms                 | 25.7±0.06ms                     |    0.47 | bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))          |
| -        | 54.8±0.1ms                 | 25.9±0.3ms                      |    0.47 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))   |
| -        | 64.7±0.3ms                 | 29.0±0.2ms                      |    0.45 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))   |
| -        | 28.4±0.1ms                 | 12.4±0.2ms                      |    0.44 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000)) |
| -        | 26.3±0.1ms                 | 11.6±0.7ms                      |    0.44 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000))   |
| -        | 59.3±0.2ms                 | 26.1±0.2ms                      |    0.44 | bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))             |
| -        | 26.8±1ms                   | 11.6±0.4ms                      |    0.43 | bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))           |
| -        | 57.3±0.2ms                 | 24.7±0.3ms                      |    0.43 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))              |
| -        | 58.9±0.09ms                | 25.2±1ms                        |    0.43 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))     |
| -        | 59.4±0.09ms                | 25.4±0.09ms                     |    0.43 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))    |
| -        | 60.1±0.1ms                 | 25.2±0.1ms                      |    0.42 | bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))            |
| -        | 26.0±0.2ms                 | 10.9±0.1ms                      |    0.42 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000))  |
| -        | 25.8±1ms                   | 10.3±0.2ms                      |    0.4  | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))            |
| -        | 26.5±0.08ms                | 10.7±0.1ms                      |    0.4  | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))   |
| -        | 9.64±0.05ms                | 3.89±0.1ms                      |    0.4  | bench_function_base.Sort.time_sort_worst                                       |
| -        | 26.8±0.4ms                 | 10.5±0.2ms                      |    0.39 | bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))          |
| -        | 26.8±0.1ms                 | 10.4±0.7ms                      |    0.39 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))  |
| -        | 24.9±0.6ms                 | 9.80±0.2ms                      |    0.39 | bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))             |
| -        | 24.6±0.7ms                 | 9.60±0.02ms                     |    0.39 | bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))            |
| -        | 24.5±0.06ms                | 9.25±0.03ms                     |    0.38 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))     |
| -        | 24.0±0.08ms                | 9.22±0.02ms                     |    0.38 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))    |
| -        | 24.0±1ms                   | 8.85±0.08ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))              |
| -        | 24.6±0.2ms                 | 9.06±0.05ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))    |
| -        | 23.7±0.07ms                | 8.78±0.07ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))             |
| -        | 24.2±0.2ms                 | 8.88±0.3ms                      |    0.37 | bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',))           |
| -        | 24.4±0.2ms                 | 9.02±0.09ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))   |
| -        | 24.6±0.2ms                 | 8.95±0.08ms                     |    0.36 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))            |
| -        | 89.0±0.3ms                 | 7.42±0.07ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))             |
| -        | 87.7±0.5ms                 | 6.67±0.06ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('random',))              |
| -        | 88.3±0.2ms                 | 6.81±0.04ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))     |
| -        | 87.2±0.07ms                | 6.54±0.02ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))    |

Benchmark numbers for `np.argsort` on a TGL (sorting an array of 500,000 numbers):

| Change   | Before [93898621] <main>   | After [41eb9481] <xss-openmp>   |   Ratio | Benchmark (Parameter)                                                             |
|----------|----------------------------|---------------------------------|---------|-----------------------------------------------------------------------------------|
| +        | 530±20μs                   | 788±50μs                        |    1.49 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('uniform',))            |
| +        | 528±30μs                   | 759±30μs                        |    1.44 | bench_function_base.Sort.time_argsort('quick', 'int32', ('uniform',))             |
| +        | 608±30μs                   | 787±20μs                        |    1.29 | bench_function_base.Sort.time_argsort('quick', 'float32', ('uniform',))           |
| -        | 10.8±0.02ms                | 4.12±0.02ms                     |    0.38 | bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 1000)) |
| -        | 11.3±0.03ms                | 4.28±0.03ms                     |    0.38 | bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 1000)) |
| -        | 10.5±0.03ms                | 3.92±0.03ms                     |    0.37 | bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 1000))   |
| -        | 11.0±0.03ms                | 3.95±0.02ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 100))  |
| -        | 11.7±0.04ms                | 4.23±0.03ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 100))  |
| -        | 10.7±0.02ms                | 3.84±0.05ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 100))    |
| -        | 11.8±0.04ms                | 4.23±0.02ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))   |
| -        | 10.4±0.02ms                | 3.78±0.02ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 1000))  |
| -        | 8.28±0.2ms                 | 2.92±0.1ms                      |    0.35 | bench_function_base.Sort.time_argsort('quick', 'int32', ('reversed',))            |
| -        | 10.8±0.08ms                | 3.72±0.06ms                     |    0.35 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 100))   |
| -        | 8.62±0.2ms                 | 2.96±0.09ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))           |
| -        | 8.71±0.2ms                 | 2.92±0.09ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float32', ('reversed',))          |
| -        | 8.70±0.02ms                | 2.95±0.01ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))           |
| -        | 8.74±0.03ms                | 3.00±0.02ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float64', ('reversed',))          |
| -        | 12.3±0.02ms                | 4.21±0.03ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100))    |
| -        | 8.35±0.2ms                 | 2.81±0.08ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('ordered',))            |
| -        | 8.34±0.1ms                 | 2.79±0.1ms                      |    0.33 | bench_function_base.Sort.time_argsort('quick', 'int32', ('ordered',))             |
| -        | 8.30±0.2ms                 | 2.78±0.09ms                     |    0.33 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('reversed',))           |
| -        | 12.0±0.3ms                 | 3.87±0.1ms                      |    0.32 | bench_function_base.Sort.time_argsort('quick', 'float32', ('random',))            |
| -        | 10.3±0.04ms                | 3.31±0.03ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 10))   |
| -        | 12.5±0.04ms                | 3.98±0.03ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 10))   |
| -        | 9.50±0.03ms                | 3.08±0.06ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'int64', ('ordered',))             |
| -        | 9.48±0.04ms                | 3.07±0.04ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'int64', ('reversed',))            |
| -        | 13.5±0.04ms                | 4.18±0.04ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'float64', ('random',))            |
| -        | 11.7±0.3ms                 | 3.66±0.1ms                      |    0.31 | bench_function_base.Sort.time_argsort('quick', 'int32', ('random',))              |
| -        | 9.97±0.03ms                | 3.12±0.03ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 10))     |
| -        | 13.0±0.02ms                | 4.03±0.05ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10))     |
| -        | 11.7±0.3ms                 | 3.58±0.08ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('random',))             |
| -        | 9.96±0.01ms                | 3.08±0.03ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 10))    |
| -        | 14.2±0.06ms                | 4.11±0.07ms                     |    0.29 | bench_function_base.Sort.time_argsort('quick', 'int64', ('random',))              |

seiko2plus

Nice to see multi-threading support - well done. This pr should also include a release note and add documentation mentioning that sort operations now support multi-threading on x86 and that the number of threads can be controlled via the environment variable OMP_NUM_THREADS. Additionally, OpenMP flags should be disabled if the meson option disable-threading is enabled.

tylerjereddy · 2025-04-01T23:40:58Z

numpy/_core/meson.build

+  if omp.found()
+    omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP']
+  endif
+endif


Are we "all good" to use OpenMP in NumPy directly? I thought there were other ecosystem interaction concerns to consider, like cross-interactions with OpenBLAS or wheel-related issues, etc. Maybe this is just for a custom local build rather than for activation in wheels.

If we are actually "ok" for that, I guess that in addition to the env variable Sayed mentioned there is also threadpoolctl where one might need to modulate both with something like with controller.limit(limits={"openblas": 2, "openmp": 4})?

Are we "all good" to use OpenMP in NumPy directly?

I am not familiar with implications of using openMP and how it could potentially interact with other modules. I was hoping to get an answer to that via the pull request and everyone's input.

OpenBLAS usually manages OpenMP itself, which creates a bit of confusion when nesting it.

OpenBLAS now has:
https://github.com/OpenMathLib/OpenBLAS/blob/develop/driver/others/blas_server_callback.c

Which we can use if we have a central thread pool we want to re-use across multiple things?

@ogrisel I was wondering if you know anything about this (i.e. whether it is safe to use OpenMP in NumPy, or it would create issues).

Which we can use if we have a central thread pool we want to re-use across multiple things?

The libopenblas we ship inside our wheels is always built with pthreads, not openmp. Build scripts live at https://github.com/MacPython/openblas-libs/tree/main/tools.

scikit-learn does run into issues with running OpenMP's threadpool together with OpenBLAS: scikit-learn/scikit-learn#28883

There is a draft PR here to force OpenBLAS to use OpenMP (alway from pthreads): scikit-learn/scikit-learn#29403

I thought there were other ecosystem interaction concerns to consider, like cross-interactions with OpenBLAS or wheel-related issues

Yes, there are some quite ugly issues with multiple installed openmp libraries and segfaults depending on import order, see
microsoft/LightGBM#6595

tylerjereddy · 2025-04-01T23:46:46Z

numpy/_core/tests/test_multiarray.py

+@pytest.mark.parametrize("dtype", [np.float16, np.float32, np.float64])
+def test_sort_largearrays(dtype):
+    N = 1000000
+    arr = np.random.rand(N)


maybe we should pin the values down with modern default_rng?

One other thing I checked was how slow the test might be (do we want a slow marker for the large array handling?). It didn't seem too bad (< 1 s for each case locally on an ARM Mac), though it is in the top 10 slowest for this module for example.

I did pin down the value with a determined seed.

r-devulap · 2025-04-02T17:23:05Z

@rgommers pointed out this discussion in scipy that details the complications of using openmp scipy/scipy#10239 (comment)

rgommers · 2025-04-02T18:40:13Z

Also xref https://pypackaging-native.github.io/key-issues/native-dependencies/blas_openmp, which details a bunch of issues.

rgommers · 2025-04-02T18:44:10Z

I suspect that if we start using OpenMP code, we should disable it in wheels and only let distro packagers enable it.

r-devulap · 2025-04-09T19:51:32Z

I suspect that if we start using OpenMP code, we should disable it in wheels and only let distro packagers enable it.

@rgommers How does the interaction with PyTorch or Sciki-learn work? Don't they vendor their own version of libgomp which can potentially conflict with what the distro provides?

rgommers · 2025-04-09T20:18:03Z

@rgommers How does the interaction with PyTorch or Sciki-learn work? Don't they vendor their own version of libgomp which can potentially conflict with what the distro provides?

That is already the case - the answer for that one is: (a) please don't mix distro packages with wheels, and (b) the libomp/libiomp (not libgomp, which interacts badly with multiprocessing) inside wheels will have its symbols mangled with auditwheel, so there's no clash.

If you have PyTorch and scikit-learn both installed in the same environment and then imported, that usually works just fine, but it has given rise to a host of hard to debug issues in the past. Also numpy from conda defaults, which pulls in OpenMP via MKL. In wheels you just get two OpenMP runtimes loaded that are isolated from each other (bad for performance but should be robust again symbol conflicts), in conda envs you should get only one but can get two if dependency trees don't work out well. E.g., PyTorch relies on MKL which pulls in Intel OpenMP (libiomp); scikit-learn uses LLVM OpenMP (libomp). There has to be code that handles this correctly, to avoid conflicts. IIRC mkl-service does this, so it uses the other OpenMP runtime if that's already loaded in memory.

You get the gist - this is a little painful. There's only one way to really do OpenMP right - build everything in a coherent fashion against the same OpenMP library. Distros usually get this right. With wheels you can't. And it gets worse when users do pip install . or pip install pkgs-that-uses-openmp from source - because then you don't get auditwheel symbol mangling, and things may go more wrong.

This PR looks nice and simple though, so if the performance gains are really large, maybe making it opt-in rather than use-if-detected could work.

rgommers · 2025-04-09T20:23:32Z

For the design question of whether NumPy et al. should enable parallelism by default, please see https://thomasjpfan.github.io/parallelism-python-libraries-design/ for a good discussion. Cc @thomasjpfan for visibility.

r-devulap · 2025-04-14T22:38:22Z

updated patch:

Ported openMP support for np.argsort from Add OpenMP support to argsort intel/x86-simd-sort#195 which speeds up sorting arrays > 10000 by up-to 3.5x on both AVX-512 and AVX2. Updated the title and added benchmark numbers in the first comment.
Added a new meson option -enable-openmp which is false by default. Meson only builds with openmp when both disable-threading == false and enable-openmp == true
Added CI coverage to test the openMP code paths.

rgommers

Thanks for the update @r-devulap. CI additions looks fine to me; a few comments on build support.

numpy/_core/meson.build

rgommers · 2025-04-16T04:08:05Z

numpy/_core/meson.build

+if use_intel_sort and use_openmp
+  omp = dependency('openmp', required : false)
+  if omp.found()
+    omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP']


Let's not have a separate variable for this. It's more idiomatic to use a dependency object:

Suggested change

omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP']

omp_dep = declare_dependency(dependencies: omp, compile_args: ['-DXSS_USE_OPENMP'])

The -fopemp flag shouldn't need to be added explicitly, it's already present in the omp dependency (e.g., see here).

Sounds good to me. I have also updated the dependency required flag to be true. If someone is explicitly using enable-openmp=true, then its reasonable to expect a build failure if openMP isn't available.

rgommers · 2025-04-16T04:10:31Z

This will also need a release note in doc/release/upcoming_changes/

r-devulap · 2025-04-16T19:10:13Z

This will also need a release note in doc/release/upcoming_changes/

Thanks for reviewing this! I have added two release notes: one for performance improvements and another highlighting general openMP build support.

r-devulap · 2025-04-17T19:07:49Z

scikit-learn does run into issues with running OpenMP's threadpool together with OpenBLAS: scikit-learn/scikit-learn#28883

@thomasjpfan Thanks for the reference. Correct me if I am wrong but from what I understand, it looks like the performance problem occurs when calling two functions back to back in a loop: where one uses pthreads and other uses openMP to manage their respective threads. This causes resource contention when both functions try to use available CPU cores with no visibility into what the other one is doing.

Beyond standardizing the thread management library across the entire Python ecosystem, are there alternative approaches to solve this?

thomasjpfan · 2025-04-17T20:19:21Z

This causes resource contention when both functions try to use available CPU cores with no visibility into what the other one is doing.

Yes, this is the underlying issue.

Beyond standardizing the thread management library across the entire Python ecosystem, are there alternative approaches to solve this?

Standardizing the thread management layer is the only long term solution I see. The hard part is figuring out a way to standardize that works for most projects. Top of mind projects and how they multi-thread:

SciPy's fft uses C++ threads & scipy.linalg uses OpenBLAS
NumPy uses OpenBLAS
PyTorch uses MKL
Polars uses Rust+rayon
Scikit-learn uses OpenMP and Python multi-threading
Python multi-threading uses pthreads, which will become more common with free-threading

Some workarounds for threadpool specific issues:

Contention: Set OPENBLAS_THREAD_TIMEOUT:
- Slowdown when using openblas-pthreads alongside openmp based parallel code OpenMathLib/OpenBLAS#3187
- Consider unifying the two OpenBLAS libraries in NumPy and SciPy wheels Create OpenBLAS wheel scipy/scipy#15129
Incompatible between Intel OpenMP and LLVM OpenMP: Set MKL_THREADING_LAYER
- https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

r-devulap · 2025-04-23T17:31:46Z

Adding 2.3.0 label. Would like to have this in for that release wi 8000 th or without the openmp support.

Pulls in 2 major changes: (1) Fixes a performance regression on 16-bit dtype sorting (see intel/x86-simd-sort#190) (2) Adds openmp support for quicksort which speeds up sorting arrays > 100,000 by up to 3x. See: intel/x86-simd-sort#179

Also adds a simple unit test to stress the openmp code paths

r-devulap · 2025-05-01T04:17:28Z

Rebased with main.

seiko2plus

LGTM, Thank you! massive performance gains for 16-bit sorting. Since OpenMP is disabled by default, I think it's fine to merge.

charris · 2025-05-14T21:46:53Z

Thanks @r-devulap .

gitboy16 · 2025-05-20T11:25:39Z

Hi, thank you for the PR. Would it be possible to have wheels/packages with opennmp enabled available somewhere so that multithreaded sort can be used? Thank you!

rgommers · 2025-05-20T12:27:54Z

That would a lot of work including vendoring libomp into wheels, we won't be putting anything like that up on PyPI. If you or anyone else is willing to do this in a fork and host them outside of PyPI, that is of course possible - there's no need for us as maintainers to do that.

github-actions bot added the 01 - Enhancement label Apr 1, 2025

seiko2plus added 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Apr 1, 2025

seiko2plus requested changes Apr 1, 2025

View reviewed changes

tylerjereddy reviewed Apr 1, 2025

View reviewed changes

r-devulap changed the title ~~ENH: Use openmp on x86-simd-sort to speed up sorting large arrays~~ ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort Apr 14, 2025

r-devulap force-pushed the xss-openmp branch from 8656315 to b8fc6be Compare April 15, 2025 21:35

r-devulap mentioned this pull request Apr 15, 2025

BUG: Performance regression in argsort on sorted data #28714

Open

rgommers reviewed Apr 16, 2025

View reviewed changes

r-devulap added this to the 2.3.0 release milestone Apr 23, 2025

Raghuveer Devulapalli added 4 commits April 30, 2025 21:16

Update x86-simd-sort module to latest

130cac5

Pulls in 2 major changes: (1) Fixes a performance regression on 16-bit dtype sorting (see intel/x86-simd-sort#190) (2) Adds openmp support for quicksort which speeds up sorting arrays > 100,000 by up to 3x. See: intel/x86-simd-sort#179

BLD: Add openmp flags to build x86-simd-sort

7fd938b

Also adds a simple unit test to stress the openmp code paths

ENH: Update x86-simd-sort to port openmp support for argsort

f8cfa4e

Add meson option to toggle building with openMP

02c4728

r-devulap force-pushed the xss-openmp branch from ffe1c72 to 6eff29e Compare May 1, 2025 04:16

Raghuveer Devulapalli added 3 commits April 30, 2025 21:17

TST: Add np.argsort test for openmp paths

21bc19f

CI: Add openmp flags to test openMP code paths

e425de8

Update x86-simd-sort: detect already sorted arrays for np.argsort

ac59ea9

Raghuveer Devulapalli added 3 commits April 30, 2025 21:17

DOCS: add release notes

8ba425e

Minor changes to meson.build

e0f0247

Initialize omp to empty variable

6eff29e

seiko2plus approved these changes May 14, 2025

View reviewed changes

charris merged commit 25d26e5 into numpy:main May 14, 2025
72 of 73 checks passed

mattip mentioned this pull request May 20, 2025

ENH: Multithreaded sort / x86-simd-sort #29009

Closed

	omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP']
	omp_dep = declare_dependency(dependencies: omp, compile_args: ['-DXSS_USE_OPENMP'])

Uh oh!

ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort #28619

ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort #28619

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!