8000 ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort by r-devulap · Pull Request #28619 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort #28619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 14, 2025
4 changes: 2 additions & 2 deletions .github/workflows/linux_simd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ jobs:
python -m pip install pytest pytest-xdist hypothesis typing_extensions

- name: Build
run: CC=gcc-13 CXX=g++-13 spin build -- -Dallow-noblas=true -Dcpu-baseline=avx512_skx -Dtest-simd='BASELINE,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL,AVX512_SPR'
run: CC=gcc-13 CXX=g++-13 spin build -- -Denable-openmp=true -Dallow-noblas=true -Dcpu-baseline=avx512_skx -Dtest-simd='BASELINE,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL,AVX512_SPR'

- name: Meson Log
if: always()
Expand Down Expand Up @@ -263,7 +263,7 @@ jobs:
python -m pip install pytest pytest-xdist hypothesis typing_extensions

- name: Build
run: CC=gcc-13 CXX=g++-13 spin build -- -Dallow-noblas=true -Dcpu-baseline=avx512_spr
run: CC=gcc-13 CXX=g++-13 spin build -- -Denable-openmp=true -Dallow-noblas=true -Dcpu-baseline=avx512_spr

- name: Meson Log
if: always()
Expand Down
6 changes: 6 additions & 0 deletions doc/release/upcoming_changes/28619.highlight.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Building NumPy with OpenMP Parallelization
-------------------------------------------
NumPy now supports OpenMP parallel processing capabilities when built with the
``-Denable_openmp=true`` Meson build flag. This feature is disabled by default.
When enabled, ``np.sort`` and ``np.argsort`` functions can utilize OpenMP for
parallel thread execution, improving performance for these operations.
7 changes: 7 additions & 0 deletions doc/release/upcoming_changes/28619.performance.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Performance improvements to ``np.sort`` and ``np.argsort``
----------------------------------------------------------
``np.sort`` and ``np.argsort`` functions now can leverage OpenMP for parallel
thread execution, resulting in up to 3.5x speedups on x86 architectures with
AVX2 or AVX-512 instructions. This opt-in feature requires NumPy to be built
with the -Denable_openmp Meson flag. Users can control the number of threads
used by setting the OMP_NUM_THREADS environment variable.
2 changes: 2 additions & 0 deletions meson.options
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ option('disable-intel-sort', type: 'boolean', value: false,
description: 'Disables SIMD-optimized operations related to Intel x86-simd-sort')
option('disable-threading', type: 'boolean', value: false,
description: 'Disable threading support (see `NPY_ALLOW_THREADS` docs)')
option('enable-openmp', type: 'boolean', value: false,
description: 'Enable building NumPy with openmp support')
option('disable-optimization', type: 'boolean', value: false,
description: 'Disable CPU optimized code (dispatch,simd,unroll...)')
option('cpu-baseline', type: 'string', value: 'min',
Expand Down
22 changes: 20 additions & 2 deletions numpy/_core/meson.build
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,21 @@ if use_intel_sort and not fs.exists('src/npysort/x86-simd-sort/README.md')
error('Missing the `x86-simd-sort` git submodule! Run `git submodule update --init` to fix this.')
endif

# openMP related settings:
if get_option('disable-threading') and get_option('enable-openmp')
error('Build options `disable-threading` and `enable-openmp` are conflicting. Please set at most one to true.')
endif

use_openmp = get_option('enable-openmp') and not get_option('disable-threading')

# Setup openmp flags for x86-simd-sort:
omp = []
omp_dep = []
if use_intel_sort and use_openmp
omp = dependency('openmp', required : true)
omp_dep = declare_dependency(dependencies: omp, compile_args: ['-DXSS_USE_OPENMP'])
endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we "all good" to use OpenMP in NumPy directly? I thought there were other ecosystem interaction concerns to consider, like cross-interactions with OpenBLAS or wheel-related issues, etc. Maybe this is just for a custom local build rather than for activation in wheels.

If we are actually "ok" for that, I guess that in addition to the env variable Sayed mentioned there is also threadpoolctl where one might need to modulate both with something like with controller.limit(limits={"openblas": 2, "openmp": 4})?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we "all good" to use OpenMP in NumPy directly?

I am not familiar with implications of using openMP and how it could potentially interact with other modules. I was hoping to get an answer to that via the pull request and everyone's input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenBLAS usually manages OpenMP itself, which creates a bit of confusion when nesting it.

OpenBLAS now has:
https://github.com/OpenMathLib/OpenBLAS/blob/develop/driver/others/blas_server_callback.c

Which we can use if we have a central thread pool we want to re-use across multiple things?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel I was wondering if you know anything about this (i.e. whether it is safe to use OpenMP in NumPy, or it would create issues).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which we can use if we have a central thread pool we want to re-use across multiple things?

The libopenblas we ship inside our wheels is always built with pthreads, not openmp. Build scripts live at https://github.com/MacPython/openblas-libs/tree/main/tools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn does run into issues with running OpenMP's threadpool together with OpenBLAS: scikit-learn/scikit-learn#28883

There is a draft PR here to force OpenBLAS to use OpenMP (alway from pthreads): scikit-learn/scikit-learn#29403

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there were other ecosystem interaction concerns to consider, like A3E2 cross-interactions with OpenBLAS or wheel-related issues

Yes, there are some quite ugly issues with multiple installed openmp libraries and segfaults depending on import order, see
microsoft/LightGBM#6595


if not fs.exists('src/common/pythoncapi-compat')
error('Missing the `pythoncapi-compat` git submodule! ' +
'Run `git submodule update --init` to fix this.')
Expand Down Expand Up @@ -867,12 +882,15 @@ foreach gen_mtargets : [
] : []
],
]



mtargets = mod_features.multi_targets(
gen_mtargets[0], multiarray_gen_headers + gen_mtargets[1],
dispatch: gen_mtargets[2],
# baseline: CPU_BASELINE, it doesn't provide baseline fallback
prefix: 'NPY_',
dependencies: [py_dep, np_core_dep],
dependencies: [py_dep, np_core_dep, omp_dep],
c_args: c_args_common + max_opt,
cpp_args: cpp_args_common + max_opt,
include_directories: [
Expand Down Expand Up @@ -1286,7 +1304,7 @@ py.extension_module('_multiarray_umath',
'src/umath',
'src/highway'
],
dependencies: [blas_dep],
dependencies: [blas_dep, omp],
link_with: [
npymath_lib,
unique_hash_so,
Expand Down
15 changes: 15 additions & 0 deletions numpy/_core/tests/test_multiarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -10292,6 +10292,21 @@ def test_argsort_int(N, dtype):
arr[N - 1] = maxv
assert_arg_sorted(arr, np.argsort(arr, kind='quick'))

# Test large arrays that leverage openMP implementations from x86-simd-sort:
@pytest.mark.parametrize("dtype", [np.float16, np.float32, np.float64])
def test_sort_largearrays(dtype):
N = 1000000
rnd = np.random.RandomState(1100710816)
arr = -0.5 + rnd.random(N).astype(dtype)
assert_equal(np.sort(arr, kind='quick'), np.sort(arr, kind='heap'))

# Test large arrays that leverage openMP implementations from x86-simd-sort:
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
def test_argsort_largearrays(dtype):
N = 1000000
rnd = np.random.RandomState(1100710816)
arr = -0.5 + rnd.random(N).astype(dtype)
assert_arg_sorted(arr, np.argsort(arr, kind='quick'))

@pytest.mark.skipif(not HAS_REFCOUNT, reason="Python lacks refcounts")
def test_gh_22683():
Expand Down
Loading
0