8000 ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort by r-devulap · Pull Request #28619 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort #28619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 14, 2025

Conversation

r-devulap
Copy link
Member
@r-devulap r-devulap commented Apr 1, 2025

Update x86-simd-sort module to latest to pull in 4 major changes:

Benchmark numbers for `np.sort` on a TGL (sorting an array of 5 million numbers):

[100.00%] ··· bench_function_base.Sort.time_sort_worst                     9.64±0.05ms                                                                  12:53:12 [100/24308]
| Change   | Before [93898621] <main>   | After [19f94d3c] <xss-openmp>   |   Ratio | Benchmark (Parameter)                                                          |
|----------|----------------------------|---------------------------------|---------|--------------------------------------------------------------------------------|
| -        | 4.57±0.08ms                | 3.61±0.04ms                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))           |
| -        | 4.51±0.04ms                | 3.57±0.04ms                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))   |
| -        | 4.53±0.03ms                | 3.57±0.02ms                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100))  |
| -        | 4.54±0.03ms                | 3.58±0.03ms
8000
                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000)) |
| -        | 54.8±0.2ms                 | 27.0±0.05ms                     |    0.49 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))           |
| -        | 60.6±0.3ms                 | 29.8±0.3ms                      |    0.49 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000)) |
| -        | 53.0±0.1ms                 | 25.2±0.1ms                      |    0.48 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))            |
| -        | 55.0±0.3ms                 | 26.2±0.2ms                      |    0.48 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))  |
| -        | 55.1±0.2ms                 | 25.7±0.06ms                     |    0.47 | bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))          |
| -        | 54.8±0.1ms                 | 25.9±0.3ms                      |    0.47 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))   |
| -        | 64.7±0.3ms                 | 29.0±0.2ms                      |    0.45 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))   |
| -        | 28.4±0.1ms                 | 12.4±0.2ms                      |    0.44 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000)) |
| -        | 26.3±0.1ms                 | 11.6±0.7ms                      |    0.44 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000))   |
| -        | 59.3±0.2ms                 | 26.1±0.2ms                      |    0.44 | bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))             |
| -        | 26.8±1ms                   | 11.6±0.4ms                      |    0.43 | bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))           |
| -        | 57.3±0.2ms                 | 24.7±0.3ms                      |    0.43 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))              |
| -        | 58.9±0.09ms                | 25.2±1ms                        |    0.43 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))     |
| -        | 59.4±0.09ms                | 25.4±0.09ms                     |    0.43 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))    |
| -        | 60.1±0.1ms                 | 25.2±0.1ms                      |    0.42 | bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))            |
| -        | 26.0±0.2ms                 | 10.9±0.1ms                      |    0.42 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000))  |
| -        | 25.8±1ms                   | 10.3±0.2ms                      |    0.4  | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))            |
| -        | 26.5±0.08ms                | 10.7±0.1ms                      |    0.4  | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))   |
| -        | 9.64±0.05ms                | 3.89±0.1ms                      |    0.4  | bench_function_base.Sort.time_sort_worst                                       |
| -        | 26.8±0.4ms                 | 10.5±0.2ms                      |    0.39 | bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))          |
| -        | 26.8±0.1ms                 | 10.4±0.7ms                      |    0.39 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))  |
| -        | 24.9±0.6ms                 | 9.80±0.2ms                      |    0.39 | bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))             |
| -        | 24.6±0.7ms                 | 9.60±0.02ms                     |    0.39 | bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))            |
| -        | 24.5±0.06ms                | 9.25±0.03ms                     |    0.38 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))     |
| -        | 24.0±0.08ms                | 9.22±0.02ms                     |    0.38 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))    |
| -        | 24.0±1ms                   | 8.85±0.08ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))              |
| -        | 24.6±0.2ms                 | 9.06±0.05ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))    |
| -        | 23.7±0.07ms                | 8.78±0.07ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))             |
| -        | 24.2±0.2ms                 | 8.88±0.3ms                      |    0.37 | bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',))           |
| -        | 24.4±0.2ms                 | 9.02±0.09ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))   |
| -        | 24.6±0.2ms                 | 8.95±0.08ms                     |    0.36 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))            |
| -        | 89.0±0.3ms                 | 7.42±0.07ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))             |
| -        | 87.7±0.5ms                 | 6.67±0.06ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('random',))              |
| -        | 88.3±0.2ms                 | 6.81±0.04ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))     |
| -        | 87.2±0.07ms                | 6.54±0.02ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))    |

Benchmark numbers for `np.argsort` on a TGL (sorting an array of 500,000 numbers):

| Change   | Before [93898621] <main>   | After [41eb9481] <xss-openmp>   |   Ratio | Benchmark (Parameter)                                                             |
|----------|----------------------------|---------------------------------|---------|-----------------------------------------------------------------------------------|
| +        | 530±20μs                   | 788±50μs                        |    1.49 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('uniform',))            |
| +        | 528±30μs                   | 759±30μs                        |    1.44 | bench_function_base.Sort.time_argsort('quick', 'int32', ('uniform',))             |
| +        | 608±30μs                   | 787±20μs                        |    1.29 | bench_function_base.Sort.time_argsort('quick', 'float32', ('uniform',))           |
| -        | 10.8±0.02ms                | 4.12±0.02ms                     |    0.38 | bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 1000)) |
| -        | 11.3±0.03ms                | 4.28±0.03ms                     |    0.38 | bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 1000)) |
| -        | 10.5±0.03ms                | 3.92±0.03ms                     |    0.37 | bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 1000))   |
| -        | 11.0±0.03ms                | 3.95±0.02ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 100))  |
| -        | 11.7±0.04ms                | 4.23±0.03ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 100))  |
| -        | 10.7±0.02ms                | 3.84±0.05ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 100))    |
| -        | 11.8±0.04ms                | 4.23±0.02ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))   |
| -        | 10.4±0.02ms                | 3.78±0.02ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 1000))  |
| -        | 8.28±0.2ms                 | 2.92±0.1ms                      |    0.35 | bench_function_base.Sort.time_argsort('quick', 'int32', ('reversed',))            |
| -        | 10.8±0.08ms                | 3.72±0.06ms                     |    0.35 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 100))   |
| -        | 8.62±0.2ms                 | 2.96±0.09ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))           |
| -        | 8.71±0.2ms                 | 2.92±0.09ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float32', ('reversed',))          |
| -        | 8.70±0.02ms                | 2.95±0.01ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))           |
| -        | 8.74±0.03ms                | 3.00±0.02ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float64', ('reversed',))          |
| -        | 12.3±0.02ms                | 4.21±0.03ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100))    |
| -        | 8.35±0.2ms                 | 2.81±0.08ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('ordered',))            |
| -        | 8.34±0.1ms                 | 2.79±0.1ms                      |    0.33 | bench_function_base.Sort.time_argsort('quick', 'int32', ('ordered',))             |
| -        | 8.30±0.2ms                 | 2.78±0.09ms                     |    0.33 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('reversed',))           |
| -        | 12.0±0.3ms                 | 3.87±0.1ms                      |    0.32 | bench_function_base.Sort.time_argsort('quick', 'float32', ('random',))            |
| -        | 10.3±0.04ms                | 3.31±0.03ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 10))   |
| -        | 12.5±0.04ms                | 3.98±0.03ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 10))   |
| -        | 9.50±0.03ms                | 3.08±0.06ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'int64', ('ordered',))             |
| -        | 9.48±0.04ms                | 3.07±0.04ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'int64', ('reversed',))            |
| -        | 13.5±0.04ms                | 4.18±0.04ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'float64', ('random',))            |
| -        | 11.7±0.3ms                 | 3.66±0.1ms                      |    0.31 | bench_function_base.Sort.time_argsort('quick', 'int32', ('random',))              |
| -        | 9.97±0.03ms                | 3.12±0.03ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 10))     |
| -        | 13.0±0.02ms                | 4.03±0.05ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10))     |
| -        | 11.7±0.3ms                 | 3.58±0.08ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('random',))             |
| -        | 9.96±0.01ms                | 3.08±0.03ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 10))    |
| -        | 14.2±0.06ms                | 4.11±0.07ms                     |    0.29 | bench_function_base.Sort.time_argsort('quick', 'int64', ('random',))              |

@seiko2plus seiko2plus added 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Apr 1, 2025
Copy link
Member
@seiko2plus seiko2plus left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see multi-threading support - well done. This pr should also include a release note and add documentation mentioning that sort operations now support multi-threading on x86 and that the number of threads can be controlled via the environment variable OMP_NUM_THREADS. Additionally, OpenMP flags should be disabled if the meson option disable-threading is enabled.

if omp.found()
omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP']
endif
endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we "all good" to use OpenMP in NumPy directly? I thought there were other ecosystem interaction concerns to consider, like cross-interactions with OpenBLAS or wheel-related issues, etc. Maybe this is just for a custom local build rather than for activation in wheels.

If we are actually "ok" for that, I guess that in addition to the env variable Sayed mentioned there is also threadpoolctl where one might need to modulate both with something like with controller.limit(limits={"openblas": 2, "openmp": 4})?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we "all good" to use OpenMP in NumPy directly?

I am not familiar with implications of using openMP and how it could potentially interact with other modules. I was hoping to get an answer to that via the pull request and everyone's input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenBLAS usually manages OpenMP itself, which creates a bit of confusion when nesting it.

OpenBLAS now has:
https://github.com/OpenMathLib/OpenBLAS/blob/develop/driver/others/blas_server_callback.c

Which we can use if we have a central thread pool we want to re-use across multiple things?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel I was wondering if you know anything about this (i.e. whether it is safe to use OpenMP in NumPy, or it would create issues).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which we can use if we have a central thread pool we want to re-use across multiple things?

The libopenblas we ship inside our wheels is always built with pthreads, not openmp. Build scripts live at https://github.com/MacPython/openblas-libs/tree/main/tools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn does run into issues with running OpenMP's threadpool together with OpenBLAS: scikit-learn/scikit-learn#28883

There is a draft PR here to force OpenBLAS to use OpenMP (alway from pthreads): scikit-learn/scikit-learn#29403

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there were other ecosystem interaction concerns to consider, like cross-interactions with OpenBLAS or wheel-related issues

Yes, there are some quite ugly issues with multiple installed openmp libraries and segfaults depending on import order, see
microsoft/LightGBM#6595

@pytest.mark.parametrize("dtype", [np.float16, np.float32, np.float64])
def test_sort_largearrays(dtype):
N = 1000000
arr = np.random.rand(N)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should pin the values down with modern default_rng?

One other thing I checked was how slow the test might be (do we want a slow marker for the large array handling?). It didn't seem too bad (< 1 s for each case locally on an ARM Mac), though it is in the top 10 slowest for this module for example.< 8000 /p>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did pin down the value with a determined seed.

@r-devulap
Copy link
Member Author

@rgommers pointed out this discussion in scipy that details the complications of using openmp scipy/scipy#10239 (comment)

@rgommers
Copy link
Member
rgommers commented Apr 2, 2025

Also xref https://pypackaging-native.github.io/key-issues/native-dependencies/blas_openmp, which details a bunch of issues.

@rgommers
Copy link
Member
rgommers commented Apr 2, 2025

I suspect that if we start using OpenMP code, we should disable it in wheels and only let distro packagers enable it.

@r-devulap
Copy link
Member Author

I suspect that if we start using OpenMP code, we should disable it in wheels and only let distro packagers enable it.

@rgommers How does the interaction with PyTorch or Sciki-learn work? Don't they vendor their own version of libgomp which can potentially conflict with what the distro provides?

@rgommers
Copy link
Member
rgommers commented Apr 9, 2025

@rgommers How does the interaction with PyTorch or Sciki-learn work? Don't they vendor their own version of libgomp which can potentially conflict with what the distro provides?

That is already the case - the answer for that one is: (a) please don't mix distro packages with wheels, and (b) the libomp/libiomp (not libgomp, which interacts badly with multiprocessing) inside wheels will have its symbols mangled with auditwheel, so there's no clash.

If you have PyTorch and scikit-learn both installed in the same environment and then imported, that usually works just fine, but it has given rise to a host of hard to debug issues in the past. Also numpy from conda defaults, which pulls in OpenMP via MKL. In wheels you just get two OpenMP runtimes loaded that are isolated from each other (bad for performance but should be robust again symbol conflicts), in conda envs you should get only one but can get two if dependency trees don't work out well. E.g., PyTorch relies on MKL which pulls in Intel OpenMP (libiomp); scikit-learn uses LLVM OpenMP (libomp). There has to be code that handles this correctly, to avoid conflicts. IIRC mkl-service does this, so it uses the other OpenMP runtime if that's already loaded in memory.

You get the gist - this is a little painful. There's only one way to really do OpenMP right - build everything in a coherent fashion against the same OpenMP library. Distros usually get this right. With wheels you can't. And it gets worse when users do pip install . or pip install pkgs-that-uses-openmp from source - because then you don't get auditwheel symbol mangling, and things may go more wrong.


This PR looks nice and simple though, so if the performance gains are really large, maybe making it opt-in rather than use-if-detected could work.

@rgommers
Copy link
Member
rgommers commented Apr 9, 2025

For the design question of whether NumPy et al. should enable parallelism by default, please see https://thomasjpfan.github.io/parallelism-python-libraries-design/ for a good discussion. Cc @thomasjpfan for visibility.

@r-devulap r-devulap changed the title ENH: Use openmp on x86-simd-s 8000 ort to speed up sorting large arrays ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort Apr 14, 2025
@r-devulap
Copy link
Member Author

updated patch:

  1. Ported openMP support for np.argsort from Add OpenMP support to argsort intel/x86-simd-sort#195 which speeds up sorting arrays > 10000 by up-to 3.5x on both AVX-512 and AVX2. Updated the title and added benchmark numbers in the first comment.
  2. Added a new meson option -enable-openmp which is false by default. Meson only builds with openmp when both disable-threading == false and enable-openmp == true
  3. Added CI coverage to test the openMP code paths.

Copy link
Member
@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @r-devulap. CI additions looks fine to me; a few comments on build support.

if use_intel_sort and use_openmp
omp = dependency('openmp', required : false)
if omp.found()
omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not have a separate variable for this. It's more idiomatic to use a dependency object:

Suggested change
omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP']
omp_dep = declare_dependency(dependencies: omp, compile_args: ['-DXSS_USE_OPENMP'])

The -fopemp flag shouldn't need to be added explicitly, it's already present in the omp dependency (e.g., see here).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I have also updated the dependency required flag to be true. If someone is explicitly using enable-openmp=true, then its reasonable to expect a build failure if openMP isn't available.

@rgommers
Copy link
Member

This will also need a release note in doc/release/upcoming_changes/

@r-devulap
Copy link
Member Author

This will also need a release note in doc/release/upcoming_changes/

Thanks for reviewing this! I have added two release notes: one for performance improvements and another highlighting general openMP build support.

@r-devulap
Copy link
Member Author

scikit-learn does run into issues with running OpenMP's threadpool together with OpenBLAS: scikit-learn/scikit-learn#28883

@thomasjpfan Thanks for the reference. Correct me if I am wrong but from what I understand, it looks like the performance problem occurs when calling two functions back to back in a loop: where one uses pthreads and other uses openMP to manage their respective threads. This causes resource contention when both functions try to use available CPU cores with no visibility into what the other one is doing.

Beyond standardizing the thread management library across the entire Python ecosystem, are there alternative approaches to solve this?

@thomasjpfan
Copy link
Contributor
thomasjpfan commented Apr 17, 2025

This causes resource contention when both functions try to use available CPU cores with no visibility into what the other one is doing.

Yes, this is the underlying issue.

Beyond standardizing the thread management library across the entire Python ecosystem, are there alternative approaches to solve this?

Standardizing the thread management layer is the only long term solution I see. The hard part is figuring out a way to standardize that works for most projects. Top of mind projects and how they multi-thread:

  • SciPy's fft uses C++ threads & scipy.linalg uses OpenBLAS
  • NumPy uses OpenBLAS
  • PyTorch uses MKL
  • Polars uses Rust+rayon
  • Scikit-learn uses OpenMP and Python multi-threading
  • Python multi-threading uses pthreads, which will become more common with free-threading

Some workarounds for threadpool specific issues:

@r-devulap r-devulap added this to the 2.3.0 release milestone Apr 23, 2025
@r-devulap
Copy link
Member Author

Adding 2.3.0 label. Would like to have this in for that release with or without the openmp support.

Pulls in 2 major changes:

(1) Fixes a performance regression on 16-bit dtype sorting (see
intel/x86-simd-sort#190)

(2) Adds openmp support for quicksort which speeds up sorting arrays >
100,000 by up to 3x. See: intel/x86-simd-sort#179
Also adds a simple unit test to stress the openmp code paths
@r-devulap
Copy link
Member Author

Rebased with main.

Copy link
Member
@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you! massive performance gains for 16-bit sorting. Since OpenMP is disabled by default, I think it's fine to merge.

@charris charris merged commit 25d26e5 into numpy:main May 14, 2025
72 of 73 checks passed
@charris
Copy link
Member
charris commented May 14, 2025

Thanks @r-devulap .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0