-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort #28619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see multi-threading support - well done. This pr should also include a release note and add documentation mentioning that sort operations now support multi-threading on x86
and that the number of threads can be controlled via the environment variable OMP_NUM_THREADS
. Additionally, OpenMP flags should be disabled if the meson option disable-threading
is enabled.
if omp.found() | ||
omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP'] | ||
endif | ||
endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we "all good" to use OpenMP in NumPy directly? I thought there were other ecosystem interaction concerns to consider, like cross-interactions with OpenBLAS or wheel-related issues, etc. Maybe this is just for a custom local build rather than for activation in wheels.
If we are actually "ok" for that, I guess that in addition to the env variable Sayed mentioned there is also threadpoolctl
where one might need to modulate both with something like with controller.limit(limits={"openblas": 2, "openmp": 4})
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we "all good" to use OpenMP in NumPy directly?
I am not familiar with implications of using openMP and how it could potentially interact with other modules. I was hoping to get an answer to that via the pull request and everyone's input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenBLAS usually manages OpenMP itself, which creates a bit of confusion when nesting it.
OpenBLAS now has:
https://github.com/OpenMathLib/OpenBLAS/blob/develop/driver/others/blas_server_callback.c
Which we can use if we have a central thread pool we want to re-use across multiple things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel I was wondering if you know anything about this (i.e. whether it is safe to use OpenMP in NumPy, or it would create issues).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which we can use if we have a central thread pool we want to re-use across multiple things?
The libopenblas
we ship inside our wheels is always built with pthreads, not openmp. Build scripts live at https://github.com/MacPython/openblas-libs/tree/main/tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scikit-learn does run into issues with running OpenMP's threadpool together with OpenBLAS: scikit-learn/scikit-learn#28883
There is a draft PR here to force OpenBLAS to use OpenMP (alway from pthreads): scikit-learn/scikit-learn#29403
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there were other ecosystem interaction concerns to consider, like cross-interactions with OpenBLAS or wheel-related issues
Yes, there are some quite ugly issues with multiple installed openmp libraries and segfaults depending on import order, see
microsoft/LightGBM#6595
numpy/_core/tests/test_multiarray.py
Outdated
@pytest.mark.parametrize("dtype", [np.float16, np.float32, np.float64]) | ||
def test_sort_largearrays(dtype): | ||
N = 1000000 | ||
arr = np.random.rand(N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should pin the values down with modern default_rng
?
One other thing I checked was how slow the test might be (do we want a slow
marker for the large array handling?). It didn't seem too bad (< 1 s for each case locally on an ARM Mac), though it is in the top 10 slowest for this module for example.<
8000
/p>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did pin down the value with a determined seed.
@rgommers pointed out this discussion in scipy that details the complications of using openmp scipy/scipy#10239 (comment) |
Also xref https://pypackaging-native.github.io/key-issues/native-dependencies/blas_openmp, which details a bunch of issues. |
I suspect that if we start using OpenMP code, we should disable it in wheels and only let distro packagers enable it. |
@rgommers How does the interaction with PyTorch or Sciki-learn work? Don't they vendor their own version of libgomp which can potentially conflict with what the distro provides? |
That is already the case - the answer for that one is: (a) please don't mix distro packages with wheels, and (b) the If you have PyTorch and scikit-learn both installed in the same environment and then imported, that usually works just fine, but it has given rise to a host of hard to debug issues in the past. Also You get the gist - this is a little painful. There's only one way to really do OpenMP right - build everything in a coherent fashion against the same OpenMP library. Distros usually get this right. With wheels you can't. And it gets worse when users do This PR looks nice and simple though, so if the performance gains are really large, maybe making it opt-in rather than use-if-detected could work. |
For the design question of whether NumPy et al. should enable parallelism by default, please see https://thomasjpfan.github.io/parallelism-python-libraries-design/ for a good discussion. Cc @thomasjpfan for visibility. |
updated patch:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @r-devulap. CI additions looks fine to me; a few comments on build support.
numpy/_core/meson.build
Outdated
if use_intel_sort and use_openmp | ||
omp = dependency('openmp', required : false) | ||
if omp.found() | ||
omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not have a separate variable for this. It's more idiomatic to use a dependency object:
omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP'] | |
omp_dep = declare_dependency(dependencies: omp, compile_args: ['-DXSS_USE_OPENMP']) |
The -fopemp
flag shouldn't need to be added explicitly, it's already present in the omp
dependency (e.g., see here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I have also updated the dependency required
flag to be true. If someone is explicitly using enable-openmp=true
, then its reasonable to expect a build failure if openMP isn't available.
This will also need a release note in |
Thanks for reviewing this! I have added two release notes: one for performance improvements and another highlighting general openMP build support. |
@thomasjpfan Thanks for the reference. Correct me if I am wrong but from what I understand, it looks like the performance problem occurs when calling two functions back to back in a loop: where one uses pthreads and other uses openMP to manage their respective threads. This causes resource contention when both functions try to use available CPU cores with no visibility into what the other one is doing. Beyond standardizing the thread management library across the entire Python ecosystem, are there alternative approaches to solve this? |
Yes, this is the underlying issue.
Standardizing the thread management layer is the only long term solution I see. The hard part is figuring out a way to standardize that works for most projects. Top of mind projects and how they multi-thread:
Some workarounds for threadpool specific issues:
|
Adding 2.3.0 label. Would like to have this in for that release with or without the openmp support. |
Pulls in 2 major changes: (1) Fixes a performance regression on 16-bit dtype sorting (see intel/x86-simd-sort#190) (2) Adds openmp support for quicksort which speeds up sorting arrays > 100,000 by up to 3x. See: intel/x86-simd-sort#179
Also adds a simple unit test to stress the openmp code paths
Rebased with main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you! massive performance gains for 16-bit sorting. Since OpenMP is disabled by default, I think it's fine to merge.
Thanks @r-devulap . |
Update x86-simd-sort module to latest to pull in 4 major changes:
np.argsort
perf regressions on sorted data (as reported in: BUG: Performance regression in argsort on sorted data #28714)Benchmark numbers for `np.sort` on a TGL (sorting an array of 5 million numbers):
Benchmark numbers for `np.argsort` on a TGL (sorting an array of 500,000 numbers):