-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
PERF: improve multithreaded ufunc scaling #27896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Darn, this seems to break building on Windows:
If anyone happens to know what's causing that let me know, I'll try debugging on a Windows machine in the meantime... |
numpy/_core/src/common/npy_hashtable.cpp
8000
Outdated
if (res->buckets == NULL) { | ||
PyErr_NoMemory(); | ||
PyMem_Free(res); | ||
return NULL; | ||
} | ||
|
||
#ifdef Py_GIL_DISABLED | ||
res->mutex = new std::shared_mutex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using a heap allocation because otherwise I'd need to convert ufunc_object.c
- which includes npy_hashtable.h
and sees the layout of PyArrayIdentityHash
- to C++ and that seemed like too much to take on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... unfortunately, until ufunc_object.c
is converted and all Python C-API facing functions are auto-wrapped in some exception handler, we need to catch exceptions here.
(We will also always have to do that for internal C/C++ API boundaries, i.e. ufunc inner loops calls, I guess. But we do it there, e.g. in the fft ones.)
I think the Windows linker errors I ran into earlier were caused by missing |
/* Returns a borrowed ref of the second value in the matching info tuple */ | ||
PyObject * | ||
get_info_no_cast(PyUFuncObject *ufunc, PyArray_DTypeMeta *op_dtype, | ||
int ndtypes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not totally clear to me if originally not including this function in dispatching.h
and forward-declaring it here was an intentional decision for some reason
@charris do you think this is OK to backport to 2.1? IMO moving stuff to C++ like this feels risky for a bugfix release. |
Python multithreading is still in its early stages, so 2.2.0 is probably soon enough for those wanting to experiment with that feature, and it will come out at about the same time as 2.1.4. I would prefer to limit the 2.1 backports to clear bug fixes. |
Ah I see you added the backport-candi 8000 date label because it's a backport to 2.2. I was confused about the target branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious and had a look, thinking it might be too much, but it is nice to see this actually is a small change! My review may not mean too much, but it seems all OK, and the only thing I have is a suggested name change...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _with_locking
is a much better name, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo the exception thing. I do wonder slightly whether allows re-entry from within the same thread?
Not sure that it matters, but I think it may be nice to have a recrusive mutex here. It doesn't seem implausible for a promotion function to call the promoter again to figure things out.
(But, it should also be rare enough that I don't care if we defer that to when someone runs into it.)
if (res->buckets == NULL) { | ||
PyErr_NoMemory(); | ||
PyMem_Free(res); | ||
return NULL; | ||
} | ||
|
||
#ifdef Py_GIL_DISABLED | ||
res->mutex = new std::shared_mutex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... unfortunately, until ufunc_object.c
is converted and all Python C-API facing functions are auto-wrapped in some exception handler, we need to catch exceptions here.
(We will also always have to do that for internal C/C++ API boundaries, i.e. ufunc inner loops calls, I guess. But we do it there, e.g. in the fft ones.)
@@ -163,7 +167,9 @@ check_and_adjust_axis(int *axis, int ndim) | |||
* <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52023>. | |||
* clang versions < 8.0.0 have the same bug. | |||
*/ | |||
#if (!defined __STDC_VERSION__ || __STDC_VERSION__ < 201112 \ | |||
#ifdef __cplusplus | |||
#define NPY_ALIGNOF(type) alignof(type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we shouldn't just use alignof
, but let's go with this.
@@ -115,6 +122,9 @@ PyArrayIdentityHash_Dealloc(PyArrayIdentityHash *tb) | |||
{ | |||
PyMem_Free(tb->buckets); | |||
PyMem_Free(tb); | |||
#ifdef Py_GIL_DISABLED | |||
delete (std::shared_mutex *)tb->mutex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems OK to assume that delete
will never throw an exception.
"index %"NPY_INTP_FMT" is out of bounds " | ||
"for size %"NPY_INTP_FMT, *index, max_item); | ||
"index %" NPY_INTP_FMT " is out of bounds " | ||
"for size %" NPY_INTP_FMT, *index, max_item); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just replace all NPY_INTP_FMT
with %zd
one of these days...
Unfortunately then we run into the issue of finding a portable implementation. My brief research indicates that there isn't a standardized shared recursive mutex because of the performance trade-offs, although the posix shared read-write lock is recursive. If someone does try to use this reentrantly, it will deadlock. Do you think that's problematic in practice? Thanks for the pointer about exceptions. I found |
Not right now. I could imagine it being meaningful, for example if you had a parametric dtype like
Yeah, that seems good, nice. (No idea if it would catch non memory errors, but I don't think those are possible). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ngoldbaum looks like this is ready.
* PERF: add a fast path to ufunc type resolution * MAINT: move dispatching.c to C++ * MAINT: move npy_hashtable to C++ and use std::shared_mutex * MAINT: fix windows linking * MAINT: remove outdated comment * MAINT: only call try_promote on free-threaded build Converts dispatching to cpp in order to use `std::shared_mutex` to improve free-threaded scaling. * MAINT: try to give new function a name indicating it uses a mutex * MAINT: only do complicated casting to get a mutex pointer once * MAINT: use std::nothrow to avoid dealing with exceptions * DOC: add changelog
PR numpy#27896 was backported to NumPy 2.2. [skip ci]
return NULL; | ||
} | ||
return info; | ||
} | ||
|
||
#ifdef Py_GIL_DISABLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the drive-by comment, but Py_GIL_DISABLED
means you're running a on free-threading enabled Python, not necessarily that the GIL is concretely disabled, right?
If you're calling mutex->lock
or mutex->lock_shared
with the GIL held, then there is a risk of deadlock if, for example, promote_and_get_info_and_ufuncimpl
released the GIL in another thread and is waiting to take the GIL back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you may want to wrap calls to mutex->lock[_shared]
in a Py_BEGIN_ALLOW_THREADS
/Py_END_ALLOW_THREADS
pair.
Fixes #27786 and re-do of #27859.
Thanks to @seiko2plus for the suggestion to convert this code to C++ and use C++17 features. Let's see if any of the platforms we run CI on object to using this, it passes the tests on my Macbook Pro.
There is an earlier commit included in this PR that keeps things in C and uses the CPython-internal
_PyRWMutex
, which also seems to work, but using C++ lets us avoid the need to write a PEP or make an argument to the C API workgroup that making_PyRWMutex
public would be nice for C extensions.With the test script in the issue, I see the following scaling using
std::shared_mutex
: