8000 PERF: improve multithreaded ufunc scaling by ngoldbaum · Pull Request #27896 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

PERF: improve multithreaded ufunc scaling #27896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 5, 2024

Conversation

ngoldbaum
Copy link
Member
@ngoldbaum ngoldbaum commented Dec 3, 2024

Fixes #27786 and re-do of #27859.

Thanks to @seiko2plus for the suggestion to convert this code to C++ and use C++17 features. Let's see if any of the platforms we run CI on object to using this, it passes the tests on my Macbook Pro.

There is an earlier commit included in this PR that keeps things in C and uses the CPython-internal _PyRWMutex, which also seems to work, but using C++ lets us avoid the need to write a PEP or make an argument to the C API workgroup that making _PyRWMutex public would be nice for C extensions.

With the test script in the issue, I see the following scaling using std::shared_mutex:

mflops_array_length_1000

@ngoldbaum
Copy link
Member Author
ngoldbaum commented Dec 3, 2024

Darn, this seems to break building on Windows:

FAILED: numpy/_core/_multiarray_umath.cp311-win_amd64.pyd numpy/_core/_multiarray_umath.cp311-win_amd64.pdb 
"link" @numpy/_core/_multiarray_umath.cp311-win_amd64.pyd.rsp
   Creating library numpy\_core\_multiarray_umath.cp311-win_amd64.lib and object numpy\_core\_multiarray_umath.cp311-win_amd64.exp
src_umath_dispatching.cpp.obj : error LNK2001: unresolved external symbol "struct _typeobject PyArrayDTypeMeta_Type" (?PyArrayDTypeMeta_Type@@3U_typeobject@@A)
  Hint on symbols that are defined and could potentially match:
    PyArrayDTypeMeta_Type
src_umath_dispatching.cpp.obj : error LNK2001: unresolved external symbol "struct npy_static_pydata_struct npy_static_pydata" (?npy_static_pydata@@3Unpy_static_pydata_struct@@A)
  Hint on symbols that are defined and could potentially match:

    npy_static_pydata

numpy\_core\_multiarray_umath.cp311-win_amd64.pyd : fatal error LNK1120: 2 unresolved externals

If anyone happens to know what's causing that let me know, I'll try debugging on a Windows machine in the meantime...

if (res->buckets == NULL) {
PyErr_NoMemory();
PyMem_Free(res);
return NULL;
}

#ifdef Py_GIL_DISABLED
res->mutex = new std::shared_mutex();
Copy link
Member Author
@ngoldbaum ngoldbaum Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using a heap allocation because otherwise I'd need to convert ufunc_object.c - which includes npy_hashtable.h and sees the layout of PyArrayIdentityHash - to C++ and that seemed like too much to take on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... unfortunately, until ufunc_object.c is converted and all Python C-API facing functions are auto-wrapped in some exception handler, we need to catch exceptions here.

(We will also always have to do that for internal C/C++ API boundaries, i.e. ufunc inner loops calls, I guess. But we do it there, e.g. in the fft ones.)

@ngoldbaum ngoldbaum added the 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703) label Dec 3, 2024
@ngoldbaum ngoldbaum added this to the 2.2.0 release milestone Dec 3, 2024
@ngoldbaum
Copy link
Member Author
ngoldbaum commented Dec 3, 2024

If anyone happens to know what's causing that let me know

I think the Windows linker errors I ran into earlier were caused by missing extern "C" blocks in some headers, looks like it's fixed now.

@ngoldbaum ngoldbaum added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Dec 3, 2024
/* Returns a borrowed ref of the second value in the matching info tuple */
PyObject *
get_info_no_cast(PyUFuncObject *ufunc, PyArray_DTypeMeta *op_dtype,
int ndtypes);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not totally clear to me if originally not including this function in dispatching.h and forward-declaring it here was an intentional decision for some reason

@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label Dec 3, 2024
@ngoldbaum
Copy link
Member Author

@charris do you think this is OK to backport to 2.1? IMO moving stuff to C++ like this feels risky for a bugfix release.

@charris
Copy link
Member
charris commented Dec 3, 2024

do you think this is OK to backport to 2.1?

Python multithreading is still in its early stages, so 2.2.0 is probably soon enough for those wanting to experiment with that feature, and it will come out at about the same time as 2.1.4. I would prefer to limit the 2.1 backports to clear bug fixes.

@ngoldbaum
Copy link
Member Author

Ah I see you added the backport-candi 8000 date label because it's a backport to 2.2. I was confused about the target branch.

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious and had a look, thinking it might be too much, but it is nice to see this actually is a small change! My review may not mean too much, but it seems all OK, and the only thing I have is a suggested name change...

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _with_locking is a much better name, thanks!

Copy link
Member
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo the exception thing. I do wonder slightly whether allows re-entry from within the same thread?

Not sure that it matters, but I think it may be nice to have a recrusive mutex here. It doesn't seem implausible for a promotion function to call the promoter again to figure things out.
(But, it should also be rare enough that I don't care if we defer that to when someone runs into it.)

if (res->buckets == NULL) {
PyErr_NoMemory();
PyMem_Free(res);
return NULL;
}

#ifdef Py_GIL_DISABLED
res->mutex = new std::shared_mutex();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... unfortunately, until ufunc_object.c is converted and all Python C-API facing functions are auto-wrapped in some exception handler, we need to catch exceptions here.

(We will also always have to do that for internal C/C++ API boundaries, i.e. ufunc inner loops calls, I guess. But we do it there, e.g. in the fft ones.)

@@ -163,7 +167,9 @@ check_and_adjust_axis(int *axis, int ndim)
* <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52023>.
* clang versions < 8.0.0 have the same bug.
*/
#if (!defined __STDC_VERSION__ || __STDC_VERSION__ < 201112 \
#ifdef __cplusplus
#define NPY_ALIGNOF(type) alignof(type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we shouldn't just use alignof, but let's go with this.

@@ -115,6 +122,9 @@ PyArrayIdentityHash_Dealloc(PyArrayIdentityHash *tb)
{
PyMem_Free(tb->buckets);
PyMem_Free(tb);
#ifdef Py_GIL_DISABLED
delete (std::shared_mutex *)tb->mutex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK to assume that delete will never throw an exception.

"index %"NPY_INTP_FMT" is out of bounds "
"for size %"NPY_INTP_FMT, *index, max_item);
"index %" NPY_INTP_FMT " is out of bounds "
"for size %" NPY_INTP_FMT, *index, max_item);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just replace all NPY_INTP_FMT with %zd one of these days...

@ngoldbaum
Copy link
Member Author

Not sure that it matters, but I think it may be nice to have a recrusive mutex here.

Unfortunately then we run into the issue of finding a portable implementation. My brief research indicates that there isn't a standardized shared recursive mutex because of the performance trade-offs, although the posix shared read-write lock is recursive.

If someone does try to use this reentrantly, it will deadlock. Do you think that's problematic in practice?

Thanks for the pointer about exceptions. I found std::nothrow which seems to be doing what you wanted with a little less boilerplate - is that OK? Do I need to worry about exceptions elsewhere in NumPy?

@ngoldbaum ngoldbaum removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Dec 4, 2024
@seberg
Copy link
Member
seberg commented Dec 4, 2024

Do you think that's problematic in practice?

Not right now. I could imagine it being meaningful, for example if you had a parametric dtype like categorical[dtype]. Such a dtype could wish to call ufunc.resolve_dtypes to figure out the right result categorical[dtype].

which seems to be doing what you wanted

Yeah, that seems good, nice. (No idea if it would catch non memory errors, but I don't think those are possible).

Copy link
Member
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ngoldbaum looks like this is ready.

@seberg seberg merged commit 7124091 into numpy:main Dec 5, 2024
69 checks passed
charris pushed a commit to charris/numpy that referenced this pull request Dec 5, 2024
* PERF: add a fast path to ufunc type resolution

* MAINT: move dispatching.c to C++

* MAINT: move npy_hashtable to C++ and use std::shared_mutex

* MAINT: fix windows linking

* MAINT: remove outdated comment

* MAINT: only call try_promote on free-threaded build

Converts dispatching to cpp in order to use `std::shared_mutex` to improve free-threaded scaling.

* MAINT: try to give new function a name indicating it uses a mutex

* MAINT: only do complicated casting to get a mutex pointer once

* MAINT: use std::nothrow to avoid dealing with exceptions

* DOC: add changelog
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Dec 5, 2024
charris added a commit to charris/numpy that referenced this pull request Dec 5, 2024
PR numpy#27896 was backported to NumPy 2.2.

[skip ci]
return NULL;
}
return info;
}

#ifdef Py_GIL_DISABLED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the drive-by comment, but Py_GIL_DISABLED means you're running a on free-threading enabled Python, not necessarily that the GIL is concretely disabled, right?

If you're calling mutex->lock or mutex->lock_shared with the GIL held, then there is a risk of deadlock if, for example, promote_and_get_info_and_ufuncimpl released the GIL in another thread and is waiting to take the GIL back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you may want to wrap calls to mutex->lock[_shared] in a Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS pair.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 - Maintenance 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: poor multithreaded performance scaling for small arrays (nogil)
5 participants
0