-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
MAINT: use an atomic load/store and a mutex to initialize the argparse and runtime import caches #26780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hmm, it looks like whichever version of gcc is available for the manylinux2014 32bit build doesn't have C11 atomics support. I have no idea if I should go out of my way to support that, although I think I can add some code from CPython's |
There is a later manylinux than manylinux2014, but is doesn't support 32 bits at all. We don't release 32 bit linux wheels, I wonder if anyone does? |
#include "numpy/npy_common.h" | ||
|
||
#if __STDC_VERSION__ >= 201112L && !defined(__STDC_NO_ATOMICS__) | ||
// TODO: support C++ atomics as well if this header is ever needed in C++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to d 8000 escribe this comment to others. Learn more.
The code to do this is in the CPython header this is cribbed from. It would be dead code if I included it.
I'm not worried about 32-bit manylinux2014 specifically, however C11 atomics are optional right? So do we have a sense what other platforms/compilers not tested in our CI don't support it? |
That's right.
gcc 4.9 was the first version to have support, 4.9.0 came out April 2014. llvm/clang 4.0 fully supported atomics, that came out March 2017. Intel fully supported atomics with version 18 of icc, released Nov 2017. Any other compilers worth considering? |
It looks like CPython doesn't account for any compilers besides clang, gcc, and MSVC. It supports older clang and gcc versions using the gcc builtin atomics support provided prior to gcc 4.9. |
Emscripten (which we have in CI, so should be fine), the newer Intel compilers ( However, the 32-bit
That uses GCC 10: https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux2014-centos-7-based-glibc-217. And from the CI log:
The problem is probably not linking |
I re-added the libatomic test in the meson configuration and also added some uint8 operations to the linking test as well since @colesbury pointed out that GCC under RISC-V has a similar issue with 8 bit atomic operations are only available when linking with libatomic but 64 bit ops are always available. I don't have a RISC-V machine to test with, but I don't think it hurts to add the check given that we know it's an issue, and we do specifically need 8 bit atomic operations right now. Unfortunately it looks like that wasn't sufficient, so I added some preprocessor logic to detect the gcc/clang builtin atomics as well if stdc atomics aren't detected. That seems to be enough to get numpy to build on the manylinux2014 image. |
@ngoldbaum did you see that |
Sounds good! Re the second commit, can we avoid |
I saw that! It does make some things easier (no need to initialize the lock at runtime) and I'm planning to switch over to it wholesale and require 3.13b3 or newer within the next week or two. It looks like deadsnakes hasn't yet added 3.13b3 so I need to wait for that at least. For this PR I think we'd need a public version of |
The build changes LGTM. For reference: mesonbuild/meson#11445 adds a built-in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-atomic loads and stores are not thread-safe. You need to use atomic loads/stores to ensure the correct ordering.
Otherwise, things like cache->initialized=1
may be visible before the other important parts of the cache.
@colesbury can you explain more why it's not thread safe with the lock? I thought I only needed one atomic load to serialize acquiring the lock. |
The purpose of the atomic loads and stores are to serialize a thread going through the fast-path (no lock acquisition) with a thread that may concurrently be finishing up the initialization. The
That's a problem because the reader might see (Or, on certain CPUs, the stores to Similarly, the load (outside the lock) needs to be atomic so that it's not reordered with subsequent operations. For example, the code essentially does:
If
So the load and store of See also: https://en.wikipedia.org/wiki/Double-checked_locking#cite_ref-8 |
OK, the last commit reworks the import cache to use atomic loads and stores on the pointer directly. I also added the necessary atomic loads and stores for the argparse cache, which I decided was easier to deal with using an integer flag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM. One small issue to fix and a small cleanup nit that would be nice.
} | ||
PyThread_release_lock(npy_runtime_imports.import_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the release before the error check, or we will dead-lock on error!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Also, the obj == NULL
check doesn't make sense -- it would need to be *obj == NULL
to check the result.
But I think it would be better not to hold the lock around the npy_import
:
- Imports in general can run arbitrary code, which may reentrantly trigger
npy_cache_import_runtime
. - Imports (i.e.,
PyImport_ImportModule
) are already thread-safe internally
So I'd suggest writing this as:
if (!npy_atomic_load_ptr(obj)) {
PyObject *value = npy_import(module, attr);
if (value == NULL) {
return -1;
}
PyThread_acquire_lock(npy_runtime_imports.import_mutex, WAIT_LOCK);
if (!npy_atomic_load_ptr(obj)) {
npy_atomic_store_ptr(obj, Py_NewRef(value));
}
PyThread_release_lock(npy_runtime_imports.import_mutex);
Py_DECREF(value);
}
Or you can get rid of the lock if you implement compare-exchange:
if (!npy_atomic_load_ptr(obj)) {
PyObject *value = npy_import(module, attr);
if (value == NULL) {
return -1;
}
PyObject *exepected = NULL;
if (!npy_atomic_compare_exchange_ptr(obj, &expected, value)) {
Py_DECREF(value);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I kept the lock since the next iteration will use PyMutex and I'd like to make sure this gets in for NumPy 2.1. Thankfully deadsnakes was updated over the weekend so we can test against a version with a public PyMutex now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last review comments were addressed, and CI is happy too. It looks like it's time to give this a go. Thanks a lot @ngoldbaum, and @colesbury and @seberg for the reviews!
This is a re-do of #26430.
It adds locking to the argparse cache and the cached runtime imports, making both thread safe.
Both caches leverage a new
npy_atomic.h
header that wraps C11 atomics and MSVC intrinsics to implement a uint8 atomic load. I think this is sufficient to work on all supported platforms but we'll see what CI says.First, we do a regular load to see if the cache is initialized. If it is, we read from the cache and go on our way. If it isn't, we do an atomic load to serialize access and then immediately acquire a lock. Whichever thread wins the race to acquire the lock fills in the cache and the other threads wait until the lock is released, acquire the lock, see the cache is filled, and continue. Since the lock is only used before the cache is filled in, a global lock is OK I think.
I could probably refactor both uses to make use of an abstract single-initialization API that takes a function pointer, but given that there are only two usages I don't think the extra abstraction and needing to make all the cache initialization functions have the same function signature is worth the effort.