MAINT: use an atomic load/store and a mutex to initialize the argparse and runtime import caches #26780

ngoldbaum · 2024-06-21T21:35:05Z

This is a re-do of #26430.

It adds locking to the argparse cache and the cached runtime imports, making both thread safe.

Both caches leverage a new npy_atomic.h header that wraps C11 atomics and MSVC intrinsics to implement a uint8 atomic load. I think this is sufficient to work on all supported platforms but we'll see what CI says.

First, we do a regular load to see if the cache is initialized. If it is, we read from the cache and go on our way. If it isn't, we do an atomic load to serialize access and then immediately acquire a lock. Whichever thread wins the race to acquire the lock fills in the cache and the other threads wait until the lock is released, acquire the lock, see the cache is filled, and continue. Since the lock is only used before the cache is filled in, a global lock is OK I think.

I could probably refactor both uses to make use of an abstract single-initialization API that takes a function pointer, but given that there are only two usages I don't think the extra abstraction and needing to make all the cache initialization functions have the same function signature is worth the effort.

ngoldbaum · 2024-06-21T21:52:02Z

Hmm, it looks like whichever version of gcc is available for the manylinux2014 32bit build doesn't have C11 atomics support. I have no idea if I should go out of my way to support that, although I think I can add some code from CPython's pyatomic_gcc.h to cover this case.

charris · 2024-06-21T23:12:21Z

There is a later manylinux than manylinux2014, but is doesn't support 32 bits at all. We don't release 32 bit linux wheels, I wonder if anyone does?

ngoldbaum · 2024-06-22T00:31:02Z

numpy/_core/src/common/npy_atomic.h

+#include "numpy/npy_common.h"
+
+#if __STDC_VERSION__ >= 201112L && !defined(__STDC_NO_ATOMICS__)
+// TODO: support C++ atomics as well if this header is ever needed in C++


The code to do this is in the CPython header this is cribbed from. It would be dead code if I included it.

rgommers · 2024-06-22T07:03:30Z

I'm not worried about 32-bit manylinux2014 specifically, however C11 atomics are optional right? So do we have a sense what other platforms/compilers not tested in our CI don't support it?

ngoldbaum · 2024-06-25T18:12:42Z

however C11 atomics are optional right?

That's right.

So do we have a sense what other platforms/compilers not tested in our CI don't support it?

gcc 4.9 was the first version to have support, 4.9.0 came out April 2014.

llvm/clang 4.0 fully supported atomics, that came out March 2017.

Intel fully supported atomics with version 18 of icc, released Nov 2017.

Any other compilers worth considering?

ngoldbaum · 2024-06-25T18:26:10Z

It looks like CPython doesn't account for any compilers besides clang, gcc, and MSVC. It supports older clang and gcc versions using the gcc builtin atomics support provided prior to gcc 4.9.

rgommers · 2024-06-25T19:43:22Z

Any other compilers worth considering?

Emscripten (which we have in CI, so should be fine), the newer Intel compilers (icx rather than icc; LLVM-based, should be fine), ARM compilers, IBM XL compilers, commercial compilers (NAG, Fujitsu, etc.). Most of those are infrequently used, especially in combination with numpy dev versions, but an issue may show up at some point. As long as there's enough context here to not make it super confusing once that happens, it should be okay probably.

However, the 32-bit manylinux2014 failure is a bit like a canary - if that already doesn't work with GCC 10, it's almost certainly going to not work elsewhere too.

Hmm, it looks like whichever version of gcc is available for the manylinux2014 32bit build doesn't have C11 atomics support. I have no idea if I should go out of my way to support that, although I think I can add some code from CPython's pyatomic_gcc.h to cover this case.

That uses GCC 10: https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux2014-centos-7-based-glibc-217. And from the CI log:

C compiler for the host machine: cc (gcc 10.2.1 "cc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11)")

The problem is probably not linking libatomic, rather than GCC not supporting the API. We discussed that in gh-26430 and that PR has code to support linking libatomic, but you lost that in this PR. I think if you put it back, the problem will go away.

ngoldbaum · 2024-06-27T19:40:04Z

I re-added the libatomic test in the meson configuration and also added some uint8 operations to the linking test as well since @colesbury pointed out that GCC under RISC-V has a similar issue with 8 bit atomic operations are only available when linking with libatomic but 64 bit ops are always available. I don't have a RISC-V machine to test with, but I don't think it hurts to add the check given that we know it's an issue, and we do specifically need 8 bit atomic operations right now.

Unfortunately it looks like that wasn't sufficient, so I added some preprocessor logic to detect the gcc/clang builtin atomics as well if stdc atomics aren't detected. That seems to be enough to get numpy to build on the manylinux2014 image.

colesbury · 2024-06-27T19:47:00Z

@ngoldbaum did you see that PyMutex is now exposed publicly? I'm not sure if that makes any of this easier.

rgommers · 2024-06-27T19:50:00Z

Sounds good! Re the second commit, can we avoid __GNUC_MINOR__? xref scipy/scipy#20479 for why. You can safely set the minimum major version to 5, because we require a way newer version than that anyway.

ngoldbaum · 2024-06-27T19:58:46Z

did you see that PyMutex is now exposed publicly? I'm not sure if that makes any of this easier.

I saw that! It does make some things easier (no need to initialize the lock at runtime) and I'm planning to switch over to it wholesale and require 3.13b3 or newer within the next week or two. It looks like deadsnakes hasn't yet added 3.13b3 so I need to wait for that at least.

For this PR I think we'd need a public version of _PyOnceFlag to avoid all the platform-specific stuff.

rgommers · 2024-06-27T21:53:23Z

The build changes LGTM. For reference: mesonbuild/meson#11445 adds a built-in atomic dependency to Meson, and that PR originally had a similar approach as in this PR - now changed to unconditionally link libatomic for robustness (since it's assumed to never be harmful, and the detection code is a bit fragile). I think we can go with this PR in its current state; we may want to switch to dependency('atomic') once it's available in the next Meson version.

numpy/_core/src/common/npy_atomic.h

numpy/_core/src/common/npy_import.h

colesbury

The non-atomic loads and stores are not thread-safe. You need to use atomic loads/stores to ensure the correct ordering.

Otherwise, things like cache->initialized=1 may be visible before the other important parts of the cache.

numpy/_core/src/common/npy_argparse.c

numpy/_core/src/common/npy_import.h

ngoldbaum · 2024-07-04T19:29:45Z

@colesbury can you explain more why it's not thread safe with the lock? I thought I only needed one atomic load to serialize acquiring the lock.

colesbury · 2024-07-04T20:12:45Z

The purpose of the atomic loads and stores are to serialize a thread going through the fast-path (no lock acquisition) with a thread that may concurrently be finishing up the initialization.

The cache->initialized = 1; needs to be atomic to ensure that it's not reordered with the assignment to cache->obj (or similar). For example, if it's not atomic, the compiler could generate code like:

PyObject *tmp = npy_import(module, attr);
cache->initialized = 1;
cache->obj = tmp;

That's a problem because the reader might see cache->initialized == 1, but still see a NULL obj.

(Or, on certain CPUs, the stores to cache->initialized and cache->obj could equivalently get re-ordered).

Similarly, the load (outside the lock) needs to be atomic so that it's not reordered with subsequent operations. For example, the code essentially does:

if (cache->initialized) {
  ...
  PyObject *xyz = cache->obj;
  ...
}

If cache->initialized is not atomic, then some CPUs may execute the loads out of order, like:

PyObject *tmp = cache->obj; // speculative load
if (cache->initialized) {
    PyObject *xyz = tmp;
}

So the load and store of cache->initialized needs to have at least "acquire" and "release" semantics, but it's simpler to use the strong "sequential consistency" semantics, which are the default in C11/C++11 if you don't specify a memory ordering.

See also: https://en.wikipedia.org/wiki/Double-checked_locking#cite_ref-8

ngoldbaum · 2024-07-04T21:28:30Z

OK, the last commit reworks the import cache to use atomic loads and stores on the pointer directly. I also added the necessary atomic loads and stores for the argparse cache, which I decided was easier to deal with using an integer flag.

seberg

Thanks, LGTM. One small issue to fix and a small cleanup nit that would be nice.

numpy/_core/src/common/npy_argparse.h

seberg · 2024-07-05T06:42:41Z

numpy/_core/src/common/npy_import.h

        }
+        PyThread_release_lock(npy_runtime_imports.import_mutex);


Move the release before the error check, or we will dead-lock on error!

Good catch! Also, the obj == NULL check doesn't make sense -- it would need to be *obj == NULL to check the result.

But I think it would be better not to hold the lock around the npy_import:

Imports in general can run arbitrary code, which may reentrantly trigger npy_cache_import_runtime.

Imports (i.e., PyImport_ImportModule) are already thread-safe internally

So I'd suggest writing this as:

if (!npy_atomic_load_ptr(obj)) { PyObject *value = npy_import(module, attr); if (value == NULL) { return -1; } PyThread_acquire_lock(npy_runtime_imports.import_mutex, WAIT_LOCK); if (!npy_atomic_load_ptr(obj)) { npy_atomic_store_ptr(obj, Py_NewRef(value)); } PyThread_release_lock(npy_runtime_imports.import_mutex); Py_DECREF(value); }

Or you can get rid of the lock if you implement compare-exchange:

if (!npy_atomic_load_ptr(obj)) { PyObject *value = npy_import(module, attr); if (value == NULL) { return -1; } PyObject *exepected = NULL; if (!npy_atomic_compare_exchange_ptr(obj, &expected, value)) { Py_DECREF(value); } }

Thanks! I kept the lock since the next iteration will use PyMutex and I'd like to make sure this gets in for NumPy 2.1. Thankfully deadsnakes was updated over the weekend so we can test against a version with a public PyMutex now.

numpy/_core/src/umath/umathmodule.c

rgommers

The last review comments were addressed, and CI is happy too. It looks like it's time to give this a go. Thanks a lot @ngoldbaum, and @colesbury and @seberg for the reviews!

ngoldbaum added the 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703) label Jun 21, 2024

github-actions bot added the 03 - Maintenance label Jun 21, 2024

ngoldbaum mentioned this pull request Jun 21, 2024

ENH: add locking around initializing the argparse cache #26430

Closed

ngoldbaum force-pushed the single-init branch from f587879 to b74ea41 Compare June 21, 2024 21:46

ngoldbaum changed the title ~~MNT: use an atomic load and mutex to initialize the argparse and runtime import caches~~ MAINT: use an atomic load and mutex to initialize the argparse and runtime import caches Jun 21, 2024

ngoldbaum commented Jun 22, 2024

View reviewed changes

ngoldbaum force-pushed the single-init branch from 5126d73 to abc000b Compare June 27, 2024 19:57

ngoldbaum commented Jun 28, 2024

View reviewed changes

numpy/_core/src/common/npy_atomic.h Show resolved Hide resolved

seberg reviewed Jul 4, 2024

View reviewed changes

numpy/_core/src/common/npy_import.h Outdated Show resolved Hide resolved

ngoldbaum mentioned this pull request Jul 4, 2024

TypeError: descriptor 'some_method' for 'A' objects doesn't apply to a 'B' object python/cpython#121368

Closed

colesbury reviewed Jul 4, 2024

View reviewed changes

numpy/_core/src/common/npy_argparse.c Outdated Show resolved Hide resolved

numpy/_core/src/common/npy_argparse.c Outdated Show resolved Hide resolved

numpy/_core/src/common/npy_import.h Outdated Show resolved Hide resolved

numpy/_core/src/common/npy_import.h Outdated Show resolved Hide resolved

ngoldbaum force-pushed the single-init branch from 8a17181 to 396a413 Compare July 4, 2024 21:25

ngoldbaum changed the title ~~MAINT: use an atomic load and mutex to initialize the argparse and runtime import caches~~ MAINT: use an atomic load/store and a mutex to initialize the argparse and runtime import caches Jul 4, 2024

seberg reviewed Jul 5, 2024

View reviewed changes

ngoldbaum added 2 commits July 9, 2024 12:41

TST: add a test to check if the argparse cache is thread safe

e55b25c

MNT: lock initializing the argparse cache

6386423

ngoldbaum added 6 commits July 9, 2024 12:44

MNT: convert runtime imports to use single-initialization

9a22d84

MAINT: re-add libatomic link test to meson.build

6ea18a4

MAINT: add support for gcc builtin atomics

3c620cb

MAINT: rework to use double-checked locking and avoid new struct

2e2fa1d

BUG: fix windows build

b6af99a

MAINT: respond to review comments

dabfe59

ngoldbaum force-pushed the single-init branch from dfbb614 to dabfe59 Compare July 9, 2024 18:44

rgommers added this to the 2.1.0 release milestone Jul 10, 2024

rgommers mentioned this pull request Jul 12, 2024

MAINT: declare that NumPy's C D1E6 extensions support running without the GIL #26913

Merged

ngoldbaum mentioned this pull request Jul 12, 2024

ENH: Support free-threaded python build (tracking issue) #26157

Closed

16 tasks

rgommers approved these changes Jul 12, 2024

View reviewed changes

rgommers merged commit 7896d73 into numpy:main Jul 12, 2024
67 of 68 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: use an atomic load/store and a mutex to initialize the argparse and runtime import caches #26780

MAINT: use an atomic load/store and a mutex to initialize the argparse and runtime import caches #26780

MAINT: use an atomic load/store and a mutex to initialize the argparse and runtime import caches #26780

MAINT: use an atomic load/store and a mutex to initialize the argparse and runtime import caches #26780

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment