8000 MAINT: use an atomic load/store and a mutex to initialize the argparse and runtime import caches by ngoldbaum · Pull Request #26780 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

MAINT: use an atomic load/store and a mutex to initialize the argparse and runtime import caches #26780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 12, 2024

Conversation

ngoldbaum
Copy link
Member

This is a re-do of #26430.

It adds locking to the argparse cache and the cached runtime imports, making both thread safe.

Both caches leverage a new npy_atomic.h header that wraps C11 atomics and MSVC intrinsics to implement a uint8 atomic load. I think this is sufficient to work on all supported platforms but we'll see what CI says.

First, we do a regular load to see if the cache is initialized. If it is, we read from the cache and go on our way. If it isn't, we do an atomic load to serialize access and then immediately acquire a lock. Whichever thread wins the race to acquire the lock fills in the cache and the other threads wait until the lock is released, acquire the lock, see the cache is filled, and continue. Since the lock is only used before the cache is filled in, a global lock is OK I think.

I could probably refactor both uses to make use of an abstract single-initialization API that takes a function pointer, but given that there are only two usages I don't think the extra abstraction and needing to make all the cache initialization functions have the same function signature is worth the effort.

@ngoldbaum ngoldbaum added the 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703) label Jun 21, 2024
@ngoldbaum ngoldbaum changed the title MNT: use an atomic load and mutex to initialize the argparse and runtime import caches MAINT: use an atomic load and mutex to initialize the argparse and runtime import caches Jun 21, 2024
@ngoldbaum
Copy link
Member Author

Hmm, it looks like whichever version of gcc is available for the manylinux2014 32bit build doesn't have C11 atomics support. I have no idea if I should go out of my way to support that, although I think I can add some code from CPython's pyatomic_gcc.h to cover this case.

@charris
Copy link
Member
charris commented Jun 21, 2024

There is a later manylinux than manylinux2014, but is doesn't support 32 bits at all. We don't release 32 bit linux wheels, I wonder if anyone does?

#include "numpy/npy_common.h"

#if __STDC_VERSION__ >= 201112L && !defined(__STDC_NO_ATOMICS__)
// TODO: support C++ atomics as well if this header is ever needed in C++
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to d 8000 escribe this comment to others. Learn more.

The code to do this is in the CPython header this is cribbed from. It would be dead code if I included it.

@rgommers
Copy link
Member

I'm not worried about 32-bit manylinux2014 specifically, however C11 atomics are optional right? So do we have a sense what other platforms/compilers not tested in our CI don't support it?

@ngoldbaum
Copy link
Member Author

however C11 atomics are optional right?

That's right.

So do we have a sense what other platforms/compilers not tested in our CI don't support it?

gcc 4.9 was the first version to have support, 4.9.0 came out April 2014.

llvm/clang 4.0 fully supported atomics, that came out March 2017.

Intel fully supported atomics with version 18 of icc, released Nov 2017.

Any other compilers worth considering?

@ngoldbaum
Copy link
Member Author

It looks like CPython doesn't account for any compilers besides clang, gcc, and MSVC. It supports older clang and gcc versions using the gcc builtin atomics support provided prior to gcc 4.9.

@rgommers
Copy link
Member

Any other compilers worth considering?

Emscripten (which we have in CI, so should be fine), the newer Intel compilers (icx rather than icc; LLVM-based, should be fine), ARM compilers, IBM XL compilers, commercial compilers (NAG, Fujitsu, etc.). Most of those are infrequently used, especially in combination with numpy dev versions, but an issue may show up at some point. As long as there's enough context here to not make it super confusing once that happens, it should be okay probably.

However, the 32-bit manylinux2014 failure is a bit like a canary - if that already doesn't work with GCC 10, it's almost certainly going to not work elsewhere too.

Hmm, it looks like whichever version of gcc is available for the manylinux2014 32bit build doesn't have C11 atomics support. I have no idea if I should go out of my way to support that, although I think I can add some code from CPython's pyatomic_gcc.h to cover this case.

That uses GCC 10: https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux2014-centos-7-based-glibc-217. And from the CI log:

C compiler for the host machine: cc (gcc 10.2.1 "cc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11)")

The problem is probably not linking libatomic, rather than GCC not supporting the API. We discussed that in gh-26430 and that PR has code to support linking libatomic, but you lost that in this PR. I think if you put it back, the problem will go away.

@ngoldbaum
Copy link
Member Author

I re-added the libatomic test in the meson configuration and also added some uint8 operations to the linking test as well since @colesbury pointed out that GCC under RISC-V has a similar issue with 8 bit atomic operations are only available when linking with libatomic but 64 bit ops are always available. I don't have a RISC-V machine to test with, but I don't think it hurts to add the check given that we know it's an issue, and we do specifically need 8 bit atomic operations right now.

Unfortunately it looks like that wasn't sufficient, so I added some preprocessor logic to detect the gcc/clang builtin atomics as well if stdc atomics aren't detected. That seems to be enough to get numpy to build on the manylinux2014 image.

@colesbury
Copy link

@ngoldbaum did you see that PyMutex is now exposed publicly? I'm not sure if that makes any of this easier.

@rgommers
Copy link
Member

Sounds good! Re the second commit, can we avoid __GNUC_MINOR__? xref scipy/scipy#20479 for why. You can safely set the minimum major version to 5, because we require a way newer version than that anyway.

@ngoldbaum
Copy link
Member Author

did you see that PyMutex is now exposed publicly? I'm not sure if that makes any of this easier.

I saw that! It does make some things easier (no need to initialize the lock at runtime) and I'm planning to switch over to it wholesale and require 3.13b3 or newer within the next week or two. It looks like deadsnakes hasn't yet added 3.13b3 so I need to wait for that at least.

For this PR I think we'd need a public version of _PyOnceFlag to avoid all the platform-specific stuff.

@rgommers
Copy link
Member

The build changes LGTM. For reference: mesonbuild/meson#11445 adds a built-in atomic dependency to Meson, and that PR originally had a similar approach as in this PR - now changed to unconditionally link libatomic for robustness (since it's assumed to never be harmful, and the detection code is a bit fragile). I think we can go with this PR in its current state; we may want to switch to dependency('atomic') once it's available in the next Meson version.

Copy link
@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-atomic loads and stores are not thread-safe. You need to use atomic loads/stores to ensure the correct ordering.

Otherwise, things like cache->initialized=1 may be visible before the other important parts of the cache.

@ngoldbaum
Copy link
Member Author

@colesbury can you explain more why it's not thread safe with the lock? I thought I only needed one atomic load to serialize acquiring the lock.

@colesbury
Copy link

The purpose of the atomic loads and stores are to serialize a thread going through the fast-path (no lock acquisition) with a thread that may concurrently be finishing up the initialization.

The cache->initialized = 1; needs to be atomic to ensure that it's not reordered with the assignment to cache->obj (or similar). For example, if it's not atomic, the compiler could generate code like:

PyObject *tmp = npy_import(module, attr);
cache->initialized = 1;
cache->obj = tmp;

That's a problem because the reader might see cache->initialized == 1, but still see a NULL obj.

(Or, on certain CPUs, the stores to cache->initialized and cache->obj could equivalently get re-ordered).

Similarly, the load (outside the lock) needs to be atomic so that it's not reordered with subsequent operations. For example, the code essentially does:

if (cache->initialized) {
  ...
  PyObject *xyz = cache->obj;
  ...
}

If cache->initialized is not atomic, then some CPUs may execute the loads out of order, like:

PyObject *tmp = cache->obj; // speculative load
if (cache->initialized) {
    PyObject *xyz = tmp;
}

So the load and store of cache->initialized needs to have at least "acquire" and "release" semantics, but it's simpler to use the strong "sequential consistency" semantics, which are the default in C11/C++11 if you don't specify a memory ordering.

See also: https://en.wikipedia.org/wiki/Double-checked_locking#cite_ref-8

@ngoldbaum
Copy link
Member Author

OK, the last commit reworks the import cache to use atomic loads and stores on the pointer directly. I also added the necessary atomic loads and stores for the argparse cache, which I decided was easier to deal with using an integer flag.

@ngoldbaum ngoldbaum changed the title MAINT: use an atomic load and mutex to initialize the argparse and runtime import caches MAINT: use an atomic load/store and a mutex to initialize the argparse and runtime import caches Jul 4, 2024
Copy link
Member
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM. One small issue to fix and a small cleanup nit that would be nice.

}
PyThread_release_lock(npy_runtime_imports.import_mutex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the release before the error check, or we will dead-lock on error!

Copy link
@colesbury colesbury Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Also, the obj == NULL check doesn't make sense -- it would need to be *obj == NULL to check the result.

But I think it would be better not to hold the lock around the npy_import:

  • Imports in general can run arbitrary code, which may reentrantly trigger npy_cache_import_runtime.
  • Imports (i.e., PyImport_ImportModule) are already thread-safe internally

So I'd suggest writing this as:

if (!npy_atomic_load_ptr(obj)) {
    PyObject *value = npy_import(module, attr);
    if (value == NULL) {
        return -1;
    }
    PyThread_acquire_lock(npy_runtime_imports.import_mutex, WAIT_LOCK);
    if (!npy_atomic_load_ptr(obj)) {
        npy_atomic_store_ptr(obj, Py_NewRef(value));
    }
    PyThread_release_lock(npy_runtime_imports.import_mutex);
    Py_DECREF(value);
}

Or you can get rid of the lock if you implement compare-exchange:

if (!npy_atomic_load_ptr(obj)) {
    PyObject *value = npy_import(module, attr);
    if (value == NULL) {
        return -1;
    }
    PyObject *exepected = NULL;
    if (!npy_atomic_compare_exchange_ptr(obj, &expected, value)) {
        Py_DECREF(value);
    }
}

Copy link
Member Author
@ngoldbaum ngoldbaum Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I kept the lock since the next iteration will use PyMutex and I'd like to make sure this gets in for NumPy 2.1. Thankfully deadsnakes was updated over the weekend so we can test against a version with a public PyMutex now.

< 9E88 circle cx="8" cy="8" r="7" stroke="currentColor" stroke-opacity="0.25" stroke-width="2" vector-effect="non-scaling-stroke" fill="none" />
Copy link
Member
@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last review comments were addressed, and CI is happy too. It looks like it's time to give this a go. Thanks a lot @ngoldbaum, and @colesbury and @seberg for the reviews!

@rgommers rgommers merged commit 7896d73 into numpy:main Jul 12, 2024
67 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 - Maintenance 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0