8000 TSAN failures seen running PyO3 tests with the free-threaded build · Issue #130421 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

TSAN failures seen running PyO3 tests with the free-threaded build #130421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ngoldbaum opened this issue Feb 21, 2025 · 4 comments
Open

TSAN failures seen running PyO3 tests with the free-threaded build #130421

ngoldbaum opened this issue Feb 21, 2025 · 4 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@ngoldbaum
Copy link
Contributor
ngoldbaum commented Feb 21, 2025

I'm seeing TSAN warnings running the PyO3 tests using CPython commit 38642bf

I've done this on an M3 Macbook Pro running MacOS Sequoia as well as @nascheme's cpython_sanity docker image which has LLVM 20 installed (as well as Python3.13 with TSAN and some packages, but I didn't use that). I think it's only possible to run TSAN on both the rust code and CPython using LLVM 20 and I can't easily install that on my Mac right now since it's not yet packaged on homebrew.

See this comment in the PyO3 repo if you want to use the docker image, there are some small tweaks you need to do before it will work correctly.

On an ARM Mac, I installed llvm from homebrew and then did CONFIGURE_OPTS="--with-thread-sanitizer" pyenv install 3.14t-dev to get a TSAN CPython build. You'll also need to install a rust toolchain.

You'll also need a copy of PyO3 checked out to this branch.

Because homebrew doesn't have LLVM 19, I had to resort to just running the cargo tests as normal using a CPython with TSAN. I think this should still detect races happening inside CPython.

pyenv global 3.14t-dev
pip install nox
cd pyo3
nox -s test

Here is the full output from one invocation on my Mac: https://gist.github.com/ngoldbaum/e198d87149617ecdaf881f29a03b8126

Here are a sampling of the warning summaries:

data race pytime.c:1163 in py_get_monotonic_clock
data race typeobject.c:2235 in _PyType_AllocNoTrack
data race weakrefobject.c:413 in get_or_create_weakref
data race pytime.c:1180 in py_get_monotonic_clock
data race pytime.c:1162 in py_get_monotonic_clock
data race object.c:343 in _Py_IncRef

You can ignore all the test_compiler_error messages - nightly rust always has compiler error message failures.

Another way to trigger these failures is with cargo stress, which runs the tests in a loop to try to trigger safety issues like this. I have a hacked together version of cargo stress on this branch that makes it so that instead of crashing if a thread writes to stderr, it prints the stderr to the terminal and continues. If you run TSAN with TSAN_OPTIONS=exit_code=0, my version of cargo stress will happily continue running after the first TSAN warning. This is a good way to generate lots of warnings quickly without waiting to rerun the full test suite manually.

Here are some additional summaries that I see in a cargo stress run:

data race tupleobject.c:173 in PyTuple_Pack
data race typeobject.c:3368 in best_base

And here is the full terminal output (this ran for about 10 seconds before I killed it with ctrl-c): https://gist.github.com/ngoldbaum/1d1e29c8e10f0ac979ef27a95c73d39f

When I try to do the same tests in the docker container using a version of 3.14t-dev I built on the container, I don't see any of the TSAN reports seen above. Maybe they don't happen on x86_64?

Also note that there is a race inside PyO3 triggered by the PyO3 testtest_thread_safety_2, you may see that if you are running the tests inside the docker container. There are also two test failures due to unexpected panics that I only see under TSAN in the docker container. I'm not sure what's happening with the failures yet.

Linked PRs

@picnixz picnixz added topic-free-threading type-bug An unexpected behavior, bug, or error labels Feb 21, 2025
@colesbury
Copy link
Contributor
colesbury commented Feb 25, 2025

For future reference:

py_get_monotonic_clock (macOS only)

The initialization of the global base isn't thread-saf 8000 e. My preference would be to try to initialize this during runtime initialization, before any other threads are created.

cpython/Python/pytime.c

Lines 1160 to 1166 in 38642bf

#elif defined(__APPLE__)
static _PyTimeFraction base = {0, 0};
if (base.denom == 0) {
if (py_mach_timebase_info(&base, raise_exc) < 0) {
return -1;
}
}

long_from_non_binary_base

Non thread-safe initialization of global state. Either initialize this during runtime initialization (if it's fast enough) or use _PyOnceFlag.

cpython/Objects/longobject.c

Lines 2835 to 2839 in f963239

static double log_base_BASE[37] = {0.0e0,};
static int convwidth_base[37] = {0,};
static twodigits convmultmax_base[37] = {0,};
if (log_base_BASE[base] == 0.0) {

_PyType_AllocNoTrack

_PyType_PreHeaderSize() reads tp_flags and PyType_Ready() modifies it. I think we should be ensuring that PyType_Ready() is called before any instances of objects are allocated, but:

  • I'm not sure how this is happening in the PyO3 tests
  • Python historically has been fairly forgiving for not calling PyType_Ready()

const size_t presize = _PyType_PreHeaderSize(type);

type->tp_flags |= Py_TPFLAGS_READY;

get_or_create_weakref and PyType_FromMetaclass

Seems pretty similar to the _PyType_AllocNoTrack race above. Again, I'm not sure how we are creating a weak reference to an object before we finish initializing that object's type.

if (!_PyType_SUPPORTS_WEAKREFS(Py_TYPE(obj))) {

type->tp_weaklistoffset = weaklistoffset;

  Read of size 8 at 0x0003000600e0 by thread T725 (mutexes: write M0):
    #0 get_or_create_weakref weakrefobject.c:413 (libpython3.14t.dylib:arm64+0x200024)
    #1 PyWeakref_NewRef weakrefobject.c:921 (libpython3.14t.dylib:arm64+0x1fffc4)
    #2 pyo3::types::weakref::reference::PyWeakrefReference::new::h021cf5fda7b820b3 reference.rs:87 (pyo3-f6efbb17135478d3:arm64+0x10001e630)

  Previous write of size 8 at 0x0003000600e0 by thread T724 (mutexes: write M1):
    #0 PyType_FromMetaclass typeobject.c:5055 (libpython3.14t.dylib:arm64+0x176988)
    #1 PyType_FromSpec typeobject.c:5156 (libpython3.14t.dylib:arm64+0x177b14)
    #2 pyo3::pyclass::create_type_object::PyTypeBuilder::build::h21354c33afccbc76 create_type_object.rs:493 (pyo3-f6efbb17135478d3:arm64+0x1000833ec)

best_base / type_ready

Not sure, but seems related to the above two issues.

  Read of size 8 at 0x00010e0400b8 by thread T5 (mutexes: write M0):
    #0 best_base typeobject.c:3368 (libpython3.14t.dylib:arm64+0x176c60)
    #1 PyType_FromMetaclass typeobject.c:4893 (libpython3.14t.dylib:arm64+0x176284)
    #2 PyType_FromSpec typeobject.c:5156 (libpython3.14t.dylib:arm64+0x177b14)
    #3 pyo3::pyclass::create_type_object::PyTypeBuilder::build::h4dc4cb20c59c0b38 <null>:140509280 (test_class_conversion-25e69c9142759d40:arm64+0x1000483d8)

  Previous write of size 8 at 0x00010e0400b8 by thread T3 (mutexes: write M1):
    #0 type_ready typeobject.c:8667 (libpython3.14t.dylib:arm64+0x183350)
    #1 PyType_Ready typeobject.c:8697 (libpython3.14t.dylib:arm64+0x1777f4)
    #2 PyType_FromMetaclass typeobject.c:5069 (libpython3.14t.dylib:arm64+0x1769bc)
    #3 PyType_FromSpec typeobject.c:5156 (libpython3.14t.dylib:arm64+0x177b14)
    #4 pyo3::pyclass::create_type_object::PyTypeBuilder::build::h4dc4cb20c59c0b38 <null>:140509280 (test_class_conversion-25e69c9142759d40:arm64+0x1000483d8)

_Py_IncRef and _PyObject_SetDeferredRefcount

_PyObject_SetDeferredRefcount modifies ob_ref_shared in a way that's not thread-safe if other threads may be modifying the reference count of that object. In general, we don't want to be calling _PyObject_SetDeferredRefcount() on objects that are accessible to other threads.

(We could also consider trying to make _PyObject_SetDeferredRefcount more thread-safe, but that would also make it more complicated and slower.)

The type shouldn't be exposed to other threads, but it might be getting accessed via a gc.get_objects() or gc.get_referrers call.

op->ob_ref_shared = _Py_REF_SHARED(_Py_REF_DEFERRED, 0);

  Previous write of size 8 at 0x00010e040010 by thread T11 (mutexes: write M1):
    #0 _PyObject_SetDeferredRefcount object.c:2549 (libpython3.14t.dylib:arm64+0x127768)
    #1 _PyObject_AssignUniqueId uniqueid.c:96 (libpython3.14t.dylib:arm64+0x345a6c)
    #2 PyType_FromMetaclass typeobject.c:5060 (libpython3.14t.dylib:arm64+0x16ed24)
    #3 PyType_FromSpec typeobject.c:5156 (libpython3.14t.dylib:arm64+0x16fde0)
    #4 pyo3::pyclass::create_type_object::PyTypeBuilder::build::h4dc4cb20c59c0b38 <null> (test_enum-eedee5d138326bdc:arm64+0x10007d098)

PyTuple_Pack / mi_page_free_list_extend

Possible duplicate of:

@picnixz picnixz added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Feb 25, 2025
@colesbury
Copy link
Contributor

@ngoldbaum - I think a few of these races, such as _Py_IncRef/ _PyObject_SetDeferredRefcount, may be due to the use of gc.get_objects() in test_gc.rs:

https://github.com/PyO3/pyo3/blob/b07871d962d56e66243d2c52a700e978545ff417/tests/test_gc.rs#L185

gc.get_objects() has the same problem as gc.get_referrers, which is that it can expose partially initialized objects. In this case, I think the gc.get_objects() temporarily adds some partially initialized types to the object's list concurrently with the _PyObject_SetDeferredRefcount() call.

Is it possible to remove the gc.get_objects() call from that test, at least in the free threaded build?

8000

@ngoldbaum
Copy link
Contributor Author

Is it possible to remove the gc.get_objects() call from that test, at least in the free threaded build?

Likely yes, I'll take a look at that.

@ngoldbaum
Copy link
Contributor Author

I tried disabling the call to gc.get_objects() but I'm still seeing the same race. It looks like I can trigger it intermittently just running test_mapping, so I don't think it has anything to do with test_gc:

goldbaum at Nathans-MBP in ~/Documents/pyo3 on 3.14-testing!
± while TSAN_OPTIONS="suppressions=$PWD/suppression.txt" cargo test --test test_mapping; do :; done
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.03s
     Running tests/test_mapping.rs (target/debug/deps/test_mapping-8e2f8c9757276f6e)
test_mapping-8e2f8c9757276f6e(29015,0x1eed6c840) malloc: nano zone abandoned due to inability to reserve vm space.

running 4 tests
==================
WARNING: ThreadSanitizer: data race (pid=29015)
  Atomic write of size 8 at 0x00010888b810 by thread T3 (mutexes: write M0):
    #0 _Py_IncRef object.c:343 (libpython3.14t.dylib:arm64+0x120814)
    #1 pyo3::instance::Py$LT$T$GT$::clone_ref::h1172372761686055 <null> (test_mapping-8e2f8c9757276f6e:arm64+0x1000525fc)

  Previous write of size 8 at 0x00010888b810 by thread T1 (mutexes: write M1):
    #0 _PyObject_SetDeferredRefcount object.c:2549 (libpython3.14t.dylib:arm64+0x12ab00)
    #1 _PyObject_AssignUniqueId uniqueid.c:96 (libpython3.14t.dylib:arm64+0x3491ec)
    #2 PyType_FromMetaclass typeobject.c:5080 (libpython3.14t.dylib:arm64+0x1720bc)
    #3 PyType_FromSpec typeobject.c:5176 (libpython3.14t.dylib:arm64+0x173178)
    #4 pyo3::pyclass::create_type_object::PyTypeBuilder::build::h9c2637c9285e8dd3 <null> (test_mapping-8e2f8c9757276f6e:arm64+0x10006a878)

  Mutex M0 (0x000105900040) created at:
    #0 pthread_mutex_init <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64+0x329f0)
    #1 std::sys::pal::unix::sync::mutex::Mutex::init::h72bc101bcffec752 <null> (test_mapping-8e2f8c9757276f6e:arm64+0x1000a51d4)

  Mutex M1 (0x0001059001c0) created at:
    #0 pthread_mutex_init <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64+0x329f0)
    #1 std::sys::pal::unix::sync::mutex::Mutex::init::h72bc101bcffec752 <null> (test_mapping-8e2f8c9757276f6e:arm64+0x1000a51d4)

  Thread T3 (tid=124485, running) created by main thread at:
    #0 pthread_create <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64+0x31050)
    #1 std::sys::pal::unix::thread::Thread::new::hae4c12895803408a <null> (test_mapping-8e2f8c9757276f6e:arm64+0x1000a540c)

  Thread T1 (tid=124483, running) created by main thread at:
    #0 pthread_create <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64+0x31050)
    #1 std::sys::pal::unix::thread::Thread::new::hae4c12895803408a <null> (test_mapping-8e2f8c9757276f6e:arm64+0x1000a540c)

SUMMARY: ThreadSanitizer: data race object.c:343 in _Py_IncRef
==================
test test_setitem ... ok
test test_delitem ... ok
test test_getitem ... ok
test mapping_is_not_sequence ... ok

test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.38s

ThreadSanitizer: reported 1 warnings
error: test failed, to rerun pass `--test test_mapping`

Caused by:
  process didn't exit successfully: `/Users/goldbaum/Documents/pyo3/target/debug/deps/test_mapping-8e2f8c9757276f6e` (signal: 6, SIGABRT: process abort signal)

Here I'm using a suppression file with the following content:

race:py_get_monotonic_clock
race:_PyType_AllocNoTrack

colesbury added a commit to colesbury/cpython that referenced this issue Feb 26, 2025
Windows and macOS require precomputing a "timebase" in order to convert
OS timestamps into nanoseconds. Retrieve and compute this value during
runtime initialization to avoid data races when accessing the time.
colesbury added a commit to colesbury/cpython that referenced this issue Feb 26, 2025
colesbury added a commit that referenced this issue Feb 27, 2025
Windows and macOS require precomputing a "timebase" in order to convert
OS timestamps into nanoseconds. Retrieve and compute this value during
runtime initialization to avoid data races when accessing the time.
seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
…30592)

Windows and macOS require precomputing a "timebase" in order to convert
OS timestamps into nanoseconds. Retrieve and compute this value during
runtime initialization to avoid data races when accessing the time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants
0