8000 [3.11] gh-106883 Fix deadlock in threaded application by diegorusso · Pull Request #117332 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

[3.11] gh-106883 Fix deadlock in threaded application #117332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 11, 2025

Conversation

diegorusso
Copy link
Contributor
@diegorusso diegorusso commented Mar 28, 2024

When using threaded applications, there is a high risk of a deadlock in the intepreter.
It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL.

It has been suggested to disable GC during the
_PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls.

When using threaded applications, there is a high risk of a deadlock in
the intepreter.
It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL.

It has been suggested to disable GC during the
_PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls.
@pablogsal
Copy link
Member

This will break users that have disabled the gc, we need to do the dance only if it has not been disabled

@pablogsal
Copy link
Member

Also there are return paths for errors where this leaves the gc disabled

@diegorusso
Copy link
Contributor Author
diegorusso commented Mar 28, 2024

@pablogsal thanks for your comments! Of course this was more an "attempt" to check if the approach was in the right direction and to check if the C API I used are the correct ones. Next week I will produce a more complete patch.

@colesbury colesbury self-requested a review March 28, 2024 19:35
Copy link
Contributor
@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is the right general approach. I left some inline comments to expand on what @pablogsal wrote.

diegorusso and others added 3 commits April 5, 2024 15:03
When using threaded applications, there is a high risk of a deadlock in
the intepreter.
It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL.

By disabling the GC during the _PyThread_CurrentFrames() and
_PyThread_CurrentExceptions() calls fixes the issue.
@diegorusso
Copy link
Contributor Author

@pablogsal @colesbury I've updated the PR addressing your comments and adding a test to verify the patch. Please let me know what you think.

@diegorusso
Copy link
Contributor Author

It is currently failing on the Address Sanitiser build, I'm investigating what the problem is.

@diegorusso diegorusso marked this pull request as draft April 5, 2024 17:13
@diegorusso diegorusso marked this pull request as ready for review April 9, 2024 10:31
@diegorusso
Copy link
Contributor Author

@pablogsal @colesbury

I have pushed a new test for it. It is slower compare to the existent tests: it takes about 10 seconds on my machine (Parallels on MBP) when it succeeds. If it fails (it shouldn't) then it's the timeout of 40 seconds.
When used in the build with the address sanitizer enabled, it is skipped because it will be way slower.
My questions to you are: do we really need the test? if yes, can we mark it as slow?

Thoughts?

@pablogsal
Copy link
Member
pablogsal commented Apr 9, 2024
8000

We have a thing to consider which is that 3.11.9 was the last bug fix release of 3.11 and that is already released. I could make an extra bug fix release to include a fix for this PR but technically this is too late. I will check with the other release managers to see what they think

@diegorusso
Copy link
Contributor Author

Sure, no problem. Whatever the outcome is, it has been entertaining debugging it :)

@arhadthedev arhadthedev added the pending The issue will be closed if no feedback is provided label Apr 10, 2024
@diegorusso
Copy link
Contributor Author

I just want to give a gentle ping on this PR. What will be its fate? Merge or close? :) Joking apart, it would be nice to have a closure.

@pablogsal
Copy link
Member
pablogsal commented May 28, 2024

Unfortunately unless you want to argue that this is a security fix, we cannot backport bug fixes anymore to 3.11.

If this issue is something that's affecting a considerable number of users I am willing to reconsider, though. But in any case the release will be folded with the next security release of 3.11.

@diegorusso
Copy link
Contributor Author

Hello, I don't think I can argue that this is a security fix (it isn't :) ) but also I left a comment on dask/distributed#8616 issue as they bumped into the same problem. I requested them to comment directly here: if they have found a way to workaround the issue, we can just close the PR otherwise we need to understand what the blockers are and what's the magnitude of the issue.

@pablogsal
Copy link
Member

Let's check what the other RMs think just to get some consensus on this.

CC @ambv @Yhg1s @hugovk

@diegorusso diegorusso deleted the branch python:3.11 June 25, 2024 15:34
@diegorusso diegorusso closed this Jun 25, 2024
@tvalentyn
Copy link
tvalentyn commented Feb 12, 2025

If this issue is something that's affecting a considerable number of users I am willing to reconsider, though. But in any case the release will be folded with the next security release of 3.11.

I work on a python library, and in the last ~2 months at least 3 of our customers have reached out for help debugging a deadlock situation. after much of very involved debugging, we attributed the issue to: #106883 .

@diegorusso diegorusso restored the 3.11 branch March 6, 2025 11:43
@diegorusso diegorusso reopened this Mar 6, 2025
@pablogsal
Copy link
Member

CC @ambv @Yhg1s @hugovk

Ping @ambv @Yhg1s @hugovk

@Yhg1s
Copy link
Member
Yhg1s commented Mar 10, 2025

3.11 is yours, Pablo, so it's really up to you. I recall working around this issue at Google, and while I think fixing it is worth while, as RM I'm not sure fixing it in 3.11, at this point is worth the risk...

That said, the risk is very small, and it is the case that we're pushing for more use of threads (and testing with threads) with free-threaded Python on the horizon, and it would definitely be nice if threads were less flaky in 3.11... But that's thinking more with my free-threaded contributor hat than my RM hat :P

@pablogsal
Copy link
Member

I just wanted to double check with you and other RMs. But I am convinced to backport this

@pablogsal pablogsal merged commit 6b37486 into python:3.11 Mar 11, 2025
21 of 22 checks passed
@pablogsal
Copy link
Member

merged, this will go in the next security release

@diegorusso diegorusso deleted the 3.11 branch March 11, 2025 16:01
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot PPC64LE Fedora Stable 3.11 (tier-2) has failed when building commit 6b37486.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/933/builds/1172) and take a look at the build logs.
  4. Check if the failure is related to this commit (6b37486) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/933/builds/1172

Failed tests:

  • test_sys

Failed subtests:

  • test_current_frames_exceptions_deadlock - test.test_sys.SysModuleTest.test_current_frames_exceptions_deadlock

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le/build/Lib/test/support/threading_helper.py", line 66, in decorator
    return func(*args)
           ^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le/build/Lib/test/test_sys.py", line 530, in test_current_frames_exceptions_deadlock
    support.wait_process(pid, exitcode=0, timeout=TIMEOUT)
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le/build/Lib/test/support/__init__.py", line 1963, in wait_process
    raise AssertionError(f"process {pid} is still running "
AssertionError: process 2699152 is still running after 40.3 seconds


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le/build/Lib/test/support/threading_helper.py", line 66, in decorator
    return func(*args)
           ^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le/build/Lib/test/test_sys.py", line 530, in test_current_frames_exceptions_deadlock
    support.wait_process(pid, exitcode=0, timeout=TIMEOUT)
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le/build/Lib/test/support/__init__.py", line 1963, in wait_process
    raise AssertionError(f"process {pid} is still running "
AssertionError: process 2676978 is still running after 40.3 seconds

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot PPC64LE Fedora Stable Clang 3.11 (tier-3) has failed when building commit 6b37486.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/967/builds/1171) and take a look at the build logs.
  4. Check if the failure is related to this commit (6b37486) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/967/builds/1171

Failed tests:

  • test_sys

Failed subtests:

  • test_current_frames_exceptions_deadlock - test.test_sys.SysModuleTest.test_current_frames_exceptions_deadlock

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.clang/build/Lib/test/support/threading_helper.py", line 66, in decorator
    return func(*args)
           ^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.clang/build/Lib/test/test_sys.py", line 530, in test_current_frames_exceptions_deadlock
    support.wait_process(pid, exitcode=0, timeout=TIMEOUT)
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.clang/build/Lib/test/support/__init__.py", line 1963, in wait_process
    raise AssertionError(f"process {pid} is still running "
AssertionError: process 2710960 is still running after 46.9 seconds


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.clang/build/Lib/test/support/threading_helper.py", line 66, in decorator
    return func(*args)
           ^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.clang/build/Lib/test/test_sys.py", line 530, in test_current_frames_exceptions_deadlock
    support.wait_process(pid, exitcode=0, timeout=TIMEOUT)
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.clang/build/Lib/test/support/__init__.py", line 1963, in wait_process
    raise AssertionError(f"process {pid} is still running "
AssertionError: process 2732847 is still running after 40.3 seconds

@pablogsal
Copy link
Member

Ugh, this is not good. @diegorusso do you mind checking? Otherwise we may need to revert :(

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot PPC64LE Fedora Stable Refleaks 3.11 (tier-2) has failed when building commit 6b37486.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/966/builds/843) and take a look at the build logs.
  4. Check if the failure is related to this commit (6b37486) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/966/builds/843

Failed tests:

  • test_sys

Failed subtests:

  • test_current_frames_exceptions_deadlock - test.test_sys.SysModuleTest.test_current_frames_exceptions_deadlock

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/support/threading_helper.py", line 66, in decorator
    return func(*args)
           ^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/test_sys.py", line 530, in test_current_frames_exceptions_deadlock
    support.wait_process(pid, exitcode=0, timeout=TIMEOUT)
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/support/__init__.py", line 1963, in wait_process
    raise AssertionError(f"process {pid} is still running "
AssertionError: process 3111838 is still running after 40.3 seconds


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/support/threading_helper.py", line 66, in decorator
    return func(*args)
           ^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/test_sys.py", line 530, in test_current_frames_exceptions_deadlock
    support.wait_process(pid, exitcode=0, timeout=TIMEOUT)
  File "/home/buildbot/buildarea/3.11.cstratak-fedora-stable-ppc64le.refleak/build/Lib/test/support/__init__.py", line 1963, in wait_process
    raise AssertionError(f"process {pid} is still running "
AssertionError: process 3114961 is still running after 40.3 seconds

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot PPC64LE RHEL8 3.11 (tier-2) has failed when building commit 6b37486.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/997/builds/1182) and take a look at the build logs.
  4. Check if the failure is related to this commit (6b37486) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/997/builds/1182

Failed tests:

  • test_sys

Failed subtests:

  • test_current_frames_exceptions_deadlock - test.test_sys.SysModuleTest.test_current_frames_exceptions_deadlock

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le/build/Lib/test/support/threading_helper.py", line 66, in decorator
    return func(*args)
           ^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le/build/Lib/test/test_sys.py", line 530, in test_current_frames_exceptions_deadlock
    support.wait_process(pid, exitcode=0, timeout=TIMEOUT)
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le/build/Lib/test/support/__init__.py", line 1963, in wait_process
    raise AssertionError(f"process {pid} is still running "
AssertionError: process 3527836 is still running after 40.3 seconds


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le/build/Lib/test/support/threading_helper.py", line 66, in decorator
    return func(*args)
           ^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le/build/Lib/test/test_sys.py", line 530, in test_current_frames_exceptions_deadlock
    support.wait_process(pid, exitcode=0, timeout=TIMEOUT)
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le/build/Lib/test/support/__init__.py", line 1963, in wait_process
    raise AssertionError(f"process {pid} is still running "
AssertionError: process 3511003 is still running after 40.5 seconds

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot PPC64LE RHEL8 Refleaks 3.11 (tier-2) has failed when building commit 6b37486.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/978/builds/854) and take a look at the build logs.
  4. Check if the failure is related to this commit (6b37486) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/978/builds/854

Failed tests:

  • test_sys

Failed subtests:

  • test_current_frames_exceptions_deadlock - test.test_sys.SysModuleTest.test_current_frames_exceptions_deadlock

Test leaking resources:

  • test_functools: memory blocks

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le.refleak/build/Lib/test/support/threading_helper.py", line 66, in decorator
    return func(*args)
           ^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le.refleak/build/Lib/test/test_sys.py", line 530, in test_current_frames_exceptions_deadlock
    support.wait_process(pid, exitcode=0, timeout=TIMEOUT)
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le.refleak/build/Lib/test/support/__init__.py", line 1963, in wait_process
    raise AssertionError(f"process {pid} is still running "
AssertionError: process 3696628 is still running after 40.5 seconds


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le.refleak/build/Lib/test/support/threading_helper.py", line 66, in decorator
    return func(*args)
           ^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le.refleak/build/Lib/test/test_sys.py", line 530, in test_current_frames_exceptions_deadlock
    support.wait_process(pid, exitcode=0, timeout=TIMEOUT)
  File "/home/buildbot/buildarea/3.11.cstratak-RHEL8-ppc64le.refleak/build/Lib/test/support/__init__.py", line 1963, in wait_process
    raise AssertionError(f"process {pid} is still running "
AssertionError: process 3726771 is still running after 40.3 seconds

@diegorusso
Copy link
Contributor Author

Ugh, this is not good. @diegorusso do you mind checking? Otherwise we may need to revert :(

I'm looking into it.

@pablogsal
Copy link
Member

Technically by our policy we will need to revert if this is not fixed in 24h but given that we don't have a release soon, let's wait until Friday EOD max

@diegorusso
Copy link
Contributor Author

I plan to make a new PR today.

@diegorusso
Copy link
Contributor Author
diegorusso commented Mar 13, 2025

Upon an initial investigation, the culprit is the low timeout set in the test: increasing the timeout makes the test passing.

If we look at the builds related to this commit https://buildbot.python.org/#/changes/40264 we notice that all the failing ones related to this test have been built with --with-pydebug.
If we look at the file gcmodule.c (https://github.com/python/cpython/blob/6b37486184590d19c6f24e620545ec8f8f65e4c7/Modules/gcmodule.c) we see that in the file there are different code paths depending if the debug is enabled or not.
This has an impact on the execution time: with the debug enabled it takes more time execute the test case and it cannot finish by the timeout set of 40 seconds. The fact that happens on PPC64LE is because very likely these machines don't have a high clock speed and it might take some time to execute the test.
We can observe a similar failure on SPARCv9 Oracle Solaris as well https://buildbot.python.org/#/builders/988/builds/1187

On GitHub tests are passing because we build CPython without debug.

As a reference on my AArch64 Linux VM the execution time difference of test_sys between a build with and without debug is:

  • without debug: ~1.5 seconds
  • with debug: ~7 seconds

Another data point. On a software emulated environment:

  • without debug: ~17.3 seconds
  • with debug: ~68 seconds

There is about a 4x difference between the build with debug and without.

I suggest to increase the timeout to a level that it is enough for all the platforms, even the older ones.

@diegorusso
Copy link
Contributor Author

PR has been raised: #131182

@AA-Turner AA-Turner removed the pending The issue will be closed if no feedback is provided label Apr 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0