-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
[3.11] gh-106883 Fix deadlock in threaded application #117332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When using threaded applications, there is a high risk of a deadlock in the intepreter. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL. It has been suggested to disable GC during the _PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls.
This will break users that have disabled the gc, we need to do the dance only if it has not been disabled |
Also there are return paths for errors where this leaves the gc disabled |
@pablogsal thanks for your comments! Of course this was more an "attempt" to check if the approach was in the right direction and to check if the C API I used are the correct ones. Next week I will produce a more complete patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is the right general approach. I left some inline comments to expand on what @pablogsal wrote.
When using threaded applications, there is a high risk of a deadlock in the intepreter. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL. By disabling the GC during the _PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls fixes the issue.
@pablogsal @colesbury I've updated the PR addressing your comments and adding a test to verify the patch. Please let me know what you think. |
It is currently failing on the Address Sanitiser build, I'm investigating what the problem is. |
I have pushed a new test for it. It is slower compare to the existent tests: it takes about 10 seconds on my machine (Parallels on MBP) when it succeeds. If it fails (it shouldn't) then it's the timeout of 40 seconds. Thoughts? |
We have a thing to consider which is that 3.11.9 was the last bug fix release of 3.11 and that is already released. I could make an extra bug fix release to include a fix for this PR but technically this is too late. I will check with the other release managers to see what they think |
Sure, no problem. Whatever the outcome is, it has been entertaining debugging it :) |
I just want to give a gentle ping on this PR. What will be its fate? Merge or close? :) Joking apart, it would be nice to have a closure. |
Unfortunately unless you want to argue that this is a security fix, we cannot backport bug fixes anymore to 3.11. If this issue is something that's affecting a considerable number of users I am willing to reconsider, though. But in any case the release will be folded with the next security release of 3.11. |
Hello, I don't think I can argue that this is a security fix (it isn't :) ) but also I left a comment on dask/distributed#8616 issue as they bumped into the same problem. I requested them to comment directly here: if they have found a way to workaround the issue, we can just close the PR otherwise we need to understand what the blockers are and what's the magnitude of the issue. |
I work on a python library, and in the last ~2 months at least 3 of our customers have reached out for help debugging a deadlock situation. after much of very involved debugging, we attributed the issue to: #106883 . |
3.11 is yours, Pablo, so it's really up to you. I recall working around this issue at Google, and while I think fixing it is worth while, as RM I'm not sure fixing it in 3.11, at this point is worth the risk... That said, the risk is very small, and it is the case that we're pushing for more use of threads (and testing with threads) with free-threaded Python on the horizon, and it would definitely be nice if threads were less flaky in 3.11... But that's thinking more with my free-threaded contributor hat than my RM hat :P |
I just wanted to double check with you and other RMs. But I am convinced to backport this |
merged, this will go in the next security release |
|
|
Ugh, this is not good. @diegorusso do you mind checking? Otherwise we may need to revert :( |
|
|
|
I'm looking into it. |
Technically by our policy we will need to revert if this is not fixed in 24h but given that we don't have a release soon, let's wait until Friday EOD max |
I plan to make a new PR today. |
Upon an initial investigation, the culprit is the low timeout set in the test: increasing the timeout makes the test passing. If we look at the builds related to this commit https://buildbot.python.org/#/changes/40264 we notice that all the failing ones related to this test have been built with On GitHub tests are passing because we build CPython without debug. As a reference on my AArch64 Linux VM the execution time difference of
Another data point. On a software emulated environment:
There is about a 4x difference between the build with debug and without. I suggest to increase the timeout to a level that it is enough for all the platforms, even the older ones. |
PR has been raised: #131182 |
When using threaded applications, there is a high risk of a deadlock in the intepreter.
It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL.
It has been suggested to disable GC during the
_PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls.
sys._current_frames
#106883