-
-
Notifications
You must be signed in to change notification settings - Fork 32k
bpo-42130: Fix for explicit suppressing of cancellations in wait_for() #28149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
9c2942a
Create test for suppressing cancellations
Dreamsorcerer 7ff2c04
Update test_tasks.py
Dreamsorcerer cdaa9c8
Update test_tasks.py
Dreamsorcerer 342222f
Update test_tasks.py
Dreamsorcerer 01d9884
Use a sentinel
Dreamsorcerer af96f10
Fix test for issue37658
Dreamsorcerer 1e5233d
Update test_tasks.py
Dreamsorcerer 743177e
Fix original issue
Dreamsorcerer d7c0965
Remove test
Dreamsorcerer 80f027b
Merge branch 'main' into patch-4
Dreamsorcerer 4eb5de2
Create 2021-11-29-17-17-45.bpo-42130.Ho8dUh.rst
Dreamsorcerer a8f5c1f
Compare to tuple
Dreamsorcerer 5de6b79
Use enum for sentinel
Dreamsorcerer 5a79cf2
Missing import
Dreamsorcerer 11e1214
Update tasks.py
Dreamsorcerer 3a35126
Update tasks.py
Dreamsorcerer df15ac1
Update test_tasks.py
Dreamsorcerer 7374366
Update tasks.py
Dreamsorcerer c694173
Revert mistaken changes.
Dreamsorcerer cb59377
Update 2021-11-29-17-17-45.bpo-42130.Ho8dUh.rst
Dreamsorcerer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Update tasks.py
- Loading branch information
commit 11e12141c20e8efb2387a5e19145359b2d6c748a
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've gone through this in depth now. This is basically just reverting #21894, which was an incorrect fix for the issue I fixed in my previous PR.
I think someone mixed up this
except
with the timeout. Thisexcept
will only happen if atask.cancel()
comes from the outer code. If the future we are waiting on is completed, or cancelled, or the timeout happens, then the code continues without entering this block of code. Therefore, this code is only handling an explicit cancellation and should therefore always reraise that exception.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think this is a safe assumption; that mentioned PR wasn’t merely a mistake of mixing up blocks. The whole source of this issue is that there appears to be a small time window after which fut completes during which a cancellation can arrive on ‘await waiter’, which breaks the assumption you suggest that we can never get into this block once fut is done.
Currently we seem to be bouncing between two implementations with PRs and counter PRs on the behaviour during this race:
I have yet to see a solution that solves both and am concerned it is not possible with the current async primitives, since we’re going back and forth. The PR here as it stands just moves us back to leaky but cancellation guaranteeing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The library I made can handle the race condition, albeit it does not do so implicitly. The caller is required to know that the wrapped future returns some resource that must be freed explicitly, and provide a race-condition handler callback to do so in such a case.
I'm not sure if this change would be acceptable upstream to the
wait_for
implementation. However, as you've pointed out the issue stems from the current primitives not providing a reliable (atomic?) synchronization of running states between the different tasks. I'm not sure if that can be solved efficiently, which makes me doubt it would be solved at that level ever.If my solution would be considered for upstream, all using code would have to be evaluated to have an additional proper cleanup callback when necessary. It does not sound appealing, but it is better than having the problem unresolved. Right now, I use the
wait_for2
library in most of my applications with great success, but unfortunately I can not monkey-patch it in for all the dependencies I'm using, because its effective use may require explicit changes in the libraries.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure which part you think is an assumption. But, I see the issue you are talking about. After the explicit cancellation, the inner future could be completed before control is resumed in the
wait_for()
. I think I can write a test case for this, although I can't think of any reasonable solution yet. It's also likely a very rare occurrence, while the bug reports are talking about common, easily reproducible situations, which have been fixed by these PRs.I think the tests in
wait_for2
are probably not even hitting this condition, they fail due to catching the cancellation at the bottom of this function and raising aTimeoutError
instead. As already mentioned, we should only change this to aTimeoutError
if the cancellation came from our timeout, but the cancellation'smsg
argument seems to keep disappearing, which is how we would fix that.I'll look into both of these later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another case might be if the inner future wants to suppress the cancellation, then this should presumably suppress in
wait_for()
too. I'll see if I can produce tests for all 3 of these conditions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removed test that I commented on in your other PR was the specific test for this race. Perhaps it wasn't working right, but this is what it was trying to test, rather than the replacement test that is checking internal timeout vs completion. If you run that test vs the reverted implementation, do you see failures?
It's rare, but was still pretty easy to hit; I was sufficiently reproducible to cause a suite of ~1,000 tests to have at least one failure per CI run. Effectively an external cancellation would come from the test teardown, which would occasionally hang the test waiting for a pool to empty that had leaked a connection during the cancelled acquire. So a rough approximation would be that one in 1000 runs of such a server may deadlock during shutdown.
Some tracking issues that led up to the current implementation:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, I can't recall why the timeout value is so low in the tests. I've adjusted it further to surely avoid timeout and focus on the cancellation. I've pushed it here if you're interested: https://github.com/Traktormaster/wait-for2/tree/pr28149
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so my understanding is that the 2nd race condition I was thinking of, can't actually happen. The last block of code checks that the future is not done and then cancels it, so anytime the code comes back to the future, it will raise the CancelledError. There is nowhere in between the done() check and the cancel() call that allows another task to run.
I've changed this PR to fix the last condition I mentioned, about deliberate suppression of cancellation (and tidy up some tests).
As for the original issue of
wait_for()
suppressing explicit cancellations by accident, this requires a more complex change to asyncio.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking this test as an example:
What happens here, is that
with_for_coro()
is started and the call towait_for()
will result incoroutine_function()
being appended to the list (loop._ready
) to run in the next loop iteration.Then we cancel
with_for_coro()
, which appends it to the list to also run in the next iteration.Then the
await
yields back to the loop, which then runscoroutine_function()
followed bywith_for_coro()
.When running
with_for_coro()
, it raises theCancelledError
and finds the future is complete.The function can then either raise the
CancelledError
even though the future is complete, or return the result even though it received a cancellation. Neither of which is the expected behaviour.The expected behaviour would be that the cancellation interrupts
coroutine_function()
. In other words, after callingtask.cancel()
,coroutine_function()
should not continue to run without a cancellation.To achieve this, we would need to establish some kind of dependency between the
with_for_coro()
future and thecoroutine_function()
future. Such that, ifwith_for_coro()
is cancelled, it will always be scheduled to run ahead ofcoroutine_function()
. This obviously requires some changes to the event loop. Maybe a function that says fut1 depends on fut2, so if fut1 and fut2 are inloop._ready
at the same time, fut2 will always get called before fut1?