8000 bpo-42130: Fix for explicit suppressing of cancellations in wait_for() by Dreamsorcerer · Pull Request #28149 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

bpo-42130: Fix for explicit suppressing of cancellations in wait_for() #28149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update tasks.py
  • Loading branch information
Dreamsorcerer authored Dec 6, 2021
commit 11e12141c20e8efb2387a5e19145359b2d6c748a
21 changes: 7 additions & 14 deletions Lib/asyncio/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@

import concurrent.futures
import contextvars
import enum
import functools
import inspect
import itertools
Expand All @@ -26,9 +25,6 @@
from . import futures
from .coroutines import _is_coroutine

_SENTINEL = enum.Enum("_SENTINEL", "sentinel")
sentinel = _SENTINEL.sentinel

# Helper to generate new task names
# This uses itertools.count() instead of a "+= 1" operation because the latter
# is not thread safe. See bpo-11866 for a longer explanation.
Expand Down Expand Up @@ -435,15 +431,12 @@ async def wait_for(fut, timeout):
try:
await waiter
except exceptions.CancelledError as e:
if fut.done() and e.args == (sentinel,):
return fut.result()
else:
fut.remove_done_callback(cb)
# We must ensure that the task is not running
# after wait_for() returns.
# See https://bugs.python.org/issue32751
await _cancel_and_wait(fut, loop=loop)
raise
fut.remove_done_callback(cb)
# We must ensure that the task is not running
# after wait_for() returns.
# See https://bugs.python.org/issue32751
await _cancel_and_wait(fut, loop=loop)
raise
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've gone through this in depth now. This is basically just reverting #21894, which was an incorrect fix for the issue I fixed in my previous PR.

I think someone mixed up this except with the timeout. This except will only happen if a task.cancel() comes from the outer code. If the future we are waiting on is completed, or cancelled, or the timeout happens, then the code continues without entering this block of code. Therefore, this code is only handling an explicit cancellation and should therefore always reraise that exception.

Copy link
@aaliddell aaliddell Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think this is a safe assumption; that mentioned PR wasn’t merely a mistake of mixing up blocks. The whole source of this issue is that there appears to be a small time window after which fut completes during which a cancellation can arrive on ‘await waiter’, which breaks the assumption you suggest that we can never get into this block once fut is done.

Currently we seem to be bouncing between two implementations with PRs and counter PRs on the behaviour during this race:

  • One that ignores the cancellation but shouldn’t leak allocated resources.
  • One that ensures it raises cancellation but does leak allocated resources.

I have yet to see a solution that solves both and am concerned it is not possible with the current async primitives, since we’re going back and forth. The PR here as it stands just moves us back to leaky but cancellation guaranteeing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library I made can handle the race condition, albeit it does not do so implicitly. The caller is required to know that the wrapped future returns some resource that must be freed explicitly, and provide a race-condition handler callback to do so in such a case.

I'm not sure if this change would be acceptable upstream to the wait_for implementation. However, as you've pointed out the issue stems from the current primitives not providing a reliable (atomic?) synchronization of running states between the different tasks. I'm not sure if that can be solved efficiently, which makes me doubt it would be solved at that level ever.

If my solution would be considered for upstream, all using code would have to be evaluated to have an additional proper cleanup callback when necessary. It does not sound appealing, but it is better than having the problem unresolved. Right now, I use the wait_for2 library in most of my applications with great success, but unfortunately I can not monkey-patch it in for all the dependencies I'm using, because its effective use may require explicit changes in the libraries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think this is a safe assumption

I'm not sure which part you think is an assumption. But, I see the issue you are talking about. After the explicit cancellation, the inner future could be completed before control is resumed in the wait_for(). I think I can write a test case for this, although I can't think of any reasonable solution yet. It's also likely a very rare occurrence, while the bug reports are talking about common, easily reproducible situations, which have been fixed by these PRs.

I think the tests in wait_for2 are probably not even hitting this condition, they fail due to catching the cancellation at the bottom of this function and raising a TimeoutError instead. As already mentioned, we should only change this to a TimeoutError if the cancellation came from our timeout, but the cancellation's msg argument seems to keep disappearing, which is how we would fix that.

I'll look into both of these later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case might be if the inner future wants to suppress the cancellation, then this should presumably suppress in wait_for() too. I'll see if I can produce tests for all 3 of these conditions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed test that I commented on in your other PR was the specific test for this race. Perhaps it wasn't working right, but this is what it was trying to test, rather than the replacement test that is checking internal timeout vs completion. If you run that test vs the reverted implementation, do you see failures?

It's also likely a very rare occurrence

It's rare, but was still pretty easy to hit; I was sufficiently reproducible to cause a suite of ~1,000 tests to have at least one failure per CI run. Effectively an external cancellation would come from the test teardown, which would occasionally hang the test waiting for a pool to empty that had leaked a connection during the cancelled acquire. So a rough approximation would be that one in 1000 runs of such a server may deadlock during shutdown.

Some tracking issues that led up to the current implementation:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests in wait_for2 are probably not even hitting this condition, they fail due to catching the cancellation at the bottom of this function and raising a TimeoutError instead.

That's fair, I can't recall why the timeout value is so low in the tests. I've adjusted it further to surely avoid timeout and focus on the cancellation. I've pushed it here if you're interested: https://github.com/Traktormaster/wait-for2/tree/pr28149

Copy link
Contributor Author
@Dreamsorcerer Dreamsorcerer Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so my understanding is that the 2nd race condition I was thinking of, can't actually happen. The last block of code checks that the future is not done and then cancels it, so anytime the code comes back to the future, it will raise the CancelledError. There is nowhere in between the done() check and the cancel() call that allows another task to run.

I've changed this PR to fix the last condition I mentioned, about deliberate suppression of cancellation (and tidy up some tests).

As for the original issue of wait_for() suppressing explicit cancellations by accident, this requires a more complex change to asyncio.

Copy link
Contributor Author
@Dreamsorcerer Dreamsorcerer Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking this test as an example:

    def test_wait_for_does_not_suppress_cancellation(self):
        async def with_for_coro():
            await asyncio.wait_for(coroutine_function(), timeout=1)
            assert False, 'End of with_for_coro. Should not be reached!'

        async def main():
            task = asyncio.create_task(with_for_coro())
            await asyncio.sleep(0)
            self.assertFalse(task.done())
            task.cancel()
            with self.assertRaises(asyncio.CancelledError):
                await task

        asyncio.run(main())

What happens here, is that with_for_coro() is started and the call to wait_for() will result in coroutine_function() being appended to the list (loop._ready) to run in the next loop iteration.
Then we cancel with_for_coro(), which appends it to the list to also run in the next iteration.
Then the await yields back to the loop, which then runs coroutine_function() followed by with_for_coro().
When running with_for_coro(), it raises the CancelledError and finds the future is complete.
The function can then either raise the CancelledError even though the future is complete, or return the result even though it received a cancellation. Neither of which is the expected behaviour.

The expected behaviour would be that the cancellation interrupts coroutine_function(). In other words, after calling task.cancel(), coroutine_function() should not continue to run without a cancellation.

To achieve this, we would need to establish some kind of dependency between the with_for_coro() future and the coroutine_function() future. Such that, if with_for_coro() is cancelled, it will always be scheduled to run ahead of coroutine_function(). This obviously requires some changes to the event loop. Maybe a function that says fut1 depends on fut2, so if fut1 and fut2 are in loop._ready at the same time, fut2 will always get called before fut1?


if fut.done():
return fut.result()
Expand Down Expand Up @@ -516,7 +509,7 @@ async def _cancel_and_wait(fut, loop):
fut.add_done_callback(cb)

try:
fut.cancel(sentinel)
fut.cancel()
# We cannot wait on *fut* directly to make
# sure _cancel_and_wait itself is reliably cancellable.
await waiter
Expand Down
0