-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Possible race condition failure in uasyncio wait_for/wait_for_ms #7386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What is the exact error printed out? Might be related to #7289. If you can get a reproduction (even a large amount of code) that would allow it to be investigated. |
OK, I have a reliable repro for this.
import socket
import uasyncio as asyncio
import network
async def async_recvfrom(sock, count):
yield asyncio.core._io_queue.queue_read(sock)
return sock.recvfrom(count)
async def receive_loop(ssid, password, listen_port=49000):
wlan = network.WLAN(network.STA_IF)
wlan.active(True)
wlan.connect(ssid, password)
while wlan.status() != network.STAT_GOT_IP:
await asyncio.sleep_ms(500)
print("IP address: {}".format(wlan.ifconfig()[0]))
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(('', listen_port))
while True:
try:
data, (host, port) = await asyncio.wait_for(async_recvfrom(sock, 1500), 1)
print("Received {} bytes from {}:{}".format(len(data), host, port))
except asyncio.TimeoutError:
pass
def run(ssid, password):
asyncio.run(receive_loop(ssid, password)) And import socket
import time
def run(host, port=49000, data=b'Hello world!'):
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.connect((host, port))
while True:
sock.send(data)
print("Sent {} bytes to {}:{}".format(len(data), host, port))
time.sleep(1) Log from the device:
You can see that it doesn't typically take long to fail. I have deliberately made the wait on both ends the same as I'm pretty sure it's a race condition. In this case the scheduler has exited, realising that there are no runnable tasks. In my real code I have another task running so the networking loop just hangs up. This is running with a bunch of patches (my outstanding PRs and Mike Teachman's i2s) on top of da8aad1. It doesn't feel like it's related to #7289. |
For good measure, I've also reproduced it on current HEAD (66115a3) with IDF 4.2 and with the |
Out of curiosity I tried taking current HEAD and reverting out b505971, which looks to be the last thing that touched I made a small tweak in my test code above to 8000 print out a message on timeout to confirm that both branches are being exercised and I've had this build running that code for a few minutes now without it failing. So I suspect that the fix to #5797 introduced another bug. I'd like to dig deeper into this code, but I really need to get back to the actual work I'm supposed to be doing today… 😀 |
Thanks for the reproduction. I can get the above code to crash on ESP32 (TinyPICO using the GENERIC build) as of the latest commit 66115a3. But it did not crash a PYBD-SF2.
There is something wrong with this output: it should be printing an exception instance not a class. So that's a hint as to what is wrong (and I have seen this before on an unrelated project, but it was not possible to reproduce it).
The fix could have also uncovered another bug. Investigation will tell. |
I can reproduce the same issue with the following code that does not use sockets: import uasyncio as asyncio
import machine
async def wait_flag(flag):
await flag.wait()
async def receive_loop():
flag = asyncio.ThreadSafeFlag()
machine.Timer(0).init(freq=99.9, callback=lambda t:flag.set())
while True:
try:
await asyncio.wait_for(wait_flag(flag), 0.01)
print("Flag was set")
except asyncio.TimeoutError:
print("TimeoutError")
def run():
asyncio.run(receive_loop()) But I only see the error if WLAN is connected to an access point... |
Curiouser and curiouser… I'd love to get stuck in and help, but I'm aware that a) you know the code, and b) I really need to finish this proposal… |
I can confirm that we see a very related issue (on a STM32F777 based system, no wlan), specifically the We have not seen this issue before we updated to 1.15 current head, coming from 1.13. We haven't (yet) tried to revert b505971 . On our device the fault is showing up during integration testing, usually with higher communication loads (there are several UARTs and USB VCP). |
I found a simple reproduction that does not use import time
import uasyncio as asyncio
TIMEOUT = 2
async def sleep_task():
print(time.ticks_us(), "sleep_task sleep")
await asyncio.sleep(0)
async def wait_task(t):
print(time.ticks_us(), "wait_task wait")
await t
async def main():
waiting_task = asyncio.create_task(wait_task(asyncio.create_task(sleep_task())))
print(time.ticks_us(), "main sleep")
await asyncio.sleep_ms(TIMEOUT)
waiting_task.cancel()
print(time.ticks_us(), "main wait")
await waiting_task
asyncio.run(main()) It errors out (most of the time, not always) with:
It will error on ESP32 without WLAN active. It will error on PYBD-SF2 running at 48MHz, but not when running faster. |
And here's a deterministic reproduction that also fails on the unix port: import uasyncio as asyncio
async def sleep_task():
print("sleep_task sleep")
await asyncio.sleep(0)
print("sleep_task wake")
async def wait_task(t):
print("wait_task wait")
await t
print("wait_task wake")
async def main():
waiting_task = asyncio.create_task(wait_task(asyncio.create_task(sleep_task())))
print("main sleep")
await asyncio.sleep(0)
await asyncio.sleep(0)
waiting_task.cancel()
print("main wait")
await waiting_task
print("main wait")
asyncio.run(main()) |
The issue seems to be when task A waits on task B, and there's a race between cancellation of task A and completion of task B. If task B completes just before task A is cancelled then the cancellation of A does not work but rather prints the "Task exception wasn't retrieved" message. |
This commit fixes a problem with a race between cancellation of task A and completion of task B, when A waits on B. If task B completes just before task A is cancelled then the cancellation of A does not work. Instead, the CancelledError meant to cancel A gets passed through to B (that's expected behaviour) but B handles it as a "Task exception wasn't retrieved" scenario, printing out such a message (this is because finished tasks point their "coro" attribute to themselves to indicate they are done, and implement the throw() method, but that method inadvertently catches the CancelledError). The correct behaviour is for B to bounce that CancelledError back out. The fix here reworks the "waiting" attribute of Task to be called "state" and uses it to indicate whether a task is: running and not awaited on, running and awaited on, finished and not awaited on, or finished and awaited on. This means the task does not need to point "coro" to itself to indicate finished, and also allows removal of the throw() method. A benefit of this is that "Task exception wasn't retrieved" messages can go back to being able to print the name of the coroutine function. Fixes issue micropython#7386. Signed-off-by: Damien George <damien@micropython.org>
See #7399 for a proposed fix. |
Summary of the bug:
|
This commit fixes a problem with a race between cancellation of task A and completion of task B, when A waits on B. If task B completes just before task A is cancelled then the cancellation of A does not work. Instead, the CancelledError meant to cancel A gets passed through to B (that's expected behaviour) but B handles it as a "Task exception wasn't retrieved" scenario, printing out such a message (this is because finished tasks point their "coro" attribute to themselves to indicate they are done, and implement the throw() method, but that method inadvertently catches the CancelledError). The correct behaviour is for B to bounce that CancelledError back out. This bug is mainly seen when wait_for() is used, and in that context the symptoms are: - occurs when using wait_for(T, S), if the task T being waited on finishes at exactly the same time as the wait-for timeout S expires - task T will have run to completion - the "Task exception wasn't retrieved message" is printed with "<class 'CancelledError'>" as the error (ie no traceback) - the wait_for(T, S) call never returns (it's never put back on the uasyncio run queue) and all tasks waiting on this are blocked forever from running - uasyncio otherwise continues to function and other tasks continue to be scheduled as normal The fix here reworks the "waiting" attribute of Task to be called "state" and uses it to indicate whether a task is: running and not awaited on, running and awaited on, finished and not awaited on, or finished and awaited on. This means the task does not need to point "coro" to itself to indicate finished, and also allows removal of the throw() method. A benefit of this is that "Task exception wasn't retrieved" messages can go back to being able to print the name of the coroutine function. Fixes issue micropython#7386. Signed-off-by: Damien George <damien@micropython.org>
Can confirm that fix in #7399 works on my repro test case. I've sucked it into my branch and have switched my code back to the original |
Have had this patch running all day with my networking code without any obvious failures, so it looks good to me. |
This commit fixes a problem with a race between cancellation of task A and completion of task B, when A waits on B. If task B completes just before task A is cancelled then the cancellation of A does not work. Instead, the CancelledError meant to cancel A gets passed through to B (that's expected behaviour) but B handles it as a "Task exception wasn't retrieved" scenario, printing out such a message (this is because finished tasks point their "coro" attribute to themselves to indicate they are done, and implement the throw() method, but that method inadvertently catches the CancelledError). The correct behaviour is for B to bounce that CancelledError back out. This bug is mainly seen when wait_for() is used, and in that context the symptoms are: - occurs when using wait_for(T, S), if the task T being waited on finishes at exactly the same time as the wait-for timeout S expires - task T will have run to completion - the "Task exception wasn't retrieved message" is printed with "<class 'CancelledError'>" as the error (ie no traceback) - the wait_for(T, S) call never returns (it's never put back on the uasyncio run queue) and all tasks waiting on this are blocked forever from running - uasyncio otherwise continues to function and other tasks continue to be scheduled as normal The fix here reworks the "waiting" attribute of Task to be called "state" and uses it to indicate whether a task is: running and not awaited on, running and awaited on, finished and not awaited on, or finished and awaited on. This means the task does not need to point "coro" to itself to indicate finished, and also allows removal of the throw() method. A benefit of this is that "Task exception wasn't retrieved" messages can go back to being able to print the name of the coroutine function. Fixes issue micropython#7386. Signed-off-by: Damien George <damien@micropython.org>
(Updated to 514bf1a patch and that also seems solid today) |
Thanks @jonathanhogg for testing. Now merged in 514bf1a |
I think I'm still facing this issue using MicroPython v1.17 on 2021-09-02; ESP32 module with ESP32. The code for reproducing the issue is the following: import network
import struct
import socket
import uasyncio
import gc
# Helper function, parse IPv4 data
def parseudppacket(data):
strLen = len(data) - 28
# https://en.wikipedia.org/wiki/IPv4#Header
unpacked = struct.unpack(f"!BB3HBBHII4H{strLen}s", data)
return unpacked
# Helper function, converts int to IPv4 string
def intToIPv4(int):
ipData = struct.pack("I", int)
components = struct.unpack("4B", ipData)
return f"{ components[3] }.{ components[2] }.{ components[1] }.{ components[0] }"
# Return the first packet that has device's IP Address as destination
async def _getNatPunchResponse(infoClientSocket):
try:
foundPacket = False
while not foundPacket:
message = infoClientSocket.recv(2048)
dataExtracted = parseudppacket(message)
ipDest = intToIPv4(dataExtracted[9])
currentWlanIp = network.WLAN(network.STA_IF).ifconfig()[0]
print(f"Received packet for { ipDest }")
if (ipDest == currentWlanIp):
foundPacket = True
return dataExtracted
except uasyncio.CancelledError: # Task sees CancelledError
print('Trapped cancelled error.')
return None
except Exception as sendUdpError:
print(f"Error receiving connection info from server: { sendUdpError }")
async def natPunchDaemon():
while True:
infoClientSocket = None
infoClientSocket = socket.socket(socket.AF_INET, socket.SOCK_RAW, socket.IPPROTO_UDP)
infoClientSocket.settimeout(3)
dataExtracted = None
try:
# _getNatPunchResponse is supposed to be "cancelled" after a second, but it continues its execution
dataExtracted = await uasyncio.wait_for(_getNatPunchResponse(infoClientSocket), 1)
except uasyncio.TimeoutError: # Mandatory error trapping
print('Response maximum time exceeded (timeout)')
if dataExtracted is not None:
print(f"Received string data: { dataExtracted }")
if infoClientSocket is not None:
infoClientSocket.close()
await uasyncio.sleep(6)
def startNatPunch():
uasyncio.run(natPunchDaemon()) |
I think the loop beginning while not foundPacket: is hogging the scheduler. Try adding |
Indeed, |
@alessionossa: if you look at my repro code way up in this thread you'll see a simple mechanism for doing async |
I'm seeing an issue using
wait_for_ms()
on reads from a socket, on ESP32 devices with recent HEAD, where occasionally I get an uncaught taskCancelledError
logged to the terminal and thewait_for_ms()
never returns, locking up my code.I'm using UDP sockets, so I'm not using the built-in
Stream
stuff, instead I'm rolling my own using the same underlyingyield core._io_queue.queue_read()
trick.I suspect that it's triggered by a packet arriving at the same time as the wait ending, but I've had a gander at the
uasyncio
code and I can't see any obvious logic errors.I'm using an ugly workaround at the moment (ditching the IOQueue and just repeatedly attempting to read from a non-blocking socket) as I have to have this code working this week, but I'll try to get back to it next week and see if I can reduce it to a smaller failing case and work out what's going wrong.
The text was updated successfully, but these errors were encountered: