Possible race condition failure in uasyncio wait_for/wait_for_ms #7386

jonathanhogg · 2021-06-12T13:56:02Z

I'm seeing an issue using wait_for_ms() on reads from a socket, on ESP32 devices with recent HEAD, where occasionally I get an uncaught task CancelledError logged to the terminal and the wait_for_ms() never returns, locking up my code.

I'm using UDP sockets, so I'm not using the built-in Stream stuff, instead I'm rolling my own using the same underlying yield core._io_queue.queue_read() trick.

I suspect that it's triggered by a packet arriving at the same time as the wait ending, but I've had a gander at the uasyncio code and I can't see any obvious logic errors.

I'm using an ugly workaround at the moment (ditching the IOQueue and just repeatedly attempting to read from a non-blocking socket) as I have to have this code working this week, but I'll try to get back to it next week and see if I can reduce it to a smaller failing case and work out what's going wrong.

The text was updated successfully, but these errors were encountered:

dpgeorge · 2021-06-13T00:50:54Z

What is the exact error printed out? Might be related to #7289.

If you can get a reproduction (even a large amount of code) that would allow it to be investigated.

jonathanhogg · 2021-06-13T12:10:32Z

OK, I have a reliable repro for this.

receiver.py to run on the (ESP32) device:

import socket
import uasyncio as asyncio
import network

async def async_recvfrom(sock, count):
    yield asyncio.core._io_queue.queue_read(sock)
    return sock.recvfrom(count)

async def receive_loop(ssid, password, listen_port=49000):
    wlan = network.WLAN(network.STA_IF)
    wlan.active(True)
    wlan.connect(ssid, password)
    while wlan.status() != network.STAT_GOT_IP:
        await asyncio.sleep_ms(500)
    print("IP address: {}".format(wlan.ifconfig()[0]))
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.bind(('', listen_port))
    while True:
        try:
            data, (host, port) = await asyncio.wait_for(async_recvfrom(sock, 1500), 1)
            print("Received {} bytes from {}:{}".format(len(data), host, port))
        except asyncio.TimeoutError:
            pass

def run(ssid, password):
    asyncio.run(receive_loop(ssid, password))

And sender.py to run on something else:

import socket
import time

def run(host, port=49000, data=b'Hello world!'):
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.connect((host, port))
    while True:
        sock.send(data)
        print("Sent {} bytes to {}:{}".format(len(data), host, port))
        time.sleep(1)

Log from the device:

>>> import receiver
>>> receiver.run('[***]', '[***]')
IP address: 192.168.178.47
Received 12 bytes from 192.168.178.20:49450
Received 12 bytes from 192.168.178.20:49450
Received 12 bytes from 192.168.178.20:49450
Task exception wasn't retrieved
future: <Task> coro= <Task>
<class 'CancelledError'>
>>>

You can see that it doesn't typically take long to fail. I have deliberately made the wait on both ends the same as I'm pretty sure it's a race condition. In this case the scheduler has exited, realising that there are no runnable tasks. In my real code I have another task running so the networking loop just hangs up.

This is running with a bunch of patches (my outstanding PRs and Mike Teachman's i2s) on top of da8aad1.

It doesn't feel like it's related to #7289.

jonathanhogg · 2021-06-13T12:19:26Z

For good measure, I've also reproduced it on current HEAD (66115a3) with IDF 4.2 and with the esp32-20210418-v1.15.bin release firmware.

jonathanhogg · 2021-06-13T13:01:49Z

Out of curiosity I tried taking current HEAD and reverting out b505971, which looks to be the last thing that touched wait_for().

I made a small tweak in my test code above to 8000 print out a message on timeout to confirm that both branches are being exercised and I've had this build running that code for a few minutes now without it failing.

So I suspect that the fix to #5797 introduced another bug.

I'd like to dig deeper into this code, but I really need to get back to the actual work I'm supposed to be doing today… 😀

dpgeorge · 2021-06-13T14:36:45Z

Thanks for the reproduction. I can get the above code to crash on ESP32 (TinyPICO using the GENERIC build) as of the latest commit 66115a3. But it did not crash a PYBD-SF2.

Task exception wasn't retrieved
future: <Task> coro= <Task>
<class 'CancelledError'>

There is something wrong with this output: it should be printing an exception instance not a class. So that's a hint as to what is wrong (and I have seen this before on an unrelated project, but it was not possible to reproduce it).

So I suspect that the fix to #5797 introduced another bug.

The fix could have also uncovered another bug. Investigation will tell.

dpgeorge · 2021-06-13T15:00:10Z

I can reproduce the same issue with the following code that does not use sockets:

import uasyncio as asyncio
import machine

async def wait_flag(flag):
    await flag.wait()

async def receive_loop():
    flag = asyncio.ThreadSafeFlag()
    machine.Timer(0).init(freq=99.9, callback=lambda t:flag.set())
    while True:
        try:
            await asyncio.wait_for(wait_flag(flag), 0.01)
            print("Flag was set")
        except asyncio.TimeoutError:
            print("TimeoutError")

def run():
    asyncio.run(receive_loop())

But I only see the error if WLAN is connected to an access point...

jonathanhogg · 2021-06-13T17:01:31Z

But I only see the error if WLAN is connected to an access point...

Curiouser and curiouser…

I'd love to get stuck in and help, but I'm aware that a) you know the code, and b) I really need to finish this proposal…

hoihu · 2021-06-13T20:46:07Z

I can confirm that we see a very related issue (on a STM32F777 based system, no wlan), specifically the <class 'CancelledError'> error.

We have not seen this issue before we updated to 1.15 current head, coming from 1.13. We haven't (yet) tried to revert b505971 .

On our device the fault is showing up during integration testing, usually with higher communication loads (there are several UARTs and USB VCP).

dpgeorge · 2021-06-14T02:02:24Z

I found a simple reproduction that does not use wait_for() (but is a reduction of the wait_for code):

import time
import uasyncio as asyncio

TIMEOUT = 2


async def sleep_task():
    print(time.ticks_us(), "sleep_task sleep")
    await asyncio.sleep(0)


async def wait_task(t):
    print(time.ticks_us(), "wait_task wait")
    await t


async def main():
    waiting_task = asyncio.create_task(wait_task(asyncio.create_task(sleep_task())))

    print(time.ticks_us(), "main sleep")
    await asyncio.sleep_ms(TIMEOUT)

    waiting_task.cancel()
    print(time.ticks_us(), "main wait")
    await waiting_task


asyncio.run(main())

It errors out (most of the time, not always) with:

153758444 main sleep
153759156 sleep_task sleep
153759933 wait_task wait
153761033 main wait
Task exception wasn't retrieved
future: <Task> coro= <Task>
<class 'CancelledError'>

It will error on ESP32 without WLAN active. It will error on PYBD-SF2 running at 48MHz, but not when running faster.

dpgeorge · 2021-06-14T02:12:44Z

And here's a deterministic reproduction that also fails on the unix port:

import uasyncio as asyncio


async def sleep_task():
    print("sleep_task sleep")
    await asyncio.sleep(0)
    print("sleep_task wake")


async def wait_task(t):
    print("wait_task wait")
    await t
    print("wait_task wake")


async def main():
    waiting_task = asyncio.create_task(wait_task(asyncio.create_task(sleep_task())))

    print("main sleep")
    await asyncio.sleep(0)
    await asyncio.sleep(0)

    waiting_task.cancel()
    print("main wait")
    await waiting_task
    print("main wait")


asyncio.run(main())

dpgeorge · 2021-06-14T13:46:08Z

The issue seems to be when task A waits on task B, and there's a race between cancellation of task A and completion of task B. If task B completes just before task A is cancelled then the cancellation of A does not work but rather prints the "Task exception wasn't retrieved" message.

This commit fixes a problem with a race between cancellation of task A and completion of task B, when A waits on B. If task B completes just before task A is cancelled then the cancellation of A does not work. Instead, the CancelledError meant to cancel A gets passed through to B (that's expected behaviour) but B handles it as a "Task exception wasn't retrieved" scenario, printing out such a message (this is because finished tasks point their "coro" attribute to themselves to indicate they are done, and implement the throw() method, but that method inadvertently catches the CancelledError). The correct behaviour is for B to bounce that CancelledError back out. The fix here reworks the "waiting" attribute of Task to be called "state" and uses it to indicate whether a task is: running and not awaited on, running and awaited on, finished and not awaited on, or finished and awaited on. This means the task does not need to point "coro" to itself to indicate finished, and also allows removal of the throw() method. A benefit of this is that "Task exception wasn't retrieved" messages can go back to being able to print the name of the coroutine function. Fixes issue micropython#7386. Signed-off-by: Damien George <damien@micropython.org>

dpgeorge · 2021-06-14T14:02:40Z

See #7399 for a proposed fix.

dpgeorge · 2021-06-15T01:43:14Z

Summary of the bug:

occurs when using wait_for(T, S), if the task T being waited on finishes at exactly the same time as the wait-for timeout S expires
task T will have run to completion
the Task exception wasn't retrieved message is printed with <class 'CancelledError'> as the error (ie no traceback)
the wait_for(T, S) call never returns (it's never put back on the uasyncio run queue) and all tasks waiting on this are blocked forever from running
uasyncio otherwise continues to function and other tasks continue to be scheduled as normal

This commit fixes a problem with a race between cancellation of task A and completion of task B, when A waits on B. If task B completes just before task A is cancelled then the cancellation of A does not work. Instead, the CancelledError meant to cancel A gets passed through to B (that's expected behaviour) but B handles it as a "Task exception wasn't retrieved" scenario, printing out such a message (this is because finished tasks point their "coro" attribute to themselves to indicate they are done, and implement the throw() method, but that method inadvertently catches the CancelledError). The correct behaviour is for B to bounce that CancelledError back out. This bug is mainly seen when wait_for() is used, and in that context the symptoms are: - occurs when using wait_for(T, S), if the task T being waited on finishes at exactly the same time as the wait-for timeout S expires - task T will have run to completion - the "Task exception wasn't retrieved message" is printed with "<class 'CancelledError'>" as the error (ie no traceback) - the wait_for(T, S) call never returns (it's never put back on the uasyncio run queue) and all tasks waiting on this are blocked forever from running - uasyncio otherwise continues to function and other tasks continue to be scheduled as normal The fix here reworks the "waiting" attribute of Task to be called "state" and uses it to indicate whether a task is: running and not awaited on, running and awaited on, finished and not awaited on, or finished and awaited on. This means the task does not need to point "coro" to itself to indicate finished, and also allows removal of the throw() method. A benefit of this is that "Task exception wasn't retrieved" messages can go back to being able to print the name of the coroutine function. Fixes issue micropython#7386. Signed-off-by: Damien George <damien@micropython.org>

jonathanhogg · 2021-06-15T06:51:43Z

Can confirm that fix in #7399 works on my repro test case. I've sucked it into my branch and have switched my code back to the original wait_for_ms() and yield ...queue_read() combo and will run this today to see if it remains stable.

jonathanhogg · 2021-06-15T15:55:53Z

Have had this patch running all day with my networking code without any obvious failures, so it looks good to me.

This commit fixes a problem with a race between cancellation of task A and completion of task B, when A waits on B. If task B completes just before task A is cancelled then the cancellation of A does not work. Instead, the CancelledError meant to cancel A gets passed through to B (that's expected behaviour) but B handles it as a "Task exception wasn't retrieved" scenario, printing out such a message (this is because finished tasks point their "coro" attribute to themselves to indicate they are done, and implement the throw() method, but that method inadvertently catches the CancelledError). The correct behaviour is for B to bounce that CancelledError back out. This bug is mainly seen when wait_for() is used, and in that context the symptoms are: - occurs when using wait_for(T, S), if the task T being waited on finishes at exactly the same time as the wait-for timeout S expires - task T will have run to completion - the "Task exception wasn't retrieved message" is printed with "<class 'CancelledError'>" as the error (ie no traceback) - the wait_for(T, S) call never returns (it's never put back on the uasyncio run queue) and all tasks waiting on this are blocked forever from running - uasyncio otherwise continues to function and other tasks continue to be scheduled as normal The fix here reworks the "waiting" attribute of Task to be called "state" and uses it to indicate whether a task is: running and not awaited on, running and awaited on, finished and not awaited on, or finished and awaited on. This means the task does not need to point "coro" to itself to indicate finished, and also allows removal of the throw() method. A benefit of this is that "Task exception wasn't retrieved" messages can go back to being able to print the name of the coroutine function. Fixes issue micropython#7386. Signed-off-by: Damien George <damien@micropython.org>

jonathanhogg · 2021-06-16T15:40:13Z

(Updated to 514bf1a patch and that also seems solid today)

dpgeorge · 2021-06-17T02:49:52Z

Thanks @jonathanhogg for testing. Now merged in 514bf1a

alessionossa · 2022-01-15T11:30:31Z

I think I'm still facing this issue using MicroPython v1.17 on 2021-09-02; ESP32 module with ESP32.

The code for reproducing the issue is the following:

import network
import struct
import socket
import uasyncio
import gc

# Helper function, parse IPv4 data
def parseudppacket(data):
    strLen = len(data) - 28
    # https://en.wikipedia.org/wiki/IPv4#Header
    unpacked = struct.unpack(f"!BB3HBBHII4H{strLen}s", data)
    return unpacked

# Helper function, converts int to IPv4 string
def intToIPv4(int):
    ipData = struct.pack("I", int)
    components = struct.unpack("4B", ipData)
    return f"{ components[3] }.{ components[2] }.{ components[1] }.{ components[0] }"

# Return the first packet that has device's IP Address as destination
async def _getNatPunchResponse(infoClientSocket):
    try:
        foundPacket = False

        while not foundPacket:
            message = infoClientSocket.recv(2048)
            dataExtracted = parseudppacket(message)
            ipDest = intToIPv4(dataExtracted[9])
            currentWlanIp = network.WLAN(network.STA_IF).ifconfig()[0]
            print(f"Received packet for { ipDest }")
            if (ipDest == currentWlanIp):
                foundPacket = True
        
        return dataExtracted
    except uasyncio.CancelledError:  # Task sees CancelledError
        print('Trapped cancelled error.')
        return None
    except Exception as sendUdpError:
        print(f"Error receiving connection info from server: { sendUdpError }")

async def natPunchDaemon():

    while True:
        infoClientSocket = None

        infoClientSocket = socket.socket(socket.AF_INET, socket.SOCK_RAW, socket.IPPROTO_UDP)
        infoClientSocket.settimeout(3)

        dataExtracted = None
        try:
            # _getNatPunchResponse is supposed to be "cancelled" after a second, but it continues its execution
            dataExtracted = await uasyncio.wait_for(_getNatPunchResponse(infoClientSocket), 1)
        except uasyncio.TimeoutError:  # Mandatory error trapping
            print('Response maximum time exceeded (timeout)')

        if dataExtracted is not None:
            print(f"Received string data: { dataExtracted }")

        if infoClientSocket is not None:
            infoClientSocket.close()
        
        await uasyncio.sleep(6)

def startNatPunch():
    uasyncio.run(natPunchDaemon())

peterhinch · 2022-01-16T16:28:50Z

I think the loop beginning

        while not foundPacket:

is hogging the scheduler. Try adding await uasyncio.sleep_ms(0) in the loop.

dpgeorge · 2022-01-16T22:11:46Z

Indeed, _getNatPunchResponse never await's.

jonathanhogg · 2022-01-20T17:36:13Z

@alessionossa: if you look at my repro code way up in this thread you'll see a simple mechanism for doing async recvs from a UDP socket. Something similar to that should solve your problem.

dpgeorge added the extmod Relates to extmod/ directory in source label Jun 13, 2021

dpgeorge added bug prio-high labels Jun 13, 2021

dpgeorge mentioned this issue Jun 14, 2021

extmod/uasyncio: Fix race with cancelled task waiting on finished task. #7399

Merged

dpgeorge closed this as completed Jun 17, 2021

alessionossa mentioned this issue Jan 15, 2022

Crash micropython in pairheap.h #7289

Open

Kalki70 mentioned this issue May 31, 2022

There is still a cancellation/race issue in uasyncio wait_for #8717

Closed

mzakharo mentioned this issue May 23, 2025

Race condition between wait_for and socket.read() #17351

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Possible race condition failure in uasyncio wait_for/wait_for_ms #7386

Possible race condition failure in uasyncio wait_for/wait_for_ms #7386

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Possible race condition failure in uasyncio wait_for/wait_for_ms #7386

Possible race condition failure in uasyncio wait_for/wait_for_ms #7386

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!