8000 Possible race condition failure in uasyncio wait_for/wait_for_ms · Issue #7386 · micropython/micropython · GitHub
[go: up one dir, main page]

Skip to content

Possible race condition failure in uasyncio wait_for/wait_for_ms #7386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jonathanhogg opened this issue Jun 12, 2021 · 21 comments
Closed

Possible race condition failure in uasyncio wait_for/wait_for_ms #7386

jonathanhogg opened this issue Jun 12, 2021 · 21 comments
Labels
bug extmod Relates to extmod/ directory in source

Comments

@jonathanhogg
Copy link
Contributor

I'm seeing an issue using wait_for_ms() on reads from a socket, on ESP32 devices with recent HEAD, where occasionally I get an uncaught task CancelledError logged to the terminal and the wait_for_ms() never returns, locking up my code.

I'm using UDP sockets, so I'm not using the built-in Stream stuff, instead I'm rolling my own using the same underlying yield core._io_queue.queue_read() trick.

I suspect that it's triggered by a packet arriving at the same time as the wait ending, but I've had a gander at the uasyncio code and I can't see any obvious logic errors.

I'm using an ugly workaround at the moment (ditching the IOQueue and just repeatedly attempting to read from a non-blocking socket) as I have to have this code working this week, but I'll try to get back to it next week and see if I can reduce it to a smaller failing case and work out what's going wrong.

@dpgeorge dpgeorge added the extmod Relates to extmod/ directory in source label Jun 13, 2021
@dpgeorge
Copy link
Member

What is the exact error printed out? Might be related to #7289.

If you can get a reproduction (even a large amount of code) that would allow it to be investigated.

@jonathanhogg
Copy link
Contributor Author
jonathanhogg commented Jun 13, 2021

OK, I have a reliable repro for this.

receiver.py to run on the (ESP32) device:

import socket
import uasyncio as asyncio
import network

async def async_recvfrom(sock, count):
    yield asyncio.core._io_queue.queue_read(sock)
    return sock.recvfrom(count)

async def receive_loop(ssid, password, listen_port=49000):
    wlan = network.WLAN(network.STA_IF)
    wlan.active(True)
    wlan.connect(ssid, password)
    while wlan.status() != network.STAT_GOT_IP:
        await asyncio.sleep_ms(500)
    print("IP address: {}".format(wlan.ifconfig()[0]))
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.bind(('', listen_port))
    while True:
        try:
            data, (host, port) = await asyncio.wait_for(async_recvfrom(sock, 1500), 1)
            print("Received {} bytes from {}:{}".format(len(data), host, port))
        except asyncio.TimeoutError:
            pass

def run(ssid, password):
    asyncio.run(receive_loop(ssid, password))

And sender.py to run on something else:

import socket
import time

def run(host, port=49000, data=b'Hello world!'):
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.connect((host, port))
    while True:
        sock.send(data)
        print("Sent {} bytes to {}:{}".format(len(data), host, port))
        time.sleep(1)

Log from the device:

>>> import receiver
>>> receiver.run('[***]', '[***]')
IP address: 192.168.178.47
Received 12 bytes from 192.168.178.20:49450
Received 12 bytes from 192.168.178.20:49450
Received 12 bytes from 192.168.178.20:49450
Task exception wasn't retrieved
future: <Task> coro= <Task>
<class 'CancelledError'>
>>> 

You can see that it doesn't typically take long to fail. I have deliberately made the wait on both ends the same as I'm pretty sure it's a race condition. In this case the scheduler has exited, realising that there are no runnable tasks. In my real code I have another task running so the networking loop just hangs up.

This is running with a bunch of patches (my outstanding PRs and Mike Teachman's i2s) on top of da8aad1.

It doesn't feel like it's related to #7289.

@jonathanhogg
Copy link
Contributor Author

For good measure, I've also reproduced it on current HEAD (66115a3) with IDF 4.2 and with the esp32-20210418-v1.15.bin release firmware.

@jonathanhogg
Copy link
Contributor Author

Out of curiosity I tried taking current HEAD and reverting out b505971, which looks to be the last thing that touched wait_for().

I made a small tweak in my test code above to 8000 print out a message on timeout to confirm that both branches are being exercised and I've had this build running that code for a few minutes now without it failing.

So I suspect that the fix to #5797 introduced another bug.

I'd like to dig deeper into this code, but I really need to get back to the actual work I'm supposed to be doing today… 😀

@dpgeorge
Copy link
Member

Thanks for the reproduction. I can get the above code to crash on ESP32 (TinyPICO using the GENERIC build) as of the latest commit 66115a3. But it did not crash a PYBD-SF2.

Task exception wasn't retrieved
future: <Task> coro= <Task>
<class 'CancelledError'>

There is something wrong with this output: it should be printing an exception instance not a class. So that's a hint as to what is wrong (and I have seen this before on an unrelated project, but it was not possible to reproduce it).

So I suspect that the fix to #5797 introduced another bug.

The fix could have also uncovered another bug. Investigation will tell.

@dpgeorge
Copy link
Member

I can reproduce the same issue with the following code that does not use sockets:

import uasyncio as asyncio
import machine

async def wait_flag(flag):
    await flag.wait()

async def receive_loop():
    flag = asyncio.ThreadSafeFlag()
    machine.Timer(0).init(freq=99.9, callback=lambda t:flag.set())
    while True:
        try:
            await asyncio.wait_for(wait_flag(flag), 0.01)
            print("Flag was set")
        except asyncio.TimeoutError:
            print("TimeoutError")

def run():
    asyncio.run(receive_loop())

But I only see the error if WLAN is connected to an access point...

@jonathanhogg
Copy link
Contributor Author

But I only see the error if WLAN is connected to an access point...

Curiouser and curiouser…

I'd love to get stuck in and help, but I'm aware that a) you know the code, and b) I really need to finish this proposal…

@hoihu
Copy link
Contributor
hoihu commented Jun 13, 2021

I can confirm that we see a very related issue (on a STM32F777 based system, no wlan), specifically the <class 'CancelledError'> error.

We have not seen this issue before we updated to 1.15 current head, coming from 1.13. We haven't (yet) tried to revert b505971 .

On our device the fault is showing up during integration testing, usually with higher communication loads (there are several UARTs and USB VCP).

@dpgeorge
Copy link
Member

I found a simple reproduction that does not use wait_for() (but is a reduction of the wait_for code):

import time
import uasyncio as asyncio

TIMEOUT = 2


async def sleep_task():
    print(time.ticks_us(), "sleep_task sleep")
    await asyncio.sleep(0)


async def wait_task(t):
    print(time.ticks_us(), "wait_task wait")
    await t


async def main():
    waiting_task = asyncio.create_task(wait_task(asyncio.create_task(sleep_task())))

    print(time.ticks_us(), "main sleep")
    await asyncio.sleep_ms(TIMEOUT)

    waiting_task.cancel()
    print(time.ticks_us(), "main wait")
    await waiting_task


asyncio.run(main())

It errors out (most of the time, not always) with:

153758444 main sleep
153759156 sleep_task sleep
153759933 wait_task wait
153761033 main wait
Task exception wasn't retrieved
future: <Task> coro= <Task>
<class 'CancelledError'>

It will error on ESP32 without WLAN active. It will error on PYBD-SF2 running at 48MHz, but not when running faster.

@dpgeorge
Copy link
Member

And here's a deterministic reproduction that also fails on the unix port:

import uasyncio as asyncio


async def sleep_task():
    print("sleep_task sleep")
    await asyncio.sleep(0)
    print("sleep_task wake")


async def wait_task(t):
    print("wait_task wait")
    await t
    print("wait_task wake")


async def main():
    waiting_task = asyncio.create_task(wait_task(asyncio.create_task(sleep_task())))

    print("main sleep")
    await asyncio.sleep(0)
    await asyncio.sleep(0)

    waiting_task.cancel()
    print("main wait")
    await waiting_task
    print("main wait")


asyncio.run(main())

@dpgeorge
Copy link
Member

The issue seems to be when task A waits on task B, and there's a race between cancellation of task A and completion of task B. If task B completes just before task A is cancelled then the cancellation of A does not work but rather prints the "Task exception wasn't retrieved" message.

dpgeorge added a commit to dpgeorge/micropython that referenced this issue Jun 14, 2021
This commit fixes a problem with a race between cancellation of task A and
completion of task B, when A waits on B.  If task B completes just before
task A is cancelled then the cancellation of A does not work.  Instead,
the CancelledError meant to cancel A gets passed through to B (that's
expected behaviour) but B handles it as a "Task exception wasn't retrieved"
scenario, printing out such a message (this is because finished tasks point
their "coro" attribute to themselves to indicate they are done, and
implement the throw() method, but that method inadvertently catches the
CancelledError).  The correct behaviour is for B to bounce that
CancelledError back out.

The fix here reworks the "waiting" attribute of Task to be called "state"
and uses it to indicate whether a task is: running and not awaited on,
running and awaited on, finished and not awaited on, or finished and
awaited on.  This means the task does not need to point "coro" to itself to
indicate finished, and also allows removal of the throw() method.

A benefit of this is that "Task exception wasn't retrieved" messages can go
back to being able to print the name of the coroutine function.

Fixes issue micropython#7386.

Signed-off-by: Damien George <damien@micropython.org>
@dpgeorge
Copy link
Member

See #7399 for a proposed fix.

@dpgeorge
Copy link
Member
8000

Summary of the bug:

  • occurs when using wait_for(T, S), if the task T being waited on finishes at exactly the same time as the wait-for timeout S expires
  • task T will have run to completion
  • the Task exception wasn't retrieved message is printed with <class 'CancelledError'> as the error (ie no traceback)
  • the wait_for(T, S) call never returns (it's never put back on the uasyncio run queue) and all tasks waiting on this are blocked forever from running
  • uasyncio otherwise continues to function and other tasks continue to be scheduled as normal

dpgeorge added a commit to dpgeorge/micropython that referenced this issue Jun 15, 2021
This commit fixes a problem with a race between cancellation of task A and
completion of task B, when A waits on B.  If task B completes just before
task A is cancelled then the cancellation of A does not work.  Instead,
the CancelledError meant to cancel A gets passed through to B (that's
expected behaviour) but B handles it as a "Task exception wasn't retrieved"
scenario, printing out such a message (this is because finished tasks point
their "coro" attribute to themselves to indicate they are done, and
implement the throw() method, but that method inadvertently catches the
CancelledError).  The correct behaviour is for B to bounce that
CancelledError back out.

This bug is mainly seen when wait_for() is used, and in that context the
symptoms are:
- occurs when using wait_for(T, S), if the task T being waited on finishes
  at exactly the same time as the wait-for timeout S expires
- task T will have run to completion
- the "Task exception wasn't retrieved message" is printed with
  "<class 'CancelledError'>" as the error (ie no traceback)
- the wait_for(T, S) call never returns (it's never put back on the
  uasyncio run queue) and all tasks waiting on this are blocked forever
  from running
- uasyncio otherwise continues to function and other tasks continue to be
  scheduled as normal

The fix here reworks the "waiting" attribute of Task to be called "state"
and uses it to indicate whether a task is: running and not awaited on,
running and awaited on, finished and not awaited on, or finished and
awaited on.  This means the task does not need to point "coro" to itself to
indicate finished, and also allows removal of the throw() method.

A benefit of this is that "Task exception wasn't retrieved" messages can go
back to being able to print the name of the coroutine function.

Fixes issue micropython#7386.

Signed-off-by: Damien George <damien@micropython.org>
@jonathanhogg
Copy link
Contributor Author

Can confirm that fix in #7399 works on my repro test case. I've sucked it into my branch and have switched my code back to the original wait_for_ms() and yield ...queue_read() combo and will run this today to see if it remains stable.

@jonathanhogg
Copy link
Contributor Author

Have had this patch running all day with my networking code without any obvious failures, so it looks good to me.

dpgeorge added a commit to dpgeorge/micropython that referenced this issue Jun 16, 2021
This commit fixes a problem with a race between cancellation of task A and
completion of task B, when A waits on B.  If task B completes just before
task A is cancelled then the cancellation of A does not work.  Instead,
the CancelledError meant to cancel A gets passed through to B (that's
expected behaviour) but B handles it as a "Task exception wasn't retrieved"
scenario, printing out such a message (this is because finished tasks point
their "coro" attribute to themselves to indicate they are done, and
implement the throw() method, but that method inadvertently catches the
CancelledError).  The correct behaviour is for B to bounce that
CancelledError back out.

This bug is mainly seen when wait_for() is used, and in that context the
symptoms are:
- occurs when using wait_for(T, S), if the task T being waited on finishes
  at exactly the same time as the wait-for timeout S expires
- task T will have run to completion
- the "Task exception wasn't retrieved message" is printed with
  "<class 'CancelledError'>" as the error (ie no traceback)
- the wait_for(T, S) call never returns (it's never put back on the
  uasyncio run queue) and all tasks waiting on this are blocked forever
  from running
- uasyncio otherwise continues to function and other tasks continue to be
  scheduled as normal

The fix here reworks the "waiting" attribute of Task to be called "state"
and uses it to indicate whether a task is: running and not awaited on,
running and awaited on, finished and not awaited on, or finished and
awaited on.  This means the task does not need to point "coro" to itself to
indicate finished, and also allows removal of the throw() method.

A benefit of this is that "Task exception wasn't retrieved" messages can go
back to being able to print the name of the coroutine function.

Fixes issue micropython#7386.

Signed-off-by: Damien George <damien@micropython.org>
@jonathanhogg
Copy link
Contributor Author

(Updated to 514bf1a patch and that also seems solid today)

@dpgeorge
Copy link
Member

Thanks @jonathanhogg for testing. Now merged in 514bf1a

@alessionossa
Copy link
alessionossa commented Jan 15, 2022

I think I'm still facing this issue using MicroPython v1.17 on 2021-09-02; ESP32 module with ESP32.

The code for reproducing the issue is the following:

import network
import struct
import socket
import uasyncio
import gc

# Helper function, parse IPv4 data
def parseudppacket(data):
    strLen = len(data) - 28
    # https://en.wikipedia.org/wiki/IPv4#Header
    unpacked = struct.unpack(f"!BB3HBBHII4H{strLen}s", data)
    return unpacked

# Helper function, converts int to IPv4 string
def intToIPv4(int):
    ipData = struct.pack("I", int)
    components = struct.unpack("4B", ipData)
    return f"{ components[3] }.{ components[2] }.{ components[1] }.{ components[0] }"

# Return the first packet that has device's IP Address as destination
async def _getNatPunchResponse(infoClientSocket):
    try:
        foundPacket = False

        while not foundPacket:
            message = infoClientSocket.recv(2048)
            dataExtracted = parseudppacket(message)
            ipDest = intToIPv4(dataExtracted[9])
            currentWlanIp = network.WLAN(network.STA_IF).ifconfig()[0]
            print(f"Received packet for { ipDest }")
            if (ipDest == currentWlanIp):
                foundPacket = True
        
        return dataExtracted
    except uasyncio.CancelledError:  # Task sees CancelledError
        print('Trapped cancelled error.')
        return None
    except Exception as sendUdpError:
        print(f"Error receiving connection info from server: { sendUdpError }")

async def natPunchDaemon():

    while True:
        infoClientSocket = None

        infoClientSocket = socket.socket(socket.AF_INET, socket.SOCK_RAW, socket.IPPROTO_UDP)
        infoClientSocket.settimeout(3)

        dataExtracted = None
        try:
            # _getNatPunchResponse is supposed to be "cancelled" after a second, but it continues its execution
            dataExtracted = await uasyncio.wait_for(_getNatPunchResponse(infoClientSocket), 1)
        except uasyncio.TimeoutError:  # Mandatory error trapping
            print('Response maximum time exceeded (timeout)')

        if dataExtracted is not None:
            print(f"Received string data: { dataExtracted }")

        if infoClientSocket is not None:
            infoClientSocket.close()
        
        await uasyncio.sleep(6)

def startNatPunch():
    uasyncio.run(natPunchDaemon())

@peterhinch
Copy link
Contributor

I think the loop beginning

        while not foundPacket:

is hogging the scheduler. Try adding await uasyncio.sleep_ms(0) in the loop.

@dpgeorge
Copy link
Member

Indeed, _getNatPunchResponse never await's.

@jonathanhogg
Copy link
Contributor Author

@alessionossa: if you look at my repro code way up in this thread you'll see a simple mechanism for doing async recvs from a UDP socket. Something similar to that should solve your problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug extmod Relates to extmod/ directory in source
Projects
None yet
Development

No branches or pull requests

5 participants
0