8000 idf v5.1 `listen`/`accept` race condition ends up losing clients · Issue #8443 · adafruit/circuitpython · GitHub
[go: up one dir, main page]

Skip to content
idf v5.1 listen/accept race condition ends up losing clients #8443
@bill88t

Description

@bill88t

CircuitPython version

Adafruit CircuitPython 9.0.0-alpha.1-50-gaa0d7aad83 on 2023-09-28; M5Stack Timer Camera X with ESP32

Code/REPL

https://github.com/bill88t/CircuitPython_FTP_Server, current master 53224fdd20302d4005054011b51a45591f3ac0fd loaded onto the board

code.py:

import wifi
from socketpool import SocketPool
from ftp_server import ftp
from supervisor import reload

if not wifi.radio.connected:
    print("No wifi")
    reload()

pool = SocketPool(wifi.radio)
ftps = ftp(pool, str(wifi.radio.ipv4_address), verbose=True)
try:
    ftps.serve_till_quit()
except KeyboardInterrupt:
    pass
finally:
    ftps.deinit()
reload()

Behavior

When making a full filesystem dump (237 files), PASV fails with the client never connecting to the data socket.

Well what does that have to do with the core?
Well, first of all, it does work on 8.x without a single error or retry, same settings, same everything.
Second, the failure has to do with the sockets and listen.

Debugging the failure led me to the conclusion that if a client connects very quickly after .listen(..) is run, the client may be accepted by the networking stack, but stored nowhere and so .accept times out.

When this bug happens, .accept() will fail by timeout, raising it's regular OSError, as if no client is waiting.
However the client is connected and waiting.
It is not rejected like when the socket is closed or no more connections are permitted.

Description

So this is some sort of race condition where a client connection comes in at just the wrong time to be accepted by the code but not stored anywhere.

FTP relies on making a new socket for every transfer and directory listing. For that reason we end up making ~300 sockets/connections for a full dump of 237 files.

It would probably be possible to make a simplier example, since we only need to spam (yes, around 7 connections / second) and testing if the connection object is not None.

This issue has already received a workaround on the current (ftp) master, so it's not really a world ending bug.
However it's still an issue that could conceivably appear in prod for a single connection, assuming bad enough luck (timing).

FTP just happens to roll a broken dice enough times to make it break almost 100%.

Additional information

If anybody intends on actually going through the ftp code:

line 645: self._data_socket, self._client_pasv = self._pasv_sock.accept() ends up returning None forever while this bug happens.
The timeout will close the socket and tell the client the file transfer failed (message 25).
The client will then re-request a new socket which will work.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0