8000 esp32: Webserver Socket stops responding after a minute. · Issue #15844 · micropython/micropython · GitHub
[go: up one dir, main page]

Skip to content

esp32: Webserver Socket stops responding after a minute. #15844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mispacek opened this issue Sep 13, 2024 · 18 comments · Fixed by #15952
Closed

esp32: Webserver Socket stops responding after a minute. #15844

mispacek opened this issue Sep 13, 2024 · 18 comments · Fixed by #15952

Comments

@mispacek
Copy link
mispacek commented Sep 13, 2024

Port, board and/or hardware

esp32 port, ESP32 and ESP32C3

ESP32 (lolin32-Lite, esp32 devkit v1) and ESP32C3 (esp32c3 supermini)
both board have same issues

MicroPython version

MicroPython v1.23.0 on 2024-06-02; ESP32C3 module with ESP32C3
in older versions than MicroPython v1.20.0 code works fine withou issues

MicroPython v1.19.1 works OK
MicroPython v1.20.0 works OK
MicroPython v1.21.0 NOT working
MicroPython v1.22.0 NOT working
MicroPython v1.23.0 NOT working

Reproduction

Copy content of zip file to your ESP device. Set correct SSID and Password and reset your board.
issue.zip

The web server will offer you index.html, which cyclically send requests to server and server answers back.
in terminal you can see positions of virtual joysticks on webpage.
image

After about a minute, the server stops responding. There is enough free RAM, I don't know where the problem is, everything works correctly on older versions.

There is another version of the web server available in the Zip archive, without the use of threads, but the problem is exactly the same.

Expected behaviour

I expect that the server will behave the same in all versions. It should respond indefinitely to web browser requests.

Observed behaviour

In Chrome, using developer tools, I check the responses from the server. After 1-5 minutes, the server stops responding and sometimes restarts.
image

Additional Information

Web server must run in background, without blocking of REPL and WebRepl

Code of webserver with threads:

import network
import usocket
import _thread
from os import stat, listdir
from time import sleep_ms
import gc


class WebServer:
    def __init__(self, web_folder='/www', port=80):
        self.WEB_FOLDER = web_folder
        self.MIMETYPES = {
            "txt"   : "text/plain",
            "htm"   : "text/html",
            "html"  : "text/html",
            "css"   : "text/css",
            "csv"   : "text/csv",
            "js"    : "application/javascript",
            "xml"   : "application/xml",
            "xhtml" : "application/xhtml+xml",
            "json"  : "application/json",
            "zip"   : "application/zip",
            "pdf"   : "application/pdf",
            "ts"    : "application/typescript",
            "ttf"   : "font/ttf",
            "jpg"   : "image/jpeg",
            "jpeg"  : "image/jpeg",
            "png"   : "image/png",
            "gif"   : "image/gif",
            "svg"   : "image/svg+xml",
            "ico"   : "image/x-icon",
            "cur"   : "application/octet-stream",
            "tar"   : "application/tar",
            "tar.gz": "application/tar+gzip",
            "gz"    : "application/gzip",
            "mp3"   : "audio/mpeg",
            "wav"   : "audio/wav",
            "ogg"   : "audio/ogg"
        }
        self.webserv_sock = None
        self.url_handlers = {}
        self.port = port

    def _file_exists(self, path):
        try:
            stat(path)
            return True
        except:
            return False

    def get_mime_type(self, filename):
        try:
            _, ext = filename.rsplit(".", 1)
            return self.MIMETYPES.get(ext, "application/octet-stream")
        except:
            return "application/octet-stream"
        
    def read_in_chunks(self, file_object, chunk_size=1024):
        while True:
            data = file_object.read(chunk_size)
            if not data:
                break
            yield data

    def serve_file(self, client, path):
        try:
            
            if path.startswith("/*GET_FILE"):
                file_path = path.replace("/*GET_FILE", "")
            else:
                if path == "/":
                    path = "/index.html"
                if '?' in path:
                    path = path.split('?')[0]
                file_path = self.WEB_FOLDER + path
            
            mime_type = self.get_mime_type(file_path)
            filestatus = 0 # 0=Not found  1=Found  2=found in GZip

            if self._file_exists(file_path + '.gz'):
                filestatus = 2
                file_path += '.gz'
            elif self._file_exists(file_path):
                filestatus = 1
                        
            if filestatus > 0:
                with open(file_path, 'rb') as file:
                    client.write(b'HTTP/1.1 200 OK\r\n')
                    client.write(b"Content-Type: " + mime_type.encode() + b"\r\n")
                    if filestatus == 2:
                        client.write(b'Content-Encoding: gzip\r\n')
                    client.write(b'\r\n')
                    for piece in self.read_in_chunks(file):
                        client.write(piece)
            else:
                client.write(b"HTTP/1.0 404 Not Found\r\n\r\nFile not found.")
        except OSError as e:
            print("OSError:", e)
            client.write(b"HTTP/1.0 500 Internal Server Error\r\n\r\nInternal error.")
        except Exception as e:
            print("Exception:", e)
            client.write(b"HTTP/1.0 500 Internal Server Error\r\n\r\nInternal error.")

    def handle(self, pattern):
        """Decorator to register a handler for a specific URL pattern."""
        def decorator(func):
            self.url_handlers[pattern] = func
            return func
        return decorator

    def client_handler(self, client):
        try:
            gc.collect()
            sleep_ms(0)
            request = client.recv(2048)
            if request:
                _, path, _ = request.decode("utf-8").split(" ", 2)
                for pattern, handler in self.url_handlers.items():
                    if path.startswith(pattern):
                        try:
                            gc.collect()
                            handler(client, path, request)
                        except Exception as e:
                            print("Handler Exception:", e)
                        client.close()
                        return
                # Default file serving if no handler matches
                self.serve_file(client, path)
        except Exception as e:
            sleep_ms(0)
            #print("Webserver Exception:", e)
        finally:
            client.close()

    def web_thread(self):
        while True:
            try:
                cl, addr = self.webserv_sock.accept()
                cl.settimeout(2)  # time in seconds
                self.client_handler(cl)
            except Exception as ex:
                sleep_ms(0)

    def start(self):
        addr = usocket.getaddrinfo('0.0.0.0', self.port)[0][-1]
        self.webserv_sock = usocket.socket()
        self.webserv_sock.setsockopt(usocket.SOL_SOCKET, usocket.SO_REUSEADDR, 1)
        self.webserv_sock.bind(addr)
        self.webserv_sock.listen(5)
        _thread.start_new_thread(self.web_thread, ())
        for interface in [network.AP_IF, network.STA_IF]:
            wlan = network.WLAN(interface)
            if not wlan.active():
                continue
            ifconfig = wlan.ifconfig()
            print("Web server spusten na adrese {}:{}".format(ifconfig[0], self.port))

    def stop(self):
        if self.webserv_sock:
            self.webserv_sock.close()
 

Code of webserver without threads:

import network
import socket
from os import stat, listdir
from time import sleep_ms
import gc
from micropython import alloc_emergency_exception_buf

# Define constant for registering the handler function
_SO_REGISTER_HANDLER = 20

class WebServer:
    def __init__(self, web_folder='/www', port=80):
        self.WEB_FOLDER = web_folder
        self.MIMETYPES = {
            "txt"   : "text/plain",
            "htm"   : "text/html",
            "html"  : "text/html",
            "css"   : "text/css",
            "csv"   : "text/csv",
            "js"    : "application/javascript",
            "xml"   : "application/xml",
            "xhtml" : "application/xhtml+xml",
            "json"  : "application/json",
            "zip"   : "application/zip",
            "pdf"   : "application/pdf",
            "ts"    : "application/typescript",
            "ttf"   : "font/ttf",
            "jpg"   : "image/jpeg",
            "jpeg"  : "image/jpeg",
            "png"   : "image/png",
            "gif"   : "image/gif",
            "svg"   : "image/svg+xml",
            "ico"   : "image/x-icon",
            "cur"   : "application/octet-stream",
            "tar"   : "application/tar",
            "tar.gz": "application/tar+gzip",
            "gz"    : "application/gzip",
            "mp3"   : "audio/mpeg",
            "wav"   : "audio/wav",
            "ogg"   : "audio/ogg"
        }
        self.webserv_sock = None
        self.url_handlers = {}
        self.port = port
        self.client_sockets = []

    def _file_exists(self, path):
        try:
            stat(path)
            return True
        except:
            return False

    def get_mime_type(self, filename):
        try:
            _, ext = filename.rsplit(".", 1)
            return self.MIMETYPES.get(ext, "application/octet-stream")
        except:
            return "application/octet-stream"
        
    def read_in_chunks(self, file_object, chunk_size=1024):
        while True:
            data = file_object.read(chunk_size)
            if not data:
                break
            yield data

    def serve_file(self, client, path):
        try:
            
            if path.startswith("/*GET_FILE"):
                file_path = path.replace("/*GET_FILE", "")
            else:
                if path == "/":
                    path = "/index.html"
                if '?' in path:
                    path = path.split('?')[0]
                file_path = self.WEB_FOLDER + path
            
            mime_type = self.get_mime_type(file_path)
            filestatus = 0 # 0=Not found  1=Found  2=found in GZip

            if self._file_exists(file_path + '.gz'):
                filestatus = 2
                file_path += '.gz'
            elif self._file_exists(file_path):
                filestatus = 1
                        
            if filestatus > 0:
                with open(file_path, 'rb') as file:
                    client.write(b'HTTP/1.1 200 OK\r\n')
                    client.write(b"Content-Type: " + mime_type.encode() + b"\r\n")
                    if filestatus == 2:
                        client.write(b'Content-Encoding: gzip\r\n')
                    client.write(b'\r\n')
                    for piece in self.read_in_chunks(file):
                        client.write(piece)
            else:
                client.write(b"HTTP/1.0 404 Not Found\r\n\r\nFile not found.")
        except OSError as e:
            print("OSError:", e)
            client.write(b"HTTP/1.0 500 Internal Server Error\r\n\r\nInternal error.")
        except Exception as e:
            print("Exception:", e)
            client.write(b"HTTP/1.0 500 Internal Server Error\r\n\r\nInternal error.")

    def handle(self, pattern):
        """Decorator to register a handler for a specific URL pattern."""
        def decorator(func):
            self.url_handlers[pattern] = func
            return func
        return decorator

    
    def accept_websocket(self, sock):
        try:
            client_sock, client_addr = sock.accept()
            #client_sock.setblocking(False)
            client_sock.settimeout(2)
            client_sock.setsockopt(socket.SOL_SOCKET, _SO_REGISTER_HANDLER, self.client_handler)
            self.client_sockets.append(client_sock)
            # Debug output
            #print("Received new connection from:", client_addr)
            #print(self.client_sockets)
        except Exception as e:
            print("Error accepting connection:", e)
    
    
    
    
    def client_handler(self, client_sock):
        try:
            request = client_sock.recv(2048)
            #print(request)
            if request:
                _, path, _ = request.decode("utf-8").split(" ", 2)
                for pattern, handler in self.url_handlers.items():
                    if path.startswith(pattern):
                        try:
                            gc.collect()
                            handler(client_sock, path, request)
                        except Exception as e:
                            print("Handler Exception:", e)
                        self.close_client(client_sock)
                        return
                # Default file serving if no handler matches
                self.serve_file(client_sock, path)
            self.close_client(client_sock)
        except Exception as e:
            print("Error in client_handler:", e)
            self.close_client(client_sock)

    def close_client(self, client_sock):
        try:
            client_sock.setsockopt(socket.SOL_SOCKET, _SO_REGISTER_HANDLER, None)
            client_sock.close()
        except Exception as e:
            print("Error closing client socket:", e)
        finally:
            if client_sock in self.client_sockets:
                self.client_sockets.remove(client_sock) 

    def start(self):
        alloc_emergency_exception_buf(100)
        addr = socket.getaddrinfo('0.0.0.0', self.port)[0][-1]
        self.webserv_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.webserv_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        self.webserv_sock.bind(addr)
        self.webserv_sock.listen(1)
        self.webserv_sock.settimeout(2)
        self.webserv_sock.setsockopt(socket.SOL_SOCKET, _SO_REGISTER_HANDLER, self.accept_websocket)
        for interface in [network.AP_IF, network.STA_IF]:
            wlan = network.WLAN(interface)
            if not wlan.active():
                continue
            ifconfig = wlan.ifconfig()
            print("Web server started at address {}:{}".format(ifconfig[0], self.port))

    def stop(self):
        if self.webserv_sock:
            self.webserv_sock.setsockopt(socket.SOL_SOCKET, _SO_REGISTER_HANDLER, None)
            self.webserv_sock.close()
        for client_sock in self.client_sockets:
            self.close_client(client_sock)

Code of Conduct

Yes, I agree

@mispacek mispacek added the bug label Sep 13, 2024
@TRadigk
Copy link
TRadigk commented Sep 14, 2024

Hi,

I've had the same issue and no solution for that. But since I already was going for seed xiao ESP32C3, I was able to replace it with an ESP32S3 (also seed xiao), which is a little bit more expensive but has the same footprint and mostly same pin configuration.

In my investigations I also found out:
long reque 8000 sts (256 Bytes) lead to immediate "hang", the device stops responding to "ping" requests. With wireshark I found out that the device always answers the HTTP ACK, but might not send data. If helpful I could provide this additional info in more detail.

This is independent from Socket or asyncio.start-server(). Don't mind refactoring your code into different approaches to realize a server. It "feels" like there is a buffer mishandling between the ESP32(C3) and the wifi module built-in.

@xuancong84
Copy link
Contributor
xuancong84 commented Sep 15, 2024

I also encounter the same problem on esp32c3, except that instead of one minute, the web server stops responding after a random period of time between a few minutes and one day or so, and it will never resume responding, so I have to do a hard reset at least every day. However, ping continues to work.

The randomness of this bug makes it very difficult to solve.

@dpgeorge
Copy link
Member

Thanks for the clear report and the reproduction. Using the instructions in the original report, with ESP32_GENERIC firmware, IDF 5.2.2 and a recent master MicroPython v1.24.0-preview-309-gf1bdac375 I was able to reproduce the issue: after around 1 minute of having the webpage open and accessing the ESP32 server, TCP connections would stop. They would eventually resume after a few minutes.

This looks like the same root cause as with the issue #12819. It's because TCP sockets are remaining open on the ESP32 server for up to 2 minutes.

I tested the fix suggested in #12819 which is to add CONFIG_LWIP_TCP_MSL=6000 to ports/esp32/boards/sdkconfig.base That fixes the issue for me, both the threading and non-threading version of the webserver (v1 and v2) work fine with that adjusted MSL (at least, I can run for much longer and don't see any issues).

I also made a simpler client program in Python to show the same issue, so you don't need to test with a web browser. Just run this on the host PC:

import time
import requests

IP = '<ip of esp32>'

for i in range(1000):
    r = requests.get(f'http://{IP}/*JOY;{i};0;0;0;0;0')
    r.close()
    time.sleep(0.15)

Eventually that will stop sending requests.

@dpgeorge
Copy link
Member

The problem does not happen on a PYBD-SF6 (and presumably all other bare-metal lwIP MicroPython boards) because those boards have a very limited amount of static memory allocated to lwIP, and lwIP i 8000 s then forced to close and reuse sockets that are sitting in the TIME-WAIT state.

In particular, in the tcp_alloc(), the "tcp_alloc: killing off oldest TIME-WAIT connection" debug message is printed (and hence tcp_kill_timewait() run) on PYBD-SF6 but not on ESP32.

Ideally we need a way on ESP32 to limit the maximum number of opened PCBs, and force tcp_alloc() to reclaim PCBs when that limit is reached. But I'm not sure how to do that. Somehow we need to hook into the memp_malloc(MEMP_TCP_PCB) call and return NULL when the maximum number of PCBs is reached.

@dpgeorge dpgeorge added this to the release-1.24.0 milestone Sep 16, 2024
@TRadigk
Copy link
TRadigk commented Sep 16, 2024

Hi Damien,
I couldn't encounter the issue on ESP32 S3 (with 8 MB) of Ram in 1000 subsequent requests, where the ESP32 C3 hang up after like 20. Since you're changing something in the ESP32 port, should the issue also occur on the ESP32S3 but due to much more available RAM way later?

@dpgeorge
Copy link
Member

should the issue also occur on the ESP32S3 but due to much more available RAM way later?

Probably not, because the TCP socket waits in TIME-WAIT state for 2 minutes, after which it is reclaimed. So there's probably enough RAM to allocate a few thousand sockets before they need to be recycled, at which point the 2 minutes from the first sockets has passed and their memory is freed.

@dpgeorge dpgeorge changed the title Webserver Socket stops responding after a minute. esp32: Webserver Socket stops responding after a minute. Sep 19, 2024
@straga
Copy link
straga commented Sep 23, 2024

Probably that same or part only. #14421
Because on older firmware all works.

@projectgus

This comment was marked as duplicate.

@dpgeorge

This comment was marked as outdated.

@projectgus

This comment was marked as outdated.

@dpgeorge

This comment was marked as outdated.

@projectgus

This comment was marked as outdated.

@projectgus
Copy link
Contributor
projectgus commented Oct 1, 2024

I've hidden the comments above because I was reading the WireShark output wrong, my mistake!

The underlying problem is the same as described here. The conditions are:

  1. The server (MicroPython) closes the connection first. If the client closes the connection first, it doesn't enter TIME-WAIT on the server so the problem doesn't trigger. However if the server closes the connection first then it does enter TIME-WAIT.
  2. The client opens a new connection with the same source port, before the TIME-WAIT has expired (two minutes I think, by default).

What happens then is that the client sends TCP SYN, the server sends TCP ACK, the client recognises the ACK is invalid because it belongs to the old connection and sends TCP RST, but the server ignores this (required for RFC1337). It will stay stuck like this until after the 2 minutes expires, although the client usually times out first.

Reproduce without MicroPython

I can reproduce this in C if I modify the tcp_server example to close the socket after it sends the first response back to the client.

ESP-IDF server patch:

diff --git i/examples/protocols/sockets/tcp_server/main/tcp_server.c w/examples/protocols/sockets/tcp_server/main/tcp_server.c
index 5f1c284960..b4ac5c97b3 100644
--- i/examples/protocols/sockets/tcp_server/main/tcp_server.c
+++ w/examples/protocols/sockets/tcp_server/main/tcp_server.c
@@ -59,6 +59,7 @@ static void do_retransmit(const int sock)
                 to_write -= written;
             }
         }
+        break;
     } while (len > 0);
 }

Client program to repro with tcp_server:

#!/usr/bin/env python
import socket
import time

HOST="10.42.0.150"  # <-- Server IP goes here
PORT=3333

while True:
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.settimeout(5)  # so fails when bug triggers
    s.connect((HOST, PORT))

    s.send(b"ABCD")
    print(time.time(), s.recv(4))
    s.close()

Workarounds

I can't find a perfect workaround, but I think any of these would work:

  • Set a timeout on the client side and retry when the connect times out. For example, if doing this from JavaScript.
  • Slow down the rate of requests so that ports aren't reused within a two minute window. It'd be a big change, but a long lived WebSocket could be a good fit for this kind of frequent data request.
  • Implement HTTP/1.1 connection reuse for the server. (Another big change.)
  • Have the client close the connection, not the server. Unfortunately I think this is only possible by implementing proper HTTP/1.1. The kind of server attached to this issue is an HTTP/1.0 style "close-delimited response message", i.e. the server needs to Close the connection because this is how the client knows there is no more data. Doing it in the recommended HTTP/1.1 way means setting Connection: close and a Content-Length: header, so it's a lot more complex...

Fixes?

Linux and OpenBSD servers don't have this problem because if they see an out-of-sequence SYN for a TIME-WAIT socket then they allow it as a new connection. OpenBSD code, Linux code.

We could request LWIP support be added to do the same thing. The issues I can see are:

  • OpenBSD and Linux also check timestamps by default, which is recommended approach to avoid "TIME-WAIT assassination" (see RFC1337). LWIP doesn't implement the timestamp option.
  • The way LWIP is structured in the code then it's kind of fiddly to recycle a TIME-WAIT socket.
  • Due to the above, I suspect Espressif would only add it to their esp-lwip fork if upstream LWIP agreed it was a good idea. And the upstream LWIP is not particularly active.

A simpler/quicker "fix" would to be to immediately clean up the socket if an RST is received in TIME-WAIT (i.e. change this bit). However this does open the possibility of "TIME-WAIT assassination", trading one bug for a worse one in some cases (old data arrives on new socket 🙀).

@projectgus

This comment was marked as outdated.

@ma261065
Copy link
ma261065 commented Oct 1, 2024

Isn't rfc1337 saying that immediately cleaning up the socket if an RST is received in TIME-WAIT is a fix for TWA, rather than 'opening up the possibility of TWA' as you said?

Fixes for TWA Hazards

   We discuss three possible fixes to TCP to avoid these hazards.

   (F1) Ignore RST segments in TIME-WAIT state.

@projectgus
Copy link
Contributor
projectgus commented Oct 1, 2024

Isn't rfc1337 saying that immediately cleaning up the socket if an RST is received in TIME-WAIT is a fix for TWA, rather than 'opening up the possibility of TWA' as you said?

My understanding is that TIME-WAIT assassination is when the TCP stack cleans up the socket too early (the bad effect being that after this, additional out-of-band packets from the old socket might be delivered to a new socket). The fix of ignoring RST means ignoring the RST packet and doing nothing, rather than cleaning up the socket early.

@projectgus
Copy link
Contributor

One more contributing factor to this is that ESP-IDF doesn't limit the number of sockets in TIME-WAIT. The config option that should limit this is not effective (see espressif/esp-idf#9670).

If that bug was fixed then this problem would happen much less often because TIME-WAIT sockets would be cleaned up more rapidly when there are a lot of short lived connections. However, I think the problem might still happen sometimes - in my packet captures sometimes Linux reuses the same source port very quickly, for whatever reason.

@projectgus
Copy link
Contributor

Candidate fix in the linked PR, and I've posted some builds there if anyone is able to test: #15952 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
0