FFFF Web Workflow Access Causes Program Pause And Board Freeze · Issue #9171 · adafruit/circuitpython · GitHub
[go: up one dir, main page]

Skip to content

Web Workflow Access Causes Program Pause And Board Freeze #9171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
safferyj opened this issue Apr 12, 2024 · 13 comments
Open

Web Workflow Access Causes Program Pause And Board Freeze #9171

safferyj opened this issue Apr 12, 2024 · 13 comments

Comments

@safferyj
Copy link

CircuitPython version

9.0.3 and 9.1.0-beta.0 on Unexpected Maker FeatherS2 Neo and Adafruit Feather ESP32-S2 Reverse TFT

Code/REPL

import board
import digitalio
import microcontroller
import time
import watchdog 
import wifi

#print(f"Start ({microcontroller.cpu.reset_reason})")
print(f"IP address: {wifi.radio.ipv4_address}")

# LED
led = digitalio.DigitalInOut(board.LED)
led.direction = digitalio.Direction.OUTPUT

# Watchdog
dog = microcontroller.watchdog
dog.timeout = 60
dog.mode = watchdog.WatchDogMode.RAISE

while True:

    for i in range(15):

        # Feed the watchdog
        dog.feed()

        led.value = True
        print(".", end="")
        time.sleep(1)

        led.value = False
        print(".", end="")
        time.sleep(1)

    print()

Behavior

Accessing the Welcome page of the web workflow can cause the executing program to pause. With the above code, the LED stops flashing. Clicking on the Full Code Editor link causes the program to resume.

welcome-pause

Description

This issue does not seem to happen in 8.2.10.

Additional information

With the above code, if the pause happens and you wait more than a minute, then resuming by entering the Full Code Editor leads to an immediate watchdog exception.

@safferyj safferyj added the bug label Apr 12, 2024
@tannewt tannewt added this to the 9.x.x milestone Apr 12, 2024
@Neradoc
Copy link
Neradoc commented Feb 26, 2025

I see a similar situation, where code is paused when fetching a URL from the web workflow, to the point of causing a complete freeze of the board, losing USB and everything. When the board's code is particularly busy, this happens easily.

The freeze seems to be triggered by some access by web workflow, including the single scanning done by the web workflow home page of another board or the recurrent scanning by discotool manager (both of which retrieve some information on the board after detecting it on MDNS).

When connected to USB, the frozen board does not respond to ctrl-C but usually still has the USB drive work, but sometimes also error and unmount after a little while, without coming back.

The web workflow might remain working when the code is frozen, and apparently using the web workflow might make the code run again for a bit.

On a board with a more complex code, including a neopixel strip, a dotstar strip and a webserver, the freezing happens easily during normal web workflow use, making it quite difficult to use. I usually don't see it recover after the board freezes.

Repro on:

  • Adafruit QTPY ESP32-S2
  • Adafruit Feather ESP32-S2
  • Unexpected Maker FeatherS2
  • Adafruit QT PY ESP32-S3

Repro: latest, 9.2.4, 9.0.0
No repro: 8.2.10

Here is a simple code that helps visualize the board freezing:

import board
import time
import neopixel

status = neopixel.NeoPixel(board.NEOPIXEL, 1)

while True:
    for color in [0x200020, 0x002020]:
        status.fill(color)
        time.sleep(0.25)

Here is some python code that connects to the board's web workflow in a loop to force trigger the freeze:

import requests, sys, time
from datetime import datetime as d

ADDRESS = "192.168.1.38"
if len(sys.argv) > 1: ADDRESS = sys.argv[1]
url = f"http://{ADDRESS}/cp/version.json"

was_ok = None
t0 = d.now()
try:
    while True:
        is_ok = True
        try:
            with requests.get(url, timeout=1) as response:
                is_ok &= True
        except (requests.exceptions.ReadTimeout, requests.exceptions.ConnectionError):
            is_ok &= False
        print((f"{str(d.now()-t0)[:7]} " + ("ERROR","ok")[is_ok]).ljust(60), "\033[1G\033[1A")
        if is_ok != was_ok:
            print()
        was_ok = is_ok
        time.sleep(0.1)
except KeyboardInterrupt:
    print()

On a QTPY S2, this usually triggers the issue after approximately 30 seconds.
Sometimes the board ends up recovering for a little while.
When that happens I get outputs like this. The times indicate when the result of a connection change.
(So for example from 0:35 to, 2:10 it retries with a 1s timeout and errors every time)

0:00:00 ok   
0:00:32 ERROR
0:00:33 ok   
0:00:35 ERROR
0:02:10 ok   
0:02:42 ERROR
0:05:52 ok   
0:06:22 ERROR
0:07:57 ok   
0:08:29 ERROR
0:08:30 ok   
0:08:32 ERROR
0:11:42 ok   
0:12:16 ERROR

@Neradoc Neradoc changed the title Web Workflow Welcome Page Causes Program Pause Web Workflow Access Causes Program Pause And Board Freeze Feb 26, 2025
@tannewt
Copy link
Member
tannewt commented Feb 27, 2025

Web workflow responses are currently blocking. So, if they take a while, then everything else will be starved. I think the easiest way to fix this will be switching to Zephyr (or another RTOS). That way the web workflow can run in a separate thread and yield as it waits for sockets.

@Neradoc
Copy link
Neradoc commented Feb 28, 2025

Is blocking new to CP9 ?
The test code above with 8.2.10 on the same QTPY S2 goes without error for over 10 minutes.
So I hope there is some way to mitigate the issue to return to the previous behavior. I can live with the board stuttering when a file is uploaded or it's scanning MDNS, but the problem is the complete freeze for minutes or never recovering, plus drive corruption when USB fails.

On 9.x latest, the test starts failing within 30 seconds. In fact it's quite regular, the first error since reset happens after 140 to 150 requests, regardless of the sleep duration in the test script (tested 10ms, 50ms, 100ms).
With code like the following driving 2 strands of LED, it takes way less requests to freeze, around 10-20.
With Circuitpyton 8, this code just keeps doing its LED thing.

import board
import time
import neopixel
import adafruit_dotstar

pixel = neopixel.NeoPixel(board.NEOPIXEL, 90)
pidots = adafruit_dotstar.DotStar(board.SCK, board.MISO, 90)

while True:
    for color in [0x200020, 0x002020]:
        pixel.fill(color)
        pidots.fill(color)
        time.sleep(0.5)

@tannewt
Copy link
Member
tannewt commented Feb 28, 2025

No, it isn't new to CP9. CP9 did upgrade to IDF 5 though. It was a big step and the "wake circuitpython up from socket activity" is complicated.

static void socket_select_task(void *arg) {

The easiest way to hunt this down may be a git bisect. It'll be time consuming but also enlightening.

@dhalbert dhalbert modified the milestones: 9.x.x, 10.0.0 Mar 17, 2025
@eightycc eightycc self-assigned this May 8, 2025
@eightycc
Copy link
Collaborator
eightycc commented May 9, 2025

Adafruit CircuitPython 10.0.0-alpha.4-5-g07ec0c6d09 on 2025-05-08; Adafruit Metro ESP32S2 with ESP32S2

Attempted to reproduce using the Python program at the top of the issue. Using the various connection methods listed in the table, could not reproduce the problem. Dots appear on the serial console at a steady rate regardless of what I do with web workflow. Tried both file browser and full code editor.

It appears that ESP-IDF 5.4.1 resolved this issue.

To be thorough, I will test with 9.2.7 and will also try the method and program described by @Neradoc .

@eightycc
Copy link
Collaborator
eightycc commented May 9, 2025

Could not reproduce the problem (using the method and program at the top of the issue) at 9.2.7, so went back to 9.2.4 and then to 9.0.3, and could not reproduce it. Thinking it might have something to do with tinyuf2, reverted that to 0.18.2, and still could not reproduce the problem. I've been testing on a Metro ESP32S2, the only S2 I have on hand. Next up is Neradoc's program and method.

@RetiredWizard
Copy link

FWIW: I have been preoccupied so haven't had a great deal of use with recent builds but since Web Workflow was released I've seen issues similar to those reported here. In my case they seem to come and go, a few times I came close to working up a repeatable recipe only for the problem to go away. I eventually wrote it off to WiFi strength issues as my office is on the edge of WiFi coverage for most of the micro controller boards. I've thrown 10.x on the two continually running WiFi boards and I don't think I've seen the issue but I haven't really stressed them like I was with earlier versions.

@eightycc
Copy link
Collaborator

Thinking that a configuration difference between boards might be involved, I rebuilt CP 9.0.3 for the Metro ESP32S2 with these additional lines from unexpectedmaker_feathers2_neo added to its mpconfigboard.mk:

CIRCUITPY_BITBANG_NEOPIXEL = 1

# Include these Python libraries in firmware.
FROZEN_MPY_DIRS += $(TOP)/frozen/Adafruit_CircuitPython_NeoPixel

I was still unable to reproduce the original issue. It's possible that the WiFi signal, as described by @RetiredWizard, may have been a factor in the original issue. For my tests, the Metro is within 10 feet of the AP. Dumping the AP shows a strong -44 dBm signal with only 2 errors out of ~150K packets.


Moving on, I was able to reproduce the issue documented by @Neradoc using CP 9.0.3 after 4 seconds:

↪ python ../../../test.py
0:00:00 ok                                                   
0:00:04 ERROR                                                
0:00:05 ok                                                   
0:00:41 ERROR                                                
0:02:19 ok                                                   
0:03:02 ERROR                                                
0:04:37 ok                                                   
0:05:21 ERROR                                                
0:06:54 ok                                                   
0:07:31 ERROR                                                
0:07:33 ok                                                   
0:07:35 ERROR                                                
0:07:36 ok                                                   
0:07:38 ERROR                                                
0:09:12 ok                                                   
0:09:49 ERROR                                                
0:09:50 ok                                                   
0:09:53 ERROR                                                
0:11:32 ok                                                   
0:12:08 ERROR                                                
0:13:43 ok                                                   
0:14:20 ERROR                                                
0:15:54 ok                                                   
0:16:33 ERROR                                                
0:18:07 ok                                                   
0:18:50 ERROR                                                
0:21:26 ok                                                   
0:21:59 ERROR                                                
0:27:36 ok                                                   
0:28:21 ERROR                                                
0:33:01 ok                                                   

After updating to 10.0.0-alpha.4, the issue no longer occurs. I ran @Neradoc's test for 1 hour with no errors while displaying a continuous lightshow on the neopixel:

↪ python ../../../test.py
0:00:00 ok                                                   
1:00:15 ok                                                   

I'm concluding that the upgrade to ESP-IDF 5.4.1 resolved this issue.

@Neradoc
Copy link
Neradoc commented May 11, 2025

I'm running the test code I posted above on a Feather ESP32-S2 on latest.

Adafruit CircuitPython 10.0.0-alpha.4-12-g99d4ea099a on 2025-05-10; Adafruit Feather ESP32S2 with ESP32S2

0:00:00 ok                                                   
0:00:05 ERROR                                                
0:00:22 ERROR   

And at this point the board disconnected from USB.
Testing it connected to power only:

0:00:00 ok                                                   
0:00:15 ERROR

And it didn't come back (waited a few minutes).
I repro on QTPY S2.

It was hard to repro on QTPY S3 (with PSRAM), but then it got worse: while the test code is running, the code runs fine (looping the LED) and the web workflow responds. But then if I connect to the REPL and hit ctrl-C the board crashes.

This might be an unrelated 10 alpha issue, but it is triggered by the web workflow at least.

  • open the web workflow in a browser
  • connect to serial via USB
  • type ctrl-C in the REPL

The board freezes (after printing some or all of the normal messages) and crashes (goes away from USB) after a few seconds.
Reproduced on the Feather ESP32-S2.

@eightycc eightycc reopened this May 12, 2025
@eightycc
Copy link
Collaborator

@Neradoc is able to reproduce the issue with 10.0.0-alpha.4 on slightly different hardware. I've re-opened the issue and will investigate further once I have hardware on hand that can reproduce the issue.

@eightycc
Copy link
Collaborator

With a Feather ESP32-S2 running 10.0.0-alpha.5:

Adafruit CircuitPython 10.0.0-alpha.5-dirty on 2025-05-16; Adafruit Feather ESP32S2 with ESP32S2

Issue reproduces as described by @Neradoc:

↪ python test.py
0:00:00 ok                                                   
0:00:40 ERROR                                                
0:04:52 ERROR

Board freezes. With DEBUG=1 build, no informative messages logged by ESP-IDF on serial console.

@eightycc
Copy link
Collaborator

Created a full debug build that does not overflow dram0_0_seg by setting:

CONFIG_ESP_WIFI_IRAM_OPT=n

and that does not overflow flash by using a Makefile extract from #10265 to remove the second OTA partition from flash. Built with DEBUG=1.

Got an assert from socket_select_task:

assert failed: socket_select_task Socket.c:95 (num_triggered > 0)

Backtrace: 0x4002c921:0x3ffd6360 0x40035741:0x3ffd6380 0x40039f65:0x3ffd63a0 0x400b6c02:0x3ffd64c0 0x4003625e:0x3ffd6500

@eightycc
Copy link
Collaborator

An example of the hangup:

ESP-IDF verbose debug tracing with additional log messages:

D (124289) socket_select_task: select
D (124443) socket_select_task: num_triggered: 1
D (124443) socket_select_task: waking CP
D (124443) supervisor_web_workflow_background: workflow_background
D (124445) supervisor_web_workflow_background: workflow_background calls socketpool_socket_accept
D (124454) socket: socketpool_socket_accept
D (124459) socket: register_open_socket fd: 57
D (124464) socket: CP poll_resume
D (124468) socket: socketpool_socket_accept newsoc: 57

Backtrace for the socket_select task:

Thread 5 (Thread 1073568588 "socket_select" (Name: socket_select)):
#0  0x4001a8d4 in ?? ()
#1  0x40036494 in vPortClearInterruptMaskFromISR (prev_level=<optimized out>) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/include/freertos/portmacro.h:560
#2  vPortExitCritical (mux=0x3ffcc228 <xKernelLock>) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:514
#3  vPortExitCritical (mux=0x3ffcc228 <xKernelLock>) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:495
#4  0x401c681c in xTaskGenericNotify (xTaskToNotify=<optimized out>, uxIndexToNotify=<optimized out>, ulValue=0, eAction=eIncrement, pulPreviousNotificationValue=0x0) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:6008
#5  0x400a2ada in port_wake_main_task () at supervisor/port.c:465
#6  0x400a2c92 in background_callback_add_core (cb=0x3ffd0e70 <workflow_background_cb>) at ../../supervisor/shared/background_callback.c:46
#7  background_callback_add_core (cb=0x3ffd0e70 <workflow_background_cb>) at ../../supervisor/shared/background_callback.c:29
#8  0x400a3bcc in supervisor_workflow_request_background () at ../../supervisor/shared/workflow.c:61
#9  0x400b6daf in socket_select_task (arg=<optimized out>) at common-hal/socketpool/Socket.c:128
#10 0x40036421 in vPortTaskWrapper (pxCode=0x400b6c1c <socket_select_task>, pvParameters=0x0) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:139

Backtrace for the main task (CircuitPython):

#0  0x40036471 in xPortEnterCriticalTimeout (mux=0x3ffe2e5c, timeout=-1) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:490
#1  0x401c54cc in vPortEnterCritical (mux=0x3ffe2e5c) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/include/freertos/portmacro.h:567
#2  xQueueGenericSend (xQueue=0x3ffe2e10, pvItemToQueue=0x0, xTicksToWait=<optimized out>, xCopyPosition=0) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/queue.c:950
#3  0x40139c11 in sys_mutex_unlock (pxMutex=0x3ffd88b8 <g_lwip_protect_mutex>) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/lwip/port/freertos/sys_arch.c:77
#4  0x40139e65 in sys_arch_unprotect (pval=1) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/lwip/port/freertos/sys_arch.c:470
#5  0x4013b970 in netconn_err (conn=0x3ffed9f4) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/lwip/lwip/src/api/api_lib.c:1152
#6  0x4013bae1 in netconn_recv_data (conn=0x3ffed9f4, new_buf=0x3ffe1970, apiflags=8 '\b') at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/lwip/lwip/src/api/api_lib.c:606
#7  0x4013bc14 in netconn_recv_data_tcp (conn=0x3ffed9f4, new_buf=0x3ffe1970, apiflags=8 '\b') at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/lwip/lwip/src/api/api_lib.c:730
#8  0x4013bca6 in netconn_recv_tcp_pbuf_flags (conn=0x3ffed9f4, new_buf=0x3ffe1970, apiflags=8 '\b') at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/lwip/lwip/src/api/api_lib.c:811
#9  0x401279b9 in lwip_recv_tcp (sock=0x3ffd7a84 <sockets+20>, mem=0x3ffe1a64, len=<optimized out>, flags=0) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/lwip/lwip/src/api/sockets.c:990
#10 0x40128f1a in lwip_recvfrom (s=57, mem=0x3ffe1a64, len=1, flags=0, from=0x0, fromlen=0x0) at /home/rabeles/Development/cp-development/--Type <RET> for more, q to quit, c to continue without paging--
issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/lwip/lwip/src/api/sockets.c:1257
#11 0x40128ff5 in lwip_recv (s=57, mem=0x3ffe1a64, len=1, flags=0) at /home/rabeles/Development/cp-development/issue-9171/cp-repo-9171/ports/espressif/esp-idf/components/lwip/lwip/src/api/sockets.c:1322
#12 0x400b74e1 in socketpool_socket_recv_into (self=0x3ffd5748 <active>, buf=0x3ffe1a64 "", len=1) at common-hal/socketpool/Socket.c:582
#13 0x400ad639 in _process_request (socket=<optimized out>, request=<optimized out>) at ../../supervisor/shared/web_workflow/web_workflow.c:1413
#14 supervisor_web_workflow_background (data=<optimized out>) at ../../supervisor/shared/web_workflow/web_workflow.c:1589
#15 0x400a2d1e in background_callback_run_all () at ../../supervisor/shared/background_callback.c:83
#16 0x400a3953 in mp_hal_delay_ms (delay_ms=500) at ../../supervisor/shared/tick.c:116
#17 0x400d9f88 in common_hal_time_delay_ms (delay=500) at ../../shared-module/time/__init__.c:27
#18 0x400c70ca in time_sleep (seconds_o=<optimized out>) at ../../shared-bindings/time/__init__.c:70
#19 0x40093419 in fun_builtin_1_call (self_in=0x3f033c48 <time_sleep_obj>, n_args=1, n_kw=0, args=0x3f800c84) at ../../py/objfun.c:70
#20 0x4008db71 in mp_call_function_n_kw (fun_in=0x3f033c48 <time_sleep_obj>, n_args=1, n_kw=0, args=0x3f800c84) at ../../py/runtime.c:736
#21 0x4008dc3a in mp_call_method_n_kw (n_args=1, n_kw=0, args=0x3f800c7c) at ../../py/runtime.c:752
#22 0x4009cfe0 in mp_execute_bytecode (code_state=<optimized out>, inject_exc=<optimized out>) at ../../py/vm.c:1064
#23 0x40093354 in fun_bc_call (self_in=0x3f801820, n_args=0, n_kw=0, args=<optimized out>) at ../../py/objfun.c:271
#24 0x4008db71 in mp_call_function_n_kw (fun_in=0x3f801820, n_args=0, n_kw=0, args=0x0) at ../../py/runtime.c:736
#25 0x4008db89 in mp_call_function_0 (fun=0x3f801820) at ../../py/runtime.c:710
#26 0x400e0035 in parse_compile_execute (source=0x3f018d41, input_kind=MP_PARSE_FILE_INPUT, exec_flags=32, result=0x3ffd0dc0 <_exec_result>) at ../../shared/runtime/pyexec.c:163
#27 0x400e046d in pyexec_file (filename=0x3f018d41 "code.py", result=0x3ffd0dc0 <_exec_result>) at ../../shared/runtime/pyexec.c:783
#28 0x400a0c0a in maybe_run_list (filenames=<optimized out>, n_filenames=<optimized out>) at ../../main.c:305
#29 maybe_run_list (filenames=<optimized out>, n_filenames=<optimized out>) at ../../main.c:285
#30 0x400a0fe1 in run_code_py (safe_mode=SAFE_MODE_NONE, simulate_reset=0x3ffe1df0) at ../../main.c:484
#31 0x400a1582 in main () at ../../main.c:1119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0