-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Web Workflow Access Causes Program Pause And Board Freeze #9171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I see a similar situation, where code is paused when fetching a URL from the web workflow, to the point of causing a complete freeze of the board, losing USB and everything. When the board's code is particularly busy, this happens easily. The freeze seems to be triggered by some access by web workflow, including the single scanning done by the web workflow home page of another board or the recurrent scanning by discotool manager (both of which retrieve some information on the board after detecting it on MDNS). When connected to USB, the frozen board does not respond to ctrl-C but usually still has the USB drive work, but sometimes also error and unmount after a little while, without coming back. The web workflow might remain working when the code is frozen, and apparently using the web workflow might make the code run again for a bit. On a board with a more complex code, including a neopixel strip, a dotstar strip and a webserver, the freezing happens easily during normal web workflow use, making it quite difficult to use. I usually don't see it recover after the board freezes. Repro on:
Repro: latest, 9.2.4, 9.0.0 Here is a simple code that helps visualize the board freezing: import board
import time
import neopixel
status = neopixel.NeoPixel(board.NEOPIXEL, 1)
while True:
for color in [0x200020, 0x002020]:
status.fill(color)
time.sleep(0.25) Here is some python code that connects to the board's web workflow in a loop to force trigger the freeze: import requests, sys, time
from datetime import datetime as d
ADDRESS = "192.168.1.38"
if len(sys.argv) > 1: ADDRESS = sys.argv[1]
url = f"http://{ADDRESS}/cp/version.json"
was_ok = None
t0 = d.now()
try:
while True:
is_ok = True
try:
with requests.get(url, timeout=1) as response:
is_ok &= True
except (requests.exceptions.ReadTimeout, requests.exceptions.ConnectionError):
is_ok &= False
print((f"{str(d.now()-t0)[:7]} " + ("ERROR","ok")[is_ok]).ljust(60), "\033[1G\033[1A")
if is_ok != was_ok:
print()
was_ok = is_ok
time.sleep(0.1)
except KeyboardInterrupt:
print() On a QTPY S2, this usually triggers the issue after approximately 30 seconds.
|
Web workflow responses are currently blocking. So, if they take a while, then everything else will be starved. I think the easiest way to fix this will be switching to Zephyr (or another RTOS). That way the web workflow can run in a separate thread and yield as it waits for sockets. |
Is blocking new to CP9 ? On 9.x latest, the test starts failing within 30 seconds. In fact it's quite regular, the first error since reset happens after 140 to 150 requests, regardless of the sleep duration in the test script (tested 10ms, 50ms, 100ms). import board
import time
import neopixel
import adafruit_dotstar
pixel = neopixel.NeoPixel(board.NEOPIXEL, 90)
pidots = adafruit_dotstar.DotStar(board.SCK, board.MISO, 90)
while True:
for color in [0x200020, 0x002020]:
pixel.fill(color)
pidots.fill(color)
time.sleep(0.5) |
No, it isn't new to CP9. CP9 did upgrade to IDF 5 though. It was a big step and the "wake circuitpython up from socket activity" is complicated.
The easiest way to hunt this down may be a git bisect. It'll be time consuming but also enlightening. |
Attempted to reproduce using the Python program at the top of the issue. Using the various connection methods listed in the table, could not reproduce the problem. Dots appear on the serial console at a steady rate regardless of what I do with web workflow. Tried both
To be thorough, I will test with 9.2.7 and will also try the method and program described by @Neradoc . |
Could not reproduce the problem (using the method and program at the top of the issue) at 9.2.7, so went back to 9.2.4 and then to 9.0.3, and could not reproduce it. Thinking it might have something to do with |
FWIW: I have been preoccupied so haven't had a great deal of use with recent builds but since Web Workflow was released I've seen issues similar to those reported here. In my case they seem to come and go, a few times I came close to working up a repeatable recipe only for the problem to go away. I eventually wrote it off to WiFi strength issues as my office is on the edge of WiFi coverage for most of the micro controller boards. I've thrown 10.x on the two continually running WiFi boards and I don't think I've seen the issue but I haven't really stressed them like I was with earlier versions. |
Thinking that a configuration difference between boards might be involved, I rebuilt CP 9.0.3 for the Metro ESP32S2 with these additional lines from
I was still unable to reproduce the original issue. It's possible that the WiFi signal, as described by @RetiredWizard, may have been a factor in the original issue. For my tests, the Metro is within 10 feet of the AP. Dumping the AP shows a strong -44 dBm signal with only 2 errors out of ~150K packets. Moving on, I was able to reproduce the issue documented by @Neradoc using CP 9.0.3 after 4 seconds:
After updating to 10.0.0-alpha.4, the issue no longer occurs. I ran @Neradoc's test for 1 hour with no errors while displaying a continuous lightshow on the neopixel:
I'm concluding that the upgrade to ESP-IDF 5.4.1 resolved this issue. |
I'm running the test code I posted above on a Feather ESP32-S2 on latest.
And at this point the board disconnected from USB.
And it didn't come back (waited a few minutes). It was hard to repro on QTPY S3 (with PSRAM), but then it got worse: while the test code is running, the code runs fine (looping the LED) and the web workflow responds. But then if I connect to the REPL and hit ctrl-C the board crashes. This might be an unrelated 10 alpha issue, but it is triggered by the web workflow at least.
The board freezes (after printing some or all of the normal messages) and crashes (goes away from USB) after a few seconds. |
@Neradoc is able to reproduce the issue with 10.0.0-alpha.4 on slightly different hardware. I've re-opened the issue and will investigate further once I have hardware on hand that can reproduce the issue. |
With a Feather ESP32-S2 running 10.0.0-alpha.5:
Issue reproduces as described by @Neradoc:
Board freezes. With |
Created a full debug build that does not overflow
and that does not overflow flash by using a Got an assert from
|
An example of the hangup: ESP-IDF verbose debug tracing with additional log messages:
Backtrace for the
Backtrace for the
|
CircuitPython version
Code/REPL
Behavior
Accessing the Welcome page of the web workflow can cause the executing program to pause. With the above code, the LED stops flashing. Clicking on the Full Code Editor link causes the program to resume.
Description
This issue does not seem to happen in 8.2.10.
Additional information
With the above code, if the pause happens and you wait more than a minute, then resuming by entering the Full Code Editor leads to an immediate watchdog exception.
The text was updated successfully, but these errors were encountered: