-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
TinyUSB: Schedule tud_task from the DCD interrupt, switch rp2 & samd ports #12846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TinyUSB: Schedule tud_task from the DCD interrupt, switch rp2 & samd ports #12846
Conversation
Codecov Report
@@ Coverage Diff @@
## master #12846 +/- ##
=======================================
Coverage 98.43% 98.43%
=======================================
Files 158 158
Lines 20978 20978
=======================================
Hits 20649 20649
Misses 329 329 |
This is really great, thank you! We would definitely want to switch all TinyUSB ports (except esp32) to use this mechanism. It also leaves open the possibility to run Overall this helps to clean up the scheduling behaviour of MicroPython on bare-metal ports: that events (hardware IRQ, like USB IRQ) do some low-level processing at a high priority, and then schedule more complex processing to happen at a lower priority. That scheduling can either be at PendSV (eg cyw43 driver) or at thread level, like done in this PR for USB. |
86bfb5a
to
6ca82b2
Compare
Agreed, this could be useful and I looked into this a bit as well while investigating potential ways for this to work. For my long-term purpose here (allowing "soft" Python callbacks in the USB tasks) then it's less suitable as my understanding is that you can't currently do heap operations from PendSV handlers. However that might be something we could change(?), it could even be possible to interrupt a GC pass to handle a PendSV level interrupt and have it restart afterwards(??). The latency for "normal" scheduling of C callbacks seems pretty good, on rp2 in my tests it was usually around 30us and nearly always sub-100us (although of course there are outliers.) |
Code size report:
|
6ca82b2
to
316b156
Compare
Correct. Although you could run
A simply way to make it work would be to lock out PendSV (but not other IRQs) when doing a GC pass. Then PendSV effectively becomes a low-priority/high-latency but preemptive level of execution. That's anyway it's main use. Eg the cyw43 driver doesn't mind if there's a bit of latency. But also see a very old #913 for ideas on making the GC reentrant. |
dcd_event_handler() is called from the IRQ when a new DCD event is queued for processing by the TinyUSB thread mode task. This lets us queue the handler to run immediately when MicroPython resumes. Currently this relies on a linker --wrap hack to work, but a PR has been submitted to TinyUSB to allow the function to be called inline from dcd_event_handler() itself. This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <angus@redyak.com.au>
This change: - Has a small code size reduction. - Should slightly improve overall performance. The old hook code seemed to use between 0.1% and 1.6% of the total CPU time doing no-op calls even when no USB work was required. - USB performance is mostly the same, there is a small increase in latency for some workloads that seems to because sometimes the hook usbd_task() is called at the right time to line up with the next USB host request. This only happened semi-randomly due to the timing of the hook. Improving the wakeup latency by switching rp2 to tickless WFE allows the usbd_task() to run in time for the next USB host request almost always, improving performance and more than offsetting this impact. This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <angus@redyak.com.au>
Previously the TinyUSB task was run in the ISR immediately after the interrupt handler. This approach gives very similar performance (no change in CDC throughput tests) but reduces the amount of time spent in the ISR, and allows TinyUSB callbacks to run in thread mode. This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <angus@redyak.com.au>
Running the performance benchmarks on a Pico board before and after this PR gives:
|
316b156
to
26d5032
Compare
@dpgeorge That's odd, I swear I did this and saw more marginal changes. +6-10% seems improbable... Did you make any cache modifications in the code for this run? |
Huh, unless having fewer cache evictions from not calling the hook actually helps the hot code paths stay in cache. 🤷 |
I did not change anything. I just ran master and then this PR on a Pico. There could be some general variability in running performance tests on rp2 due to external SPI flash and its caching. For really quiet/non-variable performance tests you need to use something like a PYBv1.x (Cortex M4 with fast internal flash. Note that Cortex M7 has too much speculative execution that makes the tests noisy.). But I would actually expect the performance to increase a bit, because it's removing hooks from deep within the VM. |
Interesting... I measured the "up to 1.6%" by adding hacky GPIO register writes around |
When |
TinyUSB device support consists of a main task function
tud_task_ext()
"upper layer" handler and a DCD ISR "lower layer" handler. Currently different MicroPython ports calltud_task_ext()
in one of three ways:tud_task_ext()
unnecessarily (measurements on rp2 suggest between 0.1% and 1.6%, although the latter is in a pathological tight loopwhile True: pass
). It also adds variable latency between a USB interrupt and the driver handling it.tud_int_handler
"low-level" handler runs. Ports include samd, renesas-ra, mimxrt. This has minimal latency but blocks other interrupts while USB processing is completed. It is also incompatible with running non-hard-ISR Python code in task-level USB callbacks (as implemented in tinyusb: Support USB device drivers written in Python #9497).This PR introduces a new approach trying to overcome the shortcomings of 1 & 2: hooking the
dcd_event_handler()
function in TinyUSB and using it to schedule execution oftud_task_ext()
in the runtime. Now whenever the ISR queues work for the upper layer handler then that handler is scheduled to run. There are commits that switch rp2 and samd to use this new method. Tested on an WeActStudio board (rp2040) and a Seeed XIAO (samd21).At the moment this approach uses linker wrapping which is a bit hacky. However, I've submitted hathach/tinyusb#2303 to TinyUSB. If accepted then this could later be changed to use the macro and a direct function call for the hook. The macro also avoids unnecessary scheduling of
tud_task_ext()
following any DCD events don't result in new work queued (SOF being the most obvious one).Testing with the
cdc_rate_benchmark.py
script from micropython/micropython-lib#558 shows USB CDC performance remains the same on samd with these changes. Initially reading from CDC on rp2 was a little slower, however the root cause was identified as #12837 (slow resume from WFE). With that patch added rp2 is now slightly faster on read tests, and no significant change on writes.I'm confident that given time this method can be applied to all ports, except possibly esp32.
This work was funded through GitHub Sponsors.