8000 TinyUSB: Schedule tud_task from the DCD interrupt, switch rp2 & samd ports by projectgus · Pull Request #12846 · micropython/micropython · GitHub
[go: up one dir, main page]

Skip to content

TinyUSB: Schedule tud_task from the DCD interrupt, switch rp2 & samd ports #12846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 9, 2023

Conversation

projectgus
Copy link
Contributor
@projectgus projectgus commented Nov 1, 2023

TinyUSB device support consists of a main task function tud_task_ext() "upper layer" handler and a DCD ISR "lower layer" handler. Currently different MicroPython ports call tud_task_ext() in one of three ways:

  1. From a VM hook that runs every N Python loop or return opcodes, and/or when the system is idle. Ports include rp2, nrf. This approach wastes some amount of CPU time calling tud_task_ext() unnecessarily (measurements on rp2 suggest between 0.1% and 1.6%, although the latter is in a pathological tight loop while True: pass). It also adds variable latency between a USB interrupt and the driver handling it.
  2. Direct from the USB peripheral ISR, after the tud_int_handler "low-level" handler runs. Ports include samd, renesas-ra, mimxrt. This has minimal latency but blocks other interrupts while USB processing is completed. It is also incompatible with running non-hard-ISR Python code in task-level USB callbacks (as implemented in tinyusb: Support USB device drivers written in Python #9497).
  3. From a dedicated RTOS task that blocks on the DCD event queue (esp32).

This PR introduces a new approach trying to overcome the shortcomings of 1 & 2: hooking the dcd_event_handler() function in TinyUSB and using it to schedule execution of tud_task_ext() in the runtime. Now whenever the ISR queues work for the upper layer handler then that handler is scheduled to run. There are commits that switch rp2 and samd to use this new method. Tested on an WeActStudio board (rp2040) and a Seeed XIAO (samd21).

At the moment this approach uses linker wrapping which is a bit hacky. However, I've submitted hathach/tinyusb#2303 to TinyUSB. If accepted then this could later be changed to use the macro and a direct function call for the hook. The macro also avoids unnecessary scheduling of tud_task_ext() following any DCD events don't result in new work queued (SOF being the most obvious one).

Testing with the cdc_rate_benchmark.py script from micropython/micropython-lib#558 shows USB CDC performance remains the same on samd with these changes. Initially reading from CDC on rp2 was a little slower, however the root cause was identified as #12837 (slow resume from WFE). With that patch added rp2 is now slightly faster on read tests, and no significant change on writes.

I'm confident that given time this method can be applied to all ports, except possibly esp32.

This work was funded through GitHub Sponsors.

Copy link
codecov bot commented Nov 1, 2023

Codecov Report

Merging #12846 (26d5032) into master (bbc5a18) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #12846   +/-   ##
=======================================
  Coverage   98.43%   98.43%           
=======================================
  Files         158      158           
  Lines       20978    20978           
=======================================
  Hits        20649    20649           
  Misses        329      329           

@dpgeorge
Copy link
Member
dpgeorge commented Nov 9, 2023

This is really great, thank you! We would definitely want to switch all TinyUSB ports (except esp32) to use this mechanism.

It also leaves open the possibility to run tud_task_ext() at PendSV level if we ever want to do that (by replacing the call to mp_sched_schedule_node with pendsv_schedule_dispatch).

Overall this helps to clean up the scheduling behaviour of MicroPython on bare-metal ports: that events (hardware IRQ, like USB IRQ) do some low-level processing at a high priority, and then schedule more complex processing to happen at a lower priority. That scheduling can either be at PendSV (eg cyw43 driver) or at thread level, like done in this PR for USB.

@projectgus projectgus force-pushed the feature/tinyusb_schedule branch from 86bfb5a to 6ca82b2 Compare November 9, 2023 00:33
@projectgus
Copy link
Contributor Author

It also leaves open the possibility to run tud_task_ext() at PendSV level if we ever want to do that (by replacing the call to mp_sched_schedule_node with pendsv_schedule_dispatch).

Agreed, this could be useful and I looked into this a bit as well while investigating potential ways for this to work. For my long-term purpose here (allowing "soft" Python callbacks in the USB tasks) then it's less suitable as my understanding is that you can't currently do heap operations from PendSV handlers. However that might be something we could change(?), it could even be possible to interrupt a GC pass to handle a PendSV level interrupt and have it restart afterwards(??).

The latency for "normal" scheduling of C callbacks seems pretty good, on rp2 in my tests it was usually around 30us and nearly always sub-100us (although of course there are outliers.)

Copy link
github-actions bot commented Nov 9, 2023

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
     mimxrt:    +0 +0.000% TEENSY40
        rp2:  -168 -0.051% RPI_PICO[incl +8(bss)]
       samd:   +16 +0.006% ADAFRUIT_ITSYBITSY_M4_EXPRESS[incl +8(bss)]

@projectgus projectgus force-pushed the feature/tinyusb_schedule branch from 6ca82b2 to 316b156 Compare November 9, 2023 00:42
@dpgeorge
Copy link
Member
dpgeorge commented Nov 9, 2023

For my long-term purpose here (allowing "soft" Python callbacks in the USB tasks) then it's less suitable as my understanding is that you can't currently do heap operations from PendSV handlers

Correct. Although you could run tud_task_ext at PendSV if there are no Python callbacks registered, and at thread level otherwise. And/or the Python callbacks can say if they are hard or soft and if hard then can run at PendSV.

However that might be something we could change(?), it could even be possible to interrupt a GC pass to handle a PendSV level interrupt and have it restart afterwards(??).

A simply way to make it work would be to lock out PendSV (but not other IRQs) when doing a GC pass. Then PendSV effectively becomes a low-priority/high-latency but preemptive level of execution. That's anyway it's main use. Eg the cyw43 driver doesn't mind if there's a bit of latency.

But also see a very old #913 for ideas on making the GC reentrant.

dcd_event_handler() is called from the IRQ when a new DCD event is queued
for processing by the TinyUSB thread mode task.  This lets us queue the
handler to run immediately when MicroPython resumes.

Currently this relies on a linker --wrap hack to work, but a PR has been
submitted to TinyUSB to allow the function to be called inline from
dcd_event_handler() itself.

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
This change:
- Has a small code size reduction.
- Should slightly improve overall performance.  The old hook code
  seemed to use between 0.1% and 1.6% of the total CPU time doing no-op
  calls even when no USB work was required.
- USB performance is mostly the same, there is a small increase in
  latency for some workloads that seems to because sometimes the hook
  usbd_task() is called at the right time to line up with the next USB host
  request.  This only happened semi-randomly due to the timing of the hook.
  Improving the wakeup latency by switching rp2 to tickless WFE allows the
  usbd_task() to run in time for the next USB host request almost always,
  improving performance and more than offsetting this impact.

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
Previously the TinyUSB task was run in the ISR immediately after the
interrupt handler.  This approach gives very similar performance (no change
in CDC throughput tests) but reduces the amount of time spent in the ISR,
and allows TinyUSB callbacks to run in thread mode.

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
@dpgeorge
Copy link
Member
dpgeorge commented Nov 9, 2023

Running the performance benchmarks on a Pico board before and after this PR gives:

diff of scores (higher is better)
N=100 M=100                     perf0 ->      perf1         diff      diff% (error%)
bm_chaos.py                    210.70 ->     213.40 :      +2.70 =  +1.281% (+/-0.07%)
bm_fannkuch.py                  52.43 ->      55.57 :      +3.14 =  +5.989% (+/-0.03%)
bm_fft.py                     1547.02 ->    1581.21 :     +34.19 =  +2.210% (+/-0.02%)
bm_float.py                   3784.03 ->    3769.89 :     -14.14 =  -0.374% (+/-0.07%)
bm_hexiom.py                    33.77 ->      37.17 :      +3.40 = +10.068% (+/-0.04%)
bm_nqueens.py                 2444.62 ->    2594.77 :    +150.15 =  +6.142% (+/-0.05%)
bm_pidigits.py                 362.22 ->     381.45 :     +19.23 =  +5.309% (+/-0.04%)
bm_wordcount.py                 39.19 ->      37.86 :      -1.33 =  -3.394% (+/-0.03%)
core_import_mpy_multi.py       312.96 ->     310.81 :      -2.15 =  -0.687% (+/-0.11%)
core_import_mpy_single.py       62.74 ->      59.40 :      -3.34 =  -5.324% (+/-0.24%)
core_locals.py                  27.75 ->      28.30 :      +0.55 =  +1.982% (+/-0.02%)
core_qstr.py                   127.46 ->     137.20 :      +9.74 =  +7.642% (+/-0.05%)
core_str.py                     18.94 ->      19.12 :      +0.18 =  +0.950% (+/-0.03%)
core_yield_from.py             217.03 ->     221.68 :      +4.65 =  +2.143% (+/-0.01%)
misc_aes.py                    283.11 ->     302.55 :     +19.44 =  +6.867% (+/-0.02%)
misc_mandel.py                1706.65 ->    1709.35 :      +2.70 =  +0.158% (+/-0.04%)
misc_pystone.py               1387.69 ->    1414.15 :     +26.46 =  +1.907% (+/-0.06%)
misc_raytrace.py               227.76 ->     223.51 :      -4.25 =  -1.866% (+/-0.06%)
viper_call0.py                 331.10 ->     331.09 :      -0.01 =  -0.003% (+/-0.01%)
viper_call1a.py                323.39 ->     323.38 :      -0.01 =  -0.003% (+/-0.01%)
viper_call1b.py                241.03 ->     241.03 :      +0.00 =  +0.000% (+/-0.01%)
viper_call1c.py                242.91 ->     242.91 :      +0.00 =  +0.000% (+/-0.01%)
viper_call2a.py                318.44 ->     318.43 :      -0.01 =  -0.003% (+/-0.01%)
viper_call2b.py                211.29 ->     211.28 :      -0.01 =  -0.005% (+/-0.01%)

@dpgeorge dpgeorge force-pushed the feature/tinyusb_schedule branch from 316b156 to 26d5032 Compare November 9, 2023 01:42
@dpgeorge dpgeorge merged commit 26d5032 into micropython:master Nov 9, 2023
@projectgus
Copy link
Contributor Author

Running the performance benchmarks on a Pico board before and after this PR gives:

@dpgeorge That's odd, I swear I did this and saw more marginal changes. +6-10% seems improbable... Did you make any cache modifications in the code for this run?

@projectgus
Copy link
Contributor Author
projectgus commented Nov 9, 2023

+6-10% seems improbable...

Huh, unless having fewer cache evictions from not calling the hook actually helps the hot code paths stay in cache. 🤷

@projectgus projectgus deleted the feature/tinyusb_schedule branch November 9, 2023 02:43
@dpgeorge
Copy link
Member
dpgeorge commented Nov 9, 2023

I did not change anything. I just ran master and then this PR on a Pico.

There could be some general variability in running performance tests on rp2 due to external SPI flash and its caching. For really quiet/non-variable performance tests you need to use something like a PYBv1.x (Cortex M4 with fast internal flash. Note that Cortex M7 has too much speculative execution that makes the tests noisy.).

But I would actually expect the performance to increase a bit, because it's removing hooks from deep within the VM.

@projectgus
Copy link
Contributor Author
projectgus commented Nov 9, 2023

Interesting...

I measured the "up to 1.6%" by adding hacky GPIO register writes around tud_task_ext() and then measuring the duty cycle of those function calls on the logic analyzer. I'm pretty confident that's roughly the maximum amount of time spent in "no-op" calls to tud_task_ext() compared to executing other code. But I guess measuring this doesn't account for any additional savings from cache, maybe better pipelining, etc.

@dpgeorge
Copy link
Member
dpgeorge commented Nov 9, 2023

When MICROPY_VM_HOOK_LOOP is defined it executes that macro on each "jump" opcode, and that macro needs to do a bit of work even if it doesn't call tud_task_ext().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0