-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
stmhal: USB CDC speed improvements, TIM3 release #1713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a follow-up/replacement to #1652 - it only implements the change for TIM3 timer handling / callback. It passes all the test The speedup is app. 27% for the test case above (it involve a lot of non-comm parts so the real speedup is faster) Edit: Actually the speedup is more like 10-30x, see below for numbers I've commented out the USBD_CDC_HAL_TIM_PeriodElapsedCallback() call in timer.c (no more used) - please advice if it should be encapsulated in a compiler switch? Or completely removed? |
This is much simpler, thanks! I tested raw VCP speed and it went from 33 kbyte/sec up to 54 kbyte/sec with this patch, so that's a great improvement! But there is an issue: the serial driver does not seem to flush its buffer (call USBD_CDC_HAL_TIM_PeriodElapsedCallback) at the very start. To test this, reset the board (it'll print the uPy header to serial) then run (using normal Python on your PC): import time, serial
ser = serial.Serial('/dev/ttyACM0', baudrate=115200, timeout=1, interCharTimeout=1)
time.sleep(0.2)
while ser.inWaiting() > 0:
print(ser.read(ser.inWaiting())) That script should read whatever is waiting on the serial line. Before this patch it would dump the uPy header. After this patch there is no data to read. |
Another problem with this patch: if I write a simple pyboard script to write chars repeatedly to sys.stdout, then have another script running on the PC to read the chars (same as the one in my previous comment), then these chars are not received if the pyboard VCP buffer gets full. Without the patch it works fine, there will always be chars available to read by the PC. |
Thanks, yes I see this as well - should have noticed this, sorry. I'll have a look. Maybe some sort of periodic polling might still be necessary as backup if something is failing. Perhaps the USB stack allows for that as well (I still think there should be an endopoint polling hook somewhere). |
ok one way to solve this is to enable the SOF (start-of-frame) interrupt. There is a callback for this event as well and it seems to work (solves the buffer flush issue - haven't checked the other one). It's essentially a similar approach as before (periodic polling of the USB endpoint buffers) but does not need a timer. Thoughts / Things to consider:
|
another issue is how to best integrate the patch. There is already MICROPY_HW_USE_ALT_IRQ_FOR_CDC. Adding another compiler switch would bloat the code. I'd propose to rename/remove the code parts that have a naming convention related to TIM3 and also to remove the comiler switch. |
- releases TIM3 and uses SOF irqs to trigger CDC tx transfers - increases CDC tx transfer rate sigificantly (up to 10x, depending on usage) - uses a txComplete callback to check immediately for a next tx transfer
I made some quick measurements, using cdc_speed_2 against the current master. As @dhylands suggested I used his rshell tool to transmit a jpg image to the sd card. The image size was 2.09MBytes. The transfer times were:
The PC->Pyboard times are not pure transfer speeds as flash write sequences are necessary too. For pure VCP transfer this is an improvement of about 10x! The calculated rawdata speed is about 1.4-2 Mbits/sec. Note that the buffers are identical but the SOF polling is faster than the existing TIM3 approach. |
some more numbers. I extended @dpgeorge testprogram slightly: import time, serial
ser = serial.Serial('/dev/cu.usbmodem1452', baudrate=115200, timeout=1, interCharTimeout=1)
time.sleep(0.2)
start = last = time.time()
number_of_bytes = 0
while (1):
if ser.inWaiting() > 0:
data = ser.read(ser.inWaiting())
number_of_bytes = number_of_bytes + len(data)
now = time.time()
if (now-last > 1):
print("RX data length {}".format(number_of_bytes))
mbits = (number_of_bytes * 8 / (now-start)) / 1e6
print("Speed: {} Mbits/sec".format(mbits))
last = now on the pyboard I run: while (1):
print(help()) the resulting output prints the effective transfer speed every second. On my HW setup the number is 8.6MBits/sec with cdc_speed_2 and 0.28MBits/sec with master, a difference of factor 31. |
I have been running some extensive tests for this patch. I confirm that the pyboard to PC goes from about 0.28MBit/s to about 8.2MBit/s. But in running these tests I very occasionally get a hard crash on the pyboard. It's not very repeatable so hard to debug, and I don't even know if the bug is introduced by this patch or if it was there before. @hoihu I think there is an even simpler way to make this change: there is already a SOF callback here: https://github.com/micropython/micropython/blob/master/stmhal/usbdev/class/src/usbd_cdc_msc_hid.c#L1034 it just needs to be filled in! Then, without the txComplete callback (which I don't think is needed) you can make a very concise patch :) SOF at 1ms shouldn't be too much of an ov 10000 erhead. If needed we can put a simple counter in the SOF interrupt that only calls the CDC callback every, say, 8 interrupts. The MICROPY_HW_USE_ALT_IRQ_FOR_CDC setting is not needed if the SOF approach works properly. Looking at the bigger picture, I'd really like to understand more about the USB interrupts, because if bInterval is set to some value like 16ms then I'd assume there was a USB interrupt being called at around that rate (initiated by the PC). And this could be used to do the pyboard->PC transfer (ie instead of using TIM3). The SOF interrupt shouldn't be needed. But, anyway, that can be an optimisation for later if the SOF approach works reliably. |
@dpgeorge Any sense of when the hard crash happens? Is just after booting the pyboard? Or is it after the pyboard has been running for a while? |
Here is the test script that eventually crashes with this PR (but so far not with master): https://gist.github.com/dpgeorge/47118387260df646c690 @dhylands it needs to run for a bit of time before it crashes, sometimes it crashes soon, othertimes it takes a while (ie after many of the read/write tests have run). |
yes I saw that. My patch optimized the callback. In the original way (SOF callback as you have pointed out) "USBD_LL_SOF(hpcd->pData)" is called and then the callback is looked up. Because this is a 1msec task I wanted to optimized it a bit. I can revert it to the original version if you like. And I can also remove the txComplete task. It may make sense for future applications (e.g. if you want to have a direct notification of a transfer complete event).
that was my first thougth on the subject, and maybe there is a way to do that. I haven't figured that out yet. You'd want to look in the HAL_PCD_IRQHandler resp the HAL_PCD_DATAInStageCallback (which is defined in USBD_conf.c). Note that the SOF irq timing is very a precise 1msec clock (some other ARM processor use this to FLL an internal Oscillator and avoid an external crystal...), whereas the bInterval settings is not reliable.
Interesting, try to replicate this on my desk. |
Ok, I agree with optimising it. But if we go this route (and I suggest we do) then may as well just rename USBD_CDC_HAL_TIM_PeriodElapsedCallback to HAL_PCD_SOFCallback. Regarding crash. Running the vcprate.py script linked above on master branch has not yet given a crash (after running for more than 1.5 hours). Whereas running it on this branch gives a crash in about 5 minutes. |
I tried the miminal change, by just renaming USBD_CDC_HAL_TIM_PeriodElapsedCallback to HAL_PCD_SOFCallback, and setting Sof_enable=1. It works, and the speed up is the same as reported above, but also the bug is still there (hard crash after about 4-5 mins of running vcprate.py). |
Hm I'm having troubles running the script on OSX. On cdc_speed_2 It starts with
then it hangs in the ser.inWaiting() call loop. With master I get:
|
@hoihu: For the first issue, try replacing the busy wait loop with a simple read: |
still the same. Unfortunately.
I guess I do ;) well with your patch it worked.
Great - so in the end it really might be a very small patch. |
Maybe it's a pyserial issue. I'll try with the latest version from github. |
Didn't improve with newest github pyserial. Although sometimes it does go a step further if ser.flush() is moved and/or timeout options changed. But it's very shaky. I'm not sure if this is a problem with the handshaking of the PC->pyboard (script) or a real problem with the firmware. The handshake is more time critical as the response is much faster. I'll be on holidays during the next couple of days, so I can't investigate further at the moment. |
Tried to pinpoint the fault with a reduced test script. I experienced crashes also with master (hard fault) so I believe it's a general issue. I'm using the simplified test script: import time
import argparse
import serial
def drain_input(ser):
a = ser.inWaiting()
if a:
drained = ser.read(a)
print("DRAINED {}".format(drained))
write_test_script = """
import sys
b=bytearray(%u)
while 1:
n = sys.stdin.readinto(b)
sys.stdout.write(b)
"""
def test(dev,buf_size):
ser = serial.Serial(dev, baudrate=4000000)
ser.write(b'\x03\x03\x01\x04') # break, break, raw-repl, soft-reboot
ser.flush()
ser.write(bytes(write_test_script % (buf_size), 'ascii'))
# ser.write(bytes(write_test_script % (buf_size,buf_size,buf_size), 'ascii'))
ser.flush()
time.sleep(0.1)
drain_input(ser)
ser.write(b'\x04') # eof
time.sleep(0.1)
ser.flush()
drain_input(ser)
for i in range(0,10):
ser.write(buf_size*"A".encode('ascii'))
ser.flush()
time.sleep(0.05)
a = ser.inWaiting()
if a != buf_size:
print("STRANGE, expected length = {}, was {}".format(buf_size,a))
response = ser.read(a)
print(response)
else:
response = ser.read(a)
# print(response[:160])
port = """/dev/cu.usbmodem1452"""
def main():
for i in range(1000):
print("testing {}".format(i))
test(port,200+i)
return
if __name__ == "__main__":
main() It happes always if the board is soft-reseting (not during the 10x iteration). So I believe it has something to do with initializing the board while the communication is still active. On cdc_speed_2 I tried to deactivate the SOF irqs while soft-reseting but that didn't help either. It takes some time to crash, sometimes more than 10min. What is immediately obvious is that the "DRAINED" output is not always consistent. It sometimes prints the "raw REPL; CTRL-B to exit\r\n>OK\r\nPYB: sync filesystems\r\nPYB: soft reboot\r\nraw REPL; CTRL-B to exit\r\n>" part and sometimes it doesn't. Again this happens with master. |
For other things, like the timers, we actually de-initialize them because if they fire during soft-reset then thay can be running with callbacks that point into the heap which don't exist any more. For USB, we don't want to reinitialize things, so all of the USB stuff is supposed to be stored in global variables (i.e. not on the heap). So if there is anything during interrupt processing which references something on the heap, then that would be bad, |
Hmm. pyb_usb_init0 allocates an exception object and stores it in MP_STATE_PORT(mp_const_vcp_interrupt) So if an interrupt to process Control-C happens between gc_init and pyb_usb_dev_init0 the interrupt handler will be referencing uninitialized data (most likely previously initialized data). However, since that early initialization always happens in the same order it seems likely that the exception object will get allocated at the same location each time (having the object get trampled by something else is usually what causes the serious issues). |
I haven't seen any hard crashes yet, running @hoihu's script on
57E6
master, but I have seen the following (running on master): |
Beside of what @dhylands has already reported, I also see the following quite often:
It seems that sometimes, the 0x4 (CTRL-D) is not interpreted correctly or not even present in the receive buffer of the pyboard (see the line with and then the following chars Occassionally a whole RAM segment is printed out as well. If the problem is triggered by soft-reseting the board only it may not be severe. I can open another ticket and move a problem summary there, otherwise this is getting mixed up with the CDC changes. Also we could think of other test scripts without involving a lot of CTRL-D's so that it's possible to test the rx/tx path only. |
If the Control-C happens to arrive when mp_hal_set_interrupt_char(-1) happens to be in effect: https://github.com/micropython/micropython/blob/master/lib/utils/pyexec.c#L74 So that explains the beginnings of the sequence where it shows the code being printed out. Then it stays out of synce for a few iterations and gets back into sync. I'm alot more concerned about the cases where it seems to be reporting data from RAM beyond the end of the buffer. I think I'll add a marker just beyond the end of the buffer to confirm that. That may be that the length is being corrupted. I'm not sure how that causes a hard-fault yet - but understanding the length corruption may show how something else might get corrupted and cause the hard fault. |
Thanks guys for the debugging! Looks like this might take some time to fully resolve... |
In order to separate those different issues I've modified @dpgeorge testscript (see https://github.com/hoihu/projects/blob/master/scripts/vcp_testrun.py) The intention was to handshake the soft reset sequence and the uploaded testscript with the PC's test script using these rules:
Now what I did observe was that both By inserting a small delay between the CTRL-C ( I know that the small delay is hiding a possible bug that may be related to what @dhylands has described. I feel not competent to judge this issue, but I guess this should be fixed in FW rather by inserting a delay from the PC.Hence IMO the issue observed is not related to the patch but rather another issue caused by repeated CTRL-C's. I believe that cdc_speed_2 is actually quite stable. I'd be glad if somebody can verify the results. The testscript can be run easily (you only need to change the port settings). |
Having back-to-back Control-C's makes it much more likely that the pendsv_nlr_longjmp will actually do a longjmp via the interrupt handler. Its possible that this in turn leaves some code in an inconsistent state, depending on what was executing when the longjmp occurred. |
I was able to trigger the "repeated ctrl-c hard crash" with master with a simplified script (takes only a few seconds on my board to trigger a crash) I've placed an issue here: |
@hoihu I have made a minimal version of this PR that does just the bare minimum necessary to get SOF working (not public, only on my local PC right now). I'm running tests now with the test script using only one ctrl-C (\x03) to reset the board (using 2 ctrl-C's is actually not needed in this test script). And so far there are no hard crashes. I'll run it for a few hours and if there's nothing wrong then I'll merge my version. Of course I will credit you as originator of the idea in the commit message. Is this ok for you? Otherwise, if you want to be the author of the commit, I'll need to ask you to make this PR as minimal as possible. |
Yes it's fine for me - go ahead with your version. Nice kickstarter btw. Backed it yesterday😃. |
@hoihu ok, great! And thanks for the pledge! So my tests finished. I ran the loop 1000 times, each time reading and then writing 8000 bytes. There were no hard faults (and running the same test with 2x ctrl-C's I did hit a hard fault). |
Ok so I'm closing this now. I'll leave #1796 open. |
Previous to this patch the USB CDC driver used TIM3 to trigger the sending of outgoing data over USB serial. This patch changes the behaviour so that the USB SOF interrupt is used to trigger the processing of the sending. This reduces latency and increases bandwidth of outgoing data. Thanks to Martin Fischer, aka @hoihu, for the idea and initial prototype. See PR #1713.
…ilds turn on reversed() for all builds
This patch releases TIM3 from periodically polling the CDC-USB tx state