8000 stmhal: USB CDC speed improvements, TIM3 release by hoihu · Pull Request #1713 · micropython/micropython · GitHub
[go: up one dir, main page]

Skip to content

stmhal: USB CDC speed improvements, TIM3 release #1713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

hoihu
Copy link
Contributor
@hoihu hoihu commented Dec 13, 2015

This patch releases TIM3 from periodically polling the CDC-USB tx state

  • no overhead of periodically polling tx state (e.g. if USB is not connected)
  • TIM3 is free to use
  • faster response and improved data rate because the callback is directly called

@hoihu
Copy link
Contributor Author
hoihu commented Dec 13, 2015

This is a follow-up/replacement to #1652 - it only implements the change for TIM3 timer handling / callback.

It passes all the test ./run-tests --target=pyboard --device=/dev/..

The speedup is app. 27% for the test case above (it involve a lot of non-comm parts so the real speedup is faster)

Edit: Actually the speedup is more like 10-30x, see below for numbers

I've commented out the USBD_CDC_HAL_TIM_PeriodElapsedCallback() call in timer.c (no more used) - please advice if it should be encapsulated in a compiler switch? Or completely removed?

@dpgeorge
Copy link
Member

This is much simpler, thanks!

I tested raw VCP speed and it went from 33 kbyte/sec up to 54 kbyte/sec with this patch, so that's a great improvement!

But there is an issue: the serial driver does not seem to flush its buffer (call USBD_CDC_HAL_TIM_PeriodElapsedCallback) at the very start. To test this, reset the board (it'll print the uPy header to serial) then run (using normal Python on your PC):

import time, serial
ser = serial.Serial('/dev/ttyACM0', baudrate=115200, timeout=1, interCharTimeout=1)
time.sleep(0.2)
while ser.inWaiting() > 0:
    print(ser.read(ser.inWaiting()))

That script should read whatever is waiting on the serial line. Before this patch it would dump the uPy header. After this patch there is no data to read.

@dpgeorge
Copy link
Member

Another problem with this patch: if I write a simple pyboard script to write chars repeatedly to sys.stdout, then have another script running on the PC to read the chars (same as the one in my previous comment), then these chars are not received if the pyboard VCP buffer gets full. Without the patch it works fine, there will always be chars available to read by the PC.

@hoihu
Copy link
Contributor Author
hoihu commented Dec 14, 2015

But there is an issue: the serial driver does not seem to flush its buffer (call USBD_CDC_HAL_TIM_PeriodElapsedCallback) at the very start.

Thanks, yes I see this as well - should have noticed this, sorry.

I'll have a look. Maybe some sort of periodic polling might still be necessary as backup if something is failing. Perhaps the USB stack allows for that as well (I still think there should be an endopoint polling hook somewhere).

@hoihu
Copy link
Contributor Author
hoihu commented Dec 17, 2015

ok one way to solve this is to enable the SOF (start-of-frame) interrupt. There is a callback for this event as well and it seems to work (solves the buffer flush issue - haven't checked the other one). It's essentially a similar approach as before (periodic polling of the USB endpoint buffers) but does not need a timer.

Thoughts / Things to consider:

  • the SOF events happen in 1msec intervals (I checked this with my logic analyser). So the IRQ callback is called at that rate (USBD_CDC_HAL_TIM_PeriodElapsedCallback). Is this acceptable?
  • The code changes for this is even less than the one I proposed above, so from this point of view, it's not a bad thing
  • We could leave the "txComplete" callback in place - it might even further increase data throughput but probably not a lot as the 1msec polling interval is already quite low. I can provide some measurement results if needed.
  • A "combined" approach is also possible, e.g. enable SOF irq if a transfer was triggered and disable it if transfer was completed and no more data is pending to send.That may be the best approach but would introduce a little more code changes.
  • What about the bInterval settings in the device descriptor? Reduce it to 5-10msec?

@hoihu
Copy link
Contributor Author
hoihu commented Dec 18, 2015

another issue is how to best integrate the patch.

There is already MICROPY_HW_USE_ALT_IRQ_FOR_CDC. Adding another compiler switch would bloat the code.

I'd propose to rename/remove the code parts that have a naming convention related to TIM3 and also to remove the comiler switch.

- releases TIM3 and uses SOF irqs to trigger CDC tx transfers
- increases CDC tx transfer rate sigificantly (up to 10x, depending on usage)
- uses a txComplete callback to check immediately for a next tx transfer
@hoihu
Copy link
Contributor Author
hoihu commented Dec 18, 2015

I made some quick measurements, using cdc_speed_2 against the current master. As @dhylands suggested I used his rshell tool to transmit a jpg image to the sd card. The image size was 2.09MBytes.

The transfer times were:

Branch Pyboard -> PC PC -> Pyboard
cdc_speed_2 8sec 13sec
master 82sec 41sec

The PC->Pyboard times are not pure transfer speeds as flash write sequences are necessary too.
I'm a bit puzzeld as to why the master branch is slower in transferring data from the pyboad to the pc because there is no flash interaction...

For pure VCP transfer this is an improvement of about 10x! The calculated rawdata speed is about 1.4-2 Mbits/sec. Note that the buffers are identical but the SOF polling is faster than the existing TIM3 approach.

@hoihu
Copy link
Contributor Author
hoihu commented Dec 18, 2015

some more numbers. I extended @dpgeorge testprogram slightly:

import time, serial
ser = serial.Serial('/dev/cu.usbmodem1452', baudrate=115200, timeout=1, interCharTimeout=1)
time.sleep(0.2)
start = last = time.time()
number_of_bytes = 0

while (1):
    if ser.inWaiting() > 0:
        data = ser.read(ser.inWaiting())
        number_of_bytes = number_of_bytes + len(data)
    now = time.time()
    if (now-last > 1):
        print("RX data length {}".format(number_of_bytes))
        mbits = (number_of_bytes * 8 / (now-start)) / 1e6
        print("Speed: {} Mbits/sec".format(mbits))
        last = now

on the pyboard I run:

while (1):
    print(help())

the resulting output prints the effective transfer speed every second. On my HW setup the number is 8.6MBits/sec with cdc_speed_2 and 0.28MBits/sec with master, a difference of factor 31.

@hoihu hoihu changed the title stmhal: callback for USB VCP instead of TIM3 stmhal: USB CDC speed improvements, TIM3 release Dec 18, 2015
@dpgeorge dpgeorge mentioned this pull request Dec 24, 2015
2 tasks
@dpgeorge
Copy link
Member

I have been running some extensive tests for this patch. I confirm that the pyboard to PC goes from about 0.28MBit/s to about 8.2MBit/s. But in running these tests I very occasionally get a hard crash on the pyboard. It's not very repeatable so hard to debug, and I don't even know if the bug is introduced by this patch or if it was there before.

@hoihu I think there is an even simpler way to make this change: there is already a SOF callback here: https://github.com/micropython/micropython/blob/master/stmhal/usbdev/class/src/usbd_cdc_msc_hid.c#L1034 it just needs to be filled in! Then, without the txComplete callback (which I don't think is needed) you can make a very concise patch :)

SOF at 1ms shouldn't be too much of an ov 10000 erhead. If needed we can put a simple counter in the SOF interrupt that only calls the CDC callback every, say, 8 interrupts.

The MICROPY_HW_USE_ALT_IRQ_FOR_CDC setting is not needed if the SOF approach works properly.

Looking at the bigger picture, I'd really like to understand more about the USB interrupts, because if bInterval is set to some value like 16ms then I'd assume there was a USB interrupt being called at around that rate (initiated by the PC). And this could be used to do the pyboard->PC transfer (ie instead of using TIM3). The SOF interrupt shouldn't be needed. But, anyway, that can be an optimisation for later if the SOF approach works reliably.

@dhylands
Copy link
Contributor

@dpgeorge Any sense of when the hard crash happens? Is just after booting the pyboard? Or is it after the pyboard has been running for a while?

@dpgeorge
Copy link
Member

Here is the test script that eventually crashes with this PR (but so far not with master): https://gist.github.com/dpgeorge/47118387260df646c690

@dhylands it needs to run for a bit of time before it crashes, sometimes it crashes soon, othertimes it takes a while (ie after many of the read/write tests have run).

@hoihu
Copy link
Contributor Author
hoihu commented Dec 27, 2015

I think there is an even simpler way to make this change: there is already a SOF callback here: https://github.com/micropython/micropython/blob/master/stmhal/usbdev/class/src/usbd_cdc_msc_hid.c#L1034 it just needs to be filled in!

yes I saw that. My patch optimized the callback. In the original way (SOF callback as you have pointed out) "USBD_LL_SOF(hpcd->pData)" is called and then the callback is looked up. Because this is a 1msec task I wanted to optimized it a bit. I can revert it to the original version if you like.

And I can also remove the txComplete task. It may make sense for future applications (e.g. if you want to have a direct notification of a transfer complete event).

if bInterval is set to some value like 16ms then I'd assume there was a USB interrupt being called at around that rate

that was my first thougth on the subject, and maybe there is a way to do that. I haven't figured that out yet. You'd want to look in the HAL_PCD_IRQHandler resp the HAL_PCD_DATAInStageCallback (which is defined in USBD_conf.c).

Note that the SOF irq timing is very a precise 1msec clock (some other ARM processor use this to FLL an internal Oscillator and avoid an external crystal...), whereas the bInterval settings is not reliable.

I very occasionally get a hard crash

Interesting, try to replicate this on my desk.

@dpgeorge
Copy link
Member

yes I saw that. My patch optimized the callback.

Ok, I agree with optimising it. But if we go this route (and I suggest we do) then may as well just rename USBD_CDC_HAL_TIM_PeriodElapsedCallback to HAL_PCD_SOFCallback.

Regarding crash. Running the vcprate.py script linked above on master branch has not yet given a crash (after running for more than 1.5 hours). Whereas running it on this branch gives a crash in about 5 minutes.

@dpgeorge
Copy link
Member

I tried the miminal change, by just renaming USBD_CDC_HAL_TIM_PeriodElapsedCallback to HAL_PCD_SOFCallback, and setting Sof_enable=1. It works, and the speed up is the same as reported above, but also the bug is still there (hard crash after about 4-5 mins of running vcprate.py).

@hoihu
Copy link
Contributor Author
hoihu commented Dec 27, 2015

Hm I'm having troubles running the script on OSX. On cdc_speed_2 It starts with

drain: b'MicroPython v1.5.1-1'
initial read: 1020
reading data...
using bufsize=8000, read 256020 bytes in 0.29 sec = 872.16 kibytes/sec = 7.14 MBits/sec
writing data...

then it hangs in the ser.inWaiting() call loop. With master I get:

drain: b'\x04Traceback (most rec'
initial read: 1020
reading data...
using bufsize=8000, read 256620 bytes in 6.63 sec = 37.81 kibytes/sec = 0.31 MBits/sec
drain: b'e\x04Traceback (most re'
drain: b'OK'
writing data...
Traceback (most recent call last):
  File "vcprate.py", line 117, in <module>
    main()
  File "vcprate.py", line 101, in main
    write_test(port, 8000, 32)
  File "vcprate.py", line 87, in write_test
    if response != b'OK%04u' % bufsize:
TypeError: unsupported operand type(s) for %: 'bytes' and 'int'

@dpgeorge
Copy link
Member

@hoihu: For the first issue, try replacing the busy wait loop with a simple read: response = ser.read(6). For the second issue, you need Python 3.5 :) Or just change it to bytes('OK%04u' % bufsize).

@hoihu
Copy link
Contributor Author
hoihu commented Dec 27, 2015

try replacing the busy wait loop with a simple read

still the same. Unfortunately.

you need Python 3.5 :)

I guess I do ;) well with your patch it worked.

It works, and the speed up is the same as reported above,

Great - so in the end it really might be a very small patch.

@hoihu
Copy link
Contributor Author
hoihu commented Dec 27, 2015

Maybe it's a pyserial issue. I'll try with the latest version from github.

@hoihu
Copy link
Contributor Author
hoihu commented Dec 27, 2015

Didn't improve with newest github pyserial.

Although sometimes it does go a step further if ser.flush() is moved and/or timeout options changed. But it's very shaky. I'm not sure if this is a problem with the handshaking of the PC->pyboard (script) or a real problem with the firmware. The handshake is more time critical as the response is much faster.

I'll be on holidays during the next couple of days, so I can't investigate further at the moment.

@hoihu
Copy link
Contributor Author
hoihu commented Dec 30, 2015

Tried to pinpoint the fault with a reduced test script.

I experienced crashes also with master (hard fault) so I believe it's a general issue. I'm using the simplified test script:

import time
import argparse
import serial

def drain_input(ser):
    a = ser.inWaiting()
    if a:
        drained = ser.read(a)
        print("DRAINED {}".format(drained))

write_test_script = """
import sys
b=bytearray(%u)
while 1:
    n = sys.stdin.readinto(b)
    sys.stdout.write(b)
    """

def test(dev,buf_size):
    ser = serial.Serial(dev, baudrate=4000000)
    ser.write(b'\x03\x03\x01\x04') # break, break, raw-repl, soft-reboot
    ser.flush()
    ser.write(bytes(write_test_script % (buf_size), 'ascii'))
    # ser.write(bytes(write_test_script % (buf_size,buf_size,buf_size), 'ascii'))
    ser.flush()
    time.sleep(0.1)
    drain_input(ser)
    ser.write(b'\x04') # eof
    time.sleep(0.1)
    ser.flush()
    drain_input(ser)

    for i in range(0,10):
        ser.write(buf_size*"A".encode('ascii'))
        ser.flush()
        time.sleep(0.05)
        a = ser.inWaiting()
        if a != buf_size:
            print("STRANGE, expected length = {}, was {}".format(buf_size,a))
            response = ser.read(a)  
            print(response)
        else:
            response = ser.read(a)  
            # print(response[:160])

port = """/dev/cu.usbmodem1452"""

def main():
    for i in range(1000):
        print("testing {}".format(i))
        test(port,200+i)
    return

if __name__ == "__main__":
    main()

It happes always if the board is soft-reseting (not during the 10x iteration). So I believe it has something to do with initializing the board while the communication is still active. On cdc_speed_2 I tried to deactivate the SOF irqs while soft-reseting but that didn't help either.

It takes some time to crash, sometimes more than 10min. What is immediately obvious is that the "DRAINED" output is not always consistent. It sometimes prints the "raw REPL; CTRL-B to exit\r\n>OK\r\nPYB: sync filesystems\r\nPYB: soft reboot\r\nraw REPL; CTRL-B to exit\r\n>" part and sometimes it doesn't. Again this happens with master.

@dhylands
Copy link
Contributor

For other things, like the timers, we actually de-initialize them because if they fire during soft-reset then thay can be running with callbacks that point into the heap which don't exist any more.

For USB, we don't want to reinitialize things, so all of the USB stuff is supposed to be stored in global variables (i.e. not on the heap). So if there is anything during interrupt processing which references something on the heap, then that would be bad,

@dhylands
Copy link
Contributor

Hmm. pyb_usb_init0 allocates an exception object and stores it in MP_STATE_PORT(mp_const_vcp_interrupt)

So if an interrupt to process Control-C happens between gc_init and pyb_usb_dev_init0 the interrupt handler will be referencing uninitialized data (most likely previously initialized data). However, since that early initialization always happens in the same order it seems likely that the exception object will get allocated at the same location each time (having the object get trampled by something else is usually what causes the serious issues).

@dhylands
Copy link
Contributor

I haven't seen any hard crashes yet, running @hoihu's script on 57E6 master, but I have seen the following (running on master):
https://pastebin.mozilla.org/8855520
which looks very disconcerting.

@hoihu
Copy link
Contributor Author
hoihu commented Dec 30, 2015

Beside of what @dhylands has already reported, I also see the following quite often:

DRAINED b'\x04Traceback (most recent call last):\r\n  File "<stdin>", line 5, in <module>\r\nKeyboardInterrupt: \r\n\x04>OK'
testing 60
STRANGE, expected length = 260, was 364
b'\x03\x03\x01\r\nimport sys\r\nb=bytearray(260)\r\nwhile 1:\r\n    n = sys.stdin.readinto(b)\r\n    sys.stdout.write(b)\r\n    \x04AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\x04Traceback (most recent call last):\r\n  File "<stdin>", line 6, in <module>\r\nKeyboardInterrupt: \r\n\x04>'
STRANGE, expected length = 260, was 0
b''
STRANGE, expected length = 260, was 0

It seems that sometimes, the 0x4 (CTRL-D) is not interpreted correctly or not even present in the receive buffer of the pyboard (see the line with and then the following chars x03\x03\x01\.. where it should read 0x4at the end and hence the following code (starting with "import sys" etc is not put into the raw repl). Then of course the following tests are also failing because the program submitted was invalid.

Occassionally a whole RAM segment is printed out as well.

If the problem is triggered by soft-reseting the board only it may not be severe. I can open another ticket and move a problem summary there, otherwise this is getting mixed up with the CDC changes.

Also we could think of other test scripts without involving a lot of CTRL-D's so that it's possible to test the rx/tx path only.

@dhylands
Copy link
Contributor

If the Control-C happens to arrive when mp_hal_set_interrupt_char(-1) happens to be in effect: https://github.com/micropython/micropython/blob/master/lib/utils/pyexec.c#L74
then it will be treated as data and not as a Control-C. I think that explains the above. Recognizing Control-A, B, (enter/exit raw repl) will only happen while in the REPL, so if the Control-C is missed, then it will still be inside the loop from the previous invocation and Control-A Control-D sequence will be treated as data.

So that explains the beginnings of the sequence where it shows the code being printed out. Then it stays out of synce for a few iterations and gets back into sync.

I'm alot more concerned about the cases where it seems to be reporting data from RAM beyond the end of the buffer. I think I'll add a marker just beyond the end of the buffer to confirm that. That may be that the length is being corrupted.

I'm not sure how that causes a hard-fault yet - but understanding the length corruption may show how something else might get corrupted and cause the hard fault.

@dpgeorge
Copy link
Member

Thanks guys for the debugging! Looks like this might take some time to fully resolve...

@hoihu
Copy link
Contributor Author
hoihu commented Jan 1, 2016

In order to separate those different issues I've modified @dpgeorge testscript (see https://github.com/hoihu/projects/blob/master/scripts/vcp_testrun.py)

The intention was to handshake the soft reset sequence and the uploaded testscript with the PC's test script using these rules:

  • check for the CTRL-B to exitsequence to handshake soft-resets
  • check for Ready to verify that the uploaded testscript is ready to run (the testscript sends this as a startup message).

Now what I did observe was that both master and cdc_speed_2were sensitive for repeated CTRL-C sequences. I got different messages back from the pyboard after soft-reseting (on master). Those eventually lead to inconsitent behaviour with responses and even complete hard crashes.

By inserting a small delay between the CTRL-C (\x03...\x03\x01\x04) I managed to get a consistent response from the pyboard. No more hard-crashed were observed (on both branches). I tested with blocksizes from 10Bytes - 10kBytes.

I know that the small delay is hiding a possible bug that may be related to what @dhylands has described. I feel not competent to judge this issue, but I guess this should be fixed in FW rather by inserting a delay from the PC.Hence IMO the issue observed is not related to the patch but rather another issue caused by repeated CTRL-C's. I believe that cdc_speed_2 is actually quite stable.

I'd be glad if somebody can verify the results. The testscript can be run easily (you only need to change the port settings).

@dhylands
Copy link
Contributor
dhylands commented Jan 2, 2016

Having back-to-back Control-C's makes it much more likely that the pendsv_nlr_longjmp will actually do a longjmp via the interrupt handler.

Its possible that this in turn leaves some code in an inconsistent state, depending on what was executing when the longjmp occurred.

@hoihu
Copy link
Contributor Author
hoihu commented Jan 22, 2016

I was able to trigger the "repeated ctrl-c hard crash" with master with a simplified script (takes only a few seconds on my board to trigger a crash)

I've placed an issue here:
#1796

@dpgeorge dpgeorge mentioned this pull request Jan 28, 2016
4 tasks
@dpgeorge
Copy link
Member

@hoihu I have made a minimal version of this PR that does just the bare minimum necessary to get SOF working (not public, only on my local PC right now). I'm running tests now with the test script using only one ctrl-C (\x03) to reset the board (using 2 ctrl-C's is actually not needed in this test script). And so far there are no hard crashes. I'll run it for a few hours and if there's nothing wrong then I'll merge my version. Of course I will credit you as originator of the idea in the commit message. Is this ok for you? Otherwise, if you want to be the author of the commit, I'll need to ask you to make this PR as minimal as possible.

@hoihu
Copy link
Contributor Author
hoihu commented Jan 29, 2016

Yes it's fine for me - go ahead with your version.

Nice kickstarter btw. Backed it yesterday😃.

@dpgeorge
Copy link
Member

@hoihu ok, great! And thanks for the pledge!

So my tests finished. I ran the loop 1000 times, each time reading and then writing 8000 bytes. There were no hard faults (and running the same test with 2x ctrl-C's I did hit a hard fault).

@hoihu
Copy link
Contributor Author
hoihu commented Jan 29, 2016

Ok so I'm closing this now. I'll leave #1796 open.

@hoihu hoihu closed this Jan 29, 2016
@hoihu hoihu deleted the cdc_speed_2 branch January 29, 2016 16:31
dpgeorge added a commit that referenced this pull request Jan 29, 2016
Previous to this patch the USB CDC driver used TIM3 to trigger the
sending of outgoing data over USB serial.  This patch changes the
behaviour so that the USB SOF interrupt is used to trigger the processing
of the sending.  This reduces latency and increases bandwidth of outgoing
data.

Thanks to Martin Fischer, aka @hoihu, for the idea and initial prototype.

See PR #1713.
@dpgeorge
Copy link
Member

My version of this was made in d363133. Thanks @hoihu! And thanks @dhylands!

tannewt added a commit to tannewt/circuitpython that referenced this pull request Mar 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0