-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Add support for cycle counter and other ideas for time module #1225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For anybody wishing to play with the CPU cycle counter, this my hack that I threw into pyb.info() just to see if it worked: diff --git a/stmhal/modpyb.c b/stmhal/modpyb.c
index 2c5c199..3ca3a44 100644
--- a/stmhal/modpyb.c
+++ b/stmhal/modpyb.c
@@ -155,6 +155,20 @@ STATIC mp_obj_t pyb_info(mp_uint_t n_args, const mp_obj_t *args) {
printf("LFS free: %u bytes\n", (uint)(nclst * fatfs->csize * 512));
}
+ {
+ static int enabled = 0;
+ if (!enabled) {
+ CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
+ DWT->CYCCNT = 0;
+ DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
+ enabled = 1;
+ }
+ printf("DWT->CTRL = %lu\n", DWT->CTRL);
+ uint32_t t1 = DWT->CYCCNT;
+ uint32_t t2 = DWT->CYCCNT;
+ printf("CYCCNT = %lu delta t = %lu\n", DWT->CYCCNT, t2 - t1);
+ }
+
if (n_args == 1) {
// arg given means dump gc allocation table
gc_dump_alloc_table(); and these are the relevent lines from executing: pyb.info(); pyb.info() ...
|
Well, every Cortex-M3 and higher CPU has cycle counter. Ok, Cortex-M3 and up arch has a cycle counter, though a particular implementation may omit that. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337h/BABJFFGJ.html
Yes, and to support cross-architectural things, "machine" module was recently introduced. But thinking of it, cycle counting is generally has more related to "timing" rather than "hardware" - just another method to measure time. And Python has "time" module for that, and it would be nice to think how to extend it to support all the things we found needed for embedded systems/MCUs (in Pythonic (not Arduinic) manner, with quality enough for it to be submitted as PEP). |
So yeah - every M3 or M4 may have a cycle counter. I was able to confirm that it actually exists on the STM32F405 chip. I don't see that the time module is relevant, since that's designed for wall time. This is more closely related to performance or profiling. What I'm really interested in, is adding the infrastructure for accounting so that we can separate interrupt time from main thread time, etc. This doesn't really need a cycle counter, but that gives the best results. Any timer (like the SysTick or other) can be used. |
Maybe it was so in python1.5, but since https://www.python.org/dev/peps/pep-0418/ it's all things (scalar) time: https://docs.python.org/3/library/time.html (well, as much as people sitting on desktops could do it).
So, you're interested in adhoc application profiling, rather than providing access to a generic high-resolution clock (what cycle counter is), on top of which any adhoc profiling can be made, is that right? |
It seems that any of the useful time functions require floating point? Exposing the cycle counter seems useful in and of itself. It means I can get access to a high resolution timer without having to tie up any additional timer hardware. We could expose time.perf_counter() but it's only meaningful for platforms which have floating point support. I'd also like to expose some profiling information. If there are existing profiling modules available in python for measuring things like time spent in interrupt handlers, then I'd like to hear about them. All of the python profiling stuff I've seen so far is for profiling your python code. I'm interested in profiling the pyboard as a system, which includes non-python stuff, and includes things like interrupt handlers, which I've not seen mentioned in any of the profiler things I'v 8000 e looked at (but I'm not that familiar with them). What I really want to know is how much headroom I've got before I run out of CPU cycles, and my program stops running in realtime. Or answer the question can my program continue to run in realtime if I lower the clock frequency by 50%? Or answer the question, can I take this code that I've written on my 168MHz pyboard and expect it to tun on the 84MHz Espruino Pico board? Or some other processor running micropython? I just did a scan of nucleo dev-boards and there are a number of them which have flash/RAM big enough to run micropython, and they have clocks ranging from 32-100 MHz. I'd like to know if any of those might be suitable to run a given program (assuming that there is a micropython port available for that chip). My gut tells me that none of the available python profilers will provide those answers. |
Yes, and not just float, but a double-precision float (or put another way, float where number of mantissa bits are not less that 32, to cover "standard" int value). That's a problem, and something we (who else?) should find good enough solution for. [more comments in thinking] |
The 3.4 documentation on
This might be perfect for an integer return value, as the output of clock is not defined by CPython. micropython could just define it as always being an integer output (the same as the C Of course, python may eventually remove Otherwise, I vote to support |
This makes all common files "port-aware" using the .. only directive.
@dhylands I used your cycle counting code (CYCCNT register) to do some profiling of bytecodes (ie how many cycles each bytecode took, by reading CYCCNT at each dispatch point) and it worked very well. It's not something that would be easy (or perhaps even useful) for every day use because it slows down the code a bit. But I think something like this is useful to have. Off topic a bit, but related to your questions about IRQ profiling: I also have some code which counts each time an IRQ is fired. There is an individual 32-bit count for each IRQ that is incremented when that IRQ is run. This adds a lot to the code size and makes all IRQs slightly less efficient, but is very useful to see exactly what is running in the "background". The uPy interface is |
On topic of adding stuff we need to "time" module, I finally posted to python-ideas to get wider feedback: https://mail.python.org/pipermail/python-ideas/2015-June/034241.html |
Ok, so python-ideas discussion was useful, here's my finalized proposal based on it:
|
There's alternative proposal from Nick Coghlan: https://mail.python.org/pipermail/python-ideas/2015-June/034365.html . It's more consistent in naming scheme, less constrained, offers some reflection - all this make API bigger and harder to write (but easier to read). |
To map original @dhylands' request to my proposal, it would be |
I like it. One thing:
Why would ticks_us be limited to 24 bits? On Sat, Jul 11, 2015 at 9:37 PM, Paul Sokolovsky notifications@github.com
|
Thanks.
Well, that's just an abstract example, however based on mix-up of real-world possibilities. For example, if ticks_us() is backed by hardware timer, and that timer is 24-bit, there would be that limitation. Cortex-M SysTick counter is an example of 24-bit counter (but it doesn't count microseconds). |
@dpgeorge : Any comments? |
Sorry, this one got lost :) I'm happy with sleep_ms and sleep_us, and also ticks_ms and ticks_us. I would rather ticks_elapsed than elapsed to emphasise the restricted use of this function (eg it can't be used to process values from time.time()). For the CPU cycle counter how about ticks_cpu()? It has a more precise meaning than perf_counter_raw. Or even ticks_raw.
Then already this is a rather big limitation (limiting ms and us to 24-bit): a 24-bit microsecond counter can only count to 16 seconds, compared with 71 minutes for a 32-bit counter. And ms is limited to 4.6 hours compared with around 50 days. So maybe we already need separate elapsed functions for the different counters. |
I think that the 24 bit example was just an example. We should use the max resolution that we can. Our millisecond tick counter is already 32-bit. We use the systick counter by setting the reload value to (CPU frequency/1000) and it downcounts to zero and reloads (so it reloads every millisecond). So as long as we're running at less than (2^24) * 1000 Hz (which is 16 GHz) then our use of a 24-bit systick timer is fine. |
Well, ticks_ms, ticks_us function should not be considered in isolation - there should still be rest of Python's time functions, like time.time(), which counts in seconds. I wanted to speculate on what bit width ticks_us, ticks_ms should have, but then skipped, as that would "informal addendum". So, common sense says that they should be not less than 10 bits (count to >=1000). But that's actually not enough, because periods like 1.5ms, 2.5ms could not be measured without big error. Then, 14 bits is the next threshold (16-bit arch, 1 bit for tagged pointer, 1 bit to keep values positive, which is waste, but human-friendly). But if someone has a MCU with just 8-bit counters, backs these functions by raw counters, does it mean one can't have ticks_us() (because of only 8-bit width)? Nope, they can. |
We cannot be sure that such counter comes literally from "CPU", but I guess with imprecision isn't a problem, and the name indeed sounds better than perf_counter_raw(), because it's both shorter and because original perf_counter() is defined in terms of seconds still. |
Thanks for the review, I'm adding "unix" implementation to my queue then. |
The problem with having a common elapsed() function is that you need to decide from the very beginning what bit width to support. Example: say you have ticks_ms and ticks_us and they are 32-bit and so elapsed uses 32-bit wrapping arith. Then some time later you want to add ticks_ns and it has only 16-bit precision, then you need to change elapsed to use 16-bit wrapping, and then existing code may break (because now you can only measure up to 65s with ticks_ms). |
Yes, and arguably, that's not rocket science to do.
Multi-second periods should be measured with second-precisions time() in the first place. So, if it wasn't done right in the first place, no surprise it may break. Overall, that's compromise as usual, but IMHO having a single "elapsed" function is not that bad one. |
Another idea about "ticks_elapsed" - I'm still concerned about the length of identifier, what about "ticks_diff"? |
But then we should from the start specify the full set of
A function like
Saving 3 bytes is not much... elapsed does give a good feel for what the function is actually returning (instead of a simple difference which you'd think could be achieved by a subtraction). In datetime module they use timedelta, so it could be ticks_delta. |
Yes, that was the idea and I agree that common prefix is good.
So everyone is of course encouraged to give maximum width for all time resolutions. But not being able to do so is not a reason to e.g. invent own adhoc API, because proposed API is general enough. How particular software will run with it is matter of particular software (just the same as software requiring 100Mb heap won't run on 128K).
It's not 3 bytes, it's 3 keystrokes, and that's noticeable. So, I don't know which is better, I'd optimize for length. Please decide what it should be. |
Agree. Typing In addition, it think diff is anyway a better description of what the function does: takes the difference of the 2 arguments. The word elapsed was better suited to the original API where the function (eg elapsed_millis) took 1 argument (being the start time) and returned the elapsed time since then. |
time.ticks_cpu() is now implemented in stmhal, and possibly other ports. |
String internationalization for Brazilian Portuguese
It turns out that the STM32 M3 and M4 CPUs have a cycle counter. I've been able to enable it and get values back from it on the pyboard.
I'd like to integrate this into the codebase so that we can do some basic cycle accounting; i.e. cycles spent processing IRQs, cycles spent in the main thread, cycles spent waiting for interrupts (WFI), cycles spent sleeping (delay), cycles spent garbage collecting, cycles spent compiling, cycles spent executing bytecode, etc.
Ideally, I'd like to be able to present some notion of CPU idle time or CPU used time. This would allow you to determine, for example, some lower bound on the CPU MHz required for a given piece of code.
This would obviously sit behind a compile time flag, but it could also sit behind a runtime flag as well (since it takes RAM to store the accounting information).
Ideally, we'd make the infrastructure available to any architecture which can get at the appropriate information (cycle counters).
So, before I started coding anything, I though I would bring this up for discussion, and see what other people's thoughts are.
The text was updated successfully, but these errors were encountered: