8000 Add support for cycle counter and other ideas for time module · Issue #1225 · micropython/micropython · GitHub
[go: up one dir, main page]

Skip to content

Add support for cycle counter and other ideas for time module #1225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dhylands opened this issue May 4, 2015 · 27 comments
Closed

Add support for cycle counter and other ideas for time module #1225

dhylands opened this issue May 4, 2015 · 27 comments
Labels
rfc Request for Comment

Comments

@dhylands
Copy link
Contributor
dhylands commented May 4, 2015

It turns out that the STM32 M3 and M4 CPUs have a cycle counter. I've been able to enable it and get values back from it on the pyboard.

I'd like to integrate this into the codebase so that we can do some basic cycle accounting; i.e. cycles spent processing IRQs, cycles spent in the main thread, cycles spent waiting for interrupts (WFI), cycles spent sleeping (delay), cycles spent garbage collecting, cycles spent compiling, cycles spent executing bytecode, etc.

Ideally, I'd like to be able to present some notion of CPU idle time or CPU used time. This would allow you to determine, for example, some lower bound on the CPU MHz required for a given piece of code.

This would obviously sit behind a compile time flag, but it could also sit behind a runtime flag as well (since it takes RAM to store the accounting information).

Ideally, we'd make the infrastructure available to any architecture which can get at the appropriate information (cycle counters).

So, before I started coding anything, I though I would bring this up for discussion, and see what other people's thoughts are.

@dhylands
Copy link
Contributor Author
dhylands commented May 4, 2015

For anybody wishing to play with the CPU cycle counter, this my hack that I threw into pyb.info() just to see if it worked:

diff --git a/stmhal/modpyb.c b/stmhal/modpyb.c
index 2c5c199..3ca3a44 100644
--- a/stmhal/modpyb.c
+++ b/stmhal/modpyb.c
@@ -155,6 +155,20 @@ STATIC mp_obj_t pyb_info(mp_uint_t n_args, const mp_obj_t *args) {
         printf("LFS free: %u bytes\n", (uint)(nclst * fatfs->csize * 512));
     }

+    {
+        static int enabled = 0;
+        if (!enabled) {
+            CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
+            DWT->CYCCNT = 0;
+            DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
+            enabled = 1;
+        }
+        printf("DWT->CTRL = %lu\n", DWT->CTRL);
+        uint32_t t1 = DWT->CYCCNT;
+        uint32_t t2 = DWT->CYCCNT;
+        printf("CYCCNT = %lu delta t = %lu\n", DWT->CYCCNT, t2 - t1);
+    }
+
     if (n_args == 1) {
         // arg given means dump gc allocation table
         gc_dump_alloc_table();

and these are the relevent lines from executing:

pyb.info(); pyb.info()

...
DWT->CTRL = 1073741825
CYCCNT = 2996 delta t = 1
...
DWT->CTRL = 1073741825
CYCCNT = 271479 delta t = 1

@pfalcon
Copy link
Contributor
pfalcon commented May 4, 2015

It turns out that the STM32 M3 and M4 CPUs have a cycle counter.

Well, every Cortex-M3 and higher CPU has cycle counter. Ok, Cortex-M3 and up arch has a cycle counter, though a particular implementation may omit that. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337h/BABJFFGJ.html

Ideally, we'd make the infrastructure available to any architecture which can get at the appropriate information (cycle counters).

Yes, and to support cross-architectural things, "machine" module was recently introduced. But thinking of it, cycle counting is generally has more related to "timing" rather than "hardware" - just another method to measure time. And Python has "time" module for that, and it would be nice to think how to extend it to support all the things we found needed for embedded systems/MCUs (in Pythonic (not Arduinic) manner, with quality enough for it to be submitted as PEP).

@dhylands
Copy link
Contributor Author
dhylands commented May 4, 2015

So yeah - every M3 or M4 may have a cycle counter. I was able to confirm that it actually exists on the STM32F405 chip.

I don't see that the time module is relevant, since that's designed for wall time. This is more closely related to performance or profiling.

What I'm really interested in, is adding the infrastructure for accounting so that we can separate interrupt time from main thread time, etc. This doesn't really need a cycle counter, but that gives the best results. Any timer (like the SysTick or other) can be used.

@pfalcon
Copy link
Contributor
pfalcon commented May 4, 2015

I don't see that the time module is relevant, since that's designed for wall time.

Maybe it was so in python1.5, but since https://www.python.org/dev/peps/pep-0418/ it's all things (scalar) time: https://docs.python.org/3/library/time.html (well, as much as people sitting on desktops could do it).

What I'm really interested in, is adding the infrastructure for accounting

So, you're interested in adhoc application profiling, rather than providing access to a generic high-resolution clock (what cycle counter is), on top of which any adhoc profiling can be made, is that right?

@dhylands
Copy link
Contributor Author
dhylands commented May 4, 2015

It seems that any of the useful time functions require floating point?

Exposing the cycle counter seems useful in and of itself. It means I can get access to a high resolution timer without having to tie up any additional timer hardware. We could expose time.perf_counter() but it's only meaningful for platforms which have floating point support.

I'd also like to expose some profiling information. If there are existing profiling modules available in python for measuring things like time spent in interrupt handlers, then I'd like to hear about them.

All of the python profiling stuff I've seen so far is for profiling your python code. I'm interested in profiling the pyboard as a system, which includes non-python stuff, and includes things like interrupt handlers, which I've not seen mentioned in any of the profiler things I'v 8000 e looked at (but I'm not that familiar with them).

What I really want to know is how much headroom I've got before I run out of CPU cycles, and my program stops running in realtime. Or answer the question can my program continue to run in realtime if I lower the clock frequency by 50%? Or answer the question, can I take this code that I've written on my 168MHz pyboard and expect it to tun on the 84MHz Espruino Pico board? Or some other processor running micropython?

I just did a scan of nucleo dev-boards and there are a number of them which have flash/RAM big enough to run micropython, and they have clocks ranging from 32-100 MHz. I'd like to know if any of those might be suitable to run a given program (assuming that there is a micropython port available for that chip).

My gut tells me that none of the available python profilers will provide those answers.

@pfalcon
Copy link
Contributor
pfalcon commented May 6, 2015

It seems that any of the useful time functions require floating point?

Yes, and not just float, but a double-precision float (or put another way, float where number of mantissa bits are not less that 32, to cover "standard" int value). That's a problem, and something we (who else?) should find good enough solution for.

[more comments in thinking]

@pfalcon pfalcon added the rfc Request for Comment label May 26, 2015
@vitiral
Copy link
Contributor
vitiral commented Jun 13, 2015

The 3.4 documentation on time.clock() says the following:

Deprecated since version 3.3: The behaviour of this function depends on the platform: use perf_counter() or process_time() instead, depending on your requirements, to have a well defined behaviour.

This might be perfect for an integer return value, as the output of clock is not defined by CPython. micropython could just define it as always being an integer output (the same as the C clock() function). You would also want to define time.CLOCKS_PER_SEC as a value that changes with clock frequency (would be altered when frequency is changed).

Of course, python may eventually remove time.clock, so I'm not sure what issue that would have.

Otherwise, I vote to support perf_counter and process_time for the time module.

@pfalcon pfalcon changed the title Add support for cycle counter Add support for cycle counter and other ideas for time module Jun 22, 2015
pfalcon referenced this issue Jun 22, 2015
This makes all common files "port-aware" using the .. only directive.
@dpgeorge
Copy link
Member

@dhylands I used your cycle counting code (CYCCNT register) to do some profiling of bytecodes (ie how many cycles each bytecode took, by reading CYCCNT at each dispatch point) and it worked very well. It's not something that would be easy (or perhaps even useful) for every day use because it slows down the code a bit. But I think something like this is useful to have.

Off topic a bit, but related to your questions about IRQ profiling: I also have some code which counts each time an IRQ is fired. There is an individual 32-bit count for each IRQ that is incremented when that IRQ is run. This adds a lot to the code size and makes all IRQs slightly less efficient, but is very useful to see exactly what is running in the "background". The uPy interface is pyb.irq_stats() which just returns a memoryview to the counters, and so you can extract the data however you see fit.

@pfalcon
Copy link
Contributor
pfalcon commented Jun 23, 2015

On topic of adding stuff we need to "time" module, I finally posted to python-ideas to get wider feedback: https://mail.python.org/pipermail/python-ideas/2015-June/034241.html

@pfalcon
Copy link
Contributor
pfalcon commented Jul 11, 2015

Ok, so python-ideas discussion was useful, here's my finalized proposal based on it:

utime.sleep_ms()
utime.sleep_us()

utime.ticks_ms()
utime.ticks_us()
utime.elapsed()

sleep_*() ones are obvious. ticks_() use new "term", as based on the discussion, closest thing CPython3's time module has, monotonic(), is confusing, because by definition, ticks_*() do wrap. With my proposal, there's single "elapsed" function, which should work for both ticks_ms(), ticks_us() (and also for any other similar "wrapping" function added). That's of course a limitation, because if, say, ticks_us() is limited to 24 bits of value, ticks_ms() would need to be either. But IMHO, that's acceptable constraint to not bloat API.

@pfalcon
Copy link
Contributor
pfalcon commented Jul 11, 2015

There's alternative proposal from Nick Coghlan: https://mail.python.org/pipermail/python-ideas/2015-June/034365.html . It's more consistent in naming scheme, less constrained, offers some reflection - all this make API bigger and harder to write (but easier to read).

@pfalcon
Copy link
Contributor
pfalcon commented Jul 11, 2015

To map original @dhylands' request to my proposal, it would be perf_counter_raw(), raw signifying it's not a SI time unit. Period will be measured by the same utime.elapsed() function, so number of signficant digits has to be the same as for ticks_*().

@danicampora
Copy link
Member

I like it. One thing:

if, say, ticks_us() is limited to 24 bits of value

Why would ticks_us be limited to 24 bits?

On Sat, Jul 11, 2015 at 9:37 PM, Paul Sokolovsky notifications@github.com
wrote:

Ok, so python-ideas discussion was useful, here's my finalized proposal
based on it:

utime.sleep_ms()
utime.sleep_us()

utime.ticks_ms()
utime.ticks_us()
utime.elapsed()

sleep__() ones are obvious. ticks_() use new "term", as based on the
discussion, closest thing CPython3's time module has, monotonic(), is
confusing, because by definition, ticks__() do wrap. With my proposal,
there's single "elapsed" function, which should work for both ticks_ms(),
ticks_us() (and also for any other similar "wrapping" function added).
That's of course a limitation, because if, say, ticks_us() is limited to 24
bits of value, ticks_ms() would need to be either. But IMHO, that's
acceptable constraint to not bloat API.


Reply to this email directly or view it on GitHub
#1225 (comment)
.

@pfalcon
Copy link
Contributor
pfalcon commented Jul 11, 2015

Thanks.

Why would ticks_us be limited to 24 bits?

Well, that's just an abstract example, however based on mix-up of real-world possibilities. For example, if ticks_us() is backed by hardware timer, and that timer is 24-bit, there would be that limitation. Cortex-M SysTick counter is an example of 24-bit counter (but it doesn't count microseconds).

@pfalcon
Copy link
Contributor
pfalcon commented Jul 22, 2015

@dpgeorge : Any comments?

@dpgeorge
Copy link
Member

Sorry, this one got lost :)

I'm happy with sleep_ms and sleep_us, and also ticks_ms and ticks_us. I would rather ticks_elapsed than elapsed to emphasise the restricted use of this function (eg it can't be used to process values from time.time()).

For the CPU cycle counter how about ticks_cpu()? It has a more precise meaning than perf_counter_raw. Or even ticks_raw.

Cortex-M SysTick counter is an example of 24-bit counter (but it doesn't count microseconds).

Then already this is a rather big limitation (limiting ms and us to 24-bit): a 24-bit microsecond counter can only count to 16 seconds, compared with 71 minutes for a 32-bit counter. And ms is limited to 4.6 hours compared with around 50 days.

So maybe we already need separate elapsed functions for the different counters.

@dhylands
Copy link
Contributor Author

Cortex-M SysTick counter is an example of 24-bit counter (but it doesn't count microseconds).

Then already this is a rather big limitation (limiting ms and us to 24-bit): a 24-bit microsecond counter can only count to 16 seconds, compared with 71 minutes for a 32-bit counter. And ms is limited to 4.6 hours compared with around 50 days.

So maybe we already need separate elapsed functions for the different counters.

I think that the 24 bit example was just an example. We should use the max resolution that we can.

Our millisecond tick counter is already 32-bit.

We use the systick counter by setting the reload value to (CPU frequency/1000) and it downcounts to zero and reloads (so it reloads every millisecond). So as long as we're running at less than (2^24) * 1000 Hz (which is 16 GHz) then our use of a 24-bit systick timer is fine.

@pfalcon
Copy link
Contributor
pfalcon commented Jul 23, 2015

Then already this is a rather big limitation (limiting ms and us to 24-bit): a 24-bit microsecond counter can only count to 16 seconds, compared with 71 minutes for a 32-bit counter. And ms is limited to 4.6 hours compared with around 50 days.

Well, ticks_ms, ticks_us function should not be considered in isolation - there should still be rest of Python's time functions, like time.time(), which counts in seconds. I wanted to speculate on what bit width ticks_us, ticks_ms should have, but then skipped, as that would "informal addendum". So, common sense says that they should be not less than 10 bits (count to >=1000). But that's actually not enough, because periods like 1.5ms, 2.5ms could not be measured without big error. Then, 14 bits is the next threshold (16-bit arch, 1 bit for tagged pointer, 1 bit to keep values positive, which is waste, but human-friendly). But if someone has a MCU with just 8-bit counters, backs these functions by raw counters, does it mean one can't have ticks_us() (because of only 8-bit width)? Nope, they can.

@pfalcon
Copy link
Contributor
pfalcon commented Jul 23, 2015

For the CPU cycle counter how about ticks_cpu()?

We cannot be sure that such counter comes literally from "CPU", but I guess with imprecision isn't a problem, and the name indeed sounds better than perf_counter_raw(), because it's both shorter and because original perf_counter() is defined in terms of seconds still.

@pfalcon
Copy link
Contributor
pfalcon commented Jul 23, 2015

Thanks for the review, I'm adding "unix" implementation to my queue then.

@dpgeorge
Copy link
Member

The problem with having a common elapsed() function is that you need to decide from the very beginning what bit width to support. Example: say you have ticks_ms and ticks_us and they are 32-bit and so elapsed uses 32-bit wrapping arith. Then some time later you want to add ticks_ns and it has only 16-bit precision, then you need to change elapsed to use 16-bit wrapping, and then existing code may break (because now you can only measure up to 65s with ticks_ms).

@pfalcon
Copy link
Contributor
pfalcon commented Jul 25, 2015

The problem with having a common elapsed() function is that you need to decide from the very beginning what bit width to support.

Yes, and arguably, that's not rocket science to do.

(because now you can only measure up to 65s with ticks_ms).

Multi-second periods should be measured with second-precisions time() in the first place. So, if it wasn't done right in the first place, no surprise it may break.

Overall, that's compromise as usual, but IMHO having a single "elapsed" function is not that bad one.

@pfalcon
Copy link
Contributor
pfalcon commented Jul 25, 2015

Another idea about "ticks_elapsed" - I'm still concerned about the length of identifier, what about "ticks_diff"?

@dpgeorge
Copy link
Member

The problem with having a common elapsed() function is that you need to decide from the very beginning what bit width to support.

Yes, and arguably, that's not rocket science to do.

But then we should from the start specify the full set of ticks_* functions for which ticks_elapsed (or ticks_diff) is supposed to work with. And I'd say they should all begin with tick_ to form a neat set. Eg there'll be ticks_ms, ticks_us and ticks_cpu and that's it. And then a given port will implement them and choose there and then what the common bit width is.

Multi-second periods should be measured with second-precisions time() in the first place.

A function like sleep_ms is useful for second-long delays because you often need sub-second precision, eg 2.5 seconds = 2500ms (without resorting to floating point). And sometimes you dynamically generate this delay (eg in a scheduler, asyncio) and don't know exactly what its size is. It could be 1ms or 10s (or more, or less) and you don't want to need to select the specific sleep function (eg s, ms, us) depending on the size of the delay. Rather you just want to sleep_ms because it gives enough precision (1ms) and large enough maximum delay (hopefully!). [If you did need/want to select between using sleep and sleep_ms, you'd need a function to tell you the maximum delay / bit-width.]

Another idea about "ticks_elapsed" - I'm still concerned about the length of identifier, what about "ticks_diff"?

Saving 3 bytes is not much... elapsed does give a good feel for what the function is actually returning (instead of a simple difference which you'd think could be achieved by a subtraction). In datetime module they use timedelta, so it could be ticks_delta.

@pfalcon
Copy link
Contributor
pfalcon commented Jul 31, 2015

And I'd say they should all begin with tick_ to form a neat set. Eg there'll be ticks_ms, ticks_us and ticks_cpu and that's it. And then a given port will implement them and choose there and then what the common bit width is.

Yes, that was the idea and I agree that common prefix is good.

It could be 1ms or 10s (or more, or less) and you don't want to need to select the specific sleep function (eg s, ms, us) depending on the size of the delay.

So everyone is of course encouraged to give maximum width for all time resolutions. But not being able to do so is not a reason to e.g. invent own adhoc API, because proposed API is general enough. How particular software will run with it is matter of particular software (just the same as software requiring 100Mb heap won't run on 128K).

Saving 3 bytes is not much... elapsed does give a good feel for what the function is actually returning (instead of a simple difference which you'd think could be achieved by a subtraction). In datetime module they use timedelta, so it could be ticks_delta.

It's not 3 bytes, it's 3 keystrokes, and that's noticeable. So, I don't know which is better, I'd optimize for length. Please decide what it should be.

@dpgeorge
Copy link
Member

It's not 3 bytes, it's 3 keystrokes, and that's noticeable.

Agree. Typing diff is much easier than typing elapsed, both because diff is shorter (it's almost 3 chars since the f is repeated) and because fingers are used to diff from the unix utility, from typing difference, difficult, etc. Elapsed is an uncommon word (although more specific and related to time).

In addition, it think diff is anyway a better description of what the function does: takes the difference of the 2 arguments. The word elapsed was better suited to the original API where the function (eg elapsed_millis) took 1 argument (being the start time) and returned the elapsed time since then.

@dpgeorge
Copy link
Member

time.ticks_cpu() is now implemented in stmhal, and possibly other ports.

tannewt added a commit to tannewt/circuitpython that referenced this issue Oct 3, 2018
String internationalization for Brazilian Portuguese
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc Request for Comment
Projects
None yet
Development

No branches or pull requests

5 participants
0