-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
tests: Add performance benchmarking test suite. #4858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4c278a5
to
4709fe2
Compare
Some more example outputs: Running the suite on PYBv1.0 the output is:
Note the percentage standard deviation is very small (3rd and 5th columns) meaning the results are accurate. Running on PYBD_SF2 @ 120MHz:
The percentage standard deviation is a bit larger here due to caching effects on the Cortex-M7. Comparing PYBv1.0 @ 168MHz and PYBD_SF2 @ 120MHz:
There is a 10%-20% speed improvement for the new board. Comparing PYBD_SF2 @ 120MHz and 216MHz:
Running at 216Mhz (an 80% frequency increase over 120MHz) gives about 70% performance increase. On a PC, comparing the standard unix executable with MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE disabled (left) vs enabled (right) gives:
Most tests are swamped by the noise (last column in parenthesis). Those that are not, bm_float, misc_pystone, misc_raytrace, show improvements with bytecode caching enabled. |
A note on coverage: the test suite obviously doesn't test everything in the VM/runtime. Up until now the main tool used to determine VM performance was the pystone test. Running this test and then computing coverage statistics gives, for source in the
Then, running the entire suite in this PR and computing coverage statistics gives:
So a lot more of the code is exercised with this benchmark test suite, but still it would be good to add more tests to get the coverage up further. |
Tried this for some scnearios on windows, works well. For convenienece I changed line 122 to
|
Ok, that's a good idea, to make it a bit more robust. Eventually the suite could do with a way to automatically skip those tests that won't work (eg due to no native emitter, or no complex numbers). |
Supposing that it doesn't make sense to retrofit the original benchmark framework into this new one (it probably doesn't), its purpose was microbenchmarking, i.e. comparison of performance of individual statements or small snippets of code, among each other, to find the most performant way to execute some simple operation. In that regard, it would make sense to rename it to "microbench". "internal_bench", and especially "run-intbench.py" are rather unclear otherwise. |
So, this follows a typical vendor fork/codedrop model. If there're reasons not to maintain proper fork of https://github.com/python/pyperformance (e.g., because too many tests from other sources are to be added), then how about committing (in a separate commit from runner scripts!) original sources, with proper reference to the exact upstream revision used. Then applying any further changes in separate commits. This will show examples of how original tests should be modified to suit this framework, and allow to propagate any update/fixes from upstreams. |
I think the word "micro" is rather overloaded in the context of this project, so best not to use it to describe a benchmark suite.
I agree it makes sense to have the (externally sourced) tests in a separate commit to the run script, but it's not worth the effort to have additional commits to separate unmodified code then the modifications. There's no intention to follow changes of the original source of the tests from pyperformance, they are just useful as a starting point, and proper credit is given at the top of those files.
There are some simple (short) benchmark tests included which show how to use the framework. |
8b7bcf0
to
a38d748
Compare
To emphasise these benchmark tests compare the internal performance of features amongst themselves, rather than absolute performance testing.
This benchmarking test suite is intended to be run on any MicroPython target. As such all tests are parameterised with N and M: N is the approximate CPU frequency (in MHz) of the target and M is the approximate amount of heap memory (in kbytes) available on the target. When running the benchmark suite these parameters must be specified and then each test is tuned to run on that target in a reasonable time (<1 second). The test scripts are not standalone: they require adding some extra code at the end to run the test with the appropriate parameters. This is done automatically by the run-perfbench.py script, in such a way that imports are minimised (so the tests can be run on targets without filesystem support). To interface with the benchmarking framework, each test provides a bm_params dict and a bm_setup function, with the later taking a set of parameters (chosen based on N, M) and returning a pair of functions, one to run the test and one to get the results. When running the test the number of microseconds taken by the test are recorded. Then this is converted into a benchmark score by inverting it (so higher number is faster) and normalising it with an appropriate factor (based roughly on the amount of work done by the test, eg number of iterations). Test outputs are also compared against a "truth" value, computed by running the test with CPython. This provides a basic way of making sure the test actually ran correctly. Each test is run multiple times and the results averaged and standard deviation computed. This is output as a summary of the test. To make comparisons of performance across different runs the run-perfbench.py script also includes a diff mode that reads in the output of two previous runs and computes the difference in performance. Reports are given as a percentage change in performance with a combined standard deviation to give an indication if the noise in the benchmarking is less than the thing that is being measured. Example invocations for PC, pyboard and esp8266 targets respectively: $ ./run-perfbench.py 1000 1000 $ ./run-perfbench.py --pyboard 100 100 $ ./run-perfbench.py --pyboard --device /dev/ttyUSB0 50 25
From https://github.com/python/pyperformance commit 6690642ddeda46fc5ee6e97c3ef4b2f292348ab8
misc_aes.py and misc_mandel.py are adapted from sources in this repository. misc_pystone.py is the standard Python pystone test. misc_raytrace.py is written from scratch.
To test raw viper function call overhead: function entry, exit and conversion of arguments to/from objects.
a38d748
to
9cebead
Compare
Really? Interesting. That's the best practice for any serious open-source project, and that's how it was done in this project previously, before it started to become a vendor silo.
So, vendor fork-and-forget silo, after all.
The real use of this framework is to integrate existing benchmarks as developed by Python community, so adhoc tests written just for it, isn't exactly the material I was talking about. |
resolves micropython#4153: Fixed build issue when CIRCUITPY_USB is off
This benchmarking test suite is intended to be run on any MicroPython target. As such all tests are parameterised with N and M: N is the approximate CPU frequency (in MHz) of the target and M is the approximate amount of heap memory (in kbytes) available on the target. When running the benchmark suite these parameters must be specified and then each test is tuned to run on that target in a reasonable time (<1 second).
The test scripts are not standalone: they require adding some extra code at the end to run the test with the appropriate parameters. This is done automatically by the run-perfbench.py script, in such a way that imports are minimised (so the tests can be run on targets without filesystem support).
To interface with the benchmarking framework, each test provides a bm_params dict and a bm_setup function, with the later taking a set of parameters (chosen based on N, M) and returning a pair of functions, one to run the test and one to get the results.
When running the test the number of microseconds taken by the test are recorded. Then this is converted into a benchmark score by inverting it (so higher number is faster) and normalising it with an appropriate factor (based roughly on the amount of work done by the test, eg number of iterations).
Test outputs are also compared against a "truth" value, computed by running the test with CPython. This provides a basic way of making sure the test actually ran correctly.
Each test is run multiple times and the results averaged and standard deviation computed. This is output as a summary of the test.
To make comparisons of performance across different runs the run-perfbench.py 8000 script also includes a diff mode that reads in the output of two previous runs and computes the difference in performance. Reports are given as a percentage change in performance with a combined standard deviation to give an indication if the noise in the benchmarking is less than the thing that is being measured.
Some tests are taken from the CPython performance suite. Others are taken from other sources, or written from scratch.
Example invocations for PC, pyboard and esp8266 targets respectively:
Example output from a single run on PC (columns are: microseconds-avg, microseconds-sd, score-avg, score-sd):
Example diff (last column is the noise, which in this case is larger than the measured difference across the runs, which were actually with the same unix executable, so it makes sense because there should be no difference within the measurement error):
Note: as part of this PR the existing
tests/bench
directory was renamed totests/internal_bench
, and these tests added undertests/perf_bench
.