8000 tests: Add performance benchmarking test suite. by dpgeorge · Pull Request #4858 · micropython/micropython · GitHub
[go: up one dir, main page]

Skip to content

tests: Add performance benchmarking test suite. #4858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 28, 2019

Conversation

dpgeorge
Copy link
Member
@dpgeorge dpgeorge commented Jun 18, 2019

This benchmarking test suite is intended to be run on any MicroPython target. As such all tests are parameterised with N and M: N is the approximate CPU frequency (in MHz) of the target and M is the approximate amount of heap memory (in kbytes) available on the target. When running the benchmark suite these parameters must be specified and then each test is tuned to run on that target in a reasonable time (<1 second).

The test scripts are not standalone: they require adding some extra code at the end to run the test with the appropriate parameters. This is done automatically by the run-perfbench.py script, in such a way that imports are minimised (so the tests can be run on targets without filesystem support).

To interface with the benchmarking framework, each test provides a bm_params dict and a bm_setup function, with the later taking a set of parameters (chosen based on N, M) and returning a pair of functions, one to run the test and one to get the results.

When running the test the number of microseconds taken by the test are recorded. Then this is converted into a benchmark score by inverting it (so higher number is faster) and normalising it with an appropriate factor (based roughly on the amount of work done by the test, eg number of iterations).

Test outputs are also compared against a "truth" value, computed by running the test with CPython. This provides a basic way of making sure the test actually ran correctly.

Each test is run multiple times and the results averaged and standard deviation computed. This is output as a summary of the test.

To make comparisons of performance across different runs the run-perfbench.py 8000 script also includes a diff mode that reads in the output of two previous runs and computes the difference in performance. Reports are given as a percentage change in performance with a combined standard deviation to give an indication if the noise in the benchmarking is less than the thing that is being measured.

Some tests are taken from the CPython performance suite. Others are taken from other sources, or written from scratch.

Example invocations for PC, pyboard and esp8266 targets respectively:

$ ./run-perfbench.py 1000 1000
$ ./run-perfbench.py --pyboard 100 100 
$ ./run-perfbench.py --pyboard --device /dev/ttyUSB0 50 25

Example output from a single run on PC (columns are: microseconds-avg, microseconds-sd, score-avg, score-sd):

N=1000 M=1000
perf_bench/bm_chaos.py: 256584.25 2.4258 6824350.49 2.4035
perf_bench/bm_fannkuch.py: 63835.25 0.8947 125.33 0.8812
perf_bench/bm_float.py: 111214.88 0.2396 539499.27 0.2396
perf_bench/bm_hexiom.py: 241042.00 1.0713 1037.28 1.0598
perf_bench/bm_pidigits.py: 294052.38 0.3303 4080.95 0.3292
perf_bench/misc_aes.py: 162349.50 1.0903 31540.60 1.0759
perf_bench/misc_pystone.py: 127773.88 0.6347 156532.82 0.6344
perf_bench/misc_raytrace.py: 143539.25 0.3331 8360.18 0.3319
perf_bench/viper_call0.py: 24956.25 0.7986 40072.68 0.8003
perf_bench/viper_call1a.py: 24253.12 0.7807 41234.29 0.7746
perf_bench/viper_call1b.py: 29998.50 0.4197 33335.59 0.4186
perf_bench/viper_call1c.py: 29870.75 0.3258 33477.92 0.3267
perf_bench/viper_call2a.py: 24519.88 0.3549 40783.75 0.3536
perf_bench/viper_call2b.py: 34740.50 0.3293 28785.16 0.3291

Example diff (last column is the noise, which in this case is larger than the measured difference across the runs, which were actually with the same unix executable, so it makes sense because there should be no difference within the measurement error):

N=100 M=100                        run1 ->       run2
perf_bench/bm_chaos.py         55046.98 ->   58727.78 :   +3680.80 =  +6.687% (+/-7.08%)
perf_bench/bm_fannkuch.py       6212.45 ->    6318.86 :    +106.41 =  +1.713% (+/-4.38%)
perf_bench/bm_float.py        496851.00 ->  498911.38 :   +2060.38 =  +0.415% (+/-1.51%)
perf_bench/bm_hexiom.py         3001.84 ->    3016.07 :     +14.23 =  +0.474% (+/-2.23%)
perf_bench/bm_pidigits.py      50180.12 ->   50480.82 :    +300.70 =  +0.599% (+/-3.17%)
perf_bench/misc_aes.py         33760.51 ->   33737.44 :     -23.07 =  -0.068% (+/-4.28%)
perf_bench/misc_pystone.py    143706.83 ->  143131.81 :    -575.02 =  -0.400% (+/-0.97%)
perf_bench/misc_raytrace.py     8913.56 ->    8971.39 :     +57.83 =  +0.649% (+/-2.18%)
perf_bench/viper_call0.py      37900.45 ->   37434.42 :    -466.03 =  -1.230% (+/-3.97%)
perf_bench/viper_call1a.py     39614.00 ->   38855.42 :    -758.58 =  -1.915% (+/-1.13%)
perf_bench/viper_call1b.py     31568.40 ->   31216.71 :    -351.69 =  -1.114% (+/-2.59%)
perf_bench/viper_call1c.py     31762.58 ->   31199.36 :    -563.22 =  -1.773% (+/-2.65%)
perf_bench/viper_call2a.py     37762.71 ->   38162.33 :    +399.62 =  +1.058% (+/-4.71%)
perf_bench/viper_call2b.py     26562.02 ->   26937.13 :    +375.11 =  +1.412% (+/-4.77%)

Note: as part of this PR the existing tests/bench directory was renamed to tests/internal_bench, and these tests added under tests/perf_bench.

@dpgeorge dpgeorge force-pushed the tests-benchmarking branch from 4c278a5 to 4709fe2 Compare June 19, 2019 03:06
@dpgeorge
Copy link
Member Author

Some more example outputs:

Running the suite on PYBv1.0 the output is:

N=100 M=100
perf_bench/bm_chaos.py: 172991.88 0.0006 722.58 0.0006
perf_bench/bm_fannkuch.py: 82412.38 0.0012 72.80 0.0012
perf_bench/bm_float.py: 53790.38 0.0009 4647.67 0.0009
perf_bench/bm_hexiom.py: 291988.25 0.0032 34.25 0.0032
perf_bench/bm_pidigits.py: 104099.38 0.0005 624.40 0.0005
perf_bench/misc_aes.py: 95260.00 0.0019 335.92 0.0019
perf_bench/misc_mandel.py: 138049.00 0.0050 2897.52 0.0050
perf_bench/misc_pystone.py: 162376.12 0.0004 1847.56 0.0004
perf_bench/misc_raytrace.py: 343353.88 0.0050 34.95 0.0050
perf_bench/viper_call0.py: 52008.62 0.0009 576.83 0.0009
perf_bench/viper_call1a.py: 54509.75 0.0015 550.36 0.0015
perf_bench/viper_call1b.py: 68981.25 0.0103 434.90 0.0103
perf_bench/viper_call1c.py: 68263.50 0.0007 439.47 0.0007
perf_bench/viper_call2a.py: 55939.00 0.0000 536.30 0.0000
perf_bench/viper_call2b.py: 79516.88 0.0008 377.28 0.0008

Note the percentage standard deviation is very small (3rd and 5th columns) meaning the results are accurate.

Running on PYBD_SF2 @ 120MHz:

N=100 M=100
perf_bench/bm_chaos.py: 165495.12 0.0359 755.31 0.0359
perf_bench/bm_fannkuch.py: 74981.88 0.0378 80.02 0.0378
perf_bench/bm_float.py: 52771.38 0.0444 4737.42 0.0444
perf_bench/bm_hexiom.py: 255539.00 0.0114 39.13 0.0114
perf_bench/bm_pidigits.py: 86327.25 0.0216 752.95 0.0216
perf_bench/misc_aes.py: 83928.25 0.0485 381.28 0.0485
perf_bench/misc_mandel.py: 120719.00 0.0286 3313.48 0.0286
perf_bench/misc_pystone.py: 152907.50 0.0249 1961.97 0.0249
perf_bench/misc_raytrace.py: 331205.25 0.0121 36.23 0.0121
perf_bench/viper_call0.py: 46669.62 0.4340 642.83 0.4340
perf_bench/viper_call1a.py: 49006.12 0.2492 612.17 0.2492
perf_bench/viper_call1b.py: 69127.62 0.6512 434.00 0.6517
perf_bench/viper_call1c.py: 68088.88 0.8393 440.63 0.8382
perf_bench/viper_call2a.py: 49621.00 0.1955 604.59 0.1950
perf_bench/viper_call2b.py: 79325.62 0.3047 378.19 0.3043

The percentage standard deviation is a bit larger here due to caching effects on the Cortex-M7.

Comparing PYBv1.0 @ 168MHz and PYBD_SF2 @ 120MHz:

$ ./run-perfbench.py --diff pybv10 pybd_sf2_120 
N=100 M=100                   pybv10 -> pybd_sf2_120
bm_chaos.py                   722.58 ->     755.31 :     +32.73 =  +4.530% (+/-0.04%)
bm_fannkuch.py                 72.80 ->      80.02 :      +7.22 =  +9.918% (+/-0.04%)
bm_float.py                  4647.67 ->    4737.42 :     +89.75 =  +1.931% (+/-0.05%)
bm_hexiom.py                   34.25 ->      39.13 :      +4.88 = +14.248% (+/-0.01%)
bm_pidigits.py                624.40 ->     752.95 :    +128.55 = +20.588% (+/-0.03%)
misc_aes.py                   335.92 ->     381.28 :     +45.36 = +13.503% (+/-0.06%)
misc_mandel.py               2897.52 ->    3313.48 :    +415.96 = +14.356% (+/-0.03%)
misc_pystone.py              1847.56 ->    1961.97 :    +114.41 =  +6.192% (+/-0.03%)
misc_raytrace.py               34.95 ->      36.23 :      +1.28 =  +3.662% (+/-0.01%)
viper_call0.py                576.83 ->     642.83 :     +66.00 = +11.442% (+/-0.48%)
viper_call1a.py               550.36 ->     612.17 :     +61.81 = +11.231% (+/-0.28%)
viper_call1b.py               434.90 ->     434.00 :      -0.90 =  -0.207% (+/-0.65%)
viper_call1c.py               439.47 ->     440.63 :      +1.16 =  +0.264% (+/-0.84%)
viper_call2a.py               536.30 ->     604.59 :     +68.29 = +12.734% (+/-0.22%)
viper_call2b.py               377.28 ->     378.19 :      +0.91 =  +0.241% (+/-0.31%)

There is a 10%-20% speed improvement for the new board.

Comparing PYBD_SF2 @ 120MHz and 216MHz:

$ ./run-perfbench.py --diff pybd_sf2_120 pybd_sf2_216 
N=100 M=100               pybd_sf2_120 -> pybd_sf2_216
bm_chaos.py                   755.31 ->    1217.47 :    +462.16 = +61.188% (+/-0.06%)
bm_fannkuch.py                 80.02 ->     134.99 :     +54.97 = +68.695% (+/-0.06%)
bm_float.py                  4737.42 ->    7684.98 :   +2947.56 = +62.219% (+/-0.10%)
bm_hexiom.py                   39.13 ->      65.28 :     +26.15 = +66.829% (+/-0.04%)
bm_pidigits.py                752.95 ->    1289.58 :    +536.63 = +71.270% (+/-0.07%)
misc_aes.py                   381.28 ->     655.73 :    +274.45 = +71.981% (+/-0.13%)
misc_mandel.py               3313.48 ->    5402.80 :   +2089.32 = +63.055% (+/-0.04%)
misc_pystone.py              1961.97 ->    3245.67 :   +1283.70 = +65.429% (+/-0.05%)
misc_raytrace.py               36.23 ->      58.70 :     +22.47 = +62.020% (+/-0.03%)
viper_call0.py                642.83 ->    1119.36 :    +476.53 = +74.130% (+/-3.16%)
viper_call1a.py               612.17 ->    1123.09 :    +510.92 = +83.460% (+/-1.43%)
viper_call1b.py               434.00 ->     742.87 :    +308.87 = +71.168% (+/-0.96%)
viper_call1c.py               440.63 ->     749.34 :    +308.71 = +70.061% (+/-1.13%)
viper_call2a.py               604.59 ->    1128.30 :    +523.71 = +86.622% (+/-2.46%)
viper_call2b.py               378.19 ->     652.27 :    +274.08 = +72.472% (+/-1.63%)

Running at 216Mhz (an 80% frequency increase over 120MHz) gives about 70% performance increase.

On a PC, comparing the standard unix executable with MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE disabled (left) vs enabled (right) gives:

$ ./run-perfbench.py --diff pc_noopt pc_opt 
N=1000 M=1000               pc_noopt ->     pc_opt
bm_chaos.py               3423956.12 -> 3763209.60 : +339253.48 =  +9.908% (+/-9.38%)
bm_fannkuch.py                 61.63 ->      66.64 :      +5.01 =  +8.129% (+/-7.39%)
bm_float.py                256299.70 ->  294071.03 :  +37771.33 = +14.737% (+/-4.54%)
bm_hexiom.py                  641.70 ->     667.48 :     +25.78 =  +4.017% (+/-2.99%)
bm_pidigits.py               1460.91 ->    1462.31 :      +1.40 =  +0.096% (+/-1.08%)
misc_aes.py                 15548.79 ->   16811.59 :   +1262.80 =  +8.122% (+/-4.67%)
misc_mandel.py             148059.56 ->  137140.89 :  -10918.67 =  -7.375% (+/-4.27%)
misc_pystone.py             83406.46 ->   91272.20 :   +7865.74 =  +9.431% (+/-3.90%)
misc_raytrace.py             4413.91 ->    5164.98 :    +751.07 = +17.016% (+/-2.46%)
viper_call0.py              23515.27 ->   24250.12 :    +734.85 =  +3.125% (+/-17.15%)
viper_call1a.py             23047.67 ->   24971.43 :   +1923.76 =  +8.347% (+/-16.51%)
viper_call1b.py             19124.34 ->   19552.57 :    +428.23 =  +2.239% (+/-12.87%)
viper_call1c.py             20068.07 ->   20797.94 :    +729.87 =  +3.637% (+/-5.86%)
viper_call2a.py             24441.72 ->   23964.06 :    -477.66 =  -1.954% (+/-14.32%)
viper_call2b.py             17562.77 ->   17958.38 :    +395.61 =  +2.253% (+/-11.43%)

Most tests are swamped by the noise (last column in parenthesis). Those that are not, bm_float, misc_pystone, misc_raytrace, show improvements with bytecode caching enabled.

@dpgeorge
Copy link
Member Author

A note on coverage: the test suite obviously doesn't test everything in the VM/runtime. Up until now the main tool used to determine VM performance was the pystone test. Running this test and then computing coverage statistics gives, for source in the py/ directory (using lcov):

Overall coverage rate:
  lines......: 24.9% (3960 of 15926 lines)
  functions..: 29.6% (396 of 1338 functions)
  branches...: 19.9% (1880 of 9451 branches)

Then, running the entire suite in this PR and computing coverage statistics gives:

Overall coverage rate:
  lines......: 43.0% (6852 of 15926 lines)
  functions..: 50.9% (681 of 1338 functions)
  branches...: 35.2% (3328 of 9451 branches)

So a lot more of the code is exercised with this benchmark test suite, but still it would be good to add more tests to get the coverage up further.

@stinos
Copy link
Contributor
stinos commented Jun 19, 2019

Tried this for some scnearios on windows, works well. For convenienece I changed line 122 to
if l.find(': ') != -1 and l.find('CRASH:') == -1: to skip failed tests in diff mode. MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE disabled/enabled looks pretty good here:

N=1000 M=1000              .\nocache ->    .\cache
bm_fannkuch.py                 86.32 ->      83.03 :      -3.29 =  -3.811% (+/-7.12%)
bm_float.py                250800.82 ->  352531.40 : +101730.58 = +40.562% (+/-4.83%)
bm_hexiom.py                  776.48 ->     769.02 :      -7.46 =  -0.961% (+/-3.28%)
bm_pidigits.py               1235.98 ->    2564.10 :   +1328.12 = +107.455% (+/-1.45%)
misc_aes.py                 15628.79 ->   18984.39 :   +3355.60 = +21.471% (+/-3.14%)
misc_mandel.py             184151.42 ->  153845.95 :  -30305.47 = -16.457% (+/-6.57%)
misc_pystone.py             83333.20 ->   98516.40 :  +15183.20 = +18.220% (+/-6.65%)
misc_raytrace.py             5274.72 ->    5994.64 :    +719.92 = +13.648% (+/-5.49%)

@dpgeorge
Copy link
Member Author

For convenienece I changed line 122 to
if l.find(': ') != -1 and l.find('CRASH:') == -1: to skip failed tests in diff mode.

Ok, that's a good idea, to make it a bit more robust.

Eventually the suite could do with a way to automatically skip those tests that won't work (eg due to no native emitter, or no complex numbers).

@pfalcon
Copy link
Contributor
pfalcon commented Jun 22, 2019

as part of this PR the existing tests/bench directory was renamed to tests/internal_bench, and these tests added under tests/perf_bench.

Supposing that it doesn't make sense to retrofit the original benchmark framework into this new one (it probably doesn't), its purpose was microbenchmarking, i.e. comparison of performance of individual statements or small snippets of code, among each other, to find the most performant way to execute some simple operation. In that regard, it would make sense to rename it to "microbench". "internal_bench", and especially "run-intbench.py" are rather unclear otherwise.

@pfalcon
Copy link
Contributor
pfalcon commented Jun 22, 2019

So, this follows a typical vendor fork/codedrop model. If there're reasons not to maintain proper fork of https://github.com/python/pyperformance (e.g., because too many tests from other sources are to be added), then how about committing (in a separate commit from runner scripts!) original sources, with proper reference to the exact upstream revision used. Then applying any further changes in separate commits. This will show examples of how original tests should be modified to suit this framework, and allow to propagate any update/fixes from upstreams.

@dpgeorge
Copy link
Member Author

In that regard, it would make sense to rename it to "microbench". "internal_bench", and especially "run-intbench.py" are rather unclear otherwise.

I think the word "micro" is rather overloaded in the context of this project, so best not to use it to describe a benchmark suite.

how about committing (in a separate commit from runner scripts!) original sources, with proper reference to the exact upstream revision used. Then applying any further changes in separate commits

I agree it makes sense to have the (externally sourced) tests in a separate commit to the run script, but it's not worth the effort to have additional commits to separate unmodified code then the modifications. There's no intention to follow changes of the original source of the tests from pyperformance, they are just useful as a starting point, and proper credit is given at the top of those files.

This will show examples of how original tests should be modified to suit this framework

There are some simple (short) benchmark tests included which show how to use the framework.

@dpgeorge dpgeorge force-pushed the tests-benchmarking branch from 8b7bcf0 to a38d748 Compare June 26, 2019 04:33
dpgeorge added 6 commits June 28, 2019 16:28
To emphasise these benchmark tests compare the internal performance of
features amongst themselves, rather than absolute performance testing.
This benchmarking test suite is intended to be run on any MicroPython
target.  As such all tests are parameterised with N and M: N is the
approximate CPU frequency (in MHz) of the target and M is the approximate
amount of heap memory (in kbytes) available on the target.  When running
the benchmark suite these parameters must be specified and then each test
is tuned to run on that target in a reasonable time (<1 second).

The test scripts are not standalone: they require adding some extra code at
the end to run the test with the appropriate parameters.  This is done
automatically by the run-perfbench.py script, in such a way that imports
are minimised (so the tests can be run on targets without filesystem
support).

To interface with the benchmarking framework, each test provides a
bm_params dict and a bm_setup function, with the later taking a set of
parameters (chosen based on N, M) and returning a pair of functions, one to
run the test and one to get the results.

When running the test the number of microseconds taken by the test are
recorded.  Then this is converted into a benchmark score by inverting it
(so higher number is faster) and normalising it with an appropriate factor
(based roughly on the amount of work done by the test, eg number of
iterations).

Test outputs are also compared against a "truth" value, computed by running
the test with CPython.  This provides a basic way of making sure the test
actually ran correctly.

Each test is run multiple times and the results averaged and standard
deviation computed.  This is output as a summary of the test.

To make comparisons of performance across different runs the
run-perfbench.py script also includes a diff mode that reads in the output
of two previous runs and computes the difference in performance.  Reports
are given as a percentage change in performance with a combined standard
deviation to give an indication if the noise in the benchmarking is less
than the thing that is being measured.

Example invocations for PC, pyboard and esp8266 targets respectively:

    $ ./run-perfbench.py 1000 1000
    $ ./run-perfbench.py --pyboard 100 100
    $ ./run-perfbench.py --pyboard --device /dev/ttyUSB0 50 25
misc_aes.py and misc_mandel.py are adapted from sources in this repository.
misc_pystone.py is the standard Python pystone test.  misc_raytrace.py is
written from scratch.
To test raw viper function call overhead: function entry, exit and
conversion of arguments to/from objects.
@dpgeorge dpgeorge force-pushed the tests-benchmarking branch from a38d748 to 9cebead Compare June 28, 2019 06:31
@dpgeorge dpgeorge merged commit 9cebead into micropython:master Jun 28, 2019
@dpgeorge dpgeorge deleted the tests-benchmarking branch June 28, 2019 06:47
@pfalcon
Copy link
Contributor
pfalcon commented Jun 28, 2019

but it's not worth the effort to have additional commits to separate unmodified code then the modifications

Really? Interesting. That's the best practice for any serious open-source project, and that's how it was done in this project previously, before it started to become a vendor silo.

There's no intention to follow changes of the original source of the tests from pyperformance

So, vendor fork-and-forget silo, after all.

There are some simple (short) benchmark tests included which show how to use the framework.

The real use of this framework is to integrate existing benchmarks as developed by Python community, so adhoc tests written just for it, isn't exactly the material I was talking about.

tannewt added a commit to tannewt/circuitpython that referenced this pull request Jun 23, 2021
resolves micropython#4153: Fixed build issue when CIRCUITPY_USB is off
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0