8000 py/obj.h: Optimise mp_obj_type_t storage (v2) by jimmo · Pull Request #8813 · micropython/micropython · GitHub
[go: up one dir, main page]

Skip to content

py/obj.h: Optimise mp_obj_type_t storage (v2) #8813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

jimmo
Copy link
Member
@jimmo jimmo commented Jun 24, 2022

This is v2 of #7542, which itself is based on adafruit#4903

This version only includes the "slot" approach, and replaces the existing sparse representation. The basic idea is that where the mp_obj_type_t struct used to have 12 pointer fields, it now has 12 uint8_t indices, and a variable-length array of pointers. So in the best case (no fields used) we save 12x4-12=36 bytes and in the common case (three fields used) we save 9x4-12=24 bytes. See #7542 and the spreadsheet linked there for more analysis.

The size difference is:

   bare-arm: -1204 -2.092% 
minimal x86: -3101 -1.885% [incl -3488(data)]
   unix x64: -5712 -1.093% [incl -6432(data)]
unix nanbox: -2836 -0.618% [incl -3392(data)]
      stm32: -2896 -0.732% PYBV10
     cc3200: -2104 -1.143% 
      esp32: -2280 -0.151% GENERIC[incl -3440(data)]
        nrf: -2276 -1.225% pca10040
        rp2: -2432 -0.480% PICO
       samd: -1660 -1.181% ADAFRUIT_ITSYBITSY_M4_EXPRESS

As discussed in #7542, with some changes to objtype.c:mp_obj_class_lookup we can reduce the slot size to 4-bits, and save a further 6 bytes per object.

The performance change on PYBV11 is

bm_chaos.py                    364.78 ->     354.80 :      -9.98 =  -2.736% (+/-0.01%)
bm_fannkuch.py                  77.61 ->      75.87 :      -1.74 =  -2.242% (+/-0.00%)
bm_fft.py                     2569.63 ->    2525.30 :     -44.33 =  -1.725% (+/-0.00%)
bm_float.py                   5954.45 ->    5750.53 :    -203.92 =  -3.425% (+/-0.02%)
bm_hexiom.py                    43.52 ->      43.14 :      -0.38 =  -0.873% (+/-0.01%)
bm_nqueens.py                 4404.68 ->    4356.57 :     -48.11 =  -1.092% (+/-0.00%)
bm_pidigits.py                 630.48 ->     630.86 :      +0.38 =  +0.060% (+/-0.23%)
misc_aes.py                    413.58 ->     406.96 :      -6.62 =  -1.601% (+/-0.00%)
misc_mandel.py                3536.21 ->    3446.73 :     -89.48 =  -2.530% (+/-0.01%)
misc_pystone.py               2337.25 ->    2325.08 :     -12.17 =  -0.521% (+/-0.01%)
misc_raytrace.py               377.76 ->     367.50 :     -10.26 =  -2.716% (+/-0.01%)

and Unix is

bm_chaos.py                   6075.93 ->    6132.92 :     +56.99 =  +0.938% (+/-3.40%)
bm_fannkuch.py                  26.53 ->      26.76 :      +0.23 =  +0.867% (+/-1.06%)
bm_fft.py                    29450.94 ->   30485.22 :   +1034.28 =  +3.512% (+/-1.26%)
bm_float.py                  66775.28 ->   68260.69 :   +1485.41 =  +2.224% (+/-3.14%)
bm_hexiom.py                   187.63 ->     186.86 :      -0.77 =  -0.410% (+/-0.70%)
bm_nqueens.py                97709.83 ->   97778.66 :     +68.83 =  +0.070% (+/-0.95%)
bm_pidigits.py                9038.39 ->    9061.83 :     +23.44 =  +0.259% (+/-0.30%)
misc_aes.py                    119.79 ->     119.91 :      +0.12 =  +0.100% (+/-3.24%)
misc_mandel.py               72987.13 ->   74526.80 :   +1539.67 =  +2.110% (+/-2.55%)
misc_pystone.py              26630.69 ->   26863.56 :    +232.87 =  +0.874% (+/-2.83%)
misc_raytrace.py             10729.21 ->   10660.26 :     -68.95 =  -0.643% (+/-1.12%)

This is marginally worse than the initial performance results I reported in #7542 (same test setup, but comparing to a different baseline). I'm not sure this is worth making configurable because it forms part of the .mpy ABI.

I have split the commits more than necessary, but in some cases just to make it easier to review. For example, I think the last two (1fe8dbd and a8cd503) should be combined, as well as combining 9dcd6c8, a30e9ef, a2dec50.

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Jun 24, 2022
@dpgeorge
Copy link
Member

I'm not sure this is worth making configurable because it forms part of the .mpy ABI.

Does the .mpy version need to be bumped with this change?

@jimmo jimmo force-pushed the obj-type-split-repr-v2 branch 6 times, most recently from e50ab43 to 09b2f54 Compare September 13, 2022 15:02
@jimmo
Copy link
Member Author
jimmo commented Sep 13, 2022

Rebased.

Current performance diff on PYBV11 is

N=100 M=100                /home/jimmo/obj-type-v2-rebase-pybv11-baseline -> /home/jimmo/obj-type-v2-rebase-pybv11-def0         diff      diff% (error%)
bm_chaos.py                    350.72 ->     346.16 :      -4.56 =  -1.300% (+/-0.00%)
bm_fannkuch.py                  76.62 ->      74.92 :      -1.70 =  -2.219% (+/-0.01%)
bm_fft.py                     2327.18 ->    2298.25 :     -28.93 =  -1.243% (+/-0.00%)
bm_float.py                   5680.06 ->    5634.47 :     -45.59 =  -0.803% (+/-0.01%)
bm_hexiom.py                    47.20 ->      46.26 :      -0.94 =  -1.992% (+/-0.00%)
bm_nqueens.py                 4345.23 ->    4246.35 :     -98.88 =  -2.276% (+/-0.01%)
bm_pidigits.py                 625.44 ->     622.29 :      -3.15 =  -0.504% (+/-0.20%)
misc_aes.py                    409.71 ->     406.47 :      -3.24 =  -0.791% (+/-0.01%)
misc_mandel.py                3083.09 ->    3036.62 :     -46.47 =  -1.507% (+/-0.01%)
misc_pystone.py               2419.22 ->    2404.86 :     -14.36 =  -0.594% (+/-0.01%)
misc_raytrace.py               358.06 ->     353.56 :      -4.50 =  -1.257% (+/-0.01%)

and the code size diff on PYBV11 is -2704.

I updated objtype.c and objnamedtuple.c to no longer always use the full 12-slot version, such that instance classes now still fit in 4 GC words (as long as the base doesn't have a protocol), and named tuple comes out smaller than it used to be.

Only TODO now is a way to split the "supported native ABI version from the supported bytecode ABI version". Current idea is that to run a native .mpy (i.e. one with an arch set) you need to exact match, whereas to run bytecode .mpy you need "at least this version". So this PR will move the current version to 7, but keeping the minimum at 6.

@jimmo
Copy link
Member Author
jimmo commented Sep 14, 2022

OK I think this is finally ready.

Because this will be a breaking change for people with custom types (e.g. user c modules), I thought that #9302 might be a nice way to help people update.

Does the .mpy version need to be bumped with this change?

See #9303

@jimmo jimmo force-pushed the obj-type-split-repr-v2 branch from d080882 to f5c2029 Compare September 14, 2022 03:19
The buffer protocol type only has a single member, and this
layout creates problems for the upcoming split/slot-index
mp_obj_type_t layout optimisations.

If we need to make the buffer protocol more sophisticated
in the future either we can rely on the mp_obj_type_t
optimisations to just add additional slots to mp_obj_type_t
or re-visit the buffer protocol then.

This change is a no-op in terms of generated code.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Remove setting unused slots.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@jimmo jimmo force-pushed the obj-type-split-repr-v2 branch 2 times, most recently from d11662c to 8c98491 Compare September 15, 2022 06:31
@dlech
Copy link
Contributor
dlech commented Sep 15, 2022

Have you done any profiling to see which slots are accessed the most? For example, I'm wondering if making locals_dict not a slot would improve performance since it is used in nearly every type and is accessed frequently

@jimmo
Copy link
Member Author
jimmo commented Sep 15, 2022

Have you done any profiling to see which slots are accessed the most?

That's a good question... All the analysis I did (see #7542) was entirely on code size and so only focused on which slots are used for given types, not how frequently they're accessed.

For example, I'm wondering if making locals_dict not a slot would improve performance since it is used in nearly every type and is accessed frequently.

locals_dict is only used in 66 (of 138) types defined in the PYBV11 firmware. But it's accessed very frequently.

The other problem is that the 12 slot indices work out perfectly to 3 words. Having one fewer slot wouldn't decrease the size of the index "table" while still adding a word to every type that doesn't use locals_dict. So +288 bytes. However this would be offset by replacing the MP_OBJ_TYPE_{GET,HAS}_SLOT invocations with direct access to locals_dict in about 10 places.

It's probably worth doing the experiment.

@dpgeorge
Copy link
Member

Have you done any profiling to see which slots are accessed the most? For example, I'm wondering if making locals_dict not a slot would improve performance since it is used in nearly every type and is accessed frequently

This is worth looking in to, but I suggest to get this PR in as-is first, to act as a baseline.

@jimmo
Copy link
Member Author
jimmo commented Sep 16, 2022

@dlech I tested this and the result was +232 bytes (+288 bytes as predicted for adding the slot and changing the definitions, then -56 bytes to use type->locals_dict directly) and mixed results on the performance, some tests improved slightly, some got worse. I suspect most of this is driven by changes in the code layout.

Here's the performance diff from the current PR.

$ ./run-perfbench.py -s ~/obj-type-v2-rebase2-pybv11-pr ~/obj-type-v2-rebase2-pybv11-locals 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/obj-type-v2-rebase2-pybv11-pr -> /home/jimmo/obj-type-v2-rebase2-pybv11-locals         diff      diff% (error%)
bm_chaos.py                    348.30 ->     347.37 :      -0.93 =  -0.267% (+/-0.01%)
bm_fannkuch.py                  74.95 ->      74.73 :      -0.22 =  -0.294% (+/-0.00%)
bm_fft.py                     2293.41 ->    2308.07 :     +14.66 =  +0.639% (+/-0.00%)
bm_float.py                   5660.72 ->    5635.56 :     -25.16 =  -0.444% (+/-0.00%)
bm_hexiom.py                    46.71 ->      46.16 :      -0.55 =  -1.177% (+/-0.00%)
bm_nqueens.py                 4262.27 ->    4255.89 :      -6.38 =  -0.150% (+/-0.00%)
bm_pidigits.py                 622.26 ->     620.58 :      -1.68 =  -0.270% (+/-0.24%)
misc_aes.py                    405.77 ->     405.82 :      +0.05 =  +0.012% (+/-0.00%)
misc_mandel.py                3035.08 ->    3065.90 :     +30.82 =  +1.015% (+/-0.01%)
misc_pystone.py               2401.84 ->    2386.13 :     -15.71 =  -0.654% (+/-0.01%)
misc_raytrace.py               356.15 ->     355.20 :      -0.95 =  -0.267% (+/-0.00%)

Damien and I talked about using some of the saved bytes from this PR to instead enable -O2 in more places, which preliminary testing shows a worthwhile improvement.

@jimmo jimmo force-pushed the obj-type-split-repr-v2 branch from 8c98491 to 1c86d0f Compare September 16, 2022 04:54
@stinos
Copy link
Contributor
stinos commented Sep 16, 2022

From a practical point of view: could the relevant commit, or otherwise some explanation here, include the basic instructions needed to convert from the current implementation to the new one with slots? It's pretty clear from this PR's diff but would be helpful nonetheless.

@jimmo
Copy link
Member Author
jimmo commented Sep 16, 2022

From a practical point of view: could the relevant commit, or otherwise some explanation here, include the basic instructions needed to convert from the current implementation to the new one with slots? It's pretty clear from this PR's diff but would be helpful nonetheless.

Yes, for sure, once the details are finalised. (We're still experimenting with another idea which might change the macro slightly).

I am a 8000 lso implementing #9302 (and the corresponding update to the wiki) so that people who aren't following the GitHub closely can see instructions too.

@jimmo jimmo force-pushed the obj-type-split-repr-v2 branch 3 times, most recently from 0e4da5d to afcfc0c Compare September 17, 2022 16:10
@jimmo
Copy link
Member Author
jimmo commented Sep 19, 2022

@robert-hh I promised @stinos too that I would write some instructions. Haven't had a chance to come back to this since the PR was finalised earlier this arvo. (At the time that @stinos asked, we hadn't decided on the getiter and make_new changes).

Earlier today I wrote this -- https://github.com/micropython/micropython/wiki/Build-Troubleshooting#define_const_obj_type
This is the page you now get a link to when you have a compile error (see #9302).

But there is more to add, and I will add some quick notes now, and then will write some more hopefully tonight otherwise tomorrow. In particular I need to write more context about why the change needs to be made and what the macro does.

@robert-hh
Copy link
Contributor

@jimmo Please forget my rant about this PR. I could update the mimxrt and samd PRs quite smoothly. Only the property is lost, that you could stop at any point in the commit tree and still have a state, that can be compiled and run. To keep that, it would require a lot of editing the history.
But still the documentation will be helpful.

@jimmo
Copy link
Member Author
jimmo commented Sep 19, 2022

Only the property is lost, that you could stop at any point in the commit tree and still have a state, that can be compiled and run.

@robert-hh Can you clarify what you mean here... every commit from this PR should have compiled and run (or at least that was my intention and I did a lot of rebasing while I was working to make this the case, into the hundreds of times as we revised it over and over... it's possible I missed something though).

Or is this more a matter for a rebased pending PR (i.e. your samd branch?). It should be possible to do the rebase in a way such that every commit still compiles?

@robert-hh
Copy link
Contributor

I have no doubt that you tested everything carefully.

Or is this more a matter for a rebased pending PR (i.e. your samd branch?).

That's it.

It should be possible to do the rebase in a way such that every commit still compiles?

I know. But then each commit, which introduces a new class, would have to be changed separately in the history. That is possible with git rebase -i, but I do that typically with fingers crossed (and a backup). I had cases where that damaged the repository.

@robert-hh
Copy link
Contributor

And before I forget it (again): One really good aspect of this PR is, that it reduces the code size. At the SAMD port it's about 2.5 k.

@jimmo
Copy link
Member Author
jimmo commented Sep 19, 2022

And before I forget it (again): One really good aspect of this PR is, that it reduces the code size. At the SAMD port it's about 2.5 k.

Thanks! And also huge shoutout to @jepler for the original idea and implementation.

2.5k might might be only 2% of the firmware size, but it offsets other useful features (or optimisations), and to a build that is only just missing the flash budget it makes all the difference. Hopefully worth the once-off pain of migration.

peterzuger added a commit to peterzuger/bitstruct-micropython that referenced this pull request Sep 20, 2022
peterzuger added a commit to peterzuger/hydrogen-micropython that referenced this pull request Sep 20, 2022
peterzuger added a commit to peterzuger/tlc5947-rgb-micropython that referenced this pull request Sep 20, 2022
@robert-hh
Copy link
Contributor

Indeed. At the SAM21 port I started fighting for almost every byte. Just one more question about the required chaneg. For that case in the getiter part:

  • If you had getiter as mp_identity_getiter and mp_stream_unbuffered_iter as iternext, then just set the MP_TYPE_FLAG_ITERNEXT_STREAM and do not set the getiter slot.

You mention the getiter slot, but what about the iternext slot in this case.

@jimmo
Copy link
Member Author
jimmo commented Sep 20, 2022

You mention the getiter slot, but what about the iternext slot in this case.

There is now only an iter slot and its behaviour depends on the four possible flag values.

@jimmo
Copy link
Member Author
jimmo commented Sep 20, 2022

So in that specific case, now just the flag is set and the iter slot should be unset. (There was about 20 instances of this in PYBV11, so 80 bytes saved)

@robert-hh
Copy link
Contributor

Thanks.- Just a note: the flag MP_TYPE_FLAG_ITERNEXT_STREAM does not exist. Is MP_TYPE_FLAG_ITER_IS_STREAM the proper choice?

@jimmo
Copy link
Member Author
jimmo commented Sep 20, 2022

Thanks.- Just a note: the flag MP_TYPE_FLAG_ITERNEXT_STREAM does not exist. Is MP_TYPE_FLAG_ITER_IS_STREAM the proper choice?

Ah thank you yes that's right. I will update.

We did a last minute rename after I wrote the first version of the wiki page.

@jimmo
Copy link
Member Author
jimmo commented Sep 20, 2022

Sorry, what I wrote on the wiki was out of date and clearly confusing. I have updated it

@stinos
Copy link
Contributor
stinos commented Sep 20, 2022

Is the PR reaching the final stage now? So I know when to test against it

@jimmo
Copy link
Member Author
jimmo commented Sep 20, 2022

Is the PR reaching the final stage now? So I know when to test against it

It's merged.

@stinos
Copy link
Contributor
stinos commented Sep 20, 2022

Oops missed that; then I better get started :)

@robert-hh
Copy link
Contributor

Sorry, what I wrote on the wiki was out of date and clearly confusing.

It's a good documentation, and any documentation is better than none.

@stinos
Copy link
Contributor
stinos commented Sep 20, 2022

Some comments:

  • I know it's an implementation detail, but wouldn't hurt to mention explicitly that slot indices are 1-based because 0 means 'no slot'
  • perhaps something for the wiki: use a regex like type->(?!base|flags|name)(\w+) to replace the old version with slot access like MP_OBJ_TYPE_GET_SLOT(type, \g<1>)
  • is there a reason MP_OBJ_TYPE_SET_SLOT(t, f, v, n) explicitly uses slots[(t)->slot_index_##f - 1] instead of just slots[(n) + 1]'?

@dpgeorge
Copy link
Member

Comments were added in d75c7e8, so I think this can be closed.

@dpgeorge dpgeorge closed this Oct 25, 2022
tannewt pushed a commit to tannewt/circuitpython that referenced this pull request Jan 23, 2024
shared-module/usb_hid: fix keyboard descriptor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0