py/obj.h: Optimise mp_obj_type_t storage (v2) #8813

jimmo · 2022-06-24T11:16:52Z

This is v2 of #7542, which itself is based on adafruit#4903

This version only includes the "slot" approach, and replaces the existing sparse representation. The basic idea is that where the mp_obj_type_t struct used to have 12 pointer fields, it now has 12 uint8_t indices, and a variable-length array of pointers. So in the best case (no fields used) we save 12x4-12=36 bytes and in the common case (three fields used) we save 9x4-12=24 bytes. See #7542 and the spreadsheet linked there for more analysis.

The size difference is:

   bare-arm: -1204 -2.092% 
minimal x86:
8000
 -3101 -1.885% [incl -3488(data)]
   unix x64: -5712 -1.093% [incl -6432(data)]
unix nanbox: -2836 -0.618% [incl -3392(data)]
      stm32: -2896 -0.732% PYBV10
     cc3200: -2104 -1.143% 
      esp32: -2280 -0.151% GENERIC[incl -3440(data)]
        nrf: -2276 -1.225% pca10040
        rp2: -2432 -0.480% PICO
       samd: -1660 -1.181% ADAFRUIT_ITSYBITSY_M4_EXPRESS

As discussed in #7542, with some changes to objtype.c:mp_obj_class_lookup we can reduce the slot size to 4-bits, and save a further 6 bytes per object.

The performance change on PYBV11 is

bm_chaos.py                    364.78 ->     354.80 :      -9.98 =  -2.736% (+/-0.01%)
bm_fannkuch.py                  77.61 ->      75.87 :      -1.74 =  -2.242% (+/-0.00%)
bm_fft.py                     2569.63 ->    2525.30 :     -44.33 =  -1.725% (+/-0.00%)
bm_float.py                   5954.45 ->    5750.53 :    -203.92 =  -3.425% (+/-0.02%)
bm_hexiom.py                    43.52 ->      43.14 :      -0.38 =  -0.873% (+/-0.01%)
bm_nqueens.py                 4404.68 ->    4356.57 :     -48.11 =  -1.092% (+/-0.00%)
bm_pidigits.py                 630.48 ->     630.86 :      +0.38 =  +0.060% (+/-0.23%)
misc_aes.py                    413.58 ->     406.96 :      -6.62 =  -1.601% (+/-0.00%)
misc_mandel.py                3536.21 ->    3446.73 :     -89.48 =  -2.530% (+/-0.01%)
misc_pystone.py               2337.25 ->    2325.08 :     -12.17 =  -0.521% (+/-0.01%)
misc_raytrace.py               377.76 ->     367.50 :     -10.26 =  -2.716% (+/-0.01%)

and Unix is

bm_chaos.py                   6075.93 ->    6132.92 :     +56.99 =  +0.938% (+/-3.40%)
bm_fannkuch.py                  26.53 ->      26.76 :      +0.23 =  +0.867% (+/-1.06%)
bm_fft.py                    29450.94 ->   30485.22 :   +1034.28 =  +3.512% (+/-1.26%)
bm_float.py                  66775.28 ->   68260.69 :   +1485.41 =  +2.224% (+/-3.14%)
bm_hexiom.py                   187.63 ->     186.86 :      -0.77 =  -0.410% (+/-0.70%)
bm_nqueens.py                97709.83 ->   97778.66 :     +68.83 =  +0.070% (+/-0.95%)
bm_pidigits.py                9038.39 ->    9061.83 :     +23.44 =  +0.259% (+/-0.30%)
misc_aes.py                    119.79 ->     119.91 :      +0.12 =  +0.100% (+/-3.24%)
misc_mandel.py               72987.13 ->   74526.80 :   +1539.67 =  +2.110% (+/-2.55%)
misc_pystone.py              26630.69 ->   26863.56 :    +232.87 =  +0.874% (+/-2.83%)
misc_raytrace.py             10729.21 ->   10660.26 :     -68.95 =  -0.643% (+/-1.12%)

This is marginally worse than the initial performance results I reported in #7542 (same test setup, but comparing to a different baseline). I'm not sure this is worth making configurable because it forms part of the .mpy ABI.

I have split the commits more than necessary, but in some cases just to make it easier to review. For example, I think the last two (1fe8dbd and a8cd503) should be combined, as well as combining 9dcd6c8, a30e9ef, a2dec50.

dpgeorge · 2022-06-24T12:36:42Z

I'm not sure this is worth making configurable because it forms part of the .mpy ABI.

Does the .mpy version need to be bumped with this change?

jimmo · 2022-09-13T15:03:31Z

Rebased.

Current performance diff on PYBV11 is

N=100 M=100                /home/jimmo/obj-type-v2-rebase-pybv11-baseline -> /home/jimmo/obj-type-v2-rebase-pybv11-def0         diff      diff% (error%)
bm_chaos.py                    350.72 ->     346.16 :      -4.56 =  -1.300% (+/-0.00%)
bm_fannkuch.py                  76.62 ->      74.92 :      -1.70 =  -2.219% (+/-0.01%)
bm_fft.py                     2327.18 ->    2298.25 :     -28.93 =  -1.243% (+/-0.00%)
bm_float.py                   5680.06 ->    5634.47 :     -45.59 =  -0.803% (+/-0.01%)
bm_hexiom.py                    47.20 ->      46.26 :      -0.94 =  -1.992% (+/-0.00%)
bm_nqueens.py                 4345.23 ->    4246.35 :     -98.88 =  -2.276% (+/-0.01%)
bm_pidigits.py                 625.44 ->     622.29 :      -3.15 =  -0.504% (+/-0.20%)
misc_aes.py                    409.71 ->     406.47 :      -3.24 =  -0.791% (+/-0.01%)
misc_mandel.py                3083.09 ->    3036.62 :     -46.47 =  -1.507% (+/-0.01%)
misc_pystone.py               2419.22 ->    2404.86 :     -14.36 =  -0.594% (+/-0.01%)
misc_raytrace.py               358.06 ->     353.56 :      -4.50 =  -1.257% (+/-0.01%)

and the code size diff on PYBV11 is -2704.

I updated objtype.c and objnamedtuple.c to no longer always use the full 12-slot version, such that instance classes now still fit in 4 GC words (as long as the base doesn't have a protocol), and named tuple comes out smaller than it used to be.

Only TODO now is a way to split the "supported native ABI version from the supported bytecode ABI version". Current idea is that to run a native .mpy (i.e. one with an arch set) you need to exact match, whereas to run bytecode .mpy you need "at least this version". So this PR will move the current version to 7, but keeping the minimum at 6.

jimmo · 2022-09-14T02:40:32Z

OK I think this is finally ready.

Because this will be a breaking change for people with custom types (e.g. user c modules), I thought that #9302 might be a nice way to help people update.

Does the .mpy version need to be bumped with this change?

See #9303

The buffer protocol type only has a single member, and this layout creates problems for the upcoming split/slot-index mp_obj_type_t layout optimisations. If we need to make the buffer protocol more sophisticated in the future either we can rely on the mp_obj_type_t optimisations to just add additional slots to mp_obj_type_t or re-visit the buffer protocol then. This change is a no-op in terms of generated code. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

Remove setting unused slots. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

dlech · 2022-09-15T16:12:03Z

Have you done any profiling to see which slots are accessed the most? For example, I'm wondering if making locals_dict not a slot would improve performance since it is used in nearly every type and is accessed frequently

jimmo · 2022-09-15T23:41:27Z

Have you done any profiling to see which slots are accessed the most?

That's a good question... All the analysis I did (see #7542) was entirely on code size and so only focused on which slots are used for given types, not how frequently they're accessed.

For example, I'm wondering if making locals_dict not a slot would improve performance since it is used in nearly every type and is accessed frequently.

locals_dict is only used in 66 (of 138) types defined in the PYBV11 firmware. But it's accessed very frequently.

The other problem is that the 12 slot indices work out perfectly to 3 words. Having one fewer slot wouldn't decrease the size of the index "table" while still adding a word to every type that doesn't use locals_dict. So +288 bytes. However this would be offset by replacing the MP_OBJ_TYPE_{GET,HAS}_SLOT invocations with direct access to locals_dict in about 10 places.

It's probably worth doing the experiment.

dpgeorge · 2022-09-16T00:56:08Z

8000

Have you done any profiling to see which slots are accessed the most? For example, I'm wondering if making locals_dict not a slot would improve performance since it is used in nearly every type and is accessed frequently

This is worth looking in to, but I suggest to get this PR in as-is first, to act as a baseline.

jimmo · 2022-09-16T04:42:08Z

@dlech I tested this and the result was +232 bytes (+288 bytes as predicted for adding the slot and changing the definitions, then -56 bytes to use type->locals_dict directly) and mixed results on the performance, some tests improved slightly, some got worse. I suspect most of this is driven by changes in the code layout.

Here's the performance diff from the current PR.

$ ./run-perfbench.py -s ~/obj-type-v2-rebase2-pybv11-pr ~/obj-type-v2-rebase2-pybv11-locals 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/obj-type-v2-rebase2-pybv11-pr -> /home/jimmo/obj-type-v2-rebase2-pybv11-locals         diff      diff% (error%)
bm_chaos.py                    348.30 ->     347.37 :      -0.93 =  -0.267% (+/-0.01%)
bm_fannkuch.py                  74.95 ->      74.73 :      -0.22 =  -0.294% (+/-0.00%)
bm_fft.py                     2293.41 ->    2308.07 :     +14.66 =  +0.639% (+/-0.00%)
bm_float.py                   5660.72 ->    5635.56 :     -25.16 =  -0.444% (+/-0.00%)
bm_hexiom.py                    46.71 ->      46.16 :      -0.55 =  -1.177% (+/-0.00%)
bm_nqueens.py                 4262.27 ->    4255.89 :      -6.38 =  -0.150% (+/-0.00%)
bm_pidigits.py                 622.26 ->     620.58 :      -1.68 =  -0.270% (+/-0.24%)
misc_aes.py                    405.77 ->     405.82 :      +0.05 =  +0.012% (+/-0.00%)
misc_mandel.py                3035.08 ->    3065.90 :     +30.82 =  +1.015% (+/-0.01%)
misc_pystone.py               2401.84 ->    2386.13 :     -15.71 =  -0.654% (+/-0.01%)
misc_raytrace.py               356.15 ->     355.20 :      -0.95 =  -0.267% (+/-0.00%)

Damien and I talked about using some of the saved bytes from this PR to instead enable -O2 in more places, which preliminary testing shows a worthwhile improvement.

stinos · 2022-09-16T06:31:40Z

From a practical point of view: could the relevant commit, or otherwise some explanation here, include the basic instructions needed to convert from the current implementation to the new one with slots? It's pretty clear from this PR's diff but would be helpful nonetheless.

jimmo · 2022-09-16T08:03:50Z

From a practical point of view: could the relevant commit, or otherwise some explanation here, include the basic instructions needed to convert from the current implementation to the new one with slots? It's pretty clear from this PR's diff but would be helpful nonetheless.

Yes, for sure, once the details are finalised. (We're still experimenting with another idea which might change the macro slightly).

I am also implementing #9302 (and the corresponding update to the wiki) so that people who aren't following the GitHub closely can see instructions too.

jimmo · 2022-09-19T11:20:38Z

@robert-hh I promised @stinos too that I would write some instructions. Haven't had a chance to come back to this since the PR was finalised earlier this arvo. (At the time that @stinos asked, we hadn't decided on the getiter and make_new changes).

Earlier today I wrote this -- https://github.com/micropython/micropython/wiki/Build-Troubleshooting#define_const_obj_type
This is the page you now get a link to when you have a compile error (see #9302).

But there is more to add, and I will add some quick notes now, and then will write some more hopefully tonight otherwise tomorrow. In particular I need to write more context about why the change needs to be made and what the macro does.

robert-hh · 2022-09-19T14:35:19Z

@jimmo Please forget my rant about this PR. I could update the mimxrt and samd PRs quite smoothly. Only the property is lost, that you could stop at any point in the commit tree and still have a state, that can be compiled and run. To keep that, it would require a lot of editing the history.
But still the documentation will be helpful.

jimmo · 2022-09-19T14:42:20Z

Only the property is lost, that you could stop at any point in the commit tree and still have a state, that can be compiled and run.

@robert-hh Can you clarify what you mean here... every commit from this PR should have compiled and run (or at least that was my intention and I did a lot of rebasing while I was working to make this the case, into the hundreds of times as we revised it over and over... it's possible I missed something though).

Or is this more a matter for a rebased pending PR (i.e. your samd branch?). It should be possible to do the rebase in a way such that every commit still compiles?

robert-hh · 2022-09-19T15:06:54Z

I have no doubt that you tested everything carefully.

Or is this more a matter for a rebased pending PR (i.e. your samd branch?).

That's it.

It should be possible to do the rebase in a way such that every commit still compiles?

I know. But then each commit, which introduces a new class, would have to be changed separately in the history. That is possible with git rebase -i, but I do that typically with fingers crossed (and a backup). I had cases where that damaged the repository.

robert-hh · 2022-09-19T15:14:43Z

And before I forget it (again): One really good aspect of this PR is, that it reduces the code size. At the SAMD port it's about 2.5 k.

jimmo · 2022-09-19T23:36:06Z

And before I forget it (again): One really good aspect of this PR is, that it reduces the code size. At the SAMD port it's about 2.5 k.

Thanks! And also huge shoutout to @jepler for the original idea and implementation.

2.5k might might be only 2% of the firmware size, but it offsets other useful features (or optimisations), and to a build that is only just missing the flash budget it makes all the difference. Hopefully worth the once-off pain of migration.

for micropython/micropython#8813

robert-hh · 2022-09-20T09:16:26Z

Indeed. At the SAM21 port I started fighting for almost every byte. Just one more question about the required chaneg. For that case in the getiter part:

If you had getiter as mp_identity_getiter and mp_stream_unbuffered_iter as iternext, then just set the MP_TYPE_FLAG_ITERNEXT_STREAM and do not set the getiter slot.

You mention the getiter slot, but what about the iternext slot in this case.

jimmo · 2022-09-20T09:21:15Z

You mention the getiter slot, but what about the iternext slot in this case.

There is now only an iter slot and its behaviour depends on the four possible flag values.

jimmo · 2022-09-20T09:23:39Z

So in that specific case, now just the flag is set and the iter slot should be unset. (There was about 20 instances of this in PYBV11, so 80 bytes saved)

robert-hh · 2022-09-20T09:43:52Z

Thanks.- Just a note: the flag MP_TYPE_FLAG_ITERNEXT_STREAM does not exist. Is MP_TYPE_FLAG_ITER_IS_STREAM the proper choice?

jimmo · 2022-09-20T09:55:50Z

Thanks.- Just a note: the flag MP_TYPE_FLAG_ITERNEXT_STREAM does not exist. Is MP_TYPE_FLAG_ITER_IS_STREAM the proper choice?

Ah thank you yes that's right. I will update.

We did a last minute rename after I wrote the first version of the wiki page.

jimmo · 2022-09-20T10:03:05Z

Sorry, what I wrote on the wiki was out of date and clearly confusing. I have updated it

stinos · 2022-09-20T10:13:55Z

Is the PR reaching the final stage now? So I know when to test against it

jimmo · 2022-09-20T10:14:52Z

Is the PR reaching the final stage now? So I know when to test against it

It's merged.

stinos · 2022-09-20T10:19:17Z

Oops missed that; then I better get started :)

robert-hh · 2022-09-20T11:44:12Z

Sorry, what I wrote on the wiki was out of date and clearly confusing.

It's a good documentation, and any documentation is better than none.

stinos · 2022-09-20T14:29:56Z

Some comments:

I know it's an implementation detail, but wouldn't hurt to mention explicitly that slot indices are 1-based because 0 means 'no slot'
perhaps something for the wiki: use a regex like type->(?!base|flags|name)(\w+) to replace the old version with slot access like MP_OBJ_TYPE_GET_SLOT(type, \g<1>)
is there a reason MP_OBJ_TYPE_SET_SLOT(t, f, v, n) explicitly uses slots[(t)->slot_index_##f - 1] instead of just slots[(n) + 1]'?

for micropython/micropython#8813

dpgeorge · 2022-10-25T13:09:48Z

Comments were added in d75c7e8, so I think this can be closed.

shared-module/usb_hid: fix keyboard descriptor

jimmo mentioned this pull request Jun 24, 2022

py/obj.h: Add optimised mp_obj_type_t representations. #7542

Closed

dpgeorge added the py-core Relates to py/ directory in source label Jun 24, 2022

jimmo force-pushed the obj-type-split-repr-v2 branch 6 times, most recently from e50ab43 to 09b2f54 Compare September 13, 2022 15:02

This was referenced Sep 14, 2022

py/mkrules.mk: Add link to build troubleshooting on failure. #9302

Merged

py/persistentcode.h: Split the native and bytecode ABI version. #9303

Closed

jimmo force-pushed the obj-type-split-repr-v2 branch 2 times, most recently from 02968a7 to d080882 Compare September 14, 2022 02:30

jimmo force-pushed the obj-type-split-repr-v2 branch from d080882 to f5c2029 Compare September 14, 2022 03:19

jimmo added 2 commits September 15, 2022 13:40

top: Standardise mp_obj_type_t initialisation.

b1ddddc

Remove setting unused slots. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

jimmo force-pushed the obj-type-split-repr-v2 branch 2 times, most recently from d11662c to 8c98491 Compare September 15, 2022 06:31

jimmo force-pushed the obj-type-split-repr-v2 branch from 8c98491 to 1c86d0f Compare September 16, 2022 04:54

jimmo force-pushed the obj-type-split-repr-v2 branch 3 times, most recently from 0e4da5d to afcfc0c Compare September 17, 2022 16:10

jimmo mentioned this pull request Sep 19, 2022

FYI: ulab needs to be updated to upstream MicroPython mp_obj_type_t changes. v923z/micropython-ulab#548

Closed

jimmo mentioned this pull request Sep 20, 2022

Update mp_obj_type_t definitions for latest MicroPython. v923z/micropython-ulab#549

Merged

peterzuger added a commit to peterzuger/bitstruct-micropython that referenced this pull request Sep 20, 2022

updated all mp_obj_type_t defs use MP_DEFINE_CONST_OBJ_TYPE

ef3268c

for micropython/micropython#8813

peterzuger added a commit to peterzuger/hydrogen-micropython that referenced this pull request Sep 20, 2022

updated all mp_obj_type_t defs use MP_DEFINE_CONST_OBJ_TYPE

6604e34

for micropython/micropython#8813

peterzuger added a commit to peterzuger/tlc5947-rgb-micropython that referenced this pull request Sep 20, 2022

updated all mp_obj_type_t defs use MP_DEFINE_CONST_OBJ_TYPE

653b43e

for micropython/micropython#8813

peterzuger added a commit to peterzuger/st7789_mpy that referenced this pull request Sep 20, 2022

updated all mp_obj_type_t defs use MP_DEFINE_CONST_OBJ_TYPE

8485d2a

for micropython/micropython#8813

peterzuger added a commit to peterzuger/.emacs.d that referenced this pull request Sep 20, 2022

updated mp-type snippet to use MP_DEFINE_CONST_OBJ_TYPE

6cc231b

for micropython/micropython#8813

jimmo mentioned this pull request Sep 20, 2022

py/obj: Add comments explaining the slot index scheme. #9373

Closed

dpgeorge closed this Oct 25, 2022

nord0296 mentioned this pull request Feb 10, 2023

Use of mp_obj_type_t Incompatible with latest micropython? lvgl/lv_binding_micropython#258

Closed

jimmo mentioned this pull request Oct 9, 2023

RPI PICO W Bluetooth not working with self-compiled firmware #12617

Closed

tannewt pushed a commit to tannewt/circuitpython that referenced this pull request Jan 23, 2024

Merge pull request micropython#8813 from tlyu/fix-keyboard-descriptor

5f318c3

shared-module/usb_hid: fix keyboard descriptor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

py/obj.h: Optimise mp_obj_type_t storage (v2) #8813

py/obj.h: Optimise mp_obj_type_t storage (v2) #8813

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

py/obj.h: Optimise mp_obj_type_t storage (v2) #8813

py/obj.h: Optimise mp_obj_type_t storage (v2) #8813

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!