-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
py/obj.h: Optimise mp_obj_type_t storage (v2) #8813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Does the .mpy version need to be bumped with this change? |
e50ab43
to
09b2f54
Compare
Rebased. Current performance diff on PYBV11 is
and the code size diff on PYBV11 is -2704. I updated objtype.c and objnamedtuple.c to no longer always use the full 12-slot version, such that instance classes now still fit in 4 GC words (as long as the base doesn't have a protocol), and named tuple comes out smaller than it used to be. Only TODO now is a way to split the "supported native ABI version from the supported bytecode ABI version". Current idea is that to run a native .mpy (i.e. one with an arch set) you need to exact match, whereas to run bytecode .mpy you need "at least this version". So this PR will move the current version to 7, but keeping the minimum at 6. |
02968a7
to
d080882
Compare
d080882
to
f5c2029
Compare
The buffer protocol type only has a single member, and this layout creates problems for the upcoming split/slot-index mp_obj_type_t layout optimisations. If we need to make the buffer protocol more sophisticated in the future either we can rely on the mp_obj_type_t optimisations to just add additional slots to mp_obj_type_t or re-visit the buffer protocol then. This change is a no-op in terms of generated code. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Remove setting unused slots. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
d11662c
to
8c98491
Compare
Have you done any profiling to see which slots are accessed the most? For example, I'm wondering if making |
That's a good question... All the analysis I did (see #7542) was entirely on code size and so only focused on which slots are used for given types, not how frequently they're accessed.
locals_dict is only used in 66 (of 138) types defined in the PYBV11 firmware. But it's accessed very frequently. The other problem is that the 12 slot indices work out perfectly to 3 words. Having one fewer slot wouldn't decrease the size of the index "table" while still adding a word to every type that doesn't use locals_dict. So +288 bytes. However this would be offset by replacing the It's probably worth doing the experiment. |
This is worth looking in to, but I suggest to get this PR in as-is first, to act as a baseline. |
@dlech I tested this and the result was +232 bytes (+288 bytes as predicted for adding the slot and changing the definitions, then -56 bytes to use Here's the performance diff from the current PR.
Damien and I talked about using some of the saved bytes from this PR to instead enable -O2 in more places, which preliminary testing shows a worthwhile improvement. |
8c98491
to
1c86d0f
Compare
From a practical point of view: could the relevant commit, or otherwise some explanation here, include the basic instructions needed to convert from the current implementation to the new one with slots? It's pretty clear from this PR's diff but would be helpful nonetheless. |
Yes, for sure, once the details are finalised. (We're still experimenting with another idea which might change the macro slightly). I am a 8000 lso implementing #9302 (and the corresponding update to the wiki) so that people who aren't following the GitHub closely can see instructions too. |
0e4da5d
to
afcfc0c
Compare
@robert-hh I promised @stinos too that I would write some instructions. Haven't had a chance to come back to this since the PR was finalised earlier this arvo. (At the time that @stinos asked, we hadn't decided on the getiter and make_new changes). Earlier today I wrote this -- https://github.com/micropython/micropython/wiki/Build-Troubleshooting#define_const_obj_type But there is more to add, and I will add some quick notes now, and then will write some more hopefully tonight otherwise tomorrow. In particular I need to write more context about why the change needs to be made and what the macro does. |
@jimmo Please forget my rant about this PR. I could update the mimxrt and samd PRs quite smoothly. Only the property is lost, that you could stop at any point in the commit tree and still have a state, that can be compiled and run. To keep that, it would require a lot of editing the history. |
@robert-hh Can you clarify what you mean here... every commit from this PR should have compiled and run (or at least that was my intention and I did a lot of rebasing while I was working to make this the case, into the hundreds of times as we revised it over and over... it's possible I missed something though). Or is this more a matter for a rebased pending PR (i.e. your samd branch?). It should be possible to do the rebase in a way such that every commit still compiles? |
I have no doubt that you tested everything carefully.
That's it.
I know. But then each commit, which introduces a new class, would have to be changed separately in the history. That is possible with git rebase -i, but I do that typically with fingers crossed (and a backup). I had cases where that damaged the repository. |
And before I forget it (again): One really good aspect of this PR is, that it reduces the code size. At the SAMD port it's about 2.5 k. |
Thanks! And also huge shoutout to @jepler for the original idea and implementation. 2.5k might might be only 2% of the firmware size, but it offsets other useful features (or optimisations), and to a build that is only just missing the flash budget it makes all the difference. Hopefully worth the once-off pain of migration. |
Indeed. At the SAM21 port I started fighting for almost every byte. Just one more question about the required chaneg. For that case in the getiter part:
You mention the getiter slot, but what about the iternext slot in this case. |
There is now only an |
So in that specific case, now just the flag is set and the iter slot should be unset. (There was about 20 instances of this in PYBV11, so 80 bytes saved) |
Thanks.- Just a note: the flag |
Ah thank you yes that's right. I will update. We did a last minute rename after I wrote the first version of the wiki page. |
Sorry, what I wrote on the wiki was out of date and clearly confusing. I have updated it |
Is the PR reaching the final stage now? So I know when to test against it |
It's merged. |
Oops missed that; then I better get started :) |
It's a good documentation, and any documentation is better than none. |
Some comments:
|
Comments were added in d75c7e8, so I think this can be closed. |
shared-module/usb_hid: fix keyboard descriptor
This is v2 of #7542, which itself is based on adafruit#4903
This version only includes the "slot" approach, and replaces the existing sparse representation. The basic idea is that where the mp_obj_type_t struct used to have 12 pointer fields, it now has 12 uint8_t indices, and a variable-length array of pointers. So in the best case (no fields used) we save 12x4-12=36 bytes and in the common case (three fields used) we save 9x4-12=24 bytes. See #7542 and the spreadsheet linked there for more analysis.
The size difference is:
As discussed in #7542, with some changes to objtype.c:mp_obj_class_lookup we can reduce the slot size to 4-bits, and save a further 6 bytes per object.
The performance change on PYBV11 is
and Unix is
This is marginally worse than the initial performance results I reported in #7542 (same test setup, but comparing to a different baseline). I'm not sure this is worth making configurable because it forms part of the .mpy ABI.
I have split the commits more than necessary, but in some cases just to make it easier to review. For example, I think the last two (1fe8dbd and a8cd503) should be combined, as well as combining 9dcd6c8, a30e9ef, a2dec50.