-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
py/obj.h: Add optimised mp_obj_type_t representations. #7542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There's a couple of test cases still outstanding (objnamedtuple in particular needs some fixes) and it will require some thought for natmod etc. Also the trick that makes the split representation work with the macro is unsupported on clang and gcc raises a warning (although it does do the right thing). I think we'll probably remove the split approach before merging as the slots get better size for the same performance. |
Split:
Slot index (8-bit):
Slot index (4-bit):
|
88662c9
to
2aa4644
Compare
Failing on Windows due to e.g.
The workaround is to change these to
Fortunately there isn't too many of these. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you rediscovered the problem with directly initializing slots inside ext[0]. Having 3 representations is probably overkill anyway.
Based on the bits I scanned this looks great and I'm excited to see how much further you took my proof of concept. However, this is far from a complete review so I'm just commenting.
#define _MP_DEFINE_CONST_OBJ_TYPE_IMPL(_1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, N, ...) _MP_DEFINE_CONST_OBJ_TYPE_##N | ||
#define MP_DEFINE_CONST_OBJ_TYPE(...) _MP_DEFINE_CONST_OBJ_TYPE_EXPAND(_MP_DEFINE_CONST_OBJ_TYPE_IMPL(__VA_ARGS__, _INV, 12, _INV, 11, _INV, 10, _INV, 9, _INV, 8, _INV, 7, _INV, 6, _INV, 5, _INV, 4, _INV, 3, _INV, 2, _INV, 1, _INV, 0, _INV, _INV, _INV)(__VA_ARGS__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious how you would handle this. I looked into how boost preprocessor (the one I was familiar with) accomplishes the things it does, like dispatching based on the number of arguments, and nope'd right back out of there as quickly as I could. This isn't too bad, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you rediscovered the problem with directly initializing slots inside ext[0].
Yeah it's fortunate that GCC warns but does the right thing anyway. I also can't figure out exactly what the pattern to how it decides which fields to warn against is.
This isn't too bad, though.
Thanks. I was very relieved when this worked across gcc/clang/msvc, and TBH I don't think (other than msvc's quirk with expanding VA_ARGS) there's anything too controversial here. It also works on older gcc (6.4) and icc and tcc. It would be bad news if this ruled out another compiler though... iar is the main one I haven't tried.
2aa4644
to
92e8ba2
Compare
The buffer protocol type only has a single member, and this layout creates problems for the upcoming split/slot-index mp_obj_type_t layout optimisations. If we need to make the buffer protocol more sophisticated in the future either we can rely on the mp_obj_type_t optimisations to just add additional slots to mp_obj_type_t or re-visit the buffer protocol then. This change is a no-op in terms of generated code.
Remove setting unused slots. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This will allow the structure of mp_obj_type_t to change while keeping the definition code the same. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
20437c9
to
f0116aa
Compare
Rebased. |
f6554c5
to
29afbfd
Compare
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This will always have the maximum size of a mp_obj_type_t representation and can be used as a member in other structs. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This is a no-op, but sets the stage for changing the mp_obj_type_t representation.
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
29afbfd
to
f6b78ec
Compare
For make_new, it was previously only checking the exact byte at the offset (because all other slots it's a uint8_t index). Handle make_new explictly and check the full pointer. On LE arch, it's very likely (but not guaranteed) that the first byte of the make_new pointer is non-zero, but on BE it's very likely to be zero. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
I have fixed the outstanding issues with namedtuple and dynamic native modules. The only thing left is how MSVC handles #if inside macro invocations (which is easy to fix as described above). If we decide to use either slots or split, then I think we should use it exclusively (i.e. remove the "full" representation) as it becomes part of the ABI to dynamic modules. I ended up having to abandon the 4-bit slot indices because objtype.c:mp_obj_class_lookup needs to be able to use offsetof. If we decide to go ahead with the slot approach we can modify objtype.c to work with this and get an additional ~300 bytes (on PYBV11). For the slot index approach, I see an overall saving of 2920 bytes on PYBV11 and 1224 bytes on minimal-arm. Here's the updated perf scores for PYBV11
If we want to use the split representation, I think we should use Adafruit's approach instead of this PR (see https://github.com/adafruit/circuitpython/blob/main/py/obj.h#L636 for obj.h and https://github.com/adafruit/circuitpython/blob/main/py/objlist.c#L471 for an example type). It works on clang and is overall much simpler and doesn't need fancy macros (although functionally the same). The split implementation in this PR is just there to get a size/perf comparison, which is 1772 bytes saved on PYBV11, and updated perf scores below:
|
I checked, and this current PR builds OK on the pic16bit port (which is based on a very old gcc version). |
Superceded by #8813 |
The existings mp_obj_type_t uses a sparse representation for slots for the capability methods of the type (eg print, make_new). This commit adds a compact slot-index representation. The basic idea is that where the mp_obj_type_t struct used to have 12 pointer fields, it now has 12 uint8_t indices, and a variable-length array of pointers. So in the best case (no fields used) it saves 12x4-12=36 bytes (on a 32-bit machine) and in the common case (three fields used) it saves 9x4-12=24 bytes. Overall with all associated changes, this slot-index representation reduces code size by 1000 to 3000 bytes on bare-metal ports. Performance is marginally better on a few tests (eg about 1% better on misc_pystone.py and misc_raytrace.py on PYBv1.1), but overall marginally worse by a percent or so. See issue #7542 for further analysis and discussion. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
update flash chip for Metro M7 1011
The existings mp_obj_type_t uses a sparse representation for slots for the capability methods of the type (eg print, make_new). This commit adds a compact slot-index representation. The basic idea is that where the mp_obj_type_t struct used to have 12 pointer fields, it now has 12 uint8_t indices, and a variable-length array of pointers. So in the best case (no fields used) it saves 12x4-12=36 bytes (on a 32-bit machine) and in the common case (three fields used) it saves 9x4-12=24 bytes. Overall with all associated changes, this slot-index representation reduces code size by 1000 to 3000 bytes on bare-metal ports. Performance is marginally better on a few tests (eg about 1% better on misc_pystone.py and misc_raytrace.py on PYBv1.1), but overall marginally worse by a percent or so. See issue micropython#7542 for further analysis and discussion. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
The existings mp_obj_type_t uses a sparse representation for slots for the capability methods of the type (eg print, make_new). This commit adds a compact slot-index representation. The basic idea is that where the mp_obj_type_t struct used to have 12 pointer fields, it now has 12 uint8_t indices, and a variable-length array of pointers. So in the best case (no fields used) it saves 12x4-12=36 bytes (on a 32-bit machine) and in the common case (three fields used) it saves 9x4-12=24 bytes. Overall with all associated changes, this slot-index representation reduces code size by 1000 to 3000 bytes on bare-metal ports. Performance is marginally better on a few tests (eg about 1% better on misc_pystone.py and misc_raytrace.py on PYBv1.1), but overall marginally worse by a percent or so. See issue micropython#7542 for further analysis and discussion. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This is based on adafruit#4903 (@jepler @tannewt) -- see discussion there for more history.
The goal is to save flash space with a minimal performance hit. In summary: saves ~3k on PYBV11 for 1-2% performance hit.
There is an mp_obj_type_t structure in ROM for every type (with 138 types on PYBV11). The structure has 13 "slots" (e.g. make_new, print, protocol, etc). Most structures only use a small subset of these slots. See analysis for PYBV11 at https://docs.google.com/spreadsheets/d/1DQG7xZwCjo0DbFJbPwh92gtyjRfPiS-nJvG1kRSvG8k/edit?usp=sharing
This PR adds two different alternative representations for mp_obj_type_t (as well as keeping the original).
In the "slot index" approach, make_new is left out as most (100/138) types declare this, and gives two benefits:
To make it possible to use all three approaches without changing the object definitions, I've added a macro to define an mp_obj_type_t instance. Use MICROPY_OBJ_TYPE_REPR = (MICROPY_OBJ_TYPE_REPR_FULL, MICROPY_OBJ_TYPE_REPR_SLOT_INDEX, MICROPY_OBJ_TYPE_REPR_SPLIT) to select at compile time.
To make the slot index approach work I had to change the buffer protocol to no longer be a nested structure. I don't know the original history behind this but I imagine the plan was perhaps to have multiple buffer functions?
On PYBV11 I see a saving of 3128 bytes for the slot index approach with 4-bit indexes, and 2860 bytes for 8-bit indexes. On minimal-arm, 1184 and 1172 respectively. The split approach is 1768 on PYBV11 and 784 on minimal. The performance hit for three approaches (slots-4, slots-8, split) is very similar -- approx 0.5-2% across the tests on PYBV11. I'll post more detailed binary size & performance shortly.