8000 py/obj.h: Add optimised mp_obj_type_t representations. by jimmo · Pull Request #7542 · micropython/micropython · GitHub
[go: up one dir, main page]

Skip to content

py/obj.h: Add optimised mp_obj_type_t representations. #7542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

jimmo
Copy link
Member
@jimmo jimmo commented Jul 15, 2021

This is based on adafruit#4903 (@jepler @tannewt) -- see discussion there for more history.

The goal is to save flash space with a minimal performance hit. In summary: saves ~3k on PYBV11 for 1-2% performance hit.

There is an mp_obj_type_t structure in ROM for every type (with 138 types on PYBV11). The structure has 13 "slots" (e.g. make_new, print, protocol, etc). Most structures only use a small subset of these slots. See analysis for PYBV11 at https://docs.google.com/spreadsheets/d/1DQG7xZwCjo0DbFJbPwh92gtyjRfPiS-nJvG1kRSvG8k/edit?usp=sharing

This PR adds two different alternative representations for mp_obj_type_t (as well as keeping the original).

  • "split" -- the minimal subset of slots (make_new, print, attr, parent, locals_dict) are on all types, but only "extended" types use a large mp_obj_type_t with all the fields. (This is the same as the CircuitPython implementation)
  • "slot index" -- mp_obj_type_t has a 8-bit (or 4-bit) "index" per slot, and a variable-length list of slots.

In the "slot index" approach, make_new is left out as most (100/138) types declare this, and gives two benefits:

  • 12 types means it aligns to three words (whereas a the 13th would make it use 16 bytes)
  • make_new is the most frequently-accessed field, so this keeps it fast and reduces code size.

To make it possible to use all three approaches without changing the object definitions, I've added a macro to define an mp_obj_type_t instance. Use MICROPY_OBJ_TYPE_REPR = (MICROPY_OBJ_TYPE_REPR_FULL, MICROPY_OBJ_TYPE_REPR_SLOT_INDEX, MICROPY_OBJ_TYPE_REPR_SPLIT) to select at compile time.

To make the slot index approach work I had to change the buffer protocol to no longer be a nested structure. I don't know the original history behind this but I imagine the plan was perhaps to have multiple buffer functions?

On PYBV11 I see a saving of 3128 bytes for the slot index approach with 4-bit indexes, and 2860 bytes for 8-bit indexes. On minimal-arm, 1184 and 1172 respectively. The split approach is 1768 on PYBV11 and 784 on minimal. The performance hit for three approaches (slots-4, slots-8, split) is very similar -- approx 0.5-2% across the tests on PYBV11. I'll post more detailed binary size & performance shortly.

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Jul 15, 2021
@jimmo
Copy link
Member Author
jimmo commented Jul 15, 2021

There's a couple of test cases still outstanding (objnamedtuple in particular needs some fixes) and it will require some thought for natmod etc.

Also the trick that makes the split representation work with the macro is unsupported on clang and gcc raises a warning (although it does do the right thing). I think we'll probably remove the split approach before merging as the slots get better size for the same performance.

@jimmo
Copy link
Member Author
jimmo commented Jul 15, 2021

Split:

$ ./tools/metrics.py diff ~/obj-type-size-baseline ~/obj-type-size-split 
   bare-arm:  -848 -1.453% 
minimal x86: -1208 -0.820% [incl -1372(data)]
   unix x64: -2736 -0.534% [incl -3000(data)]
unix nanbox: -2116 -0.468% [incl -2336(data)]
      stm32: -1768 -0.448% PYBV10
     cc3200: -1408 -0.768% 
      esp32: -1584 -0.106% GENERIC[incl -1920(data)]
        nrf:    +0 +0.000% pca10040
        rp2: -1376 -0.286% PICO
       samd:  -880 -0.846% ADAFRUIT_ITSYBITSY_M4_EXPRESS

diff of microsecond times (lower is better)
N=100 M=100              baseline -> split         diff      diff% (error%)
bm_chaos.py               159231.12 ->  161566.12 :   +2335.00 =  +1.466% (+/-0.02%)
bm_fannkuch.py             76576.12 ->   77441.88 :    +865.76 =  +1.131% (+/-0.01%)
bm_fft.py                 299206.88 ->  306512.62 :   +7305.74 =  +2.442% (+/-0.00%)
bm_float.py                50519.00 ->   50811.25 :    +292.25 =  +0.578% (+/-0.01%)
bm_hexiom.py              279316.38 ->  282219.25 :   +2902.87 =  +1.039% (+/-0.00%)
bm_nqueens.py             234611.12 ->  239366.75 :   +4755.63 =  +2.027% (+/-0.00%)
bm_pidigits.py            100635.50 ->  101507.75 :    +872.25 =  +0.867% (+/-0.34%)
misc_aes.py                83292.25 ->   84806.88 :   +1514.63 =  +1.818% (+/-0.01%)
misc_mandel.py            129385.88 ->  133236.12 :   +3850.24 =  +2.976% (+/-0.01%)
misc_pystone.py           153295.50 ->  154457.75 :   +1162.25 =  +0.758% (+/-0.01%)
misc_raytrace.py          160199.50 ->  161332.38 :   +1132.88 =  +0.707% (+/-0.01%)

Slot index (8-bit):

$ ./tools/metrics.py diff ~/obj-type-size-baseline ~/obj-type-size-slot-8
   bare-arm: -1196 -2.050% 
minimal x86: -1480 -1.004% [incl -1824(data)]
   unix x64: -5472 -1.068% [incl -6368(data)]
unix nanbox: -2784 -0.616% [incl -3332(data)]
      stm32: -2860 -0.725% PYBV10
     cc3200: -2096 -1.143% 
      esp32: -2364 -0.159% GENERIC[incl -3416(data)]
        nrf:    +0 +0.000% pca10040
        rp2: -2344 -0.487% PICO
       samd: -1320 -1.269% ADAFRUIT_ITSYBITSY_M4_EXPRESS

N=100 M=100              baseline -> slots-8         diff      diff% (error%)
bm_chaos.py               159231.12 ->  161050.12 :   +1819.00 =  +1.142% (+/-0.02%)
bm_fannkuch.py             76576.12 ->   77910.38 :   +1334.26 =  +1.742% (+/-0.01%)
bm_fft.py                 299206.88 ->  304822.00 :   +5615.12 =  +1.877% (+/-0.00%)
bm_float.py                50519.00 ->   50729.50 :    +210.50 =  +0.417% (+/-0.02%)
bm_hexiom.py              279316.38 ->  283041.12 :   +3724.74 =  +1.334% (+/-0.00%)
bm_nqueens.py             234611.12 ->  237916.88 :   +3305.76 =  +1.409% (+/-0.00%)
bm_pidigits.py            100635.50 ->  101725.75 :   +1090.25 =  +1.083% (+/-0.20%)
misc_aes.py                83292.25 ->   84892.00 :   +1599.75 =  +1.921% (+/-0.01%)
misc_mandel.py            129385.88 ->  131569.12 :   +2183.24 =  +1.687% (+/-0.00%)
misc_pystone.py           153295.50 ->  155462.75 :   +2167.25 =  +1.414% (+/-0.01%)
misc_raytrace.py          160199.50 ->  160901.75 :    +702.25 =  +0.438% (+/-0.01%)

Slot index (4-bit):

$ ./tools/metrics.py diff ~/obj-type-size-baseline ~/obj-type-size-slot-4
   bare-arm: -1216 -2.084% 
minimal x86: -1512 -1.026% [incl -2080(data)]
   unix x64: -5456 -1.065% [incl -6696(data)]
unix nanbox: -2840 -0.628% [incl -3800(data)]
      stm32: -3128 -0.793% PYBV10
     cc3200: -2216 -1.209% 
      esp32: -2504 -0.168% GENERIC[incl -3968(data)]
        nrf:    +0 +0.000% pca10040
        rp2: -2392 -0.497% PICO
       samd: -1368 -1.315% ADAFRUIT_ITSYBITSY_M4_EXPRESS

N=100 M=100              baseline -> slots-4         diff      diff% (error%)
bm_chaos.py               159231.12 ->  160950.12 :   +1719.00 =  +1.080% (+/-0.02%)
bm_fannkuch.py             76576.12 ->   76889.12 :    +313.00 =  +0.409% (+/-0.02%)
bm_fft.py                 299206.88 ->  303855.50 :   +4648.62 =  +1.554% (+/-0.00%)
bm_float.py                50519.00 ->   51105.38 :    +586.38 =  +1.161% (+/-0.01%)
bm_hexiom.py              279316.38 ->  283352.25 :   +4035.87 =  +1.445% (+/-0.00%)
bm_nqueens.py             234611.12 ->  236560.88 :   +1949.76 =  +0.831% (+/-0.00%)
bm_pidigits.py            100635.50 ->  101132.38 :    +496.88 =  +0.494% (+/-0.24%)
misc_aes.py                83292.25 ->   84467.62 :   +1175.37 =  +1.411% (+/-0.01%)
misc_mandel.py            129385.88 ->  132409.88 :   +3024.00 =  +2.337% (+/-0.01%)
misc_pystone.py           153295.50 ->  154405.25 :   +1109.75 =  +0.724% (+/-0.00%)
misc_raytrace.py          160199.50 ->  161836.25 :   +1636.75 =  +1.022% (+/-0.01%)

@jimmo jimmo force-pushed the obj-slots-split-index branch from 88662c9 to 2aa4644 Compare July 15, 2021 07:07
@jimmo
Copy link
Member Author
jimmo commented Jul 15, 2021

Failing on Windows due to #if inside a macro invocation (which I was surprise worked on GCC anyway).

e.g.

STATIC MP_DEFINE_CONST_OBJ_TYPE(
    mp_type_bound_meth, MP_QSTR_bound_method, MP_TYPE_FLAG_NONE, MP_TYPE_NULL_MAKE_NEW,
    #if MICROPY_ERROR_REPORTING == MICROPY_ERROR_REPORTING_DETAILED
    print, bound_meth_print,
    #endif
    call, bound_meth_call
    #if MICROPY_PY_FUNCTION_ATTRS
    , attr, bound_meth_attr
    #endif
    );

The workaround is to change these to

#if MICROPY_ERROR_REPORTING == MICROPY_ERROR_REPORTING_DETAILED
#define TYPE_BOUND_METH_PRINT print, bound_meth_print,
#endif

#if MICROPY_PY_FUNCTION_ATTRS
#define TYPE_BOUND_METH_ATTRS , attr, bound_meth_attr
#endif

STATIC MP_DEFINE_CONST_OBJ_TYPE(
    mp_type_bound_meth, MP_QSTR_bound_method, MP_TYPE_FLAG_NONE, MP_TYPE_NULL_MAKE_NEW,
    TYPE_BOUND_METH_PRINT
    call, bound_meth_call
    TYPE_BOUND_METH_ATTRS
    );

Fortunately there isn't too many of these.

Copy link
Contributor
@jepler jepler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you rediscovered the problem with directly initializing slots inside ext[0]. Having 3 representations is probably overkill anyway.

Based on the bits I scanned this looks great and I'm excited to see how much further you took my proof of concept. However, this is far from a complete review so I'm just commenting.

Comment on lines +759 to +847
#define _MP_DEFINE_CONST_OBJ_TYPE_IMPL(_1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, N, ...) _MP_DEFINE_CONST_OBJ_TYPE_##N
#define MP_DEFINE_CONST_OBJ_TYPE(...) _MP_DEFINE_CONST_OBJ_TYPE_EXPAND(_MP_DEFINE_CONST_OBJ_TYPE_IMPL(__VA_ARGS__, _INV, 12, _INV, 11, _INV, 10, _INV, 9, _INV, 8, _INV, 7, _INV, 6, _INV, 5, _INV, 4, _INV, 3, _INV, 2, _INV, 1, _INV, 0, _INV, _INV, _INV)(__VA_ARGS__))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious how you would handle this. I looked into how boost preprocessor (the one I was familiar with) accomplishes the things it does, like dispatching based on the number of arguments, and nope'd right back out of there as quickly as I could. This isn't too bad, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you rediscovered the problem with directly initializing slots inside ext[0].

Yeah it's fortunate that GCC warns but does the right thing anyway. I also can't figure out exactly what the pattern to how it decides which fields to warn against is.

This isn't too bad, though.

Thanks. I was very relieved when this worked across gcc/clang/msvc, and TBH I don't think (other than msvc's quirk with expanding VA_ARGS) there's anything too controversial here. It also works on older gcc (6.4) and icc and tcc. It would be bad news if this ruled out another compiler though... iar is the main one I haven't tried.

@jimmo jimmo force-pushed the obj-slots-split-index branch from 2aa4644 to 92e8ba2 Compare July 29, 2021 02:16
jimmo added 3 commits April 21, 2022 23:32
The buffer protocol type only has a single member, and this
layout creates problems for the upcoming split/slot-index
mp_obj_type_t layout optimisations.

If we need to make the buffer protocol more sophisticated
in the future either we can rely on the mp_obj_type_t
optimisations to just add additional slots to mp_obj_type_t
or re-visit the buffer protocol then.

This change is a no-op in terms of generated code.
Remove setting unused slots.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This will allow the structure of mp_obj_type_t to change while
keeping the definition code the same.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@jimmo jimmo force-pushed the obj-slots-split-index branch 2 times, most recently from 20437c9 to f0116aa Compare April 21, 2022 14:39
@jimmo
Copy link
Member Author
jimmo commented Apr 21, 2022

Rebased.

@jimmo jimmo force-pushed the obj-slots-split-index branch 2 times, most recently from f6554c5 to 29afbfd Compare April 22, 2022 01:50
jimmo added 9 commits April 22, 2022 12:58
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This will always have the maximum size of a mp_obj_type_t representation
and can be used as a member in other structs.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This is a no-op, but sets the stage for changing the mp_obj_type_t
representation.
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@jimmo jimmo force-pushed the obj-slots-split-index branch from 29afbfd to f6b78ec Compare April 22, 2022 03:09
For make_new, it was previously only checking the exact byte
at the offset (because all other slots it's a uint8_t index).

Handle make_new explictly and check the full pointer.

On LE arch, it's very likely (but not guaranteed) that the first byte
of the make_new pointer is non-zero, but on BE it's very likely to be zero.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@jimmo
Copy link
Member Author
jimmo commented Apr 22, 2022

I have fixed the outstanding issues with namedtuple and dynamic native modules. The only thing left is how MSVC handles #if inside macro invocations (which is easy to fix as described above).

If we decide to use either slots or split, then I think we should use it exclusively (i.e. remove the "full" representation) as it becomes part of the ABI to dynamic modules.

I ended up having to abandon the 4-bit slot indices because objtype.c:mp_obj_class_lookup needs to be able to use offsetof. If we decide to go ahead with the slot approach we can modify objtype.c to work with this and get an additional ~300 bytes (on PYBV11).

For the slot index approach, I see an overall saving of 2920 bytes on PYBV11 and 1224 bytes on minimal-arm. Here's the updated perf scores for PYBV11

$ ./run-perfbench.py -s ~/slots-pybv11-baseline ~/slots-pybv11-slots 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/slots-pybv11-baseline -> /home/jimmo/slots-pybv11-slots         diff      diff% (error%)
bm_chaos.py                    358.07 ->     357.24 :      -0.83 =  -0.232% (+/-0.02%)
bm_fannkuch.py                  76.62 ->      76.15 :      -0.47 =  -0.613% (+/-0.01%)
bm_fft.py                     2542.13 ->    2495.32 :     -46.81 =  -1.841% (+/-0.00%)
bm_float.py                   5916.09 ->    5858.35 :     -57.74 =  -0.976% (+/-0.01%)
bm_hexiom.py                    43.71 ->      43.68 :      -0.03 =  -0.069% (+/-0.00%)
bm_nqueens.py                 4398.49 ->    4372.86 :     -25.63 =  -0.583% (+/-0.00%)
bm_pidigits.py                 640.04 ->     638.79 :      -1.25 =  -0.195% (+/-0.19%)
misc_aes.py                    412.97 ->     411.42 :      -1.55 =  -0.375% (+/-0.01%)
misc_mandel.py                3520.42 ->    3408.66 :    -111.76 =  -3.175% (+/-0.00%)
misc_pystone.py               2275.25 ->    2281.57 :      +6.32 =  +0.278% (+/-0.01%)
misc_raytrace.py               376.95 ->     377.46 :      +0.51 =  +0.135% (+/-0.00%)

If we want to use the split representation, I think we should use Adafruit's approach instead of this PR (see https://github.com/adafruit/circuitpython/blob/main/py/obj.h#L636 for obj.h and https://github.com/adafruit/circuitpython/blob/main/py/objlist.c#L471 for an example type). It works on clang and is overall much simpler and doesn't need fancy macros (although functionally the same). The split implementation in this PR is just there to get a size/perf comparison, which is 1772 bytes saved on PYBV11, and updated perf scores below:

$ ./run-perfbench.py -s ~/slots-pybv11-baseline ~/slots-pybv11-split 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/slots-pybv11-baseline -> /home/jimmo/slots-pybv11-split         diff      diff% (error%)
bm_chaos.py                    358.07 ->     355.38 :      -2.69 =  -0.751% (+/-0.01%)
bm_fannkuch.py                  76.62 ->      76.49 :      -0.13 =  -0.170% (+/-0.01%)
bm_fft.py                     2542.13 ->    2496.87 :     -45.26 =  -1.780% (+/-0.00%)
bm_float.py                   5916.09 ->    5847.29 :     -68.80 =  -1.163% (+/-0.02%)
bm_hexiom.py                    43.71 ->      43.48 :      -0.23 =  -0.526% (+/-0.00%)
bm_nqueens.py                 4398.49 ->    4381.44 :     -17.05 =  -0.388% (+/-0.00%)
bm_pidigits.py                 640.04 ->     638.84 :      -1.20 =  -0.187% (+/-0.34%)
misc_aes.py                    412.97 ->     417.16 :      +4.19 =  +1.015% (+/-0.00%)
misc_mandel.py                3520.42 ->    3440.50 :     -79.92 =  -2.270% (+/-0.00%)
misc_pystone.py               2275.25 ->    2275.56 :      +0.31 =  +0.014% (+/-0.00%)
misc_raytrace.py               376.95 ->     376.41 :      -0.54 =  -0.143% (+/-0.01%)

@dpgeorge
Copy link
Member
dpgeorge commented May 6, 2022

I checked, and this current PR builds OK on the pic16bit port (which is based on a very old gcc version).

@jimmo
Copy link
Member Author
jimmo commented Jun 24, 2022

Superceded by #8813

@jimmo jimmo closed this Jun 24, 2022
dpgeorge pushed a commit that referenced this pull request Sep 19, 2022
The existings mp_obj_type_t uses a sparse representation for slots for the
capability methods of the type (eg print, make_new).  This commit adds a
compact slot-index representation.  The basic idea is that where the
mp_obj_type_t struct used to have 12 pointer fields, it now has 12 uint8_t
indices, and a variable-length array of pointers.  So in the best case (no
fields used) it saves 12x4-12=36 bytes (on a 32-bit machine) and in the
common case (three fields used) it saves 9x4-12=24 bytes.

Overall with all associated changes, this slot-index representation reduces
code size by 1000 to 3000 bytes on bare-metal ports.  Performance is
marginally better on a few tests (eg about 1% better on misc_pystone.py and
misc_raytrace.py on PYBv1.1), but overall marginally worse by a percent or
so.

See issue #7542 for further analysis and discussion.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
RetiredWizard pushed a commit to RetiredWizard/micropython that referenced this pull request Feb 9, 2023
karfas pushed a commit to karfas/micropython that referenced this pull request Apr 23, 2023
The existings mp_obj_type_t uses a sparse representation for slots for the
capability methods of the type (eg print, make_new).  This commit adds a
compact slot-index representation.  The basic idea is that where the
mp_obj_type_t struct used to have 12 pointer fields, it now has 12 uint8_t
indices, and a variable-length array of pointers.  So in the best case (no
fields used) it saves 12x4-12=36 bytes (on a 32-bit machine) and in the
common case (three fields used) it saves 9x4-12=24 bytes.

Overall with all associated changes, this slot-index representation reduces
code size by 1000 to 3000 bytes on bare-metal ports.  Performance is
marginally better on a few tests (eg about 1% better on misc_pystone.py and
misc_raytrace.py on PYBv1.1), but overall marginally worse by a percent or
so.

See issue micropython#7542 for further analysis and discussion.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
alphonse82 pushed a commit to alphonse82/micropython-wch-ch32v307 that referenced this pull request May 8, 2023
The existings mp_obj_type_t uses a sparse representation for slots for the
capability methods of the type (eg print, make_new).  This commit adds a
compact slot-index representation.  The basic idea is that where the
mp_obj_type_t struct used to have 12 pointer fields, it now has 12 uint8_t
indices, and a variable-length array of pointers.  So in the best case (no
fields used) it saves 12x4-12=36 bytes (on a 32-bit machine) and in the
common case (three fields used) it saves 9x4-12=24 bytes.

Overall with all associated changes, this slot-index representation reduces
code size by 1000 to 3000 bytes on bare-metal ports.  Performance is
marginally better on a few tests (eg about 1% better on misc_pystone.py and
misc_raytrace.py on PYBv1.1), but overall marginally worse by a percent or
so.

See issue micropython#7542 for further analysis and discussion.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0