|
| 1 | +MicroPython string interning |
| 2 | +============================ |
| 3 | + |
| 4 | +MicroPython uses `string interning`_ to save both RAM and ROM. This avoids |
| 5 | +having to store duplicate copies of the same string. Primarily, this applies to |
| 6 | +identifiers in your code, as something like a function or variable name is very |
| 7 | +likely to appear in multiple places in the code. In MicroPython an interned |
| 8 | +string is called a QSTR (uniQue STRing). |
| 9 | + |
| 10 | +A QSTR value (with type ``qstr``) is a index into a linked list of QSTR pools. |
| 11 | +QSTRs store their length and a hash of their contents for fast comparison during |
| 12 | +the de-duplication process. All bytecode operations that work with strings use |
| 13 | +a QSTR argument. |
| 14 | + |
| 15 | +Compile-time QSTR generation |
| 16 | +---------------------------- |
| 17 | + |
| 18 | +In the MicroPython C code, any strings that should be interned in the final |
| 19 | +firmware are written as ``MP_QSTR_Foo``. At compile time this will evaluate to |
| 20 | +a ``qstr`` value that points to the index of ``"Foo"`` in the QSTR pool. |
| 21 | + |
| 22 | +A multi-step process in the ``Makefile`` makes this work. In summary this |
| 23 | +process has three parts: |
| 24 | + |
| 25 | +1. Find all ``MP_QSTR_Foo`` tokens in the code. |
| 26 | + |
| 27 | +2. Generate a static QSTR pool containing all the string data (including lengths |
| 28 | + and hashes). |
| 29 | + |
| 30 | +3. Replace all ``MP_QSTR_Foo`` (via the preprocessor) with their corresponding |
| 31 | + index. |
| 32 | + |
| 33 | +``MP_QSTR_Foo`` tokens are searched for in two sources: |
| 34 | + |
| 35 | +1. All files referenced in ``$(SRC_QSTR)``. This is all C code (i.e. ``py``, |
| 36 | + ``extmod``, ``ports/stm32``) but not including third-party code such as |
| 37 | + ``lib``. |
| 38 | + |
| 39 | +2. Additional ``$(QSTR_GLOBAL_DEPENDENCIES)`` (which includes ``mpconfig*.h``). |
| 40 | + |
| 41 | +*Note:* ``frozen_mpy.c`` (generated by mpy-tool.py) has its own QSTR generation |
| 42 | +and pool. |
| 43 | + |
| 44 | +Some additional strings that can't be expressed using the ``MP_QSTR_Foo`` syntax |
| 45 | +(e.g. they contain non-alphanumeric characters) are explicitly provided in |
| 46 | +``qstrdefs.h`` and ``qstrdefsport.h`` via the ``$(QSTR_DEFS)`` variable. |
| 47 | + |
| 48 | +Processing happens in the following stages: |
| 49 | + |
| 50 | +1. ``qstr.i.last`` is the concatenation of putting every single input file |
| 51 | + through the C pre-processor. This means that any conditionally disabled code |
| 52 | + will be removed, and macros expanded. This means we don't add strings to the |
| 53 | + pool that won't be used in the final firmware. Because at this stage (thanks |
| 54 | + to the ``NO_QSTR`` macro added by ``QSTR_GEN_EXTRA_CFLAGS``) there is no |
| 55 | + definition for ``MP_QSTR_Foo`` it passes through this stage unaffected. This |
| 56 | + file also includes comments from the preprocessor that include line number |
| 57 | + information. Note that this step only uses files that have changed, which |
| 58 | + means that ``qstr.i.last`` will only contain data from files that have |
| 59 | + changed since the last compile. |
| 60 | +2. ``qstr.split`` is an empty file created after running ``makeqstrdefs.py split`` |
| 61 | + on qstr.i.last. It's just used as a dependency to indicate that the step ran. |
| 62 | + This script outputs one file per input C file, ``genhdr/qstr/...file.c.qstr``, |
| 63 | + which contains only the matched QSTRs. Each QSTR is printed as ``Q(Foo)``. |
| 64 | + This step is necessary to combine the existing files with the new data |
| 65 | + generated from the incremental update in ``qstr.i.last``. |
| 66 | + |
| 67 | +3. ``qstrdefs.collected.h`` is the output of concatenating ``genhdr/qstr/*`` |
| 68 | + using ``makeqstrdefs.py cat``. This is now the full set of ``MP_QSTR_Foo``'s |
| 69 | + found in the code, now formatted as ``Q(Foo)``, one-per-line, with duplicates. |
| 70 | + This file is only updated if the set of qstrs has changed. A hash of the QSTR |
| 71 | + data is written to another file (``qstrdefs.collected.h.hash``) which allows |
| 72 | + it to track changes across builds. |
| 73 | + |
| 74 | +4. ``qstrdefs.preprocessed.h`` adds in the QSTRs from qstrdefs*. It |
| 75 | + concatenates ``qstrdefs.collected.h`` with ``qstrdefs*.h``, then it transforms |
| 76 | + each line from ``Q(Foo)`` to ``"Q(Foo)"`` so they pass through the preprocessor |
| 77 | + unchanged. Then the preprocessor is used to deal with any conditional |
| 78 | + compilation in ``qstrdefs*.h``. Then the transformation is undone back to |
| 79 | + ``Q(Foo)``, and saved as ``qstrdefs.preprocessed.h``. |
| 80 | + |
| 81 | +5. ``qstrdefs.generated.h`` is the output of ``makeqstrdata.py``. For each |
| 82 | + ``Q(Foo)`` in qstrdefs.preprocessed.h (plus some extra hard-coded ones), it outputs |
| 83 | + ``QDEF(MP_QSTR_Foo, (const byte*)"hash" "Foo")``. |
| 84 | + |
| 85 | +Then in the main compile, two things happen with ``qstrdefs.generated.h``: |
| 86 | + |
| 87 | +1. In qstr.h, each QDEF becomes an entry in an enum, which makes ``MP_QSTR_Foo`` |
| 88 | + available to code and equal to the index of that string in the QSTR table. |
| 89 | + |
| 90 | +2. In qstr.c, the actual QSTR data table is generated as elements of the |
| 91 | + ``mp_qstr_const_pool->qstrs``. |
| 92 | + |
| 93 | +.. _`string interning`: https://en.wikipedia.org/wiki/String_interning |
| 94 | + |
| 95 | +Run-time QSTR generation |
| 96 | +------------------------ |
| 97 | + |
| 98 | +Additional QSTR pools can be created at runtime so that strings can be added to |
| 99 | +them. For example, the code:: |
| 100 | + |
| 101 | + foo[x] = 3 |
| 102 | + |
| 103 | +Will need to create a QSTR for the value of ``x`` so it can be used by the |
| 104 | +"load attr" bytecode. |
| 105 | + |
| 106 | +Also, when compiling Python code, identifiers and literals need to have QSTRs |
| 107 | +created. Note: only literals shorter than 10 characters become QSTRs. This is |
| 108 | +because a regular string on the heap always takes up a minimum of 16 bytes (one |
| 109 | +GC block), whereas QSTRs allow them to be packed more efficiently into the pool. |
| 110 | + |
| 111 | +QSTR pools (and the underlying "chunks" that store the string data) are allocated |
| 112 | +on-demand on the heap with a minimum size. |
0 commit comments