8000 Native modules in ELF format · Issue #3311 · micropython/micropython · GitHub
[go: up one dir, main page]

Skip to content

Native modules in ELF format #3311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aykevl opened this issue Sep 10, 2017 · 21 comments
Closed

Native modules in ELF format #3311

aykevl opened this issue Sep 10, 2017 · 21 comments

Comments

@aykevl
Copy link
Contributor
aykevl commented Sep 10, 2017

I have written a proof-of-concept (read: hacked together) a loader for .so files (in ELF format). This means it is possible to write code in C, package it as a shared object and let MicroPython import it like any .py or .mpy file. Read Extending Python with C or C++ for the CPython tutorial on it.

TL;DR: should I continue working on it and send a pull request?

There are a few reasons why this might be useful:

  1. It is possible to write special algorithms in C, that will be too slow for Python. I've tried @micropython.native and @micropython.viper but for some algorithms this is still way too slow (and consumes too much memory). Also consider that a big compiler like GCC can do a lot more optimization than MicroPython will ever be able to. That's why I wrote a new extmod module for LED animations but I doubt this would be suitable for core MicroPython being more of a kitchen sink addition.
  2. Some special hardware requires native code to handle it properly, think about the more esoteric hardware features of any MCU or bitbanged drivers. Things like WS2812 and DHT are commonly used by the community and are built in, but there is much more hardware that is less commonly used and not suitable for core MicroPython.
  3. Extract currently built-in modules as shared objects for the people that really need them, reducing the MicroPython core size for the rest. No idea whether this would be desirable. These modules can be built together with the core and provided as separate downloads.

Of course you can manage your own port (with your own builds etc.) or write in assembly but it may be much easier to simply drop a .so file in the flash storage and import it from Python.

Currently it works only for the ESP8266 (that's the only MCU I have that can run MicroPython) and being a hacked together system many features are not implemented (e.g. .data). But simple functions work fine and adding more shouldn't be too much effort. Porting over to e.g. linux/x86_64 shouldn't be too difficult either. Maybe it's even possible to add some support for Arduino, as some people are probably more familiar with that C++ dialect and it opens possibilities for easily porting a wealth of Arduino libraries.

I've worked on this for about two days now and it turned out to be a bit more work than expected (mainly because of lacking documentation for Xtensa) so before I continue I'd like to know whether such a feature would be desirable.

@pfalcon
Copy link
Contributor
pfalcon commented Sep 10, 2017

It was possible to load native modules dynamically ~2 years ago: #1627 . So, the suggestion is to switch to that patch instead and help with it (testing, etc.).

@stinos
Copy link
Contributor
stinos commented Sep 10, 2017

I'm not really at home in differences between ELF and other formats, so I'm wondering how does your implementation compare to simply loading .so files using dlopen and then acquiring a function returning a uPy module like in stinos@0715d6c?

@aykevl
Copy link
Contributor Author
aykevl commented Sep 10, 2017

(sorry for the wall of text...)

@pfalcon interesting! I did search the issues/pull-requests but apparently missed that one. I don't fully understand how it works, though. It does something interesting with objcopy and linker scripts that I don't really follow but probably should investigate. It may be better anyway to have a tool on the host which preprocesses the .so file to shrink it (removing useless data) and make it easier for the MCU to parse.

The ELF format is used for basically all executables/object files on Linux and some other operating systems. .so files have basically two views (two sets of headers) over the same data. One is the list of sections (.text etc.) and the other is the program header for the OS loader how to load the program file into memory and map from locations in the file to virtual memory addresses. As I understand it, a program linker (linking multiple .o files into an executable file) uses the sections and generates the program header while an OS program loader and dynamic linker usually only looks at the program header and ignore the sections. Oh, and ELF relies a lot on being able to seek in a file (or rather, it expects almost the whole file to be mapped to memory), which is kind of inconvenient with the current reader (I ended up patching the reader to allow seeking).

The implementation I have is more comparable to the one in Contiki OS, which I discovered after writing mine. What both do is that they look at the list of sections instead of at the program header, loading all sections that are relevant (like .text, .rodata, but not keeping things after load like .rela and .dynsym). This can save a lot of memory. What the current implementation also does (for Xtensa) is that it loads the .text section into program memory (using MP_PLAT_COMMIT_EXEC).

The current implementation takes a source like the following:

const int value = 42;

// wrapper function
mp_obj_t hsv2rgb_rainbow(size_t n_args, const mp_obj_t *args) {
    uint8_t h = mp_obj_get_int(args[0]);
    uint8_t s = mp_obj_get_int(args[1]);
    uint8_t v = mp_obj_get_int(args[2]);
    return mp_obj_new_int(mod_pixels_hsv2rgb_rainbow(h, s, v));
}

// list of entries in the Python module
const mp_dyn_module_t module_test[] = {
    { TYPE_FUNC_VAR, hsv2rgb_rainbow},
    { TYPE_CONST_INT, value},
};

Compiled using commands like this (removed all kinds of semi-optional flags):

gcc -fPIC -nostdlib -o test.o -c test.c
gcc -shared -nostdlib -o test.so test.o

That can be imported using Python

import test
print('%06x' % test.hsv2rgb_rainbow(50, 255, 255))

Behind the scenes it reads the .dynsym section. Then it reads the relocation section (.rela.dyn and .rela.plt) and adjusts all pointers (currently it ignores the PLT and GOT and inserts the final pointers at link time). When it discovers it is relocating a function or value within the module_test list it inserts a MicroPython object in the module with the given name (as discovered via the .dynsym and .dynstr sections).
This means both the module_test variable and all referenced functions/values must be public (i.e. not static). It also means there is no initialization function. Of course, all of this can be changed, but the current way seems like the most minimal way and the way that uses the least amount of memory.

The difference with stinos/micropython@0715d6c and CPython (which work very similar in that regard) is that they use the dynamic linker, while this approach does the linking itself.

@dpgeorge
Copy link
Member

@aykevl thanks for sharing your work! Dynamically loadable native object code is definitely a feature that MicroPython wants to have, it would be very useful.

If you read through PR #1627 you'll see that there was a long discussion behind it (and previous PR's for it). That PR aims to keep the MCU-side of the loading/linking as minimal as possible by having the offline compiler essentially do the linking: all calls into the uPy runtime are done via a big lookup table eg mp_obj_new_int(val) becomes runtime_table->mp_obj_new_int(val).

The other thing done in #1627 is that it uses a custom "object file" format which is based on the existing .mpy format for pre-compiled bytecode. The idea is that a .mpy file is a generic container for precompiled uPy code and can contain bytecode, object code, or a combination of both. Bytecode-only .mpy files would load into (almost) any uPy port, while files containing object code would only load on ports with the same arch as the code was compiled for. This scheme also allows for putting @micropython.native, .viper and .asm_thumb code in .mpy files (it would use the same linking scheme as object code generated from a C/C++ compiler).

@aykevl
Copy link
Contributor Author
aykevl commented Sep 12, 2017

Nice work! For my reference, #222 (persistent bytecode v1) and #581 (mpy native code v1) are also interesting reads.

Thinking a bit more about it, I think it's best to follow your approach and doing most work on the host. That means somehow putting the machine code in the .mpy file.
But it looks unavoidable to do some sort of linking on the MCU. That shouldn't be too expensive if the host does most of the preprocessing. I'm currently trying things out, but haven't got very far.

My attempts so far:

  • Simply including the machine code without relocations: lacks some local relocations (required for Xtensa) and makes it harder to call back into the MicroPython API.
  • Wrapping all parts of the .so file and doing some linking on the MCU: this worked by interpreting the .so on Xtensa so it shouldn't be too hard to do most of the processing on the host and including simple instructions for the MCU. I haven't gotten this working on amd64.
  • Interpreting .o files (effectively doing static linking ourselves). This avoids PLT's which are unnecessary for our purpose, but requires a lot of understanding of the target files (e.g. for ARM we'd need a R_ARM_CALL relocation on a PL instruction which only has 24 bits for the address, so the result can only jump within +/- 32MB - they effectively want you to understand ASM machine code - see the manual). For some reason all my attempts (during a few hours) have failed to get this working on X86-64 (due to PC-relative addressing or maybe something else). I have already tried making the page executable and allocating a page in lower address space (<2GB). Calling the init function isn't hard but calling back appears impossible.

The current file layout is as follows (uint means varint encoding as used in current .mpy):

  • 4-byte file header, similar to the current .mpy but with ['M', 2, 0x80, ELF e_machine] (e_machine indicates the ISA which is for example 0x3E for X86-64 and 0x28 for ARM).
  • uint text_size: size of machine code
  • machine code
  • uint index into code where the init function lies (I call it mpy_module_init). This is not always the first byte, e.g. on Xtensa it's usually the 4th or even the 8th byte.
  • uint numer of relocations
  • relocations, array of (uint target, uint offset, uint addend) where target is a magic number for the function call (currently a fixed list but should probably be indexes into mp_fun_table), offset is the offset from the start of the machine code, and addend is what should be added to the address of target (required for x86-64). sint here is uint with the least-significant bit indicating the sign (to encode signed integers more efficiently). offset could be made more efficient by using the distance from the previous offset (or 0) so the varint encoding is more efficient for large .mpy files.

All in all, it turns out to be a lot harder than expected. Suggestions are welcome.

@aykevl
Copy link
Contributor Author
aykevl commented Sep 18, 2017

I have a very basic implementation that works on ARM (non-Thumb), x86-64 (using a hack to allocate memory) and Xtensa/ESP8266.
It consists of a script that converts a .o file into a .mpy file using roughly the same format as above and a part that lives in MicroPython to import the file and execute the native mpy_module_init function. The mpy_module_init gets the locals dict for the module where it can add it's own methods (and classes, but those aren't implemented yet). MicroPython methods to relocate currently have magic numbers, I intend to change those to mp_fun_table.

This is what a dynamically loadable module would look like. Currently it needs to add each member of the module manually, but maybe this can be changed to use a mp_rom_map_elem_t, making it even more like native modules. Also MP_QSTR_ isn't implemented yet.

#include "module.h"

STATIC mp_obj_t plus(mp_obj_t _a, mp_obj_t _b) {
    int a = mp_obj_get_int(_a);
    int b = mp_obj_get_int(_b);
    return mp_obj_new_int(a + b);
}
STATIC MP_DEFINE_CONST_FUN_OBJ_2(plus_obj, plus);

mp_obj_t mpy_module_init(const mp_obj_t locals) {
    mp_obj_dict_store(locals, MP_OBJ_NEW_QSTR(qstr_from_str("plus")), (mp_obj_t)MP_ROM_PTR(&plus_obj));
    return mp_const_none;
}

A lot needs to be cleaned up before I can even commit this feature, but I'm just asking for feedback now on the implementation. Does this seem workable?

The problem I had with x86-64 was that the printf function wasn't displaying full 64-but pointers so I thought they were in the lower 2GB address space. Turns out micropython lives far higher (0x0000555555....) so unfortunately I couldn't simply allocate a buffer in the lower 2GB to make relative loads/jumps work. I did a workaround by just allocating right above where micropython lives. Actually I think on systems with an OS a dynamic linker should be used, just like CPython (and stinos@0715d6c). This small custom format makes more sense for MCU's.

@pfalcon
Copy link
Contributor
pfalcon commented Sep 18, 2017

A lot needs to be cleaned up before I can even commit this feature, but I'm just asking for feedback now on the implementation.

Why do you duplicate work done 2 years ago by @dpgeorge in #1627 ?

@aykevl
Copy link
Contributor Author
aykevl commented Sep 18, 2017

Mostly because I did it in a very different way.
Differences:

  • I use relocations, Dynamic native modules v2 #1627 uses an array of function pointers passed to the init function. Not using relocations is quite limiting. Using them is difficult (you basically have to write a small linker to turn a .o into a .mpy) but shouldn't give much overhead on the MCU.
  • I don't use mp_fun_table yet, but that's a TODO. Again, the implementation is so different I couldn't even copy anything (due to relocations).
  • I use a more complex .mpy file format (e.g. also specifying .data) - though that could be implemented in Dynamic native modules v2 #1627 too.
  • I try to keep modules as much as possible like native modules, one of the issues raised in Dynamic native modules v2 #1627.

By far the majority of the work went into relocations. For the parts that were shared (e.g. the changes in py/emitglue.c) I took a look there.

If you don't agree with these differences, please tell me why. That's why I'm asking.

@pfalcon
Copy link
Contributor
pfalcon commented Sep 18, 2017

If you don't agree with these differences, please tell me why. That's why I'm asking.

I don't agree with the process. There's a code (2nd or 3rd iteration even) written by another developer. You seem to largely ignore it, and go with your own code. But to have a feature implemented, developers need to cooperate and look at (work with!) each other's code. But why next guy will want to work with your code, if you don't want to work with the previous guy's one?

@aykevl
Copy link
Contributor Author
aykevl commented Sep 19, 2017

Yeah OK you're right. I wasn't very cooperative in that regard. I didn't see much value in working with that code as it's an old branch (merge conflicts!) but mostly because lots needs to change for relocations, so much that I saw far more difference than overlap. And, admittedly, I was impatient and wanted to get something working.

But I'll be more cooperative. Also, I would like to have this feature implemented somehow. That means a decision needs to be made regarding how it's implemented. If function pointers are used, I'll go with #1627 and improve it, but if not I don't see the point of adapting rather different and old code. I think relocations are superior (I can elaborate on that if needed) but I don't want to continue working on it unless I have some sort of green light this might eventually be merged.

@dpgeorge what's your opinion on this? Would you still want to work on this? I can put the code I have in a branch if needed.

@dpgeorge
Copy link
Member

That means a decision needs to be made regarding how it's implemented. If function pointers are used, I'll go with #1627 and improve it, but if not I don't see the point of adapting rather different and old code. I think relocations are superior

Please elaborate on this point about relocations in the code. As I see it, the main goals for native code in .mpy files is:

  1. should work seamlessly with any target arch (x86, Thumb2, xtensa, etc)
  2. should allow any arbitrary native code to run (eg compiled from C, C++, assembly, Rust, etc)
  3. it's a great feature to have for all ports, even those that are extremely limited (potentially more so for such tiny ports so they can have pluggable functionality in C) so there should be an absolute minimum of support code on the target that is running MicroPython
  4. potential to execute in-place so code can be executed from ROM/flash and doesn't need to be loaded into RAM (think about small ports like the micro:bit that have only 16k of RAM)

The micro:bit is a good real-world example of something that could benefit a lot from this feature, in the following way: it has a base, precompiled firmware with MicroPython and a user can customise this by including modules for their current application. Such modules can be .mpy files that are appended directly to the end of the precompiled firmware and form a "frozen filesystem" in flash. This doesn't require recompiling, just appending binary files and can be done in the browser very easily (this is what happens at the moment). The user could then import .mpy files from the frozen filesystem, including native code. This scheme would allow the base binary to be minimal yet allow easy extensibility for drivers like Neopixel and DHT22 which need to be written in C or some other low-level language.

@aykevl
Copy link
Contributor Author
aykevl commented Sep 20, 2017
  1. potential to execute in-place so code can be executed from ROM/flash and doesn't need to be loaded into RAM (think about small ports like the micro:bit that have only 16k of RAM)

How would that work? I don't think we can execute from a filesystem directly as files might be fragmented. If it's possible it would be very nice, though. (Unfortunately not a solution for the ESP as that system uses a Harvard architecture).

I think we have different priorities here. My idea:

  • Source files for modules should look as much as possible like built-in modules, possibly with some extra boilerplate if needed. This way you can create a single source file (e.g. a DHT22 module) that can be built as a built-in module or as a standalone .mpy file, depending on the port (flash size etc.)
  • Ideally, these .mpy files are built as part of the standard build process, so they end up as separate .mpy files in the build directory (along firmware.bin etc.)

That means I really don't like all the preprocessor magic (especially the CONTEXT) that goes on in your modx.c example. But maybe we can find a midway that has most benefits of both ways.

So if we want dynamically loadable modules that look like built-in modules, in particular this case is very hard to do without relocations (expanded from MP_DEFINE_CONST_FUN_OBJ_1):

STATIC mp_obj_t add1(mp_obj_t n) {...}
mp_obj_fun_builtin_fixed_t add1_obj = {{&mp_type_fun_builtin_1}, .fun._1 = add1}

It stores a MicroPython pointer (&mp_type_fun_builtin_1) to a value in .data. There is no way we can predict this pointer (as it depends on the build), and it can't be filled in at runtime. When all object files are linked together in an executable, a fixed pointer can be inserted (as long as the executable isn't PIE). For shared object files, a relocation entry is created to patch up the .data section (insert the pointer) after the file's segments have been loaded into the address space of the process.

There is a similar problem with function calls. In a .so file these are replaced with a PLT (stub functions which call the real functions), where the PLT is then patched up to include the destination pointers. The reason for using a PLT is that the .text section doesn't have to be touched, so it can be shared among different processes (reducing memory consumption). For our case, a PLT is just overhead as a module is only loaded once anyway on a MCU.

I have an idea how we might be able to fix this, avoiding relocations except for one. The idea is to start the file with a extern void __attribute__((section(".mpygot"))) *const mp_fun_table[MP_F_NUMBER_OF];. If my guess is correct, this will create a relocation to insert the mp_fun_table pointer. Inserting this pointer is very cheap on the MCU, just a 32-bit store. Code can then do a PC-relative load using that address to get the target pointer. In code this would look like mp_fun_table[MP_F_NEW_INT] and mp_fun_table[MP_F_CONST_NONE]. A preprocessor could then replace all instances of mp_obj_new_int and mp_const_none with those array lookups. For MP_DEFINE_CONST_FUN_OBJ_1 and the like we can do something similar. This way almost all of the code could be shared between built-in and dynamically loaded modules. I would have to test this method, though, to see whether it is actually possible.
EDIT: did a very minimal test, it works on ARM when compiled using -fPIE.

While the MicroPython core itself is smaller when using a pointer table, the modules itself gets bigger. One advantage of relocations is that modules can stay mostly position-dependent with no indirection. Using a pointer table causes every call to a MicroPython function to have some overhead and produce a bit more code because of the indirection.
Relocations also don't need to cost a lot of code space. Currently they take up ~92 bytes of space on Xtensa (ESP8266), ~224 bytes on ARM (Raspberry Pi 3, non-Thumb) and ~224 on amd64. I don't know how much budget there is on systems like the nRF51822 (micro:bit) and I don't know how much the relocations would cost on Thumb2, but maybe I can try using qemu-arm.

@dpgeorge
Copy link
Member

How would that work? I don't think we can execute from a filesystem directly as files might be fragmented.

Right. So it would need to be a special filesystem, and the simplest case is a "frozen filesystem" of file data that is concatenated into a big blob.

Source files for modules should look as much as possible like built-in modules, possibly with some extra boilerplate if needed. This way you can create a single source file (e.g. a DHT22 module) that can be built as a built-in module or as a standalone .mpy file

This was considered when designing my version of the native persistent code, and it can be done: you just have to provide other definitions for the macros (and is why CONTEXT is the way it is).

Ideally, these .mpy files are built as part of the standard build process

There's nothing stopping thing from happening with my version. But one of my big priorities with the design was the ability to create native .mpy files in a stand-alone environment, only needing the particular mpconfigport.h file that you want to make the files work for.

So if we want dynamically loadable modules that look like built-in modules, in particular this case is very hard to do without relocations (expanded from MP_DEFINE_CONST_FUN_OBJ_1):

Yes this is impossible to do without relocations. That's why all function wrappers are created dynamically when the .mpy file is loaded (in my version). Yes this uses extra memory, but that memory is saved in cases where the native code in the .mpy file runs directly from ROM.

The biggest decision is really whether to go to the effort to support running from ROM, or accept that it's not worth it and require to load the code into RAM. The latter would allow relocations and C code that is much similar to how existing code is written (ie minimal macro magic).

Bytecode in .mpy files is currently relocated (because qstrs need to be rewritten in the bytecode) and so must be copied to RAM when loaded/imported. The reason for this design decision is because it's too much overhead (in code size, RAM usage and execution speed) to have an indirection table to look up qstrs. And this would need to be applied to all bytecode, not just that loaded from a .mpy file. So it might be fair enough then to also require that native code be copied and relocated when it's loaded.

The ability to run .mpy files from ROM/flash would then come as a second step: .mpy files could be dynamically frozen into a set region of flash, ie they would be relocated into flash once for the particular target then executed from there next time they were referenced.

@pfalcon
Copy link
Contributor
pfalcon commented Sep 22, 2017

This was considered when designing my version of the native persistent code, and it can be done: you just have to provide other definitions for the macros (and is why CONTEXT is the way it is)

And I actually went over it and considered how to make it even easier/more transparent to support both. I regret not commenting then on #1627 , and now it's all forgotten/lost.

The biggest decision is really whether to go to the effort to support running from ROM

Of course. There're 3 basic requirements for dynamically loadable modules:

  1. Machine-independent format/support (rules out relocations).
  2. Allow dynamic loadable modules to be "mmapped" (rules out relocations).
  3. Allow to build modules either builtin or dynaloaded from the same source (extra points for builtin version to be as efficient as "natively builtin" one).

Qstr's are the biggest problem with the dynamic modules, yeah. Maybe even worth considering not using them for the dynaload case (because the other alternative is to have them as variables in .data section - that's not counting an alternative of complicating qstr code (killing performance) by adding "qstr symlinks").

@dpgeorge
Copy link
Member

Machine-independent format/support (rules out relocations).

I don't agree with this as a basic requirement. It's a "nice to have" because it makes it easier (trivial) to support additional archs, but it doesn't bring any technical feature/advantage (like ability to run from ROM would).

@aykevl
Copy link
Contributor Author
aykevl commented Sep 25, 2017

replying to @dpgeorge:

Right. So it would need to be a special filesystem, and the simplest case is a "frozen filesystem" of file data that is concatenated into a big blob.

It would be nice to have a clear use case, otherwise we're working on support for something (and limiting our options!) when it might not even be implemented. Currently I see two cases: .mpy files on a FAT filesystem (can be fragmented, thus impossible to store machine code), or built-in to MicroPython in which case it's better to make them built-in anyway.

Source files for modules should look as much as possible like built-in modules, possibly with some extra boilerplate if needed. This way you can create a single source file (e.g. a DHT22 module) that can be built as a built-in module or as a standalone .mpy file

This was considered when designing my version of the native persistent code, and it can be done: you just have to provide other definitions for the macros (and is why CONTEXT is the way it is).

So that's why there is for example both CONTEXT and CONTEXT_ALONE?
I see your point. It would change the format of these hybrid (builtin + dynamically loadable) mod 8000 ules. That's not nice, but could be done if required for performance / minimality.

Ideally, these .mpy files are built as part of the standard build process

There's nothing stopping thing from happening with my version. But one of my big priorities with the design was the ability to create native .mpy files in a stand-alone environment, only needing the particular mpconfigport.h file that you want to make the files work for.

Good point. My current design doesn't really work that way.

replying to @pfalcon

  1. Machine-independent format/support (rules out relocations).

I agree with @dpgeorge here, it doesn't seem very fundamental. It would be easier to provide support, yes. But while there are many different boards/chips, there is a much smaller amount of archs (X64, ARM, Thumb2, Xtensa, PIC, maybe others?). I already did a few of those.

  1. Allow dynamic loadable modules to be "mmapped" (rules out relocations).

Do you mean executing them directly from flash, or actually mmapping them on an OS with MMU (like Linux)? In the case of executing them from flash, see the first quote/reply of this comment. In the latter case, can you elaborate?

  1. Allow to build modules either builtin or dynaloaded from the same source (extra points for builtin version to be as efficient as "natively builtin" one).

If I understand you correctly, that's the whole point of the changes I proposed.

@aykevl
Copy link
Contributor Author
aykevl commented Sep 26, 2017

I just made a proof of concept to store pointers to fun_table and qstr_table at the start of .text. This means the CONTEXT macro isn't necessary anymore, but requires touching the machine code. It results in a code size reduction of 104 bytes on X64, 32 bytes on the stm32 port and 36 bytes on the esp8266 port (I haven't tested whether it actually works on anything besides unix/X64). Unfortunately the .mpy file itself from the modx example gets a bit bigger: 72 bytes on X64 and 44 bytes on Thumb2 (CROSS=1) because of the indirection (loading a pointer instead of using it from a function parameter which is already stored in a register).
To be clear: this is just a proof of concept. I made it to know whether it is actually possible to do this. The result is stored in the dyn-nat-no-context branch

EDIT: with the removal of CONTEXT, a lot can be shared between builtin and dynload native functions (b6ad76a). On X64, this results in an additional code size reduction of 240 bytes (together 344 bytes) and data size reduction of 128 bytes - this is a lot more than I expected.

@dpgeorge
Copy link
Member
dpgeorge commented Oct 5, 2017

I just made a proof of concept to store pointers to fun_table and qstr_table at the start of .text. This means the CONTEXT macro isn't necessary anymore, but requires touching the machine code

I checket out your code and it's clever! But as you say it requires writing to the first 2 words at the beginning of the native machine code. This means that all of the machine code needs to be in RAM. In that case one may as well relocate everything to shrink the native loaded code size.

A mechanism that uses relocation may actually be better in the long run than designing for in-place execution. Relocation is already used for bytecode (to rewrite qstrs to match those in the runtime) so relocating native code is consistent with this.

But more importantly it would be easier to "dynamically freeze" a relocated module rather than an in-place one. By "dynamically freeze" I mean doing what frozen bytecode does but at runtime (see discussion at #2709). And do this for dynamically loaded native code.

It would work something like this: instead of loading an .mpy file into RAM it's loaded directly into (erased) flash, and then linked (relocated) against the VM/runtime and all its qstrs and symbols. This should be possible because the loaded data only needs to be written once (and flash only allows one write after an erase). On subsequent boots of the system it would check if the .mpy file was already loaded into flash, and if so just use the existing data (the relocations will not have changed if the frimware did not change).

A big benefit of doing relocations is that the module can define static function wrappers (eg MP_DEFINE_CONST_FUN_OBJ_0) and static dict tables. And if the modules are dynamically frozen then these wrappers and tables don't take up any RAM. This is in contrast to an in-place (ie non relocated) scheme which must allocate these wrappers and tables on the GC heap, and they would be much harder to dynamically freeze.

So relocated native code + dynamic freezing would allow for the most minimal RAM footprint for a loaded native code. And for big modules on systems with little memory this would be advantageous.

@aykevl
Copy link
Contributor Author
aykevl commented Oct 7, 2017

"Dynamically freezing" sounds interesting. In fact, I was thinking about exactly a feature like this (before I even knew what "freezing" meant for MicroPython) for the ESP8266. The problem with the ESP is that it has lots of flash (often 4MB), but only the first 1MB is executable. Additionally, code cannot be executed directly from the filesystem (unless we implement some sort of defragmentation, which seems way too complex). So my idea was to reserve a bit of storage at the start (a few 10s of kilobytes maybe) and store (copy+relocate) the executable scripts there.

Another approach I would like to try: store the .mpy file somewhere in flash (non-fragmented), and leave all fields-to-relocate at 0xff. On many flash chips, bits can be set to 0 but not back to 1. This means we can simply overwrite the still cleared (all ones) fields with the correct address, without needing to erase the block first.
This could be useful for systems like the micro:bit, which if I understand it correctly, have scripts (and in the future maybe .mpy files with native code) appended to the end in the online editor. On the first run, the necessary fields can be relocated with little effort.

I'm currently working on some missing features in the nrf port, so I can test there.

@aykevl
Copy link
Contributor Author
aykevl commented Oct 18, 2017

I've started working on #2709 as a building block.

Meanwhile, maybe I can add @micropython.asm_thumb etc. support for the cross compiler? That should be relatively easy but requires a change in file format - the same change as required for storing native code in .mpy files. It depends on a subset of the changes in #1627.

tannewt pushed a commit to tannewt/circuitpython that referenced this issue Aug 25, 2020
@jonnor
Copy link
Contributor
jonnor commented Sep 29, 2024

Based on this discussion (and many others), a version of native modules landed in 2019, initially with #5083

@jonnor jonnor closed this as completed Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
0