-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Native modules in ELF format #3311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It was possible to load native modules dynamically ~2 years ago: #1627 . So, the suggestion is to switch to that patch instead and help with it (testing, etc.). |
I'm not really at home in differences between ELF and other formats, so I'm wondering how does your implementation compare to simply loading .so files using |
(sorry for the wall of text...) @pfalcon interesting! I did search the issues/pull-requests but apparently missed that one. I don't fully understand how it works, though. It does something interesting with objcopy and linker scripts that I don't really follow but probably should investigate. It may be better anyway to have a tool on the host which preprocesses the The ELF format is used for basically all executables/object files on Linux and some other operating systems. The implementation I have is more comparable to the one in Contiki OS, which I discovered after writing mine. What both do is that they look at the list of sections instead of at the program header, loading all sections that are relevant (like .text, .rodata, but not keeping things after load like .rela and .dynsym). This can save a lot of memory. What the current implementation also does (for Xtensa) is that it loads the .text section into program memory (using The current implementation takes a source like the following: const int value = 42;
// wrapper function
mp_obj_t hsv2rgb_rainbow(size_t n_args, const mp_obj_t *args) {
uint8_t h = mp_obj_get_int(args[0]);
uint8_t s = mp_obj_get_int(args[1]);
uint8_t v = mp_obj_get_int(args[2]);
return mp_obj_new_int(mod_pixels_hsv2rgb_rainbow(h, s, v));
}
// list of entries in the Python module
const mp_dyn_module_t module_test[] = {
{ TYPE_FUNC_VAR, hsv2rgb_rainbow},
{ TYPE_CONST_INT, value},
}; Compiled using commands like this (removed all kinds of semi-optional flags):
That can be imported using Python import test
print('%06x' % test.hsv2rgb_rainbow(50, 255, 255)) Behind the scenes it reads the .dynsym section. Then it reads the relocation section (.rela.dyn and .rela.plt) and adjusts all pointers (currently it ignores the PLT and GOT and inserts the final pointers at link time). When it discovers it is relocating a function or value within the The difference with stinos/micropython@0715d6c and CPython (which work very similar in that regard) is that they use the dynamic linker, while this approach does the linking itself. |
@aykevl thanks for sharing your work! Dynamically loadable native object code is definitely a feature that MicroPython wants to have, it would be very useful. If you read through PR #1627 you'll see that there was a long discussion behind it (and previous PR's for it). That PR aims to keep the MCU-side of the loading/linking as minimal as possible by having the offline compiler essentially do the linking: all calls into the uPy runtime are done via a big lookup table eg The other thing done in #1627 is that it uses a custom "object file" format which is based on the existing .mpy format for pre-compiled bytecode. The idea is that a .mpy file is a generic container for precompiled uPy code and can contain bytecode, object code, or a combination of both. Bytecode-only .mpy files would load into (almost) any uPy port, while files containing object code would only load on ports with the same arch as the code was compiled for. This scheme also allows for putting |
Nice work! For my reference, #222 (persistent bytecode v1) and #581 (mpy native code v1) are also interesting reads. Thinking a bit more about it, I think it's best to follow your approach and doing most work on the host. That means somehow putting the machine code in the .mpy file. My attempts so far:
The current file layout is as follows (uint means varint encoding as used in current .mpy):
All in all, it turns out to be a lot harder than expected. Suggestions are welcome. |
I have a very basic implementation that works on ARM (non-Thumb), x86-64 (using a hack to allocate memory) and Xtensa/ESP8266. This is what a dynamically loadable module would look like. Currently it needs to add each member of the module manually, but maybe this can be changed to use a #include "module.h"
STATIC mp_obj_t plus(mp_obj_t _a, mp_obj_t _b) {
int a = mp_obj_get_int(_a);
int b = mp_obj_get_int(_b);
return mp_obj_new_int(a + b);
}
STATIC MP_DEFINE_CONST_FUN_OBJ_2(plus_obj, plus);
mp_obj_t mpy_module_init(const mp_obj_t locals) {
mp_obj_dict_store(locals, MP_OBJ_NEW_QSTR(qstr_from_str("plus")), (mp_obj_t)MP_ROM_PTR(&plus_obj));
return mp_const_none;
} A lot needs to be cleaned up before I can even commit this feature, but I'm just asking for feedback now on the implementation. Does this seem workable? The problem I had with x86-64 was that the |
Mostly because I did it in a very different way.
By far the majority of the work went into relocations. For the parts that were shared (e.g. the changes in py/emitglue.c) I took a look there. If you don't agree with these differences, please tell me why. That's why I'm asking. |
I don't agree with the process. There's a code (2nd or 3rd iteration even) written by another developer. You seem to largely ignore it, and go with your own code. But to have a feature implemented, developers need to cooperate and look at (work with!) each other's code. But why next guy will want to work with your code, if you don't want to work with the previous guy's one? |
Yeah OK you're right. I wasn't very cooperative in that regard. I didn't see much value in working with that code as it's an old branch (merge conflicts!) but mostly because lots needs to change for relocations, so much that I saw far more difference than overlap. And, admittedly, I was impatient and wanted to get something working. But I'll be more cooperative. Also, I would like to have this feature implemented somehow. That means a decision needs to be made regarding how it's implemented. If function pointers are used, I'll go with #1627 and improve it, but if not I don't see the point of adapting rather different and old code. I think relocations are superior (I can elaborate on that if needed) but I don't want to continue working on it unless I have some sort of green light this might eventually be merged. @dpgeorge what's your opinion on this? Would you still want to work on this? I can put the code I have in a branch if needed. |
Please elaborate on this point about relocations in the code. As I see it, the main goals for native code in .mpy files is:
The micro:bit is a good real-world example of something that could benefit a lot from this feature, in the following way: it has a base, precompiled firmware with MicroPython and a user can customise this by including modules for their current application. Such modules can be .mpy files that are appended directly to the end of the precompiled firmware and form a "frozen filesystem" in flash. This doesn't require recompiling, just appending binary files and can be done in the browser very easily (this is what happens at the moment). The user could then import .mpy files from the frozen filesystem, including native code. This scheme would allow the base binary to be minimal yet allow easy extensibility for drivers like Neopixel and DHT22 which need to be written in C or some other low-level language. |
How would that work? I don't think we can execute from a filesystem directly as files might be fragmented. If it's possible it would be very nice, though. (Unfortunately not a solution for the ESP as that system uses a Harvard architecture). I think we have different priorities here. My idea:
That means I really don't like all the preprocessor magic (especially the CONTEXT) that goes on in your modx.c example. But maybe we can find a midway that has most benefits of both ways. So if we want dynamically loadable modules that look like built-in modules, in particular this case is very hard to do without relocations (expanded from STATIC mp_obj_t add1(mp_obj_t n) {...}
mp_obj_fun_builtin_fixed_t add1_obj = {{&mp_type_fun_builtin_1}, .fun._1 = add1} It stores a MicroPython pointer ( There is a similar problem with function calls. In a .so file these are replaced with a PLT (stub functions which call the real functions), where the PLT is then patched up to include the destination pointers. The reason for using a PLT is that the .text section doesn't have to be touched, so it can be shared among different processes (reducing memory consumption). For our case, a PLT is just overhead as a module is only loaded once anyway on a MCU. I have an idea how we might be able to fix this, avoiding relocations except for one. The idea is to start the file with a While the MicroPython core itself is smaller when using a pointer table, the modules itself gets bigger. One advantage of relocations is that modules can stay mostly position-dependent with no indirection. Using a pointer table causes every call to a MicroPython function to have some overhead and produce a bit more code because of the indirection. |
Right. So it would need to be a special filesystem, and the simplest case is a "frozen filesystem" of file data that is concatenated into a big blob.
This was considered when designing my version of the native persistent code, and it can be done: you just have to provide other definitions for the macros (and is why CONTEXT is the way it is).
There's nothing stopping thing from happening with my version. But one of my big priorities with the design was the ability to create native .mpy files in a stand-alone environment, only needing the particular mpconfigport.h file that you want to make the files work for.
Yes this is impossible to do without relocations. That's why all function wrappers are created dynamically when the .mpy file is loaded (in my version). Yes this uses extra memory, but that memory is saved in cases where the native code in the .mpy file runs directly from ROM. The biggest decision is really whether to go to the effort to support running from ROM, or accept that it's not worth it and require to load the code into RAM. The latter would allow relocations and C code that is much similar to how existing code is written (ie minimal macro magic). Bytecode in .mpy files is currently relocated (because qstrs need to be rewritten in the bytecode) and so must be copied to RAM when loaded/imported. The reason for this design decision is because it's too much overhead (in code size, RAM usage and execution speed) to have an indirection table to look up qstrs. And this would need to be applied to all bytecode, not just that loaded from a .mpy file. So it might be fair enough then to also require that native code be copied and relocated when it's loaded. The ability to run .mpy files from ROM/flash would then come as a second step: .mpy files could be dynamically frozen into a set region of flash, ie they would be relocated into flash once for the particular target then executed from there next time they were referenced. |
And I actually went over it and considered how to make it even easier/more transparent to support both. I regret not commenting then on #1627 , and now it's all forgotten/lost.
Of course. There're 3 basic requirements for dynamically loadable modules:
Qstr's are the biggest problem with the dynamic modules, yeah. Maybe even worth considering not using them for the dynaload case (because the other alternative is to have them as variables in .data section - that's not counting an alternative of complicating qstr code (killing performance) by adding "qstr symlinks"). |
I don't agree with this as a basic requirement. It's a "nice to have" because it makes it easier (trivial) to support additional archs, but it doesn't bring any technical feature/advantage (like ability to run from ROM would). |
replying to @dpgeorge:
It would be nice to have a clear use case, otherwise we're working on support for something (and limiting our options!) when it might not even be implemented. Currently I see two cases: .mpy files on a FAT filesystem (can be fragmented, thus impossible to store machine code), or built-in to MicroPython in which case it's better to make them built-in anyway.
So that's why there is for example both
Good point. My current design doesn't really work that way. replying to @pfalcon
I agree with @dpgeorge here, it doesn't seem very fundamental. It would be easier to provide support, yes. But while there are many different boards/chips, there is a much smaller amount of archs (X64, ARM, Thumb2, Xtensa, PIC, maybe others?). I already did a few of those.
Do you mean executing them directly from flash, or actually mmapping them on an OS with MMU (like Linux)? In the case of executing them from flash, see the first quote/reply of this comment. In the latter case, can you elaborate?
If I understand you correctly, that's the whole point of the changes I proposed. |
I just made a proof of concept to store pointers to EDIT: with the removal of CONTEXT, a lot can be shared between builtin and dynload native functions (b6ad76a). On X64, this results in an additional code size reduction of 240 bytes (together 344 bytes) and data size reduction of 128 bytes - this is a lot more than I expected. |
I checket out your code and it's clever! But as you say it requires writing to the first 2 words at the beginning of the native machine code. This means that all of the machine code needs to be in RAM. In that case one may as well relocate everything to shrink the native loaded code size. A mechanism that uses relocation may actually be better in the long run than designing for in-place execution. Relocation is already used for bytecode (to rewrite qstrs to match those in the runtime) so relocating native code is consistent with this. But more importantly it would be easier to "dynamically freeze" a relocated module rather than an in-place one. By "dynamically freeze" I mean doing what frozen bytecode does but at runtime (see discussion at #2709). And do this for dynamically loaded native code. It would work something like this: instead of loading an .mpy file into RAM it's loaded directly into (erased) flash, and then linked (relocated) against the VM/runtime and all its qstrs and symbols. This should be possible because the loaded data only needs to be written once (and flash only allows one write after an erase). On subsequent boots of the system it would check if the .mpy file was already loaded into flash, and if so just use the existing data (the relocations will not have changed if the frimware did not change). A big benefit of doing relocations is that the module can define static function wrappers (eg MP_DEFINE_CONST_FUN_OBJ_0) and static dict tables. And if the modules are dynamically frozen then these wrappers and tables don't take up any RAM. This is in contrast to an in-place (ie non relocated) scheme which must allocate these wrappers and tables on the GC heap, and they would be much harder to dynamically freeze. So relocated native code + dynamic freezing would allow for the most minimal RAM footprint for a loaded native code. And for big modules on systems with little memory this would be advantageous. |
"Dynamically freezing" sounds interesting. In fact, I was thinking about exactly a feature like this (before I even knew what "freezing" meant for MicroPython) for the ESP8266. The problem with the ESP is that it has lots of flash (often 4MB), but only the first 1MB is executable. Additionally, code cannot be executed directly from the filesystem (unless we implement some sort of defragmentation, which seems way too complex). So my idea was to reserve a bit of storage at the start (a few 10s of kilobytes maybe) and store (copy+relocate) the executable scripts there. Another approach I would like to try: store the .mpy file somewhere in flash (non-fragmented), and leave all fields-to-relocate at 0xff. On many flash chips, bits can be set to 0 but not back to 1. This means we can simply overwrite the still cleared (all ones) fields with the correct address, without needing to erase the block first. I'm currently working on some missing features in the nrf port, so I can test there. |
I've started working on #2709 as a building block. Meanwhile, maybe I can add |
…n-main Translations update from Weblate
Based on this discussion (and many others), a version of native modules landed in 2019, initially with #5083 |
I have written a proof-of-concept (read: hacked together) a loader for .so files (in ELF format). This means it is possible to write code in C, package it as a shared object and let MicroPython import it like any .py or .mpy file. Read Extending Python with C or C++ for the CPython tutorial on it.
TL;DR: should I continue working on it and send a pull request?
There are a few reasons why this might be useful:
@micropython.native
and@micropython.viper
but for some algorithms this is still way too slow (and consumes too much memory). Also consider that a big compiler like GCC can do a lot more optimization than MicroPython will ever be able to. That's why I wrote a new extmod module for LED animations but I doubt this would be suitable for core MicroPython being more of a kitchen sink addition.Of course you can manage your own port (with your own builds etc.) or write in assembly but it may be much easier to simply drop a .so file in the flash storage and import it from Python.
Currently it works only for the ESP8266 (that's the only MCU I have that can run MicroPython) and being a hacked together system many features are not implemented (e.g.
.data
). But simple functions work fine and adding more shouldn't be too much effort. Porting over to e.g. linux/x86_64 shouldn't be too difficult either. Maybe it's even possible to add some support for Arduino, as some people are probably more familiar with that C++ dialect and it opens possibilities for easily porting a wealth of Arduino libraries.I've worked on this for about two days now and it turned out to be a bit more work than expected (mainly because of lacking documentation for Xtensa) so before I continue I'd like to know whether such a feature would be desirable.
The text was updated successfully, but these errors were encountered: