-
Notifications
You must be signed in to change notification settings - Fork 165
Fix core panics: Don't use m_malloc and gc_malloc unintended #418
Conversation
MicroPython's "gc_alloc" and "m_malloc" functions should probably not be invoked from within C-level RTOS extensions. Otherwise, the garbage collector might want to prune memory which does not really hold MicroPython objects. This eventually leads to random core panics.
Hi @amotl . Thanks for the PR and the clear analysis! In some way, I am intuitively with you that micropython gc should probably not be used from "regular" C code. However, from looking so far, I can't yet see conclusively that it is wrong and will cause trouble. I haven't really understood the GC yet, but from all I read so far, it might just work to use it for "regular" C allocation. On the other hand I am not convinced that simply replacing the calls with normal malloc/free is correct in all cases. Are we sure this isn't leaking memory? I'm wondering whether some of the calls in the lfs code actually rely on GC. Also, there will be some other side effects like different parts of memory will be used. Ie, either memory inside the pool allocated for GC or outside. Both of which are limited resources. So, I don't want to switch this around without good reason. So, now I'm thinking, before spending more time on analyzing the impact of changing this, let's look again at the merit of doing it. You say things seem more stable to you. Can you elaborate? Is there some reproduction I can follow? Also you mention the hypothesis that resource pressure might lead to failed allocation and thus instability. Did I get you right? Do you have an idea how we can turn this into a reproduction? (like allocate a lot with gc_malloc in one thread and trigger a gc in another? Have you tried any such thing?) |
Dear Peter, thanks for following up on this.
While I can't provide a concise repro, I once had a memory layout in place, where I have been receiving core dumps like outlined below in a reproduceable manner - at most times when just invoking
Absolutely. It really depends on the alignment of data structures in memory, which is a mixture of different things like compiled bytecode vs. allocated heap memory from MicroPython data structures. This can be observed through When I had a certain state of a) the size of the code base accounting for a specific memory layout of byte-compiled Python code vs. b) characteristics of its runtime behavior through allocating heap memory on my workbench and while pushing code of our fairly large datalogger program through FTP while the code on the device was still being executed, I have been able to deterministically reproduce the core panics and mitigate them through the changes outlined within this PR.
While I am not 100% sure either, I am pretty much convinced that at least some of these mitigations are sound. If you follow the code, you will see that almost all the memory allocated is very much local to the C-level scope and used as intermediary memory for shuffling things around while not touching the MicroPython scope at all. Every allocation is complemented by an appropriate call to
I hear you. However, I believe the wrong kind of memory is being allocated within the utility functions touched within this PR, as none of these are using it for allocating MicroPython objects at all. We are running our custom Dragonfly and Squirrel builds since Nov' 2019 with these mitigations and have not been disappointed yet. As a consequence thereof, we tried to channel them to other members of the community observing similar problems and it actually seemed to help in a few cases already. Also, @robert-hh has da56836 in his homebrew branch. While I don't know yet about any improvements he might have observed, I believe it didn't do any harm either.
I had very much the same idea to produce a minimal and concise repro, but have not been able to work on that yet. However, I tried to get in touch with @Xykon to get some support for this. I believe it will comfort many people out there observing spurious problems even when uploading smaller code bases using Pymakr and when accessing the filesystem or NVRAM at runtime. With kind regards,
|
Some file use problems and core dumps when using ftp and littlefs file system definitely disappeared when using the suggested changes to lfs_util.h. Actually I just replaced gc_malloc() by malloc(). Using FAT these problem also were not present. So the actual state looks definitely like a BUG. It should be mentioned that the reference version of MicroPython at micropython.org uses simply malloc() and free() at this place. |
So I looked through all cases mentioned in the PR above. Here are my comments:
|
For a good reason with respect to ftp.c, and maybe others too. See #443 |
Dear Pycom team,
we investigated some random core panics we have been observing when working on our Terkin datalogger based on the FiPy. More details about this can be found within [1].
We will be happy if you could review this change. While we are not 100% sure we did the changes in the right manner, we are pretty much confident it improves accessing the filesystem at runtime through the FTP server and the LittleFS filesystem. We already verified this on our own devices and with other members of the community.
With kind regards,
Andreas.
[1] https://community.hiveeyes.org/t/dont-invoke-micropythons-m-malloc-within-c-level-rtos-extensions/2743