8000 Fix core panics: Don't use m_malloc and gc_malloc unintended by amotl · Pull Request #418 · pycom/pycom-micropython-sigfox · GitHub
[go: up one dir, main page]

Skip to content
This repository was archived by the owner on Sep 16, 2024. It is now read-only.

Fix core panics: Don't use m_malloc and gc_malloc unintended #418

Merged
merged 1 commit into from
Jun 10, 2020

Conversation

amotl
Copy link
Contributor
@amotl amotl commented Mar 5, 2020

Dear Pycom team,

we investigated some random core panics we have been observing when working on our Terkin datalogger based on the FiPy. More details about this can be found within [1].

We will be happy if you could review this change. While we are not 100% sure we did the changes in the right manner, we are pretty much confident it improves accessing the filesystem at runtime through the FTP server and the LittleFS filesystem. We already verified this on our own devices and with other members of the community.

With kind regards,
Andreas.

[1] https://community.hiveeyes.org/t/dont-invoke-micropythons-m-malloc-within-c-level-rtos-extensions/2743

MicroPython's "gc_alloc" and "m_malloc" functions should
probably not be invoked from within C-level RTOS extensions.
Otherwise, the garbage collector might want to prune memory
which does not really hold MicroPython objects.
This eventually leads to random core panics.
@peter-pycom
Copy link
Contributor

Hi @amotl . Thanks for the PR and the clear analysis!

In some way, I am intuitively with you that micropython gc should probably not be used from "regular" C code.

However, from looking so far, I can't yet see conclusively that it is wrong and will cause trouble. I haven't really understood the GC yet, but from all I read so far, it might just work to use it for "regular" C allocation. On the other hand I am not convinced that simply replacing the calls with normal malloc/free is correct in all cases. Are we sure this isn't leaking memory? I'm wondering whether some of the calls in the lfs code actually rely on GC. Also, there will be some other side effects like different parts of memory will be used. Ie, either memory inside the pool allocated for GC or outside. Both of which are limited resources. So, I don't want to switch this around without good reason.

So, now I'm thinking, before spending more time on analyzing the impact of changing this, let's look again at the merit of doing it. You say things seem more stable to you. Can you elaborate? Is there some reproduction I can follow? Also you mention the hypothesis that resource pressure might lead to failed allocation and thus instability. Did I get you right? Do you have an idea how we can turn this into a reproduction? (like allocate a lot with gc_malloc in one thread and trigger a gc in another? Have you tried any such thing?)

@amotl
Copy link
Contributor Author
amotl commented Mar 16, 2020

Dear Peter,

thanks for following up on this.

You say things seem more stable to you. Can you elaborate? Is there some reproduction I can follow?

While I can't provide a concise repro, I once had a memory layout in place, where I have been receiving core dumps like outlined below in a reproduceable manner - at most times when just invoking gc.collect(), to make it a bit more deterministic instead of just hitting the GC-run indeterministically.

Also you mention the hypothesis that resource pressure might lead to failed allocation and thus instability. Did I get you right?

Absolutely. It really depends on the alignment of data structures in memory, which is a mixture of different things like compiled bytecode vs. allocated heap memory from MicroPython data structures. This can be observed through micropython.mem_info(True), which prints a table of heap utilisation, see also utility functions to report on memory allocation.

When I had a certain state of a) the size of the code base accounting for a specific memory layout of byte-compiled Python code vs. b) characteristics of its runtime behavior through allocating heap memory on my workbench and while pushing code of our fairly large datalogger program through FTP while the code on the device was still being executed, I have been able to deterministically reproduce the core panics and mitigate them through the changes outlined within this PR.

On the other hand I am not convinced that simply replacing the calls with normal malloc/free is correct in all cases. Are we sure this isn't leaking memory?

While I am not 100% sure either, I am pretty much convinced that at least some of these mitigations are sound. If you follow the code, you will see that almost all the memory allocated is very much local to the C-level scope and used as intermediary memory for shuffling things around while not touching the MicroPython scope at all. Every allocation is complemented by an appropriate call to free(), so that should be safe.

I'm wondering whether some of the calls in the LittleFS code actually rely on GC.
Also, there will be some other side effects like different parts of memory will be used. Ie, either memory inside the pool allocated for GC or outside. Both of which are limited resources. So, I don't want to switch this around without good reason.

I hear you. However, I believe the wrong kind of memory is being allocated within the utility functions touched within this PR, as none of these are using it for allocating MicroPython objects at all.

We are running our custom Dragonfly and Squirrel builds since Nov' 2019 with these mitigations and have not been disappointed yet. As a consequence thereof, we tried to channel them to other members of the community observing similar problems and it actually seemed to help in a few cases already. Also, @robert-hh has da56836 in his homebrew branch. While I don't know yet about any improvements he might have observed, I believe it didn't do any harm either.

Do you have an idea how we can turn this into a reproduction? [I am thinking of] allocating a lot of memory with gc_malloc in one thread and invoke the garbage collector in another. Have you tried any such thing?

I had very much the same idea to produce a minimal and concise repro, but have not been able to work on that yet. However, I tried to get in touch with @Xykon to get some support for this. I believe it will comfort many people out there observing spurious problems even when uploading smaller code bases using Pymakr and when accessing the filesystem or NVRAM at runtime.

With kind regards,
Andreas.


#0  nlr_jump (val=0x3ffc92e8 <mp_state_ctx+24>) at ../py/nlrsetjmp.c:33
#1  0x400fe5ae in mp_raise_msg (exc_type=0x3f40ea14 <mp_type_MemoryError>, msg=0x3f40dc67 \"memory allocation failed, heap is locked\") at ../py/runtime.c:1465
#2  0x400fe5c5 in m_malloc_fail (num_bytes=28) at ../py/runtime.c:1454
#3  0x400f65ed in m_malloc (num_bytes=28) at ../py/malloc.c:88
#4  0x400f4fa7 in f_readdir_helper (fno=0x3ffdbeb0, dp=0x3ffc9204 <ftp_data+8>) at ftp/ftp.c:341
#5  0x400f51ff in ftp_list_dir (list=0x3ffdd960 <error: Cannot access memory at address 0x3ffdd960>, listsize=0x3ffdc310, maxlistsize=512) at ftp/ftp.c:1455
#6  0x400f5ee1 in ftp_run () at ftp/ftp.c:672
#7  0x400e08d3 in TASK_Servers (pvParameters=<optimized out>) at serverstask.c:105
#0  nlr_jump (val=0x3ffc92e8 <mp_state_ctx+24>) at ../py/nlrsetjmp.c:33
#1  0x400fe5c2 in mp_raise_msg (exc_type=0x3f40ea20 <mp_type_MemoryError>, msg=0x3f40dc73 \"memory allocation failed, heap is locked\") at ../py/runtime.c:1465
#2  0x400fe5d9 in m_malloc_fail (num_bytes=12) at ../py/runtime.c:1454
#3  0x400f6601 in m_malloc (num_bytes=12) at ../py/malloc.c:88
#4  0x400f3553 in littlefs_prepare_attributes (cfg=0x3ffc9258 <ftp_data+92>) at littlefs/vfs_littlefs.c:357
#5  0x400f35a7 in littlefs_open_common_helper (lfs=0x3ffc46a8 <sflash_vfs_flash+56>, file_path=0x3ffddb82 <error: Cannot access memory at address 0x3ffddb82>, fp=0x3ffc9204 <ftp_data+8>, mode=1282, cfg=0x3ffc9258 <ftp_data+92>, timestamp_update_ptr=0x3ffc9264 <ftp_data+104>) at littlefs/vfs_littlefs.c:283
#6  0x400f4ee8 in f_open_helper (fp=0x3ffc9204 <ftp_data+8>, mode=10 '\
', path=<optimized out>) at ftp/ftp.c:242
#7  ftp_open_file (path=<optimized out>, mode=10) at ftp/ftp.c:1392
#8  0x400f59db in ftp_process_cmd () at ftp/ftp.c:1186
#9  0x400f5ee3 in ftp_run () at ftp/ftp.c:659
#10 0x400e08d3 in TASK_Servers (pvParameters=<optimized out>) at serverstask.c:105

-- https://community.hiveeyes.org/t/dont-invoke-micropythons-m-malloc-within-c-level-rtos-extensions/2743 ff.

@robert-hh
8000 Copy link
Contributor
robert-hh commented Mar 29, 2020

Some file use problems and core dumps when using ftp and littlefs file system definitely disappeared when using the suggested changes to lfs_util.h. Actually I just replaced gc_malloc() by malloc(). Using FAT these problem also were not present. So the actual state looks definitely like a BUG. It should be mentioned that the reference version of MicroPython at micropython.org uses simply malloc() and free() at this place.
Edit: m_malloc() and m_free() use either malloc() and free() or gc_alloc() and gc_free(), depening on the flag MICROPY_ENABLE_GC. Here, gc_alloc() and gc_free() are used.
Edit2: If the memory provided by malloc() is to be used inside a function only, alloca() may be used too. Then freeing is not required.

@robert-hh
Copy link
Contributor
robert-hh commented Mar 29, 2020

So I looked through all cases mentioned in the PR above. Here are my comments:

ftp.c: Agree on the change. Allocation is purely function-local with no
       locks to protect gc_alloc()
lfs_util.h: Definitely good. After that change, my core dumps during 
            ftp use and strange file errors went away.
vfs_littlefs.c: In general I agree on the change. The allocation of memory is purely local.
                But I had no problems with the existing code, but some board
                freezing when using malloc/free. Maybe causality, maybe not.
machpwm.c: The use of gc_malloc seems ok here. But I did not fully trace it (yet).
machrmt.c: I agree to the change. The allocated buffer is only use locally 
           or in a function called by it. No need to keep the data.
machtimer_alarm.c: Use of gc_alloc() seems inappropriate, since it is called
                   by mptask.c once. The memory is not used by the Python VM.
                   So yes. The change looks OK.
modcoap.c: It looks to me as if the allocated memory is used beyond the scope
           of the function where it is allocated, but not in the Python VM.
modmdns.c: m_malloc (=gc_alloc) looks not ok. The data is used beyond the scope of
           the function, but in a espressif lib call, which is definitely not the
           Python VM.
modpycom.c The allocated object is use in a very tight context, so malloc/free
           seems appropriate. ANOTHER BUG HERE: m_free() is called twice with the
           same object (lines 312 and 314), but the result of m_malloc is not checked
           for validity. Maybe that is done by gc_alloc.
moducrypto.c: The allocated memory is just used locally. The object returned to
              the VM copies the content. So m_malloc is not needed, but the worst
              that could happend is wrong data sneaking in.
mptask.c:  The data is allocated here and later used in vfs_littlefs. Around line 134 
           it is free'd and new space is allocated. The use of alloc/free seems ok,
           And I could not tell yet and drawback of using gc_alloc(). I would expect
           strange values for the cwd.
mpthreadport.c: The lock is allocated and used in a Python object. So using gc_alloc
                looks fine.

@robert-hh
Copy link
Contributor
robert-hh commented May 10, 2020

For a good reason with respect to ftp.c, and maybe others too. See #443

@amotl
Copy link
Contributor Author
amotl commented Jun 10, 2020

Thanks for merging this and also thanks for the adjustments within 8495c7d, 6cb700a, 669dfe2 and 8e08fab, @salal-m.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0