-
Notifications
You must be signed in to change notification settings - Fork 165
Core dump during FTP file transfer, caused by a heap lock. #443
Comments
@robert-hh thanks for the detailed analysis and thanks for your (hopefully) patience. The GC alloc problem is being worked on right now. @salal-m please take this repro into account! |
@robert-hh we have incorporated your suggestions with our changes. These solve the memory related crashes but I was not able to produce the scenario where heap is locked. Could you please verify if the attached dev build is stable in your scenario as well? Thanks. |
Yes, this is hard to trigger, when you want to. I will test it. I do not know what exactly you changed. Below is the function as I have it now, using malloc. At least for the m_malloc() exception, it should be possible to use it, when before the call a return frame is pushed with nlr_push() (and of course later removed with nlr_pop(). But looking at all other places in ftp.c where memory is allocated, it is consistent NOT to use m_malloc().
|
You are right, m_malloc() should not be used here. Even if the crash doesn't occur due to locked heap, there are instances of crashes due to invalid memory access (which most likely happens because GC collects that memory). |
Dang. I was optimistic, since I ran my old test for an hour and it worked fine. Then I started a different test, and that crashed. Since I do not have the elf file of your image, I cannot decode the core dump. So you just get the numbers. The test setup is simple. Use Filezilla to upload a bunch of files. In my case it were ~100 of different sizes. Then I run in REPL, connected through USB/UART:
Core dump:
|
The decoded stack from this is: ==================== CURRENT THREAD STACK ===================== The .elf file is: So, the crash is happening at another point now. For some reason I am not able to reproduce this one either on LOPY4. I transfer ~120 small files (average size 1KB) and then run the above code through REPL. It keeps printing the filenames without any issue. I even tried doing FTP transfer while this was running, still no crash. Does the crash happen immediately and is it consistent? |
I will repeat the tests. And you have to do ftp file transfers while the listdir loop is running. |
One question about the changes: did you also change the definition of lfs_malloc in lfs_util.h? There m_malloc() was used, and lfs uses that for the allocation of a per-file cache buffer. |
Yes. That has also been changed to use malloc(). |
In the absence of your elf file, I had repeated the test with my local build, and could not produce the error again. Now it is running with your firmware again, and stable until now. |
Looking at the decoded stack from teh core dump, things are definitely wrong. The code compares data, and both pointer should point to RAM addresses. But one of them does not, for whatever reason. |
Were you able to reproduce this issue? |
I tried quite a while with both firmware versions, but could not replicate it. That's kind of sad. But a core dump which happens repeatable but only once in a while is not much better. |
Hi there, I'm glad to see you are working on this topic and will be happy to review the current state against #418. We are using FTP intensively with our Squirrel builds [1] and have not seen any issues so far. @salal-m and @peter-pycom: May I humbly ask which toolchain you are using to build the firmware images? I believe it is important to use 1.22.0-98 in order to mitigate random memory corruption issues on Espressif ESP32's rev.1 and rev.2 silicon [2,3]. Maybe you are already using rev.3? With kind regards, [1] https://community.hiveeyes.org/t/squirrel-firmware-for-pycom-esp32/2960 P.S.: When already working on this, please also take #413 into consideration. While this won't raise any core dumps, it would be cool to see it integrated within the upcoming release candidate as well. |
I should not that all kudos on this topic go to @amotl, who was the first to mention that and his work for fixing it. After we talked about it, I started to look for the malloc() uses and changed it at two places, ftp.c and lfs_util.h. @amotl male later his PR #418 with a more extensive list of potentially affected places. |
Uh oh!
There was an error while loading. Please reload this page.
Pycom MicroPython 1.20.2.rc7 [v1.20.1.r2-142-g6d01270-dirty] on 2020-05-10; LoPy4 with ESP32
This is the "a smoking gun" for a fault, which made me hate LittleFS. It happens when I do other activities in REPL or an active Python script during or shortly after making FTP file transfers. Replacing m_malloc() by malloc() in lfs_util.h solved the problem, and now I found the reason, but at another place than assumed. Below is the backtrace and the core dump. The reason is, that lfs tries to allocate memory for a buffer, but the heap is locked at that very moment. m_malloc() then raises an execption, which causes the core dump, since is happens outside the MP process. Since malloc() does not use the heap, the problem does not occur. This actual event happend in ftp.c which is calling m_malloc here, but at that place m_ malloc was also identified as critical by @amotl. It may be the general problem, that background processes must not use m_malloc.
Note: in my modified ftp.c I had replaced m_malloc() since long my malloc(). Still there is the omission in ftp.c, that malloc() is not checked for returning NULL. m_malloc() did not need that, because it would raise an exception, which for the ftpserver leads to a crash.
The text was updated successfully, but these errors were encountered: