Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

todd-herbert · 2025-01-17T13:18:18Z

Unexpected BLE disconnections (such as 0x8 BLE_HCI_CONNECTION_TIMEOUT) cause sd_flash_page_erase and sd_flash_write operations to fail. This failure is reported with NRF_EVT_FLASH_OPERATION_ERROR. Currently, the InternalFileSystem library doesn't detect these events.

This PR aims to detect NRF_EVT_FLASH_OPERATION_ERROR, allowing several reattempts of a failed write / erase operation.

I'm unsure whether this fix is overly crude, and am concerned that I may be missing some finer detail of the filesystem implementation. Of note is the change from a counting semaphore to a binary semaphore. Any input here would be greatly valued.

henrygab

Wow! This was a tricky bug that you tracked down. To be clear, I don't have final say in this ... my advice here is free, and maybe you get what you pay for.

I think your solution is based on solid read of the SDK. My comments primarily revolve around keeping the code easy to read / function naming / etc. I hope you find it useful.

libraries/InternalFileSytem/src/flash/flash_nrf5x.c

todd-herbert · 2025-01-19T10:10:18Z

@henrygab Thanks for the feedback! I'll get onto those changes tomorrow.

todd-herbert · 2025-01-20T15:13:40Z

630668a aims to respond to feedback from @henrygab. Hopefully I'm correctly interpreting what you're envisioning.

the event from the callback is now stored, rather than success / fail as a bool
wait_for_async_flash_op_completion method gathers result only. Delays / queuing now handled in fal_prog and fal_erase.

Please do let me know if you feel there's any room for further improvement here, or if some new logic flaw has appeared during the refactoring.

8000

henrygab

Seems cleaner. Logic remains apparently sound. Nicely done. 🎉

One code style comment repeated a few times.

libraries/InternalFileSytem/src/flash/flash_nrf5x.c

esev · 2025-01-20T23:54:59Z

One thing I've been wondering about this PR: If it fails after retrying, what should the behavior be? From #325 (comment), I've seen this can lead to asserts in LittleFS. These don't necessarily happen right away and can happen some time later when LittleFS actually accesses one of the lost blocks.

I've been wondering if the code here should assert if the retries fail. That calls attention to a possible FS corruption issue sooner, at the point where we've first detected a failure.

I've been debating this a bit myself, as it certainly isn't nice to crash if it can be avoided. But maybe it is useful in this case, to highlight the issue sooner, closer to the root cause of the failure?

Anyway, just adding this comment here in case there is a strong opinion one way or another. I'm happy with the current logic.

henrygab · 2025-01-21T00:46:06Z

... I've been wondering if the code here should assert if the retries fail. That calls attention to a possible FS corruption issue sooner, at the point where we've first detected a failure.

Those are good questions to ask. If the corruption was guaranteed at this point, then you are right ... earlier is better, and here would prevent later-discovered inconsistencies, maybe even leave the file system in a valid state ... if never written to again.

What I'm not 100% sure of is whether a failed write is guaranteed to cause LFS corruption. If I understand correctly, LFS is generally designed to not trust that data was actually written, just because the write reported success. I have not dived into LFS internals for a while...

@todd-herbert ... questions for you, as you're the one who most recently dived deep....

Is the corruption essentially an edge case caused by the configuration choice (vs. the physical flash properties)?
Are there any situations where, with a similar configuration vs. physical flash, a write that fails would NOT cause LFS corruption?

Maybe, in addition to retries, since the flash is internal, changing the LFS configuration would be a worthwhile second PR for Adafruit's folks to consider? (of course, only if it would make LFS robust to failed writes)

later comments... and I'm not official reviewer.

todd-herbert · 2025-01-21T11:01:10Z

@todd-herbert ... questions for you, as you're the one who most recently dived deep....

I have to be honest, @esev has looked into this in much more depth than me. I've only really narrowed in on this one particular BLE disconnection case.

I'm not actually sure which situations could trigger the loop to hit MAX_RETRY, but maybe that's the argument in favor of asserting in this situation: to uncover any elusive edge cases which could be better handled.

Is the corruption essentially an edge case caused by the configuration choice (vs. the physical flash properties)?

I'm no expert in the area, but reading @geeksville's thoughts in meshtastic/firmware#4447, it does sound that the "32 LittleFS blocks per page" situation creates opportunities for corruption to occur.

geeksville · 2025-01-22T07:31:08Z

Btw I'm kinda afk for another week but I just have to say: great find Todd! Great work! (Sent from a phone - please ignore typos)

…

On Tue, Jan 21, 2025, 20:01 todd-herbert ***@***.***> wrote: @todd-herbert <https://github.com/todd-herbert> ... questions for you, as you're the one who most recently dived deep.... I have to be honest, @esev <https://github.com/esev> has looked into this in much more depth than me. I've only really narrowed in on this one particular BLE disconnection case. I'm not actually sure which situations could trigger the loop to hit MAX_RETRY, but maybe that's the argument in favor of asserting in this situation: to uncover any elusive edge cases which could be better handled. Is the corruption essentially an edge case caused by the configuration choice (vs. the physical flash properties)? I'm no expert in the area, but reading @geeksville <https://github.com/geeksville>'s thoughts in meshtastic/firmware#4447 <meshtastic/firmware#4447>, it does sound that the "32 LittleFS blocks per page" situation creates opportunities for corruption to occur. — Reply to this email directly, view it on GitHub <#838 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABXB2K2YA2HMW6KNHMLBID2LYSI3AVCNFSM6AAAAABVL5BYQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBUGQYDIMBUGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

henrygab · 2025-01-23T00:08:39Z

I'm no expert in the area, but reading @geeksville's thoughts in meshtastic/firmware#4447, it does sound that the "32 LittleFS blocks per page" situation creates opportunities for corruption to occur.

I accept that reporting a failure, in at least some cases, will cause corruption in LFS.

If the assertion compiles to nothing in release builds, then sure ... this is a fine place to assert.

If intending release builds to lock up / crash...

Concern:

Cache layer does: Erase(physical == 32x logical sectors @ 128 bytes each)
Cache layer then re-writes the entire physical page
One of those writes fails

I would not lock up on retail. The erase has already been attempted, so nothing is saved as a result.

(+) The user experience is that the device simply hangs ... some device running tucked away in a hard-to-reach place now needs someone to go an pull the battery & power, or manually press the reset button (if exposed). Maybe ok, maybe not.

Because it's become clear that LFS was never designed to work as currently configured, I have little hope that any additional changes will improve the situation in a meaningful way. Bump the retry count to 100, double the delay every 20 attempts, and then crash if the operation fails.

A correct fix...

A correct fix would be to update the LFS configuration to indicate the 4096-byte block size. LFS supports inline'd files, so small files can end up stored inline within a sector (not taking 4k per file). This would require testing, but something along the lines of changing:

Adafruit_nRF52_Arduino/libraries/InternalFileSytem/src/InternalFileSystem.cpp

Lines 34 to 35 in 4dcfa3b

    
           #define LFS_FLASH_TOTAL_SIZE  (7*FLASH_NRF52_PAGE_SIZE) 
        
           #define LFS_BLOCK_SIZE        128

#define LFS_FLASH_TOTAL_SIZE  (7*FLASH_NRF52_PAGE_SIZE) 
#define LFS_BLOCK_SIZE        FLASH_NRF52_PAGE_SIZE // more correct

Adafruit_nRF52_Arduino/libraries/InternalFileSytem/src/InternalFileSystem.cpp

Lines 100 to 119 in 4dcfa3b

    
           static struct lfs_config _InternalFSConfig = 
        
           { 
        
             .context = NULL, 
        
             .read = _internal_flash_read, 
        
             .prog = _internal_flash_prog, 
        
             .erase = _internal_flash_erase, 
        
             .sync = _internal_flash_sync, 
        
             .read_size = LFS_BLOCK_SIZE, 
        
             .prog_size = LFS_BLOCK_SIZE, 
        
             .block_size = LFS_BLOCK_SIZE, 
        
             .block_count = LFS_FLASH_TOTAL_SIZE / LFS_BLOCK_SIZE, 
        
             .lookahead = 128, 
        
             .read_buffer = NULL, 
        
             .prog_buffer = NULL, 
        
             .lookahead_buffer = NULL, 
        
             .file_buffer = NULL 
        
           };

   .read_size = 128; // ??? flash might support reading single byte at a time... 
   .prog_size = 128; // ??? smallest size of a write to the page ... see flash's datasheet, might be 512?
   .block_size = FLASH_NRF52_PAGE_SIZE; //  This is the physical page size
   .block_count = LFS_FLASH_TOTAL_SIZE / FLASH_NRF52_PAGE_SIZE; // e.g., seven (7)

The above changes are an ESTIMATE / STRAWMAN and entirely UNTESTED, as they are intended for discussion only.

todd-herbert · 2025-01-23T11:23:29Z

I would not lock up on retail. The erase has already been attempted, so nothing is saved as a result.

That makes sense to me.

Personally, I'd be inclined to leave this PRs scope targeting this one specific BLE disconnection issue. It seems like the fix is fairly non-controversial, and could be rolled out without too much fear of causing disruption.

The additional discussion going on here with further aims to improve the stability of InternalFileSystem is certainly very positive and not something I'd want to discourage though!

adafruit#838

hathach

superb ! Thank you very much for investigating and fixing the hard-to-reproduce issue. I am sure this will fix several confusing bug when radio & flash are both higly active. Thank you @henrygab for very thoughful review as usual.

I made an attempt to tidy up the code a bit, let me knoww if it looks OK and work for you all.

PS: seems like I couldn't push to fokred PR branch, @todd-herbert could you either enable the perssion for maintainer or just apply the code here
flash_nrf5x.c.txt

hathach · 2025-02-07T13:23:19Z

libraries/InternalFileSytem/src/flash/flash_nrf5x.c

-  {
-    _sem = xSemaphoreCreateCounting(10, 0);
+  if ( _sem == NULL ) {
+    _sem = xSemaphoreCreateBinary();


kind of forgot, but I think sd_flash has an internal FIFO that we can queue flashing API. But I agree using binary would be better since we wait and retry each call

hathach · 2025-02-10T04:12:30Z

libraries/InternalFileSytem/src/flash/flash_nrf5x.c

+      break;
+    }
+    if (err == NRF_ERROR_BUSY) {
+      delay(1);


I think we should also delay in case on FLASH OP failed as well since the sd stack queue is probably full or radio is too busy at the time.

adafruit#838 (review)

todd-herbert · 2025-02-13T08:26:00Z

Thanks for tidying it up! I've directly applied the changes from the linked text file, and it's still correctly handling BLE disconnection during flash write.

Log Output

DEBUG | 08:18:48 87 [Button] Opening /prefs/config.proto, fullAtomic=1
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[BLE   ] BLE_GAP_EVT_DISCONNECTED : Conn Handle = 0#
[GAP   ] Disconnect Reason: CONNECTION_TIMEOUT #
INFO  | 08:18:51 91 [Button] BLE Disconnected, reason = 0x8
DEBUG | 08:18:51 91 [Button] PhoneAPI::close()
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#

PS: seems like I couldn't push to fokred PR branch, @todd-herbert could you either enable the perssion for maintainer or just apply the code here

Ah you're not the first person I've heard this from actually, although I'm not sure why it seems to happen sometimes. The "Allow edits by maintainers" box is certainly ticked.

hathach

perfect, thanks for re-testing the changes. I think your branch is forked from a forked, which complicate gh PR. I think I could use the web editor but couldn't push directly to your fork. Anyway, it is all good now.

Reattempt failed flash operations

3ea7855

todd-herbert mentioned this pull request Jan 17, 2025

Reattempt failed flash operations meshtastic/Adafruit_nRF52_Arduino#1

Merged

esev mentioned this pull request Jan 19, 2025

nRF52832 frequently connect/disconnect occur assertion "head >= 2 && head <= lfs->cfg->block_count" #325

Closed

henrygab requested changes Jan 19, 2025

View reviewed changes

henrygab added the Bug label Jan 19, 2025

Refactor for maintainability

630668a

henrygab reviewed Jan 20, 2025

View reviewed changes

henrygab previously approved these changes Jan 21, 2025

View reviewed changes

braces

facf938

todd-herbert added a commit to todd-herbert/meshtastic-nrf52-arduino that referenced this pull request Jan 24, 2025

Align with pending upstream PR

92e936c

adafruit#838

hathach reviewed Feb 10, 2025

View reviewed changes

Apply patch from hathach

275da60

adafruit#838 (review)

hathach approved these changes Feb 13, 2025

View reviewed changes

hathach merged commit 53058a7 into adafruit:master Feb 13, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!