8000 Avoid InternalFileSystem corruption caused by simultaneous BLE operation by todd-herbert · Pull Request #838 · adafruit/Adafruit_nRF52_Arduino · GitHub
[go: up one dir, main page]

Skip to content

Avoid InternalFileSystem corruption caused by simultaneous BLE operation #838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 13, 2025

Conversation

todd-herbert
Copy link
Contributor

Unexpected BLE disconnections (such as 0x8 BLE_HCI_CONNECTION_TIMEOUT) cause sd_flash_page_erase and sd_flash_write operations to fail. This failure is reported with NRF_EVT_FLASH_OPERATION_ERROR. Currently, the InternalFileSystem library doesn't detect these events.

This PR aims to detect NRF_EVT_FLASH_OPERATION_ERROR, allowing several reattempts of a failed write / erase operation.

I'm unsure whether this fix is overly crude, and am concerned that I may be missing some finer detail of the filesystem implementation. Of note is the change from a counting semaphore to a binary semaphore. Any input here would be greatly valued.

Copy link
Collaborator
@henrygab henrygab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! This was a tricky bug that you tracked down. To be clear, I don't have final say in this ... my advice here is free, and maybe you get what you pay for.

I think your solution is based on solid read of the SDK. My comments primarily revolve around keeping the code easy to read / function naming / etc. I hope you find it useful.

@todd-herbert
Copy link
Contributor Author

@henrygab Thanks for the feedback! I'll get onto those changes tomorrow.

@henrygab henrygab added the Bug label Jan 19, 2025
@todd-herbert
Copy link
Contributor Author

630668a aims to respond to feedback from @henrygab. Hopefully I'm correctly interpreting what you're envisioning.

  • the event from the callback is now stored, rather than success / fail as a bool
  • wait_for_async_flash_op_completion method gathers result only. Delays / queuing now handled in fal_prog and fal_erase.

Please do let me know if you feel there's any room for further improvement here, or if some new logic flaw has appeared during the refactoring.

8000

Copy link
Collaborator
@henrygab henrygab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems cleaner. Logic remains apparently sound. Nicely done. 🎉

One code style comment repeated a few times.

@esev
Copy link
esev commented Jan 20, 2025

One thing I've been wondering about this PR: If it fails after retrying, what should the behavior be? From #325 (comment), I've seen this can lead to asserts in LittleFS. These don't necessarily happen right away and can happen some time later when LittleFS actually accesses one of the lost blocks.

I've been wondering if the code here should assert if the retries fail. That calls attention to a possible FS corruption issue sooner, at the point where we've first detected a failure.

I've been debating this a bit myself, as it certainly isn't nice to crash if it can be avoided. But maybe it is useful in this case, to highlight the issue sooner, closer to the root cause of the failure?

Anyway, just adding this comment here in case there is a strong opinion one way or another. I'm happy with the current logic.

@henrygab
Copy link
Collaborator

... I've been wondering if the code here should assert if the retries fail. That calls attention to a possible FS corruption issue sooner, at the point where we've first detected a failure.

Those are good questions to ask. If the corruption was guaranteed at this point, then you are right ... earlier is better, and here would prevent later-discovered inconsistencies, maybe even leave the file system in a valid state ... if never written to again.

What I'm not 100% sure of is whether a failed write is guaranteed to cause LFS corruption. If I understand correctly, LFS is generally designed to not trust that data was actually written, just because the write reported success. I have not dived into LFS internals for a while...

@todd-herbert ... questions for you, as you're the one who most recently dived deep....

  1. Is the corruption essentially an edge case caused by the configuration choice (vs. the physical flash properties)?
  2. Are there any situations where, with a similar configuration vs. physical flash, a write that fails would NOT cause LFS corruption?

Maybe, in addition to retries, since the flash is internal, changing the LFS configuration would be a worthwhile second PR for Adafruit's folks to consider? (of course, only if it would make LFS robust to failed writes)

henrygab
henrygab previously approved these changes Jan 21, 2025
@henrygab henrygab dismissed their stale review January 21, 2025 00:48

later comments... and I'm not official reviewer.

@todd-herbert
Copy link
Contributor Author
todd-herbert commented Jan 21, 2025

@todd-herbert ... questions for you, as you're the one who most recently dived deep....

I have to be honest, @esev has looked into this in much more depth than me. I've only really narrowed in on this one particular BLE disconnection case.

I'm not actually sure which situations could trigger the loop to hit MAX_RETRY, but maybe that's the argument in favor of asserting in this situation: to uncover any elusive edge cases which could be better handled.

Is the corruption essentially an edge case caused by the configuration choice (vs. the physical flash properties)?

I'm no expert in the area, but reading @geeksville's thoughts in meshtastic/firmware#4447, it does sound that the "32 LittleFS blocks per page" situation creates opportunities for corruption to occur.

@geeksville
Copy link
Contributor
geeksville commented Jan 22, 2025 via email

@henrygab
Copy link
Collaborator

I'm no expert in the area, but reading @geeksville's thoughts in meshtastic/firmware#4447, it does sound that the "32 LittleFS blocks per page" situation creates opportunities for corruption to occur.

I accept that reporting a failure, in at least some cases, will cause corruption in LFS.

If the assertion compiles to nothing in release builds, then sure ... this is a fine place to assert.

If intending release builds to lock up / crash...

Concern:

  • Cache layer does: Erase(physical == 32x logical sectors @ 128 bytes each)
  • Cache layer then re-writes the entire physical page
  • One of those writes fails

I would not lock up on retail. The erase has already been attempted, so nothing is saved as a result.

(+) The user experience is that the device simply hangs ... some device running tucked away in a hard-to-reach place now needs someone to go an pull the battery & power, or manually press the reset button (if exposed). Maybe ok, maybe not.

Because it's become clear that LFS was never designed to work as currently configured, I have little hope that any additional changes will improve the situation in a meaningful way. Bump the retry count to 100, double the delay every 20 attempts, and then crash if the operation fails.

A correct fix...

A correct fix would be to update the LFS configuration to indicate the 4096-byte block size. LFS supports inline'd files, so small files can end up stored inline within a sector (not taking 4k per file). This would require testing, but something along the lines of changing:

#define LFS_FLASH_TOTAL_SIZE (7*FLASH_NRF52_PAGE_SIZE)
#define LFS_BLOCK_SIZE 128

#define LFS_FLASH_TOTAL_SIZE  (7*FLASH_NRF52_PAGE_SIZE) 
#define LFS_BLOCK_SIZE        FLASH_NRF52_PAGE_SIZE // more correct

static struct lfs_config _InternalFSConfig =
{
.context = NULL,
.read = _internal_flash_read,
.prog = _internal_flash_prog,
.erase = _internal_flash_erase,
.sync = _internal_flash_sync,
.read_size = LFS_BLOCK_SIZE,
.prog_size = LFS_BLOCK_SIZE,
.block_size = LFS_BLOCK_SIZE,
.block_count = LFS_FLASH_TOTAL_SIZE / LFS_BLOCK_SIZE,
.lookahead = 128,
.read_buffer = NULL,
.prog_buffer = NULL,
.lookahead_buffer = NULL,
.file_buffer = NULL
};

   .read_size = 128; // ??? flash might support reading single byte at a time... 
   .prog_size = 128; // ??? smallest size of a write to the page ... see flash's datasheet, might be 512?
   .block_size = FLASH_NRF52_PAGE_SIZE; //  This is the physical page size
   .block_count = LFS_FLASH_TOTAL_SIZE / FLASH_NRF52_PAGE_SIZE; // e.g., seven (7)

The above changes are an ESTIMATE / STRAWMAN and entirely UNTESTED, as they are intended for discussion only.

@todd-herbert
Copy link
Contributor Author

I would not lock up on retail. The erase has already been attempted, so nothing is saved as a result.

That makes sense to me.


Personally, I'd be inclined to leave this PRs scope targeting this one specific BLE disconnection issue. It seems like the fix is fairly non-controversial, and could be rolled out without too much fear of causing disruption.

The additional discussion going on here with further aims to improve the stability of InternalFileSystem is certainly very positive and not something I'd want to discourage though!

todd-herbert added a commit to todd-herbert/meshtastic-nrf52-arduino that referenced this pull request Jan 24, 2025
Copy link
Member
@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superb ! Thank you very much for investigating and fixing the hard-to-reproduce issue. I am sure this will fix several confusing bug when radio & flash are both higly active. Thank you @henrygab for very thoughful review as usual.

I made an attempt to tidy up the code a bit, let me knoww if it looks OK and work for you all.

PS: seems like I couldn't push to fokred PR branch, @todd-herbert could you either enable the perssion for maintainer or just apply the code here
flash_nrf5x.c.txt

{
_sem = xSemaphoreCreateCounting(10, 0);
if ( _sem == NULL ) {
_sem = xSemaphoreCreateBinary();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of forgot, but I think sd_flash has an internal FIFO that we can queue flashing API. But I agree using binary would be better since we wait and retry each call

break;
}
if (err == NRF_ERROR_BUSY) {
delay(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also delay in case on FLASH OP failed as well since the sd stack queue is probably full or radio is too busy at the time.

@todd-herbert
Copy link
Contributor Author

Thanks for tidying it up! I've directly applied the changes from the linked text file, and it's still correctly handling BLE disconnection during flash write.

Log Output
DEBUG | 08:18:48 87 [Button] Opening /prefs/config.proto, fullAtomic=1
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[BLE   ] BLE_GAP_EVT_DISCONNECTED : Conn Handle = 0#
[GAP   ] Disconnect Reason: CONNECTION_TIMEOUT #
INFO  | 08:18:51 91 [Button] BLE Disconnected, reason = 0x8
DEBUG | 08:18:51 91 [Button] PhoneAPI::close()
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#

PS: seems like I couldn't push to fokred PR branch, @todd-herbert could you either enable the perssion for maintainer or just apply the code here

Ah you're not the first person I've heard this from actually, although I'm not sure why it seems to happen sometimes. The "Allow edits by maintainers" box is certainly ticked.

Copy link
Member
@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect, thanks for re-testing the changes. I think your branch is forked from a forked, which complicate gh PR. I think I could use the web editor but couldn't push directly to your fork. Anyway, it is all good now.

@hathach hathach merged commit 53058a7 into adafruit:master Feb 13, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0