8000 v3-alpha: Rbyds and B-trees and gcksums, oh my! by geky · Pull Request #1111 · littlefs-project/littlefs · GitHub
[go: up one dir, main page]

Skip to content

v3-alpha: Rbyds and B-trees and gcksums, oh my! #1111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1,589 commits into
base: master
Choose a base branch
from
Open

v3-alpha: Rbyds and B-trees and gcksums, oh my! #1111

wants to merge 1,589 commits into from

Conversation

geky
Copy link
Member
@geky geky commented May 29, 2025
                  ####                                 
               ######## #####    #####                 
             ################# ########  ####          
       #### ### ############################## ####    
     #############-#### ##_#########################   
    ########'#########_\'|######_####################  
     ##### ##########_.. \ .'#_._#### #/#_##########   
   ################_    \ v .'_____#_.'.'_ ########### 
  ################_ "'. |  .""  _________ '"-##########
  ########"#####..--.  /    .-"'|  | |   '"-####'######
    #############    \     /  .---.|_|_    ########### 
     ###########     |    /  -|        |-   ########   
                     )    |  -|littlefs|-              
                     |    |  -|   v3   |-              
                     |    |   '--------'               
                     |   ||     ' ' '                  
       -------------/-/---\\----------------------     

Table of contents ^

  1. Hello!
  2. Wait, a disk breaking change?
  3. What's new?
    1. Implemented
      1. Efficient metadata compaction
      2. Efficient random writes
      3. Better logging: No more sync-padding issues
      4. Efficient inline files, no more RAM constraints
      5. Independent file caches
      6. Easier logging APIs: lfs3_file_fruncate
      7. Sparse files
      8. Efficient file name lookup
      9. A simpler/more robust metadata tree
      10. A well-defined sync model
      11. Stickynotes, no more 0-sized files
      12. A new and improved compat flag system
      13. Error detection! - Global-checksums
      14. Better traversal APIs
      15. Incremental GC
      16. Better recovery from runtime errors
      17. Standard custom attributes
      18. More tests!
    2. Planned
      1. Efficient block allocation, via optional on-disk block-map (bmap)
      2. Bad block tracking
      3. Pre-erased block tracking
      4. Error correction! - Metadata redundancy
      5. Error correction! - Data redundancy
      6. Transparent block deduplication
    3. Stretch goals
      1. 16-bit and 64-bit variants
      2. Config API rework
      3. Block device API rework
      4. Custom attr API rework
      5. Alternative (cheaper) write-strategies
      6. Advanced file tree operations
      7. Advanced file copy-on-write operations
      8. Simple key-value APIs
      9. Reserved blocks to prevent CoW lockups
      10. Metadata checks to prevent metadata lockups
      11. Integrated block-level ECC
      12. Disk-level RAID
    4. Out-of-scope (for now)
      1. Alternative checksums
      2. Feature-limited configurations for smaller code/stack sizes
      3. lfs3_file_openat for dir-relative APIs
      4. lfs3_file_openn for non-null-terminated-string APIs
      5. Transparent compression
      6. Filesystem shrinking
      7. High-level caches
      8. Symbolic links
      9. 100% line/branch coverage
  4. Code/stack size
    1. Runtime error recovery
    2. B-tree flexibility
    3. Traversal inversion
  5. Benchmarks
    1. Simulated benchmarks
      1. Linear writes
      2. Random writes
      3. Logging
  6. Funding
  7. Next steps

Hello! ^

Hello everyone! As some of you may have already picked up on, there's been a large body of work fermenting in the background for the past couple of years. Originally started as an experiment to try to solve littlefs's $O(n^2)$ metadata compaction, this branch eventually snowballed into more-or-less a full rewrite of the filesystem from the ground up.

There's still several chunks of planned work left, but now that this branch has reached feature parity with v2, there's nothing really stopping it from being merged eventually.

So I figured it's a good time to start calling this v3, and put together a public roadmap.

NOTE: THIS WORK IS INCOMPLETE AND UNSTABLE

Here's a quick TODO list of planned work before stabilization. More details below:

  • Test framework rework (merged, mostly)
  • Rbyds
  • B-trees
  • M-tree
  • B-shrubs
  • Fruncate
  • Sync model rework
  • Stickynotes
  • Traversal API rework
  • Incremental GC
  • Global-checksums
  • On-disk block-map (in progress)
  • Metadata redundancy
  • Data redundancy
  • Document, document, document
  • Stretch goals

This work may continue to break the on-disk format.

That being said, I highly encourage others to experiment with v3 where possible. Feedback is welcome, and immensely important at this stage. Once it's stabilized, it's stabilized.

To help with this, the current branch uses a v0.0 as its on-disk version to indicate that it is experimental. When it is eventually released, v3 will reject this version and fail to mount.

Unfortunately, the API will be under heavy flux during this period.

A note on benchmarking: The on-disk block-map is key for scalable allocator performance, so benchmarks at this stage needs to be taken with a grain of salt when many blocks are involved. Please refer to this version as "v3 (no bmap)" or something similar in any published benchmarks until this work is completed.

Wait, a disk breaking change? ^

Yes. v3 breaks disk compatibility from v2.

I think this is a necessary evil. Attempting to maintain backwards compatibility has a heavy cost:

  1. Development time - The littlefs team is ~1 guy, and v3 has already taken ~2.5 years. The extra work to make everything compatible would stretch this out much longer and likely be unsustainable.

  2. Code cost - The goal of littlefs is to be, well, little. This is unfortunately in conflict with backwards compatibility.

    Take the new B-tree data-structure, for example. It would be easy to support both B-tree and CTZ skip-list files, but now you need ~2x the code. This cost gets worse for the more enmeshed features, and potentially exceeds the cost of just including both v3 and v2 in the codebase.

So I think it's best for both littlefs as a project and long-term users to break things here.

Note v2 isn't going anywhere! I'm happy to continue maintaining the v2 branch, merge bug fixes when necessary, etc. But the economic reality is my focus will be shifting to v3.

What's new ^

Ok, with that out of the way, what does breaking everything actually get us?

Implemented: ^

  • Efficient metadata compaction: $O(n^2) \rightarrow O(n \log n)$ ^

    v3 adopts a new metadata data-structure: Red-black-yellow Dhara trees (rbyds). Based on the data-structure invented by Daniel Beer for the Dhara FTL, rbyds extend log-encoded Dhara trees with self-balancing and self-counting (also called order-statistic) properties.

    This speeds up most metadata operations, including metadata lookup ( $O(n) \rightarrow O(\log n)$ ), and, critically, metadata compaction ( $O(n^2) \rightarrow O(n \log n)$ ).

    This improvement may sound minor on paper, but it's a difference measured in seconds, sometimes even minutes, on devices with extremely large blocks.

  • Efficient random writes: $O(n) \rightarrow O(\log_b^2 n)$ ^

    A much requested feature, v3 adopts B-trees, replacing the CTZ skip-list that previously backed files.

    This avoids needing to rewrite the entire file on random reads, bringing the performance back down into tractability.

    For extra cool points, littlefs's B-trees use rbyds for the inner nodes, which makes CoW updates much cheaper than traditional array-packed B-tree nodes when large blocks are involved ( $O(n) \rightarrow O(\log n)$ ).

  • Better logging: No more sync-padding issues ^

    v3's B-trees support inlining data directly in the B-tree nodes. This gives us a place to store data during sync, without needing to pad things for prog alignment.

    In v2 this padding would force the rewriting of blocks after sync, which had a tendency to wreck logging performance.

  • Efficient inline files, no more RAM constraints: $O(n^2) \rightarrow O(n \log n)$ ^

    In v3, B-trees can have their root inlined in the file's mdir, giving us what I've been calling a "B-shrub". This, combined with the above inlined leaves, gives us a much more efficient inlined file representation, with better code reuse to boot.

    Oh, and B-shrubs also make small B-trees more efficient by avoiding the extra block needed for the root.

  • Independent file caches ^

    littlefs's pcache, rcache, and file caches can be configured independently now. This should allow for better RAM utilization when tuning the filesystem.

  • Easier logging APIs: lfs3_file_fruncate ^

    Thanks to the new self-counting/order-statistic properties, littlefs can now truncate from both the end and front of files via the new lfs3_file_fruncate API.

    Before, the best option for logging was renaming log files when they filled up. Now, maintaining a log/FIFO is as easy as:

    lfs3_file_write(&lfs, &file, entry, entry_size) => entry_size;
    lfs3_file_fruncate(&lfs, &file, log_size) => 0;
  • Sparse files ^

    Another advantage of adopting B-trees, littlefs can now cheaply represent file holes, where contiguous runs of zeros can be implied without actually taking up any disk space.

    Currently this is limited to a couple operations:

    • lfs3_file_truncate
    • lfs3_file_fruncate
    • lfs3_file_seek + lfs3_file_write past the end of the file

    But more advanced hole operations may be added in the future.

  • Efficient file name lookup: $O(n) \rightarrow O(\log_b n)$ ^

    littlefs now uses a B-tree (yay code reuse) to organize files by file name. This allows for much faster file name lookup than the previous linked-list of metadata blocks.

  • A simpler/more robust metadata tree ^

    As a part of adopting B-trees for metadata, the previous threaded file tree has been completely ripped out and replaced with one big metadata tree: the M-tree.

    I'm not sure how much users are aware of it, but the previous threaded file tree was a real pain-in-the-ass with the amount of bugs it caused. Turns out having a fully-connected graph in a CoBW filesystem is a really bad idea.

    In addition to removing an entire category of possible bugs, adopting the M-tree allows for multiple directories in a single metadata block, removing the 1-dir = 1-block minimum requirement.

  • A well-defined sync model ^

    One interesting thing about littlefs, it doesn't have a strictly POSIX API. This puts us in a relatively unique position, where we can explore tweaks to the POSIX API that may make it easer to write powerloss-safe applications.

    To leverage this (and because the previous sync model had some real problems), v3 includes a new, well-defined sync model.

    I think this discussion captures most of the idea, but for a high-level overview:

    1. Open file handles are strictly snapshots of the on-disk state. Writes to a file are copy-on-write (CoW), with no immediate affect to the on-disk state or any other file handles.

    2. Syncing or closing an in-sync file atomically updates the on-disk state and any other in-sync file handles.

    3. Files can be desynced, either explicitly via lfs3_file_desync, or because of an error. Desynced files do not recieve sync broadcasts, and closing a desynced file has no affect on the on-disk state.

    4. Calling lfs3_file_sync on a desynced file will atomically update the on-disk state, any other in-sync file handles, and mark the file as in-sync again.

    5. Calling lfs3_file_resync on a file will discard its current contents and mark the file as in-sync. This is equivalent to
      closing and reopening the file.

  • Stickynotes, no more 0-sized files ^

    As an extension of the littlefs's new sync model, v3 introduces a new file type: LFS3_TYPE_STICKYNOTE.

    A stickynote represents a file that's in the awkward state of having been created, but not yet synced. If you lose power, stickynotes are hidden from the user and automatically cleaned up on the next mount.

    This avoids the 0-sized file issue, while still allowing most of the POSIX interactions users expect.

  • A new and improved compat flag system ^

    v2.1 was a bit of a mess, but it was a learning experience. v3 still includes a global version field, but also includes a set of compat flags that allow non-linear addition/removal of future features.

    These are probably familiar to users of Linux filesystems, though I've given them slightly different names:

    • rcompat flags - Must understand to read the filesystem (incompat_flags)
    • wcompat flags - Must understand to write to the filesystem (ro_compat_flags)
    • ocompat flags - No understanding necessary (compat_flags)

    This also provides an easy route for marking a filesystem as read-only, non-standard, etc, on-disk.

  • Error detection! - Global-checksums ^

    v3 now supports filesystem-wide error-detection. This is actually quite tricky in a CoBW filesystem, and required the invention of global-checksums (gcksums) to prevent rollback issues caused by naive checksumming.

    With gcksums, and a traditional Merkle-tree-esque B-tree construction, v3 now provides a filesystem-wide self-validating checksum via lfs3_fs_cksum. This checksum can be stored external to the filesystem to provide protection against last-commit rollback issues, metastability, or just for that extra peace of mind.

    Funny thing about checksums. It's incredibly cheap to calculate checksums when writing, as we're already processing that data anyways. The hard part is, when do you check the checksums?

    This is a problem that mostly ends up on the user, but to help, v3 adds a large number checksum checking APIs (probably too many if I'm honest):

    • LFS3_M_CKMETA/CKDATA - Check checksums during mount
    • LFS3_O_CKMETA/CKDATA - Check checksums during file open
    • lfs3_fs_ckmeta/ckdata - Explicitly check all checksums in the filesystem
    • lfs3_file_ckmeta/ckdata - Explicitly check a file's checksums
    • LFS3_T_CKMETA/CKDATA - Check checksums incrementally during a traversal
    • LFS3_GC_CKMETA/CKDATA - Check checksums during GC operations
    • LFS3_M_CKPROGS - Closed checking of data during progs
    • LFS3_M_CKFETCHES - Optimistic (not closed) checking of data during fetches
    • LFS3_M_CKREADS (planned) - Closed checking of data during reads
  • Better traversal APIs ^

    The traversal API has been completely reworked to be easier to use (both externally and internally).

    No more callback needed, blocks can now be iterated over via the dir-like lfs3_traversal_read function.

    Traversals can also perform janitorial work and check checksums now, based on the flags provided to lfs3_traversal_open.

  • Incremental GC ^

    GC work can now be accomplished incrementally, instead of requiring one big go. This is managed by lfs3_fs_gc, cfg.gc_flags, and cfg.gc_steps.

    Internally, this just shoves one of the new traversal objects into lfs3_t. It's equivalent to managing a traversal object yourself, but hopefully makes it easier to write library code.

    However, this does add a significant chunk of RAM to lfs3_t, so GC is now an opt-in feature behind the LFS3_GC ifdef.

  • Better recovery from runtime errors ^

    Since we're already doing a full rewrite, I figured let's actually take the time to make sure things don't break on exceptional errors.

    Most in-RAM filesystem state should now revert to the last known-good state on error.

    The one exception involves file data (not metadata!). Reverting file data correctly turned out to roughly double the cost of files. And now that you can manual revert with lfs3_file_resync, I figured this cost just isn't worth it. So file data remains undefined after an error.

    In total, these changes add a significant amount of code and stack, but I'm of the opinion this is necessary for the maturing of littlefs as a filesystem.

  • Standard custom attributes ^

    Breaking disk gives us a chance to reserve attributes 0x80-0xbf for future standard custom attributes:

    • 0x00-0x7f - Free for user-attributes 8000 (uattr)
    • 0x80-0xbf - Reserved for standard-attributes (sattr)
    • 0xc0-0xff - Encouraged for system-attributes (yattr)

    In theory, it was technically possible to reserve these attributes without a disk-breaking change, but it's much safer to do so while we're already breaking the disk.

    v3 also includes the possibility of extending the custom attribute space from 8-bits to ~25-bits in the future, but I'd hesitate to to use this, as it risks a significant increase in stack usage.

  • More tests! ^

    v3 comes with a couple more tests than v2 (+~6812.2%):

         suites  cases  permutations ;     pls   runtime
    v2:      22    198         11641 ;   28741    54.16s
    v3:      23    784        804655 ; 2228513  1323.18s
    

    You may or may not have seen the test framework rework that went curiously under-utilized. That was actually in preparation for the v3 work.

    The goal is not 100% line/branch coverage, but just to have more confidence in littlefs's reliability.

Planned: ^

  • Efficient block allocation, via optional on-disk block-map (bmap) ^

    The one remaining bottleneck in v3 is block allocation. This is a tricky problem for littlefs (and any CoW/CoBW filesystem), because we don't actually know when a block becomes free.

    This is in-progress work, but the solution I'm currently looking involves 1. adding an optional on-disk block map (bmap) stored in gstate, and 2. updating it via tree diffing on sync. In theory this will drop huge file writes: $O(n^2 \log n) \rightarrow O(n \log_b^2 n)$

    There is also the option of using the bmap as a simple cache, which doesn't avoid the filesystem-wide scan but at least eliminates the RAM constraint of the lookahead buffer.

    As a plus, we should be able to leverage the self-counting property of B-trees to make the on-disk bmap compressible.

  • Bad block tracking ^

    This is a much requested feature, and adding the optional on-disk bmap finally gives us a place to track bad blocks.

  • Pre-erased block tracking ^

    Just like bad-blocks, the optional on-disk bmap gives us a place to track pre-erased blocks. Well, at least in theory.

    In practice it's a bit more of a nightmare. To avoid multiple progs, we need to mark erased blocks as unerased before progging. This introduces an unbounded number of catch-22s when trying to update the bmap itself.

    Fortunately, if instead we store a simple counter in the bmap's gstate, we can resolve things at the mrootanchor worst case.

  • Error correction! - Metadata redundancy ^

    Note it's already possible to do error-correction at the block-device level outside of littlefs, see ramcrc32cbd and ramrsbd for examples. Because of this, integrating in-block error correction is low priority.

    But I think there's potential for cross-block error-correction in addition to the in-block error-correction.

    The plan for cross-block error-correction/block redundancy is a bit different for metadata vs data. In littlefs, all metadata is logs, which is a bit of a problem for parity schemes. I think the best we can do is store metadata redundancy as naive copies.

    But we already need two blocks for every mdir, one usually just sits unused when not compacting. This, combined with metadata usually being much smaller than data, makes the naive scheme less costly than one might expect.

  • Error correction! - Data redundancy ^

    For raw data blocks, we can be a bit more clever. If we add an optional dedup tree for block -> parity group mapping, and an optional parity tree for parity blocks, we can implement a RAID-esque parity scheme for up to 3 blocks of data redundancy relatively cheaply.

  • Transparent block deduplication ^

    This one is a bit funny. Originally block deduplication was intentionally out-of-scope, but it turns out you need something that looks a lot like a dedup tree for error-correction to work in a system that allows multiple block references.

    If we already need a virtual -> physical block mapping for error correction, why not make the key the block checksum and get block deduplication for free?

    Though if this turns out to not be as free as I think it is, block deduplication will fall out-of-scope.

Stretch goals: ^

These may or may not be included in v3, depending on time and funding:

  • 16-bit and 64-bit variants ^

  • Config API rework ^

  • Block device API rework ^

  • Custom attr API rework ^

  • Alternative (cheaper) write-strategies (write-once, global-aligned, eager-crystallization) ^

  • Advanced file tree operations (lfs3_file_punchhole, lfs3_file_insertrange, lfs3_file_collapserange, LFS3_SEEK_DATA, LFS3_SEEK_HOLE) ^

  • Advanced file copy-on-write operations (shallow lfs3_cowcopy + opportunistic lfs3_copy) ^

  • Simple key-value APIs (lfs3_get, lfs3_set, lfs3_remove, etc) ^

  • Reserved blocks to prevent CoW lockups ^

  • Metadata checks to prevent metadata lockups ^

  • Integrated block-level ECC (ramcrc32cbd, ramrsbd) ^

  • Disk-level RAID (this is just data redund + a disk aware block allocator) ^

Out-of-scope (for now): ^

If we don't stop somewhere, v3 will never be released. But these may be added in the future:

  • Alternative checksums (crc16, crc64, sha256, etc) ^

  • Feature-limited configurations for smaller code/stack sizes (LFS3_NO_DIRS, LFS3_KV, LFS3_2BLOCK, etc) ^

  • lfs3_file_openat for dir-relative APIs ^

  • lfs3_file_openn for non-null-terminated-string APIs ^

  • Transparent compression ^

  • Filesystem shrinking ^

  • High-level caches (block cache, mdir cache, btree leaf cache, etc) ^

  • Symbolic links ^

  • 100% line/branch coverage ^

Code/stack size ^


littlefs v1, v2, and v3, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)

Unfortunately, v3 is a little less little than v2:

     code            stack           ctx
v2: 17144             1440           580
v3: 37300 (+117.6%)   2280 (+58.3%)  636 (+9.7%)

On one hand, yes, more features generally means more code.

And it's true there's an opportunity here to carve out more feature-limited builds to save code/stack in the future.

But I think it's worth discussing some of the other reasons for the code/stack increase:

  1. Runtime error recovery ^

    Recovering from runtime errors isn't cheap. We need to track both the before and after state of things during fallible operations, and this adds both stack and code.

    But I think this is necessary for the maturing of littlefs as a filesystem.

    Maybe it will make sense to add a sort of LFS3_GLASS mode in the future, but this is out-of-scope for now.

  2. B-tree flexibility ^

    The bad news: The new B-tree files are extremely flexible. Unfortunately, this is a double-edged sword.

    B-trees, on their own, don't add that much code. They are a relatively poetic data-structure. But deciding how to write to a B-tree, efficiently, with an unknown write pattern, is surprisingly tricky.

    The current implementation, what I've taken to calling the "lazy-crystallization algorithm", leans on the more complicated side to see what is possible performance-wise.

    The good news: The new B-tree files are extremely flexible.

    There's no reason you need the full crystallization algorithm if you have a simple write pattern, or don't care as much about performance. This will either be a future or stretch goal, but it would be interesting to explore alternative write-strategies that could save code in these cases.

  3. Traversal inversion ^

    Inverting the traversal, i.e. moving from a callback to incremental state machine, adds both code and stack as 1. all of the previous on-stack state needs to be tracked explicitly, and 2. we now need to worry about what happens if the filesystem is modified mid-traversal.

    In theory, this could be reverted if you don't need incremental traversals, but extricating incremental traversals from the current codebase would be an absolute nightmare, so this is out-of-scope for now.

Benchmarks ^

A note on benchmarking: The on-disk block-map is key for scalable allocator performance, so benchmarks at this stage needs to be taken with a grain of salt when many blocks are involved. Please refer to this version as "v3 (no bmap)" or something similar in any published benchmarks until this work is completed.

First off, I would highly encourage others to do their own benchmarking with v3/v2. Filesystem performance is tricky to measure because it depends heavily on your application's write pattern and hardware nuances. If you do, please share in this thread! Others may find the results useful, and now is the critical time for finding potential disk-related performance issues.

Simulated benchmarks ^

To test the math behind v3, I've put together some preliminary simulated benchmarks.

Note these are simulated and optimistic. They do not take caching or hardware buffers into account, which can have a big impact on performance. Still, I think they provide at least a good first impression of v3 vs v2.

To find an estimate of runtime, I first measured the amount of bytes read, progged, and erased, and then scaled based on values found in relevant datasheets. The options here were a bit limited, but WinBond fortunately provides runtime estimates in the datasheets on their website:

  • NOR flash - w25q64jv

  • NAND flash - w25n01gv

  • SD/eMMC - Also w25n01gv, assuming a perfect FTL

    I said optimistic, didn't I? I could't find useful estimates for SD/eMMC, so I'm just assuming a perfect FTL here.

These also assume an optimal bus configuration, which, as any embedded engineer knows, is often not the case.

Full benchmarks here: https://benchmarks.littlefs.org (repo, commit)

And here are the ones I think are the most interesting:

Note that SD/eMMC is heavily penalized by the lack of on-disk block-map! SD/eMMC breaks flash down into many small blocks, which tends to make block allocator performance dominate.

  1. Linear writes, where we write a 1 MiB file and don't call sync until closing the file. ^

    This one is the most frustrating to compare against v2. CTZ skip-lists are really fast at appending! The problem is they are only fast at appending:

    (commit, reads, progs, erases, sim, usage)

  2. Random writes, note we start with a 1MiB file. ^

    As expected, v2 is comically bad at random writes. v3 is indistinguishable from zero in the NOR case:

    (commit, reads, progs, erases, sim, usage)

  3. Logging, write 4 MiB, but limit the file to 1 MiB. ^

    In v2 this is accomplished by renaming the file, in v3 we can leverage lfs3_file_fruncate.

    v3 performs significantly better with large blocks thanks to avoiding the sync-padding problem:

    (commit, reads, progs, erases, sim, usage)

Funding ^

If you think this work is worthwhile, consider sponsoring littlefs. Current benefits include:

  1. Being able to complain about v3 not being released yet
  2. Being able to complain about the disk breakage v2 -> v3

I joke, but I truly appreciate those who have contributed to littlefs so far. littlefs, in its current form, is a mostly self-funded project, so every little bit helps.

If you would like to contribute in a different way, or have other requests, feel free to reach me at geky at geky.net.

As stabilization gets closer, I will also be open to contract work to help port/integrate/adopt v3. If this is interesting to anyone, let me know.

Thank you @micropython, @fusedFET for sponsoring littlefs, and thank you @Eclo, @kmetabg, and @nedap for your past sponsorships!

Next steps ^

For me, I think it's time to finally put together a website/wiki/discussions/blog. I'm not sure on the frequency quite yet, but I plan to write/publish the new DESIGN.md in chapters in tandem with the remaining work.

EDIT: Pinned codemap/plot links to specific commits via benchmarks.littlefs.org/tree.html

geky added 30 commits March 11, 2025 18:09
Unifying these complicated attr-assigning flags across all the scripts
is the main benefit of the new internal Attr system.

The only tricky bit is we need to somehow keep track of all input fields
in case % modifiers reference fields, when we could previously discard
non-data fields.

Tricky but doable.

Updated flags:

- -L/--label -> -L/--add-label
- --colors -> -C/--add-color
- --formats -> -F/--add-format
- --chars -> -*/--add-char/--chars
- --line-chars -> -_/--add-line-char/--line-chars

I've also tweaked Attr to accept glob matches when figuring out group
assignments. This is useful for matching slightly different, but
similarly named results in our benchmark scripts.

There's probably a clever way to do this by injecting new by fields with
csv.py, but just adding globbing is simpler and makes attr assignment
even more flexible.
This adopts the Attr rework for the --add-xticklabel and
--add-yticklabel flags.

Sort of.

These require a bit of special behavior to make work, but should at
least be externally consistent with the other Attr flags.

Instead of assigning to by-field groups, --add-xticklabel/yticklabel
assign to the relevant x/y coord:

  $ ./scripts/plotmpl.py \
          --add-xticklabel='0=zero' \
          --add-yticklabel='100=one-hundred'

The real power comes from our % modifiers. As a special case,
--add-xticklabel/yticklabel can reference the special x/y field, which
represents the current x/y coord:

  $ ./scripts/plotmpl.py --y2 --yticks=5 --add-yticklabel='%(y)d KiB'

Combined with format specifiers, this allows for quite a bit:

  $ ./scripts/plotmpl.py --y2 --yticks=5 --add-yticklabel='0x%(y)04x'

---

Note that plot.py only shows the min/max x/yticks, so plot.py only
accepts indexed --add-xticklabel/yticklabels, and will error if the
assigning variant is used.
In addition to providing more functionality for creating -b/--by fields,
this lets us remove strings from the expr parser. Strings had no
well-defined operations and could best be described as an "ugly wart".

Maybe we'll reintroduce string exprs in the future, but for now csv.py's
-f/--field fields will be limited to numeric values.

As an extra plus, no more excessive quoting when injecting new -b/--by
fields.

---

This also fixed sorting on non-field fields, which was apparently
broken. Or at least mostly useless since it was defaulting to string
sorting.
Guh

This may have been more work than I expected. The goal was to allowing
passing recursive results (callgraph info, structs, etc) between
scripts, which is simply not possible with csv files.

Unfortunately, this raised a number of questions: What happens if a
script receives recursive results? -d/--diff with recursive results?
How to prevent folding of ordered results (structs, hot, etc) in piped
scripts? etc.

And ended up with a significant rewrite of most of the result scripts'
internals.

Key changes:

- Most result scripts now support -O/--output-json in addition to
  -o/--json, with -O/--output-json including any recursive results in
  the "children" field.

- Most result scripts now support both csv and json as input to relevant
  flags: -u/--use, -d/--diff, -p/--percent. This is accomplished by
  looking for a '[' as the first character to decide if an input file is
  json or csv.

  Technically this breaks if your json has leading whitespace, but why
  would you ever keep whitespace around in json? The human-editability
  of json was already ruined the moment comments were disallowed.

- csv.py requires all fields to be explicitly defined, so added
  -i/--enumerate, -Z/--children, and -N/--notes. At least we can provide
  some reasonable defaults so you shouldn't usually need to type out the
  whole field.

- Notably, the rendering scripts (plot.py, treemapd3.py, etc) and
  test/bench scripts do _not_ support json. csv.py can always convert
  to/from json when needed.

- The table renderer now supports diffing recursive results, which is
  nice for seeing how the hot path changed in stack.py/perf.py/etc.

- Moved the -r/--hot logic up into main, so it also affects the
  outputted results. Note it is impossible for -z/--depth to _not_
  affect the outputted results.

- We now sort in one pass, which is in theory more efficient.

- Renamed -t/--hot -> -r/--hot and -R/--reverse-hot, matching -s/-S.

- Fixed an issue with -S/--reverse-sort where only the short form was
  actually reversed (I misunderstood what argparse passes to Action
  classes).

- csv.py now supports json input/output, which is funny.
This removes most of the special behavior around how -r/--hot and
-i/--enumerate interact. This does mean -r/--hot risks folding results
if -i/--enumerate is not specified, but this is _technically_ a valid
operation.

For most of the recursive result scripts, I've replaced the "i" field
with separate "z" and "i" fields for depth and field number, which I
think is a bit more informative/useful.

I've also added a default-hidden "off" field to structs.py/ctx.py, since
we have that info available. I considered replacing "i" with this, but
decided against it since non-zero offsets for union members would risk
being confusing/mistake prone.
This gives csv.py access to a hidden feature in our table renderer used
by some of the other scripts: fields that affect by-field grouping, but
aren't actually printed.

For example, this prevents summing same named functions in different
files, but only shows the function name in the table render:

  $ ./scripts/csv.py lfs.code.csv -bfile -bfunction -lfunction
  function                                size
  lfs_alloc                                398
  lfs_alloc_discard                         31
  lfs_alloc_findfree                        77
  ...

This is especially useful when enumerating results. For example, this
prevents any summing without extra table noise:

  $ ./scripts/csv.py lfs.code.csv -i -bfunction -fsize -lfunction
  function                                size
  lfs_alloc                                398
  lfs_alloc_discard                         31
  lfs_alloc_findfree                        77
  ...

I also tweaked -b/--by field defaults a bit to account to
enumerate/label fields a bit better.
I guess in addition to its other utilities, csv.py is now also turning
into a sort of man database for some of the more complicated APIs in the
scripts:

  ./csv.py --help
  ./csv.py --help-exprs
  ./csv.py --help-mods

It's a bit minimal, but better than nothing.

Also dropped the %c modifier because this never actually worked.
There's an ordering issue with hotifying and folding when we have
multiple foldable results with children. This was hard to notice since
most of the recursive scripts have unique results, but it _is_ an issue
for perf.py/perfbd.py, which rely on result folding to merge samples.

The fix is to fold _before_ hotifying.

We could fold multiple times to avoid changing the behavior of the
result scripts, but instead I've just moved the folding in the table
renderer up into the relevant main functions. This means 1. we only fold
once, and 2. folding affects outputted csv/json files.

I'm a bit on the fence about this behavior change, but it is a bit more
consistent with how -r/--hot, -z/--depth, etc, affect both table and
csv/json results consistently.

Maybe we should move towards the table render always reflecting the
csv/json results? Most csv/json usage is with -q/--quiet anyways...

---

This does create a new risk in that the table renderer can hide results
if they aren't folded first.

To hopefully avoid this I've added an assert in the table renderer if it
notices results being hidden.
Kind of a complicated corner case, but this shows up if you try to sort
by fields as numbers and not as strings. In theory this is possible by
creating a hidden sort field with a typed expr:

  $ ./scripts/csv.py test.csv -bi -bfunction -Si=i

But we weren't typechecking sort fields that already exist in the by
fields, since these are usually strings.

This fix is to make sure all exprs are in the typechecked fields, even
if they are already in by fields. There's no real cost to this.

---

Note this version does _not_ typecheck i, and sorts by string:

  $ ./scripts/csv.py test.csv -bi -bfunction -Si

This raises the question, should we always sort by string by default?

I don't think so. It's easy to miss the difference, and a typecheck
error is a lot safer than incorrect sorting.

So this will sort by number, with i as a hidden field:

  $ ./scripts/csv.py test.csv -bfunction -Si

If you want to sort by string with a hidden field, this is still
possible with -l/--label:

  $ ./scripts/csv.py test.csv -bi -lfunction -Si
It felt weird that adding hidden fields required changing existing
flags unrelated to the field you actually want to affect, and the
upper/lower flag thing seems to work well for -s/-S sooo...

- Replaced -l/--label with -B/--hidden-by for by fields that can
  be hidden from the table renderer.

- Added -F/--hidden-field as a similar thing for field fields.

- Better integrated -i/--enumerate into by fields, now these actually
  maintain related order. And of course added a matching
  -I/--hidden-enumerate flag.

The only downside is this is eating a lot of flag names.. But one of the
nice thing about limiting this complexity to csv.py is it avoids these
flag names cluttering up the other result scripts.

---

The -F/--hidden-fields flag I'm not so sure about, since field exprs
can't really reference each other (single pass). But it does provide
symmetry with -B/--hidden-by, and reserves the name in case hidden field
fields are more useful in the future.

Unfortunately it _is_ annoyingly inconsistent with other hidden fields
(-S/--sort, -D/--define, etc) in that it does end up in output csvs...

But this script is already feeling way over-engineered as is.
This should either have checked diff_result==None, or we should be
mapping diff_result=None => diff_result_=None. To be safe I've done
both.

This was a nasty typo and I only noticed because ctx.py stopped printing
"cycle detected" for our linked-lists (which are expected to be cyclic).
It's just too unintuitive to filter after exprs.

Note this is consistent with how exprs/mods are evaluated. Exprs/mods
can't reference other exprs/mods because csv.py is only single-pass, so
allowing defines to reference exprs/mods is surprising.

And the solution to needing these sort of post-expr/mod references is
the same for defines: You can always chain multiple csv.py calls.

The reason defines were change to evaluate after expr eval was because
this seemed inconsistent with other result scripts, but this is not
actually the case. Other result scripts simply don't have exprs/mods, so
filtering in fold is the same as filtering during collection. Note that
even in fold, filtering is done _before_ the actual fold/sum operation.

---

Also fixed a recursive-define regression when folding. Counter-
intuitively, we _don't_ want to recursively apply define filters. If we
do the results will just end up too confusing to be useful.
With this, we apply the same result modifiers (exprs/defines/hot/etc) to
both the input results and -d/--diff results. So if both start with the
same format, diffing/hotifying/etc should work as expected.

This is really the only way I can seen -d/--diff results working with
result modifiers in a way that makes sense.

The downside of this is that you can't save results with some complex
operation applied, and then diff while applying the same operation,
since most of the newer operations (hotify) are _not_ idempotent.

Fortunately the two alternatives are not unreasonable:

1. Save results _without_ the operation applied, since the operation
   will be applied to both the input and diff results.

   This is a bit asymmetric, but should work.

2. Apply the operation to the input and then pipe to csv.py for diffing.

This used to "just work" when we did _not_ apply operations to output
csv/json, but this was really just equivalent to 1..

I think the moral of the story is you can solve any problem with enough
chained csv.py calls.
This makes it so scripts with complex fields will still output all
fields to output csv/json files, while only showing a user-friendly
subset unless -f/--field is explicitly provided.

While internal fields are often too much information to show by default,
csv/json files are expected to go to other scripts, not humans. So more
information is more useful up until you actually hit a performance
bottleneck.

And if you _do_ somehow manage to hit a performance bottleneck, you can
always limit the output with explicit -f/--field flags.
So:

  $ ./scripts/code.py lfs.o -o- -q

Becomes:

  $ ./scripts/code.py lfs.o -o-

The original intention of -o/-O _not_ being exclusive (aka table is
still rendered unless disabled with -q/--quiet), was to allow results to
be written to csv files and rendered to tables in a single pass.

But this was never useful. Heck, we're not even using this in our
Makefile right now because it would make the rule dependencies more
complicated than it's worth. Even for long-running result scripts
(perf.py, perfbd.py, etc), most of the work is building that csv file,
the cost of rendering a table in a second pass is negligible.

In every case I've used -o/-O, I've also wanted -q/--quiet, and almost
always forget this on the first run. So might as well make the expected
behavior the actual behavior.

---

As a plus, this let us simplify some of the scripts a bit, by replacing
visibility filters with -o/-O dependent by-fields.
This was a bit broken when r was None. Which is unusual, but happens
when rendering added/removed diff results.
Now that I'm looking into some higher-level scripts, being able to merge
results without first renaming everything is useful.

This gives most scripts an implicit prefix for field fields, but _not_
by fields, allowing easy merging of results from different scripts:

  $ ./scripts/stack.py lfs.ci -o-
  function,stack_frame,stack_limit
  lfs_alloc,288,1328
  lfs_alloc_discard,8,8
  lfs_alloc_findfree,16,32
  ...

At least now these have better support in scripts with the addition of
the --prefix flag (this was tricky for csv.py), which allows explicit
control over field field prefixes:

  $ ./scripts/stack.py lfs.ci -o- --prefix=
  function,frame,limit
  lfs_alloc,288,1328
  lfs_alloc_discard,8,8
  lfs_alloc_findfree,16,32
  ...

  $ ./scripts/stack.py lfs.ci -o- --prefix=wonky_
  function,wonky_frame,wonky_limit
  lfs_alloc,288,1328
  lfs_alloc_discard,8,8
  lfs_alloc_findfree,16,32
  ...
This can be explicitly disabled with -x/--no-strip in the relevant
scripts, but stripping by default seems to be more useful for composing
results in higher-level scripts. It's better for the result names to be
consistent, even if they don't match the .o symbols exactly.

Note some scripts are unaffected:

- cov.py - gcov doesn't seem to have an option for getting the
  unstripped symbols, so we only output the stripped names.

- structs.py - structs.py deals with struct names, which are notably not
  symbols.
…s.py

This adds -i/--internal to ctx.py and structs.py, which has proven
useful for introspection/debugging. Being able to view the ctx/args of
internal functions is nice, even if they don't actually contribute to
the high-level cost.

This also reverts structs.py to limit to .h files by default, to match
ctx.py, once again relying on dwarf file info. This has been a bit
unreliable in the past, but there's not much else that determines if a
struct is part of the "public interface" in C.

But that's what ctx.py is for.

---

Also fixed an issue where structs appearing in multiple files would have
their sizes added together, which ends up with some pretty confusing
results (sizeof(uint32_t) => 8?).
-!/--everything has been useful enough to warrant a short form flag,
and -! is unlikely to conflict with other flags while also getting the
point across that this is a bit of an unusual option.
I forgot that this is still useful for erroring scripts, such as
stack.py when checking for recursion.

Technically this is possible with -o/dev/null, but that's both
unnecessarily complicated and includes the csv encoding cost for no
reason.
The previous behavior of -N/--no-header still rendering a header when
--title is also provided was confusing. I think this is a better API,
at the minor cost of needing to pass one more flag if you don't want
stats in the header.
This replaces the default colors with colors with precomputed alpha.
They should look the same as long as you don't change the background
color.

The reason for this is I need non-transparent colors for a fork of
treemapd3.py that I am working on. I'm attempting to render some arrows
behind the tiles, and with transparency the result is just too noisy.

---

This then broke the nested rendering, which relied on opacity to make
each layer darker, so I've replaced that with an explicit svg
filter-effect.

The opacity hack was non-linear and kinda ugly past depth >~3, so it
should have eventually been replaced anyways.
Inspired heavily by d3 and brendangregg's flamegraphs, codemapd3.py is
intended to be a powerful high-level code exploring tool.

It's a visual tool, so probably best explained visually:

  $ CFLAGS='-DLFS_NO_LOG -DLFS_NO_ASSERT' make -j
  $ ./scripts/codemapd3.py \
          lfs.o lfs_util.o \
          lfs.ci lfs_util.ci \
          -otest.svg -W1500 -H700 --dark
  updated test.svg, code 35528 stack 2440 ctx 636

And open test.svg in a browser of your choice.

(TODO add a make rule for this)

---

Features include:

- Rendering of code cost in a treemap organized by subsystem (based on
  underscore-separated namespaces), making it relatively easy to see
  where the bulk of our code cost comes from.

- Rendering of the deepest stack/ctx cost as a set of tiles, making it
  relatively easy to see where the bulk of our stack cost comes from.

- Interactive (on mouseover) rendering of callgraph info, showing
  dependencies and relevant stack/ctx costs per-function.

  This currently includes 4 modes:

  1. mode-callgraph - This shows the full callgraph, including all
     children's children, which is effectively all dependencies of that
     function, i.e. the total code cost necessary for that _specific_
     function to work.

  2. mode-deepest - This shows the deepest/hot path of calls from that
     function, which is every child that contributes to the function's
     stack cost.

  3. mode-callees - This shows all functions the current function
     immediately calls.

  4. mode-callers - This shows all functions that call the current
     function.

  And yes, cycles are handled correctly: We show the deepest
  non-cyclical path, but display the measured stack usage as infinite.

For more details see ./scripts/codemapd3.py --help.

---

One particularly neat feature I'm happy about is -t/--tiny, which scales
the resulting image such that 1 pixel ~= 1 byte. This should be useful
for comparing littlefs to other filesystems in a way that is visually
interesting.

- d3 - https://d3js.org
- brendangregg's flamegraphs - https://github.com/brendangregg/FlameGraph
This just makes dat behave similarly to Python's getattr, etc:

- dat("bogus")       -> raises ValueError
- dat("bogus", 1234) -> returns 1234

This replaces try_dat, which is easy to forget about when copy-pasting
between scripts.

Though all of this wouldn't be necessary if only we could catch
exceptions in expressions...
This was a humorous name conflict that went unnoticed only because we
lazily import json in read_csv.
Even though I think this makes less sense for the ascii-rendering
scripts, it's useful to have this flag around when jumping between
treemap.py and treemapd3.py.

And it might actually make sense sometimes now that -t/--tiny does not
override --to-scale.
This turns out to be extremely useful, for the sole purpose of being
able to specify colors/formats/etc in csv fields (-C'%(fields)s' for
example, or -C'#%(field)06x' for a cooler example).

This is a bit tricky for --chars, but doable with a psplit helper
function.

Also fixed a bug in plot.py where we weren't using dataattrs_ correctly.
Just removing outdated/commented-out code, and TODOs that have now been
taken care of.
geky added 12 commits May 25, 2025 12:53
This adds lfsr_btree_commitroot_ and lfsr_bshrub_commitroot_, to contain
the root-specific commit logic such that it can be forced off the stack
hot-path if necessary.

---

Note we're not actually using LFS_NOINLINE yet, as the critical
function, lfsr_btree_commitroot_ is implicitly forced off the stack
hot-path via the multiple calls from lfsr_btree_commit and
lfsr_bshrub_commit.

And I'm not sure it makes sense to use LFS_NOINLINE here. It absolutely
wrecks lfsr_bshrub_commitroot_'s stack, which always ends up on the
stack hot-path because of the route through lfsr_mdir_commit.

Is this a big hack? Honestly yeah.

It doesn't even really save that much stack, but I figured it was worth
a try:

           code          stack          ctx
  before: 37260           2296          636
  after:  37300 (+0.1%)   2280 (-0.7%)  636 (+0.0%)

At least the code organization is a bit better, with lfsr_bshrub_commit
reusing lfsr_btree_commitroot_ for bshrub -> btree migration.
Life's too short to not use this flag.
Not sure what the point of this was, I think it was copied from a d3
example svg at some point. But it forces the svg to always fit in the
window, even if this makes the svg unreadable.

These svgs tend to end up questionably large in order to fit in the most
info, so the unreadableness ends up a real problem for even modest
window sizes.
I don't think this hyphen broke anything, but this matches the naming
convention of other files in the codebase (lfs_util.h for example).
Removing the vestiges of v2 tests.
Kinda. It's actually only 3 scripts. These have been replaced with the
new dbg*.py scripts:

- readblock.py -> dbgblock.py
- readmdir.py -> dbgrbyd.py
- readtree.py -> dbglfs.py
Removing the final vestiges of v2.
This one was a bit more involved.

Removes utils that are no longer useful, and made sure some of the
name/API changes over time are adopted consistently:

- lfs_npw2 -> lfs_nlog2
- lfs_tole32_ -> lfs_tole32
- lfs_fromle32_ -> lfs_fromle32

Also did another pass for lfs_ prefixes on mem/str functions. The habit
to use the naked variants of these is hard to break!
Like the LFS_DISK_VERSION, the intention is to mark the current driver
as experimental.

Just in case...
@geky
Copy link
Member Author
geky commented May 29, 2025

There is high risk this thread drowns in noise. So, please limit this thread to high-level v3-wide discussion.

For bugs, low-level questions, concerns, etc, feel free to create issues on littlefs with "v3" in the title. I will label relevant issues with the v3 label when I can.

@geky geky added needs major version breaking functionality only allowed in major versions v3 next major on-disk major WEEWOOWEEWOO labels May 29, 2025
@dpgeorge
Copy link
Contributor

Fantastic! Great to see this v3 work making it's way into the world.

The code size increase would be the biggest concern for MicroPython. Any optimisations to get that down would be welcome. In particular we put read-only littlefs2 inside a 32k bootloader. Do you have any idea how big the read-only version of v3 is?

@geky
Copy link
Member Author
geky commented May 29, 2025

@dpgeorge, I don't know quite yet. The non-default builds aren't fully reimplemented, but will update this thread when they are. Read-only is the most interesting but also the most time consuming.

The good news is that rbyds (and red-black trees) and B-trees are much more complicated on the write side than read side, so the impact should be less than the default build, but I'll wait to stick my foot in my mouth until after I have actual numbers.

@Ryan-CW-Code
Copy link

Looking forward to the Simple key-value APIs in v3

@geky
Copy link
Member Author
geky commented May 29, 2025

@Ryan-CW-Code, it's worth noting the simple key-value APIs are stretch goals. They may or may not make it into v3 depending on how long things take and funding.

Or to put it another way, stretch goals won't block the release of v3. But it would be convenient to get them in while breaking the API, to minimize future API breaks.

Though if they don't make it into v3, I would like to introduce them in future releases, if possible. Note most stretch and out-of-scope features are mainly API related unlikely to affect the on-disk format.

@Ryan-CW-Code
Copy link

My English is not very good, please forgive me for using machine translation.
Thanks for your contribution!
Using kvdb and tsdb in the file system is a common function.
We are preparing to migrate from flashdb to littlefs, and have implemented kvdb and tsdb based on littlefs, but due to the long wait for metadata compression, we perform gc operations in a low-priority task every time we make a file change to obtain constant write time.
I noticed that v3 was a good opportunity, with improved metadata compression, and a lot of performance improvements, and I was very willing to test the v3 version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs major version breaking functionality only allowed in major versions next major on-disk major WEEWOOWEEWOO v3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0