-
Notifications
You must be signed in to change notification settings - Fork 866
v3-alpha: Rbyds and B-trees and gcksums, oh my! #1111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Unifying these complicated attr-assigning flags across all the scripts is the main benefit of the new internal Attr system. The only tricky bit is we need to somehow keep track of all input fields in case % modifiers reference fields, when we could previously discard non-data fields. Tricky but doable. Updated flags: - -L/--label -> -L/--add-label - --colors -> -C/--add-color - --formats -> -F/--add-format - --chars -> -*/--add-char/--chars - --line-chars -> -_/--add-line-char/--line-chars I've also tweaked Attr to accept glob matches when figuring out group assignments. This is useful for matching slightly different, but similarly named results in our benchmark scripts. There's probably a clever way to do this by injecting new by fields with csv.py, but just adding globbing is simpler and makes attr assignment even more flexible.
This adopts the Attr rework for the --add-xticklabel and --add-yticklabel flags. Sort of. These require a bit of special behavior to make work, but should at least be externally consistent with the other Attr flags. Instead of assigning to by-field groups, --add-xticklabel/yticklabel assign to the relevant x/y coord: $ ./scripts/plotmpl.py \ --add-xticklabel='0=zero' \ --add-yticklabel='100=one-hundred' The real power comes from our % modifiers. As a special case, --add-xticklabel/yticklabel can reference the special x/y field, which represents the current x/y coord: $ ./scripts/plotmpl.py --y2 --yticks=5 --add-yticklabel='%(y)d KiB' Combined with format specifiers, this allows for quite a bit: $ ./scripts/plotmpl.py --y2 --yticks=5 --add-yticklabel='0x%(y)04x' --- Note that plot.py only shows the min/max x/yticks, so plot.py only accepts indexed --add-xticklabel/yticklabels, and will error if the assigning variant is used.
In addition to providing more functionality for creating -b/--by fields, this lets us remove strings from the expr parser. Strings had no well-defined operations and could best be described as an "ugly wart". Maybe we'll reintroduce string exprs in the future, but for now csv.py's -f/--field fields will be limited to numeric values. As an extra plus, no more excessive quoting when injecting new -b/--by fields. --- This also fixed sorting on non-field fields, which was apparently broken. Or at least mostly useless since it was defaulting to string sorting.
Guh This may have been more work than I expected. The goal was to allowing passing recursive results (callgraph info, structs, etc) between scripts, which is simply not possible with csv files. Unfortunately, this raised a number of questions: What happens if a script receives recursive results? -d/--diff with recursive results? How to prevent folding of ordered results (structs, hot, etc) in piped scripts? etc. And ended up with a significant rewrite of most of the result scripts' internals. Key changes: - Most result scripts now support -O/--output-json in addition to -o/--json, with -O/--output-json including any recursive results in the "children" field. - Most result scripts now support both csv and json as input to relevant flags: -u/--use, -d/--diff, -p/--percent. This is accomplished by looking for a '[' as the first character to decide if an input file is json or csv. Technically this breaks if your json has leading whitespace, but why would you ever keep whitespace around in json? The human-editability of json was already ruined the moment comments were disallowed. - csv.py requires all fields to be explicitly defined, so added -i/--enumerate, -Z/--children, and -N/--notes. At least we can provide some reasonable defaults so you shouldn't usually need to type out the whole field. - Notably, the rendering scripts (plot.py, treemapd3.py, etc) and test/bench scripts do _not_ support json. csv.py can always convert to/from json when needed. - The table renderer now supports diffing recursive results, which is nice for seeing how the hot path changed in stack.py/perf.py/etc. - Moved the -r/--hot logic up into main, so it also affects the outputted results. Note it is impossible for -z/--depth to _not_ affect the outputted results. - We now sort in one pass, which is in theory more efficient. - Renamed -t/--hot -> -r/--hot and -R/--reverse-hot, matching -s/-S. - Fixed an issue with -S/--reverse-sort where only the short form was actually reversed (I misunderstood what argparse passes to Action classes). - csv.py now supports json input/output, which is funny.
This removes most of the special behavior around how -r/--hot and -i/--enumerate interact. This does mean -r/--hot risks folding results if -i/--enumerate is not specified, but this is _technically_ a valid operation. For most of the recursive result scripts, I've replaced the "i" field with separate "z" and "i" fields for depth and field number, which I think is a bit more informative/useful. I've also added a default-hidden "off" field to structs.py/ctx.py, since we have that info available. I considered replacing "i" with this, but decided against it since non-zero offsets for union members would risk being confusing/mistake prone.
This gives csv.py access to a hidden feature in our table renderer used by some of the other scripts: fields that affect by-field grouping, but aren't actually printed. For example, this prevents summing same named functions in different files, but only shows the function name in the table render: $ ./scripts/csv.py lfs.code.csv -bfile -bfunction -lfunction function size lfs_alloc 398 lfs_alloc_discard 31 lfs_alloc_findfree 77 ... This is especially useful when enumerating results. For example, this prevents any summing without extra table noise: $ ./scripts/csv.py lfs.code.csv -i -bfunction -fsize -lfunction function size lfs_alloc 398 lfs_alloc_discard 31 lfs_alloc_findfree 77 ... I also tweaked -b/--by field defaults a bit to account to enumerate/label fields a bit better.
I guess in addition to its other utilities, csv.py is now also turning into a sort of man database for some of the more complicated APIs in the scripts: ./csv.py --help ./csv.py --help-exprs ./csv.py --help-mods It's a bit minimal, but better than nothing. Also dropped the %c modifier because this never actually worked.
There's an ordering issue with hotifying and folding when we have multiple foldable results with children. This was hard to notice since most of the recursive scripts have unique results, but it _is_ an issue for perf.py/perfbd.py, which rely on result folding to merge samples. The fix is to fold _before_ hotifying. We could fold multiple times to avoid changing the behavior of the result scripts, but instead I've just moved the folding in the table renderer up into the relevant main functions. This means 1. we only fold once, and 2. folding affects outputted csv/json files. I'm a bit on the fence about this behavior change, but it is a bit more consistent with how -r/--hot, -z/--depth, etc, affect both table and csv/json results consistently. Maybe we should move towards the table render always reflecting the csv/json results? Most csv/json usage is with -q/--quiet anyways... --- This does create a new risk in that the table renderer can hide results if they aren't folded first. To hopefully avoid this I've added an assert in the table renderer if it notices results being hidden.
Kind of a complicated corner case, but this shows up if you try to sort by fields as numbers and not as strings. In theory this is possible by creating a hidden sort field with a typed expr: $ ./scripts/csv.py test.csv -bi -bfunction -Si=i But we weren't typechecking sort fields that already exist in the by fields, since these are usually strings. This fix is to make sure all exprs are in the typechecked fields, even if they are already in by fields. There's no real cost to this. --- Note this version does _not_ typecheck i, and sorts by string: $ ./scripts/csv.py test.csv -bi -bfunction -Si This raises the question, should we always sort by string by default? I don't think so. It's easy to miss the difference, and a typecheck error is a lot safer than incorrect sorting. So this will sort by number, with i as a hidden field: $ ./scripts/csv.py test.csv -bfunction -Si If you want to sort by string with a hidden field, this is still possible with -l/--label: $ ./scripts/csv.py test.csv -bi -lfunction -Si
It felt weird that adding hidden fields required changing existing flags unrelated to the field you actually want to affect, and the upper/lower flag thing seems to work well for -s/-S sooo... - Replaced -l/--label with -B/--hidden-by for by fields that can be hidden from the table renderer. - Added -F/--hidden-field as a similar thing for field fields. - Better integrated -i/--enumerate into by fields, now these actually maintain related order. And of course added a matching -I/--hidden-enumerate flag. The only downside is this is eating a lot of flag names.. But one of the nice thing about limiting this complexity to csv.py is it avoids these flag names cluttering up the other result scripts. --- The -F/--hidden-fields flag I'm not so sure about, since field exprs can't really reference each other (single pass). But it does provide symmetry with -B/--hidden-by, and reserves the name in case hidden field fields are more useful in the future. Unfortunately it _is_ annoyingly inconsistent with other hidden fields (-S/--sort, -D/--define, etc) in that it does end up in output csvs... But this script is already feeling way over-engineered as is.
This should either have checked diff_result==None, or we should be mapping diff_result=None => diff_result_=None. To be safe I've done both. This was a nasty typo and I only noticed because ctx.py stopped printing "cycle detected" for our linked-lists (which are expected to be cyclic).
It's just too unintuitive to filter after exprs. Note this is consistent with how exprs/mods are evaluated. Exprs/mods can't reference other exprs/mods because csv.py is only single-pass, so allowing defines to reference exprs/mods is surprising. And the solution to needing these sort of post-expr/mod references is the same for defines: You can always chain multiple csv.py calls. The reason defines were change to evaluate after expr eval was because this seemed inconsistent with other result scripts, but this is not actually the case. Other result scripts simply don't have exprs/mods, so filtering in fold is the same as filtering during collection. Note that even in fold, filtering is done _before_ the actual fold/sum operation. --- Also fixed a recursive-define regression when folding. Counter- intuitively, we _don't_ want to recursively apply define filters. If we do the results will just end up too confusing to be useful.
With this, we apply the same result modifiers (exprs/defines/hot/etc) to both the input results and -d/--diff results. So if both start with the same format, diffing/hotifying/etc should work as expected. This is really the only way I can seen -d/--diff results working with result modifiers in a way that makes sense. The downside of this is that you can't save results with some complex operation applied, and then diff while applying the same operation, since most of the newer operations (hotify) are _not_ idempotent. Fortunately the two alternatives are not unreasonable: 1. Save results _without_ the operation applied, since the operation will be applied to both the input and diff results. This is a bit asymmetric, but should work. 2. Apply the operation to the input and then pipe to csv.py for diffing. This used to "just work" when we did _not_ apply operations to output csv/json, but this was really just equivalent to 1.. I think the moral of the story is you can solve any problem with enough chained csv.py calls.
This makes it so scripts with complex fields will still output all fields to output csv/json files, while only showing a user-friendly subset unless -f/--field is explicitly provided. While internal fields are often too much information to show by default, csv/json files are expected to go to other scripts, not humans. So more information is more useful up until you actually hit a performance bottleneck. And if you _do_ somehow manage to hit a performance bottleneck, you can always limit the output with explicit -f/--field flags.
So: $ ./scripts/code.py lfs.o -o- -q Becomes: $ ./scripts/code.py lfs.o -o- The original intention of -o/-O _not_ being exclusive (aka table is still rendered unless disabled with -q/--quiet), was to allow results to be written to csv files and rendered to tables in a single pass. But this was never useful. Heck, we're not even using this in our Makefile right now because it would make the rule dependencies more complicated than it's worth. Even for long-running result scripts (perf.py, perfbd.py, etc), most of the work is building that csv file, the cost of rendering a table in a second pass is negligible. In every case I've used -o/-O, I've also wanted -q/--quiet, and almost always forget this on the first run. So might as well make the expected behavior the actual behavior. --- As a plus, this let us simplify some of the scripts a bit, by replacing visibility filters with -o/-O dependent by-fields.
This was a bit broken when r was None. Which is unusual, but happens when rendering added/removed diff results.
Now that I'm looking into some higher-level scripts, being able to merge results without first renaming everything is useful. This gives most scripts an implicit prefix for field fields, but _not_ by fields, allowing easy merging of results from different scripts: $ ./scripts/stack.py lfs.ci -o- function,stack_frame,stack_limit lfs_alloc,288,1328 lfs_alloc_discard,8,8 lfs_alloc_findfree,16,32 ... At least now these have better support in scripts with the addition of the --prefix flag (this was tricky for csv.py), which allows explicit control over field field prefixes: $ ./scripts/stack.py lfs.ci -o- --prefix= function,frame,limit lfs_alloc,288,1328 lfs_alloc_discard,8,8 lfs_alloc_findfree,16,32 ... $ ./scripts/stack.py lfs.ci -o- --prefix=wonky_ function,wonky_frame,wonky_limit lfs_alloc,288,1328 lfs_alloc_discard,8,8 lfs_alloc_findfree,16,32 ...
This can be explicitly disabled with -x/--no-strip in the relevant scripts, but stripping by default seems to be more useful for composing results in higher-level scripts. It's better for the result names to be consistent, even if they don't match the .o symbols exactly. Note some scripts are unaffected: - cov.py - gcov doesn't seem to have an option for getting the unstripped symbols, so we only output the stripped names. - structs.py - structs.py deals with struct names, which are notably not symbols.
…s.py This adds -i/--internal to ctx.py and structs.py, which has proven useful for introspection/debugging. Being able to view the ctx/args of internal functions is nice, even if they don't actually contribute to the high-level cost. This also reverts structs.py to limit to .h files by default, to match ctx.py, once again relying on dwarf file info. This has been a bit unreliable in the past, but there's not much else that determines if a struct is part of the "public interface" in C. But that's what ctx.py is for. --- Also fixed an issue where structs appearing in multiple files would have their sizes added together, which ends up with some pretty confusing results (sizeof(uint32_t) => 8?).
-!/--everything has been useful enough to warrant a short form flag, and -! is unlikely to conflict with other flags while also getting the point across that this is a bit of an unusual option.
I forgot that this is still useful for erroring scripts, such as stack.py when checking for recursion. Technically this is possible with -o/dev/null, but that's both unnecessarily complicated and includes the csv encoding cost for no reason.
The previous behavior of -N/--no-header still rendering a header when --title is also provided was confusing. I think this is a better API, at the minor cost of needing to pass one more flag if you don't want stats in the header.
This replaces the default colors with colors with precomputed alpha. They should look the same as long as you don't change the background color. The reason for this is I need non-transparent colors for a fork of treemapd3.py that I am working on. I'm attempting to render some arrows behind the tiles, and with transparency the result is just too noisy. --- This then broke the nested rendering, which relied on opacity to make each layer darker, so I've replaced that with an explicit svg filter-effect. The opacity hack was non-linear and kinda ugly past depth >~3, so it should have eventually been replaced anyways.
Inspired heavily by d3 and brendangregg's flamegraphs, codemapd3.py is intended to be a powerful high-level code exploring tool. It's a visual tool, so probably best explained visually: $ CFLAGS='-DLFS_NO_LOG -DLFS_NO_ASSERT' make -j $ ./scripts/codemapd3.py \ lfs.o lfs_util.o \ lfs.ci lfs_util.ci \ -otest.svg -W1500 -H700 --dark updated test.svg, code 35528 stack 2440 ctx 636 And open test.svg in a browser of your choice. (TODO add a make rule for this) --- Features include: - Rendering of code cost in a treemap organized by subsystem (based on underscore-separated namespaces), making it relatively easy to see where the bulk of our code cost comes from. - Rendering of the deepest stack/ctx cost as a set of tiles, making it relatively easy to see where the bulk of our stack cost comes from. - Interactive (on mouseover) rendering of callgraph info, showing dependencies and relevant stack/ctx costs per-function. This currently includes 4 modes: 1. mode-callgraph - This shows the full callgraph, including all children's children, which is effectively all dependencies of that function, i.e. the total code cost necessary for that _specific_ function to work. 2. mode-deepest - This shows the deepest/hot path of calls from that function, which is every child that contributes to the function's stack cost. 3. mode-callees - This shows all functions the current function immediately calls. 4. mode-callers - This shows all functions that call the current function. And yes, cycles are handled correctly: We show the deepest non-cyclical path, but display the measured stack usage as infinite. For more details see ./scripts/codemapd3.py --help. --- One particularly neat feature I'm happy about is -t/--tiny, which scales the resulting image such that 1 pixel ~= 1 byte. This should be useful for comparing littlefs to other filesystems in a way that is visually interesting. - d3 - https://d3js.org - brendangregg's flamegraphs - https://github.com/brendangregg/FlameGraph
This just makes dat behave similarly to Python's getattr, etc: - dat("bogus") -> raises ValueError - dat("bogus", 1234) -> returns 1234 This replaces try_dat, which is easy to forget about when copy-pasting between scripts. Though all of this wouldn't be necessary if only we could catch exceptions in expressions...
This was a humorous name conflict that went unnoticed only because we lazily import json in read_csv.
Even though I think this makes less sense for the ascii-rendering scripts, it's useful to have this flag around when jumping between treemap.py and treemapd3.py. And it might actually make sense sometimes now that -t/--tiny does not override --to-scale.
This turns out to be extremely useful, for the sole purpose of being able to specify colors/formats/etc in csv fields (-C'%(fields)s' for example, or -C'#%(field)06x' for a cooler example). This is a bit tricky for --chars, but doable with a psplit helper function. Also fixed a bug in plot.py where we weren't using dataattrs_ correctly.
Just removing outdated/commented-out code, and TODOs that have now been taken care of.
This adds lfsr_btree_commitroot_ and lfsr_bshrub_commitroot_, to contain the root-specific commit logic such that it can be forced off the stack hot-path if necessary. --- Note we're not actually using LFS_NOINLINE yet, as the critical function, lfsr_btree_commitroot_ is implicitly forced off the stack hot-path via the multiple calls from lfsr_btree_commit and lfsr_bshrub_commit. And I'm not sure it makes sense to use LFS_NOINLINE here. It absolutely wrecks lfsr_bshrub_commitroot_'s stack, which always ends up on the stack hot-path because of the route through lfsr_mdir_commit. Is this a big hack? Honestly yeah. It doesn't even really save that much stack, but I figured it was worth a try: code stack ctx before: 37260 2296 636 after: 37300 (+0.1%) 2280 (-0.7%) 636 (+0.0%) At least the code organization is a bit better, with lfsr_bshrub_commit reusing lfsr_btree_commitroot_ for bshrub -> btree migration.
Life's too short to not use this flag.
Not sure what the point of this was, I think it was copied from a d3 example svg at some point. But it forces the svg to always fit in the window, even if this makes the svg unreadable. These svgs tend to end up questionably large in order to fit in the most info, so the unreadableness ends up a real problem for even modest window sizes.
I don't think this hyphen broke anything, but this matches the naming convention of other files in the codebase (lfs_util.h for example).
Removing the vestiges of v2 tests.
Kinda. It's actually only 3 scripts. These have been replaced with the new dbg*.py scripts: - readblock.py -> dbgblock.py - readmdir.py -> dbgrbyd.py - readtree.py -> dbglfs.py
Removing the final vestiges of v2.
This one was a bit more involved. Removes utils that are no longer useful, and made sure some of the name/API changes over time are adopted consistently: - lfs_npw2 -> lfs_nlog2 - lfs_tole32_ -> lfs_tole32 - lfs_fromle32_ -> lfs_fromle32 Also did another pass for lfs_ prefixes on mem/str functions. The habit to use the naked variants of these is hard to break!
Like the LFS_DISK_VERSION, the intention is to mark the current driver as experimental. Just in case...
There is high risk this thread drowns in noise. So, please limit this thread to high-level v3-wide discussion. For bugs, low-level questions, concerns, etc, feel free to create issues on littlefs with "v3" in the title. I will label relevant issues with the v3 label when I can. |
Fantastic! Great to see this v3 work making it's way into the world. The code size increase would be the biggest concern for MicroPython. Any optimisations to get that down would be welcome. In particular we put read-only littlefs2 inside a 32k bootloader. Do you have any idea how big the read-only version of v3 is? |
@dpgeorge, I don't know quite yet. The non-default builds aren't fully reimplemented, but will update this thread when they are. Read-only is the most interesting but also the most time consuming. The good news is that rbyds (and red-black trees) and B-trees are much more complicated on the write side than read side, so the impact should be less than the default build, but I'll wait to stick my foot in my mouth until after I have actual numbers. |
Looking forward to the Simple key-value APIs in v3 |
@Ryan-CW-Code, it's worth noting the simple key-value APIs are stretch goals. They may or may not make it into v3 depending on how long things take and funding. Or to put it another way, stretch goals won't block the release of v3. But it would be convenient to get them in while breaking the API, to minimize future API breaks. Though if they don't make it into v3, I would like to introduce them in future releases, if possible. Note most stretch and out-of-scope features are mainly API related unlikely to affect the on-disk format. |
My English is not very good, please forgive me for using machine translation. |
Table of contents ^
What's new?
Implemented
Planned
Stretch goals
Out-of-scope (for now)
Code/stack size
Benchmarks
Simulated benchmarks
Hello! ^
Hello everyone! As some of you may have already picked up on, there's been a large body of work fermenting in the background for the past couple of years. Originally started as an experiment to try to solve littlefs's$O(n^2)$ metadata compaction, this branch eventually snowballed into more-or-less a full rewrite of the filesystem from the ground up.
There's still several chunks of planned work left, but now that this branch has reached feature parity with v2, there's nothing really stopping it from being merged eventually.
So I figured it's a good time to start calling this v3, and put together a public roadmap.
NOTE: THIS WORK IS INCOMPLETE AND UNSTABLE
Here's a quick TODO list of planned work before stabilization. More details below:
This work may continue to break the on-disk format.
That being said, I highly encourage others to experiment with v3 where possible. Feedback is welcome, and immensely important at this stage. Once it's stabilized, it's stabilized.
To help with this, the current branch uses a v0.0 as its on-disk version to indicate that it is experimental. When it is eventually released, v3 will reject this version and fail to mount.
Unfortunately, the API will be under heavy flux during this period.
A note on benchmarking: The on-disk block-map is key for scalable allocator performance, so benchmarks at this stage needs to be taken with a grain of salt when many blocks are involved. Please refer to this version as "v3 (no bmap)" or something similar in any published benchmarks until this work is completed.
Wait, a disk breaking change? ^
Yes. v3 breaks disk compatibility from v2.
I think this is a necessary evil. Attempting to maintain backwards compatibility has a heavy cost:
Development time - The littlefs team is ~1 guy, and v3 has already taken ~2.5 years. The extra work to make everything compatible would stretch this out much longer and likely be unsustainable.
Code cost - The goal of littlefs is to be, well, little. This is unfortunately in conflict with backwards compatibility.
Take the new B-tree data-structure, for example. It would be easy to support both B-tree and CTZ skip-list files, but now you need ~2x the code. This cost gets worse for the more enmeshed features, and potentially exceeds the cost of just including both v3 and v2 in the codebase.
So I think it's best for both littlefs as a project and long-term users to break things here.
Note v2 isn't going anywhere! I'm happy to continue maintaining the v2 branch, merge bug fixes when necessary, etc. But the economic reality is my focus will be shifting to v3.
What's new ^
Ok, with that out of the way, what does breaking everything actually get us?
Implemented: ^
Efficient metadata compaction:$O(n^2) \rightarrow O(n \log n)$ ^
v3 adopts a new metadata data-structure: Red-black-yellow Dhara trees (rbyds). Based on the data-structure invented by Daniel Beer for the Dhara FTL, rbyds extend log-encoded Dhara trees with self-balancing and self-counting (also called order-statistic) properties.
This speeds up most metadata operations, including metadata lookup ($O(n) \rightarrow O(\log n)$ ), and, critically, metadata compaction ( $O(n^2) \rightarrow O(n \log n)$ ).
This improvement may sound minor on paper, but it's a difference measured in seconds, sometimes even minutes, on devices with extremely large blocks.
Efficient random writes:$O(n) \rightarrow O(\log_b^2 n)$ ^
A much requested feature, v3 adopts B-trees, replacing the CTZ skip-list that previously backed files.
This avoids needing to rewrite the entire file on random reads, bringing the performance back down into tractability.
For extra cool points, littlefs's B-trees use rbyds for the inner nodes, which makes CoW updates much cheaper than traditional array-packed B-tree nodes when large blocks are involved ($O(n) \rightarrow O(\log n)$ ).
Better logging: No more sync-padding issues ^
v3's B-trees support inlining data directly in the B-tree nodes. This gives us a place to store data during sync, without needing to pad things for prog alignment.
In v2 this padding would force the rewriting of blocks after sync, which had a tendency to wreck logging performance.
Efficient inline files, no more RAM constraints:$O(n^2) \rightarrow O(n \log n)$ ^
In v3, B-trees can have their root inlined in the file's mdir, giving us what I've been calling a "B-shrub". This, combined with the above inlined leaves, gives us a much more efficient inlined file representation, with better code reuse to boot.
Oh, and B-shrubs also make small B-trees more efficient by avoiding the extra block needed for the root.
Independent file caches ^
littlefs's
pcache
,rcache
, and file caches can be configured independently now. This should allow for better RAM utilization when tuning the filesystem.Easier logging APIs:
lfs3_file_fruncate
^Thanks to the new self-counting/order-statistic properties, littlefs can now truncate from both the end and front of files via the new
lfs3_file_fruncate
API.Before, the best option for logging was renaming log files when they filled up. Now, maintaining a log/FIFO is as easy as:
Sparse files ^
Another advantage of adopting B-trees, littlefs can now cheaply represent file holes, where contiguous runs of zeros can be implied without actually taking up any disk space.
Currently this is limited to a couple operations:
lfs3_file_truncate
lfs3_file_fruncate
lfs3_file_seek
+lfs3_file_write
past the end of the fileBut more advanced hole operations may be added in the future.
Efficient file name lookup:$O(n) \rightarrow O(\log_b n)$ ^
littlefs now uses a B-tree (yay code reuse) to organize files by file name. This allows for much faster file name lookup than the previous linked-list of metadata blocks.
A simpler/more robust metadata tree ^
As a part of adopting B-trees for metadata, the previous threaded file tree has been completely ripped out and replaced with one big metadata tree: the M-tree.
I'm not sure how much users are aware of it, but the previous threaded file tree was a real pain-in-the-ass with the amount of bugs it caused. Turns out having a fully-connected graph in a CoBW filesystem is a really bad idea.
In addition to removing an entire category of possible bugs, adopting the M-tree allows for multiple directories in a single metadata block, removing the 1-dir = 1-block minimum requirement.
A well-defined sync model ^
One interesting thing about littlefs, it doesn't have a strictly POSIX API. This puts us in a relatively unique position, where we can explore tweaks to the POSIX API that may make it easer to write powerloss-safe applications.
To leverage this (and because the previous sync model had some real problems), v3 includes a new, well-defined sync model.
I think this discussion captures most of the idea, but for a high-level overview:
Open file handles are strictly snapshots of the on-disk state. Writes to a file are copy-on-write (CoW), with no immediate affect to the on-disk state or any other file handles.
Syncing or closing an in-sync file atomically updates the on-disk state and any other in-sync file handles.
Files can be desynced, either explicitly via
lfs3_file_desync
, or because of an error. Desynced files do not recieve sync broadcasts, and closing a desynced file has no affect on the on-disk state.Calling
lfs3_file_sync
on a desynced file will atomically update the on-disk state, any other in-sync file handles, and mark the file as in-sync again.Calling
lfs3_file_resync
on a file will discard its current contents and mark the file as in-sync. This is equivalent toclosing and reopening the file.
Stickynotes, no more 0-sized files ^
As an extension of the littlefs's new sync model, v3 introduces a new file type:
LFS3_TYPE_STICKYNOTE
.A stickynote represents a file that's in the awkward state of having been created, but not yet synced. If you lose power, stickynotes are hidden from the user and automatically cleaned up on the next mount.
This avoids the 0-sized file issue, while still allowing most of the POSIX interactions users expect.
A new and improved compat flag system ^
v2.1 was a bit of a mess, but it was a learning experience. v3 still includes a global version field, but also includes a set of compat flags that allow non-linear addition/removal of future features.
These are probably familiar to users of Linux filesystems, though I've given them slightly different names:
rcompat flags
- Must understand to read the filesystem (incompat_flags)wcompat flags
- Must understand to write to the filesystem (ro_compat_flags)ocompat flags
- No understanding necessary (compat_flags)This also provides an easy route for marking a filesystem as read-only, non-standard, etc, on-disk.
Error detection! - Global-checksums ^
v3 now supports filesystem-wide error-detection. This is actually quite tricky in a CoBW filesystem, and required the invention of global-checksums (gcksums) to prevent rollback issues caused by naive checksumming.
With gcksums, and a traditional Merkle-tree-esque B-tree construction, v3 now provides a filesystem-wide self-validating checksum via
lfs3_fs_cksum
. This checksum can be stored external to the filesystem to provide protection against last-commit rollback issues, metastability, or just for that extra peace of mind.Funny thing about checksums. It's incredibly cheap to calculate checksums when writing, as we're already processing that data anyways. The hard part is, when do you check the checksums?
This is a problem that mostly ends up on the user, but to help, v3 adds a large number checksum checking APIs (probably too many if I'm honest):
LFS3_M_CKMETA/CKDATA
- Check checksums during mountLFS3_O_CKMETA/CKDATA
- Check checksums during file openlfs3_fs_ckmeta/ckdata
- Explicitly check all checksums in the filesystemlfs3_file_ckmeta/ckdata
- Explicitly check a file's checksumsLFS3_T_CKMETA/CKDATA
- Check checksums incrementally during a traversalLFS3_GC_CKMETA/CKDATA
- Check checksums during GC operationsLFS3_M_CKPROGS
- Closed checking of data during progsLFS3_M_CKFETCHES
- Optimistic (not closed) checking of data during fetchesLFS3_M_CKREADS
(planned) - Closed checking of data during readsBetter traversal APIs ^
The traversal API has been completely reworked to be easier to use (both externally and internally).
No more callback needed, blocks can now be iterated over via the dir-like
lfs3_traversal_read
function.Traversals can also perform janitorial work and check checksums now, based on the flags provided to
lfs3_traversal_open
.Incremental GC ^
GC work can now be accomplished incrementally, instead of requiring one big go. This is managed by
lfs3_fs_gc
,cfg.gc_flags
, andcfg.gc_steps
.Internally, this just shoves one of the new traversal objects into
lfs3_t
. It's equivalent to managing a traversal object yourself, but hopefully makes it easier to write library code.However, this does add a significant chunk of RAM to
lfs3_t
, so GC is now an opt-in feature behind theLFS3_GC
ifdef.Better recovery from runtime errors ^
Since we're already doing a full rewrite, I figured let's actually take the time to make sure things don't break on exceptional errors.
Most in-RAM filesystem state should now revert to the last known-good state on error.
The one exception involves file data (not metadata!). Reverting file data correctly turned out to roughly double the cost of files. And now that you can manual revert with
lfs3_file_resync
, I figured this cost just isn't worth it. So file data remains undefined after an error.In total, these changes add a significant amount of code and stack, but I'm of the opinion this is necessary for the maturing of littlefs as a filesystem.
Standard custom attributes ^
Breaking disk gives us a chance to reserve attributes
0x80-0xbf
for future standard custom attributes:0x00-0x7f
- Free for user-attributes 8000 (uattr)0x80-0xbf
- Reserved for standard-attributes (sattr)0xc0-0xff
- Encouraged for system-attributes (yattr)In theory, it was technically possible to reserve these attributes without a disk-breaking change, but it's much safer to do so while we're already breaking the disk.
v3 also includes the possibility of extending the custom attribute space from 8-bits to ~25-bits in the future, but I'd hesitate to to use this, as it risks a significant increase in stack usage.
More tests! ^
v3 comes with a couple more tests than v2 (+~6812.2%):
You may or may not have seen the test framework rework that went curiously under-utilized. That was actually in preparation for the v3 work.
The goal is not 100% line/branch coverage, but just to have more confidence in littlefs's reliability.
Planned: ^
Efficient block allocation, via optional on-disk block-map (bmap) ^
The one remaining bottleneck in v3 is block allocation. This is a tricky problem for littlefs (and any CoW/CoBW filesystem), because we don't actually know when a block becomes free.
This is in-progress work, but the solution I'm currently looking involves 1. adding an optional on-disk block map (bmap) stored in gstate, and 2. updating it via tree diffing on sync. In theory this will drop huge file writes:$O(n^2 \log n) \rightarrow O(n \log_b^2 n)$
There is also the option of using the bmap as a simple cache, which doesn't avoid the filesystem-wide scan but at least eliminates the RAM constraint of the lookahead buffer.
As a plus, we should be able to leverage the self-counting property of B-trees to make the on-disk bmap compressible.
Bad block tracking ^
This is a much requested feature, and adding the optional on-disk bmap finally gives us a place to track bad blocks.
Pre-erased block tracking ^
Just like bad-blocks, the optional on-disk bmap gives us a place to track pre-erased blocks. Well, at least in theory.
In practice it's a bit more of a nightmare. To avoid multiple progs, we need to mark erased blocks as unerased before progging. This introduces an unbounded number of catch-22s when trying to update the bmap itself.
Fortunately, if instead we store a simple counter in the bmap's gstate, we can resolve things at the mrootanchor worst case.
Error correction! - Metadata redundancy ^
Note it's already possible to do error-correction at the block-device level outside of littlefs, see ramcrc32cbd and ramrsbd for examples. Because of this, integrating in-block error correction is low priority.
But I think there's potential for cross-block error-correction in addition to the in-block error-correction.
The plan for cross-block error-correction/block redundancy is a bit different for metadata vs data. In littlefs, all metadata is logs, which is a bit of a problem for parity schemes. I think the best we can do is store metadata redundancy as naive copies.
But we already need two blocks for every mdir, one usually just sits unused when not compacting. This, combined with metadata usually being much smaller than data, makes the naive scheme less costly than one might expect.
Error correction! - Data redundancy ^
For raw data blocks, we can be a bit more clever. If we add an optional dedup tree for block -> parity group mapping, and an optional parity tree for parity blocks, we can implement a RAID-esque parity scheme for up to 3 blocks of data redundancy relatively cheaply.
Transparent block deduplication ^
This one is a bit funny. Originally block deduplication was intentionally out-of-scope, but it turns out you need something that looks a lot like a dedup tree for error-correction to work in a system that allows multiple block references.
If we already need a virtual -> physical block mapping for error correction, why not make the key the block checksum and get block deduplication for free?
Though if this turns out to not be as free as I think it is, block deduplication will fall out-of-scope.
Stretch goals: ^
These may or may not be included in v3, depending on time and funding:
16-bit and 64-bit variants ^
Config API rework ^
Block device API rework ^
Custom attr API rework ^
Alternative (cheaper) write-strategies (write-once, global-aligned, eager-crystallization) ^
Advanced file tree operations (
lfs3_file_punchhole
,lfs3_file_insertrange
,lfs3_file_collapserange
,LFS3_SEEK_DATA
,LFS3_SEEK_HOLE
) ^Advanced file copy-on-write operations (shallow
lfs3_cowcopy
+ opportunisticlfs3_copy
) ^Simple key-value APIs (
lfs3_get
,lfs3_set
,lfs3_remove
, etc) ^Reserved blocks to prevent CoW lockups ^
Metadata checks to prevent metadata lockups ^
Integrated block-level ECC (ramcrc32cbd, ramrsbd) ^
Disk-level RAID (this is just data redund + a disk aware block allocator) ^
Out-of-scope (for now): ^
If we don't stop somewhere, v3 will never be released. But these may be added in the future:
Alternative checksums (crc16, crc64, sha256, etc) ^
Feature-limited configurations for smaller code/stack sizes (
LFS3_NO_DIRS
,LFS3_KV
,LFS3_2BLOCK
, etc) ^lfs3_file_openat
for dir-relative APIs ^lfs3_file_openn
for non-null-terminated-string APIs ^Transparent compression ^
Filesystem shrinking ^
High-level caches (block cache, mdir cache, btree leaf cache, etc) ^
Symbolic links ^
100% line/branch coverage ^
Code/stack size ^
littlefs v1, v2, and v3, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)
Unfortunately, v3 is a little less little than v2:
On one hand, yes, more features generally means more code.
And it's true there's an opportunity here to carve out more feature-limited builds to save code/stack in the future.
But I think it's worth discussing some of the other reasons for the code/stack increase:
Runtime error recovery ^
Recovering from runtime errors isn't cheap. We need to track both the before and after state of things during fallible operations, and this adds both stack and code.
But I think this is necessary for the maturing of littlefs as a filesystem.
Maybe it will make sense to add a sort of
LFS3_GLASS
mode in the future, but this is out-of-scope for now.B-tree flexibility ^
The bad news: The new B-tree files are extremely flexible. Unfortunately, this is a double-edged sword.
B-trees, on their own, don't add that much code. They are a relatively poetic data-structure. But deciding how to write to a B-tree, efficiently, with an unknown write pattern, is surprisingly tricky.
The current implementation, what I've taken to calling the "lazy-crystallization algorithm", leans on the more complicated side to see what is possible performance-wise.
The good news: The new B-tree files are extremely flexible.
There's no reason you need the full crystallization algorithm if you have a simple write pattern, or don't care as much about performance. This will either be a future or stretch goal, but it would be interesting to explore alternative write-strategies that could save code in these cases.
Traversal inversion ^
Inverting the traversal, i.e. moving from a callback to incremental state machine, adds both code and stack as 1. all of the previous on-stack state needs to be tracked explicitly, and 2. we now need to worry about what happens if the filesystem is modified mid-traversal.
In theory, this could be reverted if you don't need incremental traversals, but extricating incremental traversals from the current codebase would be an absolute nightmare, so this is out-of-scope for now.
Benchmarks ^
A note on benchmarking: The on-disk block-map is key for scalable allocator performance, so benchmarks at this stage needs to be taken with a grain of salt when many blocks are involved. Please refer to this version as "v3 (no bmap)" or something similar in any published benchmarks until this work is completed.
First off, I would highly encourage others to do their own benchmarking with v3/v2. Filesystem performance is tricky to measure because it depends heavily on your application's write pattern and hardware nuances. If you do, please share in this thread! Others may find the results useful, and now is the critical time for finding potential disk-related performance issues.
Simulated benchmarks ^
To test the math behind v3, I've put together some preliminary simulated benchmarks.
Note these are simulated and optimistic. They do not take caching or hardware buffers into account, which can have a big impact on performance. Still, I think they provide at least a good first impression of v3 vs v2.
To find an estimate of runtime, I first measured the amount of bytes read, progged, and erased, and then scaled based on values found in relevant datasheets. The options here were a bit limited, but WinBond fortunately provides runtime estimates in the datasheets on their website:
NOR flash - w25q64jv
NAND flash - w25n01gv
SD/eMMC - Also w25n01gv, assuming a perfect FTL
I said optimistic, didn't I? I could't find useful estimates for SD/eMMC, so I'm just assuming a perfect FTL here.
These also assume an optimal bus configuration, which, as any embedded engineer knows, is often not the case.
Full benchmarks here: https://benchmarks.littlefs.org (repo, commit)
And here are the ones I think are the most interesting:
Note that SD/eMMC is heavily penalized by the lack of on-disk block-map! SD/eMMC breaks flash down into many small blocks, which tends to make block allocator performance dominate.
Linear writes, where we write a 1 MiB file and don't call sync until closing the file. ^
This one is the most frustrating to compare against v2. CTZ skip-lists are really fast at appending! The problem is they are only fast at appending:
Random writes, note we start with a 1MiB file. ^
As expected, v2 is comically bad at random writes. v3 is indistinguishable from zero in the NOR case:
Logging, write 4 MiB, but limit the file to 1 MiB. ^
In v2 this is accomplished by renaming the file, in v3 we can leverage
lfs3_file_fruncate
.v3 performs significantly better with large blocks thanks to avoiding the sync-padding problem:
Funding ^
If you think this work is worthwhile, consider sponsoring littlefs. Current benefits include:
I joke, but I truly appreciate those who have contributed to littlefs so far. littlefs, in its current form, is a mostly self-funded project, so every little bit helps.
If you would like to contribute in a different way, or have other requests, feel free to reach me at geky at geky.net.
As stabilization gets closer, I will also be open to contract work to help port/integrate/adopt v3. If this is interesting to anyone, let me know.
Thank you @micropython, @fusedFET for sponsoring littlefs, and thank you @Eclo, @kmetabg, and @nedap for your past sponsorships!
Next steps ^
For me, I think it's time to finally put together a website/wiki/discussions/blog. I'm not sure on the frequency quite yet, but I plan to write/publish the new DESIGN.md in chapters in tandem with the remaining work.
EDIT: Pinned codemap/plot links to specific commits via benchmarks.littlefs.org/tree.html