-
Notifications
You must be signed in to change notification settings - Fork 666
Produce split FST files when using multiple threads #5806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Those are amazing speed ups, certainly a great justification for the work for the viewers (really for the fstReader) to support this. Feel free to make a pre-pull with the typo and trivial fixes that can be made now, or even add the refactoring to support multiple writers. Always good to get what you can in so that the patch doesn't go as stale.
Would we be better off instead having all of the definition information in a common file to reduce the duplication? Note VCD has code to do rollover, and this requires that you have a separate header, then data files that rollover, so you can combine a header with the rollover files of choice (older ones perhaps being deleted). As part of this FST change I think we should support rollover too. That probably means you need the thread name and the rollover number e.g.
Or maybe to help sorting we should add zeros?
If FST has some problems with doing rollover we can punt for now, but suggest we make sure that the naming would be forward compatible with this.
AFAIK everyone is using the fst reader code, so if this can be isolated to just reader improvements we maybe done for all viewers.
Yes, I suppose we should have a new flag. I'd suggest we always enable it with FST && threads. When we've made other FST writer enhancements (e.g. enums) we required a viewer upgrade. Those left behind can use --no-trace-split.
I think the API uses "Get" in the name of accessors, I'd suggest renaming With that I suggest make a pull request on gtkwave. We can then get this into that API, and into Verilator, and will be ready.
I wonder if that should be a separate support binary?
Pull welcome ;) |
Let me try to frame the discussion a bit more and guide the thought process. There are 2 different concepts to think about here:
I think you are thinking in terms of 1., while this PR in its current form is geared towards 2. Having an improved format sure would be nice. If an FST 2.0 comes out from the discussion in gtkwave/gtkwave#70, that would be fantastic. Given there is some interest in working towards that based on that discussion, it might not be justified to extend FST at this stage. This however is independent from the question in the 2nd point. There are more generic use cases for a waveform viewer to be able to display multiple trace files as one. For example, you could have SystemC + Verilated model, or some other cosim environment, where each component can generate a separate trace file. They could even be of different formats if you bolt different tools together. Being able to load all traces together, by merging hierarchies would be useful on it's own, and it would also enable the performance gain presented in this PR, to add to the argument for the viewer to able to display multiple traces as one. I believe currently neither GTKWave nor Surfer can do this. The nice thing about independent split traces like in this PR is that it enables parallelism for all formats, not just FST, though realistically we won't have many...
If you think about it in terms of the 1st point above, then yes, maybe, but even in that case the FST header for a design with 1.2M signals is less than 5MB, so duplicating it only saves constant space and is likely not significant. If you think about it in terms of point 2 above, then this is out of scope for the discussion.
With point 2 in mind, I am not proposing an FST change here, so this is out of scope. However I agree FST 2.0 should have rollover support.
I believe this is not true, in particular Surfer uses https://github.com/ekiwi/fst-reader which is a native Rust implementation.
Let me know if you think different after the above. I'm kind of happy either way. I'm more interested in whether we can drop
I thought about that, but decided against it. For our testing purposes, with small files of well defined shapes, the ~200 lines of noddy python I added works fine, but is does not scale to large files (it's in memory text wrangling!), nor can it handle generic, mismatched hierarchy, and that is something I do not want to commit to maintaining either at this point, so no standalone tool for now.
Or maybe just diff it in driver.py, now that I added a VCD parser there, and remove the external dependency on vcddiff? With that in mind, if you agree with the approach, then I will have the conversations with the waveform viewers, and if there is willingness, I will PR |
You're right I wasn't thinking about 1 vs 2, but rather how to get ultimately to the user seeing one file. As long as we retain a way to get a single trace file (with lower performance), I'm ok with dropping I prefer to keep vcddiff as use it on very large files separately, and think it's good to maintain some separation - ideally we'd put more there vs in driver.py. |
That would be available, via a runtime mechanism (
As discussed, that is probably a question for FST 2.0. I you object to emitting multiple files (sounds like you don't?), we could feed them through the moral equivalent of |
I think multiple files is the correct way, as it will perform best, allow rollover, and allows potential multi-system simulations. |
BTW I also like your directory idea (versus e.g. a prefix) - it's much easier to move a "trace-worth" of files around. I might change the rollover example tests to do similar (I don't think a verilator change is needed). |
Looking quickly I take that back, the rollover naming is part of verilated_vcd_c, so I'll avoid making a change at this time. |
include/gtkwave/fstapi.c
Outdated
return xc | ||
&& !xc->is_initial_time | ||
&& ((xc->vchg_siz >= xc->fst_break_size) || (xc->flush_context_pending)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it would be good to get this in the original gtkwave files as well, to help other simulators adapt.
Wha exactly do you call rollover? |
When the trace file gets large, start throwing away records from earlier simulation time (that is, keep only the tail end of the trace), to limit file size. Verilator supports this for VCD where you can give it a file size limit. See test_regress/t/t_trace_rollover.cpp |
2af1967
to
2929e50
Compare
This is a draft PR that implements splitting of FST trace files, with each piece being produces in parallel with the others. It is not expected to be merge this just yet, but is here to enable discussion around gtkwave/gtkwave#286, and ekiwi/wellen#23 (for Surfer).
With this patch, if you verilate with
--threads N --trace-fst
forN
>= 2, then instead of creating a single FST file, the simulation will instead create a directory with N independent trace files in it.That is, if you used to get a
trace.fst
file, then you will now get atrace.fst
directory with the following structure:Each of the numbered files are independent FST files (you can open them with GTKWave/Surfer just fine). They all have the exact same hierarchy and metadata (enum type tables in particular), but they only contain disjoint subset of the signals. That is, every signal is precisely in one file (and all aliases are in the same file), and if you merge the signals from all the files, you would get back the original (unsplit) FST file that is currently produced by Verilator without this patch.
With some decently sized benchmarks I have lying around, this gets over 2x average speedup over tracing into a single FST file (as does master) when using 4 threads.
The overhead of tracing into a single FST file, compared to disabling tracing at runtime for the same is about 5x, so Amdahl's law limit is approx 2.5x on 4 threads. The 2.04x average here is already close, so this shows the benefit of being able to produce trace files in parallel, though there is probably some opportunity for later fine tuning.
A: Single FST (master)
B: Split FST (this PR)
execute - Sim speed [kHz] - higher is better
The effect 8000 on total trace dump size is surprisingly positive on average as well, though most cases don't vary much. (I have checked the outlier and it looks correct).
execute - Trace dump size [MB] - lower is better
You do pay with a bit of extra memory, as each thread uses an ~128 MB buffer at run-time, but that should not be a big issue.
One nice property of splitting traces files like this is that this strategy works for every format, though I do not suggest implementing it for VCD, as Verilator can already do parallel VCD tracing efficiently (VCD is trivial enough to stitch fragments together at run-time), unless there is some special interest.
Before this can be merged, there should be:
--trace-split
option? default with threads? etc.), considering all the other existing --trace-* optionsWe need one trivial function added to the FST writer API, which tells us if a file buffer reached a size where it needs to be flushed (compressed). We use this to make sure all files are flushed at the same time, to avoid flushing of one file holding up execution.
Testing this required adding a
vcdmerge
function to the test driver, which can merge vcd files that are split this way. This is used to compare with the existing reference files (after feeding each subfile through fst2vcd).Note: the 3 test failures are due to a bug in
vcddiff
. They all fail with--vlt
and pass with--vltmt
. If you swap the last 2 lines of the reference files, they will pass with--vlt
, but fail with--vltmt
.The changes inside
src/
(that is, to Verilator the compiler) are preliminary pending a decision on how to expose this in the CLI.Otherwise the functionality is all there, feedback is appreciated.