Produce split FST files when using multiple threads #5806

gezalore · 2025-02-26T16:16:18Z

This is a draft PR that implements splitting of FST trace files, with each piece being produces in parallel with the others. It is not expected to be merge this just yet, but is here to enable discussion around gtkwave/gtkwave#286, and ekiwi/wellen#23 (for Surfer).

With this patch, if you verilate with --threads N --trace-fst for N >= 2, then instead of creating a single FST file, the simulation will instead create a directory with N independent trace files in it.

That is, if you used to get a trace.fst file, then you will now get a trace.fst directory with the following structure:

$ tree trace.fst
trace.fst
├── 0
├── 1
├── 2
└── 3

Each of the numbered files are independent FST files (you can open them with GTKWave/Surfer just fine). They all have the exact same hierarchy and metadata (enum type tables in particular), but they only contain disjoint subset of the signals. That is, every signal is precisely in one file (and all aliases are in the same file), and if you merge the signals from all the files, you would get back the original (unsplit) FST file that is currently produced by Verilator without this patch.

With some decently sized benchmarks I have lying around, this gets over 2x average speedup over tracing into a single FST file (as does master) when using 4 threads.

The overhead of tracing into a single FST file, compared to disabling tracing at runtime for the same is about 5x, so Amdahl's law limit is approx 2.5x on 4 threads. The 2.04x average here is already close, so this shows the benefit of being able to produce trace files in parallel, though there is probably some opportunity for later fine tuning.

A: Single FST (master)

B: Split FST (this PR)

execute - Sim speed [kHz] - higher is better

Case	#A	#B	Mean A	Mean B	Gain (B/A)	p-value
OpenTitan:default:hello	5	5	2.73 (± 1.00%)	6.12 (± 0.34%)	2.24x	0.00
Vortex:sane:saxpy	5	5	0.96 (± 0.54%)	1.66 (± 0.22%)	1.72x	0.00
XiangShan:mini-chisel3:microbench	5	5	0.85 (± 0.08%)	1.85 (± 0.22%)	2.17x	0.00
XiangShan:mini-chisel6:cmark	5	5	1.66 (± 0.39%)	3.21 (± 0.34%)	1.93x	0.00
XuanTie-C910:default:memcpy	5	5	1.15 (± 0.40%)	2.49 (± 0.73%)	2.17x	0.00

Geometric mean					2.04x
Geometric mean - pVal < 0.05					2.04x

The effect on total trace dump size is surprisingly positive on average as well, though most cases don't vary much. (I have checked the outlier and it looks correct).

execute - Trace dump size [MB] - lower is better

Case	#A	#B	Mean A	Mean B	Gain (A/B)	p-value
OpenTitan:default:hello	5	5	233.84 (± 0.00%)	238.47 (± 0.00%)	0.98x	0.00
Vortex:sane:saxpy	5	5	510.34 (± 0.00%)	468.44 (± 0.00%)	1.09x	0.00
XiangShan:mini-chisel3:microbench	5	5	1408.21 (± 0.00%)	1447.34 (± 0.00%)	0.97x	0.00
XiangShan:mini-chisel6:cmark	5	5	759.13 (± 0.00%)	744.12 (± 0.00%)	1.02x	0.00
XuanTie-C910:default:memcpy	5	5	184.40 (± 0.00%)	106.66 (± 0.00%)	1.73x	0.00

Geometric mean					1.13x
Geometric mean - pVal < 0.05					1.13x

You do pay with a bit of extra memory, as each thread uses an ~128 MB buffer at run-time, but that should not be a big issue.

One nice property of splitting traces files like this is that this strategy works for every format, though I do not suggest implementing it for VCD, as Verilator can already do parallel VCD tracing efficiently (VCD is trivial enough to stitch fragments together at run-time), unless there is some special interest.

Before this can be merged, there should be:

Agreement from at least one of the viewers to support opening such split files
A discussion about how to expose this through the Verilator CLI (new --trace-split option? default with threads? etc.), considering all the other existing --trace-* options
Add specific tests (in this PR all --vltmt runs of tracing tests use the split scheme).

We need one trivial function added to the FST writer API, which tells us if a file buffer reached a size where it needs to be flushed (compressed). We use this to make sure all files are flushed at the same time, to avoid flushing of one file holding up execution.

Testing this required adding a vcdmerge function to the test driver, which can merge vcd files that are split this way. This is used to compare with the existing reference files (after feeding each subfile through fst2vcd).

Note: the 3 test failures are due to a bug in vcddiff. They all fail with --vlt and pass with --vltmt. If you swap the last 2 lines of the reference files, they will pass with --vlt, but fail with --vltmt.

The changes inside src/ (that is, to Verilator the compiler) are preliminary pending a decision on how to expose this in the CLI.

Otherwise the functionality is all there, feedback is appreciated.

wsnyder · 2025-02-26T23:16:37Z

Those are amazing speed ups, certainly a great justification for the work for the viewers (really for the fstReader) to support this.

Feel free to make a pre-pull with the typo and trivial fixes that can be made now, or even add the refactoring to support multiple writers. Always good to get what you can in so that the patch doesn't go as stale.

They all have the exact same hierarchy and metadata

Would we be better off instead having all of the definition information in a common file to reduce the duplication?

Note VCD has code to do rollover, and this requires that you have a separate header, then data files that rollover, so you can combine a header with the rollover files of choice (older ones perhaps being deleted). As part of this FST change I think we should support rollover too. That probably means you need the thread name and the rollover number e.g.

  0_0  // Thread 0, header  (or only one header file?)
  0_10  // Thread 0, 10th data file (earlier's deleted), e.g. starts at time 100,000
  0_11  // Thread 0, 11th data file (earlier's deleted), e.g. starts at time 110,000
  1_0  // Thread 1, header  (or only one header file?)
  1_10  // Thread 1, 10th data file (earlier's deleted), e.g. starts at time 100,000
  1_11  // Thread 1, 11th data file (earlier's deleted), e.g. starts at time 110,000

Or maybe to help sorting we should add zeros?

  000000_000000  // Thread 0, header  (or only one header file?)
  000000_000010  // Thread 0, 10th data file (earlier's deleted), e.g. starts at time 100,000
  000000_000011  // Thread 0, 11th data file (earlier's deleted), e.g. starts at time 110,000

If FST has some problems with doing rollover we can punt for now, but suggest we make sure that the naming would be forward compatible with this.

Agreement from at least one of the viewers to support opening such split files

AFAIK everyone is using the fst reader code, so if this can be isolated to just reader improvements we maybe done for all viewers.

A discussion about how to expose this through the Verilator CLI (new --trace-split option? default with threads? etc.), considering all the other existing --trace-* options

Yes, I suppose we should have a new flag. I'd suggest we always enable it with FST && threads. When we've made other FST writer enhancements (e.g. enums) we required a viewer upgrade. Those left behind can use --no-trace-split.

We need one trivial function added to the FST writer API.

I think the API uses "Get" in the name of accessors, I'd suggest renaming fstWriterWillFlushContextOnTimeChange to fstWriterGetFlushContextPending.

With that I suggest make a pull request on gtkwave. We can then get this into that API, and into Verilator, and will be ready.

Testing this required adding a vcdmerge function to the test driver, which can merge vcd files that are split this way.

I wonder if that should be a separate support binary?

Note: the 3 test failures are due to a bug in vcddiff.

Pull welcome ;)

gezalore · 2025-02-27T11:58:17Z

Let me try to frame the discussion a bit more and guide the thought process. There are 2 different concepts to think about here:

Should we improve the FST format to have native support for split files, or whatever, that enables better writer performance through parallelism.
Should we improve waveform viewers to be able to display multiple trace files as one.

I think you are thinking in terms of 1., while this PR in its current form is geared towards 2.

Having an improved format sure would be nice. If an FST 2.0 comes out from the discussion in gtkwave/gtkwave#70, that would be fantastic. Given there is some interest in working towards that based on that discussion, it might not be justified to extend FST at this stage.

This however is independent from the question in the 2nd point. There are more generic use cases for a waveform viewer to be able to display multiple trace files as one. For example, you could have SystemC + Verilated model, or some other cosim environment, where each component can generate a separate trace file. They could even be of different formats if you bolt different tools together. Being able to load all traces together, by merging hierarchies would be useful on it's own, and it would also enable the performance gain presented in this PR, to add to the argument for the viewer to able to display multiple traces as one. I believe currently neither GTKWave nor Surfer can do this.

The nice thing about independent split traces like in this PR is that it enables parallelism for all formats, not just FST, though realistically we won't have many...

Would we be better off instead having all of the definition information in a common file to reduce the duplication?

If you think about it in terms of the 1st point above, then yes, maybe, but even in that case the FST header for a design with 1.2M signals is less than 5MB, so duplicating it only saves constant space and is likely not significant.

If you think about it in terms of point 2 above, then this is out of scope for the discussion.

As part of this FST change I think we should support rollover too.

With point 2 in mind, I am not proposing an FST change here, so this is out of scope. However I agree FST 2.0 should have rollover support.

AFAIK everyone is using the fst reader code

I believe this is not true, in particular Surfer uses https://github.com/ekiwi/fst-reader which is a native Rust implementation.

Yes, I suppose we should have a new flag. I'd suggest we always enable it with FST && threads.

Let me know if you think different after the above. I'm kind of happy either way. I'm more interested in whether we can drop --trace-threads (which is ignored by VCD tracing anyways), and remove the related 'offload' mode from include/verilated_trace*. Or if we must keep it for backward compatibility. Keep in mind there is no offline fstmerge tool available, so if you must have a single file, currently it must be written as one. However, how many files you write (and how many threads you use, up to the value of --threads) can be decided entirely at run-time, so no need for the compile time option.

I wonder if that should be a separate support binary? (vcdmerge)

I thought about that, but decided against it. For our testing purposes, with small files of well defined shapes, the ~200 lines of noddy python I added works fine, but is does not scale to large files (it's in memory text wrangling!), nor can it handle generic, mismatched hierarchy, and that is something I do not want to commit to maintaining either at this point, so no standalone tool for now.

Pull welcome ;) (vcddiff bug)

Or maybe just diff it in driver.py, now that I added a VCD parser there, and remove the external dependency on vcddiff?

With that in mind, if you agree with the approach, then I will have the conversations with the waveform viewers, and if there is willingness, I will PR fstWriterGetFlushContextPending in the upstream FST writer, then polish of this PR, but otherwise propose no other changes to FST right now.

wsnyder · 2025-02-27T12:44:45Z

You're right I wasn't thinking about 1 vs 2, but rather how to get ultimately to the user seeing one file.

As long as we retain a way to get a single trace file (with lower performance), I'm ok with dropping --trace-threads, but open to other's opinions.

I prefer to keep vcddiff as use it on very large files separately, and think it's good to maintain some separation - ideally we'd put more there vs in driver.py.

gezalore · 2025-02-27T13:12:47Z

As long as we retain a way to get a single trace file (with lower performance), I'm ok with dropping --trace-threads, but open to other's opinions.

That would be available, via a runtime mechanism (+verilator+no_split_trace, or VerilatedFst::noSplit(true)), as you said, with lower performance without offloading and without the FST writer thread. (Actually the offload and FST writer threads provide some interference with the verialted thread pool so their performance effect is not always clear.)

how to get ultimately to the user seeing one file.

As discussed, that is probably a question for FST 2.0. I you object to emitting multiple files (sounds like you don't?), we could feed them through the moral equivalent of tar, but I am somewhat reluctant to do that as it just adds another non-standard container format on top that the consumers would have to be aware of. For this reason I went with the current 'directory' method, which makes the dump easy to handle as a unit, but ultimately it's just multiple standard dump files. I'm open to alternatives here (trace.fst.0, trace.fst.1, ... or similar?) if there is a better idea that doesn't cause undue complexity in downstream tools (as in the argument above, being able to view multiple files together can be justified based on a cosim environment as well, and might be more generally useful).

wsnyder · 2025-02-27T13:22:23Z

I think multiple files is the correct way, as it will perform best, allow rollover, and allows potential multi-system simulations.

wsnyder · 2025-02-27T13:23:39Z

BTW I also like your directory idea (versus e.g. a prefix) - it's much easier to move a "trace-worth" of files around. I might change the rollover example tests to do similar (I don't think a verilator change is needed).

wsnyder · 2025-02-27T13:25:27Z

Looking quickly I take that back, the rollover naming is part of verilated_vcd_c, so I'll avoid making a change at this time.

oscargus · 2025-03-01T10:40:47Z

include/gtkwave/fstapi.c

+        return xc
+                && !xc->is_initial_time
+                && ((xc->vchg_siz >= xc->fst_break_size) || (xc->flush_context_pending));
+}


Probably it would be good to get this in the original gtkwave files as well, to help other simulators adapt.

ecstrema · 2025-03-02T16:33:28Z

Wha exactly do you call rollover?

gezalore · 2025-03-02T16:38:20Z

Wha exactly do you call rollover?

When the trace file gets large, start throwing away records from earlier simulation time (that is, keep only the tail end of the trace), to limit file size. Verilator supports this for VCD where you can give it a file size limit. See test_regress/t/t_trace_rollover.cpp

gezalore · 2025-03-04T12:19:52Z

Adding this diagram here as a reference on what this patch does, mostly so I can reference it elsewhere.

For verilator#5806 via veripool/vcddiff#5

For #5806 via veripool/vcddiff#5

gezalore marked this pull request as draft February 26, 2025 16:26

wsnyder mentioned this pull request Feb 28, 2025

Add --trace-saif for SAIF power traces #5812

Merged

gezalore mentioned this pull request Feb 28, 2025

Fix peaking at end of file veripool/vcddiff#5

Merged

gezalore force-pushed the splitfst branch from 5b6aa7d to bb145e5 Compare February 28, 2025 17:28

oscargus mentioned this pull request Mar 1, 2025

Support aggregating multiple FST files ekiwi/wellen#23

Open

oscargus reviewed Mar 1, 2025

View reviewed changes

gezalore mentioned this pull request Mar 4, 2025

Support opening multiple files gtkwave/gtkwave#418

Open

gezalore mentioned this pull request Mar 4, 2025

Add fstWriterGetFlushContextPending to writer API. gtkwave/libfst#8

Merged

gezalore added a commit to gezalore/verilator that referenced this pull request Mar 4, 2025

Bump vcddiff

ddc408f

For verilator#5806 via veripool/vcddiff#5

gezalore mentioned this pull request Mar 4, 2025

Bump vcddiff #5828

Merged

gezalore force-pushed the splitfst branch from bb145e5 to 6be0a9e Compare March 4, 2025 16:20

gezalore added a commit that referenced this pull request Mar 4, 2025

Bump vcddiff (#5828)

2cad858

For #5806 via veripool/vcddiff#5

gezalore force-pushed the splitfst branch 2 times, most recently from 2af1967 to 2929e50 Compare March 4, 2025 17:14

oscargus mentioned this pull request Mar 5, 2025

Parallel generation of fst-files nickg/nvc#1169

Open

gezalore mentioned this pull request Mar 13, 2025

Update libfst from upstream #5855

Merged

Produce split FST files when using multiple threads

34eecc9

gezalore force-pushed the splitfst branch from 2929e50 to 34eecc9 Compare March 13, 2025 13:01

Apply 'make format'

c73ee1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Produce split FST files when using multiple threads #5806

Produce split FST files when using multiple threads #5806

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Produce split FST files when using multiple threads #5806

Are you sure you want to change the base?

Produce split FST files when using multiple threads #5806

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!