Sparkplug is designed to compile fast. Very fast. So fast, that we can pretty much compile whenever we want, allowing us to tier up to Sparkplug code much more aggressively than we can to TurboFan code.
There are a couple of tricks that make the Sparkplug compiler fast. First of all, it cheats; the functions it compiles have already been compiled to bytecode, and the bytecode compiler has already done most of the hard work like variable resolution, figuring out if parentheses are actually arrow functions, desugaring destructuring statements, and so on. Sparkplug compiles from bytecode rather than from JavaScript source, and so doesn’t have to worry about any of that.
The second trick is that Sparkplug doesn’t generate any intermediate representation (IR) like most compilers do. Instead, Sparkplug compiles directly to machine code in a single linear pass over the bytecode, emitting code that matches the execution of that bytecode. In fact, the entire compiler is a switch
statement inside a for
loop, dispatching to fixed per-bytecode machine code generation functions.
// The Sparkplug compiler (abridged).
for (; !iterator.done(); iterator.Advance()) {
VisitSingleBytecode();
}
The lack of IR means that the compiler has limited optimisation opportunity, beyond very local peephole optimisations. It also means that we have to port the entire implementation separately to each architecture we support, since there’s no intermediate architecture-independent stage. But, it turns out that neither of these is a problem: a fast compiler is a simple compiler, so the code is pretty easy to port; and Sparkplug doesn’t need to do heavy optimisation, since we have a great optimising compiler later on in the pipeline anyway.
Technically, we currently do two passes over the bytecode — one to discover loops, and a second one to generate the actual code. We’re planning on getting rid of the first one eventually though.
Adding a new compiler to an existing mature JavaScript VM is a daunting task. There’s all sorts of things you have to support beyond just standard execution; V8 has a debugger, a stack-walking CPU profiler, there’s stack traces for exceptions, integration into the tier-up, on-stack replacement to optimized code for hot loops… it’s a lot.
Sparkplug does a neat sleight-of-hand that simplifies most of these problems away, which is that it maintains “interpreter-compatible stack frames”.
Let’s rewind a bit. Stack frames are how code execution stores function state; whenever you call a new function, it creates a new stack frame for that function’s local variables. A stack frame is defined by a frame pointer (marking its start) and a stack pointer (marking its end):
At this point, roughly half of you will be screaming, saying “this diagram doesn’t make sense, stacks obviously grow in the opposite direction!”. Fear not, I made a button for you:
When a function is called, the return address is pushed to the stack; this is popped off by the function when it returns, to know where to return to. Then, when that function creates a new frame, it saves the old frame pointer on the stack, and sets the new frame pointer to the start of its own stack frame. Thus, the stack has a chain of frame pointers, each marking the start of a frame which points to the previous one:
Strictly speaking, this is just a convention followed by the generated code, not a requirement. It’s a pretty universal one though; the only time it’s really broken is when stack frames are elided entirely, or when debugging side-tables can be used to walk stack frames instead.
This is the general stack layout for all types of function; there are then conventions on how arguments are passed, and how the function stores values in its frame. In V8, we have the convention for JavaScript frames that arguments (including the receiver) are pushed in reverse order on the stack before the function is called, and that the first few slots on the stack are: the current function being called; the context it is being called with; and the number of arguments that were passed. This is our “standard” JS frame layout:
This JS calling convention is shared between optimized and interpreted frames, and it’s what allows us to, for example, walk the stack with minimal overhead when profiling code in the performance panel of the debugger.
In the case of the Ignition interpreter, the convention gets more explicit. Ignition is a register-based interpreter, which means that there are virtual registers (not to be confused with machine registers!) which store the current state of the interpreter — this includes JavaScript function locals (var/let/const declarations), and temporary values. These registers are stored on the interpreter’s stack frame, along with a pointer to the bytecode array being executed, and the offset of the current bytecode within that array:
Sparkplug intentionally creates and maintains a frame layout which matches the interpreter’s frame; whenever the interpreter would have stored a register value, Sparkplug stores one too. It does this for several reasons:
There is one small change we make to the interpreter stack frame, which is that we don’t keep the bytecode offset up-to-date during Sparkplug code execution. Instead, we store a two-way mapping from Sparkplug code address range to corresponding bytecode offset; a relatively simple mapping to encode, since the Sparkplug code is emitted directly from a linear walk over the bytecode. Whenever a stack frame access wants to know the “bytecode offset” for a Sparkplug frame, we look up the currently executing instruction in this mapping and return the corresponding bytecode offset. Similarly, whenever we want to OSR from the interpreter to Sparkplug, we can look up the current bytecode offset in the mapping, and jump to the corresponding Sparkplug instruction.
You may notice that we now have an unused slot on the stack frame, where the bytecode offset would be; one that we can’t get rid of since we want to keep the rest of the stack unchanged. We re-purpose this stack slot to instead cache the “feedback vector” for the currently executing function; this is the vector that stores object shape data, and needs to be loaded for most operations. All we have to do is be a bit careful around OSR to make sure that we swap in either the correct bytecode offset, or the correct feedback vector for this slot.
Thus the Sparkplug stack frame is:
Sparkplug actually generates very little of its own code. JavaScript semantics are complex, and it would take a lot of code to perform even the simplest operations. Forcing Sparkplug to regenerate this code inline on each compilation would be bad for multiple reasons:
So instead of all this, most Sparkplug code just calls into “builtins”, small snippets of machine code embedded in the binary, to do the actual dirty work. These builtins are either the same one that the interpreter uses, or at least share the majority of their code with the interpreter’s bytecode handlers.
In fact, Sparkplug code is basically just builtin calls and control flow:
You might now be thinking, “Well, what’s the point of all this then? Isn’t Sparkplug just doing the same work as the interpreter?” — and you wouldn’t be entirely wrong. In many ways, Sparkplug is “just” a serialization of interpreter execution, calling the same builtins and maintaining the same stack frame. Nevertheless, even just this is worth it, because it removes (or more precisely, pre-compiles) those unremovable interpreter overheads, like operand decoding and next-bytecode dispatch.
It turns out, interpreters defeat a lot of CPU optimisations: static operands are dynamically read from memory by the interpreter, forcing the CPU to either stall or speculate on what the values could be; dispatching to the next bytecode requires successful branch prediction to stay performant, and even if the speculations and predictions are correct, you’ve still had to execute all that decoding and dispatching code, and you’ve still used up valuable space in your various buffers and caches. A CPU is effectively an interpreter itself, albeit one for machine code; seen this way, Sparkplug is a “transpiler” from Ignition bytecode to CPU bytecode, moving your functions from running in an “emulator” to running “native”.
So, how well does Sparkplug work in real life? We ran Chrome 91 with a couple of benchmarks, on a couple of our performance bots, with and without Sparkplug, to see its impact.
Spoiler alert: we’re pretty pleased.
The below benchmarks list various bots running various operating systems. Although the operating system is prominent in the bot’s name, we don’t think it actually has much of an impact on the results. Rather, the different machines also have different CPU and memory configurations, which we believe are the majority source of differences.
Speedometer is a benchmark that tries to emulate real-world website framework usage, by building a TODO-list-tracking webapp using a couple of popular frameworks, and stress-testing that app’s performance when adding and deleting TODOs. We’ve found it to be a great reflection of real-world loading and interaction behaviours, and we’ve repeatedly found that improvements to Speedometer are reflected in our real-world metrics.
With Sparkplug, the Speedometer score improves by 5-10%, depending on which bot we’re looking at.
Speedometer is a great benchmark, but it only tells part of the story. We additionally have a set of “browsing benchmarks”, which are recordings of a set of real websites that we can replay, script a bit of interaction, and get a more realistic view of how our various metrics behave in the real world.
On these benchmarks, we chose to look at our “V8 main-thread time” metric, which measures the total amount of time spent in V8 (including compilation and execution) on the main thread (i.e. excluding streaming parsing or background optimized compilation). This is out best way of seeing how well Sparkplug pays for itself while excluding other sources of benchmark noise.
The results are varied, and very machine and website dependent, but on the whole they look great: we see improvements on the order of around 5–15%.
In conclusion: V8 has a new super-fast non-optimising compiler, which improves V8 performance on real-world benchmarks by 5–15%. It’s already available in V8 v9.1 behind the --sparkplug
flag, and we’ll be rolling it out in Chrome 91.