Aggressive Unboxing in The Dart VM
Aggressive Unboxing in The Dart VM
Natal-RN
October 2020
Victor Agnez Lima
Advisor
Elizabeth Ferreira Gouvêa Goldbarg, PhD (DIMAp - UFRN)
Co-supervisor
Stefan Martin Kustermann, MSc (Google LLC)
Natal-RN
October 2020
Universidade Federal do Rio Grande do Norte - UFRN
Sistema de Bibliotecas - SISBI
Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET
I’d like to start by thanking my advisor Elizabeth Goldbarg for the guidance and
learning experiences she has given to me during my graduation course, and my co-supervisor
Martin Kustermann, who taught me and helped me throughout and after my internship.
They were always ready to provide me with the best advice whenever I needed. I would
also like to thank professors Monica Pereira and Martin Musicante, for accepting to be part
of the examining board, and my former manager Samir Jindel, for the amazing experience
with my internship project.
In addition, I would like to thank Google LLC for the opportunity of doing such an
internship, and my university UFRN for the academic support and for making all of this
possible, especially the Programa Talento Metrópole, for empowering me to be here and
go further.
Finally, regarding aspects not directly related to this work, I would like to thank my
family and my dear Cecília, for all the support, encouragement and unconditional love.
Also, I want to thank my friends (classmates and professors) from UFRN, IFRN, ICPC
(especially the contestants, coaches and organizers that make ICPC possible at UFRN),
CEI and all the other friends I met along this journey, for all the lessons, laughs, good
talks and pleasant moments we had together, particularly my best friend for more than 20
years, Luiza Egito. I wouldn’t be here without all of you.
“What we know is a drop, what we don’t know is an ocean.”
Isaac Newton
Aggressive Unboxing in the Dart VM
Abstract
Resumo
Dart é uma linguagem de programação de código aberto desenvolvida pelo Google LLC
e popular para o desenvolvimento de aplicações para dispositivos móveis. Este trabalho
apresenta uma série de melhorias feitas para o compilador de otimização da Máquina Virtual
Dart, desenvolvidas durante um projeto de estágio com o objetivo de tornar as aplicações
menores e com melhor desempenho. Para isso, o projeto focou-se em manter mais valores
numéricos desencapsulados, criando suporte para instâncias com campos desencapsulados
e para métodos poderem retornar e receber como parâmetros tais valores. Como resultado,
o tempo de execução de diversas aplicações em Dart foi reduzido significativamente, além
de tornar importantes aplicativos para celulares mais de 2% menores.
List of figures
List of tables
List of algorithms
List of abbreviations
1 Introduction p. 15
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 20
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 22
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 22
1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 22
3.1.6 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 50
5 Final Remarks p. 67
References p. 68
CL — Changelist
GC — Garbage Collector
IL — Intermediate Language
OS — Operating System
PC — Program Counter
VM — Virtual Machine
15
1 Introduction
Originally designed with an unsound type system, Dart 1.x (or simply Dart 1) was an
optionally typed language. This means, in particular, that types had no effect on runtime
semantics, being used mainly as annotations to improve readability and for documentation.
However, in August 2018, Dart 2.0 was released, containing several updates to the language,
including a new and sound type system and turning Dart into a statically typed language.
Due to its sound type system, the users of the Dart 2 can trust that their programs will
never get into a state where an expression evaluates to a not permitted value according to
the static type of the expression.
The new type system is also expected to improve even more: every type in Dart is
currently nullable, that is, the null value can be assigned to variables of any type. This will
not be true after a new feature, called Non-Nullability By Default (NNBD) and currently
available as a technology preview, is released. With this feature, type annotations that are
not marked as nullable will not accept null values anymore. These changes will ensure null
safety, avoiding bugs and allowing the compiled code to be further optimized.
The syntax of Dart is similar to some popular programming languages such as C++
(STROUSTRUP, 2000) and Java (ARNOLD et al., 2000). Users of Dart do not need to manage
memory: for this, it uses a generational garbage collector. Dart has single (multilevel)
inheritance, but it also supports implementing interfaces, that are implicitly defined for
16
each class, and to use mixins, which are a way of reusing a given class body in multiple
class hierarchies. The following books present overviews on the language: (WALRATH;
LADD, 2012), (BALBAERT, 2014), (KOPEC, 2014), (BRACHA, 2015) and (SINHA, 2020).
Currently, one of the main uses of Dart is through the Flutter framework (official
website: https://flutter.dev/), a cross-platform toolkit for developing mobile, web and
desktop applications that has Dart as the development language. Web applications in Dart
can be compiled to the JavaScript language (GOODMAN; EICH, 2001), while mobile and
desktop applications can run using Dart Native compilation.
Dart Native compiles Dart code to machine code of the target architecture, supporting
ARMv7 (32-bit), ARMv8 (64-bit) and x86-64 (also called x64) architectures1 . It can run
in Just In Time (JIT) mode and Ahead Of Time (AOT) mode.
On the one hand, JIT mode allows a faster development cycle, as it supports the
binaries to change dynamically; makes runtime optimizations; and also has support to
hot reload, which injects changes in the source code into the running program, preserving
its current state — making it unnecessary to restart the application in order to run
the updated code. On the other hand, Flutter uses AOT for production as it results in
optimized and predictable code with a lower startup latency and smaller code size, which
is critical for mobile applications. This project focuses on the AOT compilation mode.
Since Dart 2 was released, AOT mode has been remarkably improved, aiming to
generate more efficient and smaller code. This was possible thanks to most of the new
changes, including the sound type system. Another change in the language that came up
with Dart 2 was the semantics around integer values: in Dart 1, integers had arbitrary
precision, while Dart 2 introduced 64-bit integers with wrap-around semantics, that is, all
arithmetic operations are performed modulo 264 (arbitrary precision integers became a
separate class named BigInt). This change was convenient because, in practice, before
that, every integer operation had to check if the result overflowed and possibly allocate an
integer object with higher precision to store the result. There were three ways to represent
the primitive integers, which will be discussed in detail ahead: smis, that are small integers,
mints (medium integers) and big-ints. In JIT mode, the code for those allocations could
be generated lazily when overflows are detected, however, in AOT mode, each of such
operations would have to contain the code for all integer types, resulting in a much bigger
generated code and, in many cases, worse runtime performance. The 64-bit integers avoid
those issues and have a range of values that is suitable for most real world applications,
1
JIT mode also supports IA-32 architecture.
17
justifying such decision — in general, the cases that need big-ints are not only rare but
also easy to detect.
Integer objects (mints), like all other objects in Dart, have a uniform layout, which
starts with a one-word size header, containing information such as an integer representing
its class (the class ID), among other information. After the header, comes the data inside
the object: for most types, including all user-defined types, it contains the contents of
the fields enclosed by that object, which consisted only of tagged pointers when this
project started; the exceptions are some predefined classes, such as numeric primitives
(e.g. a floating-point object), which do not contain fields, but, instead, contain the bits
representing their values. Pointers are used because the type of each field might change at
runtime: although each field has a static type, it can hold values of any subtype of that
type, including null. Also, every Dart object has an alignment of at least one word.
The reason why all pointers in Dart are said to be tagged is related to a particular
way of handling integers: the aforementioned smi. It was previously noted that big-ints
are rarely used; in fact, most integers in real world programs fit in a much smaller range,
which is used by smis: the range of signed integers that can be represented with fewer bits
than one word (i.e., 31 bits on 32-bit architectures, and 63 bits on 64-bit architectures).
For this reason, the following optimization is performed: for each value that fits in the smi
range, instead of allocating an object to hold this integer, the value is shifted one bit to
the left, and the least significant bit is set to 0. Additionally, the least significant bit of
each pointer is set to 1, so that it is possible for the VM to distinguish between tagged
pointers and smis, by checking the value of the least significant bit — called tagging bit.
This is safe to do because: for the case of pointers, the least significant bit of the address
of any object must be 0, due to the alignment of Dart objects; and, for the case of smis,
thanks to the way the range of values was chosen, the bits discarded (by the left shift
and by truncating the value to one-word size) must be all equal to the most significant
remaining bit, being all of them either 0 if the value is positive, or 1 if it is negative, so
this operation would not result in loss of information as well.
18
class Point {
var x, y;
}
storeValues(xValue, yValue) {
globalPoint.x = xValue;
globalPoint.y = yValue;
}
main() {
storeValues(50, 5000000000);
}
Program 1: Sample program that stores values 50 and 5000000000 into fields.
Program 1 is a Dart code which sets values 50 and 5000000000 to fields x and y,
respectively, of globalPoint, a global variable of type Point. Figure 1 illustrates what
would happen in order to store those values on a 32-bit architecture.
At the top of figure 1, we have the decimal notation of the value to store into each of
19
the two fields. Since we are supposing that this program is running on a 32-bit architecture,
the range of smi values is [−230 , 230 − 1]. The example on the left shows that, since value
50 is inside this range, it is represented as a smi, being truncated and shifted to the left.
However, on the right, the value does not fit in the smi range, therefore an allocation
needs to be performed: a mint object (which requires 1 word for the header and 2 words
for the data) is allocated, then the value 5000000000 is stored inside that object, and the
address of that object is used (in this example, the address was 0xFFAABB88), having its
tagging bit set to 1 to indicate that it is not a smi (so it would become 0xFFAABB89). The
output at the bottom of the figure contains the bits that would in fact be stored into the
field. Such operation that receives a scalar value and returns a tagged pointer (or smi) is
called boxing operation (and its inverse is the unboxing operation). Additionally, figure 2
illustrates what would be the layout of objects of program 1 before and after calling the
method storeValues for that execution. For the sake of simplicity, this figure elides any
padding added to the objects due to their alignment.
Smis are useful to avoid pointer dereferencing when retrieving their scalar values and,
mainly, to avoid excessive object allocations. Retrieving the value from (i.e., untagging) a
smi can be done by simply performing a signed right shift on its bits; similarly, retrieving
the pointer address from a tagged pointer only requires the tagging bit to be reset. One
might be wondering the reason why the least significant bit is set to 1 for pointers and
0 for smis and not the opposite; in fact, some benefits come from this choice: it allows
arithmetic operations such as addition and subtraction to be performed directly on smis,
without having to untag them, as long as the operation does not overflow. Besides using
object alignment to tag pointers, there are many other approaches for tagging words that
could have been used instead, being many of them explained in (GUDEMAN, 1995). For
instance, the set of 64-bit floating-point values that are interpreted as Not a Number
(NaN) according to the IEEE standard (ZURAS et al., 2008) is large and would allow 51
bits to be encoded.
All information presented in this document about the Dart compiler refers to the latest
version of the Dart SDK repository on GitHub (https://github.com/dart-lang/sdk)
available on 6 of March of 2020.
20
1.2 Motivation
A value is said to be boxed (or tagged ) if it is a smi or contains the tagged pointer to
an object, and unboxed when it is a scalar value (e.g. an int value is unboxed if it stores
the integer value itself). As noted previously, boxed values are important in order to allow
variables to hold values of different types. Even if variables are annotated to be of double
type, for instance, boxed values would be needed to distinguish the null value from the
others. However, when compiling in AOT mode, it is possible to prove in certain scenarios
that some variables will never hold the null value. This is done by a global static analysis
called Type Flow Analysis (TFA). In addition, as Dart is expecting to enable NNBD in
the near future, the users of Dart will be able to say whether the types are nullable or not.
In any case, when numeric values are known not to be nullable, keeping them unboxed
21
class Wrapper {
double value;
Wrapper(this.value);
increment(double toAdd) {
value += toAdd;
}
}
main() {
final w = Wrapper(3.0);
w.increment(1.5);
print(w.value);
}
would allow a better performance and memory management, since it would avoid object
allocations and pointer dereferencing when operating those values, and also result in a
smaller code size, which is currently one of the main goals of the Dart compiler. For this
reason, the project focused on adding features to allow more values to be unboxed. The
main changes performed in the Dart compiler were supporting classes to hold unboxed
fields and functions to receive and return unboxed values. Also, the boxing instructions of
integers on ARMv7 were optimized to reduce the generated code size.
Consider, for example, a class that wraps a floating-point value and has a method
to increment that value. Program 2 gives one possible implementation for it. Before the
changes presented in this project, the field value would always be boxed, even if it always
stores a non-nullable double. Similarly, the increment method would always expect the
parameter toAdd to be boxed. Then, whenever a call is performed to increment, it would
be necessary to unbox the variables value and toAdd, sum those values, perform a boxing
operation — which includes an allocation — and, finally, store the result into value.
Allowing parameters and fields to be unboxed would make this operation simpler: the
parameter and the field would already be unboxed, so it would be possible to sum them
directly, and then the result would be ready to be stored into the field without performing
any allocation. This example illustrates some of the improvements this project provides to
the Dart compiler. Here, we also ignore any optimization the compiler could have done
due to the simplicity of the program, such as inlining the method increment or replacing
its parameter by a constant (since there is only one value passed to it, 1.5).
22
1.3 Objectives
1.4 Methodology
The impact of the project was assessed by running several benchmarks on the internal
system, giving statistics about runtime and generated code size, among others. These
results compared a patch that disabled all changes of this project in the code base with
the master branch after the entire project was submitted.
1.5 Contribution
Chapter 2 presents the changes that were submitted while developing support for
unboxed fields, and additionally it includes technical information about Dart AOT compi-
lation (section 2.1), the internship starting task to update the boxing instructions (section
2.2) and changes made to add support for inlining SIMD arithmetic operations (section
2.8). Chapter 3 focuses on allowing parameter and return values to be unboxed; and it is
divided into two sections. Section 3.1 presents the changes for supporting unboxed values
when performing calls to static methods and constructors, while section 3.2 extends the
support to non-static methods as well. Chapter 4 presents and discusses the results this
work achieved on several benchmarks, responsible for assessing both Flutter applications
and Dart AOT compilation. Finally, chapter 5 offers some final considerations and possible
future work.
24
Firstly, AOT compilation of Dart is divided into three main parts: the Common Front
End (CFE), followed by several transformations, and then the Dart VM (or simply VM).
CFE is responsible for parsing Dart programs, building the type hierarchy, reporting
syntax and static semantic errors, generating the augmented abstract syntax tree (AST),
which is also called kernel. Then, a series of transformations are made on the kernel,
including a global static analysis called Type Flow Analysis (TFA), which identifies (and
removes) unused methods and classes, and obtains more precise information on the types
of fields and variables, including nullability. The kernel is annotated with this information
as metadata and, after all transformations are done, it is serialized. Finally, the Dart VM
loads all kernel information, builds the control-flow graph (CFG) of each function, provides
the execution environment for the Dart program and generates the binaries. In AOT mode,
the VM is also divided into two parts called precompiler, which runs before executing the
Dart code, and precompiled runtime, the runtime environment. Both CFE and the kernel
transformations are written in Dart, while the VM is written in C++. Figure 3 presents a
diagram of the AOT compilation.
The vertices of the control-flow graph are blocks of low level instructions of an
intermediate representation, called Intermediate Language (IL). This representation is
in Static Single Assignment (SSA) form, meaning that it always defines variables before
using them, and each variable is assigned only once. The IL is the same for every target,
but each IL instruction might generate different assembly instructions depending on the
target architecture.
25
The “static” part of SSA means that there is only one definition-site for each variable,
even though it could be inside a loop that is executed multiple times at runtime. SSA
form is commonly used because it brings many advantages to data-flow analysis and
optimization algorithms, as it allows each SSA variable to be associated with its unique
definition and a list of all uses this variable has. Practical examples of optimizations
that might take advantage of SSA form are: dead code elimination, improving needless
relationships between different variables and detecting equality of variables (ALPERN;
WEGMAN; ZADECK, 1988), (ROSEN; WEGMAN; ZADECK, 1988).
The list below provides examples of IL instructions that were relevant to the develop-
ment of this project:
a tagged value and returns the unboxed representation of it. It also has more specific
subtypes such as UnboxInt64Instr.
• Load field instruction: Receives a tagged pointer to an object and returns the value
of a given field of that object. Implemented as LoadFieldInstr.
• Store field instruction: Receives two variables, the first one being a tagged pointer to
an object. It stores into a given field of that object the value of the second variable.
The C++ class for this instruction is StoreInstanceFieldInstr.
• Check null instruction: The CheckNullInstr receives a tagged value and checks if it
is null. It is used, for instance, before accessing a field or a method of a variable of
nullable type. If the value is null, it throws an exception; otherwise, the execution
simply proceeds to the next instruction.
• Call instructions: Used to perform a call to a target function. There are several
instructions for calls, such as: StaticCallInstr, used when the target function is
known at compile time; ClosureCallInstr, used for closures; NativeCallInstr,
used when the target is a native function; instance call instructions, used for call-
ing a method of an object when the target needs to be looked up at runtime
(e.g. DispatchTableCallInstr, PolymorphicInstanceCallInstr, etc). More de-
tails about those instructions are presented in chapter 3.
• Parameter instruction: This instruction is used in entry blocks (e.g. function entry)
to define the SSA variable of an incoming parameter. It does not emit any native
code itself. The C++ class for this instruction is named ParameterInstr.
• Return instruction: The ReturnInstr is responsible for generating the native code to
return from a function. It receives a variable with the value that should be returned.
• Instructions for arithmetic operations: Those are responsible for performing arithmetic
operations on several types. Two examples are the BinaryInt64OpInstr, which
operates over 64-bit integers, and SimdOpInstr, used for operating values of SIMD
data types (e.g. Float32x4, which stores four single precision floating-point values).
for the smi operation is inlined, and a runtime check is added to verify if the objects
are indeed smis. If the tagging bit of any of those objects shows that it is not a smi,
then an instance call is performed, which will lookup the target method. When the
operands are smis, no call needs to be performed and the operation is very efficient.
Some IL instructions have a fast path and a slow path. For instance, the boxing
instruction of 64-bit integer has a fast path for the case when the value is a smi and a
slow path, which needs to allocate a new object, when it is not. Similarly, the null check
instruction has a slow path that creates and throws an exception when it receives a null
value, and the slow path of CheckedSmiOpInstr happens when an instance call needs to
be performed.
Regarding memory management, most VM objects, just like Dart objects, are allocated
in the Dart heap. For this reason, they cannot be allocated directly using the C++ new
operator. Such objects are called heap objects (or Raw objects1 ), being always operated
by pointers or, preferably, by object handlers. An example is the RawClass object, which
holds information about a given class, such as the size of instances of that class, its library,
etc, and the handler for this object is Class. Similarly, the RawField object is handled by
Field, and contains the offset of this field relative to the enclosing object address, among
other pieces of information. These objects are available at both compile time and runtime.
The Dart VM manages memory by using a tracing Garbage Collector (GC). Briefly,
tracing collection determines which objects should be collected based on a set of roots: it
traces the heap from those roots and considers every object referenced by them (either
directly or by a chain of references) as reachable; then, it collects only the unreachable
objects. This strategy has advantages when compared to others such as reference counting,
for example by allowing cycles of objects that reference each other to be collected, without
requiring any more sophisticated algorithm to be used for this. More details about garbage
collection strategies and implementations are presented in (JONES, 1996).
There are several implementation strategies for tracing collectors. One such example
is the generational garbage collection (LIEBERMAN; HEWITT, 1983), which is based on
the hypothesis that a large portion of the objects that are reclaimed by GCs have a short
lifetime. Generational GCs split the data into generations according to their creation
time: the most recently created objects are in the youngest generation, the oldest objects
1
The heap objects were renamed after a big refactoring in April of 2020, not being called Raw objects
anymore. This refactoring was done to avoid using C pointers to represent tagged pointers. The RawClass,
for instance, became ClassLayout and tagged pointers to it became ClassPtr instead of RawClass*.
28
are in the oldest generation, and so on. Then, the GC runs collection cycles on younger
generations more frequently than older generations. If an object survives after many
collection cycles, it is promoted to an older generation. By doing this, when the hypothesis
holds (which is the case for most real world applications), the collection cycles become
much more efficient. For this reason, the Dart VM uses a generational GC.
The GC used by the Dart VM has two generations, called new space and old space.
The former is used to store objects that were created more recently, which are expected
to become unreachable sooner, while the latter contains objects that are not expected to
become “garbage”. The root set includes objects in the new space that are referenced by
the stack, by variables or by objects in the old space.
When the GC is triggered, it runs an algorithm based on (CHENEY, 1970): the younger
generation (the new space) is henceforth called from-space and a new block of memory,
named to-space, is allocated; then, for each reachable object in the from-space, this object
is either promoted to the old space or copied to the to-space; finally, the from-space is
deallocated, and the to-space becomes the younger generation. While visiting the objects,
this algorithm also updates the references to the ones that are moved.
There are three configuration modes for the compiler, with only the last two being
intended for users of Dart: debug, release, and product. The debug mode is used for the
Dart VM development, running several assertions in order to identify bugs easier. The
release mode is useful for end-user development, supporting hot-reloading of application
code (in JIT mode) and having an API to examine the running application (e.g. used for
examining CPU profile, memory profile and for debugging). Finally, product is the most
optimized mode.
29
The first issue to be solved in the internship, before start working to support unboxed
fields, was to change the boxing instructions on ARMv7. The reason was that each boxing
operation of 64-bit integer was generating an unnecessarily large piece of 88 bytes of ARM
machine code.
As mentioned in section 2.1, boxing instructions have a fast path, when the integer is
a smi; and a slow path, when it is not and, therefore, it is necessary to allocate a box for
that integer (see figures 1 and 2). The issue was that most of the code was used by the
slow path and it did not need to be inline, as the slow path is already expected to be slow.
For this reason, a shared stub for the slow path was implemented, and each boxing
instruction would jump to this stub only if the integer value to be boxed was not a smi.
If there is not enough free space in the new space to allocate the box, a runtime call is
performed for that allocation, possibly triggering the GC, so the state of the live registers
needs to be saved (stored) onto the stack, as they might be clobbered, and restored after
the call. Since shared stubs can be called from multiple places, each of them having
different live registers, all registers that are not reserved are saved. The exception is the
FPU registers: if they are not being used, it is much cheaper not to save them. Therefore,
two similar stubs were created: one that saves all non-reserved registers, including FPU,
and another one that only saves regular registers.
Another concern was that part of the stub was generated by a shared stub generator,
but it did not have support for the return value, so it had to be updated. In order to make
the saved registers seen as part of the caller frame, the registers are saved before entering
the stub frame. However, the return value of runtime calls is always stored at the top of
the stack. It cannot be the position where a register was saved on the stack, since they
are pushed before entering the stub frame, so the change was to reserve extra space on
the stack just before performing the runtime call. This space must not store a value that
could be seen as a pointer, otherwise the GC would try to use it as an address, so an even
value is pushed (as it would be seen as a smi due to the tagging bit). Finally, after the
call, the result has to be moved from the stack to a register; the R0 register was chosen for
this. Since the registers are saved on the stack, the return value is simply stored into the
position where R0 was pushed. Figure 5 illustrates this procedure. The boxing instruction
was updated to always use R0 as the output register, so no data is lost when the return
value is stored into it.
31
Figure 5: Values in the stack when shared stub performs runtime call.
32
Link 1 of the appendix provides the domain where changelists (CLs), submitted or
undergoing code review, on the Dart SDK codebase are available. Link 2 contains the
shared stub implementation. The green lines of code (on the right) are the lines added
after this change, while the red ones, on the left, were deleted.
After updating the boxing instruction to use the shared stub, it started to emit only
32 bytes of native code: four instructions for the fast path, one loading instruction of the
stub address, one jump to this address, and two store instructions of the integer value into
the allocated box. Link 3 of the appendix contains the changes on the boxing instruction,
and the entire CL can be accessed through link 4.
Lately, another improvement has been made: since the stub is at a fixed address, it is
possible to make a PC-relative call to that address, instead of loading the stub address to
a register and then jumping to this address. This change saves four more bytes, reducing
the initial size of 88 bytes to 28 bytes for each boxing instruction. Link 5 of the appendix
provides the changes for this last improvement.
Figures 6 and 7 show the generated code before and after making those changes,
respectively.
The first line of each figure shows the IL instruction. The values starting with v (v28,
v3, v4, ...) are the SSA variables. The following lines contain the assembly code: the first
column shows the address; the second one, the instruction in hexadecimal notation; and
33
The first main goal for the internship was to add support to unboxed fields. Fields in
Dart were always stored tagged, but numeric (either integer or floating-point) values that
are known not to be nullable do not need a box. Avoiding it could reduce the overhead
caused by boxing and unboxing instructions, resulting in better performance and smaller
code size.
The first decision to be made for supporting unboxed fields is the way those fields will
be identified. The RawField class already had a bit field called is_unboxing_candidate,
because JIT compilation supports unboxed fields2 , so this bit was used. It is possible to
know that the VM is running in AOT mode by checking the flag FLAG_precompiled_mode,
so this flag guards most of the changes in this project.
The required conditions for a field to be an unboxing candidate are: (1) it cannot
ever store the null value; (2) it needs to be of a numeric type; and (3) the compiler
configuration and target architecture need to support values of its type to be unboxed.
Condition (1) invalidates nullable-type fields. Moreover, it also invalidates late and static
fields, because they store the null value before being initialized. Due to condition (2),
the only accepted types are int, double, and SIMD types, but it does not need to be
the declared type: it can be the type inferred by TFA, which is serialized in the kernel
as metadata. Condition (3) can be checked easily in the VM by calling methods such as
SupportsUnboxedSimd128(). Therefore, this checking is done and the unboxing candidate
bit is set in the kernel loader, and the changes for this are available through link 6 of the
appendix.
2
Actually, the “unboxed” fields in JIT are not exactly unboxed: the enclosing object still stores a
reference to the field; however, instead of holding a reference to the Dart object, which has a header, it
stores a reference to the memory address where the scalar value is stored.
34
There are two IL instructions responsible for operating directly with non-static fields:
LoadFieldInstr and StoreInstanceFieldInstr, in charge of loading from and storing
to instance fields, respectively. Those instructions had to be updated to generate a different
machine code for the case when the field is unboxed. The main difference is that, in this
case, the values would be loaded or stored inside the enclosing object, instead of a box
referenced by that object. The links below provide the changes made for each architecture:
The AOT compiler has support to cross-compile code for 32-bit architectures even if
running on a 64-bit device. To make it possible, the compiler needs information, such as
offsets, about the layout of objects on both host and target platforms.
There are many kinds of offsets that are used by the VM, such as the instance size
of each class, the offset of each field and the type arguments offset. Type arguments are
an array of types used by generic classes to store the type used on each instance. Those
offsets were computed using the assumption that each field requires exactly one-word size,
as they were expected to store a memory address. However, after enabling unboxed fields,
this might not be true anymore: the size of those fields might be 128 bits, in the case of
SIMD fields, or 64 bits, which is the case of int and double and that is two-word size on
32-bit platforms. Additionally, most offsets were stored in words since this value was the
same for every architecture. Now, they might be different on different platforms and, due
to cross-compilation, there are places where the correct offset to be used is the host offset,
while in other scenarios it is the target offset. For this reason, it was necessary to update
the compiler to keep both values.
Several offsets are precomputed for each architecture by a module called offsets
extractor. The AOT compiler then uses the offsets for the target architecture it is targeting.
Despite this module being usually used to retrieve information regarding the target, many
of the host offsets were being used in places where target offsets were expected, because
35
they were known to be the same (in words) on both host and target. In some cases the
offsets extractor did not even compute those offsets. It was the case of many instance sizes,
so this module had to be updated to include them. However, it was noticed that many
classes have fields that do not exist in product mode or in the precompiled runtime (in
order to obtain performance and memory improvements), while the offsets extractor only
computed the values for release mode of JIT compilation. Therefore, the module had to
be changed to compute the offsets both on release and product configuration modes, and
on JIT and precompiled runtime.
Links 10 and 11 of the appendix contain the CLs responsible for the updates on the
offsets extractor.
Those CLs fixed the offsets that are precomputed by the offsets extractor, but many
other offsets — such as field offsets for Dart fields — are calculated in the VM. Consequently,
those offsets had to be updated to take into account the new size of the fields. Furthermore,
some offsets in the RawField and RawClass were duplicated to store both target and host
values, both of them being calculated together. The Class::CalculateFieldOffsets
method was updated to do so, and the link 12 of the appendix provides such updates. This
method also computes a bitmap of unboxed fields, which is explained in section 2.6. Most
of the other changes in that file were made to update the offsets to be used, from host to
target.
Most updates in the two files available through the following links were made to ensure
that the appropriate offset is used: (1) link 13 of the appendix; and (2) link 14.
The previously mentioned updates to duplicate offsets in the RawClass and RawField
can be found on lines 857 and 1238, respectively, of the file available through link 15 of
the appendix. An optimization performed was to have those two offsets (one for target
and another for host) only when it is not running the precompiled runtime, given that
host information is not available in this scenario.
The GC visits the pointers of live objects to make sure that every object referenced
by them will survive, possibly being moved, and that the pointers will be updated to the
new addresses in those cases. When visiting the pointers of an object, the GC had the
assumption that every word inside that object (except for the header that has a fixed
length in the beginning of each object) would be a tagged value, i.e., they would be either
36
pointers or smis. So, each value stored from the first word in the object (after the header)
to the last was visited as a separate tagged value. The location of the last word in the
object is calculated by the object address and its size, which is retrieved from a shared
class table using the class ID (that is stored in the header of the object). Now, as it is
not true anymore that every value in the object is tagged, the unboxed values need to be
skipped.
Since the GC is performance sensitive, the checking for unboxed fields has to be as
fast as possible. The solution was to store, for each class, a bitmap of unboxed fields. For
each word inside the instances of that class, the bitmap stores a bit to indicate whether
that word is an unboxed value (if the bit is set) or a tagged value (bit not set). Notice
that, since unboxed values might use more than one-word size, they might require more
than one corresponding bit in the bitmap.
As an optimization, given that the instance size is, in general, not larger than 64 words
(not including the header), the bitmap was set to have a fixed length of 64 bits, and for
every position after that, the bit is considered not set.
The bitmap is populated when calculating the field offsets (see section 2.5). If a field
is an unboxing candidate whose offset is more than what is supported by the bitmap, this
field is decided not to be unboxed anymore, and the unboxing candidate bit (described in
section 2.3) is reset to false (zero).
As an example, program 3 contains two classes Foo, which has an unboxed field, and
Bar, which contains two unboxed fields and three boxed fields. The field value in class Foo
can be unboxed because it is an integer (as it has the int type annotation) and because it
is never set to null, as it is set only in the constructor of the class and the program only
initializes it with values 3 and 8 (in function main). The final modifier enforces the field
to never be reassigned after it is initialized, being good practice to use it when variables
are expected to be set only once.
Regarding class Bar, notice that it has a constructor with an optional parameter,
identified by crotchets. When optional parameters are not passed, they receive the default
value: if not specified, it is null. The field unboxedInt is (as the name suggests) unboxed,
for the same reason as the field value in Foo; the difference is that unboxedInt was not
annotated as an integer: instead, this is inferred by TFA, just as its nullability. Nevertheless,
boxedNum is boxed, even though it is never null and only stores numeric values, because
when variable bar1 is created (inside main) it receives an integer value, while when bar2
is created, it receives a floating-point value; the inferred type from TFA is num, which is a
37
class Foo {
final int value;
Foo(this.value);
// ...
}
class Bar {
final unboxedInt;
final boxedNum;
final foo;
final unboxedDouble;
int nullableInt;
main() {
final bar1 = Bar(/*unboxedInt=*/ 1,
/*boxedNum=*/ 2,
/*foo=*/ Foo(3),
/*unboxedDouble=*/ 4.0,
/*nullableInt=*/ 5);
super type of both int and double and it is not unboxed in order to distinguish between
those types. The field foo is boxed because it stores an instance of Foo class, which is
not a numeric primitive. The field unboxedDouble is unboxed because it always stores a
double value. Finally, nullableInt is an integer field that is initialized with value 5 when
bar1 is created, but it is initialized with the default value, null, when bar2 is created. It
was not marked as final because the setNullableInt method can reassign a value to it:
just after creating bar2, the value 10 is stored into it. Although this field stores values 5
in bar1 and 10 in bar2, it also stores the null value between bar2 initialization and the
call to setNullableInt, therefore, it is nullable and cannot be unboxed.
Figure 8 shows the layout of instances of the classes Foo and Bar on a 32-bit platform.
It also includes the bitmap for each of them. To illustrate how the bitmap would be
retrieved, it assumes that the class IDs of Foo and Bar are 300 and 301, respectively. This
information is available in the header of the instances, so, when the GC visits them, it asks
the shared class table for the bitmap, passing the class ID as the index. It is important to
note that, since it is a 32-bit platform, each word contains only 32 bits, so each unboxed
field requires two words to store its 64-bit values. Similarly, if there were 128-bit fields
(e.g. of Float32x4 SIMD type), each of them would require four words and consequently
use four bits in the bitmap.
Link 16 of the appendix contains the bitmap implementation, as well as the interface
to access it through the shared class table. The bits of the bitmap are set while calculating
the offsets of the fields, inside the Class::CalculateFieldOffsets method, whose im-
plementation is available through link 12. Finally, methods used by the GC to visit object
pointers had to be updated to skip the words marked as unboxed values by the bitmap;
link 17 provides an example of such updates.
An important thing to note is that the bitmap for unboxed fields presented in section
2.6 is computed in the precompiler, but the GC might use it in the precompiled runtime.
This is possible because information such as the size and the bitmap of each class is
serialized in the precompiler to be deserialized and used at runtime.
The changes for serializing the bitmap can be seen in the ClassSerializationCluster
class. They are available through link 18 of the appendix (and, similarly, the updates in
the same file on the ClassDeserializationCluster are responsible for deserializing the
39
bitmap).
The previous file also contains important changes in the class responsible for serializing
instances that are computed statically — the class InstanceSerializationCluster. This
case is similar to the changes in the GC: instances being serialized need to ensure that all
of its fields are also serialized, so the pointers to them need to be visited, but unboxed
fields should not be seen as pointers — instead, those bits should be preserved in the
instance —, so the bitmap was also used here to identify them. Since the bitmap does
not store information about the type of the unboxed fields, it was not possible to know
the size of each field — for instance, 16 bytes of unboxed data could correspond to one
SIMD field, or two double fields, or one int and one double fields, and so forth — so
the data of the field is not serialized entirely at once. Instead, each 32 bits are serialized
separately, because this value is a divisor of all word sizes and all unboxed field sizes,
avoiding unnecessary extra bits to be serialized.
Another serializer that had to be updated in a similar way is responsible for runtime
serialization of instances, e.g. when sending objects asynchronously through a port. Link
40
The last step finished the CL for unboxed double and SIMD fields. Link 20 of the
appendix contains the complete CL. Just after that, another similar CL was submitted to
allow integer fields to be unboxed as well (link 21).
However, before submitting those changes, some unexpected results were noticed when
running benchmarks to assess their impact on the VM. More specifically, benchmarks that
use SIMD values, such as NBodySIMD, were expected to have a better performance after
the SIMD fields became unboxed, but they took almost twice as long to execute. More
details about the algorithm this benchmark runs are available in chapter 4.
After investigating, we realized that the arithmetic operations for SIMD values were
not inlined. Instead, the SIMD values, initially unboxed, were being boxed to be passed as
arguments in a static call to the method that operates them, and, after that, the result was
being unboxed to be stored in the field. This overhead of boxing and unboxing operations
was responsible for those regressions. Before changes, this overhead did not exist, since all
values were boxed. Chapter 3 presents changes on the calling conventions for supporting
unboxed arguments. However, no calls should be performed in this case, as they are much
more expensive than simply operating those values inline, and that would avoid boxing
the operands. The reason why those calls were being performed was that the VM had not
added support to inline arithmetic SIMD operations yet. Therefore, this section presents
changes that were made to avoid those benchmark regressions.
The inliner had to be changed to include those operations. Also, to make them work
properly, null checks need to be performed when inlining them. The reason is that, when
an optimization is performed, IL instructions might be inserted as speculative, so that
if the optimization should not have been done and, for instance, types mismatch, it can
bail out and run again with unoptimized code. This is how runtime optimizations are
performed in JIT mode. In AOT, if it bails out, the function is recompiled with some
optimizations disabled; since this is not intended to happen in this case, null checks are
performed so that any boxing or unboxing instruction later inserted is guaranteed to be
non-speculative (unnecessary checks usually are removed automatically).
Additionally, when the first operand (the receiver) of a call is null, the exception thrown
is NoSuchMethod, since that operation is not a method of null; but, if the second operand
41
In order to generate smaller code, shared stubs are used by the slow path of null
checks, which is executed when the exception needs to be created. Thus, another stub
was created, responsible for throwing ArgumentError exceptions, so that each null check
instruction would use the appropriate stub according to its enumeration value. Link 23 of
the appendix contains the implementation for this stub on ARMv8. The code is similar
for the other architectures.
The mentioned changes in the inliner are available through link 24. Finally, link 25 of
the appendix provides the complete CL. This change, along with the unboxed fields, not
only avoided regressions on those SIMD benchmarks, but also achieved runtime speedups
by at least a factor of 3 when compared to before any changes. Those results are discussed
in detail in chapter 4.
42
After unboxed instance fields were enabled, the next step of the internship was to
extend the calling conventions to support unboxed parameters across static and instance
calls.
The language supports required and optional parameters. The required parameters are
the first ones and they are positional, i.e., the position of each argument determines its
corresponding parameter. Optional parameters can be either positional or named (which
needs to have the name of the parameter specified jointly to the value being passed). Dart
also supports generic methods; in these cases, an array called type arguments provides
type information for each invocation of the method.
The calling conventions of Dart functions before changes were: if the function expects
a type arguments array, it is pushed onto the stack before any arguments. Then, all
arguments of the call, starting with the receiver (if any), are tagged and pushed onto the
stack in the same order as they are passed in the source code. If the function needs more
information, e.g. it has optional parameters or it expects type arguments or it is a closure,
metadata about the arguments is passed in a register. This register holds a pointer to
an array-like object, named ArgumentsDescriptor. The arguments descriptor contains
information such as the length of the type arguments array, the names of the named
parameters that were passed and their position, and the total number of arguments. Finally,
when the function returns, the value is boxed and passed in a fixed register. To illustrate
this, consider the sample program 4. The bar method calls the updateVal method of
class Foo, to update the value of receiverFoo. The arguments of the call are another
instance of Foo, named multiplierFoo, and the result of the sum of integers toAdd and
dummyValue. The updateVal method updates the value inside the receiver multiplying it
by the value of the first parameter and then adding the value of the second parameter.
After the call, the value of dummyValue is subtracted from the field of receiverFoo, and
the result is returned from bar.
43
class Foo {
int val;
Foo(this.val);
@pragma('vm:never-inline')
int bar(Foo receiverFoo, int toAdd,
int dummyValue, Foo multiplierFoo) {
receiverFoo.updateVal(multiplierFoo, toAdd + dummyValue);
receiverFoo.val -= dummyValue;
return receiverFoo.val;
}
main() {
var result = bar(Foo(5), 3, 4, Foo(2));
print(result); // prints 5 * 2 + 3 == 13
result = bar(Foo(3), 4, 5, Foo(6));
print(result); // prints 3 * 6 + 4 == 22
}
Figure 9 contains the CFG for the function bar before the changes presented in this
chapter were enabled. It was simplified to elide parallel moves of registers and other
extra pieces of information not important for this example. Notice that, before adding
the values of parameters toAdd and dummyValue, it needs to unbox them (lines 8 and
9). Then, to pass the result as an argument, it uses a boxing instruction (line 11). Lines
12, 13 and 14 push onto the stack the receiverFoo parameter (whose SSA variable is
v2), the multiplierFoo parameter (v5), and the result of the sum (variable v28, which
represents the result of the boxing instruction), respectively. After the call, the field val
of receiverFoo is loaded (line 16), operated (line 17), updated (line 18) and then the
result needs to be boxed (line 19) to be returned. Since this flow graph was built with
the changes of chapter 2, there was no unboxing instruction after loading the field and no
boxing instruction before storing into it, as it was already unboxed.
In general, in scenarios where functions are, for instance, just abstractions for operating
numeric values, it was necessary to box the arguments, make the call, unbox them in the
target function to perform the operation, operate them, box the return value, return it
and, possibly, unbox the result after the call to continue operating with it. The overhead
of boxing and unboxing operations would be avoided if those values were passed unboxed
from the call site to the target function. It is the focus of this chapter.
Section 3.1 explains the initial change for updating static methods and constructors to
45
support unboxed values. Section 3.2 extends those updates to interface calls.
The reason for treating static methods and constructors separately from the other
calls in the initial change is that calls for those methods are easier to update, because they
have only one possible target, meaning that those methods can be treated individually. For
interface calls, since Dart is object-oriented, the target of that call might be any method
that is an implementation of that interface, but the arguments passed in the call site must
be compatible with what is expected by the callee, for every possible callee. In this section,
this scenario is avoided by the fact that the call site always specifies the exact target to be
called.
Similarly to the unboxed fields, it is necessary to identify the parameters (and return
values) that can be unboxed. Since there is no class in the Dart VM for parameters, this is
done in the RawFunction class. Again, the checking occurs in the kernel loader, retrieving
information from TFA to know the inferred type of the parameters and their nullability.
Although the language supports required and optional parameters, the latter are used less
frequently than the required ones, and unboxing them would require more effort. Also,
optional parameters are nullable many times, being null their default value fairly often. In
addition, there was a feature being developed in parallel in the VM team to make optional
parameters to be seen as required if they are passed in every call to the function, reducing
the number of optional parameters in the VM. For all those reasons, only the required
parameters were decided to be unboxed, including required parameters of functions that
also have optional parameters. The verification of the type and nullability of parameters
is therefore performed only for the required ones. Link 26 of the appendix provides the
changes for it.
Another bitmap was used to store information about the unboxed parameters and
the return value. However, the RawFunction class does not store the inferred type of
each parameter, and this is important to determine the type of the unboxed values, for
instance, when allocating the appropriate register for those values. As creating an array
of types and storing it for each instance of RawFunction would be expensive regarding
memory usage, it was decided to unbox only int and double values, as they are much
more commonly used than SIMD types. The bitmap was implemented to use two bits
for each value: the first one says whether the value should be boxed or unboxed, and the
second one says whether the value is integer or floating-point, in case of being unboxed.
The bitmap implementation can be found through link 28 of the appendix.
Instructions related to calls had those methods always returning kTagged values, so
they had to be updated to retrieve the appropriate representation using the bitmap of
the function. Those IL instructions are: ParameterInstr, used in the function entry to
represent the parameters; StaticCallInstr, used in the call site to perform the call;
ReturnInstr, for returning from the target to the call site; and PushArgumentInstr, used
for pushing a given argument before performing the call. Link 29 of the appendix shows
the changes on the methods of those instructions. It is also possible to see changes that
add the arguments size to the ArgumentsInfo: this class is used to create the arguments
descriptor, and section 3.1.5 explains these changes.
The representation and the offset of parameter instructions are used when processing
the entry of functions, during the register allocation step, which determines the location for
each SSA variable. It uses a linear-scan algorithm (POLETTO; SARKAR, 1999), estimating
the live ranges of the variables by live intervals and performing the allocation with linear
time complexity on the total number of variables. It was updated to create two live ranges
for the arguments that require a pair of registers to represent them (which is the case of
unboxed integers on 32-bit platforms), and to use the offset in the parameter instruction
and its representation (since it is not always kTagged anymore) to compute the location
of the parameter. Those changes are available through link 31 of the appendix.
The calling conventions had to be updated for supporting unboxed values: as they were
always boxed, regular registers were expected to hold the arguments before pushing them
onto the stack, and the return value was always stored in a specific register (R0 on ARM
architectures and RAX on x64), but unboxed values might require different registers. For the
case of floating-point values, FPU registers are needed, and the following ones were chosen
for the return value on each architecture: V0 on ARMv8, Q0 on ARMv7 and XMM0 on x64.
Also, since Dart uses 64-bit integers, on ARMv7 the result was decided to be stored in the
pair (R0, R1) of registers if the value is an unboxed integer. Those changes were reflected
in the generated code of the following IL instructions: LoadIndexedUnsafeInstr, used
for loading the arguments from the stack; PushArgumentInstr for pushing the arguments
from registers to the stack; ReturnInstr for storing the result in the appropriate register;
and call instructions, such as StaticCallInstr, which had to have the output register
updated. The following links provide those changes:
Stack maps are bitmaps used to identify pointers in the stack. They are used by the
GC, when visiting and updating stack pointers to objects that are moved.
48
The values stored on stack frames start (ignoring the return address, which is not
represented in the stack maps) with spill slots, that are values that had to be pushed
onto the stack because there were not enough registers to store all live values. Then, there
might be outgoing arguments, i.e., arguments that were pushed by PushArgumentInstr IL
instructions, preceding a call. This is the default case when visiting stack frames, however,
some IL instructions that have a fast path and a slow path could need to push more values
onto the stack (and pop them after that instruction) when running the slow path. This
is the case for the boxing instructions, which could trigger the garbage collector if there
is not enough space to allocate a box for the given value — as discussed in section 2.2
—, so the live registers need to be pushed onto the stack. Moreover, when running the
slow case of the CheckedSmiOpInstr IL instruction (which happens when at least one of
the operands is not a smi), not only do the live registers need to be saved, but a call also
needs to be performed, even though there were no PushArgumentInstr preceding it; in
this case, the slow path code pushes the arguments for this operation.
Since the outgoing arguments were always boxed, the stack maps did not have bits for
those values. Therefore, after supporting unboxed arguments, new bits had to be added to
that bitmap to describe them. The stack maps are serialized as read-only data in binaries,
so it is important to avoid serializing unnecessary bits in order to reduce the metadata for
generated code — this is the reason why the outgoing arguments were not represented
there. The algorithm for visiting a stack frame consists of using the stack maps for the
first spill_slot_count slots (the number of spill slots) of the frame, then using the
remaining bits of the bitmap to visit the end of the frame, and finally visiting each slot of
the remaining range assuming that it is tagged.
The algorithm described above did not have to change: the bitmap only had to
add new bits for the last outgoing arguments, starting from the first unboxed argu-
ment, as the arguments pushed before this one will continue to be considered tagged.
Link 35 of the appendix contains the implementation for this algorithm (inside the
StackFrame::VisitObjectPointers method).
Figure 10 illustrates the stack map of a frame that has five spill slots and five slots
for outgoing arguments, after the changes. Notice that, besides the spill slots, the bitmap
contains bits only for the last three slots of outgoing arguments, because this is the smallest
suffix of those slots containing the values of all unboxed arguments. The other two slots
for outgoing arguments are then considered tagged and visited, following the described
algorithm.
49
For optional parameters, the prologue processes the arguments in the order they were
pushed and, in order to obtain the position of the first argument on the stack, the number
of arguments was used as an offset (in words) to jump from the last pushed argument to
the first one. Once again, this could be done because all arguments were tagged, so each
of them used to require exactly one-word size. The process for handling the type of the
arguments in generic functions is similar: for those cases, an array of types is passed as an
extra argument before all the others, so the number of arguments was used to calculate
50
After supporting unboxed parameters, knowing the number of arguments was not
enough for calculating their size anymore. For this reason, the arguments descriptor was
updated to include this information. Link 37 of the appendix contains the changes for
including the arguments size in the ArgumentsDescriptor. The changes in the prologue
builder for calculating the offset of the arguments using information about the unboxed
parameters are available through link 38.
3.1.6 Intrinsics
Some functions that belong to a list of recognized methods do not have part of their
code (or their entire code) generated automatically from Dart source code. This is done in
order to make such functions more optimized, and examples of them are: the sqrt and
cos functions on numeric values; and the _allocate method of _OneByteString, used to
allocate strings. These optimized pieces of code are called intrinsics, of which there are two
kinds in the Dart VM: graph intrinsics, which are custom-built IL graphs; and assembly
intrinsics, hand-written snippets of assembly.
Most assembly intrinsics are used for native functions, which do not have unboxed
parameters and return values (as mentioned in section 3.1.1). Nevertheless, to avoid
validating all other assembly intrinsics, we have decided to disable parameter and return
value unboxing for functions that have intrinsics of this kind. Therefore, those functions
had their bitmaps of unboxed parameters and return value reset. This is done when
creating the initial state of recognized methods, just after the kernel is loaded. Link 39 of
the appendix provides the changes for this.
Graph intrinsics, however, were updated to add or remove IL instructions for boxing
and unboxing parameters and the return value according to the bitmap, to make them
compatible with the changes in this chapter. The changes for checking if those instructions
are needed and adding them are available through link 40.
This, along with some small tweaks that mainly reflect changes already discussed, ends
the CL for unboxed parameter and return values when performing calls to static methods
and constructors. Link 41 of the appendix provides the entire CL.
51
Aside from static calls, instance calls are also frequently used. Instance calls can be
either dynamic or interface calls. In Dart, type annotations are optional, so, when it is
not possible to know statically the type of a variable which is the receiver of a call, a
runtime check is performed to figure that out and then the target function is looked up
— in this case, a dynamic call is performed. Conversely, if it is possible to know that the
variable is of a given type that has a method with that name, we know the interface of the
target, although the target itself might be any method of a derived class that overrides or
implements that interface. One important thing to note is that the method of a class can
be identified just by its name, because Dart does not have overloaded methods (if this was
the case, then the entire signature of the method would be required).
This section is focused on interface calls. The reason is that, when performing dynamic
calls, we do not have any information on the interfaces or the possible targets of the call
at compile time, so the only way of unboxing parameters would be to unbox the same
parameters in all functions with that name, so that we would be sure that the code at the
call site is compatible with all the possible targets; however, most of the parameters would
probably not be unboxed since it is unlike that every method with a given name receive
a non-nullable parameter with the same type at a given position. For interface calls, it
is possible not to aggregate the methods just by their name, but instead use the class
hierarchy to partition the methods into disjoint sets in such a way that minimizes the size
of each set, while ensuring that the targets of an interface call are always in the same set:
by doing so, the unboxed parameters of a given function only need to be compatible with
the other functions in its set.
The partitioning algorithm was implemented to run with the AOT transformations,
just after the types inferred by TFA are available. The approach used was the following:
suppose that each (possibly abstract) method in the Dart program is a vertex, and add
an undirected edge between two vertices if and only if one of their respective methods
overrides the other. Then, the connected components form a partition of the methods (i.e.,
the vertices) which ensures that targets of any interface call are always grouped together.
This was done using a Disjoint-Set Union (DSU) data structure (GALLER; FISHER, 1964),
using path compression and union by rank.
52
Firstly, each method is assigned into a singleton. Then, the methods are processed,
one by one. When processing a method s, the direct parent classes of its enclosing class
(including, in addition to the base class, any interface class that it implements) are visited
and, if any of them has a method with the same name as s, its set is merged with the set
of s. After that, the parents of those super classes are visited recursively in the same way.
One optimization performed is that: if a class c already has a method t overridden by s,
then the parents of c are not visited recursively, because s is already connected to t, and t
will be connected to the methods it overrides after being processed, hence s will already
be connected through t to those methods.
After creating the partition, the status in the unboxing metadata of the parameters
and the return value of each method are updated using the types inferred from TFA. The
status will be updated to indicate that the value should be boxed if any of the following
conditions is true: (1) the inferred type is not double or int; (2) the inferred type is
nullable; (3) the inferred type is double and the status indicate unboxing candidate of
int type; or, analogously, (4) the inferred type is int and the status indicate unboxing
candidate of double type. Otherwise, the status is updated in the following way: if the
status was boxed, it remains boxed; else if the inferred type is int, it becomes an unboxing
candidate of int type; otherwise, it becomes an unboxing candidate of double type.
To illustrate the partitioning algorithm, consider the class diagram in figure 11. Also,
consider that the types of the method signatures correspond to the respective inferred
types and those types are not nullable. There would be four disjoint sets of methods bar:
• The first set would contain the methods of classes Base1, Foo1 and Foo2, since
both Foo1::bar and Foo2::bar override Base1::bar. Since all of them receive a
53
• The second disjoint set would contain the methods Base2::bar and Foo3::bar,
because the former is overridden by the latter, even though Base2 is not a direct
parent class of Foo3. Since Base2::bar expects a value of type Object, it needs to
be boxed.
• The third set contains only the Foo4::bar method, as it does not override any other
method and there is no derived class of Foo4. Since it expects a non-nullable double
value as parameter, it can be unboxed.
• Finally, the last disjoint set of this example includes the methods Base5::bar,
Foo5::bar and Foo6::bar, given that the last two override the first. In this case,
Base5 is not the base class of Foo5, but, instead, a class whose interface is imple-
mented by Foo5. Since Foo5::bar expects a double value and Foo6::bar expects
an int value, their parameters need to be boxed.
54
Some methods are marked to always have boxed values: native methods, as discussed
in section 3.1.1; methods whose enclosing classes are subtypes of num, e.g. int, because those
methods might be replaced by an optimized IL instruction called CheckedSmiOpInstr,
which always expects the arguments to be boxed; and methods that do not need dynamic
invocation forwarders (the reason for this is explained in section 3.2.2).
The implementation of the DSU data structure and its auxiliary functions can be found
through link 43 of the appendix. Notice that each method is processed in the algorithm
as a Member. This is done because the Member class includes Field, Constructor and
Procedure, however, fields are ignored since the changes in chapter 2 are already responsible
for them. Getters and setters are also ignored.
The registerMember method in the file mentioned above creates a new singleton for
each valid Member; linkWithSuperClasses links a method with all methods it overrides;
finishGraph clears auxiliary information such as the ranks, since no more sets will be
merged, and it makes the unboxing metadata of each set to require boxed values if any
method in that set is marked to do so; applyToArg and applyToReturn update the status
in the unboxing metadata of a given parameter and return value, respectively, according
to the type inferred by TFA; getUnboxingInfoOfMember is used to access the unboxing
metadata of a given member, possibly returning null if the member was ignored, e.g. if
the member is a field. The partitioning algorithm runs after TFA and the metadata is
accessed when annotating the kernel — link 44 of the appendix contains the changes for
this.
In the file mentioned above, each method is annotated with its respective unboxing
metadata, which is serialized in the kernel. Then, in the VM, the kernel loader deserializes
this information to fill the bitmap (described in section 3.1.1). Link 45 of the appendix
provides the changes in the kernel loader. A C++ version of the unboxing metadata,
UnboxingInfoMetadata, is used, whose implementation is available through link 46.
When dynamic calls are performed, the target of the call needs to be found at runtime.
Consequently, static checks for the parameters are not possible. For this reason, there are
special automatically generated methods, called dynamic invocation forwarders, which are
called instead of the target itself. Those functions perform type checks on the arguments
before forwarding the call to the actual target, so that type errors are still caught.
55
class Foo {
void bar(int arg) { /*....*/ }
}
main() {
final foo = Foo();
bar(foo);
}
bar(dynamic obj) {
obj.bar(1); // calls obj.dyn:bar(1), which
// forwards the call to obj.bar(1)
Algorithm 5 illustrates how dynamic invocation forwarders are used. The bar method
receives an object of dynamic type as argument. The dynamic type is a top type, just like
Object, i.e., it accepts values of any type; the difference between these two type annotations
is that the static analysis always accepts accesses to members of dynamic values (as long
as there are no errors in any subterm of the expression, e.g. in the arguments), and the
runtime checks ensure soundness in such cases.
The code for the Foo::bar method expects one parameter of integer type, so a dynamic
invocation forwarder, named dyn:bar, is created to check if dynamic calls to this target
match its signature. When obj.bar(1) is executed, obj.dyn:bar is called, and it verifies
that the argument is indeed an integer; then, it forwards the call to Foo:bar. In contrast,
when obj.bar("a") is executed, the code for obj.dyn:bar detects that the call does not
match the interface of the function, therefore it throws an exception.
Dynamic calls always receive and return boxed values, so they were updated to unbox
the parameters using the bitmap before forwarding the call, and possibly box the return
value of the call, if it was unboxed, before returning it. By doing so, we ensure that dynamic
calls work for targets that expect unboxed values. Link 47 of the appendix contains the
changes in the CFG of dynamic invocation forwarders.
56
However, there is a special case for dynamic calls: some functions do not have dynamic
invocation forwarders, and they are called directly from the call site. This happens with
functions that have no parameters and functions whose parameters do not need to have
their types checked. This means that the dynamic invocation forwarder of a function is
created if at least one parameter needs to be checked before the call. A parameter does
not need to be checked if it is covariant (in those cases, the checks are performed in the
body of the target function) or if it is of a top type (because in this case values of any
type would be accepted). For this reason, methods that do not need dynamic invocation
forwarders always expect boxed values, as mentioned in section 3.2.1.
Link 48 of the appendix provides the complete CL. It includes changes in many
expectation files, containing the expected kernel of sample Dart programs, since the kernel
has changed to include the unboxing information metadata. It also contains updates in
IL instructions analogous to the ones presented in section 3.1.2 for static calls, but now
regarding other types of calls: InstanceCallBaseInstr and DispatchTableCallInstr,
that can be used to perform interface calls. This ends the last CL submitted during the
internship.
57
After all CLs had been submitted, a new patch was created disabling all changes in
this project, to assess its total impact on the benchmarks. An internal system, managed
by a different team, is responsible for configuring and distributing the load among several
identical servers dedicated to running those benchmarks.
The runtime value is calculated after running the benchmarks 10 times and taking
the best run, in order to avoid noise (e.g. caused by swapping, by input and output
operations on disk, or by other running processes). The changes are calculated based on
the actual runtime of the applications (less is better). The internal system also computes
a significance value, so that noisy benchmarks, whose runtime results are unstable, have a
significance closer to 0.0 to avoid highlighting unreal improvements or regressions. The
results are sorted by significance, from the worst to the best result. The first column of
each table contains the benchmark name; the second one provides the percentage of change
in runtime; and the last column contains the significance value.
The benchmarks with the worst results had their significance near 0 in those cases,
even though some of them seemed to have a big change in the final score, such as
rrect_contains_bench (Moto G4) on ARMv7, that had a regression of 36.16%. Besides
the significance, other factors indicate that this was not a real regression, such as the fact
that this benchmark only regressed 15.21% on other device (Pixel 2), and even improved
4.07% on ARMv8 — both with 0.0 significance. In contrast, the improvements at the
bottom rows were much more significant and consistent on both ARMv7 and ARMv8.
Still regarding the Flutter benchmarks, the code size also had great improvements,
58
being the most important results presented on tables 3 and 4. The total size of all
applications decreased, with most of the regressions happening in the read-only data size,
mainly because of the extra bits that were added on the stack maps (described in section
3.1.4). The Flutter Gallery application is a collection of many of the features available
on Flutter, and it is the main application used for evaluating the impact on code size
(the other benchmarks are mainly used to assess runtime performance). The total size
of the gallery decreased 2.05% on ARMv7 and 1.06% on ARMv8. This difference was
caused mainly because of the updates to reduce the size of boxing instructions, described
in section 2.2, that only happened on ARMv7.
One important thing to highlight is that the changes for unboxed parameter and
return values did not have much impact on code size: the changes for unboxed values when
performing interface calls (section 3.2) only decreased the Gallery total size in 0.15% and,
before that, for changes for static calls had even increased the total size in 0.13%. The
reason for having a small impact is probably that there are still many places that use
boxed values, so even though we achieved more unboxed values, it is necessary to pay for
extra boxing and unboxing instructions. This being the case, it is expected that future
changes that enable more values to be unboxed, including NNBD, will make the positive
impact of unboxed parameters more significant. In any case, those changes had a great
60
impact, which will be discussed next, on the performance of many Dart programs.
Links 49 and 50 of the appendix provide the entire figures with all changes in the
Flutter benchmarks on ARMv7 and ARMv8, respectively.
Link 51 of the appendix contains a figure with the results of all Dart AOT benchmarks
on x64.
More than 75% of the results on table 5 were regressions with absolute value less
than 4% in runtime. Only three of those benchmarks had significance with absolute
value greater than 1.0. Most of the regressions happened on noisy benchmarks and micro
benchmarks, which are used to assess very specific language features or scenarios. Among
those results, one that is important to highlight is the Splay benchmark, which uses splay
trees (SLEATOR; TARJAN, 1985). This data manipulation benchmark, focused on object
creation and destruction, is responsible for stressing the GC: the regression of -3.8% was
mainly a consequence of the extra overhead caused by using the bitmap of unboxed fields
(section 2.6) when the GC is visiting object pointers.
In contrast, table 6 shows more significant results — here, none of the 40 improvements
had a significance less than 1.0, and only three were less than 4% of improvement. Whereas
the most significant regression was -5.54%, the most significant improvement was 1701%.
All the SIMD benchmarks improved at least 200%, due to the changes described in section
2.8, however non-SIMD benchmarks also had great results, such as NBody, which improved
131.1% on Intel Xeon and 105.1% on Intel Core i5. Some of the main benchmarks and
their runtime results on x64 (Intel Xeon) are presented below:
Table 5: The 40 most significant regressions on the Dart AOT benchmarks (on x64
architecture).
64
Table 6: The 40 most significant improvements on the Dart AOT benchmarks (on x64
architecture).
65
implementation, there are also two other versions, using closures and iterators.
The DeltaBlue implementation improved 10.92%; DeltaBlueClosures, 7.58%; and
DeltaBlueIterators improved 9.74% (however with a lower significance of 0.6).
This is one of the benchmarks improved after submitting the CLs for unboxed
parameters, improving at least 4% after each of the two CLs (with the classical
implementation).
• Meteor: Also part of The Computer Language Benchmarks Game, this benchmark
computes all solutions for the Meteor Puzzle using a branch and bound algorithm.
The puzzle consists of completely covering a 10 × 5 grid of hexagonal cells with
ten given pieces, each of them made of five hexagonal cells. The benchmark had an
improvement of 7.36%, and it is another example of a benchmark with great results
after supporting unboxed parameters: it improved 4.64% after submitting the CL
for supporting unboxed parameters when performing interface calls (section 3.2).
• SHA256 and MD5: These benchmarks use, respectively, the Secure Hash Algorithm 2
with 256 bit digests (N.I.S.T, 2002) and the MD5 message-digest algorithm (RIVEST;
DUSSE, 1992), two widely used cryptographic hash functions. The former improved
2.41% and, the latter, 5%, both with significance of 0.4.
The former use the Bench2D benchmark suite (WEBBER, 2014) and the latter was
based on Box2dWeb (HECHT, 2012), a JavaScript port of Box2D. Bench2D improved
11.96% and Box2DOctane, 9.74%.
67
5 Final Remarks
This work has achieved important results on several benchmarks, demonstrating the
positive impact it leads on the Dart applications, regarding both code size and runtime
performance. Most of the features developed in this project use nullability information,
currently inferred from TFA, to unbox numeric values. Therefore, the impact of those
changes is expected to become even more remarkable after NNBD is enabled.
The project added support for storing unboxed values into instance fields and for using
such values as arguments when performing calls to both static and non-static methods. It
has also updated the boxing operations of integers to reduce code size on ARMv7 by using
a shared stub for the box allocation and it has added support to inline SIMD arithmetic
operations.
Future work might involve passing more unboxed values as arguments when performing
calls. This could be carried out by supporting new calls with unboxed values, such as calls
to closures, native functions or tear-offs. Another way to achieve that goal could be to
improve the changes presented in this document, for instance: by implementing a clever
partitioning algorithm that takes the call sites and possible targets into account when
creating the partition in order to reduce the size of each disjoint set; by accepting new
types of parameters to be unboxed, such as SIMD types; or by ensuring that dynamic
invocation forwarders are created whenever a function expects unboxed values, so that the
checking for the existence of such forwarders will not have to be performed. Finally, there
is also much space for making currently boxed values to become unboxed, for example,
by updating IL instructions that only accept boxed values to also support those values
unboxed.
68
References
ARNOLD, K. et al. The Java programming language. [S.l.]: Addison-wesley Reading, 2000.
BRACHA, G.; BAK, L. Dart, a new programming language for structured web
programming. In: Presentation at GOTO conference. [S.l.: s.n.], 2011.
GOLDBERG, A.; ROBSON, D. Smalltalk-80: the language and its implementation. [S.l.]:
Addison-Wesley Longman Publishing Co., Inc., 1983.
GOODMAN, D.; EICH, B. Javascript bible. hungry minds. Inc., New York (USA), 2001.
GOUSIOS, G. et al. Lean ghtorrent: Github data on demand. In: Proceedings of the 11th
working conference on mining software repositories. [S.l.: s.n.], 2014. p. 384–387.
N.I.S.T. Secure hash standard (shs). In: U.S. DOC. Federal Information Processing
Standards (FIPS) Publication 180-1. [S.l.], 2002.
RICHARDS, M. Bcpl: A tool for compiler writing and system programming. In:
Proceedings of the May 14-16, 1969, spring joint computer conference. [S.l.: s.n.], 1969. p.
557–566.
RIVEST, R.; DUSSE, S. The MD5 message-digest algorithm. [S.l.]: MIT Laboratory for
Computer Science, 1992.
ROSEN, B. K.; WEGMAN, M. N.; ZADECK, F. K. Global value numbers and redundant
computations. In: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on
Principles of programming languages. [S.l.: s.n.], 1988. p. 12–27.
WALRATH, K.; LADD, S. Dart: Up and Running: A New, Tool-Friendly Language for
Structured Web Apps. [S.l.]: " O’Reilly Media, Inc.", 2012.
ZURAS, D. et al. Ieee standard for floating-point arithmetic. IEEE Std, v. 754, n. 2008, p.
1–70, 2008.
70
Link 1: https://dart-review.googlesource.com
Link 2: https://dart-review.googlesource.com/c/sdk/+/127146/12/runtime/
vm/compiler/stub_code_compiler_arm.cc#1146
Link 3: https://dart-review.googlesource.com/c/sdk/+/127146/12/runtime/
vm/compiler/backend/il_arm.cc#4652
Link 4: https://dart-review.googlesource.com/c/sdk/+/127146/
Link 5: https://dart-review.googlesource.com/c/sdk/+/128072
Link 6: https://dart-review.googlesource.com/c/sdk/+/133983/10/runtime/
vm/kernel_loader.cc
Link 7: https://dart-review.googlesource.com/c/sdk/+/133983/10/runtime/
vm/compiler/backend/il_arm.cc
Link 8: https://dart-review.googlesource.com/c/sdk/+/133983/10/runtime/
vm/compiler/backend/il_arm64.cc
Link 9: https://dart-review.googlesource.com/c/sdk/+/133983/10/runtime/
vm/compiler/backend/il_x64.cc