Porting QEMU To Plan 9: QEMU Internals and Port Strategy: Nathaniel Wesley Filardo September 11, 2007
Porting QEMU To Plan 9: QEMU Internals and Port Strategy: Nathaniel Wesley Filardo September 11, 2007
Abstract
This paper discusses the difficulties which must be overcome to port the QEMU processor emulator [2] to Plan 9.
It begins with a detailed look at QEMU’s internal machinery and outlines the difficulties encountered. For each, it
discusses the currently favored approach towards a workable solution. In many cases, alternatives and mechanisms
for subsequent improvements and optimizations are also presented. It is hoped that this paper also provides useful
groundwork for other, future ports of QEMU to novel platforms and compilers.
1
Contents
1 Overview 4
1.1 The Nature of QEMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Simulated Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2
5 Other Porting Issues 22
5.1 Register Calling Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Translation Block Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Explicit Branch Prediction Overrides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.6 Interacting With The Outside World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.7 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 Summary 24
6.1 Earlier Work and Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3
1 Overview
QEMU is a fast, portable emulator of many architectures, including IA-32 (X86), PPC, and SPARC. It seems worth-
while to bring it to the Plan 9 environment so that certain Linux applications (e.g., firefox) can be made available and
to provide an environment for kernel testing.
its own vocabulary of micro-ops, the individual operations will vary across guests, but some names are common by convention.
4
1.2 Simulated Hardware
Both Bochs and QEMU simulate hardware at a very low level. Both have software representations of buses and
peripherals such as video cards, network cards, and disk controllers. Both Bochs and QEMU provide to the simulation
accurate models of a limited set of hardware, including interrupt controllers, bus drivers, disk controllers, disk drives,
keyboards, mice, video cards, and network cards. Over time, this set has grown to include a reasonable selection of
devices likely to be supported by guest operating systems. Both Bochs and QEMU use BIOSes run inside the simulation
to initialize certain parts of the hardware, a design decision which allows the device emulators to remain faithful to the
original hardware.
Outside the simulation, these device drivers (drivees?) make use of host features to provide the simulation and the
user with desired functionality. Some examples are
• Video frame-buffers are exposed via the user’s choice of UIs. For QEMU, options include a SDL window, a VNC
server, and no graphical output.
• Networks under QEMU can be disabled, bridged to the host kernel, created over a UNIX socket using a virtual
Ethernet protocol (allowing other QEMU and compatible simulators and virtualizers to participate), or entirely
simulated by QEMU.
1.3 Portability
QEMU is reasonably modular, with guest-specific parts of the emulator largely factored out into their own files and
directories.4 All guests speak the same interface to the core, drivers, and dynamic translator. Of the some 111,000
lines of code5 for the entire QEMU system, the guest-specific components constitute roughly a third. The X86 guest in
particular occupies under 8,000 lines of code. The relative compactness of the guest descriptions enables QEMU, unlike
Bochs, to simulate a large number of guests.6
QEMU’s dynamic translator requires that its platforms expose symbol and relocation information about their com-
piled executables. Fortunately, most of this information is desirable for more “respectable” reasons such as debuggers,
dynamic loaders, or support of separate compilation. Most modern platforms offer most of the needed infrastructure
or are a small patch away from doing so.
Further, QEMU is written almost entirely in C, creating a layer of isolation between host and guest environments.
Notably the dynamic translator is written entirely in C with some GNU extensions. This structural portability, coupled
with GCC’s large list of supported systems, endows QEMU with a large degree of portability across host systems. The
obstacles for porting to Plan 9 will be largely the use of GCC extensions, teaching QEMU about Plan 9’s a.out format,
and some UNIXisms in the host driver code.
1.4 Roadmap
The next section gives an in-depth look at the current state of QEMU’s universe, focusing on the dynamic translator
as both a provider of tools to other parts of QEMU and as a consumer of the larger system’s offered tools. Then, with
that groundwork, we discuss mechanisms for offering the same tools on a Plan 9 system. We then visit, briefly, some
less-explored, hopefully smaller issues that may be encountered in the port.
2.1 Translation
2.1.1 Translation of Basic Blocks
Figure 1 shows a small X86 basic block with a conditional jump at its end, and the micro-op representation of that
loop. Each op specifies a micro-op which is to be copied into the translation buffer. Those which appear to be given
4 Some per-host and per-target code remains in common files, using #ifdef to select the right one.
5 grep -c \;
6 Of interest in particular to the Plan 9 community are its X86, ALPHA, MIPS, PPC, and possibly SPARC targets.
5
tbstart:
/* ADDL $0x2, %EBX */
op_movl_T0_im(2)
op_movl_EBX_T1
op_addl_T0_T1
op_movl_T0_EBX
/* SUBL $0x1,%EAX */
op_movl_T0_im(1)
op_movl_EAX_T1
l1:
op_addl_T0_T1
ADDL $0x2, %EBX
op_movl_T0_EAX
SUBL $0x1, %EAX
JNZ l1
/* JNZ l1 */
endloop:
(a) A small X86 guest basic block’s in- op_jz_T0(l2) /* Transfer control to l2 if T0 == 0 */
struction stream. op_goto_tb0(tb)
op_jmp_im(tbstart)
op_movl_T0_im(tb)
op_exit_tb
l2:
op_goto_tb1(tb)
op_jmp_im(endloop)
op_movl_T0_im(tb | 1)
op_exit_tb
(b) A translation into micro-ops.
arguments are making use of a technique we call “constant folding” – Section 2.3.4 explains in more detail, but for now
it is sufficient to imagine that QEMU has a mechanism for parameterized micro-ops. In particular, tb refers to the
meta-data associated for the current translation buffer.
The code emitted in response to JNZ may seem especially odd. See Section 2.2.3 and 2.2.4 for details of control flow
handling. The translation given is not faithful to QEMU’s handling of condition code calculations; see Section 2.2.2 for
details.
6
2.2.1 Function Calls To Reduce Translation Size
The MMU operations implicit in most instructions of modern systems are themselves complex. Typically, a translation
cache is added to reduce the number of trips to memory, but at the expense of even greater code complexity. Placing
all of this complexity in each translation buffer at each memory operation site would be extremely expensive.
Further, some guest instructions are themselves remarkably complex; the X86 instruction CPUID is a good example8
and requires about 75 lines of C even in QEMU’s simplistic implementation.9 Unlike ARPL, CPUID has entirely different
behaviors based on the contents of registers; this implies that either the micro-op implementing CPUID will be very large
(CISC micro-op library approach) or that CPUID will be decomposed into a lengthy list of micro-ops (RISC approach).
Instead of the usual approach, where the translator would copy code to implement the MMU operation(s) or CPUID
instruction into the translation buffer, in these cases the micro-ops contain function calls to helper functions such as
ldl kernel, for a long read in kernel mode, and helper cpuid, which contains the complete implementation of CPUID.
7
tbstart: tbstart:
/* ADDL $0x2, %EBX */ /* ADDL $0x2, %EBX */
op_movl_T0_im(2) op_movl_T0_im(2)
op_movl_EBX_T1 op_movl_EBX_T1
op_addl_T0_T1 op_addl_T0_T1
op_movl_T0_EBX op_movl_T0_EBX
op_update2_cc op_nop /* Eliminated CC update */
op_add_EIP(5) op_nop /* Eliminated PC update */
op_set_cc_op(subl) op_set_cc_op(subl)
op_mov_T0_cc op_nop /* Eliminated CC calculations */
op_movl_eflags_T0 op_nop
/* JNZ l1 */ /* JNZ l1 */
op_jz(l2) op_jz_subl(l2) /* Specialized jump op */
op_goto_tb0(tb) op_goto_tb0(tb)
op_jmp_im(tbstart) op_jmp_im(tbstart)
op_movl_T0_im(tb) op_movl_T0_im(tb)
op_exit_tb op_exit_tb
l2: l2:
op_goto_tb1(tb) op_goto_tb1(tb)
op_jmp_im(endloop) op_jmp_im(endloop)
op_movl_T0_im(tb | 1) op_movl_T0_im(tb | 1)
op_exit_tb op_exit_tb
(a) A naı̈ve translation showing condition code and (b) A translation without unnecessary computation.
EIP updates.
Figure 2: A simplified view of translation into host code, showing optimizations on condition codes and instruction pointer
calculations. Translations are represented in C rather than micro-op names for clarity.
8
second, since the value of the condition codes is now unnecessary, avoid computing them but leave enough information
in case they are essential later.10
This lazy evaluation interacts with the synchronous fault mechanism. Upon a synchronous fault, the emulator core
reverse engineers the instruction pointer by looking at the host’s instruction pointer and divining which instruction
was in progress. It requires that all instructions are translated into a series of micro-ops which are as restartable as
guest instructions; thus most instructions translate into a series of loads into temporary registers, computation on those
temporaries, and a single transfer back to actual registers or store to memory.
It is worth noting that the condition code optimizations should also interact with the synchronous fault mechanism,
but they do not. Observationally, no guest depends on the condition codes during synchronous fault handling, so QEMU
does not ensure their accuracy. It does, however, ensure the accuracy of all condition codes that are live at the time of
the synchronous fault so that the guest code may be restarted.
a basic block.
11 One could imagine a more sophisticated translator that could place several conditional jumps into one translation unit so long as there
9
These changes in and of themselves are insufficient: despite parameterization, we have not provided a way for
the translation buffer to conditionally jump to one path or the other. QEMU’s micro-ops libraries provide micro-
ops that transfer control to another point in the translation buffer by jumping, using a mechanism QEMU calls
GOTO LABEL PARAM(), which may be thought of as simply a host JMP instruction with enough room in the instruc-
tion to jump to any address. As with GOTO TB() and EXIT TB(), GOTO LABEL PARAM() is exposed to the translator by
being contained within specialized micro-ops.
Recall that the translation procedure involves disassembling guest instructions, selecting an appropriate sequence
of micro-ops, and then translating that sequence into host instructions. During the selection phase, the translator can
look up the size of each micro-op’s host code. Thus it can readily learn the which offsets into the translation buffer
are between micro-ops. It can even bind these offsets to named “labels” during translation and pass them to the code
generator.
This somewhat tortured mechanism allows us to translate guest conditional jumps into host condition testing and
unconditional jumps. Specifically, a guest conditional jump, say JZ %EAX, with both destinations on the same page is,
in both micro-ops and pseudo-code:
op_movl_EAX_T0 T0 = EAX
op_jz_T0(l1) if(!T0)
GOTO_LABEL_PARAM(l1)
op_goto_tb0 GOTO_TB(0) /* Not taken branch */
op_jmp_im(not-taken) EIP = &(not-taken)
op_movl_T0_im(tb | 0) T0 = tb | 0
op_jmp_label(l2) GOTO_LABEL_PARAM(l2)
l1:
op_goto_tb1 GOTO_TB(1) /* Taken branch */
op_jmp_im(taken) EIP = &(taken)
op_movl_T0_im(tb | 1) T0 = tb | 1
l2:
op_exit_tb EXIT_TB()
That is, if the condition is true, control will follow the branch taken branch (the code after l1) and, the first time
through, will fall through the GOTO TB(), to return to the emulator loop. The emulator will then find in cache or
translate the branch taken successor into its own translation buffer and patch this one for next time. If, next time
through, the condition is still true, control will directly chain to the appropriate translation block. If not, control will
follow the branch not taken successor and the symmetric thing will happen.
2.3.1 Coring
Since C functions complete by returning, and most compilers produce code with prologues on entry and epilogues to
return, the existing dyngen design is to mechanically strip both to extract the core of the function. These cores can
then be pasted together and the entire mass given a prologue and epilogue to create a function at runtime. Assuming
that each core has a unique return point in its epilogue (that is, at its highest address), this will work exactly as desired:
instead of returning, each concatenated function will simply hand control off to the next by falling through.
10
2.3.2 Register Allocation
GCC has extended the C language to allow some control over the register allocator. QEMU’s targets, whenever possible,
assign host registers to hold a pointer to the emulator environment, the temporary registers, and a subset of the guest’s
registers. The goal is to reduce memory traffic and address computations for the common cases seen in the micro-ops
library.
However, as some hosts may have registers smaller than some guests, most guests provide code for the case where
registers are unavailable. This is fortunate, as it means platforms whose C compilers do not allow such fine-grained
(and non-portable) control over the compiler’s output can use these other pathways for their initial port. Section 4.4
discusses both how to eliminate explicit register allocation for the initial port and a mechanism for explicitly using
registers on Plan 9 in more detail.
11
2.3.6 Micro-Ops as Data Structures
Much as we could view translation buffers as odd data structures, micro-ops themselves are similarly odd data structures,
supporting a larger vocabulary of mind-bending operations statically, at compile time; dynamically, at the request of
the emulator; and at execution time.
• Static Operations:
– Extract core.
– Enumerate relocation requirements.
– Enumerate constant-folding parameters.
– Enumerate control-flow parameters and exports.
• Dynamic Operations At Translation Time:
– Copy core to target address.
– Relocate function calls.
– Fold a value in for a constant parameter.
– Set a jump’s destination label.
– Export GOTO TB() labels.
• Execution Operations Within A Micro-Op:
– Make a function call.
– Read from global store (and/or registers).
– Write to global store (and/or registers).
– Go to the next micro-op14 (“branch not taken successor”)
– Transfer control to another micro-op in this translation buffer(“near branch taken successor”).
– Transfer control to another translation buffer(“far branch taken successor”).
– Exit back to the emulator loop.
• Dynamic Operations After Translation Time:
– Set a GOTO TB() destination.
12
op.c op.o qemu
GCC ld
CPU−specific
Micro−ops
Library w/ relocation
dyngen
opc.h translate−op.c
op.h #include translate−all.c GCC & ld
gen−op.h
Translator
}
FORCE_RET();
}
Tn and CC Z are preprocessor macros for the temporary registers used by the translation and for the condition code zero
flag, respectively. In the absence of the mysterious FORCE_RET(), the control flow graph is as shown in Figure 4(a). In
this case, there may well be a “return” instruction in the middle of the host instruction stream, rendering the micro-op
unsuitable for concatenation, as in Figure 5(a).
FORCE RET() is a hack to overcome this problem, an attempt to force all paths through the function to a singular
ending. The desired effect on the control flow graph is shown in Figure 4(b). GCC’s code generator, then, when
presented with such a function, apparently always places an end at the highest address of the function (though this is
by no means strictly necessary). Specifically, FORCE RET() is defined (at dyngen-exec.h:197) to be
__asm__ __volatile__("" : : : "memory");
This is an empty inline asm block decorated (with __volatile__) so that block motion and lifting is disabled in GCC.
FORCE_RET()s are placed only where necessary, presumably determined by intuition or debugging. However, when it
works, it works: op arpl with FORCE_RET() compiles as in Figure 5(b), having only one return instruction at its highest
address.
13
000000000049e943 <op_arpl>: 000000000049e943 <op_arpl>:
49e943: mov %r15d,%edx 49e943: mov %r15d,%edx
49e946: mov %r12d,%eax 49e946: mov %r12d,%eax
49e949: and $0x3,%edx 49e949: and $0x3,%edx
49e94c: and $0x3,%eax 49e94c: and $0x3,%eax
49e94f: cmp %eax,%edx 49e94f: cmp %eax,%edx
49e951: jae 49e96c <op_arpl+0x29> 49e951: jae 49e96d <op_arpl+0x2a>
49e953: mov %r15d,%edx 49e953: mov %r15d,%edx
49e956: mov %r12d,%eax 49e956: mov %r12d,%eax
49e959: mov $0x40,%r12d 49e959: mov $0x40,%r12d
49e95f: and $0xfffffffffffffffc,%edx 49e95f: and $0xfffffffffffffffc,%edx
49e962: and $0x3,%eax 49e962: and $0x3,%eax
49e965: mov %edx,%r15d 49e965: mov %edx,%r15d
49e968: or %eax,%r15d 49e968: or %eax,%r15d
49e96b: retq 49e96b: jmp 49e970 <op_arpl+0x2d>
49e96c: xor %r12d,%r12d 49e96d: xor %r12d,%r12d
49e96f: retq 49e970: retq
(a) Without the use of FORCE RET(). (b) With use of FORCE RET().
3.5 Relocation
QEMU makes use of the host platform’s ability to carry out dynamic loading (or separate compilation) to allow its
micro-ops to make function calls and reference global variables. Further, the mechanism is used “abusively” to load
constants and for the implementation of non-local control flow.
14
Explicitly, the generated C part of the dynamic translator for emitting a op cpuid micro-op is16
extern void op_cpuid();
extern char helper_cpuid;
memcpy(gen_code_ptr, (void *)((char *)&op_cpuid+0), 13);
*(uint32_t *)(gen_code_ptr + 5) = (long)(&helper_cpuid) - (long)(gen_code_ptr + 5) + -4;
gen_code_ptr += 13;
Here, five bytes into the host instruction stream, the dynamic translator will land a computed expression such that at
runtime the call is correctly dispatched to helper cpuid.
15
Currently libdynld does not understand relocation on ARM, so this discussion is somewhat theoretical. However, it
seems worth a brief mention to save somebody else the trouble of discovering it anew.
QEMU currently assumes that the linker’s output is of the form fun1, pool1, fun2, pool2, .... Given a list of
micro-ops for a translation buffer, it will concatenate as many micro-ops as it can before the constant pools would be
too far away from the first, then place a constant pool, then go back to placing micro-ops. This reduces the number of
jumps needed around a translation buffer.
According to Pike’s manual on the Plan 9 C compilers [6], the Plan loader will not produce code of this nature, using
instead a single “static base” for the entire program. Since, in our case, the micro-ops library constitutes an “entire
program,” the presence of this static base table makes it impossible to extract a micro-op from the dynamic module. It
seems that it will be necessary to modify the loaders to either load registers in multiple instructions or emit a constant
pool for each function (and therefore cause each function to set the static base register). In the latter case, QEMU
must be able to distinguish the loads responsible for rewriting the static base register so that it can further abusively
set them.
16
• A X86-specific version.
• A PPC-specific version.
• A GNU C implementation making use of GNU C’s Labels as Values extension [3, Section 5.3].
While none of these implementations are suitable for use on Plan 9, it is instructive to consider at least one for
concreteness. Tragically, all of the mechanisms here are full of horrors: the host-specific versions “know” which type of
relocation records will be emitted by the linker in response to their code, and the GNU C implementation is remarkably
odd.
The X86-specific version is inline assembler but probably simpler to explain than the GNUisms in the GNU C
implementation. The inline assembler (from exec-all.h:333), with some manual preprocessing, is, approximately:
.section .data
__op_label##n#.op_goto_tb##n :
.long 1f
.section .text
jmp __op_jmp##n
1:
Here n is a parameter to GOTO TB(); it is either 0 or 1 depending on which meta-data slot is being used for this jump.
What this achieves is to place a jump instruction with a destination determined by relocation, using another class of
abusive symbols, op jmpN. Dyngen places the address of the patch location into an array for QEMU’s use. The symbol
placed in the .data section is yet another abusive class, op labelN, to which dyngen responds by producing code to
export the address of the relocation (the address of the instruction after the JMP) for the emulator’s use.
After code generation, the translation block’s chains are “reset”, meaning that for the host-specific versions, the
jump location is patched with the address of its next instruction. This is, in effect, creating a conditional without any
conditional instructions. After the micro-op containing GOTO TB(), the translator will have placed micro-ops to store the
instruction pointer and return from the translation unit back to the control loop; see section 2.2.4. Upon either finding
the next translation in cache or translating it anew, the jump location will be patched to be the head of that unit and
the translation units’ meta-data will be updated to show that they are linked. Thus, the next time this translation unit
runs, it will jump directly to its successor, rather than have to involve the main loop.
3.8 Conclusion
The current implementation of QEMU’s dynamic translator is riddled with platform and compiler-specific knowledge.
Any attempt to port QEMU to a new host will have to struggle with these difficulties.
17
op.c cc −T 8.op (dlm) dyngen op_body.c qemu
?l −x arrays
cc, ?l
CPU−specific
Micro−ops
Library w/ relocation
dyngen
opc.h translate−op.c
op.h #include translate−all.c
gen−op.h
Translator
Figure 7(a) compiles by 8c into the intermediate representation shown in Figure 7(b) but is loaded by 8l into the host
code shown in Figure 7(c).
Removing the statement global2 = 1; merely removes its corresponding instruction from the emitted code but
does not change the result’s structure. Thus defining FORCE RET() to be a statement with side-effects is insufficient
under cc. As these productions always end in the middle, they are not suitable for dyngen’s use.
For 8l, the relevant code motion is carried out in pass.c:/^xfol. Some investigatory effort towards modifying this
routine to produce functions which may be cored and concatenated, but no meaningful results have been achieved.
However, it has not been deemed impossible either, so this avenue of attack remains open. Also remaining is to
investigate other loaders.
(a) Example micro-op like code. (b) Intermediate output of 8c. (c) Final output produced by 8l.
4.3.2 Alternatives
Since the micro-ops are all built at once (per guest architecture), it is possible that we could add a loader flag to ensure
that all functions had only one return, placed at their highest address. This would allow us to define away FORCE_RET
and trust the loader to do the right thing, rather than scatter FORCE_RETs wherever necessary whenever the compiler
or loader changed behaviors. However, since we cannot load an already loaded program, an additional program would
have to extract the fully loaded, modulo relocation, micro-op bodies and generate C files containing the host code as
18
data to be compiled into QEMU.
It may also be possible to shim an intermediate program between the compiler and the loader, rewriting the
intermediate format so that the loader produces routines which may be cored and concatenated. This would be akin
to the syntactic “suggestions” attempted with the gotos in 7(a), but at the assembler level. From investigation of 8l
this seems to be more difficult than the loader flag above.
Additional discussion can be found in Section 4.6.
Solution
• For the initial port, we can avoid being fancy here. We can avoid coring micro-ops, and we can let them return
to enter the next micro-op. For more discussion, see Section 4.6.
Solution
• The simplest solution may well be to avoid explicit register allocation altogether. There is extant code in the
QEMU code base to do this.
• Since registerization is likely to provide some non-trivial speedup of guest code, we may avail ourselves of the cc’s
extern register storage class. However, the easiest way to meet the requirement of universal exposure to these
declarations will be to build our own libc, libdraw, and other libraries we link against.
4.5 Relocation
4.5.1 Relocation and Intermediate Formats of Compilation
cc’s intermediate format (the rough correspondence of a .o file) can still be relocated, as references are still by name,
but does not contain native instructions, as those are only selected in full by the loader. Conversely, the loader generally
fully specifies the layout of an executable and so discards the relocation data. However, the loader’s understanding
of dynamically loaded modules (from the delayed dynld(2) project) will be sufficient to emit the relocation data that
dyngen needs.
For the purposes of constant loading, dyngen needs to know which symbols a relocation record references. This
information is as readily available as anything is in the other executable formats dyngen understands, but dynld(2)
currently does not offer any real semantic interpretation of relocation records to its callers. We note that while dyngen
is the current motivation for exposure, such information should be useful for acid as well, once dynamically loaded
modules enter more widespread use.
As mentioned above, the micro-ops library includes static inline functions and constant lookup tables, which cc
will land as static objects in the dynamically loaded module. For the purposes of this port, we will move all constant
pools to other files (forcing the compiler to generate reloction records). static inline functions are a little more
problematic, as many of them are declared in header files. To overcome this problem, we will move all such functions to
header files and then generate, via awk, an alternate version of the header which lacks the keyword static and may be
toggeled via preprocessor macros between prototype and full function body mode. When compiling the micro-ops library
proper, we will process these includes in prototype mode, in liu of the original version. We will subsequently compile
only these headers, in export mode, to produce an object which exports these previously static inline functions.
This is a manual way of forcing the same situation as attaned by GCC; it is necessary as the Plan 9 loaders cannot
consume dynamically loadable modules for further loading.
20 Sorry.
21 At the moment QEMU does not support multiple processors in the guest, so while this would move the state pointer from per-CPU to
per-process storage, it is hoped that this move would not alter semantics or correctness of the program.
22 Rules of optimization: don’t do it, and, for experts only, don’t do it yet.
19
4.5.2 Plan 9’s Dynamic Load Facility
For this port, dynld(2) has been extended to increase transparency of the dynamically loaded modules (dlms) to other
programs. There are two additional API functions, dyn import table() and dyn reloc table(), which extract the
import and relocation tables respectively and present them as arrays to the caller. These functions are necessary as
these tables are “packed” in the dlm in a way that is intended to be parsed once, by dynld(2)’s dynloadgen(), the
dynamic loader function. It was not necessary to write a function for extracting the export table for two reasons:
QEMU did not need such a function, and the current semantics of dlms are that the export table is simply a chunk of
data identified by the symbol exporttab; libmach’s facilities allow it to be readily extracted.
In support of dyn reloc table()’s operation, a new per-architecture function crack dynrel() has been added.
This function investigates a given relocation tuple of (offset in dlm, mode of relocation) and computes (imported
symbol’s import index, offset from imported symbol). The mapping is currently defined in an architecture spe-
cific way, requiring crack dynrel() to be per-architecture. Sadly, crack dynrel() duplicates knowledge from dynreloc(),
dynld(2)’s relocator function, but as they are both short this is passable.
More verbose commentary about dynld(2) and the changes made have been pushed off to a separate document,
dynld.txt, distributed in parallel with this document, as these changes are unlikely to be interesting to future developers
of QEMU on Plan 9.
Solution
• cc’s relocation capabilities are sufficient for the task at hand. Some changes have been proposed to dynld(2) to
increase visibility into the dynamically loadable modules.
20
Solution
• We will (ab)use the stack to store the “not taken” successor micro-op’s address before entering a micro-op.
4.7.2 A Return To C
C proper (i.e., GNU extensions and inline assembler aside) lacks any non-local control flow transfer mechanism other
than a function call. It may also be useful to note that full functions written in assembler are available (and reasonably
portable) on Plan 9; into this latter category fall setjmp() and longjmp().
It should be noted that the use of jumps out of C function bodies is remarkably complex as it imposes many
requirements of the machine code at the jump site. In particular, the stack pointer must be back where it was at
function entry so as to avoid leaving trash around23 Further, all live variables must have been committed to backing
store as there will be no future point to do so; fortunately, this is required of global state at function call sites.
The semantics of GOTO TB() make it more complex than just the constant jumps used by GOTO LABEL PARAM(), as
it must be able to handle both the chained and unchained situations, and it must be easy for external code to toggle
which behavior is active. The simplest way to achieve this may be to define GOTO TB(whichSuccessor) using a host
state in the current translation buffer meta-data structure and a host conditional, as in:
void *next = curr_tb->successor[whichSuccessor];
if(next)
magic_jump_to(next);
This is similar in spirit to the extant GNU C implementation, but it avoids the tortured address export. It depends on
“normal” relocation (for env and whatever function serves the roll of magic jump to).
as a feature.
21
{ void *temp = curr_tb->successor[whichSuccessor];
if(temp) {
env->tempjbuf[JMPBUFPC] = temp;
longjmp(&env->tempjbuf,1);
}
}
It should be noted that we can abusively load in curr tb, possibly including the offset to the specific field we’re
requesting, rather than make a reference to a global store.
4.7.4 Alternatives
While the use of conditional dispatch inside GOTO TB() is unlikely to have measurable performance cost, it may be useful
to briefly mention an approach to its elimination. The current QEMU translator uses a new class of abusive relocation
to get the address of the next instruction after the jump so that the jump becomes an expensive no-operation. However,
we could push this problem one layer higher by observing that transfer to the next micro-op may be just as good as
a transfer to the next instruction. While the latter’s address is (almost?) impossible to get in proper C, the former is
readily calculated using the same mechanisms as for labels.
If the machine code emitted for all micro-ops is sufficiently nice, the stack restore capability of longjmp() may be
unnecessary. In this case a simpler function that simply set the instruction pointer would suffice. If micro-ops are still
using RET to enter their natural successor, the simpler function would still have to clean the stack slightly.
Solution
• longjmp() based solutions to EXIT TB(), GOTO LABEL PARAM(), and GOTO TB() have been proposed.25
Solution
This is an optimization used by X86-on-X86 simulation and may be considered premature optimization for the purposes
of the initial port.
Solution
It should be straightforward to replace this with getcallerpc.
Solution
This may also be viewed as premature optimization for the purposes of the initial port and so removing the annotation
(via #define) should suffice.
25 Sowhile not purely C, this is certainly closer than QEMU’s current approach.
26 The rather ugly attribute((regparm(N))) which specifies that the first N parameters should be passed as registers.
27 There have been discussions on GCC’s mailing list about using branch predictor hints for pointers that result from malloc(), which strikes
this author as remarkably silly. Hints also appear as decoration in Linux but this author is not aware of performance figures demonstrating a
non-decorative utility.
22
5.4 Memory Management
QEMU supports both a “softmmu” mode and a “user” mode emulation strategy. The former emulates a full memory
management unit (with translation cache), while the latter uses mmap and mprotect to host a system inside user usable
address space. This is intended for running executables compiled for one architecture on another, under the same
operating system.
The absence of mmap could be overcome by use of segattach, but no mechanism parallel to mprotect exists on Plan 9.
Fortunately, “user” mode emulation is not likely attractive to Plan 9 users and so may be considered unnecessary to
port28 .
Solution
It seems that system emulation mode uses no advanced memory tricks and so nothing beyond libc’s standard allocator
functions will be necessary.
5.5 Locking
QEMU uses some limited test-and-set locking techniques for threading support in “user” emulation mode (not yet in
“system” mode; SMP is implemented by round-robin emulation of the CPUs) and for CPU interrupt management.
Currently every architecture codes in inline asm the appropriate test-and-set mechanism for implementing locks.
Some locking is sprinkled around the code in what seems to be active development towards taking advantage of
multiple host processors. However, this code is incomplete which may pose problems; see Section 5.6.
Solution
Plan 9’s libc provides a tas() function which implements test-and-set.
upstream. The previous assertion that it was deprecated therefore seems misleading.
29 And similarly complex mechanisms on Windows hosts
30 See Appendix A
23
longjmp(&env->tempjbuf,1);
}
}
then all translations will return to the main loop after a fixed number of chains have been made. Note that because the
mechanism of return is to make GOTO TB() a no-op, the emulation state is not compromised in any way: it just acts as
if it were not chained. It is therefore safe to leave this mechanism engaged, even if chain break is set to a value larger
than the number of translation buffers encountered between timer ticks; doing so will only add the cost of a decrement
and a conditional to the fast path.
Solution
The Plan 9 note mechanism should suffice for timer management. The block device code appears to have some way of
avoiding use of AIO, but details are fuzzy.
6 Summary
This paper presents an initial attempt at a strategy map for porting QEMU to Plan 9. From reading the QEMU
paper [2], reading of QEMU code, some reading of the compiled binaries, and some hints as to where to begin, a series
of potential issues were identified. For each, at least one solution is herein proposed for review; whenever possible,
an effort has been made to identify other possible solutions as well. It should be noted that this is by no means an
exhaustive list. For quick reference, the favored solution for each identified problem is tabulated in Table 1.
References
[1] Qemu internals. URL http://fabrice.bellard.free.fr/qemu/qemu-tech.html.
[2] Fabrice Bellard. QEMU, a Fast and Portable Dynamic Translator. USENIX, 2005. URL
http://www.usenix.org/publications/library/proceedings/usenix05/tech/freenix/full_papers/bellard/bellard.pdf.
[3] GCC 4.2 Manual. Free Software Foundation, Inc. URL http://gcc.gnu.org/onlinedocs/gcc-4.2.0/gcc/.
24
[4] Kevin Lawton. BOCHS, The Open-Source IA-32 Emulation Project. URL http://bochs.sourceforge.net/.
[5] Rob Pike. A Manual for the Plan 9 Assembler, . URL http://plan9.bell-labs.com/sys/doc/asm.html.
[6] Rob Pike. How to Use the Plan 9 C Compiler, . URL http://www.cs.bell-labs.com/sys/doc/comp.html.
25
A QEMU Interrupt Bug
The CPU interrupt dispatch mechanism’s goal is to unchain whatever translation block is running and force control
flow to return to the main CPU execution loop. Its basic structure is:
1. Mark an interrupt-specific flag.
2. Fetch the current translation block.
3. Remove the environment’s pointer to the current translation block.
4. Recursively unchain the current translation block.
The main CPU execution loop, to which we are trying to ensure control flow returns, has the basic structure:
1. Check for interrupts in the flag word and if any are set, handle them.
2. Find or generate the next translation block.
3. If we are chaining (i.e. not on an exception path and the just-executed translation block left behind patching
instructions), patch the current translation block with a chain to the next translation block.
4. Set the next translation block as the current.
5. Run the current translation block.
Note that there is a clear point, just after step 3 of the CPU execution loop, where all of the following conditions can
hold:
1. A commitment has been made to which translation block will run next, but it is not yet considered the current
translation block.
2. It might be the case that the current translation block is not chained to the new translation block. (More strictly,
it might be the case that the current translation block’s transitive closure under chaining does not include the
new translation block.)
3. The next translation block may have extant chaining patches, if it was pulled from cache. In the worst case, it
may be part of a cyclic chain.
The first condition ensures that we have already selected our next translation block and are not able to select a different
one. The second condition implies that the interrupt dispatch mechanism’s recursive unchaining may not affect the
already-selected translation block’s chaining. The addition of the third condition yields that there are cases where
interrupt dispatch is deferred until the next asynchronous interrupt (e.g., timer tick). If interrupts are not queued, but
merely use the flag bits to signal their pending status, then this bug may additionally imply loss of interrupts.
The fix seems to be either disabling asynchronous signals during parts of the CPU execution loop or to re-check for
interrupts after the next translation block has been set as the current. The former is slow, imposing two system calls
per pass through the CPU execution loop. The latter implies more code re-structuring than I wish to do, as the bigger
problem of the port remains.
26