OLP Notes
OLP Notes
OLP Notes
1. Introduction
When people make use of computers, they quickly consume all of the processing power available. It is the
nature of computers; their flexibility quickly leads to ideas for more automation, which then lead to more and
more ideas, until the processing resources available are exhausted. In addition software development is
agglomerative, it is easier to add software and hence CPU load than it is to remove it. This is a fundamental
asymmetry in software development. As we software developers have developed more sophisticated ideas for
developing and layering our software; and users have demanded more of our software infrastructure, more
demands have been made of the underlying hardware.
A very simple equation quantifies the performance of hardware.
Instructionspersecond = instructionspercycle x cyclespersecond
Fortunately Moores law has given us a formidable, non-linear, improvement in performance for some time.
Hardware design has been incredibly competitive between teams of engineers in companies. For a long time the
primary metric to measure performance was cycles-per-second (CPS), which caused the megahertz myth. Intel
has done more to explore the boundaries of CPS in common CPUs than any other company. Customers of Intel
found that another metric than CPU frequency had become increasingly important in their buying decisions
however, the metric of instructionspersecondperwatt.
In parallel with research into everincreasing cyclespersecond, designers have also been improving the
instructionspercycle (IPC) that their hardware is capable of. When they crossed the boundary of greater than
one instructionpercycle (IPC > 1) they entered the world of instruction-level parallelism (ILP). This paper
describes the primary techniques used by hardware designers to achieve and exploit instruction-level
parallelism.
This report is organised as follows: in the next section I shall quickly outline pipeliningwhich is required
background to understanding any instruction-level parallelism implementation. Section 3 discusses the several
hazards that instruction-level parallelism (ILP) uncovers and techniques for working around the hazard or
otherwise minimising the impact of them; it also discusses extensions to ILP such as superscalar processors.
Section 4 describes the theoretical limit of ILP attainable by ideal and potentially realisable processors. In the
conclusion I discuss what impact these techniques have on operating system development.
In this document I shall be discussing RISC based instruction architectures. This is not especially constraining
as it is possible to map the registermemory/memorymemory CISC instruction-set to a series of RISC partial
instructions. This can be done mechanically and is used by Intel and IBM in the Pentium & Power processor
families.
Instruction-Level Parallelism
2006-04-13
2. Backgroundpipelining
When a CPU executes an instruction, it transitions through number of stages to complete. These basic stages (or
phases) are: instruction fetch, instruction decode, register fetch, execution/effective-address computation,
memory access and write back to registers.
Instruction fetch (IF)use the current program counter (PC) to index into memory and fetch the
instruction. The act of fetching from memory may cause an exception and other complicated
interactions with the hardware memory optimisations.
Instruction decode/Register fetch (ID) this phase combines decoding of the next instruction with the
latching of the current register values from the register file with the interpretation of the operation to be
executed. A suitably designed instruction-set architecture it is possible to determine the registers
operated on by the command independently of what the actual instruction is, which allows the register
fetch to occur in parallel with the instruction decode.
Execution/effective-address computation (EXE)the arithmetic logic unit (ALU) is central to
instruction execution. Given the decoded operation and latched operands, the ALU performs the
required operation, which can be either register arithmetic or an effective-address computation. The
result is then latched for the next phase of execution.
Memory access (MEM)either store a latched operand into memory indexed by the effectiveaddress(EA) computed in the previous phase or load and latch a value from the computed EA.
Write back to registers (WBR)Tranfer the latched output of the memory access or execution phases
to the required register. Some instructions do not need a WBR phase.
A simple CPU implementation would perform each phase in a clock cycle and every instruction would take up
to 5 clock cycles to complete. Not all instructions require all phases, for instance a simple register indirect store
instruction does not need to perform the execution or the write back phases and can complete in 3 cycles.
Early computer designers realised that this implementation leaves most of the CPU idle for most of the time and
does not use the resources efficiently. In 1960 the designers of the IBM-7030 (Stretchbecause it was supposed
to be a stretch in the state of the art) introduced the concept of an instruction pipeline. As Figure 1 below
shows, a pipeline can overlap the execution phases of an instruction with previous instructions.
Instruction\clock
IF
ID
EXE
MEM
WBR
IF
ID
EXE
MEM
WBR
IF
ID
EXE
MEM
WBR
IF
ID
EXE
MEM
WBR
IF
ID
EXE
MEM
i+1
i+2
i+3
i+4
WBR
Figure 1: Simplified instruction pipeline, with instruction overlap it is possible to complete one instruction per
clock cycle.
Instruction-Level Parallelism
2006-04-13
When a branch needs to be taken then the next instruction cannot be determinedin this simple designuntil
the instruction pointer (IP) register WBR phase has been completed. This is known as a control hazard, though
in this case it is a variant of a data hazard, as the pipeline continues executing a simple linear instruction stream
for a brief period after the branch.
It is also possible that two instructions need to use the same hardware in the same cycle; this is called a
structural hazard. For instance on hardware with a single memory interface then the MEM phase of a load
instruction will interfere with the IF phase of a later instruction.
A data hazard exists between the computation of r1 in the add and the use of r1 in the sub instruction
following it. The problem is that the register file is not updated until the WBR phase some two clock cycles
after the result was computed by the EXE phase of the add instruction. Using multiplexers it is possible to
forward the result of the add EXE phase into the next EXE phase for the subsequent sub instruction. This
technique recognises, that although the value has not yet been committed to the register file, it has been
computed and is available in a latch when an EXE phase completes.
Instruction-Level Parallelism
2006-04-13
Unfortunately not all data hazards can be resolved in this way. Consider the fragment
ld r1, [r2]
add r2, r1, r1
Remember, from Figure 1, that the memory phase of the i-th instruction and the execution phase of next
instruction are on the same clock cycle. As the ld's data will not be available until the end of the MEM phases
clock cycle but the add EXE phase needs it at the beginning of the same clock. We must stall the pipeline for
this hazard.
3. Instruction-level parallelism
1
Most processors since 1985 have used the pipeline method just described to overlap instructions. The
background material discussed a simple pipeline and how to achieve instruction overlap, this section will
discuss far more sophisticated approaches to processors. Achieving not instruction overlap but the actual
execution of more than one instruction at a time through dynamic scheduling and how to maximise the
throughput of a processor. It will be useful to re-visit the various dependencies and hazards again, before
discussing these more powerful techniques for identifying and exploiting more ILP.
All of the subsequent techniques in this paper exploit parallelism among instructions. The amount of
parallelism available within a single basic block is really quite small; for typical RISC processors, without
extraordinary optimisation techniques by the compiler, the dynamic branch frequency is between 15% and 25%,
which means that the average basic block is between 3 and 7 instructions long. These instructions usually
depend on each other too and as a result the amount of overlap is further limited. The upshot of these
observations is that some way must be found to exploit parallelism across multiple basic blocks.
In this section I shall discuss a number of techniques that the processor can use alone. The compiler can also be
of assistance in generating larger basic blocks by various techniques, such as loop unrolling. The compiler
techniques available are in common with the VLIW/EPIC techniques and will not be covered in this paper.
[Hennessy03] p172
Instruction-Level Parallelism
2006-04-13
There are three different types of dependencies: data dependencies (aka true dependencies), name dependencies
and control dependencies.
Instruction-Level Parallelism
2006-04-13
S1 is control dependent on p1 and S2 is control dependent on p2 but is not dependent on p1. In general there
are two constraints imposed by control dependencies: an instruction that is control dependent on a branch cannot
be moved before the branch and, conversely, an instruction that is not control dependent on a branch must not be
moved after the branch in such a way that its execution would be controlled by the branch.
A control dependency does not in itself limit performance fundamentally. We can execute an instruction path
speculatively, provided we guarantee that speculatively executed instructions do not effect the program state
until branch result is determined. This implies that the instructions executed speculatively must not raise an
exception or otherwise cause side effects.
Instruction-Level Parallelism
2006-04-13
Instruction-Level Parallelism
2006-04-13
branches; this 12-bit number is used to index a 4K-entry table of 2-bit predictors. The local predictor has two
2
levels: the top level is a 1024 entry local history table, indexed by 10-bits of the branch address . Each entry
consists of the outcome of the 10 most recent branches. This 10-bit history allows patterns of up to 10 branches
to be discovered and predicted. The history is used to select an entry in another table of 1024 3-bit saturating
counters. Finally the tournament between global and local predictors uses a table of 4096 2-bit predictors
indexed by the branch address. For the SPECfp95 benchmarks the predictions were only wrong less than once
per thousand completed instructions, the SPECint95 was slightly worse at 11.5 times per 1000 completed
instructions.
3.2.4. Speculation
The use of superscalar techniques in combination with typical branch frequencies implies that a processor will
execute a branch every clock cycle to maintain efficient use of hardware resources. We must overcome the
limitations imposed by control dependencies to exploit the superscalar hardware efficiently. With the accuracy
achieved by modern branch prediction units, hardware now can execute instructions speculatively, with a near
certainty that this speculation will be used. This technique involves executing post branch instructions as if the
prediction is correct and unwinding any speculation if it later turns out that prediction was wrong. The higher
the accuracy of the branch predictors, the more leeway designers have in the number of cycles it take to get the
instruction stream back on track after a prediction failure.
The hardware to support speculative execution is, yet another, extension to Tomasulo's basic design. The
extension separates the flow of results between instructions from the completion and writing back of those
results. While a series of instructions are still speculative, results are passed directly from instruction to
instruction and the result is only stored in the register file when the instruction is no longer speculative. This
additional phase is called the instruction commit phase.
These bits are the least significant bits of the branch address / SizeInst + 1.
Instruction-Level Parallelism
2006-04-13
The key ideas are to allow instructions to execute out of order but to force them to commit in order and also to
prevent any irrevocable action until they commit. Adding this commit phase to Tomasulo's approach uses a set
of hardware buffers to temporarily hold the results of finished but not yet committed instructions. These buffers
are called reorder buffers (ROB). They are also used to pass intermediate results among speculative
instructions as they execute. The updated 4 phases in instruction execution follows:
Phase 1issueget an instruction from the instruction queue. Issue it if there is a reservation station
available and there is a spare slot in the ROB. Cache the currently available operands from either the
register file or earlier ROB entries into the reservation station.
Phase 2executewait until all of the operands are available for one of the reservation stations, then
execute the operation. Instructions may use multiple clock cycles in this stage, hence may complete in
any order.
Phase 3write resultswhen the result is available, write it on the bus tagged with the ROB index of
the destination register, the reservation stations that are monitoring the bus for any operands so tagged
will snap up the operand and store it for execution.
Phase 4committhere are two possible actions to be taken at commit, depending on whether the
instruction is a branch or not. The commit phase takes the instruction from the head of the ROB queue
and if it is a regular instruction then the result is stored in either the register file or memory.
Alternatively if the instruction is a branch we must know by now if it was predicted correctly or not. If
the branch was predicted correctly then no further action need be taken otherwise when prediction fails
then the entire ROB queue is flushed and the IF unit is re-started at the new IP.
The use of reorder buffers and in order commits vastly simplifies the generation of precise exceptions. While an
instruction is speculative any exception is temporarily stored in the ROB but when the instruction reaches the
head of the ROB queue, the processor can then raise the exception, flush the ROB and start the IF unit at the
exception handler.
The biggest advantage of speculation is its ability to uncover events that would otherwise stall the pipeline early,
such as cache misses. This advantage can work against us when the operation that is being speculated implicitly
takes a long time, for instance a costly exceptional event. To maintain as much of the advantage as possible and
minimise the disadvantage, most processors will only speculate about low cost exceptions such as first-level
cache misses. When a potentially expensive exception occurs, such as a high level cache or TLB miss the
processor will stall until the instruction causing the event is no longer speculative.
4. Limitations of ILP
To evaluate the maximum potential instruction-level parallelism possible, an instruction stream needs to be run
on an ideal processor with no significant limitations. The ideal processor always predicts branches correctly,
has no structural hazards and has an infinite number of ROB buffers and renaming registers. This eliminates all
control and name dependencies, leaving only true data dependencies. This model means that any instruction can
be scheduled on the cycle immediately following the execution of the predecessor on which it depends. It
further implies that it is possible for the last dynamically executed instruction in the program to be scheduled on
the first cycle.
To measure the available ILP, a set of programsin this case the SPEC benchmarkswhere compiled and
optimised with a standard optimising compiler. The programs were then instrumented and executed to produce
a trace of the instruction and data references. Every instruction in the trace is then scheduled as early as
possible given the assumptions of perfect branch prediction and no hardware limits. Table 1 shows the average
amount of parallelism available for 6 of the SPEC92 benchmarks. Three of these benchmarks are FP intensive
and the other three are integer programs.
SPEC
Instructions issued
gcc
55
espresso
63
li
18
fpppp
75
doduc
119
tomcatv
150
Table 1: Table of SPEC benchmark and average instructions issued for an ideal processor.
Instruction-Level Parallelism
2006-04-13
Obviously the ideal processor is not realisable; a realisable but ambitious processor can achieve much less than
the ideal processor. Table 2 explores a processor which can issue 64 instructions per clock with no issue
restrictions, a tournament branch predictor of substantial size, perfect disambiguation of memory reference, 64
additional renaming registers for both integer and FP.
SPEC\Window
gcc
espresso
Infinite
10
15
256
10
15
128
10
13
64
9
10
32
8
8
16
6
6
8
4
4
4
3
2
li
fpppp
doduc
12
52
17
12
47
16
11
35
15
11
22
12
9
14
9
6
8
7
4
5
4
3
3
3
tomcatv
56
45
34
22
14
Table 2: Average instruction issue of SPEC Benchmark versus window size for instruction decode
The most startling observation is that with realistic processor constraints listed above, the effect of the window
size for the integer programs is not as severe as for FP programs, this points to the key difference between these
two types of programs, it seems that most FP programs are vector-based and can use loop-level parallelism to a
much greater extent than the typical integer program.
5. Conclusion
Li et al in [Li03] discovered, when instrumenting power consumed by particular operating system routines, that
instruction-per-cycle statistics could be used to predict the power consumed by a particular routine, i.e. IPC and
power consumption are highly correlated. They explroed this correlation a little using a power accurate
simulator of the MIPS R10000 and found that some 50% of the power used by a routine could be directly
attributed to the data-path and pipeline structures in a processor. These structures are used to control the ILP
being exploited by the CPU. Unfortunately this interesting correlation was not explored further in their paper.
To make maximum use of the superscalar structures of a CPU is quite complicated and a good compiler is
essential. One illustrative example can be found in [Gray05] which found that the gcc Itanium code generation
was not very good. Using the Itanium performance tools, such as perfmon, they hand optimised a particularly
important portion of the IPC path. The basic development iteration was to measure the stalls in the processor
pipeline, develop a different instruction sequence to avoid the stalls, assemble, test and start to measuring again.
In this way they reduced the cycles used by the portion of the IPC path from 508 cycles to an, astounding, 36
cycles. Although the Itanium is not a traditional RISC superscalar processor, many of the techniques used to
order the operations on Itanium are directly applicable to superscalar processors through a combination of static
and dynamic scheduling.
Interestingly hardware designers also use the fundamentally iterative technique of: measure, design, build, test
and measure again; as a basic performance improving process too. They try to find a typical load for the CPU,
called benchmarks, measure stalls and other problems and design a solution to the difficulties uncovered, build
and test it. This leads to a cycle of ever improving benchmarks; unfortunately what hasnt been demonstrated is
that benchmarks improvements lead to improvements in real production software. Everything depends on how
representative of typical workloads the benchmarks are.
As operating system developers we can study the benchmarks used by hardware designers and classify them in
terms of how representative they are of a typical OS implementation. I would suggest using the li lisp based
intSPECmark is the starting point. The nature of operating systems is to use tables of pointers to functions, to
implement OS services. A classic example of this would be Unixs CDEVSW table that maps system calls to
particular drivers as a table of function pointer vectors. Lisp is essentially implemented in this way with tables
of function pointers linking objects together. Unfortunately Tables 1 and 2, indicate that modern processors
cannot exploit much ILP on these types of workload. It would be interesting do instrument an OS and see how
many pipeline stalls are generated in normal, typical, operation. Perhaps Apples Shark performance tool would
be appropriate.
10
Instruction-Level Parallelism
2006-04-13
Current research in microprocessor design seems to have reached the limit of ILP exploitation for a single
thread. This has lead to exploration of thread-level parallelism (TLP). TLP is easier to exploit using multiple
processors, though now designers are also taking advantage of idle superscalar hardware by using symmetric
multi-threading. As the theoretical study demonstrated it is very difficult to uncover significant ILP in any
given instruction stream. It is much easier to execute two independent instruction streams and as they are
independent they cannot suffer from data and control hazards with respect to each other.
In practice it seems that superscaling a processor beyond a certain amount leads to decreasing performance per
watt. In todays microprocessor marketplace, the laptop is the highest margin computer being sold by
manufacturers. Since a laptop can run from batteries the power consumed by CPUs has become a major system
integration issue. Superscalar ILP increases the base power draw of a CPU by introducing vast number of gates
to support it, this is a very active field of research now among the processor manufactures and seems to be
emerging as the new metric against which a processor is judged. This bodes well for future performance versus
power consumption of upcoming processors.
A. References
[Gray05] Gray, Chapman, Chubb, MosbergerTang, Heiser (2005). ItaniumA System Implementors Tale.
In Proceedings 2005 USENIX Annual Technical Conference (Anaheim, CA, USA April, 2005)
[Hennessy03] Hennessy and Patterson (2003). Computer Architecture A Quantitative Approach 3rd Edition,
(Morgan Kaufmann, San Francisco, CA, USA, 2003)
[Li03] Tao Li, Lizy Kurian John (2003). Runtime Modeling And Estimation Of Operating System Power
Consumption, SIGMETRICS03 June 1014, 2003, San Diego, Ca, USA, 2003
11