[go: up one dir, main page]

100% found this document useful (1 vote)
132 views57 pages

Instruction-Level Parallelism (ILP), Since The

The document discusses instruction-level parallelism and techniques for exploiting it, including dynamic scheduling, loop unrolling, and branch prediction. It covers hardware and software approaches and limitations on ILP that led to multicore processors. Key concepts are data dependences, hazards, and reducing branch costs through advanced prediction.

Uploaded by

Ram Sudeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
132 views57 pages

Instruction-Level Parallelism (ILP), Since The

The document discusses instruction-level parallelism and techniques for exploiting it, including dynamic scheduling, loop unrolling, and branch prediction. It covers hardware and software approaches and limitations on ILP that led to multicore processors. Key concepts are data dependences, hazards, and reducing branch costs through advanced prediction.

Uploaded by

Ram Sudeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Instruction-Level Parallelism: Concepts and Challenges

 All processors (since about 1985) use pipelining to


overlap the execution of instructions and improve
performance.
 This potential overlap among instructions is called
instruction-level parallelism (ILP), since the
instructions can be evaluated in parallel.
Two approaches to exploiting ILP:
 An approach that relies on hardware to help
discover and exploit the parallelism dynamically
 An approach that relies on software technology to
find parallelism statically at compile time.
Instruction-Level Parallelism: Concepts and Challenges

 We will study both the approaches.


 Limitations on ILP approaches is included in this
chapter, and it was such limitations that directly led to
the movement to multicore.

The value of the CPI (cycles per instruction) for a pipelined processor
is as follows:

Pipeline CPI Ideal pipeline = CPI + Structural stalls


+ Data hazard stalls
+ Control stalls
What Is Instruction-Level Parallelism?

 The amount of paralellism available within a basic block.


 MIPS programs, the average dynamic branch frequency
is often between 15% and 25%, meaning that between
three and six instructions execute between a pair of
branches.
 To obtain substantial performance enhancements, we
must exploit ILP across multiple basic blocks.
 The simplest and most common way to increase the ILP
is to exploit parallelism among iterations of a loop.
What Is Instruction-Level Parallelism?

 This type of parallelism is often called loop-level


parallelism.
Example:
for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];

 Loop-level parallelism can be performed by unrolling the


loop either statically or dynamically.
 Alternative method for exploiting loop-level parallelism
is the use of SIMD in both vector processors and
Graphics Processing Units (GPUs).
What Is Instruction-Level Parallelism?

 This type of parallelism is often called loop-level


parallelism.
Example:
for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];

 Loop-level parallelism can be performed by unrolling the


loop either statically or dynamically.
 Alternative method for exploiting loop-level parallelism
is the use of SIMD in both vector processors and
Graphics Processing Units (GPUs).
Data Dependences and Hazards

 If two instructions are parallel, they can execute simultaneously in


a pipeline of arbitrary depth without causing any stalls.

There are three different types of dependences: data dependences


(also called true data dependences), name dependences, and control
dependences
 Data Dependences:
An instruction j is data dependent on instruction i if either of
the following holds:
 Instruction i produces a result that may be used by instruction j.
 Instruction j is data dependent on instruction k, and instruction k
is data dependent on instruction i.
Data Dependences and Hazards

 A data dependence conveys three things:


(1) the possibility of a hazard,
(2) the order in which results must be calculated, and
(3) an upper bound on how much parallelism can possibly be
exploited
 A dependence can be overcome in two different ways:
(1) maintaining the dependence but avoiding a hazard, and
(2) eliminating a dependence by transforming the code
Data Dependences and Hazards

 Name Dependences:
A name dependence occurs when two instructions use the same
register or memory location, called a name.

 There are two types of name dependences between an instruction


i that precedes instruction j in program order:
1. An antidependence between instruction i and instruction j
occurs when instruction j writes a register or memory
location that instruction i reads.
2. An output dependence occurs when instruction i and
instruction j write the same register or memory location.
This can be solved by Register Renaming.
Data Dependences and Hazards

 Data Hazards:
Three types of data hazards exist
1. RAW
2. WAR
3. WAW

Control Dependence
A control dependence determines the ordering of an instruction, i,
with respect to a branch instruction so that instruction i is executed in
correct program order and only when it should be.
if p1 {
S1;
};
if p2 {
S2;
}
Data Dependences and Hazards

Control Dependence
 Consider this code sequence:
DADDU R2,R3,R4
BEQZ R2,L1
LW R1,0(R2)
L1:
Creates a new exception :memory protection exception

Preserving exceptions raised in the order.


Data Dependences and Hazards

Control Dependence
 Consider this code sequence:
DADDU R1,R2,R3
BEQZ R4,L
DSUBU R1,R5,R6
L: ...
OR R7,R1,R8

Data flow of R1 to be preserved (by speculating the branch).


Basic Compiler techniques for Exposing ILP

• These techniques are crucial for processors


that use static issue or static scheduling.

Basic Pipeline Schedule and Loop Unrolling


Basic Pipeline Schedule and Loop Unrolling
• Parallelism among instructions must be exploited by
finding sequences of unrelated instructions that can
be overlapped in the pipeline.
• To avoid a pipeline stall:
The execution of a dependent instruction must
be separated from the source instruction by a
distance in clock cycles equal to the pipeline
latency of that source instruction.
Basic Pipeline Schedule and Loop Unrolling
• We assume the standard five-stage integer pipeline, so that
branches have a delay of one clock cycle.
• We assume that the functional units are fully
pipelined and no structural hazards exist.
Basic Pipeline Schedule and Loop Unrolling
• We look at how the compiler can increase the amount of
available ILP by transforming loops.
for (i=999; i>=0; i=i–1)
x[i] = x[i] + s;
• This loop is parallel by noticing that the body of each iteration
is independent.
• Translate the above segment to MIPS assembly language.
Basic Pipeline Schedule and Loop Unrolling
Example:
Show how the loop would look on MIPS, both
scheduled and unscheduled, including any stalls or idle
clock cycles. Schedule for delays from floating-point
operations, but remember that we are ignoring delayed
branches.
Basic Pipeline Schedule and Loop Unrolling
Basic Pipeline Schedule and Loop Unrolling
Basic Pipeline Schedule and Loop Unrolling
• In the above example, we need only three cycles
(load, add, store) to complete one iteration of loop
operation.
• Four out of Seven cycles used to execute the single
iteration are spent for loop overhead (the DADDUI
and BNE—and two stalls).
• A simple scheme for increasing the number of
instructions relative to the branch and overhead
instructions is loop unrolling.
• Unrolling simply replicates the loop body multiple
times, adjusting the loop termination code.
Basic Pipeline Schedule and Loop Unrolling
• Loop unrolling can also be used to improve
scheduling. Because it eliminates the branch, it
allows instructions from different iterations to be
scheduled together.
• We can eliminate the data use stalls by creating
additional independent instructions within the loop
body.
• The loop unrolling increases the required number of
registers.
Basic Pipeline Schedule and Loop Unrolling
Example:

Show our loop unrolled so that there are four copies of


the loop body, assumingR1 – R2 (that is, the size of the
array) is initially a multiple of 32, which meansthat the
number of loop iterations is a multiple of 4. Eliminate
any obviously redundant computations and do not
reuse any of the registers.
Basic Pipeline Schedule and Loop Unrolling
Basic Pipeline Schedule and Loop Unrolling

Without scheduling, every operation in the unrolled loop is


followed by a dependent operation and thus will cause a
stall. This loop will run in 27 clock cycles—each LD has 1
stall, each ADDD 2, the DADDUI 1, plus 14 instruction
issue cycles—or 6.75 clock cycles for each of the four
elements, but it can be scheduled to improve performance
significantly.
Basic Pipeline Schedule and Loop Unrolling
Summary of the Loop Unrolling and Scheduling
Reducing Branch Costs with Advanced Branch
Prediction

• We can also reduce the performance losses of


branches by predicting how they will behave.
• if (aa==2)
aa=0;
if (bb==2)
bb=0;
if (aa!=bb) {
Reducing Branch Costs with Advanced Branch
Prediction
• Branch predictors that use the behavior of other
branches to make a prediction are called correlating
predictors or two-level predictors
• In the general case, an (m,n) predictor uses
the behavior of the last m branches to choose
from 2m branch predictors, each of which is
an n-bit predictor for a single branch.
Dynamic Scheduling; The Idea

We still use in-order instruction issue (i.e.,


instructions issued in program order), but we
want an instruction to begin execution as soon
as its data operands are available. Such a
pipeline does out-of-order execution, which
implies out-of-order completion.
Dynamic Scheduling; The Idea

There is an antidependence between the ADD.D and the SUB.D, and if


the pipeline
executes the SUB.D before the ADD.D (which is waiting for the DIV.D),
it will violate
the antidependence, yielding a WAR hazard. Likewise, to avoid
violating
output dependences, such as the write of F6 by MUL.D,
Dynamic Scheduling; The Idea
Dynamic Scheduling: Tomasulo’s
Approach
Dynamic Scheduling: Tomasulo’s
Approach
Hardware Based Speculation

• Using the bypassed value is like performing a


speculative register read, since we do not know
whether the instruction providing the source
register value is providing the correct result until
the instruction is no longer speculative.
• When an instruction is no longer speculative, we
allow it to update the register file or memory; we
call this additional step in the instruction
execution sequence instruction commit.
Hardware Based Speculation

• The key idea behind implementing speculation is


to allow instructions to execute out of order but
to force them to commit in order and to prevent
any irrevocable action (such as updating state or
taking an exception) until an instruction commits.
• Adding this commit phase to the instruction
execution sequence requires an additional set of
hardware buffers that hold the results of
instructions that have finished execution but have
not committed.
Hardware Based Speculation

• This hardware buffer, which we call the reorder


buffer (ROB), is also used to pass results among
instructions that may be speculated.
• The ROB holds the result of an instruction
between the time the operation associated with
the instruction completes and the time the
instruction commits.
• Hence, the ROB is a source of operands for
instructions, just as the reservation stations
provide operands in Tomasulo’s algorithm.
Hardware Based Speculation
Difference b/w Tomasulo’s Approach and ROB:

• The key difference is that in Tomasulo’s algorithm, once


an instruction writes its result, any subsequently issued
instructions will find the result in the register file.
• With speculation, the register file is not updated until
the instruction commits (and we know definitively that
the instruction should execute); thus, the ROB supplies
operands in the interval between completion of
instruction execution and instruction commit.
• Each entry in the ROB contains four fields: the
instruction type, the destination field, the value field,
and the ready field.
The basic structure of a FP unit using Tomasulo’s algorithm and
extended to handle speculation
Operations: Extended speculation handling
Example 1
Example 2
Elimination of Hazards in ROB
• WAW and WAR hazards through memory are eliminated with
speculation because the actual updating of memory occurs in
order, when a store is at the head of the ROB, and, hence, no
earlier loads or stores can still be pending.
• RAW hazards through memory are maintained by two
restrictions:
1. Not allowing a load to initiate the second step of its
execution if any active ROB entry occupied by a store has a
Destination field that matches the value of the A field of the
load.
2. Maintaining the program order for the computation of an
effective address of a load with respect to all earlier stores.
VLIW
VLIW
Multiple Isssue

You might also like