[go: up one dir, main page]

0% found this document useful (0 votes)
22 views39 pages

Pipeline

The document discusses instruction pipelining, explaining how it works and its benefits. Pipelining allows multiple instructions to be processed simultaneously by overlapping their execution across different stages. While pipelining can improve throughput, hazards like structural, data, and control hazards can reduce its effectiveness if they cause stalls.

Uploaded by

Nepal Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views39 pages

Pipeline

The document discusses instruction pipelining, explaining how it works and its benefits. Pipelining allows multiple instructions to be processed simultaneously by overlapping their execution across different stages. While pipelining can improve throughput, hazards like structural, data, and control hazards can reduce its effectiveness if they cause stalls.

Uploaded by

Nepal Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Introduction to Instruction Pipelining

Pipelining

Pipelining is an implementation technique in which


multiple instructions are overlapped in execution

Here we will consider RISC architecture


Memory Access by Load/Store
All other instructions use registers.
Pipelining is Natural!

Laundry Example
Ann, Brian, Cathy, Dave A B C D
each have one load of clothes
to wash, dry, and fold
Washer takes 30 minutes

Dryer takes 40 minutes

“Folder” takes 20 minutes


Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time

30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e
r D

Sequential laundry takes 6 hours for 4 loads


If they learned pipelining, how long would laundry take?
Pipelined Laundry: Start work ASAP
6 PM 7 8 9 10 11 Midnight
Time

30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r
D

Pipelined laundry takes 3.5 hours for 4 loads


Pipelining Lessons
Pipelining doesn’t help
6 PM 7 8 9 latency of single task, it helps
throughput of entire workload
Time
Pipeline rate is limited by
slowest pipeline stage
30 40 40 40 40 20
T Multiple tasks operating
a A simultaneously using
s different resources
k
Potential speedup = Number
B pipeline stages
O
r Unbalanced lengths of
d C pipeline stages reduces
e speedup
r
D Time to “fill” pipeline and
time to “drain” it reduces
speedup

Stall for Dependencies


The Five Stages of Load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Load Ifetch Reg/Dec Exec Mem Wr

Ifetch: Instruction Fetch


Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode
Exec: Calculate the memory address
Mem: Read the data from the Data Memory
Wr: Write the data back to the register file

Branch instruction – 2 stages


Store instruction – 4 stages
ALU instruction – 5 stages (4th stage is idle)
Pipelining
Improve performance by increasing throughput

Ideal speedup is number of stages in the pipeline.


Do we achieve this? NO!
The computer pipeline stage time are limited by the slowest resource, either
the ALU operation, or the memory access
Fill and drain time
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1 Cycle 2
Clk
Single Cycle Implementation:
Load Store Waste

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk

Multiple Cycle Implementation:


Load Store R-type
Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch

Pipeline Implementation:
Load Ifetch Reg Exec Mem Wr

Store Ifetch Reg Exec Mem Wr

R-type Ifetch Reg Exec Mem Wr


Why Pipeline?
Suppose we execute 100 instructions
Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns
Multicycle Machine
10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
Ideal pipelined machine
10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
Why Pipeline? Because the resources are there!
Time (clock cycles)

ALU
I Im Reg Dm Reg
n Inst 0
s

ALU
t Inst 1 Im Reg Dm Reg
r.

ALU
O Inst 2 Im Reg Dm Reg
r
d Inst 3

ALU
Im Reg Dm Reg
e
r
Inst 4

ALU
Im Reg Dm Reg
Speedup and Efficiency

k-stage pipeline processes n tasks in k + (n-1) clock


cycles:

k cycles for the first task and n-1 cycles for the remaining
n-1 tasks

Total time to process n tasks Tk = [ k + (n-1)]t

For the non-pipelined processor T1 = n k t

Speedup factor T1 nkt nk


Sk =
Tk = [ k + (n-1)] t= k + (n-1)

nk ≈k
If n is very large (n >> k), then Sk ≈ n
Efficiency and Throughput

Efficiency of the k-stages pipeline:

Sk n
Ek = = k + (n-1)
k

Pipeline throughput (the number of tasks per unit time):

Hk = [ k +n(n-1)] t = nf
k + (n-1)
Can pipelining get us into trouble?
Yes: Pipeline Hazards
Structural hazards: attempt to use the same resource two different
ways at the same time
E.g., combined washer/dryer would be a structural hazard or folder busy
doing something else (watching TV)
Single memory cause structural hazards
Data hazards: attempt to use item before it is ready
E.g., one sock of pair in dryer and one in washer; can’t fold until you get
sock from washer through dryer
instruction depends on result of prior instruction still in the pipeline
Control hazards: attempt to make a decision before condition is
evaluated
E.g., washing football uniforms and need to get proper detergent level;
need to see after dryer before next load in
branch instructions
Can always resolve hazards by waiting
pipeline control must detect the hazard
take action (or delay action) to resolve hazards
Slow Down From Stalls

• Perfect pipelining with no hazards à an instruction completes every


cycle (total cycles ~ num instructions) k + (n-1) ≈ n

à speedup = increase in clock speed = num pipeline stages


nk
Sk ≈ ≈k
n
•With hazards and stalls, some cycles (= stall time) go by
during which no instruction completes, and then the stalled
instruction completes

• Total cycles = number of instructions + stall cycles

• Slowdown because of stalls = 1/ (1 + stall cycles per instr)


Speedup Equation for Pipelining
Compared to unpipelined,

Now it is evident that


Speedup Equation for Pipelining

Putting in the equation of speedup

Thus, if there are no stalls, the speedup is equal to the number


of pipeline stages, matching our intuition for the ideal case.
Single Memory is a Structural Hazard
Time (clock cycles)

ALU
I Mem Mem Reg
n
Load Reg

ALU
Mem Reg Mem Reg
t Instr 1
r.

ALU
Mem Reg Mem Reg
O Instr 2
r

ALU
d Mem Reg Mem Reg
e
Instr 3
r

ALU
Mem Reg Mem Reg
Instr 4
Single Memory is a Structural Hazard
Time (clock cycles)

ALU
I Mem Mem Reg
n
Load Reg

ALU
Mem Reg Mem Reg
t Instr 1
r.

ALU
Mem Reg Mem Reg
O Instr 2
r
d
Instr 3

ALU
Bubble Mem Reg Mem Reg
e
r
Instr 4

ALU
Mem Reg Mem Reg

One cycle stall for structural hazard


Example: Dual-port vs. Single-port

Machine A: Dual ported memory


Machine B: Single ported memory, but its pipelined implementation has
a 1.05 times faster clock rate
Ideal CPI = 1 for both
Data references are of 40% mix

SpeedUpA = [1/(1 + 0)] x (clockunpipe/clockpipe)


= Pipeline Depth
SpeedUpB = [1/(1 + 0.4x1)] x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33

Machine A is 1.33 times faster


Control Hazard
When a branch is executed, it may or may not change the
PC to something other than incrementing it.

If a branch changes the PC to its target address, it is


called a taken branch;

If it falls through, it is not taken, or untaken.

If instruction i is a taken branch, then the PC is normally


not changed until the end of ID, after the completion of the
address calculation and comparison.
Control Hazard Solution #1: Stall
I Time (clock cycles)
n

ALU
s Mem Reg Mem Reg
t Add
r.

ALU
Mem Reg Mem Reg
Beq
O
r
Load Lost

ALU
Mem Reg Mem Reg
d potential
e
r

Stall: wait until decision is clear

Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) =>slow

Move decision to end of decode by improving hardware


save 1 cycle per branch

If 20% instructions are BEQ, all others have CPI 1, what is the average
CPI?
Control Hazard Solution #1: Stall
Control Hazard Solution #2: Predict
I Time (clock cycles)
n

ALU
s Mem Reg Mem Reg
t Add
r.

ALU
Mem Reg Mem Reg
Beq
O
r
Load

ALU
Mem Reg Mem Reg
d
e
r

Predict: guess one direction (taken/untaken) then back up if wrong


Impact: 0 lost cycles per branch instruction if right, 1 if wrong
(right 50% of time)
Need to “Squash” and restart following instruction if wrong
Produce CPI on branch of (1 *.5 + 2 * .5) = 1.5
Total CPI might then be: 1.5 * .2 + 1 * .8 = 1.1 (20% branch)
Control Hazard Solution #2: Predict
Control Hazard Solution #3: Delayed Branch
I Time (clock cycles)
n

ALU
s Mem Reg Mem Reg
t Add
r.

ALU
Mem Reg Mem Reg
Beq
O
r

ALU
d Misc Mem Reg Mem Reg

ALU
r Load Mem Reg Mem Reg

Delayed Branch: Redefine branch behavior (takes place after


next instruction)
Impact: 0 extra clock cycles per branch instruction if can find
instruction to put in “slot” ( 50% of time)
The longer the pipeline, the harder to fill
Used by MIPS architecture
Control Hazard Solution #3: Delayed Branch
Scheduling Branch Delay Slots (Fig A.14)
A. From before branch B. From branch target C. From fall through
add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
delay slot delay slot
add $1,$2,$3
if $1=0 then
delay slot sub $4,$5,$6

becomes becomes becomes


add $1,$2,$3
if $2=0 then if $1=0 then
add $1,$2,$3 sub $4,$5,$6
add $1,$2,$3
if $1=0 then
sub $4,$5,$6

A is the best choice, fills delay slot & reduces instruction count (IC)
In B, the sub instruction may need to be copied, increasing IC
In B and C, must be okay to execute sub when branch fails
More On Delayed Branch

Compiler effectiveness for single branch delay


slot:
Fills about 60% of branch delay slots
About 80% of instructions executed in branch delay slots
useful in computation
About 50% (60% x 80%) of slots usefully filled
Evaluating Branch Alternatives
A simplified pipeline speedup equation for Branch:

Pipeline speedup = Pipeline depth


1 +Branch frequency´Branch penalty

Assume that in a deeper pipeline, it takes at least three pipeline


stages before the branch-target address is known and an
additional cycle before the branch condition is evaluated

Assume 4% unconditional branch


6% conditional branch- untaken
10% conditional branch-taken
Evaluating Branch Alternatives
Data Hazard on r1
An instruction depends on the result of a previous instruction still in the pipeline

add r1 ,r2,r3

sub r4, r1 ,r3

and r6, r1 ,r7

or r8, r1 ,r9

xor r10, r1 ,r11


Data Hazard on r1:
• Dependencies backwards in time are hazards

Time (clock cycles)


IF ID/RF EX MEM WB

ALU
I add r1,r2,r3 Im Reg Dm Reg

ALU
Im Reg Dm Reg
s
t
sub r4,r1,r3
r.

ALU
Im Reg Dm Reg
and r6,r1,r7
O

ALU
r Im Reg Dm Reg
d or r8,r1,r9
e

ALU
Im Reg Dm Reg
r xor r10,r1,r11
Data Hazard Solution:
• “Forward” result from one stage to another
Time (clock cycles)
IF ID/RF EX MEM WB

ALU
I add r1,r2,r3 Im Reg Dm Reg

ALU
Im Reg Dm Reg
s
t
sub r4,r1,r3
r.

ALU
Im Reg Dm Reg
and r6,r1,r7
O

ALU
r Im Reg Dm Reg
d or r8,r1,r9
e

ALU
Im Reg Dm Reg
r xor r10,r1,r11
• “or” OK if define read/write properly
•Forwarding can’t prevent all data hazard! – lw followed by R-type?
Forwarding (or Bypassing): What about Loads?
• Dependencies backwards in time are hazards
Time (clock cycles)
IF ID/RF EX MEM WB

ALU
lw r1,0(r2) Im Reg Dm Reg

ALU
Im Reg Dm Reg
sub r4,r1,r3

• Can’t solve with forwarding:


• Must delay/stall instruction dependent on loads
Forwarding (or Bypassing): What about Loads

• Dependencies backwards in time are hazards


Time (clock cycles)
IF ID/RF EX MEM WB

ALU
lw r1,0(r2) Im Reg Dm Reg

ALU
sub r4,r1,r3 Stall Im Reg Dm Reg

• Can’t solve with forwarding:


• Must delay/stall instruction dependent on loads
Software Scheduling to Avoid Load
Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code: Fast code:
LW Rb,b LW Rb,b
LW Rc,c LW Rc,c
ADD Ra,Rb,Rc LW Re,e
SW a,Ra ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f SW a,Ra
SUB Rd,Re,Rf SUB Rd,Re,Rf
SW d,Rd SW d,Rd

Compiler optimizes for performance by out-of-order execution


Summary: Pipelining
What makes it easy
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores; Memory
addresses are asigned
What makes it hard?
structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions
data hazards: an instruction depends on a previous instruction

We’ll talk about modern processors and what really


makes it hard:
trying to improve performance with out-of-order execution, etc.
Summary
Pipelining is a fundamental concept
multiple steps using distinct resources
Utilize capabilities of the Datapath by pipelined
instruction processing
start next instruction while working on the current one
limited by length of longest stage (plus fill/flush)
detect and resolve hazards

You might also like