0% found this document useful (0 votes)

13 views51 pages

Lecture13 Pipeline2

Uploaded by

sl4429056

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views51 pages

Lecture13 Pipeline2

Uploaded by

sl4429056

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Instruction Level Parallelism

Pipeline Hazards and LL Pipeline

Indian Institute of Technology Tirupati

Jaynarayan T Tudu
[jtt@iittp.ac.in]

Computer System Architecture (CS5202)

19th March, 2020
Pipeline Architecture

Review of the
basic pipeline
architecture

The stages are being shown with

the corresponding resources
Program Execution Scenario

DADD R1 R2 R3

DSUB R4 R1 R5

AND R6 R1 R7

OR R8 R1 R9

XOR R10 R1 R11

BEQ R1 R4 offset

XOR R1 R4 R11 Dependency

(Hazards)
NOR R10 R4 R11
Data
JMP 100(offset) Resource
……. Control
…….
Data Dependency

The
challenges of
data sharing
(Hazards)

Data Hazards
RAW
WAR
RAR
WAW
Data Dependency: Solutions

Data
Forwarding
/ Bypassing
/Short
circuiting

Data Hazards
RAW
WAR
RAR
WAW

XOR R10 R1 R11

BEQ R1 R4 offset

XOR R1 R4 R11 Dependency

(Hazards)
NOR R10 R4 R11
Resource
JMP 100(offset) Data
……. Control
…….
Resource Dependency

DADD R1 R2 R3

DSUB R4 R1 R5

AND R6 R1 R7

OR R8 R1 R9

XOR R10 R1 R11

BEQ R1 R4 offset

XOR R1 R4 R11

NOR R10 R4 R11

The solution is in effective
JMP 100(offset) deployment of additional resources!
…….
……. IM and DM, additional ports to Reg etc
Control Dependency

PC = 00: DADD R1 R2 R3

PC = 04: DSUB R4 R1 R5

PC = 08: AND R6 R1 R7

PC = 16: OR R8 R1 R9

PC = 20: XOR R10 R1 R11

PC = 24: BEQ R1 R4 offset

PC = 28: XOR R1 R4 R11

PC = 32: NOR R10 R4 R11

PC = 36: JMP 100(offset)

…….

PC = 24 + offset: …….
Branch Prediction:Control Dependency

The control
hazards
(Causes the
break of normal
pipeline flow)

Predict the
control path
(branch prediction)

* Advanced branch predictors will be covered during Super-scalar architecture

Branch Prediction:Control Dependency
This requires two things:

- Branching decision (taken or Not taken) Both have been moved to ID,
- Branch target address (effective address) Just we don’t want to wait!

MIPS implementation

* Advanced branch predictors will be covered during Super-scalar architecture

Branch Prediction:Control Dependency
This requires two things:

- Branching decision (taken or Not taken)

- Branch target address (effective address)

RISC-V implementation

* Advanced branch predictors will be covered during Super-scalar architecture

Performance with Hazards
Hazards are causing pipeline to stall – extra cycle penalty!

In ideal pipeline: cycle per instruction = 1.

Therefore, with hazards, the cycle per instruction = 1 + stall cycles per instruction

CPI in unpipelined (generally multicycle)

Speed up =
1 + stall cycles per instruction

The above equation could be expressed also as follow:

Pipeline depth
Speed up =
1 + stall cycles per instruction
Deep Pipeline Architecture
Can this be increased?
Pipeline depth
Speed up =
1 + stall cycles per instruction

as much as possible, this

must to be avoided.

The question is how to increase pipeline depth without increasing stall cycles?

Analyzing the Execution stage!

Long Latency Pipeline
Multiple units in EX
stage

Integer Unit
Floating point/Integer multiply
FP Adder
FP/Integer Divider

Assumption:

Each of the execution unit

is non-pipeline.

How to decide on which unit or where do a deeper pipeline is

needed?
FP Pipeline Architecture
To decide which unit to be pipelined for performance gain, we
use the following specification.

Latency: number of cycles between the production and the consumption of

the results (it helps designer to decide on reducing stalls).

Initiation interval: elapsed number of cycle between issuing of two operations

of a given type
(For pipeline this would be always 1, for multi-cycle it may not be)
.
FP Pipeline Architecture
Block diagram of deeper pipeline based on the given specification.

Integer ALU: only one stage

FP/Integer Multiply: fully pipelined with seven stages.
FP adder: fully pipelined with four stages.
Floating-point division is not fully pipelined but multicycle (24 cycles).
Consumption

Production
FP Pipeline Architecture
Latency and Initiation interval for each of the pipeline units. Note that in
case of FP divider the initiation interval is 25 instead of 1.

(= 1 - 1)
(= 2 - 1)
(= 4 - 1)
(= 7 - 1)
( = 25 - 1 )

Latency: number of cycles between production of results and the

consumption of results

Initiation interval: elapsed number of cycles between issuing of two

operations/instructions of a given type.
FP Pipeline Architecture: Timing
Timing of independent set of instructions.

Progress of time (clock cycle)

Data is required at this stage

Results are available at this stages

Figure is from 6th edition of the text book, however, for your reading you may take 5th edition,
There are some printing bugs in 6th edition.
FP Pipeline Architecture: Hazards
Hazards and forwarding in long latency pipeline:
Data dependency (Consumer must get the updated data)
Control (Unpredictable control path)
Structural hazards (No two stages can access a single resources at a time)

1) Because the divide unit is non-pipeline, structural hazard can occur.

This needs to be detected and the issuing instructions need to be stalled.

2) Due to varying running time of each of the instruction, the number of register
writes required in a cycle is more than 1.

3) WAW hazards are possible, because instruction no longer reach WB in order.

Note that WAR hazards shall never occur since the read happens in ID stage
and the write at WB stage.
4) Instruction can complete in different order that they were issued (issued in
order but completed in out-of-order), this may leads to imprecise exception
handling.
5) Because of longer latency of operations, the stalls due to RAW hazards will
be more frequent.
FP Pipeline Architecture: Hazards
Hazards and forwarding in long latency pipeline:

WAW may not be

maintain due to
out-of-order

Register Read Register Write

Non-pipeline single unit
(The new instruction can be issued only after 25 cycles)
FP Pipeline Architecture: Hazards
Pipeline stalls due to data dependency hazards.

Solution: stall and forwarding

RAW
f4, f0, f2 are the floating dependency
point registers leading to
RAW dependency.
Due to structural
hazards
RAW
dependency
FP Pipeline Architecture: Solutions
Solutions to hazards.

Two data dependency hazards: RAW and WAW are of

interest Structural hazards
And the structural hazards.

Situation for WAW to cause issue: if fld f2, 0(x2) would have been issued
a cycle before. fld f2, 0(x2) and fadd.d f2, f4, f6 would cause WAW. WAW
FP Pipeline Architecture: Solutions
Solutions to hazards.
Structural hazards due to WB and MEM:
- detect hazards and stall
- detection can be done at ID stage or at MEM stage
- stall the issue at ID stage or stall before entering to MEM or WB

Solving the WAW hazards:

- delay the issue of fld instruction unitl fadd.d enters to MEM stage
- stop fadd.d to write back its result, and then issue the fld as usual.
- WAW is very rare situation in code.

Hazards among FP and Integer units:

- Hazards can occur among FP instructions or between FP and Integer
instruction.
- Having separate register file for FP and Integer unit is a good solution
- Detection of hazards among FP
- Check for structural hazards
- Check for RAW hazard
- Check for WAW hazards
Pipeline Architecture: Exception
The problem due to longer pipeline and out-of-order completion:

DIV.D F0, F2, F4

ADD.D F10, F10, F8
SUB.D F12, F12, F14 Exception or interrupts

Where should program returns after handling exception?

To DIV.D or to SUB.D.

How to maintain precise program state?

Can’t afford to lose value of any register!

Pipeline Architecture: Exception
Different situations which are Exception!

- I/O device request

- System Call (invoking OS kernel from user mode)
- Tracing instruction execution
- Break points (like gdb break)
- Integer arithmetic overflow (when a number can’t be represented)
- FP arithmetic anomaly (such as NaN)
- Divide by zero
- Page fault (OS intervention is called for)
- Misaligned memory access (you have to do memory alignment)
- Memory protection violation
- Using an undefine instruction
- Hardware malfunction
- Power failure, Reset, Restart etc.
Beyond Pipeline

Question of interest:

How to increase IPC? IPC >= 1

or
CPI <= 1.0
Limits of Pipeline
Processor Performance revisit:
Performance = 1/CPU_time

CPU_time = time/program
= Instruction/Program x Cycle/Instruction x Time/Cycle
Speed up = Performance of new / Performance of Old
= CPU time in Old / CPU time in new


In the 1980’s (decade of pipelining):
 CPI: between 5.0 to 1.15

In the 1990’s (decade of superscalar):
 CPI: between 1.15 to 0.5 (best case)

In the 2000’s (decade of multicore):
 Focus on thread-level parallelism, CPI near to 0.33 (best
case)
Limits of Pipeline
Amdhal's Speed up = P1/P2
Law P1: Performance for entire task using the enhancement
P2: Performance for entire task without enhancement

N
No. of
Processors
h 1-h f
1 1-f
Time


h = fraction of time in serial code

f = fraction that is vectorizable

v = speedup for f 1

Overall speedup: Speedup=
f
1− f +
v
Limits of Pipeline
Amdhal's
Law
N
No. of
Processors
h 1-h f
1 1-f
Time


Sequential bottle neck

Even if v is infinite, the performance is limited by non-vectorizable code
i.e 1-f
1 1
lim 
v  f 1 f
1 f 
v
Limits of Pipeline
Pipeline Performance Model:

Pipeline
Depth
1

1-g g


g = fraction of time pipeline is filled

1-g = fraction of time pipeline is not filled (stalled)
Limits of Pipeline
Pipeline Performance Model:

Pipeline
Depth
1

1-g g


g = fraction of time pipeline is filled

1-g = fraction of time pipeline is not filled (stalled)
Beyond Scalar Pipeline

n = pipeline stages Speedup jumps from 3 to 4.3 for

f = fraction of N=6, f=0.8, but s =2 instead of
s=1 (scalar)
Vectorizable code
s = scalar size
(for base pipeline
s = 1)

Typical Range
Limits of Pipeline
1 f - fraction vectorizable
Speedup(N) = N - number of processors
(1-f) + f/N

The challenge of

Amdhal's Law

Look at the
90%
and
95%
Limits of Pipeline

 IBM RISC Experience

 Control and data dependencies add 15%
 Best case CPI of 1.15, IPC of 0.87
 Deeper pipelines (higher frequency) magnify dependence
penalties

 This analysis assumes 100% cache hit rates

 Hit rates approach 100% for some programs
 Many important programs have much worse hit rates
Classifying ILP Machines
Baseline scalar RISC:
 Issue parallelism = IP = 1 [only one instruction]
 Operation latency = OP = 1
 Peak IPC = 1
INSTRUCTIONS
SUCCESSIVE

1
IF DE EX WB
2
3
4
5
6

0 1 2 3 4 5 6 7 8 9
TIME IN CYCLES (OF BASELINE MACHINE)

- Jouppi, WRL Reserch Report 89/7, 1989

Classifying ILP Machines
Super-pipelined:
 Cycle time (minor cycle) = 1/m of baseline
 Issue parallelism = IP = 1 inst/minor cycle
 Operation latency = OP = m minor cycles
 Peak IPC = m instr / major cycle (m x speedup?)

1
2
3
4
5
6
IF DE EX WB

1 2 3 4 5 6

- Jouppi, WRL Reserch Report 89/7, 1989

Limits on Instruction Level Parallelism (ILP)
Ideas IPC achieved
Weiss and Smith [1984] 1.58
Sohi and Vajapeyam [1987] 1.81
Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck)
Tjaden and Flynn [1973] 1.96
Uht [1986] 2.00
Smith et al. [1989] 2.00
Jouppi and Wall [1988] 2.40
Johnson [1991] 2.50
Acosta et al. [1986] 2.79
Wedig [1982] 3.00
Butler et al. [1991] 5.8
Melvin and Patt [1991] 6
Wall [1991] 7 (Jouppi disagreed)
Kuck et al. [1972] 8
Riseman and Foster [1972] 51 (no control dependences)
Nicolau and Fisher [1984] 90 (Fisher’s optimism)
Beyond Scalar Limit
 Go beyond single instruction pipeline, achieve IPC > 1
 Dispatch multiple instructions per cycle
 Provide more generally applicable form of concurrency
(not just vectors)
 Geared for sequential code that is hard to parallelize
otherwise
 Exploit fine-grained or instruction-level parallelism (ILP)
Classifying ILP Machines
Super-scalar pipeline:
 Issue parallelism = IP = n inst / cycle
 Operation latency = OP = 1 cycle
 Peak IPC = n instr / cycle (n x speedup?)

1
2
3
4
5
6
7
8
9
IF DE EX WB

- Jouppi, WRL Reserch Report 89/7, 1989

Classifying ILP Machines
VLIW: Very Long Instruction Word:
 Issue parallelism = IP = n inst / cycle
 Operation latency = OP = 1 cycle
 Peak IPC = n instr / cycle = 1 VLIW / cycle

IF DE WB

- Jouppi, WRL Reserch Report 89/7, 1989

Very Long Instruction Word
Processor
VLIW: Idea and Motivation
 To overcome the difficulty of finding parallelism
in machine-level object code.

 In a VLIW processor, multiple instructions are

packed together and issued in parallel to an
equal number of execution units.

 The compiler (not the processor) checks that

there are only independent instructions executed
in parallel.
VLIW: Very Long Instruction Word
Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2

Two Integer Units,

Single Cycle Latency

Two Load/Store Units,

Three Cycle Latency Two Floating-Point Units,
Four Cycle Latency

 Multiple operations packed into one instruction

 Each operation slot is for a fixed function
 Constant operation latencies are specified
 Architecture requires guarantee of:
 Parallelism within an instruction => no x-operation RAW check
 No data use before data ready => no data interlocks
VLIW Compiler Responsibilities

The compiler:

 Schedules to maximize parallel execution

 Guarantees intra-instruction parallelism
 Schedules to avoid data hazards (no interlocks)
 Typically separates operations with explicit NOPs
Early VLIW Machines
 FPS AP120B (1976)
 scientific attached array processor
 first commercial wide instruction machine
 hand-coded vector math libraries using software pipelining and
loop unrolling
 Multiflow Trace (1987)
 commercialization of ideas from Fisher’s Yale group including
“trace scheduling”
 available in configurations with 7, 14, or 28 operations/instruction
 28 operations packed into a 1024-bit instruction word
 Cydrome Cydra-5 (1987)
 7 operations encoded in 256-bit instruction word
 rotating register file
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop: add r1 ld
Compile

loop: ld f1, 0(r1) fadd

add r1, 8 Schedule
fadd f2, f0, f1
sd f2, 0(r2)
add r2, 8
add r2 bne sd
bne r1, r3, loop

How many FP ops/cycle?

IPC = 1 fadd / 8 cycles = 0.125
Loop Unrolling
for (i=0; i<N; i++)
B[i] = A[i] + C;

Unroll inner loop to perform 4

iterations at once
for (i=0; i<N; i+=4)
{
B[i] = A[i] + C;
B[i+1] = A[i+1] + C;
B[i+2] = A[i+2] + C;
B[i+3] = A[i+3] + C;
}

Need to handle values of N that are not multiples of

unrolling factor with final cleanup loop
Scheduling Loop Unrolled Code
Unroll 4 ways

loop: ld f1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx

ld f2, 8(r1)
ld f3, 16(r1) loop: ld f1
ld f4, 24(r1) ld f2
add r1, 32 ld f3
fadd f5, f0, f1 add r1 ld f4 fadd f5
fadd f6, f0, f2
Schedule fadd f6
fadd f7, f0, f3
fadd f7
fadd f8, f0, f4
sd f5, 0(r2) fadd f8
sd f6, 8(r2) sd f5
sd f7, 16(r2) sd f6
sd f8, 24(r2) sd f7
add r2, 32 add r2 bne sd f8
bne r1, r3, loop

How many FLOPS/cycle? IPC= 4 fadds / 11 cycles = 0.36

Software Pipelining
Int1 Int 2 M1 M2 FP+ FPx
Unroll 4 ways first
ld f1
loop: ld f1, 0(r1) ld f2
ld f2, 8(r1) ld f3
ld f3, 16(r1) add r1 ld f4
prolog
ld f4, 24(r1) ld f1 fadd f5
add r1, 32 ld f2 fadd f6
fadd f5, f0, f1 ld f3 fadd f7
fadd f6, f0, f2 add r1 ld f4 fadd f8
fadd f7, f0, f3
loop: ld f1 sd f5 fadd f5
fadd f8, f0, f4 iterate
sd f5, 0(r2) ld f2 sd f6 fadd f6
sd f6, 8(r2) add r2 ld f3 sd f7 fadd f7
sd f7, 16(r2) add r1 bne ld f4 sd f8 fadd f8
add r2, 32 sd f5 fadd f5
sd f8, -8(r2) sd f6 fadd f6
bne r1, r3, loop epilog
add r2 sd f7 fadd f7
bne sd f8 fadd f8
sd f5
How many FLOPS/cycle?
4 fadds / 4 cycles = 1
Software Pipelining vs. Loop Unrolling
Loop Unrolled Wind-down overhead
performance

Startup overhead

Loop Iteration time

Software Pipelined
performance

Loop time
Iteration

Software pipelining pays startup/wind-down costs

only once per loop, not once per iteration
Reference:
- Jouppi, WRL Reserch Report 89/7, 1989
- Appendix C: Computer Architecture Quantitative Approach, 5th Edition.
(Pipeline: Basic and Intermediate Concepts)

Next Lecture
Pipeline to continue...

Instruction Pipelining and SuperScalar Development - 2019
No ratings yet
Instruction Pipelining and SuperScalar Development - 2019
53 pages
ELECH473 Th04
No ratings yet
ELECH473 Th04
59 pages
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
No ratings yet
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
64 pages
Pipelining 2019
No ratings yet
Pipelining 2019
82 pages
Pipelining and Parallelism
No ratings yet
Pipelining and Parallelism
41 pages
DLCO Module 6 Sem 3
No ratings yet
DLCO Module 6 Sem 3
40 pages
MULTIcycle OPERATIONS
No ratings yet
MULTIcycle OPERATIONS
24 pages
COMP 206: Computer Architecture and Implementation
No ratings yet
COMP 206: Computer Architecture and Implementation
24 pages
HRY-312 Computer Organization Introduction To Pipelining
No ratings yet
HRY-312 Computer Organization Introduction To Pipelining
30 pages
Ca 5
No ratings yet
Ca 5
12 pages
CoA Batch13
No ratings yet
CoA Batch13
30 pages
Pipelining
No ratings yet
Pipelining
43 pages
Lec 1
No ratings yet
Lec 1
30 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
Pipeline: A Simple Implementation of A RISC Instruction Set
100% (1)
Pipeline: A Simple Implementation of A RISC Instruction Set
16 pages
Chapter 17 - Pipelining Hazards
No ratings yet
Chapter 17 - Pipelining Hazards
33 pages
Instruction Pipelining Basics
No ratings yet
Instruction Pipelining Basics
20 pages
Unit 3
No ratings yet
Unit 3
94 pages
CA-unit 4-Material
No ratings yet
CA-unit 4-Material
31 pages
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
No ratings yet
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
136 pages
ACA Handwriten Notes Chat GPT
No ratings yet
ACA Handwriten Notes Chat GPT
52 pages
L04 Pipelining
No ratings yet
L04 Pipelining
38 pages
3 Pipeline
No ratings yet
3 Pipeline
21 pages
Embedded Systems Design: Pipelining and Instruction Scheduling
No ratings yet
Embedded Systems Design: Pipelining and Instruction Scheduling
48 pages
Module 4
No ratings yet
Module 4
12 pages
Pipeline Hazards
No ratings yet
Pipeline Hazards
61 pages
Lec 2
No ratings yet
Lec 2
21 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
Question Ans CA
No ratings yet
Question Ans CA
28 pages
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
No ratings yet
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
7 pages
05 Wideissue
No ratings yet
05 Wideissue
77 pages
Pipelinehazard For Class
No ratings yet
Pipelinehazard For Class
61 pages
Pipelinehazard 160823134502
No ratings yet
Pipelinehazard 160823134502
61 pages
Week 4 - Pipelining
No ratings yet
Week 4 - Pipelining
44 pages
Lecture-5-09 01 2025
No ratings yet
Lecture-5-09 01 2025
25 pages
CO Pipelining PDF Notes
No ratings yet
CO Pipelining PDF Notes
10 pages
Unit 3-2 COA
No ratings yet
Unit 3-2 COA
58 pages
Unit 5
No ratings yet
Unit 5
51 pages
CS530 Fall2015 Lecture9
No ratings yet
CS530 Fall2015 Lecture9
5 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
CPU Pipelining and Cache Basics
No ratings yet
CPU Pipelining and Cache Basics
61 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
49 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
Moduel 5
No ratings yet
Moduel 5
46 pages
Pipelining Lecture
No ratings yet
Pipelining Lecture
39 pages
Pipelining Basic Concept
No ratings yet
Pipelining Basic Concept
23 pages
ILP - Appendix C PDF
No ratings yet
ILP - Appendix C PDF
52 pages
CCS CMCS 611-101 Advanced Computer Architecture Advanced Computer Architecture
100% (2)
CCS CMCS 611-101 Advanced Computer Architecture Advanced Computer Architecture
24 pages
Parallelism in Uniprocessor System and Granularity
100% (5)
Parallelism in Uniprocessor System and Granularity
5 pages
Kuliah 14 Pipeliningg
No ratings yet
Kuliah 14 Pipeliningg
28 pages
Unit 5
No ratings yet
Unit 5
36 pages
1.pipelining & ILP
No ratings yet
1.pipelining & ILP
37 pages
Computer Architecture Suggestions With Ans
No ratings yet
Computer Architecture Suggestions With Ans
40 pages
ch4 3
No ratings yet
ch4 3
61 pages
Computer Architecture 1
No ratings yet
Computer Architecture 1
37 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
Notes - 1095 - Unit-1 - C.A. (MCA-20-31) - Full
No ratings yet
Notes - 1095 - Unit-1 - C.A. (MCA-20-31) - Full
54 pages
A Tiny Scale VLIW Processor For RealTime
No ratings yet
A Tiny Scale VLIW Processor For RealTime
8 pages
HSE-6-Soc Introduction To The System Design Approach
No ratings yet
HSE-6-Soc Introduction To The System Design Approach
69 pages
MPMC Unit-1 & 2
No ratings yet
MPMC Unit-1 & 2
77 pages
Use A Microprocessor, A DSP, or Both
100% (2)
Use A Microprocessor, A DSP, or Both
11 pages
02-General Purpose Processors
No ratings yet
02-General Purpose Processors
37 pages
Design Issues
No ratings yet
Design Issues
12 pages
5.IA 64 and Itanium Processors
No ratings yet
5.IA 64 and Itanium Processors
9 pages
ACA20012021 - Vector & Multiple Issue Processor - 2
No ratings yet
ACA20012021 - Vector & Multiple Issue Processor - 2
21 pages
CSE - CS401 - COMPUTER ARCHITECTURE - R18 - Booklet
No ratings yet
CSE - CS401 - COMPUTER ARCHITECTURE - R18 - Booklet
2 pages
Programmable DSP Lecture1
75% (4)
Programmable DSP Lecture1
19 pages
SCADA Systems and Applications
No ratings yet
SCADA Systems and Applications
52 pages
The 50 Year History of The Microprocessor As Five Technology Eras
No ratings yet
The 50 Year History of The Microprocessor As Five Technology Eras
2 pages
ISA Models
No ratings yet
ISA Models
5 pages
DSP Technology Growth & Directory
No ratings yet
DSP Technology Growth & Directory
38 pages
Distributed and Cloud Computing Topic 1
No ratings yet
Distributed and Cloud Computing Topic 1
10 pages
Digital Signal Processor Overview
No ratings yet
Digital Signal Processor Overview
34 pages
Characteristics of DSP
100% (1)
Characteristics of DSP
15 pages
Embedded Systems Design Guide
No ratings yet
Embedded Systems Design Guide
25 pages
RHD'L: Instruction-Level Parallel Processing: History, Overview and Perspective
No ratings yet
RHD'L: Instruction-Level Parallel Processing: History, Overview and Perspective
42 pages
Sanjay - High Performance DSP Architectures
No ratings yet
Sanjay - High Performance DSP Architectures
38 pages
Aca 2 Marks With Answers
No ratings yet
Aca 2 Marks With Answers
22 pages
Modern Processor and Memory Technology
No ratings yet
Modern Processor and Memory Technology
17 pages
Answer1:: Name: Said Amin Shah ROLL NO:9955
No ratings yet
Answer1:: Name: Said Amin Shah ROLL NO:9955
3 pages
Digital Signal Processors & Architecture
No ratings yet
Digital Signal Processors & Architecture
190 pages
Tiger SHARC Processor
No ratings yet
Tiger SHARC Processor
35 pages
Microprocessor vs. Microcontroller
No ratings yet
Microprocessor vs. Microcontroller
4 pages
Input Unit: Memory: in Processing Element (PE) or CPU: Output
No ratings yet
Input Unit: Memory: in Processing Element (PE) or CPU: Output
24 pages

Lecture13 Pipeline2

Uploaded by

Lecture13 Pipeline2

Uploaded by

Instruction Level Parallelism

Pipeline Hazards and LL Pipeline

Indian Institute of Technology Tirupati

Computer System Architecture (CS5202)

The stages are being shown with

XOR R10 R1 R11

XOR R1 R4 R11 Dependency

More solutions will be discussed in detail during LL pipeline and superscalar!

XOR R10 R1 R11

XOR R1 R4 R11 Dependency

XOR R10 R1 R11

NOR R10 R4 R11

PC = 20: XOR R10 R1 R11

PC = 24: BEQ R1 R4 offset

PC = 28: XOR R1 R4 R11

PC = 32: NOR R10 R4 R11

PC = 36: JMP 100(offset)

* Advanced branch predictors will be covered during Super-scalar architecture

* Advanced branch predictors will be covered during Super-scalar architecture

- Branching decision (taken or Not taken)

* Advanced branch predictors will be covered during Super-scalar architecture

In ideal pipeline: cycle per instruction = 1.

CPI in unpipelined (generally multicycle)

The above equation could be expressed also as follow:

as much as possible, this

Analyzing the Execution stage!

Each of the execution unit

How to decide on which unit or where do a deeper pipeline is

Latency: number of cycles between the production and the consumption of

Initiation interval: elapsed number of cycle between issuing of two operations

Integer ALU: only one stage

Latency: number of cycles between production of results and the

Initiation interval: elapsed number of cycles between issuing of two

Progress of time (clock cycle)

Data is required at this stage

Results are available at this stages

1) Because the divide unit is non-pipeline, structural hazard can occur.

3) WAW hazards are possible, because instruction no longer reach WB in order.

WAW may not be

Register Read Register Write

Solution: stall and forwarding

Two data dependency hazards: RAW and WAW are of

Solving the WAW hazards:

Hazards among FP and Integer units:

DIV.D F0, F2, F4

Where should program returns after handling exception?

How to maintain precise program state?

Can’t afford to lose value of any register!

- I/O device request

How to increase IPC? IPC >= 1

n = pipeline stages Speedup jumps from 3 to 4.3 for

 IBM RISC Experience

 This analysis assumes 100% cache hit rates

- Jouppi, WRL Reserch Report 89/7, 1989

- Jouppi, WRL Reserch Report 89/7, 1989

- Jouppi, WRL Reserch Report 89/7, 1989

- Jouppi, WRL Reserch Report 89/7, 1989

 In a VLIW processor, multiple instructions are

 The compiler (not the processor) checks that

Two Integer Units,

Two Load/Store Units,

 Multiple operations packed into one instruction

 Schedules to maximize parallel execution

loop: ld f1, 0(r1) fadd

How many FP ops/cycle?

Unroll inner loop to perform 4

Need to handle values of N that are not multiples of

loop: ld f1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx

How many FLOPS/cycle? IPC= 4 fadds / 11 cycles = 0.36

Loop Iteration time

Software pipelining pays startup/wind-down costs

You might also like