100% found this document useful (1 vote)

132 views57 pages

Instruction-Level Parallelism (ILP), Since The

The document discusses instruction-level parallelism and techniques for exploiting it, including dynamic scheduling, loop unrolling, and branch prediction. It covers hardware and software approaches and limitations on ILP that led to multicore processors. Key concepts are data dependences, hazards, and reducing branch costs through advanced prediction.

Uploaded by

Ram Sudeep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

132 views57 pages

Instruction-Level Parallelism (ILP), Since The

Uploaded by

Ram Sudeep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 57

Instruction-Level Parallelism: Concepts and Challenges

 All processors (since about 1985) use pipelining to

overlap the execution of instructions and improve
performance.
 This potential overlap among instructions is called
instruction-level parallelism (ILP), since the
instructions can be evaluated in parallel.
Two approaches to exploiting ILP:
 An approach that relies on hardware to help
discover and exploit the parallelism dynamically
 An approach that relies on software technology to
find parallelism statically at compile time.
Instruction-Level Parallelism: Concepts and Challenges

 We will study both the approaches.

 Limitations on ILP approaches is included in this
chapter, and it was such limitations that directly led to
the movement to multicore.

The value of the CPI (cycles per instruction) for a pipelined processor
is as follows:

Pipeline CPI Ideal pipeline = CPI + Structural stalls

+ Data hazard stalls
+ Control stalls
What Is Instruction-Level Parallelism?

 The amount of paralellism available within a basic block.

 MIPS programs, the average dynamic branch frequency
is often between 15% and 25%, meaning that between
three and six instructions execute between a pair of
branches.
 To obtain substantial performance enhancements, we
must exploit ILP across multiple basic blocks.
 The simplest and most common way to increase the ILP
is to exploit parallelism among iterations of a loop.
What Is Instruction-Level Parallelism?

 This type of parallelism is often called loop-level

parallelism.
Example:
for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];

 Loop-level parallelism can be performed by unrolling the

loop either statically or dynamically.
 Alternative method for exploiting loop-level parallelism
is the use of SIMD in both vector processors and
Graphics Processing Units (GPUs).
What Is Instruction-Level Parallelism?

 This type of parallelism is often called loop-level

parallelism.
Example:
for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];

 Loop-level parallelism can be performed by unrolling the

 If two instructions are parallel, they can execute simultaneously in

a pipeline of arbitrary depth without causing any stalls.

There are three different types of dependences: data dependences

(also called true data dependences), name dependences, and control
dependences
 Data Dependences:
An instruction j is data dependent on instruction i if either of
the following holds:
 Instruction i produces a result that may be used by instruction j.
 Instruction j is data dependent on instruction k, and instruction k
is data dependent on instruction i.
Data Dependences and Hazards

 A data dependence conveys three things:

(1) the possibility of a hazard,
(2) the order in which results must be calculated, and
(3) an upper bound on how much parallelism can possibly be
exploited
 A dependence can be overcome in two different ways:
(1) maintaining the dependence but avoiding a hazard, and
(2) eliminating a dependence by transforming the code
Data Dependences and Hazards

 Name Dependences:
A name dependence occurs when two instructions use the same
register or memory location, called a name.

 There are two types of name dependences between an instruction

i that precedes instruction j in program order:
1. An antidependence between instruction i and instruction j
occurs when instruction j writes a register or memory
location that instruction i reads.
2. An output dependence occurs when instruction i and
instruction j write the same register or memory location.
This can be solved by Register Renaming.
Data Dependences and Hazards

 Data Hazards:
Three types of data hazards exist
1. RAW
2. WAR
3. WAW

Control Dependence
A control dependence determines the ordering of an instruction, i,
with respect to a branch instruction so that instruction i is executed in
correct program order and only when it should be.
if p1 {
S1;
};
if p2 {
S2;
}
Data Dependences and Hazards

Control Dependence
 Consider this code sequence:
DADDU R2,R3,R4
BEQZ R2,L1
LW R1,0(R2)
L1:
Creates a new exception :memory protection exception

Preserving exceptions raised in the order.

Data Dependences and Hazards

Control Dependence
 Consider this code sequence:
DADDU R1,R2,R3
BEQZ R4,L
DSUBU R1,R5,R6
L: ...
OR R7,R1,R8

Data flow of R1 to be preserved (by speculating the branch).

Basic Compiler techniques for Exposing ILP

• These techniques are crucial for processors

that use static issue or static scheduling.

Basic Pipeline Schedule and Loop Unrolling

Basic Pipeline Schedule and Loop Unrolling
• Parallelism among instructions must be exploited by
finding sequences of unrelated instructions that can
be overlapped in the pipeline.
• To avoid a pipeline stall:
The execution of a dependent instruction must
be separated from the source instruction by a
distance in clock cycles equal to the pipeline
latency of that source instruction.
Basic Pipeline Schedule and Loop Unrolling
• We assume the standard five-stage integer pipeline, so that
branches have a delay of one clock cycle.
• We assume that the functional units are fully
pipelined and no structural hazards exist.
Basic Pipeline Schedule and Loop Unrolling
• We look at how the compiler can increase the amount of
available ILP by transforming loops.
for (i=999; i>=0; i=i–1)
x[i] = x[i] + s;
• This loop is parallel by noticing that the body of each iteration
is independent.
• Translate the above segment to MIPS assembly language.
Basic Pipeline Schedule and Loop Unrolling
Example:
Show how the loop would look on MIPS, both
scheduled and unscheduled, including any stalls or idle
clock cycles. Schedule for delays from floating-point
operations, but remember that we are ignoring delayed
branches.
Basic Pipeline Schedule and Loop Unrolling
Basic Pipeline Schedule and Loop Unrolling
Basic Pipeline Schedule and Loop Unrolling
• In the above example, we need only three cycles
(load, add, store) to complete one iteration of loop
operation.
• Four out of Seven cycles used to execute the single
iteration are spent for loop overhead (the DADDUI
and BNE—and two stalls).
• A simple scheme for increasing the number of
instructions relative to the branch and overhead
instructions is loop unrolling.
• Unrolling simply replicates the loop body multiple
times, adjusting the loop termination code.
Basic Pipeline Schedule and Loop Unrolling
• Loop unrolling can also be used to improve
scheduling. Because it eliminates the branch, it
allows instructions from different iterations to be
scheduled together.
• We can eliminate the data use stalls by creating
additional independent instructions within the loop
body.
• The loop unrolling increases the required number of
registers.
Basic Pipeline Schedule and Loop Unrolling
Example:

Show our loop unrolled so that there are four copies of

the loop body, assumingR1 – R2 (that is, the size of the
array) is initially a multiple of 32, which meansthat the
number of loop iterations is a multiple of 4. Eliminate
any obviously redundant computations and do not
reuse any of the registers.
Basic Pipeline Schedule and Loop Unrolling
Basic Pipeline Schedule and Loop Unrolling

Without scheduling, every operation in the unrolled loop is

followed by a dependent operation and thus will cause a
stall. This loop will run in 27 clock cycles—each LD has 1
stall, each ADDD 2, the DADDUI 1, plus 14 instruction
issue cycles—or 6.75 clock cycles for each of the four
elements, but it can be scheduled to improve performance
significantly.
Basic Pipeline Schedule and Loop Unrolling
Summary of the Loop Unrolling and Scheduling
Reducing Branch Costs with Advanced Branch
Prediction

• We can also reduce the performance losses of

branches by predicting how they will behave.
• if (aa==2)
aa=0;
if (bb==2)
bb=0;
if (aa!=bb) {
Reducing Branch Costs with Advanced Branch
Prediction
• Branch predictors that use the behavior of other
branches to make a prediction are called correlating
predictors or two-level predictors
• In the general case, an (m,n) predictor uses
the behavior of the last m branches to choose
from 2m branch predictors, each of which is
an n-bit predictor for a single branch.
Dynamic Scheduling; The Idea

We still use in-order instruction issue (i.e.,

instructions issued in program order), but we
want an instruction to begin execution as soon
as its data operands are available. Such a
pipeline does out-of-order execution, which
implies out-of-order completion.
Dynamic Scheduling; The Idea

There is an antidependence between the ADD.D and the SUB.D, and if

the pipeline
executes the SUB.D before the ADD.D (which is waiting for the DIV.D),
it will violate
the antidependence, yielding a WAR hazard. Likewise, to avoid
violating
output dependences, such as the write of F6 by MUL.D,
Dynamic Scheduling; The Idea
Dynamic Scheduling: Tomasulo’s
Approach
Dynamic Scheduling: Tomasulo’s
Approach
Hardware Based Speculation

• Using the bypassed value is like performing a

speculative register read, since we do not know
whether the instruction providing the source
register value is providing the correct result until
the instruction is no longer speculative.
• When an instruction is no longer speculative, we
allow it to update the register file or memory; we
call this additional step in the instruction
execution sequence instruction commit.
Hardware Based Speculation

• The key idea behind implementing speculation is

to allow instructions to execute out of order but
to force them to commit in order and to prevent
any irrevocable action (such as updating state or
taking an exception) until an instruction commits.
• Adding this commit phase to the instruction
execution sequence requires an additional set of
hardware buffers that hold the results of
instructions that have finished execution but have
not committed.
Hardware Based Speculation

• This hardware buffer, which we call the reorder

buffer (ROB), is also used to pass results among
instructions that may be speculated.
• The ROB holds the result of an instruction
between the time the operation associated with
the instruction completes and the time the
instruction commits.
• Hence, the ROB is a source of operands for
instructions, just as the reservation stations
provide operands in Tomasulo’s algorithm.
Hardware Based Speculation
Difference b/w Tomasulo’s Approach and ROB:

• The key difference is that in Tomasulo’s algorithm, once

an instruction writes its result, any subsequently issued
instructions will find the result in the register file.
• With speculation, the register file is not updated until
the instruction commits (and we know definitively that
the instruction should execute); thus, the ROB supplies
operands in the interval between completion of
instruction execution and instruction commit.
• Each entry in the ROB contains four fields: the
instruction type, the destination field, the value field,
and the ready field.
The basic structure of a FP unit using Tomasulo’s algorithm and
extended to handle speculation
Operations: Extended speculation handling
Example 1
Example 2
Elimination of Hazards in ROB
• WAW and WAR hazards through memory are eliminated with
speculation because the actual updating of memory occurs in
order, when a store is at the head of the ROB, and, hence, no
earlier loads or stores can still be pending.
• RAW hazards through memory are maintained by two
restrictions:
1. Not allowing a load to initiate the second step of its
execution if any active ROB entry occupied by a store has a
Destination field that matches the value of the A field of the
load.
2. Maintaining the program order for the computation of an
effective address of a load with respect to all earlier stores.
VLIW
VLIW
Multiple Isssue

CompanionAsset 9780128119051 Chapter03
No ratings yet
CompanionAsset 9780128119051 Chapter03
67 pages
Historical Fascism and The Radical Right
No ratings yet
Historical Fascism and The Radical Right
11 pages
SSMP Lab Contents
No ratings yet
SSMP Lab Contents
123 pages
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
No ratings yet
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
201 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Questionnaire For Entp. Student Intention Original
No ratings yet
Questionnaire For Entp. Student Intention Original
3 pages
System Software and Microprocessor Labmanual
No ratings yet
System Software and Microprocessor Labmanual
130 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Proms Data Collections Manual 2021 en
No ratings yet
Proms Data Collections Manual 2021 en
29 pages
Processes and Threads: Modern Operating Systems
No ratings yet
Processes and Threads: Modern Operating Systems
62 pages
Cache Memory
67% (3)
Cache Memory
72 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
Module 3:the Memory System: Courtesy: Text Book: Carl Hamacher 5 Edition
No ratings yet
Module 3:the Memory System: Courtesy: Text Book: Carl Hamacher 5 Edition
73 pages
PowerPoint Slides To Chapter 07
No ratings yet
PowerPoint Slides To Chapter 07
49 pages
Syllabus PDF
No ratings yet
Syllabus PDF
102 pages
Computer Architecture Notes
No ratings yet
Computer Architecture Notes
45 pages
Computer Programming Lab Manual
No ratings yet
Computer Programming Lab Manual
159 pages
031317
No ratings yet
031317
21 pages
Read - Et Al - 2014 - Writing Persuasive Text
No ratings yet
Read - Et Al - 2014 - Writing Persuasive Text
9 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
CS 6461: Computer Architecture Instruction Level Parallelism
No ratings yet
CS 6461: Computer Architecture Instruction Level Parallelism
41 pages
Hro PTW
No ratings yet
Hro PTW
1 page
Unit 3: Embedded Firmware & Hardware Design and Development
100% (2)
Unit 3: Embedded Firmware & Hardware Design and Development
22 pages
Instruction Level Parallelism: Soner Onder
No ratings yet
Instruction Level Parallelism: Soner Onder
25 pages
Lecture 30 GPU Programming Loop Parallelism
No ratings yet
Lecture 30 GPU Programming Loop Parallelism
16 pages
Processor and Memory Organization
No ratings yet
Processor and Memory Organization
17 pages
Employee Schedule
No ratings yet
Employee Schedule
4 pages
An Essay Towards A New Theory of Vision PDF
No ratings yet
An Essay Towards A New Theory of Vision PDF
3 pages
How To Write A Job Application Latter
No ratings yet
How To Write A Job Application Latter
22 pages
Performance Task 3rd Quarter
No ratings yet
Performance Task 3rd Quarter
25 pages
Event Draft (MS Excel)
No ratings yet
Event Draft (MS Excel)
4 pages
Pipelining: Advanced Computer Architecture
No ratings yet
Pipelining: Advanced Computer Architecture
23 pages
Microprocessor and Interfacing Techniques: (Course Code: CET208A) Credits-3
No ratings yet
Microprocessor and Interfacing Techniques: (Course Code: CET208A) Credits-3
147 pages
Credit Management Config
No ratings yet
Credit Management Config
12 pages
CPU Scheduling
No ratings yet
CPU Scheduling
48 pages
Adaptive Re-Use of Regal Cinema
No ratings yet
Adaptive Re-Use of Regal Cinema
26 pages
Embedded System LESSONPLAN
No ratings yet
Embedded System LESSONPLAN
7 pages
CS3353 Unit2
No ratings yet
CS3353 Unit2
51 pages
Bollywood and Housing
No ratings yet
Bollywood and Housing
13 pages
ECE 521: Microprocessor System: CHAPTER 3: Microprocessor Programming in C
No ratings yet
ECE 521: Microprocessor System: CHAPTER 3: Microprocessor Programming in C
42 pages
Advanced Computer Architecture 2
No ratings yet
Advanced Computer Architecture 2
17 pages
Arduino - Loop Statement
No ratings yet
Arduino - Loop Statement
23 pages
BE LP5 Manual 23-24
No ratings yet
BE LP5 Manual 23-24
67 pages
ARM Assignments From Nptel
No ratings yet
ARM Assignments From Nptel
13 pages
Testing Conventional Applications
No ratings yet
Testing Conventional Applications
122 pages
Renewable Energy Engineering: Courses
No ratings yet
Renewable Energy Engineering: Courses
14 pages
Appendix and Vitae
No ratings yet
Appendix and Vitae
4 pages
Nervous System Lab
No ratings yet
Nervous System Lab
4 pages
Cs2253 Computer Organization and Architecture Question Bank
No ratings yet
Cs2253 Computer Organization and Architecture Question Bank
10 pages
PGNLN Asas Mikrob
No ratings yet
PGNLN Asas Mikrob
51 pages
Interupt in Arm
No ratings yet
Interupt in Arm
28 pages
Toc Unit III
No ratings yet
Toc Unit III
14 pages
Overview of RTOS
No ratings yet
Overview of RTOS
22 pages
Pipelining: Advanced Computer Architecture
100% (1)
Pipelining: Advanced Computer Architecture
30 pages
Computer Architecture Sample Midterm
No ratings yet
Computer Architecture Sample Midterm
9 pages
Tomasulo's Algorithm and Scoreboarding
No ratings yet
Tomasulo's Algorithm and Scoreboarding
17 pages
Module 1
No ratings yet
Module 1
79 pages
1089 Kg/year Cephalosporin Antibiotic Production Plant From Corn Steep Liquor
No ratings yet
1089 Kg/year Cephalosporin Antibiotic Production Plant From Corn Steep Liquor
136 pages
Tle10 He Cookery q2 Mod5 Evaluatingthefinishedproduct v3 (50 Pages)
No ratings yet
Tle10 He Cookery q2 Mod5 Evaluatingthefinishedproduct v3 (50 Pages)
42 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
2 pages
Question Paper Code:: (10×2 20 Marks)
No ratings yet
Question Paper Code:: (10×2 20 Marks)
2 pages
Chapter 08 - Pipeline and Vector Processing
No ratings yet
Chapter 08 - Pipeline and Vector Processing
14 pages
F5 LTM and GTM Basic Concepts and Operations
No ratings yet
F5 LTM and GTM Basic Concepts and Operations
100 pages
AA Tree: Balancing Rotations
No ratings yet
AA Tree: Balancing Rotations
6 pages
Architectural Support For HLL
67% (3)
Architectural Support For HLL
18 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Sets: Addressing Modes and Formats
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Sets: Addressing Modes and Formats
47 pages
Summer Mock Amc 8 v2
No ratings yet
Summer Mock Amc 8 v2
5 pages
1.2 Assembler Notes
100% (1)
1.2 Assembler Notes
16 pages
RTOS For Fault Tolerant Application
No ratings yet
RTOS For Fault Tolerant Application
8 pages
Guide Tube Bracket ID No 5062-4
No ratings yet
Guide Tube Bracket ID No 5062-4
2 pages
Unit V-Initialization of 80386DX
No ratings yet
Unit V-Initialization of 80386DX
16 pages
Instruction Pipeline
No ratings yet
Instruction Pipeline
27 pages
Training in Tata Group
No ratings yet
Training in Tata Group
27 pages
Robotics and Machine Vision Internal 3 Important Questions
No ratings yet
Robotics and Machine Vision Internal 3 Important Questions
1 page
ARM7TDMI Processor
No ratings yet
ARM7TDMI Processor
44 pages
Business Studies Principles of Management Multiple Choice Questions
No ratings yet
Business Studies Principles of Management Multiple Choice Questions
6 pages
Public Procurement Law and Tendering Procedures Question Six
No ratings yet
Public Procurement Law and Tendering Procedures Question Six
3 pages
Question Bank of Computer Network
No ratings yet
Question Bank of Computer Network
1 page
CH03 Loaders and Linkers
100% (5)
CH03 Loaders and Linkers
20 pages
Journal of Aging and Identity Sufi
No ratings yet
Journal of Aging and Identity Sufi
13 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
General Education: St. Louis Review Center, Inc
No ratings yet
General Education: St. Louis Review Center, Inc
9 pages
CS8602 CD
No ratings yet
CS8602 CD
2 pages
Computer Architecture Question Anna University
No ratings yet
Computer Architecture Question Anna University
2 pages
Spring End Sem Data Structure Question 2012-13
No ratings yet
Spring End Sem Data Structure Question 2012-13
2 pages
Bildungsroman Genre: The Course of The Novel
No ratings yet
Bildungsroman Genre: The Course of The Novel
4 pages
PDPR Form 5 2021 Unit 1 - It's Personal
No ratings yet
PDPR Form 5 2021 Unit 1 - It's Personal
3 pages

Instruction-Level Parallelism (ILP), Since The

Uploaded by

Instruction-Level Parallelism (ILP), Since The

Uploaded by

Instruction-Level Parallelism: Concepts and Challenges

 All processors (since about 1985) use pipelining to

 We will study both the approaches.

Pipeline CPI Ideal pipeline = CPI + Structural stalls

 The amount of paralellism available within a basic block.

 This type of parallelism is often called loop-level

 Loop-level parallelism can be performed by unrolling the

 This type of parallelism is often called loop-level

 Loop-level parallelism can be performed by unrolling the

 If two instructions are parallel, they can execute simultaneously in

There are three different types of dependences: data dependences

 A data dependence conveys three things:

 There are two types of name dependences between an instruction

Preserving exceptions raised in the order.

Data flow of R1 to be preserved (by speculating the branch).

• These techniques are crucial for processors

Basic Pipeline Schedule and Loop Unrolling

Show our loop unrolled so that there are four copies of

Without scheduling, every operation in the unrolled loop is

• We can also reduce the performance losses of

We still use in-order instruction issue (i.e.,

There is an antidependence between the ADD.D and the SUB.D, and if

• Using the bypassed value is like performing a

• The key idea behind implementing speculation is

• This hardware buffer, which we call the reorder

• The key difference is that in Tomasulo’s algorithm, once

You might also like