[go: up one dir, main page]

0% found this document useful (0 votes)
48 views24 pages

Unit 05 Design Space of Pipelines

The document discusses the design space of pipelines in computer architecture. It covers topics like basic pipeline layout, dependency resolution, pipelined execution of integer and boolean instructions, and pipelined processing of loads and stores. The objectives of the unit are also mentioned.

Uploaded by

SrinivasaRao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views24 pages

Unit 05 Design Space of Pipelines

The document discusses the design space of pipelines in computer architecture. It covers topics like basic pipeline layout, dependency resolution, pipelined execution of integer and boolean instructions, and pipelined processing of loads and stores. The objectives of the unit are also mentioned.

Uploaded by

SrinivasaRao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Computer Architecture Unit 5

Unit 5 Design Space of Pipelines


Structure:
5.1 Introduction
Objectives
5.2 Design Space of Pipelines
Basic layout of a pipeline
Dependency resolution
5.3 Pipeline Instruction Processing
5.4 Pipelined Execution of Integer and Boolean Instructions
The design space
Logical layout of FX pipelines
Implementation of FX pipelines
5.5 Pipelined Processing of Loads and Stores
Subtasks of load and store processing
The design space
Sequential consistency of instruction execution
Instruction issuing and parallel execution
5.6 Summary
5.7 Glossary
5.8 Terminal Questions
5.9 Answers

5.1 Introduction
In the previous unit, you studied pipelined processors in great detail with a
short review of pipelining and examples of some pipeline in modern
processors. You also studied various kinds of pipeline hazards and the
techniques available to handle them.
In this unit, we will introduce you to the design space of pipelines. Day-by-
day increasing complexity of the chips had lead to higher operating speeds.
These speeds are provided by overlapping instruction latencies or by
implementing pipelining. In the early models, discrete pipeline was used.
Discrete pipeline performs the task in stages like fetch, decode, execution,
memory, and write-back operations. Here every pipeline stage requires one
cycle of time, and as there are 5 stages so the instruction latency is of five
cycles. Longer pipelines over more cycles can hide instruction latencies.

Manipal University Jaipur B1648 Page No. 102


Computer Architecture Unit 5

This provides processors to attain higher clock speeds. Instruction pipelining


has significantly improved the performance of today’s processors. In this
unit, you will study the design space of pipelines which is further divided into
basic layout of a pipeline and dependency resolution. We focus primarily on
pipelined execution of Integer and Boolean instructions and pipelined
processing of loads and stores.
Objectives:
After studying this unit, you should be able to:
 explain design space of pipelines
 describe pipeline instruction processing
 identify pipelined execution of Integer and Boolean instructions
 discuss pipelined processing of loads and stores

5.2 Design Space of Pipelines


In this section, we will learn about the design space of pipelines. The design
space of pipelines can be sub divided into two aspects as shown in
figure 5.1.

Figure 5.1: Principle Aspects of Design Space of Pipelines

Let’s discuss each one of them in detail.


5.2.1 Basic Layout of a pipeline
To understand a pipeline in depth, it is necessary to know about those
decisions which are fundamental to the layout of a pipeline. Let’s discuss
them below:

Manipal University Jaipur B1648 Page No. 103


Computer Architecture Unit 5

The number of pipeline stages used to perform a given task are:,


1. Specification of the subtasks to be performed in each of the pipeline
stages,
2. Layout of the stage sequence, that is, whether the stages are used in a
strict sequential manner or some stages are recycled,
3. Use of bypassing, and
4. Timing of the pipeline operations, that is, whether pipeline operations
are controlled synchronously or asynchronously.

Figure 5.2 depicts these stages diagrammatically.

Figure 5.2: Overall Stage Layout of a pipeline

5.2.2 Dependency resolution


Pipeline design has another aspect called the dependency resolution.
Earlier, some pipelined computers used the Microprocessor without
Interlocked Pipeline Stages (MIPS approach) and used a static dependency
resolution which is also called static scheduling or software interlock
resolution.
Here the detection and proper resolution of dependencies is done by the
compiler. Examples of static dependency resolution are:
 Original MIPS designs (like the MIPS and the MIPS-X)
 Some less famous RISC processors (like RCA, Spectrum)
 Intel processor (i860) which has both VLIW and scalar operation modes.

Manipal University Jaipur B1648 Page No. 104


Computer Architecture Unit 5

A further advanced resolution scheme is the combined static/dynamic


dependency resolution. This has been employed by MIPS R processors like
R2000, R3000, R4000, R4200 and R6000. In the first MIPS processors
(R2000, R3000) hardware interlocks were used for the long latency
operations, such as multiplication, division and conversion, while the
resolution of short latency operations relied entirely on the compiler. Newer
R-series implementations have extended the range of hardware interlocks
further and further, first to the load/store hazards (R6000) and then to other
short latency operations as well (R4000). In the 84000, the only instructions
which rely on a static dependency resolution are the coprocessor control
instructions.
In recent processors dependencies are resolved dynamically, by extra
hardware. Nevertheless, compilers for these processors are assumed to
perform a parallel optimisation by code reordering, in order to increase
performance. Figure 5.3 shows the various possibilities of resolving the
pipeline hazards.

Figure 5.3: Possibilities for Resolving Pipeline Hazards

Self Assessment Questions


1. The full form of MIPS is ___________________.
2. In recent processors dependencies are resolved ______________, by
extra hardware.

Manipal University Jaipur B1648 Page No. 105


Computer Architecture Unit 5

Activity 1:
Visit a library and find out the features ofR2000, R3000, R4000, R4200
and R6000. Compare them in a chart.

5.3 Pipeline Instruction Processing


An Instruction pipeline operates on a stream of instructions by overlapping
and decomposing the three phases (fetch, decode and execute) of the
instruction cycle. It has been extensively used in RISC machine and many
high-end mainframes as one of the major contributor to achieve high
performance. A typical instruction execution in pipeline architecture consists
of a sequence of following operations:
1. Fetch instruction: In this operation, the next expected instruction is
read into a buffer from cache memory.
2. Decode instruction/register fetch: The Instruction Decoder reads the
next instruction from the memory, decode it, optimize the order of
execution and further sends the instruction to the destinations.
3. Calculate operand address: Now, the effective address of each source
operand is calculated.
4. Fetch operand/memory access: Then, the memory is accessed to
fetch each operand. For a load instruction, data returns from memory
and is placed in the Load Memory Data (LMD) register. If it is a store,
then data from register is written into memory. In both cases, the
operand address as computed in the prior cycle is used.
5. Execute instruction: In this operation, the ALU perform the indicated
operation on the operands prepared in the prior cycle and store the
result in the specified destination operand location.
6. Write back operand: Finally, the result into the register file is written or
stored into the memory.
These six stages of instruction pipeline are shown in a flowchart in
figure 5.4.

Manipal University Jaipur B1648 Page No. 106


Computer Architecture Unit 5

Figure 5.4: Flowchart of an Instruction Pipeline

Self Assessment Questions


3. In ________________ the result into the register file is written or stored
into the memory.
4. In Decode Instruction/Register Fetch operation, the _______________
and the _______________ are determined and the register file is
accessed to read the registers.

Manipal University Jaipur B1648 Page No. 107


Computer Architecture Unit 5

5.4 Pipelined Execution of Integer and Boolean Instructions


Now let us discuss Pipelined execution of integer and Boolean instructions
with respect to the design space.
5.4.1 The design space
In this section, first we will overview the salient aspects of pipelined
execution of FX instructions. (In this section, the abbreviation FX will be
used to denote integer and Boolean.) With reference to figure 5.5 we
emphasise two basic aspects of the design space: how FX pipelines are laid
out logically and how they are implemented.

Figure 5.5: Design Space of the Pipelined Execution of FX instructions

A logical layout of an FX pipeline consists, first, of the specification of how


many stages an FX pipeline has and what tasks are to be performed in
these stages. These issues will be discussed in Section 5.4.2 for RISC and
CISC pipelines. The other key aspect of the design space is how FX
pipelines are implemented. In this respect we note that the term FX pipeline
can be interpreted in both a broader and a narrower sense. In the broader
sense, it covers the full task of instruction fetch, decode, execute and, if
required, write back. In this case, it is usually employed for the execution of
Local Store (LS) and branch instructions and is termed a master pipeline.
By contrast, in the narrower sense, an FX pipeline is understood to deal
only with the execution and writeback phases of the processing of FX

Manipal University Jaipur B1648 Page No. 108


Computer Architecture Unit 5

instructions. Then, the preceding tasks of instruction fetch, decode and, in


the case of superscalar execution, instruction issue are performed by a
separate part of the processor.
5.4.2 Logical layout of FX pipelines
Integer and Boolean instructions account for a considerable proportion of
programs. Together, they amount to 30-40% of all executed instructions.
Therefore, the layout of FX pipelines is fundamental to obtaining a high-
performance processor.
In the following topic, we discuss how FX pipelines are laid out. However,
we describe the FX pipelines for RISC and C1SC processors separately,
since each type has a slightly different scope. While processing operates
instructions, RISC pipelines have to cope only with register operands. By
contrast, CISC pipelines must be able to deal with both register and memory
operands as well as destinations.
Pipeline in RISC architecture: Before discussing pipelines in RISC
machines, let us first discuss what is a RISC machine? The term RISC
stands for Reduced Instruction Set Computing. RISC computers reduce
chip complexity by using simpler instructions. As a result, RISC compilers
have to generate software routines to perform complex instructions that
would have been done in hardware by CISC (Complex Instruction Set
Computing) computers. The salient features of RISC architecture are as
follows:
 RISC architecture has instructions of uniform length.
 Instruction sets are streamlined to carry efficient and important
instructions.
 Memory addressing method is simplified. The complex references are
split up into various reduced instructions.
 The numbers of registers are increased. RISC processors can have
minimum 16 and maximum 64 registers. These registers get hold of
variables that are frequently used.
Pipelining is a standard feature in RISC processors. A typical RISC
processor pipeline operates in the following steps:
1. Fetch instructions from the memory
2. Read the registers and then decode instruction
3. Either execute instruction or compute the address
Manipal University Jaipur B1648 Page No. 109
Computer Architecture Unit 5

4. Access the operand stored at that memory location


5. Write the calculated result into the register
RISC instructions are simpler as compared to the instructions used in CISC
processors. It is due to the pipelining feature used there. CISC instructions
are of variable length, while RISC instructions are of same length. RISC
instructions can be fetched in a single operation. Theoretically, one clock
cycle should be taken by each stage in RISC processor so that the
processor completes execution of one instruction in one clock cycle. But
practically, RISC processors take more than one cycle for one instruction.
The processor may sometimes stall due to branch instructions and data
dependencies. Data dependency takes place if an instruction waits for the
output of previous instruction. Delay can also be due to the reason that
instruction is waiting for some data which is not currently available in the
register. So, the processor cannot finish an instruction in one clock cycle.
Branch instructions are those that tell the processor to make a decision
about what the next instruction to be executed. They are generally based on
the results of another instruction. So, they can also create problems in a
pipeline if a branch is conditional on the results of an instruction, which has
not yet finished its path through the pipeline. In that case also, the processor
takes more than one clock cycle to finish one instruction.
Pipeline in CISC architecture: CISC is an acronym for Complex Instruction
Set Computer. The CISC machines are easy to program and make efficient
use of memory. Since the earliest machines were programmed in assembly
language and memory was slow and expensive, the CISC philosophy was
commonly implemented in large computers such as PDP-11. Most common
microprocessor designs such as the Intel 80x86 and Motorola 68K series
have followed the CISC philosophy. The CISC instructions sets have the
following main features:
 Two-operand format; here instructions have both source & destination.
 Register to register, memory to register and register to memory
commands.
 Multiple addressing modes for memory, having specialised modes for
indexing through arrays
 Depending upon the addressing mode, the instruction length varies
 Multiple clock cycles required by instructions to execute.

Manipal University Jaipur B1648 Page No. 110


Computer Architecture Unit 5

Intel 80486, a CISC machine, uses 5-stage pipeline. Here the CPU tries to
maintain one instruction execution per clock cycle. However, this
architecture does not provide maximum potential performance improvement
due to the following reasons:
 Occurrence of sub-cycles between the initial fetch and the instruction
execution.
 Execution of an instruction waiting for previous instruction output.
 Occurrence of the branch instruction.
5.4.3 Implementation of FX pipelines
Most of today's arithmetic pipelines are designed to perform fixed
functions. These arithmetic/logic units (ALUs) perform fixed-point and
floating-point operations separately. The fixed-point unit is also called the
integer unit. The floating-point unit can be built either as part of the central
processor or on a separate coprocessor. These arithmetic units perform
scalar operations involving one pair of operands at a time. The pipelining in
scalar arithmetic pipelines is controlled by software loops. Vector arithmetic
units can be designed with pipeline hardware directly under firmware or
hardwired control.
Scalar and vector arithmetic pipelines differ mainly in the areas of register
files and control mechanisms involved. Vector hardware pipelines are often
built as add-on options to a scalar processor or as an attached processor
driven by a control processor. Both scalar and vector processors are used in
modem supercomputers.
Arithmetic pipeline stages: Depending on the function to be implemented,
different pipeline stages in an arithmetic unit require different hardware
logic. Since all arithmetic operations (such as add, subtract, multiply, divide,
squaring, square rooting, logarithm, etc.) can be implemented with the basic
add and shifting operations, the core arithmetic stages require some form of
hardware to add and to shift. For example, a typical three-stage floating-
point adder includes a first stage for exponent comparison and equalisation
which is implemented with an integer adder and some shifting logic; a
second stage for fraction addition using a high-speed carry look ahead
adder; and a third stage for fraction normalisation and exponent
readjustment using a shifter and another addition logic.

Manipal University Jaipur B1648 Page No. 111


Computer Architecture Unit 5

Arithmetic or logical shifts can be easily implemented with shift registers.


High-speed addition requires either the use of a carry-propagation adder
(CPA) which adds two numbers and produces an arithmetic sum as shown
in figure5.6a, or the use of a carry-save adder (CSA) to "add" three input
numbers and produce one sum output and a carry output as exemplified in
figure 5.6b.

Figure 5.6: Distinction between a Carry-propagate Adder (CPA) and a


Carry-save Adder (CSA)

Manipal University Jaipur B1648 Page No. 112


Computer Architecture Unit 5

In a CPA, the carries generated in successive digits are allowed to


propagate from the low end to the high end, using either ripple carry
propagation or sonic carry looks-head technique. In a CSA, the carries are
not allowed to propagate but instead are saved in a carry vector. In general,
an n-bit CSA is specified as follows: Let X, Y, and Z be three n-bit input
numbers, expressed as X= (xn-1, xn-2, x1, x0) and so on. The CSA performs
bitwise operations simultaneously on all columns of digits to produce two n-
bit output numbers, denoted as Sb = (0, Sn-1, Sn-2, ..., S1, S0) and C = (Cn, Cn-
b
1, …………………C1, 0).Note that the leading bit of the bitwise sum, S is
always a 0, and the tail bit of the carry vector C is always a 0. The input-
output relationships are expressed below:

Si = xi  yi  zi
Ci +1 = xiyi  yizi  zixi…….5.1

fori = 0,1, 2, ...,n - 1, where  is the exclusive OR and  is the logical OR


operation. Note that the arithmetic sum of three input numbers, i.e., S = X+
Y + Z, is obtained by adding the two output numbers, i.e., S = Sb +C, using
a CPA. We use the CPA and CSAs to implement the pipeline stages of a
fixed-point multiply unit as follows.
Multiply Pipeline Design: Consider as an example the multiplication of two
8-bit integers A x B = P, where P is the 16-bit product. This fixed-point
multiplication can be written as the summation of eight partial products as
shown below: P = A x B = P0 + P1+ P2 + ………. + P7, where x and + are
arithmetic multiply and add operations, respectively.

Manipal University Jaipur B1648 Page No. 113


Computer Architecture Unit 5
Pipelining and Superscalar Techniques

Note that the partial product Pj, is obtained by multiplying the multiplicand A
by the jth bit of B and then shifting the result j bits to the left for
j = 0, 1, 2, ..., 7. Thus Pj, is (8 + j) bits long with j trailing zeros. The
summation of the eight partial products is done with a Wallace tree of CSAs
plus a CPA at the final stage, as shown in figure 5.7.

Figure 5.7: A Pipeline Unit for Fixed-point Multiplication of 8-bit Integers

The first stage (S1) generates all eight partial products, ranging from 8 bits
to 15 bits, simultaneously. The second stage (S2) is made up of two levels
of four CSAs, and it essentially merges eight numbers into four numbers
ranging from 13 to 15 bits. The third stage (S3) consists of two CSAs, and it
merges four numbers from S2 into two 16-bit numbers. The final stage (S4)

Manipal University Jaipur B1648 Page No. 114


Computer Architecture Unit 5

is a CPA, which adds up the last two numbers to produce the final
product P.
For a maximum width of 16 bits, the CPA is estimated to need four gate
levels of delay. Each level of the CSA can be implemented with a two-gate-
level logic. The delay of the first stage (S1) also involves two gate levels.
Thus the entire pipeline stages have an approximately equal amount of
delay. The matching of stage delays is crucial to the determination of the
number of pipeline stages, as well as the clock period. If the delay of the
CPA stage can be further reduced to match that of a single CSA level, then
the pipeline can be divided into six stages with a clock rate twice as fast.
The basic concepts can be extended to operands with a larger number of
bits.
Self Assessment Questions
5. While processing operates instructions, RISC pipelines have to cope
only with _______________.
6. In RISC architecture, instructions are of a uniform length (True/ False).
7. Name two microprocessors which follow the CISC philosophy.
8. ______________ adds two numbers and produces an arithmetic sum.

Activity 2:
Access the internet and find out more about the difference between fixed
point and floating point units.

5.5 Pipelined Processing of Loads and Stores


Now let us study pipelined processing of loads and stores in detail.
5.5.1 Subtasks of load and store processing
Loads and stores are frequent operations, especially in RISC code. While
executing RISC code we can expect to encounter about 25-35% load
instructions and about 10% store instructions. Thus, it is of great importance
to execute load and store instructions effectively. How this can be done is
the topic of this section.
To start with, we summarise the subtasks which have to be performed
during a load or store instruction.

Manipal University Jaipur B1648 Page No. 115


Computer Architecture Unit 5

Let us first consider a load instruction. Its execution begins with the
determination of the effective memory address (EA) from where data is to
be fetched. In straightforward cases, like RISC processors, this can be done
in two steps: fetching the referenced address register(s) and calculating the
effective address. However, for CISC processors address calculation may
be a difficult task, requiring multiple subsequent register fetches and
address calculations, as for instance in the case of indexed, post-
incremented, relative addresses. Once the effective address is available,
the next step is usually, to forward the effective (virtual) address to the MMU
for translation and to access the data cache. Here, and in the subsequent
discussion, we shall not go into details of whether the referenced cache is
physically or virtually addressed, and thus we neglect the corresponding
issues. Furthermore, we assume that the referenced data is available in the
cache and thus it is fetched in one or a few cycles. Usually, fetched data is
made directly available to the requesting unit, such as the FX or FP unit,
through bypassing. Finally, the last subtask to be performed is writing the
accessed data into the specified register.
For a store instruction, the address calculation phase is identical to that
already discussed for loads. However, subsequently both the virtual address
and the data to be stored can be sent out in parallel to the MMU and the
cache, respectively. This concludes the processing of the store instruction.
Figure 5.8 shows the subtasks involved in executing load and store
instructions.

Figure 5.8: Subtasks of Executing Load and Store Instructions

Manipal University Jaipur B1648 Page No. 116


Computer Architecture Unit 5

5.5.2 The design space


While considering the design space of pipelined load/store processing we
take into account only one aspect, namely whether load/store operations
are executed sequentially or in parallel with FX instructions (Figure 5.9).
In traditional pipeline implementations, load and store instructions are
processed by the master pipeline. Thus, loads and stores are executed
sequentially with other instructions (Figure 5.9).

Figure 5.9: Sequential vs. Parallel Execution of Load/Store Instructions

In this case, the required address calculation of a load/store instruction can


be performed by the adder of the execution stage. However, one instruction
slot is needed for each load or store instruction.

Manipal University Jaipur B1648 Page No. 117


Computer Architecture Unit 5

A more effective technique for load/store instruction processing is to do it in


parallel with data manipulations (see again Figure 5.9). Obviously, this
approach assumes the existence of an autonomous load/store unit which
can perform address calculations on its own.
Let’s discuss both these techniques in detail.
5.5.3 Sequential consistency of instruction execution
By operating the processors with multiple EUs (Execution Units) in parallel,
the instructions execution can be finished very fast. However, all the
instructions execution should maintain sequential consistency. The
sequential consistency follows two aspects:
1. Processor Consistency - the order of instructions execution ();
2. Memory Consistency - the order of accessing the memory ().
Processor consistency: The phrase Processor Consistency is applied to
suggest the consistency of instruction completion with sequential instruction
execution. There are two types of processor consistency reflected by
Superscalar processors; namely weak or strong consistency.
Weak processor consistency specifies that all the instructions must be
executed justly; with the condition of no violation of data dependencies.
Data dependencies must be observed and settled during the execution.
Strong processor consistency forces the instructions to follow program order
for the execution. This can be attained through ROB (reorder buffer).ROB is
a storage area from where all data is read and written.
Memory consistency: One another face of superscalar instruction
execution is whether memory access is executed in the same order as in a
sequential processor.
Memory consistency is weak if with strict sequential program execution, the
memory access is out-of-order. Moreover, data dependencies should not be
dishonoured. Simply, it can be stated that weak consistency permits load
and store reordering and being very particular about memory data
dependencies, to be found and settled.
Memory consistency is strong, if memory access occurs strictly in program
order and load/store reordering is prohibited.

Manipal University Jaipur B1648 Page No. 118


Computer Architecture Unit 5

Load and Store reordering


Load and store instructions affect both the processor and the memory.
Firstly ALU or address unit computes the addresses and then the load and
store instructions get executed.
Now, the loads can fetch the data cache from the memory data. Once the
generated address is received, a store instruction can send the operands.
Processor affirming weak memory consistency permits memory access
reordering. This point can be considered as advantageous because of the
following three reasons:
1. Permitting load/store bypassing,
2. Making speculative loads or stores feasible
3. Allowing hiding of cache misses.
Load/Store bypassing
Load/Store bypassing means that any of the two can bypass each other.
This means either stores can bypass loads or vice versa, without violating
the memory data dependencies. The bypassing of loads to stores provides
the advantage of runtime overlapping of loops.
This is accomplished by permitting loads at the origin of iteration to access
memory without having to hold till stores at the end of the former iteration
are finished. In order to prevent fetching a false data value, a load can
bypass pending stores if none of the previous stores have the same target
address as the load. Nevertheless, certain addresses of pending stores may
not be available.
Speculative loads
Speculative loads avoid memory access delay. This delay can be caused
due to the non- computation of required addresses or clashes among the
addresses. The speculative loads should be checked for correctness. If
required then respective measures should be taken to done for it.
Speculative loads are alike speculative branches.
To check the address, write the loads and stores computed target address
into ROB (ReOrder buffer). The address comparison is carried out at ROB.
Reorder buffer (ROB)
ROB came in 1988 for the solution of precise interrupt problem. Currently,
ROB is an assurance tool for sequential consistency execution where
multiple EUs operate in parallel.
Manipal University Jaipur B1648 Page No. 119
Computer Architecture Unit 5

ROB is a circular buffer. It has a head and tail pointers. In ROB, instructions
enter in program order only. Instructions can only be retired if all of their
previous instructions have finished and they had also retired.
Sequential consistency can be maintained by directing instructions to
update the program state by writing their results in proper program order
into the memory or referenced architectural register(s). ROB can
successfully support both interrupt handling and speculative execution.
5.5.4 Instruction Issuing and parallel execution
In this phase execution tuples are created. After its creation it is decided
that which execution tuple can now be issued. When the accessibility of
data and resources are checked during run-time it is then known as
Instruction Issuing. In instruction issuing area many pipelines are
processed.
In figure 5.10 you can see a reorder buffer which follows FIFO order.

Figure 5.10: A Reorder Buffer.

In this buffer the entries received and sent in FIFO order. When the input
operands are present then the instruction can be executed. Other instruction
might be located in instruction issue.
Other constraints are associated with the buffers carrying the execution
tuples. In figure 5.11 you can see the Parallel Execution Schedule (PES) of

Manipal University Jaipur B1648 Page No. 120


Computer Architecture Unit 5

iteration. PES has hardware resources which contain one path to the
memory, two integer units and one branch unit.

Figure 5.11: Example of PES

You can see that rows are showing the time steps and columns are showing
certain operations performed in time step. In this PES we can see that in
branch unit “ble” is not taken and it is theoretically executing instruction from
predicted path. In this example we have showed renaming values for only
r3 register but others can also be renamed. Various values allotted to
register r3 are bounded to different physical register (R1, R2, R3, R4).
Now you can see numerous ways of arranging instruction issue buffer for
boosting up the complexity.
Single queue method: Renaming is not needed in single queue method
because this method has 1 queue and no out of ordering issue. In this
method the operand availability could be handled through easy reservation
bits allotted to every register. During the instructional modification of register
issues, a register reserved and after the modification finished the register is
cleared.
Multiple queue method: In multiple queue method, all the queues get
instruction issue in order. Due to other queues some queues can be issued
out. With respect to instruction type single queues are organized.
Reservation stations: In reservation stations, the instruction issue does not
follow the FIFO order. As a result for data accessibility, the reservation
stations at the same time have to observe their source operands. The
conventional way of doing this is to reserve the operand data in reservation

Manipal University Jaipur B1648 Page No. 121


Computer Architecture Unit 5

station. As reservation station receive the instruction then available operand


values are firstly read and placed in it.
After that it logically evaluate the difference between the operand
designators of inaccessible data and result designators of finishing
instructions. If there is similarity, then the result value is extracted to
matching reservation station.
Instruction got issued as all the operands are prepared in reservation
station. It can be divided into instruction type for decreasing data paths or
may behave as a single block.
Self Assessment Questions
9. In traditional pipeline implementations, load and store instructions are
processed by the ____________________.
10. The consistency of instruction completion with that of sequential
instruction execution is specified b _______________.
11. Reordering of memory accesses is not allowed by the processor which
endorses weak memory consistency does not allow (True/False).
12. _____________ is not needed in single queue method.
13. In reservation stations, the instruction issue does not follow the FIFO
order. (True/ False).

5.6 Summary
 The design space of pipelines can be sub divided into two aspects:
basic layout of a pipeline and dependency resolution.
 An Instruction pipeline operates on a stream of instructions by
overlapping and decomposing the three phases (fetch, decode and
execute) of the instruction cycle.
 Two basic aspects of the design space are how FX pipelines are laid out
logically and how they are implemented.
 A logical layout of an FX pipeline consists, first, of the specification of
how many stages an FX pipeline has and what tasks are to be
performed in these stages.
 The other key aspect of the design space is how FX pipelines are imple-
mented.

Manipal University Jaipur B1648 Page No. 122


Computer Architecture Unit 5

 In logical layout of FX pipelines, the FX pipelines for RISC and C1SC


processors have to be taken separately, since each type has a slightly
different scope.
 Pipelined processing of loads and stores consist of sequential
consistency of instruction execution and parallel execution.

5.7 Glossary
 CISC: It is an acronym for Complex Instruction Set Computer. The CISC
machines are easy to program and make efficient use of memory.
 CPA: It stands for carry-propagation adder which adds two numbers
and produces an arithmetic sum.
 CSA: It stands for carry-save adder which adds three input numbers
and produces one sum output.
 LMD: Load Memory Data.
 Load/Store bypassing: It defines that either loads can bypasss stores
or vice versa, without violating the memory data dependencies.
 Memory consistency: It is used to find out whether memory access is
performed in the same order as in a sequential processor.
 Processor consistency: It is used to indicate the consistency of
instruction completion with that of sequential instruction execution.
 RISC: It stands for Reduced Instruction Set Computing. RISC
computers reduce chip complexity by using simpler instructions.
 ROB: It stands for Reorder Buffer. ROB is an assurance tool for
sequential consistency execution where multiple EUs operate in parallel.
 Speculative loads: They avoid memory access delay. This delay can
be caused due to the non- computation of required addresses or
clashes among the addresses.
 Tomasulo’s algorithm: It allows the replacement of sequential order by
data-flow order.

5.8 Terminal Questions


1. Name the two sub divisions of design space of pipelines and write short
notes on them.
2. What do you mean by pipeline instruction processing?
3. Explain the concept of pipelined execution of Integer and Boolean
instructions.

Manipal University Jaipur B1648 Page No. 123


Computer Architecture Unit 5

4. Describe the logical layout of both RISC and CISC computers.


5. Write in brief the process of implementation of FX pipelines.
6. Explain the various subtasks involved in load and store processing
7. Write short notes on:
a. Sequential Consistency of Instruction Execution
b. Instruction Issuing and Parallel Execution

5.9 Answers
Self Assessment Questions
1. Microprocessor without Interlocked Pipeline Stages
2. Dynamically
3. Write Back Operand
4. Opcode, operand specifiers
5. Register operands
6. True
7. Intel 80x86 and Motorola 68K series
8. Carry-propagation adder (CPA)
9. Master pipeline
10. Processor Consistency
11. False
12. Renaming
13. True

Terminal Questions
1. The design space of pipelines can be sub divided into two aspects:
basic layout of a pipeline and dependency resolution. Refer Section 5.2.
2. A pipeline instruction processing technique is used to increase the
instruction throughput. It is used in the design of modern CPUs,
microcontrollers and microprocessors.Refer Section 5.3 for more details.
3. There are two basic aspects of the design space of pipelined execution
of Integer and Boolean instructions: how FX pipelines are laid out
logically and how they are implemented. Refer Section 5.4.
4. While processing operates instructions, RISC pipelines have to cope
only with register operands. By contrast, CISC pipelines must be able to
deal with both register and memory operands as well as destinations.
Refer Section 5.4.

Manipal University Jaipur B1648 Page No. 124


Computer Architecture Unit 5

5. Depending on the function to be implemented, different pipeline stages


in an arithmetic unit require different hardware logic. Refer Section 5.4.
6. The execution of load and store instructions begins with the
determination of the effective memory address (EA) from where data is
to be fetched. This can be broken down into subtasks. Refer
Section 5.5.
7. The overall instruction execution of a processor should mimic sequential
execution, i.e. it should preserve sequential consistency. Refer
Section 5.5. The first step is to create and buffer execution and then
determine which tuples can be issued for parallel execution. Refer
Section 5.5.

References:
 Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill.
 Godse D. A. & Godse A. P. (2010). Computer Organisation, Technical
Publications. pp. 3–9.
 Hennessy, John L., Patterson, David A. & Goldberg, David (2002)
Computer Architecture: A Quantitative Approach, (3rd edition), Morgan
Kaufmann.
 Sima, Dezsö, Fountain, Terry J. & Kacsuk, Péter (1997) Advanced
computer architectures - a design space approach, Addison-Wesley-
Longman: I-XXIII, 1-766.

E-references:
 http://www.eecg.toronto.edu/~moshovos/ACA06/readings/ieee-
proc.superscalar.pdf
 http://webcache.googleusercontent.com/search?q=cache:yU5nCVnju9
cJ:www.ic.uff.br/~vefr/teaching/lectnotes/AP1-topico3.5.ps.gz+load+
store+sequential+instructions&cd=2&hl=en&ct=clnk&gl=in

Manipal University Jaipur B1648 Page No. 125

You might also like