Unit 05 Design Space of Pipelines
Unit 05 Design Space of Pipelines
5.1 Introduction
In the previous unit, you studied pipelined processors in great detail with a
short review of pipelining and examples of some pipeline in modern
processors. You also studied various kinds of pipeline hazards and the
techniques available to handle them.
In this unit, we will introduce you to the design space of pipelines. Day-by-
day increasing complexity of the chips had lead to higher operating speeds.
These speeds are provided by overlapping instruction latencies or by
implementing pipelining. In the early models, discrete pipeline was used.
Discrete pipeline performs the task in stages like fetch, decode, execution,
memory, and write-back operations. Here every pipeline stage requires one
cycle of time, and as there are 5 stages so the instruction latency is of five
cycles. Longer pipelines over more cycles can hide instruction latencies.
Activity 1:
Visit a library and find out the features ofR2000, R3000, R4000, R4200
and R6000. Compare them in a chart.
Intel 80486, a CISC machine, uses 5-stage pipeline. Here the CPU tries to
maintain one instruction execution per clock cycle. However, this
architecture does not provide maximum potential performance improvement
due to the following reasons:
Occurrence of sub-cycles between the initial fetch and the instruction
execution.
Execution of an instruction waiting for previous instruction output.
Occurrence of the branch instruction.
5.4.3 Implementation of FX pipelines
Most of today's arithmetic pipelines are designed to perform fixed
functions. These arithmetic/logic units (ALUs) perform fixed-point and
floating-point operations separately. The fixed-point unit is also called the
integer unit. The floating-point unit can be built either as part of the central
processor or on a separate coprocessor. These arithmetic units perform
scalar operations involving one pair of operands at a time. The pipelining in
scalar arithmetic pipelines is controlled by software loops. Vector arithmetic
units can be designed with pipeline hardware directly under firmware or
hardwired control.
Scalar and vector arithmetic pipelines differ mainly in the areas of register
files and control mechanisms involved. Vector hardware pipelines are often
built as add-on options to a scalar processor or as an attached processor
driven by a control processor. Both scalar and vector processors are used in
modem supercomputers.
Arithmetic pipeline stages: Depending on the function to be implemented,
different pipeline stages in an arithmetic unit require different hardware
logic. Since all arithmetic operations (such as add, subtract, multiply, divide,
squaring, square rooting, logarithm, etc.) can be implemented with the basic
add and shifting operations, the core arithmetic stages require some form of
hardware to add and to shift. For example, a typical three-stage floating-
point adder includes a first stage for exponent comparison and equalisation
which is implemented with an integer adder and some shifting logic; a
second stage for fraction addition using a high-speed carry look ahead
adder; and a third stage for fraction normalisation and exponent
readjustment using a shifter and another addition logic.
Si = xi yi zi
Ci +1 = xiyi yizi zixi…….5.1
Note that the partial product Pj, is obtained by multiplying the multiplicand A
by the jth bit of B and then shifting the result j bits to the left for
j = 0, 1, 2, ..., 7. Thus Pj, is (8 + j) bits long with j trailing zeros. The
summation of the eight partial products is done with a Wallace tree of CSAs
plus a CPA at the final stage, as shown in figure 5.7.
The first stage (S1) generates all eight partial products, ranging from 8 bits
to 15 bits, simultaneously. The second stage (S2) is made up of two levels
of four CSAs, and it essentially merges eight numbers into four numbers
ranging from 13 to 15 bits. The third stage (S3) consists of two CSAs, and it
merges four numbers from S2 into two 16-bit numbers. The final stage (S4)
is a CPA, which adds up the last two numbers to produce the final
product P.
For a maximum width of 16 bits, the CPA is estimated to need four gate
levels of delay. Each level of the CSA can be implemented with a two-gate-
level logic. The delay of the first stage (S1) also involves two gate levels.
Thus the entire pipeline stages have an approximately equal amount of
delay. The matching of stage delays is crucial to the determination of the
number of pipeline stages, as well as the clock period. If the delay of the
CPA stage can be further reduced to match that of a single CSA level, then
the pipeline can be divided into six stages with a clock rate twice as fast.
The basic concepts can be extended to operands with a larger number of
bits.
Self Assessment Questions
5. While processing operates instructions, RISC pipelines have to cope
only with _______________.
6. In RISC architecture, instructions are of a uniform length (True/ False).
7. Name two microprocessors which follow the CISC philosophy.
8. ______________ adds two numbers and produces an arithmetic sum.
Activity 2:
Access the internet and find out more about the difference between fixed
point and floating point units.
Let us first consider a load instruction. Its execution begins with the
determination of the effective memory address (EA) from where data is to
be fetched. In straightforward cases, like RISC processors, this can be done
in two steps: fetching the referenced address register(s) and calculating the
effective address. However, for CISC processors address calculation may
be a difficult task, requiring multiple subsequent register fetches and
address calculations, as for instance in the case of indexed, post-
incremented, relative addresses. Once the effective address is available,
the next step is usually, to forward the effective (virtual) address to the MMU
for translation and to access the data cache. Here, and in the subsequent
discussion, we shall not go into details of whether the referenced cache is
physically or virtually addressed, and thus we neglect the corresponding
issues. Furthermore, we assume that the referenced data is available in the
cache and thus it is fetched in one or a few cycles. Usually, fetched data is
made directly available to the requesting unit, such as the FX or FP unit,
through bypassing. Finally, the last subtask to be performed is writing the
accessed data into the specified register.
For a store instruction, the address calculation phase is identical to that
already discussed for loads. However, subsequently both the virtual address
and the data to be stored can be sent out in parallel to the MMU and the
cache, respectively. This concludes the processing of the store instruction.
Figure 5.8 shows the subtasks involved in executing load and store
instructions.
ROB is a circular buffer. It has a head and tail pointers. In ROB, instructions
enter in program order only. Instructions can only be retired if all of their
previous instructions have finished and they had also retired.
Sequential consistency can be maintained by directing instructions to
update the program state by writing their results in proper program order
into the memory or referenced architectural register(s). ROB can
successfully support both interrupt handling and speculative execution.
5.5.4 Instruction Issuing and parallel execution
In this phase execution tuples are created. After its creation it is decided
that which execution tuple can now be issued. When the accessibility of
data and resources are checked during run-time it is then known as
Instruction Issuing. In instruction issuing area many pipelines are
processed.
In figure 5.10 you can see a reorder buffer which follows FIFO order.
In this buffer the entries received and sent in FIFO order. When the input
operands are present then the instruction can be executed. Other instruction
might be located in instruction issue.
Other constraints are associated with the buffers carrying the execution
tuples. In figure 5.11 you can see the Parallel Execution Schedule (PES) of
iteration. PES has hardware resources which contain one path to the
memory, two integer units and one branch unit.
You can see that rows are showing the time steps and columns are showing
certain operations performed in time step. In this PES we can see that in
branch unit “ble” is not taken and it is theoretically executing instruction from
predicted path. In this example we have showed renaming values for only
r3 register but others can also be renamed. Various values allotted to
register r3 are bounded to different physical register (R1, R2, R3, R4).
Now you can see numerous ways of arranging instruction issue buffer for
boosting up the complexity.
Single queue method: Renaming is not needed in single queue method
because this method has 1 queue and no out of ordering issue. In this
method the operand availability could be handled through easy reservation
bits allotted to every register. During the instructional modification of register
issues, a register reserved and after the modification finished the register is
cleared.
Multiple queue method: In multiple queue method, all the queues get
instruction issue in order. Due to other queues some queues can be issued
out. With respect to instruction type single queues are organized.
Reservation stations: In reservation stations, the instruction issue does not
follow the FIFO order. As a result for data accessibility, the reservation
stations at the same time have to observe their source operands. The
conventional way of doing this is to reserve the operand data in reservation
5.6 Summary
The design space of pipelines can be sub divided into two aspects:
basic layout of a pipeline and dependency resolution.
An Instruction pipeline operates on a stream of instructions by
overlapping and decomposing the three phases (fetch, decode and
execute) of the instruction cycle.
Two basic aspects of the design space are how FX pipelines are laid out
logically and how they are implemented.
A logical layout of an FX pipeline consists, first, of the specification of
how many stages an FX pipeline has and what tasks are to be
performed in these stages.
The other key aspect of the design space is how FX pipelines are imple-
mented.
5.7 Glossary
CISC: It is an acronym for Complex Instruction Set Computer. The CISC
machines are easy to program and make efficient use of memory.
CPA: It stands for carry-propagation adder which adds two numbers
and produces an arithmetic sum.
CSA: It stands for carry-save adder which adds three input numbers
and produces one sum output.
LMD: Load Memory Data.
Load/Store bypassing: It defines that either loads can bypasss stores
or vice versa, without violating the memory data dependencies.
Memory consistency: It is used to find out whether memory access is
performed in the same order as in a sequential processor.
Processor consistency: It is used to indicate the consistency of
instruction completion with that of sequential instruction execution.
RISC: It stands for Reduced Instruction Set Computing. RISC
computers reduce chip complexity by using simpler instructions.
ROB: It stands for Reorder Buffer. ROB is an assurance tool for
sequential consistency execution where multiple EUs operate in parallel.
Speculative loads: They avoid memory access delay. This delay can
be caused due to the non- computation of required addresses or
clashes among the addresses.
Tomasulo’s algorithm: It allows the replacement of sequential order by
data-flow order.
5.9 Answers
Self Assessment Questions
1. Microprocessor without Interlocked Pipeline Stages
2. Dynamically
3. Write Back Operand
4. Opcode, operand specifiers
5. Register operands
6. True
7. Intel 80x86 and Motorola 68K series
8. Carry-propagation adder (CPA)
9. Master pipeline
10. Processor Consistency
11. False
12. Renaming
13. True
Terminal Questions
1. The design space of pipelines can be sub divided into two aspects:
basic layout of a pipeline and dependency resolution. Refer Section 5.2.
2. A pipeline instruction processing technique is used to increase the
instruction throughput. It is used in the design of modern CPUs,
microcontrollers and microprocessors.Refer Section 5.3 for more details.
3. There are two basic aspects of the design space of pipelined execution
of Integer and Boolean instructions: how FX pipelines are laid out
logically and how they are implemented. Refer Section 5.4.
4. While processing operates instructions, RISC pipelines have to cope
only with register operands. By contrast, CISC pipelines must be able to
deal with both register and memory operands as well as destinations.
Refer Section 5.4.
References:
Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill.
Godse D. A. & Godse A. P. (2010). Computer Organisation, Technical
Publications. pp. 3–9.
Hennessy, John L., Patterson, David A. & Goldberg, David (2002)
Computer Architecture: A Quantitative Approach, (3rd edition), Morgan
Kaufmann.
Sima, Dezsö, Fountain, Terry J. & Kacsuk, Péter (1997) Advanced
computer architectures - a design space approach, Addison-Wesley-
Longman: I-XXIII, 1-766.
E-references:
http://www.eecg.toronto.edu/~moshovos/ACA06/readings/ieee-
proc.superscalar.pdf
http://webcache.googleusercontent.com/search?q=cache:yU5nCVnju9
cJ:www.ic.uff.br/~vefr/teaching/lectnotes/AP1-topico3.5.ps.gz+load+
store+sequential+instructions&cd=2&hl=en&ct=clnk&gl=in