[go: up one dir, main page]

0% found this document useful (0 votes)
16 views72 pages

3rd Unit

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views72 pages

3rd Unit

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Parallel Processing and

Pipelining
Parallel Processing
• Parallel processing is a term used to denote a large class of
techniques that are used to provide simultaneous data-
processing tasks for the purpose of increasing the
computational speed of a computer system.
• Instead of processing each instruction sequentially as in a
conventional computer, a parallel processing system is able to
perform concurrent data processing to achieve faster
execution time.
• For example, while an instruction is being executed in the ALU, the next instruction
can be read from memory.
• The purpose of parallel processing is to speed up the
computer processing capability and increase its throughput,
that is, the amount of processing that can be accomplished
during a given interval of time.
• Parallel processing can be viewed from various levels of
complexity.
• At the lowest level, we distinguish between parallel and serial operations by the
type of registers used. Shift registers operate in serial fashion one bit at a time,
while registers with parallel load operate with all the bits of the word
simultaneously.
• Parallel processing at a higher level of complexity can be
achieved by having a multiplicity of functional units that
perform identical or different operations simultaneously.
• Parallel processing is established by distributing the data among the multiple
functional units.
Processor with multiple functional units
Processor with multiple functional units contd.
• The operands in the registers are applied to one of the units
depending on the operation specified by the instruction
associated with the operands.
• The adder and integer multiplier perform the arithmetic
operations with integer numbers.
• The floating-point operations are separated into three circuits
operating in parallel.
• The logic, shift, and increment operations can be performed
concurrently on different data.
• All units are independent of each other.
• So one number can be shifted while another number is being
incremented.
• The amount of hardware increases with parallel processing,
and with it, the cost of the system increases.
However, technological developments have reduced hardware costs to the point
where parallel processing techniques are economically feasible.
• There are a variety of ways that parallel processing can be
classified. It can be considered from the internal organization
of the processors, from the interconnection structure
between processors, or from the flow of information through
the system.
• One classification introduced by M. J. Flynn considers the
organization of a computer system by the number of
instructions and data items that are manipulated
simultaneously (Flynn’s Classification).
Flynn’s Classification
• The Von Neumann architecture is characterized by a CPU and
central memory system, with instructions and data being read
from memory.
• The single data path between the CPU and memory over
which both instructions and data must pass, and the
sequential nature of instruction execution together limit the
performance possible from the computer. This is sometimes
known as the Von Neumann bottleneck.
• This is aided in uniprocessor’s by pipelining and superscalar
design enhancements.
• In 1972 Michael J Flynn introduced a classification of various
computer architectures based on notions of instruction and
data streams.
• Machines can thus be classified as SISD, MISD, SIMD and
MIMD.

 Single instruction stream, single data stream (SISD)


 A single-processor computer system is called SISD.
 A program executed by the processor constitutes the single
instruction stream, and the sequence of data items that it
operates on constitutes the single data stream.
 Instructions are executed sequentially and the system may or
may not have internal parallel processing capabilities.
 Parallel processing in this case may be achieved by means of
pipeline processing.
Flynn’s
• SISD – Single Instruction Single Data

SI PE SD

9
 Single instruction stream, multiple data stream (SIMD)
 A single stream of instructions is broadcast (or distributed) to
a number of processors.
 Each processor operates on its own data.
 All processors execute the same program but operate on
different data.
 The multiple data streams are the sequences of data items
accessed by the individual processors in their own memories.
Flynn’s
SIMD- Single instruction stream, multiple data stream

PE SD

SI PE SD

PE SD

11
 Multiple instruction stream, multiple data stream (MIMD)
 Involves a number of independent processors, each executing
a different program and accessing its own sequence of data
items.
 Most Multiprocessor fit into this category.
 A MIMD machine is usually a number of separate processors
connected together through some interconnection network.
Flynn’s
• MIMD- Multiple Instructions Multiple Data

Multiple Instructions Multiple Data

SI PE SD

SI PE SD

SI PE SD

13
 Multiple instruction stream, single data stream (MISD)
 MISD architecture is the most uncommon one.
 Is only of theoretical interest since no practical system has
been constructed using this organization.
 In this architecture, the same data stream flows through a
linear array of processors, executing different instructions on
the stream.
Flynn’s
• MISD- Multiple Instructions / Single Data

Multiple Instructions Single Data

SI PE

SI PE SD

SI PE

15
Pipelining
1. Pipelining is a technique of decomposing a sequential process
into suboperations, with each subprocess being executed in a
special dedicated segment that operates concurrently with all
other segments. A pipeline can be visualized as a collection of
processing segments through which binary information flows.
2. Each segment performs partial processing dictated by the way
the task is partitioned. The result obtained from the
computation in each segment is transferred to the next
segment in the pipeline. The final result is obtained after the
data have passed through all segments.
3. It is characteristic of pipelines that several computations can be
in progress in distinct segments at the same time. The
overlapping of computation is made possible by associating a
register with each segment in the pipeline. The registers provide
isolation between each segment so that each can operate on
distinct data simultaneously.
• The simplest way of viewing the pipeline structure is to imagine
that each segment consists of an input register followed by a
combinational circuit. The register holds the data and the
combinational circuit performs the sub operation in the
particular segment. The output of the combinational circuit in a
given segment is applied to the input register of the next
segment. A clock is applied to all registers after enough time has
elapsed to perform all segment activity. In this way the
information flows through the pipeline one step at a time.
Suppose that we want to perform the combined multiply and add
operations with a stream of numbers.
Now consider the case where a k-segment pipeline
with a clock cycle time tp is used to execute n tasks.
The first task T1 requires a time equal to ktp to
complete its operation since there are k segments in
the pipe. The remaining n - 1 tasks emerge from the
pipe at the rate of one task per clock cycle and they
will be completed after a time equal to (n - 1)tp.
Therefore, to complete n tasks using a k-segment
pipeline requires k + (n - 1) clock cycles. For example,
the diagram of Fig. 9-4 shows four segments and six
tasks. The time required to complete all the
operations is 4 + (6 - 1) = 9 clock cycles, as
indicated in the diagram.
Arithmetic Pipeline
Arithmetic Pipeline
• Pipeline arithmetic units are usually found in very high speed
computers. They are used to implement floating-point operations,
multiplication of fixed-point numbers, and similar computations
encountered in scientific problems.
• The inputs to the floating-point adder pipeline are two normalized
floating-point binary numbers.

• A and B are two fractions that represent the mantissas and a and b
are the exponents. The floating-point addition and subtraction can
be performed in four segments, as shown in Fig. 9-6. The registers
labeled R are placed between the segments to store intermediate
results. The suboperations that are performed in the four segments
are:
1. Compare the exponents.
2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalize the result.
The exponents are compared by subtracting them to determine
their difference. The larger exponent is chosen as the exponent of
the result. The exponent difference determines how many times
the mantissa associated with the smaller exponent must be shifted
to the right. This produces an alignment of the two mantissas. It
should be noted that the shift must be designed as a combinational
circuit to reduce the shift time. The two mantissas are added or
subtracted in segment 3. The result is normalized in segment 4.
When an overflow occurs, the mantissa of the sum or difference is
shifted right and the exponent incremented by one. If an
underflow occurs, the number of leading zeros in the mantissa
determines the number of left shifts in the mantissa and the
number that must be subtracted from the exponent.
Consider the two normalized floating-point numbers:
Instruction Pipeline
In the most general case, the computer needs to process each
instruction with the following sequence of steps:-
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.
RISC Pipeline
CISC
Major characteristics of CISC (Complex instruction set computer)
architecture are:
1. A large number of instructions—typically from 100 to 250
instructions.
2. Some instructions that perform specialized tasks and are used
infrequently.
3. A large variety of addressing modes—typically from 5 to 20
different modes.
4. Variable-length instruction formats .
5. Instructions that manipulate operands in memory .
Reduced Instruction Set Computer
In the early 1980s, a number of computer designers
recommended that computers use fewer instructions with simple
constructs so they can be executed much faster within the CPU
without having to use memory as often. This RISC type of
computer is classified as a reduced instruction set computer or
RISC.
The concept of RISC architecture involves an attempt to reduce
execution time by simplifying the instruction set of the computer.
The major characteristics of a RISC processor are:
1. Relatively few instructions
2. Relatively few addressing modes
3. Memory access limited to load and store instructions
4. All operations done within the registers of the CPU
5. Fixed-length, easily decoded instruction format
6. Single-cycle instruction execution
7. Hardwired rather than microprogrammed control
8. A relatively large number of registers in the processor unit
9. Use of overlapped register windows to speed-up procedure call
and return
10. Efficient instruction pipeline
11. Compiler support for efficient translation of high-level language
programs into machine language programs
Among the characteristics attributed to RISC is its ability to use
an efficient instruction pipeline. The simplicity of the
instruction set can be utilized to implement an instruction
pipeline using a small number of suboperations, with each
being executed in one clock cycle. Because of the fixed-length
instruction format, the decoding of the operation can occur at
the same time as the register selection. All data manipulation
instructions have register-to- register operations. Since all
operands are in registers, there is no need for calculating an
effective address or fetching of operands from memory.
Therefore, the instruction pipeline can be implemented with
two or three segments. One segment fetches the instruction
from program memory, and the other segment executes the
instruction in the ALU. A third segment may be used to store the
result of the ALU operation in a destination register.
The data transfer instructions in RISC are limited to load and
store instructions. These instructions use register indirect
addressing. They usually need three or four stages in the
pipeline. One of the major advantages of RISC is its ability to
execute instructions at the rate of one per clock cycle. The
advantage of RISC over CISC is that RISC can achieve pipeline
segments, requiring just one clock cycle, while CISC uses many
segments in its pipeline, with the longest segment requiring
two or more clock cycles.
Another characteristic of RISC is the support given by the
compiler that translates the high-level language program into
machine language program. Instead of designing hardware to
handle the difficulties associated with data conflicts and branch
penalties, RISC processors rely on the efficiency of the compiler to
detect and minimize the delays encountered with these
problems.
Example: Three-Segment Instruction Pipeline
Delayed Branch: It was shown in Fig. 9-8 that a branch instruction
delays the pipeline operation until the instruction at the branch
address is fetched. Several techniques for reducing branch
penalties were discussed in the preceding section. The method
used in most RISC processors is to rely on the compiler to redefine
the branches so that they take effect at the proper time in the
pipeline. This method is referred to as delayed branch.
Vector Processing
There is a class of computational problems that are beyond the
capabilities of a conventional computer. These problems are
characterized by the fact that they require a vast number of
computations that will take a conventional computer days or even
weeks to complete. In many science and engineering applications,
the problems can be formulated in terms of vectors and matrices
that lend themselves to vector processing. Computers with vector
processing capabilities are in demand in specialized applications.
The following are representative application areas when vector
processing is of the utmost importance.
Long-range weather forecasting
Petroleum explorations
Seismic data analysis
Medical diagnosis
Aerodynamics and space flight simulations
Artificial intelligence and expert systems
Mapping the human genome
Image processing .
Vector Operations :
Memory Interleaving
• Pipeline and vector processors often require simultaneous
access to memory from two or more sources.
• An instruction pipeline may require the fetching of an
instruction and an operand at the same time from two
different segments.
• Similarly, an arithmetic pipeline usually requires two or more
operands to enter the pipeline at the same time.
• Instead of using two memory buses for simultaneous access,
the memory can be partitioned into a number of modules
connected to a common memory address and data buses.
• A memory module is a memory array together with its own
address and data registers.
Interleaving
 Divides the memory system into a number of
memory modules. Each module has its own address buffer register
(ABR) and data buffer register (DBR).
 Arranges addressing so that successive words in
the address space are placed in different
modules.
 When requests for memory access involve
consecutive addresses, the access will be to
different modules.
 Since parallel access to these modules is
possible, the average rate of fetching words from
the Main Memory can be increased.
Methods of address layouts
k bits m bits
m bits k bits
Module Address in module MM address
Address in module Module MM address

ABR DBR ABR DBR ABR DBR ABR DBR ABR DBR ABR DBR

Module Module Module Module Module Module


k
0 i n- 1 0 i 2 -1

 Consecutive words are placed in a


module. •Consecutive words are located in
 High-order k bits of a memory address consecutive modules.
determine the module. •Consecutive addresses can be located in
 Low-order m bits of a memory address
determine the word within a module. consecutive modules.
 When a block of words is transferred •While transferring a block of data,
from main memory to cache, only one several memory modules can be kept busy
module is busy at a time.
at the same time.
Multiple module memory organization

You might also like