Parallelism
Why Parallel Architecture?
Parallel computer architecture adds a new
dimension in the development of computer
system by using a greater number of
processors.
In principle, performance achieved by utilizing
large number of processors is higher than the
performance of a single processor at a given
point of time.
Parallelism
Parallel processing can be described as a class of techniques which enables the system to achieve simultaneous data-processing tasks
to increase the computational speed of a computer system.
A parallel processing system can carry out simultaneous data-processing to achieve faster execution time.
For instance, while an instruction is being processed in the ALU component of the CPU, the next instruction can be read from
memory.
The primary purpose of parallel processing is to enhance the computer processing capability and increase its throughput,
A parallel processing system can be achieved by having a multiplicity of functional units that perform identical or different operations
simultaneously.
The data can be distributed among various multiple functional units.
The following diagram shows one possible way of separating the execution unit into eight functional units operating in parallel.
The operation performed in each functional unit is indicated in each block if the diagram
Parallelism
Parallelism
The adder and integer multiplier performs the arithmetic operation with
integer numbers.
The floating-point operations are separated into three circuits operating in
parallel.
The logic, shift, and increment operations can be performed concurrently on
different data.
All units are independent of each other, so one number can be shifted while
another number is being incremented
Parallelism
Advantages of Parallel Computing over Serial Computing are as follows:
1. It saves time and money as many resources working together will reduce
the time and cut potential costs.
2. It can be impractical to solve larger problems on Serial Computing.
3. It can take advantage of non-local resources when the local re sources
are finite.
4. Serial Computing ‘wastes’ the potential computing power, thus Parallel
Computing makes better work of hardware.
Parallelism
Types of Parallelism:
1. Bit-level parallelism:
• It is the form of parallel computing which is based on the increasing processor’s size. It reduces the
number of instructions that the system must execute in order to perform a task on large-sized data.
Example: Consider a scenario where an 8-bit processor must compute the sum of two 16-bit integers. It
must first sum up the 8 lower-order bits, then add the 8 higher-order bits, thus requiring two instructions
to perform the operation. A 16- bit processor can perform the operation with just one instruction.
2. Instruction-level parallelism:
• A processor can only address less than one instructionforeachclockcyclephase.Theseinstructionscanbere-
orderedand grouped which are later on executed concurrently without affecting the result of the
program. This is called instruction-level parallelism.
3. Task Parallelism:
• Task parallelism employs the decomposition of a task into subtasks and then allocating each of the
subtasks for execution. The processors perform execution of sub tasks concurrently.
4. Data-level parallelism (DLP)
• Instructions from a single stream operate concurrently on several data – Limited by non-regular data
manipulation patterns and by memory bandwidth
Parallelism-Applications
• Numeric Weather Predictions
• Socio Economics
• Finite Element Analysis
• AI and Automation
• Genetic Engineering
• Weapon Research and Defense
• Medical Applications
• Remote Sensing Applications
• Energy Resource Exploration
Parallelism
• Architectural Trends
• When multiple operations are executed in parallel, the number of cycles needed to execute
the program is reduced.
• However, resources are needed to support each of the concurrent activities.
• Resources are also needed to allocate local storage.
• The best performance is achieved by an intermediate action plan that uses resources to
utilize a degree of parallelism and a degree of locality.
• Generally, the history of computer architecture has been divided into four generations
having following basic technologies− Vacuum tubes, Transistors, Integrated circuits ,VLSI
• Till 1985, the duration was dominated by the growth in bit-level parallelism.
• 4-bit microprocessors followed by 8-bit, 16-bit, and soon.
• To reduce the number of cycles needed to perform a full 32-bit operation, the
widthofthedatapathwasdoubled.Lateron,64-bitoperationswereintroduced.
• The growth in instruction-level-parallelism dominated the mid-80s tomid-90s.
• The RISC approach showed that it was simple to pipeline the steps of instruction processing
so that on an average an instruction is executed in almost every cycle.
Instruction Level Parallelism
• Almost all processors since 1985 use pipelining to overlap the
execution of instructions and improve performance. This potential
overlap among instructions is called instruction level parallelism
• First introduced in the IBM Stretch (Model 7030) in about 1959
• Later the CDC 6600 incorporated pipelining and the use of multiple
functional units
• The Intel i486 was the first pipelined implementation of the IA32
architecture
January 2013 Instruction Level Parallelism 9
Instruction Level Parallelism
• Instruction level parallel processing is the concurrent processing of
multiple instructions
• Difficult to achieve within a basic code block
• Typical MIPS programs have a dynamic branch frequency of between 15% and
25%
• That is, between three and six instructions execute between a pair of
branches, and data hazards usually exist within these instructions as they are
likely to be dependent
• Given basic code block size in number of instructions, ILP must be
exploited across multiple blocks
January 2013 Instruction Level Parallelism 10
Instruction Level Parallelism
• The current trend is toward very deep pipelines, increasing from a
depth of < 10 to > 20.
• With more stages, each stage can be smaller, more simple and
provide less gate delay, therefore very high clock rates are possible.
January 2013 Instruction Level Parallelism 11
Loop Level Parallelism
Exploitation among Iterations of a Loop
• Loop adding two 1000 element arrays
• Code
for (i=1; i<= 1000; i=i+1)
x[i] = x[i] + y[i];
• If we look at the generated code, within a loop there may be little
opportunity for overlap of instructions, but each iteration of the loop
can overlap with any other iteration
January 2013 Instruction Level Parallelism 12
Concepts and Challenges
Approaches to Exploiting ILP
• Two major approaches
• Dynamic – these approaches depend upon the hardware to locate the
parallelism
• Static – fixed solutions generated by the compiler, and thus bound at compile
time
• These approaches are not totally disjoint, some requiring both
• Limitations are imposed by data and control hazards
January 2013 Instruction Level Parallelism 13
Features Limiting Exploitation of Parallelism
• Program features
• Instruction sequences
• Processor features
• Pipeline stages and their functions
• Interrelationships
• How do program properties limit performance? Under what circumstances?
January 2013 Instruction Level Parallelism 14
Approaches to Exploiting ILP
Dynamic Approach
• Hardware intensive approach
• Dominate desktop and server markets
• Pentium III, 4, Athlon
• MIPS R10000/12000
• Sun UltraSPARC III
• PowerPC 603, G3, G4
• Alpha 21264
January 2013 Instruction Level Parallelism 15
Approaches to Exploiting ILP
Static Approach
• Compiler intensive approach
• Embedded market and IA-64
January 2013 Instruction Level Parallelism 16
Terminology and Ideas
• Cycles Per Instruction
• Pipeline CPI = Ideal Pipeline CPI + Structural Stalls + Data Hazard Stalls +
Control Stalls
• Ideal Pipeline CPI is the max that we can achieve in a given
architecture. Stalls and/or their impacts must be minimized.
• During 1980s CPI =1 was a target objective for single chip
microprocessors
• 1990’s objective: reduce CPI below 1
• Scalar processors are pipelined processors that are designed to fetch
and issue at most one instruction every machine cycle
• Superscalar processors are those that are designed to fetch and issue
multiple instructions every machine cycle
January 2013 Instruction Level Parallelism 17
Approaches to Exploiting ILP
That We Will Explore
Technique Reduces
Forwarding and bypassing Potential data hazards and stalls
Delayed branches and simple branch scheduling Control hazard stalls
Basic dynamic scheduling (scoreboarding) Data hazard stalls from true dependences
Dynamic scheduling with renaming Data hazard stalls and stalls from antidependences and
output dependences
Branch prediction Control stalls
Issuing multiple instructions per cycle Ideal CPI
Hardware Speculation Data hazard and control hazard stalls
Dynamic memory disambiguation Data hazard stalls with memory
Loop unrolling Control hazard stalls
Basic computer pipeline scheduling Data hazard stalls
Compiler dependence analysis, software pipelining, trace Ideal CPI, data hazard stalls
scheduling
Hardware support for Compiler speculation Ideal CPI, data, control stalls.
January 2013 Instruction Level Parallelism 18
Flynn’s Classification
Flynn’s Classification
Flynn’s Classification
Flynn’s Classification