PARALLEL COMPUTING
Parallel processing
• Parallel processing architecture refers to the method of
executing multiple instructions or operations
simultaneously to enhance computational speed and
efficiency. One of the fundamental classifications of
parallel architectures is Flynn’s Taxonomy, proposed
by Michael J. Flynn in 1966. It categorizes computer
architectures based on the number of instruction
streams and data streams being processed
simultaneously.
Flynn’s Taxonomy
• Flynn’s Classification refers to a classification of parallel
computer architectures. Parallel computers can be
classified by the concurrency in processing sequences
(streams), data, or instructions from the perspective of
an assembly language programmer.
INSTRUCTION DATA
STREAM/POOL STREAM/POOL
SISD- SINGLE INSTRUCTION SINGLE
DATA
• An SISD computing system is a uniprocessor machine
which is capable of executing a single instruction,
operating on a single data stream. In SISD, machine
instructions are processed in a sequential manner and
computers adopting this model are popularly called
sequential computers. Most conventional computers
have SISD architecture. All the instructions and data to
be processed have to be stored in primary memory.
INSTRUCTION
STREAM/POOL
PU
DATA
STREAM/POOL
INSTRUCTION
STREAM/POOL
PU
DATA
STREAM/POOL
INSTRUCTION
STREAM/POOL
PU
DATA
STREAM/POOL
SIMD-SINGLE INSTRUCTION
MULTIPLE DATA INSTRUCTION
STREAM/POOL
STREAM/POOL
P
U
DATA
P
U
P
U
INSTRUCTION
STREAM/POOL
STREAM/POOL
P
U
DATA
P
U
P
U
P
U
•But only one instruction is
given in the instruction
pool a+b
•We have 4 data how to do
• An SIMD system is a multiprocessor machine capable of
executing the same instruction on all the CPUs but
operating on different data streams. Machines based on
an SIMD model are well suited to scientific computing
since they involve lots of vector and matrix operations.
So that the information can be passed to all the
processing elements (PEs) organized data elements of
vectors can be divided into multiple sets(N-sets for N PE
systems) and each PE can process one data set.
INSTRUCTION
STREAM/POOL
STREAM/POOL
P
U
DATA
P
U
P
U
INSTRUCTION
STREAM/POOL
P
U
STREAM/POOL
P
U
DATA
P
U
P
U
MISD
INSTRUCTION
STREAM/POOL
P
U
STREAM/POOL
P
U
DATA
P
U
P
U
• An MISD computing system is a multiprocessor machine
capable of executing different instructions on different
PEs but all of them operating on the same dataset .
•Structure: Multiple processors execute different instructions but operate
on the same data stream.
•Example: Rarely used in practice, but some fault-tolerant systems
(e.g., certain pipeline architectures) fit this category.
•Use Case: Redundant computations for reliability in aerospace applications.
INSTRUCTION
STREAM/POOL
STREAM/POOL
P
U
DATA
P
U
P
U
INSTRUCTION
STREAM/POOL
STREAM/POOL
P P
U U
DATA
P P
U U
P P
U U
P P
U U
MIMD
• An MIMD system is a multiprocessor machine which is capable of executing
multiple instructions on multiple data sets. Each PE in the MIMD model has
separate instruction and data streams; therefore machines built using this
model are capable to any kind of application. Unlike SIMD and MISD machines,
PEs in MIMD machines work asynchronously.
• In the shared memory MIMD model (tightly coupled multiprocessor
systems), all the PEs are connected to a single global memory and they all
have access to it. The communication between PEs in this model takes place
through the shared memory, modification of the data stored in the global
memory by one PE is visible to all other PEs. Dominant representative shared
memory MIMD systems are Silicon Graphics machines and Sun/IBM’s SMP
(Symmetric Multi-Processing).
In Distributed memory MIMD machines (loosely coupled multiprocessor
systems) all PEs have a local memory. The communication between PEs in this
model takes place through the interconnection network (the inter process
communication channel, or IPC). The network connecting PEs can be
configured to tree, mesh or in accordance with the requirement.
What is Pipelining?
• Pipelining is an arrangement of the CPU’s
hardware components to raise the CPU’s
general performance. In a pipelined
processor, procedures called ‘stages’ are
accomplished in parallel, and the execution
of more than one line of instruction occurs.
Now let us look at a real-life example that
should operate based on the pipelined
operation concept.
Performance of a pipelined processor Consider a ‘k’ segment pipeline
with clock cycle time as ‘Tp’.
Let there be ‘n’ tasks to be completed in the pipelined processor. Now, the
first instruction is going to take ‘k’ cycles to
come out of the pipeline but the other ‘n – 1’ instructions will take only ‘1’
cycle each, i.e, a total of ‘n – 1’ cycles. So, time taken to execute ‘n’
instructions in a pipelined processor:
ET = k + n – 1 cycles
pipeline
= (k + n – 1) Tp
In the same case, for a non-pipelined processor, the execution time of ‘n’
instructions will be:
ET = n * k * Tp
non-pipeline
So, speedup (S) of the pipelined processor over the non-pipelined
processor, when ‘n’ tasks are executed on the same processor is:
S = Performance of non-pipelined processor /
Performance of pipelined processor
Instruction pipelining
• Instruction pipelining is a form of pipelining where the
CPU executes multiple instructions in different stages
concurrently.
A typical instruction pipeline has five stages:
Stage Description
1. Instruction Fetch (IF) Fetch the instruction from memory.
Decode the instruction to understand the
2. Instruction Decode (ID)
operation.
Perform the required operation (ALU
3. Execute (EX)
operations, memory access, etc.).
Read or write data from/to memory (if
4. Memory Access (MEM)
needed).
5. Write Back (WB) Store the result back in a register.
I1: ADD R1, R2, R3
I2: SUB R4, R5, R6
I3: MUL R7, R8, R9
MEM WB (Write
Cycle IF (Fetch) ID (Decode) EX (Execute)
(Memory) Back)
1 I1
2 I2 I1
3 I3 I2 I1
4 I3 I2 I1
5 I3 I2 I1
6 I3 I2
7 I3
Hazards in Pipelining
• Pipelining improves instruction execution efficiency, but
it also introduces potential issues called hazards.
Hazards are situations that stall or delay pipeline
execution, reducing its efficiency.
• There are three types of hazards in pipelining:
1.Structural Hazards
2.Data Hazards
3.Control Hazards
1. Structural Hazards 🚧
• A structural hazard occurs when multiple instructions
compete for the same hardware resource in the
pipeline, causing a conflict and stalling the pipeline.
This happens when the CPU does not have enough
resources to handle multiple instructions at the same
time. /
Instruction
1 2 3 4 5
Cycle
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX
I3 IF(Mem) ID EX
I4 IF(Mem) ID
HOW TO AVOID?
Cycle 1 2 3 4 5 6 7 8
IF(Me Me
I1 ID EX WB
m) m
IF(Me
I2 ID EX Mem WB
m)
IF(Me
I3 ID EX Mem WB
m)
IF(Mem
I4 – – –
)
SOLUTION
SOLUTION
Instruc
tion/ 1 2 3 4 5 6 7
Cycle
I1 IF(CM) ID EX DM WB
I2 IF(CM) ID EX DM WB
I3 IF(CM) ID EX DM WB
I4 IF(CM) ID EX DM
I5 IF(CM) ID EX
I6 IF(CM) ID
I7 IF(CM)
2. Data hazards
• If an instruction accesses a register that a preceding instruction
overwrites in a subsequent cycle, data hazards exist. Pipelining
will yield inaccurate results unless we reduce data risks.
Type Cause Example
A later instruction needs a ADD R1, R2, R3
RAW (Read After Write) value that a previous SUB R4, R1, R5 → (R1 not
instruction is still computing. ready)
A later instruction writes to a
WAR (Write After Read) register before a previous Rare in simple pipelines
instruction reads it.
Two instructions try to write
Happens in out-of-order
WAW (Write After Write) to the same register at the
execution
same time.
RAW
Cause:
•An instruction reads a register before a previous instruction writes to it.
•The required data is not yet available, causing a stall.
Example:
assembly
I1: ADD R1, R2, R3 # R1 = R2 + R3
I2: SUB R4, R1, R5 # R4 = R1 - R5
🔴 Problem:
•I2 needs R1, but R1 is still being computed in I1.
•I2 cannot proceed until I1 writes R1.
Solution:
✅ Forwarding (Bypassing) – Directly pass R1 to I2 without waiting.
✅ Pipeline Stall (Bubble Insertion) – Temporarily delay I2 until R1 is
available.
WAR (Write After Read) Hazard
⚡
Cause:
•A later instruction writes to a register before an earlier instruction reads it.
•This can happen in out-of-order execution, but it is rare in simple pipelines.
Example:
assembly
I1: SUB R4, R1, R5 # Reads R1
I2: ADD R1, R2, R3 # Writes R1
🔴 Problem:
•I1 is reading R1, but I2 writes to R1 before I1 finishes reading it.
Solution:
✅ Register Renaming – Use a temporary register to avoid conflicts.
✅ Reorder Execution – Ensure I1 reads R1 before I2 writes to it.
WAW (Write After Write) Hazard
⚡
Cause:
•Two instructions try to write to the same register at the same time.
•This happens in out-of-order execution, where later instructions finish earlier.
Example:
assembly
I1: MUL R1, R2, R3 # Writes R1
I2: ADD R1, R4, R5 # Writes R1
🔴 Problem:
•I2 might overwrite R1 before I1 completes its computation.
•This leads to incorrect results.
Solution:
✅ Register Renaming – Use separate registers for different computations.
✅ Force In-Order Execution – Ensure I1 finishes writing before I2.
Control hazard
A control hazard (also called a branch hazard) occurs when the pipeline does not
know the outcome of a branch instruction and fetches the wrong instruction.
This causes a pipeline stall while the processor waits to determine the correct path.
🔴 Problem: When a branch instruction (if, goto, loop, function call) is
encountered,
the CPU doesn’t know whether to follow the branch or continue sequential
execution.
🔴 Impact: The pipeline might fetch and execute incorrect instructions,
leading to wasted cycles and performance loss.
• I1: BEQ R1, R2, LABEL # If R1 == R2, jump to LABEL
• I2: ADD R3, R4, R5 # Might be skipped if branch is
taken
• I3: MUL R6, R7, R8 # Might also be skipped
🔴 Issue:
•The CPU fetches I2 and I3 before knowing if the branch is taken.
•If the branch is taken, I2 and I3 must be discarded, leading to pipeline stalls.
Solutions to Control Hazards
✅ 1. Pipeline Stalling (Branch Bubble)
•The CPU waits until the branch instruction is resolved before
fetching new instructions.
•Disadvantage: This causes delays and lowers
performance.
✅ 2. Branch Prediction
•The processor guesses whether the branch will be taken or not.
•If the guess is correct, execution continues without delay.
•If the guess is wrong, the CPU flushes the incorrect instructions and starts again.
🔹 Static Branch Prediction:
•Always assumes a branch is taken or not taken.
•Works well for simple loops (for, while).
🔹 Dynamic Branch Prediction:
•Uses history-based techniques (e.g., Branch History Table) to predict branches.
•Modern CPUs use dynamic prediction for higher accuracy.
✅ 3. Branch Delay Slots
•The compiler rearranges instructions so that useful work is done before the branch is resolved.
🔹 Example (Before Optimization)
assembly
I1: BEQ R1, R2, LABEL
I2: ADD R3, R4, R5 # Might be wasted if branch is taken
🔹 Example (After Optimization with Branch Delay Slot)
assembly
I1: BEQ R1, R2, LABEL
I2: LOAD R6, 100(R7) # Useful work done before the branch is resolved
•The delay slot is filled with an instruction that does not depend on the branch.
•This reduces pipeline stalls.
✅ 4. Speculative Execution
•The CPU executes both paths of the branch in advance.
•Once the branch decision is known, the incorrect path is discarded.
•Used in modern processors (e.g., Intel, AMD).
Vector processing
• Vector processing is a high-performance computing
technique where a single instruction operates on
multiple data elements simultaneously (SIMD -
Single Instruction, Multiple Data).
🔹 Why is it important?
• It speeds up computations for scientific simulations,
graphics processing, AI, and large-scale
numerical calculations.
• Unlike scalar processing (which operates on one data
item at a time), vector processing handles arrays
(vectors) of data in one go.
Vector Processor Classification?
• Memory to Memory Architecture
In memory-to-memory architecture, source operands, intermediate
results, and final results are retrieved directly from the main
memory. For memory-to-memory vector instructions, it is
necessary to specify the base address, offset, increment, and
vector length to facilitate data transfers between the main
memory and pipelines. Notable processors employing memory-to-
memory formats include TI-ASC, CDC STAR-100, and Cyber-205.
• Main points about Memory-to-Memory Architecture
• No limitation on size.
• Speed is comparatively slower.
Register to Register
Architecture
• In register-to-register architecture, operands and results
are retrieved indirectly from the main memory through
the use of a large number of vector or scalar registers.
Processors like Cray-1 and Fujitsu VP-200 utilize vector
instructions in register-to-register formats.
• Main points about Register-to-Register Architecture
• Limited size.
• Higher speed compared to memory-to-memory
architecture.
• Increased hardware cost.
Hybrid Architecture
• Hybrid Architecture unites memory-to-memory and
register-to-register architectures to gain from all. The
way is that of flexible operand retrieval methods; which
improves performance and efficiency in many
computational tasks. This solution provides a balanced
solution which can be adapted to requirements of the
application, with a better utilization of available
resources.
Loosely Coupled & Tightly Coupled
Systems
Loosely Coupled Systems 🖧
Definition: A loosely coupled system consists of multiple
independent processors that communicate through a network. Each
processor has its own memory and operating system, and they
coordinate tasks by exchanging messages.
• Characteristics
✔️Each processor has its own memory and OS.
✔️Communication happens via message passing (network-based).
✔️High scalability, meaning more processors can be added easily.
✔️Fault tolerance is better—if one system fails, others continue.
✔️Slower data transfer compared to tightly coupled systems.
• Examples of Loosely Coupled Systems
🔹 Distributed Systems – Google Cloud, Amazon AWS, Microsoft Azure
🔹 Grid Computing – SETI@home, Folding@home
🔹 Cluster Computing – Hadoop, Kubernetes clusters
Tightly Coupled Systems
• A tightly coupled system consists of multiple processors that
share the same memory and OS. These processors work
closely together and communicate via shared memory, making
data transfer faster than in loosely coupled systems.
• Characteristics
✔️Processors share a common memory space.
✔️Faster data exchange due to shared memory.
✔️Lower scalability compared to loosely coupled systems.
✔️Better synchronization between processors.
✔️If a processor fails, it may affect the entire system.
• Examples of Tightly Coupled Systems
🔹 Multiprocessor Systems – Intel Xeon, AMD EPYC servers
🔹 Supercomputers – Cray, IBM Blue Gene
Feature Loosely Coupled Tightly Coupled
Processor Message passing via
Shared memory
Communication network
Each processor has its Shared between
Memory
own memory processors
Faster (direct memory
Speed Slower (network latency)
access)
Scalability High Limited
High (failure of one does Low (failure may crash
Fault Tolerance
not affect others) the system)
Cloud computing, Grid Supercomputers, Multi-
Example
computing core CPUs