Parallel Hardware
Some computer systems can do many things at the same time. This is called
parallel hardware.
In some systems (like pipelining or multiple issue), the computer does
tasks in parallel inside the CPU, but the programmer doesn’t need to do
anything special.
So, these are not counted as "true" parallel systems here.
We only call it parallel hardware if the programmer can or must change
the code to make the system run things in parallel.
Types of Parallel Computers
There are two main ways to group parallel computers:
1. Flynn’s Taxonomy
This method groups computers by how many instructions and data streams
they can handle at once:
Type Name What it means
Normal computer – does one instruction on one
SISD Single Instruction, Single Data
data at a time.
Single Instruction, Multiple Same instruction runs on many pieces of data at
SIMD
Data once. (Used in GPUs)
Many processors run different instructions on
Multiple Instruction, Multiple
MIMD different data. (Used in multi-core CPUs and
Data
clusters)
2. Memory Access Model
This method is about how the processors access memory.
Shared Memory:
All processors share the same memory. They can talk to each other
using that shared memory.
Distributed Memory:
Each processor has its own memory. To share data, they must send
messages to each other.
SIMD Systems (Single Instruction, Multiple Data)
SIMD is a type of parallel computing.
It means: one instruction is applied to many pieces of data at the same
time.
It is useful when you're doing the same thing repeatedly on different data
values.
WORKING OF SIMD
A single control unit sends the same instruction to multiple datapaths.
Each datapath applies the instruction to its own data.
If the instruction can’t be used on some data, that datapath sits idle.
Example:
Suppose you want to add two arrays element-wise:
for (i = 0; i < n; i++)
x[i] += y[i];
In SIMD, if the system has n datapaths, all additions can be done at once.
Vector Processors
A vector processor handles vectors (arrays of data) instead of single
numbers.
It’s another way to do SIMD operations efficiently.
Features:
1. Vector Registers:
Can store a whole array (e.g. 4 to 256 numbers) in one register.
2. Vector Instructions:
Do operations like addition/multiplication on all elements in the vector
at once.
3. Pipelined Units:
Helps process multiple elements very quickly one after another.
Benefits:
Faster for large loops and big arrays.
Very efficient memory usage.
Compilers can automatically convert code to vector form (vectorization).
GPUs (Graphics Processing Units)
Originally built to render graphics.
Now used for general computing (AI, simulations, etc.)
Works like SIMD but more flexible.
Features:
1. Hardware Multithreading:
Can switch between many threads to keep execution fast.
2. SIMD + MIMD: (combine)
o Within one core → uses SIMD (same instruction on many data).
o Across multiple cores → uses MIMD (different instructions on
different cores).
3. Memory Use:
we Can use shared or distributed memory, or both.
Benefits:
Very fast for image processing, AI, and scientific computing.
Can handle hundreds or thousands of threads.
Modern GPUs are powerful and widely used.
Distributed-Memory Interconnects
In distributed-memory systems, each processor has its own private memory,
and processors communicate through an interconnection network.
(a) Crossbar Layout
We have 4 processors (P1–P4) and 4 memory modules (M1–M4).
Each processor is connected to every memory module using a grid of
switches (the small circles).
This means any processor can access any memory module directly.
(b) Internal Switches
Two types of switch configurations:
1. Connected (i) → Allows data to pass (like a “green light”).
2. Not connected (ii) → Blocks the path (like a “red light”).
These switches control traffic so processors don’t interfere with each
other.
Example
(a) Ring
Processors (P1, P2, P3) are connected in a circle (ring).
Each processor is connected to two neighbors.
Data travels in one direction (or sometimes both directions).
If one link fails, the whole ring may break.
Example: Think of it like a circular train track, where each station (processor)
connects to the next one.
(b) Toroidal Mesh
Ring Bisection
Here, the ring network is divided into two halves (A-side and B-side).
(a) Only 2 Communications
Only two messages can cross between A and B at the same time.
This creates a bottleneck.
(b) 4 Simultaneous Communications
By rearranging, now four messages can be exchanged at once.
This increases bandwidth and efficiency.
A bisection of a toroidal mesh
The network is divided into two halves (left side vs. right side).
Each square = a processor.
Connections cross the middle line, linking left half to right half.
1. Fully Connected Network (Fig. 2.11)
Each processor is directly linked to every other processor.
Bisection Width: Very high (p²/4).
Advantage: Fastest communication, best possible.
Disadvantage: Too costly and impractical for many processors.
2. Hypercube (Fig. 2.12)
a) 1-D Hypercube
Just 2 processors connected by a link.
b) 2-D Hypercube
4 processors arranged like a square.
Each connected to 2 others.
c) 3-D Hypercube
8 processors arranged like a cube.
Each processor connects to 3 others.
Indirect Network
In direct networks (like mesh, ring, hypercube), processors are directly
connected to each other.
In an indirect network, processors do not connect directly to every other
processor.
Instead, they send data through a switching network (a kind of “middle
box” or router).
Omega Network
The Omega Network is a type of multistage interconnection network.
Key Features:
Uses log₂(N) stages (if N = number of processors).
o Example: For 8 processors, it needs 3 stages (since log₂(8) = 3).
Each stage has small 2×2 switches (simple, low cost).
The path is decided by the binary address of the destination.
Squares on left = Processors (Inputs).
Circles = Switches (2×2).
Right side = Outputs (Memory modules).
Data flows step by step → Switch → Switch → Switch → Output.
Formula for calculate message passing :
If
Latency = l seconds
Bandwidth = b bytes/second
Message size = n bytes
Then:
Cache coherence
Directory-Based Cache Coherence
Problem
Many cores = many caches.
If one core changes a variable, other cores might still use the old value in
their cache.
We need a way to keep all caches in sync (cache coherence).
Snooping method (old way)
Every time a variable changes, all cores are told (broadcast).
Works for small systems but too slow when there are many cores.
Directory method (new way)
A directory keeps a list of which cores have a copy of each variable
(memory block).
When one core updates the variable:
1. It asks the directory.
2. The directory checks who has the copy.
3. Only those cores are told to update or delete their copy.
No need to tell everyone — only the ones that need it.
False Sharing
CPUs don’t store single variables in cache.
They load cache lines (blocks of memory, e.g., 64 bytes).
If one variable is inside a cache line, the whole line comes into same
cache line.
Example
Array y[0..7], each element is a double (8 bytes).
Cache line = 64 bytes → all 8 elements (y[0] to y[7]) fit in one cache line.
Now:
Core 0 updates y[0], y[1], y[2], y[3].
Core 1 updates y[4], y[5], y[6], y[7].
Shared-memory vs. Distributed-memory
1. Shared-memory systems
All processors share the same memory.
Processors communicate by reading/writing shared variables.
Easy for programmers (just use common data).
Problem:
o Hard to scale → if many processors try to use the same bus or
interconnect, they clash.
o Expensive hardware (e.g., large crossbars).
Best for: small to medium systems (few processors).
2. Distributed-memory systems
Each processor has its own private memory.
Processors communicate by sending messages over a network (like
hypercube or toroidal mesh).
Harder for programmers (must manage message passing).
Advantage:
o Can scale to thousands of processors.
o Cheaper interconnects.
Best for: very large problems with lots of data and computation.
Parallel Software
Now a days we are using Parallel hardware everywhere like desktops,
servers, laptops, phones use multicore processors.
Now We have to change that.
Hardware is ready (many cores available).
To get speedup, software must be written to use multiple
cores/processors.
But Programmers need to learn .
o Shared-memory programming.
o Distributed-memory programming.
o MIMD (Multiple Instruction, Multiple Data).
o SIMD (Single Instruction, Multiple Data).
Caveats (pgm)
1. SPMD = Single Program, Multiple Data.
Instead of running different programs on each core, we run one program
on all cores.
The program behaves differently depending on the thread/process ID.
2. Data Parallelism with SPMD
Easy to divide an array among processes/threads.
Working with arrays.
Calculate or run different arrays fun() same time with dividing.
Shared-memory
1. Shared vs Private Variables
Shared variables → can be read/written by any thread. (Used for
communication).
Private variables → belong to one thread only.
2. Two Thread Paradigms
a. Dynamic Threads
There is one master thread and a pool of worker threads.
Workflow:
1. Master waits for a request (e.g., from user/network).
2. When work arrives → it creates (forks) a worker thread.
3. Worker thread does the task.
4. After finishing, worker joins back to master and terminates.
Advantage: Saves resources (threads exist only while working).
Disadvantage: Forking and joining threads repeatedly takes extra time.
b. Static Threads
Master thread starts all threads at the beginning.
Threads run until all work is finished.
After work → they join back, cleanup happens, and then program ends.
Advantage: Faster.
Disadvantage: Wastes resources if some threads are idle.
Nondeterminism
Deterministic program → same input always gives the same output.
Nondeterministic program → same input may give different outputs in
different runs.
This usually happens in parallel programs, especially with MIMD systems.
Updating a shared variable (x)
Two threads want to update a shared variable x.
Example :
my_val = Compute_val(my_rank);
x += my_val; // both threads try to update x
Thread 0 loads x (0).
Thread 1 loads x (0).
Thread 0 adds its my_val (say 7).
Thread 0 stores result (7).
Thread 1 adds its my_val (say 19).
Thread 1 stores result (19).
Final x = 19 (Thread 0’s update value lost)
Thread Safety
Thread safe = Safe for many threads to use at the same time.
Not thread safe = If many threads use it together, mistakes will happen.
Thread Safe → works correctly with many threads.
Not Thread Safe → causes errors if many threads use it together.
Distributed-Memory
Each processor (core) has its own private memory.
One processor cannot directly access another processor’s memory.
To share information, processors must communicate with each other.
They use different models to communicate:
(a) Message Passing (most common → MPI)
Process A sends a message.
Process B receives that message.
Example:
o Process 1 → "Hello" → Process 0.
o Process 0 → prints: Hello.
Very powerful, but programmer must manage all details.
(b) One-Sided Communication (Remote Memory Access)
Only one process does the action (no matching send/receive needed).
Example: Process 0 directly copies data into Process 1’s memory.
Advantage: simpler, less waiting.
(c) PGAS (Partitioned Global Address Space) Languages
Looks like shared memory programming, but actually runs on distributed
systems.
Memory is split into local + shared parts.
Programmer controls which data lives where, to avoid slow “remote”
memory access.