BCS702 – Module 1 Notes
Parallel Hardware and Parallel Software
Table of Contents
1. Parallel Hardware
1.1 Classifications of Parallel Computers
1.2 SIMD Systems
1.3 MIMD Systems
1.4 Interconnection Networks
1.5 Cache Coherence
1.6 Shared vs Distributed Memory
2. Parallel Software
2.1 Caveats
2.2 Coordinating Processes/Threads
2.3 Shared Memory
2.4 Distributed Memory
2.5 GPU Programming
1. Parallel Hardware
Parallel hardware enables multiple computations to run simultaneously. It is essential in
modern computing to improve performance, especially in data-heavy tasks like simulations
and image processing.
1.1 Classifications of Parallel Computers
Flynn’s Taxonomy classifies computers based on the number of instruction and data
streams:
- SISD: Single instruction, single data
- SIMD: Single instruction, multiple data
- MIMD: Multiple instruction, multiple data
Another classification is based on memory access:
- Shared memory: Cores access common memory
- Distributed memory: Each core has its own memory and uses messages to communicate
1.2 SIMD Systems
SIMD (Single Instruction, Multiple Data) systems apply one instruction to many data points
simultaneously.
Used in image processing, simulations, and vector math operations.
Limitation: All data streams must execute the same instruction.
Vector processors and GPUs are examples of SIMD systems.
1.3 MIMD Systems
MIMD systems use independent processors with separate instruction streams and data sets.
Processors run asynchronously, suitable for complex and diverse tasks.
MIMD includes:
- Shared-memory systems (e.g., multicore CPUs)
- Distributed-memory systems (e.g., clusters of computers)
1.4 Interconnection Networks
These connect processors and memory in a parallel system.
Shared-memory systems use buses and crossbars.
Distributed-memory systems use rings, meshes, hypercubes, and omega networks.
Key terms:
- Latency: Delay before data starts transferring
- Bandwidth: Rate at which data transfers
1.5 Cache Coherence
Problem: When multiple processors cache the same variable, updates by one processor may
not be seen by others.
Solutions:
- Snooping: Cores monitor a shared bus for updates
- Directory-based: Each memory block tracks which cores have cached it
False Sharing: Performance issue when different threads update variables in the same cache
line.
1.6 Shared vs Distributed Memory
Shared-memory: Easier to program, but harder to scale due to bus limitations.
Distributed-memory: Harder to program, but scales better with more processors.
2. Parallel Software
Writing software for parallel systems involves more complexity than serial programs.
Programmers need to manage synchronization, communication, and potential errors like
race conditions.
2.1 Caveats
Not all problems can be parallelized.
Some are 'embarrassingly parallel' (e.g., processing independent images), while others
require complex coordination.
2.2 Coordinating Processes/Threads
Threads/processes must be synchronized to avoid errors.
Load balancing and communication minimization are key.
Parallelizing is the act of converting a serial program to run in parallel.
2.3 Shared Memory
Uses threads that access a common memory space.
Thread types:
- Dynamic (created/destroyed as needed)
- Static (created once and reused)
Issues:
- Nondeterminism: Output may vary by run
- Race conditions: Conflicting writes to the same variable
Solutions: Mutexes (locks), semaphores, and monitors.
2.4 Distributed Memory
Processes use message-passing to share data.
Message-passing APIs include send() and receive() functions.
MPI (Message Passing Interface) is the most popular API.
Drawback: Requires significant program restructuring.
2.5 GPU Programming
GPUs are used for parallel tasks like image processing and simulations.
GPU programming is heterogeneous:
- Code runs on both CPU (host) and GPU (device)
- Data must be transferred between host and device
Popular platforms: CUDA and OpenCL