You are an expert in the subject "Parallel Computing".
It is a subject in the
7th semester VTU syllabus. I am preparing for an exam on that subject, and I
am in need of some notes. I will give you the questions, and you need to give
me exam-ready answers for each question; each answer must be elaborate,
and it must be easy to understand as well 1. Define Parallel Computing and
ddifferentiate between SIMD and MIMD systems. 2. Explain different types
of shared memory architecture. 3. Classify parallel computers based on
Flynn’s taxonomy. 4. Explain interconnection networks used in parallel
computers. 5. Describe the key features of MIMD systems.
Got it 👍 I’ll prepare exam-ready answers for your 7th semester Parallel
Computing questions. The answers will be elaborate, easy to understand,
and in point form where needed so you can write them directly in the
exam.
1. Define Parallel Computing and differentiate
between SIMD and MIMD systems.
Definition of Parallel Computing:
Parallel computing is a form of computation in which many
calculations or processes are carried out simultaneously.
It uses multiple processing elements (processors/cores) to solve
a problem faster by dividing the work among them.
The main goal is to increase performance, reduce execution time,
and handle large-scale problems efficiently.
Parallel computing is widely used in scientific simulations, data
analysis, machine learning, and real-time systems.
Difference between SIMD and MIMD:
SIMD (Single MIMD (Multiple
Aspect Instruction, Instruction,
Multiple Data) Multiple Data)
Definition Executes the same Executes different
instruction on instructions on
multiple data different data
items items
simultaneously. independently.
SIMD (Single MIMD (Multiple
Aspect Instruction, Instruction,
Multiple Data) Multiple Data)
Processing All processors Each processor has
follow the same its own control unit
control unit but and can run
operate on different different programs.
data streams.
Example Systems Vector processors, Multicore CPUs,
GPUs. distributed
systems, clusters.
Applications Image processing, Databases, web
matrix operations, servers, scientific
simulations with simulations, real-
repetitive time systems.
operations.
Complexity Simpler design, More complex
efficient for regular design, supports
data operations. general-purpose
parallelism.
2. Explain different types of shared memory
architecture.
In shared memory architecture, all processors access a common global
memory. The way this memory is organized determines the type of
architecture.
Types:
1. Uniform Memory Access (UMA):
o All processors access memory with equal latency and
bandwidth.
o Memory is shared uniformly.
o Easy to program but does not scale well for large systems.
o Example: Symmetric Multiprocessors (SMP).
2. Non-Uniform Memory Access (NUMA):
o Memory is physically distributed but logically shared.
o Accessing local memory is faster than remote memory.
o Provides better scalability than UMA.
o Example: Modern multiprocessor servers.
3. Cache-Only Memory Architecture (COMA):
o There is no main memory, and each node has a large cache.
o Data migrates dynamically to the cache of the processor that
needs it.
o Efficient for reducing remote memory access but complex to
implement.
3. Classify parallel computers based on Flynn’s
taxonomy.
Flynn’s taxonomy (1972) classifies parallel computers based on instruction
streams and data streams.
1. SISD (Single Instruction, Single Data):
o A single processor executes a single instruction on a single data
stream.
o Traditional sequential computers.
o Example: Old PCs, microprocessors.
2. SIMD (Single Instruction, Multiple Data):
o Multiple processors execute the same instruction on multiple
data simultaneously.
o Used in vector and array processors.
o Example: GPUs, Cray supercomputers.
3. MISD (Multiple Instruction, Single Data):
o Multiple instructions operate on the same data stream.
o Rare in practice, sometimes used in fault-tolerant systems.
4. MIMD (Multiple Instruction, Multiple Data):
o Multiple processors execute different instructions on different
data.
o Most powerful and widely used today.
o Example: Multicore processors, clusters, distributed systems.
4. Explain interconnection networks used in
parallel computers.
Interconnection networks define how processors and memory units are
connected for communication in a parallel system.
Types of Interconnection Networks:
1. Bus-based:
o All processors share a common communication bus.
o Simple and inexpensive but not scalable (bus contention
increases with processors).
2. Crossbar Switch:
o Every processor has a direct connection to every memory
module.
o Provides high performance but very expensive for large systems.
3. Multistage Networks (Indirect):
o Use switches arranged in multiple stages to connect processors
to memory.
o Examples: Omega network, Banyan network.
o Good balance between cost and performance.
4. Mesh and Torus Networks (Direct):
o Processors arranged in a 2D or 3D grid, connected to neighbors.
o Torus adds wraparound links for better connectivity.
o Scalable and widely used in supercomputers.
5. Hypercube:
o Processors connected in an n-dimensional cube structure.
o Each processor connects to log2(N) neighbors.
o High performance and scalability.
6. Tree Networks:
o Processors connected in a hierarchical tree structure.
o Efficient for divide-and-conquer tasks but bottleneck at higher
levels.
5. Describe the key features of MIMD systems.
MIMD (Multiple Instruction, Multiple Data) systems are the most powerful
class of parallel computers.
Key Features:
1. Multiple Instruction Execution:
o Each processor executes its own instruction stream
independently.
2. Multiple Data Streams:
o Each processor works on different sets of data simultaneously.
3. Asynchronous Operation:
o Processors operate independently without strict synchronization,
allowing flexible execution.
4. Distributed or Shared Memory:
o Can use shared memory (SMP/NUMA) or distributed memory
(clusters, message passing).
5. High Scalability:
o Can be scaled from a few processors to thousands of processors.
6. General-purpose Parallelism:
o Suitable for a wide range of applications (scientific simulations,
databases, AI, servers).
7. Complex Interconnection Networks:
o Uses mesh, hypercube, or multistage networks for efficient
communication.
8. Examples:
o Supercomputers like IBM Blue Gene, multicore CPUs, and cloud
data centers.
✅ These answers are exam-ready, detailed, and structured for 10-mark
questions.
Would you like me to also prepare diagrams (for Flynn’s taxonomy, shared
memory, interconnection networks) so you can include them in your exam
answers for better marks?
6. Explain the characteristics of Vector Processors. 7. Design a simple
algorithm for summing an array using shared-memory parallel programming.
8. Define and explain speedup and efficiency in parallel computing. 9. What
is thread safety? What is cache coherence? Explain snooping cache
coherence. 10. State and explain Amdahl’s Law.
Perfect 👍 Let’s continue with exam-ready answers for your next set of
Parallel Computing questions. I’ll make them clear, elaborate, and easy
to understand, with exam-appropriate depth.
6. Explain the characteristics of Vector Processors.
A vector processor is a CPU design that can execute a single instruction on
a large set of data (vector) at once, rather than processing one element
at a time. It is mainly used in scientific and engineering applications.
Characteristics:
1. Vector Registers:
o Store entire vectors (arrays) instead of individual scalar values.
o Allow simultaneous operations on multiple elements.
2. Vector Instructions:
o Specialized instructions operate on vectors (e.g., vector add,
vector multiply).
o Reduces the number of instruction fetches compared to scalar
processors.
3. Pipelining:
o Uses deeply pipelined functional units.
o Once the pipeline is filled, it can produce one result per clock
cycle.
4. High Throughput:
o Capable of handling multiple data elements simultaneously,
leading to very high performance in data-parallel tasks.
5. Memory Access Patterns:
o Supports strided memory access (fetch elements with regular
gaps).
o Reduces memory latency by prefetching data.
6. Applications:
o Scientific simulations, weather forecasting, graphics rendering,
matrix computations.
7. Examples:
o Cray-1 (classic vector supercomputer), modern GPUs implement
vector-like processing.
7. Design a simple algorithm for summing an array
using shared-memory parallel programming.
We want to compute the sum of all elements in an array using multiple
processors that share memory.
Algorithm: (Pseudocode)
textCopy codeInput: Array A[1...n], number of processors p
Output: Sum of all elements in A
1. Divide the array A into p chunks (nearly equal parts).
2. Each processor Pi computes the partial sum of its chunk in
parallel.
For example: partial_sum[i] = sum(A[start_i : end_i])
3. Synchronize all processors (barrier).
4. A single processor (say P0) collects all partial sums:
total_sum = 0
for i = 1 to p:
total_sum = total_sum + partial_sum[i]
5. Return total_sum
Explanation:
Step 2 is executed in parallel using shared-memory threads.
Synchronization ensures that all threads finish their work before
combining results.
This reduces execution time compared to sequential summation.
8. Define and explain speedup and efficiency in
parallel computing.
Speedup (S):
Speedup measures how much faster a parallel algorithm is compared
to a sequential one.
Formula:
Ts
S=
Tp
where
T s = execution time of sequential program,
T p = execution time of parallel program.
Ideal speedup = number of processors (linear speedup).
In practice, speedup is limited by communication, synchronization, and
sequential portions.
Efficiency (E):
Efficiency measures how well the processors are being utilized in
parallel execution.
Formula:
S Ts
E= =
P P× T p
where P = number of processors.
Efficiency ranges from 0 to 1 (or 0% to 100%).
High efficiency means processors are not idle and overhead is minimal.
9. What is thread safety? What is cache coherence?
Explain snooping cache coherence.
Thread Safety:
A program or function is thread-safe if it can be safely executed by
multiple threads at the same time without causing data corruption or
unexpected results.
Achieved using synchronization mechanisms like locks,
semaphores, or atomic operations.
Cache Coherence:
In multiprocessor systems, each processor may have its own cache.
When multiple processors cache the same memory location, updates
made by one processor must be visible to others.
Cache coherence ensures consistency of shared data across different
caches.
Snooping Cache Coherence:
In snooping protocols, all caches monitor (or “snoop” on) a common
bus to keep track of memory operations.
If one cache modifies a data block, others invalidate or update their
copies.
Example: MESI protocol (Modified, Exclusive, Shared, Invalid).
Steps in Snooping:
1. Processor issues a read/write request.
2. All caches listen (snoop) on the bus.
3. If another cache has a copy, it responds by invalidating/updating it.
4. Ensures all caches have a coherent view of memory.
10. State and explain Amdahl’s Law.
Statement:
Amdahl’s Law states that the potential speedup of a program using
multiple processors is limited by the sequential portion of the
program.
Formula:
1
S ( P )=
1−f
f+
P
Where:
S ( P ) = speedup with P processors,
f = fraction of program that is sequential (cannot be parallelized),
1 − f = fraction that can be parallelized.
Explanation:
If 20% of the program is sequential ( f =0.2), then no matter how many
processors we use, the speedup cannot exceed 5.
This shows the importance of minimizing sequential portions to gain
better performance in parallel computing.
Implications:
1. Parallelism gives diminishing returns when sequential parts dominate.
2. To improve speedup, reduce f by optimizing sequential parts.
3. Useful for predicting scalability of parallel programs.
✅ These are well-elaborated exam-ready answers.
Would you like me to also add small diagrams for Amdahl’s Law
(speedup curve) and cache coherence (snooping mechanism)? They
will fetch you extra marks in the exam.
11. Explain the scalability in MIMD systems. 12. Explain the distributed
memory programming model. 13. Differentiate between CPU and GPU
architectures. 14. Define GPU programming. What makes GPUs suitable for
parallel tasks? 15. Explain the following: i) Dynamic and Static Threads ii)
Nondeterminism
Alright 👍 Let’s prepare exam-ready notes for your next Parallel Computing
questions. I’ll keep them elaborate, structured, and easy to write in
exams, with clear points and explanations.
11. Explain the scalability in MIMD systems.
Scalability in MIMD (Multiple Instruction, Multiple Data) systems refers to
how well the system’s performance improves as the number of processors
increases.
Key Points:
1. Definition:
o A scalable system continues to deliver near-linear performance
improvements as more processors are added.
2. Factors Affecting Scalability:
o Communication Overhead: More processors lead to increased
data exchange and synchronization costs.
o Memory Bottleneck: Shared memory access can cause
contention, reducing scalability.
o Load Balancing: Unequal distribution of work among
processors reduces overall efficiency.
o Interconnection Network: The topology and bandwidth of the
network play a major role in scalability.
3. Types of Scalability:
o Strong Scalability: Performance improves when the problem
size is fixed but processors increase.
o Weak Scalability: Performance improves when both the
problem size and processors increase proportionally.
4. Improving Scalability:
o Efficient parallel algorithms.
o Reducing sequential portions (Amdahl’s Law).
o Using high-speed interconnection networks.
o Optimizing memory hierarchy (NUMA/COMA).
✅ Conclusion: Scalability is a critical measure of the usefulness of MIMD
systems in solving large-scale scientific and commercial problems.
12. Explain the distributed memory programming
model.
In distributed memory systems, each processor has its own private
memory. Communication between processors occurs by passing messages.
Features:
1. Local Memory Access:
o Each processor directly accesses only its own memory.
o Remote data requires explicit communication.
2. Message Passing:
o Data exchange happens using message-passing libraries (e.g.,
MPI – Message Passing Interface).
3. Scalability:
o Highly scalable since adding processors also increases memory
capacity.
o Avoids memory contention of shared memory systems.
4. Programming Model:
o Programs are divided into tasks (processes).
o Each process runs independently and exchanges data via
messages.
o Example: MPI_Send(), MPI_Recv().
5. Applications:
o Used in supercomputers, clusters, and large-scale simulations.
✅ Conclusion: Distributed memory programming is suitable for large-scale
parallelism but requires careful management of communication to achieve
good performance.
13. Differentiate between CPU and GPU
architectures.
CPU (Central GPU (Graphics
Aspect
Processing Unit) Processing Unit)
Design Goal Optimized for low- Optimized for high-
latency execution throughput parallel
of sequential tasks. execution of many
tasks.
Cores Few powerful, Thousands of
complex cores. smaller, simpler
cores.
Control Logic Large portion Small portion for
devoted to control control, more for
and caching. computation.
Memory Large caches, High-bandwidth
Hierarchy optimized for memory, optimized
irregular memory for regular, data-
access. parallel access.
Best Suited For Branch-heavy, Data-parallel tasks
sequential tasks (graphics, ML,
(OS, databases). scientific
simulations).
Parallelism Exploits limited Exploits massive
parallelism (thread- parallelism (data-
level). level, SIMD).
✅ Conclusion: CPUs are general-purpose processors, while GPUs are
specialized for tasks requiring massive parallelism.
14. Define GPU programming. What makes GPUs
suitable for parallel tasks?
Definition:
GPU Programming refers to writing code that executes on Graphics
Processing Units (GPUs) to exploit their massive parallelism.
Frameworks like CUDA (NVIDIA) and OpenCL are used for GPU
programming.
Why GPUs are suitable for parallel tasks:
1. Massive Parallelism:
o Thousands of cores execute many threads simultaneously.
2. SIMD Execution:
o Efficiently applies the same operation to large data sets (e.g.,
matrix operations, deep learning).
3. High Memory Bandwidth:
o GPUs are designed with fast memory to handle large datasets.
4. Throughput-Oriented:
o Focused on maximizing the number of operations per second
rather than minimizing latency.
5. Specialized Hardware:
o Streaming Multiprocessors (SMs) handle multiple threads in
parallel.
6. Applications:
o Machine Learning, Deep Learning, Image Processing,
Cryptography, Scientific Computing.
✅ Conclusion: GPUs are ideal for highly parallel tasks where the same
computation must be repeated across large datasets.
15. Explain the following:
i) Dynamic and Static Threads
Static Threads:
o Threads are created once and remain fixed throughout the
program.
o Work is divided at compile-time or initialization.
o Easier to manage but less flexible.
o Example: OpenMP with fixed number of threads.
Dynamic Threads:
o Threads are created and destroyed during program execution.
o Work assignment changes at runtime depending on load.
o Provides better load balancing but adds management
overhead.
o Example: Thread pools in Java/Python.
ii) Nondeterminism
Definition:
o Nondeterminism in parallel programming occurs when a program
produces different outputs for the same input due to variations in
execution order of threads.
Causes:
o Race conditions (two threads accessing shared data without
synchronization).
o Uncontrolled scheduling by the operating system.
Implications:
o Makes debugging difficult.
o Can lead to incorrect or inconsistent results.
Solution:
o Use synchronization mechanisms (locks, semaphores, atomic
operations).
o Avoid unnecessary shared state.
✅ Conclusion: Understanding thread management and avoiding
nondeterminism is critical for writing correct parallel programs.
✨ These answers are detailed, structured, and exam-ready.
Would you like me to also make a comparison table of shared vs
distributed memory models (Q12 extension)? It often comes as a follow-
up or an internal exam question.
16. Explain the role of synchronization in coordinating processes and
Threads. 17. Differentiate between parallel hardware and parallel software
with suitable examples. 18. Describe MIMD systems with suitable examples.
19. Explain how GPUs are used in MIMD systems to accelerate computations.
20. Explain Amdahl’s law with an example showing its limitations in GPU
systems.
Great 👍 Let’s continue with exam-ready answers for your Parallel
Computing questions (Q16–Q20). I’ll explain in a way that’s clear, elaborate,
and easy to write in exams.
16. Explain the role of synchronization in
coordinating processes and threads.
Definition:
Synchronization is the mechanism used to coordinate concurrent
processes or threads so they can safely access shared resources
(like memory, files, or variables) without conflict.
Role of Synchronization:
1. Mutual Exclusion:
o Ensures that only one process/thread accesses a shared resource
at a time.
o Prevents race conditions.
o Implemented using locks, semaphores, or monitors.
2. Order of Execution:
o Maintains correct execution order where processes depend on
each other’s results.
o Example: A producer thread must add items before a consumer
thread removes them.
3. Data Consistency:
o Prevents simultaneous updates to shared variables, ensuring
data correctness.
4. Avoiding Deadlocks and Starvation:
o Proper synchronization ensures processes don’t wait forever for
resources.
5. Coordination:
o Synchronization allows processes/threads to wait for events
(barrier synchronization).
o Example: All threads must finish partial sums before computing
the final sum.
✅ Conclusion: Without synchronization, parallel programs may lead to
unpredictable results, data corruption, and performance issues.
17. Differentiate between parallel hardware and
parallel software with suitable examples.
Parallel
Aspect Parallel Software
Hardware
Definition Physical Algorithms,
infrastructure programs, and
enabling multiple libraries designed
computations at to exploit parallel
the same time. hardware.
Components Multicore CPUs, Parallel
GPUs, clusters, programming
interconnection models, compilers,
networks. libraries, and
applications.
Parallel
Aspect Parallel Software
Hardware
Focus Providing resources Efficiently using
for execution. those resources.
Example A multicore Intel i7 OpenMP code that
CPU with 8 cores. divides tasks across
the 8 cores.
Dependency Exists Requires hardware
independently of to execute.
software.
Objective Increase raw Translate
computational computational
power. tasks into parallel
workloads.
✅ Conclusion: Parallel hardware provides the foundation, and parallel
software ensures that the hardware is used effectively.
18. Describe MIMD systems with suitable
examples.
Definition:
MIMD (Multiple Instruction, Multiple Data) systems allow multiple
processors to execute different instructions on different data sets
independently.
Characteristics:
1. Each processor has its own control unit.
2. Supports asynchronous execution (processors may not be in
lockstep).
3. Can use shared memory (SMP, NUMA) or distributed memory
(clusters).
4. Highly scalable and supports general-purpose computing.
Examples:
Shared Memory MIMD:
o Symmetric Multiprocessors (SMP) like modern multicore
Intel/AMD CPUs.
Distributed Memory MIMD:
o Cluster systems like Beowulf clusters.
Supercomputers:
o IBM Blue Gene, Cray XT5.
Cloud Systems:
o Amazon AWS clusters running independent tasks.
✅ Conclusion: MIMD systems dominate modern computing because they
can execute diverse workloads efficiently.
19. Explain how GPUs are used in MIMD systems to
accelerate computations.
Role of GPUs in MIMD Systems:
1. Hybrid Architecture:
o MIMD systems typically consist of multicore CPUs (MIMD
processors).
o GPUs are attached as accelerators to offload massively parallel
tasks.
2. Execution Model:
o CPU executes control-heavy tasks (branching, sequential
parts).
o GPU executes data-parallel tasks (matrix multiplication, deep
learning).
3. Communication:
o Data is transferred between CPU and GPU via high-speed
interconnects (PCIe, NVLink).
4. Acceleration Examples:
o Machine Learning → GPUs speed up training by handling billions
of matrix operations.
o Physics Simulations → GPUs handle vector/matrix computations,
while CPUs coordinate tasks.
o Video Rendering → GPUs parallelize pixel rendering across
thousands of cores.
5. MIMD + GPU Parallelism:
o CPU cores (MIMD) run different instructions.
o Each GPU core executes SIMD-style instructions.
o Together, they form a heterogeneous parallel system.
✅ Conclusion: GPUs extend the power of MIMD systems by accelerating
compute-intensive, data-parallel workloads.
20. Explain Amdahl’s Law with an example showing
its limitations in GPU systems.
Amdahl’s Law (Recap):
1
S ( P )=
1−f
f+
P
f : sequential fraction of program.
1 − f : parallel fraction of program.
P : number of processors (or GPU cores).
Example:
Suppose a program is 90% parallelizable ( f =0.1).
With P=1000 GPU cores:
1 1
S ( 1000 )= ≈ ≈ 9.91
0.9 0.1009
0.1+
1000
➡️Even with 1000 GPU cores, speedup is only about 10x because of the
sequential bottleneck.
Limitations in GPU Systems:
1. Sequential Overhead:
o Parts like memory transfer (CPU ↔ GPU) remain sequential,
reducing effective speedup.
2. Diminishing Returns:
o Increasing GPU cores does not help if sequential fraction f is
large.
3. Synchronization Costs:
o GPUs require thread synchronization, which adds extra overhead.
4. Amdahl’s Assumption:
o It assumes fixed problem size, but in practice, larger datasets
can scale better (addressed by Gustafson’s Law).
✅ Conclusion: Amdahl’s Law highlights that simply adding more GPU cores
cannot guarantee proportional speedup, as sequential bottlenecks and
overhead limit performance.
✨ These are well-structured, exam-ready answers with clear points and
examples.
Would you like me to also prepare a short note on Gustafson’s Law
(often asked as a follow-up to Amdahl’s Law in VTU exams)?
21. Write a short note on GPU performance metrics. And Evaluate the
performance difference between GPU-based and CPU-based matrix
multiplication. 22. Explain how to measure the execution time of MIMD
programs with examples.
Perfect 👍 Let’s tackle your next Parallel Computing questions (Q21–Q22). I’ll
make the answers exam-ready, clear, and elaborate, with both theory
and practical examples.
21. Write a short note on GPU performance
metrics. And evaluate the performance difference
between GPU-based and CPU-based matrix
multiplication.
GPU Performance Metrics:
To evaluate how well a GPU executes parallel programs, the following
metrics are commonly used:
1. Throughput (FLOPS):
o Number of floating-point operations per second.
o High FLOPS indicates better ability to handle large-scale
computations (TFLOPS for modern GPUs).
2. Memory Bandwidth:
o Rate at which data is transferred between GPU memory and
cores.
o GPUs often achieve hundreds of GB/s, essential for data-
parallel tasks.
3. Occupancy:
o Ratio of active warps (groups of threads) to maximum possible
warps on a Streaming Multiprocessor (SM).
o Higher occupancy leads to better resource utilization.
4. Latency Hiding:
o Ability of GPU to overlap memory operations with computation
by switching between threads.
o Important for performance in memory-bound applications.
5. Power Efficiency:
o Performance per watt of energy consumed.
o Crucial for large GPU clusters (data centers, HPC).
6. Scalability:
o Ability to scale performance with more cores or multiple GPUs.
Evaluating Matrix Multiplication (CPU vs GPU):
Case: Multiplying two matrices (size 1000 × 1000).
1. CPU-based Execution:
o CPUs have a few powerful cores.
o Execution is sequential or lightly parallelized using threads.
o Execution time is relatively high due to limited cores and cache-
based memory access.
2. GPU-based Execution:
o GPUs have thousands of cores optimized for SIMD operations.
o Each thread computes one element (or partial sum) of the output
matrix in parallel.
o Memory bandwidth is high, allowing large data transfers
efficiently.
o Execution is tens to hundreds of times faster than CPU.
Example Results (general trend):
CPU (8 cores): ~5–10 GFLOPS.
GPU (NVIDIA RTX 3090): >15 TFLOPS.
GPU speedup: 1000× faster for large matrices.
✅ Conclusion: GPUs outperform CPUs for matrix multiplication due to
massive parallelism, higher memory bandwidth, and SIMD execution, making
them ideal for scientific computing and machine learning tasks.
22. Explain how to measure the execution time of
MIMD programs with examples.
Why Execution Time Measurement is Needed:
In MIMD systems, multiple processors execute independently.
Measuring execution time helps evaluate performance, speedup, and
efficiency.
Methods of Measuring Execution Time:
1. Wall-Clock Time:
o The actual elapsed time from start to end of the program.
o Includes computation, communication, and synchronization
overhead.
o Measured using functions like time() in C, or chrono in C++.
cCopy codedouble start = MPI_Wtime(); // start timer
parallel_function();
double end = MPI_Wtime(); // end timer
double elapsed = end - start;
2. CPU Time (Per Processor):
o Measures the time a processor spends on computation,
excluding idle or waiting time.
o Example in C: clock() function.
3. Parallel Execution Time (Tp):
o The maximum execution time across all processors.
o Since tasks may finish at different times, total execution time is
determined by the slowest processor.
T p=max ( T 1 ,T 2 ,… , T p )
4. Profiling Tools:
o Tools like gprof, MPI Profiler, or NVIDIA Nsight analyze
computation vs communication time.
Example – Array Summation with MPI (MIMD):
cCopy code#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
int rank, size;
double start, end;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
start = MPI_Wtime(); // Start timing
// Parallel computation (each processor sums part of array)
// ...
MPI_Barrier(MPI_COMM_WORLD); // Synchronize all
end = MPI_Wtime(); // End timing
if (rank == 0) {
printf("Execution Time: %f seconds\n", end - start);
}
MPI_Finalize();
return 0;
}
Here, MPI_Wtime() gives wall-clock time for parallel execution.
Barrier ensures all processes finish before measuring total time.
✅ Conclusion:
Execution time in MIMD programs can be measured using wall-clock
timers, CPU timers, and profiling tools. The total parallel execution time
is determined by the slowest processor, including communication and
synchronization delays.
✨ These are detailed, exam-ready answers.
Would you like me to also add a table comparing GPU vs CPU
performance metrics (like FLOPS, bandwidth, cores) so you can use it as a
strong point in Q21?