0% found this document useful (0 votes)

10 views25 pages

Parallel Computing Exam Answers

The document provides detailed exam-ready answers for various questions related to Parallel Computing, including definitions, classifications, architectures, and programming models. Key topics covered include the differences between SIMD and MIMD systems, shared memory architectures, Flynn's taxonomy, interconnection networks, and the characteristics of vector processors. Additionally, it discusses scalability in MIMD systems, the distributed memory programming model, and differentiates between CPU and GPU architectures.

Uploaded by

Anirudh Phaniraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views25 pages

Parallel Computing Exam Answers

Uploaded by

Anirudh Phaniraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

You are an expert in the subject "Parallel Computing".

It is a subject in the
7th semester VTU syllabus. I am preparing for an exam on that subject, and I
am in need of some notes. I will give you the questions, and you need to give
me exam-ready answers for each question; each answer must be elaborate,
and it must be easy to understand as well 1. Define Parallel Computing and
ddifferentiate between SIMD and MIMD systems. 2. Explain different types
of shared memory architecture. 3. Classify parallel computers based on
Flynn’s taxonomy. 4. Explain interconnection networks used in parallel
computers. 5. Describe the key features of MIMD systems.

Got it 👍 I’ll prepare exam-ready answers for your 7th semester Parallel
Computing questions. The answers will be elaborate, easy to understand,
and in point form where needed so you can write them directly in the
exam.

1. Define Parallel Computing and differentiate

between SIMD and MIMD systems.
Definition of Parallel Computing:
 Parallel computing is a form of computation in which many
calculations or processes are carried out simultaneously.

 It uses multiple processing elements (processors/cores) to solve

a problem faster by dividing the work among them.

 The main goal is to increase performance, reduce execution time,

and handle large-scale problems efficiently.

 Parallel computing is widely used in scientific simulations, data

analysis, machine learning, and real-time systems.

Difference between SIMD and MIMD:

SIMD (Single MIMD (Multiple
Aspect Instruction, Instruction,
Multiple Data) Multiple Data)
Definition Executes the same Executes different
instruction on instructions on
multiple data different data
items items
simultaneously. independently.
SIMD (Single MIMD (Multiple
Aspect Instruction, Instruction,
Multiple Data) Multiple Data)
Processing All processors Each processor has
follow the same its own control unit
control unit but and can run
operate on different different programs.
data streams.
Example Systems Vector processors, Multicore CPUs,
GPUs. distributed
systems, clusters.
Applications Image processing, Databases, web
matrix operations, servers, scientific
simulations with simulations, real-
repetitive time systems.
operations.
Complexity Simpler design, More complex
efficient for regular design, supports
data operations. general-purpose
parallelism.

2. Explain different types of shared memory

architecture.
In shared memory architecture, all processors access a common global
memory. The way this memory is organized determines the type of
architecture.

Types:
1. Uniform Memory Access (UMA):

o All processors access memory with equal latency and

bandwidth.

o Memory is shared uniformly.

o Easy to program but does not scale well for large systems.

o Example: Symmetric Multiprocessors (SMP).

2. Non-Uniform Memory Access (NUMA):

o Memory is physically distributed but logically shared.

o Accessing local memory is faster than remote memory.

o Provides better scalability than UMA.

o Example: Modern multiprocessor servers.

3. Cache-Only Memory Architecture (COMA):

o There is no main memory, and each node has a large cache.

o Data migrates dynamically to the cache of the processor that

needs it.

o Efficient for reducing remote memory access but complex to

implement.

3. Classify parallel computers based on Flynn’s

taxonomy.
Flynn’s taxonomy (1972) classifies parallel computers based on instruction
streams and data streams.
1. SISD (Single Instruction, Single Data):

o A single processor executes a single instruction on a single data

stream.

o Traditional sequential computers.

o Example: Old PCs, microprocessors.

2. SIMD (Single Instruction, Multiple Data):

o Multiple processors execute the same instruction on multiple

data simultaneously.

o Used in vector and array processors.

o Example: GPUs, Cray supercomputers.

3. MISD (Multiple Instruction, Single Data):

o Multiple instructions operate on the same data stream.

o Rare in practice, sometimes used in fault-tolerant systems.

4. MIMD (Multiple Instruction, Multiple Data):

o Multiple processors execute different instructions on different
data.

o Most powerful and widely used today.

o Example: Multicore processors, clusters, distributed systems.

4. Explain interconnection networks used in

parallel computers.
Interconnection networks define how processors and memory units are
connected for communication in a parallel system.

Types of Interconnection Networks:

1. Bus-based:

o All processors share a common communication bus.

o Simple and inexpensive but not scalable (bus contention

increases with processors).

2. Crossbar Switch:

o Every processor has a direct connection to every memory

module.

o Provides high performance but very expensive for large systems.

3. Multistage Networks (Indirect):

o Use switches arranged in multiple stages to connect processors

to memory.

o Examples: Omega network, Banyan network.

o Good balance between cost and performance.

4. Mesh and Torus Networks (Direct):

o Processors arranged in a 2D or 3D grid, connected to neighbors.

o Torus adds wraparound links for better connectivity.

o Scalable and widely used in supercomputers.

5. Hypercube:

o Processors connected in an n-dimensional cube structure.

o Each processor connects to log2(N) neighbors.

o High performance and scalability.

6. Tree Networks:

o Processors connected in a hierarchical tree structure.

o Efficient for divide-and-conquer tasks but bottleneck at higher

levels.

5. Describe the key features of MIMD systems.

MIMD (Multiple Instruction, Multiple Data) systems are the most powerful
class of parallel computers.

Key Features:
1. Multiple Instruction Execution:

o Each processor executes its own instruction stream

independently.

2. Multiple Data Streams:

o Each processor works on different sets of data simultaneously.

3. Asynchronous Operation:

o Processors operate independently without strict synchronization,

allowing flexible execution.

4. Distributed or Shared Memory:

o Can use shared memory (SMP/NUMA) or distributed memory

(clusters, message passing).

5. High Scalability:

o Can be scaled from a few processors to thousands of processors.

6. General-purpose Parallelism:

o Suitable for a wide range of applications (scientific simulations,

databases, AI, servers).

7. Complex Interconnection Networks:

o Uses mesh, hypercube, or multistage networks for efficient

communication.
8. Examples:

o Supercomputers like IBM Blue Gene, multicore CPUs, and cloud

data centers.

✅ These answers are exam-ready, detailed, and structured for 10-mark

questions.
Would you like me to also prepare diagrams (for Flynn’s taxonomy, shared
memory, interconnection networks) so you can include them in your exam
answers for better marks?

6. Explain the characteristics of Vector Processors. 7. Design a simple

algorithm for summing an array using shared-memory parallel programming.
8. Define and explain speedup and efficiency in parallel computing. 9. What
is thread safety? What is cache coherence? Explain snooping cache
coherence. 10. State and explain Amdahl’s Law.

Perfect 👍 Let’s continue with exam-ready answers for your next set of
Parallel Computing questions. I’ll make them clear, elaborate, and easy
to understand, with exam-appropriate depth.

6. Explain the characteristics of Vector Processors.

A vector processor is a CPU design that can execute a single instruction on
a large set of data (vector) at once, rather than processing one element
at a time. It is mainly used in scientific and engineering applications.

Characteristics:
1. Vector Registers:

o Store entire vectors (arrays) instead of individual scalar values.

o Allow simultaneous operations on multiple elements.

2. Vector Instructions:

o Specialized instructions operate on vectors (e.g., vector add,

vector multiply).
o Reduces the number of instruction fetches compared to scalar
processors.

3. Pipelining:

o Uses deeply pipelined functional units.

o Once the pipeline is filled, it can produce one result per clock
cycle.

4. High Throughput:

o Capable of handling multiple data elements simultaneously,

leading to very high performance in data-parallel tasks.

5. Memory Access Patterns:

o Supports strided memory access (fetch elements with regular
gaps).

o Reduces memory latency by prefetching data.

6. Applications:

o Scientific simulations, weather forecasting, graphics rendering,

matrix computations.

7. Examples:

o Cray-1 (classic vector supercomputer), modern GPUs implement

vector-like processing.

7. Design a simple algorithm for summing an array

using shared-memory parallel programming.
We want to compute the sum of all elements in an array using multiple
processors that share memory.

Algorithm: (Pseudocode)
textCopy codeInput: Array A[1...n], number of processors p
Output: Sum of all elements in A

1. Divide the array A into p chunks (nearly equal parts).

2. Each processor Pi computes the partial sum of its chunk in
parallel.
For example: partial_sum[i] = sum(A[start_i : end_i])
3. Synchronize all processors (barrier).
4. A single processor (say P0) collects all partial sums:
total_sum = 0
for i = 1 to p:
total_sum = total_sum + partial_sum[i]
5. Return total_sum

Explanation:
 Step 2 is executed in parallel using shared-memory threads.

 Synchronization ensures that all threads finish their work before

combining results.

 This reduces execution time compared to sequential summation.

8. Define and explain speedup and efficiency in

parallel computing.
Speedup (S):
 Speedup measures how much faster a parallel algorithm is compared
to a sequential one.

 Formula:
Ts
S=
Tp

where
T s = execution time of sequential program,
T p = execution time of parallel program.

 Ideal speedup = number of processors (linear speedup).

 In practice, speedup is limited by communication, synchronization, and

sequential portions.

Efficiency (E):
 Efficiency measures how well the processors are being utilized in
parallel execution.

 Formula:
S Ts
E= =
P P× T p

where P = number of processors.

 Efficiency ranges from 0 to 1 (or 0% to 100%).

 High efficiency means processors are not idle and overhead is minimal.

9. What is thread safety? What is cache coherence?

Explain snooping cache coherence.
Thread Safety:
 A program or function is thread-safe if it can be safely executed by
multiple threads at the same time without causing data corruption or
unexpected results.

 Achieved using synchronization mechanisms like locks,

semaphores, or atomic operations.

Cache Coherence:
 In multiprocessor systems, each processor may have its own cache.

 When multiple processors cache the same memory location, updates

made by one processor must be visible to others.

 Cache coherence ensures consistency of shared data across different

caches.

Snooping Cache Coherence:

 In snooping protocols, all caches monitor (or “snoop” on) a common
bus to keep track of memory operations.

 If one cache modifies a data block, others invalidate or update their

copies.

 Example: MESI protocol (Modified, Exclusive, Shared, Invalid).

Steps in Snooping:
1. Processor issues a read/write request.

2. All caches listen (snoop) on the bus.

3. If another cache has a copy, it responds by invalidating/updating it.

4. Ensures all caches have a coherent view of memory.

10. State and explain Amdahl’s Law.

Statement:
 Amdahl’s Law states that the potential speedup of a program using
multiple processors is limited by the sequential portion of the
program.

Formula:
1
S ( P )=
1−f
f+
P
Where:
 S ( P ) = speedup with P processors,

 f = fraction of program that is sequential (cannot be parallelized),

 1 − f = fraction that can be parallelized.

Explanation:
 If 20% of the program is sequential ( f =0.2), then no matter how many
processors we use, the speedup cannot exceed 5.

 This shows the importance of minimizing sequential portions to gain

better performance in parallel computing.

Implications:
1. Parallelism gives diminishing returns when sequential parts dominate.

2. To improve speedup, reduce f by optimizing sequential parts.

3. Useful for predicting scalability of parallel programs.

✅ These are well-elaborated exam-ready answers.

Would you like me to also add small diagrams for Amdahl’s Law
(speedup curve) and cache coherence (snooping mechanism)? They
will fetch you extra marks in the exam.

11. Explain the scalability in MIMD systems. 12. Explain the distributed
memory programming model. 13. Differentiate between CPU and GPU
architectures. 14. Define GPU programming. What makes GPUs suitable for
parallel tasks? 15. Explain the following: i) Dynamic and Static Threads ii)
Nondeterminism

Alright 👍 Let’s prepare exam-ready notes for your next Parallel Computing
questions. I’ll keep them elaborate, structured, and easy to write in
exams, with clear points and explanations.

11. Explain the scalability in MIMD systems.

Scalability in MIMD (Multiple Instruction, Multiple Data) systems refers to
how well the system’s performance improves as the number of processors
increases.

Key Points:
1. Definition:

o A scalable system continues to deliver near-linear performance

improvements as more processors are added.

2. Factors Affecting Scalability:

o Communication Overhead: More processors lead to increased

data exchange and synchronization costs.

o Memory Bottleneck: Shared memory access can cause

contention, reducing scalability.

o Load Balancing: Unequal distribution of work among

processors reduces overall efficiency.

o Interconnection Network: The topology and bandwidth of the

network play a major role in scalability.

3. Types of Scalability:
o Strong Scalability: Performance improves when the problem
size is fixed but processors increase.

o Weak Scalability: Performance improves when both the

problem size and processors increase proportionally.

4. Improving Scalability:

o Efficient parallel algorithms.

o Reducing sequential portions (Amdahl’s Law).

o Using high-speed interconnection networks.

o Optimizing memory hierarchy (NUMA/COMA).

✅ Conclusion: Scalability is a critical measure of the usefulness of MIMD

systems in solving large-scale scientific and commercial problems.

12. Explain the distributed memory programming

model.
In distributed memory systems, each processor has its own private
memory. Communication between processors occurs by passing messages.

Features:
1. Local Memory Access:

o Each processor directly accesses only its own memory.

o Remote data requires explicit communication.

2. Message Passing:

o Data exchange happens using message-passing libraries (e.g.,

MPI – Message Passing Interface).

3. Scalability:

o Highly scalable since adding processors also increases memory

capacity.

o Avoids memory contention of shared memory systems.

4. Programming Model:

o Programs are divided into tasks (processes).

o Each process runs independently and exchanges data via
messages.

o Example: MPI_Send(), MPI_Recv().

5. Applications:

o Used in supercomputers, clusters, and large-scale simulations.

✅ Conclusion: Distributed memory programming is suitable for large-scale

parallelism but requires careful management of communication to achieve
good performance.

13. Differentiate between CPU and GPU

architectures.
CPU (Central GPU (Graphics
Aspect
Processing Unit) Processing Unit)
Design Goal Optimized for low- Optimized for high-
latency execution throughput parallel
of sequential tasks. execution of many
tasks.
Cores Few powerful, Thousands of
complex cores. smaller, simpler
cores.
Control Logic Large portion Small portion for
devoted to control control, more for
and caching. computation.
Memory Large caches, High-bandwidth
Hierarchy optimized for memory, optimized
irregular memory for regular, data-
access. parallel access.
Best Suited For Branch-heavy, Data-parallel tasks
sequential tasks (graphics, ML,
(OS, databases). scientific
simulations).
Parallelism Exploits limited Exploits massive
parallelism (thread- parallelism (data-
level). level, SIMD).

✅ Conclusion: CPUs are general-purpose processors, while GPUs are

specialized for tasks requiring massive parallelism.
14. Define GPU programming. What makes GPUs
suitable for parallel tasks?
Definition:
 GPU Programming refers to writing code that executes on Graphics
Processing Units (GPUs) to exploit their massive parallelism.

 Frameworks like CUDA (NVIDIA) and OpenCL are used for GPU
programming.

Why GPUs are suitable for parallel tasks:

1. Massive Parallelism:

o Thousands of cores execute many threads simultaneously.

2. SIMD Execution:

o Efficiently applies the same operation to large data sets (e.g.,

matrix operations, deep learning).

3. High Memory Bandwidth:

o GPUs are designed with fast memory to handle large datasets.

4. Throughput-Oriented:

o Focused on maximizing the number of operations per second

rather than minimizing latency.

5. Specialized Hardware:

o Streaming Multiprocessors (SMs) handle multiple threads in

parallel.

6. Applications:

o Machine Learning, Deep Learning, Image Processing,

Cryptography, Scientific Computing.

✅ Conclusion: GPUs are ideal for highly parallel tasks where the same
computation must be repeated across large datasets.
15. Explain the following:
i) Dynamic and Static Threads
 Static Threads:

o Threads are created once and remain fixed throughout the

program.

o Work is divided at compile-time or initialization.

o Easier to manage but less flexible.

o Example: OpenMP with fixed number of threads.

 Dynamic Threads:

o Threads are created and destroyed during program execution.

o Work assignment changes at runtime depending on load.

o Provides better load balancing but adds management

overhead.

o Example: Thread pools in Java/Python.

ii) Nondeterminism
 Definition:

o Nondeterminism in parallel programming occurs when a program

produces different outputs for the same input due to variations in
execution order of threads.

 Causes:

o Race conditions (two threads accessing shared data without

synchronization).

o Uncontrolled scheduling by the operating system.

 Implications:

o Makes debugging difficult.

o Can lead to incorrect or inconsistent results.

 Solution:
o Use synchronization mechanisms (locks, semaphores, atomic
operations).

o Avoid unnecessary shared state.

✅ Conclusion: Understanding thread management and avoiding

nondeterminism is critical for writing correct parallel programs.

✨ These answers are detailed, structured, and exam-ready.

Would you like me to also make a comparison table of shared vs
distributed memory models (Q12 extension)? It often comes as a follow-
up or an internal exam question.

16. Explain the role of synchronization in coordinating processes and

Threads. 17. Differentiate between parallel hardware and parallel software
with suitable examples. 18. Describe MIMD systems with suitable examples.
19. Explain how GPUs are used in MIMD systems to accelerate computations.
20. Explain Amdahl’s law with an example showing its limitations in GPU
systems.

Great 👍 Let’s continue with exam-ready answers for your Parallel

Computing questions (Q16–Q20). I’ll explain in a way that’s clear, elaborate,
and easy to write in exams.

16. Explain the role of synchronization in

coordinating processes and threads.
Definition:
 Synchronization is the mechanism used to coordinate concurrent
processes or threads so they can safely access shared resources
(like memory, files, or variables) without conflict.

Role of Synchronization:
1. Mutual Exclusion:

o Ensures that only one process/thread accesses a shared resource

at a time.

o Prevents race conditions.

o Implemented using locks, semaphores, or monitors.

2. Order of Execution:

o Maintains correct execution order where processes depend on

each other’s results.

o Example: A producer thread must add items before a consumer

thread removes them.

3. Data Consistency:

o Prevents simultaneous updates to shared variables, ensuring

data correctness.

4. Avoiding Deadlocks and Starvation:

o Proper synchronization ensures processes don’t wait forever for
resources.

5. Coordination:

o Synchronization allows processes/threads to wait for events

(barrier synchronization).

o Example: All threads must finish partial sums before computing

the final sum.

✅ Conclusion: Without synchronization, parallel programs may lead to

unpredictable results, data corruption, and performance issues.

17. Differentiate between parallel hardware and

parallel software with suitable examples.
Parallel
Aspect Parallel Software
Hardware
Definition Physical Algorithms,
infrastructure programs, and
enabling multiple libraries designed
computations at to exploit parallel
the same time. hardware.
Components Multicore CPUs, Parallel
GPUs, clusters, programming
interconnection models, compilers,
networks. libraries, and
applications.
Parallel
Aspect Parallel Software
Hardware
Focus Providing resources Efficiently using
for execution. those resources.
Example A multicore Intel i7 OpenMP code that
CPU with 8 cores. divides tasks across
the 8 cores.
Dependency Exists Requires hardware
independently of to execute.
software.
Objective Increase raw Translate
computational computational
power. tasks into parallel
workloads.

✅ Conclusion: Parallel hardware provides the foundation, and parallel

software ensures that the hardware is used effectively.

18. Describe MIMD systems with suitable

examples.
Definition:
 MIMD (Multiple Instruction, Multiple Data) systems allow multiple
processors to execute different instructions on different data sets
independently.

Characteristics:
1. Each processor has its own control unit.

2. Supports asynchronous execution (processors may not be in

lockstep).

3. Can use shared memory (SMP, NUMA) or distributed memory

(clusters).

4. Highly scalable and supports general-purpose computing.

Examples:
 Shared Memory MIMD:
o Symmetric Multiprocessors (SMP) like modern multicore
Intel/AMD CPUs.

 Distributed Memory MIMD:

o Cluster systems like Beowulf clusters.

 Supercomputers:

o IBM Blue Gene, Cray XT5.

 Cloud Systems:

o Amazon AWS clusters running independent tasks.

✅ Conclusion: MIMD systems dominate modern computing because they

can execute diverse workloads efficiently.

19. Explain how GPUs are used in MIMD systems to

accelerate computations.
Role of GPUs in MIMD Systems:
1. Hybrid Architecture:

o MIMD systems typically consist of multicore CPUs (MIMD

processors).

o GPUs are attached as accelerators to offload massively parallel

tasks.

2. Execution Model:
o CPU executes control-heavy tasks (branching, sequential
parts).

o GPU executes data-parallel tasks (matrix multiplication, deep

learning).

3. Communication:

o Data is transferred between CPU and GPU via high-speed

interconnects (PCIe, NVLink).

4. Acceleration Examples:

o Machine Learning → GPUs speed up training by handling billions

of matrix operations.
o Physics Simulations → GPUs handle vector/matrix computations,
while CPUs coordinate tasks.

o Video Rendering → GPUs parallelize pixel rendering across

thousands of cores.

5. MIMD + GPU Parallelism:

o CPU cores (MIMD) run different instructions.

o Each GPU core executes SIMD-style instructions.

o Together, they form a heterogeneous parallel system.

✅ Conclusion: GPUs extend the power of MIMD systems by accelerating

compute-intensive, data-parallel workloads.

20. Explain Amdahl’s Law with an example showing

its limitations in GPU systems.
Amdahl’s Law (Recap):
1
S ( P )=
1−f
f+
P
 f : sequential fraction of program.

 1 − f : parallel fraction of program.

 P : number of processors (or GPU cores).

Example:
 Suppose a program is 90% parallelizable ( f =0.1).

 With P=1000 GPU cores:

1 1
S ( 1000 )= ≈ ≈ 9.91
0.9 0.1009
0.1+
1000
➡️Even with 1000 GPU cores, speedup is only about 10x because of the
sequential bottleneck.
Limitations in GPU Systems:
1. Sequential Overhead:

o Parts like memory transfer (CPU ↔ GPU) remain sequential,

reducing effective speedup.

2. Diminishing Returns:

o Increasing GPU cores does not help if sequential fraction f is

large.

3. Synchronization Costs:

o GPUs require thread synchronization, which adds extra overhead.

4. Amdahl’s Assumption:
o It assumes fixed problem size, but in practice, larger datasets
can scale better (addressed by Gustafson’s Law).

✅ Conclusion: Amdahl’s Law highlights that simply adding more GPU cores
cannot guarantee proportional speedup, as sequential bottlenecks and
overhead limit performance.

✨ These are well-structured, exam-ready answers with clear points and

examples.
Would you like me to also prepare a short note on Gustafson’s Law
(often asked as a follow-up to Amdahl’s Law in VTU exams)?

21. Write a short note on GPU performance metrics. And Evaluate the
performance difference between GPU-based and CPU-based matrix
multiplication. 22. Explain how to measure the execution time of MIMD
programs with examples.

Perfect 👍 Let’s tackle your next Parallel Computing questions (Q21–Q22). I’ll
make the answers exam-ready, clear, and elaborate, with both theory
and practical examples.
21. Write a short note on GPU performance
metrics. And evaluate the performance difference
between GPU-based and CPU-based matrix
multiplication.
GPU Performance Metrics:
To evaluate how well a GPU executes parallel programs, the following
metrics are commonly used:
1. Throughput (FLOPS):

o Number of floating-point operations per second.

o High FLOPS indicates better ability to handle large-scale

computations (TFLOPS for modern GPUs).

2. Memory Bandwidth:

o Rate at which data is transferred between GPU memory and

cores.

o GPUs often achieve hundreds of GB/s, essential for data-

parallel tasks.

3. Occupancy:

o Ratio of active warps (groups of threads) to maximum possible

warps on a Streaming Multiprocessor (SM).

o Higher occupancy leads to better resource utilization.

4. Latency Hiding:

o Ability of GPU to overlap memory operations with computation

by switching between threads.

o Important for performance in memory-bound applications.

5. Power Efficiency:

o Performance per watt of energy consumed.

o Crucial for large GPU clusters (data centers, HPC).

6. Scalability:

o Ability to scale performance with more cores or multiple GPUs.

Evaluating Matrix Multiplication (CPU vs GPU):
Case: Multiplying two matrices (size 1000 × 1000).
1. CPU-based Execution:

o CPUs have a few powerful cores.

o Execution is sequential or lightly parallelized using threads.

o Execution time is relatively high due to limited cores and cache-

based memory access.

2. GPU-based Execution:

o GPUs have thousands of cores optimized for SIMD operations.

o Each thread computes one element (or partial sum) of the output
matrix in parallel.

o Memory bandwidth is high, allowing large data transfers

efficiently.

o Execution is tens to hundreds of times faster than CPU.

Example Results (general trend):

 CPU (8 cores): ~5–10 GFLOPS.

 GPU (NVIDIA RTX 3090): >15 TFLOPS.

 GPU speedup: 1000× faster for large matrices.

✅ Conclusion: GPUs outperform CPUs for matrix multiplication due to

massive parallelism, higher memory bandwidth, and SIMD execution, making
them ideal for scientific computing and machine learning tasks.

22. Explain how to measure the execution time of

MIMD programs with examples.
Why Execution Time Measurement is Needed:
 In MIMD systems, multiple processors execute independently.

 Measuring execution time helps evaluate performance, speedup, and

efficiency.
Methods of Measuring Execution Time:
1. Wall-Clock Time:

o The actual elapsed time from start to end of the program.

o Includes computation, communication, and synchronization

overhead.

o Measured using functions like time() in C, or chrono in C++.

cCopy codedouble start = MPI_Wtime(); // start timer

parallel_function();
double end = MPI_Wtime(); // end timer
double elapsed = end - start;

2. CPU Time (Per Processor):

o Measures the time a processor spends on computation,

excluding idle or waiting time.

o Example in C: clock() function.

3. Parallel Execution Time (Tp):

o The maximum execution time across all processors.

o Since tasks may finish at different times, total execution time is

determined by the slowest processor.

T p=max ( T 1 ,T 2 ,… , T p )

4. Profiling Tools:

o Tools like gprof, MPI Profiler, or NVIDIA Nsight analyze

computation vs communication time.

Example – Array Summation with MPI (MIMD):

cCopy code#include <mpi.h>
#include <stdio.h>

int main(int argc, char *argv[]) {

int rank, size;
double start, end;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

start = MPI_Wtime(); // Start timing

// Parallel computation (each processor sums part of array)

// ...

MPI_Barrier(MPI_COMM_WORLD); // Synchronize all

end = MPI_Wtime(); // End timing

if (rank == 0) {
printf("Execution Time: %f seconds\n", end - start);
}

MPI_Finalize();
return 0;
}

 Here, MPI_Wtime() gives wall-clock time for parallel execution.

 Barrier ensures all processes finish before measuring total time.

✅ Conclusion:
Execution time in MIMD programs can be measured using wall-clock
timers, CPU timers, and profiling tools. The total parallel execution time
is determined by the slowest processor, including communication and
synchronization delays.

✨ These are detailed, exam-ready answers.

Would you like me to also add a table comparing GPU vs CPU
performance metrics (like FLOPS, bandwidth, cores) so you can use it as a
strong point in Q21?

Perfect ? I
No ratings yet
Perfect ? I
7 pages
HPC Pyq 2023
No ratings yet
HPC Pyq 2023
24 pages
COA U5 PPT Full
No ratings yet
COA U5 PPT Full
43 pages
Parallel Computer Models: PCA Chapter 1
No ratings yet
Parallel Computer Models: PCA Chapter 1
61 pages
Advanced Computer Architecture Assigment
No ratings yet
Advanced Computer Architecture Assigment
60 pages
BCS702 Module1
No ratings yet
BCS702 Module1
25 pages
NOTES
No ratings yet
NOTES
19 pages
Parallel Computing and BigData
No ratings yet
Parallel Computing and BigData
4 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
49 pages
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
No ratings yet
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
17 pages
PC Questions
No ratings yet
PC Questions
1 page
Multi Core
No ratings yet
Multi Core
7 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
MCSE-103 Advanced Computer Architecture
No ratings yet
MCSE-103 Advanced Computer Architecture
9 pages
Week 4 PDC
No ratings yet
Week 4 PDC
11 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
PP 1st Internals
No ratings yet
PP 1st Internals
40 pages
MCSE-103 Advanced Computer Architecture
No ratings yet
MCSE-103 Advanced Computer Architecture
9 pages
Parallel Computer Architecture Guide
No ratings yet
Parallel Computer Architecture Guide
44 pages
Parallel Computer Architecture
No ratings yet
Parallel Computer Architecture
44 pages
Unit 1
No ratings yet
Unit 1
21 pages
PC - Module 1
No ratings yet
PC - Module 1
82 pages
Module 1-3
No ratings yet
Module 1-3
87 pages
PC - Ia01
No ratings yet
PC - Ia01
10 pages
Multi
No ratings yet
Multi
5 pages
I Notes
No ratings yet
I Notes
27 pages
Advance Computer Architecture2
No ratings yet
Advance Computer Architecture2
36 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Unit 1
No ratings yet
Unit 1
22 pages
Aca UNIT-5
No ratings yet
Aca UNIT-5
10 pages
Module 1
No ratings yet
Module 1
19 pages
PC Module-1
No ratings yet
PC Module-1
11 pages
MCSE
No ratings yet
MCSE
12 pages
ACA T1 Solutions
No ratings yet
ACA T1 Solutions
17 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
Flynn's and Fengs Architecture
No ratings yet
Flynn's and Fengs Architecture
28 pages
CC UNIT-1 Material
No ratings yet
CC UNIT-1 Material
26 pages
Pda 2
No ratings yet
Pda 2
105 pages
Parallel Computing for Tech Students
No ratings yet
Parallel Computing for Tech Students
14 pages
ACA UNIT-5 Notes
No ratings yet
ACA UNIT-5 Notes
15 pages
CS213 Parallel Processing Syllabus
No ratings yet
CS213 Parallel Processing Syllabus
26 pages
Parallel Computing 19Q
No ratings yet
Parallel Computing 19Q
3 pages
ch.9 Pipeline MoDIFIED
No ratings yet
ch.9 Pipeline MoDIFIED
76 pages
Parallel Detailed Explanations
No ratings yet
Parallel Detailed Explanations
2 pages
Architecture of Parallel Computing
No ratings yet
Architecture of Parallel Computing
6 pages
5 Marks Q. Describe Array Processor Architecture
No ratings yet
5 Marks Q. Describe Array Processor Architecture
11 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
Unit - 01 Easid
No ratings yet
Unit - 01 Easid
18 pages
Seminar
No ratings yet
Seminar
85 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Parallel Processing Parallel Processing
No ratings yet
Parallel Processing Parallel Processing
64 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
Unit 4 COA
No ratings yet
Unit 4 COA
5 pages
Pcnotes 2024
No ratings yet
Pcnotes 2024
21 pages
Quiz 11 04 2025 18 52 22 147
No ratings yet
Quiz 11 04 2025 18 52 22 147
9 pages
Parallel & Distributed Computing Course Overview
No ratings yet
Parallel & Distributed Computing Course Overview
47 pages
Server Memory Population Rules For HPE ProLiant Gen11 Servers With AMD EPYC 9004 Series Processors-A50007481enw
No ratings yet
Server Memory Population Rules For HPE ProLiant Gen11 Servers With AMD EPYC 9004 Series Processors-A50007481enw
8 pages
Virtualization Xeon Core Count Impacts Performance Paper
No ratings yet
Virtualization Xeon Core Count Impacts Performance Paper
10 pages
Part01 PDF
No ratings yet
Part01 PDF
70 pages
rh442 Notes
No ratings yet
rh442 Notes
26 pages
VI SEM DS Guess Paper CSE & IT
No ratings yet
VI SEM DS Guess Paper CSE & IT
4 pages
Slot 01
No ratings yet
Slot 01
15 pages
Dell Validated Design For SAP HANA Deployments With Dell PowerEdge For VMware vSAN Hyperconverged Infrastructure
No ratings yet
Dell Validated Design For SAP HANA Deployments With Dell PowerEdge For VMware vSAN Hyperconverged Infrastructure
25 pages
Cs3551 Distributed Computing Unit-1
No ratings yet
Cs3551 Distributed Computing Unit-1
52 pages
Huawei Cloud Exam Prep Guide
No ratings yet
Huawei Cloud Exam Prep Guide
6 pages
Cluster Computing: A Paper Presentation On
No ratings yet
Cluster Computing: A Paper Presentation On
16 pages
Server 2003
No ratings yet
Server 2003
1,475 pages
Acaces2019 Proc Arch Sec Part-1
No ratings yet
Acaces2019 Proc Arch Sec Part-1
62 pages
C G1056 Pages: 3: Answer All Questions, Each Carries 4 Marks
No ratings yet
C G1056 Pages: 3: Answer All Questions, Each Carries 4 Marks
3 pages
Scheduling
No ratings yet
Scheduling
72 pages
Review of Chiplet-Based Design: System Architecture
No ratings yet
Review of Chiplet-Based Design: System Architecture
20 pages
Virtual Memory
No ratings yet
Virtual Memory
83 pages
Cao - Unit 4 - Notes - Final
No ratings yet
Cao - Unit 4 - Notes - Final
30 pages
Parallel Programming Using OpenMP
No ratings yet
Parallel Programming Using OpenMP
76 pages
Best Practice Guide
No ratings yet
Best Practice Guide
46 pages
RHEL 7 Performance Tuning Guide
No ratings yet
RHEL 7 Performance Tuning Guide
85 pages
Ritesh Kumar Jha 26900121014 Pcc-cs402
No ratings yet
Ritesh Kumar Jha 26900121014 Pcc-cs402
9 pages
Overview of Parallel Computing
No ratings yet
Overview of Parallel Computing
75 pages
FusionCompute V100R005C10 Host and Cluster Management Guide 01
No ratings yet
FusionCompute V100R005C10 Host and Cluster Management Guide 01
137 pages
Shared Memory Architecture
No ratings yet
Shared Memory Architecture
63 pages
Oracle Monster Virtual Machine Performance: Vmware Vsphere 6.5
No ratings yet
Oracle Monster Virtual Machine Performance: Vmware Vsphere 6.5
11 pages
Rao W 0230 Tuning Rhel For Databases
No ratings yet
Rao W 0230 Tuning Rhel For Databases
69 pages
CC Answers
No ratings yet
CC Answers
92 pages
CXL Over Ethernet A Novel FPGA-based Memory
No ratings yet
CXL Over Ethernet A Novel FPGA-based Memory
9 pages
Mca 4
No ratings yet
Mca 4
61 pages
Parallel Distributed Computing
No ratings yet
Parallel Distributed Computing
64 pages