0% found this document useful (0 votes)

16 views68 pages

HPC BOOk

The document provides an overview of parallel computing, explaining its significance in solving complex problems faster and overcoming the limitations of traditional serial computing. It discusses various forms of parallelism, modern processor architectures, and parallel programming platforms, highlighting the importance of choosing the right platform based on problem characteristics. Additionally, it covers implicit parallelism, the dichotomy of parallel computing platforms, and models of parallel computation, emphasizing the evolution and applications of parallel computing in various fields.

Uploaded by

yashk.forrupo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views68 pages

HPC BOOk

Uploaded by

yashk.forrupo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Savitribai PhulePuneUniversity

Fourth Year of Computer Engineering (2019 Course) 410250: High Performance

Computing

Unit 1

Introduction to Parallel Computing:

Parallel computing is a method of computation in which many calculations or processes are carried out
simultaneously. Large problems are often divided into smaller ones, which are then solved concurrently
("in parallel"). There are several different forms of parallel computing: bit-level, instruction-level, data,
and task parallelism. Parallelism has long been employed in high-performance computing, but has gained
broader interest due to the physical constraints preventing frequency scaling. As power consumption (and
consequently heat generation) by computers has become a concern in recent years, parallel computing has
become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.
Parallel computing is closely related to concurrent computing—they are frequently used together, and
often conflated, though the two are distinct: parallel computing is the simultaneous execution of the same
task (split up and distributed) on multiple cores; concurrent computing is when several tasks are made
progress on simultaneously.
Parallel computers can be roughly classified according to the level at which the hardware supports
parallelism, with multi-core and multi-processor computers having multiple processing elements within a
single machine, while clusters, MPPs, and grids use multiple computers to work on the same task.
Specialized parallel computer architectures are sometimes used for specific tasks, even in niche areas.
Some software is designed for parallel computing. Parallel programming languages, like Fortran with
array syntax, OpenMP, and MPI are used to develop parallel applications. Parallel algorithms are
designed to exploit the architecture of the parallel computers they will run on. Multi-core processors have
become commonplace even in personal computers.

1.Motivating Parallelism:
The primary motivation for using parallel computing stems from the limitations of traditional serial
computing and the increasing demands of modern computational tasks.
1.1 Solving Complex Problems Faster:
 Many real-world problems are incredibly complex and require vast amounts of computation.
These problems can take days, weeks, or even months to solve on a single processor. Parallel
computing breaks down these problems into smaller parts, allowing multiple processors to work
on them simultaneously, drastically reducing the overall computation time.
 Examples include:
o Scientific simulations: Weather forecasting, climate modeling, drug discovery, and
simulations of physical phenomena.
o Data analysis: Processing large datasets in fields like genomics, finance, and social
media.
o Engineering design: Simulating complex systems like aircraft or automobiles.

1.2. Overcoming the Limits of Serial Computing:

 Physical limitations: The speed of a single processor is limited by physical constraints such as
the speed of light and heat dissipation. It's becoming increasingly difficult to significantly
increase the clock speed of processors.
 Amdahl's Law: This law states that the speedup achieved by parallelizing a program is limited
by the portion of the program that cannot be parallelized. Even with infinitely many processors,
there's a limit to how much faster a program can run if it has a serial component.
 Parallel computing offers a way to overcome these limitations by using multiple processors to
work on different parts of a problem concurrently.

1.3. Handling Large Datasets:

 The amount of data being generated is growing exponentially. Traditional methods of data
processing are struggling to keep up.
 Parallel computing provides the necessary tools to efficiently process and analyze these massive
datasets. By distributing the data and processing tasks across multiple processors, it becomes
possible to extract valuable insights from large datasets in a reasonable time.

1.4. Cost-Effectiveness:

 While supercomputers with a large number of processors can be expensive, parallel computing
can also be implemented using clusters of commodity hardware, which can be a more cost-
effective solution for many problems.
 Furthermore, by reducing the time required to solve a problem, parallel computing can save
money in terms of energy consumption and resource utilization.

1.5. Modeling Real-World Phenomena:

 Many real-world phenomena occur concurrently. Parallel computing is a natural way to model
these phenomena, as it allows for the simulation of multiple events happening simultaneously.
 Examples include:
o Traffic flow: Simulating the movement of vehicles on a road network.
o Financial markets: Modeling the interactions of buyers and sellers in a market.
o Biological systems: Simulating the interactions of cells in a living organism.

A modern processor, also known as a microprocessor or CPU (Central Processing Unit), is a complex
integrated circuit that executes instructions in a computer. It's the "brain" of the computer, responsible for
performing all the calculations and operations needed to run programs.

Here's a simplified diagram of a modern processor:

Basic components of a modern processor:

 Cores: Modern processors have multiple cores, each capable of executing instructions
independently. This allows the processor to perform multiple tasks simultaneously, improving
performance.
 Cache Memory: Cache is a small, high-speed memory that stores frequently accessed data and
instructions. This reduces the time it takes for the processor to access information, improving
performance. Modern processors have multiple levels of cache (L1, L2, L3), with L1 being the
fastest and smallest, and L3 being the largest and slowest.
 Control Unit: The control unit fetches instructions from memory, decodes them, and
coordinates the execution of these instructions by other components of the processor.
 Arithmetic Logic Unit (ALU): The ALU performs arithmetic and logical operations on data.
 Floating Point Unit (FPU): The FPU performs operations on floating-point numbers, which
are used in scientific and graphical applications.
 Memory Management Unit (MMU): The MMU manages the flow of data between the
processor and main memory.
 Bus Interface: The bus interface connects the processor to the rest of the computer system,
including memory and peripherals.

Basic features of modern processors:

 High clock speeds: Clock speed is the number of instructions a processor can execute per
second, measured in gigahertz (GHz). Modern processors have clock speeds of several GHz.
 Multiple cores: As mentioned earlier, multiple cores allow the processor to perform multiple
tasks simultaneously.
 Large cache sizes: Larger caches can store more data and instructions, improving performance.
 Advanced instruction sets: Modern processors support advanced instruction sets that allow
them to perform complex operations more efficiently.
 Integrated graphics: Some modern processors include integrated graphics processing units
(GPUs), which can handle basic graphics tasks.

Examples of modern processors:

 Intel Core i9-13900K

 AMD Ryzen 9 7950X
 Apple M2 Max
These processors are used in a wide range of devices, from personal computers and laptops to servers and
supercomputers. They are constantly evolving, with new features and improvements being introduced
regularly.

1.6. Parallel Programming Platforms:

Parallel programming platforms encompass the hardware and software infrastructure that supports the
development and execution of parallel programs. These platforms vary significantly in their architecture,
programming models, and intended applications. Here's an overview of common parallel programming
platforms with illustrative figures:

1.6.1. Shared-Memory Systems

 Description: In shared-memory systems, multiple processors share a common memory space.

This allows processors to access and modify data in the same memory locations, facilitating
communication and data sharing.
 Examples: Multi-core processors, symmetric multiprocessors (SMPs)
 Programming Models: Threads (e.g., Pthreads, OpenMP)
 Figure:

Shared memory system diagram

1.6.2. Distributed-Memory Systems

 Description: In distributed-memory systems, each processor has its own local memory.
Processors communicate with each other by sending messages over an interconnection network.
 Examples: Clusters of workstations, massively parallel processors (MPPs)
 Programming Models: Message Passing Interface (MPI)
 Figure:
Distributed memory system diagram

1.6.3. Hybrid Systems

 Description: Hybrid systems combine features of both shared-memory and distributed-memory

systems. Nodes in a cluster may consist of multi-core processors with shared memory, while
communication between nodes occurs through message passing.
 Examples: Modern supercomputers, multi-core clusters
 Programming Models: MPI combined with threads (e.g., MPI + OpenMP)
 Figure:

Hybrid system diagram

1.6.4. Heterogeneous Systems

 Description: Heterogeneous systems utilize different types of processors, such as CPUs and
GPUs, to accelerate computations. GPUs are particularly well-suited for data-parallel tasks.
 Examples: Systems with GPUs, FPGAs, or other specialized processors
 Programming Models: CUDA (for NVIDIA GPUs), OpenCL
 Figure:
heterogeneous system diagram

1.6.5. Cloud Computing Platforms

 Description: Cloud computing platforms provide access to scalable computing resources over the
internet. These platforms can be used to run parallel applications on virtual machines or
containers.
 Examples: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform
 Programming Models: Map Reduce, Apache Spark
 Figure:

Cloud computing platform diagram

Important Considerations for Choosing a Parallel Programming Platform

 Nature of the problem: The characteristics of the problem being solved will influence the choice
of platform. Data-parallel problems may be well-suited for GPUs, while problems with complex
communication patterns may require a distributed-memory system.
 Scalability: The ability of the platform to scale to a large number of processors is important for
solving large problems.
 Cost: The cost of the hardware and software required for the platform is a factor to consider.
 Programming effort: The ease of programming and debugging parallel applications on the
platform is also important.

The choice of a specific parallel programming platform depends on the requirements of the application
and the available resources. Each platform has its own strengths and weaknesses, and it's essential to
select the platform that best matches the needs of the task at hand.
1.7.Implicit Parallelism:

Implicit parallelism is a type of parallel computing where the compiler or runtime environment
automatically detects and exploits parallelism in a program without explicit instructions from the
programmer. This contrasts with explicit parallelism, where the programmer must explicitly specify how
the program should be parallelized.

Here's a breakdown of implicit parallelism with a focus on how it can be visualized:

Concept:

 The programmer writes code in a sequential manner, as if it were to be executed on a single

processor.
 A compiler or interpreter analyzes the code and identifies opportunities for parallel execution.
 The compiler or interpreter then transforms the code into a parallel form, which can be executed
on multiple processors or cores.

1.7.1Visualization:

Imagine you have a task of washing a pile of dishes.

 Sequential Approach (No Parallelism): You wash each dish one by one, stacking them
as you go. This is like a program running on a single processor, executing instructions
one after another.
 Implicit Parallelism: You have a dishwasher. You load the dishes into the dishwasher,
press the button, and the machine washes multiple dishes simultaneously. You (the
programmer) didn't have to specify how the dishes are washed in parallel; the dishwasher
(the compiler/runtime) handles that automatically.

Figure:

Dishwasher washing dishes, representing implicit parallelism

Role Aspects of Implicit Parallelism:

 Ease of Programming: Requires less effort from the programmer, as they don't need to deal
with complex parallel programming constructs.
 Compiler/Runtime Dependence: Relies heavily on the compiler or runtime environment to
effectively detect and exploit parallelism.
 Limited Control: The programmer has less control over how the program is parallelized,
which can sometimes lead to suboptimal performance.

Examples of Implicit Parallelism:

 Vectorization: Compilers can automatically convert loops that perform the same operation on
multiple data elements into vector instructions, which can be executed in parallel on vector
processors.
 Superscalar Processors: Modern processors can execute multiple instructions simultaneously
using multiple execution units. This is a form of implicit parallelism at the hardware level.
 Functional Programming Languages: Some functional programming languages, such as
Haskell, have features that make it easier for compilers to detect and exploit parallelism.

Advantages:

 Simplifies programming by abstracting away the details of parallel execution.

 Can improve performance without requiring significant changes to existing code.

Disadvantages:

 May not always be effective in detecting and exploiting all available parallelism.
 Can be difficult to predict the performance of a program, as it depends on the capabilities of the
compiler or runtime environment.

1.8.Dichotomy of Parallel Computing Platforms:

The dichotomy of parallel computing platforms refers to the fundamental ways in which parallel
computers are organized, both logically (from the programmer's perspective) and physically (in terms of
hardware). This classification helps understand how different parallel architectures handle instructions
and data. The primary division is between SIMD (Single Instruction, Multiple Data) and MIMD
(Multiple Instruction, Multiple Data).

1.8.1. SIMD (Single Instruction, Multiple Data)

 Concept: SIMD architectures execute the same instruction simultaneously on multiple data
elements. Think of it as an assembly line where each worker (processing unit) performs the same
task on different items (data).
 Examples: Vector processors, GPUs (to some extent)

Figure:
SIMD architecture diagram

 Characteristics:
o One control unit dispatches instructions to all processing units.
o All processing units execute the same instruction synchronously.
o Well-suited for data-parallel problems where the same operation needs to be performed
on a large dataset.
o Examples of applications include image processing, multimedia, and scientific
simulations with regular data structures.

1.8.2. MIMD (Multiple Instructions, Multiple Data)

 Concept: MIMD architectures allow each processing unit to execute different instructions on
different data. This provides greater flexibility compared to SIMD.
 Examples: Multi-core processors, clusters of workstations, distributed systems

Figure:

MIMD architecture diagram

 Characteristics:
o Each processing unit has its own control unit and can execute different instructions.
o Processing units can operate asynchronously.
o Suitable for a wider range of applications, including those with complex control flow and
irregular data structures.
o Examples of applications include simulations with complex interactions, database
management, and web servers.

1.8.3. Further Subdivisions within MIMD:

MIMD architectures can be further divided based on how memory is organized:

 Shared-Memory MIMD: Processors share a common memory space. Communication

between processors is done through shared memory.
o Figure: (Refer to the Shared-Memory Systems figure in the previous response)
 Distributed-Memory MIMD: Each processor has its own local memory. Processors
communicate by sending messages over an interconnection network.
o Figure: (Refer to the Distributed-Memory Systems figure in the previous
response)

Relationship between Logical and Physical Organization:

The logical organization (programmer's view) and physical organization (hardware) are closely
related. For example, a programmer might use threads to program a shared-memory MIMD
machine, while they might use message passing (MPI) to program a distributed-memory MIMD
machine.

1.9. Models:

1.9.1. Models of Parallel Computation:

These models describe how computations are organized and executed in parallel.

 SIMD (Single Instruction, Multiple Data): As discussed before, one instruction is

executed simultaneously on multiple data elements. Ideal for data-parallel tasks.
Example: Vector processors, early GPUs.

1. Conceptual Diagram:

SIMD architecture diagram

 Control Unit: Fetches and decodes instructions. There's only one in a SIMD processor.
 Processing Elements (PEs): Each PE has its own data memory but receives the same
instruction from the control unit.
 Data: Multiple data elements are fed into the PEs.
 Execution: All PEs execute the same instruction simultaneously on their respective data.

Analogy: Imagine multiple chefs in a kitchen, all following the exact same recipe (instruction)
but working on different ingredients (data).

2. Vector Processing Example:

A common way to visualize SIMD is through vector processing:

vector processing diagram

 Vectors: Data is organized into vectors (arrays of data elements).

 Vector Registers: Special registers that can hold multiple data elements.
 Vector Operations: Instructions operate on entire vectors at once.

Example: Adding two vectors:

Vector A: [1, 2, 3, 4]
Vector B: [5, 6, 7, 8]
Result: [6, 8, 10, 12]

In a SIMD processor, this addition would happen in a single instruction, with each PE adding the
corresponding elements of the vectors.

Key Takeaways from the Figures:

 One instruction stream: A single instruction is broadcast to all processing elements.

 Multiple data streams: Each processing element operates on its own data.
 Synchronous execution: All processing elements execute the instruction at the same
time.

Where SIMD is Used:

 Graphics processing (GPUs): GPUs use SIMD extensively for tasks like pixel shading
and texture mapping.
 Scientific computing: Many scientific simulations involve performing the same
operations on large datasets, making them well-suited for SIMD.
 Multimedia processing: Tasks like image and video editing often involve applying the same
transformations to many pixels or data points.

1.9.3. MIMD (Multiple Instruction, Multiple Data): Multiple instructions are executed
simultaneously on multiple data elements. Offers more flexibility than SIMD. Example: Multi-
core processors, clusters.

Figure of MIMD Architecture

MIMD architecture diagram

Explanation of the Figure:

 Processing Elements (PEs): In MIMD, each PE is a full-fledged processor with its own:
o Control Unit: Fetches and decodes instructions independently.
o Arithmetic Logic Unit (ALU): Performs arithmetic and logical operations.
o Local Memory (Optional): Some MIMD architectures have local memory for
each PE, while others share a common memory space.
 Interconnection Network: This network allows the PEs to communicate with each
other. The type of network can vary (e.g., bus, crossbar, network on a chip).
 Key Characteristics Visualized:
o Multiple Instruction Streams: Each PE can execute a different program or a
different part of the same program. This is shown by the different "Instructions"
going to each PE.
o Multiple Data Streams: Each PE operates on its own set of data. This is
represented by the different "Data" going to each PE.
o Asynchronous Execution: The PEs operate independently and don't need to be
synchronized with each other.

Analogy:
Think of a team of researchers working on a large project. Each researcher (PE) has their own
expertise, tools, and data, and they work independently but communicate with each other to
share findings and collaborate.

Variations of MIMD:

MIMD architectures can be further classified based on how memory is organized:

 Shared-Memory MIMD: All PEs share a common memory space. Communication

happens through reading and writing to shared memory.
o Example: Multi-core processors in a personal computer.
 Distributed-Memory MIMD: Each PE has its own local memory. Communication
happens through message passing over the interconnection network.
o Example: A cluster of computers connected by a network.

Where MIMD is Used:

 General-purpose computing: Most modern computers, from laptops to servers, are MIMD
architectures.
 High-performance computing: Supercomputers often use MIMD architectures to solve
complex scientific and engineering problems.
 Cloud computing: Cloud platforms use MIMD architectures to run a wide variety of
applications.

Advantages of MIMD:

 Flexibility: Can handle a wide range of applications, including those with complex control flow
and irregular data structures.
 Scalability: Can be scaled to a large number of processors.

1.10. SIMT (Single Instruction, Multiple Threads): This model is commonly used in
GPUs. It's a variation of SIMD where multiple threads execute the same instruction on
different data. However, unlike pure SIMD, threads in SIMT can diverge, meaning they can
take different execution paths based on conditional statements. This provides more
flexibility than SIMD while still benefiting from the data parallelism. Example: Modern
GPUs (Nvidia CUDA, AMD ROCm).

Conceptual Diagram of SIMT

SIMT architecture diagram

Explanation:

 Threads: In SIMT, the fundamental unit of execution is a thread. Multiple threads are
grouped into a warp (Nvidia terminology) or a wavefront (AMD terminology). A warp is
a set of threads that execute in a SIMD-like fashion.
 SIMD within a Warp: Within a warp, all threads execute the same instruction at the
same time, similar to SIMD.
 Thread Divergence: This is the key difference from pure SIMD. Threads within a warp
can take different execution paths based on conditional branches (if/else statements). This
is illustrated in the diagram where some threads take the "if" path and others take the
"else" path.
 Masking: When threads diverge, some threads become inactive (masked) while others
execute a particular branch. The inactive threads effectively "wait" until the other threads
have finished executing their branch.
 Reconvergence: After the divergent branches have been executed, the threads
reconverge and execute the subsequent instructions together.

Imagine a group of soldiers (threads) marching in formation (warp). They are all following the
same order (instruction) to march forward. However, they encounter a fork in the road
(conditional branch). Some soldiers are ordered to go left, and others are ordered to go right. The
soldiers who went left wait for the soldiers who went right to catch up before they continue
marching together again.

Features of SIMT:

 Combines SIMD and Multithreading: Leverages the data parallelism of SIMD while
providing the flexibility of multithreading.
 Handles Divergent Control Flow: Can efficiently handle code with conditional branches,
which is common in many applications.
 Hardware Efficiency: By grouping threads into warps and executing them in a SIMD-like
fashion, SIMT can achieve high hardware utilization.

Where SIMT is used:

 GPUs: Modern GPUs from Nvidia and AMD are the primary users of the SIMT architecture. It's
well-suited for graphics rendering, deep learning, and other data-parallel tasks.

Important Considerations:

 Divergence Penalty: Thread divergence can lead to performance degradation because some
threads are idle while others are executing a different branch. Therefore, it's important to write
code that minimizes divergence as much as possible.
 Warp Size: The warp size is a hardware-specific parameter (e.g., 32 threads in Nvidia GPUs).
Understanding the warp size is crucial for optimizing SIMT code.

1.11.SPMD (Single Program, Multiple Data): In this model, multiple processors execute
the same program concurrently, but each processor operates on its own set of data. This is a
common programming model for distributed-memory systems. Example: MPI programs
running on a cluster.

Figure of SPMD

SPMD architecture diagram

Explanation of the Figure:

 Processors/Nodes (P1, P2, P3, P4): These represent the individual processing units. They
could be cores within a multi-core processor, separate CPUs in a shared-memory system, or
independent nodes in a distributed-memory cluster.
 Single Program (Program Code): The same program code is loaded onto each processor.
This is the defining characteristic of SPMD.
 Data Partitioning (Data 1, Data 2, Data 3, Data 4): The overall problem's data is divided
into distinct chunks. Each processor receives a different chunk of data to process.
 Local Memory (Optional): Each processor typically has its own local memory where it stores
its assigned data and performs computations. In shared-memory systems, this might be a portion
of the shared memory space. In distributed-memory systems, it's the processor's own private
memory.
 Communication Network (Optional): The processors might need to communicate with each
other during the computation. This is usually done through an interconnection network using
message passing. This is represented by the arrows between the processors.
Think of a team of farmers harvesting a large field of crops.

 Single Program: They all follow the same instructions on how to harvest (e.g., how to pick the
crops, how to sort them).
 Multiple Data: Each farmer is assigned a different section of the field to harvest.
 Communication: They might need to communicate to coordinate their efforts, such as where to
deliver the harvested crops.

How SPMD Works Step-by-Step:

1. Problem Decomposition: The problem is broken down into smaller tasks, and the data is
partitioned accordingly.
2. Program Distribution: The same program is copied or made accessible to all processors.
3. Process/Thread Launch: Each processor starts executing the program.
4. Unique ID/Rank Assignment: Each processor is assigned a unique identifier (rank or ID). This
ID is used by the program to determine which part of the data the processor should work on.
5. Local Computation: Each processor executes the program on its assigned data partition.
6. Inter-process Communication (if needed): Processors exchange data or synchronization signals
using message passing or shared memory.
7. Result Aggregation (if needed): The results from each processor are combined to form the final
solution.

Key Advantages of SPMD:

 Relatively easy to program: Only one program needs to be written.

 Scalable: Works well on a large number of processors.
 Versatile: Can be used on various parallel architectures (shared-memory, distributed-memory).

Common Use Cases:

 Scientific simulations
 Data analysis
 Image and video processing

1.10.2. Data Flow Models: In data flow computing, the execution of instructions is determined
by the availability of data. An instruction is executed as soon as its operands are available.
This model can expose significant parallelism, as multiple instructions can be executed
concurrently as long as their data dependencies are met. Example: Static dataflow
architectures, dynamic dataflow architectures.

Concept of Data Flow:

In traditional (control-flow) computing, instructions are executed in a specific order dictated by the
program's control flow (e.g., loops, conditional statements). In data flow computing:
 Notes: Represent operations or functions.
 Arcs/Edges: Represent data dependencies between operations. An arc from node A to node B
means that the output of A is an input to B.
 Execution Triggered by Data Availability: A node/operation executes as soon as all its input
data is available. There's no central control dictating the order of execution.

Figure of a Data Flow Graph:

Data flow graph

Explanation of the Figure:

 Nodes (Circles/Boxes): Each node represents an operation (e.g., addition, multiplication, a

function call).
 Arcs/Edges (Arrows): Arrows indicate the flow of data. For example, the arrow from "+" to "*"
means that the result of the addition is an input to the multiplication.
 Tokens (Dots on Arcs): These represent the presence of data on an arc. A node can only fire
(execute) when all its input arcs have tokens.

How Execution Works in Data Flow:

1. Data Arrival: When data arrives on all input arcs of a node (i.e., tokens are present on all
incoming arrows), the node becomes enabled or fireable.
2. Node Firing: The enabled node executes its operation, consumes the input tokens, and produces
output tokens on its output arcs.
3. Cascading Execution: The output tokens may enable other nodes, triggering a chain reaction of
executions.

Example:

In the figure, if data arrives at the inputs of the "+" node, it will execute. Its output will then provide input
to the "" node. If data also arrives at the other input of the "" node, it will then execute.

Key Characteristics of Data Flow:

 Data-driven execution: Computation is triggered by data availability.

 Implicit parallelism: Operations can execute concurrently as long as their data dependencies are
satisfied. This exposes a high degree of parallelism automatically.
 No central control: There's no program counter or central controller dictating the order of
execution.
 Fine-grained parallelism: Can exploit parallelism at a very fine level (e.g., individual arithmetic
operations).

Types of Data Flow:

 Static Data Flow: The data flow graph is fixed at compile time.
 Dynamic Data Flow: The data flow graph can change during execution.

Advantages of Data Flow:

 High potential for parallelism.

 Automatic exploitation of parallelism.

Disadvantages of Data Flow:

 Implementation complexity (especially for dynamic data flow).

 Overhead of token management.
 Not as widely used in general-purpose computing as control flow.

Where Data Flow is Used:

 Specialized hardware for signal processing and other data-intensive applications.

 Some functional programming languages and parallel programming models.

1.12.Demand-driven Computation (also known as Lazy Evaluation): In this model,

computations are performed only when their results are needed. This can avoid
unnecessary computations and improve efficiency. This is often used in functional
programming languages. Example: Haskell.

Conceptual Diagram of Demand-Driven Computation

Demand driven computation diagram

Explanation:
 Expressions/Functions: In lazy evaluation, expressions or function calls are not evaluated
immediately. Instead, they are represented as promises or thunks – essentially, a record of the
expression and the environment needed to evaluate it.
 Demand: Evaluation is triggered only when the result of an expression is actually needed
(demanded) by another part of the program.
 Evaluation on Demand: When the result is demanded, the expression is evaluated, and the result
is memoized (cached) so that it doesn't need to be recomputed if it's needed again later.

Analogy:

Imagine you have a recipe book with many complex recipes.

 Eager Evaluation (Traditional): You prepare all the ingredients and cook all the dishes in the
book right away, even if you're only going to eat one dish.
 Lazy Evaluation: You only prepare the ingredients and cook the dish when you're actually ready
to eat it. If someone asks you for a specific dish, you prepare only that dish. If they ask for the
same dish again later, you just serve them the already-prepared dish (memoization).

Key Features of Demand-Driven Computation:

 Avoids unnecessary computations: Expressions are only evaluated if their results are needed.
 Supports infinite data structures: It's possible to define and work with infinite lists or other
data structures because only the parts that are actually used are computed.
 Can improve performance: By avoiding unnecessary computations, lazy evaluation can
improve the overall performance of a program.

Where Demand-Driven Computation is Used:

 Functional programming languages: Haskell, Miranda, and other functional languages use lazy
evaluation as a core feature.
 Some other languages: Some languages provide features or libraries to support lazy evaluation.

Important Considerations:

 Evaluation order: In lazy evaluation, the order of evaluation is not always obvious, which can
make debugging more challenging.
 Space leaks: If not implemented carefully, lazy evaluation can lead to space leaks, where
unevaluated expressions accumulate in memory.

Architectures for Parallel Computing:

These are hardware implementations that support the parallel models.

 N-wide Superscalar Architectures: These architectures can issue multiple instructions per clock
cycle. The "N" represents the number of instructions that can be issued simultaneously. This is a
form of instruction-level parallelism (ILP). Example: Modern CPUs with multiple execution
units.

1.13. N-Wide Superscalar Architectures:

Superscalar architecture diagram

 Instruction Fetch Unit: Fetches multiple instructions simultaneously.

 Instruction Decode Unit: Decodes multiple instructions simultaneously.
 Instruction Dispatch Unit: Sends decoded instructions to available execution units.
 Execution Units (N): Multiple execution units (ALUs, FPUs, load/store units) that can execute
instructions in parallel. The "N" in N-wide superscalar refers to the number of execution units
(and therefore the number of instructions that can be issued/executed per clock cycle).
 Register File: Stores data and intermediate results.
 Reorder Buffer (ROB): Buffers instructions to allow out-of-order execution while maintaining
program order for correct results.

Key Idea: Superscalar architectures exploit instruction-level parallelism (ILP) by executing multiple
independent instructions concurrently.

Analogy: Imagine a restaurant kitchen with multiple chefs (execution units). The head chef (dispatch
unit) assigns different tasks (instructions) to the chefs, and they work simultaneously.

1.14. Multi-Core Processors:

multicore processor diagram

 Cores: Multiple independent processing units (cores) on a single chip. Each core has its own:
o ALU, Control Unit, Registers: The standard components of a CPU.
o L1 Cache (often private per core): A small, fast cache for frequently accessed data.
 Shared L2/L3 Cache (sometimes): Larger caches shared between cores.
 Interconnect: A communication network that allows cores to communicate with each other and
access shared resources.

Key Idea: Multi-core processors exploit thread-level or process-level parallelism by executing multiple
threads or processes concurrently on different cores.

Analogy: Imagine multiple separate kitchens (cores) within the same restaurant. Each kitchen can prepare
different dishes (execute different programs or threads) simultaneously.

1.15. Multi-Threaded Processors (Simultaneous Multi-Threading - SMT)

Multithreaded processor diagram

 Cores: Each core has multiple hardware threads (also called logical processors).
 Shared Resources within a Core: The hardware threads share some resources within the core,
such as:
o Execution Units: The same ALUs and FPUs are shared between threads.
o Caches (L1, L2): The caches are also shared.
 Private Resources per Thread: Each thread has its own:
o Registers: To store its own data.
o Program Counter: To track its own instruction execution.

Key Idea: Multi-threaded processors exploit thread-level parallelism by allowing multiple threads to
share the resources of a single core. This improves resource utilization and can hide memory latency.

Analogy: Imagine a single kitchen with multiple chefs (threads) sharing the same ovens, stoves, and other
equipment but having their own recipes and ingredients.

Combination in Modern CPUs:

Modern CPUs often combine all three concepts:

 Each core is superscalar, capable of executing multiple instructions per clock cycle.
 The chip has multiple cores, allowing for parallel execution of multiple threads/processes.
 Each core is multi-threaded (SMT), allowing for even better resource utilization.

Relationship between Models and Architectures:

The models and architectures are closely related. For example:

 SIMD is often implemented using vector processors or GPUs.

 MIMD is implemented using multi-core processors, clusters, or distributed systems.
 SIMT is a programming model specifically designed for GPUs.
 SPMD is often used with distributed-memory MIMD architectures.
 Data flow models have inspired specialized architectures, although they are also implemented on
more general-purpose processors.
Unit 2

Parallel Algorithm Design

2.1. Global System for Mobile Communications (GSM) architecture:

Figure of GSM Architecture

GSM architecture diagram

Explanation of the Figure:

The GSM network is broadly divided into four main subsystems:

1. Mobile Station (MS): This is the mobile phone itself, consisting of:
o Mobile Equipment (ME): The physical device (handset).
o Subscriber Identity Module (SIM): A smart card that stores user-specific information
like IMSI (International Mobile Subscriber Identity), authentication keys, and
phonebook.
2. Base Station Subsystem (BSS): This part handles the radio interface between the mobile station
and the network. It consists of:
o Base Transceiver Station (BTS): Contains the radio transceivers and antennas that
communicate with mobile phones in a cell.
o Base Station Controller (BSC): Manages multiple BTSs, handling radio resource
allocation, handover control, and other functions.
3. Network Switching Subsystem (NSS): This is the core of the GSM network, responsible for call
routing, call control, and subscriber management. It includes:
o Mobile Switching Center (MSC): The main switching node in the GSM network. It
handles call setup, call routing, and other switching functions.
o Visitor Location Register (VLR): A database that stores temporary information about
subscribers who are roaming in the area controlled by the MSC.
o Home Location Register (HLR): A central database that stores permanent information
about all subscribers in the network.
o Authentication Center (AuC): Responsible for authenticating subscribers and
generating encryption keys.
o Equipment Identity Register (EIR): A database that stores information about mobile
equipment, such as IMEI (International Mobile Equipment Identity).
4. Operation and Support Subsystem (OSS): This part is responsible for the overall management
and maintenance of the GSM network. It includes:
o Network Management Center (NMC): Monitors and controls the network, handling
tasks such as fault management, performance management, and configuration
management.

How a GSM Call Works (Simplified):

1. A mobile phone (MS) sends a request to the nearest BTS.

2. The BTS forwards the request to the BSC.
3. The BSC sends the request to the MSC.
4. The MSC checks the VLR or HLR to authenticate the subscriber and determine the routing
information.
5. The MSC routes the call to the destination network.
6. During the call, the BSS manages the radio connection between the MS and the network,
handling handovers if the user moves between cells.

Basic Concepts in GSM:

 Time Division Multiple Access (TDMA): Divides each frequency channel into time slots,
allowing multiple users to share the same frequency.
 Frequency Division Multiple Accesses (FDMA): Divides the available frequency spectrum into
channels, each assigned to a different user or communication.
 Cellular Network: The network is divided into cells, each served by a BTS. This allows for
efficient use of the frequency spectrum and provides wider coverage.

2.2. Mobile Station:

A single definitive "figure" for a Mobile Station (MS) in GSM because it encompasses both the physical
device and the user's identity. However, I can provide a combined visual and explanation to cover the key
aspects:

1. Physical Representation (Mobile Equipment - ME)

Typical mobile phone

This represents the physical device, the handset itself. It contains components like:

 Radio Transceiver: For transmitting and receiving radio signals.

 Antenna: To send and receive radio waves.
 Baseband Processor: Handles signal processing and communication protocols.
 Microprocessor: The main processor that runs the phone's operating system and applications.
 Memory: Stores the phone's firmware, user data, and applications.
 User Interface: Screen, keyboard/touchscreen, speaker, microphone.

2. Logical Representation (Including SIM)

mobile station diagram with SIM card

This diagram highlights the two main components of the MS:

 Mobile Equipment (ME): As described above, the physical device.

 Subscriber Identity Module (SIM): A small smart card that contains user-specific information:
o IMSI (International Mobile Subscriber Identity): A unique identifier for the
subscriber.
o Authentication Key (Ki): Used for authenticating the subscriber to the network.
o Other Data: Phonebook, SMS messages, etc.

Key Relationship: ME and SIM

The ME and SIM work together:

 The ME provides the hardware for communication.

 The SIM provides the identity and authentication information that allows the user to access the
GSM network.

You can think of it like this: the ME is the "phone," and the SIM is the "key" that unlocks the network.

How the MS Interacts with the GSM Network

mobile station interacting with a base station

This diagram shows the MS communicating with the Base Transceiver Station (BTS) of the GSM
network.

Points about the MS:

 Unique Identity: Each MS is uniquely identified by its IMEI (International Mobile Equipment
Identity), which is a serial number for the hardware.
 User Identity: Each subscriber is uniquely identified by their IMSI, which is stored on the SIM
card.
 Radio Interface: The MS communicates with the GSM network using radio waves.
 Mobility: The MS is designed to be mobile, allowing users to move between different cells while
maintaining a connection to the network.

2.2.1. Base Station System:

Here's a figure representing the Base Station Subsystem (BSS) in GSM, along with a breakdown of its
key components:

Figure of Base Station Subsystem (BSS)

Base Station Subsystem (BSS) diagram

Explanation of the Figure:

The BSS is responsible for managing the radio interface between the mobile stations (MS) and the core
network (NSS). It consists of two main components:

1. Base Transceiver Station (BTS):

o This is the "radio" part of the BSS.
o It contains the radio transceivers (transmitters and receivers) and antennas that
communicate directly with mobile phones in a specific cell.
o Functions of the BTS include:
 Transmitting and receiving radio signals to/from mobile phones.
 Modulating and demodulating radio signals.
 Encoding and decoding speech and data.
 Frequency hopping.
2. Base Station Controller (BSC):
o This is the "control" part of the BSS.
o It manages one or more BTSs.
o Functions of the BSC include:
 Radio resource management: Allocating radio channels and time slots to mobile
phones.
 Handover control: Managing the transfer of a call from one BTS to another as the
user moves between cells.
 Power control: Adjusting the transmit power of the BTS and mobile phones to
minimize interference.
 Call setup and release.

Key Interfaces:

 Um Interface (Air Interface): This is the radio interface between the MS and the BTS.
 Abis Interface: This is the interface between the BTS and the BSC.

How the BSS Works in a Call:

1. A mobile phone (MS) sends a call request to the BTS.

2. The BTS forwards the request to the BSC.
3. The BSC allocates a radio channel and time slot for the call.
4. The BTS establishes a radio connection with the MS.
5. During the call, the BSS manages the radio connection, handling handovers if necessary.

Concepts Related to the BSS:

 Cell: The geographical area covered by a single BTS.

 Cell Planning: The process of designing the layout of cells to provide optimal coverage and
capacity.
 Frequency Reuse: Using the same frequencies in different cells to increase capacity.
 Handover: The process of transferring a call from one BTS to another as the user moves between
cells.

2.2.3. Switching subsystem:

a figure of the Network Switching Subsystem (NSS) in GSM. Here's a diagram and explanation to clarify
its components and functions:
Figure of Network Switching Subsystem (NSS)

Network Switching Subsystem (NSS) diagram

Explanation of the Figure:

The NSS is the core of the GSM network. It's responsible for managing calls, subscriber data, and
connections to other networks. Here are its main components:

1. Mobile Switching Center (MSC):

o This is the central switching node of the NSS.
o It handles call setup, routing, and teardown.
o It also manages mobility functions like handovers (transferring calls between cells).
o It connects to other networks, such as the Public Switched Telephone Network (PSTN)
and other mobile networks.
2. Visitor Location Register (VLR):
o A database associated with each MSC.
o It stores temporary information about subscribers who are "visiting" the area served by
that MSC (i.e., they are roaming).
o Information stored in the VLR includes:
 IMSI (International Mobile Subscriber Identity)
 TMSI (Temporary Mobile Subscriber Identity)
 Location Area Identity (LAI)
 Other subscriber information
3. Home Location Register (HLR):
o A central database that stores permanent information about all subscribers in the network.
o Information stored in the HLR includes:
 IMSI
 MSISDN (Mobile Station International Subscriber Directory Number - the phone
number)
 Subscriber services (e.g., call forwarding, call waiting)
 Location information (which VLR the subscriber is currently registered with)
4. Authentication Center (AuC):
o Responsible for authenticating subscribers and generating encryption keys.
o It uses the subscriber's secret key (Ki), which is stored in the SIM card and the AuC, to
perform authentication.
5. Equipment Identity Register (EIR):
o A database that stores information about mobile equipment (handsets).
o It can be used to track stolen or faulty devices.

Key Interfaces:

 A Interface: Connects the BSS (specifically the BSC) to the MSC.

 Connections to other networks (e.g., PSTN, other mobile networks).

How the NSS Works in a Call:

1. When a mobile phone makes a call, the BSS forwards the call request to the MSC.
2. The MSC queries the VLR (if the subscriber is roaming) or the HLR (if the subscriber is in their
home network) to get the subscriber's information.
3. The AuC authenticates the subscriber.
4. The MSC sets up the call path to the destination.

Key Functions of the NSS:

 Call control: Setting up, routing, and tearing down calls.

 Mobility management: Tracking the location of subscribers and managing handovers.
 Subscriber management: Storing and managing subscriber data.
 Security: Authenticating subscribers and encrypting communication.

2.2.4. Security, Data Services:

to create a single image that perfectly encapsulates all aspects of security and data services in GSM.
These are multifaceted topics. However, I can provide diagrams and explanations to illustrate the key
concepts:

1. Security in GSM

GSM Security architecture diagram

Explanation of the Figure:

GSM security focuses on three main areas:

 Subscriber Identity Confidentiality: Protecting the subscriber's identity (IMSI) from being
intercepted.
 Authentication: Verifying that the user is a legitimate subscriber.
 Encryption: Protecting the communication between the mobile station and the network from
eavesdropping.

Components and Processes:

 SIM Card: Stores the subscriber's secret key (Ki), which is used for authentication.
 Authentication Center (AuC): Generates random numbers (RAND) and performs
authentication calculations.
 Authentication Process:
1. The network sends a RAND to the mobile station.
2. The mobile station uses its Ki and the RAND to calculate a Signed Response (SRES).
3. The network performs the same calculation using the subscriber's Ki stored in the AuC.
4. If the SRES values match, the subscriber is authenticated.
 Encryption Process:
1. After successful authentication, a Ciphering Key (Kc) is generated.
2. The Kc is used to encrypt the communication between the mobile station and the BTS
using an encryption algorithm (A5).

2.2.4. Data Services in GSM

GSM Data Services diagram

Explanation of the Figure:

GSM initially focused on voice calls, but it has evolved to support various data services:

 Circuit-Switched Data (CSD):

o A dedicated circuit is established between the mobile station and the network for the
duration of the data transmission.
o Used for services like fax and dial-up internet access.
 Short Message Service (SMS):
o Allows sending and receiving short text messages.
o Uses the signaling channels of the GSM network.
 General Packet Radio Service (GPRS):
o Provides packet-switched data services, allowing for "always-on" internet connectivity.
o More efficient use of network resources compared to CSD.
 Enhanced Data rates for GSM Evolution (EDGE):
o An improvement over GPRS, providing higher data rates.

Evolution of Data Services:

GSM has continued to evolve with the introduction of technologies like:

 High-Speed Circuit-Switched Data (HSCSD): Increases data rates over CSD.

 Universal Mobile Telecommunications System (UMTS): The successor to GSM, offering
significantly higher data rates and multimedia capabilities.
 Long-Term Evolution (LTE): The latest generation of mobile communication technology,
providing even higher data rates and lower latency.

Key Points:

 GSM security mechanisms are designed to protect subscriber identity, authenticate users, and
encrypt communication.
 GSM has evolved from primarily supporting voice calls to offering a wide range of data services.
 The evolution of GSM has paved the way for more advanced mobile communication technologies
like UMTS and LTE.

2.2.5. HSCSD:

o create a single, universally recognized "figure" specifically for HSCSD. The core concept is about using
multiple time slots, so a timing diagram is most illustrative. Here's how we can visualize it:

Understanding the Basics: GSM Time Slots

First, let's remember how GSM works with time slots:

GSM TDMA frame structure diagram

 TDMA (Time Division Multiple Access): GSM divides each frequency channel into 8 time
slots. Each user is assigned one or more time slots to transmit and receive data.
 Normal GSM Data (Circuit-Switched): In basic circuit-switched data (CSD) in GSM, a user is
typically allocated one time slot per frame.

Visualizing HSCSD: Multiple Time Slots

HSCSD time slot allocation diagram

Explanation of the HSCSD Figure:

 Multiple Consecutive Time Slots: The key idea of HSCSD is to allocate multiple consecutive
time slots to a single user. In the figure, you can see a user being allocated two consecutive time
slots.
 Increased Data Rate: By using multiple time slots, HSCSD can significantly increase the data
rate compared to basic CSD. For example, if a single time slot provides a data rate of 9.6 kbps,
using two time slots would provide a data rate of 19.2 kbps, and so on.
 Symmetric vs. Asymmetric: HSCSD can be configured in two ways:
o Symmetric: The same number of time slots is allocated for both uplink (mobile to
network) and downlink (network to mobile) transmission.
o Asymmetric: Different numbers of time slots are allocated for uplink and downlink. This
is useful for applications where the downlink data rate is higher than the uplink data rate
(e.g., web browsing).

Key Improvements in HSCSD:

 Higher Data Rates: The primary benefit of HSCSD is the increased data rate, which allows for
faster data transmission.
 Improved User Experience: Higher data rates lead to a better user experience for data-intensive
applications.

Limitations of HSCSD:

 Circuit-Switched Nature: Like basic CSD, HSCSD is a circuit-switched technology. This means
that a dedicated circuit is established for the duration of the data transmission, even if there is no
data being transmitted. This can be inefficient for applications with bursty traffic.
 Limited Capacity: Allocating multiple time slots to a single user reduces the number of users
that can be supported simultaneously.

Transition to GPRS/EDGE:

HSCSD was an important step in the evolution of GSM data services, but it was eventually superseded by
packet-switched technologies like GPRS and EDGE, which offer more efficient use of network resources
and higher data rates.

2.2.6. GPRS - GPRS system and protocol architecture 2.3 UTRAN, UMTS core network;
Improvements on Core Network:
This figure and explanation should give you a good overview of the GPRS and UMTS architectures,
focusing on UTRAN, the UMTS core network, and the improvements made in the core network.

Here's a breakdown with a representative diagram:

Figure of GPRS and UMTS Architecture Evolution

GPRS and UMTS architecture evolution diagram

Explanation of the Figure:

This diagram illustrates the evolution from GPRS (2.5G) to UMTS (3G), highlighting the changes in the
core network.

1. GPRS (2.5G):

 Radio Access Network: The existing GSM Base Station Subsystem (BSS) is used, with the
addition of a Packet Control Unit (PCU) at the BSC to handle packet data.
 Core Network: Two new nodes are introduced:
o Serving GPRS Support Node (SGSN): Responsible for packet routing and mobility
management within the GPRS network.
o Gateway GPRS Support Node (GGSN): Acts as a gateway between the GPRS network
and external packet data networks (e.g., the Internet).

2. UMTS (3G):

 Radio Access Network: The BSS is replaced by the UMTS Terrestrial Radio Access Network
(UTRAN), which uses Wideband Code Division Multiple Access (WCDMA) technology for
higher data rates.
o Node B: The equivalent of the BTS in GSM.
o Radio Network Controller (RNC): The equivalent of the BSC in GSM.
 Core Network: The core network is significantly enhanced to support higher data rates and new
services. Key changes include:
o SGSN and GGSN Evolution: The SGSN and GGSN are upgraded to handle the higher
data rates and new protocols of UMTS.
o Introduction of new network elements:
 Mobile Switching Center Server (MSC-S): Handles circuit-switched calls.
 Media Gateway (MGW): Connects the circuit-switched and packet-switched
domains.

Improvements on Core Network in UMTS:

 Increased Bandwidth: UMTS provides significantly higher data rates compared to GPRS,
enabling new services like video streaming and mobile broadband.
 Improved Quality of Service (QoS): UMTS supports different QoS classes, allowing for
prioritization of different types of traffic.
 Packet-Switched Focus: UMTS is primarily a packet-switched network, which is more efficient
for data transmission than the circuit-switched approach used in GSM.
 New Protocols: UMTS introduces new protocols to support higher data rates and new services.

UTRAN (UMTS Terrestrial Radio Access Network):

 WCDMA Technology: UTRAN uses WCDMA, a radio access technology that provides higher
data rates and better spectral efficiency compared to GSM's TDMA.
 Node B and RNC: The Node B and RNC work together to manage the radio resources and
provide connectivity to mobile devices.

Key Differences between GPRS and UMTS:

Feature GPRS UMTS

Radio Access Technology TDMA WCDMA
Data Rates Up to 171 kbps Up to 2 Mbps (and higher with HSPA)
Core Network Basic packet-switched Enhanced packet-switched and circuit-switched
Key Network Elements SGSN, GGSN SGSN, GGSN, MSC-S, MGW

2.2.7. 802.11 Architecture 802.11a, 802.11b standard:

Visualizing the PHY Differences (Conceptual):

It's hard to visualize modulation schemes precisely in a simple diagram, but here's a conceptual idea:

 802.11b (DSSS): Imagine spreading the signal over a wider frequency range, like spreading paint
thinly over a large canvas. This makes it more resistant to noise but limits the data rate.
 802.11a (OFDM): Imagine dividing the frequency band into multiple smaller sub-channels and
transmitting data in parallel on these sub-channels, like having multiple painters working on
smaller sections of the canvas simultaneously. This allows for higher data rates but is more
susceptible to signal degradation over distance.

It's challenging to represent the nuances of the 802.11 standards (like 802.11a and 802.11b) in a single
static figure. The key differences lie in their physical layer specifications (how they transmit data over the
air). However, I can provide a general 802.11 architecture diagram and then highlight the differences
between 802.11a and 802.11b.

General 802.11 Architecture

802.11 architecture diagram

Explanation of the General Architecture:

 Basic Service Set (BSS): The fundamental building block of an 802.11 network. It consists of:
o Access Point (AP): A central device that acts as a bridge between the wireless network
and a wired network.
o Stations (STAs): Wireless devices (laptops, smartphones, etc.) that connect to the AP.
 Extended Service Set (ESS): Multiple BSSs connected together through a distribution system
(DS), typically a wired network. This allows for roaming between BSSs.
 Distribution System (DS): The backbone network that connects APs in an ESS.
 Independent Basic Service Set (IBSS) or Ad-hoc Network: A network without an AP, where
stations communicate directly with each other.

2.7.1. Basic Layers:

 Physical Layer (PHY): Defines how data is transmitted over the air (modulation, encoding,
frequency bands). This is where 802.11a and 802.11b differ significantly.
 Medium Access Control (MAC) Layer: Controls access to the wireless medium (e.g., using
CSMA/CA).

Differences between 802.11a and 802.11b:

Feature 802.11a 802.11b

Frequency Band 5 GHz 2.4 GHz
Orthogonal Frequency Division
Modulation Multiplexing (OFDM) Direct Sequence Spread Spectrum (DSSS)
Maximum Data
Rate 54 Mbps 11 Mbps
Range Shorter Longer
More susceptible to interference (due to other
Interference Less susceptible to interference devices using 2.4 GHz)
Unit III Parallel Communication

3. Basic Communication:

Basic communication is the process of conveying information, ideas, thoughts, feelings, and emotions
between individuals or groups. It's a fundamental aspect of human interaction and is essential for building
relationships, coordinating activities, and sharing knowledge.

Here's a breakdown of the key elements of basic communication:

3.1. The Communication Process:

The communication process typically involves the following elements:

 Sender: The person or entity who initiates the communication.

 Message: The information, idea, or feeling that the sender wants to convey.
 Encoding: The process of converting the message into a form that can be transmitted (e.g.,
words, symbols, gestures).
 Channel: The medium through which the message is transmitted (e.g., speech, writing, email,
phone call).
 Receiver: The person or entity who receives the message.
 Decoding: The process of interpreting the message and understanding its meaning.
 Feedback: The receiver's response to the message, which can be verbal or nonverbal.
 Context: The situation or environment in which the communication takes place.
 Noise: Any interference that can disrupt the communication process (e.g., physical noise,
distractions, cultural differences).

3.2. Forms of Communication:

Communication can take many forms, including:

 Verbal Communication: Using spoken or written words to convey a message.

 Nonverbal Communication: Using body language, facial expressions, gestures, and tone of
voice to convey a message.
 Visual Communication: Using images, graphics, and other visual aids to convey a message.

3.3. Effective Communication:

Effective communication is characterized by:

 Clarity: The message is clear and easy to understand.

 Conciseness: The message is brief and to the point.
 Accuracy: The message is free from errors.
 Relevance: The message is relevant to the receiver.
 Active Listening: Paying attention to the sender and trying to understand their message.
 Empathy: Understanding and sharing the feelings of the sender.
 Respect: Treating the sender with courtesy and consideration.
3.4. Barriers to Communication:

Several factors can hinder effective communication, including:

 Physical Barriers: Noise, distance, and other environmental factors.

 Language Barriers: Differences in language or vocabulary.
 Cultural Barriers: Differences in cultural norms and values.
 Psychological Barriers: Emotional states, biases, and prejudices.

3.5. Importance of Communication:

Communication is essential for:

 Building relationships: Strong communication skills are essential for building and maintaining
healthy relationships.
 Sharing information: Communication allows us to share knowledge and ideas.
 Making decisions: Effective communication is essential for making informed decisions.
 Resolving conflicts: Communication can help to resolve conflicts and disagreements.
 Achieving goals: Communication is essential for coordinating activities and achieving common
goals.

3.2. One-to-All Broadcast:

A one-to-all broadcast is a fundamental communication pattern in parallel computing where a single

process (or node) sends the same data to all other processes in a system. Here are a couple of ways to
visualize this:

1. Simple Star Topology:

star topology network diagram

 Central Node (Sender): The node in the center represents the sender.
 Peripheral Nodes (Receivers): The nodes around the center represent the receivers.
 Arrows: The arrows from the sender to each receiver indicate the broadcast of the data.

Explanation: This is the most straightforward way to visualize a one-to-all broadcast. The central node
sends a copy of the data to each of the surrounding nodes.
3.2.2. Tree-based Broadcast (More Efficient for Larger Systems):

Tree based broadcast diagram

 Root Node (Sender): The top node is the sender.

 Intermediate Nodes: Nodes in the middle act as relays.
 Leaf Nodes (Receivers): The bottom nodes are the receivers.
 Arrows: The arrows indicate the flow of data down the tree.

Explanation: This method is more efficient for larger systems. The sender sends the data to a few
intermediate nodes, which then forward it to their children, and so on. This reduces the load on the sender
and can significantly reduce the overall broadcast time.

Key Concepts in One-to-All Broadcast:

 Source Node: The node that initiates the broadcast.

 Destination Nodes: All other nodes in the system that receive the data.
 Data Replication: The data is copied and sent to multiple destinations.
 Communication Overhead: The time and resources required to perform the broadcast.

Use Cases for One-to-All Broadcast:

 Distributing input data to all processors in a parallel computation.

 Broadcasting control information or instructions.
 Updating shared data structures.

Efficiency Considerations:

The efficiency of a one-to-all broadcast depends on factors such as:

 Network topology: The way the nodes are connected.

 Communication bandwidth: The capacity of the communication links.
 Number of nodes: The size of the system.

Different algorithms and techniques can be used to optimize the broadcast process, such as tree-based
broadcasts, which are particularly efficient for large systems.
3.3. All-to-One Reduction:

An all-to-one reduction is a fundamental communication pattern in parallel computing where data from
all processes (or nodes) in a system is combined using an associative operation (like sum, product, min,
max) and accumulated at a single destination process.

Here are a couple of ways to visualize this:

3.3.1. Simple Gather with Reduction at the Destination:

simple gather with reduction diagram

 Peripheral Nodes (Senders): The nodes around the center represent the senders. Each holds a
piece of data.
 Central Node (Receiver/Accumulator): The node in the center receives data from all other
nodes.
 Arrows: The arrows pointing towards the center indicate the data being sent.
 Reduction Operation: The central node performs the reduction operation (e.g., sum) on the
received data.

Explanation: This is a straightforward visualization. Each node sends its data to the central node, and the
central node combines the data using the specified operation.

3.3.2. Tree-based Reduction (More Efficient for Larger Systems):

Tree based reduction diagram

 Leaf Nodes (Senders): The bottom nodes hold the initial data.
 Intermediate Nodes: These nodes receive data from their children, perform a partial reduction,
and pass the result up the tree.
 Root Node (Receiver/Accumulator): The top node receives the final partial results and performs
the final reduction.
 Arrows: The arrows indicate the flow of data and partial results up the tree.

Explanation: This method is more efficient for larger systems. Data is combined in a hierarchical
manner, reducing the communication load and overall time.

Key Concepts in All-to-One Reduction:

 Source Nodes: All nodes except the destination node.

 Destination Node: The node that receives the final result.
 Associative Operation: The operation used to combine the data (e.g., addition, multiplication,
minimum, maximum). An associative operation is one where the order of operations doesn't
matter (e.g., (a + b) + c = a + (b + c)).
 Data Aggregation: Combining data from multiple sources into a single result.
 Communication Overhead: The time and resources required to perform the reduction.

Use Cases for All-to-One Reduction:

 Calculating the sum, product, average, minimum, or maximum of values distributed across
multiple processors.
 Gathering results from parallel computations.
 Implementing global operations in parallel algorithms.

Efficiency Considerations:

The efficiency of an all-to-one reduction depends on factors such as:

 Network topology: The way the nodes are connected.

 Communication bandwidth: The capacity of the communication links.
 Number of nodes: The size of the system.
 Reduction operation: The complexity of the operation.
Tree-based reductions are generally more efficient than simple gather operations for large systems. They
minimize the number of communication steps and reduce the load on the destination node.

3.2.4. All-to-All Broadcast and Reduction:

Both all-to-all broadcast and all-to-all reduction are collective communication operations in parallel
computing where every process communicates with every other process. They are more complex than
one-to-all or all-to-one operations.

1. All-to-All Broadcast (or Allgather):

 Concept: Each process starts with a piece of data. At the end of the operation, every process has
a copy of the data from all other processes.
 Visualization (for 4 processes):

alltoall broadcast diagram

 Explanation:
o Each process (P0, P1, P2, P3) has its initial data (represented by different colors).
o Each process sends its data to all other processes.
o After the broadcast, each process has received the data from all other processes.
 Implementation: All-to-all broadcast can be implemented in various ways, such as:
o Direct communication: Each process sends its data directly to every other process. This
requires p( p-1) messages, where p is the number of processes.
o Ring-based communication: Processes are arranged in a logical ring, and data is passed
around the ring.
o Tree-based communication: A tree structure is used to distribute the data.

2. All-to-All Reduction:

 Concept: Each process starts with a piece of data. A reduction operation (e.g., sum, product, min,
max) is performed across all data elements, and the result of the reduction is distributed to all
processes.
 Visualization (for 4 processes with summation):
alltoall reduction diagram

 Explanation:
o Each process (P0, P1, P2, P3) has its initial data (e.g., numbers).
o The data from all processes is combined using the reduction operation (in this case,
summation).
o The result of the reduction (the sum) is distributed to all processes.
 Implementation: Similar to all-to-all broadcast, all-to-all reduction can be implemented using
different communication patterns:
o Direct communication: Each process sends its data to every other process, and each
process performs the reduction locally.
o Ring-based or tree-based communication: Partial reductions are performed at
intermediate steps to reduce communication overhead.

Key Differences and Relationships:

 All-to-all broadcast distributes the original data from each process to all processes.
 All-to-all reduction combines data from all processes using a reduction operation and distributes
the result to all processes.
 All-to-all reduction can be seen as a combination of an all-to-all communication phase followed
by local reductions at each process.

Use Cases:

 All-to-all broadcast: Used in algorithms that require each process to have access to data from all
other processes, such as matrix transposition or certain graph algorithms.
 All-to-all reduction: Used in algorithms that require a global reduction operation, such as
calculating a global sum or finding the global minimum/maximum.

Efficiency Considerations:

The efficiency of all-to-all operations depends on factors like:

 Network topology: The way the processes are connected.

 Communication bandwidth: The capacity of the communication links.
 Number of processes: The size of the system.
 Message size: The amount of data being sent.
Efficient algorithms for all-to-all communication are crucial for achieving good performance in parallel
applications.

3.5. All-Reduce and Prefix-Sum Operations:

Both All-reduce and Prefix-sum (also known as scan) are important collective communication operations
in parallel computing. They involve combining data from multiple processes, but they do so in different
ways.

3.5.1. All-reduce:

 Concept: Each process starts with a piece of data. An associative operation (e.g., sum, product,
max, min) is performed across all data elements, and the result of the reduction is distributed to
all processes.
 Visualization (for 4 processes with summation):

allreduce operation diagram

 Explanation:
o Each process (P0, P1, P2, P3) has its initial data (e.g., numbers).
o The data from all processes is combined using the reduction operation (in this case,
summation).
o The total sum is then sent back to every process.
 Implementation: All-reduce can be implemented using different communication patterns:
o All-to-one reduction followed by one-to-all broadcast: First, the data is reduced to one
process, and then that process broadcasts the result to all other processes.
o Tree-based algorithms: More efficient for larger numbers of processes.

3.5.2. Prefix-sum (Scan):

 Concept: Each process starts with a piece of data. The prefix-sum operation computes the
cumulative sum (or another associative operation) of the data up to each process. Each process
receives the sum of all elements up to and including itself.
 Visualization (for 4 processes with summation):
prefix sum operation diagram

 Explanation:
o Each process (P0, P1, P2, P3) has its initial data (e.g., numbers).
o P0 receives its own value (the prefix sum up to P0 is just its own value).
o P1 receives the sum of P0 and P1.
o P2 receives the sum of P0, P1, and P2.
o P3 receives the sum of P0, P1, P2, and P3 (which is the total sum).
 Implementation: Prefix-sum can also be implemented using various algorithms, including tree-
based algorithms, which are efficient for large numbers of processes.

Key Differences:

Feature All-reduce Prefix-sum (Scan)

The total reduction result is sent to all Each process receives the reduction of elements up to
Result
processes. and including itself.
Use Computing cumulative sums, solving recurrence
Computing global sums, averages, etc.
Case relations, etc.
Export to Sheets

Example:

Let's say we have 4 processes with the following data:

 P0: 2
 P1: 4
 P2: 1
 P3: 3
 All-reduce (sum): All processes would receive the value 10 (2 + 4 + 1 + 3).
 Prefix-sum (sum):
o P0 would receive 2.
o P1 would receive 6 (2 + 4).
o P2 would receive 7 (2 + 4 + 1).
o P3 would receive 10 (2 + 4 + 1 + 3).

In Summary:
 All-reduce computes a global reduction and distributes the result to all processes.
 Prefix-sum computes partial reductions and distributes the intermediate results to each process,
where each process receives the reduction of all elements up to itself.

Both are essential collective operations in parallel programming with distinct use cases.

3.6. Collective Communication using MPI:

It's challenging to create one figure that perfectly captures all collective communication operations in
MPI (Message Passing Interface). However, I can provide a set of diagrams illustrating some of the most
common and important ones, along with explanations:

Concepts in MPI Collective Communication:

 Communicator: A group of processes that can communicate with each other.

 Rank: A unique identifier assigned to each process within a communicator.
 Collective Operations: Operations that involve all processes in a communicator.

Figures and Explanations:

3.6.1. MPI_Bcast (Broadcast):

MPI_Bcast operation diagram

 Explanation: One process (the root) sends the same data to all other processes in the
communicator.

3.6.2. MPI_Gather:
MPI_Gather operation diagram

 Explanation: Data from all processes is gathered to a single process (the root).

3.6.3. MPI_Scatter:

MPI_Scatter operation diagram

 Explanation: Data from one process (the root) is distributed to all other processes.

3.6.4. MPI_Reduce:

MPI_Reduce operation diagram

 Explanation: Data from all processes is combined using a reduction operation (e.g., sum,
product, min, max) at a single process (the root).

3.6.5. MPI_Allreduce:

MPI_Allreduce operation diagram

 Explanation: Data from all processes is combined using a reduction operation, and the result is
distributed to all processes.

3.6.6. MPI_Allgather:

MPI_Allgather operation diagram

 Explanation: Data from all processes is gathered at all processes (each process receives a copy
of the data from all other processes).

3.6.7. MPI_Scan (Prefix-sum):

MPI_Scan operation diagram

 Explanation: Each process receives the reduction of all elements up to and including itself.

Key Advantages of Using MPI Collective Communication:

 Simplicity: Easier to use than point-to-point communication for many common communication
patterns.
 Efficiency: MPI implementations often use optimized algorithms for collective operations, which
can be more efficient than implementing them using point-to-point communication.
 Portability: MPI is a standard, so code that uses collective communication is portable across
different platforms.

Important Considerations:

 Synchronization: Collective operations are typically blocking, meaning that all processes in the
communicator must reach the collective call before any of them can continue.
 Communicator: It's important to use the correct communicator for collective operations.

These figures and explanations should give you a good overview of collective communication in MPI.
Remember that these are just some of the most common collective operations. MPI provides many other
collective functions for more specialized communication patterns.
Unit IV Analytical Modeling of Parallel Programs

Sources of Overhead in Parallel Programs:

When you convert a sequential program to run in parallel, you hope to see a significant speedup.
However, there are inherent overheads in parallel execution that can limit the achievable performance
gain. Here are the main sources of overhead in parallel programs:

4.1. Inter-process Communication:

 Description: In most parallel programs, processes need to exchange data or synchronize their
actions. This communication takes time, which is overhead.
 Factors affecting communication overhead:
o Communication volume: The amount of data being transferred.
o Communication frequency: How often communication occurs.
o Network latency and bandwidth: The speed and capacity of the communication
network.
o Communication protocol: The efficiency of the communication protocol used.

4.2. Idling (or Load Imbalance):

 Description: Processes may spend time idle waiting for other processes to complete their tasks or
to receive data. This happens when the workload is not evenly distributed among the processes
(load imbalance) or when there are dependencies between tasks.
 Causes of idling:
o Uneven workload distribution: Some processes have more work to do than others.
o Synchronization points: Processes may need to wait for each other at synchronization
points (e.g., barriers).
o Dependencies between tasks: Some tasks cannot start until others have completed.

4.3. Extra Computation (or Redundant Computation):

 Description: Sometimes, parallel algorithms require performing extra computations that are not
necessary in the sequential version. This can be due to:
o Algorithm design: Some parallel algorithms inherently require more computations than
their sequential counterparts.
o Data replication: Data may need to be replicated across multiple processes, leading to
redundant computations.

4.4. Overhead of Parallelization (or Startup/Initialization Overhead):

 Description: There's an overhead associated with starting up parallel execution, including:

o Process creation/thread creation: Creating and managing processes or threads.
o Data partitioning and distribution: Dividing the data among the processes.
o Setting up communication channels: Establishing connections between processes.
o Synchronization initialization: Initializing synchronization primitives.

4.5. Contention:

 Description: Occurs when multiple processes try to access the same shared resource (e.g., shared
memory, a critical section). This can lead to delays as processes wait for access to the resource.

Impact of Overheads:

These overheads can significantly impact the performance of parallel programs. They can reduce the
speedup achieved by parallelization and, in some cases, even lead to a slowdown compared to the
sequential version.

Minimizing Overheads:

Several techniques can be used to minimize overheads:

 Efficient communication algorithms and libraries: Using optimized communication

algorithms and libraries (like MPI) can reduce communication overhead.
 Load balancing techniques: Distributing the workload evenly among the processes can reduce
idling.
 Careful algorithm design: Choosing parallel algorithms that minimize extra computation and
communication.
 Minimizing synchronization: Reducing the number of synchronization points can reduce idling.
 Efficient resource management: Managing shared resources effectively can reduce contention.

It's difficult to represent all aspects of performance measures and analysis in a single static figure.
However, I can provide a combination of diagrams and explanations to illustrate the key concepts and
metrics used in evaluating parallel program performance.

1. Key Performance Metrics:

 Speedup: The ratio of the execution time of the sequential program to the execution time of the
parallel program.
o Speedup (S) = T(sequential) / T(parallel)
 Efficiency: The speedup divided by the number of processors.
o Efficiency (E) = S / P (where P is the number of processors)
 Overhead: The total time spent in non-useful work in the parallel program (e.g., communication,
idling, extra computation).
o Overhead = (P * T(parallel)) - T(sequential)
 Scalability: The ability of a parallel program to achieve increasing speedup as the number of
processors is increased.

4.1.6. Visualizing Speedup:

Speedup graph

 X-axis: Number of processors (P)

 Y-axis: Speedup (S)
 Ideal Speedup (Linear Speedup): A straight line with a slope of 1. This represents the
theoretical maximum speedup that can be achieved.
 Actual Speedup: The speedup achieved by the parallel program. It typically deviates from the
ideal speedup due to overheads.

Explanation: This graph shows how speedup changes as the number of processors increases. Ideally,
doubling the number of processors would halve the execution time (linear speedup). However, in practice,
overheads prevent this ideal scaling.

4.1.7. Visualizing Efficiency:

efficiency graph

 X-axis: Number of processors (P)

 Y-axis: Efficiency (E)
 Ideal Efficiency: A horizontal line at E = 1. This represents perfect efficiency.
 Actual Efficiency: The efficiency achieved by the parallel program. It typically decreases as the
number of processors increases due to overheads.

Explanation: This graph shows how efficiently the processors are being utilized. Ideally, efficiency
should remain constant at 100% as the number of processors increases. However, in practice, efficiency
decreases due to overheads.
4.1.8. Amdahl's Law:

Amdahl's Law graph

 Serial Fraction (s): The portion of the program that cannot be parallelized.
 Parallel Fraction (p): The portion of the program that can be parallelized (p = 1 - s).
 Speedup Limit: The maximum speedup that can be achieved is limited by the serial fraction,
regardless of the number of processors.
o Maximum Speedup = 1 / s

Explanation: This graph illustrates the impact of the serial fraction on the achievable speedup. Even with
an infinite number of processors, the speedup is limited by the portion of the program that must be
executed sequentially.

4.1.9. Gustafson's Law:

 Focuses on scaling the problem size with the number of processors.

 Assumes that the parallel portion of the work scales linearly with the number of processors.
 Scaled Speedup = s + p * P (where s is the serial fraction, p is the parallel fraction, and P is the
number of processors)

Explanation: Gustafson’s Law argues that instead of fixing the problem size and trying to reduce the
execution time, we should increase the problem size as we add more processors.These figures and
explanations provide a basic understanding of the key performance measures and analysis techniques
used in parallel computing. By analyzing these metrics, you can identify performance bottlenecks and
optimize your parallel programs.

Execution Rate and Redundancy:

You're asking about two important concepts in parallel computing: execution rate and redundancy. Let's
break them down:

4.1.10. A. Execution Rate:

 Concept: Execution rate refers to the number of operations or instructions that a processor or a
parallel system can execute per unit of time. It's a measure of how fast the system can perform
computations.
 Units: Common units for execution rate include:
o Instructions per second (IPS)
o Floating-point operations per second (FLOPS) - particularly for scientific computing.
o Operations per clock cycle (IPC)
 Factors Affecting Execution Rate:
o Clock speed: The frequency at which the processor operates.
o Instructions per cycle (IPC): The number of instructions a processor can execute in a
single clock cycle (influenced by factors like pipelining and superscalar execution).
o Number of processors/cores: In a parallel system, having more processors can increase
the overall execution rate.
o Memory bandwidth: The rate at which data can be transferred to and from memory.
o Interconnect bandwidth: The rate at which data can be transferred between processors
in a parallel system.

4.1.10. B. Redundancy:

 Concept: In parallel computing, redundancy refers to performing the same computation multiple
times on different processors or storing the same data in multiple locations. This is done to
improve:
o Fault tolerance: If one processor fails, the computation can still continue on other
processors that have the same data.
o Performance (sometimes): In some cases, redundant computations can improve
performance by reducing communication overhead or simplifying the algorithm.
 Types of Redundancy:
o Computational redundancy: Performing the same calculations on multiple processors.
o Data redundancy: Storing the same data in multiple locations.
 Overhead of Redundancy: Redundancy introduces overhead in terms of:
o Increased computation: More computations are performed overall.
o Increased memory usage: More memory is required to store redundant data.
o Increased communication (sometimes): Redundant data may need to be kept consistent
across multiple locations.

Relationship between Execution Rate and Redundancy:

 Trade-off: There's often a trade-off between execution rate and redundancy. Increasing
redundancy can improve fault tolerance or simplify an algorithm, but it also increases the total
amount of computation that needs to be done, which can potentially lower the overall execution
rate if not carefully managed.
 Example: Consider a parallel algorithm where each processor needs to access a large shared data
structure.
o No redundancy: Each processor accesses the shared data structure directly. This can
lead to contention and limit the execution rate.
o Data redundancy: Each processor has a local copy of the data structure. This eliminates
contention and can improve the execution rate, but it increases memory usage and may
require additional communication to keep the copies consistent.

 Execution rate measures how fast a system can perform computations.

 Redundancy involves performing extra computations or storing extra data to improve fault
tolerance or sometimes performance.
 There's often a trade-off between execution rate and redundancy. The optimal balance depends on
the specific application and its requirements.
4.2. The Effect of Granularity on Performance:

You're asking for a visualization of how granularity affects performance in parallel computing. This is a
crucial concept! Here's a combined graphical and explanatory approach:

Concept of Granularity:

Granularity refers to the size of the computational tasks that are distributed to individual processors in a
parallel system. It's often described as:

 Fine-grained parallelism: Small tasks, high degree of parallelism, frequent communication.

 Coarse-grained parallelism: Large tasks, lower degree of parallelism, less frequent
communication.

Figure: Impact of Granularity on Performance

granularity vs performance graph

Explanation of the Graph:

 X-axis: Granularity (Task Size): Moving from left to right represents increasing task size (from
fine-grained to coarse-grained).
 Y-axis: Performance (e.g., Speedup, Efficiency): Represents the performance of the parallel
program.

Key Observations from the Graph:

1. Fine-Grained Region (Left Side):

o High degree of parallelism (many small tasks).
o High communication overhead due to frequent data exchange and synchronization.
o Performance initially increases as granularity increases because the benefits of
parallelism start to outweigh the communication costs. However, as granularity becomes
extremely fine, the communication overhead dominates, and performance decreases.
2. Optimal Granularity Region (Middle):
o A balance between parallelism and communication overhead.
o Performance is maximized in this region. The task size is large enough to amortize the
communication costs but small enough to still provide sufficient parallelism.
3. Coarse-Grained Region (Right Side):
o Low communication overhead.
o Limited parallelism because there are fewer, larger tasks.
o Performance decreases as granularity increases because the available parallelism
decreases. Load imbalance also becomes more likely; some processors finish early and sit
idle while others work on large tasks.

Imagine a large field that needs to be harvested:

 Fine-grained: Many workers (processors) are each assigned a very small area to harvest. There's
a lot of communication and coordination needed to ensure everyone is working efficiently and the
harvested crops are collected.
 Coarse-grained: A few workers are each assigned a very large area to harvest. There's little need
for communication, but the overall harvesting time may not be optimal if some workers are faster
than others (load imbalance).
 Optimal granularity: A moderate number of workers are assigned appropriately sized areas.
This balances the workload and minimizes the need for excessive communication.

Factors Affecting Optimal Granularity:

 Communication-to-computation ratio: The relative cost of communication compared to

computation. If communication is expensive, coarser granularity is generally better.
 Number of processors: With more processors, finer granularity can be more effective (up to a
point).
 Nature of the problem: Some problems are inherently more suitable for fine-grained or coarse-
grained parallelism.

Finding the right granularity is essential for achieving good performance in parallel programs. The
optimal granularity depends on a variety of factors, and careful consideration must be given to the trade-
off between parallelism and communication overhead.

4.3. Scalability of Parallel Systems:

Scalability of parallel systems graph

 X-axis: Number of Processors (P)

 Y-axis: Performance Metric (Speedup, Efficiency)
 Ideal Scalability (Linear Speedup): A straight line with a slope of 1 on a speedup graph, or a
horizontal line at 1.0 (100%) on an efficiency graph.
 Real Scalability: Typically deviates from ideal due to overheads (communication, idling, etc.).

Explanation: This graph illustrates how a parallel program performs as the number of processors
increases.

 Strong Scaling: How the execution time varies with the number of processors for a fixed total
problem size.
 Weak Scaling: How the execution time varies with the number of processors for a fixed problem
size per processor.

4.4. Minimum Execution Time and Minimum Cost:

minimum execution time and minimum cost graph

 X-axis: Number of Processors (P)

 Y-axis: Execution Time (T) or Cost (C)
 Minimum Execution Time: The lowest point on the execution time curve. Increasing processors
beyond this point does not significantly reduce execution time and may even increase it due to
overheads.
 Cost: Often defined as the product of execution time and the number of processors (C = T * P).
 Minimum Cost: The lowest point on the cost curve. This represents the most cost-effective way
to run the program.

Explanation: This graph illustrates the trade-off between execution time and cost. Increasing the number
of processors usually decreases execution time but increases cost. The goal is to find the optimal balance.

4.5. Optimal Execution Time:

Optimal execution time is related to the minimum execution time but emphasizes finding the best balance
between computation and communication. It's often found by analyzing the algorithm's complexity and
the target architecture's characteristics. There's no single graph for this, as it's a result of analysis.

4.6. Asymptotic Analysis of Parallel Programs:

Asymptotic analysis describes how the performance of a parallel program scales with increasing input
size (n) and/or number of processors (p), as these values approach infinity. We use "Big O" notation (O),
"Big Omega" notation (Ω), and "Big Theta" notation (Θ) to express this.
 Example: A parallel algorithm with a time complexity of O(n/p + log p) means that as n and p
become very large, the execution time will be dominated by terms proportional to n/p and log p.

Visualization (Conceptual):

big o notation graph

 X-axis: Input Size (n) or Number of Processors (p)

 Y-axis: Execution Time (T)

Explanation: This graph shows how execution time grows as the input size or number of processors
increases. Asymptotic analysis helps us understand the fundamental scalability limits of a parallel
algorithm.

In Short:

 Scalability describes how well a program utilizes increasing processors.

 Minimum execution time finds the fastest runtime, which may not be the most cost-effective.
 Minimum cost finds the most efficient use of resources (time * processors).
 Asymptotic analysis helps us understand how performance scales with very large inputs or
processor counts.

4.7. Matrix Computation:

For visualizations of matrix-vector and matrix-matrix multiplication. Here are figures and explanations to
illustrate these fundamental matrix operations:

4.7.1. Matrix-Vector Multiplication:

Matrix vector multiplication diagram
Explanation:

 Matrix (A): An m x n matrix (m rows, n columns).

 Vector (x): An n x 1 vector (n rows, 1 column).
 Result (b): An m x 1 vector (m rows, 1 column).

How it Works:

Each element bᵢ in the resulting vector b is calculated as the dot product of the i-th row of matrix A and
the vector x:

 bᵢ = Aᵢ₁ * x₁ + Aᵢ₂ * x₂ + ... + Aᵢₙ * xₙ

Example:

[ 1 2 ] [ 3 ] [ (1*3) + (2*4) ] [ 11 ]
[ 4 5 ] x [ 4 ] = [ (4*3) + (5*4) ] = [ 32 ]

4.7.2. Matrix-Matrix Multiplication:

Matrix matrix multiplication diagram

Explanation:

 Matrix (A): An m x n matrix.

 Matrix (B): An n x p matrix.
 Result (C): An m x p matrix.

How it Works:

Each element Cᵢⱼ in the resulting matrix C is calculated as the dot product of the i-th row of matrix A and
the j-th column of matrix B:

 Cᵢⱼ = Aᵢ₁ * B₁ⱼ + Aᵢ₂ * B₂ⱼ + ... + Aᵢₙ * Bₙⱼ

Example:

[ 1 2 ] [ 5 6 ] [ (1*5)+(2*7) (1*6)+(2*8) ] [ 19 22 ]
[ 3 4 ] x [ 7 8 ] = [ (3*5)+(4*7) (3*6)+(4*8) ] = [ 43 50 ]

4.7.3. Parallelization of Matrix Computations:

Both matrix-vector and matrix-matrix multiplication are well-suited for parallelization. Here are some
common approaches:

 Row-wise or Column-wise partitioning: The matrix A is divided into rows or columns, and
each processor is assigned a subset of rows or columns to compute a portion of the result.
 Block partitioning: The matrices are divided into blocks, and each processor is assigned a subset
of blocks to compute a portion of the result. This is commonly used for matrix-matrix
multiplication.

Considerations for Parallel Matrix Multiplication:

 Communication overhead: The amount of data that needs to be communicated between

processors.
 Load balancing: Ensuring that each processor has roughly the same amount of work to do.
 Memory access patterns: Efficiently accessing data from memory to minimize memory latency.
Unit 5 CUDA Architecture

5.1. Introduction to GPU Architecture Overview:

GPU architecture diagram

 Streaming Multiprocessors (SMs): The core processing units of a GPU. Each SM contains:
o CUDA Cores (ALUs): Perform the actual computations.
o Shared Memory: Fast, on-chip memory shared by threads within a block.
o Registers: Fast, private memory for each thread.
o Special Function Units (SFUs): Handle transcendental functions (sin, cos, etc.).
o Load/Store Units: Handle memory access.
 Global Memory (DRAM): Off-chip memory accessible by all SMs. Larger but slower than
shared memory.
 Interconnect: Connects the SMs and memory controllers.

Key Idea: GPUs are designed for data-parallel computations, with many cores working simultaneously
on different parts of the data.

5.2. Introduction to CUDA C - CUDA Programming Model:

CUDA extends C/C++ to allow programmers to write code that runs on the GPU.

 Host: The CPU.

 Device: The GPU.
 Kernels: Functions that run on the GPU.
 Threads: Lightweight execution units that run within a kernel.
 Blocks: Groups of threads that execute on a single SM.
 Grid: A collection of blocks that make up a kernel launch.
CUDA programming model diagram

5.3. Write and Launch a CUDA Kernel:

C++
__global__ void kernel_function(float *out, float *in) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
out[idx] = in[idx] * 2.0f;
}

int main() {
// ... allocate memory on host and device ...
kernel_function<<<blocks, threads>>>(dev_out, dev_in);
// ... copy results back to host ...
}

 global: Keyword that indicates a kernel function.

 blockIdx.x, blockDim.x, threadIdx.x: Built-in variables that provide the thread's index within the
grid.
 <<<blocks, threads>>>: Syntax for launching a kernel, specifying the number of blocks and
threads per block.

5.4. Handling Errors:

CUDA provides functions for error checking:

C++
cudaError_t error = cudaGetLastError();
if (error != cudaSuccess) {
fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(error));
return 1;
}

5.5. CUDA Memory Model:

CUDA memory model diagram

 Global Memory: Largest, slowest memory. Accessible by all threads.

 Shared Memory: Fast, on-chip memory shared by threads within a block.
 Registers: Fastest, private memory for each thread.
 Constant Memory: Read-only memory accessible by all threads.
 Texture Memory: Optimized for 2D data access.

5.6. Manage Communication and Synchronization:

 Shared Memory: Used for fast communication within a block.

 __syncthreads(): A barrier synchronization function that waits for all threads in a block to reach a
certain point in the code.
 Atomic Operations: Used for thread-safe access to shared memory.

5.7. Parallel Programming in CUDA C:

 Data parallelism: Distributing data across threads and blocks.

 Thread hierarchy: Organizing threads into blocks and grids to manage parallelism.
 Memory optimization: Using appropriate memory types to maximize performance.
 Kernel optimization: Techniques like loop unrolling and instruction scheduling to improve
kernel performance.
Unit 6 High Performance Computing Applications

High Performance Computing Applications:

It's challenging to represent the broad "Scope of Parallel Computing" in a single figure. However, I can
provide a diagram that illustrates the key areas where parallel computing is applied, along with
visualizations of parallel search and sorting algorithms.

6.1. Scope of Parallel Computing:

Scope of parallel computing diagram

Explanation:

This diagram shows the diverse application areas of parallel computing:

 Scientific Computing: Simulations (weather, climate, physics, chemistry), data analysis

(genomics, astronomy).
 Engineering: Design and simulation of complex systems (aircraft, automobiles, bridges).
 Data Science and Big Data: Processing and analyzing massive datasets, machine learning, data
mining.
 Finance: Financial modeling, risk analysis, algorithmic trading.
 Multimedia: Image and video processing, computer graphics, virtual reality.

6.2. Parallel Search Algorithms:

6.2. a) Parallel Depth-First Search (DFS):

Parallel depth first search diagram

Explanation:

 The search space (e.g., a tree or graph) is divided among multiple processors.
 Each processor performs a DFS on its assigned portion.
 Communication is needed to share information about visited nodes or to coordinate the search.

6.2. b) Parallel Breadth-First Search (BFS):

Parallel breadth first search diagram

Explanation:

 The search space is explored level by level.

 Each level can be processed in parallel, with different processors exploring different parts of the
level.
 Communication is needed to share information about discovered nodes and to coordinate the
exploration of the next level.

6.3. Parallel Sorting Algorithms:

a) Parallel Bubble Sort:

Parallel bubble sort diagram

Explanation:

 The list is divided into partitions.

 Bubble sort is applied to each partition in parallel.
 A merging or comparison step is needed to combine the sorted partitions.

b) Parallel Merge Sort:

Parallel merge sort diagram

Explanation:

 The list is divided recursively into sublists.

 Each sublist is sorted recursively in parallel.
 The sorted sublists are merged in parallel.

Basic Considerations for Parallel Algorithms:

 Data partitioning: How to divide the data among the processors.

 Communication overhead: The amount of data that needs to be communicated between
processors.
 Load balancing: Ensuring that each processor has roughly the same amount of work to do.
 Synchronization: Coordinating the actions of the processors.

These figures and explanations provide a basic understanding of the scope of parallel computing and
some common parallel algorithms for search and sorting. The actual implementation and performance of
these algorithms can vary depending on the specific parallel architecture and programming model used.

6.3. Distributed Computing:

It's challenging to create a single figure that encompasses all aspects of distributed computing, document
classification, Kubernetes, GPU applications, and parallel computing for AI/ML. These are broad topics.
However, I can provide a set of diagrams and explanations to illustrate the key concepts and their
relationships:

Distributed Computing Overview:

Distributed computing diagram

 Nodes/Machines: Independent computing devices connected by a network.

 Network: The communication medium that allows nodes to exchange data.
 Distributed Applications: Software systems that run on multiple nodes and coordinate their
actions to achieve a common goal.

6.3.1. Document Classification in a Distributed Environment:

Distributed document classification diagram

 Data Partitioning: The large collection of documents is divided into smaller subsets.
 Distributed Processing: Each node processes a subset of documents, performing tasks like
feature extraction and classification.
 Central Aggregation (or Distributed Aggregation): The results from each node are combined
to produce the final classification results.

6.3.2. Kubernetes Architecture:

kubernetes architecture diagram

 Control Plane: Manages the cluster. Key components include:

o API Server: The front-end for the Kubernetes API.
o Scheduler: Assigns pods (containers) to nodes.
o Controller Manager: Manages controllers that regulate the state of the cluster.
o etcd: A distributed key-value store that stores cluster configuration data.
 Worker Nodes: Run the applications. Key components include:
o Kubelet: An agent that runs on each node and manages pods.
o Kube-proxy: A network proxy that manages network communication to pods.
o Containers: Run the application code.

6.3.3. GPU Applications in Parallel Computing:

GPU architecture diagram

 CPU (Host): Manages overall program execution.

 GPU (Device): A specialized processor with many cores designed for parallel processing.
 Memory Transfer: Data is transferred between the CPU and GPU memory.
 Kernel Execution: Parallel code (kernels) is executed on the GPU.

6.3.4. Parallel Computing for AI/ML:

Parallel computing for AI/ML diagram

 Data Parallelism: Distributing the training data across multiple processors or GPUs.
 Model Parallelism: Distributing the model itself across multiple processors or GPUs.
 Task Parallelism: Dividing the training process into smaller independent tasks that can be
executed in parallel.

Relationship between Concepts:

 Distributed Computing: Provides the infrastructure for running parallel applications across
multiple machines.
 Kubernetes: A container orchestration platform that simplifies the deployment and management
of distributed applications, including those using GPUs.
 GPU Applications: GPUs are highly effective for accelerating parallel computations, especially
in AI/ML.
 Parallel Computing for AI/ML: Uses parallel processing techniques (including GPUs and
distributed computing) to train large models and process massive datasets.

Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
Parallel and Distributed Computing Complete Notes
No ratings yet
Parallel and Distributed Computing Complete Notes
41 pages
PDC Complete Course File
No ratings yet
PDC Complete Course File
422 pages
Cours 1
No ratings yet
Cours 1
38 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
34 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
P 1
No ratings yet
P 1
44 pages
CS Btech 5 Scheme 2023030924012129
No ratings yet
CS Btech 5 Scheme 2023030924012129
15 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
U1&u2 Padcom-25
No ratings yet
U1&u2 Padcom-25
95 pages
Parallel Programming - Unit 1
No ratings yet
Parallel Programming - Unit 1
81 pages
Generic Questions
No ratings yet
Generic Questions
70 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
CMP 252 - Parallelism Fundamentals
No ratings yet
CMP 252 - Parallelism Fundamentals
64 pages
Specification and Design of Embedded Systems PDF
No ratings yet
Specification and Design of Embedded Systems PDF
216 pages
PC Notes
No ratings yet
PC Notes
26 pages
Module 1 ACA
86% (7)
Module 1 ACA
42 pages
Course Book
No ratings yet
Course Book
152 pages
Unit1 2 and 3
No ratings yet
Unit1 2 and 3
76 pages
CC Unit 1
No ratings yet
CC Unit 1
36 pages
Vector
No ratings yet
Vector
38 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
PDC Lecture 01
No ratings yet
PDC Lecture 01
36 pages
Co 1
No ratings yet
Co 1
66 pages
Accelerating Graph Indexing For ANNS On Modern CPUs
No ratings yet
Accelerating Graph Indexing For ANNS On Modern CPUs
17 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
3 B.tech - Cse Aiml 7 8 Sem
No ratings yet
3 B.tech - Cse Aiml 7 8 Sem
52 pages
Chapter # 1
No ratings yet
Chapter # 1
117 pages
Data Analytics Btma 636 Day 2
No ratings yet
Data Analytics Btma 636 Day 2
144 pages
Lecture-2-06 01 2025
No ratings yet
Lecture-2-06 01 2025
21 pages
Parallel Computing
No ratings yet
Parallel Computing
25 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
Arm 7 Architecture
100% (1)
Arm 7 Architecture
22 pages
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
No ratings yet
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
57 pages
Term Paper About Computer Architecture
100% (1)
Term Paper About Computer Architecture
6 pages
Cloud Computing Unit-1
No ratings yet
Cloud Computing Unit-1
51 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
PC 1
No ratings yet
PC 1
53 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
PDC 3
No ratings yet
PDC 3
26 pages
Unit 1 HPC
No ratings yet
Unit 1 HPC
11 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
CC Assignment 3
No ratings yet
CC Assignment 3
8 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
HPC Note
No ratings yet
HPC Note
39 pages
Cloud Computing
No ratings yet
Cloud Computing
27 pages
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
No ratings yet
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
20 pages
Lecture 9
No ratings yet
Lecture 9
72 pages
BCSE412L - Parallel Computing 01
No ratings yet
BCSE412L - Parallel Computing 01
27 pages
Lec1 and 2
No ratings yet
Lec1 and 2
52 pages
The Design of Optimized RISC Processor For Edge Artificial Intelligence Based On Custom Instruction Set Extension
No ratings yet
The Design of Optimized RISC Processor For Edge Artificial Intelligence Based On Custom Instruction Set Extension
13 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
No ratings yet
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
21 pages
TDCI Arch
No ratings yet
TDCI Arch
77 pages
CO Module 5 Notes
No ratings yet
CO Module 5 Notes
16 pages
Parallel Programming
No ratings yet
Parallel Programming
44 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Unit-3 PDF
No ratings yet
Unit-3 PDF
67 pages
Chap. 9 Pipeline and Vector Processing
0% (1)
Chap. 9 Pipeline and Vector Processing
12 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Exception and Interrupt Handler: B.Tech ECE V Semester SENSE VIT University
No ratings yet
Exception and Interrupt Handler: B.Tech ECE V Semester SENSE VIT University
36 pages
Poor Man's Computing Revisited: Alexander Shchepetkin, I.G.P.P. UCLA
No ratings yet
Poor Man's Computing Revisited: Alexander Shchepetkin, I.G.P.P. UCLA
12 pages
Unit-6: Pipeline & Vector Processing
No ratings yet
Unit-6: Pipeline & Vector Processing
41 pages
Chapter 9
No ratings yet
Chapter 9
28 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
Parallel Computing Terminology
No ratings yet
Parallel Computing Terminology
11 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
Chap. 9 Pipeline and Vector Processing
No ratings yet
Chap. 9 Pipeline and Vector Processing
16 pages
ACA Question Bank 2024
No ratings yet
ACA Question Bank 2024
6 pages
SIMD Computer Organizations
0% (1)
SIMD Computer Organizations
20 pages
Coa
No ratings yet
Coa
11 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
Lab #3: Advanced Assembly Programming & Adding SRAM
No ratings yet
Lab #3: Advanced Assembly Programming & Adding SRAM
2 pages
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
From Everand
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
Sam Steed
No ratings yet
Operating System Interview Questions and Answers
From Everand
Operating System Interview Questions and Answers
Manish Soni
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet