0% found this document useful (0 votes)

186 views29 pages

Parellel Computing Module1 Notes

Uploaded by

Mihir Shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

186 views29 pages

Parellel Computing Module1 Notes

Uploaded by

Mihir Shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

Slowing Performance Improvement:

We care because the rate of performance improvement for single processors has slowed dramatically. From
1986 to 2003, performance increased by over 50% per year, but from 2015 to 2017, it increased by less than
4% per year.

Stark Difference in Growth:

This change is significant. At 50% per year, performance grows by a factor of 60 in 10 years, while at 4%, it
only grows by a factor of 1.5. This means waiting for the next generation of microprocessors is no longer an
effective strategy for performance gains.

Need for Increased Performance:

While some applications may be fast enough, many others (implied by the move to parallel systems) require
greater performance that single processors can no longer provide at the historical rate of improvement.

Physical Limits:

The text implies that manufacturers faced limitations in their ability to continue developing faster, single-
processor systems. By 2005, they determined that the path to rapid performance increases lay in a different
direction.

Shift to Parallelism:

The solution was to abandon the effort to create "ever-faster monolithic processors" and instead focus on
building parallel systems.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
Multiple Processors:

The new design approach was to "put multiple complete processors on a single integrated circuit" to achieve
the desired performance gains.

Lack of Awareness:

The fundamental issue is that programs written for a single processor are "unaware of the existence of multiple
processors."

No Automatic Benefit:

Simply running a serial program on a system with multiple processors does not automatically improve its
performance. The performance will be "effectively the same as its performance on a single processor."

Incompatible Design:

The text highlights that parallel execution is not a "magic" addition. The program's design itself must be
written to take advantage of the multiple processors for any performance benefit to be realized.

Driving Major Advances: Increased computational power has been the core reason behind significant
advances in various fields, including science (e.g., decoding the human genome, accurate medical imaging),
the internet (e.g., fast web searches), and entertainment (e.g., realistic computer games).

Expanding the Scope of Solvable Problems: As our computational power grows, it allows us to tackle a
greater number of complex problems that were previously out of reach.

The text provides several examples of these problems:

Climate Modeling: To better understand climate change, we need highly accurate computer models
that can simulate the complex interactions between the atmosphere, oceans, solid land, and ice caps.
Protein Folding: Misfolded proteins are believed to be linked to diseases like Huntington's,
Parkinson's, and Alzheimer's, but current computational power limits our ability to study their
configurations.
Drug Discovery: Increased computational power is needed to analyze genomes extensively to devise
personalized treatments for individuals who do not respond to existing drugs.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
Energy Research: Greater computational power will enable the creation of more detailed models of
clean energy technologies like wind turbines, solar cells, and batteries, which could lead to far more
efficient designs.
Data Analysis: A tremendous amount of data is being generated (doubling every two years), and it is
largely useless without extensive analysis. This includes data from genomics, particle colliders,
medical imaging, and astronomical research.

Heat and Power Limitations:

The historical increase in single-processor speed came from making transistors smaller and faster. However,
faster transistors consume more power, which is dissipated as heat. By the early 21st century, air-cooled
integrated circuits reached the limit of their ability to dissipate this heat, making it impossible to continue
increasing their speed.

Slowdown in Transistor Density:

The increase in transistor density, a key driver of performance, has also slowed dramatically in recent years,
further limiting the potential for single-processor improvements.

A "Moral Imperative":

The text argues that there is a need to continue increasing computational power because of its potential to drive
significant advances and improve our existence.

The Solution is Parallelism:

Since building faster single processors is no longer a viable path, the industry has turned to parallelism as the
solution. Instead of building one faster, complex processor, manufacturers now put multiple, simpler, complete
processors (known as cores) on a single chip.

Why we need to write parallel programs

Based on the text, we need to write parallel programs because traditional serial programs cannot effectively
utilize modern multi-core systems, and automatic conversion from serial to parallel is not a viable solution.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
The core reason is that the most efficient parallel algorithms are often fundamentally different from their serial
counterparts and are unlikely to be "discovered" by a simple translation program. The text illustrates this with
an example of summing a large number of values.

1. Serial Program A serial program to sum n values would run on a single core and would look like this:

sum = 0;
for (i = 0; i < n; i++) {
x = Compute_next_value(. . .);
sum += x;
}

2. A Simple Parallel Approach (Master/Slave) A basic parallel version would divide the work among p
cores, with each core calculating a "partial sum" of its assigned values.

my_sum = 0;
my_first_i = . . . ;
my_last_i = . . . ;
for (my_i = my_first_i; my_i < my_last_i; my_i++) {
my_x = Compute_next_value(. . .);
my_sum += my_x;
}

After computing their my_sum, all cores send their results to a designated "master" core (e.g., Core 0) which
then adds them all together:

sum = my_sum;
for each core other than myself {
receive value from core;
sum += value;
}
} else {
send my_sum to the master;
}
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
This method is a simple generalization of the serial process, but it creates a bottleneck because the master core
does all the final work. For 1000 cores, the master would have to perform 999 receives and adds.

A far more efficient parallel algorithm for this problem is one that does not resemble the serial approach.
Instead of a single master, cores are paired up to sum their values. This process is repeated, with the number of
active cores being halved at each step, until only one core holds the final sum.

The text describes this as follows: "Instead of making the master core do all the work...we can pair the cores so
that while core 0 adds in the result of core 1, core 2 can add in the result of core 3, core 4 can add in the result
of core 5, and so on."

This method is illustrated by the provided diagram:

FIGURE 1.1 Multiple cores forming a global sum. (Image description: A diagram showing 8 cores (0-7). In
the first step, core 0 receives from core 1, core 2 from core 3, etc. In the next step, core 0 receives from core 2,
and core 4 from core 6.

In the final step, core 0 receives from core 4. The final result is in core 0. This shows a tree-like structure of
communication and summation.)
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
This second method is dramatically more efficient. With 8 cores, the master core only performs 3 receives and
adds (compared to 7).

With 1000 cores, the difference is even more stark: the master core would only need to perform 10 receives
and adds, an improvement of almost a factor of 100 over the first method.

Based on the text provided, we write parallel programs by following the basic idea of partitioning the work
among the available cores. There are two primary approaches to this partitioning:

In this approach, you divide the problem into different tasks, and each core executes a different task. The cores
may be performing different operations.

Example: A professor and her TAs are grading an exam with five questions. To use task-parallelism,
they could each be assigned to grade all 100 student responses for a single question (e.g., the professor
grades all responses for Question 1, a TA grades all responses for Question 2, and so on). Each person
is performing a different grading task.

In this approach, you partition the data used in the problem among the cores, and each core performs more or
less the same operation on its assigned piece of the data.

Example: For the same exam grading scenario, data-parallelism would involve dividing the 100 exams
into five piles of 20 exams each. The professor and each of her TAs would then grade all questions on
every exam in their respective pile. Each person is performing the same task (grading a full exam) on a
different set of data.

Writing a parallel program becomes much more complex when the cores cannot work independently and need
to coordinate their efforts. This coordination involves three key issues:

Communication: Cores must be able to send data to each other. For example, in the global sum
problem, cores send their partial sums to a master core to be combined.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
Load Balancing: The work must be divided evenly among the cores. If one core is assigned
significantly more work than others, its computational power will be wasted as it sits idle, waiting for
the heavily loaded core to finish.
Synchronization: Cores must be able to wait for each other at specific points in the program. This
prevents some cores from starting work on data that has not yet been prepared by another core (e.g.,
waiting for a master core to read in all the data from a file before starting calculations).

In practice, the most powerful parallel programs are written using explicit parallel constructs (e.g., extensions
to C, C++, and Java).

These programs contain direct instructions that tell each core what task to execute and when to coordinate with
other cores. However, this level of explicit control makes the programs extremely complex to write.

Based on the text provided, this section outlines the approach to learning explicit parallel programming by
focusing on a few key APIs that correspond to different types of parallel systems.

The purpose is to learn how to write parallel programs using the C language with four specific APIs:

Message-Passing Interface (MPI): A library of functions for message-passing between cores.

POSIX Threads (Pthreads): A library for managing threads (cores) in shared-memory systems.
OpenMP: A library and compiler modifications for parallelism.
CUDA: A library and compiler modifications for programming Nvidia GPUs.

The need for multiple APIs stems from the different ways parallel systems are designed. These systems are
typically classified in two ways: by how they handle memory and by how they handle instructions.

Shared-Memory Systems: All cores can share access to the computer's memory. Cores coordinate by
examining and updating these shared memory locations.
Distributed-Memory Systems: Each core has its own private memory, and cores communicate by
explicitly sending and receiving messages to each other.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
This classification is illustrated in the provided diagram:

FIGURE 1.2 (a) A shared memory system and (b) a distributed memory system. (Image Description: (a) A
schematic showing multiple cores connected to a single, shared memory block. (b) A schematic showing
multiple cores, each with its own memory block, connected by a network.)

Multiple-Instruction, Multiple-Data (MIMD) Systems: Each core has its own control unit and can
execute its own stream of instructions on its own data, independently of other cores. This is how
conventional processors work.
Single-Instruction, Multiple-Data (SIMD) Systems: Cores share a single control unit, so they all
execute the exact same instruction at the same time, but each on its own separate data. This is the
fundamental principle behind modern Graphics Processing Units (GPUs).

Each of the APIs is designed to program a specific type of parallel system:

MPI: Used for distributed-memory MIMD systems.

Pthreads: Used for shared-memory MIMD systems.
OpenMP: Used for shared-memory MIMD and shared-memory SIMD systems.
CUDA: Used for programming Nvidia GPUs, which exhibit aspects of all four classifications (shared
memory, distributed memory, SIMD, and MIMD). The focus, however, will be on the shared memory
SIMD and MIMD aspects.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

Concurrent Computing:

A program where multiple tasks can be in progress at the same time. This is a broad term. For example, a
multitasking operating system running on a single-core machine is concurrent, as it quickly switches between
tasks, giving the appearance that they are all in progress simultaneously. Both parallel and distributed
programs are considered concurrent.

Parallel Computing:

A program in which multiple tasks cooperate closely to solve a single problem. These tasks usually run
simultaneously on cores that are physically close to each other and are connected by a very high-speed
network or share the same memory. The text's previous examples of the two concurrent addition programs are
considered parallel.

Distributed Computing:

A program that may need to cooperate with other programs to solve a problem. These programs are often
"loosely coupled" and can run on computers that are physically separated by large distances. The programs
themselves may have been created independently. A Web search program is provided as an example of
distributed computing.

Increased Complexity: Writing parallel programs is almost always more complex than writing serial
programs because it requires coordinating the actions of multiple cores.

Avoid "Seat-of-Your-Pants" Programming: It is a mistake to write parallel programs without careful

design.

Design is Crucial: All the rules for careful program design and development are even more important when
writing parallel programs than they are for serial programs.

Program Text: Code, whether displayed in a block or within a sentence, will use a specific typeface, like the
following example: printf("hello, world\n");
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
Definitions: When a term is being defined, it will be displayed in boldface type. For instance, "A parallel
program can make use of multiple cores."

UNIX Shell Commands: Commands intended to be typed at a shell prompt will be preceded by a dollar sign
($), for example:

Function Syntax: The syntax of a function call will be specified by providing a sample argument list, such as
int abs( int x);.

General Syntax Notation:

Angle brackets (< >) are used to enclose required content.

Square brackets ([ ]) are used to enclose optional content.
For example, the syntax if ( <expression> ) <statement> [ else <statement> ] indicates that the
expression and the first statement are required, while the else clause and its statement are optional.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

This classification is based on the number of instruction streams and data streams a computer can manage
simultaneously.

SISD (Single Instruction, Single Data): A classical von Neumann system. It executes one instruction
at a time and computes a single data value at a time.
SIMD (Single Instruction, Multiple Data): These systems manage multiple data streams but support
only a single instruction stream. All cores execute the same instruction on their own data.
MIMD (Multiple Instruction, Multiple Data): These systems support both multiple instruction
streams and multiple data streams. Each core can execute its own set of instructions independently.

This classification is based on how the cores access the computer's memory.

Shared Memory Systems: In these systems, all cores can share access to the same memory locations.
They coordinate their work by modifying these shared memory locations.
Distributed Memory Systems: In these systems, each core has its own private memory. Cores
coordinate their work by communicating with each other across a network.

Single Instruction, Multiple Data (SIMD) systems are a type of parallel computer that operates on multiple
data streams using a single instruction stream. In this model, a central control unit broadcasts an instruction to
multiple datapaths. Each datapath then either executes the instruction on its own data or remains idle.

Key Principle: SIMD systems are ideal for problems with "data-parallelism," where the same
operation is applied to a large array of data. This allows for very efficient parallelization of simple
loops.
Synchronous Operation: In a classical SIMD system, all datapaths operate synchronously and must
wait for the next broadcasted instruction. They have no internal instruction storage.
History: SIMD systems had a checkered history, but their principles have re-emerged in modern vector
processors and GPUs.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

Vector processors are a type of SIMD system designed to operate on arrays of data (vectors) rather than
individual data elements (scalars).

Vector Registers: These are specialized registers that can store and operate on a fixed-length array of
operands simultaneously.
Vector Instructions: A single instruction can perform an operation on an entire vector. For example, a
single instruction could replace a for loop that adds two arrays, which is a significant efficiency gain.
Interleaved Memory: The memory system is structured into multiple banks, allowing for faster,
simultaneous access to elements of a vector, as they are distributed across different banks.
Specialized Hardware: Vector processors include special hardware to accelerate strided memory
access (accessing elements at fixed intervals) and scatter/gather operations (accessing elements at
irregular intervals).
Strengths and Weaknesses: They are very fast and easy to use for applications that can be
"vectorized." However, they do not handle irregular data structures well and face scalability limits in
their vector length.

GPUs are powerful processors that leverage SIMD principles for real-time graphics and general-purpose
computing.

SIMD Foundation: GPUs use a graphics pipeline where programmable "shader functions" are applied
to thousands of elements (like vertices or pixels). Since the same operation is applied to many
elements, GPUs optimize performance by using SIMD parallelism with a large number of datapaths on
each core.
Hardware Multithreading: Due to the massive amount of data they process, GPUs use extensive
hardware multithreading to manage multiple tasks and avoid memory stalls.
Not Purely SIMD: Although they use SIMD parallelism, modern GPUs are not purely SIMD systems.
They can also execute multiple instruction streams on a single core and have dozens of cores running
independent instruction streams, making them both SIMD and MIMD.
Memory: GPUs can have both shared and distributed memory.
Growing Popularity: They are becoming increasingly popular for general high-performance
computing problems, as they are capable of very high computational power.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

Multiple Instruction, Multiple Data (MIMD) systems support multiple independent instruction streams
operating on multiple data streams simultaneously.

They typically consist of a collection of fully autonomous processors, each with its own control unit and
datapath.

Unlike SIMD systems, MIMD systems are usually asynchronous, meaning the processors can operate at their
own pace without a global clock.

There are two principal types of MIMD systems, classified by how their processors access memory:

Shared-Memory Systems:

A collection of autonomous processors is connected to a single memory system. All processors can access any
memory location.

Communication and coordination are typically implicit, as processors communicate by accessing and updating
shared data structures.

This is illustrated in Figure 2.3.

Distributed-Memory Systems:

Each processor is paired with its own private memory, and these processor-memory pairs are connected by an
interconnection network.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
Communication and coordination are explicit, as processors must send messages or use special functions to
access another processor's memory.

This is illustrated in Figure 2.4.

Shared-Memory Systems

The most common shared-memory systems today are multi-core processors. In systems with multiple multi-
core processors, memory access can be further classified:

UMA (Uniform Memory Access):

In a UMA system, all processors are connected directly to a shared main memory. This means all cores have
the same access time to all memory locations, which generally makes them easier to program.

This is shown in Figure 2.5.

PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
NUMA (Nonuniform Memory Access):

In a NUMA system, each processor has a direct connection to a block of main memory. A core can access
memory in its directly connected block more quickly than memory that must be accessed through another chip.
While this can complicate programming, it can also lead to faster access times to local memory and support
larger amounts of total memory.

This is shown in Figure 2.6.

The most widely available distributed-memory systems are clusters. A cluster is a collection of commodity
systems (like PCs) connected by a commodity network (like Ethernet). The individual nodes within a cluster
are often themselves shared-memory systems, making modern clusters a type of hybrid system.

A grid is a larger-scale, geographically distributed distributed-memory system that connects different types of
hardware (making it heterogeneous) into a unified system.

Bus:

An older type of interconnect that uses shared communication wires. It is inexpensive and flexible, but
performance degrades significantly as the number of connected devices increases due to contention for the
shared wires. For this reason, buses are being replaced by switched interconnects in larger systems.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
Switched Interconnects (e.g., Crossbar):

These use switches to control data routing. A crossbar is a fast and powerful example that allows multiple
devices to communicate simultaneously without contention, provided they are not trying to access the same
memory module. However, crossbars are much more expensive to build than bus-based systems.

Figure 2.7 illustrates a crossbar switch, showing how multiple processors can access different memory
modules simultaneously.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

These interconnects are often divided into two groups: direct and indirect.

In a direct interconnect, each switch is directly connected to a processor-memory pair, and the switches are
connected to each other.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
Examples: Common examples include the ring and the two-dimensional toroidal mesh (Figure 2.8). A
mesh is more expensive than a ring but offers greater connectivity and more opportunities for simultaneous
communication.

Connectivity Measures:

o Bisection Width: This is a measure of the "worst-case" connectivity. It is defined as the minimum
number of links that must be removed to divide the system's nodes into two equal halves. (Figures 2.9,
2.10)

o Bisection Bandwidth: A related measure that sums the bandwidth of the links in the bisection.

Theoretical Ideals:

o Fully Connected Network: The ideal direct interconnect where every switch is connected to every
other switch (Figure 2.11). It has the highest connectivity but is impractical for all but the smallest
systems.
o Hypercube: A highly connected and practical direct interconnect built inductively from simpler
structures (Figure 2.12). It offers better connectivity than a ring or mesh but is more expensive to
construct.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

In an indirect interconnect, some switches may not be directly connected to a processor (Figure 2.13).
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

Examples:

o Crossbar: A distributed-memory crossbar uses unidirectional links to allow all processors to

communicate simultaneously, as long as they are not sending messages to the same destination
processor (Figure 2.14).
o Omega Network: This is a less expensive indirect network that uses 2x2 switches (Figure 2.15, 2.16).
Unlike a crossbar, it has communications that cannot occur simultaneously, but its cost is significantly
lower.

Latency: The time it takes for data to start arriving at its destination after the source begins transmitting. It
is the time for the first byte to be received.
Bandwidth: The rate at which data is received after the first byte has arrived. It is usually measured in
bytes or bits per second.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

The total time to transmit a message is a combination of these two factors:

bandwidth).

The cache coherence problem arises in shared-memory systems where each core has its own private cache.
Since CPU caches are managed by hardware, a core has no direct control over them.

The Problem:

When multiple cores have a cached copy of a shared variable, and one core updates its copy, the other cores
may not see this update.

This can lead to unpredictable behavior and incorrect results. The text gives an example where a core might
use an outdated value of a shared variable x from its cache, even after another core has updated the value in
main memory. Figure 2.17 illustrates a shared-memory system with two cores and two caches.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

There are two main approaches to ensuring that all cores have a consistent view of shared data:

1. Snooping Cache Coherence:

o How it works: It is based on a shared bus or interconnect that supports broadcasts. When a core
updates a variable in its cache, it broadcasts this information. Other cores "snoop" the bus, see the
broadcast, and mark their local copy of that variable's cache line as invalid, forcing them to get the
updated version later.
o Limitation: Snooping is not scalable to very large systems because a broadcast across a large
network is extremely expensive and causes performance to degrade.
2. Directory-Based Cache Coherence:
o How it works: This approach uses a distributed data structure called a directory to track the status
and location of each cache line. When a variable is updated, the directory is consulted, and only the
specific cores that have a cached copy of that variable are contacted to invalidate their copies.
o Trade-off: This method requires substantial additional storage for the directory but is much more
scalable than snooping because it avoids expensive broadcasts.

False sharing is a performance issue that arises because caches operate on entire cache lines, not individual
variables.

How it occurs:

It happens when two or more cores repeatedly access different, independent variables that are located within
the same cache line.

The Consequence:

When one core updates its variable, the entire cache line is invalidated for the other cores.

Even though they are accessing different variables, the other cores must then fetch a fresh copy of the entire
cache line from memory, causing a significant performance penalty.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
The Fix:

The text suggests using temporary, local storage for each core to perform calculations before copying the
results to shared memory, which reduces the frequency of accessing the shared cache line.

This avoids the performance degradation without affecting the correctness of the results.

Definition: In a shared-memory system, multiple processors (or cores) are connected to a single memory
system. All processors can access any memory location, and they coordinate their work implicitly by reading
from and writing to shared data structures.

Advantages:

o Programmers often find them more appealing to program, as they can coordinate work implicitly
through shared data rather than explicit message passing.

Disadvantages:

o The primary disadvantage is the high cost of scaling the interconnect.

o Bus-based interconnects become inefficient as more processors are added due to increased contention.
o Large crossbar interconnects are very expensive to build.

Uses:

o Shared-memory systems are suitable for systems with only a few processors due to the interconnect
scaling limitations.

Definition: In a distributed-memory system, each processor has its own private memory. The processors are
connected by an interconnection network and must communicate with each other explicitly by sending
messages to coordinate their work.

Advantages:

o Interconnects like the hypercube and toroidal mesh are relatively inexpensive.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
o They are highly scalable, allowing systems with thousands of processors to be built.
o They are well-suited for problems requiring vast amounts of data or computation.

Disadvantages:

o The concept of explicitly sending messages can be less appealing to programmers, which implies a more
complex programming model.

Uses:

o These systems are used for solving problems that require immense computational power and large
amounts of data.

Some problems are "embarrassingly parallel," meaning they can be solved by simply dividing the work among
processes or threads. A simple array addition is given as an example:

double x[n] , y[n];

...
for ( int i = 0; i < n; i++)
x [i] += y[i];

For such problems, the programmer only needs to:

Divide the work so that each process/thread gets roughly the same amount (a process called load
balancing).
Minimize the amount of communication required between processes/threads.

The vast majority of problems are more complex and require additional coordination between processes and
threads. For these problems, programmers also need to:

Arrange for synchronization: This involves ensuring that processes or threads wait for each other at
certain points in the program before proceeding.
Arrange for communication: This involves setting up the transfer of data between processes or
threads.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
The text notes that synchronization and communication are often interrelated. For example, in distributed-
memory programs, synchronization can be achieved implicitly through communication, while in shared-
memory programs, communication can be achieved implicitly through synchronization.

Shared-memory

In shared-memory programs, variables can be either shared or private. Communication between threads is
done implicitly through shared variables, which can be accessed by any thread. Private variables are accessible
only by a single thread.

Dynamic Threads:

A master thread forks worker threads as needed to handle specific tasks. Once a worker thread completes its
task, it terminates. This approach efficiently uses system resources because threads are only active when they
are needed.

Static Threads:

All threads are forked at the beginning of the program and run until all work is completed. While this may be
less efficient in terms of resource use if threads are idle, it avoids the time overhead of repeatedly forking and
joining threads, potentially leading to better performance.

Nondeterminism:

This occurs in asynchronous MIMD systems, where a program's output can vary from one run to another even
with the same input. For example, a simple printf statement can have its output order change from run to run:

...
printf("Thread %d > my_x = %d\n", my_rank , my_x);
...
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
Race Conditions:

This is a serious type of nondeterminism that can lead to program errors. A race condition occurs when
multiple threads try to access and modify a shared resource simultaneously, and the outcome depends on which
thread "wins the race." A simple example is two threads trying to update a shared variable x:

my_val = Compute_val(my_rank);
x += my_val;

The x += my_val operation is not atomic and can lead to an incorrect final value for x if both threads try to
update it at the same time.

To prevent race conditions, a critical section is used. This is a block of code that can only be executed by one
thread at a time, ensuring mutual exclusion.

Mutex (Mutual Exclusion Lock):

The most common mechanism for enforcing mutual exclusion. A thread must "lock" a mutex before entering a
critical section and "unlock" it upon exiting. For the example above, a mutex would be used like this:

my_val = Compute_val(my_rank);
Lock(&add_my_val_lock);
x += my_val;
Unlock(&add_my_val_lock);

This ensures that only one thread at a time can execute the x += my_val statement.

Busy-Waiting: A thread enters a loop to repeatedly check a condition until it is met. An example using a
shared variable ok_for_1 to control execution order is:

if (my_rank == 1)
while (!ok_for_1); /* Busy-wait loop */
x += my_val; /* Critical section */
if (my_rank == 0)
ok_for_1 = true;

This method can be wasteful of resources, as the core is busy doing no useful work while waiting.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

A function is thread-safe if it can be safely used in a multithreaded program without causing errors. Functions
written for serial programs, especially those that use static local variables, can be a source of errors in
multithreaded environments.

The C function strtok is a classic example of a function that is not thread-safe because its internal state (a static
variable) can be corrupted if multiple threads call it concurrently.

In distributed-memory systems, cores can only directly access their own private memories. Communication
between processes is explicit and often handled through a message-passing API, most commonly the Message-
Passing Interface (MPI).

Message-passing provides functions (at a minimum, send and receive) for processes to communicate.

Example: A simple send/receive operation between two processes:

char message[100];
...
my_rank = Get_rank();
if (my_rank == 1) {
sprintf(message , "Greetings from process 1");
Send( message , MSG_CHAR , 100, 0);
} else if (my_rank == 0) {
Receive(message , MSG_CHAR , 100, 1);
printf("Process 0 > Received: %s\n", message);
}

Characteristics: Programs are typically SPMD (Single Program, Multiple Data), where all processes run
the same executable but perform different actions based on their rank.

Pros: It is a powerful and versatile API used on the most powerful computers in the world.
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
Cons: It is very low-level and requires significant effort to parallelize a serial program, often described as the
"assembly language of parallel programming."

Also known as remote memory access, this paradigm allows a single process to update another process's
remote memory without the second process's explicit participation (i.e., without a matching receive call).

Advantages: This can simplify communication and reduce overhead by eliminating the need for two-sided
synchronization.

Challenges: It can be difficult to ensure the safety and correctness of memory access. For example, the
sending process needs a way to know it's safe to write to remote memory, and the receiving process needs a
way to know the data has been updated. This can introduce complex issues and errors that are hard to track
down.

PGAS languages provide a shared-memory programming model for distributed-memory hardware, aiming to
simplify programming while maintaining high performance.

Core Concept: The language creates a single logical address space that spans the private memories of all
processes.

How it Works: To prevent poor performance from accessing remote memory, PGAS languages give the
programmer tools to control how shared data is distributed. Programmers can ensure that a process primarily
accesses data that is local to its own memory, which is much faster.

Example: In a vector addition, a programmer can define the arrays as shared but then ensure that each process
only works on the portion of the array that is located in its own local memory:

shared int n = . . . ;
shared double x[n] , y[n];
private int i, my_first_element , my_last_element;
...
for (i = my_first_element; i <= my_last_element; i++)
x [i] += y[i];
PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators
This approach allows the programmer to use a familiar shared-memory-like syntax while still manually
managing data locality for performance.

Parallel Computing-Module2 Notes
No ratings yet
Parallel Computing-Module2 Notes
10 pages
20250318120345-02) BEC601 - Mod-1 - PPT - 18.3.25
No ratings yet
20250318120345-02) BEC601 - Mod-1 - PPT - 18.3.25
142 pages
BCM601-Module 1
No ratings yet
BCM601-Module 1
35 pages
DFS Water Jug Problem Solution
No ratings yet
DFS Water Jug Problem Solution
26 pages
Java - Lab - Manual-21csl35 - Skit
No ratings yet
Java - Lab - Manual-21csl35 - Skit
30 pages
Aids - VSB Syllabus 2023 - 16.8.24
No ratings yet
Aids - VSB Syllabus 2023 - 16.8.24
88 pages
III Sem Syllabus RNSIT New
No ratings yet
III Sem Syllabus RNSIT New
19 pages
8 Bit Microcontroller: Vishal V.Panchbhai, Asst. Prof. E & T Dept
No ratings yet
8 Bit Microcontroller: Vishal V.Panchbhai, Asst. Prof. E & T Dept
41 pages
r20 4-1 Open Elective III Syllabus Final Ws
No ratings yet
r20 4-1 Open Elective III Syllabus Final Ws
29 pages
DSA Lesson Plan CD3291 JEC
No ratings yet
DSA Lesson Plan CD3291 JEC
4 pages
ADA Important Questions PDF
No ratings yet
ADA Important Questions PDF
13 pages
BAD702 SMLDS Lab Manual (Student Copy)
No ratings yet
BAD702 SMLDS Lab Manual (Student Copy)
47 pages
Iot Unit 3
No ratings yet
Iot Unit 3
4 pages
Nokia 2
No ratings yet
Nokia 2
4 pages
7th Sem Updated Lab Manual
No ratings yet
7th Sem Updated Lab Manual
14 pages
7th Sem 1
No ratings yet
7th Sem 1
32 pages
Problem Solving Basics - Module 3
No ratings yet
Problem Solving Basics - Module 3
9 pages
Stqa Viva
No ratings yet
Stqa Viva
10 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
18csl47 - Daa Lab Manual
No ratings yet
18csl47 - Daa Lab Manual
36 pages
AI & GREEN SKILLS - Shell - Edunet
No ratings yet
AI & GREEN SKILLS - Shell - Edunet
6 pages
An Introduction To Parallel Programming. Second Edition Peter S. Pacheco Download
No ratings yet
An Introduction To Parallel Programming. Second Edition Peter S. Pacheco Download
100 pages
Write A Program To Find Factorial of List of Number Reading Input As Command Line Argument
100% (1)
Write A Program To Find Factorial of List of Number Reading Input As Command Line Argument
7 pages
21CS63 - CG&FIP Course Material
No ratings yet
21CS63 - CG&FIP Course Material
151 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
57 pages
Parallel Computing - Question Bank
No ratings yet
Parallel Computing - Question Bank
4 pages
Multimedia Data Compression and Storage Lab Manual
No ratings yet
Multimedia Data Compression and Storage Lab Manual
24 pages
Computer Graphics Lab Guide
No ratings yet
Computer Graphics Lab Guide
57 pages
Iot Lab Manual
No ratings yet
Iot Lab Manual
24 pages
Daa Unit I
No ratings yet
Daa Unit I
10 pages
A Mini Project Report On Age and Gender Detection-2
No ratings yet
A Mini Project Report On Age and Gender Detection-2
16 pages
Assignment 3 UHV
No ratings yet
Assignment 3 UHV
2 pages
Parallel Processing Previous Year Question
No ratings yet
Parallel Processing Previous Year Question
11 pages
Segue 2026 PPT Deck
No ratings yet
Segue 2026 PPT Deck
43 pages
BCS515B
No ratings yet
BCS515B
2 pages
Bail657c G Ai
100% (1)
Bail657c G Ai
3 pages
cw3551 Dis Unit 2 Notes
No ratings yet
cw3551 Dis Unit 2 Notes
18 pages
DSA Lab Syllabus
No ratings yet
DSA Lab Syllabus
1 page
Department of Computer Science and Engineering SUB - CODE: CS8711 YEAR: IV (2020-2021 ODD) Sub - Name: Cloud Computing Lab Sem: Vii Course Outcome
No ratings yet
Department of Computer Science and Engineering SUB - CODE: CS8711 YEAR: IV (2020-2021 ODD) Sub - Name: Cloud Computing Lab Sem: Vii Course Outcome
2 pages
MadXAbhi RSSB Basic & Senior Computer Instructor Study Material
No ratings yet
MadXAbhi RSSB Basic & Senior Computer Instructor Study Material
272 pages
Unit I
No ratings yet
Unit I
53 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
22 pages
CS3401 - Algorithm
No ratings yet
CS3401 - Algorithm
37 pages
Excel Guide for Data Analysts
No ratings yet
Excel Guide for Data Analysts
62 pages
M3-M4-Understanding of Data
No ratings yet
M3-M4-Understanding of Data
16 pages
Web Essentials - IT3401 - Important Questions
No ratings yet
Web Essentials - IT3401 - Important Questions
8 pages
PC - Question Bank
No ratings yet
PC - Question Bank
2 pages
Synopsis Format For MCA Industrial Training
No ratings yet
Synopsis Format For MCA Industrial Training
3 pages
Neural Networks Course Overview
No ratings yet
Neural Networks Course Overview
1 page
AI Unit 1.
No ratings yet
AI Unit 1.
15 pages
21CS1601 UNIT2 PPT Updated 02.02.2024
No ratings yet
21CS1601 UNIT2 PPT Updated 02.02.2024
133 pages
Pentagon Space PVT LTD Drive Registration
No ratings yet
Pentagon Space PVT LTD Drive Registration
1 page
Letter To Anna Research
No ratings yet
Letter To Anna Research
2 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
M.E. Cse (Ai&ml)
No ratings yet
M.E. Cse (Ai&ml)
63 pages
Aspiring Developer's Resume
No ratings yet
Aspiring Developer's Resume
2 pages
ccs341 Data Warehouse Lab Experiments
No ratings yet
ccs341 Data Warehouse Lab Experiments
26 pages
IT3401 - Model Exam Set 1
No ratings yet
IT3401 - Model Exam Set 1
2 pages
Elective 3
No ratings yet
Elective 3
30 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
VMProtect Ultra Unpacker 1.0
100% (1)
VMProtect Ultra Unpacker 1.0
98 pages
Ubuntu - Getting Started - Linux Journey
No ratings yet
Ubuntu - Getting Started - Linux Journey
3 pages
AN - 3326 How To Use SAMA5D2 USB Mass Storage Gadget Under Linux 00003326a PDF
No ratings yet
AN - 3326 How To Use SAMA5D2 USB Mass Storage Gadget Under Linux 00003326a PDF
19 pages
Vprotect 3.6 Data Protection For Open Virtulization Platforms
No ratings yet
Vprotect 3.6 Data Protection For Open Virtulization Platforms
17 pages
Windows Server 2023
No ratings yet
Windows Server 2023
6 pages
Linux Ulitimate Guide Cheat Sheet Edition Pd3ywv
No ratings yet
Linux Ulitimate Guide Cheat Sheet Edition Pd3ywv
106 pages
CICS Training for IT Professionals
100% (3)
CICS Training for IT Professionals
196 pages
Mid 2
No ratings yet
Mid 2
2 pages
Virtual Box Iso File Windows Installation
No ratings yet
Virtual Box Iso File Windows Installation
29 pages
SAP Kernel Upgrade Guide
No ratings yet
SAP Kernel Upgrade Guide
3 pages
Robocopy Guide for IT Professionals
100% (1)
Robocopy Guide for IT Professionals
10 pages
Mcsa Notes
No ratings yet
Mcsa Notes
221 pages
Maintenance Firmware For Netx 90 FD 01 EN
No ratings yet
Maintenance Firmware For Netx 90 FD 01 EN
36 pages
MT6795 Android Scatter
No ratings yet
MT6795 Android Scatter
7 pages
Operating System Notes - 525902ed 4136 46b3 8fc6 3ee1ba12eb74
No ratings yet
Operating System Notes - 525902ed 4136 46b3 8fc6 3ee1ba12eb74
85 pages
Unit-1 Complete
No ratings yet
Unit-1 Complete
115 pages
Linux Lab for Computer Engineering
No ratings yet
Linux Lab for Computer Engineering
7 pages
Types of Virtual Machines Explained
No ratings yet
Types of Virtual Machines Explained
10 pages
PDFCreator: 5 Easy PDF Creation Methods
100% (1)
PDFCreator: 5 Easy PDF Creation Methods
6 pages
Java Multithreading Guide
No ratings yet
Java Multithreading Guide
9 pages
Ubuntu 18
No ratings yet
Ubuntu 18
10 pages
HotSpot JVM GC Options Cheatsheet - A4 1+2
No ratings yet
HotSpot JVM GC Options Cheatsheet - A4 1+2
2 pages
Post Quiz Threads PDF
No ratings yet
Post Quiz Threads PDF
5 pages
Unit 2
No ratings yet
Unit 2
60 pages
Linux Commands Cheatsheet V1.01
No ratings yet
Linux Commands Cheatsheet V1.01
31 pages
LVM Extension Guide for SysAdmins
No ratings yet
LVM Extension Guide for SysAdmins
7 pages
Ghid de Studiu 70 - 410 - Exam
No ratings yet
Ghid de Studiu 70 - 410 - Exam
105 pages
Openshift Presentation
No ratings yet
Openshift Presentation
27 pages
EPSTPLOG
No ratings yet
EPSTPLOG
27 pages
7-Deadlocks, Resource-Allocation Graph, Methods For Handling Deadlocks-13!08!2024
No ratings yet
7-Deadlocks, Resource-Allocation Graph, Methods For Handling Deadlocks-13!08!2024
45 pages

Parellel Computing Module1 Notes

Uploaded by

Parellel Computing Module1 Notes

Uploaded by

PARALLEL COMPUTING |BCS702 |Module -01 |Search Creators

Slowing Performance Improvement:

Stark Difference in Growth:

Need for Increased Performance:

The text provides several examples of these problems:

Heat and Power Limitations:

Slowdown in Transistor Density:

The Solution is Parallelism:

Why we need to write parallel programs

This method is illustrated by the provided diagram:

Message-Passing Interface (MPI): A library of functions for message-passing between cores.

Each of the APIs is designed to program a specific type of parallel system:

MPI: Used for distributed-memory MIMD systems.

Avoid "Seat-of-Your-Pants" Programming: It is a mistake to write parallel programs without careful

General Syntax Notation:

Angle brackets (< >) are used to enclose required content.

This is illustrated in Figure 2.3.

This is illustrated in Figure 2.4.

UMA (Uniform Memory Access):

This is shown in Figure 2.5.

This is shown in Figure 2.6.

o Crossbar: A distributed-memory crossbar uses unidirectional links to allow all processors to

The total time to transmit a message is a combination of these two factors:

1. Snooping Cache Coherence:

o The primary disadvantage is the high cost of scaling the interconnect.

double x[n] , y[n];

For such problems, the programmer only needs to:

Mutex (Mutual Exclusion Lock):

Example: A simple send/receive operation between two processes:

You might also like