[go: up one dir, main page]

0% found this document useful (0 votes)
28 views53 pages

Lecture 12-MPI Collective Communication

The document covers the concepts of MPI (Message Passing Interface) blocking and nonblocking point-to-point communication, highlighting the differences between MPI_Send/MPI_Recv and their nonblocking counterparts MPI_Isend/MPI_Irecv. It discusses the importance of avoiding deadlocks, the use of request handles for nonblocking calls, and the rules for successful communication. Additionally, it introduces collective communication operations such as broadcast, scatter, gather, and reduction, along with their respective functions and usage in parallel programming.

Uploaded by

roarsomebros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views53 pages

Lecture 12-MPI Collective Communication

The document covers the concepts of MPI (Message Passing Interface) blocking and nonblocking point-to-point communication, highlighting the differences between MPI_Send/MPI_Recv and their nonblocking counterparts MPI_Isend/MPI_Irecv. It discusses the importance of avoiding deadlocks, the use of request handles for nonblocking calls, and the rules for successful communication. Additionally, it introduces collective communication operations such as broadcast, scatter, gather, and reduction, along with their respective functions and usage in parallel programming.

Uploaded by

roarsomebros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Applied High-Performance Computing and Parallel

Programming

Presenter: Liangqiong Qu

Assistant Professor

The University of Hong Kong


Review of Lecture 11: MPI Blocking Point-to-Point Communication
▪ MPI_Send
• MPI_Send would hang until the whole message has arrived the receiver or the whole message has
been copied into a system buffer.
• After MPI_Send, the send buffer is save for reuse.

▪ MPI_Recv
• MPI_Recv would hang until the message has been received into the buffer specified by the buffer
argument.
• If message is not available, the process will remain hang until a message become available.

Network

System Buffer

Send Buffer Receive Buffer


Review of Lecture 11: MPI Blocking Point-to-Point Communication

▪ Both ranks wait for Receive to get called. A deadlock occurs when two or more
processors try to access the same set of resources.

▪ Possible solutions to avoid deadlock:


• Different ordering of send and receive, but not symmetric and does not scale
• Using nonblocking point-to-point communication
Review of Lecture 11: Nonblocking Point-to-Point Communication
▪ A nonblocking MPI call returns immediately to the next statement without waiting
for task to complete, whereas a blocking send will return after the data has been copied
out of the sender memory.
▪ Donot reuse buff before MPI nonblocking call has been completed. Return of call does not
imply communication completion, check for completion via MPI_Wait*/MPI_Test*

▪ MPI_Request is a handle to an hidden request object that holds detailed information about
the transaction. The request handle can be used for subsequent Wait* and Test* calls.
▪ MPI_Irecv has no status argument
Review of Lecture 11: Nonblocking Point-to-Point Communication
▪ All nonblocking calls in MPI return a request handle in lieu of a status variable.
▪ MPI provides two functions to complete a nonblocking communication call.
• MPI_Wait: Waiting forces the process to go in "blocking mode". The sending
process will simply wait for the request to finish. If your process waits right
after MPI_Isend, the send is the same as calling MPI_Send.

• MPI_Test: Testing checks if the request can be completed. If it can, the request is
automatically completed and the data transferred.
Outline

▪ Point-to-Point and Collective Communication with MPI

▪ MPI Nonblocking Point-to-Point Communication

▪ Collective Communication
• Synchronization (barrier)
• Data movement (broadcast, scatter, gather, all to all)
• Global computation

▪ Examples
Example of Nonblocking Point-to-Point Communication

• Objective: 1) Do nonblocking point-to-point


communication between rank 0 and rank 1
2) Transfer the vector buffer in rank 0 with vector size
BUFFER_COUNT to rank 1.
3) The buffer in rank 0 is initialized as [0, 2, 4, …18]
Example of Nonblocking Point-to-Point Communication
Example of Nonblocking Point-to-Point Communication (Core Code)

• Rank 0 initializes the Isend. The


Isend will return & not wait for
completion of communication.
• The send rank 0 starts working
immediately with the return of
MPI_ISend

• Work for 1ms, then check


whether the receiver is ready,
using MPI_Test. If the returned
flag is !0, the request is
completed.

• If the receiver is still not ready


after rank 0 has worked for 6ms,
the program switches to
blocking mode using
`MPI_Wait`
Example of Nonblocking Point-to-Point Communication

Submit the job to the slurm

• Batch script
Output results of the submitted jobs
Summarization of Nonblocking Point-to-Point Communication
• Blocking vs. nonblocking: MPI_Send()/MPI_Recv() blocks until data is received or
copied out to the system buff; A nonblocking MPI call returns immediately to the
next statement without waiting for communication to complete.
• Standard nonblocking for send and recv is MPI_Isend() and MPI_Irecv()
• Return of call does not imply completion of communication
• Use MPI_Wait*() / MPI_Test*() to check for completion using request handles
• Potentials
• Enabling overlapping between communication & computation
• Avoiding certain deadlocks
• Avoiding synchronization and idle times

• Caveat: Compiler does not know about asynchronous modification of data


Blocking and Nonblocking Point-to-Point Communication
▪ For a communication to succeed:
• The sender must specify a valid destination.
• The receiver must specify a valid source rank (or MPI_ANY_SOURCE).
• The communicator used by the sender and receiver must be the same (e.g.,
MPI_COMM_WORLD).
• The tags specified by the sender and receiver must match (or MPI_ANY_TAG
for receiver).
• The data types of the messages being sent and received must match.
• The receiver's buffer must be large enough to hold the received message.
Collective Communication
Review of Lecture 10---Parallel Execution in MPI
• Processes run throughout program execution

Program startup
• MPI start mechanism:
• Launches tasks/processes
• Establishes communication context (“communicator”)

• MPI point-to-point communication +


• between pairs of tasks/processes
• MPI collective communication:
• between all processes or a subgroup

• Clean shutdown by MPI Program shutdown


Collective Communication in MPI
• Collective communication allows you to exchange data among a group of
processes

• It must involve all processes in the scope of a communicator

1 3

0 4

source 2
5 e.g. MPI_Bcast
Communicator
Collective Communication in MPI
• Collective communication allows you to exchange data among a group
of processes

• It must involve all processes in the scope of a communicator

• It consists of:
• Blocking variants: The call would hang until the message has arrived
the receiver or been copied into a system buffer. Buffer can be reused
after return

• Nonblocking variants: A nonblocking call return immediately to the


next statement without waiting for task to complete. But buffer can only
be used after completion (MPI_Wait*, MPI_Test*).
Collective Communication in MPI
▪ Rules for all collectives
• Data type matching
• Do not use tags
• Count must be exact, i.e., there is only one message length, buffer
must be large enough

▪ Different types of communication:


• Synchronization (barrier)
• Data movement (broadcast, scatter, gather)
• Global computation (reduction, scan)
• Combinations of data movement and computation (reduction +
broadcast)
Review of Barrier in OpenMP

#pragma omp barrier

• Each thread blocks upon reaching the barrier


until all threads have reached the barrier
• All accessible shared variables are flushed to
the memory hierarchy
• barrier may not appear within work-sharing
construct → potential of deadlock or
unanticipated results
Synchronization: Barrier

▪ Explicit synchronization of all ranks from


specified communicator
• int MPI_Barrier(MPI_Comm comm)

▪ Any process calling it will be blocked until all the


processes within the group have called it. Once all
the processes in the communicator group have
reached the barrier, the function will return, and
all processes in the group can continue.
From: https://sites.cs.ucsb.edu/~tyang/class/240a17/slides
Collective Data Movements
▪ MPI provides three categories of collective data-movement routines in which one
process either sends to or receives from all processes: broadcast, gather, and scatter.

broadcast

scatter gather
Collective Data Movements: MPI_ Bcast
▪ Broadcasting happens when one process wants to send the same
information to every other process. It sends buffer contents from one
rank (called the “root”) to all ranks in the communicator.

▪ No restrictions on which rank is root


broadcast

MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)

• buffer [send/recv] the address of the buffer.


• count the number of elements sent to each process.
• datatype: MPI datatype
• root an integer indicating the rank of broadcast root process
• comm the communicator
Collective Data Movements: MPI_ Bcast
▪ Broadcasting happens when one process wants to send the same
information to every other process. Send buffer contents from one
rank (“root”) to all ranks

▪ No restrictions on which rank is root (usually rank 0)


broadcast

MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
Collective Data Movements: MPI_ Bcast
▪ Broadcasting happens when one process wants to send the same
information to every other process. Send buffer contents from one
rank (“root”) to all ranks

▪ No restrictions on which rank is root (usually rank 0)

MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)

Common mistake
if (I am master) then
MPI_Bcast (buff,…,0,MPI_COMM_WORLD)
else
MPI_Recv(buff, … ,0, MPI_COMM_WORLD)
endif
Collective Data Movements: MPI_ Bcast
• Defining variables: source indicate the
root rank who initiates the broadcast
MPI collective function, no receiving
identity is required

• Defining variables: We will send four


integers in the broadcast and need to
define a buffer for sending and a
buffer for receiving four integers (!)

• The MPI_Bcast() function broadcasts a


message from the process with rank root
to all processes of the group, it self
included (!)

• All processes should print out their


buffer, also for those ranks that did not
initialize the buffer. Originally only
rank 0 is initialized.
Collective Data Movements: MPI_ Bcast
Collective Data Movements: MPI_Scatter
▪ Scatter: Distributes distinct messages from a single root rank to each
ranks in the communicator. Given communicator with n ranks,
distribute the data into n equal segments, where the i-th segment is
sent to the i-th process in the communicator

scatter

Example: the scattering operation distributes evenly a set of data over all the processes of a
communicator. (From: https://www.codingame.com/playgrounds/349/introduction-to-mpi)
Collective Data Movements: MPI_Scatter
▪ Scatter: Distributes distinct messages from a single root rank to each ranks in the
communicator.

MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendstype,


void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

• sendbuf is the address of the send buffer that ONE process will dispatch to all the
other processes
• recvbuf is the address of the receive buffer
• sendcount is the number of elements the process will send to other process
• root is the rank of the process that will be sending its data

▪ In general, sendcount = recvcount


• This is the length of the segment
• It is not the length of the message, but the length of each segment
Collective Data Movements: MPI_Scatter
rank
rank 0 1 2 3

sendbuf

recvbuf

MPI_Scatter(sendbuf, 1, MPI_INT, recvbuf, 1, MPI_INT, root, MPI_COMM_WORLD)


Collective Data Movements: MPI_Scatter
rank
rank 0 1 2 3

sendbuf

recvbuf

MPI_Scatter(sendbuf, 1, MPI_INT, recvbuf, 1, MPI_INT, root, MPI_COMM_WORLD)

sendbuf

recvbuf

Note the count here is not the length of the message, but the length of each segment
Collective Data Movements: MPI_Gather
▪ Receive a message from each rank and place i-th rank’s message at i-th position in
receive buffer

int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype,


void *recvbuf, int recvcount, MPI_Datatype recvtype,
int root, MPI_Comm comm )

• sendcount is the number of elements in the send


• recvcount is the number of elements for any single receive
• root is the rank of the receiving process

▪ In general, sendcount = recvcount gather


▪ recvbuf is ignored on non-root ranks because there is nothing to receive
Collective Data Movements: MPI_Gather

rank

sendbuf

recvbuf

MPI_Gather(sendbuf, 1, MPI_INT, recvbuf, 1, MPI_INT, root, MPI_COMM_WORLD)

sendbuf

recvbuf
Example Usage of Collective Data Movements
▪ Matrix-vector multiplication task: Develop a parallel MPI program using collective
functions to perform matrix-vector multiplication on a 100x100 matrix A and a vector b
of length 100. The initial data for matrix A and vector b reside on processor P, and the
program should utilize four processors, including processor P, to execute the computation
in parallel.

Fig. 1 Matrix-vector multiplication.


Thank you for taking this course!
Please help take a SSCC survey at moodle
Please complete the online survey by 16 March (23:55)
2025 via Moodle or use the following link

https://hku.au1.qualtrics.com/jfe/form/SV_6sQBDuNUKN
8Gbbg
A file containing the responses in Excel/PDF format will be sent to the Class Representative. These responses
will be analyzed and discussed during the Staff-Student Consultative Committee meeting on 26 March 2025
(Wednesday).
Example Usage of Collective Data Movements
▪ Matrix-vector multiplication task: Develop a parallel MPI program using collective
functions to perform matrix-vector multiplication on a 100x100 matrix A and a vector b
of length 100. The initial data for matrix A and vector b reside on processor P, and the
program should utilize four processors, including processor P, to execute the computation
in parallel.
Concept:
• Matrix is distributed by rows (i.e., row-major order)
• Product vector is needed in entirety by every process
• MPI_Gather will be used to collect the product from Fig. 1 Matrix-vector multiplication.
processes

• A: a matrix partitioned across rows and distributed to


processes as Apart
• b: a vector present on all processes
• c: a partitioned vector updated by each process independently
Example Usage of Collective Data Movements: Code
▪ Matrix-vector multiplication task: Develop a parallel MPI program using collective
functions to perform matrix-vector multiplication on a 100x100 matrix A and a vector b
of length 100.
Example Usage of Collective Data Movements: Code-Continued
▪ Matrix-vector multiplication task: Develop a parallel MPI program using collective
functions to perform matrix-vector multiplication on a 100x100 matrix A and a vector b
of length 100.
Example Usage of Collective Data Movements: Code-Continued
▪ Matrix-vector multiplication task: Develop a parallel MPI program using collective
functions to perform matrix-vector multiplication on a 100x100 matrix A and a vector b
of length 100.
Example Usage of Collective Data Movements
▪ Matrix-vector multiplication task: Develop a parallel MPI program using collective
functions to perform matrix-vector multiplication on a 100x100 matrix A and a vector b
of length 100.
Global Computation
Global Computation: MPI_Reduce
▪ MPI_Reduce: Collective computation operation. Applies a reduction operation on all tasks
in communicator and places the result in root rank.

MPI_reduce( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,


MPI_Op op, int root, MPI_Comm comm );

▪ MPI_Op op here indicates the reduce operation (MPI predefined or your own)
▪ count indicates the number of elements in send buffer (integer)
▪ Result in recvbuf only available on root process
▪ Perform operation on all count elements of an array
▪ If all ranks require result, use MPI_Allreduce(), by not specifying the root rank.
MPI_Allreduce( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm );
Global Computation: MPI_Reduce
▪ MPI_Reduce: Collective computation operation. Applies a reduction operation on all tasks
in communicator and places the result in root rank.
MPI_reduce( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
MPI_Op op, int root, MPI_Comm comm );

▪ MPI_Op op here indicates the reduction operation (MPI predefined or your own)
▪ count indicates the number of elements in send buffer (integer)

rank 0 rank 1 rank 2 rank 3

MPI_reduce(sendbuf, recvbuf, 1, MPI_MPI_INT,


MPI_SUM, 1, MPI_COMM_WORLD );
Predefined operators in MPI
MPI OP Operation
MPI_MAX Maximum
MPI_MIN Minimum ▪ If the 12 predefined ops are not enough
use MPI_Op_create/MPI_Op_free to
MPI_SUM Sum
create own ones.
MPI_PROD Product
MPI_LAND Logical AND
MPI_BAND Bitwise AND
MPI_LOR Logical OR
MPI_BOR Bitwise OR
MPI_LXOR Logical exclusive OR
MPI_BXOR Bitwise exclusive OR
MPI_MAXLOC Max val and location
MPI_MINLOC Min val and location
Review of Lecture 11---Example 2. Parallel Integration in MPI
Task: calculate in parallel, using 4 processors with the (existing)
intergrate(x, y). Let a = 0, b =2, and f(x) = x2

• Prerequisite knowledge of Trapezoidal Rule in C language for integration


Review of Lecture 11 ---Example 2. Parallel Integration in MPI
Task: calculate in parallel,
using 4 processors, let a = 0, b =2
• Split up interval [a, b] into equal
disjoint chunks
• Compute partial results in parallel
• Collect global sum at rank 0
Parallel Integration in MPI
Task: calculate in parallel,
using 4 processors, let a = 0, b =2
• Now let’s simplify the
implementation using collective
communication
• Split up interval [a, b] into equal
disjoint chunks
• Compute partial results in parallel
• Collect global sum at rank 0 using
MPI_Reduce
Global Computation: MPI_Scan
▪ MPI_Scan: Performs a prefix reduction of the data stored in sendbuf at each process and
returns the results in recvbuf of the process with rank dest.

MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm )
Global Computation: MPI_Scan
▪ MPI_Scan: Performs a prefix reduction of the data stored in sendbuf at each process and
returns the results in recvbuf of the process with rank dest.

MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm )

MPI_Scan(sendbuf, recvbuf, 4, MPI_MPI_INT,


MPI_SUM, 4, MPI_COMM_WORLD );
Nonblocking Collective Communication
▪ A non-blocking call has the same syntax as its blocking counterpart, with two differences:
• The letter I (think of "initiate" or "immediate") appears in the name of the call,
immediately following the first underscore: e.g., MPI_Ibcast vs MPI_bcast
• The final argument is a handle to an opaque (or hidden) request object that holds
detailed information about the transaction. The request handle can be used for
subsequent Wait and Test calls.

MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendstype,


void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

MPI_Iscatter(void *sendbuf, int sendcount, MPI_Datatype sendstype,


void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm,
MPI_Request *request)
Nonblocking Collective Communication

What happened with this


code?
Nonblocking Collective Communication
• Bad example that we should not
do

• We have induced some chaotic


behavior by the second invocation
of strcpy
• Donot use the buffer before the call
completed by checking with
MPI_Wait/Test

• Every nonblocking call in MPI


should be completed with a
matching call to MPI_Wait,
MPI_Test
Nonblocking Collective Communication
• Every nonblocking call in
MPI should be completed
with a matching call to
MPI_Wait or MPI_Test
Summarization of MPI Collective Communication

▪ MPI collectives
• All ranks in communicator must call
the function
▪ Types:
• Synchronization (barrier) scatter broadcast

• Data movement (broadcast, scatter,


gather, all to all)
• Global computation (reduction, scan)
• Combinations of data movement and
computation (reduction + broadcast) gather
Thank you very much for choosing this course!

Give us your feedback!

https://forms.gle/zDdrPGCkN7ef3UG5A

56

You might also like