Lecture 12-MPI Collective Communication
Lecture 12-MPI Collective Communication
Programming
Presenter: Liangqiong Qu
Assistant Professor
▪ MPI_Recv
• MPI_Recv would hang until the message has been received into the buffer specified by the buffer
argument.
• If message is not available, the process will remain hang until a message become available.
Network
System Buffer
▪ Both ranks wait for Receive to get called. A deadlock occurs when two or more
processors try to access the same set of resources.
▪ MPI_Request is a handle to an hidden request object that holds detailed information about
the transaction. The request handle can be used for subsequent Wait* and Test* calls.
▪ MPI_Irecv has no status argument
Review of Lecture 11: Nonblocking Point-to-Point Communication
▪ All nonblocking calls in MPI return a request handle in lieu of a status variable.
▪ MPI provides two functions to complete a nonblocking communication call.
• MPI_Wait: Waiting forces the process to go in "blocking mode". The sending
process will simply wait for the request to finish. If your process waits right
after MPI_Isend, the send is the same as calling MPI_Send.
• MPI_Test: Testing checks if the request can be completed. If it can, the request is
automatically completed and the data transferred.
Outline
▪ Collective Communication
• Synchronization (barrier)
• Data movement (broadcast, scatter, gather, all to all)
• Global computation
▪ Examples
Example of Nonblocking Point-to-Point Communication
• Batch script
Output results of the submitted jobs
Summarization of Nonblocking Point-to-Point Communication
• Blocking vs. nonblocking: MPI_Send()/MPI_Recv() blocks until data is received or
copied out to the system buff; A nonblocking MPI call returns immediately to the
next statement without waiting for communication to complete.
• Standard nonblocking for send and recv is MPI_Isend() and MPI_Irecv()
• Return of call does not imply completion of communication
• Use MPI_Wait*() / MPI_Test*() to check for completion using request handles
• Potentials
• Enabling overlapping between communication & computation
• Avoiding certain deadlocks
• Avoiding synchronization and idle times
Program startup
• MPI start mechanism:
• Launches tasks/processes
• Establishes communication context (“communicator”)
1 3
0 4
source 2
5 e.g. MPI_Bcast
Communicator
Collective Communication in MPI
• Collective communication allows you to exchange data among a group
of processes
• It consists of:
• Blocking variants: The call would hang until the message has arrived
the receiver or been copied into a system buffer. Buffer can be reused
after return
broadcast
scatter gather
Collective Data Movements: MPI_ Bcast
▪ Broadcasting happens when one process wants to send the same
information to every other process. It sends buffer contents from one
rank (called the “root”) to all ranks in the communicator.
MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
Collective Data Movements: MPI_ Bcast
▪ Broadcasting happens when one process wants to send the same
information to every other process. Send buffer contents from one
rank (“root”) to all ranks
MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
Common mistake
if (I am master) then
MPI_Bcast (buff,…,0,MPI_COMM_WORLD)
else
MPI_Recv(buff, … ,0, MPI_COMM_WORLD)
endif
Collective Data Movements: MPI_ Bcast
• Defining variables: source indicate the
root rank who initiates the broadcast
MPI collective function, no receiving
identity is required
scatter
Example: the scattering operation distributes evenly a set of data over all the processes of a
communicator. (From: https://www.codingame.com/playgrounds/349/introduction-to-mpi)
Collective Data Movements: MPI_Scatter
▪ Scatter: Distributes distinct messages from a single root rank to each ranks in the
communicator.
• sendbuf is the address of the send buffer that ONE process will dispatch to all the
other processes
• recvbuf is the address of the receive buffer
• sendcount is the number of elements the process will send to other process
• root is the rank of the process that will be sending its data
sendbuf
recvbuf
sendbuf
recvbuf
sendbuf
recvbuf
Note the count here is not the length of the message, but the length of each segment
Collective Data Movements: MPI_Gather
▪ Receive a message from each rank and place i-th rank’s message at i-th position in
receive buffer
rank
sendbuf
recvbuf
sendbuf
recvbuf
Example Usage of Collective Data Movements
▪ Matrix-vector multiplication task: Develop a parallel MPI program using collective
functions to perform matrix-vector multiplication on a 100x100 matrix A and a vector b
of length 100. The initial data for matrix A and vector b reside on processor P, and the
program should utilize four processors, including processor P, to execute the computation
in parallel.
https://hku.au1.qualtrics.com/jfe/form/SV_6sQBDuNUKN
8Gbbg
A file containing the responses in Excel/PDF format will be sent to the Class Representative. These responses
will be analyzed and discussed during the Staff-Student Consultative Committee meeting on 26 March 2025
(Wednesday).
Example Usage of Collective Data Movements
▪ Matrix-vector multiplication task: Develop a parallel MPI program using collective
functions to perform matrix-vector multiplication on a 100x100 matrix A and a vector b
of length 100. The initial data for matrix A and vector b reside on processor P, and the
program should utilize four processors, including processor P, to execute the computation
in parallel.
Concept:
• Matrix is distributed by rows (i.e., row-major order)
• Product vector is needed in entirety by every process
• MPI_Gather will be used to collect the product from Fig. 1 Matrix-vector multiplication.
processes
▪ MPI_Op op here indicates the reduce operation (MPI predefined or your own)
▪ count indicates the number of elements in send buffer (integer)
▪ Result in recvbuf only available on root process
▪ Perform operation on all count elements of an array
▪ If all ranks require result, use MPI_Allreduce(), by not specifying the root rank.
MPI_Allreduce( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm );
Global Computation: MPI_Reduce
▪ MPI_Reduce: Collective computation operation. Applies a reduction operation on all tasks
in communicator and places the result in root rank.
MPI_reduce( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
MPI_Op op, int root, MPI_Comm comm );
▪ MPI_Op op here indicates the reduction operation (MPI predefined or your own)
▪ count indicates the number of elements in send buffer (integer)
MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm )
Global Computation: MPI_Scan
▪ MPI_Scan: Performs a prefix reduction of the data stored in sendbuf at each process and
returns the results in recvbuf of the process with rank dest.
MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm )
▪ MPI collectives
• All ranks in communicator must call
the function
▪ Types:
• Synchronization (barrier) scatter broadcast
https://forms.gle/zDdrPGCkN7ef3UG5A
56