Basic Communication
Operations
Preliminaries
• A big problem is divided into smaller tasks (logical unit)
• Process is an entity that execute tasks
• Mapping is performed to allocate tasks to processes
• Several processes executes at the same time and perform Inter
Process Communication (Interaction)
• Interaction is performed to share Data, Work, Synchronization
Information
• There are various patterns for communication
Assumptions for the Operations
• Interconnections support cut-through routing
• Communication time between any pair of nodes in the
network is same (regardless of the number of intermediate
nodes)
• Links are bi-directional
• The directly connected nodes can simultaneously send and receive messages of
m words without any congestion
• Single-port communication model
• A node can send on only one of its links at a time
• A node can receive on only one of its links at a time
• However, a node can receive a message while sending
another message at the same time on the same or a different
link.
Patterns
1. One to All Broadcast/All to One Reduction
2. All to All Broadcast/All to All Reduction
3. All Reduce (All to One Reduction + One to All
Broadcast)
4. Scatter (One to All Broadcast Personalized)/Gather
Topologies
1. Ring/Linear Array (One Dimensional)
2. Mesh (Two Dimensional)
3. Hyper Cube (Three Dimensional)
One-to-All Broadcast and All-to-One
Reduction
One-to-All Broadcast
• A single process sends identical data to all other processes.
• Initially one process has data of m size.
• After broadcast operation, each of the processes have own copy of
the m size.
All-to-One Reduction
• Dual of one-to-all broadcast
• The m-sized data from all processes are combined through an
associative operator
• Accumulated at a single destination process into one buffer of
size m
One-to-All Broadcast and All-to-One Reduction
One-to-All Broadcast and All-to-One Reduction
• Application: Used in many parallel algorithms including matrix-vector
multiplication, shortest path, Gaussian Elimination.
• How it works: Sequentially send p-1 from the source to the other p-1
process
• Disadvantages:
• Source becomes bottleneck
• The communication network is underutilized because only the connection
between a single pair of nodes is used at a time
• Solution: Recursive Doubling
Recursive doubling (Linear Array or Ring)
Recursive Doubling Broadcast
• Source process sends the massage to another process
• In next communication phase both the processes can
simultaneously propagate the message
• Message “HI” from the source node P0 is passed to all other nodes in
the ring in following three steps:
1. P0 to P4 (Distance:4)
2. P0 to P2, P4 to P6, in parallel (Distance:2)
3. P0 to P1, P2 to P3, P4 to P5, P6 to P7, in parallel (Distance:1)
Recursive doubling (Linear Array or Ring)
Recursive Doubling Reduction
Example: Sum of all numbers
Mesh
• We can regard each row and column of a square mesh
of p nodes as a linear array of nodes
• Communication algorithms on the mesh are simple
extensions of their linear array counterparts
Broadcast and Reduction
• Two step breakdown:
i. The operation is performed along one dimension by treating the row
as linear array
ii. Then the all the columns are treated similarly
One to all Broadcast on a 16 node mesh
3 7 11 15
2 6 10 14
1 5 9 13
0 4 8 12
“HI”
Step 1 (0th row recursive doubling)
3 7 11 15
2 6 10 14
1 5 9 13
0 4 8 12
“HI” “HI”
Step 2 (0th row recursive doubling)
3 7 11 15
2 6 10 14
1 5 9 13
0 4 8 12
“HI” “HI” “HI” “HI”
Step 3 (All Column recursive doubling)
3 7 11 15
2 6 10 14
“HI” “HI” “HI” “HI”
1 5 9 13
0 4 8 12
“HI” “HI” “HI” “HI”
Step 4 (All Column recursive doubling)
3 7 11 15
“HI” “HI” “HI” “HI”
2 6 10 14
“HI” “HI” “HI” “HI”
1 5 9 13
“HI” “HI” “HI” “HI”
0 4 8 12
“HI” “HI” “HI” “HI”
Reduction
3 7 11 15
“HI” “HI” “HI” “HI”
2 6 10 14
“HI” “HI” “HI” “HI”
1 5 9 13
“HI” “HI” “HI” “HI”
0 4 8 12
“HI” “HI” “HI” “HI”
3 7 11 15
2 6 10 14
“HI” “HI” “HI” “HI”
1 5 9 13
0 4 8 12
“HI” “HI” “HI” “HI”
3 7 11 15
2 6 10 14
1 5 9 13
0 4 8 12
“HI” “HI” “HI” “HI”
3 7 11 15
2 6 10 14
1 5 9 13
0 4 8 12
“HI” “HI”
Mesh (Broadcast and Reduction)
Hypercube
Broadcast
• Source node first send data to one node in the highest
dimension
• The communication successively proceeds along lower
dimensions in the subsequent steps
• The algorithm is same as used for linear array
• However, here [in hypercube] changing order of dimension will not congest
the network
Hypercube (Broadcast)
Matrix-Vector Multiplication (An
Application)
All-to-All Broadcast and All-to-All
Reduction
• All-to-All Broadcast
• A generalization to of one-to-all broadcast.
• Every process broadcasts m-word message.
• The broadcast-message for each of the processes can be
different than others
• All-to-All Reduction
• Dual of all-to-all broadcast
• Each node is the destination of an all-to-one reduction out of
total P reductions.
All-to-All Broadcast and All-to-All
Reduction
Linear Ring Broadcast (All to All)
Linear Ring Reduction (All to All)
• Draw an All-to-All Broadcast on a P-node linear ring
• Reverse the directions in each foreach of the step without
changing message
• After each communication step, combine messages
having same broadcast destination with associative
operator.
Task
• Draw an All-to-All Broadcast on a 4-node linear ring
• Reverse the directions and combine the results using ‘SUM’
All-to-All Broadcast on 2D Mesh
• based on the linear
array algorithm,
treating rows and
columns of the mesh
as linear arrays
• communication takes
place in two phases
• Row Wise All to All
Broad cast
• Column Wise All to
All Broad cast
All-to-All Broadcast on HyperCube
• The hypercube algorithm for all-to-all broadcast extends
the mesh algorithm to log p dimensions.
• Procedure: Requires log p steps.
• Communication: Occurs along a different dimension (x, y,
z) of the p-node hypercube in each step.
• Step Process: Pairs of nodes exchange data, doubling the
message size for the next step by concatenating received
messages with current data.
• Figure Illustrates these steps for an eight-node hypercube
with bidirectional communication channels.
All-Reduce
• All-Reduce: All to One Reduction + One to All Broad Cast
• Use all-to-one reduction followed by one-to-all broadcast
• The output is same as All to All Broadcast with less traffic
congestion
Example
• All to All Broadcast
• All to One Reduction
• One to All Broadcast
Prefix-Sums
• Prefix-sums are also known as scan operations
• Given p numbers n0, n1, ..., np-1(one on each node), the
problem is to compute the sums such that: -
• 𝑺𝒌 = σ𝑘𝑖=0 (𝒏𝒊)
• Here 𝑺𝒌 is the prefix-sum computed at kth node after the operation.
• Example:
• Original sequence: <3, 1, 4, 0, 2>
• Sequence of prefix sums: <3, 4, 8, 8, 10>
Rules
• Round Bracket (): The
msg is sent to other
node in next step
• Square Bracket []: The
msg is kept with that
node
• Lower index node will
keep msg in square
bracket as it is
• Higher index will add
msg in square bracket
that it got from lower
index node
Scatter and Gather
• Scatter (one-to-all personalized communication)
• Gather (Concatenation) is different than all to one
reduction as it doesn’t reduce the results with
associative operator
The scatter operation on an eight-
node hypercube
All-to-All personalized Communication
• Each node sends a distinct message of size m to every
other node.
• Also known total exchange
Example (Transpose Matrix)
All-to-all personalized communication in
transposing a 4 x 4 matrix using four processes.
All-to-All personalized [Ring]
Cont.
• All-to-all personalized communication on a six-node
ring.
• The label of each message is of the form {x, y},
where x is the label of the node that originally owned
the message, and y is the label of the node that is the
final destination of the message.
• The label ({x1, y1}, {x2, y2}, ..., {xn, yn}) indicates a
message that is formed by concatenating n individual
messages.
All-to-All personalized [Mesh]
• Two Steps
1. All to All Personal
Communication (Row Wise)
2. All to All Personal
Communication (Column
Wise)
All-to-All
personalized
[Hyper Cube]
• 0th Process
• 1st Step (x-axis) 0<->1
• (0,1), (0,3), (0,5), (0,7)
• 2nd step (y-axis) 0<->2
• (0,2), (0,6), (1,2), (1,6)
• 3rd Step (z-axis) 0<->4
• (0,4), (1,4), (2,4), (3,4)
• 2nd Process
• 1st Step (x-axis) 2<->3
• (2,3), (2,7), (2,5), (2,1)
• 2nd step (y-axis) 2<->0
• (2,0), (2,4), (3,0), (3,4)
• 3rd Step (z-axis) 2<->6
• (2,6), (3,6), (0,6), (1,6)
Circular Shift
• circular q-shift is the operation in which node i sends
a data packet to node (i + q) mod p in a p-node
ensemble (0 < q < p).
Circular Shift [Linear/Ring]
• Min (q, P-q) for finding short path of communication
Circular Shift [Mesh]
• Circular shift over Mesh
Topology is done in following
steps
1. Communication Over Row {q
mod sqrt (p)}
2. Compensatory Column Shift
3. Communication Over Column
{Floor[q/sqrt(p)]}
The communication steps in a
circular 5-shift on a 4 x 4 mesh
Circular Shift
[Hypercube]
• Q-shift e.g 5-shift
• First convert to binary
representation (101)
• Write the power of 2
(for enabled bits) 22 +
20
• i.e. 5 = 4 + 1
• 5 shift = 4 shift + 1 shift
he mapping of an eight-node linear array
onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of
a 4-shift and a 1-shift.