Intro to communication
9/8/09
Hypercubes
• Advantages:
– Best diameter and bisection width we have seen so far.
– Practically, these are very useful bounds and multiple systems
use a hypercube network topology.
• Disadvantages:
– Number of links per node is not constant
– p = 2d, 2d links per node = Θ(log p)
1
Bit labeling
• Because the number of processors along
any given dimension is 2, we can
represent a single coordinate by a single
bit: 0 or 1
• Each processor, therefore, has an
identifier of d bits.
– For example, (0,1,1) in a 8-processor
hypercube
Helpful facts
• Flipping any of the d bits returns a neighbor of a
given processor.
– Each flip corresponds to a link in the network.
• Because at most d bits can be flipped, a
processor has d neighbors.
• This also gives us a quick algorithm to check if
any two processors are neighbors.
– Hamming distance of 1
2
Communication paths
• First, find the number of bits that two
messages differ.
• Route the message to a neighbor, and
continue until it arrives.
• If the processor ids differ by k bits, there
are k! possible shortest paths possible.
– High connectivity = good
Our model of parallel computation
• As described previously, our model will be:
– p processors connected by an interconnection
network
– Each processor has its own local memory
– Accessing remote memory requires using the network
• We want processors to communicate
simultaneously
– Using MPI function calls for your assignments
– Further complicated because accessing remote
memory often requires a send/receive
3
Details
• We will assume that only one message is being
routed per processor.
• Before the message is sent, it must be prepared,
a destination and error correcting applied and
any other preprocessing.
• This is called setup time and we will use τ to
model this parameter.
More details
• The transmission of the message can be
performed at the rate of µ time per word.
• It follows that total communication time can be
modeled as:
– τ + µm, where m is the size of the message
• To be comparable to computation, we usually
measure this in floating point operations, or
flops.
4
Supporting simultaneous
communication
• A processor can send and receive a message at
any given time step.
• It is clear that a processor can not perform
multiple sends and receives at the same time.
• Therefore, we would like communication steps
such that no two destinations are the same:
– We will call this a permutation network
Example
• This permutation (2 3 1 0) implies:
– Processor 0 sends a message to 2 (0, 2)
– Processor 1 sends a message to 3 (1, 3)
– Processor 2 sends a message to 1 (2, 1)
– Processor 3 sends a message to 0 (3, 0)
• Note that any step where a processor is not
involved can be represented by communication
with itself, e.g., (0,0).
5
Multistage communication networks
• Multistage interconnection networks
(details next class and in text) are
topologically equivalent.
• It can be shown that these can achieve a
permutation in at most 3 steps.
Take home messages
• We make the theoretical assumption that
any permutation can be supported in
parallel (with some constant overhead).
• We will use permutation networks as our
interconnection network in this course.
– Simplifies algorithm design and analysis
6
In class example
• In small groups, what are some
permutations that can be achieved using a
ring network?
• What about a hypercube? Come up with
one example for a 3D cube of 8
processors.
Often used permutations
• The permutation network, by definition,
allows any communication where no two
destinations are the same in one step
• Even so, we usually only use the following:
– Shift permutations
– Hypercubic permutations
7
Shift permutations
• A shift permutation is one such that each
processor communicates with its neighbor
• Left shift:
– Processor i communicates with processor i -1
• Right shift:
– Processor i communicates with processor (i +
1) mod p
Hypercubic permutations
• Let d = ceiling (log p)
• Just as in a static hypercube, each processor’s
id can be viewed as a string of d bits
• Fix a bit position j. The communication where
processors that only differ at this bit is a
hypercubic permutation
• There are d hypercubic permutations
8
Another reason why
hypercubes are cool
• Obviously, these can be simultaneously
routed on any hypercube network
• This is better than many multistage
networks, some of which we will discuss
next class.
Tips for parallel algorithm analysis
• In any algorithm we design, we will
separate its computation and
communication costs.
• For example:
– Computation time = O(f(n,p))
– Communication time = O(τg(n,p) + µh(n,p))
9
A simple algorithm example
• Suppose we’d like to add n numbers,
where n >> p.
• For simplicity, lets assume p evenly
divides n.
• What is one algorithm for this?
An approach
• It is clear that computing the sum of n/p
integers on a single processor requires n/p
time.
• Adding p resulting integers can be
performed in log p steps.
• Computation time = O(n/p + log p)
10
Communication time
• Communication time = O(τ + µ)log p
• Spoiler: Can this be done on a
hypercube?
How do we know if this good?
• The sequential runtime of adding n
integers is Θ(n).
• Therefore, a parallel algorithm must take
at least Ω(n/p); ours is O(n/p + log p)
• As a result, we are efficient as long as:
– n > p log p (in class)
11
But why?
• Note τ and µ are constants, and we are
using O notation here, not Θ
• Why isn’t the algorithm?
– Max{f(n,p), g(n,p), h(n,p)}
Older values of communication
constants (relative to one flop)
Machine Τ µ τ/µ
IBM BG/L 3000 50-100 15-30
Intel Delta 4650 87 54
Intel Paragon 7800 9 867
Cray T3D 27000 9 3000
IBM SP1 28000 50 560
CM5 450 3 113
12
An example
• Suppose our algorithm achieves ideal speedup
• Further,
– f(n,p) = g(n,p) = h(n,p) = T(n, 1) / p
• If τ, however, is 1000, our speedup would be
p/1000
– This implies 1000 processors would be as fast as 1
Rules of thumb
• We would like f(n,p) to be as close to
O(T(n,1)/p) as possible
• Ideally, h(n,p) is a slower growing function
than f(n,p) and g(n,p) even slower
13