Dna Book
Dna Book
} Gopal Pandurangan
Department of Computer Science
University of Houston
September 17, 2021
1 Introduction 1
1.1 Two aspects of distributed computing . . . . . . . . . . . . . . . . . . . . 2
1.2 What this book is about? . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Background needed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Worked Exercises and Exercises . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Advanced Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Model 6
2.1 The Distributed Network Model . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Performance of distributed algorithms: Complexity measures . . . . . . . . 11
2.2.1 Time complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Message complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
i
CONTENTS ii
5 Leader Election 36
5.1 Leader Election in a Ring Network . . . . . . . . . . . . . . . . . . . . . 37
5.1.1 A basic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.2 An Improved Algorithm with O(n log n) Message Complexity . . . 40
5.2 Randomized Distributed Algorithms . . . . . . . . . . . . . . . . . . . . . 41
5.3 A Randomized LE Algorithm for Rings . . . . . . . . . . . . . . . . . . . 44
5.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Leader Election in a Complete Network . . . . . . . . . . . . . . . . . . . 46
5.4.1 A SimpleLE-CN Algorithm . . . . . . . . . . . . . . . . . . . . . 47
5.4.2 A Randomized Algorithm . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Leader Election in a General Network . . . . . . . . . . . . . . . . . . . . 50
5.5.1 BasicLE-Gen Algorithm . . . . . . . . . . . . . . . . . . . . . . . 50
5.5.2 A Randomized Leader Algorithm with Improved Message Complexity 53
5.6 Estimating Network Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.7 Worked Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4.2.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.5 Worked Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Appendices 126
D Probability 145
D.1 Events, Probability, Probability Space . . . . . . . . . . . . . . . . . . . 145
D.2 Principle of Inclusion-Exclusion . . . . . . . . . . . . . . . . . . . . . . . 146
D.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
D.4 The Birthday Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
D.5 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
D.6 Expectation of a random variable . . . . . . . . . . . . . . . . . . . . . . 148
D.7 Useful Types of Random Variables . . . . . . . . . . . . . . . . . . . . . 150
D.7.1 Binomial Random variables . . . . . . . . . . . . . . . . . . . . . 150
D.7.2 The Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . 151
D.8 Bounding Deviation of a Random Variable from its Expectation* . . . . . . 151
D.8.1 Markov Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 152
D.8.2 Chernoff Bound for Sum of Indicator R.V’s . . . . . . . . . . . . . 153
CONTENTS v
Bibliography 161
List of Figures
4.1 The figure shows routing tables in each node in a graph. (If slot is empty
in a routing table, then it means it is a neighbor.) For example, if node B
wants to route a message to node E, then according to B’s routing table
the next hop is node D and so it will forward the message to D (along
with the information of the destination node, which is E). According to
D’s routing table, E is a neighbor, so it will forward to E directly. . . . . 27
4.2 Illustrating Dijkstra’s algorithm (centralized) on a graph. The algorithm
computes shortest paths from A to all nodes in the graph; the output can
be represented as a shortest path tree (SPT). The final SPT is shown —
the (unique) path in the tree from A to any other node gives the shortest
path to that node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Illustrating Belman-Ford distributed algorithm on a graph. The algorithm
computes shortest paths from A to all nodes in the graph; the output can
be represented as a shortest path tree (SPT). The final SPT is shown —
the (unique) path in the tree from A to any other node gives the shortest
path to that node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 An example network for count-to-infinity problem. . . . . . . . . . . . . . 33
vi
LIST OF FIGURES vii
C.1 A connected undirected graph and the same graph with weights . . . . . 140
C.2 A graph and an induced subgraph of the graph. . . . . . . . . . . . . . . . 141
C.3 A disconnected graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
C.4 An undirected graph and its BFS tree. . . . . . . . . . . . . . . . . . . . 143
C.5 An undirected graph and its DFS tree . . . . . . . . . . . . . . . . . . . 144
List of Algorithms
viii
1
Introduction
This book will cover the fundamentals of distributed network algorithms from a modern
perspective. With the rise of the Internet and the paradigm shift from single-processor
computing to multi-processor computing, it becomes very important to understand
the fundamentals of distributed computing and networks. The examples of real-world
distributed networks are many: the Internet consists of billions of entities connected over
a communication backbone; a peer-to-peer network consists of computers communicating
via an overlay network built on top of the Internet; wireless phones can communicate
with each other or through a centralized access point using radio communication; a sensor
network consists of tiny devices that communicate using a wireless medium; a typical data
center network consists of thousands of distributed machines connected over a network
which can be used to distributively solve a large-scale computational problem; Bitcoin is
a peer-to-peer digital currency that is managed in a decentralized fashion; a network of
people communicating via a social network. In all these examples, we have the paradigm of
many distributed agents that communicate via some network to jointly perform some task.
The study of computation of such distributed systems is quite different from the traditional
computation theory of single processor (sequential) computation (which you study in
a typical algorithms course). While time and memory are the important complexity
measures in the sequential setting, the amount of communication between nodes plays a
key role in the distributed setting.
This book will focus on distributed network algorithms for fundamental distributed
computing problems. The goal is to explain the algorithms in an intuitive way, but without
sacrificing mathematical rigor. While algorithms and analysis will form a substantial
portion of the book, a good portion will be devoted to understanding the fundamental
limits of distributed network computation. The book will introduce some of the most
important developments in the field in the last decade or so. This is done in a manner that
is accessible to an advanced undergraduate or a beginning graduate student. The hope
is that it will serve not only as a reference but also will prepare the reader for pursuing
research in this vibrant area of algorithms.
1
1. INTRODUCTION 2
Message passing vs. shared memory. In message passing, nodes (we will use this
terminology throughout which can denote machines, processors, sensors, people, or in
general, any set of agents) communicate via an underlying network by exchanging messages.
Whereas in shared memory, the communication is via shared access to a common memory.
While some fundamental problems in distributed computing are common to both models,
some others are more meaningful in one versus the other. In this book we will focus
on the message passing model. In a message passing model, we assume an underlying
communication network modeled as a graph. The vertices (or interchangeably nodes) of
the graph represent the nodes and the edges of the graph represent the communication
links between the nodes. Two neighboring nodes are nodes that are connected to each
other directly via a communication link; such nodes can communicate with each other by
1. INTRODUCTION 3
sending and receiving messages. This book will exclusively focus on the message passing
model as it is suitable for distributed network computing.
Efficient algorithms. Another aspect that will be stressed throughout is the efficiency
of distributed algorithms. The efficiency measures that we consider are time and message
complexity. Designing algorithms that finishes as quickly as possible and uses as little
communication (messages) as possible will be an important focus. We will also cover
some lower bounds that show that certain problems cannot be solved faster or using
less communication (messages). These are important as well, as it shows the limits of
performance that a distributed algorithm designer can achieve.
1. INTRODUCTION 4
What will you learn? The book aims to teach the fundamentals of distributed network
algorithms. It is expected that you will acquire a good knowledge of distributed algorithms,
in particular, algorithms for various fundamental problems, which form building blocks
for designing efficient algorithms in distributed networks and systems. You will also learn
about distributed algorithms for processing (large-scale) data which underlies distributed
Big Data processing. You will also appreciate the limits to the efficiency of distributed
algorithms, including what is possible and what is not. This will give insight into designing
algorithms that can operate close to the theoretical optimum. From a practical point
of view, as large-scale distributed systems and networks, and distributed processing of
Big Data become increasingly common, these algorithms will play an important role in
designing efficient distributed systems and applications.
exercises before seeing their solutions is a good learning technique. The solutions to the
worked exercises can provide insights in solving the exercises.
Model
The focus of this course is on distributed network algorithms which are algorithms that
are designed to run on many processors “distributed” over a communication network.
Typically, the processors are distributed over a geographical area. Examples include
Internet, peer-to-peer networks, ad hoc wireless and sensor networks. As mentioned in
the previous chapter, there are many distributed computing models. This course will
focus almost exclusively on the message passing model which is suitable for distributed
computation by a set of processors interconnected over a communication network.
6
2. MODEL 7
1 2
3
4 5
7 8 6 3 9
2 1
5
Connected undirected graph Connected undirected weighted graph
Figure 2.1: A distributed network modeled as a graph.
Local communication Nodes can communicate directly (only) with their neighbors
through the edges. We can distinguish two types of local communication: (1) Local
unicast: nodes can send different messages to each of its neighbors in one step; (2)
Local broadcast: nodes send the same message to all its neighbours in any step.
Local broadcast is simpler and is a feature of wireless networks, which use radio
broadcast. On the other hand, local unicast, is more suitable to wired networks.
Unique IDs Nodes have unique processor identities (IDs). For example, in the Internet,
each endhost (may be a intermediate router or a processor at an endpoint) has a
unique IP address. There are situations where node ids may not be present or they
may not be unique; for most part we will not consider this scenario, although such
anonymous networks (or networks with homonyms) are also worth studying.
timing assumptions. First, we assume that messages arrive in the same order they
are sent (i.e., there is FIFO queuing). Second, we assume that if a processor has an
event enabling it to perform a task, the processor will eventually perform the task.
In particular, each message sent will eventually be delivered.
Between the two extreme models, we can define “intermediate” models that are
partially synchronous, where the processors have some partial information about
timing (e.g., almost synchronized clocks or upper bounds on message delivery time
etc.), but do not have complete information as they do in the synchronous model.
Although intermediate models can provide a more realistic model of real networks
such as the Internet, we will restrict our attention to synchronous and asynchronous
models in this course. Algorithms designed for the synchronous model, can sometimes
be translated to work for the asynchronous model (see below), and algorithms for
the latter model will work for an intermediate model as well.
This book will mainly assume the synchronous model. In particular, most of the
algorithms will be designed for the synchronous model. However, we do devote
some chapters to the asynchronous model. In particular, we show how algorithms
designed for the synchronous model can be transformed in a automatic way to
algorithms in the asynchronous model by means of a tool called the synchronizer.
The usefulness of such a tool allows the algorithm designer to focus on designing an
distributed algorithm to the simpler synchronous algorithm, where it is easier to
prove correctness and perform complexity analysis. Then one can use a synchronizer
to obtain a corresponding asynchronous algorithm. In contrast to this approach,
directly designing algorithms for the asynchronous model is challenging, as it is
typically difficult to prove correctness. However, on the downside, there is sometimes
an overhead in using a synchronizer as opposed to directly designing an asynchronous
algorithm directly.
Sometimes, we will assume, that nodes have some limited global knowledge, e.g.,
about the network size (the number of nodes in the network) — denoted throughout
by n. A node has to communicate with its neighbors in order to know about their
identities; throughout the run of the algorithm, it can also learn about the neighbors
of its neighbors, by communicating with them (indirectly via its neighbours) etc.
Starting time How and when does a node start its execution? We will typically assume
that all nodes start the computation at round 0, are initially in the same starting
state, and proceed to execute the same algorithm. However, in some algorithms,
only some nodes start their computation at round 0; other nodes can join later in
the computation when they receive messages from their neighbors. These issues will
become clearer when we discuss various algorithms.
2. MODEL 10
Congest
Synchronous bandwidth
Asynchronous
Model communication
Shared Memory
Figure 2.2: Distributed computing model classification
For most part, we will assume the synchronous CONGEST model unless otherwise
stated, since it is simpler and easier to design algorithms for this model. Another key plus
of this model is that it allows us to focus on certain core distributed algorithmic issues,
without the distraction of more complex issues that arise in the asynchronous setting.
In the fault-free setting, as mentioned earlier, we will see later that using a tool called
synchronizers one can transform a synchronous algorithm to work in an asynchronous
model with some overhead.
However, this does not mean that asynchronous model or the LOCAL model is less
important. We will study these models when and where they are appropriate. For example,
to understand locality issues, i.e., how far (i.e., how many hops) a node must communicate
to achieve a certain task, it might be more appropriate to adopt the LOCAL model, which
abstracts out issues related to bandwidth and congestion. Another important issue that
we will ignore for now is failures and faults. Unless otherwise stated, we will assume
that nodes do not fail and are not malicious, i.e., they faithfully run the algorithm they
are supposed to execute. Moreover, messages are delivered correctly and are not lost or
otherwise corrupted. Later in the course, we will study models with failures whenever
appropriate.
A relevant question is whether the above synchronous model is too restrictive? One
can think of the model as mainly to facilitate analysis — it provides a clean way to
compare the performance of various distributed algorithms (this situation is similar to the
RAM model in the sequential setting[4]). Algorithms themselves need not be conformed
to work in this restricted model; they may work in more general or alternate models. The
synchronous/asynchronous models are the two extremes. If we have an algorithm that
assumes point-to-point communication (suitable for a wired setting), can we use it in a
wireless setting where radio broadcast communication is used? Some elements will have
to be changed, but the core algorithm may still work.
2. MODEL 11
2.3 Exercises
Exercise 2.1. Show that Θ(log n) bits are both necessary and sufficient to uniquely
address a network of n nodes, when each node has a (unique) identifier chosen from the
integer range [1, n2 ].
Exercise 2.2. Consider a distributed network of n nodes, where each node starts with
its own data value, say a number. For example, this can be a sensor network and the
value can represent the temperature reading of each sensor node. Describe a distributed
algorithm that computes the minimum value among all nodes of the network. When the
algorithm has terminated, every node should know the value of the minimum. You can
assume the synchronous CONGEST model.
Show that your algorithm correctly computes the global minimum. Does your algo-
rithm detect termination, i.e., does each node know when all nodes have finished their
2. MODEL 12
computation? What is the time and message complexity of the algorithm? Does your
algorithm and analysis also work in the asynchronous model?
Exercise 2.3. 1. What is the difference between the KT0 model and the KT1 model?
2. Suppose we have an algorithm that works in the KT0 model. Will it work in the
KT1 model without any changes?
3. Suppose we have an algorithm that works in the KT1 model. Will it work in the
KT0 model without any changes?
4. Suppose we show that the time complexity of solving some problem in the KT1
model is T , then what does it imply regarding the time complexity of the problem
in the KT0 model?
We start our study of distributed algorithms with some of the most basic algorithms.
These basic “primitives” are commonly used and they also form building blocks for more
sophisticated algorithms. As discussed in Chapter 2, we will assume that the distributed
network is an undirected connected graph. We will consider the CONGEST model, i.e.,
only O(log n) sized message can be sent per edge per time step. Our algorithms will also
assume the synchronous model throughout.
This chapter uses basic concepts in graph theory; for a quick recap of these, see
Appendix C.
3.1 Broadcast
Broadcasting is an important communication mode: sending a message from a source
node to all other nodes of the network.
Definition 3.1 I Broadcast
Given a network G = (V, E) and a source node s, the goal of broadcast is to send a
message from s to all nodes in V .
13
3. BROADCAST AND TREE ALGORITHMS 14
S S S
A B A B A B
C D C D C D
Figure 3.1: Illustrating the Flooding algorithm in a graph. s is the source node.
Proof Idea: It is important to first get an intuition why the algorithm works before
one formally shows it. In the case of Flooding, it is clear that every node will eventually
receive the message sent by s. Why? Consider any node v. There is a path from s to
v, in particular, consider a shortest path between s and v. Each node along this path
forwards the message to its neighbour on the path and eventually it reaches v. Clearly the
number of steps taken is bounded by the number of edges in the path, which is bounded
by diameter D of the network. Recall that the diameter is the maximum distance between
any two nodes in the network.
Typically, we will use mathematical induction to argue for formal correctness of many
algorithms. For a quick recap of induction, see Appendix B.
Proof. The formal correctness as well as time needed can be easily established via mathe-
matical induction. As in the proof idea, let v be any node. We use induction on t to show
that after t time units, the message has already reached every vertex at a distance of t or
less from the source. At time t = 0 (the base case), node s has the message, which is at
3. BROADCAST AND TREE ALGORITHMS 15
distance 0. Assuming that the hypothesis is true at time t, it follows that at time t + 1,
all neighbors of nodes at distance t + 1 from the source will receive the message by the
flooding algorithm.
The message complexity follows from the fact that each edge delivers the message
at least once and at most twice (one in each direction). The time complexity is clearly
bounded by D, as discussed in the proof idea.
Proof. Every node has to receive the message sent by s, the source node. Since there are
n nodes in the network, and each node (except s) has to receive a message on its own, at
least n − 1 messages are needed.
For the time lower bound, note that, for any source node s, there exists a node (say
u) which is at a distance of at least Ω(D) from s (Exercise 3.1). Since each round can
advance the distance (to u) by at most one, the time lower bound follows.
Note that while the time lower bound of Ω(D) is tight for broadcast (the flooding
algorithm achieves it), the message lower bound is not. Later we will show a fundamental
result: Ω(m) is indeed a lower bound on the message complexity of broadcast.
Assume that we are given a spanning tree T of G, rooted at s. Then such a tree can be
used to do efficient broadcast from s that can use less number of messages than flooding.
A spanning tree-based broadcast from a source node s proceeds as follows. The root s
sends the message to its neighbours; when a node v receives the message for the first time,
3. BROADCAST AND TREE ALGORITHMS 16
it forwards it along the tree edges only except to the node from which it just received the
message; if v receives the message again, it ignores it.
Essentially, tree-based broadcast is like flooding, but restricted to the tree edges only;
no messages are sent via non-tree edges.
The following theorem is immediate from the above description.
Theorem 3.3
Given a n-vertex graph G with a spanning tree T rooted at s, the message complexity
of broadcast from s is n − 1 and time complexity is depth(T ), where depth(T ) is
the depth (or height) of the tree T .
Note that the time complexity of spanning tree-based broadcast can be larger than the
diameter of G, if the depth of the spanning tree used is larger than diameter. If we use a
breadth-first spanning tree (BFS) then, the depth of such a tree is diameter. Thus, using a
breadth-first spanning tree, we get a message complexity of n − 1 and a time complexity
of D for broadcast. We will study distributed algorithms for constructing a BFS later.
depth (T)
graph T
Figure 3.2: Illustrating the Convergecast algorithm in a graph. s is the source node.
3.4.1 Pipelining
Consider the following modification of the sum finding problem: each node has a list of k
values, i.e., each node i has < ai (1), . . . , ai (k) >, and the goal is to compute the list of
(component-wise) sums
n n
!
X X
ai (1), . . . , ai (k)
i=1 i=1
3.5 Upcast
We are given k items distributed arbitrarily on the nodes of a network G = (V, E). A
node can have zero, one, or more items. (We assume that each item is of small size, i.e.,
can be represented in O(log n) bits.) The goal in the upcast problem is that we would
like to collect all these items at a node s ∈ V . Assume that there is a spanning tree T
rooted at s. Note that, unlike in the converge cast setting, one cannot aggregate items;
all the items have to be individually collected at the root. However, one can use the
convergecast algorithm to solve the problem in a trivial way: convergecast each item
sequentially one by one to the root, waiting for depth(T ) time between convergecasts; this
takes O(k × depth(T )) time. We next describe a simple algorithm (Algorithm 2) that
uses pipelining to achieve an O(k + depth(T )) time.
Assume that each item has a unique number associated with it, say its ID (IDs as
usual, will be represented using O(log n) bits). At any round, each node does the following:
among the set of items that it currently has, send the item with the highest ID to its
parent (the sent item is then removed from its set). The main idea of this algorithm is to
use the ID to break ties; it gives priority to the item that is ranked highest. This kind of
priority is very useful in analysis to show bound on the running time. Figure 3.3 shows
an illustration of the algorithm. The following theorem establishes the correctness and
time complexity of this upcast algorithm.
1 {} 1 {} 1 {9} 1 {7,9}
9 7
2 3 {} 2 3 {} 2 3 {4} 2 3 {4}
{1,9} {1} {1} {1}
4 7
4 5 4 5 4 5 4 5
{1,3,4} {6,7} {1,3} {6} {1,3} {6} {1,3} {6}
Figure 3.3: Illustrating Upcast Algorithm in a graph. Ties are broken by priority based
on ID.
Theorem 3.4
Algorithm 2 upcasts all the items to the root and takes O(k + depth(T )) time.
Proof idea: The highest ranked item (i.e., the item with the highest ID) — call it z
— is never delayed by any other item, since it has the highest priority. Hence it reaches
the root in exactly dz steps, where dz is its distance from the root. The second-highest
ranked item — call it y— can be delayed by only the highest ranked item. Furthermore,
it is important to note that it can be delayed at most once by the highest ranked item.
Why? Consider the situation when z is delayed by y. This means both these items are in
the same node v at some round and z has to be forwarded to v’s parent before y. Once z
3. BROADCAST AND TREE ALGORITHMS 19
is forwarded, it is always “one-step” ahead of y and never again delays y. Thus y reaches
the root in at most dy + 1 steps, where dy is its distance from the root.
Proof. Consider the jth highest item (we refer to the item and its rank interchangeably).
It can possibly be delayed only by items ranked higher than itself, j − 1 other items. Let
j 0 (1 6 j 0 6 j − 1) be the rank of the lowest ranked item that delays j. Note that once
this happens, it is never delayed again, by j 0 ; any further delay can only be attributed
to an item that is ranked higher than j 0 (say j 00 ): this is because, j 00 can delay j 0 by one
round which will allow j to “catch” j 0 which incurs an additional delay of one round for j;
however, this delay can be “charged” (attributed) to j 00 .
It is left as an exercise to show that the above bound is optimal, i.e., every upcast
algorithm needs Ω(k + depth(T )) rounds in the worst case.
Exercise 3.5 asks you to show that an even simpler rule (no priority is needed) also
gives the same time bound for upcast. This rule thus applies even where there is no global
ordering of the items.
3.6 Downcast
The objective of downcast is the opposite of upcast: Given k distinct items at the root
(source nodes) s of a (spanning) tree T , the goal is to disseminate all the k items to all
the nodes in the tree. As in upcast, using pipelining, one can show that downcast can be
accomplished in O(k + depth(T )) time. See Worked Exercise 3.1.
Algorithm 2 An upcast Algorithm; code executed by node v. Initially there are k items
distributed among the nodes in the network.
1: Let S(v) be the initial set of items that node v possesses.
2: Let item a be the item with the highest ID in S(v).
3: Send a to parent of v in the tree T .
4: S(v) = S(v) − v.
have a parent and leaf nodes will not have any children. Note that each node (except) the
root will have exactly one parent. Note that this information is all that is needed for a
node to “route” information in the tree, e.g., if each node repeatedly forwards information
to its respective parent, eventually the root will receive it. On the other hand, as we saw
in Tree broadcast, if each node sends its information to all its children, then broadcast
can be implemented.
a d a d a d
accept invite
invite
f f f
S c S c S c
invite accept invite
e e e
b b b
a d a d a d
invite accept invite invite
accept invite f accept f f
S invite c S c S c
invite invite
invite e accept e e
b b b
a a S c c S
b
b S c S
c c a d,e,f b,d,e,f a
d c f c
e c f c
e f d f c d,e c
Figure 3.4: BFS tree construction algorithm on a graph starting from source node S.
Figures (1)–(6) show the steps of the algorithm. The BFS tree constructed is shown. The
table shows the parent-child relationships and the invite-accept steps executed.
a d a d a d a d
echo echo
echo
f echo f f f
S c S c S c S c
echo echo
e e e e
b b b b
(1) (2) (3) (4)
a d a d a d
done done done
f f done f
S c S c S c
done done
e e e
b b b
(5) (6) (7)
Figure 3.5: Illustrating termination detection. Steps (1)–(4) show the echo sending
(convergecast) process. Once the echos reach the root, the source node broadcasts “done”
message to all the nodes (via flooding). The broadcasting is shown in steps (5)–(7).
Solution. Each item has a unique ID, and there exists a total order on these IDs. The
items are sent down in increasing order of their ID. Any item takes at least depth(T )
time to be sent from root to leaf. Due to congestion, an item with the ith highest ID must
wait for i − 1 rounds before it can be sent from the root. At subsequent nodes, due to the
pipelining, it is not further delayed by higher ID items. Thus each item takes at most
i − 1 + depth(T ) 6 k − 1 + depth(T ) = O(k + depth(T )) rounds to reach all leaves from
the root.
3. BROADCAST AND TREE ALGORITHMS 24
Worked Exercise 3.2. Consider the following algorithm for information spreading.
Assume that each token has a priority number. Each node simply floods each token
separately; if there are multiple tokens to send, then one with the highest priority is sent.
More precisely, when a node receives a token (from any of its neighbors) for the first time
it saves it in its buffer. (If the same token is received subsequently it is ignored.) In
every round, it picks a token with the highest priority among those in its buffer (if its
non-empty) and sends it to all its neighbors and then deletes it from its buffer. Show that
this algorithm will take O(D + k) rounds.
Solution. We show that a token starting at a node u will reach any node v in at most
D + k rounds.
More precisely, we show the following claim.
Claim: Let P be a shortest path between u and v. Let P = u, u1 , . . . , ut = v and let
it have t 6 D hops. Then a token starting at u will reach node ui in at most i + r rounds
(1 6 i 6 t) where r is the priority of the token (the token with the highest priority is 0,
and the one with the lowest priority is k).
The claim clearly holds for the highest priority token, i.e., token with priority r = 0
(call it a1 ). This is because it is not delayed at all in P , hence it will reach ui in i rounds
(1 6 i 6 t).
Now consider the second highest priority, i.e., r = 1, (call it a2 ), we argue that it can
be delayed by one round at most, i.e., it will reach ui in i + 1 rounds.
Suppose a2 encounters the highest priority token, a1 , in some node uj , j < i in the
path P . (Otherwise it will not be delayed at all.) It will be delayed by a round if a2 and
a1 reach uj in the same round (i.e., both tokens are in the node’s buffer in this round);
otherwise, again, it will not be delayed at all. Since uj , uj+1 , . . . , ui is a shortest path
(and there is no shorter path), a1 which will be sent first, will reach every node along this
path before a2 (note that this may be due to the forwarding from uj or a1 was already
present in the node.) Hence a2 will never again be delayed on this path by a1 . Hence a2
will reach ui , for 1 6 i 6 t in at most i + 1 rounds.
The same argument can be applied to the third highest priority token and so on.
Hence the time bound of D + k holds for any token for any destination node.
3.10 Exercises
Exercise 3.1. Given a network G, for any source node s in G, there exists a node which
is at a distance of at least Ω(D) from s, where D is the diameter of the network.
Exercise 3.3. Show that the convergecast algorithm described in Section 3.4 correctly
P
calculates the v av in time O(depth(T )).
Exercise 3.4. Show that the pipelined convergecast algorithm described in Section 3.4.1
correctly calculates component-wise sums in time O(k + depth(T )).
3. BROADCAST AND TREE ALGORITHMS 25
Exercise 3.5. Consider an upcast algorithm that where each node (except the root)
maintains a list of items to be sent to its parent. There are a total of k items over all the
nodes. In every round, a node sends an arbitrary item in its current list to its parent and
deletes it from the list. Show that this upcast algorithm terminates in O(k + depth(T ))
time.
Exercise 3.6. Prove the correctness of the distributed BFS tree construction algorithm.
Exercise 3.7. Let h be the depth (or height) of the BFS tree constructed by the algorithm.
Show that the time complexity of the BFS algorithm is O(h). Show that h is related to
the diameter D of the network as follows: D/2 6 h 6 D.
Exercise 3.9. Show that upcast of k items in a tree T needs at least Ω(k + depth(T ))
rounds.
Exercise 3.10. Consider the following algorithm for information spreading which is even
simpler than the one described in Worked Exercise 3.2. Each node simply floods each
token separately. More precisely, when a node receives a token (from any of its neighbors)
for the first time it saves the token in its buffer. In every round, a node picks a token
from its buffer (if its non-empty) and sends it to all its neighbors and then deletes the
token from its buffer (and hence each token is sent only once). Show that this algorithm
will take O(D + k) rounds.
Exercise 3.11. In this problem, we will design a distributed algorithm for constructing a
depth-first search (DFS) spanning tree (see Appendix C) in a given network G (as usual,
we will assume that G is connected and undirected) with n nodes and m edges. We will
assume the synchronous CONGEST model, as usual.
The sequential DFS algorithm is as follows. Given a starting node s, whenever the
search reaches a node v (initially, v is s itself), the next node to visit is decided as follows.
If v has a neighbor, say u, that has not been visited yet, then u is visited next. Otherwise,
the search returns to the node from which v was visited for the first time. If there is no
such vertex, i.e., v is s itself, then the search finishes. The DFS spanning tree is the tree
consisting only of those edges in the graph which are used to visit vertices for the first
time in the DFS algorithm.
(a) Describe a natural distributed implementation of the above sequential algorithm.
What is the time and message complexity? (Hint: Note that at any round, there is only
one “active” node in the algorithm, the one that is currently being visited.)
(b) Show how to get a distributed implementation with time complexity O(n).
4
26
4. DISTRIBUTED SHORTEST PATHS ALGORITHMS 27
A
Dest. Next hop Weight 1 3 Dest. Next hop Weight
A 1 A 3
C 2 B C B 2
D 1 2
D 1
E D 3 3 E 2
1 1 2
2
D E
Dest. Next hop Weight Dest. Next hop Weight
A B 2 A D 4
B 1 B D 3
C 1 C 2
E 2 D 2
Figure 4.1: The figure shows routing tables in each node in a graph. (If slot is empty in a
routing table, then it means it is a neighbor.) For example, if node B wants to route a
message to node E, then according to B’s routing table the next hop is node D and so it
will forward the message to D (along with the information of the destination node, which
is E). According to D’s routing table, E is a neighbor, so it will forward to E directly.
A routing algorithm’s goal is to find a “good” path or route to send data from a
source node to a destination node. Typically, a good path is one that has the least cost.
Consider a weighted network G = (V, E, c) with positive real-valued edge costs given by
the cost function c. The cost of a path p = (e1 , . . . , ek ) is defined as c(p) = ki=1 c(ei ). As
P
The link state and distance vector algorithms are essentially the routing algorithms
used in the Internet today. There are two types of protocols based on whether the routing
is made within an Autonomous system (intra-AS routing) or between Autonomous systems
(inter-AS routing). Shortest path routing is typically used for intra-AS routing. A
protocol called Border Gateway Protocol (BGP) is used for inter-AS routing.
2 3 3 2 3 3 3
4 (3,A) 4 2 (3,A) 4 3
3 C 1 2 3 C 1 2 C
A G A G 3 1 2
A G
E E E
1 6 4 1 6 4 1
5 5 6 (5,B) 4
5
5 5 5
D H D H D H
(5,A) (5,A)
initial graph step 1 step 2
2 3 3 2 (3,A) 3 3 2 3 3
(3,A) 4 4 (3,A) 4
3 C 1 2 3 C 1 2 3 C 1 2
A G A G A G
E E E (6,E)
1 6 (4,C) 4 1 6 (4,C) 4 1 6 (4,C) 4
5 5 5
5 5 5
D H D H D H
(4,C) (4,C) (9,D) (4,C) (8,E)
step 3 step 4 step 6
2 3 3 2 3 3 2 3 3
(3,A) 4 (3,A) 4 (3,A) 4
3 C 1 2 3 C 1 2 3 C 1 2
A G A G A G
E (6,E) E (6,E) E (6,E)
1 6 (4,C) 4 1 6 (4,C) 4 1 6 (4,C) 4
5 5 5
5 5 5
D H D H D H
(4,C) (8,E) (4,C) (8,E) (4,C) (8,E)
step 7 step 8 step 9
B C D E F G H B 6 F
step 1 A 2,A 3,A 5,A ∞ ∞ ∞ ∞
step 2 AB 3,A 5,A 5,B 8,B ∞ ∞ 2
step 3 ABC 4,C 4,C 8,B ∞ ∞ 3 C 1 2
A G
step 4 ABCD 4,C ∞ 9,D E
step 5 ABCDE 8,B 6,E 8,E 1 4
The message complexity and time complexity of the global routing algorithm is solely
determined by the broadcast algorithm. We note that each edge information constitutes a
message (the cost and the two endpoints of an edge can be represented using O(log n)
bits). Thus the goal is to broadcast all the edges (and their respective costs) to all
nodes in the networks. Broadcast can be done by flooding (Section 3.1.1); recall that, in
flooding, a node never sends the same message twice. If broadcast is done by flooding
(cf. Section 3.1.1), then every node’s local information (i.e., the cost and identities of the
endpoints all its incident edges1 ) then the message complexity is O(|E|2 ), since O(|E|)
messages have to be broadcast, each of which causes O(|E|) messages to be sent by
1
Nodes need to communicate with their neighbors first to obtain the information of the identities of
nodes at the other end of their incident edges; this takes 1 round and O(|E|) messages.
4. DISTRIBUTED SHORTEST PATHS ALGORITHMS 30
flooding. The time complexity is clearly upper bounded by O(|E|D) (where D is the
diameter of the network), since flooding one message takes O(D) time (cf. Chapter 3)
and there at most O(|E|) messages. This is a pessimistic upper bound. Exercise 4.1 asks
you to show a better time bound.
Internet’s Open Shortest Path First (OSPF) protocol uses a LS routing algorithm
as mentioned above. Since the LS algorithm is centralized, it may not scale well when
the networks become larger. This is also an issue when nodes have limited resources e.g.,
computation power or memory, which impacts the power of local computation. In this
book, we will generally ignore the complexity of “within-node” computation, but this is
an issue in practice; in any case, it will be desirable to design algorithms that has low
“within-node” computation cost.
Exercise 4.4 discusses a distributed implementation of Dijkstra’s algorithm.
2 3 3 2 3 3 3
4 (3,A) 4 2 (3,A) 4 3
3 C 1 2 3 C 1 2 C
A G A G 3 1 2
A G
E E E
1 6 4 1 6 4 1
5 5 6 (4,C) 4
5
5 5 5
D H D H D H
(5,A) (4,C)
round 0 round 1 round 2 (10,D)
2 3 3 2 (3,A) 3 3 2 3 3
(3,A) 4 4 (3,A) 4
3 C 1 2 3 C 1 2 3 C 1 2
A G A G A G
E (6,E) E (6,E) E (6,E)
1 6 (4,C) 4 1 6 (4,C) 4 1 6 (4,C) 4
5 5 5
5 5 5
D H D H D H
(4,C) (8,E) (4,C) (8,E) (4,C) (8,E)
round 3 round 4 round 5
2 3 3 2 3 3
(3,A) 4 (3,A) 4
3 C 1 2 3 C 1 2
A G A G
E (6,E) E (6,E)
1 6 (4,C) 4 1 6 (4,C) 4
5 5
5 5
D H D H
(4,C) (8,E) (4,C) (8,E)
round 6 round 7
A B C D E F G H
a(.),P(.) a(.),P(.) a(.),P(.) a(.),P(.) a(.),P(.) a(.),P(.) a(.),P(.) a(.),P(.)
round 0 0,A ∞ ∞ ∞ ∞ ∞ ∞ ∞
round 1 0,A 2,A 3,A 5,A ∞ ∞ ∞ ∞
round 2 0,A 2,A 3,A 4,C 4,C 8,B ∞ 10,D
round 3 0,A 2,A 3,A 4,C 4,C 8,B 6,E 8,E
round 4 0,A 2,A 3,A 4,C 4,C 8,B 6,E 8,E
round 5 0,A 2,A 3,A 4,C 4,C 8,B 6,E 8,E
round 6 0,A 2,A 3,A 4,C 4,C 8,B 6,E 8,E
round 7 0,A 2,A 3,A 4,C 4,C 8,B 6,E 8,E
Proof. Fix a node x ∈ V , we prove that the algorithm computes a shortest path from s
to x.
Let P = v0 , v1 , ...., vk , where v0 = s and vk = x be a shortest path from s to x. Note
that k < n; this is because, a shortest path cannot have cycles, and hence has to be simple,
i.e., it can at most n nodes on the path.
We prove by induction on i that after the ith round, the algorithm has computed the
shortest path from s to vi , i.e., a(vi ) = d(s, vi ).
The hypothesis holds for v0 = s in round zero.
Assume that it holds for j 6 i − 1. After the ith iteration,
which is the shortest path from s to vi , since P is a shortest path from s to vi , and
the right hand side is the distance between s to vi on that path. Hence after i rounds,
a(vi ) has the correct shortest path distance from the source s.
Thus after n rounds, shortest paths from s to every other node has been computed.
Since, in each round O(|E|) messages are exchanges, the total message complexity is
O(n|E|).
From the above proof, it can be seen that the algorithm will compute the correct
values even if the nodes operate asynchronously, i.e., delays don’t matter (we will look
at asynchronous algorithms in more detail later). An important observation is that the
algorithm is “self-terminating” — there is no signal that the computation should stop;
each node just stops when the correct shortest path values are reached (see Worked
Exercise 4.1).
1 1 1 L
A B C D
Solution. The answer depends on if the graph is weighted or unweighted. We first show
the statement is true for unweighted graphs and then show it is false for weighted graphs.
For an unweighted graph, for a given non-source node v, its a(·) value is updated only
once when a path from the root to it of length < ∞ is found. This is because distance
here is equivalent to the number of nodes in the path from source to v. The first path
to v that is found will be of shortest distance from the root. Subsequent paths will only
be longer, and thus the distance to the root along those paths is correspondingly longer.
Thus v’s a(·) value is updated only once over the course of the algorithm and the claim is
true.
For a weighted graph, we construct a counterexample. Consider a graph with two
paths from the source to a given node v. Let the number of nodes in the first path be 2
and the number of nodes in the second path be 7. Further, let the total weight of edges
4. DISTRIBUTED SHORTEST PATHS ALGORITHMS 34
in the first path be 100 and the second path be 8. Now, v will update its a(·) value in the
2nd round and update its a(·) value again only in the 7th round, thus showing the claim
to be false.
4.4 Exercises
Exercise 4.1. Show that broadcast of the entire graph topology by flooding as discussed
above takes O(|E| + D) time. Hint: See Exercise 3.10.
Exercise 4.2. The shortest path diameter between two nodes u and v, denoted by S(u, v),
is defined as the number of edges in a shortest path between u and v; if there is more than
one shortest path, the one with the minimum number of edges is taken to be S(u, v). The
shortest path diameter of the network, denoted by S, is defined as S = maxu,v {S(u, v)}.
• Show that in an unweighted graph, S is the same as the diameter D of the network.
2. When a node in the tree receives a “new phase” message it sends it to all its
neighbors.
3. A non-tree node when it receives a “new phase” message replies ”non-tree node”
and also sends its d(v) value to the tree node(s) that sent the “new phase” message.
4. All tree nodes that receive replies from a non-tree neighbor convergecasts the
minimum value of d(v) and the ID of the non-tree node that sent it.
5. Once the convergecast terminates, the root has the minimum d(v) value and the ID
of the non-tree node that has this minimum value. The root broadcasts this value
and ID to all nodes in the tree.
6. Once the non-tree node hears its own ID, it adds itself to the tree (becomes a tree
node). It then sends “update” message to its neighbors.
7. A non-tree node u that receives an update message from its tree neighbor v updates
its d(u) value if d(v) + w(u, v) < d(u); in that case, u sets its d(u) value to be
d(v) + w(u, v). It also updates its parent to v. It then sends a “done” message to
its tree neighbor(s). The “done” messages are convergecast to the root.
Leader Election
Leader election is one of the fundamental tasks in distributed computing. It is one of the
fundamental “symmetry breaking” problems that we will study. As mentioned in Chapter
2 (also see Section 3.1.1), all nodes in the network will execute the same distributed
algorithm, starting with the initial state. The only node-specific information that can
influence the behaviour of each node is its (unique) ID and the random choices it makes
(this will be case in randomized algorithms, as explained below). In symmetry breaking,
the goal is to break the initial starting symmetry among all nodes (due to the execution of
the same algorithm, starting from the same initial state). Hence at the end of symmetry
breaking, different nodes will end up exhibiting more than one type of behavior or decision.
Leader election is one such problem, where the goal is to elect an unique leader of the
entire network. That is at the end of leader election, exactly one process knows that it is
the leader; all others know they are not.
Leader election is a basic primitive that serves in many applications. In many applica-
tions, a leader node is required that performs tasks on behalf of other nodes or the entire
network. In one of the first applications, leader election was used to decide who has the
right to own a token in a “token-ring” network (where the network is a cycle). Only one
node can have the token at any point in time, and that node has the right to transmit a
message, while other nodes should not. More generally, many nodes might contend to
access some shared resource and to break this symmetry, they elect a leader which only
has to right to some shared resource at any point in time.
As usual, we consider a network of n nodes, represented as an undirected connected
(not necessarily complete) graph G = (V, E). We will assume the synchronous CONGEST
model. We assume that all nodes have unique identifiers; the identifiers can be arbitrary
numbers which can be represented by O(log n) bits. We will assume that all nodes are
awake initially and start executing the algorithm simultaneously — so called simultaneous
wakeup model. However, our algorithms and analysis can also be adapted to the adversarial
wakeup model where the nodes are awoken at arbitrary points in time, with the restriction
that nodes wake upon receiving a message and at least one node is initially awake.
We now define the leader election problem formally. Every node u has a special
variable statusu that can be set to a value in
{⊥, non-elected, elected}; initially statusu = ⊥. An algorithm A solves leader
election in T rounds if, from round T on, exactly one node has its status set to elected
36
5. LEADER ELECTION 37
while all other nodes are in state non-elected. These nodes need not be aware of the
identity of the leader. In this book, we will focus on this implicit variant of leader election.
In another variant, all the non-leaders may change their status component to the value
non-leader, and moreover, every node must also know the identity of the unique leader.
This formulation may be necessary in problems where nodes coordinate and communicate
through a leader. This explicit leader election can be achieved by simply executing an
(implicit) leader election algorithm and then broadcasting the leader’s identity using an
additional O(n) messages and O(D) time (where D is the diameter of the graph).
The goal is to design efficient leader election algorithms — algorithms that have small
message and time complexity. We will first discuss algorithms for two special type of
networks: ring and the complete graph. Later we will focus on algorithms that work
on arbitrary (general) graphs. We will first talk about deterministic algorithms (i.e.,
algorithms that don’t make random choices) and then focus on randomized algorithms.
This chapter will thus introduce randomization in distributed algorithms which will be a
frequently occurring theme for the rest of the course.
Before we discuss algorithms, we make note of a few important assumptions. First, it
is essential that nodes have unique identifiers — this is essential for deterministic leader
election. For deterministic algorithms, other than unique identifiers, there is no other
resource to break symmetry, since all nodes execute the same algorithm starting with the
same starting state. Thus in anonymous networks, i.e., in networks where nodes don’t
have IDs, deterministic leader election is not possible. (On the other hand, for randomized
algorithms, nodes can presumably break symmetry via random choices, as we will see
below.) Second, we assume that the identifiers, while being unique, are drawn from a large
enough set. In other words, our leader election algorithms should work for all IDs drawn
from this set. This is needed to avoid trivial algorithms. For example, if the set of IDs is
always drawn from the set {1, . . . , n} and there are n nodes, then a trivial algorithm can
simply assign the node with ID 1 as the leader.
6 1 5
I D(5) < I D(6)
6 2 Non-elected 6 2 I D(1) < I D(2)
Non-elected
2 4 1
5
4 3 3 2
4 4
I D(3) < I D(4)
Non-elected
round 1 round 2
4 3
I D(4) < I D(6) I D(3) < I D(6)
Non-elected 6 2 Non-elected 6 2
3 2
4 4
I D(2) < I D(4) I D(1) < I D(4)
Non-elected Non-elected
round 3 round 4
2 1
I D(2) < I D(6) I D(1) < I D(6)
Non-elected 6 2 Non-elected 6 2
4 Elected 4
I D(1) = I D(1)
round 5 1 round 6
6 2
5 3
round 7
We first present a basic leader election algorithm (called BasicLE-Ring) for the ring.
Although simple, the idea of this algorithm is used in subsequent leader election algorithms.
This algorithm takes O(n2 ) messages and O(n) rounds, where n is the size of the ring. The
pseudocode is presented in Algorithm 5 (as mentioned earlier, the pseudcode is written
from the perspective of a single node v; all nodes execute instances of the same code).
For sake of simplicity, we assume that the ring is unidirectional, i.e., every node has only
one outgoing edge to its (say) clockwise neighbour. The algorithm and its analysis can
5. LEADER ELECTION 39
be extended to apply even without this assumption. The idea of the algorithm is simple:
each node sends its ID around the ring; a node (say v) forwards a received ID to its
neighbor only if the received ID is smaller in value compared to its (i.e., v’s) ID. If a node
receives its own ID, then it elects itself as leader. Figure 5.1 illustrates the algorithm.
Algorithm 5 BasicLE-Ring: A Basic Leader Election Algorithm for a Ring; Code for
node v
1: Round 1: Send ID(v) to (clockwise) neighbor.
2: Round i > 2:
3: Let ID(u) be the ID received by v at the beginning of this round.
4: if ID(u) < ID(v) then
5: Send ID(u) to neighbor; statusv = non-elected
6: else if ID(u) = ID(v) then
7: statusv = elected
5.1.1.1 Analysis
Theorem 5.1
Algorithm 5 is correct and has time complexity O(n) and message complexity O(n2 ).
Proof. Correctness. First we analyze the correctness of the algorithm, i.e., we show that
the algorithm elects a unique leader. It is clear that the node with the lowest ID will be
elected as a leader. We will argue as follows. Let’s call the message sent by a node as a
token. The token sent by the node with the lowest ID will not be stopped by any other
node. On the other hand, any other node v will not be elected as leader; its ID will be
stopped by some node whose ID is smaller than v. Thus one and only one node will be
elected as leader.
Complexity. The time complexity is clearly O(n) rounds. This is because, in each round,
at most one token is sent across an edge (hence no congestion in any edge); furthermore,
the token from the lowest ID node goes round the entire ring once and comes back to it —
this requires n rounds.
The message complexity depends on the distribution of IDs along the ring. Suppose
the IDs are arranged in an increasing order in the clockwise direction, then the message
complexity can be calculated as follows: the token from the lowest ID goes around the
entire ring and incurs n messages, the token from the second lowest ID will be stopped
only by the node with the lowest ID and hence incurs n − 1 messages (since tokens go
around in a clockwise direction) and so on. Thus the total number of messages is:
We note that the message complexity of the above algorithm can be better. A good
case is when the IDs are arranged in decreasing order in the clockwise direction. In this
5. LEADER ELECTION 40
case, except for the token from the lowest ID node which travels through the whole ring,
the rest of the tokens travel only one hop and then get stopped. Thus the total message
complexity is O(n).
Proof. In phase i − 1, both of v’s tokens were not killed during their trips of length 2i−1
from v. That means that v has the smallest ID among all nodes within distance 2i−1 .
5. LEADER ELECTION 41
Thus any node within this distance, if it was at all active during phase i − 1, its token
would have been stopped by v. Hence such a node cannot be active at the beginning of
phase i.
Lemma 5.3
n
The number of active nodes in phase i is at most d 2i−1 e.
Proof. From Lemma 5.2, for each active node in phase i, there is no other active node
within distance 2i−1 of it. Or in other words there will be at most 1 active node per
segment of length 2i−1 in the ring. The lemma follows.
From the above two lemmas, we can analyze the message and time complexity.
Theorem 5.4
The message complexity of the ImprovedLE-Ring algorithm is O(n log n) and the
time complexity if O(n).
Xn
log
n
4 × 2i d e
i=0 2i−1
Xn
log
n
6 4 × 2i
i=0 2i−2
Xn
log
n
6 16 × 2i
i=0 2i
Xn
log
= 16n = O(n log n).
i=0
round, by symmetry, they will perform the same action — either they both communicate
or they both don’t. Thus, there is no progress.
On the other hand, if both users use randomization, then we can show that with
constant probability they will make progress in a round. To be more precise, we will
assume that each user has access to random coin tosses. In particular, they can flip a fair
(unbiased) coin independently and privately. This is typically referred to as an unbiased
private coin (as opposed to a public coin which is shared by both users). A fair coin has
probability 1/2 of getting HEADS or TAILS. How can having such a private unbiased coin
help? Here is a simple algorithm: Each user tosses his private random coin; if it comes up
HEADS, then he decides to send the message, otherwise he doesn’t (remains silent). See
Figure 5.2. Note that this algorithm is identical to both the users except the outcome of
the coin flips. Using basic probability laws, the probability that exactly one user sends
a message in a round is 12 12 + 12 12 = 12 . The first term of the previous sum corresponds
to the probability that user A sends the message and B doesn’t and the second sum
corresponds to the probability that user B sends and user A doesn’t. If we care about A’s
message being sent, then this happens with probability 1/4. In either case, the “success”
probability, i.e., the probability of message transmission, is a constant. By repeating this
algorithm a few times in every round, the success probability can be increased close to 1.
5. LEADER ELECTION 43
A B A B A B A B
(Head) (Head) (Head) (Tail) (Tail) (Head) (Tail) (Tail)
In this chapter and in later chapters we will use randomization extensively in the
design of distributed algorithms. In many cases, it gives additional power to the algorithm
designer, frequently leading to improved complexity bounds, and also typically simpler
algorithms. In some cases, as in the above simple setting, it helps in overcoming im-
possibility results, i.e., without using using randomization it is not possible to solve the
problem by purely deterministic means.
In randomized algorithms, we are usually interested in the average or mean or expected
performance of the algorithm. The message and time complexity of a randomized algorithm
is a random variable, with an associated probability distribution. We are interested in the
expected value of this random variable, which is a single parameter that characterizes the
distribution. In most cases, we would able to show that the random variable is concentrated
around its expected value, i.e., with probability very close to 1, its value is within a small
range of the expected value. This implies that “almost always” the algorithm’s performance
will be close to the expected value. Thus, in analysis of randomized algorithms, we usually
5. LEADER ELECTION 44
• Each node initiates a token that contains its rank and also its ID. The token is sent
along the ring in one direction (say, in a clockwise direction).
• When a node v receives a token with a higher rank than itself the token is killed.
(If the tokens have the same rank then ties are broken using IDs). Else it is allowed
to progress further and v knows that it is not a leader.
1 1 1
4 4 4
0.2
6 2
0.3 0.6
5 0.5 0.4 3
0.1
4 (.1 = .1)
Elected
round 7
5.3.1 Analysis
First we note that the correctness of the RandLE-Ring algorithm follows from the BasicLE-
ring algorithm. In the RandLE-Ring algorithm, the node with the lowest rank will be
elected leader (in the case when there is more than one node with the lowest rank value,
the one with the lowest ID will be elected as leader); others will be non-leaders.
Lemma 5.5
RandLE-Ring algorithm takes expected O(n log n) messages and O(n) (determinis-
tic) time.
Proof Idea: Since the algorithm is randomized, the message complexity is a random
variable (however, note that the time complexity is O(n), deterministically); let’s denote
this random variable by X. It is not difficult to see (from the analysis of the BasicLE-ring
algorithm) that this random variable will take values only in the range [n, n2 ]. The goal
is to compute the expected value of X. However, computing E[X] directly from the
definition is difficult, since it depends on the random choices by all the nodes which
influences the actions of the intermediate nodes which determines whether a token is
forwarded or not. To address this, we write X as a sum of Xv ’s whose expectations
are easier to compute. Xv is the random variable that denotes the number of messages
generated by the token initiated by node v. Thus, the main step in the proof is to compute
E[Xv ] and then E[X] follows easily by applying the linearity of expectation.
Proof. We compute E[Xv ]. Xv is equal to the distance traveled by the token initiated by
node v (in the clockwise direction). Let x0 be the random number generated by node v and
xi denotes the random number generated by the ith nearest neighbor of v. The probability
that v’s token gets stopped by the ith nearest neighbor is equal to the probability that xi
and x0 are the smallest and second smallest, respectively, among (i + 1) random numbers:
1
x0 , x1 , . . . , xi . This probability is i(i+1) (Exercise 5.3 asks you to show this).
Thus n n
X 1 X 1
E[Xi ] = i· = 6 Hn+1 = Θ(log n).
i=1 i(i + 1) i=1 i + 1
In the above, Hn is Harmonic sum, which is defined to be the sum of the reciprocals of
the first i numbers, i.e., Hn = ni=1 1/i. This is asymptotically Θ(log n).
P
P
To compute X = v Xv we use, linearity of expectation:
X X
E[X] = E[ Xv ] = E[Xv ] = Θ(n log n).
v∈V v∈V
possible. The clique is a dense network, which is opposite of the ring, which is a sparse
network. On the other hand, the clique is also a symmetric network. A complete network
can model scenario where each node can directly communicate with another node. For
example, to a crude approximation, in the Internet, a node can, in principle, talk to any
another node if it knows the IP address of the other node. Distributed algorithms for
complete networks, are also important in that they serve as important special cases which
provide lower bounds.
10 15
A B win
10 A B A B (10,20,30)
win
20 C D C D (10,20,30)
F C
40 20 win
30 D F D F (10,20,30)
E D
30 25 winner messages sent by referee's
let, A,C and E are candidates
Here, A,B,C,D,E,F are ID's and and B,D,F are referees for all of them.
10,15,20,25,30,35,40 are corresponding ranks A,C,E sends message to all referees
round 1
10 15
A wins all competitions A B
State: Elected
F C
40 20
E D
30 25
round 2
Theorem 5.6
Consider a complete network G of n nodes and assume the CONGEST model of
communication. With high probability, Algorithm 7 solves leader election in O(1)
√
rounds, while using O( n log3/2 n) messages.
Proof Idea: We first show that there will be at least one candidate node after Step
1. We also show that there won’t be too many (more than O(log n)) candidate nodes
after Step 1. Both these statements hold with high probability, i.e., with probability at
least 1 − 1/n. The fact that there will be at least one candidate means that at least
one leader will be elected. Not having more than O(log n) candidates leads to reduced
message complexity, since only candidate nodes initiate messages. The rest of the proof
is showing that exactly one candidate is elected, by showing that exactly one candidate,
in particular, the one with the largest rank, will be elected. This is made possible, since
√
each candidate node competes in about n competitions. This is the right number of
competitions which guarantee that with good probability every pair of nodes will take
part in at least one common competition. In particular, every node will compete with the
node with the highest rank and lose.
Proof. Since all nodes enter either the elected or non-elected state after 2 rounds, at
the latest, we get the runtime bound of O(1).
We now argue the message complexity bound. On expectation, there are 2 log n
candidate nodes. By using a Chernoff bound (cf. Appendix D), there are at most 7 log n
candidate nodes with probability at least 1−n−2 . By the description of the algorithm, each
referee node only sends messages to the candidate nodes by which it has been contacted.
Since we have O(log n) candidates, the total number of messages sent is bounded by
√
O( n log3/2 n) with high probability.
Finally, we show that the Algorithm solves leader election with high probability. With
2 log n n
probability 1 − n ≈ exp(−2 log n) = n−2 , no node elects itself as leader. Hence the
probability that at least one node is elected as leader is at least 1 − n−2 . Let ` be the
node that generates the highest random rank r` among all candidate nodes; with high
probability, ` is unique. Clearly, node ` enters the elected state, since it receives all
“winner” notifications.
Now consider some other candidate node v. By the description of the algorithm,
node v chooses its referees randomly among all nodes. Therefore, the probability
√
that an
2 n log n
individual referee selected by v is among the referees chosen by `, is n
. Since each
referee is chosen independently and uniformly at random with replacement, it follows that
the probability that ` and v do not choose any common referee node is at most
s 2√n log n
log n
1 − 2 6 exp(−4 log n) = n−4 ,
n
which means that with high probability, some node x serves as common referee to ` and v.
√
By assumption, we have rv < r` , which means that node v does not receive 2d n log ne
“winner” notifications, and thus it subsequently enters the non-elected state. By taking
5. LEADER ELECTION 50
a union bound over all other candidate nodes, it follows that, with probability at least
1 − n1 , no other node except ` wins all of its competitions, and therefore, node ` is the
only node to become a leader.
2 2 {3,4}
(2) (2)
reject reject
(4) (3)
accept accept
(3) reject reject
3
(3) 3 {2,4,5,6} Non-elected
4 reject
(4) (3) 4
(5) (6) {2,3} accept
accept reject
Non-elected
5 6 reject
(5) (6) 5 reject 6
{3,6} {3,5}
Non-elected Non-elected
(a) (b)
2 2
echo2
3 {2,4,5,6} 3 {2}
4 (2) (2) 4
{2,3} (2) (2) {2} echo(2) echo(2)
(3) (3)
5 6 5 6
{3,6} {3,5} {2,3} {2,3}
(c)
(d)
{leader}
2 2
echo(2)
3 {2} 3 {2}
4 4
{2} {2}
5 6 5 6
{2} {2} {2} {2}
(e) (f)
Theorem 5.7
The BasicLE-Gen algorithm is correct and has time complexity O(D) and message
complexity O(D|E|).
Proof. Correctness. The correctness of the algorithm is easy to establish and is along the
lines of the BasicLE-Ring algorithm. We first establish that the node with the minimum
ID will be elected as leader. This is because, the minimum ID will override any other ID
and will be forwarded by every node. In D rounds it will reach all nodes in the network.
For the same reason, no other node will be elected as leader.
5. LEADER ELECTION 52
Complexity Analysis. The time complexity is trivially D, since the algorithm goes for
D rounds. In each round, every node v may update its value (in the worst case) and
this causes d(v) update messages to be sent; thus a total of O(|E|) messages are sent per
round. Thus the total message complexity is O(D|E|).
Termination detection. If nodes do not have knowledge of D, then the above algorithm
obviously will not work. The main problem is detecting termination. It is clear, that
eventually (i.e., after D rounds) the minimum ID value will spread throughout the network,
but nodes won’t know whether the algorithm has finished or not. In particular, the node
with the minimum ID, which will be the leader, will not know when to finally declare itself
the leader. This issue can be addressed using echo messages in a way that is very similar
to the one discussed in Section 3.7.2, where we discussed detecting termination of the
BFS algorithm. The approach here is similar. Here, the algorithm also builds a BFS tree
rooted at the leader and will let the leader know when the tree building is finished. Since
each node may have different leader candidates during the course of the algorithm, there
may be several different (partially built) BFS trees during the course of the algorithm.
However, at any point in time, a node will belong to only one BFS tree and at the end of
the algorithm, it will belong to the BFS tree rooted at the leader.
BasicLE-Gen Algorithm with termination detection. We will describe the algo-
rithm in detail from a node v’s point of view. When a node v learns and updates a new
value m(u) received from a neighbour u, it chooses u as its parent in the BFS tree rooted
at the node with ID m(u) — call this tree BF S(m(u)). (v will be part of only one tree
at any point and will drop its participation in the previous tree.) Note that m(u) is
the current minimum ID learned by v (note that v’s status will be non-elected). v
will send an accept(m(u)) message to u; u upon receiving the accept(m(u)) from v will
designate v as its child in the tree BF S(m(u)). v will then send its updated m(v) value
to its neighbors. On the other hand, if v does not update its value based on the m(u)
value received from a neighbor u, it sends a reject(m(u)) message to u.
v will check if it is a leaf node in the current BFS tree, by checking whether it receives
any accept messages for an update that it sent the previous round. If v receives only reject
message from all its neighbors (and no other message, e.g., a new m(.) value), then v will
decide that it is a leaf for the current tree that it participates and sends an echo((m(v))
message back to its parent. If a node receives echo messages from all its children for the
5. LEADER ELECTION 53
tree that it currently participates, then it will forward its own echo message back to its
parent (i.e., parent of v, assuming v itself is not the root of the tree). Note that when v
learns and updates a new m(.) value, then it drops waiting for echo messages from older
m(.) values. (Again, a node participates in maintaining only one BFS tree at any point,
the one with the smallest ID value that is currently knows.) If v itself is the root of the
tree and it receives echo values from all its neighbors (for the tree BF S(m(v))), then this
means that it is the leader (sets its status to be elected). At this point, the status of
all nodes have been determined.
Theorem 5.8
The BasicLE-Gen algorithm with the termination detection as described above is
correct and has time complexity O(D) and message complexity O(D|E|).
Theorem 5.9
The RandLE-Gen algorithm is correct and has (deterministic) time complexity
O(D) and has expected message complexity O(|E| min{log n, D}).
Proof Idea: The non-trivial part of the proof is bounding the (expected) message
complexity. The key idea is to upper bound the number of update messages sent by some
fixed node and then use linearity of expectation to bound the total number of messages.
This saves the day, since there is a non-trivial dependence between the number of update
messages sent by different nodes, e.g., the number of messages sent by neighbors can
be correlated — this makes it difficult to bound the total number of messages directly.
Instead, we focus on bounding the expected number of messages sent by a single node
and show that it is bounded by its degree times a O(log n) factor.
Proof. Correctness. First, we observe that the correctness of the RandLE-Gen immedi-
ately follows from the correctness of the BasicLE-Gen algorithm, since the only difference
is that instead of ID(v), rank(v) is used as the starting value of m(v). Since ranks are
chosen randomly in [0, 1], with probability 1 they are distinct. Thus the unique node with
the minimum rank is elected as leader.
Complexity analysis. The time complexity is the same as BasicLE-Gen, and thus is
O(D).
We next bound the expected message complexity. Let random variable Xv denote the
number of messages sent by node v over the entire algorithm. It is clear that the total
P
message complexity, denoted by random variable X, is given by v∈V Xv . We focus on
computing E[Xv ], and then apply linearity of expectation to obtain the result.
We observe that except for the message sent by v in round 1, in any subsequent round,
v will send a message to all its neighbors only if it updates its m(v) value. The update
will happen if the new m(.) value received in some round is smaller than its (current)
m(v) value. We compute the expected number of times an update will happen at node v.
To compute this, we consider the m(.) values that v receives over time. If v receives more
than one m(.) value from its (different) neighbors in the same round, then it will update
at most once (if the minimum among all the received values is less than the current m(v)
value).
Thus, to upper bound the number updates, it is fine to assume that v sees one distinct
m(.) value in each round (if it does not see any new value in a round, then there is no
update at all; or if it sees a value that it has already seen, then there is no update as
well). Note that this is just for the sake of analysis; the algorithm itself need not operate
in this way. Let the m(.) values seen by v starting from round 1 be m1 , m2 , . . . in that
order. Each of these (distinct) values originated at some starting node (i.e, the node which
first chose the rank). Hence, for the sake of analyzing the upper bound, we assume that
all of these values reached v (on the other hand, in the actual algorithm some of these
values would have been stopped even before they reach v). These values are independently
and randomly chosen from [0, 1]. Thus, we have a sequence of n values, m1 , m2 , . . . , mn ,
each of whose values are independently and randomly chosen from [0, 1]. Let the random
variable Y denote the number of times, the minimum value is updated. By Problem D.2
5. LEADER ELECTION 55
Note that, if 2D < log n, then clearly, the algorithm will terminate in at most 2D
rounds, and hence use only O(|E|D) messages.
Solution: Assume that the nodes know n, the number of nodes in the network (some
upper bound of it also works). Consider the following algorithm: Each node waits a
certain number of rounds to start broadcasting their ID. In fact, each node starts flooding
its ID at time ID × n. When a node receives an ID for the first time (including the node
that sends it first), it selects that ID to be its leader. We show that this algorithm is
correct and every node selects the minimum ID to be the leader.
Suppose (without loss of generality) that the IDs are {1, 2, . . . , n} (if the IDs are
spaced more, then this is even better for the algorithm). Then node 1, starts at time
n, node 2 starts at time 2n, node 3 starts at time 3n and so on. Hence, in general, the
minimum ID node will start first (at time equal to its ID) and finish its broadcast before
the second minimum node starts the process. The reason is, there is a gap of at least n
rounds between two consecutive starts. Therefore, after getting the minimum ID (denoted
5. LEADER ELECTION 56
by minID), all the nodes decide upon it as the leader and the algorithm stops. Hence,
the message complexity of the algorithm is O(n) (only one broadcast by the minimum ID
node) and the running time is O(minID · n + n) (the second term n is for completing the
broadcast). So the running time is O(minID · n), assuming minID > 1.
Worked Exercise 5.2. Consider the following leader election algorithm for a complete
network. Assume that each node arranges its neighbors in some order, say from 1 to n − 1.
1. Each node independently chooses its rank, which is a random number in the interval
[0, 1].
(a) Each node communicates with its neighbours numbered from 2i−1 till 2i − 1
and finds their ranks.
(b) If a node finds a neighbor of higher rank among the neighbours it communicates
in round i, it sets its status to be non-elected and stops at this round.
Otherwise, it proceeds to the next round (round i + 1).
Show that the above algorithm correctly elects a leader and takes expected O(n log n)
messages and O(log n) (deterministic) time.
Solution:
We prove that the above protocol takes expected O(n log n) messages and O(log n)
rounds.
It is clear that the protocol takes O(log n) time because it takes at most O(log n)
rounds.
We now bound the expected number of messages used by the protocol.
The expected number of messages for a node is bounded by
Xn
log
1+ ( # messages exchanged in round i)(P(not succeeding in first i − 1 rounds))
i=1
Xn
log
61+ 2i+1 (1/2i−1 ) = O(log n)
i=1
5.8 Exercises
Exercise 5.1. Show that the time complexity of the ImprovedLE-Ring algorithm is O(n).
Exercise 5.2. Give a deterministic leader election algorithm in rings that takes O(n)
messages in the synchronous CONGEST model (which is obviously optimal for rings).
Assume that nodes don’t know n, the number of nodes in the ring. Note that your
algorithm can take an arbitrary (but finite) time. (Hint: See Worked Exercise 5.1 that
gives such an algorithm, but assumes knowledge of n.)
Exercise 5.3. Show that in Algorithm RandLE-Ring the probability that a token gener-
1
ated by a node v gets stopped by the ith nearest neighbor is equal to i(i+1) .
Exercise 5.4. In Algorithm RandLE-Ring, Show that choosing a rank randomly from
[1, n3 ] (instead of from [0, 1]) also gives a correct leader election algorithm with the same
complexity bounds.
Exercise 5.5. Give a randomized leader election algorithm in rings that takes expected
O(n) messages and expected O(n) rounds (assume all nodes know n).
Exercise 5.6. In a complete network, give a deterministic leader election algorithm that
runs in O(log n) rounds and takes O(n log n) messages. (Hint: A natural idea is to try
something similar to the deterministic leader election algorithm on the ring: reduce the
number of candidates by a constant factor in every constant number of rounds. One
should do this as efficiently as possible, taking on average O(n) messages per round, giving
O(n log n) overall.)
Exercise 5.7. Show that the complexity bounds for the BasicLE-Gen algorithm are
asymptotically tight. That is, there is a network, with a given distribution of IDs such
that the algorithm takes Ω(D) time and Ω(D|E|) messages for this network.
Exercise 5.8. Given an example network where the expected message complexity of the
RandLE-Gen algorithm is Θ(|E| log n).
Exercise 5.9. Give a deterministic leader election algorithm in the synchronous CON-
GEST model that is efficient with respect to both the message complexity (i.e., close to
O(|E|)) as well as the time complexity (i.e., close to D).
Exercise 5.10. In a general network, give an efficient randomized distributed algorithm
to estimate the size of the network, i.e. it should output a constant factor estimate of
log n with high probability. Your algorithm, with high probability, should have time
complexity O(D) and expected message complexity O(|E| min{log n, D}), i.e., the same
complexity bounds as those of RandLE-Gen. (Hint: The idea is to generate IDs that are
“random”-like but not too large. Consider the following process: Each node independently
tosses a fair coin till it get TAILS. It sets its ID to be the number of tosses needed to get
TAILS. Show that, with high probability, the maximum ID over all nodes in the network
is Θ(log n).)
Exercise 5.11. Use the ideas in Exercise 5.10 to design and analyze a leader election
algorithm in a general network, where nodes don’t have any knowledge of n, the size of
the network.
6
In the last chapter, we focused on leader election, which can be considered as a “global”
symmetry breaking problem since the goal is to break symmetry among all the nodes in
the network in order to elect a unique leader. In contrast, in this chapter, we look at
“local” symmetry breaking, where the goal is to break symmetry among nodes that are
quite close, typically neighbors. A fundamental problem in this category is the Maximal
Independent Set (MIS) problem which is stated as follows: Given a graph G = (V, E),
find a subset S ⊆ V of nodes such that S is independent (i.e., no two nodes in S are
neighbors in G) and S is maximal (i.e., no other nodes can be added to S and still keep it
independent). MIS is a “local” problem in the sense that given a subset S ⊆ V , verifying
whether S forms a maximal independent can be done as follows: each node checks with
its neighbors and decides whether the MIS property holds in its neighborhood. However,
as we will see, computing a MIS in a distributed network is harder. In particular, we
will present a MIS algorithm that takes O(log n) rounds (where n is the number of nodes
in the network) with high probability. Another way to view this algorithm is as follows:
To compute a MIS, each node (with high probability) needs only information about its
O(log n)-neighborbood, i.e., nodes within distance O(log n). While this distance is not
“fully local” (i.e., not just restricted to looking at neighbors alone), but still relatively
small compared to the network diameter (which can be as large as n). A fundamental
open question in distributed computing is to fully characterize the locality of MIS. As
we shall see in a later chapter, we can show a highly non-trivial locality lower bound of
Ω(log∗ n) 1 for MIS, i.e., in the worst case, some node has to look at information from
nodes that are Ω(log∗ n) away.
In this chapter, besides MIS, we will focus on another fundamental local symmetry
breaking problem, namely, coloring. Both are related to each other in some way, yet have
their own characteristics. We will also study a distributed optimization problem, namely
finding small-sized dominating sets. We will show fast (i.e., running in polylog(n) rounds)
distributed algorithms for all these problems. A key feature of all the algorithms in this
chapter is the use of randomness — all the algorithms are randomized. This is not a
coincidence — currently the best known algorithms for these problems are randomized
and are significantly faster than the best known deterministic counterparts.
1
log∗ n is the iterated logarithm of n — the number of times one has to repeatedly take logarithmic
function to reduce n to 1.
58
6. LOCAL SYMMETRY BREAKING 59
Throughout this chapter, we will consider the synchronous LOCAL model, although
all the algorithms (for MIS, coloring, and dominating set) will work seamlessly in the
CONGEST model as well. As usual, we consider a network of n nodes, represented as
an undirected, connected (not necessarily complete) graph G = (V, E). We assume that
all nodes have unique identifiers; the identifiers can be arbitrary numbers which can be
represented by O(log n) bits. As in the previous chapter, we will assume that all nodes
are awake initially and start executing the algorithm simultaneously, i.e., the simultaneous
wakeup model.
A C A C
B B
D F D F
E E
(a) (b)
MIS is a basic primitive that arises in distributed algorithms. It has many applications,
e.g., resource allocation, dominating set construction etc. For example, in a wireless
network, neighboring nodes may not be able to communicate properly due to signal
interference. One way to solve this problem is by finding a MIS and allowing nodes that
belong to the MIS to communicate at the same time. Another useful property of MIS is
that it is also a dominating set. Given a graph G = (V, E), a dominating set D ⊆ V is a
subset of nodes such that every vertex in V either belongs to D or has a neighbor in D.
It is easy to see that a MIS is also a dominating set; in fact it is a minimal dominating
set (MDS). A MDS can be used in various applications. For example, a dominating set
can be used as a network backbone for routing — it is enough to find routes between the
nodes in the dominating set; any other node can route by sending it to its dominator first
(if there is more than one dominator, a specific one can be chosen).
6. LOCAL SYMMETRY BREAKING 60
Fast Distributed MIS Algorithms Given an arbitrary network G = (V, E), the goal
is to design a distributed algorithm which will output a MIS as follows: every node will
know whether it is in the MIS or not.
Let Γ(v) be the set of vertices in V that are adjacent to v. Let N (v) denote the closed
neighborhood of v, i.e., the set consisting of v and Γ(v).
The goal is to design fast distributed algorithms, i.e., algorithms running in O(log n)
(or O(polylog(n)) rounds. The reason we call these algorithms “fast” is because their
running time can be significantly smaller than the diameter D of the network. Note that
D is an obvious time upper bound for any graph problem in the LOCAL model, since one
can collect all the network information (including topology) at a single node and locally
(within node) compute the solution.
Intuition behind the Fast MIS Algorithms The algorithm proceeds in rounds; in
every round it finds an independent set S, and adds S to I (initially I is empty) and
deletes S ∪ Γ(S) from the graph. Note that it is very easy to implement this algorithm in
a linear (i.e., O(n)) rounds (see Exercise 6.3). This is essentially the same as finding a
MIS in the sequential setting by iterating the following starting with the input graph G:
Choose an arbitrary node v in G and include it in the MIS, and delete v and its neighbors
in G to get a new graph G0 ; repeat the same in G0 till G0 becomes empty.
We give some intuition in designing an algorithm that runs in O(log n) rounds, which
is significantly faster than a linear time algorithm. To get a fast distributed algorithm, an
obvious strategy is to include as many nodes in the MIS in a round as possible. Consider
a node v whose degree is denoted by d(v). Clearly either v or at least one of its neighbors
should be in the MIS. (In particular, if v is in the MIS, that precludes its neighbors to
be in the MIS.) Thus, we have a basic symmetry breaking problem between v and its
neighbors. As discussed earlier (see Chapter 5 in regard to the toy problem of channel
usage), the symmetry can be broken by using randomization. One reasonable way is for v
1
to choose itself with probability d(v)+1 . Why ? Assume, for simplicity, that all nodes in
the graph have the same (or almost same) degree. Then the above probability ensures
that on average there will be one node chosen among v and its neighbors (this follows
simply by linearity of expectation) which is correct “on average.” But due to randomness
two situations are possible: (1) Two neighboring nodes can be chosen; (2) None of the
nodes are chosen in a closed neighborhood of a node. To deal with the first situation, the
algorithm does a (deterministic) tie breaking rule if two neighboring nodes are chosen:
for example, one can break ties based on degrees (or if the degrees or the same, then
based on IDs). There is an alternate way of probabilistic tie breaking that avoids the
above situation. Each nodes chooses a random rank (similar to randomized leader election
algorithms!); the node with the lowest rank goes into the independent set. The random
rank is chosen from a large enough set (say [1, n3 ]), then the probability that two nodes
choosing the same rank is negligible (one can also implement this random choosing without
knowledge of n — this is explored in Exercise 6.1). To deal with the second situation, the
algorithm is repeated many times till the status of all nodes are determined. We will show
that, with high probability, the number of repetitions will be small, i.e., logarithmic in n.
Next we present two algorithms for MIS based on the two different symmetry breaking
6. LOCAL SYMMETRY BREAKING 61
ideas discussed above. Both these ideas are important as similar themes show up in other
applications as well such as coloring and dominating set.
1 1
d(1) d(1)
6 2 6 2
Let 1,4,5 be marked nodes
1
according to probability 2d(v)
d(5)
5 3 5 d(5)
3
d(5)
d(4) d(4)
4 4
(a)
status = yes
1 1
yes yes
6 2 6 2
yes
yes
5 3 5 3
yes
status = yes
4 4
d(4) ≤ d(5) 2,4 and 6 are also deleted, since 5 joins MIS.
So, 4 unmarks itself
(b) (c)
6 2
(d)
The main idea of the analysis is as follows. We divide the algorithm into phases (for
the sake of analysis): in the first phase, we will consider nodes having degree between
[∆, ∆/2); in phase i the nodes considered will have degree between [∆/2i−1 , ∆/2i ). Thus
there will be O(log ∆) phases. At the end of phase i, the status of all nodes of degree
higher than ∆/2i would have been decided. We will show that each phase lasts for O(log n)
rounds with high probability. The following is the key lemma.
Lemma 6.2
Consider a node, say v, in phase i (1 6 i 6 log ∆). We will show that, with constant
probability, one of the following two events will happen: (1) The status of v will be
determined (and hence will take no further part in the algorithm); (2) The degree
of v will drop below ∆/2i .
1 2
d(w)
2
1
d(v)
w
v
Figure 6.3: Illustration for proof of Lemma 6.2.
Proof. Consider phase 1. In this phase, we only focus on nodes with degree [∆, ∆/2); let v
be one such node. We lower bound the probability that the status of v will be determined
in one round. This can be done in two ways in a round: (1) v enters MIS or (2) a neighbor
of v enters MIS.
We lower bound the probability that a neighbor of v enters MIS as follows in two steps:
(1) A neighbor of v, say w, marks itself; and (2) At least one of v’s marked neighbors
remains marked after the tie-breaking step (line number 9 of Algorithm 9).
The probability that step (1) does not happen is at most
∆/2
1
1− 6 e−1/4 .
2∆
This is because all the neighbors of v have degree at most ∆ and v’s degree d(v) is at
least ∆/2. Hence the probability that step (1) happens is at least 1 − e−1/4 .
We next bound the probability of step (2), given step (1) has happened. Among all
the marked neighbors of v consider the one that has the highest degree, call it w (choose
the one with the highest ID if there is more than one such node). (See Figure 6.3.) The
probability of w, remaining marked is at most the probability that none of its higher (or
equal) degree neighbors are marked. Crucially, it is enough to focus on neighbors of w
that are not in Γ(v). This is because, w is the highest degree node in Γ(v) (or it has the
6. LOCAL SYMMETRY BREAKING 64
highest ID if there is another marked node in Γ(v)). Note that, we also do not consider
the event that v itself is of higher degree than w and is marked; this event will lead to v’s
deletion anyway.
Thus we can bound the probability of the following event independent of the nodes in
N (v): the probability that at least one of the neighbors of w (excluding those in N (v)) is
marked is at most X 1 1
6 .
u∈Γ(w)
2d(w) 2
Thus the probability that none of its neighbors are marked is at least 1/2.
We next lower bound the probability that both steps (1) and (2) will happen. By the
above discussion, due to the independence of the two steps, the probability is at least
(1 − e1/4 )(1/2). Let’s denote this constant probability by β.
After O(log n) iterations of the While loop, with high probability either v is deleted
or its degree will go below ∆/2, i.e., it will move out of phase 1. By a straightforward
union bound, the above statement will hold for all nodes considered in phase 1. Thus in
O(log n) rounds, with high probability, all nodes with degree in [∆, ∆/2) will be deleted
(i.e., their status will be determined) or their degrees will be reduced below ∆/2. We can
then have a similar argument for phase 2, and in general, phase i. By applying another
union bound over all the O(log ∆) phases, the status of all nodes will be determined with
high probability in O(log ∆ log n) rounds.
We show that a good vertex is likely to have one of its lower degree neighbors in the
MIS and thereby deleted from V .
Lemma 6.3
Let v ∈ V be a good vertex with degree d(v) > 0. Then, the probability that some
vertex w ∈ Γ(v) gets marked is at least 1 − e−1/6 .
Proof. Each vertex w ∈ Γ(v) is marked independently with probability 1/(2d(w)). The
6. LOCAL SYMMETRY BREAKING 65
The proofs of the next two lemmas are very similar to the argument in the proof of
Lemma 6.2 and are left as exercises.
Lemma 6.4
Consider an iteration of the While loop of Algorithm 9. Let v be a good vertex
and let w be a marked neighbor of v that has the highest degree (if there is a tie,
then w is the one with the highest ID). Then w is selected to be in the MIS with
probability at least 1/2.
Lemma 6.5
The probability that the status of a good vertex is determined in an iteration of the
While loop is at least (1 − e−1/6 )/2.
Proof. Direct the edges in E from the lower degree end-point to the higher degree end-point
breaking ties according to the ID (as done in the algorithm).
We will show a mapping that assigns each bad edge to two unique edges. This will
imply that the number of bad edges is at most |E|/2, proving the lemma. The mapping
is as follows. Consider a bad vertex v. At least two-thirds of its neighbors supercede v,
and all these edges will be directed out of v; at most one-third will be directed into v.
Consider a bad edge e directed into v. Map e to two distinct outgoing edges of v. Since
there are twice as many outgoing edges as incoming edges, one can uniquely map every
incoming (bad) edge to two unique outgoing edges. This is the required mapping.
Lemma 6.7
The expected number of edges deleted in one iteration of the algorithm is at least
((1 − e−1/6 )/4)|E|.
Proof. By lemma 6.5, a good edge is deleted with probability at least (1 − e−1/6 )/2, since
at least one of its endpoints is good. Since the number of good edges is at least |E|/2, the
claim follows by linearity of expectation.
6. LOCAL SYMMETRY BREAKING 66
Lemma 6.8
MIS Algorithm 1 finishes in O(log n) rounds with high probability.
Proof. Let α = 1 − (1 − e−1/6 )/4. Let Xi be the number of edges remaining after i
iterations. We have X0 = |E|. By Lemma 6.7, for i > 1, E[Xi |Xi−1 ] 6 αXi−1 . By taking
expectations on both sides (see Appendix D.10),
Hence
E[Xi ] 6 (α)i E[X0 ] = (α)i |E|.
Choosing i = f = 4 log1/α n, we obtain E[Xf ] 6 |E|/n4 6 1/n2 , since |E| < n2 .
Using Markov’s inequality, we bound the number of edges remaining after f iterations:
Hence, with probability at least 1 − 1/n2 , all the edges are deleted and the algorithm
finishes in O(log n) rounds.
Proof. Let F be the set of edges at the beginning of an iteration (in the beginning of the
algorithm F = E, where E is the set of edges of the graph). We show that the expected
number of edges deleted in the iteration is at least |F |/2. For the sake of analysis, replace
each undirected edge (u, v) by two directed edges (u → v) and (v → u). Let N (u) denote
the set of neighbors of u (in the undirected graph) and u itself. Call a node u as ”eligible”
with respect to v if u is the smallest ranked node among N (u) and N (v). Node that if u
is eligible it will be deleted, since it is the lowest ranked node in N (u) and hence will get
into the MIS. It is clear that:
1
P(u is eligible w.r.t. v) > .
d(u) + d(v)
Let random variable X(u → v) denote the number of directed outgoing edges incident to
v that get deleted when u is eligible with respect to v. We note that X(u → v) > d(v) if u
is eligible with respect to v. Note that we deliberately undercount the number of directed
outgoing edges deleted — in fact, d(u) + d(v) edges are deleted when u is deleted. But
this is to avoid overcounting as explained below.
Let r.v. X denote the total number of directed edges deleted in the iteration. Then
X
X> X(u → v) + X(v → u).
(u→v),(v→u)∈F
Note that when u deletes outgoing edges of v because of the reason that u is eligible
w.r.t. v, no other neighbor of v can be simultaneously eligible. Thus each deleted (directed
outgoing) edge is counted at most once.
By linearity of expectation, the expected total number of (directed) edges deleted is
X
E[X] > E[X(u → v)] + E[X(v → u)]
(u→v),(v→u)∈F
X d(v) d(u)
> +
(u→v),(v→u)∈E
d(u) + d(v) d(u) + d(v)
X d(v) + d(u)
= = |F |.
(u→v),(v→u)∈F
d(u) + d(v)
Since we counted directed edges, the actual number of (undirected) edges deleted is
|F |/2.
6. LOCAL SYMMETRY BREAKING 68
Lemma 6.10
The algorithm terminates in O(log n) iterations with high probability, i.e., with
probability at least 1 − 1/n.
Proof. The proof is similar to the proof of Lemma 6.8. Using conditional expectation,
the expected number of edges remaining after k iterations is at most |E|/2k . Plugging
k = 4 log n, the number of edges remaining will be at most |E|/n4 6 1/n2 , since |E| 6 n2 .
By Markov’s inequality the probability that at least one edge will remain after 4 log n
iteration is at most 1/n2 .
Choosing a random number Since we choose a random number in [0, 1] , every node
gets a unique rank with probability 1 (because of infinite precision), so the algorithm will
correctly produce an MIS. However, it is not crucial that all ranks are distinct. If they
are not, then there is a small probability that the algorithm will be incorrect.
Thus, one can simply choose a random number from the interval, say, [1, n4 ], so that
the
probability that any two nodes will have the same rank in any one iteration is at most
n
2
1/n4 = O(1/n2 ). Thus, with probability at least 1 − O(1/n), during the entire course
of the algorithm, all ranks will be distinct (why?).
Thus, we will have a Monte-Carlo algorithm that is correct with high probability. If
we want a Las Vegas algorithm, we can simply rerun the algorithm if we find that two
neighbors have been included in the independent set. This will still guarantee an expected
running time of O(log n). Hence we can assume that all ranks are distinct (thus nodes
have been assigned a random ordering of ranks).
6.2 Coloring
In this section, we consider another fundamental local symmetry breaking problem, namely
coloring a graph. In particular, we consider the problem of coloring a given graph using
∆ + 1 colors, where ∆ is the maximum degree of the graph. It is easy to show that every
graph can be colored in ∆ + 1 colors (see Exercise 6.4). In fact, it is easy to accomplish
this using a centralized algorithm in linear time. Here we will show how to achieve the
same using a distributed algorithm that runs in O(log n) rounds.
Given an undirected graph G, a vertex-coloring of G is simply assigning colors to
vertices. We are interested in a legal vertex-coloring, i.e., no two neighboring vertices
should be assigned the same color (in other words, if two vertices are connected by an
edge they should be assigned different colors).
colors to begin with.) The algorithm proceeds in iterations (Algorithm 11 gives the
pseudocode). Initially all vertices are uncolored and assume that all vertices are asleep.
Each iteration consists of the following steps:
2. Every vertex that woke up picks a tentative color uniformly at random from its own
list of colors.
4. The color list of each uncolored vertex in the graph is updated by removing all
colors successfully used by neighbors.
5. All colored vertices are deleted from the graph (they don’t take further part in the
algorithm).
Analysis We show the correctness and analyze the running time of the algorithm.
Lemma 6.11
The algorithm (if it terminates) computes a legal vertex-coloring of the graph.
Proof. Note that in any step, a vertex u always chooses a color that is not chosen by
any of its neighbors (in this step). Since colors chosen by neighbors that were colored in
previous steps are deleted from u’s list, there is no conflict. Since every vertex starts with
6. LOCAL SYMMETRY BREAKING 70
du + 1 colors, each uncolored vertex is left with at least one color to choose from at any
step. Hence the coloring is legal.
Lemma 6.12
In every iteration, each uncolored vertex successfully colors itself with probability
at least 1/4.
Proof. Consider an uncolored vertex u at a iteration t. Let Ltu denote the list of colors
that u has at the beginning of this round. Denote the (current) degree of u as dtu and the
(current) set of neighbors of u at Nut . We will show that with constant probability u will
color itself sucessfully in iteration t.
Let the event Eu denote that vertex u colors itself successfully. Let Ēu denote the
complement of the event Eu . Let the event Wu denote that u wakes up. Let the event
Eu (c) denote that u chooses color c (in this round).
Since a node has to wakeup to color itself, we have,
1 1
P(Eu ) = P(Eu |Wu )P(Wu ) = P(Eu |Wu ) = (1 − P(Ēu |Wu )). (6.1)
2 2
We upper bound P(Ēu |Wu ). This event will happen if any one of u’s neighbors wakes up
and also chooses the same color that u chooses. Hence,
X
P(Ēu |Wu ) 6 P(Eu (c) ∧ Wv ∧ Ev (c)|Wu )
v∈Nut ,c∈Ltu
X
= P(Eu (c) ∧ Ev (c)|Wu , Wv )P(Wv )
v∈Nut ,c∈Ltu
1 X
= P(Eu (c) ∧ Ev (c)|Wu , Wv )
2 v∈N t ,c∈Lt
u u
To bound the above probability, fix a particular neighbor v. Exercise 6.6 asks you to show
that the probability that u and v choose the same color is at most |L1t | = (dt 1+1) < d1t .
u u u
Hence,
1 X 1
P(Ēu |Wu ) < 1/dtu = .
2 v∈N t 2
u
1
Hence, by equation 6.1, P(Eu ) > 4
.
Lemma 6.13
Within O(log n) rounds the graph will be legally colored with high probability (i.e.,
with probability at least 1 − 1/n), where n is the number of vertices in G.
Proof. The probability that a vertex is not colored after 16 ln n rounds is at most 1/n2 .
By union bound, the probability that no vertex remains uncolored after so many rounds
is at most 1/n.
6. LOCAL SYMMETRY BREAKING 71
Proof. We first argue that each iteration of the WHILE loop maintains a legal coloring.
At the beginning the coloring is obviously valid. Hence, assuming that the coloring is
valid at the beginning of an iteration of the WHILE loop, we show that it remains valid
at the end of the iteration. Consider any two adjacent nodes u and v, and let u be the
parent of v. For the moment, we will assume that u is not the root. In an iteration, if u
and v choose different indices (in Line 9 of Algorithm 12) where they differ with respect
6. LOCAL SYMMETRY BREAKING 72
to their respective parents, then the new colors of u and v are obviously different. On the
other hand, if they both choose the same index (say i), then that means their ith bits are
different; hence the new colors will be different again (because the ith bit is part of the
color as well). If u is a root, then a similar argument holds.
We now argue that each iteration reduces the size of the color of each node by
(essentially) a logarithmic factor. This is because if ` is the size of the color at the
beginning of the iteration, then the size of the color at the end of the iteration is at most
dlog `e + 1, since it is a concatenation of the representation of an index of the color (which
has at most dlog `e bits plus the concatenation of 1 bit).
Thus after O(log∗ n) iterations the number of colors will be reduced to a constant.
The algorithm continues till the size of the color does not reduce any further. It is easy to
check that this means that the node ends up with the following six colors (represented as
bits): 00, 01, 10, 11, 100, 101.
Algorithm 12 O(log∗ n) Coloring Algorithm for Rooted Trees (code for node v).
1: c(v) = ID(v) //Initialize color of v to be its ID.
2: send color to all children (if any)
3: f lag = “true00
4: while flag == “true” do
5: oldsize = | < c(v) > | . Number of bits in the bit representation of c(v)
6: if v == root then
7: i=0 . picks index as 0
8: else
9: Let i be the first (least significant) index where < cv > differs from < cparent(v) >
10: < c(v) >=<< i > ci (v) > . Set the new color of v as the concatenation of the bit
representation of i and the ith bit of c(v).
11: v notifies its color to all its children
12: if oldsize == | < c(v) > | then
13: flag = “false” . Finish the algorithm
repeat this process to eliminate color 5: again shifting down, and then nodes with color 5
choosing a free color in {1, 2, 3}. We again repeat to eliminate color 6.
Corollary 6.16
Any bounded degree graph with maximum degree ∆, where ∆ = O(1), can be in
colored in O(log∗ n) rounds using at most ∆ + 1 colors.
appears to be inherently sequential. We will see shortly that it can be distributed with
only a small loss in performance.
We give two proofs for the above theorem. The first is conceptually simple and
straightforward. The second proof uses a charging argument and is useful in designing a
distributed algorithm.
First Proof
Proof. Let ri be the number of elements that are uncovered at the beginning of iteration
i by the greedy algorithm. This means that r0 = n. Let OP T be the optimum set cover
and let its size be |OP T |. At the start of iteration i, since the optimum covers all the
elements with |OP T | number of sets, there is at least one set which has at least ri /|OP T |
elements. Hence, greedy will cover at least ri /|OP T | elements in iteration i. Thus the
number of elements that remain uncovered at the beginning of iteration i + 1 is
From the above recurrence, we have ri = r0 (1 − 1/|OP T |)i = n(1 − 1/|OP T |)i . Plugging
f = |OP T | ln(n/|OP T |) (where ln is the natural logarithm), we have
1
rf 6 n(1 − 1/|OP T |)|OP T | ln(n/|OP T |) 6 ne− |OP T | |OP T |(ln(n/|OP T |)) = |OP T |.
Hence after picking f sets, there are only |OP T | elements left that are (still) uncovered;
this can be covered by picking at most |OP T | more sets. Hence the number of sets picked
by greedy is at most |OP T | ln(n/|OP T |) + |OP T | = O(|OP T | log(n/|OP T |)). Since
|OP T | > n/∆ (Exercise 6.14), we have that the approximation ratio of greedy to be
O(log(n/|OP T |)) = O(log ∆).
Second Proof To bound the performance of a set cover algorithm we do the following
accounting(charging): when the algorithm picks a set, its price (which is 1) is distributed
equally to all the new (i.e., previously uncovered) elements it covers, i.e., each new element
covered gets a price of 1/k, where k is the number of new elements covered. Each element
is assigned a price only once, at the time it is covered by the algorithm. Note that, greedy
chooses a set S at each iteration that realizes the minimum unit price. For any subset A
P
of the ground set X, let g(A) = e∈A p(e). By our accounting, g(X), the sum of the unit
prices, is the total cost incurred by greedy.
6. LOCAL SYMMETRY BREAKING 76
Proof. For any subset S of the collection of the sets given, we will show that g(S) 6 H|S| ,
where H|S| = 1 + 1/2 + 1/3 + · · · + 1/|S| is the harmonic sum. Sort the elements of S
according to the iteration when they are covered by greedy, breaking ties arbitrarily. Let
e1 , . . . , ek be this numbering. When greedy covers ei , p(ei ) 6 1/(k − i + 1). This is because,
since ei belongs to S, greedy would have chosen to cover ei by choosing a set (in the
collection) that has at least k − i + 1 uncovered elements, since S has these many elements
still uncovered (it could have chosen S itself, but no other set which has lesser number of
uncovered elements). Hence, g(S) = e1 ,e2 ,...,ek p(ei ) 6 ki=1 1/(k − i + 1) = Hk = H|S| .
P P
We have, for any two sets A and B, g(A ∪ B) 6 g(A) + g(B). Denoting OP T as an
optimal cover and GREEDY as the set output by the greedy algorithm, we have
X
|GREEDY | = g(X) = g(∪S∈OP T S) 6 g(S)
S∈OP T
X
6 H|S|
S∈OP T
X
6 maxS {H|S| } 1
S∈OP T
degree; this is what the modified accounting rule ensures. We show that one can process
all nodes in a category in O(log n) rounds, giving a overall time bound of O(log n log ∆)
rounds.
2. Among all the candidate sets that contain it, each voter votes for that set which
has the lowest rank.
3. A candidate is elected if it obtains at least 1/4 of the votes of its electorate (electorate
is the set of nodes that can vote for it — they will be a subset of its closed
neighborhood).
4. Elected candidates (nodes) enter the dominating set and they (and their incident
edges) are deleted from the graph. (Nodes that have degree 0 are also deleted from
the graph).
We note that the cost of the set can now be distributed equally among the elements that
voted for it; thus the above steps implement the (modified) accounting scheme mentioned
above. It is easy to see that all the above steps can be implemented in O(1) rounds (even
in the CONGEST model).
6. LOCAL SYMMETRY BREAKING 78
6.4.2.1 Analysis
We analyze one iteration of the election in a (arbitrary) phase i and show that it finishes
in O(log n) rounds whp. For the sake of analysis, it is convenient to construct a bipartite
graph: candidates on one side and its neighbors on the other. Let d(v) denote the degree of
node v in the bipartite graph. Note that the candidates during phase i are all those vertices
whose degree is in the interval (∆/2i , ∆/2i−1 ]; however, all non-dominated neighbors of
candidates (no restriction on degree) participate during the phase. The neighbors of a
candidate c form the electorate of c. The neighbors of a voter v form the pool of v.
We will show that the expected number of edges that are removed from the bipartite
graph is a constant fraction in each iteration. This will imply that O(log n) elections are
enough to end a phase whp.
Definition 6.3
A voter is influential for a candidate c if at least 3/4 of the voters in c’s electorate
have degree no greater than that of v.
Lemma 6.18
For any two voters v and w with d(v) > d(w), in c’s electorate,
Lemma 6.19
Let v be an influential voter for candidate c. Then
Proof. Let random variable X denote the number of votes for c. Let Y = C − X, where
C is the size of the electorate of c; random variable Y denotes the number of voters that
did not vote for c.
6. LOCAL SYMMETRY BREAKING 79
X
E[X|v votes c] > P(w votes c|v votes c).
w:d(w)6d(v)
1 3c 3c
= .>
2 4 8
The above follows because voter v is influential for candidate c, and thus at least 3/4 of
the voters in c’s electorate have degree no greater than that of v and we use Lemma 6.18.
Hence,
P(c not elected |v votes c) = P(X < c/4|v votes c)
= P(Y > 3c/4|v votes c).
Applying Markov’s inequality we have,
P(Y > 3c/4|v votes c) 6 E[Y |v votes c]/(3c/4) = 4(c − E[X|v votes c])/(3c)
6 5/6.
Lemma 6.20
In a phase, let m0 denote the total number of edges in the bipartite graph at the
beginning of the iteration (in this phase). Let X denote the number of edges
removed from the bipartite graph after one election. Then E[X] > m0 /24.
Proof. An edge (v, c) is good if v is influential voter for candidate c. By the definition of
an influential voter, any candidate has at least 1/4 of its electorate to be influential, at
least 1/4 fraction of its edges of its are good. Thus overall, at least 1/4 fraction of the
edges in the bipartite graph are good.
We have X
E[X] > P(c is elected and v votes c)d(v)
(v,c)
X
> P(c is elected and v votes c)d(v)
(v,c) is good
X
= P(v votes c)P(c is elected |v votes c)d(v)
(v,c) is good
Since P(v votes c) = 1/d(v), the above is equal to
X
P(c is elected |v votes c)
(v,c) is good
By applying Lemma 6.19 and since at least 1/4 edges are good, we have,
The above lemma shows that, on average, a constant fraction of edges are deleted from
the biparitite graph. Using techniques similar to those used in Section 6.1 (in particular,
in the proof of Lemma 6.8), this leads to the following lemma.
Lemma 6.21
A phase terminates in O(log n) rounds whp.
Solution: First run the coloring algorithm for a bounded degree graph (Section 6.3.2).
The algorithm colors the graph G by at most ∆ + 1 colors and finishes in O(log∗ n) rounds
(cf. Corollary 6.16).
Now we use the relation between independent set and node coloring. Notice that
the set of nodes with the same color is an independent set. However, that set is not
necessarily a MIS. Nonetheless, starting with a coloring, one can easily derive a MIS
algorithm: Suppose the colors are numbered from 1, 2, . . . , ∆ + 1. In the first round all
nodes of the first color join the MIS and notify their neighbors. Then, all nodes of the
second color which do not have a neighbor that is already in the MIS join the MIS and
inform their neighbors. This process is repeated for all colors. Hence, it takes at most
O(∆) rounds to compute a MIS from the ∆ + 1 coloring. Therefore the total time required
is O(log∗ n + ∆) rounds. Since ∆ is bounded by a constant, the required time bound
follows.
Worked Exercise 6.2. Show that the MIS Algorithm 2 when run on a graph that is a
√
path of length n terminates in O( log n) rounds with high probability.
Solution: MIS Algorithm 2 terminates for a given node when it decides whether it is in
the MIS or not. Define an undecided node as one which has not yet decided if it is in the
MIS or not. We show that the maximum size of any path of consecutive undecided nodes
√
drastically reduces after 2 log n rounds and then we bound the time taken for any node
in any of these paths to decide.
Lemma 6.23
√
After 2 log n rounds, the maximum size of any path of consecutive undecided nodes
√
is 2 log n w.h.p.
6. LOCAL SYMMETRY BREAKING 81
Proof. Consider a path of nodes of size k + 1. Let the ranks of the nodes from one end of
the path to the other end be denoted by n1 , n2 , . . . , nk+1 . Now, after one round of MIS
Algorithm 2, k nodes on this path will remain undecided iff either n1 > n2 > . . . > nk+1
or n1 < n2 < . . . < nk+1 (this is simple to prove). Thus the probability that a path of k
2
consecutive undecided nodes remains among this path is (k+1)! . The probability that the
2
given path remains after k rounds is ( (k+1)! )k (because in each round, every node picks
a new rank, so this condition must be met with the given probability in all k rounds).
There are n − (k + 1) possible such paths in a path of n nodes (consider the starting
position of the path). Thus by the union bound, the probability that there exists a
path of k consecutive undecided nodes remaining after k rounds is upper bounded by
2
√
(n − k − 1)( (k+1)! )k 6 n( k!1 )k . If we set k = 2 log n, then the probability that there exists
√ √
a path of 2 log n consecutive undecided nodes after 2 log n rounds
!2√log n
1
6n √
(2 log n)!
!2√log n
1
6n √ √
(2 log n/2)(2 log n/2)
1
= log log n
n
1
6 1+c for a positive constant c when n is sufficiently large
n
√ √
Therefore the probability that such a path of size 2 log n exists after 2 log n rounds
1
is n1+c . The probability of maintaining a path of length > k for k rounds is smaller than
√
this value. Thus the probability that there exists any path of size 2 log n or larger is
1
upper bounded by n n1+c = n1c , which is still very low. Thus the lemma is proved.
Now, consider any such path of consecutive undecided nodes. In every round, at least
√
one node in the path will decide to join the MIS. Thus in at most 2 log n rounds, all
√
nodes in every path of undecided nodes will decide. Thus, after O( log n) rounds MIS
Algorithm 2 terminates w.h.p.
6.6 Exercises
Exercise 6.1. The MIS algorithm 2 requires the knowledge of n, the network size. Can
you modify the algorithm so that it works without knowledge of n.
Exercise 6.2. Compute the expected message complexity of the MIS algorithms 1 and 2.
Also show a high probability bound on the message complexity.
Exercise 6.3. Give an O(n)-round deterministic algorithm for MIS in the synchronous
CONGEST model.
6. LOCAL SYMMETRY BREAKING 82
Exercise 6.4. Consider the following algorithm for coloring a graph using ∆ + 1 colors,
where ∆ is the maximum degree of the graph. Let the colors be numbered from 1 to ∆ + 1.
Start with color 1 and an arbitrary node. Color that node with color 1, skip its neighbors,
choose an arbitrary node from the remaining set of nodes and color that node with color
1 and so on. Continue till no more node can be colored with color 1 (note that the color 1
nodes form a MIS). Then choose color 2 and repeat the above process on nodes that are
not yet colored and so on. Show that the above algorithm gives a valid ∆ + 1 coloring.
Exercise 6.5. Can we color every graph using ∆ colors? Justify your answer.
Exercise 6.6. In the proof of Lemma 6.12, show that that the probability that u and v
choose the same color is at most 1/|Ltu | = 1/(dtu + 1).
Exercise 6.7. Consider the ∆ + 1-coloring algorithm discussed above without the sleeping
step, i.e., all vertices are awake in every step and pick a tentative color uniformly at
random from its list of colors. Show that this algorithm also finds a ∆ + 1-coloring in
O(log n) rounds with high probability.
Exercise 6.8. The input is a 6-node oriented tree. The root r has two children u and v.
Node u has one child w and node v has two children x and y. The ID’s of these nodes
(in decimal) are as follows: IDr = 104, IDu = 110, IDv = 51, IDw = 170, IDx = 35, and
IDy = 15. Show the execution of the deterministic 6-coloring algorithm on oriented trees
on this input. Make sure to use the correct number of bits in the representation of colors
in each round of the algorithm and also make sure to terminate the algorithm after the
appropriate number of rounds. Your answer should be a sequence of appropriately labeled
illustrations of the given oriented tree.
Exercise 6.9. In a graph with maximum degree ∆, show that the deterministic symmetry
breaking rule reduces the size of the color of node from ` to at most ∆(dlog `e + 1) and,
furthermore, by applying the rule O(log∗ n) rounds, the size of the color of every node
reduces to at most 3∆ bits .
Exercise 6.10. The input is a graph G = (V, E) whose vertices have already been
assigned a proper vertex coloring c : V → {1, 2, 3, 4}. In other words, each node knows
its color, denoted by the local variable c(v). Furthermore, c(v) ∈ {1, 2, 3, 4} for all v ∈ V
and c(u) 6= c(v) for all {u, v} ∈ E. Design and present a distributed algorithm in the
CONGEST model running in 4 rounds for computing an MIS of G. Use pseudocode to
describe your algorithm.
distance at most two from v. Note that N2 (v) contains v and all its neighbors. Let ∆2
denote the size of the largest 2-neighborhood in the graph.
Describe a randomized algorithm, running in the Local model in O(log n) rounds
(in expectation), that produces a distance-2 coloring of G using ∆2 colors. Write your
algorithm using pseudocode executed by a node v. Make sure that your pseudocode is
clear and well-commented.
Hint: Mimic the (∆ + 1)-coloring algorithm presented here.
while (V 6= ∅) do
S ← all nodes of degree at most 2
for each edge e ∈ E do
if e has exactly one endpoint in S then
orient e from its endpoint in S to its endpoint in V \ S
if e has both endpoints in S then
orient e arbitrarily
V ←V \S
E ← edges in G[V ]
(a) Prove that if G is a tree, then this algorithm can be implemented (in the CONGEST
model) to run in O(log n) rounds, yielding an orientation of edges such that every
node has at most 2 out-neighbors.
(b) Show that there is a deterministic algorithm in the CONGEST model to compute a
3-coloring of an unoriented tree in O(log n) rounds.
Exercise 6.14. Show that in the set cover problem, |OP T | > n/∆.
Exercise 6.15. Adapt the second proof of Theorem 6.17 to show that the modified
accounting scheme described in Section 6.4.1 gives a O(log ∆) approximation.
Exercise 6.16. Time division multiple access (TDMA) is a method for different nodes
in a distributed system to access a shared physical medium. For example, for nodes in a
wireless network, their radio frequency channel is the shared medium that they all need
to access to be able to communicate. TDMA is also used in other settings such as bus
networks, where different nodes need to access a shared bus in order to communicate. In
TDMA, time is partitioned into time slots and each node has its own time slot during
which it uses the shared medium.
Consider the situation in which there are n identical (here “identical” means that
nodes do not have IDs) nodes in a wireless network, each of which needs to send a message
to a base station. The base station is within the transmission range of the wireless nodes,
but is possible that not all pairs of wireless nodes are in each others’ transmission ranges.
The wireless nodes have access to a single radio frequency channel and therefore if two or
6. LOCAL SYMMETRY BREAKING 84
more of these nodes transmit their message at the same time, the base station hears a
“collision,” but does not receive any of the transmitted messages. More precisely, a base
station has the ability to distinguish among three situations: (i) it hears no transmission,
(ii) it hears a “collision”, and (iii) it hears a message. The base station can also transmit,
but it also uses the same radio frequency channel as the wireless nodes. Therefore, if the
base station is transmitting at the same time as a wireless node, it will hear a collision.
Now notice that n nodes have to send messages to the base station and no two nodes
can transmit at the same time. So at least n time slots are needed for all the messages to
reach the base station. The problem is how should the wireless nodes and the base station
coordinate transmissions so that it does not take too many time slots for all messages to
reach the base station.
(a) Design a randomized algorithm that uses O(n) times slots in expectation to ensure
that the messages from all n wireless nodes reach the base station. You can assume
that nodes initially know n.
(b) Prove that your algorithm runs in expected O(n) time slots.
(c) How many time slots will your algorithm take with high probability?
7
In this chapter, we focus on one of the most central problems in distributed computing,
the minimum spanning tree (MST) problem. The MST along with leader election are
the most fundamental of the “global” distributed network problems. They are “global”
because they both need at least Ω(D) rounds in a network of diameter D (lower bounds
are the focus of a later chapter), i.e., their computation needs the entire graph to be
traversed.
The MST is an important and commonly occurring primitive in the design and
operation of communication networks. Formally, the MST problem is, given a n-node
connected graph with edge weights, the goal is to compute a spanning tree of minimum
total weight, i.e., the total weight of the n − 1 spanning tree edges.1 In practice, the weights
can be used to model delay, congestion etc and hence an MST gives a spanning tree that
minimizes total delay, congestion etc. One of the most common applications of MST is
that can serve a backbone for efficient communication, e.g., it can be used naturally for
broadcast. Any node that wishes to broadcast simply sends messages along the spanning
tree. The advantage of this method over flooding is that redundant messages are avoided.
The message complexity is O(n) which is optimal. And since the spanning tree is one
of minimum weight the total cost (assuming, weights model delay etc.) is minimized. A
distributed MST algorithm can be also be used for leader election if a rooted tree (where
parent-child relationships are known to each node) is constructed, since the root can serve
as the leader. The MST algorithms described in this chapter naturally construct a rooted
tree.
In this chapter, we present distributed algorithms for the MST problem. As usual,
we assume the clean network model and the synchronous CONGEST model. (Recall
that in the clean network model each node has only knowledge only about itself and the
weights incident on its edges; it does not know any information about its neighboring
nodes; CONGEST model means that only O(log n)-sized message is allowed to travel per
round per edge.) Let’s discuss the output of a distributed MST algorithm. At the end of
the distributed MST algorithm, each node will know which of its incident edges belong to
the MST and which do not, i.e., each node needs to know (only) the status of its incident
edges (and not other edges).
1
Note that all spanning trees of a connected n-node graph will have n − 1 edges.
85
7. MINIMUM SPANNING TREE 86
We make an assumption that simplifies our algorithms and analysis: we assume that
all edge weights in the graph are distinct. It is easy to show that this implies that the
MST is unique (Exercise 7.1). This assumption is without loss of generality, because one
can tag each edge weight (additionally) with the IDs of the endpoints of the edge (which
are unique as a pair).2 This tagging can be used to break ties between edges having the
same weight.
As in centralized MST algorithms, distributed MST algorithms also rely on two
important properties of a MST: (1) cut property and (2) cycle property.
1. Cut property: A cut in a graph is a partition of the vertex set into two disjoint
sets. The cut property states that, given any cut in a graph, the lightest (minimum
weight) edge crossing the cut belongs to the MST (due to the assumption of unique
edge weights, there is a unique lightest edge crossing the cut.)
2. Cycle property: Consider any cycle in the graph. The heaviest (maximum weight)
edge in the cycle will not be in the MST.
each fragment will have one root (leader) node. Each fragment is identified by the identifier
of its root — called the fragment ID — and each node in the fragment knows its fragment
ID.
F1 F2
MWOE
F4
MWOE
MWOE
F5
F3
Fv Fw
checking can be done (in increasing weight order) starting from the neighbor that was
checked last in the previous phase. This is because, all edges that were checked earlier
would belong to the same fragment and will continue to be in the same fragment till
the end of the algorithm. Then, each node sends its minimum outgoing incident edge
to the root by convergecasting the minimum; the root then finds the MOE, which is the
minimum among all the edges convergecast. Note that the convergecast process uses the
fragment (tree) edges only.
plus the messages needed to find MOE. The latter takes a total of O(m + n log n) messages
because in each phase a node checks its neighbor in increasing order of weight starting
from the last checked node. Thus, except for the last checked node (which takes one
message per phase) all other neighbors are checked at most once. Hence total message
complexity is
X Xn
log X
2d(v) + 1 = O(m + n log n).
v∈V i=1 v∈V
7.2.1 Analysis
Before we go to the formal proof, we give the high-level idea behind the correctness and
running time.
7. MINIMUM SPANNING TREE 90
Correctness. The algorithm’s correctness follows from the cycle property. Using the
cycle property, since only non-MST edges are filtered (note that an edge at node v is not
sent upward if it closes a cycle with edges in U , i.e., the already sent edges — since edges
are sent in non-increasing order, the filtered edges are the heaviest edge in a cycle) it
can be shown that the root receives all the MST edges (plus may be additional edges)
required to compute the MST correctly.
Running time. It is easy to show that the edges reported by each node to its parent in
the tree are sent in nondecreasing weight order, and each node sends at most n − 1 edges
upward to its parent. This is because if more than n − 1 edges are sent through a node,
then at least one edge will form a cycle with the edges sent previously and will be filtered
by the cycle property. To build the BFS tree, it takes O(D) time. Since the depth of the
BFS tree is D and each node sends at most n − 1 edges upward, the pipeline algorithm
takes O(D + n) = O(n) time. The analysis shows that there is not too much delay (more
than n) overall before the root gets all the edges that it needs to compute the MST.
We now formally argue the correctness and running time. We make two simple
observations:
1. The edges reported by each intermediate vertex to its parent in the tree are cycle-free.
2. Every vertex starts sending messages upwards at round L(v) = Depth(T (v)).
Lemma 7.2
(a) For each child u of v that is still active at round t, Q(v) at the beginning of
round t contains at least one edge.
(b) If v sends to its parent an edge of weight w0 at round t, then all of the elements
v was informed at round t − 1 by its active children were of weight w0 or
larger.
(c) If v sends to its parent an element of weight w0 at round t, then any later
element it will learn is of weight w0 or larger.
(d) Any non-root node v sends elements in nondecreasing weight order to its parent;
it sends the elements in a continuous fashion till it terminates.
Proof. The proof is by induction on the height of the tree. The base case trivially holds
for leaves.
We consider the induction step. Consider an intermediate vertex v at height h and
assume that the claims hold for each of its children.
Claim (a) : Let Av be the set of k elements sent by v to its parent during the first t − h
rounds (i.e., h . . . , t − 1). Note that the set of edges in Av are cycle-free, by Line
4 of the Pipeline algorithm. Consider an active child u of v. Let Au be the set of
elements sent by u to v up to round t − 1. Similarly, the set Au is cycle-free. u has
transmitted continuously to v, since round L(u) 6 h − 1. Hence, |Au | > k + 1. Thus
there exists some edge f ∈ Au − Av such that Av ∪ {f } is cycle-free (see Exercise 7.1).
This element belongs to Q(v).
Claim (b) : Consider any active child u of v. Let f be the element sent by u on round
t − 1. Let f 0 be some element sent by u at some round t0 6 t − 1 and is still in Q(v)
at round t. Then: w(f ) > w(f 0 ) > w0 .
Claim (d) : This claim follows from Claim (c) and the element selection rule, i.e.,
selecting the minimum weight edge in Q(v). Claim (a) shows v that the elements
are sent continuously.
Proof. The proof is by induction. The base case is true for leaves. The induction step
follows from Claim (a) of Lemma 7.2 as follows. Consider an intermediate node v. By
Claim (a), it is not possible to have v terminate at round t (i.e., no more edges in Q(v))
unless all of its children have terminated by time t.
7. MINIMUM SPANNING TREE 92
Running time We are now ready the finish the time analysis. The root starts getting
messages at time Depth(T ). The root receives at most n − 1 elements from each of its
children. By Lemma 7.2, these elements are sent in a fully pipelined fashion. The time to
know all the edges is O(n − 1 + Depth(T )) = O(n). The root can broadcast the entire set
of MST edges in an additional O(Depth(T ) + n − 1) = O(n) time.
Message analysis We analyze the message complexity of the algorithm. In the worst
case, each node can send Θ(n) edges upward, and hence the overall message complexity is
O(n2 ).
Hence we can state the following theorem.
Theorem 7.4
The Pipeline algorithm correctly computes the MST in O(n) time and using O(n2 )
messages.
First part: Controlled-GHS The first part, called the controlled-GHS algorithm is
similar to the GHS algorithm, with the crucial property of ensuring that the diameter of
fragments do not become too big. (Note that in the GHS algorithm, even after the first
phase, there can be fragments with diameter that are as large as n). The controlled-GHS
begins with each node as a singleton fragment. In every phase, as in the GHS algorithm,
each fragment finds its MOE. However, not all fragments are merged along their MOE
edges. Only a subset of MOE edges are selected for merging. A crucial property of the
merging is the following: in every phase, the number of fragments is reduced by at least a
factor of two, while the diameter is not increased by more than a constant factor (the
7. MINIMUM SPANNING TREE 93
Second part: Pipeline algorithm The second part of the algorithm uses the Pipeline
√ √
algorithm to find the remaining n − 1 MST edges (since there are only n fragments
left). As in the Pipeline algorithm, a breadth-first tree B is built on G. Let r(B) be
the root of B. Using the edges in B, r(B) collects weights of the interfragment edges,
computes the minimum spanning tree, T 0 , of the fragments by considering each fragment as
a super node. It then broadcasts the edges in T 0 to the other nodes using the breadth-first
√
tree B. Since the depth of B is D and each node sends at most n edges upward, the
√
Pipeline algorithm takes O(D + n) time; this analysis is exactly similar to the Pipeline
√
algorithm (with n replacing n). Thus the overall time complexity of GKP algorithm is
√
O(D + n log∗ n). We will show that the overall message complexity of this algorithm is
O(m + n1.5 ).
Let Fi be the set of fragments at the beginning of phase i. (In the first phase, F1
7. MINIMUM SPANNING TREE 94
consists of n singleton fragments.) Only fragments that have diameter at most 2i will
find MOE edges (unlike all fragments as in the GHS algorithm); the intuition is that this
keeps the diameter of the merged fragments under control. Each fragment F ∈ Fi of
diameter at most 2i determines the MOE of F (finding MOE follows the same procedure
as explained in the GHS algorithm — Section 7.1.1) and adds it to a candidate set Mi of
MOE edges. Consider the fragment graph Hi = (Fi , Mi ) defined as follows. The fragment
graph consists of vertices {F1 , . . . , Fk }, where each Fj (1 6 j 6 k) is a fragment at the
start of phase i of the algorithm. The edges of Hi are obtained by contracting the vertices
of each fragment Fj ∈ Fi to a single vertex in Hi and removing all resulting self-loops of
Hi , leaving only the MOE edges in set Mi . Note that the fragment graph is a forest, in
fact a collection of rooted trees: direct each MOE edge of a fragment as an outgoing edge.
We further note that the fragment is not explicitly constructed in the algorithm; it is just
a construct to explain it. Gopal: Refer to figure 7.2 to explain how the root is formed.
As mentioned earlier, not all MOE edges are used for merging (as in the GHS
algorithm); only a subset of them are used as described below. The algorithm first finds a
maximal matching in the fragment graph Hi (the intuition for using a maximal matching
is mentioned below). Since the fragment graph is a rooted tree, this can be accomplished
by running the deterministic symmetry breaking algorithm of Section 6.3 which runs in
O(log∗ n) rounds on the fragment graph (Exercise 7.3 asks you to show how this can be
adapted to find a maximal matching instead of a coloring). However, note that since
Hi is obtained by contracting the fragment, the symmetry breaking algorithm has to be
simulated by the leader of each fragment. Hence simulating one round on the fragment
√ √
graph will take O( n) rounds, since the diameter of each fragment is bounded by O( n)
√
(as shown below). Hence in the (original) graph, the algorithm takes O( n log∗ n) rounds.
The set of MOE edges chosen in the maximal matching are used for merging. Let’s
call this set Mi0 (Line 4 of Algorithm 14) The intuition for using only these matched edges
is that there are no long chains. Indeed, only two fragments are merged by merging along
the matched edges and hence the diameter increases only by a factor of 2. This is what
we want, but we also want the number of fragments to go down by at least a factor of 2.
If we use only the matched edges to join fragments, this may not happen. For example,
let the fragment graph be a star graph. In this case, there is only one matched edge and
only two fragments are merged. To avoid this kind of situation, the algorithm merges
more edges: If a fragment of diameter at most 2i has no incident matching edge (i.e., it
has no matching edge touching it) it adds its MOE edge to the set of chosen edges Mi0
(Line 5 of Algorithm 14). Merging now takes place along the matching edges plus these
additional added edges. Note that this increases the diameter of the merged fragment,
since more fragments are merged; however, since only fragments that have no incident
matching edges are merged, the increase in diameter is not too much. Thus we need two
properties to hold: (1) the diameter of the merged fragment does not increase by too
much; and (2) the number of fragments remaining at phase i is at most n/2i . We show
these in the following lemmas.
7. MINIMUM SPANNING TREE 95
Lemma 7.5
At the beginning of phase i, each fragment has diameter O(2i ). In particular, at
√
the end of the controlled-GHS algorithm each fragment has diameter O( n).
Proof. We show that at the beginning of phase j, the diameter of each fragment is at
most 5 · 2j . The proof is by induction on j. The base case, i.e., at the beginning of phase
0, the statement is trivially true, since 5 · 2j = 5 · 20 = 4 which is bigger than 0, the
diameter of a singleton fragment.
By induction hypothesis, the diameter of each fragment at the beginning of phase i is
at most 5 · 2i . We show that at the end of phase i (i.e., at the beginning of phase i + 1),
the diameter of each fragment is at most 5 · 2i+1 .
In phase i, we compute an upper bound on the diameter of any resulting fragment.
Note that a fragment consists of joining two fragments via a matching edge in the fragment
graph. Note that at least one of these two has diameter at most 2i since only fragments
with diameter at most 2i find MOE edges; the MOE edge may lead to a fragment with
larger diameter, i.e., at most 5 · 2i . In addition to joining two fragments via a matching
edge, other fragments (with diameter at most 2i ) can be joined to either fragments of
the matching edge (see Figure 7.3). Hence the resulting diameter of the fragment at
the end of phase i is at most 5 · 2i + 3 · 2i + 3, since the diameter of the combined
fragment is determined by at most 4 fragments, out of which at most one has diameter
5 · 2i and the other three have diameter 2i and these are joined by 3 MOE edges (which
contributes to the constant 3). Hence the diameter at the end of phase i is at most
5 · 2i + 3 · 2i + 3 6 8 · 2i + 3 6 5 · 2i+1 , for i > 1.
√
Since the controlled-GHS algorithm runs for dlog √ne phases, the diameter of each
√
fragment at the end of the algorithm is at most O(2dlog ne ) = O( n).
Lemma 7.6
At the end of phase i, each fragment has size at least 2i .
Proof. We prove the above statement via induction on i. The base case (i = 0) is trivially
true, since at the end of phase 0, each fragment is of size at least 20 = 1.
Assume the statement is true for phase j. We show that it holds for phase j + 1 as
well, i.e., at the end of phase j + 1, the size of each fragment is at least 2j . During phase
j + 1, categorize all fragments into two categories: (1) fragments with diameter greater
than 2j+1 and (2) fragments with diameter at most 2j+1 . Each fragment in category 1 has
at least 2j+1 nodes (since the diameter is at least this much), hence the induction step
holds trivially for these fragments. By the algorithm, in phase j + 1, each fragment in
category 2 merges with some other fragment (either category 1 or category 2). Since by
induction hypothesis, every fragment has size at least 2j , the size of the resulting fragment
at least doubles, i.e., becomes at least 2j+1 . Thus all fragments have size at least 2j+1 at
the end of phase j + 1.
7. MINIMUM SPANNING TREE 96
Corollary 7.7
Proof. From Lemma 7.6, at the end of phase i, each fragment has size at least 2i . Since
fragments are disjoint, this implies that the number of fragments at the beginning of
√
phase i,√ is at most n/2i . Thus, after dlog ne phases, the number of fragments is at most
√
n/2dlog ne 6 n.
Figure 7.3: Controlled-GHS algorithm: Merging fragments. The figure shows a fragment
graph consisting of 8 fragments — a, b, c, d, e, f, g, h. In the fragment graph each fragment
is represented as a node and the edges between fragments are the MOE edges — the
outgoing edges are directed away from the respective fragments. The red edges are the
edges of a maximal matching. The fragments are merged via the matched edges plus the
added green edges; a green edge is the MOE edge of a fragment which has no incident
matching edge — in the above example, these are fragments a and c. The figure shows
the merged fragments after merging via the red and green edges.
7. MINIMUM SPANNING TREE 97
Thus we are ready the show the final lemma that shows the correctness and complexity
of controlled-GHS algorithm.
Lemma 7.8
√ √
Algorithm 14 outputs at most n MST fragments each of diameter O( n) in
√
O( n log∗ n) rounds and incurs O(m + n log∗ n log n) messages.
Proof. The correctness of the algorithm is established in Lemma 7.5 and Corollary 7.7.
We next show the complexity bounds. Consider one of the O(log n) iterations of
the algorithm. Similar to the analysis of the GHS algorithm (Section 7.1.2), the total
message complexity for finding the MOEs is O(m + n log n). The time complexity for
finding the MOEs in phase i is O(2i ), since the diameter of all fragments is O(2i ), by
Lemma 7.5. Then a matching is built using the O(log∗ n)-deterministic symmetry-breaking
algorithm. The symmetry breaking algorithm is simulated by the leaders of neighboring
fragments by communicating with each other; since the diameter of each fragment is
bounded by O(2i ), the time needed to simulate one round of the symmetry breaking
algorithm in phase i is O(2i ) rounds. Since only MST edges are used in communication,
the total number of messages needed is O(n) per round of simulation. Since there are
O(log∗ n) iterations,√ the total time and message complexity for building the maximal
Pdlog ne i √
matching is O( i=0 2 log∗ n) = O( n log∗ n) and O(n log∗ n) respectively. Afterwards,
adding selected edges into Mi0 (Line 5 of the Controlled-GHS algorithm) can be done with
additional O(n) message complexity and O(2i ) time complexity in phase i. Since there
√
are dlog ne = O(log n) phases in the controlled-GHS algorithm, the overall message
complexity of the algorithm is O(m + n log∗ n log n). The overall time complexity is
√
O( n log∗ n).
√
7.3.2 Second Part: Combining the Remaining n fragments
√
The second part of the GKP algorithm combines the remaining at most n fragments
using the Pipeline algorithm (Section 7.2) as follows. Recall that in the Pipeline algorithm
(Algorithm 13) we build a BFS tree first. (If a root or leader is known, then BFS
construction takes O(D) rounds and O(m) messages — Section 3.7. Exercise 7.4 deals
with the situation if there is no leader). In the algorithm described in Section 7.2, leaves
of the BFS tree send their incident edges to their respective parents in increasing order of
weight. Intermediate nodes forward the edges up the tree in increasing order of weight; the
√
edges forwarded are cycle-free. In the present setting, since only at most n fragments
√
remain, at most n − 1 MST edges need to be discovered (unlike the case of the original
Pipeline algorithm where all the n − 1 MST edges needed to be discovered). Furthermore,
not all incident edges need to be flooded, only inter-fragment edges need to be forwarded.
Each node can identify its inter-fragment incident edges by checking the fragment ID of
its neighbors. Only these edges are forwarded in increasing order. The filtering rule (via
√
the cycle property) ensures that no node forwards more than n − 1 edges, since any
more will close a cycle. Hence, the message complexity is O(n1.5 ). The Pipeline algorithm
guarantees that the edges are sent continuously by each node in a pipelined fashion, hence
7. MINIMUM SPANNING TREE 98
√ √
the time complexity is O(D + n), since the depth of the tree is O(D) and O( n) edges
are sent in a pipelined fashion. The correctness follows from the correctness proof of the
Pipeline algorithm (Lemma 7.2).
Solution: Suppose, to the contrary, that there are two different MSTs, T and T 0 . Let e
be the minimum-weight edge that is in T but not in T 0 . The graph {e} ∪ T 0 must contain
a cycle, and at least one edge in this cycle, say e0 , is not in T , as T contains no cycles.
Since the edge weights are all distinct and e0 is in one but not both of the trees, weight of
e is strictly less than the weight of e0 . Thus {e} ∪ T 0 − {e0 } is a spanning tree of smaller
weight than T 0 ; this is a contradiction.
Worked Exercise 7.2. Given a n-node complete network G with edge weights, give
a distributed algorithm to compute the MST of G in O(log n) rounds. (Hint: Try to
efficiently simulate the GHS algorithm.)
Solution: We implement the GHS algorithm which takes O(log n) phases. We will show
how to implement one phase in O(1) rounds, hence overall time is O(log n) rounds, as
desired.
Before the start of the first phase, we do a pre-processing step: elect a leader for the
entire graph — since we have a complete graph, this can be done in O(1) rounds using
O(n2 ) messages — each node broadcasts its ID to all other nodes and the minimum ID
learnt is the leader.
Each phase consists of two parts: (1) Finding MOE and (2) Merging fragments. Note
that each fragment is identified by all nodes in that fragment having the same fragment
ID. Finding MOE and merging can be done as follows. Each node finds its minimum
weight incident edge (which can be done by contacting all its neighbors to learn their
fragment IDs) and sends that edge to the leader (along with its fragment ID and its own
ID; note that the edge will be of the from (f, g), where f and g are fragment IDs). The
leader then aggregates the MOE edges (fragment by fragment) and finds the MOE edge
7. MINIMUM SPANNING TREE 99
per fragment (this can be done since the leader gets to know all the minimum incident
edges from all nodes and their respective fragment IDs). Since the leader knows all the
MOEs, it can merge locally, i.e., it can find out which fragments merge with which, and
send the merged fragment IDs back to the respective nodes. The nodes rename their
fragment IDs (if they got merged). This takes O(1) rounds.
7.5 Exercises
Exercise 7.1. Given a weighted graph G = (V, E, w), let S be a collection of cycle-free
subsets of E (i.e., a subset of edges which don’t contain a cycle) closed under inclusion
(i.e., of A ∈ S and B ⊆ A then also B ∈ S). Note that the MST is the maximal cycle-free
subset (i.e., a spanning tree) of largest weight.
Show that S satisfies the following important replacement property: If A, B ∈ S and
|B| = |A| + 1, then there exists some element (edge) f ∈ B − A such that A ∪ {f } ∈ S.
In other words, there exists an edge f belonging to B (but not belonging to A) such that
f and the edges of A are cycle-free. (Hint: Use the fact that if there are ` cycle-free edges
in a graph with n nodes, then the number of connected components is exactly n − `.)
Exercise 7.2. Give an input graph G with n nodes and m = O(n) edges for which the
GHS algorithm is tight in the number of messages used, i.e., it takes O(n log n) messages.
Show your bound.
Exercise 7.3. Given a rooted tree, show how to find a maximal matching in O(log∗ n)
rounds.
Exercise 7.4. Show how to deterministically elect a leader at the beginning of the second
part of the GKP algorithm. Note that the time and message complexity of leader election
should not exceed the overall time and complexity of the GKP algorithm. (Once a leader
is elected, a BFS tree can be built which is used in the Pipeline phase).
Exercise 7.5. Show that in the GKP algorithm, the Pipeline phase can be implemented
√
in O(D + n) rounds using O(m + n1.5 ) messages.
Exercise 7.6. Consider the following spanning tree verification problem: Given a graph
G = (V, E) with n nodes and m edges, and a subgraph H of G, the goal is to verify
whether H forms a spanning tree of G. The goal is design a distributed algorithm that
runs on G (as usual, each node has only local knowledge: which of its incident edges
belong to H or not) that outputs “yes” (i.e., all nodes will be in state “yes” at the end) if
H is a spanning tree of G and “no” otherwise (i.e., all nodes will be in state “no” at the
√
end). Design a O(D + n log∗ n) algorithm for the problem, where D is the diameter of G
(and not diameter of H). Show that your algorithm works. (Hint: Use GKP algorithm.)
8
MapReduce Algorithms
100
8. MAPREDUCE ALGORITHMS 101
focus only on the problem, not on the platform. In particular, the framework shields the
programmer from all the low-level details of parallel programming such as inter-machine
communication, handling machine failures, performing the shuffling step, data distribution,
scheduling the program’s execution across a set of machines, etc., which are all addressed
by the MapReduce environment. The programmer only needs to specify the map and
reduce functions; the system-level issues are handled by the underlying implementation.
The canonical example of MapReduce program is counting word frequencies in a text
file with a one-round MapReduce protocol.
Reduce:
1: method Reduce(word w, counts [c1 , c2 , . . . ])
2: sum ← 0
3: for all count c ∈ counts [c1 , c2 , . . . ] do
4: sum ← sum + c
5: Emit(word w, count sum)
Notice that the above algorithm consists of only one round. However, in general several
rounds are needed to solve more complicated problems, as discussed in the next section.
Exercise 8.1. Give a MapReduce algorithm that computes the degree of each vertex of
a given graph G = (V, E) represented as a list of edges (u, v). How many rounds does it
comprise?
Exercise 8.2. Give a MapReduce algorithm that enumerates all the triangles of a given
graph G = (V, E) represented as a list of edges (u, v). How many rounds does it comprise?
has a memory of size O(n1−ε ), where n is the size of the input and ε is a constant greater
than zero. That is, the memory size of each machine is sublinear in the input size.
For the same reason as before (namely, that MapReduce is aimed for processing large
data sets), in order for the model to have practical relevance, the number of available
machines is also limited to be substantially smaller (i.e., sublinear) in the input size.
Specifically, the model assumes that O(n1−ε ) machines are available for the computation,
where n is the size of the input and ε is a constant greater than zero. Also, each machine
runs in time polynomial in n.
The third and last component of the model is the cost function. As it usually involves
the movement of tera- or peta-bytes of data across the network of machines, shuffling
turns out to be the time consuming operation. Hence, the cost function is the number of
rounds of the computation. One can also define different complexity classes of MapReduce
computations: an algorithm A ∈ MRC i if it runs in O(logi n) rounds.
In particular, for an algorithm to be practical we aim to design MapReduce algorithms
that terminate in a constant number of rounds, that is, that belong to class MRC 0 . The
fundamental idea behind this is that in large-scale computations no machine gets to see
the whole input during the computation.
Comment: Notice that this model is actually an execution model, which differs from
the MapReduce specification model described in the preceding section. The algorithm
does not use the number of available machines at the specification level, and thus the
degree of parallelism exposed by an algorithm is decoupled from the one of the system
where the computation will be executed.
In order to partially bridge the two models, we can view the reduce phase and
the map phase of the subsequent round as a single computation phase. Looking at a
MapReduce computation through this lens, in every round each machine performs some
computation on the set of (key; value) pairs assigned to it (reduce phase), and then
designates which machine each output value should be sent to in the next round (map
phase). The shuffle ensures that the data is moved to the right machine, after which the
next computation can begin. This way, one can reason about machines as opposed to
mappers and reducers, defining what each machine does during each round and specifying
which machine each output (key; value) pair goes to. Syntactically, we have that at each
round i, the computation performed is φi = mi+1 ri (X), where mi+1 is map function of
round i + 1, ri reduce function of round i, X is a set of (key; value) pairs, and is the
operation of feeding the output of ri (X) to mi+1 .
Theorem 8.1
For any φ > 0, Algorithm 16 returns a 2(1 + φ)-approximate solution for the densest
log n
subgraph problem O( log(1+φ) ) MapReduce rounds.
Proof. We begin by proving the correctness of the algorithm, that is, Algorithm 16 returns
a 2(1 + φ)-approximation to the densest subgraph problem. First of all, notice that at
least one node must be removed in every iteration of the while loop, that is, there will
always be a node i in line 4 of the algorithm. In order to argue this, observe that the
average degree of nodes in S is
P
i∈S degS (i) 2|E(S)|
= = 2ρ(S).
|S| |S|
8. MAPREDUCE ALGORITHMS 104
Since it is not possible that all nodes in S have degree bigger than their average, this proves
that at least one node must be removed in every pass, hence the algorithm terminates (in
at most n rounds, but soon we will see it needs actually much less than n rounds). Now,
consider the first time in the pass when a node i from the optimal solution S ∗ is removed.
(This moment is guaranteed to exist, since as we just argued S eventually becomes empty.)
Clearly S ∗ ⊆ S (before removing nodes from S). Let i ∈ A(S) ∩ S ∗ . If ρ(S ∗ ) 6 degS ∗ (i)
holds, we have
which gives the claimed quality of the output solution. It is easy to see that ρ(S ∗ ) 6
degS ∗ (i) holds: by the optimality of S ∗ , for each i ∈ S ∗ we have
We finally note that the algorithm returns the set S̄ which is the set with the largest
ρ encountered (lines 6 and 7 of the Algorithm 16). Thus, its ρ value will be at least as
large as S.
We now prove the running time of the algorithm. Observe that at each round we have
X
2|E(S)| = degS (i)
i∈S
X X
= degS (i) + degS (i)
i∈A(S) i∈S\A(S)
where the inequality follows by considering only the second addendum. Thus,
φ
|A(S)| > |S|.
1+φ
Equivalently,
1
|S \ A(S)| < |S|.
1+φ
Therefore, the cardinality of the remaining set S decreases by a factor at least 1/(1 + φ)
log n
during each round. Hence, the algorithm terminates in O(log1+φ n) = O( log(1+φ) ) rounds.
Finally, it is easy to see that the constraint on the memory used by each machine is
8. MAPREDUCE ALGORITHMS 105
never violated.
Exercise 8.3. Explain how the map and reduce steps for the densest subgraph problem
are implemented. Why does the memory constraint used by each machine never violated?
1
Typical large graphs have c ∈ [0.08, 0.5].
8. MAPREDUCE ALGORITHMS 106
Algorithm 17 MST(V, E)
1: if |E| < η then
2: compute T ∗ = M ST (V, E)
3: return T ∗
4: ` ← Θ(|E|/η) . ` = number of machines to be used in the current round
5: partition E into E1 , E2 , . . . , E` where |Ei | = Θ(η) using a universal hash function
h : E → {1, 2, . . . , `}
6: in parallel, compute Ti , the minimum spanning forest on G = (V, Ei )
7: return M ST (V, ∪i Ti )
Proof. We have to prove the correctness and provide an upper bound on the round
complexity of the algorithm. To claim correctness we have to show that the algorithm
returns a minimum spanning tree of the input graph G. Clearly, the algorithm returns a
spanning tree of G. We prove that it is a spanning tree of minimum cost by arguing that
no edge pruned before the last round is part of a spanning tree of G of cost smaller than
the cost of the spanning tree returned by algorithm M ST (V, E). This is ensured by the
cycle property of minimum spanning trees.
Cycle property of minimum spanning trees: Let T be a minimum spanning tree
of a weighted graph G. Let a be an edge of G that is not in T , and let C be the cycle
formed by a with T . Then, for every edge b of C, weight(b) 6 weight(a). Thus, each
edge thrown out during the algorithm was part of a cycle, and had weight no smaller than
all other edges of the cycle. Therefore, any such edge cannot appear in the MST. Hence,
the edges computed at the end of the algorithm form a minimum spanning tree of G.
We now analyze the performance of the algorithm. Specifically, we will show that (1)
w.h.p. the memory constraint of each machine is never violated, and (2) the algorithm
terminates in O(1) MapReduce rounds. We first prove (1). Since the partition of the
edges among the machines is done randomly, (1) is ensured by a simple application of a
Chernoff bound:
Chernoff bound: Let X1 , X2 , . . . , Xj be independent random variables, with 0 6
Xi 6 1. Let X = ji=1 Xi and µ = E[X]. Then,
P
2 /j
Pr[|X − µ| > α] 6 2e−2α .
The application of this bound for the first round is as follows (for subsequent rounds it is
similar):
• j = n1+c
• α = n1+ε
8. MAPREDUCE ALGORITHMS 107
that is, w.h.p. any machine y receives no more than 2n1+ε edges. The claim that all
the machines get no more than 2n1+ε edges follows by applying the union bound to the
O(nc−ε ) machines.
It remains to prove (2), that is, that the algorithm terminates in O(1) MapReduce
rounds. In fact, by construction every iteration reduces the input size by a factor of
Θ(nε ), as each machine receives as input Θ(n1+ε ) edges and returns at most n − 1 edges.
Therefore, after dc/εe − 1 = O(1) iterations the input is small enough (Θ(n1+ε ) edges) to
fit into a single machine, which will return a MST of the original graph G.
Exercise 8.4. Consider the problem of finding a maximal matching of a graph. Give an
example that shows that the same strategy used for MST does not work for this problem,
i.e., show that it is impossible to build a maximal matching using only the non-filtered
edges.
Exercise 8.5. The prefix sum problem asks, given a list a1 , a2 , . . . , an of n values, to
return a list b1 , b2 , . . . , bn where bi = ij=1 aj for each i ∈ {1, 2, . . . , n}. Describe and
P
analyze an algorithm for prefix sum assuming that the input is given in pairs (i; ai ).
9
9.1 Introduction
Till now we have focused on studying fundamental distributed network algorithms. In
this chapter, we will take a different view and study distributed algorithms for graphs.
What are the differences between the two?
Distributed network algorithms have been studied for over the last three decades mainly
in the context of distributed communication networks (e.g., the Internet, peer-to-peer
networks, and ad hoc wireless networks), where they crucially enable fundamental network
operations such as broadcast, multicast, routing, search etc. (We have till now focused on
distributed algorithms from this perspective.) At its core, as we have seen, distributed
network algorithms are graph algorithms, but there is a big difference in the way these
algorithms are modeled, designed, and analyzed compared to the centralized setting. As
we have assumed till now, each node (which represents a processor in a communication
network) computes in a decentralized and localized manner; nodes also can communicate
with their neighbors by exchanging messages. We have studied distributed algorithms for
important graph problems such as spanning tree, shortest paths etc., which are widely
used in modern communication networks. In distributed computation, communication
is at least as important as computation within a node; in particular, communication
between nodes is typically the costly operation, and dominates the overall cost of the
algorithm. On the other hand, the emergence of “Big Data” over the last decade has led to
many new computing platforms for distributed processing of large-scale data, exemplified
by MapReduce and Hadoop, and more recently systems such as Pregel, Giraph, GPS,
GraphLab, Spark, etc. In these platforms, the data—which is typically too large to fit
into a single machine—is distributed across a group of machines that are connected via
a communication network, and the machines jointly process the data in a distributed
fashion.
In the last chapter we studied MapReduce model which is a model for distributed
processing of large-scale data. In this chapter, we will study another model called the
k-machine model, a general model that captures distributed computation of data over a
network of machines. MapReduce (developed at Google) has become a very successful
distributed computing platform for a wide variety of large-scale computing applications
108
9. K-MACHINE MODEL AND ALGORITHMS 109
and also has been used for processing graphs. However, as pointed out by the developers
of Pregel (which was also developed at Google), MapReduce may sometimes be ill-suited
for implementing distributed algorithms, especially, graph algorithms. One reason is that
many graph algorithms may take a lot of (MapReduce) communication rounds which
might be too costly. Indeed the MapReduce model is stateless, i.e., the data is not really
associated with any particular machine and can be moved around (from one machine to
another) from one round to the next. This can lead to a lot of communication per round;
hence MapReduce algorithms typically work best when there are very few rounds, say,
O(1) rounds.
On the other hand, graph algorithms are better suited to a message-passing distributed
computing model and this is the main design principle behind Pregel (and other systems
that followed it such as Giraph and Spark). Hence, the algorithms and theory for the
message-passing distributed computing model can be leveraged to study distributed graph
processing systems.
In this chapter, we introduce the k-machine model as a model for distributed processing
of data and present algorithms for several problems. We give a high-level overview of the
model and the ideas behind it; more details are in Section 9.2.
The k-machine model consists of a point-to-point communication network of k machines
interconnected by bandwidth-restricted links; the machines communicate by message
passing over the links. The network is used to process some data (typically large) that is
partitioned across the machines in a balanced fashion. The data can be a (large) graph or
other types of data. For example, the data can be set of n numbers partitioned across the
machines with each machine getting n/k numbers. Or the input can be an arbitrary n-node
graph G. Vertices and edges of G are partitioned across the machines in a (approximately)
balanced manner; in particular, one can assume that the vertices (and their incident
edges) are partitioned in a random fashion, which is a common implementation in many
real-world systems. As usual, we assume that the distributed computation proceeds
in a sequence of rounds. In a round, each machine does some local computation and
can send/receive messages. We use the receive-compute-model that we had assumed
throughout. Local computation within a machine is considered free, while communicating
messages between the machines is the costly operation. This assumption is reasonable in
the context of large-scale data — indeed, typically in practice, even assuming the links
have a bandwidth of order of gigabytes of data per second, the amount of data that have
been to be communicated can be in order of tera or peta bytes which generally dominates
the overall computation cost.
Let’s compare and contrast the k-machine model with the usual message-passing dis-
tributed computing model that we have studied till now, i..e, the synchronous CONGEST
model operating on a distributed network. As we know, in this model, each node in the
network represents a processor (or a computing element) and the edges represent the
communication links between the nodes. Let’s call this the vertex-centric model, since here
the vertices do the computation and send/receive messages. One can obtain algorithms in
the k-machine model by simulating vertex-centric algorithms on the k-machine model (in
9. K-MACHINE MODEL AND ALGORITHMS 110
fact, this is the main method1 that we will use to design k-machine algorithms in this
Chapter). In contrast, in the k-machine model the vertices of the input graph don’t really
perform any computation and the simulation of a vertex is done by the machine that
holds the vertex (although, in principle, any machine can do the simulation). On the
other hand, the k-machine model itself is simply a distributed network whose topology is
a k-clique. This clique
network operates as a synchronous CONGEST model, where each
k
link (there are 2 links) has a bandwidth constraint.
A fundamental issue that we would like to study is how the number of communication
rounds scales with the number of machines used: more precisely, if we use k machines,
does the rounds scale linearly (or even super-linearly) in k?
9.2 Model
We consider a network of k > 1 (distinct) machines N = {p1 , . . . , pk } that are pairwise
interconnected by bidirectional point-to-point communication links — henceforth called
the k-machine model. Each machine executes an instance of a distributed algorithm
A. The computation advances in synchronous rounds where, in each round, machines
can exchange messages over their communication links. Each link is assumed to have a
bandwidth of B, i.e., B bits can be transmitted over the link in one round. As usual, for
convenience, we will assume throughout that bandwidth B = O(log n), where n is the size
of the input data; in any case, it is easy to rewrite our upper bounds to scale in terms of
parameter B. Note that, as before, machines have no other means of communication and
do not share any memory; all communication is via message-passing. As we have assumed
before, local computation within a machine is considered free, while communicating
messages between the machines is the costly operation. This assumption is also reasonable
in the context of large-scale data. Indeed, typically in practice, even assuming the links
have a bandwidth of order of gigabytes of data per second, the amount of data that have
been to be communicated can be in order of tera or peta bytes which generally dominates
the overall computation cost.
Although the k-machine model can be used to distributively solve any problem, say,
e.g., sorting (Exercise 9.1), in this Chapter, we are mainly interested in solving graph
problems where we are given an input graph G of n vertices (assume that each vertex has
a unique label) and m edges. To avoid trivialities, we will assume that n > k (typically, in
practice, n k). Unless otherwise stated, we will consider G to be undirected. Initially,
the entire graph G is not known by a single machine, but rather partitioned among the k
machines in a “balanced” fashion, i.e., the nodes and/or edges of G must be partitioned
approximately evenly among the machines. We will assume a vertex-partition model,
where vertices (and their incident edges) are partitioned across machines. In particular,
we will assume the random (vertex) partition, i.e., vertices (and its incident edges) of the
1
Many real-world systems such as Pregel, Giraph, and Spark use vertex-centric algorithms, where
vertices are the objects which do computation and send/receive messages to their respective neighbors,
much like our standard message-passing model. However, note that one can also directly design algorithms
for the k-machine model, and this can be more efficient than simulating corresponding algorithms designed
for the vertex-centric model; we won’t focus on this aspect in this Chapter.
9. K-MACHINE MODEL AND ALGORITHMS 111
input graph are assigned randomly to machines.2 In general, we need only a “balanced”
partition of the input graph among the machines. It can be shown that (cf. Lemma 9.1)
a random partition gives rise to an (approximately) balanced partition.
Formally, in the random vertex partition (RVP) model, each vertex of G is assigned
independently and randomly to one of the k machines. If a vertex v is assigned to machine
pi we call pi the home machine of v. Note that when a vertex is assigned to a machine, all
its incident edges are assigned to that machine as well; i.e., the home machine will know
the labels of neighbors of that vertex as well as the identity of the home machines of the
neighboring vertices. A convenient way to implement the above assignment is via hashing:
each vertex (label) is hashed to one of the k machines. Hence, if a machine knows a vertex
label, it also knows where it is hashed to.
Depending on the problem P, the vertices and/or edges of G have labels chosen from
a set of polynomial (in n) size. Eventually, each machine pi (1 6 i 6 k) must output a
variable oi (which may depend on the set of vertices assigned to machine pi ) and the output
o = ho1 , . . . , ok i must satisfy certain feasibility conditions w.r.t. problem P. For example,
when considering the minimum spanning tree (MST) problem, each oi corresponds to a
set of edges, and the edges in the union of the sets oi must form an MST of the input
graph G. Note that it is not required that a particular edge should be output by a specific
machine (although, one can design an algorithm where each machine outputs the MST
edges incident to the vertices assigned to that machine — note that this is a natural
generalization of the analogous assumption in the standard distributed message passing
model, where each vertex knows which of its incident edges belong to the MST.)
We say that algorithm A solves problem P if A, for each input instance of the problem
gives an output that is feasible for P. The round complexity of A is the maximum number
of communication rounds until termination, over all input instances. The round complexity
captures captures the communication cost of the algorithm, since links can transmit only
a limited amount of bits per round. Hence, to minimize communication — a key goal –
we would prefer algorithms with low round complexity.
model. The methodology is summarized in the Conversion Theorem. Thus using this
theorem one can convert distributed graph algorithms in the vertex-centric model to apply
in the k-machine model.
We note that fast distributed algorithms in the standard model may not directly imply
fast algorithms in the k-machine model. The main reason for this is that while in the
standard vertex-centric model, the topology is arbitrary and not a complete network (i.e.,
a clique), whereas the k-machine model is a clique. To achieve faster algorithms, we
consider vertex-centric algorithms in an intermediate clique model and then show two
ways — parts (a) and (b) respectively of the Conversion Theorem — to efficiently convert
algorithms in the clique model to the k-machine model. Part (b) applies to converting
distributed algorithms (in the clique model) that only uses broadcast (note that in a
broadcast algorithm, in a round, every node, sends the same message to a subset (or all)
of its neighbors; the rest of the neighbors don’t get any message), while part (a) applies to
any algorithm. Part (a) will sometimes give better time bounds compared to part (b) and
vice versa — this depends on the problem at hand and the type of distributed algorithm
considered, as well as on the graph parameters. (The latter can be especially useful in
applications where we might have some information on the graph parameters/topology
as explained below.) Using this theorem, one can design algorithms for various graph
problems, e.g., PageRank, minimum spanning tree (MST), connectivity, spanning tree
(ST) verification, shortest paths, Triangle enumeration etc. Problems such as PageRank,
MST, and connectivity, graph covering etc. can be solved in Õ(n/k) time (note that Õ
notation hides a polylog(n) multiplicative and additive factor); this shows that one can
achieve almost linear (in k) scaling. For graph connectivity, BFS tree construction, and
ST verification, can be solved in Õ(min(n/k, m/k 2 + D∆/k)) bound — note that the
second part of the above bound may be better in some cases, e.g., if the graph is sparse
(i.e., m = O(n)) and D and ∆ are small (e.g., bounded by O(log n)) — then we get a
bound of Õ(n/k 2 ).
3. The number of edges mapped to any link of the network is Õ(m/k 2 + n/k).
Proof. (1) This follows easily from a direct Chernoff bound application. Since each vertex
of G is mapped independently and uniformly to the set of k machines, the expected number
of vertices mapped to a machine (by linearity of expectation) is n/k. The concentration
follows from a Chernoff bound (see Appendix).
(2) We use Bernstein’s inequality (see Appendix). Fix a machine p. Let random variable
Xip be defined as follows: Xip = d(vi ) (d(vi ) is the degree of vi ) if vertex vi is assigned
to machine p, otherwise Xip = 0. Let X p = ni=1 Xip denote the total degree of the
P
vertices assigned to machine p; in other words, it is the total number of edges that
have at least one endvertex assigned to machine p. We have E[Xip ] = d(vi )/k and
E[X p ] = ni=1 E[Xip ] = ni=1 d(vi )/k = 2m/k. Furthermore, V ar(Xip ) = E[(Xip )2 ] −
P P
2
E[Xip ]2 = k1 (d(vi ))2 − ( d(vk i ) )2 = (d(vki )) (1 − 1/k) and hence V ar(X p ) = ni=1 V ar(Xip ) =
P
1
(1 − k1 ) ni=1 (d(vi ))2 6 k1 (1 − k1 ) ni=1 ∆d(vi ) = k1 (1 − k1 )∆m.
P P
k
Using Bernstein’s inequality, we have (for some t > 0):
t2
P(X p > E[X p ] + t) 6 e− 2V ar(X p )+(2/3)bt
where b = max16i6n |Xip − E[Xip ]|. Now, |Xip − E[Xip ]| 6 d(vi )(1 − 1/k) 6 ∆(1 − 1/k) = b.
Let γ > 0 and let A be the event that X p > 2m/k + γ(2m/k + ∆). Hence, for any
γ > 0 and letting t = γ(2m/k + ∆), we have:
t2 t2
P(A) 6 e− 2(1/k)(1−1/k)∆m+(2/3)∆(1−1/k)t 6 e− Θ(m∆/k+∆t)
γ 2 (2m/k+∆)2
− 2 Θ( m + ∆k +1)
6e Θ(m∆/k+∆2 ) 6 e−γ ∆k m = O(1/n3α ),
√
if γ = Θ(α log n).
The above tail bound applies to a single machine p; applying a union bound over all
the k machines, we have X p = Õ(m/k + ∆) whp for every machine p ∈ N .
(3) To prove the third bound, we utilize the following concentration bound:
Proposition 9.1 ([5, 6]). Let, for a graph G = (V, E), m < ηn2 , and let R be a random
subset of V of size |R| = t such that t > 1/3η. Let e(G[R]) denote the number of edges in
9. K-MACHINE MODEL AND ALGORITHMS 114
The Clique Model Consider a complete n-node network C and a spanning subgraph
G of C determined by a set of (possibly weighted) edges E(G). The nodes of C execute a
distributed algorithm and each node u is aware of the edges that are incident to u in G.
Each node can send a message of at most B = O(log n) bits over each incident link per
round. For a graph problem P , we are interested in distributed algorithms that run on
the network C and, given input graph G, compute a feasible solution of P . In addition
to round complexity (the number of rounds in the worst case), we are interested in the
message complexity of an algorithm in this model which is the number of messages (in the
worst case) sent over all links. Additionally, we are also interested in communication degree
complexity which is the maximum number of messages sent or received by any node in any
round; i.e., it is the minimum integer M 0 such that every node sends a message to at most
M 0 other nodes in each round. Note that we can simulate any vertex-centric algorithm
running on a network G of an arbitrary topology that uses messages of O(log n) size in the
clique model by simply restricting the communication to edges in E(G) ⊂ E(C). In this
case, the time and message complexities remain the same while the communication degree
complexity can be bounded by the maximum degree of G. We say that an algorithm is a
broadcast algorithm if, in every round and for every node u, it holds that u broadcasts the
same message to a subset (or all) of its neighbors (it does not send any message at all to
9. K-MACHINE MODEL AND ALGORITHMS 115
rest of the neighbors). We define the broadcast complexity of an algorithm as the number
of times nodes broadcast messages.
Proof idea We present the main ideas of the proof of Theorem 9.2. To obtain algorithm
A for the k-machine model, each machine locally simulates the execution of AC at each
hosted vertex. If algorithm AC requires a message to be sent from a node u1 ∈ C hosted
at machine p1 to some node u2 ∈ C hosted at p2 , then p1 sends this message directly to
p2 via the links of the network N . We will now bound the necessary number of rounds for
simulating one round of algorithm AC in the k-machine model: We observe that we can
bound the number of messages sent in a round of AC through each machine link using
Lemma 9.1(2). Let Gi be the graph that captures the communication happening in round
i of AC , i.e., there exists an edge (u, v) ∈ E(Gi ) if u and v communicated in round i. By
Lemma 9.1(2), each communication link of N is mapped to at most Õ(|E(Gi )|/k 2 + ∆i /k)
edges of Gi (whp), where ∆i is the maximum degree of Gi . Summing up over all TC (n)
9. K-MACHINE MODEL AND ALGORITHMS 116
Proof. Consider any n-node input graph G with m edges and suppose that nodes in G
are assigned to the k machines of the network N according to the vertex partitioning
process (cf. Section 9.2).
We now describe how to obtain algorithm A for the k-machine model from the clique
model algorithm AC : Each machine locally simulates the execution of AC at each hosted
vertex. First of all, we only need to consider inter-machine communication, since local
computation at each machine happens instantaneously at zero cost. If algorithm AC
requires a message to be sent from a node u1 ∈ C hosted at machine p1 to some node
u2 ∈ C hosted at p2 , then p1 sends this message directly to p2 via the links of the network
N . (Recall that a machine p1 knows the hosting machines of all endpoints of all edges (in
G) that are incident to a node hosted at p1 .) Moreover, p1 adds a header containing the
IDs of u1 and u2 to ensure that p2 can correctly deliver the message to the simulation of
AC at u2 .
Proof of (a): We will bound the number of messages sent in each round through each link
using Lemma 9.1(2). Let Gi be the graph whose node set is the same as the input graph
(as well as the clique model), and there is an edge between nodes u and v if and only if a
message is sent between u and v in round i of the algorithm; in other words, Gi captures
the communications happening in round i. From Lemma 9.1(2), we know that (w.h.p.)
each communication link of N is mapped to at most Õ(|E(Gi )|/k 2 + ∆i /k) edges of Gi ,
where ∆i is the maximum degree of Gi . This means that each machine needs to send
at most Õ(|E(Gi )|/k 2 + ∆i /k) messages over a specific communication link with high
probability. In other words, the ith round of AC can be simulated in Õ(|E(Gi )|/k 2 + ∆i /k)
rounds, and, by taking a union bound, the same is true for all rounds in [1, TC (n)]. By
summing up over all rounds of AC , we can conclude that the number rounds needed to
simulate AC is
!
TC (n)
∆0
& '!
X |E(Gi )| ∆i M
Õ + = Õ + TC (n)
i=1 k2 k k2 k
PT (n)
C
where the equality is because of the following facts: (1) i=1 |E(Gi )| = O(M ) since
|E(Gi )| is at most two times the number of messages sent by all nodes in the ith round,
and (2) ∆i 6 ∆0 . This proves (a).
Proof of (b): We first slightly modify the previous simulation to simulate broadcast
algorithms: Note that if AC is a broadcast algorithm, then for the ith round (i > 1) of
algorithm AC , if a node u belonging to machine p1 sends messages to nodes v1 , . . . , vj
9. K-MACHINE MODEL AND ALGORITHMS 117
We can use an MST algorithm for verifying graph connectivity which in turn can
be used for ST. We assign weight 1 to all edges of the input graph G and then add an
edge with infinite weight between any pair of nodes u, v where (u, v) ∈ / E(G), yielding a
0 0
modified graph G . Clearly, G is disconnected iff an MST of G contains an edge with
infinite weight. This yields the same bound that we have for MST, namely Õ(n/k).
We now describe how to verify whether an edge set S is an ST, by employing a given
algorithm A for Conn. Note that, for ST verification, each machine p initially knows the
assumed status of the edges incident to its nodes wrt. being part of the ST, and eventually
p has to output either yes or no. First, we run A on the graph induced by S and then we
compute the size of S as follows: Each machine locally adds 1 to its count for each edge
(u, v) ∈ S, if p is the home machine for vertices u, v. Otherwise, if one of u or v reside
on a different machine, then p adds 1/2. Then, all machines exchange their counts via
broadcast, which takes 1 round (since each count is at most n and W ∈ Θ(log n)) and
determine the final count by summing up over all received counts including their own.
Each machine outputs yes iff (1) the output of the Conn algorithm A returned yes and
(2) the final count is n − 1. Thus we get the same bounds for ST verification as for graph
connectivity.
Recalling that we can compute a BFS in Õ(m/k 2 + Dd∆/ke) rounds, it is straightfor-
ward to see that the same bound holds for Conn (and thus also ST verification): First,
we run a leader election√ algorithm among the k machines. This can be done in O(1)
rounds and using Õ( k) messages using Algorithm 7. The designated leader machine
then chooses an arbitrary node s as the source node and executes a BFS algorithm. Once
this algorithm has terminated, each machine locally computes the number of its vertices
that are part of the BFS and then computes the total number of vertices in the BFS by
exchanging its count (similarly to the ST verification above). The input graph is connected
iff the BFS contains all vertices.
9.4.3 PageRank
The PageRank problem is to compute the PageRank distribution of a given graph (may
be directed or undirected). A distributed page rank algorithm based on the distributed
random walk algorithm is as follows: Initially, each node generates Θ(log n) random walk
tokens. A node forwards each token with probability 1 − δ and terminates the token
with probability δ (called the reset probability). Clearly, every token will take at most
O(log n/δ) steps with high probability before being terminated. We can show that these
steps can be implemented in O(log2 n) rounds in the clique model. Since this requires
O(n log2 n/δ) messages to be sent in total, Theorem 9.2(a) yields that, for any δ > 0,
there is a randomized algorithm for computing PageRank in the k-machine model such
k n
that C1/n (PageRank) ∈ Õ( δk ).
9.5 Exercises
Exercise 9.1. You are given n numbers and the goal is to sort them using a distributed
algorithm in the k-machine model. The n numbers are partitioned across the k-machines
9. K-MACHINE MODEL AND ALGORITHMS 119
(n > k) in the following way: each machine gets an (arbitrary) subset of n/k numbers
(assume k divides n). Give an Õ(n/k 2 )-round (note that Õ notation can hide polylog(n)
terms) distributed algorithm in the k-machine model for sorting the n numbers; at the
end, for every number, some machine should know the sorted index of that number (e.g.,
it may be the machine that initially contains that number, but can be any other machine
as well). Note that O(n/k) round algorithm is trivial: simply send all the n numbers to
one machine and that machine will sort all the numbers locally.
Hint: One can use randomized quicksort. Recall that in randomized quicksort, you
choose a random element in the set as pivot, partition the elements into two parts based
on the pivot (one part containing all elements smaller than the pivot and the other larger),
and then recursively sort the two parts.
Exercise 9.2. Give an efficient algorithm to solve the single-source shortest path problem
in the k-machine model. You can assume that you are given an undirected weighted graph
G with a source node s and the goal is to find the shortest path distance from every node
to s (just finding the distance is enough, the path itself is not needed). Show correctness
of your algorithm and analyze the round complexity in the k-machine model.
10
So far, we have assumed that every message sent in a round is delivered by the same
round and that all nodes compute in lock-step synchrony. While the synchronous model
is a convenient abstraction that lets us study distributed algorithms in a round by round
fashion, many real networks such as the internet are inherently asynchronous: instead of
being delivered instantaneously, a message might be delayed for some arbitrary amount of
time and nodes might perform their computing steps at vastly different speeds.1
In an asynchronous system nodes take steps according to some given schedule that is a
priori unknown to the algorithm. If such a step s is triggered by the receipt of a message,
we say that s is a receive step. On the other hand, we say that a step is a send step if
node u sends out a set of messages. For convenience, we assume that a node can perform
local computation in both, send and receive steps. From the above description, it is clear
that each message is sent in exactly one send step and received in exactly one receive step.
fig. 10.1 shows a graphical representation of an asynchronous algorithm. The space-time
diagram simply lists the steps that each nodes takes over the passage of time (from left to
right) where send steps and receive steps of messages are connected by a directed edge.
s2 s4 s2 s4
u
s5 s5
v
s1 s7 s8 s1 s7 s8
w
s3 s6 s9 s3 s6 s9
Figure 10.1: A Space-Time Diagram (left) and the corresponding Space-Time Graph
(right).
1
To fully capture the harsh reality of distributed networks in the real world, we would also need to
account for messages being lost and the failure of nodes. We defer the treatment of fault-tolerance to a
later chapter.
120
10. BASICS OF ASYNCHRONOUS SYSTEMS 121
2 3
u
2
v
1 3 4
w
1 2 4
As we have seen above, logical time provides nodes a simple way to measure each
others progress. Logical clocks, however, have the shortcoming that a step s that has
clock value i does not necessarily “happen before” a step s0 with a larger clock value j if s
and s0 occur at different nodes. For example, in Figure 10.2, the first step at node u has a
logical clock value of 2 and the first step at node w has a clock value of 1, but neither step
precedes the other according to our notion of “happens before”. The problem lies in the
1-dimensionality of logical time: we are trying to use a single integer value to represent
the progress of up to n independent nodes. That is, we are losing some information by
mapping a partial order (the happens before relation) into a linear order (logical time).
Proof. The lemma is trivially true if u = v thus we focus on the case where u and v are
distinct. How does node u update its entry Vu (s)[v]? Recall that each node advances its
own entry (i.e. Vu [u]) in each step and updates entries associated with other nodes by
using the maximum rule upon receiving a message. Hence there must be a step su where
node u receives a message with a vector clock value of V w1 [v] = λ, sent by some node w1
in step sw1 . We consider 2 cases:
10. BASICS OF ASYNCHRONOUS SYSTEMS 123
6 w: Here we can apply the same argument to w1 that we have applied to u above.
1. v =
Namely, there must be an earliest step s0w1 where node w1 has received a message
with an attached vector clock value of Vw2 [v] = λ, which was sent by some node w2
∗ ∗ ∗
in some step sw2 , and also sw2 → s0w1 → sw1 → su → s. If w2 = v we proceed to
case 2. Otherwise we repeat the same argument for w2 and construct the causal
∗ ∗ ∗ ∗
chain sw3 → s0w2 → sw2 → s0w1 → sw1 → su → s and so forth, yielding a sequence of
steps at nodes w1 , w2 , . . . . At each node wi (i > 1), we focus on the earliest event
where wi has set its vector clock entry of v to λ by using the maximum rule. This
ensures that, after repeating this argument at most n − 1, we reach case 2.
2. v = w: Note that v must have increased Vv [v] exactly λ times by taking λ steps. By
the definition of the happens before relation, each one of these steps happens before
∗
step sw and since sw → s we are done.
1 2
1 1
0 0
u
0 2
2 4
1 1
v
0
0
1 3
0 1
w
0
0
0
0 1 3
1 2 3
Figure 10.3: A run of an asynchronous algorithm where each step is labeled with a vector
clock reading.
We will now show a one-to-one correspondence between vector clocks and the happens
before relation. Recall that, when comparing two vectors V 1 and V 2 , it holds that
V 1 < V 2 if and only if (a) V 1 [i] 6 V 2 [i], for every index i, and (b) there exists an index
j such that V 1 [j] < V 2 [j].
Let V (s1 ) and V (s2 ) be the vector clock readings at step s1 at node u,respectively step
s2 at v. It is easy to see that V (s1 ) 6= V (s2 ) if s1 6= s2 . The following lemma motivates
us to say that steps s1 and s2 are concurrent if and only if neither V (s1 ) < V (s2 ) nor
V (s1 ) > V (s2 ) hold, i.e., if the two vector clock values are incomparable.
Theorem 10.3
Let V (sα ) be the vector clock reading of step sα at node u and let and V (sβ ) be
the vector clock reading of step sβ at v, where sα 6= sβ . Then sα happens before sβ
if and only if V (sα ) < V (sβ ).
Proof. First, suppose that sα happens before sβ . We will show that V (sα ) < V (sβ ).
According to the definition of “happens before”, this means that there is a causal chain
10. BASICS OF ASYNCHRONOUS SYSTEMS 124
sα → · · · → sβ , starting at sα and ending at sβ such that, for any pair of subsequent steps
sk , sk+1 in this chain, exactly one of the following 2 statements is true:
(a) sk and sk+1 happen at the same node w;
(b) sk is the send step of a message M and sk+1 is the receive step of M .
In either case, the update rule of vector clocks ensures that V (sk ) < V (sk+1 ) and since <
is transitive, we have V (sα ) < V (sβ ).
Now, consider the case where V (sα ) < V (sβ ) and, for the sake of a contradiction,
assume that sα does not happen before sβ . Since the first part of our proof readily yields
a contradiction to sβ happening before sα , we only need to handle the case where sα
and sβ are concurrent steps. Applying Lemma 10.2 to step sβ tells us that the first
Vv (sβ )[u] steps of node u happen before step sβ at node v. By assumption, it holds that
Vu (sα )[u] 6 Vv (sβ )[u], and hence sα also happens before sβ , as required.
s7 . But according to the cut E, step s9 lies in the past whereas step s7 is in the future,
contradicting the fact that messages cannot travel backwards in time. In other words,
the cuts C and D are both “consistent” with our intuitive notion of snapshot whereas
E is inconsistent. Fortunately, the happens-before relation provides us with an intuitive
criterion to identify whether a cut is consistent. To this end, we define the causal past of
a step si to be the set of all steps sj such that sj happens before si . We now formally
define a cut C = (s1 , . . . , sn ) to be consistent, if and only if, for each si , the causal past of
si is in the past of C.
Observation 10.1. In the space-time diagram, a cut is consistent, if and only if, there
is a rubber band transformation of the time-axis that does not violate the happens before
relation.
Note that Observation 10.1 excludes messages that travel backwards in time such as
in cut E of Figure 10.4.
s2 s4 s2 s4
u u
s5 s5
v v
s1 s7 s8 s1 s7 s8
w w
s3 s6 s9 s3 s6 s9
C D E
Figure 10.4: Two different space-time visualizations of the same asynchronous computation.
Note that the consistent cuts C and D represent equivalent global snapshots, whereas E
is an inconsistent cut.
Appendices
126
Appendix A
Asymptotic Notation
Algorithmic analysis will not be as elegant as it is if not due to the fact that we emphasise
on asymptotic and approximate analysis, rather than exact analysis for a given input size.
In other words, we mainly care about how the algorithm’s performance scales with the
problem size in a reasonably approximate manner that is approximate up to a constant
factor (i.e., a constant that is independent of the problem size). This allows the algorithm
analyzer to focus on quantifying the dominant function that determines the algorithm’s
performance, while avoiding cumbersome details.
Consider two positive functions g(n) and f (n) on natural numbers. Think of n as the
input size of the algorithm; and f (and g) as denoting the runtime (of an algorithm) as a
function of the input size.
1.5g(n)
f (n)
n0
n n
(a) f (n) 6 g(n) for all n > n0 . Note (b) f (n) > g(n), but f (n) 6 1.5g(n)
that behavior for n < n0 is irrelevant. for all n.
127
APPENDIX A. ASYMPTOTIC NOTATION 128
Informally, we will say that if g(n) = O(f (n)), then f (n) (possibly, multiplied with a
large-enough constant factor) “dominates” or “upper bounds” g(n). Formally we have the
following definition.
Definition A.1 I Big-O
Given two (positive) functions f (n) and g(n), we say g(n) = O(f (n)) if there is a
positive constant c (independent of n), such that for any n > n0 , g(n) 6 cf (n).
The above definition is usually cumbersome to use for showing how two functions are
related. An alternate, but easier way, is to define Big-O using limits as follows:
Definition A.2
Given two (positive) functions f (n) and g(n), we say g(n) = O(f (n)) if
g(n)
lim 6 c.
n→∞ f (n)
where c > 0 is a fixed constant (assuming the limit exists). In other words, the limit
is bounded above by a fixed constant.
Definition A.4
A (positive) function g(n) is “Big-Ω” of (another positive) f (n) if
g(n)
lim > c.
n→∞ f (n)
where c > 0 is a fixed positive constant (assuming the limit exists). In other words,
the limit is bounded below by a fixed constant.
The above definition assumes that the limit exists (otherwise we can modify the
definition to take the lim inf).
A.3 Big-ΘNotation
Big-Theta notation is used to compare functions that are both Big-O and Big-Ω of
each other, i.e., in other words have the same asymptotic behavior. More precisely, two
positive functions g(n) and f (n) are said to be “Big-Theta” of each other if their growth
is asymptotically within a constant factor of each other. We can formally define it as
follows.
Definition A.5
A (positive) function g(n) is “Big-Theta” of (another positive) f (n) if there are
some fixed positive constants c1 and c2 such that, c1 f (n) 6 g(n) 6 c2 f (n), for all
n > n0 . Equivalently, g(n) = Θ(f (n)) if g(n) = O(f (n)) AND g(n) = Ω(f (n)).
Informally, we will say that if g(n) = Θ(f (n)), then both f (n) and g(n) are of the
“same order”, i.e., they are essentially the same function (as far as asymptotic growth is
concerned).
Alternatively, using limits we have the definition:
Definition A.6
A (positive) function g(n) is “Big-Theta” of (another positive) f (n) if
g(n)
lim = c.
n→∞ f (n)
Definition A.7
Given two functions, g(n) and f (n), we say that g(n) = o(f (n)) if for any positive
constant c there is a constant nc such that for any n > nc , g(n) < cf (n).
The above definition says that if g(n) is o(f (n)), i.e., g(n) grows strictly slower than
f (n) or f (n) grows strictly faster than g(n), then no matter what constant c you choose
(however small it may be), when n becomes large enough (i.e., n > nc , where nc is a
constant that depends on c), g(n) is dominated by cf (n). In other words, cf (n) always
starts dominating g(n), however small c is, after a certain point (when n becomes large).
Alternatively, using limits we have the definition:
Definition A.8
A (positive) function g(n) is “Little-o” of (another positive) f (n) if
g(n)
lim = 0.
n→∞ f (n)
Similarly, we would like asymptotic notation to capture a function that grows strictly
faster than some other function. This is the Little-ω notation which can be considered as
a complement of the Little-o notation. That is if f (n) = o(g(n)), then g(n) = ω(f (n)).
Definition A.9
Given two functions, g(n) and f (n), we say that g(n) = ω(f (n)) if for any positive
constant c there is a constant nc such that for any n > nc , g(n) > cf (n).
g(n)
lim = ∞.
n→∞ f (n)
A.5 Examples
We next illustrate the above notations with some examples.
Example A.1. 1. 2n = O(n) = Θ(n). Constants are ignored! Big-O, Big-Omega, and
Big-Theta notation “hides” constant factors.
7. 1/n = o(1)
8. n1000 = o(1.1n )
9. n2 = ω(n)
10. n = (?)n1+sin n
n
lim sup =∞
n→∞ n1+sin n
n
lim inf =0
n→∞ n1+sin n
Hence the two functions are not asymptotically comparable via little-o or little-ω
notation.
Appendix B
Mathematical Induction
132
APPENDIX B. MATHEMATICAL INDUCTION 133
is not difficult to see if we iterate through the numbers. The base step shows that the
statement is true for n = 0. Applying the induction step for k = 0 (since we show it for
any number k), it holds for k = 1. Since it holds for k = 1, we can again use the induction
step to show the statement for k = 2. Again, since it holds for k = 2, we can again use
the induction step to show the statement for k = 3 and so on. Repeating this argument,
it is clear that the statement holds for every natural number. This is the essence of a
proof by mathematical induction.
The power of mathematical induction — which naturally underlies recursive algorithms
(and iterative algorithms as well) — comes from the induction hypothesis which simply
assumes that the statement is true for k (somehow by “magic” !). This gives us a lot of
“ammunition” to prove that the statement is true for k + 1.
k(k + 1)
0 + 1 + 2 + ··· + k = .
2
Using the above, we have to prove that the following statement holds:
(k + 1)(k + 2)
0 + 1 + 2 + ··· + k + k + 1 = (B.1)
2
How can we show that? We use the induction hypothesis that already tells us the value
of the sum up to k. Thus the left hand side of Equation B.1 becomes
k(k + 1)
+k+1
2
APPENDIX B. MATHEMATICAL INDUCTION 134
Algorithm via Induction Proof The above induction proof also suggests a natural
algorithm that gives a valid coloring of the regions formed by n lines. The algorithm is
naturally incremental. Start with one line and use two colors to color the two regions
on either side — this is the base case. Then add one line at a time. After adding a line,
simply reverse the colors of the regions on one side of the added line, leaving the other
APPENDIX B. MATHEMATICAL INDUCTION 135
side unchanged — this is the strategy used in the induction step. Clearly this algorithm
gives a valid coloring and the proof of its correctness is the indeed the induction proof!
This example illustrates how algorithms can be designed by using induction proofs; as
an added bonus the proof of correctness of the algorithm follows from the induction proof.
Weak Induction The above form of mathematical induction is sometimes called the
“weak form of induction” as the induction hypothesis assumes only the validity of the
statement for value k (and not for values less than k — as done in the “strong form”
as explained in the next section). This is already useful in proving correctness of many
algorithms, especially iterative algorithms, which operate incrementally on the input
size. In particular, it is useful in proving “loop invariant” in a for or while loop. A loop
invariant is a property that holds in every iteration of the for loop; if it fails to hold the
loop terminates. Correctness of many algorithms with for and while loops involve showing
statements on loop invariants. Such examples are shown throughout the book.
a product of two numbers a and b, i.e., k + 1 = ab, where 1 < a < k + 1 and 1 < b < k + 1
(in fact they are both strictly smaller than k). Since a and b are strictly smaller than k + 1
and strictly greater than 1, induction hypothesis applies to both of them, and hence each
of them can be written as a product of prime numbers. Hence k + 1 which is a product of
a and b can be written as a product of prime numbers.
Note that the strong form was needed for the above proof, since a and b are numbers
that are smaller than k.
0 + 1 + 2 + · · · + n 6 cna
where, c and a are fixed constants. We want to find the the smallest a for which the above
statement is true (c can be any fixed constant).
Base step: n = 0. The statement is clearly true for any fixed a and c.
Induction step: Assume the statement is true for k. We will prove that it is true for
k + 1 for a suitable a.
By induction hypothesis, ki=0 i 6 ck a and thus,
P
k+1 k
i + k + 1 6 ck a + k + 1.
X X
i=
i=0 i=0
The above inequality will not be satisfied for a = 1, since that will imply that k + 1 6 c
which is clearly not true (the left hand side increases with k, while the right hand side
APPENDIX B. MATHEMATICAL INDUCTION 137
is a constant). Let’s try a = 2. Plugging a = 2 into the inequality B.2 and simplifying
k+1
yields 2k+1 6 c, which is true for all k if c = 1. Hence if we take c = 1 and a = 2, the
induction step is satisfied. Thus, we have shown, in fact, derived, that ni=0 i 6 n2 .
P
B.5 Exercises
Exercise B.1. Prove that the sum of the squares of the first n natural numbers is
n(n+1)(2n+1)
6
.
Pn
Exercise B.2. Using induction, give a tight bound on the following sum: i=1 id , where
d is a fixed constant.
i=0 a−1
Exercise B.6. Consider the set of first 2n positive integers, i.e., A = {1, 2, . . . , 2n}. Take
any subset S of n + 1 distinct numbers from set A. Show that there two numbers in S
such that one divides the other.
Appendix C
Graph-theoretic Concepts
This appendix reviews basic concepts in graph theory. For more details on these concepts
see [4].
C.1 Definition
We formally define a graph as follows.
Definition C.1
An undirected graph G is a pair (V, E) where
In a directed graph the edges have directions, i.e., ordered pairs — (u, v) means
that u points to v and (v, u) means that v points to u. A weighted graph includes
a weight function w : E → Q attaching a (rational) number (weight) to each edge.
139
APPENDIX C. GRAPH-THEORETIC CONCEPTS 140
a a
1 2
3
b c d b c d
4 5
6
8 3
7 9
e e
2 1
g 5 g
f f
Figure C.1: A connected undirected graph and the same graph with weights
C.3 Distance
Given a graph G = (V, E), and two vertices u and v belonging to V , and a path P
between u and v, the length of the path is the number of edges between u and v in P .
The distance between u and v is the length of the shortest path between u and v. The
shortest path between u and v will have minimum length among all other paths between
the two vertices.
a b
h
c d f a b
e
i g c d f
Proof. Consider an edge (u, v) ∈ E. This contributes one to deg(u) and one to deg(v).
C.6 Connectivity
Connectivity is a basic concept in graph theory.
Definition C.4
An undirected graph is connected if there is a path between every pair of vertices.
A directed graph is connected if there is a directed path from u to v or a directed
path from v to u, for every pair of vertices u, v.
A directed graph is strongly connected if there is a directed path from u to v and a
directed path from v to u, for every pair of vertices u, v.
A directed graph is weakly connected if the undirected graph obtained by ignoring
the directions of the edges is connected.
a b
h j
c d f k l
e
i g
If the graph is not connected, then we can define a spanning tree on each connected
component. A (disjoint) collection of spanning trees is called a spanning forest.
Proof. The proof is by induction on |V |. Base case (|V | = 1) is trivial. Assume that the
lemma is true for all graphs of size < |V |. Consider a connected graph G of size |V | and
a spanning tree T of G. Remove an edge (u, v) of T . This breaks T into two components
— T1 and T2 . T1 (T2 ) is a spanning tree of the subgraph induced by the vertices of T1 (T2 ).
The number of edges in T is |T1 | − 1 + |T2 | − 1 + 1 = |T | − 1 = |V | − 1.
the most recently visited vertex, BFS explores nodes in increasing distances from a given
source node.
a
a b h
b c f
c d f d g h
e i g e i
a 1
a c
e b 2
b d d 3 g 7
f c 4 5 e
g h 8
i 6 f
h i 9
Figure C.5: An undirected graph and its DFS tree rooted at vertex a where the DFS is
started. Thick lines show the tree edges and the dotted lines show the back edges. The
num values of the vertices — which shows the order in which the vertices are visited —
are also shown.
Appendix D
Probability
We recall basic definitions and concepts of probability theory. We also present various
probabilistic inequalities and bounds that are used throughout the book. These prove
very useful in the analysis of randomized algorithms. We refer to [7, 2] for more details
on the concepts presented here.
• P r{S} = 1.
The probability of any event E can be computed as the sum of the probabilities of the
simple events that it is composed of; this follows from the third axiom:
X
P r(E) = P r(s).
s∈E
145
APPENDIX D. PROBABILITY 146
i1 <i2 <...il
Statement D.2 (Boole’s inequality (union bound)). For any arbitrary sequence of events
E1 , E2 , . . . , En :
P(∪ni=1 Ei ) 6
X
P(Ei )
i
The union bound is used often in upper bounding the occurrence of the union of a set
of “bad” events. If this upper bound is small, then it implies that none of the bad events
occur with high probability.
P r(A ∩ B)
P r(A | B) =
P r(B)
In the above definition, by conditioning on B we restrict the sample space to the set B.
Thus the conditional probability can be considered as P r(A ∩ B) “normalized” by P r(B).
Given two events E1 and E2 , Bayes’ rule relates the conditional probability of the
first given the second, to the conditional probability of the second given the first. This is
useful to infer one conditional probability from the other.
The following are some useful identities that involve computing the probability of the
intersection of many events.
P r(En ) = P r(An | En−1 )P r(En−1 ) = P r(An | En−1 )P r(An−1 | En−2 )....P (A2 | E1 )P r(A1 ).
P r(A ∩ B)
P r(A | B) = = P r(A).
P r(B)
N.(N − 1).(N − 2) . . . (N − m + 1)
P r(E) = .
Nm
= Πm−1
i=0 (1 − i/N )
Pm−1
−i/N
6 Πm−1
i=0 e = e− i=0
i/N
= e−m(m−1)/2N .
√
For m = 2N + 1 6 28, P(E) < 1/e < 1/2.
The apparent
√“paradox” in this problem is that significantly less number of people,
i.e., only about N people (which is much smaller than N ), are needed to have a good
chance (about 50%) to have two people to have a common birthday.
APPENDIX D. PROBABILITY 148
Alternate Analysis Assume that we choose one birthday after the other independently
and uniformly at random from [1 . . . N ]. Let the event Ei denote: ”the ith choice is
different from the first i − 1 choices”. Then, we compute:
X : S → V.
The random variable X allows one to “transfer” the probability function that is defined
on the set of events of S to one that is defined on the set of values V , the range of X in
the following manner. Let E(r) = {s ∈ S | X(s) = r} Then we can define:
X
P r(X = r) = P r(E(r)) = P r(s).
s∈E(r)
Two random variables X and Y (defined on the same sample space) are called
independent if for all x and y
applications, we are less interested in the entire distribution of the probability over the
values, but rather than a single (or few) parameter(s) that characterizes the distribution.
A standard such parameter is the expectation, also called as the mean or the average of
the random variable.
Definition D.4 I Expectation of a random variable
The expectation of a discrete random variable X is
X
E[X] = iP r(X = i).
i∈range(X)
Thus, the expectation is a weighted sum over all possible values of the random variable,
weighted by the corresponding probabilities of the values.
A very useful property that comes in handy in computing the expected value of a
random variable is the linearity property of expectation stated as follows.
Theorem D.1 I Linearity of Expectation
Given a sequence of random variables, X1 , X2 , . . . , Xn , we have
" n # n
X X
E Xi = E[Xi ].
i=1 i=1
The linearity property says that the expectation of the sum of random variables is
equal to the sum of their individual expectations. Note that this is true, regardless whether
the random variables are independent or not; in fact, it is the latter case that makes this
property so useful in many applications. In many applications, this is commonly used to
compute the expectation of a random variable (which is not directly easy to compute by
applying the Definition D.4) by decomposing it into a sum of random variables whose
expectations are easier to compute. A typical example is decomposing the random variable
into a set of indicator (also called Bernoulli or 0-1)) random variables, which take only
two values 0 and 1. The following example illustrates this paradigm.
Problem D.1
Assume that N customers checked their jackets in a restaurant. The jackets get
mixed and while leaving the restaurant, each customer gets a random jacket, i.e.,
probability of getting any particular jacket is 1/N . The goal is to compute the
expected number of customers who get their own jacket.
N
X
E[X] = E[Xi ] = 1.
i=1
Problem D.2
Let a1 , a2 , . . . , an be a sequence of n values. Each value ai is independently and
randomly chosen from a fixed distribution D (e.g., uniform distribution from the
real interval [0, 1]). Let mi = min{a1 , a2 , . . . , ai }, i.e., the minimum of the first i
values. Let r.v. Y denote the number of times the minimum value is updated, i.e.,
the number of times mi 6= mi+1 . Then E[Y ] = O(log n).
Solution: First, without loss of generality we can assume that all the values are distinct,
since we are upper bounding Y . Indeed, when a value repeats, there won’t be any update.
Let the indicator random variable Yi denote the event that mi is updated. The value
mi will be updated if ai+1 is the minimum value among the first i + 1 values. The
1
probability that the above event happens is i+1 . This is because each of the first i + 1
values are chosen independently from the same distribution, and assuming that values are
1
distinct, the probability that a particular value, (i.e., ai+1 ) is the minimum is i+1 . Hence
1
E[Yi ] = P(Yi = 1) = i+1 .
We have Y = n−1
P Pn−1 Pn−1 1
i=1 Yi . Thus, E[Y ] = i=1 E[Yi ] = 6 Hn = Θ(log n).
Pn i=1 i+1
Note that Hn is the harmonic mean, defined as i=1 1/i and is Θ(log n).
P r(X = 1) = p, P r(X = 0) = 1 − p.
Then:
E[X] = 1 · p + 0 · (1 − p) = p.
Consider a sequence of n independent Bernoulli trials X1 , ...., Xn . Let X = ni=1 Xi .
P
Then we say that the random variable X has a Binomial distribution with parameters
n and p denoted as X ∼ B(n, p). Equivalently, X denotes the number of successes obtained
when doing n independent trials where each trial has a probability of success p. The
probability of getting k successes is given by:
!
n k
P r(X = k) = p (1 − p)n−k .
k
APPENDIX D. PROBABILITY 151
P r(X = i) = (1 − p)i−1 p.
1 1
(1 − p)i−1 =
X X
E[X] = P rob(X > i) = = .
i>1 i>1 1 − (1 − p) p
(1 − p)k
P r(X > k | X > r) =
(1 − p)r
E[X]
P r(X > a) 6 .
a
Proof. For any a > 0, let I be an indicator r.v. for the event X > a. Then I 6 X/a.
Taking expectations on both sides, we get E[I] = P(X > a) 6 E[X]/a.
Chebyshev’s inequality gives a stronger bound, which assumes that the variance (or
the “second moment”) or standard deviation of the random variable is known. These are
defined below.
Definition D.5
The variance of a random variable X is
V ar[X]
P r(|X − E[X]| > a) 6 .
a2
Proof.
P r(|X − E[X]| > a) = P r((X − E[X])2 > a2 ).
By Markov inequality,
A useful special case of the Markov and Chebyshev’s inequalities is given by the
following theorem. This is typically useful in showing whether a non-negative integer-
valued random variable takes 0 value or otherwise.
APPENDIX D. PROBABILITY 153
V ar(X) E(X 2 )
P(X = 0) 6 = − 1.
(E(X))2 (E(X))2
All moments of X can be obtained from M (t) and then evaluating the result at t = 0,
hence the name.
d d
M 0 (t) = E[etX ] = E[ (etX )] = E[XetX ].
dt dt
0
Thus, M (0) = E[X] and the nth derivative of M (t) is given by:
and
M n (0) = E[X n ] n > 1.
1. The moment generating function of the sum of independent random variables equals
the product of the individual moment generating functions.
Chernoff Bounds via MGF Let X be a random variable and M be its MGF. Then,
for all t > 0, via Markov’s inequality, we have:
E[etX ]
P(X > a) = P(etX > eta ) 6 ta
= e−ta M (t).
e
Similarly, for all t < 0,
P(X 6 a) 6 eta M (t).
The above bounds are called Chernoff bounds. We obtain the best bound by using the t
that minimizes the right hand side.
2 /2
Example Let X be the standard normal r.v. P (X > a) 6 e−ta et for all t > 0. The
right hand side is minimized for t = a. Thus, for a > 0
2 /2
P (X > a) 6 e−a .
By Markov’s inequality,
E[etX ]
P(X > (1 + δ)µ) <
et(1+δ)µ
Pn
E[et i=1 Xi ] E[Πni=1 etXi ] Πni=1 E[etXi ]
= = =
et(1+δ)µ et(1+δ)µ et(1+δ)µ
Πn (pi et + 1 − pi ) Πn (1 + pi (et − 1))
= i=1 t(1+δ)µ) = i=1 t(1+δ)µ
e e
t
Pn t t
Πn epi (e −1) e i=1 pi (e −1) e(e −1)µ
< i=1t(1+δ)µ = =
e et(1+δ)µ et(1+δ)µ
!µ
eδ
6
(1 + δ)(1+δ)
for t = ln(1 + δ).
Using δ − (1 + δ) ln(1 + δ) 6 −δ 2 /3 for 0 < δ < 1 we get
2 /3
P(X > (1 + δ)µ) 6 e−µδ .
APPENDIX D. PROBABILITY 157
Lower tail:
P(X < (1 − δ)µ) = P(e−tX > e−t(1−δ)µ ).
By Markov’s inequality,
E[e−tX ]
P(X < (1 − δ)µ) < .
e−t(1−δ)µ
Similar calculations yield
−t
E[e−tX ] e(e −1)µ
< .
e−t(1−δ)µ e−t(1−δ)µ
For t = ln(1/(1 − δ)), the right hand side of the above is
!µ
e−δ
6 .
(1 − δ)(1−δ)
2 /2
Since (1 − δ)(1−δ) > e−δ+δ , we have
2 /2
P(X < (1 − δ)µ) < e−µδ .
n 1q 2
P(|X − | > 6n log n) 6
2 2 n
Proof.
E[X] = n/2
We need
n 1q n 1q
− 6n log n 6 X 6 + 6n log n
2 2 2 2
or s
n 6 log n
X = (1 ± )
2 n
APPENDIX D. PROBABILITY 158
q
6 log n
Fixing δ = n
n δ2
P(X < (1 − δ)n/2) 6 e− 2 2 6 1/n
n δ2
P(X > (1 + δ)n/2) 6 e− 2 3 6 1/n
We see that Chernoff gives a stronger bound compared to Chebyshev which in turn
is stronger compared to Markov. The reason is Markov’s inequality uses only the first
moment (i.e., the expectation) of the random variable in its bound, whereas, Chebyshev
uses the first and second moment (equivalently, the variance) into account, while Chernoff
uses all the moments (via the MGF).
The expression E[Y |Z] is a random variable f (Z) that takes the value E[Y |Z = z]
when Z = z.
Theorem D.8
We collect commonly used math formulas in algorithm analysis. For a elaborate list see the
theoretical computer cheat sheet by Steve Seiden (https://www.tug.org/texshowcase/
cheat.pdf)
160
Bibliography
[2] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[3] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[6] V. Rödl and A. Ruciński. Random graphs with monochromatic triangles in every edge
coloring. Random Struct. Algorithms, 5(2):253–270, 1994.
[7] E. Upfal and M. Mitzenmacher. Probability and Computing: Randomized algorithms and
Probabilistic Analysis. Cambridge University Press, 2005.
161