Book
Book
Sariel Har-Peled¬
¬ Department of Computer Science; University of Illinois; 201 N. Goodwin Avenue; Urbana, IL, 61801,
USA; sariel@illinois.edu; http://sarielhp.org/. Work on this paper was partially supported by a NSF
CAREER award CCR-0132901.
2
Contents
Contents 3
3 Min Cut 27
3.1 Branching processes – Galton-Watson Process . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 On coloring trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Min Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1.1 The probability of success. . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3
3.3.1.2 Running time analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 A faster algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4
7.4 A special case of Hoeffding’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.4.1 Some technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.5 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8 Martingales 73
8.1 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.1.2.1 Examples of martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.1.2.2 Azuma’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9 Martingales II 79
9.1 Filters and Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.2.1 Martingales – an alternative definition . . . . . . . . . . . . . . . . . . . . . . . . 81
9.3 Occupancy Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.3.1 Lets verify this is indeed an improvement . . . . . . . . . . . . . . . . . . . . . . . 83
9.4 Some useful estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5
14.1.1.2 Walking on two dimensional grid . . . . . . . . . . . . . . . . . . . . . . 106
14.1.1.3 Walking on three dimensional grid . . . . . . . . . . . . . . . . . . . . . 106
6
20.3.1 ε-nets and ε-samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
20.3.2 Some applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
20.3.2.1 Range searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
20.3.2.2 Learning a concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
20.3.2.3 A naive proof of the ε-sample theorem. . . . . . . . . . . . . . . . . . . . 157
20.3.3 A quicky proof of the ε-net theorem (Theorem 20.3.4) . . . . . . . . . . . . . . . 158
20.4 Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
20.4.1 Building ε-sample via discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
20.4.1.1 Faster deterministic construction of ε-samples. . . . . . . . . . . . . . . 162
20.4.2 Building ε-net via discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
20.5 Proof of the ε-net theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
20.6 A better bound on the growth function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
20.7 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
20.7.1 Variants and extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
20.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7
22.1.2.5 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
22.1.2.6 ZZ∗p is cyclic for prime numbers . . . . . . . . . . . . . . . . . . . . . . . 192
22.1.2.7 ZZ∗n is cyclic for powers of a prime . . . . . . . . . . . . . . . . . . . . . . 193
22.1.3 Quadratic residues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
22.1.3.1 Quadratic residue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
22.1.3.2 Legendre symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
22.1.3.3 Jacobi symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
22.1.3.4 Jacobi(a, n): Computing the Jacobi symbol . . . . . . . . . . . . . . . . 198
22.1.3.5 Subgroups induced by the Jacobi symbol . . . . . . . . . . . . . . . . . . 198
22.2 Primality testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
22.2.1 Distribution of primes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
22.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
26 Entropy II 223
26.1 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
26.2 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
8
27 Entropy III - Shannon’s Theorem 225
27.1 Coding: Shannon’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
27.2 Proof of Shannon’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
27.2.1 How to encode and decode efficiently . . . . . . . . . . . . . . . . . . . . . . . . . 226
27.2.1.1 The scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
27.2.1.2 The proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
27.2.2 Lower bound on the message size . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
27.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
29 Expanders I 237
29.1 Preliminaries on expanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
29.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
29.2 Tension and expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
30 Expanders II 241
30.1 Bi-tension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
30.2 Explicit construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
30.2.1 Explicit construction of a small expander . . . . . . . . . . . . . . . . . . . . . . . 243
30.2.1.1 A quicky reminder of fields . . . . . . . . . . . . . . . . . . . . . . . . . 243
30.2.1.2 The construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Bibliography 257
Index 263
9
10
Chapter 1
A randomized algorithm. The randomized algorithm in this case is easy – the player randomly
chooses a number among 1, 2, 3 at every round. Since, at every point in time, there are two coins that
have the same side up, and the other coin is the other side up, a random choice hits the lonely coin, and
thus finishes the game, with probability 1/3 at each step. In particular, the number of iterations of the
game till it terminates behaves like a geometric variable with geometric distribution with probability
1/3 (and thus the expected number of rounds is 3). Clearly, the probability that the game continues for
more than i rounds, when the player uses this random algorithm, is (2/3)i . In particular, it vanishes to
zero relatively quickly.
11
A deterministic algorithm. The surprise here is that there is no deterministic algorithm that can
generate a winning sequence. Indeed, if the player uses a deterministic algorithm, then the adversary
can simulate the algorithm herself, and know at every stage what coin the player would ask to flip (it
is easy to verify that flipping two coins in a step is equivalent to flipping the other coin – so we can
restrict ourselves to a single coin flip at each step). In particular, the adversary can rotate the board in
the end of the round, such that the player (in the next round) flips one of the two coins that are in the
same state. Namely, the player never wins.
The shocker. One can play the same game with a board of size 4 (i.e., a square), where at each stage
the player can flip one or two coins, and the adversary can rotate the board by 0, 90, 180, 270 degrees
after each round. Surprisingly, there is a deterministic winning strategy for this case. The interested
reader can think what it is (this is one of these brain teasers that are not immediate, and might take
you 15 minutes to solve, or longer [or much longer]).
The unfair game of the analysis of algorithms. The underlying problem with analyzing algorithm
is the inherent unfairness of worst case analysis. We are given a problem, we propose an algorithm, then
an all-powerful adversary chooses the worst input for our algorithm. Using randomness gives the player
(i.e., the algorithm designer) some power to fight the adversary by being unpredictable.
12
1.2. Basic probability
Here we recall some definitions about probability. The reader already familiar with these definition can
happily skip this section.
As a concrete example, if we are rolling a dice, then Ω = {1, 2, 3, 4, 5, 6} and F would be the power
set of all possible subsets of Ω.
(ii) Pr[Ω] = 1.
Definition 1.2.4. A probability space is a triple (Ω, F , Pr), where Ω is a sample space, F is a σ-algebra
defined over Ω, and Pr is a probability measure.
Definition 1.2.5. A random variable f is a mapping from Ω into some set G. We require that the
probability of the random variable to take on any value in a given subset of values
−1is well
defined.
Formally, for any subset U ⊆ G, we have that f (U) ∈ F . That is, Pr[ f ∈ U] = Pr f (U) is defined.
−1
Going back to the dice example, the number on the top of the dice when we roll it is a random
variable. Similarly, let X be one if the number rolled is larger than 3, and zero otherwise. Clearly X is
a random variable.
We denote the probability of a random variable X to get the value x, by Pr[X = x] (or sometime
Pr[x], if we are lazy).
Definition 1.2.7 (Conditional Probability.). The conditional probability of X given Y , is the probability
that X = x given that Y = y. We denote this quantity by Pr[X = x | Y = y].
13
One useful way to think about the conditional probability Pr[X | Y ] is as a function, between the
given value of Y (i.e., y), and the probability of X (to be equal to x) in this case. Since in many cases x
and y are omitted in the notation, it is somewhat confusing.
The conditional probability can be computed using the formula
Pr (X = x) ∩ (Y = y)
Pr[X = x | Y = y] = .
Pr[Y = y]
For example, let us roll a dice and let X be the number we got. Let Y be the random variable that
is true if the number we get is even. Then, we have that
1
Pr X = 2 Y = true = .
3
Definition 1.2.8. Two random variables X and Y are independent if Pr X = x Y = y = Pr[X = x], for
all x and y.
Pr X = x ∩ Y = y = Pr X = x Pr Y = y .
Remark. Informally, and not quite correctly, one possible way to think about conditional probability
Pr[X = x | Y = y] is as measuring the benefit of having more information. If we know that Y = y, do we
have any change in the probability of X = x?
Lemma 1.2.11 (Linearity of expectation). Linearity of expectation is the property that for any
two random variables X and Y , we have that E X + Y = E X + E Y .
Õ Õ Õ
Proof: E X + Y = Pr[ω] X(ω) + Y (ω) = Pr[ω] X(ω) + Pr[ω] Y (ω) = E X + E Y .
ω∈Ω ω∈Ω ω∈Ω
1.3. QuickSort
Let the input be a set T = {t1, . . . , tn } of n items to be sorted. We remind the reader, that the QuickSort
algorithm randomly pick a pivot element (uniformly), splits the input into two subarrays of all the
elements smaller than the pivot, and all the elements larger than the pivot, and then it recurses on
these two subarrays (the pivot is not included in these two subproblems). Here we will show that the
expected running time of QuickSort is O(n log n).
Definition 1.3.1. For an event E, let X be a random variable which is 1 if E occurred and 0 otherwise.
The random variable X is an indicator variable.
14
Let S1, . . . , Sn be the elements in their sorted order (i.e., the output order). Let Xi j = 1 be the
indicator variable which is one iff QuickSort compares Si to S j , and let pi j denote the probability that
this happens. Clearly, the number of comparisons performed by the algorithm is C = i< j Xi j . By
Í
linearity of expectations, we have
hÕ i Õ Õ
E C =E Xi j = E Xi j = pi j .
i< j
i< j i< j
We want to bound pi j , the probability that the Si is compared to S j . Consider the last recursive
call involving both Si and S j . Clearly, the pivot at this step must be one of Si, . . . , S j , all equally likely.
Indeed, Si and S j were separated in the next recursive call.
Observe, that Si and S j get compared if and only if pivot is Si or S j . Thus, the probability for that
is 2/( j − i + 1). Indeed,
2
pi j = Pr Si or S j picked picked pivot from Si, . . . , S j =
.
j −i+1
Thus,
n Õ n Õ n n−i+1
Õ 2 n Õ
n
Õ Õ Õ Õ 1
pi j = 2/( j − i + 1) = ≤2 ≤ 2nHn ≤ n + 2n ln n,
i=1 j>i i=1 j>i i=1 k=1
k i=1 k=1
k
Ín
where Hn is the harmonic number ¬ Hn = i=1 1/i, We thus proved the following result.
Lemma 1.3.3. QuickSort performs in expectation at most n + 2n ln n comparisons, when sorting n
elements.
Note, that this holds for all inputs. No assumption on the input is made. Similar bounds holds not
only in expectation, but also with high probability.
This raises the question, of how does the algorithm pick a random element? We assume we have
access to a random source that can get us number between 1 and n uniformly.
Note, that the algorithm always works, but it might take quadratic time in the worst case.
Remark 1.3.4 (Wait, wait, wait). Let us do the key argument in the above more slowly, and more carefully.
Imagine, that before running QuickSort we choose for every element a random priority, which is a real
number in the range [0, 1]. Now, we reimplement QuickSort such that it always pick the element with
the lowest random priority (in the given subproblem) to be the pivot. One can verify that this variant
and the standard implementation have the same running time. Now, ai gets compares to a j if and
only if all the elements ai+1, . . . , a j−1 have random priority larger than both the random priority of ai
and the random priority of a j . But the probability that one of two elements would have the lowest
random-priority out of j − i + 1 elements is 2 ∗ 1/( j − i + 1), as claimed.
15
The problem is that it is not always possible to order the objects in three dimensions. This ordering
might have cycles. So, one possible solution is to build a binary space partition. We build a binary
tree. In the root, we place a polygon P. Let h be the plane containing P. Next, we partition the input
polygons into two sets, depending on which side of h they fall into. We recursively construct a BSP for
each set, and we hang it from the root node. If a polygon intersects h then we cut it into two polygons
as split by h. We continue the construction recursively on the objects on one side of h, and the objects
on the other side. What we get, is a binary tree that splits space into cells, and furthermore, one can
use the painter algorithm on these objects. The natural question is how big is the resulting partition.
We will study the easiest case, of disjoint segments in the plane.
We pick a random permutation σ of 1, . . . , n, and in the ith step we insert sσ(i) splitting all the cells
that si intersects.
Observe, that if si crosses a cell completely, it just splits it into two and no new fragments are created.
As such, the bad case is when a segment s is being inserted, and its line intersect some other segment t.
So, let E(s, t) denote the event that when inserted s it had split t. In particular, let index(s, t) denote
the number of segments on the line of s between s (closer) endpoint and t (including t. If the line of s
does not intersect t, then index(s, t) = ∞.
We have that
h i 1
Pr E(s, t) = .
1 + index(s, t)
Let Xs,t be the indicator variable that is 1 if E(s, t) happens. We have that
n
Õ n
Õ
S = number of fragments = Xsi,s j .
i=1 j=1,i, j
16
As such, by linearity of expectations, we have
n
hÕ n
Õ i Õn n
Õ n n
Õ Õ
E S =E Xsi,s j = E Xsi,s j =
Pr E(si, s j )
i=1 j=1,i, j i=1 j=1,i, j i=1 j=1,i, j
n n
Õ Õ 1
=
i=1 j=1,i, j
1 + index si, s j
n Õ n
Õ 2
≤ = 2nHn .
i=1 j=1
1+ j
Since the size of the BSP is proportional to the number of fragments created, we have the following
result.
Theorem 1.4.1. Given n disjoint segments in the plane, one can build a BSP for them of size O(n log n).
Csaba Tóth [Tót03] showed that BSP for segments in the plane, in the worst case, has complexity
log n
Ω n .
log log n
med−2
"m−2 m−1 # m−2 m−1
Õ Õ Õ Õ Õ 2 Õ 2(m − i − 1)
Xi j = E Xi j =
α1 = E
= ≤ 2 m−2 .
i< j<m i=1 j=i+1 i=1 j=i+1
m−i+1 i=1
m−i+1
(ii) If m < i < j: Using the same analysis as above, we have that Pr Xi j = 1 = 2/( j − m + 1). As such,
j−1 j−1
n n n
2( j − m − 1)
Õ
Õ Õ Õ 2 Õ
Xi j = ≤ 2 n−m .
α2 = E =
j=m+1 i=m+1 j=m+1 i=m+1 j − m + 1 j=m+1 j − m + 1
17
(iii) i < m < j: Here, we compare Si to S j if and
only if the first indicator in the range Si, . . . , S j is
either Si or S j . As such, E Xi j = Pr Xi j = 1 = 2/( j − i + 1). As such, we have
m−1 n m−1 n
Õ Õ Õ Õ 2
α3 = E
Xi j =
.
i=1 j=m+1 i=1 j=m+1 j − i + 1
Observe, that for a fixed ∆ = j − i + 1, we are going to handle the gap ∆ in the above summation,
Ín
at most ∆ − 2 times. As such, α3 ≤ ∆=3 2(∆ − 2)/∆ ≤ 2n.
n n
Õ Õ 2
(iv) i = m. We have α4 = X = ln n + 1.
E ij =
j −m+1
j=m+1 j=m+1
m−1 m−1
Õ Õ 2
(v) j = m. We have α5 = E Xi j =
≤ ln m + 1.
i=1 i=1
m−i+1
A different approach can reduce the number of comparisons (in expectation) to 1.5n + o(n). More on
that later in the course.
18
Chapter 2
if they are equal or not. Unfortunately, the only access you have to the two vectors is via a black-box
that enables you to compute the dot-product of two binary vectors over ZZ2 . Formally, given two binary
Ín
vectors as above, their dot-product is hv, ui = i=1 vi ui (which is a non-negative integer number). Their
dot product modulo 2, is hv, ui mod 2 (i.e., it is 1 if hv, ui is odd and 0 otherwise).
Naturally, we could the use the black-box to read the vectors (using 2n calls), but since we are
interested only in deciding if they are equal or not, this should require less calls to the black-box (which
is expensive).
n
Lemma 2.1.1. Given two binary vectors v, u ∈ 0, 1 , a randomized algorithm can, using two compu-
tations of dot-product modulo 2, decide if v is equal to u or not. The algorithm may return one of the
following two values:
,: Then v , u.
=: Then the probability that the algorithm made a mistake (i.e., the vectors are different) is at most
1/2.
The running time of the algorithm is O(n + B(n)), where B(n) is the time to compute a single dot-product
of vectors of length n.
Proof: Pick a random vector r = (r1, . . . , rn ) ∈ {0, 1}n by picking each coordinate independently with
probability 1/2. Compute the two dot-products hv, ri and hu, ri.
• If hv, ri ≡ hv, ri mod 2 ⇒ the algorithm returns ‘=’.
• If hv, ri . hv, ri mod 2 ⇒ the algorithm returns ‘,’.
19
Clearly, if the ‘,’ is returned then v , u.
So, assume that the algorithm returned ‘=’ but v , u. For the sake of simplicity of exposition,
assume that they differ on the nth bit: un , vn . We then have that
=α 0 =β 0
z }| { z }| {
n−1
Õ n−1
Õ
α = hv, ri = vi ri + vnrn and β = hu, ri = ui ri + unrn .
i=1 i=1
2.1.1.1. Amplification
Of course, this is not a satisfying algorithm – it returns the correct answer only with probability half if
the vectors are different. So, let us run the algorithm t times. Let T1, . . . , Tt be the returned values from
all these executions. If any of the t executions returns that the vectors are different, then we know that
they are different.
Pr Algorithm fails = Pr v , u, but all t executions return ‘=’
t
Ö 1 1
= Pr T1 = ‘=’ Pr T2 = ‘=’ · · · Pr Tt = ‘=’ ≤
= t.
i=1
2 2
2.1.2. Matrices
Given three binary matrices B, C, D of size n × n, we are interested in deciding if BC = D. Computing
BC is expensive – the fastest known (theoretical!) algorithm has running time (roughly) O n2.37 . On
the other hand, multiplying such a matrix with a vector r (modulo 2, as usual) takes only O(n2 ) time
(and this algorithm is simple).
n×n
Lemma 2.1.3. Given three binary matrices B, C, D ∈ 0, 1 and a confidence parameter δ > 0, a
randomized algorithm can decide if BC = D or not. More precisely the algorithm can return one of the
following two results:
20
,: Then BC , D.
=: Then BC = D with probability ≥ 1 − δ.
The running time of the algorithm is O n2 log δ−1 .
Proof: Compute a random vector r = (r1, . . . , rn ), and compute the quantity x = BCr = B(Cr) in O(n2 )
time, using the associative property of matrix multiplication. Similarly, compute y = Dr. Now, if x , y
then return ‘=’.
Now, we execute this algorithm t = lg δ−1 times. If all of these independent runs return that the
Lemma 2.2.1. In expectation, the number of times the minimum of a prefix of n randomly permuted
numbers change, is O(log n). That is E[X] = O(log n).
Proof: Consider the indicator variable Xi , such that Xi = 1 if ci , ci−1 . The probability for that is ≤ 1/i,
since this is the probability that the smallest number of b1, . . . , bi is bi . (Why is this probability not
n
Õ Õ 1
simply equal to 1/i?) As such, we have X = i Xi , and E[X] = = O(log n).
Í
E[Xi ] =
i i=1
i
21
given a point p, compute its id(p). We associate with each unique id a data-structure that stores all the
points falling into this grid cell (of course, we do not maintain such data-structures for grid cells which
are empty). For our purposes here, the grid-cell data-structure can simply be a linked list of points. So,
once we computed id(p), we fetch the data structure for this cell, by using hashing. Namely, we store
pointers to all those data-structures in a hash table, where each such data-structure is indexed by its
unique id. Since the ids are integer numbers, we can do the hashing in constant time.
Lemma 2.3.4. Given a set P of n points in the plane, and a distance r, one can verify in linear time,
whether or not CP(P) < r or CP(P) ≥ r.
Proof: Indeed, store the points of P in the grid Gr . For every non-empty grid cell, we maintain a linked
list of the points inside it. Thus, adding a new point p takes constant time. Indeed, compute id(p),
check if id(p) already appears in the hash table, if not, create a new linked list for the cell with this ID
number, and store p in it. If a data-structure already exist for id(p), just add p to it.
This takes O(n) time. Now, if any grid cell in Gr (P) contains more than four points of P, then, by
Lemma 2.3.3, it must be that the CP(P) < r.
Thus, when inserting a point p, the algorithm fetch all the points of P that were already inserted,
for the cell of p, and the 8 adjacent cells. All those cells must contain at most 4 points of P (otherwise,
we would already have stopped since the CP(·) of the inserted points is smaller than r). Let S be the
set of all those points, and observe that |S| ≤ 4 · 9 = O(1). Thus, we can compute by brute force the
closest point to p in S. This takes O(1) time. If d(p, S) < r, we stop and return this distance (together
with the two points realizing d(p, S) as a proof that the distance is too short). Otherwise, we continue
to the next point, where d(p, S) = mins∈S kpsk.
Overall, this takes O(n) time. As for correctness, first observe that if CP(P) > r then the algorithm
would never make a mistake, since it returns ‘CP(P) < r’ only after finding a pair of points of P with
distance smaller than r. Thus, assume that p, q are the pair of points of P realizing the closest pair, and
kpqk = CP(P) < r. Clearly, when the later of them, say p, is being inserted, the set S would contain q,
and as such the algorithm would stop and return “CP(P) < r”.
Lemma 2.3.4 hints to a natural way to compute CP(P). Indeed, permute the points of P, in an
arbitrary fashion, and let P = hp1, . . . , pn i. Next, let ri = CP {p1, . . . , pi } . We can check if ri+1 < ri , by
just calling the algorithm for Lemma 2.3.4 on Pi+1 and ri . If ri+1 < ri , the algorithm of Lemma 2.3.4,
would give us back the distance ri+1 (with the other point realizing this distance).
22
So, consider the “good” case where ri+1 = ri = ri−1 . Namely, the length of the shortest pair does not
change. In this case we do not need to rebuild the data structure of Lemma 2.3.4 for each point. We
can just reuse it from the previous iteration. Thus, inserting a single point takes constant time as long
as the closest pair (distance) does not change.
Things become bad, when ri < ri−1 . Because then we need to rebuild the grid, and reinsert all the
points of Pi = hp1, . . . , pi i into the new grid Gri (Pi ). This takes O(i) time.
So, if the closest pair radius, in the sequence r1, . . . , rn , changes only k times, then the running time
of the algorithm would be O(nk). But we can do even better!
Theorem 2.3.5. Let P be a set of n points in the plane. One can compute the closest pair of points of
P in expected linear time.
Proof: Pick a random permutation of the points of P, and let hp1, . . . , pn i be this permutation. Let
r2 = kp1 p2 k, and start inserting the points into the data structure of Lemma 2.3.4. In the ith iteration,
if ri = ri−1 , then this insertion takes constant time. If ri < ri−1 , then we rebuild the grid and reinsert the
points. Namely, we recompute Gri (Pi ).
To analyze the running time of this algorithm, let Xi be the indicator variable which is 1 if ri , ri−1 ,
and 0 otherwise. Clearly, the running time is proportional to
Õ n
R=1+ (1 + Xi · i).
i=2
Thus, the expected running time is
h Õn i n
Õ n
Õ
E R =1+E 1+ (1 + Xi · i) = n + E[Xi ] · i = n + i · Pr[X1 = 1] ,
i=2
i=2 i=2
by linearity of expectation and since for an indicator variable Xi , we have that E[Xi ] = Pr[Xi = 1].
Thus, we need to bound Pr[Xi = 1] = Pr[ri < ri−1 ]. To bound this quantity, fix the points of Pi , and
randomly permute them. A point q ∈ Pi is critical if CP(Pi \ {q}) > CP(Pi ).
• If there are no critical points, then ri−1 = ri and then Pr[Xi = 1] = 0.
• If there is one critical point, than Pr[Xi = 1] = 1/i, as this is the probability that this critical point
would be the last point in a random permutation of Pi .
• If there are two critical points, and let p, q be this unique pair of points of Pi realizing CP(Pi ).
The quantity ri is smaller than ri−1 , if either p or q are pi . But the probability for that is 2/i (i.e.,
the probability in a random permutation of i objects, that one of two marked objects would be
the last element in the permutation).
Observe, that there can not be more than two critical points. Indeed, if p and q are two points that
realize the closest distance, than if there is a third critical point r, then CP(Pi \ {r}) = kpqk, and r is
not critical.
We conclude that
n n
Õ Õ 2
E R =n+ i · Pr[X1 = 1] ≤ n + i · ≤ 3n.
i=2 i=2
i
As such, the expected running time of this algorithm is O(E[R]) = O(n).
Theorem 2.3.5 is a surprising result, since it implies that uniqueness (i.e., deciding if n real numbers
are all distinct) can be solved in linear time. However, there is a lower bound of Ω(n log n) on uniqueness,
using the comparison tree model. This reality dysfunction, can be easily explained, once one realizes
that the model of computation of Theorem 2.3.5 is considerably stronger, using hashing, randomization,
and the floor function.
23
2.4. Las Vegas and Monte Carlo algorithms
Definition 2.4.1. A Las Vegas algorithm is a randomized algorithms that always return the correct
result. The only variant is that it’s running time might change between executions.
Definition 2.4.2. A Monte Carlo algorithm is a randomized algorithm that might output an incorrect
result. However, the probability of error can be diminished by repeated executions of the algorithm.
Definition 2.4.3. The class P consists of all languages L that have a polynomial time algorithm Alg,
such that for any input Σ∗ , we have
• x ∈ L ⇒ Alg(x) accepts,
• x < L ⇒ Alg(x) rejects.
Definition 2.4.4. The class NP consists of all languages L that have a polynomial time algorithm Alg,
such that for any input Σ∗ , we have:
(i) If x ∈ L ⇒ then ∃y ∈ Σ∗ , Alg(x, y) accepts, where |y| (i.e. the length of y) is bounded by a
polynomial in |x|.
(ii) If x < L ⇒ then ∀y ∈ Σ∗ Alg(x, y) rejects.
Definition 2.4.5. For a complexity class C, we define the complementary class co-C as the set of languages
whose complement is in the class C. That is
co−C = L L ∈ C ,
4 where L = Σ∗ \ L.
Definition 2.4.6. The class RP (for Randomized Polynomial time) consists of all languages L that have
a randomized algorithm Alg with worst case polynomial running time such that for any input x ∈ Σ∗ ,
we have
(i) If x ∈ L then Pr[Alg(x) accepts] ≥ 1/2.
(ii) x < L then Pr[Alg(x) accepts] = 0.
¬ There is also the internet.
24
An RP algorithm is a Monte Carlo algorithm, but this algorithm can make a mistake only if x ∈ L.
As such, co−RP is all the languages that have a Monte Carlo algorithm that make a mistake only if
x < L. A problem which is in RP ∩ co−RP has an algorithm that does not make a mistake, namely a
Las Vegas algorithm.
Definition 2.4.7. The class ZPP (for Zero-error Probabilistic Polynomial time) is the class of languages
that have a Las Vegas algorithm that runs in expected polynomial time.
Definition 2.4.8. The class PP (for Probabilistic Polynomial time) is the class of languages that have a
randomized algorithm Alg, with worst case polynomial running time, such that for any input x ∈ Σ∗ ,
we have
(i) If x ∈ L then Pr[Alg(x) accepts] > 1/2.
(ii) If x < L then Pr[Alg(x) accepts] < 1/2.
Definition 2.4.9. The class BPP (for Bounded-error Probabilistic Polynomial time) is the class of lan-
guages that have a randomized algorithm Alg with worst case polynomial running time such that for
any input x ∈ Σ∗ , we have
(i) If x ∈ L then Pr[Alg(x) accepts] ≥ 3/4.
(ii) If x < L then Pr[Alg(x) accepts] ≤ 1/4.
25
26
Chapter 3
Min Cut
598 - Class notes for Randomized Algorithms
Sariel Har-Peled To acknowledge the corn - This purely
January 24, 2018 American expression means to admit the losing
of an argument, especially in regard to a detail;
to retract; to admit defeat. It is over a hundred
years old. Andrew Stewart, a member of
Congress, is said to have mentioned it in a
speech in 1828. He said that haystacks and
cornfields were sent by Indiana, Ohio and
Kentucky to Philadelphia and New York.
Charles A. Wickliffe, a member from Kentucky
questioned the statement by commenting that
haystacks and cornfields could not walk.
Stewart then pointed out that he did not mean
literal haystacks and cornfields, but the horses,
mules, and hogs for which the hay and corn
were raised. Wickliffe then rose to his feet, and
said, "Mr. Speaker, I acknowledge the corn".
27
Of course, since infant mortality is dramatically down (as is the number of aristocrat males dying to
maintain the British empire), the probability of family names to disappear is now much lower than it was
in the 19th century. Interestingly, countries with family names that were introduced long time ago have
very few surnames (i.e., Korean have 250 surnames, and three surnames form 45% of the population).
On the other hand, countries that introduced surnames more recently have dramatically more surnames
(for example, the Dutch have surnames only for the last 200 years, and there are 68, 000 different family
names).
Here we are going to look on a very specific variant of this problem. Imagine that starting with a
single male. A male has exactly two children, and one of them is a male with probability half (i.e., the
Y -chromosome is being passed only to its male children). As such, the natural question is what is the
probability that h generations down, there is a male decedent that all his ancestors are male (i.e., it
caries the original family name, and the original Y -chromosome).
1 1 1
− 2 ≥ ⇔ 4h(h + 1) − (h + 1) ≥ 4h2 ⇔ 4h2 + 4h − h − 1 ≥ 4h2 ⇔ 3h ≥ 1,
h 4h h+1
which trivially holds.
28
Proof: The claim trivially holds for small values of h. Let h j be the minimal index such that ρh j ≤ 1/2 j .
It is easy to verify that ρh j ≥ 1/2 j+1 . As such,
(S, V \ S) = uv u ∈ S, v ∈ V \ S, and uv ∈ E ,
Lemma 3.2.2. Let E1, . . . , En be n events which are not necessarily independent. Then,
h i h i
n
Pr ∩i=1 Ei = Pr E1 ∗ Pr E2 E1 ∗ Pr E3 E1 ∩ E2 ∗ . . . ∗ Pr En E1 ∩ . . . ∩ En−1 .
29
3.3. The Algorithm
Observation 3.3.1. A set of vertices in G/x y corresponds to a set of vertices in the graph G. Thus a
cut in G/xy always corresponds to a valid cut in G. However, there are cuts in G that do not exist in
G/x y. For example, the cut S = {x}, does not exist in G/x y. As such, the size of the minimum cut in
G/x y is at least as large as the minimum cut in G (as long as G/x y has at least one edge). Since any
cut in G/xy has a corresponding cut of the same cardinality in G.
Our algorithm works by repeatedly performing edge contractions. This is beneficial as this shrinks
the underlying graph, and we would compute the cut in the resulting (smaller) graph. An “extreme”
example of this, is shown in Figure 3.3, where we contract the graph into a single edge, which (in turn)
corresponds to a cut in the original graph. (It might help the reader to think about each vertex in the
contracted graph, as corresponding to a connected component in the original graph.)
Figure 3.3 also demonstrates the problem with taking this approach. Indeed, the resulting cut is not
the minimum cut in the graph.
So, why did the algorithm fail to find the minimum cut in this case?¬ The failure occurs because
of the contraction at Figure 3.3 (e), as we had contracted an edge in the minimum cut. In the new
graph, depicted in Figure 3.3 (f), there is no longer a cut of size 3, and all cuts are of size 4 or more.
Specifically, the algorithm succeeds only if it does not contract an edge in the minimum cut.
Observation 3.3.2. Let e1, . . . , en−2 be a sequence of edges in G, such that none of them is in the min-
imum cut, and such that G0 = G/{e1, . . . , en−2 } is a single multi-edge. Then, this multi-edge corresponds
to a minimum cut in G.
¬ Naturally, if the algorithm had succeeded in finding the minimum cut, this would have been our success.
30
2
2 2
2 2
y
x
(a) (b) (c) (d)
2 2
2 2 4 4
2 2 2 3 3
2 2 52 5
(i) (j)
Figure 3.3: (a) Original graph. (b)–(j) a sequence of contractions in the graph, and (h) the cut in the
original graph, corresponding to the single edge in (h). Note that the cut of (h) is not a mincut in the
original graph.
3.3.1. Analysis
3.3.1.1. The probability of success.
Naturally, if we are extremely lucky, the algorithm would never pick an edge in the mincut, and the
algorithm would succeed. The ultimate question here is what is the probability of success. If it is
relatively “large” then this algorithm is useful since we can run it several times, and return the best
result computed. If on the other hand, this probability is tiny, then we are working in vain since this
31
approach would not work.
kn
Lemma 3.3.3. If a graph G has a minimum cut of size k and G has n vertices, then |E(G)| ≥ 2 .
Proof: Each vertex degree is at least k, otherwise the vertex itself would form a minimum cut of size
smaller than k. As such, there are at least v∈V degree(v)/2 ≥ nk/2 edges in the graph.
Í
Lemma 3.3.4. If we pick in random an edge e from a graph G, then with probability at most 2/n it
belong to the minimum cut.
Proof: There are at least nk/2 edges in the graph and exactly k edges in the minimum cut. Thus, the
probability of picking an edge from the minimum cut is smaller then k/(nk/2) = 2/n.
The following lemma shows (surprisingly) that MinCut succeeds with reasonable probability.
2
Lemma 3.3.5. MinCut outputs the mincut with probability ≥ .
n(n − 1)
Proof: Let Ei be the event that ei is not in the minimum cut of Gi . By Observation 3.3.2, MinCut
outputs the minimum cut if the events E0, . . . , En−3 all happen (namely, all edges picked are outside the
minimum cut).
2 2
By Lemma 3.3.4, it holds Pr Ei E0 ∩ E1 ∩ . . . ∩ Ei−1 ≥ 1 − =1− . Implying that
|V(Gi )| n−i
∆ = Pr[E0 ∩ . . . ∩ En−3 ] = Pr[E0 ] · Pr E1 E0 · Pr E2 E0 ∩ E1 · . . . · Pr En−3 E0 ∩ . . . ∩ En−4 .
As such, we have
n−3 n−3
n−i−2 n−2 n−3 n−4
Ö
Ö 2 2 1 2
∆≥ 1− = = ∗ ∗ ...· · = .
i=0
n−i i=0
n − i n n − 1 n − 2 4 3 n · (n − 1)
32
FastCut(G = (V, E))
G – multi-graph
begin
n ← |V(G)|
Contract ( G, t ) if n ≤ 6 then
begin Compute (via brute force) minimum cut
while |(G)| > t do lof G and return cut.
√ m
Pick a random edge e in G. t ← 1 + n/ 2
G ← G/e H1 ← Contract(G, t)
return G H2 ← Contract(G, t)
end /* Contract is randomized!!! */
X1 ← FastCut(H1 ),
X2 ← FastCut(H2 )
return minimum cut out of X1 and X2 .
end
Figure 3.5: Contract(G, t) shrinks G till it has only t vertices. FastCut computes the minimum cut
using Contract.
Namely, this probability deteriorates very quickly toward the end of the execution, when the graph
becomes small enough.
√ (To see this, observe that for ν = n/2, the probability of success is roughly 1/4,
but for ν = n − n the probability of success is roughly 1/n.)
So, the key observation is that as the graph get smaller the probability to make a bad choice increases.
So, instead of doing the amplification from the outside of the algorithm, we will run the new algorithm
more times when the graph is smaller. Namely, we put the amplification directly into the algorithm.
The basic new operation we use is Contract, depicted in Figure 3.5, which also depict the new
algorithm FastCut.
Proof: Well, we perform two calls to Contract(G, t) which takes O(n2 ) time. And then we perform two
recursive calls on the resulting graphs. We have:
This would require a more involved algorithm, thats life.
33
n
T(n) = O n 2
+ 2T √ .
2
The solution to this recurrence is O n2 log n as one can easily (and should) verify.
Exercise 3.4.2. Show that one can modify FastCut so that it uses only O(n2 ) space.
√
Lemma 3.4.3. The probability that Contract G, n/ 2 had not contracted the minimum cut is at least
1/2.
Namely, the probability that the minimum cut in the contracted graph is still a minimum cut in the
original graph is at least 1/2.
l √ m
Proof: Just plug in ν = n − t = n − 1 + n/ 2 into Eq. (3.2). We have
l √ m l √ m
h i t(t − 1) 1 + n/ 2 1 + n/ 2 − 1 1
Pr E0 ∩ . . . ∩ En−t ≥ = ≥ .
n · (n − 1) n(n − 1) 2
Lemma 3.4.4. FastCut finds the minimum cut with probability larger than Ω(1/log n).
Proof: Let Th be the recursion tree of the algorithm of depth h = Θ(log n). Color an edge of recursion
tree by black if the contraction succeeded. Clearly, the algorithm succeeds if there is a path from the
root to a leaf that is all black. This is exactly the settings of Lemma 3.1.1, and we conclude that the
probability of success is at least 1/(h + 1) = Θ(1/log n), as desired.
Exercise 3.4.5. Prove, that running FastCut repeatedly c · log2 n times, guarantee that the algorithm
outputs the minimum cut with probability ≥ 1 − 1/n2 , say, for c a constant large enough.
Theorem 3.4.6. One can compute the minimum cut in a graph G with n vertices in O(n2 log3 n) time.
The algorithm succeeds with probability ≥ 1 − 1/n2 .
Proof: We do amplification on FastCut by running it O(log2 n) times. The running time bound follows
from Lemma 3.4.1. The bound on the probability follows from Lemma 3.4.4, and using the amplification
analysis as done in Lemma 3.3.9 for MinCutRep.
3.5.0.0.1. Galton-Watson process. The idea of using coloring of the edges of a tree to analyze
FastCut might be new (i.e., Section 3.1.2).
34
Chapter 4
4.1. Preliminaries
h i
Definition 4.1.1 (Variance and Standard Deviation). For a random variable X, let V X = E (X − µ X )2 =
h i
E X 2 − µ2X denote the variance of X, where µ X = E X . Intuitively, this tells us how concentrated is
the distribution of X. q
The standard deviation of X, denoted by σX is the quantity V X .
h i h i
Observation 4.1.2. (i) For any constant c ≥ 0, we have V cX = c2 V X .
h i h i h i
(ii) For X and Y independent variables, we have V X + Y = V X + V Y .
Definition 4.1.3 (Bernoulli distribution). Assume, that one flips a coin and get 1 (heads) with probability
p, and 0 (i.e., tail) with probability q = 1 − p. Let X be this random variable. The variable X is has
Bernoulli distributionh i with parameter p.
We have that E X = 1 · p + 0 · (1 − p) = p, and
h i
V X = E X 2 − µ2X = E X 2 − p2 = p − p2 = p(1 − p) = pq.
Definition 4.1.4 (Binomial distribution). Assume that we repeat a Bernoulli experiment n times (indepen-
dently!). Let X1, . . . , Xn be the resulting random variables, and let X = X1 + · · · + Xn . The variable X has
the binomial distribution with parameters n and p. We denote this fact by X ∼ Bin(n, p). We have
n k n−k
h
i
b(k; n, p) = Pr X = k = p q .
k
h i h i hÍ i Í h i
n n
Also, E X = np, and V X = V i=1 Xi = i=1 V Xi = npq.
35
Observation 4.1.5. Let C1, . . . , Cn be random events (not necessarily independent). Than
" n # n
Ø Õ
Pr Ci ≤ Pr[Ci ] .
i=1 i=1
(This is usually referred to as the union bound.) If C1, . . . , Cn are disjoint events then
" n # n
Ø Õ
Pr Ci = Pr[Ci ] .
i=1 i=1
∞
h i Õ p 1
E X = i (1 − p)i−1 p = p f 0(1 − p) = = .
i=1 (1 − (1 − p)) 2 p
∞ ∞
h i 1 i−1 1 1
i 2 (1 − p)i−2 − 2 .
Õ Õ
V X =E X − 2 = i (1 − p) p − 2 . = p + p(1 − p)
2 2
p i=1
p i=2
p
We need to do a similar trick to what we did before, to this end, we observe that
∞
00 2
i(i − 1)x i−2 = (1 − x)−1 =
Õ
00
f (x) = .
i=2
(1 − x)3
As such, we have that
∞ ∞ ∞ ∞
Õ
2 i−2
Õ
i−2
Õ
i−2 00 1 Õ i−1 1
∆(x) = i x = i(i − 1)x + ix = f (x) + ix = f 00(x) + ( f 0(x) − 1)
i=2 i=2 i=2
x i=2 x
1 1 − (1 − x)2 1 x(2 − x)
2 1 1 2 2
= + − 1 = + = + ·
(1 − x)3 x (1 − x)2 (1 − x)3 x (1 − x)2 (1 − x)3 x (1 − x)2
2 2−x
= + .
(1 − x) 3 (1 − x)2
As such, we have that
1+p 2(1 − p) 1 − p2
h i 1 2 1 1
V X = p + p(1 − p)∆(1 − p) − 2 = p + p(1 − p) 3 + 2 − 2 = p + + − 2
p p p p p2 p p
p + 2(1 − p) + p − p − 1 1 − p
3 3
= = 2 .
p2 p
36
4.1.2. Some needed math
Lemma 4.1.8. For any positive integer n, we have:
(i) (1 + 1/n)n ≤ e.
(ii) (1 − 1/n)n−1 ≥ e−1 .
(iii) n! ≥ (n/e)n .
n k n ne k
(iv) For any k ≤ n, we have: ≤ ≤ .
k k k
Proof: (i) Indeed, 1 + 1/n ≤ exp(1/n), since 1 + x ≤ e x , for x ≥ 0. As such (1 + 1/n)n ≤ exp(n(1/n)) = e.
n−1
(ii) Rewriting the inequality, we have that we need to prove n−1 n ≥ 1e . This is equivalence to
n n−1 1 n−1
proving e ≥ n−1
= 1 + n−1 , which is our friend from (i).
(iii) Indeed,
∞
nn Õ ni
≤ = en,
n! i=0
i!
Í∞ x i
by the Taylor expansion of e x = i=0 n
i! . This implies that (n/e) ≤ n!, as required.
(iv) Indeed, for any k ≤ n, we have nk ≤ n−1k−1 since kn − n = n(k − 1) ≤ k(n − 1) = kn − k. As such,
n n−i
k ≤ k−i , for 1 ≤ i ≤ k − 1. As such,
n nk nk ne k
≤ ≤ k = ,
k k! k k
e
by (iii).
Let Xi be the number of balls in the ith bins, when we throw n balls into n bins (i.e., m = n). Clearly,
n
h i Õ 1
E Xi = Pr[The jth ball fall in ith bin] = n · = 1,
j=1
n
37
by linearity of expectation. The probability that the first bin has exactly i balls is
i n−i i ne i 1 i
n 1 1 n 1 e i
1− ≤ ≤ =
i n n i n i n i
This follows by Lemma 4.1.8 (iv).
Let C j (k) be the event that the jth bin has k or more balls in it. Then,
n i
e e k e e2 e k 1
h i Õ
Pr C1 (k) ≤ ≤ 1+ + 2 +... = .
i=k
i k k k k 1 − e/k
Let k ∗ = d(3 ln n)/ln ln ne. Then,
i e k∗ k∗
e k∗
h
∗ 1
Pr C1 (k ) ≤ ∗ ≤ 2 = 2 exp 1 − ln 3 − ln ln n + ln ln ln n
k 1 − e/k ∗ (3 ln n)/ln ln n
≤ 2(exp(− ln ln n + ln ln ln n)) k
∗
ln ln ln n
1
≤ 2 exp −3 ln n + 6 ln n ≤ 2 exp(−2.5 ln n) ≤ 2 ,
ln ln n n
for n large enough. We conclude, that since there are n bins and they have identical distributions that
n
∗
Õ 1
Pr[any bin contains more than k balls] ≤ Ci (k ∗ ) ≤ .
i=1
n
3 ln n
∗
Theorem 4.2.2. With probability at least 1 − 1/n, no bin has more than k = balls in it.
ln ln n
Exercise 4.2.3. Show that for m = n ln n, with probability 1 − o(1), every bin has O(log n) balls.
It is interesting to note, that if at each iteration we randomly pick d bins, and throw the ball into
the bin with the smallest number of balls, then one can do much better. We currently do not have the
machinery to prove the following theorem, but hopefully we would prove it later in the course.
Theorem 4.2.4. Suppose that n balls are sequentially places into n bins in the following manner. For
each ball, d ≥ 2 bins are chosen independently and uniformly at random (with replacement). Each ball
is placed in the least full of the d bins at the time of placement, with ties broken randomly. After all the
balls are places, the maximum load of any bin is at most ln ln n/(ln d) + O(1), with probability at least
1 − o(1/n).
Note, even by setting d = 2, we get considerable improvement. A proof of this theorem can be found in
the work by Azar et al. [ABKU00].
38
l√ m
thus for m = 2n + 1 , the probability that all the m balls fall in different bins is smaller than 1/e.
This is sometime referred to as the birthday paradox. You have m = 30 people in the room, and
you ask them for the date (day and month) of their birthday (i.e., n = 365). The above shows that the
probability of all birthdays to be distinct is exp(−30 · 29/730) ≤ 1/e. Namely, there is more than 50%
chance for a birthday collision, a simple but counterintuitive phenomena.
Theorem 4.3.1 (Markov’s Inequality). Let Y be a random variable assuming only non-negative val-
ues. Then for all t > 0, we have
i EY
h
Pr Y ≥ t ≤
t
Proof: Indeed,
h i Õ h i Õ Õ h i
E Y = y Pr Y = y + y Pr[Y = y] ≥ y Pr Y = y
y≥t y<t y≥t
Õ h i h i
≥ t Pr Y = y = t Pr Y ≥ t .
y≥t
Exercise 4.3.2. For any (integer) k > 1, define a random positive variable Xk such that Pr[Xk ≥ k E[Xk ]] =
1
.
k
h i 1 h i
Theorem 4.3.3 (Chebyshev’s inequality). Pr |X − µ X | ≥ tσX ≤ 2 , where µ X = E X and σX =
q t
V X .
h i
Set Y = (X − µ X )2 . Clearly, E Y = σX2 . Now, apply Markov’s inequality to Y .
39
n o
Let Ci ∈ 1, . . . , n be the coupon picked in the ith trial. The jth trial is a success, if C j was not
picked before in the first j − 1 trials. Let Xi denote the number of trials from the ith success, till after
the (i + 1)th success. Clearly, the number of trials performed is
n−1
Õ
X= Xi .
i=0
Ín 1 V[X] π 2
Since, limn→∞ i=1 i 2 = π 2 /6, we have lim = .
n→∞ n2 6
h i 2
Corollary 4.4.1. Let X be the number of rounds till we collection all n coupons. Then, V X ≈ π6 n2
and its standard deviation is σX ≈ √π n.
6
This implies a weak bound on the concentration of X, using Chebyshev inequality, but this is going
to be quite weaker than what we implied we can do. Indeed, we have
" #
π 1
Pr X ≥ n log n + n + t · n √ ≤ Pr X − E[X] ≥ tσX ≤ 2 ,
6 t
Note, that this is somewhat approximate, and hold for n sufficiently large.
4.5. Notes
The material in this note covers parts of [MR95, sections 3.1,3.2,3.6]
40
Chapter 5
41
√ √ 1
Similarly, we have that Pr E[Y ] − t m/2 ≤ Y ≤ E[Y ] + t m/2 ≥ 1 − 2 .
t
Proof: Let Yi = ψ(ri ) be an indicator variable that is 1 if the ith sample ri has the property ψ. Now,
Y = i Yi is a binomial distribution with probability p = α/n, and m samples; that is, Y ∼ Bin(m, p). We
Í
saw in the previous lecture that, E[Y ] = mp, V[Y ] = mp(1 − p), and its standard deviation√ is as such
√ tσY n mn n
σY = mp(1 − p) ≤ m/2, as p(1 − p) is maximized for p = 1/2. We have ∆ =
p p
≤t =t √ ,
m 2m 2 m
p
since (α/n)(1 − (α/n)) is maximized for α = n/2. As such,
" #
n n m m i
h
Pr Z − α ≥ t √ ≤ Pr Z − α ≥ ∆ = Pr Y − α ≥ ∆ = Pr Y − α ≥ ∆
2 m m n n
h i 1
= Pr Y − E[Y ] ≥ tσY ≤ 2 ,
t
by Chebychev’s inequality.
in R is larger than s k .
One can conceptually think about the interval I(k) = r−, r+ as confidence interval – we know that
s k ∈ I(k) with probability ≥ 1 − 1/t 2 . But how big is this interval? Namely, how many elements are
there in I(k) ∩ Sample?
To this end, consider the interval of ranks in the sample that might contain the kth element. By the
above, this is
n √ √
I(k, t) = k + −t m/2 − 1, t m/2 + 1 .
m
In particular, consider the maximum ν ≤ k, such that I(ν, t) and I(k, t) are disjoint. We have the
condition that
n √ n √ m3/2
ν + t m/2 + 1 ≤ k − t m/2 − 1 =⇒ ν ≤ k − t − 1.
m m n
Setting g = k − t mn − 1 and h = k + t mn + 1, we have that I(g, t) and I(k, t) and I(h, t) are all disjoint
3/2 3/2
42
Func LazySelect( S, k )
Input: S - set of n elements, k - index of element to be output.
begin
repeat
R ← Sample with replacement of n3/4 elements from S
∪ {−∞, +∞}.
Sort R.
√ √
l ← max 1, kn−1/4 − n , h ← min n3/4, kn−1/4 + n
a ← R(l) , b ← R(h) .
Compute the ranks rS (a) and rS (b) of b in S
/* using 2n comparisons */
P← y∈S a≤y≤b
/* done when computing the rank of a and b */
Until (rS (a) ≤ k ≤ rS (b)) and |P| ≤ 8n3/4 + 2
Sort P in O(n3/4 log n) time.
return Pk−rS (a)+1
end LazySelect
Lemma 5.1.2. Given a set U of n numbers, a number k, and parameters t and m, one can compute,
in O(m log m) time, two numbers r−, r+ ∈ U, such that:
(A) The number of rank k in U is in the interval I r−, r+ .
√
(B) There are at most O tn/ m numbers of U in I.
The algorithm succeeds with probability ≥ 1 − 3/t 3 .
Proof: Compute the sample in O(m) time (assuming the input numbers are in an array, say. Next sort
the numbers of R in O(n log n) time, and return the two elements of rank `− and `+ in the sorted set, as
the boundaries of the interval. The correctness follows from the above discussion.
We next use the above observation to get a fast algorithm for selection.
√ √
where α( j) = j · n−1/4 − n and β( j) = j · n−1/4 + n.
43
Lemma 5.1.4. For a fixed j, we have that Pr S( j) ∈ I( j) ≥ 1 − 1/(4n1/4 ).
Proof: There are two possible bad events: (i) S( j) < Rα( j) and (ii) Rβ( j) < S( j) . Let Xi be an indicator
variable which is 1 if the ith sample is smaller equal to S( j) , otherwise 0. We have p = Pr[Xi ]] = j/n
Ín3/4
and q = 1 − j/n.
The random variable X = i=1 Xi is the rank of S( j) in the random sample. Clearly,
X ∼ B 3/4, j/n (i.e., X has a binomial distribution with p = j/n, and n3/4 trials). As such, we have
E[X] = pn3/4 and V[X] = n3/4 pq.
Now, by Chebyshev inequality
1
q
Pr X − pn 3/4
≥t n3/4 pq ≤ .
t2
p
Since pn3/4 = jn−1/4 and n3/4 ( j/n)(1 − j/n) ≤ n3/8 /2, we have that the probability of a > S( j) or b > S( j)
is
h
−1/4 √ −1/4 √ i
Pr S( j) < Rα( j) or Rβ( j) < S( j) = Pr X < ( jn − n) or X > ( jn + n)
n3/8
= Pr |X − jn−1/4 | ≥ 2n1/8 ·
2
1 1
≤ 2
= 1/4 .
4n
2n1/8
Lemma 5.1.5. LazySelect succeeds with probability ≥ 1 − O(n−1/4 ) in the first iteration. And it per-
forms only 2n + o(n) comparisons.
Proof: By Lemma 5.1.4, we know that S(k) ∈ I(k) with probability ≥ 1 − 1/(4n1/4 ). This in turn implies
that S(k) ∈ P. Thus, the only possible bad event is that the set P is too large. To this end, set
k − = k − 3n3/4 and k + = k + 3n3/4 , and observe that, by definition, it holds I(k − ) ∩ I(k) = ∅ and
I(k) ∩ I(k + ) = ∅. As such, we know by Lemma 5.1.4, that S(k − ) ∈ I(k − ) and S(k + ) ∈ I(k + ), and this holds
with probability ≥ 1 − 4n21/4 . As such, the set P, which is by definition contained in the range I(k), has
only elements that are larger than S(k − ) and smaller than S(k + ) . As such, the size of P is bounded by
k + − k − = 6n3/4 . Thus, the algorithm succeeds in the first iteration, with probability ≥ 1 − 4n31/4 .
As for the number of comparisons, an iteration requires
O(n3/4 log n) + 2n + O(n3/4 log n) = 2n + o(n)
comparisons
Any deterministic selection algorithm requires 2n comparisons, and LazySelect can be changed to
require only 1.5n + o(n) comparisons (expected).
Douglas Adams.
44
Lemma 5.2.1. For x ≥ 0, we have 1− x ≤ exp(−x) and 1+ x ≤ e x . Namely, for all x, we have 1+ x ≤ e x .
Proof: For x = 0 we have equality. Next, computing the derivative on both sides, we have that we need
to prove that −1 ≤ − exp(−x) ⇐⇒ 1 ≥ exp(−x) ⇐⇒ e x ≥ 1, which clearly holds for x ≥ 0.
A similar argument works for the second inequality.
y
Lemma 5.2.2. For any y ≥ 1, and |x| ≤ 1, we have 1 − x 2 ≥ 1 − yx 2 .
Proof: Observe that the inequality holds with equality for x = 0. So compute the derivative of x of both
sides of the inequality. We need to prove that
y−1 y−1
y(−2x) 1 − x 2 ≥ −2yx ⇐⇒ 1 − x 2 ≤ 1,
1 − x 2 y e xy ≤ (1 + x) y ≤ e xy .
Lemma 5.2.3. For any y ≥ 1, and |x| ≤ 1, we have
Proof: The right side of the inequality is standard by now. As for the left side. Observe that
(1 − x 2 )e x ≤ 1 + x,
since dividing both sides by (1+ x)e x , we get 1− x ≤ e−x , which we know holds for any x. By Lemma 5.2.2,
we have
y y y
1 − x 2 y e xy ≤ 1 − x 2 e x y = 1 − x 2 e x ≤ 1 + x ≤ e x y .
for any t.
A stronger bound, follows from the following observation. Let Zir denote the event that the ith
coupon was not picked in the first r trials. Clearly,
h i r r
r 1
Pr Zi = 1 − ≤ exp − .
n n
βn log n
h i
r
Thus, for r = βn log n, we have Pr Zi ≤ exp − = n−β . Thus,
n
" #
βn log n
h i Ø h i
Pr X > βn log n ≤ Pr Zi ≤ n · Pr Z1 ≤ n−β+1 .
i
45
Lemma 5.2.4. Let the random h variable X denote
i the number of trials for collecting each of the n types
of coupons. Then, we have Pr X > n ln n + cn ≤ e−c .
Proof: The probability we fail to pick the first type of coupon is α = (1 − ≤ exp = 1/n)m − n ln nn+cn
exp(−c)/n. As such, using the union bound, the probability we fail to pick all n types of coupons is
bounded by nα = exp(−c), as claimed.
In the following, we show a slightly stronger bound on the probability, which is 1 − exp(−e−c ). To
see that it is indeed stronger, observe that e−c ≥ 1 − exp(−e−c ).
k 2m km
Observe also that lim 1 − 2 = 1, and exp − = n−k exp(−ck). Also,
n→∞ n n
n k! n(n − 1) · · · (n − k + 1)
lim = lim = 1.
n→∞ k n k n→∞ nk
m
n k nk km n k −k
exp(−ck)
Thus, lim 1− = lim exp − = lim n exp(−ck) = .
n→∞ k n n→∞ k! n n→∞ k! k!
Theorem 5.2.6. Let the random variable X denote the number of trials for collecting each i n
h of the
types of coupons. Then, for any constant c ∈ R, and m = n ln n + cn, we have limn→∞ Pr X > m =
1 − exp −e−c .
Before dwelling into the proof, observe that 1 − exp(−e−c ) ≈ 1 − (1 − e−c ) = e−c . Namely, in the limit,
the upper bound of Lemma 5.2.4 is tight.
h i
Proof: We have Pr X > m = Pr ∪i Zim . By inclusion-exclusion, we have
n
" #
Zim (−1)i+1 Pin,
Ø Õ
Pr =
i i=1
j
" #
Ík
where Pnj = Zimv . Let Skn =
Õ Ù
i+1 P n . n ≤ Pr
We know that S2k Zim ≤ S2k+1
n
Ð
Pr i=1 (−1) i i .
1≤i1 <i2 <...<i j ≤n v=1
46
By symmetry,
"Ù k m
#
n n m n k
Pk = Pr Zv = 1− ,
k v=1
k n
k k
exp(−c j)
(−1) j+1 P j = (−1) j+1 ·
Õ Õ
Sk = .
j=1 j=1
j!
Observe that lim k→∞ Sk = 1 − exp(−e−c ) by the Taylor expansion of exp(x) (for x = −e−c ). Indeed,
Clearly, limn→∞ Skn = Sk and lim k→∞ Sk = 1 − exp(−e−c ). Thus, (using fluffy math), we have
h i h i
n
lim Pr X > m = lim Pr ∪i=1 Zim = lim lim Skn = lim Sk = 1 − exp(−e−c ).
n→∞ n→∞ n→∞ k→∞ k→∞
47
48
Chapter 6
Namely, pairwise independent variables behaves like independent random variables as long as you
look only in pairs.
Example 6.1.2. Consider the probability space show on the right, where the triple of X Y Z
variables X, Y, Z can be assigned any of the rows with equal probability (i.e., 1/4).
0 0 0
Clearly, for any α, β ∈ {0, 1} we have Pr[(X = α) ∩ (Y = β)] =
0 1 1
Pr[(X = α)] Pr[(Y = β)] = 1/4 (this also holds for X, Z and Y, Z). Namely, X, Y, Z
1 0 1
are all pairwise independent. However, they are not 3-wise independent (or just
1 1 0
independent). Indeed, we have Pr[(X = 1) ∩ (Y = 1) ∩ (Z = 1)] = 0, while it should have
been 1/8 if they were truly independent, or even just 3-wise independent.
Proof: Imagine that we first choose a, then the required probability, is that we choose b such that
y − ai ≡ b (mod p). And the probability for that is 1/p, as we choose b uniformly.
49
Lemma 6.1.4. Let p be a prime, and fix a ∈ {1, . . . , p − 1}. Then, ai (mod p) i = 0, . . . , p − 1 = ZZ p .
Putting it differently, for any non-zero a ∈ ZZ p , there is a unique inverse b ∈ ZZ p such that ab (mod p) =
1.
Proof: Assume, for the sake of contradiction, that the claim is false. Then, by the pigeon hole principle,
there must exists 1 ≤ j < i ≤ p − 1 such that ai (mod p) = a j (mod p). Namely, there are k 0, k, u such
that
ai = u + k p and a j = u + k 0 p.
(Here, we know that 0 ≤ k < p, 0 ≤ k 0 < p and 0 ≤ u < p.) Since i > j it must be that k > k 0.
Subtracting the two equalities, we get that a(i − j) = (k − k 0)p > 0. Now, i − j must be larger than one,
since if i − j = 1 then a = p, which is impossible. Similarly, i − j < p. Also, i − j can not divide p, since
p is a prime. Thus, it must be that i − j must divide k − k 0. So, let us set β = (k − k 0)/(i − j) ≥ 1. This
implies that a = βp ≥ p, which is impossible. Thus, our assumption is false.
Lemma 6.1.5. Given y, z, x, w ∈ ZZ p , such that x , w, and choosing a and b randomly and uniformly
from ZZ p , the probability that y ≡ ax + b (mod p) and z = aw + b (mod p) is 1/p2 .
Proof: This equivalent to claiming that the system of equalities y ≡ ax + b (mod p) and z = aw + b have
a unique solution in a and b.
To see why this is true, subtract one equation from the other. We get y − z ≡ a(x − w) (mod p). Since
x − w . 0 (mod p), it must be that there is a unique value of a such that the equation holds. This in
turns, imply a specific value for b. The probability that a and b get those two specific values is 1/p2 .
Lemma 6.1.6. Let i and j be two distinct elements of ZZ p . And choose a and b randomly and inde-
pendently from ZZ p . Then, the two random variables Yi = ai + b (mod p) and Yj = a j + b (mod p) are
uniformly distributed on ZZ p , and are pairwise independent.
Proof: The claim about the uniform distribution follows from Lemma 6.1.3, as Pr[Yi = α] = 1/p, for any
α ∈ ZZ p . As for being pairwise independent, observe that
Pr Yi = α ∩ Yj = β
1/n2 1 h i
Pr Yi = α Yj = β = = = Pr Yi = α ,
=
Pr Yj = β n
1/n
by Lemma 6.1.3 and Lemma 6.1.5. Thus, Yi and Yj are pairwise independent.
Remark 6.1.7. It is important to understand what independence between random variables mean: hav-
ing information about the value of X, gives you no information about Y . But this is only pairwise
independence. Indeed, consider the variables Y1, Y2, Y3, Y4 defined above. Every pair of them are pairwise
independent. But, given the values of Y1 and Y2 , one can compute the value of Y3 and Y4 immediately.
Indeed, giving the value of Y1 and Y2 is enough to figure out the value of a and b. Once we know a and
b, we immediately can compute all the Yi s.
Thus, the notion of independence can be extended to k-pairwise independence of n random variables,
where only if you know the value of k variables, you can compute the value of all the other variables.
More on that later in the course.
h i h i h i
Lemma 6.1.8. If X and Y are pairwise independent then E XY = E X E Y .
50
h i h hi i h i Í
Proof: By definition, E XY = x,y x y Pr (X = x) ∩ (Y = y) = x,y x y Pr X = x Pr Y = y = x x Pr[X = x]
Í Í
Í h i h i
x x]) E X E Y .
Í Í
y y Pr[Y = y] = ( x Pr[X = y y Pr[Y = y] =
Ín
i X1, X2, . . . , Xn be pairwise independent random variables, and X =
h i Í6.1.9.h Let
Lemma i=1 Xi . Then
n
V X = i=1 V Xi .
h i
2
2 2
Proof: Observe, that V X = E (X − E[X]) = E X − E X . Let X and Y be pairwise independent
h i
variables. Observe that E XY = E X E Y , as can be easily verified. Thus,
h i
X Y = E (X + Y − E[X] − E[Y ])2
V +
2 2
= E X + Y − 2 X + Y E[X] + E[Y ] + E[X] + E[Y ]
2
2
= E (X + Y ) − E X + E Y
h i 2 h i h i
= E X + 2XY + Y − E X
2 2
− 2 E X E Y − (E[Y ])2
2 2 2 2 h i h i h i
= E X − E X + EY − EY + 2 E XY − 2 E X E Y
h i h i h i h i h i h i
= V X + V Y + 2E X E Y − 2E X E Y
h i h i
=V X +V Y ,
by Lemma 6.1.8. Using the above argumentation for several variables, instead of just two, implies the
lemma.
Definition 6.1.10. The class RP (for Randomized Polynomial time) consists of all languages L that have
a deterministic algorithm Alg(x, r) with worst case polynomial running time such that for any input
x ∈ Σ∗ ,
• x ∈ L =⇒ Alg(x, r) = 1 for half the possible values of r.
• x < L =⇒ Alg(x, r) = 0 for all values of r.
51
Let assume that we now want to minimize the number of random bits we use in the execution of the
algorithm (Why?). If we run the algorithm t times, we have confidence 2−t in our result, while using
t log n random bits (assuming our random algorithm needs only log n bits in each execution). Similarly,
let us choose two random numbers from ZZn , and run Alg(x, a) and Alg(x, b), gaining us only confidence
1/4 in the correctness of our results, while requiring 2 log n bits.
Can we do better? Let us define ri = ai + b mod n, where a, b are random values as above (note,
Ít
that we assume that n is prime), for i = 1, . . . , t. Thus Y = i=1 Alg(x, ri ) is a sum of random variables
which are pairwise independent, as the ri are pairwise independent. √ Assume, that x ∈ L, then we have
Ít
E[Y ] = t/2, and σY = V[Y ] = i=1 V[Alg(x, ri )] ≤ t/4, and σY ≤ t/2. The probability that all those
2
by the Chebyshev inequality. Thus we were able to “extract” from our random bits, much more than
one would naturally suspect is possible. We thus get the following result.
Lemma 6.1.11. Given an algorithm Alg in RP that uses lg n random bits, one can run it t times, such
that the runs results in a new algorithm that fails with probability at most 1/t.
α participates in more than E[Xc ln n ] 1
Pr = Pr[Xc ln n ≥ 1] ≤ ≤ (7/8)c ln n n ≤ β+1 ,
c ln n levels of the recursion 1 n
if c ln(8/7) ln n ≥ β ln n ⇐⇒ c ≥ β/ln(8/7). We conclude the following.
Theorem 6.2.1. For any β ≥ 1, we have that the running time of QuickSort sorting n elements is
O(βn log n), with probability ≥ 1 − 1/n β .
Proof: For c = β/ln(8/7), the probability that an element participates in at most c ln n levels of the
recursion is at most 1/n β+1 . Since there are n elements, by the union bound, this bounds the probability
that any input number would participate in more than c ln n recursive calls. But that implies that the
recursion depth of QuickSort is ≤ c ln n, which immediately implies the claim.
What the above proof shows is that an element can not be too unlucky – if it participates in enough
rounds, then, with high probability, the subproblem containing it would shrink significantly. This fairness
of luck is one of the most important principles in randomized algorithms, and we next formalize it by
proving a rather general theorem on the “concentration” of luck.
52
Chapter 7
Consider the binomial distribution Bin(n, 1/2) for various values of n as depicted in Figure 7.1 – here
we think about the value of the variable as the number of heads in flipping a fair coin n times. Clearly,
as the value of n increases the probability of getting a number of heads that is significantly smaller or
larger than n/2 is tiny. Here we are interested in quantifying exactly how far can we divert from this
expected value. Specifically, if X ∼ Bin(n,
√ 1/2), then we would be interested in bounding the probability
Pr[X > n/2 + ∆], where ∆ = tσX = t n/2 (i.e., we are t standard deviations away from the expectation).
For t > 2, this probability is roughly 2−t , which is what we prove here.
More surprisingly, if you look only on the middle of the distribution, it looks the same after clipping
away the uninteresting tails, see Figure 7.2; that is, it looks more and more like the normal distribution.
This is a universal phenomena known the central limit theorem – every sum of nicely behaved random
variables behaves like the normal distribution. We unfortunately need a more precise quantification of
this behavior, thus the following.
7.1.2.1.1. The game. Consider the game where a player starts with Y0 = 1 dollars. At every round,
the player can bet a certain amount x (fractions are fine). With probability half she loses her bet, and
with probability half she gains an amount equal to her bet. The player is not allowed to go all in –
because if she looses then the game is over. So it is natural to ask what her optimal betting strategy is,
such that in the end of the game she has as much money as possible.
53
0.16
0.3 0.2 0.14 0.1
0.25 0.12 0.08
0.15 0.1
0.2 0.08 0.06
0.15 0.1 0.06 0.04
0.1 0.05 0.04
0.02 0.02
0.05
0 0 0
0
0
5
10
15
20
25
30
0
10
20
30
40
50
60
0
2
4
6
8
10
12
14
16
0
1
2
3
4
5
6
7
8
n=8 n = 16 n = 32 n = 64
0.08 0.04
0.07 0.05 0.035 0.01
0.06 0.04 0.03
0.05 0.025 0.008
0.04 0.03 0.02 0.006
0.03 0.02 0.015 0.004
0.02 0.01 0.002
0.01 0.01 0.005 0
0 0 0
0
1000
2000
3000
4000
5000
6000
7000
8000
0
100
200
300
400
500
0
20
40
60
80
100
120
0
50
100
150
200
250
Figure 7.1: The binomial distribution for different values of n. It pretty quickly concentrates around its
expectation.
0.16 0.08
0.2 0.14 0.1 0.07
0.12 0.08 0.06
0.15 0.1 0.06 0.05
0.1
0.08 0.04
0.06 0.04 0.03
0.05 0.04 0.02 0.02
0.02 0.01
0 0 0 0
20
25
30
35
40
45
45
50
55
60
65
70
75
80
85
10
15
20
25
10
12
14
16
5
0
2
4
6
8
n = 16 n = 32 n = 64 n = 128
0.04 0.01
0.05 0.035 0.025 0.008
0.04 0.03 0.02 0.006
0.025 0.015
0.03 0.02 0.004
0.02 0.015 0.01
0.01 0.005 0.002
0.01 0.005 0
0 0 0
3950
4000
4050
4100
4150
4200
4250
460
480
500
520
540
560
220
230
240
250
260
270
280
290
300
100
110
120
130
140
150
160
Figure 7.2: The “middle” of the binomial distribution for different values of n. It very quickly converges
to the normal distribution (under appropriate rescaling and translation.
54
Values Probabilities Inequality Ref
h i
Pr Y ≥ ∆ ≤ exp −∆2 /2n
−1, +1 Pr[Xi = −1] = Theorem 7.1.7p58
h i
Pr Y ≤ −∆ ≤ exp −∆2 /2n
Pr[Xi = 1] = 1/2 Theorem 7.1.7p58
h i
Pr |Y | ≥ ∆ ≤ 2 exp −∆2 /2n
Corollary 7.1.8p59
Pr[Xi = 0] = n
Pr Y − ≥ ∆ ≤ 2 exp −2∆2 /n
0, 1 2 Corollary 7.1.9p59
Pr[Xi = 1] = 1/2
Pr[Xi = 0] = 1 − pi h i δ µ
e
0,1 Pr Y > (1 + δ)µ < (1+δ) Theorem 7.3.2p65
Pr[Xi = 1] = pi 1+δ
Table 7.1: Summary of Chernoff type inequalities covered. Here we have n independent random variables
X1, . . . , Xn , Y = i Xi and µ = E[Y ].
Í
55
7.1.2.1.2. Is the game pointless? So, let Yi−1 be the money the player has in the end of the (i − 1)th
round, and she bets an amount ψi ≤ Yi−1 in the ith round. As such, in the end of the ith round, she has
(
Yi−1 − ψi lose: probability half
Yi =
Yi−1 + ψi win: probability half
dollars. This game, in expectation, does not change the amount of money the player has. Indeed, we
have
1 1
E Yi Yi−1 = (Yi−1 − ψi ) + (Yi−1 + ψi ) = Yi−1 .
2 2
h i
And as such, we have that E Yi = E E Yi Yi−1 = E Yi−1 = · · · = E Y0 = 1. In particular, E[Yn ] = 1
– namely, on average, independent of the player strategy she is not going to make any money in this
game (and she is allowed to change her bets after every round). Unless, she is lucky¬ ...
7.1.2.1.3. What about a lucky player? The player believes she will get lucky and wants to develop
a strategy to take advantage of it. Formally, she believes that she can win, say, at least (1 + δ)/2 fraction
of her bets (instead of the predicted 1/2) – for example, if the bets are in the stock market, she can
improve her chances by doing more research on the companies she is investing in . Unfortunately, the
player does not know which rounds she is going to be lucky in – so she still needs to be careful.
7.1.2.1.4. In a search of a good strategy. Of course, there are many safe strategies the player can
use, from not playing at all, to risking only a tiny fraction of her money at each round. In other words,
our quest here is to find the best strategy that extracts the maximum benefit for the player out of her
inherent luck.
Here, we restrict ourselves to a simple strategy – at every round, the player would bet β fraction
of her money, where β is a parameter to be determined. Specifically, in the end of the ith round, the
player would have
(
(1 − β)Yi−1 lose
Yi =
(1 + β)Yi−1 win.
By our assumption, the player is going to win in at least M = (1 + δ)n/2 rounds. Our purpose here is to
figure out what the value of β should be so that player gets as rich as possible® . Now, if the player is
successful in ≥ M rounds, out of the n rounds of the game, then the amount of money the player has,
in the end of the game, is
n/2−(δ/2)n
Yn ≥ (1 − β)n−M (1 + β) M = (1 − β)n/2−(δ/2)n (1 + β)n/2+(δ/2)n = (1 − β)(1 + β) (1 + β)δn
n/2−(δ/2)n n/2−(δ/2)n
= 1 − β2 (1 + β)δn ≥ exp −2β2 exp(β/2)δn = exp −β2 + β2 δ + βδ/2 n .
To maximize this quantity, we choose β = δ/4 (there is a better choice,see Lemma 7.1.6, but
we use
2this
2 3 2
δ δ δ δ
value for the simplicity of exposition). Thus, we have that Yn ≥ exp − + + n ≥ exp n ,
16 16 8 16
proving the following.
¬ “Iwould rather have a general who was lucky than one who was good.” – Napoleon Bonaparte.
“Iam a great believer in luck, and I find the harder I work, the more I have of it.” – Thomas Jefferson.
® This optimal choice is known as Kelly criterion, see Remark 7.1.3.
56
Lemma 7.1.1. Consider a Chernoff game with n rounds, starting with one dollar, where the player
wins in ≥ (1 + δ)n/2 of the rounds. If the player bets δ/4 fraction of her
current money, at all rounds,
then in the end of the game the player would have at least exp nδ2 /16 dollars.
Remark 7.1.2. Note, that Lemma 7.1.1 holds if the player wins any ≥ (1 + δ)n/2 rounds. In particular,
the statement does not require randomness by itself – for our application, however, it is more natural
and interesting to think about the player wins as being randomly distributed.
Remark 7.1.3. Interestingly, the idea of choosing the best fraction to bet is an old and natural question
arising in investments strategies, and the right fraction to use is known as Kelly criterion, going back
to Kelly’s work from 1956 [Kel56].
The above implies that if a player is lucky, then she is going to become filthy rich¯ . Intuitively, this
should be a pretty rare event – because if the player is rich, then (on average) many other people have
to be poor. We are thus ready for the kill.
Theorem 7.1.4 (Chernoff’s inequality). Let X1, . . . , Xn be n independent random variables, where
Xi = 0 or Xi = 1 with equal probability. Then, for any δ ∈ (0, 1/2), we have that
" #
n
2
Õ δ
Pr Xi ≥ (1 + δ) ≤ exp − n .
i
2 16
Proof: Imagine that we are playing the Chernoff game above, with β = δ/4, starting with 1 dollar, and
let Yi be the amount of money in the end of the ith round. Here Xi = 1 indicates that the player won
the ith round. We have, by Lemma 7.1.1 and Markov’s inequality, that
" #
n nδ
2 2
Õ E[Yn ] 1 δ
Pr Xi ≥ (1 + δ) ≤ Pr Yn ≥ exp ≤ 2
= 2
= exp − n ,
i
2 16 exp(nδ /16) exp(nδ /16) 16
as claimed.
7.1.2.2.1. This is crazy – so intuition maybe? If the player is (1 + δ)/2-lucky then she can make
a lot of money; specifically, at least f (δ) = exp nδ2 /16 dollars by the end of the game. Namely, beating
the odds has significant monetary value, and this value grows quickly with δ. Since we are in a “zero-
sum” game settings, this event should be very rare indeed. Under this interpretation, of course, the
player needs to know in advance the value of δ – so imagine that she guesses it somehow in advance,
or she plays the game in parallel with all the possible values of δ, and she settles on the instance that
maximizes her profit.
7.1.2.2.2. Can one do better? No, not really. Chernoff inequality is tight (this is a challenging
homework exercise) up to the constant in the exponent. The best bound I know for this version of the
inequality has 1/2 instead of 1/16 in the exponent. Note, however, that no real effort was taken to
optimize the constants – this is not the purpose of this write-up.
¯ Not that there is anything wrong with that – many of my friends are filthy,
57
7.1.2.3. Some low level boring calculations
Above, we used the following well known facts.
Lemma 7.1.5. (A) Markov’s inequality. For any positive random variable X and t > 0, we have
Pr[X ≥ t] ≤ E[X] /t. h i h i
(B) For any two random variables X and Y , we have that E X = E E X Y .
(C) For x ∈ (0, 1), 1 + x ≥ e x/2 .
(D) For x ∈ (0, 1/2), 1 − x ≥ e−2x .
δ
Lemma 7.1.6. The quantity exp −β2 + β2 δ + βδ/2 n is maximal for β =
4(1−δ) .
Proof: We have to maximize f (β) = −β2 + β2 δ + βδ/2 by choosing the correct value of β (as a function
δ
of δ, naturally). f 0(β) = −2β + 2βδ + δ/2 = 0 ⇐⇒ 2(δ − 1)β = −δ/2 ⇐⇒ β = 4(1−δ) .
h i
Pr Y ≥ ∆ ≤ exp −∆2 /2n .
by the Taylor expansion of exp(·). Note, that (2k)! ≥ (k!)2 k , and thus
∞ ∞ ∞ i
Õ t 2i Õ t 2i Õ 1 t2
E[exp(t Xi )] = ≤ = = exp t 2
/2 ,
i=0
(2i)! i=0 2i (i!) i=0 i! 2
again, by the Taylor expansion of exp(·). Next, by the independence of the Xi s, we have
n n
" !# " #
et /2 = ent /2 .
Õ Ö Ö Ö 2 2
E[exp(tY )] = E exp t Xi = E exp(t Xi ) = E[exp(t Xi )] ≤
i i i=1 i=1
58
exp nt 2 /2
= exp nt 2 /2 − t∆ .
We have Pr[Y ≥ ∆] ≤
exp(t∆)
Next, by minimizing the above quantity for t, we set t = ∆/n. We conclude,
2 !
n ∆
2
∆ ∆
Pr[Y ≥ ∆] ≤ exp − ∆ = exp − .
2 n n 2n
Corollary 7.1.8. Let X1, . . . , Xn be n independent random variables, such that Pr[Xi = 1] = Pr[Xi = −1] =
Ín
2 , for i = 1, . . . , n. Let Y = i=1 Xi . Then, for any ∆ > 0, we have
1
2 /2n
Pr[|Y | ≥ ∆] ≤ 2e−∆ .
Corollary 7.1.9. Let X1, . . . , Xn be n independent coin flips, such that Pr[Xi = 0] = Pr[Xi = 1] = 21 , for
Ín
i = 1, . . . , n. Let Y = i=1 Xi . Then, for any ∆ > 0, we have
h n i 2
Pr Y − ≥ ∆ ≤ 2e−2∆ /n .
2
Remark 7.1.10. Before going any further, it is might be instrumental to understand what this inequalities
√ √ then case where Xi is either zero or one with probability half. In this case µ = E[Y ] = n/2.
imply. Consider
Set δ = t n ( µ is approximately the standard deviation of X if pi = 1/2). We have by
h n i √
Pr Y − ≥ ∆ ≤ 2 exp −2∆2 /n = 2 exp −2(t n)2 /n = 2 exp −2t 2 .
2
Thus, Chernoff inequality implies exponential decay (i.e., ≤ 2−t ) with t standard deviations, instead of
just polynomial (i.e., ≤ 1/t 2 ) by the Chebychev’s inequality.
59
Thus, consider the indicator variable Xi which is 1 if u is successful in the ith level, and zero otherwise.
Note that the Xi s are independent, and Pr[Xi = 1] = 1/2.
If u participates in v levels, then we have the random variables X1, X2, . . . , Xv . To make things simpler,
we will extend this series by adding independent random variables, such that Pr[‘] Xi = 1 = 1/2, for
i ≥ v. Thus, we have an infinite sequence of independent random variables, that are 0/1 and get 1 with
probability 1/2. The question is how many elements in the sequence we need to read, till we get T ones.
Lemma 7.2.1. Let X1, X2, . . . be an infinite sequence of independent random 0/1 variables.
√ Let M be
an arbitrary parameter. Then the probability that we need 2M + 4t M variables of
to read√more than √
this sequence till we collect M ones is at most 2 exp −t , for t ≤ M. If t ≥ M then this probability
2
√
is at most 2 exp −t M .
ÍL √
Proof: Consider the random variable Y = i=1 Xi , where L = 2M + 4t M. Its expectation is L/2, and
using the Chernoff inequality, we get
2!
L L 2 L
h i
α = Pr Y ≤ M ≤ Pr Y − ≥ − M ≤ 2 exp − −M
2 2 L 2
√ 2 √ 2 8t 2 M
2
2
≤ 2 exp − M + 2t M − M ≤ 2 exp − 2t M = 2 exp − ,
L L L
√ √
by Corollary 7.1.9. For t ≤ M we have that L = 2M + 4t M ≤ 8M, as such in this case Pr[Y ≤ M] ≤
2 exp −t 2 .
√ 8t 2 M 8t 2 M √
If t ≥ M, then α = 2 exp − √ ≤ 2 exp − √ ≤ 2 exp −t M .
2M + 4t M 6t M
Going back to the QuickSort problem, we have that if we sort n elements, the probability
p that u will
p
participate in more than L = (4+c) dlg ne = 2 dlg ne +4c lg n lg n, is smaller than 2 exp −c lg n lg n ≤
p p
1/nc , by Lemma 7.2.1. There are n elements being sorted, and as such the probability that any element
would participate in more than (4 + c + 1) dlg ne recursive calls is smaller than 1/nc .
Lemma 7.2.2. For any c > 0, the probability that QuickSort performs more than (6 + c)n lg n, is
smaller than 1/nc .
Lemma 7.2.3. The events E1, . . . , En are independent (as such, variables X1, . . . , Xn are independent).
Proof: The trick is to think about the sampling process in a different way, and then the result readily
follows. Indeed, we randomly pick a permutation of the given numbers, and set the first number to be
° The answer, my friend, is blowing in the permutation.
60
πn . We then, again, pick a random permutation of the remaining numbers and set the first number as
the penultimate number (i.e., πn−1 ) in the output permutation. We repeat this process till we generate
the whole permutation. h i
Now, consider 1 ≤ i1 < i2 < . . . < i k ≤ n, and observe that Pr Eik Ei1 ∩ . . . ∩ Eik−1 = Pr Eik ,
since by our thought experiment, Eik is determined before all the other variables Eik−1, . . . , Ei1 , and these
variables are inherently not effected by this event happening or not. As such, we have
h i h i
Pr Ei1 ∩ Ei2 ∩ . . . ∩ Eik = Pr Eik Ei1 ∩ . . . ∩ Eik−1 Pr Ei1 ∩ . . . ∩ Eik−1
k k
h i h i Ö Ö 1
= Pr Eik Pr Ei1 ∩ Ei2 ∩ . . . ∩ Eik−1 = Pr Ei j = ,
j=1 j=1
i j
by induction.
Proof: Follows readily from Chernoff’s inequality, as Z = Xi is a sum of independent indicator vari-
Í
i
ables, and, since by linearity of expectations, we have
n ∫ n+1
h i Õ h i Õ 1 1
µ=E Z = E Xi = ≥ dx = ln(n + 1) ≥ ln n.
i i=1
i x=1 x
61
RandomRoute( v0, . . . , vN−1 )
// vi : Packet at node i to be routed to node d(i).
(i) Pick a random intermediate destination σ(i) from [1, . . . , N]. Packet vi travels to
σ(i).
// Here random sampling is done with replacement.
// Several packets might travel to the same destination.
(ii) Wait till all the packets arrive to their intermediate destination.
(iii) Packet vi travels from σ(i) to its destination d(i).
Figure 7.3: The routing algorithm
Theorem 7.2.5 ([KKT91]). For any deterministic oblivious permutation routing algorithm on a net-
work of N nodes
each of out-degree n, there is a permutation for which the routing of the permutation
p
takes Ω N/n units of time (i.e., ticks).
Proof: (Sketch.) The above is implied by a nice averaging argument – construct, for every possible
destination, the routing tree of all packets to this specific node. Argue that there must be many edges
in this tree that are highly congested in this tree (which is NOT the permutation routing we are looking
for!). Now, by averaging, there must be a single edge that is congested in “many” of these trees. Pick
a source-destination pair from each one of these trees that uses this edge, and complete it into a full
permutation in the natural way. Clearly, the congestion of the resulting permutation is high. For the
exact details see [KKT91].
7.2.3.0.1. How do we send a packet? We use bit fixing. Namely, the packet from the i node,
always go to the current adjacent node that have the first different bit as we scan the destination string
d(i). For example, packet from (0000) going to (1101), would pass through (1000), (1100), (1101).
7.2.3.0.2. The routing algorithm. We assume each edge have a FIFO queue. The routing algorithm
is depicted in Figure 7.3.
7.2.3.1. Analysis
We analyze only (i) as (iii) follows from the same analysis. In the following, let ρi denote the route
taken by vi in (i).
Exercise 7.2.6. Once a packet v j that travel along a path ρ j can not leave a path ρi , and then join it
again later. Namely, ρi ∩ ρ j is (maybe an empty) path.
Lemma 7.2.7. Let the route of a message c follow the sequence of edges π = (e1, e2, . . . , e k ). Let S be
the set of packets whose routes pass through at least one of (e1, . . . , e k ). Then, the delay incurred by c is
at most |S|.
Proof: A packet in S is said to leave π at that time step at which it traverses an edge of π for the last
time. If a packet is ready to follow edge e j at time t, we define its lag at time t to be t − j. The lag of c
is initially zero, and the delay incurred by c is its lag when it traverse e k . We will show that each step
at which the lag of c increases by one can be charged to a distinct member of S.
62
We argue that if the lag of c reaches ` + 1, some packet in S leaves π with lag `. When the lag of c
increases from ` to ` + 1, there must be at least one packet (from S) that wishes to traverse the same
edge as c at that time step, since otherwise c would be permitted to traverse this edge and its lag would
not increase. Thus, S contains at least one packet whose lag reach the value `.
Let τ be the last time step at which any packet in S has lag `. Thus there is a packet d ready to
follow edge e µ at τ, such that τ − µ = `. We argue that some packet of S leaves π at τ; this establishes
the lemma since once a packet leaves π, it would never join it again and as such will never again delay
c.
Since d is ready to follow e µ at τ, some packet ω (which may be d itself) in S follows e µ at time τ.
Now ω leaves π at time τ; if not, some packet will follow e µ+1 at step µ + 1 with lag still at `, violating
the maximality of τ. We charge to ω the increase in the lag of c from ` to ` + 1; since ω leaves π, it
will never be charged again. Thus, each member of S whose route intersects π is charge for at most one
delay, establishing the lemma.
Let Hi j be an indicator variable that is 1 if ρi and ρ j share an edge, and 0 otherwise. The total delay
for vi is at most ≤ j Hi j .
Í
Crucially, for a fixed i, the variables Hi1, . . . , HiN are independent. Indeed, imagine first picking
the destination of vi , and let the associated path be ρi . Now, pick the destinations of all the other
packets in the network. Since the sampling of destinations is done with replacements, whether or
not, the path of v j intersects ρi h or not,i is independent of whether vk intersects ρi . Of course, the
probabilities Pr Hi j = 1 and Pr Hik = 1 are probably different. Confusingly, however, H11, . . . , HN N
are not independent. Indeed, imagine k and j being close vertices on the hypercube. If Hi j = 1 then
intuitively it means that ρi is traveling close to the vertex v j , and as such there is a higher probability
that Hik = 1.
Let ρi = (e1, . . . , e k ), and let T(e) be the number of packets (i.e., paths) that pass through e. We
have that
N k
" N # " k #
Õ Õ Õ Õ
Hi j ≤ T(e j ) and thus E Hi j ≤ E T(e j ) .
j=1 j=1 j=1 j=1
Because of symmetry, the variables T(e) have the same distribution for all the edges of G. On the other
hand, the expected length of a path is n/2, there are N packets, and there are Nn/2 edges. We conclude
E[T(e)] = 1. Thus
" N # " k #
Õ Õ h i n
µ=E Hi j ≤ E T(e j ) = E | ρi | ≤ .
j=1 j=1
2
Since there are N = 2n packets, we know that with probability ≤ 2−5n all packets arrive to their temporary
destination in a delay of most 7n.
Theorem 7.2.8. Each packet arrives to its destination in ≤ 14n stages, in probability at least 1 − 1/N
(note that this is very conservative).
63
7.2.4. Faraway Strings
Consider the Hamming distance between binary strings. It is natural to ask how many strings of
length n can one have, such that any pair of them, is of Hamming distance at least t from each other.
Consider two random strings, generated by picking at each bit randomly and independently. Thus,
E[dH (x, y)] = n/2, where dH (x, y) denote the hamming distance between x and y. In particular, using
the Chernoff inequality, we have that
Next, consider generating M such string, where the value of M would be determined shortly. Clearly,
the probability that any pair of strings are at distance at most n/2 − ∆, is
M
exp −2∆2 /n < M 2 exp −2∆2 /n .
α≤
2
If this probability is smaller than one, then there is some probability that all the M strings are of
distance at least n/2 − ∆ from each other. Namely, there exists a set of M strings such that every pair
of them is far. We used here the fact that if an event has probability larger than zero, then it exists.
Thus, set ∆ = n/4, and observe that
Lemma 7.2.9. There exists a set of exp(n/16) binary strings of length n, such that any pair of them is
at Hamming distance at least n/4 from each other.
This is our first introduction to the beautiful technique known as the probabilistic method — we
will hear more about it later in the course.
This result
√ has also interesting interpretation in the Euclidean setting. Indeed, consider the sphere
S of radius n/2 centered at (1/2, 1/2, . . . , 1/2) ∈ Rn . Clearly, all the vertices of the binary hypercube
{0, 1}n lie on this sphere. As such, let P be the set of points
p on S that √ exists according
√ to Lemma 7.2.9.
A pair p, q of points of P have Euclidean distance at least dH (p, q) = n4 = n/2 from each other. We
conclude:
Lemma 7.2.10. Consider the unit hypersphere S in Rn . The sphere S contains a set Q of points, such
that each pair of points is at (Euclidean) distance at least one from each other, and |Q| ≥ exp(n/16).
64
µ
eδ
h i
Theorem 7.3.2. For any δ > 0, we have Pr X > (1 + δ)µ < .
(1 + δ)1+δ
Or in a more simplified form, we have:
h i
Pr X > (1 + δ)µ < exp −µδ2 /4 ,
δ ≤ 2e − 1 (7.1)
h i
δ > 2e − 1 Pr X > (1 + δ)µ < 2−µ(1+δ), (7.2)
h i µδ ln δ
and δ≥e 2
Pr X > (1 + δ)µ < exp − . (7.3)
2
h i
Proof: We have Pr[X > (1 + δ)µ] = Pr et X > et(1+δ)µ . By the Markov inequality, we have:
E et X
h i
Pr X > (1 + δ)µ < t(1+δ)µ
e
On the other hand,
h i
E et X = E et(X1 +X2 ...+Xn ) = E et X1 · · · E et Xn .
Namely,
În t Xi În În
i=1 E e (1 − pi )e0 + pi et 1 + pi (et − 1)
i=1 i=1
Pr[X > (1 + δ)µ] < = = .
et(1+δ)µ et(1+δ)µ et(1+δ)µ
Let y = pi (et − 1). We know that 1 + y < e y (since y > 0). Thus,
În Ín
t pi (et − 1)
i=1 exp(pi (e − 1)) exp i=1
Pr[X > (1 + δ)µ] < =
et(1+δ)µ et(1+δ)µ
t n µ
exp (e − 1) i=1 pi exp (et − 1)µ exp et − 1
Í
= = =
et(1+δ)µ et(1+δ)µ et(1+δ)
µ
exp(δ)
= ,
(1 + δ)(1+δ)
if we set t = log(1 + δ).
For the proof of the simplified form, see Section 7.3.1.
µ
eδ
Definition 7.3.3. F (µ, δ) =
+
.
(1 + δ)(1+δ)
Example 7.3.4. Arkansas Aardvarks win a game with probability 1/3. What is their probability to have
a winning season with n games. By Chernoff inequality, this probability is smaller than
n/3
e1/2
n/3
F (n/3, 1/2) =
+
= 0.89745 = 0.964577n .
1.51.5
For n = 40, this probability is smaller than 0.236307. For n = 100 this is less than 0.027145. For
n = 1000, this is smaller than 2.17221 · 10−16 (which is pretty slim and shady). Namely, as the number of
experiments is increases, the distribution converges to its expectation, and this converge is exponential.
65
Theorem 7.3.5. Under the same assumptions as Theorem 7.3.2, we have: Pr[X < (1 − δ)µ] < exp −µδ2 /2 .
2
Definition 7.3.6. Let F − (µ, δ) = e−µδ /2 , and let ∆− (µ, ε) denote the quantity, which is what should be
the value of δ, so that the probability is smaller than ε. We have that
s
2 log 1/ε
∆− (µ, ε) = .
µ
log2 (1/ε)
And for large δ we have ∆+ (µ, ε) < − 1.
µ
since δ > 2e − 1. For the stronger version, Eq. (7.3), observe that
µ
eδ
h i
Pr X > (1 + δ)µ < = exp µδ − µ(1 + δ) ln(1 + δ) . (7.4)
(1 + δ)1+δ
As such, we have
h i 1+δ µδ ln δ
Pr X > (1 + δ)µ < exp −µ(1 + δ) ln(1 + δ) − 1 ≤ exp −µδln ≤ exp − ,
e 2
1+x √ 1 + x ln x
since for x ≥ e2 we have that ≥ x ⇐⇒ ln ≥ .
e e 2
As for Eq. (7.1), we prove this only for δ ≤ 1/2. For details about the case 1/2 ≤ δ ≤ 2e − 1, see
[MR95]. The Taylor expansion of ln(1 + δ) is
δ2 δ3 δ4 δ2
δ− + − +· ≥ δ− ,
2 3 4 2
for δ ≤ 1/2.
66
7.4. A special case of Hoeffding’s inequality
In this section, we prove yet another version of Chernoff inequality, where each variable is randomly
picked according to its own distribution in the range [0, 1]. We prove a more general version of this
inequality in Section 7.5, but the version presented here does not follow from this generalization.
Ín
Theorem 7.4.1. Let X1, . . . , Xn ∈ [0, 1] be n independent random variables, let X = i=1 Xi , and let
h i µ µ+η n − µ n−µ−η
µ = E[X]. We have that Pr X − µ ≥ η ≤ .
µ+η n− µ−η
By calculations, see Lemma 7.4.6 below, one can show that E s X1 ≤ 1 + (s − 1) E[Xi ]. As such, by the
(µ + η)(n − µ) µn − µ2 + ηn − ηµ
Setting s = = we have that
µ(n − µ − η) µn − µ2 − ηµ
µ ηn µ η n−µ
1 + (s − 1) =1+ · =1+ = .
n µn − µ − ηµ n
2 n− µ−η n− µ−η
Ín
Corollary 7.4.3. Let X1, . . . , Xn ∈ [0, 1] be n independent random variables, let X = i=1 Xi /n, p =
h i h i
X q p. X p t ≤ exp n f (t) , for
E = µ/n and = 1 − Then, we have that Pr − ≥
p q
f (t) = (p + t) ln + (q − t) ln . (7.5)
p+t q−t
Ín
Theorem 7.4.4. Let X1, . .h. , Xn ∈ [0, 1]
i be n independent random
h variables,
i let X = ( i=1 Xi )/n, and let
p = E[X]. We have that Pr X − p ≥ t ≤ exp −2nt 2 and Pr X − p ≤ −t ≤ exp −2nt 2 .
√
xi )/n ≥ n x1 · · · xn .
² The Ín
inequality between arithmetic and geometric means: ( i=1
67
Proof: Let p = µ/n, q = 1 − p, and let f (t) be the function from Eq. (7.5), for t ∈ (−p, q). Now, we have
that
p p+t p q q−t q p q
0
f (t) = ln + (p + t) − − ln − (q − t) = ln − ln
p+t p (p + t)2 q−t q (q − t)2 p+t q−t
p(q − t)
= ln .
q(p + t)
As for the second derivative, we have
00 qX
(pX+Xt) p (p + t)(−1) − (q − t) −p − t − q + t 1
f (t) = .=
X
· · =− ≤ −4.
p(q − t) q (p + t)A
2 (q − t)(p + t) (q − t)(p + t)
Indeed,
t ∈ (−p,
q) and the denominator is minimized for t = (q − p)/2, and as such (q − t)(p + t) ≤
2q − (q − p) 2p + (q − p) /4 = (p + q)2 /4 = 1/4.
f 00(x) 2
Now, f (0) = 0 and f 0(0) = 0, and by Taylor’s expansion, we have that f (t) = f (0) + f 0(0)t + t ≤
2
−2t 2, where x is between 0 and t.
The first bound now readily follows from plugging this bound into Corollary 7.4.3. The second bound
follows by considering the random variants Yih=i1 − Xi , for all i, and plugging this into the first bound.
Indeed, for Y = 1 − X, we have that q = E Y , and then X − p ≤ −t ⇐⇒ t ≤ p − X ⇐⇒ t ≤
h i h i
1 − q − (1 − Y ) = Y − q. Thus, Pr X − p ≤ −t = Pr Y − q ≥ t ≤ exp −2nt 2 .
Ín
Theorem 7.4.5. Let X1, . .. , Xn ∈ [0, 1] be n independent random variables, let X = ( i=1 Xi), and let
µ = E[X]. We have that Pr X − µ ≥ ε µ ≤ exp −ε 2 µ/4 and Pr X − µ ≤ −ε µ ≤ exp −ε 2 µ/2 .
Proof: Let p = µ/n, and let g(x) = f (px), for x ∈ [0, 1] and xp < q. As before, computing the derivative
of g, we have
p(q − xp) q − xp 1 px
g0(x) = p f 0(xp) = p ln = p ln ≤ p ln ≤− ,
q(p + xp) q(1 + x) 1+x 2
since (q − xp)/q is maximized for x = 0, andln 1+x
1
≤ −x/2,∫for x ∈ [0, 1], as can be easily verified³ . Now,
x ∫ x
g(0) = f (0) = 0, and by integration, we have that g(x) = y=0 g0(y)dy ≤ y=0 (−py/2)dy = −px 2 /4. Now,
plugging into Corollary 7.4.3, we get that the desired probability Pr X − µ ≥ ε µ is
h i
Pr X − p ≥ εp ≤ exp n f (εp) = exp ng(ε) ≤ exp −pnε 2 /4 = exp −µε 2 /4 .
68
7.4.1. Some technical lemmas
Lemma 7.4.6. Let X ∈ [0, 1] be a random variable, and let s ≥ 1. Then E s X ≤ 1 + (s − 1) E[X].
Proof: For the sake of simplicity of exposition, assume that X is a discrete random variable, and that
there is a value α ∈ (0, 1/2), such that β = Pr[X = α] > 0. Consider the modified random variable X 0,
such that Pr[X 0 = 0] = Pr[X = 0] + β/2, and Pr[X 0 = 2α] = Pr[X = α] + β/2. Clearly, E[X] = E[X 0].
that E s X − E s X = (β/2)(s2α + s0 ) − βs α ≥ 0, by the convexity of s x . We conclude
0
Next, observe
that E s achieves its maximum if takes only the values 0 and 1. But then, we have that E s X =
X
sX 7.5.1.2 Let X 2be a random variable. If E[X] = 0 and a ≤ X ≤ b, then for any s > 0, we have
Lemma
E e ≤ exp s (b − a) /8 .
Proof: Let a ≤ x ≤ b and observe that x can be written as a convex combination of a and b. In
particular, we have
b− x
x = λa + (1 − λ)b for λ= ∈ [0, 1] .
b−a
Since s > 0, the function exp(sx) is convex, and as such
b − x sa x − a sb
e sx ≤ e + e ,
b−a b−a
since we have that f (λx + (1 − λ)y) ≤ λ f (x) + (1 − λ) f (y) if f (·) is a convex function. Thus, for a random
variable X, by linearity of expectation, we have
b − X sa X − a sb b − E[X] sa E[X] − a sb
sX
E e ≤E e + e = e + e
b−a b−a b−a b−a
b sa a sb
= e − e ,
b−a b−a
since E[X] = 0.
a a b
Next, set p = − and observe that 1 − p = 1 + = and
b−a b−a b−a
a
−ps(b − a) = − − s(b − a) = sa.
b−a
As such, we have
E e sX ≤ (1 − p)e sa + pe sb = (1 − p + pe s(b−a) )e sa
= (1 − p + pe s(b−a) )e−ps(b−a)
= exp −ps(b − a) + ln 1 − p + pe s(b−a) = exp(−pu + ln(1 − p + peu )),
69
for u = s(b − a). Setting
we thus have E e sX ≤ exp(φ(u)). To prove the claim, we will show that φ(u) ≤ u2 /8 = s2 (b − a)2 /8.
To see that, expand φ(u) about zero using Taylor’s expansion. We have
1
φ(u) = φ(0) + uφ0(0) + u2 φ00(θ) (7.6)
2
where θ ∈ [0, u], and notice that φ(0) = 0. Furthermore, we have
peu
φ0(u) = −p + ,
1 − p + peu
p
and as such φ0(0) = −p + 1−p+p = 0. Now,
as claimed.
Lemma 7.5.2. Let X be a random variable. If E[X] = 0 and a ≤ X ≤ b, then for any s > 0, we have
exp s (b−a)
2 2
8
Pr[X > t ] ≤ .
e st
Proof: Using the same technique we used in proving Chernoff’s inequality, we have that
s (b−a)2
2
sX
E e exp 8
Pr[X > t ] = Pr e sX > e st ≤
≤ .
e st e st
Theorem 7.5.3 (Hoeffding’s inequality). Let X1, . . . , Xn be independent random variables, where
Xi ∈ [ai, bi ], for i = 1, . . . , n. Then, for the random variable S = X1 + · · · + Xn and any η > 0, we
have
2 η2
h i
Pr S − E[S] ≥ η ≤ 2 exp − Ín .
i=1 (bi − ai )
2
70
Ín
Proof: Let Zi = Xi − E[Xi ], for i = 1, . . . , n. Set Z = i=1 Zi , and observe that
E[exp(sZ)]
Pr[Z ≥ η] = Pr e sZ ≥ e sη ≤
,
exp(sη)
since the Zi s are independent and by Lemma 7.5.1. This implies that
n n
!
Ö
s2 (b i −ai )2 /8 s2 Õ
Pr[Z ≥ η] ≤ exp(−sη) e = exp (bi − ai )2 − sη .
i=1
8 i=1
2η2
Pr[Z ≥ η] ≤ exp − Í .
(bi − ai )2
The claim now follows by the symmetry of the upper bound (i.e., apply the same proof to −Z).
7.7. Exercises
Ín
Exercise 7.7.1 (Chernoff inequality is tight.). Let S = i=1 Si be a sum of n independent random variables
each attaining values +1 and −1 with equal probability. Let P(n, ∆) = Pr[S > ∆]. Prove that for ∆ ≤ n/C,
2
1 ∆
P(n, ∆) ≥ exp − ,
C Cn
where C is a suitable constant. That is, the well-known Chernoff bound P(n, ∆) ≤ exp(−∆2 /2n)) is close
to the truth.
Exercise 7.7.2 (Chernoff inequality is tight by direct calculations.). For this question use only basic argu-
mentation – do not use Stirling’s formula, Chernoff inequality or any similar “heavy” machinery.
71
n−k
Õ 2n n 2n
(A) Prove that ≤ 2 .
i=0
i 4k 2
Hint: Consider flipping a coin 2n times. Write down explicitly the probability of this coin to
have at most n − k heads, and use Chebyshev inequality.
Using (A), prove that 2n 2n √
n ≥ 2 /4 n (which is a pretty good estimate).
(B)
2n 2i + 1 2n
(C) Prove that = 1− .
n + i + 1 n + i + 1 n + i
2n −i(i − 1) 2n
(D) Prove that ≤ exp .
n + i 2n n
8i 2 2n
2n
(E) Prove that ≥ exp − .
n+i n n 2n
2n 2
(F) Using the above, prove that ≤ c √ for some constant c (I got c = 0.824... but any reasonable
n n
constant will do).
(G) Using the above, prove that
√
Õn
(t+1)
2n
≤ c22n exp −t 2 /2 .
√ n−i
i=t n+1
In particular,
√ conclude that when flipping fair coin 2n times, the probability to get less than
n − t n heads (for t an integer) is smaller than c0 exp −t 2 /2 , for some constant c0.
(H) Let X be the number of heads in 2n coin flips. Prove that for any integer t > 0 and any δ > 0
sufficiently small, it holds that Pr[X < (1 − δ)n] ≥ exp −c00 δ2 n , where c00 is some constant.
Namely, the Chernoff inequality is tight in the worst case.
Exercise 7.7.3 (More binary strings. More!). To some extent, Lemma 7.2.9 is somewhat silly, as one can
prove a better bound by direct argumentation. Indeed, for a fixed binary string x of length n, show
a bound on the number of strings in the Hamming ball around x of radius n/4 (i.e., binary strings of
distance at most n/4 from x). (Hint: interpret the special case of the Chernoff inequality as an inequality
over binomial coefficients.)
Next, argue that the greedy algorithm which repeatedly pick a string which is in distance ≥ n/4 from
all strings picked so far, stops after picking at least exp(n/8) strings.
Exercise 7.7.4 (Tail inequality for geometric variables). Let X1, . . . , Xm be m independent random variables
j−1
h probabilityi p (i.e., Pr[Xi = j] = (1 − p) p). Let Y = i Xi , and let
Í
with geometric distribution with
µ = E[Y ] = m/p. Prove that Pr Y ≥ (1 + δ)µ ≤ exp −mδ2 /8 .
72
Chapter 8
Martingales
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
‘After that he always chose out a “dog command” and sent them ahead. It had the task of informing
the inhabitants in the village where we were going to stay overnight that no dog must be allowed
to bark in the night otherwise it would be liquidated. I was also on one of those commands and
when we came to a village in the region of Milevsko I got mixed up and told the mayor that every
dog-owner whose dog barked in the night would be liquidated for strategic reasons. The mayor got
frightened, immediately harnessed his horses and rode to headquarters to beg mercy for the whole
village. They didn’t let him in, the sentries nearly shot him and so he returned home, but before we
got to the village everybody on his advice had tied rags round the dogs muzzles with the result that
three of them went mad.’
– The good soldier Svejk, Jaroslav Hasek
8.1. Martingales
8.1.1. Preliminaries
Let X and Y be two random variables. Let ρ(x, y) = Pr[(X = x) ∩ (Y = y)]. Then,
ρ(x, y) ρ(x, y)
Pr X = x Y = y =
=Í
Pr[Y = y] z ρ(z, y)
x x ρ(x, y) x ρ(x, y)
Í Í
= x
Õ
and E X Y = y = x Pr X = x Y = y = Í
.
x z ρ(z, y) Pr[Y = y]
h i
Lemma 8.1.2. For any two random variables X and Y , we have E E X Y =E X .
73
Proof: E E X Y = EY E X Y = y = y Pr[Y = y] E X Y = y
Í
Õ h i Í x Pr[X = x ∩ Y = y]
x
= Pr Y = y
y
Pr[Y = y]
ÕÕ h i Õ Õ h i
= x Pr X = x ∩ Y = y = x Pr X = x ∩ Y = y
y x x y
Õ h i
h i
= x Pr X = x = E X .
x
h i
Lemma 8.1.3. For any two random variables X and Y , we have E Y · E X Y = E XY .
Õ
Proof: We have that E Y · E X Y = Pr[Y = y] · y · E X Y = y
y
x Pr[X = x ∩ Y = y] Õ Õ
Í h i h i
x
Õ
= Pr[Y = y] · y · = x y · Pr X = x ∩ Y = y = E XY .
y
Pr[Y = y] x y
8.1.2. Martingales
Intuitively, martingales are a sequence of random variables describing a process, where the only thing
that matters at the beginning of the ith step is where the process was in the end of the (i − 1)th step.
That is, it does not matter how the process arrived to a certain state, only that it is currently at this
state.
Definition 8.1.4. A sequence of random variables X0, X1, . . . , is said to be a martingale sequence if for
all i > 0, we have E Xi X0, . . . , Xi−1 = Xi−1 .
Lemma 8.1.5. Let X0, X1, . . . , be a martingale sequence. Then, for all i ≥ 0, we have E Xi = E X0 .
Example 8.1.7. Let Yi = Xi2 − i, where Xi is as defined in the above example. We claim that Y0, Y1, . . . is
a martingale. Let us verify that this is true. Given Yi−1 , we have Yi−1 = Xi−1
2 − (i − 1). We have that
1 1
E Yi Yi−1 = E Xi2 − i Xi−12
− (i − 1) = (Xi−1 + 1)2 − i) + (Xi−1 − 1)2 − i
2 2
= Xi−1 + 1 − i = Xi−1 − (i − 1) = Yi−1,
2 2
74
Example 8.1.8. Let U be a urn with b black balls, and w white balls. We repeatedly select a ball and
replace it by c balls having the same color. Let Xi be the fraction of black balls after the first i trials.
We claim that the sequence X0, X1, . . . is a martingale.
Indeed, let ni = b + w + i(c − 1) be the number of balls in the urn after the ith trial. Clearly,
Example 8.1.9. Let G be a random graph on the vertex set V = {1, . . . , n} obtained by independently
choosing to include each possible edge with probability p. The underlying probability space is called
Gn,p . Arbitrarily label the m = n(n − 1)/2 possible edges with the sequence 1, . . . , m. For 1 ≤ j ≤ m,
define the indicator random variable I j , which takes values 1 if the edge j is present in G, and has value
0 otherwise. These indicator variables are independent and each takes value 1 with probability p.
Consider any real valued function f defined over the space of all graphs, e.g., the clique number,
which is defined as being the size of the largest complete subgraph. The edge exposure martingale
is defined to be the sequence of random variables X0, . . . , Xm such that
Xi = E f (G) I1, . . . , Ii ,
while X0 = E[ f (G)] and Xm = f (G). This sequence of random variable begin a martingale follows
immediately from a theorem that would be described in the next lecture.
One can define similarly a vertex exposure martingale, where the graph Gi is the graph induced
on the first i vertices of the random graph G.
Example 8.1.10 (The sheep of Mabinogion). The following is taken from medieval Welsh manuscript based
on Celtic mythology:
“And he came towards a valley, through which ran a river; and the borders of the valley were
wooded, and on each side of the river were level meadows. And on one side of the river he
saw a flock of white sheep, and on the other a flock of black sheep. And whenever one of the
white sheep bleated, one of the black sheep would cross over and become white; and when
one of the black sheep bleated, one of the white sheep would cross over and become black.”
– Peredur the son of Evrawk, from the Mabinogion.
More concretely, we start at time 0 with w0 white sheep, and b0 black sheep. At every iteration, a
random sheep is picked, it bleats, and a sheep of the other color turns to this color. the game stops as
soon as all the sheep have the same color. No sheep dies or get born during the game. Let Xi be the
expected number of black sheep in the end of the game, after the ith iteration. For reasons that we
would see later on, this sequence is a martingale.
The original question is somewhat more interesting – if we are allowed to take a way sheep in the
end of each iteration, what is the optimal strategy to maximize Xi ?
75
Theorem 8.1.11 (Azuma’s Inequality.). Let X0, . . . , Xm be a martingale with X0 = 0, and |Xi+1 −
√
Xi | ≤ 1 for all 0 ≤ i < m. Let λ > 0 be arbitrary. Then Pr Xm > λ m < exp −λ2 /2 .
√
Proof: Let α = λ/ m. Let Yi = Xi − X Y X X
i−1 , so that |Yi | ≤ 1 and E i 0, . . . , i−1 = 0.
We are interested in bounding E eαYi X0, . . . , Xi−1 . Note that, for −1 ≤ x ≤ 1, we have
eα + e−α eα − e−α
eαx ≤ h(x) = + x,
2 2
as eαx is a convex function, h(−1) = e−α , h(1) = eα , and h(x) is a linear function. Thus,
as (2i)! ≥ 2i i!.
Hence, by Lemma 8.1.3, we have that
m m−1
" # " ! #
Ö Ö
E eαXm = E eαYi = E eαYi eαYm
i=1 i=1
m−1
" ! # "m−1 #
Ö Ö
α2 /2
eαYi E eαYm X0, . . . , Xm−1 ≤e eαYi
=E E
i=1 i=1
≤ exp mα /2 . 2
e m
αX
√ h √ i E √
Pr Xm > λ m = Pr eαXm > eαλ m = = emα /2−αλ m
2
√
eαλ m
√ √ √ 2
= exp m(λ/ m)2 /2 − (λ/ m)λ m = e−λ /2,
Theorem 8.1.12 (Azuma’s Inequality). Let X0, . . . , Xm be a martingale sequence such that and |Xi+1 −
√
Xi | ≤ 1 for all 0 ≤ i < m. Let λ > 0 be arbitrary. Then Pr |Xm − X0 | > λ m < 2 exp −λ2 /2 .
76
Example 8.1.13. Let χ(H) be the chromatic number of a graph H. What is chromatic number of a
random graph? How does this random variable behaves?
Consider the vertex exposure martingale, and let Xi = E χ(G) Gi . Again, without proving it,
√ 2
we claim that X0, . . . , Xn = X is a martingale, X n ≤ e−λ /2 .
and as such, we have: Pr |Xn − 0 | > λ
However, X0 = E[ χ(G)], and Xn = E χ(G) Gn = χ(G). Thus,
√ 2
Pr | χ(G) − E[ χ(G)]| > λ n ≤ e−λ /2 .
Namely, the chromatic number of a random graph is highly concentrated! And we do not even know
what is the expectation of this variable!
77
78
Chapter 9
Martingales II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
“The Electric Monk was a labor-saving device, like a dishwasher or a video recorder. Dishwashers washed
tedious dishes for you, thus saving you the bother of washing them yourself, video recorders watched tedious
television for you, thus saving you the bother of looking at it yourself; Electric Monks believed things for
you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the
world expected you to believe.”
– — Dirk Gently’s Holistic Detective Agency, Douglas Adams..
Definition 9.1.2. Given a σ-field (Ω, F ), a probability measure Pr : F → R+ is a function that satisfies
the following conditions.
(A) ∀A∈ F , 0 ≤ Pr[A] ≤ 1.
(B) Pr Ω = 1. h i Í h i
(C) For mutually disjoint events C1, C2, . . . , we have Pr ∪i Ci = i Pr Ci .
Definition 9.1.3. A probability space (Ω, F , Pr) consists of a σ-field (Ω, F ) with a probability measure
Pr defined on it.
Definition 9.1.4. Given a σ-field (Ω, F ) with F = 2Ω , a filter (also filtration) is a nested sequence
F0 ⊆ F1 ⊆ · · · ⊆ Fn of subsets of 2Ω , such that:
(A) F0 = {∅, Ω}.
(B) Fn = 2Ω .
(C) For 0 ≤ i ≤ n, (Ω, Fi ) is a σ-field.
Definition 9.1.5. An elementary event or atomic event is a subset of a sample space that contains
only one element of Ω.
79
Intuitively, when we consider a probability space, we usually consider a random variable X. The
value of X is a function of the elementary event that happens in the probability space. Formally, a
random variable is a mapping X : Ω → R. Thus, each Fi defines a partition of Ω into atomic events.
This partition is getting more and more refined as we progress down the filter.
Example 9.1.6. Consider an algorithm Alg that uses n random bits. As such, the underlying sample
space is Ω = b1 b2 . . . bn b1, . . . , bn ∈ {0, 1} ; that is, the set of all binary strings of length n. Next, let
Fi be the σ-field generated by the partition of Ω into the atomic events Bw , where w ∈ {0, 1}i ; here w is
the string encoding the first i random bits used by the algorithm. Specifically,
Bw = wx x ∈ {0, 1}n−i ,
and the set of atomic events in Fi is Bw w ∈ {0, 1}i . The set Fi is the closure of this set of atomic
events under complement and union. In particular, we conclude that F0, F1, . . . , Fn form a filter.
Example 9.1.8. Let F0, . . . , Fn be the filter defined in Example 9.1.6. Let X be the parity of the n bits.
Clearly, X = 1 is a valid event only in Fn (why?). Namely, it is only measurable in Fn , but not in Fi , for
i < n.
9.2. Martingales
Definition 9.2.1. A sequence of random variables Y1, Y2, . . . , is said to be a martingale difference
sequence if for all i ≥ 0, we have E Yi Y1, . . . , Yi−1 = 0.
Clearly, X1, . . . , is a martingale sequence if and only if Y1, Y2, . . . , is a martingale difference sequence
where Yi = Xi − Xi−1 .
80
9.2.1. Martingales – an alternative definition
Definition 9.2.3. Let (Ω, F , Pr) be a probability space with a filter F0, F1 , . . . . Suppose that X0, X1, . . ., are
random variables such that, for all i ≥ 0, Xi is Fi -measurable. The sequence X0, . . . , Xn is a martingale
provided that, for all i ≥ 0, we have E Xi+1 | Fi = Xi .
Lemma
h 9.2.4. i (Ω, F ) and
Let (Ω, G) be two σ-fields such that F ⊆ G. Then, for any random variable
X, E E X G F = E X F .
Proof: E E X G F = E E X G = g F = f
x Pr[X=x∩G=g]
Í
Í
x x Pr[X = x ∩ G = g]
Õ x
Pr[G=g] · Pr[G = g ∩ F = f ]
=E F= f =
Pr[G = g] g∈G
Pr[F = f ]
x Pr[X=x∩G=g] x Pr[X=x∩G=g]
Í Í
Õ x
Pr[G=g] · Pr[G = g ∩ F = f ] Õ x
Pr[G=g] · Pr[G = g]
= =
g∈G,g⊆ f
Pr[F = f ] g∈G,g⊆ f
Pr[F = f ]
Í
x Pr[X = x ∩ G = g]
Í
x Pr[X = x ∩ G = g] x g∈G,g⊆ f
Í
x
Õ
= =
g∈G,g⊆ f
Pr[F = f ] Pr[F = f ]
x Pr[X = x ∩ F = f ]
Í
x
=E X F .
=
Pr[F = f ]
Theorem 9.2.5. Let (Ω, F , Pr) be a probability space, and let F0, . . . , Fn be a filter
with respect to it.
Let X be any random variable over this probability space and define Xi = E X Fi then, the sequence
X0, . . . , Xn is a martingale.
Specifically, a function is c-Lipschitz, if the inequality holds with a constant c (instead of 1).
Definition 9.2.7. Let X1, . . . , Xn be a sequence of independent random variables, and a function f (X1, . . . , Xn )
defined over them that such that f satisfies the Lipschitz condition. The Doob martingale sequence
h i
Y0, . . . , Ym is defined by Y0 = E f (X1, . . . , Xn ) and Yi = E f (X1, . . . , Xn ) X1, . . . , Xi , for i = 1, . . . , n.
Clearly, a Doob martingale Y0, . . . , Yn is a martingale, by Theorem 9.2.5. Furthermore, if |Xi − Xi−1 | ≤
1, for i = 1, . . . , n, then |Xi − Xi−1 | ≤ 1. and we can use Azuma’s inequality on such a sequence.
81
9.3. Occupancy Revisited
We have m balls thrown independently and uniformly into n bins. Let Z denote the number of bins
that remains empty in the end of the process. Let Xi be the bin chosen in the ith trial, and let
Z = F(X1, . . . , Xm ), where F returns the number of empty bins given that m balls had thrown into bins
√ 2
X1, . . . , Xm . Clearly, we have by Azuma’s inequality that Pr Z − E[Z] > λ m ≤ 2e−λ /2 .
The following is an extension of Azuma’s inequality shown in class. We do not provide a proof but
it is similar to what we saw.
Theorem 9.3.1 (Azuma’s Inequality - Stronger Form). Let X0, X1, . . . , be a martingale sequence
such that for each k, |Xk − Xk−1 | ≤ ck , where ck may depend on k. Then, for all t ≥ 0, and any λ > 0,
!
h i λ2
Pr |Xt − X0 | ≥ λ ≤ 2 exp − Ít .
2 k=1 ck2
Theorem 9.3.2. Let r = m/n, and Zend be the number of empty bins when m balls are thrown randomly
h i m
into n bins. Then µ = E Zend = n 1 − 1n ≈ ne−r , and for any λ > 0, we have
2
λ (n − 1/2)
Pr Zend − µ ≥ λ ≤ 2 exp − 2 .
n − µ2
Proof: Let z(Y, t) be the expected number of empty bins, if there are Y empty bins in time t. Clearly,
m−t
1
z(Y, t) = Y 1 − .
n
m
In particular, µ = z(n, 0) = n 1 − 1n .
Let Ft be the σ-field generated by the bins chosen in the first t steps. Let Zend be the number of
empty bins at time m, and let Zt = E Zend Ft . Namely, Zt is the expected number of empty bins after
we know where the first t balls had been placed. The random variables Z0, Z1, . . . , Zm form a martingale.
Let Yt be the number of empty bins after t balls where thrown. We have Zt−1 = z(Yt−1, t − 1). Consider
the ball thrown in the t-step. Clearly:
(A) With probability 1 − Yt−1 /n the ball falls into a non-empty bin. Then Yt = Yt−1 , and Zt = z(Yt−1, t).
Thus,
m−t m−t+1 ! m−t m−t
Yt−1
1 1 1 1
∆t = Zt − Zt−1 = z(Yt−1, t) − z(Yt−1, t − 1) = Yt−1 1 − − 1− = 1− ≤ 1− .
n n n n n
(B) Otherwise, with probability Yt−1 /n the ball falls into an empty bin, and Yt = Yt−1 − 1. Namely,
Zt = z(Yt − 1, t). And we have that
m−t m−t+1
1 1
∆t = Zt − Zt−1 = z(Yt−1 − 1, t) − z(Yt−1, t − 1) = (Yt−1 − 1) 1 − − Yt−1 1 −
n n
m−t m−t m−t
Yt−1 Yt−1
1 1 1 1
= 1− Yt−1 − 1 − Yt−1 1 − = 1− −1 + =− 1− 1−
n n n n n n
m−t
1
≥ − 1− .
n
82
1 m−t
Thus, Z0, . . . , Zm is a martingale sequence, where |Zt − Zt−1 | ≤ |∆t | ≤ ct , where ct = 1 −
n . We
have
n
n 2 1 − (1 − 1/n)2m
n2 − µ2
2m
Õ 1 − (1 − 1/n)
ct2 = = = .
t=1
1 − (1 − 1/n)2 2n − 1 2n − 1
n√ λ n
2
λ2
r
√
Pr Zend − µ ≥ λ n = Pr Zend − µ ≥ λ m ≤ 2 exp − = 2 exp − ,
m 2m 2 ln n
√
which is interesting only if λ > 2 ln n. On the other hand, Theorem 9.3.2 implies that
λ n(n − 1/2)
2
√
Pr Zend − µ ≥ λ n ≤ 2 exp − ≤ 2 exp −λ 2
,
n2 − µ2
Proof: Follows by induction. Indeed, for m = 1 the claim is immediate. For m ≥ 2, we have
m m−1
m−1 m
1 1 1 1
1− = 1− 1− ≥ 1− 1− ≥ 1− .
n n n n n n
83
84
Chapter 10
10.1. Introduction
The probabilistic method is a combinatorial technique to use probabilistic algorithms to create objects
having desirable properties, and furthermore, prove that such objects exist. The basic technique is based
on two basic observations:
2. If the probability of event E is larger than zero, then E exists and it is not empty.
The surprising thing is that despite the elementary nature of those two observations, they lead to a
powerful technique that leads to numerous nice and strong results. Including some elementary proofs
of theorems that previously had very complicated and involved proofs.
The main proponent of the probabilistic method, was Paul Erdős. An excellent text on the topic is
the book by Noga Alon and Joel Spencer [AS00].
This topic is worthy of its own course. The interested student is refereed to the course “Math 475
— The Probabilistic Method”.
10.1.1. Examples
Theorem 10.1.1. For any undirected graph G(V, E) with n vertices and m edges, there is a partition of
the vertex set V into two sets A and B such that
m
uv ∈ E u ∈ A and v ∈ B ≥ .
2
85
Proof: Consider the following experiment: randomly assign each vertex to A or B, independently and
equal probability.
For an edge e = uv, the probability that one endpoint is in A, and the other in B is 1/2, and let Xe
be the indicator variable with value 1 if this happens. Clearly,
Õ Õ 1 m
uv ∈ E (u, v) ∈ (A × B) ∪ (B × A)
E = E[Xe ] = = .
2 2
e∈E(G) e∈E(G)
Theorem 10.1.3. Let M be an n × n binary matrix p (i.e., each entry is either 0 or 1), then there always
n
exists a vector b ∈ {−1, +1} such that kMbk ∞ ≤ 4 n log n.
Proof: Let v = (v1, . . . , vn ) be a row of M. Chose a random b = (b1, . . . , bn ) ∈ {−1, +1}n . Let i1, . . . , im be
the indices such that vi j = 1, and let
n
Õ m
Õ m
Õ
Y = hv, bi = vi bi = vi j bi j = bi j .
i=1 j=1 j=1
As such Y is the sum of m independent random variables that accept values in {−1, +1}. Clearly,
" #
Õ Õ Õ
E [Y ] = E [hv, bi] = E v b
i i = [v b
E i i ] = vi E[bi ] = 0.
i i i
√
By Chernoff inequality (Theorem 7.1.7) and the symmetry of Y , we have that, for ∆ = 4 n ln n, it
holds
" m #
n ln n
2
Õ ∆ 2
Pr[|Y | ≥ ∆] = 2 Pr[v · b ≥ ∆] = 2 Pr bi j ≥ ∆ ≤ 2 exp − = 2 exp −8 ≤ 8.
j=1
2m m n
√
Thus, the probability that any entry in Mb exceeds 4 n ln n is smaller
√ than 2/n7 . Thus, with probability
at least 1 − 2/n , all the entries of Mb have value smaller than 4 n ln n. √
7
Theorem 10.2.1. For any set of m clauses, there is a truth assignment of variables that satisfies at
least m/2 clauses.
86
Proof: Assign every variable a random value. Clearly, a clause with k variables, has probability 1 − 2−k
to be satisfied. Using linearity of expectation, and the fact that every clause has at least one variable, it
follows, that E[X] = m/2, where X is the random variable counting the number of clauses being satisfied.
In particular, there exists an assignment for which X ≥ m/2.
For an instant I, let mopt (I), denote the maximum number of clauses that can be satisfied by the
“best” assignment. For an algorithm Alg, let mAlg (I) denote the number of clauses satisfied computed
by the algorithm Alg. The approximation factor of Alg, is mAlg (I)/mopt (I). Clearly, the algorithm
of Theorem 10.2.1 provides us with 1/2-approximation algorithm.
For every clause, C j in the given instance, let z j ∈ {0, 1} be a variable indicating whether C j is
satisfied or not. Similarly, let xi = 1 if the ith variable is being assigned the value TRUE. Let C +j be
indices of the variables that appear in C j in the positive, and C −j the indices of the variables that appear
in the negative. Clearly, to solve MAX-SAT, we need to solve:
m
Õ
maximize zj
j=1
subject to xi, z j ∈ {0, 1} for all i, j
Õ Õ
xi + (1 − xi ) ≥ z j for all j.
i∈C j+ i∈C j−
m
Õ
maximize zj
j=1
subject to 0 ≤ yi, z j ≤ 1 for all i, j
Õ Õ
yi + (1 − yi ) ≥ z j for all j.
i∈C j+ i∈C j−
Which can be solved in polynomial time. Let b t denote the values assigned to the variable t by the
Ím
linear-programming solution. Clearly, j=1 zbj is an upper bound on the number of clauses of I that can
be satisfied.
yi . This is randomized rounding.
We set the variable yi to 1 with probability b
Lemma 10.2.2. Let C j be a clause with k literals. The probability that it is satisfied by randomized
rounding is at least βk zbj ≥ (1 − 1/e)b
z j , where
k
1
βk = 1 − 1 − .
k
Proof: Assume C j = y1 ∨ v2 . . . ∨ vk . By the LP, we have yb1 + · · · + ybk ≥ zbj . Furthermore, the probability
Îk Îk
that C j is not satisfied is i=1 (1 − b
yi ). Note that 1 − i=1 (1 − b
yi ) is minimized when all the b yi ’s are equal
(by symmetry). Namely, when b yi = zbj /k. Consider the function f (x) = 1 − (1 − x/k) k . This is a concave
87
function, which is larger than g(x) = βk x for all 0 ≤ x ≤ 1, as can be easily verified, by checking the
inequality at x = 0 and x = 1.
Thus,
k
Ö
Pr C j is satisfied = 1 − yi ) ≥ f zbj ≥ βk zbj .
(1 − b
i=1
The second part of the inequality, follows from the fact that βk ≥ 1 − 1/e, for all k ≥ 0. Indeed, for
k = 1, 2 the claim trivially holds. Furthermore,
k k
1 1 1 1
1− 1− ≥ 1− ⇔ 1− ≤ ,
k e k e
1 k
but this holds since 1 − x ≤ e−x implies that 1 − 1
≤ e−1/k , and as such 1 − ≤ e−k/k = 1/e.
k k
Theorem 10.2.3. Given an instance I of MAX-SAT, the expected number of clauses satisfied by linear
programming and randomized rounding is at least (1−1/e) ≈ 0.632mopt (I), where mopt (I) is the maximum
number of clauses that can be satisfied on that instance.
Theorem 10.2.4. Given an instance I of MAX-SAT, let n1 be the expected number of clauses satisfied
by randomized assignment, and let n2 be the expected number of clauses satisfied by linear programming
followed by randomized rounding. Then, max(n1, n2 ) ≥ (3/4) j zbj ≥ (3/4)mopt (I).
Í
Proof: It is enough to show that (n1 + n2 )/2 ≥ 34 j zbj . Let Sk denote the set of clauses that contain k
Í
literals. We know that
Õ Õ Õ Õ
n1 = 1 − 2−k ≥ 1 − 2−k zbj .
k C j ∈Sk k C j ∈Sk
n1 + n2 Õ Õ 1 − 2−k + βk
≥ zbj .
2 k C ∈S
2
j k
n1 + n2 3 Õ Õ 3Õ
≥ zbj = zbj .
2 4 k C ∈S 4 j
j k
¬ Indeed,by the proof of Lemma 10.2.2, we have that βk ≥ 1 − 1/e. Thus, 1 − 2−k + βk ≥ 2 − 1/e − 2−k ≥ 3/2 for k ≥ 3.
Thus, we only need to check the inequality for k = 1 and k = 2, which can be done directly.
88
Chapter 11
89
as c = 2 and d = 18. Thus,
h i Õ
(0.4)s < 1.
Õ
Pr E s ≤
s≥1 s≥1
It thus follows that the random graph we generated has the required properties with positive probabil-
ity.
Theorem 11.2.1. For n large enough, there exists a bipartite graph G(V, R, E) with |V | = n, |R| = 2lg
2
n
such that:
(i) Every subset of n/2 vertices of V has at least 2lg n − n neighbors in R.
2
Proof: Each vertex of R chooses d = 2lg n (4 lg2 n)/n neighbors independently in R. We show that the
2
resulting graph violate the required properties with probability less than half.®
The probability for a set of n/2 vertices on the left to fail to have enough neighbors, is
lg2 n dn/2 !n
n n lg2 n e dn n
2 n 2
τ≤ 1− 2 ≤2 exp −
n/2 n 2lg n n 2 2lg2 n
!n !
2lg n e 2lg n (4 lg2 n)/n lg2 n e
2 2
n2
© ª
n 2
≤ expn + n ln −2n lg n®®,
2
®
≤2 exp −
n 2 2lg
2
n n ®
| {z }
| {z } ∗
∗
« ¬
2n 2n y
n 2lg 2lg x xe
≤ 2n and ¯.
since n/2 lg2 n
2 −n
= n , and y ≤ y Now, we have
2lg n e 2
lg2 n
ρ = n ln = n ln 2 + ln e − ln n ≤ (ln 2)n lg2 n ≤ 0.7n lg2 n,
n
for n ≥ 3. As such, we have τ ≤ exp n + (0.7 − 2)n lg2 n 1/4.
Everybody knows that lg n = log2 n. Everybody knows that the captain lied.
® Here, we keep parallel edges if they happen – which is unlikely. The reader can ignore this minor technicality, on her
way to ignore this whole write-up.
¯ The reader might want to verify that one can use significantly weaker upper bounds and the result still follows – we
are using the tighter bounds here for educational reasons, and because we can.
90
As for the second property, note that the expected number of neighbors of a vertex v ∈ R is 4 lg2 n.
Indeed, the probability of a vertex on R to become adjacent to a random edge is ρ = 1/|R|, and
this “experiment” is repeated independently dn times. As such, the expected degree of a vertex is
µ E Y = dn/|R| = 4 lg2 n. The Chernoff bound (Theorem 7.3.2p65 ) implies that
h i h i
α = Pr Y > 12 lg n = Pr Y > (1 + 2)µ < exp −µ22 /4 = exp −4 lg2 n .
2
Since there are 2lg n vertices in R, we have that the probability that any vertex in R has a degree that
2
exceeds 12 lg2 n, is, by the union bound, at most |R| α ≤ 2lg n exp −4 lg2 n ≤ exp −3 lg2 n 1/4,
2
concluding our tedious calculations° .
Thus, with constant positive probability, the random graph has the required property, as the union
of the two bad events has probability 1/2.
We assume that given a vertex (of the above graph) we can compute its neighbors, without computing
the whole graph.
So, we are given an input x. Use lg2 n bits to pick a vertex v ∈ R. We next identify the neighbors
of v in V: r1, . . . , rk . We then compute Alg(x, ri ), for i = 1, . . . k. Note that k = O lg2 n . If all k calls
return 0, then we return that Alg is not in the language. Otherwise, we return that x belongs to V.
If x is in the language, then consider the subset U ⊆ V, such that running Alg on any of the strings
of U returns TRUE.
2 We know that |U| ≥ n/2. The set U is connected to all the vertices of R except for
at most |R| − 2lg n − n = n of them. As such, the probability of a failure in this case, is
h i h i n n
Pr x ∈ L but r1, r2, . . . , rk < U = Pr v not connected to U ≤ ≤ 2 .
|R| 2lg n
Lemma 11.2.2. Given an algorithm Alg in RP that uses lg n random bits, and an access explicit access
to the graph of Theorem 11.2.1, one can decide if an input word is in the language of Alg using lg2 n
bits, and the probability of failure is at most lgn2 n .
2
Let us compare the various results we now have about running an algorithm in RP using lg2 n bits.
We have three options:
(A) Randomly run the algorithm lg n times independently. The probability of failure is at most
1/2lg n = 1/n.
(B) Lemma 11.2.2, which as probability of failure at most 1/2lg n = 1/n.
(C) The third option is to use pairwise independent sampling (see Lemma 6.1.11p52 ). While it is
not directly comparable to the above two options, it is clearly inferior, and is thus less useful.
Unfortunately, there is no explicit construction of the expanders used here. However, there are
alternative techniques that achieve a similar result.
° Once again, our verbosity in applying the Chernoff inequality is for educational reasons – usually such calculations
would be swept under the rag. No wonder than that everybody is afraid to look under the rag.
91
11.3. Oblivious routing revisited
Theorem 11.3.1. Consider any randomized oblivious algorithm for permutation routing on the hy-
n
p with N = 2 nodes. If this algorithm uses k random bits, then its expected running time is
percube
Ω 2−k N/n .
Corollary 11.3.2. Any randomized oblivious algorithm for permutation routing on the hypercube with
N = 2n nodes must use Ω(n) random bits in order to achieve expected running time O(n).
Theorem 11.3.3. For every n, there exists a randomized oblivious scheme for permutation routing on
a hypercube with n = 2n nodes that uses 3n random bits and runs in expected time at most 15n.
92
Chapter 12
Pr A ∩ B C
Pr[A ∩ B ∩ C] Pr[B ∩ C] Pr[A ∩ B ∩ C]
= Pr A B ∩ C .
= =
Pr B C Pr[B ∩ C]
Pr[C] Pr[C]
As for (ii), we already saw it and used it in the minimum cut algorithm lecture.
Definition 12.1.2. An event E is mutually independent of a set of events C, if for any subset U ⊆ C, we
have that Pr[E ∩ ( E 0 ∈U E 0)] = Pr[E] Pr[ E 0 ∈U E 0].
Ñ Ñ
Let E1, . . . , En be events. A dependency graph for these events is a directed graph G = (V, E),
where {1, . . . , n}, such that Ei is mutually independent of all the events in E j (i, j) < E .
Intuitively, an edge (i, j) in a dependency graph indicates that Ei and E j have (maybe) some depen-
dency between them. We are interested in settings where this dependency is limited enough, that we
can claim something about the probability of all these events happening simultaneously.
93
Lemma 12.1.3 (Lovász Local Lemma). Let G(V, E) be a dependency Ö graph for events Eh1, . . . , iEn .
n
Suppose that there exist xi ∈ [0, 1], for 1 ≤ i ≤ n such that Pr[Ei ] ≤ xi 1 − x j . Then Pr ∩i=1
Ei ≥
(i, j)∈E
n
Ö
(1 − xi ).
i=1
Lemma 12.1.4. Let G(V, E) be a dependency Ö graph for events E1, . . . , En . Suppose that there exist
xi ∈ [0, 1], for 1 ≤ i ≤ n such that Pr[Ei ] ≤ xi 1 − x j . Now, let S be a subset of the vertices from
(i, j)∈E
{1, . . . , n}, and let i be an index not in S. We have that
h i
Pr Ei ∩ j∈S E j ≤ xi . (12.1)
xi , by arguing as above.
By Lemma 12.1.1 (i), we have that
h i
Ù Pr E i ∩ ∩ j∈N jE ∩ E
m∈R m
Pr Ei E j = h i .
j∈S
Pr ∩ j∈N E j ∩ m∈R E m
since Ei is mutually independent of C(R). As for the denominator, let N = { j1, . . . , jr }. We have, by
Lemma 12.1.1 (ii), that
h i h i h i
Pr E j1 ∩ . . . ∩ E jr ∩m∈R Em = Pr E j1 ∩m∈R Em Pr E j2 E j1 ∩ ∩m∈R Em
h i
· · · Pr E jr E j1 ∩ . . . ∩ E jr−1 ∩ ∩m∈R Em
h i h i
= 1 − Pr E j1 ∩m∈R Em 1 − Pr E j2 E j1 ∩ ∩m∈R Em
h i
· · · 1 − Pr E jr E j1 ∩ . . . ∩ E jr−1 ∩ ∩m∈R Em
Ö
≥ 1 − x j1 · · · 1 − x jr ≥ 1 − xj ,
(i, j)∈E
94
Proof of Lovász local lemma (Lemma 12.1.3): Using Lemma 12.1.4, we have that
h i h i h i n
n n−1
Ö
Pr ∩i=1 Ei = (1 − Pr[E1 ]) 1 − Pr E2 E1 · · · 1 − Pr En ∩i=1 Ei ≥ (1 − xi ).
i=1
Corollary 12.1.5. Let E1, . . . , En be events, with Pr[Ei ] ≤ p for all i. If eachh event is
i mutually inde-
n
pendent of all other events except for at most d, and if ep(d + 1) ≤ 1, then Pr ∩i=1 Ei > 0.
Proof: If d = 0 the result is trivial, as the events are independent. Otherwise, there is a dependency
graph, with every vertex having degree at most d. Apply Lemma 12.1.3 with xi = d+1 1
. Observe that
d
d 1 1 1 1
xi (1 − xi ) = 1− > · ≥ p,
d+1 d+1 d+1 e
1 d
by assumption and the since 1 − d+1 > 1/e, see Lemma 12.1.6 below.
The following is standard by now, and we include it only for the sake of completeness.
n
1 1
Lemma 12.1.6. For any n ≥ 1, we have 1 − > .
n+1 e
n n n+1 n
> 1e . Namely, we need to prove e >
Proof: This is equivalent to n+1 n . But this obvious, since
n+1 n n
= 1 + 1n < exp(n(1/n)) = e.
n
95
(i) k/2 literals of Ei have been fixed.
After assigning each value, we discover all the dangerous clauses, and we defer (“freeze”) all the
unassigned variables participating in such a clause. We continue in this fashion till all the unspecified
variables are frozen. This completes the first stage of the algorithm.
At the second stage of the algorithm, we will compute a satisfying assignment to the variables using
brute force. This would be done by taking the surviving formula I 0 and breaking it into fragments, so
that each fragment does not share any variable with any other fragment (naively, it might be that all of
I 0 is one fragment). We can find a satisfying assignment to each fragment separately, and if each such
fragment is “small” the resulting algorithm would be “fast”.
We need to show that I 0 has a satisfying assignment and that the fragments are indeed small.
12.2.1.1. Analysis
A clause had survived if it is not satisfied by the variables fixed in the first stage. Note, that a clause
that survived must have a dangerous clause as a neighbor in the dependency graph G. Not that I 0,
the instance remaining from I after the first stage, has at least k/2 unspecified variables in each clause.
Furthermore, every clause of I 0 has at most d = k2 k/50 neighbors in G0, where G0 is the dependency
graph for I 0. It follows, that again, we can apply Lovász local lemma to conclude that I 0 has a satisfying
assignment.
Definition 12.2.2. Two connected graphs G1 = (V1, E1 ) and G2 = (V2, E2 ), where V1, V2 ⊆ {1, . . . , n} are
unique if V1 , V2 .
Lemma 12.2.3. Let G be a graph with degree at most d and with n vertices. Then, the number of
unique subgraphs of G having r vertices is at most nd 2r .
Lemma 12.2.4. With probability 1 − o(1), all connected components of G0 have size at most O(log m),
where G0 denote the dependency graph for I 0.
Proof: Let G4 be a graph formed from G by connecting any pair of vertices of G of distance exactly 4
from each other. The degree of a vertex of G4 is at most O(d 4 ).
Let U be a set of r vertices of G, such that every pair is in distance at least 4 from each other in G.
We are interested in bounding the probability that all the clauses of U survive the first stage.
The probability of a clause to be dangerous is at most 2−k/2 , as we assign (random) values to half
of the variables of this clause. Now, a clause survive only if it is dangerous or one of its neighbors is
dangerous. Thus, the probability that a clause survive is bounded by 2−k/2 (d + 1).
96
Furthermore, the survival of two clauses Ei and E j in U is an independent event, as no neighbor of
Ei shares a variable with a neighbor of E j (because of the distance 4 requirement). We conclude, that
the probability that all the vertices of U to appear in G0 is bounded by
r
−k/2
2 (d + 1) .
In fact, we are interested in sets U that induce a connected subgraphs of G4 . The number of unique
such sets of size r is bounded by the number of unique subgraphs of G4 of size r, which is bounded by
md 8r , by Lemma 12.2.3. Thus, the probability of any connected subgraph of G4 of size r = log2 m to
survive in G0 is smaller than
r 8r r
md 8r 2−k/2 (d + 1) = m k2 k/50 2−k/2 (k2 k/50 + 1) ≤ m2 kr/5 · 2−kr/4 = m2−kr/20 = o(1),
since k ≥ 50. (Here, a subgraph survive of G4 survive, if all its vertices appear in G0.) Note, however, that
if a connected component of G0 has more than L vertices, than there must be a connected component
having L/d 3 vertices in G4 that had survived in G0. We conclude, that with probability o(1), no connected
component of G0 has more than O(d 3 log m) = O(log m) vertices (note, that we consider k to be a constant,
and thus, also d).
Thus, after the first stage, we are left with fragments of (k/2)-SAT, where every fragment has size
at most O(log m), and thus having at most O(log m) variables. Thus, we can by brute force find the
satisfying assignment to each such fragment in time polynomial in m. We conclude:
Theorem 12.2.5. The above algorithm finds a satisfying truth assignment for any instance of k-SAT
containing m clauses, which each variable is contained in at most 2 k/50 clauses, in expected time poly-
nomial in m.
97
98
Chapter 13
Problem 13.1.1 (Set Balancing). Given a binary matrix A of size n × n, find a vector v ∈ {−1, +1}n , such
that k Avk ∞ is minimized.
Using random
√ assignment and the Chernoff inequality, we showed that there exists v, such that
k Avk ∞ ≤ 4 n ln n. Can we derandomize this algorithm? Namely, can we come up with an efficient
deterministic algorithm that has low discrepancy?
To derandomize our algorithm, construct a computation tree of depth n, where in the ith level we
expose the ith coordinate of v. This tree T has depth n. The root represents all possible random choices,
while a node at depth i, represents all computations when the first i bits are fixed. For a node v ∈ T,
let P(v) be the probability that a random computation starting from v succeeds. Let vl and vr be the
two children of v. Clearly, P(v) = (P(vl ) + P(vr ))/2. In particular, max(P(vl ), P(vr )) ≥ P(v). Thus, if we
could compute P(·) quickly (and deterministically), then we could derandomize the algorithm.
Let Cm be the bad eventp that rm · v > 4 n log n, where rm is the mth row of A. Similarly, Cm− is the
p
+
bad event that rm · v < −4 n log n, and let Cm = Cm+ ∪ Cm− . Consider the probability, Pr Cm+ v1, . . . , v k
(namely, the first k coordinates of v are specified). Let rm = (α1, . . . , αn ). We have that
n k
" # " # " #
Õ p Õ Õ Õ
Pr Cm+ v1, . . . , v k = Pr vi αi > 4 n log n − vi αi > L = Pr vi > L ,
vi αi = Pr
i=k+1 i=1 i≥k+1,αi ,0 i≥k+1,αi =1
99
Ík
where L = 4 n log n − i=1 vi αi is a known quantity (since v1, . . . , v k are known). Let V = i≥k+1,αi =1 1.
p Í
We have,
Õ Õ v +1 L +V
i
Pr Ci+ v1, . . . , v k (vi + 1) > L + V = Pr
= Pr > ,
2 2
i≥k+1 i≥k+1
αi =1 αi =1
The last probability, is the probability that in V flips of a fair coin we will get more than (L + V)/2
heads. Thus,
V V
Õ V 1 1© Õ V ª
Pm+ Cm+
= Pr v1, . . . , v k = n
= n ®.
i 2 2 i
i=d(L+V)/2e «i=d(L+V)/2e ¬
This implies, that we can compute Pm+ in polynomial time! Indeed, we are adding V ≤ n numbers,
each one of them is a binomial coefficient that has polynomial size representation in n, and can be
computed in polynomial time (why?). One can define in similar fashion Pm− , and let Pm = Pm+ + Pm− .
Clearly, Pm can be computed in polynomial time, by applying a similar argument to the computation
of Pm− = Pr Cm− v1, . . . , v k .
For a node v ∈ T, let vv denote the portion of v that was fixed when traversing from the root of T
Ín
to v. Let P(v) = m=1 Pr Cm vv . By the above discussion P(v) can be computed in polynomial time.
Furthermore, we know, by the previous result on set balancing that P(r) < 1 (that was the bound used
to show that there exist a good assignment).
As before, for any v ∈ T, we have P(v) ≥ min(P(vl ), P(vr )). Thus, wephave a polynomial deterministic
algorithm for computing a set balancing with discrepancy smaller than 4 n log n. Indeed, set v = root(T).
And start traversing down the tree. At each stage, compute P(vl ) and P(vr ) (in polynomial time), and
p of P(·). Clearly, after n steps, we reach a leaf, that corresponds to a
set v to the child with lower value
vector v such that k Av k ∞ ≤ 4 n log n.
0 0
Theorem 13.1.2. Using the method of conditional p probabilities, one can compute in polynomial time
n
in n, a vector v ∈ {−1, 1} , such that k Avk ∞ ≤ 4 n log n.
Note, that this method might fail to find the best assignment.
Theorem 13.2.2. For all K, L there exists a graph G with girth(G) > L and χ(G) > K.
100
Proof: Fix µ < 1/L, and let G ≈ G(n, p) with p = n µ−1 ; namely, G is a random graph on n vertices chosen
by picking each pair of vertices to be an edge in G, randomly and independently with probability p. Let
X be the number of cycles of size at most L. Then
L L L
Õ n! 1 i Õ ni µ−1 i
Õ n µi
E[X] = · ·p ≤ · n ≤ = o(n),
i=3
(n − i)! 2i i=3
2i i=3
2i
n!
as µL < 1, and since the number of different sequence of i vertices is (n−i)! , and every cycle is being
counted in this sequence 2i times.
In particular,
l Pr[X
m ≥ n/2] = o(1).
Let x = 3p ln n + 1. We remind the reader that α(G) denotes the size of the largest independent set
in G. We have that
i n x x
p(x − 1) x
h
( x
) 3
Pr α(G) ≥ x ≤ (1 − p) < n exp −
2 < n exp − ln n < o(1) = o(1).
x 2 2
Let n be sufficiently large so that both these events have probability less than 1/2. Then there is a
specific G with less than n/2 cycles of length at most L and with α(G) < 3n1−µ ln n + 1.
Remove from G a vertex from each cycle of length at most L. This gives a graph G∗ with at least n/2
vertices. G∗ has girth greater than L and α(G∗ ) ≤ α(G) (any independent set in G∗ is also an independent
set in G). Thus
∗ |V(G∗ )| n/2 nµ
χ(G ) ≥ ≥ 1−µ ≥ .
α(G∗ ) 3n ln n 12 ln n
To complete the proof, let n be sufficiently large so that this is greater than K.
|E| 3
Theorem 13.2.3. The crossing number of any simple graph G = (V, E) with |E| ≥ 4 |V| is ≥ .
64 |V| 2
Proof: By Euler’s formula any simple planar graph with n vertices has at most 3n − 6 edges. (Indeed,
f − e + v = 2 in the case with maximum number of edges, we have that every face, has 3 edges around
it. Namely, 3 f = 2e. Thus, (2/3)e − e + v = 2 in this case. Namely, e = 3v − 6.) This implies that
the crossing number of any simple graph with n vertices and m edges is at least m − 3n + 6 > m − 3n.
Let G = (V, E) be a graph with |E| ≥ 4 |V| embedded in the plane with t = cr(G) crossings. Let H be
the random induced subgraph of G obtained by picking each vertex of G randomly and independently,
to be a vertex of H with probabilistic p (where P will be specified shortly). The expected number of
vertices of H is p |V|, the expected number of its edges is p2 |E|, and the expected number of crossings
101
in the given embedding is p4 t, implying that the expected value of its crossing number is at most p4 t.
Therefore, we have p4 t ≥ p2 |E| − 3p |V|, implying that
|E| 3 |V|
cr(G) ≥ − 3 ,
p2 p
let p = 4 |V| /|E| < 1, and we have cr(G) ≥ (1/16 − 3/64) |E| 3 /|V| 2 = |E| 3 /(64 |V| 2 ).
Theorem 13.2.4. Let P be a set of n distinct points in the plane, and let L be a set of m distinct lines.
Then, the number of incidences between the points of P and the lines
of L (that is, the number of pairs
(p, `) with p ∈ P, ` ∈ L, and p ∈ `) is at most c m2/3 n2/3 + m + n , for some absolute constant c.
Proof: Let I denote the number of such incidences. Let G = (V, E) be the graph whose vertices are all
the points of P, where two are adjacent if and only if they are consecutive points of P on some line in L.
Clearly |V| = n, and |E| = I − m. Note that G is already given embedded in the plane, where the edges
are presented by segments of the corresponding lines of L.
Either, we can not apply Theorem 13.2.3, implying that I − m = |E| < 4 |V| = 4n. Namely, I ≤ m + 4n.
Or alliteratively,
|E| 3 (I − m)3 m m2
= ≤ cr(G) ≤ ≤ .
64 |V| 2 64n2 2 2
Implying that I ≤ (32)1/3 m2/3 n2/3 + m. In both cases, I ≤ 4(m2/3 n2/3 + m + n).
This technique has interesting and surprising results, as the following theorem shows.
Theorem 13.2.5. For any three sets A, B and C of s real numbers each, we have
| A · B + C| = ab + c a ∈ A, b ∈ B, mc ∈ C ≥ Ω s3/2 .
Clearly n = |P| =
sr, and m = |L| = s . Furthermore, a line y = bx + c of L is incident with s points
2
For r < s3 , we have that sr ≤ s2r 2/3 . Thus, for r < s3 , we have s3 ≤ 12s2r 2/3 , implying that s3/2 ≤ 12r.
Namely, |R| = Ω(s3/2 ), as claimed.
Among other things, the crossing number technique implies a better bounds for k-sets in the plane
than what was previously known. The k-set problem had attracted a lot of research, and remains till
this day one of the major open problems in discrete geometry.
102
13.2.3. Bounding the at most k-level
Let L be a set of n lines in the plane. Assume, without loss of generality, that no three lines of L pass
through a common point, and none of them is vertical. The complement of union of lines L break the
plane into regions known as faces. An intersection of two lines, is a vertex, and the maximum interval
on a line between two vertices is am edge. The whole structure of vertices, edges and faces induced by
L is known as arrangement of L, denoted by A(L).
Let L be a set of n lines in the plane. A point p ∈ `∈L ` is of level k if there are k lines of L strictly
Ð
below it. The k-level is the closure of the set of points of level k. Namely, the k-level is an x-monotone
curve along the lines of L.t
The 0-level is the boundary of the “bottom” face of the arrangement of
L (i.e., the face containing the negative y-axis). It is easy to verify that the
0-level has at most n − 1 vertices, as each line might contribute at most one
segment to the 0-level (which is an unbounded convex polygon). 3-level
It is natural to ask what the number of vertices at the k-level is (i.e.,
what the combinatorial complexity of the polygonal chain forming the k- 1-level
0-level
level is). This is a surprisingly hard question, but the same question on the
complexity of the at most k-level is considerably easier.
Theorem 13.2.6. The number of vertices of level at most k in an arrangement of n lines in the plane
is O(nk).
Proof: Pick a random sample R of L, by picking each line to be in the sample with probability 1/k.
Observe that
n
E[|R|] = .
k
Let L≤k = L≤k (L) be the set of all vertices of A(L) of level at most k, for k > 1. For a vertex p ∈ L≤k ,
let Xp be an indicator variable which is 1 if p is a vertex of the 0-level of A(R). The probability that p is
in the 0-level of A(R) is the probability that none of the j lines below it are picked to be in the sample,
and the two lines that define it do get selected to be in the sample. Namely,
j 2 k
k 1
1 1 1 1 1
Pr Xp = 1 = 1 −
≥ 1− ≥ exp −2 = 2 2
k k k k 2 k k 2 e k
since j ≤ k and 1 − x ≥ e−2x , for 0 < x ≤ 1/2.
On the other hand, the number of vertices on the 0-level of R is at most |R| − 1. As such,
Õ
Xp ≤ |R| − 1.
p∈L ≤k
|L≤k | n
Putting these two inequalities together, we get that ≤ . Namely, |L≤k | ≤ e2 nk.
e k
2 2 k
103
104
Chapter 14
Random Walks I
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
“A drunk man will find his way home; a drunk bird may wander forever.”
– Anonymous.
14.1. Definitions
Let G = G(V, E) be an undirected connected graph. For v ∈ V, let Γ(v) denote the set of neighbors of
v in G; that is, Γ(v) = u vu ∈ E(G) . A random walk on G is the following process: Starting from
a vertex v0 , we randomly choose one of the neighbors of v0 , and set it to be v1 . We continue in this
fashion, in the ith step choosing vi , such that vi ∈ Γ(vi−1 ). It would be interesting to investigate the
random walk process. Questions of interest include:
(A) How long does it take to arrive from a vertex v to a vertex u in G?
(B) How long does it take to visit all the vertices in the graph.
(C) If we start from an arbitrary vertex v0 , how long the random walk has to be such that the location
of the random walk in the ith step is uniformly (or near uniformly) distributed on V(G)?
Example 14.1.1. In the complete graph Kn , visiting all the vertices takes in expectation O(n log n) time,
as this is the coupon collector problem with n − 1 coupons. Indeed, the probability we did not visit a
specific vertex v by the ith step of the random walk is ≤ (1−1/n)i−1 ≤ e−(i−1)/n ≤ 1/n10 , for i = Ω(n log n).
As such, with high probability, the random walk visited all the vertex of Kn . Similarly, arriving from u
to v, takes in expectation n − 1 steps of a random walk, as the probability of visiting v at every step of
the walk is p = 1/(n − 1), and the length of the walk till we visit v is a geometric random variable with
expectation 1/p.
105
1 2i
Proof: The probability that in the 2ith step we visit 0 is 22i i
, As such, the expected number of times
we visit the origin is
∞ Õ ∞
Õ 1 2i 1
≥ √ = ∞,
i=1
2 i
2i
i=1 2 i
22i 22i
2i
since √ ≤ ≤ √ [MN98, p. 84]. This can also be verified using the Stirling formula, and the
2 i i 2i
resulting sequence diverges.
A random walk on the integer grid ZZd , starts from a point of this integer grid, and at each step if it is
at point (i1, i2, . . . , id ), it chooses a coordinate and either increases it by one, or decreases it by one, with
equal probability.
Lemma 14.1.4. Consider the infinite random walk on the two dimensional integer grid ZZ2 , starting
from (0, 0). The expected number of times that such a walk visits the origin is unbounded.
Proof: Rotate the grid by 45 degrees, and consider the two new axises X 0 and Y 0. Let xi be the projection
of the location of the ith step√ of the random walk on the X -axis, and define
0
√ yi in a similar fashion.
√ j is an integer. By scaling by a factor of 2, consider the resulting
Clearly, xi are of the √form j/ 2, where
random walks xi0 = 2xi and yi0 = 2yi . Clearly, xi and yi are random walks on the integer grid, and
furthermore, they are independent. As such, the probability that we visit the origin at the 2ith step is
2 1 2i 2
Pr x2i = 0 ∩ y2i = 0 = Pr x2i = 0 = 22i i
0 0
0
≥ 1/4i. We conclude, that the infinite random walk on
the grid ZZ2 visits the origin in expectation
∞ ∞
Õ Õ 1
xi0 yi0
Pr =0∩ =0 ≥ = ∞,
i=0 i=0
4i
Lemma 14.1.5. Consider the infinite random walk on the three dimensional integer grid ZZ3 , starting
from (0, 0, 0). The expected number of times that such a walk visits the origin is bounded.
Proof: The probability of a neighbor of a point (x, y, z) to be the next point in the walk is 1/6. Assume
that we performed a walk for 2i steps, and decided to perform 2a steps parallel to the x-axis, 2b steps
parallel to the y-axis, and 2c steps parallel to the z-axis, where a + b + c = i. Furthermore, the walk on
each dimension is balanced, that is we perform a steps to the left on the x-axis, and a steps to the right
on the x-axis. Clearly, this corresponds to the only walks in 2i steps that arrives to the origin.
106
(2i)!
Next, the number of different ways we can perform such a walk is a!a!b!b!c!c! , and the probability to
perform such a walk, summing over all possible values of a, b and c, is
2 2i i !2
i! i
Õ
Õ (2i)! 1 2i 1 1 2i 1 Õ 1
αi = = =
a+b+c=i
a!a!b!b!c!c! 62i i 22i a+b+c=i
a! b! c! 3 i 22i a+b+c=i
a b c 3
a,b,c≥0 a,b,c≥0 a,b,c≥0
i i
Consider the case where i = 3m. We have that
a b c ≤ m m m . As such,
i i i
i i i
Õ
2i 1 1 1 2i 1 1
αi ≤ = .
i 22i 3 m m m a+b+c=i a b c 3 i 22i 3 m m m
a,b,c≥0
i !
1 1 3i
1
for some constant c. As such, αi = O √ = O 3/2 . Thus,
i 3 i i
∞
Õ Õ 1
α6m = O 3/2 = O(1).
m=1 i
i
Finally, observe that α6m ≥ (1/6)2 α6m−2 and α6m ≥ (1/6)4 α6m−4 . Thus,
∞
Õ
αm = O(1).
m=1
Notes
The presentation here follows [Nor98].
107
108
Chapter 15
Random Walks II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled “Then you must begin a reading program
immediately so that you man understand the
January 24, 2018
crises of our age," Ignatius said solemnly. "Begin
with the late Romans, including Boethius, of
course. Then you should dip rather extensively
into early Medieval. You may skip the
Renaissance and the Enlightenment. That is
mostly dangerous propaganda. Now, that I
think about of it, you had better skip the
Romantics and the Victorians, too. For the
contemporary period, you should study some
selected comic books.”
“You’re fantastic.”
“I recommend Batman especially, for he tends
to transcend the abysmal society in which he’s
found himself. His morality is rather rigid, also.
I rather respect Batman.”
109
Thus, we can think about this algorithm as performing a random walk on the numbers 0, 1, . . . , n,
where at each step, we go to the right probability at least half. The question is, how long does it take
to arrive to n in such a settings.
Theorem 15.1.1. The expected number of steps to arrive to a satisfying assignment is O(n2 ).
Proof: Consider the random walk on the integer line, starting from zero, where we go to the left with
probability 1/2, and to the right probability 1/2. Let Yi be the location of the walk at the i step. Clearly,
E[Yi ] ≥ E[Xi ]. In fact, by defining the random walk on the integer line more carefully, one can ensure
that Yi ≤ Xi . Thus, the expected number of steps till Yi is equal to n is an upper bound on the required
quantity.
To this end, observe that the probability that in the ith step we have Yi ≥ n is
i/2
i
Õ 1
i i/2 + k
> 1/3,
k=n/2
2
√ √
by Lemma 15.1.2 below. Here we need that k = i/6, and k ≥ n/2. That is, we need that i/6 ≥ n/2,
which in turns implies that this holds for i > µ = 9n2 . To see that, observe that if we get i/2 + k times
+1, and i − (i/2 + k) = i/2 − k times −1, then we have that Yi = (i/2 + k) − ((i/k) − m) = 2k ≥ n.
Next, if Xi fails to arrive to n at the first µ steps, we will reset Yµ = Xµ and continue the random
walk, repeating this process as many phases as necessary. The probability that the number of phases
exceeds i is ≤ (2/3)i . As such, the expected number of steps in the walk is at most
i
0 2 2
Õ
cn i = O(n2 ),
i
3
as claimed.
2i
Õ 1 2i 1
Lemma 15.1.2. We have ≥ .
√ 2 k
2i 3
k=i+ i/6
√
Proof: It is known¬ that 2ii ≤ 22i / i (better constants are known). As such, since 2i 2i
for all m,
i ≥ m ,
we have by symmetry that
2i 2i √ 1 √ 1 22i 1
Õ 1 2i Õ 1 2i 1 2i
≥ − i/6 ≥ − i/6 ·√ = .
√ 22i k k=i+1
22i k 22i i 2 22i i 3
k=i+ i/6
prove it yourself.
110
The Markov chain start at an initial state X0 , and at each point in time moves according to the
transition probabilities. This form a sequence of states {Xt }. We have a distribution over those sequences.
Such a sequence would be referred to as a history.
Similar to Martingales, the behavior of a Markov chain in the future, depends only on its location
Xt at time t, and does not depends on the earlier stages that the Markov chain went through. This
is the memorylessness property of the Markov chain, and it follows as Pi j is independent of time.
Formally, the memorylessness property is
Pr Xt+1 = j X0 = i0, X1 = i1, . . . , Xt−1 = it−1, Xt = i = Pr Xt+1 = j Xt = i = Pi j .
The initial state of the Markov chain might also be chosen randomly.
For states i, j ∈ S, the t-step transition probability is Pi(t)j = Pr Xt = j X0 = i . The probability
that we visit j for the first time, starting from i after t steps, is denoted by
Let fi j = t>0 ri(t)j denote the probability that the Markov chain visits state j, at any point in time,
Í
starting from state i. The expected number of steps to arrive to state j starting from i is
Õ
hi j = t · ri(t)j .
t>0
Of course, if fi j < 1, then there is a positive probability that the Markov chain never arrives to j, and
as such hi j = ∞ in this case.
Definition 15.2.1. A state i ∈ S for which fii < 1 (i.e., the chain has positive probability of never visiting
i again), is a transient state. If fii = 1 then the state is persistent.
A state i that is persistent but hii = ∞ is null persistent. A state i that is persistent and hii , ∞
is non null persistent.
Example 15.2.2. Consider the state 0 in the random walk on the integers. We already know that in
expectation the random walk visits the origin infinite number of times, so this hints that this is a
persistent state. Let figure out the probability r(2n)00 . To this end, consider a walk X0, X1, . . . , X2n that
starts at 0 and return to 0 only in the 2n step. Let Si = Xi − Xi−1 , for all i. Clearly, we have Si ∈ −1, +1
(i.e., move left or move right). Assume the walk starts by S1 = +1 (the case −1 is handled similarly).
Clearly, the walk S2, . . . , S2n−1 must be prefix balanced; that is, the number of 1s is always bigger (or
equal) for any prefix of this sequence.
Strings with this property are known
1 2m
as Dyck words, and the number of such words of length 2m
is the Catalan number Cm = m+1 m . As such, the probability of the random walk to visit 0 for the
first time (starting from 0 after 2n steps, is
(2n) 1 2n − 2 1 1 1 1
r00 = 2 = Θ · √ = Θ 3/2 .
n n − 1 22n n n n
(the 2 here is because the other option is that the sequence starts with −1), using that 2n 2n √
n =Θ 2 / n .
It is not hard to show that f00 = 1 (this requires a trick). On the other hand, we have that
Õ ∞
Õ ∞
Õ √
h00 = t· r(t)
00 ≥ 2nr(2n)
00 = Θ 1/ n = ∞.
t>0 n=1 n=1
111
In finite Markov chains there are no null persistent states (this requires a proof, which is left as an
exercise). There is a natural directed graph associated with a Markov chain. The states are the vertices,
and the transition probability Pi j is the weight assigned to the edge (i → j). Note that we include only
edges with Pi j > 0.
Definition 15.2.3. A strong component (or a strong connected component) of a directed graph G is a
maximal subgraph C of G such that for any pair of vertices i and j in the vertex set of C, there is a
directed path from i to j, as well as a directed path from j to i.
Definition 15.2.4. A strong component C is said to be a final strong component if there is no edge
going from a vertex in C to a vertex that is not in C.
In a finite Markov chain, there is positive probability to arrive from any vertex on C to any other
vertex of C in a finite number of steps. If C is a final strong component, then this probability is 1, since
the Markov chain can never leave C once it enters it . It follows that a state is persistent if and only if
it lies in a final strong component.
Definition 15.2.5. A Markov chain is irreducible if its underlying graph consists of a single strong
component.
Definition 15.2.7. A stationary distribution for a Markov chain with the transition matrix P is a
probability distribution π such that π = πP.
In general, stationary distribution does not necessarily exist. We will mostly be interested in Markov
chains that have stationary distribution. Intuitively it is clear that if a stationary distribution exists,
then the Markov chain, given enough time, will converge to the stationary distribution.
Definition 15.2.8. The periodicity of a state i is the maximum integer T for which there exists an initial
distribution q (0) and positive integer a such that, for all t if at time t we have qi(t) > 0 then t belongs
to the arithmetic progression a + ti i ≥ 0 . A state is said to be periodic if it has periodicity greater
than 1, and is aperiodic otherwise. A Markov chain in which every state is aperiodic is aperiodic.
Example 15.2.9. The easiest example maybe of a periodic Markov chain is a directed cycle.
112
v2
For example, the Markov chain on the right, has periodicity of three. In particular,
v1 the initial state probability vector q (0) = (1, 0, 0) leads to the following sequence of state
v3 probability vectors
Note, that this chain still has a stationary distribution, that is (1/3, 1/3, 1/3), but unless you start from
this distribution, you are going to converge to it.
A neat trick that forces a Markov chain to be aperiodic, is to shrink all the probabilities by a factor
of 2, and make every state to have a transition probability to itself equal to 1/2. Clearly, the resulting
Markov chain is aperiodic.
The following theorem is the fundamental fact about Markov chains that we will need. The interested
reader, should check the proof in [Nor98] (the proof is not hard).
Theorem 15.2.11 (Fundamental theorem of Markov chains). Any irreducible, finite, and aperi-
odic Markov chain has the following properties.
(i) All states are ergodic.
(ii) There is a unique stationary distribution π such that, for 1 ≤ i ≤ n, we have πi > 0.
(iii) For 1 ≤ i ≤ n, we have fii = 1 and hii = 1/πi .
(iv) Let N(i, t) be the number of times the Markov chain visits state i in t steps. Then
N(i, t)
lim = πi .
t→∞ t
Namely, independent of the starting distribution, the process converges to the stationary dis-
tribution.
113
114
Chapter 16
where d(w) is the degree of vertex w. Clearly, the resulting Markov chain MG is irreducible. Note, that
the graph must have an odd cycle, and it has a cycle of length 2. Thus, the gcd of the lengths of its
115
cycles is 1. Namely, MG is aperiodic. Now, by the Fundamental theorem of Markov chains, MG has a
unique stationary distribution π.
Lemma 16.1.1. For all v ∈ V, we have πv = d(v)/2m.
Proof: Since π is stationary, and the definition of Puv , we get
Õ
πv = πP v = πu Puv,
uv
and this holds for all v. We only need to verify the claimed solution, since there is a unique stationary
distribution. Indeed,
d(v) Õ d(u) 1 d(v)
= πv = [πP]v = = ,
2m uv
2m d(u) 2m
as claimed.
Lemma 16.1.2. For all v ∈ V, we have hvv = 1/πv = 2m/d(v).
Definition 16.1.3. The hitting time huv is the expected number of steps in a random walk that starts
at u and ends upon first reaching v.
The commute time between u and v is denoted by CTuv = huv + hvu .
Let Cu (G) denote the expected length of a walk that starts at u and ends upon visiting every vertex
in G at least once. The cover time of G denotes by C(G) is defined by C(G) = maxu Cu (G).
Example 16.1.4 (Lollipop). Let L2n be the 2n-vertex lollipop graph, this graph con-
sists of a clique on n vertices, and a path on the remaining n vertices. There is a
vertex u in the clique which is where the path is attached to it. Let v denote the
end of the path, see figure on the right. n vertices
Taking a random walk from u to v requires in expectation O(n2 ) steps, as we
already saw in class. This ignores the probability of escape – that is, with probability u
(n − 1)/n when at u we enter the clique Kn (instead of the path). As such, it turns x1
3 2
out that huv = Θ(n ), and hvu = Θ(n ). (Thus, hitting times are not symmetric!) x2
Note, that the cover time is not monotone decreasing with the number of edges.
Indeed, the path of length n, has cover time O(n2 ), but the larger graph Ln has cover v = xn
time Ω(n3 ).
Example 16.1.5 (More on walking on the Lollipop). To see why huv = Θ n3 , number the vertices on the
stem x1, . . . , xn . Let Ti be the expected time to arrive to the vertex xi when starting a walk from u.
Observe, that surprisingly, T1 = Θ(n2 ). Indeed, the walk has to visit the vertex u about n times in
expectation, till the walk would decide to go to x1 instead of falling back into the clique. The time
between visits to u is in expectation O(n) (assuming the walk is inside the clique).
Now, observe that T2i = Ti + Θ(i 2 ) + 12 T2i . Indeed, starting with xi , it takes in expectation Θ(i 2 ) steps
of the walk to either arrive (with equal probability) at x2i (good), or to get back to u (oopsi). In the
later case, the game begins from scratch. As such, we have that
!
T2i = 2Ti + Θ i 2 = 2 2Ti/2 + Θ (i/2)2 + Θ i 2 = · · · = 2iT1 + Θ i 2 ,
assuming i is a power of two (why not?). As such, Tn = nT1 + Θ(n2 ). Since T1 = Θ(n2 ), we have that
Tn = Θ(n3 ).
116
Definition 16.1.6. A n × n matrix M is stochastic if all its entries are non-negative and for each row i,
it holds k Mik = 1. It is doubly stochastic if in addition, for any i, it holds k M ki = 1.
Í Í
Lemma 16.1.7. Let MC be a Markov chain, such that transition probability matrix P is doubly stochas-
tic. Then, the distribution u = (1/n, 1/n, . . . , 1/n) is stationary for MC.
n
Õ P ki 1
Proof: [uP]i = = .
k=1
n n
(Note, that (u → v) being an edge in the graph is crucial. Indeed, without it a significantly worst
case bound holds, see Theorem 16.2.1.)
Proof: Consider a new Markov chain defined by the edges of the graph (where every edge is taken twice
as two directed edges), where the current state is the last (directed) edge visited. There are 2m edges
in the new Markov chain, and the new transition matrix, has Q(u→v),(v→w) = Pvw = d(v) 1
. This matrix is
doubly stochastic, meaning that not only do the rows sum to one, but the columns sum to one as well.
Indeed, for the (v → w) we have
Õ Õ Õ 1
Q(x→y),(v→w) = Q(u→v),(v→w) = Pvw = d(v) × = 1.
d(v)
x∈V,y∈Γ(x) u∈Γ(v) u∈Γ(v)
Thus, the stationary distribution for this Markov chain is uniform, by Lemma 16.1.7. Namely, the
stationary distribution of e = (u → v) is hee = πe = 1/(2m). Thus, the expected time between successive
traversals of e is 1/πe = 2m, by Theorem 15.2.11 (iii).
Consider huv + hvu and interpret this as the time to go from u to v and then return to u. Conditioned
on the event that the initial entry into u was via the (v → u) , we conclude that the expected time to
go from there to v and then finally use (v → u) is 2m. The memorylessness property of a Markov chains
now allows us to remove the conditioning: since how we arrived to u is not relevant. Thus, the expected
time to travel from u to v and back is at most 2m.
The effective resistance between nodes u and v is the voltage difference between u and v when
one ampere is injected into u and removed from v (or injected into v and removed from u). The effective
resistance is always bounded by the branch resistance, but it can be much lower.
Given an undirected graph G, let N(G) be the electrical network defined over G, associating one ohm
resistance on the edges of N(G).
You might now see the connection between a random walk on a graph and electrical network. In-
tuitively (used in the most unscientific way possible), the electricity, is made out of electrons each one
117
of them is doing a random walk on the electric network. The resistance of an edge, corresponds to the
probability of taking the edge. The higher the resistance, the lower the probability that we will travel on
this edge. Thus, if the effective resistance Ruv between u and v is low, then there is a good probability
that travel from u to v in a random walk, and huv would be small.
Theorem 16.2.1. For any two vertices u and v in G, the commute time CTuv = 2mRuv , where Ruv is
the effective resistance between u and v.
Proof: Let φuv denote the voltage at u in N(G) with respected to v, where d(x) amperes of current are
injected into each node x ∈ V, and 2m amperes are removed from v. We claim that
huv = φuv .
Note, that the voltage on an edge x y is φ xy = φ xv − φ yv . Thus, using Kirchhoff’s Law and Ohm’s Law,
we obtain that
Õ Õ φ xw Õ
x ∈ V \ {v} d(x) = current(xw) = = (φ xv − φwv ), (16.1)
resistance(xw)
w∈Γ(x) w∈Γ(x) w∈Γ(x)
since the resistance of every edge is 1 ohm. (We also have the “trivial” equality that φvv = 0.) Further-
more, we have only n variables in this system; that is, for every x ∈ V, we have the variable φ xv .
Now, for the random walk interpretation – by the definition of expectation, we have
1 Õ Õ Õ
x ∈ V \ {v} h xv = (1 + hwv ) ⇐⇒ d(x) h xv = 1+ hwv
d(x)
w∈Γ(x) w∈Γ(x) w∈Γ(x)
Õ Õ Õ
⇐⇒ 1 = d(x) h xv − hwv = (h xv − hwv ).
w∈Γ(x) w∈Γ(x) w∈Γ(x)
Since d(x) =
Í
w∈Γ(x) 1, this is equivalent to
Õ
x ∈ V \ {v} d(x) = (h xv − hwv ). (16.2)
w∈Γ(x)
Again, we also have the trivial equality hvv = 0.¬ Note, that this system also has n equalities and n
variables.
Eq. (16.1) and Eq. (16.2) show two systems of linear equalities. Furthermore, if we identify huv with
φ xv then they are exactly the same system of equalities. Furthermore, since Eq. (16.1) represents a
physical system, we know that it has a unique solution. This implies that φ xv = h xv , for all x ∈ V.
Imagine the network where u is injected with 2m amperes, and for all nodes w remove d(w) units
from w. In this new network, hvu = −φ0vu = φ0uv . Now, since flows behaves linearly, we can superimpose
them (i.e., add them up). We have that in this new network 2m unites are being injected at u, and
2m units are being extracted at v, all other nodes the charge cancel itself out. The voltage difference
between u and v in the new network is φb = φuv + φ0uv = huv + hvu = CTuv . Now, in the new network there
are 2m amperes going from u to v, and by Ohm’s law, we have
as claimed.
¬ In previous lectures, we interpreted hvv as the expected length of a walk starting at v and coming back to v.
118
Example 16.2.2. Recall the lollipop Ln from Exercise 16.1.4. Let u be the connecting vertex between the
clique and the stem (i.e., the path). We inject d(x) units of flow for each vertex x of Ln , and collect 2m
units at u. Next, let u = x0, x1, . . . , xn = v be the vertices of the stem. Clearly, there are 2(n − i) − 1
units of electricity flowing on the edge (xi+1 → xi ). Thus, the voltage on this edge is 2(n − i), by Ohm’s
law (every edge has resistance one). The effective resistance from v to u is as such Θ(n2 ), which implies
that hvu = Θ(n2 ).
Similarly, it is easy to show huv = Θ(n3 ).
A similar analysis works for the random walk on the integer line in the range 1 to n.
Lemma 16.2.3. For any n vertex connected graph G, and for all u, v ∈ V(G), we have CTuv < n3 .
Proof: The effective resistance between any two nodes in the network is bounded by the length of the
shortest path between the two nodes, which is at most n − 1. As such, plugging this into Theorem 16.2.1,
yields the bound, since m < n2 .
119
120
Chapter 17
Random Walks IV
598 - Class notes for Randomized Algorithms
Sariel Har-Peled “Do not imagine, comrades, that leadership is a
January 24, 2018 pleasure! On the contrary, it is a deep and
heavy responsibility. No one believes more
firmly than Comrade Napoleon that all animals
are equal. He would be only too happy to let
you make your decisions for yourselves. But
sometimes you might make the wrong decisions,
comrades, and then where should we be?
Suppose you had decided to follow Snowball,
with his moonshine of windmills-Snowball, who,
as we now know, was no better than a
criminal?”
Theorem 17.1.1. Let G be an undirected connected graph, then C(G) ≤ 2m(n − 1), where n = |V(G)| and
m = |E(G)|.
Proof: (Sketch.) Construct a spanning tree T of G, and consider the time to walk around T. The
expected time to travel on this edge on both directions is CTuv = huv + hvu , which is smaller than 2m,
by Lemma 16.1.8. Now, just connect up those bounds, to get the expected time to travel around the
spanning tree. Note, that the bound is independent of the starting vertex.
Definition 17.1.2. The resistance of G is R(G) = maxu,v∈V(G) Ruv ; namely, it is the maximum effective
resistance in G.
Proof: Consider the vertices u and v realizing R(G), and observe that max(huv, hvu ) ≥ CTuv /2, and
CTuv = 2mRuv by Theorem 16.2.1. Thus, C(G) ≥ CTuv /2 ≥ mR(G).
As for the upper bound. Consider a random walk, and divide it into epochs, where a epoch is
a random walk of length 2e3 mR(G). For any vertex v, the expected time to hit u is hvu ≤ 2mR(G),
by Theorem 16.2.1. Thus, the probability that u is not visited in a epoch is 1/e3 by the Markov
121
inequality. Consider a random walk with ln n epochs. We have that the probability of not visiting u is
≤ (1/e3 )ln n ≤ 1/n3 . Thus, all vertices are visited after ln n epochs, with probability ≥ 1−1/n3 . Otherwise,
after this walk, we perform a random walk till we visit all vertices. The length of this (fix-up) random
walk is ≤ 2n3 , by Theorem 17.1.1. Thus, expected length of the walk is ≤ 2e3 mR(G) ln n + 2n3 (1/n2 ).
Lemma 17.1.5. Suppose that G contains p edge-disjoint paths of length at most ` from s to t. Then
Rst ≤ `/p.
Theorem 17.2.2. Let USTCON denote the problem of deciding if a vertex s is connected to a vertex
t in an undirected graph. Then USTCON ∈ RLP.
Proof: Perform a random walk of length 2n3 in the input graph G, starting from s. Stop as soon as the
random walk hit t. If u and v are in the same connected component, then hst ≤ n3 . Thus, by the Markov
inequality, the algorithm works. It is easy to verify that it can be implemented in O(log n) space.
Given such a universal traversal sequence, we can construct (a non-uniform) Turing machine that
can solve USTCON for such d-regular graphs, by encoding the sequence in the machine.
Let F denote a family of graphs, and let U(F ) denote the length of the shortest universal traversal
sequence for all the labeled graphs in F . Let R(F ) denote the maximum resistance of graphs in this
family.
Theorem 17.2.4. U(F ) ≤ 5mR(F ) lg(n |F |).
122
Proof: Same old, same old. Break the string into epochs, each of length L = 2mR(G). Now, start random
walks from all the possible vertices, from all possible graphs. Continue the walks till all vertices are
being visited. Initially, there are n2 |F | vertices that need to visited. In expectation, in each epoch half
the vertices get visited. As such, after 1 + lg2 (n |F |) epochs, the expected number of vertices still need
visiting is ≤ 1/2. Namely, with constant probability we are done.
Let U(d, n) denote the length of the shortest universal traversal sequence of connected, labeled n-
vertex, d-regular graphs.
Lemma 17.2.5. The number of labeled n-vertex graphs that are d-regular is (nd)O(nd) .
Proof: Such a graph has dn/2 edges overall. Specifically, we encode this by listing for every vertex its
d neighbors – there are n−1
d ≤ n d possibilities. As such, there are at most nnd choices for edges in
the graph¬ Every vertex has d! possible labeling of the edges adjacent to it, thus there are (d!)n ≤ d nd
possible labelings.
Proof: The diameter of every connected n-vertex, d-regular graph is O(n/d). Indeed, consider the path
realizing the diameter of the graph, and assume it has t vertices. Number the vertices along the path
consecutively, and consider all the vertices that their number is a multiple of three. There are α ≥ bt/3c
such vertices. No pair of these vertices can share a neighbor, and as such, the graph has at least (d + 1)α
vertices. We conclude that n ≥ (d + 1)α = (d + 1)(t/3 − 1). We conclude that t ≤ d+1 3
(n + 1) ≤ 3n/d.
And so, this also bounds the resistance of such a graph. The number of edges is m = nd/2. Now,
combine Lemma 17.2.5 and Theorem 17.2.4.
This is, as mentioned before, not uniform solution. There is by now a known log-space deterministic
algorithm for this problem, which is uniform.
123
Theorem 17.3.1 (Fundamental theorem of algebraic graph theory.). Let G = G(V, E) be an n-
vertex, undirected (multi)graph with maximum degree d. Let λ1 ≥ λ2 ≥ · · · ≥ λn be the eigenvalues of
M(G) and the corresponding orthonormal eigenvectors are e1, . . . , en . The following holds.
(i) If G is connected then λ2 < λ1 .
(ii) For i = 1, . . . , n, we have |λi | ≤ d.
(iii) d is an eigenvalue if and only if G is regular.
(iv) If G is d-regular then the eigenvalue λ1 = d has the eigenvector e1 = √1n (1, 1, 1, . . . , 1).
(v) The graph G is bipartite if and only if for every eigenvalue λ there is an eigenvalue −λ of the
same multiplicity.
(vi) Suppose that G is connected. Then G is bipartite if and only if −λ1 is an eigenvalue.
(vii) If G is d-regular and bipartite, then λn = d and en = √1n (1, 1, . . . , 1, −1, . . . , −1), where there are
equal numbers of 1s and −1s in en .
124
Chapter 18
Random Walks V
598 - Class notes for Randomized Algorithms
Sariel Har-Peled “Is there anything in the Geneva Convention
January 24, 2018 about the rules of war in peacetime?” Stanko
wanted to know, crawling back toward the
truck. “Absolutely nothing,” Caulec assured
him. “The rules of war apply only in wartime.
In peacetime, anything goes.”
Definition 18.1.1. Let G = (V, E) be an undirected d-regular graph. The graph G is a (n, d, c)-expander
(or just c-expander), forevery set S ⊆ V of size at most |V | /2, there are at least cd |S| edges connecting
S and S = V \ S; that is e S, S ≥ cd |S|,
Guaranteeing aperiodicity Let G be a (n, d, c)-expander. We would like to perform a random walk
on G. The graph G is connected, but it might be periodic (i.e., bipartite). To overcome this, consider
the random walk on G that either stay in the current state with probability 1/2 or traverse one of the
edges. Clearly, the resulting Markov Chain (MC) is aperiodic. The resulting transition matrix is
Q = M/2d + I/2,
where M is the adjacency matrix of G and I is the identity n × n matrix. Clearly Q is doubly stochastic.
Furthermore, if λbi is an eigenvalue of M, with eigenvector vi , then
!
1 M 1 λbi
Qvi = + I vi = + 1 vi .
2 d 2 d
As such, λc λi /d + 1 /2 is an eigenvalue of Q. Namely, if there is a spectral gap in the graph G, there
would also be a similar spectral gap in the resulting MC. This MC can be generated by adding to each
vertex d self loops, ending up with a 2d-regular graph. Clearly, this graph is still an expander if the
original graph is an expander, and the random walk on it is aperiodic.
From this point on, we would just assume our expander is aperiodic.
125
18.1.1. Bounding the mixing time
For a MC with n states, we denote by π = π1, . . . , πn its stationary distribution. We consider only
nicely behave MC that fall under Theorem 15.2.11p113 . As such, no state in the MC has zero stationary
probability.
Definition 18.1.2. Let q(t) denote the state probability vector of a Markov chain defined by a transition
matrix Q at time t ≥ 0, given an initial distribution q(0) . The relative pairwise distance of the
Markov chain at time t is
qi(t) − πi
∆(t) = max .
i πi
Namely, if ∆(t) approaches zero then q(t) approaches π.
We remind the reader that we saw a construction of a constant degree expander with constant
expansion. In its transition matrix Q, we have that λb1 = 1, and −1 ≤ λb2 < 1, and furthermore
the spectral gap λb1 − λb2 was a constant (the two properties are equivalent, but we proved only one
direction of this).
We need a slightly stronger property (that does hold for our expander construction). We have that
n
λb2 ≥ maxi=2 λbi .
Theorem 18.1.3. Let Q be the transition matrix of an aperiodic (n, d, c)-expander. Then, for any initial
distribution q(0) , we have that
t
∆(t) ≤ n3/2 λb2 .
Namely, since λb2 is a constant smaller than 1, the distance ∆(t) drops exponentially with t.
Proof: We have that q(t) = q(0) Qt . Let B(Q) = hv1, . . . , vn i denote the orthonormal eigenvector basis of
Ín
Q (see Definition 29.2.3p239 ), and write q(0) = i=1 αi vi . Since λb1 = 1, we have that
n n t n t
t t
Õ Õ Õ
(t) (0)
q =q Q = αi vi Q = αi λi vi = α1 v1 +
b αi λbi vi .
i=1 i=1 i=2
√ √ √ t
Since v1 = 1/ n, 1/ n, . . . , 1/ n , and λbi ≤ λb2 < 1, for i > 1, we have that limt→∞ λbi = 0, and thus
n
Õ t
(t)
π = lim q = α1 v1 + αi lim λbi vi = α1 v1 .
t→∞ t→∞
i=2
v
n n
t
Õ Õ
(0) (0)
Now, since v1, . . . , vn is an orthonormal basis, and q = αi vi , we have that q = αi2 . Thus
2
i=1 i=1
implies that
v
n n n t 2
t
Õ t √ Õ t √
Õ
(t) (t)
q −π = q − α1 v1 = αi λbi vi ≤ n αi λbi vi = n αi λbi
1 1
i=2 1 i=2 2 i=2
v
n
t
√ t Õ
2 √ t (0) √ t (0) √ t
≤ n λb2 (αi ) ≤ n λb2 q ≤ n λb2 q = n λb2 ,
2 1
i=2
126
since q(0) is a distribution. Now, since πi = 1/n, we have
qi(t) − πi √ t
∆(t) = max = max n qi(t) − πi ≤ n max q(t) − π ≤ n n λb2 .
i πi i i 1
U = |V(G)| ,
and since our expander construction grow exponentially in size (but the base of the exponent is a
constant), we have that U = O(2n ). (Translation: We can not quite get an expander with a specific
number of vertices. Rather, we can guarantee an expander that has more vertices than we need, but
not many more.)
We label the vertices of G with all the binary
l m ofjlength n, kin a round robin fashion (thus,
strings
n
each binary string of length n appears either |V(G)| /2 or |V(G)| /2n times). For a vertex v ∈ V(G),
let s(v) denote the binary string associated with v.
Consider a string x that we would like to decide if it is in L or not. We know that at least 99/100U
vertices of G are labeled with “random” strings that would yield the right result if we feed them into Alg
(the constant here deteriorated from 199/200 to 99/100 because the number of times a string appears
is not identically the same for all strings).
The algorithm. We perform a random walk of length µ = αβk on G, where α and β are constants to
be determined shortly, and k is a parameter. To this end, we randomly choose a starting vertex X0 (this
would require n + O(1) bits). Every step in the random walk, would require O(1) random bits, as the
expander is a constant degree expander, and as such overall, this would require n + O(k) random bits.
Now, lets X0, X1, . . . , Xµ be the resulting random walk. We compute the result of
where ri = s Xi·β . Specifically, we use the strings associated with nodes that are in distance β from each
other along the path of the random walk. We return the majority of the bits Y0, . . . , Yαk as the decision
of whether x ∈ L or not.
We assume here that we have a fully explicit construction of an expander. That is, given a vertex
of an expander, we can compute all its neighbors in polynomial time (in the length of the index of the
vertex). While the construction of expander shown is only explicit it can be made fully explicit with
more effort.
127
18.2.1. The analysis
Intuition. Skipping every β nodes in the random walk corresponds to performing a random walk on
the graph G β ; that is, we raise the graph to power k. This new graph is a much better expander (but the
degree had deteriorated). Now, consider a specific input x, and mark the bad vertices for it in the graph
G. Clearly, we mark at most 1/100 fraction of the vertices. Conceptually, think about these vertices
as being uniformly spread in the graph and far apart. From the execution of the algorithm to fail, the
random walk needs to visit αk/2 bad vertices in the random walk in G k . However, the probability for
that is extremely small - why would the random walk keep stumbling into bad vertices, when they are
so infrequent?
The real thing. Let Q be the transition matrix of G. We assume, as usual, that the random walk on
G is aperiodic (if not, we can easily fix it using standard tricks), and thus ergodic. Let B = Q β be the
transition matrix of the random walk of the states we use in the algorithm. Note, that the eigenvalues
(except the first one) of B “shrink”. In particular, by picking β to be a sufficiently large constant, we
have that
1
λb1 B = 1 and λbi B ≤ , for i = 2, . . . , U.
10
For the input string x, let W be the matrix that has 1 in the diagonal entry Wii , if and only Alg(x, s(i))
returns the right answer, for i = 1, . . . , U. (We remind the reader that s(i) is the string associated with
the ith vertex, and U = |V(G)|.) The matrix W is zero everywhere else. Similarly, let W = I − W be the
“complement” matrix having 1 at Wii iff Alg(x, s(i)) is incorrect. We know that W is a U × U matrix,
that has at least (99/100)U ones on its diagonal.
Lemma 18.2.1. Let Q be a symmetric transition matrix, then all its eigenvalues of Q are in the range
[−1, 1].
Proof: Let p ∈ Rn be an eigenvector with eigenvalue λ. Let pi be the coordinate with the maximum
absolute value in p. We have that
U
Õ U
Õ U
Õ
λpi = pQ p j Q ji ≤ p j Q ji ≤ |pi | Q ji = pi .
i =
j=1 j=1 j=1
Lemma 18.2.2. Let Q be a symmetric transition matrix, then for any p ∈ Rn , we have that kpQk 2 ≤
kpk 2 .
Proof: Let B(Q) = hv1, . . . , vn i denote the orthonormal eigenvector basis of Q, with eigenvalues 1 =
λ1, . . . , λn . Write p = i αi vi , and observe that
Í
Õ Õ sÕ sÕ
pQ 2
= αi vi Q = αi λi vi = αi2 λi2 ≤ αi2 = p 2
,
i 2 i 2 i i
128
Lemma 18.2.3. Let B = Q β be the transition matrix of the graph G β . For all vectors p ∈ Rn , we have:
(i) kpBWk 2 ≤ kpk 2 , and (ii) pBW ≤ kpk /5.
Proof: (i) Since multiplying a vector by W has the effect of zeroing out some coordinates, its clear that
it can not enlarge the norm of a matrix. As such, kpBWk 2 ≤ kpBk 2 ≤ kpk 2 by Lemma 18.2.2.
(ii) Write p = i αi vi , where v1, . . . , vn is the orthonormal basis of Q (and thus also of B), with
Í
√
eigenvalues 1 = λb1, . . . , λbn . We remind the reader that v1 = (1, 1, . . . , 1)/ n. Since W zeroes out at
least 99/100
q of the entries of a vectors it is multiplied by (and copy the rest as they are), we have that
√
v1 W ≤ (n/100)(1/ n)2 ≤ 1/10 = kv1 k /10. Now, for any x ∈ RU , we have xW ≤ k xk. As such, we
have that
Õ U
Õ
pBW = αi vi BW ≤ α1 v1 BW + αi vi BW
2
i 2 i=2
U U
!
Õ |α1 |
β Õ β
≤ α1 v1 W + αi vi λbi
W ≤ + αi vi λbi
i=2
10 i=2
v
u
tU v
u
tU
|α1 | Õ β 2
|α1 | 1 Õ kpk 1 kpk
≤ + αi λbi ≤ + αi2 ≤ + kpk ≤ ,
10 i=2
10 10 i=2
10 10 5
β
since λi ≤ 1/10, for i = 2, . . . , n.
Consider the strings r0, . . . , rν . For each one of these strings, we can write down whether its a “good”
string (i.e., Alg return the correct result), or a bad string. This results in a binary pattern b0, . . . , b k .
Given a distribution p ∈ RU on the states of the graph, its natural to ask what is the probability of being
in a “good” state. Clearly, this is the quantity kpWk 1 . Thus, if we are interested in the probability of
a specific pattern, then we should start with the initial distribution p0 , truncate away the coordinates
that represent an invalid state, apply the transition matrix, again truncate away forbidden coordinates,
and repeat in this fashion till we exhaust the pattern. Clearly, the `1 -norm of the resulting vector is
the probability of this pattern. To this end, given a pattern b0, . . . , b k , let S = hS0, . . . , Sν i denote the
corresponding sequence of “truncating” matrices (i.e., Si is either W or W). Formally, we set Si = W if
Alg(x, ri ) returns the correct answer, and set Si = W otherwise.
The above argument implies the following lemma.
Lemma 18.2.4. For any fixed pattern b0, . . . , bν the probability of the random walk to generate this
pattern of random strings is p(0) S0 BS1 . . . BSν , where S = hS0, . . . , Sν i is the sequence of W and W
1
encoded by this pattern.
Theorem 18.2.5. The probability that the majority of the outputs Alg(x, r0 ), Alg(x, r1 ), . . . , Alg(x, rk )
is incorrect is at most 1/2 k .
Proof: The majority is wrong, only if (at least) half the elements of the sequence S = hS0, . . . , Sν i belong
to W. Fix such a “bad” sequence S, and observe that the distributions we work with are vectors in RU .
As such, if p0 is the initial distribution, then we have that
h i √ √ 1
Pr S = p(0) S0 BS1 . . . BSν ≤ U p(0) S0 BS1 . . . BSν ≤ U ν/2 p(0) ,
1 2 5 2
129
by Lemma 18.3.1 below (i.e., Cauchy-Schwarz inequality) and by repeatedly applying Lemma 18.2.3,
since half of the sequence
√ S are W, and the rest are W. The distribution p(0) was uniform, which implies
that p(0) 2 = 1/ U. As such, let S be the set of all bad patterns (there are 2ν−1 such “bad” patterns).
We have
h i √ 1 1
Pr majority is bad ≤ 2 k U ν/2 p(0) = (4/5)ν/2 = (4/5)αk/2 ≤ k ,
5 2 2
for α = 7.
Proof: We can safely assume all the coordinates of v are positive. Now,
v
u v
u
d d
t d t d
Õ Õ Õ Õ √
kvk 1 = vi = vi · 1 = |v · (1, 1, . . . , 1)| ≤ vi2 12 = d v ,
i=1 i=1 i=1 i=1
130
Chapter 19
In this chapter, we will prove that given a set P of n points in Rd , one canLucky
reduceJim,
theKingsley
dimension of
Amis
the points to k = O(ε −2 log n) such that distances are 1 ± ε preserved. Surprisingly, this reduction is done
by randomly picking a subspace of k dimensions and projecting the points into this random subspace.
One way of thinking about this result is that we are “compressing” the input of size nd (i.e., n points
with d coordinates) into size O(nε −2 log n), while (approximately) preserving distances.
Remark 19.1.2. It is easy to verify that if A0 and B0 are translated copies of A and B (that is, A0 =
A + p and B = B + q, for some points p, q ∈ Rd ), respectively, then A0 + B0 is a translated copy
131
of A + B. In particular, since volume is preserved under translation, we have that vol A0 + B0 =
vol (A + B) + p + q = vol A + B , where vol(X) is the volume (i.e., measure) of the set X.
Theorem 19.1.3 (Brunn-Minkowski inequality). Let A and B be two non-empty compact sets in
Rn . Then
1/n 1/n 1/n
vol A + B ≥ vol A + vol B .
Definition 19.1.4. A set A ⊆ Rn is a brick set if it is the union of finitely many (close) axis parallel
boxes with disjoint interiors.
It is intuitively clear, by limit arguments, that proving Theorem 19.1.3 for brick sets will imply it
for the general case.
Lemma 19.1.5 (Brunn-Minkowski inequality for Brick Sets). Let A and B be two non-empty
brick sets in Rn . Then
1/n
vol A + B ≥ vol(A)1/n + vol(B)1/n .
Proof: By induction on the number k of bricks in A and B. If k = 2 then A and B are just bricks,
with dimensions a1, . . . , an and b1, . . . , bn , respectively. In this case, the dimensions of A + B are a1 +
În 1/n În 1/n
b1, . . . , an + bn , as can be easily verified. Thus, we need to prove that i=1 ai + i=1 bi ≤
Î 1/n
n
i=1 (ai + bi ) .Dividing the left side by the right side, we have
n
! 1/n n
! 1/n n n
Ö ai Ö bi 1 Õ ai 1 Õ bi
+ ≤ + = 1,
i=1
ai + bi i=1
ai + bi n i=1 ai + bi n i=1 ai + bi
by the generalized arithmetic-geometric mean inequality¬ , and the claim follows for this case.
Now let k > 2 and suppose that the Brunn-Minkowski inequality holds for any pair of brick sets with
fewer than k bricks (together). Let A and B be a pair of sets having k bricks together, the A has at least
two (disjoint) bricks. However, this implies that there is an axis parallel hyperplane h that separates
the interior of one brick of A from the interior of another brick of A (the hyperplane h might intersect
other bricks of A). Assume that h is the hyperplane x1 = 0 (this can be achieved by translation and
renaming of coordinates).
Let A+ = A ∩ h+ and A− = A ∩ h− , where h+ and h− are the two open half spaces induced by h. Let
A+ and A− be the closure of A+ and A− , respectively. Clearly, A+ and A− are both brick sets with (at
least) one fewer brick than A.
Next, observe that the claim is translation invariant (see Remark 19.1.2), and as such, let us translate
B so that its volume is split by h in the same ratio A’s volume is being split. Denote the two parts of
¬ Here is a proof of the generalized form: Let x , . . . , x be n positive real numbers. Consider the quantity R =
1 n
x1 x2 · · · xn . If we fix the sum of the n numbers to be equal to α, then R is maximized when all the xi s are equal. Thus,
√ p
n
x1 x2 · · · xn ≤ n (α/n)n = α/n = (x1 + · · · + xn )/n.
132
B by B+ and B− , respectively. Let ρ = vol(A+ )/vol(A) = vol(B+ )/vol(B) (if vol(A) = 0 or vol(B) = 0 the
claim trivially holds).
Observe, that A+ + B+ ⊆ A + B, and it lies on one side of h (since h ≡ (x1 = 0)), and similarly
A− + B− ⊆ A + B and it lies on the other side of h. Thus, by induction and since A+ + B+ and A− + B−
are interior disjoint, we have
Proof of Theorem 19.1.3: Let A1 ⊆ A2 ⊆ · · · ⊆ Ai ⊆ · · · be a sequence of finite brick sets, such that
i Ai = A, and similarly let B1 ⊆ B2 ⊆ · · · ⊆ Bi ⊆ · · · be a sequence of finite brick sets, such that i Bi =
Ð Ð
B. By the definition of volume ,we have that limi→∞ vol(Ai ) = vol(A) and limi→∞ vol(Bi ) = vol(B).
We claim that limi→∞ vol(Ai + Bi ) = vol(A + B). Indeed, consider any point z ∈ A + B, and let u ∈ A
and v ∈ B be such that u + v = z. By definition, there exists an i, such that for all j > i we have u ∈ A j ,
v ∈ B j , and as such z ∈ A j + B j . Thus, A + B ⊆ ∪ j (A j + B j ) and ∪ j (A j + B j ) ⊆ ∪ j (A + B) ⊆ A + B; namely,
∪ j (A j + B j ) = A + B.
Furthermore, for any i > 0, since Ai and Bi are brick sets, we have
Theorem 19.1.6 (Brunn-Minkowski for slice volumes.). Let P be a convex set in Rn+1 , and let
A = P ∩ (x1 = a), B = P ∩ (x1 = b) and C = P ∩ (x1 = c) be three slices of A, for a < b < c. We have
vol(B) ≥ min(vol(A), vol(C)). Specifically, consider the function
1/n
!
v(t) = vol P ∩ (x1 = t) ,
and let I = tmin, tmax be the interval where the hyperplane x1 = t intersects P. Then, v(t) is concave on
I.
This is the standard definition in measure theory of volume. The reader unfamiliar with this fanfare can either consult
a standard text on the topic, or take it for granted as this is intuitively clear.
133
Proof: If a or c are outside I, then vol(A) = 0 or vol(C) = 0, respectively, and then the claim trivially
holds.
Otherwise, let α = (b − a)/(c − a). We have that b = (1 − α) · a + α · c, and by the convexity of P, we
have (1 − α)A + αC ⊆ B. Thus, by Theorem 19.1.3 we have
Namely, v(·) is concave on I, and in particular v(b) ≥ min(v(a), v(c)), which in turn implies that vol(B) =
v(b)n ≥ (min(v(a), v(c)))n = min(vol(A), vol(C)), as claimed.
Corollary 19.1.7. For A and B compact sets in Rn , the following holds vol((A + B)/2) ≥
p
vol(A)vol(B).
Namely, the left side is the radius of a ball having the same volume as K, and the right side is the
radius of a sphere having the same surface area as K. In particular, if we scale K so that its surface area
is the same as b, then the above inequality implies that vol(K) ≤ vol(b).
134
To prove Eq. (19.1), observe that vol(b) = S(b)/n® . Also, observe that
K + ε b is the body K together with a small “atmosphere” around it of thickness
ε. In particular, the volume of this “atmosphere” is (roughly) ε S(K) (in fact,
Minkowski defined the surface area of a convex body to be the limit stated next).
Formally, we have
vol(K + ε b) − vol(K)
S(K) = lim
ε→0+
ε n
vol(K) + vol(ε b)1/n − vol(K)
1/n
≥ lim ,
ε→0+ ε
by the Brunn-Minkowski inequality. Now vol(ε b)1/n = εvol(b)1/n , and as such
135
19.2.1. The strange and curious life of the hypersphere
Consider the ball of radius r in Rn denoted by r bn , where bn is the unit radius ball centered at the
origin. Clearly, vol(r bn ) = r n vol(bn ). Now, even if r is very close to 1, the quantity r n might be very
close to zero if n is sufficiently large. Indeed, if r = 1 − δ, then r n = (1 − δ)n ≤ exp(−δn), which is very
small if δ 1/n. (Here, we used the fact that 1 − x ≤ e x , for x ≥ 0.) Namely, for the ball in high
dimensions, its mass is concentrated in a very thin shell close to its surface.
The volume of a ball and the surface area
of hypersphere. Let vol(rbn ) denote the volume of
the ball of radius r in Rn , and Area rS(n−1) denote the surface area of its bounding sphere (i.e., the
surface area of rS(n−1) ). It is known that
where the gamma function, Γ(·), is an extension√of the factorial function. Specifically, if n is even then
Γ(n/2 + 1) = (n/2)!, and for n odd Γ(n/2 + 1) = π(n!!)/2(n+1)/2 , where n!! = 1 · 3 · 5 · · · n is the double
factorial. The most surprising implication of these two formulas is that, as n increases, the volume of
the unit ball first increases (till dimension 5 in fact) and then starts decreasing to zero.
Similarly, the surface area of the unit sphere S(n−1) in Rn tends to zero 2
p 1 − xn
as the dimension increases. To see this, compute the volume of the unit ball
xn 1
using an integral of its slice volume, when it is being sliced by a hyperplanes
1
perpendicular to the nth coordinate.
We have, see figure on the right, that
∫ 1 q ∫ 1
(n−1)/2
n n−1 n−1
vol b = vol 1 − xn b
2 dxn = vol b 1 − xn2 dxn,
xn =−1 xn =−1
Now, the integral on the right side tends to zero as n increases. In fact, for n very large, the term
(n−1)/2
1 − xn2 is very close to 0 everywhere except for a small interval around 0. This implies that the
main contribution of the volume of the ball happens when we consider slices of the ball by hyperplanes
of the form xn = δ, where δ is small.
If one has to visualize how such a ball in high dimensions looks like, it might be best to think about
it as a star-like creature: It has very little mass close to the tips of any set of orthogonal directions we
pick, and most of its mass somehow lies on hyperplanes close to its center.¯
Proof: We will prove a slightly weaker bound, with −nt 2 /4 in the exponent. Let A
b = T(A), where
136
b+B
A b
A A A
b
A b
A
b
B b
B
B B B
b+B
A b b+B
A b
A A
b
A b
A
b B
A+ b b B
A+ b
2 2
b
B b
B
B B
t2
≤1− 8
and bn is the unit ball in Rn . We have that Pr[A] = µ A b , where µ A b = vol A b /vol(bn )° .
Let B = S(n−1) \ At and B b = T(B), see Figure 19.1. We have that ka − bk ≥ t for all a ∈ A and
b ∈ B. By Lemma 19.2.2 below, the set A + B /2 is contained in the ball rbn centered at the origin,
b b
n
where r = 1 − t 2 /8. Observe that µ(rbn ) = vol(rbn )/vol(bn ) = r n = 1 − t 2 /8 . As such, applying the
Brunn-Minkowski inequality in the form of Corollary 19.1.7, we have
n ! r
t2 A B
+ p
= µ rbn ≥ µ
b b p
1− ≥ µ A b µ Bb = Pr[A] Pr[B] ≥ Pr[B] /2 .
8 2
a +bb t2
a∈A b ∈ B,
b
Lemma 19.2.2. For any b b and b b we have ≤ 1− .
2 8
° This is one of these “trivial” claims that might give the reader a pause, so here is a formal proof. Pick a random
point p uniformly inside the ball bn . Let ψ be the probability that p ∈ A. b Clearly, vol A
b = ψvol(bn ). So, consider the
normalized point q = p/kpk. Clearly, p ∈ A b if and only if q ∈ A, by the definition of A.
b Thus, µ Ab = vol Ab /vol(bn ) = ψ =
h i
Pr p ∈ A
b = Pr[q ∈ A] = Pr[A], since q has a uniform distribution on the hypersphere by assumption.
137
a = αa and b
Proof: Let b b = βb, where a ∈ A and b ∈ B. We have
s a
{
2
r
a+b a − b t2 t2
kuk = = 12 − ≤ 1 − ≤ 1− , (19.2)
≤
2 2 4 8 u
2
t/
since ka − bk ≥ t. As for b
a and b
b, assume that α ≤ β, and observe that the b
h
quantity b a + b is maximized when β = 1. As such, by the triangle inequality,
b o
we have
a +bb αa + b α(a + b) b
+ (1 − α)
b
= ≤
2 2 2 2
t2
1
≤ α 1− + (1 − α) = τ,
8 2
by Eq. (19.2) and since kbk = 1. Now, τ is a convex combination of the two numbers 1/2 and 1 − t 2 /8.
In particular, we conclude that τ ≤ max(1/2, 1 − t 2 /8) ≤ 1 − t 2 /8, since t ≤ 2.
Proof: We prove only the first inequality, the second follows by symmetry. Let
n o
A = x ∈ S(n−1) f (x) ≤ med( f ) .
By Lemma 19.3.1, we have Pr[A] ≥ 1/2. Consider a point x ∈ At , where At is as defined in Theo-
rem 19.2.1. Let nn(x) be the nearest point in A to x. We have by definition that k x − nn(x)k ≤ t. As
such, since f is 1-Lipschitz and nn(x) ∈ A, we have that
f (x) ≤ f (nn(x)) + knn(x) − xk ≤ med( f ) + t.
Thus, by Theorem 19.2.1, we get Pr[ f > med( f ) + t] ≤ 1 − Pr[At ] ≤ 2 exp −t 2 n/2 .
138
19.4. The Johnson-Lindenstrauss Lemma
Lemma 19.4.1. For a unit vector x ∈ S(n−1) , let
q
f (x) = x12 + x22 + · · · + x k2
be the length of the projection of x into the subspace formed by the first k coordinates. Let x be a vector
randomly chosen with uniform distribution from S(n−1) . Then f (x) is sharply concentrated. Namely,
there exists m = m(n, k) such that
by the triangle inequality and since p is 1-Lipschitz. Theorem 19.3.3 (i.e., Lévy’s lemma) gives the
required tail estimate with m = med( f ).
only need to prove the lower bound on m. For a random x = (x1, . . 2. , xn ) ∈ÍSn 2, we
Thus, (n−1)
we2
have E k xk = 1. h Byi linearity of expectations, handi symmetry, we have 1 = E k xk = E i=1 xi =
Ín
i=1 E xi = n E x j , for any 1 ≤ j ≤ n. Thus, E x j = 1/n, for j = 1, . . . , n. Thus,
2 2 2
" k # k
Õ Õ k
f 2
x 2
E ( (x)) = E i = E[xi ] = ,
i=1 i=1
n
by linearity of expectation.
We next use that f is concentrated, to show that f 2 is also relatively concentrated. For any t ≥ 0,
we have
k
= E f 2 ≤ Pr[ f ≤ m + t] (m + t)2 + Pr[ f ≥ m + t] · 1 ≤ 1 · (m + t)2 + 2 exp(−t 2 n/2),
n
p
since f (x) ≤ 1, for any x ∈ S(n−1) . Let t = k/5n. Since k ≥ 10 ln n, we have that 2 exp(−t 2 n/2) ≤ 2/n.
We get that
k p 2
≤ m + k/5n + 2/n.
n
p p p p p
Implying that (k − 2)/n ≤ m + k/5n, which in turn implies that m ≥ (k − 2)/n − k/5n ≥ 12 k/n.
Next, we would like to argue that given a fixed vector, projecting it down into a random k-dimensional
subspace results in a random vector such that its length is highly concentrated. This would imply that
we can do dimension reduction and still preserve distances between points that we care about.
To this end, we would like to flip Lemma 19.4.1 around. Instead of randomly picking a point and
projecting it down to the first k-dimensional space, we would like x to be fixed, and randomly pick the
139
k-dimensional subspace we project into. However, we need to pick this random k-dimensional space
carefully. Indeed, if we rotate this random subspace, by a transformation T, so that it occupies the first
k dimensions, then the point T(x) needs to be uniformly distributed on the hypersphere if we want to
use Lemma 19.4.1.
As such, we would like to randomly pick a rotation of Rn . This maps the standard orthonormal basis
into a randomly rotated orthonormal space. Taking the subspace spanned by the first k vectors of the
rotated basis results in a k-dimensional random subspace. Such a rotation is an orthonormal matrix
with determinant 1. We can generate such a matrix, by randomly picking a vector e1 ∈ S(n−1) . Next, we
set e1 as the first column of our rotation matrix, and generate the other n − 1 columns, by generating
recursively n − 1 orthonormal vectors in the space orthogonal to e1 .
Remark 19.4.2 (Generating a random point on the sphere.). At this point, the reader might wonder how
do we pick a point uniformly from the unit hypersphere. The idea is to pick a point from the
multi-dimensional normal distribution N n (0, 1), and normalizing it to have length 1. Since the multi-
dimensional normal distribution has the density function
(2π)−n/2 exp −(x12 + x22 + · · · + xn2 )/2 ,
which is symmetric (i.e., all the points in distance r from the origin have the same distribution), it
follows that this indeed generates a point randomly and uniformly on S(n−1) .
Generating a vector with multi-dimensional normal distribution, is no more than picking each coor-
dinate according to the normal distribution, see Lemma 19.7.1p143 . Given a source of random numbers
according to the uniform distribution, this can be done using a O(1) computations per coordinate, using
the Box-Muller transformation [BM58]. Overall, each random vector can be generated in O(n) time.
Since projecting down n-dimensional normal distribution to the lower dimensional space yields a
normal distribution, it follows that generating a random projection, is no more than randomly picking
n vectors according to the multidimensional normal distribution v1, . . . , vn . Then, we orthonormalize
them, using Graham-Schmidt, where vb1 = v1 /kv1 k, and vbi is the normalized vector of vi − wi , where wi
is the projection of vi to the space spanned by v1, . . . , vi−1 .
Taking those vectors as columns of a matrix, generates a matrix A, with determinant either 1 or
−1. We multiply one of the vectors by −1 if the determinant is −1. The resulting matrix is a random
rotation matrix.
We can now restate Lemma 19.4.1 in the setting where the vector is fixed and the projection is into
a random subspace.
Lemma 19.4.3. Let x ∈ S(n−1) be an arbitrary unit vector, and consider a random k dimensional
subspace F, and let f (x) be the length of the projection of x into F. Then, there exists m = m(n, k) such
that
Pr[ f (x) ≥ m + t] ≤ 2 exp(−t 2 n/2) and Pr[ f (x) ≤ m − t] ≤ 2 exp(−t 2 n/2),
p
for any t ∈ [0, 1]. Furthermore, for k ≥ 10 ln n, we have m ≥ 21 k/n.
Proof: Let vi be the ith orthonormal vector having 1 at the ith coordinate. Let M be a random translation
of space generated as described above. Clearly, for arbitrary fixed unit vector x, the vector Mx is
distributed uniformly on the sphere. Now, the ith column of the matrix M is the random vector ei , and
MT vi = ei . As such, we have
hMx, vi i = (Mx)T vi = xT MT vi = xT ei = hx, ei i .
140
In particular, treating Mx as a random vector, and projecting it on the first k coordinates, we have that
v
u
t k v
u
t k
Õ Õ
f (x) = hMx, vi i 2 = hx, ei i 2 .
i=1 i=1
But e1, . . . , e k is just an orthonormal basis of a random k-dimensional subspace. As such, the expression
on the right is the length of the projection of x into a k-dimensional random subspace. As such, the
length of the projection of x into a random k-dimensional subspace has exactly the same distribution
as the length of the projection of a random vector into the first k coordinates. The claim now follows
by Lemma 19.4.1.
Definition 19.4.4. The mapping f : Rn → R k is called K-bi-Lipschitz for a subset X ⊆ Rn if there exists
a constant c > 0 such that
cK −1 · kp − qk ≤ k f (p) − f (q)k ≤ c · kp − qk ,
for all p, q ∈ X.
The least K for which f is K-bi-Lipschitz is called the distortion of f , and is denoted dist( f ). We
will refer to f as a K-embedding of X.
Remark 19.4.5. Let X ⊆ Rm be a set of n points, where m potentially might be much larger than n.
Observe, that in this case, since we only care about the inter-point distances of points in X, we can
consider X to be a set of points lying in the affine subspace F spanned by the points of X. Note, that
this subspace has dimension n − 1. As such, each point of X be interpreted as n − 1 dimensional point in
F. Namely, we can assume, for our purposes, that the set of n points in Euclidean space we care about
lies in Rn (in fact, Rn−1 ).
Note, that if m < n we can always pad all the coordinates of the points of X by zeros, such that the
resulting point set lies in Rn .
Proof: By Remark 19.4.5, we can assume that X ⊆ Rn . Let k = 200ε −2 ln n. Assume k < n, and let
F be a random k-dimensional linear subspace of Rn . Let PF : Rn → F be the orthogonal projection
operator of Rn into F. Let m be the number around which kPF (x)k is concentrated, for x ∈ S(n−1) , as in
Lemma 19.4.3.
Fix two points x, y ∈ Rn , we prove that
ε ε
1 − m k x − yk ≤ kPF (x) − PF (y)k ≤ 1 + m k x − yk
3 3
holds with probability ≥ 1 − n−2 . Since there are 2n pairs of points in X, it follows that with constant
probability (say > 1/3) this holds for all pairs of points of X. In such a case, the mapping p is D-
embedding of X into R k with D ≤ 1+ε/31−ε/3 ≤ 1 + ε, for ε ≤ 1.
Let u = x −y, we have PF (u) = PF (x) − PF (y) since PF (·) is a linear operator. Thus, the condition
becomes 1 − 3ε m kuk ≤ kPF (u)k ≤ 1 + 3ε m kuk. Again, since projection is a linear operator, for any
α > 0, the condition is equivalent to
141
As such, we can assume that kuk = 1 by picking α = 1/kuk. Namely, we need to show that
ε
|kPF (u)k − m| ≤ m.
3
Let f (u) = kPF (u)k. By Lemma 19.4.1 (exchanging the random space with the random vector), for
t = εm/3, we have that the probability that this does not hold is bounded by
t2n −ε m n ε k
2 2 2
Pr[| f (u) − m| ≥ t] ≤ 4 exp − = 4 exp ≤ 4 exp − < n−2,
2 18 72
p
since m ≥ 1
2 k/n and k = 200ε −2 ln n.
142
19.6. Exercises
Exercise 19.6.1 (Boxes can be separated.). (Easy.) Let A and B be two axis-parallel boxes that are interior
disjoint. Prove that there is always an axis-parallel hyperplane that separates the interior of the two
boxes.
Corollary 19.6.3. For A and B compact sets in Rn , we have for any λ ∈ [0, 1] that
vol(λA + (1 − λ)B) ≥ vol(A)λ vol(B)1−λ .
Exercise 19.6.4 (Projections are contractions.). (Easy.) Let F be a k-dimensional affine subspace, and let
PF : Rd → F be the projection that maps every point x ∈ Rd to its nearest neighbor on F. Prove that p
is a contraction (i.e., 1-Lipschitz). Namely, for any p, q ∈ Rd , it holds that kPF (p) − PF (q)k ≤ k p − q k.
Exercise 19.6.5 (JL Lemma works for angles.). Show that the Johnson-Lindenstrauss lemma also (1 ± ε)-
preserves angles among triples of points of P (you might need to increase the target dimension however
by a constant factor). [For every angle, construct a equilateral triangle that its edges are being preserved
by the projection (add the vertices of those triangles [conceptually] to the point set being embedded).
Argue, that this implies that the angle is being preserved.]
19.7. Miscellaneous
Lemma 19.7.1. (A) The multidimensional normal distribution is symmetric; that is, for any two points
p, q ∈ Rd such that kpk = kqk we have that g(p) = g(q), where g(·) is the density function of the
multidimensional normal distribution Nd .
(B) The projection of the normal distribution on any direction is a one dimensional normal distri-
bution.
(C) Picking d variables X1, . . . , Xd using one dimensional normal distribution N results in a point
(X1, . . . , Xd ) that has multidimensional normal distribution Nd .
143
144
Chapter 20
20.1. VC dimension
Definition 20.1.1. A range space S is a pair (X, R), where X is a ground set (finite or infinite) and R
is a (finite or infinite) family of subsets of X. The elements of X are points and the elements of R are
ranges.
Our interest is in the size/weight of the ranges in the range space. For technical reasons, it will be
easier to consider a finite subset x as the underlining ground set.
Definition 20.1.2. Let S = (X, R) be a range space, and let x be a finite (fixed) subset of X. For a range
r ∈ R, its measure is the quantity
|r ∩ x|
m(r) = .
|x|
145
While x is finite, it might be very large. As such, we are interested in getting a good estimate to
m(r) by using a more compact set to represent the range space.
Definition 20.1.3. Let S = (X, R) be a range space. For a subset N (which might be a multi-set) of x, its
estimate of the measure of m(r), for r ∈ R, is the quantity
|r ∩ N |
s(r) = .
|N |
The main purpose of this chapter is to come up with methods to generate a sample N, such that
m(r) ≈ s(r), for all the ranges r ∈ R.
It is easy to see that in the worst case, no sample can capture the measure of all ranges. Indeed,
given a sample N, consider the range x \ N that is being completely missed by N. As such, we need
to concentrate on range spaces that are “low dimensional”, where not all subsets are allowable ranges.
The notion of VC dimension (named after Vapnik and Chervonenkis [VC71]) is one way to limit the
complexity of a range space.
Definition 20.1.4. Let S = (X, R) be a range space. For Y ⊆ X, let
R |Y = r ∩ Y r ∈ R
(20.1)
20.1.1. Examples
Intervals. Consider the set X to be the real line, and consider R to be the set of all 1 2
intervals on the real line. Consider the set Y = {1, 2}. Clearly, one can find four intervals
that contain all possible subsets of Y . Formally, the projection R |Y = {{ } , {1} , {2} , {1, 2}}. The intervals
realizing each of these subsets are depicted on the right.
p q s
However, this is false for a set of three points B = {p, q, r}, since there is no interval
that can contain the two extreme points p and r without also containing q. Namely, the subset {p, r} is
not realizable for intervals, implying that the largest shattered set by the range space (real line, intervals)
is of size two. We conclude that the VC dimension of this space is two.
Disks. Let X = R2 , and let R be the set of disks in the plane. Clearly, for any
three points in the plane (in general position), denoted by p, q, and r, one can
p
find eight disks that realize all possible 23 different subsets. See the figure on
t
the right.
But can disks shatter a set with four points? Consider such a set P of four q
points. If the convex hull of P has only three points on its boundary, then the {p.q}
subset X having only those three vertices (i.e., it does not include the middle
point) is impossible, by convexity. Namely, there is no disk that contains only the points of X without
the middle point.
146
d Alternatively, if all four points are vertices of the convex hull and they are
a, b, c, d along the boundary of the convex hull, either the set {a, c} or the set {b, d}
a is not realizable. Indeed, if both options are realizable, then consider the two disks
c D1 and D2 that realize those assignments. Clearly, ∂D1 and ∂D2 must intersect in
four points, but this is not possible, since two circles have at most two intersection
b points. See the figure on the left. Hence the VC dimension of this range space is 3.
Convex sets. Consider the range space S = (R2, R), where R is the set of all
(closed) convex sets in the plane. We claim that dimVC (S) = ∞. Indeed, consider
a set U of n points p1, . . . , pn all lying on the boundary of the unit circle in the CH(V)
plane. Let V be any subset of U, and consider the convex hull CH (V). Clearly,
CH (V) ∈ R, and furthermore, CH (V) ∩ U = V. Namely, any subset of U is
realizable by S. Thus, S can shatter sets of arbitrary size, and its VC dimension is unbounded.
Complement. Consider the range space S = (X, R) with δ = dimVC (S). Next, consider the complement
space, S = X, R , where
R = X\r r ∈ R ;
namely, the ranges of S are the complement of the ranges in S. What is the VC dimension of S? Well,
a set B ⊆ X is shattered by S if and only if it is shattered by S. Indeed, if S shatters B, then for any
Z ⊆ B, we have that (B \ Z) ∈ R |B , which implies that
Z = B \ (B \ Z) ∈ R |B . Namely, R |B contains all
the subsets of B, and S shatters B. Thus, dimVC S = dimVC (S).
Lemma 20.1.5. For a range space S = (X, R) we have that dimVC (S) = dimVC S , where S is the
complement range space.
20.1.1.1. Halfspaces
Let S = (X, R), where X = Rd and R is the set of all (closed) halfspaces in Rd . We need the following
technical claim.
Claim 20.1.6. Let P = {p1, . . . , pd+2 } be a set of d+2 points in Rd . There are real numbers β1, . . . , βd+2 ,
not all of them zero, such that i βi pi = 0 and i βi = 0.
Í Í
Proof: Indeed, set qi = (pi, 1), for i = 1, . . . , d + 2. Now, the points q1, . . . , qd+2 ∈ Rd+1 are linearly depen-
Íd+2
dent, and there are coefficients β1, . . . , βd+2 , not all of them zero, such that i=1 βi qi = 0. Considering
Íd+2
only the first d coordinates of these points implies that i=1 βi pi = 0. Similarly, by considering only the
Íd+2
(d + 1)st coordinate of these points, we have that i=1 βi = 0.
To see what the VC dimension of halfspaces in Rd is, we need the following result of Radon. (For a
reminder of the formal definition of convex hulls, see Definition 32.1.1p253 .)
Theorem 20.1.7 (Radon’s theorem). Let P = {p1, . . . , pd+2 } be a set of d + 2 points in Rd . Then,
there exist two disjoint subsets C and D of P, such that CH (C) ∩ CH (D) , ∅ and C ∪ D = P.
Proof: By Claim 20.1.6 there are real numbers β1, . . . , βd+2 , not all of them zero, such that βi pi = 0
Í
i
and i βi = 0.
Í
147
Assume, for the sake of simplicity of exposition, that β1, . . . , βk ≥ 0 and βk+1, . . ., βd+2 < 0. Further-
Ík Íd+2
more, let µ = i=1 βi = − i=k+1 βi . We have that
k
Õ d+2
Õ
βi pi = − βi pi .
i=1 i=k+1
Ík
In particular, v = i=1 (βi /µ)pi is a point in CH ({p1, . . . , p k }). Furthermore, for the same point v we
Íd+2
have v = i=k+1 −(βi /µ)pi ∈ CH ({p k+1, . . . , pd+2 }). We conclude that v is in the intersection of the two
convex hulls, as required.
The following is a trivial observation, and yet we provide a proof to demonstrate it is true.
Lemma 20.1.8. Let P ⊆ Rd be a finite set, let r be any point in CH (P), and let h+ be a halfspace of
Rd containing r. Then there exists a point of P contained inside h+ .
Proof: The halfspace h+ can be written as h+ = t ∈ Rd ht, vi ≤ c . Now r ∈ CH (P) ∩ h+ , and as such
there are numbers α1, . . . , αm ≥ 0 and points p1, . . . , pm ∈ P, such that i αi = 1 and i αi pi = r. By the
Í Í
linearity of the dot product, we have that
* m + m
Õ Õ
r ∈ h+ =⇒ hr, vi ≤ c =⇒ αi pi, v ≤ c =⇒ β = αi hpi, vi ≤ c.
i=1 i=1
Setting βi = hpi, vi, for i = 1, . . . , m, the above implies that β is a weighted average of β1, . . . , βm . In
particular, there must be a βi that is no larger than the average. That is βi ≤ c. This implies that
hpi, vi ≤ c. Namely, pi ∈ h+ as claimed.
Let S be the range space having Rd as the ground set and all the close halfspaces as ranges. Radon’s
theorem implies that if a set Q of d + 2 points is being shattered by S, then we can partition this set
Q into two disjoint sets Y and Z such that CH (Y ) ∩ CH (Z) , ∅. In particular, let r be a point in
CH (Y ) ∩ CH (Z). If a halfspace h+ contains all the points of Y , then CH (Y ) ⊆ h+ , since a halfspace is
a convex set. Thus, any halfspace h+ containing all the points of Y will contain the point r ∈ CH (Y ).
But r ∈ CH (Z) ∩ h+ , and this implies that a point of Z must lie in h+ , by Lemma 20.1.8. Namely,
the subset Y ⊆ Q cannot be realized by a halfspace, which implies that Q cannot be shattered. Thus
dimVC (S) < d + 2. It is also easy to verify that the regular simplex with d + 1 vertices is shattered by S.
Thus, dimVC (S) = d + 1.
148
Lemma 20.2.1 (Sauer’s lemma). If (X, R) is a range space of VC dimension δ with |X| = n, then
|R| ≤ Gδ (n).
and R \ x = r \ {x} r ∈ R .
R x = r \ {x} r ∪ {x} ∈ R and r \ {x} ∈ R
Observe that |R| = |R x | + |R \ x|. Indeed, we charge the elements of R to their corresponding element in
R \ x. The only bad case is when there is a range r such that both r ∪ {x} ∈ R and r \ {x} ∈ R, because
then these two distinct ranges get mapped to the same range in R \ x. But such ranges contribute
exactly one element to R x . Similarly, every element of R x corresponds to two such “twin” ranges in R.
Observe that (X \ {x} , R x ) has VC dimension δ − 1, as the largest set that can be shattered is of size
δ − 1. Indeed, any set B ⊂ X \ {x} shattered by R x implies that B ∪ {x} is shattered in R.
Thus, we have
by induction.
Definition 20.2.3 (Shatter function). Given a range space S = (X, R), its shatter function πS (m) is the
maximum number of sets that might be created by S when restricted to subsets of size m. Formally,
πS (m) = max R |B ;
B⊂X
|B|=m
Proof: Let n = |B|, and observe that R |B ≤ Gδ (n) ≤ nδ , by Eq. (20.2). As such, R |B ≤ nδ , and, by
definition, the shattering dimension of S is at most δ; namely, the shattering dimension is bounded by
the VC dimension.
Our arch-nemesis in the following is the function x/ln x. The following lemma states some properties
of this function, and its proof is delegated to Exercise 20.8.2.
Lemma 20.2.5. For the function f (x) = x/ln x the following hold.
149
(A) f (x) is monotonically increasing for x ≥ e.
(B) f (x) ≥ e,√for x > 1.
(C) For u ≥ √e, if f (x) ≤ u, then x ≤ 2u ln u.
(D) For u ≥ e, if x > 2u ln u, then f (x) > u.
(E) For u ≥ e, if f (x) ≥ u, then x ≥ u ln u.
The next lemma introduces a standard argument which is useful in bounding the VC dimension of a
range space by its shattering dimension. It is easy to see that the bound is tight in the worst case.
Lemma 20.2.6. If S = (X, R) is a range space with shattering dimension d, then its VC dimension is
bounded by O(d log d).
Proof: Let N ⊆ X be the largest set shattered by S, and let δ denote its cardinality. We have that
2δ = R |N ≤ πS (|N |) ≤ cδ d , where c is a fixed constant. As such, we have that δ ≤ lg c + d lg δ, which in
δ − lg c
turn implies that ≤ d. Assuming δ ≥ max(2, 2 lg c), we have that
lg δ
δ δ 2d
≤ d =⇒ ≤ ≤ 6d =⇒ δ ≤ 2(6d) ln(6d),
2 lg δ ln δ ln 2
by Lemma 20.2.5(C).
Disks revisited. To see why the shattering dimension is more convenient to work with than the VC
dimension, consider the range space S = (X, R), where X = R2 and R is the set of disks in the plane. We
know that the VC dimension of S is 3 (see Section 20.1.1).
We next use a standard continuous deformation argument to argue that the shattering dimension of
this range space is also 3.
Lemma 20.2.7. Consider the range space S = (X, R), where X = R2 and R is the set of disks in the
plane. The shattering dimension of S is 3.
Proof: Consider any set P of n points in the plane, and consider the set F = R |P . We claim that
|F | ≤ 4n3 .
The set F contains only n sets with a single point in them and only 2n sets with two points in them.
So, fix Q ∈ F such that |Q| ≥ 3.
D D
p q p
=⇒ =⇒
D00
D0 D0
150
There is a disk D that realizes this subset; that is, P ∩ D = Q. For ⇓
the sake of simplicity of exposition, assume that P is in general position.
s
Shrink D till its boundary passes through a point p of P. b
D
Now, continue shrinking the new disk D0 in such a way that its
boundary passes through the point p (this can be done by moving the
center of D0 towards p). Continue in this continuous deformation till p
q
the new boundary hits another point q of P. Let D00 denote this disk.
Next, we continuously deform D00 so that it has both p ∈ Q and D00
q ∈ Q on its boundary. This can be done by moving the center of D 00
along the bisector linear between p and q. Stop as soon as the boundary
of the disk hits a third point r ∈ P. (We have freedom in choosing in
which direction to move the center. As such, move in the direction that
causes the disk boundary to hit a new point r.) Let D b be the resulting disk. The boundary of D
b is the
unique circle passing through p, q, and r. Furthermore, observe that
D ∩ (P \ {r}) = D
b ∩ (P \ {r}).
That is, we can specify the point set P ∩ D by specifying the three points p, q, r (and thus specifying the
disk D)
b and the status of the three special points; that is, we specify for each point p, q, r whether or
not it is inside the generated subset.
As such, there are at most 8 3n different subsets in F containing
more than three points, as each
n
such subset maps to a “canonical” disk, there are at most 3 different such disks, and each such disk
defines at most eight different subsets.
Similar argumentation implies that there are at most 4 2n subsets that are defined by a pair of points
that realizes the diameter of the resulting disk. Overall, we have that
n n
|F | = 1 + n + 4 +8 ≤ 4n3,
2 3
since there is one empty set in F , n sets of size 1, and the rest of the sets are counted as described
above.
The proof of Lemma 20.2.7 might not seem like a great simplification over the same bound we got by
arguing about the VC dimension. However, the above argumentation gives us a very powerful tool – the
shattering dimension of a range space defined by a family of shapes is always bounded by the number
of points that determine a shape in the family.
Thus, the shattering dimension of, say, arbitrarily oriented rectangles in the plane
is bounded by (and in this case, equal to) five, since such a rectangle is uniquely
determined by five points. To see that, observe that if a rectangle has only four
points on its boundary, then there is one degree of freedom left, since we can rotate
the rectangle “around” these points; see the figure on the right.
151
Definition 20.2.8. The dual range space to a range space S = (X, R) is the space S? = (R, X?), where
X? = R p p ∈ X .
D3
D1 D2 D3
p6
p2 p1 1 1 1
p01 p01 1 1 1
(A) p3 p1 p2 1 0 1
p5
p4
p3 1 0 0
D1 D2 p4 1 1 0
p5 0 1 0
p1 p01 p2 p3 p4 p5 p6 p6 0 1 1
D1 1 1 1 1 1 0 0
(B)
D2 1 1 0 0 1 1 1
D3 1 1 1 0 0 0 1 (C)
Figure 20.1: (A) R p1 = R p10 . (B) Writing the set system as an incidence matrix where a point is a
column and a set is a row. For example, D2 contains p4 , and as such the column of p4 has a 1 in the
row corresponding to D2 . (C) The dual set system is represented by a matrix which is the transpose of
the original incidence matrix.
Naturally, the dual range space to S? is the original S, which is thus sometimes referred to as the
primal range space. (In other words, the dual to the dual is the primal.) The easiest way to see
this, is to think about it as an abstract set system realized as an incidence matrix, where each point is a
column and a set is a row in the matrix having 1 in an entry if and only if it contains the corresponding
point; see Figure 20.1. Now, it is easy to verify that the dual range space is the transposed matrix.
To understand what the dual space is, consider X to be the plane and R to be a set of m disks. Then,
in the dual range space S? = (R, X?), every point p in the plane has a set associated with it in X?, which
is the set of disks of R that contains p. In particular, if we consider the arrangement formed by the m
disks of R, then all the points lying inside a single face of this arrangement correspond to the same set
of X?. The number of ranges in X? is bounded by the complexity of the arrangement of these disks,
which is O(m2 ); see Figure 20.1.
Let the dual shatter function of the range space S be πS?(m) = πS? (m), where S? is the dual range
space to S.
Definition 20.2.9. The dual shattering dimension of S is the shattering dimension of the dual range
space S?.
Note that the dual shattering dimension might be smaller than the shattering dimension and hence
also smaller than the VC dimension of the range space. Indeed, in the case of disks in the plane, the
dual shattering dimension is just 2, while the VC dimension and the shattering dimension of this range
space is 3. Note, also, that in geometric settings bounding the dual shattering dimension is relatively
easy, as all you have to do is bound the complexity of the arrangement of m ranges of this space.
The following lemma shows a connection between the VC dimension of a space and its dual. The
interested reader® might find the proof amusing.
® The author is quite aware that the interest of the reader in this issue might not be the result of free choice. Neverthe-
less, one might draw some comfort from the realization that the existence of the interested reader is as much an illusion
as the existence of free choice. Both are convenient to assume, and both are probably false. Or maybe not.
152
Lemma 20.2.10. Consider a range space S = (X, R) with VC dimension δ. The dual range space
S? = (R, X?) has VC dimension bounded by 2δ+1 .
Proof: Assume that S? shatters a set F = {r1, . . . , r k } ⊆ R of k ranges. Then, there is a set P ⊆ X of
m = 2 k points that shatters F . Formally, for every subset V ⊆ F , there exists a point p ∈ P, such that
Fp = V.
So, consider the matrix M (of dimensions k × 2 k ) having the points p1, . . . , p2k of P as the columns,
and every row is a set of F , where the entry in the matrix corresponding to a point p ∈ P and a range
r ∈ F is 1 if and only if p ∈ r and zero otherwise. Since P shatters F , we know that this matrix has all
possible 2 k binary vectors as columns.
Next, let κ0 = 2 blg kc ≤ k, and consider the matrix M0 of p1 p2 . . . p2k 0 0 0
size κ ×lg κ , where the ith row is the binary representation
0 0 r1 0 1 0 0 0 1
r2 1 1 1 0 1 0
of the number i − 1 (formally, the jth entry in the ith row 0 1 1
. . . . .
is 1 if the jth bit in the binary representation of i − 1 is 1), M : .. M :
.. .. .. .. 0
1 0 0
where i = 1, . . . , κ0. See the figure on the right. Clearly, the rk−2 1 1 . . . 0 1 0 1
rk−1 0 0 . . . 1 1 1 0
lg κ0 columns of M0 are all different, and we can find lg κ0 r 1 0 . . . 1
k 1 1 1
columns of M that are identical to the columns of M0 (in
the first κ0 entries starting from the top of the columns).
Each such column corresponds to a point p ∈ P, and let Q ⊂ P be this set of lg κ0 points. Note that
for any subset Z ⊆ Q, there is a row t in M0 that encodes this subset. Consider the corresponding row
in M; that is, the range rt ∈ F . Since M and M0 are identical (in the relevant lg κ0 columns of M) on the
first κ0, we have that rt ∩ Q = Z. Namely, the set of ranges F shatters Q. But since the original range
space has VC dimension δ, it follows that |Q| ≤ δ. Namely, |Q| = lg κ0 = blg kc ≤ δ, which implies that
lg k ≤ δ + 1, which in turn implies that k ≤ 2δ+1 .
Lemma 20.2.11. If a range space S = (X, R) has dual shattering dimension δ, then its VC dimension
is bounded by δO(δ) .
Proof: The shattering dimension of the dual range space S? is bounded by δ, and as such, by Lemma 20.2.6,
its VC dimension is bounded by δ0 = O(δ log δ). Since the dual range space to S? is S, we have by
Lemma 20.2.10 that the VC dimension of S is bounded by 2δ +1 = δO(δ) .
0
The bound of Lemma 20.2.11 might not be pretty, but it is sufficient in a lot of cases to bound the
VC dimension when the shapes involved are simple.
Example 20.2.12. Consider the range space S = R2, R , where R is a set of shapes in the plane, so that
the boundary of any pair of them intersects at most s times. Then, the VC dimension of S is O(1).
Indeed, the dual shattering dimension of S is O(1), since the complexity of the arrangement of n such
shapes is O(sn2 ). As such, by Lemma 20.2.11, the VC dimension of S is O(1).
153
Proof: As a warm-up exercise, we prove a somewhat weaker bound here of O((δ + δ0) log(δ + δ0)). The
stronger bound follows from Theorem 20.2.14 below. Let B be a set of n points in X that are shattered
S. There are at most Gδ (n) and Gδ 0 (n) different ranges of B in the range sets R |B and R 0|B , respectively,
by b
by Lemma 20.2.1. Every subset C of B realized by b r∈Rb is a union of two subsets B ∩ r and B ∩ r0, where
r ∈ R and r0 ∈ R 0, respectively. Thus, the number of different subsets of B realized by b S is bounded
by Gδ (n)Gδ 0 (n). Thus, 2n ≤ nδ nδ , for δ, δ0 > 1. We conclude that n ≤ (δ + δ0) lg n, which implies that
0
Interestingly, one can prove a considerably more general result with tighter bounds. The required
computations are somewhat more painful.
R 0 = f (r1, . . . , r k ) r1 ∈ R 1, . . . , r k ∈ R k
and the associated range space T = (X, R 0). Then, the VC dimension of T is bounded by O(kδ lg k),
where δ = maxi δi .
by Lemma 20.2.1 and Lemma 20.2.2. On the other hand, since Y is being shattered by R 0, this implies
k
that R 0|Y = 2t . Thus, we have the inequality 2t ≤ 2(te/δ)δ , which implies t ≤ k(1 + δ lg(te/δ)). Assume
that t ≥ e and δ lg(te/δ) ≥ 1 since otherwise the claim is trivial, and observe that t ≤ k(1 + δ lg(te/δ)) ≤
3kδ lg(t/δ). Setting x = t/δ, we have
t ln(t/δ) t x
≤ 3k ≤ 6k ln =⇒ ≤ 6k =⇒ x ≤ 2 · 6k ln(6k) =⇒ x ≤ 12k ln(6k),
δ ln 2 δ ln x
Corollary 20.2.15. Let S = (X, R) and T = (X, R 0) be two range spaces of VC dimension δ and δ0,
respectively, where δ, δ0 > 1. Let R
b = r ∩ r0 r ∈ R, r0 ∈ R 0 . Then, for the range space b
S = (X, R),
b we
S) = O(δ + δ0).
have that dimVC (b
Corollary 20.2.16. Any finite sequence of combining range spaces with finite VC dimension (by inter-
secting, complementing, or taking their union) results in a range space with a finite VC dimension.
154
20.3. On ε-nets and ε-sampling
20.3.1. ε-nets and ε-samples
Definition 20.3.1 (ε-sample). Let S = (X, R) be a range space, and let x be a finite subset of X. For
0 ≤ ε ≤ 1, a subset C ⊆ x is an ε-sample for x if for any range r ∈ R, we have
| m(r) − s(r)| ≤ ε,
where m(r) = |x ∩ r| / |x| is the measure of r (see Definition 20.1.2) and s(r) = |C ∩ r| / |C| is the estimate
of r (see Definition 20.1.3). (Here C might be a multi-set, and as such |C ∩ r| is counted with multiplicity.)
As such, an ε-sample is a subset of the ground set x that “captures” the range space up to an error
of ε. Specifically, to estimate the fraction of the ground set covered by a range r, it is sufficient to count
the points of C that fall inside r.
If X is a finite set, we will abuse notation slightly and refer to C as an ε-sample for S.
To see the usage of such a sample, consider x = X to be, say, the population of a country (i.e., an
element of X is a citizen). A range in R is the set of all people in the country that answer yes to a
question (i.e., would you vote for party Y?, would you buy a bridge from me?, questions like that). An
ε-sample of this range space enables us to estimate reliably (up to an error of ε) the answers for all
these questions, by just asking the people in the sample.
The natural question of course is how to find such a subset of small (or minimal) size.
Theorem 20.3.2 (ε-sample theorem, [VC71]). There is a positive constant c such that if (X, R) is
any range space with VC dimension at most δ, x ⊆ X is a finite subset and ε, ϕ > 0, then a random
subset C ⊆ x of cardinality
c
δ 1
s = 2 δ log + log
ε ε ϕ
(In the above theorem, if s > |x|, then we can just take all of x to be the ε-sample.)
For a strengthened version of the above theorem with slightly better bounds is known [Har11].
Sometimes it is sufficient to have (hopefully smaller) samples with a weaker property – if a range is
“heavy”, then there is an element in our sample that is in this range.
Definition 20.3.3 (ε-net). A set N ⊆ x is an ε-net for x if for any range r ∈ R, if m(r) ≥ ε (i.e.,
|r ∩ x| ≥ ε |x|), then r contains at least one point of N (i.e., r ∩ N , ∅).
Theorem 20.3.4 (ε-net theorem, [HW87]). Let (X, R) be a range space of VC dimension δ, let x
be a finite subset of X, and suppose that 0 < ε ≤ 1 and ϕ < 1. Let N be a set obtained by m random
independent draws from x, where
4 4 8δ 16
m ≥ max lg , lg . (20.3)
ε ϕ ε ε
155
(We remind the reader that lg = log2 .)
The proofs of the above theorems are somewhat involved and we first turn our attention to some
applications before presenting the proofs.
Remark 20.3.5. The above two theorems also hold for spaces with shattering dimension at most δ, in
which
case the sample
size is slightly larger. Specifically, for Theorem 20.3.4, the sample size needed is
1 1 δ δ
O lg + lg .
ε ϕ ε ε
Dunknown
20.3.2.2. Learning a concept
Assume that we have a function f defined in the plane that returns ‘1’
inside an (unknown) disk Dunknown and ‘0’ outside it. There is some distri-
bution D defined over the plane, and we pick points from this distribution.
Furthermore, we can compute the function for these labels (i.e., we can
compute f for certain values, but it is expensive). For a mystery value
ε > 0, to be explained shortly, Theorem 20.3.4 tells us to pick (roughly)
O((1/ε) log(1/ε)) random points in a sample R from this distribution and
to compute the labels for the samples. This is demonstrated in the figure
on the right, where black dots are the sample points for which f (·) returned 1.
So, now we have positive examples and negative examples. We would like
to find a hypothesis that agrees with all the samples we have and that hopefully
is close to the true unknown disk underlying the function f . To this end,
D
compute the smallest disk D that contains the sample labeled by ‘1’ and does
not contain any of the ‘0’ points, and let g : R2 → {0, 1} be the function g that
returns ‘1’ inside the disk and ‘0’ otherwise. We claim that g classifies correctly
all but an ε-fraction of the points (i.e., the probability of misclassifying a point
picked according to the given distribution is smaller than ε); that is, Prp∈D [ f (p) , g(p)] ≤ ε.
156
Geometrically, the region where g and f disagree is all the points in the Dunknown
symmetric difference between the two disks. That is, E = D ⊕ Dunknown ; see D ⊕ Dunknown
the figure on the right.
Thus, consider the range space S having the plane as the ground set and
the symmetric difference between any two disks as its ranges. By Corol-
lary 20.2.16, this range space has finite VC dimension. Now, consider the
(unknown) disk D0 that induces f and the region r = Dunknown ⊕ D. Clearly,
the learned classifier g returns incorrect answers only for points picked inside D
r.
probability of a mistake in the classification is the measure of r under the distribution D.
Thus, the
So, if PrD r > ε (i.e., the probability that a sample point falls inside r), then by the ε-net theorem (i.e.,
Theorem 20.3.4) the set R is an ε-net for S (ignore for the time being the possibility that the random
sample fails to be an ε-net) and as such, R contains a point q inside r. But, it is not possible for g
(which classifies correctly all the sampled points of R) to make a mistake on q, a contradiction,
because
by construction, the range r is where g misclassifies points. We conclude that PrD r ≤ ε, as desired.
Little lies. The careful reader might be tearing his or her hair out because of the above description.
First, Theorem 20.3.4 might fail, and the above conclusion might not hold. This is of course true, and
in real applications one might use a much larger sample to guarantee that the probability of failure is so
small that it can be practically ignored. A more serious issue is that Theorem 20.3.4 is defined only for
finite sets. Nowhere does it speak about a continuous distribution. Intuitively, one can approximate a
continuous distribution to an arbitrary precision using a huge sample and apply the theorem to this sam-
ple as our ground set. A formal proof is more tedious and requires extending the proof of Theorem 20.3.4
to continuous distributions. This is straightforward and we will ignore this topic altogether.
r r
" # " #
Õ Õ
Xi − pr ≥ (ε/p)pr = Pr Xi − µ ≥ φµ ≤ exp −µφ2 /2 + exp −µφ2 /4
Pr
i=1 i=1
ε2
≤ 2 exp −µφ /4 = 2 exp − r ≤ ϕ,
2
4p
4 2 4p 2
for r ≥ 2 ln ≥ 2 ln .
ε ϕ ε ϕ
Viola! We proved the ε-sample theorem. Well, not quite. We proved that the sample works correctly
for a single range. Namely, we proved that for a specific range r ∈ R, we have that Pr[| m(r) − s(r)| > ε] ≤
ϕ. However, we need to prove that ∀r ∈ R, Pr[| m(r) − s(r)| > ε] ≤ ϕ.
157
Now, naively, we can overcome this by using a union bound on the bad probability. Indeed, if there
are k different ranges under consideration, then we can use a sample that is large enough such that the
probability of it to fail for each range is at most ϕ/k. In particular, let Ei be the bad event that the
sample fails for the ith range. We have that Pr[Ei ] ≤ ϕ/k, which implies that
" k # k
Ø Õ
Pr[sample fails for any range] ≤ Pr Ei ≤ Pr[Ei ] ≤ k(ϕ/k) ≤ ϕ,
i=1 i=1
by the union bound; that is, the sample works for all ranges with good probability.
However, the number of ranges that we need to prove the theorem for is πS (|x|) (see Definition 20.2.3).
In particular, if we plug in confidence ϕ/πS (|x|) to the above analysis and use the union bound, we get
that for
4 πS (|x|)
r ≥ 2 ln
ε ϕ
estimates correctly (up to ±ε) the size of all ranges with confidence ≥ 1 − ϕ. Bounding πS (|x|)
the sample
by O |x| δ (using Eq. (20.2)p148 for a space with VC dimension δ), we can bound the required size of r
by O δε −2 log(|x| /ϕ) . We summarize the result.
Lemma 20.3.6. Let (x, R) be a finite range space with VC dimension at most δ, and let ε, ϕ > 0 be
parameters. Then a random subset C ⊆ x of cardinality O δε log(|x| /ϕ) is an ε-sample for x with
−2
probability at least 1 − ϕ.
Namely, the “naive” argumentation gives us a sample bound which depends on the underlying size
of the ground set. However, the sample size in the ε-sample theorem (Theorem 20.3.2) is independent
of the size of the ground set. This is the magical property of the ε-sample theorem¯ .
Interestingly, using a chaining argument on Lemma 20.3.6, one can prove the ε-sample theorem for
the finite case; see Exercise 20.8.3. We provide a similar proof when using discrepancy, in Section 20.4.
However, the original proof uses a clever double sampling idea that is both interesting and insightful
that makes the proof work for the infinite case also.
E1 = ∃r ∈ R |r ∩ x| ≥ εn and r ∩ N = ∅ .
¯ The notion of magic is used here in the sense of Arthur C. Clarke’s statement that “any sufficiently advanced
158
performs. Indeed, if m is sufficiently large, we expect the random variable |r ∩ T | to concentrate around
εm, and one can argue this formally using Chernoff’s inequality. Namely, intuitively, for a heavy range
r we have that
h εm i
Pr[r ∩ N = ∅] ≈ Pr r ∩ N = ∅ and |r ∩ T | ≥ .
2
Inspired by this, let E2 be the event that N fails for some range r but T “works” for r; formally
n εm o
E2 = ∃r ∈ R |r ∩ x| ≥ εn, r ∩ N = ∅ and |r ∩ T | ≥ .
2
Intuitively, since E[|r ∩ T |] ≥ εm, then for the range r that N fails for, we have with “good” probability
that |r ∩ T | ≥ εm/2. Namely, Pr[E1 ] ≈ Pr[E2 ].
Next, let
n εm o
E20 = ∃r ∈ R r ∩ N = ∅ and |r ∩ T | ≥ .
2
Clearly, E2 ⊆ E20 and as such Pr[E2 ] ≤ Pr E20 . Now, fix Z = N ∪ T, and observe that |Z | = 2m. Next,
fix a range r, and observe that the bad probability of E20 is maximized if |r ∩ Z | = εm/2. Now, the
probability that all the elements of r ∩ Z fall only into the second half of the sample is at most 2−εm/2
as a careful calculation shows. Now, there are at most Z |R ≤ Gd (2m) different ranges that one has to
consider. As such, Pr[E1 ] ≈ Pr[E2 ] ≤ Pr E20 ≤ Gd (2m)2−εm/2 and this is smaller than ϕ, as a careful
calculation shows by just plugging the value of m into the right-hand side; see Eq. (20.3)p155 .
20.4. Discrepancy
The proof of the ε-sample/net theorem is somewhat complicated. It turns out that one can get a
somewhat similar result by attacking the problem from the other direction; namely, let us assume that
we would like to take a truly large sample of a finite range space S = (X, R) defined over n elements with
m ranges. We would like this sample to be as representative as possible as far as S is concerned. In
fact, let us decide that we would like to pick exactly half of the points of X in our sample (assume that
n = |X| is even).
To this end, let us color half of the points of X by −1 (i.e., black) and the other half by 1 (i.e., white).
If for every range, r ∈ R, the number of black points inside it is equal to the number of white points,
then doubling the number of black points inside a range gives us the exact number of points inside the
range. Of course, such a perfect coloring is unachievable in almost all situations. To see this, consider
the complete graph K3 – clearly, in any coloring (by two colors) of its vertices, there must be an edge
with two endpoints having the same color (i.e., the edges are the ranges).
Formally, let χ : X → {−1, 1} be a coloring. The discrepancy of χ over a range r is the amount of
imbalance in the coloring inside χ. Namely,
Õ
| χ(r)| = χ(p) .
p∈r
The overall discrepancy of χ is disc( χ) = maxr∈R | χ(r)|. The discrepancy of a (finite) range space
S = (X, R) is the discrepancy of the best possible coloring; namely,
disc(S) = min disc( χ).
χ:X→{−1,+1}
159
The natural question is, of course, how to compute the coloring χ of minimum discrepancy. This
seems like a very challenging question, but when you do not know what to do, you might as well do
something random. So, let us pick a random coloring χ of X. To this end, let Π be an arbitrary
partition of X into pairs (i.e., a perfect matching). For a pair {p, q} ∈ Π, we will either color χ(p) = −1
and χ(q) = 1 or the other way around; namely, χ(p) = 1 and χ(q) = −1. We will decide how to color this
pair using a single coin flip. Thus, our coloring would be induced by making such a decision for every
pair of Π, and let χ be the resulting coloring. We will refer to χ as compatible with the partition Π
if, for all {p, q} ∈ Π, we have that χ({p, q}) = 0; namely,
160
20.4.1. Building ε-sample via discrepancy
Let S = (X, R) be a range space with shattering
dimension δ. Let P ⊆ X be a set of n points, and consider
the induced range space S|P = P, R |P ; see Definition 20.1.4p146 . Here, by the definition of shattering
dimension, we have that m = R |P = O n . Without loss of generality, we assume that n is a power of
δ
2. Consider a coloring χ of P with discrepancy bounded by Corollary 20.4.2. In particular, let Q be the
points of P colored by, say, −1. We know that |Q| = n/2, and for any range r ∈ R, we have that
p q p
χ(r) = ||(P \ Q) ∩ r| − |Q ∩ r|| < n ln(4m) = n ln O nδ ≤ c n ln(nδ ),
for some absolute constant c. Observe that |(P \ Q) ∩ r| = |P ∩ r| − |Q ∩ r|. In particular, we have that
for any range r, p
||P ∩ r| − 2 |Q ∩ r|| ≤ c n ln(nδ ). (20.4)
Dividing both sides by n = |P| = 2 |Q|, we have that
r
|P ∩ r| |Q ∩ r| δ ln n
− ≤ τ(n) for τ(n) = c . (20.5)
|P| |Q| n
Namely, a coloring with discrepancy bounded by Corollary 20.4.2 yields a τ(n)-sample. Intuitively, if n is
very large, then Q provides a good approximation to P. However, we want an ε-sample for a prespecified
ε > 0. Conceptually, ε is a fixed constant while τ(n) is considerably smaller. Namely, Q is a sample
which is too tight for our purposes (and thus too big). As such, we will coarsen (and shrink) Q till
we get the desired ε-sample by repeated application of Corollary 20.4.2. Specifically, we can “chain”
together several approximations generated by Corollary 20.4.2. This is sometime refereed to as the
sketch property of samples. Informally, as testified by the following lemma, a sketch of a sketch is a
sketch° .
Lemma 20.4.3. Let Q ⊆ P be a ρ-sample for P (in some underlying range space S), and let R ⊆ Q be
a ρ0-sample for Q. Then R is a (ρ + ρ0)-sample for P.
Proof: By definition, we have that, for every r ∈ R,
|r ∩ P| |r ∩ Q| |r ∩ Q| |r ∩ R|
− ≤ρ and − ≤ ρ0 .
|P| |Q| |Q| |R|
By adding the two inequalities together, we get
|r ∩ P| |r ∩ R| |r ∩ P| |r ∩ Q| |r ∩ Q| |r ∩ R|
− = − + − ≤ ρ + ρ0 .
|P| |R| |P| |Q| |Q| |R|
Thus, let P0 = P and P1 = Q. Now, in the ith iteration, we will compute a coloring χi−1 of Pi−1 with
low discrepancy, as guaranteed by Corollary 20.4.2, and let Pi be the points of Pi−1 colored white by χi−1 .
Ík
Let δi = τ(ni−1 ), where ni−1 = |Pi−1 | = n/2i−1 . By Lemma 20.4.3, we have that P k is a ( i=1 δi )-sample
for P. Since we would like the smallest set in the sequence P1, P2, . . . that is still an ε-sample, we would
Ík
like to find the maximal k, such that ( i=1 δi ) ≤ ε. Plugging in the value of δi and τ(·), see Eq. (20.5),
it is sufficient for our purposes that
k k k
s s s
Õ Õ Õ δ ln(n/2i−1 ) δ ln(n/2 k−1 ) δ ln nk−1
δi = τ(ni−1 ) = c i−1
≤ c1 k−1
= c1 ≤ ε,
i=1 i=1 i=1
n/2 n/2 nk−1
° Try saying this quickly 100 times.
161
since the above series behaves like a geometric series, and as such its total sum is proportional to its
largest element± , where c1 is a sufficiently large constant. This holds for
s
δ ln nk−1 δ ln nk−1 c2δ nk−1
c1 ≤ ε ⇐⇒ c12 ≤ ε 2 ⇐⇒ 12 ≤ .
nk−1 nk−1 ε ln nk−1
c12 δ c12 δ
The last inequality holds for nk−1 ≥ 2 ln 2 , by Lemma 20.2.5(D). In particular, taking the largest
ε2 ε
k for which this holds results in a set P k of size O (δ/ε 2 ) ln(δ/ε) which is an ε-sample for P.
Theorem 20.4.4 (ε-sample via discrepancy). For a range space (X, R) with shattering dimension at
most δ and B ⊆ X a finite subset and ε > 0, there exists a subset C ⊆ B, of cardinality O (δ/ε 2 ) ln(δ/ε) ,
such that C is an ε-sample for B.
Note that it is not obvious how to turn Theorem 20.4.4 into an efficient construction algorithm
of such an ε-sample. Nevertheless, this theorem can be turned into a relatively efficient deterministic
algorithm using conditional probabilities. In particular, there is a deterministic O n δ+1 time algorithm
for computing an ε-sample for a range space of VC dimension δ and with n points in its ground set using
the above approach (see the bibliographical notes in Section 20.7 for details). Inherently, however, it is a
far cry from the simplicity of Theorem 20.3.2 that just requires us to take a random sample. Interestingly,
there are cases where using discrepancy leads to smaller ε-samples; again see bibliographical notes for
details.
Lemma 20.4.5. Consider the sets R ⊆ P and R0 ⊆ P0. Assume that P and P0 are disjoint, |P| = |P0 |,
and |R| = |R0 |. Then, if R is an ε-sample of P and R0 is an ε-sample of P0, then R ∪ R0 is an ε-sample
of P ∪ P0.
|r ∩ (P ∪ P0)| |r ∩ (R ∪ R0)| |r ∩ P| |r ∩ P0 | |r ∩ R| |r ∩ R0 |
− = + − −
|P ∪ P0 | |R ∪ R0 | |P ∪ P0 | |P ∪ P0 | |R ∪ R0 | |R ∪ R0 |
|r ∩ P| |r ∩ P0 | |r ∩ R| |r ∩ R0 |
= + − −
2 |P| 2 |P0 | 2 |R| 2 |R0 |
|r ∩ P0 | |r ∩ R0 |
1 |r ∩ P| |r ∩ R|
= − + −
2 |P| |R| |P0 | |R0 |
1 |r ∩ P| |r ∩ R| 1 |r ∩ P0 | |r ∩ R0 |
≤ − + −
2 |P| |R| 2 |P0 | |R0 |
ε ε
≤ + = ε.
2 2
± Formally, one needs to show that the ratio between two consecutive elements in the series is larger than some constant,
say 1.1. This is easy but tedious, but the well-motivated reader (of little faith) might want to do this calculation.
162
Interestingly, by breaking the given ground sets into sets of equal size and building a balanced binary
tree over these sets, one can speed up the deterministic algorithm for building ε-samples. The idea is to
compute the sample bottom-up, where at every node we merge the samples provided by the children (i.e.,
using Lemma 20.4.5), and then we sketch the resulting set using Lemma 20.4.3. By carefully fine-tuning
this construction, one can get an algorithm for computing ε-samples in time which is near linear in n
(assuming ε and δ are small constants). We delegate the details of this construction to Exercise 20.8.6.
This algorithmic idea is quite useful and we will refer to it as sketch-and-merge.
163
r
ν0 p
By assumption, we have that ≥ d ln nk . This implies that
c4 2 k
r
ν0 ν0 ν0 ν0 c3 ν0
ν k ≤ k + c3 · = 1 + √ ≤ 2 ,
2 2 k c4 2 k 2 k c4 2k
So consider a “heavy” range r that contains at least ν0 ≥ εn points of P. To show that P k is an ε-net,
we need to show that P k ∩ r , ∅. To apply Claim 20.4.6, we need a k such that εn/2 k ≥ c4 d ln nk−1 , or
equivalently, such that
2nk 2c4 d
≥ ,
ln(2nk ) ε
which holds for nk = Ω dε ln dε , by Lemma 20.2.5(D). But then, by Claim 20.4.6, we have that
d
|P ∩ r| 1 εn ε
νk = |Pk ∩ r| ≥ ≥ · k = nk = Ω d ln > 0.
2 · 2k 2 2 2 ε
d d
We conclude that the set P k , which is of size Ω ε ln ε , is an ε-net for P.
Theorem 20.4.7 (ε-net via discrepancy). For any range space (X, R) with shattering dimension at
most d, a finite subset B ⊆ X, and ε > 0, there exists a subset C ⊆ B, of cardinality O((d/ε) ln(d/ε)),
such that C is an ε-net for B.
E1 = ∃r ∈ R |r ∩ x| ≥ εn and r ∩ N = ∅ .
(Namely, there exists a “heavy” range r that does not contain any point of N.) To complete the proof,
we must show that Pr[E1 ] ≤ ϕ. Let T = (y1, . . . , ym ) be another random sample generated in a similar
fashion to N. Let E2 be the event that N fails but T “works”; formally
n εm o
E2 = ∃r ∈ R |r ∩ x| ≥ εn, r ∩ N = ∅, and |r ∩ T | ≥ .
2
h i
Intuitively, since E |r ∩ T | ≥ εm, we have that for the range r that N fails for, it follows with “good”
probability that |r ∩ T | ≥ εm/2. Namely, E1 and E2 have more or less the same probability.
164
h i h i h i
Claim 20.5.1. Pr E2 ≤ Pr E1 ≤ 2 Pr E2 .
h i h i
Proof: Clearly, E2 ⊆ E1 , and thus Pr E2 ≤ Pr E1 . As for the other part, note that by the definition
of conditional probability, we have
h i h i h i h i
Pr E2 E1 = Pr E2 ∩ E1 /Pr E1 = Pr E2 /Pr E1 .
It is thus enough to show that Pr E2 E1 ≥ 1/2.
Assume that E1 occurs. There is r ∈ R, such that |r ∩ x| > εn and r∩ N = ∅. The required probability
is at least the probability that for this specific r, we have |r ∩ T | ≥ εn
2 . However, X = |r ∩ T | is a binomial
variable with expectation E X = pm, and variance V X = p(1 − p)m ≤ pm, where p = |r ∩ x| /n ≥ ε.
Pr[E2 ] εm
h εm i 1
≥ Pr |r ∩ T | ≥ = 1 − Pr |r ∩ T | <
2 ≥ .
Pr[E1 ] 2 2
Claim 20.5.1 implies that to bound the probability of E1 , it is enough to bound the probability of
E2 . Let
0
n εm o
E2 = ∃r ∈ R r ∩ N = ∅, |r ∩ T | ≥ .
2
Clearly, E2 ⊆ E20 . Thus, bounding the probability of E20 is enough to prove Theorem 20.3.4. Note,
however, that a shocking thing happened! We no longer have x participating in our event. Namely, we
turned bounding an event that depends on a global quantity (i.e., the ground set x) into bounding a
quantity that depends only on a local quantity/experiment (involving only N and T). This is the crucial
idea in this proof.
Proof: We imagine that we sample the elements of N ∪ T together, by picking Z = (z1, . . . , z2m ) inde-
pendently from x. Next, we randomly decide the m elements of Z that go into N, and the remaining
elements go into T. Clearly,
Õ Pr E 0 ∩ (Z = z)
Õ
0 0 2
Pr E2 ∩ (Z = z) = · Pr[Z = z]
Pr E2 =
Pr[Z = z]
z∈x 2m z∈x 2m
Õ
Pr E2 Z = z Pr[Z = z] = E Pr E20 Z = z .
0
=
z
165
Thus, from this point on, we fix the set Z, and we bound Pr E20 Z . Note that Pr[E20 ] is a weighted
average of Pr[E20 |Z = z], and as such a bound on this quantity would imply the same bound on Pr[E20 ].
It is now enough to consider the ranges in the projection space (Z, R |Z ) (which has VC dimension δ).
By Lemma 20.2.1, we have R |Z ≤ Gδ (2m).
Let us fix any r ∈ R |Z , and consider the event
n εm o
Er = r ∩ N = ∅ and |r ∩ T | > .
2
We claim that Pr[Er ] ≤ 2−εm/2 . Observe that if k = |r ∩ (N ∪ T)| ≤ εm/2, then the event is empty, and
this claim trivially holds. Otherwise, Pr[Er ] = Pr[r ∩ N = ∅]. To bound this probability, observe that
we have the 2m elements of Z, and we can choose any m of them to be N, as long as none of them is
2m−k 2m
one of the k “forbidden” elements of r ∩ (N ∪ T). The probability of that is m / m . We thus have
2m−k
m (2m − k)(2m − k − 1) · · · (m − k + 1)
Pr[Er ] ≤ Pr[r ∩ N = ∅] = 2m =
2m(2m − 1) · · · (m + 1)
m
m(m − 1) · · · (m − k + 1)
= ≤ 2−k ≤ 2−εm/2 .
2m(2m − 1) · · · (2m − k + 1)
Thus,
Ø Õ
E20 Z = Pr Pr[Er ] ≤ R |Z 2−εm/2 ≤ Gδ (2m)2−εm/2,
Pr Er ≤
r∈R |Z r∈R |Z
implying that Pr E20 ≤ Gδ (2m)2−εm/2 .
Proof of Theorem 20.3.4. By Claim 20.5.1 and Claim 20.5.2, we have that Pr[E1 ] ≤ 2Gδ (2m)2−εm/2 .
It thus remains to verify that if m satisfies Eq. (20.3), then 2Gδ (2m)2−εm/2 ≤ ϕ.
Indeed, we know that 2m ≥ 8δ (by Eq. (20.3)p155 ) and by Lemma 20.2.2, Gδ (2m) ≤ 2(2em/δ)δ , for
δ ≥ 1. Thus, it is sufficient to show that the inequality 4(2em/δ)δ 2−εm/2 ≤ ϕ holds. By rearranging and
taking lg of both sides, we have that this is equivalent to
δ
εm/2 4 2em εm 2em 4
2 ≥ =⇒ ≥ δ lg + lg .
ϕ δ 2 δ ϕ
By our choice of m (see Eq. (20.3)), we have that εm/4 ≥ lg(4/ϕ). Thus, we need to show that
εm 2em
≥ δ lg .
4 δ
8δ 16
We verify this inequality for m = lg (this would also hold for bigger values, as can be easily
ε ε
verified). Indeed
16 16e 16
2δ lg ≥ δ lg lg .
ε ε ε
2
16 16e 16 16 16
This is equivalent to ≥ lg , which is equivalent to ≥ lg , which is certainly true for
ε ε ε eε ε
0 < ε ≤ 1.
This completes the proof of the theorem.
166
20.6. A better bound on the growth function
In this section, we prove Lemma 20.2.2p149 . Since the proof is straightforward but tedious, the reader
can safely skip reading this section.
Proof: (i) Indeed, 1 + 1/n ≤ exp(1/n), since 1 + x ≤ e x , for x ≥ 0. As such (1 + 1/n)n ≤ exp(n(1/n)) = e.
n−1
(ii) Rewriting the inequality, we have that we need to prove n−1 n ≥ 1e . This is equivalent to
n n−1 n−1
proving e ≥ n−1 1
= 1 + n−1 , which is our friend from (i).
(iii) Indeed,
∞
nn Õ ni
≤ = en,
n! i=0
i!
Í∞ x i
by the Taylor expansion of e x = i=0 n
i! . This implies that (n/e) ≤ n!, as required.
(iv) Indeed, for any k ≤ n, we have nk ≤ n−1 k−1 , as can be easily verified. As such,
n
k ≤ n−i
k−i , for
1 ≤ i ≤ k − 1. As such,
n k n n − 1 n−k +1 n
≤ · ··· = .
k k k −1 1 k
n nk nk ne k
As for the other direction, by (iii), we have ≤ ≤ k = .
k k! k k
e
n δ ne δ
Lemma 20.2.2 restated. For n ≥ 2δ and δ ≥ 1, we have ≤ Gδ (n) ≤ 2 , where Gδ (n) =
δ δ
δ
Õ n
.
i=0
i
δ δ
Õ n Õ ne i
Proof: Note that by Lemma 20.6.1(iv), we have Gδ (n) = ≤ 1+ . This series behaves like
i=0
i i=1
i
a geometric series with constant larger than 2, since
ne i ne i−1 ne i − 1 i−1 ne 1
i−1
ne 1 n n
/ = = 1− ≥ = ≥ ≥ 2,
i i−1 i i i i i e i δ
by Lemma 20.6.1. As such, this series is bounded by twice the largest element in the series, implying
the claim.
167
sample of size (δ/ε) ln(1/ε) is an ε-net with constant probability. For a proof that shows that in general ε-
nets cannot be much smaller in the worst case, see [PA95]. The original proof of the ε-net theorem is due
to Haussler and Welzl [HW87]. The proof of the ε-sample theorem is due to Vapnik and Chervonenkis
[VC71]. The bound in Theorem 20.3.2 can be improved to O εδ2 + ε12 log ϕ1 [AB99].
An alternative proof of the ε-net theorem proceeds by first computing an (ε/4)-sample of sufficient
size, using the ε-sample theorem (Theorem 20.3.2p155 ), and then computing and ε/4-net for this sample
using a direct sample of the right size. It is easy to verify the resulting set is an ε-net. Furthermore, using
the “naive” argument (see Section 20.3.2.3) then implies that this holds with the right probability, thus
implying the ε-net theorem (the resulting constants might be slightly worse). Exercise 20.8.3 deploys
similar ideas.
The beautiful alternative proof of both theorems via the usage of discrepancy is due to Chazelle and
Matoušek [CM96]. The discrepancy method is a beautiful topic which is quite deep mathematically, and
we have just skimmed the thin layer of melted water on top of the tip of the iceberg² . Two nice books
on the topic are the books by Chazelle [Cha01] and Matoušek [Mat99]. The book by Chazelle [Cha01]
is currently available online for free from Chazelle’s webpage.
We will revisit discrepancy since in some geometric cases it yields better results than the ε-sample
theorem. In particular, the random coloring of Theorem 20.4.1 can be derandomized using conditional
probabilities. One can then use it to get an ε-sample/net by applying it repeatedly. A faster algorithm
results from a careful implementation of the sketch-and-merge approach. The disappointing feature of
all the deterministic constructions of ε-samples/nets is that their running time is exponential in the
dimension δ, since the number of ranges is usually exponential in δ.
A similar result to the one derived by Haussler and Welzl [HW87], using a more geometric approach,
was done independently by Clarkson at the same time [Cla87], exposing the fact that VC dimension is
not necessary if we are interested only in geometric applications. This was later refined by Clarkson
[Cla88], leading to a general technique that, in geometric settings, yields stronger results than the ε-net
theorem. This technique has numerous applications in discrete and computational geometry and leads
to several “proofs from the book” in discrete geometry.
Exercise 20.8.5 is from Anthony and Bartlett [AB99].
|r ∩ A| |r ∩ X| |r ∩ X|
− ≤ε .
| A| |X| |X|
168
Theorem 20.7.1 ([LLS01]). Let (X, R) be a range space with shattering dimension d, where |X| = n,
0 < p < 1 be given parameters. Then, consider a random sample A ⊆ X of size
and let 0 < ε < 1 and
c 1 1
d log + log , where c is a constant. Then, it holds that for each range r ∈ R of at least pn
ε2 p p ϕ
points, we have
|r ∩ A| |r ∩ X| |r ∩ X|
− ≤ε .
| A| |X| |X|
In other words, A is a (p, ε)-sample for (X, R). The probability of success is ≥ 1 − ϕ.
20.8. Exercises
Exercise 20.8.1 (Compute clustering radius). Let C and P be two given sets of points in the plane, such
that k = |C| and n = |P|. Let r = max p∈P minc∈C kc − pk be the covering radius of P by C (i.e., if we
place a disk of radius r centered at each point of C, all those disks cover the points of P).
(A) Give an O(n + k log n) expected time algorithm that outputs a number α, such that r ≤ α ≤ 10r.
(B) For ε > 0 a prescribed parameter, give an O(n + kε −2 log n) expected time algorithm that outputs
a number α, such that r ≤ α ≤ (1 + ε)r.
Exercise 20.8.3 (A direct proof of the ε-sample theorem). For the case that the given range space is finite,
one can prove the ε-sample theorem (Theorem 20.3.2p155 ) directly. So, we are given a range space
S = (x, R) with VC dimension δ, where x is a finite set.
(A) Show that there exists an ε-sample of S of size O δε −2 log log|x|
ε by extracting an ε/3-sample from
an ε/9-sample of the original space (i.e., apply Lemma 20.3.6 twice and use Lemma 20.4.3).
(k)
|x|
(B) Show that for any k, there exists an ε-sample of S of size O δε −2 log log ε .
(C) Show that there exists an ε-sample of S of size O δε −2 log 1ε .
Exercise 20.8.4 (Sauer’s lemma is tight). Show that Sauer’s lemma (Lemma 20.2.1) is tight. Specifically,
provide a finite range space that has the number of ranges as claimed by Lemma 20.2.1.
Exercise 20.8.5 (Flip and flop). (A) Let b1, . . . , b2m be m binary bits. Let Ψ be the set of all permutations
of 1, . . . , 2m, such that for any σ ∈ Ψ, we have σ(i) = i or σ(i) = m + i, for 1 ≤ i ≤ m, and similarly,
σ(m + i) = i or σ(m + i) = m + i. Namely, σ ∈ Ψ either leaves the pair i, i + m in their positions or
it exchanges them, for 1 ≤ i ≤ m. As such |Ψ| = 2m .
Prove that for a random σ ∈ Ψ, we have
Ím Ím
i=1 bσ(i) i=1 bσ(i+m)
≥ ε ≤ 2e−ε m/2 .
2
Pr −
m m
(B) Let Ψ0 be the set of all permutations of 1, . . . , 2m. Prove that for a random σ ∈ Ψ0, we have
Ím Ím
i=1 bσ(i) i=1 bσ(i+m)
≥ ε ≤ 2e−Cε m/2,
2
Pr −
m m
where C is an appropriate constant. [Use (A), but be careful.]
169
(C) Prove Theorem 20.3.2 using (B).
Exercise 20.8.6 (Sketch and merge). Assume that you are given a deterministic algorithm that can com-
pute the discrepancy of Theorem 20.4.1 in O(nm) time, where n is the size of the ground set and m is
the number of induced ranges. We are assuming that the VC dimension δ of the given range space is
small and that the algorithm input is only the ground set X (i.e., the algorithm can figure out on its
own what the relevant ranges are).
(A) For a prespecified ε > 0, using the ideas described in Section 20.4.1.1, show how to compute a small
ε-sample of X quickly. The running time of your algorithm should be (roughly) O n/εO(δ) polylog .
What is the exact bound on the running time of your algorithm?
(B) One can slightly improve the running of the above algorithm by more aggressively sketching the
sets used. That is, one can add additional sketch layers in the tree. Show how by using such an
approach one can improve the running time of the above algorithm by a logarithmic factor.
Exercise 20.8.7 (Building relative approximations). Prove the following theorem using discrepancy.
Theorem 20.8.8. Let (X, R) be a range space with shattering dimension δ, where
|X| = n, and let 0 < ε < 1 and 0 < p < 1 be given parameters. Then one can
δ
construct a set N ⊆ X of size O ε2δ p ln εp , such that, for each range r ∈ R of at
least pn points, we have
|r ∩ N | |r ∩ X| |r ∩ X|
− ≤ε .
|N | |X| |X|
170
Chapter 21
171
We will be interested in computing the arrangement A S and a representation of it that makes it
easy to manipulate. In particular, we would like to be able to quickly resolve questions of the type (i)
are two points in the same face?, (ii) can one traverse from one point to the other without crossing
any segment?, etc. The naive representation of each face as polygons (potentially with holes) is not
conducive to carrying out such tasks, since a polygon might be arbitrarily complicated. Instead, we will
prefer to break the arrangement into smaller canonical tiles.
To this end, a vertical trapezoid is a quadrangle with two vertical sides. The breaking of the faces
into such trapezoids is the vertical decomposition of the arrangement A S .
This technique of building the arrangement by inserting the segments one by one is called random-
ized incremental construction.
Who need these pesky arrangements anyway? The reader might wonder who needs arrange-
ments? As a concrete examples, consider a situation where you are give several maps of a city containing
different layers of information (i.e., streets map, sewer map, electric lines map, train lines map, etc).
We would like to compute the overlay map formed by putting all these maps on top of each other. For
example, we might be interested in figuring out if there are any buildings lying on a planned train line,
etc.
More generally, think about a set of general constraints in Rd . Each constraint is bounded by a
surface, or a patch of a surface. The decomposition of Rd formed by the arrangement of these surfaces
gives us a description of the parametric space in a way that is algorithmically useful. For example, finding
if there is a point inside all the constraints, when all the constraints are induced by linear inequalities,
is linear programming. Namely, arrangements are a useful way to think about any parametric space
partitioned by various constraints.
172
21.1.1. Randomized incremental construction (RIC)
Imagine that we had computed the arrangement Bi−1 = A| Si−1 . In the ith iteration we compute Bi
by inserting si into the arrangement Bi−1 . This involves splitting some trapezoids (and merging some
others).
As a concrete example, consider the figure on the right. Here q
we insert s in the arrangement. To this end we split the “vertical s
trapezoids” 4pqt and 4bqt, each into three trapezoids. The two
trapezoids σ0 and σ00 now need to be merged together to form the
new trapezoid which appears in the vertical decomposition of the σ0 σ00
new arrangement. (Note that the figure does not show all the trape-
p t b
zoids in the vertical decomposition.)
To facilitate this, we need to compute the trapezoids of Bi−1 that intersect si . This is done by
maintaining a conflict graph. Each trapezoid σ ∈ A| Si−1 maintains a conflict list cl(σ) of all the
segments of S that intersect its interior. In particular, the conflict list of σ cannot contain any segment
of Si−1 , and as such it contains only the segments of S \ Si−1 that intersect its interior. We also maintain
|
a similar structure for each segment, listing all the trapezoids of A Si−1 that it currently intersects (in
its interior). We maintain those lists with cross pointers, so that given an entry (σ, s) in the conflict list
of σ, we can find the entry (s, σ) in the conflict list of s in constant time.
Thus, given si , we know what trapezoids need to be split (i.e., all the trapezoids in
cl(si )). Splitting a trapezoid σ by a segment si is the operation of computing a set of (at
most) four trapezoids that cover σ and have si on their boundary. We compute those new
trapezoids, and next we need to compute the conflict lists of the new trapezoids. This
can be easily done by taking the conflict list of a trapezoid σ ∈ cl(si ) and distributing its
segments among the O(1) new trapezoids that cover σ. Using careful implementation, si
this requires a linear time in the size of the conflict list of σ.
Note that only trapezoids that intersect si in their interior get split. Also, we need to update the
conflict lists for the segments (that were not inserted yet).
We next sketch the low-level details involved in maintaining these conflict lists. For a segment s that
intersects the interior of a trapezoid σ, we maintain the pair (s, σ). For every trapezoid σ, in the current
vertical decomposition, we maintain a doubly linked list of all such pairs that contain σ. Similarly, for
each segment s we maintain the doubly linked list of all such pairs that contain s. Finally, each such
pair contains two pointers to the location in the two respective lists where the pair is being stored.
It is now straightforward to verify that using this data-structure we can implement the required
operations in linear time in the size of the relevant conflict lists.
In the above description, we ignored the need to merge adjacent trapezoids if they have identical
floor and ceiling – this can be done by a somewhat straightforward and tedious implementation of the
vertical decomposition data-structure, by providing pointers between adjacent vertical trapezoids and
maintaining the conflict list sorted (or by using hashing) so that merge operations can be done quickly.
In any case, this can be done in linear time in the input/output size involved, as can be verified.
21.1.1.1. Analysis
Claim 21.1.1. The (amortized) running time of constructing Bi from Bi−1 is proportional to the size
of the conflict lists of the vertical trapezoids in Bi \ Bi−1 (and the number of such new trapezoids).
173
Proof: Observe that we can charge all the work involved in the ith iteration to either the conflict lists of
the newly created trapezoids or the deleted conflict lists. Clearly, the running time of the algorithm in
the ith iteration is linear in the total size of these conflict lists. Observe that every conflict gets charged
twice – when it is being created and when it is being deleted. As such, the (amortized) running time in
the ith iteration is proportional to the total length of the newly created conflict lists.
Thus, to bound the running time of the algorithm, it is enough to bound the expected size of the
destroyed conflict lists in ith iteration (and sum this bound on the n iterations carried out by the
algorithm). Or alternatively, bound the expected size of the conflict lists created in the ith iteration.
Lemma 21.1.2. Let S be a set of n segments (in general position¬ ) with k intersection points. Let Si
be the first i segments in a random permutation of S. The expected size of Bi = A| Si , denoted by τ(i)
(i.e., the number of trapezoids in Bi ), is O i + k(i/n)2 .
Proof: Consider an intersection point p = s ∩ s0, where s, s0 ∈ S. The probability that p is present in
A| Si is equivalent to the probability that both s and s0 are in Si . This probability is
n−2
(n − 2)! i! (n − i)! i(i − 1)
α = i−2
n = · = .
i
(i − 2)! (n − i)! n! n(n − 1)
where V is the set of k intersection points of A S . Also, every endpoint of a segment of Si contributed
its two endpoints to the arrangement A(Si ). Thus, we have that the expected number of vertices in
A(Si ) is
i(i − 1)
2i + k.
n(n − 1)
Now, the number of trapezoids in A| Si is proportional to the number of vertices of A(Si ), which implies
the claim.
for this pain, but it is a minor trifle, not to be mentioned, when compared to the other offenses in this book.
174
So, imagine that the overall size of the conflict lists of the trapezoids of Bi is Wi and the total size
of the conflict lists created only in the ith iteration is Ci .
We are interested in bounding the expected size of Ci , since this is (essentially) the amount of work
done by the algorithm in this iteration. Observe that the structure of Bi is defined independently of
the permutation Si and depends only on the (unordered) set Si = {s1, . . . , si }. So, fix Si . What is the
probability that si is a specific segment s of Si ? Clearly, this is 1/i since this is the probability of s being
the last element in a permutation of the i elements of Si (i.e., we consider a random permutation of Si ).
Now, consider a trapezoid σ ∈ Bi . If σ was created in the ith iteration, then si must be one of
the (at most four) segments that define it. Indeed, if si is not one of the segments that define σ, then
σ existed in the vertical decomposition before si was inserted. Since Bi is independent of the internal
ordering of Si , it follows that Pr[σ ∈ (Bi \ Bi−1 )] ≤ 4/i. In particular, the overall size of the conflict lists
in the end of the ith iteration is Õ
Wi = | cl(σ)|.
σ∈Bi
As such, the expected overall size of the conflict lists created in the ith iteration is
Õ 4 4
E Ci Bi ≤ | cl(σ)| ≤ Wi .
i i
σ∈Bi
By Lemma 21.1.2, the expected size of Bi is O i + ki 2 /n2 . Let us guess (for the time being) that on
average the size of the conflict list of a trapezoid of Bi is about O(n/i). In particular, assume that we
know that
i2 n i
h i
E Wi = O i + 2 k =O n+k ,
n i n
by Lemma 21.1.2, implying
ki n k
h i 4 4 h i 4
E Ci = E E Ci Bi ≤ E Wi = E Wi = O n+ =O + ,
(21.1)
i i i n i n
using Lemma 8.1.2 . In particular, the expected (amortized) amount of work in the ith iteration is
p73
proportional to E Ci . Thus, the overall expected running time of the algorithm is
" n # n
n k
Õ Õ
E Ci = O + = O n log n + k .
i=1 i=1
i n
Theorem 21.1.3. Given a set S of n segments in the plane with k intersections, one can compute the
vertical decomposition of A S in expected O(n log n + k) time.
Intuition and discussion. What remains to be seen is how we came up with the guess that the
average size of a conflict list of a trapezoid of Bi is about O(n/i). Note that using ε-nets implies that
the bound O((n/i) log i) holds with constant probability (see Theorem 20.3.4p155 ) for all trapezoids in
this arrangement. As such, this result is only slightly surprising. To prove this, we present in the next
section a “strengthening” of ε-nets to geometric settings.
To get some intuition on how we came up with this guess, consider a set P of n points on the line
and a random sample R of i points from P. Let I b be the partition of the real line into (maximal) open
intervals by the endpoints of R, such that these intervals do not contain points of R in their interior.
Consider an interval (i.e., a one-dimensional trapezoid) of I.
b It is intuitively clear that this interval
(in expectation) would contain O(n/i) points. Indeed, fix a point x on the real line, and imagine that
175
we pick each point with probability i/n to be in the random sample. The random variable which is the
number of points of P we have to scan starting from x and going to the right of x till we “hit” a point
that is in the random sample behaves like a geometric variable with probability i/n, and as such its
expected value is n/i. The same argument works if we scan P to the left of x. We conclude that the
number of points of P in the interval of Ib that contains x but does not contain any point of R is O(n/i)
in expectation.
Of course, the vertical decomposition case is more involved, as each vertical trapezoid is defined
by four input segments. Furthermore, the number of possible vertical trapezoids is larger. Instead of
proving the required result for this special case, we will prove a more general result which can be applied
in a lot of other settings.
Axioms. Let S, F (R), D(σ), and K(σ) be such that for any subset R ⊆ S, the set F (R) satisfies the
following axioms:
(i) For any σ ∈ F (R), we have D(σ) ⊆ R and R ∩ K(σ) = ∅.
(ii) If D(σ) ⊆ R and K(σ) ∩ R = ∅, then σ ∈ F (R).
176
21.2.1.1. Examples of the general framework
177
21.2.2. Analysis
In the following, S is a set of n objects complying with axioms (i) and (ii).
The challenge. What makes the analysis not easy is that there are dependencies between the defining
set of a region and its stopping set (i.e., conflict list). In particular, we have the following difficulties
(A) The defining set might be of different sizes depending on the region σ being considered.
(B) Even if all the regions have a defining set of the same size d (say, 4 as in the case of vertical
trapezoids), it is not true that every d objects define a valid region. For example, for the case
of segments, the four segments might be vertically separated from each other (i.e., think about
them as being four disjoint intervals on the real line), and they do not define a vertical trapezoid
together. Thus, our analysis is going to be a bit loopy loop – we are going to assume we know
how many regions exists (in expectation) for a random sample of certain size, and use this to
derive the desired bounds.
178
21.2.2.2. On exponential decay
For any natural number r and a number t > 0, consider R to be a random sample of size r from S
n
without repetition. We will refer to a region σ ∈ F (R) as being t-heavy if ω(σ) ≥ t · . Let F≥t (R)
r
denote all the t-heavy regions of F (R).®
Intuitively, and somewhat incorrectly, we expect the average weight of a region of F (R) to be roughly
n/r. We thus expect the size of this set to drop fast as t increases. Indeed, Lemma 21.2.1 tells us that
a trapezoid of weight t (n/r) has probability
n r t (n/r) r d r d r n/r r d
ρr,n d, t · ≈ 1− ≈ exp(−t) · ≈ exp −t + 1 · 1 −
r n n n n n
≈ exp(−t + 1) · ρr,n (d, n/r)
to be created, since (1 − r/n)n/r ≈ 1/e. Namely, a t-heavy region has exponentially lower probability to
be created than a region of weight n/r. We next formalize this argument.
Lemma 21.2.3. Let r ≤ n and let t be parameters, such that 1 ≤ t ≤ r/d. Furthermore, let R be a
sample of size r, and let R0 be ah sample of
i size r = br/tc,
0 both from S. i Let σ ∈ T be a region with
t d h
weight ω(σ) ≥ t (n/r). Then, Pr σ ∈ F (R) = O exp − t Pr σ ∈ F (R0) .
2
Proof: For the sake of simplicity of exposition, assume that k = ω(σ) = t (n/r). By Lemma 21.2.1 (i.e.,
Eq. (21.2)) we have
k r d
Pr[σ ∈ F (R)] ρr,n (d, k) 22d 1 − 21 · nr n
= ≤
Pr[σ ∈ F (R0)] ρ 0 (d, k) 1 r0 k r0 d
r ,n 22d
1−4n n
k
kr r0 r d kr 0 kr r d
4d 4d
≤ 2 exp − 1+8 ≤ 2 exp 8 −
2n n r0 n 2n r 0
d
tn br/tc tnr r
= 24d exp 8 − = O exp(−t/2)t d ,
nr 2nr br/tc
since 1/(1 − x) ≤ 1 + 2x for x ≤ 1/2 and 1 + y ≤ exp(y), for all y. (The constant in the above O(·) depends
exponentially on d.)
Let
Ef (r) = E[|F (R)|] and Ef≥t (r) = E[|F≥t (R)|] ,
where the expectation is over random subsets R ⊆ S of size r. Note that Ef (r) = Ef≥0 (r) is the expected
number of regions created by a random sample of size r. In words, Ef≥t (r) is the expected number of
regions in a structure created by a sample of r random objects, such that these regions have weight
which is t times larger than the “expected” weight (i.e., n/r). In the following, we assume that Ef (r) is
a monotone increasing function.
Lemma 21.2.4 (The exponential decay lemma). Given a set S of n objects and parameters r ≤ n
and 1 ≤ t ≤ r/d, where d = maxσ∈T (S) |D(σ)|, if axioms (i) and (ii) above hold for any subset of S, then
Ef≥t (r) = O t d exp(−t/2) Ef (r) . (21.3)
® These are the regions that are at least t times overweight. Speak about an obesity problem.
179
Proof: Let R be a random sample of size r from S and let R0 be a random sample of size r 0 = br/tc from
S. Let H = X ⊆S,|X |=r F≥t (X) denote the set of all t-heavy regions that might be created by a sample of
Ð
size r. In the following, the expectation is taken over the content of the random samples R and R0.
For a region σ, let Xσ be the indicator variable that is 1 if and only if σ ∈ F (R). By linearity of
expectation and since E[Xσ ] = Pr[σ ∈ F (R)], we have
" #
h i Õ Õ Õ
Ef≥t (r) = E |F≥t (R)| = E Xσ = E[Xσ ] = Pr[σ ∈ F (R)]
σ∈H σ∈H σ∈H
! !
= O t d exp(−t/2) Pr[σ ∈ F (R0)] = O t d exp(−t/2)
Õ Õ
Pr[σ ∈ F (R0)]
σ∈H σ∈T
d 0 d
= O t exp(−t/2) Ef (r ) = O t exp(−t/2) Ef (r) ,
by Lemma 21.2.3 and since Ef (r) is a monotone increasing function.
180
21.3. Applications
21.3.1. Analyzing the RIC algorithm for vertical decomposition
We remind the reader that the input of the algorithm of Section 21.1.2 is a set S of n segments with k
intersections, and it uses randomized incremental construction to compute the vertical decomposition
of the arrangement A S .
Lemma 21.1.2 shows that the number of vertical trapezoids in the randomized incremental construc-
tion is in expectation Ef (i) = O i + k (i/n)2 . Thus, by Theorem 21.2.5 (used with c = 1), we have that
the total expected size of the conflict lists of the vertical decomposition computed in the ith step is
" #
n i
Õ
E[Wi ] = E ω(σ) = O Ef (i) = O n + k .
i n
σ∈Bi
This is the missing piece in the analysis of Section 21.1.2. Indeed, the amortized work in the ith step
of the algorithm is O(Wi /i) (see Eq. (21.1)p175 ), and as such, the expected running time of this algorithm
is
n n
" !# !
Wi i
Õ Õ 1
E O =O n+k = O(n log n + k).
i=1
i i=1
i n
21.3.2. Cuttings
Let S be a set of n lines in the plane, and let r be an arbitrary parameter. A (1/r)-cutting of S is a
partition of the plane into constant complexity regions such that each region intersects at most n/r lines
of S. It is natural to try to minimize the number of regions in the cutting, as cuttings are a natural tool
for performing “divide and conquer”.
Consider the range space having S as its ground set and vertical trapezoids as its ranges (i.e., given
a vertical trapezoid σ, its corresponding range is the set of all lines of S that intersect the interior of
σ). This range space has a VC dimension which is a constant as can be easily verified. Let X ⊆ S be
an ε-net for this range space, for ε = 1/r. By Theorem 20.3.4p155 (ε-net theorem), there exists such an
ε-net X of this range space, of size O((1/ε) log(1/ε)) = O(r log r). In fact, Theorem 20.3.4p155 states that
an appropriate random sample is an ε-net with non-zero probability, which implies, by the probabilistic
method, that such a net (of this size) exists.
Lemma 21.3.1. There exists a (1/r)-cutting of a set of lines S in the plane of size O (r log r)2 .
Proof: Consider the vertical decomposition A| X , where X is as above. We claim that this collection
of trapezoids is the desired cutting.
The bound on the size is immediate, as the complexity of A| X is O |X | 2 and |X | = O(r log r).
As for correctness, consider a vertical trapezoid σ in the arrangement A| X . It does not intersect
any of the lines of X in its interior, since it is a trapezoid in the vertical decomposition A| X . Now, if
σ intersected more than n/r lines of S in its interior, where n = |S|, then it must be that the interior of
σ intersects one of the lines of X, since X is an ε-net for S, a contradiction.
It follows that σ intersects at most εn = n/r lines of S in its interior.
181
Claim 21.3.2. Any (1/r)-cutting in the plane of n lines contains at least Ω r 2 regions.
Proof: An arrangement of n lines (in general position) has M = 2n intersections. However, the number
of intersections of the lines intersecting a single region in the cutting is at most m = n/r
2 . This implies
that any cutting must be of size at least M/m = Ω n2 /(n/r)2 = Ω r 2 .
We can get cuttings of size matching the above lower bound using the moments technique.
Theorem 21.3.3. Let S be a set of n lines in the plane, and let r be a parameter. One can compute a
(1/r)-cutting of S of size O(r 2 ).
r, A | R . If a
Proof: Let R ⊆ S be a random sample of size and consider its vertical decomposition
vertical trapezoid σ ∈ A| R intersects at most n/r lines of S, then we can add it to the output cutting.
The other possibility is that a σ intersects t(n/r) lines of S, for some t > 1, and let cl(σ) ⊂ S be the
conflict list of σ (i.e., the list of lines of S that intersect the interior of σ). Clearly, a (1/t)-cutting for
the set cl(σ) forms a vertical decomposition (clipped inside σ) such that each trapezoid in this cutting
intersects at most n/r lines of S. Thus, we compute such a cutting inside each such “heavy” trapezoid
using the algorithm (implicit in the proof) of Lemma 21.3.1, and these subtrapezoids to the resulting
cutting. Clearly, the size of the resulting cutting inside σ is O t log t = O t . The resulting two-level
2 2 4
partition is clearly the required cutting. By Theorem 21.2.5, the expected size of the cutting is
Õ 4 r 4 Õ
ω(σ) ª 4 ª
O Ef (r) + E ® = O Ef (r) +
2 (ω(σ)) ®
© ©
E
σ∈F (R) n/r n
σ∈F (R)
« ¬ « ¬
r 4 n 4
= O Ef (r) + = O(Ef (r)) = O r 2 ,
· Ef (r)
n r
Proof: So, consider a region σ with d defining objects in D(σ) and k detractors in K(σ). We have to
pick the d defining objects of D(σ) to be in the random sample R of size r but avoid picking any of the
k objects of K(σ) to be in R.
n n n − (r − d) r
The second part follows since = / . Indeed, for the right-hand side first
r r−d d d
pick a sample of size r − d and then a sample of size d from the remaining objects. Merging the two
182
random samples, we get a random sample of size r. However, since we do not care if an object is in the
r
first sample or second sample, we observe that every such
random sample is being
counted d times.
n n − (r − d) n n−d
The third part is easier, as it follows from = . The two sides count
r−d d d r−d
the different ways to pick two subsets from a set of size n, the first one of size d and the second one of
size r − d.
m − t t m m t
Lemma 21.4.2. For M ≥ m ≥ t ≥ 0, we have ≤ Mt ≤ .
M −t M
t
m
t m! (M − t)!t! m m−1 m−t+1
Proof: We have that α = M
= = · ··· . Now, since M ≥ m, we
(m − t)!t! M! M M −1 M −t+1
t
m−i m
have that ≤ , for all i ≥ 0. As such, the maximum (resp. minimum) fraction on the right-hand
M −i M
m−t t m−t+1 t
m−t+1
size is m/M (resp. M−t+1 ≤ α ≤ (m/M)t .
). As such, we have M−t ≤ M−t+1
Y X
X Y
Lemma 21.4.3. Let 0 ≤ X, Y ≤ N. We have that 1 − ≤ 1− .
N 2N
by Lemma 21.4.3 (setting N = n/4, X = r, and Y = d + k) and since r ≥ 2d and 4r/n ≤ 1/2.
183
21.5. Bibliographical notes
The technique described in this chapter is generally attributed to the work by Clarkson and Shor [CS89],
which is historically inaccurate as the technique was developed by Clarkson [Cla88]. Instead of mildly
confusing the matter by referring to it as the Clarkson technique, we decided to make sure to really
confuse the reader and refer to it as the moments technique. The Clarkson technique [Cla88] is in
fact more general and implies a connection between the number of “heavy” regions and “light” regions.
The general framework can be traced back to the earlier paper [Cla87]. This implies several beautiful
results, some of which we cover later in the book.
For the full details of the algorithm of Section 21.1, the interested reader is refereed to the books
[dBCKO08, BY98]. Interestingly, in some cases the merging stage can be skipped; see [Har00a].
Agarwal et al. [AMS98] presented a slightly stronger variant than the original version of Clarkson
[Cla88] that allows a region to disappear even if none of the members of its stopping set are in the
random sample. This stronger setting is used in computing the vertical decomposition of a single face
in an arrangement (instead of the whole arrangement). Here an insertion of a faraway segment of the
random sample might cut off a portion of the face of interest. In particular, in the settings of Agarwal
et al. Axiom (ii) is replaced by the following:
Interestingly, Clarkson [Cla88] did not prove Theorem 21.2.5 using the exponential decay lemma but
gave a direct proof. In fact, his proof implicitly contains the exponential decay lemma. We chose the
current exposition since it is more modular and provides a better intuition of what is really going on
and is hopefully slightly simpler. In particular, Lemma 21.2.1 is inspired by the work of Sharir [Sha03].
The exponential decay lemma (Lemma 21.2.4) was proved by Chazelle and Friedman [CF90]. The
work of Agarwal et al. [AMS98] is a further extension of this result. Another analysis was provided by
Clarkson et al. [CMS93].
Another way to reach similar results is using the technique of Mulmuley [Mul94], which relies on
a direct analysis on ‘stoppers’ and ‘triggers’. This technique is somewhat less convenient to use but is
applicable to some settings where the moments technique does not apply directly. Also, his concept of
the omega function might explain why randomized incremental algorithms perform better in practice
than their worst case analysis [Mul89].
Backwards analysis in geometric settings was first used by Chew [Che86] and was formalized by
Seidel [Sei93]. It is similar to the “leave one out” argument used in statistics for cross validation. The
basic idea was probably known to the Greeks (or Russians or French) at some point in time.
(Naturally, our summary of the development is cursory at best and not necessarily accurate, and all
possible disclaimers apply. A good summary is provided in the introduction of [Sei93].)
Sampling model. As a rule of thumb all the different sampling approaches are similar and yield similar
results. For example, we used such an alternative sampling approach in the “proof” of Lemma 21.2.1.
It is a good idea to use whichever sampling scheme is the easiest to analyze in figuring out what’s going
on. Of course, a formal proof requires analyzing the algorithm in the sampling model its uses.
Lazy randomized incremental construction. If one wants to compute a single face that contains a
marking point in an arrangement of curves, then the problem in using randomized incremental construc-
tion is that as you add curves, the region of interest shrinks, and regions that were maintained should be
ignored. One option is to perform flooding in the vertical decomposition to figure out what trapezoids
are still reachable from the marking point and maintaining only these trapezoids in the conflict graph.
Doing it in each iteration is way too expensive, but luckily one can use a lazy strategy that performs this
184
cleanup only a logarithmic number of times (i.e., you perform a cleanup in an iteration if the iteration
number is, say, a power of 2). This strategy complicates the analysis a bit; see [dBDS95] for more de-
tails on this lazy randomized incremental construction technique. An alternative technique was
suggested by the author for the (more restricted) case of planar arrangements; see [Har00b]. The idea
is to compute only what the algorithm really needs to compute the output, by computing the vertical
decomposition in an exploratory online fashion. The details are unfortunately overwhelming although
the algorithm seems to perform quite well in practice.
Cuttings. The concept of cuttings was introduced by Clarkson. The first optimal size cuttings were
constructed by Chazelle and Friedman [CF90], who proved the exponential decay lemma to this end.
Our elegant proof follows the presentation by de Berg and Schwarzkopf [dBS95]. The problem with this
approach is that the constant involved in the cutting size is awful¯ . Matoušek [Mat98] showed that
there (1/r)-cuttings with 8r 2 + 6r + 4 trapezoids, by using level approximation. A different approach
was taken by the author [Har00a], who showed how to get cuttings which seem to be quite small (i.e.,
constant-wise) in practice. The basic idea is to do randomized incremental construction but at each
iteration greedily add all the trapezoids with conflict list small enough to the cutting being output.
One can prove that this algorithm also generates O(r 2 ) cuttings, but the details are not trivial as the
framework described in this chapter is not applicable for analyzing this algorithm.
Cuttings also can be computed in higher dimensions for hyperplanes. In the plane, cuttings can also
be computed for well-behaved curves; see [SA95].
Another fascinating concept is shallow cuttings. These are cuttings covering only portions of the
arrangement that are in the “bottom” of the arrangement. Matoušek came up with the concept [Mat92].
See [AES99, CCH09] for extensions and applications of shallow cuttings.
Even more on randomized algorithms in geometry. We have only scratched the surface of this
fascinating topic, which is one of the cornerstones of “modern” computational geometry. The interested
reader should have a look at the books by Mulmuley [Mul94], Sharir and Agarwal [SA95], Matoušek
[Mat02], and Boissonnat and Yvinec [BY98].
21.6. Exercises
Exercise 21.6.1 (Convex hulls incrementally). Let P be a set of n points in the plane.
(A) Describe a randomized incremental algorithm for computing the convex hull CH (P). Bound the
expected running time of your algorithm.
(B) Assume that for any subset of P, its convex hull has complexity t (i.e., the convex hull of the subset
has t edges). What is the expected running time of your algorithm in this case? If your algorithm
is not faster for this case (for example, think about the case where t = O(log n)), describe a variant
of your algorithm which is faster for this case.
Exercise 21.6.2 (Compressed quadtree made incremental). Given a set P of n points in Rd , describe a
randomized incremental algorithm for building a compressed quadtree for P that works in expected
O(dn log n) time. Prove the bound on the running time of your algorithm.
¯ This is why all computations related to cuttings should be done on a waiter’s bill pad. As Douglas Adams put it:
“On a waiter’s bill pad, reality and unreality collide on such a fundamental level that each becomes the other and anything
is possible, within certain parameters.”
185
186
Chapter 22
Primality testing
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
“The world is what it is; men who are nothing, who allow themselves to become nothing, have
no place in it.”
— Bend in the river, V.S. Naipaul
For integer numbers x and y, let x | y denotes that x divides y. The greatest common divisor (gcd) of
two numbers x and y, denoted by gcd(x, y), is the largest integer that divides both x and y. The least
common multiple (lcm) of x and y, denoted by lcm(x, y) = x y/gcd(x, y), is the smallest integer α, such
that x | α and y | α. An integer number p > 0 is prime if it is divisible only by 1 and itself (we will
consider 1 not to be prime).
187
Some standard definitions:
Exercise 22.1.1. Show that gcd(Fn, Fn−1 ) = 1, where Fi is the ith Fibonacci number. Argue that for two
consecutive Fibonacci numbers EuclidGCD(Fn, Fn−1 ) takes O(n) time, if every operation takes O(1)
time.
Lemma 22.1.2. For all α, β > 0 integers, there are integer numbers
x and y, such that gcd(α, β) =
αx + βy, which can be computed in polynomial time; that is, O poly log α + log β .
Proof: If α = β then the claim trivially holds. Otherwise, assume that α > β (otherwise, swap them),
and observe that gcd(α, β) = gcd(α mod β, β). In particular, by induction, there are integers x 0, y0, such
that gcd(α mod β, β) = x 0(α mod β) + y0 β. However, τ = α mod β = α − β bα/βc. As such, we have
gcd(α, β) = gcd(α mod β, β) = x 0 α − β bα/βc + y0 β = x 0α + (y0 − β bα/βc)β,
as claimed. The running time follows immediately by modifying EuclidGCD to compute these num-
bers.
We use α ≡ β (mod n) or α ≡n β to denote that α and β are congruentn modulo n;o that is
α mod n = β mod n. Or put differently, we have n | (α − β). The set ZZn = 0, . . . , n − 1 form a
group under addition modulon (see Definition 22.1.9p190 for a formal definition of a group). The more
interesting creature is ZZ∗n = x x ∈ {1, . . . , n} , x > 0, and gcd(x, n) = 1 , which is a group modulo n
under multiplication.
Remark 22.1.3. Observe that ZZ∗1 = {1}, while for n > 1, ZZ∗n does not contain n.
Lemma 22.1.4. For any element α ∈ ZZ∗n , there exists a unique inverse element β = α−1 ∈ ZZ∗n such
that α ∗ β ≡n 1. Furthermore, the inverse can be computed in polynomial time¬ .
¬ Again, as is everywhere in this chapter, the polynomial time is in the number of bits needed to specify the input.
188
Proof: Since α ∈ ZZ∗n , we have that gcd(α, n) = 1. As such, by Lemma 22.1.2, there exists x and y
integers, such that xα + yn = 1. That is xα ≡ 1 (mod n), and clearly β := x mod n is the desired inverse,
and it can be computed in polynomial time by Lemma 22.1.2.
As for uniqueness, assume that there are two inverses β, β0 to α < n, such that β < β0 < n. But then
βα ≡n β0α ≡n 1, which implies that n | (β0 − β)α, which implies that n | β0 − β, which is impossible as
0 < β0 − β < n.
It is now straightforward, but somewhat tedious, to verify the following (the interested reader that
had not encountered this stuff before can spend some time proving this).
Lemma 22.1.5. The set ZZn under the + operation modulo n is a group, as is ZZ∗n under multiplication
modulo n. More importantly, for a prime number p, ZZ p forms a field with the +, ∗ operations modulo p
(see Definition 22.1.17p192 ).
Proof: By the coprime property of the ni s it follows that gcd(ni, n/ni ) = 1. As such, n/ni ∈ ZZ∗ni , and it
has a unique inverse mi modulo ni ; that is (n/ni )mi ≡ 1 (mod ni ). So set r = i ri mi n/ni . Observe that
Í
for i , j, we have that n j | (n/ni ), and as such ri mi n/ni (mod n j ) ≡ 0 (mod n j ). As such, we have
!
Õ n n
r mod n j = ri mi mod n j mod n j = r j m j mod n j mod n j = r j ∗ 1 mod n j = r j .
i
ni nj
As for uniqueness, if there is another such number r 0, such that r < r 0 < n, then r 0 − r (mod ni ) = 0
implying that ni | r 0 − r, for all i. Since all the ni s are coprime, this implies that n | r 0 − r, which is of
course impossible.
Lemma 22.1.7 (Fast exponentiation). Given numbers b, c, n, one can compute bc mod n in polyno-
mial time.
Namely, computing bc mod n can be reduced to recursively computing b bc/2c mod n, and a constant
number of operations (on numbers that are smaller than n). Clearly, the depth of the recursion is
O(log c).
189
22.1.1.4. Euler totient function
The Euler totient function φ(n) = ZZ∗n is the number of positive integer numbers that at most n and
are coprime with n. If n is prime then φ(n) = n − 1.
Lemma 22.1.8. Let n = p1k1 · · · ptkt , where the pi s are prime numbers and the ki s are positive integers
t
piki −1 (pi − 1). and this quantity can be computed
Ö
(this is the prime factorization of n). Then φ(n) =
i=1
in polynomial time if the factorization is given.
Proof: Observe that φ(1) = 1 (see Remark 22.1.3), and for a prime number p, we have that φ(p) = p − 1.
Now, for k > 1, and p prime we have that φ(p k ) = p k−1 (p − 1), as a number x ≤ p k is coprime with p k ,
if and only if x mod p , 0, and (p − 1)/p fraction of the numbers in this range have this property.
Now, if n and m are relative primes, then gcd(x, nm) = 1 ⇐⇒ gcd(x, n) = 1 and gcd(x, m) = 1. In
particular, there are φ(n)φ(m) pairs (α, β) ∈ ZZ∗n × ZZ∗m , such that gcd(α, n) = 1 and gcd(β, m) = 1. By the
Chinese remainder theorem (Theorem 22.1.6), each such pair represents a unique number in the range
1, . . . , nm, as desired.
Now, the claim follows by easy induction on the prime factorization of the given number.
In the following we restrict our attention to abelian groups since it makes the discussion somewhat
simpler. In particular, some of the claims below holds even without the restriction to abelian groups.
The identity element is unique. Indeed, if both f , g ∈ G are identity elements, then f = f × g = g.
Similarly, for every element x ∈ G there exists a unique inverse y = x −1 . Indeed, if there was another
inverse z, then y = y × i = y × (x × z) = (y × x) × z = i × z = z.
22.1.2.2. Subgroups
For a group G, a subset H ⊆ G that is also a group (under the same operation) is a subgroup.
For x, y ∈ G, let us define x ∼ y if x/y ∈ H . Here x/y = x y −1 and y −1 is the inverse of y in G.
Observe that (y/x)(x/y) = yx −1 x y −1 = i. That is y/x is the inverse of x/y, and it is in H . But that
implies that x ∼ y =⇒ y ∼ x. Now, if x ∼ y and y ∼ z, then x/y, y/z ∈ H . But then x/y × y/z ∈ H ,
and furthermore x/y × y/z = x y −1 yz−1 = xz −1 = x/z. that is x ∼ z. Together, this implies that ∼ is an
equivalence relationship.
190
Furthermore, observe that if x/y = x/z then y −1 = x −1 (x/y) = x −1 (x/z) = z−1, that is y = z.
In particular, the equivalence class of x ∈ G, is [x] = z ∈ G x ∼ z . Observe that if x ∈ H then
i/x = ix −1 = x −1 ∈ H , and thus i ∼ x. That is H = [x]. The following is now easy.
Lemma 22.1.10. Let G be an abelian group, and let H ⊆ G be a subgroup. Consider the set G/H =
[x] x ∈ G . We claim that [x] = [y] for any x, y ∈ G. Furthermore G/H is a group (that is, the
Proof: Since G is finite, there are integers i > j ≥ 1, such that i , j and gi = g j , but then g j × gi− j =
gi = g j . That is gi− j = i and, by definition, we have gi− j ∈ hgi. It is now straightforward to verify that
the other properties of a group hold for hgi.
In particular, for an element g ∈ G, we define its order as ord(g) = hgi , which clearly is the
minimum positive integer m, such that g m = i. Indeed, for j > m, observe that g j = g j mod m ∈ X =
i, g, g 2, . . . , g m−1 , which implies that hgi = X.
A group G is cyclic, if there is an element g ∈ G, such that hgi = G. In such a case g is a generator
of G.
Lemma 22.1.13. For any finite abelian group G, and any g ∈ G, we have that ord(g) divides |G|, and
g |G| = i.
Proof: By Lemma 22.1.12, the set hgi is a subgroup of G. By Lemma 22.1.11, we have that ord(g) =
|G|/ord(g) |G|/ord(g)
hgi | |G|. As such, g |G| = g ord(g) = i = i.
191
Theorem 22.1.15. (Euler’s theorem) For all n and x ∈ ZZ∗n , we have x φ(n) ≡ 1 (mod n).
(Fermat’s theorem) If p is a prime then ∀x ∈ ZZ∗p x p−1 ≡ 1 (mod p).
Proof: The group ZZ∗n is abelian and has φ(n) elements, with 1 being the identity element (duh!). As
∗
such, by Lemma 22.1.13, we have that x φ(n) = x | ZZn | ≡ 1 (mod n), as claimed.
The second claim follows by setting n = p, and recalling that φ(p) = p − 1, if p is a prime.
One might be tempted to think that Lemma 22.1.14 implies that if p is a prime then ZZ∗p is a cyclic
group, but this does not follow, as the cardinality of ZZ∗p is φ(p) = p − 1, which is not a prime number
(for p > 2). To prove that ZZ∗p is cyclic, let us go back shortly to the totient function.
Lemma 22.1.16. For any n > 0, we have d|n φ(d) = n.
Í
Proof: For any g > 0, let Vg = x x ∈ {1, . . . , n} and gcd(x, n) = g . Now, x ∈ Vg ⇐⇒ gcd(x, n) = g
⇐⇒ gcd(x/g, n/g) = 1 ⇐⇒ x/g ∈ ZZ∗n/g . Since V1, V2, . . . , Vn form a partition of {1, . . . , n}, it follows
Õ Õ Õ Õ
that n = Vg = ZZ∗n/g = φ(n/g) = φ(d).
g g|n g|n d|n
22.1.2.5. Fields
Definition 22.1.17. A field is an algebraic structure hF, +, ∗, 0, 1i consisting of two abelian groups:
(A) F under +, with 0 being the identity element.
(B) F \ {0} under ∗, with 1 as the identity element (here 0 , 1).
Also, the following property (distributivity of multiplication over addition) holds:
∀a, b, c ∈ F a ∗ (b + c) = (a ∗ b) + (a ∗ c).
We need the following: A polynomial p of degree k over a field F has at most k roots. indeed, if p
has the root α then it can be written as p(x) = (x − α)q(x), where q(x) is a polynomial of one degree
lower. To see this, we divide p(x) by the polynomial (x − α), and observe that p(x) = (x − α)q(x) + β, but
Ît
clearly β = 0 since p(α) = 0. As such, if p had t roots α1, . . . , αt , then p(x) = q(x) i=1 (x − αi ), which
implies that p would have degree at least t.
192
Lemma 22.1.19. For any prime p, the group ZZ∗p is cyclic.
Proof: For p = 2 the claim trivially holds, so assume p > 2. If the set Rp−1 , from Lemma 22.1.18, is not
empty, then there is g ∈ Rp−1 , it has order p − 1, and it is a generator of ZZ∗p , as ZZ∗p = p − 1, implying
that ZZ∗p = hgi and this group is cyclic.
Now, by Lemma 22.1.13, we have that for any y ∈ ZZ∗p , we have that ord(y) | p − 1 = ZZ∗p . This
implies that Rk is empty if k does not divides p − 1. On the other hand, R1, . . . , Rp−1 form a partition
of ZZ∗p . As such, we have that
Õ Õ
p − 1 = ZZ∗p = |Rk | ≤ φ(k) = p − 1,
k |p−1 k |p−1
by Lemma 22.1.18 and Lemma 22.1.16p192 , implying that the inequality in the above display is equality,
and for all k | p − 1, we have that |Rk | = φ(k). In particular, Rp−1 = φ(p − 1) > 0, and by the above the
claim follows.
Proof: Let g be a generator of ZZ∗p . Observe that g p−1 ≡ 1 mod p. The number g < p, and as such p
does not divide g, and also p does not divide g p−2 , and also p does not divide p − 1. As such, p2 does
not divide ∆ = (p − 1)g p−2 p; that is, ∆ . 0 (mod p2 ). As such, we have that
p − 1 p−2
p−1 p−1
(g + p) ≡g + g p ≡ g p−1 + ∆ . g p−1 (mod p2 )
1
=⇒ (g + p) p−1 . 1 (mod p2 ) or g p−1 . 1 (mod p2 ).
where γ1 is an integer (the p3 is not a typo – the binomial coefficient contributes at least one factor of p
– here we are using that p > 2). In particular, as p does not divides β, it follows that p does not divides
γ1 either. Let us apply this argumentation again to
p
g p (p−1) = 1 + γ1 p2 = 1 + γ1 p3 + p4 <whatever> = 1 + γ2 p3,
2
where again p does not divides γ2 . Repeating this argument, for i = 1, . . . , c − 2, we have
i−1 p p
αi = g p (p−1) = g p (p−1) = 1 + γi−1 pi = 1 + γi−1 pi+1 + pi+2 <whatever> = 1 + γi pi+1,
i
where p does not divides γi . In particular, this implies that αc−2 = 1 + γc−2 pc−1 and p does not divides
γc−2 . This in turn implies that αc−2 . 1 (mod pc ).
Now, the order of g in ZZn , denoted by k, must divide ZZ∗n by Lemma 22.1.13p191 . Now ZZ∗n = φ(n) =
p (p − 1), see Lemma 22.1.8p190 . So, k | pc−1 (p − 1). Also, αc−2 . 1 (mod pc ). implies that k does
c−1
193
not divides pc−2 (p − 1). It follows that pc−1 | k. So, let us write k = pc−1 k 0, where k 0 ≤ (p − 1). This,
by definition, implies that g k ≡ 1 (mod pc ). Now, g p ≡ g (mod p), because g is a generator of ZZ∗p . As
such, we have that
δ 0 δ−1 0 δ−1 0
g k ≡ p g p k ≡ p (g p ) p k ≡ p (g) p k ≡ p . . . ≡ p (g) k ≡ p g k mod pc mod p ≡ p 1.
0
Namely, g k ≡ 1 (mod p), which implies, as g as a generator of ZZ∗p , that either k 0 = 1 or k 0 = p − 1. The
0
case k 0 = 1 is impossible, as this implies that g = 1, and it can not be the generator of ZZ∗p . We conclude
that k = pc−1 (p − 1); that is, ZZ∗n is cyclic.
Theorem 22.1.22 (Euler’s criterion). Let p be an odd prime, and α ∈ ZZ∗p . We have that
(A) α(p−1)/2 ≡ p ±1.
(B) If α is a quadratic residue, then α(p−1)/2 ≡ p 1.
(C) If α is not a quadratic residue, then α(p−1)/2 ≡ p −1.
Proof: (A) Let γ = α(p−1)/2 , and observe that γ 2 ≡ p α p−1 ≡ 1, by Fermat’s theorem (Theorem 22.1.15p192 ),
which implies that γ is either +1 or −1, as the polynomial x 2 − 1 has at most two roots over a field.
(B) Let α ≡ p β2 , and again by Fermat’s theorem, we have α(p−1)/2 ≡ p β p−1 ≡ p 1.
(C) Let X be the set of elements in ZZ∗p that are not quadratic residues, and consider α ∈ X. Since
∗
ZZ p is a group, for any x ∈ ZZ∗p there is a unique y ∈ ZZ∗p such that x y ≡ p α. As such, we partition ZZ∗p
into pairs C = {x, y} x, y ∈ ZZ∗p and xy ≡ p α . We have that
Ö Ö Ö
τ ≡p β ≡p x y ≡p α ≡ p α(p−1)/2 .
β∈ZZ∗p {x,y}∈C {x,y}∈C
Let consider a similar set of pair, but this time for 1: D = {x, y} x, y ∈ ZZ∗p, x , y and x y ≡ p 1 .
Clearly, D does not contain −1 and 1, but all other elements in ZZ∗p are in D. As such,
Ö Ö Ö
τ ≡p β ≡ p (−1)1 xy ≡ p 1 ≡ p −1.
β∈ZZ∗p {x,y}∈D {x,y}∈D
194
The following is easy to verify.
Lemma 22.1.24. Let p be an odd prime, and let a, b be integer numbers. We have:
(i) (−1 | p) = (−1)(p−1)/2 .
(ii) (a | p) (b | p) = (ab | p).
(iii) If a ≡ p b then (a | p) = (b | p).
Lemma 22.1.25 (Gauss’ lemma). Let p be an odd prime and let a be an integer that is not divisible
by p. Let X = α j = ja (mod p) j = 1, . . . , (p − 1)/2 , and L = x ∈ X x > p/2 ⊆ X. Then (a | p) =
(−1)n , where n = |L|.
Proof: Observe that for any distinct i, j, such that 1 ≤ i ≤ j ≤ (p − 1)/2, we have that ja ≡ ia (mod p)
implies that ( j − i)a ≡ 0 (mod p), which is impossible as j − i < p and gcd(a, p) = 1. As such, all the
elements of X are distinct, and |X | = (p − 1)/2. We have a somewhat stronger property: If ja ≡ p − ia
is impossible. That is, S = X \ L, and L = p − ` ` ∈ L
(mod p) implies ( j + i)a ≡ 0 (mod p), which
are disjoint, and S ∪ L = 1, . . . , (p − 1)/2 . As such,
(p−1)/2
p−1 n (p−1)/2 p − 1
n n
Ö Ö Ö Ö Ö
! ≡ x· (p − y) ≡ (−1) x· y ≡ (−1) ja ≡ (−1) a ! (mod p).
2 x∈S y∈L x∈S y∈L j=1
2
Dividing both sides by (−1)n ((p − 1)/2)!, we have that (a | p) ≡ a(p−1)/2 ≡ (−1)n (mod p), as claimed.
Lemma 22.1.26. If p is an odd prime, and a > 2 and gcd(a, p) = 1 then (a | p) = (−1)∆ , where
(p−1)/2
Õ 2
∆= b ja/pc. Furthermore, we have (2 | p) = (−1)(p −1)/8 .
j=1
p−1 1 p−1 p2 −1
Í(p−1)/2
Rearranging, and observing that j=1 j= 2 · 2 2 +1 = 8 . We have that
p2 − 1 Õ p2 − 1
(a − 1) = (∆ + n)p − 2 y. =⇒ (a − 1) ≡ (∆ + n)p (mod 2). (22.1)
8 8
y∈L
Observe that p ≡ 1 (mod 2), and for any x we have that x ≡ −x (mod 2). As such, and if a is odd, then
the above implies that n ≡ ∆ (mod 2). Now the claim readily follows from Lemma 22.1.25.
As for (2 | p), setting a = 2, observe that b ja/pc = 0, for j = 0, . . . (p − 1)/2, and as such ∆ = 0. Now,
Eq. (22.1) implies that p 8−1 ≡ n (mod 2), and the claim follows from Lemma 22.1.25.
2
195
Theorem 22.1.27 (Law of quadratic reciprocity). If p and q are distinct odd primes, then
p−1 q−1
(p | q) = (−1) 2 2 (q | p) .
Proof: Let S = (x, y) 1 ≤ x ≤ (p − 1)/2 and 1 ≤ y ≤ (q − 1)/2 . As lcm(p, q) = pq, it follows that there
are no (x, y) ∈ S, such that qx = py, as all such numbers are strict smaller than pq. Now, let
(p−1)/2 (q−1)/2
p−1 q−1 Õ Õ
τ= · = |S| = |S1 | + |S2 | = bqx/pc + bpy/qc .
2 2 x=1 y=1
| {z } | {z }
τ1 τ2
The claim now readily follows by Lemma 22.1.26, as (−1)τ = (−1)τ1 (−1)τ2 = (p | q) (q | p).
Definition 22.1.28. For any integer a, and an odd number n with prime factorization n = p1k1 · · · ptkt , its
Jacobi symbol is
t
(a | pi ) ki .
Ö
na | no =
i=1
Ík Î
k
Claim 22.1.29. For odd integers n1, . . . , nk , we have that i=1 (ni − 1)/2 ≡ i=1 ni − 1 /2 (mod 2).
Proof: We prove for two odd integers x and y, and apply this repeatedly to get the claim. Indeed, we
x − 1 y − 1 xy − 1 xy − x + 1 − y + 1 − 1 xy − x − y + 1
have + ≡ (mod 2) ⇐⇒ 0 ≡ (mod 2) ⇐⇒ 0 ≡
2 2 2 2 2
(x − 1)(y − 1)
(mod 2) ⇐⇒ 0 ≡ (mod 2), which is obviously true.
2
Lemma 22.1.30 (Law of quadratic reciprocity). For n and m positive odd integers, we have that
n−1 m−1
nn | mo = (−1) 2 2 nm | no .
Îν ε
Proof: Let n = i=1 pi and Let m = j=1 q j be the prime factorization of the two numbers (allowing
repeated factors). If they share a common factor p, then both nn | mo and nm | no contain a zero term
when expanded, as (n | p) = (m | p) = 0. Otherwise, we have
µ
ν Ö µ
ν Ö µ
ν Ö
Ö Ö Ö
nn | mo = pi | q j = pi | q j = (−1)(q j −1)/2·(pi −1)/2 q j | pi
i=1 j=1 i=1 j=1 i=1 j=1
ν Ö µ µ
ν Ö
!
Ö Ö
(−1)(q j −1)/2·(pi −1)/2 · q j | pi = s nm | no .
=
i=1 j=1 i=1 j=1
| {z }
s
196
by Theorem 22.1.27. As for the value of s, observe that
ν Ö µ
! (pi −1)/2 ν ν
! (m−1)/2
Ö Ö (pi −1)/2 Ö
(q j −1)/2 (m−1)/2 (pi −1)/2
s= (−1) = (−1) = (−1) = (−1)(n−1)/2·(m−1)/2,
i=1 j=1 i=1 i=1
n2 − 1 m 2 − 1 n2 m 2 − 1
Lemma 22.1.31. For odd integers n and m, we have that + ≡ (mod 2).
8 8 8
Proof: For an odd integer n, we have that either (i) 2 | n − 1 and 4 | n + 1, or (ii) 4 | n − 1 and 2 | n + 1.
As such, 8 | n2 − 1 = (n − 1)(n + 1). In particular, 64 | n2 − 1 m2 − 1 . We thus have that
n2 − 1 m 2 − 1 n2 m 2 − n2 − m 2 + 1
≡ 0 (mod 2) ⇐⇒ ≡ 0 (mod 2)
8 8
n2 m 2 − 1 n2 − m 2 − 2
⇐⇒ ≡ (mod 2)
8 8
n2 − 1 m 2 − 1 n2 m 2 − 1
⇐⇒ + ≡ (mod 2).
8 8 8
Lemma 22.1.32. Let m, n be odd integers, and a, b be any integers. We have the following:
(A) nab | no = na | no nb | no.
(B) na | nmo = na | no na | mo.
(C) If a ≡ b (mod n) then na | no = nb | no.
(D) If gcd(a, n) > 1 then na | no = 0.
(E) n1 | no = 1.
2
(F) n2 | no = (−1)(n −1)/8 .
n−1 m−1
(G) nn | mo = (−1) 2 2 nm | no .
197
22.1.3.4. Jacobi(a, n): Computing the Jacobi symbol
Given a and n (n is an odd number), we are interested in computing (in polynomial time) the Jacobi
symbol na | no. The algorithm Jacobi(a, n) works as follows:
(A) If a = 0 then return 0 // Since n0 | no = 0.
(B) If a > n then return Jacobi(a (mod n), n) // Lemma 22.1.32 (C)
(C) If gcd(a, n) > 1 then return 0 // Lemma 22.1.32 (D)
(D) If a = 2 then
(I) Compute ∆ = n2 − 1 (mod 16),
(II) Return (−1)∆/8 (mod 2) // As (n2 − 1)/8 ≡ ∆/8 (mod 2), and by Lemma 22.1.32
(F)
(E) If 2 | a then return Jacobi(2, n) * Jacobi(a/2, n) // Lemma 22.1.32 (A)
// Must be that a and b are both odd, a < n, and they are coprime
(F) a0 := a (mod 4), n0 := n (mod 4), β = (a0 − 1)(n0 − 1)/4.
return (−1) Jacobi(n, a)
β // By Lemma 22.1.32 (G)
Ignoring the recursive calls, all the operations takes polynomial time. Clearly, computing Jacobi(2, n)
takes polynomial time. Otherwise, observe that Jacobi reduces its input size by say, one bit, at least
every two recursive calls, and except the a = 2 case, it always perform only a single call. Thus, it follows
that its running time is polynomial. We thus get the following.
Lemma 22.1.33. Given integers a and n, where n is odd, then na | no can be computed in polynomial
time.
Proof: For a, b ∈ Jn , we have that nab | no ≡ na | no nb | no ≡ a(n−1)/2 b(n−1)/2 ≡ (ab)(n−1)/2 mod n, implying
that ab ∈ Jn . Now, n1 | no = 1, so 1 ∈ Jn . Now, for a ∈ Jn , let a−1 the inverse of a (which is a number
in ZZ∗n ). Observe that a(a−1 ) = kn + 1, for some k, and as such, we have
1 = n1 | no = nkn + 1 | no = aa−1 | n = nkn + 1 | no = na | no a−1 | n .
Lemma 22.1.35. Let n be an odd integer that is composite, then |Jn | ≤ ZZ∗n /2.
198
piki . Let q = p1k1 , and m = n/q. By Lemma 22.1.20p193 ,
Ît
Proof: Let has the prime factorization n = i=1
the group ZZ∗q is cyclic, and let g be its generator. Consider the element a ∈ ZZ∗n such that
Such a number a exists and its unique, by the Chinese remainder theorem (Theorem 22.1.6p189 ). In
piki , and observe that, for all i, we have a ≡ 1 (mod pi ), as pi | m. As such,
Ît
particular, let m = i=2
writing the Jacobi symbol explicitly, we have
t t t
ki ki
Ö Ö Ö
na | no = na | qo (a | pi ) = na | qo (1 | pi ) = na | qo 1 = na | qo = ng | qo .
i=2 i=2 i=2
since a ≡ g (mod q), and Lemma 22.1.32p197 (C). At this point there are two possibilities:
(A) If k 1 = 1, then q = p1 , and ng | qo = (g | q) = g (q−1)/2 (mod q). But g is a generator of ZZ∗q ,
and its order is q − 1. As such g (q−1)/2 ≡ −1 (mod q), see Definition 22.1.23p194 . We conclude
that na | no = −1. If we assume that Jn = ZZ∗n , then na | no ≡ a(n−1)/2 ≡ −1 (mod n). Now, as
m | n, we have
a(n−1)/2 ≡m a(n−1)/2 mod n mod m ≡m −1.
Theorem 22.2.1. Given a number n, and a parameter δ > 0, there is a randomized algorithm that, de-
cides if the given number is prime or composite. The running time of the algorithm is O (log n)c log(1/δ) ,
where c is some constant. If the algorithm returns that n is composite then it is. If the algorithm returns
that n is prime, then is wrong with probability at most δ.
One could even say “trivial” with heavy Russian accent.
199
Proof: Run the above algorithm m = O(log(1/δ)) times. If any of the runs returns that it is composite
then the algorithm return that n is composite, otherwise the algorithms returns that it is a prime.
If the algorithm fails, then n is a composite, and let r1, . . . , rm be the random numbers the algorithm
picked. The algorithm fails only if r1, . . . , rm ∈ Jn , but since |Jn | ≤ ZZ2n /2, by Lemma 22.1.35p198 , it
m
follows that this happens with probability at most |Jn | / ZZ2n ≤ 1/2m ≤ δ, as claimed.
Proof: Let X be the product of the all composite numbers between m and 2m, we have
X ·∆
2m 2m · (2m − 1) · · · (m + 2) · (m + 1)
= = .
m m · (m − 1) · · · 2 · 1 m · (m − 1) · · · 2 · 1
Since none of the numbers between 2 and m divides any of the factors of ∆, it must be that the number
X 2m 2m
m·(m−1)···2·1 is an integer number, as m is an integer. Therefore, m = c · ∆, for some integer c > 0,
implying the claim.
Lemma 22.2.3. The number of prime numbers between m and 2m is O(m/ln m).
Proof: Let us denote all primes between m and 2m as p1 < p2 < · · · < p k . Since p1 ≥ m, it follows from
Îk
Lemma 22.2.2 that m k ≤ i=1 pi ≤ 2m
m ≤ 2 . Now, taking log of both sides, we have k lg m ≤ 2m.
2m
Namely, k ≤ 2m/lg m.
Proof: Let the number of primes less than n be Π(n), then by Lemma 22.2.3, there exist some positive
constant C, such that for all ∀n ≥ N, we have Π(2n)
! − Π(n) ≤ C · n/ln n. Namely, Π(2n) ≤ C · n/ln n +
Õne
dlg Õne
dlg
n/2i n
Π(n). Thus, Π(2n) ≤ Π 2n/2i − Π 2n/2i+1 ≤ C· = O , by observing that the
n/2 i n
i=0 i=0
ln ln
summation behaves like a decreasing geometric series.
Proof: Let T(p, m) be the number of times p appear in the prime factorizationof m!. Formally, T(p, m)
is the highest number k such that p k divides m!. We claim that T(p, m) = i=1 m/pi . Indeed, consider
Í∞
an integer β ≤ m, such that β = pt γ, where γ is an integer that is not divisible by p. Observe that β
contributes exactly to the first t terms of the summation of T(p, m) – namely, its contribution to m! as
far as powers of p is counted correctly.
Let α be the maximum number such that pα divides 2m 2m!
m = m!m! . Clearly,
∞
m
Õ 2m
α = T(p, 2m) − 2T(p, m) = −2 i .
i=1
pi p
200
j k j k
It is easy to verify that for any integers x, y, we have that 0 ≤ 2x
y − 2 xy ≤ 1. In particular, let k be
j k j k
m
the largest number such that 2m pk
− 2 pk
= 1, and observe that T(p, 2m) ≤ k as only the proceedings
k − 1 terms might be non-zero in the summation of T(p, 2m). But this implies that 2m/p k ≥ 1, which
Theorem 22.2.7. Let π(n) be the number of distinct prime numbers between 1 and n. We have that
π(n) = Θ(n/ln n).
201
202
Chapter 23
(X, d) be an n-point metric space. We denote the open ball of radius r about
Definition 23.1.2. Let
x ∈ X, by b(x, r) = y ∈ X d(x, y) < r .
Underling our discussion of metric spaces are algorithmic applications. The hardness of various
computational problems depends heavily on the structure of the finite metric space. Thus, given a finite
metric space, and a computational task, it is natural to try to map the given metric space into a new
metric where the task at hand becomes easy.
Example 23.1.3. For example, computing the diameter is not trivial in two dimensions, but is easy in
one dimension. Thus, if we could map points in two dimensions into points in one dimension, such that
the diameter is preserved, then computing the diameter becomes easy. In fact, this approach yields an
efficient approximation algorithm, see Exercise 23.7.3 below.
Of course, this mapping from one metric space to another, is going to introduce error. We would be
interested in minimizing the error introduced by such a mapping.
203
Definition 23.1.4. Let (X, dX ) and (Y, dY ) be metric spaces. A mapping f : X → Y is called an embed-
ding, and is C-Lipschitz if dY ( f (x), f (y)) ≤ C · dX (x, y) for all x, y ∈ X. The mapping f is called
K-bi-Lipschitz if there exists a C > 0 such that
for all x, y ∈ X.
The least K for which f is K-bi-Lipschitz is called the distortion of f , and is denoted dist( f ). The
least distortion with which X may be embedded in Y is denoted cY (X).
There are several powerful results in this vain, that show the existence of embeddings with low
distortion that would be presented:
1. Probabilistic trees - every finite metric can be randomly embedded into a tree such that the
“expected” distortion for a specific pair of points is O(log n).
2. Bourgain embedding - shows that any n-point metric space can be embedded into (finite dimen-
sional) metric space with O(log n) distortion.
3. Johnson-Lindenstrauss lemma - shows that any n-point set in Euclidean space with the regular
Euclidean distance can be embedded into R k with distortion (1 + ε), where k = O(ε −2 log n).
23.2. Examples
23.2.0.0.1. What is distortion? When considering a mapping f : X → Rd of a metric space (X, d)
to Rd , it would useful to observe that since Rd can be scaled, we can consider f to be an an expansion
(i.e., no distances shrink). Furthermore, we can in fact assume that there is at least one pair of points
x−yk
x, y ∈ X, such that d(x, y) = k x − yk. As such, we have dist( f ) = max x,y kd(x,y) .
23.2.0.0.2. Why distortion is necessary? Consider the a graph G = (V, E) with one vertex s
connected to three other vertices a, b, c, where the weights on the edges are all one (i.e., G is the star
√ with three leafs). We claim that G can not be embedded into Euclidean space with distortion
graph
≤ 2. Indeed, consider the associated metric space (V, dG ) and an (expansive) embedding f : V → Rd .
Consider the triangle formed by 4 = a0 b0 c0, where a0 = f (a), b0 = f (b) and c0 = f (c). Next, consider
the following quantity max(ka0 − s0 k , kb0 − s0 k , kc0 − s0 k) which lower bounds the distortion of f . This
quantity is minimized when r = ka0 − s0 k = kb0 − s0 k = kc0 − s0 k. Namely, s0 is the center of the smallest
enclosing circle of 4. However, r is minimize when all √the edges of 4 are of equal length, and are in fact
of length dG (a, b) = 2. It follows that dist( f ) ≥ r ≥ 2/ 3.
It is known that Ω(log n) distortion is necessary in the worst case. This is shown using expanders
[Mat02].
204
Definition 23.2.1. Hierarchically well-separated tree (HST) is a metric space defined on the leaves
of a rooted tree T. To each vertex u ∈ T there is associated a label ∆u ≥ 0 such that ∆u = 0 if and only if
u is a leaf of T. The labels are such that if a vertex u is a child of a vertex v then ∆u ≤ ∆v . The distance
between two leaves x, y ∈ T is defined as ∆lca(x,y) , where lca(x, y) is the least common ancestor of x and
y in T.
A HST T is a k-HST if for a vertex v ∈ T, we have that ∆v ≤ ∆p(v) /k, where p(v) is the parent of v
in T.
Note that a HST is a very limited metric. For example, consider the cycle G = Cn of n vertices, with
weight one on the edges, and consider an expansive embedding f of G into a HST HST. It is easy to
verify, that there must be two consecutive nodes of the cycle, which are mapped to two different subtrees
of the root r of HST. Since HST is expansive, it follows that ∆r ≥ n/2. As such, dist( f ) ≥ n/2. Namely,
HSTs fail to faithfully represent even very simple metrics.
23.2.2. Clustering
One natural problem we might want to solve on a graph (i.e., finite metric space) (X, d) is to partition it
into clusters. One such natural clustering is the k-median clustering, where we would like to choose
a set C ⊆ X of k centers, such that νC (X, d) = q∈X d(q, C) is minimized, where d(q, C) = minc∈C d(q, c)
Í
is the distance of q to its closest center in C.
It is known that finding the optimal k-median clustering in a (general weighted) graph is NP-
complete. As such, the best we can hope for is an approximation algorithm. However, if the structure
of the finite metric space (X, d) is simple, then the problem can be solved efficiently. For example, if the
points of X are on the real line (and the distance between a and b is just |a − b|), then k-median can be
solved using dynamic programming.
Another interesting case is when the metric space (X, d) is a HST. Is not too hard to prove the
following lemma. See Exercise 23.7.1.
Lemma 23.2.2. Let (X, d) be a HST defined over n points, and let k > 0 be an integer. One can
compute the optimal k-median clustering of X in O(k 2 n) time.
Thus, if we can embed a general graph G into a HST HST, with low distortion, then we could
approximate the k-median clustering on G by clustering the resulting HST, and “importing” the resulting
partition to the original space. The quality of approximation, would be bounded by the distortion of
the embedding of G into HST.
205
The partition is now defined as follows: A point x ∈ X is assigned to the cluster Cy of y, where y is
the first point in the permutation in distance ≤ R from x. Formally,
Cy = x ∈ X x ∈ b(y, R) and π(y) ≤ π(z) for all z ∈ X with x ∈ b(z, R) .
23.3.2. Properties
Lemma 23.3.1. Let (X, d) be a finite metric space, ∆ = 2u a prescribed parameter, and let P be the
partition of X generated by the above random partition. Then the following holds:
(i) For any C ∈ P, we have diam(C) ≤ ∆.
(ii) Let x be any point of X, and t a parameter ≤ ∆/8. Then,
8t b
Pr[b(x, t) * P(x)] ≤ ln ,
∆ a
where a = |b(x, ∆/8)|, b = |b(x, ∆)|.
Proof: Since Cy ⊆ b(y, R), we have that diam(Cy ) ≤ ∆, and thus the first claim holds.
Let U be the set of points of b(x, ∆), such that w ∈ U iff b(w, R) ∩ b(x, t) , ∅. Arrange the points
of U in increasing distance from x, and let w1, . . . , wb0 denote the resulting order, where b0 = |U|.
Let Ik = [d(x, w k ) − t, d(x, w k ) + t] and write E k for the event that w k is the first point in π such
that b(x, t) ∩ Cwk , ∅, and yet b(x, t) * Cwk . Note that if w k ∈ b(x, ∆/8), then Pr[E k ] = 0 since
b(x, t) ⊆ b(x, ∆/8) ⊆ b(w k , ∆/4) ⊆ b(w k , R).
In particular, w1, . . . , wa ∈ b(x, ∆/8) and as such Pr[E1 ] = · · · = Pr[Ea ] = 0. Also, note that if
d(x, w k ) < R −t then b(w k , R) contains b(x, t) and as such E k can not happen. Similarly, if d(x, w k ) > R +t
then b(w k , R) ∩ b(x, t) = ∅ and E k can not happen. As such, if E k happen then R − t ≤ d(x, w k ) ≤ R + t.
Namely, if E k happen then R ∈ Ik . Namely, Pr[E k ] = Pr[E k ∩ (R ∈ Ik )] = Pr[R ∈ Ik ] · Pr[E k | R ∈ Ik ].
Now, R is uniformly distributed in the interval [∆/4, ∆/2], and Ik is an interval of length 2t. Thus,
Pr[R ∈ Ik ] ≤ 2t/(∆/4) = 8t/∆.
Next, to bound Pr[E k | R ∈ Ik ], we observe that w1, . . . , w k−1 are closer to x than w k and their distance
to b(x, t) is smaller than R. Thus, if any of them appear before w k in π then E k does not happen. Thus,
Pr[E k | R ∈ Ik ] is bounded by the probability that w k is the first to appear in π out of w1, . . . , w k . But
this probability is 1/k, and thus Pr[E k | R ∈ Ik ] ≤ 1/k.
We are now ready for the kill. Indeed,
b0
Õ b0
Õ b0
Õ
Pr[b(x, t) * P(x)] = Pr[E k ] = Pr[E k ] = Pr[R ∈ Ik ] · Pr[E k | R ∈ Ik ]
k=1 k=a+1 k=a+1
b0
Õ 8t 1 8t b0 8t b
≤ · ≤ ln ≤ ln ,
k=a+1
∆ k ∆ a ∆ a
Íb ∫b dx
since 1
k=a+1 k ≤ a x = ln ab and b0 ≤ b.
206
23.4. Probabilistic embedding into trees
In this section, given n-point finite metric (X, d). we would like to embed it into a HST. As mentioned
above, one can verify that for any embedding into HST, the distortion in the worst case is Ω(n). Thus,
we define a randomized algorithm that embed (X, d) into a tree. Let T be the resulting tree, and
consider two points x, y ∈ X. Consider the random variable dT (x, y). We constructed the tree T such
that distances
h never
i shrink; i.e. d(x, y) ≤ dT (x, y). The probabilistic distortion of this embedding is
dT (x,y)
max x,y E d(x,y) . Somewhat surprisingly, one can find such an embedding with logarithmic probabilistic
distortion.
Theorem 23.4.1. Given n-point metric (X, d) one can randomly embed it into a 2-HST with probabilis-
tic distortion ≤ 24 ln n.
Proof: The construction is recursive. Let diam(P), and compute a random partition of X with cluster
diameter diam(P)/2, using the construction of Section 23.3.1. We recursively construct a 2-HST for each
cluster, and hang the resulting clusters on the root node v, which is marked by ∆v = diam(P). Clearly,
the resulting tree is a 2-HST.
For a node v ∈ T, let X(v) be the set of points of X contained in the subtree of v.
For the analysis, assume diam(P) = 1, and consider two points x, y ∈ X. We consider a node v ∈ T
to be in level i if level(v) = dlg ∆v e = i. The two points x and y correspond to two leaves in T, and let b u
be the least common ancestor of x and y in t. We have dT (x, y) ≤ 2 level(v) . Furthermore, note that along
a path the levels are strictly monotonically increasing.
In fact, we are going to be conservative, and let w be the first ancestor of x, such that b = b(x, d(x, y))
is not completely contained in X(u1 ), . . . , X(um ), where u1, . . . , um are the children of w. Clearly, level(w) >
level(bu). Thus, dT (x, y) ≤ 2level(w) .
Consider the path σ from the root of T to x, and let Ei be the event that b is not fully contained
in X(vi ), where vi is the node of σ of level i (if such a node exists). Furthermore, let Yi be the indicator
variable which is 1 if Ei is the first to happened out of the sequence of events E0, E−1, . . .. Clearly,
dT (x, y) ≤ Yi 2i .
Í
Let t = d(x, y) and j = blg d(x, y)c, and ni = b(x, 2i ) for i = 0, . . . , −∞. We have
0 0 0
h i Õ 8t ni
E[Yi ] 2i ≤ 2i Pr Ei ∩ Ei−1 ∩ Ei−1 · · · E0 ≤ 2i · i ln
Õ Õ
E[dT (x, y)] ≤ ,
i= j i= j i= j
2 ni−3
207
Theorem 23.4.2. Let (X, d) be a n-point metric space. One can compute in polynomial time a k-
median clustering of X which has expected price O(α log n), where α is the price of the optimal k-median
clustering of (X, d).
Proof: The algorithm is described above, and the fact that its running time is polynomial can be easily
be verified. To prove the bound on the quality of the clustering, for any point p ∈ X, let center(p)
denote the closest point in Copt to p according to d, where Copt is the set of k-medians in the optimal
clustering. Let C be the set of k-medians returned by the algorithm, and let HST be the HST used by
the algorithm. We have
Õ Õ
β = νC (X, d) ≤ νC (X, dHST ) ≤ νCopt (X, dHST ) ≤ dHST (p, Copt ) ≤ dHST (p, center(p)).
p∈X p∈X
« p∈X ¬
by linearity of expectation and Theorem 23.4.1.
Proof: Indeed, let x 0 and y0 be the closet points of Y , to x and y, respectively. Observe that f (x) =
d(x, x 0) ≤ d(x, y0) ≤ d(x, y) + d(y, y0) = d(x, y) + f (y) by the triangle inequality. Thus, f (x) − f (y) ≤ d(x, y).
By symmetry, we have f (y) − f (x) ≤ d(x, y). Thus, | f (x) − f (y)| ≤ d(x, y).
Proof: Assume that diam(Y) = Φ (i.e., the smallest distance in Y is 1), and let ri = 2i−2 , for i = 1, . . . , α,
where α = dlg Φe. Let Pi, j be a random partition of P with diameter ri , using Theorem 23.4.1, for
i = 1, . . . , α and j = 1, . . . , β, where β = dc log ne and c is a large enough constant to be determined
shortly.
For each cluster of Pi, j randomly toss a coin, and let Vi, j be the all the points of X that belong
to clusters in Pi, j that got ’T’ in their coin toss. For a point u ∈ x, let fi, j (x) = d(x, X \ Vi, j ) =
208
minv∈X\Vi, j d(x, v), for i = 0, . . . , m and j = 1, . . . , β. Let F : X → R(m+1)·β be the embedding, such that
F(x) = f0,1 (x), f0,2 (x), . . . , f0,β (x), f1,1 (x), f0,2 (x), . . . , f1,β (x), . . . , fm,1 (x), fm,2 (x), . . . , fm,β (x) .
Next, consider two points x, y ∈ X, with distance φ = d(x, y). Let k be an integer such that
ru ≤ φ/2 ≤ ru+1 . Clearly, in any partition of Pu,1, . . . , Pu,β the points x and y belong to different clusters.
Furthermore, with probability half x ∈ Vu, j and y < Vu, j or x < Vu, j and y ∈ Vu, j , for 1 ≤ j ≤ β.
Let E j denote the event that b(x, ρ) ⊆ Vu, j and y < Vu, j , for j = 1, . . . , β, where ρ = φ/(64 ln n). By
Lemma 23.3.1, we have
8ρ φ
Pr b(x, ρ) * Pu, j (x) ≤ ln n ≤
≤ 1/2.
ru 8ru
Thus,
since those three events are independent. Notice, that if E j happens, than fu, j (x) ≥ ρ and fu, j (y) = 0.
Let X j be an indicator variable which is 1 if Ei happens, for j = 1, . . . , β. Let Z = j Xj ,
Í
and we have µ = E[Z] = E j X j ≥ β/8. Thus, the probability that only β/16 of E1, . . . , E β
Í
happens, is Pr[Z < (1 − 1/2) E[Z]]. 10By the Chernoff inequality, we have Pr[Z < (1 − 1/2) E[Z]] ≤
exp −µ1/(2 · 2 ) = exp(−β/64) ≤ 1/n , if we set c = 640.
2
On the other hand, fi, j (x) − fi, j (y) ≤ d(x, y) = φ ≤ 64ρ ln n. Thus,
q p p
kF(x) − F(y)k ≤ αβ(64ρ ln n)2 ≤ 64 αβρ ln n = αβ · φ.
n
Thus, setting G(x) = F(x) 256√ln
β
, we get a mapping that maps two points of distance φ from each
h √ n
i
other to two points with distance in the range φ, φ · αβ · √ β . Namely, G(·) is an embedding with
256 ln
√ √
distortion O( α ln n) = O( ln Φ ln n).
The probability that G fails on one of the pairs, is smaller than (1/n10 ) · 2n < 1/n8 . In particular,
we can check the distortion of G for all 2n pairs, and if any of them fail (i.e., the distortion is too big),
we restart the process.
209
Both problems can be overcome with careful tinkering. Indeed, for a resolution ri , we are going to
modify the metric, so that it ignores short distances (i.e., distances ≤ ri /n2 ). Formally, for each resolution
ri , let Gi = (X, E bi ) be the graph where two points x and y are connected if d(x, y) ≤ ri /n2 . Consider a
connected component C ∈ Gi . For any two points x, y ∈ C, we have d(x, y) ≤ n(ri /n2 ) ≤ ri /n. Let Xi
be the set of connected components of Gi , and define the distances between two connected components
C, C 0 ∈ Xi , to be di (C, C 0) = d(C, C 0) = minc∈C,c 0 ∈C 0 d(c, c0).
It is easy to verify that (Xi, di ) is a metric space (see Exercise 23.7.2). Furthermore, we can naturally
embed (X, d) into (Xi, di ) by mapping a point x ∈ X to its connected components in Xi . Essentially (Xi, di )
is a snapped version of the metric (X, d), with the advantage that Φ((X, di )) = O(n2 ). We now embed Xi
into β = O(log n) coordinates. Next, for any point of X we embed it into those β coordinates, by using
the embedding of its connected component in Xi . Let Ei be the embedding for resolution ri . Namely,
Ei (x) = ( fi,1 (x), fi,2 (x), . . . , fi,β (x)), where fi, j (x) = min(di (x, X \ Vi, j ), 2ri ). The resulting embedding is
F(x) = ⊕Ei (x) = (E1 (x), E2 (x), . . . , ).
Since we slightly modified the definition of fi, j (·), we have to show that fi, j (·) is nonexpansive. Indeed,
consider two points x, y ∈ Xi , and observe that
fi, j (x) − fi, j (y) ≤ di (x, Vi, j ) − di (y, Vi, j ) ≤ di (x, y) ≤ d(x, y),
We still have to handle the infinite number of coordinates problem. However, the above proof shows
that we care about a resolution ri (i.e., it contributes to the estimates in the above proof) only if there
is a pair x and y such that ri /n2 ≤ d(x, y) ≤ ri n2 . Thus, for every pair of distances there are O(log n)
relevant resolutions. Thus, there are at most η = O(n2 β log n) = O(n2 log2 n) relevant coordinates, and
we can ignore all the other coordinates. Next, consider the affine subspace h that spans F(P). Clearly,
it is n − 1 dimensional, and consider the projection G : Rη → Rn−1 that projects a point to its closest
¬ Indeed, if f (x) < d (x, V ) and f (y) < d (x, V ) then f (x) = 2r and f (y) = 2r , which implies the above
i, j i i, j i, j i i, j i, j i i, j i
inequality. If fi, j (x) = di (x, Vi, j ) and fi, j (y) = di (x, Vi, j ) then the inequality trivially holds. The other option is handled in a
similar fashion.
210
point in h. Clearly, G(F(·)) is an embedding with the same distortion for P, and the target space is of
dimension n − 1.
Note, that all this process succeeds with high probability. If it fails, we try again. We conclude:
Theorem 23.5.3 (Low quality Bourgain theorem.). Given a n-point metric M, one can embed it
into Euclidean space of dimension n − 1, such that the distortion of the embedding is at most O(log3/2 n).
Using the Johnson-Lindenstrauss lemma, the dimension can be further reduced to O(log n). In fact,
being more careful in the proof, it is possible to reduce the dimension to O(log n) directly.
23.7. Exercises
Exercise 23.7.1 (Clustering for HST.). Let (X, d) be a HST defined over n points, and let k > 0 be an
integer. Provide an algorithm that computes the optimal k-median clustering of X in O(k 2 n) time.
[Transform the HST into a tree where every node has only two children. Next, run a dynamic
programming algorithm on this tree.]
(a) Give a counter example to the following claim: Let (X, d) be a metric space, and let P be a partition
of X. Then, the pair (P, d0) is a metric, where d0(C, C 0) = d(C, C 0) = min x∈C,y∈C 0 d(x, y) and C, C 0 ∈ P.
(b) Let (X, d) be a n-point metric space, and consider the set U = i 2i ≤ d(x, y) ≤ 2i+1, for x, y ∈ X .
Prove that |U| = O(n). Namely, there are only n different resolutions that “matter” for a finite
metric space.
211
Exercise 23.7.3 (Computing the diameter via embeddings.).
(a) (h:1) Let ` be a line in the plane, and consider the embedding f : R2 → `, which is the projection
of the plane into `. Prove that f is 1-Lipschitz, but it is not K-bi-Lipschitz for any constant K.
√
(b) (h:3) Prove that one can find a family of projections F of size O(1/ ε), such that for any two points
x, y ∈ R2 , for one of the projections f ∈ F we have d( f (x), f (y)) ≥ (1 − ε)d(x, y).
√
(c) (h:1) Given a set P of n in the plane, given a O(n/ ε) time algorithm that outputs two points
x, y ∈ P, such that d(x, y) ≥ (1 − ε)diam(P), where diam(P) = max z,w∈P d(z, w) is the diameter of P.
(d) (h:2) Given P, show how to extract, in O(n) time, a set Q ⊆ P of size O(ε −2 ), such that diam(Q) ≥
(1 − ε/2)diam(P). (Hint: Construct a grid of appropriate resolution.)
In particular, give an (1 − ε)-approximation algorithm to the diameter of P that works in O(n + ε −2.5 )
time. (There are slightly faster approximation algorithms known for approximating the diameter.)
Acknowledgments
The presentation in this write-up follows closely the insightful suggestions of Manor Mendel.
b
s
a
212
Chapter 24
where S(n) is the n dimensional unit sphere in Rn+1 . This is an instance of semi-definite programming,
which is a special case of convex programming, which can be solved in polynomial time (solved here
means approximated within arbitrary constant in polynomial time). Observe that (P) is a relaxation of
(Q), and as such the optimal solution of (P) has value larger than the optimal value of (Q).
The intuition is that vectors that correspond to vertices that should be on one side of the cut, and
vertices on the other sides, would have vectors which are faraway from each other in (P). Thus, we
compute the optimal solution for (P), and we uniformly generate a random vector r on the unit sphere
S(n) . This induces a hyperplane h which passes through the origin and is orthogonal to r. We next
assign all the vectors that are on one side of h to S, and the rest to S.
213
24.1.1. Analysis
The intuition of the above rounding procedure, is that with good probability, vectors that have big angle
between them would be separated by this cut.
1
Lemma 24.1.1. We have Pr sign(hvi, r i) , sign v j , r = arccos vi, v j .
π
j v
Proof: Let us think about the vectors vi, v j and r as being in the plane. To vi
see why this is a reasonable assumption, consider the plane g spanned by vi
and v j , and observe that for the random events we consider, only the direction
of r matter, which can be decided by projecting r on g, and normalizing it to τ
have length 1. Now, the sphere is symmetric, and as such, sampling r randomly
from S(n) , projecting it down to g, and then normalizing it, is equivalent to just
choosing uniformly a vector from the unit circle.
Now, sign(hvi, r i) , sign v j , r happens only if r falls in the double wedge
formed by the lines perpendicular to vi and v j . The angle of this double wedge is exactly the angle
between vi and v j . Now, since vi and v j are unit vectors, we have vi, v j = cos(τ), where τ = ∠vi v j .
1
Thus, Pr sign(hvi, r i) , sign v j , r = 2τ/(2π) = π · arccos vi, v j , as claimed.
Theorem 24.1.2. Let W be the random variable which is the weight of the cut generated by the algo-
rithm. We have
1Õ
E[W] = wi j arccos vi, v j .
π i< j
Proof: Let Xi j be i j X
an indicator variable which is 1 if is in the cut. We have E i j = Pr sign(hvi, r i) , sign v j , r
1
π arccos v , v
i j Í , by Lemma 24.1.1.
Clearly, W = i< j wi j Xi j , and by linearity of expectation, we have
Õ Õ
i j E Xi j =
E [W] = w arccos vi, v j .
i< j i< j
arccos(y) 1 2 ψ
Lemma 24.1.3. For −1 ≤ y ≤ 1, we have ≥ α · (1 − y), where α = min .
π 2 0≤ψ≤π π 1 − cos(ψ)
Proof: Set y = cos(ψ). The inequality now becomes ψπ ≥ α 12 (1 − cos ψ). Reorganizing, the inequality
ψ
becomes π2 1−cos ψ ≥ α, which trivially holds by the definition of α.
Lemma 24.1.4. α > 0.87856.
Proof: Using simple calculus, one can see that α achieves its value for ψ = 2.331122..., the nonzero root
of cosψ + ψ sin ψ = 1.
Theorem 24.1.5. The above algorithm computes in expectation a cut of size αOpt ≥ 0.87856Opt,
where Opt is the weight of the maximal cut.
Proof: Consider the optimal solution to (P), and lets its value be γ ≥ Opt. We have
1Õ Õ 1
E[W] = wi j arccos vi, v j ≥ wi j α 1 − vi, v j = αγ ≥ αOpt,
π i< j i< j
2
by Lemma 24.1.3.
214
24.2. Semi-definite programming
Let us define a variable xi j = vi, v j , and consider the n by n matrix M formed by those variables, where
xii = 1 for i = 1, . . . , n. Let V be the matrix having v1, . . . , vn as its columns. Clearly, M = V T V. In
particular, this implies that for any non-zero vector v ∈ Rn , we have vT Mv = vT AT Av = (Av)T (Av) ≥ 0.
A matrix that has this property, is called semidefinite. The interesting thing is that any semi-definite
matrix P can be represented as a product of a matrix with its transpose; namely, P = BT B. It is easy
to observe that if this semi-definite matrix has a diagonal one, then B has rows which are unit vectors.
Thus, if we solve (P) and get back a semi-definite matrix, then we can recover the vectors realizing the
solution, and use them for the rounding.
In particular, (P) can now be restated as
1
wi j (1 − xi j )
Í
(SD) Maximize 2 i< j
xii = 1 for i = 1, . . . , n
xi j i=1,...,n, j=1,...,n is semi-definite.
subject to:
We are trying to find the optimal value of a linear function over a set which is the intersection of linear
constraints and the set of semi-definite matrices.
Lemma 24.2.1. Let U be the set of n × n semidefinite matrices. The set U is convex.
Proof: Consider A, B ∈ U, and observe that for any t ∈ [0, 1], and vector v ∈ Rn , we have: vT (t A + (1 −
t)B)v = tvT Av + (1 − t)vT Bv ≥ 0 + 0 ≥ 0, since A and B are semidefinite.
Positive semidefinite matrices corresponds to ellipsoids. Indeed, consider the set xT Ax = 1: the set of
vectors that solve this equation is an ellipsoid. Also, the eigenvalues of a positive semidefinite matrix are
all non-negative real numbers. Thus, given a matrix, we can in polynomial time decide if it is positive
semidefinite or not.
Thus, we are trying to optimize a linear function over a convex domain. There is by now machinery
to approximately solve those problems to within any additive error in polynomial time. This is done by
using interior point method, or the ellipsoid method. See [BV04, GLS93] for more details.
215
216
Chapter 25
25.1. Entropy
Definition 25.1.1. The entropy in bits of a discrete random variable X is given by
Õ
H(X) = − Pr[X = x] lg Pr[X = x] .
x
h i
1
Equivalently, H(X) = E lg Pr[X] .
The binary entropy function H(p) for a random binary variable that is 1 with probability p, is
H(p) = −p lg p − (1 − p) lg(1 − p). We define H(0) = H(1) = 0.
The function H(p) is a concave symmetric around 1/2 on the interval [0, 1] and achieves its maximum
at 1/2. For a concrete example, consider H(3/4) ≈ 0.8113 and H(7/8) ≈ 0.5436. Namely, a coin that has
3/4 probably to be heads have higher amount of “randomness” in it than a coin that has probability
7/8 for heads.
We have that
1
H(p) = (−p ln p − (1 − p) ln(1 − p))
ln 2
p 1−p 1−p
0 1
and H (p) = − ln p − − (−1) ln(1 − p) − (−1) = lg .
ln 2 p 1−p p
Deploying our amazing ability to compute derivative of simple functions once more, we get that
p p(−1) − (1 − p)
00 1 1
H (p) = =− .
ln 2 1 − p p 2 p(1 − p) ln 2
217
Since ln 2 ≈ 0.693, we have that H00(p) ≤ 0, for all p ∈ (0, 1), and the H(·) is concave in this range. Also,
H0(1/2) = 0, which implies that H(1/2) = 1 is a maximum of the binary entropy. Namely, a balanced
coin has the largest amount of randomness in it.
Example 25.1.2. A random variable X that has probability 1/n to be i, for i = 1, . . . , n, has entropy
Ín 1 1
H(X) = − i=1 n lg n = lg n.
Note, that the entropy is oblivious to the exact values that the random variable can have, and it is
sensitive only to the probability distribution. Thus, a random variables that accepts −1, +1 with equal
probability has the same entropy (i.e., 1) as a fair coin.
Lemma 25.1.3. Let X and Y be two independent random variables, and let Z be the random variable
(X, T). Then H(Z) = H(X) + H(Y ).
Proof: In the following, summation are over all possible values that the variables can have. By the
independence of X and Y we have
Õ 1
H(Z) = Pr[(X, Y ) = (x, y)] lg
x,y
Pr[(X, Y ) = (x, y)]
Õ 1
= Pr[X = x] Pr[Y = y] lg
x,y
Pr[X = x] Pr[Y = y]
ÕÕ 1
= Pr[X = x] Pr[Y = y] lg
x y
Pr[X = x]
ÕÕ 1
+ Pr[X = x] Pr[Y = y] lg
y x
Pr[Y = y]
Õ 1 Õ 1
= Pr[X = x] lg + Pr[Y = y] lg = H(X) + H(Y ).
x
Pr[X = x] y
Pr[Y = y]
2nH(q) n
Lemma 25.1.4. Suppose that nq is integer in the range [0, n]. Then ≤ ≤ 2nH(q) .
n+1 nq
n nq
Proof: This trivially holds if q = 0 or q = 1, so assume 0 < q < 1. We know that nq q (1 − q)n−nq ≤
(q + (1 − q))n = 1. As such, since q−nq (1 − q)−(1−q)n = 2n (−q lg q−(1−q) lg(1−q)) = 2nH(q) , we have
n
≤ q−nq (1 − q)−(1−q)n = 2nH(q) .
nq
n nq
As for the other direction, we claim that µ(nq) = nq q (1 − q)n−nq is the largest term in nk=0 µ(k) = 1,
Í
n k n−k q
n−k
∆ k = µ(k) − µ(k + 1) = q (1 − q) 1− ,
k k +11−q
and the sign of this quantity is the sign of (k + 1)(1 − q) − (n − k)q = k + 1 − kq − q − nq + kq = 1 + k − q − nq.
Namely, ∆ k ≥ 0 when k ≥ nq + q − 1, and ∆ k < 0 otherwise. Namely, µ(k) < µ(k + 1), for k < nq, and
Ín
µ(k) ≥ µ(k + 1) for k ≥ nq. Namely, µ(nq) is the largest term in k=0 µ(k) = 1, and as such it is larger
n nq
than the average. We have µ(nq) = nq q (1 − q)n−nq ≥ n+1 1
, which implies
n
1 −nq 1 nH(q)
≥ q (1 − q)−(n−nq) = 2 .
nq n+1 n+1
218
Lemma 25.1.4 can be extended to handle non-integer values of q. This is straightforward, and we
omit the easy details.
n nH(q) . n nH(q) .
(i) q ∈ [0, 1/2] ⇒ bnqc ≤ 2 (ii) q ∈ [1/2, 1] dnqe ≤ 2
Corollary 25.1.5. We have: nH(q) n nH(q) n
(iii) q ∈ [1/2, 1] ⇒ 2n+1 ≤ bnqc . (iv) q ∈ [0, 1/2] ⇒ 2n+1 ≤ dnqe .
The bounds of Lemma 25.1.4 and Corollary 25.1.5 are loose but sufficient for our purposes. As a
sanity check, consider the case when we generate a sequence of n bits using a coin with probability q
for head, then by the Chernoff inequality, we will get roughly nq heads in this sequence. As such, the
n nH(q)
generated sequence Y belongs to nq ≈ 2 possible sequences that have similar probability. As such,
n
H(Y ) ≈ lg nq = nH(q), by Example 25.1.2, a fact that we already know from Lemma 25.1.3.
Theorem 25.1.7. Suppose that the value of a random variable X is chosen uniformly at random from
the integers {0, . . . , m − 1}. Then there is an extraction function for X that outputs on average (i.e., in
expectation) at least blg mc − 1 = bH(X)c − 1 independent and unbiased bits.
Proof: We represent m as a sum of unique powers of 2, namely m = i ai 2i , where ai ∈ {0, 1}. Thus,
Í
we decomposed {0, . . . , m − 1} into a disjoint union of blocks that have sizes which are distinct powers
of 2. If a number falls inside such a block, we output its relative location in the block, using binary
representation of the appropriate length (i.e., k if the block is of size 2 k ). The fact that this is an
extraction function, fulfilling Definition 25.1.6, is obvious.
Now, observe that the claim holds trivially if m is a power of two. Thus, if m is not a power of 2,
then in the decomposition if there is a block of size 2 k , and the X falls inside this block, then the entropy
is k. Thus, for the inductive proof, assume
that are looking at the largest block in the decomposition,
that is m < 2 , and let u = lg(m − 2 ) < k. It is easy to verify that, for any integer α > 2 k , we have
k+1 k
α−2k α+1−2k u+1 + 2 k . As such, m−2k ≤ 2u+1 .
α ≤ α+1 . Furthermore, m ≤ 2 m 2u+1 +2k
219
Let Y be the random variable which is the number of random bits extract. We have that
2k m − 2k k
m − 2k
k+ lg(m − 2 ) − 1 = k + (u − k − 1)
E[Y ] ≥
m m m
2u+1 2u+1
≥ k + u+1 (u − k − 1) = k − (1 + k − u).
2 + 2k 2u+1 + 2 k
If u = k − 1, then H(X) ≥ k − 12 · 2 = k − 1, as required. If u = k − 2 then H(X) ≥ k − 13 · 3 = k − 1. Finally,
if u < k − 2 then
2u+1 k −u+1
E[Y ] ≥ k − k
(1 + k − u) ≥ k − k−u−1 ≥ k − 1,
2 2
since 2+i
2i
≤ 1 for i ≥ 2.
Theorem 25.1.8. Consider a coin that comes up heads with probability p > 1/2. For any constant
δ > 0 and for n sufficiently large:
1. One can extract, from an input of a sequence of n flips, an output sequence of (1−δ)nH(p) (unbiased)
independent random bits.
2. One can not extract more than nH(p) bits from such a sequence.
Proof: There are nj input sequences with exactly j heads, and each has probability p j (1 − p)n− j . We
n o
map this sequence to the corresponding number in the set 0, . . . , nj − 1 . Note, that this, conditional
distribution on j, is uniform on this set, and we can apply the extraction algorithm of Theorem 25.1.7.
Let Z be the random variables which is the number of heads in the input, and let B be the number of
random bits extracted. We have
n
Õ
Pr[Z = k] E B Z = k ,
E[B] =
k=0
n
and by Theorem 25.1.7, we have E B Z = k ≥ lg − 1. Let ε < p − 1/2 be a constant to be
k
determined shortly. For n(p − ε) ≤ k ≤ n(p + ε), we have
n n 2nH(p+ε)
≥ ≥ ,
k bn(p + ε)c n+1
by Corollary 25.1.5 (iii). We have
dn(p−ε)e dn(p−ε)e
n
Õ Õ
Pr[Z = k] E B Z = k ≥ Pr[Z = k]
E[B] ≥ lg −1
k
k=bn(p−ε)c k=bn(p−ε)c
dn(p−ε)e
2nH(p+ε)
Õ
≥ Pr[Z = k] lg −2
n+1
k=bn(p−ε)c
= (nH(p + ε) − lg(n + 1)) Pr[|Z − np| ≤ εn]
nε 2
≥ (nH(p + ε) − lg(n + 1)) 1 − 2 exp − ,
4p
220
2
− np
h i
− nε
ε ε 2
since µ = E[Z] = np and Pr |Z − np| ≥ p pn ≤ 2 exp 4 p = 2 exp 4p , by the Chernoff inequal-
ity. In particular, fix ε > 0, such that H(p + ε) > (1 − δ/4)H(p), and since p is fixed nH(p) = Ω(n), in
δ
particular,
for n sufficiently large, we have − lg(n + 1) ≥ − 10 nH(p). Also, for n sufficiently large, we have
2 exp − nε
2 δ
4p ≤ 10 . Putting it together, we have that for n large enough, we have
δ δ δ
E[B] ≥ 1− − nH(p) 1 − ≥ (1 − δ)nH(p),
4 10 10
as claimed.
As for the upper bound, observe that if an input sequence x has probability q, then the output
sequence y = Ext(x) has probability to be generated which is at least q. Now, all sequences of length
|y| have equal probability to be generated. Thus, we have the following (trivial) inequality 2|Ext(x)| q ≤
2|Ext(x)| Pr[y = Ext(X)] ≤ 1, implying that |Ext(x)| ≤ lg(1/q). Thus,
Õ Õ 1
E[B] = Pr[X = x] |Ext(x)| ≤ Pr[X = x] lg = H(X).
x x
Pr[X = x]
221
222
Chapter 26
Entropy II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
The memory of my father is wrapped up in white paper, like sandwiches taken for a day at work. Just as
a magician takes towers and rabbits out of his hat, he drew love from his small body, and the rivers of his
hands overflowed with good deeds.
– – Yehuda Amichai, My Father..
26.1. Compression
In this section, we will consider the problem of how to compress a binary string. We will map each binary
string, into a new string (which is hopefully shorter). In general, by using a simple counting argument,
one can show that no such mapping can achieve real compression (when the inputs are adversarial).
However, the hope is that there is an underling distribution on the inputs, such that some strings are
considerably more common than others.
Definition 26.1.1. A compression function Compress takes as input a sequence of n coin flips, given as
an element of {H, T }n , and outputs a sequence of bits such that each input sequence of n flips yields a
distinct output sequence.
Note, that this is very weak. Usually, we would like the function to output a prefix code, like the
Huffman code.
Theorem 26.1.3. Consider a coin that comes up heads with probability p > 1/2. For any constant
δ > 0, when n is sufficiently large, the following holds.
(i) There exists a compression function Compress such that the expected number of bits output by
Compress on an input sequence of n independent coin flips (each flip gets heads with probability p)
is at most (1 + δ)nH(p); and
(ii) The expected number of bits output by any compression function on an input sequence of n inde-
pendent coin flips is at least (1 − δ)nH(p).
223
Proof: Let ε > 0 be a constant such that p − ε > 1/2. The first bit output by the compression procedure
is ’1’ if the output string is just a copy of the input (using n + 1 bits overall in the output), and ’0’ if it
is compressed. We compress only if the number of ones in the input sequence, denoted by
2
X is larger
than (p − ε)n. By the Chernoff inequality, we know that Pr[X < (p − ε)n] ≤ exp −nε /2p .
If there are more than (p − ε)n ones in the input, and since p − ε > 1/2, we have that
n n
n n n
≤ 2nH(p−ε),
Õ Õ
≤
j dn(p − ε)e 2
j=dn(p−ε)e j=dn(p−ε)e
by Corollary 25.1.5. As such, we can assign each such input sequence a number in the range 0 . . . 2n 2nH(p−ε) ,
and this requires (with the flag bit) 1 + blg n + nH(p − ε)c random bits.
Thus, the expected number of bits output is bounded by
by carefully setting ε and n being sufficiently large. Establishing the upper bound.
As for the lower bound, observe that at least one of the sequences having exactly τ = b(p + ε)nc
heads, must be compressed into a sequence having
n 2nH(p+ε)
lg − 1 ≥ lg − 1 = nH(p − ε) − lg(n + 1) − 1 = µ,
b(p + ε)nc n+1
by Corollary 25.1.5. Now, any input string with less than τ heads has lower probability to be generated.
Indeed, for a specific strings with α < τ ones the probability to generate them is pα (1 − p)n−α and
pτ (1 − p)n−τ , respectively. Now, observe that
τ−α
n−τ (1 − p) n−τ 1 − p
τ−α
n−α
α
p (1 − p) τ
= p (1 − p) · τ
= p (1 − p) < pτ (1 − p)n−τ,
p τ−α p
224
Chapter 27
Definition 27.1.1. The input to a binary symmetric channel with parameter p is a sequence of bits
x1, x2, . . . , and the output is a sequence of bits y1, y2, . . . , such that Pr[xi = yi ] = 1 − p independently for
each i.
Translation: Every bit transmitted have the same probability to be flipped by the channel. The
question is how much information can we send on the channel with this level of noise. Naturally, a
channel would have some capacity constraints (say, at most 4,000 bits per second can be sent on the
channel), and the question is how to send the largest amount of information, so that the receiver can
recover the original information sent.
Now, its important to realize that noise handling is unavoidable in the real world. Furthermore,
there are tradeoffs between channel capacity and noise levels (i.e., we might be able to send considerably
more bits on the channel but the probability of flipping (i.e., p) might be much larger). In designing a
communication protocol over this channel, we need to figure out where is the optimal choice as far as
the amount of information sent.
Definition 27.1.2. A (k, n) encoding function Enc : {0, 1} k → {0, 1}n takes as input a sequence of k
bits and outputs a sequence of n bits. A (k, n) decoding function Dec : {0, 1}n → {0, 1} k takes as
input a sequence of n bits and outputs a sequence of k bits.
225
Thus, the sender would use the encoding function to send its message, and the decoder would use
the received string (with the noise in it), to recover the sent message. Thus, the sender starts with a
message with k bits, it blow it up to n bits, using the encoding function, to get some robustness to noise,
it send it over the (noisy) channel to the receiver. The receiver, takes the given (noisy) message with n
bits, and use the decoding function to recover the original k bits of the message.
Naturally, we would like k to be as large as possible (for a fixed n), so that we can send as much
information as possible on the channel. Naturally, there might be some failure probability; that is, the
receiver might be unable to recover the original string, or recover an incorrect string.
The following celebrated result of Shannon¬ in 1948 states exactly how much information can be
sent on such a channel.
Theorem 27.1.3 (Shannon’s theorem.). For a binary symmetric channel with parameter p < 1/2
and for any constants δ, γ > 0, where n is sufficiently large, the following holds:
(i) For an k ≤ n(1−H(p)−δ) there exists (k, n) encoding and decoding functions such that the probability
the receiver fails to obtain the correct message is at most γ for every possible k-bit input messages.
(ii) There are no (k, n) encoding and decoding functions with k ≥ n(1−H(p)+δ) such that the probability
of decoding correctly is at least γ for a k-bit input message chosen uniformly at random.
Our scheme would be simple. Pick k ≤ n(1 − H(p) − δ). For any number i = 0, . . . , K b = 2 k+1 − 1,
randomly generate a binary string Yi made out of n bits, each one chosen independently and uniformly.
Let Y0, . . . , YKb denote these codewords.
For each of these codewords we will compute the probability that if we send this codeword, the
receiver would fail. Let X0, . . . , XK , where K = 2 k − 1, be the K codewords with the lowest probability of
failure. We assign these words to the 2 k messages we need to encode in an arbitrary fashion. Specifically,
for i = 0, . . . , 2 k − 1, we encode i as the string Xi .
The decoding of a message w is done by going over all the codewords, and finding all the codewords
that are in (Hamming) distance in the range [p(1 − ε)n, p(1 + ε)n] from w. If there is only a single word
Xi with this property, we return i as the decoded word. Otherwise, if there are no such word or there is
more than one word then the decoder stops and report an error.
¬ Claude Elwood Shannon (April 30, 1916 - February 24, 2001), an American electrical engineer and mathematician,
226
Intuition. Each code Yi corresponds to a region that looks like a ring. The r = pn
“ring” for Yi is all the strings in Hamming distance between (1 − ε)r and Y2
(1 + ε)r from Yi , where r = pn. Clearly, if we transmit a string Yi , and the
receiver gets a string inside the ring of Yi , it is natural to try to recover the Y0
received string to the original code corresponding to Yi . Naturally, there are
two possible bad events here: 2εpn
Y1
(A) The received string is outside the ring of Yi , and
(B) The received string is contained in several rings of different Y s, and
it is not clear which one should the receiver decode the string to. These bad
regions are depicted as the darker regions in the figure on the right.
Let Si = S(Yi ) be all the binary strings (of length n) such that if the receiver gets this word, it would
decipher it to be the original string assigned to Yi (here are still using the extended set of codewords
Y0, . . . , YKb). Note, that if we remove some codewords from consideration, the set S(Yi ) just increases
in size (i.e., the bad region in the ring of Yi that is covered multiple times shrinks). Let Wi be the
probability that Yi was sent, but it was not deciphered correctly. Formally, let r denote the received
word. We have that Õ
Wi = Pr[r was received when Yi was sent] . (27.1)
r<Si
To bound this quantity, let ∆(x, y) denote the Hamming distance between the binary strings x and y.
Clearly, if x was sent the probability that y was received is
As such, we have
The value of Wi is a random variable over the choice of Y0, . . . , YKb. As such, its natural to ask what
is the expected value of Wi .
Consider the ring
where ε > 0 is a small enough constant. Observe that x ∈ ring(y) if and only if y ∈ ring(x). Suppose,
that the code word Yi was sent, and r was received. The decoder returns the original code associated
with Yi , if Yi is the only codeword that falls inside ring(r).
Lemma 27.2.1. Given that Yi was sent, and r was received and furthermore r ∈ ring(Yi ), then the
probability of the decoder failing, is
γ
τ = Pr r < Si r ∈ ring(Yi ) ≤ ,
8
where γ is the parameter of Theorem 27.1.3.
227
Proof: The decoder fails here, only if ring(r) contains some other codeword Yj ( j , i) in it. As such,
Õ
τ = Pr r < Si r ∈ ring(Yi ) ≤ Pr Yj ∈ ring(r), for any j , i ≤ Pr Yj ∈ ring(r) .
j,i
Now, we remind the reader that the Yj s are generated by picking each bit randomly and independently,
with probability 1/2. As such, we have
(1+ε)np n
n n
| ring(r) | Õ
m
Pr Yj ∈ ring(r) =
= ≤ n ,
|{0, 1}n | 2n 2 b(1 + ε)npc
m=(1−ε)np
since (1 + ε)p < 1/2 (for ε sufficiently small), and as such the last binomial coefficient in this summation
is the largest. By Corollary 25.1.5 (i), we have
n n n
Pr Yj ∈ ring(r) ≤ n ≤ n 2nH((1+ε)p) = n2n(H((1+ε)p)−1) .
2 b(1 + ε)npc 2
As such, we have
since k ≤ n(1 − H(p) − δ). Now, we choose ε to be a small enough constant, so that the quantity
H((1 + ε)p) − H(p) − δ is equal to some (absolute) negative (constant), say −β, where β > 0. Then,
τ ≤ n2−βn+1 , and choosing n large enough, we can make τ smaller than γ/8, as desired. As such, we just
proved that
γ
τ = Pr r < Si r ∈ ring(Yi ) ≤ .
8
Lemma 27.2.2. Consider the situation where Yi is sent, and the received string is r. We have that
Õ γ
Pr[r < ring(Yi )] = w(Yi, r) ≤ ,
8
r < ring(Yi )
Proof: This quantity, is the probability of sending Yi when every bit is flipped with probability p, and
receiving a string r such that more than pn + εpn bits where flipped (or less than pn − εpn). But
this quantity can be bounded using the Chernoff inequality. Indeed, let Z = ∆(Yi, r), and observe that
E[Z] = pn, and it is the sum of n independent indicator variables. As such
2
Õ ε γ
w(Yi, r) = Pr[|Z − E[Z]| > εpn] ≤ 2 exp − pn < ,
4 4
r < ring(Yi )
228
Proof: Observe that Si,r w(Yi, r) ≤ w(Yi, r) and for fixed Yi and r we have that E[w(Yi, r)] = w(Yi, r). As
such, we have that
Õ h i Õ Õ γ
f (Yi ) = S
E i,r w(Y i , r) ≤ E [w(Yi , r)] = w(Yi, r) ≤ ,
8
r < ring(Yi ) r < ring(Yi ) r < ring(Yi )
by Lemma 27.2.2.
Õ h i
Lemma 27.2.4. We have that g(Yi ) = E Si,r w(Yi, r) ≤ γ/8 (the expectation is over all the
r ∈ ring(Yi )
choices of the Y s excluding Yi ).
Proof: We have that Si,r w(Yi, r) ≤ Si,r , as 0 ≤ w(Yi, r) ≤ 1. As such, we have that
Õ h i Õ h i Õ
g(Yi ) = S
E i,r w(Yi , r) ≤ S
E i,r = Pr[r < Si ]
r ∈ ring(Yi ) r ∈ ring(Yi ) r ∈ ring(Yi )
Õ
= Pr[r < Si ∩ (r ∈ ring(Yi ))]
r
Õ
Pr r < Si r ∈ ring(Yi ) Pr[r ∈ ring(Yi )]
=
r
Õγ γ
≤ Pr[r ∈ ring(Yi )] ≤ ,
r
8 8
by Lemma 27.2.1.
Lemma 27.2.5. For any i, we have µ = E[Wi ] ≤ γ/4, where γ is the parameter of Theorem 27.1.3,
where Wi is the probability of failure to recover Yi if it was sent, see Eq. (27.1).
Proof: We have by Eq. (27.2) that Wi = r Si,r w(Yi, r). For a fixed value of Yi , we have by linearity of
Í
expectation, that
" #
Õ Õ h i
E Wi Yi = E Si,r w(Yi, r) Yi = E Si,r w(Yi, r) Yi
r r
Õ h i Õ h i γ γ γ
= E Si,r w(Yi, r) Yi + S
E i,r w(Yi , r) Yi = g(Yi ) + f (Yi ) ≤ + = ,
8 8 4
r ∈ ring(Yi ) r < ring(Yi )
In the following, we need the following trivial (but surprisingly deep) observation.
Observation 27.2.6. For a random variable X, if E[X] ≤ ψ, then there exists an event in the probability
space, that assigns X a value ≤ ψ.
Lemma 27.2.7. For the codewords X0, . . . , XK , the probability of failure in recovering them when sending
them over the noisy channel is at most γ.
229
Proof: We just proved that when using Y0, . . . , YKb, the expected probability of failure when sending Yi ,
b = 2 k+1 − 1. As such, the expected total probability of failure is
is E[Wi ] ≤ γ/4, where K
K K
Õ Õ
γ
b b
E Wi = E[Wi ] ≤ 2 k+1 ≤ γ2 k ,
i=0 i=0 4
by Lemma 27.2.5. As such, by Observation 27.2.6, there exist a choice of Yi s, such that
K
b
Wi ≤ 2 k γ.
Õ
i=0
Now, we use a similar argument used in proving Markov’s inequality. Indeed, the Wi are always positive,
and it can not be that 2 k of them have value larger than γ, because in the summation, we will get that
K
b
Wi > 2 k γ.
Õ
i=0
Which is a contradiction. As such, there are 2 k codewords with failure probability smaller than γ. We
set the 2 k codewords X0, . . . , XK to be these words, where K = 2 k − 1. Since we picked only a subset of
the codewords for our code, the probability of failure for each codeword shrinks, and is at most γ.
Lemma 27.2.7 concludes the proof of the constructive part of Shannon’s theorem.
230
Chapter 28
LP in d dimensions:(H, → −c )
H - set of n closed half spaces in Rd
→
−c - vector in d dimensions
Find p ∈ Rd s.t. ∀h ∈ H we have p ∈ h and f (p) is maximized.
Where f (p) = p, → −c .
a1 x1 + a2 x2 + · · · + an xn ≤ bn .
One difficulty that we ignored earlier, is that the optimal solution for the LP might be unbounded,
see Figure 28.1.
Namely, we can find a solution with value ∞ to the target function.
For a half space h let η(h) denote the normal of h directed into the feasible region. Let µ(h) denote
the closed half space, resulting from h by translating it so that it passes through the origin. Let µ(H)
be the resulting set of half spaces from H. See Figure 28.1 (b).
The new set of constraints µ(H) is depicted in Figure 28.1 (c).
231
µ(H) feasible region
µ(h)
h µ(h)
→
−c
µ(h) ρ0 µ(h)
h g
h1
h2
µ(h1)
µ(h2)
Proof: Consider the ρ0 the unbounded ray in the feasible region of (H, →
−c ) such that the line that contain
→−
it passes through the origin. Clearly, ρ is unbounded also in (H, c ), and this is if and only if. See
0
Lemma 28.1.2. Deciding if (µ(H), → −c ) is bounded can be done by solving a d − 1 dimensional LP.
Furthermore, if it is bounded, then we have a set of d constraints, such that their intersection prove this.
Furthermore, the corresponding set of d constraints in H testify that (H, → −c ) is bounded.
Proof: Rotate space, such that → −c is the vector (0, 0, . . . , 0, 1). And consider the hyperplane g ≡ x = 1.
d
Clearly, (µ(H), →
−c ) is unbounded if and only if the region g ∩ Ñ
h∈µ(H) h is non-empty. By deciding if this
region is unbounded, is equivalent to solving the following LP: L = (H , (1, 0, . . . , 0)) where
0 0
H 0 = g ∩ h h ∈ µ(H) .
232
vi
p g vi+1 g
µ(h2) ∩ g µ(h1) ∩ g µ(h2) ∩ g µ(h1) ∩ g
h1 h1
h2 h2
µ(h1) µ(h1)
µ(h2) µ(h2)
→
−c
(In the above example, µ(H) ∩ g is infeasible because the intersection of µ(h2 ) ∩ g and µ(h1 ) ∩ g is
empty, which implies that h1 ∩ h2 is bounded in the direction →−c which we care about. The positive y
direction in this figure. )
We are now ready to show the algorithm for the LP for L = (H, → −c ). By solving a d − 1 dimensional
LP we decide whether L is unbounded. If it is unbounded, we are done (we also found the unbounded
solution, if you go carefully through the details).
See Figure 28.3 (a).
(in the above figure, we computed p.)
In fact, we just computed a set h1, . . . , hd s.t. their intersection is bounded in the direction of → −c
(thats what the boundness check returned).
Let us randomly permute the remaining half spaces of H, and let h1, h2, . . . , hd, hd+1, . . . , hn be the
resulting permutation.
Let vi be the vertex realizing the optimal solution for the LP:
Li = {h1, . . . , hi } , →
−c
1. vi = vi+1 . This means that vi ∈ hi+1 and it can be checked in constant time.
2. vi , vi+1 . It must be that vi < hi+1 but then, we must have... What is depicted in Figure 28.3 (b).
B - the set of d constraints that define vi+1 . If hi+1 < B then vi = vi+1 . As such, the probability of
vi , vi+1 is roughly d/i because this is the probability that one of the elements of B is hi+1 . Indeed, fix
the first i + 1 elements, and observe that there are d elements that are marked (those are the elements
of B). Thus, we are asking what is the probability of one of d marked elements to be the last one in a
random permutation of hd+1, . . . , hi+1 , which is exactly d/(i + 1 − d).
Note that if some of the elements of B is h1, . . . , hd than the above expression just decreases (as there
are less marked elements).
Well, let us restrict our attention to ∂hi+1 . Clearly, the optimal solution to Li+1 on hi+1 is the required
vi+1 . Namely, we solve the LP Li+1 ∩ hi+1 using recursion.
This takes T(i + 1, d − 1) time. What is the probability that vi+1 , vi ?
233
Well, one of the d constraints defining vi+1 has to be hi+1 .The probability for that is ≤ 1 for i ≤ 2d −1,
and it is
d
≤ ,
i+1−d
otherwise.
Summarizing everything, we have:
2d
Õ
T(n, d) = O(n) + T(n, d − 1) + T(i, d − 1)
i=d+1
n
Õ d
+ T(i, d − 1)
i=2d+1
i + 1 − d
What is the solution of this monster? Well, one essentially to guess the solution and verify it. To guess
solution, let us “simplify” (incorrectly) the recursion to :
n
Õ T(i, d − 1)
T(n, d) = O(n) + T(n, d − 1) + d
i=2d+1
i+1−d
So think about the recursion tree. Now, every element in the sum is going to contribute a near
constant factor, because we divide it by (roughly) i + 1 − d and also, we are guessing the the optimal
solution is linear/near linear.
In every level of the recursion we are going to penalized by a multiplicative factor of d. Thus, it is
natural, to conjecture that T(n, d) ≤ (3d)3d n.
Which can be verified by tedious substitution into the recurrence, and is left as exercise.
BTW, we are being a bit conservative about the constant. In fact, one can prove that the running time
is d!n. Which is still exponential in d.
234
SolveLP((H, → −c ))
/* initialization */
Rotate (H, → −c ) s.t. →
−c = (0, . . . , 1)
Solve recursively the d − 1 dim LP:
L 0 ≡ µ(H) ∩ (xd = 1)
if L 0 has a solution then
return “Unbounded”
return vn
28.3. References
The description in this class notes is loosely based on the description of low dimensional LP in the book
of de Berg et al. [dBCKO08].
235
236
Chapter 29
Expanders I
598 - Class notes for Randomized Algorithms
Sariel Har-Peled “Mr. Matzerath has just seen fit to inform me
January 24, 2018 that this partisan, unlike so many of them, was
an authentic partisan. For - to quote the rest of
my patient’s lecture - there is no such thing as a
part-time partisan. Real partisans are partisans
always and as long as they live. They put fallen
governments back in power and over throw
governments that have just been put in power
with the help of partisans. Mr. Matzerath
contended - and this thesis struck me as
perfectly plausible - that among all those who
go in for politics your incorrigible partisan, who
undermines what he has just set up, is closest to
the artiest because he consistently rejects what
he has just created.”
where
e(X, Y ) = uv u ∈ X, v ∈ Y
.
A graph is [n, d, δ]-expander if it is a n vertex, d-regular, δ-expander.
A (n, d)-graph G is a connected d-regular undirected (multi) graph. We will consider the set of
vertices of such a graph to be the set nno = {1, . . . , n}.
For a (multi) graph G with n nodes, its adjacency matrix is a n × n matrix M, where Mi j is the
number of edges between i and j. It would be convenient to work the transition matrix Q associated
with the random walk on G. If G is d-regular then Q = M(G)/d and it is doubly stochastic.
A vector x is eigenvector of a matrix M with eigenvalue µ, if xM = µx. In particular, by taking
the dot product of both size by x, we get hxM, xi = hµx, xi, which implies µ = hxM, xi /hx, xi. Since the
237
adjacency matrix M of G is symmetric, all its eigenvalues are real numbers (this is a special case of the
spectral theorem from linear algebra). Two eigenvectors with different eigenvectors are orthogonal to
each other.
We denote the eigenvalues of M by λb1 ≥ λb2 ≥ · · · λbn , and the eigenvalues of Q by λb1 ≥ λb2 ≥ · · · λbn .
Note, that for a d-regular graph, the eigenvalues of Q are the eigenvalues of M scaled down by a factor
of 1/d; that is λbi = λbi /d.
Lemma 29.1.1. Let G be an undirected graph, and let ∆ denote the maximum degree in G. Then,
λb1 (G) = λb1 (M) = ∆ if and only one connected component of G is ∆-regular. The multiplicity of ∆ as
an eigenvector is the number of ∆-regular connected components. Furthermore, we have λbi (G) ≤ ∆, for
all i.
Proof: The ith entry of M1n is the degree of the ith vertex vi of G (i.e., M1n = d(vi ), where 1n =
(1, 1, . . . , 1) ∈ Rn . So, let x be an eigenvector of M with eigenvalue λ, and let x j , 0 be the coordinate
with the largest (absolute value) among all coordinates of x corresponding to a connect component H
of G. We have that
Õ
|λ| x j = (Mx) j = xi ≤ ∆ x j ,
vi ∈N(v j )
where N(v j ) are the neighbors of vi in G. Thus, all the eigenvalues of G have λbi ≤ ∆, for i = 1, . . . , n.
If λ = ∆, then this implies that xi = x j if vi ∈ N(v j ), and d(v j ) = ∆. Applying this argument to the
vertices of N(v j ), implies that H must be ∆-regular, and furthermore, x j = xi , if xi ∈ V(H). Clearly, the
dimension of the subspace with eigenvalue (in absolute value) ∆ is exactly the number of such connected
components.
The following is also known. We do not provide a proof since we do not need it in our argumentation.
Lemma 29.1.2. If G is bipartite, then if λ is eigenvalue of M(G) with multiplicity k, then −λ is also
its eigenvalue also with multiplicity k.
Intuitively, the tension captures how close is estimating the variance of a function defined over the
vertices of G, by just considering the edges of G. Note, that a disconnected graph would have infinite
tension, and the clique has tension 1.
Surprisingly, tension is directly related to expansion as the following lemma testifies.
Lemma 29.2.2. Let G = (V, E) be a given connected d-regular graph with n vertices. Then, G is a
1
δ-expander, where δ ≥ and γ(G) is the tension of G.
2γ(G)
238
Proof: Consider a set S ⊆ V, where |S| ≤ n/2. Let fS (v) be the function assigning 1 if v ∈ S, and zero
otherwise. Observe that if (u, v) ∈ S × S ∪ S × S then | fS (u) − fS (v)| = 1, and | fS (u) − fS (v)| = 0
otherwise. As such, we have
2 |S| (n − |S|) e S, S
2 2
f f f f
= E | S (x) − S (y)| ≤ γ(G) E | S (x) − S (y)| = γ(G) ,
n2 x,y∈V xy∈E |E |
by Lemma 29.2.4. Now, since G is d-regular, we have that |E | = nd/2. Furthermore, n − |S| ≥ n/2, which
implies that
2 |E | · |S| (n − |S|) 2(nd/2)(n/2) |S| 1
e S, S ≥ 2
= 2
= d |S| .
γ(G)n γ(G)n 2γ(G)
which implies the claim (see Eq. (29.1)).
Now, a clique has tension 1, and it has the best expansion possible. As such, the smaller the tension
of a graph, the better expander it is.
Definition 29.2.3. Given a random walk matrix Q associated with a d-regular graph, let B(Q) = hv1, . . . , vn i
denote the orthonormal eigenvector basis defined by Q. That
n n √ is, v1, . . . , vn is an orthonormal basis
for R , where all these vectors are eigenvectors of Q and v1 = 1 / n. Furthermore, let λbi denote the ith
eigenvalue of Q, associated with the eigenvector vi , such that λb1 ≥ λb2 ≥ · · · ≥ λbn .
Lemma 29.2.4. Let G = (V, E) be a given connected d-regular graph with n vertices. Then γ(G) = 1
,
1−λ
c2
where λb2 = λ2 /d is the second largest eigenvalue of Q.
Proof: Let f : V → R. Since in Eq. (29.2), we only look on the difference between two values of f ,
we can add a constant to f , and would not change the quantities involved in Eq. (29.2). As such, we
assume that E[ f (x)] = 0. As such, we have that
2 2 2 2
E | f (x) − f (y)| = E ( f (x) − f (y)) = E ( f (x)) − 2 f (x) f (y) + ( f (y))
(29.3)
x,y∈V x,y∈V x,y∈V
Now, let I be the n × n identity matrix (i.e., one on its diagonal, and zero everywhere else). We have
that
!
1 Õ 1 Õ Õ Õ 2 Õ
ρ= ( f (x) − f (y))2 = d( f (x))2 − 2 f (x) f (y) = ( f (x))2 − f (x) f (y)
d x y∈E d x∈V xy∈E x∈V
d xy∈E
Õ
= (I − Q) xy f (x) f (y).
x,y∈V
Note, that 1n is an eigenvector of Q with eigenvalue 1, and this is the largest eigenvalue of Q. Let B(Q) =
hv1, . . . , vn i be the orthonormal eigenvector basis defined by Q, with eigenvalues λb1 ≥ λb2 ≥ · · · ≥ λbn ,
Ín
respectively. Write f = i=1 αi vi , and observe that
n *Õ +
f (i)
Õ v1 v1 1 α1
0 = E[ f (x)] = = f, √ = αi vi, √ = √ hα1 v1, v1 i = √ ,
i=1
n n i n n n
239
since vi ⊥v1 for i ≥ 2. Hence α1 = 0, and we have
n n
n
Õ Õ Õ Õ
ρ= (I − Q) xy f (x) f (y) = (I − Q) x y αi=1 αi vi (x) α j v j (y)
x,y∈V x,y∈V i=2 j=1
Õ Õ Õ
= αi α j vi (x) (I − Q) xy v j (y).
i, j x∈V y∈V
xth row of
Õ
(I − Q) x y v j (y) = , v j = (I − Q)v j (x) = 1 − λbj v j (x) = 1 − λbj v j (x),
(I − Q)
y∈V
Ín
since v j is eigenvector of Q with eigenvalue λbj . Since v1, . . . , vn is an orthonormal basis, and f = i=1 αi vi ,
we have that k f k 2 = j α2j . Going back to ρ, we have that
Í
Õ Õ Õ Õ
ρ= αi α j vi (x) 1 − λbj v j (x) = αi α j 1 − λbj vi (x)v j (x)
i, j x∈V i, j x∈V
Õ n
Õ
= αi α j 1 − λbj vi, v j = α2j 1 − λbj v j , v j
i, j j=1
n
Õ Õ n
Õ n
Õ
2
≥ 1 − λb2 α2j v j (x) = 1 − λb2 α2j = 1 − λb2 k f k 2 = 1 − λb2 ( f (x))2 (29.4)
j=2 x∈V j=2 j=1
2
= n 1 − λb2 E ( f (x)) ,
x∈V
1 1 Õ 1
( f (x) − f (y))2 = 2
E | f (x) − f (y)| .
= ·
1 − λb2 |E | xy∈E 1 − λb2 xy∈E
1
This implies that γ(G) ≤ . Observe, that the inequality in our analysis, had risen from Eq. (29.4),
1 − λb2
but if we take f = v2 , then the inequality there holds with equality, which implies that γ(G) ≥ 1c ,
1−λ2
which implies the claim.
Lemma 29.2.2 together with the above lemma, implies that the expansion δ of a d-regular graph G
is at least δ = 1/2γ(G) = (1 − λ2 /d)/2, where λ2 is the second eigenvalue of the adjacency matrix of G.
Since the tension of a graph is direct function of its second eigenvalue, we could either argue about the
tension of a graph or its second eigenvalue when bounding the graph expansion.
240
Chapter 30
Expanders II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled Be that as it may, it is to night school that I owe
January 24, 2018 what education I possess; I am the first to own
that it doesn’t amount to much, though there is
something rather grandiose about the gaps in it.
The proof of the following lemma is similar to the proof of Lemma 29.2.4. The proof is provided for
the sake of completeness, but there is little new in it.
1
Lemma 30.1.2. Let G = (V, E) be a connected d-regular graph with n vertices. Then γ2 (G) = ,
1−λ b
where λ
b = λ(G),
b where λ(G)
b = max λb2, −λbn , where λbi is the ith largest eigenvalue of the random walk
matrix associated with G.
Let Q be the matrix associated with the random walk on G (each entry is either zero or 1/d), we
have
2 1 Õ 2 1 Õ
f f Q xy ( f (x) − g(y))2
ρ= E | (x) − g(y)| = ( (x) − g(y)) =
(x→y)∈e
E nd n x,y∈V
(x→y)∈e E
1 Õ 2 Õ
= ( f (x))2 + (g(x))2 − Q xy f (x)g(y).
n x∈V n x,y∈V
241
Let B(Q) = hv1, . . . , vn i be the orthonormal eigenvector basis defined by Q (see Definition 29.2.3), with
Ín Ín
eigenvalues λb1 ≥ λb2 ≥ · · · ≥ λbn , respectively. Write f = i=1 αi vi and g = i=1 βi vi . Since E[ f (x)] = 0,
we have that α1 = 0. Now, Q x y = Q yx , and we have
! !
Õ Õ Õ Õ Õ Õ Õ
Q x y f (x)g(y) = Q yx αi vi (x) β j v j (y) = αi β j v j (y) Q yx vi (x)
x,y∈V x,y∈V i j i, j y∈V x∈V
Õ Õ Õ n
Õ Õ
= αi β j v j (y) λbi vi (y) = αi β j λbi v j , vi = αi βi λbi (vi (y))2
i, j y∈V i, j i=2 y∈V
n n Õ
Õ αi2 + βi2 Õ λ
bÕ
≤λ
b (vi (y))2 ≤ (αi vi (y))2 + (βi vi (y))2
i=2
2 y∈V
2 i=1 y∈V
λ
b Õ
= ( f (y))2 + (g(y))2
2 y∈V
As such,
1 Õ 1 Õ 1 Õ 2 f (x)g(y)
| f (x) − g(y)| 2 = | f (x) − g(y)| 2 = ( f (y))2 + (g(y))2 −
E
(x→y)∈e
E nd n y∈V n x,y∈V d
(x→y)∈e
E
1 Õ 2 2 Õ2
= ( f (y)) + (g(y)) Q xy f (x)g(y)
−
n y∈V
n x,y∈V
!
1 2 λ
b Õ
2 2 2 2
( f (y)) + (g(y)) = 1 − λ b E ( f (y)) + E (g(y))
≥ − ·
n n 2 y∈V y∈V y∈V
2
E | f (x) − g(y)| ,
= 1−λb
x,y∈V
b . Again, by trying either f = g = v2 or f = vn
by Eq. (30.2). This implies that γ2 (G) ≤ 1/ 1 − λ
and g = −vn , we get that the inequality above holds with equality, which implies γ2 (G) ≥ 1/ 1 − λ
b .
Together, the claim now follows.
Our main interest would be in the second largest eigenvalue of M. Formally, let
hxM, xi
λ2 (G) = max .
n x⊥1 ,x,0 hx, xi
242
We state the following result but do not prove it since we do not need it for our nafarious purposes
(however, we did prove the left side of the inequality).
Theorem 30.2.2. Let G be a δ-expander with adjacency matrix M and let λ2 = λ2 (G) be the second-
largest eigenvalue of M. Then s
1 λ2 λ2
1− ≤ δ ≤ 2 1− .
2 d d
What the above theorem says, is that the expansion of a [n, d, δ]-expander is a function of how far
is its second eigenvalue (i.e., λ2 ) from its first eigenvalue (i.e., d). This is usually referred to as the
spectral gap.
We will start by explicitly constructing an expander that has “many” edges, and then we will show
to reduce its degree till it become a constant degree expander.
(iv) Identity: There exists two distinct special elements 0, 1 ∈ F. We have that ∀x ∈ F it holds x + 0 = a
and x · 1 = x.
(v) Inverse: There exists two distinct special elements 0, 1 ∈ F, and we have that ∀x ∈ F there exists
an element −x ∈ F, such that x + (−x) = 0.
Similarly, ∀x ∈ F, x , 0, there exists an element y = x −1 = 1/x ∈ F such that x · y = 1.
Let q = 2t , and r > 0 be an integer. Consider the finite field Fq . It is the field of polynomials
of degree at most t − 1, where the coefficients are over Z2 (i.e., all calculations are done modulus 2).
Formally, consider the polynomial
p(x) = x t + x + 1.
It it irreducible over F2 = {0, 1} (i.e., p(0) = p(1) , 0). We can now do polynomial arithmetic over
polynomials (with coefficients from F2 ), where we do the calculations modulus p(x). Note, that any
irreducible polynomial of degree n yields the same field up to isomorphism. Intuitively, we are introducing
the n distinct roots of p(x) into F by creating an extension field of F with those roots.
An element of Fq = F2t can be interpreted as a binary string b = b0 b1 . . . , bt−1 of length t, where the
corresponding polynomial is
t−1
bi x i .
Õ
poly(b) =
i=0
243
The nice property of Fq is that addition can be interpreted as a xor operation. That is, for any x, y ∈ Fq ,
we have that x + y + y = x and x − y − y = x. The key properties of Fq we need is that multiplications
and addition can be computed in it in polynomial time in t, and it is a field (i.e., each non-zero element
has a unique inverse).
For more details on this field, see any standard text on abstract algebra.
ρ(α, x, y) = α + y · 1, x, x 2, . . . , x r = α0 + y, α1 + yx, α2 + yx 2, . . . , αr + yx r ∈ G.
α ∈ G, x, y ∈ Fq
E = αβ
β = ρ(α, x, y)
2
Note, that this graph is well defined, as ρ(β, x, y) = α. The degree of a vertex of LD(q, r) is Fq = q2 ,
and LD(q, r) has N = |G| = qr+1 = 2t(r+1) = 2n vertices.
Theorem 30.2.3. For any t > 0, r > 0 and q = 2t , where r < q, we have that LD(q, r) is a graph with
qr+1 vertices. Furthermore, λ1 (LD(q, r)) = q2 , and λi (LD(q,
r+1 r)) ≤ rq, for i = 2, . . . , n.
In particular, if r ≤ q/2, then LD(q, r) is a q , q , 4 -expander.
2 1
Proof: Let M be the N × N adjacency matrix of LD(q, r). Let L : Fq → {0, 1} be a linear map which is
onto. It is easy to verify that L −1 (0) = L −1 (1) ¬
We are interested in the eigenvalues of the matrix M. To this end, we consider vectors in RN . The
ith row an ith column of M is associated with a unique element bi ∈ G. As such, for a vector v ∈ RN ,
we denote by v[bi ] the ith coordinate of v. In particular, for α = (α0, . . . , αr ) ∈ G, let vα ∈ RN denote
the vector, where its β = (β0, . . . , βr ) ∈ G coordinate is
vα [β] = (−1) L (
Ír
αi βi )
i=0 .
¬ Indeed, if Z = L −1 (0), and L(x) = 1, then L(y) = 1, for all y ∈ U = x + z z ∈ Z . Now, its clear that |Z | = |U|.
244
Let V = vα α ∈ G . For α , α0 ∈ V, observe that
So, consider ψ = α + α0 , 0. Assume, for the simplicity of exposition that all the coordinates of ψ are
non-zero. We have, by the linearity of L that
However, since ψr , 0, the quantity ψr βr βr ∈ Fq = Fq . Thus, the summation βr ∈Fq (−1) L(ψr βr ) gets
Í
L −1 (0) terms that are 1, and L −1 (0) terms that are −1. As such, this summation is zero, implying that
hvα, vα 0 i = 0. Namely, the vectors of V are orthogonal.
Observe, that for α, β, ψ ∈ G, we have vα [β + ψ] = vα [β] vα [ψ]. For α ∈ G, consider the vector Mvα .
We have, for β ∈ G, that
vα [β + y(1, x, . . . , x r )]
Õ Õ Õ
(Mvα )[β] = M βψ · vα [ψ] = vα [ψ] =
ψ∈G x,y ∈ Fq x,y ∈ Fq
ψ=ρ(β,x,y)
« x,y ∈ Fq ¬
Thus, setting λ(α) = x,y ∈ Fq vα [y(1, x, . . . , x r )] ∈ R, we have that Mvα = λ(α) · vα . Namely, vα is an
Í
eigenvector, with eigenvalue λ(α).
Ír
Let pα (x) = i=0 αi x i , and let
If pα (x) = 0 then (−1) L(y pα (x)) = 1, for all y. As such, each such x contributes q to λ(α).
If pα (x) , 0 then y pα (x) takes all the values of Fq , and as such, L(y pα (x)) is 0 for half of these
values, and 1 for the other half. Implying that these kind terms contribute 0 to λ(α). But pα (x) is a
polynomial of degree r, and as such there could be at most r values of x for which the first term is taken.
As such, if α , 0 then λ(α) ≤ rq. If α = 0 then λ(α) = q2 , which implies the theorem.
This construction provides an expander with constant degree only if the number of vertices is a
constant. Indeed, if we want an expander with constant degree, we have to take q to be as small as
possible. We get the relation n = qr+1 ≤ q q , since r ≤ r, which implies that q = Ω(log n/log log n). Now,
the expander of Theorem 30.2.3 is q2 -regular, which means that it is not going to provide us with a
constant degree expander.
However, we are going to use it as our building block in a construction that would start with this
expander and would inflate it up to the desired size.
245
246
Chapter 31
247
verify that the low quality expander of Theorem 30.2.3 has this property. It is also easy to verify that
the complete graph can be easily be made into having consistent labeling (exercise). These two graphs
would be sufficient for our construction.
about e1 as a small “zig” step in H, e2 is a long “zag” step in G, and finally e3 is a “zig” step in H.
Another way of representing a zig-zag-zig path v1 v2 v3 v4 starting at the vertex v1 = (i, v) ∈ V(F), is
to parameterize it by two integers `, `0 ∈ ndo, where
v1 = (i, v), v2 = (i, vH [`]) v3 = (iG [vH [`]] , vH [`]) v4 = (iG [vH [`]] , (vH [`])H [`0]).
Let Z be the set of all (unordered) pairs of vertices of K connected by such a zig-zag-zig path. Note,
that every vertex (i, v) of K has d 2 paths having (i, v) as an end point. Consider the graph F = (V(K), Z).
The graph F has nD vertices, and it is d 2 regular. Furthermore, since we shortcut all these zig-zag-zig
248
paths in K, the graph F is a much better expander (intuitively) than K. We will refer to the graph F as
the zig-zag product of G and H.
Definition 31.1.1. The zig-zag product of (n, D)-graph G and a (D, d)-graph H, is the (nD, d 2 ) graph
F = G z H, where the set of vertices is nno × nDo and for any v ∈ nno, i ∈ nDo, and `, `0 ∈ ndo we have
in F the edge connecting the vertex (i, v) with the vertex (iG [vH [`]] , (vH [`])H [`0]).
Remark 31.1.2. We need the resulting zig-zag graph to have consistent labeling. For the sake of simplicity
of exposition, we are just going to assume this property.
Proof: Let G = (nno, E) be a (n, D)-graph and H = (nDo, E 0) be a (D, d)-graph. Fix any function
f : nno × nDo → R, and observe that
2 2
| f (u, k) − f (v, `)| = E E | f (u, k) − f (v, `)|
ψ= E
u,v∈nno k,`∈nDo u,v∈nno
k,`∈nDo
2 2
| f (u, k) − f (v, `)| = γ2 (G) E E | f (u, k) − f (u[p] , `)| .
≤ γ2 (G)
E E
k,`∈nDo uv∈E(G) k,`∈nDo u∈nno
p∈nDo
| {z }
=∆1
Now,
2 2
| f (u, k) − f (u[p] , `)| | f (u, k) − f (u[p] , `)|
∆1 = E E ≤ E γ2 (H) E
u∈nno k,p∈nDo u∈nno k p∈E(H)
`∈nDo `∈nDo
= γ2 (H) E E | f (u, p[ j]) − f (u[p] , `)| 2 .
u∈nno p∈nDo
`∈nDo j∈ndo
| {z }
=∆2
Now,
∆2 = E E | f (u, p[ j]) − f (u[p] , `)| 2 = E E | f (v[p] , p[ j]) − f (v, `)| 2
j∈ndo u∈nno j∈ndo v∈nno
`∈nDo p∈nDo `∈nDo p∈nDo
2
= E E | f (v[p] , p[ j]) − f (v, `)|
j∈ndo p∈nDo
v∈nno `∈nDo
2
| f (v[p] , p[ j]) − f (v, `)|
= γ2 (H) E E .
j∈ndo p`∈E(H)
v∈nno
| {z }
=∆3
249
Now, we have
2
∆3 = E E | f (v[p] , p[ j]) − f (v, p[i])| = [| f (u, k) − f (`, v)|] ,
E
j∈ndo p∈nDo (u,k)(`,v)∈E(G z H)
v∈nno i∈ndo
as (v[p] , p[ j]) is adjacent to (v[p] , p) (a short edge), which is in turn adjacent to (v, p) (a long edge),
which is adjacent to (v, p[i]) (a short edge). Namely, (v[p] , p[ j]) and (v, p[i]) form the endpoints of a zig-
zag path in the replacement product of G and H. That is, these two endpoints are connected by an edge
in the zig-zag product graph. Furthermore, it is easy to verify that each zig-zag edge get accounted for
in this representation exactly once, implying the above inequality. Thus, we have ψ ≤ γ2 (G)(γ2 (H))2 ∆3 ,
which implies the claim.
The second claim follows by similar argumentation.
31.1.3. Squaring
The last component in our construction, is squaringsquaring!graph a graph. Given a (n, d)-graph G,
consider the multigraph G2 formed by connecting any vertices connected in G by a path of length 2.
Clearly, if M is the adjacency matrix of G, then the adjacency matrix of G2 is the matrix M2 . Note, that
M i j is the number of distinct paths of length 2 in G from i to j. Note, that the new graph might have
2
self loops, which does not effect our analysis, so we keep them in.
(γ2 (G))2
Lemma 31.1.4. Let G be a (n, d)-graph. The graph G2 is a (n, d 2 )-graph. Furthermore γ2 G2 = 2γ
2 (G)−1
.
2 2
Proof: The graph G2 has eigenvalues λb1 (G) , . . . , λb1 (G) for its matrix Q2 . As such, we have that
b G2 = max λb2 G2 , −λbn G2 .
λ
2 2
Now, λb1 G2 = 1. Now, if λb2 (G) ≥ λbn (G) < 1 then λb G2 = λb2 G2 = λb2 (G) = λ(G)
b .
2 2
b G2 = λb2 G2 = λbn (G) = λ(G)
If λb2 (G) < λbn (G) then λ b ..
2
b G2 = λ(G) . Now, By Lemma 30.1.2 γ2 (G) = b1 , which implies that
Thus, in either case λ b
1−λ(G)
λ(G)
b = 1 − 1/γ2 (G), and thus
250
Let G0 be any graph that its square is the complete graph over n0 = N + 1 vertices. Observe that G20 is
d 4 -regular. Set Gi = Gi−1
2
z H , Clearly, the graph Gi has
ni = ni−1 N
2
vertices. The graph Gi−1 z H is d 2 regular. As far as the bi-tension, let αi = γ2 (Gi ). We have that
2 2 2 2
αi−1 2
αi−1 32 αi−1
αi = (γ2 (H)) = ≤ 1.64 .
2αi−1 − 1 2αi−1 − 1 25 2αi−1 − 1
Theorem 31.1.5. For any i ≥ 0, one can compute deterministically a graph Gi with ni = (d 4 + 1)d 4i
vertices, which is d 2 regular, where d = 256. The graph Gi is a (1/10)-expander.
Proof: The construction is described above. As for the expansion, since the bi-tension bounds the
tension of a graph, we have that γ(Gi ) ≤ γ2 (Gi ) ≤ 5. Now, by Lemma 29.2.2, we have that Gi is a
δ-expander, where δ ≥ 1/(2γ(Gi )) ≥ 1/10.
31.3. Exercises
Exercise 31.3.1 (Expanders made easy.). By considering a random bipartite three-regular graph on
2n vertices obtained by picking three random permutations between the two sides of the bipartite graph,
prove that there is a c > 0 such that for every n there exits a (2n, 3, c)-expander. (What is the value of
c in your construction?)
Exercise 31.3.2 (Is your consistency in vain?). In the construction, we assumed that the graphs we
are dealing with when building expanders have consistent labeling. This can be enforced by working
with bipartite graphs, which implies modifying the construction slightly.
(A) Prove that a d-regular bipartite graph always has a consistent labeling (hint: consider matchings
in this graph).
251
(C) Let G be a (n, D)-graph and let H be a (D, d)-graph. Prove that if G is bipartite then GG z H is
bipartite.
(D) Describe in detail a construction of an expander that is: (i) bipartite, and (ii) has consistent labeling
at every stage of the construction (prove this property if necessary). For the ith graph in your series,
what is its vertex degree, how many vertices it has, and what is the quality of expansion it provides?
Acknowledgements
Much of the presentation was followed suggestions by Manor Mendel. He also contributed some of the
figures.
252
Chapter 32
Miscellaneous Prerequisite
598 - Class notes for Randomized Algorithms
Sariel Har-Peled Be that as it may, it is to night school that I owe
January 24, 2018 what education I possess; I am the first to own
that it doesn’t amount to much, though there is
something rather grandiose about the gaps in it.
The purpose of this chapter is to remind the reader (and the author) about some
The tin basic
drum, definitions
Gunter Grass
and results in mathematics used in the text. The reader should refer to standard texts for further details.
In the following, we cover some material from linear algebra. Proofs of these facts can be found in
any text on linear algebra, for example [Leo98].
For a matrix M, let MT denote the transposed matrix. We remind the reader that for two matrices
M and B, we have (MB)T = BT MT . Furthermore, for any three matrices M, B, and C, we have (MB)C =
M(BC).
A matrix M ∈ Rn×n is symmetric if MT = M. All the eigenvalues of a symmetric matrix are real
numbers. A symmetric matrix M is positive definite if xT Mx > 0, for all x ∈ Rn . Among other things
this implies that M is non-singular. If M is symmetric, then it is positive definite if and only if all its
eigenvalues are positive numbers.
In particular, if M is symmetric positive definite, then det(M) > 0. Since all the eigenvalues of a
positive definite matrix are positive real numbers, the following holds, as can be easily verified.
Claim 32.1.2. A symmetric matrix M is positive definite if and only if there exists a matrix B such
that M = BT B and B is not singular.
253
Lemma 32.1.3. Given a simplex 4 in Rd with vertices v1, . . . , vd, vd+1 (or equivalently 4 = CH (v1, . . . , vd+1 )),
the volume of this simplex is the absolute value of (1/d!)|C|, where C is the value of the determinant
1 1 ... 1
C= . In particular, for a triangle with vertices at (x, y), (x 0, y0), and (x 00, y00) its area
v1 v2 . . . vd+1
1 x y
1
is the absolute value of 1 x 0 y0 .
2
1 x 00 y00
For any vector v ∈ V, we have that affine(V) = v +linear V − v , where V−→
→− →− →− −v = →
−v 0 − →
−v →
−v 0 ∈ V .
Lemma 32.1.6. The convex hull of n points in the plane can be computed in O(n log n) time.
Lemma 32.1.7. The lower and upper envelopes of n lines in the plane can be computed in O(n log n)
time.
32.2. Calculus
i∞
x2 x3 x4 Õ
i+1 x
Lemma 32.2.1. For x ∈ (−1, 1), we have ln(1 + x) = x − + − +··· = (−1) .
2 3 4 i=1
i
254
Proof: (A) Let f (x) = 1 + x and g(x) = exp(x). Observe that f (0) = g(0) = 1. Now, for x ≥ 0, we have
that f 0(x) = 1 and g0(x) = exp(x) ≥ 1. As such f (x) ≤ g(x) for x ≥ 0. Similarly, for x < 0, we have
g0(x) = exp(x) < 1, which implies that f (x) ≤ g(x).
(B) This is immediate from (A).
(C) Observe that exp(1) ≤ 1 + 2 · 1 and exp(0) = 1 + 2 · 0. By the convexity of 1 + 2x, it follows that
exp(x) ≤ 1 + 2x for all x ∈ [0, 1].
(D) Observe that (i) exp(−2(1/2)) = 1/e ≤ 1/2 = 1−(1/2), (ii) exp(−2 · 0) = 1 ≤ 1−0, (iii) exp(−2x)0 =
−2 exp(−2x), and (iv) exp(−2x)00 = 4 exp(−2x) ≥ 0 for all x. As such, exp(−2x) is a convex function and
the claim follows.
ln y ln y
Lemma 32.2.3. For 1 > ε > 0 and y ≥ 1, we have that ≤ log1+ε y ≤ 2 .
ε ε
Proof: By Lemma 32.2.2, 1+ x ≤ exp(x) ≤ 1+2x for x ∈ [0, 1]. This implies that ln(1+ x) ≤ x ≤ ln(1+2x).
ln y ln y ln y
As such, log1+ε y = = ≤ . The other inequality follows in a similar fashion.
ln(1 + ε) ln(1 + 2(ε/2)) ε/2
255
256
Bibliography
[AB99] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cam-
bridge, 1999.
[ABKU00] Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations. SIAM J. Comput.,
29(1):180–200, 2000.
[Ach01] D. Achlioptas. Database-friendly random projections. In Proc. 20th ACM Sympos. Princi-
ples Database Syst. (PODS), pages 274–281, 2001.
[AHY07] P. Agarwal, S. Har-Peled, and H. Yu. Embeddings of surfaces, curves, and moving points
in Euclidean space. In Proc. 23rd Annu. Sympos. Comput. Geom. (SoCG), pages 381–389,
2007.
[AKPW95] N. Alon, R. M. Karp, D. Peleg, and D. West. A graph-theoretic game and its application
to the k-server problem. SIAM J. Comput., 24(1):78–100, February 1995.
[AN04] N. Alon and A. Naor. Approximating the cut-norm via grothendieck’s inequality. In Proc.
36th Annu. ACM Sympos. Theory Comput. (STOC), pages 72–80, 2004.
[AR94] N. Alon and Y. Roichman. Random cayley graphs and expanders. Random Struct. Algo-
rithms, 5(2):271–285, 1994.
[Aro98] S. Arora. Polynomial time approximation schemes for Euclidean TSP and other geometric
problems. J. Assoc. Comput. Mach., 45(5):753–782, September 1998.
[AS00] N. Alon and J. H. Spencer. The Probabilistic Method. Wiley InterScience, 2nd edition,
2000.
[Bar96] Y. Bartal. Probabilistic approximations of metric space and its algorithmic application.
In Proc. 37th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), pages 183–193, October
1996.
257
[Bar98] Y. Bartal. On approximating arbitrary metrics by tree metrics. In Proc. 30th Annu. ACM
Sympos. Theory Comput. (STOC), pages 161–168, 1998.
[BM58] G. E.P. Box and M. E. Muller. A note on the generation of random normal deviates. Ann.
Math. Stat., 28:610–611, 1958.
[BY98] J.-D. Boissonnat and M. Yvinec. Algorithmic Geometry. Cambridge University Press, 1998.
[CCH09] C. Chekuri, K. L. Clarkson., and S. Har-Peled. On the set multi-cover problem in geometric
settings. In Proc. 25th Annu. Sympos. Comput. Geom. (SoCG), pages 341–350, 2009.
[CF90] B. Chazelle and J. Friedman. A deterministic view of random sampling and its use in
geometry. Combinatorica, 10(3):229–249, 1990.
[Cha01] B. Chazelle. The Discrepancy Method: Randomness and Complexity. Cambridge University
Press, New York, 2001.
[Che86] L. P. Chew. Building Voronoi diagrams for convex polygons in linear expected time. Tech-
nical Report PCS-TR90-147, Dept. Math. Comput. Sci., Dartmouth College, Hanover, NH,
1986.
[CKR01] G. Calinescu, H. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension
problem. In Proc. 12th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 8–16, 2001.
258
[dBS95] M. de Berg and O. Schwarzkopf. Cuttings and applications. Internat. J. Comput. Geom.
Appl., 5:343–355, 1995.
[DG03] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss.
Rand. Struct. Alg., 22(3):60–65, 2003.
[DP09] Devdatt P. Dubhashi and Alessandro Panconesi. Concentration of Measure for the Analysis
of Randomized Algorithms. Cambridge University Press, 2009.
[DS00] P. G. Doyle and J. L. Snell. Random walks and electric networks. ArXiv Mathematics
e-prints, 2000.
[Gar02] R. J. Gardner. The Brunn-Minkowski inequality. Bull. Amer. Math. Soc., 39:355–405,
2002.
[GLS93] M. Grötschel, L. Lovász, and A. Schrijver. Geometric Algorithms and Combinatorial Opti-
mization, volume 2 of Algorithms and Combinatorics. Springer-Verlag, Berlin Heidelberg,
2nd edition, 1993.
[GRSS95] M. Golin, R. Raman, C. Schwarz, and M. Smid. Simple randomized algorithms for closest
pair problems. Nordic J. Comput., 2:3–27, 1995.
[Gup00] A. Gupta. Embeddings of Finite Metrics. PhD thesis, University of California, Berkeley,
2000.
[Har00a] S. Har-Peled. Constructing planar cuttings in theory and practice. SIAM J. Comput.,
29(6):2016–2039, 2000.
[Har11] S. Har-Peled. Geometric Approximation Algorithms, volume 173 of Math. Surveys & Mono-
graphs. Amer. Math. Soc., Boston, MA, USA, 2011.
[Hås01] J. Håstad. Some optimal inapproximability results. J. Assoc. Comput. Mach., 48(4):798–
859, 2001.
[HLW06] S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin
Amer. Math. Soc., 43:439–561, 2006.
[HR15] Sariel Har-Peled and Benjamin Raichel. Net and prune: A linear time algorithm for Eu-
clidean distance problems. J. Assoc. Comput. Mach., 62(6):44:1–44:35, December 2015.
259
[HW87] D. Haussler and E. Welzl. ε-nets and simplex range queries. Discrete Comput. Geom.,
2:127–151, 1987.
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse
of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC), pages
604–613, 1998.
[Ind01] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. In Proc. 42nd
Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), pages 10–31, 2001. Tutorial.
[Kel56] J. L. Kelly. A new interpretation of information rate. Bell Sys. Tech. J., 35(4):917–926, jul
1956.
[KKMO04] S. Khot, G. Kindler, E. Mossel, and R. O’Donnell. Optimal inapproximability results for
max cut and other 2-variable csps. In Proc. 45th Annu. IEEE Sympos. Found. Comput.
Sci. (FOCS), pages 146–154, 2004. To appear in SICOMP.
[KKT91] C. Kaklamanis, D. Krizanc, and T. Tsantilas. Tight bounds for oblivious routing in the
hypercube. Math. sys. theory, 24(1):223–232, 1991.
[KLMN05] R. Krauthgamer, J. R. Lee, M. Mendel, and A. Naor. Measured descent: A new embedding
method for finite metric spaces. Geom. funct. anal. (GAFA), 15(4):839–858, 2005.
[KPW92] J. Komlós, J. Pach, and G. Woeginger. Almost tight bounds for ε-nets. Discrete Comput.
Geom., 7:163–173, 1992.
[Leo98] S. J. Leon. Linear Algebra with Applications. Prentice Hall, 5th edition, 1998.
[LLS01] Y. Li, P. M. Long, and A. Srinivasan. Improved bounds on the sample complexity of
learning. J. Comput. Syst. Sci., 62(3):516–527, 2001.
[Mag07] A. Magen. Dimensionality reductions in `2 that preserve volumes and distance to affine
spaces. Discrete Comput. Geom., 38(1):139–153, 2007.
[Mat90] J. Matoušek. Bi-Lipschitz embeddings into low-dimensional Euclidean spaces. Comment.
Math. Univ. Carolinae, 31:589–600, 1990.
[Mat92] J. Matoušek. Reporting points in halfspaces. Comput. Geom. Theory Appl., 2(3):169–186,
1992.
[Mat98] J. Matoušek. On constants for cuttings in the plane. Discrete Comput. Geom., 20:427–448,
1998.
[Mat99] J. Matoušek. Geometric Discrepancy. Springer, 1999.
[Mat02] J. Matoušek. Lectures on Discrete Geometry, volume 212 of Grad. Text in Math. Springer,
2002.
[McD89] C. McDiarmid. Surveys in Combinatorics, chapter On the method of bounded differences.
Cambridge University Press, 1989.
[Mil76] G. L. Miller. Riemann’s hypothesis and tests for primality. J. Comput. Sys. Sci., 13(3):300–
317, 1976.
260
[MN98] J. Matoušek and J. Nešetřil. Invitation to Discrete Mathematics. Oxford Univ Press, 1998.
[MN08] M. Mendel and A. Naor. Towards a calculus for non-linear spectral gaps. manuscript, 2008.
[MOO05] E. Mossel, R. O’Donnell, and K. Oleszkiewicz. Noise stability of functions with low influ-
ences invariance and optimality. In Proc. 46th Annu. IEEE Sympos. Found. Comput. Sci.
(FOCS), pages 21–30, 2005.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, Cam-
bridge, UK, 1995.
[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and
probabilistic analysis. Cambridge, 2005.
[Mul89] K. Mulmuley. An efficient algorithm for hidden surface removal. Comput. Graph.,
23(3):379–388, 1989 1989.
[Nor98] J. R. Norris. Markov Chains. Statistical and Probabilistic Mathematics. Cambridge Press,
1998.
[PA95] J. Pach and P. K. Agarwal. Combinatorial Geometry. John Wiley & Sons, 1995.
[Rab80] M. O. Rabin. Probabilistic algorithm for testing primality. J. Number Theory, 12(1):128–
138, 1980.
[RVW02] O. Reingold, S. Vadhan, and A. Wigderson. Entropy waves, the zig-zag graph product,
and new constant-degree expanders and extractors. Annals Math., 155(1):157–187, 2002.
[SA95] M. Sharir and P. K. Agarwal. Davenport-Schinzel Sequences and Their Geometric Appli-
cations. Cambridge University Press, New York, 1995.
[Sei93] R. Seidel. Backwards analysis of randomized geometric algorithms. In J. Pach, editor, New
Trends in Discrete and Computational Geometry, volume 10 of Algorithms and Combina-
torics, pages 37–68. Springer-Verlag, 1993.
[Sha03] M. Sharir. The Clarkson-Shor technique revisited and extended. Comb., Prob. & Comput.,
12(2):191–201, 2003.
[Smi00] M. Smid. Closest-point problems in computational geometry. In J.-R. Sack and J. Urrutia,
editors, Handbook of Computational Geometry, pages 877–935. Elsevier, Amsterdam, The
Netherlands, 2000.
[Ste12] E. Steinlight. Why novels are redundant: Sensation fiction and the overpopulation of
literature. ELH, 79(2):501–535, 2012.
[Tót03] C. D. Tóth. A note on binary plane partitions. Discrete Comput. Geom., 30(1):3–16, 2003.
261
[VC71] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies
of events to their probabilities. Theory Probab. Appl., 16:264–280, 1971.
[Wes01] D. B. West. Intorudction to Graph Theory. Prentice Hall, 2ed edition, 2001.
[WG75] H. W. Watson and F. Galton. On the probability of the extinction of families. J. Anthrop.
Inst. Great Britain, 4:138–144, 1875.
262
Index
263
P, 24 edge, 103, 171
RP, 24 effective resistance, 117
ZPP, 25 eigenvalue, 237
conditional expectation, 73 eigenvector, 237
conditional probability, 13, 29 electrical network, 117
confidence, 20 elementary event, 13, 79
conflict graph, 173 embedding, 101, 141, 204
conflict list, 173 entropy, 217
congruent modulo n, 188 binary, 217
consistent labeling, 247 epochs, 121
contraction estimate, 146
edge, 30 Euler totient function, 190
convex hull, 253 event, 13
coprime, 188 expander
cover time, 116 [n, d, δ]-expander, 237
covering radius, 169 [n, d, c]-expander, 125
critical, 23 c, 125
crossing number, 101, 160 expectation, 13, 39
cut, 29
face, 171
minimum, 29
faces, 103
cuts, 29
field, 192, 243
cutting, 181
filter, 79
cyclic, 191
filtration, 79
defining set, 176 final strong component, 112
Delaunay fragment, 16
circle, 177 fully explicit, 127
triangle, 177
generator, 191
dependency graph, 93
geometric distribution, 36
dimension
graph
combinatorial, 176 d-regular, 237
dual shattering, 152 labeled, 122
shattering, 149 lollipop, 116
dual, 152 grid, 21
discrepancy, 159 grid cell, 21
compatible, 160 grid cluster, 21
cross, 160 ground set, 145
distortion, 141, 204 group, 188, 190
distributivity of multiplication over addition, 192 growth function, 148
Doob martingale, 81 gcd, 187
double factorial, 136
doubly stochastic, 117 harmonic number, 15
dual heavy
range space, 152 t-heavy, 179
shatter function, 152 Hierarchically well-separated tree, 205
shattering dimension, 152 history, 111
Dyck words, 111 hitting time, 116
264
Hoeffding’s inequality, 70 median, 138
HST, 205 memorylessness property, 111
HST, 205, 207, 208 merge, 162
hypercube, 61 metric, 203
metric space, 203–211
identity element, 190 mincut, 29
independent, 14 Minkowski sum, 131
pairwise, 49 modulo
wise equivalent, 49
k, 49 moments technique, 184
indicator variable, 14 all regions, 176
inequality
Hoeffding, 70 net
isoperimetric, 134 ε-net, 155
irreducible, 112 ε-net theorem, 155
isoperimetric inequality, 134 NP
complete, 86
Jacobi symbol, 196
oblivious, 61
Kelly criterion, 57 Ohm’s law, 117
Kirchhoff’s law, 117 open ball, 203
OR-concentrator, 89
Law of quadratic reciprocity, 196 order, 191
lazy randomized incremental construction, 185 orthonormal eigenvector basis, 239
Legendre symbol, 194
level, 103 periodicity, 112
k-level, 103 prime, 187
linear subspace, 254 prime factorization, 190
Linearity of expectation, 14 probabilistic distortion, 207
Lipschitz, 138, 204 probabilities, 13
bi-Lipschitz, 204 Probability
Lipschitz condition, 81 Amplification, 32
lollipop graph, 116 probability, 13
long, 248 probability measure, 13, 79
lcm, 187 probability space, 13, 79
problem
Markov chain, 110 MAX-SAT, 86–88
aperiodic, 112
ergodic, 113 quadratic residue, 194
martingale, 81 quotation
edge exposure, 75 – From Gustible’s Planet, Cordwainer Smith,
vertex exposure, 75 85
martingale difference, 80 – The Glass Bead Game, Hermann Hesse, 93
martingale sequence, 74 – Yehuda Amichai, My Father., 223, 225
matrix — Dirk Gently’s Holistic Detective Agency, Dou-
positive definite, 253 glas Adams., 79
symmetric, 253 — Yehuda Amichai, Tourists, 99
measure, 145 –Romain Gary, The talent scout., 217
265
Anonymous, 105 sphere
Carl XIV Johan, King of Sweden, 231 surface area, 136
quotient, 188 spread, 208
quotient group, 191 squaring, 250
standard deviation, 35
Radon’s theorem, 147 state
random incremental construction, 172, 173, 176, aperiodic, 112
181 ergodic, 113
lazy, 185 non null, 111
random sample, 103, 155–158, 162, 164, 168, 174, null persistent, 111
175, 178–180, 182, 184 periodic, 112
relative (p, ε)-approximation, 168 persistent, 111
via discrepancy, 170 transient, 111
ε-sample, 155 state probability vector, 112
random variable, 13, 207 stationary distribution, 112
random walk, 105 stochastic, 117
randomized rounding, 87 stopping set, 176
range, 145 strong component, 112
range space, 145 sub martingale, 80
dual, 152 subgraph
primal, 152 unique, 96
projection, 146 subgroup, 190
rank, 42 super martingale, 80
relative pairwise distance, 126
remainder, 188 tension, 238
replacement product, 248 theorem
residue, 188 ε-net, 155
resistance, 117, 121 Radon’s, 147
ε-sample, 155
sample transition matrix, 125, 237
ε-sample, 155 transition probabilities matrix, 110
ε-sample theorem, 155 transition probability, 110
ε-sample, 155 traverse, 122
sample space, 13 Turing machine
semidefinite, 215 log space, 122
set
union bound, 36
defining, 176 uniqueness, 23
stopping, 176 universal traversal sequence, 122
shallow cuttings, 185
shatter function, 149 variance, 35
dual, 152 VC
shattered, 146 dimension, 146
shattering dimension, 149 vertex, 103
short, 248 vertical decomposition, 172
sketch, 161 vertex, 172
sketch and merge, 162, 163, 170 vertical trapezoid, 172
spectral gap, 126, 243 vertices, 171
266
volume, 132
ball, 136
simplex, 254
walk, 122
weight
region, 176
width, 21
zig-zag, 249
zig-zag product, 249
zig-zag-zig path, 248
267