[go: up one dir, main page]

0% found this document useful (0 votes)
7 views267 pages

Book

Uploaded by

sirisharravind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views267 pages

Book

Uploaded by

sirisharravind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 267

Class notes for Randomized Algorithms

Sariel Har-Peled¬

January 24, 2018

¬ Department of Computer Science; University of Illinois; 201 N. Goodwin Avenue; Urbana, IL, 61801,
USA; sariel@illinois.edu; http://sarielhp.org/. Work on this paper was partially supported by a NSF
CAREER award CCR-0132901.
2
Contents

Contents 3

1 Introduction, Quick Sort and BSP 11


1.1 What are randomized algorithms? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.1 The benefits of unpredictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.2 Back to randomized algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.3 Randomized vs average-case analysis . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Basic probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Formal basic definitions: Sample space, σ-algebra, and probability . . . . . . . . . 13
1.2.2 Expectation and conditional probability . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 QuickSort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Binary space partition (BSP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 BSP for disjoint segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Extra: QuickSelect running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Verifying Identities, Changing Minimum, Closest Pair and Some Complexity 19


2.1 Verifying equality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1.1 Amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 How many times can a minimum change? . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Closest Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Las Vegas and Monte Carlo algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Complexity Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Min Cut 27
3.1 Branching processes – Galton-Watson Process . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 On coloring trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Min Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1.1 The probability of success. . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3
3.3.1.2 Running time analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 A faster algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 The Occupancy and Coupon Collector problems 35


4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.2 Some needed math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Occupancy Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 The Probability of all bins to have exactly one ball . . . . . . . . . . . . . . . . . 38
4.3 The Markov and Chebyshev’s inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 The Coupon Collector’s Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Sampling, Estimation, and More on the Coupon’s Collector Problems II 41


5.1 Randomized selection – Using sampling to learn the world . . . . . . . . . . . . . . . . . 41
5.1.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1.1 Inverse estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.2 Randomized selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 The Coupon Collector’s Problem Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.1 Some technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.2 Back to the coupon collector’s problem . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.3 An asymptotically tight bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Sampling and other Stuff 49


6.1 Two-Point Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.1 About Modulo Rings and Pairwise Independence . . . . . . . . . . . . . . . . . . 49
6.1.1.1 Generating k-wise independent variable . . . . . . . . . . . . . . . . . . 51
6.1.2 Application: Using less randomization for a randomized algorithm . . . . . . . . . 51
6.2 QuickSort is quick via direct argumentation . . . . . . . . . . . . . . . . . . . . . . . . . 52

7 Concentration of Random Variables – Chernoff’s Inequality 53


7.1 Concentration of mass and Chernoff’s inequality . . . . . . . . . . . . . . . . . . . . . . . 53
7.1.1 Example: Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1.2 A restricted case of Chernoff inequality via games . . . . . . . . . . . . . . . . . . 53
7.1.2.1 Chernoff games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1.2.2 Chernoff’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.1.2.3 Some low level boring calculations . . . . . . . . . . . . . . . . . . . . . 58
7.1.3 Chernoff Inequality - A Special Case – the classical proof . . . . . . . . . . . . . . 58
7.2 Applications of Chernoff’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2.1 QuickSort is Quick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2.2 How many times can the minimum change? . . . . . . . . . . . . . . . . . . . . . 60
7.2.3 Routing in a Parallel Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.2.4 Faraway Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3 The Chernoff Bound — General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3.1 A More Convenient Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4
7.4 A special case of Hoeffding’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.4.1 Some technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.5 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

8 Martingales 73
8.1 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.1.2.1 Examples of martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.1.2.2 Azuma’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

9 Martingales II 79
9.1 Filters and Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.2.1 Martingales – an alternative definition . . . . . . . . . . . . . . . . . . . . . . . . 81
9.3 Occupancy Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.3.1 Lets verify this is indeed an improvement . . . . . . . . . . . . . . . . . . . . . . . 83
9.4 Some useful estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

10 The Probabilistic Method 85


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10.2 Maximum Satisfiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

11 The Probabilistic Method II 89


11.1 Expanding Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
11.2 Probability Amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
11.3 Oblivious routing revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

12 The Probabilistic Method III 93


12.1 The Lovász Local Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
12.2 Application to k-SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
12.2.1 An efficient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
12.2.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

13 The Probabilistic Method IV 99


13.1 The Method of Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
13.2 A Short Excursion into Combinatorics via the Probabilistic Method . . . . . . . . . . . . 100
13.2.1 High Girth and High Chromatic Number . . . . . . . . . . . . . . . . . . . . . . . 100
13.2.2 Crossing Numbers and Incidences . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
13.2.3 Bounding the at most k-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

14 Random Walks I 105


14.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
14.1.1 Walking on grids and lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
14.1.1.1 Walking on the line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5
14.1.1.2 Walking on two dimensional grid . . . . . . . . . . . . . . . . . . . . . . 106
14.1.1.3 Walking on three dimensional grid . . . . . . . . . . . . . . . . . . . . . 106

15 Random Walks II 109


15.1 The 2SAT example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
15.1.1 Solving 2SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
15.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

16 Random Walks III 115


16.1 Random Walks on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
16.2 Electrical networks and random walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
16.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

17 Random Walks IV 121


17.1 Cover times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
17.1.1 Rayleigh’s Short-cut Principle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
17.2 Graph Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
17.2.1 Directed graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
17.3 Graphs and Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
17.4 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

18 Random Walks V 125


18.1 Rapid mixing for expanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
18.1.1 Bounding the mixing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
18.2 Probability amplification by random walks on expanders . . . . . . . . . . . . . . . . . . 127
18.2.1 The analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
18.3 Some standard inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

19 The Johnson-Lindenstrauss Lemma 131


19.1 The Brunn-Minkowski inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
19.1.1 The Isoperimetric Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
19.2 Measure Concentration on the Sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
19.2.1 The strange and curious life of the hypersphere . . . . . . . . . . . . . . . . . . . 136
19.2.2 Measure Concentration on the Sphere . . . . . . . . . . . . . . . . . . . . . . . . . 136
19.3 Concentration of Lipschitz Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
19.4 The Johnson-Lindenstrauss Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
19.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
19.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
19.7 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

20 On Complexity, Sampling, and ε-Nets and ε-Samples 145


20.1 VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
20.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
20.1.1.1 Halfspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
20.2 Shattering dimension and the dual shattering dimension . . . . . . . . . . . . . . . . . . 148
20.2.1 The dual shattering dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
20.2.1.1 Mixing range spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
20.3 On ε-nets and ε-sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6
20.3.1 ε-nets and ε-samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
20.3.2 Some applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
20.3.2.1 Range searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
20.3.2.2 Learning a concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
20.3.2.3 A naive proof of the ε-sample theorem. . . . . . . . . . . . . . . . . . . . 157
20.3.3 A quicky proof of the ε-net theorem (Theorem 20.3.4) . . . . . . . . . . . . . . . 158
20.4 Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
20.4.1 Building ε-sample via discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
20.4.1.1 Faster deterministic construction of ε-samples. . . . . . . . . . . . . . . 162
20.4.2 Building ε-net via discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
20.5 Proof of the ε-net theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
20.6 A better bound on the growth function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
20.7 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
20.7.1 Variants and extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
20.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

21 Sampling and the Moments Technique 171


21.1 Vertical decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
21.1.1 Randomized incremental construction (RIC) . . . . . . . . . . . . . . . . . . . . . 173
21.1.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
21.1.2 Backward analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
21.2 General settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
21.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
21.2.1.1 Examples of the general framework . . . . . . . . . . . . . . . . . . . . . 177
21.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
21.2.2.1 On the probability of a region to be created . . . . . . . . . . . . . . . . 178
21.2.2.2 On exponential decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
21.2.2.3 Bounding the moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
21.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
21.3.1 Analyzing the RIC algorithm for vertical decomposition . . . . . . . . . . . . . . . 181
21.3.2 Cuttings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
21.4 Bounds on the probability of a region to be created . . . . . . . . . . . . . . . . . . . . . 182
21.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
21.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

22 Primality testing 187


22.1 Number theory background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
22.1.1 Modulo arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
22.1.1.1 Prime and coprime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
22.1.1.2 Computing gcd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
22.1.1.3 The Chinese remainder theorem . . . . . . . . . . . . . . . . . . . . . . . 189
22.1.1.4 Euler totient function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
22.1.2 Structure of the modulo group ZZn . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
22.1.2.1 Some basic group theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
22.1.2.2 Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
22.1.2.3 Cyclic groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
22.1.2.4 Modulo group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

7
22.1.2.5 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
22.1.2.6 ZZ∗p is cyclic for prime numbers . . . . . . . . . . . . . . . . . . . . . . . 192
22.1.2.7 ZZ∗n is cyclic for powers of a prime . . . . . . . . . . . . . . . . . . . . . . 193
22.1.3 Quadratic residues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
22.1.3.1 Quadratic residue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
22.1.3.2 Legendre symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
22.1.3.3 Jacobi symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
22.1.3.4 Jacobi(a, n): Computing the Jacobi symbol . . . . . . . . . . . . . . . . 198
22.1.3.5 Subgroups induced by the Jacobi symbol . . . . . . . . . . . . . . . . . . 198
22.2 Primality testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
22.2.1 Distribution of primes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
22.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

23 Finite Metric Spaces and Partitions 203


23.1 Finite Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
23.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
23.2.1 Hierarchical Tree Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
23.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
23.3 Random Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
23.3.1 Constructing the partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
23.3.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
23.4 Probabilistic embedding into trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
23.4.1 Application: approximation algorithm for k-median clustering . . . . . . . . . . . 207
23.5 Embedding any metric space into Euclidean space . . . . . . . . . . . . . . . . . . . . . . 208
23.5.1 The bounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
23.5.2 The unbounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
23.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
23.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

24 Approximate Max Cut 213


24.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
24.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
24.2 Semi-definite programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
24.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

25 Entropy, Randomness, and Information 217


25.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
25.1.1 Extracting randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
25.2 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

26 Entropy II 223
26.1 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
26.2 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

8
27 Entropy III - Shannon’s Theorem 225
27.1 Coding: Shannon’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
27.2 Proof of Shannon’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
27.2.1 How to encode and decode efficiently . . . . . . . . . . . . . . . . . . . . . . . . . 226
27.2.1.1 The scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
27.2.1.2 The proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
27.2.2 Lower bound on the message size . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
27.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

28 Low Dimensional Linear Programming 231


28.1 Linear programming in constant dimension (d > 2) . . . . . . . . . . . . . . . . . . . . . 231
28.2 Handling Infeasible Linear Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
28.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

29 Expanders I 237
29.1 Preliminaries on expanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
29.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
29.2 Tension and expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

30 Expanders II 241
30.1 Bi-tension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
30.2 Explicit construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
30.2.1 Explicit construction of a small expander . . . . . . . . . . . . . . . . . . . . . . . 243
30.2.1.1 A quicky reminder of fields . . . . . . . . . . . . . . . . . . . . . . . . . 243
30.2.1.2 The construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

31 Expanders III - The Zig Zag Product 247


31.1 Building a large expander with constant degree . . . . . . . . . . . . . . . . . . . . . . . 247
31.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
31.1.2 The Zig-Zag product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
31.1.3 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
31.1.4 The construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
31.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
31.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

32 Miscellaneous Prerequisite 253


32.1 Geometry and linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
32.1.1 Linear and affine subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
32.1.2 Computational geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
32.2 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

Bibliography 257

Index 263

9
10
Chapter 1

Introduction, Quick Sort and BSP


598 - Class notes for Randomized Algorithms
Sariel Har-Peled Everybody knows that the dice are loaded
Everybody rolls with their fingers crossed
January 24, 2018
Everybody knows the war is over
Everybody knows the good guys lost
Everybody knows the fight was fixed
The poor stay poor, the rich get rich
That’s how it goes
Everybody knows

Everybody knows, Leonard Cohen

1.1. What are randomized algorithms?


Randomized algorithms are algorithms that makes random decision during their execution. Specifically,
they are allowed to use variables, such that their value is taken from some random distribution. It is
not immediately clear why adding the ability to use randomness helps an algorithm. But it turns out
that the benefits are quite substantial. Before listing them, let start with an example.

1.1.1. The benefits of unpredictability


Consider the following game. The adversary has a equilateral triangle, with three coins on the vertices
of the triangle (which are, numbered by, I don’t known, 1,2,3). Initially, the adversary set each of the
three coins to be either heads or tails, as she sees fit.
At each round of the game, the player can ask to flip certain coins (say, flip coins at vertex 1 and 3).
If after the flips all three coins have the same side up, then the game stop. Otherwise, the adversary is
allowed to rotate the board by 0, 120 or −120 degrees, as she seems fit. And the game continues from
this point on. To make things interesting, the player does not see the board at all, and does not know
the initial configuration of the coins.

A randomized algorithm. The randomized algorithm in this case is easy – the player randomly
chooses a number among 1, 2, 3 at every round. Since, at every point in time, there are two coins that
have the same side up, and the other coin is the other side up, a random choice hits the lonely coin, and
thus finishes the game, with probability 1/3 at each step. In particular, the number of iterations of the
game till it terminates behaves like a geometric variable with geometric distribution with probability
1/3 (and thus the expected number of rounds is 3). Clearly, the probability that the game continues for
more than i rounds, when the player uses this random algorithm, is (2/3)i . In particular, it vanishes to
zero relatively quickly.

11
A deterministic algorithm. The surprise here is that there is no deterministic algorithm that can
generate a winning sequence. Indeed, if the player uses a deterministic algorithm, then the adversary
can simulate the algorithm herself, and know at every stage what coin the player would ask to flip (it
is easy to verify that flipping two coins in a step is equivalent to flipping the other coin – so we can
restrict ourselves to a single coin flip at each step). In particular, the adversary can rotate the board in
the end of the round, such that the player (in the next round) flips one of the two coins that are in the
same state. Namely, the player never wins.

The shocker. One can play the same game with a board of size 4 (i.e., a square), where at each stage
the player can flip one or two coins, and the adversary can rotate the board by 0, 90, 180, 270 degrees
after each round. Surprisingly, there is a deterministic winning strategy for this case. The interested
reader can think what it is (this is one of these brain teasers that are not immediate, and might take
you 15 minutes to solve, or longer [or much longer]).

The unfair game of the analysis of algorithms. The underlying problem with analyzing algorithm
is the inherent unfairness of worst case analysis. We are given a problem, we propose an algorithm, then
an all-powerful adversary chooses the worst input for our algorithm. Using randomness gives the player
(i.e., the algorithm designer) some power to fight the adversary by being unpredictable.

1.1.2. Back to randomized algorithms


1. Best. There are cases where only randomized algorithms are known or possible, especially for
games. For example, consider the 3 coins example given above.
2. Speed. In some cases randomized algorithms are considerably faster than any deterministic
algorithm.
3. Simplicity. Even if a randomized algorithm is not faster, often it is considerably simpler than its
deterministic counterpart.
4. Derandomization. Some deterministic algorithms arises from derandomizing the randomized
algorithms, and this are the only algorithm we know for these problems (i.e., discrepancy).
5. Adversary arguments and lower bounds. The standard worst case analysis relies on the idea
that the adversary can select the input on which the algorithm performs worst. Inherently, the
adversary is more powerful than the algorithm, since the algorithm is completely predictable. By
using a randomized algorithm, we can make the algorithm unpredictable and break the adversary
lower bound.
Namely, randomness makes the algorithm vs. adversary game a more balanced game, by giving
the algorithm additional power against the adversary.

1.1.3. Randomized vs average-case analysis


Randomized algorithms are not the same as average-case analysis. In average case analysis, one
assumes that is given some distribution on the input, and one tries to analyze an algorithm execution
on such an input.
On the other hand, randomized algorithms do not assume random inputs – inputs can be arbitrary.
As such, randomized algorithm analysis is more widely applicable, and more general.
While there is a lot of average case analysis in the literature, the problem that it is hard to find
distribution on inputs that are meaningful in comparison to real world inputs. In particular, for numerous
cases, the average case analysis exposes structure that does not exist in real world input.

12
1.2. Basic probability
Here we recall some definitions about probability. The reader already familiar with these definition can
happily skip this section.

1.2.1. Formal basic definitions: Sample space, σ-algebra, and probability


A sample space Ω is a set of all possible outcomes of an experiment. We also have a set of events F ,
where every member of F is a subset of Ω. Formally, we require that F is a σ-algebra.

Definition 1.2.1. A single element of Ω is an elementary event or an atomic event.

Definition 1.2.2. A set F of subsets of Ω is a σ-algebra if:


(i) F is not empty,
(ii) if X ∈ F then X = (Ω \ X) ∈ F , and
(iii) if X, Y ∈ F then X ∪ Y ∈ F .
More generally, we require that if Xi ∈ F , for i ∈ ZZ, then ∪i Xi ∈ F . A member of F is an event.

As a concrete example, if we are rolling a dice, then Ω = {1, 2, 3, 4, 5, 6} and F would be the power
set of all possible subsets of Ω.

Definition 1.2.3. A probability measure is a mapping Pr : F → [0, 1] assigning probabilities to


events. The function Pr needs to have the following properties:
(i) Additive: for X, Y ∈ F disjoint sets, we have that Pr X ∪ Y = Pr X + Pr Y , and
     

(ii) Pr[Ω] = 1.

Definition 1.2.4. A probability space is a triple (Ω, F , Pr), where Ω is a sample space, F is a σ-algebra
defined over Ω, and Pr is a probability measure.

Definition 1.2.5. A random variable f is a mapping from Ω into some set G. We require that the
probability of the random variable to take on any value in a given subset of values
 −1is well
 defined.
Formally, for any subset U ⊆ G, we have that f (U) ∈ F . That is, Pr[ f ∈ U] = Pr f (U) is defined.
−1

Going back to the dice example, the number on the top of the dice when we roll it is a random
variable. Similarly, let X be one if the number rolled is larger than 3, and zero otherwise. Clearly X is
a random variable.
We denote the probability of a random variable X to get the value x, by Pr[X = x] (or sometime
Pr[x], if we are lazy).

1.2.2. Expectation and conditional probability


Definition 1.2.6 (Expectation). The expectation of a random variable X, is its average. Formally, the
expectation of X is
  Õ
E X = x Pr X = x .
 
x

Definition 1.2.7 (Conditional Probability.). The conditional probability of X given Y , is the probability
that X = x given that Y = y. We denote this quantity by Pr[X = x | Y = y].

13
One useful way to think about the conditional probability Pr[X | Y ] is as a function, between the
given value of Y (i.e., y), and the probability of X (to be equal to x) in this case. Since in many cases x
and y are omitted in the notation, it is somewhat confusing.
The conditional probability can be computed using the formula

Pr (X = x) ∩ (Y = y)
 
Pr[X = x | Y = y] = .
Pr[Y = y]

For example, let us roll a dice and let X be the number we got. Let Y be the random variable that
is true if the number we get is even. Then, we have that
 1
Pr X = 2 Y = true = .

3
Definition 1.2.8. Two random variables X and Y are independent if Pr X = x Y = y = Pr[X = x], for
 

all x and y.

Observation 1.2.9. If X and Y are independent then Pr X = x Y = y = Pr[X = x] which is equiva-


 
Pr[X = x ∩ Y = y]
lent to = Pr[X = x]. That is, X and Y are independent, if for all x and y, we have
Pr[Y = y]
that

Pr X = x ∩ Y = y = Pr X = x Pr Y = y .
     

Remark. Informally, and not quite correctly, one possible way to think about conditional probability
Pr[X = x | Y = y] is as measuring the benefit of having more information. If we know that Y = y, do we
have any change in the probability of X = x?

Lemma 1.2.11 (Linearity of expectation). Linearity of expectation is the property that for any
two random variables X and Y , we have that E X + Y = E X + E Y .
  

 Õ  Õ Õ
Proof: E X + Y = Pr[ω] X(ω) + Y (ω) = Pr[ω] X(ω) + Pr[ω] Y (ω) = E X + E Y .
    

ω∈Ω ω∈Ω ω∈Ω

1.3. QuickSort
Let the input be a set T = {t1, . . . , tn } of n items to be sorted. We remind the reader, that the QuickSort
algorithm randomly pick a pivot element (uniformly), splits the input into two subarrays of all the
elements smaller than the pivot, and all the elements larger than the pivot, and then it recurses on
these two subarrays (the pivot is not included in these two subproblems). Here we will show that the
expected running time of QuickSort is O(n log n).

Definition 1.3.1. For an event E, let X be a random variable which is 1 if E occurred and 0 otherwise.
The random variable X is an indicator variable.

Observation 1.3.2. For an indicator variable X of an event E, we have


h i h i h i h i h i
E X = 0 · Pr X = 0 + 1 · Pr X = 1 = Pr X = 1 = Pr E .

14
Let S1, . . . , Sn be the elements in their sorted order (i.e., the output order). Let Xi j = 1 be the
indicator variable which is one iff QuickSort compares Si to S j , and let pi j denote the probability that
this happens. Clearly, the number of comparisons performed by the algorithm is C = i< j Xi j . By
Í
linearity of expectations, we have
hÕ i Õ   Õ
E C =E Xi j = E Xi j = pi j .
 
i< j
i< j i< j

We want to bound pi j , the probability that the Si is compared to S j . Consider the last recursive
call involving both Si and S j . Clearly, the pivot at this step must be one of Si, . . . , S j , all equally likely.
Indeed, Si and S j were separated in the next recursive call.
Observe, that Si and S j get compared if and only if pivot is Si or S j . Thus, the probability for that
is 2/( j − i + 1). Indeed,
2
pi j = Pr Si or S j picked picked pivot from Si, . . . , S j =
 
.
j −i+1
Thus,
n Õ n Õ n n−i+1
Õ 2 n Õ
n
Õ Õ Õ Õ 1
pi j = 2/( j − i + 1) = ≤2 ≤ 2nHn ≤ n + 2n ln n,
i=1 j>i i=1 j>i i=1 k=1
k i=1 k=1
k
Ín
where Hn is the harmonic number ¬ Hn = i=1 1/i, We thus proved the following result.
Lemma 1.3.3. QuickSort performs in expectation at most n + 2n ln n comparisons, when sorting n
elements.

Note, that this holds for all inputs. No assumption on the input is made. Similar bounds holds not
only in expectation, but also with high probability.
This raises the question, of how does the algorithm pick a random element? We assume we have
access to a random source that can get us number between 1 and n uniformly.
Note, that the algorithm always works, but it might take quadratic time in the worst case.
Remark 1.3.4 (Wait, wait, wait). Let us do the key argument in the above more slowly, and more carefully.
Imagine, that before running QuickSort we choose for every element a random priority, which is a real
number in the range [0, 1]. Now, we reimplement QuickSort such that it always pick the element with
the lowest random priority (in the given subproblem) to be the pivot. One can verify that this variant
and the standard implementation have the same running time. Now, ai gets compares to a j if and
only if all the elements ai+1, . . . , a j−1 have random priority larger than both the random priority of ai
and the random priority of a j . But the probability that one of two elements would have the lowest
random-priority out of j − i + 1 elements is 2 ∗ 1/( j − i + 1), as claimed.

1.4. Binary space partition (BSP)


Let assume that we would like to render an image of a three dimensional scene on the computer screen.
The input is in general a collection of polygons in three dimensions. The painter algorithm, render the
scene by drawing things from back to front; and let front stuff overwrite what was painted before.
∫n ∫n
¬ Using integration to bound summation, we have Hn ≤ 1 + 1
x=1 x
dx ≤ 1 + ln n. Similarly, Hn ≥ 1
x=1 x
dx = ln n.

15
The problem is that it is not always possible to order the objects in three dimensions. This ordering
might have cycles. So, one possible solution is to build a binary space partition. We build a binary
tree. In the root, we place a polygon P. Let h be the plane containing P. Next, we partition the input
polygons into two sets, depending on which side of h they fall into. We recursively construct a BSP for
each set, and we hang it from the root node. If a polygon intersects h then we cut it into two polygons
as split by h. We continue the construction recursively on the objects on one side of h, and the objects
on the other side. What we get, is a binary tree that splits space into cells, and furthermore, one can
use the painter algorithm on these objects. The natural question is how big is the resulting partition.
We will study the easiest case, of disjoint segments in the plane.

1.4.1. BSP for disjoint segments


Let P = {s1, . . . , sn } be n disjoint segments in the plane. We will build the BSP by using the lines defined
by these segments. This kind of BSP is called autopartition.
To recap, the BSP is a binary tree, at every internal node we store a segment of P, where the line
associated with it splits its region into its two children. Finally, each leaf of the BSP stores a single
segment. A fragment is just going to be a subsegment formed by this splitting. Clearly, every internal
node, stores a fragment that defines its split. As such, the size of the BSP is proportional to the number
of fragments generated when building the BSP.
One application of such a BSP is ray shooting - given a ray you would like to determine what is
the first segment it hits. Start from the root, figure out which child contains the apex of the ray, and
first (recursively) compute the first segment stored in this child that the ray intersect. Contain into the
second child only if the first subtree does not contain any segment that intersect the ray.

1.4.1.1. The algorithm

We pick a random permutation σ of 1, . . . , n, and in the ith step we insert sσ(i) splitting all the cells
that si intersects.
Observe, that if si crosses a cell completely, it just splits it into two and no new fragments are created.
As such, the bad case is when a segment s is being inserted, and its line intersect some other segment t.
So, let E(s, t) denote the event that when inserted s it had split t. In particular, let index(s, t) denote
the number of segments on the line of s between s (closer) endpoint and t (including t. If the line of s
does not intersect t, then index(s, t) = ∞.
We have that
h i 1
Pr E(s, t) = .
1 + index(s, t)

Let Xs,t be the indicator variable that is 1 if E(s, t) happens. We have that

n
Õ n
Õ
S = number of fragments = Xsi,s j .
i=1 j=1,i, j

16
As such, by linearity of expectations, we have
n
hÕ n
Õ i Õn n
Õ n n
 Õ Õ
E S =E Xsi,s j = E Xsi,s j =
    
Pr E(si, s j )
i=1 j=1,i, j i=1 j=1,i, j i=1 j=1,i, j
n n
Õ Õ 1
= 
i=1 j=1,i, j
1 + index si, s j
n Õ n
Õ 2
≤ = 2nHn .
i=1 j=1
1+ j

Since the size of the BSP is proportional to the number of fragments created, we have the following
result.
Theorem 1.4.1. Given n disjoint segments in the plane, one can build a BSP for them of size O(n log n).

 Csaba Tóth  [Tót03] showed that BSP for segments in the plane, in the worst case, has complexity
log n
Ω n .
log log n

1.5. Extra: QuickSelect running time


We remind the reader that QuickSelect receives an array t[1 . . . n] of n real numbers, and a number
k, and returns the element of rank k in the sorted order of the elements of t. We can of course, use
QuickSort, and just return the kth element in the sorted array, but a more efficient algorithm, would
be to modify QuickSelect, so that it recurses on the subproblem that contains the element we are
interested in. Formally, QuickSelect chooses a random pivot, splits the array according to the pivot.
This implies that we now know the rank of the pivot, and if its equal to m, we return it. Otherwise, we
recurse on the subproblem containing the required element (modifying m as we go down the recursion.
Namely, QuickSelect is a modification of QuickSort performing only a single recursive call (instead
of two).
As before, to bound the expected running time, we will bound the expected number of comparisons.
As before, let S1, . . . , Sn be the elements of t in their sorted order. Now, for i < j, let Xi j be the indicator
variable that is one if Si is being compared to S j during the execution of QuickSelect. There are several
possibilities to consider:
(i) If i < j < m: Here, Si is being compared to S j , if and only if the first pivot in the range Si, . . . , Sk
is either Si or S j . The probability for that is 2/(k − i + 1). As such, we have that

med−2
"m−2 m−1 # m−2 m−1
 Õ  Õ Õ Õ Õ 2 Õ 2(m − i − 1)  
Xi j  = E Xi j =
 
α1 = E 
  = ≤ 2 m−2 .
i< j<m  i=1 j=i+1 i=1 j=i+1
m−i+1 i=1
m−i+1
 

(ii) If m < i < j: Using the same analysis as above, we have that Pr Xi j = 1 = 2/( j − m + 1). As such,
 

j−1 j−1
 n n n
2( j − m − 1)
 Õ 
Õ Õ Õ 2 Õ  
Xi j  = ≤ 2 n−m .

α2 = E  =
 j=m+1 i=m+1  j=m+1 i=m+1 j − m + 1 j=m+1 j − m + 1
 
17
(iii) i < m < j: Here, we compare   Si to S j if and
 only if the first indicator in the range Si, . . . , S j is
either Si or S j . As such, E Xi j = Pr Xi j = 1 = 2/( j − i + 1). As such, we have

m−1 n  m−1 n
Õ Õ  Õ Õ 2
α3 = E 
 Xi j  =
 .
 i=1 j=m+1  i=1 j=m+1 j − i + 1
 
Observe, that for a fixed ∆ = j − i + 1, we are going to handle the gap ∆ in the above summation,
Ín
at most ∆ − 2 times. As such, α3 ≤ ∆=3 2(∆ − 2)/∆ ≤ 2n.
n n
Õ Õ 2
(iv) i = m. We have α4 = X = ln n + 1.

E ij =
j −m+1
j=m+1 j=m+1

m−1 m−1
Õ Õ 2
(v) j = m. We have α5 = E Xi j =
 
≤ ln m + 1.
i=1 i=1
m−i+1

Thus, the expected number of comparisons performed by QuickSelect is bounded by


Õ
αi ≤ 2(m − 2) + 2(n − m) + 2n + ln n + 1 + ln m = 4n − 2 + ln n + ln m.
i

Theorem 1.5.1. In expectation, QuickSelect performs at most 4n − 2 + ln n + ln m comparisons, when


selecting the mth element out of n elements.

A different approach can reduce the number of comparisons (in expectation) to 1.5n + o(n). More on
that later in the course.

18
Chapter 2

Verifying Identities, Changing Minimum,


Closest Pair and Some Complexity
598 - Class notes for Randomized Algorithms
Sariel Har-Peled The events of September 8 prompted Foch to
January 24, 2018 draft the later legendary signal: “My centre is
giving way, my right is in retreat, situation
excellent. I attack.” It was probably never sent.

John Keegan, The first world war

2.1. Verifying equality


2.1.1. Vectors
n
You are given two binary vectors v = (v1, . . . , vn ), u = (u1, . . . , un ) ∈ 0, 1 and you would like to decide


if they are equal or not. Unfortunately, the only access you have to the two vectors is via a black-box
that enables you to compute the dot-product of two binary vectors over ZZ2 . Formally, given two binary
Ín
vectors as above, their dot-product is hv, ui = i=1 vi ui (which is a non-negative integer number). Their
dot product modulo 2, is hv, ui mod 2 (i.e., it is 1 if hv, ui is odd and 0 otherwise).
Naturally, we could the use the black-box to read the vectors (using 2n calls), but since we are
interested only in deciding if they are equal or not, this should require less calls to the black-box (which
is expensive).
 n
Lemma 2.1.1. Given two binary vectors v, u ∈ 0, 1 , a randomized algorithm can, using two compu-
tations of dot-product modulo 2, decide if v is equal to u or not. The algorithm may return one of the
following two values:
,: Then v , u.
=: Then the probability that the algorithm made a mistake (i.e., the vectors are different) is at most
1/2.
The running time of the algorithm is O(n + B(n)), where B(n) is the time to compute a single dot-product
of vectors of length n.

Proof: Pick a random vector r = (r1, . . . , rn ) ∈ {0, 1}n by picking each coordinate independently with
probability 1/2. Compute the two dot-products hv, ri and hu, ri.
• If hv, ri ≡ hv, ri mod 2 ⇒ the algorithm returns ‘=’.
• If hv, ri . hv, ri mod 2 ⇒ the algorithm returns ‘,’.

19
Clearly, if the ‘,’ is returned then v , u.
So, assume that the algorithm returned ‘=’ but v , u. For the sake of simplicity of exposition,
assume that they differ on the nth bit: un , vn . We then have that
=α 0 =β 0
z }| { z }| {
n−1
Õ n−1
Õ
α = hv, ri = vi ri + vnrn and β = hu, ri = ui ri + unrn .
i=1 i=1

Now, there are two possibilities:


• If α0 . β0 mod 2, then, with probability half, we have ri = 0, and as such α . β mod 2.
• If α0 ≡ β0 mod 2, then, with probability half, we have ri = 1, and as such α . β mod 2.
As such, with probability at most half, the algorithm would fail to discover that the two vectors are
different. 

2.1.1.1. Amplification
Of course, this is not a satisfying algorithm – it returns the correct answer only with probability half if
the vectors are different. So, let us run the algorithm t times. Let T1, . . . , Tt be the returned values from
all these executions. If any of the t executions returns that the vectors are different, then we know that
they are different.
Pr Algorithm fails = Pr v , u, but all t executions return ‘=’
   

= Pr T1 = ‘=’ ∩ T2 = ‘=’ ∩ · · · ∩ Tt = ‘=’


   

t
 Ö 1 1
= Pr T1 = ‘=’ Pr T2 = ‘=’ · · · Pr Tt = ‘=’ ≤
    
= t.
i=1
2 2

We thus get the following result.


 n
Lemma 2.1.2. Given two binary vectors v, u ∈ 0, 1 and a confidence parameter δ > 0, a random-
ized algorithm can decide if v is equal to u or not. More precisely, the algorithm may return one of the
two following results:
,: Then v , u.
=: Then, with probability ≥ 1 − δ, we have v , u. 
The running time of the algorithm is O (n + B(n)) ln δ−1 , where B(n) is the time to compute a single
dot-product of two vectors of length n.

Proof: Follows from the above by setting t = dlg(1/δ)e. 

2.1.2. Matrices
Given three binary matrices B, C, D of size n × n, we are interested in deciding if BC = D. Computing
BC is expensive – the fastest known (theoretical!) algorithm has running time (roughly) O n2.37 . On

the other hand, multiplying such a matrix with a vector r (modulo 2, as usual) takes only O(n2 ) time
(and this algorithm is simple).
 n×n
Lemma 2.1.3. Given three binary matrices B, C, D ∈ 0, 1 and a confidence parameter δ > 0, a
randomized algorithm can decide if BC = D or not. More precisely the algorithm can return one of the
following two results:

20
,: Then BC , D.
=: Then BC = D with probability ≥ 1 − δ.
The running time of the algorithm is O n2 log δ−1 .


Proof: Compute a random vector r = (r1, . . . , rn ), and compute the quantity x = BCr = B(Cr) in O(n2 )
time, using the associative property of matrix multiplication. Similarly, compute y = Dr. Now, if x , y
then return ‘=’.
Now, we execute this algorithm t = lg δ−1 times. If all of these independent runs return that the
 

matrices are equal then return ‘=’.


The algorithm fails only if BC , D, but then, assume the ith row in two matrices BC and D are
different. The probability that the algorithm would not detect that these rows are different is at most
1/2, by Lemma 2.1.1. As such, the probability that all t runs failed is at most 1/2t ≤ δ, as desired. 

2.2. How many times can a minimum change?


Let a1, . . . , an be a set of n numbers, and let us randomly permute them into the sequence b1, . . . , bn .
Next, let ci = minik=1 bi , and let X be the random variable which is the number of distinct values that
appears in the sequence c1, . . . , cn . What is the expectation of X?

Lemma 2.2.1. In expectation, the number of times the minimum of a prefix of n randomly permuted
numbers change, is O(log n). That is E[X] = O(log n).

Proof: Consider the indicator variable Xi , such that Xi = 1 if ci , ci−1 . The probability for that is ≤ 1/i,
since this is the probability that the smallest number of b1, . . . , bi is bi . (Why is this probability not
n
Õ Õ 1
simply equal to 1/i?) As such, we have X = i Xi , and E[X] = = O(log n).
Í
E[Xi ] = 
i i=1
i

2.3. Closest Pair


Assumption 2.3.1. Throughout the discourse, we are going to assume that every hashing operation takes
(worst case) constant time. This is quite a reasonable assumption when true randomness is available
(using for example perfect hashing [CLRS01]). We will revisit this issue later in the course.

For a real positive number r and a point p = (x, y) in R2 , define


 jxk jyk 
Gr (p) := r , r ∈ R2 .
r r
The number r is the width of the grid Gr . Observe that Gr partitions the plane into square regions,
which are grid cells. Formally, for any i, j ∈ Z, the intersection of the half-planes x ≥ ri, x < r(i + 1),
y ≥ r j and y < r( j + 1) is a grid cell. Further a grid cluster is a block of 3 × 3 contiguous grid cells.
For a point set P, and a parameter r, the partition of P into subsets by the grid Gr , is denoted by
Gr (P). More formally, two points p, q ∈ P belong to the same set in the partition Gr (P), if both points
are being mapped to the same grid point or equivalently belong to the same grid cell.
Note, that every grid cell C of Gr , has a unique ID; indeed, let p = (x, y) be any point in C, and
consider the pair of integer numbers idC = id(p) = (bx/rc , by/rc). Clearly, only points inside C are going
to be mapped to idC . This is useful, as one can store a set P of points inside a grid efficiently. Indeed,

21
given a point p, compute its id(p). We associate with each unique id a data-structure that stores all the
points falling into this grid cell (of course, we do not maintain such data-structures for grid cells which
are empty). For our purposes here, the grid-cell data-structure can simply be a linked list of points. So,
once we computed id(p), we fetch the data structure for this cell, by using hashing. Namely, we store
pointers to all those data-structures in a hash table, where each such data-structure is indexed by its
unique id. Since the ids are integer numbers, we can do the hashing in constant time.

We are interested in solving the following problem.


Problem 2.3.2. Given a set P of n points in the plane, find the pair of points closest to each other.
Formally, return the pair of points realizing CP(P) = minp,q∈P kpqk.

We need the following easy packing lemma.


Lemma 2.3.3. Let P be a set of points contained inside a square , such that
the sidelength of  is α = CP(P). Then |P| ≤ 4.
α
Proof: Partition  into four√equal squares 1, . . . , 4 , and observe that each of p
these squares has diameter 2α/2 < α, and as such each can contain at most
one point of P; that is, the disk of radius α centered at a point p ∈ P completely
covers the subsquare containing it; see the figure on the right.
Note that the set P can have four points if it is the four corners of . 

Lemma 2.3.4. Given a set P of n points in the plane, and a distance r, one can verify in linear time,
whether or not CP(P) < r or CP(P) ≥ r.

Proof: Indeed, store the points of P in the grid Gr . For every non-empty grid cell, we maintain a linked
list of the points inside it. Thus, adding a new point p takes constant time. Indeed, compute id(p),
check if id(p) already appears in the hash table, if not, create a new linked list for the cell with this ID
number, and store p in it. If a data-structure already exist for id(p), just add p to it.
This takes O(n) time. Now, if any grid cell in Gr (P) contains more than four points of P, then, by
Lemma 2.3.3, it must be that the CP(P) < r.
Thus, when inserting a point p, the algorithm fetch all the points of P that were already inserted,
for the cell of p, and the 8 adjacent cells. All those cells must contain at most 4 points of P (otherwise,
we would already have stopped since the CP(·) of the inserted points is smaller than r). Let S be the
set of all those points, and observe that |S| ≤ 4 · 9 = O(1). Thus, we can compute by brute force the
closest point to p in S. This takes O(1) time. If d(p, S) < r, we stop and return this distance (together
with the two points realizing d(p, S) as a proof that the distance is too short). Otherwise, we continue
to the next point, where d(p, S) = mins∈S kpsk.
Overall, this takes O(n) time. As for correctness, first observe that if CP(P) > r then the algorithm
would never make a mistake, since it returns ‘CP(P) < r’ only after finding a pair of points of P with
distance smaller than r. Thus, assume that p, q are the pair of points of P realizing the closest pair, and
kpqk = CP(P) < r. Clearly, when the later of them, say p, is being inserted, the set S would contain q,
and as such the algorithm would stop and return “CP(P) < r”. 

Lemma 2.3.4 hints to a natural way to compute CP(P). Indeed, permute the points of P, in an
arbitrary fashion, and let P = hp1, . . . , pn i. Next, let ri = CP {p1, . . . , pi } . We can check if ri+1 < ri , by

just calling the algorithm for Lemma 2.3.4 on Pi+1 and ri . If ri+1 < ri , the algorithm of Lemma 2.3.4,
would give us back the distance ri+1 (with the other point realizing this distance).

22
So, consider the “good” case where ri+1 = ri = ri−1 . Namely, the length of the shortest pair does not
change. In this case we do not need to rebuild the data structure of Lemma 2.3.4 for each point. We
can just reuse it from the previous iteration. Thus, inserting a single point takes constant time as long
as the closest pair (distance) does not change.
Things become bad, when ri < ri−1 . Because then we need to rebuild the grid, and reinsert all the
points of Pi = hp1, . . . , pi i into the new grid Gri (Pi ). This takes O(i) time.
So, if the closest pair radius, in the sequence r1, . . . , rn , changes only k times, then the running time
of the algorithm would be O(nk). But we can do even better!
Theorem 2.3.5. Let P be a set of n points in the plane. One can compute the closest pair of points of
P in expected linear time.
Proof: Pick a random permutation of the points of P, and let hp1, . . . , pn i be this permutation. Let
r2 = kp1 p2 k, and start inserting the points into the data structure of Lemma 2.3.4. In the ith iteration,
if ri = ri−1 , then this insertion takes constant time. If ri < ri−1 , then we rebuild the grid and reinsert the
points. Namely, we recompute Gri (Pi ).
To analyze the running time of this algorithm, let Xi be the indicator variable which is 1 if ri , ri−1 ,
and 0 otherwise. Clearly, the running time is proportional to
Õ n
R=1+ (1 + Xi · i).
i=2
Thus, the expected running time is
h Õn i n
Õ n
Õ
E R =1+E 1+ (1 + Xi · i) = n + E[Xi ] · i = n + i · Pr[X1 = 1] ,
  
i=2
i=2 i=2
by linearity of expectation and since for an indicator variable Xi , we have that E[Xi ] = Pr[Xi = 1].
Thus, we need to bound Pr[Xi = 1] = Pr[ri < ri−1 ]. To bound this quantity, fix the points of Pi , and
randomly permute them. A point q ∈ Pi is critical if CP(Pi \ {q}) > CP(Pi ).
• If there are no critical points, then ri−1 = ri and then Pr[Xi = 1] = 0.
• If there is one critical point, than Pr[Xi = 1] = 1/i, as this is the probability that this critical point
would be the last point in a random permutation of Pi .
• If there are two critical points, and let p, q be this unique pair of points of Pi realizing CP(Pi ).
The quantity ri is smaller than ri−1 , if either p or q are pi . But the probability for that is 2/i (i.e.,
the probability in a random permutation of i objects, that one of two marked objects would be
the last element in the permutation).
Observe, that there can not be more than two critical points. Indeed, if p and q are two points that
realize the closest distance, than if there is a third critical point r, then CP(Pi \ {r}) = kpqk, and r is
not critical.
We conclude that
n n
Õ Õ 2
E R =n+ i · Pr[X1 = 1] ≤ n + i · ≤ 3n.
 
i=2 i=2
i
As such, the expected running time of this algorithm is O(E[R]) = O(n). 
Theorem 2.3.5 is a surprising result, since it implies that uniqueness (i.e., deciding if n real numbers
are all distinct) can be solved in linear time. However, there is a lower bound of Ω(n log n) on uniqueness,
using the comparison tree model. This reality dysfunction, can be easily explained, once one realizes
that the model of computation of Theorem 2.3.5 is considerably stronger, using hashing, randomization,
and the floor function.

23
2.4. Las Vegas and Monte Carlo algorithms
Definition 2.4.1. A Las Vegas algorithm is a randomized algorithms that always return the correct
result. The only variant is that it’s running time might change between executions.

An example for a Las Vegas algorithm is the QuickSort algorithm.

Definition 2.4.2. A Monte Carlo algorithm is a randomized algorithm that might output an incorrect
result. However, the probability of error can be diminished by repeated executions of the algorithm.

The matrix multiplication algorithm is an example of a Monte Carlo algorithm.

2.4.1. Complexity Classes


I assume people know what are Turing machines, NP, NPC, RAM machines, uniform model, logarithmic
model. PSPACE, and EXP. If you do now know what are those things, you should read about them.
Some of that is covered in the randomized algorithms book, and some other stuff is covered in any basic
text on complexity theory¬ .

Definition 2.4.3. The class P consists of all languages L that have a polynomial time algorithm Alg,
such that for any input Σ∗ , we have
• x ∈ L ⇒ Alg(x) accepts,
• x < L ⇒ Alg(x) rejects.

Definition 2.4.4. The class NP consists of all languages L that have a polynomial time algorithm Alg,
such that for any input Σ∗ , we have:
(i) If x ∈ L ⇒ then ∃y ∈ Σ∗ , Alg(x, y) accepts, where |y| (i.e. the length of y) is bounded by a
polynomial in |x|.
(ii) If x < L ⇒ then ∀y ∈ Σ∗ Alg(x, y) rejects.

Definition 2.4.5. For a complexity class C, we define the complementary class co-C as the set of languages
whose complement is in the class C. That is

co−C = L L ∈ C ,


4 where L = Σ∗ \ L.

It is obvious that P = co−P and P ⊆ NP ∩ co−NP. (It is currently unknown if P = NP ∩ co−NP or


whether NP = co−NP, although both statements are believed to be false.)

Definition 2.4.6. The class RP (for Randomized Polynomial time) consists of all languages L that have
a randomized algorithm Alg with worst case polynomial running time such that for any input x ∈ Σ∗ ,
we have
(i) If x ∈ L then Pr[Alg(x) accepts] ≥ 1/2.
(ii) x < L then Pr[Alg(x) accepts] = 0.
¬ There is also the internet.

24
An RP algorithm is a Monte Carlo algorithm, but this algorithm can make a mistake only if x ∈ L.
As such, co−RP is all the languages that have a Monte Carlo algorithm that make a mistake only if
x < L. A problem which is in RP ∩ co−RP has an algorithm that does not make a mistake, namely a
Las Vegas algorithm.

Definition 2.4.7. The class ZPP (for Zero-error Probabilistic Polynomial time) is the class of languages
that have a Las Vegas algorithm that runs in expected polynomial time.

Definition 2.4.8. The class PP (for Probabilistic Polynomial time) is the class of languages that have a
randomized algorithm Alg, with worst case polynomial running time, such that for any input x ∈ Σ∗ ,
we have
(i) If x ∈ L then Pr[Alg(x) accepts] > 1/2.
(ii) If x < L then Pr[Alg(x) accepts] < 1/2.

The class PP is not very useful. Why?


Consider the mind-boggling stupid randomized algorithm that returns either yes or no with proba-
bility half. This algorithm is almost in PP, as it return the correct answer with probability half. An
algorithm is in PP needs to be slightly better, and be correct with probability better than half. However,
how much better can be made to be arbitrarily close to 1/2. In particular, there is no way to do effective
amplification with such an algorithm.

Definition 2.4.9. The class BPP (for Bounded-error Probabilistic Polynomial time) is the class of lan-
guages that have a randomized algorithm Alg with worst case polynomial running time such that for
any input x ∈ Σ∗ , we have
(i) If x ∈ L then Pr[Alg(x) accepts] ≥ 3/4.
(ii) If x < L then Pr[Alg(x) accepts] ≤ 1/4.

2.5. Bibliographical notes


Section 2.4 follows [MR95, Section 1.5]. The closest-pair algorithm follows Golin et al. [GRSS95]. This
is in turn a simplification of a result of the celebrated result of Rabin [Rab76]. Smid provides a survey
of such algorithms [Smi00]. A generalization of the closest pair algorithm was provided by Har-Peled
and Raichel [HR15]

25
26
Chapter 3

Min Cut
598 - Class notes for Randomized Algorithms
Sariel Har-Peled To acknowledge the corn - This purely
January 24, 2018 American expression means to admit the losing
of an argument, especially in regard to a detail;
to retract; to admit defeat. It is over a hundred
years old. Andrew Stewart, a member of
Congress, is said to have mentioned it in a
speech in 1828. He said that haystacks and
cornfields were sent by Indiana, Ohio and
Kentucky to Philadelphia and New York.
Charles A. Wickliffe, a member from Kentucky
questioned the statement by commenting that
haystacks and cornfields could not walk.
Stewart then pointed out that he did not mean
literal haystacks and cornfields, but the horses,
mules, and hogs for which the hay and corn
were raised. Wickliffe then rose to his feet, and
said, "Mr. Speaker, I acknowledge the corn".

Funk, Earle, A Hog on Ice and Other


Curious Expressions

3.1. Branching processes – Galton-Watson Process


3.1.1. The problem
In the 19th century, Victorians were worried that aristocratic surnames were disappearing, as family
names passed on only through the male children. As such, a family with no male children had its family
name disappear. So, imagine the number of male children of a person is an independent random variable
X ∈ {0, 1, 2, . . .}. Starting with a single person, its family (as far as male children are concerned) is a
random tree with the degree of a node being distributed according to X. We continue recursively in
constructing this tree, again, sampling the number of children for each current leaf according to the
distribution of X. It is not hard to see that a family disappears if E[X] ≤ 1, and it has a constant
probability of surviving if E[X] > 1.
Francis Galton asked the question of what is the probability of such a blue-blood family name to
survive, and this question was answered by Henry William Watson [WG75]. The Victorians were worried
about strange things, see [Gre69] for a provocatively titled article from the period, and [Ste12] for a
more recent take on this issue.

27
Of course, since infant mortality is dramatically down (as is the number of aristocrat males dying to
maintain the British empire), the probability of family names to disappear is now much lower than it was
in the 19th century. Interestingly, countries with family names that were introduced long time ago have
very few surnames (i.e., Korean have 250 surnames, and three surnames form 45% of the population).
On the other hand, countries that introduced surnames more recently have dramatically more surnames
(for example, the Dutch have surnames only for the last 200 years, and there are 68, 000 different family
names).
Here we are going to look on a very specific variant of this problem. Imagine that starting with a
single male. A male has exactly two children, and one of them is a male with probability half (i.e., the
Y -chromosome is being passed only to its male children). As such, the natural question is what is the
probability that h generations down, there is a male decedent that all his ancestors are male (i.e., it
caries the original family name, and the original Y -chromosome).

3.1.2. On coloring trees


Let Th be a complete binary tree of height h. We randomly color its edges by black and white. Namely,
for each edge we independently choose its color to be either black or white, with equal probability (say,
black indicates the child is male). We are interested in the event that there exists a path from the root
of Th to one of its leafs, that is all black. Let E h denote this event, and let ρh = Pr[E h ]. Observe that
ρ0 = 1 and ρ1 = 3/4 (see below).
To bound this probability, consider the root u of Th and its two children ul and ur . The probability
that there is a black path from ul to one of its children is ρh−1 , and as such, the probability that there is
a black path from u through ul to a leaf of the subtree of ul is Pr[the edge uul is colored black] · ρh−1 =
ρh−1 /2. As such, the probability that there is no black path through ul is 1 − ρh−1 /2. As such, the
probability of not having a black path from u to a leaf (through either children) is (1 − ρh−1 /2)2 . In
particular, there desired probability, is the complement; that is
 ρh−1  2 ρh−1  ρh−1  ρ2h−1
ρh = 1 − 1 − = 2− = ρh−1 − .
2 2 2 4
In particular, ρ0 = 1, and ρ1 = 3/4.

Lemma 3.1.1. We have that ρh ≥ 1/(h + 1).

Proof: The proof is by induction. For h = 1, we have ρ1 = 3/4 ≥ 1/(1 + 1).


Observe that ρh = f (ρh−1 ) for f (x) = x − x 2 /4, and f 0(x) = 1 − x/2. As such, f 0(x) > 0 for
x ∈ [0, 1] and  f (x) is increasing in the range [0, 1]. As such, by induction, we have that ρh = f (ρh−1 ) ≥
1 1 1
f = − 2 . We need to prove that ρh ≥ 1/(h + 1), which is implied by the above if
(h − 1) + 1 h 4h

1 1 1
− 2 ≥ ⇔ 4h(h + 1) − (h + 1) ≥ 4h2 ⇔ 4h2 + 4h − h − 1 ≥ 4h2 ⇔ 3h ≥ 1,
h 4h h+1
which trivially holds. 

Lemma 3.1.2. We have that ρh = O(1/h).

28
Proof: The claim trivially holds for small values of h. Let h j be the minimal index such that ρh j ≤ 1/2 j .
It is easy to verify that ρh j ≥ 1/2 j+1 . As such,

ρh j − ρh j+1 1/2 j − 1/2 j+2 j+6 j+4


 
h j+1 − h j ≤  2 ≤ 2( j+2)+2
= 2 + 2 = O 2j .
1/2
ρh j+1 /4

Arguing similarly, we have

ρh j − ρh j+2 1/2 j+1 − 1/2 j+2 j+1 j


 
h j+2 − h j ≥   2 ≥ = 2 + 2 = Ω 2j .
1/22 j+2
ρh j /4

We conclude that h j = (h j − h j−2 ) + (h j−2 − h j−4 ) + · · · = Ω(2 j ), implying the claim. 

3.2. Min Cut


3.2.1. Problem Definition

Let G = (V, E) be an undirected graph with n vertices and m edges. We are


interested in cuts in G.
Definition 3.2.1. A cut in G is a partition of the vertices of V into two
sets S and V \ S, where the edges of the cut are S V \S

(S, V \ S) = uv u ∈ S, v ∈ V \ S, and uv ∈ E ,


where S , ∅ and V \ S , ∅. We will refer to the number of edges in the


cut (S, V \ S) as the size of the cut. For an example of a cut, see figure
on the right.
We are interested in the problem of computing the minimum cut (i.e., mincut), that is, the cut in
the graph with minimum cardinality. Specifically, we would like to find the set S ⊆ V such that (S, V \ S)
is as small as possible, and S is neither empty nor V \ S is empty.

3.2.2. Some Definitions


We remind the reader of the following concepts. The conditional probability of X given Y is
Pr X = x Y = y = Pr[(X = x) ∩ (Y = y)]/Pr[Y = y]. An equivalent, useful restatement of this is that
h i
Pr (X = x) ∩ (Y = y) = Pr X = x Y = y · Pr[Y = y] .
 
(3.1)

The following is easy to prove by induction using Eq. (3.1).

Lemma 3.2.2. Let E1, . . . , En be n events which are not necessarily independent. Then,
h i h i
n
     
Pr ∩i=1 Ei = Pr E1 ∗ Pr E2 E1 ∗ Pr E3 E1 ∩ E2 ∗ . . . ∗ Pr En E1 ∩ . . . ∩ En−1 .

29
3.3. The Algorithm

The basic operation used by the algorithm is edge


contraction, depicted in Figure 3.1. We take an
edge e = xy in G and merge the two vertices into
a single vertex. The new resulting graph is denoted y
by G/x y. Note, that we remove self loops created by x {x, y}
the contraction. However, since the resulting graph (a) (b)
is no longer a regular graph, it has parallel edges –
namely, it is a multi-graph. We represent a multi- Figure 3.1: (a) A contraction of the edge x y.
graph, as a regular graph with multiplicities on the (b) The resulting graph.
edges. See Figure 3.2.
The edge contraction operation can be implemented
in O(n) time for a graph with n vertices. This is done
by merging the adjacency lists of the two vertices be- 2 2
ing contracted, and then using hashing to do the fix- 2 2
ups (i.e., we need to fix the adjacency list of the ver-
tices that are connected to the two vertices). (a) (b)
Note, that the cut is now computed counting mul-
tiplicities (i.e., if e is in the cut and it has weight w, Figure 3.2: (a) A multi-graph. (b) A minimum
then the contribution of e to the cut weight is w). cut in the resulting multi-graph.

Observation 3.3.1. A set of vertices in G/x y corresponds to a set of vertices in the graph G. Thus a
cut in G/xy always corresponds to a valid cut in G. However, there are cuts in G that do not exist in
G/x y. For example, the cut S = {x}, does not exist in G/x y. As such, the size of the minimum cut in
G/x y is at least as large as the minimum cut in G (as long as G/x y has at least one edge). Since any
cut in G/xy has a corresponding cut of the same cardinality in G.

Our algorithm works by repeatedly performing edge contractions. This is beneficial as this shrinks
the underlying graph, and we would compute the cut in the resulting (smaller) graph. An “extreme”
example of this, is shown in Figure 3.3, where we contract the graph into a single edge, which (in turn)
corresponds to a cut in the original graph. (It might help the reader to think about each vertex in the
contracted graph, as corresponding to a connected component in the original graph.)
Figure 3.3 also demonstrates the problem with taking this approach. Indeed, the resulting cut is not
the minimum cut in the graph.
So, why did the algorithm fail to find the minimum cut in this case?¬ The failure occurs because
of the contraction at Figure 3.3 (e), as we had contracted an edge in the minimum cut. In the new
graph, depicted in Figure 3.3 (f), there is no longer a cut of size 3, and all cuts are of size 4 or more.
Specifically, the algorithm succeeds only if it does not contract an edge in the minimum cut.

Observation 3.3.2. Let e1, . . . , en−2 be a sequence of edges in G, such that none of them is in the min-
imum cut, and such that G0 = G/{e1, . . . , en−2 } is a single multi-edge. Then, this multi-edge corresponds
to a minimum cut in G.

¬ Naturally, if the algorithm had succeeded in finding the minimum cut, this would have been our success.

30
2
2 2
2 2
y
x
(a) (b) (c) (d)

2 2
2 2 4 4
2 2 2 3 3
2 2 52 5

(e) (f) (g) (h)

(i) (j)

Figure 3.3: (a) Original graph. (b)–(j) a sequence of contractions in the graph, and (h) the cut in the
original graph, corresponding to the single edge in (h). Note that the cut of (h) is not a mincut in the
original graph.

Note, that the claim in the above observation Algorithm MinCut(G)


is only in one direction. We might be able to still G0 ← G
compute a minimum cut, even if we contract an i=0
edge in a minimum cut, the reason being that a while Gi has more than two vertices do
minimum cut is not unique. In particular, another Pick randomly an edge ei from the edges of Gi
minimum cut might survived the sequence of con- Gi+1 ← Gi /ei
tractions that destroyed other minimum cuts. i ←i+1
Using Observation 3.3.2 in an algorithm is prob- Let (S, V \ S) be the cut in the original graph
lematic, since the argumentation is circular, how corresponding to the single edge in Gi
can we find a sequence of edges that are not in the return (S, V \ S).
cut without knowing what the cut is? The way to Figure 3.4: The minimum cut algorithm.
slice the Gordian knot here, is to randomly select an edge at each stage, and contract this random edge.
See Figure 3.4 for the resulting algorithm MinCut.

3.3.1. Analysis
3.3.1.1. The probability of success.
Naturally, if we are extremely lucky, the algorithm would never pick an edge in the mincut, and the
algorithm would succeed. The ultimate question here is what is the probability of success. If it is
relatively “large” then this algorithm is useful since we can run it several times, and return the best
result computed. If on the other hand, this probability is tiny, then we are working in vain since this

31
approach would not work.
kn
Lemma 3.3.3. If a graph G has a minimum cut of size k and G has n vertices, then |E(G)| ≥ 2 .
Proof: Each vertex degree is at least k, otherwise the vertex itself would form a minimum cut of size
smaller than k. As such, there are at least v∈V degree(v)/2 ≥ nk/2 edges in the graph.
Í

Lemma 3.3.4. If we pick in random an edge e from a graph G, then with probability at most 2/n it
belong to the minimum cut.
Proof: There are at least nk/2 edges in the graph and exactly k edges in the minimum cut. Thus, the
probability of picking an edge from the minimum cut is smaller then k/(nk/2) = 2/n. 

The following lemma shows (surprisingly) that MinCut succeeds with reasonable probability.
2
Lemma 3.3.5. MinCut outputs the mincut with probability ≥ .
n(n − 1)
Proof: Let Ei be the event that ei is not in the minimum cut of Gi . By Observation 3.3.2, MinCut
outputs the minimum cut if the events E0, . . . , En−3 all happen (namely, all edges picked are outside the
minimum cut).
  2 2
By Lemma 3.3.4, it holds Pr Ei E0 ∩ E1 ∩ . . . ∩ Ei−1 ≥ 1 − =1− . Implying that
|V(Gi )| n−i
     
∆ = Pr[E0 ∩ . . . ∩ En−3 ] = Pr[E0 ] · Pr E1 E0 · Pr E2 E0 ∩ E1 · . . . · Pr En−3 E0 ∩ . . . ∩ En−4 .
As such, we have
n−3  n−3
n−i−2 n−2 n−3 n−4
 Ö
Ö 2 2 1 2
∆≥ 1− = = ∗ ∗ ...· · = . 
i=0
n−i i=0
n − i n n − 1 n − 2 4 3 n · (n − 1)

3.3.1.2. Running time analysis.


Observation 3.3.6. MinCut runs in O(n2 ) time.
Observation 3.3.7. The algorithm always outputs a cut, and the cut is not smaller than the minimum
cut.
Definition 3.3.8. (informal) Amplification is the process of running an experiment again and again till
the things we want to happen, with good probability, do happen.
Let MinCutRep be the algorithm that runs MinCut n(n − 1) times and return the minimum cut
computed in all those independent executions of MinCut.
Lemma 3.3.9. The probability that MinCutRep fails to return the minimum cut is < 0.14.
2
Proof: The probability of failure of MinCut to output the mincut in each execution is at most 1− n(n−1) ,
by Lemma 3.3.5. Now, MinCutRep fails, only if all the n(n − 1) executions of MinCut fail. But these
executions are independent, as such, the probability to this happen is at most
  n(n−1)  
2 2
1− ≤ exp − · n(n − 1) = exp(−2) < 0.14,
n(n − 1) n(n − 1)
since 1 − x ≤ e−x for 0 ≤ x ≤ 1. 
Theorem 3.3.10. One can compute the minimum cut in O(n4 ) time with constant probability to get a
correct result. In O n4 log n time the minimum cut is returned with high probability.

32
FastCut(G = (V, E))
G – multi-graph
begin
n ← |V(G)|
Contract ( G, t ) if n ≤ 6 then
begin Compute (via brute force) minimum cut
while |(G)| > t do lof G and return cut.
√ m
Pick a random edge e in G. t ← 1 + n/ 2
G ← G/e H1 ← Contract(G, t)
return G H2 ← Contract(G, t)
end /* Contract is randomized!!! */
X1 ← FastCut(H1 ),
X2 ← FastCut(H2 )
return minimum cut out of X1 and X2 .
end

Figure 3.5: Contract(G, t) shrinks G till it has only t vertices. FastCut computes the minimum cut
using Contract.

3.4. A faster algorithm


The algorithm presented in the previous section is extremely simple. Which raises the question of
whether we can get a faster algorithm­ ?
So, why MinCutRep needs so many executions? Well, the probability of success in the first ν
iterations is
ν−1  ν−1
n−i−2
 Ö
Ö 2
Pr[E0 ∩ . . . ∩ E ν−1 ] ≥ 1− =
i=0
n−i i=0
n−i
n−2 n−3 n−4 (n − ν)(n − ν − 1)
= ∗ ∗ ... = . (3.2)
n n−1 n−2 n · (n − 1)

Namely, this probability deteriorates very quickly toward the end of the execution, when the graph
becomes small enough.
√ (To see this, observe that for ν = n/2, the probability of success is roughly 1/4,
but for ν = n − n the probability of success is roughly 1/n.)
So, the key observation is that as the graph get smaller the probability to make a bad choice increases.
So, instead of doing the amplification from the outside of the algorithm, we will run the new algorithm
more times when the graph is smaller. Namely, we put the amplification directly into the algorithm.
The basic new operation we use is Contract, depicted in Figure 3.5, which also depict the new
algorithm FastCut.

Lemma 3.4.1. The running time of FastCut(G) is O n2 log n , where n = |V(G)|.




Proof: Well, we perform two calls to Contract(G, t) which takes O(n2 ) time. And then we perform two
recursive calls on the resulting graphs. We have:
­ This would require a more involved algorithm, thats life.

33
n
 
T(n) = O n 2
+ 2T √ .
2

The solution to this recurrence is O n2 log n as one can easily (and should) verify.



Exercise 3.4.2. Show that one can modify FastCut so that it uses only O(n2 ) space.
 √ 
Lemma 3.4.3. The probability that Contract G, n/ 2 had not contracted the minimum cut is at least
1/2.
Namely, the probability that the minimum cut in the contracted graph is still a minimum cut in the
original graph is at least 1/2.
l √ m
Proof: Just plug in ν = n − t = n − 1 + n/ 2 into Eq. (3.2). We have
l √ m l √ m 
h i t(t − 1) 1 + n/ 2 1 + n/ 2 − 1 1
Pr E0 ∩ . . . ∩ En−t ≥ = ≥ . 
n · (n − 1) n(n − 1) 2

The following lemma bounds the probability of success.

Lemma 3.4.4. FastCut finds the minimum cut with probability larger than Ω(1/log n).

Proof: Let Th be the recursion tree of the algorithm of depth h = Θ(log n). Color an edge of recursion
tree by black if the contraction succeeded. Clearly, the algorithm succeeds if there is a path from the
root to a leaf that is all black. This is exactly the settings of Lemma 3.1.1, and we conclude that the
probability of success is at least 1/(h + 1) = Θ(1/log n), as desired. 

Exercise 3.4.5. Prove, that running FastCut repeatedly c · log2 n times, guarantee that the algorithm
outputs the minimum cut with probability ≥ 1 − 1/n2 , say, for c a constant large enough.

Theorem 3.4.6. One can compute the minimum cut in a graph G with n vertices in O(n2 log3 n) time.
The algorithm succeeds with probability ≥ 1 − 1/n2 .

Proof: We do amplification on FastCut by running it O(log2 n) times. The running time bound follows
from Lemma 3.4.1. The bound on the probability follows from Lemma 3.4.4, and using the amplification
analysis as done in Lemma 3.3.9 for MinCutRep. 

3.5. Bibliographical Notes


The MinCut algorithm was developed by David Karger during his PhD thesis in Stanford. The fast
algorithm is a joint work with Clifford Stein. The basic algorithm of the mincut is described in [MR95,
pages 7–9], the faster algorithm is described in [MR95, pages 289–295].

3.5.0.0.1. Galton-Watson process. The idea of using coloring of the edges of a tree to analyze
FastCut might be new (i.e., Section 3.1.2).

34
Chapter 4

The Occupancy and Coupon Collector


problems
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018

4.1. Preliminaries
h i
Definition 4.1.1 (Variance and Standard Deviation). For a random variable X, let V X = E (X − µ X )2 =
 
h i
E X 2 − µ2X denote the variance of X, where µ X = E X . Intuitively, this tells us how concentrated is
 

the distribution of X. q  
The standard deviation of X, denoted by σX is the quantity V X .

h i h i
Observation 4.1.2. (i) For any constant c ≥ 0, we have V cX = c2 V X .
h i h i h i
(ii) For X and Y independent variables, we have V X + Y = V X + V Y .

Definition 4.1.3 (Bernoulli distribution). Assume, that one flips a coin and get 1 (heads) with probability
p, and 0 (i.e., tail) with probability q = 1 − p. Let X be this random variable. The variable X is has
Bernoulli distributionh i with parameter p.
We have that E X = 1 · p + 0 · (1 − p) = p, and
h i
V X = E X 2 − µ2X = E X 2 − p2 = p − p2 = p(1 − p) = pq.
   

Definition 4.1.4 (Binomial distribution). Assume that we repeat a Bernoulli experiment n times (indepen-
dently!). Let X1, . . . , Xn be the resulting random variables, and let X = X1 + · · · + Xn . The variable X has
the binomial distribution with parameters n and p. We denote this fact by X ∼ Bin(n, p). We have

n k n−k
h  
i
b(k; n, p) = Pr X = k = p q .
k
h i h i hÍ i Í h i
n n
Also, E X = np, and V X = V i=1 Xi = i=1 V Xi = npq.

35
Observation 4.1.5. Let C1, . . . , Cn be random events (not necessarily independent). Than
" n # n
Ø Õ
Pr Ci ≤ Pr[Ci ] .
i=1 i=1

(This is usually referred to as the union bound.) If C1, . . . , Cn are disjoint events then
" n # n
Ø Õ
Pr Ci = Pr[Ci ] .
i=1 i=1

4.1.1. Geometric distribution


Definition 4.1.6. Consider a sequence X1, X2, . . . of independent Bernoulli trials with probability p for
success. Let X be the number of trials one has to perform till encountering the first success. The
distribution of X is geometric distribution with parameter p. We denote this by X ∼ Geom(p).
h i
Lemma 4.1.7. For a variable X ∼ Geom(p), we have, for all i, that Pr X = i = (1 − p)i−1 p. Further-
h i h i
more, E X = 1/p and V X = (1 − p)/p2 .
Proof: The proof of the expectation and variance is included for the sake of completeness, and the
Í∞ i
reader is of course encouraged to skip (reading) this proof. So, let f (x) = i=0 x = 1−x
1
, and observe
0 Í ∞ i−1
that f (x) = i=1 ix = (1 − x) . As such, we have
−2


h i Õ p 1
E X = i (1 − p)i−1 p = p f 0(1 − p) = = .
i=1 (1 − (1 − p)) 2 p

∞ ∞
h i 1 i−1 1 1
i 2 (1 − p)i−2 − 2 .
Õ Õ
V X =E X − 2 = i (1 − p) p − 2 . = p + p(1 − p)
 2 2
p i=1
p i=2
p
We need to do a similar trick to what we did before, to this end, we observe that

 00 2
i(i − 1)x i−2 = (1 − x)−1 =
Õ
00
f (x) = .
i=2
(1 − x)3
As such, we have that
∞ ∞ ∞ ∞
Õ
2 i−2
Õ
i−2
Õ
i−2 00 1 Õ i−1 1
∆(x) = i x = i(i − 1)x + ix = f (x) + ix = f 00(x) + ( f 0(x) − 1)
i=2 i=2 i=2
x i=2 x
1 1 − (1 − x)2 1 x(2 − x)
   
2 1 1 2 2
= + − 1 = + = + ·
(1 − x)3 x (1 − x)2 (1 − x)3 x (1 − x)2 (1 − x)3 x (1 − x)2
2 2−x
= + .
(1 − x) 3 (1 − x)2
As such, we have that
1+p 2(1 − p) 1 − p2
 
h i 1 2 1 1
V X = p + p(1 − p)∆(1 − p) − 2 = p + p(1 − p) 3 + 2 − 2 = p + + − 2
p p p p p2 p p
p + 2(1 − p) + p − p − 1 1 − p
3 3
= = 2 . 
p2 p

36
4.1.2. Some needed math
Lemma 4.1.8. For any positive integer n, we have:
(i) (1 + 1/n)n ≤ e.
(ii) (1 − 1/n)n−1 ≥ e−1 .
(iii) n! ≥ (n/e)n .
 n  k  n   ne  k
(iv) For any k ≤ n, we have: ≤ ≤ .
k k k

Proof: (i) Indeed, 1 + 1/n ≤ exp(1/n), since 1 + x ≤ e x , for x ≥ 0. As such (1 + 1/n)n ≤ exp(n(1/n)) = e.
 n−1
(ii) Rewriting the inequality, we have that we need to prove n−1 n ≥ 1e . This is equivalence to
n n−1 1 n−1
proving e ≥ n−1
 
= 1 + n−1 , which is our friend from (i).
(iii) Indeed,

nn Õ ni
≤ = en,
n! i=0
i!

Í∞ x i
by the Taylor expansion of e x = i=0 n
i! . This implies that (n/e) ≤ n!, as required.
(iv) Indeed, for any k ≤ n, we have nk ≤ n−1k−1 since kn − n = n(k − 1) ≤ k(n − 1) = kn − k. As such,
n n−i
k ≤ k−i , for 1 ≤ i ≤ k − 1. As such,

 n k n n−1 n−i n−k +1 n! n


 
≤ · ··· ··· = = .
k k k −1 k −i 1 (n − k)!k! k

As for the other direction, we have

n nk nk  ne  k
 
≤ ≤  k = ,
k k! k k
e

by (iii). 

4.2. Occupancy Problems


Problem 4.2.1. We are throwing m balls into n bins randomly (i.e., for every ball we randomly and
uniformly pick a bin from the n available bins, and place the ball in the bin picked). There are many
natural questions one can ask here:
(A) What is the maximum number of balls in any bin?
(B) What is the number of bins which are empty?
(C) How many balls do we have to throw, such that all the bins are non-empty, with reasonable
probability?

Let Xi be the number of balls in the ith bins, when we throw n balls into n bins (i.e., m = n). Clearly,
n
h i Õ 1
E Xi = Pr[The jth ball fall in ith bin] = n · = 1,
j=1
n

37
by linearity of expectation. The probability that the first bin has exactly i balls is
   i   n−i     i  ne  i  1  i
n 1 1 n 1  e i
1− ≤ ≤ =
i n n i n i n i
This follows by Lemma 4.1.8 (iv).
Let C j (k) be the event that the jth bin has k or more balls in it. Then,
n  i
e  e k  e e2 e k 1
h i Õ   
Pr C1 (k) ≤ ≤ 1+ + 2 +... = .
i=k
i k k k k 1 − e/k
Let k ∗ = d(3 ln n)/ln ln ne. Then,
i  e  k∗  k∗
e   k∗

h
∗ 1 
Pr C1 (k ) ≤ ∗ ≤ 2 = 2 exp 1 − ln 3 − ln ln n + ln ln ln n
k 1 − e/k ∗ (3 ln n)/ln ln n
≤ 2(exp(− ln ln n + ln ln ln n)) k

ln ln ln n
 
1
≤ 2 exp −3 ln n + 6 ln n ≤ 2 exp(−2.5 ln n) ≤ 2 ,
ln ln n n
for n large enough. We conclude, that since there are n bins and they have identical distributions that
n

Õ 1
Pr[any bin contains more than k balls] ≤ Ci (k ∗ ) ≤ .
i=1
n
3 ln n
 

Theorem 4.2.2. With probability at least 1 − 1/n, no bin has more than k = balls in it.
ln ln n
Exercise 4.2.3. Show that for m = n ln n, with probability 1 − o(1), every bin has O(log n) balls.
It is interesting to note, that if at each iteration we randomly pick d bins, and throw the ball into
the bin with the smallest number of balls, then one can do much better. We currently do not have the
machinery to prove the following theorem, but hopefully we would prove it later in the course.
Theorem 4.2.4. Suppose that n balls are sequentially places into n bins in the following manner. For
each ball, d ≥ 2 bins are chosen independently and uniformly at random (with replacement). Each ball
is placed in the least full of the d bins at the time of placement, with ties broken randomly. After all the
balls are places, the maximum load of any bin is at most ln ln n/(ln d) + O(1), with probability at least
1 − o(1/n).
Note, even by setting d = 2, we get considerable improvement. A proof of this theorem can be found in
the work by Azar et al. [ABKU00].

4.2.1. The Probability of all bins to have exactly one ball


Next, we are interested in the probability that all m balls fall in distinct bins. Let Xi be the event that
the ith ball fell in a distinct bin from the first i − 1 balls. We have:
"m # m i−1 m  m 
n−i+1 i−1
Ù Ö  Ù  Ö  Ö 
Xi = Pr[X2 ] Pr  Xi Xj  ≤
 
Pr   ≤ 1−
i=2 i=3 j=2
n n
 i=2 i=2
 

m
m(m − 1)
Ö  
−(i−1)/n
≤ e ≤ exp − ,
i=2
2n

38
l√ m
thus for m = 2n + 1 , the probability that all the m balls fall in different bins is smaller than 1/e.
This is sometime referred to as the birthday paradox. You have m = 30 people in the room, and
you ask them for the date (day and month) of their birthday (i.e., n = 365). The above shows that the
probability of all birthdays to be distinct is exp(−30 · 29/730) ≤ 1/e. Namely, there is more than 50%
chance for a birthday collision, a simple but counterintuitive phenomena.

4.3. The Markov and Chebyshev’s inequalities


We remind the reader that for a random variable X assuming real values, its expectation is E[Y ] =
y y · Pr[Y = y]. Similarly, for a function f (·), we have E[ f (Y )] = y f (y) · Pr[Y = y].
Í Í

Theorem 4.3.1 (Markov’s Inequality). Let Y be a random variable assuming only non-negative val-
ues. Then for all t > 0, we have
i EY
 
h
Pr Y ≥ t ≤
t

Proof: Indeed,
h i Õ h i Õ Õ h i
E Y = y Pr Y = y + y Pr[Y = y] ≥ y Pr Y = y
y≥t y<t y≥t
Õ h i h i
≥ t Pr Y = y = t Pr Y ≥ t . 
y≥t

Markov inequality is tight, as the following exercise testifies.

Exercise 4.3.2. For any (integer) k > 1, define a random positive variable Xk such that Pr[Xk ≥ k E[Xk ]] =
1
.
k
h i 1 h i
Theorem 4.3.3 (Chebyshev’s inequality). Pr |X − µ X | ≥ tσX ≤ 2 , where µ X = E X and σX =
q   t
V X .

Proof: Note that


h i
Pr |X − µ X | ≥ tσX = Pr (X − µ X )2 ≥ t 2 σX2 .
 

h i
Set Y = (X − µ X )2 . Clearly, E Y = σX2 . Now, apply Markov’s inequality to Y . 

4.4. The Coupon Collector’s Problem


There are n types of coupons, and at each trial one coupon is picked in random. How many trials one
has to perform before picking all coupons? Let m be the number of trials performed. We would like to
bound the probability that m exceeds a certain number, and we still did not pick all coupons.

39
n o
Let Ci ∈ 1, . . . , n be the coupon picked in the ith trial. The jth trial is a success, if C j was not
picked before in the first j − 1 trials. Let Xi denote the number of trials from the ith success, till after
the (i + 1)th success. Clearly, the number of trials performed is
n−1
Õ
X= Xi .
i=0

Now, the probability of Xi to succeed


h i in a trial ish pi i= (n − i)/n, and Xi has the geometric distribution
with probability pi . As such E Xi = 1/pi , and V Xi = q/p2 = (1 − pi )/pi2 .
Thus,
n−1 h i n−1
h i Õ Õ n
E X = E Xi = = nHn = n(ln n + Θ(1)) = n ln n + O(n),
i=0 i=0
n−i
Ín
where Hn = i=1 1/i is the nth Harmonic number.
As for variance, using the independence of X0, . . . , Xn−1 , we have
n−1 h i n−1 n−1 n−1 n−1 
h i Õ Õ 1 − pi Õ 1 − (n − i)/n Õ i/n Õ i n 2
V X = V Xi = = =  =
i=0 i=0
pi2 i=0
n−i 2

i=0
n−i 2
i=0
n n−i
n n
n−1 n n n n
!
Õ i Õ n−i Õ n Õ1 Õ 1
=n =n =n − =n 2
− nHn .
i=0
(n − i)2
i=1
i 2
i=1
i 2
i=1
i i=1
i2

Ín 1 V[X] π 2
Since, limn→∞ i=1 i 2 = π 2 /6, we have lim = .
n→∞ n2 6
h i  2
Corollary 4.4.1. Let X be the number of rounds till we collection all n coupons. Then, V X ≈ π6 n2
and its standard deviation is σX ≈ √π n.
6

This implies a weak bound on the concentration of X, using Chebyshev inequality, but this is going
to be quite weaker than what we implied we can do. Indeed, we have
  " #
π 1
Pr X ≥ n log n + n + t · n √ ≤ Pr X − E[X] ≥ tσX ≤ 2 ,
6 t

Note, that this is somewhat approximate, and hold for n sufficiently large.

4.5. Notes
The material in this note covers parts of [MR95, sections 3.1,3.2,3.6]

40
Chapter 5

Sampling, Estimation, and More on the


Coupon’s Collector Problems II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled There is not much talking now. A silence falls
January 24, 2018 upon them all. This is no time to talk of hedges
and fields, or the beauties of any country.
Sadness and fear and hate, how they well up in
the heart and mind, whenever one opens the
pages of these messengers of doom. Cry for the
broken tribe, for the law and custom that is
gone. Aye, and cry aloud for the man who is
dead, for the woman and children bereaved.
Cry, the beloved country, these things are not
yet at an end. The sun pours down on the
earth, on the lovely land that man cannot enjoy.
He knows only the fear of his heart.

Alan Paton, Cry, the beloved country


5.1. Randomized selection – Using sampling to learn the world
5.1.1. Sampling
One of the big advantages of randomized algorithms, is that they sample the world; that is, learn how
the input looks like without reading all the input. For example, consider the following problem: We are
given a set of U of n objects u1, . . . , un . and we want to compute the number of elements of U that have
some property. Assume, that one can check if this property holds, in constant time, for a single object,
and let ψ(u) be the function that returns 1 if the property holds for the element u. and zero otherwise.
Now, let α be the number of objects in U that have this property. We want to reliably estimate α
without computing the property for all the elements of U.
A natural approach, would be to pick a random sample R of m objects, r1, . . . , rm from U (with
Ím
repetition), and compute Y = i=1 ψ(r1 ), and our estimate for α is β = (n/m)Y . It is natural to ask how
far is β from the true estimate.
Lemma 5.1.1. Let U be a set of n elements, with α of them having a certain property ψ. Let R be a
uniform random sample from U (with repetition), and let Y be the number of elements in R that have
the property ψ, and let Z = (n/m)Y be the estimate for α. Then, for any t ≥ 1, we have that
n n
 
1
Pr α − t √ ≤ Z ≤ α + t √ ≥ 1 − 2.
2 m 2 m t

41
√ √ 1
Similarly, we have that Pr E[Y ] − t m/2 ≤ Y ≤ E[Y ] + t m/2 ≥ 1 − 2 .
 
t
Proof: Let Yi = ψ(ri ) be an indicator variable that is 1 if the ith sample ri has the property ψ. Now,
Y = i Yi is a binomial distribution with probability p = α/n, and m samples; that is, Y ∼ Bin(m, p). We
Í
saw in the previous lecture that, E[Y ] = mp, V[Y ] = mp(1 − p), and its standard deviation√ is as such
√ tσY n mn n
σY = mp(1 − p) ≤ m/2, as p(1 − p) is maximized for p = 1/2. We have ∆ =
p p
≤t =t √ ,
m 2m 2 m
p
since (α/n)(1 − (α/n)) is maximized for α = n/2. As such,
" #
n n m m i
    h
Pr Z − α ≥ t √ ≤ Pr Z − α ≥ ∆ = Pr Y − α ≥ ∆ = Pr Y − α ≥ ∆
2 m m n n
h i 1
= Pr Y − E[Y ] ≥ tσY ≤ 2 ,
t
by Chebychev’s inequality. 

5.1.1.1. Inverse estimation


We are given a set U = {u1, . . . , un } of n distinct numbers. Let si denote the ith smallest number
in U – that is si is the number of rank i in U. We are interested in estimating s k quickly. So,
let us take a sample R of size m. Let R≤sk be the set of all the numbers in R that are ≤ s k . For
Y = R≤sk , we have that µ = E[Y ] = km/n. Furthermore, for any t ≥ 1, Lemma 5.1.1 implies that
√ √ 1
Pr µ − t m/2 ≤ Y ≤ µ + t m/2 ≥ 1 − 2 . In particular, with probability ≥ 1 − 1/t 2 the number r− of
 
√ t √
rank `− = µ − t m/2 −1 in R is smaller than s k , and similarly, the number r+ of rank `+ = µ + t m/2 +1
   

in R is larger than s k .
One can conceptually think about the interval I(k) = r−, r+ as confidence interval – we know that
 

s k ∈ I(k) with probability ≥ 1 − 1/t 2 . But how big is this interval? Namely, how many elements are
there in I(k) ∩ Sample?
To this end, consider the interval of ranks in the sample that might contain the kth element. By the
above, this is
n  √ √
I(k, t) = k + −t m/2 − 1, t m/2 + 1 .

m
In particular, consider the maximum ν ≤ k, such that I(ν, t) and I(k, t) are disjoint. We have the
condition that
n √ n √ m3/2
ν + t m/2 + 1 ≤ k − t m/2 − 1 =⇒ ν ≤ k − t − 1.
m m n
Setting g = k − t mn − 1 and h = k + t mn + 1, we have that I(g, t) and I(k, t) and I(h, t) are all disjoint
3/2 3/2

with probability ≥ 1 − 3/t 2 . l  m l  m


n n
To this end, let g = k − 2 t 2√m and h = k + 2 t 2√m . It is easy to verify (using the same
argumentation as above) that with probability at least 1 − 3/t 3 , the three confidence I(g), I(k) and I(h)
do not intersect. As such, we have
n
 
I(k) ∩ R ≤ h − g ≤ 4 t √ .
2 m
We thus get the following.

42
Func LazySelect( S, k )
Input: S - set of n elements, k - index of element to be output.
begin
repeat
R ← Sample with replacement of n3/4 elements from S


∪ {−∞, +∞}.
Sort R. 
√   √ 
l ← max 1, kn−1/4 − n , h ← min n3/4, kn−1/4 + n
 

a ← R(l) , b ← R(h) .
Compute the ranks rS (a) and rS (b) of b in S
/* using 2n comparisons */
P← y∈S a≤y≤b
/* done when computing the rank of a and b */
Until (rS (a) ≤ k ≤ rS (b)) and |P| ≤ 8n3/4 + 2
Sort P in O(n3/4 log n) time.
return Pk−rS (a)+1
end LazySelect

Figure 5.1: The LazySelect algorithm.

Lemma 5.1.2. Given a set U of n numbers, a number k, and parameters t and m, one can compute,
in O(m log m) time, two numbers r−, r+ ∈ U, such that: 
(A) The number of rank k in U is in the interval I r−, r+ .

√ 
(B) There are at most O tn/ m numbers of U in I.
The algorithm succeeds with probability ≥ 1 − 3/t 3 .
Proof: Compute the sample in O(m) time (assuming the input numbers are in an array, say. Next sort
the numbers of R in O(n log n) time, and return the two elements of rank `− and `+ in the sorted set, as
the boundaries of the interval. The correctness follows from the above discussion. 
We next use the above observation to get a fast algorithm for selection.

5.1.2. Randomized selection


We are given a set S of n distinct elements, with an associated ordering. For t ∈ S, let rS (t) denote the
rank of t (the smallest element in S has rank 1). Let S(i) denote the ith element in the sorted list of S.
Given k, we would like to compute Sk (i.e., select the kth element). The code of LazySelect is
depicted in Figure 5.1.
Exercise 5.1.3. Show how to compute the ranks of rS (a) and rS (b), such that the expected number of
comparisons performed is 1.5n.
Consider the element S(k) and where it is mapped to in the random sample R. Consider the interval
of values
I( j) = R(α( j)), R(β( j)) = R(k) α( j) ≤ k ≤ β( j) ,
  

√ √
where α( j) = j · n−1/4 − n and β( j) = j · n−1/4 + n.

43
Lemma 5.1.4. For a fixed j, we have that Pr S( j) ∈ I( j) ≥ 1 − 1/(4n1/4 ).
 

Proof: There are two possible bad events: (i) S( j) < Rα( j) and (ii) Rβ( j) < S( j) . Let Xi be an indicator
variable which is 1 if the ith sample is smaller equal to S( j) , otherwise 0. We have p = Pr[Xi ]] = j/n
Ín3/4
and q = 1 − j/n.
 The random variable X = i=1 Xi is the rank of S( j) in the random sample. Clearly,
X ∼ B 3/4, j/n (i.e., X has a binomial distribution with p = j/n, and n3/4 trials). As such, we have
E[X] = pn3/4 and V[X] = n3/4 pq.
Now, by Chebyshev inequality
 
1
q
Pr X − pn 3/4
≥t n3/4 pq ≤ .
t2
p
Since pn3/4 = jn−1/4 and n3/4 ( j/n)(1 − j/n) ≤ n3/8 /2, we have that the probability of a > S( j) or b > S( j)
is
h
−1/4 √ −1/4 √ i
Pr S( j) < Rα( j) or Rβ( j) < S( j) = Pr X < ( jn − n) or X > ( jn + n)
 

n3/8
 
= Pr |X − jn−1/4 | ≥ 2n1/8 ·
2
1 1
≤ 2
= 1/4 . 
4n

2n1/8

Lemma 5.1.5. LazySelect succeeds with probability ≥ 1 − O(n−1/4 ) in the first iteration. And it per-
forms only 2n + o(n) comparisons.
Proof: By Lemma 5.1.4, we know that S(k) ∈ I(k) with probability ≥ 1 − 1/(4n1/4 ). This in turn implies
that S(k) ∈ P. Thus, the only possible bad event is that the set P is too large. To this end, set
k − = k − 3n3/4 and k + = k + 3n3/4 , and observe that, by definition, it holds I(k − ) ∩ I(k) = ∅ and
I(k) ∩ I(k + ) = ∅. As such, we know by Lemma 5.1.4, that S(k − ) ∈ I(k − ) and S(k + ) ∈ I(k + ), and this holds
with probability ≥ 1 − 4n21/4 . As such, the set P, which is by definition contained in the range I(k), has
only elements that are larger than S(k − ) and smaller than S(k + ) . As such, the size of P is bounded by
k + − k − = 6n3/4 . Thus, the algorithm succeeds in the first iteration, with probability ≥ 1 − 4n31/4 .
As for the number of comparisons, an iteration requires
O(n3/4 log n) + 2n + O(n3/4 log n) = 2n + o(n)
comparisons 
Any deterministic selection algorithm requires 2n comparisons, and LazySelect can be changed to
require only 1.5n + o(n) comparisons (expected).

5.2. The Coupon Collector’s Problem Revisited


5.2.1. Some technical lemmas
Unfortunately, in Randomized Algorithms, many of the calculations are awful¬ . As such, one has to be
dexterous in approximating such calculations. We present quickly a few of these estimates.
¬ "In space travel," repeated Slartibartfast, "all the numbers are awful." – Life, the Universe, and Everything Else,

Douglas Adams.

44
Lemma 5.2.1. For x ≥ 0, we have 1− x ≤ exp(−x) and 1+ x ≤ e x . Namely, for all x, we have 1+ x ≤ e x .

Proof: For x = 0 we have equality. Next, computing the derivative on both sides, we have that we need
to prove that −1 ≤ − exp(−x) ⇐⇒ 1 ≥ exp(−x) ⇐⇒ e x ≥ 1, which clearly holds for x ≥ 0.
A similar argument works for the second inequality. 
y
Lemma 5.2.2. For any y ≥ 1, and |x| ≤ 1, we have 1 − x 2 ≥ 1 − yx 2 .

Proof: Observe that the inequality holds with equality for x = 0. So compute the derivative of x of both
sides of the inequality. We need to prove that
 y−1  y−1
y(−2x) 1 − x 2 ≥ −2yx ⇐⇒ 1 − x 2 ≤ 1,

which holds since 1 − x 2 ≤ 1, and y − 1 ≥ 0. 

1 − x 2 y e xy ≤ (1 + x) y ≤ e xy .

Lemma 5.2.3. For any y ≥ 1, and |x| ≤ 1, we have

Proof: The right side of the inequality is standard by now. As for the left side. Observe that

(1 − x 2 )e x ≤ 1 + x,

since dividing both sides by (1+ x)e x , we get 1− x ≤ e−x , which we know holds for any x. By Lemma 5.2.2,
we have
y  y  y
1 − x 2 y e xy ≤ 1 − x 2 e x y = 1 − x 2 e x ≤ 1 + x ≤ e x y .



5.2.2. Back to the coupon collector’s problem


There are n types of coupons, and at each trial one coupon is picked in random. How many trials one
has to perform before picking all coupons? Let m be the number of trials performed. We would like to
bound the probability that m exceeds a certain number, and we still did not pick all coupons.
In the previous lecture, we showed that
 
π 1
Pr # of trials ≥ n log n + n + t · n √ ≤ 2 ,
6 t

for any t.
A stronger bound, follows from the following observation. Let Zir denote the event that the ith
coupon was not picked in the first r trials. Clearly,
h i  r  r
r 1
Pr Zi = 1 − ≤ exp − .
n n

βn log n
h i  
r
Thus, for r = βn log n, we have Pr Zi ≤ exp − = n−β . Thus,
n
" #
βn log n
h i Ø h i
Pr X > βn log n ≤ Pr Zi ≤ n · Pr Z1 ≤ n−β+1 .
i

45
Lemma 5.2.4. Let the random h variable X denote
i the number of trials for collecting each of the n types
of coupons. Then, we have Pr X > n ln n + cn ≤ e−c .

 
Proof: The probability we fail to pick the first type of coupon is α = (1 − ≤ exp = 1/n)m − n ln nn+cn
exp(−c)/n. As such, using the union bound, the probability we fail to pick all n types of coupons is
bounded by nα = exp(−c), as claimed. 

In the following, we show a slightly stronger bound on the probability, which is 1 − exp(−e−c ). To
see that it is indeed stronger, observe that e−c ≥ 1 − exp(−e−c ).

5.2.3. An asymptotically tight bound


Lemma 5.2.5. Let c > 0 be a constant, m = n ln n + cn for a positive integer n. Then for any constant
m
n k exp(−ck)
k, we have lim 1− = .
n→∞ k n k!

Proof: By Lemma 5.2.3, we have


m
k 2m km k km
      
1 − 2 exp − ≤ 1− ≤ exp − .
n n n n

k 2m km
  
Observe also that lim 1 − 2 = 1, and exp − = n−k exp(−ck). Also,
n→∞ n n

n k! n(n − 1) · · · (n − k + 1)
 
lim = lim = 1.
n→∞ k n k n→∞ nk
m
n k nk km n k −k
   
exp(−ck)
Thus, lim 1− = lim exp − = lim n exp(−ck) = . 
n→∞ k n n→∞ k! n n→∞ k! k!

Theorem 5.2.6. Let the random variable X denote the number of trials for collecting each i n
h of the
types of coupons. Then, for any constant c ∈ R, and m = n ln n + cn, we have limn→∞ Pr X > m =
 
1 − exp −e−c .

Before dwelling into the proof, observe that 1 − exp(−e−c ) ≈ 1 − (1 − e−c ) = e−c . Namely, in the limit,
the upper bound of Lemma 5.2.4 is tight.
h i
Proof: We have Pr X > m = Pr ∪i Zim . By inclusion-exclusion, we have
 

n
" #
Zim (−1)i+1 Pin,
Ø Õ
Pr =
i i=1

j
" #
Ík
where Pnj = Zimv . Let Skn =
Õ Ù
i+1 P n . n ≤ Pr
We know that S2k Zim ≤ S2k+1
n
Ð 
Pr i=1 (−1) i i .
1≤i1 <i2 <...<i j ≤n v=1

46
By symmetry,
  "Ù k m
#  
n n m n k
Pk = Pr Zv = 1− ,
k v=1
k n

Thus, Pk = limn→∞ Pkn = exp(−ck)/k!, by Lemma 5.2.5. Thus, we have

k k
exp(−c j)
(−1) j+1 P j = (−1) j+1 ·
Õ Õ
Sk = .
j=1 j=1
j!

Observe that lim k→∞ Sk = 1 − exp(−e−c ) by the Taylor expansion of exp(x) (for x = −e−c ). Indeed,

x j Õ (−e−c ) j (−1) j exp(−c j)


  Õ ∞ ∞ Õ∞
exp x = = =1+ .
j=0
j! j=0
j! j=1
j!

Clearly, limn→∞ Skn = Sk and lim k→∞ Sk = 1 − exp(−e−c ). Thus, (using fluffy math), we have
h i h i
n
lim Pr X > m = lim Pr ∪i=1 Zim = lim lim Skn = lim Sk = 1 − exp(−e−c ). 
n→∞ n→∞ n→∞ k→∞ k→∞

47
48
Chapter 6

Sampling and other Stuff


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018

6.1. Two-Point Sampling


Definition 6.1.1. A collection of random variables X1, . . . , Xn is pairwise-independent,
h i ifh for anyi pair
h of i
variables Xi and X j , and any pair of values α and β we have that Pr Xi = α ∩ X j = β = Pr Xi = α Pr X j = β .
Similarly, this collection is k-wise independent, if for any t ≤ k variables Xi1, . . . , Xit in this
collection, and any set of t values, α1, . . . , αt0 we have that
t
Ö
Pr Xi1 = α1 ∩ . . . ∩ Xit = αt Pr Xi j = α j .
    
=
j=1

Namely, pairwise independent variables behaves like independent random variables as long as you
look only in pairs.

Example 6.1.2. Consider the probability space show on the right, where the triple of X Y Z
variables X, Y, Z can be assigned any of the rows with equal probability (i.e., 1/4).
0 0 0
Clearly, for any α, β ∈ {0, 1} we have Pr[(X = α) ∩ (Y = β)] =
0 1 1
Pr[(X = α)] Pr[(Y = β)] = 1/4 (this also holds for X, Z and Y, Z). Namely, X, Y, Z
1 0 1
are all pairwise independent. However, they are not 3-wise independent (or just
1 1 0
independent). Indeed, we have Pr[(X = 1) ∩ (Y = 1) ∩ (Z = 1)] = 0, while it should have
been 1/8 if they were truly independent, or even just 3-wise independent.

6.1.1. About Modulo Rings and Pairwise Independence


n o
Let p be a prime number, and let ZZ p = 0, 1, . . . , p − 1 denote the ring of integers modules p. Two
integers x and y are equivalent modulo p, if x ≡ y mod p; namely, the reminder of dividing x and y
by p is the same.
Lemma 6.1.3. Given y, i ∈ ZZ p , and choosing a and b randomly, independently and uniformly from ZZ p ,
the probability of y ≡ ai + b (mod p) is 1/p.

Proof: Imagine that we first choose a, then the required probability, is that we choose b such that
y − ai ≡ b (mod p). And the probability for that is 1/p, as we choose b uniformly. 

49
Lemma 6.1.4. Let p be a prime, and fix a ∈ {1, . . . , p − 1}. Then, ai (mod p) i = 0, . . . , p − 1 = ZZ p .


Putting it differently, for any non-zero a ∈ ZZ p , there is a unique inverse b ∈ ZZ p such that ab (mod p) =
1.

Proof: Assume, for the sake of contradiction, that the claim is false. Then, by the pigeon hole principle,
there must exists 1 ≤ j < i ≤ p − 1 such that ai (mod p) = a j (mod p). Namely, there are k 0, k, u such
that

ai = u + k p and a j = u + k 0 p.

(Here, we know that 0 ≤ k < p, 0 ≤ k 0 < p and 0 ≤ u < p.) Since i > j it must be that k > k 0.
Subtracting the two equalities, we get that a(i − j) = (k − k 0)p > 0. Now, i − j must be larger than one,
since if i − j = 1 then a = p, which is impossible. Similarly, i − j < p. Also, i − j can not divide p, since
p is a prime. Thus, it must be that i − j must divide k − k 0. So, let us set β = (k − k 0)/(i − j) ≥ 1. This
implies that a = βp ≥ p, which is impossible. Thus, our assumption is false. 

Lemma 6.1.5. Given y, z, x, w ∈ ZZ p , such that x , w, and choosing a and b randomly and uniformly
from ZZ p , the probability that y ≡ ax + b (mod p) and z = aw + b (mod p) is 1/p2 .

Proof: This equivalent to claiming that the system of equalities y ≡ ax + b (mod p) and z = aw + b have
a unique solution in a and b.
To see why this is true, subtract one equation from the other. We get y − z ≡ a(x − w) (mod p). Since
x − w . 0 (mod p), it must be that there is a unique value of a such that the equation holds. This in
turns, imply a specific value for b. The probability that a and b get those two specific values is 1/p2 . 

Lemma 6.1.6. Let i and j be two distinct elements of ZZ p . And choose a and b randomly and inde-
pendently from ZZ p . Then, the two random variables Yi = ai + b (mod p) and Yj = a j + b (mod p) are
uniformly distributed on ZZ p , and are pairwise independent.

Proof: The claim about the uniform distribution follows from Lemma 6.1.3, as Pr[Yi = α] = 1/p, for any
α ∈ ZZ p . As for being pairwise independent, observe that

 Pr Yi = α ∩ Yj = β
 
1/n2 1 h i
Pr Yi = α Yj = β = = = Pr Yi = α ,

=
Pr Yj = β n
 
1/n

by Lemma 6.1.3 and Lemma 6.1.5. Thus, Yi and Yj are pairwise independent. 

Remark 6.1.7. It is important to understand what independence between random variables mean: hav-
ing information about the value of X, gives you no information about Y . But this is only pairwise
independence. Indeed, consider the variables Y1, Y2, Y3, Y4 defined above. Every pair of them are pairwise
independent. But, given the values of Y1 and Y2 , one can compute the value of Y3 and Y4 immediately.
Indeed, giving the value of Y1 and Y2 is enough to figure out the value of a and b. Once we know a and
b, we immediately can compute all the Yi s.
Thus, the notion of independence can be extended to k-pairwise independence of n random variables,
where only if you know the value of k variables, you can compute the value of all the other variables.
More on that later in the course.
h i h i h i
Lemma 6.1.8. If X and Y are pairwise independent then E XY = E X E Y .

50
h i h hi i h i Í
Proof: By definition, E XY = x,y x y Pr (X = x) ∩ (Y = y) = x,y x y Pr X = x Pr Y = y = x x Pr[X = x]
Í Í
Í  h i h i
x x]) E X E Y .
Í Í
y y Pr[Y = y] = ( x Pr[X = y y Pr[Y = y] = 
Ín
i X1, X2, . . . , Xn be pairwise independent random variables, and X =
h i Í6.1.9.h Let
Lemma i=1 Xi . Then
n
V X = i=1 V Xi .

h i
2
 2    2
Proof: Observe, that V X = E (X − E[X]) = E X − E X . Let X and Y be pairwise independent
 

h i
variables. Observe that E XY = E X E Y , as can be easily verified. Thus,
   

h i
X Y = E (X + Y − E[X] − E[Y ])2
 
V +
 2     2
= E X + Y − 2 X + Y E[X] + E[Y ] + E[X] + E[Y ]

2
      2
= E (X + Y ) − E X + E Y


h i    2 h i h i
= E X + 2XY + Y − E X
2 2
− 2 E X E Y − (E[Y ])2
   
 2    2  2    2 h i h i h i
= E X − E X + EY − EY + 2 E XY − 2 E X E Y
h i h i h i h i h i h i
= V X + V Y + 2E X E Y − 2E X E Y
h i h i
=V X +V Y ,

by Lemma 6.1.8. Using the above argumentation for several variables, instead of just two, implies the
lemma. 

6.1.1.1. Generating k-wise independent variable


Í k−1
Consider the polynomial f (x) = i=0 αi x i evaluated modulo p, where the coefficients α0, . . . , αk−1 are
taken from ZZ p . We claim that f (0), f (1), . . . , f (p − 1) are k-wise independent. Indeed, for any k indices
i1, . . . , i k ∈ ZZ p , and k values v1, . . . , vk ∈ ZZ p , we have that β = Pr[ f (i1 ) = v1 and . . . and f (i k ) = vk ]
happens only for one specific choice of the αs, which implies that this probability is 1/p k , which is what
we need.

6.1.2. Application: Using less randomization for a randomized algorithm


We can consider a randomized algorithm, to be a deterministic algorithm Alg(x, r) that receives together
with the input x, a random string r of bits, that it uses to read random bits from. Let us redefine RP:

Definition 6.1.10. The class RP (for Randomized Polynomial time) consists of all languages L that have
a deterministic algorithm Alg(x, r) with worst case polynomial running time such that for any input
x ∈ Σ∗ ,
• x ∈ L =⇒ Alg(x, r) = 1 for half the possible values of r.
• x < L =⇒ Alg(x, r) = 0 for all values of r.

51
Let assume that we now want to minimize the number of random bits we use in the execution of the
algorithm (Why?). If we run the algorithm t times, we have confidence 2−t in our result, while using
t log n random bits (assuming our random algorithm needs only log n bits in each execution). Similarly,
let us choose two random numbers from ZZn , and run Alg(x, a) and Alg(x, b), gaining us only confidence
1/4 in the correctness of our results, while requiring 2 log n bits.
Can we do better? Let us define ri = ai + b mod n, where a, b are random values as above (note,
Ít
that we assume that n is prime), for i = 1, . . . , t. Thus Y = i=1 Alg(x, ri ) is a sum of random variables
which are pairwise independent, as the ri are pairwise independent. √ Assume, that x ∈ L, then we have
Ít
E[Y ] = t/2, and σY = V[Y ] = i=1 V[Alg(x, ri )] ≤ t/4, and σY ≤ t/2. The probability that all those
2

executions failed, corresponds to the event that Y = 0, and


" √ #
h i h ti t √ 1
Pr Y = 0 ≤ Pr Y − E[Y ] ≥ = Pr Y − E[Y ] ≥ · t ≤ ,
2 2 t

by the Chebyshev inequality. Thus we were able to “extract” from our random bits, much more than
one would naturally suspect is possible. We thus get the following result.
Lemma 6.1.11. Given an algorithm Alg in RP that uses lg n random bits, one can run it t times, such
that the runs results in a new algorithm that fails with probability at most 1/t.

6.2. QuickSort is quick via direct argumentation


Consider a specific element α in the input array of n elements that is being sorted by QuickSort, and
let Xi be the size of the recursive subproblem in the ith level of the recursion that contains x. If x thus
not participate in such a subproblem in this level, that Xi = 0. It is easy to verify that
 1 3 1 7
X0 = n and E Xi Xi−1 ≤ · Xi−1 + Xi−1 ≤ Xi−1 .

2 4 2 8
h  i
As such, E Xi = E E Xi = (7/8)i n. In particular, we have by Markov’s inequality that
 

 
α participates in more than E[Xc ln n ] 1
Pr = Pr[Xc ln n ≥ 1] ≤ ≤ (7/8)c ln n n ≤ β+1 ,
c ln n levels of the recursion 1 n
 
if c ln(8/7) ln n ≥ β ln n ⇐⇒ c ≥ β/ln(8/7). We conclude the following.

Theorem 6.2.1. For any β ≥ 1, we have that the running time of QuickSort sorting n elements is
O(βn log n), with probability ≥ 1 − 1/n β .

Proof: For c = β/ln(8/7), the probability that an element participates in at most c ln n levels of the
recursion is at most 1/n β+1 . Since there are n elements, by the union bound, this bounds the probability
that any input number would participate in more than c ln n recursive calls. But that implies that the
recursion depth of QuickSort is ≤ c ln n, which immediately implies the claim. 

What the above proof shows is that an element can not be too unlucky – if it participates in enough
rounds, then, with high probability, the subproblem containing it would shrink significantly. This fairness
of luck is one of the most important principles in randomized algorithms, and we next formalize it by
proving a rather general theorem on the “concentration” of luck.

52
Chapter 7

Concentration of Random Variables –


Chernoff’s Inequality
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018

7.1. Concentration of mass and Chernoff’s inequality

7.1.1. Example: Binomial distribution

Consider the binomial distribution Bin(n, 1/2) for various values of n as depicted in Figure 7.1 – here
we think about the value of the variable as the number of heads in flipping a fair coin n times. Clearly,
as the value of n increases the probability of getting a number of heads that is significantly smaller or
larger than n/2 is tiny. Here we are interested in quantifying exactly how far can we divert from this
expected value. Specifically, if X ∼ Bin(n,
√ 1/2), then we would be interested in bounding the probability
Pr[X > n/2 + ∆], where ∆ = tσX = t n/2 (i.e., we are t standard deviations away from the expectation).
For t > 2, this probability is roughly 2−t , which is what we prove here.
More surprisingly, if you look only on the middle of the distribution, it looks the same after clipping
away the uninteresting tails, see Figure 7.2; that is, it looks more and more like the normal distribution.
This is a universal phenomena known the central limit theorem – every sum of nicely behaved random
variables behaves like the normal distribution. We unfortunately need a more precise quantification of
this behavior, thus the following.

7.1.2. A restricted case of Chernoff inequality via games

7.1.2.1. Chernoff games

7.1.2.1.1. The game. Consider the game where a player starts with Y0 = 1 dollars. At every round,
the player can bet a certain amount x (fractions are fine). With probability half she loses her bet, and
with probability half she gains an amount equal to her bet. The player is not allowed to go all in –
because if she looses then the game is over. So it is natural to ask what her optimal betting strategy is,
such that in the end of the game she has as much money as possible.

53
0.16
0.3 0.2 0.14 0.1
0.25 0.12 0.08
0.15 0.1
0.2 0.08 0.06
0.15 0.1 0.06 0.04
0.1 0.05 0.04
0.02 0.02
0.05
0 0 0
0

0
5
10
15
20
25
30

0
10
20
30
40
50
60
0
2
4
6
8
10
12
14
16
0
1
2
3
4
5
6
7
8

n=8 n = 16 n = 32 n = 64
0.08 0.04
0.07 0.05 0.035 0.01
0.06 0.04 0.03
0.05 0.025 0.008
0.04 0.03 0.02 0.006
0.03 0.02 0.015 0.004
0.02 0.01 0.002
0.01 0.01 0.005 0
0 0 0

0
1000
2000
3000
4000
5000
6000
7000
8000
0
100
200
300
400
500
0
20
40
60
80
100
120

0
50
100
150
200
250

n = 128 n = 256 n = 512 n = 8192

Figure 7.1: The binomial distribution for different values of n. It pretty quickly concentrates around its
expectation.

0.16 0.08
0.2 0.14 0.1 0.07
0.12 0.08 0.06
0.15 0.1 0.06 0.05
0.1
0.08 0.04
0.06 0.04 0.03
0.05 0.04 0.02 0.02
0.02 0.01
0 0 0 0
20
25
30
35
40
45

45
50
55
60
65
70
75
80
85
10

15

20

25
10
12
14
16

5
0
2
4
6
8

n = 16 n = 32 n = 64 n = 128
0.04 0.01
0.05 0.035 0.025 0.008
0.04 0.03 0.02 0.006
0.025 0.015
0.03 0.02 0.004
0.02 0.015 0.01
0.01 0.005 0.002
0.01 0.005 0
0 0 0
3950
4000
4050
4100
4150
4200
4250
460
480
500
520
540
560
220
230
240
250
260
270
280
290
300
100
110
120
130
140
150
160

n = 256 n = 512 n = 1024 n = 8192

Figure 7.2: The “middle” of the binomial distribution for different values of n. It very quickly converges
to the normal distribution (under appropriate rescaling and translation.

54
Values Probabilities Inequality Ref
h i
Pr Y ≥ ∆ ≤ exp −∆2 /2n

−1, +1 Pr[Xi = −1] = Theorem 7.1.7p58
h i
Pr Y ≤ −∆ ≤ exp −∆2 /2n

Pr[Xi = 1] = 1/2 Theorem 7.1.7p58
h i
Pr |Y | ≥ ∆ ≤ 2 exp −∆2 /2n

Corollary 7.1.8p59

Pr[Xi = 0] = n
Pr Y − ≥ ∆ ≤ 2 exp −2∆2 /n
  
0, 1 2 Corollary 7.1.9p59
Pr[Xi = 1] = 1/2
Pr[Xi = 0] = 1 − pi h i  δ µ
e
0,1 Pr Y > (1 + δ)µ < (1+δ) Theorem 7.3.2p65
Pr[Xi = 1] = pi 1+δ

Pr[Y > (1 + δ)µ] < exp −µδ2 /4



For δ ≤ 2e − 1 Theorem 7.3.2p65
δ ≥ 2e − 1 Pr[Y > (1 + δ)µ] < 2−µ(1+δ)
 
δ ≥ e2 Pr[Y > (1 + δ)µ] < exp −(µδ/2) ln δ

Pr[Y < (1 − δ)µ] < exp −µδ2 /2



For δ ≥ 0 Theorem 7.3.5p66
h i
Xi s have arbitrary Pr Y − µ ≥ ε µ ≤ exp −ε 2 µ/4

Xi ∈ [0, 1] independent distribu- h i Theorem 7.4.5p68
Pr Y − µ ≤ −ε µ ≤ exp −ε 2 µ/2 .

tions.
h i
Xi s have arbitrary Pr Y − µ ≥ η ≤
Xi ∈ ai, bi
 
independent distribu- Theorem 7.5.3p70
2 η2
 
tions. 2 exp − Ín
i=1 (bi − ai )
2

Table 7.1: Summary of Chernoff type inequalities covered. Here we have n independent random variables
X1, . . . , Xn , Y = i Xi and µ = E[Y ].
Í

55
7.1.2.1.2. Is the game pointless? So, let Yi−1 be the money the player has in the end of the (i − 1)th
round, and she bets an amount ψi ≤ Yi−1 in the ith round. As such, in the end of the ith round, she has
(
Yi−1 − ψi lose: probability half
Yi =
Yi−1 + ψi win: probability half
dollars. This game, in expectation, does not change the amount of money the player has. Indeed, we
have
 1 1
E Yi Yi−1 = (Yi−1 − ψi ) + (Yi−1 + ψi ) = Yi−1 .

2 2
h  i
And as such, we have that E Yi = E E Yi Yi−1 = E Yi−1 = · · · = E Y0 = 1. In particular, E[Yn ] = 1
     

– namely, on average, independent of the player strategy she is not going to make any money in this
game (and she is allowed to change her bets after every round). Unless, she is lucky¬ ...

7.1.2.1.3. What about a lucky player? The player believes she will get lucky and wants to develop
a strategy to take advantage of it. Formally, she believes that she can win, say, at least (1 + δ)/2 fraction
of her bets (instead of the predicted 1/2) – for example, if the bets are in the stock market, she can
improve her chances by doing more research on the companies she is investing in­ . Unfortunately, the
player does not know which rounds she is going to be lucky in – so she still needs to be careful.

7.1.2.1.4. In a search of a good strategy. Of course, there are many safe strategies the player can
use, from not playing at all, to risking only a tiny fraction of her money at each round. In other words,
our quest here is to find the best strategy that extracts the maximum benefit for the player out of her
inherent luck.
Here, we restrict ourselves to a simple strategy – at every round, the player would bet β fraction
of her money, where β is a parameter to be determined. Specifically, in the end of the ith round, the
player would have
(
(1 − β)Yi−1 lose
Yi =
(1 + β)Yi−1 win.
By our assumption, the player is going to win in at least M = (1 + δ)n/2 rounds. Our purpose here is to
figure out what the value of β should be so that player gets as rich as possible® . Now, if the player is
successful in ≥ M rounds, out of the n rounds of the game, then the amount of money the player has,
in the end of the game, is
  n/2−(δ/2)n
Yn ≥ (1 − β)n−M (1 + β) M = (1 − β)n/2−(δ/2)n (1 + β)n/2+(δ/2)n = (1 − β)(1 + β) (1 + β)δn
  n/2−(δ/2)n  n/2−(δ/2)n
= 1 − β2 (1 + β)δn ≥ exp −2β2 exp(β/2)δn = exp −β2 + β2 δ + βδ/2 n .
 

To maximize this quantity, we choose β = δ/4 (there is a better choice,see Lemma 7.1.6,  but
 we use
 2this
2 3 2

δ δ δ δ
value for the simplicity of exposition). Thus, we have that Yn ≥ exp − + + n ≥ exp n ,
16 16 8 16
proving the following.
¬ “Iwould rather have a general who was lucky than one who was good.” – Napoleon Bonaparte.
­ “Iam a great believer in luck, and I find the harder I work, the more I have of it.” – Thomas Jefferson.
® This optimal choice is known as Kelly criterion, see Remark 7.1.3.

56
Lemma 7.1.1. Consider a Chernoff game with n rounds, starting with one dollar, where the player
wins in ≥ (1 + δ)n/2 of the rounds. If the player bets δ/4 fraction of her
 current money, at all rounds,
then in the end of the game the player would have at least exp nδ2 /16 dollars.

Remark 7.1.2. Note, that Lemma 7.1.1 holds if the player wins any ≥ (1 + δ)n/2 rounds. In particular,
the statement does not require randomness by itself – for our application, however, it is more natural
and interesting to think about the player wins as being randomly distributed.

Remark 7.1.3. Interestingly, the idea of choosing the best fraction to bet is an old and natural question
arising in investments strategies, and the right fraction to use is known as Kelly criterion, going back
to Kelly’s work from 1956 [Kel56].

7.1.2.2. Chernoff’s inequality

The above implies that if a player is lucky, then she is going to become filthy rich¯ . Intuitively, this
should be a pretty rare event – because if the player is rich, then (on average) many other people have
to be poor. We are thus ready for the kill.

Theorem 7.1.4 (Chernoff’s inequality). Let X1, . . . , Xn be n independent random variables, where
Xi = 0 or Xi = 1 with equal probability. Then, for any δ ∈ (0, 1/2), we have that
" #
n
 2 
Õ δ
Pr Xi ≥ (1 + δ) ≤ exp − n .
i
2 16

Proof: Imagine that we are playing the Chernoff game above, with β = δ/4, starting with 1 dollar, and
let Yi be the amount of money in the end of the ith round. Here Xi = 1 indicates that the player won
the ith round. We have, by Lemma 7.1.1 and Markov’s inequality, that
" #
n nδ
  2   2 
Õ E[Yn ] 1 δ
Pr Xi ≥ (1 + δ) ≤ Pr Yn ≥ exp ≤ 2
= 2
= exp − n ,
i
2 16 exp(nδ /16) exp(nδ /16) 16

as claimed. 

7.1.2.2.1. This is crazy – so intuition maybe? If the player is (1 + δ)/2-lucky then she can make
a lot of money; specifically, at least f (δ) = exp nδ2 /16 dollars by the end of the game. Namely, beating
the odds has significant monetary value, and this value grows quickly with δ. Since we are in a “zero-
sum” game settings, this event should be very rare indeed. Under this interpretation, of course, the
player needs to know in advance the value of δ – so imagine that she guesses it somehow in advance,
or she plays the game in parallel with all the possible values of δ, and she settles on the instance that
maximizes her profit.

7.1.2.2.2. Can one do better? No, not really. Chernoff inequality is tight (this is a challenging
homework exercise) up to the constant in the exponent. The best bound I know for this version of the
inequality has 1/2 instead of 1/16 in the exponent. Note, however, that no real effort was taken to
optimize the constants – this is not the purpose of this write-up.
¯ Not that there is anything wrong with that – many of my friends are filthy,

57
7.1.2.3. Some low level boring calculations
Above, we used the following well known facts.
Lemma 7.1.5. (A) Markov’s inequality. For any positive random variable X and t > 0, we have
Pr[X ≥ t] ≤ E[X] /t. h i h  i
(B) For any two random variables X and Y , we have that E X = E E X Y .
(C) For x ∈ (0, 1), 1 + x ≥ e x/2 .
(D) For x ∈ (0, 1/2), 1 − x ≥ e−2x .
δ
Lemma 7.1.6. The quantity exp −β2 + β2 δ + βδ/2 n is maximal for β =
 
4(1−δ) .

Proof: We have to maximize f (β) = −β2 + β2 δ + βδ/2 by choosing the correct value of β (as a function
δ
of δ, naturally). f 0(β) = −2β + 2βδ + δ/2 = 0 ⇐⇒ 2(δ − 1)β = −δ/2 ⇐⇒ β = 4(1−δ) . 

7.1.3. Chernoff Inequality - A Special Case – the classical proof


Theorem 7.1.7. Let X1, . . . , Xn be n independent random variables, such that Pr[Xi = 1] = Pr[Xi = −1] =
Ín
2 , for i = 1, . . . , n. Let Y = i=1 Xi . Then, for any ∆ > 0, we have
1

h i
Pr Y ≥ ∆ ≤ exp −∆2 /2n .


Proof: Clearly, for an arbitrary t, to specified shortly, we have


E[exp(tY )]
Pr[Y ≥ ∆] = Pr[exp(tY ) ≥ exp(t∆)] ≤ ,
exp(t∆)
the first part follows by the fact that exp(·) preserve ordering, and the second part follows by the Markov
inequality.
Observe that
1 1 et + e−t
E[exp(t Xi )] = et + e−t =
2 2 2
t t2 t3

1
= 1+ + + +···
2 1! 2! 3!
t t2 t3
 
1
+ 1− + − +···
2 1! 2! 3!
t2 t 2k
 
= 1+ + + +··· + +··· ,
2! (2k)!

by the Taylor expansion of exp(·). Note, that (2k)! ≥ (k!)2 k , and thus
∞ ∞ ∞  i
Õ t 2i Õ t 2i Õ 1 t2
E[exp(t Xi )] = ≤ = = exp t 2 
/2 ,
i=0
(2i)! i=0 2i (i!) i=0 i! 2

again, by the Taylor expansion of exp(·). Next, by the independence of the Xi s, we have
n n
" !# " #
et /2 = ent /2 .
Õ Ö Ö Ö 2 2
E[exp(tY )] = E exp t Xi = E exp(t Xi ) = E[exp(t Xi )] ≤
i i i=1 i=1

58
exp nt 2 /2

= exp nt 2 /2 − t∆ .

We have Pr[Y ≥ ∆] ≤
exp(t∆)
Next, by minimizing the above quantity for t, we set t = ∆/n. We conclude,
 2 !
n ∆
 2
∆ ∆
Pr[Y ≥ ∆] ≤ exp − ∆ = exp − . 
2 n n 2n

By the symmetry of Y , we get the following:

Corollary 7.1.8. Let X1, . . . , Xn be n independent random variables, such that Pr[Xi = 1] = Pr[Xi = −1] =
Ín
2 , for i = 1, . . . , n. Let Y = i=1 Xi . Then, for any ∆ > 0, we have
1

2 /2n
Pr[|Y | ≥ ∆] ≤ 2e−∆ .

Corollary 7.1.9. Let X1, . . . , Xn be n independent coin flips, such that Pr[Xi = 0] = Pr[Xi = 1] = 21 , for
Ín
i = 1, . . . , n. Let Y = i=1 Xi . Then, for any ∆ > 0, we have
h n i 2
Pr Y − ≥ ∆ ≤ 2e−2∆ /n .
2

Remark 7.1.10. Before going any further, it is might be instrumental to understand what this inequalities
√ √ then case where Xi is either zero or one with probability half. In this case µ = E[Y ] = n/2.
imply. Consider
Set δ = t n ( µ is approximately the standard deviation of X if pi = 1/2). We have by
h n i  √ 
Pr Y − ≥ ∆ ≤ 2 exp −2∆2 /n = 2 exp −2(t n)2 /n = 2 exp −2t 2 .
 
2
Thus, Chernoff inequality implies exponential decay (i.e., ≤ 2−t ) with t standard deviations, instead of
just polynomial (i.e., ≤ 1/t 2 ) by the Chebychev’s inequality.

7.2. Applications of Chernoff’s inequality


There is a zoo of Chernoff type inequalities, and prove some of them later on the chapter – while being
very useful and technically interesting, they tend to numb to reader into boredom and submission. As
such, we discuss applications of Chernoff’s inequality here, and the interested reader can read the proofs
of the more general forms only if they are interested in them.

7.2.1. QuickSort is Quick


We revisit QuickSort. We remind the reader that the running time of QuickSort is proportional to the
number of comparisons performed by the algorithm. Next, consider an arbitrary element u being sorted.
Consider the ith level recursive subproblem that contains u, and let Si be the set of elements in this
subproblems. We consider u to be successful in the ith level, if |Si+1 | ≤ |Si | /2. Namely, if u is successful,
then the next level in the recursion involving u would include a considerably smaller subproblem. Let
Xi be the indicator variable which is 1 if u is successful.
We first observe that if QuickSort is applied to an array with n elements, then u can be successful
at most T = dlg ne times, before the subproblem it participates in is of size one, and the recursion stops.

59
Thus, consider the indicator variable Xi which is 1 if u is successful in the ith level, and zero otherwise.
Note that the Xi s are independent, and Pr[Xi = 1] = 1/2.
If u participates in v levels, then we have the random variables X1, X2, . . . , Xv . To make things simpler,
we will extend this series by adding independent random variables, such that Pr[‘] Xi = 1 = 1/2, for
i ≥ v. Thus, we have an infinite sequence of independent random variables, that are 0/1 and get 1 with
probability 1/2. The question is how many elements in the sequence we need to read, till we get T ones.

Lemma 7.2.1. Let X1, X2, . . . be an infinite sequence of independent random 0/1 variables.
√ Let M be
an arbitrary parameter. Then the probability that we need 2M + 4t M variables of
 to read√more than √
this sequence till we collect M ones is at most 2 exp −t , for t ≤ M. If t ≥ M then this probability
2
√ 
is at most 2 exp −t M .

ÍL √
Proof: Consider the random variable Y = i=1 Xi , where L = 2M + 4t M. Its expectation is L/2, and
using the Chernoff inequality, we get
2!
L L 2 L
h i   
α = Pr Y ≤ M ≤ Pr Y − ≥ − M ≤ 2 exp − −M
2 2 L 2
√ 2  √ 2 8t 2 M
 2    
2
≤ 2 exp − M + 2t M − M ≤ 2 exp − 2t M = 2 exp − ,
L L L
√ √
by Corollary 7.1.9. For t ≤ M we have that L = 2M + 4t M ≤ 8M, as such in this case Pr[Y ≤ M] ≤
2 exp −t 2 .

√ 8t 2 M 8t 2 M  √ 
   
If t ≥ M, then α = 2 exp − √ ≤ 2 exp − √ ≤ 2 exp −t M . 
2M + 4t M 6t M
Going back to the QuickSort problem, we have that if we sort n elements, the probability
 p that u will
p 
participate in more than L = (4+c) dlg ne = 2 dlg ne +4c lg n lg n, is smaller than 2 exp −c lg n lg n ≤
p p

1/nc , by Lemma 7.2.1. There are n elements being sorted, and as such the probability that any element
would participate in more than (4 + c + 1) dlg ne recursive calls is smaller than 1/nc .

Lemma 7.2.2. For any c > 0, the probability that QuickSort performs more than (6 + c)n lg n, is
smaller than 1/nc .

7.2.2. How many times can the minimum change?


Let Π = π1 . . . πn be a random permutation of {1, . . . , n}. Let Ei be the event that πi is the minimum
number seen so far as we read Π; that is, Ei is the event that πi = minik=1 π k . Let Xi be the indicator
variable that is one if Ei happens. We already seen, and it is easy to verify, that E[Xi ] = 1/i. We are
interested in how many times the minimum might change° ; that is Z = i Xi , and how concentrated is
Í
the distribution of Z. The following is maybe surprising.

Lemma 7.2.3. The events E1, . . . , En are independent (as such, variables X1, . . . , Xn are independent).

Proof: The trick is to think about the sampling process in a different way, and then the result readily
follows. Indeed, we randomly pick a permutation of the given numbers, and set the first number to be
° The answer, my friend, is blowing in the permutation.

60
πn . We then, again, pick a random permutation of the remaining numbers and set the first number as
the penultimate number (i.e., πn−1 ) in the output permutation. We repeat this process till we generate
the whole permutation. h i
Now, consider 1 ≤ i1 < i2 < . . . < i k ≤ n, and observe that Pr Eik Ei1 ∩ . . . ∩ Eik−1 = Pr Eik ,
 

since by our thought experiment, Eik is determined before all the other variables Eik−1, . . . , Ei1 , and these
variables are inherently not effected by this event happening or not. As such, we have
h i h i  
Pr Ei1 ∩ Ei2 ∩ . . . ∩ Eik = Pr Eik Ei1 ∩ . . . ∩ Eik−1 Pr Ei1 ∩ . . . ∩ Eik−1
k k
h i h i Ö   Ö 1
= Pr Eik Pr Ei1 ∩ Ei2 ∩ . . . ∩ Eik−1 = Pr Ei j = ,
j=1 j=1
i j

by induction. 

Theorem 7.2.4. Let Π = π1 . . . πn be a random permutation of 1, . . . , n, and let Z be the number of


times, that πi is the smallest number among π1, . . . , πi , for i = 1,. . . , n. Then, we have that for t ≥ 2e
that Pr[Z > t ln n] ≤ 1/n t ln 2 , and for t ∈ 1, 2e , we have that Pr Z > t ln n ≤ 1/n(t−1) /4 .
2
 

Proof: Follows readily from Chernoff’s inequality, as Z = Xi is a sum of independent indicator vari-
Í
i
ables, and, since by linearity of expectations, we have
n ∫ n+1
h i Õ h i Õ 1 1
µ=E Z = E Xi = ≥ dx = ln(n + 1) ≥ ln n.
i i=1
i x=1 x

Next, we set δ = t − 1, and use Theorem 7.3.2p65 . 

7.2.3. Routing in a Parallel Computer


Let G be a graph of a network,
h where every
i node is a processor. The processor communicate by sending
packets on the edges. Let 0, . . . , N − 1 denote be vertices (i.e., processors) of G, where N = 2n , and G is
the hypercube. As such, each processes is identified with a binary string b1 b2 . . . bn ∈ {0, 1}n . Two nodes
are connected if their binary string differs only in a single bit. Namely, G is the binary hypercube over
n bits.
We want to investigate the best routing strategy for this topology of network. We assume that every
processor need to send a message to a single other processor. This is represented by a permutation
π, and we would like to figure out how to send the messages encoded by the permutation while create
minimum delay/congestion.
Specifically, in our model, every edge has a FIFO queue± of the packets it has to transmit. At every
clock tick, one message get sent. All the processors start sending the packets in their permutation in
the same time.
A routing scheme is oblivious if every node that has to forward a packet, inspect the packet, and
depending only on the content of the packet decides how to forward it. That is, such a routing scheme
is local in nature, and does not take into account other considerations. Oblivious routing is of course a
bad idea – it ignores congestion in the network, and might insist routing things through regions of the
hypercube that are “gridlocked”.
± First in, first out queue. I sure hope you already knew that.

61
RandomRoute( v0, . . . , vN−1 )
// vi : Packet at node i to be routed to node d(i).
(i) Pick a random intermediate destination σ(i) from [1, . . . , N]. Packet vi travels to
σ(i).
// Here random sampling is done with replacement.
// Several packets might travel to the same destination.
(ii) Wait till all the packets arrive to their intermediate destination.
(iii) Packet vi travels from σ(i) to its destination d(i).
Figure 7.3: The routing algorithm

Theorem 7.2.5 ([KKT91]). For any deterministic oblivious permutation routing algorithm on a net-
work of N nodes
 each of out-degree n, there is a permutation for which the routing of the permutation
p
takes Ω N/n units of time (i.e., ticks).

Proof: (Sketch.) The above is implied by a nice averaging argument – construct, for every possible
destination, the routing tree of all packets to this specific node. Argue that there must be many edges
in this tree that are highly congested in this tree (which is NOT the permutation routing we are looking
for!). Now, by averaging, there must be a single edge that is congested in “many” of these trees. Pick
a source-destination pair from each one of these trees that uses this edge, and complete it into a full
permutation in the natural way. Clearly, the congestion of the resulting permutation is high. For the
exact details see [KKT91]. 

7.2.3.0.1. How do we send a packet? We use bit fixing. Namely, the packet from the i node,
always go to the current adjacent node that have the first different bit as we scan the destination string
d(i). For example, packet from (0000) going to (1101), would pass through (1000), (1100), (1101).

7.2.3.0.2. The routing algorithm. We assume each edge have a FIFO queue. The routing algorithm
is depicted in Figure 7.3.

7.2.3.1. Analysis
We analyze only (i) as (iii) follows from the same analysis. In the following, let ρi denote the route
taken by vi in (i).

Exercise 7.2.6. Once a packet v j that travel along a path ρ j can not leave a path ρi , and then join it
again later. Namely, ρi ∩ ρ j is (maybe an empty) path.

Lemma 7.2.7. Let the route of a message c follow the sequence of edges π = (e1, e2, . . . , e k ). Let S be
the set of packets whose routes pass through at least one of (e1, . . . , e k ). Then, the delay incurred by c is
at most |S|.

Proof: A packet in S is said to leave π at that time step at which it traverses an edge of π for the last
time. If a packet is ready to follow edge e j at time t, we define its lag at time t to be t − j. The lag of c
is initially zero, and the delay incurred by c is its lag when it traverse e k . We will show that each step
at which the lag of c increases by one can be charged to a distinct member of S.

62
We argue that if the lag of c reaches ` + 1, some packet in S leaves π with lag `. When the lag of c
increases from ` to ` + 1, there must be at least one packet (from S) that wishes to traverse the same
edge as c at that time step, since otherwise c would be permitted to traverse this edge and its lag would
not increase. Thus, S contains at least one packet whose lag reach the value `.
Let τ be the last time step at which any packet in S has lag `. Thus there is a packet d ready to
follow edge e µ at τ, such that τ − µ = `. We argue that some packet of S leaves π at τ; this establishes
the lemma since once a packet leaves π, it would never join it again and as such will never again delay
c.
Since d is ready to follow e µ at τ, some packet ω (which may be d itself) in S follows e µ at time τ.
Now ω leaves π at time τ; if not, some packet will follow e µ+1 at step µ + 1 with lag still at `, violating
the maximality of τ. We charge to ω the increase in the lag of c from ` to ` + 1; since ω leaves π, it
will never be charged again. Thus, each member of S whose route intersects π is charge for at most one
delay, establishing the lemma. 

Let Hi j be an indicator variable that is 1 if ρi and ρ j share an edge, and 0 otherwise. The total delay
for vi is at most ≤ j Hi j .
Í
Crucially, for a fixed i, the variables Hi1, . . . , HiN are independent. Indeed, imagine first picking
the destination of vi , and let the associated path be ρi . Now, pick the destinations of all the other
packets in the network. Since the sampling of destinations is done with replacements, whether or
not, the path of v j intersects ρi h or not,i is independent of whether vk intersects ρi . Of course, the
probabilities Pr Hi j = 1 and Pr Hik = 1 are probably different. Confusingly, however, H11, . . . , HN N
 

are not independent. Indeed, imagine k and j being close vertices on the hypercube. If Hi j = 1 then
intuitively it means that ρi is traveling close to the vertex v j , and as such there is a higher probability
that Hik = 1.
Let ρi = (e1, . . . , e k ), and let T(e) be the number of packets (i.e., paths) that pass through e. We
have that
N k
" N # " k #
Õ Õ Õ Õ
Hi j ≤ T(e j ) and thus E Hi j ≤ E T(e j ) .
j=1 j=1 j=1 j=1

Because of symmetry, the variables T(e) have the same distribution for all the edges of G. On the other
hand, the expected length of a path is n/2, there are N packets, and there are Nn/2 edges. We conclude
E[T(e)] = 1. Thus
" N # " k #
Õ Õ h i n
µ=E Hi j ≤ E T(e j ) = E | ρi | ≤ .
j=1 j=1
2

By the Chernoff inequality, we have


" # " #
Õ Õ
Pr Hi j > 7n ≤ Pr Hi j > (1 + 13) µ < 2−13µ ≤ 2−6n .
j j

Since there are N = 2n packets, we know that with probability ≤ 2−5n all packets arrive to their temporary
destination in a delay of most 7n.
Theorem 7.2.8. Each packet arrives to its destination in ≤ 14n stages, in probability at least 1 − 1/N
(note that this is very conservative).

63
7.2.4. Faraway Strings
Consider the Hamming distance between binary strings. It is natural to ask how many strings of
length n can one have, such that any pair of them, is of Hamming distance at least t from each other.
Consider two random strings, generated by picking at each bit randomly and independently. Thus,
E[dH (x, y)] = n/2, where dH (x, y) denote the hamming distance between x and y. In particular, using
the Chernoff inequality, we have that

Pr[dH (x, y) ≤ n/2 − ∆] ≤ exp −2∆2 /n .




Next, consider generating M such string, where the value of M would be determined shortly. Clearly,
the probability that any pair of strings are at distance at most n/2 − ∆, is

M
 
exp −2∆2 /n < M 2 exp −2∆2 /n .
 
α≤
2

If this probability is smaller than one, then there is some probability that all the M strings are of
distance at least n/2 − ∆ from each other. Namely, there exists a set of M strings such that every pair
of them is far. We used here the fact that if an event has probability larger than zero, then it exists.
Thus, set ∆ = n/4, and observe that

α < M 2 exp −2n2 /16n = M 2 exp(−n/8).




Thus, for M = exp(n/16), we have that α < 1. We conclude:

Lemma 7.2.9. There exists a set of exp(n/16) binary strings of length n, such that any pair of them is
at Hamming distance at least n/4 from each other.

This is our first introduction to the beautiful technique known as the probabilistic method — we
will hear more about it later in the course.
This result
√ has also interesting interpretation in the Euclidean setting. Indeed, consider the sphere
S of radius n/2 centered at (1/2, 1/2, . . . , 1/2) ∈ Rn . Clearly, all the vertices of the binary hypercube
{0, 1}n lie on this sphere. As such, let P be the set of points
p on S that √ exists according
√ to Lemma 7.2.9.
A pair p, q of points of P have Euclidean distance at least dH (p, q) = n4 = n/2 from each other. We
conclude:

Lemma 7.2.10. Consider the unit hypersphere S in Rn . The sphere S contains a set Q of points, such
that each pair of points is at (Euclidean) distance at least one from each other, and |Q| ≥ exp(n/16).

7.3. The Chernoff Bound — General Case


Here we present the Chernoff bound in a more general settings.

Question 7.3.1. Let X1, . . . , Xn be n independent Bernoulli trials, where


h i h i
Pr Xi = 1 = pi, and Pr Xi = 0 = qi = 1 − pi .
Íb
(Each Xi is known as a Poisson trials.) And let X = i=1 Xi . µ = E X = i pi . We are interested in
  Í

the question of what is the probability that X > (1 + δ)µ?

64


h i 
Theorem 7.3.2. For any δ > 0, we have Pr X > (1 + δ)µ < .
(1 + δ)1+δ
Or in a more simplified form, we have:
h i
Pr X > (1 + δ)µ < exp −µδ2 /4 ,

δ ≤ 2e − 1 (7.1)
h i
δ > 2e − 1 Pr X > (1 + δ)µ < 2−µ(1+δ), (7.2)
 
h i µδ ln δ
and δ≥e 2
Pr X > (1 + δ)µ < exp − . (7.3)
2
h i
Proof: We have Pr[X > (1 + δ)µ] = Pr et X > et(1+δ)µ . By the Markov inequality, we have:

E et X
 
h i
Pr X > (1 + δ)µ < t(1+δ)µ
e
On the other hand,
h i
E et X = E et(X1 +X2 ...+Xn ) = E et X1 · · · E et Xn .
     

Namely,
În t Xi În În
i=1 E e (1 − pi )e0 + pi et 1 + pi (et − 1)
   
i=1 i=1
Pr[X > (1 + δ)µ] < = = .
et(1+δ)µ et(1+δ)µ et(1+δ)µ
Let y = pi (et − 1). We know that 1 + y < e y (since y > 0). Thus,
În Ín
t pi (et − 1)

i=1 exp(pi (e − 1)) exp i=1
Pr[X > (1 + δ)µ] < =
et(1+δ)µ et(1+δ)µ 
t n  µ
exp (e − 1) i=1 pi exp (et − 1)µ exp et − 1
Í  
= = =
et(1+δ)µ et(1+δ)µ et(1+δ)
 µ
exp(δ)
= ,
(1 + δ)(1+δ)
if we set t = log(1 + δ).
For the proof of the simplified form, see Section 7.3.1. 



Definition 7.3.3. F (µ, δ) =
+
.
(1 + δ)(1+δ)

Example 7.3.4. Arkansas Aardvarks win a game with probability 1/3. What is their probability to have
a winning season with n games. By Chernoff inequality, this probability is smaller than
 n/3
e1/2

 n/3
F (n/3, 1/2) =
+
= 0.89745 = 0.964577n .
1.51.5
For n = 40, this probability is smaller than 0.236307. For n = 100 this is less than 0.027145. For
n = 1000, this is smaller than 2.17221 · 10−16 (which is pretty slim and shady). Namely, as the number of
experiments is increases, the distribution converges to its expectation, and this converge is exponential.

65
Theorem 7.3.5. Under the same assumptions as Theorem 7.3.2, we have: Pr[X < (1 − δ)µ] < exp −µδ2 /2 .


2
Definition 7.3.6. Let F − (µ, δ) = e−µδ /2 , and let ∆− (µ, ε) denote the quantity, which is what should be
the value of δ, so that the probability is smaller than ε. We have that
s
2 log 1/ε
∆− (µ, ε) = .
µ

log2 (1/ε)
And for large δ we have ∆+ (µ, ε) < − 1.
µ

7.3.1. A More Convenient Form


Proof: (of simplified form of Theorem 7.3.2p65 ) Eq. (7.2) is easy. Indeed, we have
h e i (1+δ)µ h e i (1+δ)µ
≤ ≤ 2−(1+δ)µ,
1+δ 1 + 2e − 1

since δ > 2e − 1. For the stronger version, Eq. (7.3), observe that


h i   
Pr X > (1 + δ)µ < = exp µδ − µ(1 + δ) ln(1 + δ) . (7.4)
(1 + δ)1+δ

As such, we have
   
h i   1+δ µδ ln δ
Pr X > (1 + δ)µ < exp −µ(1 + δ) ln(1 + δ) − 1 ≤ exp −µδln ≤ exp − ,
e 2

1+x √ 1 + x ln x
since for x ≥ e2 we have that ≥ x ⇐⇒ ln ≥ .
e e 2

As for Eq. (7.1), we prove this only for δ ≤ 1/2. For details about the case 1/2 ≤ δ ≤ 2e − 1, see
[MR95]. The Taylor expansion of ln(1 + δ) is

δ2 δ3 δ4 δ2
δ− + − +· ≥ δ− ,
2 3 4 2

for δ ≤ 1. Thus, plugging into Eq. (7.4), we have


! !
h i
Pr X > (1 + δ)µ < exp µ δ − (1 + δ) δ − δ2 /2 = exp µ δ − δ + δ2 /2 − δ2 + δ3 /2
  

≤ exp µ −δ2 /2 + δ3 /2 ≤ exp −µδ2 /4 ,


 

for δ ≤ 1/2. 

66
7.4. A special case of Hoeffding’s inequality
In this section, we prove yet another version of Chernoff inequality, where each variable is randomly
picked according to its own distribution in the range [0, 1]. We prove a more general version of this
inequality in Section 7.5, but the version presented here does not follow from this generalization.
Ín
Theorem 7.4.1. Let X1, . . . , Xn ∈ [0, 1] be n independent random variables, let X = i=1 Xi , and let
h i  µ  µ+η  n − µ  n−µ−η
µ = E[X]. We have that Pr X − µ ≥ η ≤ .
µ+η n− µ−η

Proof: Let s ≥ 1 be some arbitrary parameter. By the standard arguments, we have


h i h i E s X  n
X
E s Xi .
Ö
µ+η −µ−η
γ = Pr X ≥ µ + η = Pr s ≥ s ≤ µ+η = s
 
s i=1

By calculations, see Lemma 7.4.6 below, one can show that E s X1 ≤ 1 + (s − 1) E[Xi ]. As such, by the
 

AM-GM inequality² , we have that


n n  n   n 
!
Ö  X Ö  1 Õ µ n
E s i
≤ 1 + (s − 1) E[Xi ] ≤ 1 + (s − 1) E[Xi ] = 1 + (s − 1) .
i=1 i=1
n i=1 n

(µ + η)(n − µ) µn − µ2 + ηn − ηµ
Setting s = = we have that
µ(n − µ − η) µn − µ2 − ηµ
µ ηn µ η n−µ
1 + (s − 1) =1+ · =1+ = .
n µn − µ − ηµ n
2 n− µ−η n− µ−η

As such, we have that


n  µ+η  n   µ+η   n−µ−η
n n

−µ−η
Ö  X µ(n − µ − η) − µ µ − µ
γ≤s E s i = = . 
i=1
(µ + η)(n − µ) n− µ−η (µ + η) n− µ−η

Remark 7.4.2. Setting s = (µ + η)/µ in the proof of Theorem 7.4.1, we have


h i   µ+η    n   µ+η n
µ
Pr X − µ ≥ η ≤ µ+η 1 + µ+η
µ − 1 µ
n = µ
µ+η 1 + ηn .

Ín
Corollary 7.4.3. Let X1, . . . , Xn ∈ [0, 1] be n independent random variables, let X = i=1 Xi /n, p =
h i h i
X q p. X p t ≤ exp n f (t) , for

E = µ/n and = 1 − Then, we have that Pr − ≥

p q
f (t) = (p + t) ln + (q − t) ln . (7.5)
p+t q−t
Ín
Theorem 7.4.4. Let X1, . .h. , Xn ∈ [0, 1]
i be n independent random
h variables,
i let X = ( i=1 Xi )/n, and let
p = E[X]. We have that Pr X − p ≥ t ≤ exp −2nt 2 and Pr X − p ≤ −t ≤ exp −2nt 2 .
 


xi )/n ≥ n x1 · · · xn .
² The Ín
inequality between arithmetic and geometric means: ( i=1

67
Proof: Let p = µ/n, q = 1 − p, and let f (t) be the function from Eq. (7.5), for t ∈ (−p, q). Now, we have
that
p p+t p q q−t q p q
 
0
f (t) = ln + (p + t) − − ln − (q − t) = ln − ln
p+t p (p + t)2 q−t q (q − t)2 p+t q−t
p(q − t)
= ln .
q(p + t)
As for the second derivative, we have
00 qX
(pX+Xt) p (p + t)(−1) − (q − t) −p − t − q + t 1
f (t) = .=
X
· · =− ≤ −4.
p(q − t) q (p + t)A
2 (q − t)(p + t) (q − t)(p + t)
Indeed,
 t ∈ (−p,
  q) and the denominator is minimized for t = (q − p)/2, and as such (q − t)(p + t) ≤
2q − (q − p) 2p + (q − p) /4 = (p + q)2 /4 = 1/4.
f 00(x) 2
Now, f (0) = 0 and f 0(0) = 0, and by Taylor’s expansion, we have that f (t) = f (0) + f 0(0)t + t ≤
2
−2t 2, where x is between 0 and t.
The first bound now readily follows from plugging this bound into Corollary 7.4.3. The second bound
follows by considering the random variants Yih=i1 − Xi , for all i, and plugging this into the first bound.
Indeed, for Y = 1 − X, we have that q = E Y , and then X − p ≤ −t ⇐⇒ t ≤ p − X ⇐⇒ t ≤
h i h i
1 − q − (1 − Y ) = Y − q. Thus, Pr X − p ≤ −t = Pr Y − q ≥ t ≤ exp −2nt 2 .


Ín
Theorem 7.4.5. Let X1, . .. , Xn ∈ [0, 1] be n independent random variables, let X = ( i=1 Xi), and let
µ = E[X]. We have that Pr X − µ ≥ ε µ ≤ exp −ε 2 µ/4 and Pr X − µ ≤ −ε µ ≤ exp −ε 2 µ/2 .
  

Proof: Let p = µ/n, and let g(x) = f (px), for x ∈ [0, 1] and xp < q. As before, computing the derivative
of g, we have
p(q − xp) q − xp 1 px
g0(x) = p f 0(xp) = p ln = p ln ≤ p ln ≤− ,
q(p + xp) q(1 + x) 1+x 2
since (q − xp)/q is maximized for x = 0, andln 1+x
1
≤ −x/2,∫for x ∈ [0, 1], as can be easily verified³ . Now,
x ∫ x
g(0) = f (0) = 0, and by integration, we have that g(x) = y=0 g0(y)dy ≤ y=0 (−py/2)dy = −px 2 /4. Now,
plugging into Corollary 7.4.3, we get that the desired probability Pr X − µ ≥ ε µ is
 
h i  
Pr X − p ≥ εp ≤ exp n f (εp) = exp ng(ε) ≤ exp −pnε 2 /4 = exp −µε 2 /4 .
  

As for the other inequality, set h(x) = g(−x) = f (−xp). Then


p(q + xp) q(1 − x) q − xq p+q
 
0 0
h (x) = −p f (−xp) = −p ln = p ln = p ln = p ln 1 − x
q(p − xp) q + xp q + xp q + xp
 
1
= p ln 1 − x ≤ p ln(1 − x) ≤ −px,
q + xp
since 1 − x ≤ e−x . By integration, as before, we hconclude thati h(x) ≤ −px 2 /2. Now, plugging
into Corollary 7.4.3, we get Pr X − µ ≤ −ε µ = Pr X − p ≤ −εp ≤ exp n f (−εp) ≤ exp nh(ε) ≤
   

exp −npε 2 /2 ≤ exp −µε 2 /2 .


 

³ Indeed, this is equivalent to 1
1+x ≤ e−x/2 ⇐⇒ e x/2 ≤ 1 + x, which readily holds for x ∈ [0, 1].

68
7.4.1. Some technical lemmas
Lemma 7.4.6. Let X ∈ [0, 1] be a random variable, and let s ≥ 1. Then E s X ≤ 1 + (s − 1) E[X].
 

Proof: For the sake of simplicity of exposition, assume that X is a discrete random variable, and that
there is a value α ∈ (0, 1/2), such that β = Pr[X = α] > 0. Consider the modified random variable X 0,
such that Pr[X 0 = 0] = Pr[X = 0] + β/2, and Pr[X 0 = 2α] = Pr[X = α] + β/2. Clearly, E[X] = E[X 0].
that E s X − E s X = (β/2)(s2α + s0 ) − βs α ≥ 0, by the convexity of s x . We conclude
0 
Next, observe
that E s achieves its maximum if takes only the values 0 and 1. But then, we have that E s X =
 X  

Pr[X = 0] s0 + Pr[X = 1] s1 = (1 − E[X]) + E[X] s = 1 + (s − 1) E[X] , as claimed. 

7.5. Hoeffding’s inequality


In this section, we prove a generalization of Chernoff’s inequality. The proof is considerably more
tedious, and it is included here for the sake of completeness.

 sX  7.5.1.2 Let X 2be a random variable. If E[X] = 0 and a ≤ X ≤ b, then for any s > 0, we have
Lemma
E e ≤ exp s (b − a) /8 .

Proof: Let a ≤ x ≤ b and observe that x can be written as a convex combination of a and b. In
particular, we have
b− x
x = λa + (1 − λ)b for λ= ∈ [0, 1] .
b−a
Since s > 0, the function exp(sx) is convex, and as such
b − x sa x − a sb
e sx ≤ e + e ,
b−a b−a
since we have that f (λx + (1 − λ)y) ≤ λ f (x) + (1 − λ) f (y) if f (·) is a convex function. Thus, for a random
variable X, by linearity of expectation, we have
b − X sa X − a sb b − E[X] sa E[X] − a sb
 
 sX 
E e ≤E e + e = e + e
b−a b−a b−a b−a
b sa a sb
= e − e ,
b−a b−a
since E[X] = 0.
a a b
Next, set p = − and observe that 1 − p = 1 + = and
b−a b−a b−a
 a 
−ps(b − a) = − − s(b − a) = sa.
b−a
As such, we have

E e sX ≤ (1 − p)e sa + pe sb = (1 − p + pe s(b−a) )e sa
 

= (1 − p + pe s(b−a) )e−ps(b−a)
  
= exp −ps(b − a) + ln 1 − p + pe s(b−a) = exp(−pu + ln(1 − p + peu )),

69
for u = s(b − a). Setting

φ(u) = −pu + ln(1 − p + peu ),

we thus have E e sX ≤ exp(φ(u)). To prove the claim, we will show that φ(u) ≤ u2 /8 = s2 (b − a)2 /8.
 

To see that, expand φ(u) about zero using Taylor’s expansion. We have
1
φ(u) = φ(0) + uφ0(0) + u2 φ00(θ) (7.6)
2
where θ ∈ [0, u], and notice that φ(0) = 0. Furthermore, we have

peu
φ0(u) = −p + ,
1 − p + peu
p
and as such φ0(0) = −p + 1−p+p = 0. Now,

00 (1 − p + peu )peu − (peu )2 (1 − p)peu


φ (u) = = .
(1 − p + peu )2 (1 − p + peu )2

For any x, y ≥ 0, we have (x + y)2 ≥ 4x y as this is equivalent to (x − y)2 ≥ 0. Setting x = 1 − p and


y = peu , we have that
(1 − p)peu (1 − p)peu 1
φ00(u) = ≤ = .
(1 − p + peu )2 4(1 − p)peu 4
Plugging this into Eq. (7.6), we get that
 
1 1 sX 1
φ(u) ≤ u2 = (s(b − a))2 E e
2
≤ exp(φ(u)) ≤ exp (s(b − a)) ,
 
and
8 8 8

as claimed. 

Lemma 7.5.2. Let X be a random variable. If E[X] = 0 and a ≤ X ≤ b, then for any s > 0, we have

exp s (b−a)
 2 2

8
Pr[X > t ] ≤ .
e st

Proof: Using the same technique we used in proving Chernoff’s inequality, we have that

s (b−a)2
 2 
 sX 
 E e exp 8
Pr[X > t ] = Pr e sX > e st ≤

≤ . 
e st e st

Theorem 7.5.3 (Hoeffding’s inequality). Let X1, . . . , Xn be independent random variables, where
Xi ∈ [ai, bi ], for i = 1, . . . , n. Then, for the random variable S = X1 + · · · + Xn and any η > 0, we
have
2 η2
h i  
Pr S − E[S] ≥ η ≤ 2 exp − Ín .
i=1 (bi − ai )
2

70
Ín
Proof: Let Zi = Xi − E[Xi ], for i = 1, . . . , n. Set Z = i=1 Zi , and observe that

 E[exp(sZ)]
Pr[Z ≥ η] = Pr e sZ ≥ e sη ≤

,
exp(sη)

by Markov’s inequality. Arguing as in the proof of Chernoff’s inequality, we have


" n # n n
s (bi − ai )2
Ö Ö Ö  2 
E[exp(sZ)] = E exp(sZi ) = E[exp(sZi )] ≤ exp ,
i=1 i=1 i=1
8

since the Zi s are independent and by Lemma 7.5.1. This implies that
n n
!
Ö
s2 (b i −ai )2 /8 s2 Õ
Pr[Z ≥ η] ≤ exp(−sη) e = exp (bi − ai )2 − sη .
i=1
8 i=1

The upper bound is minimized for s = 4η/ − ai )2 , implying


Í 
i (bi

2η2

Pr[Z ≥ η] ≤ exp − Í .
(bi − ai )2

The claim now follows by the symmetry of the upper bound (i.e., apply the same proof to −Z). 

7.6. Bibliographical notes


Some of the exposition here follows more or less the exposition in [MR95]. Exercise 7.7.1 (without
the hint) is from [Mat99]. McDiarmid [McD89] provides a survey of Chernoff type inequalities, and
Theorem 7.4.5 and Section 7.4 is taken from there (our proof has somewhat weaker constants).
Section 7.2.3 is based on Section 4.2 in [MR95]. A similar result to Theorem 7.2.8 is known for the
case of the wrapped butterfly topology (which is similar to the hypercube topology but every node has
a constant degree, and there is no clear symmetry). The interested reader is referred to [MU05].
A more general treatment of such inequalities and tools is provided by Dubhashi and Panconesi
[DP09].

7.7. Exercises
Ín
Exercise 7.7.1 (Chernoff inequality is tight.). Let S = i=1 Si be a sum of n independent random variables
each attaining values +1 and −1 with equal probability. Let P(n, ∆) = Pr[S > ∆]. Prove that for ∆ ≤ n/C,
 2
1 ∆
P(n, ∆) ≥ exp − ,
C Cn

where C is a suitable constant. That is, the well-known Chernoff bound P(n, ∆) ≤ exp(−∆2 /2n)) is close
to the truth.

Exercise 7.7.2 (Chernoff inequality is tight by direct calculations.). For this question use only basic argu-
mentation – do not use Stirling’s formula, Chernoff inequality or any similar “heavy” machinery.

71
n−k  
Õ 2n n 2n
(A) Prove that ≤ 2 .
i=0
i 4k 2
Hint: Consider flipping a coin 2n times. Write down explicitly the probability of this coin to
have at most n − k heads, and use Chebyshev inequality.
Using (A), prove that 2n 2n √
n ≥ 2 /4 n (which is a pretty good estimate).

(B)
    
2n 2i + 1 2n
(C) Prove that = 1− .
n + i + 1  n + i + 1  n + i
2n −i(i − 1) 2n
(D) Prove that ≤ exp .
n + i 2n n
8i 2 2n
    
2n
(E) Prove that ≥ exp − .
n+i n  n 2n
2n 2
(F) Using the above, prove that ≤ c √ for some constant c (I got c = 0.824... but any reasonable
n n
constant will do).
(G) Using the above, prove that

Õn
(t+1) 
2n

≤ c22n exp −t 2 /2 .

√ n−i
i=t n+1

In particular,
√ conclude that when flipping fair coin 2n times, the probability to get less than
n − t n heads (for t an integer) is smaller than c0 exp −t 2 /2 , for some constant c0.


(H) Let X be the number of heads in 2n coin flips. Prove that for any  integer t > 0 and any δ > 0
sufficiently small, it holds that Pr[X < (1 − δ)n] ≥ exp −c00 δ2 n , where c00 is some constant.
Namely, the Chernoff inequality is tight in the worst case.

Exercise 7.7.3 (More binary strings. More!). To some extent, Lemma 7.2.9 is somewhat silly, as one can
prove a better bound by direct argumentation. Indeed, for a fixed binary string x of length n, show
a bound on the number of strings in the Hamming ball around x of radius n/4 (i.e., binary strings of
distance at most n/4 from x). (Hint: interpret the special case of the Chernoff inequality as an inequality
over binomial coefficients.)
Next, argue that the greedy algorithm which repeatedly pick a string which is in distance ≥ n/4 from
all strings picked so far, stops after picking at least exp(n/8) strings.

Exercise 7.7.4 (Tail inequality for geometric variables). Let X1, . . . , Xm be m independent random variables
j−1
h probabilityi p (i.e., Pr[Xi = j] = (1 − p) p). Let Y = i Xi , and let
Í
with geometric distribution with
µ = E[Y ] = m/p. Prove that Pr Y ≥ (1 + δ)µ ≤ exp −mδ2 /8 .

72
Chapter 8

Martingales
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018

‘After that he always chose out a “dog command” and sent them ahead. It had the task of informing
the inhabitants in the village where we were going to stay overnight that no dog must be allowed
to bark in the night otherwise it would be liquidated. I was also on one of those commands and
when we came to a village in the region of Milevsko I got mixed up and told the mayor that every
dog-owner whose dog barked in the night would be liquidated for strategic reasons. The mayor got
frightened, immediately harnessed his horses and rode to headquarters to beg mercy for the whole
village. They didn’t let him in, the sentries nearly shot him and so he returned home, but before we
got to the village everybody on his advice had tied rags round the dogs muzzles with the result that
three of them went mad.’
– The good soldier Svejk, Jaroslav Hasek

8.1. Martingales

8.1.1. Preliminaries
Let X and Y be two random variables. Let ρ(x, y) = Pr[(X = x) ∩ (Y = y)]. Then,

ρ(x, y) ρ(x, y)
Pr X = x Y = y =
 

Pr[Y = y] z ρ(z, y)

x x ρ(x, y) x ρ(x, y)
Í Í
= x
Õ
and E X Y = y = x Pr X = x Y = y = Í
   
.
x z ρ(z, y) Pr[Y = y]

of X given Y , is the random variable E X Y is the


 
Definition 8.1.1. The conditional expectation
random variable f (y) = E X Y = y .
 

  h i
Lemma 8.1.2. For any two random variables X and Y , we have E E X Y =E X .
 

73
Proof: E E X Y = EY E X Y = y = y Pr[Y = y] E X Y = y
      Í  

Õ h i Í x Pr[X = x ∩ Y = y]
x
= Pr Y = y
y
Pr[Y = y]
ÕÕ h i Õ Õ h i
= x Pr X = x ∩ Y = y = x Pr X = x ∩ Y = y
y x x y
Õ h i
h i
= x Pr X = x = E X . 
x

  h i
Lemma 8.1.3. For any two random variables X and Y , we have E Y · E X Y = E XY .
 

 Õ
Proof: We have that E Y · E X Y = Pr[Y = y] · y · E X Y = y
   
y

x Pr[X = x ∩ Y = y] Õ Õ
Í h i h i
x
Õ
= Pr[Y = y] · y · = x y · Pr X = x ∩ Y = y = E XY . 
y
Pr[Y = y] x y

8.1.2. Martingales
Intuitively, martingales are a sequence of random variables describing a process, where the only thing
that matters at the beginning of the ith step is where the process was in the end of the (i − 1)th step.
That is, it does not matter how the process arrived to a certain state, only that it is currently at this
state.

Definition 8.1.4. A sequence of random  variables X0, X1, . . . , is said to be a martingale sequence if for
all i > 0, we have E Xi X0, . . . , Xi−1 = Xi−1 .


Lemma 8.1.5. Let X0, X1, . . . , be a martingale sequence. Then, for all i ≥ 0, we have E Xi = E X0 .
   

8.1.2.1. Examples of martingales


Example 8.1.6. An example of martingales is the sum of money after participating in a sequence of fair
bets. That is, let Xi be the amount of money a gambler has after playing i rounds.  In each round it
either gains one dollar, or loses one dollar. Clearly, we have E Xi X0, . . . , Xi−1 = E Xi Xi−1 = Xi .


Example 8.1.7. Let Yi = Xi2 − i, where Xi is as defined in the above example. We claim that Y0, Y1, . . . is
a martingale. Let us verify that this is true. Given Yi−1 , we have Yi−1 = Xi−1
2 − (i − 1). We have that

 1  1 
E Yi Yi−1 = E Xi2 − i Xi−12
− (i − 1) = (Xi−1 + 1)2 − i) + (Xi−1 − 1)2 − i
  
2 2
= Xi−1 + 1 − i = Xi−1 − (i − 1) = Yi−1,
2 2

which implies that indeed it is a martingale.

74
Example 8.1.8. Let U be a urn with b black balls, and w white balls. We repeatedly select a ball and
replace it by c balls having the same color. Let Xi be the fraction of black balls after the first i trials.
We claim that the sequence X0, X1, . . . is a martingale.
Indeed, let ni = b + w + i(c − 1) be the number of balls in the urn after the ith trial. Clearly,

(c − 1) + Xi−1 ni−1 Xi−1 ni−1


E Xi Xi−1, . . . , X0 = Xi−1 · + (1 − Xi−1 ) ·
 
ni ni
Xi−1 (c − 1) + Xi−1 ni−1 c − 1 + ni−1 ni
= = Xi−1 = Xi−1 = Xi−1 .
ni ni ni

Example 8.1.9. Let G be a random graph on the vertex set V = {1, . . . , n} obtained by independently
choosing to include each possible edge with probability p. The underlying probability space is called
Gn,p . Arbitrarily label the m = n(n − 1)/2 possible edges with the sequence 1, . . . , m. For 1 ≤ j ≤ m,
define the indicator random variable I j , which takes values 1 if the edge j is present in G, and has value
0 otherwise. These indicator variables are independent and each takes value 1 with probability p.
Consider any real valued function f defined over the space of all graphs, e.g., the clique number,
which is defined as being the size of the largest complete subgraph. The edge exposure martingale
is defined to be the sequence of random variables X0, . . . , Xm such that

Xi = E f (G) I1, . . . , Ii ,
 

while X0 = E[ f (G)] and Xm = f (G). This sequence of random variable begin a martingale follows
immediately from a theorem that would be described in the next lecture.
One can define similarly a vertex exposure martingale, where the graph Gi is the graph induced
on the first i vertices of the random graph G.

Example 8.1.10 (The sheep of Mabinogion). The following is taken from medieval Welsh manuscript based
on Celtic mythology:

“And he came towards a valley, through which ran a river; and the borders of the valley were
wooded, and on each side of the river were level meadows. And on one side of the river he
saw a flock of white sheep, and on the other a flock of black sheep. And whenever one of the
white sheep bleated, one of the black sheep would cross over and become white; and when
one of the black sheep bleated, one of the white sheep would cross over and become black.”
– Peredur the son of Evrawk, from the Mabinogion.

More concretely, we start at time 0 with w0 white sheep, and b0 black sheep. At every iteration, a
random sheep is picked, it bleats, and a sheep of the other color turns to this color. the game stops as
soon as all the sheep have the same color. No sheep dies or get born during the game. Let Xi be the
expected number of black sheep in the end of the game, after the ith iteration. For reasons that we
would see later on, this sequence is a martingale.
The original question is somewhat more interesting – if we are allowed to take a way sheep in the
end of each iteration, what is the optimal strategy to maximize Xi ?

8.1.2.2. Azuma’s inequality


A sequence of random variables X0, X1, . . . has bounded differences if |Xi − Xi−1 | ≤ ∆, for some fixed
∆.

75
Theorem 8.1.11 (Azuma’s Inequality.). Let X0, . . . , Xm be a martingale with X0 = 0, and |Xi+1 −
√ 
Xi | ≤ 1 for all 0 ≤ i < m. Let λ > 0 be arbitrary. Then Pr Xm > λ m < exp −λ2 /2 .
 


Proof: Let α = λ/ m. Let Yi = Xi − X Y X X
 
i−1 , so that |Yi | ≤ 1 and E i 0, . . . , i−1 = 0.
We are interested in bounding E eαYi X0, . . . , Xi−1 . Note that, for −1 ≤ x ≤ 1, we have
 

eα + e−α eα − e−α
eαx ≤ h(x) = + x,
2 2
as eαx is a convex function, h(−1) = e−α , h(1) = eα , and h(x) is a linear function. Thus,

E eαYi X0, . . . , Xi−1 ≤ E h(Yi ) X0, . . . , Xi−1 = h E Yi X0, . . . , Xi−1


     
  eα + e−α
=h 0 =
2
α2 3 2 3
(1 + α + 2! + α3! + · · · ) + (1 − α + α2! − α3! + · · · )
=
2
α2 α4 α6
=1+ + + +···
2 4! 6!
 2  3
1 α2 1 α2 1 α2
 
+ · · · = exp α2 /2 ,

≤ 1+ + +
1! 2 2! 2 3! 2

as (2i)! ≥ 2i i!.
Hence, by Lemma 8.1.3, we have that
m m−1
" # " ! #
Ö Ö
E eαXm = E eαYi = E eαYi eαYm
 
i=1 i=1
m−1
" ! # "m−1 #
Ö Ö
α2 /2
eαYi E eαYm X0, . . . , Xm−1 ≤e eαYi
 
=E E
i=1 i=1
≤ exp mα /2 . 2 

Therefore, by Markov’s inequality, we have

e m
 αX 
√  h √ i E √
Pr Xm > λ m = Pr eαXm > eαλ m = = emα /2−αλ m
 2

eαλ m
 √ √ √  2
= exp m(λ/ m)2 /2 − (λ/ m)λ m = e−λ /2,

implying the result. 

Here is an alternative form.

Theorem 8.1.12 (Azuma’s Inequality). Let X0, . . . , Xm be a martingale sequence such that and |Xi+1 −
√ 
Xi | ≤ 1 for all 0 ≤ i < m. Let λ > 0 be arbitrary. Then Pr |Xm − X0 | > λ m < 2 exp −λ2 /2 .
 

76
Example 8.1.13. Let χ(H) be the chromatic number of a graph H. What is chromatic number of a
random graph? How does this random variable behaves?
Consider the vertex exposure martingale, and let Xi = E χ(G) Gi . Again, without proving it,
 
√  2
we claim that X0, . . . , Xn = X is a martingale, X n ≤ e−λ /2 .

and as such, we have: Pr |Xn − 0 | > λ
However, X0 = E[ χ(G)], and Xn = E χ(G) Gn = χ(G). Thus,
 

√  2
Pr | χ(G) − E[ χ(G)]| > λ n ≤ e−λ /2 .


Namely, the chromatic number of a random graph is highly concentrated! And we do not even know
what is the expectation of this variable!

77
78
Chapter 9

Martingales II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
“The Electric Monk was a labor-saving device, like a dishwasher or a video recorder. Dishwashers washed
tedious dishes for you, thus saving you the bother of washing them yourself, video recorders watched tedious
television for you, thus saving you the bother of looking at it yourself; Electric Monks believed things for
you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the
world expected you to believe.”
– — Dirk Gently’s Holistic Detective Agency, Douglas Adams..

9.1. Filters and Martingales


Definition 9.1.1. A σ-field (Ω, F ) consists of a sample space Ω (i.e., the atomic events) and a collection
of subsets F satisfying the following conditions:
(A) ∅ ∈ F .
(B) C ∈ F ⇒ C ∈ F .
(C) C1, C2, . . . ∈ F ⇒ C1 ∪ C2 . . . ∈ F .

Definition 9.1.2. Given a σ-field (Ω, F ), a probability measure Pr : F → R+ is a function that satisfies
the following conditions.
(A) ∀A∈ F , 0 ≤ Pr[A] ≤ 1.
(B) Pr Ω = 1. h i Í h i
(C) For mutually disjoint events C1, C2, . . . , we have Pr ∪i Ci = i Pr Ci .

Definition 9.1.3. A probability space (Ω, F , Pr) consists of a σ-field (Ω, F ) with a probability measure
Pr defined on it.

Definition 9.1.4. Given a σ-field (Ω, F ) with F = 2Ω , a filter (also filtration) is a nested sequence
F0 ⊆ F1 ⊆ · · · ⊆ Fn of subsets of 2Ω , such that:
(A) F0 = {∅, Ω}.
(B) Fn = 2Ω .
(C) For 0 ≤ i ≤ n, (Ω, Fi ) is a σ-field.

Definition 9.1.5. An elementary event or atomic event is a subset of a sample space that contains
only one element of Ω.

79
Intuitively, when we consider a probability space, we usually consider a random variable X. The
value of X is a function of the elementary event that happens in the probability space. Formally, a
random variable is a mapping X : Ω → R. Thus, each Fi defines a partition of Ω into atomic events.
This partition is getting more and more refined as we progress down the filter.

Example 9.1.6. Consider an algorithm Alg that uses n random bits. As such, the underlying sample
space is Ω = b1 b2 . . . bn b1, . . . , bn ∈ {0, 1} ; that is, the set of all binary strings of length n. Next, let
Fi be the σ-field generated by the partition of Ω into the atomic events Bw , where w ∈ {0, 1}i ; here w is
the string encoding the first i random bits used by the algorithm. Specifically,

Bw = wx x ∈ {0, 1}n−i ,


and the set of atomic events in Fi is Bw w ∈ {0, 1}i . The set Fi is the closure of this set of atomic


events under complement and union. In particular, we conclude that F0, F1, . . . , Fn form a filter.

Definition 9.1.7. A random n variable X is said


o to be Fi -measurable if for each x ∈ R, the event X ≤ x
is in Fi ; that is, the set ω ∈ Ω X(ω) ≤ x is in Fi .

Example 9.1.8. Let F0, . . . , Fn be the filter defined in Example 9.1.6. Let X be the parity of the n bits.
Clearly, X = 1 is a valid event only in Fn (why?). Namely, it is only measurable in Fn , but not in Fi , for
i < n.

As such, a random variable X is Fi -measurable, only if it is a constant on the elementary events of


Fi . This gives us a new interpretation of what a filter is – its a sequence of refinements of the underlying
probability space, that is achieved by splitting the atomic events of Fi into smaller atomic events in Fi+1 .
Putting it explicitly, an atomic event E of Fi , is a subset of 2Σ . As we move to Fi+1 the event E might
now be split into several atomic (and disjoint events) E1, . . . , E k . Now, naturally, the atomic event that
really happens is an atomic event of Fn . As we progress down the filter, we “zoom” into this event.

 andY any random variable


Definition 9.1.9 (Conditional expectation in a filter). Let (Ω, F ) be any σ-field,
that takes on distinct values on the elementary events in F . Then E X | F = E X | Y .


9.2. Martingales
Definition 9.2.1. A sequence of random variables Y1, Y2, . . . , is said to be a martingale difference
sequence if for all i ≥ 0, we have E Yi Y1, . . . , Yi−1 = 0.
 

Clearly, X1, . . . , is a martingale sequence if and only if Y1, Y2, . . . , is a martingale difference sequence
where Yi = Xi − Xi−1 .

Definition 9.2.2. A sequence of random variables Y1, Y2, . . . , is

E Yi Y1, . . . , Yi−1 ≤ Yi−1,


 
a super martingale sequence if ∀i
E Yi Y1, . . . , Yi−1 ≥ Yi−1 .
 
and a sub martingale sequence if ∀i

80
9.2.1. Martingales – an alternative definition
Definition 9.2.3. Let (Ω, F , Pr) be a probability space with a filter F0, F1 , . . . . Suppose that X0, X1, . . ., are
random variables such that, for all i ≥ 0, Xi is Fi -measurable. The sequence X0, . . . , Xn is a martingale
provided that, for all i ≥ 0, we have E Xi+1 | Fi = Xi .

Lemma
h  9.2.4. i (Ω, F ) and
Let (Ω, G) be two σ-fields such that F ⊆ G. Then, for any random variable
X, E E X G F = E X F .
 

Proof: E E X G F = E E X G = g F = f
       

x Pr[X=x∩G=g]
Í

x x Pr[X = x ∩ G = g]
 Õ x
Pr[G=g] · Pr[G = g ∩ F = f ]
=E F= f =
Pr[G = g] g∈G
Pr[F = f ]
x Pr[X=x∩G=g] x Pr[X=x∩G=g]
Í Í
Õ x
Pr[G=g] · Pr[G = g ∩ F = f ] Õ x
Pr[G=g] · Pr[G = g]
= =
g∈G,g⊆ f
Pr[F = f ] g∈G,g⊆ f
Pr[F = f ]
Í 
x Pr[X = x ∩ G = g]
Í
x Pr[X = x ∩ G = g] x g∈G,g⊆ f
Í
x
Õ
= =
g∈G,g⊆ f
Pr[F = f ] Pr[F = f ]
x Pr[X = x ∩ F = f ]
Í
x
=E X F .
 
= 
Pr[F = f ]

Theorem 9.2.5. Let (Ω, F , Pr) be a probability space, and let F0, . . . , Fn be a filter
 with respect to it.
Let X be any random variable over this probability space and define Xi = E X Fi then, the sequence
X0, . . . , Xn is a martingale.

Proof: We need to show that E Xi+1 Fi = Xi . Namely,


 

E Xi+1 Fi = E E X Fi+1 Fi = E X Fi = Xi,


       

by Lemma 9.2.4 and by definition of Xi . 

Definition 9.2.6. Let f : D1 × · · · × Dn → R be a real-valued function with a arguments from possibly


distinct domains. The function f is said to satisfy the Lipschitz condition if for any x1 ∈ D1, . . . , xn ∈
Dn , and i ∈ {1, . . . , n} and any yi ∈ Di ,

f (x1, . . . , xi−1, xi, xi+1, . . . , xn ) − f (x1, . . . , xi−1, yi, xi+1, . . . , xn ) ≤ 1.

Specifically, a function is c-Lipschitz, if the inequality holds with a constant c (instead of 1).

Definition 9.2.7. Let X1, . . . , Xn be a sequence of independent random variables, and a function f (X1, . . . , Xn )
defined over them that such that f satisfies the Lipschitz condition. The Doob martingale sequence
h i
Y0, . . . , Ym is defined by Y0 = E f (X1, . . . , Xn ) and Yi = E f (X1, . . . , Xn ) X1, . . . , Xi , for i = 1, . . . , n.
 

Clearly, a Doob martingale Y0, . . . , Yn is a martingale, by Theorem 9.2.5. Furthermore, if |Xi − Xi−1 | ≤
1, for i = 1, . . . , n, then |Xi − Xi−1 | ≤ 1. and we can use Azuma’s inequality on such a sequence.

81
9.3. Occupancy Revisited
We have m balls thrown independently and uniformly into n bins. Let Z denote the number of bins
that remains empty in the end of the process. Let Xi be the bin chosen in the ith trial, and let
Z = F(X1, . . . , Xm ), where F returns the number of empty bins given that m balls had thrown into bins
√  2
X1, . . . , Xm . Clearly, we have by Azuma’s inequality that Pr Z − E[Z] > λ m ≤ 2e−λ /2 .


The following is an extension of Azuma’s inequality shown in class. We do not provide a proof but
it is similar to what we saw.
Theorem 9.3.1 (Azuma’s Inequality - Stronger Form). Let X0, X1, . . . , be a martingale sequence
such that for each k, |Xk − Xk−1 | ≤ ck , where ck may depend on k. Then, for all t ≥ 0, and any λ > 0,
!
h i λ2
Pr |Xt − X0 | ≥ λ ≤ 2 exp − Ít .
2 k=1 ck2

Theorem 9.3.2. Let r = m/n, and Zend be the number of empty bins when m balls are thrown randomly
h i m
into n bins. Then µ = E Zend = n 1 − 1n ≈ ne−r , and for any λ > 0, we have
   2 
λ (n − 1/2)
Pr Zend − µ ≥ λ ≤ 2 exp − 2 .
n − µ2
Proof: Let z(Y, t) be the expected number of empty bins, if there are Y empty bins in time t. Clearly,
  m−t
1
z(Y, t) = Y 1 − .
n
m
In particular, µ = z(n, 0) = n 1 − 1n .
Let Ft be the σ-field generated by  the bins chosen in the first t steps. Let Zend be the number of
empty bins at time m, and let Zt = E Zend Ft . Namely, Zt is the expected number of empty bins after
we know where the first t balls had been placed. The random variables Z0, Z1, . . . , Zm form a martingale.
Let Yt be the number of empty bins after t balls where thrown. We have Zt−1 = z(Yt−1, t − 1). Consider
the ball thrown in the t-step. Clearly:
(A) With probability 1 − Yt−1 /n the ball falls into a non-empty bin. Then Yt = Yt−1 , and Zt = z(Yt−1, t).
Thus,
 m−t   m−t+1 !  m−t   m−t
Yt−1
 
1 1 1 1
∆t = Zt − Zt−1 = z(Yt−1, t) − z(Yt−1, t − 1) = Yt−1 1 − − 1− = 1− ≤ 1− .
n n n n n

(B) Otherwise, with probability Yt−1 /n the ball falls into an empty bin, and Yt = Yt−1 − 1. Namely,
Zt = z(Yt − 1, t). And we have that
  m−t   m−t+1
1 1
∆t = Zt − Zt−1 = z(Yt−1 − 1, t) − z(Yt−1, t − 1) = (Yt−1 − 1) 1 − − Yt−1 1 −
n n
 m−t   m−t   m−t 
Yt−1 Yt−1
      
1 1 1 1
= 1− Yt−1 − 1 − Yt−1 1 − = 1− −1 + =− 1− 1−
n n n n n n
  m−t
1
≥ − 1− .
n

82
1 m−t
Thus, Z0, . . . , Zm is a martingale sequence, where |Zt − Zt−1 | ≤ |∆t | ≤ ct , where ct = 1 −

n . We
have
n
n 2 1 − (1 − 1/n)2m
n2 − µ2
2m

Õ 1 − (1 − 1/n)
ct2 = = = .
t=1
1 − (1 − 1/n)2 2n − 1 2n − 1

Now, deploying Azuma’s inequality, yield the result. 

9.3.1. Lets verify this is indeed an improvement


m
Consider the case where m = n ln n. Then, µ = n 1 − 1n ≤ 1. And using the “weak” Azuma’s inequality
implies that

n√ λ n
 2 
λ2
r

     
Pr Zend − µ ≥ λ n = Pr Zend − µ ≥ λ m ≤ 2 exp − = 2 exp − ,
m 2m 2 ln n

which is interesting only if λ > 2 ln n. On the other hand, Theorem 9.3.2 implies that

λ n(n − 1/2)
 2

  
Pr Zend − µ ≥ λ n ≤ 2 exp − ≤ 2 exp −λ 2
,
n2 − µ2

which is interesting for any λ ≥ 1 (say).

9.4. Some useful estimates


Lemma 9.4.1. For any n ≥ 2, and m ≥ 1, we have that (1 − 1/n)m ≥ 1 − m/n.

Proof: Follows by induction. Indeed, for m = 1 the claim is immediate. For m ≥ 2, we have
m  m−1
m−1 m
     
1 1 1 1
1− = 1− 1− ≥ 1− 1− ≥ 1− . 
n n n n n n

This implies the following.

Lemma 9.4.2. For any m ≤ n, we have that 1 − m/n ≤ (1 − 1/n)m ≤ exp(−m/n).

83
84
Chapter 10

The Probabilistic Method


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
“Shortly after the celebration of the four thousandth anniversary of the opening of space, Angary J. Gustible
discovered Gustible’s planet. The discovery turned out to be a tragic mistake.
Gustible’s planet was inhabited by highly intelligent life forms. They had moderate telepathic powers.
They immediately mind-read Angary J. Gustible’s entire mind and life history, and embarrassed him very
deeply by making up an opera concerning his recent divorce.”
– – From Gustible’s Planet, Cordwainer Smith.

10.1. Introduction
The probabilistic method is a combinatorial technique to use probabilistic algorithms to create objects
having desirable properties, and furthermore, prove that such objects exist. The basic technique is based
on two basic observations:

1. If E[X] = µ, then there exists a value x of X, such that x ≥ E[X].

2. If the probability of event E is larger than zero, then E exists and it is not empty.

The surprising thing is that despite the elementary nature of those two observations, they lead to a
powerful technique that leads to numerous nice and strong results. Including some elementary proofs
of theorems that previously had very complicated and involved proofs.
The main proponent of the probabilistic method, was Paul Erdős. An excellent text on the topic is
the book by Noga Alon and Joel Spencer [AS00].
This topic is worthy of its own course. The interested student is refereed to the course “Math 475
— The Probabilistic Method”.

10.1.1. Examples
Theorem 10.1.1. For any undirected graph G(V, E) with n vertices and m edges, there is a partition of
the vertex set V into two sets A and B such that
m
uv ∈ E u ∈ A and v ∈ B ≥ .

2
85
Proof: Consider the following experiment: randomly assign each vertex to A or B, independently and
equal probability.
For an edge e = uv, the probability that one endpoint is in A, and the other in B is 1/2, and let Xe
be the indicator variable with value 1 if this happens. Clearly,
Õ Õ 1 m
uv ∈ E (u, v) ∈ (A × B) ∪ (B × A)
 
E = E[Xe ] = = .
2 2
e∈E(G) e∈E(G)

Thus, there must be a partition of V that satisfies the theorem. 

Definition 10.1.2. For a vector v = (v1, . . . , vn ) ∈ Rn , kvk ∞ = max |vi |.


i

Theorem 10.1.3. Let M be an n × n binary matrix p (i.e., each entry is either 0 or 1), then there always
n
exists a vector b ∈ {−1, +1} such that kMbk ∞ ≤ 4 n log n.

Proof: Let v = (v1, . . . , vn ) be a row of M. Chose a random b = (b1, . . . , bn ) ∈ {−1, +1}n . Let i1, . . . , im be
the indices such that vi j = 1, and let
n
Õ m
Õ m
Õ
Y = hv, bi = vi bi = vi j bi j = bi j .
i=1 j=1 j=1

As such Y is the sum of m independent random variables that accept values in {−1, +1}. Clearly,
" #
Õ Õ Õ
E [Y ] = E [hv, bi] = E v b
i i = [v b
E i i ] = vi E[bi ] = 0.
i i i

By Chernoff inequality (Theorem 7.1.7) and the symmetry of Y , we have that, for ∆ = 4 n ln n, it
holds
" m #
n ln n
 2  
Õ ∆ 2
Pr[|Y | ≥ ∆] = 2 Pr[v · b ≥ ∆] = 2 Pr bi j ≥ ∆ ≤ 2 exp − = 2 exp −8 ≤ 8.
j=1
2m m n

Thus, the probability that any entry in Mb exceeds 4 n ln n is smaller
√ than 2/n7 . Thus, with probability
at least 1 − 2/n , all the entries of Mb have value smaller than 4 n ln n. √
7

In particular, there exists a vector b ∈ {−1, +1}n such that k Mb k ∞ ≤ 4 n ln n. 

10.2. Maximum Satisfiability


In the MAX-SAT problem, we are given a binary formula F in [CNF] (Conjunctive normal form),
and we would like to find an assignment that satisfies as many clauses as possible of F, for example
F = (x ∨ y) ∧ (x ∨ z). Of course, an assignment satisfying all the clauses of the formula, and thus F itself,
would be even better – but this problem is of course NPC. As such, we are looking for how well can be
we do when we relax the problem to maximizing the number of clauses to be satisfied..

Theorem 10.2.1. For any set of m clauses, there is a truth assignment of variables that satisfies at
least m/2 clauses.

86
Proof: Assign every variable a random value. Clearly, a clause with k variables, has probability 1 − 2−k
to be satisfied. Using linearity of expectation, and the fact that every clause has at least one variable, it
follows, that E[X] = m/2, where X is the random variable counting the number of clauses being satisfied.
In particular, there exists an assignment for which X ≥ m/2. 

For an instant I, let mopt (I), denote the maximum number of clauses that can be satisfied by the
“best” assignment. For an algorithm Alg, let mAlg (I) denote the number of clauses satisfied computed
by the algorithm Alg. The approximation factor of Alg, is mAlg (I)/mopt (I). Clearly, the algorithm
of Theorem 10.2.1 provides us with 1/2-approximation algorithm.
For every clause, C j in the given instance, let z j ∈ {0, 1} be a variable indicating whether C j is
satisfied or not. Similarly, let xi = 1 if the ith variable is being assigned the value TRUE. Let C +j be
indices of the variables that appear in C j in the positive, and C −j the indices of the variables that appear
in the negative. Clearly, to solve MAX-SAT, we need to solve:

m
Õ
maximize zj
j=1
subject to xi, z j ∈ {0, 1} for all i, j
Õ Õ
xi + (1 − xi ) ≥ z j for all j.
i∈C j+ i∈C j−

We relax this into the following linear program:

m
Õ
maximize zj
j=1
subject to 0 ≤ yi, z j ≤ 1 for all i, j
Õ Õ
yi + (1 − yi ) ≥ z j for all j.
i∈C j+ i∈C j−

Which can be solved in polynomial time. Let b t denote the values assigned to the variable t by the
Ím
linear-programming solution. Clearly, j=1 zbj is an upper bound on the number of clauses of I that can
be satisfied.
yi . This is randomized rounding.
We set the variable yi to 1 with probability b

Lemma 10.2.2. Let C j be a clause with k literals. The probability that it is satisfied by randomized
rounding is at least βk zbj ≥ (1 − 1/e)b
z j , where
 k
1
βk = 1 − 1 − .
k

Proof: Assume C j = y1 ∨ v2 . . . ∨ vk . By the LP, we have yb1 + · · · + ybk ≥ zbj . Furthermore, the probability
Îk Îk
that C j is not satisfied is i=1 (1 − b
yi ). Note that 1 − i=1 (1 − b
yi ) is minimized when all the b yi ’s are equal
(by symmetry). Namely, when b yi = zbj /k. Consider the function f (x) = 1 − (1 − x/k) k . This is a concave

87
function, which is larger than g(x) = βk x for all 0 ≤ x ≤ 1, as can be easily verified, by checking the
inequality at x = 0 and x = 1.
Thus,
k
Ö
Pr C j is satisfied = 1 − yi ) ≥ f zbj ≥ βk zbj .
  
(1 − b
i=1

The second part of the inequality, follows from the fact that βk ≥ 1 − 1/e, for all k ≥ 0. Indeed, for
k = 1, 2 the claim trivially holds. Furthermore,
 k  k
1 1 1 1
1− 1− ≥ 1− ⇔ 1− ≤ ,
k e k e

1 k
but this holds since 1 − x ≤ e−x implies that 1 − 1
≤ e−1/k , and as such 1 − ≤ e−k/k = 1/e.

k k 

Theorem 10.2.3. Given an instance I of MAX-SAT, the expected number of clauses satisfied by linear
programming and randomized rounding is at least (1−1/e) ≈ 0.632mopt (I), where mopt (I) is the maximum
number of clauses that can be satisfied on that instance.

Theorem 10.2.4. Given an instance I of MAX-SAT, let n1 be the expected number of clauses satisfied
by randomized assignment, and let n2 be the expected number of clauses satisfied by linear programming
followed by randomized rounding. Then, max(n1, n2 ) ≥ (3/4) j zbj ≥ (3/4)mopt (I).
Í

Proof: It is enough to show that (n1 + n2 )/2 ≥ 34 j zbj . Let Sk denote the set of clauses that contain k
Í
literals. We know that
Õ Õ  Õ Õ 
n1 = 1 − 2−k ≥ 1 − 2−k zbj .
k C j ∈Sk k C j ∈Sk

By Lemma 10.2.2 we have n2 ≥ βk zbj . Thus,


Í Í
k C j ∈Sk

n1 + n2 Õ Õ 1 − 2−k + βk
≥ zbj .
2 k C ∈S
2
j k

One can verify that 1 − 2−k + βk ≥ 3/2, for all k. ¬



Thus, we have

n1 + n2 3 Õ Õ 3Õ
≥ zbj = zbj . 
2 4 k C ∈S 4 j
j k

¬ Indeed,by the proof of Lemma 10.2.2, we have that βk ≥ 1 − 1/e. Thus, 1 − 2−k + βk ≥ 2 − 1/e − 2−k ≥ 3/2 for k ≥ 3.

Thus, we only need to check the inequality for k = 1 and k = 2, which can be done directly.

88
Chapter 11

The Probabilistic Method II


598 - Class notes for Randomized Algorithms
Sariel Har-Peled “Today I know that everything watches, that
January 24, 2018 nothing goes unseen, and that even wallpaper
has a better memory than ours. It isn’t God in
His heaven that sees all. A kitchen chair, a
coat-hanger a half-filled ash tray, or the wood
replica of a woman name Niobe, can perfectly
well serve as an unforgetting witness to every
one of our acts.”

Gunter Grass, The tin drum


11.1. Expanding Graphs
In this lecture, we are going to discuss expanding graphs.
Definition 11.1.1. An (n, d, α, c) OR-concentrator is a bipartite multigraph G(L, R, E), with the inde-
pendent sets of vertices L and R each of cardinality n, such that
(i) Every vertex in L has degree at most d.
(ii) Any subset S of vertices of L, with |S| ≤ αn has at least c |S| neighbors in R.
A good (n, d, α, c) OR-concentrator should have d as small as possible¬ , and c as large as possible.
Theorem 11.1.2. There is an integer n0 , such that for all n ≥ n0 , there is an (n, 18, 1/3, 2) OR-
concentrator.
Proof: Let every vertex of L choose neighbors by sampling (with replacement) d vertices independently
and uniformly from R. We discard multiple parallel edges in the resulting graph.
Let E s be the event that a subset of s vertices of L has fewer than cs neighbors in R. Clearly,
h i n  n   cs  ds  ne  s  ne  cs  cs  ds   s  d−c−1 s
d−c
Pr E s ≤ ≤ = exp(1 + c)c ,
s cs n s cs n n
k
since nk ≤ ne . Setting α = 1/3 using s ≤ αn, and c = 2, we have

k
  d−c−1 !s  d !s  d !s
h i 1 1 1
Pr E s ≤ e1+c c d−c ≤ 31+c e1+c c d−c ≤ 31+c e1+c c d
3 3 3
s   18 !s
c d  s
 
2
≤ (3e)1+c ≤ (3e)1+2 ≤ 0.4 ,
3 3
¬ Or smaller!

89
as c = 2 and d = 18. Thus,
h i Õ
(0.4)s < 1.
Õ
Pr E s ≤
s≥1 s≥1

It thus follows that the random graph we generated has the required properties with positive probabil-
ity. 

11.2. Probability Amplification


Let Alg be an algorithm in RP, such that given x, Alg picks a random number r from the range
ZZn = {0, . . . , n − 1}, for a suitable choice of a prime n, and computes a binary value Alg(x, r) with the
following properties:
(A) If x ∈ L, then Alg(x, r) = 1 for at least half the possible values of r.
(B) If x < L, then Alg(x, r) = 0 for all possible choices of r.
Next, we show that using lg2 n bits­ one can achieve 1/nlg n confidence, compared with the naive 1/n,
and the 1/t confidence achieved by t (dependent) executions of the algorithm using two-point sampling.

Theorem 11.2.1. For n large enough, there exists a bipartite graph G(V, R, E) with |V | = n, |R| = 2lg
2
n

such that:
(i) Every subset of n/2 vertices of V has at least 2lg n − n neighbors in R.
2

(ii) No vertex of R has more than 12 lg2 n neighbors.

Proof: Each vertex of R chooses d = 2lg n (4 lg2 n)/n neighbors independently in R. We show that the
2

resulting graph violate the required properties with probability less than half.®
The probability for a set of n/2 vertices on the left to fail to have enough neighbors, is
   lg2 n    dn/2 !n
n n lg2 n e dn n
 
2 n 2
τ≤ 1− 2 ≤2 exp −
n/2 n 2lg n n 2 2lg2 n

!n !
2lg n e 2lg n (4 lg2 n)/n lg2 n e
2 2
n2
© ª
n 2
≤ exp­­n + n ln −2n lg n®®,
­ 2
®
≤2 exp −
n 2 2lg
2
n ­ n ®
| {z }
| {z } ∗

« ¬
2n 2n  y
n  2lg 2lg x xe
≤ 2n and ¯.
 
since n/2 lg2 n
2 −n
= n , and y ≤ y Now, we have

2lg n e  2 
lg2 n
ρ = n ln = n ln 2 + ln e − ln n ≤ (ln 2)n lg2 n ≤ 0.7n lg2 n,
n
for n ≥ 3. As such, we have τ ≤ exp n + (0.7 − 2)n lg2 n  1/4.


­ Everybody knows that lg n = log2 n. Everybody knows that the captain lied.
® Here, we keep parallel edges if they happen – which is unlikely. The reader can ignore this minor technicality, on her
way to ignore this whole write-up.
¯ The reader might want to verify that one can use significantly weaker upper bounds and the result still follows – we

are using the tighter bounds here for educational reasons, and because we can.

90
As for the second property, note that the expected number of neighbors of a vertex v ∈ R is 4 lg2 n.
Indeed, the probability of a vertex on R to become adjacent to a random edge is ρ = 1/|R|, and
this “experiment” is repeated independently dn times. As such, the expected degree of a vertex is
µ E Y = dn/|R| = 4 lg2 n. The Chernoff bound (Theorem 7.3.2p65 ) implies that


h i h i
α = Pr Y > 12 lg n = Pr Y > (1 + 2)µ < exp −µ22 /4 = exp −4 lg2 n .
2  

Since there are 2lg n vertices in R, we have that the probability that any vertex in R has a degree that
2

exceeds 12 lg2 n, is, by the union bound, at most |R| α ≤ 2lg n exp −4 lg2 n ≤ exp −3 lg2 n  1/4,
2  
concluding our tedious calculations° .
Thus, with constant positive probability, the random graph has the required property, as the union
of the two bad events has probability  1/2. 

We assume that given a vertex (of the above graph) we can compute its neighbors, without computing
the whole graph.
So, we are given an input x. Use lg2 n bits to pick a vertex v ∈ R. We next identify the neighbors
of v in V: r1, . . . , rk . We then compute Alg(x, ri ), for i = 1, . . . k. Note that k = O lg2 n . If all k calls
return 0, then we return that Alg is not in the language. Otherwise, we return that x belongs to V.
If x is in the language, then consider the subset U ⊆ V, such that running Alg on any of the strings
of U returns TRUE.
 2 We know that |U| ≥ n/2. The set U is connected to all the vertices of R except for
at most |R| − 2lg n − n = n of them. As such, the probability of a failure in this case, is

h i h i n n
Pr x ∈ L but r1, r2, . . . , rk < U = Pr v not connected to U ≤ ≤ 2 .
|R| 2lg n

We summarize the result.

Lemma 11.2.2. Given an algorithm Alg in RP that uses lg n random bits, and an access explicit access
to the graph of Theorem 11.2.1, one can decide if an input word is in the language of Alg using lg2 n
bits, and the probability of failure is at most lgn2 n .
2

Let us compare the various results we now have about running an algorithm in RP using lg2 n bits.
We have three options:
(A) Randomly run the algorithm lg n times independently. The probability of failure is at most
1/2lg n = 1/n.
(B) Lemma 11.2.2, which as probability of failure at most 1/2lg n = 1/n.
(C) The third option is to use pairwise independent sampling (see Lemma 6.1.11p52 ). While it is
not directly comparable to the above two options, it is clearly inferior, and is thus less useful.

Unfortunately, there is no explicit construction of the expanders used here. However, there are
alternative techniques that achieve a similar result.
° Once again, our verbosity in applying the Chernoff inequality is for educational reasons – usually such calculations

would be swept under the rag. No wonder than that everybody is afraid to look under the rag.

91
11.3. Oblivious routing revisited
Theorem 11.3.1. Consider any randomized oblivious algorithm for permutation routing on the hy-
n
 p with N = 2 nodes. If this algorithm uses k random bits, then its expected running time is
percube
Ω 2−k N/n .

Corollary 11.3.2. Any randomized oblivious algorithm for permutation routing on the hypercube with
N = 2n nodes must use Ω(n) random bits in order to achieve expected running time O(n).

Theorem 11.3.3. For every n, there exists a randomized oblivious scheme for permutation routing on
a hypercube with n = 2n nodes that uses 3n random bits and runs in expected time at most 15n.

92
Chapter 12

The Probabilistic Method III


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
At other times you seemed to me either pitiable or contemptible, eunuchs, artificially confined to an eternal
childhood, childlike and childish in your cool, tightly fenced, neatly tidied playground and kindergarten,
where every nose is carefully wiped and every troublesome emotion is soothed, every dangerous thought
repressed, where everyone plays nice, safe, bloodless games for a lifetime and every jagged stirring of life,
every strong feeling, every genuine passion, every rapture is promptly checked, deflected and neutralized by
meditation therapy.
– – The Glass Bead Game, Hermann Hesse .

12.1. The Lovász Local Lemma


Pr A ∩ B C
 
Lemma 12.1.1. (i) Pr A B ∩ C =
 
Pr B C
 
(ii) Let η1, . . . , ηn be n events which are not necessarily independent. Then,
h i h i h i h i h i
n
Pr ∩i=1 ηi = Pr η1 ∗ Pr η2 η1 Pr η3 η1 ∩ η2 ∗ . . . ∗ Pr ηn η1 ∩ . . . ∩ ηn−1 .

Proof: (i) We have that

Pr A ∩ B C
 
Pr[A ∩ B ∩ C] Pr[B ∩ C] Pr[A ∩ B ∩ C]

= Pr A B ∩ C .
 
 = =
Pr B C Pr[B ∩ C]

Pr[C] Pr[C]

As for (ii), we already saw it and used it in the minimum cut algorithm lecture. 

Definition 12.1.2. An event E is mutually independent of a set of events C, if for any subset U ⊆ C, we
have that Pr[E ∩ ( E 0 ∈U E 0)] = Pr[E] Pr[ E 0 ∈U E 0].
Ñ Ñ
Let E1, . . . , En be events. A dependency graph for these events is a directed graph G = (V, E),
where {1, . . . , n}, such that Ei is mutually independent of all the events in E j (i, j) < E .

Intuitively, an edge (i, j) in a dependency graph indicates that Ei and E j have (maybe) some depen-
dency between them. We are interested in settings where this dependency is limited enough, that we
can claim something about the probability of all these events happening simultaneously.

93
Lemma 12.1.3 (Lovász Local Lemma). Let G(V, E) be a dependency Ö graph for events Eh1, . . . , iEn .
n
Suppose that there exist xi ∈ [0, 1], for 1 ≤ i ≤ n such that Pr[Ei ] ≤ xi 1 − x j . Then Pr ∩i=1

Ei ≥
(i, j)∈E
n
Ö
(1 − xi ).
i=1

We need the following technical lemma.

Lemma 12.1.4. Let G(V, E) be a dependency Ö graph for events E1, . . . , En . Suppose that there exist
xi ∈ [0, 1], for 1 ≤ i ≤ n such that Pr[Ei ] ≤ xi 1 − x j . Now, let S be a subset of the vertices from

(i, j)∈E
{1, . . . , n}, and let i be an index not in S. We have that
h i
Pr Ei ∩ j∈S E j ≤ xi . (12.1)

Proof: The proof is by induction on k = |S|. h i


For k = 0, we have by assumption that Pr Ei ∩ j∈S E j = Pr[Ei ] ≤ xi (i, j)∈E 1 − x j ≤ xi .
Î 

Thus, let N = j ∈ S (i, j) ∈ E , and let R = S \ N. Ifh N = ∅, then



i we have
h that Eii is mutually
independent of the events of C(R) = E j j ∈ R . Thus, Pr Ei ∩ j∈S E j = Pr Ei ∩ j∈R E j = Pr[Ei ] ≤


xi , by arguing as above.
By Lemma 12.1.1 (i), we have that
h   i

 Ù   Pr E i ∩ ∩ j∈N jE ∩ E
m∈R m
Pr Ei E j  = h i .

 j∈S 
 Pr ∩ j∈N E j ∩ m∈R E m

We bound the numerator by


h   i h i Ö
Pr Ei ∩ ∩ j∈N E j ∩m∈R Em ≤ Pr Ei ∩m∈R Em = Pr[Ei ] ≤ xi 1 − xj ,

(i, j)∈E

since Ei is mutually independent of C(R). As for the denominator, let N = { j1, . . . , jr }. We have, by
Lemma 12.1.1 (ii), that
h i h i h  i
Pr E j1 ∩ . . . ∩ E jr ∩m∈R Em = Pr E j1 ∩m∈R Em Pr E j2 E j1 ∩ ∩m∈R Em
h  i
· · · Pr E jr E j1 ∩ . . . ∩ E jr−1 ∩ ∩m∈R Em
 h i  h   i
= 1 − Pr E j1 ∩m∈R Em 1 − Pr E j2 E j1 ∩ ∩m∈R Em
 h  i 
· · · 1 − Pr E jr E j1 ∩ . . . ∩ E jr−1 ∩ ∩m∈R Em
Ö
≥ 1 − x j1 · · · 1 − x jr ≥ 1 − xj ,
  
(i, j)∈E

by Eq. (12.1) and induction, as every


h probability
i term in the above expression has less than |S| items
j∈S E j ≤ xi .
Ñ
involved. It thus follows, that Pr Ei 

94
Proof of Lovász local lemma (Lemma 12.1.3): Using Lemma 12.1.4, we have that
h i  h i  h i n
n n−1
Ö
Pr ∩i=1 Ei = (1 − Pr[E1 ]) 1 − Pr E2 E1 · · · 1 − Pr En ∩i=1 Ei ≥ (1 − xi ).
i=1

Corollary 12.1.5. Let E1, . . . , En be events, with Pr[Ei ] ≤ p for all i. If eachh event is
i mutually inde-
n
pendent of all other events except for at most d, and if ep(d + 1) ≤ 1, then Pr ∩i=1 Ei > 0.

Proof: If d = 0 the result is trivial, as the events are independent. Otherwise, there is a dependency
graph, with every vertex having degree at most d. Apply Lemma 12.1.3 with xi = d+1 1
. Observe that
 d
d 1 1 1 1
xi (1 − xi ) = 1− > · ≥ p,
d+1 d+1 d+1 e
1 d

by assumption and the since 1 − d+1 > 1/e, see Lemma 12.1.6 below. 

The following is standard by now, and we include it only for the sake of completeness.
 n
1 1
Lemma 12.1.6. For any n ≥ 1, we have 1 − > .
n+1 e
n n n+1 n
> 1e . Namely, we need to prove e >
 
Proof: This is equivalent to n+1 n . But this obvious, since
n+1 n n
= 1 + 1n < exp(n(1/n)) = e.
 
n 

12.2. Application to k-SAT


We are given a instance I of k-SAT, where every clause contains k literals, there are m clauses, and
every one of the n variables, appears in at most 2 k/50 clauses.
Consider a random assignment, and let Ei be the event that the ith clause was not satisfied. We
know that p = Pr[Ei ] = 2 −k , and furthermore, Ei depends on at most d = k2 k/50 other events. Since
ep(d + 1) = e k · 2 k/50 + 1 2−k < 1, for k ≥ 4, we conclude that by Corollary 12.1.5, that
h i h i
Pr I have a satisfying assignment = Pr ∪i Ei > 0.

12.2.1. An efficient algorithm


The above just proves that a satisfying assignment exists. We next show a polynomial algorithm (in m)
for the computation of such an assignment (the algorithm will not be polynomial in k).
Let G be the dependency graph for I, where the vertices are the clauses of I, and two clauses are
connected if they share a variable. In the first stage of the algorithm, we assign values to the variables
one by one, in an arbitrary order. In the beginning of this process all variables are unspecified, at each
step, we randomly assign a variable either 0 or 1 with equal probability.
Definition 12.2.1. A clause Ei is dangerous if both the following conditions hold:

95
(i) k/2 literals of Ei have been fixed.

(ii) Ei is still unsatisfied.

After assigning each value, we discover all the dangerous clauses, and we defer (“freeze”) all the
unassigned variables participating in such a clause. We continue in this fashion till all the unspecified
variables are frozen. This completes the first stage of the algorithm.
At the second stage of the algorithm, we will compute a satisfying assignment to the variables using
brute force. This would be done by taking the surviving formula I 0 and breaking it into fragments, so
that each fragment does not share any variable with any other fragment (naively, it might be that all of
I 0 is one fragment). We can find a satisfying assignment to each fragment separately, and if each such
fragment is “small” the resulting algorithm would be “fast”.
We need to show that I 0 has a satisfying assignment and that the fragments are indeed small.

12.2.1.1. Analysis
A clause had survived if it is not satisfied by the variables fixed in the first stage. Note, that a clause
that survived must have a dangerous clause as a neighbor in the dependency graph G. Not that I 0,
the instance remaining from I after the first stage, has at least k/2 unspecified variables in each clause.
Furthermore, every clause of I 0 has at most d = k2 k/50 neighbors in G0, where G0 is the dependency
graph for I 0. It follows, that again, we can apply Lovász local lemma to conclude that I 0 has a satisfying
assignment.

Definition 12.2.2. Two connected graphs G1 = (V1, E1 ) and G2 = (V2, E2 ), where V1, V2 ⊆ {1, . . . , n} are
unique if V1 , V2 .

Lemma 12.2.3. Let G be a graph with degree at most d and with n vertices. Then, the number of
unique subgraphs of G having r vertices is at most nd 2r .

Proof: Consider a unique subgraph G b of G, which by definition is connected. Let H be a connected


subtree of G spanning G.b Duplicate every edge of H, and let H 0 denote the resulting graph. Clearly, H 0
is Eulerian, and as such posses a Eulerian path π of length at most 2(r − 1), which can be specified, by
picking a starting vertex v, and writing down for the i-th vertex of π which of the d possible neighbors,
is the next vertex in π. Thus, there are st most nd 2(r−1) ways of specifying π, and thus, there are at
most nd 2(r−1) unique subgraphs in G of size r. 

Lemma 12.2.4. With probability 1 − o(1), all connected components of G0 have size at most O(log m),
where G0 denote the dependency graph for I 0.

Proof: Let G4 be a graph formed from G by connecting any pair of vertices of G of distance exactly 4
from each other. The degree of a vertex of G4 is at most O(d 4 ).
Let U be a set of r vertices of G, such that every pair is in distance at least 4 from each other in G.
We are interested in bounding the probability that all the clauses of U survive the first stage.
The probability of a clause to be dangerous is at most 2−k/2 , as we assign (random) values to half
of the variables of this clause. Now, a clause survive only if it is dangerous or one of its neighbors is
dangerous. Thus, the probability that a clause survive is bounded by 2−k/2 (d + 1).

96
Furthermore, the survival of two clauses Ei and E j in U is an independent event, as no neighbor of
Ei shares a variable with a neighbor of E j (because of the distance 4 requirement). We conclude, that
the probability that all the vertices of U to appear in G0 is bounded by
 r
−k/2
2 (d + 1) .

In fact, we are interested in sets U that induce a connected subgraphs of G4 . The number of unique
such sets of size r is bounded by the number of unique subgraphs of G4 of size r, which is bounded by
md 8r , by Lemma 12.2.3. Thus, the probability of any connected subgraph of G4 of size r = log2 m to
survive in G0 is smaller than
 r   8r  r
md 8r 2−k/2 (d + 1) = m k2 k/50 2−k/2 (k2 k/50 + 1) ≤ m2 kr/5 · 2−kr/4 = m2−kr/20 = o(1),

since k ≥ 50. (Here, a subgraph survive of G4 survive, if all its vertices appear in G0.) Note, however, that
if a connected component of G0 has more than L vertices, than there must be a connected component
having L/d 3 vertices in G4 that had survived in G0. We conclude, that with probability o(1), no connected
component of G0 has more than O(d 3 log m) = O(log m) vertices (note, that we consider k to be a constant,
and thus, also d). 

Thus, after the first stage, we are left with fragments of (k/2)-SAT, where every fragment has size
at most O(log m), and thus having at most O(log m) variables. Thus, we can by brute force find the
satisfying assignment to each such fragment in time polynomial in m. We conclude:

Theorem 12.2.5. The above algorithm finds a satisfying truth assignment for any instance of k-SAT
containing m clauses, which each variable is contained in at most 2 k/50 clauses, in expected time poly-
nomial in m.

97
98
Chapter 13

The Probabilistic Method IV


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
Once I sat on the steps by a gate of David’s Tower, I placed my two heavy baskets at my side. A group
of tourists was standing around their guide and I became their target marker. “You see that man with the
baskets? Just right of his head there’s an arch from the Roman period. Just right of his head.” “But he’s
moving, he’s moving!” I said to myself: redemption will come only if their guide tells them, “You see that
arch from the Roman period? It’s not important: but next to it, left and down a bit, there sits a man who’s
bought fruit and vegetables for his family.”
– — Yehuda Amichai, Tourists .

13.1. The Method of Conditional Probabilities


In previous lectures, we encountered the following problem.

Problem 13.1.1 (Set Balancing). Given a binary matrix A of size n × n, find a vector v ∈ {−1, +1}n , such
that k Avk ∞ is minimized.

Using random
√ assignment and the Chernoff inequality, we showed that there exists v, such that
k Avk ∞ ≤ 4 n ln n. Can we derandomize this algorithm? Namely, can we come up with an efficient
deterministic algorithm that has low discrepancy?
To derandomize our algorithm, construct a computation tree of depth n, where in the ith level we
expose the ith coordinate of v. This tree T has depth n. The root represents all possible random choices,
while a node at depth i, represents all computations when the first i bits are fixed. For a node v ∈ T,
let P(v) be the probability that a random computation starting from v succeeds. Let vl and vr be the
two children of v. Clearly, P(v) = (P(vl ) + P(vr ))/2. In particular, max(P(vl ), P(vr )) ≥ P(v). Thus, if we
could compute P(·) quickly (and deterministically), then we could derandomize the algorithm.
Let Cm be the bad eventp that rm · v > 4 n log n, where rm is the mth row of A. Similarly, Cm− is the
p
+

bad event that rm · v < −4 n log n, and let Cm = Cm+ ∪ Cm− . Consider the probability, Pr Cm+ v1, . . . , v k
 

(namely, the first k coordinates of v are specified). Let rm = (α1, . . . , αn ). We have that

n k
" # " # " #
Õ p Õ Õ Õ
Pr Cm+ v1, . . . , v k = Pr vi αi > 4 n log n − vi αi > L = Pr vi > L ,
 
vi αi = Pr
i=k+1 i=1 i≥k+1,αi ,0 i≥k+1,αi =1

99
Ík
where L = 4 n log n − i=1 vi αi is a known quantity (since v1, . . . , v k are known). Let V = i≥k+1,αi =1 1.
p Í
We have,
   
Õ   Õ v +1 L +V
i
Pr Ci+ v1, . . . , v k (vi + 1) > L + V  = Pr 
 
= Pr  > ,
   
2 2 
i≥k+1 i≥k+1
  
 
 αi =1   αi =1 
The last probability, is the probability that in V flips of a fair coin we will get more than (L + V)/2
heads. Thus,

V   V  
Õ V 1 1© Õ V ª
Pm+ Cm+
 
= Pr v1, . . . , v k = n
= n­ ®.
i 2 2 i
i=d(L+V)/2e «i=d(L+V)/2e ¬
This implies, that we can compute Pm+ in polynomial time! Indeed, we are adding V ≤ n numbers,
each one of them is a binomial coefficient that has polynomial size representation in n, and can be
computed in polynomial time (why?). One can define in similar fashion Pm− , and let Pm = Pm+ + Pm− .
Clearly, Pm can be computed  in polynomial time, by applying a similar argument to the computation
of Pm− = Pr Cm− v1, . . . , v k .
For a node v ∈ T, let vv denote the portion of v that was fixed when traversing from the root of T
Ín
to v. Let P(v) = m=1 Pr Cm vv . By the above discussion P(v) can be computed in polynomial time.
Furthermore, we know, by the previous result on set balancing that P(r) < 1 (that was the bound used
to show that there exist a good assignment).
As before, for any v ∈ T, we have P(v) ≥ min(P(vl ), P(vr )). Thus, wephave a polynomial deterministic
algorithm for computing a set balancing with discrepancy smaller than 4 n log n. Indeed, set v = root(T).
And start traversing down the tree. At each stage, compute P(vl ) and P(vr ) (in polynomial time), and
p of P(·). Clearly, after n steps, we reach a leaf, that corresponds to a
set v to the child with lower value
vector v such that k Av k ∞ ≤ 4 n log n.
0 0

Theorem 13.1.2. Using the method of conditional p probabilities, one can compute in polynomial time
n
in n, a vector v ∈ {−1, 1} , such that k Avk ∞ ≤ 4 n log n.

Note, that this method might fail to find the best assignment.

13.2. A Short Excursion into Combinatorics via the Probabilis-


tic Method
In this section, we provide some additional examples of the Probabilistic Method to prove some results
in combinatorics and discrete geometry. While the results are not directly related to our main course,
their beauty, hopefully, will speak for itself.

13.2.1. High Girth and High Chromatic Number


Definition 13.2.1. For a graph G, let α(G) be the cardinality of the largest independent set in G, χ(G)
denote the chromatic number of G, and let girth(G) denote the length of the shortest circle in G.

Theorem 13.2.2. For all K, L there exists a graph G with girth(G) > L and χ(G) > K.

100
Proof: Fix µ < 1/L, and let G ≈ G(n, p) with p = n µ−1 ; namely, G is a random graph on n vertices chosen
by picking each pair of vertices to be an edge in G, randomly and independently with probability p. Let
X be the number of cycles of size at most L. Then
L L L
Õ n! 1 i Õ ni µ−1  i
Õ n µi
E[X] = · ·p ≤ · n ≤ = o(n),
i=3
(n − i)! 2i i=3
2i i=3
2i

n!
as µL < 1, and since the number of different sequence of i vertices is (n−i)! , and every cycle is being
counted in this sequence 2i times.
In particular,
l Pr[X
m ≥ n/2] = o(1).
Let x = 3p ln n + 1. We remind the reader that α(G) denotes the size of the largest independent set
in G. We have that
i n  x   x 
p(x − 1) x
  
h
( x
) 3
Pr α(G) ≥ x ≤ (1 − p) < n exp −
2 < n exp − ln n < o(1) = o(1).
x 2 2

Let n be sufficiently large so that both these events have probability less than 1/2. Then there is a
specific G with less than n/2 cycles of length at most L and with α(G) < 3n1−µ ln n + 1.
Remove from G a vertex from each cycle of length at most L. This gives a graph G∗ with at least n/2
vertices. G∗ has girth greater than L and α(G∗ ) ≤ α(G) (any independent set in G∗ is also an independent
set in G). Thus

∗ |V(G∗ )| n/2 nµ
χ(G ) ≥ ≥ 1−µ ≥ .
α(G∗ ) 3n ln n 12 ln n
To complete the proof, let n be sufficiently large so that this is greater than K. 

13.2.2. Crossing Numbers and Incidences


The following problem has a long and very painful history. It is truly amazing that it can be solved by
such a short and elegant proof.
And embedding of a graph G = (V, E) in the plane is a planar representation of it, where each vertex
is represented by a point in the plane, and each edge uv is represented by a curve connecting the points
corresponding to the vertices u and v. The crossing number of such an embedding is the number of
pairs of intersecting curves that correspond to pairs of edges with no common endpoints. The crossing
number cr(G) of G is the minimum possible crossing number in an embedding of it in the plane.

|E| 3
Theorem 13.2.3. The crossing number of any simple graph G = (V, E) with |E| ≥ 4 |V| is ≥ .
64 |V| 2

Proof: By Euler’s formula any simple planar graph with n vertices has at most 3n − 6 edges. (Indeed,
f − e + v = 2 in the case with maximum number of edges, we have that every face, has 3 edges around
it. Namely, 3 f = 2e. Thus, (2/3)e − e + v = 2 in this case. Namely, e = 3v − 6.) This implies that
the crossing number of any simple graph with n vertices and m edges is at least m − 3n + 6 > m − 3n.
Let G = (V, E) be a graph with |E| ≥ 4 |V| embedded in the plane with t = cr(G) crossings. Let H be
the random induced subgraph of G obtained by picking each vertex of G randomly and independently,
to be a vertex of H with probabilistic p (where P will be specified shortly). The expected number of
vertices of H is p |V|, the expected number of its edges is p2 |E|, and the expected number of crossings

101
in the given embedding is p4 t, implying that the expected value of its crossing number is at most p4 t.
Therefore, we have p4 t ≥ p2 |E| − 3p |V|, implying that

|E| 3 |V|
cr(G) ≥ − 3 ,
p2 p

let p = 4 |V| /|E| < 1, and we have cr(G) ≥ (1/16 − 3/64) |E| 3 /|V| 2 = |E| 3 /(64 |V| 2 ). 

Theorem 13.2.4. Let P be a set of n distinct points in the plane, and let L be a set of m distinct lines.
Then, the number of incidences between the points of P and the lines
 of L (that is, the number of pairs
(p, `) with p ∈ P, ` ∈ L, and p ∈ `) is at most c m2/3 n2/3 + m + n , for some absolute constant c.

Proof: Let I denote the number of such incidences. Let G = (V, E) be the graph whose vertices are all
the points of P, where two are adjacent if and only if they are consecutive points of P on some line in L.
Clearly |V| = n, and |E| = I − m. Note that G is already given embedded in the plane, where the edges
are presented by segments of the corresponding lines of L.
Either, we can not apply Theorem 13.2.3, implying that I − m = |E| < 4 |V| = 4n. Namely, I ≤ m + 4n.
Or alliteratively,

|E| 3 (I − m)3 m m2
 
= ≤ cr(G) ≤ ≤ .
64 |V| 2 64n2 2 2

Implying that I ≤ (32)1/3 m2/3 n2/3 + m. In both cases, I ≤ 4(m2/3 n2/3 + m + n). 

This technique has interesting and surprising results, as the following theorem shows.

Theorem 13.2.5. For any three sets A, B and C of s real numbers each, we have
 
| A · B + C| = ab + c a ∈ A, b ∈ B, mc ∈ C ≥ Ω s3/2 .


Proof: Let R = A· B+C, |R| = r and define P = (a, t) a ∈ A, t ∈ R , and L = y = bx + c b ∈ B, c ∈ C .


 

Clearly n = |P| =
 sr, and m = |L| = s . Furthermore, a line y = bx + c of L is incident with s points
2

of R, namely with (a, t) a ∈ A, t = ab + c . Thus, the overall number of incidences is at least s3 . By


Theorem 13.2.4, we have
      
2/3
s3 ≤ 4 m2/3 n2/3 + m + n = 4 s2 (sr)2/3 + s2 + sr = 4 s2r 2/3 + s2 + sr .

For r < s3 , we have that sr ≤ s2r 2/3 . Thus, for r < s3 , we have s3 ≤ 12s2r 2/3 , implying that s3/2 ≤ 12r.
Namely, |R| = Ω(s3/2 ), as claimed. 

Among other things, the crossing number technique implies a better bounds for k-sets in the plane
than what was previously known. The k-set problem had attracted a lot of research, and remains till
this day one of the major open problems in discrete geometry.

102
13.2.3. Bounding the at most k-level
Let L be a set of n lines in the plane. Assume, without loss of generality, that no three lines of L pass
through a common point, and none of them is vertical. The complement of union of lines L break the
plane into regions known as faces. An intersection of two lines, is a vertex, and the maximum interval
on a line between two vertices is am edge. The whole structure of vertices, edges and faces induced by
L is known as arrangement of L, denoted by A(L).
Let L be a set of n lines in the plane. A point p ∈ `∈L ` is of level k if there are k lines of L strictly
Ð
below it. The k-level is the closure of the set of points of level k. Namely, the k-level is an x-monotone
curve along the lines of L.t
The 0-level is the boundary of the “bottom” face of the arrangement of
L (i.e., the face containing the negative y-axis). It is easy to verify that the
0-level has at most n − 1 vertices, as each line might contribute at most one
segment to the 0-level (which is an unbounded convex polygon). 3-level
It is natural to ask what the number of vertices at the k-level is (i.e.,
what the combinatorial complexity of the polygonal chain forming the k- 1-level
0-level
level is). This is a surprisingly hard question, but the same question on the
complexity of the at most k-level is considerably easier.
Theorem 13.2.6. The number of vertices of level at most k in an arrangement of n lines in the plane
is O(nk).
Proof: Pick a random sample R of L, by picking each line to be in the sample with probability 1/k.
Observe that
n
E[|R|] = .
k
Let L≤k = L≤k (L) be the set of all vertices of A(L) of level at most k, for k > 1. For a vertex p ∈ L≤k ,
let Xp be an indicator variable which is 1 if p is a vertex of the 0-level of A(R). The probability that p is
in the 0-level of A(R) is the probability that none of the j lines below it are picked to be in the sample,
and the two lines that define it do get selected to be in the sample. Namely,
 j  2  k
k 1
  
1 1 1 1 1
Pr Xp = 1 = 1 −
 
≥ 1− ≥ exp −2 = 2 2
k k k k 2 k k 2 e k
since j ≤ k and 1 − x ≥ e−2x , for 0 < x ≤ 1/2.
On the other hand, the number of vertices on the 0-level of R is at most |R| − 1. As such,
Õ
Xp ≤ |R| − 1.
p∈L ≤k

Moreover this, of course, also holds in expectation, implying


" #
Õ h i n
E Xp ≤ E |R| − 1 ≤ .
p∈L
k
≤k

On the other hand, by linearity of expectation, we have


" #
Õ Õ   |L≤k |
E Xp = E Xp ≥ 2 2 .
p∈L p∈L
e k
≤k ≤k

|L≤k | n
Putting these two inequalities together, we get that ≤ . Namely, |L≤k | ≤ e2 nk. 
e k
2 2 k

103
104
Chapter 14

Random Walks I
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
“A drunk man will find his way home; a drunk bird may wander forever.”
– Anonymous.

14.1. Definitions
Let G = G(V, E) be an undirected connected graph. For v ∈ V, let Γ(v) denote the set of neighbors of
v in G; that is, Γ(v) = u vu ∈ E(G) . A random walk on G is the following process: Starting from


a vertex v0 , we randomly choose one of the neighbors of v0 , and set it to be v1 . We continue in this
fashion, in the ith step choosing vi , such that vi ∈ Γ(vi−1 ). It would be interesting to investigate the
random walk process. Questions of interest include:
(A) How long does it take to arrive from a vertex v to a vertex u in G?
(B) How long does it take to visit all the vertices in the graph.
(C) If we start from an arbitrary vertex v0 , how long the random walk has to be such that the location
of the random walk in the ith step is uniformly (or near uniformly) distributed on V(G)?
Example 14.1.1. In the complete graph Kn , visiting all the vertices takes in expectation O(n log n) time,
as this is the coupon collector problem with n − 1 coupons. Indeed, the probability we did not visit a
specific vertex v by the ith step of the random walk is ≤ (1−1/n)i−1 ≤ e−(i−1)/n ≤ 1/n10 , for i = Ω(n log n).
As such, with high probability, the random walk visited all the vertex of Kn . Similarly, arriving from u
to v, takes in expectation n − 1 steps of a random walk, as the probability of visiting v at every step of
the walk is p = 1/(n − 1), and the length of the walk till we visit v is a geometric random variable with
expectation 1/p.

14.1.1. Walking on grids and lines



Lemma 14.1.2 (Stirling’s formula). For any integer n ≥ 1, it holds n! ≈ 2πn (n/e)n .

14.1.1.1. Walking on the line


Lemma 14.1.3. Consider the infinite random walk on the integer line, starting from 0. Here, the
vertices are the integer numbers, and from a vertex k, one walks with probability 1/2 either to k − 1 or
k + 1. The expected number of times that such a walk visits 0 is unbounded.

105
1 2i

Proof: The probability that in the 2ith step we visit 0 is 22i i
, As such, the expected number of times
we visit the origin is
∞   Õ ∞
Õ 1 2i 1
≥ √ = ∞,
i=1
2 i
2i
i=1 2 i

22i 22i
 
2i
since √ ≤ ≤ √ [MN98, p. 84]. This can also be verified using the Stirling formula, and the
2 i i 2i
resulting sequence diverges. 

14.1.1.2. Walking on two dimensional grid

A random walk on the integer grid ZZd , starts from a point of this integer grid, and at each step if it is
at point (i1, i2, . . . , id ), it chooses a coordinate and either increases it by one, or decreases it by one, with
equal probability.

Lemma 14.1.4. Consider the infinite random walk on the two dimensional integer grid ZZ2 , starting
from (0, 0). The expected number of times that such a walk visits the origin is unbounded.

Proof: Rotate the grid by 45 degrees, and consider the two new axises X 0 and Y 0. Let xi be the projection
of the location of the ith step√ of the random walk on the X -axis, and define
0
√ yi in a similar fashion.
√ j is an integer. By scaling by a factor of 2, consider the resulting
Clearly, xi are of the √form j/ 2, where
random walks xi0 = 2xi and yi0 = 2yi . Clearly, xi and yi are random walks on the integer grid, and
furthermore, they are independent. As such, the probability that we visit the origin at the 2ith step is
 2  1 2i   2
Pr x2i = 0 ∩ y2i = 0 = Pr x2i = 0 = 22i i
 0 0
  0
≥ 1/4i. We conclude, that the infinite random walk on
the grid ZZ2 visits the origin in expectation
∞ ∞
Õ Õ 1
xi0 yi0
 
Pr =0∩ =0 ≥ = ∞,
i=0 i=0
4i

as this sequence diverges. 

14.1.1.3. Walking on three dimensional grid


i i!
 
In the following, let = .
a b c a! b! c!

Lemma 14.1.5. Consider the infinite random walk on the three dimensional integer grid ZZ3 , starting
from (0, 0, 0). The expected number of times that such a walk visits the origin is bounded.

Proof: The probability of a neighbor of a point (x, y, z) to be the next point in the walk is 1/6. Assume
that we performed a walk for 2i steps, and decided to perform 2a steps parallel to the x-axis, 2b steps
parallel to the y-axis, and 2c steps parallel to the z-axis, where a + b + c = i. Furthermore, the walk on
each dimension is balanced, that is we perform a steps to the left on the x-axis, and a steps to the right
on the x-axis. Clearly, this corresponds to the only walks in 2i steps that arrives to the origin.

106
(2i)!
Next, the number of different ways we can perform such a walk is a!a!b!b!c!c! , and the probability to
perform such a walk, summing over all possible values of a, b and c, is
 2   2i   i !2
i! i
  Õ    
Õ (2i)! 1 2i 1 1 2i 1 Õ 1
αi = = =
a+b+c=i
a!a!b!b!c!c! 62i i 22i a+b+c=i
a! b! c! 3 i 22i a+b+c=i
a b c 3
a,b,c≥0 a,b,c≥0 a,b,c≥0

i  i
Consider the case where i = 3m. We have that

a b c ≤ m m m . As such,
   i    i    i 
i i i
 Õ  
2i 1 1 1 2i 1 1
αi ≤ = .
i 22i 3 m m m a+b+c=i a b c 3 i 22i 3 m m m
a,b,c≥0

By the Stirling formula, we have



i 2πi(i/e)i 3i
 
≈ p = c ,
m m m i i/3
3 i
 
2πi/3 3e

 i !
1 1 3i
 
1
for some constant c. As such, αi = O √ = O 3/2 . Thus,
i 3 i i


Õ Õ  1 
α6m = O 3/2 = O(1).
m=1 i
i

Finally, observe that α6m ≥ (1/6)2 α6m−2 and α6m ≥ (1/6)4 α6m−4 . Thus,

Õ
αm = O(1). 
m=1

Notes
The presentation here follows [Nor98].

107
108
Chapter 15

Random Walks II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled “Then you must begin a reading program
immediately so that you man understand the
January 24, 2018
crises of our age," Ignatius said solemnly. "Begin
with the late Romans, including Boethius, of
course. Then you should dip rather extensively
into early Medieval. You may skip the
Renaissance and the Enlightenment. That is
mostly dangerous propaganda. Now, that I
think about of it, you had better skip the
Romantics and the Victorians, too. For the
contemporary period, you should study some
selected comic books.”
“You’re fantastic.”
“I recommend Batman especially, for he tends
to transcend the abysmal society in which he’s
found himself. His morality is rather rigid, also.
I rather respect Batman.”

John Kennedy Toole, A confederacy of


Dunces

15.1. The 2SAT example


Let G = G(V, E) be a undirected connected graph. For v ∈ V, let Γ(v) denote the neighbors of v in G.
A random walk on G is the following process: Starting from a vertex v0 , we randomly choose one of
the neighbors of v0 , and set it to be v1 . We continue in this fashion, such that vi ∈ Γ(vi−1 ). It would
be interesting to investigate the process of the random walk. For example, questions like: (i) how long
does it take to arrive from a vertex v to a vertex u in G? and (ii) how long does it take to visit all the
vertices in the graph.

15.1.1. Solving 2SAT


Consider a 2SAT formula F with m clauses defined over n variables. Start from an arbitrary assignment
to the variables, and consider a non-satisfied clause in F. Randomly pick one of the clause variables,
and change its value. Repeat this till you arrive to a satisfying assignment.
Consider the random variable Xi , which is the number of variables assigned the correct value (ac-
cording to the satisfying assignment) in the current assignment. Clearly, with probability (at least) half
Xi = Xi−1 + 1.

109
Thus, we can think about this algorithm as performing a random walk on the numbers 0, 1, . . . , n,
where at each step, we go to the right probability at least half. The question is, how long does it take
to arrive to n in such a settings.
Theorem 15.1.1. The expected number of steps to arrive to a satisfying assignment is O(n2 ).

Proof: Consider the random walk on the integer line, starting from zero, where we go to the left with
probability 1/2, and to the right probability 1/2. Let Yi be the location of the walk at the i step. Clearly,
E[Yi ] ≥ E[Xi ]. In fact, by defining the random walk on the integer line more carefully, one can ensure
that Yi ≤ Xi . Thus, the expected number of steps till Yi is equal to n is an upper bound on the required
quantity.
To this end, observe that the probability that in the ith step we have Yi ≥ n is
i/2
i
 
Õ 1
i i/2 + k
> 1/3,
k=n/2
2
√ √
by Lemma 15.1.2 below. Here we need that k = i/6, and k ≥ n/2. That is, we need that i/6 ≥ n/2,
which in turns implies that this holds for i > µ = 9n2 . To see that, observe that if we get i/2 + k times
+1, and i − (i/2 + k) = i/2 − k times −1, then we have that Yi = (i/2 + k) − ((i/k) − m) = 2k ≥ n.
Next, if Xi fails to arrive to n at the first µ steps, we will reset Yµ = Xµ and continue the random
walk, repeating this process as many phases as necessary. The probability that the number of phases
exceeds i is ≤ (2/3)i . As such, the expected number of steps in the walk is at most
 i
0 2 2
Õ
cn i = O(n2 ),
i
3

as claimed. 
2i  
Õ 1 2i 1
Lemma 15.1.2. We have ≥ .
√ 2 k
2i 3
k=i+ i/6

Proof: It is known¬ that 2ii ≤ 22i / i (better constants are known). As such, since 2i  2i 
for all m,

i ≥ m ,
we have by symmetry that
2i 2i √ 1 √ 1 22i 1
     
Õ 1 2i Õ 1 2i 1 2i
≥ − i/6 ≥ − i/6 ·√ = . 
√ 22i k k=i+1
22i k 22i i 2 22i i 3
k=i+ i/6

15.2. Markov Chains


Let S denote a state space, which is either finite or countable. A Markov chain is at one state at any
given time. There is a transition probability Pi j , which is the probability to move to the state j, if
the Markov chain is currently at state i. As such, j Pi j = 1 and ∀i, j we have 0 ≤ Pi j ≤ 1. The matrix
Í

P = Pi j i j is the transition probabilities matrix.
¬ Probably because you got it as a homework problem, if not wikipedia knows, and if you are bored you can try and

prove it yourself.

110
The Markov chain start at an initial state X0 , and at each point in time moves according to the
transition probabilities. This form a sequence of states {Xt }. We have a distribution over those sequences.
Such a sequence would be referred to as a history.
Similar to Martingales, the behavior of a Markov chain in the future, depends only on its location
Xt at time t, and does not depends on the earlier stages that the Markov chain went through. This
is the memorylessness property of the Markov chain, and it follows as Pi j is independent of time.
Formally, the memorylessness property is
Pr Xt+1 = j X0 = i0, X1 = i1, . . . , Xt−1 = it−1, Xt = i = Pr Xt+1 = j Xt = i = Pi j .
   

The initial state of the Markov chain might also be chosen randomly. 
For states i, j ∈ S, the t-step transition probability is Pi(t)j = Pr Xt = j X0 = i . The probability


that we visit j for the first time, starting from i after t steps, is denoted by

ri(t)j = Pr Xt = j and X1 , j, X2 , j, . . . , Xt−1 , j X0 = i .


 

Let fi j = t>0 ri(t)j denote the probability that the Markov chain visits state j, at any point in time,
Í
starting from state i. The expected number of steps to arrive to state j starting from i is
Õ
hi j = t · ri(t)j .
t>0

Of course, if fi j < 1, then there is a positive probability that the Markov chain never arrives to j, and
as such hi j = ∞ in this case.
Definition 15.2.1. A state i ∈ S for which fii < 1 (i.e., the chain has positive probability of never visiting
i again), is a transient state. If fii = 1 then the state is persistent.
A state i that is persistent but hii = ∞ is null persistent. A state i that is persistent and hii , ∞
is non null persistent.
Example 15.2.2. Consider the state 0 in the random walk on the integers. We already know that in
expectation the random walk visits the origin infinite number of times, so this hints that this is a
persistent state. Let figure out the probability r(2n)00 . To this end, consider a walk X0, X1, . . . , X2n that
starts at 0 and return to 0 only in the 2n step. Let Si = Xi − Xi−1 , for all i. Clearly, we have Si ∈ −1, +1
(i.e., move left or move right). Assume the walk starts by S1 = +1 (the case −1 is handled similarly).
Clearly, the walk S2, . . . , S2n−1 must be prefix balanced; that is, the number of 1s is always bigger (or
equal) for any prefix of this sequence.
Strings with this property are known
1 2m
 as Dyck words, and the number of such words of length 2m
is the Catalan number Cm = m+1 m . As such, the probability of the random walk to visit 0 for the
first time (starting from 0 after 2n steps, is
     
(2n) 1 2n − 2 1 1 1 1
r00 = 2 = Θ · √ = Θ 3/2 .
n n − 1 22n n n n
(the 2 here is because the other option is that the sequence starts with −1), using that 2n 2n √
n =Θ 2 / n .
 
It is not hard to show that f00 = 1 (this requires a trick). On the other hand, we have that
Õ ∞
Õ ∞
Õ  √ 
h00 = t· r(t)
00 ≥ 2nr(2n)
00 = Θ 1/ n = ∞.
t>0 n=1 n=1

Namely, 0 (and in fact all integers) are null persistent.

111
In finite Markov chains there are no null persistent states (this requires a proof, which is left as an
exercise). There is a natural directed graph associated with a Markov chain. The states are the vertices,
and the transition probability Pi j is the weight assigned to the edge (i → j). Note that we include only
edges with Pi j > 0.

Definition 15.2.3. A strong component (or a strong connected component) of a directed graph G is a
maximal subgraph C of G such that for any pair of vertices i and j in the vertex set of C, there is a
directed path from i to j, as well as a directed path from j to i.

Definition 15.2.4. A strong component C is said to be a final strong component if there is no edge
going from a vertex in C to a vertex that is not in C.

In a finite Markov chain, there is positive probability to arrive from any vertex on C to any other
vertex of C in a finite number of steps. If C is a final strong component, then this probability is 1, since
the Markov chain can never leave C once it enters it­ . It follows that a state is persistent if and only if
it lies in a final strong component.

Definition 15.2.5. A Markov chain is irreducible if its underlying graph consists of a single strong
component.

Clearly, if a Markov chain is irreducible, then all states are persistent.


 
Definition 15.2.6. Let q (t) = q1(t), q2(t), . . . , qn(t) be the state probability vector (also called the distri-
bution of the chain at time t), to be the row vector whose ith component is the probability that the
chain is in state i at time t.

The key observation is that

q (t) = q (t−1) P = q (0) Pt .

Namely, a Markov chain is fully defined by q (0) and P.

Definition 15.2.7. A stationary distribution for a Markov chain with the transition matrix P is a
probability distribution π such that π = πP.

In general, stationary distribution does not necessarily exist. We will mostly be interested in Markov
chains that have stationary distribution. Intuitively it is clear that if a stationary distribution exists,
then the Markov chain, given enough time, will converge to the stationary distribution.

Definition 15.2.8. The periodicity of a state i is the maximum integer T for which there exists an initial
distribution q (0) and positive integer a such that, for all t if at time t we have qi(t) > 0 then t belongs
to the arithmetic progression a + ti i ≥ 0 . A state is said to be periodic if it has periodicity greater
than 1, and is aperiodic otherwise. A Markov chain in which every state is aperiodic is aperiodic.

Example 15.2.9. The easiest example maybe of a periodic Markov chain is a directed cycle.

­ Think about it as hotel California.

112
v2
For example, the Markov chain on the right, has periodicity of three. In particular,
v1 the initial state probability vector q (0) = (1, 0, 0) leads to the following sequence of state
v3 probability vectors

q (0) = (1, 0, 0) =⇒ q (1) = (0, 1, 0) =⇒ q (2) = (0, 0, 1) =⇒ q (3) = (1, 0, 0) =⇒ . . . .

Note, that this chain still has a stationary distribution, that is (1/3, 1/3, 1/3), but unless you start from
this distribution, you are going to converge to it.

A neat trick that forces a Markov chain to be aperiodic, is to shrink all the probabilities by a factor
of 2, and make every state to have a transition probability to itself equal to 1/2. Clearly, the resulting
Markov chain is aperiodic.

Definition 15.2.10. An ergodic state is aperiodic and (non-null) persistent.


An ergodic Markov chain is one in which all states are ergodic.

The following theorem is the fundamental fact about Markov chains that we will need. The interested
reader, should check the proof in [Nor98] (the proof is not hard).

Theorem 15.2.11 (Fundamental theorem of Markov chains). Any irreducible, finite, and aperi-
odic Markov chain has the following properties.
(i) All states are ergodic.
(ii) There is a unique stationary distribution π such that, for 1 ≤ i ≤ n, we have πi > 0.
(iii) For 1 ≤ i ≤ n, we have fii = 1 and hii = 1/πi .
(iv) Let N(i, t) be the number of times the Markov chain visits state i in t steps. Then

N(i, t)
lim = πi .
t→∞ t
Namely, independent of the starting distribution, the process converges to the stationary dis-
tribution.

113
114
Chapter 16

Random Walks III


598 - Class notes for Randomized Algorithms
Sariel Har-Peled “I gave the girl my protection, offering in my
January 24, 2018 equivocal way to be her father. But I came too
late, after she had ceased to believe in fathers. I
wanted to do what was right, I wanted to make
reparation: I will not deny this decent impulse,
however mixed with more questionable motives:
there must always be a place for penance and
reparation. Nevertheless, I should never have
allowed the gates of the town to be opened to
people who assert that there are higher
considerations that those of decency. They
exposed her father to her naked and made him
gibber with pain, they hurt her and he could
not stop them (on a day I spent occupied with
the ledgers in my office). Thereafter she was no
longer fully human, sister to all of us. Certain
sympathies died, certain movements of the heart
became no longer possible to her. I too, if I live
longer enough in this cell with its ghost not only
of the father and the daughter but of the man
who even by lamplight did not remove the black
discs from his eyes and the subordinate whose
work it was to keep the brazier fed, will be
touched with the contagion and turned into a
create that believes in nothing.”

J. M. Coetzee, Waiting for the Barbarians

16.1. Random Walks on Graphs


Let G = (V, E) be a connected, non-bipartite, undirected graph, with n vertices. We define the natural
Markov chain on G, where the transition probability is
(
d(u) if uv ∈ E
1
Puv =
0 otherwise,

where d(w) is the degree of vertex w. Clearly, the resulting Markov chain MG is irreducible. Note, that
the graph must have an odd cycle, and it has a cycle of length 2. Thus, the gcd of the lengths of its

115
cycles is 1. Namely, MG is aperiodic. Now, by the Fundamental theorem of Markov chains, MG has a
unique stationary distribution π.
Lemma 16.1.1. For all v ∈ V, we have πv = d(v)/2m.
Proof: Since π is stationary, and the definition of Puv , we get
  Õ
πv = πP v = πu Puv,
uv

and this holds for all v. We only need to verify the claimed solution, since there is a unique stationary
distribution. Indeed,
d(v) Õ d(u) 1 d(v)
= πv = [πP]v = = ,
2m uv
2m d(u) 2m

as claimed. 
Lemma 16.1.2. For all v ∈ V, we have hvv = 1/πv = 2m/d(v).
Definition 16.1.3. The hitting time huv is the expected number of steps in a random walk that starts
at u and ends upon first reaching v.
The commute time between u and v is denoted by CTuv = huv + hvu .
Let Cu (G) denote the expected length of a walk that starts at u and ends upon visiting every vertex
in G at least once. The cover time of G denotes by C(G) is defined by C(G) = maxu Cu (G).
Example 16.1.4 (Lollipop). Let L2n be the 2n-vertex lollipop graph, this graph con-
sists of a clique on n vertices, and a path on the remaining n vertices. There is a
vertex u in the clique which is where the path is attached to it. Let v denote the
end of the path, see figure on the right. n vertices
Taking a random walk from u to v requires in expectation O(n2 ) steps, as we
already saw in class. This ignores the probability of escape – that is, with probability u
(n − 1)/n when at u we enter the clique Kn (instead of the path). As such, it turns x1
3 2
out that huv = Θ(n ), and hvu = Θ(n ). (Thus, hitting times are not symmetric!) x2
Note, that the cover time is not monotone decreasing with the number of edges.
Indeed, the path of length n, has cover time O(n2 ), but the larger graph Ln has cover v = xn
time Ω(n3 ).
Example 16.1.5 (More on walking on the Lollipop). To see why huv = Θ n3 , number the vertices on the

stem x1, . . . , xn . Let Ti be the expected time to arrive to the vertex xi when starting a walk from u.
Observe, that surprisingly, T1 = Θ(n2 ). Indeed, the walk has to visit the vertex u about n times in
expectation, till the walk would decide to go to x1 instead of falling back into the clique. The time
between visits to u is in expectation O(n) (assuming the walk is inside the clique).
Now, observe that T2i = Ti + Θ(i 2 ) + 12 T2i . Indeed, starting with xi , it takes in expectation Θ(i 2 ) steps
of the walk to either arrive (with equal probability) at x2i (good), or to get back to u (oopsi). In the
later case, the game begins from scratch. As such, we have that
!
T2i = 2Ti + Θ i 2 = 2 2Ti/2 + Θ (i/2)2 + Θ i 2 = · · · = 2iT1 + Θ i 2 ,
   

assuming i is a power of two (why not?). As such, Tn = nT1 + Θ(n2 ). Since T1 = Θ(n2 ), we have that
Tn = Θ(n3 ).

116
Definition 16.1.6. A n × n matrix M is stochastic if all its entries are non-negative and for each row i,
it holds k Mik = 1. It is doubly stochastic if in addition, for any i, it holds k M ki = 1.
Í Í

Lemma 16.1.7. Let MC be a Markov chain, such that transition probability matrix P is doubly stochas-
tic. Then, the distribution u = (1/n, 1/n, . . . , 1/n) is stationary for MC.
n
Õ P ki 1
Proof: [uP]i = = . 
k=1
n n

Lemma 16.1.8. For any edge (u → v) ∈ E, we have huv + hvu ≤ 2m.

(Note, that (u → v) being an edge in the graph is crucial. Indeed, without it a significantly worst
case bound holds, see Theorem 16.2.1.)
Proof: Consider a new Markov chain defined by the edges of the graph (where every edge is taken twice
as two directed edges), where the current state is the last (directed) edge visited. There are 2m edges
in the new Markov chain, and the new transition matrix, has Q(u→v),(v→w) = Pvw = d(v) 1
. This matrix is
doubly stochastic, meaning that not only do the rows sum to one, but the columns sum to one as well.
Indeed, for the (v → w) we have
Õ Õ Õ 1
Q(x→y),(v→w) = Q(u→v),(v→w) = Pvw = d(v) × = 1.
d(v)
x∈V,y∈Γ(x) u∈Γ(v) u∈Γ(v)

Thus, the stationary distribution for this Markov chain is uniform, by Lemma 16.1.7. Namely, the
stationary distribution of e = (u → v) is hee = πe = 1/(2m). Thus, the expected time between successive
traversals of e is 1/πe = 2m, by Theorem 15.2.11 (iii).
Consider huv + hvu and interpret this as the time to go from u to v and then return to u. Conditioned
on the event that the initial entry into u was via the (v → u) , we conclude that the expected time to
go from there to v and then finally use (v → u) is 2m. The memorylessness property of a Markov chains
now allows us to remove the conditioning: since how we arrived to u is not relevant. Thus, the expected
time to travel from u to v and back is at most 2m. 

16.2. Electrical networks and random walks


A resistive electrical network is an undirected graph; each edge has branch resistance associated
with it. The electrical flow is determined by two laws: Kirchhoff’s law (preservation of flow - all the
flow coming into a node, leaves it) and Ohm’s law (the voltage across a resistor equals the product of
the resistance times the current through it). Explicitly, Ohm’s law states

voltage = resistance ∗ current.

The effective resistance between nodes u and v is the voltage difference between u and v when
one ampere is injected into u and removed from v (or injected into v and removed from u). The effective
resistance is always bounded by the branch resistance, but it can be much lower.
Given an undirected graph G, let N(G) be the electrical network defined over G, associating one ohm
resistance on the edges of N(G).
You might now see the connection between a random walk on a graph and electrical network. In-
tuitively (used in the most unscientific way possible), the electricity, is made out of electrons each one

117
of them is doing a random walk on the electric network. The resistance of an edge, corresponds to the
probability of taking the edge. The higher the resistance, the lower the probability that we will travel on
this edge. Thus, if the effective resistance Ruv between u and v is low, then there is a good probability
that travel from u to v in a random walk, and huv would be small.
Theorem 16.2.1. For any two vertices u and v in G, the commute time CTuv = 2mRuv , where Ruv is
the effective resistance between u and v.

Proof: Let φuv denote the voltage at u in N(G) with respected to v, where d(x) amperes of current are
injected into each node x ∈ V, and 2m amperes are removed from v. We claim that

huv = φuv .

Note, that the voltage on an edge x y is φ xy = φ xv − φ yv . Thus, using Kirchhoff’s Law and Ohm’s Law,
we obtain that
Õ Õ φ xw Õ
x ∈ V \ {v} d(x) = current(xw) = = (φ xv − φwv ), (16.1)
resistance(xw)
w∈Γ(x) w∈Γ(x) w∈Γ(x)

since the resistance of every edge is 1 ohm. (We also have the “trivial” equality that φvv = 0.) Further-
more, we have only n variables in this system; that is, for every x ∈ V, we have the variable φ xv .
Now, for the random walk interpretation – by the definition of expectation, we have
1 Õ Õ Õ
x ∈ V \ {v} h xv = (1 + hwv ) ⇐⇒ d(x) h xv = 1+ hwv
d(x)
w∈Γ(x) w∈Γ(x) w∈Γ(x)
Õ Õ Õ
⇐⇒ 1 = d(x) h xv − hwv = (h xv − hwv ).
w∈Γ(x) w∈Γ(x) w∈Γ(x)

Since d(x) =
Í
w∈Γ(x) 1, this is equivalent to
Õ
x ∈ V \ {v} d(x) = (h xv − hwv ). (16.2)
w∈Γ(x)

Again, we also have the trivial equality hvv = 0.¬ Note, that this system also has n equalities and n
variables.
Eq. (16.1) and Eq. (16.2) show two systems of linear equalities. Furthermore, if we identify huv with
φ xv then they are exactly the same system of equalities. Furthermore, since Eq. (16.1) represents a
physical system, we know that it has a unique solution. This implies that φ xv = h xv , for all x ∈ V.
Imagine the network where u is injected with 2m amperes, and for all nodes w remove d(w) units
from w. In this new network, hvu = −φ0vu = φ0uv . Now, since flows behaves linearly, we can superimpose
them (i.e., add them up). We have that in this new network 2m unites are being injected at u, and
2m units are being extracted at v, all other nodes the charge cancel itself out. The voltage difference
between u and v in the new network is φb = φuv + φ0uv = huv + hvu = CTuv . Now, in the new network there
are 2m amperes going from u to v, and by Ohm’s law, we have

φb = voltage = resistance ∗ current = 2mRuv,

as claimed. 
¬ In previous lectures, we interpreted hvv as the expected length of a walk starting at v and coming back to v.

118
Example 16.2.2. Recall the lollipop Ln from Exercise 16.1.4. Let u be the connecting vertex between the
clique and the stem (i.e., the path). We inject d(x) units of flow for each vertex x of Ln , and collect 2m
units at u. Next, let u = x0, x1, . . . , xn = v be the vertices of the stem. Clearly, there are 2(n − i) − 1
units of electricity flowing on the edge (xi+1 → xi ). Thus, the voltage on this edge is 2(n − i), by Ohm’s
law (every edge has resistance one). The effective resistance from v to u is as such Θ(n2 ), which implies
that hvu = Θ(n2 ).
Similarly, it is easy to show huv = Θ(n3 ).
A similar analysis works for the random walk on the integer line in the range 1 to n.

Lemma 16.2.3. For any n vertex connected graph G, and for all u, v ∈ V(G), we have CTuv < n3 .

Proof: The effective resistance between any two nodes in the network is bounded by the length of the
shortest path between the two nodes, which is at most n − 1. As such, plugging this into Theorem 16.2.1,
yields the bound, since m < n2 . 

16.3. Bibliographical Notes


A nice survey of the material covered here, is available online at http://arxiv.org/abs/math.PR/
0001057 [DS00].

119
120
Chapter 17

Random Walks IV
598 - Class notes for Randomized Algorithms
Sariel Har-Peled “Do not imagine, comrades, that leadership is a
January 24, 2018 pleasure! On the contrary, it is a deep and
heavy responsibility. No one believes more
firmly than Comrade Napoleon that all animals
are equal. He would be only too happy to let
you make your decisions for yourselves. But
sometimes you might make the wrong decisions,
comrades, and then where should we be?
Suppose you had decided to follow Snowball,
with his moonshine of windmills-Snowball, who,
as we now know, was no better than a
criminal?”

Animal Farm, George Orwell

17.1. Cover times


We remind the reader that the cover time of a graph is the expected time to visit all the vertices in the
graph, starting from an arbitrary vertex (i.e., worst vertex). The cover time is denoted by C(G).

Theorem 17.1.1. Let G be an undirected connected graph, then C(G) ≤ 2m(n − 1), where n = |V(G)| and
m = |E(G)|.

Proof: (Sketch.) Construct a spanning tree T of G, and consider the time to walk around T. The
expected time to travel on this edge on both directions is CTuv = huv + hvu , which is smaller than 2m,
by Lemma 16.1.8. Now, just connect up those bounds, to get the expected time to travel around the
spanning tree. Note, that the bound is independent of the starting vertex. 

Definition 17.1.2. The resistance of G is R(G) = maxu,v∈V(G) Ruv ; namely, it is the maximum effective
resistance in G.

Theorem 17.1.3. mR(G) ≤ C(G) ≤ 2e3 mR(G) ln n + 2n.

Proof: Consider the vertices u and v realizing R(G), and observe that max(huv, hvu ) ≥ CTuv /2, and
CTuv = 2mRuv by Theorem 16.2.1. Thus, C(G) ≥ CTuv /2 ≥ mR(G).
As for the upper bound. Consider a random walk, and divide it into epochs, where a epoch is
a random walk of length 2e3 mR(G). For any vertex v, the expected time to hit u is hvu ≤ 2mR(G),
by Theorem 16.2.1. Thus, the probability that u is not visited in a epoch is 1/e3 by the Markov

121
inequality. Consider a random walk with ln n epochs. We have that the probability of not visiting u is
≤ (1/e3 )ln n ≤ 1/n3 . Thus, all vertices are visited after ln n epochs, with probability ≥ 1−1/n3 . Otherwise,
after this walk, we perform a random walk till we visit all vertices. The length of this (fix-up) random
walk is ≤ 2n3 , by Theorem 17.1.1. Thus, expected length of the walk is ≤ 2e3 mR(G) ln n + 2n3 (1/n2 ). 

17.1.1. Rayleigh’s Short-cut Principle.


Observe that effective resistance is never raised by lowering the resistance on an edge, and it is never
lowered by raising the resistance on an edge. Similarly, resistance is never lowered by removing a vertex.
Interestingly, effective resistance comply with the triangle inequality.
Observation 17.1.4. For a graph with minimum degree d, we have R(G) ≥ 1/d (collapse all vertices
except the minimum-degree vertex into a single vertex).

Lemma 17.1.5. Suppose that G contains p edge-disjoint paths of length at most ` from s to t. Then
Rst ≤ `/p.

17.2. Graph Connectivity


Definition 17.2.1. A probabilistic log-space Turing machine for a language L is a Turing machine
using space O(log n) and running in time O(poly(n)), where n is the input size. A problem A is in RLP,
if there exists a probabilistic log-space Turing machine M such that M accepts x ∈ L(A) with probability
larger than 1/2, and if x < L(A) then M(x) always reject.

Theorem 17.2.2. Let USTCON denote the problem of deciding if a vertex s is connected to a vertex
t in an undirected graph. Then USTCON ∈ RLP.

Proof: Perform a random walk of length 2n3 in the input graph G, starting from s. Stop as soon as the
random walk hit t. If u and v are in the same connected component, then hst ≤ n3 . Thus, by the Markov
inequality, the algorithm works. It is easy to verify that it can be implemented in O(log n) space. 

Definition 17.2.3. A graph is d-regular, if all its vertices are of degree d.


A d-regular graph is labeled if at each vertex of the graph, each of the d edges incident on that
vertex has a unique label in {1, . . . , d}.
Any sequence of symbols σ = (σ1, σ2, . . .) from {1, . . . , d} together with a starting vertex s in a
labeled graph describes a walk in the graph. For our purposes, such a walk would almost always be
finite.
A sequence σ is said to traverse a labeled graph if the walk visits every vertex of G regardless of
the starting vertex. A sequence σ is said to be a universal traversal sequence of a labeled graph if
it traverses all the graphs in this class.

Given such a universal traversal sequence, we can construct (a non-uniform) Turing machine that
can solve USTCON for such d-regular graphs, by encoding the sequence in the machine.
Let F denote a family of graphs, and let U(F ) denote the length of the shortest universal traversal
sequence for all the labeled graphs in F . Let R(F ) denote the maximum resistance of graphs in this
family.
Theorem 17.2.4. U(F ) ≤ 5mR(F ) lg(n |F |).

122
Proof: Same old, same old. Break the string into epochs, each of length L = 2mR(G). Now, start random
walks from all the possible vertices, from all possible graphs. Continue the walks till all vertices are
being visited. Initially, there are n2 |F | vertices that need to visited. In expectation, in each epoch half
the vertices get visited. As such, after 1 + lg2 (n |F |) epochs, the expected number of vertices still need
visiting is ≤ 1/2. Namely, with constant probability we are done. 

Let U(d, n) denote the length of the shortest universal traversal sequence of connected, labeled n-
vertex, d-regular graphs.

Lemma 17.2.5. The number of labeled n-vertex graphs that are d-regular is (nd)O(nd) .

Proof: Such a graph has dn/2 edges overall. Specifically, we encode this by listing for every vertex its
d neighbors – there are n−1
d ≤ n d possibilities. As such, there are at most nnd choices for edges in

the graph¬ Every vertex has d! possible labeling of the edges adjacent to it, thus there are (d!)n ≤ d nd
possible labelings. 

Lemma 17.2.6. U(d, n) = O n3 d log n .




Proof: The diameter of every connected n-vertex, d-regular graph is O(n/d). Indeed, consider the path
realizing the diameter of the graph, and assume it has t vertices. Number the vertices along the path
consecutively, and consider all the vertices that their number is a multiple of three. There are α ≥ bt/3c
such vertices. No pair of these vertices can share a neighbor, and as such, the graph has at least (d + 1)α
vertices. We conclude that n ≥ (d + 1)α = (d + 1)(t/3 − 1). We conclude that t ≤ d+1 3
(n + 1) ≤ 3n/d.
And so, this also bounds the resistance of such a graph. The number of edges is m = nd/2. Now,
combine Lemma 17.2.5 and Theorem 17.2.4. 

This is, as mentioned before, not uniform solution. There is by now a known log-space deterministic
algorithm for this problem, which is uniform.

17.2.1. Directed graphs


−−−−−−−−→
Theorem 17.2.7. One can solve the STCON problem with a log-space randomized algorithm, that
always output NO if there is no path from s to t, and output YES with probability at least 1/2 if there is
a path from s to t.

17.3. Graphs and Eigenvalues


Consider an undirected graph G = G(V, E) with n vertices. The adjacency matrix M(G) of G is the
n × n symmetric matrix where Mi j = M ji is the number of edges between the vertices vi and v j . If G is
bipartite, we assume that V is made out of two independent sets X and Y . In this case the matrix M(G)
can be written in block form.
Since M(G) is symmetric, all its eigenvalues exists λ1 ≥ λ2 · · · ≥ λn , and their corresponding orthonor-
mal basis vectors are e1, . . . , en . We will need the following theorem.
¬ This is a callous upper bound – better analysis is possible. But never analyze things better than you have to - it

usually a waste of time.

123
Theorem 17.3.1 (Fundamental theorem of algebraic graph theory.). Let G = G(V, E) be an n-
vertex, undirected (multi)graph with maximum degree d. Let λ1 ≥ λ2 ≥ · · · ≥ λn be the eigenvalues of
M(G) and the corresponding orthonormal eigenvectors are e1, . . . , en . The following holds.
(i) If G is connected then λ2 < λ1 .
(ii) For i = 1, . . . , n, we have |λi | ≤ d.
(iii) d is an eigenvalue if and only if G is regular.
(iv) If G is d-regular then the eigenvalue λ1 = d has the eigenvector e1 = √1n (1, 1, 1, . . . , 1).
(v) The graph G is bipartite if and only if for every eigenvalue λ there is an eigenvalue −λ of the
same multiplicity.
(vi) Suppose that G is connected. Then G is bipartite if and only if −λ1 is an eigenvalue.
(vii) If G is d-regular and bipartite, then λn = d and en = √1n (1, 1, . . . , 1, −1, . . . , −1), where there are
equal numbers of 1s and −1s in en .

17.4. Bibliographical Notes


A nice survey of algebraic graph theory appears in [Wes01] and in [Bol98].

124
Chapter 18

Random Walks V
598 - Class notes for Randomized Algorithms
Sariel Har-Peled “Is there anything in the Geneva Convention
January 24, 2018 about the rules of war in peacetime?” Stanko
wanted to know, crawling back toward the
truck. “Absolutely nothing,” Caulec assured
him. “The rules of war apply only in wartime.
In peacetime, anything goes.”

Romain Gary, Gasp

18.1. Rapid mixing for expanders


We remind the reader of the following definition of expander.

Definition 18.1.1. Let G = (V, E) be an undirected d-regular graph. The graph G is a (n, d, c)-expander
(or just c-expander), forevery set S ⊆ V of size at most |V | /2, there are at least cd |S| edges connecting
S and S = V \ S; that is e S, S ≥ cd |S|,

Guaranteeing aperiodicity Let G be a (n, d, c)-expander. We would like to perform a random walk
on G. The graph G is connected, but it might be periodic (i.e., bipartite). To overcome this, consider
the random walk on G that either stay in the current state with probability 1/2 or traverse one of the
edges. Clearly, the resulting Markov Chain (MC) is aperiodic. The resulting transition matrix is

Q = M/2d + I/2,

where M is the adjacency matrix of G and I is the identity n × n matrix. Clearly Q is doubly stochastic.
Furthermore, if λbi is an eigenvalue of M, with eigenvector vi , then
  !
1 M 1 λbi
Qvi = + I vi = + 1 vi .
2 d 2 d
 
As such, λc λi /d + 1 /2 is an eigenvalue of Q. Namely, if there is a spectral gap in the graph G, there
would also be a similar spectral gap in the resulting MC. This MC can be generated by adding to each
vertex d self loops, ending up with a 2d-regular graph. Clearly, this graph is still an expander if the
original graph is an expander, and the random walk on it is aperiodic.
From this point on, we would just assume our expander is aperiodic.

125
18.1.1. Bounding the mixing time
For a MC with n states, we denote by π = π1, . . . , πn its stationary distribution. We consider only

nicely behave MC that fall under Theorem 15.2.11p113 . As such, no state in the MC has zero stationary
probability.
Definition 18.1.2. Let q(t) denote the state probability vector of a Markov chain defined by a transition
matrix Q at time t ≥ 0, given an initial distribution q(0) . The relative pairwise distance of the
Markov chain at time t is
qi(t) − πi
∆(t) = max .
i πi
Namely, if ∆(t) approaches zero then q(t) approaches π.
We remind the reader that we saw a construction of a constant degree expander with constant
expansion. In its transition matrix Q, we have that λb1 = 1, and −1 ≤ λb2 < 1, and furthermore
the spectral gap λb1 − λb2 was a constant (the two properties are equivalent, but we proved only one
direction of this).
We need a slightly stronger property (that does hold for our expander construction). We have that
n
λb2 ≥ maxi=2 λbi .
Theorem 18.1.3. Let Q be the transition matrix of an aperiodic (n, d, c)-expander. Then, for any initial
distribution q(0) , we have that
 t
∆(t) ≤ n3/2 λb2 .

Namely, since λb2 is a constant smaller than 1, the distance ∆(t) drops exponentially with t.
Proof: We have that q(t) = q(0) Qt . Let B(Q) = hv1, . . . , vn i denote the orthonormal eigenvector basis of
Ín
Q (see Definition 29.2.3p239 ), and write q(0) = i=1 αi vi . Since λb1 = 1, we have that
n   n  t n  t
t t
Õ Õ Õ
(t) (0)
q =q Q = αi vi Q = αi λi vi = α1 v1 +
b αi λbi vi .
i=1 i=1 i=2
√ √ √   t
Since v1 = 1/ n, 1/ n, . . . , 1/ n , and λbi ≤ λb2 < 1, for i > 1, we have that limt→∞ λbi = 0, and thus
n
Õ   t 
(t)
π = lim q = α1 v1 + αi lim λbi vi = α1 v1 .
t→∞ t→∞
i=2
v
n n
t
Õ Õ
(0) (0)
Now, since v1, . . . , vn is an orthonormal basis, and q = αi vi , we have that q = αi2 . Thus
2
i=1 i=1
implies that
v
n n n   t 2
t
Õ  t √ Õ  t √
Õ
(t) (t)
q −π = q − α1 v1 = αi λbi vi ≤ n αi λbi vi = n αi λbi
1 1
i=2 1 i=2 2 i=2
v
n
t
√  t Õ
2 √   t (0) √   t (0) √  t
≤ n λb2 (αi ) ≤ n λb2 q ≤ n λb2 q = n λb2 ,
2 1
i=2

126
since q(0) is a distribution. Now, since πi = 1/n, we have

qi(t) − πi √  t
∆(t) = max = max n qi(t) − πi ≤ n max q(t) − π ≤ n n λb2 . 
i πi i i 1

18.2. Probability amplification by random walks on expanders


We are interested in performing probability amplification for an algorithm that is a BPP algorithm
(see Definition 2.4.9). It would be convenient to work with an algorithm which is already somewhat
amplified. That is, we assume that we are given a BPP algorithm Alg for a language L, such that
h i
(A) If x ∈ L then Pr Alg(x) accepts ≥ 199/200.
h i
(B) If x < L then Pr Alg(x) accepts ≤ 1/200.
We assume that Alg requires a random bit string of length n. So, we have a constant degree expander
G (say of degree d) that has at least 200 · 2n vertices. In particular, let

U = |V(G)| ,

and since our expander construction grow exponentially in size (but the base of the exponent is a
constant), we have that U = O(2n ). (Translation: We can not quite get an expander with a specific
number of vertices. Rather, we can guarantee an expander that has more vertices than we need, but
not many more.)
We label the vertices of G with all the binary
l m ofjlength n, kin a round robin fashion (thus,
strings
n
each binary string of length n appears either |V(G)| /2 or |V(G)| /2n times). For a vertex v ∈ V(G),
let s(v) denote the binary string associated with v.
Consider a string x that we would like to decide if it is in L or not. We know that at least 99/100U
vertices of G are labeled with “random” strings that would yield the right result if we feed them into Alg
(the constant here deteriorated from 199/200 to 99/100 because the number of times a string appears
is not identically the same for all strings).

The algorithm. We perform a random walk of length µ = αβk on G, where α and β are constants to
be determined shortly, and k is a parameter. To this end, we randomly choose a starting vertex X0 (this
would require n + O(1) bits). Every step in the random walk, would require O(1) random bits, as the
expander is a constant degree expander, and as such overall, this would require n + O(k) random bits.
Now, lets X0, X1, . . . , Xµ be the resulting random walk. We compute the result of

Yi = Alg(x, ri ), for i = 0, . . . , ν, and ν = αk,

where ri = s Xi·β . Specifically, we use the strings associated with nodes that are in distance β from each

other along the path of the random walk. We return the majority of the bits Y0, . . . , Yαk as the decision
of whether x ∈ L or not.
We assume here that we have a fully explicit construction of an expander. That is, given a vertex
of an expander, we can compute all its neighbors in polynomial time (in the length of the index of the
vertex). While the construction of expander shown is only explicit it can be made fully explicit with
more effort.

127
18.2.1. The analysis
Intuition. Skipping every β nodes in the random walk corresponds to performing a random walk on
the graph G β ; that is, we raise the graph to power k. This new graph is a much better expander (but the
degree had deteriorated). Now, consider a specific input x, and mark the bad vertices for it in the graph
G. Clearly, we mark at most 1/100 fraction of the vertices. Conceptually, think about these vertices
as being uniformly spread in the graph and far apart. From the execution of the algorithm to fail, the
random walk needs to visit αk/2 bad vertices in the random walk in G k . However, the probability for
that is extremely small - why would the random walk keep stumbling into bad vertices, when they are
so infrequent?

The real thing. Let Q be the transition matrix of G. We assume, as usual, that the random walk on
G is aperiodic (if not, we can easily fix it using standard tricks), and thus ergodic. Let B = Q β be the
transition matrix of the random walk of the states we use in the algorithm. Note, that the eigenvalues
(except the first one) of B “shrink”. In particular, by picking β to be a sufficiently large constant, we
have that
    1
λb1 B = 1 and λbi B ≤ , for i = 2, . . . , U.
10
For the input string x, let W be the matrix that has 1 in the diagonal entry Wii , if and only Alg(x, s(i))
returns the right answer, for i = 1, . . . , U. (We remind the reader that s(i) is the string associated with
the ith vertex, and U = |V(G)|.) The matrix W is zero everywhere else. Similarly, let W = I − W be the
“complement” matrix having 1 at Wii iff Alg(x, s(i)) is incorrect. We know that W is a U × U matrix,
that has at least (99/100)U ones on its diagonal.

Lemma 18.2.1. Let Q be a symmetric transition matrix, then all its eigenvalues of Q are in the range
[−1, 1].

Proof: Let p ∈ Rn be an eigenvector with eigenvalue λ. Let pi be the coordinate with the maximum
absolute value in p. We have that

U
Õ U
Õ U
Õ
λpi = pQ p j Q ji ≤ p j Q ji ≤ |pi | Q ji = pi .

i =
j=1 j=1 j=1

This implies that |λ| ≤ 1.


(We used the symmetry of the matrix, in implying that Q eigenvalues are all real numbers.) 

Lemma 18.2.2. Let Q be a symmetric transition matrix, then for any p ∈ Rn , we have that kpQk 2 ≤
kpk 2 .

Proof: Let B(Q) = hv1, . . . , vn i denote the orthonormal eigenvector basis of Q, with eigenvalues 1 =
λ1, . . . , λn . Write p = i αi vi , and observe that
Í

Õ Õ sÕ sÕ
pQ 2
= αi vi Q = αi λi vi = αi2 λi2 ≤ αi2 = p 2
,
i 2 i 2 i i

since |λi | ≤ 1, for i = 1, . . . , n, by Lemma 18.2.1. 

128
Lemma 18.2.3. Let B = Q β be the transition matrix of the graph G β . For all vectors p ∈ Rn , we have:
(i) kpBWk 2 ≤ kpk 2 , and (ii) pBW ≤ kpk /5.

Proof: (i) Since multiplying a vector by W has the effect of zeroing out some coordinates, its clear that
it can not enlarge the norm of a matrix. As such, kpBWk 2 ≤ kpBk 2 ≤ kpk 2 by Lemma 18.2.2.
(ii) Write p = i αi vi , where v1, . . . , vn is the orthonormal basis of Q (and thus also of B), with
Í

eigenvalues 1 = λb1, . . . , λbn . We remind the reader that v1 = (1, 1, . . . , 1)/ n. Since W zeroes out at
least 99/100
q of the entries of a vectors it is multiplied by (and copy the rest as they are), we have that

v1 W ≤ (n/100)(1/ n)2 ≤ 1/10 = kv1 k /10. Now, for any x ∈ RU , we have xW ≤ k xk. As such, we
have that
Õ U
Õ
pBW = αi vi BW ≤ α1 v1 BW + αi vi BW
2
i 2 i=2
U U
!
Õ |α1 |
β Õ β
≤ α1 v1 W + αi vi λbi
W ≤ + αi vi λbi
i=2
10 i=2
v
u
tU  v
u
tU
|α1 | Õ β 2
 |α1 | 1 Õ kpk 1 kpk
≤ + αi λbi ≤ + αi2 ≤ + kpk ≤ ,
10 i=2
10 10 i=2
10 10 5

β
since λi ≤ 1/10, for i = 2, . . . , n. 

Consider the strings r0, . . . , rν . For each one of these strings, we can write down whether its a “good”
string (i.e., Alg return the correct result), or a bad string. This results in a binary pattern b0, . . . , b k .
Given a distribution p ∈ RU on the states of the graph, its natural to ask what is the probability of being
in a “good” state. Clearly, this is the quantity kpWk 1 . Thus, if we are interested in the probability of
a specific pattern, then we should start with the initial distribution p0 , truncate away the coordinates
that represent an invalid state, apply the transition matrix, again truncate away forbidden coordinates,
and repeat in this fashion till we exhaust the pattern. Clearly, the `1 -norm of the resulting vector is
the probability of this pattern. To this end, given a pattern b0, . . . , b k , let S = hS0, . . . , Sν i denote the
corresponding sequence of “truncating” matrices (i.e., Si is either W or W). Formally, we set Si = W if
Alg(x, ri ) returns the correct answer, and set Si = W otherwise.
The above argument implies the following lemma.
Lemma 18.2.4. For any fixed pattern b0, . . . , bν the probability of the random walk to generate this
pattern of random strings is p(0) S0 BS1 . . . BSν , where S = hS0, . . . , Sν i is the sequence of W and W
1
encoded by this pattern.

Theorem 18.2.5. The probability that the majority of the outputs Alg(x, r0 ), Alg(x, r1 ), . . . , Alg(x, rk )
is incorrect is at most 1/2 k .

Proof: The majority is wrong, only if (at least) half the elements of the sequence S = hS0, . . . , Sν i belong
to W. Fix such a “bad” sequence S, and observe that the distributions we work with are vectors in RU .
As such, if p0 is the initial distribution, then we have that
h i √ √ 1
Pr S = p(0) S0 BS1 . . . BSν ≤ U p(0) S0 BS1 . . . BSν ≤ U ν/2 p(0) ,
1 2 5 2

129
by Lemma 18.3.1 below (i.e., Cauchy-Schwarz inequality) and by repeatedly applying Lemma 18.2.3,
since half of the sequence
√ S are W, and the rest are W. The distribution p(0) was uniform, which implies
that p(0) 2 = 1/ U. As such, let S be the set of all bad patterns (there are 2ν−1 such “bad” patterns).
We have
h i √ 1 1
Pr majority is bad ≤ 2 k U ν/2 p(0) = (4/5)ν/2 = (4/5)αk/2 ≤ k ,
5 2 2
for α = 7. 

18.3. Some standard inequalities



Lemma 18.3.1. For any vector v = (v1, . . . , vd ) ∈ Rd , we have that kvk 1 ≤ d kvk 2 .

Proof: We can safely assume all the coordinates of v are positive. Now,
v
u v
u
d d
t d t d
Õ Õ Õ Õ √
kvk 1 = vi = vi · 1 = |v · (1, 1, . . . , 1)| ≤ vi2 12 = d v ,
i=1 i=1 i=1 i=1

by the Cauchy-Schwarz inequality. 

130
Chapter 19

The Johnson-Lindenstrauss Lemma


598 - Class notes for Randomized Algorithms
Sariel Har-Peled Dixon was alive again. Consciousness was upon
January 24, 2018 him before he could get out of the way; not for
him the slow, gracious wandering from the halls
of sleep, but a summary, forcible ejection. He
lay sprawled, too wicked to move, spewed up
like a broken spider-crab on the tarry shingle of
the morning. The light did him harm, but not
as much as looking at things did; he resolved,
having done it once, never to move his eyeballs
again. A dusty thudding in his head made the
scene before him beat like a pulse. His mouth
had been used as a latrine by some small
creature of the night, and then as its
mausoleum. During the night, too, he’d
somehow been on a cross-country run and then
been expertly beaten up by secret police. He
felt bad.

In this chapter, we will prove that given a set P of n points in Rd , one canLucky
reduceJim,
theKingsley
dimension of
Amis
the points to k = O(ε −2 log n) such that distances are 1 ± ε preserved. Surprisingly, this reduction is done
by randomly picking a subspace of k dimensions and projecting the points into this random subspace.
One way of thinking about this result is that we are “compressing” the input of size nd (i.e., n points
with d coordinates) into size O(nε −2 log n), while (approximately) preserving distances.

19.1. The Brunn-Minkowski inequality


d d
 a set A ⊆ R , an a point p ∈ R , let A + p denote the translation of A by p. Formally, A + p =
For
q+p q ∈ A .

Definition 19.1.1. For two sets A and B in Rn , let A + B denote


the Minkowski sum of A and B. Formally,
Ø + =
A + B = a + b a ∈ A, b ∈ B = (p + B).

p∈A

Remark 19.1.2. It is easy to verify that if A0 and B0 are translated copies of A and B (that is, A0 =
A + p and B = B + q, for some points p, q ∈ Rd ), respectively, then A0 + B0 is a translated copy

131
 
of A + B. In particular, since volume is preserved under translation, we have that vol A0 + B0 =
   
vol (A + B) + p + q = vol A + B , where vol(X) is the volume (i.e., measure) of the set X.

Our purpose here is to prove the following theorem.

Theorem 19.1.3 (Brunn-Minkowski inequality). Let A and B be two non-empty compact sets in
Rn . Then
 1/n  1/n  1/n
vol A + B ≥ vol A + vol B .

Definition 19.1.4. A set A ⊆ Rn is a brick set if it is the union of finitely many (close) axis parallel
boxes with disjoint interiors.

It is intuitively clear, by limit arguments, that proving Theorem 19.1.3 for brick sets will imply it
for the general case.

Lemma 19.1.5 (Brunn-Minkowski inequality for Brick Sets). Let A and B be two non-empty
brick sets in Rn . Then
   1/n
vol A + B ≥ vol(A)1/n + vol(B)1/n .

Proof: By induction on the number k of bricks in A and B. If k = 2 then A and B are just bricks,
with dimensions a1, . . . , an and b1, . . . , bn , respectively. In this case, the dimensions of A + B are a1 +
În  1/n În  1/n
b1, . . . , an + bn , as can be easily verified. Thus, we need to prove that i=1 ai + i=1 bi ≤
Î  1/n
n
i=1 (ai + bi ) .Dividing the left side by the right side, we have

n
! 1/n n
! 1/n n n
Ö ai Ö bi 1 Õ ai 1 Õ bi
+ ≤ + = 1,
i=1
ai + bi i=1
ai + bi n i=1 ai + bi n i=1 ai + bi

by the generalized arithmetic-geometric mean inequality¬ , and the claim follows for this case.
Now let k > 2 and suppose that the Brunn-Minkowski inequality holds for any pair of brick sets with
fewer than k bricks (together). Let A and B be a pair of sets having k bricks together, the A has at least
two (disjoint) bricks. However, this implies that there is an axis parallel hyperplane h that separates
the interior of one brick of A from the interior of another brick of A (the hyperplane h might intersect
other bricks of A). Assume that h is the hyperplane x1 = 0 (this can be achieved by translation and
renaming of coordinates).
Let A+ = A ∩ h+ and A− = A ∩ h− , where h+ and h− are the two open half spaces induced by h. Let
A+ and A− be the closure of A+ and A− , respectively. Clearly, A+ and A− are both brick sets with (at
least) one fewer brick than A.
Next, observe that the claim is translation invariant (see Remark 19.1.2), and as such, let us translate
B so that its volume is split by h in the same ratio A’s volume is being split. Denote the two parts of
¬ Here is a proof of the generalized form: Let x , . . . , x be n positive real numbers. Consider the quantity R =
1 n
x1 x2 · · · xn . If we fix the sum of the n numbers to be equal to α, then R is maximized when all the xi s are equal. Thus,
√ p
n
x1 x2 · · · xn ≤ n (α/n)n = α/n = (x1 + · · · + xn )/n.

132
B by B+ and B− , respectively. Let ρ = vol(A+ )/vol(A) = vol(B+ )/vol(B) (if vol(A) = 0 or vol(B) = 0 the
claim trivially holds).
Observe, that A+ + B+ ⊆ A + B, and it lies on one side of h (since h ≡ (x1 = 0)), and similarly
A− + B− ⊆ A + B and it lies on the other side of h. Thus, by induction and since A+ + B+ and A− + B−
are interior disjoint, we have

vol A + B ≥ vol A+ + B+ + vol A− + B−


  
  1/n  1/n  n  n
≥ vol A+ + vol B+ + vol(A− )1/n + vol(B− )1/n
h in
= ρ1/n vol(A)1/n + ρ1/n vol(B)1/n
h in
(1 − ρ)1/n vol(A)1/n + (1 − ρ)1/n vol(B)1/n
h in
= (ρ + (1 − ρ)) vol(A)1/n + vol(B)1/n
h in
1/n 1/n
= vol(A) + vol(B) ,

establishing the claim. 

Proof of Theorem 19.1.3: Let A1 ⊆ A2 ⊆ · · · ⊆ Ai ⊆ · · · be a sequence of finite brick sets, such that
i Ai = A, and similarly let B1 ⊆ B2 ⊆ · · · ⊆ Bi ⊆ · · · be a sequence of finite brick sets, such that i Bi =
Ð Ð
B. By the definition of volume ,we have that limi→∞ vol(Ai ) = vol(A) and limi→∞ vol(Bi ) = vol(B).
­

We claim that limi→∞ vol(Ai + Bi ) = vol(A + B). Indeed, consider any point z ∈ A + B, and let u ∈ A
and v ∈ B be such that u + v = z. By definition, there exists an i, such that for all j > i we have u ∈ A j ,
v ∈ B j , and as such z ∈ A j + B j . Thus, A + B ⊆ ∪ j (A j + B j ) and ∪ j (A j + B j ) ⊆ ∪ j (A + B) ⊆ A + B; namely,
∪ j (A j + B j ) = A + B.
Furthermore, for any i > 0, since Ai and Bi are brick sets, we have

vol(Ai + Bi )1/n ≥ vol(Ai )1/n + vol(Bi )1/n,

by Lemma 19.1.5. Thus,


  1/n  
vol A + B = lim vol(Ai + Bi )1/n ≥ lim vol(Ai )1/n + vol(Bi )1/n
i→∞ i→∞
1/n 1/n
= vol(A) + vol(B) .


Theorem 19.1.6 (Brunn-Minkowski for slice volumes.). Let P be a convex set in Rn+1 , and let
A = P ∩ (x1 = a), B = P ∩ (x1 = b) and C = P ∩ (x1 = c) be three slices of A, for a < b < c. We have
vol(B) ≥ min(vol(A), vol(C)). Specifically, consider the function

 1/n
!

v(t) = vol P ∩ (x1 = t) ,

and let I = tmin, tmax be the interval where the hyperplane x1 = t intersects P. Then, v(t) is concave on
 

I.
­ This is the standard definition in measure theory of volume. The reader unfamiliar with this fanfare can either consult

a standard text on the topic, or take it for granted as this is intuitively clear.

133
Proof: If a or c are outside I, then vol(A) = 0 or vol(C) = 0, respectively, and then the claim trivially
holds.
Otherwise, let α = (b − a)/(c − a). We have that b = (1 − α) · a + α · c, and by the convexity of P, we
have (1 − α)A + αC ⊆ B. Thus, by Theorem 19.1.3 we have

v(b) = vol(B)1/n ≥ vol((1 − α)A + αC)1/n ≥ vol((1 − α)A)1/n + vol(αC)1/n


= ((1 − α)n vol(A))1/n + (αn vol(C))1/n
= (1 − α) · vol(A)1/n + α · vol(C)1/n
= (1 − α)v(a) + αv(c).

Namely, v(·) is concave on I, and in particular v(b) ≥ min(v(a), v(c)), which in turn implies that vol(B) =
v(b)n ≥ (min(v(a), v(c)))n = min(vol(A), vol(C)), as claimed. 

Corollary 19.1.7. For A and B compact sets in Rn , the following holds vol((A + B)/2) ≥
p
vol(A)vol(B).

Proof: We have that


  1/n   1/n   1/n   1/n  
1/n 1/n
vol (A + B)/2 = vol A/2 + B/2 ≥ vol A/2 + vol B/2 = vol(A) + vol(B) /2
q
≥ vol(A)1/n vol(B)1/n

by Theorem 19.1.3, and since (a + b)/2 ≥ ab for any a, b ≥ 0. The claim now follows by raising this
inequality to the power n. 

19.1.1. The Isoperimetric Inequality


The following is not used anywhere else and is provided because of its mathematical elegance. The
skip-able reader can thus employ their special gift and move on to Section 19.2.
The isoperimetric inequality states that among all convex bodies of a fixed surface area, the ball
has the largest volume (in particular, the unit circle is the largest area planar region with perimeter 2π).
This problem can be traced back to antiquity, in particular Zenodorus (200–140 BC) wrote a monograph
(which was lost) that seemed to have proved the claim in the plane for some special cases. The first
formal proof for the planar case was done by Steiner in 1841. Interestingly, the more general claim is
an easy consequence of the Brunn-Minkowski inequality.
Let K be a convex body in Rn and b be the n dimensional ball of radius one centered at the origin.
Let S(X) denote the surface area of a compact set X ⊆ Rn . The isoperimetric inequality states that
  1/n   1/(n−1)
vol(K) S(K)
≤ . (19.1)
vol(b) S(b)

Namely, the left side is the radius of a ball having the same volume as K, and the right side is the
radius of a sphere having the same surface area as K. In particular, if we scale K so that its surface area
is the same as b, then the above inequality implies that vol(K) ≤ vol(b).

134
To prove Eq. (19.1), observe that vol(b) = S(b)/n® . Also, observe that
K + ε b is the body K together with a small “atmosphere” around it of thickness
ε. In particular, the volume of this “atmosphere” is (roughly) ε S(K) (in fact,
Minkowski defined the surface area of a convex body to be the limit stated next).
Formally, we have
vol(K + ε b) − vol(K)
S(K) = lim
ε→0+
 ε n
vol(K) + vol(ε b)1/n − vol(K)
1/n

≥ lim ,
ε→0+ ε
by the Brunn-Minkowski inequality. Now vol(ε b)1/n = εvol(b)1/n , and as such

vol(K) + 1n εvol(K)(n−1)/n vol(b)1/n + 2n ε 2 h· · ·i + · · · + ε n vol(b) − vol(K)


 
S(K)≥ lim
ε→0+ ε
(n−1)/n 1/n
nεvol(K) vol(b)
= lim = nvol(K)(n−1)/n vol(b)1/n .
ε→0+ ε
Dividing both sides by S(b) = nvol(b), we have
 1/(n−1)  1/n
S(K) vol(K)(n−1)/n
 
S(K) vol(K)
≥ =⇒ ≥ ,
S(b) vol(b)(n−1)/n S(b) vol(b)

establishing the isoperimetric inequality.

19.2. Measure Concentration on the Sphere


Let S(n−1) be the unit sphere in Rn . We assume there is a uniform
probability measure defined over S(n−1) , such that its total measure is 1.
Surprisingly, most of the mass of this measure is near the equator. Indeed,
consider an arbitrary equator π on S(n−1) (that it, it is the intersection of
the sphere with a hyperplane passing through the center of ball inducing
the sphere). Next, consider all the points that are in distance ≈ `(n) = T
c/n1/3 from π. The question we are interested in is what fraction of the π
sphere is covered by this strip T (depicted on the right).
Notice, that as the dimension increases the width `(n) of this strip
decreases. But surprisingly, despite its width becoming smaller, as the dimension increases, this strip
contains a larger and larger fraction of the sphere. In particular, the total fraction of the sphere not
covered by this (shrinking!) strip converges to zero.
Furthermore, counter intuitively, this is true for any equator. We are going to show that even a
stronger result holds: The mass of the sphere is concentrated close to the boundary of any set A ⊆ S(n−1)
such that Pr[A] = 1/2.
Before proving this somewhat surprising theorem, we will first try to get an intuition about the
behavior of the hypersphere in high dimensions.
∫1
® Indeed, vol(b) = r=0
S(b)r n−1 dr = S(b)/n.

135
19.2.1. The strange and curious life of the hypersphere
Consider the ball of radius r in Rn denoted by r bn , where bn is the unit radius ball centered at the
origin. Clearly, vol(r bn ) = r n vol(bn ). Now, even if r is very close to 1, the quantity r n might be very
close to zero if n is sufficiently large. Indeed, if r = 1 − δ, then r n = (1 − δ)n ≤ exp(−δn), which is very
small if δ  1/n. (Here, we used the fact that 1 − x ≤ e x , for x ≥ 0.) Namely, for the ball in high
dimensions, its mass is concentrated in a very thin shell close to its surface.
The volume of a ball and the surface  area
 of hypersphere. Let vol(rbn ) denote the volume of
the ball of radius r in Rn , and Area rS(n−1) denote the surface area of its bounding sphere (i.e., the
surface area of rS(n−1) ). It is known that

π n/2r n   2π n/2r n−1


vol(rbn ) = and Area rS(n−1) = ,
Γ(n/2 + 1) Γ(n/2)

where the gamma function, Γ(·), is an extension√of the factorial function. Specifically, if n is even then
Γ(n/2 + 1) = (n/2)!, and for n odd Γ(n/2 + 1) = π(n!!)/2(n+1)/2 , where n!! = 1 · 3 · 5 · · · n is the double
factorial. The most surprising implication of these two formulas is that, as n increases, the volume of
the unit ball first increases (till dimension 5 in fact) and then starts decreasing to zero.
Similarly, the surface area of the unit sphere S(n−1) in Rn tends to zero 2
p 1 − xn
as the dimension increases. To see this, compute the volume of the unit ball
xn 1
using an integral of its slice volume, when it is being sliced by a hyperplanes
1
perpendicular to the nth coordinate.
We have, see figure on the right, that
  ∫ 1 q  ∫ 1
 (n−1)/2
n n−1 n−1 
vol b = vol 1 − xn b
2 dxn = vol b 1 − xn2 dxn,
xn =−1 xn =−1

Now, the integral on the right side tends to zero as n increases. In fact, for n very large, the term
 (n−1)/2
1 − xn2 is very close to 0 everywhere except for a small interval around 0. This implies that the
main contribution of the volume of the ball happens when we consider slices of the ball by hyperplanes
of the form xn = δ, where δ is small.
If one has to visualize how such a ball in high dimensions looks like, it might be best to think about
it as a star-like creature: It has very little mass close to the tips of any set of orthogonal directions we
pick, and most of its mass somehow lies on hyperplanes close to its center.¯

19.2.2. Measure Concentration on the Sphere


Theorem 19.2.1 (Measure concentration on the sphere.). Let A ⊆ S(n−1) be a measurable set
with Pr[A] ≥ 1/2, and let At denote the set of points of S(n−1) in distance at most t from A, where
t ≤ 2. Then 1 − Pr[At ] ≤ 2 exp −nt 2 /2 .

Proof: We will prove a slightly weaker bound, with −nt 2 /4 in the exponent. Let A
b = T(A), where

T(X) = αx x ∈ X, α ∈ [0, 1] ⊆ bn,




¯ In short, it looks like a Boojum [Car76].

136
b+B
A b

A A A
b
A b
A

b
B b
B

B B B

b+B
A b b+B
A b

A A
b
A b
A
b B
A+ b b B
A+ b
2 2

b
B b
B

B B

t2
≤1− 8

Figure 19.1: Illustration of the proof of Theorem 19.2.1.

     
and bn is the unit ball in Rn . We have that Pr[A] = µ A b , where µ A b = vol A b /vol(bn )° .
Let B = S(n−1) \ At and B b = T(B), see Figure 19.1. We have that ka − bk ≥ t for all a ∈ A and
 
b ∈ B. By Lemma 19.2.2 below, the set A + B /2 is contained in the ball rbn centered at the origin,
b b
n
where r = 1 − t 2 /8. Observe that µ(rbn ) = vol(rbn )/vol(bn ) = r n = 1 − t 2 /8 . As such, applying the
Brunn-Minkowski inequality in the form of Corollary 19.1.7, we have
n ! r
t2 A B
   +     p
= µ rbn ≥ µ
b b p
1− ≥ µ A b µ Bb = Pr[A] Pr[B] ≥ Pr[B] /2 .
8 2

Thus, Pr[B] ≤ 2(1 − t 2 /8)2n ≤ 2 exp(−2nt 2 /8), since 1 − x ≤ exp(−x), for x ≥ 0. 

a +bb t2
a∈A b ∈ B,
b
Lemma 19.2.2. For any b b and b b we have ≤ 1− .
2 8

° This is one of these “trivial” claims that might give the reader a pause, so here is a formal proof. Pick a random
point p uniformly inside the ball bn . Let ψ be the probability that p ∈ A. b Clearly, vol A
b = ψvol(bn ). So, consider the
   
normalized point q = p/kpk. Clearly, p ∈ A b if and only if q ∈ A, by the definition of A.
b Thus, µ Ab = vol Ab /vol(bn ) = ψ =
h i
Pr p ∈ A
b = Pr[q ∈ A] = Pr[A], since q has a uniform distribution on the hypersphere by assumption.

137
a = αa and b
Proof: Let b b = βb, where a ∈ A and b ∈ B. We have
s a

{
2
r
a+b a − b t2 t2
kuk = = 12 − ≤ 1 − ≤ 1− , (19.2)


2 2 4 8 u

2
t/
since ka − bk ≥ t. As for b
a and b
b, assume that α ≤ β, and observe that the b

h
quantity b a + b is maximized when β = 1. As such, by the triangle inequality,
b o
we have
a +bb αa + b α(a + b) b
+ (1 − α)
b
= ≤
2 2 2 2
t2
 
1
≤ α 1− + (1 − α) = τ,
8 2
by Eq. (19.2) and since kbk = 1. Now, τ is a convex combination of the two numbers 1/2 and 1 − t 2 /8.
In particular, we conclude that τ ≤ max(1/2, 1 − t 2 /8) ≤ 1 − t 2 /8, since t ≤ 2. 

19.3. Concentration of Lipschitz Functions


Consider a function f : S(n−1) → R,  and imagine that we have a probability density function defined
over the sphere. Let Pr[ f ≤ t] = Pr x ∈ S n−1 f (x) ≤ t . We define the median of f , denoted by
med( f ), to be the sup t, such that Pr[ f ≤ t] ≤ 1/2.
We define Pr[ f < med( f )] = sup x<med( f ) Pr[ f ≤ x]. The following is obvious but (in fact) requires a
formal proof.
Lemma 19.3.1. We have Pr[ f < med( f )] ≤ 1/2 and Pr[ f > med( f )] ≤ 1/2.
Proof: Since k ≥1 (−∞, med( f ) − 1/k] = (−∞, med( f )), we have
Ð
 
1 1 1
Pr[ f < med( f )] = sup Pr f ≤ med( f ) − ≤ sup = .
k ≥1 k k ≥1 2 2
The second claim follows by a symmetric argument. 
Definition 19.3.2 (c-Lipschitz). A function f : A → B is c-Lipschitz if, for any x, y ∈ A, we have
k f (x) − f (y)k ≤ c k x − yk.
Theorem 19.3.3 (Lévy’s Lemma). Let f : S(n−1) → R be 1-Lipschitz. Then for all t ∈ [0, 1],
Pr[ f > med( f ) + t] ≤ 2 exp −t 2 n/2 and Pr[ f < med( f ) − t] ≤ 2 exp −t 2 n/2 .
 

Proof: We prove only the first inequality, the second follows by symmetry. Let
n o
A = x ∈ S(n−1) f (x) ≤ med( f ) .

By Lemma 19.3.1, we have Pr[A] ≥ 1/2. Consider a point x ∈ At , where At is as defined in Theo-
rem 19.2.1. Let nn(x) be the nearest point in A to x. We have by definition that k x − nn(x)k ≤ t. As
such, since f is 1-Lipschitz and nn(x) ∈ A, we have that
f (x) ≤ f (nn(x)) + knn(x) − xk ≤ med( f ) + t.
Thus, by Theorem 19.2.1, we get Pr[ f > med( f ) + t] ≤ 1 − Pr[At ] ≤ 2 exp −t 2 n/2 .



138
19.4. The Johnson-Lindenstrauss Lemma
Lemma 19.4.1. For a unit vector x ∈ S(n−1) , let
q
f (x) = x12 + x22 + · · · + x k2

be the length of the projection of x into the subspace formed by the first k coordinates. Let x be a vector
randomly chosen with uniform distribution from S(n−1) . Then f (x) is sharply concentrated. Namely,
there exists m = m(n, k) such that

Pr[ f (x) ≥ m + t] ≤ 2 exp(−t 2 n/2) andPr[ f (x) ≤ m − t] ≤ 2 exp(−t 2 n/2),


p
for any t ∈ [0, 1]. Furthermore, for k ≥ 10 ln n, we have m ≥ 12 k/n.

Proof: The orthogonal projection p : Rn → R k given by p(x1, . . . , xn ) = (x1, . . . , x k ) is 1-Lipschitz (since


projections can only shrink distances, see Exercise 19.6.4). As such, f (x) = kp(x)k is 1-Lipschitz, since
for any x, y we have

| f (x) − f (y)| = |kp(x)k − kp(y)k| ≤ kp(x) − p(y)k ≤ k x − yk ,

by the triangle inequality and since p is 1-Lipschitz. Theorem 19.3.3 (i.e., Lévy’s lemma) gives the
required tail estimate with m = med( f ).
 only need to prove the lower bound on m. For a random x = (x1, . . 2. , xn ) ∈ÍSn 2, we
Thus, (n−1)
 we2
have E k xk = 1. h Byi linearity of expectations, handi symmetry, we have 1 = E k xk = E i=1 xi =
Ín
i=1 E xi = n E x j , for any 1 ≤ j ≤ n. Thus, E x j = 1/n, for j = 1, . . . , n. Thus,
 2 2 2

" k # k
Õ Õ k
f 2
x 2
 
E ( (x)) = E i = E[xi ] = ,
i=1 i=1
n

by linearity of expectation.
We next use that f is concentrated, to show that f 2 is also relatively concentrated. For any t ≥ 0,
we have
k
= E f 2 ≤ Pr[ f ≤ m + t] (m + t)2 + Pr[ f ≥ m + t] · 1 ≤ 1 · (m + t)2 + 2 exp(−t 2 n/2),
 
n
p
since f (x) ≤ 1, for any x ∈ S(n−1) . Let t = k/5n. Since k ≥ 10 ln n, we have that 2 exp(−t 2 n/2) ≤ 2/n.
We get that
k  p 2
≤ m + k/5n + 2/n.
n
p p p p p
Implying that (k − 2)/n ≤ m + k/5n, which in turn implies that m ≥ (k − 2)/n − k/5n ≥ 12 k/n.

Next, we would like to argue that given a fixed vector, projecting it down into a random k-dimensional
subspace results in a random vector such that its length is highly concentrated. This would imply that
we can do dimension reduction and still preserve distances between points that we care about.
To this end, we would like to flip Lemma 19.4.1 around. Instead of randomly picking a point and
projecting it down to the first k-dimensional space, we would like x to be fixed, and randomly pick the

139
k-dimensional subspace we project into. However, we need to pick this random k-dimensional space
carefully. Indeed, if we rotate this random subspace, by a transformation T, so that it occupies the first
k dimensions, then the point T(x) needs to be uniformly distributed on the hypersphere if we want to
use Lemma 19.4.1.
As such, we would like to randomly pick a rotation of Rn . This maps the standard orthonormal basis
into a randomly rotated orthonormal space. Taking the subspace spanned by the first k vectors of the
rotated basis results in a k-dimensional random subspace. Such a rotation is an orthonormal matrix
with determinant 1. We can generate such a matrix, by randomly picking a vector e1 ∈ S(n−1) . Next, we
set e1 as the first column of our rotation matrix, and generate the other n − 1 columns, by generating
recursively n − 1 orthonormal vectors in the space orthogonal to e1 .
Remark 19.4.2 (Generating a random point on the sphere.). At this point, the reader might wonder how
do we pick a point uniformly from the unit hypersphere. The idea is to pick a point from the
multi-dimensional normal distribution N n (0, 1), and normalizing it to have length 1. Since the multi-
dimensional normal distribution has the density function
(2π)−n/2 exp −(x12 + x22 + · · · + xn2 )/2 ,


which is symmetric (i.e., all the points in distance r from the origin have the same distribution), it
follows that this indeed generates a point randomly and uniformly on S(n−1) .
Generating a vector with multi-dimensional normal distribution, is no more than picking each coor-
dinate according to the normal distribution, see Lemma 19.7.1p143 . Given a source of random numbers
according to the uniform distribution, this can be done using a O(1) computations per coordinate, using
the Box-Muller transformation [BM58]. Overall, each random vector can be generated in O(n) time.
Since projecting down n-dimensional normal distribution to the lower dimensional space yields a
normal distribution, it follows that generating a random projection, is no more than randomly picking
n vectors according to the multidimensional normal distribution v1, . . . , vn . Then, we orthonormalize
them, using Graham-Schmidt, where vb1 = v1 /kv1 k, and vbi is the normalized vector of vi − wi , where wi
is the projection of vi to the space spanned by v1, . . . , vi−1 .
Taking those vectors as columns of a matrix, generates a matrix A, with determinant either 1 or
−1. We multiply one of the vectors by −1 if the determinant is −1. The resulting matrix is a random
rotation matrix.

We can now restate Lemma 19.4.1 in the setting where the vector is fixed and the projection is into
a random subspace.
Lemma 19.4.3. Let x ∈ S(n−1) be an arbitrary unit vector, and consider a random k dimensional
subspace F, and let f (x) be the length of the projection of x into F. Then, there exists m = m(n, k) such
that
Pr[ f (x) ≥ m + t] ≤ 2 exp(−t 2 n/2) and Pr[ f (x) ≤ m − t] ≤ 2 exp(−t 2 n/2),
p
for any t ∈ [0, 1]. Furthermore, for k ≥ 10 ln n, we have m ≥ 21 k/n.

Proof: Let vi be the ith orthonormal vector having 1 at the ith coordinate. Let M be a random translation
of space generated as described above. Clearly, for arbitrary fixed unit vector x, the vector Mx is
distributed uniformly on the sphere. Now, the ith column of the matrix M is the random vector ei , and
MT vi = ei . As such, we have
hMx, vi i = (Mx)T vi = xT MT vi = xT ei = hx, ei i .

140
In particular, treating Mx as a random vector, and projecting it on the first k coordinates, we have that
v
u
t k v
u
t k
Õ Õ
f (x) = hMx, vi i 2 = hx, ei i 2 .
i=1 i=1

But e1, . . . , e k is just an orthonormal basis of a random k-dimensional subspace. As such, the expression
on the right is the length of the projection of x into a k-dimensional random subspace. As such, the
length of the projection of x into a random k-dimensional subspace has exactly the same distribution
as the length of the projection of a random vector into the first k coordinates. The claim now follows
by Lemma 19.4.1. 

Definition 19.4.4. The mapping f : Rn → R k is called K-bi-Lipschitz for a subset X ⊆ Rn if there exists
a constant c > 0 such that

cK −1 · kp − qk ≤ k f (p) − f (q)k ≤ c · kp − qk ,

for all p, q ∈ X.
The least K for which f is K-bi-Lipschitz is called the distortion of f , and is denoted dist( f ). We
will refer to f as a K-embedding of X.

Remark 19.4.5. Let X ⊆ Rm be a set of n points, where m potentially might be much larger than n.
Observe, that in this case, since we only care about the inter-point distances of points in X, we can
consider X to be a set of points lying in the affine subspace F spanned by the points of X. Note, that
this subspace has dimension n − 1. As such, each point of X be interpreted as n − 1 dimensional point in
F. Namely, we can assume, for our purposes, that the set of n points in Euclidean space we care about
lies in Rn (in fact, Rn−1 ).
Note, that if m < n we can always pad all the coordinates of the points of X by zeros, such that the
resulting point set lies in Rn .

Theorem 19.4.6 (Johnson-Lindenstrauss lemma.). Let X be an n-point set in a Euclidean space,


and let ε ∈ (0, 1] be given. Then there exists a (1 + ε)-embedding of X into R k , where k = O(ε −2 log n).

Proof: By Remark 19.4.5, we can assume that X ⊆ Rn . Let k = 200ε −2 ln n. Assume k < n, and let
F be a random k-dimensional linear subspace of Rn . Let PF : Rn → F be the orthogonal projection
operator of Rn into F. Let m be the number around which kPF (x)k is concentrated, for x ∈ S(n−1) , as in
Lemma 19.4.3.
Fix two points x, y ∈ Rn , we prove that
 ε  ε
1 − m k x − yk ≤ kPF (x) − PF (y)k ≤ 1 + m k x − yk
3 3
holds with probability ≥ 1 − n−2 . Since there are 2n pairs of points in X, it follows that with constant

probability (say > 1/3) this holds for all pairs of points of X. In such a case, the mapping p is D-
embedding of X into R k with D ≤ 1+ε/31−ε/3 ≤ 1 + ε, for ε ≤ 1.
Let u = x −y, we have PF (u) = PF (x) −  PF (y) since PF (·) is a linear operator. Thus, the condition
becomes 1 − 3ε m kuk ≤ kPF (u)k ≤ 1 + 3ε m kuk. Again, since projection is a linear operator, for any
α > 0, the condition is equivalent to

1 − 3ε m kαuk ≤ kPF (αu)k ≤ 1 + 3ε m kαuk .


 

141
As such, we can assume that kuk = 1 by picking α = 1/kuk. Namely, we need to show that

ε
|kPF (u)k − m| ≤ m.
3

Let f (u) = kPF (u)k. By Lemma 19.4.1 (exchanging the random space with the random vector), for
t = εm/3, we have that the probability that this does not hold is bounded by

t2n −ε m n ε k
   2 2   2 
Pr[| f (u) − m| ≥ t] ≤ 4 exp − = 4 exp ≤ 4 exp − < n−2,
2 18 72
p
since m ≥ 1
2 k/n and k = 200ε −2 ln n. 

19.5. Bibliographical notes


Our presentation follows Matoušek [Mat02]. The Brunn-Minkowski inequality is a powerful inequality
which is widely used in mathematics. A nice survey of this inequality and its applications is provided
by Gardner [Gar02]. Gardner says: “In a sea of mathematics, the Brunn-Minkowski inequality appears
like an octopus, tentacles reaching far and wide, its shape and color changing as it roams from one area
to the next.” However, Gardner is careful in claiming that the Brunn-Minkowski inequality is one of the
most powerful inequalities in mathematics since as a wit put it “the most powerful inequality is x 2 ≥ 0,
since all inequalities are in some sense equivalent to it.”
A striking application of the Brunn-Minkowski inequality is the proof that in any partial ordering of
n elements, there is a single comparison that knowing its result, reduces the number of linear extensions
that are consistent with the partial ordering, by a constant fraction. This immediately implies (the
uninteresting result) that one can sort n elements in O(n log n) comparisons. More interestingly, it
implies that if there are m linear extensions of the current partial ordering, we can always sort it using
O(log m) comparisons. A nice exposition of this surprising result is provided by Matoušek [Mat02,
Section 12.3].
There are several alternative proofs of the JL lemma, see [IM98] and [DG03]. Interestingly, it is
enough to pick each entry in the dimension reducing matrix randomly out of −1, 0, 1. This requires
a more involved proof [Ach01]. This is useful when one cares about storing this dimension reduction
transformation efficiently.
Magen [Mag07] observed that the JL lemma preserves angles, and in fact can be used to preserve any
“k dimensional angle”, by projecting down to dimension O(kε −2 log n). In particular, Exercise 19.6.5 is
taken from there.
In fact, the random embedding preserves much more structure than just distances between points. It
preserves the structure and distances of surfaces as long as they are low dimensional and “well behaved”,
see [AHY07] for some results in this direction.
Dimension reduction is crucial in learning, AI, databases, etc. One common technique that is being
used in practice is to do PCA (i.e., principal component analysis) and take the first few main axes.
Other techniques include independent component analysis, and MDS (multidimensional scaling). MDS
tries to embed points from high dimensions into low dimension (d = 2 or 3), while preserving some
properties. Theoretically, dimension reduction into really low dimensions is hopeless, as the distortion
in the worst case is Ω(n1/(k−1) ), if k is the target dimension [Mat90].

142
19.6. Exercises
Exercise 19.6.1 (Boxes can be separated.). (Easy.) Let A and B be two axis-parallel boxes that are interior
disjoint. Prove that there is always an axis-parallel hyperplane that separates the interior of the two
boxes.

Exercise 19.6.2 (Brunn-Minkowski inequality slight extension.). Prove the following.

Corollary 19.6.3. For A and B compact sets in Rn , we have for any λ ∈ [0, 1] that
vol(λA + (1 − λ)B) ≥ vol(A)λ vol(B)1−λ .

Exercise 19.6.4 (Projections are contractions.). (Easy.) Let F be a k-dimensional affine subspace, and let
PF : Rd → F be the projection that maps every point x ∈ Rd to its nearest neighbor on F. Prove that p
is a contraction (i.e., 1-Lipschitz). Namely, for any p, q ∈ Rd , it holds that kPF (p) − PF (q)k ≤ k p − q k.

Exercise 19.6.5 (JL Lemma works for angles.). Show that the Johnson-Lindenstrauss lemma also (1 ± ε)-
preserves angles among triples of points of P (you might need to increase the target dimension however
by a constant factor). [For every angle, construct a equilateral triangle that its edges are being preserved
by the projection (add the vertices of those triangles [conceptually] to the point set being embedded).
Argue, that this implies that the angle is being preserved.]

19.7. Miscellaneous
Lemma 19.7.1. (A) The multidimensional normal distribution is symmetric; that is, for any two points
p, q ∈ Rd such that kpk = kqk we have that g(p) = g(q), where g(·) is the density function of the
multidimensional normal distribution Nd .
(B) The projection of the normal distribution on any direction is a one dimensional normal distri-
bution.
(C) Picking d variables X1, . . . , Xd using one dimensional normal distribution N results in a point
(X1, . . . , Xd ) that has multidimensional normal distribution Nd .

143
144
Chapter 20

On Complexity, Sampling, and ε-Nets and


ε-Samples
598 - Class notes for Randomized Algorithms
Sariel Har-Peled “I’ve never touched the hard stuff, only smoked
January 24, 2018 grass a few times with the boys to be polite,
and that’s all, though ten is the age when the
big guys come around teaching you all sorts to
things. But happiness doesn’t mean much to
me, I still think life is better. Happiness is a
mean son of a bitch and needs to be put in his
place. Him and me aren’t on the same team,
and I’m cutting him dead. I’ve never gone in for
politics, because somebody always stand to gain
by it, but happiness is an even crummier racket,
and their ought to be laws to put it out of
business.”
In this chapter we will try to quantify the notion of geometric complexity. It is intuitively clear that
c Momo, Emile Ajar
a a (i.e., disk) is a simpler shape than an (i.e., ellipse), which is in turn simpler than a - (i.e., smiley).
This becomes even more important when we consider several such shapes and how they interact with
each other. As these examples might demonstrate, this notion of complexity is somewhat elusive.
To this end, we show that one can capture the structure of a distribution/point set by a small subset.
The size here would depend on the complexity of the shapes/ranges we care about, but surprisingly it
would be independent of the size of the point set.

20.1. VC dimension
Definition 20.1.1. A range space S is a pair (X, R), where X is a ground set (finite or infinite) and R
is a (finite or infinite) family of subsets of X. The elements of X are points and the elements of R are
ranges.

Our interest is in the size/weight of the ranges in the range space. For technical reasons, it will be
easier to consider a finite subset x as the underlining ground set.
Definition 20.1.2. Let S = (X, R) be a range space, and let x be a finite (fixed) subset of X. For a range
r ∈ R, its measure is the quantity
|r ∩ x|
m(r) = .
|x|

145
While x is finite, it might be very large. As such, we are interested in getting a good estimate to
m(r) by using a more compact set to represent the range space.
Definition 20.1.3. Let S = (X, R) be a range space. For a subset N (which might be a multi-set) of x, its
estimate of the measure of m(r), for r ∈ R, is the quantity
|r ∩ N |
s(r) = .
|N |
The main purpose of this chapter is to come up with methods to generate a sample N, such that
m(r) ≈ s(r), for all the ranges r ∈ R.
It is easy to see that in the worst case, no sample can capture the measure of all ranges. Indeed,
given a sample N, consider the range x \ N that is being completely missed by N. As such, we need
to concentrate on range spaces that are “low dimensional”, where not all subsets are allowable ranges.
The notion of VC dimension (named after Vapnik and Chervonenkis [VC71]) is one way to limit the
complexity of a range space.
Definition 20.1.4. Let S = (X, R) be a range space. For Y ⊆ X, let

R |Y = r ∩ Y r ∈ R

(20.1)

denote the projection of R on Y . The range space S projected to Y is S|Y = Y, R |Y .



If R |Y contains all subsets of Y (i.e., if Y is finite, we have R |Y = 2|Y | ), then Y is shattered by R (or
equivalently Y is shattered by S).
The Vapnik-Chervonenkis dimension (or VC dimension) of S, denoted by dimVC (S), is the
maximum cardinality of a shattered subset of X. If there are arbitrarily large shattered subsets, then
dimVC (S) = ∞.

20.1.1. Examples

Intervals. Consider the set X to be the real line, and consider R to be the set of all 1 2
intervals on the real line. Consider the set Y = {1, 2}. Clearly, one can find four intervals
that contain all possible subsets of Y . Formally, the projection R |Y = {{ } , {1} , {2} , {1, 2}}. The intervals
realizing each of these subsets are depicted on the right.
p q s
However, this is false for a set of three points B = {p, q, r}, since there is no interval
that can contain the two extreme points p and r without also containing q. Namely, the subset {p, r} is
not realizable for intervals, implying that the largest shattered set by the range space (real line, intervals)
is of size two. We conclude that the VC dimension of this space is two.
Disks. Let X = R2 , and let R be the set of disks in the plane. Clearly, for any
three points in the plane (in general position), denoted by p, q, and r, one can
p
find eight disks that realize all possible 23 different subsets. See the figure on
t
the right.
But can disks shatter a set with four points? Consider such a set P of four q
points. If the convex hull of P has only three points on its boundary, then the {p.q}
subset X having only those three vertices (i.e., it does not include the middle
point) is impossible, by convexity. Namely, there is no disk that contains only the points of X without
the middle point.

146
d Alternatively, if all four points are vertices of the convex hull and they are
a, b, c, d along the boundary of the convex hull, either the set {a, c} or the set {b, d}
a is not realizable. Indeed, if both options are realizable, then consider the two disks
c D1 and D2 that realize those assignments. Clearly, ∂D1 and ∂D2 must intersect in
four points, but this is not possible, since two circles have at most two intersection
b points. See the figure on the left. Hence the VC dimension of this range space is 3.

Convex sets. Consider the range space S = (R2, R), where R is the set of all
(closed) convex sets in the plane. We claim that dimVC (S) = ∞. Indeed, consider
a set U of n points p1, . . . , pn all lying on the boundary of the unit circle in the CH(V)
plane. Let V be any subset of U, and consider the convex hull CH (V). Clearly,
CH (V) ∈ R, and furthermore, CH (V) ∩ U = V. Namely, any subset of U is
realizable by S. Thus, S can shatter sets of arbitrary size, and its VC dimension is unbounded.
Complement.  Consider the range space S = (X, R) with δ = dimVC (S). Next, consider the complement
space, S = X, R , where

R = X\r r ∈ R ;

namely, the ranges of S are the complement of the ranges in S. What is the VC dimension of S? Well,
a set B ⊆ X is shattered by S if and only if it is shattered by S. Indeed, if S shatters B, then for any
Z ⊆ B, we have that (B \ Z) ∈ R |B , which implies that
 Z = B \ (B \ Z) ∈ R |B . Namely, R |B contains all
the subsets of B, and S shatters B. Thus, dimVC S = dimVC (S).
 
Lemma 20.1.5. For a range space S = (X, R) we have that dimVC (S) = dimVC S , where S is the
complement range space.

20.1.1.1. Halfspaces
Let S = (X, R), where X = Rd and R is the set of all (closed) halfspaces in Rd . We need the following
technical claim.

Claim 20.1.6. Let P = {p1, . . . , pd+2 } be a set of d+2 points in Rd . There are real numbers β1, . . . , βd+2 ,
not all of them zero, such that i βi pi = 0 and i βi = 0.
Í Í

Proof: Indeed, set qi = (pi, 1), for i = 1, . . . , d + 2. Now, the points q1, . . . , qd+2 ∈ Rd+1 are linearly depen-
Íd+2
dent, and there are coefficients β1, . . . , βd+2 , not all of them zero, such that i=1 βi qi = 0. Considering
Íd+2
only the first d coordinates of these points implies that i=1 βi pi = 0. Similarly, by considering only the
Íd+2
(d + 1)st coordinate of these points, we have that i=1 βi = 0. 

To see what the VC dimension of halfspaces in Rd is, we need the following result of Radon. (For a
reminder of the formal definition of convex hulls, see Definition 32.1.1p253 .)

Theorem 20.1.7 (Radon’s theorem). Let P = {p1, . . . , pd+2 } be a set of d + 2 points in Rd . Then,
there exist two disjoint subsets C and D of P, such that CH (C) ∩ CH (D) , ∅ and C ∪ D = P.

Proof: By Claim 20.1.6 there are real numbers β1, . . . , βd+2 , not all of them zero, such that βi pi = 0
Í
i
and i βi = 0.
Í

147
Assume, for the sake of simplicity of exposition, that β1, . . . , βk ≥ 0 and βk+1, . . ., βd+2 < 0. Further-
Ík Íd+2
more, let µ = i=1 βi = − i=k+1 βi . We have that
k
Õ d+2
Õ
βi pi = − βi pi .
i=1 i=k+1
Ík
In particular, v = i=1 (βi /µ)pi is a point in CH ({p1, . . . , p k }). Furthermore, for the same point v we
Íd+2
have v = i=k+1 −(βi /µ)pi ∈ CH ({p k+1, . . . , pd+2 }). We conclude that v is in the intersection of the two
convex hulls, as required. 
The following is a trivial observation, and yet we provide a proof to demonstrate it is true.
Lemma 20.1.8. Let P ⊆ Rd be a finite set, let r be any point in CH (P), and let h+ be a halfspace of
Rd containing r. Then there exists a point of P contained inside h+ .
Proof: The halfspace h+ can be written as h+ = t ∈ Rd ht, vi ≤ c . Now r ∈ CH (P) ∩ h+ , and as such


there are numbers α1, . . . , αm ≥ 0 and points p1, . . . , pm ∈ P, such that i αi = 1 and i αi pi = r. By the
Í Í
linearity of the dot product, we have that
* m + m
Õ Õ
r ∈ h+ =⇒ hr, vi ≤ c =⇒ αi pi, v ≤ c =⇒ β = αi hpi, vi ≤ c.
i=1 i=1

Setting βi = hpi, vi, for i = 1, . . . , m, the above implies that β is a weighted average of β1, . . . , βm . In
particular, there must be a βi that is no larger than the average. That is βi ≤ c. This implies that
hpi, vi ≤ c. Namely, pi ∈ h+ as claimed. 
Let S be the range space having Rd as the ground set and all the close halfspaces as ranges. Radon’s
theorem implies that if a set Q of d + 2 points is being shattered by S, then we can partition this set
Q into two disjoint sets Y and Z such that CH (Y ) ∩ CH (Z) , ∅. In particular, let r be a point in
CH (Y ) ∩ CH (Z). If a halfspace h+ contains all the points of Y , then CH (Y ) ⊆ h+ , since a halfspace is
a convex set. Thus, any halfspace h+ containing all the points of Y will contain the point r ∈ CH (Y ).
But r ∈ CH (Z) ∩ h+ , and this implies that a point of Z must lie in h+ , by Lemma 20.1.8. Namely,
the subset Y ⊆ Q cannot be realized by a halfspace, which implies that Q cannot be shattered. Thus
dimVC (S) < d + 2. It is also easy to verify that the regular simplex with d + 1 vertices is shattered by S.
Thus, dimVC (S) = d + 1.

20.2. Shattering dimension and the dual shattering dimension


The main property of a range space with bounded VC dimension is that the number of ranges for a set
of n elements grows polynomially in n (with the power being the dimension) instead of exponentially.
Formally, let the growth function be
δ   δ
Õ n Õ ni
Gδ (n) = ≤ ≤ n δ, (20.2)
i=0
i i=0
i!
for δ > 1 (the cases where δ = 0 or δ = 1 are not interesting and we will just ignore them). Note that
for all n, δ ≥ 1, we have Gδ (n) = Gδ (n − 1) + Gδ−1 (n − 1)¬ .
¬ Here is a cute (and standard) counting argument: G (n) is just the number of different subsets of size at most δ out
δ
of n elements. Now, we either decide to not include the first element in these subsets (i.e., Gδ (n − 1)) or, alternatively, we
include the first element in these subsets, but then there are only δ − 1 elements left to pick (i.e., Gδ−1 (n − 1)).

148
Lemma 20.2.1 (Sauer’s lemma). If (X, R) is a range space of VC dimension δ with |X| = n, then
|R| ≤ Gδ (n).

Proof: The claim trivially holds for δ = 0 or n = 0.


Let x be any element of X, and consider the sets

and R \ x = r \ {x} r ∈ R .
 
R x = r \ {x} r ∪ {x} ∈ R and r \ {x} ∈ R

Observe that |R| = |R x | + |R \ x|. Indeed, we charge the elements of R to their corresponding element in
R \ x. The only bad case is when there is a range r such that both r ∪ {x} ∈ R and r \ {x} ∈ R, because
then these two distinct ranges get mapped to the same range in R \ x. But such ranges contribute
exactly one element to R x . Similarly, every element of R x corresponds to two such “twin” ranges in R.
Observe that (X \ {x} , R x ) has VC dimension δ − 1, as the largest set that can be shattered is of size
δ − 1. Indeed, any set B ⊂ X \ {x} shattered by R x implies that B ∪ {x} is shattered in R.
Thus, we have

|R| = |R x | + |R \ x| ≤ Gδ−1 (n − 1) + Gδ (n − 1) = Gδ (n),

by induction. 

Interestingly, Lemma 20.2.1 is tight. See Exercise 20.8.4.


Next, we show pretty tight bounds on Gδ (n). The proof is technical and not very interesting, and it
is delegated to Section 20.6.
 n δ  ne  δ δ  
Õ n
Lemma 20.2.2. For n ≥ 2δ and δ ≥ 1, we have ≤ Gδ (n) ≤ 2 , where Gδ (n) = .
δ δ i=0
i

Definition 20.2.3 (Shatter function). Given a range space S = (X, R), its shatter function πS (m) is the
maximum number of sets that might be created by S when restricted to subsets of size m. Formally,

πS (m) = max R |B ;
B⊂X
|B|=m

see Eq. (20.1).


The shattering dimension of S is the smallest d such that πS (m) = O(m d ), for all m.

By applying Lemma 20.2.1 to a finite subset of X, we get:


Corollary 20.2.4. If S = (X, R) is a range space of VC dimension δ, then for every finite subset B of
X, we have R |B ≤ πS (|B|) ≤ Gδ (|B|). That is, the VC dimension of a range space always bounds its
shattering dimension.

Proof: Let n = |B|, and observe that R |B ≤ Gδ (n) ≤ nδ , by Eq. (20.2). As such, R |B ≤ nδ , and, by
definition, the shattering dimension of S is at most δ; namely, the shattering dimension is bounded by
the VC dimension. 

Our arch-nemesis in the following is the function x/ln x. The following lemma states some properties
of this function, and its proof is delegated to Exercise 20.8.2.
Lemma 20.2.5. For the function f (x) = x/ln x the following hold.

149
(A) f (x) is monotonically increasing for x ≥ e.
(B) f (x) ≥ e,√for x > 1.
(C) For u ≥ √e, if f (x) ≤ u, then x ≤ 2u ln u.
(D) For u ≥ e, if x > 2u ln u, then f (x) > u.
(E) For u ≥ e, if f (x) ≥ u, then x ≥ u ln u.

The next lemma introduces a standard argument which is useful in bounding the VC dimension of a
range space by its shattering dimension. It is easy to see that the bound is tight in the worst case.

Lemma 20.2.6. If S = (X, R) is a range space with shattering dimension d, then its VC dimension is
bounded by O(d log d).

Proof: Let N ⊆ X be the largest set shattered by S, and let δ denote its cardinality. We have that
2δ = R |N ≤ πS (|N |) ≤ cδ d , where c is a fixed constant. As such, we have that δ ≤ lg c + d lg δ, which in
δ − lg c
turn implies that ≤ d.­ Assuming δ ≥ max(2, 2 lg c), we have that
lg δ

δ δ 2d
≤ d =⇒ ≤ ≤ 6d =⇒ δ ≤ 2(6d) ln(6d),
2 lg δ ln δ ln 2

by Lemma 20.2.5(C). 

Disks revisited. To see why the shattering dimension is more convenient to work with than the VC
dimension, consider the range space S = (X, R), where X = R2 and R is the set of disks in the plane. We
know that the VC dimension of S is 3 (see Section 20.1.1).
We next use a standard continuous deformation argument to argue that the shattering dimension of
this range space is also 3.

Lemma 20.2.7. Consider the range space S = (X, R), where X = R2 and R is the set of disks in the
plane. The shattering dimension of S is 3.

Proof: Consider any set P of n points in the plane, and consider the set F = R |P . We claim that
|F | ≤ 4n3 .
The set F contains only n sets with a single point in them and only 2n sets with two points in them.

So, fix Q ∈ F such that |Q| ≥ 3.

D D

p q p
=⇒ =⇒
D00

D0 D0

­ We remind the reader that lg = log2 .

150
There is a disk D that realizes this subset; that is, P ∩ D = Q. For ⇓
the sake of simplicity of exposition, assume that P is in general position.
s
Shrink D till its boundary passes through a point p of P. b
D
Now, continue shrinking the new disk D0 in such a way that its
boundary passes through the point p (this can be done by moving the
center of D0 towards p). Continue in this continuous deformation till p
q
the new boundary hits another point q of P. Let D00 denote this disk.
Next, we continuously deform D00 so that it has both p ∈ Q and D00
q ∈ Q on its boundary. This can be done by moving the center of D 00

along the bisector linear between p and q. Stop as soon as the boundary
of the disk hits a third point r ∈ P. (We have freedom in choosing in
which direction to move the center. As such, move in the direction that
causes the disk boundary to hit a new point r.) Let D b be the resulting disk. The boundary of D
b is the
unique circle passing through p, q, and r. Furthermore, observe that
D ∩ (P \ {r}) = D
b ∩ (P \ {r}).

That is, we can specify the point set P ∩ D by specifying the three points p, q, r (and thus specifying the
disk D)
b and the status of the three special points; that is, we specify for each point p, q, r whether or
not it is inside the generated subset.
As such, there are at most 8 3n different subsets in F containing

more than three points, as each
n
such subset maps to a “canonical” disk, there are at most 3 different such disks, and each such disk
defines at most eight different subsets.
Similar argumentation implies that there are at most 4 2n subsets that are defined by a pair of points

that realizes the diameter of the resulting disk. Overall, we have that
n n
   
|F | = 1 + n + 4 +8 ≤ 4n3,
2 3
since there is one empty set in F , n sets of size 1, and the rest of the sets are counted as described
above. 
The proof of Lemma 20.2.7 might not seem like a great simplification over the same bound we got by
arguing about the VC dimension. However, the above argumentation gives us a very powerful tool – the
shattering dimension of a range space defined by a family of shapes is always bounded by the number
of points that determine a shape in the family.
Thus, the shattering dimension of, say, arbitrarily oriented rectangles in the plane
is bounded by (and in this case, equal to) five, since such a rectangle is uniquely
determined by five points. To see that, observe that if a rectangle has only four
points on its boundary, then there is one degree of freedom left, since we can rotate
the rectangle “around” these points; see the figure on the right.

20.2.1. The dual shattering dimension


Given a range space S = (X, R), consider a point p ∈ X. There is a set of ranges of R associated with p,
namely, the set of all ranges of R that contains p which we denote by

R p = r r ∈ R, the range r contains p .
This gives rise to a natural dual range space to S.

151
Definition 20.2.8. The dual range space to a range space S = (X, R) is the space S? = (R, X?), where
X? = R p p ∈ X .


D3
D1 D2 D3
p6
p2 p1 1 1 1
p01 p01 1 1 1
(A) p3 p1 p2 1 0 1
p5
p4
p3 1 0 0
D1 D2 p4 1 1 0
p5 0 1 0
p1 p01 p2 p3 p4 p5 p6 p6 0 1 1
D1 1 1 1 1 1 0 0
(B)
D2 1 1 0 0 1 1 1
D3 1 1 1 0 0 0 1 (C)

Figure 20.1: (A) R p1 = R p10 . (B) Writing the set system as an incidence matrix where a point is a
column and a set is a row. For example, D2 contains p4 , and as such the column of p4 has a 1 in the
row corresponding to D2 . (C) The dual set system is represented by a matrix which is the transpose of
the original incidence matrix.

Naturally, the dual range space to S? is the original S, which is thus sometimes referred to as the
primal range space. (In other words, the dual to the dual is the primal.) The easiest way to see
this, is to think about it as an abstract set system realized as an incidence matrix, where each point is a
column and a set is a row in the matrix having 1 in an entry if and only if it contains the corresponding
point; see Figure 20.1. Now, it is easy to verify that the dual range space is the transposed matrix.
To understand what the dual space is, consider X to be the plane and R to be a set of m disks. Then,
in the dual range space S? = (R, X?), every point p in the plane has a set associated with it in X?, which
is the set of disks of R that contains p. In particular, if we consider the arrangement formed by the m
disks of R, then all the points lying inside a single face of this arrangement correspond to the same set
of X?. The number of ranges in X? is bounded by the complexity of the arrangement of these disks,
which is O(m2 ); see Figure 20.1.
Let the dual shatter function of the range space S be πS?(m) = πS? (m), where S? is the dual range
space to S.
Definition 20.2.9. The dual shattering dimension of S is the shattering dimension of the dual range
space S?.
Note that the dual shattering dimension might be smaller than the shattering dimension and hence
also smaller than the VC dimension of the range space. Indeed, in the case of disks in the plane, the
dual shattering dimension is just 2, while the VC dimension and the shattering dimension of this range
space is 3. Note, also, that in geometric settings bounding the dual shattering dimension is relatively
easy, as all you have to do is bound the complexity of the arrangement of m ranges of this space.
The following lemma shows a connection between the VC dimension of a space and its dual. The
interested reader® might find the proof amusing.
® The author is quite aware that the interest of the reader in this issue might not be the result of free choice. Neverthe-

less, one might draw some comfort from the realization that the existence of the interested reader is as much an illusion
as the existence of free choice. Both are convenient to assume, and both are probably false. Or maybe not.

152
Lemma 20.2.10. Consider a range space S = (X, R) with VC dimension δ. The dual range space
S? = (R, X?) has VC dimension bounded by 2δ+1 .

Proof: Assume that S? shatters a set F = {r1, . . . , r k } ⊆ R of k ranges. Then, there is a set P ⊆ X of
m = 2 k points that shatters F . Formally, for every subset V ⊆ F , there exists a point p ∈ P, such that
Fp = V.
So, consider the matrix M (of dimensions k × 2 k ) having the points p1, . . . , p2k of P as the columns,
and every row is a set of F , where the entry in the matrix corresponding to a point p ∈ P and a range
r ∈ F is 1 if and only if p ∈ r and zero otherwise. Since P shatters F , we know that this matrix has all
possible 2 k binary vectors as columns.
Next, let κ0 = 2 blg kc ≤ k, and consider the matrix M0 of p1 p2 . . . p2k 0 0 0
size κ ×lg κ , where the ith row is the binary representation
0 0 r1 0 1 0 0 0 1
r2 1 1 1 0 1 0
of the number i − 1 (formally, the jth entry in the ith row 0 1 1
. . . . .
is 1 if the jth bit in the binary representation of i − 1 is 1), M : .. M :
.. .. .. .. 0
1 0 0
where i = 1, . . . , κ0. See the figure on the right. Clearly, the rk−2 1 1 . . . 0 1 0 1
rk−1 0 0 . . . 1 1 1 0
lg κ0 columns of M0 are all different, and we can find lg κ0 r 1 0 . . . 1
k 1 1 1
columns of M that are identical to the columns of M0 (in
the first κ0 entries starting from the top of the columns).
Each such column corresponds to a point p ∈ P, and let Q ⊂ P be this set of lg κ0 points. Note that
for any subset Z ⊆ Q, there is a row t in M0 that encodes this subset. Consider the corresponding row
in M; that is, the range rt ∈ F . Since M and M0 are identical (in the relevant lg κ0 columns of M) on the
first κ0, we have that rt ∩ Q = Z. Namely, the set of ranges F shatters Q. But since the original range
space has VC dimension δ, it follows that |Q| ≤ δ. Namely, |Q| = lg κ0 = blg kc ≤ δ, which implies that
lg k ≤ δ + 1, which in turn implies that k ≤ 2δ+1 . 

Lemma 20.2.11. If a range space S = (X, R) has dual shattering dimension δ, then its VC dimension
is bounded by δO(δ) .

Proof: The shattering dimension of the dual range space S? is bounded by δ, and as such, by Lemma 20.2.6,
its VC dimension is bounded by δ0 = O(δ log δ). Since the dual range space to S? is S, we have by
Lemma 20.2.10 that the VC dimension of S is bounded by 2δ +1 = δO(δ) .
0


The bound of Lemma 20.2.11 might not be pretty, but it is sufficient in a lot of cases to bound the
VC dimension when the shapes involved are simple.

Example 20.2.12. Consider the range space S = R2, R , where R is a set of shapes in the plane, so that

the boundary of any pair of them intersects at most s times. Then, the VC dimension of S is O(1).
Indeed, the dual shattering dimension of S is O(1), since the complexity of the arrangement of n such
shapes is O(sn2 ). As such, by Lemma 20.2.11, the VC dimension of S is O(1).

20.2.1.1. Mixing range spaces


Lemma 20.2.13. Let S = (X, R) and T = (X, R 0) be two range spaces of VC dimension δ and 0
 δ,
b = r ∪ r0 r ∈ R, r0 ∈ R 0 . Then, for the range space b
respectively, where δ, δ0 > 1. Let R

S = X, R
b , we
 
have that dimVC b S = O(δ + δ0).

153
Proof: As a warm-up exercise, we prove a somewhat weaker bound here of O((δ + δ0) log(δ + δ0)). The
stronger bound follows from Theorem 20.2.14 below. Let B be a set of n points in X that are shattered
S. There are at most Gδ (n) and Gδ 0 (n) different ranges of B in the range sets R |B and R 0|B , respectively,
by b
by Lemma 20.2.1. Every subset C of B realized by b r∈Rb is a union of two subsets B ∩ r and B ∩ r0, where
r ∈ R and r0 ∈ R 0, respectively. Thus, the number of different subsets of B realized by b S is bounded
by Gδ (n)Gδ 0 (n). Thus, 2n ≤ nδ nδ , for δ, δ0 > 1. We conclude that n ≤ (δ + δ0) lg n, which implies that
0

n = O((δ + δ0) log(δ + δ0)), by Lemma 20.2.5(C). 

Interestingly, one can prove a considerably more general result with tighter bounds. The required
computations are somewhat more painful.

Theorem 20.2.14. Let S1 = X, R 1 , . . . , S k = X, R k be range spaces with VC dimension δ1, . . . , δk ,


 
respectively. Next, let f (r1, . . . , r k ) be a function that maps any k-tuple of sets r1 ∈ R 1, . . . , r k ∈ R k into
a subset of X. Consider the range set

R 0 = f (r1, . . . , r k ) r1 ∈ R 1, . . . , r k ∈ R k


and the associated range space T = (X, R 0). Then, the VC dimension of T is bounded by O(kδ lg k),
where δ = maxi δi .

Proof: Assume a set Y ⊆ X of size t is being shattered by R 0, and observe that


n o
k k
R 0|Y ≤ (r1, . . . , r k ) r1 ∈ R 1|Y , . . . , r k ∈ R |Y ≤ R 1|Y · · · R |Y ≤ Gδ1 (t) · Gδ2 (t) · · · Gδk (t)
  δ k
k te
≤ (Gδ (t)) ≤ 2 ,
δ

by Lemma 20.2.1 and Lemma 20.2.2. On the other hand, since Y is being shattered by R 0, this implies
 k
that R 0|Y = 2t . Thus, we have the inequality 2t ≤ 2(te/δ)δ , which implies t ≤ k(1 + δ lg(te/δ)). Assume
that t ≥ e and δ lg(te/δ) ≥ 1 since otherwise the claim is trivial, and observe that t ≤ k(1 + δ lg(te/δ)) ≤
3kδ lg(t/δ). Setting x = t/δ, we have

t ln(t/δ) t x
≤ 3k ≤ 6k ln =⇒ ≤ 6k =⇒ x ≤ 2 · 6k ln(6k) =⇒ x ≤ 12k ln(6k),
δ ln 2 δ ln x

by Lemma 20.2.5(C). We conclude that t ≤ 12δk ln(6k), as claimed. 

Corollary 20.2.15. Let S = (X, R) and T = (X, R 0) be two range spaces of VC dimension δ and δ0,
respectively, where δ, δ0 > 1. Let R
b = r ∩ r0 r ∈ R, r0 ∈ R 0 . Then, for the range space b

S = (X, R),
b we
S) = O(δ + δ0).
have that dimVC (b

Corollary 20.2.16. Any finite sequence of combining range spaces with finite VC dimension (by inter-
secting, complementing, or taking their union) results in a range space with a finite VC dimension.

154
20.3. On ε-nets and ε-sampling
20.3.1. ε-nets and ε-samples
Definition 20.3.1 (ε-sample). Let S = (X, R) be a range space, and let x be a finite subset of X. For
0 ≤ ε ≤ 1, a subset C ⊆ x is an ε-sample for x if for any range r ∈ R, we have

| m(r) − s(r)| ≤ ε,

where m(r) = |x ∩ r| / |x| is the measure of r (see Definition 20.1.2) and s(r) = |C ∩ r| / |C| is the estimate
of r (see Definition 20.1.3). (Here C might be a multi-set, and as such |C ∩ r| is counted with multiplicity.)

As such, an ε-sample is a subset of the ground set x that “captures” the range space up to an error
of ε. Specifically, to estimate the fraction of the ground set covered by a range r, it is sufficient to count
the points of C that fall inside r.
If X is a finite set, we will abuse notation slightly and refer to C as an ε-sample for S.
To see the usage of such a sample, consider x = X to be, say, the population of a country (i.e., an
element of X is a citizen). A range in R is the set of all people in the country that answer yes to a
question (i.e., would you vote for party Y?, would you buy a bridge from me?, questions like that). An
ε-sample of this range space enables us to estimate reliably (up to an error of ε) the answers for all
these questions, by just asking the people in the sample.
The natural question of course is how to find such a subset of small (or minimal) size.

Theorem 20.3.2 (ε-sample theorem, [VC71]). There is a positive constant c such that if (X, R) is
any range space with VC dimension at most δ, x ⊆ X is a finite subset and ε, ϕ > 0, then a random
subset C ⊆ x of cardinality

c
 
δ 1
s = 2 δ log + log
ε ε ϕ

is an ε-sample for x with probability at least 1 − ϕ.

(In the above theorem, if s > |x|, then we can just take all of x to be the ε-sample.)
For a strengthened version of the above theorem with slightly better bounds is known [Har11].
Sometimes it is sufficient to have (hopefully smaller) samples with a weaker property – if a range is
“heavy”, then there is an element in our sample that is in this range.

Definition 20.3.3 (ε-net). A set N ⊆ x is an ε-net for x if for any range r ∈ R, if m(r) ≥ ε (i.e.,
|r ∩ x| ≥ ε |x|), then r contains at least one point of N (i.e., r ∩ N , ∅).

Theorem 20.3.4 (ε-net theorem, [HW87]). Let (X, R) be a range space of VC dimension δ, let x
be a finite subset of X, and suppose that 0 < ε ≤ 1 and ϕ < 1. Let N be a set obtained by m random
independent draws from x, where
 
4 4 8δ 16
m ≥ max lg , lg . (20.3)
ε ϕ ε ε

Then N is an ε-net for x with probability at least 1 − ϕ.

155
(We remind the reader that lg = log2 .)
The proofs of the above theorems are somewhat involved and we first turn our attention to some
applications before presenting the proofs.

Remark 20.3.5. The above two theorems also hold for spaces with shattering dimension at most δ, in
which
 case the sample
 size is slightly larger. Specifically, for Theorem 20.3.4, the sample size needed is
1 1 δ δ
O lg + lg .
ε ϕ ε ε

20.3.2. Some applications


We mention two (easy) applications of these theorems, which (hopefully) demonstrate their power.

20.3.2.1. Range searching


So, consider a (very large) set of points P in the plane. We would like to be able to quickly decide how
many points are included inside a query rectangle. Let us assume that we allow ourselves 1% error.
What Theorem 20.3.2 tells us is that there is a subset of constant size (that depends only on ε) that
can be used to perform this estimation, and it works for all query rectangles (we used here the fact
that rectangles in the plane have finite VC dimension). In fact, a random sample of this size works with
constant probability.

Dunknown
20.3.2.2. Learning a concept
Assume that we have a function f defined in the plane that returns ‘1’
inside an (unknown) disk Dunknown and ‘0’ outside it. There is some distri-
bution D defined over the plane, and we pick points from this distribution.
Furthermore, we can compute the function for these labels (i.e., we can
compute f for certain values, but it is expensive). For a mystery value
ε > 0, to be explained shortly, Theorem 20.3.4 tells us to pick (roughly)
O((1/ε) log(1/ε)) random points in a sample R from this distribution and
to compute the labels for the samples. This is demonstrated in the figure
on the right, where black dots are the sample points for which f (·) returned 1.
So, now we have positive examples and negative examples. We would like
to find a hypothesis that agrees with all the samples we have and that hopefully
is close to the true unknown disk underlying the function f . To this end,
D
compute the smallest disk D that contains the sample labeled by ‘1’ and does
not contain any of the ‘0’ points, and let g : R2 → {0, 1} be the function g that
returns ‘1’ inside the disk and ‘0’ otherwise. We claim that g classifies correctly
all but an ε-fraction of the points (i.e., the probability of misclassifying a point
picked according to the given distribution is smaller than ε); that is, Prp∈D [ f (p) , g(p)] ≤ ε.

156
Geometrically, the region where g and f disagree is all the points in the Dunknown
symmetric difference between the two disks. That is, E = D ⊕ Dunknown ; see D ⊕ Dunknown
the figure on the right.
Thus, consider the range space S having the plane as the ground set and
the symmetric difference between any two disks as its ranges. By Corol-
lary 20.2.16, this range space has finite VC dimension. Now, consider the
(unknown) disk D0 that induces f and the region r = Dunknown ⊕ D. Clearly,
the learned classifier g returns incorrect answers only for points picked inside D
r.
  probability of a mistake in the classification is the measure of r under the distribution D.
Thus, the
So, if PrD r > ε (i.e., the probability that a sample point falls inside r), then by the ε-net theorem (i.e.,
Theorem 20.3.4) the set R is an ε-net for S (ignore for the time being the possibility that the random
sample fails to be an ε-net) and as such, R contains a point q inside r. But, it is not possible for g
(which classifies correctly all the sampled points of R) to make a mistake on q, a contradiction,
  because
by construction, the range r is where g misclassifies points. We conclude that PrD r ≤ ε, as desired.
Little lies. The careful reader might be tearing his or her hair out because of the above description.
First, Theorem 20.3.4 might fail, and the above conclusion might not hold. This is of course true, and
in real applications one might use a much larger sample to guarantee that the probability of failure is so
small that it can be practically ignored. A more serious issue is that Theorem 20.3.4 is defined only for
finite sets. Nowhere does it speak about a continuous distribution. Intuitively, one can approximate a
continuous distribution to an arbitrary precision using a huge sample and apply the theorem to this sam-
ple as our ground set. A formal proof is more tedious and requires extending the proof of Theorem 20.3.4
to continuous distributions. This is straightforward and we will ignore this topic altogether.

20.3.2.3. A naive proof of the ε-sample theorem.


To demonstrate why the ε-sample/net theorems are interesting, let us try to prove the ε-sample theorem
in the natural naive way. Thus, consider a finite range space S = (x, R) with shattering dimension δ.
Also, consider a range r that contains, say, a p fraction of the points of x, where p ≥ ε. Consider a
random sample R of r points from x, picked with replacement.
Let pi be the ith sample point, and let Xi be an indicator variable which is one if and only if pi ∈ r.
Clearly, ( i Xi )/r is an estimate for p = |r ∩ x| /|x|. We would like this estimate to be within ±ε of p
Í
and with confidence ≥ 1 − ϕ.
Ír
Xi − pr ≥ εr = (ε/p)pr. Set φ = ε/p and µ = E[ i Xi ] = pr. Using
Í
As such, the sample failed if i=1
Chernoff’s inequality (Theorem 7.3.2p65 and Theorem 7.3.2p65 ), we have

r r
" # " #
Õ Õ
Xi − pr ≥ (ε/p)pr = Pr Xi − µ ≥ φµ ≤ exp −µφ2 /2 + exp −µφ2 /4
 
Pr
i=1 i=1
ε2
 
≤ 2 exp −µφ /4 = 2 exp − r ≤ ϕ,
2 
4p
   
4 2 4p 2
for r ≥ 2 ln ≥ 2 ln .
ε ϕ ε ϕ
Viola! We proved the ε-sample theorem. Well, not quite. We proved that the sample works correctly
for a single range. Namely, we proved that for a specific range r ∈ R, we have that Pr[| m(r) − s(r)| > ε] ≤
ϕ. However, we need to prove that ∀r ∈ R, Pr[| m(r) − s(r)| > ε] ≤ ϕ.

157
Now, naively, we can overcome this by using a union bound on the bad probability. Indeed, if there
are k different ranges under consideration, then we can use a sample that is large enough such that the
probability of it to fail for each range is at most ϕ/k. In particular, let Ei be the bad event that the
sample fails for the ith range. We have that Pr[Ei ] ≤ ϕ/k, which implies that
" k # k
Ø Õ
Pr[sample fails for any range] ≤ Pr Ei ≤ Pr[Ei ] ≤ k(ϕ/k) ≤ ϕ,
i=1 i=1

by the union bound; that is, the sample works for all ranges with good probability.
However, the number of ranges that we need to prove the theorem for is πS (|x|) (see Definition 20.2.3).
In particular, if we plug in confidence ϕ/πS (|x|) to the above analysis and use the union bound, we get
that for
 
4 πS (|x|)
r ≥ 2 ln
ε ϕ

  estimates correctly (up to ±ε) the size of all ranges with confidence ≥ 1 − ϕ. Bounding πS (|x|)
the sample
by O |x| δ (using Eq. (20.2)p148 for a space with VC dimension δ), we can bound the required size of r
by O δε −2 log(|x| /ϕ) . We summarize the result.


Lemma 20.3.6. Let (x, R) be a finite range space with VC dimension at most δ, and let ε, ϕ > 0 be
parameters. Then a random subset C ⊆ x of cardinality O δε log(|x| /ϕ) is an ε-sample for x with
−2

probability at least 1 − ϕ.

Namely, the “naive” argumentation gives us a sample bound which depends on the underlying size
of the ground set. However, the sample size in the ε-sample theorem (Theorem 20.3.2) is independent
of the size of the ground set. This is the magical property of the ε-sample theorem¯ .
Interestingly, using a chaining argument on Lemma 20.3.6, one can prove the ε-sample theorem for
the finite case; see Exercise 20.8.3. We provide a similar proof when using discrepancy, in Section 20.4.
However, the original proof uses a clever double sampling idea that is both interesting and insightful
that makes the proof work for the infinite case also.

20.3.3. A quicky proof of the ε-net theorem (Theorem 20.3.4)


Here we provide a sketchy proof of Theorem 20.3.4, which conveys the main ideas. The full proof in all
its glory and details is provided in Section 20.5.
Let N = (x1, . . . , xm ) be the sample obtained by m independent samples from x (observe that N might
contain the same element several times, and as such it is a multi-set). Let E1 be the probability that N
fails to be an ε-net. Namely, for n = |x|, let

E1 = ∃r ∈ R |r ∩ x| ≥ εn and r ∩ N = ∅ .


To complete the proof, we must show that Pr[E1 ] ≤ ϕ.


Let T = (y1, . . . , ym ) be another random sample generated in a similar fashion to N. It might be that
N fails for a certain range r, but then since T is an independent sample, we still expect that |r ∩ T | = εm.
εm
In particular, the probability that Pr |r ∩ T | ≥ 2 is a large constant close to 1, regardless of how N
 

¯ The notion of magic is used here in the sense of Arthur C. Clarke’s statement that “any sufficiently advanced

technology is indistinguishable from magic.”

158
performs. Indeed, if m is sufficiently large, we expect the random variable |r ∩ T | to concentrate around
εm, and one can argue this formally using Chernoff’s inequality. Namely, intuitively, for a heavy range
r we have that
h  εm i
Pr[r ∩ N = ∅] ≈ Pr r ∩ N = ∅ and |r ∩ T | ≥ .
2
Inspired by this, let E2 be the event that N fails for some range r but T “works” for r; formally
n εm o
E2 = ∃r ∈ R |r ∩ x| ≥ εn, r ∩ N = ∅ and |r ∩ T | ≥ .
2
Intuitively, since E[|r ∩ T |] ≥ εm, then for the range r that N fails for, we have with “good” probability
that |r ∩ T | ≥ εm/2. Namely, Pr[E1 ] ≈ Pr[E2 ].
Next, let
n εm o
E20 = ∃r ∈ R r ∩ N = ∅ and |r ∩ T | ≥ .
2
Clearly, E2 ⊆ E20 and as such Pr[E2 ] ≤ Pr E20 . Now, fix Z = N ∪ T, and observe that |Z | = 2m. Next,
 

fix a range r, and observe that the bad probability of E20 is maximized if |r ∩ Z | = εm/2. Now, the
probability that all the elements of r ∩ Z fall only into the second half of the sample is at most 2−εm/2
as a careful calculation shows. Now, there are at most Z |R ≤ Gd (2m) different ranges that one has to
consider. As such, Pr[E1 ] ≈ Pr[E2 ] ≤ Pr E20 ≤ Gd (2m)2−εm/2 and this is smaller than ϕ, as a careful
 

calculation shows by just plugging the value of m into the right-hand side; see Eq. (20.3)p155 . 

20.4. Discrepancy
The proof of the ε-sample/net theorem is somewhat complicated. It turns out that one can get a
somewhat similar result by attacking the problem from the other direction; namely, let us assume that
we would like to take a truly large sample of a finite range space S = (X, R) defined over n elements with
m ranges. We would like this sample to be as representative as possible as far as S is concerned. In
fact, let us decide that we would like to pick exactly half of the points of X in our sample (assume that
n = |X| is even).
To this end, let us color half of the points of X by −1 (i.e., black) and the other half by 1 (i.e., white).
If for every range, r ∈ R, the number of black points inside it is equal to the number of white points,
then doubling the number of black points inside a range gives us the exact number of points inside the
range. Of course, such a perfect coloring is unachievable in almost all situations. To see this, consider
the complete graph K3 – clearly, in any coloring (by two colors) of its vertices, there must be an edge
with two endpoints having the same color (i.e., the edges are the ranges).
Formally, let χ : X → {−1, 1} be a coloring. The discrepancy of χ over a range r is the amount of
imbalance in the coloring inside χ. Namely,
Õ
| χ(r)| = χ(p) .
p∈r

The overall discrepancy of χ is disc( χ) = maxr∈R | χ(r)|. The discrepancy of a (finite) range space
S = (X, R) is the discrepancy of the best possible coloring; namely,
disc(S) = min disc( χ).
χ:X→{−1,+1}

159
The natural question is, of course, how to compute the coloring χ of minimum discrepancy. This
seems like a very challenging question, but when you do not know what to do, you might as well do
something random. So, let us pick a random coloring χ of X. To this end, let Π be an arbitrary
partition of X into pairs (i.e., a perfect matching). For a pair {p, q} ∈ Π, we will either color χ(p) = −1
and χ(q) = 1 or the other way around; namely, χ(p) = 1 and χ(q) = −1. We will decide how to color this
pair using a single coin flip. Thus, our coloring would be induced by making such a decision for every
pair of Π, and let χ be the resulting coloring. We will refer to χ as compatible with the partition Π
if, for all {p, q} ∈ Π, we have that χ({p, q}) = 0; namely,

∀ {p, q} ∈ Π ( χ(p) = +1 and χ(q) = −1)


crossing
or ( χ(p) = −1 and χ(q) = +1). pair
r
.
Consider a range r and a coloring χ compatible with Π. If a pair {p, q} ∈ Π
falls completely inside r or completely outside r, then it does not contribute anything to the discrepancy
of r. Thus, the only pairs that contribute to the discrepancy of r are the ones that cross it. Namely,
{p, q} ∩ r , ∅ and {p, q} ∩ (X \ r) , ∅.
As such, let #r denote the crossing number of r, that is, the number of pairs that cross r. Next,
let Xi ∈ {−1, +1} be the indicator
p variable which is the contribution of the ith crossing pair to the
discrepancy of r. For ∆r = 2#r ln(4m), we have by Chernoff’s inequality (Theorem 7.1.7p58 ), that
" #
Õ
Pr[| χ(r)| ≥ ∆r ] = Pr[ χ(r) ≥ ∆r ] + Pr[ χ(r) ≤ −∆r ] = 2 Pr Xi ≥ ∆r
i
∆2r
 
1
≤ 2 exp − = .
2#r 2m
Since there are m ranges in R, it follows that with good probability (i.e., at least half) for all r ∈ R the
discrepancy of r is at most ∆r .
Theorem 20.4.1. Let S = (X, R) be a range space defined over n = |X| elements with m = |R| ranges.
Consider any partition Π of the elements of X into pairs. Then, with probability ≥ 1/2, for any range
r ∈ R, a random coloring χ : X → {−1, +1} that is compatible with the partition Π has discrepancy at
most
p
| χ(r)| < ∆r = 2#r ln(4m),
where
p #r denotes the number of pairs of Π that cross r. In particular, since #r ≤ |r|, we have | χ(r)| ≤
2 |r| ln(4m).
Observe that for every range r we have that #r ≤ n/2, since 2#r ≤ |X|. As such, we have:
Corollary 20.4.2. Let S = (X, R) be a range space defined over n elements with m ranges. Let Π
be an arbitrary
p partition of X into pairs. Then a random coloring which is compatible with Π has
disc( χ) < n ln(4m), with probability ≥ 1/2.
One can easily amplify the probability of success of the coloring by increasing the threshold. In
particular, for any constant c ≥ 1, one has that
p
∀r ∈ R | χ(r)| ≤ 2c #r ln(4m),
2
with probability ≥ 1 − .
(4m)c

160
20.4.1. Building ε-sample via discrepancy
Let S = (X, R) be a range space with shattering
 dimension δ. Let P ⊆ X be a set of n points, and consider
the induced range space S|P = P, R |P ; see Definition 20.1.4p146 . Here, by the definition of shattering
dimension, we have that m = R |P = O n . Without loss of generality, we assume that n is a power of
δ

2. Consider a coloring χ of P with discrepancy bounded by Corollary 20.4.2. In particular, let Q be the
points of P colored by, say, −1. We know that |Q| = n/2, and for any range r ∈ R, we have that
p q p
χ(r) = ||(P \ Q) ∩ r| − |Q ∩ r|| < n ln(4m) = n ln O nδ ≤ c n ln(nδ ),


for some absolute constant c. Observe that |(P \ Q) ∩ r| = |P ∩ r| − |Q ∩ r|. In particular, we have that
for any range r, p
||P ∩ r| − 2 |Q ∩ r|| ≤ c n ln(nδ ). (20.4)
Dividing both sides by n = |P| = 2 |Q|, we have that
r
|P ∩ r| |Q ∩ r| δ ln n
− ≤ τ(n) for τ(n) = c . (20.5)
|P| |Q| n
Namely, a coloring with discrepancy bounded by Corollary 20.4.2 yields a τ(n)-sample. Intuitively, if n is
very large, then Q provides a good approximation to P. However, we want an ε-sample for a prespecified
ε > 0. Conceptually, ε is a fixed constant while τ(n) is considerably smaller. Namely, Q is a sample
which is too tight for our purposes (and thus too big). As such, we will coarsen (and shrink) Q till
we get the desired ε-sample by repeated application of Corollary 20.4.2. Specifically, we can “chain”
together several approximations generated by Corollary 20.4.2. This is sometime refereed to as the
sketch property of samples. Informally, as testified by the following lemma, a sketch of a sketch is a
sketch° .
Lemma 20.4.3. Let Q ⊆ P be a ρ-sample for P (in some underlying range space S), and let R ⊆ Q be
a ρ0-sample for Q. Then R is a (ρ + ρ0)-sample for P.
Proof: By definition, we have that, for every r ∈ R,
|r ∩ P| |r ∩ Q| |r ∩ Q| |r ∩ R|
− ≤ρ and − ≤ ρ0 .
|P| |Q| |Q| |R|
By adding the two inequalities together, we get
|r ∩ P| |r ∩ R| |r ∩ P| |r ∩ Q| |r ∩ Q| |r ∩ R|
− = − + − ≤ ρ + ρ0 . 
|P| |R| |P| |Q| |Q| |R|
Thus, let P0 = P and P1 = Q. Now, in the ith iteration, we will compute a coloring χi−1 of Pi−1 with
low discrepancy, as guaranteed by Corollary 20.4.2, and let Pi be the points of Pi−1 colored white by χi−1 .
Ík
Let δi = τ(ni−1 ), where ni−1 = |Pi−1 | = n/2i−1 . By Lemma 20.4.3, we have that P k is a ( i=1 δi )-sample
for P. Since we would like the smallest set in the sequence P1, P2, . . . that is still an ε-sample, we would
Ík
like to find the maximal k, such that ( i=1 δi ) ≤ ε. Plugging in the value of δi and τ(·), see Eq. (20.5),
it is sufficient for our purposes that
k k k
s s s
Õ Õ Õ δ ln(n/2i−1 ) δ ln(n/2 k−1 ) δ ln nk−1
δi = τ(ni−1 ) = c i−1
≤ c1 k−1
= c1 ≤ ε,
i=1 i=1 i=1
n/2 n/2 nk−1
° Try saying this quickly 100 times.

161
since the above series behaves like a geometric series, and as such its total sum is proportional to its
largest element± , where c1 is a sufficiently large constant. This holds for
s
δ ln nk−1 δ ln nk−1 c2δ nk−1
c1 ≤ ε ⇐⇒ c12 ≤ ε 2 ⇐⇒ 12 ≤ .
nk−1 nk−1 ε ln nk−1

c12 δ c12 δ
The last inequality holds for nk−1 ≥ 2 ln 2 , by Lemma 20.2.5(D). In particular, taking the largest
ε2 ε
k for which this holds results in a set P k of size O (δ/ε 2 ) ln(δ/ε) which is an ε-sample for P.


Theorem 20.4.4 (ε-sample via discrepancy). For a range space (X, R) with shattering dimension at
most δ and B ⊆ X a finite subset and ε > 0, there exists a subset C ⊆ B, of cardinality O (δ/ε 2 ) ln(δ/ε) ,

such that C is an ε-sample for B.

Note that it is not obvious how to turn Theorem 20.4.4 into an efficient construction algorithm
of such an ε-sample. Nevertheless, this theorem can be turned into a relatively efficient  deterministic
algorithm using conditional probabilities. In particular, there is a deterministic O n δ+1 time algorithm
for computing an ε-sample for a range space of VC dimension δ and with n points in its ground set using
the above approach (see the bibliographical notes in Section 20.7 for details). Inherently, however, it is a
far cry from the simplicity of Theorem 20.3.2 that just requires us to take a random sample. Interestingly,
there are cases where using discrepancy leads to smaller ε-samples; again see bibliographical notes for
details.

20.4.1.1. Faster deterministic construction of ε-samples.


One can speed up the deterministic construction mentioned above by using a sketch-and-merge approach.
To this end, we need the following merge property of ε-samples. (The proof of the following lemma is
quite easy. Nevertheless, we provide the proof in excruciating detail for the sake of completeness.)

Lemma 20.4.5. Consider the sets R ⊆ P and R0 ⊆ P0. Assume that P and P0 are disjoint, |P| = |P0 |,
and |R| = |R0 |. Then, if R is an ε-sample of P and R0 is an ε-sample of P0, then R ∪ R0 is an ε-sample
of P ∪ P0.

Proof: We have for any range r that

|r ∩ (P ∪ P0)| |r ∩ (R ∪ R0)| |r ∩ P| |r ∩ P0 | |r ∩ R| |r ∩ R0 |
− = + − −
|P ∪ P0 | |R ∪ R0 | |P ∪ P0 | |P ∪ P0 | |R ∪ R0 | |R ∪ R0 |
|r ∩ P| |r ∩ P0 | |r ∩ R| |r ∩ R0 |
= + − −
2 |P| 2 |P0 | 2 |R| 2 |R0 |
|r ∩ P0 | |r ∩ R0 |
   
1 |r ∩ P| |r ∩ R|
= − + −
2 |P| |R| |P0 | |R0 |
1 |r ∩ P| |r ∩ R| 1 |r ∩ P0 | |r ∩ R0 |
≤ − + −
2 |P| |R| 2 |P0 | |R0 |
ε ε
≤ + = ε. 
2 2
± Formally, one needs to show that the ratio between two consecutive elements in the series is larger than some constant,

say 1.1. This is easy but tedious, but the well-motivated reader (of little faith) might want to do this calculation.

162
Interestingly, by breaking the given ground sets into sets of equal size and building a balanced binary
tree over these sets, one can speed up the deterministic algorithm for building ε-samples. The idea is to
compute the sample bottom-up, where at every node we merge the samples provided by the children (i.e.,
using Lemma 20.4.5), and then we sketch the resulting set using Lemma 20.4.3. By carefully fine-tuning
this construction, one can get an algorithm for computing ε-samples in time which is near linear in n
(assuming ε and δ are small constants). We delegate the details of this construction to Exercise 20.8.6.
This algorithmic idea is quite useful and we will refer to it as sketch-and-merge.

20.4.2. Building ε-net via discrepancy


We are given range space (X, R) with shattering dimension d and ε > 0 and the target is to compute an
ε-net for this range space.
We need to be slightly more careful if we want to use discrepancy to build ε-nets, and we will use
Theorem 20.4.1 instead of Corollary 20.4.2 in the analysis.
The construction is as before – we set P0 = P, and Pi is all the points colored +1 in the coloring of
Pi−1 by Theorem 20.4.1. We repeat this till we get a set that is the required net.
To analyze this construction (and decide when it should stop), let r be a range in a given range space
(X, R) with shattering dimension d, and let
νi = |Pi ∩ r|
denote the size of the range r in the ith set Pi and let ni = |Pi |, for i ≥ 0. Observer that the number of
points in r colored by +1 and −1 when coloring Pi−1 is
αi = |Pi ∩ r| = νi and βi = |Pi−1 ∩ r| − |Pi ∩ r| = νi−1 − νi,
respectively. As such, setting mi = R |Pi = O nid , we have, by Theorem 20.4.1, that the discrepancy of

r in this coloring of Pi−1 is
p p
|αi − βi | = |νi − 2νi−1 | ≤ 2νi−1 ln 4mi−1 ≤ c dνi−1 ln ni−1
for some constant c, since the crossing number #r of a range r ∩ Pi−1 is always bounded by its size. This
is equivalent to
2i−1 νi−1 − 2i νi ≤ c2i−1 dνi−1 ln ni−1 .
p
(20.6)
We need the following technical claim that states that the size of νk behaves as we expect; as long
as the set P k is large enough, the size of νk is roughly ν0 /2 k .
Claim 20.4.6. There is a constant c4 (independent of d), such that for all k with ν0 /2 k ≥ c4 d ln nk ,
(ν0 /2 k )/2 ≤ νk ≤ 2(ν0 /2 k ).
Proof: The proof is by induction. For k = 0 the claim trivially holds. Assume that it holds for i < k.
Adding up the inequalities of Eq. (20.6), for i = 1, . . . , k, we have that
k k r r
k
Õ
i−1
p Õ
i−1 ν0 k ν0
ν0 − 2 νk ≤ c2 dνi−1 ln ni−1 ≤ c2 2d i−1 ln ni−1 ≤ c3 2 d k ln nk ,
i=1 i=1
2 2
for some constant c3 since this summation behaves like an increasing geometric series and the last term
dominates the summation. Thus,
r r
ν0 ν0 ν0 ν0
k
− c3 d k ln nk ≤ νk ≤ k + c3 d k ln nk .
2 2 2 2

163
r
ν0 p
By assumption, we have that ≥ d ln nk . This implies that
c4 2 k
r  
ν0 ν0 ν0 ν0 c3 ν0
ν k ≤ k + c3 · = 1 + √ ≤ 2 ,
2 2 k c4 2 k 2 k c4 2k

by selecting c4 ≥ 4c32 . Similarly, we have


√ ! !
c3 ν0 /c4 2 k
p
c3 d ln nk
 
ν0 ν0 ν0 c3 ν0
νk ≥ k 1 − p ≥ k 1− p = k 1− √ ≥ k /2. 
2 ν0 /2 k 2 ν0 /2 k 2 c4 2

So consider a “heavy” range r that contains at least ν0 ≥ εn points of P. To show that P k is an ε-net,
we need to show that P k ∩ r , ∅. To apply Claim 20.4.6, we need a k such that εn/2 k ≥ c4 d ln nk−1 , or
equivalently, such that
2nk 2c4 d
≥ ,
ln(2nk ) ε
 
which holds for nk = Ω dε ln dε , by Lemma 20.2.5(D). But then, by Claim 20.4.6, we have that

d
 
|P ∩ r| 1 εn ε
νk = |Pk ∩ r| ≥ ≥ · k = nk = Ω d ln > 0.
2 · 2k 2 2 2 ε
 
d d
We conclude that the set P k , which is of size Ω ε ln ε , is an ε-net for P.

Theorem 20.4.7 (ε-net via discrepancy). For any range space (X, R) with shattering dimension at
most d, a finite subset B ⊆ X, and ε > 0, there exists a subset C ⊆ B, of cardinality O((d/ε) ln(d/ε)),
such that C is an ε-net for B.

20.5. Proof of the ε-net theorem


In this section, we finally prove Theorem 20.3.4.
Let (X, R) be a range space of VC dimension δ, and let x be a subset of X of cardinality n. Suppose
that m satisfies Eq. (20.3)p155 . Let N = (x1, . . . , xm ) be the sample obtained by m independent samples
from x (the elements of N are not necessarily distinct, and we treat N as an ordered set). Let E1 be the
probability that N fails to be an ε-net. Namely,

E1 = ∃r ∈ R |r ∩ x| ≥ εn and r ∩ N = ∅ .


(Namely, there exists a “heavy” range r that does not contain any point of N.) To complete the proof,
we must show that Pr[E1 ] ≤ ϕ. Let T = (y1, . . . , ym ) be another random sample generated in a similar
fashion to N. Let E2 be the event that N fails but T “works”; formally
n εm o
E2 = ∃r ∈ R |r ∩ x| ≥ εn, r ∩ N = ∅, and |r ∩ T | ≥ .
2
h i
Intuitively, since E |r ∩ T | ≥ εm, we have that for the range r that N fails for, it follows with “good”
probability that |r ∩ T | ≥ εm/2. Namely, E1 and E2 have more or less the same probability.

164
h i h i h i
Claim 20.5.1. Pr E2 ≤ Pr E1 ≤ 2 Pr E2 .

h i h i
Proof: Clearly, E2 ⊆ E1 , and thus Pr E2 ≤ Pr E1 . As for the other part, note that by the definition
of conditional probability, we have
  h i h i h i h i
Pr E2 E1 = Pr E2 ∩ E1 /Pr E1 = Pr E2 /Pr E1 .
 
It is thus enough to show that Pr E2 E1 ≥ 1/2.
Assume that E1 occurs. There is r ∈ R, such that |r ∩ x| > εn and r∩ N = ∅. The required probability
is at least the probability that for this specific r, we have |r ∩ T | ≥ εn
2 . However, X = |r ∩ T | is a binomial
variable with expectation E X = pm, and variance V X = p(1 − p)m ≤ pm, where p = |r ∩ x| /n ≥ ε.
   

Thus, by Chebychev’s inequality (Theorem 4.3.3p39 ),


h εm i h pm i h pm i
Pr X < ≤ Pr X < ≤ Pr |X − pm| >
2 2 √ 2 √ p
pm √ pm
  
= Pr |X − pm| > pm ≤ Pr X − E[X] > V[X]
2 2
 2
2 1
≤ √ ≤ ,
pm 2

since m ≥ 8/ε ≥ 8/p; see Eq. (20.3)p155 . Thus, for r ∈ E1 , we have

Pr[E2 ] εm
h εm i 1
≥ Pr |r ∩ T | ≥ = 1 − Pr |r ∩ T | <
 
2 ≥ . 
Pr[E1 ] 2 2

Claim 20.5.1 implies that to bound the probability of E1 , it is enough to bound the probability of
E2 . Let
0
n εm o
E2 = ∃r ∈ R r ∩ N = ∅, |r ∩ T | ≥ .
2
Clearly, E2 ⊆ E20 . Thus, bounding the probability of E20 is enough to prove Theorem 20.3.4. Note,
however, that a shocking thing happened! We no longer have x participating in our event. Namely, we
turned bounding an event that depends on a global quantity (i.e., the ground set x) into bounding a
quantity that depends only on a local quantity/experiment (involving only N and T). This is the crucial
idea in this proof.

Claim 20.5.2. Pr[E2 ] ≤ Pr E20 ≤ Gd (2m)2−εm/2 .


 

Proof: We imagine that we sample the elements of N ∪ T together, by picking Z = (z1, . . . , z2m ) inde-
pendently from x. Next, we randomly decide the m elements of Z that go into N, and the remaining
elements go into T. Clearly,
Õ Pr E 0 ∩ (Z = z)
 
Õ
 0  0 2
Pr E2 ∩ (Z = z) = · Pr[Z = z]

Pr E2 =
Pr[Z = z]
z∈x 2m z∈x 2m
Õ
Pr E2 Z = z Pr[Z = z] = E Pr E20 Z = z .
 0    
=
z

165
Thus, from this point on, we fix the set Z, and we bound Pr E20 Z . Note that Pr[E20 ] is a weighted
 

average of Pr[E20 |Z = z], and as such a bound on this quantity would imply the same bound on Pr[E20 ].
It is now enough to consider the ranges in the projection space (Z, R |Z ) (which has VC dimension δ).
By Lemma 20.2.1, we have R |Z ≤ Gδ (2m).
Let us fix any r ∈ R |Z , and consider the event
n εm o
Er = r ∩ N = ∅ and |r ∩ T | > .
2
We claim that Pr[Er ] ≤ 2−εm/2 . Observe that if k = |r ∩ (N ∪ T)| ≤ εm/2, then the event is empty, and
this claim trivially holds. Otherwise, Pr[Er ] = Pr[r ∩ N = ∅]. To bound this probability, observe that
we have the 2m elements of Z, and we can choose any m of them to be N, as long as none of them is
2m−k  2m 
one of the k “forbidden” elements of r ∩ (N ∪ T). The probability of that is m / m . We thus have
2m−k 
m (2m − k)(2m − k − 1) · · · (m − k + 1)
Pr[Er ] ≤ Pr[r ∩ N = ∅] = 2m  =
2m(2m − 1) · · · (m + 1)
m
m(m − 1) · · · (m − k + 1)
= ≤ 2−k ≤ 2−εm/2 .
2m(2m − 1) · · · (2m − k + 1)
Thus,
Ø  Õ
E20 Z = Pr  Pr[Er ] ≤ R |Z 2−εm/2 ≤ Gδ (2m)2−εm/2,
   
Pr Er  ≤
r∈R |Z  r∈R |Z
 
implying that Pr E20 ≤ Gδ (2m)2−εm/2 .
 


Proof of Theorem 20.3.4. By Claim 20.5.1 and Claim 20.5.2, we have that Pr[E1 ] ≤ 2Gδ (2m)2−εm/2 .
It thus remains to verify that if m satisfies Eq. (20.3), then 2Gδ (2m)2−εm/2 ≤ ϕ.
Indeed, we know that 2m ≥ 8δ (by Eq. (20.3)p155 ) and by Lemma 20.2.2, Gδ (2m) ≤ 2(2em/δ)δ , for
δ ≥ 1. Thus, it is sufficient to show that the inequality 4(2em/δ)δ 2−εm/2 ≤ ϕ holds. By rearranging and
taking lg of both sides, we have that this is equivalent to
 δ
εm/2 4 2em εm 2em 4
2 ≥ =⇒ ≥ δ lg + lg .
ϕ δ 2 δ ϕ
By our choice of m (see Eq. (20.3)), we have that εm/4 ≥ lg(4/ϕ). Thus, we need to show that
εm 2em
≥ δ lg .
4 δ
8δ 16
We verify this inequality for m = lg (this would also hold for bigger values, as can be easily
ε ε
verified). Indeed  
16 16e 16
2δ lg ≥ δ lg lg .
ε ε ε
2
16 16e 16 16 16
This is equivalent to ≥ lg , which is equivalent to ≥ lg , which is certainly true for
ε ε ε eε ε
0 < ε ≤ 1.
This completes the proof of the theorem. 

166
20.6. A better bound on the growth function
In this section, we prove Lemma 20.2.2p149 . Since the proof is straightforward but tedious, the reader
can safely skip reading this section.

Lemma 20.6.1. For any positive integer n, the following hold.


(i) (1 + 1/n)n ≤ e. (ii) (1 − 1/n)n−1 ≥ e−1 .
 n k n ne k
   
(iii) n! ≥ (n/e)n . (iv) For any k ≤ n, we have ≤ ≤ .
k k k

Proof: (i) Indeed, 1 + 1/n ≤ exp(1/n), since 1 + x ≤ e x , for x ≥ 0. As such (1 + 1/n)n ≤ exp(n(1/n)) = e.
 n−1
(ii) Rewriting the inequality, we have that we need to prove n−1 n ≥ 1e . This is equivalent to
n n−1 n−1
proving e ≥ n−1 1
 
= 1 + n−1 , which is our friend from (i).
(iii) Indeed,

nn Õ ni
≤ = en,
n! i=0
i!
Í∞ x i
by the Taylor expansion of e x = i=0 n
i! . This implies that (n/e) ≤ n!, as required.
(iv) Indeed, for any k ≤ n, we have nk ≤ n−1 k−1 , as can be easily verified. As such,
n
k ≤ n−i
k−i , for
1 ≤ i ≤ k − 1. As such,
 n k n n − 1 n−k +1 n
 
≤ · ··· = .
k k k −1 1 k
n nk nk  ne  k
 
As for the other direction, by (iii), we have ≤ ≤  k = . 
k k! k k
e

 n δ  ne  δ
Lemma 20.2.2 restated. For n ≥ 2δ and δ ≥ 1, we have ≤ Gδ (n) ≤ 2 , where Gδ (n) =
δ δ
δ  
Õ n
.
i=0
i

δ   δ 
Õ n Õ ne  i
Proof: Note that by Lemma 20.6.1(iv), we have Gδ (n) = ≤ 1+ . This series behaves like
i=0
i i=1
i
a geometric series with constant larger than 2, since
 ne  i  ne  i−1 ne  i − 1  i−1 ne  1
 i−1
ne 1 n n
/ = = 1− ≥ = ≥ ≥ 2,
i i−1 i i i i i e i δ

by Lemma 20.6.1. As such, this series is bounded by twice the largest element in the series, implying
the claim. 

20.7. Bibliographical notes


The exposition of the ε-net and ε-sample theorems is roughly based on Alon and Spencer [AS00] and
Komlós et al. [KPW92]. In fact, Komlós et al. proved a somewhat stronger bound; that is, a random

167
sample of size (δ/ε) ln(1/ε) is an ε-net with constant probability. For a proof that shows that in general ε-
nets cannot be much smaller in the worst case, see [PA95]. The original proof of the ε-net theorem is due
to Haussler and Welzl [HW87]. The proof of the ε-sample theorem  is due to Vapnik and Chervonenkis
[VC71]. The bound in Theorem 20.3.2 can be improved to O εδ2 + ε12 log ϕ1 [AB99].
An alternative proof of the ε-net theorem proceeds by first computing an (ε/4)-sample of sufficient
size, using the ε-sample theorem (Theorem 20.3.2p155 ), and then computing and ε/4-net for this sample
using a direct sample of the right size. It is easy to verify the resulting set is an ε-net. Furthermore, using
the “naive” argument (see Section 20.3.2.3) then implies that this holds with the right probability, thus
implying the ε-net theorem (the resulting constants might be slightly worse). Exercise 20.8.3 deploys
similar ideas.
The beautiful alternative proof of both theorems via the usage of discrepancy is due to Chazelle and
Matoušek [CM96]. The discrepancy method is a beautiful topic which is quite deep mathematically, and
we have just skimmed the thin layer of melted water on top of the tip of the iceberg² . Two nice books
on the topic are the books by Chazelle [Cha01] and Matoušek [Mat99]. The book by Chazelle [Cha01]
is currently available online for free from Chazelle’s webpage.
We will revisit discrepancy since in some geometric cases it yields better results than the ε-sample
theorem. In particular, the random coloring of Theorem 20.4.1 can be derandomized using conditional
probabilities. One can then use it to get an ε-sample/net by applying it repeatedly. A faster algorithm
results from a careful implementation of the sketch-and-merge approach. The disappointing feature of
all the deterministic constructions of ε-samples/nets is that their running time is exponential in the
dimension δ, since the number of ranges is usually exponential in δ.
A similar result to the one derived by Haussler and Welzl [HW87], using a more geometric approach,
was done independently by Clarkson at the same time [Cla87], exposing the fact that VC dimension is
not necessary if we are interested only in geometric applications. This was later refined by Clarkson
[Cla88], leading to a general technique that, in geometric settings, yields stronger results than the ε-net
theorem. This technique has numerous applications in discrete and computational geometry and leads
to several “proofs from the book” in discrete geometry.
Exercise 20.8.5 is from Anthony and Bartlett [AB99].

20.7.1. Variants and extensions


A natural application of the ε-sample theorem is to use it to estimate the weights of ranges. In particular,
given a finite range space (X, R), we would like to build a data-structure such that we can decide quickly,
given a query range r, what the number of points of X inside r is. We could always use a sample of size
(roughly) O(ε −2 ) to get an estimate of the weight of a range, using the ε-sample theorem. The error
of the estimate of the size |r ∩ X| is ≤ εn, where n = |X|; namely, the error is additive. The natural
question is whether one can get a multiplicative estimate ρ, such that |r ∩ X| ≤ ρ ≤ (1 + ε) |r ∩ X|, where
|r ∩ X|.
In particular, a subset A ⊂ X is a (relative) (ε, p)-sample if for each r ∈ R of weight ≥ pn,

|r ∩ A| |r ∩ X| |r ∩ X|
− ≤ε .
| A| |X| |X|

Of course, one can simply generate an εp-sample


√ of size (roughly) O(1/(εp)2 ) by the ε-sample theorem.
This is not very interesting when p = 1/ n. Interestingly, the dependency on p can be improved.
² The iceberg is melting because of global warming; so sorry, climate change.

168
Theorem 20.7.1 ([LLS01]). Let (X, R) be a range space with shattering dimension d, where |X| = n,
 0 < p < 1 be given parameters. Then, consider a random sample A ⊆ X of size
and let 0 < ε < 1 and
c 1 1
d log + log , where c is a constant. Then, it holds that for each range r ∈ R of at least pn
ε2 p p ϕ
points, we have
|r ∩ A| |r ∩ X| |r ∩ X|
− ≤ε .
| A| |X| |X|
In other words, A is a (p, ε)-sample for (X, R). The probability of success is ≥ 1 − ϕ.

A similar result is achievable by using discrepancy; see Exercise 20.8.7.

20.8. Exercises
Exercise 20.8.1 (Compute clustering radius). Let C and P be two given sets of points in the plane, such
that k = |C| and n = |P|. Let r = max p∈P minc∈C kc − pk be the covering radius of P by C (i.e., if we
place a disk of radius r centered at each point of C, all those disks cover the points of P).
(A) Give an O(n + k log n) expected time algorithm that outputs a number α, such that r ≤ α ≤ 10r.
(B) For ε > 0 a prescribed parameter, give an O(n + kε −2 log n) expected time algorithm that outputs
a number α, such that r ≤ α ≤ (1 + ε)r.

Exercise 20.8.2 (Some calculus required). Prove Lemma 20.2.5.

Exercise 20.8.3 (A direct proof of the ε-sample theorem). For the case that the given range space is finite,
one can prove the ε-sample theorem (Theorem 20.3.2p155 ) directly. So, we are given a range space
S = (x, R) with VC dimension δ, where x is a finite set. 
(A) Show that there exists an ε-sample of S of size O δε −2 log log|x|
ε by extracting an ε/3-sample from
an ε/9-sample of the original space (i.e., apply Lemma 20.3.6  twice and use Lemma 20.4.3).
(k)
|x|
(B) Show that for any k, there exists an ε-sample of S of size O δε −2 log log ε .
(C) Show that there exists an ε-sample of S of size O δε −2 log 1ε .


Exercise 20.8.4 (Sauer’s lemma is tight). Show that Sauer’s lemma (Lemma 20.2.1) is tight. Specifically,
provide a finite range space that has the number of ranges as claimed by Lemma 20.2.1.

Exercise 20.8.5 (Flip and flop). (A) Let b1, . . . , b2m be m binary bits. Let Ψ be the set of all permutations
of 1, . . . , 2m, such that for any σ ∈ Ψ, we have σ(i) = i or σ(i) = m + i, for 1 ≤ i ≤ m, and similarly,
σ(m + i) = i or σ(m + i) = m + i. Namely, σ ∈ Ψ either leaves the pair i, i + m in their positions or
it exchanges them, for 1 ≤ i ≤ m. As such |Ψ| = 2m .
Prove that for a random σ ∈ Ψ, we have
 Ím Ím
i=1 bσ(i) i=1 bσ(i+m)

≥ ε ≤ 2e−ε m/2 .
2
Pr −
m m
(B) Let Ψ0 be the set of all permutations of 1, . . . , 2m. Prove that for a random σ ∈ Ψ0, we have
 Ím Ím
i=1 bσ(i) i=1 bσ(i+m)

≥ ε ≤ 2e−Cε m/2,
2
Pr −
m m
where C is an appropriate constant. [Use (A), but be careful.]

169
(C) Prove Theorem 20.3.2 using (B).

Exercise 20.8.6 (Sketch and merge). Assume that you are given a deterministic algorithm that can com-
pute the discrepancy of Theorem 20.4.1 in O(nm) time, where n is the size of the ground set and m is
the number of induced ranges. We are assuming that the VC dimension δ of the given range space is
small and that the algorithm input is only the ground set X (i.e., the algorithm can figure out on its
own what the relevant ranges are).
(A) For a prespecified ε > 0, using the ideas described in Section 20.4.1.1, show how to compute a small

ε-sample of X quickly. The running time of your algorithm should be (roughly) O n/εO(δ) polylog .
What is the exact bound on the running time of your algorithm?
(B) One can slightly improve the running of the above algorithm by more aggressively sketching the
sets used. That is, one can add additional sketch layers in the tree. Show how by using such an
approach one can improve the running time of the above algorithm by a logarithmic factor.

Exercise 20.8.7 (Building relative approximations). Prove the following theorem using discrepancy.
Theorem 20.8.8. Let (X, R) be a range space with shattering dimension δ, where
|X| = n, and let 0 < ε < 1 and 0 < p < 1 be given parameters. Then one can
δ
construct a set N ⊆ X of size O ε2δ p ln εp , such that, for each range r ∈ R of at
least pn points, we have

|r ∩ N | |r ∩ X| |r ∩ X|
− ≤ε .
|N | |X| |X|

In other words, N is a relative (p, ε)-approximation for (X, R).

170
Chapter 21

Sampling and the Moments Technique


598 - Class notes for Randomized Algorithms
Sariel Har-Peled Sun and rain and bush had made the site look
January 24, 2018 old, like the site of a dead civilization. The
ruins, spreading over so many acres, seemed to
speak of a final catastrophe. But the civilization
wasn’t dead. It was the civilization I existed in
and in fact was still working towards. And that
could make for an odd feeling: to be among the
ruins was to have your time-sense unsettled.
You felt like a ghost, not from the past, but
from the future. You felt that your life and
ambition had already been lived out for you and
you were looking at the relics of that life. You
were in a place where the future had come and
gone.

A bend in the river, V. S. Naipaul

21.1. Vertical decomposition


vertex edge
 a set S of n segments in the plane, its arrangement, denoted
Given
by A S , is the decomposition of the plane into faces, edges, and vertices.
The vertices of A S are the endpoints and the intersection points of the

segments of S, the edges are the maximal connected portions of the segments
not containing any vertex, and the faces are the connected components of
face
the complement of the union of the segments of S. These definitions are
depicted on the right.
For numerical reasons (and also conceptually), a symbolic representation
would be better than a numerical one. Thus, an intersection vertex would be represented by two pointers
to the segments that their intersection is this vertex. Similarly, an edge would be represented as a pointer
to the segment that contains it, and two pointers to the vertices forming its endpoints.
Naturally, we are assuming here that we have geometric primitives that can resolve any decision
problem of interest that involve a few geometric entities. For example, for a given segment s and a point
p, we would be interested in deciding if p lies vertically below s. From a theoretical point of view, all
these primitives require a constant amount of computation, and are “easy”. In the real world, numerical
issues and degeneracies make implementing these primitives surprisingly challenging. We are going to
ignore this major headache here, but the reader should be aware of it.

171
We will be interested in computing the arrangement A S and a representation of it that makes it

easy to manipulate. In particular, we would like to be able to quickly resolve questions of the type (i)
are two points in the same face?, (ii) can one traverse from one point to the other without crossing
any segment?, etc. The naive representation of each face as polygons (potentially with holes) is not
conducive to carrying out such tasks, since a polygon might be arbitrarily complicated. Instead, we will
prefer to break the arrangement into smaller canonical tiles.
To this end, a vertical trapezoid is a quadrangle with two vertical sides. The breaking of the faces
into such trapezoids is the vertical decomposition of the arrangement A S .

Formally, for a subset R ⊆ S, let A| R denote



 the vertical decomposition
of the plane formed by the arrangement A R of the segments of R. This is
the partition of the plane into interior disjoint vertical trapezoids formed by
|
 vertical walls through each vertex of A R . Formally, a vertex of

erecting σ
| R is either an endpoint of a segment of R or an intersection point of two of
A
its segments. From each such vertex we shoot up (similarly, down) a vertical
ray till it hits a segment of R or it continues all the way to infinity. See the
figure on the right.
Note that a vertical trapezoid is defined by at most four segments: two segments defining its ceiling
and floor and two segments defining the two intersection points that induce the two vertical walls on its
boundary. Of course, a vertical trapezoid might be degenerate and thus be defined by fewer segments
(i.e., an unbounded vertical trapezoid or a triangle with a vertical segment as one of its sides).
Vertical decomposition breaks the faces of the arrangement that might be arbitrarily complicated
into entities (i.e., vertical trapezoids) of constant complexity. This makes handling arrangements (de-
composed into vertical trapezoid) much easier computationally.
In the following, we assume that the n segments of S have k pairwise intersection points overall, and
we want to compute the arrangement A = A S ; namely, compute the edges, vertices, and faces of A S .
 
One possible way is the following: Compute a random permutation  of the| segments of S: S = hs1, . . . , sn i.
 i of S. Compute A Si from A Si−1 , for i = 1, . . . , n. Clearly,
|

Let Si = hs1, . .. , si i be the prefix of length
| S = | S , and we can extract A S from it. Namely, in the ith iteration, we insert the segment s
A A n i
into the arrangement A| Si−1 .


This technique of building the arrangement by inserting the segments one by one is called random-
ized incremental construction.

Who need these pesky arrangements anyway? The reader might wonder who needs arrange-
ments? As a concrete examples, consider a situation where you are give several maps of a city containing
different layers of information (i.e., streets map, sewer map, electric lines map, train lines map, etc).
We would like to compute the overlay map formed by putting all these maps on top of each other. For
example, we might be interested in figuring out if there are any buildings lying on a planned train line,
etc.
More generally, think about a set of general constraints in Rd . Each constraint is bounded by a
surface, or a patch of a surface. The decomposition of Rd formed by the arrangement of these surfaces
gives us a description of the parametric space in a way that is algorithmically useful. For example, finding
if there is a point inside all the constraints, when all the constraints are induced by linear inequalities,
is linear programming. Namely, arrangements are a useful way to think about any parametric space
partitioned by various constraints.

172
21.1.1. Randomized incremental construction (RIC)
Imagine that we had computed the arrangement Bi−1 = A| Si−1 . In the ith iteration we compute Bi

by inserting si into the arrangement Bi−1 . This involves splitting some trapezoids (and merging some
others).
As a concrete example, consider the figure on the right. Here q
we insert s in the arrangement. To this end we split the “vertical s
trapezoids” 4pqt and 4bqt, each into three trapezoids. The two
trapezoids σ0 and σ00 now need to be merged together to form the
new trapezoid which appears in the vertical decomposition of the σ0 σ00
new arrangement. (Note that the figure does not show all the trape-
p t b
zoids in the vertical decomposition.)
To facilitate this, we need to compute the trapezoids of  Bi−1 that intersect si . This is done by
maintaining a conflict graph. Each trapezoid σ ∈ A| Si−1 maintains a conflict list cl(σ) of all the
segments of S that intersect its interior. In particular, the conflict list of σ cannot contain any segment
of Si−1 , and as such it contains only the segments of S \ Si−1 that intersect its interior. We also maintain
|
a similar structure for each segment, listing all the trapezoids of A Si−1 that it currently intersects (in
its interior). We maintain those lists with cross pointers, so that given an entry (σ, s) in the conflict list
of σ, we can find the entry (s, σ) in the conflict list of s in constant time.
Thus, given si , we know what trapezoids need to be split (i.e., all the trapezoids in
cl(si )). Splitting a trapezoid σ by a segment si is the operation of computing a set of (at
most) four trapezoids that cover σ and have si on their boundary. We compute those new
trapezoids, and next we need to compute the conflict lists of the new trapezoids. This
can be easily done by taking the conflict list of a trapezoid σ ∈ cl(si ) and distributing its
segments among the O(1) new trapezoids that cover σ. Using careful implementation, si
this requires a linear time in the size of the conflict list of σ.
Note that only trapezoids that intersect si in their interior get split. Also, we need to update the
conflict lists for the segments (that were not inserted yet).
We next sketch the low-level details involved in maintaining these conflict lists. For a segment s that
intersects the interior of a trapezoid σ, we maintain the pair (s, σ). For every trapezoid σ, in the current
vertical decomposition, we maintain a doubly linked list of all such pairs that contain σ. Similarly, for
each segment s we maintain the doubly linked list of all such pairs that contain s. Finally, each such
pair contains two pointers to the location in the two respective lists where the pair is being stored.
It is now straightforward to verify that using this data-structure we can implement the required
operations in linear time in the size of the relevant conflict lists.
In the above description, we ignored the need to merge adjacent trapezoids if they have identical
floor and ceiling – this can be done by a somewhat straightforward and tedious implementation of the
vertical decomposition data-structure, by providing pointers between adjacent vertical trapezoids and
maintaining the conflict list sorted (or by using hashing) so that merge operations can be done quickly.
In any case, this can be done in linear time in the input/output size involved, as can be verified.

21.1.1.1. Analysis
Claim 21.1.1. The (amortized) running time of constructing Bi from Bi−1 is proportional to the size
of the conflict lists of the vertical trapezoids in Bi \ Bi−1 (and the number of such new trapezoids).

173
Proof: Observe that we can charge all the work involved in the ith iteration to either the conflict lists of
the newly created trapezoids or the deleted conflict lists. Clearly, the running time of the algorithm in
the ith iteration is linear in the total size of these conflict lists. Observe that every conflict gets charged
twice – when it is being created and when it is being deleted. As such, the (amortized) running time in
the ith iteration is proportional to the total length of the newly created conflict lists. 

Thus, to bound the running time of the algorithm, it is enough to bound the expected size of the
destroyed conflict lists in ith iteration (and sum this bound on the n iterations carried out by the
algorithm). Or alternatively, bound the expected size of the conflict lists created in the ith iteration.
Lemma 21.1.2. Let S be a set of n segments (in general position¬ ) with k intersection points. Let Si
be the first i segments in a random permutation of S. The expected size of Bi = A| Si , denoted by τ(i)
(i.e., the number of trapezoids in Bi ), is O i + k(i/n)2 .

Proof: Consider­ an intersection point p = s ∩ s0, where s, s0 ∈ S. The probability that p is present in
A| Si is equivalent to the probability that both s and s0 are in Si . This probability is
n−2
(n − 2)! i! (n − i)! i(i − 1)
α = i−2
n = · = .
i
(i − 2)! (n − i)! n! n(n − 1)

For each intersection point p in A S define an indicator variable Xp , which



 is 1 if the two segments
defining p are in the random sample Si and 0 otherwise. We have that E Xp = α, and as such, by
linearity of expectation, the expected number of intersection points in the arrangement A(Si ) is
" #
Õ Õ   Õ
E Xp = E Xp = α = kα,
p∈V p∈V p∈V

where V is the set of k intersection points of A S . Also, every endpoint of a segment of Si contributed

its two endpoints to the arrangement A(Si ). Thus, we have that the expected number of vertices in
A(Si ) is
i(i − 1)
2i + k.
n(n − 1)
Now, the number of trapezoids in A| Si is proportional to the number of vertices of A(Si ), which implies

the claim. 

21.1.2. Backward analysis


In the following, we would like to consider the total amount of work involved in the ith iteration of
the algorithm. The way to analyze these iterations is (conceptually) to run the algorithm for the first i
iterations and then run “backward” the last iteration.
¬ In this case, no two intersection points of input segments are the same, no two intersection points (or vertices) have
the same x-coordinate, no two segments lie on the same line, etc. Making the geometric algorithm work correctly for all
degenerate inputs is a huge task that can usually be handled by tedious and careful implementation. Thus, we will always
assume general position of the input. In other words, in theory all geometric inputs are inherently good, while in practice
they are all evil (as anybody who tried to implement geometric algorithms can testify). The reader is encouraged not to
use this to draw any conclusions on the human condition.
­ The proof is provided in excruciating detail to get the reader used to this kind of argumentation. I would apologize

for this pain, but it is a minor trifle, not to be mentioned, when compared to the other offenses in this book.

174
So, imagine that the overall size of the conflict lists of the trapezoids of Bi is Wi and the total size
of the conflict lists created only in the ith iteration is Ci .
We are interested in bounding the expected size of Ci , since this is (essentially) the amount of work
done by the algorithm in this iteration. Observe that the structure of Bi is defined independently of
the permutation Si and depends only on the (unordered) set Si = {s1, . . . , si }. So, fix Si . What is the
probability that si is a specific segment s of Si ? Clearly, this is 1/i since this is the probability of s being
the last element in a permutation of the i elements of Si (i.e., we consider a random permutation of Si ).
Now, consider a trapezoid σ ∈ Bi . If σ was created in the ith iteration, then si must be one of
the (at most four) segments that define it. Indeed, if si is not one of the segments that define σ, then
σ existed in the vertical decomposition before si was inserted. Since Bi is independent of the internal
ordering of Si , it follows that Pr[σ ∈ (Bi \ Bi−1 )] ≤ 4/i. In particular, the overall size of the conflict lists
in the end of the ith iteration is Õ
Wi = | cl(σ)|.
σ∈Bi

As such, the expected overall size of the conflict lists created in the ith iteration is
Õ 4 4
E Ci Bi ≤ | cl(σ)| ≤ Wi .
 
i i
σ∈Bi

By Lemma 21.1.2, the expected size of Bi is O i + ki 2 /n2 . Let us guess (for the time being) that on

average the size of the conflict list of a trapezoid of Bi is about O(n/i). In particular, assume that we
know that
i2 n i
h i     
E Wi = O i + 2 k =O n+k ,
n i n
by Lemma 21.1.2, implying
ki n k
        
h i 4 4 h i 4
E Ci = E E Ci Bi ≤ E Wi = E Wi = O n+ =O + ,
 
(21.1)
i i i n i n
using Lemma 8.1.2  . In particular, the expected (amortized) amount of work in the ith iteration is
 p73
proportional to E Ci . Thus, the overall expected running time of the algorithm is
" n # n
n k
Õ Õ    
E Ci = O + = O n log n + k .
i=1 i=1
i n

Theorem 21.1.3. Given a set  S of n segments in the plane with k intersections, one can compute the
vertical decomposition of A S in expected O(n log n + k) time.

Intuition and discussion. What remains to be seen is how we came up with the guess that the
average size of a conflict list of a trapezoid of Bi is about O(n/i). Note that using ε-nets implies that
the bound O((n/i) log i) holds with constant probability (see Theorem 20.3.4p155 ) for all trapezoids in
this arrangement. As such, this result is only slightly surprising. To prove this, we present in the next
section a “strengthening” of ε-nets to geometric settings.
To get some intuition on how we came up with this guess, consider a set P of n points on the line
and a random sample R of i points from P. Let I b be the partition of the real line into (maximal) open
intervals by the endpoints of R, such that these intervals do not contain points of R in their interior.
Consider an interval (i.e., a one-dimensional trapezoid) of I.
b It is intuitively clear that this interval
(in expectation) would contain O(n/i) points. Indeed, fix a point x on the real line, and imagine that

175
we pick each point with probability i/n to be in the random sample. The random variable which is the
number of points of P we have to scan starting from x and going to the right of x till we “hit” a point
that is in the random sample behaves like a geometric variable with probability i/n, and as such its
expected value is n/i. The same argument works if we scan P to the left of x. We conclude that the
number of points of P in the interval of Ib that contains x but does not contain any point of R is O(n/i)
in expectation.
Of course, the vertical decomposition case is more involved, as each vertical trapezoid is defined
by four input segments. Furthermore, the number of possible vertical trapezoids is larger. Instead of
proving the required result for this special case, we will prove a more general result which can be applied
in a lot of other settings.

21.2. General settings


21.2.1. Notation
Let S be a set of objects. For a subset R ⊆ S, we define a collection of ‘regions’ called F (R). For the
case of vertical decomposition of segments (i.e., Theorem 21.1.3), the objects are segments, the regions
are trapezoids, and F (R) is the set of vertical trapezoids in A| R . Let
Ø
T = T (S) = F (R)
R⊆S

denote the set of all possible regions defined by subsets of S. a


c
In the vertical trapezoids case, the set T is the set of all verti-
cal trapezoids that can be defined by any subset of the given input e
segments. σ
We associate two subsets D(σ), K(σ) ⊆ S with each region σ ∈ T . d b
f
The defining set D(σ) of σ is the subset of S defining the region
σ (the precise requirements from this set are specified in the axioms Figure 21.1: D(σ) = {b, c, d, e}
below). We assume that for every σ ∈ T , |D(σ)| ≤ d for a (small) and K(σ) = { f }.
constant d. The constant d is sometime referred to as the combina-
torial dimension. In the case of Theorem 21.1.3, each trapezoid σ is defined by at most four segments
(or lines) of S that define the region covered by the trapezoid σ, and this set of segments is D(σ). See
Figure 21.1.
The stopping set K(σ) of σ is the set of objects of S such that including any object of K(σ) in R
prevents σ from appearing in F (R). In many applications K(σ) is just the set of objects intersecting
the cell σ; this is also the case in Theorem 21.1.3, where K(σ) is the set of segments of S intersecting
the interior of the trapezoid σ (see Figure 21.1). Thus, the stopping set of a region σ, in many cases,
is just the conflict list of this region, when it is being created by an RIC algorithm. The weight of σ is
ω(σ) = |K(σ)|.

Axioms. Let S, F (R), D(σ), and K(σ) be such that for any subset R ⊆ S, the set F (R) satisfies the
following axioms:
(i) For any σ ∈ F (R), we have D(σ) ⊆ R and R ∩ K(σ) = ∅.
(ii) If D(σ) ⊆ R and K(σ) ∩ R = ∅, then σ ∈ F (R).

176
21.2.1.1. Examples of the general framework

(A) Vertical decomposition. Discussed above.


(B) Points on a line. Let S be a set of n points on the real line. For a set R ⊆ S, let F (R) be the set
of atomic intervals of the real lines formed by R; that is, the partition of the real line into maximal
connected sets (i.e., intervals and rays) that do not contain a point of R in their interior.
Clearly, in this case, an interval I ∈ F (R) the defining set of I (i.e., D(I)) is the set containing the
(one or two) endpoints of I in R. The stopping set of an I is the set K(I), which is the set of all
points of S contained in I.
(C) Vertices of the convex-hull in 2d. Consider a set S of n points in the plane. A vertex on the
convex hull is defined by the point defining the vertex, and the two edges before and after it on the
convex hull. To this end, a certified vertex of the convex hull (say this vertex is q) is a triplet
(p, q, r), such that p, q and r are consecutive vertices of CH (S) (say, in clockwise order). Observe,
that computing the convex-hull of S is equivalent to computing the set of certified vertices of S.
For a set R ⊆ S, let F (R) denote the set of certified vertices of R (i.e., this is equivalent to the set of
vertices of the convex-hull of R. For a certified vertex σ ∈ F (R), its defining set is the set of three
vertices p, q, r that (surprise, surprise) define it. Its stopping set, is the set of all points in S, that
either on the “wrong” side of the line spanning pq, or on the “wrong” side of the line spanning qr.
Equivalently, K(σ) is the set of all points t ∈ S \ R, such that the convex-hull of p, q, r, and t does
not form a convex quadrilateral.
(D) Edges of the convex-hull in 3d.
Let S be a set of points in three dimensions. An edge e of the convex-hull of a set R ⊆ ObjSet
of points in R3 is defined by two vertices of S, and it can be certified as being on the convex hull
CH (R), by the two faces f, f 0 adjacent to e. If all the points of R are on the “right” side of both
these two faces then e is an edge of the convex hull of R. Computing all the certified edges of S is
equivalent to computing the convex-hull of S.
In the following, assume that each face of any convex-hull of a subset of points of S is a triangle.
As such, a face of the convex-hull would be defined by three points. Formally, the butterfly of an
edge e of CH (R) is (e, p, q), where pnt, q ∈ R, and such that all the points of R are on the same side
as q of the plane spanned by e and p (we have symmetric condition requiring that all the points of
S are on the same as p of the plane spanned by e and q).
For a set R ⊆ P, let F (R) be its set of butterflies. Clearly, computing all the butterflies of S (i.e.,
F (S)) is equivalent to computing the convex-hull of S.
For a butterfly σ = (e, p, q) ∈ F (R) its defining set (i.e., D(σ)) is a set of four points (i.e., the two
points defining its edge e, and the to additional vertices defining the two faces Face and f 0 adjacent
to it). Its stopping set K(σ), is the set of all the points of S \ R that of different sides of the plane
spanned by e and p (resp. e and q) than q (resp. p) [here, the stopping set is the union of these
two sets].
(E) Delaunay triangles in 2d.
For a set of S of n points in the plane. Consider a subset R ⊆ S. A Delaunay circle of R is a disc
D that has three points p1, p2, p3 of R on its boundary, and no points of R in its interior. Naturally,
these three points define a Delaunay triangle 4 = 4p1 p2 p3 . The defining set is D(4) = {p1, p2, p3 },
and the stopping set K(4) is the set of all points in S that are contained in the interior of the disk
D.

177
21.2.2. Analysis
In the following, S is a set of n objects complying with axioms (i) and (ii).
The challenge. What makes the analysis not easy is that there are dependencies between the defining
set of a region and its stopping set (i.e., conflict list). In particular, we have the following difficulties
(A) The defining set might be of different sizes depending on the region σ being considered.
(B) Even if all the regions have a defining set of the same size d (say, 4 as in the case of vertical
trapezoids), it is not true that every d objects define a valid region. For example, for the case
of segments, the four segments might be vertically separated from each other (i.e., think about
them as being four disjoint intervals on the real line), and they do not define a vertical trapezoid
together. Thus, our analysis is going to be a bit loopy loop – we are going to assume we know
how many regions exists (in expectation) for a random sample of certain size, and use this to
derive the desired bounds.

21.2.2.1. On the probability of a region to be created


Inherently, to analyze a randomized algorithm using this framework, we will be interested in the prob-
ability that a certain region would be created. Thus, let
ρr,n (d, k)
denote the probability that a region σ ∈ T appears in F (R), where its defining set is of size d, its
stopping set is of size k, R is a random sample of size r from a set S, and n = |S|. Specifically, σ is a
feasible region that might be created by an algorithm computing F (R).
The sampling model. For describing algorithms it is usually easier to work with samples created
by picking a subset of a certain size (without repetition) from the original set of objects. Usually, in
the algorithmic applications this would be done by randomly permuting the objects and interpreting
a prefix of this permutation as a random sample. Insisting on analyzing this framework in the “right”
sampling model creates some non-trivial technical pain.
 r k  r d
Lemma 21.2.1. We have that ρr,n (d, k) ≈ 1 − . Formally,
n n
k
r k  r d 1 r  r d

1 
1−4· ≤ ρr,n (d, k) ≤ 2 2d
1− · . (21.2)
22d n n 2 n n
Proof: Let σ be the region under consideration that is defined by d objects and having k stoppers (i.e.,
k = K(σ)). We are interested in the probability of σ being created when taking a sample  nof size r
(without repetition) from a set S of n objects. Clearly, this probability is ρr,n (d, k) = n−d−k
r−d / r , as we
have to pick the d defining objects into the random sample and avoid picking any of the k stoppers. A
tedious but careful calculation, delegated to Section 21.4, implies Eq. (21.2).
Instead, here is an elegant argument for why this estimate is correct in a slightly different sampling
model. We pick every element of S into the sample R with probability r/n, and this is done independently
for each object. In expectation, the random sample is of size r, and clearly the probability that σ is
created is the probability that we pick its d defining objects (that is, (r/n)d ) multiplied by the probability
that we did not pick any of its k stoppers (that is, (1 − r/n) k ). 
Remark 21.2.2. The bounds of Eq. (21.2) hold only when r, d, and k are in certain (reasonable) ranges.
For the sake of simplicity of exposition we ignore this minor issue. With care, all our arguments work
when one pays careful attention to this minor technicality.

178
21.2.2.2. On exponential decay
For any natural number r and a number t > 0, consider R to be a random sample of size r from S
n
without repetition. We will refer to a region σ ∈ F (R) as being t-heavy if ω(σ) ≥ t · . Let F≥t (R)
r
denote all the t-heavy regions of F (R).®
Intuitively, and somewhat incorrectly, we expect the average weight of a region of F (R) to be roughly
n/r. We thus expect the size of this set to drop fast as t increases. Indeed, Lemma 21.2.1 tells us that
a trapezoid of weight t (n/r) has probability
 n  r  t (n/r)  r  d  r d   r  n/r  r  d
ρr,n d, t · ≈ 1− ≈ exp(−t) · ≈ exp −t + 1 · 1 −
r n n n n n
≈ exp(−t + 1) · ρr,n (d, n/r)

to be created, since (1 − r/n)n/r ≈ 1/e. Namely, a t-heavy region has exponentially lower probability to
be created than a region of weight n/r. We next formalize this argument.
Lemma 21.2.3. Let r ≤ n and let t be parameters, such that 1 ≤ t ≤ r/d. Furthermore, let R be a
sample of size r, and let R0 be ah sample of
i size r = br/tc,
0 both from S. i Let σ ∈ T be a region with
t d h
weight ω(σ) ≥ t (n/r). Then, Pr σ ∈ F (R) = O exp − t Pr σ ∈ F (R0) .
2
Proof: For the sake of simplicity of exposition, assume that k = ω(σ) = t (n/r). By Lemma 21.2.1 (i.e.,
Eq. (21.2)) we have
k r d
Pr[σ ∈ F (R)] ρr,n (d, k) 22d 1 − 21 · nr n
= ≤
Pr[σ ∈ F (R0)] ρ 0 (d, k) 1 r0 k r0 d
 
r ,n 22d
1−4n n
k
kr r0  r  d kr 0 kr  r  d
   
4d 4d
≤ 2 exp − 1+8 ≤ 2 exp 8 −
2n n r0 n 2n r 0
d
tn br/tc tnr r
   
= 24d exp 8 − = O exp(−t/2)t d ,
nr 2nr br/tc
since 1/(1 − x) ≤ 1 + 2x for x ≤ 1/2 and 1 + y ≤ exp(y), for all y. (The constant in the above O(·) depends
exponentially on d.) 

Let
Ef (r) = E[|F (R)|] and Ef≥t (r) = E[|F≥t (R)|] ,
where the expectation is over random subsets R ⊆ S of size r. Note that Ef (r) = Ef≥0 (r) is the expected
number of regions created by a random sample of size r. In words, Ef≥t (r) is the expected number of
regions in a structure created by a sample of r random objects, such that these regions have weight
which is t times larger than the “expected” weight (i.e., n/r). In the following, we assume that Ef (r) is
a monotone increasing function.
Lemma 21.2.4 (The exponential decay lemma). Given a set S of n objects and parameters r ≤ n
and 1 ≤ t ≤ r/d, where d = maxσ∈T (S) |D(σ)|, if axioms (i) and (ii) above hold for any subset of S, then
 
Ef≥t (r) = O t d exp(−t/2) Ef (r) . (21.3)
® These are the regions that are at least t times overweight. Speak about an obesity problem.

179
Proof: Let R be a random sample of size r from S and let R0 be a random sample of size r 0 = br/tc from
S. Let H = X ⊆S,|X |=r F≥t (X) denote the set of all t-heavy regions that might be created by a sample of
Ð
size r. In the following, the expectation is taken over the content of the random samples R and R0.
For a region σ, let Xσ be the indicator variable that is 1 if and only if σ ∈ F (R). By linearity of
expectation and since E[Xσ ] = Pr[σ ∈ F (R)], we have
" #
h i Õ Õ Õ
Ef≥t (r) = E |F≥t (R)| = E Xσ = E[Xσ ] = Pr[σ ∈ F (R)]
σ∈H σ∈H σ∈H
! !
= O t d exp(−t/2) Pr[σ ∈ F (R0)] = O t d exp(−t/2)
Õ Õ
Pr[σ ∈ F (R0)]
σ∈H σ∈T
   
d 0 d
= O t exp(−t/2) Ef (r ) = O t exp(−t/2) Ef (r) ,
by Lemma 21.2.3 and since Ef (r) is a monotone increasing function. 

21.2.2.3. Bounding the moments


Consider a different randomized algorithm that in a first round samples r objects, R ⊆ S (say, segments),
computes the arrangement induced by these r objects (i.e., A R ), and then inside each region σ it
|

computes the arrangement of the ω(σ) objects intersecting the interior of this region, using an algorithm
that takes O((ω(σ))c ) time, where c > 0 is some fixed constant. The overall expected running time of
this algorithm is
 Õ 
  c 
E ω(σ)  .
σ∈F (R) 
 
We are now able to bound this quantity.
Theorem 21.2.5 (Bounded moments theorem). Let R ⊆ S be a random subset of size r. Let
Ef (r) = E[|F (R)|] and let c ≥ 1 be an arbitrary constant. Then,
 Õ   c    n c 
O

E
 ω(σ)  = Ef (r) .
σ∈F (R)

 r
 
Proof: Let R ⊆ S be a random sample of size r. Observe that all the regions with weight in the range
h n
(t − 1) nr , t · are in the set F≥t−1 (R) \ F≥t (R). As such, we have by Lemma 21.2.4 that
r
" # " #
 Õ  Õ n c Õ n c
ω(σ)c  ≤ E t t
 
E  (|F≥t−1 (R)| − |F≥t (R)| ) ≤ E |F≥t−1 (R)|
σ∈F (R)  t≥1
r t≥1
r
   n c Õ
≤ (t + 1)c · E[|F≥t (R)| ]
r t≥0
 n c Õ  n c Õ  
c c+d
= (t + 1) Ef≥t (r) = O (t + 1) exp(−t/2) Ef (r)
r t≥0 r t≥0
!
 n c Õ   n c 
= O Ef (r) (t + 1) c + d exp(−t/2) = O Ef (r) ,
r t≥0 r
since c and d are both constants. 

180
21.3. Applications
21.3.1. Analyzing the RIC algorithm for vertical decomposition
We remind the reader that the input of the algorithm of Section 21.1.2 is a set S of n segments with k
intersections, and it uses randomized incremental construction to compute the vertical decomposition
of the arrangement A S .
Lemma 21.1.2 shows that the number of vertical trapezoids in the randomized incremental construc-
tion is in expectation Ef (i) = O i + k (i/n)2 . Thus, by Theorem 21.2.5 (used with c = 1), we have that
the total expected size of the conflict lists of the vertical decomposition computed in the ith step is
" #
n i
Õ   
E[Wi ] = E ω(σ) = O Ef (i) = O n + k .
i n
σ∈Bi

This is the missing piece in the analysis of Section 21.1.2. Indeed, the amortized work in the ith step
of the algorithm is O(Wi /i) (see Eq. (21.1)p175 ), and as such, the expected running time of this algorithm
is
n n
" !# !
Wi i

Õ Õ 1
E O =O n+k = O(n log n + k).
i=1
i i=1
i n

This implies Theorem 21.1.3.

21.3.2. Cuttings
Let S be a set of n lines in the plane, and let r be an arbitrary parameter. A (1/r)-cutting of S is a
partition of the plane into constant complexity regions such that each region intersects at most n/r lines
of S. It is natural to try to minimize the number of regions in the cutting, as cuttings are a natural tool
for performing “divide and conquer”.
Consider the range space having S as its ground set and vertical trapezoids as its ranges (i.e., given
a vertical trapezoid σ, its corresponding range is the set of all lines of S that intersect the interior of
σ). This range space has a VC dimension which is a constant as can be easily verified. Let X ⊆ S be
an ε-net for this range space, for ε = 1/r. By Theorem 20.3.4p155 (ε-net theorem), there exists such an
ε-net X of this range space, of size O((1/ε) log(1/ε)) = O(r log r). In fact, Theorem 20.3.4p155 states that
an appropriate random sample is an ε-net with non-zero probability, which implies, by the probabilistic
method, that such a net (of this size) exists.

Lemma 21.3.1. There exists a (1/r)-cutting of a set of lines S in the plane of size O (r log r)2 .


Proof: Consider the vertical decomposition A| X , where X is as above. We claim that this collection

of trapezoids is the desired cutting.  
The bound on the size is immediate, as the complexity of A| X is O |X | 2 and |X | = O(r log r).


As for correctness, consider a vertical trapezoid σ in the arrangement A| X . It does not intersect

any of the lines of X in its interior, since it is a trapezoid in the vertical decomposition A| X . Now, if
σ intersected more than n/r lines of S in its interior, where n = |S|, then it must be that the interior of
σ intersects one of the lines of X, since X is an ε-net for S, a contradiction.
It follows that σ intersects at most εn = n/r lines of S in its interior. 

181
Claim 21.3.2. Any (1/r)-cutting in the plane of n lines contains at least Ω r 2 regions.


Proof: An arrangement of n lines (in general position) has M = 2n intersections. However, the number


of intersections of the lines intersecting a single region in the cutting is at most m = n/r

2 . This implies
that any cutting must be of size at least M/m = Ω n2 /(n/r)2 = Ω r 2 . 

We can get cuttings of size matching the above lower bound using the moments technique.

Theorem 21.3.3. Let S be a set of n lines in the plane, and let r be a parameter. One can compute a
(1/r)-cutting of S of size O(r 2 ).

r, A | R . If a

Proof: Let R ⊆ S be a random sample of size and consider its vertical decomposition
vertical trapezoid σ ∈ A| R intersects at most n/r lines of S, then we can add it to the output cutting.

The other possibility is that a σ intersects t(n/r) lines of S, for some t > 1, and let cl(σ) ⊂ S be the
conflict list of σ (i.e., the list of lines of S that intersect the interior of σ). Clearly, a (1/t)-cutting for
the set cl(σ) forms a vertical decomposition (clipped inside σ) such that each trapezoid in this cutting
intersects at most n/r lines of S. Thus, we compute such a cutting inside each such “heavy” trapezoid
using the algorithm (implicit in the proof) of Lemma 21.3.1, and these subtrapezoids to the resulting
cutting. Clearly, the size of the resulting cutting inside σ is O t log t = O t . The resulting two-level
2 2  4

partition is clearly the required cutting. By Theorem 21.2.5, the expected size of the cutting is
 Õ   4  r  4  Õ 
ω(σ)  ª 4 ª
O ­Ef (r) + E  ® = O ­Ef (r) +
 
2 (ω(σ))  ®
© ©
 E
σ∈F (R) n/r  n

σ∈F (R) 
«   ¬ «  ¬
 r 4  n 4 
= O Ef (r) + = O(Ef (r)) = O r 2 ,

· Ef (r)
n r

since Ef (r) is proportional to the complexity of A(R) which is O(r 2 ). 

21.4. Bounds on the probability of a region to be created


Here we prove Lemma 21.2.1p178 in the “right” sampling model. The casual reader is encouraged to skip
this section, as it contains mostly tedious (and not very insightful) calculations.
Let S be a given set of n objects. Let ρr,n (d, k) be the probability that a region σ ∈ T whose defining
set is of size d and whose stopping set is of size k appears in F (R), where R is a random sample from S
of size r (without repetition).
n−d−k  n−d−k  r n−d−k  r
r−d r−d d r−d d
Lemma 21.4.1. We have ρr,n (d, k) = n = n  · n−(r−d)
= n−d 
· n .
r r−d d r−d d

Proof: So, consider a region σ with d defining objects in D(σ) and k detractors in K(σ). We have to
pick the d defining objects of D(σ) to be in the random sample R of size r but avoid picking any of the
k objects of K(σ) to be in R.
n n n − (r − d) r
      
The second part follows since = / . Indeed, for the right-hand side first
r r−d d d
pick a sample of size r − d and then a sample of size d from the remaining objects. Merging the two

182
random samples, we get a random sample of size r. However, since we do not care if an object is in the
r
first sample or second sample, we observe that every such
 random sample   is being
 counted d times.
n n − (r − d) n n−d
The third part is easier, as it follows from = . The two sides count
r−d d d r−d
the different ways to pick two subsets from a set of size n, the first one of size d and the second one of
size r − d. 
 m − t t m  m t
Lemma 21.4.2. For M ≥ m ≥ t ≥ 0, we have ≤ Mt  ≤ .
M −t M
t

m
t m! (M − t)!t! m m−1 m−t+1
Proof: We have that α = M
= = · ··· . Now, since M ≥ m, we
(m − t)!t! M! M M −1 M −t+1
t
m−i m
have that ≤ , for all i ≥ 0. As such, the maximum (resp. minimum) fraction on the right-hand
M −i M
m−t t m−t+1 t
m−t+1
size is m/M (resp. M−t+1 ≤ α ≤ (m/M)t .
 
). As such, we have M−t ≤ M−t+1 
Y X
X Y
 
Lemma 21.4.3. Let 0 ≤ X, Y ≤ N. We have that 1 − ≤ 1− .
N 2N

Proof: Since 1 − α ≤ exp(−α) ≤ (1 − α/2), for 0 ≤ α ≤ 1, it follows that


Y   X  X
X XY Y Y
   
1− ≤ exp − = exp − ≤ 1− . 
N N n 2n

Lemma 21.4.4. For 2d ≤ r ≤ n/8 and k ≤ n/2, we have that


k
r k  r d 1 r  r d

1 
1−4· ≤ ρr,n (d, k) ≤ 22d
1− · .
22d n n 2 n n

Proof: By Lemma 21.4.1, Lemma 21.4.2, and Lemma 21.4.3 we have


n−d−k  r  r−d    r−d   r
n−d−k r d k r d k  r d
  
r−d d d
ρr,n (d, k) = n−d  · n  ≤ ≤ 1− ≤ 2 1−
d
n − d n n n n n
r−d
 r k  r d
≤ 2d 1 − ,
2n n
since k ≤ n/2. As for the other direction, by similar argumentation, we have
n−d−k  r  r−d  d
n − d − k − (r − d) r−d

r−d d
ρr,n (d, k) = n  · n−(r−d) ≥
r−d
n − (r − d) n − (r − d) − d
d
 r−d  d r  d
d+k r−d d+k r/2
 
= 1− ≥ 1−
n − (r − d) n−r n/2 n
 d+k   k  
r d r d
 
1 4r 1 4r
≥ d 1− ≥ 2d 1 − ,
2 n n 2 n n

by Lemma 21.4.3 (setting N = n/4, X = r, and Y = d + k) and since r ≥ 2d and 4r/n ≤ 1/2. 

183
21.5. Bibliographical notes
The technique described in this chapter is generally attributed to the work by Clarkson and Shor [CS89],
which is historically inaccurate as the technique was developed by Clarkson [Cla88]. Instead of mildly
confusing the matter by referring to it as the Clarkson technique, we decided to make sure to really
confuse the reader and refer to it as the moments technique. The Clarkson technique [Cla88] is in
fact more general and implies a connection between the number of “heavy” regions and “light” regions.
The general framework can be traced back to the earlier paper [Cla87]. This implies several beautiful
results, some of which we cover later in the book.
For the full details of the algorithm of Section 21.1, the interested reader is refereed to the books
[dBCKO08, BY98]. Interestingly, in some cases the merging stage can be skipped; see [Har00a].
Agarwal et al. [AMS98] presented a slightly stronger variant than the original version of Clarkson
[Cla88] that allows a region to disappear even if none of the members of its stopping set are in the
random sample. This stronger setting is used in computing the vertical decomposition of a single face
in an arrangement (instead of the whole arrangement). Here an insertion of a faraway segment of the
random sample might cut off a portion of the face of interest. In particular, in the settings of Agarwal
et al. Axiom (ii) is replaced by the following:

(ii) If σ ∈ F (R) and R0 is a subset of R with D(σ) ⊆ R0, then σ ∈ F (R0).

Interestingly, Clarkson [Cla88] did not prove Theorem 21.2.5 using the exponential decay lemma but
gave a direct proof. In fact, his proof implicitly contains the exponential decay lemma. We chose the
current exposition since it is more modular and provides a better intuition of what is really going on
and is hopefully slightly simpler. In particular, Lemma 21.2.1 is inspired by the work of Sharir [Sha03].
The exponential decay lemma (Lemma 21.2.4) was proved by Chazelle and Friedman [CF90]. The
work of Agarwal et al. [AMS98] is a further extension of this result. Another analysis was provided by
Clarkson et al. [CMS93].
Another way to reach similar results is using the technique of Mulmuley [Mul94], which relies on
a direct analysis on ‘stoppers’ and ‘triggers’. This technique is somewhat less convenient to use but is
applicable to some settings where the moments technique does not apply directly. Also, his concept of
the omega function might explain why randomized incremental algorithms perform better in practice
than their worst case analysis [Mul89].
Backwards analysis in geometric settings was first used by Chew [Che86] and was formalized by
Seidel [Sei93]. It is similar to the “leave one out” argument used in statistics for cross validation. The
basic idea was probably known to the Greeks (or Russians or French) at some point in time.
(Naturally, our summary of the development is cursory at best and not necessarily accurate, and all
possible disclaimers apply. A good summary is provided in the introduction of [Sei93].)
Sampling model. As a rule of thumb all the different sampling approaches are similar and yield similar
results. For example, we used such an alternative sampling approach in the “proof” of Lemma 21.2.1.
It is a good idea to use whichever sampling scheme is the easiest to analyze in figuring out what’s going
on. Of course, a formal proof requires analyzing the algorithm in the sampling model its uses.
Lazy randomized incremental construction. If one wants to compute a single face that contains a
marking point in an arrangement of curves, then the problem in using randomized incremental construc-
tion is that as you add curves, the region of interest shrinks, and regions that were maintained should be
ignored. One option is to perform flooding in the vertical decomposition to figure out what trapezoids
are still reachable from the marking point and maintaining only these trapezoids in the conflict graph.
Doing it in each iteration is way too expensive, but luckily one can use a lazy strategy that performs this

184
cleanup only a logarithmic number of times (i.e., you perform a cleanup in an iteration if the iteration
number is, say, a power of 2). This strategy complicates the analysis a bit; see [dBDS95] for more de-
tails on this lazy randomized incremental construction technique. An alternative technique was
suggested by the author for the (more restricted) case of planar arrangements; see [Har00b]. The idea
is to compute only what the algorithm really needs to compute the output, by computing the vertical
decomposition in an exploratory online fashion. The details are unfortunately overwhelming although
the algorithm seems to perform quite well in practice.
Cuttings. The concept of cuttings was introduced by Clarkson. The first optimal size cuttings were
constructed by Chazelle and Friedman [CF90], who proved the exponential decay lemma to this end.
Our elegant proof follows the presentation by de Berg and Schwarzkopf [dBS95]. The problem with this
approach is that the constant involved in the cutting size is awful¯ . Matoušek [Mat98] showed that
there (1/r)-cuttings with 8r 2 + 6r + 4 trapezoids, by using level approximation. A different approach
was taken by the author [Har00a], who showed how to get cuttings which seem to be quite small (i.e.,
constant-wise) in practice. The basic idea is to do randomized incremental construction but at each
iteration greedily add all the trapezoids with conflict list small enough to the cutting being output.
One can prove that this algorithm also generates O(r 2 ) cuttings, but the details are not trivial as the
framework described in this chapter is not applicable for analyzing this algorithm.
Cuttings also can be computed in higher dimensions for hyperplanes. In the plane, cuttings can also
be computed for well-behaved curves; see [SA95].
Another fascinating concept is shallow cuttings. These are cuttings covering only portions of the
arrangement that are in the “bottom” of the arrangement. Matoušek came up with the concept [Mat92].
See [AES99, CCH09] for extensions and applications of shallow cuttings.
Even more on randomized algorithms in geometry. We have only scratched the surface of this
fascinating topic, which is one of the cornerstones of “modern” computational geometry. The interested
reader should have a look at the books by Mulmuley [Mul94], Sharir and Agarwal [SA95], Matoušek
[Mat02], and Boissonnat and Yvinec [BY98].

21.6. Exercises
Exercise 21.6.1 (Convex hulls incrementally). Let P be a set of n points in the plane.
(A) Describe a randomized incremental algorithm for computing the convex hull CH (P). Bound the
expected running time of your algorithm.
(B) Assume that for any subset of P, its convex hull has complexity t (i.e., the convex hull of the subset
has t edges). What is the expected running time of your algorithm in this case? If your algorithm
is not faster for this case (for example, think about the case where t = O(log n)), describe a variant
of your algorithm which is faster for this case.

Exercise 21.6.2 (Compressed quadtree made incremental). Given a set P of n points in Rd , describe a
randomized incremental algorithm for building a compressed quadtree for P that works in expected
O(dn log n) time. Prove the bound on the running time of your algorithm.

¯ This is why all computations related to cuttings should be done on a waiter’s bill pad. As Douglas Adams put it:

“On a waiter’s bill pad, reality and unreality collide on such a fundamental level that each becomes the other and anything
is possible, within certain parameters.”

185
186
Chapter 22

Primality testing
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018

“The world is what it is; men who are nothing, who allow themselves to become nothing, have
no place in it.”
— Bend in the river, V.S. Naipaul

Introduction – how to read this write-up


In this note, we present a simple randomized algorithms for primality testing. The challenge is that it
requires a non-trivial amount of number theory, which is not the purpose of this course. Nevertheless,
this note is more or less self contained, and all necessary background is provided (assuming some basic
mathematical familiarity with groups, fields and modulo arithmetic). It is however not really necessary
to understand all the number theory material needed, and the reader can take it as given. In particular,
I recommend to read the number theory background part without reading all of the proofs (at least
on first reading). Naturally, a complete and total understanding of this material one needs to read
everything carefully.
The description of the primality testing algorithm in this write-up is not minimal – there are shorter
descriptions out there. However, it is modular – assuming the number theory machinery used is correct,
the algorithm description is relatively straightforward.

22.1. Number theory background

22.1.1. Modulo arithmetic


22.1.1.1. Prime and coprime

For integer numbers x and y, let x | y denotes that x divides y. The greatest common divisor (gcd) of
two numbers x and y, denoted by gcd(x, y), is the largest integer that divides both x and y. The least
common multiple (lcm) of x and y, denoted by lcm(x, y) = x y/gcd(x, y), is the smallest integer α, such
that x | α and y | α. An integer number p > 0 is prime if it is divisible only by 1 and itself (we will
consider 1 not to be prime).

187
Some standard definitions:

x, y are coprime ⇐⇒ gcd(x, y) = 1,


quotient of x/y ⇐⇒ x div y = bx/yc ,
remainder of x/y ⇐⇒ x mod y = x − y bx/yc .

The remainder x mod y is sometimes referred to as residue.

22.1.1.2. Computing gcd

Computing the gcd of two numbers is a classical EuclidGCD(a, b):


algorithm, see code on the right – proving that it in- if (b = 0)
deed returns the right result follows by an easy induc- return a
tion. It is easy to verify that if the input is made out else
of log n bits, then this algorithm takes O(poly(log n)) return EuclidGCD(b, a mod b)
time (i.e., it is polynomial in the input size). Indeed,
doing basic operations on numbers (i.e., multiplication, division, addition, subtraction, etc) with total
of ` bits takes O ` 2 time (naively – faster algorithms are known).

Exercise 22.1.1. Show that gcd(Fn, Fn−1 ) = 1, where Fi is the ith Fibonacci number. Argue that for two
consecutive Fibonacci numbers EuclidGCD(Fn, Fn−1 ) takes O(n) time, if every operation takes O(1)
time.

Lemma 22.1.2. For all α, β > 0 integers, there are integer numbers
 x and y, such that gcd(α, β) =
 
αx + βy, which can be computed in polynomial time; that is, O poly log α + log β .

Proof: If α = β then the claim trivially holds. Otherwise, assume that α > β (otherwise, swap them),
and observe that gcd(α, β) = gcd(α mod β, β). In particular, by induction, there are integers x 0, y0, such
that gcd(α mod β, β) = x 0(α mod β) + y0 β. However, τ = α mod β = α − β bα/βc. As such, we have
 
gcd(α, β) = gcd(α mod β, β) = x 0 α − β bα/βc + y0 β = x 0α + (y0 − β bα/βc)β,

as claimed. The running time follows immediately by modifying EuclidGCD to compute these num-
bers. 

We use α ≡ β (mod n) or α ≡n β to denote that α and β are congruentn modulo n;o that is
α mod n = β mod n. Or put differently, we have n | (α − β). The set ZZn = 0, . . . , n − 1 form a
group under addition modulon (see Definition 22.1.9p190 for a formal definition of a group). The more
interesting creature is ZZ∗n = x x ∈ {1, . . . , n} , x > 0, and gcd(x, n) = 1 , which is a group modulo n
under multiplication.

Remark 22.1.3. Observe that ZZ∗1 = {1}, while for n > 1, ZZ∗n does not contain n.

Lemma 22.1.4. For any element α ∈ ZZ∗n , there exists a unique inverse element β = α−1 ∈ ZZ∗n such
that α ∗ β ≡n 1. Furthermore, the inverse can be computed in polynomial time¬ .
¬ Again, as is everywhere in this chapter, the polynomial time is in the number of bits needed to specify the input.

188
Proof: Since α ∈ ZZ∗n , we have that gcd(α, n) = 1. As such, by Lemma 22.1.2, there exists x and y
integers, such that xα + yn = 1. That is xα ≡ 1 (mod n), and clearly β := x mod n is the desired inverse,
and it can be computed in polynomial time by Lemma 22.1.2.
As for uniqueness, assume that there are two inverses β, β0 to α < n, such that β < β0 < n. But then
βα ≡n β0α ≡n 1, which implies that n | (β0 − β)α, which implies that n | β0 − β, which is impossible as
0 < β0 − β < n. 

It is now straightforward, but somewhat tedious, to verify the following (the interested reader that
had not encountered this stuff before can spend some time proving this).
Lemma 22.1.5. The set ZZn under the + operation modulo n is a group, as is ZZ∗n under multiplication
modulo n. More importantly, for a prime number p, ZZ p forms a field with the +, ∗ operations modulo p
(see Definition 22.1.17p192 ).

22.1.1.3. The Chinese remainder theorem


Theorem 22.1.6 (Chinese remainder theorem). Let n1, . . . , nk be coprime numbers, and let n =
n1 n2 · · · nk . For any residues r1 ∈ ZZn1, . . . , rk ∈ ZZnk , there is a unique r ∈ ZZn , which can be computed in
polynomial time, such that r ≡ ri (mod ni ), for i = 1, . . . , k.

Proof: By the coprime property of the ni s it follows that gcd(ni, n/ni ) = 1. As such, n/ni ∈ ZZ∗ni , and it
has a unique inverse mi modulo ni ; that is (n/ni )mi ≡ 1 (mod ni ). So set r = i ri mi n/ni . Observe that
Í
for i , j, we have that n j | (n/ni ), and as such ri mi n/ni (mod n j ) ≡ 0 (mod n j ). As such, we have
!
Õ n n
 
r mod n j = ri mi mod n j mod n j = r j m j mod n j mod n j = r j ∗ 1 mod n j = r j .
i
ni nj

As for uniqueness, if there is another such number r 0, such that r < r 0 < n, then r 0 − r (mod ni ) = 0
implying that ni | r 0 − r, for all i. Since all the ni s are coprime, this implies that n | r 0 − r, which is of
course impossible. 

Lemma 22.1.7 (Fast exponentiation). Given numbers b, c, n, one can compute bc mod n in polyno-
mial time.

Proof: The key property we need is that


 
xy mod n = (x mod n) (y mod n) mod n.

Now, if c is even, then we can compute


 2  2
c c/2 c/2
b mod n = b mod n = b mod n mod n.

Similarly, if c is odd, we have


 2  2
c (c−1)/2 (c−1)/2
b mod n = (b mod n) b mod n = (b mod n) b mod n mod n.

Namely, computing bc mod n can be reduced to recursively computing b bc/2c mod n, and a constant
number of operations (on numbers that are smaller than n). Clearly, the depth of the recursion is
O(log c). 

189
22.1.1.4. Euler totient function
The Euler totient function φ(n) = ZZ∗n is the number of positive integer numbers that at most n and
are coprime with n. If n is prime then φ(n) = n − 1.

Lemma 22.1.8. Let n = p1k1 · · · ptkt , where the pi s are prime numbers and the ki s are positive integers
t
piki −1 (pi − 1). and this quantity can be computed
Ö
(this is the prime factorization of n). Then φ(n) =
i=1
in polynomial time if the factorization is given.

Proof: Observe that φ(1) = 1 (see Remark 22.1.3), and for a prime number p, we have that φ(p) = p − 1.
Now, for k > 1, and p prime we have that φ(p k ) = p k−1 (p − 1), as a number x ≤ p k is coprime with p k ,
if and only if x mod p , 0, and (p − 1)/p fraction of the numbers in this range have this property.
Now, if n and m are relative primes, then gcd(x, nm) = 1 ⇐⇒ gcd(x, n) = 1 and gcd(x, m) = 1. In
particular, there are φ(n)φ(m) pairs (α, β) ∈ ZZ∗n × ZZ∗m , such that gcd(α, n) = 1 and gcd(β, m) = 1. By the
Chinese remainder theorem (Theorem 22.1.6), each such pair represents a unique number in the range
1, . . . , nm, as desired.
Now, the claim follows by easy induction on the prime factorization of the given number. 

22.1.2. Structure of the modulo group ZZn


22.1.2.1. Some basic group theory
Definition 22.1.9. A group is a set, G, together with an operation × that combines any two elements a
and b to form another element, denoted a × b or ab. To qualify as a group, the set and operation, (G, ×),
must satisfy the following:
(A) (Closure) For all a, b ∈ G, the result of the operation, a × b ∈ G.
(B) (Associativity) For all a, b, c ∈ G, we have (a × b) × c = a × (b × c).
(C) (Identity element) There exists an element i ∈ G, called the identity element, such that
for every element a ∈ G, the equation i × a = a × i = a holds.
(D) (Inverse element) For each a ∈ G, there exists an element b ∈ G such that a × b = b × a = i.
A group is abelian (aka, commutative group) if for all a, b ∈ G, we have that a × b = b × a.

In the following we restrict our attention to abelian groups since it makes the discussion somewhat
simpler. In particular, some of the claims below holds even without the restriction to abelian groups.
The identity element is unique. Indeed, if both f , g ∈ G are identity elements, then f = f × g = g.
Similarly, for every element x ∈ G there exists a unique inverse y = x −1 . Indeed, if there was another
inverse z, then y = y × i = y × (x × z) = (y × x) × z = i × z = z.

22.1.2.2. Subgroups
For a group G, a subset H ⊆ G that is also a group (under the same operation) is a subgroup.
For x, y ∈ G, let us define x  ∼ y if x/y ∈ H . Here x/y = x y −1 and y −1 is the inverse of y in G.
Observe that (y/x)(x/y) = yx −1 x y −1 = i. That is y/x is the inverse of x/y, and it is in H . But that
implies that x ∼ y =⇒ y ∼ x. Now, if x ∼ y and y ∼ z, then x/y, y/z ∈ H . But then x/y × y/z ∈ H ,
and furthermore x/y × y/z = x y −1 yz−1 = xz −1 = x/z. that is x ∼ z. Together, this implies that ∼ is an
equivalence relationship.

190
Furthermore, observe that if x/y = x/z then y −1 = x −1 (x/y) = x −1 (x/z) = z−1, that is y = z.
In particular, the equivalence class of x ∈ G, is [x] = z ∈ G x ∼ z . Observe that if x ∈ H then
i/x = ix −1 = x −1 ∈ H , and thus i ∼ x. That is H = [x]. The following is now easy.
Lemma 22.1.10. Let G be an abelian group, and let H ⊆ G be a subgroup. Consider the set G/H =
[x] x ∈ G . We claim that [x] = [y] for any x, y ∈ G. Furthermore G/H is a group (that is, the


quotient group), with [x] × [y] = [x × y].


Proof: Pick an element α ∈ [x], and β ∈ [y], and consider the mapping f (x) = xα−1 β. We claim
that f is one to one and onto from [x] to [y]. For any γ ∈ [x], we have that γα−1 = γ/α ∈ H
As such, f (γ) = γα−1 β ∈ [β] = [y]. Now, for any γ, γ0 ∈ [x] such that γ , γ0, we have that if
f (γ) = γα−1 β = γ0α−1 β = f (γ0), then by multiplying by β−1 α, we have that γ = γ0. That is, f is one to
one, implying that [x] = [y] .
The second claim follows by careful but tediously checking that the conditions in the definition of a
group holds. 
Lemma 22.1.11. For a finite abelian group G and a subgroup H ⊆ G, we have that |H | divides |G|.
Proof: By Lemma 22.1.10, we have that |G| = |H | · |G/H |, as H = [i]. 

22.1.2.3. Cyclic groups


Lemma 22.1.12. For a finite group G, and any element g ∈ G, the set hgi = gi i ≥ 0 is a group.


Proof: Since G is finite, there are integers i > j ≥ 1, such that i , j and gi = g j , but then g j × gi− j =
gi = g j . That is gi− j = i and, by definition, we have gi− j ∈ hgi. It is now straightforward to verify that
the other properties of a group hold for hgi. 
In particular, for an element g ∈ G, we define its order as ord(g) = hgi , which clearly is the
minimum positive integer m, such that g m = i. Indeed, for j > m, observe that g j = g j mod m ∈ X =
i, g, g 2, . . . , g m−1 , which implies that hgi = X.


A group G is cyclic, if there is an element g ∈ G, such that hgi = G. In such a case g is a generator
of G.
Lemma 22.1.13. For any finite abelian group G, and any g ∈ G, we have that ord(g) divides |G|, and
g |G| = i.
Proof: By Lemma 22.1.12, the set hgi is a subgroup of G. By Lemma 22.1.11, we have that ord(g) =
  |G|/ord(g)   |G|/ord(g)
hgi | |G|. As such, g |G| = g ord(g) = i = i. 

22.1.2.4. Modulo group


Lemma 22.1.14. For any integer n, consider the additive group ZZn . Then, for any x ∈ ZZn , we have
lcm(n, x) n
that x · ord(x) = lcm(x, n). In particular, ord(x) = = . If n is prime, and x , 0 then
x gcd(n, x)
ord(x) = |ZZn | = n, and ZZn is a cyclic group.
Proof: We are working modulo n here under additions, and the identity element is 0. As such, x·ord(x) ≡n
0, which implies that n | x ord(x). By definition, ord(x) is the minimal number that has this property,
lcm(n, x)
implying that ord(x) = . Now, lcm(n, x) = nx/gcd(n, x). The second claim is now easy. 
x

191
Theorem 22.1.15. (Euler’s theorem) For all n and x ∈ ZZ∗n , we have x φ(n) ≡ 1 (mod n).
(Fermat’s theorem) If p is a prime then ∀x ∈ ZZ∗p x p−1 ≡ 1 (mod p).
Proof: The group ZZ∗n is abelian and has φ(n) elements, with 1 being the identity element (duh!). As

such, by Lemma 22.1.13, we have that x φ(n) = x | ZZn | ≡ 1 (mod n), as claimed.
The second claim follows by setting n = p, and recalling that φ(p) = p − 1, if p is a prime. 
One might be tempted to think that Lemma 22.1.14 implies that if p is a prime then ZZ∗p is a cyclic
group, but this does not follow, as the cardinality of ZZ∗p is φ(p) = p − 1, which is not a prime number
(for p > 2). To prove that ZZ∗p is cyclic, let us go back shortly to the totient function.
Lemma 22.1.16. For any n > 0, we have d|n φ(d) = n.
Í

Proof: For any g > 0, let Vg = x x ∈ {1, . . . , n} and gcd(x, n) = g . Now, x ∈ Vg ⇐⇒ gcd(x, n) = g


⇐⇒ gcd(x/g, n/g) = 1 ⇐⇒ x/g ∈ ZZ∗n/g . Since V1, V2, . . . , Vn form a partition of {1, . . . , n}, it follows
Õ Õ Õ Õ
that n = Vg = ZZ∗n/g = φ(n/g) = φ(d). 
g g|n g|n d|n

22.1.2.5. Fields
Definition 22.1.17. A field is an algebraic structure hF, +, ∗, 0, 1i consisting of two abelian groups:
(A) F under +, with 0 being the identity element.
(B) F \ {0} under ∗, with 1 as the identity element (here 0 , 1).
Also, the following property (distributivity of multiplication over addition) holds:
∀a, b, c ∈ F a ∗ (b + c) = (a ∗ b) + (a ∗ c).
We need the following: A polynomial p of degree k over a field F has at most k roots. indeed, if p
has the root α then it can be written as p(x) = (x − α)q(x), where q(x) is a polynomial of one degree
lower. To see this, we divide p(x) by the polynomial (x − α), and observe that p(x) = (x − α)q(x) + β, but
Ît
clearly β = 0 since p(α) = 0. As such, if p had t roots α1, . . . , αt , then p(x) = q(x) i=1 (x − αi ), which
implies that p would have degree at least t.

22.1.2.6. ZZ∗p is cyclic for prime numbers


For a prime number p, the group ZZ∗p has size φ(p) = p − 1, which is not a prime number for p > 2. As
such, Lemma 22.1.13 does not imply that there must be an element in ZZ∗p that has order p − 1 (and
thus ZZ∗p is cyclic). Instead, our argument is going to be more involved and less direct.
Lemma 22.1.18. For k < n, let Rk = x ∈ ZZ∗p ord(x) = k be the set of all numbers in ZZ∗p that are of


order k. We have that |Rk | ≤ φ(k).


Proof: Clearly, all the elements of Rk are roots of the polynomial x k − 1 = 0 (mod n). By the above, this
polynomial has at most k roots. Now, if Rk is not empty, then it contains an element x ∈ Rk of order k,
which implies that for all i < j ≤ k, we have that x i . x j (mod n), as the order of x is the size of hxi,
and the minimum k such that x k ≡ 1 (mod n). In particular, we have that Rk ⊆ hxi, as for y = x j , we
have that y k ≡n x j k ≡n 1 j ≡n 1.
Observe that for y = x i , if g = gcd(k, i) > 1, then y k/g ≡n x i(k/g) ≡n x lcm(i,k) ≡n 1; that is, ord(y) ≤
k/g < k, and y < Rk . As such, Rk contains only elements of x i such that gcd(i, k) = 1. That is Rk ⊆ ZZ∗k .
The claim now readily follows as ZZ∗k = φ(k). 

192
Lemma 22.1.19. For any prime p, the group ZZ∗p is cyclic.

Proof: For p = 2 the claim trivially holds, so assume p > 2. If the set Rp−1 , from Lemma 22.1.18, is not
empty, then there is g ∈ Rp−1 , it has order p − 1, and it is a generator of ZZ∗p , as ZZ∗p = p − 1, implying
that ZZ∗p = hgi and this group is cyclic.
Now, by Lemma 22.1.13, we have that for any y ∈ ZZ∗p , we have that ord(y) | p − 1 = ZZ∗p . This
implies that Rk is empty if k does not divides p − 1. On the other hand, R1, . . . , Rp−1 form a partition
of ZZ∗p . As such, we have that
Õ Õ
p − 1 = ZZ∗p = |Rk | ≤ φ(k) = p − 1,
k |p−1 k |p−1

by Lemma 22.1.18 and Lemma 22.1.16p192 , implying that the inequality in the above display is equality,
and for all k | p − 1, we have that |Rk | = φ(k). In particular, Rp−1 = φ(p − 1) > 0, and by the above the
claim follows. 

22.1.2.7. ZZ∗n is cyclic for powers of a prime


Lemma 22.1.20. Consider any odd prime p, and any integer c ≥ 1, then the group ZZ∗n is cyclic, where
n = pc .

Proof: Let g be a generator of ZZ∗p . Observe that g p−1 ≡ 1 mod p. The number g < p, and as such p
does not divide g, and also p does not divide g p−2 , and also p does not divide p − 1. As such, p2 does
not divide ∆ = (p − 1)g p−2 p; that is, ∆ . 0 (mod p2 ). As such, we have that
p − 1 p−2
 
p−1 p−1
(g + p) ≡g + g p ≡ g p−1 + ∆ . g p−1 (mod p2 )
1
=⇒ (g + p) p−1 . 1 (mod p2 ) or g p−1 . 1 (mod p2 ).

Renaming g + p to be g, if necessary, we have that g p−1 . 1 (mod p2 ), but by Theorem 22.1.15p192 ,


g p−1 ≡ 1 (mod p). As such, g p−1 = 1 + βp, where p does not divide β. Now, we have
p
 
p(p−1) p
g = (1 + βp) = 1 + βp + βp3 <whatever> = 1 + γ1 p2,
1

where γ1 is an integer (the p3 is not a typo – the binomial coefficient contributes at least one factor of p
– here we are using that p > 2). In particular, as p does not divides β, it follows that p does not divides
γ1 either. Let us apply this argumentation again to
p
g p (p−1) = 1 + γ1 p2 = 1 + γ1 p3 + p4 <whatever> = 1 + γ2 p3,
2

where again p does not divides γ2 . Repeating this argument, for i = 1, . . . , c − 2, we have
 i−1 p  p
αi = g p (p−1) = g p (p−1) = 1 + γi−1 pi = 1 + γi−1 pi+1 + pi+2 <whatever> = 1 + γi pi+1,
i

where p does not divides γi . In particular, this implies that αc−2 = 1 + γc−2 pc−1 and p does not divides
γc−2 . This in turn implies that αc−2 . 1 (mod pc ).
Now, the order of g in ZZn , denoted by k, must divide ZZ∗n by Lemma 22.1.13p191 . Now ZZ∗n = φ(n) =
p (p − 1), see Lemma 22.1.8p190 . So, k | pc−1 (p − 1). Also, αc−2 . 1 (mod pc ). implies that k does
c−1

193
not divides pc−2 (p − 1). It follows that pc−1 | k. So, let us write k = pc−1 k 0, where k 0 ≤ (p − 1). This,
by definition, implies that g k ≡ 1 (mod pc ). Now, g p ≡ g (mod p), because g is a generator of ZZ∗p . As
such, we have that
δ 0 δ−1 0 δ−1 0
 
g k ≡ p g p k ≡ p (g p ) p k ≡ p (g) p k ≡ p . . . ≡ p (g) k ≡ p g k mod pc mod p ≡ p 1.
0

Namely, g k ≡ 1 (mod p), which implies, as g as a generator of ZZ∗p , that either k 0 = 1 or k 0 = p − 1. The
0

case k 0 = 1 is impossible, as this implies that g = 1, and it can not be the generator of ZZ∗p . We conclude
that k = pc−1 (p − 1); that is, ZZ∗n is cyclic. 

22.1.3. Quadratic residues


22.1.3.1. Quadratic residue
Definition 22.1.21. An integer α is a quadratic residue modulo a positive integer n, if gcd(α, n) = 1
and for some integer β, we have α ≡ β2 (mod n).

Theorem 22.1.22 (Euler’s criterion). Let p be an odd prime, and α ∈ ZZ∗p . We have that
(A) α(p−1)/2 ≡ p ±1.
(B) If α is a quadratic residue, then α(p−1)/2 ≡ p 1.
(C) If α is not a quadratic residue, then α(p−1)/2 ≡ p −1.

Proof: (A) Let γ = α(p−1)/2 , and observe that γ 2 ≡ p α p−1 ≡ 1, by Fermat’s theorem (Theorem 22.1.15p192 ),
which implies that γ is either +1 or −1, as the polynomial x 2 − 1 has at most two roots over a field.
(B) Let α ≡ p β2 , and again by Fermat’s theorem, we have α(p−1)/2 ≡ p β p−1 ≡ p 1.
(C) Let X be the set of elements in ZZ∗p that are not quadratic residues, and consider α ∈ X. Since

ZZ p is a group, for any x ∈ ZZ∗p there is a unique y ∈ ZZ∗p such that x y ≡ p α. As such, we partition ZZ∗p
into pairs C = {x, y} x, y ∈ ZZ∗p and xy ≡ p α . We have that


Ö Ö Ö
τ ≡p β ≡p x y ≡p α ≡ p α(p−1)/2 .
β∈ZZ∗p {x,y}∈C {x,y}∈C

Let consider a similar set of pair, but this time for 1: D = {x, y} x, y ∈ ZZ∗p, x , y and x y ≡ p 1 .


Clearly, D does not contain −1 and 1, but all other elements in ZZ∗p are in D. As such,
Ö Ö Ö
τ ≡p β ≡ p (−1)1 xy ≡ p 1 ≡ p −1. 
β∈ZZ∗p {x,y}∈D {x,y}∈D

22.1.3.2. Legendre symbol


For an odd prime p, and an integer a with gcd(a, n) = 1, the Legendre symbol (a | p) is one if a
is a quadratic residue modulo p, and −1 otherwise (if p | a, we define (a | p) = 0). Euler’s criterion
(Theorem 22.1.22) implies the following equivalent definition.
Definition 22.1.23. The Legendre symbol, for a prime number p, and a ∈ ZZ∗p , is

(a | p) = a(p−1)/2 (mod p).

194
The following is easy to verify.

Lemma 22.1.24. Let p be an odd prime, and let a, b be integer numbers. We have:
(i) (−1 | p) = (−1)(p−1)/2 .
(ii) (a | p) (b | p) = (ab | p).
(iii) If a ≡ p b then (a | p) = (b | p).

Lemma 22.1.25  (Gauss’ lemma). Let p be an odd prime and let a be an integer that is not divisible
by p. Let X = α j = ja (mod p) j = 1, . . . , (p − 1)/2 , and L = x ∈ X x > p/2 ⊆ X. Then (a | p) =
(−1)n , where n = |L|.

Proof: Observe that for any distinct i, j, such that 1 ≤ i ≤ j ≤ (p − 1)/2, we have that ja ≡ ia (mod p)
implies that ( j − i)a ≡ 0 (mod p), which is impossible as j − i < p and gcd(a, p) = 1. As such, all the
elements of X are distinct, and |X | = (p − 1)/2. We have a somewhat stronger property: If ja ≡ p − ia
 is impossible. That is, S = X \ L, and L = p − ` ` ∈ L
(mod p) implies ( j + i)a ≡ 0 (mod p), which
are disjoint, and S ∪ L = 1, . . . , (p − 1)/2 . As such,

(p−1)/2
p−1 n (p−1)/2 p − 1
   
n n
Ö Ö Ö Ö Ö
! ≡ x· (p − y) ≡ (−1) x· y ≡ (−1) ja ≡ (−1) a ! (mod p).
2 x∈S y∈L x∈S y∈L j=1
2

Dividing both sides by (−1)n ((p − 1)/2)!, we have that (a | p) ≡ a(p−1)/2 ≡ (−1)n (mod p), as claimed. 

Lemma 22.1.26. If p is an odd prime, and a > 2 and gcd(a, p) = 1 then (a | p) = (−1)∆ , where
(p−1)/2
Õ 2
∆= b ja/pc. Furthermore, we have (2 | p) = (−1)(p −1)/8 .
j=1

Proof: Using the notation of Lemma 22.1.25, we have


(p−1)/2
Õ (p−1)/2
Õ   Õ Õ Õ Õ
ja = b ja/pc p + ( ja mod p) = ∆p + x+ y = (∆ + n)p + x− y
j=1 j=1 x∈S y∈L x∈S y∈L
(p−1)/2
Õ Õ
= (∆ + n)p + j −2 y.
j=1 y∈L

p−1 1 p−1 p2 −1
Í(p−1)/2  
Rearranging, and observing that j=1 j= 2 · 2 2 +1 = 8 . We have that

p2 − 1 Õ p2 − 1
(a − 1) = (∆ + n)p − 2 y. =⇒ (a − 1) ≡ (∆ + n)p (mod 2). (22.1)
8 8
y∈L

Observe that p ≡ 1 (mod 2), and for any x we have that x ≡ −x (mod 2). As such, and if a is odd, then
the above implies that n ≡ ∆ (mod 2). Now the claim readily follows from Lemma 22.1.25.
As for (2 | p), setting a = 2, observe that b ja/pc = 0, for j = 0, . . . (p − 1)/2, and as such ∆ = 0. Now,
Eq. (22.1) implies that p 8−1 ≡ n (mod 2), and the claim follows from Lemma 22.1.25.
2


195
Theorem 22.1.27 (Law of quadratic reciprocity). If p and q are distinct odd primes, then
p−1 q−1
(p | q) = (−1) 2 2 (q | p) .

Proof: Let S = (x, y) 1 ≤ x ≤ (p − 1)/2 and 1 ≤ y ≤ (q − 1)/2 . As lcm(p, q) = pq, it follows that there


are no (x, y) ∈ S, such that qx = py, as all such numbers are strict smaller than pq. Now, let

S1 = (x, y) ∈ S qx > py S2 = (x, y) ∈ S qx < py .


 
and
Í(p−1)/2
Now, (x, y) ∈ S1 ⇐⇒ 1 ≤ x ≤ (p − 1), and 1 ≤ y ≤ bqx/pc. As such, we have |S1 | = x=1 bqx/pc,
Í(q−1)/2
and similarly |S2 | = y=1 bpy/qc. We have

(p−1)/2 (q−1)/2
p−1 q−1 Õ Õ
τ= · = |S| = |S1 | + |S2 | = bqx/pc + bpy/qc .
2 2 x=1 y=1
| {z } | {z }
τ1 τ2

The claim now readily follows by Lemma 22.1.26, as (−1)τ = (−1)τ1 (−1)τ2 = (p | q) (q | p). 

22.1.3.3. Jacobi symbol

Definition 22.1.28. For any integer a, and an odd number n with prime factorization n = p1k1 · · · ptkt , its
Jacobi symbol is
t
(a | pi ) ki .
Ö
na | no =
i=1

Ík Î 
k
Claim 22.1.29. For odd integers n1, . . . , nk , we have that i=1 (ni − 1)/2 ≡ i=1 ni − 1 /2 (mod 2).

Proof: We prove for two odd integers x and y, and apply this repeatedly to get the claim. Indeed, we
x − 1 y − 1 xy − 1 xy − x + 1 − y + 1 − 1 xy − x − y + 1
have + ≡ (mod 2) ⇐⇒ 0 ≡ (mod 2) ⇐⇒ 0 ≡
2 2 2 2 2
(x − 1)(y − 1)
(mod 2) ⇐⇒ 0 ≡ (mod 2), which is obviously true. 
2
Lemma 22.1.30 (Law of quadratic reciprocity). For n and m positive odd integers, we have that
n−1 m−1
nn | mo = (−1) 2 2 nm | no .
Îν ε
Proof: Let n = i=1 pi and Let m = j=1 q j be the prime factorization of the two numbers (allowing
repeated factors). If they share a common factor p, then both nn | mo and nm | no contain a zero term
when expanded, as (n | p) = (m | p) = 0. Otherwise, we have
µ
ν Ö µ
ν Ö µ
ν Ö
Ö   Ö Ö
nn | mo = pi | q j = pi | q j = (−1)(q j −1)/2·(pi −1)/2 q j | pi
 
i=1 j=1 i=1 j=1 i=1 j=1
ν Ö µ µ
ν Ö
!
Ö Ö
(−1)(q j −1)/2·(pi −1)/2 · q j | pi = s nm | no .

=
i=1 j=1 i=1 j=1
| {z }
s

196
by Theorem 22.1.27. As for the value of s, observe that
ν Ö µ
! (pi −1)/2 ν  ν
! (m−1)/2
Ö Ö  (pi −1)/2 Ö
(q j −1)/2 (m−1)/2 (pi −1)/2
s= (−1) = (−1) = (−1) = (−1)(n−1)/2·(m−1)/2,
i=1 j=1 i=1 i=1

by repeated usage of Claim 22.1.29. 

n2 − 1 m 2 − 1 n2 m 2 − 1
Lemma 22.1.31. For odd integers n and m, we have that + ≡ (mod 2).
8 8 8
Proof: For an odd integer n, we have that either (i) 2 | n − 1 and 4 | n + 1, or (ii) 4 | n − 1 and 2 | n + 1.
As such, 8 | n2 − 1 = (n − 1)(n + 1). In particular, 64 | n2 − 1 m2 − 1 . We thus have that

n2 − 1 m 2 − 1 n2 m 2 − n2 − m 2 + 1
 
≡ 0 (mod 2) ⇐⇒ ≡ 0 (mod 2)
8 8
n2 m 2 − 1 n2 − m 2 − 2
⇐⇒ ≡ (mod 2)
8 8
n2 − 1 m 2 − 1 n2 m 2 − 1
⇐⇒ + ≡ (mod 2). 
8 8 8

Lemma 22.1.32. Let m, n be odd integers, and a, b be any integers. We have the following:
(A) nab | no = na | no nb | no.
(B) na | nmo = na | no na | mo.
(C) If a ≡ b (mod n) then na | no = nb | no.
(D) If gcd(a, n) > 1 then na | no = 0.
(E) n1 | no = 1.
2
(F) n2 | no = (−1)(n −1)/8 .
n−1 m−1
(G) nn | mo = (−1) 2 2 nm | no .

Proof: (A) Follows immediately, as (ab | pi ) = (a | pi ) (b | pi ), see Lemma 22.1.24p195 .


(B) Immediate from definition.
(C) Follows readily from Lemma 22.1.24p195 (iii).
(D) Indeed, if p | gcd(a, n) and p > 1, then (a | p) k = (0 | p) k = 0 appears as a term in na | no.
(E) Obvious by definition.
2 Ît
(F) By Lemma 22.1.26p195 , for a prime p, we have (2 | p) = (−1)(p −1)/8 . As such, writing n = i=1 pi
as a product of primes (allowing repeated primes), we have
t
Ö t
Ö 2
n2 | no = (2 | pi ) = (−1)(pi −1)/8 = (−1)∆,
i=1 i=1
Ít 2
where ∆ = i=1 (pi − 1)/8. As such, we need to compute the ∆ (mod 2), which by Lemma 22.1.31, is
t Ît
Õ pi2 − 1 i=1 pi − 1
2
n2 − 1
∆≡ ≡ ≡ (mod 2),
i=1
8 8 8
2 −1)/8
and as such n2 | no = (−1)∆ = (−1)(n .
(G) This is Lemma 22.1.30. 

197
22.1.3.4. Jacobi(a, n): Computing the Jacobi symbol
Given a and n (n is an odd number), we are interested in computing (in polynomial time) the Jacobi
symbol na | no. The algorithm Jacobi(a, n) works as follows:
(A) If a = 0 then return 0 // Since n0 | no = 0.
(B) If a > n then return Jacobi(a (mod n), n) // Lemma 22.1.32 (C)
(C) If gcd(a, n) > 1 then return 0 // Lemma 22.1.32 (D)
(D) If a = 2 then
(I) Compute ∆ = n2 − 1 (mod 16),
(II) Return (−1)∆/8 (mod 2) // As (n2 − 1)/8 ≡ ∆/8 (mod 2), and by Lemma 22.1.32
(F)
(E) If 2 | a then return Jacobi(2, n) * Jacobi(a/2, n) // Lemma 22.1.32 (A)
// Must be that a and b are both odd, a < n, and they are coprime
(F) a0 := a (mod 4), n0 := n (mod 4), β = (a0 − 1)(n0 − 1)/4.
return (−1) Jacobi(n, a)
β // By Lemma 22.1.32 (G)

Ignoring the recursive calls, all the operations takes polynomial time. Clearly, computing Jacobi(2, n)
takes polynomial time. Otherwise, observe that Jacobi reduces its input size by say, one bit, at least
every two recursive calls, and except the a = 2 case, it always perform only a single call. Thus, it follows
that its running time is polynomial. We thus get the following.

Lemma 22.1.33. Given integers a and n, where n is odd, then na | no can be computed in polynomial
time.

22.1.3.5. Subgroups induced by the Jacobi symbol


For an n, consider the set
n o
Jn = a ∈ ZZ∗n na | no ≡ a (n−1)/2
mod n . (22.2)

Claim 22.1.34. The set Jn is a subgroup of ZZ∗n .

Proof: For a, b ∈ Jn , we have that nab | no ≡ na | no nb | no ≡ a(n−1)/2 b(n−1)/2 ≡ (ab)(n−1)/2 mod n, implying
that ab ∈ Jn . Now, n1 | no = 1, so 1 ∈ Jn . Now, for a ∈ Jn , let a−1 the inverse of a (which is a number
in ZZ∗n ). Observe that a(a−1 ) = kn + 1, for some k, and as such, we have
   
1 = n1 | no = nkn + 1 | no = aa−1 | n = nkn + 1 | no = na | no a−1 | n .

And modulo n, we have


   
1 ≡ na | no a−1 | n ≡ a(n−1)/2 a−1 | n mod n.
 (n−1)/2  −1 
Which implies that a−1 ≡ a | n mod n. That is a−1 ∈ Jn .
Namely, Jn contains the identity, it is closed under inverse and multiplication, and it is now easy to
verify that fulfill the other requirements to be a group. 

Lemma 22.1.35. Let n be an odd integer that is composite, then |Jn | ≤ ZZ∗n /2.

198
piki . Let q = p1k1 , and m = n/q. By Lemma 22.1.20p193 ,
Ît
Proof: Let has the prime factorization n = i=1
the group ZZ∗q is cyclic, and let g be its generator. Consider the element a ∈ ZZ∗n such that

a ≡ g mod q and a ≡ 1 mod m.

Such a number a exists and its unique, by the Chinese remainder theorem (Theorem 22.1.6p189 ). In
piki , and observe that, for all i, we have a ≡ 1 (mod pi ), as pi | m. As such,
Ît
particular, let m = i=2
writing the Jacobi symbol explicitly, we have
t t t
ki ki
Ö Ö Ö
na | no = na | qo (a | pi ) = na | qo (1 | pi ) = na | qo 1 = na | qo = ng | qo .
i=2 i=2 i=2

since a ≡ g (mod q), and Lemma 22.1.32p197 (C). At this point there are two possibilities:
(A) If k 1 = 1, then q = p1 , and ng | qo = (g | q) = g (q−1)/2 (mod q). But g is a generator of ZZ∗q ,
and its order is q − 1. As such g (q−1)/2 ≡ −1 (mod q), see Definition 22.1.23p194 . We conclude
that na | no = −1. If we assume that Jn = ZZ∗n , then na | no ≡ a(n−1)/2 ≡ −1 (mod n). Now, as
m | n, we have
 
a(n−1)/2 ≡m a(n−1)/2 mod n mod m ≡m −1.

But this contradicts the choice of a as a ≡ 1 (mod m).


(B) If k 1 > 1 then q = p1k1 . Arguing as above, we have that na | no = (−1) k1 . Thus, if we assume
that Jn = ZZ∗n , then a(n−1)/2 ≡ −1 (mod n) or a(n−1)/2 ≡ 1 (mod n). This implies that an−1 ≡ 1
(mod n). Thus, an−1 ≡ 1 (mod q).
Now a ≡ g mod q, and thus g n−1 ≡ 1 (mod q). This implies that the order of g in ZZ∗q must
  n − 1. That is ord(g) = φ(q) | n − 1. Now, since k 1 ≥ 2, we have that p1 | φ(q) =
divide
p1k1 (p1 − 1), see Lemma 22.1.8p190 . We conclude that p1 | n − 1 and p1 | n, which is of course
impossible, as p1 > 1.
We conclude that Jn must be a proper subgroup of ZZ∗n , but, by Lemma 22.1.11p191 , it must be that
|Jn | | ZZ∗n . But this implies that |Jn | ≤ ZZ∗n /2. 

22.2. Primality testing


The primality test is now easy­ . Indeed, given a number n, first check if it is even (duh!). Otherwise,
randomly pick a number r ∈ {2, . . . , n − 1}. If gcd(r, n) > 1 then the number is composite. Otherwise,
check if r ∈ Jn (see Eq. (22.2)p198 ), by computing x = nr | no in polynomial time, see Section 22.1.3.4p198 ,
and x 0 = a(n−1)/2 mod n. (see Lemma 22.1.7p189 ). If x = x 0 then the algorithm returns is prime, otherwise
it returns it is composite.

Theorem 22.2.1. Given a number n, and a parameter δ > 0, there is a randomized algorithm that, de-
cides if the given number is prime or composite. The running time of the algorithm is O (log n)c log(1/δ) ,

where c is some constant. If the algorithm returns that n is composite then it is. If the algorithm returns
that n is prime, then is wrong with probability at most δ.
­ One could even say “trivial” with heavy Russian accent.

199
Proof: Run the above algorithm m = O(log(1/δ)) times. If any of the runs returns that it is composite
then the algorithm return that n is composite, otherwise the algorithms returns that it is a prime.
If the algorithm fails, then n is a composite, and let r1, . . . , rm be the random numbers the algorithm
picked. The algorithm fails only if r1, . . . , rm ∈ Jn , but since |Jn | ≤ ZZ2n /2, by Lemma 22.1.35p198 , it
m
follows that this happens with probability at most |Jn | / ZZ2n ≤ 1/2m ≤ δ, as claimed. 

22.2.1. Distribution of primes


In the following, let π(n) denote the number of primes between 1 and n. Here, we prove that π(n) =
Θ(n/log n).
Lemma 22.2.2. Let ∆ be the product of all the prime numbers p, where m < p ≤ 2m. We have that
∆ ≤ 2m
m .

Proof: Let X be the product of the all composite numbers between m and 2m, we have

X ·∆
 
2m 2m · (2m − 1) · · · (m + 2) · (m + 1)
= = .
m m · (m − 1) · · · 2 · 1 m · (m − 1) · · · 2 · 1
Since none of the numbers between 2 and m divides any of the factors of ∆, it must be that the number
X 2m  2m
m·(m−1)···2·1 is an integer number, as m is an integer. Therefore, m = c · ∆, for some integer c > 0,
implying the claim. 

Lemma 22.2.3. The number of prime numbers between m and 2m is O(m/ln m).

Proof: Let us denote all primes between m and 2m as p1 < p2 < · · · < p k . Since p1 ≥ m, it follows from
Îk
Lemma 22.2.2 that m k ≤ i=1 pi ≤ 2m
m ≤ 2 . Now, taking log of both sides, we have k lg m ≤ 2m.
2m

Namely, k ≤ 2m/lg m. 

Lemma 22.2.4. π(n) = O(n/ln n).

Proof: Let the number of primes less than n be Π(n), then by Lemma 22.2.3, there exist some positive
constant C, such that for all ∀n ≥ N, we have Π(2n)
! − Π(n) ≤ C · n/ln n. Namely, Π(2n) ≤ C · n/ln n +
Õne 
dlg    Õne
dlg
n/2i  n 
Π(n). Thus, Π(2n) ≤ Π 2n/2i − Π 2n/2i+1 ≤ C· = O , by observing that the
n/2 i n

i=0 i=0
ln ln
summation behaves like a decreasing geometric series. 

Lemma 22.2.5. For integers m, k and a prime p, if p k | 2m k


m , then p ≤ 2m.


Proof: Let T(p, m) be the number of times p appear in the prime factorizationof m!. Formally, T(p, m)
is the highest number k such that p k divides m!. We claim that T(p, m) = i=1 m/pi . Indeed, consider
Í∞
an integer β ≤ m, such that β = pt γ, where γ is an integer that is not divisible by p. Observe that β
contributes exactly to the first t terms of the summation of T(p, m) – namely, its contribution to m! as
far as powers of p is counted correctly.
Let α be the maximum number such that pα divides 2m 2m!

m = m!m! . Clearly,
∞ 
m
  
Õ 2m
α = T(p, 2m) − 2T(p, m) = −2 i .
i=1
pi p

200
j k j k
It is easy to verify that for any integers x, y, we have that 0 ≤ 2x
y − 2 xy ≤ 1. In particular, let k be
j k j k
m
the largest number such that 2m pk
− 2 pk
= 1, and observe that T(p, 2m) ≤ k as only the proceedings
k − 1 terms might be non-zero in the summation of T(p, 2m). But this implies that 2m/p k ≥ 1, which
 

implies in turn that p k ≤ 2m, as desired. 

Lemma 22.2.6. π(n) = Ω(n/ln n).


Î k ni
Proof: Assume 2m m have k prime factors, and thus can be written as m =
2m 
i=1 pi , By Lemma 22.2.5,

ni
we have pi ≤ 2m. Of course, the above product might not include some prime numbers between 1 and
2m, and as such k is a lower bound on the number of primes in this range; that is, k ≤ π(2m). This
k
22m
  Ö
2m 2m − lg(2m)
implies ≤ ≤ 2m = (2m) k . By taking lg of both sides, we have ≤ k ≤ π(2m). 
2m m i=1
lg(2m)

We summarize the result.

Theorem 22.2.7. Let π(n) be the number of distinct prime numbers between 1 and n. We have that
π(n) = Θ(n/ln n).

22.3. Bibliographical notes


Miller [Mil76] presented the primality testing algorithm which runs in deterministic polynomial time but
relies on Riemann’s Hypothesis (which is still open). Later on, Rabin [Rab80] showed how to convert
this algorithm to a randomized algorithm, without relying on the Riemann’s hypothesis.
This write-up is based on various sources – starting with the description in [MR95], and then filling
in some details from various sources on the web.
What is currently missing from the write-up is a description of the RSA encryption system. This
would hopefully be added in the future. There are of course typos in these notes – let me know if you
find any.

201
202
Chapter 23

Finite Metric Spaces and Partitions


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018

23.1. Finite Metric Spaces


Definition 23.1.1. A metric space is a pair (X, d) where X is a set and d : X × X → [0, ∞) is a metric,
satisfying the following axioms: (i) d(x, y) = 0 iff x = y, (ii) d(x, y) = d(y, x), and (iii) d(x, y) + d(y, z) ≥
d(x, z) (triangle inequality).

For example, R2 with the regular Euclidean distance is a metric space.


It is usually of interest to consider the finite case, where X is an n-point set. Then, the function d
n
can be specified by 2 real numbers. Alternatively, one can think about (X, d) is a weighted complete
graph, where we specify positive weights on the edges, and the resulting weights on the edges comply
with the triangle inequality.
In fact, finite metric spaces rise naturally from (sparser) graphs. Indeed, let G = (X, E) be an
undirected weighted graph defined over X, and let dG (x, y) be the length of the shortest path between x
and y in G. It is easy to verify that (X, dG ) is a finite metric space. As such if the graph G is sparse, it
provides a compact representation to the finite space (X, dG ).

 (X, d) be an n-point metric space. We denote the open ball of radius r about
Definition 23.1.2. Let
x ∈ X, by b(x, r) = y ∈ X d(x, y) < r .

Underling our discussion of metric spaces are algorithmic applications. The hardness of various
computational problems depends heavily on the structure of the finite metric space. Thus, given a finite
metric space, and a computational task, it is natural to try to map the given metric space into a new
metric where the task at hand becomes easy.

Example 23.1.3. For example, computing the diameter is not trivial in two dimensions, but is easy in
one dimension. Thus, if we could map points in two dimensions into points in one dimension, such that
the diameter is preserved, then computing the diameter becomes easy. In fact, this approach yields an
efficient approximation algorithm, see Exercise 23.7.3 below.

Of course, this mapping from one metric space to another, is going to introduce error. We would be
interested in minimizing the error introduced by such a mapping.

203
Definition 23.1.4. Let (X, dX ) and (Y, dY ) be metric spaces. A mapping f : X → Y is called an embed-
ding, and is C-Lipschitz if dY ( f (x), f (y)) ≤ C · dX (x, y) for all x, y ∈ X. The mapping f is called
K-bi-Lipschitz if there exists a C > 0 such that

CK −1 · dX (x, y) ≤ dY ( f (x), f (y)) ≤ C · dX (x, y),

for all x, y ∈ X.
The least K for which f is K-bi-Lipschitz is called the distortion of f , and is denoted dist( f ). The
least distortion with which X may be embedded in Y is denoted cY (X).

There are several powerful results in this vain, that show the existence of embeddings with low
distortion that would be presented:

1. Probabilistic trees - every finite metric can be randomly embedded into a tree such that the
“expected” distortion for a specific pair of points is O(log n).

2. Bourgain embedding - shows that any n-point metric space can be embedded into (finite dimen-
sional) metric space with O(log n) distortion.

3. Johnson-Lindenstrauss lemma - shows that any n-point set in Euclidean space with the regular
Euclidean distance can be embedded into R k with distortion (1 + ε), where k = O(ε −2 log n).

23.2. Examples
23.2.0.0.1. What is distortion? When considering a mapping f : X → Rd of a metric space (X, d)
to Rd , it would useful to observe that since Rd can be scaled, we can consider f to be an an expansion
(i.e., no distances shrink). Furthermore, we can in fact assume that there is at least one pair of points
x−yk
x, y ∈ X, such that d(x, y) = k x − yk. As such, we have dist( f ) = max x,y kd(x,y) .

23.2.0.0.2. Why distortion is necessary? Consider the a graph G = (V, E) with one vertex s
connected to three other vertices a, b, c, where the weights on the edges are all one (i.e., G is the star
√ with three leafs). We claim that G can not be embedded into Euclidean space with distortion
graph
≤ 2. Indeed, consider the associated metric space (V, dG ) and an (expansive) embedding f : V → Rd .
Consider the triangle formed by 4 = a0 b0 c0, where a0 = f (a), b0 = f (b) and c0 = f (c). Next, consider
the following quantity max(ka0 − s0 k , kb0 − s0 k , kc0 − s0 k) which lower bounds the distortion of f . This
quantity is minimized when r = ka0 − s0 k = kb0 − s0 k = kc0 − s0 k. Namely, s0 is the center of the smallest
enclosing circle of 4. However, r is minimize when all √the edges of 4 are of equal length, and are in fact
of length dG (a, b) = 2. It follows that dist( f ) ≥ r ≥ 2/ 3.
It is known that Ω(log n) distortion is necessary in the worst case. This is shown using expanders
[Mat02].

23.2.1. Hierarchical Tree Metrics


The following metric is quite useful in practice, and nicely demonstrate why algorithmically finite metric
spaces are useful.

204
Definition 23.2.1. Hierarchically well-separated tree (HST) is a metric space defined on the leaves
of a rooted tree T. To each vertex u ∈ T there is associated a label ∆u ≥ 0 such that ∆u = 0 if and only if
u is a leaf of T. The labels are such that if a vertex u is a child of a vertex v then ∆u ≤ ∆v . The distance
between two leaves x, y ∈ T is defined as ∆lca(x,y) , where lca(x, y) is the least common ancestor of x and
y in T.
A HST T is a k-HST if for a vertex v ∈ T, we have that ∆v ≤ ∆p(v) /k, where p(v) is the parent of v
in T.
Note that a HST is a very limited metric. For example, consider the cycle G = Cn of n vertices, with
weight one on the edges, and consider an expansive embedding f of G into a HST HST. It is easy to
verify, that there must be two consecutive nodes of the cycle, which are mapped to two different subtrees
of the root r of HST. Since HST is expansive, it follows that ∆r ≥ n/2. As such, dist( f ) ≥ n/2. Namely,
HSTs fail to faithfully represent even very simple metrics.

23.2.2. Clustering
One natural problem we might want to solve on a graph (i.e., finite metric space) (X, d) is to partition it
into clusters. One such natural clustering is the k-median clustering, where we would like to choose
a set C ⊆ X of k centers, such that νC (X, d) = q∈X d(q, C) is minimized, where d(q, C) = minc∈C d(q, c)
Í
is the distance of q to its closest center in C.
It is known that finding the optimal k-median clustering in a (general weighted) graph is NP-
complete. As such, the best we can hope for is an approximation algorithm. However, if the structure
of the finite metric space (X, d) is simple, then the problem can be solved efficiently. For example, if the
points of X are on the real line (and the distance between a and b is just |a − b|), then k-median can be
solved using dynamic programming.
Another interesting case is when the metric space (X, d) is a HST. Is not too hard to prove the
following lemma. See Exercise 23.7.1.
Lemma 23.2.2. Let (X, d) be a HST defined over n points, and let k > 0 be an integer. One can
compute the optimal k-median clustering of X in O(k 2 n) time.
Thus, if we can embed a general graph G into a HST HST, with low distortion, then we could
approximate the k-median clustering on G by clustering the resulting HST, and “importing” the resulting
partition to the original space. The quality of approximation, would be bounded by the distortion of
the embedding of G into HST.

23.3. Random Partitions


Let (X, d) be a finite metric space. Given a partition P = {C1, . . . , Cm } of X, we refer to the sets Ci as
clusters. We write PX for the set of all partitions of X. For x ∈ X and a partition P ∈ PX we denote
by P(x) the unique cluster of P containing x. Finally, the set of all probability distributions on PX is
denoted DX .

23.3.1. Constructing the partition


Let ∆ = 2u be a prescribed parameter, which is the required diameter of the resulting clusters. Choose,
uniformly at random, a permutation π of X and a random value α ∈ [1/4, 1/2]. Let R = α∆, and observe
that it is uniformly distributed in the interval [∆/4, ∆/2].

205
The partition is now defined as follows: A point x ∈ X is assigned to the cluster Cy of y, where y is
the first point in the permutation in distance ≤ R from x. Formally,
Cy = x ∈ X x ∈ b(y, R) and π(y) ≤ π(z) for all z ∈ X with x ∈ b(z, R) .


Let P = {Cy } y∈X denote the resulting partition.


Here is a somewhat more intuitive explanation: Once we fix the radius of the clusters R, we start
scooping out balls of radius R centered at the points of the random permutation π. At the ith stage,
we scoop out only the remaining mass at the ball centered at xi of radius r, where xi is the ith point in
the random permutation.

23.3.2. Properties
Lemma 23.3.1. Let (X, d) be a finite metric space, ∆ = 2u a prescribed parameter, and let P be the
partition of X generated by the above random partition. Then the following holds:
(i) For any C ∈ P, we have diam(C) ≤ ∆.
(ii) Let x be any point of X, and t a parameter ≤ ∆/8. Then,
8t b
Pr[b(x, t) * P(x)] ≤ ln ,
∆ a
where a = |b(x, ∆/8)|, b = |b(x, ∆)|.

Proof: Since Cy ⊆ b(y, R), we have that diam(Cy ) ≤ ∆, and thus the first claim holds.
Let U be the set of points of b(x, ∆), such that w ∈ U iff b(w, R) ∩ b(x, t) , ∅. Arrange the points
of U in increasing distance from x, and let w1, . . . , wb0 denote the resulting order, where b0 = |U|.
Let Ik = [d(x, w k ) − t, d(x, w k ) + t] and write E k for the event that w k is the first point in π such
that b(x, t) ∩ Cwk , ∅, and yet b(x, t) * Cwk . Note that if w k ∈ b(x, ∆/8), then Pr[E k ] = 0 since
b(x, t) ⊆ b(x, ∆/8) ⊆ b(w k , ∆/4) ⊆ b(w k , R).
In particular, w1, . . . , wa ∈ b(x, ∆/8) and as such Pr[E1 ] = · · · = Pr[Ea ] = 0. Also, note that if
d(x, w k ) < R −t then b(w k , R) contains b(x, t) and as such E k can not happen. Similarly, if d(x, w k ) > R +t
then b(w k , R) ∩ b(x, t) = ∅ and E k can not happen. As such, if E k happen then R − t ≤ d(x, w k ) ≤ R + t.
Namely, if E k happen then R ∈ Ik . Namely, Pr[E k ] = Pr[E k ∩ (R ∈ Ik )] = Pr[R ∈ Ik ] · Pr[E k | R ∈ Ik ].
Now, R is uniformly distributed in the interval [∆/4, ∆/2], and Ik is an interval of length 2t. Thus,
Pr[R ∈ Ik ] ≤ 2t/(∆/4) = 8t/∆.
Next, to bound Pr[E k | R ∈ Ik ], we observe that w1, . . . , w k−1 are closer to x than w k and their distance
to b(x, t) is smaller than R. Thus, if any of them appear before w k in π then E k does not happen. Thus,
Pr[E k | R ∈ Ik ] is bounded by the probability that w k is the first to appear in π out of w1, . . . , w k . But
this probability is 1/k, and thus Pr[E k | R ∈ Ik ] ≤ 1/k.
We are now ready for the kill. Indeed,
b0
Õ b0
Õ b0
Õ
Pr[b(x, t) * P(x)] = Pr[E k ] = Pr[E k ] = Pr[R ∈ Ik ] · Pr[E k | R ∈ Ik ]
k=1 k=a+1 k=a+1
b0
Õ 8t 1 8t b0 8t b
≤ · ≤ ln ≤ ln ,
k=a+1
∆ k ∆ a ∆ a
Íb ∫b dx
since 1
k=a+1 k ≤ a x = ln ab and b0 ≤ b. 

206
23.4. Probabilistic embedding into trees
In this section, given n-point finite metric (X, d). we would like to embed it into a HST. As mentioned
above, one can verify that for any embedding into HST, the distortion in the worst case is Ω(n). Thus,
we define a randomized algorithm that embed (X, d) into a tree. Let T be the resulting tree, and
consider two points x, y ∈ X. Consider the random variable dT (x, y). We constructed the tree T such
that distances
h never
i shrink; i.e. d(x, y) ≤ dT (x, y). The probabilistic distortion of this embedding is
dT (x,y)
max x,y E d(x,y) . Somewhat surprisingly, one can find such an embedding with logarithmic probabilistic
distortion.
Theorem 23.4.1. Given n-point metric (X, d) one can randomly embed it into a 2-HST with probabilis-
tic distortion ≤ 24 ln n.
Proof: The construction is recursive. Let diam(P), and compute a random partition of X with cluster
diameter diam(P)/2, using the construction of Section 23.3.1. We recursively construct a 2-HST for each
cluster, and hang the resulting clusters on the root node v, which is marked by ∆v = diam(P). Clearly,
the resulting tree is a 2-HST.
For a node v ∈ T, let X(v) be the set of points of X contained in the subtree of v.
For the analysis, assume diam(P) = 1, and consider two points x, y ∈ X. We consider a node v ∈ T
to be in level i if level(v) = dlg ∆v e = i. The two points x and y correspond to two leaves in T, and let b u
be the least common ancestor of x and y in t. We have dT (x, y) ≤ 2 level(v) . Furthermore, note that along
a path the levels are strictly monotonically increasing.
In fact, we are going to be conservative, and let w be the first ancestor of x, such that b = b(x, d(x, y))
is not completely contained in X(u1 ), . . . , X(um ), where u1, . . . , um are the children of w. Clearly, level(w) >
level(bu). Thus, dT (x, y) ≤ 2level(w) .
Consider the path σ from the root of T to x, and let Ei be the event that b is not fully contained
in X(vi ), where vi is the node of σ of level i (if such a node exists). Furthermore, let Yi be the indicator
variable which is 1 if Ei is the first to happened out of the sequence of events E0, E−1, . . .. Clearly,
dT (x, y) ≤ Yi 2i .
Í
Let t = d(x, y) and j = blg d(x, y)c, and ni = b(x, 2i ) for i = 0, . . . , −∞. We have
0 0 0
h i Õ 8t ni
E[Yi ] 2i ≤ 2i Pr Ei ∩ Ei−1 ∩ Ei−1 · · · E0 ≤ 2i · i ln
Õ Õ
E[dT (x, y)] ≤ ,
i= j i= j i= j
2 ni−3

by Lemma 23.3.1. Thus,


0
!
Ö ni
E[dT (x, y)] ≤ 8t ln ≤ 8t ln(n0 · n1 · n2 ) ≤ 24t ln n.
i= j
ni−3

It thus follows, that the expected distortion for x and y is ≤ 24 ln n. 

23.4.1. Application: approximation algorithm for k-median clustering


Let (X, d) be a n-point metric space, and let k be an integer number. We would like to compute the
optimal k-median clustering. Number, find a subset Copt ⊆ X, such that νCopt (X, d) is minimized, see
Section 23.2.2. To this end, we randomly embed (X, d) into a HST HST using Theorem 23.4.1. Next,
using Lemma 23.2.2, we compute the optimal k-median clustering of HST. Let C be the set of centers
computed. We return C together with the partition of X it induces as the required clustering.

207
Theorem 23.4.2. Let (X, d) be a n-point metric space. One can compute in polynomial time a k-
median clustering of X which has expected price O(α log n), where α is the price of the optimal k-median
clustering of (X, d).

Proof: The algorithm is described above, and the fact that its running time is polynomial can be easily
be verified. To prove the bound on the quality of the clustering, for any point p ∈ X, let center(p)
denote the closest point in Copt to p according to d, where Copt is the set of k-medians in the optimal
clustering. Let C be the set of k-medians returned by the algorithm, and let HST be the HST used by
the algorithm. We have
Õ Õ
β = νC (X, d) ≤ νC (X, dHST ) ≤ νCopt (X, dHST ) ≤ dHST (p, Copt ) ≤ dHST (p, center(p)).
p∈X p∈X

Thus, in expectation we have


Õ  Õ Õ
dHST (p, center(p)) = E[dHST (p, center(p))] = O(d(p, center(p)) log n)
 
E[β] = E 

 p∈X  p∈X p∈X
 
Õ  
= O ­(log n) d(p, center(p))® = O νCopt (X, d) log n ,
© ª

« p∈X ¬
by linearity of expectation and Theorem 23.4.1. 

23.5. Embedding any metric space into Euclidean space


Lemma 23.5.1. Let (X, d) be a metric, and let Y ⊂ X. Consider the mapping f : X → R, where
f (x) = d(x, Y ) = min y∈Y d(x, y). Then for any x, y ∈ X, we have | f (x) − f (y)| ≤ d(x, y). Namely f is
nonexpansive.

Proof: Indeed, let x 0 and y0 be the closet points of Y , to x and y, respectively. Observe that f (x) =
d(x, x 0) ≤ d(x, y0) ≤ d(x, y) + d(y, y0) = d(x, y) + f (y) by the triangle inequality. Thus, f (x) − f (y) ≤ d(x, y).
By symmetry, we have f (y) − f (x) ≤ d(x, y). Thus, | f (x) − f (y)| ≤ d(x, y). 

23.5.1. The bounded spread case


Let (X, d) be a n-point metric. The spread of X, denoted by Φ(X) = minx,ydiam(X)
∈X, x,y d(x,y)
, is the ratio between
the diameter of X and the distance between the closest pair of points.
Theorem 23.5.2. Given a√n-point metric Y = (X, d), with spread Φ, one can embed it into Euclidean
space R k with distortion O( ln Φ ln n), where k = O(ln Φ ln n).

Proof: Assume that diam(Y) = Φ (i.e., the smallest distance in Y is 1), and let ri = 2i−2 , for i = 1, . . . , α,
where α = dlg Φe. Let Pi, j be a random partition of P with diameter ri , using Theorem 23.4.1, for
i = 1, . . . , α and j = 1, . . . , β, where β = dc log ne and c is a large enough constant to be determined
shortly.
For each cluster of Pi, j randomly toss a coin, and let Vi, j be the all the points of X that belong
to clusters in Pi, j that got ’T’ in their coin toss. For a point u ∈ x, let fi, j (x) = d(x, X \ Vi, j ) =

208
minv∈X\Vi, j d(x, v), for i = 0, . . . , m and j = 1, . . . , β. Let F : X → R(m+1)·β be the embedding, such that
F(x) = f0,1 (x), f0,2 (x), . . . , f0,β (x), f1,1 (x), f0,2 (x), . . . , f1,β (x), . . . , fm,1 (x), fm,2 (x), . . . , fm,β (x) .

Next, consider two points x, y ∈ X, with distance φ = d(x, y). Let k be an integer such that
ru ≤ φ/2 ≤ ru+1 . Clearly, in any partition of Pu,1, . . . , Pu,β the points x and y belong to different clusters.
Furthermore, with probability half x ∈ Vu, j and y < Vu, j or x < Vu, j and y ∈ Vu, j , for 1 ≤ j ≤ β.
Let E j denote the event that b(x, ρ) ⊆ Vu, j and y < Vu, j , for j = 1, . . . , β, where ρ = φ/(64 ln n). By
Lemma 23.3.1, we have
 8ρ φ
Pr b(x, ρ) * Pu, j (x) ≤ ln n ≤

≤ 1/2.
ru 8ru
Thus,

Pr E j = Pr b(x, ρ) ⊆ Pu, j (x) ∩ x ∈ Vu, j ∩ y < Vu, j


     

= Pr b(x, ρ) ⊆ Pu, j (x) · Pr x ∈ Vu, j · Pr y < Vu, j ≥ 1/8,


     

since those three events are independent. Notice, that if E j happens, than fu, j (x) ≥ ρ and fu, j (y) = 0.
Let X j be an indicator variable which is 1 if Ei happens, for j = 1, . . . , β. Let Z = j Xj ,
Í

and we have µ = E[Z] = E j X j ≥ β/8. Thus, the probability that only β/16 of E1, . . . , E β
Í 

happens, is Pr[Z  < (1 − 1/2) E[Z]]. 10By the Chernoff inequality, we have Pr[Z < (1 − 1/2) E[Z]] ≤
exp −µ1/(2 · 2 ) = exp(−β/64) ≤ 1/n , if we set c = 640.
2

Thus, with high probability


v

u
β
u
tÕ r
2 β p ρ β
kF(x) − F(y)k ≥ fu, j (x) − fu, j (y) ≥ ρ 2 = β =φ· .
j=1
16 4 256 ln n

On the other hand, fi, j (x) − fi, j (y) ≤ d(x, y) = φ ≤ 64ρ ln n. Thus,
q p p
kF(x) − F(y)k ≤ αβ(64ρ ln n)2 ≤ 64 αβρ ln n = αβ · φ.

n
Thus, setting G(x) = F(x) 256√ln
β
, we get a mapping that maps two points of distance φ from each
h √ n
i
other to two points with distance in the range φ, φ · αβ · √ β . Namely, G(·) is an embedding with
256 ln
√ √
distortion O( α ln n) = O( ln Φ ln n).
The probability that G fails on one of the pairs, is smaller than (1/n10 ) · 2n < 1/n8 . In particular,

we can check the distortion of G for all 2n pairs, and if any of them fail (i.e., the distortion is too big),
we restart the process. 

23.5.2. The unbounded spread case


Our next task, is to extend Theorem 23.5.2 to the case of unbounded spread. Indeed, let (X, d) be a
n-point metric, such that diam(X) ≤ 1/2. Again, we look on the different resolutions r1, r2, . . ., where
ri = 1/2i−1 . For each one of those resolutions ri , we can embed this resolution into β coordinates, as
done for the bounded case. Then we concatenate the coordinates together.
There are two problems with this approach: (i) the number of resulting coordinates is infinite, and (ii)
a pair x, y, might be distorted a “lot” because it contributes to all resolutions, not only to its “relevant”
resolutions.

209
Both problems can be overcome with careful tinkering. Indeed, for a resolution ri , we are going to
modify the metric, so that it ignores short distances (i.e., distances ≤ ri /n2 ). Formally, for each resolution
ri , let Gi = (X, E bi ) be the graph where two points x and y are connected if d(x, y) ≤ ri /n2 . Consider a
connected component C ∈ Gi . For any two points x, y ∈ C, we have d(x, y) ≤ n(ri /n2 ) ≤ ri /n. Let Xi
be the set of connected components of Gi , and define the distances between two connected components
C, C 0 ∈ Xi , to be di (C, C 0) = d(C, C 0) = minc∈C,c 0 ∈C 0 d(c, c0).
It is easy to verify that (Xi, di ) is a metric space (see Exercise 23.7.2). Furthermore, we can naturally
embed (X, d) into (Xi, di ) by mapping a point x ∈ X to its connected components in Xi . Essentially (Xi, di )
is a snapped version of the metric (X, d), with the advantage that Φ((X, di )) = O(n2 ). We now embed Xi
into β = O(log n) coordinates. Next, for any point of X we embed it into those β coordinates, by using
the embedding of its connected component in Xi . Let Ei be the embedding for resolution ri . Namely,
Ei (x) = ( fi,1 (x), fi,2 (x), . . . , fi,β (x)), where fi, j (x) = min(di (x, X \ Vi, j ), 2ri ). The resulting embedding is
F(x) = ⊕Ei (x) = (E1 (x), E2 (x), . . . , ).
Since we slightly modified the definition of fi, j (·), we have to show that fi, j (·) is nonexpansive. Indeed,
consider two points x, y ∈ Xi , and observe that

fi, j (x) − fi, j (y) ≤ di (x, Vi, j ) − di (y, Vi, j ) ≤ di (x, y) ≤ d(x, y),

as a simple case analysis¬ shows.


For a pair x, y ∈ X, and let φ = d(x, y). To see that F(·) is the required embedding (up to scaling),
observe that, by the same argumentation of Theorem 23.5.2, we have that with high probability

β
kF(x) − F(y)k ≥ φ · .
256 ln n
To get an upper bound on this distance, observe that for i such that ri > φn2 , we have Ei (x) = Ei (y).
Thus,
Õ Õ
kF(x) − F(y)k 2 = kEi (x) − Ei (y)k 2 = kEi (x) − Ei (y)k 2
i i,ri <φn2
Õ Õ
= kEi (x) − Ei (y)k 2 + kEi (x) − Ei (y)k 2
i,φ/n2 <ri <φn2 i,ri <φ/n2
Õ 4φ2 β
= βφ2 lg n4 + (2ri )2 β ≤ 4βφ2 lg n + 4 ≤ 5βφ2 lg n.

n
2 i,ri <φ/n

Thus, kF(x) − F(y)k ≤ φ 5β lg n. We conclude, that probability, F(·) is an embedding of X


p
 with√
high

β
into Euclidean space with distortion φ 5β lg n / φ · 256 ln n = O(log3/2 n).
p

We still have to handle the infinite number of coordinates problem. However, the above proof shows
that we care about a resolution ri (i.e., it contributes to the estimates in the above proof) only if there
is a pair x and y such that ri /n2 ≤ d(x, y) ≤ ri n2 . Thus, for every pair of distances there are O(log n)
relevant resolutions. Thus, there are at most η = O(n2 β log n) = O(n2 log2 n) relevant coordinates, and
we can ignore all the other coordinates. Next, consider the affine subspace h that spans F(P). Clearly,
it is n − 1 dimensional, and consider the projection G : Rη → Rn−1 that projects a point to its closest
¬ Indeed, if f (x) < d (x, V ) and f (y) < d (x, V ) then f (x) = 2r and f (y) = 2r , which implies the above
i, j i i, j i, j i i, j i, j i i, j i
inequality. If fi, j (x) = di (x, Vi, j ) and fi, j (y) = di (x, Vi, j ) then the inequality trivially holds. The other option is handled in a
similar fashion.

210
point in h. Clearly, G(F(·)) is an embedding with the same distortion for P, and the target space is of
dimension n − 1.
Note, that all this process succeeds with high probability. If it fails, we try again. We conclude:

Theorem 23.5.3 (Low quality Bourgain theorem.). Given a n-point metric M, one can embed it
into Euclidean space of dimension n − 1, such that the distortion of the embedding is at most O(log3/2 n).

Using the Johnson-Lindenstrauss lemma, the dimension can be further reduced to O(log n). In fact,
being more careful in the proof, it is possible to reduce the dimension to O(log n) directly.

23.6. Bibliographical notes


The partitions we use are due to Calinescu et al. [CKR01]. The idea of embedding into spanning
trees

is due to Alon et al. [AKPW95], which showed that one can get a probabilistic distortion of
O ( log n log log n)
2 . Yair Bartal realized that by allowing trees with additional vertices, one can get a
considerably better result. In particular, he showed [Bar96] that probabilistic embedding into trees can
be done with polylogarithmic average distortion. He later improved the distortion to O(log n log log n) in
[Bar98]. Improving this result was an open question, culminating in the work of Fakcharoenphol et al.
[FRT04] which achieve the optimal O(log n) distortion.
Interestingly, if one does not care about the optimal distortion, one can get similar result (for
embedding into probabilistic trees), by first embedding the metric into Euclidean space, then reduce
the dimension by the Johnson-Lindenstrauss lemma, and finally, construct an HST by constructing a
quadtree over the points. The “trick” is to randomly translate the quadtree. It is easy to verify that
this yields O(log4 n) distortion. See the survey by Indyk [Ind01] for more details. This random shifting
of quadtrees is a powerful technique that was used in getting several result, and it is a crucial ingredient
in Arora [Aro98] approximation algorithm for Euclidean TSP.
Our proof of Lemma 23.3.1 (which is originally from [FRT04]) is taken from [KLMN05]. The proof
of Theorem 23.5.3 is by Gupta [Gup00].
A good exposition of metric spaces is available in Matoušek [Mat02].

23.7. Exercises
Exercise 23.7.1 (Clustering for HST.). Let (X, d) be a HST defined over n points, and let k > 0 be an
integer. Provide an algorithm that computes the optimal k-median clustering of X in O(k 2 n) time.
[Transform the HST into a tree where every node has only two children. Next, run a dynamic
programming algorithm on this tree.]

Exercise 23.7.2 (Partition induced metric.).

(a) Give a counter example to the following claim: Let (X, d) be a metric space, and let P be a partition
of X. Then, the pair (P, d0) is a metric, where d0(C, C 0) = d(C, C 0) = min x∈C,y∈C 0 d(x, y) and C, C 0 ∈ P.

(b) Let (X, d) be a n-point metric space, and consider the set U = i 2i ≤ d(x, y) ≤ 2i+1, for x, y ∈ X .


Prove that |U| = O(n). Namely, there are only n different resolutions that “matter” for a finite
metric space.

211
Exercise 23.7.3 (Computing the diameter via embeddings.).

(a) (h:1) Let ` be a line in the plane, and consider the embedding f : R2 → `, which is the projection
of the plane into `. Prove that f is 1-Lipschitz, but it is not K-bi-Lipschitz for any constant K.

(b) (h:3) Prove that one can find a family of projections F of size O(1/ ε), such that for any two points
x, y ∈ R2 , for one of the projections f ∈ F we have d( f (x), f (y)) ≥ (1 − ε)d(x, y).

(c) (h:1) Given a set P of n in the plane, given a O(n/ ε) time algorithm that outputs two points
x, y ∈ P, such that d(x, y) ≥ (1 − ε)diam(P), where diam(P) = max z,w∈P d(z, w) is the diameter of P.

(d) (h:2) Given P, show how to extract, in O(n) time, a set Q ⊆ P of size O(ε −2 ), such that diam(Q) ≥
(1 − ε/2)diam(P). (Hint: Construct a grid of appropriate resolution.)
In particular, give an (1 − ε)-approximation algorithm to the diameter of P that works in O(n + ε −2.5 )
time. (There are slightly faster approximation algorithms known for approximating the diameter.)

Acknowledgments
The presentation in this write-up follows closely the insightful suggestions of Manor Mendel.
b
s
a

212
Chapter 24

Approximate Max Cut


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018

24.1. Problem Statement


Given an undirected graph G = (V, E) and nonnegative weights wi j on the edge i j ∈ E, the maximum cut
problem (MAX CUT) is that of finding the set of vertices S that maximizes the weight of the edges in
the cut (S, S); that is, the weight of the edges with one endpoint in S and the other inÕS. For simplicity,
we usually set wi j = O for i j < E and denote the weight of a cut (S, S) by w(S, S) = wi j .
i∈S, j∈ j
This problem is NP-Complete, and hard to approximate within a certain constant.
Given a graph with vertex set V = 1, . . . , n and nonnegative weights Wi j , the weight of the maximum
cut w(S, S) N given by the following integer quadratic program:

(Q) Maximize , wi j (1 − yi y j )
2 i< j
subject to: yi ∈ {−1, 1} ∀i ∈ V .

Indeed, set S = i yi = 1 . Clearly, w(S, S) = 12 i< j , wi j (1 − yi y j ).


 Í
Solving quadratic integer programming is of course NP-Hard. Thus, we we will relax it, by thinking
about the the numbers yi as unit vectors in higher dimensional space. If so, the multiplication of the
two vectors, is now replaced by dot product. We have:

(P) Maximize 12 i< j wi j (1 − vi, v j )


Í

subject to: vi ∈ S(n) ∀i ∈ V,

where S(n) is the n dimensional unit sphere in Rn+1 . This is an instance of semi-definite programming,
which is a special case of convex programming, which can be solved in polynomial time (solved here
means approximated within arbitrary constant in polynomial time). Observe that (P) is a relaxation of
(Q), and as such the optimal solution of (P) has value larger than the optimal value of (Q).
The intuition is that vectors that correspond to vertices that should be on one side of the cut, and
vertices on the other sides, would have vectors which are faraway from each other in (P). Thus, we
compute the optimal solution for (P), and we uniformly generate a random vector r on the unit sphere
S(n) . This induces a hyperplane h which passes through the origin and is orthogonal to r. We next
assign all the vectors that are on one side of h to S, and the rest to S.

213
24.1.1. Analysis
The intuition of the above rounding procedure, is that with good probability, vectors that have big angle
between them would be separated by this cut.
  1 
Lemma 24.1.1. We have Pr sign(hvi, r i) , sign v j , r = arccos vi, v j .
π
j v
Proof: Let us think about the vectors vi, v j and r as being in the plane. To vi
see why this is a reasonable assumption, consider the plane g spanned by vi
and v j , and observe that for the random events we consider, only the direction
of r matter, which can be decided by projecting r on g, and normalizing it to τ

have length 1. Now, the sphere is symmetric, and as such, sampling r randomly
from S(n) , projecting it down to g, and then normalizing it, is equivalent to just
choosing uniformly a vector from the  unit circle.
Now, sign(hvi, r i) , sign v j , r happens only if r falls in the double wedge
formed by the lines perpendicular to vi and v j . The angle of this double wedge is exactly the angle
between vi and v j . Now, since vi and   v j are unit vectors, we have vi, v j = cos(τ), where τ = ∠vi v j .
1
Thus, Pr sign(hvi, r i) , sign v j , r = 2τ/(2π) = π · arccos vi, v j , as claimed. 
Theorem 24.1.2. Let W be the random variable which is the weight of the cut generated by the algo-
rithm. We have
1Õ 
E[W] = wi j arccos vi, v j .
π i< j

Proof: Let Xi j be i j X
  
an indicator variable which is 1 if is in the cut. We have E i j = Pr sign(hvi, r i) , sign v j , r
1

π arccos v , v
i j Í , by Lemma 24.1.1.
Clearly, W = i< j wi j Xi j , and by linearity of expectation, we have
Õ   Õ
i j E Xi j =

E [W] = w arccos vi, v j . 
i< j i< j

arccos(y) 1 2 ψ
Lemma 24.1.3. For −1 ≤ y ≤ 1, we have ≥ α · (1 − y), where α = min .
π 2 0≤ψ≤π π 1 − cos(ψ)
Proof: Set y = cos(ψ). The inequality now becomes ψπ ≥ α 12 (1 − cos ψ). Reorganizing, the inequality
ψ
becomes π2 1−cos ψ ≥ α, which trivially holds by the definition of α. 
Lemma 24.1.4. α > 0.87856.
Proof: Using simple calculus, one can see that α achieves its value for ψ = 2.331122..., the nonzero root
of cosψ + ψ sin ψ = 1. 
Theorem 24.1.5. The above algorithm computes in expectation a cut of size αOpt ≥ 0.87856Opt,
where Opt is the weight of the maximal cut.
Proof: Consider the optimal solution to (P), and lets its value be γ ≥ Opt. We have
1Õ  Õ 1 
E[W] = wi j arccos vi, v j ≥ wi j α 1 − vi, v j = αγ ≥ αOpt,
π i< j i< j
2

by Lemma 24.1.3. 

214
24.2. Semi-definite programming
Let us define a variable xi j = vi, v j , and consider the n by n matrix M formed by those variables, where
xii = 1 for i = 1, . . . , n. Let V be the matrix having v1, . . . , vn as its columns. Clearly, M = V T V. In
particular, this implies that for any non-zero vector v ∈ Rn , we have vT Mv = vT AT Av = (Av)T (Av) ≥ 0.
A matrix that has this property, is called semidefinite. The interesting thing is that any semi-definite
matrix P can be represented as a product of a matrix with its transpose; namely, P = BT B. It is easy
to observe that if this semi-definite matrix has a diagonal one, then B has rows which are unit vectors.
Thus, if we solve (P) and get back a semi-definite matrix, then we can recover the vectors realizing the
solution, and use them for the rounding.
In particular, (P) can now be restated as
1
wi j (1 − xi j )
Í
(SD) Maximize 2 i< j
xii = 1 for i = 1, . . . , n
xi j i=1,...,n, j=1,...,n is semi-definite.

subject to:

We are trying to find the optimal value of a linear function over a set which is the intersection of linear
constraints and the set of semi-definite matrices.

Lemma 24.2.1. Let U be the set of n × n semidefinite matrices. The set U is convex.

Proof: Consider A, B ∈ U, and observe that for any t ∈ [0, 1], and vector v ∈ Rn , we have: vT (t A + (1 −
t)B)v = tvT Av + (1 − t)vT Bv ≥ 0 + 0 ≥ 0, since A and B are semidefinite. 

Positive semidefinite matrices corresponds to ellipsoids. Indeed, consider the set xT Ax = 1: the set of
vectors that solve this equation is an ellipsoid. Also, the eigenvalues of a positive semidefinite matrix are
all non-negative real numbers. Thus, given a matrix, we can in polynomial time decide if it is positive
semidefinite or not.
Thus, we are trying to optimize a linear function over a convex domain. There is by now machinery
to approximately solve those problems to within any additive error in polynomial time. This is done by
using interior point method, or the ellipsoid method. See [BV04, GLS93] for more details.

24.3. Bibliographical Notes


The approximation algorithm presented is from the work of Goemans and Williamson [GW95]. Håstad
[Hås01] showed that MAX CUT can not be approximated within a factor of 16/17 ≈ 0.941176. Recently,
Khot et al. [KKMO04] showed a hardness result that matches the constant of Goemans and Williamson
(i.e., one can not approximate it better than φ, unless P = NP). However, this relies on two conjectures,
the first one is the “Unique Games Conjecture”, and the other one is “Majority is Stablest”. The
“Majority is Stablest” conjecture was recently proved by Mossel et al. [MOO05]. However, it is not
clear if the “Unique Games Conjecture” is true, see the discussion in [KKMO04].
The work of Goemans and Williamson was very influential and spurred wide research on using SDP
for approximation algorithms. For an extension of the MAX CUT problem where negative weights are
allowed and relevant references, see the work by Alon and Naor [AN04].

215
216
Chapter 25

Entropy, Randomness, and Information


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
“If only once - only once - no matter where, no matter before what audience - I could better the record of
the great Rastelli and juggle with thirteen balls, instead of my usual twelve, I would feel that I had truly
accomplished something for my country. But I am not getting any younger, and although I am still at the
peak of my powers there are moments - why deny it? - when I begin to doubt - and there is a time limit on
all of us.”
– –Romain Gary, The talent scout..

25.1. Entropy
Definition 25.1.1. The entropy in bits of a discrete random variable X is given by
Õ
H(X) = − Pr[X = x] lg Pr[X = x] .
x
h i
1
Equivalently, H(X) = E lg Pr[X] .
The binary entropy function H(p) for a random binary variable that is 1 with probability p, is
H(p) = −p lg p − (1 − p) lg(1 − p). We define H(0) = H(1) = 0.

The function H(p) is a concave symmetric around 1/2 on the interval [0, 1] and achieves its maximum
at 1/2. For a concrete example, consider H(3/4) ≈ 0.8113 and H(7/8) ≈ 0.5436. Namely, a coin that has
3/4 probably to be heads have higher amount of “randomness” in it than a coin that has probability
7/8 for heads.
We have that
1
H(p) = (−p ln p − (1 − p) ln(1 − p))
ln 2 
p 1−p 1−p

0 1
and H (p) = − ln p − − (−1) ln(1 − p) − (−1) = lg .
ln 2 p 1−p p
Deploying our amazing ability to compute derivative of simple functions once more, we get that
p p(−1) − (1 − p)
 
00 1 1
H (p) = =− .
ln 2 1 − p p 2 p(1 − p) ln 2

217
Since ln 2 ≈ 0.693, we have that H00(p) ≤ 0, for all p ∈ (0, 1), and the H(·) is concave in this range. Also,
H0(1/2) = 0, which implies that H(1/2) = 1 is a maximum of the binary entropy. Namely, a balanced
coin has the largest amount of randomness in it.
Example 25.1.2. A random variable X that has probability 1/n to be i, for i = 1, . . . , n, has entropy
Ín 1 1
H(X) = − i=1 n lg n = lg n.

Note, that the entropy is oblivious to the exact values that the random variable can have, and it is
sensitive only to the probability distribution. Thus, a random variables that accepts −1, +1 with equal
probability has the same entropy (i.e., 1) as a fair coin.
Lemma 25.1.3. Let X and Y be two independent random variables, and let Z be the random variable
(X, T). Then H(Z) = H(X) + H(Y ).
Proof: In the following, summation are over all possible values that the variables can have. By the
independence of X and Y we have
Õ 1
H(Z) = Pr[(X, Y ) = (x, y)] lg
x,y
Pr[(X, Y ) = (x, y)]
Õ 1
= Pr[X = x] Pr[Y = y] lg
x,y
Pr[X = x] Pr[Y = y]
ÕÕ 1
= Pr[X = x] Pr[Y = y] lg
x y
Pr[X = x]
ÕÕ 1
+ Pr[X = x] Pr[Y = y] lg
y x
Pr[Y = y]
Õ 1 Õ 1
= Pr[X = x] lg + Pr[Y = y] lg = H(X) + H(Y ). 
x
Pr[X = x] y
Pr[Y = y]

2nH(q) n
 
Lemma 25.1.4. Suppose that nq is integer in the range [0, n]. Then ≤ ≤ 2nH(q) .
n+1 nq
n  nq
Proof: This trivially holds if q = 0 or q = 1, so assume 0 < q < 1. We know that nq q (1 − q)n−nq ≤
(q + (1 − q))n = 1. As such, since q−nq (1 − q)−(1−q)n = 2n (−q lg q−(1−q) lg(1−q)) = 2nH(q) , we have
n
 
≤ q−nq (1 − q)−(1−q)n = 2nH(q) .
nq
n  nq
As for the other direction, we claim that µ(nq) = nq q (1 − q)n−nq is the largest term in nk=0 µ(k) = 1,
Í

where µ(k) = nk q k (1 − q)n−k . Indeed,




n k n−k q
   
n−k
∆ k = µ(k) − µ(k + 1) = q (1 − q) 1− ,
k k +11−q
and the sign of this quantity is the sign of (k + 1)(1 − q) − (n − k)q = k + 1 − kq − q − nq + kq = 1 + k − q − nq.
Namely, ∆ k ≥ 0 when k ≥ nq + q − 1, and ∆ k < 0 otherwise. Namely, µ(k) < µ(k + 1), for k < nq, and
Ín
µ(k) ≥ µ(k + 1) for k ≥ nq. Namely, µ(nq) is the largest term in k=0 µ(k) = 1, and as such it is larger
n  nq
than the average. We have µ(nq) = nq q (1 − q)n−nq ≥ n+1 1
, which implies
n
 
1 −nq 1 nH(q)
≥ q (1 − q)−(n−nq) = 2 . 
nq n+1 n+1

218
Lemma 25.1.4 can be extended to handle non-integer values of q. This is straightforward, and we
omit the easy details.
n  nH(q) . n  nH(q) .
(i) q ∈ [0, 1/2] ⇒ bnqc ≤ 2 (ii) q ∈ [1/2, 1] dnqe ≤ 2
Corollary 25.1.5. We have: nH(q) n  nH(q) n 
(iii) q ∈ [1/2, 1] ⇒ 2n+1 ≤ bnqc . (iv) q ∈ [0, 1/2] ⇒ 2n+1 ≤ dnqe .

The bounds of Lemma 25.1.4 and Corollary 25.1.5 are loose but sufficient for our purposes. As a
sanity check, consider the case when we generate a sequence of n bits using a coin with probability q
for head, then by the Chernoff inequality, we will get roughly nq heads in this sequence. As such, the
n nH(q)
generated sequence Y belongs to nq ≈ 2 possible sequences that have similar probability. As such,
n
H(Y ) ≈ lg nq = nH(q), by Example 25.1.2, a fact that we already know from Lemma 25.1.3.

25.1.1. Extracting randomness


Entropy can be interpreted as the amount of unbiased random coin flips can be extracted from a random
variable.

Definition 25.1.6. An extraction function Ext takes as input the


 value of a random variable X and
outputs a sequence of bits y, such that Pr Ext(X) = y |y| = k = 2k , whenever Pr[|y| = k] ≥ 0, where
1


|y| denotes the length of y.

As a concrete (easy) example, consider X to be a uniform random integer variable out of 0, . . . , 7.


All that Ext(x) has to do in this case, is just to compute the binary representation of x. However, note
that Definition 25.1.6 is somewhat more subtle, as it requires that all extracted sequence of the same
length would have the same probability.
Thus, for X a uniform random integer variable in the range 0, . . . , 11, the function Ext(x) can output
the binary representation for x if 0 ≤ x ≤ 7. However, what do we do if x is between 8 and 11? The
idea is to output the binary representation
 of x − 8 as a two bit number.
 1 Clearly, Definition 25.1.6 holds
for this extraction function, since Pr Ext(X) = 00 |Ext(X)| = 2 = 4 , as required. This scheme can be
of course extracted for any range.

Theorem 25.1.7. Suppose that the value of a random variable X is chosen uniformly at random from
the integers {0, . . . , m − 1}. Then there is an extraction function for X that outputs on average (i.e., in
expectation) at least blg mc − 1 = bH(X)c − 1 independent and unbiased bits.

Proof: We represent m as a sum of unique powers of 2, namely m = i ai 2i , where ai ∈ {0, 1}. Thus,
Í
we decomposed {0, . . . , m − 1} into a disjoint union of blocks that have sizes which are distinct powers
of 2. If a number falls inside such a block, we output its relative location in the block, using binary
representation of the appropriate length (i.e., k if the block is of size 2 k ). The fact that this is an
extraction function, fulfilling Definition 25.1.6, is obvious.
Now, observe that the claim holds trivially if m is a power of two. Thus, if m is not a power of 2,
then in the decomposition if there is a block of size 2 k , and the X falls inside this block, then the entropy
is k. Thus, for the inductive proof, assume
 that are looking at the largest block in the decomposition,
that is m < 2 , and let u = lg(m − 2 ) < k. It is easy to verify that, for any integer α > 2 k , we have
k+1 k

α−2k α+1−2k u+1 + 2 k . As such, m−2k ≤ 2u+1 .
α ≤ α+1 . Furthermore, m ≤ 2 m 2u+1 +2k

219
Let Y be the random variable which is the number of random bits extract. We have that
2k m − 2k   k
 m − 2k
k+ lg(m − 2 ) − 1 = k + (u − k − 1)

E[Y ] ≥
m m m
2u+1 2u+1
≥ k + u+1 (u − k − 1) = k − (1 + k − u).
2 + 2k 2u+1 + 2 k
If u = k − 1, then H(X) ≥ k − 12 · 2 = k − 1, as required. If u = k − 2 then H(X) ≥ k − 13 · 3 = k − 1. Finally,
if u < k − 2 then
2u+1 k −u+1
E[Y ] ≥ k − k
(1 + k − u) ≥ k − k−u−1 ≥ k − 1,
2 2
since 2+i
2i
≤ 1 for i ≥ 2. 

Theorem 25.1.8. Consider a coin that comes up heads with probability p > 1/2. For any constant
δ > 0 and for n sufficiently large:
1. One can extract, from an input of a sequence of n flips, an output sequence of (1−δ)nH(p) (unbiased)
independent random bits.

2. One can not extract more than nH(p) bits from such a sequence.

Proof: There are nj input sequences with exactly j heads, and each has probability p j (1 − p)n− j . We

n o
map this sequence to the corresponding number in the set 0, . . . , nj − 1 . Note, that this, conditional


distribution on j, is uniform on this set, and we can apply the extraction algorithm of Theorem 25.1.7.
Let Z be the random variables which is the number of heads in the input, and let B be the number of
random bits extracted. We have
n
Õ
Pr[Z = k] E B Z = k ,
 
E[B] =
k=0

n
  
and by Theorem 25.1.7, we have E B Z = k ≥ lg − 1. Let ε < p − 1/2 be a constant to be
 
k
determined shortly. For n(p − ε) ≤ k ≤ n(p + ε), we have

n n 2nH(p+ε)
   
≥ ≥ ,
k bn(p + ε)c n+1
by Corollary 25.1.5 (iii). We have
dn(p−ε)e dn(p−ε)e
n
Õ Õ    
Pr[Z = k] E B Z = k ≥ Pr[Z = k]
 
E[B] ≥ lg −1
k
k=bn(p−ε)c k=bn(p−ε)c
dn(p−ε)e
2nH(p+ε)
Õ  
≥ Pr[Z = k] lg −2
n+1
k=bn(p−ε)c
= (nH(p + ε) − lg(n + 1)) Pr[|Z − np| ≤ εn]
nε 2
  
≥ (nH(p + ε) − lg(n + 1)) 1 − 2 exp − ,
4p

220
  2
− np
h i  
− nε
ε ε 2
since µ = E[Z] = np and Pr |Z − np| ≥ p pn ≤ 2 exp 4 p = 2 exp 4p , by the Chernoff inequal-
ity. In particular, fix ε > 0, such that H(p + ε) > (1 − δ/4)H(p), and since p is fixed nH(p) = Ω(n), in
δ
particular,
 for n sufficiently large, we have − lg(n + 1) ≥ − 10 nH(p). Also, for n sufficiently large, we have
2 exp − nε
2 δ
4p ≤ 10 . Putting it together, we have that for n large enough, we have
   
δ δ δ
E[B] ≥ 1− − nH(p) 1 − ≥ (1 − δ)nH(p),
4 10 10

as claimed.
As for the upper bound, observe that if an input sequence x has probability q, then the output
sequence y = Ext(x) has probability to be generated which is at least q. Now, all sequences of length
|y| have equal probability to be generated. Thus, we have the following (trivial) inequality 2|Ext(x)| q ≤
2|Ext(x)| Pr[y = Ext(X)] ≤ 1, implying that |Ext(x)| ≤ lg(1/q). Thus,
Õ Õ 1
E[B] = Pr[X = x] |Ext(x)| ≤ Pr[X = x] lg = H(X). 
x x
Pr[X = x]

25.2. Bibliographical Notes


The presentation here follows [MU05, Sec. 9.1-Sec 9.3].

221
222
Chapter 26

Entropy II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
The memory of my father is wrapped up in white paper, like sandwiches taken for a day at work. Just as
a magician takes towers and rabbits out of his hat, he drew love from his small body, and the rivers of his
hands overflowed with good deeds.
– – Yehuda Amichai, My Father..

26.1. Compression
In this section, we will consider the problem of how to compress a binary string. We will map each binary
string, into a new string (which is hopefully shorter). In general, by using a simple counting argument,
one can show that no such mapping can achieve real compression (when the inputs are adversarial).
However, the hope is that there is an underling distribution on the inputs, such that some strings are
considerably more common than others.
Definition 26.1.1. A compression function Compress takes as input a sequence of n coin flips, given as
an element of {H, T }n , and outputs a sequence of bits such that each input sequence of n flips yields a
distinct output sequence.

The following is easy to verify.


Lemma 26.1.2. If a sequence S1 is more likely than S2 then the compression function that minimizes
the expected number of bits in the output assigns a bit sequence to S2 which is at least as long as S1 .

Note, that this is very weak. Usually, we would like the function to output a prefix code, like the
Huffman code.
Theorem 26.1.3. Consider a coin that comes up heads with probability p > 1/2. For any constant
δ > 0, when n is sufficiently large, the following holds.
(i) There exists a compression function Compress such that the expected number of bits output by
Compress on an input sequence of n independent coin flips (each flip gets heads with probability p)
is at most (1 + δ)nH(p); and
(ii) The expected number of bits output by any compression function on an input sequence of n inde-
pendent coin flips is at least (1 − δ)nH(p).

223
Proof: Let ε > 0 be a constant such that p − ε > 1/2. The first bit output by the compression procedure
is ’1’ if the output string is just a copy of the input (using n + 1 bits overall in the output), and ’0’ if it
is compressed. We compress only if the number of ones in the input sequence, denoted by
2
 X is larger
than (p − ε)n. By the Chernoff inequality, we know that Pr[X < (p − ε)n] ≤ exp −nε /2p .
If there are more than (p − ε)n ones in the input, and since p − ε > 1/2, we have that
n n
n n n
   
≤ 2nH(p−ε),
Õ Õ

j dn(p − ε)e 2
j=dn(p−ε)e j=dn(p−ε)e

by Corollary 25.1.5. As such, we can assign each such input sequence a number in the range 0 . . . 2n 2nH(p−ε) ,
and this requires (with the flag bit) 1 + blg n + nH(p − ε)c random bits.
Thus, the expected number of bits output is bounded by

(n + 1) exp −nε 2 /2p + (1 + blg n + nH(p − ε)c) ≤ (1 + δ)nH(p),




by carefully setting ε and n being sufficiently large. Establishing the upper bound.
As for the lower bound, observe that at least one of the sequences having exactly τ = b(p + ε)nc
heads, must be compressed into a sequence having

n 2nH(p+ε)
 
lg − 1 ≥ lg − 1 = nH(p − ε) − lg(n + 1) − 1 = µ,
b(p + ε)nc n+1

by Corollary 25.1.5. Now, any input string with less than τ heads has lower probability to be generated.
Indeed, for a specific strings with α < τ ones the probability to generate them is pα (1 − p)n−α and
pτ (1 − p)n−τ , respectively. Now, observe that
 τ−α
n−τ (1 − p) n−τ 1 − p
τ−α

n−α
α
p (1 − p) τ
= p (1 − p) · τ
= p (1 − p) < pτ (1 − p)n−τ,
p τ−α p

as 1 − p < 1/2 < p implies that (1 − p)/p < 1.


As such, Lemma 26.1.2 implies that all the input strings with less than τ ones, must be compressed
into strings of length at least µ, by an optimal compresser. Now, the Chenroff inequality implies that
Pr[X ≤ τ] ≥ 1 − exp −nε 2 /12p . Implying that an optimal compresser outputs on average at least
1 − exp −nε 2 /12p µ. Again, by carefully choosing ε and n sufficiently large, we have that the average

output length of an optimal compressor is at least (1 − δ)nH(p). 

26.2. Bibliographical Notes


The presentation here follows [MU05, Sec. 9.1-Sec 9.3].

224
Chapter 27

Entropy III - Shannon’s Theorem


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
The memory of my father is wrapped up in
white paper, like sandwiches taken for a day at work.
Just as a magician takes towers and rabbits
out of his hat, he drew love from his small body,
and the rivers of his hands
overflowed with good deeds.
– – Yehuda Amichai, My Father..

27.1. Coding: Shannon’s Theorem


We are interested in the problem sending messages over a noisy channel. We will assume that the
channel noise is “nicely” behaved.

Definition 27.1.1. The input to a binary symmetric channel with parameter p is a sequence of bits
x1, x2, . . . , and the output is a sequence of bits y1, y2, . . . , such that Pr[xi = yi ] = 1 − p independently for
each i.

Translation: Every bit transmitted have the same probability to be flipped by the channel. The
question is how much information can we send on the channel with this level of noise. Naturally, a
channel would have some capacity constraints (say, at most 4,000 bits per second can be sent on the
channel), and the question is how to send the largest amount of information, so that the receiver can
recover the original information sent.
Now, its important to realize that noise handling is unavoidable in the real world. Furthermore,
there are tradeoffs between channel capacity and noise levels (i.e., we might be able to send considerably
more bits on the channel but the probability of flipping (i.e., p) might be much larger). In designing a
communication protocol over this channel, we need to figure out where is the optimal choice as far as
the amount of information sent.

Definition 27.1.2. A (k, n) encoding function Enc : {0, 1} k → {0, 1}n takes as input a sequence of k
bits and outputs a sequence of n bits. A (k, n) decoding function Dec : {0, 1}n → {0, 1} k takes as
input a sequence of n bits and outputs a sequence of k bits.

225
Thus, the sender would use the encoding function to send its message, and the decoder would use
the received string (with the noise in it), to recover the sent message. Thus, the sender starts with a
message with k bits, it blow it up to n bits, using the encoding function, to get some robustness to noise,
it send it over the (noisy) channel to the receiver. The receiver, takes the given (noisy) message with n
bits, and use the decoding function to recover the original k bits of the message.
Naturally, we would like k to be as large as possible (for a fixed n), so that we can send as much
information as possible on the channel. Naturally, there might be some failure probability; that is, the
receiver might be unable to recover the original string, or recover an incorrect string.
The following celebrated result of Shannon¬ in 1948 states exactly how much information can be
sent on such a channel.

Theorem 27.1.3 (Shannon’s theorem.). For a binary symmetric channel with parameter p < 1/2
and for any constants δ, γ > 0, where n is sufficiently large, the following holds:

(i) For an k ≤ n(1−H(p)−δ) there exists (k, n) encoding and decoding functions such that the probability
the receiver fails to obtain the correct message is at most γ for every possible k-bit input messages.

(ii) There are no (k, n) encoding and decoding functions with k ≥ n(1−H(p)+δ) such that the probability
of decoding correctly is at least γ for a k-bit input message chosen uniformly at random.

27.2. Proof of Shannon’s theorem


The proof is not hard, but requires some care, and we will break it into parts.

27.2.1. How to encode and decode efficiently


27.2.1.1. The scheme

Our scheme would be simple. Pick k ≤ n(1 − H(p) − δ). For any number i = 0, . . . , K b = 2 k+1 − 1,
randomly generate a binary string Yi made out of n bits, each one chosen independently and uniformly.
Let Y0, . . . , YKb denote these codewords.
For each of these codewords we will compute the probability that if we send this codeword, the
receiver would fail. Let X0, . . . , XK , where K = 2 k − 1, be the K codewords with the lowest probability of
failure. We assign these words to the 2 k messages we need to encode in an arbitrary fashion. Specifically,
for i = 0, . . . , 2 k − 1, we encode i as the string Xi .
The decoding of a message w is done by going over all the codewords, and finding all the codewords
that are in (Hamming) distance in the range [p(1 − ε)n, p(1 + ε)n] from w. If there is only a single word
Xi with this property, we return i as the decoded word. Otherwise, if there are no such word or there is
more than one word then the decoder stops and report an error.

27.2.1.2. The proof

¬ Claude Elwood Shannon (April 30, 1916 - February 24, 2001), an American electrical engineer and mathematician,

has been called “the father of information theory”.

226
Intuition. Each code Yi corresponds to a region that looks like a ring. The r = pn
“ring” for Yi is all the strings in Hamming distance between (1 − ε)r and Y2
(1 + ε)r from Yi , where r = pn. Clearly, if we transmit a string Yi , and the
receiver gets a string inside the ring of Yi , it is natural to try to recover the Y0
received string to the original code corresponding to Yi . Naturally, there are
two possible bad events here: 2εpn
Y1
(A) The received string is outside the ring of Yi , and
(B) The received string is contained in several rings of different Y s, and
it is not clear which one should the receiver decode the string to. These bad
regions are depicted as the darker regions in the figure on the right.
Let Si = S(Yi ) be all the binary strings (of length n) such that if the receiver gets this word, it would
decipher it to be the original string assigned to Yi (here are still using the extended set of codewords
Y0, . . . , YKb). Note, that if we remove some codewords from consideration, the set S(Yi ) just increases
in size (i.e., the bad region in the ring of Yi that is covered multiple times shrinks). Let Wi be the
probability that Yi was sent, but it was not deciphered correctly. Formally, let r denote the received
word. We have that Õ
Wi = Pr[r was received when Yi was sent] . (27.1)
r<Si

To bound this quantity, let ∆(x, y) denote the Hamming distance between the binary strings x and y.
Clearly, if x was sent the probability that y was received is

w(x, y) = p∆(x,y) (1 − p)n−∆(x,y) .

As such, we have

Pr[r received when Yi was sent] = w(Yi, r).

Let Si,r be an indicator variable which is 1 if r < Si . We have that


Õ Õ Õ
Wi = Pr[r received when Yi was sent] = w(Yi, r) = Si,r w(Yi, r). (27.2)
r<Si r<Si r

The value of Wi is a random variable over the choice of Y0, . . . , YKb. As such, its natural to ask what
is the expected value of Wi .
Consider the ring

ring(r) = x ∈ {0, 1}n (1 − ε)np ≤ ∆(x, r) ≤ (1 + ε)np ,




where ε > 0 is a small enough constant. Observe that x ∈ ring(y) if and only if y ∈ ring(x). Suppose,
that the code word Yi was sent, and r was received. The decoder returns the original code associated
with Yi , if Yi is the only codeword that falls inside ring(r).

Lemma 27.2.1. Given that Yi was sent, and r was received and furthermore r ∈ ring(Yi ), then the
probability of the decoder failing, is
 γ
τ = Pr r < Si r ∈ ring(Yi ) ≤ ,

8
where γ is the parameter of Theorem 27.1.3.

227
Proof: The decoder fails here, only if ring(r) contains some other codeword Yj ( j , i) in it. As such,
 Õ 
τ = Pr r < Si r ∈ ring(Yi ) ≤ Pr Yj ∈ ring(r), for any j , i ≤ Pr Yj ∈ ring(r) .
   
j,i

Now, we remind the reader that the Yj s are generated by picking each bit randomly and independently,
with probability 1/2. As such, we have
(1+ε)np n
n n
 
| ring(r) | Õ
m
Pr Yj ∈ ring(r) =
 
= ≤ n ,
|{0, 1}n | 2n 2 b(1 + ε)npc
m=(1−ε)np

since (1 + ε)p < 1/2 (for ε sufficiently small), and as such the last binomial coefficient in this summation
is the largest. By Corollary 25.1.5 (i), we have
n n n
 
Pr Yj ∈ ring(r) ≤ n ≤ n 2nH((1+ε)p) = n2n(H((1+ε)p)−1) .
 
2 b(1 + ε)npc 2
As such, we have

b Pr[Y1 ∈ ring(r)] ≤ 2 k+1 n2n(H((1+ε)p)−1)


 Õ 
τ = Pr r < Si r ∈ ring(Yi ) ≤ Pr Yj ∈ ring(r) ≤ K
 
j,i

≤ n2n (1−H(p)−δ) + 1 + n (H((1+ε)p)−1) ≤ n2n (H((1+ε)p)−H(p)−δ)+1

since k ≤ n(1 − H(p) − δ). Now, we choose ε to be a small enough constant, so that the quantity
H((1 + ε)p) − H(p) − δ is equal to some (absolute) negative (constant), say −β, where β > 0. Then,
τ ≤ n2−βn+1 , and choosing n large enough, we can make τ smaller than γ/8, as desired. As such, we just
proved that
 γ
τ = Pr r < Si r ∈ ring(Yi ) ≤ .


8

Lemma 27.2.2. Consider the situation where Yi is sent, and the received string is r. We have that
Õ γ
Pr[r < ring(Yi )] = w(Yi, r) ≤ ,
8
r < ring(Yi )

where γ is the parameter of Theorem 27.1.3.

Proof: This quantity, is the probability of sending Yi when every bit is flipped with probability p, and
receiving a string r such that more than pn + εpn bits where flipped (or less than pn − εpn). But
this quantity can be bounded using the Chernoff inequality. Indeed, let Z = ∆(Yi, r), and observe that
E[Z] = pn, and it is the sum of n independent indicator variables. As such
 2 
Õ ε γ
w(Yi, r) = Pr[|Z − E[Z]| > εpn] ≤ 2 exp − pn < ,
4 4
r < ring(Yi )

since ε is a constant, and for n sufficiently large. 


h i
Lemma 27.2.3. We have that f (Yi ) = S , r) ≤ γ/8 (the expectation is over all the
Í
r < ring(Yi ) E i,r w(Yi
choices of the Y s excluding Yi ).

228
Proof: Observe that Si,r w(Yi, r) ≤ w(Yi, r) and for fixed Yi and r we have that E[w(Yi, r)] = w(Yi, r). As
such, we have that
Õ h i Õ Õ γ
f (Yi ) = S
E i,r w(Y i , r) ≤ E [w(Yi , r)] = w(Yi, r) ≤ ,
8
r < ring(Yi ) r < ring(Yi ) r < ring(Yi )

by Lemma 27.2.2. 
Õ h i
Lemma 27.2.4. We have that g(Yi ) = E Si,r w(Yi, r) ≤ γ/8 (the expectation is over all the
r ∈ ring(Yi )
choices of the Y s excluding Yi ).

Proof: We have that Si,r w(Yi, r) ≤ Si,r , as 0 ≤ w(Yi, r) ≤ 1. As such, we have that
Õ h i Õ h i Õ
g(Yi ) = S
E i,r w(Yi , r) ≤ S
E i,r = Pr[r < Si ]
r ∈ ring(Yi ) r ∈ ring(Yi ) r ∈ ring(Yi )
Õ
= Pr[r < Si ∩ (r ∈ ring(Yi ))]
r
Õ
Pr r < Si r ∈ ring(Yi ) Pr[r ∈ ring(Yi )]
 
=
r
Õγ γ
≤ Pr[r ∈ ring(Yi )] ≤ ,
r
8 8

by Lemma 27.2.1. 

Lemma 27.2.5. For any i, we have µ = E[Wi ] ≤ γ/4, where γ is the parameter of Theorem 27.1.3,
where Wi is the probability of failure to recover Yi if it was sent, see Eq. (27.1).

Proof: We have by Eq. (27.2) that Wi = r Si,r w(Yi, r). For a fixed value of Yi , we have by linearity of
Í
expectation, that
" #
Õ Õ h i
E Wi Yi = E Si,r w(Yi, r) Yi = E Si,r w(Yi, r) Yi
 
r r
Õ h i Õ h i γ γ γ
= E Si,r w(Yi, r) Yi + S
E i,r w(Yi , r) Yi = g(Yi ) + f (Yi ) ≤ + = ,
8 8 4
r ∈ ring(Yi ) r < ring(Yi )

by Lemma 27.2.3 and Lemma 27.2.4. Now E[Wi ] = E E Wi Yi ≤ E[γ/4] ≤ γ/4.


  


In the following, we need the following trivial (but surprisingly deep) observation.

Observation 27.2.6. For a random variable X, if E[X] ≤ ψ, then there exists an event in the probability
space, that assigns X a value ≤ ψ.

Lemma 27.2.7. For the codewords X0, . . . , XK , the probability of failure in recovering them when sending
them over the noisy channel is at most γ.

229
Proof: We just proved that when using Y0, . . . , YKb, the expected probability of failure when sending Yi ,
b = 2 k+1 − 1. As such, the expected total probability of failure is
is E[Wi ] ≤ γ/4, where K

K K
Õ  Õ
γ
b b
E  Wi  = E[Wi ] ≤ 2 k+1 ≤ γ2 k ,

 i=0  i=0 4
 
by Lemma 27.2.5. As such, by Observation 27.2.6, there exist a choice of Yi s, such that

K
b
Wi ≤ 2 k γ.
Õ

i=0

Now, we use a similar argument used in proving Markov’s inequality. Indeed, the Wi are always positive,
and it can not be that 2 k of them have value larger than γ, because in the summation, we will get that

K
b
Wi > 2 k γ.
Õ

i=0

Which is a contradiction. As such, there are 2 k codewords with failure probability smaller than γ. We
set the 2 k codewords X0, . . . , XK to be these words, where K = 2 k − 1. Since we picked only a subset of
the codewords for our code, the probability of failure for each codeword shrinks, and is at most γ. 

Lemma 27.2.7 concludes the proof of the constructive part of Shannon’s theorem.

27.2.2. Lower bound on the message size


We omit the proof of this part. It follows similar argumentation showing that for every ring associated
with a codewords it must be that most of it is covered only by this ring (otherwise, there is no hope for
recovery). Then an easy packing argument implies the claim.

27.3. Bibliographical Notes


The presentation here follows [MU05, Sec. 9.1-Sec 9.3].

230
Chapter 28

Low Dimensional Linear Programming


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
January 24, 2018
“Napoleon has not been conquered by man. He was greater than all of us. But god punished him because
he relied on his own intelligence alone, until that prodigious instrument was strained to breaking point.
Everything breaks in the end.”
– Carl XIV Johan, King of Sweden.

28.1. Linear programming in constant dimension (d > 2)


Let assume that we have a set H of n linear inequalities defined over d (d is a small constant) variables.
Every inequality in H defines a closed half space in Rd . Given a vector → −c = (c , . . . , c ) we want to find
1 d
p = (p1, . . . , pd ) ∈ Rd which is in all the half spaces h ∈ H and f (p) = i ci pi is maximized. Formally:
Í

LP in d dimensions:(H, → −c )
H - set of n closed half spaces in Rd

−c - vector in d dimensions
Find p ∈ Rd s.t. ∀h ∈ H we have p ∈ h and f (p) is maximized.
Where f (p) = p, → −c .

A closed half space in d dimensions is defined by an inequality of the form

a1 x1 + a2 x2 + · · · + an xn ≤ bn .

One difficulty that we ignored earlier, is that the optimal solution for the LP might be unbounded,
see Figure 28.1.
Namely, we can find a solution with value ∞ to the target function.
For a half space h let η(h) denote the normal of h directed into the feasible region. Let µ(h) denote
the closed half space, resulting from h by translating it so that it passes through the origin. Let µ(H)
be the resulting set of half spaces from H. See Figure 28.1 (b).
The new set of constraints µ(H) is depicted in Figure 28.1 (c).

Lemma 28.1.1. (H, →


−c ) is unbounded if and only if (µ(H), →
−c ) is unbounded.

231
µ(H) feasible region

µ(h)

h µ(h)

−c

(a) (b) (c)

Figure 28.1: (a) Unbounded LP. (b). (c).

µ(H) feasible region


g
µ(h2) ∩ g µ(h1) ∩ g

µ(h) ρ0 µ(h)

h g
h1
h2
µ(h1)
µ(h2)

feasible region of µ(H)

(a) (b) (c)

Figure 28.2: (a). (b). (c).

Proof: Consider the ρ0 the unbounded ray in the feasible region of (H, →
−c ) such that the line that contain
→−
it passes through the origin. Clearly, ρ is unbounded also in (H, c ), and this is if and only if. See
0

Figure 28.2 (a). 

Lemma 28.1.2. Deciding if (µ(H), → −c ) is bounded can be done by solving a d − 1 dimensional LP.
Furthermore, if it is bounded, then we have a set of d constraints, such that their intersection prove this.
Furthermore, the corresponding set of d constraints in H testify that (H, → −c ) is bounded.

Proof: Rotate space, such that → −c is the vector (0, 0, . . . , 0, 1). And consider the hyperplane g ≡ x = 1.
d
Clearly, (µ(H), →
−c ) is unbounded if and only if the region g ∩ Ñ
h∈µ(H) h is non-empty. By deciding if this
region is unbounded, is equivalent to solving the following LP: L = (H , (1, 0, . . . , 0)) where
0 0

H 0 = g ∩ h h ∈ µ(H) .


Let h ≡ a1 x1 + . . . + ad xd ≤ 0, the region corresponding to g ∩ h is a1 x1 + · · · + ad−1 xd−1 ≤ −ad which


is a d − 1 dimensional hyperplane. See Figure 28.2 (b).
But this is a d − 1 dimensional LP, because everything happens on the hyperplane xd = 1.
Notice that if (µ(H), → −c ) is bounded (which happens if and only if (H, → −c ) is bounded), then L 0 is
infeasible, and the LP L 0 would return us a set d constraints that their intersection is empty. Interpreting
those constraints in the original LP, results in a set of constraints that their intersection is bounded in
the direction of →−c . See Figure 28.2 (c).

232
vi
p g vi+1 g
µ(h2) ∩ g µ(h1) ∩ g µ(h2) ∩ g µ(h1) ∩ g

h1 h1
h2 h2
µ(h1) µ(h1)
µ(h2) µ(h2)

feasible region of µ(H) feasible region of µ(H)


−c

(a) (b) (c)

Figure 28.3: (a). (b). (c).

(In the above example, µ(H) ∩ g is infeasible because the intersection of µ(h2 ) ∩ g and µ(h1 ) ∩ g is
empty, which implies that h1 ∩ h2 is bounded in the direction →−c which we care about. The positive y
direction in this figure. ) 

We are now ready to show the algorithm for the LP for L = (H, → −c ). By solving a d − 1 dimensional
LP we decide whether L is unbounded. If it is unbounded, we are done (we also found the unbounded
solution, if you go carefully through the details).
See Figure 28.3 (a).
(in the above figure, we computed p.)
In fact, we just computed a set h1, . . . , hd s.t. their intersection is bounded in the direction of → −c
(thats what the boundness check returned).
Let us randomly permute the remaining half spaces of H, and let h1, h2, . . . , hd, hd+1, . . . , hn be the
resulting permutation.

Let vi be the vertex realizing the optimal solution for the LP:
 
Li = {h1, . . . , hi } , →
−c

There are two possibilities:

1. vi = vi+1 . This means that vi ∈ hi+1 and it can be checked in constant time.

2. vi , vi+1 . It must be that vi < hi+1 but then, we must have... What is depicted in Figure 28.3 (b).

B - the set of d constraints that define vi+1 . If hi+1 < B then vi = vi+1 . As such, the probability of
vi , vi+1 is roughly d/i because this is the probability that one of the elements of B is hi+1 . Indeed, fix
the first i + 1 elements, and observe that there are d elements that are marked (those are the elements
of B). Thus, we are asking what is the probability of one of d marked elements to be the last one in a
random permutation of hd+1, . . . , hi+1 , which is exactly d/(i + 1 − d).
Note that if some of the elements of B is h1, . . . , hd than the above expression just decreases (as there
are less marked elements).
Well, let us restrict our attention to ∂hi+1 . Clearly, the optimal solution to Li+1 on hi+1 is the required
vi+1 . Namely, we solve the LP Li+1 ∩ hi+1 using recursion.
This takes T(i + 1, d − 1) time. What is the probability that vi+1 , vi ?

233
Well, one of the d constraints defining vi+1 has to be hi+1 .The probability for that is ≤ 1 for i ≤ 2d −1,
and it is

d
≤ ,
i+1−d

otherwise.
Summarizing everything, we have:

2d
Õ
T(n, d) = O(n) + T(n, d − 1) + T(i, d − 1)
i=d+1
n
Õ d
+ T(i, d − 1)
i=2d+1
i + 1 − d

What is the solution of this monster? Well, one essentially to guess the solution and verify it. To guess
solution, let us “simplify” (incorrectly) the recursion to :

n
Õ T(i, d − 1)
T(n, d) = O(n) + T(n, d − 1) + d
i=2d+1
i+1−d

So think about the recursion tree. Now, every element in the sum is going to contribute a near
constant factor, because we divide it by (roughly) i + 1 − d and also, we are guessing the the optimal
solution is linear/near linear.
In every level of the recursion we are going to penalized by a multiplicative factor of d. Thus, it is
natural, to conjecture that T(n, d) ≤ (3d)3d n.
Which can be verified by tedious substitution into the recurrence, and is left as exercise.

Theorem 28.1.3. Given an d dimensional LP (H, →


−c ),it can be solved in expected O (3d)3d n time (the
constant in the O is dim independent).

BTW, we are being a bit conservative about the constant. In fact, one can prove that the running time
is d!n. Which is still exponential in d.

234
SolveLP((H, → −c ))
/* initialization */
Rotate (H, → −c ) s.t. →
−c = (0, . . . , 1)
Solve recursively the d − 1 dim LP:
L 0 ≡ µ(H) ∩ (xd = 1)
if L 0 has a solution then
return “Unbounded”

Let g1, . . . , gd be the set of constraints of L 0 that testifies that L 0 is infeasible


Let h1, . . . , hd be the hyperplanes of H corresponding to g1, . . . , gd
Permute H s.t. h1, . . . , hd are first.
vd = ∂h1 ∩ ∂h2 ∩ · · · ∩ ∂hd
/*vd is a vertex that testifies that (H, → −c ) is bounded */

/* the algorithm itself */


for i ← d + 1 to n do
if vi−1 ∈ hi then
vi ← vi−1
else
vi ← SolveLP((Hi−1 ∩ ∂hi , → −c )) (*)
where Hi−1 = {h1, . . . , hi−1 }

return vn

28.2. Handling Infeasible Linear Programs


In the above discussion, we glossed over the question of how to handle LPs which are infeasible. This
requires slightly modifying our algorithm to handle this case, and I am only describing the required
modifications.
First, the simplest case, where we are given an LP L which is one dimensional (i.e., defined over one
variable). Clearly, we can solve this LP in linear time (verify!), and furthermore, if there is no solution,
we can return two input inequality ax ≤ b and cx ≥ d for which there is no solution together (i.e., those
two inequalities [i.e., constraints] testifies that the LP is not satisfiable).
Next, assume that the algorithm SolveLP when called on a d − 1 dimensional LP L 0, if L 0 is not
feasible it return the d constraints of L 0 that together have non-empty intersection. Namely, those
constraints are the witnesses that L 0 is infeasible.
So the only place, where we can get such answer, is when computing vi (in the (*) line in the
algorithm). Let h10 , . . . , h0d be the corresponding set of d constraints of Hi−1 that testifies that (Hi−1 ∩ ∂hi ,

−c ) is an infeasible LP. Clearly, h0 , . . . , h0 , h must be a set of d + 1 constraints that are together are
1 d i
infeasible, and that is what SolveLP returns.

28.3. References
The description in this class notes is loosely based on the description of low dimensional LP in the book
of de Berg et al. [dBCKO08].

235
236
Chapter 29

Expanders I
598 - Class notes for Randomized Algorithms
Sariel Har-Peled “Mr. Matzerath has just seen fit to inform me
January 24, 2018 that this partisan, unlike so many of them, was
an authentic partisan. For - to quote the rest of
my patient’s lecture - there is no such thing as a
part-time partisan. Real partisans are partisans
always and as long as they live. They put fallen
governments back in power and over throw
governments that have just been put in power
with the help of partisans. Mr. Matzerath
contended - and this thesis struck me as
perfectly plausible - that among all those who
go in for politics your incorrigible partisan, who
undermines what he has just set up, is closest to
the artiest because he consistently rejects what
he has just created.”

Gunter Grass, The tin drum


29.1. Preliminaries on expanders
29.1.1. Definitions
Let G = (V, E) Be an undirected graph, where V = {1, . . . , n}. A d-regular graph is a graph where all
vertices have degree d. A d-regular graph G = (V, E) is a δ-edge expander (or just, δ-expander) if for
every set S ⊆ V of size at most |V | /2, there are at least δd |S| edges connecting S and S = V \ S; that is

e(S, S) ≥ δd |S| , (29.1)

where
e(X, Y ) = uv u ∈ X, v ∈ Y

.
A graph is [n, d, δ]-expander if it is a n vertex, d-regular, δ-expander.
A (n, d)-graph G is a connected d-regular undirected (multi) graph. We will consider the set of
vertices of such a graph to be the set nno = {1, . . . , n}.
For a (multi) graph G with n nodes, its adjacency matrix is a n × n matrix M, where Mi j is the
number of edges between i and j. It would be convenient to work the transition matrix Q associated
with the random walk on G. If G is d-regular then Q = M(G)/d and it is doubly stochastic.
A vector x is eigenvector of a matrix M with eigenvalue µ, if xM = µx. In particular, by taking
the dot product of both size by x, we get hxM, xi = hµx, xi, which implies µ = hxM, xi /hx, xi. Since the

237
adjacency matrix M of G is symmetric, all its eigenvalues are real numbers (this is a special case of the
spectral theorem from linear algebra). Two eigenvectors with different eigenvectors are orthogonal to
each other.
We denote the eigenvalues of M by λb1 ≥ λb2 ≥ · · · λbn , and the eigenvalues of Q by λb1 ≥ λb2 ≥ · · · λbn .
Note, that for a d-regular graph, the eigenvalues of Q are the eigenvalues of M scaled down by a factor
of 1/d; that is λbi = λbi /d.
Lemma 29.1.1. Let G be an undirected graph, and let ∆ denote the maximum degree in G. Then,
λb1 (G) = λb1 (M) = ∆ if and only one connected component of G is ∆-regular. The multiplicity of ∆ as
an eigenvector is the number of ∆-regular connected components. Furthermore, we have λbi (G) ≤ ∆, for
all i.

Proof: The ith entry of M1n is the degree of the ith vertex vi of G (i.e., M1n = d(vi ), where 1n =
(1, 1, . . . , 1) ∈ Rn . So, let x be an eigenvector of M with eigenvalue λ, and let x j , 0 be the coordinate
with the largest (absolute value) among all coordinates of x corresponding to a connect component H
of G. We have that
Õ
|λ| x j = (Mx) j = xi ≤ ∆ x j ,
vi ∈N(v j )

where N(v j ) are the neighbors of vi in G. Thus, all the eigenvalues of G have λbi ≤ ∆, for i = 1, . . . , n.
If λ = ∆, then this implies that xi = x j if vi ∈ N(v j ), and d(v j ) = ∆. Applying this argument to the
vertices of N(v j ), implies that H must be ∆-regular, and furthermore, x j = xi , if xi ∈ V(H). Clearly, the
dimension of the subspace with eigenvalue (in absolute value) ∆ is exactly the number of such connected
components. 

The following is also known. We do not provide a proof since we do not need it in our argumentation.
Lemma 29.1.2. If G is bipartite, then if λ is eigenvalue of M(G) with multiplicity k, then −λ is also
its eigenvalue also with multiplicity k.

29.2. Tension and expansion


Let G = (V, E), where V = {1, . . . , n} and G is a d regular graph.
Definition 29.2.1. For a graph G, let γ(G) denote the tension of G; that is, the smallest constant, such
that for any function f : V(G) → R, we have that
2 2
E | f (x) − f (y)| ≤ γ(G) E | f (x) − f (y)| .
   
(29.2)
x,y∈V xy∈E

Intuitively, the tension captures how close is estimating the variance of a function defined over the
vertices of G, by just considering the edges of G. Note, that a disconnected graph would have infinite
tension, and the clique has tension 1.
Surprisingly, tension is directly related to expansion as the following lemma testifies.
Lemma 29.2.2. Let G = (V, E) be a given connected d-regular graph with n vertices. Then, G is a
1
δ-expander, where δ ≥ and γ(G) is the tension of G.
2γ(G)

238
Proof: Consider a set S ⊆ V, where |S|  ≤ n/2. Let fS (v) be the function assigning 1 if v ∈ S, and zero
otherwise. Observe that if (u, v) ∈ S × S ∪ S × S then | fS (u) − fS (v)| = 1, and | fS (u) − fS (v)| = 0
otherwise. As such, we have
 
2 |S| (n − |S|) e S, S
2 2
f f f f
   
= E | S (x) − S (y)| ≤ γ(G) E | S (x) − S (y)| = γ(G) ,
n2 x,y∈V xy∈E |E |
by Lemma 29.2.4. Now, since G is d-regular, we have that |E | = nd/2. Furthermore, n − |S| ≥ n/2, which
implies that
  2 |E | · |S| (n − |S|) 2(nd/2)(n/2) |S| 1
e S, S ≥ 2
= 2
= d |S| .
γ(G)n γ(G)n 2γ(G)
which implies the claim (see Eq. (29.1)). 

Now, a clique has tension 1, and it has the best expansion possible. As such, the smaller the tension
of a graph, the better expander it is.
Definition 29.2.3. Given a random walk matrix Q associated with a d-regular graph, let B(Q) = hv1, . . . , vn i
denote the orthonormal eigenvector basis defined by Q. That
n n √ is, v1, . . . , vn is an orthonormal basis
for R , where all these vectors are eigenvectors of Q and v1 = 1 / n. Furthermore, let λbi denote the ith
eigenvalue of Q, associated with the eigenvector vi , such that λb1 ≥ λb2 ≥ · · · ≥ λbn .
Lemma 29.2.4. Let G = (V, E) be a given connected d-regular graph with n vertices. Then γ(G) = 1
,
1−λ
c2
where λb2 = λ2 /d is the second largest eigenvalue of Q.
Proof: Let f : V → R. Since in Eq. (29.2), we only look on the difference between two values of f ,
we can add a constant to f , and would not change the quantities involved in Eq. (29.2). As such, we
assume that E[ f (x)] = 0. As such, we have that
2 2 2 2
E | f (x) − f (y)| = E ( f (x) − f (y)) = E ( f (x)) − 2 f (x) f (y) + ( f (y))
     
(29.3)
x,y∈V x,y∈V x,y∈V

= E ( f (x))2 − 2 E [ f (x) f (y)] + E ( f (y))2


   
x,y∈V x,y∈V x,y∈V

= E ( f (x)) − 2 E [ f (x)] E [ f (y)] + E ( f (y))2 = 2 E ( f (x))2 .


2
     
x∈V x∈V y∈V y∈V x∈V

Now, let I be the n × n identity matrix (i.e., one on its diagonal, and zero everywhere else). We have
that
!
1 Õ 1 Õ Õ Õ 2 Õ
ρ= ( f (x) − f (y))2 = d( f (x))2 − 2 f (x) f (y) = ( f (x))2 − f (x) f (y)
d x y∈E d x∈V xy∈E x∈V
d xy∈E
Õ
= (I − Q) xy f (x) f (y).
x,y∈V

Note, that 1n is an eigenvector of Q with eigenvalue 1, and this is the largest eigenvalue of Q. Let B(Q) =
hv1, . . . , vn i be the orthonormal eigenvector basis defined by Q, with eigenvalues λb1 ≥ λb2 ≥ · · · ≥ λbn ,
Ín
respectively. Write f = i=1 αi vi , and observe that
n  *Õ +
f (i)

Õ v1 v1 1 α1
0 = E[ f (x)] = = f, √ = αi vi, √ = √ hα1 v1, v1 i = √ ,
i=1
n n i n n n

239
since vi ⊥v1 for i ≥ 2. Hence α1 = 0, and we have
n n
n
Õ Õ Õ Õ
ρ= (I − Q) xy f (x) f (y) = (I − Q) x y αi=1 αi vi (x) α j v j (y)
x,y∈V x,y∈V i=2 j=1
Õ Õ Õ
= αi α j vi (x) (I − Q) xy v j (y).
i, j x∈V y∈V

Now, we have that

xth row of
Õ        

(I − Q) x y v j (y) = , v j = (I − Q)v j (x) = 1 − λbj v j (x) = 1 − λbj v j (x),
(I − Q)
y∈V

Ín
since v j is eigenvector of Q with eigenvalue λbj . Since v1, . . . , vn is an orthonormal basis, and f = i=1 αi vi ,
we have that k f k 2 = j α2j . Going back to ρ, we have that
Í

Õ Õ   Õ  Õ
ρ= αi α j vi (x) 1 − λbj v j (x) = αi α j 1 − λbj vi (x)v j (x)
i, j x∈V i, j x∈V
Õ   n
Õ  
= αi α j 1 − λbj vi, v j = α2j 1 − λbj v j , v j
i, j j=1
 n
Õ Õ  n
Õ    n

2
≥ 1 − λb2 α2j v j (x) = 1 − λb2 α2j = 1 − λb2 k f k 2 = 1 − λb2 ( f (x))2 (29.4)
j=2 x∈V j=2 j=1
 
2
= n 1 − λb2 E ( f (x)) ,
 
x∈V

since α1 = 0 and λb1 ≥ λb2 ≥ · · · ≥ λbn .


We are now ready for the kill. Indeed, by Eq. (29.3), and the above, we have that
2 2 Õ
| f (x) − f (y)| 2 = 2 E ( f (x))2 ≤ ( f (x) − f (y))2
   
E   ρ=  
x,y∈V x∈V n 1 − λb2 dn 1 − λb2 xy∈E

1 1 Õ 1
( f (x) − f (y))2 = 2
E | f (x) − f (y)| .
 
= ·
1 − λb2 |E | xy∈E 1 − λb2 xy∈E

1
This implies that γ(G) ≤ . Observe, that the inequality in our analysis, had risen from Eq. (29.4),
1 − λb2
but if we take f = v2 , then the inequality there holds with equality, which implies that γ(G) ≥ 1c ,
1−λ2
which implies the claim. 

Lemma 29.2.2 together with the above lemma, implies that the expansion δ of a d-regular graph G
is at least δ = 1/2γ(G) = (1 − λ2 /d)/2, where λ2 is the second eigenvalue of the adjacency matrix of G.
Since the tension of a graph is direct function of its second eigenvalue, we could either argue about the
tension of a graph or its second eigenvalue when bounding the graph expansion.

240
Chapter 30

Expanders II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled Be that as it may, it is to night school that I owe
January 24, 2018 what education I possess; I am the first to own
that it doesn’t amount to much, though there is
something rather grandiose about the gaps in it.

Gunter Grass, The tin drum


30.1. Bi-tension
Our construction of good expanders, would use the idea of composing graphs together. To this end, in
our analysis, we will need the notion of bi-tension. Let e
E(G) be the set of directed edges of G; that is,
every edge x y ∈ E(G) appears twice as (x → y) and (y → x) in eE.
Definition 30.1.1. For a graph G, let γ2 (G) denote the bi-tension of G; that is, the smallest constant,
such that for any two function f , g : V(G) → R, we have that
2
E | f (x) − g(y)| ≤ γ2 (G) E | f (x) − g(y)| 2 .
   
(30.1)
x,y∈V (x→y)∈e
E

The proof of the following lemma is similar to the proof of Lemma 29.2.4. The proof is provided for
the sake of completeness, but there is little new in it.
1
Lemma 30.1.2. Let G = (V, E) be a connected d-regular graph with n vertices. Then γ2 (G) = ,
  1−λ b
where λ
b = λ(G),
b where λ(G)
b = max λb2, −λbn , where λbi is the ith largest eigenvalue of the random walk
matrix associated with G.

Proof: We can assume that E[ f (x)] = 0. As such, we have that


2 2 2 2 2
E | f (x) − g(y)| = E ( f (x)) − 2 E [ f (x)g(y)] + E (g(y)) = E ( f (x)) + E (g(y)) . (30.2)
         
x,y∈V x,y∈V x,y∈V y∈V x,y∈V y∈V

Let Q be the matrix associated with the random walk on G (each entry is either zero or 1/d), we
have
2 1 Õ 2 1 Õ
f f Q xy ( f (x) − g(y))2
 
ρ= E | (x) − g(y)| = ( (x) − g(y)) =
(x→y)∈e
E nd n x,y∈V
(x→y)∈e E
1 Õ   2 Õ
= ( f (x))2 + (g(x))2 − Q xy f (x)g(y).
n x∈V n x,y∈V

241
Let B(Q) = hv1, . . . , vn i be the orthonormal eigenvector basis defined by Q (see Definition 29.2.3), with
Ín Ín
eigenvalues λb1 ≥ λb2 ≥ · · · ≥ λbn , respectively. Write f = i=1 αi vi and g = i=1 βi vi . Since E[ f (x)] = 0,
we have that α1 = 0. Now, Q x y = Q yx , and we have
! !
Õ Õ Õ Õ Õ Õ Õ
Q x y f (x)g(y) = Q yx αi vi (x) β j v j (y) = αi β j v j (y) Q yx vi (x)
x,y∈V x,y∈V i j i, j y∈V x∈V
Õ Õ   Õ n
Õ Õ
= αi β j v j (y) λbi vi (y) = αi β j λbi v j , vi = αi βi λbi (vi (y))2
i, j y∈V i, j i=2 y∈V
n n Õ
Õ αi2 + βi2 Õ λ
bÕ 
≤λ
b (vi (y))2 ≤ (αi vi (y))2 + (βi vi (y))2
i=2
2 y∈V
2 i=1 y∈V

λ
b Õ 
= ( f (y))2 + (g(y))2
2 y∈V

As such,
1 Õ 1 Õ  1 Õ 2 f (x)g(y)
| f (x) − g(y)| 2 = | f (x) − g(y)| 2 = ( f (y))2 + (g(y))2 −
 
E
(x→y)∈e
E nd n y∈V n x,y∈V d
(x→y)∈e
E
1 Õ 2 2 Õ2

= ( f (y)) + (g(y)) Q xy f (x)g(y)

n y∈V
n x,y∈V
!
1 2 λ
b Õ     
2 2 2 2
( f (y)) + (g(y)) = 1 − λ b E ( f (y)) + E (g(y))
  
≥ − ·
n n 2 y∈V y∈V y∈V
 
2
E | f (x) − g(y)| ,
 
= 1−λb
x,y∈V
 
b . Again, by trying either f = g = v2 or f = vn
by Eq. (30.2). This implies that γ2 (G) ≤ 1/ 1 − λ
 
and g = −vn , we get that the inequality above holds with equality, which implies γ2 (G) ≥ 1/ 1 − λ
b .
Together, the claim now follows. 

30.2. Explicit construction


For a set U ⊆ V of vertices, its characteristic vector, denoted by x = χU , is the n dimensional vector,
where xi = 1 if and only if i ∈ U.
The following is an easy consequence of Lemma 29.1.1.
Lemma 30.2.1. For a d-regular graph G the vector 1n = (1, 1, . . . , 1) is the only eigenvector with eigen-
value d (of the adjacency matrix M(G), if and only if G is connected. Furthermore, we have |λi | ≤ d, for
all i.

Our main interest would be in the second largest eigenvalue of M. Formally, let
hxM, xi
λ2 (G) = max .
n x⊥1 ,x,0 hx, xi

242
We state the following result but do not prove it since we do not need it for our nafarious purposes
(however, we did prove the left side of the inequality).

Theorem 30.2.2. Let G be a δ-expander with adjacency matrix M and let λ2 = λ2 (G) be the second-
largest eigenvalue of M. Then s 
  
1 λ2 λ2
1− ≤ δ ≤ 2 1− .
2 d d

What the above theorem says, is that the expansion of a [n, d, δ]-expander is a function of how far
is its second eigenvalue (i.e., λ2 ) from its first eigenvalue (i.e., d). This is usually referred to as the
spectral gap.
We will start by explicitly constructing an expander that has “many” edges, and then we will show
to reduce its degree till it become a constant degree expander.

30.2.1. Explicit construction of a small expander


30.2.1.1. A quicky reminder of fields
A field is a set F together with two operations, called addition and multiplication, and denoted by +
and ·, respectively, such that the following axioms hold:

(i) Closure: ∀x, y ∈ F, we have x + y ∈ F and x · y ∈ F.

(ii) Associativity: ∀x, y, z ∈ F, we have x + (y + z) = (x + y) + z and (x · y) · z = x · (y · z).

(iii) Commutativity: ∀x, y ∈ F, we have x + y = y + x and x · y = y · x.

(iv) Identity: There exists two distinct special elements 0, 1 ∈ F. We have that ∀x ∈ F it holds x + 0 = a
and x · 1 = x.

(v) Inverse: There exists two distinct special elements 0, 1 ∈ F, and we have that ∀x ∈ F there exists
an element −x ∈ F, such that x + (−x) = 0.
Similarly, ∀x ∈ F, x , 0, there exists an element y = x −1 = 1/x ∈ F such that x · y = 1.

(vi) Distributivity: ∀x, y, z ∈ F we have x · (y + z) = x · y + x · z.

Let q = 2t , and r > 0 be an integer. Consider the finite field Fq . It is the field of polynomials
of degree at most t − 1, where the coefficients are over Z2 (i.e., all calculations are done modulus 2).
Formally, consider the polynomial
p(x) = x t + x + 1.
It it irreducible over F2 = {0, 1} (i.e., p(0) = p(1) , 0). We can now do polynomial arithmetic over
polynomials (with coefficients from F2 ), where we do the calculations modulus p(x). Note, that any
irreducible polynomial of degree n yields the same field up to isomorphism. Intuitively, we are introducing
the n distinct roots of p(x) into F by creating an extension field of F with those roots.
An element of Fq = F2t can be interpreted as a binary string b = b0 b1 . . . , bt−1 of length t, where the
corresponding polynomial is
t−1
bi x i .
Õ
poly(b) =
i=0

243
The nice property of Fq is that addition can be interpreted as a xor operation. That is, for any x, y ∈ Fq ,
we have that x + y + y = x and x − y − y = x. The key properties of Fq we need is that multiplications
and addition can be computed in it in polynomial time in t, and it is a field (i.e., each non-zero element
has a unique inverse).

30.2.1.1.1. Computing multiplication in Fq . Consider two elements α, β ∈ Fq . Multiply the two


polynomials poly(α) by poly(β), let poly(γ) be the resulting polynomial (of degree at most 2t − 2), and
compute the remainder poly(β) when dividing it by the irreducible polynomial p(x). For this remainder
polynomial, normalize the coefficients by computing their modules base 2. The resulting polynomial is
the product of α and β.

For more details on this field, see any standard text on abstract algebra.

30.2.1.2. The construction


Let q = 2t , and r > 0 be an integer. Consider the linear space G = Fqr . Here, a member α = (α0, . . . , αr ) ∈
G can be thought of as being a string (of length r + 1) over Fq , or alternatively, as a binary string of
length n = t(r + 1).
For α = (α0, . . . , αr ) ∈ G, and x, y ∈ Fq , define the operator

ρ(α, x, y) = α + y · 1, x, x 2, . . . , x r = α0 + y, α1 + yx, α2 + yx 2, . . . , αr + yx r ∈ G.
 

Since addition over Fq is equivalent to a xor operation we have that

ρ(ρ(α, x, y), x, y) = α0 + y + y, α1 + yx + yx, α2 + yx 2 + yx 2, . . . , αr + yx r + yx r




= (α0, α1, α2, . . . , αr ) = α.

Furthermore, if (x, y) , (x 0, y0) then ρ(α, x, y) , ρ(α, x 0, y0).


We now define a graph LD(q, r) = (G, E), where

α ∈ G, x, y ∈ Fq
 
E = αβ
β = ρ(α, x, y)
2
Note, that this graph is well defined, as ρ(β, x, y) = α. The degree of a vertex of LD(q, r) is Fq = q2 ,
and LD(q, r) has N = |G| = qr+1 = 2t(r+1) = 2n vertices.
Theorem 30.2.3. For any t > 0, r > 0 and q = 2t , where r < q, we have that LD(q, r) is a graph with
qr+1 vertices. Furthermore, λ1 (LD(q, r)) = q2 , and λi (LD(q,
 r+1  r)) ≤ rq, for i = 2, . . . , n.
In particular, if r ≤ q/2, then LD(q, r) is a q , q , 4 -expander.
2 1

Proof: Let M be the N × N adjacency matrix of LD(q, r). Let L : Fq → {0, 1} be a linear map which is
onto. It is easy to verify that L −1 (0) = L −1 (1) ¬
We are interested in the eigenvalues of the matrix M. To this end, we consider vectors in RN . The
ith row an ith column of M is associated with a unique element bi ∈ G. As such, for a vector v ∈ RN ,
we denote by v[bi ] the ith coordinate of v. In particular, for α = (α0, . . . , αr ) ∈ G, let vα ∈ RN denote
the vector, where its β = (β0, . . . , βr ) ∈ G coordinate is

vα [β] = (−1) L (
Ír
αi βi )
i=0 .
¬ Indeed, if Z = L −1 (0), and L(x) = 1, then L(y) = 1, for all y ∈ U = x + z z ∈ Z . Now, its clear that |Z | = |U|.


244
Let V = vα α ∈ G . For α , α0 ∈ V, observe that


(−1) L ( i=0 αi βi ) · (−1) L ( i=0 αi βi ) = (−1) L ( i=0 (αi +αi ) βi ) =


Õ Ír Ír 0
Õ Ír 0
Õ
hvα, vα 0 i = vα+α 0 [β] .
β∈G β∈G β∈G

So, consider ψ = α + α0 , 0. Assume, for the simplicity of exposition that all the coordinates of ψ are
non-zero. We have, by the linearity of L that

(−1) L ( i=0 αi βi ) = (−1) L(ψ0 β0 +···+ψr−1 βr−1 ) (−1) L(ψr βr ) .


Õ Ír Õ Õ
hvα, vα 0 i =
β∈G β0 ∈Fq,...,βr−1 ∈Fq βr ∈Fq

However, since ψr , 0, the quantity ψr βr βr ∈ Fq = Fq . Thus, the summation βr ∈Fq (−1) L(ψr βr ) gets
 Í

L −1 (0) terms that are 1, and L −1 (0) terms that are −1. As such, this summation is zero, implying that
hvα, vα 0 i = 0. Namely, the vectors of V are orthogonal.
Observe, that for α, β, ψ ∈ G, we have vα [β + ψ] = vα [β] vα [ψ]. For α ∈ G, consider the vector Mvα .
We have, for β ∈ G, that

vα [β + y(1, x, . . . , x r )]
Õ Õ Õ
(Mvα )[β] = M βψ · vα [ψ] = vα [ψ] =
ψ∈G x,y ∈ Fq x,y ∈ Fq
ψ=ρ(β,x,y)

vα [y(1, x, . . . , x r )]® · vα [β] .


© Õ

ª

« x,y ∈ Fq ¬
Thus, setting λ(α) = x,y ∈ Fq vα [y(1, x, . . . , x r )] ∈ R, we have that Mvα = λ(α) · vα . Namely, vα is an
Í
eigenvector, with eigenvalue λ(α).
Ír
Let pα (x) = i=0 αi x i , and let

vα [y(1, x, . . . , x r )] ∈ R = (−1) L(ypα (x))


Õ Õ
λ(α) =
x,y ∈ Fq x,y∈Fq

(−1) L(y pα (x)) + (−1) L(y pα (x)) .


Õ Õ
=
x,y∈Fq x,y∈Fq
pα (x)=0 pα (x),0

If pα (x) = 0 then (−1) L(y pα (x)) = 1, for all y. As such, each such x contributes q to λ(α).
If pα (x) , 0 then y pα (x) takes all the values of Fq , and as such, L(y pα (x)) is 0 for half of these
values, and 1 for the other half. Implying that these kind terms contribute 0 to λ(α). But pα (x) is a
polynomial of degree r, and as such there could be at most r values of x for which the first term is taken.
As such, if α , 0 then λ(α) ≤ rq. If α = 0 then λ(α) = q2 , which implies the theorem. 

This construction provides an expander with constant degree only if the number of vertices is a
constant. Indeed, if we want an expander with constant degree, we have to take q to be as small as
possible. We get the relation n = qr+1 ≤ q q , since r ≤ r, which implies that q = Ω(log n/log log n). Now,
the expander of Theorem 30.2.3 is q2 -regular, which means that it is not going to provide us with a
constant degree expander.
However, we are going to use it as our building block in a construction that would start with this
expander and would inflate it up to the desired size.

245
246
Chapter 31

Expanders III - The Zig Zag Product


598 - Class notes for Randomized Algorithms
Sariel Har-Peled Gradually, but not as gradually as it seemed to
January 24, 2018 some parts of his brain, he began to infuse his
tones with a sarcastic wounding bitterness.
Nobody outside a madhouse, he tried to imply,
could take seriously a single phrase of this
conjectural, nugatory, deluded, tedious rubbish.
Within quite a short time he was contriving to
sound like an unusually fanatical Nazi trooper
in charge of a book-burning reading out to the
crowd excerpts from a pamphlet written by a
pacifist, Jewish, literate Communist. A growing
mutter, half-amused, half-indignant, arose
about him, but he closed his ears to it and read
on. Almost unconsciously he began to adopt an
unnameable foreign accent and to read faster
and faster, his head spinning. As if in a dream
he heard Welch stirring, then whispering, then
talking at his side. he began punctuating his
discourse with smothered snorts of derision. He
read on, spitting out the syllables like curses,
leaving mispronunciations, omissions,
spoonerisms uncorrected, turning over the pages
of his script like a score-reader following a presto
movement, raising his voice higher and higher.
At last he found his final paragraph confronting
him, stopped, and look at his audience.

Kingsley Amis, Lucky Jim

31.1. Building a large expander with constant degree


31.1.1. Notations
For a vertex v ∈ V(G), we will denote by vG [i] = v[i] the ith neighbor of v in the graph G (we order the
neighbors of a vertex in an arbitrary order).
The regular graphs we next discuss have consistent labeling. That is, for a regular graph G (we
assume here that G is regular). This means that if u is the ith neighbor v then v is the ith neighbor of
u. Formally, this means that v[i][i] = v, for all v and i. This is a non-trivial property, but its easy to

247
verify that the low quality expander of Theorem 30.2.3 has this property. It is also easy to verify that
the complete graph can be easily be made into having consistent labeling (exercise). These two graphs
would be sufficient for our construction.

31.1.2. The Zig-Zag product


At this point, we know how to construct a good “small” expander. The question is how to build a large
expander (i.e., large number of vertices) and with constant degree.
The intuition of the construction is the following: It is easy to improve the expansion qualities of a
graph by squaring it. The problem is that the resulting graph G has a degree which is too large. To
overcome this, we will replace every vertex in G by a copy of a small graph that is connected and has low
degree. For example, we could replace every vertex of degree d in G by a path having d vertices. Every
such vertex is now in charge of original edge of the graph. Naturally, such a replacement operation
reduces the quality of the expansion of the resulting graph. In this case, replacing a vertex with a path
is a potential “disaster”, since every such subpath increases the lengths of the paths of the original graph
by a factor of d (and intuitively, a good expander have “short” paths between any pair of vertices).
Consider a “large” (n, D)-graph G and a “small” (D, d)-graph H. As G
a first stage, we replace every vertex of G by a copy of H. The new H
graph K has nno × nDo as a vertex set. Here, the edge vu ∈ V(G), where
u = v[i] and v = u[ j], is replaced by the edge connecting (v, i) ∈ V(K)
with (u, j) ∈ V(K). We will refer to this resulting edge (v, i)(u, j) as a
long edge. Also, we copy all the edges of the small graph to each one
of its copies. That is, for each i ∈ nno, and uv ∈ E(H), we add the edge
(i, u)(i, v) to K, which is a short edge. We will refer to K, which is a
(nD, d + 1)-graph, as a replacement product of G and H, denoted by
G r H. See figure on the right for an example. GrH
Again, intuitively, we are losing because the ex-
pansion of the resulting graph had deteriorated too e3
much. To overcome this problem, we will perform
local shortcuts to shorten the paths in the result-
ing graph (and thus improve its expansion proper- e2
ties). A zig-zag-zig path in the replacement prod-
uct graph K, is a three edge path e1 e2 e3 , where e1
and e3 are short edges, and the middle edge e2 is a
long edge. That is, if e1 = (i, u)(i, v), e2 = (i, v)( j, v 0),
and e3 = ( j, v 0)( j, u0), then e1, e2, e3 ∈ E(K), i j ∈ E(G), e1
GrH
uv ∈ E(H) and v u ∈ E(H). Intuitively, you can think
0 0

about e1 as a small “zig” step in H, e2 is a long “zag” step in G, and finally e3 is a “zig” step in H.
Another way of representing a zig-zag-zig path v1 v2 v3 v4 starting at the vertex v1 = (i, v) ∈ V(F), is
to parameterize it by two integers `, `0 ∈ ndo, where

v1 = (i, v), v2 = (i, vH [`]) v3 = (iG [vH [`]] , vH [`]) v4 = (iG [vH [`]] , (vH [`])H [`0]).

Let Z be the set of all (unordered) pairs of vertices of K connected by such a zig-zag-zig path. Note,
that every vertex (i, v) of K has d 2 paths having (i, v) as an end point. Consider the graph F = (V(K), Z).
The graph F has nD vertices, and it is d 2 regular. Furthermore, since we shortcut all these zig-zag-zig

248
paths in K, the graph F is a much better expander (intuitively) than K. We will refer to the graph F as
the zig-zag product of G and H.
Definition 31.1.1. The zig-zag product of (n, D)-graph G and a (D, d)-graph H, is the (nD, d 2 ) graph
F = G z H, where the set of vertices is nno × nDo and for any v ∈ nno, i ∈ nDo, and `, `0 ∈ ndo we have
in F the edge connecting the vertex (i, v) with the vertex (iG [vH [`]] , (vH [`])H [`0]).

Remark 31.1.2. We need the resulting zig-zag graph to have consistent labeling. For the sake of simplicity
of exposition, we are just going to assume this property.

We next bound the tension of the zig-zag product graph.


Theorem 31.1.3. We have γ(G z H) ≤ γ2 (G)(γ2 (H))2 . and γ2 (G z H) ≤ γ2 (G)(γ2 (H))2 .

Proof: Let G = (nno, E) be a (n, D)-graph and H = (nDo, E 0) be a (D, d)-graph. Fix any function
f : nno × nDo → R, and observe that
 
2 2
| f (u, k) − f (v, `)| = E E | f (u, k) − f (v, `)|
   
ψ= E
u,v∈nno k,`∈nDo u,v∈nno
k,`∈nDo

   
 
2 2 
| f (u, k) − f (v, `)| = γ2 (G) E  E | f (u, k) − f (u[p] , `)|  .
   
≤ γ2 (G)

E E
k,`∈nDo uv∈E(G) k,`∈nDo  u∈nno 
 p∈nDo 
 
| {z }
=∆1

Now,
   
2 2
| f (u, k) − f (u[p] , `)| | f (u, k) − f (u[p] , `)|
   
∆1 = E E ≤ E γ2 (H) E
u∈nno k,p∈nDo u∈nno k p∈E(H)
`∈nDo `∈nDo
 
 
= γ2 (H) E  E | f (u, p[ j]) − f (u[p] , `)| 2  .
   
u∈nno  p∈nDo 
`∈nDo  j∈ndo 

| {z }
=∆2

Now,
   
   
∆2 = E  E | f (u, p[ j]) − f (u[p] , `)| 2  = E  E | f (v[p] , p[ j]) − f (v, `)| 2 
       
j∈ndo  u∈nno  j∈ndo  v∈nno 
`∈nDo  p∈nDo  `∈nDo  p∈nDo
 


 
 
2 
= E  E | f (v[p] , p[ j]) − f (v, `)| 
 
j∈ndo  p∈nDo 
v∈nno  `∈nDo 

 
2
| f (v[p] , p[ j]) − f (v, `)|
 
= γ2 (H) E E .
j∈ndo p`∈E(H)
v∈nno
| {z }
=∆3

249
Now, we have
 
 
2 
∆3 = E  E | f (v[p] , p[ j]) − f (v, p[i])|  = [| f (u, k) − f (`, v)|] ,
  
E
j∈ndo  p∈nDo  (u,k)(`,v)∈E(G z H)
v∈nno  i∈ndo 

as (v[p] , p[ j]) is adjacent to (v[p] , p) (a short edge), which is in turn adjacent to (v, p) (a long edge),
which is adjacent to (v, p[i]) (a short edge). Namely, (v[p] , p[ j]) and (v, p[i]) form the endpoints of a zig-
zag path in the replacement product of G and H. That is, these two endpoints are connected by an edge
in the zig-zag product graph. Furthermore, it is easy to verify that each zig-zag edge get accounted for
in this representation exactly once, implying the above inequality. Thus, we have ψ ≤ γ2 (G)(γ2 (H))2 ∆3 ,
which implies the claim.
The second claim follows by similar argumentation. 

31.1.3. Squaring
The last component in our construction, is squaringsquaring!graph a graph. Given a (n, d)-graph G,
consider the multigraph G2 formed by connecting any vertices connected in G by a path of length 2.
Clearly, if M is the adjacency matrix of G, then the adjacency matrix of G2 is the matrix M2 . Note, that
M i j is the number of distinct paths of length 2 in G from i to j. Note, that the new graph might have
2

self loops, which does not effect our analysis, so we keep them in.
(γ2 (G))2
Lemma 31.1.4. Let G be a (n, d)-graph. The graph G2 is a (n, d 2 )-graph. Furthermore γ2 G2 = 2γ

2 (G)−1
.
 2  2
Proof: The graph G2 has eigenvalues λb1 (G) , . . . , λb1 (G) for its matrix Q2 . As such, we have that
 
b G2 = max λb2 G2 , −λbn G2 .
 
λ

  2  2
Now, λb1 G2 = 1. Now, if λb2 (G) ≥ λbn (G) < 1 then λb G2 = λb2 G2 = λb2 (G) = λ(G)
 
b .
  2   2
b G2 = λb2 G2 = λbn (G) = λ(G)
 
If λb2 (G) < λbn (G) then λ b ..
  2
b G2 = λ(G) . Now, By Lemma 30.1.2 γ2 (G) = b1 , which implies that

Thus, in either case λ b
1−λ(G)
λ(G)
b = 1 − 1/γ2 (G), and thus

1 1 1 γ2 (G) (γ2 (G))2


γ2 G2 =

= 2 = 2 = = . 
b 2) 2 − γ21(G) 2γ2 (G) − 1
 
1 − λ(G 1 − λ(G)
b 1− 1− 1
γ2 (G)

31.1.4. The construction


So, let build an expander using Theorem 30.2.3, with parameters r = 7 q = 24 = 32. Let d = q2 = 256.
The resulting graph H has N = qr+1 = d 4 vertices, and it is d = q2 regular. Furthermore, λbi ≤ r/q = 7/32,
for all i ≥ 2. As such, we have
1 32
γ(H) = γ2 (H) = = .
1 − 7/32 25

250
Let G0 be any graph that its square is the complete graph over n0 = N + 1 vertices. Observe that G20 is
d 4 -regular. Set Gi = Gi−1
2

z H , Clearly, the graph Gi has

ni = ni−1 N
2
vertices. The graph Gi−1 z H is d 2 regular. As far as the bi-tension, let αi = γ2 (Gi ). We have that
2 2  2 2
αi−1 2
αi−1 32 αi−1
αi = (γ2 (H)) = ≤ 1.64 .
2αi−1 − 1 2αi−1 − 1 25 2αi−1 − 1

It is now easy to verify, that αi can not be bigger than 5.

Theorem 31.1.5. For any i ≥ 0, one can compute deterministically a graph Gi with ni = (d 4 + 1)d 4i
vertices, which is d 2 regular, where d = 256. The graph Gi is a (1/10)-expander.

Proof: The construction is described above. As for the expansion, since the bi-tension bounds the
tension of a graph, we have that γ(Gi ) ≤ γ2 (Gi ) ≤ 5. Now, by Lemma 29.2.2, we have that Gi is a
δ-expander, where δ ≥ 1/(2γ(Gi )) ≥ 1/10. 

31.2. Bibliographical notes


A good survey on expanders is the monograph by Hoory et al. [HLW06]. The small expander construc-
tion is from the paper by Alon et al. [ASS08] (but its originally from the work by Along and Roichman
[AR94]). The work by Alon et al. [ASS08] contains a construction of an expander that is constant de-
gree, which is of similar complexity to the one we presented here. Instead, we used the zig-zag expander
construction from the influential work of Reingold et al. [RVW02]. Our analysis however, is from an
upcoming paper by Mendel and Naor [MN08]. This analysis is arguably reasonably simple (as simplicity
is in the eye of the beholder, we will avoid claim that its the simplest), and (even better) provide a good
intuition and a systematic approach to analyzing the expansion.
We took a creative freedom in naming notations, and the name tension and bi-tension are the author’s
own invention.

31.3. Exercises
Exercise 31.3.1 (Expanders made easy.). By considering a random bipartite three-regular graph on
2n vertices obtained by picking three random permutations between the two sides of the bipartite graph,
prove that there is a c > 0 such that for every n there exits a (2n, 3, c)-expander. (What is the value of
c in your construction?)

Exercise 31.3.2 (Is your consistency in vain?). In the construction, we assumed that the graphs we
are dealing with when building expanders have consistent labeling. This can be enforced by working
with bipartite graphs, which implies modifying the construction slightly.

(A) Prove that a d-regular bipartite graph always has a consistent labeling (hint: consider matchings
in this graph).

(B) Prove that if G is bipartite so is the graph G3 (the cubed graph).

251
(C) Let G be a (n, D)-graph and let H be a (D, d)-graph. Prove that if G is bipartite then GG z H is
bipartite.

(D) Describe in detail a construction of an expander that is: (i) bipartite, and (ii) has consistent labeling
at every stage of the construction (prove this property if necessary). For the ith graph in your series,
what is its vertex degree, how many vertices it has, and what is the quality of expansion it provides?

Exercise 31.3.3 (Tension and bi-tension.). [30 points]


Disprove (i.e., give a counter example) that there exists a universal constant c, such that for any
connected graph G, we have that γ(G) ≤ γ2 (G) ≤ cγ(G).

Acknowledgements
Much of the presentation was followed suggestions by Manor Mendel. He also contributed some of the
figures.

252
Chapter 32

Miscellaneous Prerequisite
598 - Class notes for Randomized Algorithms
Sariel Har-Peled Be that as it may, it is to night school that I owe
January 24, 2018 what education I possess; I am the first to own
that it doesn’t amount to much, though there is
something rather grandiose about the gaps in it.
The purpose of this chapter is to remind the reader (and the author) about some
The tin basic
drum, definitions
Gunter Grass
and results in mathematics used in the text. The reader should refer to standard texts for further details.

32.1. Geometry and linear algebra


A set X in Rd is closed, if any sequence of converging points of X converges to a point that is inside
X. A set X ⊆ Rd is compact if it is closed and bounded; namely; there exists a constant c, such that
for all p ∈ X, kpk ≤ c.
Definition 32.1.1 (Convex hull). The convex hull of a set R ⊆ Rd is the set of all convex combinations
of points of R; that is,


Õ
 m m
Õ 



CH (R) = αi ri ∀i ri ∈ R, αi ≥ 0, and αi = 1 .
 i=0

j=1


 

In the following, we cover some material from linear algebra. Proofs of these facts can be found in
any text on linear algebra, for example [Leo98].
For a matrix M, let MT denote the transposed matrix. We remind the reader that for two matrices
M and B, we have (MB)T = BT MT . Furthermore, for any three matrices M, B, and C, we have (MB)C =
M(BC).
A matrix M ∈ Rn×n is symmetric if MT = M. All the eigenvalues of a symmetric matrix are real
numbers. A symmetric matrix M is positive definite if xT Mx > 0, for all x ∈ Rn . Among other things
this implies that M is non-singular. If M is symmetric, then it is positive definite if and only if all its
eigenvalues are positive numbers.
In particular, if M is symmetric positive definite, then det(M) > 0. Since all the eigenvalues of a
positive definite matrix are positive real numbers, the following holds, as can be easily verified.
Claim 32.1.2. A symmetric matrix M is positive definite if and only if there exists a matrix B such
that M = BT B and B is not singular.

For two vectors u, v ∈ Rn , let hu, vi = uT v denote their dot product.

253
Lemma 32.1.3. Given a simplex 4 in Rd with vertices v1, . . . , vd, vd+1 (or equivalently 4 = CH (v1, . . . , vd+1 )),
the volume of this simplex is the absolute value of (1/d!)|C|, where C is the value of the determinant
1 1 ... 1
C= . In particular, for a triangle with vertices at (x, y), (x 0, y0), and (x 00, y00) its area
v1 v2 . . . vd+1
1 x y
1
is the absolute value of 1 x 0 y0 .
2
1 x 00 y00

32.1.1. Linear and affine subspaces


Definition
Í → 32.1.4. The linear subspace spanned by a set of vectors V ⊆ Rd is the set linear(V) =
− →−
i αi v i αi ∈ R, v i ∈ V .
Ín
An affine combination of vectors v1, . . . , vn is a linear combination i=1 αi · vi = α1 v1 + α2 v2 +
Ín
· · · + αn vn in which the sum of the coefficients is 1; thus, i=1 αi = 1. The maximum dimension of the
affine subspace in such a case is (n − 1)-dimensions.

Definition 32.1.5. The affine subspace spanned by a set V ⊆ Rd is


( )

− →

Õ Õ
affine(V) = αi v i αi ∈ R, v i ∈ V, and αi = 1 .
i i

 
For any vector v ∈ V, we have that affine(V) = v +linear V − v , where V−→
→− →− →− −v = →
−v 0 − →
−v →
−v 0 ∈ V .

32.1.2. Computational geometry


The following are standard results in computational geometry; see [dBCKO08] for more details.

Lemma 32.1.6. The convex hull of n points in the plane can be computed in O(n log n) time.

Lemma 32.1.7. The lower and upper envelopes of n lines in the plane can be computed in O(n log n)
time.

Proof: Use duality and the algorithm of Lemma 32.1.6. 

32.2. Calculus
i∞
x2 x3 x4 Õ
i+1 x
Lemma 32.2.1. For x ∈ (−1, 1), we have ln(1 + x) = x − + − +··· = (−1) .
2 3 4 i=1
i

Lemma 32.2.2. The following hold:


(A) For all x ∈ R, 1 + x ≤ exp(x).
(B) For x ≥ 0, 1 − x ≤ exp(−x).
(C) For 0 ≤ x ≤ 1, exp(x) ≤ 1 + 2x.
(D) For x ∈ [0, 1/2], exp(−2x) ≤ 1 − x.

254
Proof: (A) Let f (x) = 1 + x and g(x) = exp(x). Observe that f (0) = g(0) = 1. Now, for x ≥ 0, we have
that f 0(x) = 1 and g0(x) = exp(x) ≥ 1. As such f (x) ≤ g(x) for x ≥ 0. Similarly, for x < 0, we have
g0(x) = exp(x) < 1, which implies that f (x) ≤ g(x).
(B) This is immediate from (A).
(C) Observe that exp(1) ≤ 1 + 2 · 1 and exp(0) = 1 + 2 · 0. By the convexity of 1 + 2x, it follows that
exp(x) ≤ 1 + 2x for all x ∈ [0, 1].
(D) Observe that (i) exp(−2(1/2)) = 1/e ≤ 1/2 = 1−(1/2), (ii) exp(−2 · 0) = 1 ≤ 1−0, (iii) exp(−2x)0 =
−2 exp(−2x), and (iv) exp(−2x)00 = 4 exp(−2x) ≥ 0 for all x. As such, exp(−2x) is a convex function and
the claim follows. 
ln y ln y
Lemma 32.2.3. For 1 > ε > 0 and y ≥ 1, we have that ≤ log1+ε y ≤ 2 .
ε ε
Proof: By Lemma 32.2.2, 1+ x ≤ exp(x) ≤ 1+2x for x ∈ [0, 1]. This implies that ln(1+ x) ≤ x ≤ ln(1+2x).
ln y ln y ln y
As such, log1+ε y = = ≤ . The other inequality follows in a similar fashion.
ln(1 + ε) ln(1 + 2(ε/2)) ε/2

255
256
Bibliography

[AB99] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cam-
bridge, 1999.

[ABKU00] Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations. SIAM J. Comput.,
29(1):180–200, 2000.

[Ach01] D. Achlioptas. Database-friendly random projections. In Proc. 20th ACM Sympos. Princi-
ples Database Syst. (PODS), pages 274–281, 2001.

[AES99] P. K. Agarwal, A. Efrat, and M. Sharir. Vertical decomposition of shallow levels in 3-


dimensional arrangements and its applications. SIAM J. Comput., 29:912–953, 1999.

[AHY07] P. Agarwal, S. Har-Peled, and H. Yu. Embeddings of surfaces, curves, and moving points
in Euclidean space. In Proc. 23rd Annu. Sympos. Comput. Geom. (SoCG), pages 381–389,
2007.

[AKPW95] N. Alon, R. M. Karp, D. Peleg, and D. West. A graph-theoretic game and its application
to the k-server problem. SIAM J. Comput., 24(1):78–100, February 1995.

[AMS98] P. K. Agarwal, J. Matoušek, and O. Schwarzkopf. Computing many faces in arrangements


of lines and segments. SIAM J. Comput., 27(2):491–505, 1998.

[AN04] N. Alon and A. Naor. Approximating the cut-norm via grothendieck’s inequality. In Proc.
36th Annu. ACM Sympos. Theory Comput. (STOC), pages 72–80, 2004.

[AR94] N. Alon and Y. Roichman. Random cayley graphs and expanders. Random Struct. Algo-
rithms, 5(2):271–285, 1994.

[Aro98] S. Arora. Polynomial time approximation schemes for Euclidean TSP and other geometric
problems. J. Assoc. Comput. Mach., 45(5):753–782, September 1998.

[AS00] N. Alon and J. H. Spencer. The Probabilistic Method. Wiley InterScience, 2nd edition,
2000.

[ASS08] N. Alon, O. Schwartz, and A. Shapira. An elementary construction of constant-degree


expanders. Combin. Probab. Comput., 17(3):319–327, 2008.

[Bar96] Y. Bartal. Probabilistic approximations of metric space and its algorithmic application.
In Proc. 37th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), pages 183–193, October
1996.

257
[Bar98] Y. Bartal. On approximating arbitrary metrics by tree metrics. In Proc. 30th Annu. ACM
Sympos. Theory Comput. (STOC), pages 161–168, 1998.

[BM58] G. E.P. Box and M. E. Muller. A note on the generation of random normal deviates. Ann.
Math. Stat., 28:610–611, 1958.

[Bol98] B. Bollobas. Modern Graph Theory. Springer-Verlag, 1998.

[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge, 2004.

[BY98] J.-D. Boissonnat and M. Yvinec. Algorithmic Geometry. Cambridge University Press, 1998.

[Car76] L. Carroll. The hunting of the snark, 1876.

[CCH09] C. Chekuri, K. L. Clarkson., and S. Har-Peled. On the set multi-cover problem in geometric
settings. In Proc. 25th Annu. Sympos. Comput. Geom. (SoCG), pages 341–350, 2009.

[CF90] B. Chazelle and J. Friedman. A deterministic view of random sampling and its use in
geometry. Combinatorica, 10(3):229–249, 1990.

[Cha01] B. Chazelle. The Discrepancy Method: Randomness and Complexity. Cambridge University
Press, New York, 2001.

[Che86] L. P. Chew. Building Voronoi diagrams for convex polygons in linear expected time. Tech-
nical Report PCS-TR90-147, Dept. Math. Comput. Sci., Dartmouth College, Hanover, NH,
1986.

[CKR01] G. Calinescu, H. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension
problem. In Proc. 12th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 8–16, 2001.

[Cla87] K. L. Clarkson. New applications of random sampling in computational geometry. Discrete


Comput. Geom., 2:195–222, 1987.

[Cla88] K. L. Clarkson. Applications of random sampling in computational geometry, II. In Proc.


4th Annu. Sympos. Comput. Geom. (SoCG), pages 1–11, New York, NY, USA, 1988. ACM.

[CLRS01] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms.


MIT Press / McGraw-Hill, 2001.

[CM96] B. Chazelle and J. Matoušek. On linear-time deterministic algorithms for optimization


problems in fixed dimension. J. Algorithms, 21:579–597, 1996.

[CMS93] K. L. Clarkson, K. Mehlhorn, and R. Seidel. Four results on randomized incremental


constructions. Comput. Geom. Theory Appl., 3(4):185–212, 1993.

[CS89] K. L. Clarkson and P. W. Shor. Applications of random sampling in computational geom-


etry, II. Discrete Comput. Geom., 4:387–421, 1989.

[dBCKO08] M. de Berg, O. Cheong, M. van Kreveld, and M. H. Overmars. Computational Geometry:


Algorithms and Applications. Springer-Verlag, Santa Clara, CA, USA, 3rd edition, 2008.

[dBDS95] M. de Berg, K. Dobrindt, and O. Schwarzkopf. On lazy randomized incremental construc-


tion. Discrete Comput. Geom., 14:261–286, 1995.

258
[dBS95] M. de Berg and O. Schwarzkopf. Cuttings and applications. Internat. J. Comput. Geom.
Appl., 5:343–355, 1995.

[DG03] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss.
Rand. Struct. Alg., 22(3):60–65, 2003.

[DP09] Devdatt P. Dubhashi and Alessandro Panconesi. Concentration of Measure for the Analysis
of Randomized Algorithms. Cambridge University Press, 2009.

[DS00] P. G. Doyle and J. L. Snell. Random walks and electric networks. ArXiv Mathematics
e-prints, 2000.

[FRT04] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary


metrics by tree metrics. J. Comput. Sys. Sci., 69(3):485–497, 2004.

[Gar02] R. J. Gardner. The Brunn-Minkowski inequality. Bull. Amer. Math. Soc., 39:355–405,
2002.

[GLS93] M. Grötschel, L. Lovász, and A. Schrijver. Geometric Algorithms and Combinatorial Opti-
mization, volume 2 of Algorithms and Combinatorics. Springer-Verlag, Berlin Heidelberg,
2nd edition, 1993.

[Gre69] W.R. Greg. Why are Women Redundant? Trübner, 1869.

[GRSS95] M. Golin, R. Raman, C. Schwarz, and M. Smid. Simple randomized algorithms for closest
pair problems. Nordic J. Comput., 2:3–27, 1995.

[Gup00] A. Gupta. Embeddings of Finite Metrics. PhD thesis, University of California, Berkeley,
2000.

[GW95] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum


cut and satisfiability problems using semidefinite programming. J. Assoc. Comput. Mach.,
42(6):1115–1145, November 1995.

[Har00a] S. Har-Peled. Constructing planar cuttings in theory and practice. SIAM J. Comput.,
29(6):2016–2039, 2000.

[Har00b] S. Har-Peled. Taking a walk in a planar arrangement. SIAM J. Comput., 30(4):1341–1367,


2000.

[Har11] S. Har-Peled. Geometric Approximation Algorithms, volume 173 of Math. Surveys & Mono-
graphs. Amer. Math. Soc., Boston, MA, USA, 2011.

[Hås01] J. Håstad. Some optimal inapproximability results. J. Assoc. Comput. Mach., 48(4):798–
859, 2001.

[HLW06] S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin
Amer. Math. Soc., 43:439–561, 2006.

[HR15] Sariel Har-Peled and Benjamin Raichel. Net and prune: A linear time algorithm for Eu-
clidean distance problems. J. Assoc. Comput. Mach., 62(6):44:1–44:35, December 2015.

259
[HW87] D. Haussler and E. Welzl. ε-nets and simplex range queries. Discrete Comput. Geom.,
2:127–151, 1987.
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse
of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC), pages
604–613, 1998.
[Ind01] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. In Proc. 42nd
Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), pages 10–31, 2001. Tutorial.
[Kel56] J. L. Kelly. A new interpretation of information rate. Bell Sys. Tech. J., 35(4):917–926, jul
1956.
[KKMO04] S. Khot, G. Kindler, E. Mossel, and R. O’Donnell. Optimal inapproximability results for
max cut and other 2-variable csps. In Proc. 45th Annu. IEEE Sympos. Found. Comput.
Sci. (FOCS), pages 146–154, 2004. To appear in SICOMP.
[KKT91] C. Kaklamanis, D. Krizanc, and T. Tsantilas. Tight bounds for oblivious routing in the
hypercube. Math. sys. theory, 24(1):223–232, 1991.
[KLMN05] R. Krauthgamer, J. R. Lee, M. Mendel, and A. Naor. Measured descent: A new embedding
method for finite metric spaces. Geom. funct. anal. (GAFA), 15(4):839–858, 2005.
[KPW92] J. Komlós, J. Pach, and G. Woeginger. Almost tight bounds for ε-nets. Discrete Comput.
Geom., 7:163–173, 1992.
[Leo98] S. J. Leon. Linear Algebra with Applications. Prentice Hall, 5th edition, 1998.
[LLS01] Y. Li, P. M. Long, and A. Srinivasan. Improved bounds on the sample complexity of
learning. J. Comput. Syst. Sci., 62(3):516–527, 2001.
[Mag07] A. Magen. Dimensionality reductions in `2 that preserve volumes and distance to affine
spaces. Discrete Comput. Geom., 38(1):139–153, 2007.
[Mat90] J. Matoušek. Bi-Lipschitz embeddings into low-dimensional Euclidean spaces. Comment.
Math. Univ. Carolinae, 31:589–600, 1990.
[Mat92] J. Matoušek. Reporting points in halfspaces. Comput. Geom. Theory Appl., 2(3):169–186,
1992.
[Mat98] J. Matoušek. On constants for cuttings in the plane. Discrete Comput. Geom., 20:427–448,
1998.
[Mat99] J. Matoušek. Geometric Discrepancy. Springer, 1999.
[Mat02] J. Matoušek. Lectures on Discrete Geometry, volume 212 of Grad. Text in Math. Springer,
2002.
[McD89] C. McDiarmid. Surveys in Combinatorics, chapter On the method of bounded differences.
Cambridge University Press, 1989.
[Mil76] G. L. Miller. Riemann’s hypothesis and tests for primality. J. Comput. Sys. Sci., 13(3):300–
317, 1976.

260
[MN98] J. Matoušek and J. Nešetřil. Invitation to Discrete Mathematics. Oxford Univ Press, 1998.

[MN08] M. Mendel and A. Naor. Towards a calculus for non-linear spectral gaps. manuscript, 2008.

[MOO05] E. Mossel, R. O’Donnell, and K. Oleszkiewicz. Noise stability of functions with low influ-
ences invariance and optimality. In Proc. 46th Annu. IEEE Sympos. Found. Comput. Sci.
(FOCS), pages 21–30, 2005.

[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, Cam-
bridge, UK, 1995.

[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and
probabilistic analysis. Cambridge, 2005.

[Mul89] K. Mulmuley. An efficient algorithm for hidden surface removal. Comput. Graph.,
23(3):379–388, 1989 1989.

[Mul94] K. Mulmuley. Computational Geometry: An Introduction Through Randomized Algorithms.


Prentice Hall, Englewood Cliffs, NJ, 1994.

[Nor98] J. R. Norris. Markov Chains. Statistical and Probabilistic Mathematics. Cambridge Press,
1998.

[PA95] J. Pach and P. K. Agarwal. Combinatorial Geometry. John Wiley & Sons, 1995.

[Rab76] M. O. Rabin. Probabilistic algorithms. In J. F. Traub, editor, Algorithms and Complexity:


New Directions and Recent Results, pages 21–39. Academic Press, Orlando, FL, USA, 1976.

[Rab80] M. O. Rabin. Probabilistic algorithm for testing primality. J. Number Theory, 12(1):128–
138, 1980.

[RVW02] O. Reingold, S. Vadhan, and A. Wigderson. Entropy waves, the zig-zag graph product,
and new constant-degree expanders and extractors. Annals Math., 155(1):157–187, 2002.

[SA95] M. Sharir and P. K. Agarwal. Davenport-Schinzel Sequences and Their Geometric Appli-
cations. Cambridge University Press, New York, 1995.

[Sei93] R. Seidel. Backwards analysis of randomized geometric algorithms. In J. Pach, editor, New
Trends in Discrete and Computational Geometry, volume 10 of Algorithms and Combina-
torics, pages 37–68. Springer-Verlag, 1993.

[Sha03] M. Sharir. The Clarkson-Shor technique revisited and extended. Comb., Prob. & Comput.,
12(2):191–201, 2003.

[Smi00] M. Smid. Closest-point problems in computational geometry. In J.-R. Sack and J. Urrutia,
editors, Handbook of Computational Geometry, pages 877–935. Elsevier, Amsterdam, The
Netherlands, 2000.

[Ste12] E. Steinlight. Why novels are redundant: Sensation fiction and the overpopulation of
literature. ELH, 79(2):501–535, 2012.

[Tót03] C. D. Tóth. A note on binary plane partitions. Discrete Comput. Geom., 30(1):3–16, 2003.

261
[VC71] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies
of events to their probabilities. Theory Probab. Appl., 16:264–280, 1971.

[Wes01] D. B. West. Intorudction to Graph Theory. Prentice Hall, 2ed edition, 2001.

[WG75] H. W. Watson and F. Galton. On the probability of the extinction of families. J. Anthrop.
Inst. Great Britain, 4:138–144, 1875.

262
Index

(k, n) decoding function, 225 autopartition, 16


(k, n) encoding function, 225 average-case analysis, 12
(n, d)-graph, 237
C-Lipschitz, 204 ball, 203
K-bi-Lipschitz, 204 volume, 136
δ-expander, 237 Bernoulli distribution, 35
Fi -measurable, 80 bi-Lipschitz, 141
σ-algebra, 13 bi-tension, 241
σ-field, 79 binary symmetric channel, 225
c-Lipschitz, 81 binomial
regular, d, 122 estimates, 167
k-HST, 205 binomial distribution, 35
k-median clustering, 205 birthday paradox, 39
t-step transition probability, 111 bit fixing, 62
[CNF], 86 bounded differences, 75
brick set, 132
abelian, 190 butterfly, 177
adjacency matrix, 237
affine combination, 254 Catalan number, 111
affine subspace, 254 central limit theorem, 53
algorithm certified vertex, 177
Alg, 24, 25, 51, 52, 80, 87, 90, 91, 127–129 characteristic vector, 242
Contract, 33, 34 Chernoff inequality, 65
EuclidGCD, 188 simplified form, 65
EuclidGCD, 188 clause
FastCut, 33, 34 dangerous, 95
Jacobi, 198 survived, 96
Las Vegas, 24 closed, 253
LazySelect, 43, 44 clusters, 205
MinCut, 31, 32, 34 combinatorial dimension, 176
MinCutRep, 32–34 commutative group, 190
Monte Carlo, 24 commute time, 116
QuickSort, 14, 15, 17, 24, 52, 59, 60 compact, 253
QuickSelect, 17, 18 Complexity
RandomRoute, 62 co−, 24
approximation factor, 87 BPP, 25
arrangement, 103, 171 NP, 24
atomic event, 13, 79 PP, 25

263
P, 24 edge, 103, 171
RP, 24 effective resistance, 117
ZPP, 25 eigenvalue, 237
conditional expectation, 73 eigenvector, 237
conditional probability, 13, 29 electrical network, 117
confidence, 20 elementary event, 13, 79
conflict graph, 173 embedding, 101, 141, 204
conflict list, 173 entropy, 217
congruent modulo n, 188 binary, 217
consistent labeling, 247 epochs, 121
contraction estimate, 146
edge, 30 Euler totient function, 190
convex hull, 253 event, 13
coprime, 188 expander
cover time, 116 [n, d, δ]-expander, 237
covering radius, 169 [n, d, c]-expander, 125
critical, 23 c, 125
crossing number, 101, 160 expectation, 13, 39
cut, 29
face, 171
minimum, 29
faces, 103
cuts, 29
field, 192, 243
cutting, 181
filter, 79
cyclic, 191
filtration, 79
defining set, 176 final strong component, 112
Delaunay fragment, 16
circle, 177 fully explicit, 127
triangle, 177
generator, 191
dependency graph, 93
geometric distribution, 36
dimension
graph
combinatorial, 176 d-regular, 237
dual shattering, 152 labeled, 122
shattering, 149 lollipop, 116
dual, 152 grid, 21
discrepancy, 159 grid cell, 21
compatible, 160 grid cluster, 21
cross, 160 ground set, 145
distortion, 141, 204 group, 188, 190
distributivity of multiplication over addition, 192 growth function, 148
Doob martingale, 81 gcd, 187
double factorial, 136
doubly stochastic, 117 harmonic number, 15
dual heavy
range space, 152 t-heavy, 179
shatter function, 152 Hierarchically well-separated tree, 205
shattering dimension, 152 history, 111
Dyck words, 111 hitting time, 116

264
Hoeffding’s inequality, 70 median, 138
HST, 205 memorylessness property, 111
HST, 205, 207, 208 merge, 162
hypercube, 61 metric, 203
metric space, 203–211
identity element, 190 mincut, 29
independent, 14 Minkowski sum, 131
pairwise, 49 modulo
wise equivalent, 49
k, 49 moments technique, 184
indicator variable, 14 all regions, 176
inequality
Hoeffding, 70 net
isoperimetric, 134 ε-net, 155
irreducible, 112 ε-net theorem, 155
isoperimetric inequality, 134 NP
complete, 86
Jacobi symbol, 196
oblivious, 61
Kelly criterion, 57 Ohm’s law, 117
Kirchhoff’s law, 117 open ball, 203
OR-concentrator, 89
Law of quadratic reciprocity, 196 order, 191
lazy randomized incremental construction, 185 orthonormal eigenvector basis, 239
Legendre symbol, 194
level, 103 periodicity, 112
k-level, 103 prime, 187
linear subspace, 254 prime factorization, 190
Linearity of expectation, 14 probabilistic distortion, 207
Lipschitz, 138, 204 probabilities, 13
bi-Lipschitz, 204 Probability
Lipschitz condition, 81 Amplification, 32
lollipop graph, 116 probability, 13
long, 248 probability measure, 13, 79
lcm, 187 probability space, 13, 79
problem
Markov chain, 110 MAX-SAT, 86–88
aperiodic, 112
ergodic, 113 quadratic residue, 194
martingale, 81 quotation
edge exposure, 75 – From Gustible’s Planet, Cordwainer Smith,
vertex exposure, 75 85
martingale difference, 80 – The Glass Bead Game, Hermann Hesse, 93
martingale sequence, 74 – Yehuda Amichai, My Father., 223, 225
matrix — Dirk Gently’s Holistic Detective Agency, Dou-
positive definite, 253 glas Adams., 79
symmetric, 253 — Yehuda Amichai, Tourists, 99
measure, 145 –Romain Gary, The talent scout., 217

265
Anonymous, 105 sphere
Carl XIV Johan, King of Sweden, 231 surface area, 136
quotient, 188 spread, 208
quotient group, 191 squaring, 250
standard deviation, 35
Radon’s theorem, 147 state
random incremental construction, 172, 173, 176, aperiodic, 112
181 ergodic, 113
lazy, 185 non null, 111
random sample, 103, 155–158, 162, 164, 168, 174, null persistent, 111
175, 178–180, 182, 184 periodic, 112
relative (p, ε)-approximation, 168 persistent, 111
via discrepancy, 170 transient, 111
ε-sample, 155 state probability vector, 112
random variable, 13, 207 stationary distribution, 112
random walk, 105 stochastic, 117
randomized rounding, 87 stopping set, 176
range, 145 strong component, 112
range space, 145 sub martingale, 80
dual, 152 subgraph
primal, 152 unique, 96
projection, 146 subgroup, 190
rank, 42 super martingale, 80
relative pairwise distance, 126
remainder, 188 tension, 238
replacement product, 248 theorem
residue, 188 ε-net, 155
resistance, 117, 121 Radon’s, 147
ε-sample, 155
sample transition matrix, 125, 237
ε-sample, 155 transition probabilities matrix, 110
ε-sample theorem, 155 transition probability, 110
ε-sample, 155 traverse, 122
sample space, 13 Turing machine
semidefinite, 215 log space, 122
set
union bound, 36
defining, 176 uniqueness, 23
stopping, 176 universal traversal sequence, 122
shallow cuttings, 185
shatter function, 149 variance, 35
dual, 152 VC
shattered, 146 dimension, 146
shattering dimension, 149 vertex, 103
short, 248 vertical decomposition, 172
sketch, 161 vertex, 172
sketch and merge, 162, 163, 170 vertical trapezoid, 172
spectral gap, 126, 243 vertices, 171

266
volume, 132
ball, 136
simplex, 254

walk, 122
weight
region, 176
width, 21

zig-zag, 249
zig-zag product, 249
zig-zag-zig path, 248

267

You might also like