Fit2004 Course Notes
Fit2004 Course Notes
Algorithms and
Data Structures
The original version of the course notes were developed by Daniel Anderson. Over the
course of the years, additions, removals and modifications were done by members of
FIT2004’s teaching teams including: Nathan Companez, Rafael Dowsley.
0
5 8
a b c
1
2 2 5
4
16
ds e
4 9 4 7 7 10
8
10
f g
6 9
2 8
1 5 9
4 7
1 Analysis of Algorithms 1
1.1 Program Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Arguing Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Arguing Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Common Recurrence Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Asymptotic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Measures of Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Analysis of Basic Sorting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.1 Selection Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.2 Insertion Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.3 Algorithms’ Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Graphs Basics 49
5.1 Modelling with Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Representation and Storage of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Graph Traversal and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.1 Depth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.2 Finding Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.3 Cycle Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.4 Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Shortest Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 Properties of Shortest Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4.2 Shortest Path Problem Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.3 Unweighted Shortest Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 The Topological Sorting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.1 Kahn’s Algorithm for Topological Sorting . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5.2 Topological Sorting Using DFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6 Incremental Graph Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6.1 The Union-Find Disjoint-Set Data Structure . . . . . . . . . . . . . . . . . . . . . . 68
6 Greedy Algorithms 73
6.1 Shortest Path in Graphs with Non-negative Weights . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Minimum Spanning Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.1 Prim’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.2 Kruskal’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7 Dynamic Programming 83
7.1 The Key Elements of Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Top-down vs. Bottom-up Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . 87
7.3 Reconstructing Optimal Solutions to Optimisation Problems . . . . . . . . . . . . . . . 89
7.4 The Unbounded Knapsack Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.5 The 0-1 Knapsack Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.6 The Edit Distance Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.7 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.8 The Space-Saving Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
1 Binary Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Selection sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Insertion sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Karatsuba’s multiplication algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Merge sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7 Sort-and-CountInv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8 Merge-and-CountSplitInv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9 Heapsort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
10 Heapify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
11 Heap: Insert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
12 Heap: Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
13 Heap: Rise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
14 Heap: Fall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
15 Naive partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
16 Hoare partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
17 Dutch national flag partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
18 Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
19 Counting sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
20 Radix sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
21 Select minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
22 Select minimum and maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
23 Quickselect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
24 Median of medians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
25 Generic depth-first search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
26 Finding connected components using depth-first search . . . . . . . . . . . . . . . . . . 59
27 Cycle detection in an undirected graph using depth-first search . . . . . . . . . . . . . 60
28 Generic breadth-first search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
29 Single-source shortest paths in an unweighted graph . . . . . . . . . . . . . . . . . . . . . 65
30 Reconstruct shortest path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
31 Topological sorting using Kahn’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
32 Topological sorting using DFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
33 Connectivity check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
34 Union-find using disjoint-set forests (without optimisations) . . . . . . . . . . . . . . . 70
35 The FIND operation using path compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
36 The UNION operation using union by rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
37 Edge relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
38 Dijkstra’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
39 Improved Dijkstra’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
40 Prim’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
41 Kruskal’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
42 Recursive Fibonacci numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
43 Memoised Fibonacci numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
44 Bottom-up Fibonacci numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
45 Top-down coin change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
46 Bottom-up coin change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
47 Coin change solution reconstruction using backtracking . . . . . . . . . . . . . . . . . . 89
48 Bottom-up coin change with solution reconstruction using decision table . . . . . 90
49 Bottom-up unbounded knapsack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
50 Bottom-up 0-1 knapsack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
51 Bottom-up edit distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
52 Optimal sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
53 Optimal matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
54 Bellman-Ford . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
55 Bellman-Ford: Handling negative cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
56 Floyd-Warshall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
57 Warshall’s transitive closure algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
58 Bottom-up longest path in a DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
59 Recursive longest path in a DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
60 The Ford-Fulkerson method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
61 Ford-Fulkerson implemented using depth-first search . . . . . . . . . . . . . . . . . . . . 117
62 The method to solve the circulation with demands problem . . . . . . . . . . . . . . . . 126
63 Cuckoo hashing: Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
64 Cuckoo hashing: Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
65 Cuckoo hashing: Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
66 AVL tree: Right rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
67 AVL tree: Left rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
68 AVL tree: Double-right rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
69 AVL tree: Double-left rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
70 AVL tree: Rebalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
71 Prefix trie: Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
72 Prefix trie: Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
73 Prefix trie: String sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
74 Naive suffix array construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
75 Prefix-doubling suffix array construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
76 Suffix array: Pattern matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Chapter 1
Analysis of Algorithms
When we write programs, we usually like to test them to give us confidence that they are cor-
rect. But testing a program can only ever demonstrate that it is wrong (if we find a case that
breaks it). To prove that a program always works, we need to be more mathematical. The two
major focuses of algorithm analysis are proving that an algorithm is correct, and determining
the amount of resources (time and space) used by the algorithm.
analyse and establish the correctness of many simple algorithms, which we will explore as an
example in this chapter. The useful invariant for binary search is shown below. 1
We can now re-interpret the binary search algorithm such that every action it takes is simply
aiming to maintain these invariants. This observation can lead us to prove that the algorithm
is indeed correct and that it terminates.
/ array
Case 2: key ∈
Since key is not in the array, provided that the algorithm terminates, regardless of the value of lo,
array[lo] 6= key and hence the algorithm correctly identifies that key is not an element of array.
lo + hi
mid = .
2
If lo < hi - 1, then
lo < mid < hi
Proof
Since lo < hi-1, we have lo ≤ hi-2. We replace hi-2 with lo on the left most term in the
inequality.
2 × lo < 2 × mid ≤ lo + hi.
Since lo < hi-1, we have lo+1 < hi. We replace lo+1 with hi on the right most term.
and hence
lo < mid < hi
as desired.
Since this is true, and we always set either lo or hi to be equal to mid, it is always the case that at
every iteration, the interval [lo..hi) decreases in size by at least one. The interval [lo..hi) is finite
in size, therefore after some number of iterations it is true that lo ≥ hi - 1. Therefore the loop
must exit in a finite number of iterations and hence the binary search algorithm terminates in
finite time.
T (n ) = a log2 (n ) + b
Proof
T (n ) = a log2 (n ) + b .
Given this, we can conclude that the time complexity of binary search is O (log(n )).
Logarithmic Complexity
is given by
T (n ) = a log2 (n ) + b .
Linear Complexity
is given by
T (n ) = a n + b ,
6 1.3. Asymptotic Notation
Superlinear Complexity
The solution to the recurrence relation
¨
n
2T +an if n > 1,
T (n ) = 2
b ifn = 1,
is given by
T (n ) = a n log(n ) + b n .
This is the complexity of many divide and conquer sorting algorithms, such as Mergesort.
Quadratic Complexity
The solution to the recurrence relation
¨
T (n − 1) + c n if n > 0,
T (n ) =
b ifn = 0,
is given by
n (n + 1)
T (n ) = c + b.
2
Exponential Complexity
The solution to the recurrence relation
¨
2 × T (n − 1) + a if n > 0,
T (n ) =
b, if n = 0,
is given by
T (n ) = (a + b ) × 2n − a .
Big-O notation
Big-O is the notation that we will use most frequently. Informally, big-O notation denotes an
upper bound on the size of a function. That is, f (n ) = O (g (n )) means f (n) is no bigger than
than order of magnitude of g (n ). Formally, we say that f (n ) = O (g (n )) as n → ∞ if there are
constants c and n0 such that
f (n ) ≤ c · g (n ) for all n ≥ n0 .
Chapter 1. Analysis of Algorithms 7
This says that for values of n above some threshold, f is always no bigger than some constant
multiple of g . Note that this means the behaviour of f for small values of n is irrelevant. Also,
remember that big-O bounds can be overestimates. For example, it is correct to say that 2n +1 =
O (n 3 ), and also that 2n + 1 = O (n ).
Big-Ω notation
Big-Ω (Omega) notation is a counterpart to big-O notation that denotes a lower bound instead
of an upper bound. That is, f (n ) = Ω(g (n )) means that f (n ) is at least as big as the order of
magnitude of g (n ). Formally, we say that f (n ) = Ω(g (n )) as n → ∞ if there are constants c and
n0 such that
f (n ) ≥ c · g (n ) for all n ≥ n0 .
Note that this is equivalent to saying that g (n ) = O (f (n )). Just like big-O bounds could be over-
estimates, big-Ω bounds can be underestimates. For example, n 5 = Ω(n 2 ), and also n 5 = Ω(n 5 ).
Big-Θ notation
Big-Θ (Theta) notation means that the bound given is both big-O and big-Ω at the same time,
i.e. both an upper bound and a lower bound. Informally, f (n ) = Θ(g (n )) means that f (n ) is the
same order of magnitude as g (n ). Formally, we say that f (n ) = Θ(g (n )) as n → ∞ if
f (n ) = O (g (n )) and f (n ) = Ω(g (n )) as n → ∞,
or equivalently if
f (n ) = O (g (n )) and g (n ) = O (f (n )) as n → ∞.
Big-Θ notation implies that the bound given is precise, i.e. it is neither an overestimate nor an
underestimate. So n 2 + 3n + 5 = Θ(n 2 ), but n 2 + 3n + 5 6= Θ(n 3 ), and n 2 + 3n + 5 6= Θ(n ).
t best (n ) = min TA (I ).
I ∈I(n)
Do not fall into the common trap of thinking that the best-case complexity of an algorithm is
determined by its behaviour for small inputs (the best case is not “when the input is empty”).
8 1.4. Measures of Complexity
Worst-case complexity. The worst-case complexity is likewise, the most instructions that the
algorithm could possibly execute on any possible input, as a function of the input size. Formally,
t worst (n ) = max TA (I ).
I ∈I(n )
Average-case complexity. Average-case complexity is slightly more tricky to define, and in fact
there is not a universally agreed upon definition. The simplest definition, and the one that we
will use is that the average-case complexity is the average number of operations that the algo-
rithm will execute, averaged over all possible inputs of a given size (i.e., a uniformly random
probability distribution over all possible inputs of a given size), as a function of the input size.
Formally,
t average (n ) = E [TA (I )].
I ∈I(n )
This is typically much more difficult to compute than the other two.
Remark: In this unit we are mostly concerned about worst-case complexity, and when com-
plexity is mentioned without further specification you should assume we are talking about worst-
case complexity.
Space complexity vs. auxiliary space complexity. In this unit we differentiate between space
complexity and auxiliary space complexity. Space complexity is the total amount of space taken
by an algorithm as a function of input size. Auxiliary space complexity is the amount of space
taken by an algorithm excluding the space taken by the input.
An in-place algorithm is an algorithm that has O (1) auxiliary space complexity. Some authors
use the following definition instead: an algorithm is called in place if it does not store or copy
the input into any additional data structures, but modifies the input directly to produce the
output. More concretely, an algorithm is in place if it never stores more than O (1) elements of
the input outside of the input.
Expected complexity. Expected complexity tells us the average time that a randomised algo-
rithm will take over all possible random decisions that it could make. Note that this is distinct
from average-case complexity, which averages over all possible inputs. Unless stated otherwise,
we will always analyse expected worst-case performance, i.e. we will analyse the expected per-
formance over all possible random choices for the worst possible input. We could also define
best-case, and even average-case expected time complexity, but these are seldom used.
As before, let TA (I ) denote the running time of some algorithm A on some input I . Note that
TA (I ) is no longer a single value, but a probability distribution since the algorithm may take dif-
ferent times based on the outcomes of random choices. The expected (worst-case) time com-
plexity of A can then be defined as
max E[TA (I )].
I ∈I(n )
Chapter 1. Analysis of Algorithms 9
Notice how we still take the maximum value since we care about the worst-case behaviour. It is
important to not conflate worst-case behaviour (which depends on the input to the algorithm)
and unlucky random decisions.
High probability analyses. We will not cover high probability analysis in this course, but we’ll
define it here for completeness, since it is frequently used in the analysis of randomised algo-
rithms. We say that an algorithm takes O (f (n )) time with high probability if it takes O (c · f (n ))
time with probability at least 1 − n1c for all c ≥ 1. Note that this is much stronger than having
expected time complexity O (f (n )). That is, if an algorithm runs in O (f (n )) time with high prob-
ability, then its expected time complexity is at most O (f (n )) as well (and may even be lower).
• Bubble sort
• Selection sort
• Insertion sort
You probably also know some much faster O (n log(n )) sorting algorithms that we will deal with
later. In this chapter, we will analyse selection sort and insertion sort, compare the two, and
explore their fundamental underlying properties.
Let’s try to understand what is going on here more logically. At some given value of i in the main
loop of the algorithm, we have the two sublists
The key strategy behind selection sort simply says to progressively make i larger while main-
taining the following invariants.
For any given value of i in the main loop of selection sort, at the beginning of the itera-
tion, the following invariants hold:
1. array[1..i − 1] is sorted
2. For any x in array[1..i − 1] and y in array[i ..n ], x ≤ y
Initially, when i = 1, the sorted part of the array array[1..0] is empty, so the invariants trivially
hold. Note that we will often use the notation array[1..0] (or similarly array[n + 1..n ]) to mean
an empty array for convenience.
At each iteration of the main loop, we seek the minimum array[j] for i ≤ j ≤ n . Since at the
beginning of iteration i, it is true that array[1..i − 1] ≤ array[i ..n ], the value of array[min] is no
smaller than any x ∈ array[1..i −1], and hence the extended sorted part array[1..i ] remains sorted
after swapping array[i ] and array[min], so invariant 1 is maintained.
Since we are selecting array[min] as a minimum of array[i ..n ], it must be true that after swap-
ping array[i] and array[min], that array[min] ≤ array[i +1..n ], so invariant 2 is also maintained.
Termination
When the i loop terminates, the invariant must still hold, and since it was maintained in itera-
tion i = n , this implies that array[1..n ] is sorted, i.e. the entire list is sorted, and we are done.
Chapter 1. Analysis of Algorithms 11
Selection sort always does a complete scan of the remainder of the array on each iteration. It is
therefore not hard to see that the number of operations performed by selection sort is indepen-
dent of the contents of the array. Therefore the best, average and worst-case time complexities
are all exactly the same.
Proof
total operations.
We also only require one extra variable and some loop counters, so the amount of extra space
required for selection sort is constant.
Selection sort was based on the idea that we maintain two invariants with respect to the sorted
and yet-to-be-sorted sublists. We required that the first sublist be sorted, and that the second
sublist contain elements greater or equal to those in the sorted sublist. Insertion sort maintains
a weaker invariant shown below. We call the invariant weaker since it maintains a subset of the
invariants that selection sort did.
12 1.5. Analysis of Basic Sorting Algorithms
For any given value of i in the main loop of insertion sort, at the beginning of the itera-
tion, the following invariant holds:
1. array[1..i − 1] is sorted
Let’s prove that this invariant holds and is sufficient to prove that insertion sort is correct.
Initial state of the invariant
Again, the sorted sublist is empty at the beginning of the main loop, so the invariant holds.
Maintenance of the invariant
At the beginning of iteration i, the sublist array[1..i − 1] is sorted. The inner while loop of in-
sertion sort swaps the item array[j] one place to the left until array[ j − 1] ≤ array[ j ]. Since
array[1..i − 1] was originally sorted, it is therefore true that after the termination of the while
loop, array[1.. j ] is sorted. Since the sublist array[ j + 1..i ] was originally sorted, and for each
x ∈ array[ j + 1..i ], we had array[ j ] < x, it is also true that array[ j ..i ] is sorted. Combining these
observations, we can conclude that array[1..i ] is now sorted, and hence the invariant has been
maintained.
Termination
At each iteration of the inner while loop of insertion sort, j is decremented, therefore either the
loop condition j ≥ 2 must eventually be false, or it will be the case that array[ j − 1] ≤ array[ j ].
Since the loop invariant was maintained when i = n , it is true that array[1..n ] is sorted, and we
are done.
that in this case, the amount of work done is the same as selection sort, since we perform i - 1
operations for each i from 2 to n . Therefore in the worst case, insertion sort takes O (n 2 ) time.
For a random list, on average we will have to swap array[ j ] with half of the proceeding elements,
meaning that we need to perform roughly half of the amount of operations as the worst case.
This means that in the average case, insertion sort still takes O (n 2 ) time2 . Finally, just like selec-
tion sort, the amount of extra space required is constant, since we keep only one extra variable
and a loop counter.
Stable Sorting
A sorting algorithm is said to be stable if the relative ordering of elements that compare equal is
maintained by the algorithm. What does this mean? Say for example that we are going to sort a
list of people’s names by their length. If multiple names have the same length, a stable sorting
algorithm will guarantee that for each class of names with the same length, that their ordering
is preserved. For example, if we want to sort the names Bill, Bob, Jane, Peter, Kate by their
length, the names Bill, Jane, Kate all share the same length. A stable sorting algorithm will
produce the following sorted list
Bob, Bill, Jane, Kate, Peter
noticing that Bill, Jane, Kate are relatively in the same order as the initial list. An unstable
sorting algorithm may have shuffled the order of Bill, Jane, Kate while still producing a list
of names that is sorted by length. Insertion sort is an example of a stable sorting algorithm.
Selection sort on the other hand is not a stable sorting algorithm.
Online Algorithms
An algorithm (sorting or otherwise) is called online if it can process its input sequentially with-
out knowing it all in advance, as opposed to an offline algorithm which must know the entire
input in advance. Insertion sort is an example of an online sorting algorithm. If you only know
some of the input, you can begin to run insertion sort, and continue to read in the rest of the
input while continuing to sort, and the algorithm will still work. On the other hand, selection
sort is not an online algorithm, since if any new elements are added to the input, the algorithm
would have to start all over.
2 This is not a rigorous argument and hence not a formal proof. Formally proving average-case complexities is actu-
ally quite tricky, but we don’t usually care about average-case complexity, so we won’t delve into the details for now.
14 1.5. Analysis of Basic Sorting Algorithms
Chapter 2
In this chapter we will present the divide-and-conquer algorithm design paradigm and give a
few examples of algorithms that use this paradigm.
By the time we learned that algorithm, our teachers almost certainly did not talk about the effi-
ciency and correctness of this algorithm as by then we were simple users of the algorithm. But,
as IT professionals, understanding the efficiency and correctness of algorithms is a central skill
in our careers.
Considering addition, subtraction and multiplication of single digit numbers as the basic oper-
ations, for the multiplication of two n -digit numbers x and y , this algorithm clearly has com-
plexity O (n 2 ). And this brings up one of the most important questions in algorithm design: Can
we solve the problem more efficiently?
In the quest for a sub-quadratic multiplication algorithm, a first improvement idea is the fol-
lowing dividing-and-conquer strategy:
• Split the numbers between the n /2 most significant digits and the n /2 least significant
digits.
• Let xM and yM denote the n /2 most significant digits of x and y , respectively.
• Let x L and yL denote the n /2 least significant digits of x and y , respectively.
• Do recursive calls with xM , x L , yM , yL and use the results to obtain x · y .
From math we know that
So, we can reduce one instance of the problem of size n to four instances of size n /2. Note that
multiplications by powers of 10 are trivial left shifts. Are we making some progress?
Not really, if you solve the recurrence T (n ) ≤ 4T (n /2) + c · n , you will verify that the complexity
is still O (n 2 ).
Therefore, if we want to follow this approach to improve the efficiency, we should use at most
three recursive calls to subproblems of size n /2.
For a long time, no sub-quadratic multiplication algorithm was known, and eventually Russian
mathematician Andrey Kolmogorov (one of the greatest mathematicians of the 20th century,
Chapter 2. Divide and Conquer 17
with great contributions to areas such as probability theory and computational complexity)
conjectured that sub-quadratic multiplication was impossible.
But then, Anatoly Karatsuba (by the time a 23 years old student) heard the conjecture in one of
Kolmogorov’s seminar presentations and within one week came up with the first sub-quadratic
multiplication algorithm! His solution used the approach described above, but doing only three
recursive calls.
Coming back to the equations, the problem in question is to compute
(xM + x L ) · (yM + yL ) = xM · yM + xM · yL + x L · yM + x L · yL
= z + xM · yM + x L · yL .
Therefore
z = (xM + x L ) · (yM + yL ) − xM · yM − x L · yL
and to obtain the desired term z we just need to subtract the results of the 1st and 2nd recursive
calls from the result of the 3rd recursive call!
Where does this trick comes from? This trick traces back to the method developed in the 19th
century by the German mathematician Carl Friedrich Gauss (one of the greatest mathemati-
cians in history, if not the greatest) to multiply complex numbers using three multiplications
of real numbers instead of four. Adapting previous ideas to solve your new problem can be very
useful!
Note that all the techniques would work the same if binary representation is used instead of
decimals (or any other base for that matter).
By solving the recurrence T (n ) ≤ 3T (n /2) + c · n it can be verified that the time complexity of
Karatsuba’s algorithm is O (n log2 3 ), where log2 3 < 1.59.
18 2.2. Merge Sort (Revision)
This is the algorithm used in Python to multiply large numbers. Don’t underestimate the differ-
ence between n 2 (blue line in Figure 2.2) and n 1.59 (green line)!
Algorithm 5 Merge
1: function MERGE(A[i ..n1 ], B[ j ..n2 ])
2: result = empty array
3: while i ≤ n1 or j ≤ n2 do
4: if j > n2 or i ≤ n1 and A[i ] ≤ B[ j ] then
5: result.append(A[i ])
6: i += 1
7: else
8: result.append(B[ j ])
9: j += 1
10: return result
our merge sort implementation will also be stable. Using this merge routine, an implementa-
tion of merge sort can be expressed like this. The recursive merge sort pseudocode is shown in
Algorithm 6.
Merge sort repeatedly splits the input sequence in half until it can no longer do so any more.
The number of times that this can occur is log2 (n ), so the depth of the recurrence is O (log(n )).
Merging n elements takes O (n ) time, so doing this at each level of recursion results in a total of
O (n log(n )) time. This happens regardless of the input, hence the best, average and worst-case
performances for merge sort are equal. Although the MERGE_SORT routine itself uses no extra
space, the MERGE routine uses O (n ) extra space, hence the auxiliary space complexity of merge
sort is O (n ).
Merge sort can also be implemented bottom-up, in which we eliminate the recursion all-together
and simply iteratively merge together the sub-arrays of size 1, 2, 4, 8, ... n . This still requires O (n )
auxiliary memory, but eliminates the use of the program stack (for recursion) and hence should
be slightly faster in practice. There are also in-place implementations of merge sort that require
20 2.3. Counting Inversions
1 See Nicholas Pippenger, Sorting and Selecting in Rounds, SIAM Journal on Computing 1987.
Chapter 2. Divide and Conquer 21
Algorithm 7 Sort-and-CountInv
1: function SORT-AND-COUNTINV(array[lo..hi])
2: if lo = hi then
3: return (array[lo], 0)
4: else
5: mid = b(lo + hi)/2c
6: (array[lo..mid], InvL ) = SORT-AND-COUNTINV(array[lo..mid])
7: (array[mid + 1..hi], InvH ) = SORT-AND-COUNTINV(array[mid + 1..hi])
8: (array[lo..hi], InvS ) = MERGE-AND-COUNTSPLITINV(array[lo..mid], array[mid + 1..hi])
9: Inv = InvL + InvH + InvS
10: return (array[lo..hi], Inv)
Algorithm 8 Merge-and-CountSplitInv
1: function MERGE-AND-COUNTSPLITINV(A[i ..n1 ], B[ j ..n2 ])
2: result = empty array
3: splitInversions = 0
4: while i ≤ n1 or j ≤ n2 do
5: if j > n2 or (i ≤ n1 and A[i ] ≤ B[ j ]) then
6: result.append(A[i ])
7: i += 1
8: else
9: result.append(B[ j ])
10: j += 1
11: splitInversions = splitInversions + n1 - i + 1
12: return (result, splitInversions)
22 2.3. Counting Inversions
Chapter 3
Sorting is a fundamental and interesting problem in computer science. Many sorting algo-
rithms exist, each having their own advantages and disadvantages. Sorting algorithms like se-
lection sort and insertion sort - analysed in Chapter 1 - take O (n 2 ) time but are very easy to
implement and run very fast for very small input sizes. Faster sorting algorithms exists which
only take O (n log(n )) time, but which may be more difficult to implement and may run slower
for small lists. Merge sort was revised on Chapter 2 and we will now analyse some other algo-
rithms and also explore some interesting theoretical results regarding sorting. We will also take
a look at how in many cases, integers can be sorted in just O (n ) time.
Making use of the heap data structure, one can easily piece together the heapsort algorithm,
which simply involves converting the given sequence into a max heap (heapifying it) and then
successively removing the maximum element and placing it at the back of the array.
Algorithm 9 Heapsort
1: function HEAPSORT(array[1..n ])
2: HEAPIFY (array[1..n ])
3: for i = n to 1 do
4: array[i ] = EXTRACT_MAX(array[1..i ])
That’s it! Of course you’ll need to recall how to perform the HEAPIFY and EXTRACT_MAX opera-
tions. They are provided below for convenience.
Algorithm 10 Heapify
1: function HEAPIFY(array[1..n ])
2: for i = n /2 to 1 do
3: FALL (array[1..n ], i )
3.2 Quicksort
Quicksort is perhaps the most well-known divide-and-conquer algorithm for sorting. Quicksort
follows these key ideas to sort an array.
• Select some element of the array, which we will call the pivot.
• Partition the array so that all items less then the pivot are to its left, all elements
equal to the pivot are in the middle, and all elements greater than the pivot are to
its right.
• Quicksort the left part of the array (elements less than the pivot).
• Quicksort the right part of the array (elements greater than the pivot).
The key ideas are very simple. The two major components to correctly (and efficiently) imple-
menting Quicksort are:
• the partitioning algorithm (how does the algorithm move elements lesser/greater than
the pivot to the left/right efficiently?)
Naive partitioning takes O (n ) time and is stable, which are both nice properties. However, naive
partitioning is not in-place since it stores all n elements in new arrays, using O (n ) space. Cre-
ating extra arrays means that the algorithm will be slower in practice than alternatives.
An alternative partitioning scheme that accounts for this is the so called Dutch National Flag
(DNF) scheme proposed by Edsger Dijkstra. The DNF scheme partitions the array into three
sections < p , = p , and > p in-place. Only the < and > subarrays then need to be recursed on.
Dijkstra described the partitioning scheme by phrasing it in terms of the following problem.
Problem Statement
Given an array of n objects (array[1..n ]) coloured red, white or blue, sort them so that
all objects of the same colour are together, with the colours in the order red, white and
blue.
The connection to the Quicksort partitioning problem should be quite clear: The red items cor-
respond to elements less than the pivot, the white items correspond to elements equal to the
pivot, and the blue items correspond to elements greater than the pivot. To solve the DNF prob-
lem, we keep three pointers, lo, mid, and hi, such that the following invariant is maintained.
Initially, we set lo = 1, mid = 1, hi = n . This means that initially the entire array is the unknown
section. We then iteratively move items from the unknown section into their corresponding sec-
tions while updating the pointers lo, mid, hi. At each iteration, we move the item at array[mid]
(i.e. the first unknown item). There are three cases to consider:
If array[mid] is red, we’d like to move it into the first section. So, we will swap array[mid] with
array[lo]. This means that the red section is extended by one element, so we increment the lo
pointer. The item originally at array[lo] was either white, in which case it has been moved to
Chapter 3. Fast Sorting Algorithms 29
the end of the white section, so we increment mid, or the white section was empty, meaning
we simply swapped array[lo] with itself, then we increment mid. In both situations, we see that
the invariant is maintained, and the unknown section has shrunk in size by one element.
If array[mid] is white, then it is already where we want it to be (right at the end of white section)
so we simply have to increment the mid pointer. This also shrinks the unknown section by one
element.
If array[mid] is Blue, we want to move it into the final section, so we will swap array[mid] with
array[hi]. The Blue section is therefore extended so we decrement the pointer. We do not move
the mid pointer in this case since the array[mid] now contains what was previously the final
element in the unknown section.
Note that we return the final position of the lo and mid pointers so that we can quickly iden-
tify the locations of the three sections after the partitioning. Like Hoare, also notice that this
partitioning process is unfortunately not stable.
Implementing Quicksort
We now have the tools to implement Quicksort. Assume for now that we select the first element
in the range [lo..hi] to be our pivot. We will explore more complicated pivot selection rules later.
Using our partitioning functions and the fact that they return the resulting location of the pivot,
Quicksort can be implemented as shown in Algorithm 18.
Algorithm 18 Quicksort
1: function QUICKSORT(array[lo..hi])
2: if hi > lo then
3: pivot = array[lo]
4: mid = PARTITION(array[lo..hi], pivot)
5: QUICKSORT (array[lo..mid − 1])
6: QUICKSORT (array[mid + 1..hi])
Best-case partition
In the ideal case, the pivot turns out to be the median of the array. Recall that the median is the
element such that half of the array is less than it and half of the array is greater than it. This will
result in us only having log2 (n ) levels of recursion since the size of the array would be halved at
each level. At each level, the partitioning takes O (n ) total time, and hence the total runtime in
the best case for Quicksort is O (n log(n )).
Worst-case partition
In the worst case, the pivot element might be the smallest or largest element of the array. Sup-
pose we select the smallest element of the array as our pivot, then the size of the subproblems
solved recursively will be 0 and n − 1. The size 0 problem requires no effort, but we will have
only reduced the size of our remaining subproblem by one. We will therefore have to endure
O (n ) levels of recursion, on each of which we will perform O (n ) work to do the partition. In
total, this yields a worst-case run time of O (n 2 ) for Quicksort.
On average, the partitions obtained by an arbitrary pivot will be neither a perfect median nor
the smallest or largest element. This means that on average, the splits should be “okay.” Our
intuition should tell us that in the average case, Quicksort will take a O (n log(n )) running time,
since selecting a terrible pivot at every single level of recursion is extremely unlikely. Let’s prove
that this intuition is correct.
We will present two proofs of this fact. First, an intuitive proof based on coin flips and proba-
bility, then a more rigorous proof using recurrence relations.
Consider an arbitrary array of n elements. Ideally, we would like to select a pivot that
partitions the array fairly evenly. Let’s define a “good pivot” to be one that lies in the
middle 50% of the array, i.e. a good pivot is one such that at least 25% of the array is less
than it, and 25% of the array is greater than it.
Chapter 3. Fast Sorting Algorithms 31
Suppose we are lucky and we select a good pivot every single time. The worst thing
that can happen is that the pivot might be on the edge of the good range, i.e. we will
select a pivot right on the 25th or 75th percentile. In this case, our recursive calls will
have arrays of size 0.25n and 0.75n to deal with. The longest branch of the recursion
d The number of elements left after d levels of
tree will therefore consist of the later calls.
recursion will be given by 0.75d × n = 43 × n , and hence the depth of the tree will be
d
3
×n =1 =⇒ d = log 34 (n )
4
Since log 43 (n ) = O (log(n )), the tree has logarithmic height. At each level of recursion we
perform O (n ) work for the partitioning, and hence the time taken will be O (n log(n )).
This should not be surprising, as we already know that the best-case performance for
Quicksort is O (n log(n )).
We know that we can not expect to select a good pivot every time, but since the good
pivots make up 50% of the array, we have a 50% probability of selecting a good pivot.
Therefore we should expect that on average, every second pivot that we select will be
good. In other words, we expect to need two flips of an unbiased coin before seeing
heads. This means that even if every other pivot is bad and barely improves the sub-
problem size, we expect that after 2 × log 43 (n ) levels of recursion to hit the base case.
This means that the expected amount of work to Quicksort a random array is just twice
the amount of work required in the best case, which was O (n log(n )). Double O (n log(n ))
is still O (n log(n )), hence from this we can conclude that the average case complexity of
Quicksort is O (n log(n ))
Let us denote the time taken to Quicksort an array of size n by T (n ). If the k ’th smallest
element is selected as the pivot, then the running time T (n ) will be given by
T (n ) = n + 1 + T (k − 1) + T (n − k ),
where the terms T (k − 1) and T (n − k ) correspond to the time taken to perform the re-
cursive calls after the pivot operation. Now observe that over all possible inputs, there
are equally many that partition the input into any given sizes k −1 and n −k . Given this,
the average running time over all possible inputs is equal to the average over all possible
partitions, and hence
n
1X
E[T (n )] = n + 1 + (E[T (k − 1)] + E[T (n − k )]),
n k =1
where the expectation is taken over all possible inputs. Observe that each term T (k − 1)
32 3.2. Quicksort
Now substitute n − 1 for n and subtract the resulting equation from the above to find
where we have stopped at the base case T (0) = 0. Finally, using the fact that
n
X 2
= O (log(n )),
k =1
k +1
Space complexity
In the worst case, due to the linear number of recursive calls, we would expect that the auxiliary
space required by Quicksort would be O (n ). This is true, but can actually be improved by mak-
Chapter 3. Fast Sorting Algorithms 33
ing a very simple optimisation. Instead of recursively sorting the left half of the array and then
the right half of the array, we can always sort the smallest half first, followed by the largest half.
By sorting the smallest half first, we encounter only O (log(n )) levels of recursion on the program
stack, while the larger half of the partition is sorted by a tail-recursive call, which adds no over-
head to the program stack. If you work in a language without tail-call optimisation, you can
replace the second recursion with a loop to achieve the same effect. The complexities written
below assume that such an optimisation is made.
For the strict O (1) space definition of in-place, Quicksort is not in-place since it requires at least
O (log(n )) space for recursion. Using the other definition, if we use one of the in-place parti-
tioning schemes (Hoare or DNF) then Quicksort will also be in-place, but not stable. If we use
a stable partitioning scheme (such as the naive partitioning scheme), then Quicksort will be
stable, but not in-place.
Selection sort and insertion sort take O (n 2 ) time, which is fine for very small lists, but completely
impractical when n grows large. Faster sorting algorithms such as merge sort, heapsort and
Quicksort run in O (n log(n )) time, which is asymptotically a significant improvement. The next
question on your mind should now be, can we do even better than this? An interesting result
actually shows us that if we work in the comparison model, that is to say that the only valid
operations that we can perform on elements are comparisons (<, >, ≥, ≤, =, 6=), then sorting can
not possibly be done faster than O (n log(n )).
Sorting in the comparison model takes Ω(n log(n )) in the worst case. In other words,
any comparison-based sorting algorithm can not run faster than O (n log(n )) in the worst
case. Recall that Ω notation denotes an asymptotic lower bound, i.e. the algorithm must
take at least this long.
To prove the lower bound, we consider a decision tree that models the knowledge we have about
the order of a particular sequence of data after comparing individual elements. A decision tree
that represents the sorting process of a sequence of three elements in shown below. Each node
corresponds to a set of potential sorted orders based on the comparisons performed so far. The
leaf nodes of the tree correspond to states where we have enough information to know the fully
sorted order of the sequence. An example of such a tree is shown in Figure 3.1.
34 3.3. Complexity Lower Bounds for Sorting
Figure 3.1: A decision tree for comparison-based sorting. Taken from Weiss, Data Structures
and Algorithm Analysis in Java, 3rd ed, page 303.
We appeal to the structure of the decision tree depicted above and observe that com-
paring elements until we deduce the fully sorted order of a sequence is equivalent to
traversing the tree until reaching a leaf node. The worst-case behaviour of any compari-
son based sorting algorithm therefore corresponds to the depth of the deepest leaf in the
tree, i.e. the height of the tree. There are n ! total possible sorted orders for a sequence
of n elements, each of which must appear as a leaf in the tree. Let’s denote the height
of the tree by h and observe that a binary tree of height h can not have more than 2h
leaves. Putting this together, we have that
n ! ≤ 2h .
h ≥ log2 (n !).
To get a lower bound, let’s use log laws to rewrite the quantity in question as
log2 (n !) = log2 (1 × 2 × 3 × ... × n ) = log2 (1) + log2 (2) + log2 (3) + ... + log2 (n ).
Considering just the second half of the terms, since the log function is increasing, we
Chapter 3. Fast Sorting Algorithms 35
can write
n n
log2 (n !) ≥ log2 + log2 + 1 + ... + log2 (n ),
2 2
n n
≥ log2 ,
2 2
n
= (log2 (n ) − 1),
2
= Ω(n log2 (n ))
which establishes the desired lower bound. Hence the height of the tree is at least
which shows that sorting in the comparison model takes Ω(n log(n )) time.
As a simple but illustrative example, suppose that we want to sort an array that consists only
of 0s and 1s. This can be solved quite easily in linear time by simply counting the number of
0s and 1s in the array and then writing a new array consisting of that many 0s followed by that
many 1s . Let’s see how this idea can be generalised to arrays of more than just 0s and 1s.
which can be computed in linear time in the size of the universe. Each time we place an element
y into its correct position, we increment position[y ] to prepare for the next occurrence. This
algorithm is called counting sort, and is shown below in Algorithm 19. Note that counting sort
36 3.4. Sorting Integers Fast
can be similarly applied in other domains, e.g. alphabet characters, integers in the range from
−n /2 to n /2, and so on.
This sort is stable but is not in place since we construct the answer in a separate temp array.
What is the complexity of counting sort? We are required to create and maintain the counter
array of size u , and we do two passes over the input. This yields a time and space complexity of
O (n + u ) in all cases.
This tells us that counting sort takes linear time in n if and only if u = O (n ). An alternative
way to analyse integer sorting algorithms is to consider the width of the integers, rather than
the maximum value u . The width of an integer is the number of bits required to represent it in
binary. If we are sorting w -bit integers, then our universe size is u = 2w −1. The time complexity
of counting sort on w -bit integers is therefore O (n + 2w ), which is linear in n if and only if
w = log(n ) + O (1).
We will improve on this in the next section with an algorithm that can sort integers with a greater
number of bits.
by sorting an array of elements one digit at a time, from the least significant to the most signifi-
cant. Each digit must be sorted in a stable manner in order to maintain the relative ordering of
the previously sorted digits.
• Sort the array one digit at a time, from least significant (rightmost) to most signif-
icant (leftmost).
• For each digit: Sort using a stable sorting algorithm.
Figure 3.2: An illustration of LSD radix sort. First the numbers are sorted by their least
significant digit, then their second, etc. until the most significant digit is sorted and we are
finished.
Since individual digits have a very small number of possible values, we will sort the individual
digits using counting sort. A visualisation of an LSD radix sort in base-10 is shown in Figure 3.2.
It is important to realise that we do not necessarily have to use their decimal digits to sort the
elements of the array. We could have instead done radix sort with a radix (base) of 100, which
would mean sorting consecutive groups of two decimal digits at a time. Using a larger base
means fewer passes over the array must be made, but that more work must be done on each
pass. It is therefore a tradeoff to find the best base for a given input.
An implementation of LSD radix sort is shown in Algorithm 20. Note that the RADIX_PASS func-
tion is simply performing counting sort on a particular digit of the input. The function call
GET _ DIGIT (x , b , d ) computes the value of the d th digit in base b of the integer x .
Let’s suppose that we are sorting n integers that have k digits in base-b . Radix sort will perform
k passes over the array, one for each digit, and perform a counting sort each time. Since we
are interpreting the integers in base-b , the counting sort will sort over a universe of size b , and
hence have complexity O (n +b ). The total time complexity of radix sort is therefore O (k (n +b )),
and the space complexity is just that required of counting sort, so O (n + b ).
38 3.4. Sorting Integers Fast
The implementation of radix sort described above is stable but not in place since we use a tem-
porary array for counting sort. Alternate implementations of radix sort can be written that are
in place but not stable. In fact, it is even possible to write a stable, in-place radix sort, but it is
extremely complicated and well beyond the scope of this course1 .
So how does radix sort compare to counting sort? Recall that counting sort can sort a sequence
of n integers of maximum value O (n ) in linear time. An important face to notice is that for radix
sort, the number of digits and the base are not independent, they are related. A larger base will
mean fewer digits to sort. Suppose that we wish to sort integers that are w bits wide, i.e. they
have maximum value 2w . Then k , the number of digits, is related to b , the base by
w
k= .
log(b )
w
O (n + b ) .
log(b )
1 See G Franceschini, S Muthukrishnan, M Pǎtraşcu, Radix Sorting with No Extra Space, ESA 2007
Chapter 3. Fast Sorting Algorithms 39
Comparing this to counting sort, let’s say we choose a constant base b , and wish to sort integers
of maximum value O (n ), i.e. integers with log(n )+O (1) bits. Then from the above complexity, we
see that radix sort will take O (n log(n )) time, since b is a constant, which is worse than counting
sort! In order to make radix sort faster, we must therefore choose the base b more cleverly.
Notice that we can make radix sort perform fewer loop iterations by lowering the number of
digits, which can be achieved by increasing the base. Of course, increasing the base too much
would make the counting sort subroutine too slow, so we need to find a balance point. Since the
counting sort subroutine takes O (n + b ) time, the largest base that we can pick without slowing
it down is b = O (n ). If we pick base-n , we then find that the complexity of radix sort will be
w w
O (n + n ) = O n .
log(n ) log(n )
We can now see that for integers with at most w = O (log(n )) bits, the complexity of radix sort will
be O (n ), which is an improvement over counting sort! In particular, notice that integers with
c log(n ) bits have maximum size O (2c log(n ) ) = O (n c ), which is better than counting sort, which
can only handle integers of size O (n ). The space complexity will still be the space complexity
of counting sort, which is O (n + u ), where u = n , hence it will be O (n ).
w w w
Time O n O n O n
log(n ) log(n ) log(n )
Auxiliary Space O (n ) O (n ) O (n )
Fun (Non-examinable) Fact: We showed here two algorithms that sort integers in linear time
provided that they have a bounded number of bits. Surprisingly, linear time sorting is not only
possible for integers with a bounded width w . For integers with very large width, specifically
those that have at least w = Ω(log2+" (n )) bits for some " > 0, an algorithm called signature sort
can sort them in linear time. It is an open research problem whether integers of any width w
p sorted in linear time. The best known algorithms can sort n integers of any width w in
can be
O (n log(log(n ))) randomised, or O (n log(log(n ))) deterministic in the worst case2 .
2 If interested, you should watch this lecture from MIT OpenCourseware, https://youtu.be/pOKy3RZbSws
40 3.4. Sorting Integers Fast
Chapter 4
Given a sequence of numbers, many statistical applications are interested in computing the
minimum, maximum or median element of the sequence. These are all special cases of a gen-
eral problem called order statistics, which asks for the k th smallest number in the sequence.
In this chapter we will consider several different cases of this problem and explore some algo-
rithms for solving them.
to convince yourself that we can not determine the minimum with any fewer than n − 1 com-
parisons, since each comparison can only rule out one element as not been the minimum/-
maximum.
A more interesting problem arises if we consider the case of finding both the minimum and
maximum element at the same time. Clearly, our first algorithm for finding the minimum in
n − 1 comparisons could easily be adapted to find both the minimum and maximum in 2n − 2
comparisons, but can we do better? The answer is no longer obvious since we are doing more
comparisons than necessary to find either one of the minimum or maximum on its own. It turns
out that we can in fact select both the minimum and maximum element of a sequence in just
approximately 1.5n comparisons. The clever trick is to just consider each pair of numbers in the
input, and for each pair, to only compare the higher of the two with the current maximum, and
the lower of the two with the current minimum. This results in cutting out 14 of the comparisons
which were unnecessary. The implementation shown in Algorithm 22 demonstrates this trick.
Interestingly, it is possible to prove that 1.5n is the lower bound on the number of comparisons
that are required in the worst case to select the minimum and maximum element of a sequence,
so this algorithm is in fact performing the optimal number of comparisons.
Algorithm 23 Quickselect
1: function QUICKSELECT(array[lo..hi], k)
2: if hi > lo then
3: pivot = array[lo]
4: mid = PARTITION(array[lo..hi], pivot)
5: if k < mid then
6: return QUICKSELECT(array[lo..mid − 1], k )
7: else if k > mid then
8: return QUICKSELECT(array[mid + 1..hi], k )
9: else
10: return array[k ]
11: else
12: return array[k ]
Just like Quicksort though, this function has a worst-case run time of O (n 2 ) since we might select
the minimum or maximum element as the pivot and hence perform O (n ) levels of recursion.
Luckily for us, there are strategies to avoid this sort of pathological behaviour. We will explore
two such strategies: random pivot selection which results in an expected O (n ) run time and is
empirically the fastest strategy, and the median of medians of fives technique, which yields a
guaranteed worst-case linear performance, although is not as efficient in practice.
Using randomised pivot selection, the Quickselect algorithm has expected run time
O (n ).
Assume without loss of generality that the array contains no duplicate elements and
denote by T (n ) the time taken by Quickselect to select the k th order statistic of an array
44 4.3. Randomised Pivot Selection
of size n . If the pivot selected is the i th order statistic of the array, then the time taken by
Quickselect will be
T (n − i ) if i < k ,
T (n , i ) = n + 1 + 0 if i = k , .
T (i − 1)
if i > k
Averaging out over all possible pivot selections, we obtain an expected run time of
k −1 n
1 X X
E[T (n )] = n + 1 + E[T (n − i )] + E[T (i − 1)] + 1 .
n i =1 i =k +1
Rearranging the indices of the sums, we find that this is the same as
n −1 n −1
1 X X
E[T (n )] = n + 1 + E[T (i )] + E[T (i )] + 1 .
n i =n −k +1 i =k
Since both of the sums appearing above in the bracketed expression contain a consec-
utive sequence of terms ending at T (n − 1) and there are a total of n terms, they must be
bounded above by the largest possible terms, so we have the inequality
n−1
2 X
E[T (n )] ≤ n + 1 + E[T (i )].
n i =n/2
We can now prove by induction that E[T (n )] ≤ C n = O (n ) for some constant C . First, we
have that T (0) = T (0) = 1 ≤ C n for any choice of C ≥ 1. Now suppose that E[T (n )] ≤ C i
for some constant C for all i < n for some value of n . We have that
n−1
2 X
E[T (n )] ≤ n + 1 + E[T (i )],
n i =n/2
n (n − 1) n2 ( n2 − 1)
2C
E[T (n )] ≤ n + 1 + −
n 2 2
C n
≤ n + 1 + C (n − 1) − ( − 1)
2 2
3 C
≤ C +1 n +1− .
4 2
If we choose C = 4 then we will have
T (n ) ≤ 4n − 1 ≤ 4n = C n
as desired. Hence by induction on n , we have that E[T (n )] = O (n ), completing the proof.
Chapter 4. Order Statistics and Selection 45
• Finding the median of each of these groups (just sort them since they are small).
• Finding the true median of the n5 medians recursively using Quickselect.
Using these ideas, an implementation of the median of medians technique is shown in Algo-
rithm 24. The function MEDIAN_OF_FIVE selects the true median of a small list using insertion
sort.
The recursive call made to QUICKSELECT in the median of medians algorithm must then also use
MEDIAN _ OF _ MEDIANS to select its pivot. We call these two functions mutually recursive since
they do not recurse directly on themselves but rather back and forth on each other.
It remains for us to prove that this technique does indeed result in guaranteed linear time per-
46 4.4. Median of Medians
formance. On the one hand, we must ensure that the pivot selected is good enough to yield
low recursion depth, and on the other, we need to be convinced that finding the approximate
median does not take too long and kill the performance of the algorithm.
Quickselect using the median of medians strategy has worst case O (n ) performance.
Proof
First, we note that out of the n5 groups, half of them must have their median less than
the pivot (by definition of the median). Similarly, half of the n5 groups must have their
n
medians greater than the pivot. Together, we have 10 groups whose median is less than
n
the pivot, and 10 groups whose median is greater than the pivot.
Since within each group of five, two elements are greater than its median and two ele-
n
ments are less than its median, we have transitively that in each of the 10 groups whose
median is less than the pivot, three elements that are less than the pivot, and similarly
n
in each of the 10 groups whose median is greater than the pivot, three elements that are
greater than the pivot. This yields a grand total of 3n
10 elements that are below and above
the pivot respectively.
Consequently, the median of medians lies in the 30th to 70th percentile of the data. From
this, we know that the worst case of the recursive call made by Quickselect will be on
a sequence of size 0.7n . In addition to the recursive call, we also make an additional
recursive call of size 0.2n to compute the exact median of the original n5 medians. Adding
in the linear number operations required to perform the partitioning step and to find
the median of each group of 5, a recurrence relation for the amount of work done by
Quickselect using the median of medians strategy is
for some constant a . We can show that this recurrence relation has solution T (n ) ≤ C n
for some constant C by induction. First let C be a constant such that C ≥ 10a . Consider
the base case
T (0) = 0 <= C × 0 = 0,
which trivially holds. Now suppose that T (n ) ≤ C n for n < N , then we have, by substi-
tuting this inductive hypothesis
T (n ) = a n + T (0.2n ) + T (0.7n )
≤ a n + 0.2C n + 0.7C n .
= a n + 0.9C n
T (n ) ≤ C n = O (n ).
Chapter 4. Order Statistics and Selection 47
Improving Quicksort
Finally, we note that we can use this technique to improve Quicksort. If we use the median as
Quicksort’s pivot, which is sometimes referred to as Balanced Quicksort, we will get the mini-
mum possible recursion depth, resulting in a guaranteed worst-case O (n log(n )) performance!
In practice, this strategy is outperformed by random pivot selection, but it is satisfying to see
Quicksort achieve the same worst-case behaviour as heapsort and merge sort.
48 4.4. Median of Medians
Chapter 5
Graphs Basics
Graphs are a simple way of modelling sets of objects and encoding relationships between those
objects. Graphs have an enormous number of applications in a wide range of fields. Indeed, two
of the biggest tech companies in existence, Google and Facebook, base their entire business
around two huge graphs! Additionally, graph problems are ubiquitous not only in computer
science, but in modelling complex processes and situations in all disciplines of science and
business. In this chapter we begin our study of graphs.
• What are the groups of mutual friends? I.e. friends that have a friend in common, or a
friend-of-a-friend etc. These are called the connected components of the graph.
• What is the largest degree of separation between two people in the network? This would
correspond to the longest distance between any people in the graph - also called its di-
ameter.
50 5.1. Modelling with Graphs
• Are there any people in the network whom if removed would cause a pair of mutual friends
to no longer be mutual friends? Such a vertex in a graph is called an articulation point or
cut point.
Graph and network algorithms can help us solve each of these problems.
Mark Fred
Nick
Bob Alice
Ian Jane
Bill Mary
Figure 5.1: A network representing a set of people who have connections with each other.
Graphs can also be weighted. As another example, consider a road map of towns in the country,
where each town is represented by a vertex, and roads between towns are represented by edges
with weights corresponding to the distance between them.
5 8
a b c
1
2
4
16
d e
4 9
8
10
f g
• Given a pair of towns, what is the shortest route connecting them? This is called the short-
est path problem.
• What is the overall shortest subset of roads that we could keep, removing all others while
still keeping every town connected? This is called the minimum spanning tree problem.
Edges in a graph can also be directed, that is the relation only holds in one direction but not
both. For example, consider your courses that you are taking for your degree, which have pre-
requisites that must be completed first. We can model this with a graph, in which a course has
a directed edge to another course if the first course is a prerequisite of the second course and
hence must be completed first.
FIT1841 FIT2014
Figure 5.3: A network representing a course progression for a Computer Science degree.
• Are there any cycles in the course graph? If so, the degree is impossible to complete!
Ooops.
• If there aren’t any cycles, what is a valid order to complete the courses in that satisfies all
of the prerequisites? Such an order is called a topological ordering of the directed graph.
• If there aren’t any cycles, what is the longest chain of prerequisites? I.e. assuming we could
handle as many courses at a time as we wanted, how many semesters would it take us to
complete the degree? This is called the critical path problem.
Once again, graph and network algorithms can help us solve all of these and many more prob-
lems!
Direction
The edges of a graph may be directed or undirected. In an undirected graph, the edges (u , v )
and (v, u ) are equivalent, but in a directed graph, they represent different things: for example,
a one-way street in a road map, or a course prerequisite, which are not the same if you reverse
them!
Weights
The edges may also be unweighted or weighted, in which case they have an associated quantity,
which may represent for example, the distance between two towns in a road map, the band-
width of a connection in an internet network, the amount of money that it would cost to con-
nect two houses via fibre-optic cable, or anything else you can imagine. We will usually denote
the weight of the edge (u , v ) by w (u , v ).
Chapter 5. Graphs Basics 53
Depending on the particular problem, it may be allowed or disallowed for a graph to contain
multiple edges between the same pair of vertices. A graph that does contain multiple edges
between the same pair of vertices is usually referred to as a multigraph.
Similarly, for a particular purpose, a graph may or may not be allowed to contain edges that
connect a vertex to itself. Such edges are usually referred to as loops. A graph that contains no
loops or multiple edges between the same pair of vertices is called a simple graph. If we do not
specify otherwise, we will assume that the graphs we consider in this unit are simple by default.
A directed acyclic graph (DAG) is a directed graph that contains no cycles. DAGs are useful in
particular for representing tasks that have prerequisites. For example, the units in your degree
and their prerequisites form a DAG, where each edge denotes a unit that has another as a re-
quirement. In project management, the tasks in a project might form a DAG which indicates
which tasks must be performed before others. The two most fundamental and interesting prob-
lems on DAGs are the topological sorting problem and the critical path problem.
2 6
1 3 4
5 7
• In a directed graph, the maximum number of possible edges that we can have is therefore
|V |2 if we allow loops, or |V |(|V | − 1) otherwise.
|V |+1
• In an undirected graph, the maximum number of possible edges is 2 if we allow loops,
|V |
or 2 otherwise.
We say that a graph is dense if |E | ≈ |V |2 , that is informally, the graph has a lot of edges. Con-
versely, we call a graph sparse if |E | |V |2 , i.e. the graph has a relatively small number of edges.
In the case of multigraphs, adjacency matrices can be generalised such that a i , j = k where k is
the number of edges between vertices i and j .
In a weighted graph, the adjacency matrix stores the weights of the edges, such that
¨
w (i , j ), if there is an edge (i , j )
ai , j = .
0 or ∞, otherwise
The choice of whether to use 0 or ∞ as an indicator that a particular edge is absent will depend
completely on the problem at hand. The choice must be made such that the inclusion of the
absent edge with the given weight does not change the solution to the problem being consid-
ered. Alternatively, if this is not feasible, one could store two matrices, one indicating adjacency
without weights, and one storing the weights.
Chapter 5. Graphs Basics 55
In an undirected graph, the adjacency matrix will clearly be symmetric, so if we wish to, we can
save memory by only storing the upper triangular entries. Given an adjacency matrix, checking
whether two vertices have an edge between them can be done in constant O (1) time by simply
checking the corresponding entry in the adjacency matrix.
Adjacency matrices are good for representing dense graphs since they use a fixed amount of
memory that is proportional to the total number of possible edges. However, when a graph is
sparse, the corresponding adjacency matrix will contain a very large number of 0 entries, using
the same O (|V |2 ) space regardless of the actual number of edges in the graph. In this case,
a strong alternative is to store the graph in the form of an adjacency list. An adjacency list is
simply a list of lists, where each list corresponding to a particular vertex stores the vertices that
the given vertex is adjacent to. In a weighted graph, the adjacency list will store the weights of
the edges alongside the vertices corresponding to those edges.
56 5.2. Representation and Storage of Graphs
Since an adjacency list only uses memory for edges that actually exist in the graph, the space
required for this representation is O (|V | + |E |), which is a huge improvement over adjacency
matrices for sparse graphs, and about the same for dense graphs.
Unlike with adjacency matrices, we can no longer check whether two given vertices are adjacent
in constant time using an adjacency list. Adjacency lists are however, particularly convenient
when we only need to iterate over the existing edges since we will never need to examine a non-
existent edge. We could instead store our adjacency list for each vertex in a balanced binary
search tree, which would allow for O (log(|E |)) adjacency checks while retaining the ability to it-
erate efficiently. Such a strategy would use more memory however and is not particularly useful
for most applications.
The edge list uses only O (|E |) space, but affords us neither the ability to check adjacency quickly
or iterate over the adjacent vertices of a particular vertex efficiently. Their primary use is in
Kruskal’s minimum spanning tree algorithm where we need to sort all of the edges in descend-
ing order of weight, which cannot be done effectively with the other two representation strate-
gies.
Chapter 5. Graphs Basics 57
1 2 3
s 4
6 5
Figure 5.5: A graph with it’s nodes labelled in the order that they might be visited by a
depth-first search. The highlighted edges are those traversed by the search, forming a
depth-first search tree.
Depth-first search can be implemented very succinctly using recursion. We will assume that the
graph is represented as an adjacency list, so that we can access adjacent vertices in constant
time. A depth-first search is shown in Algorithm 25. Note that since the graph might not be
connected, we loop over all vertices and make a call to DFS if that vertex has not been visited.
Otherwise we may only visit one component of the graph (although for some applications, this
may be all that is needed). This algorithm is applicable to both directed and undirected graphs.
1. Vertices are interchangeable with their ID number. That is, we will frequently loop over
all vertices by looping over 1..n , and we will read from and assign to arrays using vertices
as indices.
2. Properties of the graph, for example the adjacency list, are visible to the DFS function.
This avoids us having to pass around lots of extra arguments.
58 5.3. Graph Traversal and Applications
3. Quantities associated with the graph that we are working on are visible to the DFS func-
tion. For example, we assume that the DFS function can see and use the visited array even
though it was declared in a different scope. This too saves on passing extra arguments
around.
Since the depth-first search algorithm visits every vertex only once and examines every edge
at most once (or twice for an undirected graph), its time complexity is O (|V | + |E |). Despite its
simplicity, the skeleton DFS algorithm forms the basis of a huge number of graph algorithms
with wide-ranging applications. We will now explore a few of these applications.
• Two-colouring a graph, or deciding that the graph cannot be two-coloured – a graph can
be two-coloured if every vertex can be assigned a colour, white or black, such that no
two neighbouring vertices share the same colour. This is equivalent to the graph being
bipartite.
• Finding bridges and cut-points of a graph – these are edges and vertices that if removed
from the graph would disconnect some connected component. In a sense they are “non-
redundant” connections.
Chapter 5. Graphs Basics 59
3 6
5 7
4 1 2
We can find the connected components by observing that a single call to the depth-first search
function visits precisely the connected component of the starting vertex, and nothing else. So
we can simply run a depth-first search from each vertex that has not yet been visited by a previ-
ous one, and each time, we will discover one more component of the graph. An implementation
is shown in Algorithm 26. Since the depth-first search runs in O (|V | + |E |) time, this algorithm
for determining the connected components also runs in O (|V | + |E |).
1 3 4
s 2
6 5
Figure 5.7: A graph with it’s nodes labelled in the order that they might be visited by a
breadth-first search. The traversed edges form a shortest path tree from s . Observe that all
paths from s in the tree use the fewest edges possible to reach their respective destinations.
pletely a matter of convention. You can perform breadth-first search from multiple compo-
nents if you want, and you may even search from multiple starting locations simultaneously.
An implementation of breadth-first search is shown in Algorithm 28.
In order to visit vertices that are nearby sooner, the algorithm maintains a queue of vertices
that need to be visited, appending progressively further vertices to the end of the queue as it
traverses. Finally, note that just like the depth-first search, breadth-first search also visits each
vertex at most once and examines each edge at most once (or twice for an undirected graph),
hence its time complexity is O (|V | + |E |).
total distance travelled, or the time taken, or indeed any other quantity that we are interested
in, such as toll costs. Note that we usually refer to a shortest path rather than the shortest path
since there may be multiple equally short paths between two vertices.
Before diving straight in to the algorithms, we’ll first take a look at some useful properties that
shortest paths have which will help us in understanding the algorithms and how they work.
2 5
2
2 1 1 3
1 4 7
4 3
3 4
3 6
1
Figure 5.8: An weighted, directed graph. A shortest path from vertex 1 to vertex 7 has a total
weight of 6, passing through the vertices 1 → 2 → 4 → 7.
δ(s , v ) ≤ δ(s , u ) + w (u , v ).
Behind these fancy symbols is a simple and intuitive but very critical fact that underpins most
shortest path algorithms. All that this says in plain English is that if we have a shortest path
s v , then we can’t find a path s u , followed by an edge u → v with a shorter overall length,
because this would mean that s u → v would be an even shorter path. Expressed even more
simply, you can’t find a path that is shorter than your shortest path!
Why is this true? If the shortest path from vertex s to vertex u has total length δ(s , u ) and the
edge (u , v ) has weight w (u , v ) then we can form a path to vertex v by taking a shortest path from
s to u and then traversing the edge (u , v ), forming a path from s to v . The total length of this
Chapter 5. Graphs Basics 63
path is δ(s , u ) + w (u , v ), so this length cannot be shorter than the length of the shortest path,
since it is itself a valid path. Try to formalise this argument using contradiction.
a c e
2 5
1 3
1 1 g
4
b d f
3 2
Figure 5.9: A weighted, directed graph. A shortest path from vertex a to vertex g has length 9.
δ(a , e ) = 7, so the triangle inequality tells us that δ(a , g ) ≤ δ(a , e ) + w (e , g ) = 7 + 3 = 10, which
is indeed satisfied. Since the bound is not tight, i.e. δ(a , g ) < δ(a , e ) + w (e , g ), this implies that
the edge (e , g ) is not part of a shortest path from a to g .
Consider a path that contains a cycle in it. If this cycle has a positive total weight, then we can
remove it from the path and the total weight decreases, so this path was not a shortest path. If
the cycle has a negative total weight, then we may travel around the cycle over and over for as
long as we wish and accumulate arbitrarily short paths, so the notion of a shortest path is not
well defined in this case. Finally, if the path contains a cycle of zero total weight, then we may
remove it and obtain a path with the same total weight. Therefore without loss of generality, we
may assume that shortest paths are simple paths (contain no cycles) when looking for them.
1
7 6
2 1
1 2 3 4 5
1 2 1 3
Figure 5.10: An edge weighted, directed graph containing a directed cycle. The cycle can not
possibly be a part of a shortest path since it only makes the path length longer without going
anywhere!.
−2
7 6
1 −1
1 2 3 4 5
1 2 1 3
Figure 5.11: An edge weighted, directed graph containing a directed cycle with a negative total
weight. The shortest path between vertex 1 and 5 is now undefined. For any path that you give
me, I can always make an even shorter one by adding in another traversal of the cycle.
Although single-pair is seemingly the simplest possible shortest path problem, it turns out that
the more general variants are typically simpler and asymptotically no more difficult to solve.
Given a weighted, directed graph G = (V , E ) and a starting vertex s ∈ V , find for every
vertex v ∈ V a shortest path from s to v .
Given the copious amounts of substructure that is present in shortest paths, it should be intu-
itive that we can solve the single-source problem much faster than we could solve a single-pair
problem separately for every possible vertex v .
Given a weighted, directed graph G = (V , E ), find a shortest path between every pair of
vertices u , v ∈ V .
There is much overlap between these problems, and indeed each of them can be used to solve
the former one. We will only explore algorithms for the last two of them since in general, it is
usually the same difficulty to solve the first problem as the second, except in some special cases.
In this chapter we will show an algorithm that finds the shortest path in unweighted graphs.
In Chapter 6 we present a greedy algorithm to solve the shortest path problem in graphs with
non-negative weights, and in Chapter 8 we present dynamic programming algorithms that can
compute shortest paths on graphs with negative weights.
Chapter 5. Graphs Basics 65
Note that this algorithm does not only compute the shortest path between one pair of vertices,
it computes the shortest paths from the source vertex s to every other reachable vertex in the
graph. Once we have the shortest path information, we can reconstruct the sequence of vertices
that make up the shortest path from s to the vertex u by backtracking through the pred array
until we reach s. An implementation is depicted in Algorithm 30.
A DAG (a) representing prerequisites for getting dressed and a correct topological ordering of
the DAG (b). Image source: CLRS
Since Kahn’s algorithm visits each vertex once and removes every edge once, its time complexity
is O (|V | + |E |) if implemented with the appropriate data structures. In practice, for efficiency,
we do not actually delete edges from the graph during Kahn’s algorithm. What we usually do
is maintain an array that remembers the in-degree of every vertex (the number of incoming
edges). Whenever we pop a vertex u , we decrease by one the in-degree of all vertices v such
that there exists an edge (u , v ). When this number hits zero, we know that the vertex has no
more incoming edges and hence is ready to be added to the queue.
Chapter 5. Graphs Basics 67
1 1 2 2
1 2
3 4 2 2
Figure 5.12: An undirected graph where each vertex is labelled by its connected component
number. Note that two vertices are connected if and only if they have the same number.
The problem becomes much more interesting (and difficult) if we want to modify the graph in
between queries. One solution is to simply recompute the connected components each time we
make a modification, but this will get expensive if we begin to make lots of them. This problem
is called the dynamic connectivity problem. If we restrict ourselves to the case in which edges
are only added, and not deleted, this is called the incremental connectivity problem.
The incremental connectivity problem is the problem of determining whether two vertices u , v
in a graph G are connected, while allowing the graph G to be modified, that is, have new edges
inserted in between queries. Simply rerunning the connected components algorithm every
time we make a modification would work, but is very inefficient, so we’d like to find a better
way to approach the problem.
Formally, the incremental connectivity problem is the problem of finding a data structure that
can support the following two operations:
We’d like to perform these operations fast, so doing a fresh search per CONNECTED query or
recomputing the entire connected components per LINK operation is too slow. This problem
can be solved efficiently using a data structure called a union-find or disjoint-set data structure.
1. FIND(u ): Determine which set the element u is contained in. This means returning the
identity of the representative of the set containing u .
2. UNION(u , v ): Join the contents of the sets containing u and v into a single set. This may
change the representatives of the sets.
These two operations can be seen as equivalent to maintaining the connected components of
an undirected graph, where each item in a set represents a vertex, and each union operation
represents the addition of an edge which might connect two components. Checking whether
two vertices are in the same connected component then reduces to checking whether they have
the same representative, i.e. whether FIND(u ) = FIND(v ).
Figure 5.13: A disjoint set forest. The result of merging the two sets shown in (a) is depicted by
(b). The left tree has become a subtree of the right tree. Figure source: CLRS
An implementation is depicted in Algorithm 34. So far, this is not very efficient, since a sequence
of union operations may result in a long chain, which will make FIND operations take O (n ) time
per query. To improve this, we make two optimisations.
of a particular node’s tree, we have to look at all of the nodes on the path to the root. We can
therefore modify the parent pointer for each node on the path to point directly at the root, so
that long paths are compressed and will never have to be traversed again. This has the effect of
flattening the tree and removing long chains, which significantly speeds up future FIND queries.
It can be shown that with path compression, the worst-case cost of performing m operations is
O (m log(n )), i.e. at most O (log(n )) per operation in total.
An implementation of FIND with path compression is shown in Algorithm 35. By using recur-
sion, path compression can be implemented cleverly in just one line!
In union by rank, we maintain a rank for each tree, which is an upper bound on the height of the
tree. When merging two trees together, we choose to always make the tree with a smaller rank
a child of the tree with the larger rank. This results in more balanced trees since it encourages
shorter tree to become the subtree of the larger one. It can be shown that with union by rank,
the worst-case cost of performing m operations is also O (m log(n )), i.e. at most O (log(n )) per
operation in total.
Combining both path compression and union-by-rank speeds up the data structure even more.
It can be shown that in the worst case, a sequence of m operations will cost O (mα(n )), where
α(n ) is the inverse Ackermann function, an extremely slowly growing function. To put it in per-
spective, α(n ) ≤ 4 for all values of n that can be written in the universe, so for all practical pur-
poses (but not theoretical purposes) it is a constant. This is a significant improvement over
the O (log(n )) complexity of path compression and union by rank individually. This complexity
Chapter 5. Graphs Basics 71
bound also turns out to be provably optimal, in other words, no disjoint-set data structure can
possibly do better than this.
72 5.6. Incremental Graph Connectivity
Chapter 6
Greedy Algorithms
In this chapter, we will study greedy algorithms. Greedy algorithms make a locally optimal
choice at each stage and commit to it. Often it is easy to come up with ideas to solve a problem
using a greedy strategies, but in many cases the proposed greedy algorithm does not find the
overall optimal solution for the problem (thus correctness proofs are very important). Some
of the most important greedy algorithms solve problems on graphs and will be studied in this
chapter.
The predecessor array, just like in breadth-first search, indicates for each vertex its parent in the
shortest path tree from the source. It is maintained in the same way as it was in a breadth-first
74 6.1. Shortest Path in Graphs with Non-negative Weights
search and can be used in the same way to reconstruct the shortest paths once we have found
all of the correct distances (see Algorithm 30).
Dijkstra’s algorithm for finding shortest paths in graphs with non-negative weights is based off
the following key ideas:
1. If all edge weights are non-negative, i.e. w (u , v ) ≥ 0 for all edges (u , v ), then any
sub-path s u of some longer path s v necessarily has a length that is no
greater than the entire path. This is not true when negative weights appear, as a
path may grow in number of edges but decrease in total length due to the negative
weights.
2. This means that if we know the minimum distance to all vertices u ∈ S for some
set S and we relax all edges that leave S , then we can guarantee that the distance
estimate to the vertex v ∈ / S with the minimum estimate δ(s , u ) + w (u , v ) is op-
timal. This is because even though we do not know the distances to any other
/ S , none of them can provide a shorter path to v since all edge weights
vertex v 0 ∈
are nonnegative and hence the paths can only get longer.
3. Dijkstra’s algorithm therefore employs a greedy approach, a similar idea to breadth-
first search, by visiting nodes in order of distance from the source.
Dijkstra’s algorithm works like a weighted version of breadth-first search, i.e. vertices are still
visited in order of their distance from the source, but distance is now computed in terms of
edge weights, rather than number of edges. The primary difference in implementation is that
because the edge weights are no longer all 1, we cannot simply use an ordinary queue to store
the vertices that we need to visit, since a path consisting of many low weight edges may have a
shorter length than a path of fewer large weight edges (see Figure 6.1 for an example). Instead,
Dijkstra’s algorithm employs a priority queue to ensure that vertices are visited in the correct
order of their distance from the source1 . An implementation is given in Algorithm 38.
5 6 7 8
3 1 2
1 1
1 4
5 7
2 3
13
Figure 6.1: A graph where the shortest path from 1 4 has far more edges than the path with
the fewest edges. In this case, breadth-first search will not find the shortest path.
The priority queue Q is a minimum priority queue initially containing all vertices. It serves each
vertex u in order of their current distance estimate, dist[u ]. At each iteration, the next vertex
1 Actually,
Dijkstra’s original implementation of his algorithm did not use a priority queue, but rather simply looped
over the vertices to find which one was the next closest. This is very inefficient for sparse graphs however, and was
improved by Fredman and Tarjan, who described the now standard min-heap-based priority queue implementation.
Chapter 6. Greedy Algorithms 75
u that has not been visited yet and which has the smallest current distance estimate is pro-
cessed. The intuitive idea behind the correctness of Dijkstra’s is that since vertices are removed
in distance order, once a vertex has been removed from the queue, its distance estimate must
be correct and will never be further decreased.
Proof
Let’s use induction on the set of vertices S that have been removed from the queue, i.e.
the set S = V \ Q . We will show for any vertex v ∈ S , that dist[v ] is correct.
For the base case, we will always remove the source vertex s first which has dist[s ] = 0,
which is correct since the graph contains no negative weights.
Suppose for the purpose of induction that at some point it is true that for all v ∈ S , dist[v ]
is correct. Let u be the next vertex removed from the priority queue, we will show that
dist[u ] must be correct.
Suppose for contradiction that there exists a shorter path for s u with a shorter dis-
tance δ(s u ) < dist[u ]. Let x be the furthest vertex along the correct s u path that is
in S . Since x ∈ S , by the inductive hypothesis, its distance estimate is correct. Let y be
the next vertex on the shortest path after x . Since δ(s u ) < dist[u ] and all edge weights
are non-negative, it must be true that δ(s y ) ≤ δ(s u ) < dist[u ]. But y is adjacent
to x on a shortest path, which means the edge (x , y ) was relaxed when x was removed
from Q . This means that dist[y ] = δ(s y ) < dist[u ]. If dist[y ] < dist[u ] and y 6= u ,
then Dijkstra’s algorithm would have popped y from the priority queue instead of u , a
contradiction. Alternatively if y = u and dist[y ] < dist[u ], this is also a contradiction.
By contradiction, we conclude that dist[u ] must be correct and hence by induction on
the set S , when Dijkstra’s algorithm terminates, dist[v ] is correct for all v ∈ V .
Notice how the key step of the proof requires us to invoke the fact that the edge weights are
non-negative. Without this restriction, this step of the proof would not be true, since the sub-
path s y might actually have a higher total weight than the total path s u due to negative
weights later on.
this is asymptotically equivalent (and we can always preprocess a graph to make it simple by
taking just the shortest edge between every pair of vertices). In practice, the number of entries
in the priority queue at any given time is likely to be much smaller than |E |, so this method
is almost always a speed-up too, except in pathological cases. An example implementation is
depicted in Algorithm 39.
Although the definition of a minimum spanning tree is a global one, that is we are required to
minimise the total weight of the tree, minimum spanning trees have local structure which allows
us to find them using greedy algorithms. We will see two such greedy algorithms, Prim’s algo-
rithm which is almost identical to Dijkstra’s algorithm, and Kruskal’s algorithm which makes use
of the union-find disjoint-set data structure. For both algorithms presented below, we assume
that the graph G is connected (otherwise no spanning trees exist).
Figure 6.2: A weighted graph G and a minimum spanning tree of G . Image source: Weiss
In every iteration of Prim’s algorithm, the current set of selected edges in T is a subset
of some minimum spanning tree of G .
Proof
Since M is a tree, removing the edge (x , y ) disconnects M into two components M 1 and
M 2 . Connecting the edge (u , v ) to M 1 and M 2 reconnects them to form a new spanning
tree M 0 = M 1 ∪ M 2 ∪ {(u , v )} = M ∪ {(u , v )} \ {(x , y )}. Since (u , v ) is the lightest edge
connecting T to a new vertex, it is no heavier than (x , y ), otherwise Prim’s would have
picked (x , y ) instead since x ∈ T and y ∈ / T . In other words, w (u , v ) ≤ w (x , y ), which
implies that w (u , v ) − w (x , y ) ≤ 0. So the total weight of M 0 is
w (M 0 ) = w (M ) + w (u , v ) − w (x , y )
≤ w (M )
Proof
First, we argue that Prim’s algorithm produces a spanning tree. Whenever Prim’s adds an
edge to the MST, it does so to a newly added vertex, which means it can never create a cy-
cle. Since the graph is connected, Prim’s will visit every vertex and hence it will produce
a spanning tree. Therefore when the algorithm terminates, T is a spanning tree, and by
the invariant above, it is a subset of some minimum spanning tree. Since all spanning
trees contain the same number of edges, T itself must be a minimum spanning tree.
Hence Prim’s algorithm is correct.
Sorting the edges of the graph takes O (|E | log(|E |)) time, and the sequence of UNION and FIND
operations takes O (|E | log(|V |)). The total time complexity of Kruskal’s algorithm is therefore
O (|E | log(|E |)), or equivalently O (|E | log(|V |)) since log(E ) = O (log(|V |2 )) = O (| log(|V |)) in a sim-
ple graph.
Chapter 6. Greedy Algorithms 81
Every iteration of Kruskal’s algorithm, the current set of selected edges T is a subset of
some minimum spanning tree of G .
Proof
Proof
First, we argue that Kruskal’s algorithm produces a spanning tree. T will never contain
a cycle since the use of the union-find data structure explicitly prevents it. T must also
be connected since G is connected, and if there were two components that were not
connected in T , Kruskal’s would select the edge connecting them. Therefore when the
algorithm terminates, T is a spanning tree, and by the invariant above, it is a subset of
some minimum spanning tree. Since all spanning trees contain the same number of
edges, T itself must be a minimum spanning tree. Hence Kruskal’s algorithm is correct.
Chapter 7
Dynamic Programming
A powerful technique that we have already encountered for solving problems is to reduce a
problem into smaller problems, and to then combine the solutions to those smaller problems
into a solution to the larger problem. We have seen this for example in the recursive computa-
tion of Fibonacci numbers, and in the divide-and-conquer algorithms for sorting (Merge Sort
and Quicksort). In some situations, the smaller problems that we have to solve might have to
be solved multiple times, perhaps even exponentially many times (for example, recursive com-
putation of the Fibonacci numbers). Dynamic programming involves employing a technique
called memoisation, where we store the solutions to previously computed subproblems in order
to ensure that repeated problems are solved only once. Dynamic programming very frequently
arises in optimisation problems (minimise or maximise some quantity) and counting problems
(count how many ways we can do a particular thing).
programming applies when the subproblems overlap and are repeated multiple times. Recall
the example of computing Fibonacci numbers recursively. Fibonacci numbers are defined by
recurrence relation
and can be computed recursively with a very simple algorithm shown in Algorithm 42.
Although simple, this algorithm is extremely inefficient. What is its time complexity? Well, the
amount of work taken T (n ) to compute the value of F (n ) is the sum of the times taken to com-
pute the values of T (n − 1) and T (n − 2), so the time complexity is given by the solution to the
recurrence
T (n ) = T (n − 1) + T (n − 2) + c .
Hopefully we recognise the main piece of this recurrence, its just the Fibonacci recurrence with
an added constant! The time complexity to compute F (n ) can therefore be shown to be O (F (n )).
As cool as this fact is, hopefully you remember from your mathematics study, that the Fibonacci
numbers grow exponentially, so this is an extremely slow algorithm. Specifically they grow
asymptotically like ϕ n where ϕ is the golden ratio. How does it get so bad? What exactly are
we doing so wrong?
Figure 7.1: The call tree resulting from the recursive Fibonacci function for computing Fn .
Figure 7.1 depicts what happens when we try to compute F (n ). F (n ) and F (n − 1) are com-
puted once, F (n −2) is computed twice, F (n −3) is computed three times, F (n −4) is computed
five times... In general, F (n − k ) is computed F (k + 1) times, so the leaves of the call tree are
computed exponentially many times, which explains why the algorithm is so slow. The solu-
tion to this problem is simple but demonstrates the most powerful, core idea of dynamic pro-
gramming. Instead of allowing ourselves to compute the same thing over and over again, we
Chapter 7. Dynamic Programming 85
will simply store the result of each computation in a table, and if we are required to compute it
again, we can look it up and return it straight from the table without any recomputation. This
technique is called memoisation (no, that is not a typo, it is actually spelled and pronounced
memo-isation). A memoised implementation of the recursive Fibonacci number calculation is
shown in Algorithm 43.
We store the Fibonacci numbers that have been computed in the memoisation table memo,
and return straight from the table whenever we encounter a problem that we already have the
solution to. The initial value of null in the table is used to indicate that we have not computed
that number yet. If we were planning on computing lots of Fibonacci numbers, we could even
keep the memo table between calls. Since each call to FIBONACCI_MEMO takes constant time,
each problem is never computed more than once, and there are n subproblems, the total time
complexity of this implementation is O (n ).
The approach that we have implemented above is an example of “top-down” dynamic program-
ming since we first ask for the solution to the largest problem and recursively ask for the solu-
tions to smaller problems while filling in the memoisation table. An alternative implementa-
tion would be to start from the smaller problems and work our way up to the big problems. This
would be the “bottom-up” approach. A bottom-up version of computing Fibonacci numbers is
shown in Algorithm 44.
We will explore the differences between the top-down and bottom-up approaches soon.
86 7.1. The Key Elements of Dynamic Programming
If we need to make $V and we choose to use a coin with value $x , then we are now required to
make change for $(V − x ). The essence of dynamic programming is then to try every possible
choice and take the best one, while memoising the answers to avoid repeated computation.
Using this idea, the subproblems for coin change can be expressed as
for all values of v in 0 ≤ v ≤ V . These subproblems exhibit optimal substructure since if the
optimal solution for $V includes a coin of value x , then the remaining coins must necessarily
Chapter 7. Dynamic Programming 87
if v = 0,
0
MinCoins[v] = ∞ if v < c [i ] for all i ,
min
1≤i ≤n (1 + MinCoins[v - c[i]]) otherwise
c [i ]≤v
We include the base case MinCoins[0] = 0 since the recurrence must terminate, and it takes
zero coins to make zero dollars. We also write the infeasible case (returning ∞) in case it is not
possible to make some values, since there might not necessarily be a $1 coin.
Implementing a solution
We can implement a top-down solution to coin change by implementing the recurrence relation
above as a recursive function. Your implementation might look like the pseudocode shown in
Algorithm 45. Note that we use memoisation to avoid computing a solution to a subproblem
twice! Assume that the memo table is appropriately initialised to null.
Since we solve V + 1 subproblems and each one tries all n coins, the time complexity of this
solution is O (nV ). The space complexity of storing the solutions to V subproblems is O (V ).
Just as we did for Fibonacci numbers, we could also convert this into a bottom-up solution.
Pseudocode for a bottom-up coin change algorithm is given in Algorithm 46.
table one problem at a time in an order which ensures that any dependent subproblems have al-
ready been computed before they are needed. The order in which we fill the table in the bottom-
up approach is the reverse topological order of the dependency graph of the subproblems. Most
of the time, figuring out the order in which the subproblems must be computed is simple, but
in rare cases it might be more difficult.
The majority of the time, the two dynamic programming styles are equivalent and equally as
good, but there are certain situations in which it may be preferable to favour one over the other.
As eluded to above, one reason that the top-down approach might be preferable in some cases is
that we do not need to know a-priori the order in which the subproblems need to be computed,
as the recursion takes care of ensuring that subproblems are made available when required.
The top-down approach also boasts the advantage that subproblems are only computed if they
are actually required, while the bottom-up approach strictly computes the solution to every
possible subproblem. If the solutions to only a small number of the possible subproblems are
required, then the top-down approach will avoid computing the ones that are not needed.
On the contrary, the bottom-up approach avoids recursion altogether and hence should be
faster if subproblems are frequently revisited since we drop the overhead on the program’s call
stack required to make the recursive calls. The bottom-up approach also allows us to more eas-
ily exploit the structure of specific dynamic programs in situations where we can improve the
time and space complexity of the resulting program. We will see one such example, the space-
saving trick, which allows us to reduce the space complexity of some dynamic programs, which
is not possible with the top-down approach. The following table summarises the pros and cons
of both approaches.
Both approaches are very similar. The first is advantaged by the fact that it uses less memory,
since a second table is not required to remember the optimal coins. It should also usually be
faster as a consequence. The second approach however is simpler to understand, and also al-
lows us to do some more advanced optimisations that involve dynamically adjusting the search
space for a particular subproblem based on the optimal choices of previous subproblems.
90 7.4. The Unbounded Knapsack Problem
Algorithm 48 Bottom-up coin change with solution reconstruction using decision table
1: function COIN_CHANGE(c[1..n ], V )
2: MinCoins[0..V ] = ∞
3: OptCoin[0..V ] = null
4: MinCoins[0] = 0
5: for v = 1 to V do
6: for i = 1 to n do
7: if c[i ] ≤ v then
8: if 1 + MinCoins[v − c [i ]] < MinCoins[v ] then
9: MinCoins[v ] = 1 + MinCoins[v − c [i ]]
10: OptCoin[v ] = c[i ] // Remember the best coin
11: return MinCoins[v ]
12:
13: function GET_COINS(c[1..n ], V , MinCoins[0..V ], OptCoin[0..V ])
14: if MinCoins[V] = ∞ then return error
15: coins = empty list
16: while V > 0 do
17: coins.append(OptCoin[V ]) // Take the best coin
18: V = V - OptCoin[V ]
19: return coins
mons licence.
Chapter 7. Dynamic Programming 91
Figure 7.2: An example instance of a knapsack problem. The optimal solution is to take all but
the green box for a total value of $15. Image taken from Wikimedia Commons1 .
Suppose that we are trying to fill a bag with capacity C kg. If we choose an item with weight x kg,
we are then required to fill the remaining (C − x )kg of space with the most valuable items. This
substructure is optimal since we always want to use the remaining (C − x )kg optimally. This
suggests that our subproblems should be the following:
Deriving a recurrence
Using the substructure that we derived above, we can write a recurrence for the unbounded
knapsack problem like so:
Notice that we do not need an infeasible case like we did for coin change since we are not re-
quiring that the knapsack be completely full.
Suppose we try the blue item. Since it takes up precisely 2kg of the 15kg backpack, there is 13kg
of space left for other items. We are then left with solving the subproblem of finding the most
value that we can accrue using 13kg of spacing using the remaining items. This is the optimal
substructure of the 0-1 knapsack problem. Note the emphasis highlighting the difference com-
pared to the unbounded version. In the 0-1 case, we need to never consider an item again if it
has already been used. One way to do this would be to keep track of the subset of items that
have been used so far, but this does not improve on the naive solution since we would need to
account for 2n different subsets, which is far too large for moderate values of n . Instead, the
key observation to make is that the order in which we consider items is completely irrelevant
since taking the 2kg item followed by the 4kg item is the same as taking the 4kg item followed
by taking the 2kg item. Therefore instead of keeping track of all subsets, we will just remember
that we have taken some items from the first i of them, for all values of i from 1 ≤ i ≤ n .
MaxValue[i , c ] = {The maximum value that we can fit in a capacity of c using items 1 to i }
for all 0 ≤ c ≤ C and 0 ≤ i ≤ n . This way, if we choose to take item i , then we know that we can
subsequently only consider items 1 to i −1. Note that by i = 0, we denote the empty set of items
to give us a simple base case.
Chapter 7. Dynamic Programming 93
Deriving a recurrence
Consider an instance of the knapsack problem consisting of n items of weight wi and value vi
for 1 ≤ i ≤ n and a backpack of capacity C kg. Since we are avoiding duplicate items, we should
not try using every item in the backpack, but rather, in a subproblem considering items 1 to i ,
we should simply try taking item i or not taking item i . This gives us two options, both of which
lead to recursively trying to fill the knapsack with the remaining items 1 to i − 1.
0
if i = 0,
MaxValue[i , c ] = MaxValue[i − 1, c ] if wi > c ,
max(MaxValue[i − 1, c ], v + MaxValue[i − 1, c − w ]) otherwise.
i i
In the second case of the recurrence, we account for the fact that item i may not fit in the knap-
sack, in which case there is no decision to make, we simply can not take that item and should
consider the remaining i − 1 items.
The edit distance between two strings is the minimum number of edit operations re-
quired to convert one of the strings into the other.
We can also define an optimal alignment of two strings which shows where the optimal num-
ber of insertions, deletions and substitutions occur. An optimal alignment for “computer” and
“commuter” would be
c o m p u t e r
c o m m u t e r
*
which contains one edit operation (substitute ’p’ for ’m.’) An optimal alignment for the words
“kitten” and “sitting” is given by
k i t t e n -
s i t t i n g
* * *
which contains two substitutions and one insertion, so the edit distance between them is 3.
C
OPTIMAL_ALIGNMENT (“ACAATC”, “AGCATC”) + G
∗
C
OPTIMAL_ALIGNMENT (“ACAATC”, “AGCATCG”) + -
∗
Chapter 7. Dynamic Programming 95
-
OPTIMAL_ALIGNMENT (“ACAATCC”, “AGCATC”) + G
∗
The alignment of the remaining prefixes must be optimal because if it were not, we could sub-
stitute it for a more optimal one and achieve a better overall alignment. Let us therefore define
our subproblems as follows:
Dist[i , j ] = {The edit distance between the prefixes S1 [1..i] and S2 [1..j] }
for 0 ≤ i ≤ n , 0 ≤ j ≤ m.
Deriving a recurrence
Using the optimal subproblems observed above, we can derive a recurrence for the edit distance
between two strings S1 [1..n ] and S2 [1..m] like so. Consider the computation of the subproblem
Dist[i , j ], where i , j > 0. We have three choices:
1. Align S1 [i ] with S2 [ j ], which has a cost of zero if S1 [i ] = S2 [ j ], or one otherwise.
2. Delete the character S1 [i ], which has a cost of one.
3. Insert the character S2 [ j ], which has a cost of one.
If we choose to align S1 [i ] with S2 [ j ], then the subproblem that remains to be solved is to align
the remaining prefixes i.e. to minimise Dist[i − 1, j − 1]. If we delete the character S1 [i ], then
we need to align the remaining prefix of S1 with S2 , i.e. we are interested in minimising the sub-
problem Dist[i − 1, j ]. Finally, if we chose to insert the character S2 [ j ], then we are left to align
the remaining prefix of S2 with S1 , i.e. we want to optimise the subproblem Dist[i , j − 1]. Trying
all three possibilities and selecting the best one leads to the following general recurrence for
Dist[i , j ].
if j = 0,
i
if i = 0,
j
Dist[i , j ] = Dist[i − 1, j − 1] + 1S1 [i ]6=S2 [ j ]
min Dist[i − 1, j ] + 1
otherwise
Dist[i , j − 1] + 1
We have used the notation 1S1 [i ]6=S2 [ j ] to denote an indicator variable for S1 [i ] 6= S2 [ j ]. In other
words, it equals one if S1 [i ] 6= S2 [ j ], or zero otherwise. The base cases should be easy to under-
stand. If one of the two strings is empty, then an optimal alignment simply consists of inserting
all of the characters of the other string.
| {z } | {z } | {z }
n ×m m×o n×o
There are several ways that we could compute this product, owing to the fact that matrix multi-
plication is associative. This means that we can bracket the product in any way we want and we
will still obtain the same answer. We can compute this product in either the order (A × B )×C or
A × (B × C ). Surprisingly, the order that choose to do the multiplications actually has a substan-
tial effect on the amount of computational time required. If we choose the first order (A×B )×C ,
then we perform (4 × 2 × 5) + (4 × 5 × 3) = 100 scalar multiplications to obtain the answer. If we
use the second order A × (B × C ), then we only perform (2 × 5 × 3) + (4 × 2 × 3) = 54 scalar mul-
tiplications. That’s almost half the amount of work! The general optimal matrix multiplication
problem is the problem of deciding for a sequence of n matrices, the best way to multiply them
to minimise the number of scalar multiplications required.
For a sequence of four matrices, there are five different ways to parenthesise them as follows:
A × ((B × C ) × D ), A × (B × (C × D )), (A × B ) × (C × D ), (((A × B ) × C ) × D ) and (A × (B × C )) ×
D . In general, the number of ways to parenthesise a sequence of n matrices is exponential in
n (it is related to the famous Catalan number sequence), so trying all of them would be way
too slow. Dynamic programming can be used to quickly find a solution to the optimal matrix
multiplication problem.
(A 1 × A 2 ... × A k ) × (A k +1 × A k +2 × ... × A n ),
| {z } | {z }
Result of multiplying Result of multiplying
matrices 1 to k matrices k + 1 to n
where (A 1 × A 2 ... × A k ) and (A k +1 × A k +2 × ... × A n ) were multiplied earlier. If this is the optimal
order in which to multiply the matrices, then the two subproblems of multiplying (A 1 ×A 2 ...×A k )
and (A k +1 × A k +2 × ... × A n ) must also be multiplied in optimal order. If they were not, we could
substitute them for a better order and obtain a better solution. This is the optimal substructure
for the optimal matrix multiplication problem.
The dynamic programming solution for the optimal matrix multiplication problem therefore
involves trying every possible “split point” k at which to perform the final multiplication. The
problem is then recursively solved by asking for the optimal order in which to multiply matrices
1 to k and the remaining matrices k + 1 to n . Our subproblems should therefore be:
for all 1 ≤ i ≤ j ≤ n .
Deriving a recurrence
Let us denote the number of columns in the matrices A 1 , A 2 , ...A n by the sequence c1 , c2 , ..., cn .
Recalling that two adjacent matrices can only be multiplied if the number of columns of the
first is the same as the number of rows of the second, we know that if the sequence of matrices
(A i ) is valid, then the number of rows in A 2 , ...A n must be c1 , ...cn −1 . For completeness, we will
further denote the number of rows of A 1 as c0 . A base case for our problems is easy to identify.
It takes zero multiplications to multiply a sequence of just one matrix. So DP[i , i ] = 0 for all i .
For the general case, to find the optimal number of multiplications for the sequence A i , ...A j
where i < j , we must try every possible “split point” and recurse. If we perform the split af-
ter matrix k , where i ≤ k < j , then the total number of required multiplications is DP[i , k ] +
DP[k + 1, j ] plus the number of multiplications required to multiply (A i ×...×A k ) and (A k +1 ×...×
A j ). The matrix (A i × ... × A k ) has exactly ci −1 rows and ck columns, and the matrix (A i × ... × A k )
and (A k +1 × ... × A j ) has exactly ck rows and c j columns. Hence the number of scalar multi-
plications required to multiply them is ci −1 × ck × c j . Therefore the total number of required
multiplications to multiply matrices i to j if we split after matrix k is given by
Combining everything together and trying all possible split points, we end up with the recur-
rence (
0 if i = j
DP[i , j ] =
min (DP[i , k ] + DP[k + 1, j ] + ci −1 ck c j ) if i < j .
i ≤k < j
for i ≤ j ≤ n . Note that we do not care about defining DP[i , j ] for j > i since it does not
make sense to have reversed intervals. With this recurrence, the answer to the entire problem is
DP[1, n ], the optimal number of scalar multiplications required to multiply the entire sequence
A 1 × ... × A n .
Chapter 7. Dynamic Programming 99
Since there are n 2 /2 subproblems and the computation of each subproblem i , j requires loop-
ing over j −i previous subproblems, the time complexity of the dynamic programming solution
for optimal matrix multiplication is
!
Xn X n n
X Xn Xn
(j −i) = j− i ,
i =1 j =i +1 i =1 j =i +1 j =i +1
X n
n X n
X
= j− (n − i )i ,
i =1 j =i +1 i =1
n n n
n (n + 1) i (i + 1)
X X X
= − −n i+ i 2,
i =1
2 2 i =1 i =1
n n
n
n 2 (n + 1) 1 X 2 X n 2 (n + 1) X 2
= − i + i − + i ,
2 2 i =1 i =1
2 i =1
n
n (n + 1) 1 X 2
= + i .
4 2 i =1
n(n+1)(2n +1)
Pn
Finally, using the sum of squares formula, i =1 i 2 = 12 , we obtain
n X
n n
X n (n + 1) 1 X 2
(j −i) = + i
i =1 j =i +1
4 2 i =1
n (n + 1) n (n + 1)(2n + 1)
= + ,
4 12
100 7.8. The Space-Saving Trick
In this section, we will study dynamic programming graph algorithms. Dynamic programming
can, for instance, be used to solve the shortest path problem with negative weights.
Throughout this chapter, we will assume that we are dealing only with directed graphs.
• If no negative cycles are reachable from node s , then for every node t that is reach-
able from s there is a shortest path from s to t that is simple (i.e., no nodes are
repeated).
• Cycles with positive weight cannot be part of a shortest path.
• Given a shortest path that contains cycles of weight 0, the cycles can be removed
to obtain an alternative shortest path that is simple.
• Note that any simple path has at most |V | − 1 edges.
102 8.1. Shortest Paths with Negative Weights
For a source node s , let OPT[i , v ] denote the minimum weight of a s v path with at most i
edges. Let P be an optimal path with at most i edges that achieves total weight OPT[i , v ]:
• If P has at most i − 1 edges, then OPT[i , v ] = OPT[i − 1, v ].
• If P has exactly i edges and (u , v ) is the last edge of P , then OPT[i , v ] = OPT[i − 1, u ] +
w (u , v ).
The recursive formula for dynamic programming is then
OPT[i , v ] = min OPT[i − 1, v ], min (OPT[i − 1, u ] + w (u , v )) .
u :(u ,v )∈E
To obtain the shortest paths we just need to compute OPT[|V | − 1, v ] for all vertices v ∈ V .
Commonly, a more space-efficient version of Bellman-Ford algorithm is implemented (in some
cases, this version also provides a speed-up, but no improvement in the worst-case time com-
plexity). The idea is that |V |−1 iterations of a loop are executed and in each iteration all edges are
sequentially relaxed to obtain the new distance estimates. A simple implementation is shown
in Algorithm 54. The time complexity of Bellman-Ford is easy to establish: the outer loop runs
for |V |−1 iterations, the inner loop runs for |E | iterations, and each relaxation takes O (1), so the
total time taken is O (|V ||E |).
Algorithm 54 Bellman-Ford
1: function BELLMAN_FORD(G = (V , E ), s )
2: dist[1..n ] = ∞
3: pred[1..n ] = null
4: dist[s ] = 0
5: for k = 1 to n − 1 do
6: for each edge e in E do
7: RELAX (e )
8: return dist[1..n ], pred[1..n ]
Correctness of Bellman-Ford
The key to establishing the correctness of Bellman-Ford lies in a simple property about paths
in graphs. We write a formal proof of correctness using induction.
The Bellman-Ford algorithm terminates with the correct distance estimates to all ver-
tices whose shortest path is well defined after at most |V | − 1 iterations.
Proof
In a well defined shortest path, the number of edges on the path cannot be greater than
|V | − 1, why? If a path contained more than |V | − 1 edges, then it must visit some vertex
multiple times and hence contain a cycle, which we know we from our discussion before
Chapter 8. Dynamic Programming Graph Algorithms 103
that we can ignore. We claim that after k iterations of the Bellman-Ford algorithm, that
all valid shortest paths consisting of at most k edges are correct.
We can prove this formally by induction on the number of edges in each shortest path.
For the base case, we initialise dist[s] = 0, which is correct if there is no negative cycle
containing s . Hence all valid shortest paths containing zero edges are correct.
Inductively, suppose that all valid shortest paths containing at most k edges are correct
after k iterations. Let’s argue that all valid shortest paths containing at most k + 1 edges
are correct after one more iteration. Consider some shortest path from s v consisting
of at most k + 1 edges. By the substructure of shortest paths, the sub-path s u con-
sisting all but the final edge (u , v ) is a shortest path, and by our inductive hypothesis,
is correctly estimated at the current iteration of the algorithm. If the current distance
estimate of v is incorrect, it will be relaxed in this iteration by the edge (u , v ) and hence
be made correct since δ(s , v ) = δ(s , u ) + w (u , v ) = dist[u ] + w (u , v ).
Hence by induction on the number of edges in the shortest paths, all valid shortest paths
consisting of at most k edges are correctly estimated after k iterations. Therefore the
Bellman-Ford algorithm terminates correctly after at most |V | − 1 iterations.
1. ∞ if the vertex in question was never reached (it is not connected to the source s ).
2. −∞ if the vertex in question is reachable via a negative cycle and hence does not have a
shortest path.
Note that the pred array may contain cycles if there are negative cycles present, so do not at-
tempt to reconstruct shortest paths to vertices whose distance is undefined!
Optimising Bellman-Ford
There are some simple optimisations that can be made to Bellman-Ford that improve its run-
ning time:
104 8.2. The All-Pairs Shortest Path Problem
1. During the iterations of the main k loop, if no relaxations are successful, then the shortest
paths have already been found and we can terminate early.
2. If such an early termination occurs, then we are guaranteed that no negatives cycles exist
and hence can skip the post-processing step that checks for undefined shortest paths.
3. Rather than iterating over every single edge, we could instead maintain a queue of edges
that need to be relaxed. If an edge is not successfully relaxed in one phase, then it clearly
need not be relaxed in the next phase either unless its source vertex has its distance esti-
mate improved. For sparse graphs, this improves the empirical average case complexity
of the algorithm to just O (|E |), although the worst-case behaviour is unfortunately still
O (|V ||E |). This variant is often referred to as the “Shortest Path Faster Algorithm.”
= 0 for all vertices v and dist[u ][v] = w (u , v ) for all edges e = (u , v ). Observe that if the graph
is represented as an adjacency matrix, then initialising the distance matrix dist is easy as this
is nothing but a copy of the adjacency matrix! See the implementation in Algorithm 56. The
time complexity of Floyd-Warshall is obvious. There are three loops from 1 to n and a constant
time update each iteration, so the total runtime is O (|V |3 ). The space complexity is O (|V |2 )
since we store the |V | × |V | dist matrix (note that the algorithm implicitly already uses dynamic
programming space savings techniques).
Algorithm 55 Floyd-Warshall
1: function FLOYD_WARSHALL(G = (V, E))
2: dist[1..n ][1..n ] = ∞
3: dist[v ][v ] = 0 for all vertices v
4: dist[u ][v ] = w (u , v ) for all edges e = (u , v ) in E
5: for each vertex k = 1 to n do
6: for each vertex u = 1 to n do
7: for each vertex v = 1 to n do
8: dist[u ][v ] = min(dist[u ][v ], dist[u ][k ] + dist[k ][v ])
9: return dist[1..n ][1..n ]
Comparing this against the running times of the repeated single-source solution, we can see
that for dense graphs, where |E | ≈ |V |2 , Floyd-Warshall is the best or equal best in all cases. For
sparse graphs however (|E | ≈ |V |), Floyd-Warshall is beaten by O (|V |2 ) and O (|V |2 log(|V |)).
Finally, note that just like Bellman-Ford, Floyd-Warshall can handle negative weights just fine,
and can also detect the presence of negative weight cycles. To detect negative weight cycles,
simply inspect the diagonal entries of the dist matrix. If a vertex v has dist[v ][v ] < 0, then it
must be contained in a negative weight cycle.
Correctness of Floyd-Warshall
Floyd-Warshall produces the correct distances for all pairs of vertices (u , v ) such that
there exists a shortest path between u and v .
106 8.3. Transitive Closure
Proof
We will establish the following invariant: After k iterations of Floyd-Warshall, dist[u ][v ]
contains the length of a shortest path u v consisting only of intermediate vertices
from 1 to k .
For k = 0, we have performed no iterations yet. dist[u ][v ] is therefore equal to w (u , v ) if
an edge exists between u and v . Using no intermediate vertices at all, this the only path
possible, and hence it is the shortest path possible. If u = v , then dist[u ][v ] = 0 which is
optimal using no other vertices. If there is no edge (u , v ), then dist[u ][v ] = ∞ which is
correct since no paths are possible yet.
Suppose that after iteration k , dist[u ][v ] contains the length of a shortest path u v
consisting only of intermediate vertices from 1 to k . We will show that after iteration
k + 1, dist[u ][v ] contains the length of a shortest path u v consisting only of inter-
mediate vertices from 1 to k + 1. Consider a shortest path between some vertices u and
v consisting of intermediate vertices in 1 to k + 1. If vertex k + 1 is not a part of the
path then by the inductive hypothesis we already have the optimal length. Otherwise,
this path contains only one occurrence of vertex k + 1, since if it contained multiple, it
would contain a cycle. Therefore there exists two sub-paths u k + 1 and k + 1 v
which do not contain vertex k + 1 as an intermediate vertex. By the inductive hypothe-
sis, dist[u ][k +1] and dist[k +1][v ] are optimal, and since together they form the shortest
path, the update will set dist[u ][v ] = dist[u ][k + 1] + dist[k + 1][v ] which is now optimal.
Therefore after iteration k + 1, dist[u ][v ] contains the shortest path u v consisting of
intermediate vertices from 1 to k + 1.
By induction on k , after iteration n , dist[u ][v ] contains the length of a shortest path
u v consisting of any intermediate vertices, hence the distances are optimal.
The transitive closure of a graph G is a new graph G 0 on the same vertices such that
there is an edge (u , v ) in G 0 if and only if there is a path between u and v in G .
Finding the transitive closure of a graph can be reduced to the problem of finding all-pairs
shortest paths by noting that there is an edge between u and v in the transitive closure if and
only if dist[u ][v ] < ∞. We can save some space by observing that we only ever care about
whether or not two vertices are connected, not what their distance value is. Given this, we can
store our connectivity matrix using single bits for each entry (a data structure called a bitset or
bitvector in some languages), where 0 means disconnected and 1 means connected, and update
them by noting that vertex u is connected to vertex v via vertex k if and only if connected[i ][k ]
and connected[k
][ j ] are both true. The space complexity of the algorithm therefore becomes
O |V |2 /w , where w is the number of bits in a machine word. The time complexity can also be
improved to O |V |3 /w by using bitwise operations, but we won’t discuss that for now.
Chapter 8. Dynamic Programming Graph Algorithms 107
The dependencies of the subproblems are clear from the graph, we must compute the longest
path for all descendants of a node before we can compute the result for that node. In other
words, the subproblems are dependant in a reverse topological order. Computing a topological
order and iterating over every vertex and edge in the graph takes O (|V |+|E |) time, hence we can
solve the critical path problem in O (|V | + |E |) time.
Since we are required to compute the topological order, we might as well solve the problem
top-down recursively which removes the need to know the order of dependencies of the sub-
problems. This results in a very short and elegant solution which also runs in O (|V | + |E |) time.
depth-first search (it is doing one after all), and it looks a lot like our top-down dynamic pro-
gramming algorithms (which it is also).
Although it might not be immediately obvious, this problem actually illuminates a very inter-
esting fact about dynamic programming that we saw when comparing top-down vs. bottom-up
solutions. When computing a dynamic program bottom-up, it is necessary to know the order
in which the subproblems are dependent on each other so that all of the necessary values are
available when needed to compute each subsequent subproblem. When implementing a solu-
tion top-down however, we did not need to know the order of the subproblems since they would
be computed precisely when needed.
The subproblems cannot depend on each other cyclically, so their dependencies on each other
actually form a DAG, where each subproblem is a vertex, and each dependency is a directed
edge. The order in which we compute the subproblems must therefore be a valid topological
order of that DAG. Whenever we compute a solution to a dynamic program using top-down
recursion, we are actually performing a depth-first search on the subproblem graph, which
means that we are actually producing a reverse topological order as a by-product! This explains
why top-down solutions do not need to know a valid order in advance, because they are per-
forming precisely the algorithm that computes one, even if we did not realise it.
This realisation allows us to rephrase many dynamic programming problems as path prob-
lems on DAGs. For example, recall the coin change problem from Chapter 7. If we construct
a graph G where each node corresponds to a dollar amount $V , and add edges between the
dollar amounts (V , V − ci ) for all coin denominations ci and all dollar amounts V , then the coin
change problem is precisely the problem of finding the shortest path from the desired dollar
amount in this DAG to the $0 vertex!
Chapter 9
Network Flow
Network flow models describe the flow of some material through a network with fixed capac-
ities, with applications arising in several areas of combinatorial optimisation. On the surface,
network flow can be used to model fluid in pipes, telecommunication networks, transport net-
works, financial networks and more. Beyond its obvious applications however, network flow
turns out to be useful and applicable to solving a large range of other combinatorial problems,
including graph matchings, scheduling, image segmentation, project management, and many
more.
Figure 9.1: A road network of major cities in Canada. The edge weights represent the capacity
of the roads, say the number of trucks that can be sent along that particular road per day.
Image source: CLRS
which results in a total flow of 23 trucks per day. See Figure 9.2.
Figure 9.2: An optimal solution to the road network problem in Figure 9.1. A label of x /y
indicates that we send x trucks along the road which has capacity y . Image source: Adapted
from CLRS
All of these are examples of network flow problems. In short, we have a network consisting of
nodes, one of which is the “source”, the location at which the flow is produced, and the “sink”,
the location at which the flow is consumed. The problem that we will consider is the maximum
network flow problem, or max-flow for short, where we are interesting in maximising the total
amount of flow that can be sent from the source node to the sink node.
f : V × V → R+
• Flow conservation: The amount of flow entering a node must be equal to the
amount of flow leaving that node, with the exception of the source and sink. That
is for all u ∈ V \ {s , t }, X X
f (v, u ) = f (u , v )
v ∈V v ∈V
The maximum flow problem then seeks to find a flow f that maximises the value:
a The maximum flow problem can in fact be generalised to consider non-integer capacities, although the
Figure 9.3: A flow on the road network from Figure 9.1. No more capacitated paths exist from
the source to the sink, but the total flow of 19 is not optimal. In an optimal solution, we’d like
to redirect the 4 flow that is going from v3 to v2 to go to t instead. In other words, we need a
way to send suboptimal flow choices back and redirect them. Image source: CLRS
c f (u , v ) = c (u , v ) − f (u , v ).
Additionally, the residual network contains back edges or reverse edges, which are edges that flow
in the opposite direction to those in E . The back edges are what allow the algorithm to “undo”
flow by pushing flow back along the edge that it came from, and along a new direction instead.
For each edge (u , v ) with positive flow f (u , v ), the residual network contains an edge (v, u ) with
residual capacity
c f (v, u ) = f (u , v ).
In other words, the back edge (v, u ) has the ability to cancel out the flow that currently exists on
the edge (u , v ) by redirecting it down a different path.
Figure 9.4: The residual network G f corresponding to the network and flow in Figure 9.3. The
green labelled edges are the forward edges from E with capacity c (u , v ) − f (u , v ). The red
edges are the back edges with capacity f (v, u ). Image source: Adapted from CLRS
Chapter 9. Network Flow 113
• Augmenting along a forward edge simply corresponds to pushing more flow through that
edge on the way from the source to the sink.
• Augmenting along a back edge corresponds to redirecting flow along that edge down a
different path. The flow is reduced on the respective forward edge, and is moved to the
remainder of the augmenting path.
Although the flow on the road network depicted in Figure 9.3 has no more capacitated paths
from s to t , the residual network does indeed have a capacitated path given by s → v2 → v3 → t .
The total capacity of the path is the minimum capacity of the edges encountered along the
path (that is the bottleneck of the path), as clearly we can send no more than this amount of
additional flow along the path (or we would violate a capacity constraint.) Filling an edge to full
capacity is often referred to as “saturating” the edge. Similarly, an edge is called saturated if the
flow along it is at capacity.
Figure 9.5: An augmenting path p the in residual network G f in Figure 9.4. The total capacity
of the augmenting path is 4, the bottleneck of the edge capacities in p . Image source: Adapted
from CLRS
Augmenting 4 units of flow along the path p depicted in Figure 9.5 will result in a total flow of 23
units which is now optimal. The key thing to understand here is that by augmenting along a back
edge, we have redirected the flow that was once flowing along v3 → v2 → v4 → t to flow directly
from v3 → t instead. As this flow is redirected away, the 4 units of flow that was being supplied
to v2 by v3 is now supplied by s , so that the flow along v2 → v4 → t is not reduced. Since any
flow that is reduced by augmenting along a back edge always gets replaced, an augmentation
always increases the total value of the flow.
The method
We now have everything we need to describe the Ford-Fulkerson method. See the implemen-
tation in Algorithm 60. Of course, the Ford-Fulkerson method is really just a skeleton. The
method tells us what to do, but does not specifically tell us how to do it. There are many possi-
ble strategies for actually finding augmenting paths, each of which will result in a different time
complexity.
114 9.1. The Ford-Fulkerson Algorithm
Note that Ford-Fulkerson method will always find integer-valued flows (i.e., flows in which the
flow on every edge has an integer value). This is due to the facts that initial flow is 0 (invariant at
the beginning) and all edges capacities are integers. Therefore in each step of the loop the flow
will be augmented along the augmenting path by an integer amount, thus keeping the invariant.
Use zero capacity edges to effectively represent back edges: In a naive implementation, we
might try to write separate code for dealing with forward edges and back edges, but we can
actually use a very simple trick to make them work the same, eliminating repeated code. All
we need to do is define the capacity of the back edges to be 0, and to maintain that their flow
value is always equal to the negative of the flow on the corresponding forward edge. This works
because the formula for residual capacity on an edge
c f (u , v ) = c (u , v ) − f (u , v ),
which is the definition of the residual capacity on a back edge. In other words, this formulation
allows us to treat forward and back edges the same since the formula for their residual capacities
is now the same.
Augment flow in the same routine that finds an augmenting path: Once an augmenting path
has been found, we know that we are going to augment along it, so we might as well make our
depth-first search that finds augmenting paths also perform the augmentation at the same time.
Otherwise we would simply have to write twice the length of code where we would have to locate
an augmenting path and then separately traverse it again to add flow.
Edge structure: Let’s assume that we have an object structure representing an edge of the net-
work, which contains the fields capacity, indicating the edge’s capacity (remember that this
will be zero for back edges), flow, which will indicate the current flow on that edge, and reverse
which contains a reference to its reverse edge (the back-edge corresponding to the forward edge
or vice versa).
Throwing all of these tricks together yields the pseudocode depicted in Algorithm 61.
Chapter 9. Network Flow 115
Time Complexity
The time complexity of the Ford-Fulkerson algorithm depends on the strategy selected to find
augmenting paths. Let’s consider the simplest variant in which we use a depth-first search to
find any augmenting path.
• Finding an augmenting path if one exists using a depth-first search from the source s takes
O (|E |) time.
• In the worst case, each augmentation only increases the amount of flow by one unit, and
hence it could take up to | fm a x | augmentations, where | fma x | is the value of a maximum
flow.
Therefore, the total time complexity of the Ford-Fulkerson algorithm is O (|E | × |fma x |).
possible capacity can be achieved by finding a maximum spanning tree1 in the residual
graph G f with Prim’s algorithm and then augmenting along the resulting path from s to
t . This is called the fattest augmenting path algorithm and results in a runtime of
O |E |2 log(|V |) log |E | max c (u , v ) ,
u ,v ∈V
which is better, although much trickier to read than the complexity of Edmonds-Karp!
Let’s illustrate this with a realistic example. Let’s say you have a road network consisting of
directed roads. Suppose that your tutor lives at vertex s and that Monash University is at vertex
t . You have a test coming up in your next lab, so you want to sabotage your tutor’s chances of
getting to the lab on time to avoid taking the test! For each directed road, you know the dollar
amount that it would cost to have that road temporarily shut down and blocked off. What is the
minimum amount of money that you’ll have to spend in order to ensure that there is no way for
your tutor to get to the university?
Figure 9.6: A road network, where edges are labelled by the cost of blocking off that particular
road. Image source: Adapted from CLRS
This is precisely an example of a minimum cut problem. If we interpret the costs of blocking
each road as the edges capacity, then the minimum cut is the cheapest way to disconnect s from
t . Consider the problem in Figure 9.6. By inspection, we can see that the minimum cut for the
network above corresponds to blocking off the roads v1 → v3 , v4 → v3 and v4 → t for a total cost
of $23.
The minimum cut problem is intimately related to the maximum flow problem. Notice that
this road network is actually the same network that we used as an example for maximum flow
in the previous section, whose maximum flow turned out to be exactly 23 units. This is not a
coincidence.
1 Exactly the same as a minimum spanning tree except selecting the heaviest edges instead of the lightest.
Chapter 9. Network Flow 117
Figure 9.7: An optimal solution to the minimum cut problem. Image source: Adapted from
CLRS
where as usual, we take c (u , v ) = 0 if there does not exist an edge (u , v ). The minimum
cut problem seeks an s-t cut C = (S , T ) such that c (S , T ) is as small as possible.
The minimum cut problem and the maximum flow problem are related by the min-cut max-
flow theorem. Informally, in a given flow network, the minimum cut and maximum flow prob-
lems have the same solution.
Given a flow network G = (V , E ) with a source vertex s and sink vertex t , the value of a
maximum s-t flow in G is equal to the minimum capacity s-t cut in G . Mathematically:
max | f | = min c (S , T ).
f (S ,T )
Proof
For a rigorous mathematical proof of the theorem, see CLRS, Theorem 26.6. We will
consider a more intuitive version of the proof, which follows roughly the same structure
118 9.3. Bipartite Matching
as the formal proof, but using intuitive arguments in place of rigorous mathematical
arguments.
If we consider a flow network G , some flow f and some cut (S , T ), the net amount of
flow that crosses the cut, i.e. moves through edges from S into T must be equal to the
value of the flow | f |. This has to be true because if | f | total units of flow are moving from
s (contained in S ) to t (contained in T ), then they must cross through the cut (S , T ).
Given this, the value of the maximum flow cannot be greater than the capacity of any
(S , T ) cut, since it must travel through it, so in particular, the value of the maximum flow
must be less than or equal to the capacity of the minimum cut.
Consider a maximum flow f and the particular s-t cut (S , T ) such that S contains all ver-
tices reachable from s in the residual graph G f , and T contains all remaining vertices.
(S , T ) is a valid cut since t is not reachable from s , otherwise there would be an aug-
menting path, which would imply that the flow f was not in fact maximum. Suppose
that the value of f was less than the capacity of (S , T ), i.e. that | f | < c (S , T ), then either
there is an edge crossing (S , T ) that is not saturated, or there is an edge bringing flow
from T into S . If there is an unsaturated edge (u , v ) such that u ∈ S and v ∈ T , then
there is a path from S to T in the residual graph, which is a contradiction. Similarly, if
there is an edge bringing flow from T into S , then the corresponding back-edge from
S into T has non-zero residual capacity, and hence there is a path from S to T in the
residual graph, which is also a contradiction. Therefore, the value of the flow f must be
equal to the capacity of the cut (S , T ).
Since the value of the maximum flow f cannot exceed the capacity of any cut, and we
have found a cut such that | f | = c (S , T ), we can conclude that the value of the maximum
flow is equal to the capacity of the minimum cut.
Not only does the proof above illuminate why the theorem is true, it also shows us how to con-
struct a minimum cut if we have found a maximum flow. If G = (V , E ) is a network and f is the
maximum flow, then the minimum cut (S , T ) can be found by taking S equal to all of the vertices
that are reachable from s via the residual graph G f , and taking T as all remaining vertices. The
proof shows that this will indeed be a minimum s-t cut in G . The min-cut max-flow theorem
also gives us an easy tool to prove that the Ford-Fulkerson algorithm is correct.
Ford-Fulkerson terminates when there are no augmenting paths left. Since each aug-
menting path increases the flow by an integer, and the maximum flow is finite, the algo-
rithm eventually terminates. When there does not exist an augmenting path from s to
t , the flow across the cut (S = {v : v reachable from s in G f }, T = V \ S ) is equal to the
capacity of the cut. By the min-cut max-flow theorem, this flow is therefore maximum.
Figure 9.8: The residual network G f for the maximum flow in the network in Figure 9.6. The
vertices reachable from s are {s , v1 , v2 , v4 }, leaving T = {v3 , t }, across which, the capacity can
be seen to equal to 12 + 7 + 4 = 23, which agrees with the value of the maximum flow f . Image
source: Adapted from CLRS
1 1
6 6
2 2
7 7
3 3
8 8
4 4
9 9
5 5
Figure 9.9: A bipartite graph composed of the two sides, L = {1, 2, 3, 4, 5} and R = {6, 7, 8, 9}, and
a maximum matching which has size three.
Consider the example in Figure 9.9. Suppose that the set L represents job applicants, and the
set R are available jobs. There is an edge between the applicant u and the job v if person u is
qualified for job v . The bipartite matching problem asks us to match qualified applicants to
jobs such that no applicant gets multiple jobs, and no job is taken by multiple people. For this
example, the greatest number of applicants that can be successfully matched is three.
Although the maximum bipartite matching problem does not sound very similar to the maxi-
120 9.3. Bipartite Matching
mum flow problem, it can actually be solved by constructing a corresponding flow network and
finding a maximum flow. The key idea is to use units of flow to represent matches. We construct
the corresponding flow network by taking the bipartite graph B = (V , E ) and adding new source
(s ) and sink (t ) vertices. We then connect the source vertex s to every vertex in the left subset L
and connect every vertex in the right subset R to the sink vertex t , and direct all of the original
edges in E to point from L to R . We then finally assign every edge a capacity of one.
1
6
2
7
s 3 t
8
4
9
5
Figure 9.10: The corresponding flow network for the bipartite graph in Figure 9.9. Capacities
are omitted since all edges have capacity 1.
Since every edge has a capacity of one, at most one unit of flow can be sent from s through any
vertex u ∈ L . For an integer-value flow, this ensures that each applicant can get at most one
job. Also, at most one unit of flow can leave any v ∈ R into t , which ensures that each job gets
at most one applicant. After finding a maximum flow, matches will correspond to edges from
L to R that are saturated. The flow being a maximum flow ensures that the maximum possible
number of matches are obtained.
1
6
2
7
s 3 t
8
4
9
5
Figure 9.11: A maximum flow in the corresponding flow network. The highlighted edge are
saturated with a flow of one, and form a maximum matching with size 3.
Chapter 9. Network Flow 121
Given a bipartite graph B = (V , E ), the maximum possible number of matches cannot be greater
than the number of vertices, so the value of the flow in the corresponding flow network is bounded
above by | f | ≤ |V |. Using the Ford-Fulkerson algorithm to find the maximum flow therefore
results in a runtime of O (|V ||E |) to solve the maximum bipartite matching problem. A pfaster al-
gorithm called Hopcroft-Karp solves the maximum bipartite matching problem in O ( |V ||E |)
by using the same flow network but a faster maximum flow algorithm.
Using Ford-Fulkerson to find maximum matchings is sometimes called the alternating paths al-
gorithm for bipartite matching, since the augmenting paths that will be found by Ford-Fulkerson
will strictly alternate between the left and right parts of the graph using forward edges and back
edges sequentially. In practical implementations of the algorithm, constructing the flow net-
work explicitly is not required, and much unnecessary code can be avoided by simply keeping
track of the matches such that augmenting paths can be found by beginning at an arbitrary un-
matched vertex and traversing the graph, alternating between unmatched vertices on the left
and matched vertices on the right until an unmatched vertex is discovered on the right.
Below we show an example of the circulation with demands problem in Figure 9.12, and a so-
lution to this instance of the problem in Figure 9.13.
d x = −4
3
3
d u = −3 dv = 2
2
2
3
dw = 5
Figure 9.12: An instance of the circulation with demands problem. The demand is indicated in
each vertex, and the capacity in each edge.
But how can we reduce the circulation with demands problem to a standard max-flow problem
so that we can use Ford-Fulkerson to solve it?
d x = −4
2/3
2/3
d u = −3 dv = 2
2/2
2/2
3/3
dw = 5
Figure 9.13: A solution for the circulation with demands instance above, with the flows along
each edge indicated in red.
s x
4
3
3 3
u v
2
2
3 2
w t
5
Figure 9.14: The supergraph G 0 obtained from G presented in Figure 9.12 so that a max-flow
problem can be solved.
We now solve the max-flow problem on G 0 to obtain the maximum flow fma 0
x between s and t
0
in G . A solution of the max-flow problem in our example is presented in Figure 9.15.
Let D = u ∈V ,d u >0 d u = u ∈V ,d u <0 −d u . Remember that a necessary condition for the existence
P P
Note that | fm0 a x | cannot be bigger than D , as the cut C = (S , T ) with S = {s } has capacity equal
to D by construction.
124 9.4. Circulations with Demands and Lower Bounds
s x
4/4
2/3
3/3 2/3
u v
2/2
2/2
3/3 2/2
w t
5/5
On the other hand, suppose that there is a (max-)flow f 0 in G 0 with value D , then it must be the
case that all outgoing edges of s are saturated (as by design the sum of their capacities is D ).
Similarly, all incoming edges of t must be saturated. Therefore to obtain a feasible solution f to
the circulation with demands {d u } in G , we simply need to delete these edges of f 0 (note that
each vertex v 6= s , t met the flow conservation constraint in f 0 ,P
so after deleting the mentioned
edges, it will now meet the demand constraint v ∈V f (v, u ) − v ∈V f (u , v ) = d u ).
P
The method for solving the circulation with demands problem is presented below.
For many applications of network flow to combinatorial problems it is also useful to consider
lower bounds on the edges (i.e., we will enforce that at least a specified amount of flow go
through the edge). Therefore we introduce circulation with demands and lower bounds, as for-
mally specified below.
Chapter 9. Network Flow 125
`(u , v ) ≤ f (u , v ) ≤ c (u , v ).
Let’s go back to our previous example and now impose a lower bound of 1 on all edges, as de-
picted in Figure 9.16. From our previous solution to the circulation with demands problem, we
already know that it is feasible to solve this example even when a lower bound of 1 is put on
all edges, but how can the circulation with demands and lower bounds problem be solved in
general?
The idea is that we will reduce this problem to the problem of circulation with demands but no
lower bounds.
We first define the flow f ` by setting, for each edge, f ` (u , v ) = `(u , v ). And then we figure out if
there is exists a flow f ∗ such that, for f = f ` + f ∗ , f is a feasible solution to the circulation with
demands {d u } and lower bounds problem on G . Note that f ` already takes care that the flow
f is at least equal to the lower bound on each edge. Now we just have to consider the related
graph G ∗ with adjusted capacities and demands:
the vertices of G ∗ are the same as in G , but they now have demand d u∗ = d u − v ∈V f ` (v, u )+
P
• P
`
v ∈V f (u , v );
• the edges of G ∗ are the same as in G , but they now have capacities c ∗ (u , v ) = c (u , v ) −
`(u , v ) and no lower bounds;
and solve the problem of circulation with demands {d u∗ } - without lower bounds - in G ∗ .
Following our example, we now consider the flow f ` in Figure 9.17, the adjusted graph G ∗ with
126 9.4. Circulations with Demands and Lower Bounds
d x = −4
1/3
1/3
d u = −3 dv = 2
1/2
1/2
1/3
dw = 5
Figure 9.16: An instance of the circulation with demands and lower bound problem. The
demand is indicated in each vertex, and for each edge the capacity is indicated in black and
the lower bound in blue.
its solution f ∗ in Figure 9.18, and the final solution f = f ` + f ∗ for the circulation with demands
{d u } and lower bounds problem in G in Figure 9.19.
1/1/3
1/1/3
u v
1/1/2
1/1/2
1/1/3
Figure 9.17: The flow f ` (in red) corresponding to the circulation with demands and lower
bound example in Figure 9.16.
Equipped with these extremely powerful tools, you will be able to solve many nontrivial com-
binatorial problems, as you can see from the examples given in the lectures and tutorials.
Chapter 9. Network Flow 127
d x = −2
1/2
1/2
d u = −2 dv = 1
1/1
1/1
2/2
dw = 3
Figure 9.18: The adjusted graph G ∗ for the circulation with demands {d u∗ } problem (without
lower bounds) corresponding to the example in Figure 9.16, and its solution f ∗ in red.
d x = −4
1/2/3
1/2/3
d u = −3 dv = 2
1/2/2
1/2/2
1/3/3
dw = 5
Figure 9.19: The final solution f = f ` + f ∗ for the circulation with demands {d u } and lower
bounds problem in G (lower bounds in blue, flow in red and capacities in black).
128 9.4. Circulations with Demands and Lower Bounds
Chapter 10
Storing and retrieving data is one of the most common scenarios in which we employ advanced
data structures. We will first look at arguably one of the most well-known data structures in
computer science, hashing. Hashing aims to solve the “dictionary” problem, where we need to
store and search for items associated with keys. Using hashing, we can usually support these
operations in just O (1) expected time! We will briefly revise the basic hashing schemes that you
should be familiar with, hashing with chaining and hashing with open addressing (probing).
Then we will cover a selection of good hash functions for random integer keys and string keys,
and briefly discuss the roll of randomness in produces good hashes. We will then discuss a more
advanced hashtable strategy, cuckoo hashing.
1. INSERT (item): insert an item with a given key into the dictionary.
2. DELETE (item): delete an item with a given key from the dictionary.
3. LOOKUP (key): find the item with the given key if it exists.
Note that the items can contain other data in addition to the keys. Often the main data structure
used to solve the dictionary problems will store only the keys together with links to the associ-
ated data. In this chapter we focus only on the keys to simplify the explanation. We assume
that all items have unique keys, and that all keys come from a universe U , typically integers
130 10.1. Hashtables (Revision)
{0, 1, ..., u − 1}. Non-integer keys should be converted to integers. We will see in the next chap-
ter that this problem can be solved using balanced binary search trees in O (log(n )) time per
operation, but we want to do better than this. The goal of hashtables is to solve the dictionary
problem in just O (1) expected time per operation.
Hashing
The solution to the prohibitive memory usage of direct addressing is to employ hashing. We
reduce the universe U down to a size that we can manage, say integers {0, 1, ..., m − 1} and use
a hash function
h : U → {0, 1, ..., m − 1}
to map keys onto the reduced universe. Items are then stored in a table of size m, with key k
being stored at position h (k ). This allows us to control the memory usage, but introduces a new
problem. Since it is likely that m |U |, it will be the case that there are distinct pairs of keys
ki , k j such that h (ki ) = h (k j ), i.e. they will map to the same value. We call such a pair of keys a
collision. We will explore several strategies for resolving collisions. First, we revise chaining and
open addressing. Later, we will explore an entirely different strategy, Cuckoo hashing.
0
k1
1
k2
U 2
k3 3
k4
4
Chaining
Collision resolution via chaining involves maintaining a linked list at every position 0, 1, ..., m −1
in the table and inserting all elements with h (k ) = i into the linked list at position i . If we assume
that the hash function maps keys evenly to each slot in the hash table, then for a random set of
keys, the expected number of items in each chain is n /m, where n is the number of items in the
hashtable. We call the quantity α = n /m the load factor. Given that the expected chain length is
α, the average-case time complexity for chaining is therefore O (1 + α) per operation. If we keep
m ≥ n , which implies α ≤ 1, this yields average-case O (1) performance per operation!
Chapter 10. Hashing and Hashtables 131
0 → k1
k1
1 → k2 → k3
k2
U 2 →
k3 3 → k4
k4
4 →
Deletion must be done with caution when using open addressing. Simply removing an item
from the table may make other items unreachable if they were placed there via probing. To
account for this, we should mark deleted slots with a flag that insertion is free to ignore and
overwrite, but lookup should interpret as an occupied slot.
If we once again assume that our hash functions maps keys evenly into the available table slots,
and additionally assume that for every key, each of the possible m ! probe sequences are equally
1
likely1 , then the expected number of operations performed by probing is O 1−α since we have
probability at least m−n
m of finding an empty slot each probe, which implies an expected number
of trials at most equal to
m
1 m m 1
= = m n = .
m−n
m m −n m −m 1−α
e.g. if α = 0.5, we expect at most two probes, for α = 0.75, four probes, etc. Keeping α below any
fixed constant therefore implies that we can perform operations in O (1) average-case time. In
practice, this means that we must increase the size of the table and select a new hash function
whenever the load factor exceeds this selected threshold. Common thresholds used are 0.25,
0.5, 2/3, and 0.75 depending on the probing method used.
We will look at three probing strategies that are commonly employed.
Linear Probing
In linear probing, we probe adjacent locations in the table until we find an empty slot:
h (k , i ) = (h 0 (k ) + i ) mod m,
where h 0 (k ) is the original hash function. In theory, linear probing performs quite poorly since
it is quite susceptible to primary clustering. If many keys end up around the same location,
1 This is a very unrealistic assumption, but it is commonly made since it makes for an easier analysis.
132 10.2. What is a Good Hash Function?
then it only further increases the probability that even more keys end up around that location.
However in practice, linear probing is incredibly efficient due to cache locality.
Quadratic Probing
In quadratic probing, we follow the probe sequence:
h (k , i ) = (h 0 (k ) + c1 i + c2 i 2 ) mod m,
where h 0 (k ) is the original hash function and c1 and c2 are chosen constants. Quadratic probing
avoids the problem of primary clustering since keys jump further away, but we are still suscepti-
ble to secondary clustering, keys with the same hash values will always probe the same elements.
Double Hashing
Double hashing aims to avoid the problem of secondary clustering. In double hashing, we use
a second hash function to determine the distance between probed elements.
h (k , i ) = (h1 (k ) + i h2 (k )) mod m ,
where h1 (k ) is the original hash function, and h2 (k ) is the secondary hash function. Although
it eliminates secondary clustering, double hashing is sometimes less efficient in practice since
two hash functions have to be computed
Given a group of n people, find the probability that there is at least one pair of people
with the same birthday.
Very unintuitively, the probability reaches 50% at just 23 people. At 367 people, the probability
is 100% since there are only 366 possible birthdays. If we consider the analogy with respect to
a hash table, we have a table of size 366, and with just 23 keys to insert, it is already more likely
than not that a collision will occur. In general, with n keys and a table of size m where n < m,
the probability that there exists a colliding pair is
m!
p (n , m) = 1 −
(m − n )!m n
which rapidly approaches 1 as n → m. In practice, the best we can do is to minimise the proba-
bility of a collision by designing a good hash function. It is not realistic to try to avoid collisions
100% of the time, so the design of good hashtable data structures is therefore often heavily fo-
cused not only on avoiding collisions, but handling them in an efficient way when they do in-
evitably occur.
For the mathematically inclined, the rigorous analysis of hash collisions and complexity bounds
on hashtables is analysed using tools from probabilistic combinatorics. Interested students
should read Probability and Computing, Mitzenmacher & Upfal.
choices since some number theory will tell us that consecutive multiples of a key will continu-
ously result in different hash values provided that the key is not a multiple of m itself.
The multiplicative hash works as follows. We select a constant A where A < 0 < 1 and take the
fractional part of k A, which is k A − bk Ac. We then scale m by the resulting value to produce a
hash like so
h (k ) = bm(k A − bk Ac)c.
Unlike the divisional method, the value of m does not heavily influence the quality of the hash,
so taking m as a power of two is fine and even preferable since the computation of the hash can
then be sped up by taking advantage of bitwise operations.
The main decision to make is the value of A, whose optimal choice is heavily influenced by the
characteristics of the keys being hashed. According to Knuth2 , a good choice is
p
−1 5−1
A=ϕ = ≈ 0.61803...
2
The polynomial hashing technique is a common and effective method for hashing strings. Given
a string S [1..n ], if we associate each character of S with a numeric value (its position in the al-
phabet for example), then the polynomial hash is given by
where p is a prime such that p > m . To reduce the chance of collisions, the value used for the
base x should be greater than the maximum attainable value of S [i ]. If this is the case, then it is
easy to see that the value of the polynomial hash for any string ignoring the modulo is unique,
suggesting that this is a good hash function. Evaluating the polynomial hash for a given string
naively term-by-term will lead to a time complexity of O (n 2 ) or O (n log(n )) if fast exponentiation
is used, but this can be easily improved by using Horner’s Method.
2 Donald E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, 1973.
p (x ) = a 0 + a 1 x + a 2 x 2 + ... + a n x n
bn = a n
bn −1 = a n −1 + bn x
..
.
b0 = a 0 + b1 x
p (x ) = a 0 + a 1 x + a 2 x 2 + ... + a n x n
can be rewritten as
p (x ) = a 0 + x (a 1 + x (a 2 + ... + x (a n −1 + a n x ))).
Horner’s method then simply evaluates this expression from the innermost bracketed part out-
wards.
and the polynomial hash for the substring S [i + 1.. j ] can be computed as
This allows us to “roll” the polynomial hash across a string and compute the hash values for
all substrings of a fixed length in just linear time, rather than the quadratic time that would be
required to do each substring separately.
never random, so such assumptions are incorrect and foolish. When discussing Quicksort and
Quickselect, a remarkably simple solution to combat the fact that our input data is not random
was to choose our pivots randomly. Indeed, the simplest and most effective way to combat the
fact that real-world data is not random is to introduce randomness into our algorithms instead!
This ensures that no adversary can produce input that will cause our algorithms to run at worst-
case performance. An ideal hash function would be totally random, that is, it would satisfy for
all x ∈ U , independent of all other elements of U ,
1
Pr(h (x ) = t ) = .
m
Using a totally random hash function, chaining and linear probing both achieve O (1) expected
worst-case performance. Unfortunately, generating a totally random hash function would re-
quire us to store a random hash value for every possible input key x ∈ U , which would use
just as much space as direct addressing! Instead, we try to approximate totally random hashing
with hash functions that are somewhat random but easy to compute. One such approach is
universal hashing.
In universal hashing, we select our hash function h randomly from a universal family of hash
functions. A family H is called universal if it satisfies
1
Pr {h (k ) = h (k 0 )} ≤ , for all k 6= k 0 ∈ U .
h ∈H m
With universal hashing, the expected time per operation for chaining can be shown to still be
O (1 + α). Unfortunately, unlike totally random, universal hashing is not strong enough to guar-
antee that probing achieves O (1) expected complexity. Here are three examples of universal
families:
• The set H of all hash functions. This is clearly universal but it is completely useless
since selecting from this family is equivalent to generating a totally random hash func-
tion which as we discussed is infeasible.
• H = {ha ,b (k ) = (a k + b ) mod p mod m} for all 0 < a < p , and all 0 ≤ b < p , for a fixed
prime number such that p ≥ u . This family has the disadvantage that table slots above p
mod m may be less likely to be occupied. It is advantaged by the fact that it only needs to
store three integers, a , b , and p and hence uses very little space.
• H = {ha ,b (k ) = ((a k + b ) mod u ) >> (log2 (u ) − log2 (m))} for all odd 0 < a < u , and all
0 ≤ b < u /m . This family works when u and m, the universe size and the table size, are
powers of two. This hash function then amounts to simply taking the highest log2 (m ) bits
from the product (a k + b ) mod u , and is very fast to compute since when m and u are
powers of two, no modular arithmetic is required, only bitwise operations and a single
multiplication, which are much faster.
Even stronger families of hash functions exist that provide stronger mathematical guarantees,
but we won’t discuss these for now3 .
3 If interested, you should watch this lecture from MIT OpenCourseware, https://youtu.be/Mf9Nn9PbGsE.
Chapter 10. Hashing and Hashtables 137
To lookup a key k :
To insert a key k :
4. Continue until successful or until too many attempts have been made.
5. If the process fails, rebuild the tables larger with new hash functions.
Lookup and deletion from a hashtable using cuckoo hashing clearly has a worst-case O (1) com-
plexity since it suffices to simply check two addresses. Analysis of insertion on the other hand
is tricky. Selecting the appropriate threshold MaxLoop for insertion is important to ensure that
rebuilds do not happen too frequently, but happen frequently enough such that insertions do
not encounter too many probes. In theory, a good threshold to use is MaxLoop = 6 log(n ) with
appropriate hash functions.
If the hash functions are assumed to be totally random and we maintain that the table sizes
are m ≥ 2n , then it can be shown that any given insertion will cause a failure with probability
O (1/n 2 ). Consequently, Cuckoo hashing will fail to insert n keys with probability O (1/n ), which
implies that n keys can be inserted in expected O (n ) time. Unfortunately, Cuckoo hashing’s
downside is that it requires rather strong hash functions to achieve this complexity. Cuckoo
hashing does not achieve O (1) expected time with universal hashing, or even with some much
stronger hash families4 .
4 For the details and analysis, see Pagh & Rodler, Cuckoo Hashing, http://www.it-c.dk/people/pagh/
Storing and retrieving data is one of the most common scenarios in which we employ advanced
data structures. We just covered hashtables which solve the dynamic dictionary problem in O (1)
expected time. You should already be familiar with binary search trees as a powerful alternative
lookup data structure. Binary search trees are slower than hashtables in expectation, but they
provide worst-case guarantees and can also support operations that hashtables are incapable
of, like finding the elements closest to a given key. Ordinary binary search trees can, in the worst
case degrade to O (n ) performance, so this chapter focuses on overcoming this weakness.
Recall from your previous studies that a binary search tree is a rooted binary tree such that for
each node, the keys in its left subtree compare less than its own and the keys in its right subtree
compare greater than its own. Binary search trees are an efficient data structure for storing and
retrieving elements of an ordered set in O (log(n )) time per operation. Pathological cases can
occur however when the tree becomes imbalanced, where performance degrades in the worst
case to linear time. We will study one variety of improved binary search trees called an AVL1 tree
which self-adjusts to prevent imbalance and keeps all operations running in guaranteed worst
case O (log(n )) time.
1 AVL trees are named after Adelson, Velskii, and Landis, the original inventors of the data structure.
140 11.1. AVL Trees
An example of a very imbalanced binary search tree resulting from inserting the keys
1, 2, 3, 4, 5, 6 in sorted order. Performance in such a tree is equivalent to a linked list with
worst-case O (n ) lookup and insertion.
11.1.1 Definitions
In general, a tree is considered balanced if it has a height of O (log(n )). There are many ways to
achieve this with different kinds of trees. AVL trees use a very strict definition of balance which
they maintain throughout insertions and deletions. In the context of an AVL tree, a tree is con-
sidered to be balanced if for any node in the tree, the heights of their left and right subtrees differ
by at most one. This implies that its height is worst-case O (log(n )), as required. Since lookup,
insertion and deletion all take time proportional to the height of the tree, the tree being bal-
anced implies that lookup, insertion and deletion can all be performed in worst-case O (log(n ))
time.
(a) Balanced
(b) Imbalanced
A balanced binary search tree (a) and an imbalanced binary search tree (b). (b) is imbalanced
because the heights of the root node’s left and right children are 1 and 3 respectively.
11.1.2 Rebalancing
AVL trees maintain balance by performing rotations whenever an insertion or deletion opera-
tion violates the balance property. Rotations move some of the elements of the tree around in
Chapter 11. Balanced Binary Search Trees 141
order to restore balance. We define the balance factor of a node to be the difference between
the heights of its left and right subtrees, i.e.
|balance_factor(u )| ≥ 2
There are four separate cases that can occur when a tree is imbalanced.
Left-left imbalance
Right-right imbalance
Left-right imbalance
A left-right imbalance occurs when a node’s balance factor is
2 (its left subtree is taller than its right subtree) and the left
node’s right subtree is taller than the left node’s left subtree (in
other words, the balance factor of the left node is negative). To
remedy a left-right imbalance, we first perform a leftward ro-
tation on the left node, which converts the imbalance into the
left-left situation. We then perform a rightward rotation on
the root to balance it. (Image source: Wikimedia Commons)
Right-left imbalance
A right-left imbalance occurs when a node’s balance factor is
−2 (its right subtree is taller than its left subtree) and the right
node’s left subtree is taller than the right node’s right subtree
(in other words, the balance factor of the right node is pos-
itive). To remedy a right-left imbalance, we first perform a
rightward rotation on the right node, which converts the im-
balance into the right-right situation. We then perform a left-
ward rotation on the root node to balance it. (Image source:
Wikimedia Commons)
Note that in the above algorithms, it is crucial that the functions for manipulating the children
also correctly update the node’s parent pointers! Failing to do so is a very common bug that pro-
grammers encounter when attempting to implement self-balancing binary search trees. We
should also make sure that we handle the null cases correctly when the children or the par-
ent nodes are null. In implementations of self-balancing binary search trees, it is common to
use special dummy nodes to represent null nodes rather than using the language’s actual null
pointer to avoid having to include special cases all throughout the code. We also have to store
and maintain the heights of each node in the tree, since computing them online would be too
slow.
Combining each of these together, the entire rebalancing procedure can be written as in Algo-
rithm 70. After performing any modification to the tree, we should call rebalance on all ances-
tors of the modified nodes, starting with the deepest ones first. It is important to rebalance the
deepest nodes first since fixing an imbalance at a low level of the tree may correct an imbalance
higher up in the tree, eliminating the need for redundant rotations.
Since the tree is assumed to be balanced before we modify it, the heights of at most O (log(n ))
nodes are affected by any one modification operation and hence we require at most O (log(n ))
rebalances, each of which can be executed in constant time. Therefore the total time complexity
of insertion and deletion for an AVL tree is worst-case O (log(n )). Since the tree is kept balanced
by the rebalances, lookup in the tree is also worst-case O (log(n )).
Today, huge amounts of the world’s data is in the form of text data or data that can be interpreted
as textual data. From classic literature, your favourite algorithms and data structures textbook
to DNA sequences, text data is everywhere, and it needs to be stored, processed and analysed.
We will introduce and study one special data structure for working with text strings, the suffix
tree, as well as a related data structure, the prefix trie / retrieval tree data structure which allows
for fast storage and lookup of strings.
String Terminology
Recall that a string S [1..n ] is a sequence of characters, i.e. some textual data.
1. A substring of S is a string S [i .. j ] where 1 ≤ i ≤ j ≤ n.
2. A prefix of S is a substring S [1.. j ] where 1 ≤ j ≤ n .
3. A suffix of S is a substring S [i ..n ] where 1 ≤ i ≤ n .
A useful observation to make is that a substring of S is always a prefix of some suffix of S .
• A path from the root to a node in the trie corresponds to a prefix of a word in the
trie.
• A path from the root to an internal node of the trie corresponds to a common prefix
of multiple strings.
Prefix tries can be implemented in a variety of ways. The most important decision is how to
index the children of a particular node. There are a several strategies, and we will consider three
of them. We will denote the length of a query string by n , the total length of all words in the trie
by T , and the alphabet by Σ.
• Use an array to store a child pointer for each character in the alphabet:
This method allows us to perform lookup and insert in O (n ) time since we can follow
each child pointer in constant time. However, the space requirement is quite high, since
we are storing a lot of pointers to null children. Specifically, we will use O (|Σ|T ) space in
the worse case, since every node has a pointer for every character in the alphabet.
• Use a balanced binary search tree for storing child pointers:
This method uses the minimal amount of space, as we only need to store pointers to chil-
dren that actually exist. For large alphabets, this is advantageous. However, we lose in
lookup time since it now requires up to O (n log(|Σ|)) time per lookup since at each node
we must search a BST to find the child pointer. The total space requirement however is
minimal, at just O (T ).
• Use a hashtable for storing child pointers:
This method allows us to only store pointers to the children that exist, hence we will only
use the minimal O (T ) space. It also allows lookup in O (n ) expected time, since at each
node we perform a hashtable lookup in expected constant time. This method seems to be
superior to the first two since it has optimal speed and optimal space usage. However, this
implementation is less versatile, and prohibits us from doing some more advanced tricks
with the trie. For example, you might want to implement the ability to lookup the alpha-
betically closest word to a given word. This is possible with the first two approaches, but
not (efficiently) with a hashtable since there is no way in a hashtable to quickly lookup the
alphabetically nearest key. It also does not provide deterministic worst-case guarantees
which might be important to us.
It is common to terminate each word with a special character to allow us to distinguish between
full words and prefixes. This special character is usually denoted by $. This ensures that a node
is a leaf if and only if it is the end of a word. In actual code, you should use the null character
(‘\0’ in ASCII) rather than an actual $ since $ may be a character in one of the strings.
Figure 12.1: An example of a prefix trie containing the strings be, ant, alloy, ate, are, aloe,
an, allot, all, with terminating $ characters added to indicate the ends of words.
will not be contained in any proper prefix of a word. This makes tries a potential substitute for a
hashtable for storing a dictionary of strings. Implementations of insertion and lookup for prefix
tries are shown in Algorithms 71 and 72 respectively. If we use the array-based implementation,
building a prefix trie takes O (|Σ|T ) time, and lookup takes O (n ) time, which are optimal if |Σ| is
constant.
T
O (n log(n )) × O = O (T log(n )),
n
148 12.2. Suffix Trees
where n is the number of words in the list being sorted. We can therefore see that if the alphabet
size is constant or sufficiently small, the prefix trie method will be faster. Lastly, note that this
can actually be interpreted as a form of radix sort, specifically a most significant digit (MSD)
radix sort, since strings are divided up based on their characters, first to last as we go down the
tree. An implementation of sorting using a trie is given in Algorithm 73.
Compact Tries
One major drawback to tries that we have observed is that they can potentially waste a lot of
memory, particularly if they contain long non-branching paths (paths where each node only
has one child). In cases like these, we can make prefix tries more efficient by combining the
edges along a non-branching path into a single edge. The Radix Trie and the PATRICIA Trie are
particular kinds of compact prefix tries. Figure 12.2 shows an example of a compact trie.
Figure 12.2: A compact trie containing be, ant, alloy, ate, are, aloe, an, allot, all. No
terminating character $ has been added in this example.
Given a text string T [1..n ] and a pattern P [1..m], find all occurrences of P as a substring
of T .
Many well known algorithms exist that will solve the pattern matching problem in O (n +m) per
query, but what if we wish to keep the same text string T and search for lots of small patterns?
Doing the O (n ) work per query might turn out to be extremely wasteful if the text string has
length n = 1, 000, 000 and the patterns only have length m = 20.
We can start with a suffix trie by inserting all of the suffixes into a prefix trie. In many appli-
cations, we need to be able to distinguish suffixes from substrings so we add the terminating
character $.
150 12.2. Suffix Trees
A suffix trie of the string T can be used to solve the pattern matching problem by checking
whether P is a prefix in the trie. This will take O (m) time per query, which is optimal. Since a
string of length n has n suffixes, and each has length O (n ), a suffix trie will take O (n 2 ) space,
which is extremely inefficient. This is why the suffix tree is stored as a compact trie, which will
take just O (n ) space. A suffix tree of the string banana$ is shown in Figure 12.3.
It is important to note that if we store all of the edge labels explicitly then the memory used
by the suffix tree will still be O (n 2 ). In order to mitigate this and bring the memory required
down to O (n ), we instead simply refer to each substring by its position in the original string.
For example, the substring “na” in “banana$” would simply be represented as [3, 4], meaning
that it is the substring spanning the positions 3 to 4 in the string “banana$”.
Chapter 12. Prefix Tries and Suffix Trees 151
Given a string S [1..n ], find the longest substring of S that occurs at least two times.
To solve this problem, we recall that within a prefix trie and equivalently a suffix tree, internal
nodes correspond precisely to common prefixes, and hence in the case of suffix trees, substrings
that occur multiple times. Finding the longest repeated substring is therefore simply a matter of
traversing the suffix tree and looking for the deepest internal node (non-leaf node). An example
is depicted in Figure 12.4.
1 If interested, see E Ukkonen, On-line construction of suffix trees, Algorithmica 1995, and this Stackoverflow post
http://stackoverflow.com/questions/9452701 which describes the algorithm in a very accessible way.
152 12.2. Suffix Trees
Figure 12.4: A suffix tree for the string “banana$” where each node is labelled with its depth
(distance from the root node as measured by the number of characters on each edge). The
deepest internal node has depth 3, which corresponds to the substring ”ana.”
Chapter 13
Suffix Arrays
We just saw the suffix tree data structure, a compact structure for processing and answering
questions about strings. In addition to the suffix tree, there are many other data structures that
utilise the suffixes of a string to perform efficient string related queries. We will study one more
such structure as a more space efficient alternative to the suffix tree, the suffix array.
Suffix Arrays
Suffix arrays are a compact data structure that index all of the suffixes of a particular string in
sorted order. While this may not sound immediately useful at first sight, the suffix array has a
mountain of applications in string processing which carry over to applications in fields ranging
from the study of natural languages to bioinformatics.
Consider as a first example, the string banana. Recall that when working with suffixes, we often
mark the end of the string with a special character, denoted by $. The suffixes of banana$ are
then given by
banana$
anana$
nana$
ana$
na$
a$
$
In general, we can see that a string of length n has n + 1 suffixes (including the empty suffix
containing only $.) The suffix array of a string is a sorted array of its suffixes, so the suffix array
of banana looks like
154 13.1. Building a Suffix Array
$
a$
ana$
anana$
banana$
na$
nana$
Note that the special character ($) is considered to be lexicographically less than all other suf-
fixes, so it appears at the beginning of the suffix array. This is useful since it conceptually signi-
fies the empty string (the suffix of length zero of the original string.)
Since actually writing down all of the suffixes of a string would take O (n 2 ) space, suffix arrays
are represented in a more compact form, where each element simply refers to an index 1 ≤ i ≤ n
into S at which the corresponding suffix begins. For example, the suffix array for banana would
be stored as:
SA("banana$") = [7,6,4,2,1,5,3]
where the i th index corresponds to the position in banana at which the i th sorted suffix starts
as shown in the table bellow.
Index Suffix
7 $
6 a$
4 ana$
2 anana$
1 banana$
5 na$
3 nana$
In practice, the first entry (7 in this case) is sometimes omitted since it is guaranteed that the
empty suffix is always the first one.
What we are really doing here is sorting consecutively larger prefixes of the suffixes, hence the
name prefix doubling. If we perform prefix doubling naively by simply comparing substrings at
each iteration, this will be no faster than the naive algorithm (in fact it will be even slower!) We
therefore need a trick to perform the comparisons faster.
Prefix doubling iteratively sorts longer and longer substrings each round. The key idea is that
by knowing the sorted order of the shorter substrings, we can compare the longer substrings
fast!
Suppose we wish to compare two strings A and B , each of which has the same even
length. We can imagine these two strings as being composed of two halves. Call these
halves A 1 , A 2 and B1 , B2 .
156 13.1. Building a Suffix Array
A=
A1 A2
B=
B1 B2
When we compare A and B naively, we are comparing them from character 1 through
to character l in order until we hit a pair of letters that differ. What if we already know
that A 1 < B1 ? This means that the first pair of characters that differ between A and B
occurs in the first half in favour of A, and therefore A < B . We don’t even have to look at
the second halves. If instead we knew that A 1 > B1 , then we would similarly know that
A > B.
If instead A 1 = B1 , then there is no point comparing the characters in the first halves of
A and B since we already know that they must be the same. We can then apply the same
logic to A 2 and B2 and deduce that if A 2 < B2 then A < B , or if A 2 > B2 that A > B , or
finally if A 2 = B2 that A = B . What this tells us is that knowing the order of the halves of
the strings we are comparing allows us to skip comparing all of the characters. In fact,
all we had to do was compare the halves, so we only had to do at most 2 comparisons,
regardless of the lengths of A and B ! This means we can perform comparisons in O (1) if
we already know the sorted order of the halves.
For example, say we wish to compare the long strings “supercalafragalisticexpialadoshs” and
“supercalafragalisticsuffixarrays“. Comparing them naively, we would take until character 21
to find the answer. However, if we were to look at the strings as made up of two halves:
“supercalafragali” + “sticexpialadoshs”
“supercalafragali” + “sticsuffixarrays”
and I were to tell you that the first halves are the same, and that the second half of the first string
“sticexpialadoshs” is less than the second half of the second string “sticsuffixarrays”, you could
immediately tell me that the first string is less than the second.
This explains why prefix doubling sorts the first 1 character, then 2 characters, then 4 and so on.
What is important to realise here is that if I am looking at the first 2k characters of the suffix at
position p , then what I really have are two halves: the first k characters of the suffix at position
p and the first k characters of the suffix at position p + k . At each iteration, knowing the order
of the previous halves allows all of the comparisons of the current suffixes to be performed in
O (1) time.
suffix, its current position in the suffix array. Two suffixes that are equal up to the current length
must be assigned the same rank so that we know that they are equal. For example, say we are
computing the suffix array of “banana” and we have already sorted on the first two characters
of each suffix and are now about to compare on the first four characters of each suffix. The
partially sorted suffixes and their ranks are:
If we know the ranks of two suffixes, we can compare them by simply comparing their ranks,
since a lower rank means it comes earlier in the suffix array. We now have all of the ingredients
to implement prefix doubling.
An implementation of prefix doubling is given in Algorithm 75. The value ord(c) simply refers
to the order or rank of the character c in the alphabet. e.g. ord(‘a’) = 1, ord(‘b’) = 2, etc.
The SUFFIX_COMPARE function compares two prefixes of length k in O (1) by looking at the ranks
of the two halves of the prefix, i.e. it compares by the pairs (rank[i ], rank[i +k ]). After sorting, we
compute the new ranks by going through the suffixes in the current order, and adding 1 to the
rank if the current suffix is greater than the previous one. This ensures that we correctly assign
equal ranks to suffixes that are currently equal. Sorting at each iteration of the algorithm takes
O (n log(n )) time since the comparisons are now performed in O (1), and we have to perform
log(n ) iterations to fully sort the suffixes, so the total time complexity of the prefix doubling
algorithm is O (n log2 (n )).
The prefix doubling algorithm is already significantly faster than the naive suffix array construc-
tion method, but there is still lots of room for improvement. Since the ranks are bounded above
by n , we can speed up prefix doubling by replacing the O (n log(n )) sort with an O (n ) radix sort
instead. To take advantage of radix sort, we interpret prefix doubling as sorting suffixes using
the pairs (rank[i ], rank[i + k ]) as the keys, since these correspond to the two halves of the i th
suffix of length 2k . We can therefore sort the array by first sorting on rank[i + k ] and then stably
sorting on rank[i ]. Using a counting sort for each of these, we perform O (n ) work to sort the
array. Using this method, the complexity of the prefix doubling algorithm drops to O (n log(n ))
since we still perform O (log(n )) rounds, but each round now takes just O (n ).
158 13.2. Applications of Suffix Arrays
Prefix doubling with radix sort achieves reasonably good performance on very large strings in
practice, but we can still do even better! Several algorithms exist for constructing suffix arrays
in just O (n ) time for a fixed size alphabet. Probably the most well known of these algorithms
is the DC3 algorithm1 , which works by sorting a set of “sample suffixes” which are the suffixes
at positions in the string that are not divisible by three. DC3 operates by radix sorting the first
three characters of each suffix (stably in reverse order), then recursively sorting sets of suffixes
that are still equal. Using the sample suffixes, one can then quickly sort the remaining 1/3 of
the suffixes and merge them with the sample suffixes since every non-sample suffix is just one
character prepended to a sample suffix. The running time is then given by
¨
2
T 3n + O (n ) if n > 1,
T (n ) =
Θ(1) if n = 1.
1 See Linear Work Suffix Array Construction , Juha Kärkkäinen, Peter Sanders, Stefan Burkhardt
Chapter 13. Suffix Arrays 159