Average Case 1
Average Case 1
University of Freiburg
Alexander Souza
Contents
1 Introduction 3
1.1 Optimization Problems and Algorithms . . . . . . . . . . . . . . . . . . . . 4
1.2 Adversarial Models and Performance Measures . . . . . . . . . . . . . . . . 5
1.3 Basic Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Knapsack 17
3.1 Stochastic Profits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Nemhauser-Ullmann Algorithm . . . . . . . . . . . . . . . . . . . . . 21
3.2 Stochastic Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Adaptive Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Non-Adaptive Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Paging 36
5.1 Competitive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Structure of Typical Request Sequences . . . . . . . . . . . . . . . . . . . . 44
6 Bin Packing 48
6.1 Deterministic Bin Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Stochastic Bin Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7 Traveling Repairman 53
7.1 Deterministic Arrivals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.2 Stochastic Arrivals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2
Chapter 1
Introduction
Generally speaking, average-case analysis asks how some given algorithm behaves “typ-
ically”. The main motivation for this is that for some (practically relevant) algorithms
there is a gap between the worst-possible and the usually observed mannerism.
For example, the deterministic sorting algorithm Quicksort that always chooses the
first element of an n-element array as its pivot requires O n2 comparisons in the worst
case. Thus, from that perspective one has to dismiss Quicksort (since there are algo-
rithms that guarantee O (n log n) comparisons). On the other hand, Quicksort is often
observed to outperform other algorithms in terms of number of comparisons. We will show
very soon that the expected number of comparisons of Quicksort is also O (n log n) and
even that this is achieved with high probability.
Now we explain the unspecified notions of the first two sentences in little more detail.
First of all, an algorithm receives an input and produces an output after a finite number of
elementary operations. Within this lecture we will mostly be concerned with optimization
algorithms. The outputs these algorithms produce are valuated with an objective function
and the goal is to minimize (respectively maximize) the valuation.
The second undetermined notion from above is behave: We will either be interested in
the running time or the approximation guarantee of a particular algorithm. The number
of elementary operations the algorithm executes before returning its output is called the
running time on that particular instance. For example, many algorithms that solve NP-
hard optimization problems exactly require exponential running time on some instances.
Exponential running time is in general not desired since it only allows treatment of rela-
tively small inputs. On the other hand, if we resort to polynomial running time we may
have to surrender on solution quality (unless P = NP), i.e., the algorithm does not always
produce solutions with best-possible objective value. We are mainly interested in algo-
rithms where one can prove an a priori bound on the ratio of the objective value achieved
by the algorithm compared with the best-possible value – the approximation guarantee.
The third notion, “typically” is intended to reflect the following gap that some practi-
cally relevant algorithms feature. From the worst-case perspective, unless P = NP, there is
no way that we can avoid exponential running time (if we are interested in exact solutions)
or suboptimal solutions (if we insist on polynomial running time) in general. So, what is
the point in asking for “typical” running time, respectively approximation guarantee? The
simple answer is: observed behaviour. There are several algorithms where the worst-case
bounds on running time, respectively approximation guarantee, do not adequately reflect
the behaviour one usually comes across – these bounds seem to be too pessimistic. Among
these algorithms there are several ones that play an important role in applications, e.g.,
Quicksort for sorting and Least Recently Used for memory management. It is thus
3
primarily of practical interest to give answers why those algorithms are often observed to
work well. After this glimpse into the field we have to make all these notions more precise.
extremize val(X)
subject to X ∈ U satisfies F
maximize p(X)
subject to w(X) ≤ c,
X ⊆ {1, 2, 3, 4}.
Which is the best solution? We evaluate all possibilities and find that X ∗ = {3, 4}
gives w(X ∗ ) = 155 altogether which maximizes our profit.
4
A central problem around combinatorial optimization is that it is often in principle
possible to find an optimum solution by enumerating the set of feasible solutions, but
this set mostly contains “too many” elements. This phenomenon is called combinatorial
explosion.
An optimization algorithm is given a description of an instance I, called the input, and
has to produce a feasible solution O, called the output, after a finite number of elementary
operations, where each such operation is effective and well-defined. We do not further
specify what an elementary operation means as this depends on the context. Operations
of interest include additions, multiplications, memory accesses, comparisons, and so on.
The number of elementary operations an algorithm executes on an input I before producing
an output O is called its running time on that input. The number of bits of the input is
denoted |I| and called the input-length. An algorithm whose running time is a polynomial
in |I| is called a polynomial time (or efficient) algorithm.
The central problem we are facing is that (unless P = NP), there are (many) optimiza-
tion problems for which no efficient algorithm exists. This means that we only have the
following choices: If we insist on the optimum solution, then we have to accept exponential
running time (at least on some instances). Otherwise, if we want polynomial running time,
then we must accept that any algorithm may produce suboptimal solutions (at least on
some instances).
This fundamental problem yields that we have to reason more about how to judge
optimization algorithms. It is convenient to see optimization as a game played between
a (malicious) adversary and an algorithm. Such a game is played with an optimization
problem P , which is simply a set of instances P = {I1 , I2 , . . . }. In the most basic form,
the adversary chooses an instance among the members of P and the algorithm has to find
a feasible solution. Throughout, we shall assume that our algorithms are deterministic.
(1) The way the adversary is restricted in the choice of the instance (the adversarial
model ), and
(2) the manner in which the solution produced by the algorithm is assessed (the perfor-
mance measure).
Worst-Case Model
In the classical worst-case perspective, the adversary may choose an arbitrary instance I
of the members of a given problem P .
For any instance I, let tI (alg) denote the running time of an algorithm alg on it. In
terms of running time we valuate the algorithm with the quantity
5
tries to maximize the approximation ratio and the overall quality of the algorithm is
measured with
alg(I)
r(alg) = sup rI (alg) = sup .
I∈P I∈P opt(I)
An algorithm with r(alg) ≤ % is called a %-approximation. If P is a maximization problem,
he tries to minimize the ratio and we judge
alg(I)
r(alg) = inf rI (alg) = inf
I∈P I∈P opt(I)
and call it a %-approximation if r(alg) ≥ %.
Average-Case Models
For reasons sketched above, the worst-case model is sometimes too pessimistic to give
satisfactory answers from the perspective of the practitioner. So, apparently, the adversary
is too powerful, and one has to seek models that try to capture “typical” behaviour.
One attempt in that direction is the diffuse adversary. The adversary is given the
problem, i.e., the set P of instances and along with that a set of available probability
distributions ∆ = {D1 , D2 , . . . }. The set ∆ is assumed to be chosen from an outside
entity. Each distribution D ∈ ∆ assigns to each instance I ∈ P a certain probability
Pr [I]. Notice that the input-instance I is now a random variable. Instead of allowing
the adversary to choose a particular instance we only allow him to choose any probability
distribution from the set ∆.
In terms of running time, one is usually interested in
X
t(alg) = sup E [tI (alg)] = sup tI (alg)Pr [I] .
D∈∆ D∈∆ I∈P
Notice that I is a random variable and hence tI (alg) is also a random variable.
It is debatable how the quality of an algorithm should be judged in such a model.
There are at least two alternatives. Let P be a minimization problem without loss of
generality. To transfer the definitions below to maximization we only have to replace sup
with inf.
One point of view is: Let D be the any distribution for P . Then taking the expectation
with respect to D yields
X
alg(I) alg(I)
eD (alg) = E = Pr [I]
opt(I) opt(I)
I∈P
which judges the algorithm through the expected value of the approximation ratio. Con-
sequently, as the adversary is malicious, the overall judgement is given through
alg(I)
e(alg) = sup eD (alg) = sup E .
D∈∆ D∈∆ opt(I)
Another way is this: For a distribution D, and taking expectations with respect to it,
define P
0 E [alg(I)] alg(I)Pr [I]
eD (alg) = = PI∈P ,
E [opt(I)] I∈P opt(I)Pr [I]
i.e., the ratio of the expectations, which yields the overall measure
E [alg(I)]
e0 (alg) = sup eD (alg) = sup .
D∈∆ D∈∆ E [opt(I)]
6
If e(alg) ≤ %, respectively e0 (alg) ≤ % holds, we speak of a %-expected-approximation
(with the measure used in the context).
Both measures have their advantages and drawbacks. The perspective of e(alg) is
that one instance is drawn at random and the algorithm is compared directly against the
optimal algorithm on that instance (in expectation). Thus the measure prefers algorithms
that give good approximation ratios “often”. By contrast, from the point of view of e0 (alg)
many instances are drawn and the total objective value achieved is compared with the total
optimal value (respectively in expectation). This measure thus prefers algorithms that
give good approximation ratios when the optimum itself yields “large” objective values.
It certainly depends on the intended application which perspective is the advisable one.
7
A classical worst-case notion is competitive analysis where an algorithm receives a
sequence of inputs and has to react upon each one without any foresight. Analogously
to the approximation ratio, the performance is judged by comparing the objective value
achieved with the optimal value. In such situations, a malicious adversary is even more
powerful than before, since an algorithm that is not even aware of the whole instance is
compared with an optimal algorithm that knows the whole instance in advance. Thus the
problem that the results in this model tend to be overly pessimistic is true even more.
Hence average-case models for decision-making under uncertainty are interesting both
from a theoretical and practical perspective. However, up to now, there are several rea-
sonable approaches, each one with advantages and drawbacks. We will not go into the
details here, but introduce the models in the chapters when discussed.
Fundamentals
A set Ω = {ω1 , ω2 , . . . } is called a probability space and its elements ω ∈ Ω are called
the elementary events. A probability distribution D assigns to each elementary event ω a
number, called its probability Pr [ω] such that:
(1) 0 ≤ Pr [ω] ≤ 1 for all ω ∈ Ω, and
P
(2) ω∈Ω Pr [ω] = 1.
8
For some given probability space any mapping X : Ω → Z is called a (numerical)
random variable. The expected value of X is given through
X X
E [X] = X(ω)Pr [ω] = xPr [X = x]
ω∈Ω x∈Z
P
provided that x∈Z |x|Pr [X = x] converges.
Theorem 1.6. Let X and Y be two random variables such that X(ω) ≤ Y (ω) for all
ω ∈ Ω then E [X] ≤ E [Y ].
Bounds on Distributions
Two elementary bounds on the distribution of any random variable are the theorems of
Markov and Chebyshev.
E [X]
Pr [X ≥ t] ≤
t
holds for all t > 0.
Var [X]
Pr [|X − E [X] | ≥ t] ≤
t2
holds for all t > 0.
Important Distributions
A Bernoulli distributed random variable X ∈ {0, 1} indicates if a certain event that has
probability p occurs. We denote X ∼ Ber (p) and it has the distribution:
Pr [X = 1] = p and Pr [X = 0] = 1 − p.
9
A sum of n independent Bernoulli variables X = X1 + · · · + Xn with respective pa-
rameter p has Binomial distribution, denoted X ∼ Bin (n, p). The distribution is for any
k ∈ {0, . . . , n}:
n k
Pr [X = k] = p (1 − p)n−k .
k
We clearly have E [X] = np and Var [X] = np(1 − p).
A random variable X ∈ {1, 2, . . . } that counts the number of trials until some event
that has probability p occurs for the first time has geometric distribution. It is denoted
X ∼ Geo (p) and has the distribution
Pr [X = k] = (1 − p)k−1 p
10
Chapter 2
In this chapter we give average-case analyses of the well-known sorting and searching
algorithms Quicksort and Quickselect.
2.1 Quicksort
The problem Sort is the following: We are given a sequence a = (a1 , a2 , . . . , an ) of
pairwise distinct numbers and are asked to find a permutation π of (1, 2, . . . , n) such that
the sequence b = (b1 , b2 , . . . , bn ) = (aπ(1) , aπ(2) , . . . , aπ(n) ) satisfies b1 ≤ b2 ≤ · · · ≤ bn . The
elementary operations we are interested in are the comparisons “≤” an algorithm asks.
Let ta (alg) be the number of comparisons of an algorithm alg given a sequence a.
The assumption that the numbers are pairwise distinct can easily be removed, but we
consider it for clarity of exposition.
The idea of the algorithm Quicksort is to choose some element p from a, called the
pivot, and divide a into two subsequences ` and r. The sequence ` contains the elements
ai ≤ p and r those ai > p. Quicksort is then called recursively until the sequence is
empty. The sequence (Quicksort(`), p, Quicksort(r)) is finally returned.
(1) If n = 0 return.
Worst-Case Analysis
It is well-known that this variant of Quicksort has the weakness that it may require
2
Ω n comparisons.
Observation 2.1. There is a sequence a such that ta (Quicksort) = n(n − 1)/2.
11
Proof. Consider a = (1, 2, . . . , n). Then, in step (3), ` remains empty while r contains
n − 1 elements. This step requires n − 1 comparisons. By induction, the recursive calls
Quicksort(`) and Quicksort(r) require 0, respectively (n − 1)(n − 2)/2 comparisons
“≤”. Thus, the whole algorithm needs n−1+(n−1)(n−2)/2 = n(n−1)/2 comparisons.
Average-Case Analysis
Here we study Quicksort in the following probabilistic model. Let π be any permutation
of (1, 2, . . . , n) and let the probability of π be 1/n!, i.e., Pr [π] = 1/n!. We draw a permu-
tation π at random according to this probability distribution and the input of Quicksort
is the sequence A = (aπ(1) , aπ(2) , . . . , aπ(n) ). (We use the capital letter A to emphasize that
it is a random variable.)
Then the expected P number of comparisons Quicksort needs is O (n log n). More
specifically let Hn = nk=1 1/k denote the n-th Harmonic number and recall that log n ≤
Hn ≤ log n + 1.
We will give two proofs of this theorem. In the first one we derive a recursion, in
the second one we use linearity of expectation. It causes no loss of generality to assume
that the original sequence is itself a permutation of (1, 2, . . . , n) and will be assumed in
the sequel. Also, let cn = E [tA (Quicksort)] be the expected number of comparisons
Quicksort carries out on a uniformly distributed permutation A of length n.
First proof of Theorem 2.2. The idea of this proof is to give a recursion formula and then
solve it (by induction). Before we actually prove this theorem we need a lemma that is
needed due to the recursive structure of the algorithm. The original input A is a random
permutation with uniform distribution. Hence the pivot P is a random variable. Thus also
the sequences L and R constructed by Quicksort are random permutations of subsets
of {1, 2, . . . , n}. However, for the proof to work, we have to show that these sequences are
also uniform distributed.
Lemma 2.3. Let the outcome of the pivot element be P = p. Let L = ` and R = r be two
possible outcomes of the sequences Quicksort constructs given pivot p. Then we have
1 1
Pr [ L = ` | P = p] = and Pr [ R = r | P = p] = .
|`|! |r|!
12
Since A is uniformly distributed the probability that the outcome of the pivot P = p
equals 1/n for every p ∈ {1, 2, . . . , n}. For n = 0 we have c0 = 0 and for n > 0 we we
obtain the recursion
n
X
cn = E [tA (Quicksort)] = n − 1 + E [ tA (Quicksort) | P = p] Pr [P = p]
p=1
n
X
=n−1+ (E [ tL (Quicksort) | P = p] + E [ tR (Quicksort) | P = p])Pr [P = p]
p=1
n
X 1
=n−1+ (cp−1 + cn−p )
n
p=1
n−1
2X
=n−1+ ck
n
k=0
where the equality E [ tL (Quicksort) | P = p] = cp−1 for L (and analogously for R) de-
pends critically on Lemma 2.3.
We “guess” aP solution for cn and verify by induction that it satisfies the recursion
cn = n − 1 + 2/n n−1 k=0 ck . Our guess is ck = 2(k + 1)Hk − 4k for any k ∈ {0, 1, . . . , n}.
The base case is k = 0. There, Quicksort does not perform any comparison at all,
i.e., c0 = 0. Using H0 = 0 yields the base case.
For the inductive case we have that Pthe claim is true for any k ≤ n − 1 and we will
prove it for k = n. Using the identity n−1 k=0 (k + 1)Hk = n/2(nHn − n/2 − 3/2 + Hn ) we
find
n−1 n−1
2X 2X
cn = n − 1 + ck = n − 1 + (2(k + 1)Hk − 4k)
n n
k=0 k=0
n 3
= n − 1 + 2 nHn − − + Hn − 4(n − 1) = 2(n + 1)Hn − n − 3 − 3(n − 1)
2 2
= 2(n + 1)Hn − 4n
Second proof of Theorem 2.2. We assumed that the original sequence is a permutation of
(1, 2, . . . , n). So, for any i < j ∈ {1, 2, . . . , n} let the random variable Xij be equal to one
if i and j are compared during the course Pn−1ofP the algorithm and zero otherwise. The total
number of comparisons is hence X = i=1 nj=i+1 Xij . Thus, by
n−1
X n
X n−1
X n
X n−1
X n
X
cn = E [X] = E Xij = E [Xij ] = Pr [Xij = 1]
i=1 j=i+1 i=1 j=i+1 i=1 j=i+1
13
j is the first pivot in the set {i, . . . , j}. The converse direction is trivial: if one of i or j is
the first pivot from the set {i, . . . , j}, then i and j are still in the same sequence and will
hence be compared.
What is the probability that, say, i is the first pivot of {i, . . . , j}? Consider the sub-
sequence S induced by the elements {i, . . . , j} on A. The crucial observation is that S is
also a subsequence of P . Hence i is the first pivot from the set {i, . . . , j} if and only if i is
the first element from that set in A. Since A is uniformly distributed the probability for
that event is exactly 1/(j − i + 1). Analogous reasoning for j yields the overall probability
Pr [Xij = 1] = 2/(j − i + 1).
This allows us to calculate
n−1 n n−1 n n−1
X n−i+1
X X X X 2 X 2
cn = Pr [Xij = 1] = =
j−i+1 k
i=1 j=i+1 i=1 j=i+1 i=1 k=2
n n+1−k n n
X X 2 X 2 X 2
= = (n + 1 − k) = (n + 1) − 2(n − 1)
k k k
k=2 i=1 k=2 k=2
= 2(n + 1)Hn − 4n
and the proof is complete.
So we have shown that the expected number of comparisons of Quicksort is O (n log n).
We can even strengthen the statement: The number of comparisons is O (n log n) with high
probability, i.e., with probability that tends to one as n tends to infinity.
Theorem 2.4. For any c > 1 we have that
4
Pr [tA (Quicksort) ≤ cE [tA (Quicksort)]] ≥ 1 − .
(c − 1)2 Hn
Proof. Our goal is to apply Chebyshev’s inequality: Pr [|X − E [X] | ≥ t] ≤ Var [X] /t2 ,
where, as before X denotes the number of comparisons.
Suppose that we can prove Var [X] = 2nE [X] + 8n2 then we have
Var [X]
Pr [X > cE [X]] ≤ Pr [|X − E [X] | > (c − 1)E [X]] ≤
(c − 1)2 E [X]2
2 2 4
= 2
+ 2 2
≤
2(c − 1) Hn (c − 1) Hn (c − 1)2 Hn
which is what we wanted to show.
As before, Xij is equal to one if the elements i and j are compared during the course
of the algorithm, zero otherwise.
!
X X
Var [X] = E X 2 − E [X]2 = E Xk` − E [X]2
Xij
i<j k<`
X X X
= E [Xij ] + 2 E [Xij Xik ] + E [Xij Xk` ] − E [X]2
i<j i<j,k i<j, k<`
k6=i, `6=j
X X
≤ E [X] + 2 E [Xij Xik ] + E [Xij ] E [Xk` ] − E [X]2
i<j,k i<j, k<`
k6=i, `6=j
X
≤ E [X] + 2 E [Xij Xik ] .
i<j,k
14
Now we have to bound E [Xij Xik ] for any combination of i < j and a free k. For any of
the cases i < j < k, i < k < j, and k < i < j we have
2
E [Xij Xik ] = E [ Xik | Xij ] Pr [Xij = 1] ≤ .
j−i+1
This yields
X X 2
E [Xij Xik ] ≤ (n − 1) ≤ 2n(n − 1)Hn
j−i+1
i<j,k i<j
and thus in total Var [X] ≤ E [X]+4(n−1)nHn ≤ 2nE [X]+8n2 which gives the result.
2.2 Quickselect
In the problem Select we are given a sequence a = (a1 , a2 , . . . , an ) of pairwise distinct
numbers and are asked to find the k-th smallest element, for some given k. We are again
interested in the number of comparisons “≤”. Let ta (alg, k) be the number of comparisons
of an algorithm alg given a sequence a and parameter k.
The idea of the algorithm Quickselect is, similarly to Quicksort, to choose some
element p from a, called the pivot, and divide a into two subsequences ` and r. The se-
quence ` contains the elements ai ≤ p and r those ai > p. Then Quickselect determines
which subsequence contains the desired element and is then called recursively on that
subsequence until the element is found.
Output. Element b
(1) If n = 0 return.
(4) If |`| = k − 1 return b = p. Else, if |`| > k return Quickselect(`, k), otherwise return
Quickselect(r, k − |`|)
Worst-Case Analysis
Analogously to Quicksort, this variant of Quickselect may require Ω n2 compar-
isons.
Proof. Consider a = (1, 2, . . . , n) and k = n. Then, in step (3), ` remains empty while r
contains n − 1 elements. This step requires n − 1 comparisons. By induction, the recursive
call to Quickselect(r, n − 1) requires (n − 1)(n − 2)/2 comparisons “≤”. Thus, the whole
algorithm needs n − 1 + (n − 1)(n − 2)/2 = n(n − 1)/2 comparisons.
15
Average-Case Analysis
We study Quickselect also in the probabilistic model that each permutation of the origi-
nal sequence is equally likely. Then the expected number of comparisons of Quickselect
is O (n).
Theorem 2.6. For any k we have that E [tA (Quickselect, k)] ≤ 4n.
Proof. Again, without loss of generality we assume that the original sequence is itself a
permutation of (1, 2, . . . , n). Let cn,k be the expected number of comparisons of Quicks-
elect for the parameter k on a uniformly distributed permutation A of length n.
We give a recursion formula and then give an upper bound on the solution (by induc-
tion). Notice that Lemma 2.3 holds here, too, since Quickselect uses the same pivot-rule
as Quicksort. Since A is uniformly distributed the probability that the outcome of the
pivot P = p equals 1/n for every p ∈ {1, 2, . . . , n}. For n = 0 we have c0 = 0 and for n > 0
we we obtain the recursion
n
X
cn,k = E [tA (Quickselect, k)] = n − 1 + E [ tA (Quickselect, k) | P = p] Pr [P = p]
p=1
k−1
X
=n−1+ E [ tR (Quickselect, k − p) | P = p]
p=1
n
X
+ E [ tL (Quickselect, k) | P = p] Pr [P = p]
p=k+1
k−1 n
1 X X
=n−1+ cn−p,k−p + cp,k
n
p=1 p=k+1
16
Chapter 3
Knapsack
This chapter is concerned with the Knapsack problem. This problem is of interest in
its own right because it formalizes the natural problem of selecting items so that a given
budget is not exceeded but profit is as large as possible. Questions like that often also
arise as subproblems of other problems. Typical applications include: option-selection in
finance, cutting, and packing problems. See Example 1.1 for an illustration.
In the Knapsack problem we are given a capacity c and n items. Each item i comes
along with a profit pi and a weight wi . We are asked to choose a subset of the items as
to maximize total profit but the total weight not exceeding c. Let xj ∈ {0, 1} indicate if
a certain item j is included in the solution or not.
In the sequel let p = (p1 , . . . , pn ), w = (w1 , . . . , wn ), and x =P(x 1 , . . . , xn ) denote the
n n
respective vectors. The profit of a vector x ∈ {0, P 1} is val(x) = j=1 pj xj . The weight of
a vector x ∈ {0, 1}n is given by weight(x) = nj=1 P wj xj . In order to obtain a non-trivial
problem we assume wj ≤ c for all j = 1, . . . , n and nj=1 wj > c throughout.
xj ∈ {0, 1} j = 1, . . . , n.
Knapsack is NP-hard which means that “most probably”, there is no polynomial time
optimization algorithm for it. We will consider two stochastic variants of the problem: In
Section 3.1 the profits of the items are random, in Section 3.2 the weights are.
17
fit into the knapsack anymore. Common sense suggests the following simple algorithm:
xj = 1 for j = 1, . . . , k − 1, xj = 0 for j = k, . . . , n.
Unfortunately, the approximation ratio of this algorithm can be arbitrarily bad as the
example below shows. The problem is that more efficient items can “block” more profitable
ones.
Example 3.1. Consider the following instance, where c is a sufficiently large integer.
j pj wj pj /wj
1 1 1 1
2 c−1 c 1 − 1/c
The algorithm chooses item 1, i.e., the solution x = (1, 0) and hence val(x) = 1. The
optimum solution is x∗ = (0, 1) and thus val(x∗ ) = c − 1. The approximation ratio of the
algorithm is 1/(c − 1), i.e., arbitrarily bad (small). However, this natural algorithm can
be turned into a 1/2-approximation from a worst-case perspective and even better from
an average-case point of view.
Worst-Case Analysis
The result below states that this modified algorithm is a 1/2-approximation. It is an
exercise to show that this bound is tight.
Theorem 3.2. We have that r(Greedy) ≥ 1/2.
Proof. In Problem 3.1, let us replace the constraints xj ∈ {0, 1} with xj ∈ [0, 1]. This yields
that the optimum solution of the modified problem can only be larger than the optimum
of the original problem. Such a modification is called relaxation of the constraints.
xj ∈ [0, 1] j = 1, . . . , n.
The optimal solution of this problem has the following structure. The proof of the
observation below is an exercise.
18
Observation 3.3. Let p, w, ∈ Nn be non-negative integral vectors with
p1 p2 pn
≥ ≥ ··· ≥ .
w1 w2 wn
The break index is defined as
j
( )
X
k = min j ∈ {1, . . . , n} : wi > c .
i=1
zj∗ = 1 for j = 1, . . . , k − 1,
c − k−1
P
∗ i=1 wi
zj = for j = k, and
wk
zj∗ = 0 for j = k + 1, . . . , n.
Notice that z ∗ has the form z ∗ = (1, . . . , 1, α, 0, . . . , 0) for some value 0 ≤ α < 1 at the
break index. The Greedy algorithm constructs two solutions x and y. Observe that x
has the form x = (1, . . . , 1, 0, 0, . . . , 0), i.e., equal to z ∗ except it has α = 0 at the break
index. Furthermore y = (0, . . . , 0, 1, 0, . . . , 0) with only the break item taken. The value
obtained by the Greedy algorithm is equal to max{val(x), val(y)}.
Let x∗ be an optimum solution for the Knapsack instance. Since every solution
that is feasible for the Knapsack instance is also feasible for the respective Fractional
Knapsack instance we have that val(x∗ ) ≤ val(z ∗ ).
In total we have
Average-Case Analysis
On random instances (and sufficiently large capacity c), Greedy is usually much better
than 1/2-approximate. Indeed, the algorithm approximates the better, the larger the
capacity is. In the sequel let the weights wj be arbitrary, i.e., adversarial and the profits
Pj ∼ Uni (0, 1) be uniform distributed. It is important to note that the adversary specifies
the weights before the profits materialize.
Theorem 3.4. We have that e(Greedy) ≥ 1 − O c−1 .
Proof. Let p be an outcome of the random profits and let them be ordered according to
efficiency. Consider the solution-candidate x of Greedy and z ∗ the optimal solution for
Fractional Knapsack. Analogously to the deterministic case we have
val(x) val(x) val(x) αpk 1
≥ ≥ =1− ≥1− .
val(x∗ ) val(z ∗ ) val(x) + αpk val(x) val(x)
Taking expectations yields e(Greedy) ≥ 1 − E [1/val(X)]. Thus it suffices to show that
E [1/val(X)] = O c−1 . Consider the following solution s: Sort the items according to
weight and include exactly the c smallest items. It is important to note that this solution
is constructed without looking at the profits. We prove:
19
(1) For any outcome of the profits we have val(s) ≤ val(x).
(2) E [1/val(X)] ≤ E [1/val(s)] = O c−1 .
20
3.1.2 Nemhauser-Ullmann Algorithm
In this section we will give an average-case analysis of an elegant enumeration algorithm,
named after its authors, Nemhauser-Ullmann. This algorithm computes an optimal
solution for the Knapsack problem. Not surprisingly, since Knapsack is NP-hard, the
algorithm has exponential running time in the worst case. However, under some rather
weak assumptions on profit distributions, the algorithm has polynomial expected running
time. For sake of exposition, we restrict our attention to the (important) special case that
profits are uniformly distributed.
One (stupid) way of solving Knapsack is to enumerate all subsets of the set of items,
check the feasibility of the current candidate, and keep track of the most profitable subset.
However, this approach has always exponential running time and enumerates a lot of
“uninteresting” subsets. Why? If we know (for some reason) that a certain subset is not
optimal, then all its subsets will also not be optimal. Hence it makes no sense enumerating
them. A more clever way will be introduced now.
The algorithm uses the following dominance criterion. First of all, we may assume
without loss of generality that for two solutions x 6= y we have val(x) 6= val(y) or
weight(x) 6= weight(y). Why? Since these solutions are equivalent otherwise, we lose
nothing by droping one of them. Let x and y be two solutions. We say that y is dominated
by x if
val(x) ≥ val(y) and weight(x) ≤ weight(y).
A solution x is called dominating if there is no solution y 6= x that dominates x. The
intuition is that the solutions that are dominated are redundant: We already have a
solution that is not heavier and at least as profitable. The following observation is the
basis of the algorithm.
Observation 3.5. An optimal solution is to be sought among the dominating solutions.
Let Sj be the set of all dominating solutions only considering the items {1, . . . , j}, i.e.,
the solutions are of the form x ∈ {0, 1}j 0n−j . For any two solutions x and y define the
operation z = x ⊕ y that sets zj = 1 if xj = 1 or yj = 1, and zj = 0 otherwise. Further
define the special solution 1j which is everywhere zero except at item j, where it has the
value one.
Algorithm 3.2 Nemhauser-Ullmann
Integer c, vectors p, w ∈ Nn with wj ≤ c, and
P
Input. j wj > c.
The running time obviously depends on the cardinalities of the sets S0 , S1 , . . . , Sn . For
any j let qj be an upper bound on Sj . We may assume that q0 ≤ q1 ≤ · · · ≤ qn . It is an
exercise to show that the computation of Sj can be implemented to run in time O (qj ).
This yields the following bound.
21
Lemma 3.6. Let I be an instance of Knapsack with n items. We have that
tI (Nemhauser-Ullmann) = O (nqn ) ,
Worst-Case Analysis
In the worst case, the running time of Nemhauser-Ullmann is exponential. The proof
is left as an exercise.
Average-Case Analysis
Here we give an average-case analysis using the following probabilistic model. It causes
no loss of generality to assume that the weights and profits are chosen between zero and
one. Specifically, an adversary chooses weights wj ∈ [0, 1] arbitrarily and the profits are
independent and uniformly distributed Pj ∼ Uni (0, 1). It is important to note that the
adversary specifies the weights without knowing the outcomes of the Pj . Then the expected
running time of Nemhauser-Ullmann is polynomial.
Let the random variable Qj denote the number of dominating sets using the elements
{1, . . . , j}, i.e., the cardinality of the set Sj . The idea of the proof below is to show
E [Qn ] = O n3 .
Proof of Theorem 3.8. Let m = 2n and define x1 , x2 , . . . , xm as the sequence of all vectors
x ∈ {0, 1}n according to non-decreasing weight, i.e., weight(x1 ) ≤ · · · ≤ weight(xm ), and
those with equal weight, with decreasing profit.
Let A0 = ∅ and for any k ≥ 1 define the set Ak = {x1 , . . . , xk } and the number
The crucial observation is that xk is dominating if and only if δk > 0. Why? First of
all, since Ak ⊇ Ak−1 we have δk ≥ 0. For δk > 0 the solution xk is not dominated by
any y ∈ Ak−1 since val(xk ) > val(y). It is also not dominated by any y ∈ Am − Ak since
at least weight(xk ) < weight(y) or val(xk ) > val(y) is true. Thus, if δk > 0, then xk is
dominating. Conversely, if δk = 0 there is a solution y ∈ Ak−1 with val(xk ) = val(y) and
weight(xk ) ≥ weight(y), i.e., xk is dominated by y. Pk
Furthermore observe that we can write val(xk ) = `=1 δ` for any k. For random
profits, δk is actually a random variable ∆k . The following intermediate result (proof
below) is the key: It essentially states that if a solution is dominating, i.e., ∆k > 0, then
its increase in profit is reasonably large.
Lemma 3.9. It holds that E [ ∆k | ∆k > 0] ≥ 1/(32n2 ).
22
Observe that E [val(xm )] = n/2. Using Lemma 3.9 we can furthermore write
"m # m
X X
E [val(xm )] = E ∆k = E [∆k ]
k=1 k=1
m
X
= E [ ∆k | ∆k > 0] Pr [∆k > 0]
k=1
m
X 1
≥ Pr [∆k > 0]
32n2
k=1
so we have obtained
m
X
Pr [∆k > 0] ≤ 32n2 E [val(xm )] = 16n3 .
k=1
We are interested inPE [Qn ]. Let Dk be an indicator variable for the event ∆k > 0. Thus
we can write Qn = m k=1 Dk and hence
"m # m
X X
E [Qn ] = E Dk = Pr [∆k > 0] ≤ 16n3
k=1 k=1
Proof of Lemma 3.9. For ease of notation we will drop unnecessary subscripts: Let x = xk
be some solution in the sequence x1 , x2 , . . . , xm as defined above and let ∆ = ∆k be the
associated random variable to δ = δk . We will use this notation below: For two solutions
x and y define z = x y so that zj = 1 if xj = 1 and yj = 0, and zj = 0 otherwise.
We bound E [ ∆ | ∆ > 0] from below by
1 1
E [ ∆ | ∆ > 0] ≥ Pr ∆ ≥ 2
∆>0 ·
16n 16n2
and thus it suffices to show that
1 1
Pr ∆ ≥ 2
∆>0 ≥ .
16n 2
Fix a solution x = xk and let x1 , . . . , xk−1 be its predecessors in the sequence of
solutions. Further define a` = x x` and b` = x` x. Now we look at the probability we
are interested in:
1 1
Pr ∆ ≥ ∆ > 0 = Pr ∀` < k val(x) − val(x` ) ≥ ∀` < k val(x) > val(x` )
16n2 16n2
1
= Pr ∀` < k val(a` ) ≥ val(b` ) + ∀` < k val(a` ) ≥ val(b ` )
16n2
1
= Pr ∀` < k val(b` ) ≤ val(a` ) − ∀` < k val(b` ) ≤ val(a` )
16n2
Without loss of generality, the items are indexed such that x ∈ {0, 1}d 0n−d for some d.
This yields that a` ∈ {0, 1}d 0n−d and b` ∈ 0d {0, 1}n−d .
Now fix the realizations of the profits p1 , . . . , pd , i.e., we have fixed the values of x
and a` for any ` < k. The remaining profits pd+1 , . . . , pn may still be varied. Recall the
23
definition of conditional probabilities Pr [ A | B] = Pr [A ∩ B] /Pr [B] for any two events A
and B with Pr [B] > 0. Consider the following sets (that will correspond to events later):
n−d 1
A = (pd+1 , . . . , pn ) ∈ [0, 1] : ∀` < k val(b` ) ≤ val(a` ) −
16n2
n o
B = (pd+1 , . . . , pn ) ∈ [0, 1]n−d : ∀` < k val(b` ) ≤ val(a` )
and it only remains to show that the choice of ε = 1/(4n) has the desired property.
For any a` with val(a` ) ≥ 1/(4n) and this ε we have
val(a` ) 1
val(a` )(1 − ε) = val(a` ) − ≤ val(a` ) − .
4n 16n2
Thus, whenever val(a` ) ≥ 1/(4n), any member of Bε is also a member of A, which means
24
that Bε ⊆ A. This property allows us to finalize the proof:
1
Pr [Bε ⊆ A] ≥ Pr val(a` ) ≥ ∀` < k val(b` ) ≤ val(a` )
4n
1
= 1 − Pr val(a` ) ≤ ∀` < k val(b` ) ≤ val(a` )
4n
1
≥ 1 − Pr ∃j ∈ {1, . . . , d} : Pj ≤ ∀` < k val(b` ) ≤ val(a` )
4n
d
X 1
≥1− Pr Pj ≤ ∀` < k val(b ` ) ≤ val(a ` )
4n
j=1
d
X 1 d
≥1− Pr Pj ≤ =1−
4n 4n
j=1
3
≥ .
4
Above, Pr [ Pj ≤ 1/(4n) | ∀` < k val(b` ) ≤ val(a` )] ≤ Pr [Pj ≤ 1/(4n)] is the only location
in the whole proof where we use that the profits are uniform distributed.
Example 3.10. The capacity is c = 1, n = 3, and the profits are p = (1, 1, 1). The weight
distributions are W = (W1 , W2 , W3 ) with
( (
1/5 probability 1/2 2/5 probability 1/2
W1 = , W2 = 4/5 probability 1, W3 = .
3/5 probability 1/2 9/10 probability 1/2
If we have to specify the ordering at which we try the items in advance, then the best
ordering is (1, 2, 3). This yields an expected profit of 3/2.
If we are allowed to choose depending on the current situation in our knapsack, we
might want to do as follows: We first insert item 1. In case W1 = 1/5 we insert item
2. Otherwise we insert item 3. Looking at the respective decision tree yields that our
expected profit is 7/4.
25
We further want to motivate that there are actually applications (in scheduling) for
such a setting. Suppose we are given a set of jobs and we can obtain a certain profit pj
from each job j. The actual duration of the job wj is not entirely known. We rather know
the probability distribution for the duration. There is a fixed deadline (i.e., the capacity of
the knapsack) and we seek to maximize the profit of the jobs finished before the deadline.
We will consider two types of algorithms, depending on the point in time they have
to choose the item that is tried next. The more powerful ones are adaptive algorithms:
Given the set I ⊆ J that is currently in the knapsack and its residual capacity 0 ≤ c ≤ 1,
an algorithm Π determines an item j = Π(I, c) as the next candidate. We can thus think
of such an algorithm as a mapping Π : 2J × [0, 1] → J specifying the next item to insert.
Notice that the profit of such an algorithm, denoted val(Π) is a random variable. We
denote by
adapt(p, W ) = max E [val(Π)]
Π
the maximal expected profit achievable by adaptive algorithms for the instance (p, W ).
By contrast, a non-adaptive algorithm π has to specify the ordering in which items are
tried in advance, i.e., the algorithm chooses a permutation π of J. We denote by
the maximal expected profit achievable by non-adaptive algorithms for the instance (p, W ).
Since non-adaptive are special cases of adaptive algorithms we always have the inequal-
ity
adapt(p, W ) ≥ non-adapt(p, W ).
For any non-adaptive algorithm Alg, that chooses the permutation π, say, and achieves
alg(p, W ) = E [val(π)], the quantity
alg(p, W )
α(Alg) = sup
(p,W ) adapt(p, W )
26
Fix an adaptive policy Π and let X denote the (random) items that are inserted
successfully. If the weights are deterministic it is clear that mass(X) ≤ 1. Perhaps
surprisingly, in a stochastic setting we can have E [mass(X)] ≥ 1.
Example 3.12. Suppose we have an infinite number of items, where each profit is one
and the weight is one with probability p and zero otherwise. Then the expected number of
items we insert successfully is 2/p−1 since we can continue until two of them instantiate to
unit weight. Each item j has mean truncated weight µj = p and thus E [mass(X)] = 2 − p,
which can be arbitrarily close to 2.
However, the mass we try to insert can never be more than two in expectation. The
idea of proof of the following lemma is the observation that each W̃j is bounded by one
and we can count at most one item beyond the capacity of one.
Lemma 3.13. For any instance (with knapsack capacity equal to one) and any adaptive
policy, let X denote the (random) vector of items the algorithm attempts to insert. Then
E [mass(X)] ≤ 2.
With this key-lemma, we can bound the expected profit of an optimal adaptive al-
gorithm. Define by vj = pj Pr [Wj ≤ 1] the effective value of item j which is an upper
bound on the expected profit an algorithm can gain if it attempts to insert j. Now we
parametrize the Fractional Knapsack relaxation we already know with a value t for
the capacity of the knapsack:
X X
Φ(t) = max vj xj : µj xj ≤ t, xj ∈ [0, 1] .
j j
With this formulation we can prove a bound on the expected profit of any optimal adaptive
algorithm.
Theorem 3.14. For any instance (p, W ) it holds that adapt(p, W ) ≤ Φ(2).
Proof. Fix an adaptive policy Π and let x denote the (random) vector of items that Π
attempts to insert into the knapsack. Further let the vector y be yj = Pr [xj P
= 1].
Given y, the expected mass that Π attempts to insert is E [mass(x)] = j µj yj . On
the other hand, Lemma 3.13 yields that E [mass(x)] ≤ 2. But this in turn means that the
vector y is feasible for the problem Φ(2) and we find the bounds
X
Φ(2) ≥ vj yj ≥ E [val(Π)]
j
27
Consider the function Φ(t), which can be seen as the Fractional Knapsack relax-
ation with capacity t. The usual Greedy algorithm considers the items in decreasing
order of expected effective efficiency vj /µj :
v1 vn
≥ ··· ≥ .
µ1 µn
Pk
Define mk = j=1 µj and observe that we have for any t = mk−1 + ξ ∈ [mk−1 , mk ]
k−1
X vk
Φ(t) = vj + ξ.
µk
j=1
We may assume that Φ(1) = 1 by scaling the vjPby an appropriate factor. And we may
assume that there are enough items such that nj=1 µj ≥ 1 (which can be achieved by
adding items with profit zero and positive weight).
Now we define the randomized greedy Rgreedy algorithm: Let r be the break index,
i.e., the smallest index with rj=1 µj ≥ 1 and let µ0r = 1 − r−1
P P
j=1 j , i.e., the part of item r
µ
that fits within capacity one. LetPp = µr /µr , wr = pwr . For j = 1, . . . , r − 1 set wj0 = wj
0 0 0
(2) If k < r insert item k, i.e., xk = 1. If k = r, flip another independent coin and insert
item r, i.e., xr = 1 with probability p0 , otherwise discard it.
(3) Then insert the items 1, 2, . . . , k − 1, k + 1, . . . , r in the greedy order, i.e., for j =
1, 2, . . . , k − 1, k + 1, . . . , r set xj = 1 if the item j still fits.
adapt(p, W ) ≤ Φ(2) ≤ 1 + ω,
where ω = vr /µr ≤ ( rj=1 vj )/( rj=1 µj ) = 1. With rj=1 µj = 1, the algorithm has the
P P P
simpler form:
We estimate the expected value achieved by this algorithm, with respect to the ran-
domization of the weights and with respect to our own randomization. Fix any k. The
Pk−1k is inserted first with probability vk , inserted after {1, . . . , k − 1} with probability
item
j=1 vj , and after {j, 1, . . . , k − 1} with probability vj for any k < j ≤ r. If k is the first
28
item, then its expected profit is vk = pk Pr [Wk ≤ 1]. Otherwise we use Lemma 3.11, i.e.,
a variant of Markov’s inequality:
X k Xk
Pr [item k fits] = Pr Wj ≤ 1 ≥ 1 −
µj .
j=1 j=1
P by {j, 1, . . . , k P
Similarly for the case when k is preceded − 1}.
With the obvious observation 1 − kj=1 µj ≥ vj (1 − kj=1 µj ) for any j, be denote by
ck the following lower bound on the expected profit of item k:
k−1
X k
X X r Xk
ck = vk vk + vj (1 − µ` ) + (1 − µ` − µj )
j=1 `=1 k+1 `=1
r
X k
X k
X r
X
= vk vj (1 − µ` ) + vk µ` − vj µj .
j=1 `=1 `=1 `=k+1
Pr Pr Pr
It holds that rgreedy(p, W ) ≥ k=1 ck and using j=1 vj = 1 and j=1 µj = 1 we
simplify the bound to
r
X k
X k
X r
X
rgreedy(p, W ) ≥ vk 1 − µj + vk µj − wj µj
k=1 j=1 j=1 j=k+1
X X
=1+ (−vk µj + vk2 µj ) − vk vj µj
j≤k≤r k≤j≤r
X r
X
=1+ (−vk µj + vk2 µj + vk vj µj ) − vk vj µj
j≤k≤r j,k=1
X r
X
=1+ vk µj (vj + vk − 1) − vj µj .
j≤k≤r j=1
To symmetrize this polynomial, we apply the condition of the greedy ordering. For any
j < k we have vj + vk − 1 ≤ 0 and the greedy ordering implies vk µj ≤ vj µk , allowing us
to replace vk µj by (vk µj + vj µk )/2 for all pairs j < k. This yields
r r
1 X X X
rgreedy(p, W ) ≥ 1 + (vk µj + vj µk )(vj + vk − 1) + vj µj (2vk − 1) − vj µj
2
j≤k≤r j=1 j=1
r r r
1 X 1 X X
=1+ vk µj (vj + vk − 1) + vj µj (2vj − 1) − vj µj
2 2
j,k=1 j=1 j=1
Pr Pr
and using j=1 vj = j=1 µj = 1 gives
r r r
1 1X 2 X 2 X
≥ + vk + vk µk − vk µk .
2 2
k=1 k=1 k=1
We want to compare this expression to 1 + ω, where ω = min{vj /µj : j < r}. We use the
29
Pr 2
Pr
value of ω to estimate k=1 vk ≥ω k=1 vk µk and we obtain
r r r
1 ωX X
2
X
rgreedy(p, W ) ≥ + vk µk + vk µk − vk µk
2 2
k=1 k=1 k=1
r
1 X ω
= + µk vk + vk − 1
2 2
k=1
r 2
1 X 1 ω
≥ − µk −
2 2 4
k=1
1 ω ω2
≥ + − .
4 4 16
We compare this to the adaptive optimum, which is bounded by 1 + ω and minimize for
ω ∈ [0, 1]:
rgreedy(p, W ) 1 ω2 7
≥ − ≥ .
adapt(p, W ) 4 16(1 + ω) 32
It remains to remove the assumption that j=1 µj = 1. We claim that if rj=1 µj > 1, the
Pr P
randomized greedy algorithm performs just like the simplified algorithmPr we 0just analyzed,
0 0
on a modified instance with values vj and mean sizes µj (so that j=1 µj = 1; see the
description of the algorithm). Indeed, Φ(1) = 1 and ω = vr /µr = vr0 /µ0r in both cases.
Hence the bound on adapt(p, W ) is the same. For an item k < r, our estimate of the
expected profit in both instances is
k−1
X k
X r
X k
X
ck = vk0 vk0 + vj0 (1 − µ0` ) + (1 − µ0` − µ0j ) .
j=1 `=1 k+1 `=1
For the original instance, this is because the expected contributions of item r to the total
weight, conditioned on being selected first, is p0 µr = µ0r . If it is not selected first it is not
counted at all. The expected profit for item r is cr = vr0 p0 vr = (vr0 )2 in both instances.
This reduces the analysis to the case we dealt with already, completing the proof of the
theorem.
30
Chapter 4
where cj (π) denotes the time when job j finishes under the ordering π. This objective
function measures the total (weighted) time the jobs have to wait in the system before
completion. Of course, our goal is to minimize the objective function.
subject to π ∈ Π(J)
In order to get an intuition for the objective function and that the ordering indeed
plays a role, let us have a look at the following instance.
Example 4.1. There are n jobs J = {1, . . . , n}, all with weight wj = 1, and the following
processing times: p1 = p, p2 = · · · = pn = 1, where p is a large number. Consider the
orderings
π = (1, 2, . . . , n) and π ∗ = (2, . . . , n, 1).
We find val(π) = p + p + 1 + p + 2 + · · · + p + n − 1 = np + n(n − 1)/2 and val(π ∗ ) =
1 + 2 + · · · + n − 1 + n − 1 + p = n(n − 1)/2 + n − 1 + p. So if we choose p substantially larger
than n2 , then the ratio val(π)/val(π ∗ ) is arbitrarily close to n. We will see very soon how
to solve this problem in polynomial time (which will imply that π ∗ is optimal).
The above example indicates that one might want to schedule the jobs in increasing
order of processing time. For the weighted version it may be a good idea to schedule the
31
jobs in increasing order of the ratio pj /wj , called efficiency. So we may assume that the
jobs are given in the greedy order
p1 p2 pn
≤ ≤ ··· ≤ .
w1 w2 wn
For any list π denote by j ≤π k that job j is not after job k in π; analogously j <π k
for j before k. In any ordering π, whenever some job j is somewhere before a job k, i.e.,
j <π k, such that pj /wj > pk /wk , then j and k from an inversion. Notice that the greedy
ordering has no inversions.
Theorem 4.2. The greedy order is optimal for total weighted completion time scheduling.
Proof. We show that eliminating an inversion strictly improves the objective function. Let
there be an inversion, otherwise there is nothing to show. Then there is an inversion where
the involved job j directly precedes the other one k, say, i.e., the ordering π has the form
π = (. . . , j, k, . . . ). Swap j and k and leave the rest of the ordering unchanged yielding
the ordering π 0 = (. . . , k, j, . . . ). Now π 0 has exactly one inversion less than π.
We obviously have
val(π 0 ) = val(π) − wk pj + wj pk .
Since j and k from an inversion it holds that −wk pj + wj pk < 0 since pj /wj > pk /wk
completing the proof.
the total weighted completion time of that algorithm. alg = alg(P, w) denotes the
induced random variable. Given any outcome p of P , let
32
be the optimal objective value. Let opt = opt(P, w) be the induced random variable and
Π∗ the respective random ordering. As mentioned already, we measure the quality of an
algorithm with the expected value of the approximation ratio
h alg i
e(alg) = E ,
opt
where the expectation is taken with respect to the processing time distribution.
Greedy Algorithm
Maybe the most natural algorithm is to determine the expected values µj = E [Pj ] and to
schedule the jobs in the ordering
µ1 µ2 µn
≤ ≤ ··· ≤ .
w1 w2 wn
Let this algorithm be called Egreedy.
The question is how good is this algorithm? Suppose that the algorithm scheduled the
job j before k since µj /wj ≤ µk /wk , but the outcome is such that pj /wj > pk /wk . This
means that the algorithm scheduled the jobs j and k in the wrong order. How large is the
contribution to the objective function of this error?
If we look at the proof of Theorem 4.2 we see that a pair of adjacent jobs that are
in the wrong order contribute wj pk − wk pj to the objective function. They are in wrong
position if and only if pj /wj > pk /wk . This suggests the following definitions: For any
pair j and k of jobs define
δjk = wj pk − wk pj ,
and (
1 if pj /wj > pk /wk ,
xjk =
0 otherwise.
The intuition here is that δjk measures the influence of the error, i.e., of the inversion of
j and k, to the objective function and xjk triggers if this error had occurred. Thus the
following lemma, whose proof is left as an exercise, comes to mind: Let π be an arbitrary
ordering and π ∗ be an optimal ordering for the instance p and w.
For technical reasons in the analysis below, we need a little generalized version of this
lemma. We alter the definition of δjk a little bit, but xjk stays the same: For a pair j and
k of jobs and any number β ≥ 1 define δjk = wk pj − βwj pk . This yields a similar result.
Lemma 4.4. For any β ≥ 1 we have val(π) ≤ βval(π ∗ ) + j k>π j δjk xjk .
P P
Proof. Define the variable yjk (π) that is one if k >π j and otherwise zero. We use yjk =
∗ = y (π ∗ ) as a shorthand. Observe that x
yjk (π) and yjk jk jk = 1 if and only if yjk = 1 and
∗ ∗ π ∗
yjk = 0. To see this recall that yjk = 0, i.e., k ≤ j implies pk /wk ≤ pj /wj and hence
δjk ≥ 0. Further note that we have yjk = 1 − ykj and yjk ∗ = 1 − y ∗ for j 6= k.
kj
For every list π it holds that
XX XX X
val(π) = wk pj = wk pj yjk (π) + wj pj
j k≥π j j k j
33
Using β ≥ 1 we calculate
XX
val(π) − βval(π ∗ ) ≤ ∗
wk pj (yjk − βyjk )
j k
XX
∗ ∗
= wk pj (yjk − βyjk ) + wj pk (ykj − βykj )
j k>π j
XX
∗ ∗
≤ wk pj (yjk − yjk ) − wj pk ((β − βykj ) − (β − βykj ))
j k>π j
XX
∗ ∗
= wk pj (yjk − yjk ) − βwj pk (yjk − yjk )
j k>π j
XX XX
∗ ∗
= (wk pj − βwj pk )(yjk − yjk )= δjk (1 − yjk )
j k>π j j k>π j
XX
= δjk xjk
j k≥π j
34
It is not hard to see that, since the jobs are assumed to be independent, t is allowed
to be the random variable wj Pk . Observe also that opt is a monotone non-decreasing
function. (Why?) So we apply the lemma to the random variable wk Pj and see that we
are allowed to choose β = 1 in (4.1).
Proof of Theorem 4.5. Let the n jobs be indexed such that µ1 /w1 ≤ · · · ≤ µn /wn , i.e.,
in the greedy order. Let π = (1, . . . , n) be the list of the jobs in that order. We use
Lemma 4.4 and property (4.1) to find
35
Chapter 5
Paging
where c = c(k) may depend on k but not on the length of R and γ is a constant independent
of both. The smallest such c is called the competitive ratio of the algorithm and denoted
rA (alg). If no such c exists, then the algorithm is called not competitive.
It is known that Lru (and several other paging algorithms including Fifo) are k-com-
petitive against the unrestricted adversary, and that this bound is tight. However, the
36
theoretical k-competitiveness seems to be overly pessimistic to be meaningful for practical
situations. For example, in an experimental study the observed competitive ratio of Lru
ranged between one and three but was frequently around 3/2. This gap between theo-
retical worst-case bounds and practically observed behaviour called for research. There is
consensus that the practical success of Lru (and other algorithms) is due to a phenomenon
called locality of reference. This means that a requested page is likely to be requested again
in the near future. Thus a lot of the related theoretical work was devoted to modeling
locality appropriately. The two most common ways are randomized adversaries, where re-
quest sequences obey an underlying probability distribution and deterministic adversaries,
for which the set of admissible sequences is restricted explicitly.
Overview
One difficulty in analyzing the paging problem was the lack of a strong lower bound for the
minimum number of faults needed to serve a given request sequence. So far, this number
was bounded from below by partitioning the sequence into phases, giving insight only into
local structure of the paging problem.
In Section 5.1 we give an analysis, which is based on a lower bound that captures the
structure of the entire sequence. Furthermore, the framework characterizes the regime in
which an offline optimal strategy can substantially outperform Lru. Two requests to the
same page with exactly ` distinct pages inbetween are called a pair with distance `. Define
the characteristic vector
of a sequence R, where every entry c` (R) counts the number of pairs in the sequence with
distance exactly `. Here p = p(R) denotes the number of distinct pages accessed in R.
This definition allows us to characterize the number of faults of Lru because it will have
evicted a certain page if and only if at least k distinct pages were requested since the last
request to that page. Therefore we have
X
lru(R) = c` (R) + p(R).
`≥k
The similar form of these bounds allows direct comparison of the number of faults of Lru
and Opt on any given sequence and any cache size k.
In Section 5.2 we propose a deterministic adversary, which is only mildly restricted by
two parameters. The adversary not only captures locality of reference, but also another
phenomenon which has not been considered in previous theoretical work: typical memory
access patterns. The idea here is that if data is accessed, many of the items will be
needed again, but not in the near future. For those requests any algorithm, especially
also an optimum one, will most likely incur a fault. We explain the favorable observed
competitive ratio to a large extend with this phenomenon. In specific, we prove a bound
for the competitive ratio of Lru which depends on the parameters of our adversary. We
argue that the values for these in real programs are indeed such that our bound gives
values between two and five, coming relatively close to the competitive ratios observed in
pratice.
37
We formalize this intuition with the (α, β)-adversary. Let α > 1 and β ≥ 0 be fixed.
The (α, β)-adversary is free to choose any sequence R respecting the constraint
αk−1
X p−1
X
c` (R) ≤ β c` (R).
`=k `=αk
The intuition behind this definition is that there are not “too many” request pairs with
“critical” distance between k and αk − 1: The pairs with distance less than k do no harm
because neither Lru nor Opt fault there. On the other extreme, both, Lru and Opt will
fault on many pairs with distance at least αk. But the ones where Opt might outperform
Lru are those with distance inbetween.
Our analysis yields that the competitive ratio of Lru for this class of sequences is
bounded by
α
r(α,β) (lru) ≤ 2(1 + β) .
α−1
Observe that, depending on the values for α and β, our bound can actually become as
small as two. Furthermore, the notions of the characteristic vector and the (α, β)-adversary
are a formal framework describing on which sequences Lru performs good and explains
why it performs poorly on others. We performed extensive experiments and calculated
feasible values for α and β. The results supported the intuition that the sequences in
practice feature “large” α and “small” β. Our bound for the competitive ratio of Lru on
these sequences ranged between two and five, coming close to practical experience. See
Figure 5.2 and Figure 5.3 for two representative results.
where c` (R) = ||C` (R)||. Observe that each c` (R) counts the number of consecutive pairs
(i, j) that have exactly ` distinct pages between i and j.
38
Bounds for LRU and OPT
The characteristic vector enables us to derive algebraic descriptions for the number of
faults of Lru, see Theorem 5.1. Furthermore, it allows us to bound the number of faults
of any optimal offline paging algorithm Opt from below, see Theorem 5.2. Especially the
latter result is crucial for our analysis.
Theorem 5.1. Let k denote the cache size and let R be a request sequence with charac-
teristic vector c(R), then X
lru(R) = c` (R) + p(R).
`≥k
Proof. If a certain page is in the cache, it will be evicted by Lru before it is requested the
next time if and only if there are requests to at least k distinct pages before that. The
additional p(R) in the bound is due to the initial faults, i.e., the faults incurred when the
page is brought to the cache for the first time.
Theorem 5.2. Let k denote the cache size and let R be a request sequence with charac-
teristic vector c(R) and p(R) ≥ k + 1, then
1 X`−k+1
opt(R) ≥ k−1 k−1
c` (R).
1+ k − p−1 `≥k
`
39
Theorem 5.2 states that the max in the above lower bound can be replaced by a sum,
losing a factor which is at most two. The main difficulty for the proof is that consecutive
request pairs of different distances may overlap, which has to be handled appropriately.
We make use of amortized analysis to deal with it. The remainder of this section is devoted
to the proof.
Proof of Theorem 5.2. Recall that C denotes the set of all consecutive request pairs and
also recall the set T which is referred to as time. Consider an execution of an arbitrary
optimal offline algorithm Opt on a given sequence R, e.g., Longest-Forward-Distance.
Define
F = F (R) = {t ∈ T : Opt faults at time t} (5.1)
and (
1 if t ∈ F (R),
ft (R) =
0 otherwise,
which yields X
opt(R) = ft (R).
t∈T
Let t ∈ T be a point in time and let (i, j) ∈ C be a consecutive request pair. We prove
that there exitsts a function δt (i, j) which has the two properties expressed in Claim 5.4
and Claim 5.5 below.
Claim 5.4. Let (i, j) ∈ C. If dist(i, j) ≥ k, then it holds that
X dist(i, j) − k + 1
δt (i, j) = .
dist(i, j)
t∈T
P
Otherwise, i.e., if dist(i, j) < k, then t∈T δt (i, j) = 0.
Claim 5.5. For every t ∈ T we have that
X k−1 k−1
δt (i, j) ≤ 1 + − ft (R).
k p−1
(i,j)∈C
Suppose that Claim 5.4 and Claim 5.5 hold. Then the following argument completes
the proof of the lower bound.
X
opt(R) = ft (R)
t∈T
(Claim 5.5) 1 X X
≥ k−1
δt (i, j)
1+ k − k−1
p−1 t∈T (i,j)∈C
1 X X
= k−1 k−1
δt (i, j)
1+ k − p−1 (i,j)∈C t∈T
40
i 5 4 3 2 1 j t
(a)
t=i 1 1 1 2 3 3 4 5 5 j t
(b)
Figure 5.1: Two views on the rank: fixed pair and fixed time.
It is easy to see from the discussion preceeding this proof, that the bound is tight for
(k + 1)-multicycles (1, 2, . . . , k + 1, 1, 2, . . . , k + 1, . . . ). These sequences request p = k + 1
distinct pages and hence 1 + (k − 1)/k − (k − 1)/(p − 1) = 1.
Corollary 5.3 is implied from the observation
k−1 k−1
1+ − ft (R) ≤ 2ft (R).
k p−1
Below, we present the proofs of Claim 5.4 and Claim 5.5 that complete the proof of
the theorem.
Before we actually prove Claim 5.4 and Claim 5.5 we need additional notation. For
every pair (i, j) ∈ C define the set
i.e., the times when Opt faults between i and j. For every t ∈ F (i, j), define the rank of
the fault at time t within (i, j) by
The intuition is that the fault with rank r is the r-th last fault within (i, j).
Figure 5.1 depicts two perspectives on the rank. Black circles indicate points in time
when the optimum faults, white circles the others. It turns out that both perspectives are
needed to prove Claim 5.4 and Claim 5.5.
In Figure 5.1 (a), a fixed pair (i, j) is considered and the ranks of the points in time
when the optimum faults are shown. These ranks decrease over time.
In Figure 5.1 (b), the fixed time i = t ∈ F is considered and all the pairs (v, w) with
w > t (and v < t) are shown. The numbers below the time axis indicate the respective
rank of time t for each pair.
41
Let
L(i, j) = {t ∈ F (i, j) : rankt (i, j) ≤ dist(i, j) − k + 1}
denote the set of the dist(i, j) − k + 1 last faults of (i, j).
Define the value of (i, j) at time t ∈ T by
val1 (i, j) = dist(i, j), (5.2)
(
valt (i, j) − 1 if t ∈ L(i, j),
valt+1 (i, j) = (5.3)
valt (i, j) otherwise.
Observe that the value of a pair (i, j) decreases by exactly one at a time whenever one
of the dist(i, j) − k + 1 last faults in L(i, j) occurs. At all other times, the value remains
constant. It is not hard to see that the following equality holds. It relates the rank of a
fault at time t ∈ L(i, j) with the value at that time:
valt (i, j) = rankt (i, j) + k − 1. (5.4)
Hence valt (i, j) ∈ {k, . . . , dist(i, j)} for all t ∈ L(i, j).
We are now in position to define the function δt (i, j) based on valt (i, j):
(
k−1 k−1
valt+1 (i,j) − valt (i,j) if dist(i, j) ≥ k,
δt (i, j) = (5.5)
0 otherwise.
Observe that δt (i, j) = 0 for all times except t ∈ L(i, j). The two crucial properties
of
P this definition are the following. First, for each pair (i, j) with dist(i, j) ≥ k we have
t∈T δt (i, j) = (dist(i, j) − k + 1)/dist(i, j), which will be shown below. Second, observe
that (5.5) is defined
P in such a way that the smaller the rank of a fault is, the more it
contributes to t∈L(i,j) δt (i, j). This second property is crucially needed for the proof of
Claim 5.5 below.
Proof of Claim 5.4. Let (i, j) ∈ C be a consecutive request pair. P
First, let dist(i, j) < k and observe that the definition (5.5) implies t∈T δt (i, j) = 0.
Second, let dist(i, j) ≥ k. Observe that the number of faults of Opt within (i, j)
is ||F (i, j)|| ≥ dist(i, j) − k + 1 for all pairs with distance at least k. This is because
there are dist(i, j) distinct pages between i and j and there are at most k − 1 of these
in the cache of Opt at time i. Hence, for pairs (i, j) with dist(i, j) ≥ k the property
||F (i, j)|| ≥ dist(i, j) − k + 1 implies that ||L(i, j)|| = dist(i, j) − k + 1.
Then (5.1), (5.2), (5.3), and (5.5) imply
X X X
δt (i, j) = δt (i, j) = δt (i, j)
t∈T t∈F t∈L(i,j)
X k−1 k−1
= −
valt+1 (i, j) valt (i, j)
t∈L(i,j)
dist(i,j)
X 1 1
= (k − 1) −
v−1 v
v=k
1 1
= (k − 1) −
k − 1 dist(i, j)
dist(i, j) − k + 1
= (k − 1)
(k − 1)dist(i, j)
dist(i, j) − k + 1
=
dist(i, j)
42
and Claim 5.4 is established.
Before we proceed to the proof of Claim 5.5, we give some of the intuition behind it
and introduce additional notation.
P t ∈ F be a point in time when an Opt fault occurs. We have to prove that the
Let
sum (i,j)∈C δt (i, j) of all the pairs (i, j) for which δt (i, j) > 0 is not “too large.” As
mentioned above (see the discussion after (5.5)), for every consecutive
P request pair (i, j)
with dist(i, j) ≥ k, a fault at time t contributes the more to t∈L(i,j) δt (i, j) the smaller
the value of that pair is at time t. Hence it is crucial to prove that there are not “too
many” pairs with small value at time t. This central property is expressed in Lemma 5.6.
For every t ∈ T define the set
Observe that if no fault occurs at time t, then X(t) is empty. Otherwise, i.e., for t ∈ F , the
set X(t) comprises all the pairs for which the fault at time t is one of the dist(i, j) − k + 1
last faults. Hence, for these pairs the value valt (i, j) decreases by definition (5.3) at time
t + 1. Furthermore, observe that
X X
δt (i, j) = δt (i, j),
(i,j)∈C (i,j)∈X(t)
i.e., it actually suffices to consider the pairs (i, j) ∈ X(t), because δt (v, w) = 0 for all the
other pairs (v, w) 6∈ X(t).
For every ` = 1, . . . , p − 1 define the sets and quantities
i.e., the (number of) pairs for which the fault at time t has rank `.
Proof. Consider a fixed time t ∈ F . For simplicity of notation, we omit the argument t,
e.g., we write X` and X instead of X` (t) and X(t).
First observe that (5.4) implies rankt (i, j) ≤ dist(i, j) − k + 1 ≤ p − k. Hence Xp−k+1 =
· · · = Xp−1 = ∅ and xp−k+1 = · · · = xp−1 = 0.
The sets X` partition the pairs (i, j) ∈ X according to P their rank. Hence
Pr the sets X`
r r
induce a disjoint partition of X and we have || ∪`=1 X` || = `=1 ||X` || = `=1 x` .
For sake of contradiction, fix r such that ∪r`=1 X` comprises m > r + k − 1 pairs. Let
these be (i1 , j1 ), . . . , (im , jm ). Without loss of generality t < j1 < · · · < jm−1 < jm .
Observe that all the pages rt , rj1 , . . . , rjm are distinct from each other. Hence, at time
t, the set {rt , rj1 , . . . , rjm−1 , rjm } comprises exactly m + 1 distinct pages. Therefore any
algorithm with cache size k faults at least m + 1 − k > r + k − 1 + 1 − k = r times. Hence
there are more than r faults of Opt within the time range t, . . . , jm . Consider the pair
(im , jm ) and observe that the fault at time t within that pair has rank more than r, i.e.,
rankt (im , jm ) > r. Hence (im , jm ) 6∈ ∪r`=1 X` yields the desired contradiction.
43
Proof of Claim 5.5. First observe that for every t ∈ T \F we have
X k−1 k−1
δt (i, j) = 0 ≤ 1 + − ft
k p−1
(i,j)∈C
k−1 k−1
δt (i, j) = −
`+k−2 `+k−1
P
to (i,j)∈X(t) δt (i, j).
Now interpret the x` as variables of the following linear program, where the constraints
(5.7) are implied from Lemma 5.6.
p−k
X k−1 k−1
maximize x` − (5.6)
`+k−2 `+k−1
`=1
Xr
subject to x` ≤ r + k − 1 for r = 1, . . . , p − k (5.7)
`=1
Clearly,
P the optimum value of the linear program (5.6) immediately yields an upper bound
on (i,j)∈X(t) δt (i, j).
Now we prove that the unique optimal solution to (5.6) is x∗ = (k, 1, . . . , 1). Observe
that this solution x∗ has the property that all constraints in (5.7) are satisfied with equality.
Consider an arbitrary optimal solution x = (x1 , . . . , xp−k ) where P not all constraints in
(5.7) are satisfied with equality. Let v be the smallest such index, i.e., v`=1 x` = v+k−1−ε
for some ε > 0.
Suppose v = p − k holds. Then we may increase xv by ε and achive a feasible solution
with larger objective value and hence a contradiction. Otherwise, i.e., for v < p − k,
consider the solution xε = (x1 , . . . , xv + ε, xv+1 − ε, . . . , xp−k ), which is indeed feasible. We
have that the objective value of xε − x is
k−1 k−1
ε − >0
(v + k − 1)(v + k − 2) (v + k)(v + k − 1)
44
have short distance. For example, a lot of time of the control flow is spent in executing
loops resulting in an enormous number of requests to few pages. However, the size of the
datastructures needed by an algorithm can be assumed to be large compared to the cache
size, e.g., databases, matrices, or graphs. This data is mostly accessed in a structured
way, e.g., linearly reading, or in an irregular way, e.g., the queries to a database. In both
cases, we expect a typical memory access pattern: a consecutive request to a page either
has small or large distance compared to the cache size.
We argue that request distances can be divided into three classes: short, long, and
intermediate. Intuitively, for short distance, “no” (reasonable) algorithm faults, for long
distance, “any” algorithm faults (especially any optimal algorithm), but it is not clear for
the distances inbetween. As discussed below, the treatment of the latter turns out to be
crucial for the performance of a strategy. Let k denote the cache size, let α > 1 be such
that αk is an integer, and let β ≥ 0. If a request pair has distance in the interval [0, k − 1],
then its distance is short, if it is in the interval [k, αk − 1] it is intermediate (also called
critical ), and if it is in the interval [αk, p − 1] it is called long.
The (α, β)-adversary is free to choose any sequence R in the set S(α,β) defined by
αk−1 p−1
( )
X X
S(α,β) = R with c(R) : c` (R) ≤ β c` (R) . (5.9)
`=k `=αk
The intuition behind this definition is that there must not be “too many” pairs with
critical distance compared to the long-distance pairs. This classification of sequences is
directly derived from the discussion preceeding this definition. The parameter β controls
the number of consecutive requests with distance in the critical interval [k, αk − 1]. To see
why this interval is critical, consider an (` + 1)-multicycle and observe that each request
pair has distance `. For these sequences, the competitive ratio of Lru is `/(` − k + 1).
This function decreases from k to α/(α − 1) as ` grows from k to αk − 1. Notice that we
do not restrict the number of requests with short distance, i.e., in the interval [0, k − 1].
Hence, our model implicitly captures a notion of locality of reference. Observe that the
interval [αk, p − 1] is also not constrained, allowing sequences that have a typical memory
access pattern.
Summarizing, the (α, β)-adversary is restricted with the number of pairs with distance
αk − 1], but only mildly. We just require that the total number
the interval [k,P Pof interme-
p−1
diate requests αk−1
`=k c` (R) can be balanced with the long-distance requests `=αk c` (R).
Theorem 5.7. For the (α, β)-adversary and cache size k, it holds that
α
r(α,β) (lru) ≤ f (α, β) := 2(1 + β) .
α−1
Proof. Let R ∈ S(α,β) be an admissible sequence. Further, with Theorem 5.1, Theorem 5.2,
(5.9), and `≥k `−k+1 c` (R) ≥ (α − 1)/α p`=αk c` (R) we find
P P
`
p
X αk−1
X p
X
lru(R) = c` (R) + p(R) = c` (R) + c` (R) + p(R)
`=k `=k `=αk
p
X α X`−k+1
≤ (1 + β) c` (R) + p(R) ≤ (1 + β) c` (R) + p(R)
α−1 `
`=αk `≥k
α
≤ 2(1 + β) opt(R) + p(R)
α−1
and the proof is complete, where we have used that p(R) ≤ p is a constant.
45
SPEC Hydro
purpose fluid dynamics
pages 7 451 717
x-axis `
y-axis c` (log-scale)
k 1024 (4 MB)
(α, β) (16.002, 0.012)
f (α, β) 2.16
This result can be considered a theoretical underpinning of the good practical behaviour
of Lru. It gives a criterion for the sequences on which the algorithm achieves constant
competitive ratio. We argue that request sequences that occur in practice satisfy this
criterion.
The bound f (α, β) is a constant if α is “large” and β is “small”. Indeed, the discus-
sion above suggests that practical sequences are in a set S(α,β) with this property. To
substantiate this claim, we performed experiments with memory traces generated by our
implementations of standard algorithms and by the execution of SPEC benchmark pro-
grams. SPEC benchmarks are standard for performance evaluation in computer industry.
They provide programs for a broad variety of applications, e.g., numerical simulations or
user programs. We calculated feasible values for α and β and obtained that our bound
f (α, β) gives values between two and five for these request sequences, see Figure 5.2 and
Figure 5.3.
Further experimental study is needed to determine consistent values of α and β. How-
ever, according to the above reasoning, the size of data needed by a program is large
compared to the cache size, i.e., at least a constant factor. We hence expect that α is
“large” in practice. In our experiments, we also monitored the structure of the charac-
teristic vector of these sequences. Figure 5.2 and Figure 5.3 depict two representative
examples of our results. The x-axis shows the length ` of the consecutive request and the
46
SPEC Nasa7
purpose weather simulation
pages 31 764 036
x-axis `
y-axis c` (log-scale)
k 1024 (4 MB)
(α, β) (3.97, 0.6)
f (α, β) 4.28
y-axis shows the value of the respective c` in log-scale. There is a peak for smaller values
of ` and there are many large entries c` for larger values of `, corresponding to short and
long distance. This structure suggests that β should be “small” in practice.
47
Chapter 6
Bin Packing
The problem Bin Packing is also a rather basic combinatorial optimization problem. It
occurs, for example, when certain items have to be packed into as few as possible trucks,
where each truck has a weight limitation.
minimize m
subject to Bj ⊆ I, 1 ≤ j ≤ m,
m
∪j=1 Bj = I,
weight(Bj ) ≤ 1, 1 ≤ j ≤ m.
The problem is NP-hard. Specifically, the offline version, i.e., the set I and the vector w
are given at the outset, admits a polynomial-time approximation algorithm with guarantee
3/2. But the problems is not approximable within 3/2 − ε for any ε > 0 in polynomial
time, unless P = NP. Under the (natural) assumption that the optimal number of bins
diverges as the number of items grows, the problem admits a (so-called) asymptotic PTAS,
i.e., the problem can be approximated within 1 + ε in polynomial time for any constant
ε > 0. However, this is not a practical algorithm, since the constants involved are rather
large and since there is an exponential dependency in 1/ε.
48
and have to place item i into some bin before the learn the weight of item i + 1. For
techncal reasons we will assume that limn→∞ weight(I) = ∞.
In worst-case perspective, we measure the quality of an algorithm with its (asymptotic)
competitive ratio
alg(w)
r(alg) = lim
n→∞ opt(w)
where alg(I) and opt(I) denote the respective number of bins used by the algorithm alg
and an offline optimal algorithm opt.
One of the simplest online algorithms for Bin Packing is Next-Fit Pk (nf): Initially
i = 1 and j = 1. Pack items i, i + 1, . . . , k into bin Bj as long as `=i w` ≤ 1. When
violated, set i = k + 1 and j = j + 1 and repeat the previous step until there are no items
left.
The idea is to keep always one bin active, i.e., the one that receives items. Once an
item yields that the capacity of the currently active bin is exceeded, the bin is closed
and another bin is opened, i.e., becomes active. Closed bins will never be opened again.
Next-Fit is 2-competitive, which is also best-possible for the algorithm.
Proof. We may assume that nf is using more than one bin, since otherwise there is nothing
to show.
Consider the event that nf closes a bin which has remaining capacity p and opens a new
bin. The item that is placed into the new bin has weight at least p. Thus the total weight
of any two consecutive bins is at least one, i.e., at most half of the total capacity is wasted.
Thus, weight(B1 ) + weight(B2 ) ≥ 1, . . . , weight(B2k−1 ) + weight(B
P 2k ) ≥ 1, for k = bm/2c.
Since all items are placed in the bins B1 , . . . , Bm we have i∈I wi ≥ bm/2c ≥ (m − 1)/2.
Thus, X
alg(w) = m ≤ 2 wi + 1.
i∈I
P
Obviously, since we have unit-capacity bins, we have opt(w) ≥ i wi , which yields
alg(w) ≤ 2opt(w) + 1
49
where alg(W ) and opt(W ) denote the respective random variables for the number of
bins for W .
We consider the (stochastic) process of Next Fit operating on the random weights
W1 , . . . , Wn given one-by-one. The main result of this section is for uniform distributed
weights (but the analysis is more general and can in principle be transfered to other
distributions as well).
Theorem 6.2. If the weights Wi are independent and identically uniform distributed over
[0, 1], then
4
e0 (nf) ≤ .
3
(1) The sequence (Lj )j≥1 denotes the total weight, also called level, of items packed in Bj
when bin j is closed. The complementary sequence (Rj )j≥1 with Rj = 1 − Lj gives
the remainder of bin j when closed, respectively.
(2) The sequence (Nj )j≥1 of the number of items packed into bin j when closed respec-
tively.
(3) For any non-empty bin j, (Uj )j≥1 denotes the sequence of the sizes of the first item
packed into j.
(4) The process (Mk )k≥1 denotes the sequence of the number of bins when exactly k items
are given.
We shall assume throughout that the item-weights Wj are independent and identically
distributed over [0, 1] according to some random variable X. First we derive some general
structural properties of the above processes, and then we will materialize a bound on nf
for uniform distributed weights.
General Properties
We investigate the process (Lj )j≥1 first. Since nf never opens a bin once closed, it follows
that the distribution of Lj+1 depends only on Lj but not on Lk for any 1 ≤ k < j. This
means that (Lj )j≥1 induces a (first-order) Markov chain, i.e., the Markov condition
Pr [ Lj+1 ≤ t | Lj = tj , . . . , L1 = t1 ] = Pr [ Lj+1 ≤ t | Lj = tj ]
K(s, t) = Pr [ Lj+1 ≤ t | Lj = s]
in the sequel. It is known from Markov chain theory that the distributions for Lj+1 and
Lj satisfy
Z 1
FLj+1 (t) = K(s, t)dFLj (s), (6.1)
0
with the boundary condition
FL1 (t) = K(1, t),
50
since L1 can be seen as a bin following
Pk a bin with L0 = 1.
For any k denote by Sk = i=1 Wi . If the level at closure of bin j is s, then the first
item placed into bin j + 1 has weight more than 1 − s. In our notation: Lj = s implies
Uj+1 > 1 − s. With this we find the following expression for K(s, t):
(
0 if t ≤ 1 − s,
K(s, t) = P
k≥0 Pr [ Uj+1 + Sk ≤ t, Uj+1 + Sk + X > 1 | Lj = s] if t > 1 − s.
Above, we have used that the distribution of Wk+1 is identical to X. The distribution of
Uj+1 is also that of X, conditioned on being larger than 1 − Lj , i.e.,
(
0 if u ≤ 1 − s
FUj+1 |Lj (u|s) = FX (u)−FX (1−s)
1−FX (1−s) if 1 − s < u ≤ 1.
With these observations, (6.1) is well defined. It is, again, known from Markov chain
theory that the Markov chain for (Lj )j≥1 is ergodic, which implies that it has a unique
limit distribution FL = limj→∞ FLj . Specifically, the following theorem gives a way how
to compute it:
Theorem 6.3. The process (Lj )j≥1 is an ergodic Markov chain with a limit distribution
FL satisfying Z 1
FL (t) = K(s, t)dFL (s).
0
Lemma 6.4. If FX (t) = t for 0 ≤ t ≤ 1, then K(s, t) defined in (6.1) is given through
(
0 if t ≤ 1 − s
K(s, t) = 1 −(1−t)+s
1 − s (1 − t)e if 1 − s < t ≤ 1.
Before we prove Lemma 6.4 we show how it implies our main result.
Proof of Theorem 6.2. Define R = 1 − L as the remaining capacity in the limit. Fix any
n, let Mn and Mn∗ be the respective number of bins used by nf and opt, respectively, for
W . Observe the lower bound
Mn
X Mn
X Mn
X
Mn∗ ≥ Mn − Ri = 1 − Rj = Lj .
j=1 j=1 j=1
51
Thus we may write
h PMn i hP i hP i
Mn Mn
E [nf(W )] E [Mn ] E M n − j=1 R j + E j=1 R j E j=1 R j
= ∗
≤ h i = 1 + hP i.
E [opt(W )] E [Mn ] P
E Mn − j=1 Rj M n Mn
E j=1 Lj
which proves the theorem using E [L] = 3/4 and E [R] = 1/4 as given in Lemma 6.4.
XZ t
K(s, t) = fU |L (u|s)Pr [Sk ≤ t − u, Sk + X > 1 − u] du
k≥0 u=1−s
XZ t Z t−u
= fU |L (u|s)fSk (x)Pr [X > 1 − u − x] dxdu
k≥0 u=1−s x=0
where we have used independence. Using Pr [X > x] = 1 − x and fU |L (u|s) = 1/s yields
t t−u
XZ 1 xk−1
Z
K(s, t) = (u + x)dxdu.
u=1−s x=0 s (k − 1)!
k≥0
Rt
The term for k = 0 gives the contribution u=1−s u/sdu = (t2 − (1 − s)2 )/2s. Changing
the order of summation (which is allowed as the sums converge uniformly) and integrating
yields the claim.
52
Chapter 7
Traveling Repairman
Here we consider the following online problem: A server operates on a graph G with non-
negative edge-weights and requests to vertices arrive over time (chosen by an adversary),
where each request requires servicetime of one unit. Each request must be served, but the
server is free to choose their ordering and intends to minimize the total flowtime. The
flowtime of a request is the time between its arrival and departure, i.e., the time spent
waiting.
First of all, from the worst-case perspective we observe that no deterministic online
algorithm can have competitive ratio less than diam(G), where diam(G) denotes the diam-
eter of G, i.e., the maximum value of all shortest path lengths between all pairs of vertices.
Even worse, but not at all surprising, no algorithm of the Ignore-type is competitive,
i.e., has bounded competitive ratio, see Theorem 7.2.
These negative results for the natural Ignore-type algorithms from the worst-case
perspective justify an average-case model. We assume that the requests arrive according
to a any stochastic process with rate λ > 0 (and two further natural assumptions), but
an adversary remains free to choose the vertices of the requests. This model captures
as special cases deterministic arrivals and Poisson arrivals (that can be observed in many
real-world scenarios, e.g., arrival times of phone-calls at a call-center). We give an Ignore-
type algorithm that has expected competitive ratio c(λ)ham(G) if λ 6= 1, where c(λ) is a
constant depending on λ, see Theorem 7.3. The value of c(λ) can be as small as two but
diverges as λ tends to one. Thus, for λ 6= 1, the algorithm is competitive. The intuition is
as follows: If λ < 1, then the requests arrive slower than they are served by our algorithm.
If λ > 1, then they arrive faster than they can be served even by an optimum offline
algorithm. For the remaining case λ = 1, no Ignore-type algorithm is competitive. The
reason is that the requests arrive as fast as they can be served by an optimal algorithm, but
online algorithms can be forced to make routing-mistakes that accumulate in the flowtimes
of many requests.
53
At certain points in time 0 ≤ a1 ≤ a2 ≤ · · · ≤ am requests R = (r1 , r2 , . . . , rm ) arrive.
Each request has the form rj = (ai , vi ), i.e., its arrival aj and a vertex vj . It is demanded
that the server eventually visits the vertex vj to serve the request, which takes unit time.
The time dj when the request is served is called its departure. Hence the request has
waited for time fj = dj − aj , which is called its flowtime.
The server is free to choose the ordering of service
Pm and is allowed to wait; these decisions
induce a schedule. For any schedule S, F (S) = j=1 fj (S) denotes its total flowtime, where
fj (S) is the flowtime of request rj induced by S. Our objective function is to minimize the
total flowtime. The schedule that minimizes F given the request sequence R in advance
is denoted S ∗ . For ease of notation we usually write F instead of F (S) and F ∗ instead of
F (S ∗ ). An algorithm Alg is called online if its decision at any point in time t depends
only on the past, i.e., the requests r1 , . . . , rj with aj ≤ t. Otherwise the algorithm is called
offline. We write alg(R) = F and opt(R) = F ∗ .
We study two different sources for the request sequence R. Firstly, a malicious deter-
ministic adversary is free to choose R. We measure the quality of schedules produced by
an online algorithm Alg facing such an adversary with the competitive ratio
alg(R)
r(alg) = sup .
R opt(R)
If r(alg) is not bounded, then the algorithm alg is not competitive. Secondly, we study
the following stochastic adversary. For a fixed λ > 0, the adversary is free to choose
a probability distribution D(λ) on the arrival times of the requests, provided that the
expected number of arrivals is λ. Furthermore, the vertices of the requests can also be
chosen arbitrarily. We define the expected competitive ratio by
h alg i
eλ (alg) = sup E ,
D(λ) opt
where alg and opt denote the respective random variables induced by D(λ).
We consider a family of algorithms, called Ignore. Such an algorithm maintains two
memories M and M 0 that are served in phases as follows. Suppose that the algorithm
has collected several requests in the memory M . As soon as the algorithm starts serving
those, all requests that arrive in the meantime are stored in the memory M 0 (and are hence
ignored in the current phase). After all requests in M have been served, the phase ends
and the procedure repeats with changed roles of M and M 0 . Note that we have not yet
specified how the requests in a phase are actually served, which leaves space for various
strategies.
Observation 7.1. There is a graph G such that for every lazy online algorithm alg it
holds r(alg) ≥ diam(G).
54
Proof. Consider the graph G = ({1, 2}, {{1, 2}}) with w({1, 2}) = h > 0, where h is
arbitrary. Clearly diam(G) = h.
Initially, the server of the algorithm and the optimum are at vertex 1. Consider just
one request at vertex 2 which arrives at time h. The optimal server anticipates the arrival
and starts traversing the edge at time zero, reaches vertex 2 at time h and serves the
request with flowtime 1. The lazy online algorithm starts traversing the edge at time h
and serves the request with flowtime h + 1.
B1 = (h, 2)
B2 = (1, 2), . . . , (1, 2), (1, 1), (1, 2), . . . , (1, 2)
| {z } | {z } | {z }
h requests at 2 one trap request at 1 3h + 1 requests at 2
B3 = (h + 2, 1), (1, 1), . . . , (1, 1), (1, 2), (1, 1), . . . , (1, 1)
| {z } | {z } | {z }
5h − 1 requests at 1 one trap request at 2 5h + 1 requests at 1
B4 = (h + 2, 2), (1, 2), . . . , (1, 2), (1, 1), (1, 2), . . . , (1, 2)
| {z } | {z } | {z }
9h − 1 requests at 2 one trap request at 1 7h + 1 requests at 2
..
.
55
Now we consider the algorithms. Its server is at vertex 1 until the first request arrives
which is served with flowtime h + 1 at time 2h + 1. Notice that the delay of the algorithm
is already h. Until time 2h + 1 the first h + 1 requests of B2 have arrived. These include
the trap request at vertex 1. As we are dealing with an Ignore algorithm it will serve the
trap during the next phase. As the phases are served optimally the trap is handled last in
the phase after one edge-traversal. But now the server is at the wrong vertex to serve the
remaining requests of the block and the edge needs to be traversed once more. Thus the
algorithm crossed the edge two times more than the optimum server and hence its delay
increased by 2h. We continue in this manner and see that the total delay before block Bi
is (at least) 2h(i − 1). Thus we find the lower bounds on the total flowtime
k
X k
X
alg(R) ≥ 2h(i − 1)(2ih) ≥ 2h2 i2 ≥ h2 k 3 /2
i=1 i=k/2
(1) For some fixed rate λ > 0 the expected number of arrivals per unit-time is λ.
(2) For any stopping time T , E [XT +t − XT ] ≤ E [Xt ] holds for all t ≥ 0. (A random
variable T is called stopping time for (Xt )t≥0 if the event T = t depends only on
(Xs )s≤t .)
(3) We also require that Var [Xt ] ≤ cE [Xt ] for some constant c, i.e., the process has to
have bounded variance.
Notice that Poisson-distributed and deterministic arrivals fall into this class as special
cases.
We are interested in the expected competitive ratio defined by eλ (alg) = limt→∞ eλ (alg, t),
where eλ (alg, t) = E [alg/opt] when the process is stopped at some time t and the ex-
pectation is taken with respect to Xt . We show that, in this model, bounded expected
56
competitive ratios are possible for Ignore algorithms if and only if λ 6= 1. For the case
λ = 1 the ratio is unbounded for all deterministic Ignore algorithms.
In specific, we consider the following Ignore algorithm Alg. At time zero, the algo-
rithm computes a fixed circle in the graph G that visits all vertices and has length equal
to ham(G). Let w be a parameter to be specified later on. The algorithm waits for time
w and then serves the requests that arrived using the precomputed tour. All requests that
arrive in the meantime are ignored and served in the subsequent phase, etc.
Theorem 7.3. For any stochastic process as defined above, if λ 6= 1, then there is a
constant c(λ) such that the algorithm alg yields
Proof. First condition on the event Y ≥ E [Y ] δ and then use E [ X | A] ≤ E [X] /Pr [A]
which holds for any non-negative random variable and any event A with Pr [A] > 0. We
deduce
X X
E =E Y ≥ δE [Y ] Pr [Y ≥ δE [Y ]]
Y Y
X
+E Y < δE [Y ] Pr [Y < δE [Y ]]
Y
E [ X | Y ≥ δE [Y ]] Pr [Y ≥ δE [Y ]]
≤
δE [Y ]
X
+E Y < δE [Y ] Pr [Y < δE [Y ]]
Y
E [X] X
≤ +E Y < δE [Y ] Pr [Y < δE [Y ]] .
δE [Y ] Y
The next to last inequality holds since for every event A with Pr
[A] > 0 the
conditional
expectation can be bounded in the way E [ X | A] = (E [X] − E X | A Pr A )/Pr [A] ≤
E [X] /Pr [A], where we have used that X is non-negative.
57
Below we use the following additional notation. A request rj is called pending while it
has arrived but not departed. By definition, the algorithm initially waits until time T0 = w.
Let the number of requests that arrive during the time interval I0 = [0, T0 ) be A0 . The
time when the algorithm has served these requests and is back at the initial vertex 1 be T1 .
Further let the number of arriving requests in the time interval I1 = [T0 , T1 ) be A1 . This
induces a sequence of intervals I0 , I1 , I2 , . . . , called phases, stopping times T0 , T1 , T2 , . . .
and numbers A0 , A1 , A2 , . . . , called arrivals. Fj denotes the (random) flowtime of a request
rj and I(j) the phase in which rj arrives. The optimum flowtime of request rj is denoted
Fj∗ .
Lemma 7.5. For λ < 1 we have eλ (alg) ≤ (4/(1 − λ) + c/2) · (ham(G) + 1).
Proof. We begin by choosing a value for the parameter w, i.e., the time the algorithm waits
before starting its first phase. For any positive value of w, we expect that λw requests
arrive while waiting. Hence we expect that the first phase takes time ham(G) + λw. We
require that this time be no more than w, i.e., we need ham(G) + λw ≤ w. Hence we
choose w = ham(G)/(1 − λ) which is possible for λ < 1.
Let t be any fixed time and let Xt be the (random) number of requests that arrive until
time t. Now we establish the bounds F ∗ ≥ Xt and E [F ] ≤ E [Xt ] 2(ham(G) + 1)/(1 − λ).
The optimum flowtime for any request rj is Fj∗ ≥ 1 and for Fj we have Fj ≤ ham(G) +
AI(j)−1 + ham(G) + AI(j) + 1.
We prove by induction that E [Ai ] ≤ λw for all i ≥ 0. This implies E [Fj ] ≤ 2ham(G) +
(2λw+1) ≤ 2ham(G)/(1−λ)+1, by choice of w. The base case E [A0 ] = λw is clear. For the
inductive case we exploit the following assumed property of our process: For any stopping
time T , E [XT +t − XT ] ≤ E [Xt ] holds for all t ≥ 0. By induction hypothesis E [Ai ] ≤ λw
holds for a fixed i and it follows that E [Ai+1 ] = λE [Ti+1 − Ti ] ≤ λE [ham(G) + Ai ] ≤
λ(ham(G) + λw) ≤ λw. The crude estimates F ≤ Xt2 (ham(G) + 1) and F ∗ ≥ Xt give
F 1 E [Xt ] (ham(G) + 1)
E Xt < E [Xt ] ≤ .
F∗ 2 2
With Lemma 7.4, Chebyshev’s inequality, and the choice δ = 1/2 we have
F 1
eλ (alg, t) ≤ E Xt ≥ E [Xt ]
Xt 2
F 1 1
+E X t < E [X t ] Pr X t < E [X t ]
F∗ 2 2
2E [F ] E [Xt ] (ham(G) + 1) 1
≤ + Pr |Xt − E [Xt ] | ≥ E [Xt ]
E [Xt ] 2 2
4(ham(G) + 1) E [Xt ] (ham(G) + 1) Var [Xt ]
≤ + · ,
1−λ 2 E [Xt ]2
Lemma 7.6. Let 0 < δ < 1 be an any constant such that (1 − δ)λ > 1. Then we have
!
alg 2 cλ2
E (λ) ≤ + 1 2 · (ham(G) + 1).
δ 2 (λ − 1−δ )
58
Proof. Here we choose w = 0, i.e., the algorithms begins its first phase immediately after
the arrival of the first request. We stop the process at any fixed time t and let Xt be the
number of requests that arrived.
As upper bound we use the crude estimate F ≤ Xt2 (ham(G) + 1). Now we establish
the lower bound F ∗ ≥ Xt2 δ 2 /2 which holds conditional on the event that Xt (1 − δ) ≥ t.
At time t the optimum can have served at most t many requests. Thus, by the condition
on Xt , there are P ≥ XP t − t ≥ δXt requests pending. As each request takes time at
least one we have F ∗ ≥ Pi=1 i ≥ δ 2 Xt2 /2. Thus, using Chebyshev’s inequality, we find
Pr [Xt (1 − δ) < t] ≤ Pr [|Xt − E [Xt ] | > t(λ − 1/(1 − δ))] ≤ cλ/t(λ − 1/(1 − δ))2 and hence
2Xt2 (ham(G) + 1)
alg
Et (λ) ≤ E Xt (1 − δ) ≥ t
δ 2 Xt2
2
Xt (ham(G) + 1)
+E
Xt Xt (1 − δ) < t Pr [Xt (1 − δ) < t]
2(ham(G) + 1) (ham(G) + 1)t cλt
≤ 2
+ · 2 1 2
δ (1 − δ) t (λ − 1−δ )
2(ham(G) + 1) c(ham(G) + 1)λ2
≤ + 1 2
δ2 (λ − 1−δ )
59