Online Optimal Transport & Algorithms
Online Optimal Transport & Algorithms
EL MANSOURI Omar
Contents
1 Random tree metric and online optimal transport 2
1.1 Brief introduction to optimal transport . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The online transport problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The tree case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Wrapping-ups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1
1 Random tree metric and online optimal transport
1.1 Brief introduction to optimal transport
The principle of optimal transport is, given two measures µ and η on Rd , to transport µ onto η
while minimizing some cost function.
−η
→
From Monge’s point of view, the objective is to find a function T : X → Y (X and Y being
the respective supports of µ and η) such that T# µ = η where T# µ(A) = µ(T −1 (A)) is the push
forward of µ by T , and that minimizes the cost
Z
c(x, T (x)) dµ(x)
X
If µ is not regular (a.c Lebesgue), such a “transport map” T might not exist, as the following
example shows.
1 1
2 δ−1 µ = δ0 2 δ+1 η = 12 (δ−1 + δ1)
−1 1 0 1 1
2 2
Here, µ = δ0 and η = 12 (δ−1 + δ1 ). In order to transport µ onto η, one would have to split the
mass at 0 into equal halves and send each half onto −1 et 1 respectively, effectively mapping 0
to two different destinations.
In Kantorovich’s terms, the goal is to find a distribution π ∈ M(X , Y) where M(X , Y) is the
set of all joint distributions on X × Y with marginals µ on X and η on Y , such that the cost:
is minimized. Note that this objective function is linear in π, as well as the constraints π1 = µ
and π2 = η. Hence, Kantorovich’s formulation of the problem is a linear program, whose dual
2
can be expressed as
Z Z
max f (x) dµ(x) + g(y) dη(y) such that ∀ (x, y) ∈ X × Y, f (x) + g(y) ≤ c(x, y)
f :X →R X Y
g:Y→R
The typical choice for the cost function is c(x, y) = d(x, y) for some distance d. Under these
conditions and adding exponents, the problem can be restated as searching for
1/p
Wp (µ, η) = min E[dp (x, y)]
π∈M(X ,Y)
1 Xn i.i.d
Theorem 1.1. If µ is supported on Rd and µn = δX where Xi ∼ µ, then
n i=1 i
E[W (µ, µn )] ≲ n−1/d if d ≥ 2
1
≤ √ if d = 1
n
The difficult case is when F = µ. The data is still u1 , . . . , un ∈ Rd , and at each step vj ∼
U[{u1 , . . . , un }] then observe, match and pay. Here, the vj are i.i.d (sampled with replacement)
and E[OPT] = E[W (µ, µn )]. The optimnal online solution is obtained through dynamic pro-
gramming, but is too complicated to analyze (and compute). We need a simpler algorithm.
The idea is to keep an “invariant” which is the distribution of the remaining n − k vi after k
iterations is a uniform n − k size subset of the n points. We thus minimize the cost to keep the
invariant at each iteration.
Let us denote by Sk the set of last k remaining vi . Then, we will have that Sk is a set of k points
drawn from {u1 , . . . , un } without replacement. At stage n − k, there remain k points in Sk and
the algorithm computes the optimal matching between the uniform distribution on {u1 , . . . , un }
(i.e. the law of vn−k+1 ) to the uniform distribution on Sk . It then matches vn−k+1 at random
using this optimal matching.
We know that E[OPT] = E[W (µ, µn )]. The expected cost of the algorithm is
n
X
E[ALG] = E[W (µ, U(Sk )]
k=1
3
where Sk is a k-subset of {u1 , . . . , un } drawn without replacement.
Lemma 1.2. We have: ck = ESk ∼Ik [M (Sk )] ≥ ESk ∼Uk [M (Sk )].
Hence, setting OPT = cn , we obtain the competitive ratio of the algorithm by:
n n
1X 1 X
ALG ≤ ck then CR(ALG) ≤ ck
n k=1 ncn k=1
Using the fact that E[W (µ, µk )] ≤ k −1/d ,
n
1 X 1
CR(ALG) ≲ ck ≤ 1/d
ncn k=1 n cn
Proof. Let e be an edge of the line (“two adjacent nodes”) and let us denote by de the
distance cost of this edge. Then,
X Xe ne
M (sK ) = de −
e∈E k n
where ne is the size (number of ui ) of the smallest half-line starting from e and Xe is the
number of vj that belong to this half-line. In the example above, ne = 2 and Xe = 1. Note
that Xe is a random variable, but we know its law: it as binomial with parameters ne /n
and k, B(k, nne ). Then,
s
X de kne ne ne
r
X
E[M (Sk )] ≤ 1− ≤ de
e∈E k n n e∈E nk
X
Xe ne
and OPT = M (Sk ) so that E[OPT] = e
de n
− n
where Xe ∼ B(n, nne ). It also holds
h i √
σ(X ne
that E |Xe − E[Xe ]| ≤ √ e) ≤ with σ(Xe ) the standard deviation of Xe . Hence,
2 4
√
X de ne
E[OPT] ≥
e∈E 4n
Corollary 1.4. The result also holds for a tree (the exact same proof works).
4
1.4 The general case
In general the metric induced by u1 , . . . , un is not a tree. But if there was a tree T such that
d(ui , uj ) ≤ dT (ui , uj ) ≤ Cd(ui , uj ) (the vertices of T being the ui ) where dT is the metric of T ,
then using the algorithm with respect to the tree metric dT would provide
n
1X
E[ALG] = ES ∼U [M (Sk )]
n k=1 k k
n
1X
≤ ES ∼U [M T (Sk )]
n k=1 k k
≤ 4E[OPT] because dT is a tree metric
≤ 4CE[OPT] because of the inequality
So, the algorithm would we 4C-competitive. Unfortunately, there is no such deterministic tree
with C ≪ n as the following example shows.
1
But, if we take this line at random, then E[dT (ui , ui+1 )] = 1 1 + n
+ n n1 = 2.
Theorem 1.5. Given n points u1 , . . . , un and a metric d(· , ·), there exists a random tree T
such that d(u, v) ≤ E[dT (u, v)] ≤ 8 log(n)d(u, v).
Proof. The random tree is constructed as follows: first, let π be a random permutation of [n],
1
then let β ∈ (1, 2) be a random number drawn from a distribution with density x log(2) on
(1, 2). Without loss of generality, we can assume that min d(u, v) ≥ 1 and max d(u, v) = 2I
for some large I ∈ N. This is equivalent to a representation of the following form:
We will have DI = {u1 , . . . . , un }. Given a layer Di+1 (collection of clusters of [n]), we will
build inductively Di by splitting each of its clusters. We will denote βi = β2i−1 and for every
cluster of Di+1 and every k ∈ {1, . . . , n} we will create a new cluster with all unassigned
(yet vertices) whose distances to uπ(e) is smaller than βi .
And, the distance between a cluster Di+1 and its sub-cluster is 2i+1 . Given a pair (u, v) of
vertices, we say that w “settles’ (u, v) at level i if w is the first center to which u or v is
assigned, and that w “cuts” (u, v) at level i if w settles (u, v) at the same level but not both
u and v are assigned to w. Therefore, dT (u, v) = 2i+2 .
5
dTw (u, v).
X
This implies that dT (u, v) ≤ w
1.5 Wrapping-ups
Theorem 1.6. The algorithm for online stochastic matching (applied to the random tree metric)
has a competitive ratio of Θ(log n).
If we are more cautious, we would get a competitive ratio of log(log(log n))2 . According to an
ongoing conjecture, it should be constant (with perhaps a dimension-dependant constant).
An algorithm is C-competitive if there exists some constant α > 0 such that for any sequence
of requests, E[ALG] ≤ CE[OPT] + α (α simply represents the fact that initialization does not
matter).
6
Remark 2.1. A natural idea would be to use a greedy algorithm: move the closest server to
the request. For instance, is X = {x1 , x2 , x3 } (which is discrete) and ρ0 = {x1 , x3 }:
server 1 server 2
x1 x2 x3
The case of the K-server problem turns out to be very complex, so we will instead focus on a
simpler version: the K-paging problem.
In the K-paging problem, X = [n] = {1, . . . , n} and d(xi , xj ) = 1xi ̸=xj (the original motivation
behind this lies in the context of a memory of size K and n different request types, i.e. web
pages, for which a page needs to be in the memory to be displated).
We are going to look for fractional algorithms: a fractional algorithm specifies at each time
the “probability” of having a server at a given position.Xn Instead of ρ = (ρ1 , . . . , ρK ) ∈ [n]K , we
will consider z = (z1 , . . . , zn ) ∈ [0, 1]n such that z = K, or equivalently, the anti-paging
i=1 i
#» Xn
allocation, setting x = 1 − z so that x = (x1 , . . . , xn ) is required to satisfy x = n − K.
i=1 i
Define then:
n
P = x = (x1 , . . . , xn ) ∈ [0, 1]n
X
xi = n − K
i=1
7
The performance ofd an algorithm is measured by its regret
T T
1X 1X
Rt = ft (xt ) − min ft (x)
T t=1 x∈X T
t=1
Note that in the minimum, the same x is used to evaluate each ft . If ft = f some function
XT
independent on f , then RT ≥ f T1 x − f (x∗ ) the “optimization error”).
t=1 t
Proof. We have:
Definition 2.4. Given a convex open set D ⊂ Rd , ϕ is a mirror map for D if:
∗) ϕ is strictly convex and differentiable
∗) ∇ϕ(D) − Rd
∗) limx→∂D ∥∇ϕ∥ = +∞
If ϕ is a mirror map and X ⊂ D, the mirror descent algorithm is defined by:
∇ϕ(y = ∇ϕ(xt ) − η∇f (xt )
t+1 )
ϕ
xt+1 = πX (yt+1 )
8
dual space
X ∇ϕ(·)
xt ∇ϕ(xt)
−η∇f (xt)
xt+1
∇ϕ(yt+1)
yt+1
∇ϕ(·)−1
ϕ
where πX is a projection onto X that respects the geometry of ϕ, i.e. xt+1 = minx∈X Dϕ (x, yt+1 )
with Dϕ the Bregman divergence defined by
1
Dϕ (x, y) = ϕ(x) − ϕ(y) + ∇ϕ(y)T (x − y) ≈ (x − y)T ∇2 ϕ(y)(x − y)
2
(the terms ϕ(y) − ∇ϕ(y)T (x − y) are a first order Taylor expansion of ϕ(x)).
Theoremq 2.6. The mirror descent (with respect to ϕ) for f convex and L-Lipschitz, with
R
η = L 2ρ/T , where R2 = supx∈X ϕ(x) − ϕ(x1 ) and ρ is the strong convexity parameter of ϕ,
ensures that: s
2
Rt ≤ RL
ρT
9
by telescoping in the two first tirms. In addition,
n X o
Corollary 2.7. If X = (x1 , . . . , xn ) ∈ [0, 1]n x = 1 and ∥∇f ∥∞ ≤ 1, then the mirror
k k
Xn
descent with the entropy ϕ(x) = i=1
xi log(xi ) ensures that
s
log(K) q
RT ≤ (instead of K/T )
T
which is the continuous-time version of the unprojected mirror descent. In our case (since the
anti-paging polytope is not the simplest), the mirror descent, both in discrete or continuous
time, might “hit” the boundary of P.
If we define the normal cone of P at x by the expression
NP (x) = {z ∈ Rd | z T (y − x) ≤ 0, ∀ y ∈ P}
10
The interesting property is that this equation has a unique solution starting from x(0). The
algorithm is given a request er and follows the continuous time solution of the projected mirror
descent until the request is satisfied.
Proof. (The result is actually useless because sup ∥∇ϕ∥ = +∞). We will look at how the
Bregman divergence between x(t) (the position of the algorithm) and y(t) (the position of
the optimum) evolves. More precisely, we shall focus on:
c (y, x) = −ϕ(x) − ⟨∇ϕ(x), y − x⟩
Dϕ
Indeed, ∂t D
c (y, x) = ∂ D
ϕ t,1 ϕ (y, x)ẏ +∂t,2 Dϕ (y, x)ẋ, the two terms respectively corresponding
c c
to the instantaneous cost of OPT, x∥∇ϕ∥∞ , and the instantaneous cost of ALG.
c (y, x) = ϕ(x) − ⟨ϕ(x), y − x⟩, if y is fixed and we differentiate with respect to t,
Since Dϕ
then
T
ẋ + ⟨∇2 ϕ(x)ẋ, y − x⟩ +
c (y, x) = −∇ϕ(x)
⟨∇ϕ(x),
∂t Dϕ
x⟩
= ⟨∇2 ϕ(x)ẋ, y − x⟩
∈ −⟨er − NP (x), y − x⟩
11