[go: up one dir, main page]

0% found this document useful (0 votes)
16 views11 pages

Online Optimal Transport & Algorithms

Uploaded by

elmansouriomar9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views11 pages

Online Optimal Transport & Algorithms

Uploaded by

elmansouriomar9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Stopping times and online algorithms

EL MANSOURI Omar

Contents
1 Random tree metric and online optimal transport 2
1.1 Brief introduction to optimal transport . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The online transport problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The tree case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Wrapping-ups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 The K-server problem/simplex paging via online convex optimization 6


2.1 Introduction to the K-server problem . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 “Reminder” on online gradient and online mirror descent . . . . . . . . . . . . . 7
2.3 Back to the paging problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1
1 Random tree metric and online optimal transport
1.1 Brief introduction to optimal transport
The principle of optimal transport is, given two measures µ and η on Rd , to transport µ onto η
while minimizing some cost function.

−η

Figure 1: intuition for optimal transport

From Monge’s point of view, the objective is to find a function T : X → Y (X and Y being
the respective supports of µ and η) such that T# µ = η where T# µ(A) = µ(T −1 (A)) is the push
forward of µ by T , and that minimizes the cost
Z
c(x, T (x)) dµ(x)
X

If µ is not regular (a.c Lebesgue), such a “transport map” T might not exist, as the following
example shows.

1 1
2 δ−1 µ = δ0 2 δ+1 η = 12 (δ−1 + δ1)

−1 1 0 1 1
2 2

Figure 2: counter-example for the existence of T

Here, µ = δ0 and η = 12 (δ−1 + δ1 ). In order to transport µ onto η, one would have to split the
mass at 0 into equal halves and send each half onto −1 et 1 respectively, effectively mapping 0
to two different destinations.

In Kantorovich’s terms, the goal is to find a distribution π ∈ M(X , Y) where M(X , Y) is the
set of all joint distributions on X × Y with marginals µ on X and η on Y , such that the cost:

E(x,y)∼π [c(x, y)]

is minimized. Note that this objective function is linear in π, as well as the constraints π1 = µ
and π2 = η. Hence, Kantorovich’s formulation of the problem is a linear program, whose dual

2
can be expressed as
Z Z
max f (x) dµ(x) + g(y) dη(y) such that ∀ (x, y) ∈ X × Y, f (x) + g(y) ≤ c(x, y)
f :X →R X Y
g:Y→R

The typical choice for the cost function is c(x, y) = d(x, y) for some distance d. Under these
conditions and adding exponents, the problem can be restated as searching for
1/p
Wp (µ, η) = min E[dp (x, y)]
π∈M(X ,Y)

called the p-Wasserstein distance between µ and η.

1 Xn i.i.d
Theorem 1.1. If µ is supported on Rd and µn = δX where Xi ∼ µ, then
n i=1 i
E[W (µ, µn )] ≲ n−1/d if d ≥ 2
1
≤ √ if d = 1
n

1.2 The online transport problems


Xn Xn
Given µ = n1 δ and vn = n1
i=1 ui
δ where vj are i.i.d and distributed according to F ,
j=1 vj
the goal is to minimize the transport cost from v to µ when the vj are observed and matched
or transported sequentially.
The data of the problem is (u1 , . . . , un ) ∈ Rd . The sequential transporting works as follows: at
each step,
– observe vj ∼ F
– match vj to um(j) where m(j) ∈ [n]\{m(1), . . . , m(j − 1)}
– pay d(vj , um(j) )
Xn
The goal in online matching is then to minimize the total cost paid n1 j=1
d(vj , um(j) ). The
“oracle” that knows everything in advance would use the optimal transport map, i.e. E[OPT] =
E[W (µ, vn )] where OPT is the optimum.

The difficult case is when F = µ. The data is still u1 , . . . , un ∈ Rd , and at each step vj ∼
U[{u1 , . . . , un }] then observe, match and pay. Here, the vj are i.i.d (sampled with replacement)
and E[OPT] = E[W (µ, µn )]. The optimnal online solution is obtained through dynamic pro-
gramming, but is too complicated to analyze (and compute). We need a simpler algorithm.
The idea is to keep an “invariant” which is the distribution of the remaining n − k vi after k
iterations is a uniform n − k size subset of the n points. We thus minimize the cost to keep the
invariant at each iteration.

Let us denote by Sk the set of last k remaining vi . Then, we will have that Sk is a set of k points
drawn from {u1 , . . . , un } without replacement. At stage n − k, there remain k points in Sk and
the algorithm computes the optimal matching between the uniform distribution on {u1 , . . . , un }
(i.e. the law of vn−k+1 ) to the uniform distribution on Sk . It then matches vn−k+1 at random
using this optimal matching.
We know that E[OPT] = E[W (µ, µn )]. The expected cost of the algorithm is
n
X
E[ALG] = E[W (µ, U(Sk )]
k=1

3
where Sk is a k-subset of {u1 , . . . , un } drawn without replacement.

Notations. Let us introduce a bit of notation.


— We will denote by M (Sk ) = W (µ, U(Sk )) where Sk is a multi-subset of {u1 , . . . , un } of
cardinality k.
— We will denote by Uk the distribution of Sk when elements are taken without replacement
(recall that Sk is a subset of {u1 , . . . , uk }.
— We denote by Ik the same distribution but with replacements.
The following lemma will be admitted.

Lemma 1.2. We have: ck = ESk ∼Ik [M (Sk )] ≥ ESk ∼Uk [M (Sk )].

Hence, setting OPT = cn , we obtain the competitive ratio of the algorithm by:
n n
1X 1 X
ALG ≤ ck then CR(ALG) ≤ ck
n k=1 ncn k=1
Using the fact that E[W (µ, µk )] ≤ k −1/d ,
n
1 X 1
CR(ALG) ≲ ck ≤ 1/d
ncn k=1 n cn

1.3 The tree case


Proposition 1.3. If the ui are on a line, then the competitve ratio CR is constant (lesser than
4). More precisely, we will show that
n
r
Ck ≤ Cn
k

Proof. Let e be an edge of the line (“two adjacent nodes”) and let us denote by de the
distance cost of this edge. Then,
X Xe ne
M (sK ) = de −
e∈E k n
where ne is the size (number of ui ) of the smallest half-line starting from e and Xe is the
number of vj that belong to this half-line. In the example above, ne = 2 and Xe = 1. Note
that Xe is a random variable, but we know its law: it as binomial with parameters ne /n
and k, B(k, nne ). Then,
s
X de kne ne ne
  r
X
E[M (Sk )] ≤ 1− ≤ de
e∈E k n n e∈E nk
X
Xe ne
and OPT = M (Sk ) so that E[OPT] = e
de n
− n
where Xe ∼ B(n, nne ). It also holds
h i √
σ(X ne
that E |Xe − E[Xe ]| ≤ √ e) ≤ with σ(Xe ) the standard deviation of Xe . Hence,
2 4

X de ne
E[OPT] ≥
e∈E 4n

Corollary 1.4. The result also holds for a tree (the exact same proof works).

4
1.4 The general case
In general the metric induced by u1 , . . . , un is not a tree. But if there was a tree T such that
d(ui , uj ) ≤ dT (ui , uj ) ≤ Cd(ui , uj ) (the vertices of T being the ui ) where dT is the metric of T ,
then using the algorithm with respect to the tree metric dT would provide
n
1X
E[ALG] = ES ∼U [M (Sk )]
n k=1 k k
n
1X
≤ ES ∼U [M T (Sk )]
n k=1 k k
≤ 4E[OPT] because dT is a tree metric
≤ 4CE[OPT] because of the inequality
So, the algorithm would we 4C-competitive. Unfortunately, there is no such deterministic tree
with C ≪ n as the following example shows.
 
1
But, if we take this line at random, then E[dT (ui , ui+1 )] = 1 1 + n
+ n n1 = 2.

Theorem 1.5. Given n points u1 , . . . , un and a metric d(· , ·), there exists a random tree T
such that d(u, v) ≤ E[dT (u, v)] ≤ 8 log(n)d(u, v).

Proof. The random tree is constructed as follows: first, let π be a random permutation of [n],
1
then let β ∈ (1, 2) be a random number drawn from a distribution with density x log(2) on
(1, 2). Without loss of generality, we can assume that min d(u, v) ≥ 1 and max d(u, v) = 2I
for some large I ∈ N. This is equivalent to a representation of the following form:

{u1, u2, u3, u4, u5}


layer I

layer I − 1 {u4, u5} {u1, u2, u3}

layer I − 2 {u1, u2}


u4 u5 u3
...
u1 u2

Figure 3: the constructed random tree for n = 5

We will have DI = {u1 , . . . . , un }. Given a layer Di+1 (collection of clusters of [n]), we will
build inductively Di by splitting each of its clusters. We will denote βi = β2i−1 and for every
cluster of Di+1 and every k ∈ {1, . . . , n} we will create a new cluster with all unassigned
(yet vertices) whose distances to uπ(e) is smaller than βi .
And, the distance between a cluster Di+1 and its sub-cluster is 2i+1 . Given a pair (u, v) of
vertices, we say that w “settles’ (u, v) at level i if w is the first center to which u or v is
assigned, and that w “cuts” (u, v) at level i if w settles (u, v) at the same level but not both
u and v are assigned to w. Therefore, dT (u, v) = 2i+2 .

We then define the local metric


dTw (u, v) = 1{w cuts (u, v) at level i} · 2i+2
X

5
dTw (u, v).
X
This implies that dT (u, v) ≤ w

Notice that w cuts (u, v) at level i if:


(i) d(w, u) ≤ βi = β2i+1 < d(w, v) (or the reverse inequality)
(ii) w settles the pair (u, v)
If we rank the w’s with respect to their distance to the pair (u, v) and we denote by s the
rank of w that cuts (u, v) at level i, then
Z d(w,v)
8x 1
E[dTw (u, v)] = · dx
d(w,u) x log 2 s
because the probability that w settles (u, v) knowing it is the s-th closest point is 1/s.
Hence,
8 8d(u, v)
E[dTw (u, v)] ≤ (d(w, v) − d(w, u)) ≤
s log 2 s log 2
n
" #
8d(u, v) 1 8 log(n)
dTw (u, v) ≤
X X
so E ≲ d(u, v)
w s log 2 s=1 s log(2)

1.5 Wrapping-ups
Theorem 1.6. The algorithm for online stochastic matching (applied to the random tree metric)
has a competitive ratio of Θ(log n).

If we are more cautious, we would get a competitive ratio of log(log(log n))2 . According to an
ongoing conjecture, it should be constant (with perhaps a dimension-dependant constant).

2 The K-server problem/simplex paging via online con-


vex optimization
2.1 Introduction to the K-server problem
Given a metric space (X, d), “requests” arrive sequentially: r1 ∈ X, then r2 ∈ X, . . . , rT ∈ X.
The algorithm has K servers at some location in X and it must send one server at the request
rt . If we denote by ρt ∈ X K the locations of the servers at time t (server 1 at ρt,1 , etc.), then
an algorithm must enforce rt ∈ {ρt,i }i∈[K] .
The performance of an algorithm is the total distances of all the servers, starting from ρ0 :
 
XT X
K
E[ALG] = E d(ρt−1,j , ρt,j )
t=1 j=1

(we use expected values if algorithm ALG is random) and


T X
X K
OPT = min d(Yt−1,j , Yt,j )
Yt,j : xt ∈{Yt,j }j∈[K]
t=1 j=1

An algorithm is C-competitive if there exists some constant α > 0 such that for any sequence
of requests, E[ALG] ≤ CE[OPT] + α (α simply represents the fact that initialization does not
matter).

6
Remark 2.1. A natural idea would be to use a greedy algorithm: move the closest server to
the request. For instance, is X = {x1 , x2 , x3 } (which is discrete) and ρ0 = {x1 , x3 }:

server 1 server 2

x1 x2 x3

Figure 4: counter-example for the greedy algorithm’s optimality

If r = x2 then x3 then x2 then x3 , . . . then a possible set of successive moves is ρ2 from x3 to x2


(with cost 1), then ρ2 from x2 to x3 (cost 1), . . . In this setting, the greedy algorithm will have a
cost of 1 + · · · + 1 = T (each time, ρ2 is moved) whereas the optimal procedure will only move
ρ1 to x2 at the beginning, procducing a final cost of only 1.
This shows that the greedy algorithm has a +∞-competitive ratio on the simplest (non-trivial)
problem. We are thus forced to use randomness to “hide” the position of the servers (i.e. hide
it from an attacking opponent who sends requests in order to fail the algorithm).

The case of the K-server problem turns out to be very complex, so we will instead focus on a
simpler version: the K-paging problem.

In the K-paging problem, X = [n] = {1, . . . , n} and d(xi , xj ) = 1xi ̸=xj (the original motivation
behind this lies in the context of a memory of size K and n different request types, i.e. web
pages, for which a page needs to be in the memory to be displated).
We are going to look for fractional algorithms: a fractional algorithm specifies at each time
the “probability” of having a server at a given position.Xn Instead of ρ = (ρ1 , . . . , ρK ) ∈ [n]K , we
will consider z = (z1 , . . . , zn ) ∈ [0, 1]n such that z = K, or equivalently, the anti-paging
i=1 i
#» Xn
allocation, setting x = 1 − z so that x = (x1 , . . . , xn ) is required to satisfy x = n − K.
i=1 i

Define then:  
 n 
P = x = (x1 , . . . , xn ) ∈ [0, 1]n
X
xi = n − K

i=1

A request rt ∈ X is rewritten as a unit vector et = (0, . . . , 0, 1, 0, . . . , 0) where the 1 is in the


rt -th position. The constraint of an algorithm is that at time t, xt,rt = 0 (the request has been
served). In this case, xt moves in the direction of −et in order to satisfy the request. A tentative
solution would be to follow the projected gradient descent:

xt+nδ = πP (xt+(n−1)δ − ηe1 )

where πP is the projection on the polytope P.

2.2 “Reminder” on online gradient and online mirror descent


Given an action space X ⊂ Rd that is compact and convex, at stage t ∈ N, the algorithm
chooses xt ∈ X then nature chooses a convex L-Lipschitz loss function ft : X → R, and the
algorithm suffers ft (xt ) and observes ft (·).

7
The performance ofd an algorithm is measured by its regret
T T
1X 1X
Rt = ft (xt ) − min ft (x)
T t=1 x∈X T
t=1

Note that in the minimum, the same x is used to evaluate each ft . If ft = f some function
XT
independent on f , then RT ≥ f T1 x − f (x∗ ) the “optimization error”).
t=1 t

Proposition 2.2. Gradient descent with the usual step size


  R
xt+1 = πX xt − η∇ft (xt ) with η= √
L T

L being the Lipschitz constant, ensures that RT ≤ RL/ T where R is the diameter of x.

Proof. We have:

∀x ∈ X, ft (xt ) − ft (x∗ ) ≤ ∇ft (xt )T (xt − x)


1
≤ (xt − yt+1 )T (xt − x) by definition of xt+1
η
1
= (∥xt − x∥2 + ∥xt − yt+1 ∥2 − ∥yt+1 − x∥2 )

1
= (∥xt − x∥2 − ∥yt+1 − xt ∥2 + ∥η∇ft (xt )∥2 )

1
≤ (∥xt − x∥2 − ∥xt+1 − xt ∥2 + η 2 L2

T
1X ∥xT − x∗ ∥2 ηL2 R2 ηL2
so ft (xt ) − ft (x) ≤ + ≤ +
T t=1 2ηT 2 2ηT 2

Example 2.3. For gradient descent, xt+1 = xt − η∇f (xt ).


Since xt and ∇f (xt ) are not really in the same space (xt ∈ Rd and ∇f (xt ) is in its dual), the
“mirror descent” rewrites the gradient descent method by respecting the two dual spaes.
To do that, we consider as function ϕ that will map points to gradients and gradients to points.

Definition 2.4. Given a convex open set D ⊂ Rd , ϕ is a mirror map for D if:
∗) ϕ is strictly convex and differentiable
∗) ∇ϕ(D) − Rd
∗) limx→∂D ∥∇ϕ∥ = +∞
If ϕ is a mirror map and X ⊂ D, the mirror descent algorithm is defined by:

 ∇ϕ(y = ∇ϕ(xt ) − η∇f (xt )
t+1 )
ϕ
 xt+1 = πX (yt+1 )

8
dual space
X ∇ϕ(·)

xt ∇ϕ(xt)
−η∇f (xt)
xt+1
∇ϕ(yt+1)

yt+1

∇ϕ(·)−1

Figure 5: illustration of mirror maps and mirror descent

ϕ
where πX is a projection onto X that respects the geometry of ϕ, i.e. xt+1 = minx∈X Dϕ (x, yt+1 )
with Dϕ the Bregman divergence defined by
1
Dϕ (x, y) = ϕ(x) − ϕ(y) + ∇ϕ(y)T (x − y) ≈ (x − y)T ∇2 ϕ(y)(x − y)
2
(the terms ϕ(y) − ∇ϕ(y)T (x − y) are a first order Taylor expansion of ϕ(x)).

Proposition 2.5 (properties of the Bregman divergence). The following hold:


(i) ∇ϕ(x) − ∇ϕ(y))T (x − z) = Dϕ (x, y) + Dϕ (z, x) − Dϕ (z, y)
 T
ϕ
(ii) ∀ x ∈ X and y ∈ D, (∇ϕ πX (y)) − ∇ϕy (πxϕ (y) − x) ≤ 0

Theoremq 2.6. The mirror descent (with respect to ϕ) for f convex and L-Lipschitz, with
R
η = L 2ρ/T , where R2 = supx∈X ϕ(x) − ϕ(x1 ) and ρ is the strong convexity parameter of ϕ,
ensures that: s
2
Rt ≤ RL
ρT

Proof. It is almost identitcal to the gradient descent proof:

ft (xt ) − ft (x∗ ) ≤ ∇ft (xt )T (xt − x)


1
≤ (∇ϕ(xt ) − ∇ϕ(yt+1 ))T (xt − x)
η
1
= (Dϕ (x, xt ) + Dϕ (xt , yt+1 ) − Dϕ (x, yt+1 ))
η
1
≤ (Dϕ (x, xt ) − Dϕ (x, xt+1 ) + Dϕ (x, xt+1 ) + Dϕ (xt , yt+1 ) − Dϕ (x, yt+1 ))
η
1
≤ (Dϕ (x, xt ) − Dϕ (x, xt+1 ) + Dϕ (xt , yt+1 ) − Dϕ (xt+1 , yt+1 ))
η

9
by telescoping in the two first tirms. In addition,

Dϕ (xt , yy+1 ) − Dϕ (xt+1 , yt+1 )


= ϕ(xt ) − ϕ(xt+1 ) − ∇ϕT (xt − xt+1 )
ρ
≤ (∇ϕ(xt ) − ∇ϕ(yt+1 ))T (xt − xt+1 ) − ∥xt − xt+1 ∥2
2
ρ
= η∇ft (xt ) (xt − xt+1 ) − ∥xt − xt+1 ∥2
T
2
ρ
≤ ηL∥xt − xt+1 ∥ − ∥xt − xt+1 ∥
2
Summing over T gives:
T
1X 1 (ηL)2 1 (ηL)2
f (xt ) − ft (x) ≤ Dϕ (x, xt ) + ≤ (ϕ(x) − ϕ(xt )) +
T t=1 ηT 2ρ ηT 2ρ

n X o
Corollary 2.7. If X = (x1 , . . . , xn ) ∈ [0, 1]n x = 1 and ∥∇f ∥∞ ≤ 1, then the mirror
k k
Xn
descent with the entropy ϕ(x) = i=1
xi log(xi ) ensures that
s
log(K) q
RT ≤ (instead of K/T )
T

2.3 Back to the paging problem


At time t, the algorithm received a request −et then moved in the direction of −et if we were
following gradient
Xdescent. We shall instead use the mirror descent algorithm with respect to
entropy: ϕ(x) = i xi log(xi ). For gradient descent, we have:

xm+1 = xm − η∇f (xm )


 !
1
x m 1+ m
− x(m)
so = −∇f (xm )
η
so ẋ = −∇f (x)

For mirror descent,

∇ϕ(xt+1 ) = ∇ϕ(xt ) − η∇f (xt )


∆ϕ(xt+1 ) − ∆ϕ(xt )
so = −∇f (xt )
η
so ∇2 ϕ(xt )T ẋt = −∇f (xt )

which is the continuous-time version of the unprojected mirror descent. In our case (since the
anti-paging polytope is not the simplest), the mirror descent, both in discrete or continuous
time, might “hit” the boundary of P.
If we define the normal cone of P at x by the expression

NP (x) = {z ∈ Rd | z T (y − x) ≤ 0, ∀ y ∈ P}

the projected mirror descent in continuous time writes as

∇2 ϕ(x(t))ẋ(t) ∈ −et − NP (x(t))

10
The interesting property is that this equation has a unique solution starting from x(0). The
algorithm is given a request er and follows the continuous time solution of the projected mirror
descent until the request is satisfied.

Proposition 2.8. This algorithm is supx∈P ∥∇ϕ∥-competitive.

Proof. (The result is actually useless because sup ∥∇ϕ∥ = +∞). We will look at how the
Bregman divergence between x(t) (the position of the algorithm) and y(t) (the position of
the optimum) evolves. More precisely, we shall focus on:
c (y, x) = −ϕ(x) − ⟨∇ϕ(x), y − x⟩

Indeed, ∂t D
c (y, x) = ∂ D
ϕ t,1 ϕ (y, x)ẏ +∂t,2 Dϕ (y, x)ẋ, the two terms respectively corresponding
c c
to the instantaneous cost of OPT, x∥∇ϕ∥∞ , and the instantaneous cost of ALG.
c (y, x) = ϕ(x) − ⟨ϕ(x), y − x⟩, if y is fixed and we differentiate with respect to t,
Since Dϕ
then
T
ẋ + ⟨∇2 ϕ(x)ẋ, y − x⟩ + 
c (y, x) = −∇ϕ(x) 
⟨∇ϕ(x),

∂t Dϕ 
  x⟩


= ⟨∇2 ϕ(x)ẋ, y − x⟩
∈ −⟨er − NP (x), y − x⟩

c (y, x) ≤ −⟨e , y − x⟩.


so, since y is a point that satisfies the request, ∂t Dϕ r

11

You might also like