[go: up one dir, main page]

0% found this document useful (0 votes)
11 views22 pages

Generative Modeling With Optimal Transport Maps

Uploaded by

nqthanh2101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Generative Modeling With Optimal Transport Maps

Uploaded by

nqthanh2101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Published as a conference paper at ICLR 2022

G ENERATIVE M ODELING
WITH O PTIMAL T RANSPORT M APS

Litu Rout Alexander Korotin


Space Applications Centre Skolkovo Institute of Science and Technology
Indian Space Research Organisation Artificial Intelligence Research Institute (AIRI)
lr@sac.isro.gov.in a.korotin@skoltech.ru

Evgeny Burnaev
Skolkovo Institute of Science and Technology
arXiv:2110.02999v2 [cs.LG] 5 Mar 2022

Artificial Intelligence Research Institute (AIRI)


e.burnaev@skoltech.ru

A BSTRACT

With the discovery of Wasserstein GANs, Optimal Transport (OT) has become a
powerful tool for large-scale generative modeling tasks. In these tasks, OT cost is
typically used as the loss for training GANs. In contrast to this approach, we show
that the OT map itself can be used as a generative model, providing comparable
performance. Previous analogous approaches consider OT maps as generative
models only in the latent spaces due to their poor performance in the original
high-dimensional ambient space. In contrast, we apply OT maps directly in the
ambient space, e.g., a space of high-dimensional images. First, we derive a min-
max optimization algorithm to efficiently compute OT maps for the quadratic cost
(Wasserstein-2 distance). Next, we extend the approach to the case when the
input and output distributions are located in the spaces of different dimensions
and derive error bounds for the computed OT map. We evaluate the algorithm on
image generation and unpaired image restoration tasks. In particular, we consider
denoising, colorization, and inpainting, where the optimality of the restoration
map is a desired attribute, since the output (restored) image is expected to be close
to the input (degraded) one.

1 I NTRODUCTION
Since the discovery of Generative Adversarial Networks (GANs, Goodfellow et al. (2014)), there has
been a surge in generative modeling (Radford et al., 2016; Arjovsky et al., 2017; Brock et al., 2019;
Karras et al., 2019). In the past few years, Optimal Transport (OT, Villani (2008)) theory has been
pivotal in addressing important issues of generative models. In particular, the usage of Wasserstein
distance has improved diversity (Arjovsky et al., 2017; Gulrajani et al., 2017), convergence (Sanjabi
et al., 2018), and stability (Miyato et al., 2018; Kim et al., 2021) of GANs.
Generative models based on OT can be split into two classes depending on what OT is used for.
First, the optimal transport cost serves as the loss for generative models, see Figure 1a. This is
the most prevalent class of methods which includes WGAN (Arjovsky et al., 2017) and its modifi-
cations: WGAN-GP (Gulrajani et al., 2017), WGAN-LP (Petzka et al., 2018), and WGAN-QC (Liu
et al., 2019). Second, the optimal transport map is used as a generative model itself, see Fig-
ure 1b. Such approaches include LSOT (Seguy et al., 2018), AE-OT (An et al., 2020a), ICNN-OT
(Makkuva et al., 2020), W2GN (Korotin et al., 2021a). Models of the first class have been well-
studied, but limited attention has been paid to the second class. Existing approaches of the second
class primarily consider OT maps in latent spaces of pre-trained autoencoders (AE), see Figure 3.
The performance of such generative models depends on the underlying AEs, in which decoding
transformations are often not accurate; as a result this deficiency limits practical applications in
high-dimensional ambient spaces. For this reason, using OT in the latent space does not necessarily
guarantee superior performance in generative modeling.

1
Published as a conference paper at ICLR 2022

(a) OT cost as the loss for the generative model. (b) OT map as the generative model.

Figure 1: Two existing approaches to use optimal transport in generative models.

The focus of our paper is the second class of OT-based models using OT map as the generative map.
Finding an optimal mapping is motivated by its ability to preserve specific attributes of the input
samples, a desired property in unpaired learning. For example, in unpaired image-to-image transla-
tion, the learner has to fit a map between two data distributions which preserves the image content.
CycleGAN-based models (Zhu et al., 2017) are widely used for this purpose. However, they typi-
cally have complex optimization objectives consisting of several losses (Amodio & Krishnaswamy,
2019; Lu et al., 2019) in order to make the fitted map preserve the required attributes.
The main contributions of this paper are as follows:

1. We propose an end-to-end algorithm (M4.3) to fit OT maps for the quadratic cost (Wasserstein-2
distance) between distributions located on the spaces of equal dimensions (M4.1) and extend the
method to unequal dimensions as well (M4.2). We prove error bounds for the method (M4.4).
2. We demonstrate large-scale applications of OT maps in popular computer vision tasks. We con-
sider image generation (M5.1) and unpaired image restoration (M5.2) tasks.

Our strict OT-based framework allows the theoretical analysis of the recovered transport map. The
OT map obtained by our method can be directly used in large-scale computer vision problems which
is in high contrast to previous related methods relying on autoencoders and OT maps in the latent
space. Importantly, the performance and computational complexity of our method is comparable to
OT-based generative models using OT cost as the loss.
Notations. In what follows, X and Y are two complete metric spaces, µ(x) and ν(y) are probability
distributions on X and Y, respectively. For a measurable map T : X → Y, T# µ denotes the
pushforward distribution of µ, i.e., the distribution for which any measurable set E ⊂ Y satisfies
T# µ(E) = µ(T −1 (E)). For a vector x, kxk denotes its Euclidean norm. We use hx, yi to denote the
inner product of vectors x and y. We use Π(µ, ν) to denote the set of joint probability distributions on
X × Y whose marginals are µ and ν, respectively (couplings). For a function f : RD → R ∪ {±∞}
its Legendre–Fenchel transform (the convex conjugate) is f (y) = supx∈RD {hx, yi − f (x)}. It is
convex, even if f is not.

2 BACKGROUND ON O PTIMAL T RANSPORT

Consider a cost of transportation, c : X × Y → R defined over the product space of X and Y.


Monge’s Formulation. The optimal transport cost between µ
and ν for ground cost c(·, ·) is
Z
def
Cost(µ, ν) = inf c (x, T (x)) dµ(x), (1)
T# µ=ν X

where the infimum is taken over all measurable maps


T : X → Y pushing µ to ν, see Figure 2. The map T ∗ on which Figure 2: Monge’s OT.
the infimum in (1) is attained is called the optimal transport

2
Published as a conference paper at ICLR 2022

map. Monge’s formulation does not allow splitting. For example, when µ is a Dirac distribution and
ν is a non-Dirac distribution, the feasible set of equation (1) is empty.
Kantorovich’s Relaxation. Instead of asking to which particular point y ∈ Y should all the proba-
bility mass of x be moved, Kantorovich (1948) asks how the mass of x should be distributed among
all y ∈ Y. Formally, a transport coupling replaces a transport map; the OT cost is given by:
Z
def
Cost(µ, ν) = inf c (x, y) dπ(x, y), (2)
π∈Π(µ,ν) X ×Y

where the infimum is taken over all couplings π ∈ Π(µ, ν) of µ and ν. The coupling π ∗ attaining
the infimum of (2) is called the optimal transport plan. Unlike the formulation of (1), the formu-
lation of (2) is well-posed, and with mild assumptions on spaces X , Y and ground cost c(·, ·), the
minimizer π ∗ of (2) always exists (Villani, 2008, Theorem 4.1). In particular, if π ∗ is deterministic,
i.e., π ∗ = [idX , T ∗ ]# µ for some T ∗ : X → Y, then T ∗ minimizes (1).
Duality. The dual form of (2) is given by (Kantorovich, 1948):
Z Z 
Cost(µ, ν) = sup u(x)dµ(x) + v(y)dν(y) : u(x) + v(y) ≤ c(x, y) , (3)
(u,v) X Y

with u ∈ L1 (µ), v ∈ L1 (ν) called Kantorovich potentials. For u : X → R and v : Y → R de-


fine their c-transforms by uc (y) = inf x∈X {c (x, y) − u (x)} and v c (x) = inf y∈Y {c (x, y) − v (y)}
respectively. Using c-transform, (3) is reformulated as (Villani, 2008, M5)
Z Z  Z Z 
c c
Cost(µ, ν) = sup v (x)dµ(x)+ v(y)dν(y) = sup u(x)dµ(x)+ u (y)dν(y) . (4)
v X Y u X Y

Primal-dual relationship. For certain ground costs c(·, ·), the primal solution T ∗ of (1) can be
recovered from the dual solution u∗ of (3). For example, if X = Y = RD , c(x, y) = h(x − y) with
strictly convex h : RD → R and µ is absolutely continuous supported on the compact set, then
T ∗ (x) = x − (∇h)−1 ∇u∗ (x) ,

(5)
see (Santambrogio, 2015, Theorem 1.17). For general costs, see (Villani, 2008, Theorem 10.28).

3 O PTIMAL T RANSPORT IN G ENERATIVE M ODELS


O PTIMAL TRANSPORT COST AS THE LOSS (Figure 1a). Starting with the works of Arjovsky &
Bottou (2017); Arjovsky et al. (2017), the usage of OT cost as the loss has become a major way
to apply OT for generative modeling. In this setting, given data distribution ν and fake distribution
µθ , the goal is to minimize Cost(µθ , ν) w.r.t. the parameters θ. Typically, µθ is a pushforward
distribution of some given distribution, e.g., N (0, I), via generator network Gθ .
The Wasserstein-1 distance (W1 ), i.e., the transport cost for ground cost c(x, y) = kx − yk, is
the most practically prevalent example of such a loss. Models based on this loss are known as
Wasserstein GANs (WGANs). They estimate W1 (µθ , ν) based on the dual form as given by (4). For
W1 , the optimal potentials u∗ , v ∗ of (4) satisfy u∗ = −v ∗ where u∗ is a 1-Lipschitz function (Villani,
2008, Case 5.16). As a result, to compute W1 , one needs to optimize the following simplified form:
Z Z 
W1 (µθ , ν) = sup u(x)dµθ (x) − u(y)dν(y) . (6)
kukL ≤1 X Y

In WGANs, the potential u is called the discriminator. Optimization of (6) reduces constrained
optimization of (4) with two potentials u, v to optimization of only one discriminator u. In practice,
enforcing the Lipschitz constraint on u is challenging. Most methods to do this are regularization-
based, e.g., they use gradient penalty (Gulrajani et al., 2017, WGAN-GP) and Lipschitz penalty
(Petzka et al., 2018, WGAN-LP). Other methods enforce Lipschitz property via incorporating cer-
tain hard restrictions on the discriminator’s architecture (Anil et al., 2019; Tanielian & Biau, 2021).
General transport costs (other than W1 ) can also be used as the loss for generative models. They
are less popular since they do not have a dual form reducing to a single potential function simi-
lar to (6) for W1 . Consequently, the challenging estimation of the c-transform uc is needed. To

3
Published as a conference paper at ICLR 2022

avoid this, Sanjabi et al. (2018) consider the dual form of (3) with two potentials u, v instead form
(4) with one u and softly enforce the condition u(x) + v(y) ≤ c(x, y) via entropy or quadratic
regularization. Nhan Dam et al. (2019) use the dual form of (4) and amortized optimization to
compute uc via an additional neural network. Both methods work for general c(·, ·), though the
authors test them for c(x, y) = kx − yk only, i.e., W1 distance. Mallasto et al. (2019) propose a
fast way to approximate the c-transform and test the approach (WGAN-(q, p)) with several costs,
in particular, the Wasserstein-2 distance (W2 ), i.e., the transport cost for the quadratic ground cost
c(x, y) = 21 kx − yk2 . Specifically for W2 , Liu et al. (2019) approximate the c-transform via a linear
program (WGAN-QC).
A fruitful branch of OT-based losses for generative models comes from modified versions of OT
cost, such as Sinkhorn (Genevay et al., 2018), sliced (Deshpande et al., 2018) and minibatch (Fatras
et al., 2019) OT distances. They typically have lower sample complexity than usual OT and can be
accurately estimated from random mini-batches without using dual forms such as (3). In practice,
these approaches usually learn the ground OT cost c(·, ·).
The aforementioned methods use OT cost in the ambient space to train GANs. There also exist ap-
proaches using OT cost in the latent space. For example, Tolstikhin et al. (2017); Patrini et al. (2020)
use OT cost between encoded data and a given distribution as an additional term to reconstruction
loss for training an AE. As the result, AE’s latent distribution becomes close to the given one.
O PTIMAL TRANSPORT MAP AS THE GENERATIVE MAP (Figure 1b). Methods to compute the
OT map (plan) are less common in comparison to those computing the cost. Recovering the map
from the primal form (1) or (2) usually yields complex optimization objectives containing several
adversarial terms (Xie et al., 2019; Liu et al., 2021; Lu et al., 2020). Such procedures require careful
hyperparameter choice. This needs to be addressed before using these methods in practice.
Primal-dual relationship (M2) makes it possible to recover the OT map via solving the dual form (3).
Dual-form based methods primarily consider W2 cost due to its nice theoretical properties and rela-
tion to convex functions (Brenier, 1991). In the semi-discrete case (µ is continuous, ν is discrete),
An et al. (2020a) and Lei et al. (2019) compute the dual potential and the OT map by using the
Alexandrov theory and convex geometry. For the continuous case, Seguy et al. (2018) use the en-
tropy (quadratic) regularization to recover the dual potentials and extract OT map from them via
the barycenteric projection. Taghvaei & Jalali (2019), Makkuva et al. (2020), Korotin et al. (2021a)
employ input-convex neural networks (ICNNs, see Amos et al. (2017)) to parametrize potentials in
the dual problem and recover OT maps by using their gradients.

Figure 3: The existing most prevalent approach to use OT maps in generative models.

The aforementioned dual form methods compute OT maps in LATENT SPACES for problems such
as domain adaptation and latent space mass transport, see Figure 3. OT maps in high-dimensional
ambient spaces, e.g., natural images, are usually not considered. Recent evaluation of continuous
OT methods for W2 (Korotin et al., 2021b) reveals their crucial limitations, which negatively affect
their scalability, such as poor expressiveness of ICNN architectures or bias due to regularization.

4 E ND - TO - END S OLUTION TO L EARN O PTIMAL M APS

4.1 E QUAL D IMENSIONS OF I NPUT AND O UTPUT D ISTRIBUTIONS

In this section, we use X = Y = RD and consider the Wasserstein-2 distance (W2 ), i.e., the optimal
transport for the quadratic ground cost c(x, y) = 12 kx − yk2 . We use the dual form (4) to derive a
saddle point problem the solution of which yields the OT map T ∗ . We consider distributions µ, ν
with finite second moments. We assume that for distributions µ, ν in view there exists a unique OT
plan π ∗ minimizing (3) and it is deterministic, i.e., π ∗ = [idRD , T ∗ ]# µ. Here T ∗ is an OT map

4
Published as a conference paper at ICLR 2022

which minimizes (1). Previous related works (Makkuva et al., 2020; Korotin et al., 2021a) assumed
the absolute continuity of µ, which implied the existence and uniqueness of T ∗ (Brenier, 1991).
def 1 2
Let ψ(y) = 2 kyk − v(y), where v is the potential of (4). Note that
 
1 1 1
v c (x) = inf kx − yk2 − v(y) = kxk2 − sup {hx, yi − ψ(y)} = kxk2 − ψ(x). (7)
y∈RD 2 2 y∈R D 2
Therefore, (4) is equivalent to
kxk2 kyk2
Z Z  Z Z 
2
W2 (µ, ν) = dµ(x) + dν(x) + sup − ψ(x)dµ(x) − ψ(y)dν(y) = (8)
X 2 Y 2 ψ X Y
Z Z 
Constant(µ, ν) − inf ψ(x)dµ(x) + ψ(y)dν(y) = (9)
ψ X Y
(Z Z )
Constant(µ, ν) − inf sup {hx, yi − ψ(y)} dµ(x) + ψ(y)dν(y) = (10)
ψ X y∈RD Y
 Z Z 
 
Constant(µ, ν) − inf sup hx, T (x)i − ψ T (x) dµ(x) + ψ(y)dν(y) (11)
ψ T X Y
D
where between lines (10) and (11) we replace the optimization over y ∈ R with the equivalent
optimization over functions T : RD → RD . The equivalence follows from the interchange between
the integral and the supremum (Rockafellar, 1976, Theorem 3A). We also provide an independent
proof of equivalence specializing Rockafellar’s interchange theorem in Appendix A.1. Thanks to
the following lemma, we may solve saddle point problem (11) and obtain the OT map T ∗ from its
solution (ψ ∗ , T ∗ ).
Lemma 4.1. Let T ∗ be the OT map from µ to ν. Then, for every optimal potential ψ ∗ ,
Z

hx, T (x)i − ψ ∗ T (x) dµ(x).
 
T ∈ arg sup (12)
T X

We prove Lemma 4.1 in Appendix A.2. For general µ, ν the arg supT set for optimal ψ ∗ might con-
tain not only OT map T ∗ , but other functions as well. Working with real-world data in experiments
(M5.2), we observe that despite this issue, optimization (11) still recovers T ∗ .
Relation to previous works. The use of the function T to approximate the c-transform was pro-
posed by Nhan Dam et al. (2019) to estimate the Wasserstein loss in WGANs. For W2 , the fact
that T ∗ is an OT map was used by Makkuva et al. (2020); Korotin et al. (2021a) who primarily
assumed continuous µ, ν and reduced (11) to convex ψ and T = ∇φ for convex φ. Issues with non-
uniqueness of solution of (12) were softened, but using ICNNs to parametrize ψ became necessary.
Korotin et al. (2021b) demonstrated that ICNNs negatively affect practical performance of OT and
tested an unconstrained formulation similar to (11). As per the evaluation, it provided the best em-
pirical performance (Korotin et al., 2021b, M4.5). The method bMM:Re they consider parametrizes
1 2
2 k · k − ψ(·) by a neural network, while we directly parametrize ψ(·) by a neural network (M4.3).
Recent work by Fan et al. (2021) exploits formulation similar to (11) for general costs c(·, ·). While
their formulation leads to a max-min scheme with general costs (Fan et al., 2021, Theorem 3), our
approach gives rise to a min-max method for quadratic cost. In particular, we extend the formulation
to learn OT maps between distributions in spaces with unequal dimensions, see the next subsection.

4.2 U NEQUAL D IMENSIONS OF I NPUT AND O UTPUT D ISTRIBUTIONS

Consider the case when X = RH and Y = RD have different dimensions, i.e., H 6= D. In order
to map the probability distribution µ to ν, a straightforward solution is to embed X to Y via some
Q : X → Y and then to fit the OT map between Q# µ and ν for the quadratic cost on Y = RD . In
this case, the optimization objective becomes
Z Z 
  
inf sup hQ(x), T Q(x) i − ψ T Q(x) dµ(x) + ψ(y)dν(y) (13)
ψ T X Y

5
Published as a conference paper at ICLR 2022

with the optimal T ∗ recovering the OT map from Q# µ to ν. For equal dimensions H = D and the
identity embedding Q(x) ≡ x, expression (13) reduces to optimization (11) up to a constant.
Instead of optimizing (13) over functions T : Q(X ) → Y, we propose to consider optimization
directly over generative mappings G : X → Y:
Z Z 
def  
L(ψ, G) = inf sup hQ(x), G(x)i − ψ G(x) dµ(x) + ψ(y)dν(y) (14)
ψ G X Y

Our following lemma establishes connections between (14) and OT with unequal dimensions:
Lemma 4.2. Assume that exists a unique OT plan between Q# µ and ν and it is deterministic, i.e.,
[idRD , T ∗ ]# (Q# µ). Then G∗ (x) = T ∗ Q(x) is the OT map between µ and ν for the Q-embedded
quadratic cost c(x, y) = 21 kQ(x) − yk2 . Moreover, for every optimal potential ψ ∗ of problem (14),
Z
G∗ ∈ arg sup hQ(x), G(x)i − ψ ∗ G(x) dµ(x).
 
(15)
G X

We prove Lemma 4.2 in Appendix A.3 and


schematically present its idea in Figure 4.
Analogously to Lemma 4.1, it provides a
way to compute the OT map G∗ for the Q-
embedded quadratic cost between distribu-
tions µ and ν by solving the saddle point
problem (14). Note the situation with non-
uniqueness of arg supG is similar to M4.1.
Relation to previous works. In practice,
learning OT maps directly between spaces
of unequal dimensions was considered in the Figure 4: The scheme of our approach for learning
work by (Fan et al., 2021, M5.2) but only OT maps between unequal dimensions. In the fig-
on toy examples. We demonstrate that our ure, the setup of M5.1 is shown: µ is a noise, Q is
method works well in large-scale generative the bicubic upscaling, ν is a distribution of images.
modeling tasks (M5.1). Theoretical properties
of OT maps for embedded costs are studied,
e.g., in (Pass, 2010; McCann & Pass, 2020).

4.3 P RACTICAL A SPECTS AND O PTIMIZATION P ROCEDURE

To optimize functional (14), we approximate G : RH → RD and ψ : RD → R with neural networks


Gθ , ψω and optimize their parameters via stochastic gradient descent-ascent (SGDA) by using mini-
batches from µ, ν. The practical optimization procedure is given in Algorithm 1 below. Following
the usual practice in GANs, we add a small penalty (MB.3) on potential ψω for better stability. The
penalty is not included in Algorithm 1 to keep it simple.
Relation to previous works. WGAN by Arjovsky & Bottou (2017) uses W1 as the loss to update the
generator while we solve a diferent task — we fit the generator G to be the OT map for Q-embedded
quadratic cost. Despite this, our Algorithm 1 has similarities with WGAN’s training. The update
of ψ (line 4) coincides with discriminator’s update in WGAN. The update of generator G (line 8)
differs from WGAN’s update by the term −hQ(·), Gθ (·)i. Besides, in WGAN the optimization is
inf G supD . We have inf ψ supG , i.e., the generator in our case is the solution of the inner problem.

4.4 E RROR A NALYSIS

Given a pair (ψ̂, Ĝ) approximately solving (14), a natural question to ask is how good is the recov-
ered OT map Ĝ. In this subsection, we provide a bound on the difference between G∗ and Ĝ based
on the duality gaps for solving outer and inner optimization problems.
In (8), and, as the result, in (10), (11), (13), (14), it is enough to consider optimization over convex
functions ψ, see (Villani, 2008, Case 5.17). Our theorem below assumes the convexity of ψ̂ although
it might not hold in practice since in practice ψ̂ is a neural network.

6
Published as a conference paper at ICLR 2022

Algorithm 1: Learning the optimal transport map between unequal dimensions.


Input : Input distribution µ on X = RH ; output distribution ν on Y = RD ;
generator network Gθ : RH → RD ; potential network ψω : RD → R;
number of iterations per network: KG , Kψ ; embedding Q : X → Y;
Output: Trained generator Gθ representing OT map from µ to ν;
1 repeat
2 for kψ = 1 to Kψ do
3 Draw batch X ∼ µ and Y ∼ ν;
Lψ ← |Y1 | y∈Y ψω (y) − |X| 1
P P 
4
x∈X ψω Gθ (x) ;
∂L
5 Update ω by using ∂ωψ to minimize Lψ ;
6 for kG = 1 to KG do
7 Draw batch X ∼ µ;
1
P   
8 LG ← |X| x∈X ψ G(x) − hQ(x), Gθ (x)i ;
∂LG
9 Update θ by using ∂θ to minimize LG ;
10 until not converged;
11 return Gθ

Theorem 4.3. Assume that there exists a unique deterministic OT plan for Q-embedded quadratic
cost between µ and ν, i.e., π ∗ = [idRH , G∗ ]# µ for G∗ : RH → RD . Assume that ψ̂ is β-strongly
convex (β > 0) and Ĝ : RH → RD . Define
1 = sup L(ψ̂, G) − L(ψ̂, Ĝ) and 2 = sup L(ψ̂, G) − inf sup L(ψ, G)
G G ψ G

Then the following bound holds true for the OT map G from µ to ν:
2 √ √
Z
FID(Ĝ# µ, ν)
2
≤ 2 · W 2
2 ( Ĝ# µ, ν) ≤ kĜ(x) − G∗ (x)k2 dµ(x) ≤ ( 1 + 2 )2 , (16)
L X β
where FID is the Fréchet inception distance (Heusel et al., 2017) and L is the Lipschitz constant of
the feature extractor of the pre-trained InceptionV3 neural network (Szegedy et al., 2016).

We prove Theorem 4.3 in Appendix A.4. The duality gaps upper bound L2 (µ) norm between com-
puted Ĝ and true G∗ maps, and the W22 between true ν and generated (fake) distribution Ĝ# µ.
Consequently, they upper bound FID between data ν and fake (generated) Ĝ# µ distributions.
Relation to previous works. Makkuva et al. (2020); Korotin et al. (2021a) prove related bounds for
W2 with µ, ν located on the spaces of the same dimension. Our result holds for different dimensions.

5 E XPERIMENTS
We evaluate our algorithm in generative modeling of the data distribution from a noise (M5.1) and
unpaired image restoration task (M5.2). Technical details are given in Appendix B. Additionally, in
Appendix B.4 we test our method on toy 2D datasets and evaluate it on the Wasserstein-2 benchmark
(Korotin et al., 2021b) in Appendix B.2. The code is in the supplementary material.

5.1 M ODELING DATA DISTRIBUTION FROM N OISE D ISTRIBUTION

In this subsection, µ is a 192-dimensional normal noise and ν the high-dimensional data distribution.
Let the images from ν be of size w × h with c channels. As the embedding Q : X → Y we use a
naive upscaling of a noise. For x ∈ R192 we represent it as 3-channel 8 × 8 image and bicubically
upscale it to the size w × h of data images from ν. For grayscale images drawn from ν, we stack c
copies over channel dimension.
We test our method on MNIST 32 × 32 (LeCun et al., 1998), CIFAR10 32 × 32 (Krizhevsky et al.,
2009), and CelebA 64 × 64 (Liu et al., 2015) image datasets. In Figure 5, we show random samples

7
Published as a conference paper at ICLR 2022

(a) MNIST, 32 × 32, grayscale (b) CIFAR10, 32 × 32, RGB (c) CelebA, 64 × 64, RGB

Figure 5: Randomly generated MNIST, CIFAR10, and CelebA samples by our method (OTM).
Table 1: Results on CIFAR10 dataset. Table 2: Results on CelebA dataset.
Model Related Work Inception ↑ FID ↓
Model Related Work FID ↓
NVAE Vahdat & Kautz (2020) - 51.71
DCGAN Radford et al. (2016) 52.0
PixelIQN Ostrovski et al. (2018) 5.29 49.46
DRAGAN Kodali et al. (2017) 42.3
EBM Du & Mordatch (2019) 6.02 40.58
BEGAN Berthelot et al. (2017) 38.9
DCGAN Radford et al. (2016) 6.64±0.14 37.70
NVAE Vahdat & Kautz (2020) 13.4
NCSN Song & Ermon (2019) 8.87±0.12 25.32
NCP-VAE Aneja et al. (2021) 5.2
NCP-VAE Aneja et al. (2021) - 24.08
WGAN Arjovsky et al. (2017) 41.3
WGAN Arjovsky et al. (2017) - 55.2
WGAN-GP Gulrajani et al. (2017) 30.0
WGAN-GP Gulrajani et al. (2017) 6.49±0.09 39.40
WGAN-QC Liu et al. (2019) 12.9
3P-WGAN Nhan Dam et al. (2019) 7.38 ± 0.08 28.8
AE-OT An et al. (2020a) 28.6
AE-OT An et al. (2020a) - 28.5
AE-OT-GAN An et al. (2020b) 7.8
AE-OT-GAN An et al. (2020b) - 17.1
OTM Ours 6.5
OTM Ours 7.42±0.06 21.78

generated by our approach, namely Optimal Transport Modeling (OTM). To quantify the results,
in Tables 1 and 2 we give the inception (Salimans et al., 2016) and FID (Heusel et al., 2017) scores
of generated samples. Similar to (Song & Ermon, 2019, Appendix B.2), we compute them on 50K
real and generated samples. Additionally, in Appendix B.4, we test our method on 128×128 CelebA
faces. We provide qualitative results (images of generated faces) in Figure 11.
For comparison, we include the scores of existing generative models of three types: (1) OT map as
the generative model; (2) OT cost as the loss; (3) not OT-based. Note that models of the first type
compute OT in the latent space of an autoencoder in contrast to our approach. According to our
evaluation, the performance of our method is better or comparable to existing alternatives.

5.2 U NPAIRED I MAGE R ESTORATION

In this subsection, we consider unpaired image restoration tasks on CelebA faces dataset. In this
case, the input distribution µ consists of degraded images, while ν are clean images. In all the cases,
embedding Q is a straightforward identity embedding Q(x) ≡ x.
In image restoration, optimality of the restoration map is desired since the output (restored) image
is expected to be close to the input (degraded) one minimizing the transport cost. Note that GANs
do not seek for an optimal mapping. However, in practice, due to implicit inductive biases such as
convolutional architectures, GANs still tend to fit low transport cost maps (Bézenac et al., 2021).
The experimental setup is shown in Figure 6. We split the dataset in 3 parts A, B, C containing 90K,
90K, 22K samples respectively. To each image we apply the degradation transform (decolorization,
noising or occlusion) and obtain the degraded dataset containing of 3 respective parts A, B, C. For
unpaired training we use part A of degraded and part B of clean images. For testing, we use parts C.

Figure 6: The training/testing scheme that we use for unpaired restoration tasks.
To quantify the results we compute FID of restored images w.r.t. clean images of part C. The scores
for denoising, inpainting and colorization are given in Table 3, details of each experiment and qual-
itative results are given below.
As a baseline, we include WGAN-GP. For a fair comparison, we fit it using exactly the same hyper-
parameters as in our method OTM-GP. This is possible due to the similarities between our method
and WGAN-GP’s training procedure, see discussion in M4.3. In OTM, there is no GP (MB.3).

8
Published as a conference paper at ICLR 2022

Model Denoising Colorization Inpainting


Input 166.59 32.12 47.65
WGAN-GP 25.49 7.75 16.51
OTM-GP (ours) 10.95 5.66 9.96
OTM (ours) 5.92 5.65 8.13
Table 3: FID↓ on test part C in image restoration experiments.
Denoising. To create noisy images, we add white normal noise with σ = 0.3 to each pixel. Fig-
ure 7 illustrates image denoising using our OTM approach on the test part of the dataset. We show
additional qualitative results for varying σ in Figure 15 of M(B.4).

(a) Noisy samples. (b) Pushforward samples. (c) Original samples.


Figure 7: OTM for image denoising on test C part of CelebA, 64 × 64.
Colorization. To create grayscale images, we average the RGB values of each pixel. Figure 8
illustrates image colorization using OTM on the test part of the dataset.

(a) Grayscale samples. (b) Pushforward samples. (c) Original samples.


Figure 8: OTM for image colorization on test C part of CelebA, 64 × 64.
Inpainting. To create incomplete images, we replace the right half of each clean image with zeros.
Figure 9 illustrates image inpainting using OTM on the test part of the dataset.

(a) Occluded samples. (b) Pushforward samples. (c) Original samples.


Figure 9: OTM for image inpainting on test C part of CelebA, 64 × 64.
6 C ONCLUSION
Our method fits OT maps for the embedded quadratic transport cost between probability distribu-
tions. Unlike predecessors, it scales well to high dimensions producing applications of OT maps
directly in ambient spaces, such as spaces of images. The performance is comparable to other exist-
ing generative models while the complexity of training is similar to that of popular WGANs.
Limitations. For distributions µ, ν we assume the existence of the OT map between them. In
practice, this might not hold for all real-world µ, ν. Working with equal dimensions, we focus on
the quadratic ground cost 21 kx − yk2 . Nevertheless, our approach extends to other costs c(·, ·),
see Fan et al. (2021). When the dimensions are unequal, we restrict our analysis to embedded
quadratic cost 21 kQ(x) − yk2 where Q equalizes dimensions. Choosing the embedding Q might
not be straightforward in some practical problems, but our evaluation (M5.1) shows that even naive
choices of Q work well.
Potential impact and ethics. Real-world image restoration problems often do not have paired
datasets limiting the application of supervised techniques. In these practical unpaired learning prob-
lems, we expect our optimal transport approach to improve the performance of the existing models.
However, biases in data might lead to biases in the pushforward samples. This should be taken into
account when using our method in practical problems.

9
Published as a conference paper at ICLR 2022

Reproducibility. The PyTorch source code is provided at

https://github.com/LituRout/OptimalTransportModeling

The instructions to use the code are included in the README.md file.

7 ACKNOWLEDGMENT

This research was supported by the computational resources provided by Space Applica-
tions Centre (SAC), ISRO. The first author acknowledges the funding by HRD Grant No.
0303T50FM703/SAC/ISRO. Skoltech RAIC center was supported by the RF Government (subsidy
agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021).

R EFERENCES
Matthew Amodio and Smita Krishnaswamy. Travelgan: Image-to-image translation by transfor-
mation vector learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 8983–8992, 2019.

Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In International
Conference on Machine Learning, pp. 146–155. PMLR, 2017.

Dongsheng An, Yang Guo, Na Lei, Zhongxuan Luo, Shing-Tung Yau, and Xianfeng Gu. Ae-ot:
A new generative model based on extended semi-discrete optimal transport. In International
Conference on Learning Representations, 2020a. URL https://openreview.net/
forum?id=HkldyTNYwH.

Dongsheng An, Yang Guo, Min Zhang, Xin Qi, Na Lei, and Xianfang Gu. Ae-ot-gan: Training gans
from data specific latent distribution. In European Conference on Computer Vision, pp. 548–564.
Springer, 2020b.

Jyoti Aneja, Alexander Schwing, Jan Kautz, and Arash Vahdat. Ncp-vae: Variational autoencoders
with noise contrastive priors. In Advances in Neural Information Processing Systems Conference,
2021.

Cem Anil, James Lucas, and Roger Grosse. Sorting out lipschitz function approximation. In
International Conference on Machine Learning, pp. 291–301. PMLR, 2019.

Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial
networks. arXiv preprint arXiv:1701.04862, 2017.

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks.
In International conference on machine learning, pp. 214–223. PMLR, 2017.

David Berthelot, Thomas Schumm, and Luke Metz. Began: Boundary equilibrium generative ad-
versarial networks. arXiv preprint arXiv:1703.10717, 2017.

Emmanuel de Bézenac, Ibrahim Ayed, and Patrick Gallinari. Cyclegan through the lens of (dy-
namical) optimal transport. In Joint European Conference on Machine Learning and Knowledge
Discovery in Databases, pp. 132–147. Springer, 2021.

Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions.


Communications on pure and applied mathematics, 44(4):375–417, 1991.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity
natural image synthesis. In International Conference on Learning Representations, 2019. URL
https://openreview.net/forum?id=B1xsqj09Fm.

Biwei Dai and Uros Seljak. Sliced iterative normalizing flows. In ICML Workshop on Invertible
Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.

10
Published as a conference paper at ICLR 2022

Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced
wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 3483–3491, 2018.
DC Dowson and BV Landau. The fréchet distance between multivariate normal distributions.
Journal of multivariate analysis, 12(3):450–455, 1982.
Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv
preprint arXiv:1903.08689, 2019.
Jiaojiao Fan, Shu Liu, Shaojun Ma, Yongxin Chen, and Haomin Zhou. Scalable computation of
monge maps with general costs. arXiv preprint arXiv:2106.03812, 2021.
Kilian Fatras, Younes Zine, Rémi Flamary, Rémi Gribonval, and Nicolas Courty. Learning with
minibatch wasserstein: asymptotic and gradient properties. arXiv preprint arXiv:1910.04091,
2019.
Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn di-
vergences. In International Conference on Artificial Intelligence and Statistics, pp. 1608–1617.
PMLR, 2018.
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron C Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural
Information Processing Systems Conference, pp. 2674–2680, 2014.
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Im-
proved training of wasserstein gans. In Proceedings of the 31st International Conference on
Neural Information Processing Systems, pp. 5769–5779, 2017.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
neural information processing systems, 30, 2017.
Leygonie Jacob, Jennifer She, Amjad Almahairi, Sai Rajeswar, and Aaron Courville. W2gan: Re-
covering an optimal transport map with a gan. 2018.
Leonid Vitalevich Kantorovich. On a problem of monge. Uspekhi Mat. Nauk, pp. 225–226, 1948.
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative
adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 4401–4410, 2019.
Cheolhyeong Kim, Seungtae Park, and Hyung Ju Hwang. Local stability of wasserstein gans with
abstract gradient penalty. IEEE Transactions on Neural Networks and Learning Systems, 2021.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of gans.
arXiv preprint arXiv:1705.07215, 2017.
Alexander Korotin, Vage Egiazarian, Arip Asadulaev, Alexander Safin, and Evgeny Burnaev.
Wasserstein-2 generative networks. In International Conference on Learning Representations,
2021a. URL https://openreview.net/forum?id=bEoxzW_EXsa.
Alexander Korotin, Lingxiao Li, Aude Genevay, Justin Solomon, Alexander Filippov, and Evgeny
Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. arXiv
preprint arXiv:2106.01954, 2021b.
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.
2009.

11
Published as a conference paper at ICLR 2022

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, and Xianfeng David Gu. A geometric view of optimal
transportation and generative model. Computer Aided Geometric Design, 68:1–21, 2019.
Huidong Liu, Xianfeng Gu, and Dimitris Samaras. Wasserstein gan with quadratic transport cost.
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October
2019.
Shu Liu, Shaojun Ma, Yongxin Chen, Hongyuan Zha, and Haomin Zhou. Learning high dimensional
wasserstein geodesics. arXiv preprint arXiv:2102.02992, 2021.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild.
In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738, 2015.
Guansong Lu, Zhiming Zhou, Yuxuan Song, Kan Ren, and Yong Yu. Guiding the one-to-one map-
ping in cyclegan via optimal transport. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 33, pp. 4432–4439, 2019.
Guansong Lu, Zhiming Zhou, Jian Shen, Cheng Chen, Weinan Zhang, and Yong Yu. Large-scale op-
timal transport via adversarial training with cycle-consistency. arXiv preprint arXiv:2003.06635,
2020.
Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping
via input convex neural networks. In International Conference on Machine Learning, pp. 6672–
6681. PMLR, 2020.
Anton Mallasto, Jes Frellsen, Wouter Boomsma, and Aasa Feragen. (q, p)-wasserstein gans: Com-
paring ground metrics for wasserstein gans. arXiv preprint arXiv:1902.03642, 2019.
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smol-
ley. Least squares generative adversarial networks. In Proceedings of the IEEE international
conference on computer vision, pp. 2794–2802, 2017.
Robert J McCann and Brendan Pass. Optimal transportation between unequal dimensions. Archive
for Rational Mechanics and Analysis, 238(3):1475–1520, 2020.
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for
generative adversarial networks. In International Conference on Learning Representations, 2018.
Quan Hoang Nhan Dam, Trung Le, Tu Dinh Nguyen, Hung Bui, and Dinh Phung. Threeplayer
wasserstein gan via amortised duality. In Proc. of the 28th Int. Joint Conf. on Artificial Intelligence
(IJCAI), 2019.
Georg Ostrovski, Will Dabney, and Rémi Munos. Autoregressive quantile networks for generative
modeling. In International Conference on Machine Learning, pp. 3936–3945. PMLR, 2018.
Brendan Pass. Regularity of optimal transportation between spaces with different dimensions. arXiv
preprint arXiv:1008.1544, 2010.
Giorgio Patrini, Rianne van den Berg, Patrick Forre, Marcello Carioni, Samarth Bhargav, Max
Welling, Tim Genewein, and Frank Nielsen. Sinkhorn autoencoders. In Uncertainty in Artificial
Intelligence, pp. 733–743. PMLR, 2020.
Henning Petzka, Asja Fischer, and Denis Lukovnikov. On the regularization of wasserstein gans. In
International Conference on Learning Representations, 2018.
Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep
convolutional generative adversarial networks. In 4th International Conference on Learning
Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track
Proceedings, 2016. URL http://arxiv.org/abs/1511.06434.
R Tyrrell Rockafellar. Integral functionals, normal integrands and measurable selections. In
Nonlinear operators and the calculus of variations, pp. 157–207. Springer, 1976.

12
Published as a conference paper at ICLR 2022

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training gans. Advances in neural information processing systems, 29:
2234–2242, 2016.
Maziar Sanjabi, Meisam Razaviyayn, Jimmy Ba, and Jason D Lee. On the convergence and ro-
bustness of training gans with regularized optimal transport. Advances in Neural Information
Processing Systems, 2018:7091–7101, 2018.
Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94,
2015.
Vivien Seguy, Bharath Bhushan Damodaran, Remi Flamary, Nicolas Courty, Antoine Rolet, and
Mathieu Blondel. Large scale optimal transport and mapping estimation. In International
Conference on Learning Representations, 2018.
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.
In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, 2019.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethink-
ing the inception architecture for computer vision. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 2818–2826, 2016.
Amirhossein Taghvaei and Amin Jalali. 2-wasserstein approximation via restricted convex potentials
with application to improved training for gans. arXiv preprint arXiv:1902.07197, 2019.
Ugo Tanielian and Gerard Biau. Approximating lipschitz continuous functions with groupsort neu-
ral networks. In International Conference on Artificial Intelligence and Statistics, pp. 442–450.
PMLR, 2021.
Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-
encoders. arXiv preprint arXiv:1711.01558, 2017.
Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. In Advances in
Neural Information Processing Systems Conference, 2020.
Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media,
2008.
Yujia Xie, Minshuo Chen, Haoming Jiang, Tuo Zhao, and Hongyuan Zha. On scalable and efficient
computation of large scale optimal transport. In International Conference on Machine Learning,
pp. 6882–6892. PMLR, 2019.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image trans-
lation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE
International Conference on, 2017.

13
Published as a conference paper at ICLR 2022

A P ROOFS

A.1 P ROOF OF E QUIVALENCE : E QUATION (10) AND (11)

Proof. Pick any T : X → Y. For every point x ∈ X by the definition of the supremum we have
hx, T (x)i − ψ (T (x)) ≤ sup {hx, yi − ψ(y)} .
y∈Y

Integrating the expression w.r.t. x ∼ µ yields


Z Z
{hx, T (x)i − ψ (T (x))}dµ(x) ≤ sup {hx, yi − ψ(y)} dµ(x) = L1 .
X X y∈Y

Since the inequality holds for all T : X → Y, we conclude that


Z Z
L2 = sup {hx, T (x)i − ψ (T (x))}dµ(x) ≤ sup {hx, yi − ψ(y)} dµ(x) = L1 ,
T :X →Y X X y∈Y

i.e. L2 ≤ L1 . Now let us prove that the sup on the left side actually equals L1 . To do this, we need
to show that for every  > 0 there exists T  : X → Y satisfying
Z
{hx, T  (x)i − ψ (T  (x))}dµ(x) ≥ L1 − .
X

First note that for every x ∈ X by the definition of the supremum there exists y  = y  (x) which
provides
hx, y  (x)i − ψ (y  (x)) ≥ sup {hx, yi − ψ(y)} − .
y∈Y

 
We take T (x) = y (x) for all x ∈ X and integrate the previous inequality w.r.t. x ∼ µ. We obtain
Z Z
 
{hx, T (x)i − ψ (T (x))}dµ(x) ≥ sup {hx, yi − ψ(y)} dµ(x) −  = L1 − ,
X X y∈Y

which is the desired inequality.

A.2 P ROOF OF L EMMA 4.1

Proof. It is enough to prove that ψ ∗ (x) = hT ∗ (x), xi − ψ ∗ T (x) holds µ-almost everywhere, i.e.,


T ∗ (x) ∈ arg sup {hx, yi − ψ ∗ (y)}. Since ν = T# ∗


µ, we use (9) with ψ ← ψ ∗ to derive
y∈RD
Z Z
1 1
W22 (µ, ν) − kxk2 dµ(x) − kyk2 dν(y) =
X 2 Y 2
Z Z Z Z

ψ T ∗ (x) dµ(x) =


− ∗
ψ (x)dµ(x) − ψ (y)dν(y) = − ∗
ψ (x)dµ(x) −
X Y X
Z ZY
ψ ∗ (x) + ψ ∗ T ∗ (x) dµ(x) ≤ − hT ∗ (x), xidµ(x) =
 
− (17)
X | {z } X
≥hT ∗ (x),xi
Z Z Z
1 1 1 ∗
kx − T ∗ (x)k2 dµ(x) −
kxk2 dµ(x) − kT (x)k2 dµ(x) =
X 2 X 2 X 2
Z Z
2 1 2 1
W2 (µ, ν) − kxk dµ(x) − kyk2 dν(y).
X 2 Y 2

As a result, inequality (17) becomes the equality, in particular, ψ ∗ (x) + ψ ∗ T ∗ (x) = hT ∗ (x), xi

holds µ-almost everywhere.

14
Published as a conference paper at ICLR 2022

A.3 P ROOF OF L EMMA 4.2

Proof. Let QW22 denote the Q-embedded quadratic cost. We use the change of variables formula to
derive
Z
2 1
QW2 (µ, ν) = inf kQ(x) − yk2 dπ(x, y) =
π∈Π(µ,ν) X ×Y 2
Z
1
inf kx − yk2 dπ 0 (x, y) = W22 (Q# µ, ν), (18)
π 0 ∈Π(Q# µ,ν) X ×Y 2

i.e., computing the OT plan for QW22 (µ, ν) boils down to computing the OT plan for W22 (Q# µ, ν).
It follows that [idRH , T ∗ Q(x) ]# µ = [idRH , G∗ ]# µ is an OT plan for QW22 (µ, ν), and G∗ is the
OT map. Inclusion (15) now follows from Lemma 4.1.

A.4 P ROOF OF T HEOREM 4.3


R n o
Proof. Pick any G0 ∈ arg supG L(ψ̂, G) = arg supG X hQ(x), G(x)i − ψ̂ G(x) dµ(x) or,
n o
equivalently, for all x ∈ RH , G0 (x) ∈ arg supy hQ(x), yi − ψ̂(y) . Consequently, for all y ∈ RD

hQ(x), G0 (x)i − ψ̂ G0 (x) ≥ hQ(x), yi − ψ̂(y),




which after regrouping the terms yields


ψ̂(y) ≥ ψ̂ G0 (x) + hQ(x), y − G0 (x)i.


This means that Q(x) is contained in the subgradient ∂ ψ̂ at G0 (x) for a convex ψ̂. Since ψ̂ is
β-strongly convex, for points G(x), G0 (x) ∈ RD and Q(x) ∈ ∂ ψ̂ G0 (x) we derive
β
ψ̂ G(x) ≥ ψ̂ G0 (x) + hQ(x), G(x) − G0 (x)i + kG0 (x) − G(x)k2 .
 
2
Regrouping the terms, this gives
 β
hQ(x), G0 (x)i − ψ̂ G0 (x) − hQ(x), G(x)i − ψ̂ G(x) ≥ kG0 (x) − G(x)k2 .
  
2
Integrating w.r.t. x ∼ µ yields
Z
0 1 0 β
1 = L(ψ̂, G ) − L(ψ̂, G) ≥ β kG (x) − G(x)k2 dµ(x) = · kG − G0 k2L2 (µ) . (19)
X 2 2
Let G∗ be the OT map from µ to ν. We use G∗# µ = ν to derive
Z n Z
o
L(ψ̂, G0 ) = hQ(x), G0 (x)i − ψ̂ G0 (x) dµ(x) + ψ̂(y)dν(y) =
X Y
Z n Z
o
hQ(x), G0 (x)i − ψ̂ G0 (x) dµ(x) + ψ̂ G∗ (x) dµ(x) =

X X
 
Z  

 
0 0 ∗
 
hQ(x), G (x)i − ψ̂ G (x) + ψ̂ G (x) dµ(x) ≥
X 
|
 {z }
1
≥hQ(x),G∗ (x)i+β 2 kG0 −G∗ k2

Z Z
∗ 1 0
hQ(x), G (x)idµ(x) + β kG − G∗ k2 dµ(x). (20)
X X 2
Let ψ ∗ be an optimal potential in (14). Thanks to Lemma 4.2, we have
inf sup L(ψ, G) = L(ψ ∗ , G∗ ) =
ψ G
Z Z
∗ ∗ ∗
ψ ∗ (y)dν(y) =
 
hQ(x), G (x)i − ψ G (x) dµ(x) +
X Y

15
Published as a conference paper at ICLR 2022

Z Z
hQ(x), G∗ (x)i − ψ ∗ G∗ (x) ψ ∗ G∗ (x) dµ(x) =
  
dµ(x) +
X
ZX
hQ(x), G∗ (x)idµ(x) (21)
X
By combining (20) with (21), we obtain
Z
1 0 β
2 = L(ψ̂, G0 ) − L(ψ ∗ , G∗ ) ≥ β kG − G∗ k2 dµ(x) = · kG0 − G∗ k2L2 (µ) (22)
X 2 2
The right-hand inequality of (16) follows from the triangle inequality combined with (19) and (22).
The middle inequality of (16) follows from (Korotin et al., 2021a, Lemma A.2) and G∗# µ = ν.
Now we prove the left-hand inequality of (16). Let I be the feature extractor of the pre-trained
InceptionV3 neural networks. FID score between generated (fake) Ĝ# µ and data distribution ν is
FID(Ĝ# µ, ν) = FD(I# Ĝ# µ, I# ν) ≤ 2 · W22 (I# Ĝ# µ, I# ν), (23)
where FD(·, ·) is the Fréchet distance which lower bounds 2 · W22 , see (Dowson & Landau, 1982).
Finally, from (Korotin et al., 2021a, Lemma A.1) it follows that
W22 (I# Ĝ# µ, I# ν) ≤ L2 · W22 (Ĝ# µ, ν). (24)
Here L is the Lipschitz constant of I. We combine (23) and (24) to get the left-hand inequality
in (16).

B E XPERIMENTAL D ETAILS
We use the PyTorch framework. All the experiments are conducted on 2×V100 GPUs. We compute
inception and FID scores with the official implementation from OpenAI1 and TTUR2 . The compared
results are taken from the respective papers or publicly available source codes.

B.1 G ENERAL T RAINING D ETAILS

MNIST (LeCun et al., 1998). On MNIST, we use x ∈ R192 and y ∈ R32×32 . The batch size is 64,
learning rate 2 · 10−4 , optimizer Adam (Kingma & Ba, 2014) with betas (0, 0.9), gradient optimality
coefficient λ = 10, and the number of training epochs T = 30. We observe stable training while
updating ψ once in multiple G updates, i.e., kG = 2 and kψ = 1.
CIFAR10 (Krizhevsky et al., 2009). We use all 50000 samples while training. The latent vector
x ∈ R192 and y ∈ R32×32×3 , batch size 64, λ = 10, kG = 1, kψ = 1, T = 1000, Adam optimizer
with betas (0, 0.9), and learning rate 2 · 10−4 for G and 1 · 10−3 for ψ.
CelebA (Liu et al., 2015). We use x ∈ R192 and y ∈ R64×64×3 . The images are first cropped at the
center with size 140 and then resized to 64 × 64. We consider all 202599 samples. We use Adam
with betas (0, 0.9), T = 200, KG = 2, Kψ = 1 and learning rate 2 · 10−4 .
Image restoration. In the unpaired image restoration experiments, we use Adam optimizer with
betas (0, 0.9), KG = 5, Kψ = 1, λ = 0, learning rate 1 · 10−4 and train for T = 300 epochs.
CelebA128x128 (Liu et al., 2015). On this dataset, we resize the cropped images as in CelebA to
128 × 128, i.e. y ∈ R128×128×3 . Here, KG = 5, Kψ = 1, λ = 0.01, learning rate 1 · 10−4 and
betas=(0.5, 0.999). The batch size is reduced to 16 so as to fit in the GPU memory.
Anime128x1283 . This dataset consists of 500000 high resolution images. We resize the cropped
images as in CelebA to 128 × 128, i.e. y ∈ R128×128×3 . Here, KG = 5, Kψ = 1, λ = 0.01,
learning rate 2 · 10−4 , batch size 16, and betas=(0, 0.9).
Toy datasets. The dimension is D = H = 2, total number of samples is
10000. We use the batch size 400, λ = 0.1, Kψ = 1, KG = 16, and
1
IS: https://github.com/openai/improved-gan/tree/master/inception_score
2
FID: https://github.com/bioinf-jku/TTUR
3
Anime: https://www.kaggle.com/reitanaka/alignedanimefaces

16
Published as a conference paper at ICLR 2022

T = 100. The optimizer is Adam with betas (0.5, 0.99) and learning rate 1 ·
10−3 . We use the following datasets: Gaussian to mixture of Gaussians4 , two moons
(sklearn.datasets.make_moons), circles (sklearn.datasets.make_circles),
gaussian to S-curve (sklearn.datasets.make_s_curve), and gaussian to swiss roll
(sklearn.datasets.make_swiss_roll).
Wasserstein-2 benchmark (Appendix B.2). The dimension is D = H = 64 × 64 × 3. We use
batch size 64, λ = 0, Kψ = 1, KG = 5, learning rate 10−4 , and Adam optimizer with default betas.

B.2 E VALUATION ON THE C ONTINUOUS WASSERSTEIN -2 B ENCHMARK

To empirically show that the method recovers the optimal transport maps well on equal dimensions,
we evaluate it on the recent continuous Wasserstein-2 benchmark by Korotin et al. (2021b). The
benchmark provides a number of artificial test pairs (µ, ν) of continuous probability distributions
with analytically known OT map T ∗ between them.
For evaluation, we use the ”Early” images benchmark pair Method L2 -UVP↓
(D = 12288), see (Korotin et al., 2021b, M4.1) for details. bMM:Re 1.4%
We adopt the L2 -unexplained percentage metric (Korotin OTM (ours) 1.32%
et al., 2021a, M5.1) to quantify the recovered OT map T̂ :
L2 -UVP(T̂ ) = 100 · kT̂ − T ∗ k2 /Var(ν)%. For our method the Table 4: L2 -UVP metric
L2 -UVP metric is only ≈ 1%, see Table 4. This is comparable of the recovered transport
to the best dMM:Re method which the authors evaluate on their map on the ”Early” images
benchmark. The qualitative results are given in Figure 10. benchmark pair.

Figure 10: Qualitative results of OTM on the ”Early” images benchmark pair (µ, ν) by Korotin et al.
(2021b). The 1st line shows samples x ∼ µ, the 2nd line shows fitted OT map T̂ (x), and the 3rd
line shows the corresponding optimal map T ∗ (x) ∼ ν.

B.3 F URTHER D ISCUSSION AND E VALUATION

Generative modeling. In the experiments, we use the gradient penalty on ψ for better stability of
optimization. The penalty is intended to make the gradient norm of the optimal WGAN critic equal
to 1 (Gulrajani et al., 2017, Corollary 1). This condition does not necessarily hold for optimal ψ ∗ in
our case and consequently might introduce bias to optimization.
To address this issue, we additionally tested an alternative regularization which we call the gradient
optimality. For every optimal potential ψ ∗ and map G∗ of problem (14), we get from Lemma 4.2:
 
∇G Ex∼µ [hQ(x), G∗ (x)i − ψ ∗ (G∗ (x)))] = Ex∼µ [Q(x)] − Ex∼µ [∇ψ ∗ (G∗ (x)))] = 0. (25)

Since µ is normal noise distribution and Q(x) is naive upscaling (M5.1), the above expression sim-
plifies to Ex∼µ ∇ψ ∗ (G∗ (x))) = 0. Based on this property, we establish the following regularizer
λkEx∼µ ∇ψ (G(x))) k for λ > 0 and add this to Lψ in our Algorithm 1.
While gradient penalty considers expectation of norm, gradient optimality considers norm of expec-
tation. The gradient optimality is always non-negative and vanishes at the optimal point.
4
https://github.com/AmirTag/OT-ICNN

17
Published as a conference paper at ICLR 2022

λ FID↓
We conduct additional experiments with the gradient optimality and 0.001 16.91
compare FID scores for different λ in Table 5. It leads to an improve- 0.01 16.22
ment of FID score from the earlier 7.7 with the gradient penalty to 0.1 16.70
the current 6.5 with the gradient optimality on CelebA (Table 2). 1.0 10.01
10 6.50
Unpaired restoration. In the unpaired restoration experiments
(M5.2), we test OTM with the gradient penalty to make a fair com- Table 5: Ablation study of
parison with the baseline WGAN-GP. We find OTM without regular- gradient optimality in OTM.
ization, i.e., λ = 0 works better than OTM-GP (Table 3). In practice,
more G updates for a single ψ update works fairly well (MB.1).

B.4 A DDITIONAL Q UALITATIVE R ESULTS

OTM works with both the grayscale and color embeddings of noise in the ambient space.
CelebA128x128. Figure 11 shows the grayscale embedding Q(x), the recovered transport map
Ĝ(x), and independently drawn real samples y ∼ ν.

Figure 11: OTM between 128-dimensional noise and CelebA, 128 × 128. The 1st line shows
the grayscale embedding Q (repeating bicubic upscaling of a noise, 16 × 8), the 2nd line shows
corresponding generated samples, and the 3rd line shows random samples from the dataset.

Anime128x128. Figure 12 shows the color embedding Q(x), the recovered transport map Ĝ(x),
and independently drawn real samples y ∼ ν.
Q(X)
G(X)
Y

Figure 12: OTM between 192-dimensional noise and Anime, 128 × 128. The 1st line shows the
color embedding Q (bicubic upscaling of a noise, 3 × 8 × 8), the 2nd line shows corresponding
generated samples, and the 3rd line shows random samples from the dataset.

The extended qualitative results with color embedding on MNIST, CIFAR10, and CelebA are
shown in Figure 13a, Figure 13b, and Figure 13c respectively. Table 6 shows quantiative results
on MNIST. The color embedding Q is bicubic upscaling of a noise in R3×8×8 . The samples are

18
Published as a conference paper at ICLR 2022

Table 6: Results on MNIST dataset.


Model Related Work FID ↓
VAE Kingma & Welling (2013) 23.8±0.6
LSGAN Mao et al. (2017) 7.8±0.6
BEGAN Berthelot et al. (2017) 13.1±1.0
WGAN Arjovsky et al. (2017) 6.7±0.4
SIG Dai & Seljak (2021) 4.5
AE-OT An et al. (2020a) 6.2±0.2
AE-OT-GAN An et al. (2020b) 3.2
OTM Ours 2.4

generated randomly (uncurated) by fitted optimal transport maps between noise and ambient space,
e.g., spaces of high-dimensional images. Figure 14 illustrates latent space interpolation between the
generated samples. Figure 15 shows denoising of images with varying levels of σ = 0.1, 0.2, 0.3, 0.4
by the model trained with σ = 0.3.

(a) MNIST, 32 × 32, grayscale (b) CIFAR10, 32 × 32, RGB (c) CelebA, 64 × 64, RGB

Figure 13: Randomly generated MNIST, CIFAR10, and CelebA samples by our method (OTM).

Figure 14: OTM for latent space interpolation on CelebA, 64 × 64. Extended samples.

Toy datasets. Figure 16 shows the results of our method and related approaches (M3) on a toy 2D
dataset. Figure 17 shows the results of our method applied to other toy datasets.

B.5 N EURAL N ETWORK A RCHITECTURES

This section contains architectures on CIFAR10 (Table 7), CelebA (Table 8), CelebA 128 × 128
(Figure 11) generation, image restoration tasks (Table 9), evaluation on the toy 2D datasets and
Wasserstein-2 images benchmark (Table 4).
In the unpaired restoration tasks (M5.2), we use UNet architecture for transport map G and convolu-
tional architecture for potential ψ. Similarly to (Song & Ermon, 2019), we use BatchNormaliation
(BN) and InstanceNormalization+ (INorm+) layers. In the ResNet architectures, we use the Resid-
ualBlock of NCSN (Song & Ermon, 2019).
In the toy 2D examples, we use a simple multi-layer perceptron with 3 hidden layers consisting of
128 neurons each and LeakyReLU activation. The final layer is linear without any activation.

19
Published as a conference paper at ICLR 2022

σ = 0.1 σ = 0.2 σ = 0.3 σ = 0.4 σ = 0.1 σ = 0.2 σ = 0.3 σ = 0.4 σ = 0.3

(a) Noisy samples. (b) Pushforward samples. (c) Clean.

Figure 15: OTM for image denoising for varying levels of noise on test C part of CelebA, 64 × 64.

10 10 10

5 5 5

0 0 0

−5 −5 −5

−10 −10
−10
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10

(a) LSOT (Seguy et al., 2018) (b) W2GAN (Jacob et al., 2018) (c) WGAN-LP (Petzka et al., 2018)

10 10 10

5 5 5

0 0 0

−5 5 5

−10 10 10
−10 −5 0 5 10 10 5 0 5 10
10 5 0 5 10
(d) ICNN-OT(Makkuva et al., 2020) (e) W2GN (Korotin et al., 2021a)
(f) OTM (ours)

Figure 16: Mapping between a Gaussian and a Mixture of 8 Gaussians in 2D by various methods.
The colors green, blue, and peru represent input, pushforward, and output samples respectively.

The transport map G and potential ψ architectures on MNIST 32 × 32, CelebA 128 × 128, and
Anime 128 × 128 are the generator and discriminator architectures of WGAN-QC Liu et al. (2019)
respectively.
In the evaluation on the Wasserstein-2 benchmark, we use publicly available Unet5 architecture for
transport map T and WGAN-QC discriminator’s architecture for ψ (Liu et al., 2019). These neural
network architectures are the same as the authors of the benchmark use.

5
https://github.com/milesial/Pytorch-UNet

20
Published as a conference paper at ICLR 2022

1.0

0.5

0.0

0.5

1.0
1.0 0.5 0.0 0.5 1.0
(a) Two moons (b) Circles

6
4
4
2
2
0
0
2
2
4
4
10 5 0 5 10 5.0 2.5 0.0 2.5 5.0 7.5
(c) S Curve (d) Swiss Roll

Figure 17: OTM on toy datasets, D = 2. Here, the colors green, blue, and peru represent input,
pushforward, and output samples respectively.

Table 7: Architectures for generation task on CIFAR10, 32 × 32.

G(.)
Noise: x ∈ R128
Linear, Reshape, output shape: [128 × 4 × 4]
ResidualBlock Up, output shape: [128 × 8 × 8]
ResidualBlock Up, output shape: [128 × 16 × 16]
ResidualBlock Up, output shape: [128 × 32 × 32]
Conv, Tanh, output shape: [3 × 32 × 32]

ψ(.)
Target: y ∈ R3×32×32
ResidualBlock Down, output shape: [128 × 16 × 16]
ResidualBlock Down, output shape: [128 × 8 × 8]
ResidualBlock, output shape: [128 × 8 × 8]
ResidualBlock, output shape: [128 × 8 × 8]
ReLU, Global sum pooling, output shape: [128 × 1 × 1]
Linear, output shape: [1]

21
Published as a conference paper at ICLR 2022

Table 8: Architectures for generation task on Celeba, 64 × 64.

G(.)
Noise: x ∈ R128
ConvTranspose, BN, LeakyReLU, output shape: [256 × 1 × 1]
ConvTranspose, BN, LeakyReLU, output shape: [512 × 4 × 4]
Conv, PixelShuffle, BN, LeakyReLU, output shape: [512 × 8 × 8]
Conv, PixelShuffle, BN, LeakyReLU, output shape: [512 × 16 × 16]
Conv, PixelShuffle, BN, LeakyReLU, output shape: [512 × 32 × 32]
ConvTranspose, Tanh, output shape: [3 × 64 × 64]

ψ(.)
Target: y ∈ R3×64×64
Conv, output shape: [128 × 64 × 64]
ResidualBlock Down, output shape: [256 × 32 × 32]
ResidualBlock Down, output shape: [256 × 16 × 16]
ResidualBlock Down, output shape: [256 × 8 × 8]
ResidualBlock Down, output shape: [128 × 4 × 4]
Conv, output shape: [1]

Table 9: Architectures for restoration tasks on CelebA, 64 × 64.

G(.)
Input: x ∈ R3×64×64
Conv, BN, LeakyReLU, output shape: [256 × 64 × 64]
Conv, LeakyReLU, AvgPool, output shape: [256 × 32 × 32], x1
Conv, LeakyReLU, AvgPool, output shape: [256 × 16 × 16], x2
Conv, LeakyReLU, AvgPool, output shape: [256 × 8 × 8], x3
Conv, LeakyReLU, AvgPool, output shape: [256 × 4 × 4], x4
Nearest Neighbour Upsample, Conv, BN, ReLU, output shape: [256 × 8 × 8], y3
Add (y3, x3), output shape: [256 × 8 × 8], y3
Nearest Neighbour Upsample, Conv, BN, ReLU, output shape: [256 × 16 × 16], y2
Add (y2, x2), output shape: [256 × 16 × 16], y2
Nearest Neighbour Upsample, Conv, BN, ReLU, output shape: [256 × 32 × 32], y1
Add (y1, x1), output shape: [256 × 32 × 32], y1
Nearest Neighbour Upsample, Conv, BN, ReLU, output shape: [256 × 64 × 64], y
Add (y, x), output shape: [256 × 64 × 64], y
ConvTranspose, Tanh, output shape: [3 × 64 × 64]

ψ(.)
Target: y ∈ R3×64×64
Conv, LeakyReLU, AvgPool, output shape: [256 × 32 × 32]
Conv, LeakyReLU, AvgPool, output shape: [256 × 16 × 16]
Conv, LeakyReLU, AvgPool, output shape: [256 × 8 × 8]
Conv, LeakyReLU, AvgPool, output shape: [256 × 4 × 4]
Linear, output shape: [1]

22

You might also like