Deep Riemannian Manifold Learning: BBL 17 MBM 17
Deep Riemannian Manifold Learning: BBL 17 MBM 17
Deep Riemannian Manifold Learning: BBL 17 MBM 17
Abstract
We present a new class of learnable Riemannian manifolds with a metric param-
eterized by a deep neural network. The core manifold operations–specifically
the Riemannian exponential and logarithmic maps–are solved using approximate
numerical techniques. Input and parameter gradients are computed with an ad-
joint sensitivity analysis. This enables us to fit geodesics and distances with
gradient-based optimization of both on-manifold values and the manifold itself.
We demonstrate our method’s capability to model smooth, flexible metric structures
in graph embedding tasks.
1 Introduction
Geometric domain knowledge is important for machine learning systems to interact, process, and
produce data from inherently geometric spaces and phenomena [BBL+ 17, MBM+ 17]. Geometric
knowledge often empowers the model with explicit operations and components that reflect the
geometric properties of the domain.
One commonly used construct is a Riemannian manifold, the generalization of standard Euclidean
space. These structures enable models to represent non-trivial geometric properties of data [NK17,
DFDC+ 18, GDM+ 18, NK18, GBH18]. However, despite this capability, the current set of usable
Riemannian manifolds is surprisingly small, as the core manifold operations must be provided in a
closed form to enable gradient-based optimization. Worse still, these manifolds are often set a priori,
as the techniques for interpolating between manifolds are extremely restricted [GSGR19, SGB20].
In this work, we study how to integrate a general class of Riemannian manifolds into learning systems.
Specifically, we parameterize a class of Riemannian manifolds with a deep neural network and
develop techniques to optimize through the manifold operations. We apply this manifold learning
method in the context of graph embeddings to accurately reconstruct distances.
d2 γ i j
i dγ dγ
k
+ Γ jk = 0, (1)
dt2 dt dt
∗
Work performed during an internship at Facebook AI
where γ i are the local coordinate components of γ and the Christoffel symbols Γijk are defined by
X1
∂gjl ∂gkl ∂gjk
Γijk := g il k
+ j
− . (2)
2 ∂x ∂x ∂xl
l
3 Related work
Our work primarily relates to problems in geometric machine learning. Several previous works
exist for building systems for predefined manifolds [NK17, NK18, GSGR19, GDM+ 18, GBH18,
LKJ+ 20, LLK+ 20, MN20]. Others attempt to learn a manifold structure through neural networks
[BC20, AHH17, SKF18]. Our method augments both types of paper. For the first type we expand
the scope of tractable Riemannian manifolds, and, for the second type, we allow for deep metric
constructions independent of the manifold structure.
Our work can also be viewed as a deep metric learning method. There exist several approaches which
construct deep metrics directly [HA15, VBL+ 16, SSZ17, MBL20, DSGRS18] and even ones which
structure neural networks for metrics [PCJB20]. By contrast, our method constructs deep metrics
endowed with Riemannian geometry, which allow for several desirable properties such as smoothness
and interpretability. While other papers have explored usages of Riemannian geometry for metric
learning [HFB12], ours is the first to unify this with the generality of deep neural networks.
To construct the metric g, we first note that in local coordinates each gx is an symmetric positive
definite (SPD) matrix. Since our local coordinates remain fixed by our single chart construction, this
means that our function x → gx is in fact a smooth function from Rn → Sn+ where Sn+ is the space
of n × n SPD matrices.
To parameterize this function with a neural network, we note that the matrix exponential map exp
n(n+1)
acts as a bijection between Sn , the space of symmetric matrices, and Sn+ . Let sym : R 2 → Sn
n(n+1)
be the function which takes a R 2 vector and produces the corresponding symmetric matrix in
n(n+1)
Rn×n . Given a neural network fnn : Rn → R 2 , then we define our deep metric as
We note that our construction is indeed a Riemannian metric, as it is a smooth inner product for
tangent vectors. Furthermore, it allows for some notion of universal approximation.
Prop 4.1 (Universal Approximation of Riemannian Metrics). On a compact set of Rn , our deep
metric is capable to approximating any Riemannian metric on our single-chart manifolds.
2
This follows since neural networks are universal approximators and the matrix exponential is bounded
on a bounded set. We present this fully in app. B.1
The exponential map expx (v) is the time 1 solution to eq. (1) with initial value x and velocity v.
For our deep manifold, x, v and expx (v) are all elements of Rn . Therefore, solving our geodesic
equation with the initial value becomes a standard numerical ODE problem. By linearizing the
ODE, our exponential map becomes a first order initial value problem (IVP), so computation and
differentiation can be implemented with a Neural ODE [CRBD18].
The log map logx (y) is the initial value v for which expx (v) = y. This is a boundary value problem
(BVP), and, for our deep manifolds, reduces to a BVP on Euclidean space. As opposed to IVPs, these
problems are conceptually harder, more difficult to solve, and lack a clean adjoint sensitivity analysis.
Computation There exist many computational methods for BVP problems. Common methods
include shooting method and the the grid solver [SB02], and modern numerical solvers such as the
Matlab bvp5c solver can even control for error [KS08]. However, for our purposes we found that
using the BVP solver in [AHHS19], which we reimplement for batched computation in PyTorch
[PGM+ 19], provided a faster and more stable computation.
Differentiation To calculate gradients for the input values and neural network parameters, we
develop an adjoint sensitivity analysis for BVPs. We present the proof in app. B.2.
Theorem 5.1 (Adjoint Sensitivity For BVPs). Suppose we are given a BVP B(x, y) with correspond-
ing IVP I(x, v) s.t. I(x, B(x, y)) = y. Suppose our solution to B(x0 , y0 ) is b. Then
Dx B(x0 , y0 ) = −Dv I(x0 , b)−1 ◦ Dx I(x0 , b) Dy B(x0 , y0 ) = −Dv I(x0 , b)−1 (3)
Note that the derivatives of I can be computed through the adjoint-sensitivity analysis. Furthermore,
if our ODE problems are controlled by some parameter θ, then
Dθ B(x0 , y0 ) = −Dv I(x0 , b)−1 ◦ Dθ I(x0 , b)
6 Experiments
We experimentally validate our deep manifold formulation for graph embeddings.
In this experiment we embed graphs into our deep Riemannian manifolds. Since our manifolds are
able to represent more complex geometric structures than previous methods, we expect that our
3
Sphere100 Tree6 Tree40 Cycle10 Cube1 Cube2
Euclidean
Poincaré
DeepMetric
Figure 2: Geodesics and 2D embeddings on the Tree40 graph, where each column correspond to the
geodesics connected to a point (highlighted in red). The contours show the geodesic distances. The
deep metric accurately captures the node distances and induces non-trivial geodesics.
Table 1: Distortions for 2D, 5D, and 10D embeddings of the graph distances over 3 trials.
Sphere100 Tree6 Tree40 Cycle10 Cube1 Cube2
2 5 10 2 5 10 2 5 10 2 5 10 2 5 10 2 5 10
Euclidean 42.64± 21.88 2.85± 0.14 2.82± 0.08 3.14± 2.80 1.43± 0.07 1.51± 0.10 44.64± 19.64 7.85± 1.26 5.62± 0.39 1.67± 0.06 1.91± 0.12 1.79± 0.04 6.21± 1.27 1.83± 0.09 1.74± 0.16 4.37± 0.54 2.05± 0.07 1.98± 0.09
PoincareBall 12.11± 0.60 2.42± 0.04 2.64± 0.05 1.65± 0.04 1.63± 0.00 1.63± 0.00 8.65± 3.14 3.45± 0.12 2.38± 0.12 1.93± 0.00 1.93± 0.00 1.94± 0.00 1.80± 0.01 1.81± 0.01 1.81± 0.01 2.66± 0.31 2.17± 0.01 2.16± 0.01
Sphere 3.86± 0.00 2.42± 0.00 2.56± 0.01 1.69± 0.02 1.71± 0.01 1.71± 0.00 11.61± 0.94 6.83± 0.90 5.51± 0.23 1.72± 0.00 1.73± 0.00 1.73± 0.00 1.78± 0.05 1.76± 0.01 1.76± 0.01 2.48± 0.04 2.15± 0.05 2.10± 0.01
DeepManifold 102.74± 62.74 2.41± 0.01 2.51± 0.06 2.01± 0.80 1.13± 0.04 1.13± 0.05 71.57± 15.44 4.47± 0.65 2.51± 0.37 8.23± 5.69 1.77± 0.55 1.34± 0.07 5.99± 1.49 1.55± 0.15 1.53± 0.09 3.69± 1.02 1.65± 0.03 1.58± 0.05
embeddings will be more faithful and have lower distortion. We consider the synthetic graphs from
[CBG20], we which we visualize in sect. 6.
We present our experimental details in app. C. In table 1, we report the distortion of our embeddings,
noting how our manifolds are generally capable of producing better embeddings. Geodesics are
visualized in fig. 2; note how our method produces complex geodesics for the data geometry.
7 Conclusion
We have presented deep Riemannian manifolds, a type of Riemannian manifold with a metric
parameterized by a deep neural network, and have shown how to optimize both manifold-valued data
as well as the manifold itself. We hope our work enables future geometric machine learning models
to better account for more general structures.
4
References
[AHH17] Georgios Arvanitidis, Lars Kai Hansen, and Søren Hauberg. Latent space oddity: on
the curvature of deep generative models. arXiv preprint arXiv:1710.11379, 2017.
[AHHS19] Georgios Arvanitidis, Søren Hauberg, Philipp Hennig, and Michael Schober. Fast and
robust shortest paths on manifolds learned from data. arXiv preprint arXiv:1901.07229,
2019.
[BBL+ 17] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Van-
dergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal
Processing Magazine, 34(4):18–42, 2017.
[BC20] Johann Brehmer and Kyle Cranmer. Flows for simultaneous manifold learning and
density estimation. arXiv preprint arXiv:2003.13913, 2020.
[CBG20] Calin Cruceru, Gary Bécigneul, and Octavian-Eugen Ganea. Computationally tractable
riemannian manifolds for graph embeddings. arXiv preprint arXiv:2002.08665, 2020.
[CRBD18] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural
ordinary differential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing
Systems 31, pages 6571–6583. Curran Associates, Inc., 2018.
[Cyb89] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics
of Control, Signals and Systems, 2(4):303–314, 1989.
[DC16] Manfredo P Do Carmo. Differential geometry of curves and surfaces: revised and
updated second edition. Courier Dover Publications, 2016.
[DFDC+ 18] Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak.
Hyperspherical variational auto-encoders. arXiv preprint arXiv:1804.00891, 2018.
[DSGRS18] Christopher De Sa, Albert Gu, Christopher Ré, and Frederic Sala. Representation
tradeoffs for hyperbolic embeddings. Proceedings of machine learning research,
80:4460, 2018.
[GBH18] Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks.
In Advances in neural information processing systems, pages 5345–5355, 2018.
[GDM+ 18] Caglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali Razavi, Razvan Pascanu,
Karl Moritz Hermann, Peter Battaglia, Victor Bapst, David Raposo, Adam Santoro,
et al. Hyperbolic attention networks. arXiv preprint arXiv:1805.09786, 2018.
[GSGR19] Albert Gu, Frederic Sala, Beliz Gunel, and Christopher Ré. Learning mixed-curvature
representations in product spaces. In International Conference on Learning Represen-
tations, 2019.
[HA15] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International
Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015.
[HFB12] Sø ren Hauberg, Oren Freifeld, and Michael J. Black. A geometric take on metric
learning. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 25, pages 2024–2032. Curran
Associates, Inc., 2012.
[KS08] J Kierzenka and Lawrence Shampine. A bvp solver that controls residual and error.
European Society of Computational Methods in Sciences and Engineering (ESCMSE)
Journal of Numerical Analysis, Industrial and Applied Mathematics, 3:27–41, 01 2008.
[L+ 18] Allan M. M. Leal et al. autodiff, a modern, fast and expressive C++ library for automatic
differentiation. https://autodiff.github.io, 2018.
[Lee97] J.M. Lee. Riemannian Manifolds: An Introduction to Curvature. Graduate Texts in
Mathematics. Springer New York, 1997.
[Lee03] J.M. Lee. Introduction to Smooth Manifolds. Graduate Texts in Mathematics. Springer,
2003.
[LKJ+ 20] Aaron Lou, Isay Katsman, Qingxuan Jiang, Serge Belongie, Ser-Nam Lim, and Christo-
pher De Sa. Differentiating through the fréchet mean, 2020.
5
[LLK+ 20] Aaron Lou, Derek Lim, Isay Katsman, Leo Huang, Qingxuan Jiang, Ser-Nam Lim, and
Christopher De Sa. Neural manifold ordinary differential equations, 2020.
[MBL20] Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A metric learning reality check,
2020.
+
[MBM 17] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda,
and Michael M Bronstein. Geometric deep learning on graphs and manifolds using
mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 5115–5124, 2017.
[MN20] Emile Mathieu and Maximilian Nickel. Riemannian continuous normalizing flows,
2020.
[NK17] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical
representations. In Advances in neural information processing systems, pages 6338–
6347, 2017.
[NK18] Maximilian Nickel and Douwe Kiela. Learning continuous hierarchies in the lorentz
model of hyperbolic geometry. arXiv preprint arXiv:1806.03417, 2018.
[PCJB20] Silviu Pitis, Harris Chan, Kiarash Jamali, and Jimmy Ba. An inductive bias for distances:
Neural nets that respect the triangle inequality. arXiv preprint arXiv:2002.05825, 2020.
[PGM+ 19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmai-
son, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani,
Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.
Pytorch: An imperative style, high-performance deep learning library. In H. Wal-
lach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran
Associates, Inc., 2019.
[SB02] J. Stoer and R Bulirsch. Introduction to numerical analysis. Texts in applied mathemat-
ics. Springer, 2002.
[SDSGR18] Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re. Representation tradeoffs
for hyperbolic embeddings. volume 80 of Proceedings of Machine Learning Research,
pages 4460–4469, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
[SGB20] Ondrej Skopek, Octavian-Eugen Ganea, and Gary Bécigneul. Mixed-curvature varia-
tional autoencoders. In International Conference on Learning Representations, 2020.
[SKF18] H. Shao, A. Kumar, and P. T. Fletcher. The riemannian geometry of deep generative
models. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW), pages 428–4288, 2018.
[SSZ17] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot
learning. In Advances in neural information processing systems, pages 4077–4087,
2017.
+
[VBL 16] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching
networks for one shot learning. In Advances in neural information processing systems,
pages 3630–3638, 2016.
[YDS19] Tao Yu and Christopher M De Sa. Numerically accurate hyperbolic embeddings using
tiling-based models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc,
E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems
32, pages 2023–2033. Curran Associates, Inc., 2019.
6
A Background in Riemannian Geometry
We give a cursory overview of the core Riemannian geometry constructs integral to our construction.
For a more thorough overview of this topic, we recommend interested readers consult a text such as
[DC16, Lee97, Lee03].
The core object of study is the differentiable manifold M of dimension n, the higher dimensional
analogue of a surface. Locally around each point, the space “resembles” Euclidean space Rn . These
are mathematically formulated by charts, diffeomorphic functions ϕ : U → V with U, V open
subsets of Rn and M respectively. Our chart ranges cover our manifold, allowing the neighborhood
of each point on our manifold to be controlled by local coordinates x1 , . . . , xn which correspond to
the standard directions in Rn .
By taking linear approximations of functions at each point, one can construct tangent vectors v
which form the Tx M, the tangent space of M at x. These tangent spaces are vector spaces of
dimension n. With a local coordinate system, we can define the basis dx1 , . . . , dxn of Tx M, which
we assume implicitly from the local coordinate system.
To endow our manifold with a notion of distance, we first construct Riemannian metrics. These are
a collection of local inner products gx : Tx M × Tx M → R that vary smoothly by x. We will denote
g as the general metric for for (gx )x∈M and k·kg to be the norm induced by g.
A Riemannian metric gives
qR rise to a notion of distance. Specifically, for a curve γ : I → M, then the
γ 0 (t)
2 dt. For two points x, y ∈ M, the distance dg (x, y) is defined
1
length of γ is L(γ) = 0 g
as the minimum such distance for any such curve minγ:γ(0)=x,γ(y) L(γ).
In practice, we can more efficiently find these minimizing curves by considering geodesics. Geodesics
are curves which are governed by the geodesic equation eq. (1) and are the locally minimizing curves.
This means that a minimizing curve will be a geodesic, but geodesics may not be minimizing (an
example being the great arc on a sphere).
Given a point x ∈ M and an initial velocity v ∈ Tx M, there is a unique constant speed geodesic with
γ(0) = x, γ 0 (0) = v. γ(1) is called the exponential map of x with v, or expx (v). The logarithmic
map is the (local) inverse of this and is denoted logx (y). Note that with this construction we can
interpolate a geodesic using only the exp and log maps, in particular for a geodesic γ between
x and
y, γ(t) = expx (t logx (y)). Furthermore, the distance can be represented as dg (x, y) =
logx (y)
g
B Proof of Claims
Prop B.1 (Universal Approximation of Riemannian Metrics Full). Suppose we have a compact set
D of Rn and a neural network with sufficient approximation
capacity
as in [Cyb89]. For any metric
g and > 0, there exists a neural network s.t. supx∈K
gx − gxd
< .
Proof. Let gx0 be the function s.t. gx = (exp ◦sym)(gx0 ). We note that gx0 exists and is unique as exp
is bijective as a function from Sn to Sn+ .
0
There exists a neural
network s.t. supx∈K fnn (x)
− gx ≤ K , where K is defined as
supa,b∈f (D),g0 (D)
(exp ◦sym)(a) − (exp ◦sym)(b)
≤ K ka − bk.
nn
sup
gx − gxd
≤ K sup
fnn (x) − gx0
≤
(4)
x∈D x∈D
as desired.
7
B.2 Adjoint Sensitivity
Theorem B.1 (Adjoint Sensitivity For BVPs). Suppose we are given a BVP B(x, y) with corre-
sponding IVP I(x, v) s.t. I(x, B(x, y)) = y. Suppose our solution to B(x0 , y0 ) is b. Then
Dx B(x0 , y0 ) = −Dv I(x0 , b)−1 ◦ Dx I(x0 , b) Dy B(x0 , y0 ) = −Dv I(x0 , b)−1 (5)
Note that the derivatives of I can be computed through the adjoint-sensitivity analysis. Furthermore,
if our ODE problems are controlled by some parameter θ, then
Dθ B(x0 , y0 ) = −Dv I(x0 , b)−1 ◦ Dθ I(x0 , b)
Proof. The core of this proof comes from the fact that I(x0 , B(x0 , y0 )) = y0 . Differentiating this
identity w.r.t y using the chain rule gives us
I = Dy y0
= (Dy I)(x0 , B(x0 , y0 ))
Dx I(x0 , B(x0 , y0 ))
= 0 Dy B(x0 , y0 )
Dv I(x0 , B(x0 , y0 ))
= Dv I(x0 , B(x0 , y0 ))Dy B(x0 , y0 )
rearranging and setting b = B(x0 , y0 ) gives us that Dy B(x0 , y0 ) = −Dv I(x0 , b)−1 , as desired.
For x, differentiating through the same identity gives us that
0 = Dx y0
= (Dx I)(x0 , B(x0 , y0 ))
Dx I(x0 , B(x0 , y0 ))
= I Dx B(x0 , y0 )
Dv I(x0 , B(x0 , y0 ))
= Dx I(x0 , B(x0 , y0 )) + Dv I(x0 , B(x0 , y0 ))Dx B(x0 , y0 )
and rearranging gives us that Dx B(x0 , y0 ) = −Dv I(x0 , b)−1 ◦ Dx I(x0 , b).
For θ, the proof is very similar to the proof for x. In particular, from the same differentiation we get
that
0 = Dθ y0
= (Dθ I)(x0 , B(x0 , y0 ), θ)
Dx I(x0 , B(x0 , y0 ))
= Dv I(x0 , B(x0 , y0 )) 0 Dθ B(x0 , y0 ) I
Dθ I(x0 , B(x0 , y0 ))
= Dv I(x0 , B(x0 , y0 ))Dθ B(x0 , y0 ) + Dθ I(x0 , B(x0 , y0 ))
8
Table 2: Reconstruction losses (stress) for 2D, 5D, and 10D embeddings of the graph distances.
Sphere100 Tree6 Tree40 Cycle10 Cube1 Cube2
2 5 10 2 5 10 2 5 10 2 5 10 2 5 10 2 5 10
Euclidean 1.76± 0.26 0.25± 0.00 0.25± 0.00 0.32± 0.46 0.04± 0.01 0.05± 0.01 1.42± 0.28 0.35± 0.03 0.26± 0.05 0.17± 0.00 0.21± 0.01 0.20± 0.01 0.34± 0.16 0.11± 0.01 0.12± 0.01 0.37± 0.06 0.15± 0.00 0.15± 0.00
PoincareBall 1.83± 0.00 0.13± 0.00 0.13± 0.00 0.07± 0.00 0.07± 0.00 0.07± 0.00 1.87± 0.17 0.25± 0.00 0.04± 0.00 0.22± 0.00 0.22± 0.00 0.22± 0.00 0.18± 0.00 0.18± 0.00 0.18± 0.00 0.29± 0.02 0.24± 0.00 0.24± 0.00
Sphere 0.53± 0.00 0.15± 0.00 0.07± 0.00 0.07± 0.00 0.08± 0.00 0.08± 0.00 1.03± 0.00 0.76± 0.00 0.66± 0.00 0.12± 0.00 0.12± 0.00 0.12± 0.00 0.16± 0.00 0.16± 0.00 0.16± 0.00 0.27± 0.00 0.23± 0.00 0.23± 0.00
DeepManifold 1.92± 0.29 0.11± 0.00 0.18± 0.02 0.14± 0.12 0.00± 0.00 0.00± 0.00 1.48± 0.70 0.18± 0.05 0.08± 0.01 0.62± 0.42 0.11± 0.10 0.03± 0.01 0.67± 0.15 0.05± 0.03 0.05± 0.02 0.28± 0.08 0.06± 0.00 0.06± 0.01
However, one will note that we report distortion as our loss function, which is given by