[go: up one dir, main page]

0% found this document useful (0 votes)
23 views54 pages

Topological Methods in Machine Learning A Tutorial

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views54 pages

Topological Methods in Machine Learning A Tutorial

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Topological Methods in Machine Learning: A Tutorial for Practitioners

BARIS COSKUNUZER, University of Texas at Dallas, USA


CÜNEYT GÜRCAN AKÇORA, University of Central Florida, USA
Topological Machine Learning (TML) is an emerging field that leverages techniques from algebraic topology to analyze complex data
structures in ways that traditional machine learning methods may not capture. This tutorial provides a comprehensive introduction
to two key TML techniques, persistent homology and the Mapper algorithm, with an emphasis on practical applications. Persistent
homology captures multi-scale topological features such as clusters, loops, and voids, while the Mapper algorithm creates an
interpretable graph summarizing high-dimensional data. To enhance accessibility, we adopt a data-centric approach, enabling readers
arXiv:2409.02901v1 [cs.LG] 4 Sep 2024

to gain hands-on experience applying these techniques to relevant tasks. We provide step-by-step explanations, implementations,
hands-on examples, and case studies to demonstrate how these tools can be applied to real-world problems. The goal is to equip
researchers and practitioners with the knowledge and resources to incorporate TML into their work, revealing insights often hidden
from conventional machine learning methods. The tutorial code is available at https://github.com/cakcora/TopologyForML.

CCS Concepts: • Mathematics of computing → Algebraic topology; • Computing methodologies → Learning paradigms;
Machine learning algorithms; Machine learning applications; • Applied computing → Life and medical sciences; Physical
sciences and engineering.

Additional Key Words and Phrases: Topological Data Analysis, Machine Learning, Persistent Homology, Mapper, Multiparameter
Persistence

ACM Reference Format:


Baris Coskunuzer and Cüneyt Gürcan Akçora. 2023. Topological Methods in Machine Learning: A Tutorial for Practitioners. 1, 1
(September 2023), 54 pages. https://doi.org/10.1145/XXXXXXX.XXXXXXX

Contents

Abstract 1
Contents 1
1 Introduction 3
1.1 Roadmap for the Tutorial 4
2 Background 5
2.1 A Crash Course on Topology 5
2.1.1 Topological Space 5
2.1.2 Topological Equivalence 7
2.1.3 Topological Invariant 8
2.1.4 Geometry vs. Topology 8
Authors’ addresses: Baris Coskunuzer, coskunuz@utdallas.edu, University of Texas at Dallas, Richardson, TX, USA; Cüneyt Gürcan Akçora, cuney.
akcora@ucf.edu, University of Central Florida, Orlando, FL, USA.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM

Manuscript submitted to ACM 1


2 Coskunuzer-Akcora

2.2 Homology 9
2.2.1 TLDR 9
2.2.2 Computation of Homology 11
3 Persistent Homology 14
3.1 Constructing Filtrations 14
3.1.1 Filtrations for Point Clouds 15
3.1.2 Filtrations for Images 17
3.1.3 Filtrations for Graphs 18
3.1.4 Choosing Thresholds 21
3.2 Persistence Diagrams 22
3.2.1 Interpretation of Persistence Diagrams 23
3.2.2 Wasserstein Distance 24
3.3 Integrating PDs to ML tasks 1: Vectorizations 25
3.3.1 Stability 29
3.3.2 Choice of Vectorization and Hyperparameters 29
3.4 Integrating PDs to ML tasks 2: Using Neural Networks 30
3.4.1 Vectorization with NN: Perslay 30
3.4.2 PD Learning with NN: PointNet 30
3.5 Computational Complexity for Persistent Homology 31
3.6 Software Libraries for Persistent Homology 31
4 Multiparameter Persistence 32
4.1 Multifiltrations 33
4.1.1 Multifiltrations for Graphs. 33
4.1.2 Multifiltrations for Point Clouds. 34
4.1.3 Multifiltrations for Images. 34
4.2 MP Vectorization Methods 34
5 Mapper 35
5.1 Mapper for Point Clouds 35
5.1.1 Hyperparameters for Mapper 37
5.2 Mapper for Other Data Formats 37
5.3 Computational Complexity of Mapper 39
5.4 Software Libraries for Mapper 39
6 Applications 40
6.1 PH for Point Clouds: Shape Recognition 40
6.2 PH for Graphs: Crypto-Token Anomaly Forecasting 41
6.3 PH for Images: Cancer Diagnosis from Histopathological Images 42
6.4 Multiparameter Persistence: Computer Aided Drug Discovery 43
6.5 Mapper: Cancer Genotyping from RNA Sequencing 45
7 Future Directions 47
7.1 Topological Deep Learning 47
7.2 TDA and Interpretability 47
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 3

7.3 Fully Automated TDA Models 47


7.4 Scalability of TDA methods 48
8 Conclusion 48
Acknowledgments 48
References 49
A Notation Table 54
B Dataset Resources 54

1 INTRODUCTION
As the complexity of datasets has grown in recent years, topological methods have emerged as powerful complements
to state-of-the-art ML methods. The advent of ML has revolutionized the way we analyze and interpret complex data,
yet there remain challenges in capturing the intrinsic topological structures inherent in such data. Traditional ML
techniques, while powerful, often fall short in identifying and leveraging these structures, leading to the potential
loss of valuable insights. Topological Machine Learning (TML) bridges this gap by integrating concepts from algebraic
topology into ML workflows, enabling the discovery of patterns and features that are otherwise elusive. Despite this
utility, much of the existing literature on topological methods in ML is highly technical, making it challenging for
newcomers to grasp the direct connections to practical applications.
In this tutorial, we introduce the fundamental concepts of topological methods to the machine learning (ML)
community and a broader audience interested in integrating these novel approaches into their research. No prior
knowledge of topology or ML is required. Our primary aim is to address this pressing need by providing a practical
guide for non-experts looking to employ topological techniques in various ML contexts. To maintain accessibility, we
will simplify the exposition and offer references to more detailed technical resources for those interested in further
exploration.
In this paper, we teach two cornerstone techniques of TML: Persistent Homology and Mapper algorithm, and their
effective utilization in ML. Persistent homology offers a robust, multi-scale analysis of topological features, allowing
researchers to detect and quantify structures such as clusters, loops, and voids across different scales within the data.
This capability is particularly beneficial for understanding the intricate relationships and hierarchies that may exist in
complex datasets. From an ML perspective, PH offers a great feature extraction method for complex datasets, which
was impossible with other methods. In this part, we will focus on this aspect of PH, giving hands-on instructions with
illustrations on deriving effective topological vectors from complex data. On the other hand, the Mapper algorithm
complements this by providing a visual and interpretable summary of high-dimensional data. By constructing a
summary graph that mirrors the underlying topology of data, the Mapper algorithm facilitates the exploration and
interpretation of data in an intuitive and informative way. This technique is instrumental in uncovering data’s geometric
and topological essence, making it accessible for practical analysis.
Throughout the paper, we provide comprehensive explanations and step-by-step implementations of these techniques,
supported by case studies spanning diverse applications such as cancer diagnosis, shape recognition, genotyping, and
drug discovery. We aim to equip researchers and practitioners with the necessary knowledge and tools to integrate
TML techniques into their studies, thereby unlocking new avenues for discovery and innovation. By demonstrating the
Manuscript submitted to ACM
4 Coskunuzer-Akcora

practical utility of persistent homology and the Mapper algorithm, we highlight their potential to reveal insights that
traditional methods may overlook, ultimately advancing the field of Machine Learning.
We note that this paper is not intended to be a survey of recent advances in topological data analysis but rather
a tutorial aimed at introducing the fundamentals of the topic to ML practitioners. For comprehensive surveys in
topological data analysis, refer to [28, 53, 86, 100]. For in-depth discussions of these topics, see the excellent textbooks
on TDA and computational topology [38, 39].

1.1 Roadmap for the Tutorial


We recommend reading the entire tutorial for a complete understanding, but readers may skip sections not pertinent to
their needs. Here, we give a quick overview of the paper’s structure.
In Section 2, we provide the essential topological background needed to follow the concepts discussed throughout
the rest of the paper. We aim to introduce these topological concepts, help build an intuition for TDA approaches,
and demonstrate how they can be adapted and applied to various needs. For those unfamiliar with topology, we
strongly encourage reading our introductory crash course in Section 2.1. When discussing homology, we start with
a non-technical, brief overview (Section 2.2.1), which should suffice for following the rest of the paper. For readers
interested in more technical details of homology computation, we offer a more in-depth explanation in Section 2.2.2.
In Section 3, we introduce Persistent Homology (PH) in three key sections: constructing filtrations (Section 3.1),
deriving persistence diagrams (PD) (Section 3.2), and applying PDs to ML tasks, including vectorization (Section 3.3) and
neural networks (Section 3.4). The filtration process is tailored to different data formats, as methods differ significantly
based on the type of data—whether it’s point clouds (Section 3.1.1), images (Section 3.1.2), or networks (Section 3.1.3).
If you focus on a specific data type, you can skip sections on other formats. In each subsection, we also discuss the
hyperparameter selection process. Lastly, Section 3.6 covers available software for PH.
In Section 4, we give a brief introduction to a niche subfield, Multiparameter Persistence (MP), an effective extension
of PH. Again, we outline how to tailor the method to specific data formats for multifiltrations, i.e., point clouds
(Section 4.1.2), images (Section 4.1.3), and networks (Section 4.1.1). Next, we outline the state-of-the-art methods on
how to integrate MP information in ML pipelines (Section 4.2).
In Section 5, we begin with a friendly introduction to the Mapper method (Section 5.1), an effective TML tool for
unsupervised learning. We then cover hyperparameter selection for Mapper in Section 5.1.1, which is key for applications.
Although the original Mapper algorithm is intended for point clouds, Section 5.2 explores recent advancements that
extend its application to images and networks. Lastly, we provide an overview of available software libraries for Mapper
in Section 5.4.
In Section 6, we summarize five real-life applications of these methods from published works, namely shape
recognition for point clouds (Section 6.1), anomaly forecasting for transaction networks (Section 6.2), cancer detection
from histopathological images (Section 6.3), computer-aided drug discovery (Section 6.4), and cancer genotyping from
RNA sequencing (Section 6.5).
Finally, in Section 7, we outline potential future directions to advance TML methods, aiming to improve their practical
use in ML and discuss strategies for broadening the application of topological methods to new and emerging fields.

Manuscript submitted to ACM


Topological Methods in Machine Learning: A Tutorial for Practitioners 5

2 BACKGROUND
In this section, we provide some background that will be used later. We first introduce several mathematical concepts
that will be used in the second part, where we describe homology, which is essential for introducing the methods in
subsequent sections.

2.1 A Crash Course on Topology


Topology, a core discipline in mathematics, studies the properties of shapes and spaces that remain unchanged under
continuous transformations such as stretching, crumpling, and bending, but not tearing or gluing. In machine learning,
topology provides a powerful framework for examining complex data structures flexibly and intuitively. To provide
clarity, we begin by defining several key terms essential for following the paper.

2.1.1 Topological Space. From a mathematical perspective, topology refers to the structure of a set. For instance, consider
the set of real numbers between 0 and 1, i.e., X = {𝑥 ∈ R | 0 ≤ 𝑥 ≤ 1}. Many readers might immediately consider the
closed interval [0, 1], whose shape resembles a stick. However, X is merely a set with no defined structure yet. This
means we do not know which points in X are close to or far from others, as there is no concept of neighborhoods.
For example, if we define a distance (metric) on X such as 𝑑 1 (𝑥, 𝑦) = 1 for 𝑥 ≠ 𝑦 and 𝑑 1 (𝑥, 𝑦) = 0 for 𝑥 = 𝑦, the "shape"
of the set X would be entirely different, resembling an infinite number of points dispersed in a very high-dimensional
space. Conversely, if we use a metric like 𝑑 2 (𝑥, 𝑦) = |𝑥 − 𝑦| on X, we retrieve our familiar closed interval [0, 1], a stick
of length 1. Notice that due to the different metrics used, the neighborhood structures and shapes of (X, 𝑑 1 ) and (X, 𝑑 2 )
are completely different. This brings us to the first principle in Topology: the topology of a set is defined as the complete
neighborhood information on the set.
The principle is not without exceptions: a significant research area called point-set topology studying topological
spaces that do not necessarily have a metric [45, 68]. However, these topologies are beyond our scope. Thus, throughout
the paper, a topological space refers to a set X (or dataset) equipped with a distance 𝑑 (·, ·), forming what is known as a
metric space (X, 𝑑). In other words, a set qualifies as a topological space if we can unambiguously identify the neighbors
of its points.
Although we might not explicitly reference distance, most datasets inherently possess some form of a metric for
detailed analysis. For instance, point clouds use the metric of the space they are embedded in, providing neighboring
information. In graphs, adjacent nodes are naturally considered neighbors. Similarly, in images, neighboring pixels
are deemed neighbors. This paper treats these examples as topological spaces with their natural topologies. However,
specific datasets, such as RNA-sequencing data, lack an inherent metric, requiring users to define a metric to determine
which data points are considered close and distant, depending on the context.

Simplicial Complexes. One special family of topological spaces used throughout our paper will be simplicial complexes.
A simplicial complex is a mathematical structure used in topology and combinatorics to study the properties of shapes
and spaces in a discrete manner. It consists of vertices, edges, and higher-dimensional simplices such as triangles,
tetrahedra, and their higher-dimensional counterparts, which are glued together in a specific way to form a coherent
whole.
Each simplex is a generalization of the concept of a triangle to higher dimensions. By this, we mean that the properties
and structure of a triangle (such as having vertices, edges, and faces) are extended into higher dimensions, even though
the shapes look different as we go up in dimensions. In particular, a 𝑘-simplex is defined as the convex hull of affinely
Manuscript submitted to ACM
6 Coskunuzer-Akcora

independent 𝑘 + 1 points in R𝑘 . For example, 2-simplex is a triangle (with its inside filled), e.g., the convex hull of
(smallest convex set containing) three points {(0, 0), (1, 0), (0, 1)} in R2 . Similarly, 3-simplex is a tetrahedron (with
inside filled), e.g., the convex hull of {(0, 0, 0), (1, 0, 0), (0, 1, 0), (0, 0, 1)} in R3 . A union of simplices is called a simplicial
complex if any two simplices in the complex either do not intersect or intersect in a complete subsimplex (See Figure 1).

Dimension in Topology. One of the most confusing


concepts for non-experts in topology is the notion of di- a b c
mension. To study topological spaces more effectively, we
focus on a specific family of spaces that meet certain reg-
ularity conditions, known as manifolds. A 𝑘-dimensional
manifold is a topological space that locally resembles a
ball in R𝑘 , i.e., a small neighborhood of any point looks
like (homeomorphic) a ball in R𝑘 . Importantly, we do not
concern ourselves with the ambient (i.e., surrounding) d e
space in which our manifold resides; we focus solely on
its intrinsic properties, e.g., the material the manifold is f g
made of. For example, although a circle is often visualized
in R2 as a two-dimensional object, from a topological
perspective, it is actually one-dimensional. This is be-
cause, at any point on the circle, its local neighborhood
Fig. 1. Simplicial Complexes. Among the complexes, only b and f
resembles a line segment, i.e., it is made of 1-dimensional fail to be simplicial complexes, as their simplices do not intersect at
material. Consider an ant walking on a circle; it has two complete subsimplices. All others are valid simplicial complexes.
options: moving forward or backward. While we might need two coordinates to describe the ant’s position on the circle
mathematically, from the ant’s perspective, it only experiences a continuous path where it can move in either direction.
To the ant, the circle feels like an endless line. Similarly, a sphere is a 2-dimensional manifold (surface) even though it
is often visualized in R3 . This is because a small neighborhood of any point on the sphere resembles a piece of in R2 .
Notice that with this definition, the dimension of the ambient space becomes irrelevant; only the topological properties
define the dimension of a manifold, i.e., a loop (circle) in R2 or R3 is called 1-dimensional. Similarly, a torus (a hollow
donut) and a genus-2 surface (a surface with two holes) are examples of 2-dimensional manifolds. If this discussion
leads you to wonder about manifolds with more than two dimensions, we have some discouraging news: they do exist,
but visualizing them is not intuitive.

Boundary. In topology, understanding the boundary of a manifold is crucial to grasping its structure. A 𝑘-dimensional
manifold 𝑀 is a space where each point has a neighborhood that resembles an open subset of R𝑘 . The boundary 𝜕𝑀 of
𝑀 comprises points where these neighborhoods resemble an open subset of the half-space R𝑘+ = {𝑥 ∈ R𝑘 | 𝑥𝑘 ≥ 0}.
This implies that the local structure is similar to part of R𝑘+ near each boundary point.
To illustrate, consider a sphere, a 2-dimensional surface without any boundaries. An ant walking on the sphere can
indefinitely move in any direction without encountering an edge. In contrast, the unit disk D in R2 is a 2-dimensional
surface with a boundary, specifically the unit circle C. An ant residing in D could not continue its journey once it
reaches the boundary C, or the "border" of D. This is denoted as 𝜕D = C. Similarly, the unit ball B in R3 (the solid
ball) is a 3-dimensional manifold with a boundary, and its boundary is the unit sphere S, denoted by 𝜕B = S. In this
context, we say that the sphere S bounds the ball B.
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 7

If you’re puzzled by the statement that a sphere has no boundary, whereas the solid ball does, consider it from the
perspective of "its inhabitants." An ant living on the sphere’s surface can move freely in any direction without ever
encountering a limit (no border), which is why we say the sphere has no boundary. However, a mouse living inside the
ball will eventually hit the boundary (border)—the sphere’s surface—beyond which it cannot go. That limiting surface is
what we call the solid ball’s boundary.
A key principle in topology is that a boundary’s boundary is always empty. Mathematically, this is expressed as
𝜕(𝜕X) = ∅. Furthermore, the boundary of a 𝑘-dimensional manifold is generally a (𝑘 − 1)-dimensional manifold with
no boundary. For example, the boundary of a disk D (a 2-dimensional manifold with a boundary) is a circle C (a
1-dimensional manifold without a boundary). This observation will soon be important when discussing homology.

2.1.2 Topological Equivalence. Topology primarily focuses on the global properties of shapes rather than their local
characteristics. For example, topology is concerned with whether a space is connected or has holes without regard to
the size of the object or the holes.
Topological equivalence can be intuitively understood as follows: two shapes, X and Y are topologically equivalent if
one can be continuously deformed into the other. Imagine X and Y are made of Play-Doh. If you can reshape X into Y
without tearing, gluing, or collapsing any part of the shape, then they are considered to be continuously deformable into
each other. In this context, "collapsing" refers to reducing a part of the shape to a lower dimension, such as squishing a
surface or line down to a point or flattening a 3-dimensional object into a 2-dimensional plane.
In mathematical terms, this concept corresponds to a homeomorphism (the prefix "homeo-" comes from the Greek
word "homoios," which means "similar" or "like"). In particular, such a deformation from X to Y represents a bijective
map, which means it creates a one-to-one correspondence between elements of X and Y, ensuring that each element
in X is paired with a unique element in Y and vice versa. The map 𝜑 : X → Y keeps track of how each point 𝑥 ∈ X
becomes a point 𝑦 ∈ Y after the deformation, i.e., 𝜑 (X) = 𝑦. The condition of no tearing ensures that this map is
continuous as nearby points in X must map to nearby points in Y. Gluing is the opposite of tearing, so we also require
that the inverse map 𝜑 −1 is continuous. In summary, 𝜑 : X → Y is called a homeomorphism if 𝜑 is a continuous
bijection with a continuous inverse.
We previously noted that size does not affect the topological structure. For example, let X = R, the set of all real
numbers, and Y = (−1, 1), the open interval containing all real numbers between −1 and 1, excluding the endpoints.
These two spaces are topologically equivalent via the homomorphism 𝜑 : R → (−1, 1) with 𝜑 (𝑥) = 1+|𝑥 𝑥 . However, R
|
has an infinite diameter, while (−1, 1) has a diameter of only 2. Hence, this is a good example of how the space size does
not matter in topology. On the other hand, consider X = (0, 1) ∪ (1, 2) and Y = (0, 2). While both spaces are similar as
sets, they are not topologically equivalent because X consists of two separate connected components, whereas Y is a
single connected piece. A continuous deformation (homeomorphism) must preserve the number of components.

Homotopy. Another important concept in topology is homotopy, which can be seen as a flexible deformation of
one space into another. Unlike homeomorphism, homotopy allows for collapsing but not tearing or gluing. We call
two spaces homotopic to each other if such a deformation exists from one to another. For instance, a disk and a
point are homotopic because the entire disk can be collapsed into a point by pushing inwards from all directions
towards the center. Similarly, a punctured disk (D2 − {(0, 0)}) and a circle are homotopic, as a punctured disk can be
continuously deformed towards its boundary, starting from the puncture point (0, 0). Homotopy provides a powerful
tool for classifying topological spaces and understanding their fundamental properties, as most topological invariants
Manuscript submitted to ACM
8 Coskunuzer-Akcora

are homotopy invariants, meaning they yield the same output for two homotopic spaces. For example, the connectivity
and the count of holes or cavities do not change under homotopy.

2.1.3 Topological Invariant. If two spaces X and Y are topologically equivalent, showing that there is a homeomorphism
𝜑 : X → Y would be sufficient. However, if they are topologically different, one must demonstrate that no such map
can exist, which is highly challenging. Mathematicians use invariants to show such inequivalence. An invariant refers
to a property or feature of an object that remains unchanged under certain transformations. A topological invariant is
an invariant that is preserved under homeomorphism. A topological invariant can be a number (number of components,
Euler characteristics, Betti numbers), a mathematical object (homology groups, fundamental group, cohomology ring),
or a mathematical property (compactness, connectedness).
In the example above, X = (0, 1) ∪ (1, 2) versus
Y = (0, 2), we used the number of components as a
topological invariant to show they are not equivalent.
However, more subtle examples require more sophisti-
cated invariants. For instance, a cube and a sphere are
topologically equivalent because one can deform a cube
into a sphere without tearing or gluing. In contrast, com-
(a) Sphere (b) Cube (c) Torus
paring a sphere and a torus (the surface of a donut as
shown in 2c) is more complex. They are not topologically Fig. 2. The sphere and the cube are topologically equivalent,
equivalent because the torus has "holes." By holes, we whereas the torus is different from both.

mean the loops on the surface, which cannot be shrunk down to a point without leaving the surface. From this perspective,
a sphere has no hole, however, on a torus, some loops (meridian and longitude) can’t be shrunk down to a point in
the torus, representing the "holes" in torus (see Figure 3). Despite this difference, proving that a sphere and a torus
are not homeomorphic is challenging. Therefore, subtle topological invariants that detect these holes are crucial for
distinguishing such spaces.

Euler Characteristics. One important and well-known example of such topological invariants is the Euler characteristic.
If C is a simplicial complex, we define the Euler characteristic as the alternating sum of the count of 𝑘-simplices in C,
i.e., 𝜒 (C) = 𝑘 (−1)𝑘 𝑛𝑘 , where 𝑛𝑘 is the count of 𝑘-simplices in C. It turns out the Euler characteristic is a topological
Í

invariant. i.e., if C and D are homeomorphic, then 𝜒 (C) = 𝜒 (D). The Euler characteristic is, in fact, a homotopy
invariant as well (see [51]). For example, a sphere S is topologically equivalent to the surface of a hollow tetrahedron F ,
which has 4 vertices (𝑛 0 ), 6 edges (𝑛 1 ), and 4 triangles (𝑛 2 ). Therefore, we have 𝜒 (S) = 𝜒 (F ) = 𝑛 0 −𝑛 1 +𝑛 2 = 4−6+4 = 2.
Similarly, if one computes the Euler characteristic of a torus T through a simplicial complex, we find that 𝜒 (T ) = 0.
Since 𝜒 (S) ≠ 𝜒 (T ), it follows that a sphere is not homeomorphic to a torus.

2.1.4 Geometry vs. Topology. Before proceeding, let’s clarify a common confusion between the terms topology and
geometry. Both are branches of mathematics, but they focus on different aspects of metric spaces.
Geometry focuses on the local study of shapes, sizes, and properties of space, as well as how these spaces embed (fit
into) higher-dimensional spaces. For example, a 2-dimensional sphere S can be embedded into R3 (or R10 ) in various
shapes and sizes. Geometry focuses on precise measurements (e.g., angles, distances, curvature) and the relationships
between points, lines, surfaces, and solids. It explores how shapes bend and twist, providing tools to understand complex
surfaces and spaces through concepts such as curvature and geodesics (i.e., shortest paths).
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 9

In contrast, topology examines the global properties of space that remain unchanged under continuous deformations.
It focuses on qualitative aspects like connectedness and continuity rather than precise measurements. For instance,
topology considers a cube and a sphere as topologically equivalent because these shapes can be transformed into one
another through continuous deformation despite their geometric differences—one being flat and the other curved. For
instance, no matter how we embed a 2-sphere into R10 , its topological properties remain unchanged. However, the
geometry varies significantly: a round sphere with a radius of 1 differs greatly from one with a radius of 100. Moreover,
an irregular sphere would be geometrically distinct from both.
In data science, geometry usually refers to the local shape characteristics of a dataset, such as distances, curvature,
and angles, whereas topology pertains to global characteristics, such as connectedness and the number of holes/cavities.
For example, dimension reduction methods like PCA [96] and UMAP [73] are considered geometric methods as they
heavily depend on the distances and how the dataset sits in the high dimensional space. In contrast, methods that count
the number of components or holes in the space are called topological methods.

2.2 Homology
To distinguish topological spaces, the most common method is to use topological invariants such as Euler charac-
teristics (Section 2.1.3), the fundamental group [9], or homology. Among these, homology is the most versatile and
robust invariant that applies to a wide range of spaces such as surfaces (e.g., spheres, tori), simplicial complexes (e.g.,
triangulated shapes), and manifolds (e.g., higher-dimensional analogs of curves and surfaces).
There are various ways to compute homology (cellular [51], simplicial, Morse [72]), where the outputs are the
same, but the computation methods applied are different. To utilize computational tools more effectively, it’s more
efficient to use discrete representations of topological spaces, like simplicial complexes. Simplicial homology is par-
ticularly suited for TDA because it deals with simplicial complexes, which are sets of vertices, edges, triangles,
and their higher-dimensional counterparts. Considering these aspects, TDA mostly employs simplicial homology
to capture the topological patterns in data, although Morse or cellular homology are used in specific applications
within TDA, such as cellular in image classification [61, 80].
Here, we provide a brief overview (TLDR) of homology, followed by a formal yet accessible introduction. For a
detailed, friendly introduction to simplicial homology, refer to [9, Chapter 8], or for an in-depth study, see [51, 77].

2.2.1 TLDR. Homology is a fundamental invariant in topology that captures information about a space’s structure by
examining its holes/cavities of various dimensions. The focus on holes/cavities may surprise the reader, but holes are
preferred because they are fundamental features that significantly influence the structure and properties of a space, as
we outline below. To simplify the exposition, we use the concept of a 𝑘-hole in a topological space X. Although this term
slightly abuses notation, it refers to a 𝑘-dimensional submanifold S in X that cannot be continuously deformed into a
point within the space. In reality, this 𝑘-hole corresponds to a (𝑘 + 1)-dimensional "cavity" Ω in X. Since this cavity
represents a "missing" region within the space, we describe it by using its boundary 𝜕Ω = S, which is non-contractible
in X. The 𝑘 𝑡ℎ homology group H𝑘 (X) captures these 𝑘-holes, or 𝑘-dimensional manifolds in X, that do not bound any
(𝑘 + 1)-dimensional region in the space.
• 𝑘-holes are topological invariants, meaning they remain unchanged under continuous deformations, such as
stretching or bending, that do not involve tearing or gluing. And homology can detect them.
• 0-holes represent the connected components of a space. The dimension (or rank) of H0 (X) corresponds to the
number of these components in X.
Manuscript submitted to ACM
10 Coskunuzer-Akcora

• 1-holes correspond to loops in the space that cannot be contracted to a point without leaving the space. The
dimension of H1 (X) represents the number of such loops (1-holes) in the space.
• 2-holes correspond to cavities within the space, which can be considered hollow regions enclosed by surfaces
(e.g., the interior of a sphere or torus). The dimension of H2 (X) represents the number of such cavities (2-holes)
in the space.

To make the concept of homology groups more accessi-


ble, we will offer a simplified and visual explanation. We
denote homology groups by H𝑘 (X), where X represents
the space under consideration, and 𝑘 indicates the dimen-
sion being analyzed. Independent 𝑘-holes are generators for
the homology group H𝑘 (X). Therefore, the rank of H𝑘 (X)
indicates the number of 𝑘-dimensional “holes” present in
the space X. These numbers are also called Betti numbers
{𝛽𝑘 (X)}. e.g., if H2 (X) = Z3 , then we say rank(H2 (X)) = 3
, or 𝛽 2 (X) = 3, meaning X has three 2-holes. Below, we will
give some examples for dimensions 0, 1 and 2 (See Figure 3).
• H0 (X): 0-dimensional homology represents the con-
nected components in X. If X has three components, then
we say H0 (X) = Z3 (or Z32 if used Z2 -coefficients), and
rank(H0 (X)) = 𝛽 0 (X) = 3.

• H1 (X): 1-dimensional homology computes the non-


contractible loops (holes) in X. For example, a sphere S
has no nontrivial loops, as any loop in the sphere bounds a
disk in the sphere, i.e., 𝛽 1 (S) = 0. A torus T (hollow donut), Fig. 3. Toy examples for homology. We present the ranks
of the homology groups of various topological spaces, i.e., for
on the other hand, has meridian and longitude circles (see
H0 ( X) = Z𝑘 we write only H0 = 𝑘 for simplicity, representing
Figure 3), both being non-contractible loops, making the total the count of the 𝑘-dimensional holes in X.
𝛽 1 (T ) = 2. On the other hand, a solid donut D would have
only the longitude circle as noncontractible, 𝛽 1 (D) = 1.

• H2 (X): 2-dimensional homology group captures two-dimensional cavities (voids) in the space X. To count these
cavities, we utilize surfaces in the space which does not bound a 3-dimensional domain in the space. For example, in
the unit ball B in R3 (solid ball), any closed surface bounds a 3-dimensional domain within B, and there are no cavities
within it. Hence, we say rank(H2 (B)) = 𝛽 2 (B) = 0. Similarly, if one removes two disjoint smaller balls from B, say B ′ ,
we would have two different cavities, which can be represented by different spheres in X, enclosing these balls. Then,
we say 𝛽 2 (B ′ ) = 2. In Figure 3, the sphere, the torus, or the genus-2 surface have one 2-dimesional cavities represented
by themselves.
We conclude this section with an interesting fact. While we defined the Euler characteristic for simplicial complexes
as the alternating sum of the number of 𝑘-simplices, there is an alternative way using Betti numbers. In particular, if Y
is a topological space, the Euler characteristic is given by 𝜒 (Y) = 𝑘 (−1)𝑘 𝛽𝑘 (Y).
Í

Manuscript submitted to ACM


Topological Methods in Machine Learning: A Tutorial for Practitioners 11

2.2.2 Computation of Homology. We now turn to a formal introduction. Homology is a mathematical operation whose
inputs are topological spaces and outputs are groups. In particular, for a given space X, H𝑘 (X) represents the 𝑘 𝑡ℎ
homology group of X, summarizing the 𝑘-dimensional non-collapsible submanifolds in X, each representing different
(𝑘 + 1)-dimensional "cavity" of X. We will call these "𝑘-holes" by abusing notation.
The concept of homology groups stems from the idea that a (𝑘 + 1)-dimensional hole or cavity in X is detected by the
presence of its 𝑘-dimensional boundary, which cannot be continuously contracted within X. While this method might
seem indirect—using boundaries to infer the existence of cavities—the difficulty arises from the need to identify what is
missing in X. Here, a fundamental principle of topology comes into play: the boundary of a boundary is always empty. If
Ω is a (𝑘 + 1)-dimensional domain in X with boundary S = 𝜕Ω, then the boundary of S is empty. i.e., 𝜕(𝜕Ω) = 𝜕S = ∅
Hence, to identify cavities, we first find all 𝑘-dimensional submanifolds with no boundary in X (𝑘-cycles). Then, by
eliminating those that bound a domain in X (𝑘-boundaries), we are left with the true cavities. This process forms the
core idea behind homology computation.
To keep the exposition focused, we will describe only simplicial homology with Z2 coefficients, the most common
version used in TDA. For other homology settings, refer to [51]. In particular, simplicial homology involves representing
a given space X as a simplicial complex (a collection of simplices) and performing computations on these simplices.
This approach allows us to discretize the problem by focusing on the 𝑘-dimensional "building blocks" of the topological
space, such as vertices for 0-dimension and edges for 1-dimension. By considering 𝑘-submanifolds (or 𝑘-subcomplexes)
as unions of 𝑘-simplices, we can identify the special ones that correspond to true cavities in X by using computational
tools. To formalize this concept, we now introduce the relevant mathematical notions.

i. Representation of 𝑘-simplices. In a simplicial complex X, we describe 𝑘-simplices by listing their vertices. For
example, a 1-simplex (an edge) 𝑒 with endpoints 𝑣 2 and 𝑣 4 is denoted as 𝑒 = [𝑣 2, 𝑣 4 ]. Similarly, a 2-simplex (a triangle) 𝜏
with vertices 𝑣 1 , 𝑣 5 , and 𝑣 7 is written as 𝜏 = [𝑣 1, 𝑣 5, 𝑣 7 ]. Since we are using Z2 coefficients, the order of the vertices does
not matter. However, in other versions, such as with Z-coefficients, the order of the vertices would be significant. In
general, a 𝑘-simplex Δ is represented by its 𝑘 + 1 vertices, i.e., Δ = [𝑣𝑖 0 , 𝑣𝑖 1 , . . . , 𝑣𝑖𝑘 ]. We will call a union of 𝑘-simplices
a 𝑘-subcomplex of X.

ii. 𝑘-chains C𝑘 (X). To describe all 𝑘-dimensional subcomplexes (which correspond to all 𝑘-submanifolds, with
or without boundary) within a simplicial complex X, we define a group C𝑘 (X), known as the group of 𝑘-chains.
Recall that any union of 𝑘-simplices forms a 𝑘-subcomplex. For instance, if the simplicial complex X consists of three
edges {𝑒 1, 𝑒 2, 𝑒 3 }, then all possible 1-subcomplexes are {𝑒 1, 𝑒 2, 𝑒 3, 𝑒 1 ∪ 𝑒 2, 𝑒 1 ∪ 𝑒 3, 𝑒 2 ∪ 𝑒 3, 𝑒 1 ∪ 𝑒 2 ∪ 𝑒 3 }. We represent
this collection using group elements, where each 1-simplex acts as a generator. In particular, the union 𝑒 1 ∪ 𝑒 3 is
represented by 𝜎1 = (1, 0, 1), while 𝑒 1 ∪ 𝑒 2 is represented by 𝜎2 = (1, 1, 0). Hence, we obtain the group C1 (X) = Z32 ,
where 0 element corresponds to the empty set. The group operation is addition, and since 1 + 1 = 0 in Z2 , we
have 𝜎1 + 𝜎2 = (1, 0, 1) + (1, 1, 0) = (0, 1, 1), corresponding to the union 𝑒 2 ∪ 𝑒 3 . Similarly, if X contains 𝑚 𝑘-simplices
{Δ1, Δ2, . . . , Δ𝑚 }, then C𝑘 (X) = Z𝑚
2 , where a group element like (1, 0, 1, . . . , 1) represents the 𝑘-subcomplex Δ1 ∪Δ3 ∪Δ𝑚
within X.

iii. Boundary operator 𝜕𝑘 . To identify 𝑘-holes in a simplicial complex, we must first identify 𝑘-subcomplexes that have
no boundary. This is accomplished by defining a boundary operator 𝜕𝑘 : C𝑘 (X) → C𝑘 −1 (X), which maps 𝑘-chains
to their (𝑘 − 1)-dimensional boundaries. In particular, 𝜕𝑘 maps each 𝑘-simplex to its boundary. For example, if we
have a 1-simplex (edge) 𝑒 = [𝑣 0, 𝑣 1 ], then 𝜕1 [𝑒] = 𝑣 0 + 𝑣 1 ∈ C0 (X), representing the boundary of the edge as the union
Manuscript submitted to ACM
12 Coskunuzer-Akcora

of its end vertices {𝑣 0 } ∪ {𝑣 1 }. Similarly, for a 2-simplex (triangle) 𝜏 = [𝑣 0, 𝑣 1, 𝑣 2 ], the boundary is given by 𝜕2𝜏 =
[𝑣 0, 𝑣 1 ] + [𝑣 1, 𝑣 2 ] + [𝑣 2, 𝑣 0 ] ∈ C1 (X), representing the boundary of the triangle as the union [𝑣 0, 𝑣 1 ] ∪ [𝑣 1, 𝑣 2 ] ∪ [𝑣 2, 𝑣 0 ]. To
define 𝜕𝑘 for general 𝑘-chains, we sum the boundaries of each 𝑘-simplex within the chain. For instance, if 𝜎 = Δ1 +Δ3 +Δ7
is a 𝑘-chain, then 𝜕𝑘 𝜎 = 𝜕𝑘 Δ1 + 𝜕𝑘 Δ3 + 𝜕𝑘 Δ7 . A 𝑘-chain 𝜎 is said to have no boundary if 𝜕𝑘 𝜎 = 0. In other words, a
𝑘-subcomplex with no boundary in X must map to 0 under the boundary operator.
The boundary operator 𝜕𝑘 : C𝑘 (X) → C𝑘 −1 (X) is a linear operator and can be represented as a matrix. For example,
if C𝑘 (X) = Z𝑛2 and C𝑘 −1 (X) = Z𝑚2 , then 𝜕𝑘 can be written as an 𝑚 × 𝑛 matrix A, with each column A𝑖 corresponds to
𝜕𝑘 Δ𝑖 , where Δ𝑖 is a 𝑘-simplex in X. In Figure 5, we give an explicit example of a matrix representation of a boundary
operator the simplicial complex in Figure 4.

iv. 𝑘-cycles Z𝑘 (X). We define a special subgroup Z𝑘 (X), named 𝑘-cycles, within C𝑘 (X)
for 𝑘-subcomplexes that have no boundary. As previously mentioned, a 𝑘-subcomplex with
𝑣!
no boundary in X must map to zero under the boundary operator. Hence, we define the
subgroup Z𝑘 (X) = ker 𝜕𝑘 , which includes all 𝑘-chains that map to zero. Recall that 1-chains
𝑣$ 𝑣#
with no boundary correspond to loops, while 2-chains with no boundary correspond to closed
surfaces, like a sphere or torus. 𝑣"
For example, consider a square-shaped simplicial complex X (Figure 4) with four vertices
Fig. 4. Toy example for
{𝑣 0, 𝑣 1, 𝑣 2, 𝑣 3 } and four edges {𝑒 1, 𝑒 2, 𝑒 3, 𝑒 4 } where 𝑒 1 = [𝑣 0, 𝑣 1 ], 𝑒 2 = [𝑣 1, 𝑣 2 ], 𝑒 3 = [𝑣 2, 𝑣 3 ] homology.
and 𝑒 4 = [𝑣 3, 𝑣 0 ]. In this case, C1 (X) = Z42 and C0 (X) = Z42 . Suppose 𝜎1 = 𝑒 1 + 𝑒 2 ; then
𝜕1 𝜎1 = (𝑣 0 + 𝑣 1 ) + (𝑣 1 + 𝑣 2 ) = 𝑣 0 + 𝑣 2 , meaning that 𝜎1 represents a 1-subcomplex with a boundary. However, if
𝜎2 = 𝑒 1 + 𝑒 2 + 𝑒 3 + 𝑒 4 , then 𝜕1 𝜎2 = (𝑣 0 + 𝑣 1 ) + (𝑣 1 + 𝑣 2 ) + (𝑣 2 + 𝑣 3 ) + (𝑣 3 + 𝑣 0 ) = 2(𝑣 0 + 𝑣 1 + 𝑣 2 + 𝑣 3 ) = 0, indicating that
𝜎2 represents a 1-subcomplex with no boundary. Notice that 𝜎2 corresponds to a complete loop in X. See Example 1 for
more details.
We can also describe the boundary map 𝜕1 : C1 (X) → C0 (X) as a 4 × 4 matrix, as shown 𝜕𝑒 1 𝜕𝑒 2 𝜕𝑒 3 𝜕𝑒 4
in Figure 5. It is clear that the only element in C1 (X) that maps to (0, 0, 0, 0) ∈ C0 (X) is 1 0 0 1 𝑣0
𝜕1 = ­­1
© ª
1 0 0®® 𝑣1
(1, 1, 1, 1), which corresponds to the loop 𝜎2 . This means X has only one 1-cycle, which ­0 1 1 0®® 𝑣2
is 𝜎2 . Then, Z𝑘 (X) is the subgroup of C𝑘 (X) generated by the single element (1, 1, 1, 1). If
­
«0 0 1 1¬ 𝑣3
you are unfamiliar with group theory, you can think of C1 (X) as a four-dimensional vector
space, where Z1 (X) is a one-dimensional subspace generated by the vector (1, 1, 1, 1). Fig. 5. 𝜕1 : C1 ( X) → C0 ( X) is
represented as a 4 × 4 binary ma-
trix. Columns represent the edges
v. 𝑘-boundaries B𝑘 (X). While we identified all 𝑘-subcomplexes with no boundary, none in C1 ( X) , and rows correspond
to the vertices in C0 ( X) . For ex-
correspond to true cavities. We must eliminate the ones that bounds a domain in X. To find ample, 𝜕𝑒 3 = 𝑣2 + 𝑣3 can be read
from the third column
them, we use again the boundary operator. As C𝑘+1 (X) represents all (𝑘 + 1)-subcomplexes
in X, then the image of 𝜕𝑘+1 , i.e., B𝑘 (X) = 𝜕𝑘+1 C𝑘+1 (X) ⊂ C𝑘 (X), represents the ones which bounds a (𝑘 + 1)-domain
in X. We call an element in B𝑘 (X) a 𝑘-boundary. Recall that 𝜕𝑘 (𝜕𝑘+1 𝜎) = 0 from earlier discussion. This means for
any 𝑘-boundary 𝜑 = 𝜕𝑘+1 𝜎 in B𝑘 (X), 𝜕𝑘 𝜑 = 0. Therefore, we have B𝑘 (X) ⊂ Z𝑘 (X). For example, if X is a simplicial
complex formed by only one 2-simplex 𝜏 with vertices {𝑣 0, 𝑣 1, 𝑣 2 }, then B1 (X) would be a subgroup in C1 (X), generated
by 𝜕2𝜏 = 𝜎 = (1, 1, 1) while Z1 (X) would be the same group, i.e., Z1 (X) = B1 (X). This means there is only one loop 𝜎
in X but it does not represent a hole in X as it is filled by 𝜏, i.e., 𝜎 = 𝜕𝜏.

vi. Homology group H𝑘 (X). Now we are ready to define homology. Notice that with the boundary operator, we
obtained the following sequence of groups and maps. We can consider these as vector spaces and linear maps. For each
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 13

𝑘, 𝑘-chains C𝑘 (X) represent all 𝑘-subcomplexes in X, 𝑘-cycles Z𝑘 (X) ⊂ C𝑘 (X) represent 𝑘-subcomplexes with no
boundary, and finally, B𝑘 (X) ⊂ Z𝑘 (X) represent the 𝑘-subcomplexes which bounds a (𝑘 + 1)-domain in X.

𝜕𝑘+2 𝜕𝑘+1 𝜕𝑘 𝜕𝑘 −1 𝜕1
··· C𝑘+1 (X) C𝑘 (X) C𝑘 −1 (X) ··· C0 (X) 0
Now, to identify 𝑘-holes (true cavities), we consider all 𝑘-cycles in X (Z𝑘 (X)), which potentially represent a cavity
of X, and, among them, eliminate all 𝑘-boundaries in X (B𝑘 (X)), as they represent the "fake" cavities. Hence, we define
𝑘 𝑡ℎ homology group as the quotient group

H𝑘 (X) = Z𝑘 (X)/B𝑘 (X) = ker(𝜕𝑘 )/image(𝜕𝑘+1 )

In terms of the sequence above, this quotient H𝑘 (X) = Z𝑘 (X)/B𝑘 (X) effectively counts the 𝑘-dimensional cycles that
correspond to actual holes or cavities in X, not those that are merely boundaries of higher-dimensional regions. From a
computational perspective, with this formulation, we only need to compute the kernels and images of a sequence of
linear maps {𝜕𝑘 } (binary matrices) to compute homology.
To clarify these notions, we give two toy examples for explicit computation of homology.

Example 1. Consider the example in Figure 4 where X is a square-shaped simplicial


complex with four vertices {𝑣 0, 𝑣 1, 𝑣 2, 𝑣 3 } and four edges. For 𝑘 ≥ 2, C𝑘 (X) = {0} since
there are no 𝑘-simplices in X. Both C1 (X) and C0 (X) are isomorphic to Z42 , as discussed
earlier (𝑘-cycles above). The group Z1 (X) has only one generator, the sum of all edges.
Since C2 (X) = {0}, we have B1 (X) = {0} because B1 (X) = 𝜕2 C2 (X). Therefore, H1 (X) =
Fig. 6. Toy example 2 for
Z1 (X)/B1 (X) = Z2 /{0} = Z2 , meaning that H1 (X) has rank 1, corresponding to the entire homology.
square loop. The boundary map 𝜕1 is given in Figure 5.
For H0 (X), we need to determine Z0 (X) and B0 (X). Since 𝜕0 sends everything to 0, Z0 (X) = C0 (X) = Z42 . The
group B0 (X) = 𝜕1 C1 (X) is generated by the boundaries of the edges: (𝑣 0 +𝑣 1 ), (𝑣 1 +𝑣 2 ), (𝑣 2 +𝑣 3 ), (𝑣 3 +𝑣 0 ). However, these
boundaries are not linearly independent because 𝑣 3 + 𝑣 0 is the sum of the other three boundaries. Thus, B0 (X) ≃ Z32 .
Consequently, H0 (X) = Z0 (X)/B0 (X) = Z42 /Z32 = Z2 . Recall that the rank of H0 (X) represents the number of
connected components and the fact that rank(H0 (X)) = 1 confirms that there is only one connected component.

Second example is in one dimension higher.


𝜕𝜏1 𝜕𝜏2 𝜕𝜏3 𝜕𝜏4
1 0 1 0 𝑒1 𝜕𝑒 1 𝜕𝑒 2 𝜕𝑒 3 𝜕𝑒 4 𝜕𝑒 5 𝜕𝑒 6
Example 2. Consider the hollow tetrahedron X of ­1
©
1 0 0®
ª
𝑒2 1 0 0 1 1 0 𝑣0
Figure 6, composed of four triangular faces: 𝜏1 = 𝜕2 = ­­0 𝜕1 = ­­1
­ ® © ª
1 0 1®® 𝑒3 1 0 0 0 1®® 𝑣1
­0 0 1 1®® ­0 1 1 0 1 0®®
[𝑣 0, 𝑣 1, 𝑣 2 ], 𝜏2 = [𝑣 1, 𝑣 2, 𝑣 3 ], 𝜏3 = [𝑣 0, 𝑣 1, 𝑣 3 ], and ­ 𝑒4 ­ 𝑣2
­1 0 0 1® «0 0 1 1 0 1¬
­ ®
𝑒5 𝑣3
𝜏4 = [𝑣 0, 𝑣 2, 𝑣 3 ]. The complex X contains four 2- «0 1 1 0¬ 𝑒6
simplices (triangles), six 1-simplices (edges), and
four 0-simplices (vertices). Let 𝑒 1 = [𝑣 0, 𝑣 1 ], 𝑒 2 = Fig. 7. Boundary maps for Ex. 2. The boundary map 𝜕2 : C2 ( X) → C1 ( X) is
represented as a 6 × 4 binary matrix (left). The top of each column corresponds to
[𝑣 1, 𝑣 2 ], 𝑒 3 = [𝑣 2, 𝑣 3 ], 𝑒 4 = [𝑣 3, 𝑣 0 ], 𝑒 5 = [𝑣 0, 𝑣 2 ] the 2-simplex in C2 ( X) whose image is represented by that column, while each
row’s corresponding edge is given next to the column. For example, the boundary
and 𝑒 6 = [𝑣 1, 𝑣 3 ]. Therefore, we have C2 (X) = Z42 , of 2-simplex 𝜏2 , 𝜕𝜏2 = 𝑒 2 + 𝑒 3 + 𝑒 6 by reading the second column in 𝜕2 matrix. The
boundary map 𝜕1 : C1 ( X) → C0 ( X) is similar (right).
C1 (X) = Z62 , and C0 (X) = Z42 .
A straightforward computation shows that the kernel of 𝜕2 , denoted Z2 (X), has a single generator: the sum of all
the triangles, 𝜏1 + 𝜏2 + 𝜏3 + 𝜏4 . Since there is no 3-simplex in X, we find that B2 (X) = {0}, leading to H2 (X) = Z2 . This
result aligns with the fact that rank(H2 (X)) = 1, indicating the presence of one cavity in X, consistent with X being a
Manuscript submitted to ACM
14 Coskunuzer-Akcora

hollow tetrahedron. We Furthermore, calculating Z1 (X) = Z32 and B1 (X) = Z2 shows that these cancel out, yielding
H1 (X) = {0}. In other words, any loop in X is filled, so there is no 1-hole in X. Similarly, following the same reasoning
as in the example above, we obtain H0 (X) = Z2 as expected.

3 PERSISTENT HOMOLOGY
In this section, we introduce Persistent Homology (PH), a foundational technique that played a pivotal role in the
emergence of TDA, as developed by Carlsson, Edelsbrunner, Zomorodian, and others in the early 2000s [21, 40, 124].
PH captures the underlying shape patterns within complex data sets by studying the evolution of topological features
across multiple scales.
PH first constructs a nested sequence of simplicial
complexes, known as filtration, and tracks the birth and Input Data
Point Clouds Images Networks
death of features, such as connected components, loops,
and voids, in this sequence. The resulting multi-scale Filtration Complexes
representation highlights significant features while fil- Rips Cech Cubical Clique Power
tering out noise, making PH a valuable tool in various Persistence Diagrams
fields, including medical imaging [99], biomedicine [100], TDA Dionysus Gudhi Ripser Giotto-TDA PHAT JavaPLEX
time series analysis [119], material science [79], geogra-
ML Integration
phy [34], shape analysis [107] and finance [59]. Vectorization Direct Automated
In the following, we explain PH in an accessible way,
focusing on its key aspects relevant to ML applications. Pers. Img. Silh. Betti Kernel Wasserstein Perslay PointNet

The main idea behind PH is to capture the hidden shape


Fig. 8. PH Pipeline. From data acquisition and filtration complex construc-
patterns in the data by using algebraic topology tools. tion to generating persistence diagrams using software libraries. The final step
highlights methods for integrating persistence diagrams into downstream
PH achieves this by keeping track of the evolution of ML tasks.
the topological features (𝑘-holes, components, loops, and
cavities) created in the data while looking at it in different resolutions.
In simple terms, PH can be summarized as a three-step procedure as follows.

(1) Filtration: Generate a nested sequence of simplicial complexes derived from the data.
(2) Persistence Diagrams: Record the evolution of topological features across this sequence.
(3) ML Integration: Transform the persistence diagrams into vectors for efficient use in ML models.

We provide details of these steps in the following sections. Although the second and third steps are conceptually
similar across most settings, the first step, constructing filtrations, varies significantly depending on the data type, i.e.,
point clouds, images, and networks. See Figure 8 for a visual summary of PH pipeline.

3.1 Constructing Filtrations


In Section 2, we introduced simplicial complexes which enable us to study the topological spaces via computational
tools. To study the given data in different resolutions, PH generates a nested sequence of simplicial complexes K1 ⊂
K2 ⊂ · · · ⊂ K𝑛 induced from the data. Such a sequence is called a filtration. This step can be considered the most crucial
for the effectiveness of PH in ML applications. The primary reason is that the filtration process involves examining the
data at different resolutions by adjusting a "scale parameter". The choice of this "scale parameter" can greatly influence
the performance of the method.
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 15

For each data type, there are well-established methods to construct filtrations that have proven highly effective in
their respective contexts. These methods vary significantly depending on the data type. While point cloud [26] and
image [116] settings use relatively common approaches in their settings, the construction of filtrations in graph settings
is notably more versatile [5].

3.1.1 Filtrations for Point Clouds. As the process is gen-


erally similar for various metric spaces, for simplicity, we
will describe it for a point cloud X in R𝑁 using the Eu-
clidean metric. Let X = {𝑥 1, 𝑥 2, . . . , 𝑥𝑚 } be a point cloud
in R𝑁 . We will define a nested sequence of simplicial
complexes K1 ⊂ K2 ⊂ · · · ⊂ K𝑛 induced by X. The cen-
tral idea here is to build a series of simplicial complexes
that progressively capture the topological picture of the
point cloud X in different resolutions as we move from (a) 𝜖 = 0.4 (b) 𝜖 = 0.7 (c) 𝜖 = 2
K1 to K𝑛 . The "nested" part signifies that each simplicial
Fig. 9. For 8-shaped point cloud X, { N𝜖 ( X) } filtration steps for
complex is contained within the next one, like a series
thresholds 𝜖 = 0.4, 𝜖 = 0.7 an 𝜖 = 2.
of Russian Matryoshka dolls.
Before moving on to simplicial complexes, we will define a simpler nested sequence for X. Let B𝑟 (𝑥) = {𝑦 ∈ R𝑁 |
𝑑 (𝑥, 𝑦) ≤ 𝑟 } be the closed 𝑟 -ball around 𝑥. Then, let 𝑟 -neighborhood of X, N𝑟 (X) = 𝑚
𝑖=1 B𝑟 (𝑥𝑖 ) be the union of
Ð

𝑟 -balls around the points in X. By declaring 𝑟 as our scale parameter, we first fix a monotone sequence of threshold
values 0 = 𝑟 1 < 𝑟 2 < · · · < 𝑟𝑛 , where 𝑟𝑛 = max𝑖,𝑗 {𝑑 (𝑥𝑖 , 𝑥 𝑗 )}, the diameter of X. These values intuitively represent the
resolution at which we observe the point cloud X. In particular, a smaller value of 𝑟 indicates a closer, more detailed
examination of X, while a larger value of 𝑟 means observing the point cloud with a broader view from a greater distance,
making it difficult to distinguish between points that are close together in X. This naturally gives a nested sequence
of topological spaces N𝑟 1 (X) ⊂ N𝑟 2 (X) ⊂ · · · ⊂ N𝑟𝑛 (X) (See Figure 9). While these neighborhoods {N𝑟 (X)} give a
natural sequence for the data, to effectively leverage computational tools, we need to induce a sequence by simplicial
complexes. There are two common ways to achieve this while preserving the underlying topological information.

Method 1 - Rips complexes: For X ⊂ R𝑁 , and for


𝑟 > 0, the Rips complex (aka Vietoris-Rips complex)
is the abstract simplicial complex R𝑟 (X) where a 𝑘-
simplex 𝜎 = [𝑥𝑖 0 , 𝑥𝑖 1 , . . . , 𝑥𝑖𝑘 ] ∈ R𝑟 (X) if and only if
𝑑 (𝑥𝑖𝑚 , 𝑥𝑖𝑛 ) < 𝑟 for any 0 ≤ 𝑚, 𝑛 ≤ 𝑘. In other words,
for 𝑟 > 0, if for 𝑘 + 1-points are pairwise 𝑟 -close to each
(a) Cech Complex (b) Rips Complex (c) Cech Complex
other, they form a 𝑘-simplex in R𝑟 (X).
Fig. 10. Comparison of Rips and Čech Complexes. In panels 10a and
Method 2 - Čech complexes: Similarly, for X ⊂ R𝑁 , and 10b, the Čech and Rips complexes differ: the Čech complex does not form
a 2-simplex among 𝑣1 , 𝑣2 , and 𝑣3 due to insufficient ball overlap, while the
for 𝑟 > 0, the Čech complex is the abstract simplicial com- Rips complex does. A larger radius, as in panel 10c, is needed for the Čech
complex to include this simplex.
plex Č𝑟 (X) where a 𝑘-simplex 𝜎 = [𝑥𝑖 0 , 𝑥𝑖 1 , . . . , 𝑥𝑖𝑘 ] ∈
Č𝑟 (X) if and only if 𝑘𝑗=0 𝐵𝑟 (𝑥𝑖 𝑗 ) ≠ ∅. Here, the condition to build 𝑘-simplex is different as we ask for a nontrivial
Ñ

intersection of the 𝑟 -balls of the 𝑘 + 1-points.


Manuscript submitted to ACM
16 Coskunuzer-Akcora

The main relationship between our original filtration {N𝑟𝑖 (X)} and the simplicial complex filtrations arises from the
Nerve Lemma. This lemma states that the Čech complex Č𝑟 (X) is homotopy equivalent to N𝑟 (X) for any 𝑟 ≥ 0 [38].
Since homotopy equivalent spaces have the same homology, we have H𝑘 ( Č𝑟 (X)) ≃ H𝑘 (N𝑟 (X)) for any 𝑘 ≥ 0.
Therefore, from a PH perspective, the filtrations {N𝑟𝑖 (X)}𝑛1 and { Č𝑟𝑖 (X)}𝑛1 are essentially the same. Furthermore, for
X ⊂ R𝑁 with the Euclidean metric, Rips and Čech complexes are closely related and produce similar topological
information as Č𝑟 (X) ⊂ R 2𝑟 (X) ⊂ Č√2𝑟 (X) [39].
While both complexes provide similar filtrations, Rips complexes are more commonly used in practice. This preference
is due to the fact that Rips complexes only require the pairwise distances {𝑑 (𝑥𝑖 , 𝑥 𝑗 )} among points, which can be easily
obtained at the beginning of the process. Hence, constructing Rips complexes is straightforward once these distances
are known. In contrast, constructing Čech complexes requires checking whether collections of 𝑟 -balls have nontrivial
intersections. Figure 10 depicts their differences in a toy scenario.
In particular, Rips complexes are the most common filtration type used for point clouds because of their computa-
tional practicality, and because of the Nerve Lemma, they capture very similar information produced by the simple
neighborhood filtration {𝑁𝑟 (X)} described earlier.
Although we discuss the construction for point clouds in R𝑁 , the same process can be applied to any metric space
(any space with a distance function). Furthermore, in some cases, the point cloud X is given in an abstract setting,
where only pairwise distances {𝑑 (𝑥𝑖 , 𝑥 𝑗 )} among the points are provided. The Rips complex filtration can be effectively
applied to such point clouds. In Section 6.1, we detail the real-life application of this approach for shape recognition.

Other complex types. One of the primary challenges in applying PH to point clouds is the computational cost. To
mitigate this, witness complexes offer a valuable alternative for efficiently analyzing the topological features of point
clouds, especially in high-dimensional spaces. Unlike the Rips complexes, which can be computationally expensive,
witness complexes reduce complexity by utilizing a representative subset of the data points, known as witnesses, to
construct the simplicial complex [81]. This method strikes a balance between computational efficiency and topological
accuracy, making it particularly well-suited for large datasets where constructing full complexes would be impractical.
By concentrating on representative data points (witnesses), witness complexes facilitate a more manageable and scalable
computation of persistent homology, enabling the detection of the underlying shape and features of the point cloud
across various scales. In addition to Rips and Čech complexes, other types of complexes, such as alpha complexes and

Algorithm 1 PH Machinery for Point Clouds with Rips Complexes


1: Input: Point cloud 𝑃, Distance metric 𝑑, Threshold set {𝜖𝑖 }𝑛𝑖=0
2: Output: Topological vector 𝛽®(𝑃, 𝑑, {𝜖𝑖 })
3: 𝑉 ← {𝑣 ∈ 𝑃 } ⊲ Vertex set doesn’t change with filtration
4: for 𝑖 = 0 to 𝑛 do ⊲ Filtration index
5: Vertex set remains the same for all 𝑖
6: 𝐸𝑖 ← {(𝑢, 𝑣) | 𝑢, 𝑣 ∈ 𝑃 and 𝑑 (𝑢, 𝑣) ≤ 𝜖𝑖 } ⊲ Edges with distance ≤ 𝜖𝑖
7: R𝑖 ← (𝑉 , 𝐸𝑖 ) ⊲ Rips complex at 𝜖𝑖
8: end for
9: for 𝑘 = 0 to 1 do ⊲ Topological dimensions
10: Compute persistence diagram PD𝑘 ({R𝑖 }𝑛𝑖=0 )
11: Obtain vectorization 𝛽®𝑘 from PD𝑘 ({R𝑖 }𝑛𝑖=0 )
12: end for
13: Return Concatenation of 𝛽®0 and 𝛽®1 , denoted as 𝛽®(𝑃, 𝑑, {𝜖𝑖 }) = 𝛽®0 ∥ 𝛽®1

Manuscript submitted to ACM


Topological Methods in Machine Learning: A Tutorial for Practitioners 17

4 3 1 5 4 4 3 1 5 4 4 3 1 5 4 4 3 1 5 4 4 3 1 5 4 4 3 1 5 4
1 2 5 3 5 1 2 5 3 5 1 2 5 3 5 1 2 5 3 5 1 2 5 3 5 1 2 5 3 5
1 5 4 1 3 1 5 4 1 3 1 5 4 1 3 1 5 4 1 3 1 5 4 1 3 1 5 4 1 3
3 4 2 4 2 3 4 2 4 2 3 4 2 4 2 3 4 2 4 2 3 4 2 4 2 3 4 2 4 2
2 2 1 1 3 2 2 1 1 3 2 2 1 1 3 2 2 1 1 3 2 2 1 1 3 2 2 1 1 3
X X1 X2 X3 X4 X5

Fig. 11. For the 5 × 5 image X with the given pixel values, the sublevel filtration is the sequence of binary images X1 ⊂ X2 ⊂
X3 ⊂ X4 ⊂ X5 .

Delaunay complexes, are particularly useful in lower-dimensional spaces. However, to maintain the focus of this paper,
we refer the reader to [81] for an in-depth discussion of these complexes.

3.1.2 Filtrations for Images. For images, constructing filtrations differs significantly due to the unique structure of
image data. To keep the explanation simple, we will focus on 2D images, though the concepts can be extended to 3D
and other types of images. Filtration for images typically involves nested sequences of binary images, known as cubical
complexes [60]. Starting with a given color (or grayscale) image X of dimensions 𝑟 × 𝑠, we first select a specific color
channel (e.g., red, blue, green, or grayscale). The color values 𝛾𝑖 𝑗 ∈ [0, 255] of individual pixels Δ𝑖 𝑗 ⊂ X are used, where
Δ𝑖 𝑗 represents a closed box (square including its boundary) in the 𝑖 𝑡ℎ row and 𝑗 𝑡ℎ column of the image X.
Next, we choose the number of thresholds "𝑛" to span the color interval [0, 255], i.e., 0 = 𝑡 1 < 𝑡 2 < · · · < 𝑡𝑛 = 255.
This determines the length of our filtration sequence, with a typical range being between 50 and 100. Using these
thresholds, we define a nested sequence of binary images (cubical complexes) X1 ⊂ X2 ⊂ · · · ⊂ X𝑛 such that
X𝑚 = {Δ𝑖 𝑗 ⊂ X | 𝛾𝑖 𝑗 ≤ 𝑡𝑚 } (see Figure 11).
In particular, this involves starting with a blank 𝑟 × 𝑠 image and progressively activating (coloring black) pixels
as their grayscale values reach each specified threshold 𝑡𝑚 . This process is known as sublevel filtration, applied to X
relative to the chosen color channel (in this case, grayscale). Alternatively, pixels can be activated in descending order,
referred to as superlevel filtration. In this context, let Y𝑚 = {Δ𝑖 𝑗 ⊂ X | 𝛾𝑖 𝑗 ≥ 𝑠𝑚 }, where 255 = 𝑠 1 > 𝑠 2 > · · · > 𝑠𝑛 = 0,
and Y1 ⊂ Y2 ⊂ · · · ⊂ Y𝑛 is called superlevel filtration. The persistence homology process involving such cubical
complexes has a special name, called cubical persistence. In Section 6.3, we detail a successful application of this method
in cancer detection from histopathological images.
While these sublevel and superlevel filtrations are common choices for color or grayscale images, there are other
filtration types used for binary images, e.g., erosion, dilation, and signed distance [43]. These filtrations are specific to

2 1 0 0 1 1 1 0 1 2 2 1 0 0 1 1 1 0 1 2 2 1 0 0 1 1 1 0 1 2 2 1 0 0 1 1 1 0 1 2

1 0 0 0 0 0 0 1 1 2 1 0 0 0 0 0 0 1 1 2 1 0 0 0 0 0 0 1 1 2 1 0 0 0 0 0 0 1 1 2

0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 0 0 1

1 1 0 1 2 2 2 1 0 1 1 1 0 1 2 2 2 1 0 1 1 1 0 1 2 2 2 1 0 1 1 1 0 1 2 2 2 1 0 1

2 1 0 1 2 3 2 1 0 0 2 1 0 1 2 3 2 1 0 0 2 1 0 1 2 3 2 1 0 0 2 1 0 1 2 3 2 1 0 0

1 0 1 2 2 2 2 1 0 1 1 0 1 2 2 2 2 1 0 1 1 0 1 2 2 2 2 1 0 1 1 0 1 2 2 2 2 1 0 1

1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1

0 1 1 0 0 0 0 0 1 2 0 1 1 0 0 0 0 0 1 2 0 1 1 0 0 0 0 0 1 2 0 1 1 0 0 0 0 0 1 2

1 0 1 0 1 1 1 1 2 3 1 0 1 0 1 1 1 1 2 3 1 0 1 0 1 1 1 1 2 3 1 0 1 0 1 1 1 1 2 3

2 1 0 1 2 1 0 1 2 3 2 1 0 1 2 1 0 1 2 3 2 1 0 1 2 1 0 1 2 3 2 1 0 1 2 1 0 1 2 3

X X0 X1 X2 X3

Fig. 12. Erosion Filtration. For a given binary image X , we define the erosion function, which assigns a value of 0 to black pixels and the distance to
the nearest black pixel for white pixels (shown in X0 ). From this, we generate a filtration of binary images, X0 ⊂ X1 ⊂ · · · ⊂ X3 , where each X𝑖 is
obtained by activating pixels that reach a specific threshold value.

Manuscript submitted to ACM


18 Coskunuzer-Akcora

Algorithm 2 PH Machinery for Images with Cubical Persistence


1: Input: Image X, Color channel 𝐹 : X → R, Threshold set {𝜖𝑖 }𝑛𝑖=0
2: Output: Topological vector 𝛽®(X, 𝐹, {𝜖𝑖 })
3: for 𝑖 = 0 to 𝑛 do ⊲ Filtration index
4: X𝑖 ← {Δ𝑖 𝑗 ∈ X | 𝐹 (Δ𝑖 𝑗 ) ≤ 𝜖𝑖 } ⊲ Binary Images
5: end for
6: for 𝑘 = 0 to 1 do ⊲ Topological dimensions
7: Compute persistence diagram PD𝑘 ({X𝑖 }𝑛𝑖=0 )
8: Obtain vectorization 𝛽®𝑘 from PD𝑘 ({X𝑖 }𝑛𝑖=0 )
9: end for
10: Return Concatenation of 𝛽®0 and 𝛽®1 , denoted as 𝛽®(X, 𝐹, {𝜖𝑖 }) = 𝛽®0 ∥ 𝛽®1

binary images and effective in capturing the size and other properties of topological features. In Figure 12, we give an
example of erosion filtration.

3.1.3 Filtrations for Graphs. Graphs are a widely used application domain for PH because they generate highly effective
features that are not easily obtained through other methods. While filtration methods for point clouds and images are
relatively standard, graph filtrations offer a variety of choices. The way you construct these filtrations can significantly
impact the performance of ML models. Unlike other data formats, there are various methods to construct filtrations
from graphs, where the details can be found in [5]. We categorize these methods into two groups:

i. Filtrations through node/edge functions. This type of filtration is computationally efficient in most cases and is
commonly used in applications. The main concept involves defining a filtration function on nodes or edges and using
the order dictated by these functions to create a sequence of subgraphs and corresponding simplicial complexes.
Given a graph G = (V, E), let 𝑓 : V → R be a node
filtration function. Common examples of such functions
include degree, centrality, betweenness, closeness, and
heat kernel signatures (HKS). Additionally, functions
may be derived from the domain of the graph, such as
atomic number for molecular graphs or account balance
for transaction networks. These functions establish a
hierarchy among the nodes, ordering them from less
important to more important within the graph within
the context defined by the filtration function.
Fig. 13. Graph Filtration. For G = G3 in both examples, the top figure
To define the resolution of our filtration, we choose illustrates a superlevel filtration using the node degree function with thresholds
3 > 2 > 1, where nodes of degree 3 are activated first, followed by those
a set of thresholds that cover the range of 𝑓 , denoted as of lower degrees. Similarly, the bottom figure illustrates a sublevel filtration
I = {𝜖𝑖 }𝑛1 , where 𝜖1 = min𝑣 ∈ V 𝑓 (𝑣) < 𝜖2 < . . . < 𝜖𝑛 = based on edge weights with thresholds 1.5 < 1.8 < 2.1.

max𝑣 ∈ V 𝑓 (𝑣). For each threshold 𝜖𝑖 ∈ I, let V𝑖 = {𝑣 ∈ V | 𝑓 (𝑣) ≤ 𝜖𝑖 }. Define G𝑖 as the subgraph of G induced by V𝑖 ,
i.e., G𝑖 = (V𝑖 , E𝑖 ) where E𝑖 = {𝑒𝑖 𝑗 ∈ E | 𝑣𝑖 , 𝑣 𝑗 ∈ V𝑖 }. In other words, at each threshold, we activate the nodes whose
values reach that threshold, along with the edges in the graph between these activated nodes.
This process results in a nested sequence of subgraphs G1 ⊂ G2 ⊂ . . . ⊂ G𝑛 = G (See Figure 13). For each G𝑖 , we
define an abstract simplicial complex Gb𝑖 for 1 ≤ 𝑖 ≤ 𝑛, creating a filtration of simplicial complexes G
b1 ⊆ G
b2 ⊆ . . . ⊆ G
b𝑛 .
Typically, clique complexes are used, where the complex G is obtained by assigning a 𝑘-simplex to each (𝑘 + 1)-complete
b
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 19

Algorithm 3 PH machinery for graphs (sublevel filtration)


Input: Graph G = (V, E), Filtration function 𝑓 : V → R, Threshold set I = {𝜖𝑖 }𝑛𝑖=0
Output: Topological vector 𝛽®(G, 𝑓 , I)
for 𝑖 = 0 to 𝑛 do ⊲ Filtration index
V𝑖 ← {𝑣 ∈ V | 𝑓 (𝑣) ≤ 𝜖𝑖 } ⊲ Vertex set changes with i
G𝑖 ← Induced subgraph of G with vertex set V𝑖
G
b𝑖 ← Clique complex of G𝑖
end for
for 𝑘 = 0 to 1 do ⊲ Topological dimensions
Compute persistence diagram PD𝑘 ({ G b𝑖 }𝑛 )
𝑖=0
Obtain vectorization 𝛽®𝑘 from PD𝑘 ({ G
b𝑖 }𝑛 )
𝑖=0
end for
Return Concatenation of 𝛽®0 and 𝛽®1 , denoted as 𝛽®(G, 𝑓 , I) = 𝛽®0 ∥ 𝛽®1

subgraph ((𝑘 + 1)-clique) in G [5]. The term clique refers to a group (clique) of 𝑘 + 1 vertices forming a 𝑘-simplex. This
is called sublevel filtration for the filtration function 𝑓 . Similarly, one can reverse this process (using a decreasing order
for activation of the nodes) to obtain a superlevel filtration, where V𝑖 = {𝑣 ∈ V | 𝑓 (𝑣) ≥ 𝜖𝑖′ } where {𝜖𝑖′ } is a monotone
decreasing sequence.
While we have explained the process for node filtration functions, edge filtration functions are also common. This
approach is frequently employed in weighted graphs or graphs with key edge functions for downstream tasks. For a
graph G, let 𝑔 : E → R be an edge filtration function. In the case of weighted graphs, the weight function 𝑤 : E → R
with 𝑤 (𝑒𝑖 𝑗 ) = 𝜔𝑖 𝑗 serves as an example of such a function. Other common examples include Forman Ricci and Ollivier
Ricci curvatures, which assign the (weighted or unweighted) edge a curvature value by interpreting the geometry of
the edge’s neighborhood [41]. Furthermore, edge filtration functions can be derived from the graph’s domain, such
as transaction amounts in blockchain networks [1] or density in traffic networks [22]. These functions establish a
hierarchy among the edges, similar to node filtration functions.
Again, by choosing the threshold set I = {𝜖𝑖 }𝑛𝑖=1 with 𝜖1 = min𝑒𝑖 𝑗 ∈ E 𝑔(𝑒𝑖 𝑗 ) < 𝜖2 < . . . < 𝜖𝑛 = max𝑒 𝑗𝑘 ∈ E 𝑔(𝑒 𝑗𝑘 ), we
define the filtration as follows. For 𝜖𝑖 ∈ I, let E𝑖 = {𝑒 𝑗𝑘 ∈ E | 𝑔(𝑒 𝑗𝑘 ) ≤ 𝜖𝑖 }. Let G𝑖 be a subgraph of G induced by E𝑖 ,
i.e., G𝑖 is the smallest subgraph of G containing the edge set E𝑖 (Fig. 13). In other words, for each threshold, we activate
the edges whose value reaches that threshold, along with the nodes attached to them. Again, this induces a nested
sequence of subgraphs G1 ⊂ · · · ⊂ G𝑛 = G. Then, one can follow the same steps by using clique complexes as before.
In Figure 13, we give a toy example for superlevel (node) and sublevel (weight) filtrations. Here, we list the most
common node and edge filtration functions used in applications:
• Node Filtration Functions: Degree, betweenness, centrality, eccentricity, heat kernel signature (HKS), Fiedler
values (spectral), node functions from the domain of the data.
• Edge Filtration Functions: edge weights (weighted graphs), Ollivier Ricci, Forman Ricci curvature,

ii. Graph distance-based filtrations. Next to the sublevel filtrations above, a completely distinct filtration method for
graphs uses the distances between nodes. Essentially, it treats the graph as a point cloud, where pairwise distances
between nodes are defined by graph distance, and applies Rips filtration (Section 3.1.1) to this point cloud. While this
method is more computationally intensive than filtrations via node or edge functions, it can be highly effective for
small graphs as it captures distance information and finer topological details.
Manuscript submitted to ACM
20 Coskunuzer-Akcora

To apply this method, we need to define graph distances between nodes. In an unweighted graph G = (V, E), a
common approach is to set the distance between adjacent nodes to 1 (∥𝑒𝑖 𝑗 ∥ = 1) and define the distance between 𝑣𝑖 and
𝑣𝑘 as the length of the shortest path 𝜏𝑖𝑘 in G with endpoints 𝑣𝑖 and 𝑣𝑘 , i.e. 𝑑 (𝑣𝑖 , 𝑣𝑘 ) = min ∥𝜏𝑖𝑘 ∥. Thus, each pairwise
distance is an integer representing the number of hops needed to travel from 𝑣𝑖 to 𝑣 𝑗 in G. The largest such distance is
the diameter of G, denoted as 𝑛.
After obtaining the pairwise distances {𝑑 (𝑣𝑖 , 𝑣 𝑗 )} between the nodes, we treat the graph as a point cloud V =
{𝑣 1, 𝑣 2, . . . , 𝑣𝑚 } with these distances. Since all distances are integers, we use integer thresholds {𝑟𝑖 = 𝑖} for 0 ≤ 𝑖 ≤ 𝑛 =
diam(G). Now, we are ready to define our filtration. For the filtration index, we use superscripts instead of subscripts,
where the reason will become clear later.
To make the exposition simpler, let G b𝑘 be the Rips complex corresponding to 𝑟 = 𝑘, and let G𝑘 be the graph
corresponding to the 1-skeleton of G . Here 1-skeleton means the graph itself with its nodes and edges, without
b 𝑘

considering any higher-dimensional elements like faces or solids that might be part of a more complex simplicial
b0 = V since there are no edges in the Rips complex. It is easy to
structure. For the distance parameter 𝑟 = 0, we have G
see that at the distance threshold of 1, G 1 = G because all edges are automatically included in the complex when 𝑟 = 1.
Recall that Gb1 is a Rips complex, hence any complete 𝑗-subgraph in G 1 generates a ( 𝑗 − 1)-simplex (𝑗-clique) in the Rips
complex Gb1 . In particular, G
b𝑘 is nothing but the clique (or flag) complex of G𝑘 . Furthermore, G𝑘 is the graph with node
b0 ⊂ G
set V, and edge set E𝑘 = {𝑒𝑖 𝑗 | 𝑑 (𝑣𝑖 , 𝑣 𝑗 ) ≤ 𝑘 }, i.e., G𝑘 = (V, E𝑘 ). Therefore, in the filtration G b1 ⊂ · · · ⊂ G
b𝑛 ,
all simplicial complexes have the same node set, but at each step, we add new edges and cliques. Finally, for G
b𝑛 ,
the 1-skeleton is a complete graph with 𝑚 nodes, and hence the corresponding simplicial complex G
b𝑛 would be an
(𝑚 − 1)-simplex.
The graphs {G𝑘 } are called graph powers (See Fig- GEOMETRIC APPROACHES TO PERSISTENT HOMOLOGY 5

ure 14), and this filtration is called the power filtration (or
Rips filtration), hence the superscripts. Observe that the
power filtration calculates the distances between every
pair of vertices in the graph. Therefore, even vertices that G G2 G3
are not direct neighbors or appear distant in the graph
Fig. 1: Graph Powers. A graph G = G1 and its graph powers. Red edges are added in G2 , and
can still form a simplex in the later stages of the filtration. Fig.
Hence,14. Graph
all higher Powers.
powers are same,Ai.e.
3
graph G =
G =𝐺 for𝐺
3
green ones are added in G . Note G is the complete1 graph on 7 vertices since D = diam(G) = 3.
n ≥and
3. its graph powers. Red edges
n 3

Note that you don’t need to complete the entire filtration. are added in 𝐺 2 , and green ones are added in 𝐺 3 . Note G 3 is the complete
3 G.
graph on 7 vertices since 𝐷 = diam (𝐺 ) = 3. Hence, G would be a complete
186 we obtain the following power filtration induced by the graph

In most cases, the critical insights information lies in187 graph K7 , and all higher
Gb 0 powers
⊂G are
b 2 ⊂the
b1 ⊂ G . . .same,
⊂G i.e.⊂GG
b D−1
𝑛 3
b D = G for 𝑛 ≥ 3.
the first few steps (e.g., up to 𝜖 = 5 or 10) for the power188 b 0 is equal to the set of vertices V in G. Simplicial complex G
Notice that G b1 = G
b
b2
189 is the clique complex of the original graph G. To form G , we add new edges and
filtration. Given the high computational cost, it’s both 190
practical and reasonable
b 1 accordingly.
cliques to G
to stop early.
In particular, if ρ(vi , vj ) = 2, then a new edge eeij is added
b 1 . Similarly, if there is a set of vertices {vi , . . . , vi } where the pairwise distances
For a weighted graph G = (V, E, W), edge lengths to G
191 can be defined using weights {𝜔𝑖 𝑗 }, where ∥𝑒𝑖 b𝑗 n∥ = 𝑓 (𝜔𝑖 𝑗 ) 0 k

192 are at most n, then there exists a k-simplex σ = [vi0 , . . . , vik ] in G . Note also that
instead of the uniform ∥𝑒𝑖 𝑗 ∥ = 1 used in unweighted graphs.
193 Gb n is The function
the (|V| − 1)-simplex :W
𝑓 for →≥ DR+= diam(G),
any n assignsand distances based on
hence is contractible.
194 For k ≥ 0, we take k-dimensional homology Hk with coefficients in Z/2Z. The
weights, where a smaller 𝑓 (𝜔𝑖 𝑗 ) indicates that nodes 𝑣𝑖195and 𝑣 𝑗 are
points in a closely
persistencerelated, and a larger
diagram represent the birthdistance
and death indicates they
times of a homology
196 class. In particular, let PDk (G) denote the persistence diagram for the k-dimensional
are less related. This approach is particularly useful in197financial networks
homology of whereoflarge
the power filtration edge
the graph weights
G [26, (i.e., any
27, 51]. Then, amounts)
persistence
198 k
b b , and that
diagram point (b, d) ∈ PD (G) represents a k-cycle σ that is born in G
indicate stronger financial connections or dependencies.199Infirst
Section 6.2, we elaborate on this method inwords,
b d . In other
becomes homologous to earlier features in G
the context of a
n = b is the birth
200 time for σ, while n = d is the death time for σ.
real-world application: Crypto-token anomaly forecasting.
201 More explicitly, after applying homology with coefficients in the field Z/2Z to the
202 power filtration, we obtain the following persistence module, i.e. the following sequence
Once edge lengths are defined, pairwise distances 203
between any node pair can be calculated as the length of the
of vector spaces equipped with linear maps in-between:
shortest path as before, i.e., 𝑑 (𝑣𝑖 , 𝑣 𝑗 ) = min ∥𝜖𝑖 𝑗 ∥, where
204
𝜖𝑖 𝑗 is any such
b 0 ) →path
H (G b 1in
H (G
the bgraph
) → H (G 2 with endpoints
) → . . . → H (G b𝑣D𝑖).and 𝑣 𝑗 .
b D−1 ) → H (G
k k k k k

After obtaining pairwise distances, the process is the same as for unweighted graphs.
205 It follows from [51] or [17, 28] that this persistence module decomposes uniquely (up
206 to reordering) as a direct sum ⊕N i=1 Ibi ,di with bi ∈ {0, . . . , D}, di ∈ {0, . . . , D} ∪ {∞},
207 and bi < di . In this direct sum, each term Ibi ,di is an interval sequence of the form
Manuscript submitted to ACM
208 0 → . . . → 0 → Z/2Z → . . . → Z/2Z → 0 → . . . → 0,

209 where the first copy of the field Z/2Z appears in index bi , where the last copy of the
210 field Z/2Z appears in index di − 1,1 where all linear maps between adjacent copies of
211 Z/2Z are the identity map, and where all other linear maps are the zero map. Then,
212 the persistence diagram PDk (G) is defined as the multiset PDk (G) = {(bi , di ) | 1 ≤
213 i ≤ N }, meaning that we have N different points in the persistence diagram, each of
214 the form (bi , di ) for 1 ≤ i ≤ N .
1 In the case when the last copy of the field Z/2Z appears in index D, then by convention we set

di = ∞. For power filtrations, since Gb D is contractible, we obtain only a single persistence diagram
point with death value ∞, which appears as a single bar of the form (b, d) = (0, ∞) in PD0 (G).
Topological Methods in Machine Learning: A Tutorial for Practitioners 21

Graph Reduction and PH. Although PH generates highly effective topo- 𝑣!)

logical embeddings for graph representation learning, scalability remains a 𝑣'


3-core
challenge. To address this issue, Akcora et al. [3] introduced two key methods 𝑣*
𝑣(
to significantly reduce the computational costs of the PH process. The first
𝑣&
method, CoralTDA, leverages the observation that nodes with low graph-core 2-core
𝑣%
𝑣"
do not contribute to persistence diagrams in higher dimensions. Specifically,
𝑣$
they demonstrate that the (𝑘 + 1)-core of a graph, which is a subgraph where 1-core 𝑣#
𝑣!
each vertex has at least 𝑘 neighbors, is sufficient to compute the PDk (G) of
the original graph. This improvement can be implemented as few as three Fig. 15. Coral Reduction. With CoralTDA,
lines of Python code (see the repository). The second algorithm, PrunIT, 2-core of G has the same persistence dia-
grams PDk ( G) for 𝑘 ≥ 1.
introduces an efficient method to reduce the size of graphs without altering
their persistence diagrams. In this approach, a vertex 𝑢 is said to dominate another vertex 𝑣 if the 1-neighborhood of 𝑢
contains 1-neighborhood of 𝑣, i.e., N (𝑢) ⊃ N (𝑣). The authors show that removing (pruning) a dominated vertex from
the graph does not affect the persistence diagrams at any level as long as the dominated vertex enters the filtration
after the dominating vertex.

Which filtration to use for graphs? We first note that filtrations using node/edge functions and filtrations based on
graph distances are entirely different methods, producing distinct outputs. Distance-based filtrations are computationally
demanding, but for datasets with small graphs (such as molecular graph datasets), power filtrations can be highly
effective. Conversely, for larger graphs, the computational costs necessitate the use of filtration functions. In these
cases, the choice of filtration function could be critical for performance [17]. Typically, relevant node or edge functions
specific to the domain of the data are the best choices. If such functions are unavailable, heat kernel signatures (HKS)
for node functions and Ollivier Ricci for edge functions are excellent alternatives. Choosing a filtration function can
be seen as an outdated method, as letting ML algorithms select or construct the best filtration function for optimal
performance is more reasonable. However, in many settings, model interpretability or greater control over the process
is needed. In these instances, selecting a relevant filtration function is more suitable for the model.
If the goal is to achieve optimal performance with PH machinery, learnable filtration functions are a promising
alternative. There are significant works, along with available code, that can be adapted for various tasks in graph
representation learning [55, 120].

3.1.4 Choosing Thresholds. One of the key steps in effectively applying PH to ML problems is selecting the thresholds
{𝜖𝑖 }𝑚
𝑖=1 for constructing the filtration. The first decision involves determining the number of thresholds, i.e., 𝑚. This
number can be viewed as the resolution of the filtration: a larger 𝑚 implies higher resolution, while a smaller 𝑚 indicates
lower resolution. Then, a natural question arises: is there any disadvantage to choosing a large 𝑚? The answer is "Yes".
Firstly, the computational cost increases with 𝑚. Secondly, the key information captured in the data may become diluted
in higher dimensions, depending on the vectorization step.
For graph filtrations, depending on the filtration function, selecting 𝑚 between 10 and 20 generally yields good
results. Once the number of thresholds is set, they can be chosen equally spaced if the filtration function 𝑓 : V → R
(or 𝑔 : E → R) is appropriate. Another effective method is to examine the distribution of the value set {𝑓 (𝑣𝑖 )} for all
vertices in the graphs across the dataset (or a random subsample), and then select the thresholds as the corresponding
quantiles in this distribution. Similarly, for distance-based filtrations, analyzing the distribution of pairwise distances
between vertices in a random subsample of the dataset can provide valuable insights for selecting appropriate thresholds.
Manuscript submitted to ACM
22 Coskunuzer-Akcora

For sublevel filtrations in the image setting, it’s important to choose thresholds that cover the color range [0, 255].
Typically, using 50-100 thresholds yields good results. However, if the task only concerns a specific color interval [𝑎, 𝑏],
it’s advisable to concentrate most or all thresholds within that range. These thresholds don’t need to be evenly spaced.
In the case of binary image filtrations (e.g., erosion, height, signed distance), thresholds are usually evenly distributed
up to the maximum value.
For point clouds, the number of thresholds directly deter-
mines the filtration’s resolution. While thresholds are gen-
erally evenly spaced, in the presence of outliers, they can
be more sparsely distributed after a certain value. Addition-
ally, since computational cost is often a concern with point
clouds, it’s important to find a balance between the number
of thresholds and the associated computational expense.

3.2 Persistence Diagrams


(a) 𝜖 = 0.4 (b) PB up to 𝜖 = 0.4
After constructing the filtration K1 ⊂ K2 ⊂ · · · ⊂ K𝑛
for a data type X, PH systematically tracks the evolution
of topological features (𝑘-holes) in the filtration sequence
and records this information in a persistence diagram, which
we define next. The nontrivial elements in the homology
groups H𝑘 (K𝑖 ) for 1 ≤ 𝑖 ≤ 𝑛 represent the 𝑘-dimensional
topological features (or 𝑘-holes) appearing in the filtration
sequence. Furthermore, the inclusion map 𝜄 : K𝑖 ↩→ K𝑖+1
allows us to determine if a 𝑘-hole 𝜎 in K𝑖 persists in K𝑖+1
through the induced map 𝜄 ∗ : H𝑘 (K𝑖 ) → H𝑘 (K𝑖+1 ). (c) 𝜖 = 0.7 (d) PB up to 𝜖 = 0.7

For each 𝑘-hole 𝜎, PH records its first appearance in the


filtration sequence, denoted K𝑖 0 , and its first disappearance
in a later complex, K 𝑗0 . We define 𝑏𝜎 = 𝜖𝑖 0 as the birth time
of 𝜎 and 𝑑𝜎 = 𝜖 𝑗0 as the death time of 𝜎, where {𝜖𝑖 }𝑛1 is the
threshold set used for the filtration. The difference 𝑑𝜎 − 𝑏𝜎
is called the lifespan of 𝜎. For example, if a 𝑘-hole 𝜏 first
appears in K3 and disappears in K7 , we mark the birth time
as 𝑏𝜏 = 𝜖3 and the death time as 𝑑𝜏 = 𝜖7 . The lifespan of 𝜏 is
then 𝜖7 − 𝜖3 . (e) 𝜖 = 2 (f) Persistence barcode
For each nontrivial 𝜎 ∈ H𝑘 (K𝑖 ) for 1 ≤ 𝑖 ≤ 𝑛, we repre-
sent 𝜎 with a 2-tuple (𝑏𝜎 , 𝑑𝜎 ) to denote its birth and death Fig. 16. Evolution of Persistence Barcode (PB). In PBs shown on
the right, red bars represent PB0 ( X) , and blue bars represent PB1 ( X) .
times in the filtration. The collection of all such 2-tuples is The 8-shaped point cloud X with 35 points, initially has 35 components,
corresponding to 35 red bars. As 𝜖 increases, these components merge.
called the persistence diagram (PD) as depicted in Figure 17. At 𝜖 = 0.4, only the top 11 red bars remain (b), indicating the number
Note that a topological feature with the same birth and death of components in N0.4 ( X) (a). By 𝜖 = 0.7, the space becomes fully
connected (c), and all red bars in PB0 ( X) terminate, except for the top
time (0 lifespan) is considered a trivial topological feature, ∞-bar (d). Regarding loops (1-holes), a small loop appears around 𝜖 = 0.3
(a-b) and persists in N0.7 ( X) (c), while a larger loop emerges around
and they are represented with diagonal elements in the per- 𝜖 = 0.6 (d). In PB1 ( X) (f), two blue bars, [0.3, 0.9) and [0.6, 1.7) ,
sistence diagrams. Therefore, the diagonal Δ = {𝑥 = 𝑦} ⊂ R2 correspond to the small and large loops in the 8-shape.

Manuscript submitted to ACM


Topological Methods in Machine Learning: A Tutorial for Practitioners 23

is always included in any persistence diagram. Then, 𝑘 𝑡ℎ persistence diagram is defined as

PDk (X) = {(b𝜎 , d𝜎 ) | 𝜎 ∈ Hk (Ki ) for b𝜎 ≤ i < d𝜎 } ∪ ∆

Since trivial elements (𝑡, 𝑡) in PDs can appear multiple times, we take the diagonal Δ with infinite multiplicity.
Essentially, the persistence diagram is a subset of R2 (PDk (X) ⊂ R2 ), where each point in PDk (X) is either a pair of
threshold values (𝜖𝑖 , 𝜖 𝑗 ) for some 𝑖 < 𝑗, or belongs to the diagonal (trivial element). Here, infinite multiplicity is a
technical assumption, which will be important when discussing Wasserstein distance between PDs (Section 3.2.2).
There is an equivalent concept called the persistence
barcode (see Figure 16), which uses bars (half-open in-
tervals) {[𝑏𝜎 , 𝑑𝜎 )} instead of 2-tuples {(𝑏𝜎 , 𝑑𝜎 )}. The
bar notation [𝑏𝜎 , 𝑑𝜎 ) represents the entire lifespan of
a topological feature 𝜎 as such an interval. In practice,
persistence diagrams are more commonly used due to
their practicality in applications.
Note that the choice between sublevel and superlevel
(a) 𝜖 = 2 (b) Persistence barcode
filtration is irrelevant for the point cloud setting. In the
image setting, this choice is not essential, as they essen- Fig. 17. Persistence Barcode (left) and Persistence Diagram (right) for the
8-shaped point cloud X (Figure 16). Both representations convey the same
tially convey the same information due to Alexander information, while the PD plots each bar’s birth time as its 𝑥 -coordinate and
duality [51]; when analyzing the topological features of death time as its 𝑦 -coordinate. For example, the 35 red bars in PB0 ( X) that
begin at 0 are represented as 35 red points along the 𝑥 = 0 line in PD0 ( X) ,
images, the choice of focusing on either the binary image where the 𝑦 -coordinates correspond to the bars’ death times. Similarly, the
blue points at (0.3, 0.9) and (0.6, 1.7) represent the blue bars [0.3, 0.9) and
or its complement is not crucial because the topological [0.6, 1.7) , respectively.
information of one is directly related to the topological
information of the other through this duality. The Alexander duality holds only when the ambient space is quite simple,
such as when the space has a structure like an 𝑚 × 𝑛 rectangle, like images. However, sublevel and superlevel filtrations
can yield significantly different results in the graph setting. Carriere et al. [24] introduced extended persistence diagrams
to address this issue. This approach enables the birth and death pairs of the persistence diagram to be positioned below
the diagonal (𝑏 > 𝑑). By doing so, it captures topological features identified through both superlevel and sublevel
filtrations.
3.2.1 Interpretation of Persistence Diagrams. A persistence diagram PDk (X) records the 𝑘-dimensional topological
features of the data X as a collection of points in R2 . For example, PD0 (X) records 0-holes (components), and PD1 (X)
records 1-holes (loops) appearing in the filtration sequence {K𝑖 }. Each point 𝑞 𝑗 = (𝑥 𝑗 , 𝑦 𝑗 ) represents a 𝑘-dimensional
100 with 100 thresholds spanning the interval [0, 10].
hole 𝜎 𝑗 . For instance, consider a threshold set I = {𝜖𝑖 = 𝑖/10}𝑖=0
Suppose we have two points, 𝑞 1 = (0.3, 9.7) and 𝑞 2 = (4.2, 4.6), in PD2 (X). These points represent 2-holes (cavities) 𝜎1
and 𝜎2 that appear in the filtration K0 ⊂ K1 ⊂ · · · ⊂ K100 .
For 𝑞 1 = (0.3, 9.7), the cavity 𝜎1 first appears in K3 and persists until K97 , indicating a long lifespan. Conversely, for
𝑞 2 = (4.2, 4.6), the cavity 𝜎2 first appears in K42 and persists only until K46 , indicating a short lifespan. This suggests
that 𝜎1 represents an important topological feature of the data X, while 𝜎2 is likely just topological noise.
In general, features with long lifespans (where 𝑦 − 𝑥 is large) are located far from the diagonal and are considered
significant (or "big") features. Features with short lifespans (where 𝑦 − 𝑥 is small) are close to the diagonal and are
considered insignificant (or "small") features. This is why many vectorization techniques try to involve the lifespan
information in their computation to give different emphasis to small and big features in the output.
Manuscript submitted to ACM
24 Coskunuzer-Akcora

3.2.2 Wasserstein Distance. After obtaining the persistence diagrams (PDs) for two datasets, X+ and X − , we can assess
their topological similarities using these diagrams. One effective approach is to employ a metric in the space of PDs.
If the distance between the two persistence diagrams is small, we can conclude that the two datasets exhibit similar
topological characteristics; otherwise, they differ. The most commonly used metric for this purpose is the Wasserstein
distance (also known as the matching or earth mover distance) of the Optimal Transport Theory (see an ICML tutorial
in [16]), which is defined as follows:
Let PD(X + ) and PD(X − ) be the persistence diagrams for the datasets X + and X − , respectively (we omit the
dimensions in PDs for simplicity). Denote PD(X + ) = {q+j } ∪ ∆ and PD(X − ) = {ql− } ∪ ∆, where Δ represents the
diagonal (indicating trivial cycles) with infinite multiplicity, and 𝑞 ±𝑗 = (𝑏 ±𝑗 , 𝑑 ±𝑗 ) ∈ PD(X ± ) represents the birth and death
times of a topological feature 𝜎 𝑗 in X ± . Let 𝜙 : PD(X + ) → PD(X − ) represent a bijection (matching). The presence
of the diagonal Δ on both sides ensures the existence of these bijections even if the cardinalities |{𝑞 +𝑗 }| and |{𝑞𝑙− }|
differ. This means that for optimal matching, some points in {𝑞 +𝑗 } and {𝑞𝑙− } are matched to diagonal elements whenever
necessary. Then, the 𝑥 𝑡ℎ Wasserstein distance W𝑝 is defined as
1
p
p ª
W𝑝 (PD(X + ), PD(X − )) = min ­
©∑︁ +
∥qj − 𝜙 (q+j )∥ ∞ ® , p ∈ Z+ .
𝜙
« j ¬
Here, ∥(𝑥 1, 𝑦1 ) − (𝑥 2, 𝑦2 )∥ ∞ = sup{|𝑥 1 − 𝑦1 |, |𝑥 2 − 𝑦2 |} is called supremum norm (or 𝑙 ∞ -norm). When 𝑝 = ∞, W∞ is
called the bottleneck distance. i.e., W∞ (PD(X + ), PD(X − )) = maxj {∥q+j − 𝜙 (q+j )∥ ∞ } (See Figure 18).
In applications, the most common choices for the Wasserstein
distance are 𝑝 = 1, 2, and ∞. The bottleneck distance, unlike the
𝑝-Wasserstein distance, is insensitive to the number of points in
PDk (X ± ) and instead focuses on the distance between the farthest
points in the optimal matching. For instance, ignoring the diagonal,
if PDk (X + ) has 100 points and PDk (X − ) has only three points, the
bottleneck distance is primarily determined by finding the three
closest points in PDk (X + ) to those in PDk (X − ) and taking the
maximum of these distances. Conversely, when using 𝑝 = 1 or
𝑝 = 2, the quantity of points in PDk (X ± ) becomes significant as all
points contribute to the distances, even if only slightly. Therefore, Fig. 18. Wasserstein distances W1 and W∞ (bottleneck)
if there are only a few points in {PDk (Xj )} and the focus is on the between PD1 (8) and PD1 (9) for the point clouds of "8"
and "9" on the left.
distances of the most critical features, the bottleneck distance is
preferable. However, if there are many points in {PDk (Xj )} and the primary topological patterns arise from the quantity
and location of smaller features, then choosing 𝑝 = 1 (or 𝑝 = 2) for the Wasserstein distance would be more suitable.
There are several ways to utilize the Wasserstein distance in applications. One common approach is to bypass the
vectorization step, which requires some choices, and directly compare persistence diagrams. For instance, to determine
if two datasets have similar shapes or structures, one can compute the persistence diagrams for both datasets and
then calculate the Wasserstein distance between these diagrams. A smaller distance indicates more similar shapes.
Another application lies in unsupervised learning, where datasets are clustered based on their topological similarity
using persistence diagrams, utilizing the Wasserstein distance to measure distances between them.

Manuscript submitted to ACM


Topological Methods in Machine Learning: A Tutorial for Practitioners 25

3.3 Integrating PDs to ML tasks 1: Vectorizations


To effectively use persistence diagrams (PDs) in ML applications, it is crucial to convert them into numerical or vector
formats, allowing for the seamless integration of TDA outputs into standard ML workflows. Although PDs capture the
birth and death of topological features within data, their variable size and structure make them challenging to handle
directly. For example, in a binary graph classification task with 1000 graphs (400 in Class A and 600 in Class B), PDs
might visually highlight differences between the classes. However, applying traditional statistical tools to PDs, which
are subsets of R2 , is problematic; for instance, it is not possible to compute averages or confidence intervals for each
class. Moreover, directly inputting PDs into ML models is impractical because their variable point count conflicts with
the fixed-size input required by most ML algorithms.
Vectorization addresses this by making PDs compatible with traditional ML and statistical techniques. Below, we
outline the most common vectorization methods used in practice. For a comprehensive review, refer to [6].
In the following, for a fixed dataset X, a threshold set {𝜖𝑖 }𝑛𝑖=1 , and an induced filtration {K𝑖 }𝑛𝑖=1 , we will introduce
several vectorization methods. It is important to note that all these vectorization methods are applicable to any type of
data since they simply transform a given PD into vectors. However, the effectiveness and common usage of certain
vectorization methods can vary depending on the data type and the density or sparsity of the PDs. Although we present
various vectorization methods and their configurations here, those who prefer to avoid the intricacies of vectorization
selection and hyperparameter tuning can opt for automated approaches, as detailed in Section 3.4.

Function vs. Vector Format. In the following, we introduce several vectorization methods. For each method, we will
describe how to convert a given PD into a vector or a function depending on the setting. Depending on the context,
one of these formats might be preferable (visualization, ML input, etc.), however, both formats can be easily converted
to each other. For example, if we have a function 𝑓 : [𝜖1, 𝜖𝑛 ] → R defined over an interval, we can convert 𝑓 into
an 𝑁 -dimensional vector by sampling the values at 𝑓 (𝑡 𝑗 ) where 𝑡 𝑗 = 𝜖1 + 𝑁 (𝜖𝑛 − 𝜖1 ) and 𝑗 ranges from 1 to 𝑁 .
𝑗

For each point 𝑡 𝑗 , we calculate 𝑓 (𝑡 𝑗 ). The vector v 𝑓 is then v 𝑓 = [𝑓 (𝑡 1 ), 𝑓 (𝑡 2 ), . . . , 𝑓 (𝑡 𝑁 )]. Conversely, for a given
𝑁 -dimensional vector v = [𝑣 1 𝑣 2 . . . 𝑣 𝑁 ], we can convert it to a function via a step function or linear spline interpolation,
e.g., 𝑔v : [𝜖1, 𝜖𝑛 ] → R such that 𝑔v (𝑡 𝑗 ) = 𝑣 𝑗 for 𝑡 𝑗 = 𝜖1 + 𝑁 (𝜖𝑛 − 𝜖1 ). Then, for any point 𝑠 ∈ [𝑡 𝑗 , 𝑡 𝑗+1 ], we define
𝑗
𝑔 (𝑡 ) −𝑔 (𝑡 𝑗 )
the linear extension 𝑔(𝑠) = 𝑔(𝑡 𝑗 ) + 𝑗𝑡+1 𝑗 +1 −𝑡 𝑗
(𝑠 − 𝑡 𝑗 ). Hence, any of the following vectorizations can be taken as a
function or vector depending on the need.

Topological Dimensions and Final Topological Vector. In each vectorization method, we convert a persistence diagram
PDk (X) into a vector (or array) v𝑘 where 𝑘 represents a topological dimension of 𝑘-holes. In particular, each persistence
diagram produces a different vector. The user needs to decide which topological dimensions to be used in the method.
After getting an 𝑁 -dimensional vector for each dimension, a common method is to concatenate them to obtain the
final topological vector. Hence, if we use 𝑚 different topological dimensions, we have (𝑚 · 𝑁 )-dimensional final vector
v(X) = v0 ∥ v1 ∥ . . . ∥ v𝑚−1 . In most applications, the common dimensions used are 𝑘 = 0 and 𝑘 = 1, and hence the
final topological vector is 2𝑁 -dimensional, i.e., v(X) = v0 ∥ v1 .

Betti vectors. One of the most straightforward and interpretable vectorization methods in TDA is the Betti function.
The 𝑘 th Betti number of a topological space K essentially counts the total number of 𝑘-dimensional holes in K. More
formally, it is defined as 𝛽𝑘 (K) = rank(𝐻𝑘 (K)), which is the rank of the 𝑘 th homology group of K. For example, 𝛽 0 (K)
represents the number of connected components in K, while 𝛽 1 (K) indicates the number of 1-dimensional holes.
Manuscript submitted to ACM
26 Coskunuzer-Akcora

For given filtration {K𝑖 }𝑛1 , we define 𝑘 𝑡ℎ Betti vector 𝛽®𝑘 (X) =
[𝛽𝑘 (K1 ) 𝛽𝑘 (K2 ) . . . 𝛽𝑘 (K𝑛 )]. Therefore, for each topological dimen-
sion 𝑘, we obtain a Betti vector of the size of a number of thresholds.
For example, 𝑘 = 0, 1 with 𝑛 = 50 results in 50-dimensional vec-
tor for each dimension. In ML applications, typically, these vectors
are concatenated to create a 100-dimensional topological vector as
output.

Example 3. Consider the point cloud X in Figure 16. We define


a filtration using five threshold values 𝜖b = [0, 0.25, 0.75, 1.5, 1.75],
where K𝑖 = N𝜖𝑖 (X). The Betti-0 vector 𝛽®0 (X) = [35, 20, 1, 1, 1] is Fig. 19. Betti Function. The step function represents the
obtained, with 𝛽 0 (X) = 35, 𝛽 0 (N0.25 (X)) = 20, and 𝛽 0 (N𝜖 (X)) = 1 first Betti number (𝛽 1 ) over the figure-eight-shaped point
cloud of Figure 13 as a function of the threshold 𝜖 . The func-
for 𝜖 ≥ 0.7. This indicates that N0.25 (X) consists of 20 connected tion shows the number of 1-dimensional holes (loops) in the
point cloud. Initially, there are no loops (𝛽 1 = 0), then a single
components. The Betti-0 vector can be easily determined from the loop appears at 𝜖 = 0.35 and persists until 𝜖 = 0.94, dur-
ing which another loop appears at 𝜖 = 0.55 and vanishes at
persistence barcode PB0 (X) (Figure 16f) by counting the number of 𝜖 = 0.94. The final loop disappears at 𝜖 = 1.70, returning 𝛽 1
red bars intersected by a vertical line 𝑥 = 𝜖𝑖 , representing the number to 0.

of persistent topological features at the threshold 𝜖𝑖 . Thus, with 5 thresholds, we obtain a 5-dimensional vector, 𝛽®0 (X).
Similarly, we can determine the corresponding Betti-1 vector by examining the blue bars in the persistence bar-
code (Figure 16f). We observe no blue bar at 𝑥 = 0, 0.25, and 1.75. Furthermore, there are two blue bars at 𝑥 = 0.75
and one blue bar at 𝑥 = 1.5. Therefore, the Betti-1 vector is 𝛽®1 (X) = [0, 0, 2, 1, 0]. This indicates that there are no
1-dimensional holes (1-holes) in X = N0 (X), N0.25 (X), or N1.75 (X), while N0.75 (X) contains two 1-holes and N1.5 (X)
contains one. It is also common to use Betti vectors as Betti functions via step functions (See Figure 19). Similarly, using
21 equally spaced thresholds 𝜖b = {0, 0.1, 0.2, . . . , 2} would yield 21-dimensional vectors 𝛽®0 and 𝛽®1 .

We note that Betti vectors do not require the computation of persistence diagrams. Therefore, there are computa-
tionally more effective ways to produce Betti vectors [7, 52]. Another favorable aspect of Betti vectors is their ease of
interpretation. Simply put, 𝛽𝑘 (𝜖𝑖 ) is equal to the number of 𝑘-dimensional topological features in K𝑖 .

Persistence Landscapes. Persistence Landscapes are one of the first vec-


torization methods in TDA, introduced by P. Bubenik, directly utilizing the
lifespan information [15]. In particular, in this vectorization, the points away
from the diagonal (large features) are easily distinguished and promoted. For
a given persistence diagram PDk (X) = {(bi, di )}, we first define generating
functions Λ𝑖 for each (𝑏𝑖 , 𝑑𝑖 ) ∈ PDk (G), i.e., Λ𝑖 : [𝑏𝑖 , 𝑑𝑖 ] → R is a piece-
wise linear function obtained by two line segments starting from (𝑏𝑖 , 0) and
(𝑑𝑖 , 0) connecting to the same point ( 𝑏𝑖 +𝑑 𝑖 𝑏𝑖 −𝑑𝑖
2 , 2 ). Then, we define several
piecewise-linear functions {𝜆𝑚 } in the following way. For each 𝑡 ∈ [𝜖1, 𝜖𝑛 ],
we check all generating functions {Λ𝑖 (𝑡)}, and we mark 𝑚𝑡ℎ largest value. Fig. 20. Persistence Landscape (PL). The
plot shows the PL for PD1 ( X) for the 8-
In particular, 𝑚𝑡ℎ Persistence Landscape function 𝜆𝑚 (X) : [𝜖1, 𝜖𝑛 ] → R for
shaped point cloud filtration in Example 3.
𝑡 ∈ [𝜖1, 𝜖𝑛 ] is defined as 𝜆𝑚 (X)(𝑡) = 𝑚𝑡ℎ max𝑖 {Λ𝑖 (𝑡)} (See Figure 20). Note In the plot, the blue function corresponds to
that the first and second persistence landscapes are the most commonly used 𝜆 1 , and the orange function to 𝜆 2 .

vectors, and they are used with concatenation in the applications.

Manuscript submitted to ACM


Topological Methods in Machine Learning: A Tutorial for Practitioners 27

Silhouettes. While persistence landscapes are among the first Silhouette 1 (p = 0.5)
Silhouette 1 (p = 1)
vectorizations to effectively utilize lifespan information, the need to 0.5 Silhouette 1 (p = 2)
Silhouette 0 (p = 0.5)
consider 𝑚 different maxima makes it difficult to use in ML applica- Silhouette 0 (p = 1)
0.4 Silhouette 0 (p = 2)
tions. In [27], Chazal et al. proposed a practical modification called

Silhouette
the Silhouette for persistence landscapes. This modification intro- 0.3

duces a tuning parameter 𝑝 to better utilize the lifespans of topo-


0.2
logical features. For a persistence diagram PDk (X) = {(bi, di )}N
i=1 ,
let Λ𝑖 be the generating function for (𝑏𝑖 , 𝑑𝑖 ) as defined in the 0.1

persistence landscapes. The Silhouette function 𝜓 is defined as: 0.0


𝑤𝑖 Λ𝑖 (𝑡)
Í𝑁
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Ψ𝑘 (X) = 𝑖=1 Í𝑚 , 𝑡 ∈ [𝜖1, 𝜖𝑛 ], where the weight 𝑤𝑖 is usu- Threshold
𝑖=1 𝑤𝑖
ally chosen as the lifespan (𝑑𝑖 −𝑏𝑖 ) 𝑝 . Thus, the 𝑝-silhouette function
𝑝
Fig. 21. Silhouette Functions. The plot shows the silhouette
𝜓𝑘 (X) : [𝜖1, 𝜖𝑛 ] → R is defined as: functions with tuning parameters 𝑝 = 0.5, 1, 2 PD1 ( X) for the
8-shaped point cloud filtration given in Example 3.
𝑖=1 (𝑑𝑖 − 𝑏𝑖 ) Λ𝑖 (𝑡)
Í𝑁 𝑝
Ψ𝑘 (X) =
𝑝

𝑖=1 (𝑑𝑖 − 𝑏𝑖 )
Í𝑁 𝑝

The tuning parameter 𝑝 is crucial as it adjusts the silhouette function’s emphasis on topological features with varying
lifespans. When 𝑝 < 1, shorter lifespans (𝑑𝑖 − 𝑏𝑖 ) are given more weight, emphasizing smaller features. Conversely,
shorter lifespans are down-weighted when 𝑝 > 1, highlighting larger features. Common choices for 𝑝 are 12 , 1, and
2 (See Figure 21). If the persistence diagram has a few points and the goal is to emphasize significant features, 𝑝 = 2
would be a good choice. If there are many points and the key information comes from smaller features, 𝑝 = 21 is
𝑝
considered more suitable. Similarly, Silhouette functions 𝜓𝑘 (X) can be converted to vectors for each dimension 𝑘, and
concatenation of these vectors can be used in the application as the topological vector of the dataset.

Persistence Curves. The Persistence Curve (PC) framework, in- Sum


1.75
Mean
troduced by Chung et al. [31], offers a structured approach to the
1.50
vectorization process. The core idea revolves around utilizing a gen-
1.25
erating function 𝜓 , a summary statistic 𝑇 , and a filtration parameter
Persistence Curve Value

1.00
𝜖. From these elements, a PC is defined as:
0.75
𝑃𝐶 (D,𝜓,𝑇 ) (𝑡) = 𝑇 ([𝜓 (D; 𝑏, 𝑑, 𝜖) | (𝑏, 𝑑) ∈ D𝜖 ]), 𝜖 ∈ R.
Here, D𝜖 represents the points within the persistence diagram 0.50

D that fall inside a region Δ𝜖 varying with the filtration value 𝜖. 0.25

The function 𝜓 takes as input a point from the persistence diagram 0.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
D, and the filtration parameter, 𝜖 and outputs a real number. This Filtration Parameter

function can be chosen to prioritize certain points in the PD or


to emphasize specific features. Finally, the statistic 𝑇 acts upon Fig. 22. Persistence Curves. The plot shows the persistence
curves for PD1 ( X) for the 8-shaped point cloud filtration in Ex-
a multiset of values, aggregating them into a single real number. ample 3 by using two different summary statistics: sum and
mean. The dashed line represents the persistence curve using
Common examples include sum, mean, or max. the sum of lifespans, which emphasizes the cumulative contribu-
tion of all loops. The solid line represents the persistence curve
PCs provide a general and unifying framework for vectorization using the mean of lifespans, which highlights the average signif-
methods. By selecting different combinations of 𝜓 and 𝑇 , one can icance of the loops. This visualization illustrates how different
summary statistics influence the interpretation of topological
generate various functional summaries of the PD, each potentially features in the data.

Manuscript submitted to ACM


28 Coskunuzer-Akcora

highlighting different aspects of the data. Many established vectorizations of PDs can be represented within the PC
framework, e.g., Betti curves, persistence landscapes, silhouettes, and entropy curves. See [31] for more details.

Persistence Images. Our next vectorization is Persis-


0.05
spread=0.1 spread=0.2 spread=0.5
tence Images, introduced by Adams et al. [2]. Unlike most 0.0

2.5 0.04
0.0

2.5
0.012
0.0

2.5
0.0025

0.010 0.0020
5.0 5.0
vectorizations, Persistence Images, as the name suggests, 7.5 0.03
5.0

7.5 0.008 7.5

Intensity

Intensity
Intensity
10.0 0.0015
10.0 10.0
produce 2𝐷-arrays (tensors). The idea is to capture the 12.5 0.02
12.5
0.006
12.5
15.0
0.0010
15.0 0.004 15.0

location of the points in the PDs with a multivariable 17.5 0.01


17.5
0.002
17.5
0.0005
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
0.00
function by using the 2𝐷 Gaussian functions centered
at these points. For 𝑃𝐷 (G) = {(𝑏𝑖 , 𝑑𝑖 )}, let 𝜙𝑖 represent Fig. 23. Persistence images (PI). PIs are vectorization with 2D (matrix)
a 2𝐷-Gaussian centered at the point (𝑏𝑖 , 𝑑𝑖 ) ∈ R2 . Then, output. Here, we give examples of PIs for PD1 ( X) (8-shaped point cloud in
Example 3) with different spread values. The spread parameter 𝜎 controls the
one defines a multivariable function, Persistence Surface, standard deviation of the Gaussian functions used to smooth the persistence
points in the persistence diagram. As illustrated in the figures, as 𝜎 increases,
𝜇=
Í
e 𝑖 𝑤𝑖 𝜙𝑖 where 𝑤𝑖 is the weight, mostly a function the generating Gaussian functions are even more spread out, producing a
flatter persistence image.
of the life span 𝑑𝑖 − 𝑏𝑖 . To represent this multivariable
function as a 2𝐷-vector, one defines a 𝑘 × 𝑙 grid (resolution size) on the domain of e
𝜇 , i.e., threshold domain of 𝑃𝐷 (G).
Then, one obtains the Persistence Image, a 2𝐷-vector (matrix) 𝜇® = [𝜇𝑟𝑠 ] of size 𝑘 × 𝑙 such that

𝜇𝑟𝑠 = 𝜇 (𝑥, 𝑦) 𝑑𝑥𝑑𝑦 where Δ𝑟𝑠 = pixel with index 𝑟𝑠 in the 𝑘 × 𝑙 grid.
e
Δ𝑟𝑠
Note that the resolution size 𝑘 × 𝑙 is independent of the number of thresholds used in the filtering; the choice of 𝑘
and 𝑙 is completely up to the user. There are two other important tuning parameters for persistence images, namely the
weight 𝑤𝑖 and the variance 𝜎 (the width of the Gaussian functions). Like Silhouettes, one can choose 𝑤𝑖 = (𝑑𝑖 − 𝑏𝑖 ) 𝑝 to
emphasize large or small features in the PD. Similarly, the width parameter 𝜎 determines the sharpness of Gaussian,
where smaller 𝜎 would make the Gaussian functions more like Dirac 𝛿-function, and larger 𝜎 would make the Gaussians
flat. Depending on the context, 𝜎 can be chosen a constant (e.g. 𝜎 = 0.1) or depending on the point (𝑏𝑖 , 𝑑𝑖 ), e.g.,
𝜎𝑖 = 𝑘 (𝑑𝑖 − 𝑏𝑖 ) for some constant 𝑘 > 0.

Kernel Methods. Kernel methods provide an alternative to traditional direct vectorization techniques used to transform
PDs into formats suitable for ML algorithms. In ML, these methods utilize a mathematical function known as a kernel
to analyze data and discern patterns. Unlike earlier approaches that directly represent PDs using vectors or functions,
kernel methods compute a similarity score as an inner product between pairs of PDs in a high-dimensional space without
explicitly mapping the data. This approach is advantageous because it adapts well to ML techniques like support vector
machines (SVMs) and kernel principal component analysis (KPCA). Therefore, kernel methods are well-suited for tasks
such as classification, regression, and principal component analysis.
One common such method is the persistence weighted Gaussian kernel (PWGK) [63]. It enhances the Gaussian
kernel by incorporating weights based on the significance of topological features. This approach amplifies the influence
of important features while reducing noise (i.e., short-lived holes) impact. For instance, it assigns weights to points
𝑝 = (𝑏, 𝑑) ∈ PD(X) proportional to their lifespans 𝑤 (𝑝) = (𝑑 − 𝑏). In particular, PWGK is defined as

K(PD(X + ), PD(X − )) =
∑︁ ∑︁
w(p)w(q)k(p, q)
p∈PD( X + ) q∈PD( X − )

∥𝑝 −𝑞 ∥ 2
where 𝑘 (𝑝, 𝑞) = exp − 2𝜎 2 denotes the Gaussian kernel function.
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 29

Another significant approach is the sliced Wasserstein kernel (SWK) [25], which computes the Wasserstein distance
between PDs and integrates it into a kernel framework. This method employs Optimal Transport Theory to establish a
meaningful metric for comparing PDs. Although kernel methods can yield better results in some settings, they can
be computationally intensive and impractical for large datasets due to the high computational costs associated with
computing the kernel matrix. In particular, computing kernels takes quadratic time in the number of diagrams, while
vectorizing PDs takes only linear time.

3.3.1 Stability. In most applications, the stability of vectorization is vital for statistical and inferential tasks. Essentially,
stability means that a small change in the persistence diagram (PD) should not lead to a significant change in its
vectorization. In particular, if two PDs, PDk (X + ) and PDk (X − ), are close, their corresponding vectorizations, 𝛽®𝑘 (X + )
and 𝛽®𝑘 (X − ), should also be close. This ensures that the vectorization process preserves the structural properties of the
data. Therefore, when two persistence diagrams PDk (X ± ) are similar, it implies that the datasets X + and X − share
similar shape characteristics. If these datasets are intuitively expected to belong to the same class, their vectorizations
𝛽®𝑘 (X ± ) should likewise remain close.
To formalize this concept, we need to define what constitutes a "small/big change" or what it means for PDs to be
close. This requires a distance (metric) in the space of PDs, with the most common being the Wasserstein distance (or
matching distance) as defined in Section 3.2.2. Similarly, we need a metric for the space of vectorizations, such as the
Euclidean distance ∥ 𝛽®(X + ) − 𝛽®(X − )∥ for 𝛽® ∈ R𝑁 . Thus, for a given vectorization 𝛽,
® we call it stable (with respect to
the Wasserstein-𝑝 metric) if it satisfies the stability equation

∥ 𝛽®𝑘 (X + ) − 𝛽®𝑘 (X − )∥ ≤ 𝐶 · W𝑝 (PDk (X + ), PDk (X − ))

This ensures that if W𝑝 (PDk (X + ), PDk (X − )) (the distance between PDs) is small, then the distance ∥ 𝛽®𝑘 (X + ) −
𝛽®𝑘 (X − ) ∥ between the corresponding vectorizations will also be small. Among the methods described earlier, persistence
landscapes, silhouettes, persistence images, and most kernel methods are stable vectorizations, while Betti functions
are generally unstable.

3.3.2 Choice of Vectorization and Hyperparameters. The choice of vectorization method should align with the char-
acteristics of your data and the problem at hand. If your data contains a few prominent topological features that are
crucial to the task, Silhouettes with 𝑝 ≥ 2 or Persistence Images may be the most suitable options. These methods are
also effective when dealing with noisy data, allowing you to filter out less significant features. Conversely, if your data
generates a high number of small features, where the task hinges on their location and density—in other words, when
the noise itself carries important information—Betti curves, Persistence curves, Silhouettes with 𝑝 ≤ 0.5 and kernel
methods are likely to yield strong performance. On the other hand, for point cloud data, Turkes et al. [107] provide
valuable insights on how to effectively utilize PH for shape recognition. Lastly, if interpretability is a priority, Betti
Curves stands out as the most interpretable vectorization method. Note that most vectorizations are computationally
efficient and require minimal time compared to the computation of PDs.

Hyperparameters: Each vectorization method comes with its own set of hyperparameters that need to be carefully
tuned to maximize performance. In all of them, the choice of thresholds is one of the key hyperparameters we discussed
in Section 3.1.4.
In addition to thresholds, many vectorization methods have tuning parameters that are used to adjust sensitivity
to topological noise. Typically, topological features with short lifespans (i.e., those close to the diagonal in PDs) are
Manuscript submitted to ACM
30 Coskunuzer-Akcora

regarded as noise. These features, represented as pairs {(𝑏, 𝑑)} with short lifespans (𝑑 − 𝑏), are generally easy to detect,
leading to many tuning parameters being tied to the lifespan. For example, in persistence landscapes, the number of
landscapes (𝑚) specifies how many maxima are included in the representation. A higher number of landscapes provides
a richer depiction of topological features but increases computational complexity. The first few landscapes (e.g., 1𝑠𝑡
and 2𝑛𝑑 landscapes) emphasize features with the largest lifespans. In silhouettes, the tuning parameter 𝑝 in the weight
function 𝑤𝑖 = (𝑑𝑖 − 𝑏𝑖 ) 𝑝 determines the emphasis of the vectorization. A higher 𝑝 (e.g., 𝑝 ≥ 2) de-emphasizes features
with short lifespans, while a smaller 𝑝 (e.g., 𝑝 ≤ 0.5) increases the importance of topological noise.
For persistence images, the spread (𝜎) controls the width of the Gaussian kernels applied to each persistence point.
A smaller spread results in sharper, more localized features, whereas a larger spread yields smoother images. The
resolution parameter sets the number of pixels in the persistence image, thus determining the output dimension of the
vectorization. Higher resolution captures finer detail but at a higher computational cost. Similar to silhouettes, a weight
function (e.g., linear) can be applied to weigh the contribution of each persistence point based on its significance, such
as its lifespan.

3.4 Integrating PDs to ML tasks 2: Using Neural Networks


While vectorization choices provide greater control over the model and data analysis, there are automated methods
to avoid dealing with vectorization choices or hyperparameter tuning. These methods optimize the selection for
downstream tasks by either finding the best vectorization within a large vectorization space or working directly with
persistence diagrams as point clouds, allowing neural networks to handle the data.

3.4.1 Vectorization with NN: Perslay. PersLay, introduced by Carrière et al. [24], is an automated vectorization framework
for persistence diagrams, which decides the best vectorization method for the downstream task. The method leverages a
neural network architecture that processes PDs via a specialized layer called the PersLay layer. This layer incorporates
a variety of vectorization strategies, providing a unified and flexible approach to extracting topological features.
The PersLay layer is composed of several key components:
• Weighting Functions: These assign importance to points in the persistence diagram. Common choices include
exponential weighting and Gaussian weighting, which can be learned during training.
• Transformation Functions: These functions, such as linear transformations or learned neural networks, apply to
the coordinates of the persistence points to encode meaningful geometric and topological information.
• Symmetric Functions: After transformation, symmetric functions like sum, mean, or max aggregate the transformed
points into a fixed-size vector, ensuring permutation invariance.
With this architecture, PersLay covers a wide range of known vectorization methods within its framework, i.e., Per-
sistence Landscapes, Persistence Images, Silhouettes, Betti Curves, Entropy Functions and other Persistence Curves [31].
By incorporating various vectorization techniques, PersLay can adapt to diverse data types and topological features,
making it a robust and versatile tool for integrating PH into ML workflows. For an alternative method involving slightly
different generating functions for learnable vectorizations, see [56] by Hofer et al.

3.4.2 PD Learning with NN: PointNet. An alternative automated method to leverage PDs in ML tasks without hyper-
parameter tuning is to directly utilize PointNet [88] or similar neural network architectures to process and analyze
point cloud data, e.g., DeepSets [118], PointMLP [70]. Unlike conventional techniques that transform point clouds
into regular grids or structured formats, these networks treat PDs as sets of points in R2 . Perslay and PointNet offer
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 31

fundamentally different strategies for incorporating PDs into ML models without requiring vectorization or hyper-
parameter optimization. The choice between these methods depends on the specific task, as there is no universally
superior approach.

3.5 Computational Complexity for Persistent Homology


The computational complexity of persistent homology (PH) is strongly influenced by the choice of filtration complex,
as it is directly related to the size, |K |, of the simplicial complex K, i.e., the number of simplices it contains. Early PH
algorithms had a cubic complexity with respect to the number of simplices, i.e., O (|K | 3 ) [75]. Subsequent advancements
have reduced this exponent to 𝑤 = 2.376, i.e., O (|K | 𝑤 ) [81].
Thus, the computational challenge boils down to the number of simplices in a given simplicial complex K. For a
point cloud with 𝑁 points, the worst-case scenario for the Rips or Čech complex involves up to 2𝑁 simplices across
all dimensions. However, in PH, the computation of 𝑑-dimensional topological features requires only simplices up
to dimension 𝑑 + 1. Higher-dimensional simplices do not contribute to the calculation of 𝐻𝑑 (K). For instance, when
focusing on 0- and 1-dimensional features, only simplices up to dimension 2 are needed.
In a point cloud of size 𝑁 , for a Rips complex, the number of 0-simplices (vertices) is 𝑁 , the number of possible
 𝑁 (𝑁 −1)  𝑁 (𝑁 −1) (𝑁 −2)
1-simplices (edges) is 𝑁2 = 2 , and the number of possible 2-simplices (triangles) is 𝑁3 = 6 . Recall
that if we include all dimensions, the possible number of simplices would be 2𝑁 . Hence, increasing the dimension 𝑑
for topological features thus significantly raises computational costs, which is why most ML applications limit the
maximum dimension for simplices in the filtration (𝑑 + 1) to 2 or 3, which is enough to calculate up to 1- or 2-holes,
respectively.
The discussion above primarily applies to Rips and Čech complexes, but PH computations can be significantly
more efficient when using cubical complexes, which are commonly employed in image data. For 2D images, the time
complexity of PH is approximately O (|P |𝑟 ), where 𝑟 ≈ 2.37 and |P | represents the total number of pixels [74]. Practically,
this implies that the computational cost of PH scales almost quadratically with the image size. For higher-dimensional
images, alternative methods for efficient computation exist [112].
In recent years, several works have been published to improve the scalability and computational efficiency of PH. One
approach focuses on developing alternative methods for more efficient computation of persistence diagrams [46, 61],
while another aims to sparsify datasets while preserving topological information [3, 71, 81].

3.6 Software Libraries for Persistent Homology


Several software libraries provide tools for computing persistent diagrams and vectorizations across different types of
data structures, particularly point clouds and graphs (see Table 1). Notable among these are GUDHI, DIONYSUS, and
RIPSER, which have been compared in a benchmarking study [101].
Most TDA libraries require point cloud data as input, rather than image or network (graph) data. GUDHI can process
images, but only after they are converted into a suitable format, such as a point cloud or a cubical complex. Giotto-TDA
extends the capability to graphs through its VietorisRipsPersistence and SparseRipsPersistence for undirected graphs
and FlagserPersistence for directed ones. However, a common challenge in graph-based TDA is the reliance on shortest
path (geodesic) distances, which may not always capture the most relevant topological features. Giotto-TDA also
supports the input of images and time series data, enabling the computation of various vectorizations. Beyond PH
computations, the Persim library from Scikit-TDA offers a suite of tools for further analysis of persistence diagrams,
Manuscript submitted to ACM
32 Coskunuzer-Akcora

including metrics like the Bottleneck distance and visualizations like Persistence Landscapes and Persistence Images,
enhancing the interpretability of TDA results.
In terms of usage, most libraries (e.g., TDA in R) were language-specific bindings of the underlying GUDHI, Dionysus,
and PHAT libraries, which were originally implemented in Matlab or C++ for efficiency and acted as the earliest
and most established workhorses for TDA. In recent years, Python implementations like Giotto-TDA have become
increasingly popular, and the widely used ML library Scikit-learn has contributed to this trend by creating a new set of
tools called Scikit-TDA for TDA integration (https://github.com/scikit-tda).
It is worth mentioning that since TDA has been developed and extended primarily by mathematicians, the popularity
of R within this community has made the TDA R library particularly important. Mathematicians often release advanced
models and methods that can be directly connected to TDA research in R first, leading to a delay in their implementation
in other programming languages.

Table 1. Software Libraries for Persistent Homology. In the data column, P indicates support for Point cloud data, I for Image
data, and N for Network (graph) data.

Library Data Language Notable Feature Code URL


Giotto-TDA [105] P,N,I Python Scikit integration https://github.com/giotto-ai/giotto-tda
Gudhi [85] P,I C++/Python Well-established https://github.com/GUDHI/gudhi-devel
Dionysus2 P C++/Python Persistence types https://github.com/mrzv/dionysus
Ripser.py [106] P Python Integration with Scikit-TDA https://github.com/scikit-tda/ripser.py
Ripser [10] P C++ Standalone https://github.com/Ripser/ripser
JavaPLEX [103] P Matlab/Java Well-established https://github.com/appliedtopology/javaplex
TDA P R R ecosystem integration https://cran.r-project.org/package=TDA
PHAT [11] P Python/C++ Well-established https://bitbucket.org/phat-code/phat/src/master/

4 MULTIPARAMETER PERSISTENCE
Multiparameter Persistence (often referred to as Multipersistence) introduces a novel concept with the potential to
significantly enhance the performance of single-parameter persistence. The concept, introduced in the late 2000s by
Carlsson, Zomorodian, and Singh [19, 20], has since been actively investigated for real-world applications [14].
For original PH, the term single persistence is applied because we
Table 2. Generic Bifiltration
filter the data in just one direction, K1 ⊂ K2 ⊂ · · · ⊂ K𝑛 . The fil-
tration’s construction is pivotal in achieving a detailed analysis of
K𝑚1 ⊂ K𝑚2 ⊂ ... ⊂ K𝑚𝑛
the data to uncover hidden shape patterns. For example, for graph
∪ ∪ ∪ ∪
setting, when utilizing a single function 𝑓 : V → R containing cru-
... ⊂ ... ⊂ ... ⊂ ...
cial domain information (e.g., value for blockchain networks, atomic
number for protein networks), it induces a single-parameter filtra-
∪ ∪ ∪ ∪
tion as described earlier. However, numerous datasets offer multiple K21 ⊂ K22 ⊂ ... ⊂ K2𝑛
highly relevant domain functions for data analysis. Similarly, in other ∪ ∪ ∪ ∪
data types, there are several ways to expand the single filtration into K11 ⊂ K12 ⊂ ... ⊂ K1𝑛
multifiltration to get finer information on the topological patterns
hidden in the data. The idea of using multiple filtration functions simultaneously to obtain a finer decomposition of the
dataset led to the suggestion of the MP theory as a natural extension of single persistence (SP).
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 33

4.1 Multifiltrations
Multifiltration refers to filtrations constructed using multiple scale parameters. A typical example of a bifiltration,
denoted as {K𝑖 𝑗 }, is illustrated in Table 2, where each row and column corresponds to a distinct filtration. This bifiltration
is generated by simultaneously varying one scale parameter along the horizontal axis and another along the vertical
axis. The specific construction of these filtrations depends on the data type, with additional parameters introduced
based on the nature of the data. Below, we summarize common methods for different data types. While the explanation
focuses on two parameters for simplicity, this approach generalizes to any number of parameters.

4.1.1 Multifiltrations for Graphs. In graph settings, a single filtration function results in a single-parameter filtration
G b𝑁 = G.
b1 ⊂ · · · ⊂ G b However, employing multiple filtration functions enables a more detailed analysis of the data.
For instance, two node functions 𝑓 : V → R and 𝑔 : V → R, which provide complementary information about
the network, can be combined using Multiparameter Persistence to generate a unique topological fingerprint. These
functions induce a multivariate filtration function 𝐹 : V → R2 defined by 𝐹 (𝑣) = (𝑓 (𝑣), 𝑔(𝑣)).
Next, we define sets of increasing thresholds {𝛼𝑖 }𝑚 𝑖=1 for 𝑓 and {𝛽 𝑗 } 𝑗=1 for 𝑔. Then, we have V𝑖 𝑗 = {𝑣𝑟 ∈ V |
𝑛
−1
𝑓 (𝑣𝑟 ) ≤ 𝛼𝑖 , 𝑔(𝑣𝑟 ) ≤ 𝛽 𝑗 }, which can be written as V𝑖 𝑗 = 𝐹 ((−∞, 𝛼𝑖 ] × (−∞, 𝛽 𝑗 ]). Let G𝑖 𝑗 be the subgraph of G
induced by V𝑖 𝑗 , meaning the smallest subgraph of G generated by V𝑖 𝑗 . This setup induces a bifiltration of complexes
{Gb𝑖 𝑗 | 1 ≤ 𝑖 ≤ 𝑚, 1 ≤ 𝑗 ≤ 𝑛}, which can be visualized as a rectangular grid of size 𝑚 × 𝑛 (see fig. 24). For more details
on applying multipersistence in graph settings, refer to [30, 67]. Additionally, one can combine power filtration with
filtration by functions by applying power filtration to each subgraph in the sequence induced by sublevel filtration via
functions [37]. In Section 6.4, we give details of the utilization of MP method on graph setting for computer-aided drug
discovery.
Interview Questions
Interview QuestionsIgnacio Segovia-Dominguez
Introduction toInterview
LATEXQuestions
Interview QuestionsIgnacio Segovia-Dominguez
Interview Questions
Interview Questions
The University of Texas at Dallas / NASA Jet Propulsion Lab, Caltech
Ignacio Segovia-Dominguez
I 2
InterviewAQuestionsIgnacio Segovia-Dominguez I / NASA
1
Ignacio Segovia-Dominguez Introduction
Ignacio
TheSegovia-Dominguez to L
G University F 3
T X 1 I

E Interview Questions
2 of Texas at Dallas / NASA
Ignacio Segovia-Dominguez
2 University of Texas at 1
The
G Jet Propulsion
2
F 3
Filtration:
Lab,
Dallas
Caltech
2 Jet Propulsion
G
4
2
F 3Lab, Caltech
1 I
5 The
2 4
G University F 3
2 of Texas at Dallas / NASA Jet Propulsion Lab, Caltech
Author’s Name
2 A
Filtration:
2 The 4
D
Filtration:
3 University of Texas2atADallas
B0 = 1 2
/ NASA
The
3 3
Jet Propulsion
University
D
4 Lab,
of Texas B20atCaltech
B0 = 2
A=Dallas
1
2 / NASA
3 3
D
4
Jet Propulsion
The University
Lab,
of2Texas
2 Caltech
Filtration:
A 2
at Dallas / NASA Jet Propulsion
3 3
4
D Ignacio Segovia-Dominguez
Lab, Caltech
degree

Filtration:
B 2
Filtration:
B0 = 2 B 2 B B 2
B0 = 1 B0 = 3 B0 = 1 The University of Texas at Dallas / NASA Jet Propulsion Lab, Caltech
B0 = 1 22
June 4, 2021 B0 = 13 2 Filtration:
B 0 =4 2Author’s Name 4
B0 = 2 2 5 4
5
B0 = 2 3 B0 = 24 B10 = 01 Filtration:
B 0 =3
E E E E
Interview Questions Interview Questions Interview Questions Interview Questions
B0 = 3 4 B01 = 30 B10 = 12 B0 = 41
4 C 2 3 4 C 2 3 4 C 2 3 4 C 2 3
B01 = 4 0 B01 = 41
June 4,Interview
2021 Questions B10 = 23 B10 = 02

Interview
Interview Questions
4
Ignacio Segovia-Dominguez Interview Questions
Ignacio Segovia-Dominguez Ignacio Segovia-Dominguez Ignacio Segovia-Dominguez
B1 = 0 1 B1 = 02 B0 = 4 B10 = 13

Questions
1 I
3 B1 = 1 I B1 =1 0I filtration:
2 1 2
F 3 Second
2 12 4I 2
G University ofF TexasSecond
B1 = 1 4 F 3 B10 =Segovia-Dominguez 4
Ignacio F 3
Interview Questions
2
The B1at=Dallas 2 filtration:
/ NASAG University
The Jet Propulsion Lab,
of Texas V11atCaltech
=Dallas / NASAG University
The Jet Propulsion
of Texas
Lab,B1atCaltech =Dallas 5/ NASA
TheG University of Texas
Jet Propulsion Lab,atCaltech
Dallas / NASA Jet Propulsion Lab, Caltech

Interview Questions
B1 = 2 Abstract B 1 0
Ignacio Segovia-Dominguez
Interview
Ignacio Questions
Segovia-Domingu
Second filtration: Second 1 filtration:
1 I 2 2 A V21 A V221 A= 2 2
B The University of Texas B21atA =Dallas / NASA Jet Propulsion Lab, Caltech
Ignacio Segovia-Domingu
Interview Questions
2 3
4 F 3 text goes here.Filtration: Filtration:
2 3
Filtration:
3
Filtration:
2 3

Ignacio Segovia-Domingu
Second
V1 filtration: Second
V2 filtration: V3 V11 = 2
B The University of Texas at Dallas / NASA Jet Propulsion Lab, Caltech
B B B B
G 2

The UniversityIgnacio
of Texas atSegovia-Domingu
2 2 2
5 The abstract BV120 = 1 B V130 = 1 V40 = 1 filtration: AbstractB
Filtration:
Second
B V20 = 1
2 BV230 = 2 B V240 = 2 B
Ignacio
The UniversityIgnacio
V190 1= 1
The UniversityIgnacio
2
Segovia-Domingu
of Texas atSegovia-Domingu
Dallas / NASA Jet Propu
Filtration:
Second
V30 = 2 filtration:
B
Dallas / NASA Jet Propu
of Texas atSegovia-Domingu
4 BV340 = 3 B V390 1= 3 The abstract text B
of Texas at Dallas / NASA Jet Propu
V290 =goes here. V410 = 13
B
A
3
2
2 E E E E
2 3 3 D
B
Interview QuestionsInterview Questions Interview Questions
V490 1= 4 B V490 = 4 The University
Interview
BV39+
Questions
0 1= 3 4 Dallas / NASA Jet Propu V 290 1= 24
B
3
Filtration:
3
The University of Texas at Dallas // NASA
NASA Jet
Jet Propu
Propu 3 3
Interview QuestionsInterview
V 91 1= 0
Questions Interview
1 1= 0
Questions The University
University of
of Texas
Texas at
at Dallas
Dallas V 3901 = 30
B B BV=401 1= 4 0 B
1 B Comparison 2 3with other multiparameter
BV 99+1 1= 1
1 I 2 descrip-
Ignacio B
V 99+
V=91 = 1 1
1 I
Filtration:
Segovia-Dominguez
Filtration:
2 The
Interview
Ignacio B
Questions
V=91 = 11 0 1
I / NASA Jet
Segovia-Dominguez 2 Propu Ignacio B
V 49+ 01 1= 4
Segovia-Dominguez
1 I
1
2 Ignacio Segovia-Dominguez
5
BV=9+ 1 1= 2
G UniversityIgnacio
1 Comparison with other multiparameter
B V =9+
Filtration:
B =
B =1
Segovia-Dominguez
1 1= 2 1
G University Ignacio B
...
V 9 Segovia-Dominguez
descrip-
1 = 2
1
G University Ignacio B
V 1
=9 1 0 =
Segovia-Dominguez
2
0
5 G University
2 5 tors V= 1 The of Texas V=at1 Dallas / NASA
... Filtration:
The
B = 21 Jet Propulsion
Filtration:
Lab,
of Texas BV 9+ 1atCaltech
Dallas / NASA
1= 2
The Jet Propulsion
of Texas
Lab,V=91at
B Caltech
=Dallas 10 The
/ NASA of Texas
Jet Propulsion
0 / NASA Jet Propulsion
Lab,atCaltech
Ignacio Dallas / NASA Jet Propulsion Lab, Caltech
Segovia-Dominguez
4 Second
V2= A filtration:
... The University of Texas V2=atADallas
Second B =1
/ NASA
filtration:
Filtration:
Jet Propulsion
The University
21 Lab,
of Texas V2=atCaltech
Second
First A1Dallas / NASA
The University
filtration:
filtration: Jet Propulsion
of Texas
Lab,Second
B
V29+
... 1at
Caltech
1=Dallas
A 0 filtration:
20 Lab, Caltech
Filtration:
V
tors Filtration:
First
V1 filtration: B = =1 2 Filtration:
Second
V1= filtration: Filtration: The University of Texas at Dallas / NASA Jet Propulsion Lab, Caltech
E
U
... 1
Filtration:
B20 =filtration:
First 1
...
Filtration:
B V210 = 1 B
B = 3
2
3 Filtration:
V120 = 1
B
V=1 1 0
Second
V=20 =filtration:
B
First 10
0 filtration:
B = 12
V U U ...
the k th3 Betti number of C 2
43 Filtration: 0
4BkC(C) is 2 B =
B310 =filtration:
First
U V
B420 = 3
2
1 B V320 = filtration:
First
U
V4130 = 3
2
1 B = 234 U V230 = 2
B 1 UV1130 = 20
B
...
V2240 = 130
0
0
2 B 2 V340 =filtration:
B
First 3
2 B
U V 1
B (C) is the k Betti number
U
of C B =
B =3 th
03
4 U U
0
Interview Questions Interview Questions Interview B = Questions Interview Questions
B390 1= 4 B V 2490 1= 4 4 V48190 11= 4
B BV3309 =
U V 2
3 k U 3
B = 0
30 U 3 UFirst 240
1 filtration: 1
0
1
14 V441091 = 300
B4910 = 0
V 4 B V 83910 1= 0 4 V82910 1= 0
B 4 B
B =
U 0
B1 = 0 B0 = 02 B =1
U 3
B =2 B
U
B = 440
1 Ignacio 3 Segovia-Dominguez Ignacio
U
2 Segovia-Dominguez
01
U B89+
V 1 11= 1
4 1 I
0 0 Ignacio UB V 4 Segovia-Dominguez
89+0
1 =
11 I
1
0
B = 0
1 k U B
V 9+
9
8+ 1 11=
1 I
1
0 B
UV 8 91
0
9+ 11=
1 0I
4 11
1 Ignacio Segovia-Dominguez
U V 8 1
Interview
B=1 1= 2
B = 0Questions
1
The B Interview
=0
University of Questions
Texas
B
U
B = 1Interview
V=88+
at
=2
1 11 1
Dallas B =
B
/ BQuestions
=
The
NASA
0
2
1
2 = 2
University
Jet of
PropulsionInterview
B
Texas
Lab,
U B
V=<
Questions
49+1 1=
at
1
2
1
Dallas
Caltech / NASA
The University
Jet Propulsion
of Texas
B
U
Lab,
V8=391 =
at
1 0
Caltech
Dallas
21
1 /0NASA Jet Propulsion
The University Lab,atCaltech
of Texas kDallas / NASA Jet Propulsion Lab, Caltech
= 012
B = 2 B 1
V=8<1 =1 2 0 V=8<1 11= 2
B 0 B = 1 1
U V88+
=1 1
Second filtration:
U
Second B =
B
filtration:
1 U
Second filtration:
UV8+ =
Second
1 11 1
49+
1 filtration:
U ...
In our final set of experiments, we test our MPGF method in time series
< 11
Filtration:
8+ Ignacio U...8+ < 1Segovia-Dominguez
Filtration: B = 122 Ignacio V=8 Segovia-Dominguez
...
U
Filtration: Ignacio B
U...
V=< 81 11 Segovia-Dominguez
=1 21
1 Ignacio Segovia-Dominguez
U Second
V1 filtration: Second
U V1< 1 filtration:
... B
Second
B = =
=MPGF filtration:
Second
UV18+1
... filtration: Filtration:
UV1=< 11
Second 22 filtration:
< 1 8
V210 = filtration:
In ournew final The University of Texas
set of experiments, V21<0at=Dallas
B The University
/ NASA of Texas
Jet Propulsion Lab,
B V21<0atCaltech
=Dallas / NASA
The University
Jet Propulsion
of Texas
Lab,
Second 0at
Caltech
=Dallas / NASA
11 filtration: Jet Propulsion
The University Lab,atCaltech
of Texas Dallas / NASA Jet Propulsion Lab, Caltech
per- we test our method in time series
B 1 1 1 BFirst
1 I First First filtration:
B First 1filtration: 1 filtration:
Dimension: V28+
2
classification
4 problems and compare results with other
U... < multiparameter U
Second filtration: U ...
U
= 21
G approaches. F 3 BU V1320 = 2
...
classification
Filtration: problems and
U V3210 = 2
B
Dimension:
...
compare results with
Filtration: Second
V other
V filtration:
B
U
0 V132<0 =38<
First
new multiparameter per-
Filtration:
2
filtration: BV
U13< 10 1
Filtration: =1
5sistence 2 Here, we focus onBU V2430 = 3
Dimension:
17 time
V3409 =
series datasets from the UCRU0B
1B
V4320 =38<
V4390 1=38<
3
Second
V
Second filtration:
filtration:
B
U
1
B
V12430 =38<
...
=
3 B
First
V24<
Dimension:
U 13
20 filtration:
141
...31390 =
B 41 41 41 B
Dimension:
0U 138<
sistenceinapproaches. Dimension:
U V series
Second filtration: U
0 V23409 dim U
0V 2
138<
table 1. We Here, choosewesuch focus on 17 time datasets from the UCR 1
= 02 = 1
202
2 A archive [1], 4using training/testing
1
B
1U
0 V4109 =
sizes as described 138<02 U
0
B
V 4910dim138<
VV B
U
1 V34190 dim
Dimension: 1
02 B
U
1V424910 =38<
2 113
2 3 3 B
0U V8109+
9 dim= 13
archive [1], using training/testing U1BV89+
sizes
91
= 13
0dim
as V
described in B
0V8419+
table 1. We choose such = 13
90 1138< B
UV 3899+0dim
Dimension: =3
1 1
D
1
1 1138< 138< U
0-dim 0 2
324
benchmark problems because we aim to contrast our MPGF versus newly state-
B V
1U8=9+dim
1
0 = 24 U B
V 89+dim
1
0 = 24
VV B V 10 = 24 B
UV
0 =489 dim = 13 2
B
0-dim U
1-dim 1 1
0 138<
0 1 0 = 1 1 8=9+1138< edge weight
2 B =
benchmark problems because we aim to contrast B = V B = B = 2
4
V our MPGF versus newly state-
2Second
U V8+
0-dim
1 =1dim
11
0
31U1-dim
V=8+1dim11
0
4Second
0-dim
U0V88+=1 dim
11
0
5Second
UV =88+
0-dim
1 9+1 138< 3
40
2
of-the-art results of multiparameter persistence approaches, as described in [2].
B ...
V< =1 = 1 filtration: Second
B=1 = 1 filtration:
0-dim
... V B
1-dim 1 = 1 filtration: B
...
V=8<1 dim 1=1 34 31 filtration:
U U V U1...
V8+ <dim U
5 1-dim
0-dim
BV...1<1 = 2
0-dim
U
1 0-dim
of-the-art results of multiparameter persistence
U B
1-dim
...
< 1
V11 = 2 VV approaches, as described in [2].
B
0V
= 11
...1<1- dim=2
1-dim
0
B
V1=8+
0-dim
U =4
1 3
1 dim
992 1 1
2 5 To keep
4
1-dim
all experiments under same conditions, we parallelize the MPGF
VFirst
1-dim
0-dim filtration:
1-dim
First
0
<
V2- dim
0-dim
V
filtration:
U
1
0-dim
First
V 2- dim
1
filtration:
1
First
<
V2< 1 filtration:
...
1-dim
4 1
499 1
...2
To keep all experiments ...
under V U...
1-dim U0-dim
499 filtration:
thesame
degreeconditions, we node
parallelize the MPGF
<
Fig. 24. Multidimensional Second
0VFirst
U31- dim
1-dim
filtration:
filtration: U1Second
First
V13- dim
1-dim V of each
filtration:
filtration: Second
First
VU31 filtration:
0-dim
filtration: Second
0V31<- dim
U 1
method and run Eonon
persistence a graph
a AWS networkwith
machine (original
1
0
Xeon
VU4121 graph:
Dimension:
- dim
left). Black
Platinum numbers
8175M. denote
Similarly, U
0
weV2141
Dimension:
- dim
valuesV whilst red
...
VV Platinum 8175M. Similarly, we
V
Dimension:
U4121
1-dim
1-dim
First
1V
Dimension:
U 12- dim
0-dim
... 99+1
filtration:
999+ 11
numbers show the 4apply
edge weights of the network. Hence, shape method
properties are
0VU2932- dim
and run
computed on on
two a filtration
AWS machine functions with
(i.e.,
V3229 138< Xeon
degree and edge weight). While
V V
4
9 1
9+
C Taken’s 3 1 138< U
1 - dim V Dimension:
00U2329- dim U 1-dim
0 2139 138<
138<
2 embeddings on time series, filter
1VU3943 38<
applybased on distance
Taken’s embeddingsto theonpoint 0
V4339 38<
timecell,
U
1 series, filterVV based on distance to the point
V
11
0 U3439- dim 38< U V
Dimension:
0
1 3249- dim =999+11
=
38<
each row filters by degree, each column filters the corresponding subgraph using its edge weights. For each the lower =9+ 9+1
0VU49+
cloud and density estimates, train using a XGBoost 84 dim
1
1
classifier, and perform 5-fold U
0 V8449+dim
11 V left corners represent the
V 0V
1 U4849+dim138<
1 U V 4389+
01 - dim
dim
11 =
38<
=9+ 11
corresponding threshold values. For each cell, B0 and B represent the cloud
1VU=89 dim and density
corresponding
11 estimates,
Betti numbers. The train
U
1 using
figure is aadapted
V8=9 dim11 XGBoost VV fromclassifier,
... [30]. and perform 5-fold
1V
0 U=89 dim 11 U1V=489 dim 138< = 11 1
1
VU=8+
cross validation; see further and other detailscross
0-dim in [2].validation; see further and other details...
89 1 U V8+
0-dim8=9 1 VVin [2].
...
VU=8+
0-dim
1 89 dim
1 U0V
0-dim
=
88+ 9 dim
1 =
11 =
=
= 1
...V
U8+
1-dim 9+
< 111 U ...
V8+
1-dim < 9+111 V
...
V
...
V 8+
U
1-dim
0-dim <9+111 U...
V8<9+dim
11-dim =
Table 2 shows the accuracy of other 6 methods V
0-dim
U=< 11 along with our proposed U V<
0-dim
Table 2 shows the accuracy of other 6 Manuscript= 11 ...
First
... filtration:
methods submitted
along
V=< 11
0-dim
U
to ACM
with our proposed
1-dim U V8+
0-dim
=< 11 =
0-dim
First
...=< filtration:
V
1-dim
U First
V<
1-dim
U... =
First
filtration:
... filtration: First
...=< filtration:
V
1-dim
0-dim
U First
U V=< 1 filtration:
1-dim
...
1-dim
MPGF: Multiparameter Persistence Image (MP-I) U
MPGF:[2],
...1- dim
0 Multiparameter
Multiparameter Persis- Image (MP-I)
0 U
Persistence...1- dim First filtration: U...
0 ...1- dim U ...
0 1<- dim
0-dim
First filtration:
[2], Multiparameter Persis-
1-dim
...
U
U 1
1Dimension: Dimension:
1U2- dim
First filtration: 1Dimension: Dimension:
U 2- dim U0 2- dim U 1 2- dim
1-dim
... 1
tence Kernel (MP-K) [3], Multiparameter Persistence
UFirst
Dimension:
0 3 38< Landscape
filtration:
tence Kernel (MP-K) (MP-L) [REF], First
[3], Multiparameter
Dimension:
0 U 3 38< U
First
filtration:
Persistence
U filtration: First
Landscape (MP-L) [REF],
U Dimension:
10 3- dim filtration:
38< UFirst
0 3- dim filtration:
38< 1
U 041 38<
1 0U41 38<
1 First
U filtration:
U U 0 41 38<
1 Dimension:
U1 41- dim 38<
2
1
2
1D Persistence Landscape (P-L) [4], 1D Persistence
U 0 138< Image (P-I)
1D Persistence
182 dim [5], Persistence
Landscape (P-L) [4], 1D Persistence
1U82 dim
0 138< U Image (P-I) [5], Persistence
U 1 82 dim
0 138< U82 dim
0 138< 1
3
1
2
U 083 dim
1 0 83 dim
1 U U U on contrasting the amount of
U 0 83 dim
1 1 83 dim
U 38< 12
3
2
Space Scale Kernel (PSS-K) [6]. Here, we focusU Space
4 dim
0-dim
18+ 1
onScale
contrasting
Kernel the amount
(PSS-K) [6].ofHere, we focus
1U8+
0-dim 4 dim
1 U U U 0-dim
1 8+ 4 dim1 U
00-dim
8+4 dim 1 2 4
3
4
3
U U< U
U U U 8 dim 11 4
2 3
significant information, for a machine learning
U
1-dim
0-dim
<
8 11
0-dim
method, information,
significant contained in for theatopo-
machine learning
1-dim
0-dim
0-dim
U<
8 11
U method, contained in the topo-
U
1-dim
0-dim
<
8 11
0-dim
1
U
1-dim
0-dim
<
8833 1
4 1
1-dim
<
8
1-dim
0-dim
1-dim
1-dim
8
U
U
1-dim
<
8
1-dim
0-dim
1-dim
<
8
84 1
logical summaries. ...
U logical summaries.
8+1 0-dim
...
U8+1
U
...
U0-dim
8+1 ...
U
1-dim 8+1 84
8 1
U 0 <- dim
1-dim 1 0
1-dim
U < - dim
1 U
U
U0
1-dim<- dim 1 0 <- dim
0-dim
U 18 1 4
Dimension:
The classification results are consistent with our
U 1
0 <- dim The classification
hypothesis thatresults
MPGFare cap-
1 consistent with
0Dimension:
U<- dim
U our hypothesis that MPGF cap-
Dimension:
U1
0 <- dim Dimension:
1 <- dim
1-dim
U 88+1
888+
1
0...1 - dim 38< 10...- dim 38< U
U 1 - dim
0... 38< 0...- dim
0 38< 8 1
8+
8<
1
< 11
tures significant multidimensional informationtures
1 38< significant
produced multidimensional
by filtration functions. informationU
1 38<
U produced by filtration functions.
1 38< 1 - dim
1 88+
38<
8+ 11
1
0Dimension:dim
We notice thatstate-of-the-art
the proposed method dim
0Dimension:
outperformsU U 0Dimension:dim
current state-of-the-art ap- 0Dimension:dim 8+
<
<
<1
We notice that the proposed method outperforms10 dim 38<current ap-
10 dim 38< U 10 dim 38< 10 dim 8+
38< 11 1
1 38< 1 38<
0-dim
U U
...
... 1 38<
0-dim << 1
0-dim
proaches on 23.5% of benchmark problems. Furthermore,
0 dim
proaches on 23.5% most of of benchmark
best results problems. Furthermore,
0 dim U
... most of best results
0 dim
1 38<
0-dim
0 dim << 1
<
1-dim
0-dim
1 dim
1-dim
0-dim
1 dim U U
... 1-dim
0-dim
1 dim
1-dim
<
are splitusamong MP-K,that MP-L and MPGF, leading ... us to conclude that these 0-dim
1 dim <
are split among MP-K, MP-L and MPGF, leading 1-dim
0-dim to conclude these1-dim
0-dim Dimension:
...
Dimension:
...
1-dim
0-dim 1-dim
0-dim

0Dimension:
01-dim - dim 01-dim - dim 01-dim - dim 01-dim - dim
10-dim - dim 10-dim - dim Dimension:
0 38<
38< 10-dim - dim 10-dim - dim
1-dim 1-dim Dimension:
0
1 38< 38<
Dimension: 1-dim 1-dim
0 - dim 0 - dim Dimension:
0
10
1 38< 0 - dim 0 - dim
1 - dim
1
1 - dim 1
00
1 dimdim
38<
38<
1 - dim 1 - dim
34 Coskunuzer-Akcora

4.1.2 Multifiltrations for Point Clouds. In the context of point clouds, a natural parameter to consider is the radius 𝑟 , as
discussed in Section 3.1.1. However, a point cloud P often has regions of varying density, with some areas being very
dense and others quite sparse. Additionally, a few outliers can significantly distort the topological signature due to the
way it is constructed. To address this issue, it is common to use a density parameter in conjunction with the radius
parameter when constructing the filtration. This parameter reflects the local density of points in the point cloud and
helps identify regions of different densities, distinguishing features significant in dense regions from those in sparse
ones. To achieve this, we first need to compute local densities. For each point 𝑝 ∈ P, we compute a density measure
𝛿 (𝑝), which could be based on the number of points within a fixed radius 𝑟𝑑 or kernel density estimation [14, 110].
Next, we define our threshold set {(𝑑𝑖 , 𝑟 𝑗 )}, where 𝑑 1 > 𝑑 2 > · · · > 𝑑𝑚 corresponds to thresholds for density, and
0 = 𝑟 1 < 𝑟 2 < · · · < 𝑟𝑛 = diam(P) are radius thresholds. We then define the multifiltration as follows. Using 𝛿 (𝑝), we
define a nested sequence of subsets P1 ⊂ P2 ⊂ · · · ⊂ P𝑚 = P, where P𝑖 = {𝑝 ∈ P | 𝛿 (𝑝) ≥ 𝑑𝑖 }. For each 𝑖 0 , we treat
P𝑖 0 as a separate point cloud and apply the Rips filtration process to it. Specifically, for each 𝑖 0 , we construct the Rips
filtration R𝑖 0 1 ⊂ R𝑖 0 2 ⊂ · · · ⊂ R𝑖 0𝑛 . This gives a bifiltration {K𝑖 𝑗 } as in Table 2, with K𝑖 𝑗 = R𝑖 𝑗 representing the Rips
filtration for point cloud P𝑖 with radius parameter 𝑟 𝑗 . e.g., the first column K𝑖1 = P𝑖 corresponds to radius 𝑟 1 = 0.
This multifiltration approach provides more detailed information by capturing topological patterns in both dense
and sparse regions, offering a richer understanding of the data’s underlying shape, especially when varying densities
and scales are crucial for the analysis. For further details, see [66, 110].

4.1.3 Multifiltrations for Images. Similarly, in the image setting, if the image is a color image, one can easily employ
all color channels at the same time, similar to the graph setting. In particular, a color image has three color functions,
denoted as 𝑅 (red), 𝐺 (green), and 𝐵 (blue). Thus, for every pixel Δ𝑖 𝑗 , there exist corresponding color values: 𝑅𝑖 𝑗 , 𝐺𝑖 𝑗 , 𝐵𝑖 𝑗 ∈
[0, 255]. To proceed, we establish a three-parameter multifiltration with parameters {𝛼𝑚 }1𝑁𝑅 , {𝛽𝑛 }1𝑁𝐺 , {𝛾𝑟 }1𝑁𝐵 , where
𝛼𝑚 , 𝛽𝑛 , 𝛾𝑟 ∈ [0, 255] are threshold values for color channels 𝑅, 𝐺, and 𝐵 respectively. By simply defining binary images
X𝑚,𝑛,𝑟 = {Δ𝑖 𝑗 ⊂ X | 𝑅𝑖 𝑗 ≤ 𝛼𝑚 , 𝐺𝑖 𝑗 ≤ 𝛽𝑛 , 𝐵𝑖 𝑗 ≤ 𝛾𝑟 }, we obtain a three parameter multifiltration of size 𝑁𝑅 × 𝑁𝐺 × 𝑁𝐵 .
Similarly, for grayscale images, one can utilize height, radial, erosion, signed distance, or similar filtration methods [43]
along with the color filtration to obtain a multifiltration for color images.

4.2 MP Vectorization Methods


By computing the homology groups of the complexes in these multifiltrations, {H𝑘 (K𝑖 𝑗 )}, along with the induced
inclusion maps, we obtain the corresponding multipersistence module, which can be visualized as a rectangular grid
of size 𝑚 × 𝑛. The goal here is to track 𝑘-dimensional topological features through the homology groups {H𝑘 (K𝑖 𝑗 )}
within this grid. However, as explained in [14], technical challenges rooted in commutative algebra prevent us from
transforming the multipersistence module into a mathematical structure like a "Multipersistence Diagram." The key
issue is that for any 𝑘-dimensional hole in the multifiltration, we cannot directly assign a birth and death time due to the
partial ordering within the module. For example, if the same 𝑘-hole 𝜎 appears at K1,4 and K2,3 , neither (1, 4) ⊀ (2, 3)
nor (2, 3) ⊀ (1, 4), making it difficult to define a birth or death time for 𝜎 unless it exists in a perfectly rectangular
region. This issue does not arise in single persistence, where any two indices are always comparable, i.e., 𝑖 < 𝑗 or 𝑗 < 𝑖.
As a result, we do not have an effective representation of the MP module [14]. While these technical obstacles prevent
this promising approach from reaching its full potential in real-life applications, in the past years, several approaches
have been proposed to extract key insights from MP modules by skipping MP representations and directly going to the
vectorization from multifiltrations, which we detail below.
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 35

The simplest method to derive a vector (or tensor) from a multifiltration involves using the Hilbert function of the
MP module, which is a generalization of Betti functions. Let {K𝑖 𝑗 } denote an 𝑚 × 𝑛 multifiltration associated with
a given dataset X. First, we construct the multipersistence module H𝑘 (K𝑖 𝑗 ). Next, by computing the Betti numbers
of each simplicial complex in the multifiltration, we obtain an 𝑚 × 𝑛 matrix (or tensor) 𝛽b𝑘 (X) = [𝛽𝑘 (K𝑖 𝑗)], i.e.,
𝛽𝑘 (K𝑖 𝑗 ) = rank(H𝑘 (K𝑖 𝑗 )). These are also called multipersistence (or bigraded) Betti numbers. In graph representation
learning, this simple vectorization of the MP module has demonstrated remarkable performance [30, 67, 115].
While bigraded Betti numbers offer a straightforward vectorization method for multiparameter (MP) modules, they
fail to capture crucial topological information, especially regarding the significance of dominant features or the presence
of a few important topological features. More sophisticated techniques are necessary for effective MP vectorization.
One widely used strategy involves the "slicing technique," which focuses on studying one-dimensional fibers within
the multiparameter domain. A clearer understanding can be obtained by restricting the multidimensional persistence
module to these single directions (slices) and applying single persistence analysis.
In their work [23], Carriere et al. explored this approach by considering multiple such slices, often referred to
as vineyards, and summarizing the resulting persistence diagrams. Another significant advancement is found in
multipersistence landscapes [110] by Vipond, which extends the concept of persistence landscapes [15] from single to
multiparameter persistence.
The vectorization of MP modules is a rapidly evolving area of research. Various techniques have recently been pro-
posed in practical applications, e.g., Hilbert decomposition signed measure (MP-HSM-C) [67], Effective MultiPersistence
(EMP) [30], Generalized Rank Invariant Landscape (GRIL) [115], and Stable Candidate Decomposition Representations
(S-CDR) [66]. Although many recent methods are burdened by high computational costs, the MP-HSM-C method
introduced by Loiseaux et al. [67] stands out as the most efficient in terms of speed and performance among current
MP vectorization techniques.

5 MAPPER
Next to Persistent Homology, another powerful method in TDA is the Mapper, which Singh, Memoli, and Carlsson
introduced in the late 2000s [18, 98]. The Mapper stands out as an effective and versatile tool for extracting insights
from high-dimensional datasets, which creates a simplified graph representation of data by combining ideas from
algebraic topology and data visualization. While PH is mostly used in supervised learning, Mapper is commonly utilized in
unsupervised settings. In the past decade, it has made critical contributions to various fields involving point clouds in
high dimensional spaces, e.g., biomedicine [100]. This method works by projecting data onto a lower-dimensional space,
clustering the points within overlapping intervals, and constructing a topological network that captures the essential
structure of the dataset. Through this approach, Mapper can reveal hidden patterns, relationships, and the underlying
shape of data, making it particularly valuable for unsupervised learning in fields such as genomics, neuroscience, and
social network analysis. Therefore, it is also considered a smart dimension reduction technique, too (Figure 33).

5.1 Mapper for Point Clouds


For a point cloud X in R𝑁 and a real-valued function 𝑓 : X → R, the Mapper algorithm provides a summary of the
data by scanning the clusters in X in the direction dictated by the function 𝑓 , which is commonly referred to as a
filter function or lens [38]. The output of the Mapper algorithm is the Mapper graph, which is considered a meaningful
summary of the data, representing clusters and relations between the clusters in the data. It has been applied in several
Manuscript submitted to ACM
36 Coskunuzer-Akcora

contexts like diabetes subtype identification [64], ransomware payment detection on blockchains [4], and cancer
genotyping from single-cell RNA sequence data (Section 6.5).
The Mapper algorithm generates a graph-based summary of a high-dimensional point cloud. In this new Mapper
graph, nodes represent clusters within the original point cloud, and edges indicate the interaction between these clusters.
In other words, Mapper is a soft clustering approach where a data point may appear in multiple clusters; these clusters
are then represented as nodes in the new Mapper graph.
Given a point cloud X in R𝑁 , we first define a lens 𝑓 : X → R. In general, the lens function can be a natural function
derived from the data domain, or it may be computed using a dimensionality reduction technique. The proper selection
of hyperparameters for Mapper is crucial, and we will shortly discuss these in Section 5.1.1.
The lens function then decomposes the data into subregions through a cover of the image 𝑓 (X) ⊂ R, allowing
for the identification of clusters within each subregion. Formally, we cover the image 𝑓 (X) ⊂ R with a collection of
open intervals, i.e. I = {𝐼𝑘 }𝑛𝑘=1 where 𝑘 𝐼𝑘 ⊃ 𝑓 (X). The intervals 𝐼𝑘 = (𝑎𝑘 , 𝑏𝑘 ) are indexed such that 𝑎𝑘 < 𝑎𝑘+1 and
Ð

typically 𝑏𝑘 ≥ 𝑎𝑘+1 , allowing for overlap between consecutive intervals. Figure 25b shows 6 overlapping intervals
I = {𝐼 1, . . . , 𝐼 6 } for a toy example.
Next, we consider the preimage of each interval, 𝑈𝑘 = 𝑓 −1 (𝐼𝑘 ) ⊂ X. In particular, the preimage of each interval
refers to the set of points in the original high-dimensional space, such as the data points, mapped into a given interval
by the lens function. For example, in Figure 25, the union of all points from 𝑈 2 and 𝑈 3 and lower-part points from 𝑈 1
are the preimage of the interval 𝐼 2 . We apply a clustering algorithm (e.g., k-means, DBSCAN) to each preimage 𝑈𝑘0 ,
𝑚𝑘
dividing it into 𝑚𝑘0 clusters {𝐶𝑘0𝑙 }𝑙=10 .
For each cluster 𝐶𝑘𝑙 , we create a node 𝑣𝑘𝑙 in the Mapper graph and let V𝑘 = {𝑣𝑘1, 𝑣𝑘2, . . . , 𝑣𝑘𝑚𝑘 } represent the
clusters in 𝑈𝑘 ⊂ X. Thus, V = 𝑘 V𝑘 forms the node set for the Mapper graph G. In Figure 25, there are 10 nodes in
Ð

the graph for the 6 preimages: 𝐼 2 , 𝐼 3 , 𝐼 4 , and 𝐼 6 , each resulting in two clusters. Note that the number of clusters within a
preimage can vary. For example, representing 𝑈 1 with a single cluster might result from using a density-based clustering
algorithm (e.g., DBSCAN), which identifies dense regions as clusters while treating sparse regions as noise, thereby
avoiding the creation of clusters in those less populated areas.
After obtaining the nodes, we define the edges based
a 𝑈! b 𝑣! c
on the intersections of the clusters. Note that as 𝐼𝑘 ∩ 𝐼!
𝐼𝑘+1 ≠ ∅, clusters in subsequent levels might have non- 𝑈# 𝑣# 𝑣"
𝐼#
trivial intersections. Then, if the clusters 𝐶𝑖 ∩𝐶 𝑗 ≠ ∅, we 𝑈" 𝐼"
add an edge between the corresponding nodes 𝑣𝑖 and 𝑣 𝑗 .
𝐼%
i.e., E = {𝑒𝑖 𝑗 | 𝐶𝑖 ∩𝐶 𝑗 ≠ ∅} where 𝑒𝑖 𝑗 represents the edge
𝐼&
between the nodes 𝑣𝑖 and 𝑣 𝑗 . It is also possible to define a
𝐼$
weighted graph with weights 𝜔𝑖 𝑗 = #(𝐶𝑖 ∩𝐶 𝑗 ), the count
of the points in the intersection. In this case, the weights
Fig. 25. Toy Example of Mapper. For a point cloud X (a), we
reflect the strength of the interaction between the corre- define a lens function 𝑓 : X → R (b), and the induced covering
sponding clusters, indicating how many data points are defines a Mapper graph where nodes represent clusters and edges
represent related clusters (c).
included in the overlapped area. The final Mapper graph
gives a rough summary/sketch of the whole point cloud.
The Mapper graph not only helps in identifying data clusters but also enables the selection of data points that are
similar to a given subset by leveraging its structure [12]. In this context, nodes that are closer together on the Mapper
graph tend to contain more similar data points. This aspect of cluster similarity based on their proximity in the Mapper
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 37

graph is an area that remains largely understudied and offers significant potential for further exploration. The Mapper
graph computations have been detailed in Algorithm 4. In Section 6.5, we outline the utilization of Mapper in a real-life
application, i.e., cancer genotyping from RNA sequencing.

Common Filter Functions for Mapper. While we explain the Mapper algorithm for single-valued filter functions to
keep the exposition simpler, in practice it is very common to use multivariate filter functions 𝐹 : X → R2 . One of the
most common filter functions used in applications is the Stochastic Neighborhood Embedding (t-SNE) [54], and a variant
of this, called Neighborhood Lens [42].
Similarly, it is common to use other dimension reduction techniques as filter map 𝐹 : X → R2 , e.g., Isomap, PCA,
or UMAP [92]. Then, for an open covering I = {𝐼𝑖 𝑗 } of 𝐹 (X) in R2 , one can repeat the process of 𝑈𝑖 𝑗 = 𝐹 −1 (𝐼𝑖 𝑗 ), and
assign a node 𝑤 for each cluster in 𝑈𝑖 𝑗 . Again, assign an edge between the nodes when the corresponding clusters have
nontrivial intersections. Note that if the domain of the data offers good filter functions 𝑓 , 𝑔 : X → R, one can also
utilize multivariate filter functions 𝐹 (𝑥) = (𝑓 (𝑥), 𝑔(𝑥)) ∈ R2 by combining both functions in the process.

5.1.1 Hyperparameters for Mapper. For a point cloud X, four main parameter choices must be made to obtain its
Mapper graph.
(1) The lens function. If available, the lens function can be derived directly from the data domain, which in turn
greatly enhances the interpretability of the model. In practice, when appropriate lens functions are not readily
available from the data domain, dimensionality reduction techniques like t-SNE, UMAP, or PCA are commonly
used to create lenses 𝑓 : X → R2 .
(2) Clustering method. For each subset 𝑓 −1 (𝐼𝑘 ), the clustering method determines the nodes of the Mapper graph.
The most common methods employed are DBSCAN and k-means, with their hyperparameters controlling the
granularity of the resulting clusters. For k-means, it is common to use < 10 for clusters (e.g., see [95, Figure 10]).
(3) Resolution and Overlap. The resolution and overlap are the parameters for the covering {𝐼𝑘 }𝑛1 of 𝑓 (X) ⊂ R,
i.e., 𝑛𝑘=1 𝐼𝑘 ⊃ 𝑓 (X). The resolution, 𝑛, is the number of bins (intervals) {𝐼𝑘 } to cover 𝑓 (X). The overlap is the
Ð
|𝐼𝑘 ∩𝐼𝑘+1 |
percentage of overlaps of these intervals, i.e. |𝐼𝑘 | . Note that increasing the resolution gives a finer summary
by increasing the number of nodes in the Mapper network and making the clusters smaller (See Figure 34). On
the other hand, increasing overlap adds more relation (edges) between the nodes (clusters). In Figure 25, the
resolution is 6, and the overlap is the fixed intersection amount (e.g., 20%) between the intervals 𝐼𝑘 and 𝐼𝑘+1 .
Typically, it is common to use 10% − 30% overlap and 10-20 intervals (e.g., Figure 10 of GraphPulse [95]).

5.2 Mapper for Other Data Formats


While the Mapper approach is most commonly applied to point clouds, it can be adapted for other data formats, such as
graphs and images. For graphs, this adaptation can be seen as a form of graph coarsening/skeletonization, where the
goal is to summarize clusters of nodes within the graph.
In [13, 48], Bodnar et al. present a natural extension of the Mapper algorithm to the graph setting. Given a graph
G = (V, E), we start with a filter function 𝑓 : V → R and define a cover I = {𝐼𝑘 }𝑛𝑘=1 , where 𝑘 𝐼𝑘 ⊃ 𝑓 (X). The set
Ð

V𝑘 = 𝑓 −1 (𝐼𝑘 ) represents a subset of nodes, and G𝑘 is the induced subgraph, i.e., G𝑘 = (V𝑘 , E𝑘 ) where E𝑘 ⊂ E are the
edges connecting pairs of vertices in V𝑘 . Instead of clustering, we directly use the components (connected subgraphs)
Ð𝑚𝑘
in G𝑘 as the nodes of the Mapper graph. Specifically, if G𝑘0 = 𝑙=10 G𝑘𝑙 has 𝑚𝑘0 connected subgraphs, we define the
node set C𝑘 = {𝑐𝑘0 1, . . . , 𝑐𝑘0𝑚𝑘 }. For example, if G𝑘 has five connected subgraphs, we define five nodes in the Mapper
0
Manuscript submitted to ACM
38 Coskunuzer-Akcora

Algorithm 4 Mapper Algorithm


1: Input: DataSet X, Filter function 𝑓 : X → R, I = {𝐼𝑘 } covering of 𝑓 (X) for a given resolution and overlap.
2: Output: Graph G (nodes and edges)
3: Initialize empty graph G
4: Compute filter values: 𝑓 (X)
5: for each covering set 𝐼𝑘 in I do ⊲ Defining nodes of G
6: Compute preimage: 𝑈𝑘 = {𝑥 ∈ X | 𝐹 (𝑥) ∈ 𝐼𝑘 }
7: Cluster 𝑈𝑘 into clusters {𝐶𝑖 }
8: Assign a node for each cluster, and get a set of nodes V𝑘
9: Add nodes in V𝑘 to graph G
10: end for
11: for each pair of nodes (𝑣𝑖 , 𝑣 𝑗 ) in G do ⊲ Defining edges of G
12: if Clusters 𝐶𝑖 and 𝐶 𝑗 have common points then
13: Add edge (𝑣𝑖 , 𝑣 𝑗 ) to G
14: end if
15: end for
16: Return Graph G

graph, each representing one connected subgraph in G𝑘 . Let W = {𝑤 1, 𝑤 2, . . . , 𝑤 𝑀 } be the collection of all such nodes
where 𝑀 = 𝑚 . This defines the node set of the Mapper graph of G.
Í𝑛
𝑘=1 𝑘
The edge set is defined similarly: If the subgraphs G𝑘𝑙 and G(𝑘+1)𝑙 ′ share a common node, then an edge is added
between the corresponding nodes in W. In summary, in the original construction, we replace the point cloud with
the node set of G, and the clustering algorithm is directly derived from the graph structure. For this approach, one
can utilize the common filtering functions 𝑓 : V → R from persistent homology (e.g., degree, closeness, betweenness,
eccentricity) or can use other popular functions from graph representation learning, e.g., eigenfunctions of graph
Laplacian, pagerank [48].
Alternatively, if the graph G = (V, E, X) includes node attributes, a Mapper graph can be directly derived from the
set of node attributes X. In particular, we treat the node attribute vectors X = {X𝑖 }𝑛1 with X𝑖 ∈ R𝑁 as a point cloud
X ⊂ R𝑁 and apply the Mapper algorithm as before. This Mapper graph provides a visual summary of the node attribute
space, independent of the graph’s structure. In other words, the resulting Mapper graphs summarize the graph based on
node attributes alone, without incorporating information about node neighborhoods in the original graph. An effective
utilization of this approach can be found in [95].

Fig. 26. Mapper for Graphs. For a graph G (a), a filter function 𝑓 : V → R (b), and the covering {𝑈𝛼 , 𝑈 𝛽 , 𝑈𝛾 } (c), we obtain the
node clusters (connected subgraphs) (d), which define a Mapper network (e). The figure is adapted from [48].
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 39

The Mapper algorithm on images is analogous to its application in graphs, where clustering is derived from the
original image [94]. For a given 𝑟 × 𝑠 image X, let 𝑓 : X → R represent the color values of the pixels. We define a cover
I = {𝐼𝑘 }𝑛𝑘=1 such that 𝑘 𝐼𝑘 ⊃ 𝑓 (X). A node is defined for each connected component in 𝑓 −1 (I𝑘 ), and an edge is
Ð

created between nodes if the corresponding components have a nontrivial intersection. This Mapper graph summarizes
the interaction between different color regions in X as a graph.

5.3 Computational Complexity of Mapper


The computational complexity of TDA Mapper varies depending on the specific choices of filter functions, clustering
algorithms, and dimensionality reduction techniques.
The first step in TDA Mapper involves applying a filter function. If t-SNE is used as the filter, the computational
complexity of the Barnes-Hut t-SNE variant is O (𝑁 log 𝑁 · 𝐷) [108], where 𝑁 is the number of data points (e.g.,
for graphs, the number of nodes |V |) and 𝐷 is the dimensionality of the data (e.g., for graphs, the number of node
attributes). The complexity of t-SNE is largely driven by the computation of pairwise distances in a 𝐷-dimensional
space, followed by optimizing the low-dimensional embedding. While dimensionality 𝐷 influences the time required
for these computations, the overall complexity is typically dominated by the number of data points 𝑁 .
After t-SNE, the data is covered by overlapping intervals, and clustering is performed within each interval. The
complexity of this step depends on the clustering algorithm used. For example, if a clustering algorithm with complexity
O (𝑁 2 ) is used, this step may add significant computational cost. Finally, the Mapper complex is constructed by
connecting clusters, which typically has a lower complexity, often O (𝑁 ) to O (𝑁 log 𝑁 ), depending on the number of
clusters and the method used to connect them.
The overall computational complexity of TDA Mapper when using t-SNE is dominated by the complexity of t-SNE,
which is O (𝑁 log 𝑁 · 𝐷). The subsequent steps in the Mapper pipeline add to this complexity, particularly the clustering
step. When using 𝑘-means clustering, the complexity is typically O (𝑘 · 𝑁 · 𝐷 · 𝑇 ), where 𝑘 is the number of clusters,
𝑁 is the number of data points, 𝐷 is the dimensionality of the data, and 𝑇 is the number of iterations. Therefore, the
overall complexity can be expressed as O (𝑁 log 𝑁 · 𝐷) + O (𝑘 · 𝑁 · 𝐷 · 𝑇 ) = O (𝑁 log 𝑁 ).

5.4 Software Libraries for Mapper


Several software libraries provide tools for constructing and analyzing Mapper (see Table 3). Among these, Kepler-
Mapper is a popular choice for Python users, particularly within the Scikit-TDA ecosystem, offering HTML outputs that
are ideal for creating shareable, interactive visualizations. This makes it a practical tool for users who need to present
their findings to non-technical users.
Giotto-TDA integrates well with the Python environment and offers parameter visualization capabilities, making
it an excellent choice for those needing to iteratively refine their Mapper construction. Its interactive features are
especially helpful for real-time exploration of the impact of different filter functions and covering parameters.
For users prioritizing performance over interactivity, tda-mapper-python offers a faster computation engine, making
it well-suited for large datasets or scenarios where rapid iteration is necessary. However, the lack of interactive features
means that users will need to rely on external tools for visual exploration.
TDAmapper in R is an excellent choice for users already embedded in the R ecosystem. It’s a robust standalone
solution that is particularly useful for those who prefer the extensive statistical and data manipulation capabilities
available in R. However, the lack of interactivity might require additional effort for visualization and exploration.
Manuscript submitted to ACM
40 Coskunuzer-Akcora

Table 3. Summary of TDA Mapper Libraries.

Library Name Lang. Notable Feature Interactive Code URL


Kepler-Mapper [109] Python HTML output ✓ https://github.com/scikit-tda/kepler-mapper (Scikit-TDA)
Giotto-tda [104] Python Parameter visuals ✓ https://github.com/giotto-ai/giotto-tda
tda-mapper-python [97] Python Fast computation × https://github.com/lucasimi/tda-mapper-python
TDAmapper [98] R Standalone × https://github.com/paultpearson/TDAmapper
Mapper Interactive [122] Python Interactive ✓ https://github.com/MapperInteractive/MapperInteractive

Finally, Mapper Interactive is designed for those who value hands-on engagement with the data. It offers a highly
interactive Python-based environment, making it ideal for exploratory data analysis where immediate feedback and
manipulation of the Mapper graph are crucial.
In practice, the choice of library typically depends on the specific requirements of your project, such as whether you
prioritize interactivity, computational speed, or seamless integration with other tools in your data analysis workflow. For
example, Kepler-Mapper and Giotto-TDA are ideal for exploratory analysis and visualization, while tda-mapper-python
and TDAmapper are more suitable for handling large-scale computations and integrating with ecosystems like Python
and R, respectively.

6 APPLICATIONS
In this section, we will provide examples of the applications of TDA methods in machine learning. We provide four
examples from published papers and outline their model and how they applied TDA in their project.

6.1 PH for Point Clouds: Shape Recognition


In this section, we present an illustrative example of
the utilization of PH in shape recognition, based on the
work [26], where Chazal et al. demonstrated the effective-
ness of PH in shape recognition. One specific example
involved four different animals: an elephant, flamingo,
lion, and camel, each representing a unique shape in R3 ,
where each shape is normalized to have a diameter of
one. They began by selecting 500 random points from the
surface of each animal, as shown in Figure 27, resulting
Fig. 27. Shape Recognition via PH. In [26], the authors obtained 100 point
in the point clouds X𝐸 , X𝐿 , X𝐹 , and X𝐶 for the elephant, clouds for each animal by subsampling the surfaces above and analyzed their
lion, flamingo, and camel, respectively. Then, they cre- persistence landscapes. On the right, they present the 95% confidence bands
for the persistence landscapes of these 400 point clouds, demonstrating that
ated 100 subsamples of 300 points each from these point each animal’s persistence landscape exhibits significantly distinct character-
istics. The figure is adapted from [26].
clouds, resulting in 400 different point clouds: 𝐸𝑖 ⊂ X𝐸 ,
𝐿𝑖 ⊂ X𝐿 , 𝐹𝑖 ⊂ X𝐹 , and 𝐶𝑖 ⊂ X𝐶 for 1 ≤ 𝑖 ≤ 100.
100 (see Section 3.1.1) and computed the
Next, they performed Rips filtration on each point cloud {𝐸𝑖 , 𝐿𝑖 , 𝐹𝑖 , 𝐶𝑖 }𝑖=1
corresponding persistence diagrams for dimension one. By vectorizing these persistence diagrams, they obtained the per-
100 .
sistence landscapes for each point cloud, resulting in 400 persistence landscape functions: {𝜆(𝐸𝑖 ), 𝜆(𝐿𝑖 ), 𝜆(𝐹𝑖 ), 𝜆(𝐶𝑖 )}𝑖=1
In Figure 27, they present the 95% confidence bands for the true average landscapes (bold curves) of each class. e.g.,
blue confidence band correspond to 100 persistence landscapes coming from lion point clouds {𝜆(𝐿1 ), . . . , 𝜆(𝐿100 )}.
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 41

The narrowness of these confidence bands indicates that PH techniques provide robust shape-embedding methods that
are minimally affected by noise. Moreover, the distinct confidence bands for each class underscore the effectiveness of
PH in shape recognition. We also note that recent work by Türkeş et al. [107] studied the effectiveness of PH for shape
recognition in different settings.

6.2 PH for Graphs: Crypto-Token Anomaly Forecasting


Three primary types of problems dominate graph representation learning: graph classification, node classification,
and link prediction, as outlined by Hamilton et al. [50]. These tasks encompass a broad range of real-life applications,
including brain connectivity networks, molecular property prediction, recommender systems, fraud detection and
transaction networks. In a related study, Li et al. [65] employ persistent homology to extract effective feature vectors
from weighted and directed Ethereum crypto-token networks, modeled as temporal graphs. The approach is tailored
for graph anomaly prediction, a specialized form of graph classification.
The research problem is set as a prediction task where the authors aim to predict whether the absolute price return of
an Ethereum token will change significantly beyond a predefined threshold |𝛿 | > 0 within the next ℎ days. This involves
analyzing the token’s transaction network and its price fluctuations over multiple discrete time graph snapshots. The
price information is sourced from external ground truth data, as token prices are determined by trading on blockchain
exchanges.
For each snapshot, the authors use transferred token amounts on edges to define distances between adjacent vertices.
h i −1
𝑢𝑣 −𝐴𝑚𝑖𝑛
These functions help establish the similarity between nodes: 𝜔𝑢𝑣 = 1 + 𝛼 · 𝐴𝐴𝑚𝑎𝑥 −𝐴𝑚𝑖𝑛 where 𝐴𝑢𝑣 represents
the amount of tokens transferred between nodes 𝑢 and 𝑣, and 𝐴𝑚𝑖𝑛 and 𝐴𝑚𝑎𝑥 are the minimum and maximum
transferred amounts, respectively. The authors set the parameter 𝛼 = 9 to map these weights to the interval [0.1, 1],
thus standardizing the weight values.
By treating the edge weights (similarity measures) as
distances between nodes, i.e., 𝑑 (𝑢, 𝑣) = 𝜔𝑢𝑣 , the authors
construct a filtration of Rips complexes (Section 3.1.3-ii)
from the snapshot graphs. For non-adjacent nodes, the
distance is defined by the shortest path in the graph. The
main insight here is to extract topological patterns de-
veloped in the graphs at multiple scales. This process
involves forming a simplicial complex where nodes are
connected if their distance is within a threshold 𝜖, re-
flecting their similarity based on the edge weights. Es- (a) Tronix (b) Power Ledger

sentially, nodes with high similarity (those with large Fig. 28. Betti pivots. Comparative functional summaries of the Tronix and
transaction amounts between them) are connected early Power Ledger token networks over seven days, displaying daily Betti-1 values
across various scaling parameters. Each plot captures the evolution of topo-
in the filtration, while more distant nodes appear later. logical features, with the central red lines indicating the Betti pivots. These
pivots represent the most stable or ’normal’ network structures, providing
The authors then compute persistence diagrams from insights into network behavior and potential anomalies driven by underlying
these filtrations and convert them into Betti functions. transaction activities. The figure is adapted from [65].

Additionally, they introduce new functional summaries of topological descriptors, namely Betti limits and Betti pivots,
which track the evolution of topological features as the scale parameter 𝜖 changes over the snapshots. Figure 28
illustrates two token networks and their corresponding functional summaries.
Manuscript submitted to ACM
42 Coskunuzer-Akcora

Table 4. Anomaly forecasting performance of crypto-token prices for two days horizon (ℎ = 2) for the top-10 tokens [65].

Token Tronix Omisego Mcap Storj BNB ZRX CyberMiles Vechain Icon BAT
M4 Accuracy 0.975 0.990 0.913 0.962 0.979 0.978 0.961 0.954 0.908 0.965

To identify which transaction networks indicate anomalous patterns, the authors use Modified Band Depth to assess
how central or peripheral each network’s topological descriptors are within the observed data set. Figure 28 shows
two token networks and their snapshot graphs as represented by functions within these figures. Snapshot graphs with
deeper Betti limits are considered more typical or central.
The authors integrate these novel topological features with conventional network summaries to predict price
anomalies. Daily labels are assigned as anomalous based on significant price changes anticipated in the near future (e.g.,
in one or two days). To this end, the authors construct a predictive model by utilizing topological vectors they produce
for token networks. Table 4 displays the accuracy metrics for their model, specifically for a prediction horizon of ℎ = 2
(two days). Achieving an average accuracy of 96% across ten token networks, they demonstrate the effectiveness of
topological features in the anomaly forecasting task.

6.3 PH for Images: Cancer Diagnosis from Histopathological Images


Our example will directly apply persistent homology methods to histopathological images. As noted in Section 3.1.2,
cubical persistence is the primary method for applying persistent homology in an image context. In Yadav et al.
(2023) [116], Yadav et al. successfully applied cubical persistence to analyze histopathological images. Specifically, for
each image X, the authors generated a topological feature vector 𝛽b(X) and utilized standard ML methods on these
vectors (image embeddings) for tumor classification.
Recall that for a given image 𝑋 with 𝑟 × 𝑠 resolution,
the first step is to create a filtration, which is a nested
sequence of binary images 𝑋𝑛 . A common method to
create such a sequence is to use the color values 𝛾𝑖 𝑗
of each pixel Δ𝑖 𝑗 ⊂ 𝑋 . Specifically, for a sequence of
grayscale values (𝑡 1 < 𝑡 2 < · · · < 𝑡 𝑁 ), one obtains a
Fig. 29. Eight Color Channels. Different color channels are used
nested sequence of binary images 𝑋 1 ⊂ 𝑋 2 ⊂ · · · ⊂ 𝑋 𝑁 to create sublevel filtrations. The figure is adapted from [116].
such that 𝑋𝑛 = Δ𝑖 𝑗 ⊂ 𝑋 | 𝛾𝑖 𝑗 ≤ 𝑡𝑛 .
In [116], the authors construct filtrations for cubical persistence by first extracting eight color channels 𝛾𝑖𝑘𝑗 from
histopathological images with 1 ≤ 𝑘 ≤ 8, where each superscript 𝑘 corresponds to one color channel. The first four
channels come from the RGB color space: red, green, blue, and grayscale (the average of R, G, and B). Additionally, they
utilize the HSV color space, which includes hue, saturation, value, and their average (see Figure 29). Each color channel
𝑁 . In the paper, they set 𝑁 = 100.
defines a different filtration {X𝑛𝑘 }𝑛=1
Next, by using these filtrations, they obtain the corresponding persistence diagrams for dimensions 0 and 1.
Then, by applying Betti vectorization to these persistence diagrams, they obtain 100-dimensional Betti vectors
𝛽®𝑚
𝑘 = [𝛽 𝑘 (𝑡 ) . . . 𝛽 𝑘 (𝑡
𝑚 1 𝑚 100 )] where 𝑘 represents the color, and 𝑚 = 0, 1 represents the dimension. Hence, 𝛽 0 (𝑡𝑛 )
𝑘

is the number of components (Betti-0 number) in the binary image X𝑛𝑘 , and 𝛽 1𝑘 (𝑡𝑛 ) is the number of holes (Betti-1
number) in the binary image X𝑛𝑘 . Considering that there are eight color channels and two dimensions, one obtains
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 43

(a) Colon (Gray Betti-0) (b) Bone (Gray Betti-0) (c) Cervical (Value Betti-0)

Fig. 30. Median curves and 40% confidence bands of Betti vectors, one for each cancer type. 𝑥-axis represents color values 𝑡 ∈ [0, 255]
and 𝑦-axis represents 𝛽 0 (𝑡 ) (Betti-0), i.e., the count of components in the binary image X𝑡 . For details, see [116].

1600-dimensional vector 𝛽®(X) by concatenating these 16 vectors { 𝛽®𝑚


𝑘 }. By extracting extra features by utilizing local

binary patterns (LBP) and Gabor filters, they produced another 800 and 400 dimensional features for each image,
respectively. Then, they obtain a 2800-dimensional final vector 𝛽b(X). In particular, one can consider this as an image
embedding method, and each image is realized as a point in R2800 . Next, to improve the performance by removing
correlated features, they use a feature selection method for the downstream task and reduce the vector size to 500
dimensions.
They studied five cancer types, namely bone, breast, cervical, prostate, and colon cancer. In Figure 30, they provide
the median curves and confidence bands for each class for three cancer types. As discussed earlier, while Betti vectors
are considered as a weak vectorization method in general, they can be highly effective when the signature comes from
the quantity and distribution of small features like these histopathological images. They utilized benchmark datasets
for each cancer type consisting of 20K to 60K histopathological images. In Table 5, we give their results for various sets
of feature vectors. Their results indicate that utilizing filtrations with multiple color channels can significantly improve
the classification results in some cancer types, e.g., bone, breast, and colon cancers.

Table 5. Cancer diagnosis with PH. Multiclass accuracy results for various cancer types from histopathological images using a
topological ML model with a random forest classifier applied to different feature vector sets. The number of classes is indicated in
parentheses. The first five rows correspond to the performance of Betti vectors across different color channels [116].

Features # Features Bone (3) Breast (4) Prostate (4) Cervical (5) Colon (3)
Gray 200 84.6 68.8 91.1 54.6 95.9
G-RGB 800 89.4 80.0 93.2 77.5 97.5
avg HSV 200 82.3 60.8 84.5 60.7 88.0
HSV+ avg HSV 800 90.1 78.7 93.0 82.7 97.5
All Betti features 1600 90.8 82.8 93.9 85.5 97.8
Betti + Gabor + LBP 2800 93.7 88.4 94.7 92.6 98.5
Feature Selection 500 94.2 91.6 95.2 91.4 98.4

6.4 Multiparameter Persistence: Computer Aided Drug Discovery


In this part, we outline an effective application of multiparameter persistence in computer-aided drug discovery (CADD),
showcasing its use within the graph setting. For an application in the context of point clouds, see [23, 67, 111].
Manuscript submitted to ACM
44 Coskunuzer-Akcora

Virtual screening (VS), a key technique in CADD, is used to identify potential drug candidates from a vast library of
compounds that are most likely to bind to a specific molecular target. Demir et al. [37] employed a multiparameter
persistence approach for virtual screening by framing it as a graph classification task. In this method, compounds are
represented as graphs, with atoms as nodes and bonds as edges. They adopted a ligand-based approach, wherein a
few positive samples are provided for a given protein target, and the goal is to screen the compound library to find
compounds that are most similar to these positive samples. Essentially, this approach can be viewed as a topology-based
graph ranking problem. While their approach is more technical, here we outline the key concepts of their method.
Their framework, TODD, generates fingerprints of compounds as multidimensional
0.380
0.358
vectors (tensors), represented as a 2𝐷 or 3𝐷 array for each compound. The core idea is -0.571
-0.736 -0.618
to simultaneously employ 2 or 3 highly relevant functions or weights (e.g., atomic mass,
partial charge, bond type, electron affinity, distance) to obtain a multifiltration, which 0.363 0.389 0.653

decomposes the original compound into substructures using these relevant chemical
-0.330 -0.440
functions. As detailed in Section 4.1.1, for a given compound G = (V, E), they used
atomic number 𝐴 and partial charge 𝑃 as node functions 𝐴 : V → R and 𝑃 : V → R, 0.261 0.067
0.225
as well as bond strength 𝐵 as an edge function 𝐵 : E → R, to define multifiltrations. An
illustration of such a multifiltration for the compound cytosine Figure 31 is provided Fig. 31. Cytosine. Atom types are
in Figure 32. In this example, the functions are partial charge and atomic number. coded by their color: White=Hydrogen,
Gray=Carbon, Blue=Nitrogen, and
After constructing multifiltrations, they obtained compound fingerprints using a Red=Oxygen. The decimal numbers
next to atoms represent their par-
slicing technique. In particular, within each multifiltration G𝑖 b 𝑗, they took horizontal tial charges. The figure is adapted
from [37].
slices by fixing a row 𝑖 0 to get a single persistence filtration G𝑖
b 0 𝑗. For each horizontal
slice 𝑖 0 , they then generated a persistence diagram PDk ( Gi
b 0 j). These persistence diagrams were vectorized using Betti
and Silhouette vectorizations to produce 2D arrays, such as M𝛽 = [𝛽 ( G𝑖 b 𝑗)] (Betti numbers of the clique complex G𝑖
b 𝑗).

8
Atomic Number

-0.5 -0.1 0.3 0.7


Partial Charge

Fig. 32. Sublevel bifiltration of cytosine is induced by filtering functions atomic charge 𝑓 and atomic number 𝑔. In the horizontal direction,
thresholds 𝛼 = −0.5, −0.1, +0.3, +0.7 filters the compound into substructures 𝑓 (𝑣) ≤ 𝛼 with respect to their partial charge. In the vertical direction,
thresholds 𝛽 = 1, 6, 7, 8 filters the compound in the substructures 𝑔 (𝑣) ≤ 𝛽 with respect to atomic numbers. The figure is adapted from [37].

Manuscript submitted to ACM


Topological Methods in Machine Learning: A Tutorial for Practitioners 45

Table 6. TODD vs. SOTA. EF 1% performance comparison between ToDD and the top-performing method among 12 SOTA baseline methods on eight
targets from the DUD-E Diverse subset.

Model AMPC CXCR4 KIF11 CP3A4 GCR AKT1 HIVRT HIVPR Avg.
Best Baseline 39.6 61.1 54.3 44.3 40.9 89.4 43.8 65.7 47.9
ToDD 42.9 92.3 75.0 67.6 78.9 90.7 64.1 92.1 73.7
Relative gains 8.3% 51.1% 38.1% 52.6% 92.9% 1.5% 46.3% 40.2% 52.8%

Next, using these 2D arrays, they applied two ML classifiers: Random Forest and ConvNeXt Vision Transformers.
Both classifiers performed exceptionally well, surpassing all state-of-the-art models on benchmark datasets. In Table 6,
their results are shown for the DUD-E Diverse dataset, which comprises 116K compounds targeting eight proteins. The
common metric in virtual screening is enrichment factor, which compares the proportion of active compounds found in
the top-ranked subset of a model output to the proportion of active compounds in the entire dataset. 𝐸𝐹 1% represents
the enrichment factor for the top 1% of the dataset.

6.5 Mapper: Cancer Genotyping from RNA Sequencing


In this section, we will explore a significant application domain for the Mapper algorithm: the analysis of single-cell
RNA sequencing data. Recall that Mapper is particularly effective for analyzing high-dimensional point clouds by
generating a low-dimensional graph summary that preserves local relationships (Section 5). In the Mapper summary
graph, the nodes represent clusters in the point cloud, and the edges between the nodes indicate that the corresponding
clusters are nearby (interacting) in the high-dimensional space.
RNA sequencing data is crucial for cancer genotype analysis, though it presents significant challenges. The expression
profile of a cell can be mathematically represented as a point in a high-dimensional expression space, where each
dimension corresponds to the RNA level of a gene, and the dimension of the space is the number of expressed genes.

Fig. 33. Visualization Methods. Visualization of melanoma cells in the expression space with different dimension reduction
algorithms. The figure is adapted from [114].
Manuscript submitted to ACM
46 Coskunuzer-Akcora

Fig. 34. Mapper Graph for Varying Bin Sizes. As the number of bins increases, the clusters become smaller, resulting in an
increased number of nodes (clusters). The figure is adapted from [114].

Points that are close to each other in this space correspond to cells with similar expression profiles. The set of all
possible tumors of a cancer type spans a subspace of the expression space. From an ML perspective, this is a highly
sparse point cloud in a high-dimensional space. In general, cells are assigned to some cancer subtype (e.g., malignant,
benign, etc.), and the aim is to understand the relation of these types in this high-dimensional space.
There are several significant works on cancer geno-
typing through the use of Mapper techniques [90, 91, 93].
In this paper, we will detail the methods in two works. In
the first one [114], Wang et al. conducted a comparative
analysis of Mapper visualization techniques on RNA-
sequencing data. They used the melanoma tumor cells
dataset GSE72056, which contains 4,645 cells. Of these,
1,257 are malignant, and 3,256 are non-malignant, with
some minority classes included. The dataset includes
23,686 expressed genes. From an ML perspective, this Fig. 35. Genetic Alterations. Topological representation of the
4645 expression space for low-grade gliomas using Mapper. The structure
data can be represented as a point cloud X = {𝑥 𝑖 }𝑖=1 in
reveals three main clusters corresponding to distinct glioma sub-
very high dimensional space R23,686 , where each dimen- types. One cluster, labeled IDH2mut , highlights the significance of
sion corresponds to a gene {𝛾 𝑗 }23,686
𝑗=1 . In particular, for a
the IDH2 gene in differentiating this subtype. This visualization em-
phasizes the distinct expression profiles that separate these glioma
cell 𝑥 𝑖 ∈ X with 𝑥 𝑖 = (𝑥 1𝑖 , 𝑥 2𝑖 , . . . , 𝑥 23,686
𝑖 ), 𝑥 𝑖𝑗 represents subtypes. This figure is adapted from [90].
the RNA level of gene 𝑖
  𝛾 𝑗 in cell 𝑥 . This is computed as
TPM𝑖 𝑗
𝑥 𝑖𝑗 = log2 1 + 10 , where TPM𝑖 𝑗 is the transcript-per-million (TPM) for gene 𝛾 𝑗 in cell 𝑥 𝑖 . Notice that the number
of dimensions far exceeds the number of points. From an ML perspective, this presents a highly challenging setup.
In Figure 33, the authors illustrate different dimension reduction techniques on this dataset. In Figure 34, they show
how the graph changes when the number of bins (resolution) is varied.
In the second work [90], Rabadan et al. applied the Mapper algorithm to identify somatic mutations relevant
to tumor progression. They analyzed mutation and RNA expression data for 4500 genes across 4476 patients from
12 tumor types, including low-grade glioma (LGG), lung adenocarcinoma, breast invasive carcinoma, and colorectal
adenocarcinoma. This study led to the identification of 95 mutated cancer genes, 38 of which were previously unreported.
They identified three large expression groups with oligodendrogliomas (enriched for CIC and IDH2 mutations),
IDH1-mutant astrocytomas (enriched for TP53 mutations), and IDH1-wild-type astrocytomas (enriched for EGFR
mutations) (Figure 35). In their analysis, they used the neighborhood lens (Section 5) as the filter function and scanned
the resolution and gain parameter space to determine the optimal parameters (Figure 34).
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 47

7 FUTURE DIRECTIONS
In the preceding sections, we covered the primary topological methods employed in machine learning. In this section,
we will explore future directions for advancing these methods, aiming to enhance the practicality of TDA in ML. We
will also discuss strategies for expanding the application of topological methods into new and emerging areas.

7.1 Topological Deep Learning


While TDA is already an effective tool for feature extraction from complex data, recent research underscores its potential
to augment deep learning models by providing complementary insights. Deep learning algorithms typically emphasize
local data relationships, whereas TDA offers a global perspective, delivering supplementary information. Furthermore,
TDA methods have also been integrated into deep learning algorithms to optimize its process by regulating the loss
functions and other layers [57, 113]. In recent years, there has been significant progress in integrating TDA with
deep learning in domains such as image analysis [32, 84, 99], biomedicine [69], graph representation learning [29, 58],
genomics [8], cybersecurity [47] and time-series forecasting [35, 95]. This is a rapidly emerging field, with several
recent papers surveying the current state-of-the-art [49, 82, 83, 123].

7.2 TDA and Interpretability


TDA, though often used as a feature extraction technique in machine learning, provides a powerful framework for
enhancing model interpretability by capturing the inherent geometric and topological structures of data. Tools like
PH and Mapper excel at identifying and quantifying features such as clusters, loops, and voids across multiple scales,
unveiling intricate patterns that might escape traditional analytic methods [33, 62, 76]. By utilizing these topological
insights into ML pipelines, TDA offers a unique perspective on decision boundaries, feature interactions, and model
behavior. For instance, Mapper facilitates the visualization and interpretation of complex, high-dimensional datasets by
projecting them into lower-dimensional topological spaces[78, 87, 89, 102]. This transformation uncovers relationships
within the data that are otherwise difficult to grasp, making it easier to interpret the reasoning behind certain predictions
or classifications. Ultimately, by incorporating TDA, researchers and practitioners can clarify the decision-making
processes of machine learning models, promoting greater transparency and trustworthiness in AI systems.

7.3 Fully Automated TDA Models


While PH and Mapper have demonstrated effectiveness across several machine learning domains, their successful
application still requires considerable expertise, such as hyperparameter tuning and selecting appropriate filtration
functions. To make PH and Mapper more accessible to the broader ML community, there is a pressing need for end-to-
end algorithms that automate these processes. For PH, this automation can be tailored to specific data formats, such as
point clouds, images, or graphs. In each case, the fully automated algorithm would handle the selection of filtration
functions, thresholds, vectorization methods, and other hyperparameters for optimal performance in downstream
tasks. For instance, an automated PH model designed for graph data could automatically choose the best filtration
function (learnable), appropriate threshold values, and vectorization techniques to achieve optimal results in a node
classification task. Similarly, a fully automated Mapper algorithm that selects filtration functions, resolution, gain,
and other hyperparameters would significantly enhance its usability and accessibility to the ML community. For PH,
there are works for learnable filtration functions, which find the best filtration functions in graph setting [55, 120]. On
Manuscript submitted to ACM
48 Coskunuzer-Akcora

the other hand, For Mapper, there is promising progress in this direction with Interactive Mapper [122]. The broader
adoption of TDA hinges on developing practical software libraries that streamline its use for various applications.

7.4 Scalability of TDA methods


Another key barrier to the widespread adoption of TDA methods in various application areas is their high computational
cost. Although TDA performs well on small to medium datasets, scaling these methods to larger datasets poses significant
challenges due to computational demands. Several strategies have been introduced to alleviate these costs, varying by
data type. For graph datasets, recent research [3, 117] has proposed practical techniques that substantially lower the
computational burden of PH. Similarly, for point clouds, recent works [36, 71, 121] present algorithms that improve
the efficiency of PH computations. While these advances are promising, there remains a critical need to enhance the
scalability of PH for large graph and point cloud datasets. In contrast, cubical persistence is already computationally
efficient for image datasets and can be applied to large datasets without significant cost concerns. Even in the case
of 3D image datasets [44], PH provides a cost-effective alternative to deep learning approaches. Similarly, scalability
issues are minimal for Mapper when hyperparameters are appropriately tuned.

8 CONCLUSION
In this tutorial, we have introduced the key concepts of topological methods, particularly focusing on persistent
homology and the mapper algorithm, and demonstrated their practical application in machine learning tasks. By
providing a clear and accessible roadmap, we aimed to equip readers with the tools and understanding necessary to
integrate TDA techniques into their research workflows. The strength of these methods lies in their ability to capture
and quantify intricate, multi-scale topological features that are often missed by traditional ML approaches.
As demonstrated through various case studies, including cancer diagnosis, shape recognition, and drug discovery,
TDA offers unique insights and interpretability, which are increasingly valuable in today’s era of complex data. The
integration of persistent homology for feature extraction and mapper for intuitive data visualization opens new pathways
for researchers to explore the underlying structures of their data.
Looking forward, the continued development of software libraries, scalable algorithms, and more interpretable
models will play a crucial role in making topological machine learning even more accessible to a broader audience.
The future directions we highlighted, such as topological deep learning and automated TDA models, provide exciting
opportunities for further advancements in the field.
We hope this tutorial serves as a foundational resource for those new to TDA and inspires further exploration of
topological methods in machine learning and beyond. By embracing these techniques, researchers can unlock novel
insights and push the boundaries of what is possible in data analysis.

ACKNOWLEDGMENTS
This work was partially supported by the National Science Foundation under grants DMS-2202584, 2229417, and
DMS-2220613 and by the Simons Foundation under grant # 579977. The authors acknowledge the Texas Advanced
Computing Center (TACC) at The University of Texas at Austin for providing computational resources that have
contributed to the research results reported within this paper.
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 49

REFERENCES
[1] Nazmiye Ceren Abay, Cuneyt Gurcan Akcora, Yulia R. Gel, Murat Kantarcioglu, Umar D. Islambekov, Yahui Tian, and Bhavani Thuraisingham. 2019.
ChainNet: Learning on Blockchain Graphs with Topological Features. In 2019 IEEE International Conference on Data Mining, ICDM 2019, Beijing, China,
November 8-11, 2019, Jianyong Wang, Kyuseok Shim, and Xindong Wu (Eds.). IEEE, NY, USA, 946–951. https://doi.org/10.1109/ICDM.2019.00105
[2] Henry Adams, Tegan Emerson, Michael Kirby, Rachel Neville, Chris Peterson, Patrick Shipman, Sofya Chepushtanova, Eric Hanson, Francis Motta,
and Lori Ziegelmeier. 2017. Persistence images: A stable vector representation of persistent homology. JMLR 18 (2017).
[3] Cuneyt G Akcora, Murat Kantarcioglu, Yulia Gel, and Baris Coskunuzer. 2022. Reduction algorithms for persistence diagrams of networks:
CoralTDA and PrunIT. Advances in Neural Information Processing Systems 35 (2022), 25046–25059.
[4] Cuneyt Gurcan Akcora, Yitao Li, Yulia R. Gel, and Murat Kantarcioglu. 2019. BitcoinHeist: Topological Data Analysis for Ransomware Detection
on the Bitcoin Blockchain. CoRR abs/1906.07852 (2019). arXiv:1906.07852 http://arxiv.org/abs/1906.07852
[5] Mehmet E Aktas, Esra Akbas, and Ahmed El Fatmaoui. 2019. Persistence homology of networks: methods and applications. Applied Network
Science 4, 1 (2019), 1–28.
[6] Dashti Ali, Aras Asaad, María José Jiménez, Vidit Nanda, Eduardo Paluzo-Hidalgo, and Manuel Soriano-Trigueros. 2023. A Survey of Vectorization
Methods in Topological Data Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 45, 12 (2023), 14069–14080. https://doi.org/10.1109/TPAMI.2023.3308391
[7] Bernardo Ameneyro, George Siopsis, and Vasileios Maroulas. 2022. Quantum Persistent Homology for Time Series. In 7th IEEE/ACM Symposium
on Edge Computing, SEC 2022, Seattle, WA, USA, December 5-8, 2022. IEEE, 387–392. https://doi.org/10.1109/SEC54971.2022.00057
[8] Erik J Amézquita, Farzana Nasrin, Kathleen M Storey, and Masato Yoshizawa. 2023. Genomics data analysis via spectral shape and topology. Plos
one 18, 4 (2023), e0284820.
[9] Mark Anthony Armstrong. 2013. Basic topology. Springer Science & Business Media, NY, USA.
[10] Ulrich Bauer. 2021. Ripser: efficient computation of Vietoris-Rips persistence barcodes. J. Appl. Comput. Topol. 5, 3 (2021), 391–423. https:
//doi.org/10.1007/s41468-021-00071-5
[11] Ulrich Bauer, Michael Kerber, Jan Reininghaus, and Hubert Wagner. 2017. Phat–persistent homology algorithms toolbox. Journal of symbolic
computation 78 (2017), 76–90.
[12] Murat Ali Bayir, Kiarash Shamsi, Huseyincan Kaynak, and Cuneyt Gurcan Akcora. 2022. Topological Forest. IEEE Access 10 (2022), 131711–131721.
[13] Cristian Bodnar, Cătălina Cangea, and Pietro Liò. 2021. Deep graph mapper: Seeing graphs through the neural lens. Frontiers in big Data 4 (2021),
680535.
[14] Magnus Bakke Botnan and Michael Lesnick. 2022. An Introduction to Multiparameter Persistence. CoRR abs/2203.14289 (2022). https://doi.org/10.
48550/ARXIV.2203.14289 arXiv:2203.14289
[15] Peter Bubenik et al. 2015. Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16, 1 (2015), 77–102.
[16] Charlotte Bunne. 2023. Optimal transport in learning, control, and dynamical systems. ICML Tutorial (2023).
[17] Chen Cai and Yusu Wang. 2020. Understanding the power of persistence pairing via permutation test. arXiv:2001.06058 (2020).
[18] G. Carlsson. 2009. Topology and data. Bull. Amer. Math. Soc. 46, 2 (2009).
[19] Gunnar Carlsson, Gurjeet Singh, and Afra Zomorodian. 2009. Computing multidimensional persistence. In Algorithms and Computation: 20th
International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 16-18, 2009. Proceedings 20. Springer, 730–739.
[20] Gunnar Carlsson and Afra Zomorodian. 2007. The theory of multidimensional persistence. In Proceedings of the twenty-third annual symposium on
Computational geometry. 184–193.
[21] Gunnar Carlsson, Afra Zomorodian, Anne Collins, and Leonidas Guibas. 2004. Persistence barcodes for shapes. In Proceedings of the 2004
Eurographics/ACM SIGGRAPH symposium on Geometry processing. 124–135.
[22] Daniel R Carmody and Richard B Sowers. 2021. Topological analysis of traffic pace via persistent homology. Journal of Physics: Complexity 2, 2
(2021), 025007.
[23] Mathieu Carrière and Andrew Blumberg. 2020. Multiparameter persistence image for topological machine learning. NeurIPS 33 (2020).
[24] Mathieu Carrière, Frédéric Chazal, Yuichi Ike, Théo Lacombe, Martin Royer, and Yuhei Umeda. 2020. Perslay: A neural network layer for persistence
diagrams and new graph topological signatures. In AISTATS. 2786–2796.
[25] Mathieu Carriere, Marco Cuturi, and Steve Oudot. 2017. Sliced Wasserstein kernel for persistence diagrams. In International conference on machine
learning. PMLR, 664–673.
[26] Frédéric Chazal, Brittany Fasy, Fabrizio Lecci, Bertrand Michel, Alessandro Rinaldo, and Larry Wasserman. 2015. Subsampling methods for
persistent homology. In International Conference on Machine Learning. PMLR, 2143–2151.
[27] Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, and Larry Wasserman. 2014. Stochastic convergence of persistence
landscapes and silhouettes. In Proceedings of the thirtieth annual symposium on Computational geometry. 474–483.
[28] Frédéric Chazal and Bertrand Michel. 2021. An introduction to topological data analysis: fundamental and practical aspects for data scientists.
Frontiers in Artificial Intelligence 4 (2021).
[29] Yuzhou Chen, Baris Coskunuzer, and Yulia Gel. 2021. Topological relational learning on graphs. NeurIPS 34 (2021), 27029–27042.
[30] Yuzhou Chen, Ignacio Segovia-Dominguez, Cuneyt Gurcan Akcora, Zhiwei Zhen, Murat Kantarcioglu, Yulia Gel, and Baris Coskunuzer. 2024. EMP:
Effective Multidimensional Persistence for Graph Representation Learning. In Learning on Graphs Conference. PMLR, 24–1.

Manuscript submitted to ACM


50 Coskunuzer-Akcora

[31] Yu-Min Chung and Austin Lawson. 2019. Persistence curves: A canonical framework for summarizing persistence diagrams. arXiv:1904.07768
(2019).
[32] James R Clough, Nicholas Byrne, Ilkay Oksuz, Veronika A Zimmer, Julia A Schnabel, and Andrew P King. 2020. A topological loss function for
deep-learning based image segmentation using persistent homology. IEEE transactions on pattern analysis and machine intelligence 44, 12 (2020),
8766–8778.
[33] Alex Cole, Gregory J Loges, and Gary Shiu. 2021. Quantitative and interpretable order parameters for phase transitions from persistent homology.
Physical Review B 104, 10 (2021), 104426.
[34] Padraig Corcoran and Christopher B Jones. 2023. Topological data analysis for geographical information science using persistent homology.
International Journal of Geographical Information Science 37, 3 (2023), 712–745.
[35] Baris Coskunuzer, Ignacio Segovia-Dominguez, Yuzhou Chen, and Yulia R Gel. 2024. Time-Aware Knowledge Representations of Dynamic Objects
with Multidimensional Persistence. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 11678–11686.
[36] Thibault de Surrel, Felix Hensel, Mathieu Carrière, Théo Lacombe, Yuichi Ike, Hiroaki Kurihara, Marc Glisse, and Frédéric Chazal. 2022. RipsNet: a
general architecture for fast and robust estimation of the persistent homology of point clouds. In Topological, Algebraic and Geometric Learning
Workshops 2022. PMLR, 96–106.
[37] Andac Demir, Baris Coskunuzer, Yulia Gel, Ignacio Segovia-Dominguez, Yuzhou Chen, and Bulent Kiziltan. 2022. ToDD: Topological compound
fingerprinting in computer-aided drug discovery. NeurIPS 35 (2022), 27978–27993.
[38] Tamal Krishna Dey and Yusu Wang. 2022. Computational Topology for Data Analysis. Cambridge University Press.
[39] Herbert Edelsbrunner and John L Harer. 2022. Computational topology: an introduction. American Mathematical Society.
[40] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. 2002. Topological Persistence and Simplification. Discrete Comput Geom 28 (2002),
511–533.
[41] Lukas Fesser, Sergio Serrano de Haro Ivánez, Karel Devriendt, Melanie Weber, and Renaud Lambiotte. 2024. Augmentations of Forman’s Ricci
curvature and their applications in community detection. Journal of Physics: Complexity 5, 3 (2024), 035010.
[42] Rickard Brüel Gabrielsson and Gunnar Carlsson. 2019. Exposition and interpretation of the topology of neural networks. In 2019 18th ieee
international conference on machine learning and applications (icmla). IEEE, 1069–1076.
[43] Adélie Garin and Guillaume Tauzin. 2019. A topological" reading" lesson: Classification of MNIST using TDA. In 2019 18th IEEE International
Conference On Machine Learning And Applications (ICMLA). IEEE, 1551–1556.
[44] Aldo Gonzalez-Lorenzo, Mateusz Juda, Alexandra Bac, Jean-Luc Mari, and Pedro Real. 2016. Fast, simple and separable computation of Betti
numbers on three-dimensional cubical complexes. In Computational Topology in Image Context: 6th International Workshop, CTIC 2016, Marseille,
France, June 15-17, 2016, Proceedings 6. Springer, 130–139.
[45] Jean Goubault-Larrecq. 2013. Non-Hausdorff topology and domain theory: Selected topics in point-set topology. Vol. 22. Cambridge University Press.
[46] Pierre Guillou, Jules Vidal, and Julien Tierny. 2023. Discrete morse sandwich: Fast computation of persistence diagrams for scalar data–an algorithm
and a benchmark. IEEE Transactions on Visualization and Computer Graphics 30, 4 (2023), 1897–1915.
[47] Wei Guo, Han Qiu, Zimian Liu, Junhu Zhu, and Qingxian Wang. 2022. GLD-Net: Deep Learning to Detect DDoS Attack via Topological and Traffic
Feature Fusion. Computational Intelligence and Neuroscience 2022, 1 (2022), 4611331.
[48] Mustafa Hajij, Bei Wang, and Paul Rosen. 2018. MOG: Mapper on graphs for relationship preserving clustering. arXiv preprint arXiv:1804.11242
(2018).
[49] Mustafa Hajij, Ghada Zamzmi, Theodore Papamarkou, Nina Miolane, Aldo Guzmán-Sáenz, Karthikeyan Natesan Ramamurthy, Tolga Birdal,
Tamal K Dey, Soham Mukherjee, Shreyas N Samaga, et al. 2022. Topological deep learning: Going beyond graph data. arXiv preprint arXiv:2206.00606
(2022).
[50] William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation Learning on Graphs: Methods and Applications. (2017).
[51] Allen Hatcher. 2002. Algebraic Topology. Cambridge University Press.
[52] Ryu Hayakawa. 2022. Quantum algorithm for persistent betti numbers and topological data analysis. Quantum 6 (2022), 873.
[53] Felix Hensel, Michael Moor, and Bastian Rieck. 2021. A Survey of Topological Machine Learning Methods. Frontiers in Artificial Intelligence 4
(2021), 52.
[54] Geoffrey E Hinton and Sam Roweis. 2002. Stochastic neighbor embedding. Advances in neural information processing systems 15 (2002).
[55] Christoph Hofer, Florian Graf, Bastian Rieck, Marc Niethammer, and Roland Kwitt. 2020. Graph filtration learning. In ICML. 4314–4323.
[56] Christoph D Hofer, Roland Kwitt, and Marc Niethammer. 2019. Learning Representations of Persistence Barcodes. JMLR 20, 126 (2019), 1–45.
[57] Xiaoling Hu, Fuxin Li, Dimitris Samaras, and Chao Chen. 2019. Topology-preserving deep image segmentation. Advances in neural information
processing systems 32 (2019).
[58] Johanna Immonen, Amauri Souza, and Vikas Garg. 2024. Going beyond persistent homology using persistent homology. Advances in Neural
Information Processing Systems 36 (2024).
[59] Mohd Sabri Ismail, Mohd Salmi Md Noorani, Munira Ismail, Fatimah Abdul Razak, and Mohd Almie Alias. 2022. Early warning signals of financial
crises using persistent homology. Physica A: Statistical Mechanics and Its Applications 586 (2022), 126459.
[60] Shizuo Kaji, Takeki Sudo, and Kazushi Ahara. 2020. Cubical ripser: Software for computing persistent homology of image and volume data. arXiv
preprint arXiv:2005.12692 (2020).

Manuscript submitted to ACM


Topological Methods in Machine Learning: A Tutorial for Practitioners 51

[61] Harish Kannan, Emil Saucan, Indrava Roy, and Areejit Samal. 2019. Persistent homology of unweighted complex networks via discrete Morse
theory. Scientific reports 9, 1 (2019), 13817.
[62] Aditi S Krishnapriyan, Joseph Montoya, Maciej Haranczyk, Jens Hummelshøj, and Dmitriy Morozov. 2021. Machine learning with persistent
homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks. Scientific reports 11, 1
(2021), 8888.
[63] Genki Kusano, Yasuaki Hiraoka, and Kenji Fukumizu. 2016. Persistence weighted Gaussian kernel for topological data analysis. In ICML. 2004–2013.
[64] Li Li, Wei-Yi Cheng, Benjamin S Glicksberg, Omri Gottesman, Ronald Tamler, Rong Chen, Erwin P Bottinger, and Joel T Dudley. 2015. Identification
of type 2 diabetes subgroups through topological analysis of patient similarity. Science translational medicine 7, 311 (2015), 311ra174–311ra174.
[65] Yitao Li, Umar Islambekov, Cuneyt Akcora, Ekaterina Smirnova, Yulia R Gel, and Murat Kantarcioglu. 2020. Dissecting ethereum blockchain
analytics: What we learn from topology and geometry of the ethereum graph?. In Proceedings of the 2020 SIAM international conference on data
mining. SIAM, 523–531.
[66] David Loiseaux, Mathieu Carrière, and Andrew Blumberg. 2024. A framework for fast and stable representations of multiparameter persistent
homology decompositions. Advances in Neural Information Processing Systems 36 (2024).
[67] David Loiseaux, Luis Scoccola, Mathieu Carrière, Magnus Bakke Botnan, and Steve Oudot. 2024. Stable vectorization of multiparameter persistent
homology using signed barcodes as measures. Advances in Neural Information Processing Systems 36 (2024).
[68] Rafael López. 2024. Point-set Topology: A Working Textbook. Springer Nature.
[69] Yuankai Luo, Lei Shi, and Veronika Thost. 2024. Improving self-supervised molecular representation learning using persistent homology. Advances
in Neural Information Processing Systems 36 (2024).
[70] Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. 2022. Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual
MLP Framework. In International Conference on Learning Representations.
[71] Nicholas O Malott, Aaron M Sens, and Philip A Wilsey. 2020. Topology preserving data reduction for computing persistent homology. In 2020 IEEE
International Conference on Big Data (Big Data). IEEE, 2681–2690.
[72] Yukio Matsumoto. 2002. An introduction to Morse theory. Vol. 208. American Mathematical Soc.
[73] Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv
preprint arXiv:1802.03426 (2018).
[74] Nikola Milosavljević, Dmitriy Morozov, and Primoz Skraba. 2011. Zigzag persistent homology in matrix multiplication time. In SoCG. 216–225.
[75] Dmitriy Morozov. 2005. Persistence algorithm takes cubic time in worst case. BioGeometry News, Dept. Comput. Sci., Duke Univ 2 (2005).
[76] Soham Mukherjee, Darren Wethington, Tamal K Dey, and Jayajit Das. 2022. Determining clinically relevant features in cytometry data using
persistent homology. PLoS Computational Biology 18, 3 (2022), e1009931.
[77] James R Munkres. 2018. Elements of algebraic topology. CRC press.
[78] Monica Nicolau, Arnold J Levine, and Gunnar Carlsson. 2011. Topology based data analysis identifies a subgroup of breast cancers with a unique
mutational profile and excellent survival. Proceedings of the National Academy of Sciences 108, 17 (2011), 7265–7270.
[79] Ippei Obayashi, Takenobu Nakamura, and Yasuaki Hiraoka. 2022. Persistent homology analysis for materials research and persistent homology
software: HomCloud. journal of the physical society of japan 91, 9 (2022), 091013.
[80] Adam Onus and Vanessa Robins. 2022. Quantifying the homology of periodic cell complexes. arXiv preprint arXiv:2208.09223 (2022).
[81] Nina Otter, Mason A Porter, Ulrike Tillmann, Peter Grindrod, and Heather A Harrington. 2017. A roadmap for the computation of persistent
homology. EPJ Data Science 6 (2017), 1–38.
[82] Theodore Papamarkou, Tolga Birdal, Michael Bronstein, Gunnar Carlsson, Justin Curry, Yue Gao, Mustafa Hajij, Roland Kwitt, Pietro Liò, Paolo
Di Lorenzo, et al. 2024. Position paper: Challenges and opportunities in topological deep learning. arXiv preprint arXiv:2402.08871 (2024).
[83] Mathilde Papillon, Sophia Sanborn, Mustafa Hajij, and Nina Miolane. 2023. Architectures of Topological Deep Learning: A Survey of Message-Passing
Topological Neural Networks. arXiv preprint arXiv:2304.10031 (2023).
[84] Yaopeng Peng, Hongxiao Wang, Milan Sonka, and Danny Z Chen. 2024. PHG-Net: Persistent Homology Guided Medical Image Classification. In
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 7583–7592.
[85] The GUDHI Project. 2024. GUDHI User and Reference Manual (3.10.1 ed.). GUDHI Editorial Board. https://gudhi.inria.fr/doc/3.10.1/
[86] Chi Seng Pun, Si Xian Lee, and Kelin Xia. 2022. Persistent-homology-based machine learning: a survey and a comparative study. Artificial
Intelligence Review 55, 7 (2022), 5169–5213.
[87] Emilie Purvine, Davis Brown, Brett Jefferson, Cliff Joslyn, Brenda Praggastis, Archit Rathore, Madelyn Shapiro, Bei Wang, and Youjia Zhou.
2023. Experimental observations of the topology of convolutional neural network activations. In Proceedings of the AAAI Conference on Artificial
Intelligence, Vol. 37. 9470–9479.
[88] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In
Proceedings of the IEEE conference on computer vision and pattern recognition. 652–660.
[89] Raúl Rabadán and Andrew J Blumberg. 2019. Topological data analysis for genomics and evolution: topology in biology. Cambridge University Press.
[90] Raúl Rabadán, Yamina Mohamedi, Udi Rubin, Tim Chu, Adam N Alghalith, Oliver Elliott, Luis Arnés, Santiago Cal, Álvaro J Obaya, Arnold J Levine,
et al. 2020. Identification of relevant genetic alterations in cancer using topological data analysis. Nature communications 11, 1 (2020), 3808.
[91] Omar Rafique and Ajaz Hussain Mir. 2020. A topological approach for cancer subtyping from gene expression data. Journal of biomedical informatics
102 (2020), 103357.
Manuscript submitted to ACM
52 Coskunuzer-Akcora

[92] Arif Ahmad Rather and Manzoor Ahmad Chachoo. 2022. UMAP guided topological analysis of transcriptomic data for cancer subtyping.
International Journal of Information Technology 14, 6 (2022), 2855–2865.
[93] Abbas H Rizvi, Pablo G Camara, Elena K Kandror, Thomas J Roberts, Ira Schieren, Tom Maniatis, and Raul Rabadan. 2017. Single-cell topological
RNA-seq analysis reveals insights into cellular differentiation and development. Nature biotechnology 35, 6 (2017), 551–560.
[94] Alejandro Robles, Mustafa Hajij, and Paul Rosen. 2017. The shape of an image: A study of mapper on images. arXiv preprint arXiv:1710.09008
(2017).
[95] Kiarash Shamsi, Farimah Poursafaei, Shenyang Huang, Bao Tran Gia Ngo, Baris Coskunuzer, and Cuneyt Gurcan Akcora. 2024. GraphPulse:
Topological representations for temporal graph property prediction. In The Twelfth International Conference on Learning Representations.
[96] Jonathon Shlens. 2014. A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100 (2014).
[97] Luca Simi. 2024. A Scalable Implementation of Mapper for Topological Data Analysis via Vantage Point Trees. https://doi.org/10.5281/zenodo.
10659652
[98] Gurjeet Singh, Facundo Mémoli, and Gunnar E Carlsson. 2007. Topological methods for the analysis of high dimensional data sets and 3d object
recognition.. In Eurographics Symposium on Point-Based Graphics. 91–100.
[99] Yashbir Singh, Colleen M Farrelly, Quincy A Hathaway, Tim Leiner, Jaidip Jagtap, Gunnar E Carlsson, and Bradley J Erickson. 2023. Topological
data analysis in medical imaging: current state of the art. Insights into Imaging 14, 1 (2023), 58.
[100] Yara Skaf and Reinhard Laubenbacher. 2022. Topological data analysis in biomedicine: A review. Journal of Biomedical Informatics 130 (2022),
104082.
[101] Eashwar V Somasundaram, Shael E Brown, Adam Litzler, Jacob G Scott, and Raoul R Wadhwa. 2021. Benchmarking r packages for calculation of
persistent homology. The R journal 13, 1 (2021), 184.
[102] Adam Spannaus, Heidi A Hanson, Georgia Tourassi, and Lynne Penberthy. 2024. Topological Interpretability for Deep Learning. In Proceedings of
the Platform for Advanced Scientific Computing Conference. 1–11.
[103] Andrew Tausz, Mikael Vejdemo-Johansson, and Henry Adams. 2014. JavaPlex: A research software package for persistent (co)homology. In
Proceedings of ICMS 2014 (Lecture Notes in Computer Science 8592), Han Hong and Chee Yap (Eds.). 129–136. Software available at http:
//appliedtopology.github.io/javaplex/.
[104] Guillaume Tauzin, Umberto Lupo, Lewis Tunstall, Julian Burella Pérez, Matteo Caorsi, Anibal M. Medina-Mardones, Alberto Dassatti, and
Kathryn Hess. 2021. giotto-tda: : A Topological Data Analysis Toolkit for Machine Learning and Data Exploration. Journal of Machine Learning
Research 22, 39 (2021), 1–6. http://jmlr.org/papers/v22/20-325.html
[105] Guillaume Tauzin, Umberto Lupo, Lewis Tunstall, Julian Burella Pérez, Matteo Caorsi, Anibal Medina-Mardones, Alberto Dassatti, and Kathryn
Hess. 2020. giotto-tda: A Topological Data Analysis Toolkit for Machine Learning and Data Exploration. arXiv:2004.02551 [cs.LG]
[106] Christopher Tralie, Nathaniel Saul, and Rann Bar-On. 2018. Ripser.py: A Lean Persistent Homology Library for Python. The Journal of Open Source
Software 3, 29 (Sep 2018), 925. https://doi.org/10.21105/joss.00925
[107] Renata Turkes, Guido F Montufar, and Nina Otter. 2022. On the effectiveness of persistent homology. Advances in Neural Information Processing
Systems 35 (2022), 35432–35448.
[108] Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms. The journal of machine learning research 15, 1 (2014), 3221–3245.
[109] Hendrik Jacob van Veen, Nathaniel Saul, David Eargle, and Sam W. Mangham. 2019. Kepler Mapper: A flexible Python implementation of the
Mapper algorithm. Journal of Open Source Software 4, 42 (2019), 1315. https://doi.org/10.21105/joss.01315
[110] Oliver Vipond. 2020. Multiparameter Persistence Landscapes. J. Mach. Learn. Res. 21 (2020), 61–1.
[111] Oliver Vipond, Joshua A Bull, Philip S Macklin, Ulrike Tillmann, Christopher W Pugh, Helen M Byrne, and Heather A Harrington. 2021.
Multiparameter persistent homology landscapes identify immune cell spatial patterns in tumors. Proceedings of the National Academy of Sciences
118, 41 (2021).
[112] Hubert Wagner, Chao Chen, and Erald Vuçini. 2011. Efficient computation of persistent homology for cubical data. In Topological methods in data
analysis and visualization II: theory, algorithms, and applications. Springer, 91–106.
[113] Fan Wang, Huidong Liu, Dimitris Samaras, and Chao Chen. 2020. Topogan: A topology-aware generative adversarial network. In Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, 118–136.
[114] Tongxin Wang, Travis Johnson, Jie Zhang, and Kun Huang. 2018. Topological methods for visualization and analysis of high dimensional single-cell
RNA sequencing data. In BIOCOMPUTING 2019: Proceedings of the Pacific Symposium. World Scientific, 350–361.
[115] Cheng Xin, Soham Mukherjee, Shreyas N Samaga, and Tamal K Dey. 2023. GRIL: A 2-parameter Persistence Based Vectorization for Machine
Learning. In Topological, Algebraic and Geometric Learning Workshops 2023. PMLR, 313–333.
[116] Ankur Yadav, Faisal Ahmed, Ovidiu Daescu, Reyhan Gedik, and Baris Coskunuzer. 2023. Histopathological Cancer Detection with Topological
Signatures. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 1610–1619.
[117] Zuoyu Yan et al. 2022. Neural Approximation of Extended Persistent Homology on Graphs. In ICLR Workshop on Geometrical and Topological
Representation Learning.
[118] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. 2017. Deep sets. Advances in
neural information processing systems 30 (2017).
[119] Sebastian Zeng, Florian Graf, Christoph Hofer, and Roland Kwitt. 2021. Topological attention for time series forecasting. Advances in neural
information processing systems 34 (2021), 24871–24882.
Manuscript submitted to ACM
Topological Methods in Machine Learning: A Tutorial for Practitioners 53

[120] Simon Zhang, Soham Mukherjee, and Tamal K Dey. 2022. GEFL: extended filtration learning for graph classification. In Learning on Graphs
Conference. PMLR, 16–1.
[121] Chi Zhou, Zhetong Dong, and Hongwei Lin. 2022. Learning persistent homology of 3D point clouds. Computers & Graphics 102 (2022), 269–279.
[122] Youjia Zhou, Nithin Chalapathi, Archit Rathore, Yaodong Zhao, and Bei Wang. 2021. Mapper Interactive: A scalable, extendable, and interactive
toolbox for the visual exploration of high-dimensional data. In 2021 IEEE 14th Pacific Visualization Symposium (PacificVis). IEEE, 101–110.
[123] Ali Zia, Abdelwahed Khamis, James Nichols, Usman Bashir Tayab, Zeeshan Hayder, Vivien Rolland, Eric Stone, and Lars Petersson. 2024. Topological
deep learning: a review of an emerging paradigm. Artificial Intelligence Review 57, 4 (2024), 77.
[124] Afra Zomorodian and Gunnar Carlsson. 2004. Computing persistent homology. In Proceedings of the twentieth annual symposium on Computational
geometry. 347–356.

Manuscript submitted to ACM


54 Coskunuzer-Akcora

A NOTATION TABLE

Table 7. Notation and Main Symbols

Notation Description Details


( X, 𝑑 ) Topological space with metric 𝑑 𝑑 (𝑥, 𝑦) defines the distance between points 𝑥 and 𝑦
𝜒 ( C) Euler characteristics of C Topological invariant to characterize C
H𝑘 ( X) 𝑘 -th homology group of X Describes 𝑘 -dimensional holes in X
C𝑘 ( X) 𝑘 -chains of X All 𝑘 -subcomplexes in X
Z𝑘 ( X) 𝑘 -cycles of X 𝑘 -subcomplexes in X without boundary
B𝑘 ( X) 𝑘 -boundaries of X Boundaries of (𝑘 + 1) -subcomplexes
𝜕𝑘 Boundary operator Maps C𝑘 ( X) to C𝑘 −1 ( X)
I = {𝜖𝑖 } Threshold set Used to define filtration
{ K𝑖 } Filtration Nested sequence of simplicial complexes
PDk ( X) 𝑘 -th persistence diagram of X Visualizes persistent features
PBk ( X) 𝑘 -th persistence barcode of X Alternate view of persistence
N𝑟 ( X) 𝑟 -neighborhood of X Forms covering space
R𝑟 ( X) Rips complex of point cloud X Built from pairwise distances
Č𝑟 ( X) Čech complex of point cloud X Constructed using open balls
G = ( V, E ) Graph, vertices, edges V are vertices, E are edges
W = {𝑤𝑖 𝑗 } Edge weights of G Assigns weight 𝑤𝑖 𝑗 to edge 𝑒𝑖 𝑗
G𝑘 𝑘 -th power of graph G Higher-order graph structure
Ge Clique complex of G Contains all maximal cliques
Δ𝑖 𝑗 Pixel at (𝑖, 𝑗 ) -coordinate A pixel in an 𝑚 × 𝑛 -size image
W𝑝 (·, ·) Wasserstein distance Measures distance between PDs
W∞ (·, ·) Bottleneck distance Measures distance between PDs
𝛽®𝑘 ( X) Betti vector for PDk ( X) Summarizes homological features
𝛽𝑖 ( X) 𝑖 𝑡ℎ Betti number of X Rank of H𝑖 ( X)
{ K𝑖 𝑗 } Bifiltration Nested sequence of spaces in two parameters

B DATASET RESOURCES
As we develop this tutorial to introduce topological methods to the ML community, we also seek to highlight their
real-world applications to the mathematics community. For those eager to gain hands-on experience with topological
techniques using real-world datasets, Table 8 lists the commonly utilized datasets. For task-specific datasets, you can
explore the collections at https://paperswithcode.com/datasets and https://www.kaggle.com.
Table 8. Datasets Resources. The list of commonly used benchmark datasets in ML.

Format Dataset Task Dataset URL


ModelNet Shape classification https://modelnet.cs.princeton.edu
Point UCR Time Series Classification https://www.cs.ucr.edu/~Eeamonn/time_series_data_2018/
Cloud Broad Institute Single Cell Genotyping https://singlecell.broadinstitute.org/single_cell
3D-PointCloud Various https://github.com/zhulf0804/3D-PointCloud
Small Benchmarks Graph Classification https://paperswithcode.com/task/graph-classification
Small Benchmarks Node Classification https://paperswithcode.com/task/node-classification
Graph Small Benchmarks Link Prediction https://paperswithcode.com/task/link-prediction
OGB (Large Benchmarks) Graph, Node, Link https://ogb.stanford.edu
TUDatasets (Small Benchmarks) Graph, Node, Link https://chrsmrrs.github.io/datasets
TGB (Temporal Graphs) Dynamic Prediction https://tgb.complexdatalab.com/
MNIST Handwritten Digits https://yann.lecun.com/exdb/mnist
MedMNIST Medical Image Classification https://medmnist.com/
Image
Fashion MNIST Image Classification https://www.kaggle.com/datasets/zalando-research/fashionmnist
Various Collections Image Classification https://paperswithcode.com/datasets?task=image-classification

Received 20 September 2024; revised 12 March 20xx; accepted 5 June 20xx


Manuscript submitted to ACM

You might also like