MLST Wavelet Positional Encoding
MLST Wavelet Positional Encoding
Truong-Son Hy ∗
University of Alabama at Birmingham, United States
thy@uab.edu
Abstract
Unsupervised pre-training on vast amounts of graph data is critical in real-world ap-
plications wherein labeled data is limited, such as molecule properties prediction or
materials science. Existing approaches pre-train models for specific graph domains,
neglecting the inherent connections within networks. This limits their ability to
transfer knowledge to various supervised tasks. In this work, we propose a novel
pre-training strategy on graphs that focuses on modeling their multi-resolution
structural information, allowing us to capture global information of the whole
graph while preserving local structures around its nodes. We extend the work
of Graph Wavelet Positional Encoding (WavePE) from Ngo et al. [44] by pre-
training a High-Order Permutation-Equivariant Autoencoder (HOPE-WavePE) to
reconstruct node connectivities from their multi-resolution wavelet signals. Since
our approach relies solely on the graph structure, it is domain-agnostic and adapt-
able to datasets from various domains, therefore paving the way for developing
general graph structure encoders and graph foundation models. We theoretically
demonstrate that for k given resolutions, the width required for the autoencoder
to learn arbitrarily long-range information is O n1/k r1+1/k ϵ−1/k where n, r
denote the number of nodes and the rank of normalized Laplacian respectively
and ϵ is the error tolerance defined by the Frobenius norm. We also evaluate
HOPE-WavePE on graph-level prediction tasks of different areas and show its
superiority compared to other methods. Our source code is publicly available at
https://github.com/HySonLab/WaveletPE.
1 Introduction
One of the fastest-growing areas in machine learning is graph representation learning, with impactful
applications in biomedicine, molecular chemistry, and network science. Most graph neural networks
(GNNs) rely on the message-passing framework that processes graph-structured data by exchanging
the vectorized information between nodes on graphs along their edges. Albeit achieving remarkable
results in a wide range of tasks on graph data, message-passing neural networks (MPNNs) possess
several fundamental limits, including expressiveness [64], over-smoothing [8], and over-squashing
[56]. In recent years, transformer-based architectures [73, 31, 12, 68] have emerged as powerful
alternatives to address the mentioned issues of MPNN. The self-attention mechanism in conventional
†
This work is done during the authors’ internship at FPT Software AI Center, Vietnam under the supervision
of Dr. Truong-Son Hy.
∗
Correspondence to thy@uab.edu
However, graph transformers (GTs) and VN-augmented MPNNs disregard the underlying structure
of graph data by altering inherent connections among the nodes (i.e., shortening all paths to two).
This disregard may explain their limitations in several graph-level prediction tasks. To address this,
positional and structural encodings (PSE) are commonly used to enhance structural information
in modern GNNs. However, existing PSE encoding methods often have limitations in terms of
task specificity. For instance, random walk encodings [14, 37] excel with small molecules, while
Laplacian encoding [12] captures long-range interactions effectively. Meanwhile, Ngo et al. [44]
propose Wavelet positional encoding (WavePE), a PSE encoding technique that models the node
interactions at multiple scales. By leveraging multi-resolution analysis based on Wavelet Transform
on graphs [63, 21, 11], WavePE enables neural networks to capture a comprehensive range of
structural information, from local to global properties.
Real-world applications of learning methods on graphs face two challenges [45, 24]. First, in fields
like biology and chemistry, task-specific labeled data is scarce, and acquiring them requires significant
time and resources [76]. Second, real-world graph datasets often contain out-of-distribution samples;
for example, newly synthesized molecules may have different structures from those in the training
set, leading to inaccurate prediction of supervised learning models. Meanwhile, transfer learning is a
promising approach to overcoming these challenges. Our key insight is that graph data possesses
two distinct feature types: (1) domain-specific (e.g., atom types in molecules or user names citation
networks) and (2) topological features. While the former depends on specific domains, the latter is a
general form of graph data. This observation gives rise to our idea of structural pretraining, wherein
we pretrain a model on topological features of graph data. Specifically, given the Wavelet signals of a
graph, we train an autoencoder to reconstruct its adjacency matrix. The graph wavelet signals are
defined based on functions of eigenvalues of the graph Laplacian at multiple resolutions, representing
general forms of connectivities among nodes on the graphs. As a result, a model trained on these
features should be able to capture the underlying patterns of different graph structures and generalize
well to downstream tasks. After pretraining, we can use this pretrained model as a general structural
encoder that extracts node structural features, which are passed to graph neural networks along with
domain-specific features for predicting task-specific properties.
1.1 Contribution
In this work, we propose a new pretraining approach on graphs that leverages the reconstruction of
graph structures from the Wavelet signals to generalize structural information on graph data, thus
enabling transfer learning to various downstream tasks across various domains. Our contributions are
three-fold as follows:
• We propose a high-order structural pretrained models for graph-structured data that lever-
ages high-order interactions of nodes on graphs, therefore capturing better positional and
structural information.
• We theoretically prove that pretraining by reconstruction with multi-resolution Wavelet
signals can make autoencoder learn node state after an arbitrarily walk of length d, which
can contain both local and global information of graph structures. We also analyzed the
width required for the latent space of the autoencoder to ensure such performance and also
propose a low-rank alternative.
• We empirically show that such pretraining scheme can enhance the performance of super-
vised models when fine-tuned on downstream datasets of different domains, indicating the
generalizabiilty and effectiveness of pretrained structural encoding compared with other
domain-specific pretraining methods.
2
2 Related work
Graph Positional and Structural Encodings Node positional encodings are augmented into node
features to preserve the graph structure information and increase the expressiveness of MPNNs
[33, 14, 59, 16] and Graph Transformers [13, 37, 40, 29, 47, 73]. Many recent works directly use the
graph Laplacian eigenvectors as node positional encodings and combine them with domain-specific
node features [13, 29, 47, 73]. Other works use several approaches to encode graph’s structural
information, including random walks [41, 37, 14, 32], diffusion kernels on graphs [41, 31, 44],
shortest paths [32, 69], and unsupervised node embedding methods [59, 34]. Our work builds on
similar approaches to Wang et al. [59] and Liu et al. [34], where we pre-train an encoder to capture
node positional information in an unsupervised setting. However, unlike previous works, our encoder
is high-order equivariant, enabling it to capture multi-scale properties of graph structures using
Wavelet positional encodings [44]. These learned positional features can be adapted to various
downstream tasks and generalize well to domain-specific datasets. Furthermore, we theoretically
demonstrate that our method is general and can express several common positional encoding schemes.
Pretraining on Graphs In the age of modern deep learning, pre-training models on massive
unlabeled datasets and then fine-tuning them on smaller, labeled datasets has proven remarkably
successful in areas like natural language processing [6, 54] and computer vision [22, 46]. For graph-
structured data, self-supervised pre-training is a growing area of research, with contrastive learning
being a popular approach [72, 66, 75, 71, 28]. While these methods can enhance performance on
various downstream tasks, their reliance on domain-specific features to create different views of the
input graphs limits their transferability to other domains. For instance, models pretrained on small
molecules struggle to adapt to other graph types like citation networks, proteins, or code repositories.
Our work aims to propose a more generalizable and transferable graph pre-pre-training method that
relies only on graph intrinsic features, i.e. adjacency matrices, which can be trained on unfeatured
data and easily transferred for predicting properties of small graph datasets in arbitrary domains.
3.1 Notations
In this work, we define a graph G = (V, A), where V denotes the node set, respectively, and
A ∈ Rn×n is the adjacency matrix. The normalized graph Laplacian is computed as: L e = In −
D−1/2 AD−1/2 where D is the diagonal degree matrix, and In is the identity matrix.
In this section, we formally define permutation symmetry, i.e. symmetry to the action of the symmetric
group, Sn , and construct permutation-equivariant neural networks to encode graph wavelets. An
element σ ∈ Sn is a permutation of order n, or a bijective map from {1, 2, . . . , n} to {1, 2, . . . , n}.
We present each element σ in Sn as a permutation matrix Pσ ∈ Rn×n . For example, the action of Sn
on an adjacency matrix A ∈ Rn×n and on a latent matrix (i.e. node embedding matrix) Z ∈ Rn×dz
are:
σ(A) = Pσ AP⊤ σ, σ(Z) = Pσ Z,
for σ ∈ Sn . Here, the adjacency matrix A is a second-order tensor with a single feature channel,
while the latent matrix Z is a first-order tensor with dz feature channels. LIkewise, the permutation
k
on a k-th order tensor X ∈ Rn ×d operate on the first k dimensions of X.
Furthermore, we define permutation equivariance on non-homogeneous order functions.
k
×d
Definition 3.1. A Sn -equivariant (or permutation equivariant) function is a function f : Rn →
k′ ′ k
Rn ×d that satisfies f (σ(X)) = σ(f (X)) for all σ ∈ Sn and X ∈ Rn ×d .
Although Def. 3.1 is generalized for all order features, in the scope of this work we will only include
up to the second-order feature, i.e. k, k ′ ≤ 2. This definition is also applicable to neural networks,
whose all components satisfy such properties
3
s=4 s = 15 s = 50
0.0 0.2 0.4 0.6 0.8 1.0
Figure 1: Visualization of graph Wavelet on the geometric graph of a torus. The low scaling factor s
results in a highly localized structure around the center node (yellow), while higher factors can lead
to smoother signals that can spread out to a larger part of the graph.
Laplacian Positional Encoding (Laplacian PE). Let G = (V, A) be an undirected graph with
adjacency matrix A ∈ Rn×n and degree matrix D. The normalized Laplacian is defined as:
L̃ = I − D−1/2 AD−1/2 .
Let L̃ = U ΛU ⊤ be its eigendecomposition. The Laplacian PE is given by:
XLapPE = Uk ,
where Uk ∈ Rn×k contains the eigenvectors corresponding to the k smallest nontrivial eigenvalues.
Random Walk Positional Encoding (RWPE). Let P = D−1 A be the random walk transition
matrix. The t-step transition probability is Pt . The RWPE of node i up to step k is:
xRWPE = (P1 )ii , (P2 )ii , . . . , (Pk )ii .
i
Both encodings capture structural information from spectral and probabilistic perspectives,
respectively.
4 Methodology
4.1 Spectral Graph Wavelet Tensors
As the normalized graph Laplacian Le ∈ Rn×n is symmetric when G is undirected, we can decompose
it into a complete set of orthonormal eigenvectors U = (u1 , u2 , · · · , un ) wherein ui is associated
with a real and non-negative eignevalue λi as:
e = U ΛU T ,
L (1)
where Λ = diag(λ1 , · · · , λn ) is the diagonal matrix of eigenvalues. The graph Wavelet transforms
construct a set of spectral graph Wavelet as bases to project the graph signal from the vertex domain
to the spectral domain as:
ψs = U Λs U T , (2)
here, Λs = diag(g(sλ1 ), · · · , g(sλn )) is the scaling matrix of the eigenvalues. The scaling function
g takes the eigenvalue λi and an additional scaling factor s as inputs, indicating how a signal diffuses
away from node i at scale s; we select gs (λ) = exp(−sλ) as the heat kernel. This means that we
can vary the scaling parameter s to adjust the neighborhoods surrounding a center node. Please see
Figure 1 for more illustration. Following Ngo et al. [44], a set of k graph Wavelet {ψ}ki ∈ Rn×n can
be constructed to result in a tensor of graph Wavelet W ∈ Rn×n×k .
4
Multidomain
Dataset
Domain-agnostic
Long-range
Wavelet Transform Structural Information
Equivariant
Equivariant
Decoder
Encoder Equivariant
MLP
Figure 2: Our proposed equivariant autoencoder pretraining scheme are applied on a large multiple
domain dataset while extending the feature degree, overcoming the domain-specific weakness while
also embedding long-range information.
Fast Graph Wavelet Transform Traditionally, computing graph wavelet bases necessitates the full
diagonalization of the graph Laplacian Le in Equation (1). This approach becomes computationally
expensive when the number of nodes increases. Our work adopts the efficient algorithm proposed by
Hammond et al. [21]. This method utilizes Chebyshev polynomials to approximate the Wavelet with
a time complexity of O(|E| × M ), where M represents the order of the polynomials and E is the set
of edges. This approach scales linearly with the number of edges, thereby being more efficient than
previous methods based on the eigendecomposition of the graph Laplacian.
Our goal is to train a learnable structural encoder to extract generalized and abstract node-level
structural features that can be transferred across downstream tasks in graph learning. To fulfill this,
we parameterize a high-order autoencoder [27, 55] that learns to reconstruct high-order features of
graph data from their wavelet tensors W ∈ Rn×n×k .
In particular, given a second-order wavelet tensor W ∈ Rn×n×k , the encoder E encodes W into a
latent matrix Z = E(W) ∈ Rn×dℓ , the encoder E can be composed of many equivariant operators.
Furthermore, to reduce redundancy caused by high-order training, we extract only two equivariant
mappings for the encoder, diagonal and row sum:
E 1 (Z) = diag(Z), E 2 (Z) = Z1n
Then, the decoder D lifts Z back to a high-order feature map F = D(Z) ∈ Rn×n×d . Here, We
use the outer product and diagonal operator, which represent structural and positional informationn
respectively:
D 1 (Zi ) = Zi ⊗ Z⊤
i , D 2 (Zi ) = diag(Zi )
where ⊗ is the outer product operator, and Zi is the i-th channel feature of the encoded information.
Here, dℓ and d denote the channels (dimensions) of each entries in an n × n matrix.
Finally, we use a channel-wise multi-layer perceptron (MLP) ϕ : Rn×n×d 7→ Rn×n×r to map F to a
concatenated high-degree adjacency matrix in binary values. Specifically, let Aj be the binary matrix
of j-hop neighbor in a graph and A
b j be its prediction. The final MLP network returns the predicted
array
5
ϕ(Z) = A bs A
1
bs ... A
2
bs
r
, (3)
where s1 , s2 , . . . , sr are natural degrees to be chosen. In this work, we let these values follow a
exponential pattern, which highlights the range-diversity.
Masking removes redundancy In sparse structures, learning different hop adjacencies imposes
disproportionality in edges and non-edges, which differs with the adjcency degree and also data
structures. For example, a molecule graph from MoleculeNet [61] have shorter diameter than peptides
[15]. Thus, long degree adjacency is redundant for small molecules and hurts the generalizability
for bigger graphs. To fix this, we use a binary mask M ∈ Rn×n×r , filtering out random entries in
both the ground truth and label such that non edges and edges are equal in quantity for each degree of
predicted adjacency. The remaining entries define the binary cross entropy loss:
LM (Asi , A
b s ) = BinCrossEntropy(M ⊙ As , M ⊙ A
i i
b s ),
i
where ⊙ is the elementwise product. For small connected graphs, masking removes redundant
adjacencies Aj (all one matrices) for high value j. For longer connected graphs, masking also ease
the pain in tuning the degree hyperparameters si for i ∈ [r], as the loss produced by Asi is zero
for large degrees si . This motivates our multi-range training strategy, which makes the autoencoder
range-aware with respect to the graph size and structures.
5 Theoretical Results
5.1 HOPE-WavePE is stable
Stability in graph structural positional encoding refers to the comparative change of positional
information when the (normalized) Laplacian is slightly perturbed. We expand on the preliminary
definition of [58].
Definition 5.1. (Stable PE) Let S(L̃) ∈ Rn×d be the positional feature that can be added or
concatenated to the input representation. Then S is stable if for any n, d there exists an universal
constant C > 0 such that
S(L̃) − S(L̃ + δ L̃) ≤ C δ L̃
for all natural number n, d, normalized Laplacian L̃ and the perturbation term δ L̃ ∈ Rn×n .
We will consider two scenerios for δ L̃, one is where it is equivalent to some random pertubution
with magnitude o( L̃ ) and the second lets δ L̃ be the difference between two normalized Laplacian
δ L̃ = L̃1 − P∗ L̃2 P∗⊤ where P∗ = argminP L̃1 − PL̃2 P⊤ and L̃1 , L̃2 are normalized Laplacian
of two different structures. This serves as an expansion to the initial definition, showing stable PEs
are resilient to small perturbation. This is especially meaningful as random perturbations usually
inflict heavy misinformation to spectral-based positional encoding.
Stable PE ensures that with small change of the normalized Laplacian, which can occur due to
perturbation or structural change, the difference in the PE only change with linear complexity.
Methods such as Laplacian Positional Encoding (LapPE), for example, is not stable as the SVD is not
unique. Works such as SignNet and BasisNet [33] are invariant under these unstable transformations.
However, they cannot guarantee small change of PE given small change of L̃, and thus, they are also
not stable. Moreover, we also show that RWPE are also not stable. First, let us make a remark on
RWPE [14]:
Definition 5.2. Given G = (V, E) and corresponding adjacency matrix A, then the RWPE matrix
SRW = {sij } is defined as
sij = RWi (j) = L̃j [i, i]
where RWi (j) is the probability that node i returns to itself after a random walk of length j.
Theorem 5.3. For sufficiently large n, then for all C > 0 there exists a normalized Laplacian L̃0
and perturbation δ L̃0 such that
SRW (L̃0 + δ L̃0 ) − SRW (L̃0 ) > C δ L̃0 .
6
The proof of Theorem 5.3 is quite straightforward. We observe that the encoding induced by random
walk is a polynomial of degree at most d with non-vanishing coefficient. This indicates that the
change of the SRW is O( δ L̃d ) and not O( δ L̃ ), which is the necessary condition for stability.
The negative results of the RWPE can be fixed by scaling down the coefficients attached to long walk,
i.e. high polynomial degree. And conveniently, HOPE WavePE tackle this exactly, either from the
analytic approximation of the Fast Wavelet Transform in practice [21] or as an infinite Chebyshev
polynomial.
Theorem 5.4. (informal) Let SW ave be the Wavelet PE induced by the first-order embedding of the
AE, then there exists a parametrization such that SW ave = SW ave (L̃) is a stable PE.
As opposed to the non-vanishing coefficient, Wavelet of all scaling are calculated via the Chebyshev
polynomial expansion. This has logarithmetic convergent rate, which dominants over the polynomial
rate of change problem raised by RWPE. This, together with the multiscale motivation of HOPE-
WavePE, ensures a stable and range-diverse structural embedding.
The authors at [37] showed that RWPE combined with a Sn -equivariant MLP can predict the state at
i
walk length up to the degree in which it is embedded, for example the concatenation of (D−1 A) for
i = 1, k, where D and A are the diagonal and adjacency matrix respentively, can only predict the
state at walk up to k (see Proposition 3.1 [37]). This means that global information embedding is not
possible without expensive computational cost. We demonstrate theoretically that HOPE-WavePE
can reconstruct second order feature while also generating arbitrarily high degree information, which
can capture global information. The input for HOPE-WavePE are primarily the structural wavelets of
different scaling factors.
Clarification. We use the non-linearity ReLU(x) = max(0, x) as the activation in hidden layers
of the MLPs. Moreover, biases in linear mapping are also included. The norm ∥ · ∥ denotes the
Frobenius norm for real vectors and L2 norm for functional spaces.
For the sake of convenience, we denote the input of HOPE-WavePE as the concatenated array
(2)
Ek = ψs ψs2 . . . ψsk of size n × n × k. The superscript denotes the order of feature, this
notation is also used later in the Appendix.
We justify our HOPE-WavePE pretraining scheme by first showing that with asympotic sublinear
(2)
width, the AE can estimate Ed for all d > k.
Theorem 5.5. For any ϵ > 0 and real coefficients θ0 , θ1 , . . . , θd , assume that the multidomain
dataset have C connected components and r = n + 1 − C, then there exists an Sn -equivariant AE
f : Rn×n×k → Rn×n×d of width O(n1/k r1+1/k ϵ−1/k ) and a channel-wise feed forward network
g : Rd → R such that
(2)
(g ◦ f )(Ek ) − p(ψs ) < ϵ
Pd
where p(ψs ) = j=0 θj ψsj .
The connected component aspect in Theorem 5.5 allows better efficiency to pretraining HOPE-
WavePE while preserving the performance. This is particularly useful since a combined dataset of
diverse domains and sparse graph structures also carry a large number of connected components.
Moreover, if r < n + 1 − c, the statement can be incorporated as a low-rank approximation for the
second order features. In fact, we found such low-rank alternative robust for r being as low as four.
We summarize the proof for Theorem 5.5 as follows:
(2) (1)
• The tensor Ek is contracted to the first order tensor Ek of size n × k via a resolution-wise
row sum.
• Spline method is applied to approximate Λk+1 to Λr where Λ is the diagonal eigenvalue
matrix of ψs . The bounds of the width is derived based on the number of pieces and the
work at [49].
• The decoder broadcast the first order feature into second order form via a dot product. The
output is a tensor of size n × n × r.
7
• Since ψsi − In for i = 1, d all have rank r, a linear layer can be applied after the AE
representing rank-one summations to estimate the output and a weighted sum corresponding
with the polynomial coefficients of p(ψs ).
From Theorem 5.5, we see that HOPE-WavePE can reconstruct while also generating high degree of
second order features. However, the question remains for the range of effect that an output of degree
d can encode. To do this, we define the binary matrix At ∈ Rn×n where At [i, j] = 1 if there exist a
walk of length t between the nodes i and j, and 0 otherwise.
Theorem 5.6. For any ϵ > 0 and real coefficients θ1 , θ2 , . . . , θd , there exists a two-layer ReLU feed
forward network φ : Rn×n×d → Rn×n of hidden dimension dh = 2 such that
d
(2)
X
φ(Ed ) − θj Aj < ϵ.
j=1
To prove Theorem 5.6, we need to consider ψs in two scenerios of WavePE and RWPE separatedly.
For WavePE, we perform inverse Wavelet Transform to get the polynomial of the normalized
Laplacian. Then, the binary step function is approximated via the MLP. The output is taken as
weighted sum of Aj for j = 1, d.
Theorem 5.6 combined with Theorem 5.5 implies that long-range dependencies embedding is possible
with our AE pretraining scheme and can be done in an efficient manner. Moreover, since the linear
layers and MLP in these statements are all channel-wise, they all suffice the Sn -equivariant contigency.
By learning the long-range dependency, we can obtain many meaningful information of the graph.
This includes pair-wise shortest path distance, cycle counting, cycle detecting and more. The latest
provably stable work of [26] where the authors use eigendecomposition can cost O(n3 ) and is not
provably long-range capturer. In contrast, HOPE-WavePE is not only provably O(n) for sparse graphs
but also stable, resilient to perturbation and can learn much more complex long-range structures.
6 Experiments
This section empirically evaluates the advantages of our structurally pre-trained autoencoder (AE).
We first detail the pre-training process, which utilizes a large dataset of graph structures. We then
present experimental results on various downstream tasks. As our objective is to learn a structural
feature extractor for graph data, we aim to demonstrate three key points:
• Enhanced performance on small datasets: The encoder of the pre-trained AE can extract
structural features that enhance the performance of graph-level prediction tasks on small
datasets.
• Effective PSE feature representation: The learned structural features can capture global
information and preserve graph structures when integrated into and fine-tuned along with
global models like graph transformers or VN-augmented MPNNs.
• Generalizability: The pre-trained AE can generalize well to out-of-distribution graphs with
different connectivity types and sizes, even beyond those present in the training dataset.
6.1 Setup
8
Method BBBP BACE Tox21 ToxCast SIDER
D-MPNN [67] 71.0(0.3) 80.9(0.6) 75.9(0.7) 65.5(0.3) 57.0(0.7)
AttentiveFP [62] 64.3(1.8) 78.4(0.0) 76.1(0.5) 63.7(0.2) 60.6(3.2)
N-GramRF [35] 69.7(0.6) 77.9(1.5) 74.3(0.4) - 66.8(0.7)
N-GramXGB [35] 69.1(0.8) 79.1(1.3) 75.8(0.9) - 65.5(0.7)
PretrainGNN [25] 68.7(1.3) 79.9(0.9) 76.7(0.4) 65.7(0.6) 62.7(0.8)
GROVERbase [48] 70.0(0.1) 82.6(0.7) 74.3(0.1) 65.4(0.4) 64.8(0.6)
GROVERlarge [48] 69.5(0.1) 81.0(1.4) 73.5(0.1) 65.3(0.5) 65.4(0.1)
GraphLOG pretrained [65] 72.5(0.8) 83.5(1.2) 75.7(0.5) 63.5(0.7) 61.2 (1.1)
GraphCL pretrained [70] 69.7(0.7) 75.4(1.4) 73.9(0.7) 62.4(0.6) 60.5(0.9)
InfoGraph pretrained [60] 66.3(0.6) 64.8(0.7) 68.1(0.6) 58.4(0.6) 57.1(0.8)
MPNN + LapPE 68.5(0.6) 81.1(0.5) 76.2(0.3) 64.1(0.4) 63.0(0.2)
MPNN + RWPE 69.3(0.5) 81.7(0.6) 76.8(0.2) 65.4(0.4) 63.8(0.3)
MPNN + WavePE 70.2(0.4) 86.2(0.4) 77.5(0.2) 66.3(0.5) 64.3(0.3)
MPNN + HOPE-WavePE (ours) 71.2(0.4) 86.8(0.4) 78.0(0.1) 66.9(0.5) 64.7(0.3)
Table 1: Experimental results on five small molecule (scaffold-split) classification datasets. Methods
are evaluated by ROC-AUC % (↑) where higher scores mean better performances. We report the
mean and standard deviation (in brackets) of all methods over three random seeds. Top 3 results are
highlighted, including First, Second, and Third.
Downstream task After pretraining, we integrated the pretrained autoencoder into other graph
neural networks for graph-level supervised tasks. We concatenated the node-level structural features
extracted by the pretrained autoencoder with domain-specific features before feeding them into GNN-
baed models for downstream evaluation. The implementation details for each specific downstream
task are provided in Appendix B.
6.2 Results
While we pretrained our HOPE-WavePE autoencoder on the MolPCA dataset within the Moleculenet
benchmark [61], the performance on downstream tasks within this benchmark may not fully capture
the generalizability of our approach. To address this, we evaluate HOPE-WavePE on a broader range
of real-world graph datasets. These datasets encompass diverse graph structures, connectivity types,
sizes, and domain-specific features, significantly differing from those found in Moleculenet.
Moleculenet Benchmark We follow the setting in [53] wherein pretrained models are evaluated
on a selection of five small molecule datasets in the MoleculeNet Benchmark [61]. Scaffold splitting
is used to define training and testing data, ensuring that test molecules may significantly differ
from those used for training. This approach assesses a model’s ability to predict properties of out-
of-distribution molecular structures. We compare our methods with multiple baselines, including
supervised and pretraining. In particular, two supervised methods are D-MPNN [67] and AttentiveFP
[62]. For pretraining, we compare our work with PretrainGNN [25], GraphCL [70], GraphLog
[65], Grover [48], and InfoGraph [60]. According to the experimental results in Table 1, pretrained
HOPE-WavePE can improve the performances of MPNN on small molecule property prediction
tasks. Our approach, in particular, outperforms other pretraining baselines by a large margin in 3 of 5
datasets and achieves comparable results in other 2 datasets.
TUDataset Benchmark We evaluate HOPE-WavePE on six diverse datasets from the TUDataset
benchmark [42]: a small molecule dataset (MUTAG), two chemical compound datasets (NCI1 and
NCI109), a macromolecule dataset (PROTEINS) and two social network datasets (IMDB-B and
IMDB-M). In this task, we combine the structural features extracted by the pretrained autoencoder
with node domain features and feed them into an MPNN consisting of five GIN [64] layers with
hidden dimensions varying in [32, 64, 128]. The results in Table 2 demonstrate that GIN augmented
with HOPE-WavePE significantly outperforms other complicated high-order networks like IGN [39],
CIN, and PPGNs on three out of six datasets. Notably, even though pretrained on molecular graphs,
HOPE-WavePE effectively improves GIN’s expressiveness on the two social network datasets as well.
9
This is because our method focuses solely on the inherent properties of graph data, independent of
domain-specific features.
Table 2: Experimental results on TU datasets. The methods are evaluated by Accuracy % (↑). The
reported results are means and standard deviations of runnings over five random seeds. Top 3 results
are highlighted, including First, Second, and Third.
Different Graph Connectivity Patterns We evaluate on two image classification datasets, MNIST
and CIFAR-10 in [16]. In these datasets, we construct k-nearest neighbor graphs (k = 8) using
superpixels within each image. This generates a collection of graph structures that differ significantly
from those encountered during the pre-training phase, which typically involves molecular structures.
In this task, we augment node superpixel features with their structural features before feeding them
into GPS [47], a graph transformer model. As shown in Table 3, GPS + HOPE-WavePE achieves
comparable accuracy to other MPNNs and GPS models that rely on different explicit positional
encoding methods.
ZINC Dataset We evaluate the performance of HOPE-WavePEs on a subset of the ZINC dataset
[19] containing 12,000 graphs with up to 38 heavy atoms. We assess its ability to perform molecular
tasks using GPS [47] equipped with HOPE-WavePE. The results in Table 3 our method outperforms
the two most popular encoding baselines, LapPE and RWSE, and show statistically significant
differences compared to traditional GNN methods.
Table 3: Experimental results on image classification tasks and ZINC. The results are reported by
taking the mean ± the std over four random seeds. Top 3 results are highlighted, including First,
Second and Third.
Long-range Graph Benchmark (LRGB) To show that pretrained AE can capture global infor-
mation and preserve local information on graph data, we conducted experiments on two datasets:
Peptide-struct and Peptides-func in the Long-range Graph Benchmark [15]. In this task, we integrate
pretrained HOPE-WavePE into a VN-augmented MPNN and fine-tune on the two datasets. We
10
Figure 3: A visualization of how HOPE-WavePE identifies several patterns and models the
interactions between nodes on graphs. We observe that pretraining on a large dataset of
graph-structured data enables the high-order AE to explore intricate substructures on molecular
graphs, such as cycles or repeated leaves and atoms.
compare our work with several MPNNs and transformer-based baselines, including GCN [30], Gat-
edGCN [5], Drew-GCN + LapPE [20], GraphGPS + LapPE [47], GRIT [37], GraphVIT, GraphMLP
Mixer [23] and MPNN with virtual node and random walk structural encoding [7].
Table 4: Experimental results on Peptides func and Peptides struct. We report the means and standard
deviations of 4 random seeds.
Visualization of Local Information The underlying motivation of using graph wavelets over graph
Laplacians, i.e. the graph Fourier transform, is because they offer localization in spectral domains
[63]. This is a crucial property for learning graphs, particularly in molecular or community graphs
where local connections among atoms or people reveal important structural properties. To visualize
the local information of molecualr structures that are captured by our pretraining scheme, we compute
the L2 -norm of each row in the latent matrix Z ∈ Rn×dℓ . This provides a magnitude vi = ||Z[i, :]||2
for each node i, reflecting the strength of its signal in the latent space of the graph with n nodes.
Figure 3, which shows graphs of 8 randomly selected graphs in the validation set of our pretraining
dataset, demonstrates that our method can detect repeated or long-range patterns, such as cycles in
complex molecular graphs. Nodes with similar L2 -norm values tend to be close in the graph structure,
indicating the model’s ability to capture these cyclical relationships.
11
Accuracy
1.0 7 channels
10 channels
0.9 20 channels
0.8 30 channels
40 channels
0.7 60 channels
0.6
0.5
0.4
0.3
0.2
20 21 22 23 24 25 26 27
# graph nodes
Figure 4: Average reconstruction accuracy on Peptides-struct graphs with different wavelet channel
quantities.
We have three focuses for our ablations. First, we explore how does the number of wavelet channels
affect the reconstructability of the adjacency tensor. Secondly, we show that cross-domain training
only improve performance incrementally. And finally, we show how masked and unmasked training
affect the reconstructability of HOPE-WavePE on medium-sized graphs.
Wavelet Channel Number As shown in Figure 1, different scaling factors impact the receptive
field of each node. Since each resolution is responsible for learning a specific range of hops, more
multiresolutions should capture diverse-range dependency. We visualize in Figure 4 by recording
the reconstruction accuracies of the concatenated array {A2i }7i=0 given 7, 10, 20, 30 and 60 wavelet
channels respectively. The experiment is conducted on the peptides-struct dataset from the Long
Range Graph Benchmark [15], which highlights the long-range learnability.
Accuracy
1.00
0.95
masked 7 channels
masked 20 channels
0.90 unmasked 7 channels
unmasked 20 channels
non-redundant
0.85 redundant
0.80
20 21 22 23 24 25 26 27
# graph nodes
Figure 5: Masked vs unmasked reconstruction accuracy on MNIST dataset. The regions shaded red
and blue contains hop lengths s in which As are and are not all ones, respectively.
HOPE or no HOPE? We will investigate whether HOPE actually makes the encoding vector more
meaningful in terms of structural information. In particular, we use the encoding vector from WavePE
without any pre-training to train a decoder, lifting it back to second order. Moreover, we replace
12
WavePE with RWPE to justify whether it was necessary to choose WavePE as the backbone. The
results in table ... suggests that
7 Conclusion
We have introduced HOPE-WavePE, a novel high-order permutation-equivariant pretraining method
specifically designed for graph-structured data. Our approach leverages the inherent connectivity
of graphs, eliminating reliance on domain-specific features. This enables HOPE-WavePE to gen-
eralize effectively across diverse graph types and domains. The superiority of HOPE-WavePE is
demonstrably proven through both theoretical and empirical analysis. Finally, we have discussed
the potential of HOPE-WavePE as a foundation for a general graph structural encoder. A promising
future direction will be to focus on optimizing the scalability of this approach.
13
References
[1] J. Atwood and D. Towsley. Diffusion-convolutional neural networks, 2016.
[2] C. Bodnar, F. Frasca, Y. G. Wang, N. Otter, G. Montúfar, P. Liò, and M. Bronstein. Weisfeiler
and lehman go topological: Message passing simplicial networks, 2021.
[3] C. Bodnar, F. Frasca, N. Otter, Y. G. Wang, P. Liò, G. Montúfar, and M. Bronstein. Weisfeiler
and lehman go cellular: Cw networks, 2022.
[4] G. Bouritsas, F. Frasca, S. Zafeiriou, and M. M. Bronstein. Improving graph neural network
expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 45(1):657–668, 2023. doi: 10.1109/TPAMI.2022.3154319.
[5] X. Bresson and T. Laurent. Residual gated graph convnets. CoRR, abs/1711.07553, 2017. URL
http://arxiv.org/abs/1711.07553.
[7] C. Cai, T. S. Hy, R. Yu, and Y. Wang. On the connection between mpnn and graph transformer.
arXiv preprint arXiv:2301.11956, 2023.
[8] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun. Measuring and relieving the over-smoothing
problem for graph neural networks from the topological view. Proceedings of the AAAI
Conference on Artificial Intelligence, 34(04):3438–3445, Apr. 2020. doi: 10.1609/aaai.v34i04.
5747. URL https://ojs.aaai.org/index.php/AAAI/article/view/5747.
[10] M. Defferrard, L. Martin, R. Pena, and N. Perraudin. Pygsp: Graph signal processing in python.
URL https://github. com/epfl-lts2/pygsp, 2017.
[11] C. Donnat, M. Zitnik, D. Hallac, and J. Leskovec. Learning structural node embeddings via
diffusion wavelets. In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, KDD ’18, page 1320–1329, New York, NY, USA, 2018.
Association for Computing Machinery. ISBN 9781450355520. doi: 10.1145/3219819.3220025.
URL https://doi.org/10.1145/3219819.3220025.
[14] V. P. Dwivedi, A. T. Luu, T. Laurent, Y. Bengio, and X. Bresson. Graph neural networks with
learnable structural and positional representations. In International Conference on Learning
Representations, 2022. URL https://openreview.net/forum?id=wTTjnvGphYj.
[15] V. P. Dwivedi, L. Rampasek, M. Galkin, A. Parviz, G. Wolf, A. T. Luu, and D. Beaini. Long
range graph benchmark. In Thirty-sixth Conference on Neural Information Processing Sys-
tems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=
in7XC5RcjEn.
[17] T. Gärtner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and efficient alternatives.
In B. Schölkopf and M. K. Warmuth, editors, Learning Theory and Kernel Machines, pages
129–143, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. ISBN 978-3-540-45167-9.
14
[18] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for
quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning
- Volume 70, ICML’17, page 1263–1272. JMLR.org, 2017.
[19] R. Gómez-Bombarelli, D. Duvenaud, J. M. Hernández-Lobato, J. Aguilera-Iparraguirre, T. D.
Hirzel, R. P. Adams, and A. Aspuru-Guzik. Automatic chemical design using a data-driven
continuous representation of molecules. CoRR, abs/1610.02415, 2016. URL http://arxiv.
org/abs/1610.02415.
[20] B. Gutteridge, X. Dong, M. Bronstein, and F. D. Giovanni. Drew: Dynamically rewired message
passing with delay, 2023.
[21] D. K. Hammond, P. Vandergheynst, and R. Gribonval. Wavelets on graphs via spectral graph
theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.
[22] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual
representation learning. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 9729–9738, 2020.
[23] X. He, B. Hooi, T. Laurent, A. Perold, Y. LeCun, and X. Bresson. A generalization of vit/mlp-
mixer to graphs, 2023.
[24] D. Hendrycks, K. Lee, and M. Mazeika. Using pre-training can improve model robustness
and uncertainty. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th
International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning
Research, pages 2712–2721. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.
press/v97/hendrycks19a.html.
[25] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec. Strategies for
pre-training graph neural networks. In International Conference on Learning Representations,
2020. URL https://openreview.net/forum?id=HJlWWJSFDH.
[26] Y. Huang, W. Lu, J. Robinson, Y. Yang, M. Zhang, S. Jegelka, and P. Li. On the stability
of expressive positional encodings for graphs, 2024. URL https://arxiv.org/abs/2310.
02579.
[27] T. S. Hy and R. Kondor. Multiresolution equivariant graph variational autoencoder. Machine
Learning: Science and Technology, 4(1):015031, mar 2023. doi: 10.1088/2632-2153/acc0d8.
URL https://dx.doi.org/10.1088/2632-2153/acc0d8.
[28] Y. Jiao, Y. Xiong, J. Zhang, Y. Zhang, T. Zhang, and Y. Zhu. Sub-graph contrast for scalable
self-supervised graph representation learning. In 2020 IEEE international conference on data
mining (ICDM), pages 222–231. IEEE, 2020.
[29] J. Kim, D. Nguyen, S. Min, S. Cho, M. Lee, H. Lee, and S. Hong. Pure transformers are powerful
graph learners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,
Advances in Neural Information Processing Systems, volume 35, pages 14582–14595. Curran
Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/
2022/file/5d84236751fe6d25dc06db055a3180b0-Paper-Conference.pdf.
[30] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks.
arXiv preprint arXiv:1609.02907, 2016.
[31] D. Kreuzer, D. Beaini, W. Hamilton, V. Létourneau, and P. Tossou. Rethinking graph trans-
formers with spectral attention. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and
J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages
21618–21629. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/
paper_files/paper/2021/file/b4fd1d2cb085390fbbadae65e07876a7-Paper.pdf.
[32] P. Li, Y. Wang, H. Wang, and J. Leskovec. Distance encoding: Design provably more powerful
neural networks for graph representation learning. Advances in Neural Information Processing
Systems, 33, 2020.
15
[33] D. Lim, J. Robinson, L. Zhao, T. Smidt, S. Sra, H. Maron, and S. Jegelka. Sign and basis
invariant networks for spectral graph representation learning, 2022.
[34] R. Liu, S. Cantürk, O. Lapointe-Gagné, V. Létourneau, G. Wolf, D. Beaini, and L. Rampášek.
Graph positional and structural encoder. arXiv preprint arXiv:2307.07107, 2023.
[35] S. Liu, M. F. Demirel, and Y. Liang. N-gram graph: Simple unsupervised representation
for graphs, with applications to molecules. In H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Sys-
tems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/
paper_files/paper/2019/file/2f3926f0a9613f3c3cc21d52a3cdb4d9-Paper.pdf.
[36] Y. Luo, H. Li, L. Shi, and X.-M. Wu. Enhancing graph transformers with hierarchical distance
structural encoding, 2024. URL https://arxiv.org/abs/2308.11129.
[37] L. Ma, C. Lin, D. Lim, A. Romero-Soriano, K. Dokania, M. Coates, P. H.S. Torr, and S.-N. Lim.
Graph Inductive Biases in Transformers without Message Passing. In Proc. Int. Conf. Mach.
Learn., 2023.
[38] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph net-
works. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Cur-
ran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/
bb04af0f7ecaee4aae62035497da1387-Paper.pdf.
[39] H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman. Invariant and equivariant graph networks.
In International Conference on Learning Representations, 2019. URL https://openreview.
net/forum?id=Syx72jC9tm.
[40] R. Menegaux, E. Jehanno, M. Selosse, and J. Mairal. Self-attention in colors: Another take on
encoding graph structure in transformers. Transactions on Machine Learning Research, 2023.
ISSN 2835-8856. URL https://openreview.net/forum?id=3dQCNqqv2d.
[41] G. Mialon, D. Chen, M. Selosse, and J. Mairal. Graphit: Encoding graph structure in transform-
ers. arXiv preprint arXiv:2106.05667, 2021.
[42] C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann. Tudataset: A
collection of benchmark datasets for learning with graphs. arXiv preprint arXiv:2007.08663,
2020.
[43] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting. Propagation kernels: efficient
graph kernels from propagated information. Machine Learning, 102:209 – 245, 2014. URL
https://api.semanticscholar.org/CorpusID:14487732.
[44] N. K. Ngo, T. S. Hy, and R. Kondor. Multiresolution graph transformers and wavelet positional
encoding for learning long-range and hierarchical structures. The Journal of Chemical Physics,
159(3):034109, 07 2023. ISSN 0021-9606. doi: 10.1063/5.0152833. URL https://doi.org/
10.1063/5.0152833.
[45] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and
Data Engineering, 22(10):1345–1359, 2010. doi: 10.1109/TKDE.2009.191.
[46] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision.
In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[47] L. Rampášek, M. Galkin, V. P. Dwivedi, A. T. Luu, G. Wolf, and D. Beaini. Recipe for a
General, Powerful, Scalable Graph Transformer. arXiv:2205.12454, 2022.
[48] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang. Self-supervised graph
transformer on large-scale molecular data. In Proceedings of the 34th International Conference
on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran
Associates Inc. ISBN 9781713829546.
16
[49] E. Sande, C. Manni, and H. Speleers. Explicit error estimates for spline approximation of
arbitrary smoothness in isogeometric analysis. Numerische Mathematik, 144(4):889–929, Jan.
2020. ISSN 0945-3245. doi: 10.1007/s00211-019-01097-9. URL http://dx.doi.org/10.
1007/s00211-019-01097-9.
[50] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet
kernels for large graph comparison. In D. van Dyk and M. Welling, editors, Proceedings
of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of
Proceedings of Machine Learning Research, pages 488–495, Hilton Clearwater Beach Resort,
Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR. URL https://proceedings.mlr.
press/v5/shervashidze09a.html.
[51] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt.
Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(77):2539–2561,
2011. URL http://jmlr.org/papers/v12/shervashidze11a.html.
[52] Y. Shi, Z. Huang, S. Feng, H. Zhong, W. Wang, and Y. Sun. Masked label prediction: Unified
message passing model for semi-supervised classification. In Z.-H. Zhou, editor, Proceedings
of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages
1548–1554. International Joint Conferences on Artificial Intelligence Organization, 8 2021.
doi: 10.24963/ijcai.2021/214. URL https://doi.org/10.24963/ijcai.2021/214. Main
Track.
[53] R. Sun, H. Dai, and A. W. Yu. Does gnn pretraining help molecular representation? Advances
in Neural Information Processing Systems, 35:12096–12109, 2022.
[54] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M.
Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint
arXiv:2312.11805, 2023.
[55] E. H. Thiede, T. S. Hy, and R. Kondor. The general theory of permutation equivarant neural
networks and higher order graph variational encoders. arXiv preprint arXiv:2004.03990, 2020.
[56] J. Topping, F. D. Giovanni, B. P. Chamberlain, X. Dong, and M. M. Bronstein. Understanding
over-squashing and bottlenecks on graphs via curvature. In International Conference on
Learning Representations, 2022. URL https://openreview.net/forum?id=7UmjRGzp-A.
[57] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph Attention
Networks. International Conference on Learning Representations, 2018. URL https://
openreview.net/forum?id=rJXMpikCZ.
[58] H. Wang, H. Yin, M. Zhang, and P. Li. Equivariant and stable positional encoding for more
powerful graph neural networks, 2022. URL https://arxiv.org/abs/2203.00199.
[59] H. Wang, H. Yin, M. Zhang, and P. Li. Equivariant and stable positional encoding for more
powerful graph neural networks. In International Conference on Learning Representations,
2022. URL https://openreview.net/forum?id=e95i1IHcWj.
[60] H. Wang, J. Kaddour, S. Liu, J. Tang, J. Lasenby, and Q. Liu. Evaluating self-
supervised learning for molecular graph embeddings. In A. Oh, T. Naumann, A. Glober-
son, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Informa-
tion Processing Systems, volume 36, pages 68028–68060. Curran Associates, Inc.,
2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/
d6dc15cc2442a40904e704d624d1fbe8-Paper-Datasets_and_Benchmarks.pdf.
[61] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and
V. Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):
513–530, 2018.
[62] Z. Xiong, D. Wang, X. Liu, F. Zhong, X. Wan, X. Li, Z. Li, X. Luo, K. Chen, H. Jiang, et al.
Pushing the boundaries of molecular representation for drug discovery with the graph attention
mechanism. Journal of medicinal chemistry, 63(16):8749–8760, 2019.
17
[63] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng. Graph wavelet neural network. In International
Conference on Learning Representations, 2019. URL https://openreview.net/forum?
id=H1ewdiR5tQ.
[64] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In
7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA,
USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=
ryGs6iA5Km.
[65] M. Xu, H. Wang, B. Ni, H. Guo, and J. Tang. Self-supervised graph-level representation
learning with local and global structure. In M. Meila and T. Zhang, editors, Proceedings
of the 38th International Conference on Machine Learning, volume 139 of Proceedings of
Machine Learning Research, pages 11548–11558. PMLR, 18–24 Jul 2021. URL https:
//proceedings.mlr.press/v139/xu21g.html.
[66] M. Xu, H. Wang, B. Ni, H. Guo, and J. Tang. Self-supervised graph-level representation
learning with local and global structure. In M. Meila and T. Zhang, editors, Proceedings
of the 38th International Conference on Machine Learning, volume 139 of Proceedings of
Machine Learning Research, pages 11548–11558. PMLR, 18–24 Jul 2021. URL https:
//proceedings.mlr.press/v139/xu21g.html.
[67] K. Yang, K. Swanson, W. Jin, C. Coley, P. Eiden, H. Gao, A. Guzman-Perez, T. Hopper,
B. Kelley, M. Mathea, et al. Analyzing learned molecular representations for property prediction.
Journal of chemical information and modeling, 59(8):3370–3388, 2019.
[68] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T.-Y. Liu. Do transformers
really perform badly for graph representation? In A. Beygelzimer, Y. Dauphin, P. Liang,
and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL
https://openreview.net/forum?id=OeWooOxFwDa.
[69] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T.-Y. Liu. Do trans-
formers really perform badly for graph representation? In M. Ranzato, A. Beygelz-
imer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Infor-
mation Processing Systems, volume 34, pages 28877–28888. Curran Associates, Inc.,
2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/
f1c1592588411002af340cbaedd6fc33-Paper.pdf.
[70] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen. Graph contrastive learning with
augmentations. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,
Advances in Neural Information Processing Systems, volume 33, pages 5812–5823. Curran
Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/
2020/file/3fe230348e9a12c13120749e3f9fa4cd-Paper.pdf.
[71] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen. Graph contrastive learning with
augmentations. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, ed-
itors, Advances in Neural Information Processing Systems, volume 33, pages 5812–5823.
Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/
file/3fe230348e9a12c13120749e3f9fa4cd-Paper.pdf.
[72] Y. You, T. Chen, Y. Shen, and Z. Wang. Graph contrastive learning automated. In M. Meila
and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning,
volume 139 of Proceedings of Machine Learning Research, pages 12121–12132. PMLR, 18–24
Jul 2021. URL https://proceedings.mlr.press/v139/you21a.html.
[73] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim. Graph transformer networks.
In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar-
nett, editors, Advances in Neural Information Processing Systems, volume 32. Curran
Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/
9d63484abb477c97640154d40595a3bb-Paper.pdf.
[74] M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An end-to-end deep learning architecture for graph
classification. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence
18
and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI
Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18.
AAAI Press, 2018. ISBN 978-1-57735-800-8.
[75] Y. Zhu, Y. Xu, Q. Liu, and S. Wu. An empirical study of graph contrastive learning. arXiv
preprint arXiv:2109.01116, 2021.
[76] M. Zitnik, R. Sosič, and J. Leskovec. Prioritizing network communities. Nature Communi-
cations, 9(1):2544, Jun 2018. ISSN 2041-1723. doi: 10.1038/s41467-018-04948-5. URL
https://doi.org/10.1038/s41467-018-04948-5.
A Theoretical analysis
Notations. Throughout this section, for X ∈ Rm×n , we denote X[i : j] as the indiced matrix of X
from row i to row j and X[i] as the i-th row of X.
The approximation methodology we use is the spline approximation on the eigenvalues of the
laplacians, which later . First we need to define a scalar k-spline function on a bounded domain
[a, b] ⊂ R. Let
a =: η1 < η2 < · · · < ηN +1 := b
such that ηj+1 = ηj + (b − a)/N , these points are called uniform knots in the interval [a, b]. Assume
that Pk is the class of polynomial up degree at most k.
Definition A.1. A scalar function f is called a k-spline function in the uniformly-divided interval
[a, b] of N + 1 knots if the follows conditions are satisfied:
1/k d−k+1 1
1/k
u(k+1) = (d(d − 1) . . . (d − k)) sup x k ≤ d1+ k
x∈[0,1]
Combine this with the statement of Lemma A.2 we deduce that the error defined by the max norm
will be less than ϵ if
1 1 1
1 1
w := ≥ π −1 ϵ− k d1+ k = O ϵ− k d1+ k .
h
From here, we deduce an important lemma:
19
Lemma A.3. Given two naturalnumbers dand k such that d > k, then there exists a two-layer MLP
1 1
f : Rk → R of hidden width O ϵ− k d1+ k such that
xd − f x, x2 , . . . , xk < ϵ
Lemma A.4. (First order extension) For any ϵ > 0 and a given natural numberd > k, there exists
a
1 1 1
n×k n×r −k 1+ k
two-layer Sn -equivariant linear MLP : R →R network with width O n ϵ r
k such
that
(1)
MLP Ek − E(1) r ≤ ϵ.
Proof. Assume that Λi is the diagonal matrix containing all eigenvalues of ψsi . Applying Lemma A.3,
q
we simply see
1 1
that Λ for q = k + 1, r can be estimated using a two-layer MLP of width
O ϵ− k r1+ k with the max norm error less than ϵ. Formally, let ψ csq be the estimation of ψ q
s
and ei be the error at the i-th entry along the diagonal, then we have that
n
X n
X
csq − ψ q =
ψ ei ui u⊤ ≤ |ei | ui u⊤ ≤ nϵ
s i i
i=1 i=1
With enough first-order informations, i.e. sufficiently large r in Lemma A.4, we can reconstruct the
second-order features up to arbitrarily high degree.
Theorem A.5. (First to second order) Assume that rank(ψs − In ) ≤ r and let h : Rn×r → Rn×n×r
be the resolution-wise outer product,
then there exists a broadcasted linear feed forward layer
r d (1) (2)
g : R → R such that (h ◦ g) Ed = Er .
Proof. The first order features are aggregated through an outer product operator and return r square
matrices of order n. However, these matrices are all rank one matrices and cannot represent the
initial second order features. Since the rank of a square matrix is equivalent to its length minus the
multiplicity of the eigenvalue zero, we can see that
Therefore, for all i = 1, d, the matrix ψsi − In can be written as a weighted sum of r rank one matrices
produced from the outer product. This concludes the proof.
Theorem A.6. For any ϵ > 0 and real coefficients θ0 , θ1 , . . . , θd assume that rank(ψs − In ) ≤ r,
then there exists an Sn -equivariant AE f : Rn×n×k → Rn×n×r of width O(n1/k r1+1/k ϵ−1/k ) and
a broadcasted feed forward network g : Rr → R such that
∥(g ◦ f )(Ek ) − p(ψs )∥ < ϵ
Pd j
where p(ψs ) = j=0 θj ψs .
Proof. Encoder. The input tensor is of size n × n × k, representing second order feature in k
different resolutions. The encoder simply operate resolution-wise and take the row-sum through each
square matrix. This encoder will output a first order tensor of size n × k. This layer is evidently
Sn -equivariant.
Latent. For the latent space, i.e. first-order feature space, weapply LemmaA.4 to extend from k
1 1 1
resolutions to r resolutions using a two-layer MLP of width O n k ϵ− k r1+ k . And since this MLP
is also built upon the broadcasting along the n-axis, it is also Sn -equivariant.
20
Decoder. Applying the content of Theorem A.5 we can conclude the proof.
Once the AE can learn to reconstruct, the proof of Theorem 5.6 ensures that it can capture long-range
information.
Theorem A.7. For any ϵ > 0 and real coefficients θ1 , θ2 , . . . , θd , there exists a two-layer ReLU feed
forward network φ : Rn×n×d → Rn×n of hidden dimension dh = 2 such that
d
X
φ(Ed ) − θj Aj < ϵ.
j=1
Proof. For this proof, we need to consider wavelet and random walk separatedly. Wavelet. Let ψs =
U Λs U ⊤ where Λs = diag(exp(−sλ1 ), exp(−sλ2 ), . . . , exp(−sλn )). We first need to perform a
(2)
transform on the vector basis Ed . Essentially, the transformations are done independent of the
eigenvectors U . Formally, we observe that
e−sλi − 1
λi − 1
e−2sλi − 1 (λi − 1)2
≈ A .. (4)
..
. .
e−dsλi − 1 (λi − 1)d
where A ∈ Rd×d contains the corresponding Chebyshev polynomial expanding coefficients. Note
that (4) is essentially the discrete fourier transform, thus the inversed version is simply
−sλ
λi − 1 e i −1
(λi − 1)2 e−2sλi − 1
.. ≈ A−1
.. .
(5)
. .
(λi − 1)d e−dsλi − 1
This means that the power of L̃ up to d can be retrieved via a broadcasted linear layer. Now let ζ be
the scalar step function, meaning ζ(x) = 1 for x > 0 and 0 otherwise. Then, let φ be a continuous
piece-wise linear function such that:
0 if x < 0
(
φ(x) = x/ε if 0 ≤ x < 1
1 if x ≥ 1
Since this function is a three-piece linear function, it can be represented as a ReLU-based feed
forward network with hidden dimension two. And evidently,
∥φ − ζ∥ → 0 as ε → 0.
Furthermore, it yields that ζ (In − L̃)k = Ak for all k. Therefore, we concluded the proof.
Table 6 presents details of all benchmarking datasets used in our experiments. We focus on improving
model performance in graph-level prediction tasks. All datasets contain over 1000 samples, with the
average number of nodes per dataset ranging from 13 to over 100.
21
Dataset # Graphs # Nodes # Edges Pred. level Pred. task Metric
CIFAR10 60,000 117.63 469.10 graph class. (10-way) ACC
MNIST 70,000 70.57 281.65 graph class. (10-way) ACC
ZINC-subset 12,000 23.15 24.92 graph reg. MAE
MolBBBP 2,039 24.06 25.95 graph class. (binary) AUROC
MolBACE 1,513 34.09 36.86 graph class. (binary) AUROC
MolTox21 7,831 18.57 19.29 graph class. (binary) AUROC
MolToxCast 8,576 18.78 19.26 graph class. (binary) AUROC
MolSIDER 2,039 33.64 35.36 graph class. (binary) AUROC
Peptides-func 15,535 150.94 153.65 graph class. (binary) AP
Peptides-struct 15,535 150.94 153.65 graph reg. MAE
MUTAG 188 17.9 39.6 graph class. (binary) ACC
PROTEINS 1,113 39.1 72.8 graph class. (binary) ACC
NCI1 4,110 29.9 32.3 graph class. (binary) ACC
NCI109 4,127 29.7 32.1 graph class. (binary) ACC
IMDB-B 1,000 19.8 96.5 graph class. (binary) ACC
IMDB-M 1,500 13.0 65.9 graph class. (3-way) ACC
Table 6: Dataset details for transferability experiments on image, ZINC, MoleculeNet, LRGB and
TUDataset.
Pretraining Table 7 depicts the hyperparameters of our high-order autoencoder and training settings.
In general, we used three layers of IGN [39] to build the encoder hidden dimensions of {8, 16, 32}.
The decoder is a reversed of encoder with hidden dimensions {32, 16, 8}. We used a channel-wise
2-layer MLP to compute the latent Z from the encoder’s output, and the latent dimension is set to
20. We preprocessed the Wavelet signals of graph data via the PyGSP [10] software. For each graph,
we performed Wavelet transform to get its 4-resolution Wavelet tensor, where each scale varies in
{0.25, 0.5, 0.75, 1}. Finally, the autoencoder is pretrained in 100 epochs with a batch size of 32 and
learning rate of 0.0005.
Batch size # Epoch Encoder Decoder Learning rate Scales Latent dim
32 100 [8, 16, 32] [32,16,8] 0.0005 [0.25, 0.5, 0.75, 1.0] 20
MoleculeNet Table 8 shows the hyperparameter settings for fine-tuning MPNN on five downstream
datasets in MoleculeNet benchmark. In general, we used local attetion as proposed in [52]. To model
the global interactions, we augment virtual nodes to the local models to improve the performances in
ToxCast and SIDER.
TUDataset Table 9 summarizes the hyperparameter settings for the transfer learning experiments
on six TUDataset benchmark datasets. For IMDB-B and IMDB-M, which lack domain node features,
we employed HOPE-WavePE as their initial node features. To create unified node features compatible
with the hidden dimensions of the MPNN layers, we added a 2-layer MLP before the MPNN layers
to update the combination of domain and HOPE-WavePE features. Following Bodnar et al. [3], we
performed 10-fold validations for each dataset and reported means and standard deviations.
ZINC, Image Classification Tasks, and LRGB We follow the best hyperparameter settings issued
in previous work of GPS [47] and MPNN+VN [7]; then, we fine-tuned for better performance. Our
full hyperparameter studies of the benchmarks are shown in Table 10.
22
Hyperparameter BBBP BACE Tox21 ToxCast SIDER
Pre MPNN MLP MLP MLP MLP MLP
MPNN type Attention Attention Attention Attention Attention
VN Augmented - - - ✓ ✓
# Layers 5 5 5 3 3
Hidden Dim 300 300 300 512 512
Dropout 0.5 0.5 0.5 0.5 0.5
Pooling type mean mean mean mean mean
Learning rate 1e − 3 1e − 3 1e − 3 1e − 3 1e − 3
Weight decay 1e − 9 1e − 9 1e − 9 1e − 9 1e − 9
# Epochs 50 50 50 100 50
Batch size 32 32 32 32 32
Table 10: Hyperparameter settings for ZINC, MNIST, CIFAR10, Peptides-func and Peptides-struct
dataset
23