0% found this document useful (0 votes)

27 views23 pages

MLST Wavelet Positional Encoding

Uploaded by

anh.nv221202

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views23 pages

MLST Wavelet Positional Encoding

Uploaded by

anh.nv221202

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Range-aware Graph Positional Encoding via

High-Order Pretraining: Theory and Practice

Viet Anh Nguyen† Nhat Khang Ngo†

FPT Software AI Center, Vietnam McGill University, Canada
anhnv117@fpt.com khangnn4@fpt.com

Truong-Son Hy ∗
University of Alabama at Birmingham, United States
thy@uab.edu

Abstract
Unsupervised pre-training on vast amounts of graph data is critical in real-world ap-
plications wherein labeled data is limited, such as molecule properties prediction or
materials science. Existing approaches pre-train models for specific graph domains,
neglecting the inherent connections within networks. This limits their ability to
transfer knowledge to various supervised tasks. In this work, we propose a novel
pre-training strategy on graphs that focuses on modeling their multi-resolution
structural information, allowing us to capture global information of the whole
graph while preserving local structures around its nodes. We extend the work
of Graph Wavelet Positional Encoding (WavePE) from Ngo et al. [44] by pre-
training a High-Order Permutation-Equivariant Autoencoder (HOPE-WavePE) to
reconstruct node connectivities from their multi-resolution wavelet signals. Since
our approach relies solely on the graph structure, it is domain-agnostic and adapt-
able to datasets from various domains, therefore paving the way for developing
general graph structure encoders and graph foundation models. We theoretically
demonstrate that for k given resolutions, the width required for the autoencoder
to learn arbitrarily long-range information is O n1/k r1+1/k ϵ−1/k where n, r
denote the number of nodes and the rank of normalized Laplacian respectively
and ϵ is the error tolerance defined by the Frobenius norm. We also evaluate
HOPE-WavePE on graph-level prediction tasks of different areas and show its
superiority compared to other methods. Our source code is publicly available at
https://github.com/HySonLab/WaveletPE.

1 Introduction
One of the fastest-growing areas in machine learning is graph representation learning, with impactful
applications in biomedicine, molecular chemistry, and network science. Most graph neural networks
(GNNs) rely on the message-passing framework that processes graph-structured data by exchanging
the vectorized information between nodes on graphs along their edges. Albeit achieving remarkable
results in a wide range of tasks on graph data, message-passing neural networks (MPNNs) possess
several fundamental limits, including expressiveness [64], over-smoothing [8], and over-squashing
[56]. In recent years, transformer-based architectures [73, 31, 12, 68] have emerged as powerful
alternatives to address the mentioned issues of MPNN. The self-attention mechanism in conventional
†
This work is done during the authors’ internship at FPT Software AI Center, Vietnam under the supervision
of Dr. Truong-Son Hy.
∗
Correspondence to thy@uab.edu

Preprint. Under review.

transformers computes the pairwise interactions between the input tokens, enabling the modeling of
long-range interactions between distant tokens and overcoming information bottlenecks on graphs.
While applying transformers to graphs offers advantages, it often necessitates a trade-off between
computational resources and performance, particularly when scaling models to massive datasets with
thousands to millions of nodes, like citation networks. Augmenting virtual nodes (VN) to the original
graph’s nodes , which allows global connections among these nodes, has emerged as a promising
strategy to balance these two objectives [7, 18].

However, graph transformers (GTs) and VN-augmented MPNNs disregard the underlying structure
of graph data by altering inherent connections among the nodes (i.e., shortening all paths to two).
This disregard may explain their limitations in several graph-level prediction tasks. To address this,
positional and structural encodings (PSE) are commonly used to enhance structural information
in modern GNNs. However, existing PSE encoding methods often have limitations in terms of
task specificity. For instance, random walk encodings [14, 37] excel with small molecules, while
Laplacian encoding [12] captures long-range interactions effectively. Meanwhile, Ngo et al. [44]
propose Wavelet positional encoding (WavePE), a PSE encoding technique that models the node
interactions at multiple scales. By leveraging multi-resolution analysis based on Wavelet Transform
on graphs [63, 21, 11], WavePE enables neural networks to capture a comprehensive range of
structural information, from local to global properties.

Real-world applications of learning methods on graphs face two challenges [45, 24]. First, in fields
like biology and chemistry, task-specific labeled data is scarce, and acquiring them requires significant
time and resources [76]. Second, real-world graph datasets often contain out-of-distribution samples;
for example, newly synthesized molecules may have different structures from those in the training
set, leading to inaccurate prediction of supervised learning models. Meanwhile, transfer learning is a
promising approach to overcoming these challenges. Our key insight is that graph data possesses
two distinct feature types: (1) domain-specific (e.g., atom types in molecules or user names citation
networks) and (2) topological features. While the former depends on specific domains, the latter is a
general form of graph data. This observation gives rise to our idea of structural pretraining, wherein
we pretrain a model on topological features of graph data. Specifically, given the Wavelet signals of a
graph, we train an autoencoder to reconstruct its adjacency matrix. The graph wavelet signals are
defined based on functions of eigenvalues of the graph Laplacian at multiple resolutions, representing
general forms of connectivities among nodes on the graphs. As a result, a model trained on these
features should be able to capture the underlying patterns of different graph structures and generalize
well to downstream tasks. After pretraining, we can use this pretrained model as a general structural
encoder that extracts node structural features, which are passed to graph neural networks along with
domain-specific features for predicting task-specific properties.

1.1 Contribution

In this work, we propose a new pretraining approach on graphs that leverages the reconstruction of
graph structures from the Wavelet signals to generalize structural information on graph data, thus
enabling transfer learning to various downstream tasks across various domains. Our contributions are
three-fold as follows:

• We propose a high-order structural pretrained models for graph-structured data that lever-
ages high-order interactions of nodes on graphs, therefore capturing better positional and
structural information.
• We theoretically prove that pretraining by reconstruction with multi-resolution Wavelet
signals can make autoencoder learn node state after an arbitrarily walk of length d, which
can contain both local and global information of graph structures. We also analyzed the
width required for the latent space of the autoencoder to ensure such performance and also
propose a low-rank alternative.
• We empirically show that such pretraining scheme can enhance the performance of super-
vised models when fine-tuned on downstream datasets of different domains, indicating the
generalizabiilty and effectiveness of pretrained structural encoding compared with other
domain-specific pretraining methods.

2
2 Related work

Graph Positional and Structural Encodings Node positional encodings are augmented into node
features to preserve the graph structure information and increase the expressiveness of MPNNs
[33, 14, 59, 16] and Graph Transformers [13, 37, 40, 29, 47, 73]. Many recent works directly use the
graph Laplacian eigenvectors as node positional encodings and combine them with domain-specific
node features [13, 29, 47, 73]. Other works use several approaches to encode graph’s structural
information, including random walks [41, 37, 14, 32], diffusion kernels on graphs [41, 31, 44],
shortest paths [32, 69], and unsupervised node embedding methods [59, 34]. Our work builds on
similar approaches to Wang et al. [59] and Liu et al. [34], where we pre-train an encoder to capture
node positional information in an unsupervised setting. However, unlike previous works, our encoder
is high-order equivariant, enabling it to capture multi-scale properties of graph structures using
Wavelet positional encodings [44]. These learned positional features can be adapted to various
downstream tasks and generalize well to domain-specific datasets. Furthermore, we theoretically
demonstrate that our method is general and can express several common positional encoding schemes.

Pretraining on Graphs In the age of modern deep learning, pre-training models on massive
unlabeled datasets and then fine-tuning them on smaller, labeled datasets has proven remarkably
successful in areas like natural language processing [6, 54] and computer vision [22, 46]. For graph-
structured data, self-supervised pre-training is a growing area of research, with contrastive learning
being a popular approach [72, 66, 75, 71, 28]. While these methods can enhance performance on
various downstream tasks, their reliance on domain-specific features to create different views of the
input graphs limits their transferability to other domains. For instance, models pretrained on small
molecules struggle to adapt to other graph types like citation networks, proteins, or code repositories.
Our work aims to propose a more generalizable and transferable graph pre-pre-training method that
relies only on graph intrinsic features, i.e. adjacency matrices, which can be trained on unfeatured
data and easily transferred for predicting properties of small graph datasets in arbitrary domains.

3 Notations and Preliminaries

3.1 Notations

In this work, we define a graph G = (V, A), where V denotes the node set, respectively, and
A ∈ Rn×n is the adjacency matrix. The normalized graph Laplacian is computed as: L e = In −
D−1/2 AD−1/2 where D is the diagonal degree matrix, and In is the identity matrix.

3.2 Permutation equivariant function

In this section, we formally define permutation symmetry, i.e. symmetry to the action of the symmetric
group, Sn , and construct permutation-equivariant neural networks to encode graph wavelets. An
element σ ∈ Sn is a permutation of order n, or a bijective map from {1, 2, . . . , n} to {1, 2, . . . , n}.
We present each element σ in Sn as a permutation matrix Pσ ∈ Rn×n . For example, the action of Sn
on an adjacency matrix A ∈ Rn×n and on a latent matrix (i.e. node embedding matrix) Z ∈ Rn×dz
are:
σ(A) = Pσ AP⊤ σ, σ(Z) = Pσ Z,
for σ ∈ Sn . Here, the adjacency matrix A is a second-order tensor with a single feature channel,
while the latent matrix Z is a first-order tensor with dz feature channels. LIkewise, the permutation
k
on a k-th order tensor X ∈ Rn ×d operate on the first k dimensions of X.
Furthermore, we define permutation equivariance on non-homogeneous order functions.
k
×d
Definition 3.1. A Sn -equivariant (or permutation equivariant) function is a function f : Rn →
k′ ′ k
Rn ×d that satisfies f (σ(X)) = σ(f (X)) for all σ ∈ Sn and X ∈ Rn ×d .

Although Def. 3.1 is generalized for all order features, in the scope of this work we will only include
up to the second-order feature, i.e. k, k ′ ≤ 2. This definition is also applicable to neural networks,
whose all components satisfy such properties

3
s=4 s = 15 s = 50
0.0 0.2 0.4 0.6 0.8 1.0
Figure 1: Visualization of graph Wavelet on the geometric graph of a torus. The low scaling factor s
results in a highly localized structure around the center node (yellow), while higher factors can lead
to smoother signals that can spread out to a larger part of the graph.

3.3 Positional Encodings in Graphs

Laplacian Positional Encoding (Laplacian PE). Let G = (V, A) be an undirected graph with
adjacency matrix A ∈ Rn×n and degree matrix D. The normalized Laplacian is defined as:

L̃ = I − D−1/2 AD−1/2 .
Let L̃ = U ΛU ⊤ be its eigendecomposition. The Laplacian PE is given by:
XLapPE = Uk ,

where Uk ∈ Rn×k contains the eigenvectors corresponding to the k smallest nontrivial eigenvalues.

Random Walk Positional Encoding (RWPE). Let P = D−1 A be the random walk transition
matrix. The t-step transition probability is Pt . The RWPE of node i up to step k is:
xRWPE = (P1 )ii , (P2 )ii , . . . , (Pk )ii .

i

Both encodings capture structural information from spectral and probabilistic perspectives,
respectively.

4 Methodology
4.1 Spectral Graph Wavelet Tensors

As the normalized graph Laplacian Le ∈ Rn×n is symmetric when G is undirected, we can decompose
it into a complete set of orthonormal eigenvectors U = (u1 , u2 , · · · , un ) wherein ui is associated
with a real and non-negative eignevalue λi as:

e = U ΛU T ,
L (1)
where Λ = diag(λ1 , · · · , λn ) is the diagonal matrix of eigenvalues. The graph Wavelet transforms
construct a set of spectral graph Wavelet as bases to project the graph signal from the vertex domain
to the spectral domain as:
ψs = U Λs U T , (2)
here, Λs = diag(g(sλ1 ), · · · , g(sλn )) is the scaling matrix of the eigenvalues. The scaling function
g takes the eigenvalue λi and an additional scaling factor s as inputs, indicating how a signal diffuses
away from node i at scale s; we select gs (λ) = exp(−sλ) as the heat kernel. This means that we
can vary the scaling parameter s to adjust the neighborhoods surrounding a center node. Please see
Figure 1 for more illustration. Following Ngo et al. [44], a set of k graph Wavelet {ψ}ki ∈ Rn×n can
be constructed to result in a tensor of graph Wavelet W ∈ Rn×n×k .

4
Multidomain
Dataset
Domain-agnostic
Long-range
Wavelet Transform Structural Information

Equivariant
Equivariant

Decoder
Encoder Equivariant
MLP

Pretrained Permutation Equivariant High-Order Autoencoder

Figure 2: Our proposed equivariant autoencoder pretraining scheme are applied on a large multiple
domain dataset while extending the feature degree, overcoming the domain-specific weakness while
also embedding long-range information.

Fast Graph Wavelet Transform Traditionally, computing graph wavelet bases necessitates the full
diagonalization of the graph Laplacian Le in Equation (1). This approach becomes computationally
expensive when the number of nodes increases. Our work adopts the efficient algorithm proposed by
Hammond et al. [21]. This method utilizes Chebyshev polynomials to approximate the Wavelet with
a time complexity of O(|E| × M ), where M represents the order of the polynomials and E is the set
of edges. This approach scales linearly with the number of edges, thereby being more efficient than
previous methods based on the eigendecomposition of the graph Laplacian.

4.2 Constructing Long-range Pretraining on Domain-Agnostic Data

Our goal is to train a learnable structural encoder to extract generalized and abstract node-level
structural features that can be transferred across downstream tasks in graph learning. To fulfill this,
we parameterize a high-order autoencoder [27, 55] that learns to reconstruct high-order features of
graph data from their wavelet tensors W ∈ Rn×n×k .
In particular, given a second-order wavelet tensor W ∈ Rn×n×k , the encoder E encodes W into a
latent matrix Z = E(W) ∈ Rn×dℓ , the encoder E can be composed of many equivariant operators.
Furthermore, to reduce redundancy caused by high-order training, we extract only two equivariant
mappings for the encoder, diagonal and row sum:
E 1 (Z) = diag(Z), E 2 (Z) = Z1n

Then, the decoder D lifts Z back to a high-order feature map F = D(Z) ∈ Rn×n×d . Here, We
use the outer product and diagonal operator, which represent structural and positional informationn
respectively:
D 1 (Zi ) = Zi ⊗ Z⊤
i , D 2 (Zi ) = diag(Zi )
where ⊗ is the outer product operator, and Zi is the i-th channel feature of the encoded information.
Here, dℓ and d denote the channels (dimensions) of each entries in an n × n matrix.
Finally, we use a channel-wise multi-layer perceptron (MLP) ϕ : Rn×n×d 7→ Rn×n×r to map F to a
concatenated high-degree adjacency matrix in binary values. Specifically, let Aj be the binary matrix
of j-hop neighbor in a graph and A
b j be its prediction. The final MLP network returns the predicted
array

5

ϕ(Z) = A bs A
1
bs ... A
2
bs
r
, (3)
where s1 , s2 , . . . , sr are natural degrees to be chosen. In this work, we let these values follow a
exponential pattern, which highlights the range-diversity.

Masking removes redundancy In sparse structures, learning different hop adjacencies imposes
disproportionality in edges and non-edges, which differs with the adjcency degree and also data
structures. For example, a molecule graph from MoleculeNet [61] have shorter diameter than peptides
[15]. Thus, long degree adjacency is redundant for small molecules and hurts the generalizability
for bigger graphs. To fix this, we use a binary mask M ∈ Rn×n×r , filtering out random entries in
both the ground truth and label such that non edges and edges are equal in quantity for each degree of
predicted adjacency. The remaining entries define the binary cross entropy loss:
LM (Asi , A
b s ) = BinCrossEntropy(M ⊙ As , M ⊙ A
i i
b s ),
i

where ⊙ is the elementwise product. For small connected graphs, masking removes redundant
adjacencies Aj (all one matrices) for high value j. For longer connected graphs, masking also ease
the pain in tuning the degree hyperparameters si for i ∈ [r], as the loss produced by Asi is zero
for large degrees si . This motivates our multi-range training strategy, which makes the autoencoder
range-aware with respect to the graph size and structures.

5 Theoretical Results
5.1 HOPE-WavePE is stable

Stability in graph structural positional encoding refers to the comparative change of positional
information when the (normalized) Laplacian is slightly perturbed. We expand on the preliminary
definition of [58].
Definition 5.1. (Stable PE) Let S(L̃) ∈ Rn×d be the positional feature that can be added or
concatenated to the input representation. Then S is stable if for any n, d there exists an universal
constant C > 0 such that
S(L̃) − S(L̃ + δ L̃) ≤ C δ L̃
for all natural number n, d, normalized Laplacian L̃ and the perturbation term δ L̃ ∈ Rn×n .

We will consider two scenerios for δ L̃, one is where it is equivalent to some random pertubution
with magnitude o( L̃ ) and the second lets δ L̃ be the difference between two normalized Laplacian
δ L̃ = L̃1 − P∗ L̃2 P∗⊤ where P∗ = argminP L̃1 − PL̃2 P⊤ and L̃1 , L̃2 are normalized Laplacian
of two different structures. This serves as an expansion to the initial definition, showing stable PEs
are resilient to small perturbation. This is especially meaningful as random perturbations usually
inflict heavy misinformation to spectral-based positional encoding.
Stable PE ensures that with small change of the normalized Laplacian, which can occur due to
perturbation or structural change, the difference in the PE only change with linear complexity.
Methods such as Laplacian Positional Encoding (LapPE), for example, is not stable as the SVD is not
unique. Works such as SignNet and BasisNet [33] are invariant under these unstable transformations.
However, they cannot guarantee small change of PE given small change of L̃, and thus, they are also
not stable. Moreover, we also show that RWPE are also not stable. First, let us make a remark on
RWPE [14]:
Definition 5.2. Given G = (V, E) and corresponding adjacency matrix A, then the RWPE matrix
SRW = {sij } is defined as
sij = RWi (j) = L̃j [i, i]
where RWi (j) is the probability that node i returns to itself after a random walk of length j.
Theorem 5.3. For sufficiently large n, then for all C > 0 there exists a normalized Laplacian L̃0
and perturbation δ L̃0 such that
SRW (L̃0 + δ L̃0 ) − SRW (L̃0 ) > C δ L̃0 .

6
The proof of Theorem 5.3 is quite straightforward. We observe that the encoding induced by random
walk is a polynomial of degree at most d with non-vanishing coefficient. This indicates that the
change of the SRW is O( δ L̃d ) and not O( δ L̃ ), which is the necessary condition for stability.
The negative results of the RWPE can be fixed by scaling down the coefficients attached to long walk,
i.e. high polynomial degree. And conveniently, HOPE WavePE tackle this exactly, either from the
analytic approximation of the Fast Wavelet Transform in practice [21] or as an infinite Chebyshev
polynomial.
Theorem 5.4. (informal) Let SW ave be the Wavelet PE induced by the first-order embedding of the
AE, then there exists a parametrization such that SW ave = SW ave (L̃) is a stable PE.
As opposed to the non-vanishing coefficient, Wavelet of all scaling are calculated via the Chebyshev
polynomial expansion. This has logarithmetic convergent rate, which dominants over the polynomial
rate of change problem raised by RWPE. This, together with the multiscale motivation of HOPE-
WavePE, ensures a stable and range-diverse structural embedding.

5.2 HOPE-WavePE can globalize encoding information

The authors at [37] showed that RWPE combined with a Sn -equivariant MLP can predict the state at
i
walk length up to the degree in which it is embedded, for example the concatenation of (D−1 A) for
i = 1, k, where D and A are the diagonal and adjacency matrix respentively, can only predict the
state at walk up to k (see Proposition 3.1 [37]). This means that global information embedding is not
possible without expensive computational cost. We demonstrate theoretically that HOPE-WavePE
can reconstruct second order feature while also generating arbitrarily high degree information, which
can capture global information. The input for HOPE-WavePE are primarily the structural wavelets of
different scaling factors.
Clarification. We use the non-linearity ReLU(x) = max(0, x) as the activation in hidden layers
of the MLPs. Moreover, biases in linear mapping are also included. The norm ∥ · ∥ denotes the
Frobenius norm for real vectors and L2 norm for functional spaces.
For the sake of convenience, we denote the input of HOPE-WavePE as the concatenated array
(2)
Ek = ψs ψs2 . . . ψsk of size n × n × k. The superscript denotes the order of feature, this
notation is also used later in the Appendix.
We justify our HOPE-WavePE pretraining scheme by first showing that with asympotic sublinear
(2)
width, the AE can estimate Ed for all d > k.
Theorem 5.5. For any ϵ > 0 and real coefficients θ0 , θ1 , . . . , θd , assume that the multidomain
dataset have C connected components and r = n + 1 − C, then there exists an Sn -equivariant AE
f : Rn×n×k → Rn×n×d of width O(n1/k r1+1/k ϵ−1/k ) and a channel-wise feed forward network
g : Rd → R such that
(2)
(g ◦ f )(Ek ) − p(ψs ) < ϵ
Pd
where p(ψs ) = j=0 θj ψsj .

The connected component aspect in Theorem 5.5 allows better efficiency to pretraining HOPE-
WavePE while preserving the performance. This is particularly useful since a combined dataset of
diverse domains and sparse graph structures also carry a large number of connected components.
Moreover, if r < n + 1 − c, the statement can be incorporated as a low-rank approximation for the
second order features. In fact, we found such low-rank alternative robust for r being as low as four.
We summarize the proof for Theorem 5.5 as follows:
(2) (1)
• The tensor Ek is contracted to the first order tensor Ek of size n × k via a resolution-wise
row sum.
• Spline method is applied to approximate Λk+1 to Λr where Λ is the diagonal eigenvalue
matrix of ψs . The bounds of the width is derived based on the number of pieces and the
work at [49].
• The decoder broadcast the first order feature into second order form via a dot product. The
output is a tensor of size n × n × r.

7
• Since ψsi − In for i = 1, d all have rank r, a linear layer can be applied after the AE
representing rank-one summations to estimate the output and a weighted sum corresponding
with the polynomial coefficients of p(ψs ).
From Theorem 5.5, we see that HOPE-WavePE can reconstruct while also generating high degree of
second order features. However, the question remains for the range of effect that an output of degree
d can encode. To do this, we define the binary matrix At ∈ Rn×n where At [i, j] = 1 if there exist a
walk of length t between the nodes i and j, and 0 otherwise.
Theorem 5.6. For any ϵ > 0 and real coefficients θ1 , θ2 , . . . , θd , there exists a two-layer ReLU feed
forward network φ : Rn×n×d → Rn×n of hidden dimension dh = 2 such that
d
(2)
X
φ(Ed ) − θj Aj < ϵ.
j=1

To prove Theorem 5.6, we need to consider ψs in two scenerios of WavePE and RWPE separatedly.
For WavePE, we perform inverse Wavelet Transform to get the polynomial of the normalized
Laplacian. Then, the binary step function is approximated via the MLP. The output is taken as
weighted sum of Aj for j = 1, d.
Theorem 5.6 combined with Theorem 5.5 implies that long-range dependencies embedding is possible
with our AE pretraining scheme and can be done in an efficient manner. Moreover, since the linear
layers and MLP in these statements are all channel-wise, they all suffice the Sn -equivariant contigency.
By learning the long-range dependency, we can obtain many meaningful information of the graph.
This includes pair-wise shortest path distance, cycle counting, cycle detecting and more. The latest
provably stable work of [26] where the authors use eigendecomposition can cost O(n3 ) and is not
provably long-range capturer. In contrast, HOPE-WavePE is not only provably O(n) for sparse graphs
but also stable, resilient to perturbation and can learn much more complex long-range structures.

6 Experiments
This section empirically evaluates the advantages of our structurally pre-trained autoencoder (AE).
We first detail the pre-training process, which utilizes a large dataset of graph structures. We then
present experimental results on various downstream tasks. As our objective is to learn a structural
feature extractor for graph data, we aim to demonstrate three key points:
• Enhanced performance on small datasets: The encoder of the pre-trained AE can extract
structural features that enhance the performance of graph-level prediction tasks on small
datasets.
• Effective PSE feature representation: The learned structural features can capture global
information and preserve graph structures when integrated into and fine-tuned along with
global models like graph transformers or VN-augmented MPNNs.
• Generalizability: The pre-trained AE can generalize well to out-of-distribution graphs with
different connectivity types and sizes, even beyond those present in the training dataset.

6.1 Setup

Pretraining We pretrained a high-order autoencoder on the MolPCBA dataset, a collection of

nearly 437,929 molecular graphs with an average size of 26 nodes each. During this pretraining stage,
the autoencoder focused exclusively on learning the relationships between nodes, represented by
the graph adjacency matrix A. In this matrix, Aij equals 1 if there’s a connection between nodes i
and j, and 0 otherwise. By excluding chemical features during pretraining, we granted the network
versatility, enabling it to adapt to downstream tasks that use different feature representations. For
pre-training, we employed the Adam optimizer with a learning rate of 5 × 10−4 over 100 epochs.
To incorporate multi-scale information, we constructed a 4-channel Wavelet tensor for each graph
sample, with scaling factors of [0.25, 0.5, 0.75, 1.0]. The autoencoder architecture consisted of an
encoder and a decoder, each containing three high-order linear layers with increasing dimensionality
in [8, 16, 32]. Finally, a three-layer multilayer perceptron (MLP) was used to project the encoder’s
output into a latent embedding of size 20. We split the dataset into training and validation sets with a
ratio of 9 : 1, respectively.

8
Method BBBP BACE Tox21 ToxCast SIDER
D-MPNN [67] 71.0(0.3) 80.9(0.6) 75.9(0.7) 65.5(0.3) 57.0(0.7)
AttentiveFP [62] 64.3(1.8) 78.4(0.0) 76.1(0.5) 63.7(0.2) 60.6(3.2)
N-GramRF [35] 69.7(0.6) 77.9(1.5) 74.3(0.4) - 66.8(0.7)
N-GramXGB [35] 69.1(0.8) 79.1(1.3) 75.8(0.9) - 65.5(0.7)
PretrainGNN [25] 68.7(1.3) 79.9(0.9) 76.7(0.4) 65.7(0.6) 62.7(0.8)
GROVERbase [48] 70.0(0.1) 82.6(0.7) 74.3(0.1) 65.4(0.4) 64.8(0.6)
GROVERlarge [48] 69.5(0.1) 81.0(1.4) 73.5(0.1) 65.3(0.5) 65.4(0.1)
GraphLOG pretrained [65] 72.5(0.8) 83.5(1.2) 75.7(0.5) 63.5(0.7) 61.2 (1.1)
GraphCL pretrained [70] 69.7(0.7) 75.4(1.4) 73.9(0.7) 62.4(0.6) 60.5(0.9)
InfoGraph pretrained [60] 66.3(0.6) 64.8(0.7) 68.1(0.6) 58.4(0.6) 57.1(0.8)
MPNN + LapPE 68.5(0.6) 81.1(0.5) 76.2(0.3) 64.1(0.4) 63.0(0.2)
MPNN + RWPE 69.3(0.5) 81.7(0.6) 76.8(0.2) 65.4(0.4) 63.8(0.3)
MPNN + WavePE 70.2(0.4) 86.2(0.4) 77.5(0.2) 66.3(0.5) 64.3(0.3)
MPNN + HOPE-WavePE (ours) 71.2(0.4) 86.8(0.4) 78.0(0.1) 66.9(0.5) 64.7(0.3)

Table 1: Experimental results on five small molecule (scaffold-split) classification datasets. Methods
are evaluated by ROC-AUC % (↑) where higher scores mean better performances. We report the
mean and standard deviation (in brackets) of all methods over three random seeds. Top 3 results are
highlighted, including First, Second, and Third.

Downstream task After pretraining, we integrated the pretrained autoencoder into other graph
neural networks for graph-level supervised tasks. We concatenated the node-level structural features
extracted by the pretrained autoencoder with domain-specific features before feeding them into GNN-
baed models for downstream evaluation. The implementation details for each specific downstream
task are provided in Appendix B.

6.2 Results

While we pretrained our HOPE-WavePE autoencoder on the MolPCA dataset within the Moleculenet
benchmark [61], the performance on downstream tasks within this benchmark may not fully capture
the generalizability of our approach. To address this, we evaluate HOPE-WavePE on a broader range
of real-world graph datasets. These datasets encompass diverse graph structures, connectivity types,
sizes, and domain-specific features, significantly differing from those found in Moleculenet.

Moleculenet Benchmark We follow the setting in [53] wherein pretrained models are evaluated
on a selection of five small molecule datasets in the MoleculeNet Benchmark [61]. Scaffold splitting
is used to define training and testing data, ensuring that test molecules may significantly differ
from those used for training. This approach assesses a model’s ability to predict properties of out-
of-distribution molecular structures. We compare our methods with multiple baselines, including
supervised and pretraining. In particular, two supervised methods are D-MPNN [67] and AttentiveFP
[62]. For pretraining, we compare our work with PretrainGNN [25], GraphCL [70], GraphLog
[65], Grover [48], and InfoGraph [60]. According to the experimental results in Table 1, pretrained
HOPE-WavePE can improve the performances of MPNN on small molecule property prediction
tasks. Our approach, in particular, outperforms other pretraining baselines by a large margin in 3 of 5
datasets and achieves comparable results in other 2 datasets.

TUDataset Benchmark We evaluate HOPE-WavePE on six diverse datasets from the TUDataset
benchmark [42]: a small molecule dataset (MUTAG), two chemical compound datasets (NCI1 and
NCI109), a macromolecule dataset (PROTEINS) and two social network datasets (IMDB-B and
IMDB-M). In this task, we combine the structural features extracted by the pretrained autoencoder
with node domain features and feed them into an MPNN consisting of five GIN [64] layers with
hidden dimensions varying in [32, 64, 128]. The results in Table 2 demonstrate that GIN augmented
with HOPE-WavePE significantly outperforms other complicated high-order networks like IGN [39],
CIN, and PPGNs on three out of six datasets. Notably, even though pretrained on molecular graphs,
HOPE-WavePE effectively improves GIN’s expressiveness on the two social network datasets as well.

9
This is because our method focuses solely on the inherent properties of graph data, independent of
domain-specific features.

Method MUTAG PROTEINS NCI1 NCI109 IMDB-B IMDB-M

RWK [17] 79.2 ± 2.1 59.6 ± 0.1 >3 days - - -
GK (k = 3) [50] 81.4 ± 1.7 71.4 ± 0.3 62.5 ± 0.3 62.4 ± 0.3 - -
PK [43] 76.0 ± 2.7 73.7 ± 0.7 82.5 ± 0.5 - - -
WL kernel [51] 90.4 ± 5.7 75.0 ± 3.1 86.0 ± 1.8 - 73.8 ± 3.9 50.9 ± 3.8
DCNN [1] - 61.3 ± 1.6 56.6 ± 1.0 - 49.1 ± 1.4 33.5 ± 1.4
DGCNN [74] 85.8 ± 1.8 75.5 ± 0.9 74.4 ± 0.5 - 70.0 ± 0.9 47.8 ± 0.9
IGN [39] 83.9 ± 13.0 76.6 ± 5.5 74.3 ± 2.7 72.8 ± 1.5 72.0 ± 5.5 48.7 ± 3.4
GIN [64] 89.4 ± 5.6 76.2 ± 2.8 82.7 ± 1.7 - 75.1 ± 5.1 52.3 ± 2.8
PPGNs [38] 90.6 ± 8.7 77.2 ± 4.7 83.2 ± 1.1 82.2 ± 1.4 73.0 ± 5.8 50.5 ± 3.6
Natural GN [9] 89.4 ± 1.6 71.7 ± 1.0 82.4 ± 1.3 - 73.5 ± 2.0 51.3 ± 1.5
GSN [4] 92.2 ± 7.5 76.6 ± 5.0 83.5 ± 2.0 - 77.8 ± 3.3 54.3 ± 3.3
SIN [2] - 76.4 ± 3.3 82.7 ± 2.1 - 75.6 ± 3.2 52.4 ± 2.9
CIN [3] 92.7 ± 6.1 77.0 ± 4.3 83.6 ± 1.4 84.0 ± 1.6 75.6 ± 3.7 52.7 ± 3.1
GIN + HOPE-WavePE (ours) 93.6 ± 5.8 79.5 ± 4.81 84.5 ± 2.0 83.1 ± 1.9 76.0 ± 3.7 52.7 ± 2.9

Table 2: Experimental results on TU datasets. The methods are evaluated by Accuracy % (↑). The
reported results are means and standard deviations of runnings over five random seeds. Top 3 results
are highlighted, including First, Second, and Third.

Different Graph Connectivity Patterns We evaluate on two image classification datasets, MNIST
and CIFAR-10 in [16]. In these datasets, we construct k-nearest neighbor graphs (k = 8) using
superpixels within each image. This generates a collection of graph structures that differ significantly
from those encountered during the pre-training phase, which typically involves molecular structures.
In this task, we augment node superpixel features with their structural features before feeding them
into GPS [47], a graph transformer model. As shown in Table 3, GPS + HOPE-WavePE achieves
comparable accuracy to other MPNNs and GPS models that rely on different explicit positional
encoding methods.

ZINC Dataset We evaluate the performance of HOPE-WavePEs on a subset of the ZINC dataset
[19] containing 12,000 graphs with up to 38 heavy atoms. We assess its ability to perform molecular
tasks using GPS [47] equipped with HOPE-WavePE. The results in Table 3 our method outperforms
the two most popular encoding baselines, LapPE and RWSE, and show statistically significant
differences compared to traditional GNN methods.

CIFAR10 MNIST ZINC

Model
ACC (%) ↑ ACC (%) ↑ MAE ↓
GIN [64] 55.26 ± 1.53 96.49 ± 0.25 0.526 ± 0.051
GAT [57] 64.22 ± 0.45 95.53 ± 0.20 0.384 ± 0.007
GatedGCN [5] 67.31 ± 0.31 97.34 ± 0.14 0.282 ± 0.015
Graph MLP-Mixer [23] 72.46 ± 0.36 98.35 ± 0.10 0.077 ± 0.003
GPS + (RWSE/LapPE) [47] 72.30 ± 0.36 98.05 ± 0.13 0.070 ± 0.004
GPS + HOPE-WavePE (ours) 71.22 ± 0.41 98.15 ± 0.11 0.067 ± 0.002

Table 3: Experimental results on image classification tasks and ZINC. The results are reported by
taking the mean ± the std over four random seeds. Top 3 results are highlighted, including First,
Second and Third.

Long-range Graph Benchmark (LRGB) To show that pretrained AE can capture global infor-
mation and preserve local information on graph data, we conducted experiments on two datasets:
Peptide-struct and Peptides-func in the Long-range Graph Benchmark [15]. In this task, we integrate
pretrained HOPE-WavePE into a VN-augmented MPNN and fine-tune on the two datasets. We

10
Figure 3: A visualization of how HOPE-WavePE identifies several patterns and models the
interactions between nodes on graphs. We observe that pretraining on a large dataset of
graph-structured data enables the high-order AE to explore intricate substructures on molecular
graphs, such as cycles or repeated leaves and atoms.

compare our work with several MPNNs and transformer-based baselines, including GCN [30], Gat-
edGCN [5], Drew-GCN + LapPE [20], GraphGPS + LapPE [47], GRIT [37], GraphVIT, GraphMLP
Mixer [23] and MPNN with virtual node and random walk structural encoding [7].

Peptides Struct Peptides Func

Model # Params
MAE ↓ AP ↑
GCN [30] 508k 0.3496 ± 0.0013 0.5930 ± 0.0023
GINE [64] 476k 0.6346 ± 0.0071 0.5498 ± 0.0079
GatedGCN [5] 509k 0.3420 ± 0.0013 0.5864 ± 0.0077
GatedGCN + RWSE [5] 506k 0.3357 ± 0.0006 0.6069 ± 0.0035
Drew-GCN + LapPE [20] 502k 0.2536 ± 0.0015 0.7150 ± 0.0044
GRIT [37] - 0.2460 ± 0.0012 0.6988 ± 0.0082
Graph VIT [23] - 0.2449 ± 0.0016 0.6942 ± 0.0075
GraphMLP Mixer [23] - 0.2475 ± 0.0015 0.6921 ± 0.0054
MPNN + VN + RWSE [7] 506k 0.2529 ± 0.0009 0.6685 ± 0.0062
MPNN + VN + HOPE-WavePE (ours) 430k 0.2493 ± 0.0001 0.6970 ± 0.0033
GraphGPS [47] - 0.2500 ± 0.0012 0.6535 ± 0.0014
GraphGPS + LapPE [47] - 0.2500 ± 0.0005 0.6535 ± 0.0041
GraphGPS + HDSE [36] 506k 0.2457 ± 0.0013 0.7156 ± 0.0058
GraphGPS + HOPE-WavePE (ours) 505k 0.2465 ± 0.0002 0.7012 ± 0.0013

Table 4: Experimental results on Peptides func and Peptides struct. We report the means and standard
deviations of 4 random seeds.

Visualization of Local Information The underlying motivation of using graph wavelets over graph
Laplacians, i.e. the graph Fourier transform, is because they offer localization in spectral domains
[63]. This is a crucial property for learning graphs, particularly in molecular or community graphs
where local connections among atoms or people reveal important structural properties. To visualize
the local information of molecualr structures that are captured by our pretraining scheme, we compute
the L2 -norm of each row in the latent matrix Z ∈ Rn×dℓ . This provides a magnitude vi = ||Z[i, :]||2
for each node i, reflecting the strength of its signal in the latent space of the graph with n nodes.
Figure 3, which shows graphs of 8 randomly selected graphs in the validation set of our pretraining
dataset, demonstrates that our method can detect repeated or long-range patterns, such as cycles in
complex molecular graphs. Nodes with similar L2 -norm values tend to be close in the graph structure,
indicating the model’s ability to capture these cyclical relationships.

11
Accuracy
1.0 7 channels
10 channels
0.9 20 channels
0.8 30 channels
40 channels
0.7 60 channels
0.6
0.5
0.4
0.3
0.2
20 21 22 23 24 25 26 27
# graph nodes

Figure 4: Average reconstruction accuracy on Peptides-struct graphs with different wavelet channel
quantities.

6.3 Ablation Study

We have three focuses for our ablations. First, we explore how does the number of wavelet channels
affect the reconstructability of the adjacency tensor. Secondly, we show that cross-domain training
only improve performance incrementally. And finally, we show how masked and unmasked training
affect the reconstructability of HOPE-WavePE on medium-sized graphs.

Wavelet Channel Number As shown in Figure 1, different scaling factors impact the receptive
field of each node. Since each resolution is responsible for learning a specific range of hops, more
multiresolutions should capture diverse-range dependency. We visualize in Figure 4 by recording
the reconstruction accuracies of the concatenated array {A2i }7i=0 given 7, 10, 20, 30 and 60 wavelet
channels respectively. The experiment is conducted on the peptides-struct dataset from the Long
Range Graph Benchmark [15], which highlights the long-range learnability.

Accuracy
1.00

0.95
masked 7 channels
masked 20 channels
0.90 unmasked 7 channels
unmasked 20 channels
non-redundant
0.85 redundant

0.80

20 21 22 23 24 25 26 27
# graph nodes

Figure 5: Masked vs unmasked reconstruction accuracy on MNIST dataset. The regions shaded red
and blue contains hop lengths s in which As are and are not all ones, respectively.

HOPE or no HOPE? We will investigate whether HOPE actually makes the encoding vector more
meaningful in terms of structural information. In particular, we use the encoding vector from WavePE
without any pre-training to train a decoder, lifting it back to second order. Moreover, we replace

12
WavePE with RWPE to justify whether it was necessary to choose WavePE as the backbone. The
results in table ... suggests that

Cross-Domain Training To further verify transferability of HOPE-WavePE onto different tasks in a

direct manner, we train HOPE-WavePE on four different datasets with great domain disparity, namely:
ZINC (molecule), MNIST (image), ogbl-collab (collaborative network) and ogbl-ppa (interaction
network), and record the decoded accuracy on the reconstruction the adjacency matrix. To avoid
resource insufficency in processing the two ogbl tasks while not changing their intrinsic structure, we
partitioned them into almost similar size graphs as in MNIST and ZINC using METIS algorithm.
Our results shown in Table 5 indicates the transferrability of HOPE-WavePE to extremely different
domains, given that their sizes are similar.

Table 5: Accuracy of HOPE-WavePE on cross-domain training.

Domain Type Name ZINC MNIST ogbl-collab ogbl-ppa
Molecular ZINC 0.9891 0.9734 0.9812 0.9734
Image MNIST 0.9652 0.9753 0.9652 0.9712
Collaborative Network ogbl-collab 0.9860 0.9860 0.9742 0.9812
Interaction Network ogbl-ppa 0.9767 0.9767 0.9832 0.9912

Masked vs Unmasked To visually illustrate our motivation of masking the autoencoder, we

evaluated HOPE-WavePE on reconstructing the concatenated adjacency {A2i }7i=0 over the MNIST
dataset. Note that, these graphs are connected and their diameter is far below the considered range.
In particular, As = 1n×n for all s ≥ 8, and, thus, are redundant in the structural learning. From
Figure 5, we see that masking significantly removes redundant learning from exceedingly long range,
which would punishes important ranges otherwise.

7 Conclusion
We have introduced HOPE-WavePE, a novel high-order permutation-equivariant pretraining method
specifically designed for graph-structured data. Our approach leverages the inherent connectivity
of graphs, eliminating reliance on domain-specific features. This enables HOPE-WavePE to gen-
eralize effectively across diverse graph types and domains. The superiority of HOPE-WavePE is
demonstrably proven through both theoretical and empirical analysis. Finally, we have discussed
the potential of HOPE-WavePE as a foundation for a general graph structural encoder. A promising
future direction will be to focus on optimizing the scalability of this approach.

13
References
[1] J. Atwood and D. Towsley. Diffusion-convolutional neural networks, 2016.

[2] C. Bodnar, F. Frasca, Y. G. Wang, N. Otter, G. Montúfar, P. Liò, and M. Bronstein. Weisfeiler
and lehman go topological: Message passing simplicial networks, 2021.

[3] C. Bodnar, F. Frasca, N. Otter, Y. G. Wang, P. Liò, G. Montúfar, and M. Bronstein. Weisfeiler
and lehman go cellular: Cw networks, 2022.

[4] G. Bouritsas, F. Frasca, S. Zafeiriou, and M. M. Bronstein. Improving graph neural network
expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 45(1):657–668, 2023. doi: 10.1109/TPAMI.2022.3154319.

[5] X. Bresson and T. Laurent. Residual gated graph convnets. CoRR, abs/1711.07553, 2017. URL
http://arxiv.org/abs/1711.07553.

[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,

G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural
information processing systems, 33:1877–1901, 2020.

[7] C. Cai, T. S. Hy, R. Yu, and Y. Wang. On the connection between mpnn and graph transformer.
arXiv preprint arXiv:2301.11956, 2023.

[8] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun. Measuring and relieving the over-smoothing
problem for graph neural networks from the topological view. Proceedings of the AAAI
Conference on Artificial Intelligence, 34(04):3438–3445, Apr. 2020. doi: 10.1609/aaai.v34i04.
5747. URL https://ojs.aaai.org/index.php/AAAI/article/view/5747.

[9] P. de Haan, T. Cohen, and M. Welling. Natural graph networks, 2020.

[10] M. Defferrard, L. Martin, R. Pena, and N. Perraudin. Pygsp: Graph signal processing in python.
URL https://github. com/epfl-lts2/pygsp, 2017.

[11] C. Donnat, M. Zitnik, D. Hallac, and J. Leskovec. Learning structural node embeddings via
diffusion wavelets. In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, KDD ’18, page 1320–1329, New York, NY, USA, 2018.
Association for Computing Machinery. ISBN 9781450355520. doi: 10.1145/3219819.3220025.
URL https://doi.org/10.1145/3219819.3220025.

[12] V. P. Dwivedi and X. Bresson. A generalization of transformer networks to graphs. CoRR,

abs/2012.09699, 2020. URL https://arxiv.org/abs/2012.09699.

[13] V. P. Dwivedi and X. Bresson. A generalization of transformer networks to graphs. CoRR,

abs/2012.09699, 2020. URL https://arxiv.org/abs/2012.09699.

[14] V. P. Dwivedi, A. T. Luu, T. Laurent, Y. Bengio, and X. Bresson. Graph neural networks with
learnable structural and positional representations. In International Conference on Learning
Representations, 2022. URL https://openreview.net/forum?id=wTTjnvGphYj.

[15] V. P. Dwivedi, L. Rampasek, M. Galkin, A. Parviz, G. Wolf, A. T. Luu, and D. Beaini. Long
range graph benchmark. In Thirty-sixth Conference on Neural Information Processing Sys-
tems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=
in7XC5RcjEn.

[16] V. P. Dwivedi, C. K. Joshi, A. T. Luu, T. Laurent, Y. Bengio, and X. Bresson. Benchmarking

graph neural networks. Journal of Machine Learning Research, 24(43):1–48, 2023. URL
http://jmlr.org/papers/v24/22-0567.html.

[17] T. Gärtner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and efficient alternatives.
In B. Schölkopf and M. K. Warmuth, editors, Learning Theory and Kernel Machines, pages
129–143, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. ISBN 978-3-540-45167-9.

14
[18] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for
quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning
- Volume 70, ICML’17, page 1263–1272. JMLR.org, 2017.
[19] R. Gómez-Bombarelli, D. Duvenaud, J. M. Hernández-Lobato, J. Aguilera-Iparraguirre, T. D.
Hirzel, R. P. Adams, and A. Aspuru-Guzik. Automatic chemical design using a data-driven
continuous representation of molecules. CoRR, abs/1610.02415, 2016. URL http://arxiv.
org/abs/1610.02415.
[20] B. Gutteridge, X. Dong, M. Bronstein, and F. D. Giovanni. Drew: Dynamically rewired message
passing with delay, 2023.
[21] D. K. Hammond, P. Vandergheynst, and R. Gribonval. Wavelets on graphs via spectral graph
theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.
[22] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual
representation learning. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 9729–9738, 2020.
[23] X. He, B. Hooi, T. Laurent, A. Perold, Y. LeCun, and X. Bresson. A generalization of vit/mlp-
mixer to graphs, 2023.
[24] D. Hendrycks, K. Lee, and M. Mazeika. Using pre-training can improve model robustness
and uncertainty. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th
International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning
Research, pages 2712–2721. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.
press/v97/hendrycks19a.html.
[25] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec. Strategies for
pre-training graph neural networks. In International Conference on Learning Representations,
2020. URL https://openreview.net/forum?id=HJlWWJSFDH.
[26] Y. Huang, W. Lu, J. Robinson, Y. Yang, M. Zhang, S. Jegelka, and P. Li. On the stability
of expressive positional encodings for graphs, 2024. URL https://arxiv.org/abs/2310.
02579.
[27] T. S. Hy and R. Kondor. Multiresolution equivariant graph variational autoencoder. Machine
Learning: Science and Technology, 4(1):015031, mar 2023. doi: 10.1088/2632-2153/acc0d8.
URL https://dx.doi.org/10.1088/2632-2153/acc0d8.
[28] Y. Jiao, Y. Xiong, J. Zhang, Y. Zhang, T. Zhang, and Y. Zhu. Sub-graph contrast for scalable
self-supervised graph representation learning. In 2020 IEEE international conference on data
mining (ICDM), pages 222–231. IEEE, 2020.
[29] J. Kim, D. Nguyen, S. Min, S. Cho, M. Lee, H. Lee, and S. Hong. Pure transformers are powerful
graph learners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,
Advances in Neural Information Processing Systems, volume 35, pages 14582–14595. Curran
Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/
2022/file/5d84236751fe6d25dc06db055a3180b0-Paper-Conference.pdf.
[30] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks.
arXiv preprint arXiv:1609.02907, 2016.
[31] D. Kreuzer, D. Beaini, W. Hamilton, V. Létourneau, and P. Tossou. Rethinking graph trans-
formers with spectral attention. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and
J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages
21618–21629. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/
paper_files/paper/2021/file/b4fd1d2cb085390fbbadae65e07876a7-Paper.pdf.
[32] P. Li, Y. Wang, H. Wang, and J. Leskovec. Distance encoding: Design provably more powerful
neural networks for graph representation learning. Advances in Neural Information Processing
Systems, 33, 2020.

15
[33] D. Lim, J. Robinson, L. Zhao, T. Smidt, S. Sra, H. Maron, and S. Jegelka. Sign and basis
invariant networks for spectral graph representation learning, 2022.
[34] R. Liu, S. Cantürk, O. Lapointe-Gagné, V. Létourneau, G. Wolf, D. Beaini, and L. Rampášek.
Graph positional and structural encoder. arXiv preprint arXiv:2307.07107, 2023.
[35] S. Liu, M. F. Demirel, and Y. Liang. N-gram graph: Simple unsupervised representation
for graphs, with applications to molecules. In H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Sys-
tems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/
paper_files/paper/2019/file/2f3926f0a9613f3c3cc21d52a3cdb4d9-Paper.pdf.
[36] Y. Luo, H. Li, L. Shi, and X.-M. Wu. Enhancing graph transformers with hierarchical distance
structural encoding, 2024. URL https://arxiv.org/abs/2308.11129.
[37] L. Ma, C. Lin, D. Lim, A. Romero-Soriano, K. Dokania, M. Coates, P. H.S. Torr, and S.-N. Lim.
Graph Inductive Biases in Transformers without Message Passing. In Proc. Int. Conf. Mach.
Learn., 2023.
[38] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph net-
works. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Cur-
ran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/
bb04af0f7ecaee4aae62035497da1387-Paper.pdf.
[39] H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman. Invariant and equivariant graph networks.
In International Conference on Learning Representations, 2019. URL https://openreview.
net/forum?id=Syx72jC9tm.
[40] R. Menegaux, E. Jehanno, M. Selosse, and J. Mairal. Self-attention in colors: Another take on
encoding graph structure in transformers. Transactions on Machine Learning Research, 2023.
ISSN 2835-8856. URL https://openreview.net/forum?id=3dQCNqqv2d.
[41] G. Mialon, D. Chen, M. Selosse, and J. Mairal. Graphit: Encoding graph structure in transform-
ers. arXiv preprint arXiv:2106.05667, 2021.
[42] C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann. Tudataset: A
collection of benchmark datasets for learning with graphs. arXiv preprint arXiv:2007.08663,
2020.
[43] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting. Propagation kernels: efficient
graph kernels from propagated information. Machine Learning, 102:209 – 245, 2014. URL
https://api.semanticscholar.org/CorpusID:14487732.
[44] N. K. Ngo, T. S. Hy, and R. Kondor. Multiresolution graph transformers and wavelet positional
encoding for learning long-range and hierarchical structures. The Journal of Chemical Physics,
159(3):034109, 07 2023. ISSN 0021-9606. doi: 10.1063/5.0152833. URL https://doi.org/
10.1063/5.0152833.
[45] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and
Data Engineering, 22(10):1345–1359, 2010. doi: 10.1109/TKDE.2009.191.
[46] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision.
In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[47] L. Rampášek, M. Galkin, V. P. Dwivedi, A. T. Luu, G. Wolf, and D. Beaini. Recipe for a
General, Powerful, Scalable Graph Transformer. arXiv:2205.12454, 2022.
[48] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang. Self-supervised graph
transformer on large-scale molecular data. In Proceedings of the 34th International Conference
on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran
Associates Inc. ISBN 9781713829546.

16
[49] E. Sande, C. Manni, and H. Speleers. Explicit error estimates for spline approximation of
arbitrary smoothness in isogeometric analysis. Numerische Mathematik, 144(4):889–929, Jan.
2020. ISSN 0945-3245. doi: 10.1007/s00211-019-01097-9. URL http://dx.doi.org/10.
1007/s00211-019-01097-9.
[50] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet
kernels for large graph comparison. In D. van Dyk and M. Welling, editors, Proceedings
of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of
Proceedings of Machine Learning Research, pages 488–495, Hilton Clearwater Beach Resort,
Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR. URL https://proceedings.mlr.
press/v5/shervashidze09a.html.
[51] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt.
Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(77):2539–2561,
2011. URL http://jmlr.org/papers/v12/shervashidze11a.html.
[52] Y. Shi, Z. Huang, S. Feng, H. Zhong, W. Wang, and Y. Sun. Masked label prediction: Unified
message passing model for semi-supervised classification. In Z.-H. Zhou, editor, Proceedings
of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages
1548–1554. International Joint Conferences on Artificial Intelligence Organization, 8 2021.
doi: 10.24963/ijcai.2021/214. URL https://doi.org/10.24963/ijcai.2021/214. Main
Track.
[53] R. Sun, H. Dai, and A. W. Yu. Does gnn pretraining help molecular representation? Advances
in Neural Information Processing Systems, 35:12096–12109, 2022.
[54] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M.
Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint
arXiv:2312.11805, 2023.
[55] E. H. Thiede, T. S. Hy, and R. Kondor. The general theory of permutation equivarant neural
networks and higher order graph variational encoders. arXiv preprint arXiv:2004.03990, 2020.
[56] J. Topping, F. D. Giovanni, B. P. Chamberlain, X. Dong, and M. M. Bronstein. Understanding
over-squashing and bottlenecks on graphs via curvature. In International Conference on
Learning Representations, 2022. URL https://openreview.net/forum?id=7UmjRGzp-A.
[57] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph Attention
Networks. International Conference on Learning Representations, 2018. URL https://
openreview.net/forum?id=rJXMpikCZ.
[58] H. Wang, H. Yin, M. Zhang, and P. Li. Equivariant and stable positional encoding for more
powerful graph neural networks, 2022. URL https://arxiv.org/abs/2203.00199.
[59] H. Wang, H. Yin, M. Zhang, and P. Li. Equivariant and stable positional encoding for more
powerful graph neural networks. In International Conference on Learning Representations,
2022. URL https://openreview.net/forum?id=e95i1IHcWj.
[60] H. Wang, J. Kaddour, S. Liu, J. Tang, J. Lasenby, and Q. Liu. Evaluating self-
supervised learning for molecular graph embeddings. In A. Oh, T. Naumann, A. Glober-
son, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Informa-
tion Processing Systems, volume 36, pages 68028–68060. Curran Associates, Inc.,
2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/
d6dc15cc2442a40904e704d624d1fbe8-Paper-Datasets_and_Benchmarks.pdf.
[61] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and
V. Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):
513–530, 2018.
[62] Z. Xiong, D. Wang, X. Liu, F. Zhong, X. Wan, X. Li, Z. Li, X. Luo, K. Chen, H. Jiang, et al.
Pushing the boundaries of molecular representation for drug discovery with the graph attention
mechanism. Journal of medicinal chemistry, 63(16):8749–8760, 2019.

17
[63] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng. Graph wavelet neural network. In International
Conference on Learning Representations, 2019. URL https://openreview.net/forum?
id=H1ewdiR5tQ.
[64] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In
7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA,
USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=
ryGs6iA5Km.
[65] M. Xu, H. Wang, B. Ni, H. Guo, and J. Tang. Self-supervised graph-level representation
learning with local and global structure. In M. Meila and T. Zhang, editors, Proceedings
of the 38th International Conference on Machine Learning, volume 139 of Proceedings of
Machine Learning Research, pages 11548–11558. PMLR, 18–24 Jul 2021. URL https:
//proceedings.mlr.press/v139/xu21g.html.
[66] M. Xu, H. Wang, B. Ni, H. Guo, and J. Tang. Self-supervised graph-level representation
learning with local and global structure. In M. Meila and T. Zhang, editors, Proceedings
of the 38th International Conference on Machine Learning, volume 139 of Proceedings of
Machine Learning Research, pages 11548–11558. PMLR, 18–24 Jul 2021. URL https:
//proceedings.mlr.press/v139/xu21g.html.
[67] K. Yang, K. Swanson, W. Jin, C. Coley, P. Eiden, H. Gao, A. Guzman-Perez, T. Hopper,
B. Kelley, M. Mathea, et al. Analyzing learned molecular representations for property prediction.
Journal of chemical information and modeling, 59(8):3370–3388, 2019.
[68] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T.-Y. Liu. Do transformers
really perform badly for graph representation? In A. Beygelzimer, Y. Dauphin, P. Liang,
and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL
https://openreview.net/forum?id=OeWooOxFwDa.
[69] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T.-Y. Liu. Do trans-
formers really perform badly for graph representation? In M. Ranzato, A. Beygelz-
imer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Infor-
mation Processing Systems, volume 34, pages 28877–28888. Curran Associates, Inc.,
2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/
f1c1592588411002af340cbaedd6fc33-Paper.pdf.
[70] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen. Graph contrastive learning with
augmentations. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,
Advances in Neural Information Processing Systems, volume 33, pages 5812–5823. Curran
Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/
2020/file/3fe230348e9a12c13120749e3f9fa4cd-Paper.pdf.
[71] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen. Graph contrastive learning with
augmentations. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, ed-
itors, Advances in Neural Information Processing Systems, volume 33, pages 5812–5823.
Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/
file/3fe230348e9a12c13120749e3f9fa4cd-Paper.pdf.
[72] Y. You, T. Chen, Y. Shen, and Z. Wang. Graph contrastive learning automated. In M. Meila
and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning,
volume 139 of Proceedings of Machine Learning Research, pages 12121–12132. PMLR, 18–24
Jul 2021. URL https://proceedings.mlr.press/v139/you21a.html.
[73] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim. Graph transformer networks.
In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar-
nett, editors, Advances in Neural Information Processing Systems, volume 32. Curran
Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/
9d63484abb477c97640154d40595a3bb-Paper.pdf.
[74] M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An end-to-end deep learning architecture for graph
classification. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence

18
and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI
Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18.
AAAI Press, 2018. ISBN 978-1-57735-800-8.
[75] Y. Zhu, Y. Xu, Q. Liu, and S. Wu. An empirical study of graph contrastive learning. arXiv
preprint arXiv:2109.01116, 2021.
[76] M. Zitnik, R. Sosič, and J. Leskovec. Prioritizing network communities. Nature Communi-
cations, 9(1):2544, Jun 2018. ISSN 2041-1723. doi: 10.1038/s41467-018-04948-5. URL
https://doi.org/10.1038/s41467-018-04948-5.

A Theoretical analysis
Notations. Throughout this section, for X ∈ Rm×n , we denote X[i : j] as the indiced matrix of X
from row i to row j and X[i] as the i-th row of X.

A.1 Preliminary results

The approximation methodology we use is the spline approximation on the eigenvalues of the
laplacians, which later . First we need to define a scalar k-spline function on a bounded domain
[a, b] ⊂ R. Let
a =: η1 < η2 < · · · < ηN +1 := b
such that ηj+1 = ηj + (b − a)/N , these points are called uniform knots in the interval [a, b]. Assume
that Pk is the class of polynomial up degree at most k.
Definition A.1. A scalar function f is called a k-spline function in the uniformly-divided interval
[a, b] of N + 1 knots if the follows conditions are satisfied:

• f (x) = pi (x) ∀ x ∈ [ηi , ηi+1 ], i = 1, N for some pi ∈ Pk ,

• Derivative of f up to degree k is continuous.
[a,b]
For convenience, we denote the class of all such functions in Definition A.1 as Sk,N .
Lemma A.2. (spline approximation power) [49] Given a scalar function u of smoothness order
[a,b]
k + 1 in the range [a, b] divided in uniform knots of length h, there exists p ∈ Sk,N such that
k
h
∥u − p∥ ≤ u(k+1)
π
where u(k+1) is the (k + 1)-th order derivative of u.
Our proof strategy incorporate the approximation of high degree polymonial with k-spline methodol-
ogy. Since both wavelet and random walk second feature embeddings’ largest eigenvalue is 1 and
smallest eigenvalue is -1, we can let a = −1 and b = 1 in our case. Note that in a two-layer MLP,
the number of subintervals [ηi , ηi+1 ] should correspond to the width in the hidden layer, which is
equavalent to 1/h. Therefore, Lemma A.2 yields the condition in which a tolerance ϵ is satisfied.
Letting u(x) = xd for some d > k, then u is obviously smooth up to arbitrary order, thus it is also
smooth of order k. We will let ∥·∥ be the max norm in the interval [0, 1], then we have

1/k d−k+1 1
1/k
u(k+1) = (d(d − 1) . . . (d − k)) sup x k ≤ d1+ k
x∈[0,1]

Combine this with the statement of Lemma A.2 we deduce that the error defined by the max norm
will be less than ϵ if
1 1 1
1 1

w := ≥ π −1 ϵ− k d1+ k = O ϵ− k d1+ k .
h
From here, we deduce an important lemma:

19
Lemma A.3. Given two naturalnumbers dand k such that d > k, then there exists a two-layer MLP
1 1
f : Rk → R of hidden width O ϵ− k d1+ k such that

xd − f x, x2 , . . . , xk < ϵ

Lemma A.4. (First order extension) For any ϵ > 0 and a given natural numberd > k, there exists
a
1 1 1
n×k n×r −k 1+ k
two-layer Sn -equivariant linear MLP : R →R network with width O n ϵ r
k such
that
(1)
MLP Ek − E(1) r ≤ ϵ.

Proof. Assume that Λi is the diagonal matrix containing all eigenvalues of ψsi . Applying Lemma A.3,
q
we simply see
1 1
that Λ for q = k + 1, r can be estimated using a two-layer MLP of width
O ϵ− k r1+ k with the max norm error less than ϵ. Formally, let ψ csq be the estimation of ψ q
s
and ei be the error at the i-th entry along the diagonal, then we have that

n
X n
X
csq − ψ q =
ψ ei ui u⊤ ≤ |ei | ui u⊤ ≤ nϵ
s i i
i=1 i=1

Replacing ϵ with ϵ/n yields the desired result.

With enough first-order informations, i.e. sufficiently large r in Lemma A.4, we can reconstruct the
second-order features up to arbitrarily high degree.
Theorem A.5. (First to second order) Assume that rank(ψs − In ) ≤ r and let h : Rn×r → Rn×n×r
be the resolution-wise outer product,
then there exists a broadcasted linear feed forward layer
r d (1) (2)
g : R → R such that (h ◦ g) Ed = Er .

Proof. The first order features are aggregated through an outer product operator and return r square
matrices of order n. However, these matrices are all rank one matrices and cannot represent the
initial second order features. Since the rank of a square matrix is equivalent to its length minus the
multiplicity of the eigenvalue zero, we can see that

rank(ψs − In ) = rank(ψs2 − In ) = · · · = rank(ψsd − In ) ≤ r

Therefore, for all i = 1, d, the matrix ψsi − In can be written as a weighted sum of r rank one matrices
produced from the outer product. This concludes the proof.

A.2 Proof of Theorem 5.5

Theorem A.6. For any ϵ > 0 and real coefficients θ0 , θ1 , . . . , θd assume that rank(ψs − In ) ≤ r,
then there exists an Sn -equivariant AE f : Rn×n×k → Rn×n×r of width O(n1/k r1+1/k ϵ−1/k ) and
a broadcasted feed forward network g : Rr → R such that
∥(g ◦ f )(Ek ) − p(ψs )∥ < ϵ
Pd j
where p(ψs ) = j=0 θj ψs .

Proof. Encoder. The input tensor is of size n × n × k, representing second order feature in k
different resolutions. The encoder simply operate resolution-wise and take the row-sum through each
square matrix. This encoder will output a first order tensor of size n × k. This layer is evidently
Sn -equivariant.
Latent. For the latent space, i.e. first-order feature space, weapply LemmaA.4 to extend from k
1 1 1
resolutions to r resolutions using a two-layer MLP of width O n k ϵ− k r1+ k . And since this MLP
is also built upon the broadcasting along the n-axis, it is also Sn -equivariant.

20
Decoder. Applying the content of Theorem A.5 we can conclude the proof.

A.3 Proof of Theorem 5.6

Once the AE can learn to reconstruct, the proof of Theorem 5.6 ensures that it can capture long-range
information.
Theorem A.7. For any ϵ > 0 and real coefficients θ1 , θ2 , . . . , θd , there exists a two-layer ReLU feed
forward network φ : Rn×n×d → Rn×n of hidden dimension dh = 2 such that
d
X
φ(Ed ) − θj Aj < ϵ.
j=1

Proof. For this proof, we need to consider wavelet and random walk separatedly. Wavelet. Let ψs =
U Λs U ⊤ where Λs = diag(exp(−sλ1 ), exp(−sλ2 ), . . . , exp(−sλn )). We first need to perform a
(2)
transform on the vector basis Ed . Essentially, the transformations are done independent of the
eigenvectors U . Formally, we observe that

e−sλi − 1
 
λi − 1
 
 e−2sλi − 1   (λi − 1)2 
 ≈ A .. (4)
 
 .. 
 .   . 
e−dsλi − 1 (λi − 1)d
where A ∈ Rd×d contains the corresponding Chebyshev polynomial expanding coefficients. Note
that (4) is essentially the discrete fourier transform, thus the inversed version is simply
 −sλ 
λi − 1 e i −1
 
 (λi − 1)2   e−2sλi − 1 
 ..  ≈ A−1 
 .. .

(5)
 .   . 
(λi − 1)d e−dsλi − 1

This means that the power of L̃ up to d can be retrieved via a broadcasted linear layer. Now let ζ be
the scalar step function, meaning ζ(x) = 1 for x > 0 and 0 otherwise. Then, let φ be a continuous
piece-wise linear function such that:
0 if x < 0
(
φ(x) = x/ε if 0 ≤ x < 1
1 if x ≥ 1
Since this function is a three-piece linear function, it can be represented as a ReLU-based feed
forward network with hidden dimension two. And evidently,
∥φ − ζ∥ → 0 as ε → 0.

Furthermore, it yields that ζ (In − L̃)k = Ak for all k. Therefore, we concluded the proof.

B Additional implementation details

B.1 Datasets

Table 6 presents details of all benchmarking datasets used in our experiments. We focus on improving
model performance in graph-level prediction tasks. All datasets contain over 1000 samples, with the
average number of nodes per dataset ranging from 13 to over 100.

21
Dataset # Graphs # Nodes # Edges Pred. level Pred. task Metric
CIFAR10 60,000 117.63 469.10 graph class. (10-way) ACC
MNIST 70,000 70.57 281.65 graph class. (10-way) ACC
ZINC-subset 12,000 23.15 24.92 graph reg. MAE
MolBBBP 2,039 24.06 25.95 graph class. (binary) AUROC
MolBACE 1,513 34.09 36.86 graph class. (binary) AUROC
MolTox21 7,831 18.57 19.29 graph class. (binary) AUROC
MolToxCast 8,576 18.78 19.26 graph class. (binary) AUROC
MolSIDER 2,039 33.64 35.36 graph class. (binary) AUROC
Peptides-func 15,535 150.94 153.65 graph class. (binary) AP
Peptides-struct 15,535 150.94 153.65 graph reg. MAE
MUTAG 188 17.9 39.6 graph class. (binary) ACC
PROTEINS 1,113 39.1 72.8 graph class. (binary) ACC
NCI1 4,110 29.9 32.3 graph class. (binary) ACC
NCI109 4,127 29.7 32.1 graph class. (binary) ACC
IMDB-B 1,000 19.8 96.5 graph class. (binary) ACC
IMDB-M 1,500 13.0 65.9 graph class. (3-way) ACC

Table 6: Dataset details for transferability experiments on image, ZINC, MoleculeNet, LRGB and
TUDataset.

B.2 Hyperparameter Settings

Pretraining Table 7 depicts the hyperparameters of our high-order autoencoder and training settings.
In general, we used three layers of IGN [39] to build the encoder hidden dimensions of {8, 16, 32}.
The decoder is a reversed of encoder with hidden dimensions {32, 16, 8}. We used a channel-wise
2-layer MLP to compute the latent Z from the encoder’s output, and the latent dimension is set to
20. We preprocessed the Wavelet signals of graph data via the PyGSP [10] software. For each graph,
we performed Wavelet transform to get its 4-resolution Wavelet tensor, where each scale varies in
{0.25, 0.5, 0.75, 1}. Finally, the autoencoder is pretrained in 100 epochs with a batch size of 32 and
learning rate of 0.0005.

Batch size # Epoch Encoder Decoder Learning rate Scales Latent dim
32 100 [8, 16, 32] [32,16,8] 0.0005 [0.25, 0.5, 0.75, 1.0] 20

Table 7: Hyperparameter settings for pretraining high-order AE.

MoleculeNet Table 8 shows the hyperparameter settings for fine-tuning MPNN on five downstream
datasets in MoleculeNet benchmark. In general, we used local attetion as proposed in [52]. To model
the global interactions, we augment virtual nodes to the local models to improve the performances in
ToxCast and SIDER.

TUDataset Table 9 summarizes the hyperparameter settings for the transfer learning experiments
on six TUDataset benchmark datasets. For IMDB-B and IMDB-M, which lack domain node features,
we employed HOPE-WavePE as their initial node features. To create unified node features compatible
with the hidden dimensions of the MPNN layers, we added a 2-layer MLP before the MPNN layers
to update the combination of domain and HOPE-WavePE features. Following Bodnar et al. [3], we
performed 10-fold validations for each dataset and reported means and standard deviations.

ZINC, Image Classification Tasks, and LRGB We follow the best hyperparameter settings issued
in previous work of GPS [47] and MPNN+VN [7]; then, we fine-tuned for better performance. Our
full hyperparameter studies of the benchmarks are shown in Table 10.

22
Hyperparameter BBBP BACE Tox21 ToxCast SIDER
Pre MPNN MLP MLP MLP MLP MLP
MPNN type Attention Attention Attention Attention Attention
VN Augmented - - - ✓ ✓
# Layers 5 5 5 3 3
Hidden Dim 300 300 300 512 512
Dropout 0.5 0.5 0.5 0.5 0.5
Pooling type mean mean mean mean mean
Learning rate 1e − 3 1e − 3 1e − 3 1e − 3 1e − 3
Weight decay 1e − 9 1e − 9 1e − 9 1e − 9 1e − 9
# Epochs 50 50 50 100 50
Batch size 32 32 32 32 32

Table 8: Hyperparameter settings for downstream evaluations on the MoleculeNet Benchmark.

Hyperparameter MUTAG PROTEINS NCI1 NCI109 IMDB-B IMDB-M

Node Feat Domain + PE Domain + PE Domain + PE Domain + PE PE PE
Pre MPNN MLP MLP MLP MLP MLP MLP
# MPNN Layers 5 5 5 5 5 5
Hidden Dim 32 32 32 128 128 128
# Epochs 100 100 200 200 100 200
Batch size 128 128 128 128 128 128
Learning rate 1e − 3 1e − 3 1e − 3 1e − 3 1e − 3 1e − 3
Dropout 0.5 0.5 0.5 0.5 0.5 0.5
Graph pooling mean mean mean mean mean mean

Table 9: Hyperameter settings for downstream evaluations on the TUDataset benchmark.

Hyperparameter ZINC (subset) MNIST CIFAR10 Peptides-func Peptides-struct

# Layers 9 3 3 3 3
Global Model Transformer Transformer Transformer Virtual Node Virtual Node
Local Model GINE GatedGCN GatedGCN GINE GINE
Hidden dim 64 50 50 128 128
# Heads 8 4 4 - -
Dropout 0 0.2 0.2 0 0.01
Graph pooling sum mean mean mean mean
PE dim 20 20 20 20 20
Node Update MLP MLP MLP MLP MLP
Batch size 128 128 128 32 32
Learning Rate 0.0005 0.003 0.004 0.0005 0.0005
# Epochs 3000 300 300 100 100
# Warmup epochs 50 5 5 5 5
Weight decay 3e − 5 2e − 4 3e − 5 1e − 5 1e − 5
# Parameters 452,299 150,081 142,093 477,953 432,206

Table 10: Hyperparameter settings for ZINC, MNIST, CIFAR10, Peptides-func and Peptides-struct
dataset

MLST Wavelet Positional Encoding
No ratings yet
MLST Wavelet Positional Encoding
25 pages
A Comprehensive Survey On Deep Graph Representaion Learning
No ratings yet
A Comprehensive Survey On Deep Graph Representaion Learning
50 pages
Bacciu 2020
No ratings yet
Bacciu 2020
62 pages
Deep Learning On Graphs: A Survey: Ziwei Zhang, Peng Cui and Wenwu Zhu, Fellow, IEEE
No ratings yet
Deep Learning On Graphs: A Survey: Ziwei Zhang, Peng Cui and Wenwu Zhu, Fellow, IEEE
24 pages
A Comprehensive Survey On Deep Graph Representation Learning
No ratings yet
A Comprehensive Survey On Deep Graph Representation Learning
85 pages
Representation Learning On Graphs: Methods and Applications
No ratings yet
Representation Learning On Graphs: Methods and Applications
23 pages
Automated Unsupervised Graph Representation Learning
No ratings yet
Automated Unsupervised Graph Representation Learning
14 pages
GNN Foundations Frontiers and Applications Chapter2
No ratings yet
GNN Foundations Frontiers and Applications Chapter2
10 pages
Yang 20 A
No ratings yet
Yang 20 A
16 pages
Learning Methods
No ratings yet
Learning Methods
70 pages
Graph Contrastive Learning With Augmentations
No ratings yet
Graph Contrastive Learning With Augmentations
12 pages
GNNS
No ratings yet
GNNS
7 pages
An End-To-End Attention-Based Approach For Learning On Graphs
No ratings yet
An End-To-End Attention-Based Approach For Learning On Graphs
25 pages
Do Transformers Really Perform Bad For Graph Representation?
No ratings yet
Do Transformers Really Perform Bad For Graph Representation?
19 pages
Graph Convolutional Networks Review
No ratings yet
Graph Convolutional Networks Review
23 pages
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
No ratings yet
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
70 pages
Foundations and Frontiers of Graph Learning Theory
No ratings yet
Foundations and Frontiers of Graph Learning Theory
35 pages
Graphormer 2021 neurIPS
No ratings yet
Graphormer 2021 neurIPS
12 pages
Papers Papers PDF
No ratings yet
Papers Papers PDF
48 pages
Graph Learning A Survey
No ratings yet
Graph Learning A Survey
19 pages
VQGraph: A New Method To Encode and Learn From Graphs
No ratings yet
VQGraph: A New Method To Encode and Learn From Graphs
7 pages
Graph Neural Networks Methods Applications and Opp
No ratings yet
Graph Neural Networks Methods Applications and Opp
35 pages
A Comprehensive Survey of Graph Neural Networks PDF
No ratings yet
A Comprehensive Survey of Graph Neural Networks PDF
22 pages
Why Are Graph Neural Networks Effective For EDA Problems
No ratings yet
Why Are Graph Neural Networks Effective For EDA Problems
8 pages
GMPT Cikm2021 Final
No ratings yet
GMPT Cikm2021 Final
10 pages
徐1
No ratings yet
徐1
26 pages
Graph Learning A Survey
No ratings yet
Graph Learning A Survey
19 pages
NeurIPS 2020 Graph Random Neural Networks For Semi Supervised Learning On Graphs Paper
No ratings yet
NeurIPS 2020 Graph Random Neural Networks For Semi Supervised Learning On Graphs Paper
12 pages
A Generalization of Transformer Networks To Graphs
No ratings yet
A Generalization of Transformer Networks To Graphs
8 pages
Peerj Cs 357
No ratings yet
Peerj Cs 357
62 pages
Graph Representation Learning A Survey
No ratings yet
Graph Representation Learning A Survey
21 pages
04 GNNBasic
No ratings yet
04 GNNBasic
107 pages
DGCNN
No ratings yet
DGCNN
8 pages
DS 3002 Project Proposal Template
No ratings yet
DS 3002 Project Proposal Template
5 pages
2501 11968v1
No ratings yet
2501 11968v1
22 pages
29256-Article Text-33310-1-2-20240324
No ratings yet
29256-Article Text-33310-1-2-20240324
9 pages
Graph Learning Optimization Methods
No ratings yet
Graph Learning Optimization Methods
79 pages
Holographic Node Represe
No ratings yet
Holographic Node Represe
36 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
141 pages
20 Grpe Relative Positional Encod
No ratings yet
20 Grpe Relative Positional Encod
10 pages
Improving Graph Neural Networks With Simple Architecture Design
No ratings yet
Improving Graph Neural Networks With Simple Architecture Design
10 pages
Provably Powerful Graph Networks: Haggai Maron Heli Ben-Hamu Hadar Serviansky Yaron Lipman
No ratings yet
Provably Powerful Graph Networks: Haggai Maron Heli Ben-Hamu Hadar Serviansky Yaron Lipman
15 pages
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
No ratings yet
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
12 pages
Graph Transformer Networks: Corresponding Author
No ratings yet
Graph Transformer Networks: Corresponding Author
11 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
20 pages
A Comprehensive Survey On Graph Neural Networks
No ratings yet
A Comprehensive Survey On Graph Neural Networks
22 pages
Dyngraph2vec: Capturing Network Dynamics Using Dynamic Graph Representation Learning
No ratings yet
Dyngraph2vec: Capturing Network Dynamics Using Dynamic Graph Representation Learning
10 pages
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
No ratings yet
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
12 pages
GNN - PEter
No ratings yet
GNN - PEter
96 pages
ArXiv-2024-MingZhang-0-Towards Graph Contrastive Learning A Survey and Beyond
No ratings yet
ArXiv-2024-MingZhang-0-Towards Graph Contrastive Learning A Survey and Beyond
35 pages
Graph Transformers with Spectral Attention
No ratings yet
Graph Transformers with Spectral Attention
18 pages
Network Representation Learning A Survey 2020
No ratings yet
Network Representation Learning A Survey 2020
26 pages
The Graph Neural Network Model
No ratings yet
The Graph Neural Network Model
20 pages
Geometric Deep Learning Guide
No ratings yet
Geometric Deep Learning Guide
33 pages
199 Fast Node Embeddings Learning - 13
No ratings yet
199 Fast Node Embeddings Learning - 13
11 pages
10 Graph Neural Networks v2.2
No ratings yet
10 Graph Neural Networks v2.2
61 pages
A Survey On Network Embedding
No ratings yet
A Survey On Network Embedding
20 pages
Application of Graph Theory in Machine Learning
No ratings yet
Application of Graph Theory in Machine Learning
6 pages
E G R L T - T G T: Mpowering Raph Epresentation Earning With EST IME Raph Ransformation
No ratings yet
E G R L T - T G T: Mpowering Raph Epresentation Earning With EST IME Raph Ransformation
27 pages
Genie Boom Lifts Telescopic Spec 76b445
No ratings yet
Genie Boom Lifts Telescopic Spec 76b445
2 pages
Fluid Flow Orifice Measurement Guide
No ratings yet
Fluid Flow Orifice Measurement Guide
13 pages
Community Based Corrections 11th Edition Leanne Fiftal Alarid ISBN10 1305633725 ISBN13 9781305633728 Ebook and TestBank Bundle Instructor Test Bank
No ratings yet
Community Based Corrections 11th Edition Leanne Fiftal Alarid ISBN10 1305633725 ISBN13 9781305633728 Ebook and TestBank Bundle Instructor Test Bank
339 pages
JaquesDerrida On Tschumi La Villette
100% (1)
JaquesDerrida On Tschumi La Villette
13 pages
Standard Valve Springs
No ratings yet
Standard Valve Springs
3 pages
Lecture 1-2 Introduction To Strategic Management Accounting
No ratings yet
Lecture 1-2 Introduction To Strategic Management Accounting
27 pages
M
No ratings yet
M
2 pages
Consolidated Result Notice 12012022
No ratings yet
Consolidated Result Notice 12012022
1,107 pages
Blum LEGRABOX
No ratings yet
Blum LEGRABOX
40 pages
Unit - II Approaches To Management 2081 (B)
No ratings yet
Unit - II Approaches To Management 2081 (B)
20 pages
Jeep-Grand Cherokee (WK2) 2011-2017
No ratings yet
Jeep-Grand Cherokee (WK2) 2011-2017
4 pages
20170217110223lect3 Program Evaluation Approaches (Slide) (Nida)
100% (1)
20170217110223lect3 Program Evaluation Approaches (Slide) (Nida)
60 pages
Matrix Data Structures
No ratings yet
Matrix Data Structures
11 pages
Au B.com Business Statistics
No ratings yet
Au B.com Business Statistics
221 pages
Types of Soil in Argentina
No ratings yet
Types of Soil in Argentina
2 pages
Design and Implementation of Vedic Multiplier: Akshay Savji, Shruti Oza
No ratings yet
Design and Implementation of Vedic Multiplier: Akshay Savji, Shruti Oza
4 pages
WSN 100 Questions 5 Models With Answers
No ratings yet
WSN 100 Questions 5 Models With Answers
15 pages
Generation of Sinusoidal Waveforms
No ratings yet
Generation of Sinusoidal Waveforms
10 pages
Technical Data Sheet: Polyester Laminating Resin
No ratings yet
Technical Data Sheet: Polyester Laminating Resin
2 pages
Harold Lasswell
No ratings yet
Harold Lasswell
2 pages
Doe c2m2v2 1 CSF Mapping
No ratings yet
Doe c2m2v2 1 CSF Mapping
184 pages
Electrics Pyhsics Form 5
No ratings yet
Electrics Pyhsics Form 5
60 pages
Aquashield Cool Roof - TDS
No ratings yet
Aquashield Cool Roof - TDS
2 pages
Pashto Shairi DR Baqi Durrani
No ratings yet
Pashto Shairi DR Baqi Durrani
133 pages
Cop310 410 411
No ratings yet
Cop310 410 411
20 pages
YassienGhoniemCV Old
No ratings yet
YassienGhoniemCV Old
1 page
Jvin7 3 5
No ratings yet
Jvin7 3 5
4 pages
Report On BRAND PREFERENCE OF SCOOTER AM
No ratings yet
Report On BRAND PREFERENCE OF SCOOTER AM
21 pages
SIL3 Thermal Resistance Cert
No ratings yet
SIL3 Thermal Resistance Cert
2 pages
Problems During The Design and Testing of Instrument Special and Power Transformers
No ratings yet
Problems During The Design and Testing of Instrument Special and Power Transformers
200 pages

MLST Wavelet Positional Encoding

Uploaded by

MLST Wavelet Positional Encoding

Uploaded by

Range-aware Graph Positional Encoding via

High-Order Pretraining: Theory and Practice

Viet Anh Nguyen† Nhat Khang Ngo†

Preprint. Under review.

3 Notations and Preliminaries

3.2 Permutation equivariant function

3.3 Positional Encodings in Graphs

Pretrained Permutation Equivariant High-Order Autoencoder

4.2 Constructing Long-range Pretraining on Domain-Agnostic Data

5.2 HOPE-WavePE can globalize encoding information

Pretraining We pretrained a high-order autoencoder on the MolPCBA dataset, a collection of

Method MUTAG PROTEINS NCI1 NCI109 IMDB-B IMDB-M

CIFAR10 MNIST ZINC

Peptides Struct Peptides Func

6.3 Ablation Study

Cross-Domain Training To further verify transferability of HOPE-WavePE onto different tasks in a

Table 5: Accuracy of HOPE-WavePE on cross-domain training.

Masked vs Unmasked To visually illustrate our motivation of masking the autoencoder, we

[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,

[9] P. de Haan, T. Cohen, and M. Welling. Natural graph networks, 2020.

[12] V. P. Dwivedi and X. Bresson. A generalization of transformer networks to graphs. CoRR,

[13] V. P. Dwivedi and X. Bresson. A generalization of transformer networks to graphs. CoRR,

[16] V. P. Dwivedi, C. K. Joshi, A. T. Luu, T. Laurent, Y. Bengio, and X. Bresson. Benchmarking

A.1 Preliminary results

• f (x) = pi (x) ∀ x ∈ [ηi , ηi+1 ], i = 1, N for some pi ∈ Pk ,

Replacing ϵ with ϵ/n yields the desired result.

rank(ψs − In ) = rank(ψs2 − In ) = · · · = rank(ψsd − In ) ≤ r

A.2 Proof of Theorem 5.5

A.3 Proof of Theorem 5.6

B Additional implementation details

B.2 Hyperparameter Settings

Table 7: Hyperparameter settings for pretraining high-order AE.

Table 8: Hyperparameter settings for downstream evaluations on the MoleculeNet Benchmark.

Hyperparameter MUTAG PROTEINS NCI1 NCI109 IMDB-B IMDB-M

Table 9: Hyperameter settings for downstream evaluations on the TUDataset benchmark.

Hyperparameter ZINC (subset) MNIST CIFAR10 Peptides-func Peptides-struct

You might also like