LLM Paper 1707247828
LLM Paper 1707247828
8, AUGUST 2021 1
Graph-Text Relationship
arXiv:2312.02783v2 [cs.CL] 1 Feb 2024
graph data are paired with rich textual information (e.g., molecules
with descriptions). Besides, although LLMs have shown their pure text- Pure Graphs Text-Paired Graphs Text-Attributed Graphs
based reasoning ability, it is underexplored whether such ability can be
generalized to graphs (i.e., graph-based reasoning). In this paper, we
provide a systematic review of scenarios and techniques related to large
language models on graphs. We first summarize potential scenarios of
adopting LLMs on graphs into three categories, namely pure graphs,
text-attributed graphs, and text-paired graphs. We then discuss detailed
techniques for utilizing LLMs on graphs, including LLM as Predictor,
LLM as Encoder, and LLM as Aligner, and compare the advantages LLM LLM GNN LLM GNN
and disadvantages of different schools of models. Furthermore, we
discuss the real-world applications of such methods and summarize “text” GNN “text” “text” LLM
open-source codes and benchmark datasets. Finally, we conclude
“text”
with potential future research directions in this fast-growing field. The
related source can be found at https://github.com/PeterGriffinJin/ LLM as Predictor LLM as Aligner LLM as Encoder
Awesome-Language-Model-on-Graphs.
ffi
Large Language Models’ Roles
Index Terms—Large Language Models, Graph Neural Networks, Natural
Language Processing, Graph Representation Learning Fig. 1. According to the relationship between graph and text, we catego-
rize three LLM on graph scenarios. Depending on the role of LLM, we
summarize three LLM-on-graph techniques. “LLM as Predictor” is where
1 I NTRODUCTION LLMs are responsible for predicting the final answer. “LLM as Aligner”
will align the inputs-output pairs with those of GNNs. “LLM as Encoder”
articles, e ∈ E as the citation links between them, and d ∈ D Simple but powerful, subsequent models like GPT-3 [26],
as the textual content of these articles. A graph with node- GPT-4 [118], LLaMA [119], LLaMA2 [119], Mistral 7B [120],
level textual information is also called a text-attributed graph and T5 [29] show impressive emergent capabilities such
[31], a text-rich graph [62], or a textual graph [72]. as few-shot learning, chain-of-thought reasoning, and pro-
Definition 3 (Graph with edge-level textual information): A gramming. Efforts have been made to combine language
graph with node-level textual information can be denoted models with other modalities such as vision [96], [121] and
as G = (V, E, D), where V , E and D are node set, edge set, biochemical structures [47], [122], [123]. We will discuss its
and text set, respectively. Each eij ∈ E is associated with combination with graphs in this paper.
some textual information deij ∈ D. For example, in a social We would like to point out that the word “large” in
network, one can interpret v ∈ V as the users, e ∈ E as LLM is not associated with a clear and static threshold
the interaction between the users, and d ∈ D as the textual to divide language models. “Large” actually refers to a
content of the messages sent between the users. direction in which language models are inevitably evolving,
Definition 4 (Graph with graph-level textual information): It and larger foundational models tend to possess significantly
can be denoted as the pair (G, dG ), where G = (V, E). V more representation and generalization power. Hence, we
and E are node set and edge set. dG is the text set paired define LLMs to encompass both medium-scale PLMs, such as
to the graph G . For instance, in a molecular graph G , v ∈ V BERT, and large-scale LMs, like GPT-4, as suggested by [21].
denotes an atom, e ∈ E represents the strong attractive Graph Neural Networks & Graph Transformers. In real-
forces or chemical bonds that hold molecules together, and world scenarios, not all the data are sequential like text,
dG represents the textual description of the molecule. We many data lies in a more complex non-Euclidean structure,
note that texts may also be associated with subgraph-level i.e., graphs. GNN is proposed as a deep-learning architecture
concepts and then paired with the entire graph. for graph data. Primary GNNs including GCN [84], Graph-
SAGE [85] and, GAT [86] are designed for solving node-
2.2 Background level tasks. They mainly adopt a propagation-aggregation
(Large) Language Models. Language Models (LMs), or paradigm to obtain node representations:
language modeling, is an area in the field of natural language
a(l−1) (l)
h(l−1) , h(l−1)
processing (NLP) on understanding and generation from text vi vj = PROP vi vj , ∀vj ∈ N (vi ) ; (4)
distributions. In recent years, large language models (LLMs)
h(l)
vi = AGG
(l)
h(l−1)
vi , {a(l−1)
vi vj |vj ∈ N (vi )} . (5)
have demonstrated impressive capabilities in tasks such
as machine translation, text summarization, and question Later works such as GIN [189] explore GNNs for solving
answering [26], [43], [112]–[115], [195]. graph-level tasks. They obtain graph representations by
Language models have evolved significantly over time. adopting a READOUT function on node representations:
BERT [23] marks significant progress in language modeling
and representation. BERT models the conditional probability hG = READOUT({hvi |vi ∈ G}). (6)
of a word given its bidirectional context, also named masked The READOUT functions include mean pooling, max pool-
language modeling (MLM) objective: ing, and so on. Subsequent work on GNN tackles the issues
of over-smoothing [139], over-squashing [140], interpretabil-
X
ES∼D log p(si |s1 , . . . , si−1 , si+1 , . . . , sNS ) , (1) ity [145], and bias [143]. While message-passing-based GNNs
si ∈S have demonstrated advanced structure encoding capability,
researchers are exploring further enhancing its expressive-
where S is a sentence sampled from the corpus D, si is
ness with Transformers (i.e., graph Transformers). Graph
the i-th word in the sentence, and NS is the length of the
Transformers utilize a global multi-head attention mecha-
sentence. BERT utilizes the Transformer architecture with
nism to expand the receptive field of each graph encoding
attention mechanisms as the core building block. In the
layer [141]. They integrate the inductive biases of graphs
vanilla Transformer, the attention mechanism is defined as:
into the model by positional encoding, structural encoding,
QK T
Attention(Q, K, V ) = softmax √ V, (2) the combination of message-passing layers with attention
dk layers [142], or improving the efficiency of attention on large
where Q, K, V ∈ RNS ×dk are the query, key, and value graphs [144]. Graph Transformers have been proven as the
vectors for each word in the sentence, respectively. Following state-of-the-art solution for many pure graph problems.
BERT, other masked language models are proposed, such Language Models vs. Graph Transformers. Modern lan-
as RoBERTa [24], ALBERT [116], and ELECTRA [117], with guage models and graph Transformers both use Transformers
similar architectures and objectives of text representation. [93] as the base model architecture. This makes the two
Although the original Transformer paper [93] was experi- concepts hard to distinguish, especially when the language
mented on machine translation, it was not until the release of models are adopted on graph applications. In this paper,
GPT-2 [115] that language generation (aka. causal language “Transformers” typically refers to Transformer language
modeling) became impactful on downstream tasks. Causal models for simplicity. Here, we provide three points to
language modeling is the task of predicting the next word help distinguish them: 1) Tokens (word token vs. node
given the previous words in a sentence. The objective of token): Transformers take a token sequence as inputs. For
causal language modeling is defined as: language models, the tokens are word tokens; while for graph
Transformers, the tokens are node tokens. In those cases
X
ES∼D log p(si |s1 , . . . , si−1 ) . (3) where tokens include both word tokens and node tokens if
si ∈S the backbone Transformers is pretrained on text corpus (e.g.,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4
BERT [23] and LLaMA [119]), we will call it a “language theory problems) or serve as knowledge sources to enhance
model”. 2) Positional Encoding (sequence vs. graph): language the large language models (alleviate hallucination).
models typically adopt the absolute or relative positional Text-Attributed Graphs refers to graphs where nodes or
encoding considering the position of the word token in the edges are associated with semantically rich text informa-
sequence, while graph Transformers adopt shortest path tion. They are also called text-rich networks [31], textual
distance [141], random walk distance, the eigenvalues of the graphs [72] or textual-edge networks [74]. Examples include
graph Laplacian [142] to consider the distance of nodes in academic networks, e-commerce networks, social networks,
the graph. 3) Goal (text vs. graph): The language models and legal case networks. On these graphs, researchers are
are originally proposed for text encoding and generation; interested in learning representations for nodes or edges with
while graph Transformers are proposed for node encoding both textual and structure information [72] [74].
or graph encoding. In those cases where texts are served Text-Paired Graphs have textual descriptions defined for the
as nodes/edges on the graph if the backbone Transformers entire graph structure. For example, graphs like molecules
is pretrained on text corpus, we will call it a “language may be paired with captions or textual features. While the
model”. graph structure significantly contributes to molecular prop-
erties, text descriptions can complement our understanding
3 C ATEGORIZATION AND F RAMEWORK of molecules. The graph scenarios can be found in Fig. 1.
In this section, we first introduce our categorization of graph 3.2 Categorization of LLMs on Graph Techniques
scenarios where language models can be adopted. Then According to the roles of LLMs and what are the final
we discuss the categorization of LLM on graph techniques. components for solving graph-related problems, we classify
Finally, we summarize the training & inference framework LLM on graph techniques into three main categories:
for language models on graphs. LLM as Predictor. This category of methods serves LLM
as the final component to output representations or predic-
3.1 Categorization of Graph Scenarios with LLMs. tions. It can be enhanced with GNNs and can be classified
Pure Graphs without Textual Information are graphs with depending on how the graph information is injected into
no text information or no semantically rich text information. LLM: 1) Graph as Sequence: This type of method makes no
Examples include traffic graphs and power transmission changes to the LLM architecture, but makes it be aware
graphs. Those graphs often serve as context to test the graph of graph structure by taking a “graph token sequence” as
reasoning ability of large language models (solve graph input. The “graph token sequence” can be natural language
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5
descriptions for a graph or hidden representations outputted and findings. Table 4 in the Appendix lists a categoriza-
by graph encoders. 2) Graph-Empowered LLM: This type of tion of these efforts. Usually, input graphs are serialized
method modifies the architecture of the LLM base model as part of the input sequence, either by verbalizing the
(i.e., Transformers) and enables it to conduct joint text and graph structure [124]–[126], [128]–[132] or by encoding the
graph encoding inside their architecture. 3) Graph-Aware LLM graph structure into implicit feature sequences [42]. The
Finetuning: This type of method makes no changes to the studied reasoning problems range from simpler ones like
input of the LLMs or LLM architectures, but only fine-tunes connectivity, shortest paths, and cycle detection to harder
the LLMs with supervision from the graph. ones like maximum flow and Hamiltonian pathfinding (an
LLM as Encoder. This method is mostly utilized for graphs NP-complete problem). A comprehensive list of the studied
where nodes or edges are associated with text information problems is listed in Appendix Table 5. Note that we only
(solving node-level or edge-level tasks). GNNs are the final list representative problems here. This table does not include
components and we adopt LLM as the initial text encoder. more domain-specific problems, such as the spatial-temporal
To be specific, LLMs are first utilized to encode the text reasoning problems in [128].
associated with the nodes/edges. The outputted feature
vectors by LLMs then serve as input embeddings for GNNs 4.1 Direct Answering
for graph structure encoding. The output embeddings from Although graph-based reasoning problems usually involve
the GNNs are adopted as final node/edge representations complex computation, researchers still attempt to let lan-
for downstream tasks. However, these methods suffer from guage models directly generate answers from the serialized
convergence issues, sparse data issues, and inefficient issues, input graphs as a starting point or a baseline, partially
where we summarize solutions from optimization, data because of the simplicity of the approach and partially in
augmentation, and knowledge distillation perspectives. awe of other emergent abilities of LLMs. Although various
LLM as Aligner. This category of methods adopts LLMs attempts have been made to optimize how graphs are
as text-encoding components and aligns them with GNNs presented in the input sequence, which we will discuss in the
which serve as graph structure encoding components. LLMs following sections, bounded by the finite sequence length
and GNNs are adopted together as the final components for and computational operations, there is a fundamental limita-
task solving. To be specific, the alignment between LLMs and tion of this approach to solving complex reasoning problems
GNNs can be categorized into 1) Prediction Alignment where such as NP-complete ones. Unsurprisingly, most studies find
the generated pseudo labels from one modality are utilized that LLMs possess preliminary graph understanding ability,
for training on the other modality in an iterative learning but the performance is less satisfactory on more complex
fashion and 2) Latent Space Alignment where contrastive problems or larger graphs [42], [124]–[126], [128], [131] where
learning is adopted to align text embeddings generated by reasoning is necessary.
LLMs and graph embeddings generated by GNNs. Plainly Verbalizing Graphs. Verbalizing the graph structure
In the following sections, we will follow our categoriza- in natural language is the most straightforward way of
tion in Section 3 and discuss detailed methodologies for each representing graphs. Representative approaches include
graph scenario. describing the edge and adjacency lists, widely studied
in [124], [125], [128], [131]. For example, for a triangle graph
4 P URE G RAPHS with three nodes, the edge list can be written as “[(0, 1), (1, 2),
Problems on pure graphs provide a fundamental motivation (2, 0)]”, which means node 0 is connected to node 1, node 1
for why and how LLMs are introduced into graph-related is connected to node 2, node 2 is connected to node 0. It can
reasoning problems. Investigated thoroughly in graph theory, also be written in natural language such as “There is an edge
pure graphs serve as a universal representation format for a between node 0 and node 1, an edge between node 1 and node 2,
wide range of classical algorithmic problems in all perspec- and an edge between node 2 and node 0.” On the other hand, we
tives in computer science. Many graph-based concepts, such can describe the adjacency list from the nodes’ perspective.
as shortest paths, particular sub-graphs, and flow networks, For example, for the same triangle graph, the adjacency list
have strong connections with real-world applications [133]– can be written as “Node 0 is connected to node 1 and node 2.
[135], [193]. Therefore, pure graph-based reasoning is vital Node 1 is connected to node 0 and node 2. Node 2 is connected to
in providing theoretical solutions and insights for reasoning node 0 and node 1.” On these inputs, one can prompt LLMs to
problems grounded in real-world applications. answer questions either in zero-shot or few-shot (in-context
Nevertheless, many reasoning tasks require a computa- learning) settings, the former of which is to directly ask
tion capacity beyond traditional GNNs. GNNs are typically questions given the graph structure, while the latter is to ask
designed to carry out a bounded number of operations given questions about the graph structure after providing a few
a graph size. In contrast, graph reasoning problems can examples of questions and answers. [124]–[126] do confirm
require up to indefinite complexity depending on the task’s that LLMs can answer easier questions such as connectivity,
nature. On the other hand, LLMs demonstrate excellent neighbor identification, and graph size counting but fail
emergent reasoning ability [48], [112], [113] recently. This to answer more complex questions such as cycle detection
is partially due to their autoregressive mechanism, which and Hamiltonian pathfinding. Their results also reveal that
enables computing indefinite sequences of intermediate steps providing more examples in the few-shot setting increases
with careful prompting or training [48], [49]. the performance, especially on easier problems, although it
The following subsections discuss the attempts to in- is still not satisfactory.
corporate LLMs into pure graph reasoning problems. We Paraphrasing Graphs. The verbalized graphs can be lengthy,
will also discuss the corresponding challenges, limitations, unstructured, and complicated to read, even for humans,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6
so they might not be the best input format for LLMs to summarize the key nodes, edges, or sub-graphs and perform
infer the answers. To this end, researchers also attempt to reasoning.
paraphrase the graph structure into more natural or concise Searching on Graphs. This kind of reasoning is related to
sentences. [126] find that by prompting LLMs to generate a the search algorithms on graphs, such as breadth-first search
format explanation of the raw graph inputs for itself (Format- (BFS) and depth-first search (DFS) Although not universally
Explanation) or to pretend to play a role in a natural task applicable, BFS and DFS are the most intuitive and effective
(Role Prompting), the performance on some problems can be ways to solve some graph reasoning problems. Numer-
improved but not systematically. [131] explores the effect of ous explorations have been made to simulate searching-
grounding the pure graph in a real-world scenario, such as based reasoning, especially on knowledge-graph question
social networks, friendship graphs, or co-authorship graphs. answering. This approach enjoys the advantage of providing
In such graphs, nodes are described as people, and edges are interpretable evidence besides the answer. Reasoning-on-
relationships between people. Results indicate that encoding Graphs (RoG) [129] is a representative approach that prompts
in real-world scenarios can improve the performance on LLMs to generate several relation paths as plans, which are
some problems, but still not consistently. then retrieved from the knowledge graph (KG) and used
Encoding Graphs Into Implicit Feature Sequences. Finally, as evidence to answer the questions. Another approach is
researchers also attempt to encode the graph structure into to iteratively retrieve and reason on the subgraphs from
implicit feature sequences as part of the input sequence [42]. KG [130], [132], simulating a dynamic searching process. At
Unlike the previous verbalizing approaches, this usually each step, the LLMs retrieve neighbors of the current nodes
involves training a graph encoder to encode the graph and then decide to answer the question or continue the next
structure into a sequence of features and fine-tuning the search step. These methods address the scalability challenge
LLMs to adapt to the new input format. [42] demonstrates when knowledge from multiple graphs is available.
drastic performance improvement on problems including
4.3 Algorithmic Reasoning
substructure counting, maximum triplet sum, shortest path,
and bipartite matching, indicating that fine-tuning LLMs has The previous two approaches are heuristic, which means
great fitting power on a specific task distribution. that the reasoning process accords with human intuition
but is not guaranteed to lead to the correct answer. In
4.2 Heuristic Reasoning contrast, these problems are usually solved by algorithms
Direct mapping to the output leverages the LLMs’ powerful in computer science. Therefore, researchers also attempt to
representation power to “guess” the answers. Still, it does not let LLMs perform algorithmic reasoning on graphs. [124]
fully utilize the LLMs’ impressive emergent reasoning ability, proposed “Algorithmic Prompting”, which prompts the LLMs
which is essential for solving complex reasoning problems. to recall the algorithms that are relevant to the questions
To this end, attempts have been made to let LLMs perform and then perform reasoning step by step according to the
heuristic reasoning on graphs. This approach encourages algorithms. Their results, however, do not show consistent
LLMs to perform a series of intermediate reasoning steps improvement over the heuristic reasoning approach. A more
that might heuristically lead to the correct answer, which direct approach, Graph-ToolFormer [127], lets LLMs generate
resembles a path-finding reasoning schema [203]. API calls as explicit reasoning steps. These API calls are then
Reasoning Step by Step. Encouraged by the success of chain- executed externally to acquire answers on an external graph.
of-thought (CoT) reasoning [48], [113], researchers also at- This approach is suitable for converting tasks grounded in
tempt to let LLMs perform reasoning step by step on graphs. real tasks into pure graph reasoning problems, demonstrating
Chain-of-thought encourages LLMs to roll out a sequence of efficacy on various applications such as knowledge graphs,
reasoning steps to solve a problem, similar to how humans social networks, and recommendation systems.
solve problems. Zero-shot CoT is a similar approach that 4.4 Discussion
does not require any examples. These techniques are studied
The above approaches are not mutually exclusive, and they
in [42], [124]–[126], [128], [131], [132]. Results indicate that
can be combined to achieve better performance, for example,
CoT-style reasoning can improve the performance on simpler
by prompting language models for heuristics in algorithmic
problems, such as cycle detection and shortest path detection.
searching. Moreover, heuristic reasoning can also conduct
Still, the improvement is inconsistent or diminishes on more
direct answering, while algorithmic reasoning contains the
complex problems, such as Hamiltonian path finding and
capacity of heuristic reasoning as a special case. Researchers
topological sorting.
are advised to select the most suitable approach for a specific
Retrieving Subgraphs as Evidence. Many graph reasoning
problem.
problems, such as node degree counting and neighborhood
detection, only involve reasoning on a subgraph of the 5 T EXT-ATTRIBUTED G RAPHS .
whole graph. Such properties allow researchers to let LLMs Text-attributed graphs exist ubiquitously in the real world,
retrieve the subgraphs as evidence and perform reasoning e.g., academic networks, and legal case networks. Learning
on the subgraphs. Build-a-Graph prompting [124] encour- on such networks requires the model to encode both the
ages LLMs to reconstruct the relevant graph structures textual information associated with the nodes/edges and
to the questions and then perform reasoning on them. the structure information lying inside the input graph.
This method demonstrates promising results on problems Depending on the role of LLM, existing works can be
except for Hamiltonian pathfinding, a notoriously tricky categorized into three types: LLM as Predictor, LLM as
problem requiring reasoning on the whole graph. Another Encoder, and LLM as Aligner. We summarize all surveyed
approach, Context-Summarization [126], encourages LLMs to methods in Appendix Table 6.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7
5.1 LLM as Predictor The strength of these methods is that they can capture
These methods serve the language model as the main model the hidden representations of useful structure information
architecture to capture both the text information and graph with a strong graph encoder, while the challenge is how
structure information. They can be categorized into three to fill the gap between graph modality and text modality.
types: Graph as Sequence methods, Graph-Empowered LLMs, GNP [41] adopts a similar philosophy from LLaVA [91],
and Graph-Aware LLM finetuning methods, depending on how where they utilize GNN to generate graph tokens and then
structure information in graphs is injected into language project the graph tokens into the text token space with
models (input vs. architecture vs. loss). In the Graph as Se- learnable projection matrices. The projected graph tokens are
quence methods, graphs are converted into sequences that can concatenated with text tokens and fed into the language
be understood by language models together with texts from model. GraphGPT [45] further proposes to train a text-
the inputs. In the Graph-Empowered LLMs methods, people grounded GNN for the projection with a text encoder and
modify the architecture of Transformers (which is the base contrastive learning. DGTL [76] introduces disentangled
architecture for LLMs) to enable it to encode text and graph graph learning, serves graph representations as positional
structure simultaneously. In the Graph-Aware LLM finetuning encoding, and adds them to the text sequence. METERN
methods, LLM is fine-tuned with graph structure supervision [75] adds learnable relation embeddings to node textual
and can generate graph-contextualized representations. sequences for text-based multiplex representation learning
on graphs [92].
5.1.1 Graph as Sequence.
5.1.2 Graph-Empowered LLMs.
In these methods, the graph information is mainly encoded
In these methods, researchers design advanced LLM archi-
into the LLM from the “input” side. The ego-graphs associ-
tecture (i.e., Graph-Empowered LLMs) which can conduct
ated with nodes/edges are serialized into a sequence HGv
joint text and graph encoding inside their model architecture.
which can be fed into the LLM together with the texts dv :
Transformers [93] serve as the base model for nowadays pre-
HGv = Graph2Seq(Gv ), (7) trained LMs [23] and LLMs [36]. However, they are designed
for natural language (sequence) encoding and do not take
hv = LLM([HGv , dv ]). (8)
non-sequential structure information into consideration. To
Depending on the choice of Graph2Seq(·) function, the this end, Graph-Empowered LLMs are proposed. They have
methods can be further categorized into rule-based methods a shared philosophy of introducing virtual structure tokens
and GNN-based methods. The illustration of the categories HGv inside each Transformer layer:
can be found in Fig. 3. f(l) = [H (l) , H (l) ]
Hdv Gv dv (10)
Rule-based: Linearizing Graphs into Text Sequence with
Rules. These methods design rules to describe the structure where HGv can be learnable embeddings or output from
with natural language and adopt a text prompt template graph encoders. Then the original multi-head attention
as Graph2Seq(·). For example, given an ego-graph Gvi of (MHA) in Transformers is modified into an asymmetric MHA
the paper node vi connecting to author nodes vj and vk to take the structure tokens into consideration:
and venue nodes vt and vs , HGvi = Graph2Seq(Gvi ) = “The MHAasy (Hdv , H
(l)
f ) = ∥U (l)
u=1 headu (Hdv , Hdv ),
f (l) (l)
dv
centor paper node is vi . Its author neighbor nodes are vj and (l) f(l)⊤
!
vk and its venue neighbor nodes are vt and vs ”. This is the (l) f(l) Qu Ku eu(l) ,
where headu (Hdv , Hdv ) = softmax p ·V
most straightforward and easiest way (without introducing d/U
(l) (l) f(l) (l) f(l) W (l) .
extra model parameters) to encode graph structures into Q(l) f(l)
u = Hdv WQ,u , Ku = Hdv WK,u , Vu
e (l) = H
dv V,u
language models. Along this line, InstructGLM [46] designs (11)
templates to describe local ego-graph structure (maximum
With the asymmetric MHA mechanism, the node encoding
3-hop connection) for each node and conduct instruction
process of the (l + 1)-th layer will be:
tuning for node classification and link prediction. GraphText
′
[65] further proposes a syntax tree-based method to transfer f(l) = Normalize(H (l) + MHAasy (H
H f(l) , H (l) )),
dv dv dv dv
structure into text sequence. Researchers [82] also study when (l+1) ′ ′ (12)
Hdv f(l) + MLP(H
= Normalize(H f(l) )).
and why the linearized structure information on graphs can dv dv
improve the performance of LLM on node classification and Along this line of work, GreaseLM [67] proposes to have a
find that the structure information is beneficial when the language encoding component and a graph encoding compo-
textual information associated with the node is scarce (in nent in each layer. These two components interact through a
this case, the structure information can provide auxiliary modality-fusion layer (MInt layer), where a special structure
information gain). token is added to the text Transformer input, and a special
GNN-based: Encoding Graphs into Special Tokens with node is added to the graph encoding layer. DRAGON [81]
GNNs. Different from rule-based methods which use natural further proposes strategies to pretrain GreaseLM with unsu-
language prompts to linearize graphs into sequences, GNN- pervised signals. GraphFormers [72] are designed for node
based methods adopt graph encoder models (i.e., GNN) to representation learning on homogeneous text-attributed
encode the ego-graph associated with nodes into special networks where the current layer [CLS] token hidden states
token representations which are concatenated with the pure of neighboring documents are aggregated and added as a
text information into the language model: new token on the current layer center node text encoding.
Patton [31] proposes to pretrain GraphFormers with two
HGv = Graph2Seq(Gv ) = GraphEnc(Gv ). (9) novel strategies: network-contextualized masked language
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8
…
…
Graphs
Center: 𝑣! ; 1-hop neighbors: 𝑣" , 𝑣# ;... …
(a) Rule-based Graph as Sequence (b) GNN-based Graph as Sequence (c) Graph-Empowered LM
Fig. 3. The illustration of various LLM as Predictor methods, including (a) Rule-based Graph As Sequence, (b) GNN-based Graph As Sequence, (c)
Graph-Empowered LLMs.
modeling and masked node prediction. Heterformer [73] fine-tuning the language model. A summarization of the two-
introduces virtual neighbor tokens for text-rich neighbors tower graph-centric LLM fine-tuning objectives can be found
and textless neighbors which are concatenated with the in Appendix Table 7.
original text tokens and fed into each Transformer layer. There are other methods using the one-tower pipeline,
Edgeformers [74] are proposed for representation learning where node pairs are concatenated and encoded together:
on textual-edge networks where edges are associated with
hvi ,vj = LLMθ (dvi , dvj ), min f (hvi ,vj ). (14)
rich textual information. When conducting edge encoding, θ
virtual node tokens will be concatenated onto the original LinkBERT [30] proposes a document relation prediction
edge text tokens for joint encoding. objective (an extension of next sentence prediction in BERT
[23]) which aims to classify the relation of two node text pairs
5.1.3 Graph-Aware LLM finetuning. from contiguous, random, and linked. MICoL [59] explores
In these methods, the graph information is mainly injected predicting the node pairs’ binary meta-path or meta-graph
into the LLM by “fine-tuning on graphs”. Researchers indicated relation with the one-tower language model.
assume that the structure of graphs can provide hints on 5.1.4 Discussion
what documents are “semantically similar” to what other Although the community is making good progress, there are
documents. For example, papers citing each other in an still some open questions to be solved.
academic graph can be of similar topics. These methods Graph as Code Sequence. Existing graphs as sequence
adopt vanilla language models that take text as input (e.g., methods are mainly rule-based or GNN-based. The former
BERT [23] and SciBERT [25]) as the base model and fine-tune relies on natural language to describe the graphs which is
them with structure signals on the graph [51]. After that, not natural for structure data, while the latter has a GNN
the LLMs will learn node/edge representations that capture component that needs to be trained. A more promising way
the graph homophily from the text perspective. This is the is to obtain a structure-aware sequence for graphs that can
simplest way to utilize LLMs on graphs. However, during support zero-shot inference. A potential solution is to adopt
encoding, the model itself can only consider text. codes (that can capture structures) to describe the graphs
Most methods adopt the two-tower encoding and training and utilize code LLMs [22].
pipeline, where the representation of each node is obtained Advanced Graph-Empowered LLM techniques. Graph-
separately and the model is optimized as follows: empowered LLM is a promising direction to achieve foun-
dational models for graphs. However, existing works are far
hvi = LLMθ (dvi ), min f (hvi , {hv+ }, {hv− }). (13)
θ i i from enough: 1) Task. Existing methods are mainly designed
for representation learning (with encoder-only LLMs) which
Here vi+ represents the positive nodes to vi , vi− represents the
are hard to adopt for generation tasks. A potential solution
negative nodes to vi and f (·) denotes the pairwise training
is to design Graph-Empowered LLMs with decoder-only or
objective. Different methods have different strategies for vi+
encoder-decoder LLMs as the base architecture. 2) Pretraining.
and vi− with different training objectives f (·). SPECTER [51]
Pretraining is important to enable LLMs with contextualized
constructs the positive text/node pairs with the citation
data understanding capability, which can be generalized
relation, explores random negatives and structure hard
to other tasks. However, existing works mainly focus on
negatives, and fine-tunes SciBERT [25] with the triplet
pretraining LLMs on homogeneous text-attributed networks.
loss. SciNCL [52] extends SPECTER by introducing more
Future studies are needed to explore LLM pretraining in
advanced positive and negative sampling methods based on
more diverse real-world scenarios including heterogeneous
embeddings trained on graphs. Touchup-G [54] proposes the
text-attributed networks [73], dynamic text-attributed net-
measurement of feature homophily on graphs and brings
works [128], and textual-edge networks [74].
up a binary cross-entropy fine-tuning objective. TwHIN-
BERT [56] mines positive node pairs with off-the-shelf 5.2 LLM as Encoder
heterogeneous information network embeddings and trains LLMs extract textual features to serve as initial node feature
the model with a contrastive social loss. MICoL [59] discovers vectors for GNNs, which then generate node/edge repre-
semantically positive node pairs with meta-path [90] and sentations and make predictions. These methods typically
adopts the InfoNCE objective. E2EG [60] utilizes a similar adopt an LLM-GNN cascaded architecture to obtain the final
philosophy from GIANT [58] and adds a neighbor prediction representation hvi for node vi :
objective apart from the downstream task objective. WalkLM
[61] conducts random walks for structure linearization before xvi = LLM(dvi ) hvi = GNN(Xv , G). (15)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9
🔥 🔥 🔥 🔥 🔥
🔥Trainable
🔥 🔥 🔥 🔥 🔥 🔥
❄ Fixed LLM
step1 step2 ❄ Teacher Student
(a) One-step Training (b) Two-step Training (c) Augmenta7on (d) Dis7lla7on
Fig. 4. The illustration of various techniques related to LLM as Encoder, including (a) One-step Training, (b) Two-step Training, (c) Data Augmentation,
and (d) Knowledge Distillation.
Here xvi is the feature vector that captures the textual in NLP [83], [89]. LLM-GNN [64] proposes to conduct
information dvi associated with vi . The final representation zero-shot node classification on text-attributed networks by
hvi will contain both textual information and structure labeling a few nodes and using the pseudo labels to fine-
information of vi and can be used for downstream tasks. tune GNNs. TAPE [70] presents a method that uses LLM to
In the following sections, we will discuss the optimization, generate prediction text and explanation text, which serve as
augmentation, and distillation of such models. The figures augmented text data compared with the original text data.
for these techniques can be found in Fig. 4. A following medium-scale language model is adopted to
5.2.1 Optimization encode the texts and output features for augmented texts
and original text respectively before feeding into GNNs. ENG
One-step training refers to training the LLM and GNN
[71] brings forward the idea of generating labeled nodes for
together in the cascaded architecture for the downstream
each category, adding edges between labeled nodes and other
tasks. TextGNN [77] explores GCN [84], GraphSAGE [85],
nodes, and conducting semi-supervised GNN learning for
GAT [86] as the base GNN architecture, adds skip connection
node classification.
between LLM output and GNN output, and optimizes the
whole architecture for sponsored search task. AdsGNN 5.2.3 Knowledge Distillation
[78] further extends TextGNN by proposing edge-level LLM-GNN cascaded pipeline is capable of capturing both
information aggregation. GNN-LM [66] adds GNN layers text information and structure information. However, the
to enable the vanilla language model to reference similar pipeline suffers from time complexity issues during inference,
contexts in the corpus for language modeling. Joint training since GNNs need to conduct neighbor sampling and LLMs
LLMs and GNNs in a cascaded pipeline is convenient but need to encode the text associated with both the center
may suffer from efficiency [68] (only support sampling a few node and its neighbors. A straightforward solution is to
one-hop neighbors regarding memory complexity) and local serve the LLM-GNN cascade pipeline as the teacher model
minimal [35] (LLM underfits the data) issues. and distill it into an LLM as the student model. In this
Two-step training means first adapting LLMs to the graph, case, during inference, the model (which is a pure LLM)
and then finetuning the whole LLM-GNN cascaded pipeline. only needs to encode the text on the center node and avoid
GIANT [58] proposes to conduct neighborhood prediction time-consuming neighbor sampling. AdsGNN [78] proposes
with the use of XR-Transformers [79] and results in an LLM an L2-loss to force the outputs of the student model to
that can output better feature vectors than bag-of-words preserve topology after the teacher model is trained. GraD
and vanilla BERT [23] embedding for node classification. [69] introduces three strategies including the distillation
LM-GNN [68] introduces graph-aware pre-fine-tuning to objective and task objective to optimize the teacher model
warm up the LLM on the given graph before fine-tuning and distill its capability to the student model.
the whole LLM-GNN pipeline and demonstrating significant 5.2.4 Discussion
performance gain. SimTeG [35] finds that the simple frame-
Given that GNNs are demonstrated as powerful models in
work of first training the LLMs on the downstream task and
encoding graphs, “LLMs as encoders” seems to be the most
then fixing the LLMs and training the GNNs can result in
straightforward way to utilize LLMs on graphs. However,
outstanding performance. They further find that using the
there are still open questions.
efficient fine-tuning method, e.g., LoRA [40] to tune the LLM
Limited Task: Go Beyond Representation Learning. Current
can alleviate overfitting issues. GaLM [80] explores ways
“LLMs as encoders” methods or LLM-GNN cascaded architec-
to pretrain the LLM-GNN cascaded architecture. The two-
tures are mainly focusing on representation learning, given
step strategy can effectively alleviate the insufficient training
the single embedding propagation-aggregation mechanism
of the LLM which contributes to higher text representation
of GNNs, which prevents it from being adopted to generation
quality but is more computationally expensive and time-
tasks (e.g., node/text generation). A potential solution to
consuming than the one-step training strategy.
this challenge can be to conduct GNN encoding for LLM-
5.2.2 Data Augmentation generated token-level representations and to design proper
With its demonstrated zero-shot capability [43], LLMs can be decoders that can perform generation based on the LLM-
used for data augmentation to generate additional text data GNN cascaded model outputs.
for the LLM-GNN cascaded architecture. The philosophy Low Efficiency: Advanced Knowledge Distillation. The
of using LLM to generate pseudo data is widely explored LLM-GNN cascaded pipeline suffers from time complexity
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10
issues since the model needs to conduct neighbor sampling pseudo-label by GNN contras8ve learning
distill the model into a much smaller LM or even an MLP. (a) LLM-GNN Predic0on Alignment (b) LLM-GNN Latent Space Alignment
Similar methods [87] have been proven effective in GNN to Fig. 5. The illustration of LLM as Aligner methods, including (a) LLM-GNN
MLP distillation and are worth exploring for the LLM-GNN Prediction Alignment and (b) LLM-GNN Latent Space Alignment.
cascaded pipeline as well.
KL-divergence-based neighbor-level knowledge alignment:
5.3 LLM as Aligner minimize the neighborhood similarity distribution calculated
These methods contain an LLM component for text encoding between LLM and GNN. G2P2 [63] further extends node-
and a GNN component for structure encoding. These two text contrastive learning by adding text-summary interaction
components are served equally and trained iteratively or and node-summary interaction. Then, they introduce using
parallelly. LLMs and GNNs can mutually enhance each other label texts in the text modality for zero-shot classification,
since the LLMs can provide textual signals to GNNs, while and using soft prompts for few-show classification. THLM
the GNNs can deliver structure information to LLMs. Accord- [33] proposes to pretrain the language model by contrastive
ing to how the LLM and the GNN interact, these methods can learning with a heterogeneous GNN on heterogeneous text-
be further categorized into: LLM-GNN Prediction Alignment attributed networks. The pretrained LLM can be fine-tuned
and LLM-GNN Latent Space Alignment. The illustration of on downstream tasks.
these two categories of methods can be found in Fig. 5. 5.3.3 Discussion.
5.3.1 LLM-GNN Prediction Alignment In “LLMs as Aligners” methods, most research is adopt-
This refers to training the LLM with the text data on a ing shallow GNNs (e.g., GCN, GAT, with thousands of
graph and training the GNN with the structure data on a parameters) to be the graph encoders that are aligned
graph iteratively. LLM will generate labels for nodes from the with LLMs through iterative training (i.e., prediction align-
text perspective and serve them as pseudo-labels for GNN ment) or contrastive training (i.e., latent space alignment).
training, while GNN will generate labels for nodes from the Although LLMs (with millions or billions of parameters)
structure perspective and serve them as pseudo-labels for have strong expressive capability, the shallow GNNs (with
LLM training. By this design, these two modality encoders limited representative capability) can constrain the mutual
can learn from each other and contribute to a final joint text learning effectiveness between LLMs and GNNs. A potential
and graph encoding. In this direction, LTRN [57] proposes solution is to adopt GNNs which can be scaled up [88].
a novel GNN architecture with personalized PageRank Furthermore, deeper research to explore what is the best
[94] and attention mechanism for structure encoding while model size combination for LLMs and GNNs in such “LLMs
adopting BERT [23] as the language model. The pseudo labels as Aligners” LLM-GNN mutual enhancement framework is
generated by LLM and GNN are merged for the next iteration very important.
of training. GLEM [62] formulates the iterative training
process into a pseudo-likelihood variational framework, 6 T EXT-PAIRED G RAPHS
where the E-step is to optimize LLM and the M-step is to Graphs are prevalent data objects in scientific disciplines
train the GNN. such as cheminformatics [183], [194], [200], material infor-
5.3.2 LLM-GNN Latent Space Alignment matics [181], bioinformatics [201], and computer vision [147].
Within these diverse fields, graphs frequently come paired
It denotes connecting text encoding (LLM) and structure
with critical graph-level text information. For instance,
encoding (GNN) with cross-modality contrastive learning:
molecular graphs in cheminformatics are annotated with text
hdvi = LLM(dvi ), hvi = GNN(Gv ), (16) properties such as toxicity, water solubility, and permeability
Sim(hdvi , hvi ) properties [181], [183]. Research on such graphs (scientific
l(hdvi , hvi ) = P , (17) discovery) could be accelerated by the text information
j̸=i Sim(hdvi , hvj ) and the adoption of LLMs. In this section, we review the
X 1
L= (l(hdvi , hvi ) + l(hvi , hdvi )) (18) application of LLMs on graph-captioned graphs with a
v ∈G
2|G| focus on molecular graphs. According to the technique
i
categorization in Section 3.2, we begin by investigating
A similar philosophy is widely used in vision-language methods that utilize LLMs as Predictor. Then, we discuss
joint modality learning [96]. Along this line of approaches, methods that align GNNs with LLMs. We summarize all
ConGrat [53] adopts GAT [86] as the graph encoder and surveyed methods in Appendix Table 8.
tries MPNet [34] as the language model encoder. They
have expanded the original InfoNCE loss by incorporating 6.1 LLM as Predictor
graph-specific elements. These elements pertain to the most In this subsection, we review how to conduct “LLM as
likely second, third, and subsequent choices regarding the Predictor” for graph-level tasks. Existing methods can be
nodes from which a text originates and the texts that categorized into Graph as Sequence (treat graph data as
a node generates. In addition to the node-level multi- sequence input) and Graph-Empowered LLMs (design model
modality contrastive objective, GRENADE [55] proposes architecture to encode graphs).
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11
GSD is the graph shortest distance between two nodes, electrons, hybridization state, aromaticity, and presence in a
and Meank∈SP(i,j) represents the mean pooling of the edge ring. Bond features encompass the bond’s type (e.g., single,
features xek along the shortest path SP(i, j) between nodes double, or triple), the bond’s stereochemistry (e.g., E/Z
i and j . GIMLET [47] adapts bi-directional attention for node or cis/trans), and whether the bond is conjugated [188].
tokens and enables texts to selectively attend to nodes. These Each feature provides specific information about atomic
designs render the Transformer’s submodule, which handles properties and structure, crucial for molecular modeling and
the graph part, equivalent to a Graph Transformer [141]. cheminformatics. One may directly vectorize the molecular
Cross-attention is also used to interact representations graph structure into binary vectors [186] and then apply
between graphs and texts. Given the graph hidden state hG , parameterized Multilayer Perceptrons (MLPs) on the top
its node-level hidden state Hv and text hidden state HdG , of these vectors to get the graph representation. These
Text2Mol [122] implemented interaction between representa- vectorization approaches are based on human-defined rules
tions in the hidden layers of encoders, while Prot2Text [161] and vary, such as MACCS, ECFP, and CDK fingerprints [186].
implemented this interaction within the layers of between
These rules take inputs of a molecule and output a vector
WQ HdG ·(WK Hv )T consisting of 0/1 bits. Each bit denotes a specific type of
encoder and decoder HdG = softmax √
d
·
k substructure related to functional groups that could be used
WV Hv , where WQ , WK , WV are trainable parameters for various property predictions. Fingerprints consider atoms
that transform the query modality (e.g., sequences) and and structures, but they cannot automatically learn from the
the key/value modality (e.g., graphs) into the attention graph structure. GNNs could serve as automatic feature
space. Furthermore, Prot2Text [161] utilizes two trainable extractors to replace or enhance fingerprints. Some specific
parameter matrices W1 and W2 to integrate the graph methods are explored in Section 6.1.2, while the other graph
representation into the sequence representation HdG = prior such as the eigenvectors of a graph Laplacian and the
HdG + 1|dG | hG W1 W2 . random walk prior could also be used [142].
LLM Outputs for Prediction. LMs like KV-PLM [175],
6.1.3 Discussion SMILES-BERT [179], MFBERT [176], and Chemformer [156]
LLM Inputs with Sequence Prior. The first challenge is that the use a prediction head on the output vector of the last layer.
progress in advanced linearization methods has not progressed in These models are finetuned with standard classification and
tandem with the development of LLMs. Emerging around 2020, regression losses but may not fully utilize all the parameters
linearization methods for molecular graphs like SELFIES and advantages of the complete architecture. In contrast,
offer significant grammatical advantages, yet advanced LMs models like RT [164], MolXPT [169], and Text+Chem T5 [171]
and LLMs from graph machine learning and language model frame prediction as a text generation task. These models are
communities might not fully utilize these, as these encoded trained with either masked language modeling or autoregres-
results are not part of pretraining corpora prior to their sive targets, which requires a meticulous design of the context
proposal. Consequently, recent studies [168] indicate that words in the text [164]. Specifically, domain knowledge
LLMs, such as GPT-3.5/4, may be less adept at using SELFIES instructions may be necessary to activate the in-context
compared to SMILES. Therefore, the performance of LM-only learning ability of LLMs, thereby making them domain
and LLM-only methods may be limited by the expressiveness experts [168]. For example, a possible template could be
of older linearization methods, as there is no way to optimize divided into four parts: {General Description}{Task-Specific
these hard-coded rules during the learning pipeline of LLMs. Description}{Question-Answer Examples}{Test Question}.
However, the second challenge remains as the inductive bias of LLM Outputs for Reasoning. Since string representations
graphs may be broken by linearization. Rule-based linearization of molecular graphs usually carry new and in-depth domain
methods introduce inductive biases for sequence modeling, knowledge, which is beyond the knowledge of LLMs, recent
thereby breaking the permutation invariance assumption work [146], [157], [165] also attempts to utilize the reasoning
inherent in molecular graphs. It may reduce task difficulty ability of LLMs, instead of using them as a knowledge source
by introducing sequence order to reduce the search space. for predicting the property of molecular graphs. ReLM [157]
However, it does not mean model generalization. Specifically, utilizes GNNs to suggest top-k candidates, which were then
there could be multiple string-based representations for a used to construct multiple-choice answers for in-context
single graph from single or different approaches. Numerous learning. ChemCrow [146] designs the LLMs as the chemical
studies [152]–[154] have shown that training on different agent to implement various chemical tools. It avoided direct
string-based views of the same molecule can improve the inference in an expertise-intensive domain.
sequential model’s performance, as these data augmentation
6.2 LLM as Aligner
approaches manage to retain the permutation-invariance
nature of graphs. These advantages are also achievable with 6.2.1 Latent Space Alignment
a permutation-invariant GNN, potentially simplifying the One may directly align the latent spaces of the GNN and LLM
model by reducing the need for complex, string-based data through contrastive learning and predictive regularization.
augmentation design. Typically, a graph representation from a GNN can be read
LLM Inputs with Graph Prior. Rule-based linearization may out by summarizing all node-level representations, and a
be considered less expressive and generalizable compared sequence representation can be obtained from the [CLS]
to the direct graph representation with rich node features, token. We first use two projection heads, which are usually
edge features, and the adjacency matrix [187]. Various atomic MLPs, to map the separate representation vectors from the
features include atomic number, chirality, degree, formal GNN and LLM into a unified space as hG and hdG , and
charge, number of hydrogen atoms, number of radical then align them within this space. Specifically, MoMu [174]
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13
and MoMu-v2 [173] retrieve two sentences from the corpus dimensions. The scale of GNNs may be a bottleneck in learn-
for each molecular graph. During training, graph data ing semantic meaningful representation and there is a risk of
augmentation was applied to molecular graphs, creating two over-reliance on one modality, neglecting the other. Therefore,
augmented views. Consequently, there are four pairs of G for future large-scale GNN designs comparable to LLMs,
and dG . For each pair, the contrastive loss for space alignment scaling up the dimension size and adding deeper layers, may
exp(cos(hG ,hdG )/τ ) be considered. Besides, Transformer encoders [142] may also
is as ℓMoMu = − log P where τ is the
d˜G ̸=dG exp cos hG ,hd˜
G
/τ improve the expressive power of deep GNNs.
temperature hyper-parameter and d˜G denotes the sequence Generation Decoder with GNNs. GNNs are often not used
not paired to the graph G . MoleculeSTM [172] also applies as decoders for graph generation. The prevalent decoders
contrastive learning to minimize the representation distance are mostly text-based, generating linearized graph structures
between a molecular graph G and its corresponding texts dG , such as SMILES. These methods may be sensitive to the
while maximizing the distance between the molecule and sequence order in the linearized graph. Generative diffusion
unrelated descriptions. MoleculeSTM [172] randomly sam- models [202] on graphs could be utilized in future work to
ples negative graphs or texts to construct negative pairs of design generators with GNNs.
˜ and (G̃, d). Similarly, MolFM [162] and GIT-Mol [158]
(G, d)
implement contrastive loss with mutual information and
7 A PPLICATIONS
negative sampling. These two methods also use cross-entropy 7.1 Datasets, Splitting and Evaluation
to regularize the unified space with the assumption that We summarize the datasets for three scenarios (namely pure
randomly permuted graph and text inputs are predictable if graphs, text-attributed graphs, and text-paired graphs) and
they originate from the same molecule. show them in Table 5, Table 2, and Table 3 respectively.
However, the aforementioned methods cannot leverage
7.1.1 Pure Graphs
task labels. Given a classification label y , CLAMP [170]
learns to map active molecules (y = 1) so that they In Table 5, we summarize the pure graph reasoning prob-
align with the corresponding assay description for each lems discussed in Section 4. Many problems are shared or
molecular graph G : ℓCLAMP = y log σ τ −1 hTG hdG revisited in different datasets due to their commonality. NL-
+
(1 − y) log 1 − σ τ −1 hTG hdG . CLAMP [170] requires la- Graph [124], LLMtoGraph [125] and GUC [126] study a set of
bels to encourage that active molecules and their cor- standard graph reasoning problems, including connectivity,
responding text descriptions are clustered together in shortest path, and graph diameter. GraphQA [131] bench-
the latent space. To advance the alignment between two marks a similar set of problems but additionally describes the
modalities, MolCA [167] trains the Query Transformer (Q- graphs in real-world scenarios to study the effect of graph
Former) [190] for molecule-text projecting and contrastive grounding. LLM4DyG [128] focuses on reasoning tasks on
alignment. Q-former initializes Nq learnable query tokens temporally evolving graphs. Accuracy is the most common
Nq
{qk }k=1 . These query tokens are updated with self-attention evaluation metric as they are primarily formulated as graph
and interact with the output of GNNs through cross- question-answering tasks.
attention to obtain the k -th queried molecular represen- 7.1.2 Text-Attributed Graphs
tation vector (hG )k := Q-Former(qk ). The query tokens We summarize the famous datasets for evaluating models
share the same self-attention modules with the texts, but on text-attributed graphs in Table 2. The datasets are mostly
use different MLPs, allowing the Q-Former to be used from the academic, e-commerce, book, social media, and
for obtaining the representation of text sequence hdG := Wikipedia domains. The popular tasks to evaluate models
Q-Former([CLS]). Then we have ℓMolCA = −ℓg2t − ℓt2g , on those datasets include node classification, link prediction,
exp(maxk cos((hG )k ,hdG )/τ )
where ℓg2t = log P , and edge classification, regression, and recommendation. The
exp maxk cos (hG )k ,hd˜ /τ
d˜G ̸=dG G evaluation metrics for node/edge classification include
exp(maxk cos(hdG ,(hG )k )/τ )
ℓt2g = log . Accuracy, Macro-F1, and Micro-F1. For link prediction and
exp(maxk cos(hdG ,(hG̃ )k )/τ )
P
G̸̃=G recommendation evaluation, Mean Reciprocal Rank (MRR),
6.2.2 Discussion Normalized Discounted Cumulative Gain (NDCG), and Hit
Ratio (Hit) usually serve as metrics. While evaluating model
Larger-Scale GNNs. GNNs integrate atomic and graph struc-
performance on regression tasks, people tend to adopt mean
tural features for molecular representation learning [145].
absolute errors (MAE) or root mean square error (RMSE).
Specifically, Text2Mol [122] utilizes the GCN [84] as its graph
encoder and extracts unique identifiers for node features 7.1.3 Text-Paired Graphs
based on Morgan fingerprints [186]. MoMu [174], MoMu- Table 3 shows text-paired graph datasets (including text-
v2 [173], MolFM [162], GIT-Mol [158], and MolCA [167] available and graph-only datasets). For Data Splitting, options
prefer GIN [189] as the backbone, as GIN has been proven include random splitting, source-based splitting, activity
to be as expressive and powerful as the Weisfeiler-Lehman cliffs and scaffolds [196], and data balancing [143]. Graph
graph isomorphism test. As described in Section 2.2, there classification usually adopts AUC [188] as the metrics,
has been notable progress in making GNNs deeper, more while regression uses MAE, RMSE, and R2 [145]. For text
generalizable, and more powerful since the proposal of the generation evaluation, people tend to use the Bilingual
GCN [84] in 2016 and the GIN [189] in 2018. However, most Evaluation Understudy (BLEU) score; while for molecule
reviewed works [158], [162], [167], [173], [174] are developed generation evaluation, heuristic evaluation methods (based
using the GIN [189] as a proof of concept for their approaches. on factors including validity, novelty, and uniqueness) are
These pretrained GINs feature five layers and 300 hidden adopted. However, it is worth noted that BLEU score is
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14
TABLE 2
Data collection in Section 5 for text-attributed graphs. Task: “NC”, “UAP”, “LP”, “Rec”, “EC”, “RG” denote node classification, user
activity prediction, link prediction, recommendation, edge classification, and regression task.
Text. Data Year Task # Nodes # Edges Domain Source & Notes
ogb-arxiv 2020.5 NC 169,343 1,166,243 Academic OGB [188]
ogb-products 2020.5 NC 2,449,029 61,859,140 E-commerce OGB [188]
ogb-papers110M 2020.5 NC 111,059,956 1,615,685,872 Academic OGB [188]
ogb-citation2 2020.5 LP 2,927,963 30,561,187 Academic OGB [188]
Cora 2000 NC 2,708 5,429 Academic [10]
Citeseer 1998 NC 3,312 4,732 Academic [11]
DBLP 2023.1 NC, LP 5,259,858 36,630,661 Academic www.aminer.org/citation
Node
MAG 2020 NC, LP, Rec RG ∼ 10M ∼ 50M Academic multiple domains [12] [13]
Goodreads-books 2018 NC, LP ∼ 2M ∼ 20M Books multiple domains [14]
Amazon-items 2018 NC, LP, Rec ∼ 15.5M ∼ 100M E-commerce multiple domains [15]
SciDocs 2020 NC, UAP, LP, Rec - - Academic [51]
PubMed 2020 NC 19,717 44,338 Academic [16]
Wikidata5M 2021 LP ∼ 4M ∼ 20M Wikipedia [17]
Twitter 2023 NC, LP 176,279 2,373,956 Social [53]
Goodreads-reviews 2018 EC, LP ∼ 3M ∼ 100M Books multiple domains [14]
Edge
TABLE 3
Data collection in Section 6 for text-captioned graphs. “PT”, “FT”, “Cap.”, “GC”, “Retr.’, and “Gen.” refer to pretraining, finetuning, caption, graph
classification, retrieval, and graph generation, respectively. The superscript for the size denotes # graph-text pairs1 , # graphs2 , # assays3 .
Data Date Task Size Source & Notes
ChEMBL-2023 [185] 2023 Various 2.4M2 ,20.3M3 Drug-like
PubChem [183] 2019 Various 96M2 ,237M3 Biomedical
PC324K [167] 2023 PT, Cap., 324K1 PubChem [183]
MolXPT-PT [169] 2023 PT 30M2 PubChem [183], PubMed, ChEBI [182]
ChE-bio [47] 2023 PT 365K2 ChEMBL [184]
ChE-phy [47] 2023 PT 365K2 ChEMBL [184]
ChE ZS [47] 2023 GC 91K2 ChEMBL [184]
PC223M [170] 2023 PT, Retr. 223M1 ,2M2 ,20K3 PubChem [183]
PCSTM [172] 2022 PT 281K1 PubChem [183]
PCdes [183] 2022 FT, Cap, Retr. 15K1 PubChem [183]
ChEBI-20 [122] 2021 FT., Retr., Gen., Cap. 33K1 PubChem [183], ChEBI [182]
efficient but less accurate, while heuristic evaluation methods 7.3 Practical applications
are problematic subject to unintended modes, such as the
7.3.1 Scientific Discovery
superfluous addition of carbon atoms in [197].
Virtual Screening. It aims to search a library of unlabeled
7.2 Open-source Implementations
molecules to identify useful structures for a given task.
HuggingFace. HF Transformers1 is the most popular Python Machine learning models could automatically screen out
library for Transformers-based language models. Besides, it trivial candidates to accelerate this process. However, train-
also provides two additional packages: Datasets2 for easily ing accurate models is not easy since labeled molecules are
accessing and sharing datasets and Evaluate3 for easily limited in size and imbalanced in distribution [143]. There are
evaluating machine learning models and datasets. many efforts to improve GNNs against data sparsity [143],
Fairseq. Fairseq4 is another open-source Python library for [145], [192]. However, it is difficult for a model to generalize
Transformers-based language models. and understand in-depth domain knowledge that it has never
PyTorch Geometric. PyG5 is an open-source Python library been trained on. Texts could be complementary knowledge
for graph machine learning. It packages more than 60 types sources. Discovering task-related content from massive
of GNN, aggregation, and pooling layers. scientific papers and using them as instructions has great
Deep Graph Library. DGL6 is another open-source Python potential to design accurate GNNs in virtual screening [47].
library for graph machine learning. Molecular Generation. Molecular generation and optimiza-
RDKit. RDKit7 is one of the most popular open-source tion is one fundamental goal for drug and material discovery.
cheminformatics software programs that facilitates various Scientific hypotheses of molecules [199], can be represented
operations and visualizations for molecular graphs. It offers in the joint space of GNNs and LLMs. Then, one may search
many useful APIs, such as the linearization implementation in the latent space for a better hypothesis that aligns with
for molecular graphs, to convert them into easily stored the text description (human requirements) and adheres to
SMILES and to convert these SMILES back into graphs. structural constraints like chemical validity. Chemical space
1. https://huggingface.co/docs/transformers/index has been found to contain more than 1060 molecules [198],
2. https://huggingface.co/docs/datasets/index which is beyond the capacity of exploration in wet lab exper-
3. https://huggingface.co/docs/evaluate/index
iments. Generating constrained candidates within relevant
4. https://github.com/facebookresearch/fairseq
5. https://pytorch-geometric.readthedocs.io/en/latest/index.html subspaces is a challenge [202] and promising, especially
6. https://www.dgl.ai/ when incorporating textual conditions.
7. https://www.rdkit.org/docs/ Synthesis Planning. Synthesis designs start from available
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15
molecules and involve planning a sequence of steps that Education Domain. In the education domain, we can
can finally produce a desired chemical compound through construct a graph with coursework as nodes and their
a series of reactions [199]. This procedure includes a se- relations as edges. The model learned on such a graph can be
quence of reactant molecules and reaction conditions. Both utilized for knowledge tracing [136] and student performance
graphs and texts play important roles in this process. For prediction [137].
example, graphs may represent the fundamental structure of
molecules, while texts may describe the reaction conditions, 8 F UTURE DIRECTIONS
additives, and solvents. LLMs can assist in the planning by Better Benchmark Datasets. Most pure graph benchmarks
suggesting possible synthesis paths directly or by serving as evaluate LLMs’ reasoning ability on homogeneous graphs
agents to operate on existing planning tools [146]. but do not include evaluations on heterogeneous or spatial-
7.3.2 Computational Social Science temporal graphs. For text-attributed graphs, as summarized
in Table 2, most benchmark datasets are from academic
In computational social science, researchers are interested
domains and e-commerce domains. However, in the real
in modeling the behavior of people/users and discovering
world, text-attributed graphs are ubiquitous across multiple
new knowledge that can be utilized to forecast the future.
domains (e.g., legal and health). More diverse datasets are
The behaviors of users and interactions between users can
needed to comprehensively evaluate LLMs on real-world
be modeled as graphs, where the nodes are associated with
scenarios. For text-paired graphs, as summarized in Table 3,
rich text information (e.g., user profile, messages, emails). We
there is a lack of comprehensive datasets covering various
will show two example scenarios below.
machine learning tasks in chemistry. Although a massive
E-commerce. In E-commerce platforms, there are many
number of scientific papers are available, preprocessing
interactions (e.g., purchase, view) between users and prod-
them into a ready-to-use format and pairing them with
ucts. For example, users can view or purchase products.
specific molecular graph data points of interest remains
In addition, the users, products, and their interactions are
a cumbersome and challenging task. Besides, we could
associated with rich text information. For instance, products
investigate graph-text pairs in 3D space, where each molecule
have titles/descriptions and users can leave a review of
may be associated with atomic coordinates [138].
products. In this case, we can construct a graph [102]
Broader Task Space with LLMs. More comprehensive stud-
where nodes are users and products, while edges are their
ies on the performance of LLMs for graph tasks hold promise
interactions. Both nodes and edges are associated with text.
for the future. While LLMs as encoder approaches have been
It is important to utilize both the text information and the
explored for text-attributed graphs, their application to text-
graph structure information (user behavior) to model users
captioned molecular graphs remains underexplored. Promis-
and items and solve complex downstream tasks (e.g., item
ing directions include using LLMs for data augmentation
recommendation [106], bundle recommendation [107], and
and knowledge distillation to design domain-specific GNNs
product understanding [108]).
for various text-paired graph tasks. Furthermore, although
Social Media. In social media platforms, there are many
graph generation has been approached in text-paired graphs,
users and they interact with each other through messages,
it remains an open problem for text-attributed graphs (i.e.,
emails, and so on. In this case, we can build a graph where
how to conduct joint text and graph structure generation)
nodes are users and edges are the interaction between users.
Multi-Modal Foundation Models. One open question is,
There will be text associated with nodes (e.g., user profile)
“Should we use one foundation model to unify different
and edges (e.g., messages). Interesting research questions
modalities, and how?” The modalities can include texts,
will be how to do joint text and graph structure modeling
graphs, and even images. For instance, molecules can be
to deeply understand the users for friend recommendation
represented as graphs, described as texts, and photographed
[109], user analysis [110], community detection [111], and
as images; products can be treated as nodes in a graph,
personalized response generation [97], [98].
associated with a title/description, and combined with an
7.3.3 Specific Domains image. Designing a model that can conduct joint encoding
In many specific domains, text data are interconnected and for all modalities will be useful but challenging. Furthermore,
lie in the format of graphs. The structure information on the there has always been tension between building a unified
graphs can be utilized to better understand the text unit and foundational model and customizing model architectures
contribute to advanced problem-solving. for different domains. It is thus intriguing to ask whether a
Academic Domain. In the academic domain, graphs [12] unified architecture will suit different data types, or if tailor-
are constructed with papers as nodes and their relations ing model designs according to domains will be necessary.
(e.g., citation, authorship, etc) as edges. The representation Correctly answering this question can save economic and
learned for papers on such graphs can be utilized for paper intellectual resources from unnecessary attempts and also
recommendation [103], paper classification [104], and author shed light on a deeper understanding of graph-related tasks.
identification [105]. Efficienct LLMs on Graphs. While LLMs have shown
Legal Domain. In the legal domain, opinions given by a strong capability to learn on graphs, they suffer from
the judges always contain references to opinions given for inefficiency in graph linearization and model optimization.
previous cases. In such scenarios, people can construct a On one hand, as discussed in Section 5.1.1 and 6.1.1, many
graph [99] based on the citation relations between opinions. methods rely on transferring graphs into sequences that can
The representations learned on such a graph with both be inputted into LLMs. However, the length of the transferred
text and structure information can be utilized for clause sequence will increase significantly as the size of the graph
classification [100] and opinion recommendation [101]. increases. This poses challenges since LLMs always have a
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16
maximum sequence input length and a long input sequence [4] Reimers, N. and Gurevych, I., “Sentence-BERT: Sentence Embed-
will lead to higher time and memory complexity. On the other dings using Siamese BERT-Networks,” in EMNLP, 2019.
[5] Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S.,
hand, optimizing LLMs itself is computationally expensive. Yogatama, D., Bosma, M., Zhou, D., Metzler, D. and Chi, E.H.,
Although some general efficient tuning methods such as “Emergent Abilities of Large Language Models,” in TMLR, 2022.
LoRA are proposed, there is a lack of discussion on graph- [6] Nagamochi, H. and Ibaraki, T., “Algorithmic aspects of graph
aware LLM efficient tuning methods. connectivity,” in Cambridge University Press, 2018.
[7] Goldberg, A.V. and Harrelson, C., “Computing the shortest path: A
Generalizable and Robust LLMs on Graphs. Another
search meets graph theory,” in SODA (Vol. 5, pp. 156-165), 2005.
interesting direction is to explore the generalizability and [8] Sun, Z., Wang, H., Wang, H., Shao, B. and Li, J., “Efficient subgraph
robustness of LLMs on graphs. Generalizability refers to matching on billion node graphs,” in arXiv preprint arXiv:1205.6691,
having the ability to transfer the knowledge learned from 2012.
one domain graph to another; while robustness denotes [9] Chen, Z., Mao, H., Li, H., Jin, W., Wen, H., Wei, X., ... & Tang, J.
(2023). Exploring the potential of large language models (llms) in
having consistent prediction regarding obfuscations and learning on graphs. arXiv preprint arXiv:2307.03393.
attacks. Although LLMs have demonstrated their strong [10] McCallum, A.K., Nigam, K., Rennie, J. and Seymore, K., “Automat-
generalizability in processing text, they still suffer from ing the construction of internet portals with machine learning,” in
Information Retrieval, 3, pp.127-163, 2000.
robustness and hallucination issues, which are to be solved
[11] Giles, C.L., Bollacker, K.D. and Lawrence, S., “CiteSeer: An au-
for graph data modeling as well. tomatic citation indexing system,” in Proceedings of the third ACM
LLM as Dynamic Agents on Graphs. Although LLMs have conference on Digital libraries (pp. 89-98), 1998.
shown their advanced capability in generating text, one- [12] Wang, K., Shen, Z., Huang, C., Wu, C.H., Dong, Y. and Kanakia,
A., “Microsoft academic graph: When experts are not enough,” in
pass generation of LLMs suffers from hallucination and
Quantitative Science Studies, 1(1), pp.396-413, 2020.
misinformation issues due to the lack of accurate parametric [13] Zhang, Y., Jin, B., Zhu, Q., Meng, Y. and Han, J., “The Effect of
knowledge. Simply augmenting retrieved knowledge in Metadata on Scientific Literature Tagging: A Cross-Field Cross-
context is also bottlenecked by the capacity of the retriever. Model Study,” in WWW, 2023.
In many real-world scenarios, graphs such as academic [14] Wan, M. and McAuley, J., “Item recommendation on monotonic
behavior chains,” in Proceedings of the 12th ACM conference on
networks, and Wikipedia are dynamically looked up by recommender systems, 2018.
humans for knowledge-guided reasoning. Simulating such [15] Ni, J., Li, J. and McAuley, J., “Justifying recommendations using
a role of dynamic agents can help LLMs more accurately re- distantly-labeled reviews and fine-grained aspects,” in EMNLP-
IJCNLP, 2019.
trieve relevant information via multi-hop reasoning, thereby
[16] Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B. and Eliassi-
correcting their answers and alleviating hallucinations. Rad, T., “Collective classification in network data,” in AI magazine,
29(3), pp.93-93, 2008.
9 C ONCLUSION [17] Wang, X., Gao, T., Zhu, Z., Zhang, Z., Liu, Z., Li, J. and Tang,
In this paper, we provide a comprehensive review of large J., “KEPLER: A unified model for knowledge embedding and pre-
language models on graphs. We first categorize graph trained language representation,” in TACL, 2021.
[18] Liu, L., Du, B., Ji, H., Zhai, C. and Tong, H., “Neural-answering
scenarios where LMs can be adopted and summarize the logical queries on knowledge graphs,” in KDD., 2021.
large language models on graph techniques. We then provide [19] Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Philip, S. Y.,
a thorough review, analysis, and comparison of methods “A comprehensive survey on graph neural networks,” in IEEE
within each scenario. Furthermore, we summarize available transactions on neural networks and learning systems, 32(1), 4-24, 2020.
[20] Liu, J., Yang, C., Lu, Z., Chen, J., Li, Y., Zhang, M., Bai, T., Fang, Y.,
datasets, open-source codebases, and multiple applications. Sun, L., Yu, P.S. and Shi, C., “Towards Graph Foundation Models: A
Finally, we suggest future directions for large language Survey and Beyond,” in arXiv preprint arXiv:2310.11829, 2023.
models on graphs. [21] Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J. and Wu, X., “Unifying
Large Language Models and Knowledge Graphs: A Roadmap,” in
ACKNOWLEDGMENTS arXiv preprint arXiv:2306.08302, 2023.
[22] Wang, Y., Le, H., Gotmare, A.D., Bui, N.D., Li, J. and Hoi, S.C.,
This work was supported in part by US DARPA KAIROS “Codet5+: Open code large language models for code understanding
Program No. FA8750-19-2-1004 and INCAS Program No. and generation.,” in arXiv preprint arXiv:2305.07922, 2023.
HR001121C0165, National Science Foundation IIS-19-56151, [23] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., “Bert: Pre-
training of deep bidirectional transformers for language understand-
and the Molecule Maker Lab Institute: An AI Research
ing,” in NAACL, 2019.
Institutes program supported by NSF under Award No. [24] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis,
2019897, and the Institute for Geospatial Understanding M., Zettlemoyer, L. and Stoyanov, V., “Roberta: A robustly optimized
through an Integrative Discovery Environment (I-GUIDE) bert pretraining approach,” in arXiv preprint arXiv:1907.11692, 2019.
[25] Beltagy, I., Lo, K. and Cohan, A., “SciBERT: A pretrained language
by NSF under Award No. 2118329. Any opinions, findings,
model for scientific text,” in arXiv preprint arXiv:1903.10676, 2019.
and conclusions or recommendations expressed herein are [26] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal,
those of the authors and do not necessarily represent the P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., and Agarwal,
views, either expressed or implied, of DARPA or the U.S. “Language models are few-shot learners,” in NeurIPS, 2020.
Government. [27] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R. and Le,
Q.V., “Xlnet: Generalized autoregressive pretraining for language
understanding,” in NeurIPS, 2019.
R EFERENCES [28] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed,
[1] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li, M. and Lin, J., A., Levy, O., Stoyanov, V. and Zettlemoyer, L., “Bart: Denoising
“End-to-end open-domain question answering with bertserini,” in sequence-to-sequence pre-training for natural language generation,
NAACL, 2019. translation, and comprehension,” in ACL, 2020.
[2] Liu, Y. and Lapata, M., “Text Summarization with Pretrained [29] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena,
Encoders,” in EMNLP, 2019. M., Zhou, Y., Li, W. and Liu, P.J., “Exploring the limits of transfer
[3] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S.R., learning with a unified text-to-text transformer,” in JMLR, 2020.
“GLUE: A Multi-Task Benchmark and Analysis Platform for Natural [30] Yasunaga, M., Leskovec, J. and Liang, P., “LinkBERT: Pretraining
Language Understanding,” in ICLR, 2018. Language Models with Document Links,” in ACL, 2022.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17
[31] Jin, B., Zhang, W., Zhang, Y., Meng, Y., Zhang, X., Zhu, Q. and Han, [55] Li, Y., Ding, K. and Lee, K., “GRENADE: Graph-Centric Lan-
J., “Patton: Language Model Pretraining on Text-Rich Networks,” in guage Model for Self-Supervised Representation Learning on Text-
ACL, 2023. Attributed Graphs,” in EMNLP., 2023.
[32] Zhang, X., Malkov, Y., Florez, O., Park, S., McWilliams, B., Han, J. [56] Zhang, X., Malkov, Y., Florez, O., Park, S., McWilliams, B., Han, J.
and El-Kishky, A., “TwHIN-BERT: a socially-enriched pre-trained and El-Kishky, A., “TwHIN-BERT: A Socially-Enriched Pre-trained
language model for multilingual Tweet representations,” in KDD, Language Model for Multilingual Tweet Representations at Twitter,”
2023. in KDD., 2023.
[33] Zou, T., Yu, L., Huang, Y., Sun, L. and Du, B., “Pretraining Language [57] Zhang, X., Zhang, C., Dong, X.L., Shang, J. and Han, J., “Minimally-
Models with Text-Attributed Heterogeneous Graphs,” in arXiv supervised structure-rich text categorization via learning on text-rich
preprint arXiv:2310.12580, 2023. networks,” in WWW., 2021.
[34] Song, K., Tan, X., Qin, T., Lu, J. and Liu, T.Y., “Mpnet: Masked and [58] Chien, E., Chang, W.C., Hsieh, C.J., Yu, H.F., Zhang, J., Milenkovic,
permuted pre-training for language understanding,” in NeurIPs., O., and Dhillon, I.S., “Node feature extraction by self-supervised
2020. multi-scale neighborhood prediction,” in ICLR., 2022.
[35] Duan, K., Liu, Q., Chua, T.S., Yan, S., Ooi, W.T., Xie, Q. and He, J., [59] Zhang, Y., Shen, Z., Wu, C.H., Xie, B., Hao, J., Wang, Y.Y., Wang, K.
“Simteg: A frustratingly simple approach improves textual graph and Han, J., “Metadata-induced contrastive learning for zero-shot
learning,” in arXiv preprint arXiv:2308.02565., 2023. multi-label text classification,” in WWW., 2022.
[36] Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., [60] Dinh, T.A., Boef, J.D., Cornelisse, J. and Groth, P., “E2EG: End-
Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E. and to-End Node Classification Using Graph Topology and Text-based
Krusche, S., “ChatGPT for good? On opportunities and challenges Node Attributes,” in arXiv preprint arXiv:2208.04609., 2022.
of large language models for education,” in Learning and individual [61] Tan, Y., Zhou, Z., Lv, H., Liu, W. and Yang, C., “Walklm: A
differences, 103., 2023. uniform language model fine-tuning framework for attributed graph
[37] Lester, B., Al-Rfou, R. and Constant, N., “The power of scale for embedding,” in NeurIPs., 2023.
parameter-efficient prompt tuning,” in EMNLP, 2021. [62] Zhao, J., Qu, M., Li, C., Yan, H., Liu, Q., Li, R., Xie, X. and Tang,
[38] Li, X.L. and Liang, P., “Prefix-tuning: Optimizing continuous J., “Learning on large-scale text-attributed graphs via variational
prompts for generation,” in ACL, 2021. inference,” in ICLR., 2023.
[39] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Larous- [63] Wen, Z. and Fang, Y., “Augmenting Low-Resource Text Classifica-
silhe, Q., Gesmundo, A., Attariyan, M. and Gelly, S., “Parameter- tion with Graph-Grounded Pre-training and Prompting,” in SIGIR.,
efficient transfer learning for NLP,” in ICML, 2019. 2023.
[40] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, [64] Chen, Z., Mao, H., Wen, H., Han, H., Jin, W., Zhang, H., Liu, H.
L. and Chen, W., “Lora: Low-rank adaptation of large language and Tang, J., “Label-free Node Classification on Graphs with Large
models,” in ICLR, 2022. Language Models (LLMS),” in arXiv preprint arXiv:2310.04668., 2023.
[65] Zhao, J., Zhuo, L., Shen, Y., Qu, M., Liu, K., Bronstein, M., Zhu, Z.
[41] Tian, Y., Song, H., Wang, Z., Wang, H., Hu, Z., Wang, F., Chawla,
and Tang, J., “Graphtext: Graph reasoning in text space,” in arXiv
N.V. and Xu, P., “Graph Neural Prompting with Large Language
preprint arXiv:2310.01089., 2023.
Models,” in arXiv preprint arXiv:2309.15427., 2023.
[66] Meng, Y., Zong, S., Li, X., Sun, X., Zhang, T., Wu, F. and Li, J.,
[42] Chai, Z., Zhang, T., Wu, L., Han, K., Hu, X., Huang, X. and Yang, Y.,
“Gnn-lm: Language modeling based on global contexts via gnn,” in
“GraphLLM: Boosting Graph Reasoning Ability of Large Language
ICLR., 2022.
Model,” in arXiv preprint arXiv:2310.05845., 2023.
[67] Zhang, X., Bosselut, A., Yasunaga, M., Ren, H., Liang, P., Manning,
[43] Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N.,
C.D. and Leskovec, J., “Greaselm: Graph reasoning enhanced
Dai, A.M. and Le, Q.V., “Finetuned language models are zero-shot
language models for question answering,” in ICLR., 2022.
learners,” in ICLR., 2022.
[68] Ioannidis, V.N., Song, X., Zheng, D., Zhang, H., Ma, J., Xu, Y., Zeng,
[44] Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai, B., Chilimbi, T. and Karypis, G., “Efficient and effective training of
Z., Chaffin, A., Stiegler, A., Scao, T.L., Raja, A. and Dey, M., language and graph neural network models,” in AAAI, 2023.
“Multitask prompted training enables zero-shot task generalization,”
[69] Mavromatis, C., Ioannidis, V.N., Wang, S., Zheng, D., Adeshina, S.,
in ICLR., 2022.
Ma, J., Zhao, H., Faloutsos, C. and Karypis, G., “Train Your Own
[45] Tang, J., Yang, Y., Wei, W., Shi, L., Su, L., Cheng, S., Yin, D. GNN Teacher: Graph-Aware Distillation on Textual Graphs,” in
and Huang, C., “GraphGPT: Graph Instruction Tuning for Large PKDD, 2023.
Language Models,” in arXiv preprint arXiv:2310.13023., 2023. [70] He, X., Bresson, X., Laurent, T. and Hooi, B., “Explanations as
[46] Ye, R., Zhang, C., Wang, R., Xu, S. and Zhang, Y., “Natural language Features: LLM-Based Features for Text-Attributed Graphs,” in arXiv
is all a graph needs,” in arXiv preprint arXiv:2308.07134., 2023. preprint arXiv:2305.19523., 2023.
[47] Zhao, H., Liu, S., Ma, C., Xu, H., Fu, J., Deng, Z.H., Kong, L. and Liu, [71] Yu, J., Ren, Y., Gong, C., Tan, J., Li, X. and Zhang, X., “Empower Text-
Q., “GIMLET: A Unified Graph-Text Model for Instruction-Based Attributed Graphs Learning with Large Language Models (LLMs),”
Molecule Zero-Shot Learning,” in bioRxiv, pp.2023-05., 2023. in arXiv preprint arXiv:2310.09872., 2023.
[48] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, [72] Yang, J., Liu, Z., Xiao, S., Li, C., Lian, D., Agrawal, S., Singh, A.,
Q.V. and Zhou, D., “Chain-of-thought prompting elicits reasoning Sun, G. and Xie, X., “GraphFormers: GNN-nested transformers for
in large language models,” in NeurIPs., 2022. representation learning on textual graph,” in NeurIPs., 2021.
[49] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y. and [73] Jin, B., Zhang, Y., Zhu, Q. and Han, J., “Heterformer: Transformer-
Narasimhan, K., “Tree of thoughts: Deliberate problem solving with based deep node representation learning on heterogeneous text-rich
large language models,” in arXiv preprint arXiv:2305.10601., 2023. networks,” in KDD., 2023.
[50] Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gianinazzi, L., [74] Jin, B., Zhang, Y., Meng, Y. and Han, J., “Edgeformers: Graph-
Gajda, J., Lehmann, T., Podstawski, M., Niewiadomski, H., Nyczyk, Empowered Transformers for Representation Learning on Textual-
P. and Hoefler, T., “Graph of thoughts: Solving elaborate problems Edge Networks,” in ICLR., 2023.
with large language models,” in arXiv preprint arXiv:2308.09687., [75] Jin, B., Zhang, W., Zhang, Y., Meng, Y., Zhao, H. and Han, J.,
2023. “Learning Multiplex Embeddings on Text-rich Networks with One
[51] Cohan, A., Feldman, S., Beltagy, I., Downey, D. and Weld, D.S., Text Encoder,” in arXiv preprint arXiv:2310.06684., 2023.
“Specter: Document-level representation learning using citation- [76] Qin, Y., Wang, X., Zhang, Z. and Zhu, W., “Disentangled Represen-
informed transformers,” in ACL., 2020. tation Learning with Large Language Models for Text-Attributed
[52] Ostendorff, M., Rethmeier, N., Augenstein, I., Gipp, B. and Rehm, Graphs,” in arXiv preprint arXiv:2310.18152., 2023.
G., “Neighborhood contrastive learning for scientific document [77] Zhu, J., Cui, Y., Liu, Y., Sun, H., Li, X., Pelger, M., Yang, T., Zhang,
representations with citation embeddings,” in EMNLP., 2022. L., Zhang, R. and Zhao, H., “Textgnn: Improving text encoder via
[53] Brannon, W., Fulay, S., Jiang, H., Kang, W., Roy, B., Kabbara, J. and graph neural network in sponsored search,” in WWW., 2021.
Roy, D., “ConGraT: Self-Supervised Contrastive Pretraining for Joint [78] Li, C., Pang, B., Liu, Y., Sun, H., Liu, Z., Xie, X., Yang, T., Cui,
Graph and Text Embeddings,” in arXiv preprint arXiv:2305.14321., Y., Zhang, L. and Zhang, Q., “Adsgnn: Behavior-graph augmented
2023. relevance modeling in sponsored search,” in SIGIR., 2021.
[54] Zhu, J., Song, X., Ioannidis, V.N., Koutra, D. and Faloutsos, C., [79] Zhang, J., Chang, W.C., Yu, H.F. and Dhillon, I., “Fast multi-
“TouchUp-G: Improving Feature Representation through Graph- resolution transformer fine-tuning for extreme multi-label text
Centric Finetuning,” in arXiv preprint arXiv:2309.13885., 2023. classification,” in NeurIPs., 2021.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18
[80] Xie, H., Zheng, D., Ma, J., Zhang, H., Ioannidis, V.N., Song, X., Ping, [109] Chen, L., Xie, Y., Zheng, Z., Zheng, H. and Xie, J., “Friend recom-
Q., Wang, S., Yang, C., Xu, Y. and Zeng, B., “Graph-Aware Language mendation based on multi-social graph convolutional network,” in
Model Pre-Training on a Large Graph Corpus Can Help Multiple IEEE Access, 8, pp.43618-43629, 2020.
Graph Applications,” in KDD., 2023. [110] Wang, G., Zhang, X., Tang, S., Zheng, H. and Zhao, B.Y., “Unsu-
[81] Yasunaga, M., Bosselut, A., Ren, H., Zhang, X., Manning, pervised clickstream clustering for user behavior analysis,” in CHI,
C.D., Liang, P.S. and Leskovec, J., “Deep bidirectional language- 2016.
knowledge graph pretraining,” in NeurIPs., 2022. [111] Shchur, O. and Günnemann, S., “Overlapping community detec-
[82] Huang, J., Zhang, X., Mei, Q. and Ma, J., “CAN LLMS EF- tion with graph neural networks,” in arXiv:1909.12201., 2019.
FECTIVELY LEVERAGE GRAPH STRUCTURAL INFORMATION: [112] Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S.,
WHEN AND WHY,” in arXiv preprint arXiv:2309.16595.., 2023. Yogatama, D., Bosma, M., Zhou, D., Metzler, D. and Chi, E.H., 2022.
[83] Jin, X., Vinzamuri, B., Venkatapathy, S., Ji, H. and Natarajan, P., ”Emergent Abilities of Large Language Models” in Transactions on
“Adversarial Robustness for Large Language NER models using Machine Learning Research, 2022.
Disentanglement and Word Attributions,” in EMNLP., 2023. [113] Kojima, T., Gu, S.S., Reid, M., Matsuo, Y. and Iwasawa, Y., 2022.
[84] Kipf, T.N. and Welling, M., “Semi-supervised classification with ”Large language models are zero-shot reasoners” in NeurIPS.
graph convolutional networks,” in ICLR., 2017. [114] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E.,
[85] Hamilton, W., Ying, Z. and Leskovec, J., “Inductive representation Le, Q.V. and Zhou, D., 2022. ”Chain-of-thought prompting elicits
learning on large graphs,” in NeurIPs., 2017. reasoning in large language models” in NeurIPS.
[86] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P. and [115] Radford, A., 2019. ”Language Models are Unsupervised Multitask
Bengio, Y., “Graph attention networks,” in ICLR., 2018. Learners” in OpenAI Blog, 2019.
[87] Zhang, S., Liu, Y., Sun, Y. and Shah, N., “Graph-less Neural [116] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P. and Soricut,
Networks: Teaching Old MLPs New Tricks Via Distillation,” in R., 2019, September. ”ALBERT: A Lite BERT for Self-supervised
ICLR., 2022. Learning of Language Representations” in ICLR.
[88] Liu, M., Gao, H. and Ji, S., “Towards deeper graph neural networks,” [117] Clark, K., Luong, M.T., Le, Q.V. and Manning, C.D., 2019, Septem-
in KDD., 2020. ber. ”ELECTRA: Pre-training Text Encoders as Discriminators Rather
[89] Meng, Y., Huang, J., Zhang, Y. and Han, J., “Generating training Than Generators” in ICLR.
data with language models: Towards zero-shot language under- [118] Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz,
standing,” in NeurIPS., 2022. E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S. and Nori, H.,
[90] Sun, Y., Han, J., Yan, X., Yu, P.S. and Wu, T., “Pathsim: Meta 2023. ”Sparks of artificial general intelligence: Early experiments
path-based top-k similarity search in heterogeneous information with gpt-4” in arXiv preprint arXiv:2303.12712.
networks,” in VLDB., 2011.
[119] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei,
[91] Liu, H., Li, C., Wu, Q. and Lee, Y.J., “Visual instruction tuning,” in Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S. and Bikel, D.,
NeurIPs., 2023. 2023. ”Llama 2: Open foundation and fine-tuned chat models” in
[92] Park, C., Kim, D., Han, J. and Yu, H., “Unsupervised attributed arXiv preprint arXiv:2307.09288.
multiplex network embedding,” in AAAI., 2020.
[120] Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chap-
[93] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
lot, D.S., Casas, D.D.L., Bressand, F., Lengyel, G., Lample, G.,
A.N., Kaiser, Ł. and Polosukhin, I., “Attention is all you need,” in
Saulnier, L. and Lavaud, L.R., 2023. ”Mistral 7B” in arXiv preprint
NeurIPs., 2017.
arXiv:2310.06825.
[94] Haveliwala, T.H., “Topic-sensitive pagerank,” in WWW., 2002.
[121] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson,
[95] Oord, A.V.D., Li, Y. and Vinyals, O., “Representation learning with
Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M. and Ring, R.,
contrastive predictive coding,” in arXiv:1807.03748., 2018.
2022. ”Flamingo: a visual language model for few-shot learning” in
[96] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agar- NeurIPS.
wal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. and Krueger,
[122] Edwards, C., Zhai, C. and Ji, H., 2021. ”Text2mol: Cross-modal
G., “Learning transferable visual models from natural language
molecule retrieval with natural language queries” in EMNLP.
supervision,” in ICML., 2021.
[97] Sun, C., Li, J., Fung, Y.R., Chan, H.P., Abdelzaher, T., Zhai, C. and Ji, [123] Edwards, C., Lai, T., Ros, K., Honke, G., Cho, K. and Ji, H., 2022,
H., “Decoding the silent majority: Inducing belief augmented social December. ”Translation between Molecules and Natural Language”
graph with large language model for response forecasting,” in arXiv in EMNLP.
preprint arXiv:2310.13297., 2023. [124] Wang, H., Feng, S., He, T., Tan, Z., Han, X. and Tsvetkov, Y., ”Can
[98] Sun, C., Li, J., Chan, H.P., Zhai, C. and Ji, H., “Measuring the Effect Language Models Solve Graph Problems in Natural Language?” in
of Influential Messages on Varying Personas,” in ACL., 2023. arXiv preprint arXiv:2305.10037., 2023.
[99] Whalen, R., “Legal networks: The promises and challenges of legal [125] Liu, C. and Wu, B., 2023. ”Evaluating large language models on
network analysis,” in Mich. St. L. Rev.., 2016. graphs: Performance insights and comparative analysis” in arXiv
[100] Friedrich, A. and Palmer, A. and Pinkal, M., “Situation entity preprint arXiv:2308.11224, 2023.
types: automatic classification of clause-level aspect,” in ACL., 2016. [126] Guo, J., Du, L. and Liu, H.. ”GPT4Graph: Can Large Language
[101] Guha, N., Nyarko, J., Ho, D.E., Ré, C., Chilton, A., Narayana, Models Understand Graph Structured Data? An Empirical Evalua-
A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D.N. and tion and Benchmarking” in arXiv preprint arXiv:2305.15066, 2023.
Zambrano, D., “Legalbench: A collaboratively built benchmark for [127] Zhang, J., 2023. ”Graph-ToolFormer: To Empower LLMs with
measuring legal reasoning in large language models,” in arXiv Graph Reasoning Ability via Prompt Augmented by ChatGPT” in
preprint arXiv:2308.11462., 2023. arXiv preprint arXiv:2304.11116, 2023.
[102] Lin, Y., Wang, H., Chen, J., Wang, T., Liu, Y., Ji, H., Liu, Y. [128] Zhang, Z., Wang, X., Zhang, Z., Li, H., Qin, Y., Wu, S. and Zhu,
and Natarajan, P., “Personalized entity resolution with dynamic W.. ”LLM4DyG: Can Large Language Models Solve Problems on
heterogeneous knowledge graph representations,” in arXiv preprint Dynamic Graphs?” in arXiv preprint arXiv:2310.17110, 2023.
arXiv:2104.02667, 2021. [129] Luo, L., Li, Y.F., Haffari, G. and Pan, S., 2023. ”Reasoning on
[103] Bai, X., Wang, M., Lee, I., Yang, Z., Kong, X. and Xia, F., “Scientific graphs: Faithful and interpretable large language model reasoning”
paper recommendation: A survey,” in Ieee Access, 2019. in arXiv preprint arXiv:2310.01061, 2023.
[104] Chowdhury, S. and Schoen, M.P., “Research paper classification [130] Jiang, J., Zhou, K., Dong, Z., Ye, K., Zhao, W.X. and Wen, J.R..
using supervised machine learning techniques,” in Intermountain ”Structgpt: A general framework for large language model to reason
Engineering, Technology and Computing, 2020. over structured data” in arXiv preprint arXiv:2305.09645, 2023.
[105] Madigan, D., Genkin, A., Lewis, D.D., Argamon, S., Fradkin, D. [131] Fatemi, B., Halcrow, J. and Perozzi, B.. ”Talk like a graph: Encoding
and Ye, L., “Author identification on the large scale,” in CSNA, 2005. graphs for large language models” in arXiv:2310.04560, 2023.
[106] He, X., Deng, K., Wang, X., Li, Y., Zhang, Y. and Wang, M., [132] Sun, J., Xu, C., Tang, L., Wang, S., Lin, C., Gong, Y., Shum, H.Y.
“Lightgcn: Simplifying and powering graph convolution network and Guo, J.. ”Think-on-graph: Deep and responsible reasoning of
for recommendation,” in SIGIR, 2020. large language model with knowledge graph” in arXiv preprint
[107] Chang, J., Gao, C., He, X., Jin, D. and Li, Y., “Bundle recommenda- arXiv:2307.07697, 2023.
tion with graph convolutional networks,” in SIGIR, 2020. [133] Danny Z. Chen.. ”Developing algorithms and software for geo-
[108] Xu, H., Liu, B., Shu, L. and Yu, P., “Open-world learning and metric path planning problems” in ACM Comput. Surv. 28, 4es (Dec.
application to product classification,” in WWW, 2019. 1996), 18–es. https://doi.org/10.1145/242224.242246, 1996.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 19
[134] Iqbal A., Hossain Md., Ebna A., ”Airline Scheduling with Max [159] Ock J, Guntuboina C, Farimani AB. Catalyst Property Prediction
Flow algorithm” in IJCA, 2018. with CatBERTa: Unveiling Feature Exploration Strategies through
[135] Li Jiang, Xiaoning Zang, Ibrahim I.Y. Alghoul, Xiang Fang, Junfeng Large Language Models. arXiv preprint arXiv:2309.00563, 2023.
Dong, Changyong Liang. ”Scheduling the covering delivery problem [160] Fang Y, Liang X, Zhang N, Liu K, Huang R, Chen Z, Fan X, Chen
in last mile delivery” in Expert Systems with Applications, 2022. H., Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset
[136] Nakagawa, H., Iwasawa, Y. and Matsuo, Y., ”Graph-based knowl- for Large Language Models. arXiv preprint arXiv:2306.08018, 2023.
edge tracing: modeling student proficiency using graph neural [161] Abdine H, Chatzianastasis M, Bouyioukos C, Vazirgiannis M.,
network” in WI, 2019. Prot2Text: Multimodal Protein’s Function Generation with GNNs
[137] Li, H., Wei, H., Wang, Y., Song, Y. and Qu, H.. ”Peer-inspired and Transformers, arXiv preprint arXiv:2307.14367, 2023.
student performance prediction in interactive online question pools [162] Luo Y, Yang K, Hong M, Liu X, Nie Z., MolFM: A Multimodal
with graph neural network” in CIKM, 2020. Molecular Foundation Model, arXiv preprint arXiv:2307.09484, 2023.
[138] Zhang, X., Wang, L., Helwig, J., Luo, Y., Fu, C., Xie, Y., ... & Ji, S. [163] Qian, C., Tang, H., Yang, Z., Liang, H., & Liu, Y., Can large
(2023). Artificial intelligence for science in quantum, atomistic, and language models empower molecular property prediction? arXiv
continuum systems. arXiv preprint arXiv:2307.08423. preprint arXiv:2307.07443, 2023
[139] Rusch, T. K., Bronstein, M. M., & Mishra, S. (2023). A sur- [164] Born, J., & Manica, M., Regression Transformer enables concur-
vey on oversmoothing in graph neural networks. arXiv preprint rent sequence regression and generation for molecular language
arXiv:2303.10993. modelling. Nature Machine Intelligence, 5(4), 432-444, 2023.
[140] Topping, J., Di Giovanni, F., Chamberlain, B. P., Dong, X., & Bron- [165] Li J, Liu Y, Fan W, Wei XY, Liu H, Tang J, Li Q., Empowering
stein, M. M. (2021). Understanding over-squashing and bottlenecks Molecule Discovery for Molecule-Caption Translation with Large
on graphs via curvature. arXiv preprint arXiv:2111.14522. Language Models: A ChatGPT Perspective. arXiv, 2023.
[141] Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., ... & Liu, [166] Zeng, Z., Yin, B., Wang, S., Liu, J., Yang, C., Yao, H., ... & Liu,
T. Y. (2021). Do transformers really perform badly for graph Z., Interactive Molecular Discovery with Natural Language. arXiv,
representation?. NeurIPS, 34, 28877-28888. 2023.
[142] Rampášek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G., [167] Liu Z, Li S, Luo Y, Fei H, Cao Y, Kawaguchi K, Wang X, Chua TS.,
& Beaini, D. (2022). Recipe for a general, powerful, scalable graph MolCA: Molecular Graph-Language Modeling with Cross-Modal
transformer. NeurIPS, 35, 14501-14515. Projector and Uni-Modal Adapter, in EMNLP, 2023.
[143] Liu, G., Zhao, T., Inae, E., Luo, T., & Jiang, M. (2023). [168] Guo T, Guo K, Liang Z, Guo Z, Chawla NV, Wiest O, Zhang X.
Semi-Supervised Graph Imbalanced Regression. arXiv preprint What indeed can GPT models do in chemistry? A comprehensive
arXiv:2305.12087. benchmark on eight tasks. in NeurIPS, 2023.
[169] Liu Z, Zhang W, Xia Y, Wu L, Xie S, Qin T, Zhang M, Liu TY.,
[144] Wu Q, Zhao W, Li Z, Wipf DP, Yan J. Nodeformer: A scalable
MolXPT: Wrapping Molecules with Text for Generative Pre-training,
graph structure learning transformer for node classification. NeurIPS.
in ACL, 2023.
2022 Dec 6;35:27387-401.
[170] Seidl, P., Vall, A., Hochreiter, S., & Klambauer, G., Enhancing
[145] Liu, G., Zhao, T., Xu, J., Luo, T., & Jiang, M., Graph rationalization
activity prediction models in drug discovery with the ability to
with environment-based augmentations, In ACM SIGKDD, 2022.
understand human language, in ICML, 2023.
[146] Bran, A. M., Cox, S., White, A. D., & Schwaller, P., ChemCrow:
[171] Christofidellis, D., Giannone, G., Born, J., Winther, O., Laino, T.,
Augmenting large-language models with chemistry tools, arXiv
& Manica, M., Unifying molecular and textual representations via
preprint arXiv:2304.05376, 2023.
multi-task language modelling, in ICML, 2023.
[147] Riesen, K., & Bunke, H., IAM graph database repository for [172] Liu, S., Nie, W., Wang, C., Lu, J., Qiao, Z., Liu, L., ... & Anandku-
graph based pattern recognition and machine learning. In Structural, mar, A. Multi-modal molecule structure-text model for text-based
Syntactic, and Statistical Pattern Recognition: Joint IAPR International retrieval and editing, Nature Machine Intelligence, 2023.
Workshop.
[173] Lacombe, R., Gaut, A., He, J., Lüdeke, D., & Pistunova, K., Extract-
[148] Weininger, D., SMILES, a chemical language and information ing Molecular Properties from Natural Language with Multimodal
system. 1. Introduction to methodology and encoding rules. Journal Contrastive Learning, ICML Workshop on Computational Biology, 2023.
of chemical information and computer sciences, 28(1), 31-36, 1988
[174] Su, B., Du, D., Yang, Z., Zhou, Y., Li, J., Rao, A., ... & Wen, J. R.,
[149] Heller S, McNaught A, Stein S, Tchekhovskoi D, Pletnev I. InChI- A molecular multimodal foundation model associating molecule
the worldwide chemical structure identifier standard. Journal of graphs with natural language, arXiv preprint arXiv:2209.05481. 2022.
cheminformatics. 2013 Dec;5(1):1-9. [175] Zeng, Z., Yao, Y., Liu, Z., & Sun, M., A deep-learning system
[150] O’Boyle, N., & Dalke, A., DeepSMILES: an adaptation of SMILES bridging molecule structure and biomedical text with comprehen-
for use in machine-learning of chemical structures, 2018. sion comparable to human professionals, Nature communications.
[151] Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A., [176] Iwayama, M., Wu, S., Liu, C., & Yoshida, R., Functional Output
Self-referencing embedded strings (SELFIES): A 100% robust molec- Regression for Machine Learning in Materials Science. Journal of
ular string representation. Machine Learning: Science and Technology. Chemical Information and Modeling, 62(20), 4837-4851, 2022.
[152] Bjerrum, E. J. (2017). SMILES enumeration as data augmenta- [177] Bagal V, Aggarwal R, Vinod PK, Priyakumar UD. MolGPT:
tion for neural network modeling of molecules. arXiv preprint molecular generation using a transformer-decoder model. Journal of
arXiv:1703.07076. Chemical Information and Modeling. 2021 Oct 25;62(9):2064-76.
[153] Arús-Pous, J., Johansson, S. V., Prykhodko, O., Bjerrum, E. J., [178] Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A.,
Tyrchan, C., Reymond, J. L., ... & Engkvist, O. (2019). Randomized Saravia, E., ... & Stojnic, R., Galactica: A large language model for
SMILES strings improve the quality of molecular generative models. science. arXiv, 2022.
Journal of cheminformatics, 11(1), 1-13. [179] Wang, S., Guo, Y., Wang, Y., Sun, H., & Huang, J., Smiles-bert: large
[154] Tetko IV, Karpov P, Bruno E, Kimber TB, Godin G. Augmentation scale unsupervised pre-training for molecular property prediction.
is what you need!. InInternational Conference on Artificial Neural In BCB, 2019.
Networks 2019 Sep 9 (pp. 831-835). Cham: Springer International [180] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J.,
Publishing. BioBERT: a pre-trained biomedical language representation model
[155] Kudo, T., & Richardson, J., Sentencepiece: A simple and language for biomedical text mining. Bioinformatics, 36(4), 1234-1240, 2020.
independent subword tokenizer and detokenizer for neural text [181] Ma, R., & Luo, T. (2020). PI1M: a benchmark database for polymer
processing, in EMNLP, 2018. informatics. Journal of Chemical Information and Modeling.
[156] Irwin, R., Dimitriadis, S., He, J., & Bjerrum, E. J. (2022). Chem- [182] Hastings, J., Owen, G., Dekker, A., Ennis, M., Kale, N., Muthukr-
former: a pre-trained transformer for computational chemistry. ishnan, V., ... & Steinbeck, C., ChEBI in 2016: Improved services and
Machine Learning: Science and Technology, 3(1), 015022. an expanding collection of metabolites. Nucleic acids research.
[157] Shi, Y., Zhang, A., Zhang, E., Liu, Z., & Wang, X., ReLM: [183] Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., ... &
Leveraging Language Models for Enhanced Chemical Reaction Bolton, E. E., PubChem 2019 update: improved access to chemical
Prediction, in EMNLP, 2023. data, Nucleic acids research, 47(D1), D1102-D1109, 2019.
[158] Liu P, Ren Y, Ren Z., Git-mol: A multi-modal large language model [184] Gaulton, A., Bellis, L. J., Bento, A. P., Chambers, J., Davies, M.,
for molecular science with graph, image, and text, arXiv preprint Hersey, A., ... & Overington, J. P., ChEMBL: a large-scale bioactivity
arXiv:2308.06911, 2023 database for drug discovery. Nucleic acids research.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 20
TABLE 4
A collection of LLM reasoning methods on pure graph discussed in Section 4. We do not include the backbone models used in these methods
studied in the original papers, as these methods generally apply to any LLMs. The “Papers” column lists the papers that study the specific methods.
TABLE 5
A collection of pure graph reasoning problems studied in Section 4. G = (V, E) denotes a graph with vertices V and edges E . v and e denote
individual vertices and edges, respectively. The “Papers” column lists the papers that study the problem using LLMs. The “Complexity” column lists
the time complexity of standard algorithms for the problem, ignoring more advanced but complex algorithms that are not comparable to LLMs’
reasoning processes.
TABLE 6
Summary of large language models on text-attributed graphs. Role of LM: “TE”, “SE”, “ANN” and “AUG” denote text encoder, structure encoder,
annotator (labeling the node/edges), and augmentator (conduct data augmentation). Task: “NC”, “UAP”, “LP”, “Rec”, “QA”, “NLU”, “EC”, “LM”, “RG”
denote node classification, user activity prediction, link prediction, recommendation, question answering, natural language understanding, edge
classification, language modeling, and regression task.
TABLE 7
A summarization of Graph-Aware LLM finetuning objectives on text-attributed graphs. vi+ and vi− denote a positive training node and a negative
training node to vi , respectively.
−
SciNCL [52] ||hvi − hv+ ||2 ∈ (k+ − c+ ; k+ ] ||hvi − hv− ||2 ∈ (khard − c− −
hard ; khard ] max{||hvi − hv+ ||2 − ||hvi − hv− ||2 + m, 0}
i i i i
TABLE 8
Model collection in Section 6 for text-captioned graphs. “Lin.” and “Vec.” represent Linearized Graph Encoding and Vectorized Graph Encoding.
“Classif.”, “Regr.”, “NER”, “RE”, “Retr.”, “Gen.”, “Cap.” represent classification, regression, named entity recognition, relation extraction, (molecule)
graph retrieval, (molecule) graph generation, (molecule) graph captioning.