Humans Learn Language from Situated
Communicative Interactions.
What about Machines?
Katrien Beuls∗
Université de Namur, Belgium
Faculté d’informatique
katrien.beuls@unamur.be
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
Paul Van Eecke∗
Vrije Universiteit Brussel, Belgium
Artificial Intelligence Laboratory
paul@ai.vub.ac.be
Humans acquire their native languages by taking part in communicative interactions with
their caregivers. These interactions are meaningful, intentional, and situated in their everyday
environment. The situated and communicative nature of the interactions is essential to the lan-
guage acquisition process, as language learners depend on clues provided by the communicative
environment to make sense of the utterances they perceive. As such, the linguistic knowledge they
build up is rooted in linguistic forms, their meaning, and their communicative function. When it
comes to machines, the situated, communicative, and interactional aspects of language learning
are often passed over. This applies in particular to today’s large language models (LLMs), where
the input is predominantly text-based, and where the distribution of character groups or words
serves as a basis for modeling the meaning of linguistic expressions. In this article, we argue that
this design choice lies at the root of a number of important limitations, in particular regarding the
data hungriness of the models, their limited ability to perform human-like logical and pragmatic
reasoning, and their susceptibility to biases. At the same time, we make a case for an alternative
approach that models how artificial agents can acquire linguistic structures by participating
in situated communicative interactions. Through a selection of experiments, we show how the
linguistic knowledge that is captured in the resulting models is of a fundamentally different
nature than the knowledge captured by LLMs and argue that this change of perspective provides
a promising path towards more human-like language processing in machines.
∗ Both authors contributed equally.
Action Editors: Marianna Apidianaki, Abdellah Fourtassi, and Sebastian Padó. Submission received: 16
December 2023; revised version received: 30 April 2024; accepted for publication: 19 June 2024.
https://doi.org/10.1162/coli a 00534
© 2024 Association for Computational Linguistics
Published under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
(CC BY-NC-ND 4.0) license
Computational Linguistics Volume 50, Number 4
1. Humans Learn Language from Situated Communicative Interactions
Human languages are evolutionary systems that continuously adapt to changes in the
communicative needs and environment of their users (Schleicher 1869; Darwin 1871;
Maynard Smith and Szathmáry 1999). They emerge and evolve as a result of meaningful
and intentional communicative interactions between members of a linguistic commu-
nity. When children acquire their native languages as new members of such a commu-
nity, they build up their linguistic knowledge by actively taking part in communicative
interactions with their caregivers. They thereby face two challenges that can be consid-
ered fundamental to the language acquisition process. On the one hand, children need
to work out the communicative intents of their interlocutors, effectively making sense
of the utterances they perceive. On the other hand, based on these utterances and their
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
“reconstructed” meanings, they need to be able to bootstrap a linguistic system that en-
ables them to understand and produce utterances that they have never observed before.
The cognitive processes involved in the acquisition of language from situated com-
municative interactions have been extensively studied in the fields of developmental
psychology and cognitive linguistics, where they form the basis of constructivist and
usage-based theories of language acquisition (Piaget 1923; Boden 1978; Bruner 1983;
Langacker 1987; Nelson 1998; Croft 1991; Givón 1995; Clark 1996; Tomasello 2003;
Goldberg 2006; Bybee 2010; Lieven 2014; MacWhinney 2014; Diessel 2017; Behrens
2021). The ability of children to acquire language is thereby attributed to two broad
sets of skills, for which Tomasello (2003, pages 3–4) coined the terms intention reading
and pattern finding. Intention reading refers to the cognitive abilities that are concerned
with the functional, meaningful dimension of linguistic communication. These include
in particular the capacity of children to share attention, to follow and direct the attention
of others through non-linguistic gestures like pointing, and to recognize the commu-
nicative intents of their interlocutors. Pattern finding refers to the cognitive abilities
that enable language users to generalize across different communicative interactions.
These concern in particular the capacity to recognize similarities and differences in
sensory-motor experiences, and to use this capacity for perceptual and conceptual cate-
gorization, schema formation, frequency-based distributional reasoning, and analogical
thinking.
Intention reading and pattern finding are complementary yet highly interdepen-
dent skills. Intention reading enables a language user to reconstruct the intended
meaning of an observed utterance. It implements an abductive reasoning process by
which the language user constructs a hypothesis about the communicative intents of
their interlocutor. This hypothesis is constructed based on clues that are provided by
the utterance on the one hand, and its situational embedding in a communicative
environment on the other. Pattern finding then provides the ability to generalize over
observed utterances and their reconstructed meanings. It implements an inductive
reasoning process, which, over time, yields an inventory of productive schemata that
constitute the linguistic knowledge of a language user. As small children start out
without any prior linguistic knowledge, their intention reading process can only rely on
environmental clues. For example, during a communicative interaction with a caregiver,
a young child might hypothesize that the observed utterance bear-gone refers to the
observed disappearance from sight of their favorite stuffed animal. This association
can be made based on sensory experience and pre-linguistic reasoning only, without
any prior knowledge about the compositional nature of the utterance. An association
between an observed form and its hypothesized meaning that is reconstructed based
on environmental clues only is referred to as a holophrastic construction. When the
1278
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
same child later hears the utterance ball-gone in the context of a certain spherical toy
disappearing from sight, their pattern finding ability might enable them to infer the
compositional structure of the observed utterance. Indeed, based on the previously ac-
quired holophrastic construction that associates bear-gone to their favorite stuffed animal
disappearing from sight, and the observation of the utterance ball-gone in the context of
a certain spherical toy disappearing from sight, the child might infer through syntactico-
semantic generalization that the form bear is associated with their favorite stuffed
animal, that the form ball is associated with a certain spherical toy, and that the form X-
gone is associated with the referent of X disappearing from sight. The associations of bear
and ball with their respective meanings are called holistic constructions. Constructions
that contain abstract slots, such as the association between X-gone and its meaning
of X’s referent disappearing from sight, are referred to as item-based constructions.
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
The constructions that emerge from the pattern finding process can in turn provide
clues for the process of intention reading during future communicative interactions.
For example, if the child later observes the utterance gimme-bear in the context of a
request by a caregiver to hand over their favorite stuffed animal, they can recognize
the form bear, relate it to its meaning, and hypothesize that gimme-X is associated to a
request by their caregiver to hand over the referent of X. The interaction between the
processes of intention reading and pattern finding is thus bidirectional. Pattern finding
relies on intention reading for providing hypotheses about the intended meaning of
observed utterances, while intention reading relies on constructions that result from
pattern finding to constrain and navigate the space of possible meaning hypotheses.
The inter-dependency between the processes of intention reading and pattern finding is
schematically depicted in Figure 1, using the examples introduced above.
The abductive nature of the intention reading process entails that the meaning that
a language learner attributes to an observed utterance has the epistemological status
of a hypothesis. This hypothesis is likely and plausible given the linguistic and non-
linguistic knowledge of the language learner on the one hand, and the communicative
environment in which the utterance was observed on the other. Hypotheses resulting
from intention reading are thereby not free from uncertainty or doubt, and might in
some cases not even transfer to any other communicative situation. An example of
this uncertainty, rooted in anecdotal evidence, would be the case of a Flemish toddler
(2Y2M) occasionally observing the utterance goed-weekend (lit. good-weekend—Eng. Enjoy
your weekend) at the end of a conversation between their parents and caregivers when
leaving their day-care center. At home, the toddler starts to reuse the phrase as a farewell
greeting whenever they leave the room for a few minutes, for example to fetch another
toy or garment. The hypothesis of the toddler about the meaning and communicative
function of the phrase goed-weekend is only partially accurate. While they clearly have
managed to reconstruct certain aspects of its meaning, in particular its social function
as a farewell greeting, they did not capture the compositionality of the phrase and its
meaning of wishing the addressee an enjoyable Saturday and Sunday.
The uncertainty introduced by the intention reading process directly percolates to
the constructions that are formed, whether they are holophrastic mappings or the result
of a syntactico-semantic generalization process. As a language learner uses these con-
structions across a multitude of communicative interactions that are situated in a variety
of environments, they refine their hypotheses and gradually deepen the entrenchment
of constructions that generalize better across situations. At the same time, constructions
that are in conflict or competition with more generally applicable constructions, thereby
hindering successful communication rather than supporting it, will gradually assume
a different function or disappear altogether from the language learner’s inventory of
1279
Computational Linguistics Volume 50, Number 4
situation-based intention reading syntactico-semantic pattern finding
interaction A holophrastic construction
“bear-gone”
“bear gone”
item-based construction
interaction B
“X-gone”
“ball gone”
holistic construction holistic construction
“bear” “ball”
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
interaction C
“gimme bear”
item-based construction
“gimme-X”
Figure 1
Illustrative example of language acquisition through situation-based intention reading (left) and
syntactico-semantic pattern finding (right), featuring a young child taking part in three
communicative interactions. On the intention reading side of interactions A and B, the child
makes a hypothesis about the meaning of the observed utterances based on environmental clues
only. On the pattern finding side, the child creates a holophrastic construction after interaction A
and uses this construction to create an item-based construction along with two holistic
constructions after interaction B. One of these holistic constructions is then used to help the
intention reading process during interaction C, after which another item-based construction
is created.
constructions. The evolutionary dynamics of creating, strengthening, and weakening
associations between linguistic forms and meanings, on any level of abstraction, based
on their successful or unsuccessful use in communication, thereby provides a way to
overcome the uncertainty that is inherent to the intention reading process.
As the language acquisition process of a child further unfolds, the linguistic clues
that they provide as a speaker and discern as a listener gradually become more fine-
grained and reliable. Their language understanding and production processes become
less reliant on immediate environmental clues, making displaced communication pos-
sible and enabling language users to take part in conversations about abstract topics
(Hockett and Hockett 1960). Such conversations rely to a large extent on the interpreta-
tion of linguistic clues with respect to the background knowledge and belief system
of the individual interlocutors (Van Eecke et al. 2023). Yet, the linguistic and non-
linguistic knowledge that is involved remains, on the most fundamental level, grounded
in real-world experiences. “Democracy” and “eternity,” for example, are often cited as
conversational topics at the more abstract end of the spectrum (see, e.g., Löhr 2022).
The corresponding concepts have no directly observable referents as they capture social,
cultural, and linguistic constructs. However, it does not take much to argue that without
the grounding of these concepts in an individual’s lived experiences with volition,
authority, permanence, perishability, and boredom, for instance, conversations about
these topics would be hollow and soulless, if they would ever take place at all.
In sum, humans acquire their native languages by actively taking part in com-
municative interactions with other language users. These interactions are meaningful,
1280
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
intentional, and situationally embedded in a communicative environment. The situated
and communicative nature of the interactions is essential to the language acquisition
process, as language learners crucially depend on clues provided by the communicative
environment to make sense of the utterances they perceive. Through the interdependent
processes of situation-based intention reading and syntactico-semantic pattern finding,
language learners gradually build up an inventory of productive schemata that capture,
on any level of abstraction, the relationship between linguistic forms on the one hand,
and their meanings and communicative functions on the other. Importantly, the lan-
guage acquisition process is communicatively motivated, personal, usage-based, and
constructivist, in the sense that the linguistic system of each person is individually built
up and shaped by their past experiences and interactions. Through the evolutionary
dynamics of creating, strengthening, and weakening associations between linguistic
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
forms and meanings based on their successful or unsuccessful use in communication,
language learners effectively manage to build up a productive linguistic system that is
tied to their own physical and cognitive endowment yet compatible on a communica-
tive level with the linguistic systems of other members of their community.
2. Large Language Models Learn Language from Texts
Over the last decade, important advances in neural machine learning and infrastructure
(Sutskever, Vinyals, and Le 2014; Vaswani et al. 2017), combined with the availability
of huge text corpora, have led to previously unimaginable progress in the field of
natural language processing. The main catalyst in this process has been the advent of
transformer-based large language models (LLMs), such as BERT (Devlin et al. 2019),
GPT-3 (Brown et al. 2020), PaLM (Chowdhery et al. 2022), BLOOM (BigScience Work-
shop 2022), and LLaMA (Touvron et al. 2023). Definitive solutions to a wide variety
of NLP subtasks that were previously considered immensely challenging, such as
machine translation, speech recognition, conversational question answering, and text
summarization, now seem well within reach—if not already considered achieved (for an
overview, see, e.g., Lauriola, Lavelli, and Aiolli 2022). The extraordinary performance of
these models on tasks that are traditionally considered hallmarks of human intelligence
has naturally sparked the interest of researchers in other disciplines, including cognitive
science, psychology, and linguistics. The focus of these interdisciplinary investigations
is typically on comparing the performance of humans and LLMs on a variety of tasks,
with the goal of assessing to what extent LLMs can serve as a model of language un-
derstanding and intelligent reasoning in humans. Methodologically, these studies more
often than not adopt an approach called probing (Hupkes, Veldhoen, and Zuidema
2018), in which textual input prompts are designed to elicit the generation of output
text that is taken to unveil the knowledge implicitly captured by the probed LLMs
(Vulić et al. 2020). Probing as such does not reveal any explicit reasoning processes or
strategies, but leaves it up to the interpretation process of the human experimenter to
draw any conclusions about what a model must have “known” or “thought” (Shiffrin
and Mitchell 2023). While probing is an effective methodology to systematically investi-
gate the types of problems that LLMs can handle or struggle with—thereby sometimes
revealing interesting “curious failures” (Shiffrin and Mitchell 2023)—there is some
danger that the output-based interpretation of the inner workings of LLMs becomes a
speculative and unfalsifiable endeavor. This risk can only be mitigated by a constant
awareness of how these models are constructed, what they do, and how they do it
(Shanahan 2024).
1281
Computational Linguistics Volume 50, Number 4
A major result obtained through probing studies is that despite the exceptional
performance of LLMs on many NLP tasks, they seem to struggle with human-like
logical and pragmatic inferencing (Jiang and de Marneffe 2021; Hong et al. 2023).
In particular, it is often observed that LLMs fail to reason in a human-like manner
about the knowledge they seem to capture (Choudhury, Rogers, and Augenstein 2022;
Weissweiler et al. 2022; West et al. 2024). Although the precise nature of this knowledge
is still a very active field of investigation, sometimes referred to as “BERTology” (Rogers,
Kovaleva, and Rumshisky 2020), it has been convincingly argued that the reasons why
large language models struggle so much with logical and pragmatic inferencing are to
a large extent ascribable to the fundamental differences between how large language
models are constructed and how human languages are acquired (Bender and Koller
2020; Trott et al. 2023).
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
Large language models are rooted in the tradition of distributional linguistics, as
pioneered by Joos (1950), Harris (1954), and Firth (1957). Distributional linguistics offers
a framework for analyzing language in terms of “the occurrence of parts [...] relative
to other parts” (Harris 1954). These parts can be of any nature and can be chosen to
correspond, for example, to phonemes, words, or dependency relations. Let us consider
the example of words, given the focus of today’s NLP techniques on the word (or
subword/character group) level. A word-level distributional analysis of a language
describes that language solely in terms of the occurrence of words relative to each
other. Words are thereby represented in terms of their frequency of collocation with
all other words of the language as observed in corpora of language use. A word’s
frequency of collocation is captured in the form of a high-dimensional vector of which
each dimension corresponds to one other word of the language. These other words are
referred to as the context words of a given target word. Frequency of collocation is
defined in terms of the number of times a context word appears in the context window
of a target word, by which the context window spans a chosen number of words to
the left and to the right of the target word. The vectors representing words are thus
the result of counting word co-occurrences in corpora of language use. Interestingly,
words that exhibit similar distributions, indicated by a small angle between their vector
representations, tend to have similar meanings. For example, the words salmon and cod
would both frequently appear in the neighborhood of swim, water, and dinner, and only
rarely co-occur with democracy and eternity. On the other hand, cod would be collocated
more often than salmon with chips and less often with river. Their overall similar collo-
cational behavior thereby indicates a high degree of semantic similarity, whereas their
collocational differences are taken to indicate a semantic difference. The assumption
that collocational behavior can serve as a proxy to meaning has become known as the
distributional hypothesis, or perhaps more famously through the quote “You shall
know a word by the company it keeps!” (Firth 1957). The elegance of distributional
linguistics lies in the fact that linguistic forms can be analyzed independently from
any external factors (Harris 1954). Word-level distributional analyses can indeed be
constructed based on raw text only, without the need to include etymology, history,
semantics, pragmatics, or grounding in (knowledge of) the world, for example. At the
same time, aspects of those other factors can only be captured as a side-effect of their
influence on the distribution of words with respect to each other (Hill, Reichart, and
Korhonen 2015).
While the foundations of the field of distributional linguistics were laid out in the
1950s, and its routine application to specific NLP tasks such as word sense disam-
biguation (Schütze 1992, 1998) and essay grading (Rehder et al. 1998) date back to the
1990s, the wide-spread adoption of distributional semantics as the standard meaning
1282
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
representation in the field of NLP is much more recent. Turney and Pantel (2010)
are often credited with drawing the attention of the NLP community at large on the
potential of using distributional meaning representations. A major turning point in this
direction was the introduction of the famous CBOW and Skip-gram architectures for
efficiently estimating high-quality word vectors, along with their implementation in the
word2vec software package and the public availability of pre-trained word vectors
(Mikolov et al. 2013a, b). In contrast to the high-dimensional, “sparse” vectors tradition-
ally obtained through counting word co-occurrences, low-dimensional, “dense” vec-
tors, called embeddings, were now estimated by a neural network in a word prediction
task. While the method is somewhat different, the underlying idea and intuition remain
the same: Words are represented as vectors that reflect their observed collocational
behavior in corpora of language use. Interestingly, the dense vectors that are obtained
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
through machine learning methods turn out to outperform traditional sparse vectors
on virtually every NLP task they are used for (Baroni, Dinu, and Kruszewski 2014),
for reasons that are still not fully understood today (Jurafsky and Martin 2024). The
next major milestone concerns the use of transformer networks (Vaswani et al. 2017) to
learn word representations via a masked language modeling task (Devlin et al. 2019).
Such a task consists in filling in randomly selected words that are blanked in a given
stretch of text. The major difference with the Skip-gram and CBOW-based methods is
that the resulting word embeddings are “contextual” rather than “static.” As such, a
target word is no longer represented by a single vector, but by a different vector for
each different context in which it appears. The fact that words are now represented
on the level of their textual occurrence rather than their dictionary entry makes for a
more fine-grained modeling of their collocational behavior, and consequently leads to
a drastically improved performance on downstream NLP tasks. For a more extensive
overview of vector-based word representation methods, we refer the reader to the re-
cent survey by Apidianaki (2022).
The scale on which today’s large language models capture the collocational behav-
ior of words has led to a situation in which the texts that are generated do not only
exhibit human-like lexical and morpho-syntactic correctness, fluency, and eloquence,
but in which the resulting texts are also characterized by a remarkable discourse co-
herence and semantic adequacy. The contextual embeddings learned by these models
mirror the cognitive semantic spaces of humans to a remarkable extent (Vulić et al.
2020) and even seem to capture non-linguistic knowledge to an extent that they can
be used to solve natural language understanding tasks that were always thought to
require vast knowledge of the world or human-like intelligent reasoning capabilities
(cf. Hu et al. 2023; Webb, Holyoak, and Lu 2023). The reasoning errors made by LLMs
under certain circumstances even resemble those observed in humans (Dasgupta et al.
2022).1 Yet, these models still fully subscribe to the distributional linguistics paradigm,
1 In particular, Dasgupta et al. (2022) show that the tendency of humans to commit logical fallacies when
faced with reasoning problems that are not consistent with real-world knowledge and beliefs are
mirrored by LLMs. For example, syllogistic fallacies tend to go unnoticed when the conclusion is
intuitive. Likewise, logic puzzles are more likely to be solved correctly when instantiated as common
real-world situations. Taking into account the word prediction task through which LLMs are trained, the
fact that they are more likely to consider “probable utterances” to be valid conclusions of syllogisms and
that they achieve better results on logic problems that are expressed in more realistic language is not
entirely surprising (cf. McCoy et al. 2023). While interesting to see that human reasoning exhibits similar
effects, more research would be needed in humans and machines to investigate whether this similar
behavior cannot be ascribed to familiarity with utterances in machines vs. familiarity with situations in
humans. Adversarial robustness experiments in which utterances are varied but describe the same
situation could provide an interesting methodological framework here.
1283
Computational Linguistics Volume 50, Number 4
where languages are modeled solely in terms of the observed “co-occurrence of parts
[...] relative to other parts [...] without intrusion of other features such as history or
meaning” (Harris 1954). As such, all “knowledge” and “reasoning capabilities” they
might hold are rooted in contextual word prediction based on collocations that were
observed in huge amounts of raw textual data. While it is truly inconceivable that such
results are obtained based on word distributions only, admittedly distilled from equally
inconceivable amounts of texts, the fact that these LLMs are learned in complete absence
of situationally grounded and intentional communicative interactions lies at the origin
of a number of inherent limitations:
• Hallucinations. Large language models seamlessly mix fact and fiction in
the output they produce, a phenomenon referred to by the term
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
hallucination. Hallucination is a direct consequence of the generative
nature of the models, by which they use collocation-based patterns and
structures observed in input texts to generate output texts that exhibit
similar patterns and structures. The epistemological status of the
generated texts is thereby uniform and may consequently correspond to
facts, beliefs, opinions, or fantasies in human terms (McKenna et al. 2023).
Simply put, very probable word combinations might very well not
correspond to factual truths or non-factual information contained in the
training data. The use of the term “hallucination” to refer to generated
fantasies is somewhat misleading, as hallucinations result from exactly
the same generative process as any other output text. The use of the label
is thus necessarily the result of a post-hoc interpretation by the human
interpreter. One could as well argue that all output of LLMs is
hallucinated, but that their hallucinations are remarkably often
semantically adequate.
• Human-like logic and pragmatic reasoning. Large language models
suffer from an apparent dissonance between the vast knowledge they
seem to capture and their difficulty to use this knowledge to perform
human-like logical and pragmatic reasoning (Choudhury, Rogers, and
Augenstein 2022; Weissweiler et al. 2022; Mitchell and Krakauer 2023;
Hong et al. 2023; West et al. 2024; Mitchell, Palmarini, and Moskvichev
2024). This limitation is ascribable to two main reasons. The first reason
concerns the lack of situational grounding during the training process.2
LLMs are trained to generate output texts that exhibit similar (sub)word
distributions as those observed in input texts. This task, at which LLMs
clearly excel, perfectly aligns with the standard assumption of generative
machine learning that models should generate output that follows the
same distributions as the input they are trained on. However, when the
task shifts from generating texts that reflect the distribution of words in
texts written by humans to human-like logic and pragmatic reasoning,
the standard machine learning assumption is no longer upheld. Indeed,
when the distribution of words in texts is used to make predictions about
2 We are using the term situational grounding to refer to “the communicative, perceptual, or goal-oriented
contexts in which language occurs” (Pavlick 2023), thereby spanning at least Coelho Mollo and Millière’s
(2023) notions of referential grounding, sensory-motor grounding, and communicative grounding.
1284
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
the distribution of objects and events in the world, it is no longer
guaranteed that the input and target distributions coincide. The
distribution of objects and events in the world might well be unlearnable
from the distribution of words in texts. Moreover, it is extremely hard to
gauge where these distributions coincide sufficiently to justify accurate
predictions and where they are too different (Shichman et al. 2023). The
second reason why LLMs struggle so much with human-like logic and
pragmatic inferencing concerns the absence of communicative intent
during the training process (cf. Bender and Koller 2020). Human
linguistic communication is intentional, in the sense that speakers seek to
achieve an effect in their interlocutors (Austin 1962). At the same time,
language is an inferential coding system, by which is meant that a
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
speaker’s intentions are not losslessly encoded in the utterances they
produce, but need to be inferentially reconstructed by their interlocutors
(Sperber and Wilson 1986). Pragmatic inferences are drawn based on
these communicative intents, whereby the redundancy of information is
particularly meaningful (Grice 1967; Kravtchenko and Demberg 2022).
For example, if your co-worker mentions that the door was locked when
they entered the lab in the morning, this presupposes (at least) that the
door is usually not locked when they arrive. Or if you explicitly mention
that your friend paid for their coffee, this presupposes that they normally
don’t pay for their coffee. These presuppositions do not result from the
information encoded in the utterances, or from general knowledge of the
world, but from a reconstruction of the intention that motivated the
speaker to convey this information that would otherwise have been
redundant or irrelevant. While pragmatic inferences play a central role in
human communication, they are notoriously difficult to draw for LLMs
(Chang and Bergen 2024; Ruis et al. 2023). This should not be surprising
given the fact that LLMs are trained on raw textual utterances in the
absence of clues that might reveal the communicative intents that
motivated their production. At the end of the day, the distributional
hypothesis states that the distribution of words reflects their semantic
similarities and differences, but remains silent about why particular
utterances are produced in the first place.3
• Data-hungriness. Large language models are learned from raw textual
data through a word prediction task. Consequently, all aspects of
meaning, reasoning, and world knowledge that we expect LLMs to
capture need to be learned indirectly through the intermediary of their
effect on the distribution of words. In practice, this requires access to text
corpora that consist of hundreds of billions of tokens (Chang and Bergen
2024). For example, the GPT-3 family of models was trained on 300 billion
tokens sampled from a base corpus of more than 500 billion tokens
(Brown et al. 2020) and the Chinchilla model was even trained on 1.4
3 These arguments relate to a broader philosophical discussion within the community about whether
LLMs are capable of symbol grounding, reference, and capturing meaning. In this debate, it has been
argued that language models cannot capture meaning as they lack reference and communicative intent
(Bender and Koller 2020), but also that reference (Piantadosi and Hill 2022) and communicative intent
(Pavlick 2023) are not prerequisites for meaning, and that beliefs and experiences are not prerequisites for
reference (Mandelkern and Linzen 2024).
1285
Computational Linguistics Volume 50, Number 4
trillion tokens (Hoffmann et al. 2022). These amounts of data are
necessary to stretch the distributional hypothesis from making
predictions about semantic similarity on the word level to generating
output that exhibits aspects of human-like intelligence. The number of
tokens that an LLM is exposed to is several orders of magnitude larger
than the number of tokens a human is exposed to, estimated at about 60
million at the age of 5 and 200 million at the age of 20 (Frank 2023). It
should be noted, however, that such a comparison is virtually
meaningless, as the nature of their input, namely, texts vs. situated
communicative interactions, is fundamentally different.
• Sensitivity to biases. Large language models mirror the biases that are
present in the data they were trained on (Nissim, van Noord, and van der
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
Goot 2020; Vallor 2024), concerning, for example, age, ethnicity, gender,
religion, and sexual orientation (Navigli, Conia, and Ross 2023). These
biases do not even originate from the unjustifiable extrapolation of
correlations observed in the world, but from texts, written by humans,
that at times literally contain stereotypes, hate speech, or fringe opinions.
While a laudable idea in principle, the curation of training data is not a
scalable solution given the difficulty of this task and the sheer amount of
training data that is needed.
In sum, LLMs learn language from texts. They meticulously capture the collo-
cational behavior of (sub)words in a fine-grained manner and on an inconceivable
scale. All knowledge they seem to capture and reasoning abilities they seem to possess
originate from observing the distribution of (sub)words with respect to each other in
corpora of human-written texts. While their capabilities are astonishing, improving
on almost any NLP/NLU task they are applied to and stretching the distributional
hypothesis further than anyone would have imagined, the fact that they are learned
from textual data and pass over the situated, communicative, and intentional aspects of
human language use lies at the root of a number of inherent limitations. These concern
in particular the uniform epistemological status of their output, their limited ability to
perform logical and pragmatic reasoning, their data-hungriness, and their susceptibility
to biases.
3. Towards Human-like Language Learning in Machines
The text-internal prediction task through which LLMs learn their linguistic capabilities
sharply contrasts with the situated communicative interactions through which humans
acquire their native languages. As we have argued above, the fundamental differences
between the learning processes of humans and LLMs in turn lead to fundamental
differences in the kind of linguistic and non-linguistic knowledge that is built up, as well
as in how this knowledge is used to engage in communicative and non-communicative
actions. In no way does this argument refute the role of (distributional) statistical
learning in human language acquisition (cf. Saffran, Aslin, and Newport 1996; Aslin
2017), which we consider, like Tomasello (2003) and Kuhl, Tsao, and Liu (2003) among
others, to be an integral part of a human’s broader pattern finding ability (cf. Section 1).
It does, however, imply that text-based statistical learning is insufficient on its own,
and that the key towards more human-like language learning in machines lies in more
1286
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
faithfully modeling the circumstances under which human languages are acquired and
used to communicate.
The goal of the present section is to zoom in on approaches that incorporate aspects
of the meaningful, intentional, situated, communicative, and interactional nature of
linguistic communication with the aim of operationalizing more human-like language
learning in machines. We start by briefly considering recent work that remains firmly
rooted within the core of the LLM paradigm, but extends it beyond a purely text-based
prediction task (Section 3.1). In particular, we touch upon the integration of input from
other modalities and the use of reinforcement learning to better align the output of
LLMs with externally defined criteria such as human preferences. We then move on to
discuss a line of work that takes a very different perspective on the matter, as it explicitly
aims to study how artificial agents can acquire languages through the processes of
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
situation-based intention reading and syntactico-semantic pattern finding during situ-
ated communicative interactions (Section 3.2). We discuss two concrete experiments that
operationalize this paradigm and show how the linguistic systems that are acquired by
the agents in such a setting are of a fundamentally different nature than those acquired
by LLMs.
3.1 Large Language Models beyond (Sub)word Distributions
The argument that the key towards more human-like language learning in machines
lies in more faithfully modeling the circumstances under which human languages are
acquired and used to communicate minimally calls for letting go of a strict interpreta-
tion of the distributional hypothesis. In other terms, the idea that the “co-occurrence of
parts [...] relative to other parts [...] without intrusion of other features such as history
or meaning” (Harris 1954) suffices to capture all there is to human linguistic communi-
cation can no longer be upheld. When it comes to LLMs, this implies that the paradigm
needs to be extended to integrate information beyond the collocational behavior of
(sub)words as learned from raw text corpora through a text-internal prediction task.
Two main approaches to extending LLMs beyond observed (sub)word distributions
have already proven to be valuable additions to the paradigm. While the first approach
focuses on broadening the input to the learning process to other modalities than text,
the second approach concentrates on better aligning the output of LLMs with externally
defined criteria such as human preferences or task-specific ground-truth accuracies.
Multi-modal Language Models. The high-level idea behind multi-modal language models
is that information originating from multiple modalities can be integrated into an LLM’s
embedding space (see, e.g., Radford et al. 2021; Baevski et al. 2022; Lyu et al. 2023).4 The
theoretical promise of this approach is that models capable of cross-modal inferencing
break free from the limitation that knowledge of the world, or any knowledge other
than word distributions per se, can only be captured as a side-effect of the influence of
external factors on the observed distribution of words (see Section 2). For example, the
graphical structures that are commonly used to represent spatial and temporal relations
through the visual modality, for example through maps or timelines, concisely hold in-
formation about distances, durations, and temporal orderings that are hard, inefficient,
4 Note that approaches that first map input from other modalities to texts, for example through image
captioning, and then feed the resulting texts to an LLM, are also sometimes referred to as multi-modal
LLMs. However, we do not consider these approaches in this section, as the LLMs themselves remain
purely text-based.
1287
Computational Linguistics Volume 50, Number 4
or practically impossible to capture using the textual modality (see, e.g., Li et al. 2020).
After all, there are good reasons why humans make extensive use of graphical rep-
resentations in their atlases, road maps, and handbooks on subjects such as anatomy,
botany, and chemistry. Apart from the theoretical advantages that come with enriching
the representational power of LLMs beyond observed (sub)word distributions, the
integration of non-textual modalities has also facilitated the development of a variety
of real-world applications, such as multi-modal chatbots (OpenAI 2023; Yamazaki et al.
2023), robotic manipulation planning systems (Driess et al. 2023), and navigation aids
(Fan et al. 2023).
Language Model Alignment. It has often been argued that the text-internal training ob-
jective of LLMs yields models that might well generate texts that are in some sense
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
indistinguishable from human-written texts, but at the same time fail to satisfy user ex-
pectations in terms of helpfulness, relevance, factual correctness, safety, and adherence
to human values, among others (Weidinger et al. 2021; Liu et al. 2022; Tan et al. 2023).
A particularly successful approach to better satisfying user expectations concerns the
incorporation of text-external objectives through a two-stage language model alignment
process. Concretely, a standard LLM is first pre-trained on a text-internal prediction
task. Then, the learned model is fine-tuned to optimize a text-external criterion such
as human preference or ground-truth accuracy on a specific task. The language model
alignment approach has perhaps become best known through its use in combination
with reinforcement from human feedback to align the output of open-domain chat-
bots to user expectations. Typically, a reward model is first trained in interaction with
humans to capture how humans would rank a set of possible output texts. Then, this
model is used as a reward function to fine-tune the pre-trained LLM. For a more detailed
introduction into language model alignment, we refer the reader to Ziegler et al. (2019),
Ouyang et al. (2022), and Bai et al. (2022).
The integration of input data from modalities other than text as well as the align-
ment of LLMs to text-external criteria cautiously lift the limitation that all knowledge,
reasoning capabilities, and linguistic behavior of LLMs needs to be learned through
the intermediary of observed (sub)word distributions. While the integration of input
from multiple modalities provides a way to incorporate richer representations of the
situational embedding of observed utterances, the alignment of LLMs to text-external
criteria facilitates the integration of any optimization criterion for which an effective
reward function can be designed or learned. Naturally, this sparks the question of
whether the limitations ascribed in Section 2 to the absence of situationally grounded
and intentional communicative interactions during the training process of LLMs still
persist. Any answer to this question crucially hinges on two variables that were left
open in the argument.
The first variable concerns the nature of the multi-modal input to the learning
process, in particular the extent to which the input can reflect the sensory-motor inter-
actions in which humans engage every day. There seems to be no reason to believe that
it would be fundamentally impossible to collect training data in real-world situations
using sensors modeled after the sensory apparatus of humans.5 At the same time, no
5 For an exploratory operationalization of this idea, see, for example, Vong et al. (2024), who trained a
language model based on a toddler’s auditory and visual experiences captured using head-mounted
sensors for 61 hours over a period of 19 months (Sullivan et al. 2021). See Gabrieli et al. (2022) and Park
et al. (2022), respectively, for examples of devices specifically designed to emulate human taste and
tactile sensing.
1288
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
comprehensive data sets of this kind exist to date and their construction would require
overcoming extensive technical, ethical, and financial challenges. For now at least, the
input to multi-modal LLMs typically remains restricted to a combination of texts, speech
recordings, images, and/or videos.
The second variable concerns the reward function that is used in the language
model alignment process. The central question here is to what extent such a func-
tion can capture human linguistic and non-linguistic behavior. The situation is much
thornier than in the case of the first variable. The design of robust and reliable reward
functions is notoriously difficult even in simple environments (Skalse et al. 2022; Ngo,
Chan, and Mindermann 2024). The underlying reason is closely related to Goodhart’s
law (Goodhart 1975; Manheim and Garrabrant 2018), which states that measures that
become targets cease to be useful as measures (Ngo, Chan, and Mindermann 2024). For
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
instance, the number of times the entrance door to a shop opens and closes might well
be an excellent proxy to assess how busy that shop is. However, rewarding employees
for maximizing this number is unlikely to be effective at making the shop any busier, as
employees might spend all their time flipping switches and keeping clients out so that
the door can be opened and closed more efficiently. If the goal is to capture something
as multifaceted as human linguistic communication, it seems highly unlikely that an
entirely non-gameable proxy can even exist. At the same time, it is difficult to give up
on proxies altogether, as the alternative of replacing every invocation of the reward
function by an authentic real-world interaction is definitely not feasible in such a data-
hungry setting. Reward functions that rely on sufficiently realistic simulations of the
situated communicative interactions that humans engage in might offer a way out, but
they will somewhere need to strike a balance between robustness and versatility on the
one hand, and computational efficiency on the other.
Do the inherent limitations of LLMs then persist when extending the paradigm
beyond modeling (sub)word distributions? The argument developed in Section 2 that
the tendency of LLMs to hallucinate, their limited ability to perform human-like logi-
cal and pragmatic reasoning, their data-hungriness, and their sensitivity to biases are
limitations that are inherent to the paradigm was based on the premise that LLMs are
learned from texts, optimizing a text-internal prediction objective. If the input shifts
from raw texts to human-like sensory-motor observations and if the text-internal predic-
tion objective is complemented with a human-like linguistic communication objective,
the original argument no longer holds. The reality is, however, less clear-cut. While
current LLM extensions beyond (sub)word distributions already exhibit more desirable
behavior than text-internal LLMs, they are still far removed from the ideal, both on the
input side and on the task side. The most fundamental question here seems whether
a non-gameable proxy that is both practically usable for language model alignment
and sufficiently close to capturing human linguistic behavior could in principle be
designed or learned. In the end, the extent to which the limitations of LLMs can be lifted
when extended beyond (sub)word distributions is bounded by the extent to which the
meaningful, intentional, situated, communicative, and interactional aspects of human
linguistic communication can be integrated into the training process.
3.2 Language Learning from Situated Communicative Interactions
The goal of the present section is to discuss an alternative approach that leaves the text-
based prediction paradigm altogether and takes situated communicative interactions
as the starting point for modeling language acquisition in machines. In particular, we
1289
Computational Linguistics Volume 50, Number 4
examine a line of work that studies how artificial agents can acquire languages through
the processes of situation-based intention reading and syntactico-semantic pattern
finding during simulated communicative interactions. It is not our aim to provide a
comprehensive overview of computational models of human language acquisition or
emergent communication, for which we refer the reader to Doumen et al. (2025), Steels
(2011), and Lazaridou and Baroni (2020), nor to provide an application-level alternative
to current LLMs. Instead, we discuss two concrete experiments that show how the lin-
guistic systems that are acquired by the agents in such a setting are of a fundamentally
different nature than those acquired by LLMs. Through these experiments, we aim
to further convince the reader that more faithfully modeling the circumstances under
which human languages are acquired is likely to constitute a major leap towards more
human-like language processing in machines.
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
The two experiments that we discuss below highlight different aspects of the ap-
proach. The first experiment, which is presented in Section 3.2.1, demonstrates how a
population of autonomous agents can learn to communicate about entities that they
observe in their environment. Concretely, by taking part in meaningful and intentional
situated communicative interactions, the population converges on a communicatively
adequate linguistic convention. Through the processes of intention reading and pattern
finding, each individual agent builds up an inventory of holistic constructions that
associate atomic forms to concept representations that are grounded in their own sen-
sory endowment and experiences. The second experiment, presented in Section 3.2.2,
focuses on the acquisition of constructions that capture compositional patterns of higher
morpho-syntactic and semantic complexity. We consider a tutor-learner scenario in
which the learner agent has previously acquired an inventory of conceptual distinctions
that are grounded in the environment in which the agents operate, for example through
the mechanisms presented in the first experiment. Through the processes of intention
reading and pattern finding, the learner agent now bootstraps an inventory of item-
based and holistic constructions that they can use to ask and answer English questions
about their environment. The meaning side of these constructions is again grounded in
the agent’s personal endowment and experiences.
3.2.1 Experiment 1: Acquisition of Grounded Concepts. The first experiment illustrates how
a population of artificial agents that are equipped with sensors can establish a linguistic
convention that is adequate to refer to entities that they perceive in their surround-
ings. The emergent convention consists of holistic constructions that associate atomic
forms to concept representations that are grounded in the agents’ environment. Not
only does this experiment show how the acquisition of situationally grounded concept
representations through task-oriented communicative interactions can concretely be
operationalized in artificial agents but it also shows how these concept representations
can emerge and evolve to satisfy the communicative needs of a community of language
users. The foundations of the methodological framework that is adopted date back to
pioneering work by Steels (1995), Batali (1998), and Oliphant (1999), among others. The
concrete experiment that we discuss was originally introduced by Botoko Ekila et al.
(2024a, b). A skeleton version of the methodology is presented below to better fit the
illustrative role of the experiment in this article.
In order to set up the grounded concept learning experiment, we first define a
population, a world, and an interaction script. The population consists of a number of
autonomous agents, say 10, which are each endowed with a sensory apparatus. The
world consists of a number of entities that are represented through feature vectors.
1290
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
Each dimension of these vectors corresponds to a particular sensor that the agents are
endowed with. Imagine, for example, that the world consists of geometrical objects
and that the agents are endowed with 20 visual sensors. As such, the agents would
perceive individual objects along 20 dimensions, say the number of corners of an ob-
ject, its width-height ratio, its color channel values, its area, and its position on the
horizontal and vertical axes, among others. At the beginning of the experiment, agents
do not know any words or concepts. Any objects they might encounter in their en-
vironment are perceived as continuously valued feature vectors resulting from sensor
read-outs.
The agents in the population engage in a series of situated communicative inter-
actions that follow a structured interaction script. The script that is adopted in this
experiment, and which will be formally defined below, proceeds as follows. First, a
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
scene is created as a random subset of entities from the world. We will assume for now
that a scene consists of a minimum of three and a maximum of ten entities. Then, two
agents are drawn from the population and are randomly assigned the roles of speaker
and listener for the purposes of the current interaction. One entity from the scene is
randomly selected to be the topic of the interaction. As such, it will be the task of the
speaker to draw the attention of the listener to this specific entity in the scene. Obviously,
the identity of the topic is disclosed to the speaker only. The speaker then retrieves the
construction in its construction inventory that best combines entrenchment and discrim-
inative power for the topic with respect to other entities in the scene. If no adequate
construction exists, as will often be the case in the initial stages of the experiment, the
speaker invents a new construction that couples a concept representation, created based
on the observed feature vector, to a new word form, say demoxu. The form of the selected
construction is then uttered to the listener. If the listener already knows a construction
of which the form side matches the utterance, they identify the entity in the scene that
most closely matches the meaning side of the construction and point towards it. If the
listener does not know a construction of which the form side matches the utterance, they
indicate that they did not understand. In both cases, the speaker then provides feedback
to the listener by signaling communicative success or failure and by pointing towards
the topic entity. In case of success, both agents boost the entrenchment score of the
constructions that they have used and inhibit the score of any competing constructions,
that is, constructions that would also have discriminated the topic in the scene. At the
same time, the concept representations of the used constructions are shifted towards
the perceived feature vector. In case of failure, the speaker will inhibit the entrenchment
score of their used construction. If the listener knew a construction of which the form
side matches the observed utterance, they will inhibit its entrenchment score and shift
its meaning side towards the observed feature vector. If they did not know such a
construction, they will learn a new construction that couples the observed utterance
with a concept representation that is created based on the perceived feature vector. A
schematic overview of a situated communicative interaction of the grounded concept
learning experiment is presented in Figure 2. The figure shows an interaction in which
the speaker could successfully use the utterance walibu to draw the attention of the
listener to the topic entity in the scene.
Formally, an experiment E = (W, P, G) is defined as a tuple that groups a world
W = {e1 , . . . , em } comprising m entities, a population P = {a1 , . . . , ak } of k agents, and
a sequence G = (gj )ij=1 of i situated communicative interactions. Each agent is equipped
with a set of l sensors S = {s1 , . . . , sl } and is initialized with its own empty construc-
tion inventory Ia = {}. Each entity in the world is represented by means of a contin-
uously valued l-dimensional feature vector X, of which the dimensions correspond to
1291
Computational Linguistics Volume 50, Number 4
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
Figure 2
Schematic overview of a successful situated communicative interaction during the grounded
concept learning experiment. A scene comprising three entities is drawn from the world (A).
Two agents are drawn from the population and assigned the roles of speaker and listener (B).
A topic entity is drawn from the scene and disclosed to the speaker only (C). The speaker selects
the construction from its inventory with the highest communicative adequacy for the topic in the
scene. In this case, the speaker identifies the WALIBU - CXN - S and utters its form walibu (D). The
listener observes the produced utterance. In this case, the listener has previously acquired a
construction with the form walibu and retrieves the entity from the scene that matches its
meaning side most closely. The listener points to this entity (E). The speaker signals success (F).
Both speaker and listener boost the entrenchment score of the constructions they used, shift their
meaning side towards the observation, and inhibit the score of competing constructions (G).
properties of the entities that can be recorded by the agents’ corresponding sensors.6
A holistic construction w = ( f, c, s) is defined as a pairing between a form f ∈ F,
a concept representation c, and an entrenchment score s. Entrenchment scores are
bounded between a minimum of 0 and a maximum of 1. F is an infinite set of possible
forms. Concept representations are each defined as a sequence of l tuples c = ((ω1 , µ1 ,
σ1 ), . . . , (ωl , µl , σl )) that group three numerical values: ω, µ, and σ. Each tuple relates to
a feature channel of one sensor with which the agents are endowed. The weight value
ωi represents the importance of the feature channel for the concept, the mean value µi
holds the prototypical value for the concept on this channel, and σi holds the standard
deviation for the concept on this channel. Concepts are thus represented as a sequence
of normal distributions. Each distribution relates to a feature channel and the relevance
of the channel for the concept is indicated by a weight value.
6 In order to fit the illustrative purposes of this article, we present a simplified version of the original
methodology. We assume here that all agents are equipped with the same set of calibrated sensors.
We kindly refer the reader to Botoko Ekila et al. (2024a, 2024b) for a more elaborate version of the
methodology that successfully accommodates heteromorphic populations, perceptual differences, and
even sudden sensor defects.
1292
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
The interaction script according to which each communicative interaction g ∈ G
takes place is formalized as follows:
(A) Scene selection. n entities are drawn from the world to constitute a scene: C =
{e1 , . . . , en } ∈ W.
(B) Agent and role selection. A speaker S ∈ P and a listener L ∈ P are drawn from the
population P.
(C) Topic selection. The topic entity T ∈ C is drawn from the scene C and disclosed to
the speaker S.
(D) Conceptualization and production. The speaker considers all constructions in its
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
inventory that discriminate the topic in the scene, that is all w = ( f, c, s) ∈ IS for which
holds that the similarity between their concept representation c and the feature vector
X for the topic entity T is higher than the similarity between c and the feature vector
for any other entity in the scene e ∈ C. The similarity between a construction’s concept
representation c and a feature vector X is computed according to Equation (1), taking
into account the similarity on each feature channel as well as the weight value of each
feature channel in the construction’s concept representation.
l
X ωi x −µ
sim(c, X) = Pl exp(−| i σ i | ) (1)
i=1 j=1 ωj | {z i }
|{z} | {z } similarity
sum over channels weight
If multiple discriminating constructions exist in the construction inventory of the
speaker, the construction with the highest communicative adequacy is selected. The
communicative adequacy reflects both the degree of entrenchment of the construction
and the extent to which it discriminates the topic entity in the scene. The idea is that
constructions that are more entrenched and better discriminate the topic are more likely
to lead to communicative success. The communicative adequacy is concretely computed
by multiplying the entrenchment score s of the construction with the difference in
similarity between the construction’s concept representation c and the feature vector
for the topic on the one hand, and the similarity between c and the feature vector for
the closest other entity in the scene on the other. The form f of the construction with
the highest communicative adequacy is then uttered by S as U and shared with L. If no
discriminating constructions exist in the construction inventory of the speaker IS , a new
construction w = (f, c, s) is added, with f being randomly selected from the infinite set
of forms F, s being assigned a default initial value, and c = ((ω1 , µ1 , σ1 ), . . . , (ωl , µl , σl ))
being initialized with the exact values of the feature vector X for µ1 . . . µl , with a default
initial value for σ1 . . . σl , and with a default initial value for ω1 . . . ωl . The form f of this
construction is then uttered as U and shared with L.
(E) Comprehension and interpretation. If a construction with the form U exists in the
construction inventory of the listener, that is, w = (U, c, s) ∈ IL , L points to the entity
in the scene e ∈ C of which the feature vector X is most similar to c, according to the
similarity metric defined in Equation (1). If no such construction exists in IL , L signals
that they could not understand U.
1293
Computational Linguistics Volume 50, Number 4
(F) Feedback. S signals success if L correctly identified T. Otherwise, S signals failure,
and provides feedback to L by pointing to T.
(G) Alignment. If the communicative interaction g was successful, both the speaker S
and the listener L will increase the entrenchment scores sS and sL of the constructions
they have used, namely, w = (U, cS , sS ) ∈ IS and w = (U, cL , sL ) ∈ IL , by a fixed value. S
and L also shift the concept representations cS and cL of these constructions towards
the feature vector XT for topic entity T given the context C. They do that by updating
µi and σi on each feature channel using Welford’s online algorithm (Welford 1962) to
incorporate XT . As for the channel weights ωi ... ωl , the subset of channels with the
highest discriminative power for T with relation to C is selected from all channels with
positive discriminative power. The weights on the channels in this set are increased by
a fixed step on a sigmoid function. The weights on all other channels are decreased
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
by a fixed step on the same function. Both S and L also decrease the entrenchment
score of any competing constructions, defined as other constructions in their inventories
that have a positive discriminative power for T in C. The decrease in score for these
constructions is proportional to how similar their concept representations are to the
concept representation of the used construction. In other terms, constructions with more
similar concept representations are punished harder as they are considered stronger
competitors. The similarity of two concept representations is computed according to
Equation (2) and takes into account the similarity of the distributions on each channel,
the similarity of the weights on each channel, and the average of the weights on each
channel. This last component is included to reflect that similarities and differences on
channels with high weights are more meaningful to the overall concept similarity than
those on channels with low weights.
l
X
sim(cq , cr ) = [(1 − H(N (µqi , σ2qi ), N (µri , σ2ri )))
i=1 | {z }
|{z} Hellinger similarity of distributions
sum over channels
ωqi ωri
(1 − | Pl − Pl |) (2)
k=1 ωqk k=1 ωrk
| {z }
similarity of weights
ωqi ωri
Pl + Pl
k=1 ωqk k=1 ωrk
]
| 2
{z }
average of weights
If the communicative interaction g was not successful, S will decrease the entrenchment
score sS by a fixed value. If a construction with the form U existed in the construction
inventory of L, that is, w = (U, cL , sL ) ∈ IL , L will decrease its entrenchment score sL
with a fixed value and shift its concept representation cL towards XT given C. If no
such construction existed, a new construction w = (U, c, s) is added to IL , with s being
assigned a default initial value, and c = ((ω1 , µ1 , σ1 ), . . . , (ωl , µl , σl )) being initialized
with the exact values of the topic’s feature vector X for µ1 . . . µl , with a default initial
value for σ1 . . . σl , and with a default initial value for ω1 . . . ωl .
1294
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
The grounded concept learning methodology is directly applicable to any scenario
where a population of agents needs to learn to communicate about entities that can
be represented by means of continuously valued feature vectors. These vectors can
originate from robotic simulations as referred to above or be derived from any tabular
dataset that stores its entries in terms of continuously valued attributes. Botoko Ekila
et al. (2024a) present three scenarios that show the broad applicability of the methodol-
ogy on the one hand, and that illustrate the grounding of the agents’ emergent linguistic
knowledge in the communicative task and situated environment on the other. In the
first scenario, referred to as CLEVR, the scenes in which the interactions take place are
based on the images from the CLEVR dataset (Johnson et al. 2017). Each synthesized 3D
image depicts three to ten geometrical objects that vary in shape, size, color, shininess,
and 3D position. The conversion from images to scenes is performed using the visual
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
preprocessing procedure described by Nevens, Van Eecke, and Beuls (2020), resulting,
per scene, in a set of 20-dimensional feature vectors that each represent one object in the
original image. The 20 dimensions of the vectors correspond to human-interpretable
visual properties, such as the object’s area, color channel values, position on the x and y
axes, number of edges, and width-height ratio. A total of 1,000,000 training scenes and
100,000 test scenes were sampled from the original train and test splits, respectively.
The scenes of the second scenario, referred to as WINE, were created based on the 4,898
physicochemical analyses of wine samples included in the Wine Quality dataset (Cortez
et al. 2009). Each wine sample is described through an 11-dimensional feature vector
in which each dimension corresponds to a particular physicochemical property of the
sample, including for instance its acidity, amount of residual sugar, alcohol content, and
amount of sulphates. Ninety percent of the wine samples were used to create 1,000,000
training scenes that each hold three to ten samples and the remaining 10% were used to
create 100,000 test scenes in the same way. The scenes of the third scenario, referred to
as CREDIT, were created based on the 284,807 financial transaction records included
in the Credit Card Fraud Detection dataset (Dal Pozzolo et al. 2014). Each financial
transaction is described along 28 dimensions that result from a principal component
analysis. Again, 90% of the financial transaction records were used to create 1,000,000
training scenes that each hold three to ten records and the remaining 10% were used
to create 100,000 test scenes. During the scene creation process, it was ensured that no
scene holds the exact same entity more than once and that no entity that appears in a
training scene is part of a test scene.
For all three scenarios, the parameters in the formal definition were set as follows.
The population consists of 10 agents (k = 10) and each scene consists of a minimum
of three and a maximum of 10 entities (3 6 n 6 10). Constructions initially receive
an entrenchment score of 0.5. After a successful interaction, the scores of the used
constructions are increased by 0.1 and the scores of their competitors are decreased
by 0.02 ∗ sim(cq , cr ). After a failed interaction, the scores of the used constructions are
decreased by 0.1. When it comes to the concept representations, initial channel weights
(ω) are set to 0.5 and initial standard deviations (σ) to 0.01. Channel weights are
rewarded or punished with a step of +1 and −5, respectively, on the sigmoid function
1 . Each experimental run consists of 1,000,000 communicative interactions, that
1+e−1/2x
is, one interaction for each scene in the training sets.
As the experimental runs unfold, the typical learning dynamics known from the
language evolution literature (see, e.g., Blythe and Croft 2012) manifest themselves
in all three scenarios. Figure 3 illustrates these dynamics for the case of the CLEVR
experiment. The x axis represents the time dimension in terms of number of commu-
nicative interactions that have taken place. The solid line indicates, on the left y axis, the
1295
Computational Linguistics Volume 50, Number 4
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
Figure 3
Evolutionary dynamics during the training phase of the CLEVR experiment, showing the degree
of success on the communicative task, the level of conventionality of the emergent language, and
the average number of constructions in active use as a function of the number of communicative
interactions that have taken place. Mean and 2 standard deviations computed over 10
independent experimental runs with populations of 10 agents. This graph has been created
based on experimental data from Botoko Ekila et al. (2024a).
average degree of communicative success over the last 1,000 interactions. An interaction
is counted as successful if the listener has indeed been able to identify the topic entity
during the comprehension and interpretation step of the interaction script. Given that
the construction inventories of the agents are empty at the beginning of the experiment,
the degree of communicative success necessarily starts at 0. After 50,000 interactions, a
degree of communicative success of about 90% is reached. This number then continues
to increase, with the agents successfully communicating in 99.5% of the interactions
after the 1,000,000 interactions of the experiment have taken place. The dashed line
indicates, also on the left y axis, the average level of conventionality of the emergent
language over the last 1,000 interactions. An interaction is counted as linguistically
conventional if the listener would in principle have produced the same utterance if they
would have been the speaker. Like the degree of communicative success, the level of
conventionality will always start at 0. It increases at a somewhat slower pace than the
degree of communicative success and reaches about 90% after 1,000,000 interactions.
The dashdotted line shows, on the right y axis, the average number of distinct construc-
tions used by the agents during the last 1,000 interactions in which they participated
as the speaker. In the earlier stages of the experiment, a wide variety of concept rep-
resentations and words emerges in the population, peaking at about an average of 90
constructions per agent after 50,000 interactions. As the emergent language becomes
more conventional, the number of constructions in active use starts to decline, with an
average of just over 50 constructions per agent after 1,000,000 interactions.
After running the experiments for 1,000,000 interactions, the emerged conventions
are evaluated against the test portions of the datasets. The interaction script remains the
1296
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
Table 1
Results of the CLEVR, WINE, and CREDIT experiments after evaluation on the test sets in terms
of degree of communicative success, level of conventionality, and construction inventory size.
The reported values represent the mean and 2 standard deviations computed over 10
independent experimental runs.
Scenario Communicative success Conventionality Inventory size
CLEVR 99.65 (∼ 0.13) 93.86 (∼ 1.09) 46.72 (∼ 2.45)
WINE 99.74 (∼ 0.15) 88.67 (∼ 1.92) 52.67 (∼ 2.93)
CREDIT 99.67 (∼ 0.13) 87.72 (∼ 2.50) 51.43 (∼ 2.49)
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
same, with the additional constraints that agents can no longer create new constructions
to add to their inventories and that construction scores and concept representations are
no longer updated. The test results are reported in Table 1 for all three scenarios. Degrees
of communicative success of over 99.5% are achieved in each scenario, validating that
the emergent languages indeed generalize to previously unseen entities and scenes.
Levels of conventionality range from 87.72% for CREDIT, over 88.67% for WINE, to
93.86% for CLEVR. These numbers show that the emergent languages are not only
effective at solving the communicative tasks, but that the agents in the population also
manage to effectively align their linguistic systems. The average construction inventory
size of the agents ranges from 51.43 in the case of CLEVR to 52.67 in the case of WINE.
While further analysis would be required to provide a more theoretically substantiated
interpretation of the resulting construction inventory sizes, the reported numbers at
least corroborate the finding that the agents converge on a manageable set of widely
applicable concepts.
The emergent conventions consist of holistic constructions that associate linguistic
forms with grounded concept representations. Figure 4 shows, for each scenario, a con-
struction from the inventory of one agent that has reached the maximum entrenchment
score of 1.0 after 1,000,000 interactions. The DEMOXU - CXN shown in Figure 4a emerged
during the CLEVR experiment. It associates the form demoxu to a concept representation
with three relevant dimensions, that is, dimensions with a strictly positive weight value
ω. The area dimension represents the number of pixels within an object’s perimeter,
normalized on a scale from 0 to 1. The bb-area dimension represents the normalized
number of pixels within an object’s rectangular bounding box. Finally, the rel-area
dimension represents the ratio between the number of pixels within an object’s perime-
ter and the total number of pixels in the image. Converting the normalized values
back to raw pixel counts, the means and standard deviations on the three dimensions
indicate that the concept representation of the DEMOXU - CXN prototypically represents
entities with 1,344 pixels within their perimeter (σ = 76.8 pixels), 1,574 pixels within their
rectangular bounding box (σ = 115 pixels), and which cover about 1% of the image. An
analysis of the agent’s communicative behavior reveals that they use the construction
73% of the time to refer to small objects with a spherical shape. These are indeed objects
that cover a relatively small area of the image and fill about 80% of their bounding
box. The ZAPOSE - CXN shown in Figure 4b emerged during the WINE experiment. It
associates the form zapose to a concept representation with a single relevant dimension,
namely, the normalized residual sugar content of a wine sample. The prototypical value
on this dimension corresponds to 12.34 g/l (σ = 1.39 g/l), a residual sugar content typi-
cally associated with medium dry wines. The ZISENI - CXN shown in Figure 4c emerged
1297
Computational Linguistics Volume 50, Number 4
bb-area residual-sugar
μ1 : 0.041 "zapose" s : 1.0 ω3 : 1.0 μ1 : 0.137
σ1 : 0.003 σ1 : 0.012
ω1 : 1.0
rel-area (b) ZAPOSE-CXN
"demoxu" s : 1.0 ω2 : 1.0 μ2 : 0.009
σ2 : 0.001
ω3 : 1.0
area V2
μ3 : 0.035 "ziseni" s : 1.0 ω3 : 1.0 μ1 : 0.208
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
σ3 : 0.002 σ1 : 0.008
(a) DEMOXU-CXN (c) ZISENI-CXN
Figure 4
Examples of constructions that emerged in the CLEVR (a), WINE (b), and CREDIT (c) scenarios.
The constructions associate atomic labels on their form side with situationally grounded and
communicatively motivated concept representations on their meaning side. Figure adapted from
Botoko Ekila et al. (2024b).
during the CREDIT experiment and associates the form ziseni to a low value range on
the second PCA component of the transaction records.
In sum, the grounded concept learning experiment shows how a population of
agents that take part in situated communicative interactions can self-organize a lin-
guistic convention that can be used to refer to arbitrary entities in their environment.
The convention consists of holistic constructions that associate atomic forms to concept
representations. The meaningful, intentional, interactional, and communicative nature
of the learning environment facilitated the establishment of a linguistic convention that
is not only grounded in the (non-textual) environment of the agents, but which is also
motivated and shaped by their communicative needs. While we have focused on the
emergence of such a convention, agents that are added to the population at a later stage
can acquire the established convention using the exact same adoption and alignment
mechanisms while they take part in communicative interactions with more “mature”
agents. As such, also existing natural languages can be learned in a constructivist
manner through interactions with “tutor” agents that already master these languages,
as will be demonstrated in the second experiment.
3.2.2 Experiment 2: Acquisition of Grammatical Structures. Through the second experiment,
we aim to illustrate how linguistic structures of a higher morpho-syntactic and semantic
complexity can be acquired in a communicative and situationally grounded learning
environment, thereby transcending the level of atomic word forms and concept rep-
resentations addressed in the first experiment. Concretely, we focus on a tutor-learner
scenario in which the learner agent needs to acquire in a constructivist manner a linguis-
tic system that is adequate to ask and answer English questions about the environment
in which the agents are situated. The experiment, which has been extensively discussed
1298
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
by Nevens et al. (2022) and Doumen, Beuls, and Van Eecke (2023), is set up as follows.
The population consists of two agents, one being the tutor and the other being the
learner. The communicative interactions between the two agents take place in randomly
selected scenes from the CLEVR visual question answering dataset (Johnson et al. 2017).
Each communicative interaction starts with the tutor asking an English question that
is provided by the dataset for the selected scene. The task of the learner is to answer
the question asked by the tutor. The learner starts without any linguistic knowledge
apart from conceptual distinctions, such as different colors, materials, or sizes, which
are assumed to have been acquired previously through a grounded concept learning
experiment. The learner also possesses the ability to perform a number of primitive
cognitive operations. These primitive operations can be thought of as the instruction
set on which all complex reasoning processes need to build. In general, primitive
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
operations can be cognitively inspired, rooted in theory of computation, or follow from
a practical constraint such as a robot’s API specification. In our experiment, we provide
access to a number of basic operations, such as segmenting a scene, counting the number
of elements in a set, computing unions and intersections of sets, querying attributes, and
filtering sets according to prototypes.
Imagine that during their very first communicative interaction, the learner observes
the question How many blocks are there?. The learner cannot understand the question and
receives the answer to the question, say 4, as feedback from the tutor. The learner then
starts the process of intention reading. Based on the feedback provided by the tutor,
the environment in which the utterance was observed, and the inventory of primitive
cognitive operations that the agent can perform, the agent constructs a hypothesis about
the intended meaning of the observed utterance. In order to do this, the agent searches
for a network of primitive operations, referred to as a procedural semantic network
(Woods 1968; Johnson-Laird 1977; Spranger et al. 2012; Verheyen et al. 2023), that, upon
evaluation with respect to the scene, leads to the answer that was provided by the tutor.
This network could, for example, consist of the operations [segment image → filter
cube → count]. The mapping between the observed utterance How-many-blocks-are-
there? and its hypothesized meaning [segment image → filter cube → count] is
then stored by the learner agent as a holophrastic construction. Imagine that the learner
later engages in a communicative interaction where they observe the utterance How
many spheres are there?. Again, the learner cannot understand the question and needs to
construct a hypothesis about the intended meaning of the utterance through the process
of intention reading. Upon feedback by the tutor, in this case 3, the learner constructs
the hypothesis [segment image → filter ball → count]. Based on the previously
acquired holophrastic construction that associates the form How-many-blocks-are-there?
with the meaning [segment image → filter cube → count] and the current obser-
vation of the utterance How-many-spheres-are-there? and its hypothesized meaning [seg-
ment image → filter ball → count], the learner can now use its pattern finding
abilities to construct a more general item-based construction that pairs the form How-
many-Xs-are-there? with the generalized meaning representation [segment image →
filter ?type → count] in which X and ?type, respectively, represent variable slots
on the form and meaning sides of the construction. Importantly, the construction also
captures that the referent of the filler of the X-slot on the form side will fill the ?type slot
on its meaning side. At the same time, the learner can acquire two holistic constructions
that pair the forms block and sphere to their respective meanings, in this case the concepts
of cube and ball. Along with the item-based construction and the two holistic construc-
tions, the agent also learns that the BLOCK - CXN and the SPHERE - CXN provide suitable
fillers for the X-slot in the HOW- MANY-X S - ARE - THERE - CXN. This information is added
1299
Computational Linguistics Volume 50, Number 4
to the agent’s categorial network, which models emergent grammatical categories as
links between constructional slots and their observed fillers, very much in the spirit
of Radical Construction Grammar (Croft 2001). Pattern finding is implemented as an
anti-unification-based generalization process (Plotkin 1970; Van Eecke 2018; Yernaux
and Vanhoof 2019). A schematic representation of the intention reading and pattern
finding processes involved in this example is shown in Figure 5. Note that the con-
structions that are learned constitute form-meaning mappings that can be used for both
language comprehension, that is, mapping from utterances to their meaning represen-
tations, and language production, that is, mapping from meaning representations to
utterances that express them.
As discussed in Section 1 in the context of human language acquisition, intention
reading is an abductive reasoning process that introduces uncertainty into the over-
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
all learning process. In this experiment, meaning hypotheses are guaranteed to be
consistent and plausible given the communicative interaction at hand, but might not
generalize to other situations. This uncertainty directly percolates to the holophrastic,
item-based, and holistic constructions that are learned. The evolutionary dynamics that
Figure 5
Schematic overview of the acquisition of an item-based construction and two holistic
constructions through the processes of intention reading and pattern finding during a situated
communicative interaction. An agent observes a question about its environment (A), cannot
understand it, and receives the answer to the question as feedback (B). Based on its inventory of
primitive operations (C), the agent uses its intention reading capabilities to make a hypothesis
about the intended meaning of the observed utterance, which, upon evaluation with respect to
the environment, is consistent with the feedback (D). Based on the observed utterance, the
meaning hypothesis and a similar yet not identical construction that was previously acquired
(E), the agent then applies its pattern finding capabilities to learn an item-based construction and
two holistic constructions (F), as well as categorial relations between the constructional slots and
their observed fillers (G).
1300
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
overcome this uncertainty are modeled on the level of constructions through similar
alignment dynamics as those discussed in Section 3.2.1 in the context of the grounded
concept learning experiment. Concretely, all constructions carry an entrenchment score,
which reflects the frequency of their past successes and failures in communication.
During language comprehension and production, constructions with a higher score
will be preferred over less entrenched constructions. At the end of a successful com-
municative interaction, the learner agent increases the score of the constructions that
were used and decreases the score of their competitors (i.e., other constructions that
would have led to successful communication). In the case of a failed interaction, the
scores of the constructions that were used are decreased. As a consequence of these
evolutionary dynamics, constructions that are applicable in a wider range of situations
gradually gain the upper hand over constructions that either compete with them or hurt
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
communication rather than support it. Not only does this provide a way to overcome
suboptimal hypotheses resulting from the intention reading process, it also causes more
general and abstract constructions to ultimately prevail over more specific ones.
The learning dynamics of the experiment are visualized in Figure 6, where the
degree of communicative success and the construction inventory size are plotted as
a function of the number of communicative interactions that have taken place. The
degree of communicative success starts at 0 and reaches 1 after about 25,000 interactions.
The construction inventory size exhibits the typical “overshoot pattern” that was also
found in the grounded concept learning experiment. In the earlier stages of the experi-
ment, many new constructions are learned. These constructions are often holophrastic
constructions or item-based constructions with few variable slots. Moreover, many
1.0 1200
1000
0.8 Construction inventory size
Communicative success
800
0.6
600
0.4
400
0.2 200
Communicative success
Construction inventory size 0
0.0
0 50K 100K 150K 200K 250K
Number of communicative interactions
Figure 6
Learning dynamics of the experiment on the acquisition of grammatical structures through
intention reading and pattern finding during situated communicative interactions. The degree of
communicative success and the construction inventory size are plotted as a function of the
number of communicative interactions that have taken place. This graph has been created based
on experimental data from Nevens et al. (2022).
1301
Computational Linguistics Volume 50, Number 4
constructions result from pattern finding operations over suboptimal meaning hy-
potheses. As more and more interactions take place in a variety of environments, the
evolutionary dynamics of strengthening and weakening constructions based on their
successful or unsuccessful use in communication ensures that suboptimal constructions
gradually disappear from the construction inventory of the learner agent.
In sum, the grammar acquisition experiment shows how linguistic structures that
transcend the level of atomic word forms and concept representations can be acquired
through the processes of intention reading and pattern finding during situated com-
municative interactions. The linguistic knowledge that is built up consists in an inven-
tory of constructions at different levels of abstraction. These form-meaning mappings
correspond to syntactico-semantic generalizations over the compositional and non-
compositional aspects of language use observed in meaningful, intentional, and situated
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
environments. The meaning side of the acquired constructions is thereby grounded in
the physio-cognitive endowment of the agents, their shared environment, and their
communicative tasks and intentions. The evolutionary dynamics of strengthening and
weakening constructions based on their successful or unsuccessful use in communica-
tion provides a way to overcome the uncertainty involved in working out the underly-
ing meanings and intentions of linguistic utterances.
The two experiments discussed in this section illustrate that the linguistic systems
built up by artificial agents during situated communicative interactions are fundamen-
tally different from language models learned through a text prediction task. For one
thing, the agents’ linguistic knowledge is directly grounded in their physio-cognitive
endowment and environment. The constructions they acquire are motivated by their
function in solving communicative tasks rather than by the observed collocational
behavior of linguistic forms. For another, the processes of language comprehension
and production no longer rely on (sub)word prediction, but on finding combinations of
constructions that optimally map from linguistic forms to their meaning representations
and vice versa. Due to their grounded nature, these meaning representations can be
conceptualized and interpreted with respect to the communicative tasks and environ-
ment of the agents. As a consequence, the limitations of LLMs that result from their
generative nature do not apply as such to the linguistic systems that are acquired in this
setting. The agents do not hallucinate in the sense that LLMs do. They might well fail to
achieve communicative success, but the utterances they produce are always motivated
by more factors than the collocational behavior of the linguistic forms that they contain.
The situated and interactional nature of the learning environment ensures that the
constructions learned by the agents are communicatively motivated, which will ulti-
mately be required to draw human-like logical and pragmatic inferences (see Section 2).
The richness of the learning environment also lifts the constraint that all aspects of
meaning, reasoning, and world knowledge need to be learned through the intermediary
of their effect on the distribution of (sub)words, bearing the promise to avoid the need
for exposure to hundreds of billions of tokens. Finally, the linguistic systems that are
acquired will still mirror the biases present in the learning environment, but these will
likely be easier to mitigate than those resulting from uncuratable amounts of textual
data. By no means is this discussion intended to dismiss the impressive results obtained
through the LLM paradigm, nor would we claim that achieving similar results through
agent-based simulations is within reach in the near future. We do argue, however, that
more faithfully modeling the situated, communicative, and interactional environments
in which human languages are acquired provides a promising path to overcome the
limitations of systems that essentially rely on the distributional hypothesis, as powerful
as it might be.
1302
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
4. Conclusion
The primary argument of this article has been that the way in which humans acquire
their native languages is fundamentally different from the way in which LLMs learn
their linguistic capabilities. We have argued that these differences have consequential
repercussions on the linguistic knowledge that is built up, as well as on how this knowl-
edge can be used to drive the processes of language comprehension and production.
We have highlighted the fact that humans learn language by actively taking part
in meaningful and intentional communicative interactions that are situated in their ev-
eryday environment. During these communicative interactions, they make hypotheses
about the meanings and communicative intentions that underlie the utterances they
perceive, relying on clues provided by the communicative environment in which the
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
interactions take place. Constructions of different degrees of abstraction are learned
as syntactico-semantic generalizations over combinations of perceived utterances, their
reconstructed meanings and intentions, and previously acquired constructions. As such,
humans rely on their situation-based intention reading and syntactico-semantic pattern
finding capabilities to bootstrap a productive linguistic system that is tied to their own
physical and cognitive endowment, grounded in their environment, shaped by their
past experiences, and motivated by their communicative needs.
By contrast, LLMs acquire their linguistic capabilities by learning to predict
(sub)words based on the textual environment in which they appear. They meticulously
capture the (sub)words’ collocational behavior in a fine-grained manner and on an
inconceivable scale. Stretching the distributional hypothesis to previously unimagin-
able dimensions, they are able to generate texts that do not only exhibit human-like
lexical and morpho-syntactic correctness, fluency, and eloquence, but that also exhibit
a remarkable discourse coherence and semantic adequacy. The models even seem to be
capable of solving natural language understanding tasks that require substantial non-
linguistic reasoning abilities.
However, the fact that LLMs are learned in the absence of meaningful and inten-
tional situated communicative interactions lies at the root of a number of inherent limi-
tations. A first limitation concerns their so-called hallucinations. Due to their grounding
in text generation, by which they use collocation-based patterns and structures observed
in input texts to generate output texts that exhibit similar patterns and structures, they
seamlessly mix fact and fiction in the output they produce. A second limitation concerns
their struggle to perform human-like logic and pragmatic reasoning. For one thing, this
struggle is ascribable to a lack of situational grounding during the training process, by
which the distribution of objects and events in the world about which they need to
reason differs from the distribution of (sub)words in the texts they are trained on. For
another, it can be ascribed to the absence of communicative intent during the training
process. While LLMs have been trained to capture fine-grained collocational patterns
and structures in enormous corpora of language use, they have never had access to clues
about why particular utterances were produced in the first place. A third limitation
concerns their data-hungriness, which results from the fact that all aspects of meaning,
reasoning, and world knowledge that they are expected to capture need to be learned
indirectly through the intermediary of their effect on the distribution of (sub)words in
corpora. Finally, LLMs are sensitive to biases, as they mirror the collocational patterns
and structures present in uncuratable amounts of human-written texts.
Our comparison between the ways in which humans and LLMs learn their linguistic
capabilities along with our analysis of the limitations inherent to LLMs have led us to
conclude that the key towards more human-like language learning in machines lies
1303
Computational Linguistics Volume 50, Number 4
in more faithfully modeling the meaningful, intentional, situated, communicative, and
interactional aspects of human linguistic communication. In order to investigate how
this could be operationalized in practice, we have first discussed approaches that extend
the LLM paradigm beyond a purely text-based prediction task. The integration of input
from other modalities than text provides a way to incorporate richer representations of
the situational embedding of observed utterances, whereas the alignment of language
models to text-external criteria confers the possibility of integrating task-oriented, com-
municative, and interactional objectives into the training process of LLMs. While these
extensions have already proven to be valuable additions to the LLM paradigm, the
extent to which situated communicative interactions can be approximated by non-
gameable reward functions still remains an open question.
We have then moved on to discuss a line of work that leaves the text-based predic-
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
tion paradigm altogether and takes situated communicative interactions as the starting
point for modeling language acquisition in machines. We have discussed two experi-
ments that model how artificial agents can acquire a language through the processes of
intention reading and pattern finding during situated communicative interactions. The
first experiment focused on grounded concept learning and shows how a population of
autonomous agents can self-organize a linguistic convention that can be used to refer to
arbitrary entities in their environment. The meaningful, intentional, interactional, and
communicative nature of the learning environment facilitates the emergence and evolu-
tion of an environmentally grounded and communicatively motivated convention. The
second experiment focused on the acquisition of linguistic structures that transcend
the level of atomic forms and concept representations. It implements a tutor-learner
scenario in which a learner agent acquires linguistic knowledge that is adequate to
ask and answer English questions about their environment. This linguistic knowledge
consists of constructions at different levels of abstraction, which can combine to map
between English utterances and situationally grounded executable meaning representa-
tions. These meaning representations are grounded in the physio-cognitive endowment
of the agents, their shared environment, and their communicative tasks and intentions.
The experiments have shown that the linguistic systems that are built up by the
agents through situated communicative interactions are of a fundamentally different
nature than the linguistic knowledge that is captured by LLMs. The agents’ acquired
constructions are directly motivated by their communicative function, rather than by
the collocational behavior of linguistic forms. The constructions can thereby be used to
comprehend and produce linguistic expressions without relying on textual prediction,
and the meanings of linguistic expressions can be conceptualized and interpreted with
respect to the communicative tasks and environment of the agents. They thereby steer
clear of the hallucination effects that characterize text prediction-based models. As a
result of the rich nature of the learning environment, constructions can be acquired in a
more data-efficient manner and they are able to capture the communicative function of
language, which will ultimately be required to draw human-like logical and pragmatic
inferences.
The experiments that we have discussed in the final part of this article definitely do
not offer an application-level alternative to today’s LLMs, nor were they ever designed
with that goal in mind. We hope, however, to have convinced the reader that a better
integration of the situated, communicative, and interactional aspects of human linguis-
tic communication constitutes the key to overcoming the limitations of current LLMs,
and that more faithfully modeling the situated communicative interactions through
which humans acquire their native languages provides a promising path towards more
human-like language processing in machines.
1304
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
Acknowledgments (Volume 1: Long Papers), pages 238–247.
We would like to thank Jérôme Botoko Ekila https://doi.org/10.3115/v1/P14
and Jens Nevens for their assistance in -1023
creating the graphs shown in Figures 3 and 6. Batali, John. 1998. Computational
We are grateful to Marie-Catherine de simulations of the emergence of grammar.
Marneffe and Remi van Trijp for their role in In Chris Knight, James R. Hurford,
the discussions that led up to the writing of Michael Studdert-Kennedy, editors,
this article, and to Lara Verheyen, Remi van Approaches to the Evolution of Language:
Trijp, and three anonymous reviewers Social and Cognitive Bases. Cambridge
for their insightful feedback on earlier University Press, Cambridge, UK,
versions of the manuscript. The research pages 405–426.
reported on in this article was funded by the Behrens, Heike. 2021. Constructivist
European Union’s Horizon 2020 research approaches to first language acquisition.
and innovation programme under Journal of Child Language, 48(5):959–983.
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
grant agreement no. 951846, the https://doi.org/10.1017
Flemish Government under the /S0305000921000556, PubMed: 34382923
Onderzoeksprogramma Artificiële Bender, Emily M. and Alexander Koller.
Intelligentie (AI) Vlaanderen programme, 2020. Climbing towards NLU: On
the AI Flagship project ARIAC by meaning, form, and understanding in the
DigitalWallonia4.ai, and the age of data. In Proceedings of the 58th
F.R.S.-FNRS-FWO WEAVE project HERMES Annual Meeting of the Association for
I under grant numbers T002724F Computational Linguistics, pages 5185–5198.
(F.R.S.-FNRS) and G0AGU24N (FWO). https://doi.org/10.18653/v1/2020
.acl-main.463
BigScience Workshop. 2022. BLOOM: A
References 176B-parameter open-access multilingual
language model. arXiv preprint
Apidianaki, Marianna. 2022. From word
arXiv:2211.05100.
types to tokens and back: A survey of
approaches to word meaning Blythe, Richard A. and William Croft. 2012.
representation and interpretation. S-curves and the mechanisms of
Computational Linguistics, 49(2):465–523. propagation in language change. Language,
Aslin, Richard N. 2017. Statistical learning: 88:269–304. https://doi.org/10.1353
A powerful mechanism that operates by /lan.2012.0027
mere exposure. Wiley Interdisciplinary Boden, Margaret A. 1978. Artificial
Reviews: Cognitive Science, 8(1–2):e1373. intelligence and Piagetian theory. Synthese,
https://doi.org/10.1002/wcs.1373, 38(3):389–414. https://doi.org/10
PubMed: 27906526 .1007/BF00486637
Austin, John L. 1962. How to Do Things with Botoko Ekila, Jérôme, Jens Nevens, Lara
Words. Oxford University Press, London. Verheyen, Katrien Beuls, and Paul Van
Baevski, Alexei Wei-Ning Hsu, Qiantong Eecke. 2024a. Decentralised emergence of
Xu, Arun Babu, Jiatao Gu, and Michael robust and adaptive linguistic conventions
Auli. 2022. Data2vec: A general framework in populations of autonomous agents
for self-supervised learning in speech, grounded in continuous worlds. In
vision and language. In International Proceedings of the 23rd International
Conference on Machine Learning, Conference on Autonomous Agents and
pages 1298–1312. Multi-Agent Systems, pages 2168–2170.
Bai, Yuntao, Andy Jones, Kamal Ndousse, Botoko Ekila, Jérôme, Jens Nevens, Lara
Amanda Askell, Anna Chen, Nova Verheyen, Katrien Beuls, and Paul Van
DasSarma, Dawn Drain, Stanislav Fort, Eecke. 2024b. Decentralised emergence of
Deep Ganguli, Tom Henighan, et al. 2022. robust and adaptive linguistic conventions
Training a helpful and harmless assistant in populations of autonomous agents
with reinforcement learning from human grounded in continuous worlds. arXiv
feedback. arXiv preprint arXiv:2204.05862. preprint arXiv:2401.08461.
Baroni, Marco, Georgiana Dinu, and Germán Brown, Tom, Benjamin Mann, Nick Ryder,
Kruszewski. 2014. Don’t count, predict! A Melanie Subbiah, Jared D. Kaplan, Prafulla
systematic comparison of context-counting Dhariwal, Arvind Neelakantan, Pranav
vs. context-predicting semantic vectors. In Shyam, Girish Sastry, Amanda Askell,
Proceedings of the 52nd Annual Meeting of the et al. 2020. Language models are few-shot
Association for Computational Linguistics learners. In Advances in Neural Information
1305
Computational Linguistics Volume 50, Number 4
Processing Systems 33 (NeurIPS 2020), Darwin, Charles R. 1871. The Descent of Man,
pages 1877–1901. and Selection in Relation to Sex, 1st edition,
Bruner, Jerome. 1983. Learning to use language. volume 1. John Murray, London, United
Oxford University Press, Oxford, United Kingdom. https://doi.org/10.1037
Kingdom. /12293-000
Bybee, Joan. 2010. Language, Usage and Dasgupta, Ishita, Andrew K Lampinen,
Cognition. Cambridge University Press, Stephanie C. Y. Chan, Antonia Creswell,
Cambridge, United Kingdom. https:// Dharshan Kumaran, James L. McClelland,
doi.org/10.1017/CBO9780511750526 and Felix Hill. 2022. Language models
Chang, Tyler A. and Benjamin K. Bergen. show human-like content effects on
2024. Language model behavior: A reasoning. arXiv preprint
comprehensive survey. Computational arXiv:2207.07051v1.
Linguistics, 50(1):293–350. https://doi Devlin, Jacob, Ming-Wei Chang, Kenton Lee,
.org/10.1162/coli a 00492 and Kristina Toutanova. 2019. BERT:
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
Choudhury, Sagnik Ray, Anna Rogers, and Pre-training of deep bidirectional
Isabelle Augenstein. 2022. Machine transformers for language understanding.
reading, fast and slow: When do models In Proceedings of the 2019 Conference of the
“understand” language? In Proceedings of North American Chapter of the Association for
the 29th International Conference on Computational Linguistics: Human Language
Computational Linguistics, pages 78–93. Technologies, Volume 1 (Long and Short
Chowdhery, Aakanksha, Sharan Narang, Papers), pages 4171-4186.
Jacob Devlin, Maarten Bosma, Gaurav Diessel, Holger. 2017. Usage-based
Mishra, Adam Roberts, Paul Barham, linguistics. In Mark Aronoff, editor, Oxford
Hyung Won Chung, Charles Sutton, Research Encyclopedia of Linguistics. Oxford
Sebastian Gehrmann, et al. 2022. PaLM: University Press, Oxford, United
Scaling language modeling with Kingdom. https://doi.org/10.1093
pathways. arXiv preprint arXiv:2204.02311. /acrefore/9780199384655.013.363
Clark, Herbert H. 1996. Using Language. Doumen, Jonas, Katrien Beuls, and Paul Van
Cambridge University Press, Cambridge, Eecke. 2023. Modelling language
United Kingdom. https://doi.org/10 acquisition through syntactico-semantic
.1017/CBO9780511620539 pattern finding. In Findings of the
Coelho Mollo, Dimitri and Raphaël Millière. Association for Computational Linguistics:
2023. The vector grounding problem. arXiv EACL 2023, pages 1317–1327. https://
preprint arXiv:2304.01481v1. doi.org/10.18653/v1/2023.findings
Cortez, Paulo, Antonio Cerdeira, Fernando -eacl.99
Almeida, Telmo Matos, and José Reis. Doumen, Jonas, Veronica J. Schmalz, Katrien
2009. Modeling wine preferences by data Beuls, and Paul Van Eecke. 2025. The
mining from physicochemical properties. computational learning of construction
Decision Support Systems, 47(4):547–553. grammars: State of the art and prospective
https://doi.org/10.1016/j.dss roadmap. Constructions and Frames, 17.
.2009.05.016 https://arxiv.org/pdf/2407.07606
Croft, William. 1991. Syntactic Categories and Driess, Danny, Fei Xia, Mehdi S. M. Sajjadi,
Grammatical Relations: The Cognitive Corey Lynch, Aakanksha Chowdhery,
Organization of Information. University of Brian Ichter, Ayzaan Wahid, Jonathan
Chicago Press, Chicago, IL, USA. Tompson, Quan Vuong, Tianhe Yu,
Croft, William. 2001. Radical construction Wenlong Huang, Yevgen Chebotar, Pierre
grammar: Syntactic theory in typological Sermanet, Daniel Duckworth, Sergey
perspective. Oxford University Press, Levine, Vincent Vanhoucke, Karol
Oxford, United Kingdom. https://doi Hausman, Marc Toussaint, Klaus Greff,
.org/10.1093/acprof:oso/9780198299554 Andy Zeng, Igor Mordatch, and Pete
.001.0001 Florence. 2023. PaLM-E: An embodied
Dal Pozzolo, Andrea, Olivier Caelen, multimodal language model. In
Yann-Aël Le Borgne, Serge Waterschoot, Proceedings of the 40th International
and Gianluca Bontempi. 2014. Learned Conference on Machine Learning,
lessons in credit card fraud detection from pages 8469–8488.
a practitioner perspective. Expert Systems Fan, Yue, Jing Gu, Kaizhi Zheng, and Xin
with Applications, 41(10):4915–4928. Wang. 2023. R2H: Building multimodal
https://doi.org/10.1016/j.eswa navigation helpers that respond to help
.2014.02.026 requests. In Proceedings of the 2023
1306
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
Conference on Empirical Methods in Natural Noland, Katherine Millican, George
Language Processing, pages 14803–14819. van den Driessche, Bogdan Damoc,
https://doi.org/10.18653/v1/2023 Aurelia Guy, Simon Osindero, Karén
.emnlp-main.915 Simonyan, Erich Elsen, Oriol Vinyals, Jack
Firth, John R. 1957. A synopsis of linguistic Rae, and Laurent Sifre. 2022. An empirical
theory, 1930–1955. In Studies in Linguistic analysis of compute-optimal large
Analysis. Basil Blackwell, Oxford, language model training. In Advances in
pages 1–31. Neural Information Processing Systems 35
Frank, Michael C. 2023. Bridging the data (NeurIPS 2022), pages 30016–30030.
gap between children and large language Hong, Xudong, Margarita Ryzhova,
models. Trends in Cognitive Sciences, Daniel Adrian Biondi, and Vera Demberg.
27(11):990–992. https://doi.org/10 2023. Do large language models and
.1016/j.tics.2023.08.007, PubMed: humans have similar behaviors in causal
37659919 inference with script knowledge? arXiv
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
Gabrieli, Gianmarco, Michal Muszynski, preprint arXiv.07311.
Edouard Thomas, David Labbe, and Hu, Jennifer, Sammy Floyd, Olessia
Patrick W. Ruch. 2022. Accelerated Jouravlev, Evelina Fedorenko, and Edward
estimation of coffee sensory profiles using Gibson. 2023. A fine-grained comparison
an AI-assisted electronic tongue. Innovative of pragmatic language understanding in
Food Science & Emerging Technologies, humans and language models. In
82:103205. https://doi.org/10.1016 Proceedings of the 61st Annual Meeting of the
/j.ifset.2022.103205 Association for Computational Linguistics
Givón, Talmy. 1995. Functionalism and (Volume 1: Long Papers), pages 4194–4213.
Grammar. John Benjamins, Amsterdam, https://doi.org/10.18653/v1/2023
Netherlands. https://doi.org/10 .acl-long.230
.1075/z.74 Hupkes, Dieuwke. 2018. Visualisation and
Goldberg, Adele E. 2006. Constructions at ‘diagnostic classifiers’ reveal how
Work: The Nature of Generalization in recurrent and recursive neural networks
Language. Oxford University Press, Oxford, process hierarchical structure. Journal of
United Kingdom. Artificial Intelligence Research, 61:907–926.
Goodhart, Charles. 1975. Problems of https://doi.org/10.1613/jair.1.11196
monetary management: The U.K. Jiang, Nanjiang, and Marie-Catherine
experience. In Papers in Monetary de Marneffe. 2021. He thinks he knows
Economics. Reserve Bank of Australia, better than the doctors: BERT for event
Sydney, pages 1–20. factuality fails on pragmatics. Transactions
Grice, Paul. 1967. Logic and conversation. In of the Association for Computational
Paul Grice, editor, Studies in the Way of Linguistics, 9:1081–1097. https://doi
Words. Harvard University Press, .org/10.1162/tacl a 00414
Cambridge, MA, USA, pages 41–58. Johnson, Justin, Bharath Hariharan,
Harris, Zellig S. 1954. Distributional Laurens van der Maaten, Li Fei-Fei,
structure. Word, 10(2–3):146–162. C. Lawrence Zitnick, and Ross Girshick.
https://doi.org/10.1080/00437956 2017. CLEVR: A diagnostic dataset for
.1954.11659520 compositional language and elementary
Hill, Felix, Roi Reichart, and Anna visual reasoning. In 2017 IEEE Conference
Korhonen. 2015. SimLex-999: Evaluating on Computer Vision and Pattern Recognition
semantic models with (genuine) similarity (CVPR), pages 2901–2910. https://
estimation. Computational Linguistics, doi.org/10.1109/CVPR.2017.215
41(4):665–695. https://doi.org/10 Johnson-Laird, Philip N. 1977. Procedural
.1162/COLI a 00237 semantics. Cognition, 5(3):189–214.
Hockett, Charles F. and Charles D. Hockett. https://doi.org/10.1016/0010
1960. The origin of speech. Scientific -0277(77)90001-4
American, 203(3):88–97. Joos, Martin. 1950. Description of language
https://doi.org/10.1038 design. The Journal of the Acoustical Society
/scientificamerican0960-88 of America, 22(6):701–707. https://doi
Hoffmann, Jordan, Sebastian Borgeaud, .org/10.1121/1.1906674
Arthur Mensch, Elena Buchatskaya, Trevor Jurafsky, Daniel and James H. Martin. 2024.
Cai, Eliza Rutherford, Diego de Las Casas, Vector semantics and embeddings. In
Lisa Anne Hendricks, Johannes Welbl, Speech and Language Processing: An
Aidan Clark, Thomas Hennigan, Eric Introduction to Natural Language Processing,
1307
Computational Linguistics Volume 50, Number 4
Computational Linguistics, and Speech language modeling with image, audio,
Recognition with Language Models. 3rd ed. video, and text integration. arXiv preprint
Kravtchenko, Ekaterina and Vera Demberg. arXiv:2306.09093.
2022. Informationally redundant MacWhinney, Brian. 2014. Item-based
utterances elicit pragmatic inferences. patterns in early syntactic development. In
Cognition, 225:105159. https://doi.org Thomas Herbst, Hans-Jörg Schmid, and
/10.1016/j.cognition.2022.105159, Susen Faulhaber editors, Constructions
PubMed: 35580451 Collocations Patterns. De Gruyter Mouton,
Kuhl, Patricia K., Feng-Ming Tsao, and Berlin, pages 33–69. https://doi.org
Huei-Mei Liu. 2003. Foreign-language /10.1515/9783110356854.33
experience in infancy: Effects of short-term Mandelkern, Matthew and Tal Linzen. 2024.
exposure and social interaction on Do language models refer? Computational
phonetic learning. Proceedings of the Linguistics, 50(3):1191–1200.
National Academy of Sciences, https://doi.org/10.1162
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
100(15):9096–9101. https://doi.org /coli a 00522
/10.1073/pnas.1532872100, PubMed: Manheim, David and Scott Garrabrant. 2018.
12861072 Categorizing variants of Goodhart’s Law.
Langacker, Ronald W. 1987. Foundations of arXiv preprint arXiv:1803.04585.
Cognitive Grammar: Theoretical Prerequisites, Maynard Smith, John and Eörs Szathmáry.
volume 1. Stanford University Press, 1999. The Origins of Life: From the Birth of
Stanford CA, USA. Life to the Origin of Language. Oxford
Lauriola, Ivano, Alberto Lavelli, and Fabio University Press, Oxford, United
Aiolli. 2022. An introduction to deep Kingdom. https://doi.org/10.1093
learning in natural language processing: /oso/9780198504931.001.0001
Models, techniques, and tools. McCoy, R. Thomas. Shunyu Yao, Dan
Neurocomputing, 470(C):443–456. https:// Friedman, Matthew Hardy, and Thomas L.
doi.org/10.1016/j.neucom.2021.05.103 Griffiths. 2023. Embers of autoregression:
Lazaridou, Angeliki, and Marco Baroni. 2020. Understanding large language models
Emergent multi-agent communication in through the problem they are trained to
the deep learning era. arXiv preprint solve. arXiv preprint arXiv:2309.13638.
arXiv:2006.02419. McKenna, Nick, Tianyi Li, Liang Cheng,
Li, KunChang, Yinan He, Yi Wang, Yizhuo Li, Mohammad Javad Hosseini, Mark
Wenhai Wang, Ping Luo, Yali Wang, Limin Johnson, and Mark Steedman. 2023.
Wang, and Yu Qiao. 2023. VideoChat: Sources of hallucination by large language
Chat-centric video understanding. arXiv models on inference tasks. In Findings of
preprint arXiv:2305.06355. the Association for Computational Linguistics:
Lieven, Elena. 2014. First language learning EMNLP 2023, pages 2758–2774.
from a usage-based approach. In Thomas https://doi.org/10.18653/v1/2023
Herbst, Hans-Jörg Schmid, and Susen .findings-emnlp.182
Faulhaber, editors, Constructions Mikolov, Tomas, Ilya Sutskever, Kai Chen,
Collocations Patterns. De Gruyter Mouton, Greg S. Corrado, and Jeffrey Dean. 2013a.
Berlin, pages 9–32. https://doi.org Efficient estimation of word
/10.1515/9783110356854.9 representations in vector space. In 1st
Liu, Ruibo, Ge Zhang, Xinyu Feng, and International Conference on Learning
Soroush Vosoughi. 2022. Aligning Representations (ICLR 2013), Workshop
generative language models with human Proceedings.
values. In Findings of the Association for Mikolov, Tomas, Kai Chen, Greg Corrado,
Computational Linguistics: NAACL 2022, and Jeffrey Dean. 2013b. Distributed
pages 241–252. https://doi.org/10 representations of words and phrases and
.18653/v1/2022.findings-naacl.18 their compositionality. In Advances in
Löhr, Guido. 2022. What are abstract Neural Information Processing Systems 26
concepts? On lexical ambiguity and (NIPS 2013), pages 1–9.
concreteness ratings. Review of Philosophy Mitchell, Melanie and David C. Krakauer.
and Psychology, 13(3):549–566. https:// 2023. The debate over understanding in
doi.org/10.1007/s13164-021-00542-9 AI’s large language models. Proceedings of
Lyu, Chenyang, Minghao Wu, Longyue the National Academy of Sciences,
Wang, Xinting Huang, Bingshuai Liu, 120(13):e2215907120. https://doi.org
Zefeng Du, Shuming Shi, and Zhaopeng /10.1073/pnas.2215907120, PubMed:
Tu. 2023. Macaw-LLM: Multi-modal 36943882
1308
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
Mitchell, Melanie, Alessandro B. Palmarini, Park, Kyungseo, Hyunwoo Yuk, M. Yang,
and Arsenii Kirillovich Moskvichev. 2024. Junhwi Cho, Hyosang Lee, and Jung Kim.
Comparing humans, GPT-4, and GPT-4v 2022. A biomimetic elastomeric robot skin
on abstraction and reasoning tasks. In using electrical impedance and acoustic
AAAI 2024 Workshop on “Are Large tomography for tactile sensing. Science
Language Models Simply Causal Parrots?”, Robotics, 7(67):eabm7187. https://
9 pages. https://llmcp.cause-lab.net doi.org/10.1126/scirobotics.abm7187,
/pdf/LLMCP 4.pdf. PubMed: 35675452
Navigli, Roberto, Simone Conia, and Bjorn Pavlick, Ellie. 2023. Symbols and grounding
Ross. 2023. Biases in large language in large language models. Philosophical
models: Origins, inventory, and Transactions of the Royal Society A,
discussion. Journal of Data and Information 381(2251):20220041. https://doi.org/10
Quality, 15(2):1–21. https://doi .1098/rsta.2022.0041, PubMed: 37271171
.org/10.1145/3597307 Piaget, Jean. 1923. Le langage et la pensée chez
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
Nelson, Katherine. 1998. Language in l’enfant. Delachaux & Niestlé,
Cognitive Development: The Emergence of the Neuchâtel/Paris, Switzerland/France.
Mediated Mind. Cambridge University Piantadosi, Steven T. and Felix Hill. 2022.
Press, Cambridge, United Kingdom. Meaning without reference in large
Nevens, Jens, Jonas Doumen, Paul Van language models. arXiv preprint
Eecke, and Katrien Beuls. 2022. Language arXiv:2208.02957v2.
acquisition through intention reading and Plotkin, Gordon D. 1970. A note on inductive
pattern finding. In Proceedings of the 29th generalization. Machine Intelligence,
International Conference on Computational 5(1):153–163.
Linguistics, pages 15–25. Radford, Alec, Jong Wook Kim, Chris
Nevens, Jens, Paul Van Eecke, and Katrien Hallacy, Aditya Ramesh, Gabriel Goh,
Beuls. 2020. From continuous Sandhini Agarwal, Girish Sastry, Amanda
observations to symbolic concepts: A Askell, Pamela Mishkin, Jack Clark,
discrimination-based strategy for Gretchen Krueger, and Ilya Sutskever.
grounded concept learning. Frontiers in 2021. Learning transferable visual models
Robotics and AI, 7(84):1–20. https:// from natural language supervision. In
doi.org/10.3389/frobt.2020.00084, Proceedings of the 38th International
PubMed: 33501251 Conference on Machine Learning,
Ngo, Richard, Lawrence Chan, and Sören pages 8748–8763.
Mindermann. 2024. The alignment Rehder, Bob, Missy E. Schreiner, Michael
problem from a deep learning perspective: B. W. Wolfe, Darrell Laham, Thomas K.
A position paper. In The Twelfth Landauer, and Walter Kintsch. 1998. Using
International Conference on Learning latent semantic analysis to assess
Representations (ICLR 2024). knowledge: Some technical considerations.
Nissim, Malvina, Rik van Noord, and Rob Discourse Processes, 25(2–3):337–354.
van der Goot. 2020. Fair is better than https://doi.org/10.1080
sensational: Man is to doctor as woman is /01638539809545031
to doctor. Computational Linguistics, Rogers, Anna, Olga Kovaleva, and Anna
46(2):487–497. https://doi.org/10 Rumshisky. 2020. A primer in BERTology:
.1162/coli a 00379 What we know about how BERT works.
Oliphant, Michael. 1999. The learning Transactions of the Association for
barrier: Moving from innate to learned Computational Linguistics, 8:842–866.
systems of communication. Adaptive https://doi.org/10.1162/tacl_a_00349
Behavior, 7(3–4):371–383. https://doi Ruis, Laura Eline, Akbir Khan, Stella
.org/10.1177/105971239900700309 Biderman, Sara Hooker, Tim Rocktäschel,
OpenAI. 2023. GPT-4 technical report. arXiv and Edward Grefenstette. 2023. The
preprint arXiv:2303.08774. goldilocks of pragmatic understanding:
Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Fine-tuning strategy matters for
Almeida, Carroll Wainwright, Pamela implicature resolution by LLMs. In
Mishkin, Chong Zhang, Sandhini Advances in Neural Information Processing
Agarwal, Katarina Slama, Alex Ray, et al. Systems 36 (NeurIPS 2023),
2022. Training language models to follow pages 20827–20905.
instructions with human feedback. Saffran, Jenny R., Richard N. Aslin, and
Advances in Neural Information Processing Elissa L. Newport. 1996. Statistical
Systems, 35:27730–27744. learning by 8-month-old infants. Science,
1309
Computational Linguistics Volume 50, Number 4
274(5294):1926–1928. https://doi.org Reviews, 8(4):339–356. https://doi.org
/10.1126/science.274.5294.1926, /10.1016/j.plrev.2011.10.014,
PubMed: 8943209 PubMed: 22071322
Schleicher, August. 1869. Darwinism Tested by Sullivan, Jessica, Michelle Mei, Andrew
the Science of Language. English Translation Perfors, Erica Wojcik, and Michael C.
of Schleicher 1863, translated by Alex V. W. Frank. 2021. SAYCam: A large,
Bikkers. John Camden Hotten, London, longitudinal audiovisual dataset recorded
United Kingdom. https://doi.org/10 from the infant’s perspective. Open Mind,
.5962/bhl.title.49464 5:20–29.
Schütze, Hinrich. 1992. Dimensions of https://doi.org/10.1162/opmi a
meaning. In Proceedings of the 1992 00039, PubMed: 34485795
ACM/IEEE Conference on Supercomputing, Sutskever, Ilya, Oriol Vinyals, and Quoc Le.
pages 787–796. https://doi.org/10 2014. Sequence to sequence learning with
.1109/SUPERC.1992.236684 neural networks. In Advances in Neural
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
Schütze, Hinrich. 1998. Automatic word Information Processing Systems 27 (NIPS
sense discrimination. Computational 2014), pages 3104–3112.
Linguistics, 24(1):97–123. Tan, Xiaoyu, Shaojie Shi, Xihe Qiu, Chao Qu,
Shanahan, Murray. 2024. Talking about large Zhenting Qi, Yinghui Xu, and Yuan Qi.
language models. Communication of the 2023. Self-criticism: Aligning large
ACM, 67(2):68–79. https://doi.org language models with their understanding
/10.1145/3624724 of helpfulness, honesty, and harmlessness.
Shichman, Mollie, Claire Bonial, Austin In Proceedings of the 2023 Conference on
Blodgett, Taylor Hudson, Francis Ferraro, Empirical Methods in Natural Language
and Rachel Rudinger. 2023. Use defines Processing: Industry Track, pages 650–662.
possibilities: Reasoning about object https://doi.org/10.18653/v1/2023
function to interpret and execute robot .emnlp-industry.62
instructions. In Proceedings of the 15th Tomasello, Michael. 2003. Constructing a
International Conference on Computational Language: A Usage-Based Theory of Language
Semantics, pages 284–292. Acquisition. Harvard University Press,
Shiffrin, Richard and Melanie Mitchell. 2023. Harvard, MA, USA.
Probing the psychology of AI models. Touvron, Hugo, Thibaut Lavril, Gautier
Proceedings of the National Academy of Izacard, Xavier Martinet, Marie-Anne
Sciences, 120(10):e2300963120. https:// Lachaux, Timothée Lacroix, Baptiste
doi.org/10.1073/pnas.2300963120, Rozière, Naman Goyal, Eric Hambro,
PubMed: 36857344 Faisal Azhar, Aurélien Rodriguez, Armand
Skalse, Joar, Nikolaus Howe, Dmitrii Joulin, Edouard Grave, and Guillaume
Krasheninnikov, and David Krueger. 2022. Lample. 2023. LLaMA: Open and efficient
Defining and characterizing reward foundation language models. arXiv
gaming. In Advances in Neural Information preprint arXiv:2302.13971.
Processing Systems 35 (NeurIPS 2022), Trott, Sean, Cameron Jones, Tyler Chang,
pages 9460–9471. James Michaelov, and Benjamin Bergen.
Sperber, Dan and Deirdre Wilson. 1986. 2023. Do large language models know
Relevance: Communication and cognition. what humans know? Cognitive Science,
Harvard University Press, Cambridge, 47:e13309. https://doi.org/10
MA, USA. .1111/cogs.13309, PubMed: 37401923
Spranger, Michael, Simon Pauw, Martin Turney, Peter D. and Patrick Pantel. 2010.
Loetzsch, and Luc Steels. 2012. From frequency to meaning: Vector space
Open-ended procedural semantics. In models of semantics. Journal of Artificial
Luc Steels and Manfred Hild, editors, Intelligence Research, 37:141–188.
Language Grounding in Robots. Springer, https://doi.org/10.1613/jair.2934
New York, NY, USA, 153–172. https:// Vallor, Shannon. 2024. The AI Mirror: How to
doi.org/10.1007/978-1-4614 Reclaim our Humanity in an Age of Machine
-3064-3 8 Thinking. Oxford University Press, Oxford,
Steels, Luc. 1995. A self-organizing spatial United Kingdom. https://doi.org/10
vocabulary. Artificial Life, 2(3):319–332. .1093/oso/9780197759066.001.0001
https://doi.org/10.1162/artl Van Eecke, Paul. 2018. Generalisation and
.1995.2.3.319, PubMed: 8925502 Specialisation Operators for Computational
Steels, Luc. 2011. Modeling the cultural Construction Grammar and Their Application
evolution of language. Physics of Life in Evolutionary Linguistics Research. Ph.D.
1310
Beuls and Van Eecke Language Learning from Situated Communicative Interactions
thesis, Vrije Universiteit Brussel, Brussels: Po-Sen Huang, Myra Cheng, Mia Glaese,
VUB Press. Borja Balle, Atoosa Kasirzadeh, et al. 2021.
Van Eecke, Paul, Lara Verheyen, Tom Ethical and social risks of harm from
Willaert, and Katrien Beuls. 2023. The language models. arXiv preprint
Candide model: How narratives emerge arXiv:2112.04359.
where observations meet beliefs. In Weissweiler, Leonie, Valentin Hofmann,
Proceedings of the 5th Workshop on Narrative Abdullatif Köksal, and Hinrich Schütze.
Understanding (WNU), pages 48–57. 2022. The better your syntax, the better
https://doi.org/10.18653/v1/2023 your semantics? Probing pretrained
.wnu-1.7 language models for the English
Vaswani, Ashish, Noam Shazeer, Niki comparative correlative. In Proceedings
Parmar, Jakob Uszkoreit, Llion Jones, of the 2022 Conference on Empirical
Aidan N. Gomez, Lukasz Kaiser, and Illia Methods in Natural Language Processing,
Polosukhin. 2017. Attention is all you pages 10859–10882. https://doi
Downloaded from http://direct.mit.edu/coli/article-pdf/50/4/1277/2475979/coli_a_00534.pdf by guest on 25 August 2025
need. In Advances in Neural Information .org/10.18653/v1/2022.emnlp
Processing Systems 30 (NIPS 2017), -main.746
pages 6000–6010. Welford, Barry Payne. 1962. Note on a
Verheyen, Lara, Jérôme Botoko Ekila, Jens method for calculating corrected sums of
Nevens, Paul Van Eecke, and Katrien squares and products. Technometrics,
Beuls. 2023. Neuro-symbolic procedural 4(3):419–420. https://doi.org/10.1080
semantics for reasoning-intensive visual /00401706.1962.10490022
dialogue tasks. In Proceedings of the 26th West, Peter, Ximing Lu, Nouha Dziri,
European Conference on Artificial Intelligence Faeze Brahman, Linjie Li, Jena D. Hwang,
(ECAI 2023), pages 2419–2426. https:// Liwei Jiang, Jillian Fisher, Abhilasha
doi.org/10.3233/FAIA230544 Ravichander, Khyathi Chandu, Benjamin
Vong, Wai Keen, Wentao Wang, A. Emin Newman, Pang Wei Koh, Allyson Ettinger,
Orhan, and Brenden M. Lake. 2024. and Yejin Choi. 2024. The generative AI
Grounded language acquisition through paradox: “What it can create, it may not
the eyes and ears of a single child. Science, understand.” In The Twelfth International
383(6682):504–511. https://doi.org Conference on Learning Representations.
/10.1126/science.adi1374, PubMed: Woods, William A. 1968. Procedural
38300999 semantics for a question-answering
Vulić, Ivan, Simon Baker, Edoardo Maria machine. In Proceedings of the December
Ponti, Ulla Petti, Ira Leviant, Kelly Wing, 9–11, 1968, Fall Joint Computer Conference,
Olga Majewska, Eden Bar, Matt Malone, Part I, pages 457–471, New York, NY, USA.
Thierry Poibeau, Roi Reichart, and Anna https://doi.org/10.1145/1476589
Korhonen. 2020. Multi-SimLex: A .1476653
large-scale evaluation of multilingual and Yamazaki, Takato, Tomoya Mizumoto,
crosslingual lexical semantic similarity. Katsumasa Yoshikawa, Masaya Ohagi,
Computational Linguistics, 46(4):847–897. Toshiki Kawamoto, and Toshinori Sato.
https://doi.org/10.1162/coli_a_00391 2023. An open-domain avatar chatbot by
Vulić, Ivan, Edoardo Maria Ponti, Robert exploiting a large language model.
Litschko, Goran Glavaš, and Anna In Proceedings of the 24th Annual
Korhonen. 2020. Probing pretrained Meeting of the Special Interest Group on
language models for lexical semantics. Discourse and Dialogue, pages 428–432.
In Proceedings of the 2020 Conference on https://doi.org/10.18653/v1/2023
Empirical Methods in Natural Language .sigdial-1.40
Processing (EMNLP), pages 7222–7240. Yernaux, Gonzague and Wim Vanhoof. 2019.
https://doi.org/10.18653/v1/2020 Anti-unification in constraint logic
.emnlp-main.586 programming. Theory and Practice of Logic
Webb, Taylor, Keith J. Holyoak, and Programming, 19(5–6):773–789. https://
Hongjing Lu. 2023. Emergent analogical doi.org/10.1017/S1471068419000188
reasoning in large language models. Ziegler, Daniel M., Nisan Stiennon, Jeffrey
Nature Human Behaviour, 7(9):1526–1541. Wu, Tom B. Brown, Alec Radford, Dario
https://doi.org/10.1038/s41562-023 Amodei, Paul Christiano, and Geoffrey
-01659-w, PubMed: 37524930 Irving. 2019. Fine-tuning language models
Weidinger, Laura, John Mellor, Maribeth from human preferences. arXiv preprint
Rauh, Conor Griffin, Jonathan Uesato, arXiv:1909.08593.
1311