[go: up one dir, main page]

0% found this document useful (0 votes)
24 views34 pages

Neural Decoding

Uploaded by

edu mir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views34 pages

Neural Decoding

Uploaded by

edu mir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Decoding Word Embeddings

with Brain-Based Semantic Features

Emmanuele Chersoni∗ Enrico Santus∗∗


The Hong Kong Polytechnic University MIT

Chu-Ren Huang† Alessandro Lenci‡


The Hong Kong Polytechnic University University of Pisa

Word embeddings are vectorial semantic representations built with either counting or predicting
techniques aimed at capturing shades of meaning from word co-occurrences. Since their intro-
duction, these representations have been criticised for lacking interpretable dimensions. This
property of word embeddings limits our understanding of the semantic features they actually
encode. Moreover, it contributes to the “black box” nature of the tasks in which they are
used, since the reasons of word embeddings performance often remain opaque to humans. In
this contribution, we explore the semantic properties encoded in word embeddings by mapping
them onto interpretable vectors, consisting of explicit and neurobiologically-motivated semantic
features (Binder et al. 2016). Our exploration takes into account different types of embeddings,
including factorized count vectors and predict models (e.g., Skip-Gram, GloVe, etc.), as well as
the most recent contextualized representations (i.e., ELMo and BERT).
In our analysis, we first evaluate the quality of the mapping in a retrieval task, then we shed
lights on the semantic features that are better encoded in each embedding type. A large number
of probing tasks is finally set to assess how the original and the mapped embeddings perform in
discriminating semantic categories. For each probing task, we identify the most relevant semantic
features and we show that there is a correlation between the embedding performance and how they
encode those features. This study sets itself as a step forward in understanding which aspect of
meaning are captured by vector spaces, by proposing a new and simple method to carve human-
interpretable semantic representations from distributional vectors.

1. Introduction

One of the most influential and longstanding approach to semantic representation


assumes that the conceptual content of lexical items is decomposable into semantic
features that identify meaning components, hence the name of featural, decomposi-
tional or componential theories of meaning (Vigliocco and Vinson 2007). In linguistics,
features are typically represented by symbols (e.g., HUMAN, PATH, CAUSE, etc.) standing

∗ Department of Chinese and Bilingual Studies, 11 Yuk Choi Road, Hung Hom, Kowloon, Hong Kong.
E-mail: emmanuele.chersoni@polyu.edu.hk
∗∗ MIT Computer Science and Artificial Intelligence Laboratory, 32 Vassar Street, Cambridge, MA 02139,
United States. E-mail: esantus@mit.edu
† Department of Chinese and Bilingual Studies, 11 Yuk Choi Road, Hung Hom, Kowloon, Hong Kong.
E-mail: churen.huang@polyu.edu.hk
‡ Department of Philology, Literature and Linguistics, Via Santa Maria 36, 56126 Pisa, Italy. E-mail:
alessandro.lenci@unipi.it

© 2016 Association for Computational Linguistics


Computational Linguistics Volume 1, Number 1

for basic or primitive semantic dimensions (Jackendoff 1990; Wierzbicka 1996; Murphy
2010; Pustejovsky and Batiukova 2019). These “building blocks” of meaning are selected
a priori and structured into categorical representations defined by the presence or
absence of symbolic features, like in this semantic analysis of enter:
(1) enter [+ MOVE, + PATH, - CAUSE, . . . ]
Besides the issue of establishing the criteria to define the repertoire of alleged seman-
tic primitives, discrete symbolic structures strive to cope with the gradient nature of
lexical meaning and can not capture the varying degrees of feature prototypicality in
concepts (Murphy 2002). Secondly, the basic semantic features are normally too coarse-
grained to provide a full characterization of conceptual content (e.g., accounting for
the dimensions that distinguish painter from violinist). In cognitive psychology, instead
of using categorical representations formed by manually selected components, it is
customary to represent concepts with verbal properties generated by native speakers
to describe a word meaning and collected in feature norms (e.g., McRae et al. 2005;
Vinson and Vigliocco 2008; Devereux et al. 2014). Each feature is associated with a
weight corresponding to the number of subjects that listed it for a given concept and
is used to estimate its salience in that concept. The following is a representation of car
using a subset of its feature distribution from the norms in McRae et al. (2005):

a_vehicle has_4_wheels is_fast is_expensive


(2)
car 9 18 9 11
The main advantage of featural representations is that they are human-interpretable and
explainable: Features explicitly label the dimensions of word meanings and provide
explanatory factors of their semantic behavior (e.g., the similarity between violinist
and athlete can be explained by assuming that they both share a feature like HUMAN).
Conversely, featural semantic representations raise several methodological concerns, as
they are either based on intuition and therefore highly subjective, or must be carried out
with a complex and time-consuming process of elicitation from human subjects, which
is hardly scalable to cover large areas of the lexicon. In fact, existing feature norms only
include some hundreds lexical items, typically limited to concrete nouns.
Semantic features have been widely used in computational linguistics and artificial
intelligence (AI), but their limits have eventually contributed to the success of a com-
pletely different approach to semantic representation. This is based on data-driven, low-
dimensional, dense distributional vectors called word embeddings, which represent
lexical meaning as a function of the statistical distribution of words in texts. Word
embeddings are built by Distributional Semantic Models (DSMs) (Turney and Pantel
2010; Lenci 2018) using various types of methods, ranging from the factorization of
co-occurrence matrices with Singular Value Decomposition (SVD) to neural language
models. Traditional DSMs have been representing the content of lexical types through
a single vector that “summarizes” their whole distributional history, disregarding that
word tokens may have different meanings in different contexts. Things have recently
changed with the introduction of deep neural architectures for language modeling such
as ELMo (Peters et al. 2018) and BERT (Devlin et al. 2019), whose word representations
have helped achieving state-of-the-art results in a wide variety of supervised NLP
tasks. These embeddings are “contextualized”, in the sense that the model computes
a different vector for each token of the same word, depending on the sentence in which
it occurs. The popularity of word embeddings, both contextualized and not, is due to
the fact that they allow for the fast construction of continuous semantic representations

2
Chersoni et al. Decoding Word Embeddings

that can be pretrained on large natural language corpora. The vector-based encoding of
meaning is easily machine-interpretable, as embeddings can be directly fed into complex
neural architectures and indeed boost performance in several natural language process-
ing (NLP) tasks and applications.
Although word embeddings play an important role in the success of deep learning
models and do capture some aspects of lexical meaning, it is hard to understand their
actual semantic content. In fact, one notorious problem of embeddings is their lack
of human-interpretability: Information is distributed across vector dimensions that can
not be labelled with specific semantic values. In neural word embeddings, the vector
dimensions are learned as network parameters, instead of being derived from explicit
co-occurrence counts between target words and linguistic contexts, making their in-
terpretation even more challenging. Scholars have argued that DSMs provide a holistic
representation of meaning, as the content of each word can exclusively be read off from
its position relative to other elements in the semantic space, while the coordinates of
such space are themselves arbitrary and without any intrinsic semantic value (Landauer
et al. 2007; Vigliocco and Vinson 2007; Sahlgren 2008). This makes embeddings “black
box” representations that can be understood only by observing their behavior in some
external task, but whose internal content defies direct inspection. A recent and widely
used tool to investigate the linguistic properties captured by embeddings are the so-
called probing tasks (Ettinger, Elgohary, and Resnik 2016; Adi et al. 2017; Conneau et al.
2018; Kann et al. 2019). A probing task is a classification problem that targets a specific
linguistic aspect (e.g., word order, animacy, etc.). The name refers to the fact that the
classifier is used to “probe” embeddings for a particular piece of linguistic information.
The successful performance of an embedding model to address this task is then used to
infer that the vectors encode that information. However, as it was recently pointed out
by Shwartz and Dagan (2019), probing tasks are also a form of “black box” testing, since
they just provide indirect evidence about the embedding content.
The raise of the interpretability problem in AI and NLP has motivated the necessity of
understanding which shades of semantics are actually encoded by word embeddings,
and has therefore refueled the debate about the relationship between distributional rep-
resentations and semantic features (Boleda and Erk 2015). “Opening the black box” of
deep learning methods has become an imperative in computational linguistics (Linzen,
Chrupała, and Alishahi 2018; Linzen et al. 2019). Such research effort aims at analyzing
the specific information encoded by vector representations that may help explaining
their behavior in downstream tasks and applications.
In this paper, we contribute to this goal by showing that featural semantic represen-
tations can be used to interpret the content of word embeddings. In particular, we argue
that decoding semantic information from distributional vectors is strikingly similar to
the problem faced by neuroscience of how to “read off meaning” from distributed brain
activity patterns. Neurosemantic decoding is a research line that develops computa-
tional methods to identify the mental state represented by brain activity recorded with
neuroimaging techniques such as fMRI (e.g., recognizing that a given activation pattern
produced by a stimulus picture or word corresponds to an apple). A common approach
to address such task is to learn a mapping between featural concept representations
and a vector containing the corresponding fMRI recorded brain activity (Naselaris
et al. 2011; Poldrack 2011). These computational models are able to predict the concept
corresponding to a certain brain activation and contribute to shed light on the neural
representation of semantic features.
In neurosemantic decoding, human-interpretable semantic vectors are used to de-
code the content of vectors of “brain-interpretable” signals activated by a certain stim-

3
Computational Linguistics Volume 1, Number 1

ulus (cf. Section 2.2). In a similar way, we aim at decoding the semantic content of
word embeddings by learning a mapping onto vectors of human-interpretable fea-
tures. To this end, we use the semantic features introduced by Binder et al. (2016),
who proposed a set of cognitively motivated semantic primitives (henceforth, Binder
features) derived from a wide variety of modalities of neural information processing
(hence their definition as brain-based), and provided human ratings about the relevance
of each feature for a set of English words (henceforth, Binder dataset). We use these
ratings to represent the words with continuous vectors of semantic features and to learn
a map from word embeddings dimensions to Binder features. Such mapping provides a
human-interpretable correlate of word embeddings that we use to address these issues:
1. identifying which semantic features are best encoded in word embeddings;
2. explaining the performance of embeddings in semantic probing tasks.
The idea of mapping word embeddings onto semantic features is not by itself new
(Fagarasan, Vecchi, and Clark 2015; Utsumi 2020), but to the best of our knowledge the
present contribution is the first one to use mapped featural representations to interpret
the semantic information learnt by probing classifiers and to explain the embedding be-
havior in such tasks. Therefore, we establish a bridge between the research on semantic
features and the challenge of enhancing the interpretability of distributed representa-
tions, by showing that featural semantic representations can work as an important key
to open the black boxes of word embeddings and of the tasks in which they are used. As
an additional element of novelty, we also apply the neural decoding methodology to the
recently-introduced contextualized embeddings, to evaluate whether and how they dif-
fer from static ones in encoding semantic information. It is important to remark that we
do not argue that Binder feature vectors should replace distributional representations.
The main claim of this paper is rather that continuous vectors of human-interpretable
semantic features, such as Binder’s ones, are an important tool to investigate what
aspects of meaning distributional embeddings actually encode, and they can be used
to lay a bridge between symbolic and distributed semantic representations.
This paper is organized as follows. Section 2 introduces the main typologies of
DSMs and reviews the related work on vector decoding. In Section 3, we describe the
Binder features, we present the method used to map word embeddings onto Binder
feature vectors, and we evaluate the mapping accuracy. In Section 4, we investigate
which Binder features are best encoded by each type of embedding. In Section 5 we
set up a series of probing tasks to verify how the original and mapped embeddings
encode semantic categories, such as animate/inanimate or positive/negative sentiment.
Some probing tasks focus on static embeddings, while others target the token vectors
produced by contextualized embeddings. The aim of the analysis is to identify the
most important semantic features for a given task and to investigate whether there is
a correlation between the system performance and how those features are encoded by
the embeddings.

2. Related Work

2.1 From Static Distributional Models to Contextualized Embeddings

We use the term word embedding to refer to any kind of dense, low-dimensional
distributional vector. In the early days of Distributional Semantics, embeddings were
built by applying dimensionality reduction techniques derived from linear algebra,
such as SVD, to matrices keeping track of the co-occurrence information about the target

4
Chersoni et al. Decoding Word Embeddings

terms and some pre-defined set of linguistic contexts. Parameter tuning was mostly
carried out empirically, as it was driven by the model performance on specific tasks.
This family of DSMs is referred to as count models (Baroni, Dinu, and Kruszewski 2014).
The construction of distributional representations started to be conceived mainly as
the by-product of a supervised language modelling task after the introduction of the
Word2Vec package (Mikolov et al. 2013). Low-dimensional distributional word vectors
are created by neural network algorithms by learning to optimally predict the contexts
of a target word, hence their name of predict models. "Neural" embeddings have become
an essential component for several NLP applications, also thanks to the availability of
many efficient and easy-to-use tools (Mikolov et al. 2013; Bojanowski et al. 2017) that
allow researchers to quickly obtain well-performing word representations. Indeed, an
important finding of a first comparative evaluation between count and predict models
was that the latter achieve far superior performances in a wide variety of tasks (Baroni,
Dinu, and Kruszewski 2014). Although the result was claimed to be due to the sub-
optimal choice of "vanilla" hyperparameters for the count models (Levy, Goldberg, and
Dagan 2015), it was still a proof that predict models could be very efficient even without
any parameter tuning. Following studies adopting cognitively-motivated benchmarks
(e.g., based on priming, eye-tracking or EEG data) have also showed that word em-
beddings exhibit strong correlation with human performance in psycholinguistic and
neurolinguistic tasks (Søgaard 2016; Mandera, Keuleers, and Brysbaert 2017; Bakarov
2018; Schwartz and Mitchell 2019; Hollenstein et al. 2019). Finally, and significantly,
Carota et al. (2017) found that the semantic similarity computed via distributional
models between action-related words correlates with the fMRI response patterns of the
brain regions that are involved in the processing of this category of lexical items.
Another novelty recently came out from the research on deep neural networks for
language modeling. For both count and predict models, a common and longstanding
assumption was the building of a single, stable representation for each word type in the
corpus. In the latest generation of embeddings, instead, each occurrence of a word in a
specific sentence context gets a unique representation (Peters et al. 2018). Such models
typically rely on an encoder (i.e., a LSTM or a Transformer) trained on large amounts of
textual data, and the word vectors are learned as a function of the internal states of the
encoder, such that a word in different sentence contexts determines different activation
states and is represented by a distinct vector (McCann et al. 2017; Peters et al. 2018;
Howard and Ruder 2018; Devlin et al. 2019; Yang et al. 2019). Thus, the embeddings
produced by these new frameworks are said to be contextualized, as opposed to the static
vectors produced by the earlier frameworks, and they aim at modeling the specific sense
assumed by the word in context (Wiedemann et al. 2019). Interestingly, the distinction
between traditional and contextualized embeddings has been recently discussed by
drawing a parallel between the prototype and exemplar models of categorization in
cognitive psychology (Sikos and Padó 2019).
Two very popular models for obtaining contextualized word embeddings are ELMo
(Peters et al. 2018) and BERT (Devlin et al. 2019). ELMo is based on a two-layer LSTM
trained as the concatenation of a forward and a backward language model, BERT on a
stack of Transformer layers (Vaswani et al. 2017) trained jointly in a masked language
model and a next sentence prediction task. The semantic interpretation of the dimen-
sions of contextualized embeddings is still an open question. The classical approach to
analyze the syntactic and semantic information encoded by these representations is to
test them in some probing tasks (Tenney et al. 2019; Liu et al. 2019; Hewitt and Manning
2019; Kim et al. 2019; Kann et al. 2019; Yaghoobzadeh et al. 2019; Jawahar et al. 2019). In

5
Computational Linguistics Volume 1, Number 1

this contribution we adopt a different approach to the problem, mainly inspired by the
literature on neurosemantic decoding.

2.2 Interpreting Vector Representations

Like word embeddings, the brain encodes information in distributed activity patterns
that defy direct interpretation. The general goal of neurosemantic decoding is to de-
velop computational methods to infer the content of brain activities associated with a
certain word or phrase (e.g., to recognize that a pattern of brain activations corresponds
to the meaning of the stimulus word dog, instead of car). One of the most common
approaches to neural decoding consists in learning to map vectors of fMRI signals onto
vectors of semantic dimensions. If the mapping is successful, we can infer that these
dimensions are encoded in the brain. Mitchell et al. (2008) pioneered such method by
training a linear regression model from a set of words with their fMRIs. The trained
model was then asked to predict the activations for unseen words. Approaches differ
for the type of semantic representation adopted to model brain data. Mitchell et al.
(2008) used a vector of features corresponding to textual co-occurrences with 25 verbs
capturing basic semantic dimensions (e.g., hear, eat, etc.). Chang, Mitchell, and Just
(2011) instead represented words with vectors of verbal properties derived from feature
norms, and Anderson et al. (2016) with vectors of Binder features (cf. Section 3.2).
After the popularization of DSMs, the use of word embeddings for neurosemantic
decoding has become widespread. Actually, the decoding task itself has turned into an
important benchmark for DSMs, since it is claimed to represent a more robust alterna-
tive to the traditional use of behavioral datasets (Murphy, Talukdar, and Mitchell 2012).
Some of these studies used fMRI data to learn a mapping from the classical count-based
distributional models (Devereux, Kelly, and Korhonen 2010; Murphy, Talukdar, and
Mitchell 2012), from both count and prediction vectors (Bulat, Clark, and Shutova 2017b;
Abnar et al. 2018), from contextualized vectors (Beinborn, Abnar, and Choenni 2019) or
from topic models (Pereira, Detre, and Botvinick 2011; Pereira, Botvinick, and Detre
2013). This methodology has recently been extended beyond words to represent the
meanings of entire sentences (Anderson et al. 2016; Pereira et al. 2018; Sun et al. 2019),
even in presence of complex compositionality phenomena such as negation (Djokic
et al. 2019), or to predict the neural responses to complex visual stimuli (Güçlü and van
Gerven 2015). Athanasiou, Iosif, and Potamianos (2018) showed that neural activation
semantic models built out of these mappings can also be used to successfully carry out
NLP tasks such as similarity estimation, concept categorization and textual entailment.
Despite the analogy, it is important to underline a crucial difference between our
work and neurosemantic decoding. In the latter, word embeddings are used as proxies
for semantic representations to decode brain patterns that are not directly human-
interpretable. Our aim is instead to decode the content of word embeddings themselves.
We actually believe this enterprise to be also relevant for (and to a certain extent a
precondition to) the task of decoding brain states. In fact, if we want to use embeddings
for neural decoding, it is essential to have a better understanding of the semantic content
hidden in distributional representations. Otherwise, the risk is to run into the classical
fallacy of obscurum per obscurius, in which one tries to explain something unknown
(brain activations), with something that is even less known (word embeddings).
Another related line of work makes use of property norms for grounding distribu-
tional models in perceptual data, and to map them onto interpretable representations
(Fagarasan, Vecchi, and Clark 2015; Bulat, Kiela, and Clark 2016; Derby, Miller, and
Devereux 2019), an approach that has been proven useful, among the other things, also

6
Chersoni et al. Decoding Word Embeddings

for the detection of cross-domain mappings in metaphors (Bulat, Clark, and Shutova
2017a). Similarly, other studies focused on conceptual categorization proposed to learn
mappings from distributional vectors to spaces of higher-order concepts (Şenel et al.
2018; Schwarzenberg, Raithel, and Harbecke 2019). Finally, Utsumi (2018, 2020) carried
out an analysis of the semantic content of non-contextualized word embeddings, which
is close in spirit to our correlation analyses in Section 4. However, our study significantly
differs from Utsumi’s for its goals and scope. While Utsumi (2020) only aims at under-
standing the semantic knowledge encoded in distributional vectors, we add to this the
idea of using the decoded embeddings to explain and interpret their performance in
probing semantic tasks (Section 5). Moreover, our study involves a larger array of DSMs
and it is the first one to include state-of-the-art contextualized embeddings.

3. Decoding the Semantic Content of Word Embeddings

We decode the meaning of word embeddings e1 , . . . , en by mapping them onto vectors


of human-interpretable semantic features f1 , . . . , fn . We henceforth use the term dimen-
sion to refer to embedding components, while we reserve the term (semantic) feature
only for interpretable meaning components. First, we present the DSMs we have used
in our experiments (Section 3.1), then we introduce the Binder features (Section 3.2), we
illustrate the mapping method (Section 3.3), and we evaluate its quality (Section 3.4).

3.1 Word Embedding Models

Since we aim at providing a systematic comparison of the most common DSMs, we


evaluate a large pool of standard, non-contextualized word embedding models. We
trained 300-dimensional vectors on a corpus of about 3.9 billion tokens, obtained from
the concatenation of ukWaC (Baroni et al. 2009) and a 2018 dump of Wikipedia. All
vectors share the same vocabulary of ca. 345K unlemmatized tokens, corresponding to
the words with a minimum frequency of 100 in the training corpus.
Our “model zoo” includes both predict models – SGNS and FastText – and count
models – PPMI and GloVe. SGNS (Mikolov et al. 2013) is the Skip-Gram with Negative
Sampling algorithm, which learns word embeddings that predict the context lexemes
co-occurring with the targets. FastText (Bojanowski et al. 2017) is a variation of SGNS
that uses subword information and represents word types as the sum of their n-gram
embeddings. GloVe (Pennington, Socher, and Manning 2014) is a matrix model that uses
weighted regression to find embeddings that minimize the squared error of the ratios of
co-occurrence probabilities. PPMI (Bullinaria and Levy 2012) consists of a co-occurrence
matrix weighted with Positive Pointwise Mutual Information and reduced with SVD.
Although the latter DSM type could be considered out of date, we decided to include it
in our experiments, since Levy, Goldberg, and Dagan (2015) have shown that it can be
competitive with predict models, given a proper hyperparameter optimization.
Four DSMs are window-based (the w2 models select co-occurrences with a context
window of 2 words to either side of the target), and four are syntax-based. The synt
models use contexts typed with syntactic dependencies (e.g., eat-nobj), while the synf
models use syntactically filtered, untyped contexts. Dependencies were extracted from
the training corpus parsed with CoreNLP (Manning et al. 2014). As suggested by Levy,
Goldberg, and Dagan (2015) for the parameter tuning of count models, we used the
context distribution smoothing of 0.75 for PPMI and we dropped the singular value
matrix produced by SVD. We also applied to PPMI and GloVe the subsampling method

7
Computational Linguistics Volume 1, Number 1

Model Hyperparameters
345K window-selected context words, window of width 2
weighted with Positive Pointwise Mutual Information (PPMI)
PPMI.w2
reduced with Singular Value Decomposition (SVD)
subsampling method from Mikolov et al. (2013).
345K syntactically filtered context words
weighted with Positive Pointwise Mutual Information (PPMI)
PPMI.synf
reduced with Singular Value Decomposition (SVD)
subsampling method from Mikolov et al. (2013)
345K syntactically typed context words
weighted with Positive Pointwise Mutual Information (PPMI)
PPMI.synt
reduced with Singular Value Decomposition (SVD)
subsampling method from Mikolov et al. (2013)
Window of width 2
GloVe
subsampling method from Mikolov et al. (2013)
Skip-gram with negative sampling
SGNS.w2 window of width 2, 15 negative examples
trained with the word2vec library (Mikolov et al. 2013)
Skip-gram with negative sampling
SGNS.synf syntactically-filtered context words, 15 negative examples
trained with the word2vecf library (Levy and Goldberg 2014)
Skip-gram with negative sampling
SGNS.synt syntactically-typed context words, 15 negative examples
trained with the word2vecf library (Levy and Goldberg 2014)
Skip-gram with negative sampling and subword information
FastText window of width 2, 15 negative examples
trained with the fasttext library (Bojanowski et al. 2017)
Pretrained ELMo embeddings (Peters et al. 2018),
ELMo available at https://allennlp.org/elmo,
original model trained on the 1 Billion Word Benchmark (Chelba et al. 2013).
Pretrained BERT-Base embeddings (Devlin et al. 2019)
available at https://github.com/google-research/bert
BERT
large model trained on the concatenation of the Books corpus
(Zhu et al. 2015) and the English Wikipedia.
Table 1: List of the embedding models used for the study, together with their hyperpa-
rameter settings.

proposed in Mikolov et al. (2013). A summary of all models with their respective
training hyperparameters is provided in Table 1.
The contextualized embedding models are ELMo1 and BERT (the BERT-Base un-
cased version).2 Since they produce token vectors, we created type representations by
randomly sampling 1,000 sentences for each target word from the Wikipedia corpus.
We generated a contextualized embedding for each word token by feeding the sentence
to the publicly available pre-trained models of ELMo and BERT. Finally, an embedding
for each word was obtained by averaging its 1,000 contextualized vectors. We assume
this choice to be consistent with the hypothesis that context-independent conceptual
representations are abstractions from token exemplar concepts (Yee and Thompson-
Schill 2016). As a baseline, we also built models based on 300-dimensional randomly-
generated vectors (Random).

1 https://tfhub.dev/google/elmo/3.
2 We used the pipelines included in the spacy-transformers package
(https://spacy.io/universe/project/spacy-transformers).

8
Chersoni et al. Decoding Word Embeddings

Domain Meaning components (features)


V ISION, B RIGHT, D ARK, C OLOUR, PATTERN, L ARGE, S MALL, M OTION,
Vision
B IOMOTION, FAST, S LOW, S HAPE, C OMPLEXITY, FACE, B ODY
Somatic T OUCH, T EMPERATURE, T EXTURE, W EIGHT, PAIN
Audition A UDITION, L OUD, L OW, H IGH, S OUND, M USIC, S PEECH
Gustation TASTE
Olfaction S MELL
Motor H EAD,U PPER L IMB, L OWER L IMB, P RACTICE
Spatial L ANDMARK, PATH, S CENE, N EAR, T OWARD, AWAY, N UMBER
Temporal T IME, D URATION, L ONG, S HORT
Causal C AUSED, C ONSEQUENTIAL
Social S OCIAL, H UMAN, C OMMUNICATION, S ELF
Cognition C OGNITION
B ENEFIT, H ARM, P LEASANT, U NPLEASANT, H APPY,
Emotion
S AD, A NGRY, D ISGUSTED, F EARFUL, S URPRISED
Drive D RIVE, N EEDS
Attention ATTENTION, A ROUSAL
Table 2: List of the domains and meaning components (features) in Binder et al. (2016).

Word V ISION B RIGHT ... C OGNITION B ENEFIT H ARM P LEASANT U NPLEASANT


dog 5.3548 1.0968 ... 0.3548 3.5806 2.8065 3.9355 0.7097
love 0.7931 0.4828 ... 4.5172 4.9310 1.7586 5.4828 0.5172
Table 3: A sample of the rated Binder features for dog and love.

3.2 The Binder Dataset: Features for Brain-Based Semantics

Binder et al. (2016) proposed a brain-based semantics consisting of conceptual primi-


tives defined in terms of the modalities of neural information processing. This study aimed
at developing a representation that captured aspects of experience that are central in
the acquisition of concepts. The authors organized human experience in 13 different
domains (see Table 2), each one corresponding to a variable number of features for
which some specialized neural processor has been identified and described in the neu-
roscientific literature (Binder et al. 2016). In total, the brain-based semantics consists of
65 cognitively-motivated features, which we henceforth refer to as the Binder features.
For their collection of ratings, Binder et al. (2016) selected 242 words of the Knowl-
edge Representation in Neural Systems project (Glasgow et al. 2016), including 141 nouns,
62 verbs and 39 adjectives, plus 293 additional nouns in order to include more abstract
nouns, for a total of 535 words. Rated words belong to various concept types. A
summary of the concept types, parts-of-speech, and the number of words per type is
provided in Table 4. For each of these words, ratings on a 0-6 scale were collected with
Amazon Mechanical Turk, in order to assess the degree to which humans associate their
meaning with particular kinds of experience. Words were rated across multiple sessions:
each participant was assigned one word per session and provided ratings for all the se-
mantic features (cf. Table 3 for an example). Since there are several ambiguous words in
the data, participants were presented with an example sentence that allowed the correct
identification of the target word sense. The reported mean intra-word individual-to-
group correlation of the collected ratings is 0.78 (median 0.80). Interestingly, the concept

9
Computational Linguistics Volume 1, Number 1

Type-POS No. of items


Concrete Objects - Nouns 275
Living Things - Nouns 126
Other Natural Objects - Nouns 19
Artifacts - Nouns 130
Concrete Events - Nouns 60
Abstract Entities - Nouns 99
Concrete Actions - Verbs 52
Abstract Actions - Verbs 5
States - Verbs 5
Abstract Properties - Adjectives 13
Physical Properties - Adjectives 26
Table 4: Concept types, parts-of-speech and number of items in the dataset by Binder
et al. (2016).

representations based on the elicited features were compared with their distributional
representations, obtained via Latent Semantic Analysis (Landauer and Dumais 1997),
showing that brain-based features are more efficient in separating conceptual categories.
We have chosen the Binder features for our decoding experiments for three main
reasons. First of all, they are empirically motivated on the grounds of neurocognitive
evidence supporting their key role for conceptual organization. This allows us to test
the extent to which these central components of meaning are actually captured by word
embeddings. Secondly, despite being quite coarse-grained, Binder features differ from
human generated properties because the latter are actually linguistic structures that
often express complex concepts (e.g., used_for_transportation as a property for airplane),
rather than core meaning components. Thirdly, the Binder dataset covers nouns, verbs,
and adjectives, and encompasses both concrete and abstract words, while no existing
feature norms have a comparable morphosyntactic or semantic variety. Of course, we do
not claim this to be the “definitive” semantic feature lists, but in our view it represents
the most complete repository of continuous featural representations available to date.
However, the analysis methodology we present in the next section is totally general,
and can be applied to any type of semantic feature vector.

3.3 Mapping Word Embeddings onto Semantic Features

For this study, we learn a mapping from a n-dimensional word embedding e (with
n equal 300 for non-contextualized DSMs, 768 for BERT, and 1024 for ELMo) onto a
65-dimensional feature vector f whose components correspond to the ratings for the
Binder features. We henceforth refer to the mapped feature vectors as Binder vectors.
Our dataset consists of 534 Binder words.3
In the previous literature, mainly two methods have been used to learn a mapping
between embeddings and discrete feature spaces: regression models (Fagarasan, Vecchi,
and Clark 2015; Pereira et al. 2018; Utsumi 2018, 2020) and feedforward neural networks
(Abnar et al. 2018; Utsumi 2018, 2020). In our preliminary experiments, the mapping
with feedforward neural networks turned out to be suboptimal. Thus, we report the

3 One less than the original collection, because used appears twice, as verb and adjective.

10
The main body of the work consists in X parts and Y chapters:
However, LO FA IN STILE DISTACCATO E COLLOQUIALE SENZA ROBA MATEMATICA E CON
APPROCCIO PRATICO BASATO SU R
RIASSUNTO CAPITOLI
REPRODUCIBILITY
Chersoni et al. Decoding Word Embeddings

0.250

0.200

0.150
MSE

ELMo
0.100
BERT

0.050 GloVe

0.000
10 20 30 40 50 60 70 80 90 100
Dimensions

0.500
0.450
0.400
Explained Variance

0.350
0.300 ELMo
0.250
0.200 BERT
0.150
0.100
GloVe
0.050
0.000
10 20 30 40 50 60 70 80 90 100
Dimensions

Figure 1: (top) Mean Squared Error (values have been summed across the Binder fea-
tures) and (bottom) explained variance for ELMo, BERT and GloVe vectors per number
of regression components.

results for our partial least square regression model, with an appropriately chosen num-
ber k of regression components. We tested with k = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100.
The regression models have been implemented with the Python Scikit-learn package
(Pedregosa et al. 2011).4

3.4 Mapping Evaluation

For a preliminary evaluation of the mapping quality, we analyze the traditional metrics
of Mean Squared Error (M SE) and variance. First, we randomly split the data into
training
0.250and test sets, using a 80 : 20 ratio, and we measure the sum of M SE and the
variance in order to determine the optimal value for the parameter k (the number of
Rooted Mean Square Error

regression
0.200
components). After choosing the optimal k, vectors of Binder features are
predicted with the leave-one-out training paradigm, as in Utsumi (2018, 2020): For each
word in the dataset, we train a mapping between the embeddings and the gold standard
0.150
ELMo
0.100 BERT
4 https://scikit-learn.org/stable/.
GloVe
0.050
11
0.000
10 20 30 40 50 60 70 80 90 100
Dimensions
Computational Linguistics Volume 1, Number 1

Model MSE Variance


PPMI.w2 0.16 0.50
PPMI.synf 0.15 0.54
PPMI.synt 0.16 0.48
GloVe 0.16 0.46
SGNS.w2 0.15 0.54
SGNS.synf 0.14 0.58
SGNS.synt 0.14 0.59
FastText 0.14 0.55
ELMo 0.16 0.48
BERT 0.15 0.53
Random 0.30 -0.73
Table 5: Mean Squared Error (summed across features) and explained Variance per each
model, mapping with Partial Least Squares Regression and k = 30.

Binder vectors of the remaining words, and we predict the held-out word with the
resulting mapping.
Moreover, following the literature on neural decoding, the predicted vectors are
tested on a task of retrieval in the top-N neighbors. Given a predicted vector, we rank all
the 534 vectors in the gold standard dataset by decreasing cosine similarity values. Then
we measure the Top-N accuracy (Top-N Acc), as the percentage of the items of the dataset
whose gold standard vector is in the top-N of the neighbors list of the predicted vector
(Fagarasan, Vecchi, and Clark 2015). We assess this value for N = 1, 5, 10.
We measured the M SE and the explained variance for each model, finding that
k between 30 and 50 produces the optimal results for all models. Figure 1, shows
M SE and variance as a function of k for GloVe, ELMo and BERT embeddings. Most
models achieve the best fit with k = 30 or k = 40. Since the average explained variance is
slightly higher for k = 30, we keep this parameter as the optimal value for the mapping.
Table 5 reports the M SE and explained variance for the k = 30 mapping. The best scores
are obtained with the syntax-based versions of the SGNS model, together with FastText
and BERT. All mappings perform largely better than the random baseline, for which the
explained variance is negative.
Using the Partial Least Squares Regression model with k = 30 for the mapping and
leave-one-out training, we predict the vectors of all the Binder words and we evaluate
them with the Retrieval task. The results are shown in Table 6. At a glance, we can
notice that all DSMs vastly outperform the random vectors and are able to retrieve in
the top 10 ranks at least the half of the target concepts. For this auxiliary task, the best
performing model is BERT, which retrieves the 30% of the target concepts at the top of
the ranking and more than three quarters of them in the top 10. The next best models are
the syntactically-enriched versions of the Skip-Gram vector, with the one using typed
dependencies coming close to BERT performance.
These results show overall good-quality representation for all the embedding types,
and a comparison with the scores by Utsumi (2018) confirm the superiority of SGNS
model over GloVe and PPMI for this kind of mapping. Differently from the previous
study, we also consider embeddings that are trained with syntactic dependencies,
showing that for SGNS syntactic contexts determine a general improvement of the
performance (while typed dependencies are suboptimal for the PPMI model).

12
Chersoni et al. Decoding Word Embeddings

Model Top-N Accuracy


N = 1 N = 5 N = 10
PPMI.w2 0.14 0.42 0.57
PPMI.synf 0.14 0.46 0.61
PPMI.synt 0.10 0.36 0.54
GloVe 0.18 0.43 0.58
SGNS.w2 0.19 0.49 0.64
SGNS.synf 0.20 0.55 0.71
SGNS.synt 0.23 0.57 0.74
FastText 0.20 0.53 0.70
ELMo 0.22 0.50 0.68
BERT 0.30 0.59 0.76
Random 0.00 0.01 0.01
Table 6: Top-N Acc for each word embedding model.

The next tests will aim at revealing how well the different features in the Binder
dataset are encoded by our vectors.

4. How do word embeddings encode semantic features?

In the literature on neurosemantic decoding, it has been shown that models can be
compared for their ability to explain the activity in a particular brain region (Wehbe
et al. 2014; Gauthier and Ivanova 2018; Anderson et al. 2018). Analogously, we want to
inspect which features are better predicted by a given embedding model. We compute
the average of the Spearman correlation between human ratings and model predictions
across words and features. Results are reported in Table 7.
All DSMs achieve high correlation values, higher than 0.7 per word, vastly out-
performing the random baseline. The results across the models are similar, with again
BERT and the syntax-based SGNS models taking the top spots. Consistently with the
previous tests, syntactic information seems to be useful for predicting the property val-
ues, as syntax-based models almost always perform slightly better than their window-
based counterparts. A similar finding for the prediction of brain activation patterns has
already been descibed by Abnar et al. (2018), who also reported a strong performance
by dependency-based embeddings. It is also interesting to notice that all our models
have much higher correlation values than the best results reported by Utsumi (2018), a
difference that might be due to the choice of the training corpora (we used a concate-
nation of Ukwac and Wikipedia, while Utsumi trained his models on the COCA and,
separately, on the Wikipedia corpus alone). Finally, while the PPMI embeddings used
by Utsumi drastically underperform, our PPMI vectors come much closer to the predict
ones, although the latter still retain an edge.

13
Computational Linguistics Volume 1, Number 1

PPMI.w2 0.90

PPMI.synf
0.85
PPMI.synt

GloVe 0.80

SGNS.w2

0.75
SGNS.synf

SGNS.synt
0.70

FastText

ELMo 0.65

BERT
on

ic

on

or

al

n
ia

ra

sa

riv
tio

tio

tio

io

io
at

ci
ot

at

po
si

iti

au

ot

nt
So
m

D
ta

lfa

ni
M
Vi

Sp

Em

te
m
So

Au

og
us

At
Te
G

Figure 2: Average Spearman correlations per domain between the estimated and the C
original Binder features for each embedding type.

Model Word correlation Feature correlation


PPMI.w2 0.77 0.70
PPMI.synf 0.79 0.72
PPMI.synt 0.80 0.71
GloVe 0.76 0.69
SGNS.w2 0.80 0.74
SGNS.synf 0.82 0.76
SGNS.synt 0.82 0.77
FastText 0.81 0.75
ELMo 0.81 0.76
BERT 0.82 0.77
Random 0.20 -0.01
Table 7: Average word and feature Spearman correlation between the human ratings in
Binder et al. (2016) and the estimated values for each embedding type.

In the heatmap in Figure 2, it is possible to observe the average correlations per


each Binder domain. It is striking that the features belonging to the Cognition, Causal
and Social domains are the best predicted ones. On the other hand, somatosensorial
features are predicted with lower accuracies, except for Gustation and Olfation, which
however include just one feature. As suggested by Utsumi (2018, 2020), who reported
consistent findings, this can be explained by the fact that embeddings learn word
meaning only from textual data: Psycholinguistic studies on mental lexicon theorize
that humans combine both linguistic information and first-hand experience of the world
(Vigliocco and Vinson 2007; McRae and Matsuki 2009), and domains such as Cognition
and Social are especially important in the characterization of abstract concepts, for
which textual information has been suggested to be the prevailing source (Vigliocco

14
Chersoni et al. Decoding Word Embeddings

PPMI.w2 0.9

PPMI.synf
0.8
PPMI.synt
GloVe
0.7
SGNS.w2
SGNS.synf 0.6
SGNS.synt
FastText 0.5
ELMo
BERT 0.4
Br ion
D ht
C rk
tte r
La rn
Smge
Bi Mo all
ot n
Fa n
S st
C S low
pl pe
Fa ity
Boce
m To dy
ra ch
x e
W ture

Au Pa t
di in
Lo n
Lo d
H w
So igh
M nd
ee ic
Ta ch
S ste
pp e ll
Lo er ad

Pr Limb
ac b
e
gh
Pa olo

U Hme
om tio
io

Te tur

tio
u

w Lim

tic
Sp us
ig
a

ex

pe u
r

om ha

u
s

ei
Vi

er
Te
0.90
PPMI.w2
PPMI.synf 0.85

PPMI.synt 0.80

GloVe 0.75
SGNS.w2
0.70
SGNS.synf
0.65
SGNS.synt
FastText 0.60

ELMo 0.55
BERT
0.50
P k
Sc ath
N e
wa r
rd
um y

D Tim r
ur e
Lo n
on C ho g
se au rt
en d
om So ial
un m l
ic an
C Son
ni lf
Be tion

P H fit
U le arm
as t
H ant

S y
D An ad
gu r y
SuFea ed
ris l
D ed
N rive
te ds
ou n
l
m Hu cia

rp rfu

sa
le an
To ea

be
N Awa

p
og e
ar

en

io
n

qu se

Ar ntio
ne

is g
t

ap

At ee
i

st
at

at

np as
m

S
nd
La

Figure 3: Spearman correlations between the estimated and original Binder features for
each embedding type.

et al. 2009). When it comes to the somatosensorial features of concrete concepts, instead,
text-based models are clearly missing that kind of the information on the referents,
although various aspects of experiential information are “redundantly” encoded in
linguistic expressions (Riordan and Jones 2011), as suggested by the so-called Symbol
Interdependency Hypothesis (Louwerse 2008). Finally, spatial and temporal features are
particularly challenging for distributional representations. This is compatible with the
hypothesis that temporal concepts are mainly represented in spatial terms and the
acquisition of spatial attributes requires multimodal evidence (Binder et al. 2016), which
is instead lacking in our distributional embeddings. The Emotion domain also shows
good correlation values, confirming the role of distributional information in shaping
the affective content of lexical items (Recchia and Louwerse 2015; Lenci, Lebani, and
Passaro 2018).
Figure 3 provides a more analytical and variegated view of the way embeddings
predict each Binder feature, revealing interesting differences within the various do-
mains. First of all, we can observe that some somatosensorial semantic dimensions
are indeed strongly captured by embeddings, consistently with the hypothesis that
several embodied features are encoded in language (Louwerse 2008). For instance,
C OLOR, M OTION (i.e., “showing a lot of visually observable movement”), B IOMOTION
(i.e., “showing movement like that of a living thing”), and S HAPE (i.e., “having a
characteristic or defining visual shape or form”) are among the best predicted visual

15
Computational Linguistics Volume 1, Number 1

0.85
PPMI.w2
PPMI.w2
0.90
PPMI.synf
PPMI.synf

0.85 0.80
PPMI.synt PPMI.synt

GloVe 0.80 GloVe

SGNS.w2 0.75
SGNS.w2
0.75
SGNS.synf
SGNS.synf
SGNS.synt 0.70
0.70
SGNS.synt
FastText
0.65

ELMo FastText

0.65
BERT 0.60
ELMo
y

rty

ity

te

ct

ct

ty

e
ac

en

BERT
tit

io

io

at
je

je

r
ta
nt
pe

pe
en

tif
ct

ct
ev

st
ob

ob
ls
le
a

la
ar
ro

ro

al
ct

ta
ta
ct

l
p

lp
ra

ic
ra

en

in
ra

en

ic

ys
ct

tu

a
liv
st

ys
st

m
ra

ic
m

na

ph
ab

ab

ph

ys
st

ty

g
ab

ph

tio

in
er

th
ac

op
pr
(a)
(b)

Figure 4: Average Spearman correlations per word super category (a) and per word
super type (b).

features. FAST is predicted much better than S LOW, while embeddings do not seem
to discriminate the B RIGHT and D ARK components. In the Audition domain, L OUD,
M USIC (i.e., “making a musical sound”), and S PEECH (i.e., “someone or something that
talks”) are generally very well predicted. The Spatial domain instead shows an uneven
behavior, with L ANDMARK (i.e., “having a fixed location, as on a map”), S CENE (i.e.,
“bringing to mind a particular setting or physical location”), and PATH (i.e., “showing
changes in location along a particular direction or path”) presenting much higher corre-
lation values than the other features. The best predicted social features are H UMAN (i.e.,
“having human or human-like intentions, plans, or goals”) and C OMMUNICATION (i.e.,
“a thing or action that people use to communicate”). In relation to spatial features, Time
and Human, it is interesting to point out that the models with syntactic information
generally have better predictions than their window-based equivalents (cf. in Figure 3
the values for the synf/synt versions of PPMI/SGNS models with their w2 equivalents
and with FastText). Finally, negative sentiments and emotions are better predicted than
positive ones. This is consistent with previous reports of negative moods being more
frequently expressed in online texts (De Choudhury, Counts, and Gamon 2012).
Using the metadata in the Binder dataset, we group the words per super category
and type, and compute the average correlations. A quick look at Figure 4a reveals that
mental entities are the best represented ones, while embeddings are struggling the most
with physical and abstract properties. Also living objects and events tend to be well
represented by most embedding models. Figure 4b provides a summary of the average
correlations per word type, confirming that things are the most correlated, whereas
weaker correlations are observed for actions. Finally, the models only manage to achieve
moderate-to-low correlations for properties.
Finally, it is worth focusing on the behavior of contextualized embeddings. Though
BERT has a slightly higher accuracy Top-N (cf. Section 3.4), its overall word and feature

16
Chersoni et al. Decoding Word Embeddings

correlation is equivalent to the one by SGNS.synt (cf. Table 7). Moreover, Figures 2–4
do not show any significant difference in the kinds of semantic dimensions encoded by
traditional DSMs with respect to BERT and ELMo vectors. This leads us to conjecture
that the true added value of the latter models lies in their ability to capture the meaning
variations of word tokens in context, rather than in the type of semantic information
they can distil from distributional data.

5. Using semantic features to analyze probing tasks

Probing tasks (Ettinger, Elgohary, and Resnik 2016; Adi et al. 2017) have become one of
the most common tools to investigate the content of embeddings. A probing task con-
sists in a binary classifier which is fed with embeddings and is trained to classify them
with respect to a certain linguistic dimension (e.g., animacy). The classification accuracy
is taken as a proof that the embeddings encode that piece of linguistic information. As
we have said in the Introduction, the limit of the probing task methodology is that it
only provides indirect evidence about the way linguistic categories are represented by
embeddings. In this section, we show how the decoded Binder vectors can be used to
“open the box” of semantic probing tasks, to inspect the features that are relevant for a
certain task, and to analyze the performance of distributional embeddings.

5.1 Probing tasks for human-interpretable embeddings

We use the original word embeddings and their corresponding mapped Binder vectors
as input features to a logistic regression classifier, which has to determine if they belong
to a given semantic class (Yaghoobzadeh et al. 2019). The human-interpretable nature
of Binder vectors allows us to decode and explain the performance of the original
embeddings in the probing tasks. In computational linguistics, being able to determine
the semantic class membership of a word is an important task, and it has several
applications such as question answering systems, information extraction and ontology
generation (Vulić et al. 2017). Similarly, the automatic detection of a given semantic
feature of a word is potentially useful for the automatic creation of lexicon and dictio-
naries, i.e. sensory lexicons (Tekiroglu, Özbal, and Strapparava 2014), emotion lexicons
(Buechel and Hahn 2018) and sentiment dictionaries (Turney and Littman 2003; Esuli
and Sebastiani 2006; Baccianella, Esuli, and Sebastiani 2010; Cardoso and Roy 2016;
Sedinkina, Breitkopf, and Schütze 2019).
Non-contextualized embeddings were tested on the following probing tasks that
target different semantic classes and features:

Positive/Negative – Given the embedding of a word, the logistic regression clas-


sifier has to decide whether the word has a positive or a negative polarity. The
dataset consists of 250 positive words and 250 negative words from the ANEW
sentiment lexicon (Nielsen 2011; Bradley and Lang 2017), which is composed of
a total of 3, 188 words with human valence ratings on a scale between 1 (very
unpleasant) and 8 (very pleasant). The selected positive items have valency ratings
higher than 7, and the negative items have valency ratings lower than 3. The
dataset was randomly split in 400 items for training and 100 words for test.

Concrete/Abstract – The task is to decide whether a noun is concrete or abstract.


The dataset consists of 254 nouns (91 abstract, 163 concrete) selected from SimLex-
999 (Hill, Reichart, and Korhonen 2015). The dataset was randomly split in 203

17
Computational Linguistics Volume 1, Number 1

items for training and 51 words for test. For this task, concrete nouns are assumed
as the positive class.

Animate/Inanimate – The task is to decide whether a noun is animate or inani-


mate. The dataset includes 810 nouns (672 animate, 138 inanimate) corresponding
to the targets in the Direct object animacy task described below, randomly split
in 658 for training and the remnant 152 for test. For this task, animate nouns are
assumed as the positive class.

VerbNet – The task is based on verb semantic classes included in VerbNet (Kipper
et al. 2008; Palmer, Bonial, and Hwang 2017). For each VerbNet class, we generated
a set of negative examples by randomly extracting an equal number of verbs that
do not belong to the semantic class (i.e., for a semantic class with n verbs, we
extract n verbs from the other classes to build the negative examples). Each class
was then randomly split in a training and in a test set, using a 80 : 20 ratio, and
we selected the 23 classes that contained at least 20 test verbs.5 The task consists
in predicting whether a target verb is a class instance or not.

As the key feature of contextualized DSMs is to generate embeddings of words in


context, BERT and ELMo were tested on two semantic tasks probing a target word token
in an input sentence:

Direct object animacy – The task is to decide whether the direct object noun
of a sentence is animate or inanimate, and is the contextualized equivalent of
the Animate/Inanimate task above. The dataset includes 647 training subject -
verb - object sentences with animate and inanimate direct objects, and 163 test
sentences.6

Causative/Inchoative alternation – The task is to decide whether the verb in


a sentence undergoes the causative/inchoative alternation or not (Levin 1993).
Alternating verbs like break can occur both in agent-patient transitive sentences
(The man broke the vase ) and in intransitive sentences in which the patient noun
occurs as subject (The vase broke). Non-alternating verbs like buy can instead only
occur in transitive sentences (The man bought the book vs. *The book bought). This
task has already been used to probe vectors by Warstadt, Singh, and Bowman
(2019) and Klafka and Ettinger (2020). We used the dataset of the latter work, con-
sisting in 4, 000 training sentences and 1, 000 test sentences, equally split between
alternating and non-alternating target verbs. For this task, alternating verbs are
assumed as the positive class.

BERT and ELMo were queried with the sentences in the dataset to obtain the contextu-
alized embedding of the target word (the direct object noun for the animacy task, the
verb for the causative/inchoative one), which was then fed into the classifiers.
The embeddings were not fine-tuned in the probing tasks. In fact, the overall
purpose of the analysis is not to optimize the performance of the classifiers, but to use
them to investigate the information that the original embeddings encode.

5 The classes pour-9.5+spray-9.7, remove-10.1+clear-10.3+mine-10.9, and cut-21.1+carve-21.2 were


obtained by merging some VerbNet subclasses.
6 The dataset was developed and kindly provided by Evelina Fedorenko, Anna Ivanova and, Carina Kauf.

18
Chersoni et al. Decoding Word Embeddings

5.2 Interpreting probing tasks with Binder features

Task PPMI SGNS GloVe FastText Majority


w2 synf synt w2 synf synt
Positive/Negative 0.52 0.65 0.65 0.83 0.85 0.79 0.79 1.00 0.52
Concrete/Abstract 0.69 0.70 0.73 1.00 1.00 1.00 1.00 1.00 0.59
Animate/Inanimate 0.74 0.78 0.86 0.94 0.96 0.97 0.96 0.98 0.83
VerbNet
pour-9.5+spray-9.7 0.61 0.52 0.48 0.70 0.78 0.78 0.82 0.74 0.52
fill-9.8 0.50 0.50 0.52 0.74 0.72 0.74 0.70 0.74 0.50
butter-9.9 0.47 0.55 0.50 0.87 0.84 0.86 0.89 0.89 0.63
pocket-9.10 0.58 0.50 0.50 0.88 0.88 0.92 0.75 0.83 0.50
remove-10.1+clear-10.3+mine-10.9 0.44 0.56 0.44 0.56 0.68 0.76 0.68 0.64 0.52
steal-10.5 0.48 0.48 0.41 0.83 0.83 0.79 0.86 0.90 0.52
debone-10.8 0.59 0.63 0.54 0.86 0.82 0.82 0.68 0.90 0.50
cut-21.1+carve-21.2 0.65 0.65 0.57 0.80 0.80 0.96 0.57 0.73 0.50
amalgamate-22.2 0.56 0.52 0.52 0.74 0.70 0.78 0.74 0.83 0.52
tape-22.4 0.65 0.62 0.73 0.98 0.97 0.94 0.98 1.00 0.59
characterize-29.2 0.57 0.57 0.57 0.81 0.81 0.86 0.76 0.81 0.52
amuse-31.1 0.55 0.56 0.51 0.69 0.80 0.75 0.67 0.72 0.51
admire-31.2 0.44 0.52 0.56 0.87 0.91 0.87 0.78 0.96 0.52
marvel-31.3 0.59 0.62 0.65 0.72 0.69 0.83 0.69 0.76 0.58
judgement-33.1 0.71 0.66 0.66 0.77 0.80 0.80 0.77 0.80 0.54
manner_of_speaking-37.3 0.71 0.79 0.48 0.79 0.86 0.90 0.90 0.86 0.50
say-37.7 0.62 0.60 0.50 0.55 0.50 0.64 0.50 0.55 0.55
animal_sounds-38 0.67 0.80 0.67 0.87 0.90 0.83 0.83 0.93 0.56
sound_emission-43.2 0.52 0.48 0.61 0.70 0.78 0.74 0.65 0.70 0.52
cooking-45.3 0.55 0.60 0.45 0.77 0.86 0.90 0.73 0.86 0.52
other_cos-45.4 0.54 0.50 0.57 0.70 0.73 0.74 0.76 0.68 0.50
contiguous_location-47.8 0.57 0.57 0.52 0.86 0.76 0.81 0.61 0.81 0.52
run-51.3.2 0.52 0.48 0.56 0.80 0.84 0.80 0.75 0.80 0.56
Table 8: Classification accuracy on the probing tasks for the 8 non-contextualized DSMs.

Task BERT ELMo Majority


Direct object animacy 0.99 0.96 0.83
Causative/Inchoative alternation 0.91 0.86 0.51
Table 9: Classification accuracy on the contextualized probing tasks.

Our analysis consists of three main steps: i.) for each semantic task, we first train
a classifier using the original word embeddings and we measure their accuracy, as is
customary in probing task literature; ii.) then, we train the same classifiers using the
corresponding mapped Binder vectors in the training sets and we inspect the most
important semantic features of each probed class; finally iii.) we measure the overlap
between the classifier top features and the top features of the words in the test sets, and
we use this information to interpret the performance of the models in the various tasks.

5.2.1 Measuring embedding accuracy in probing tasks. First of all, we evaluate the
performance of the embeddings in each task via the traditional accuracy metric, in
order to check their ability to predict the semantic class of the word. A summary of
the performance of the traditional DSMs can be seen in Table 8, while the scores for the
contextualized models are shown in Table 9. Since the classes are unbalanced in most

19
Computational Linguistics Volume 1, Number 1

tasks, the tables also report the results for a majority baseline. At a first glance, in Table
8 we can notice a performance gap between count models based on PPMI and SVD
and the other word embedding models, with the former being largely outperformed
by the latter on all probing datasets (the largest observed gap being around 40%)
and struggling even to beat the majority baseline in many of the VerbNet-derived test
sets. All neural embeddings achieve a 100% accuracy in the classification for the Con-
crete/Abstract task and one of the models, FastText, achieves the same score also on the
Positive/Negative task. The VerbNet tasks, possibly because of the fuzzy boundaries
of the verb semantic classes, proved to be the most challenging ones and in some cases
the models struggle to beat a chance classifier. The best performing embeddings are, in
general, the FastText ones and the vectors of the SGNS family.
As for the contextualized probing tasks, BERT outperforms ELMo, and the
Causative/Inchoative alternation task is more difficult, probably because alternating
verbs are semantically more heterogeneous. However, even in this case the classification
accuracy is very high, when compared to the majority baseline.

5.2.2 Examining the semantic features of the probed classes. Since probing tasks are
typically used as “black box” tools, the performance obtained by a certain DSM is
usually regarded to be enough to draw conclusions about the information encoded by
its vectors. Here, the mere embedding accuracy we have reported in Table 8 and 9 is
not the primary aim of our analyses. In fact, we want to make the semantic information
learnt by the classifiers explicit and human-interpretable, in order to characterize the
content of the probed semantic dimensions. To this purpose:
• for each DSM, we learn a mapping between its embeddings and 65-dimensional
Binder vectors, using the whole set of 534 Binder words as training data;
• we use the decoding mapping to generate the Binder vectors of all the words
contained in the probing datasets;
• for each probing task, we train a classifier t with the decoded Binder vectors;
• we extract the weights assigned by the classifier to the Binder features and sort
them in descending order. Given the task t, T opT askF eats(t, n) is the set of the
top n features learnt by the classifier for t using the Binder vectors.
The set T opT askF eats(t, n) includes the most important semantic features for the
classification task t. Table 10 reports the top 5 features for some of our probing tasks
using the Binder vectors decoded from FastText, which is one of the best performing
non-contextualized models on average, and from BERT. Notice that the top features
provide a nice characterization of the features of the semantic classes targeted across
tasks. FACE, H UMAN, and S PEECH appear among the top features of animate nouns. For
sentiment classification, the most relevant features are positive emotions (P LEASANT,
H APPY, B ENEFIT) or belong to the the Social domain (S ELF). On the other hand,
physical properties (S HAPE, V ISION, W EIGHT) are the most important ones for the
Concrete/Abstract distinction, in which concrete nouns represent the positive class.
Similar considerations apply for the VerbNet tasks. The class run-51.3.2 contains mo-
tion verbs and its most relevant features refer to movement (M OTION, L OWER L IMB,
FAST) and direction (PATH). The classes judgment-33.1 and say-37.7 are characterized
by features related to communication and cognition. The class sound_emission-43.2 is
instead associated with features belonging to the Audition domain. Perhaps, the less
perspicuous case is represented by the features associated with the alternating class
in the Causative/Inchoative task. However, it is worth noticing the salience of the
T EMPERATURE feature, as various alternating verbs express this dimension (e.g., warm,

20
Chersoni et al. Decoding Word Embeddings

Task Top Features


Positive/Negative P LEASANT, H APPY, B ENEFIT, N EEDS, S ELF
Concrete/Abstract S HAPE, V ISION, W EIGHT, T EXTURE, T OUCH
Animate/Inanimate FACE, B ODY, H UMAN, S PEECH, B IOMOTION
fill-9.8 C OLOR, V ISION, B RIGHT, W EIGHT, PATTERN
cut-21.1+carve-21.2 P RACTICE, T OUCH, U PPER L IMB, V ISION, S HAPE
admire-31.2 C OGNITION, S OCIAL, A ROUSAL, H APPY, P LEASANT
judgement-33.1 C OMMUNICATION, S OCIAL, H EAD, C OGNITION, A ROUSAL
say-37.7 C OMMUNICATION, C OGNITION, B ENEFIT, S OCIAL, S ELF
sound_emission-43.2 A UDITION, L OUD, S OUND, H IGH, M USIC
cooking-45.3 TASTE, T EMPERATURE, S MELL, H EAD, P RACTICE
contiguous_location-47.8 L ANDMARK, V ISION, C OLOR, L ARGE, S CENE
run-51.3.2 L OWER L IMB, M OTION, PATH, FAST, B IOMOTION
Direct object animacy FACE, U PPER L IMB, S CENE, C OMPLEXITY, B IOMOTION
Causative/Inchoative alternation S LOW, C OMPLEXITY, T EMPERATURE, U PPER L IMB, S HORT
Table 10: Top 5 features ordered from left to right for a selection of the non-
contextualized probing tasks with the Binder vectors mapped from the FastText em-
beddings, and for the contextualized probing tasks with the Binder vectors mapped
from the BERT token embeddings.

heat, cool, burn, etc). This shows how a simple featural decoding of the embeddings can
be used to investigate the internal structure of the semantic classes that are targeted by
probing tasks.

5.2.3 Explaining the performance of embeddings in probing tasks. The third phase of
our analysis combines the results of the previous two steps: The Binder feature vectors
learnt in Section 5.2.2 are used to explain the accuracy of the embeddings in the probing
tasks in Section 5.2.1.
For each task t and word w in the test set of t, we rank the features of the de-
coded Binder vector fw in descending order according to their values. We indicate with
T opW ordF eats(fw , n) the set of the top n features in the Binder vector fw . Then we
measure with Average Precision (AP ) the extent to which the top Binder features of t
appear among the top decoded features of the test word w. Given the ranked feature
sets T opT askF eats(t, n) and T opW ordF eats(fw , n), we compute AP (t, w) as follows:

Pn
1 Pwt (r)
AP (t, w) = (1)
n

|T opT askF eats(t, n) ∩ T opW ordF eats(fw , n)|r1


Pwt (r) = (2)
r

where the numerator of Equation 2 is the number of task features that are also in
the word feature vector from rank 1 to rank r. AP is a measure derived from in-
formation retrieval combining precision, relevance ranking and overall recall (Man-
ning, Raghavan, and Schütze 2008; Kotlerman et al. 2010). In our case, the ranked
task features are like documents to be retrieved and the word features are like doc-
uments returned by a query. AP takes into account two main factors: i.) the extent
of the intersection among the n most important semantic features for a word and a

21
Computational Linguistics Volume 1, Number 1

FastText GloVe PPMI.synf

0.75

0.50

0.25

PPMI.synt PPMI.w2 SGNS.synf

0.75
AP

0.50

0.25

FN FP TN TP
SGNS.synt SGNS.w2

0.75

0.50

0.25

FN FP TN TP FN FP TN TP
Classification

Figure 5: Average Precision (AP ) boxplots of the Binder vectors of the test words with
respect to the top-20 Binder features of each probing task. True Positive (TP), True
Negative (TN), False Positive (FP), and False Negative (FN) refer to the output of the
classifiers trained on the original embeddings of the non-contextualized DSMs.

task, and ii.) their mutual ranking. The higher the AP (t, w) score, the more the top
features of w that are also included in the top features for the task t. For example,
suppose that T opT askF eats(P ositive/N egative, 3) = {P LEASANT, H APPY, B ENEFIT},
AP (P ositive/N egative, w) = 1 if and only if T opW ordF eats(fw , 3) contains the same
semantic features at the top of the rank.
For each model and each task, we analyze the AP of the output of the classifiers
trained on the original word embeddings, whose accuracy is reported in Tables 8 and 9.
We compute the AP of the words correctly classified in the positive class (true positive,
TP) and in the negative class (true negative, TN). Moreover, we compute the AP of the
words wrongly classified in the positive class (false positive, FP) and in the negative class
(false negative, FN). The AP distribution of each word group across the probing tasks
is reported in Figure 5 for the non-contextualized DSMs and in Figure 6 for BERT and
ELMo. The Kruskal-Wallis rank sum non-parametric test shows that in all models the
four word groups significantly differ for their AP values (df = 3, p-value < 0.001).
Post-hoc pairwise Mann–Whitney U-tests (with Bonferroni correction for multiple
comparisons) confirm that across tasks TPs have a significantly higher AP than FPs
(p < 0.001). Therefore, the words correctly classified in the positive class share a large
number of the top ranked features for that class (e.g., the words whose embeddings

22
Chersoni et al. Decoding Word Embeddings

BERT ELMo

0.5

0.4
AP

0.3

0.2

FN FP TN TP FN FP TN TP
Classification

Figure 6: Average Precision (AP ) boxplots of the Binder vectors of the test words with
respect to the top-20 Binder features of each probing task. True Positive (TP), True
Negative (TN), False Positive (FP), and False Negative (FN) refer to the output of the
classifiers trained on the original BERT and ELMo embeddings.

are correctly classified as animate have a large number of the top semantic features
that characterize animacy). Conversely, the words correctly classified in the negative
class have very few, if any, of the top task features. It is interesting to observe that the
DSMs for which the difference between the median AP (represented by the thick line in
each boxplot) of TPs and the median AP of TNs is higher (i.e., the neural embeddings
for the non-contextualized models and BERT) are the models that in general show a
higher classification accuracy in the probing tasks and better encode the Binder features
(cf. Section 4). This suggests that a model accuracy in probing tasks is strongly related
to the way its embeddings encode the most important semantic features for a certain
classification task (cf. below).
In Figures 5 and 6, the AP of the wrongly classified words (i.e., FPs and FNs) tend
to occupy an intermediate position between the AP of TPs and TNs. In fact, we can
conjecture that a word in the positive class (e.g., an animate noun) is wrongly classified
(e.g., labelled as inanimate), because it lacks many of the top features characterizing
the target class (e.g., animacy). Post-hoc pairwise Mann–Whitney U-tests support this
hypothesis, because the AP of the FNs is significantly different from the one of TPs
(PPMI.synt: p < 0.05; GloVe: p < 0.05; SGNS.w2: p < 0.001; SGNS.synf: p < 0.001;
SGNS.synt: p < 0.001; FastText: p < 0.001; ELMo: 0.001), except for PPMI.w2 (p = 0.23),
PPMI.synf (p = 1) and BERT (p = 0.39). Conversely, the AP of FPs is significantly higher
than the one of TNs (SGNS.w2: p < 0.001; SGNS.synf: p < 0.001; SGNS.synt: p < 0.001;
FastText: p < 0.001; ), except for the largely underperforming PPMI models (PPMI.w2: p
= 1; PPMI.synf: p = 0.39; PPMI.synf: p = 0.38), ELMo (p = 1), and marginally for GloVe
(p = 0.08) and BERT (p = 0.08). This suggests that the semantic features of FPs tend to
overlap with the top features of the positive class more than TNs.

23
Computational Linguistics Volume 1, Number 1

Positive/Negative VerbNet say−37.7

0.6
AP

0.4

0.2

0 1 0 1
Classes

Figure 7: The boxplots show the Average Precision (AP ) of the Binder vectors decoded
from FastText embeddings for the words belonging to the positive (1) and negative (0)
classes in the test sets of the Positive/Negative and VerbNet say-37 probing tasks.

The analysis of the semantic features of missclassified words can also provide inter-
esting clues to explain why DSMs make errors in probing tasks. For instance, FastText
does not classify keen as a sound emission verb (i.e., it is a FN for the VerbNet class
38). If we inspect its decoded vector we find C OGNITION, S OCIAL, S ELF, B ENEFIT and
P LEASANT among its top features, likely referring to the abstract adjective keen, which
is surely much more frequent in the PoS-ambiguous training data than the verb keen (to
emit wailing sounds). On the other hand, PPMI.w2 wrongly classifies judge as a manner
of speaking verb (i.e., it is a FP of the VerbNet class 37.3). This mistake can be explained
by looking at its decoded vector whose top feature is S PEECH, which is probably due to
the quite common usage of judge as a communication verb.
As illustrated in Tables 8 and 9, the variance of model accuracy across tasks is
extremely high. For instance, the accuracy of FastText ranges from 1 in the Posi-
tive/Negative and Concrete/Abstract tasks, to 0.55 for the VerbNet say-37.7 class. In the
standard use of probing tasks, the classifier accuracy is taken to be enough to draw con-
clusions about the way a certain piece of information is encoded by embeddings. Here,
we go beyond this “black box” analysis and provide a more insightful interpretation
of the different behavior of embeddings in semantic probing tasks. We argue that such
explanation can come from the decoded Binder features, and that a model performance
in a given task t depends on the way the words to be classified encode the top-n ranked
features for t (i.e., T opT askF eats(t, n)). For instance, consider the boxplots in Figure
7, which show the AP of the Binder vectors decoded from the FastText embeddings
for the words belonging to the positive (1) and negative (0) classes in the test sets of
the Positive/Negative and VerbNet say-37 probing tasks. FastText achieves a very high
accuracy in the former task, and the AP distributions of the 1 and 0 words are clearly
distinct, indicating that these two sets have different semantic features, and that the
features of the 0 words have a very low overlap with the top task features. Conversely,

24
Chersoni et al. Decoding Word Embeddings

Model ρ p-value
PPMI.w2 0.29 0.15
PPMI.synf 0.43 0.03∗
PPMI.synt 0.23 0.26
GloVe 0.65 < 0.001∗
SGNS.w2 0.68 < 0.001∗
SGNS.synf 0.78 < 0.001∗
SGNS.synt 0.70 < 0.001∗
FastText 0.71 < 0.001∗
Table 11: Spearman correlation (ρ) between APdif f (t) and the classification accuracy for
the static embeddings models.

Task Model Accuracy APdiff


Direct object animacy BERT 0.99 0.13
Direct object animacy ELMo 0.96 0.04
Causative/Inchoative alternation BERT 0.91 0.06
Causative/Inchoative alternation ELMo 0.86 0.01
Table 12: Classification accuracy and APdif f (t) for the contextualized models.

the AP distributions of the 1 and 0 words for the say-37.7 task overlap to a great extent,
suggesting that the two groups are not well separated in the semantic feature space. Our
hypothesis is that the DSM accuracy in a probing task tends to be strongly correlated
with the degree of separation between the semantic features decoded from the positive
and negative items in the target class.
To verify this hypothesis, we take the sets of positive (W1 ) and negative (W0 ) test
words of each task t and we compute the following measure:

APdif f (t) = AP (t, W1 ) − AP (t, W0 ) (3)

where AP (t, W1 ) and AP (t, W0 ) are respectively the mean AP for W1 and W2 . There-
fore, APdif f (t) estimates the separability of the positive and negative words in the
semantic feature space relevant for the task t. We expect that the higher the APdif f (t) of
a model, the higher its performance in t. Table 11 shows that this prediction is borne out,
at least for the best performing non-contextualized DSMs. The Spearman correlation
between the model accuracy in the probing tasks and APdif f (t) is fairly high for all
models, except for the PPMI ones. It is again suggestive that these are not only the
worst-performing models in the probing tasks, but also the embeddings with a less
satisfactory encoding of the Binder features. Table 12 illustrates that the correlation
between APdif f (t) and task accuracy holds true for contextualized embeddings as well.
For both BERT and ELMo, the APDif f and accuracy are greater for the Direct object
animacy task than for the Causative/Inchoative alternation.

6. General discussion and conclusions

Word embeddings have become the most common semantic representation in NLP
and AI. Despite their success in boosting the performance of applications, the way
embeddings capture meaning still defies our full understanding. The challenge mainly

25
Computational Linguistics Volume 1, Number 1

depends on the apparent impossibility to interpret the specific semantic content of


vector dimensions. Indeed, this is the essence of distributed representations like em-
beddings, in which information is spread among patterns of vector components (Hin-
ton, McClelland, and Rumelhart 1986). Consequently, the content of embeddings is
usually interpreted indirectly, by analyzing either the space of nearest neighbors, or
their performance in tasks designed to “probe” a particular semantic aspect.
In this paper, we have taken a different route, adopting a methodology inspired
by the literature on neural decoding in cognitive neuroscience. The brain too repre-
sents semantic information in distributed patterns (Huth et al. 2016). We argue that
the problem of interpreting the content of embeddings is similar to interpreting the
semantic content of brain activity. Neurosemantic decoding aims at identifying the
information encoded in the brain by learning a mapping from neural activations to
semantic features. Analogously, we decode the content of word embeddings by map-
ping them onto interpretable semantic feature vectors. Featural representations are well-
known in linguistics and cognitive science (Vigliocco and Vinson 2007), and provide a
human-interpretable analysis of the components of lexical meaning. In particular, we
rely on the ratings collected by Binder et al. (2016), whose feature set is motivated on
neurobiological basis. We have carried out the mapping of continuous embeddings onto
discrete semantic features with a twofold aim: i.) identifying which semantic features
are best encoded in word embeddings; ii.) using the proposed featural representations
to explain the performance of embeddings in semantic probing tasks.
Concerning the first goal, we have tested the embedding decoding method on
several types of static and contextualized DSMs. All models achieve high correlations
across words and features, with dependency-based DSMs having a slight edge over
the others, consistently with the findings of Abnar et al. (2018). The features from
abstract domain such as Cognition, Social and Causal seem to be the ones that are
better predicted by the models, which are purely relying on text-based information,
while the prediction of spatial and temporal features is obviously more challenging. A
further analysis reveals the salience of visual, motion and audition features, supporting
the hypothesis that language redundantly encode several aspects of sensory-motor
information (Louwerse 2008; Riordan and Jones 2011). In terms of word categories, the
vectors are very good in predicting entities, whereas they struggle with physical and
abstract properties. Moreover, it is interesting to observe that the new generation of
contextualized DSMs does not significantly differ from traditional ones for the type of
semantic information they encode.
As for the second goal, we have applied our decoded feature representations to the
widely popular probing task methodology, to gain insight on what pieces of semantic
information are actually captured by probing classifiers. For our experiments, we tested
the original embeddings on probing tasks designed to target affective valence, animacy,
concreteness, and several verb classes derived from VerbNet for non-contextualized
DSMs, and direct object animacy and causative/inchoative verb alternations for contex-
tualized embeddings. If a binary classifier manages to identify whether a word belongs
to a semantic class on the basis of its embedding, this is typically taken as indirect
evidence that the embedding encodes the relevant piece of semantic information. In our
work, instead of regarding probing tasks just as “black box” experiments, we use the
decoded feature vectors to inspect the semantic dimensions learned by the classifiers.
Moreover, we have set up a battery of tests to show how the decoded features can
explain the embedding performances in the probing tasks. We have measured with AP
the overlap between the top task features and the most important features of the test
words belonging to the positive and negative classes. Our analyses reveal that:

26
Chersoni et al. Decoding Word Embeddings

• the words correctly classified in the positive class (i.e., TPs) share a large number
of the top ranked features for that class, and, symmetrically, the words correctly
classified in the negative class (i.e., TNs) have a significantly lower number of the
top task features;
• wrongly classified words in the positive class (i.e., FNs) lack many of the top fea-
tures characterizing the target class. Conversely, the features of wrongly classified
words in the positive class (i.e., FPs) tend to overlap with the top task features
more than TNs;
• the accuracy of a DSM in a probing task strongly correlates with the degree of
separation between the semantic features decoded from its embeddings of the
words in the positive and negative classes.

These results show that semantic feature decoding provides a simple and useful tool
to explain the performance of word embeddings and to enhance the interpretability of
probing tasks.
The methodology we have proposed paves the way for other types of analyses
and applications. There are at least two prospective research extensions that we plan
to pursue, respectively concerning selectional preferences and word sense disambigua-
tion. Many recent approaches to the modeling of selectional preferences have given
up on the idea of characterizing the semantic constraints of predicates in terms of dis-
crete semantic types, focusing instead on measuring a continuous degree of predicate-
argument compatibility, known as thematic fit (McRae and Matsuki 2009). DSMs have
been extensively and successfully applied to address this issue, typically measuring the
cosine between a target noun vector and the vectors of the most prototypically predicate
arguments (Baroni and Lenci 2010; Erk, Padó, and Padó 2010; Lenci 2011; Sayeed, Green-
berg, and Demberg 2016; Santus et al. 2017; Chersoni et al. 2019; Zhang, Ding, and Song
2019; Zhang et al. 2019; Chersoni et al. 2020). This approach can be profitably paired
with our decoding methodology to identify the most salient features associated with
a predicate argument. For instance, we can expect that listen selects for direct objects
in which Audition features are particularly salient. This way, distributional methods
will be able not only to measure the gradient preference of a predicate for a certain
argument, but also to highlight the features that explain this preference, contributing to
characterize the semantic constraints of predicates.
As for word-sense disambiguation, models like ELMo and BERT provide con-
textualized embeddings that allow us to investigate word sense variation in context.
Using contextualized vectors, it might be possible to investigate how meaning changes
in contexts by inspecting the feature salience variation of different word tokens. For
example, we expect features like S OUND and M USIC to be more salient in the vector of
play in the sentence The violinist played the sonata, rather than in the sentence The team
played soccer. This could be extremely useful also in tasks such as metaphor and token-
level idiom detection, where it is typically required to disambiguate expressions that
might have a literal or a non-literal sense depending on the context of usage (King and
Cook 2018; Rohanian et al. 2020).
Word embeddings and featural symbolic representations are often regarded as
antithetic and possibly incompatible ways of representing semantic information, which
pertain to very different approaches to the study of language and cognition. In this
paper, we have shown that the distance between these two types of meaning repre-
sentation is smaller than what appears prima facie. New bridges between symbolic and
distributed lexical representations can be laid, and used to exploit their complementary
strengths: The gradience and robustness of the former and the human-interpretability of the

27
Computational Linguistics Volume 1, Number 1

latter. An important contribution may come from collecting more extensive data about
feature salience: the Binder dataset is an important starting point, but human ratings
about other types of semantic features and words might be easily collected with crowd-
sourcing methods.
In this work, we have mainly used feature-based representations as a heuristic
tool to interpret embeddings. An interesting research question is whether decoded
features from embeddings could actually have other applications too. For instance,
semantic features provide a more abstract type of semantic representation that might be
complementary to the fine-grained information captured by distributional embeddings.
This suggests to explore new ways to integrate symbolic and vector models of meaning.

References Bakarov, Amir. 2018. Can Eye Movement


Abnar, Samira, Rasyan Ahmed, Max Data Be Used As Ground Truth For Word
Mijnheer, and Willem Zuidema. 2018. Embeddings Evaluation? In Proceedings of
Experiential, Distributional and the LREC Workshop on Linguistic and
Dependency-Based Word Embeddings Neurocognitive Resources.
Have Complementary Roles in Decoding Baroni, Marco, Silvia Bernardini, Adriano
Brain Activity. In Proceedings of the LSA Ferraresi, and Eros Zanchetta. 2009. The
Workshop on Cognitive Modeling and WaCky Wide Web: A Collection of Very
Computational Linguistics. Large Linguistically Processed
Adi, Yossi, Einat Kermany, Yonatan Belinkov, Web-Crawled Corpora. Language resources
Ofer Lavi, and Yoav Goldberg. 2017. and evaluation, 43(3):209–226.
Fine-grained Analysis of Sentence Baroni, Marco, Georgiana Dinu, and Germán
Embeddings Using Auxiliary Prediction Kruszewski. 2014. Don’t Count, Predict! A
Tasks. In Proceedings of ICLR, Toulon, Systematic Comparison of
France. Context-Counting vs. Context-Predicting
Anderson, Andrew James, Jeffrey R Binder, Semantic Vectors. In Proceedings of ACL.
Leonardo Fernandino, Colin J Humphries, Baroni, Marco and Alessandro Lenci. 2010.
Lisa L Conant, Mario Aguilar, Xixi Wang, Distributional Memory: A General
Donias Doko, and Rajeev DS Raizada. Framework for Corpus-Based Semantics.
2016. Predicting Neural Activity Patterns Computational Linguistics, 36(4):673–721.
Associated with Sentences Using a Beinborn, Lisa, Samira Abnar, and Rochelle
Neurobiologically Motivated Model of Choenni. 2019. Robust Evaluation of
Semantic Representation. Cerebral Cortex, Language-Brain Encoding Experiments.
27(9):4379–4395. arXiv preprint arXiv:1904.02547.
Anderson, Andrew James, Edmund C Lalor, Binder, Jeffrey R, Lisa L Conant, Colin J
Feng Lin, Jeffrey R Binder, Leonardo Humphries, Leonardo Fernandino,
Fernandino, Colin J Humphries, Lisa L Stephen B Simons, Mario Aguilar, and
Conant, Rajeev DS Raizada, Scott Grimm, Rutvik H Desai. 2016. Toward a
and Xixi Wang. 2018. Multiple Regions of Brain-Based Componential Semantic
a Cortical Network Commonly Encode the Representation. Cognitive Neuropsychology,
Meaning of Words in Multiple 33(3-4):130–174.
Grammatical Positions of Read Sentences. Bojanowski, Piotr, Edouard Grave, Armand
Cerebral Cortex, 29(6):2396–2411. Joulin, and Tomas Mikolov. 2017.
Athanasiou, Nikos, Elias Iosif, and Enriching Word Vectors with Subword
Alexandros Potamianos. 2018. Neural Information. Transactions of the Association
Activation Semantic Models: for Computational Linguistics, 5:135–146.
Computational Lexical Semantic Models Boleda, Gemma and Katrin Erk. 2015.
of Localized Neural Activations. In Distributional Semantic Features as
Proceedings of COLING, pages 2867–2878. Semantic Primitives – Or Not. In
Baccianella, Stefano, Andrea Esuli, and Proceedings of Knowledge Representation and
Fabrizio Sebastiani. 2010. Sentiwordnet Reasoning: Integrating Symbolic and Neural
3.0: An Enhanced Lexical Resource for Approaches: Papers from the 2015 AAAI
Sentiment Analysis and Opinion Mining. Spring Symposium, pages 2–5, Stanford,
In Proceedings of LREC, 2010, pages CA, USA.
2200–2204. Bradley, Margaret M. and Peter J. Lang. 2017.
Affective Norms for English Words

28
Chersoni et al. Decoding Word Embeddings

(ANEW). In Technical Report C-3. UF Center of LREC.


for the Study of Emotion and Attention, Chersoni, Emmanuele, Enrico Santus,
Gainesville, FL. Ludovica Pannitto, Alessandro Lenci,
Buechel, Sven and Udo Hahn. 2018. Emotion Philippe Blache, and Chu-Ren Huang.
Representation Mapping for Automatic 2019. A Structured Distributional Model of
Lexicon Construction (Mostly) Performs Sentence Meaning and Processing. Natural
on Human Level. In Proceedings of Language Engineering, 25(4):483–502.
COLING. Conneau, Alexis, German Kruszewski,
Bulat, Luana, Stephen Clark, and Ekaterina Guillaume Lample, Loïc Barrault, and
Shutova. 2017a. Modelling Metaphor with Marco Baroni. 2018. What You Can Cram
Attribute-Based Semantics. In Proceedings into a Single $&!#* Vector: Probing
of EACL. Sentence Embeddings for Linguistic
Bulat, Luana, Stephen Clark, and Ekaterina Properties. In Proceedings of ACL, pages
Shutova. 2017b. Speaking, Seeing, 2126–2136, Melbourne, Australia.
Understanding: Correlating Semantic De Choudhury, Munmum, Scott Counts, and
Models with Conceptual Representation Michael Gamon. 2012. Not All Moods Are
in the Brain. In Proceedings of EMNLP. Created Equal! Exploring Human
Bulat, Luana, Douwe Kiela, and Emotional States in Social Media. In
Stephen Christopher Clark. 2016. Vision Proceedings of ICWSM, pages 1–8.
and Feature Norms: Improving Automatic Derby, Steven, Paul Miller, and Barry
Feature Norm Learning Through Devereux. 2019. Feature2Vec:
Cross-Modal Maps. In Proceedings of Distributional Semantic Modelling of
NAACL-HLT. Human Property Knowledge. In
Bullinaria, John A and Joseph P Levy. 2012. Proceedings of EMNLP.
Extracting Semantic Representations from Devereux, Barry, Colin Kelly, and Anna
Word Co-Occurrence Statistics: Stop-Lists, Korhonen. 2010. Using fMRI Activation to
Stemming, and SVD. Behavior Research Conceptual Stimuli to Evaluate Methods
Methods, 44(3):890–907. for Extracting Conceptual Representations
Cardoso, Pedro Dias and Anindya Roy. 2016. from Corpora. In Proceedings of the NAACL
Sentiment Lexicon Creation using Workshop on Computational Neurolinguistics.
Continuous Latent Space and Neural Devereux, Barry J., Lorraine K. Tyler, Jeroen
Networks. In Proceedings of the NAACL Geertzen, and Billi Randall. 2014. The
Workshop on Computational Approaches to Centre for Speech, Language and the Brain
Subjectivity, Sentiment and Social Media (CSLB) Concept Property Norms. Behavior
Analysis, pages 37–42. Research Methods, 46(4):1119–1127.
Carota, Francesca, Nikolaus Kriegeskorte, Devlin, Jacob, Ming-Wei Chang, Kenton Lee,
Hamed Nili, and Friedemann and Kristina Toutanova. 2019. BERT:
Pulvermüller. 2017. Representational Pre-training of Deep Bidirectional
Similarity Mapping of Distributional Transformers for Language
Semantics in Left Inferior Frontal, Middle Understanding. In Proceedings of
Temporal, and Motor Cortex. Cerebral NAACL-HLT 2019, Minneapolis, MN.
Cortex, 27(1):294–309. Djokic, Vesna, Jean Maillard, Luana Bulat,
Chang, Kai-min Kevin, Tom M Mitchell, and and Ekaterina Shutova. 2019. Modeling
Marcel Adam Just. 2011. Quantitative Affirmative and Negated Action
Modeling of the Neural Representation of Processing in the Brain with Lexical and
Objects: How Semantic Feature Norms Compositional Semantic Models. In
Can Account for fMRI Activation. Proceedings of ACL, pages 5155–5165.
NeuroImage, 56(2):716–727. Erk, Katrin, Sebastian Padó, and Ulrike
Chelba, Ciprian, Tomas Mikolov, Mike Padó. 2010. A Flexible, Corpus-Driven
Schuster, Qi Ge, Thorsten Brants, Phillipp Model of Regular and Inverse Selectional
Koehn, and Tony Robinson. 2013. One Preferences. Computational Linguistics,
Billion Word Benchmark for Measuring 36(4):723–763.
Progress in Statistical Language Modeling. Esuli, Andrea and Fabrizio Sebastiani. 2006.
arXiv preprint arXiv:1312.3005. Sentiwordnet: A Publicly Available Lexical
Chersoni, Emmanuele, Ludovica Pannitto, Resource for Opinion Mining. In
Enrico Santus, Alessandro Lenci, and Proceedings of LREC, volume 6, pages
Chu-Ren Huang. 2020. Are Word 417–422.
Embeddings Really a Bad Fit for the Ettinger, Allyson, Ahmed Elgohary, and
Estimation of Thematic Fit? In Proceedings Philip Resnik. 2016. Probing for Semantic

29
Computational Linguistics Volume 1, Number 1

Evidence of Composition by Means of MA, USA.


Simple Classification Tasks. In Proceedings Jawahar, Ganesh, Benoît Sagot, Djamé
of the 1st Workshop on Evaluating Vector Seddah, Samuel Unicomb, Gerardo
Space Representations for NLP, Berlin, Iñiguez, Márton Karsai, Yannick Léo,
Germany, pages 134–139. Márton Karsai, Carlos Sarraute, and Éric
Fagarasan, Luana, Eva Maria Vecchi, and Fleury. 2019. What Does BERT Learn
Stephen Clark. 2015. From Distributional About the Structure of Language? In
Semantics to Feature Norms: Grounding Proceedings of ACL.
Semantic Models in Human Perceptual Kann, Katharina, Alex Warstadt, Adina
Data. In Proceedings of IWCS. Williams, and Samuel R Bowman. 2019.
Gauthier, Jon and Anna Ivanova. 2018. Does Verb Argument Structure Alternations in
the Brain Represent Words? An Evaluation Word and Sentence Embeddings. In
of Brain Decoding Studies of Language Proceedings of SCIL.
Understanding. arXiv preprint Kim, Najoung, Roma Patel, Adam Poliak,
arXiv:1806.00591. Alex Wang, Patrick Xia, R Thomas McCoy,
Glasgow, Kimberly, Matthew Roos, Amy Ian Tenney, Alexis Ross, Tal Linzen,
Haufler, Mark Chevillet, and Michael Benjamin Van Durme, et al. 2019. Probing
Wolmetz. 2016. Evaluating Semantic What Different NLP Tasks Teach Machines
Models with Word-Sentence Relatedness. about Function Word Comprehension. In
arXiv preprint arXiv:1603.07253. Proceedings of *SEM.
Güçlü, Umut and Marcel AJ van Gerven. King, Milton and Paul Cook. 2018.
2015. Semantic Vector Space Models Leveraging Distributed Representations
Predict Neural Responses to Complex and Lexico-Syntactic Fixedness for
Visual Stimuli. arXiv preprint Token-Level Prediction of the Idiomaticity
arXiv:1510.04738. of English Verb-Noun Combinations. In
Hewitt, John and Christopher D Manning. Proceedings of ACL, pages 345–350.
2019. A Structural Probe for Finding Kipper, Karin, Anna Korhonen, Neville
Syntax in Word Representations. In Ryant, and Martha Palmer. 2008. A
Proceedings of NAACL. Large-Scale Classification of English
Hill, Felix, Roi Reichart, and Anna Verbs. Language Resource and Evaluation,
Korhonen. 2015. SimLex-999: Evaluating 42(1):21–40.
Semantic Models with (Genuine) Klafka, Josef and Allyson Ettinger. 2020.
Similarity Estimation. Computational Spying on Your Neighbors: Fine-Grained
Linguistics, 41(4):665–695. Probing of Contextual Embeddings for
Hinton, Geoffrey E., James L. McClelland, Information about Surrounding Words. In
and David E. Rumelhart. 1986. Proceedings of ACL, pages 4801–4811.
Distributed representations. In David E. Kotlerman, Lili, Ido Dagan, Idan Szpektor,
Rumelhart and James L. McClelland, and Maayan Zhitomirsky-Geffet. 2010.
editors, Parallel Distributed Processing: Directional Distributional Similarity for
Explorations in the Microstructure of Lexical Inference. Journal of Natural
Cognition. Volume 1: Foundations. MIT Language Engineering, 16(4):359.
Press, Cambridge, MA, pages 77–109. Landauer, Thomas K and Susan T Dumais.
Hollenstein, Nora, Antonio de la Torre, 1997. A Solution to Plato’s Problem: The
Nicolas Langer, and Ce Zhang. 2019. Latent Semantic Analysis Theory of
CogniVal: A Framework for Cognitive Acquisition, Induction, and
Word Embedding Evaluation. In Representation of Knowledge.
Proceedings of CONLL. Psychological Review, 104(2):211.
Howard, Jeremy and Sebastian Ruder. 2018. Landauer, Thomas K., Danielle S.
Universal Language Model Fine-Tuning McNamara, Simon Dennis, and Walter
for Text Classification. In Proceedings of Kintsch, editors. 2007. Handbook of Latent
ACL. Semantic Analysis. Lawrence Erlbaum
Huth, Alexander G, Wendy A De Heer, Associates, Mahwah, NJ.
Thomas L Griffiths, Frédéric E Theunissen, Lenci, Alessandro. 2011. Composing and
and Jack L Gallant. 2016. Natural Speech Updating Verb Argument Expectations: A
Reveals the Semantic Maps that Tile Distributional Semantic Model. In
Human Cerebral Cortex. Nature, Proceedings of ACL Workshop on Cognitive
532(7600):453–458. Modeling and Computational Linguistics,
Jackendoff, Ray. 1990. Semantic Structures, pages 58–66.
volume 18. The MIT Press, Cambridge,

30
Chersoni et al. Decoding Word Embeddings

Lenci, Alessandro. 2018. Distributional in Translation: Contextualized Word


Models of Word Meaning. Annual review of Vectors. In Advances in Neural Information
Linguistics, 4:151–171. Processing Systems, pages 6294–6305.
Lenci, Alessandro, Gianluca E. Lebani, and McRae, Ken, George S. Cree, Mark S.
Lucia C. Passaro. 2018. The Emotions of Seidenberg, and Chris McNorgan. 2005.
Abstract Words: A Distributional Semantic Semantic Feature Production Norms for a
Analysis. Topics in Cognitive Science, Large Set of Living and Nonliving Things.
10(3):550–572. Behavior Research Methods, 37(4):547–559.
Levin, Beth. 1993. English Verb Classes and McRae, Ken and Kazunaga Matsuki. 2009.
Alternations: A Preliminary Investigation. People Use Their Knowledge of Common
University of Chicago Press, Chicago, IL. Events to Understand Language, and Do
Levy, Omer and Yoav Goldberg. 2014. So as Quickly as Possible. Language and
Dependency-Based Word Embeddings. In Linguistics Compass, 3(6):1417–1429.
Proceedings of ACL. Mikolov, Tomas, Kai Chen, Greg Corrado,
Levy, Omer, Yoav Goldberg, and Ido Dagan. and Jeffrey Dean. 2013. Efficient
2015. Improving Distributional Similarity Estimation of Word Representations in
with Lessons Learned from Word Vector Space. arXiv preprint
Embeddings. Transactions of the ACL, arXiv:1301.3781.
3:211–225. Mitchell, Tom M, Svetlana V Shinkareva,
Linzen, Tal, Grzegorz Chrupała, and Afra Andrew Carlson, Kai-Min Chang,
Alishahi. 2018. Introduction. In Vicente L Malave, Robert A Mason, and
Proceedings of EMNLP Workshop on Marcel Adam Just. 2008. Predicting
BlackBoxNLP: Analyzing and Interpreting Human Brain Activity Associated with the
Neural Networks for NLP, Brussels, Meanings of Nouns. Science,
Belgium. 320(5880):1191–1195.
Linzen, Tal, Grzegorz Chrupała, Yonatan Murphy, Brian, Partha Talukdar, and Tom
Belinkov, and Dieuwke Hupkes. 2019. Mitchell. 2012. Selecting Corpus-Semantic
Introduction. In Proceedings of ACL Models for Neurolinguistic Decoding. In
Workshop on BlackboxNLP: Analyzing and Proceedings of *SEM.
Interpreting Neural Networks for NLP, Murphy, Gregory. 2002. The Big Book of
Florence, Italy. Concepts. MIT Press, Cambridge, MA.
Liu, Nelson F, Matt Gardner, Yonatan Murphy, M. Lynne. 2010. Lexical Meaning.
Belinkov, Matthew Peters, and Noah A Cambridge University Press, Cambridge,
Smith. 2019. Linguistic Knowledge and UK.
Transferability of Contextual Naselaris, Thomas, Kendrick N. Kay, Shinji
Representations. In Proceedings of NAACL. Nishimoto, and Jack L. Gallant. 2011.
Louwerse, Max M. 2008. Embodied relations Encoding and Decoding in fMRI.
are encoded in language. Psychonomic NeuroImage, 56(2):400–410.
Bulletin & Review, 15(4):838–844. Nielsen, Finn Årup. 2011. A New ANEW:
Mandera, Paweł, Emmanuel Keuleers, and Evaluation of a Word List for Sentiment
Marc Brysbaert. 2017. Explaining Human Analysis in Microblogs. arXiv preprint
Performance in Psycholinguistic Tasks arXiv:1103.2903.
with Models of Semantic Similarity Based Palmer, Martha, Claire Bonial, and Jena D
on Prediction and Counting: A Review Hwang. 2017. VerbNet: Capturing English
and Empirical Validation. Journal of Verb Behavior, Meaning and Usage. The
Memory and Language, 92:57–78. Oxford Handbook of Cognitive Science, pages
Manning, Christopher D., Prabhakar 315–336.
Raghavan, and Hinrich Schütze. 2008. Pedregosa, Fabian, Gaël Varoquaux,
Introduction to Information Retrieval. Alexandre Gramfort, Vincent Michel,
Cambridge University Press, Cambridge. Bertrand Thirion, Olivier Grisel, Mathieu
Manning, Christopher D., Mihai Surdeanu, Blondel, Peter Prettenhofer, Ron Weiss,
John Bauer, Jenny Finkel, Steven J. Vincent Dubourg, et al. 2011. Scikit-learn:
Bethard, and David McClosky. 2014. The Machine Learning in Python. Journal of
Stanford CoreNLP Natural Language Machine Learning Research, 12:2825–2830.
Processing Toolkit. In Association for Pennington, Jeffrey, Richard Socher, and
Computational Linguistics (ACL) System Christopher Manning. 2014. Glove: Global
Demonstrations, pages 55–60. Vectors for Word Representation. In
McCann, Bryan, James Bradbury, Caiming Proceedings of EMNLP.
Xiong, and Richard Socher. 2017. Learned

31
Computational Linguistics Volume 1, Number 1

Pereira, Francisco, Matthew Botvinick, and NLP, pages 99–105.


Greg Detre. 2013. Using Wikipedia to Schwartz, Dan and Tom Mitchell. 2019.
Learn Semantic Feature Representations of Understanding Language-Elicited EEG
Concrete Concepts in Neuroimaging Data by Predicting It from a Fine-Tuned
Experiments. Artificial Intelligence, Language Model. In Proceedings of
194:240–252. NAACL.
Pereira, Francisco, Greg Detre, and Matthew Schwarzenberg, Robert, Lisa Raithel, and
Botvinick. 2011. Generating Text from David Harbecke. 2019. Neural Vector
Functional Brain Images. Frontiers in Conceptualization for Word Vector Space
Human Neuroscience, 5:72. Interpretation. In Proceedings of the NAACL
Pereira, Francisco, Bin Lou, Brianna Pritchett, Workshop on Evaluating Vector Space
Samuel Ritter, Samuel J Gershman, Nancy Representations.
Kanwisher, Matthew Botvinick, and Sedinkina, Marina, Nikolas Breitkopf, and
Evelina Fedorenko. 2018. Toward a Hinrich Schütze. 2019. Automatic Domain
Universal Decoder of Linguistic Meaning Adaptation Outperforms Manual Domain
from Brain Activation. Nature Adaptation for Predicting Financial
Communications, 9(1):963. Outcomes. In Proceedings of ACL, pages
Peters, Matthew E, Mark Neumann, Mohit 346–359.
Iyyer, Matt Gardner, Christopher Clark, Şenel, Lütfi Kerem, Ihsan Utlu, Veysel
Kenton Lee, and Luke Zettlemoyer. 2018. Yücesoy, Aykut Koc, and Tolga Cukur.
Deep Contextualized Word 2018. Semantic Structure and
Representations. In Proceedings of NAACL. Interpretability of Word Embeddings.
Poldrack, Russell A. 2011. Inferring Mental IEEE/ACM Transactions on Audio, Speech,
States from Neuroimaging Data: From and Language Processing, 26(10):1769–1779.
Reverse Inference to Large-Scale Shwartz, Vered and Ido Dagan. 2019. Still a
Decoding. Neuron, 72(5):692–697. Pain in the Neck: Evaluating Text
Pustejovsky, James and Olga Batiukova. Representations on Lexical Composition.
2019. The Lexicon. Cambridge University Transactions of the Association for
Press, Cambridge. Computational Linguistics, 7:403–419.
Recchia, Gabriel and Max M Louwerse. 2015. Sikos, Jennifer and Sebastian Padó. 2019.
Reproducing Affective Norms with Frame Identification as Categorization:
Lexical Co-Occurrence Statistics: Exemplars vs Prototypes in
Predicting Valence, Arousal, and Embeddingland. In Proceedings of IWCS.
Dominance. The Quarterly Journal of Søgaard, Anders. 2016. Evaluating Word
Experimental Psychology, 68(8):1584–1598. Embeddings with fMRI and Eye-Tracking.
Riordan, Brian and Michael N. Jones. 2011. In Proceedings of the ACL Workshop on
Redundancy in perceptual and linguistic Evaluating Vector-Space Representations for
experience: Comparing feature-based and NLP, pages 116–121.
distributional models of semantic Sun, Jingyuan, Shaonan Wang, Jiajun Zhang,
representation. Topics in Cognitive Science, and Chengqing Zong. 2019. Towards
3(2):303–345. Sentence-Level Brain Decoding with
Rohanian, Omid, Marek Rei, Shiva Distributed Representations. In
Taslimipoor, and Le Han Ha. 2020. Verbal Proceedings of AAAI, volume 33, pages
Multiword Expressions for Identification 7047–7054.
of Metaphor. In Proceedings of ACL, pages Tekiroglu, Serra Sinem, Gözde Özbal, and
2890–2895. Carlo Strapparava. 2014. Sensicon: An
Sahlgren, Magnus. 2008. The Distributional Automatically Constructed Sensorial
Hypothesis. Italian Journal of Linguistics, Lexicon. In Proceedings of EMNLP, pages
20:33–53. 1511–1521.
Santus, Enrico, Emmanuele Chersoni, Tenney, Ian, Patrick Xia, Berlin Chen, Alex
Alessandro Lenci, and Philippe Blache. Wang, Adam Poliak, R Thomas McCoy,
2017. Measuring Thematic Fit with Najoung Kim, Benjamin Van Durme,
Distributional Feature Overlap. In Samuel R Bowman, Dipanjan Das, and
Proceedings of EMNLP. Ellie Pavlick. 2019. What Do You Learn
Sayeed, Asad, Clayton Greenberg, and Vera From Context? Probing for Sentence
Demberg. 2016. Thematic Fit Evaluation: Structure in Contextualized Word
An Aspect of Selectional Preferences. In Representations. In Proceedings of ICLR.
Proceedings of the ACL Workshop on Turney, Peter D and Michael L Littman. 2003.
Evaluating Vector-Space Representations for Measuring Praise and Criticism: Inference

32
Chersoni et al. Decoding Word Embeddings

of Semantic Orientation from Association. Yaghoobzadeh, Yadollah, Katharina Kann,


ACM Transactions on Information Systems Timothy J Hazen, Eneko Agirre, and
(TOIS), 21(4):315–346. Hinrich Schütze. 2019. Probing for
Turney, Peter D and Patrick Pantel. 2010. Semantic Classes: Diagnosing the
From Frequency to Meaning: Vector Space Meaning Content of Word Embeddings. In
Models of Semantics. Journal of Artificial Proceedings of ACL.
Intelligence Research, 37:141–188. Yang, Zhilin, Zihang Dai, Yiming Yang,
Utsumi, Akira. 2018. A Neurobiologically Jaime Carbonell, Ruslan Salakhutdinov,
Motivated Analysis of Distributional and Quoc V Le. 2019. XLNet: Generalized
Semantic Models. In Proceedings of CogSci. Autoregressive Pretraining for Language
Utsumi, Akira. 2020. Exploring What Is Understanding. arXiv preprint
Encoded in Distributional Word Vectors: A arXiv:1906.08237.
Neurobiologically Motivated Analysis. Yee, Eiling and Sharon L. Thompson-Schill.
Cognitive science, 44(6):e12844. 2016. Putting Concepts into Context.
Vaswani, Ashish, Noam Shazeer, Niki Psychonomic Bulletin & Review,
Parmar, Jakob Uszkoreit, Llion Jones, 23(4):1015–1027.
Aidan N Gomez, Łukasz Kaiser, and Illia Zhang, Hongming, Jiaxin Bai, Yan Song, Kun
Polosukhin. 2017. Attention Is All You Xu, Changlong Yu, Yangqiu Song, Wilfred
Need. In Advances in Neural Information Ng, and Dong Yu. 2019. Multiplex Word
Processing Systems, pages 5998–6008. Embeddings for Selectional Preference
Vigliocco, Gabriella, Lotte Meteyard, Mark Acquisition. In Proceedings of EMNLP.
Andrews, and Stavroula Kousta. 2009. Zhang, Hongming, Hantian Ding, and
Toward a Theory of Semantic Yangqiu Song. 2019. SP-10K: A
Representation. Language and Cognition, Large-Scale Evaluation Set for Selectional
1(2):219–247. Preference Acquisition. In Proceedings of
Vigliocco, Gabriella and David P. Vinson. ACL.
2007. Semantic Representation. In Gareth Zhu, Yukun, Ryan Kiros, Rich Zemel, Ruslan
Gaskell, editor, The Oxford Handbook of Salakhutdinov, Raquel Urtasun, Antonio
Psycholinguistics. Oxford University Press, Torralba, and Sanja Fidler. 2015. Aligning
Oxford, pages 195–215. Books and Movies: Towards Story-like
Vinson, David P. and Gabriella Vigliocco. Visual Explanations by Watching Movies
2008. Semantic Feature Production Norms and Reading Books. In Proceedings of the
for a Large Set of Objects and Events. IEEE International Conference on Computer
Behavior Research Methods, 40(1):183–190. Vision, pages 19–27.
Vulić, Ivan, Daniela Gerz, Douwe Kiela, Felix
Hill, and Anna Korhonen. 2017. Hyperlex:
A Large-Scale Evaluation of Graded
Lexical Entailment. Computational
Linguistics, 43(4):781–835.
Warstadt, Alex, Amanpreet Singh, and
Samuel R. Bowman. 2019. Neural
Network Acceptability Judgments.
Transactions of the Association for
Computational Linguistics, 7:625–641.
Wehbe, Leila, Brian Murphy, Partha
Talukdar, Alona Fyshe, Aaditya Ramdas,
and Tom Mitchell. 2014. Simultaneously
Uncovering the Patterns of Brain Regions
Involved in Different Story Reading
Subprocesses. PloS one, 9(11):e112575.
Wiedemann, Gregor, Steffen Remus, Awi
Chawla, and Chris Biemann. 2019. Does
BERT Make Any Sense? Interpretable
Word Sense Disambiguation with
Contextualized Embeddings. In
Proceedings of KONVENS.
Wierzbicka, Anna. 1996. Semantics: Primes
and Universals. Oxford University Press,
Oxford.

33
34

You might also like