Neural Decoding
Neural Decoding
Word embeddings are vectorial semantic representations built with either counting or predicting
techniques aimed at capturing shades of meaning from word co-occurrences. Since their intro-
duction, these representations have been criticised for lacking interpretable dimensions. This
property of word embeddings limits our understanding of the semantic features they actually
encode. Moreover, it contributes to the “black box” nature of the tasks in which they are
used, since the reasons of word embeddings performance often remain opaque to humans. In
this contribution, we explore the semantic properties encoded in word embeddings by mapping
them onto interpretable vectors, consisting of explicit and neurobiologically-motivated semantic
features (Binder et al. 2016). Our exploration takes into account different types of embeddings,
including factorized count vectors and predict models (e.g., Skip-Gram, GloVe, etc.), as well as
the most recent contextualized representations (i.e., ELMo and BERT).
In our analysis, we first evaluate the quality of the mapping in a retrieval task, then we shed
lights on the semantic features that are better encoded in each embedding type. A large number
of probing tasks is finally set to assess how the original and the mapped embeddings perform in
discriminating semantic categories. For each probing task, we identify the most relevant semantic
features and we show that there is a correlation between the embedding performance and how they
encode those features. This study sets itself as a step forward in understanding which aspect of
meaning are captured by vector spaces, by proposing a new and simple method to carve human-
interpretable semantic representations from distributional vectors.
1. Introduction
∗ Department of Chinese and Bilingual Studies, 11 Yuk Choi Road, Hung Hom, Kowloon, Hong Kong.
E-mail: emmanuele.chersoni@polyu.edu.hk
∗∗ MIT Computer Science and Artificial Intelligence Laboratory, 32 Vassar Street, Cambridge, MA 02139,
United States. E-mail: esantus@mit.edu
† Department of Chinese and Bilingual Studies, 11 Yuk Choi Road, Hung Hom, Kowloon, Hong Kong.
E-mail: churen.huang@polyu.edu.hk
‡ Department of Philology, Literature and Linguistics, Via Santa Maria 36, 56126 Pisa, Italy. E-mail:
alessandro.lenci@unipi.it
for basic or primitive semantic dimensions (Jackendoff 1990; Wierzbicka 1996; Murphy
2010; Pustejovsky and Batiukova 2019). These “building blocks” of meaning are selected
a priori and structured into categorical representations defined by the presence or
absence of symbolic features, like in this semantic analysis of enter:
(1) enter [+ MOVE, + PATH, - CAUSE, . . . ]
Besides the issue of establishing the criteria to define the repertoire of alleged seman-
tic primitives, discrete symbolic structures strive to cope with the gradient nature of
lexical meaning and can not capture the varying degrees of feature prototypicality in
concepts (Murphy 2002). Secondly, the basic semantic features are normally too coarse-
grained to provide a full characterization of conceptual content (e.g., accounting for
the dimensions that distinguish painter from violinist). In cognitive psychology, instead
of using categorical representations formed by manually selected components, it is
customary to represent concepts with verbal properties generated by native speakers
to describe a word meaning and collected in feature norms (e.g., McRae et al. 2005;
Vinson and Vigliocco 2008; Devereux et al. 2014). Each feature is associated with a
weight corresponding to the number of subjects that listed it for a given concept and
is used to estimate its salience in that concept. The following is a representation of car
using a subset of its feature distribution from the norms in McRae et al. (2005):
2
Chersoni et al. Decoding Word Embeddings
that can be pretrained on large natural language corpora. The vector-based encoding of
meaning is easily machine-interpretable, as embeddings can be directly fed into complex
neural architectures and indeed boost performance in several natural language process-
ing (NLP) tasks and applications.
Although word embeddings play an important role in the success of deep learning
models and do capture some aspects of lexical meaning, it is hard to understand their
actual semantic content. In fact, one notorious problem of embeddings is their lack
of human-interpretability: Information is distributed across vector dimensions that can
not be labelled with specific semantic values. In neural word embeddings, the vector
dimensions are learned as network parameters, instead of being derived from explicit
co-occurrence counts between target words and linguistic contexts, making their in-
terpretation even more challenging. Scholars have argued that DSMs provide a holistic
representation of meaning, as the content of each word can exclusively be read off from
its position relative to other elements in the semantic space, while the coordinates of
such space are themselves arbitrary and without any intrinsic semantic value (Landauer
et al. 2007; Vigliocco and Vinson 2007; Sahlgren 2008). This makes embeddings “black
box” representations that can be understood only by observing their behavior in some
external task, but whose internal content defies direct inspection. A recent and widely
used tool to investigate the linguistic properties captured by embeddings are the so-
called probing tasks (Ettinger, Elgohary, and Resnik 2016; Adi et al. 2017; Conneau et al.
2018; Kann et al. 2019). A probing task is a classification problem that targets a specific
linguistic aspect (e.g., word order, animacy, etc.). The name refers to the fact that the
classifier is used to “probe” embeddings for a particular piece of linguistic information.
The successful performance of an embedding model to address this task is then used to
infer that the vectors encode that information. However, as it was recently pointed out
by Shwartz and Dagan (2019), probing tasks are also a form of “black box” testing, since
they just provide indirect evidence about the embedding content.
The raise of the interpretability problem in AI and NLP has motivated the necessity of
understanding which shades of semantics are actually encoded by word embeddings,
and has therefore refueled the debate about the relationship between distributional rep-
resentations and semantic features (Boleda and Erk 2015). “Opening the black box” of
deep learning methods has become an imperative in computational linguistics (Linzen,
Chrupała, and Alishahi 2018; Linzen et al. 2019). Such research effort aims at analyzing
the specific information encoded by vector representations that may help explaining
their behavior in downstream tasks and applications.
In this paper, we contribute to this goal by showing that featural semantic represen-
tations can be used to interpret the content of word embeddings. In particular, we argue
that decoding semantic information from distributional vectors is strikingly similar to
the problem faced by neuroscience of how to “read off meaning” from distributed brain
activity patterns. Neurosemantic decoding is a research line that develops computa-
tional methods to identify the mental state represented by brain activity recorded with
neuroimaging techniques such as fMRI (e.g., recognizing that a given activation pattern
produced by a stimulus picture or word corresponds to an apple). A common approach
to address such task is to learn a mapping between featural concept representations
and a vector containing the corresponding fMRI recorded brain activity (Naselaris
et al. 2011; Poldrack 2011). These computational models are able to predict the concept
corresponding to a certain brain activation and contribute to shed light on the neural
representation of semantic features.
In neurosemantic decoding, human-interpretable semantic vectors are used to de-
code the content of vectors of “brain-interpretable” signals activated by a certain stim-
3
Computational Linguistics Volume 1, Number 1
ulus (cf. Section 2.2). In a similar way, we aim at decoding the semantic content of
word embeddings by learning a mapping onto vectors of human-interpretable fea-
tures. To this end, we use the semantic features introduced by Binder et al. (2016),
who proposed a set of cognitively motivated semantic primitives (henceforth, Binder
features) derived from a wide variety of modalities of neural information processing
(hence their definition as brain-based), and provided human ratings about the relevance
of each feature for a set of English words (henceforth, Binder dataset). We use these
ratings to represent the words with continuous vectors of semantic features and to learn
a map from word embeddings dimensions to Binder features. Such mapping provides a
human-interpretable correlate of word embeddings that we use to address these issues:
1. identifying which semantic features are best encoded in word embeddings;
2. explaining the performance of embeddings in semantic probing tasks.
The idea of mapping word embeddings onto semantic features is not by itself new
(Fagarasan, Vecchi, and Clark 2015; Utsumi 2020), but to the best of our knowledge the
present contribution is the first one to use mapped featural representations to interpret
the semantic information learnt by probing classifiers and to explain the embedding be-
havior in such tasks. Therefore, we establish a bridge between the research on semantic
features and the challenge of enhancing the interpretability of distributed representa-
tions, by showing that featural semantic representations can work as an important key
to open the black boxes of word embeddings and of the tasks in which they are used. As
an additional element of novelty, we also apply the neural decoding methodology to the
recently-introduced contextualized embeddings, to evaluate whether and how they dif-
fer from static ones in encoding semantic information. It is important to remark that we
do not argue that Binder feature vectors should replace distributional representations.
The main claim of this paper is rather that continuous vectors of human-interpretable
semantic features, such as Binder’s ones, are an important tool to investigate what
aspects of meaning distributional embeddings actually encode, and they can be used
to lay a bridge between symbolic and distributed semantic representations.
This paper is organized as follows. Section 2 introduces the main typologies of
DSMs and reviews the related work on vector decoding. In Section 3, we describe the
Binder features, we present the method used to map word embeddings onto Binder
feature vectors, and we evaluate the mapping accuracy. In Section 4, we investigate
which Binder features are best encoded by each type of embedding. In Section 5 we
set up a series of probing tasks to verify how the original and mapped embeddings
encode semantic categories, such as animate/inanimate or positive/negative sentiment.
Some probing tasks focus on static embeddings, while others target the token vectors
produced by contextualized embeddings. The aim of the analysis is to identify the
most important semantic features for a given task and to investigate whether there is
a correlation between the system performance and how those features are encoded by
the embeddings.
2. Related Work
We use the term word embedding to refer to any kind of dense, low-dimensional
distributional vector. In the early days of Distributional Semantics, embeddings were
built by applying dimensionality reduction techniques derived from linear algebra,
such as SVD, to matrices keeping track of the co-occurrence information about the target
4
Chersoni et al. Decoding Word Embeddings
terms and some pre-defined set of linguistic contexts. Parameter tuning was mostly
carried out empirically, as it was driven by the model performance on specific tasks.
This family of DSMs is referred to as count models (Baroni, Dinu, and Kruszewski 2014).
The construction of distributional representations started to be conceived mainly as
the by-product of a supervised language modelling task after the introduction of the
Word2Vec package (Mikolov et al. 2013). Low-dimensional distributional word vectors
are created by neural network algorithms by learning to optimally predict the contexts
of a target word, hence their name of predict models. "Neural" embeddings have become
an essential component for several NLP applications, also thanks to the availability of
many efficient and easy-to-use tools (Mikolov et al. 2013; Bojanowski et al. 2017) that
allow researchers to quickly obtain well-performing word representations. Indeed, an
important finding of a first comparative evaluation between count and predict models
was that the latter achieve far superior performances in a wide variety of tasks (Baroni,
Dinu, and Kruszewski 2014). Although the result was claimed to be due to the sub-
optimal choice of "vanilla" hyperparameters for the count models (Levy, Goldberg, and
Dagan 2015), it was still a proof that predict models could be very efficient even without
any parameter tuning. Following studies adopting cognitively-motivated benchmarks
(e.g., based on priming, eye-tracking or EEG data) have also showed that word em-
beddings exhibit strong correlation with human performance in psycholinguistic and
neurolinguistic tasks (Søgaard 2016; Mandera, Keuleers, and Brysbaert 2017; Bakarov
2018; Schwartz and Mitchell 2019; Hollenstein et al. 2019). Finally, and significantly,
Carota et al. (2017) found that the semantic similarity computed via distributional
models between action-related words correlates with the fMRI response patterns of the
brain regions that are involved in the processing of this category of lexical items.
Another novelty recently came out from the research on deep neural networks for
language modeling. For both count and predict models, a common and longstanding
assumption was the building of a single, stable representation for each word type in the
corpus. In the latest generation of embeddings, instead, each occurrence of a word in a
specific sentence context gets a unique representation (Peters et al. 2018). Such models
typically rely on an encoder (i.e., a LSTM or a Transformer) trained on large amounts of
textual data, and the word vectors are learned as a function of the internal states of the
encoder, such that a word in different sentence contexts determines different activation
states and is represented by a distinct vector (McCann et al. 2017; Peters et al. 2018;
Howard and Ruder 2018; Devlin et al. 2019; Yang et al. 2019). Thus, the embeddings
produced by these new frameworks are said to be contextualized, as opposed to the static
vectors produced by the earlier frameworks, and they aim at modeling the specific sense
assumed by the word in context (Wiedemann et al. 2019). Interestingly, the distinction
between traditional and contextualized embeddings has been recently discussed by
drawing a parallel between the prototype and exemplar models of categorization in
cognitive psychology (Sikos and Padó 2019).
Two very popular models for obtaining contextualized word embeddings are ELMo
(Peters et al. 2018) and BERT (Devlin et al. 2019). ELMo is based on a two-layer LSTM
trained as the concatenation of a forward and a backward language model, BERT on a
stack of Transformer layers (Vaswani et al. 2017) trained jointly in a masked language
model and a next sentence prediction task. The semantic interpretation of the dimen-
sions of contextualized embeddings is still an open question. The classical approach to
analyze the syntactic and semantic information encoded by these representations is to
test them in some probing tasks (Tenney et al. 2019; Liu et al. 2019; Hewitt and Manning
2019; Kim et al. 2019; Kann et al. 2019; Yaghoobzadeh et al. 2019; Jawahar et al. 2019). In
5
Computational Linguistics Volume 1, Number 1
this contribution we adopt a different approach to the problem, mainly inspired by the
literature on neurosemantic decoding.
Like word embeddings, the brain encodes information in distributed activity patterns
that defy direct interpretation. The general goal of neurosemantic decoding is to de-
velop computational methods to infer the content of brain activities associated with a
certain word or phrase (e.g., to recognize that a pattern of brain activations corresponds
to the meaning of the stimulus word dog, instead of car). One of the most common
approaches to neural decoding consists in learning to map vectors of fMRI signals onto
vectors of semantic dimensions. If the mapping is successful, we can infer that these
dimensions are encoded in the brain. Mitchell et al. (2008) pioneered such method by
training a linear regression model from a set of words with their fMRIs. The trained
model was then asked to predict the activations for unseen words. Approaches differ
for the type of semantic representation adopted to model brain data. Mitchell et al.
(2008) used a vector of features corresponding to textual co-occurrences with 25 verbs
capturing basic semantic dimensions (e.g., hear, eat, etc.). Chang, Mitchell, and Just
(2011) instead represented words with vectors of verbal properties derived from feature
norms, and Anderson et al. (2016) with vectors of Binder features (cf. Section 3.2).
After the popularization of DSMs, the use of word embeddings for neurosemantic
decoding has become widespread. Actually, the decoding task itself has turned into an
important benchmark for DSMs, since it is claimed to represent a more robust alterna-
tive to the traditional use of behavioral datasets (Murphy, Talukdar, and Mitchell 2012).
Some of these studies used fMRI data to learn a mapping from the classical count-based
distributional models (Devereux, Kelly, and Korhonen 2010; Murphy, Talukdar, and
Mitchell 2012), from both count and prediction vectors (Bulat, Clark, and Shutova 2017b;
Abnar et al. 2018), from contextualized vectors (Beinborn, Abnar, and Choenni 2019) or
from topic models (Pereira, Detre, and Botvinick 2011; Pereira, Botvinick, and Detre
2013). This methodology has recently been extended beyond words to represent the
meanings of entire sentences (Anderson et al. 2016; Pereira et al. 2018; Sun et al. 2019),
even in presence of complex compositionality phenomena such as negation (Djokic
et al. 2019), or to predict the neural responses to complex visual stimuli (Güçlü and van
Gerven 2015). Athanasiou, Iosif, and Potamianos (2018) showed that neural activation
semantic models built out of these mappings can also be used to successfully carry out
NLP tasks such as similarity estimation, concept categorization and textual entailment.
Despite the analogy, it is important to underline a crucial difference between our
work and neurosemantic decoding. In the latter, word embeddings are used as proxies
for semantic representations to decode brain patterns that are not directly human-
interpretable. Our aim is instead to decode the content of word embeddings themselves.
We actually believe this enterprise to be also relevant for (and to a certain extent a
precondition to) the task of decoding brain states. In fact, if we want to use embeddings
for neural decoding, it is essential to have a better understanding of the semantic content
hidden in distributional representations. Otherwise, the risk is to run into the classical
fallacy of obscurum per obscurius, in which one tries to explain something unknown
(brain activations), with something that is even less known (word embeddings).
Another related line of work makes use of property norms for grounding distribu-
tional models in perceptual data, and to map them onto interpretable representations
(Fagarasan, Vecchi, and Clark 2015; Bulat, Kiela, and Clark 2016; Derby, Miller, and
Devereux 2019), an approach that has been proven useful, among the other things, also
6
Chersoni et al. Decoding Word Embeddings
for the detection of cross-domain mappings in metaphors (Bulat, Clark, and Shutova
2017a). Similarly, other studies focused on conceptual categorization proposed to learn
mappings from distributional vectors to spaces of higher-order concepts (Şenel et al.
2018; Schwarzenberg, Raithel, and Harbecke 2019). Finally, Utsumi (2018, 2020) carried
out an analysis of the semantic content of non-contextualized word embeddings, which
is close in spirit to our correlation analyses in Section 4. However, our study significantly
differs from Utsumi’s for its goals and scope. While Utsumi (2020) only aims at under-
standing the semantic knowledge encoded in distributional vectors, we add to this the
idea of using the decoded embeddings to explain and interpret their performance in
probing semantic tasks (Section 5). Moreover, our study involves a larger array of DSMs
and it is the first one to include state-of-the-art contextualized embeddings.
7
Computational Linguistics Volume 1, Number 1
Model Hyperparameters
345K window-selected context words, window of width 2
weighted with Positive Pointwise Mutual Information (PPMI)
PPMI.w2
reduced with Singular Value Decomposition (SVD)
subsampling method from Mikolov et al. (2013).
345K syntactically filtered context words
weighted with Positive Pointwise Mutual Information (PPMI)
PPMI.synf
reduced with Singular Value Decomposition (SVD)
subsampling method from Mikolov et al. (2013)
345K syntactically typed context words
weighted with Positive Pointwise Mutual Information (PPMI)
PPMI.synt
reduced with Singular Value Decomposition (SVD)
subsampling method from Mikolov et al. (2013)
Window of width 2
GloVe
subsampling method from Mikolov et al. (2013)
Skip-gram with negative sampling
SGNS.w2 window of width 2, 15 negative examples
trained with the word2vec library (Mikolov et al. 2013)
Skip-gram with negative sampling
SGNS.synf syntactically-filtered context words, 15 negative examples
trained with the word2vecf library (Levy and Goldberg 2014)
Skip-gram with negative sampling
SGNS.synt syntactically-typed context words, 15 negative examples
trained with the word2vecf library (Levy and Goldberg 2014)
Skip-gram with negative sampling and subword information
FastText window of width 2, 15 negative examples
trained with the fasttext library (Bojanowski et al. 2017)
Pretrained ELMo embeddings (Peters et al. 2018),
ELMo available at https://allennlp.org/elmo,
original model trained on the 1 Billion Word Benchmark (Chelba et al. 2013).
Pretrained BERT-Base embeddings (Devlin et al. 2019)
available at https://github.com/google-research/bert
BERT
large model trained on the concatenation of the Books corpus
(Zhu et al. 2015) and the English Wikipedia.
Table 1: List of the embedding models used for the study, together with their hyperpa-
rameter settings.
proposed in Mikolov et al. (2013). A summary of all models with their respective
training hyperparameters is provided in Table 1.
The contextualized embedding models are ELMo1 and BERT (the BERT-Base un-
cased version).2 Since they produce token vectors, we created type representations by
randomly sampling 1,000 sentences for each target word from the Wikipedia corpus.
We generated a contextualized embedding for each word token by feeding the sentence
to the publicly available pre-trained models of ELMo and BERT. Finally, an embedding
for each word was obtained by averaging its 1,000 contextualized vectors. We assume
this choice to be consistent with the hypothesis that context-independent conceptual
representations are abstractions from token exemplar concepts (Yee and Thompson-
Schill 2016). As a baseline, we also built models based on 300-dimensional randomly-
generated vectors (Random).
1 https://tfhub.dev/google/elmo/3.
2 We used the pipelines included in the spacy-transformers package
(https://spacy.io/universe/project/spacy-transformers).
8
Chersoni et al. Decoding Word Embeddings
9
Computational Linguistics Volume 1, Number 1
representations based on the elicited features were compared with their distributional
representations, obtained via Latent Semantic Analysis (Landauer and Dumais 1997),
showing that brain-based features are more efficient in separating conceptual categories.
We have chosen the Binder features for our decoding experiments for three main
reasons. First of all, they are empirically motivated on the grounds of neurocognitive
evidence supporting their key role for conceptual organization. This allows us to test
the extent to which these central components of meaning are actually captured by word
embeddings. Secondly, despite being quite coarse-grained, Binder features differ from
human generated properties because the latter are actually linguistic structures that
often express complex concepts (e.g., used_for_transportation as a property for airplane),
rather than core meaning components. Thirdly, the Binder dataset covers nouns, verbs,
and adjectives, and encompasses both concrete and abstract words, while no existing
feature norms have a comparable morphosyntactic or semantic variety. Of course, we do
not claim this to be the “definitive” semantic feature lists, but in our view it represents
the most complete repository of continuous featural representations available to date.
However, the analysis methodology we present in the next section is totally general,
and can be applied to any type of semantic feature vector.
For this study, we learn a mapping from a n-dimensional word embedding e (with
n equal 300 for non-contextualized DSMs, 768 for BERT, and 1024 for ELMo) onto a
65-dimensional feature vector f whose components correspond to the ratings for the
Binder features. We henceforth refer to the mapped feature vectors as Binder vectors.
Our dataset consists of 534 Binder words.3
In the previous literature, mainly two methods have been used to learn a mapping
between embeddings and discrete feature spaces: regression models (Fagarasan, Vecchi,
and Clark 2015; Pereira et al. 2018; Utsumi 2018, 2020) and feedforward neural networks
(Abnar et al. 2018; Utsumi 2018, 2020). In our preliminary experiments, the mapping
with feedforward neural networks turned out to be suboptimal. Thus, we report the
3 One less than the original collection, because used appears twice, as verb and adjective.
10
The main body of the work consists in X parts and Y chapters:
However, LO FA IN STILE DISTACCATO E COLLOQUIALE SENZA ROBA MATEMATICA E CON
APPROCCIO PRATICO BASATO SU R
RIASSUNTO CAPITOLI
REPRODUCIBILITY
Chersoni et al. Decoding Word Embeddings
0.250
0.200
0.150
MSE
ELMo
0.100
BERT
0.050 GloVe
0.000
10 20 30 40 50 60 70 80 90 100
Dimensions
0.500
0.450
0.400
Explained Variance
0.350
0.300 ELMo
0.250
0.200 BERT
0.150
0.100
GloVe
0.050
0.000
10 20 30 40 50 60 70 80 90 100
Dimensions
Figure 1: (top) Mean Squared Error (values have been summed across the Binder fea-
tures) and (bottom) explained variance for ELMo, BERT and GloVe vectors per number
of regression components.
results for our partial least square regression model, with an appropriately chosen num-
ber k of regression components. We tested with k = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100.
The regression models have been implemented with the Python Scikit-learn package
(Pedregosa et al. 2011).4
For a preliminary evaluation of the mapping quality, we analyze the traditional metrics
of Mean Squared Error (M SE) and variance. First, we randomly split the data into
training
0.250and test sets, using a 80 : 20 ratio, and we measure the sum of M SE and the
variance in order to determine the optimal value for the parameter k (the number of
Rooted Mean Square Error
regression
0.200
components). After choosing the optimal k, vectors of Binder features are
predicted with the leave-one-out training paradigm, as in Utsumi (2018, 2020): For each
word in the dataset, we train a mapping between the embeddings and the gold standard
0.150
ELMo
0.100 BERT
4 https://scikit-learn.org/stable/.
GloVe
0.050
11
0.000
10 20 30 40 50 60 70 80 90 100
Dimensions
Computational Linguistics Volume 1, Number 1
Binder vectors of the remaining words, and we predict the held-out word with the
resulting mapping.
Moreover, following the literature on neural decoding, the predicted vectors are
tested on a task of retrieval in the top-N neighbors. Given a predicted vector, we rank all
the 534 vectors in the gold standard dataset by decreasing cosine similarity values. Then
we measure the Top-N accuracy (Top-N Acc), as the percentage of the items of the dataset
whose gold standard vector is in the top-N of the neighbors list of the predicted vector
(Fagarasan, Vecchi, and Clark 2015). We assess this value for N = 1, 5, 10.
We measured the M SE and the explained variance for each model, finding that
k between 30 and 50 produces the optimal results for all models. Figure 1, shows
M SE and variance as a function of k for GloVe, ELMo and BERT embeddings. Most
models achieve the best fit with k = 30 or k = 40. Since the average explained variance is
slightly higher for k = 30, we keep this parameter as the optimal value for the mapping.
Table 5 reports the M SE and explained variance for the k = 30 mapping. The best scores
are obtained with the syntax-based versions of the SGNS model, together with FastText
and BERT. All mappings perform largely better than the random baseline, for which the
explained variance is negative.
Using the Partial Least Squares Regression model with k = 30 for the mapping and
leave-one-out training, we predict the vectors of all the Binder words and we evaluate
them with the Retrieval task. The results are shown in Table 6. At a glance, we can
notice that all DSMs vastly outperform the random vectors and are able to retrieve in
the top 10 ranks at least the half of the target concepts. For this auxiliary task, the best
performing model is BERT, which retrieves the 30% of the target concepts at the top of
the ranking and more than three quarters of them in the top 10. The next best models are
the syntactically-enriched versions of the Skip-Gram vector, with the one using typed
dependencies coming close to BERT performance.
These results show overall good-quality representation for all the embedding types,
and a comparison with the scores by Utsumi (2018) confirm the superiority of SGNS
model over GloVe and PPMI for this kind of mapping. Differently from the previous
study, we also consider embeddings that are trained with syntactic dependencies,
showing that for SGNS syntactic contexts determine a general improvement of the
performance (while typed dependencies are suboptimal for the PPMI model).
12
Chersoni et al. Decoding Word Embeddings
The next tests will aim at revealing how well the different features in the Binder
dataset are encoded by our vectors.
In the literature on neurosemantic decoding, it has been shown that models can be
compared for their ability to explain the activity in a particular brain region (Wehbe
et al. 2014; Gauthier and Ivanova 2018; Anderson et al. 2018). Analogously, we want to
inspect which features are better predicted by a given embedding model. We compute
the average of the Spearman correlation between human ratings and model predictions
across words and features. Results are reported in Table 7.
All DSMs achieve high correlation values, higher than 0.7 per word, vastly out-
performing the random baseline. The results across the models are similar, with again
BERT and the syntax-based SGNS models taking the top spots. Consistently with the
previous tests, syntactic information seems to be useful for predicting the property val-
ues, as syntax-based models almost always perform slightly better than their window-
based counterparts. A similar finding for the prediction of brain activation patterns has
already been descibed by Abnar et al. (2018), who also reported a strong performance
by dependency-based embeddings. It is also interesting to notice that all our models
have much higher correlation values than the best results reported by Utsumi (2018), a
difference that might be due to the choice of the training corpora (we used a concate-
nation of Ukwac and Wikipedia, while Utsumi trained his models on the COCA and,
separately, on the Wikipedia corpus alone). Finally, while the PPMI embeddings used
by Utsumi drastically underperform, our PPMI vectors come much closer to the predict
ones, although the latter still retain an edge.
13
Computational Linguistics Volume 1, Number 1
PPMI.w2 0.90
PPMI.synf
0.85
PPMI.synt
GloVe 0.80
SGNS.w2
0.75
SGNS.synf
SGNS.synt
0.70
FastText
ELMo 0.65
BERT
on
ic
on
or
al
n
ia
ra
sa
riv
tio
tio
tio
io
io
at
ci
ot
at
po
si
iti
au
ot
nt
So
m
D
ta
lfa
ni
M
Vi
Sp
Em
te
m
So
Au
og
us
At
Te
G
Figure 2: Average Spearman correlations per domain between the estimated and the C
original Binder features for each embedding type.
14
Chersoni et al. Decoding Word Embeddings
PPMI.w2 0.9
PPMI.synf
0.8
PPMI.synt
GloVe
0.7
SGNS.w2
SGNS.synf 0.6
SGNS.synt
FastText 0.5
ELMo
BERT 0.4
Br ion
D ht
C rk
tte r
La rn
Smge
Bi Mo all
ot n
Fa n
S st
C S low
pl pe
Fa ity
Boce
m To dy
ra ch
x e
W ture
Au Pa t
di in
Lo n
Lo d
H w
So igh
M nd
ee ic
Ta ch
S ste
pp e ll
Lo er ad
Pr Limb
ac b
e
gh
Pa olo
U Hme
om tio
io
Te tur
tio
u
w Lim
tic
Sp us
ig
a
ex
pe u
r
om ha
u
s
ei
Vi
er
Te
0.90
PPMI.w2
PPMI.synf 0.85
PPMI.synt 0.80
GloVe 0.75
SGNS.w2
0.70
SGNS.synf
0.65
SGNS.synt
FastText 0.60
ELMo 0.55
BERT
0.50
P k
Sc ath
N e
wa r
rd
um y
D Tim r
ur e
Lo n
on C ho g
se au rt
en d
om So ial
un m l
ic an
C Son
ni lf
Be tion
P H fit
U le arm
as t
H ant
S y
D An ad
gu r y
SuFea ed
ris l
D ed
N rive
te ds
ou n
l
m Hu cia
rp rfu
sa
le an
To ea
be
N Awa
p
og e
ar
en
io
n
qu se
Ar ntio
ne
is g
t
ap
At ee
i
st
at
at
np as
m
S
nd
La
Figure 3: Spearman correlations between the estimated and original Binder features for
each embedding type.
et al. 2009). When it comes to the somatosensorial features of concrete concepts, instead,
text-based models are clearly missing that kind of the information on the referents,
although various aspects of experiential information are “redundantly” encoded in
linguistic expressions (Riordan and Jones 2011), as suggested by the so-called Symbol
Interdependency Hypothesis (Louwerse 2008). Finally, spatial and temporal features are
particularly challenging for distributional representations. This is compatible with the
hypothesis that temporal concepts are mainly represented in spatial terms and the
acquisition of spatial attributes requires multimodal evidence (Binder et al. 2016), which
is instead lacking in our distributional embeddings. The Emotion domain also shows
good correlation values, confirming the role of distributional information in shaping
the affective content of lexical items (Recchia and Louwerse 2015; Lenci, Lebani, and
Passaro 2018).
Figure 3 provides a more analytical and variegated view of the way embeddings
predict each Binder feature, revealing interesting differences within the various do-
mains. First of all, we can observe that some somatosensorial semantic dimensions
are indeed strongly captured by embeddings, consistently with the hypothesis that
several embodied features are encoded in language (Louwerse 2008). For instance,
C OLOR, M OTION (i.e., “showing a lot of visually observable movement”), B IOMOTION
(i.e., “showing movement like that of a living thing”), and S HAPE (i.e., “having a
characteristic or defining visual shape or form”) are among the best predicted visual
15
Computational Linguistics Volume 1, Number 1
0.85
PPMI.w2
PPMI.w2
0.90
PPMI.synf
PPMI.synf
0.85 0.80
PPMI.synt PPMI.synt
SGNS.w2 0.75
SGNS.w2
0.75
SGNS.synf
SGNS.synf
SGNS.synt 0.70
0.70
SGNS.synt
FastText
0.65
ELMo FastText
0.65
BERT 0.60
ELMo
y
rty
ity
te
ct
ct
ty
e
ac
en
BERT
tit
io
io
at
je
je
r
ta
nt
pe
pe
en
tif
ct
ct
ev
st
ob
ob
ls
le
a
la
ar
ro
ro
al
ct
ta
ta
ct
l
p
lp
ra
ic
ra
en
in
ra
en
ic
ys
ct
tu
a
liv
st
ys
st
m
ra
ic
m
na
ph
ab
ab
ph
ys
st
ty
g
ab
ph
tio
in
er
th
ac
op
pr
(a)
(b)
Figure 4: Average Spearman correlations per word super category (a) and per word
super type (b).
features. FAST is predicted much better than S LOW, while embeddings do not seem
to discriminate the B RIGHT and D ARK components. In the Audition domain, L OUD,
M USIC (i.e., “making a musical sound”), and S PEECH (i.e., “someone or something that
talks”) are generally very well predicted. The Spatial domain instead shows an uneven
behavior, with L ANDMARK (i.e., “having a fixed location, as on a map”), S CENE (i.e.,
“bringing to mind a particular setting or physical location”), and PATH (i.e., “showing
changes in location along a particular direction or path”) presenting much higher corre-
lation values than the other features. The best predicted social features are H UMAN (i.e.,
“having human or human-like intentions, plans, or goals”) and C OMMUNICATION (i.e.,
“a thing or action that people use to communicate”). In relation to spatial features, Time
and Human, it is interesting to point out that the models with syntactic information
generally have better predictions than their window-based equivalents (cf. in Figure 3
the values for the synf/synt versions of PPMI/SGNS models with their w2 equivalents
and with FastText). Finally, negative sentiments and emotions are better predicted than
positive ones. This is consistent with previous reports of negative moods being more
frequently expressed in online texts (De Choudhury, Counts, and Gamon 2012).
Using the metadata in the Binder dataset, we group the words per super category
and type, and compute the average correlations. A quick look at Figure 4a reveals that
mental entities are the best represented ones, while embeddings are struggling the most
with physical and abstract properties. Also living objects and events tend to be well
represented by most embedding models. Figure 4b provides a summary of the average
correlations per word type, confirming that things are the most correlated, whereas
weaker correlations are observed for actions. Finally, the models only manage to achieve
moderate-to-low correlations for properties.
Finally, it is worth focusing on the behavior of contextualized embeddings. Though
BERT has a slightly higher accuracy Top-N (cf. Section 3.4), its overall word and feature
16
Chersoni et al. Decoding Word Embeddings
correlation is equivalent to the one by SGNS.synt (cf. Table 7). Moreover, Figures 2–4
do not show any significant difference in the kinds of semantic dimensions encoded by
traditional DSMs with respect to BERT and ELMo vectors. This leads us to conjecture
that the true added value of the latter models lies in their ability to capture the meaning
variations of word tokens in context, rather than in the type of semantic information
they can distil from distributional data.
Probing tasks (Ettinger, Elgohary, and Resnik 2016; Adi et al. 2017) have become one of
the most common tools to investigate the content of embeddings. A probing task con-
sists in a binary classifier which is fed with embeddings and is trained to classify them
with respect to a certain linguistic dimension (e.g., animacy). The classification accuracy
is taken as a proof that the embeddings encode that piece of linguistic information. As
we have said in the Introduction, the limit of the probing task methodology is that it
only provides indirect evidence about the way linguistic categories are represented by
embeddings. In this section, we show how the decoded Binder vectors can be used to
“open the box” of semantic probing tasks, to inspect the features that are relevant for a
certain task, and to analyze the performance of distributional embeddings.
We use the original word embeddings and their corresponding mapped Binder vectors
as input features to a logistic regression classifier, which has to determine if they belong
to a given semantic class (Yaghoobzadeh et al. 2019). The human-interpretable nature
of Binder vectors allows us to decode and explain the performance of the original
embeddings in the probing tasks. In computational linguistics, being able to determine
the semantic class membership of a word is an important task, and it has several
applications such as question answering systems, information extraction and ontology
generation (Vulić et al. 2017). Similarly, the automatic detection of a given semantic
feature of a word is potentially useful for the automatic creation of lexicon and dictio-
naries, i.e. sensory lexicons (Tekiroglu, Özbal, and Strapparava 2014), emotion lexicons
(Buechel and Hahn 2018) and sentiment dictionaries (Turney and Littman 2003; Esuli
and Sebastiani 2006; Baccianella, Esuli, and Sebastiani 2010; Cardoso and Roy 2016;
Sedinkina, Breitkopf, and Schütze 2019).
Non-contextualized embeddings were tested on the following probing tasks that
target different semantic classes and features:
17
Computational Linguistics Volume 1, Number 1
items for training and 51 words for test. For this task, concrete nouns are assumed
as the positive class.
VerbNet – The task is based on verb semantic classes included in VerbNet (Kipper
et al. 2008; Palmer, Bonial, and Hwang 2017). For each VerbNet class, we generated
a set of negative examples by randomly extracting an equal number of verbs that
do not belong to the semantic class (i.e., for a semantic class with n verbs, we
extract n verbs from the other classes to build the negative examples). Each class
was then randomly split in a training and in a test set, using a 80 : 20 ratio, and
we selected the 23 classes that contained at least 20 test verbs.5 The task consists
in predicting whether a target verb is a class instance or not.
Direct object animacy – The task is to decide whether the direct object noun
of a sentence is animate or inanimate, and is the contextualized equivalent of
the Animate/Inanimate task above. The dataset includes 647 training subject -
verb - object sentences with animate and inanimate direct objects, and 163 test
sentences.6
BERT and ELMo were queried with the sentences in the dataset to obtain the contextu-
alized embedding of the target word (the direct object noun for the animacy task, the
verb for the causative/inchoative one), which was then fed into the classifiers.
The embeddings were not fine-tuned in the probing tasks. In fact, the overall
purpose of the analysis is not to optimize the performance of the classifiers, but to use
them to investigate the information that the original embeddings encode.
18
Chersoni et al. Decoding Word Embeddings
Our analysis consists of three main steps: i.) for each semantic task, we first train
a classifier using the original word embeddings and we measure their accuracy, as is
customary in probing task literature; ii.) then, we train the same classifiers using the
corresponding mapped Binder vectors in the training sets and we inspect the most
important semantic features of each probed class; finally iii.) we measure the overlap
between the classifier top features and the top features of the words in the test sets, and
we use this information to interpret the performance of the models in the various tasks.
5.2.1 Measuring embedding accuracy in probing tasks. First of all, we evaluate the
performance of the embeddings in each task via the traditional accuracy metric, in
order to check their ability to predict the semantic class of the word. A summary of
the performance of the traditional DSMs can be seen in Table 8, while the scores for the
contextualized models are shown in Table 9. Since the classes are unbalanced in most
19
Computational Linguistics Volume 1, Number 1
tasks, the tables also report the results for a majority baseline. At a first glance, in Table
8 we can notice a performance gap between count models based on PPMI and SVD
and the other word embedding models, with the former being largely outperformed
by the latter on all probing datasets (the largest observed gap being around 40%)
and struggling even to beat the majority baseline in many of the VerbNet-derived test
sets. All neural embeddings achieve a 100% accuracy in the classification for the Con-
crete/Abstract task and one of the models, FastText, achieves the same score also on the
Positive/Negative task. The VerbNet tasks, possibly because of the fuzzy boundaries
of the verb semantic classes, proved to be the most challenging ones and in some cases
the models struggle to beat a chance classifier. The best performing embeddings are, in
general, the FastText ones and the vectors of the SGNS family.
As for the contextualized probing tasks, BERT outperforms ELMo, and the
Causative/Inchoative alternation task is more difficult, probably because alternating
verbs are semantically more heterogeneous. However, even in this case the classification
accuracy is very high, when compared to the majority baseline.
5.2.2 Examining the semantic features of the probed classes. Since probing tasks are
typically used as “black box” tools, the performance obtained by a certain DSM is
usually regarded to be enough to draw conclusions about the information encoded by
its vectors. Here, the mere embedding accuracy we have reported in Table 8 and 9 is
not the primary aim of our analyses. In fact, we want to make the semantic information
learnt by the classifiers explicit and human-interpretable, in order to characterize the
content of the probed semantic dimensions. To this purpose:
• for each DSM, we learn a mapping between its embeddings and 65-dimensional
Binder vectors, using the whole set of 534 Binder words as training data;
• we use the decoding mapping to generate the Binder vectors of all the words
contained in the probing datasets;
• for each probing task, we train a classifier t with the decoded Binder vectors;
• we extract the weights assigned by the classifier to the Binder features and sort
them in descending order. Given the task t, T opT askF eats(t, n) is the set of the
top n features learnt by the classifier for t using the Binder vectors.
The set T opT askF eats(t, n) includes the most important semantic features for the
classification task t. Table 10 reports the top 5 features for some of our probing tasks
using the Binder vectors decoded from FastText, which is one of the best performing
non-contextualized models on average, and from BERT. Notice that the top features
provide a nice characterization of the features of the semantic classes targeted across
tasks. FACE, H UMAN, and S PEECH appear among the top features of animate nouns. For
sentiment classification, the most relevant features are positive emotions (P LEASANT,
H APPY, B ENEFIT) or belong to the the Social domain (S ELF). On the other hand,
physical properties (S HAPE, V ISION, W EIGHT) are the most important ones for the
Concrete/Abstract distinction, in which concrete nouns represent the positive class.
Similar considerations apply for the VerbNet tasks. The class run-51.3.2 contains mo-
tion verbs and its most relevant features refer to movement (M OTION, L OWER L IMB,
FAST) and direction (PATH). The classes judgment-33.1 and say-37.7 are characterized
by features related to communication and cognition. The class sound_emission-43.2 is
instead associated with features belonging to the Audition domain. Perhaps, the less
perspicuous case is represented by the features associated with the alternating class
in the Causative/Inchoative task. However, it is worth noticing the salience of the
T EMPERATURE feature, as various alternating verbs express this dimension (e.g., warm,
20
Chersoni et al. Decoding Word Embeddings
heat, cool, burn, etc). This shows how a simple featural decoding of the embeddings can
be used to investigate the internal structure of the semantic classes that are targeted by
probing tasks.
5.2.3 Explaining the performance of embeddings in probing tasks. The third phase of
our analysis combines the results of the previous two steps: The Binder feature vectors
learnt in Section 5.2.2 are used to explain the accuracy of the embeddings in the probing
tasks in Section 5.2.1.
For each task t and word w in the test set of t, we rank the features of the de-
coded Binder vector fw in descending order according to their values. We indicate with
T opW ordF eats(fw , n) the set of the top n features in the Binder vector fw . Then we
measure with Average Precision (AP ) the extent to which the top Binder features of t
appear among the top decoded features of the test word w. Given the ranked feature
sets T opT askF eats(t, n) and T opW ordF eats(fw , n), we compute AP (t, w) as follows:
Pn
1 Pwt (r)
AP (t, w) = (1)
n
where the numerator of Equation 2 is the number of task features that are also in
the word feature vector from rank 1 to rank r. AP is a measure derived from in-
formation retrieval combining precision, relevance ranking and overall recall (Man-
ning, Raghavan, and Schütze 2008; Kotlerman et al. 2010). In our case, the ranked
task features are like documents to be retrieved and the word features are like doc-
uments returned by a query. AP takes into account two main factors: i.) the extent
of the intersection among the n most important semantic features for a word and a
21
Computational Linguistics Volume 1, Number 1
0.75
0.50
0.25
0.75
AP
0.50
0.25
FN FP TN TP
SGNS.synt SGNS.w2
0.75
0.50
0.25
FN FP TN TP FN FP TN TP
Classification
Figure 5: Average Precision (AP ) boxplots of the Binder vectors of the test words with
respect to the top-20 Binder features of each probing task. True Positive (TP), True
Negative (TN), False Positive (FP), and False Negative (FN) refer to the output of the
classifiers trained on the original embeddings of the non-contextualized DSMs.
task, and ii.) their mutual ranking. The higher the AP (t, w) score, the more the top
features of w that are also included in the top features for the task t. For example,
suppose that T opT askF eats(P ositive/N egative, 3) = {P LEASANT, H APPY, B ENEFIT},
AP (P ositive/N egative, w) = 1 if and only if T opW ordF eats(fw , 3) contains the same
semantic features at the top of the rank.
For each model and each task, we analyze the AP of the output of the classifiers
trained on the original word embeddings, whose accuracy is reported in Tables 8 and 9.
We compute the AP of the words correctly classified in the positive class (true positive,
TP) and in the negative class (true negative, TN). Moreover, we compute the AP of the
words wrongly classified in the positive class (false positive, FP) and in the negative class
(false negative, FN). The AP distribution of each word group across the probing tasks
is reported in Figure 5 for the non-contextualized DSMs and in Figure 6 for BERT and
ELMo. The Kruskal-Wallis rank sum non-parametric test shows that in all models the
four word groups significantly differ for their AP values (df = 3, p-value < 0.001).
Post-hoc pairwise Mann–Whitney U-tests (with Bonferroni correction for multiple
comparisons) confirm that across tasks TPs have a significantly higher AP than FPs
(p < 0.001). Therefore, the words correctly classified in the positive class share a large
number of the top ranked features for that class (e.g., the words whose embeddings
22
Chersoni et al. Decoding Word Embeddings
BERT ELMo
0.5
0.4
AP
0.3
0.2
FN FP TN TP FN FP TN TP
Classification
Figure 6: Average Precision (AP ) boxplots of the Binder vectors of the test words with
respect to the top-20 Binder features of each probing task. True Positive (TP), True
Negative (TN), False Positive (FP), and False Negative (FN) refer to the output of the
classifiers trained on the original BERT and ELMo embeddings.
are correctly classified as animate have a large number of the top semantic features
that characterize animacy). Conversely, the words correctly classified in the negative
class have very few, if any, of the top task features. It is interesting to observe that the
DSMs for which the difference between the median AP (represented by the thick line in
each boxplot) of TPs and the median AP of TNs is higher (i.e., the neural embeddings
for the non-contextualized models and BERT) are the models that in general show a
higher classification accuracy in the probing tasks and better encode the Binder features
(cf. Section 4). This suggests that a model accuracy in probing tasks is strongly related
to the way its embeddings encode the most important semantic features for a certain
classification task (cf. below).
In Figures 5 and 6, the AP of the wrongly classified words (i.e., FPs and FNs) tend
to occupy an intermediate position between the AP of TPs and TNs. In fact, we can
conjecture that a word in the positive class (e.g., an animate noun) is wrongly classified
(e.g., labelled as inanimate), because it lacks many of the top features characterizing
the target class (e.g., animacy). Post-hoc pairwise Mann–Whitney U-tests support this
hypothesis, because the AP of the FNs is significantly different from the one of TPs
(PPMI.synt: p < 0.05; GloVe: p < 0.05; SGNS.w2: p < 0.001; SGNS.synf: p < 0.001;
SGNS.synt: p < 0.001; FastText: p < 0.001; ELMo: 0.001), except for PPMI.w2 (p = 0.23),
PPMI.synf (p = 1) and BERT (p = 0.39). Conversely, the AP of FPs is significantly higher
than the one of TNs (SGNS.w2: p < 0.001; SGNS.synf: p < 0.001; SGNS.synt: p < 0.001;
FastText: p < 0.001; ), except for the largely underperforming PPMI models (PPMI.w2: p
= 1; PPMI.synf: p = 0.39; PPMI.synf: p = 0.38), ELMo (p = 1), and marginally for GloVe
(p = 0.08) and BERT (p = 0.08). This suggests that the semantic features of FPs tend to
overlap with the top features of the positive class more than TNs.
23
Computational Linguistics Volume 1, Number 1
0.6
AP
0.4
0.2
0 1 0 1
Classes
Figure 7: The boxplots show the Average Precision (AP ) of the Binder vectors decoded
from FastText embeddings for the words belonging to the positive (1) and negative (0)
classes in the test sets of the Positive/Negative and VerbNet say-37 probing tasks.
The analysis of the semantic features of missclassified words can also provide inter-
esting clues to explain why DSMs make errors in probing tasks. For instance, FastText
does not classify keen as a sound emission verb (i.e., it is a FN for the VerbNet class
38). If we inspect its decoded vector we find C OGNITION, S OCIAL, S ELF, B ENEFIT and
P LEASANT among its top features, likely referring to the abstract adjective keen, which
is surely much more frequent in the PoS-ambiguous training data than the verb keen (to
emit wailing sounds). On the other hand, PPMI.w2 wrongly classifies judge as a manner
of speaking verb (i.e., it is a FP of the VerbNet class 37.3). This mistake can be explained
by looking at its decoded vector whose top feature is S PEECH, which is probably due to
the quite common usage of judge as a communication verb.
As illustrated in Tables 8 and 9, the variance of model accuracy across tasks is
extremely high. For instance, the accuracy of FastText ranges from 1 in the Posi-
tive/Negative and Concrete/Abstract tasks, to 0.55 for the VerbNet say-37.7 class. In the
standard use of probing tasks, the classifier accuracy is taken to be enough to draw con-
clusions about the way a certain piece of information is encoded by embeddings. Here,
we go beyond this “black box” analysis and provide a more insightful interpretation
of the different behavior of embeddings in semantic probing tasks. We argue that such
explanation can come from the decoded Binder features, and that a model performance
in a given task t depends on the way the words to be classified encode the top-n ranked
features for t (i.e., T opT askF eats(t, n)). For instance, consider the boxplots in Figure
7, which show the AP of the Binder vectors decoded from the FastText embeddings
for the words belonging to the positive (1) and negative (0) classes in the test sets of
the Positive/Negative and VerbNet say-37 probing tasks. FastText achieves a very high
accuracy in the former task, and the AP distributions of the 1 and 0 words are clearly
distinct, indicating that these two sets have different semantic features, and that the
features of the 0 words have a very low overlap with the top task features. Conversely,
24
Chersoni et al. Decoding Word Embeddings
Model ρ p-value
PPMI.w2 0.29 0.15
PPMI.synf 0.43 0.03∗
PPMI.synt 0.23 0.26
GloVe 0.65 < 0.001∗
SGNS.w2 0.68 < 0.001∗
SGNS.synf 0.78 < 0.001∗
SGNS.synt 0.70 < 0.001∗
FastText 0.71 < 0.001∗
Table 11: Spearman correlation (ρ) between APdif f (t) and the classification accuracy for
the static embeddings models.
the AP distributions of the 1 and 0 words for the say-37.7 task overlap to a great extent,
suggesting that the two groups are not well separated in the semantic feature space. Our
hypothesis is that the DSM accuracy in a probing task tends to be strongly correlated
with the degree of separation between the semantic features decoded from the positive
and negative items in the target class.
To verify this hypothesis, we take the sets of positive (W1 ) and negative (W0 ) test
words of each task t and we compute the following measure:
where AP (t, W1 ) and AP (t, W0 ) are respectively the mean AP for W1 and W2 . There-
fore, APdif f (t) estimates the separability of the positive and negative words in the
semantic feature space relevant for the task t. We expect that the higher the APdif f (t) of
a model, the higher its performance in t. Table 11 shows that this prediction is borne out,
at least for the best performing non-contextualized DSMs. The Spearman correlation
between the model accuracy in the probing tasks and APdif f (t) is fairly high for all
models, except for the PPMI ones. It is again suggestive that these are not only the
worst-performing models in the probing tasks, but also the embeddings with a less
satisfactory encoding of the Binder features. Table 12 illustrates that the correlation
between APdif f (t) and task accuracy holds true for contextualized embeddings as well.
For both BERT and ELMo, the APDif f and accuracy are greater for the Direct object
animacy task than for the Causative/Inchoative alternation.
Word embeddings have become the most common semantic representation in NLP
and AI. Despite their success in boosting the performance of applications, the way
embeddings capture meaning still defies our full understanding. The challenge mainly
25
Computational Linguistics Volume 1, Number 1
26
Chersoni et al. Decoding Word Embeddings
• the words correctly classified in the positive class (i.e., TPs) share a large number
of the top ranked features for that class, and, symmetrically, the words correctly
classified in the negative class (i.e., TNs) have a significantly lower number of the
top task features;
• wrongly classified words in the positive class (i.e., FNs) lack many of the top fea-
tures characterizing the target class. Conversely, the features of wrongly classified
words in the positive class (i.e., FPs) tend to overlap with the top task features
more than TNs;
• the accuracy of a DSM in a probing task strongly correlates with the degree of
separation between the semantic features decoded from its embeddings of the
words in the positive and negative classes.
These results show that semantic feature decoding provides a simple and useful tool
to explain the performance of word embeddings and to enhance the interpretability of
probing tasks.
The methodology we have proposed paves the way for other types of analyses
and applications. There are at least two prospective research extensions that we plan
to pursue, respectively concerning selectional preferences and word sense disambigua-
tion. Many recent approaches to the modeling of selectional preferences have given
up on the idea of characterizing the semantic constraints of predicates in terms of dis-
crete semantic types, focusing instead on measuring a continuous degree of predicate-
argument compatibility, known as thematic fit (McRae and Matsuki 2009). DSMs have
been extensively and successfully applied to address this issue, typically measuring the
cosine between a target noun vector and the vectors of the most prototypically predicate
arguments (Baroni and Lenci 2010; Erk, Padó, and Padó 2010; Lenci 2011; Sayeed, Green-
berg, and Demberg 2016; Santus et al. 2017; Chersoni et al. 2019; Zhang, Ding, and Song
2019; Zhang et al. 2019; Chersoni et al. 2020). This approach can be profitably paired
with our decoding methodology to identify the most salient features associated with
a predicate argument. For instance, we can expect that listen selects for direct objects
in which Audition features are particularly salient. This way, distributional methods
will be able not only to measure the gradient preference of a predicate for a certain
argument, but also to highlight the features that explain this preference, contributing to
characterize the semantic constraints of predicates.
As for word-sense disambiguation, models like ELMo and BERT provide con-
textualized embeddings that allow us to investigate word sense variation in context.
Using contextualized vectors, it might be possible to investigate how meaning changes
in contexts by inspecting the feature salience variation of different word tokens. For
example, we expect features like S OUND and M USIC to be more salient in the vector of
play in the sentence The violinist played the sonata, rather than in the sentence The team
played soccer. This could be extremely useful also in tasks such as metaphor and token-
level idiom detection, where it is typically required to disambiguate expressions that
might have a literal or a non-literal sense depending on the context of usage (King and
Cook 2018; Rohanian et al. 2020).
Word embeddings and featural symbolic representations are often regarded as
antithetic and possibly incompatible ways of representing semantic information, which
pertain to very different approaches to the study of language and cognition. In this
paper, we have shown that the distance between these two types of meaning repre-
sentation is smaller than what appears prima facie. New bridges between symbolic and
distributed lexical representations can be laid, and used to exploit their complementary
strengths: The gradience and robustness of the former and the human-interpretability of the
27
Computational Linguistics Volume 1, Number 1
latter. An important contribution may come from collecting more extensive data about
feature salience: the Binder dataset is an important starting point, but human ratings
about other types of semantic features and words might be easily collected with crowd-
sourcing methods.
In this work, we have mainly used feature-based representations as a heuristic
tool to interpret embeddings. An interesting research question is whether decoded
features from embeddings could actually have other applications too. For instance,
semantic features provide a more abstract type of semantic representation that might be
complementary to the fine-grained information captured by distributional embeddings.
This suggests to explore new ways to integrate symbolic and vector models of meaning.
28
Chersoni et al. Decoding Word Embeddings
29
Computational Linguistics Volume 1, Number 1
30
Chersoni et al. Decoding Word Embeddings
31
Computational Linguistics Volume 1, Number 1
32
Chersoni et al. Decoding Word Embeddings
33
34