[go: up one dir, main page]

0% found this document useful (0 votes)
105 views12 pages

XCS224 FinalProject

This document evaluates semantic answer similarity metrics for question answering systems. It discusses issues with existing lexical-based metrics and proposes a cross-encoder augmented BERTScore model trained on a new dataset of names. The authors analyze error patterns in current metrics like SAS and BERTScore and conduct experiments to test their hypotheses that lexical metrics are not well-suited for automated QA evaluation and that certain data types are more difficult to assess than others.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views12 pages

XCS224 FinalProject

This document evaluates semantic answer similarity metrics for question answering systems. It discusses issues with existing lexical-based metrics and proposes a cross-encoder augmented BERTScore model trained on a new dataset of names. The authors analyze error patterns in current metrics like SAS and BERTScore and conduct experiments to test their hypotheses that lexical metrics are not well-suited for automated QA evaluation and that certain data types are more difficult to assess than others.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Evaluation of Semantic Answer Similarity Metrics

Farida Mustafazade Peter Ebbinghaus

Abstract (Risch et al., 2021)’s Semantic Answer Similarity


(SAS) cross-encoder metric differs from human
We propose a cross-encoder augmented judgement and find patterns resulting in such er-
BERTScore model for semantic answer similar- rors. The main hypotheses that we will aim to test
ity trained on our new dataset of non-common
thoroughly through experiments are twofold. First
names in English contexts. There are sev-
eral issues with the existing general machine of all, lexical-based metrics are not well suited in
translation (MT) or natural language genera- the automated evaluation of QA models. Secondly,
tion (NLG) evaluation metrics, and question most metrics, specifically SAS and BERTScore,
answering (QA) systems are indifferent in that as described in (Risch et al., 2021) find some data
sense. To build robust QA systems, we need the types more difficult to assess for similarity than
ability to have equivalently robust evaluation others.
systems to verify whether model predictions
to questions are similar to ground-truth annota-
After familiarising ourselves with the current
tions. The ability to compare similarity based state of research in the field in Section 2, we de-
on semantics as opposed to pure lexical over- scribe the datasets provided in (Risch et al., 2021)
lap is important not only to compare models and the new dataset of names that we purposefully
fairly but also to indicate more realistic accep- tailor to our model in Section 3. This is followed
tance criteria in real-life applications. We build by Section 4, introducing the four new semantic
upon the first to our knowledge paper that uses answer similarity approaches described in (Risch
transformer-based model metrics to assess se-
et al., 2021) as well as three lexical n-gram-based
mantic answer similarity and achieve superior
results in the case of no lexical overlap. automated metrics. Then in Section 6, we thor-
oughly analyse the evaluation datasets described in
1 Introduction the previous section and conduct an in-depth quali-
tative analysis of the errors. Since the human labels
Having reliable metrics for evaluation of language are discrete, while the model outputs come from a
models in general, and models solving difficult continuous distribution, our error analyses method-
question answering (QA) problems, is crucial in ology involves using filtering on some thresholds
this rapidly developing field. These metrics are not to discretise the scores. Finally, in Section 4, we
only useful to identify concerns with the current mitigate some of those issues by training the mod-
models, but they also influence the development of els further on external datasets for some categories
a new generation of models. In addition, the con- and systematically generated datasets for others.
sensus is to have a simple metric as opposed to a In Section 7, we summarise our contributions and
highly configurable and parameterizable one so that discuss ways in which as part of future work, the
the development and the hyperparameter tuning do model could be improved and used in real-life ap-
not add more layers of complexity to already com- plications.
plex QA pipelines. SAS, a cross-encoder-based
metric for the estimation of semantic answer simi- 2 Related work
larity (Risch et al., 2021), provides one such metric
to compare answers based on semantic similarity. We define semantic similarity as different descrip-
The central objective of this research project is tions for something that has the same meaning in
to analyse pairs of answers like in Figure 1 where a given context, following largely (Zeng, 2007)’s

1
Question: Who killed Natalie and Ann in that METEOR as an n-gram based evaluation met-
Sharp Objects? ric proved to perform better than the BERT-based
Ground-truth answer: Amma approaches, they encourage more research in the
Predicted Answer: Luke area of semantic text analysis for QA.
EM: 0.00 Risch et al. (2021) expand on this idea and fur-
F1 : 0.00 ther address the issues with existing general MT,
Top-1-Accuracy: 0.00 NLG, which entails as well generative QA and ex-
SAS: 0.0096 tractive QA evaluation metrics. These include, but
Human Judgment: 0 are not limited to, reliance on string-based methods,
fBERT : 0.226 such as EM, F1-score, and top-n-accuracy. There

fBERT : 0.145 is a more generic problem of evaluating STS. The
Bi-Encoder: 0.208 problem is even more substantial for multi-way
f̃BERT : 0.00 annotations. Here, multiple ground-truth answers
Bi-Encoder (new model): −0.034 exist in the document for the same question, but
only one of them is annotated. The major contribu-
Figure 1: Representative example of a question and all tion of the authors is the formulation and analysis
semantic answer similarity measurement results. of four semantic answer similarity approaches that
aim to resolve to a large extent the issues men-
tioned above. They also release two three-way
definition of semantic and contextual synonyms. annotated datasets, one a subset of a well-known
Min et al. (2021) noted that open-domain QA is English SQuAD dataset (Rajpurkar et al., 2018),
inherently ambiguous because of the uncertainties one German GermanQuAD dataset (Möller et al.,
in the language itself. They observed that auto- 2021), plus a subset of NQ-open (Min et al., 2021).
matic evaluation based on exact match (EM) fails Looking into error categories revealed problem-
to capture semantic similarity, which is observed atic data types, where entities, particularly names,
in 60% of the ground-truth and prediction pairs turned out to be the leading category. This is also
in NQ-Open dataset. They, further, reported that what Si et al. (2021) try to solve using knowledge-
13%-17% of the predictions that fail automated base (KB) mined aliases as additional ground-truth
evaluations are either definitely correct or partially answers. In contrast to Si et al. (2021), we generate
correct. As a result, human evaluation improves a standalone names dataset from another dataset,
the accuracy of the open-domain QA systems they described in greater detail in Section 3. The au-
tested on by 17%-54% compared to the EM used thors try to accomplish higher EM scores, which is
by the automated evaluation system. defined as the maximum exact match on any of the
Two out of four semantic textual similarity (STS) correct answers in the expanded answer set.
metrics that we analyse and the model that we even- Our main assumption is that better metrics will
tually train depend on BERTScore, which is in- have a higher correlation with human judgement,
troduced in (Zhang et al., 2019). This metric is but the choice of a correlation metric is important.
not one-size-fits-all. On top of choosing a suit- Pearson correlation is a commonly used metric
able contextual embedding and model, there is an used in evaluating semantic text similarity (STS)
additional feature of importance weighting using for comparing the system output to human eval-
inverse document frequency (idf). The idea is to uation. Reimers et al. (2016) show that Pearson
limit the influence of common words. One of the power-moment correlation can be misleading when
findings is that most automated evaluation metrics it comes to intrinsic evaluation. They further go
demonstrate significantly better results on datasets on to demonstrate that no single evaluation metric
without adversarial examples, even when these are is well suited for all STS tasks, hence evaluation
introduced within the training dataset, while the metrics should be chosen based on the specific
performance of BERTScore suffers only slightly. task. In our case, most of the assumptions, such as
Zhang et al. (2019) uses MT and image captioning normality of data and continuity of the variables
tasks in experiments and not QA. Chen et al. (2019) behind Pearson correlation do not hold. Kendall’s
apply BERT-based evaluation metrics for the first rank correlations are meant to be more robust and
time in the context of QA. Even though they find slightly more efficient in comparison to Spearman
as demonstrated in Croux and Dehon (2010). correct without semantic similarity even though
only one correct answer is expected. In comparison
3 Data to annotation for SQuAD and GermanQuAD, we
conclude that the annotation of NQ-open indicates
We perform our analysis on the three manually
truthfulness of the predicted answer, whereas for
annotated by three human raters subsets1 of larger
SQuAD and GermanQuAD the annotation relates
datasets provided by Risch et al. (2021). Unless
to the semantic similarity of both answers which
specified otherwise, we will refer to the subsets by
can lead to differences in interpretation as well as
the associated dataset names.
evaluation.
3.1 Original datasets In an attempt to further improve the NQ-open
subset we build on (Risch et al., 2021)’s filter to
SQuAD is an English-language dataset containing
only include question-answer pairs with one given
multi-way annotated questions with 4.8 answers
ground-truth by manually re-labelling incorrect la-
per question on average. GermanQuAD (Möller
bels as well as filtering out vague questions. We
et al., 2021) is a three-way annotated German-
focus on the most obvious cases described in Sec-
language question/answer pairs dataset created by
tion 3 as well as we provide an improvement across
the deepset team which also wrote (Risch et al.,
all metrics in Section 6.
2021). Based on the German counterpart of the
Table 1 describes the size and some lexical fea-
English Wikipedia articles used in SQuAD, Ger-
tures for each of the three datasets. There were
manQuAD is the SOTA dataset for German ques-
2, 3 and 23 duplicates in each dataset respectively.
tion answering models. To address a shortcoming
Dropping these duplicates led to slight changes in
of SQuAD that was mentioned in (Kwiatkowski
the metric scores.
et al., 2019), GermanQuAD was created with the
goal of preventing strong lexical overlap between SQuAD GermanQuAD NQ-open
questions and answers. For this reason, questions
Label 0 56.7 27.3 71.7
were constantly rephrased via synonyms and al- Label 1 30.7 51.5 16.6
tered syntax as well as complex questions were Label 2 12.7 21.1 11.7
encouraged. SQuaD and GermanQuAD contain F1 = 0 565 124 3030
F1 ̸= 0 374 299 529
a pair of answers and a hand-labelled notation of Size 939 423 3559
0 if answers are completely dissimilar, 1 if an- Avg answer size 23 68 13
swers have a somewhat similar meaning, and 2 if
the two answers express the same meaning. NQ- Table 1: Percentage distribution of the labels and statis-
tics on the subsets of datasets used in the analyses. The
open is a five-way annotated open-domain adap-
average answer size column refers to the average of both
tion of Kwiatkowski et al. (2019)’s Natural Ques- the first and second answers as well as ground-truth an-
tions dataset. NQ-open is based on actual Google swer and predicted answer (NQ-open only). F1 = 0
search engine queries. In case of NQ-open, the la- indicates no string similarity, F1 ̸= 0 indicates some
bels follow a different methodology as described in string similarity. Label distribution is given in percent-
Min et al. (2021). The assumption is that we only ages.
leave questions having a non-vague interpretation.
The human annotators will attach a label 2 to all
predictions that answer the question correctly are 3.2 Augmented datasets
”definitely correct”, 1 - ”possibly correct”, and 0 In Section 6, we find that for NQ-open, in the ma-
- ”definitely incorrect”. Questions like Who won jority of cases, the underlying QA model but also
the last FIFA World Cup? received the label 1 be- BERTScore are both failing in its prediction of
cause they have different correct answers without names and their similarity to ground-truth answers
a precise answer at a point in time later than when respectively. To resolve this issue, we provide a
the question was retrieved. There is yet another new dataset that consists of 13,593 name pairs and
ambiguity with this question, which is whether it employ the Augmented SBERT approach (Thakur
is discussing FIFA Women’s World Cup or FIFA et al., 2021) whereby we use the cross-encoder
Men’s World Cup. This way two answers can be model to label a new dataset consisting of name
1
https://semantic-answer-similarity.s3. pairs (only USA at the moment) to then train a
amazonaws.com/data.zip bi-encoder model on the resulting dataset.
We deploy the already presented cross- 4 Models
encoder/stsb-roberta-large to label our new
name pairs dataset to fine-tune the afore- The focus of our research lies on different seman-
mentioned T-Systems-onsite/cross-en-de-roberta- tic similarity metrics and their underlying models.
sentence-transformer which we then use in our Bi- As a human baseline, the original paper reports
Encoder as well as BERTScore trained semantic correlations between the labels by the first and the
answer similarity metrics. These are commonly second annotator for subsets of SQuAD and Ger-
referred to as silver labels. The underlying dataset manQuAD and omits these for the NQ-open subset
is created from the open source ”Politicians on since they are not publicly available. Maximum
Wikipedia and DBpedia” dataset2 which includes correlations achieved are just under 0.7 Spearman
the names of 1,167,261 persons that have a page on rank correlation for SQuAD and 0.64 for German-
Wikipedia and DBpedia. Out of these we only use QuAD, while maximum Kendall rank correlation
those based in the U.S. as the English cross-encoder achieved is just under 0.5 and 0.6 for SQuAD and
model has difficulties with labelling names that are GermanQuAD respectively.
not as common in the U.S. Besides, the questions in The baseline semantic similarity models we have
NQ-open are on predominantly U.S. related topics. considered are bi-encoder, BERTScore vanilla, and
Then we shuffle the list of 25,462 names and pair BERTScore trained, whereas the focus will be on
them randomly to get the name pairs that are then cross-encoder (SAS) performance. Table 5 outlines
labelled by the cross-encoder model, resulting in a the exact configurations used for each model. For
dataset where 75 per cent of the values have a score NQ-open specifically, METEOR and ROUGE-L
of 0.012 or smaller. Only 23 pairs receive a score remain essential baselines, too, because the lexical-
higher than 0.5. To add to pairs which only by overlap based metrics performs similar to the se-
chance ended up describing the same individual, as mantic similarity metrics on NQ-open.
in the case of mark davis and tona rozum, we make The bi-encoder approach model is based on
use of another benefit of the dataset: it includes the sentence Transformer structure (Reimers and
different ways of writing a person’s name, such Gurevych, 2019). An advantage of the bi-encoder
as gary a labranche and labranche gary but also model architecture is that it calculates the embed-
aliases like Lisa Marie Abato’s stage name Holly dings of the two input texts separately. Thus, the
Ryder as well as e.g. Chinese ways of writing such embeddings for the ground-truth answer can be
as Rulan Chao Pian and 卞趙如蘭. In this con- pre-computed early and compared later with the
text we filter out all examples where more than prediction answers embeddings. The model we use
three different ways of writing a person’s name ex- can be applied to English and German texts because
ist because in these cases these names don’t refer it was trained in both languages. As described in
to a person but were mistakenly included in the (Risch et al., 2021), it is based on xlm-roberta-
dataset as, for example, the names of various mem- base, and it was further trained on an unreleased
bers of Tampa Bay Rays minor league who have multi-lingual paraphrase dataset which resulted in
one page for all members. the model paraphrase-xlm-r-multilingual-v1 which
then in turn was fine-tuned on an English-language
We find that only adding the first variation of
STS benchmark dataset (Cer et al., 2017) and a
names, improves the overall performance of the bi-
machine-translated German version of the same
encoder model which is trained on the new name
benchmark3 , resulting in the model T-Systems-
pairs dataset. Since most persons in the dataset
onsite/cross-en-de-roberta-sentence-transformer.
have a maximum of one variation of their name,
we only leave out close to 800 other variations this While the bi-encoder approach calculates sepa-
way, and can add 14,131 additional pairs to the rate embeddings for a pair of answers, the cross-
aforementioned random pairs. All name variation encoder architecture used for SAS (Risch et al.,
pairs receive a manually annotated score of 1 be- 2021) concatenates the answers with a special sep-
cause they are synonymous and refer to the same arator token. Pre-computation is not possible with
person. the cross-encoder approach because it takes both in-
put texts into account at the same time to calculate
2 3
https://github.com/Kandy16/ https://github.com/
people-networks/tree/dbpedia-data/ t-systems-on-site-services-gmbh/
dbpedia-data/final_datasets german-STSbenchmark
embedding, as opposed to calculating embedding approaches more thoroughly as well and find that
separately. bi-encoder and BERTScore trained don’t perform
Opposed to the bi-encoder (Risch et al., 2021), well on names in particular. We therefore use
used a separate English and German model for our new name pairs dataset to train T-Systems-
the cross-encoder because there is no multi-lingual onsite/cross-en-de-roberta-sentence-transformer on
cross-encoder implementation available yet. Sim- it with the same hyperparameters as were used
ilar to the bi-encoder approach, the English SAS to train paraphrase-xlm-r-multilingual-v1 on the
cross-encoder model relies on cross-encoder/stsb- English-language STS benchmark dataset, which
roberta-large which was trained on the same En- resulted in T-Systems-onsite/cross-en-de-roberta-
glish STS benchmark as the bi-encoder model (Cer sentence-transformer (see below).
et al., 2017). For German, on the other hand, a new
cross-encoder model had to be trained, as there 5 Experiments
were no German cross-encoder models available.
It is based on deepset’s gbert-large (Chan et al., To evaluate the shortcomings of lexical-based met-
2020) and trained on the same machine-translated rics in the context of question answering, we com-
German STS benchmark as the bi-encoder model, pare scores on evaluation datasets from BLEU,
resulting in gbert-large-sts that is used in experi- ROUGE-L, METEOR, F1 -score and the seman-
ments. tic answer similarity metrics, i.e. Bi-Encoder,
BERTScore (Zhang et al., 2019) uses BERTScore vanilla, BERTScore trained, and Cross-
Transformer-based language models to generate Encoder (SAS). To address the second hypothesis,
contextual embeddings, then match the tokens of we delve deeply into every single dataset and find
the ground-truth answer and prediction (or the differences between different types of answers, e.g.
second answer in SQuAD and GermanQuAD), names and numbers. As can be observed from
followed by creating a score from the maximum Figure 2, lexical-based metrics show considerably
cosine similarity of the matched tokens. The lower results than any of the semantic similarity
implementation from Zhang et al. (2019)4 is approaches as described in Risch et al. (2021). In
used for our evaluation, with minor changes to line with what the authors found, BLEU indeed
accommodate for missing key-value pairs for the lags behind all other metrics, followed by ME-
T-Systems-onsite/cross-en-de-roberta-sentence- TEOR. Similarly, we found that ROUGE-L and
transformer model type and 12 respectively for F1 achieve close results. In the absence of lexical
BERTScore trained, as we are using the last layer overlap, METEOR gives superior results than the
for the trained version and only the second layer other n-gram-based metrics in the case of SQUAD,
representations for vanilla type BERTScore. We but Rouge-L is closer to human judgement for the
are following Risch et al. (2021)’s approach of rest. Regarding semantic answer similarity met-
comparing two different BERTScore implemen- rics, the highest correlations are achieved in the
tations with each other: BERTScore vanilla is case of BERTScore based trained models, followed
based on the standard pre-trained BERT language closely by bi- and cross-encoder models. We found
model bert-base-uncased for English (SQuAD some inconsistencies regarding the performance of
and NQ-open) as well as deepset’s gelectra-base the cross-encoder based SAS metric. The superior
(Chan et al., 2020) for German (GermanQuAD), performance of SAS doesn’t hold up for the cor-
whereas BERTScore trained is based on the relation metrics other than Pearson. We observed
multi-lingual model that is used by the bi-encoder that SAS score underperformed when F1 = 0 com-
approach, called T-Systems-onsite/cross-en-de- pared to all other semantic answer similarity met-
roberta-sentence-transformer. rics and overperformed when there is some lexical
similarity.
In Section 6 we observe that BERTScore trained
outperforms SAS for answer-prediction pairs with- Score distribution for SAS and BERT Trained
out lexical overlap - which in addition amount to shows that SAS scores are heavily tilted towards
the largest group in NQ-open. Therefore as well 0 as per Figure 6. We also analyse BERTScore
as because cross-encoders are in general slower trained thoroughly, however since the labels are not
than bi-encoder approaches, we analyse the two a continuous variable, we will rely heavily on the
other two rank correlations, namely Spearman and
4
https://github.com/Tiiiger/bert_score Kendall’s rank correlations, similar to (Chen et al.,
Distribution of evaluation metric scores
1.0
0.8
0.6
0.4
value

0.2
0.0
0.2
BLEU ROUGE-L METEOR F1-score Bi-Encoder fBERT fBERT
0
SAS New Bi-Encoder fBERT
SQuAD variable
GermanQUAD NQ-open

Figure 2: Comparison of all (similarity) scores for the pairs in evaluation datasets. METEOR computations for
GermanQuAD are omitted since it is not available for German.

2019). Furthermore, using a combination of metric There are only 16 cases where SAS completely di-
values, a considerable number of mislabelled pair verges from human labels. In all seven cases where
cases, mainly in NQ-open, have been discovered. SAS score is above 0.5 and label is 0, we notice that
the two answers have either a common substring
GermanQuAD or could be used often in the same context. In the
F1 = 0 F1 ̸= 0
other extreme when label is indicative of seman-
Metrics r ρ τ r ρ τ
tic similarity and SAS is giving scores below 0.25,
BLEU 0.000 0.000 0.000 0.153 0.095 0.089 totalling to only 9 cases overall, there are three spa-
ROUGE-L 0.172 0.106 0.100 0.579 0.554 0.460
F1 -score 0.000 0.000 0.000 0.560 0.534 0.443 tial translations, in other words, it struggles with
Bi-Encoder 0.392 0.337 0.273 0.596 0.595 0.491 non-common names in English contexts, which we
fBERT 0.149 0.008 0.006 0.599 0.554 0.457 find even more evidence in the case of NQ-open.

fBERT 0.410 0.349 0.284 0.606 0.592 0.489
SAS 0.488 0.432 0.349 0.713 0.690 0.574 There is an encoding-related example with 12 and
10 special characters each which to our team seems
Table 2: Pearson, Spearman’s, and Kendall’s rank cor- to be a mislabelled example. We haven’t identified
relations of annotator labels and automated metrics on any other consistent error categories for SQuAD.
subsets of GermanQuAD. fBERT is BERTScore vanilla

and fBERT is BERTScore trained.
6.2 GermanQuAD
Evaluating GermanQuAD more closely proves that
6 Error Analysis SAS correlates the strongest to human annotation
This section is entirely dedicated to highlighting the as Figure 5 shows.
major categories of problematic samples. We also For GermanQuAD, SAS fails to identify seman-
include towards the end of subsection 6.3 details of tic similarity in cases where the answers are syn-
the updated NQ-open dataset. onyms or translations which also include techni-
cal terms that rely on Latin (e.g. vis viva and living
6.1 SQuAD forces (translated) (SAS score: 0.5), Anorthotiko
In Figure 3, we analyse SQuAD subset dataset of Komma Ergazomenou Laou and Progressive Party
answers and we observe a similar phenomenon as of the Working People (transl.) (0.04), Nährgebiet
in the original paper when there is no lexical over- and Akkumulationsgebiet (0.45), Zehrgebiet and
lap between the answer pairs: the higher in layers Ablationsgebiet (0.43)). This is likely the case
we go in case of BERTScore trained, the higher because SAS does not use a multilingual model.
the correlation values with human labels are. Quite Since multilingual models have not been imple-
the opposite is observed in the case of BERTScore mented for cross-encoders yet, this remains an area
vanilla, where it is either not as sensitive to embed- for future research. The general hypothesis is sup-
ding representations in case of no lexical overlap or ported by significantly higher BERTScore trained
correlations decrease with higher embedding layers. scores for the same pairs (0.43-0.5) apart from
SQuad NQ-open
F1 = 0 F1 ̸= 0 F1 = 0 F1 ̸= 0
Metrics r ρ τ r ρ τ r ρ τ r ρ τ
BLEU 0.000 0.000 0.000 0.182 0.168 0.159 0.000 0.000 0.000 0.052 0.054 0.051
ROUGE-L 0.100 0.043 0.041 0.556 0.537 0.455 0.220 0.163 0.159 0.450 0.458 0.377
METEOR 0.398 0.207 0.200 0.450 0.464 0.378 0.233 0.152 0.148 0.188 0.179 0.139
F1-score 0.000 0.000 0.000 0.594 0.579 0.497 0.000 0.000 0.000 0.394 0.407 0.337
Bi-Encoder 0.487 0.372 0.303 0.684 0.684 0.566 0.294 0.212 0.170 0.454 0.446 0.351
fBERT 0.249 0.132 0.108 0.612 0.601 0.492 0.156 0.169 0.135 0.165 0.142 0.112

fBERT 0.516 0.391 0.318 0.698 0.688 0.571 0.319 0.225 0.181 0.452 0.449 0.354
SAS 0.561 0.359 0.291 0.743 0.735 0.613 0.422 0.196 0.158 0.662 0.647 0.512
New Bi-Encoder 0.501 0.391 0.318 0.694 0.690 0.572 0.338 0.252 0.203 0.501 0.501 0.392
f˜BERT 0.519 0.399 0.324 0.707 0.698 0.581 0.351 0.257 0.208 0.498 0.507 0.398

Table 3: Pearson, Spearman’s, and Kendall’s rank correlations of annotator labels and automated metrics on subsets
of SQuAD and NQ-open. fBERT is BERTScore vanilla and fBERT ′
is BERTScore trained, and f˜BERT is the new
BERTScore trained on names.

Correlation of BERTScore to label with no lexical overlap Correlation of BERTScore to human judgement
for pairs with some lexical overlap Correlation of BERTScore to human judgement
0.75
0.80
0.5 0.70
0.75
0.65
0.70
0.4
0.60
0.65
0.55
0.3 0.60
0.50
0.55
0.2 0.45
0.50
0.40
0.45
0.1
0.35
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
fBERT, Pearsonr fBERT, Spearmanr fBERT, Kendalltau fBERT, Pearsonr fBERT, Spearmanr fBERT, Kendalltau fBERT, Pearsonr fBERT, Spearmanr fBERT, Kendalltau
fBERT, Pearsonr
0
fBERT, Spearmanr
0 fBERT, Kendalltau
0
fBERT
0
, Pearsonr fBERT
0
, Spearmanr fBERT
0
, Kendalltau fBERT
0
, Pearsonr fBERT
0
, Spearmanr fBERT
0
, Kendalltau
SAS, Pearsonr SAS, Spearmanr SAS, Kendalltau SAS, Pearsonr SAS, Spearmanr SAS, Kendalltau SAS, Pearsonr SAS, Spearmanr SAS, Kendalltau

Figure 3: Pearson, Spearman’s, and Kendall’s rank correlations for different embedding extractions for when there
is no lexical overlap (F1 = 0), when there is some overlap (F1 ̸= 0) and aggregated for the SQuAD subset. fBERT

is BERTScore vanilla and fBERT is BERTScore trained.

Anorthotiko Komma Ergazomenou Laou (0.06) description of the same, again BERTScore vanilla
which is noticeable because the multilingual model and BERTScore trained identify more similarity
used in BERTScore trained was fine-tuned on En- (0.44, 0.37); Aschraf Marwan and The husband of
glish and German but the training set of the un- Nasser’s daughter Mona (translated) (0.17) where
derlying XLM-Roberta model included more than the first is a name and the latter a description of the
ten times as many Greek tokens than Latin ones same person, BERTScore vanilla discovers more
(Conneau et al., 2020). Difficult as well are text- similarity (0.38), BERTScore trained slightly less
based calculations and numbers (transl.: 46th than SAS (0.15).
day before Easter Sunday and Wednesday after the Overall, error analysis for GermanQuAD is lim-
7th Sunday before Easter (0.41), 24576 kB/s and ited to a few cases because it is the smallest dataset
(transl.) Execute three transmissions per micro- of the three. For this reason as well as because all
frame (125 µs) with up to 1024 bytes (0.26)). metrics perform worst on it, we focus on our error
Apart from that, aliases or descriptions of re- analysis on NQ-open.
lations which point to the same person or ob-
ject: (Thayendanegea and Joseph Brant (0.028) 6.3 NQ-open
are the same person but SAS fails to recognise it, NQ-open is not only by far the largest of the three
BERTScore vanilla and BERTScore trained both datasets but also the most skewed one. We observe
find some similarity (0.36, 0.22); Goring House that the vast majority of answer-prediction pairs
and Buckinghams Haus (0.29) refer to the same have a label 0 (see Table 1). Thus, in the majority
object but one is the official name, the other one a of cases, the underlying QA model predicted the
wrong answer. Apart from that, all four semantic along with providing a sample question, ground-
similarity metrics perform considerably worse on truth answer and prediction triple. After removal
NQ-open than on SQuAD and GermanQuAD - for of duplicates, sample with imprecise questions,
answer-prediction pairs in particular that have no wrong gold label or multiple correct answers, we
lexical overlap (F1 = 0) which amount to 95 per are left with 3559 ground-truth answer/prediction
cent of all answer-prediction pairs with the label 0 pairs compared to 3658 we started with.
indicating incorrect predictions having no lexical
overlap with the ground-truth answer. This is ex- Example SAS/label disagreement
pected for wrong answers. All four metrics perform explanations for NQ-open
Spatial
only slightly better than METEOR or ROUGE-L, Names
Co-reference
thus adding no value via their semantic approach Different levels of precision
Synonyms
Incomplete answer
in the majority of all cases in NQ-open. Medical term
Biological term
Dates
We also observe that for answer-prediction pairs Temporal
Alias
which include numbers, e.g. an amount, a date or a Numeric
Unrelated answer
Acronym
year, SAS, as well as BERTScore trained, differ in Chemical term
Synonyms
many cases significantly from the label indication. Unrelated
0 3 6 9 12 15 18 21
By our definition of semantic similarity, the only
semantically similar entities to answers expected to Figure 4: Subset of NQ-open test set, where SAS score
contain a numeric value should be the exact value, < 0.01 and human label is 2, manually annotated for
not a unit more or less. Also, the position within explanation of discrepancies. Original questions and
the pairs seems to matter for digits and their string Google search has been used to assess the correctness
representation: we observe for SAS that the pair of of the gold labels.
11 and eleven has a score of 0.09 whereas the pair
of eleven and 11 has a score of 0.89. Conducted ex-
periments to improve the performance on numbers NQ-open
can be found in Appendix B. F1 = 0 F1 ̸= 0
Apart from numbers we find that for BERTScore Metrics r ρ τ r ρ τ
trained as well as bi-encoder overall performance ROUGE-L 37.7 23.9 24.5 10.7 10.9 11.4
related to names is subpar: for Alexandria Ocasio- METEOR 22.7 9.9 10.1 -3.7 3.4 3.6
Cortez and Kevin McCarthy they indicate a simi- Bi-Encoder 21.1 13.2 13.5 16.3 15.7 16.0
fBERT 21.2 10.7 10.4 12.1 12.0 12.5
larity of 0.16 and 0.11 respectively whereas SAS ′
fBERT 20.4 12.4 12.7 15.7 15.1 15.3
correctly identifies no semantic similarity (0.01). SAS 25.6 17.3 17.7 18.7 17.2 17.4
A potential reason for identified similarity might
consist in both names referring to politicians but as Table 4: Improvements in correlation figures for NQ-
Figure 1 shows, the issue can be observed for more open subset after re-labelling. All numbers reported in
percentages.
common names, too. We use our new name pairs
dataset to fine-tune to bi-encoder model. Results
are mentioned in the section below after addressing We observe an improvement of 14 (Spearman)
more general issues with NQ-open first. to 15 (Kendall) per cent on the newly created NQ-
After correcting for encoding errors and fixing open subset for BERTScore trained and for bi-
the labels (one-way as of now) manually in the NQ- encoder we see an uplift of 19 per cent for both
open subset, totalling 70 samples, the correlations Spearman and Kendall when applying our fine-
have already improved by about a per cent for SAS. tuned sentence-transformer model (see Table 3).
Correcting wrong labels in extreme cases where This refers only to pairs with no lexical overlap.
SAS score is below 0.25 and the label is 2 or when For both bi-encoder and BERTScore trained the
SAS is above 0.5 and label is 0 improves results improvement for F1 ̸= 0 is smaller still in double-
almost across the board for all models, but more so digits.
for SAS, as shown in Table 4. Figure 4 depicts the Figure 1 shows a representative example of
major error categories for when SAS scores range where both BERTScore trained and bi-encoder
below 0.25 while human annotations indicate a la- have significantly improved towards identifying
bel of 2, while Table 9 defines those categories no semantic similarity where is none.
7 Conclusion and should be thoroughly compared in the context
of the given dataset.
We have found a few patterns in the mistakes that
SAS was making. These include spatial aware- Acknowledgements
ness, names, numbers, dates, context awareness,
translations, acronyms, scientific terminology, We would like to thank Ardhendu Singh, Julian
historical events, conversions, encodings. Risch, Malte Pietsch and XCS224U course facilita-
tors, Ankit Chadha in particular, as well as Christo-
Currently, the comparison to annotator labels is
pher Potts for their constant support.
performed on ground-truth answers taken from sub-
sets of SQuAD and GermanQuAD datasets, and
only for NQ-open we have a prediction and answer References
pair. There are two main reasons why we focus
more on NQ-open. Firstly, focusing on the other Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru
Ohta, and Masanori Koyama. 2019. Optuna: A next-
two would mean less strong evidence on how the generation hyperparameter optimization framework.
metric will perform when applied to model predic- In Proceedings of the 25th ACM SIGKDD interna-
tions behind a real-world application. Secondly, tional conference on knowledge discovery & data
effectively all semantic similarity metrics failed mining, pages 2623–2631.
to have a high correlation to human labels, more Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-
so when there was no lexical overlap. NQ-open Gazpio, and Lucia Specia. 2017. SemEval-2017
subset had quite a few issues associated with the task 1: Semantic textual similarity multilingual and
labels as well as some other minor issues. Re- crosslingual focused evaluation. In Proceedings
of the 11th International Workshop on Semantic
moving duplicates, re-labelling wrong labels one- Evaluation (SemEval-2017), pages 1–14, Vancouver,
way and re-calculating all metrics led to signifi- Canada. Association for Computational Linguistics.
cant improvements across the board for semantic
similarity metrics An element of future research Branden Chan, Stefan Schweter, and Timo Möller. 2020.
German’s next language model. In Proceedings of
would be improving the performance of all met- the 28th International Conference on Computational
rics on non-common names in English contexts Linguistics, pages 6788–6796, Barcelona, Spain (On-
and spatial names. The idea of using a KB, such line). International Committee on Computational Lin-
as Freebase or Wikipedia, as explored in Si et al. guistics.
(2021), could be used to find equivalent answer Anthony Chen, Gabriel Stanovsky, Sameer Singh, and
to named geographical entities as well. In addi- Matt Gardner. 2019. Evaluating question answer-
tion, we found that bi-encoders are not only out- ing evaluation. In Proceedings of the 2nd Workshop
performing cross-encoders on answer-prediction on Machine Reading for Question Answering, pages
119–124, Hong Kong, China. Association for Com-
pairs without lexical overlap but that they are also putational Linguistics.
faster than cross-encoders which makes them more
applicable in real-world scenarios. It is also rela- Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
tively easy to enhance bi-encoder models as demon- Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzman, Edouard Grave, Myle Ott, Luke Zettle-
strated by improved results after training on the moyer, and Veselin Stoyanov. 2020. Unsuper-
names dataset. This could be essential for compa- vised cross-lingual representation learning at scale.
nies as well because models most probably won’t arXiv:1911.02116].
understand the relationships between different em-
Christophe Croux and Catherine Dehon. 2010. In-
ployees and stakeholders mentioned in internal fluence functions of the spearman and kendall
documents. Future research is needed for answer- correlation measures. Stat Methods Appl (2010)
prediction pairs without lexical overlap, the main 19:497–515.
use case of semantic answer similarity. As another
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
dimension of future research and yet one more rea- field, Michael Collins, Ankur Parikh, Chris Alberti,
son to have a preference towards BERTScore is the Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-
ability to use BERTScore as a training objective to ton Lee, Kristina Toutanova, Llion Jones, Matthew
generate soft predictions, allowing the network to Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
remain differentiable end-to-end. Both SAS and ral questions: A benchmark for question answering
BERTScore trained should be considered as met- research. Transactions of the Association for Compu-
rics to evaluate the performance of a QA model tational Linguistics, 7:452–466.
Sewon Min, Jordan Boyd-Graber, Chris Alberti, Xian-Mo Zeng. 2007. Semantic relationships between
Danqi Chen, Eunsol Choi, Michael Collins, Kelvin contextual synonyms. US-China education review,
Guu, Hannaneh Hajishirzi, Kenton Lee, Jenni- 4:33–37.
maria Palomaki, Colin Raffel, Adam Roberts, Tom
Kwiatkowski, Patrick Lewis, Yuxiang Wu, Hein- Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
rich Küttler, Linqing Liu, Pasquale Minervini, Pon- Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
tus Stenetorp, Sebastian Riedel, Sohee Yang, Min- uating text generation with bert. arXiv preprint
joon Seo, Gautier Izacard, Fabio Petroni, Lucas Hos- arXiv:1904.09675.
seini, Nicola De Cao, Edouard Grave, Ikuya Ya-
mada, Sonse Shimaoka, Masatoshi Suzuki, Shumpei
Miyawaki, Shun Sato, Ryo Takahashi, Jun Suzuki, A Model configurations
Martin Fajcik, Martin Docekal, Karel Ondrej, Pavel
Smrz, Hao Cheng, Yelong Shen, Xiaodong Liu, B Numeric errors
Pengcheng He, Weizhu Chen, Jianfeng Gao, Bar-
las Oguz, Xilun Chen, Vladimir Karpukhin, Stan Presumably, numbers are difficult to evaluate (for
Peshterliev, Dmytro Okhonko, Michael Schlichtkrull,
Sonal Gupta, Yashar Mehdad, and Wen-tau Yih. 2021. all metrics), including for the underlying QA
Neurips 2020 efficientqa competition: Systems, anal- model of the predictions because we observe a
yses and lessons learned. In Proceedings of the high amount of label 0 cases where the predic-
NeurIPS 2020 Competition and Demonstration Track, tion needed to be a number, however the labels in
volume 133 of Proceedings of Machine Learning Re-
search, pages 86–111. PMLR. NQ-open are not entirely reliable, more so when
they are 0. Therefore, we performed two exper-
Timo Möller, Julian Risch, and Malte Pietsch. 2021. iments using NQ-open dataset where we remove
Germanquad and germandpr: Improving non- all numbers from both ground-truth answers and
english question answering and passage retrieval. predictions, and in the second experiment we re-
arXiv:2104.12741.
move numbers only from ground-truth answers.
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. We further investigated whether numbers and digit
Know what you don’t know: Unanswerable questions are bringing the SAS performance down. We de-
for squad. arXiv preprint arXiv:1806.03822. rived a new dataset from NQ-open where any row
with a number in ground-truth is removed and then
Nils Reimers, Philip Beyer, and Iryna Gurevych. 2016.
Task-oriented intrinsic evaluation of semantic textual evaluated the four metrics. The removal of num-
similarity. In Proceedings of COLING 2016, the 26th bers further deteriorated the SAS performance, as
International Conference on Computational Linguis- evident in Table 6.
tics: Technical Papers, pages 87–96, Osaka, Japan. A similar experiment with SQuAD dataset
The COLING 2016 Organizing Committee.
shows a similar behaviour that SAS performed
Nils Reimers and Iryna Gurevych. 2019. Sentence- poorly compared to BERT-trained and Bi-Encoder
BERT: Sentence embeddings using Siamese BERT- metrics, but we did not observe a significant drop
networks. In Proceedings of the 2019 Conference on in performance when rows with numbers in ground-
Empirical Methods in Natural Language Processing
truth are removed from SQuAD since numbers are
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages found only in 13% of SQuAD data compared to
3982–3992, Hong Kong, China. Association for Com- 28% of NQ-Open data. We observe a similar distri-
putational Linguistics. bution of scores in Figure 6.
To investigate further, we created a new num-
Julian Risch, Timo Möller, Julian Gutsch, and Malte
Pietsch. 2021. Semantic answer similarity for eval- bers dataset consisting of numbers as strings and
uating question answering models. arXiv preprint their respective digit representation (digit/string
arXiv:2108.06130. and string/digit pairs) which were manually la-
belled as 1. These pairs were complemented by
Chenglei Si, Chen Zhao, and Jordan Boyd-Graber.
pairs of digits and their consecutive and preceding
2021. What’s in a name? answer equivalence for
open-domain question answering. arXiv preprint numbers, labelled as 0. Training the bi-encoder
arXiv:2109.05289. model on this dataset resulted in no change or
worse performance, the cross-encoder model on the
Nandan Thakur, Nils Reimers, Johannes Daxenberger, manually annotated dataset let to non-significant
and Iryna Gurevych. 2021. Augmented sbert: Data
augmentation method for improving bi-encoders for
improvements. Training the bi-encoder model on
pairwise sentence scoring tasks. arXiv:2010.08240v2 the dataset with a cross-encoder derived labels led
[cs.CL]. to slightly less poor performance.
T-Systems-onsite/ Augmented
deepset/ cross-encoder/ deepset/
gbert-large-sts stsb-roberta-large
cross-en-de-roberta bert-base-uncased gelectra-base
cross-en-de-roberta
-sentence-transformer -sentence-transformer

hidden size 1,024 1,024 768 768 768 768


intermediate size 4,096 4,096 3,072 3,072 3,072 3,072
max position embeddings 512 514 514 512 512 514
model type bert roberta xlm-roberta bert electra xlm-roberta
num attention heads 16 16 12 12 12 12
num hidden layers 24 24 12 12 12 12
vocab size 31,102 50,265 250,002 30,522 31,102 250,002
transformers version 4.9.2 - - 4.6.0.dev0 - 4.12.2

Table 5: Configuration details of each of the models used in evaluations. The architectures for the first two models
and our own model follow corresponding sequence classification. T-systems-onsite model as well as our trained
model follow XLMRobertaModel, and the other two - BertForMaskedLM & ElectraForPreTraining
archictectures respectively. Most of the models use absolute position embedding.

NQ-open
F1 = 0 F1 ̸= 0
Metrics w num wo num w num wo num
fBERT 10.9 13.5 7.1 22.6
Bi-Encoder 13.3 13.1 29.9 25.8

fBERT 14.4 14.0 29.8 25.9
SAS 11.3 9.7 41.3 35.1

Table 6: Kendall’s performance on NQ-open dataset,


with and without numbers.

SQuAD
F1 = 0 F1 ̸= 0
Metrics w num wo num w num wo num
fBERT 8.5 8.3 46.9 49.9
Bi-Encoder 29.2 31.4 56.0 56.8

fBERT 30.5 32.7 56.3 56.7
SAS 27.6 28.4 60.5 60.8

Table 7: Kendall’s performance on SQuAD dataset, with


and without numbers.
Figure 5: Distribution of scores across labels for answer-
pairs in GermanQuAD.
C Distribution of scores
D Hyperparameter tuning E Error categories
We did an automatic hyperparameter search for 5
trials with Optuna (Akiba et al., 2019). Note that
cross-validation is an approximation of Bayesian
optimization, so it is not necessary to use it with
Optuna. We found the following best hyperparam-
eters: ’Batch’: 64, ’Epochs’: 2, ’warm’: 0.45.

Batch Size {16, 32, 64, 128, 256}


Epochs {1, 2, 3, 4}
warm uniform(0.0, 0.5)

Table 8: Experimental setup for hyperparameter tuning


of cross-encoder augmented BERTScore.
Distribution of SAS scores Distribution of BERTScore trained scores
102 0.0 0.0
1.0 1.0
2.0 2.0

101 100

100
10 1

10 1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 6: Distribution of SAS and BERT Trained scores for NQ-Open when F1 = 0.

Category Definition Question Gold label Prediction


Acronym An abbreviation formed from the ini- what channel does the haves and OWN Oprah Winfrey
tial letters of other words and pro- have nots come on on directv Network
nounced as a word
Alias Indicate an additional name that a who is the man in black the dark Randall Flagg Walter Padick
person sometimes uses tower
Requires resolution of a relationship
Co- between two distinct words referringwho is marconi in we built this the father of the Italian inventor
reference to the same entity city radio Guglielmo Mar-
coni
Different When both answers are correct, but when does the sympathetic ner- constantly fight-or-flight
levels of one is more precise vous system be activated response
precision
Imprecise There can be more than one correct b-25 bomber accidentally flew Old John 1945
question answers into the empire state building Feather Mer-
chant
Medical Language used to describe compo- what is the scientific name for shoulder blade scapula
term nents and processes of the human the shoulder bone
body
Multiple There is no single definite answer city belonging to mid west of Des Moines kansas city
correct united states
answers
Spatial Requires an understanding of the where was the tv series pie in Marlow in bray studios
concept of space, location, or prox- the sky filmed Bucking-
imity hamshire
Synonyms Gold label and prediction are syn- what is the purpose of a chip in control access security
onymous a debit card to a resource
Biological Of or relating to biology or life and where is the ground tissue lo- in regions of cortex
term living processes cated in plants new growth
Wrong gold The ground-truth label is incorrect how do you call a person who sign language mute
label cannot speak
Wrong The human judgement is incorrect who wrote the words to the orig- Captain George Francis Julius
label inal pledge of allegiance Thatcher Balch Bellamy
Incomplete The gold label answer contains only what are your rights in the first religion freedom of the
answer a subset of the full answer amendment press

Table 9: Category definitions and examples from anno-


tated NQ-open dataset.

You might also like