Corpus-Based Vocabulary lists for Language Learners for Nine Languages
Adam Kilgarriff, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen,
Saussan Khalil, Sofie Johansson Kokkinakis, Robert Lew, Serge Sharoff, Ravikiran Vadlapudi,
Elena Volodina
Abstract
We present the Kelly project, and its work on developing word lists, monolingual and bilingual, for language
learning, using corpus methods, for nine languages and thirty-six language pairs. We describe the method in some
detail and discuss the many challenges encountered. We have loaded the data into an online database and made it
accessible for anyone to explore: we present our own first explorations of it.
1
Introduction
Word lists are much-used resources in many disciplines, from language learning to psycholinguistics
to learning-to-read books. A natural way to develop a word list is from a corpus. Yet a corpusderived list, on its own, usually has grave shortcomings as a practical resource. In this paper we
explore a substantial effort to generate lists for nine languages, as far as possible in a corpus-driven,
principled way, but with the overriding priority of creating lists which are as useful as possible for
language learners.
The goal of the Kelly project 1 was to develop sets of bilingual language learning word cards
in many different language combinations. For this we needed to know which words to include, and
we wanted them to be the 9,000 most frequent words in the nine languages. We then added a
research goal: to do this, using as principled a corpus-driven method as possible. The lists needed to
be ordered, so learners could learn the more common words first. Four of the languages аere „more
commonlв taught‟ (Arabic, Chinese, English, Russian), the other five „less commonlв taught‟
(Italian, Swedish, Norwegian, Greek, Polish).
The Kelly procedure for preparing the list for each language was as follows
1
Identify the corpus
Generate a frequencв list (the „monolingual 1‟ or „ε1‟ list)
Clean up, compare it with lists from other corpora, and other wordlists,
make adjustments, to give the „εβ‟ list
Translate each item into all the other Kelly languages (the T1 (translation) list)
Use the „back translations‟ to identify items for addition or deletion,
make further adjustments, to give the final, M3 list.
EU Lifelong Learning Programme grant 505630. Partners: Stockholm University, Sweden (co-ordinators); Adam Mickiewicz
University , Poland; Cambridge Lexicography and Language Services, UK; Institute for Language and Speech Processing (ILSP),
Greece; Italian National Research Council (CNR) ; Keewords AB, Sweden; Lexical Computing Ltd., UK; Gothenburg University,
Sweden; University of Leeds, UK; University of Oslo, Norway.
While the process was corpus-based, it was not one in which the corpus was religiously seen as the
authority. Every corpus has peccadilloes, and the corpus to which you have access is rarely the ideal
corpus for the task to hand. So, at various points, we were happy for expert judgment to overrule
corpus frequencies. The paper considers these divergences and what underlies them.
Once the process was complete, the translations were entered into a database which let us ask
questions like “аhat „sвmmetrical pairs‟ are there, аhere X is translated as Y, and Y is also
translated as Xς” “What аord sets of three or more аords (all of different languages) are there аhere
all аords are in sвmmetric pairs аith all othersς” The database is available to all to interrogate.2
The structure of the paper is as follows. Section 2 discusses word lists. Section 3 gives details
of the Kelly procedure for preparing lists. Section 4 considers the Kelly database as a resource for
linguistic research, and Section 5 concludes.
2.
Word lists
Word frequency lists can be seen from several perspectives.
For computational linguistics or information theory, they are also called unigram lists and can be
seen as a compact representation of a corpus, lacking much of the information in the corpus but
small and easily tractable. Psychologists exploring language production, understanding, and
acquisition are interested in word frequency, as a word’s frequency is related to the speed with
which it is understood or learned so frequency needs to be allowed for in choosing words to use
in psycholinguistic experiments. Educationalists are interested too, so frequency can guide the
curriculum for learning to read and similar. To these ends, for English, Thorndike and Lorge
prepared The Teacher’s WordBook of 30,000 words in 1944 by counting words in a corpus,
creating reference set used for many studies for many years (Thorndike and Lorge, 1944). It
made its way into English Language Teaching via West’s General Service List (West, 1953)
which was a key resource for choosing which words to use in the English Language Teaching
curriculum until the British National Corpus replaced it in the 1990s. More recently, The English
Profile Project 3 has developed ‘English Vocabulary Profile’ which lists vocabulary for each
CEFR level (Capel 2010).
In language teaching word frequency lists are used for:
defining a syllabus
deciding which words are used in:
o
learning-to-read books for children
o
textbooks for non-native learners
o
dictionaries
o
language tests for non-native learners.
2.1 The pedagogical perspective: learning vocabulary with lists and cards
Vocabulary learning is an essential part of mastering a second language (L2). As Nation (2001)
says, vocabularв knoаledge constitutes an integral part of learners‟ general proficiencв in δβ and
is a prerequisite for successful communication.
The best approach for building good receptive and productive vocabulary skills is not
self-evident, since a number of factors intervene and relative empirical evidence and support is
2
3
http://kelly.sketchengine.co.uk
http://www.englishprofile.org
still scant. Generally, there are two types of vocabulary learning: intentional, which refers to
activities aiming directly at learning lexical items and incidental, where learning vocabulary is
considered a by-product of other L2 activities not primarily focusing on the systematic learning
of words (Nation, 2001), e.g. reading. Word lists and word cards fall under the first category.
Since the 19θί‟s the dominant δβ teaching and learning paradigm has been the
communicative approach. Within this framework, intentional vocabulary learning (especially if it
involves rote learning, such as word lists) has been out of fashion and sometimes dismissed. A
substantial body of research, however, indicates that intentional vocabulary learning realized by
using word lists and cards should have its place in the instructional and learning context. In fact, a
number of studies has shown that this type of learning may prove to be more efficient compared
to contextualized vocabulary learning as favoured by the communicative approach (see for
example Laufer, 2003). There is no doubt that contextualized and incidental vocabulary learning
contributes to successful lexical development and is in line with the communicative approach
principles. However, research on learning from context shows that such learning may require
learners to engage in large amounts of reading and listening and thus, is much slower.
A critical issue regarding vocabulary learning material and activities is retention. The aim
of vocabulary learning activities should be to lead to long-term retention. Here, there are a
number of studies in favour of list and word learning (Schmitt & Schmitt, 1995; Waring, 2004;
Mondria & Mondria-de Vries, 1994; Nation, 2001;), as the use of word lists seems to exhibit
good retention (Hulstijn, 2001; Nation, 2001) and faster gains. "There is a very large number of
studies showing the effectiveness of such learning (i.e. using vocabulary cards) in terms of the
amount and speed of learning." (Nation, 1997).
Contextualized vocabulary learning requires exposure to words through reading, listening
and speaking, which, however, should be combined with a systematic study of lexical items,
collocations and similar. In addition, if the L2 learner has limited exposure to L2 outside the
classroom, then word-focused activities should complement vocabulary learning in context
(Hulstijn, 2001; Laufer, 2003; Nation, 2001). List learning can prove an efficient word-focused
activity that can help learners achieve vocabulary mastery.
Another argument in favour of using lists and cards is that the L2 learners may work at
their own pace and that this approach fosters learner autonomy. However, using word cards and
lists as pedagogical tools for vocabulary acquisition requires motivated and disciplined learners,
who should also be able to deploy the right metacognitive strategies for self-monitoring,
planning their oаn learning etcέ “If theв (learners) cannot monitor their learning accurately and
plan their review schedule accordingly, they cannot make the most of word cards and may run the
risk of inefficient learning, e.g. over-learning (devoting more time than necessary) of easy items
or under-learning of hard items” (ζakata, 2008:7).
2.2 What Words Lists Are There?
Now that we have made the case for the importance of word lists, the next question is: are there
already lists meeting those needs? How good are they? Might Kelly lists improve on what is
currently available? In this section we review the lists that in existence for the less-commonlytaught languages of the project.
2.2.1
Greek
The Center for the Greek Language has exclusive responsibility, assigned by the Greek Government, for
the organization, planning, and administration of examinations for the Certification of Attainment in
Modern Greekέ It provides tаo аord lists, for levels τ and υέ The authors simplв saв “Indicative
vocabularв for levels τ & υ” аithout providing anв further information (Efstathiades βί01). The lists are
not corpus-based and the number of lemmas is not specified.
In the curriculum published by the University of Athens for teaching Modern Greek as L2 to
adults, a vocabulary list is included as an appendix. The authors state that the list has been created based
solely on their intuition and teaching experience and that the vocabulary listed (which they call
“representative vocabularв”) complies аith the communicative needs and the learning goals specified in
the curriculum and relates to particular notions and functions, speech acts and thematic domains. The
number of words is not specified (University of Athens, 1998).
A dictionary for Greek as L2 has recently been released as support material within the framework
of the EU-funded programme MiNERA - O.P. Education ΙΙ – “Educational θroject for εuslim φhildren
2005-βίίθ”4. The dictionary5 includes 10.000 lemmas, which emerged by a processing combination of (i)
existing dictionaries of Modern Greek addressed to pupils of primary and secondary education in Greece
(as representative for the definition of the “basicήcore vocabularв”) and (ii) e-corpora, in which school
textbooks were also included. No further information about the corpus is provided.
We are aware of an attempt to create a lexicon based entirely on e-corpora, statistics and
frequency lists and constructed in thematic domains at the University of Athens but as yet there are no
publications.
2.2.2 Italian
The corpus of the spoken frequency lexicon (Lessico di frequenza dell'italiano parlato, Corpus LIP) is one
of the most important collections of texts of spoken Italian and one of the most widely used in linguistic
research. It was composed in 1990-1992 by a group of linguists led by Tullio De Mauro who used it to
build the first frequency list of spoken Italian (De Mauro et al 1993). Its 469 texts, containing a total of
approx. 490,000 words, were collected in four cities (Milan, Florence, Rome and Naples) and comprise
face-to-face and mediated dialogues and monologues.
The Vocabolario di Base della lingua italiana (VdB) [basic vocabulary of Italian ], also by De
Mauro, is a list of terms drawn primarily with statistical criteria. It represents the portion of the Italian
language used and understood by most of those who speak Italian. The choice of entries has been made
according to the first 5,000 entries in the Italian Frequency Dictionary (Bortolini et al., 1972])
supplemented with a set of terms determined by other means. The terms in the VdB are classified into
three levels:
•
•
•
Fundamental Vocabulary: the 1,991 most frequent items
High-use Vocabulary : the 2,750 next items
High-availability Vocabulary: 2,337 entries determined in various ways, especially with
common Italian dictionaries. The integration was necessary because the Frequency
dictionary was the result of the examination of written texts.
The VdB was the first work of this kind made in Italy and is now widely used for example to monitor and
improve the readability of a text according to scientific criteria.
Two centres for Italian teaching for foreigners are the Università per Stranieri di Perugia and the
Università per Stranieri di Siena. Both were contacted and replied that there are no official lists of words
for assessing students' knowledge of Italian nor for preparing teaching material and that the most used
frequency lists for deriving lexical syllabi are LIP and Vocabolario di base.
Both centres have developed lists of the words most used by learners. These are based on spoken
productions by L2 stiudents of Italian at different levels of competence.
4
5
http://www.museduc.gr/en/index.php
http://www.museduc.gr/docs/gymnasio/Dictionary.pdf
2.2.3 Norwegian
There is no official list of words, but there are word lists in many text books for foreign students.
Accoutns of the methods used to create these lists were not found.
Lexin6, with 36 000 entries and lots of pictures, based on a Swedish version (see below) is oen
available resource. The illustrations are divided into 33 topic areas with titles such as Family and
relatives, Our bodies outside, The human body inside, Mail and banking and School and education.
2.2.4 Polish
Again, there is no official list, and we were unable to identify any widely-used lists at all.
2.2.5 Swedish
For Swedish there are a number of lists available. The oldest and most famous is Sturé τllen‟s Tiotusen i
topp [Top ten thousand; Allen 1972]. It was produced on the basis of newspaper texts collected around
1965, and has not been updated.
Other leading resources include the following.
Svensk skolordlista [Swedish wordlist for schools], with 35.000 items, the outcome of a
collaboration between the Swedish Academy and the Swedish language board. It is aimed at pupils from
the 5th grade and higher, and contains short explanations in easy Swedish for most items. It is a selection
from the SAOL (Swedish Academy's Word List of Swedish Language, updated regularly, approx 125.000
words) made on the basis of most frequent words in modern newspapers and books, including a number of
colloquial words. No frequency information is provided.
Lexin Svenska ord med uttal och förklaringar 7 [Lexin Swedish words with pronunciation and
explanations] contains 28.500 words and is aimed at immigrants. The vocabulary has been selected using
frequency studies, vocabulary from course books, words specific for social studies, partly manually
selected and partly coming from specific interpreter lists, and colloquial and „difficult‟ from a range of
sources (see Gellerstam 1978). It is regularly updated based on corpus studies. There are no frequencies
or information on the vocabulary appropriateness for different learner levels.
The Base Vocabulary Pool8 (Forsbom 2006) is a frequency based list constituting central
vocabulary derived from the SUC (Stockholm Umeå Corpus). The base vocabulary pool is created on
the assumption that domain- or genre-specific words should not be in the base vocabulary pool. The core
of this list is constituted by stylistically neutral general-purpose words collected from as many domains
and genres as possible. Out of 69,371 entries in the lemma list based on SUC, 8,215 lemmas are included
in the base vocabulary pool.
3.
Preparing the Kelly lists
3.1 Identify the corpus
For each language, we needed a corpus. We wanted it to be a corpus of general, everyday language,
and we wanted it to be large, with enough different texts so that it would not be skewed by topics of
particular texts, and so that it would not miss any core vocabulary. Moreover, we wished the corpora
to be, as far as possible, 'comparable': we wanted all the lists to represent the same kind of language,
so it made sense to make connections between them.
For some languages there was a good choice of corpora available, but not all project
languages were equally well served in terms of corpus resources. Spoken corpora were only available
for a minority of the languages.
6
http://decentius.hit.uib.no/lexin.html
http://lexin.nada.kth.se/
8
http://stp.lingfil.uu.se/~evafo/resources/basevocpool
7
One corpus type that is available, or can be created, for most languages, and which does
provide a large general corpus, is a web corpus, using methods as presented in Sharoff (2006) and
Baroni et al (2010). These papers also show that web corpora can represent the language well – in
some regards, better than a corpus such as the BNC, which has a heavier weighting of fiction,
newspaper, and in general the more formal and less interactive registers. For each of the languages,
we had access to, or created, a web corpus prepared using the methods described by Sharoff and
Baroni et al.
One central question was: what should the list be a list of? The most basic option was word
forms, so invade invading invades and invaded would all be separate items. This was at odds with
usual practice, and not useful for learners (specially for highly inflectional languages like Russian,
Polish and Arabic), so we needed to lemmatise the corpus: to identify, for each word, the lemma.
We also decided that the list items would all be associated with a word class (noun, verb etc) with
brush (noun) and can (noun) treated as distinct items from brush (verb), can (verb) and can (modal).
For this we needed a part-of-speech tagger.
For details of all corpora, and lemmatisers and POS-taggers used to process them, see
Appendix 1.
3.2 Generate a frequency list
The processed corpora were then loaded into corpus tools, such as the Sketch Engine (Kilgarriff et al.
2004) or the University of Leeds installation of the Corpus WorkBench. These tools both support the
preparation of word lists, lemma lists, or, as we wanted here, lists for lemma+word class, all with
frequencies attached. They also allow the user to easily view the underlying data, the „corpus lines‟,
for any item in the list, to check for, for example, lemmatisation and POS-tagging errors and other
anomalies.
For each language, we took the 6000 most frequent lemma+word-class pairs, and this was the
M1 list, the input to the next process. (This number is lower than the target 9000 because we
expected the next steps to add many more items than they deleted, as they largely did.)
3.3 Clean up, compare it with lists from other corpora, and other wordlists
3.3.1 Cleanup
There then followed a series of procedures to „clean‟ the list, delete anomalies, correct errors (in
particular word class errors) and to check against other lists for omissions. The process would make
each team aware of the idiosyncracies of their corpus so that, where possible, these could be
mitigated by the integration of other data. The cleaning process included the following:
Checking part of speech coding (especially for complex parts of speech such as determiners
and conjunctions where the automatic part of speech tagger may not be accurate.)
Checking surprising inclusions to see whether they were errors. For instance 'top' as an
English verb appeared in the list because of numerous examples of 'back to top' in our
internet-derived corpus. Similarly, various lemmatizing errors were identified, for example
the entry 'ty', which was the wrongly-formed singular of 'tie'
Checking surprising verb uses which are more usefully coded as adjectives, e.g. English
'neighbouring' rather than the verb 'neighbour' or θolish „гrяżnicoаanв‟ („various‟) аhich
аas lemmatiгed as the verb „гrяżnicoаać‟ („varв‟)
Amalgamating variant spellings such as 'organise' and 'organize' so that their frequency is not
distorted by being divided
Merging and splitting, as necessary, aspectual variants of verbs and reflexive verbs, often
mislemmatiгed, such as θolish „opłacać się‟ („be аorthаhile‟) versus „opłacić‟ („cover‟)
To promote consistency between language teams, a list of word types for inclusion was drawn up at
the outset. This included decisions on abbreviations, proper nouns, dialect words, affixes,
inflections, hyphenated words, trademarks and others. The project guidelines are attached as
Appendix 3.
3.3.2 Polysemy, multi-word units
Two central issues for creating word lists are polysemy, and multi-word units. The problem with
polysemy is this: if a word has two meanings, for example a linguistic sentence and a prisoner's
sentence, then it is not useful for a learner (or translator) to include the word in a list without
indicating which meaning is intendedέ τn immediate response might be “let's make it a list of аord
senses”έ This strategв has tаo difficulties, one theoretical and the other practicalέ The theoretical
one is that there is no agreement about what the word senses for each word of a language are, and is
never likely to be (Kilgarriff 1997). The practical one is that we cannot count word senses: fifty
years of research in automatic Word Sense Disambiguation has not delivered programs which can
automatically say, with a reasonable level of accuracy, which sense a word is being used in.
The problem with multi-word units, like according to, is similar. It certainly makes more
sense for learners and translators to see according to in the list than to see a high frequency for the
word according (or, worse, the verb accord). But according to is a clear case, what about the many
thousand of compounds, phrasal verbs, idioms and other fixed expressions? The first problem, again,
is the theoretical one: what is the list of items we should count? The second is the practical one: how
do we count them, without getting many false positives and distortions where, for example, we do
not know what frequency to give to look because so much of the look data is taken up by look at,
look into, look up, look for, look forward to ...
Different language teams took different strategies on these two issues. Some, including the
one for English, took a hard line: we cannot count word senses or multiword units reliably, so we
shall have a plain list of simple words (in all but the most vivid cases, such as according to, united in
united states). Others, notably the Polish team, took a more translator-friendly position, splitting
clearly polysemous words between meanings and giving meaning indicators for each, and including
multi-word items, estimating frequencies in each case.
3.3.3 Points of Comparison
We quickly realized that everyday items (e.g. mummy, bread and the like) were under-represented or
sometimes missing in the first list, while administrative and technical items (e.g. sector, review) were
over-represented.
For a subset of the languages (English, Norwegian, Italian and Polish) we were fortunate in
having at our disposal spoken corpora (or subcorpora), including records of everyday informal
speech, against which we could run comparisons. For English, for instance, we used the
conversational-speech part of the British National Corpus (BNC-sp). We ran a comparison to
identify all the words which had at least 50 occurrences in BNC-sp, and were either not in the M1 list
or had much higher normalised frequency in BNC-sp than M1.
We wanted the final list to be ordered by 'centrality in the language'. In straightforward cases
we could simply use UKWaC frequency for sorting, but it was not clear how words which were
added in would be sorted, or how any other manual interventions would interact with the sorting. We
decided to use a points system, as follows.
The original list was divided into six equal groups and allocated points, with six for the most
frequent group descending to one for the least frequent. BNC-sp words were added on the following
principles. (The variance in points allowed a small amount of judgment as to the overall generality
and usefulness of the word):
The most frequent 100 words from BNC-sp were given 5 or 6 points
100-200: 4 or 5 points
200-400 3 or 4 points
400-600 2 or 3 points
Points were then deducted: -1 for informal, -2 for taboo or slang, -2 for old fashioned.
Any words on the UKWaC list that did not occur at all in BNC spoken had one point
deducted.
We then looked at a keyword comparison between UKWac and BNC spoken, in which words
were sorted according to the ratio of their frequencies in the two corpora (for the exact method, see
Kilgarriff 2009). For keywords of BNC-sp vs. UKWaC and vice versa, adjustments were made using
a points system, so that words such as sector and review, which originally had 6 points, were
demoted, and words such as bread were promoted.
For a number of very restricted sets, such as numbers, compass points and days of the week,
points were assigned to ensure consistency. This is because it would be unhelpful to language
learners to see such items at different levels.
Kelly lists include some proper nouns. The inclusion of proper nouns was corpus based, but
it was felt necessary for teams to use some judgment. In particular, teams were asked to privilege
items which did not come from their own geographical area, since these were more likely to be of
universal importance. So, for instance, for the English list, an item such as Mediterranean would be
deemed to be of more importance than Cornwall.
The additional resources (corproa and word lists) used for each language are listerd in
Appendix 2.
3.4 Translate each item into all the other Kelly languages
Once each team had prepared its own monolingual lists, these were sent to a team of translators.
Each of the nine lists was translated into each of the eight other languages, in 72 translation tasks
giving 72 translation lists.
Translators were asked to choose the core translation for each word and to make sure that the
translation was equivalent in word class and register. They were encouraged to give single-word
translations, and only one translation, where this was viable, though they should give mult i-word
translations and/or multiple translations if this seemed the only sensible thing to do. Each team
prepared instructions to deal with specific aspects of their language: for example, should the
translation include word class (not relevant for Chinese, where word class is a problematic concept)
and should the translated noun's gender and declension class be given, and if so, how.
The work was subcontracted to a translation agency. There were, in some cases, several
iterations, with Kelly project members who knew both languages for a list assessing the quality and
sending it back for re-translation if the quality was not high enough.
The output of this stage was a rich dataset of 72 T1 lists, each of around 6000-7000
translation pairs (and additional information relating to word class, frequency, points, sometimes
sense indicators, translator notes and so forth.)
3.5 Use the ‘back translations’ to identify items for addition or deletion
υв „back translations‟ for a language, e.g., English, we mean those words used by translators when
translating into English. It seemed likely that some words that were wanted in the list but were not
in the M2 lists, and some high-salience multiword units, would occur frequently as back translations.
We simplified all rows in T1 lists to plain lemma-translation pairs. This involved a number
of iterations to ensure all items which should match, as they were essentially the same word although
theв came from either the εγ list or one of eight translator‟s files, did match. To support the process
we threw away word-class information: word class often did not match across languages. We then
built a database of the resulting pairs.
The database was used to prepare three lists for each language: single-word candidates for
inclusion, multiword candidates for inclusion, and candidates for exclusion/demotion.
Inclusion. Each team was given a list of items that occurred as translations, but were not in
their own list. These were incorporated according to a points system based on the number of
lists in which they occurred as translations. So, for instance, for English, words such as wolf,
torture, mayor, earthquake, institute were not in the original list, but occurred frequently as
translations, so they were added.
Demotion/deletion. Conversely, words such as align, arguably, broker and bungalow were in
the original list but did not occur once as translations from other lists. These were therefore
considered for deletion or demotion.
Multi-word units. Phrasal verbs and other phrases had not been included in the original lists
because of the difficulty of identifying them automatically. It was hoped that these would
emerge as translations of other languages. Items such as take out, of course, for example,
take place were identified in this way.
There were then numerous further phases of merging lists, merging frequency facts and CEFR levels,
and many extra rounds of editing and checking, and then word cards were created.
4.
The Kelly database
The Kelly database is an interesting object. For each of nine languages, for each of around 9000
words,9 it contains translation mappings to one or more words in each of the other eight languages.
With 74,258 lemmas and 423,848 mappings, it is large. We are not aware of any other comparable
resources. While it has many limitations, as will be apparent from its method of construction as
detailed above, it can supply data for many research questions.
4.1
9
Symmetric pairs (sympairs)
These are lemmas, as discussed above. As the simpler wrd word will introduce no ambiguity, we shall use that
throughout this section.
A basic construct for fathoming the database is the symmatric pair (hereafter sympair). This is a pair
of words, <a, b>, of two different languages A and B, such that a translates to b and b translates to a.
A naïve theory of translation might expect most words to come in symmetric pairs. The actual
numbers of sympairs, for each language pair, is as given in Table 1 (top right, above the leading
diagonal).
Note that the definition of symmetric pairs does not exclude a having another translation into
B in addition to b, or b, into A. So a more constrained construct is the one-translation-only (oto)
sympair, where neither a nor b has anв other translations into the other‟s languageέ We might eбpect
this constraint to set aside the polysemous words. Numbers for these are in the bottom left triangle of
Table 1 (below the leading diagonal).
English
English
Polish
1147
15.1%
Italian
1331
19.4%
Swedish 1308
17.3%
Chinese 390
5.1%
Arabic
383
5%
Russian 1050
13.9%
Greek
690
9.1%
Norwegian1074
14.2%
List leng 7549
Polish
Italian
2863
2896
37.9%
42.1%
2342
34.1%
1198
17.4%
1253
1163
14.8%
17%
284
236
3.6%
3.4%
340
323
3.9%
4.6%
1620
1142
19.2% 16.8%
962
1139
12.7% 16.3%
1307
1148
15.5%
16.8%
8459
6867
Swedish
2983
39.5%
2423
28.7%
2632
38.3%
Chinese
1574
20.8%
945
12.2%
1015
15.4%
1109
14.3%
315
4%
247
2.9%
1308
15.5%
941
12.5%
2338
27.7%
8425
164
2%
376
4.8%
206
2.7%
217
2.8%
7730
Arabic
822
10.8%
1189
14%
1059
15.4%
617
7.3%
608
7.9%
Russian
2526
33.4%
2614
29.2%
2103
30.6%
2270
26.9%
979
12.6%
1451
16.5%
Greek
2594
34.3%
2461
32.5%
2164
31.5%
1954
25.8%
726
9.3%
966
12.7%
2192
29%
399
4.4%
329
957
4.32%
12.7%
273
1128
673
3%
12.6%
9%
8744
8940
7553
Norwegian
2298
30.4%
2443
28.8%
2366
34.4%
3109
36.9%
600
7.7%
916
10.4%
2114
23.6%
1377
18.2%
8942
Table 1: Sympairs and oto-sympairs by language pair
Note that these numbers are low. In a simple world, these numbers would account for a large share
of the pairs of vocabulary would fall into synsets. In practice, the fractions range between 36.9%
(Swedish-Norwegian) and 7.3% (Swedish-Arabic). The percentages, also given in the table, are
computed as the number of sympairs for a language pair divided by the smaller of the two numbers
for the total number of words for the two languages.
4.2
Cliques
A further construct of interest is the n-language clique,10 where, for words <a, b, … n> of languages
τ, υ, … ζ, all pairs <(a,b), (a,c), … (a,n), (b,c), … (b,n) … ρ are sympairs. For cliques as for
sympairs, we can have or not have the one-translation-only constraint. Figures are given, with and
without oto, in Table 2.
10
Terminology from graph theory, where a fully-connected subgraph such as this is called a clique.
# languages
3
4
5
6
7
8
9
Clique
55023
35146
16048
4980
975
106
5
Oto-Clique
14211
6413
2204
520
71
4
0
Table 2: Numbers of cliques and oto-cliques, for different number of languages
We present the five nine-language cliques in Table 3 and the four eight-language oto-cliques in
Table 4. In Appendix 5 we present the 33 seven-language oto-cliques (that do not share more than
three words with either of the tables below), and in Appendix 6, the 49 eight-language cliques (that
do not share more than three words with either of the tables below or the first table in the
appendix).11 (Near-duplicates are a complication: if one language has two words for a concept that is
otherwise largely stable, the outcome may be two cliques sharing most words.)
م تشف
hospital
θο οεοη έο
ospedale
sykehus
Szpital
ϵЂϿАЁϼЊϴ
Sjukhus
Library
ίδίζδογάεβ
biblioteca
bibliotek
Biblioteka
ϵϼϵϿϼЂІϹϾϴ
Bibliotek
مكت
医院
图书馆
م يق
音乐
Music
Μον δεά
musica
musikk
Muzyka
ЀЇϻЏϾϴ
Music
ش
太阳
Sun
Ήζδομ
sole
sol
Słońce
ЅЂϿЁЊϹ
Sol
理论
Theory
Θ πλέα
teoria
teori
Teoria
ІϹЂЄϼГ
Teori
ي
Table 3: The five 9-language cliques in the dataset
吉他
ملك
三十
مأ
Guitar
ΚδγΪλα
Queen
ία έζδ
α
chitarra
gitar
Gitara
ϷϼІϴЄϴ
Gitarr
regina
dronning
Królowa
ϾЂЄЂϿϹ϶ϴ
Drottning
Thirty
λδΪθ α
trenta
tretti
Trгвdгieści
ІЄϼϸЊϴІА
Trettio
tragedy
λαΰπ έα
tragedia
tragedie
Tragedia
ІЄϴϷϹϸϼГ
Tragedy
Table 4: The four 8-language oto-cliques in the dataset
The concepts represented by many-language cliques are of interest, as they are lexicalised in a stable
way across languages; one could even propose the method as a way of seeking out universals. We
take a brief look at the concepts here, with each concept represented by its English word (as this will
indicate the concept for most readers).
The 51 English words featuring in 8- and 9-language cliques are
bank bed bomb book bread bridge chair channel church climate coffee dog eye
father fish forest future government guitar heart horse hospital kitchen knee
level library logic marriage milk music office pocket prison problem psychology
queen revolution sand snow source sun system tea ten theory thirty trade tragedy
university water week
11
All tables order columns alphabetically by the English spelling of the language, and rows, by the spelling of the
English word, or, if there is no English word, by the word in another latin-alphabet language, taking the remaining
four latin-alphabet languages in alphabetical order: Italian, Norwegian, Polish, Swedish.
Word class is not a construct in the database, since <lemma, word class> pairs were reduced to
lemmas to avolid mismatches due to non-matching word class inventories. Nonetheless it is apparent
that these are all nouns, with the possible exceptions of future (also an adjective) and ten, thirty
(depending on whether numbers are seen as a distinct word class to nouns). Two numbers are in the
list but not others.
Institutions are well-represented: we have eight, bank church government hospital library
office prison university (or nine if we include marriage). The natural world provides six (climate,
forest, sand, snow, sun, water) , edibles and drinkables, four (bread, coffee, milk, tea), animals and
body-parts, three (dog, fish, horse; eye, heart, knee), and people and furniture, two (queen, father;
bed, chair).
The 211 English words featuring in 7-word cliques but not in 8- or 9-langauge ones are given
in Appendix 5. In addition to contributing further members to the groupings mentioned above, they
introduce verbs (believe have hope read sleep write), adverbs (almost, already), adjectives (big,
blind, central, clinical, green, industrial, mathematical, national, nervous, new, philosophical, single,
theoretical, tragic, typical), nationalities (French, Italian), months (February, July, June, November)
and days of the week (Saturday, Sunday, Thursday; one can‟t help аondering аhat happened to
Monday Tuesday Wednesday and Friday). (As can be seen, allocation of words to word classes is
problematic, as, for example, hope may be a noun as well as a verb; the analysis here is indicative
only.)
4.3
Non-sympairs
If life were simple, most words, for most language pairs, would be in sympairs. So, one question is,
why are words not in sympairs?
We can distinguish several kinds of non-sympairs. The translation pair <a of language A, b of
language B>, where a, in the source list for A, is a non-sympair if a is not given as a translation of b.
This could be because b is not in the source list for B. We can divide the non-sympair set (NS) for
the directed language pair <A, B> into those where the word in B is in the source list for B, and those
where it is not. We may call them the non-sympair-source (NSS) and non-sympair-non-source
(NSNS) sets.
In the database as a whole there are some hapaxes: words that only appear once in the whole
database, as the translatiuon of one word of one other language only. These will form a subset of the
target words in the non-sympair-non-source set.
A further question we may ask about non-sympairs is: can we get from a to b (or vice versa) via a
third language: is there a word z in a third language Z, such that a translates as z (or vice versa) and z
translates as b (or vice versa). There may be zero routes from a to b, via another language, or there
may be one, or there may be more than one. We shall call them the 0, 1, m sets. This gives the
classification of translation-pairs shown in Fig 1.
directed translation pairs
sympairs
non-sympairs
non-sympair-source
(NSS)
non-sympair-non-source
(NSNS)
translation via third word?
zero
NSS-0
one
NSS-1
translation via third word?
Many
NSS-m
zero
hapaxes
One
many
NSNS-1
NSNS-m
non-hapaxes
Fig. 1: Types of translation pairs in the Kelly database
We investigated the directed-translation-pairs for Arabic-English, Chinese-Russian, English-Greek,
Greek-English, Norwegian-Swedish, Russian-Chinese, Swedish-English and Swedish-Russian. We
identified how many translation pairs there were in each category, and give the counts in Table 5.
NS
NSS
NSS-0
NSS-1
NSS-m
NSNS
Hapax
Other
NSNS-0
NSNS-1
NSNS-m
Ara-Eng
4692
2918
628
630
1660
373
1401
286
75
12
Chi-Rus
3871
2647
1191
807
649
328
896
262
60
6
Eng-Gre
5599
2381
701
527
1153
1923
1295
594
Gre-Eng
5519
3339
1135
664
1540
554
1626
355
638
691
176
23
Nor-Swe
2958
1864
683
531
650
81
1013
36
Rus-Chi
5443
2706
1221
749
736
1155
1582
303
28
17
Table 5: Analysis of non-sympairs
504
348
Swe-Eng
3120
2095
633
576
886
214
811
103
97
14
Swe-Rus
3553
2453
801
712
940
295
805
106
149
40
Possible reasons why the directed pair was not a sympair: that is, why there was not a translation
<b, a> included:
Frequency: b is not frequent enough to get into the source list for B
o NSNS cases only
Polysemy: b has more than one meaning and the transation given for it is not
o NSS cases only
Bad translation
Translation problem: a doesn‟t carrв across easilв, it doesn‟t tвpicallв get a single-word
translation in b
o Typically gets a multiword translation in B
Culture: a denotes a salient concept in culture of A-speakers but the concept isn‟t present or
isn‟t so salient for υ-speakers
Corpus problem – a is only there because of a skew in the A corpus
General wooliness: translators might have given any of several translations of a ( and b if
NSS) so it is not so surprising they did not match up
We then took a stratified random sample of 100 transaltions pairs, with a total of 100 items
containing 15 of each of NSS-0, NSS-1 and NSS-m, 30 hapaxes, 10 NSNS-0 and NSNS-1 and 5
NSNS-m. A team member who knew the two languages classified each item in the sample.
4.4
Analysis by language family
One might expect there to be more sympairs where the languages are more closely related.
We can test the hypothesis in that Swedish and Norwegian are both Scandinavian languages, a
branch of the Germanic family, to which English also belongs; Polish and Russian are both Slavic
(see Fig. 2). The percentage of sympairs for these is given in Table 6. (Data here is a subset of data
in Table 1, we just bring attention to the language families.)
Scandinavian
Other Germanic (En-Sw, En-No)
Slavic (Ru-Pl)
Other (where one of the pair is Arabic or
Chinese)
36.9 %
39.5% , 30.4%
29.2%
Percenatages vary Ara-Rus (16.5%) and AraSwe (7.3%).
Table 6: Sympairs by language family.
Kelly languages
non-Indo-European
Indo-European
Germanic
Slavic
Scandinavian
Chinese
Arabic
Greek
Italian
English
Swedish Norwegian
Russian Polish
Fig. 2: Genetic relationships between the nine languages in the Kelly project
We have used oto-sympair ratios (Table 1) as a metric of lexical similarity to compute a completelinkage cluster analysis. The resulting tree is given in Fig. 3. In broad outline, the clustering
corresponds to the genetic relationships between languages, although it is surprising to see Italian
and English cluster so closely. In comparing the two trees we need to bear in mind that the
genetic relationships between languages do not take into account later lexical borrowing, in
particular the extent to which English words have permeated the vocabularies of various
languages.
1. 00
D is t a n c e
0. 90
L in k a g e
0. 95
0. 80
0. 85
0. 75
0. 70
0. 65
Chinese
G r eek
Ar abic
Polis h
Russian
Swedis h
Nor wegian
Englis h
I t alian
Fig. 3: Cluster analysis of Kelly languages based on sympair distance, one-translation-only
We can also explore three-language cliques. The sets of three languages for which there are most
three-language oto-cliques are
No-Ru-Sw (535), No-Po-Sw (528), En-No-Sw (503) It-No-Sw (485)
Po-Ru-Sw (473), No-Po-Ru (412), It-Po-Ru (404), En-Po-Ru (397)
The top four triples all include the two closest languages, Norwegian and Swedish. They are joined
with, first, their two geographical and cultural neighbours, Russian and Polish, before their cousin in
the language tree, English.
All triples including one of the non-European languages, Arabic and Chinese, scored lower
than all-European triples. The lowest score for an all-European triple was 164, for En-Gr-No,
whereas the highest for a triple including a non-European language was 99 for Chinese-PolishRussian. The lowest-scoring triple of all was Arabic-Chinese-Greek with just 22 three-language otocliques.
4.5
Are words and their translations of similar frequencies?
It is not clear whether there is any reason to expect words in a sympair to have similar frequencies.
Of course our frequencies will come from our corpora, so, if food words are commoner in Italian
than Polish, this could be a feature of the corpus –hence uninteresting- or it could be a feature of the
language –hence interesting-- and we will not be well equipped for unpicking the two: yet our
corpora are comparable in their methods of construction and we can at least begin to explore the
question.
First, for all the European languages, for all words in the database, we identified the
frequency in the main source corpus, and normalised to frequency per million. We left out Chinese
and Arabic because the difficulty in segmentation of the texts into words (for Chinese) and
lemmatisation (for Arabic) meant the prospects of comparing like with like across corpora, without
human intervention, was low. Throughout, we normlised to lower-case.
For each oto-sympair 12 for the (undirected) language pairs English-Greek, English-Russian,
English-Swedish and Russian-Swedish, we calculated the ratio of the higher normalised frequency to
the lower (so the lowest possible value of the ratio, when the nornmlaised frequencies are equal, is
1). In Table 7 we present the numbers of sympairs where this ratio was less than two, between two
and four, four and eight, eight and sixteen, and over sixteen.
Lg pair
Eng-Gre
Eng-Rus
Eng-Swe
Swe-Rus
# otosympairs
688
1044
1308
1292
Ratio <2
2-4
4-8
8-16
>16
444
634
749
716
162
306
401
430
48
64
126
119
13
14
22
19
21
2
10
8
Table 7: Ratios of frequencies for oto-sympairs.
Here, if life were simple, most ratios would be low. For these four language pairs, a member of the
group who knows both languages of the pair will shortly be looking at all items with a ratio greater
than four.
5
Summary and outlook
In this paper we have presented the Kelly project, and its work on developing word lists,
monolingual and bilingual, for language learning, using corpus methods, for nine languages and
thirty-six language pairs. We have described the method in some detail and discussed the many
complications encountered. We have loaded the data into an online database and made it
accessible for anyone to explore: we presented our own first explorations of it.
The propsects for Kelly lie in three arenas: commercial, scientific and administrative. The
commercial dimension, under active development by consortium member Keewords AB, is the
creation, slaes and marketing of the word cards. The scientific is in a range of directions
including the continued exploration of the database and the evalaution of Kelly lists, against
others, and for their validity n the classroom. The adminstrative realtes to the question: might
Kelly lists become key resources, perhaps official vocabularies, for language teaching for those
Kelly languages where currently-available resources are poor. We shall be making the case for
adoption of Kelly lists (or, in all likelihood, their successors) to the language-teaching institutions
of several Kelly countries.
References
Allen, S (1972). Tiotusen i topp [Top ten thousand]. Almqvist & Wiksell, Sweden
12
We excluded the few oto-sympairs containing a multiword from the analysis.
Bortolini, U., Tagliavini, G. and A. Zampolli, 1972. Lessico di frequenza della lingua italiana
contemporanea. Milano, Garzanti.
Capel, A. (2010). A1-B2 vocabulary: Insights and issues arising from the English Profile Wordlists
project. English Profile Journal 1 (1).
De Mauro, T. , Mancini, M., Vedovelli, M., and M. Voghera . 1993 . Lessico di frequenza dell'italiano
parlato. Milano, EtasLibri.
De Mauro, T. 1997. Guida all'uso delle parole. Roma, Editori Riuniti.
Efstathiadis, S., Antonopoulou, N., Manavi, D. & Vogiatzidou, S. (2001). Certificate of Attainment in
Greek. Salonica: Ministry of Education-Center for the Greek Language.
Forsbom, E. (2006). Deriving a Base Vocabulary Pool from the Stockholm Umeå Corpus.
Gavioli, L. & Aston, G. (2001). Enriching reality: language corpora in language pedagogy. ELT Journal,
55/3, pp. 238-246
Gellerstam M. (1978) Välja sina ord. Reports from Språkdata 9.
Hulstijn, J. (2001) Intentional and incidental second language vocabulary learning: a reappraisal of
elaboration, rehearsal, and automaticity. In: Robinson, P. (ed.) Cognition and second language
instruction. Cambridge: Cambridge University Press, 258–286.
Laufer, B. (2003) Vocabulary acquisition in a second language: do learners really acquire most vocabulary
by reading? some empirical evidence. Canadian Modern Language Review, 59(4), 567–587.
Leech, G, Rayson, P and Wilson, A (2001) Word Frequencies in Written and Spoken English: based on
the British National Corpus. Longman, London.
McCrostie, J. (2007). Investigating the accuracy of teachers' word frequency intuitions. RELC Journal
38(1): 53-66.
Mondria, J.-A. and Mondria-de Vries, S. (1994). Efficiently memorizing words with the help of word
cards and „hand computer‟κ theorв and applicationsέ System, 22(1): 47–57.
Nakata, T. (2008). English vocabulary learning with word lists, word cards and computers; implications
from cognitive psychology for optimal spaced learning. ReCALL, 20(1), 3–20
Nation, I. S. P. (2001) Learning vocabulary in another language. Cambridge: Cambridge
Nation, P. (1997). Vocabulary size, text coverage and word lists. In Schmitt, N. & McCarthy, M. (eds.)
Vocabulary: Description, Acquisition and Pedagogy. Cambridge University Press
Radziszewski, A., A. Kilgarriff and R. Lew (2011). Polish Word Sketches. 5th Language & Technology
Conference, θoгnań, βί11.
Schmitt, N. & Schmitt, D. (1995). Vocabulary notebooks: theoretical underpinnings and practical
suggestions. ELT Journal, 49(2): 133–143.
University of Athens (1998). Curriculum for Teaching Modern Greek as a Foreign Language to Adults
(Levels 1 and 2: Introductory and Basic). Athens: University of Athens.
Waring, Rέ (βίίδ) In defence of learning аords in аord pairsκ but onlв аhen doing it the „right‟ аaв!
Available
at
http://www1.harenet.ne.jp/~waring/vocab/principles/systematic_learning.htm
Retrieved 25/9/2011
Appendix 1: Base corpora
Language
Arabic
Chinese
English
Greek
Italian
Norwegian
Polish
Russian
Swedish
Name
Arabic web
corpus
Internet-ZH
Size (mProcessing tools
174
AMIRA
Note
MSA only; mainly
wikipedia, newspaper
277
From Northeastern
University, China
UKWaC
1,526 TreeTagger
GkWaC
149
ILSP tools
ItWaC
1,910 TreeTagger
NoWaC
?Janne Oslo-Bergen tagger
Polish web corpus
128
Takipi
Russian web
188
TreeTagger
corpus
SwedishWaC 114
From Gothenburg Univ
Appendix 2: Other resources used (corpora and wordlists)
Language
Arabic
Chinese
English
Greek
Italian
Norwegian
Polish
Swedish
Other corpora and word lists used
BNC, BNC-spoken
Official list from the Center for the Greek Language
Italian PAROLE corpus: 250,000 words, newspapers and periodical
Corpus Stammerjohann: 100,000 words spontaneous speech
Corpus per il Confronto Diacronico LABLITA: 1000,000 words of speech,
Florence area
Spoken corpus
Existing Poznan wordlists
Appendix 3: Guidelines for inclusion of word types in Kelly lists
Word type
Policy
Variants.
Spelling variants should be amalgamated, so
that e.g. organize and organise are counted as
one word for frequency calculations. Each
language team will have to have a style guide
for preferred forms for the list itself. For
English, British and US spelling variants such
as color/colour will also be amalgamated.
Inflected forms.
Derivational inflected
forms, e.g. quickly,
happiness.
Affixes, including
productive affixes.
Abbreviations.
Multi-word units.
Hyphenated compounds.
Phrasal verbs.
Phrases, idioms,
proverbs, quotations.
Subject-specific
vocabulary.
Dialect words.
Items marked by register,
e.g. very formal, slang,
offensive
Lexical variants, e.g. cash machine/ATM would
be treated as separate items.
These are not shown unless an inflected form
has a meaning that is not inherent in the base
form, e.g. better in the sense of „to get better‟έ
Comments
Although learners may want to look up
inflections, esp. irregular ones, for the
purposes of frequency they should be
treated together with the base form.
To be treated as words in their own right, i.e. as
separate lemgrams.
No, an affix will only appear if it forms a word
that is common enough in itself to merit
inclusion.
Yes, including abbreviations that are written
only, but only if they meet the normal criteria of
what we are including, so not abbreviations for
proper nouns and encyclopedic items. The
most common abbreviations will probably be
forms of address, weights and measures, Latin
abbrevs, and the few cases where an
abbreviation is the normal way to refer to an
item, e.g. DVD.
NB The inclusion of abbreviations will
mean searching on the non-alphabet
character [.].
Yes for the teams who decided to add them at
this stage, no for those аho didn‟tέ
Yes, as long as they can be found
automatically.
No for English, as they count as multi-words –
yes for languages where they have a one word
lemma.
No.
Only if it makes it by the normal frequency
criteria (it may do, for instance for some
computing terms).
No.
Normal frequency rules apply: if they come in
the top 5,000 then yes.
NB When it comes to adding CEF
levels, we may need to consider grammar
vocabulary as a special case because of
its usefulness to language learners.
ζυ We agreed that an „offensive‟
attribute should be added to the database
so that while the frequency lists
themselves can be purely frequency
based, offensive items can be weeded out
if necessary.
Geographic terms.
Country name/related adjective/name of
people/language For these: give your own,
then any others that appear in your frequency
list in the normal way.
Oceans/continents/important areas/mountain
ranges These should be included on a
frequency basis, but privilege items which are
not from your own area. So for the English list,
an item such as „εediterranean‟ аould be more
important than „δake District‟έ This suggestion
is to avoid over-representation of these items –
everв list is likelв to include manв from one‟s
own region.
Cities Your own capital city, plus any really
major cities in your country which have a
different name in translation. Then any cities
from other countries which fulfil the normal
frequency criteria and have a different name in
your language from the original.
Famous places and
buildings.
stars, planets, galaxies,
etc.
Imaginary, biblical or
mythological people or
place names.
Personal names.
Famous people and
places, and other
encyclopedic info such
as names of wars,
treaties, names of ancient
peoples, names of
organizations, etc.
Adjectives derived from
famous people.
Festivals and
ceremonies.
Trademarks.
Beliefs and religions, and
associated nouns and
adjectives.
Currencies.
We will not cover individual rivers, mountains,
deserts etc.
Only if they have metonymy, e.g. Hollywood.
Likely to be very rare.
No.
No.
No.
No.
Only if they are in the top 5,000.
If they are in the top 5,000.
If they appear in the top 5,000 and are the name
of an item, but not company names.
If they are in the top 5,000.
Include your own currency and any others in
the top 5,000.
Appendix 4: English words that featured in 7-language cliques
afternoon age aggressive air almost already angel apple balcony beer
believe big bird blind blood body bus category catholic central chaos
cheese christian city clinical club comment constitution contact
corruption country court cry culture daughter democracy description
diagnosis dialogue dictionary difficulty digital direction director
discipline distance document dollar door doubt eighty engineer example
experiment family february festival fifteen fifth fifty filter finger five
flag flower four french fresh friend garden glass god green guarantee have
height hero history hope hundred ice industrial industry italian july june
key kilometre knife lake liberal life light literature litre loan long
mathematical mathematics meat mechanism member metal method million
minister minute month mother museum myth national nervous new nightmare
nine ninth nose november page pain park parliament pay period personality
philosophical philosophy planet poem poet police population president
price product production professor quality question radio rain read
religion restaurant revenge river role root salt saturday scandal school
screen sea series seventy shirt simple six sixty sky sleep soldier son
stability strategy sugar sunday surprise sweet sword symbol tail talent
technology temperature temple text theatre theoretical third three
thursday ticket time tobacco tooth tournament tower tradition tragic
travel twelve twenty two typical understanding video virus vote war
weather white window winter woman word wound write year
Appendix 5: 33 7-language-oto-Cliques
苹果
ج
ف د
apple
ηάζο
Mela
jabłko
ГϵϿЂϾЂ
äpple
apple
ηάζο
Mela
eple
jabłko
ГϵϿЂϾЂ
äpple
cheese
νλέ
ost
ser
ЅЏЄ
ost
cheese
νλέ
Formaggio
ost
ser
ЅЏЄ
ost
corruption
δα γολΪ
corruzione
korrupsjon
δα γολΪ
corruzione
korrupsjon
korupcja
Febbraio
februar
luty
fifteen
ίλονΪλ
δομ
εαπΫθ
ϾЂЄЄЇЃЊϼ
Г
ϾЂЄЄЇЃЊϼ
Г
ЈϹ϶ЄϴϿА
Quindici
femten
fifty
π θάθ α
Cinquanta
femti
ЃГІЁϴϸЊϴ
ІА
ЃГІАϸϹЅГІ
horse
Ϊζοΰο
Cavallo
hest
piętnaści
e
pięćdгies
iąt
koń
korru
ption
korru
ption
februa
ri
femto
n
femti
o
häst
corruption
july
δοτζδομ
Luglio
juli
lipiec
ϼВϿА
juli
june
δοτθδομ
Giugno
juni
czerwiec
ϼВЁА
juni
knee
ΰσθα ο
Ginocchio
kne
kolano
ϾЂϿϹЁЂ
lake
ζέηθβ
Lago
jezioro
ЂϻϹЄЂ
litre
ζέ λο
Litro
liter
litr
ϿϼІЄ
million
Milione
million
milion
ЀϼϿϿϼЂЁ
museum
εα οηητ
λδο
ηον έο
Museo
museum
muzeum
museum
ηον έο
Museo
museum
muzeum
ЀЇϻϹϽ
Incubo
mareritt
koszmar
ϾЂЌЀϴЄ
Naso
nese
nos
ЁЂЅ
february
马
رك
ي
湖
ملي
مت ف
nightmare
أف
nose
جي
沙
Ϋπβ
Tasca
lomme
kiesгeń
pocket
Ϋπβ
Tasca
lomme
kiesгeń
ϾϴЄЀϴЁ
ficka
Sabbia
sand
piasek
ЃϹЅЂϾ
sand
Sabato
lørdag
sobota
ЅЇϵϵЂІϴ
lördag
Settanta
sytti
siedemd
гiesiąt
herbata
ЅϹЀАϸϹЅГ
І
sjuttio
sand
seventy
茶
ش
في
ئ
tea
tea
في
ητ β
muse
um
muse
um
mardr
öm
pocket
saturday
ش
δΪζ βμ
liter
Ϊίία ο
ί οηάθ
α
Ϊδ
Ϊδ
Tè
Tè
te
herbata
ficka
te
te
tooth
σθ δ
Dente
tann
гąb
ϻЇϵ
tand
twelve
υ εα
Dodici
tolv
tolv
二十
twenty
έεο δ
Venti
ϸ϶ϹЁϴϸЊϴ
ІА
ϸ϶ϴϸЊϴІА
病毒
virus
病毒
virus
virus
dаanaści
e
dаadгieś
cia
wirus
Virus
virus
wirus
δσμ
tjugo
virus
virus
狼
ζτεομ
Lupo
ulv
wilk
϶ЂϿϾ
狼
ζτεομ
Lupo
ulv
wilk
϶ЂϿϾ
varg
Appendix 6: 49 8-language-cliques
ق ل
银行
bank
λΪπ αα
banca
bank
bank
ϵϴЁϾ
Bank
床
bed
ελ ίΪ δ
letto
seng
łяżko
ϾЄЂ϶ϴІА
Sang
炸弹
bomb
ίσηία
bomba
bomba
ϵЂЀϵϴ
Bomb
bomb
ίσηία
bomba
bombe
bomba
ϵЂЀϵϴ
Bomb
书
book
ίδίζέο
libro
bok
książka
ϾЁϼϷϴ
Bok
面包
bread
οπηέ
brød
chleb
ЉϿϹϵ
Bröd
bread
οπηέ
pane
brød
chleb
ЉϿϹϵ
Bröd
bridge
ΰΫ νλα
ponte
bro
ЀЂЅІ
Bro
chair
εαλΫεζα
sedia
stol
krгesło
ЅІЇϿ
Stol
channel
εαθΪζδ
canale
kanal
kanał
ϾϴЁϴϿ
Kanal
chiesa
kirke
kościяł
clima
klima
klimat
ϾϿϼЀϴІ
Klimat
kaffe
kawa
ϾЂЈϹ
Kaffe
ق ل
ج
面包
桥
椅子
ق
كي
教堂
church
εεζβ έα
Kyrka
م
climate
εζέηα
ق
咖啡
coffee
εα Ϋμ
ق
咖啡
coffee
caffè
kaffe
kawa
ϾЂЈϹ
Kaffe
狗
dog
cane
hund
pies
ЅЂϵϴϾϴ
Hund
Öga
كل
عي
أ
父亲
鱼
eye
ηΪ δ
occhio
øye
oko
ϷϿϴϻ
father
πα Ϋλαμ
padre
far
ojciec
ЂІϹЊ
fish
οΪλδ
pesce
fisk
ryba
ЄЏϵϴ
Fisk
skog
las
ϿϹЅ
Skog
prгвsгłoś
ć
rгąd
ϵЇϸЇЍϹϹ
framtid
regerin
g
Hjärta
غ
森林
forest
Ϊ ομ
م تق ل
未来
future
ηΫζζοθ
futuro
حك م
政府
government
governo
心脏
heart
ενίΫλθβ
β
εαλ δΪ
cuore
hjerte
serce
ЃЄϴ϶ϼІϹϿАЅ
І϶Ђ
ЅϹЄϸЊϹ
مط
厨房
kitchen
εοναέθα
cucina
kjøkken
kuchnia
ϾЇЉЁГ
مط
厨房
kitchen
εοναέθα
cucina
kjøkken
πέπ ο
livello
nivå
م ت
م طق
اج
level
poziom
ϾЇЉЁГ
Kök
ЇЄЂ϶ϹЁА
Nivå
ϿЂϷϼϾϴ
Logic
逻辑
logic
ζοΰδεά
logica
logikk
逻辑
logic
ζοΰδεά
logica
logikk
logika
ϿЂϷϼϾϴ
Logic
婚姻
marriage
ekteskap
milk
ΰΪζα
melk
małżeńst
wo
mleko
ϵЄϴϾ
牛奶
matrimoni
o
latte
ЀЂϿЂϾЂ
äktens
kap
Mjölk
office
ΰλα έο
ufficio
kontor
biuro
ЂЈϼЅ
Kontor
prigione
fengsel
аięгienie
fengsel
аięгienie
ІВЄАЀϴ
problema
problem
problem
ЃЄЂϵϿϹЀϴ
psicologia
psykologi
psycholog
ia
ЃЅϼЉЂϿЂϷϼ
Г
مكت
ج
监狱
prison
νζαεά
ج
监狱
prison
νζαεά
مشكل
problem
心理学
psychology
πλσίζβη
α
ονξοζοΰέ
α
fängels
e
fängels
e
proble
m
psykol
ogi
ث ر
revolution
雪
rewolucja
ЄϹ϶ЂϿВЊϼГ
rivoluzione
revolusjon
rewolucja
ЄϹ϶ЂϿВЊϼГ
neve
snø
śnieg
ЅЁϹϷ
revolut
ion
revolut
ion
Snö
fonte
kilde
źrяdło
ϼЅІЂЋЁϼϾ
Källa
system
τ βηα
sistema
system
system
ЅϼЅІϹЀϴ
System
十
ten
Ϋεα
dieci
ti
dгiesięć
ϸϹЅГІА
Tio
trade
ηπσλδο
commercio
handel
handel
ІЂЄϷЂ϶ϿГ
Handel
universitet
ЇЁϼ϶ϹЄЅϼІ
ϹІ
϶Ђϸϴ
univers
itet
Vatten
大学
university
水
water
周
مي
城市
عش
十
ه تف
revolusjon
系统
أ
مط
πβΰά
革命
تج ر
مء
source
revolution
مص ر
ج مع
snow
παθΪ α
β
παθΪ α
β
ξδσθδ
革命
雨
电话
παθ πδ
άηδο
θ λσ
vann
uniwersyt
et
woda
week
ί οηΪ α
settimana
uke
tвdгień
ЁϹϸϹϿГ
vecka
week
ί οηΪ α
settimana
uke
tвdгień
ЁϹϸϹϿГ
vecka
πσζβ
città
by
miasto
ϷЂЄЂϸ
stad
Ϋεα
dieci
ti
dгiesięć
ϸϹЅГІА
tio
pioggia
regn
deszcz
ϸЂϺϸА
regn
telefono
telefon
telefon
ІϹϿϹЈЂЁ
telefon
ίλοξά
βζΫ πθο