Corpus-based vocabulary lists for language learners for nine languages

We present the KELLY project and its work on developing monolingual and bilingual word lists for language learning, using corpus methods, for nine languages and thirty-six language pairs. We describe the method and discuss the many challenges encountered. We have loaded the data into an online database to make it accessible for anyone to explore and we present our own first explorations of it. The focus of the paper is thus twofold, covering pedagogical and methodological aspects of the lists’ construction, and linguistic aspects of the by-product of the project, the KELLY database.

Corpus-Based Vocabulary lists for Language Learners for Nine Languages Adam Kilgarriff, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan Khalil, Sofie Johansson Kokkinakis, Robert Lew, Serge Sharoff, Ravikiran Vadlapudi, Elena Volodina Abstract We present the Kelly project, and its work on developing word lists, monolingual and bilingual, for language learning, using corpus methods, for nine languages and thirty-six language pairs. We describe the method in some detail and discuss the many challenges encountered. We have loaded the data into an online database and made it accessible for anyone to explore: we present our own first explorations of it. 1 Introduction Word lists are much-used resources in many disciplines, from language learning to psycholinguistics to learning-to-read books. A natural way to develop a word list is from a corpus. Yet a corpusderived list, on its own, usually has grave shortcomings as a practical resource. In this paper we explore a substantial effort to generate lists for nine languages, as far as possible in a corpus-driven, principled way, but with the overriding priority of creating lists which are as useful as possible for language learners. The goal of the Kelly project 1 was to develop sets of bilingual language learning word cards in many different language combinations. For this we needed to know which words to include, and we wanted them to be the 9,000 most frequent words in the nine languages. We then added a research goal: to do this, using as principled a corpus-driven method as possible. The lists needed to be ordered, so learners could learn the more common words first. Four of the languages аere „more commonlв taught‟ (Arabic, Chinese, English, Russian), the other five „less commonlв taught‟ (Italian, Swedish, Norwegian, Greek, Polish). The Kelly procedure for preparing the list for each language was as follows      1 Identify the corpus Generate a frequencв list (the „monolingual 1‟ or „ε1‟ list) Clean up, compare it with lists from other corpora, and other wordlists,  make adjustments, to give the „εβ‟ list Translate each item into all the other Kelly languages (the T1 (translation) list) Use the „back translations‟ to identify items for addition or deletion,  make further adjustments, to give the final, M3 list. EU Lifelong Learning Programme grant 505630. Partners: Stockholm University, Sweden (co-ordinators); Adam Mickiewicz University , Poland; Cambridge Lexicography and Language Services, UK; Institute for Language and Speech Processing (ILSP), Greece; Italian National Research Council (CNR) ; Keewords AB, Sweden; Lexical Computing Ltd., UK; Gothenburg University, Sweden; University of Leeds, UK; University of Oslo, Norway. While the process was corpus-based, it was not one in which the corpus was religiously seen as the authority. Every corpus has peccadilloes, and the corpus to which you have access is rarely the ideal corpus for the task to hand. So, at various points, we were happy for expert judgment to overrule corpus frequencies. The paper considers these divergences and what underlies them. Once the process was complete, the translations were entered into a database which let us ask questions like “аhat „sвmmetrical pairs‟ are there, аhere X is translated as Y, and Y is also translated as Xς” “What аord sets of three or more аords (all of different languages) are there аhere all аords are in sвmmetric pairs аith all othersς” The database is available to all to interrogate.2 The structure of the paper is as follows. Section 2 discusses word lists. Section 3 gives details of the Kelly procedure for preparing lists. Section 4 considers the Kelly database as a resource for linguistic research, and Section 5 concludes. 2. Word lists Word frequency lists can be seen from several perspectives. For computational linguistics or information theory, they are also called unigram lists and can be seen as a compact representation of a corpus, lacking much of the information in the corpus but small and easily tractable. Psychologists exploring language production, understanding, and acquisition are interested in word frequency, as a word’s frequency is related to the speed with which it is understood or learned so frequency needs to be allowed for in choosing words to use in psycholinguistic experiments. Educationalists are interested too, so frequency can guide the curriculum for learning to read and similar. To these ends, for English, Thorndike and Lorge prepared The Teacher’s WordBook of 30,000 words in 1944 by counting words in a corpus, creating reference set used for many studies for many years (Thorndike and Lorge, 1944). It made its way into English Language Teaching via West’s General Service List (West, 1953) which was a key resource for choosing which words to use in the English Language Teaching curriculum until the British National Corpus replaced it in the 1990s. More recently, The English Profile Project 3 has developed ‘English Vocabulary Profile’ which lists vocabulary for each CEFR level (Capel 2010). In language teaching word frequency lists are used for:  defining a syllabus  deciding which words are used in: o learning-to-read books for children o textbooks for non-native learners o dictionaries o language tests for non-native learners. 2.1 The pedagogical perspective: learning vocabulary with lists and cards Vocabulary learning is an essential part of mastering a second language (L2). As Nation (2001) says, vocabularв knoаledge constitutes an integral part of learners‟ general proficiencв in δβ and is a prerequisite for successful communication. The best approach for building good receptive and productive vocabulary skills is not self-evident, since a number of factors intervene and relative empirical evidence and support is 2 3 http://kelly.sketchengine.co.uk http://www.englishprofile.org still scant. Generally, there are two types of vocabulary learning: intentional, which refers to activities aiming directly at learning lexical items and incidental, where learning vocabulary is considered a by-product of other L2 activities not primarily focusing on the systematic learning of words (Nation, 2001), e.g. reading. Word lists and word cards fall under the first category. Since the 19θί‟s the dominant δβ teaching and learning paradigm has been the communicative approach. Within this framework, intentional vocabulary learning (especially if it involves rote learning, such as word lists) has been out of fashion and sometimes dismissed. A substantial body of research, however, indicates that intentional vocabulary learning realized by using word lists and cards should have its place in the instructional and learning context. In fact, a number of studies has shown that this type of learning may prove to be more efficient compared to contextualized vocabulary learning as favoured by the communicative approach (see for example Laufer, 2003). There is no doubt that contextualized and incidental vocabulary learning contributes to successful lexical development and is in line with the communicative approach principles. However, research on learning from context shows that such learning may require learners to engage in large amounts of reading and listening and thus, is much slower. A critical issue regarding vocabulary learning material and activities is retention. The aim of vocabulary learning activities should be to lead to long-term retention. Here, there are a number of studies in favour of list and word learning (Schmitt & Schmitt, 1995; Waring, 2004; Mondria & Mondria-de Vries, 1994; Nation, 2001;), as the use of word lists seems to exhibit good retention (Hulstijn, 2001; Nation, 2001) and faster gains. "There is a very large number of studies showing the effectiveness of such learning (i.e. using vocabulary cards) in terms of the amount and speed of learning." (Nation, 1997). Contextualized vocabulary learning requires exposure to words through reading, listening and speaking, which, however, should be combined with a systematic study of lexical items, collocations and similar. In addition, if the L2 learner has limited exposure to L2 outside the classroom, then word-focused activities should complement vocabulary learning in context (Hulstijn, 2001; Laufer, 2003; Nation, 2001). List learning can prove an efficient word-focused activity that can help learners achieve vocabulary mastery. Another argument in favour of using lists and cards is that the L2 learners may work at their own pace and that this approach fosters learner autonomy. However, using word cards and lists as pedagogical tools for vocabulary acquisition requires motivated and disciplined learners, who should also be able to deploy the right metacognitive strategies for self-monitoring, planning their oаn learning etcέ “If theв (learners) cannot monitor their learning accurately and plan their review schedule accordingly, they cannot make the most of word cards and may run the risk of inefficient learning, e.g. over-learning (devoting more time than necessary) of easy items or under-learning of hard items” (ζakata, 2008:7). 2.2 What Words Lists Are There? Now that we have made the case for the importance of word lists, the next question is: are there already lists meeting those needs? How good are they? Might Kelly lists improve on what is currently available? In this section we review the lists that in existence for the less-commonlytaught languages of the project. 2.2.1 Greek The Center for the Greek Language has exclusive responsibility, assigned by the Greek Government, for the organization, planning, and administration of examinations for the Certification of Attainment in Modern Greekέ It provides tаo аord lists, for levels τ and υέ The authors simplв saв “Indicative vocabularв for levels τ & υ” аithout providing anв further information (Efstathiades βί01). The lists are not corpus-based and the number of lemmas is not specified. In the curriculum published by the University of Athens for teaching Modern Greek as L2 to adults, a vocabulary list is included as an appendix. The authors state that the list has been created based solely on their intuition and teaching experience and that the vocabulary listed (which they call “representative vocabularв”) complies аith the communicative needs and the learning goals specified in the curriculum and relates to particular notions and functions, speech acts and thematic domains. The number of words is not specified (University of Athens, 1998). A dictionary for Greek as L2 has recently been released as support material within the framework of the EU-funded programme MiNERA - O.P. Education ΙΙ – “Educational θroject for εuslim φhildren 2005-βίίθ”4. The dictionary5 includes 10.000 lemmas, which emerged by a processing combination of (i) existing dictionaries of Modern Greek addressed to pupils of primary and secondary education in Greece (as representative for the definition of the “basicήcore vocabularв”) and (ii) e-corpora, in which school textbooks were also included. No further information about the corpus is provided. We are aware of an attempt to create a lexicon based entirely on e-corpora, statistics and frequency lists and constructed in thematic domains at the University of Athens but as yet there are no publications. 2.2.2 Italian The corpus of the spoken frequency lexicon (Lessico di frequenza dell'italiano parlato, Corpus LIP) is one of the most important collections of texts of spoken Italian and one of the most widely used in linguistic research. It was composed in 1990-1992 by a group of linguists led by Tullio De Mauro who used it to build the first frequency list of spoken Italian (De Mauro et al 1993). Its 469 texts, containing a total of approx. 490,000 words, were collected in four cities (Milan, Florence, Rome and Naples) and comprise face-to-face and mediated dialogues and monologues. The Vocabolario di Base della lingua italiana (VdB) [basic vocabulary of Italian ], also by De Mauro, is a list of terms drawn primarily with statistical criteria. It represents the portion of the Italian language used and understood by most of those who speak Italian. The choice of entries has been made according to the first 5,000 entries in the Italian Frequency Dictionary (Bortolini et al., 1972]) supplemented with a set of terms determined by other means. The terms in the VdB are classified into three levels: • • • Fundamental Vocabulary: the 1,991 most frequent items High-use Vocabulary : the 2,750 next items High-availability Vocabulary: 2,337 entries determined in various ways, especially with common Italian dictionaries. The integration was necessary because the Frequency dictionary was the result of the examination of written texts. The VdB was the first work of this kind made in Italy and is now widely used for example to monitor and improve the readability of a text according to scientific criteria. Two centres for Italian teaching for foreigners are the Università per Stranieri di Perugia and the Università per Stranieri di Siena. Both were contacted and replied that there are no official lists of words for assessing students' knowledge of Italian nor for preparing teaching material and that the most used frequency lists for deriving lexical syllabi are LIP and Vocabolario di base. Both centres have developed lists of the words most used by learners. These are based on spoken productions by L2 stiudents of Italian at different levels of competence. 4 5 http://www.museduc.gr/en/index.php http://www.museduc.gr/docs/gymnasio/Dictionary.pdf 2.2.3 Norwegian There is no official list of words, but there are word lists in many text books for foreign students. Accoutns of the methods used to create these lists were not found. Lexin6, with 36 000 entries and lots of pictures, based on a Swedish version (see below) is oen available resource. The illustrations are divided into 33 topic areas with titles such as Family and relatives, Our bodies outside, The human body inside, Mail and banking and School and education. 2.2.4 Polish Again, there is no official list, and we were unable to identify any widely-used lists at all. 2.2.5 Swedish For Swedish there are a number of lists available. The oldest and most famous is Sturé τllen‟s Tiotusen i topp [Top ten thousand; Allen 1972]. It was produced on the basis of newspaper texts collected around 1965, and has not been updated. Other leading resources include the following. Svensk skolordlista [Swedish wordlist for schools], with 35.000 items, the outcome of a collaboration between the Swedish Academy and the Swedish language board. It is aimed at pupils from the 5th grade and higher, and contains short explanations in easy Swedish for most items. It is a selection from the SAOL (Swedish Academy's Word List of Swedish Language, updated regularly, approx 125.000 words) made on the basis of most frequent words in modern newspapers and books, including a number of colloquial words. No frequency information is provided. Lexin Svenska ord med uttal och förklaringar 7 [Lexin Swedish words with pronunciation and explanations] contains 28.500 words and is aimed at immigrants. The vocabulary has been selected using frequency studies, vocabulary from course books, words specific for social studies, partly manually selected and partly coming from specific interpreter lists, and colloquial and „difficult‟ from a range of sources (see Gellerstam 1978). It is regularly updated based on corpus studies. There are no frequencies or information on the vocabulary appropriateness for different learner levels. The Base Vocabulary Pool8 (Forsbom 2006) is a frequency based list constituting central vocabulary derived from the SUC (Stockholm Umeå Corpus). The base vocabulary pool is created on the assumption that domain- or genre-specific words should not be in the base vocabulary pool. The core of this list is constituted by stylistically neutral general-purpose words collected from as many domains and genres as possible. Out of 69,371 entries in the lemma list based on SUC, 8,215 lemmas are included in the base vocabulary pool. 3. Preparing the Kelly lists 3.1 Identify the corpus For each language, we needed a corpus. We wanted it to be a corpus of general, everyday language, and we wanted it to be large, with enough different texts so that it would not be skewed by topics of particular texts, and so that it would not miss any core vocabulary. Moreover, we wished the corpora to be, as far as possible, 'comparable': we wanted all the lists to represent the same kind of language, so it made sense to make connections between them. For some languages there was a good choice of corpora available, but not all project languages were equally well served in terms of corpus resources. Spoken corpora were only available for a minority of the languages. 6 http://decentius.hit.uib.no/lexin.html http://lexin.nada.kth.se/ 8 http://stp.lingfil.uu.se/~evafo/resources/basevocpool 7 One corpus type that is available, or can be created, for most languages, and which does provide a large general corpus, is a web corpus, using methods as presented in Sharoff (2006) and Baroni et al (2010). These papers also show that web corpora can represent the language well – in some regards, better than a corpus such as the BNC, which has a heavier weighting of fiction, newspaper, and in general the more formal and less interactive registers. For each of the languages, we had access to, or created, a web corpus prepared using the methods described by Sharoff and Baroni et al. One central question was: what should the list be a list of? The most basic option was word forms, so invade invading invades and invaded would all be separate items. This was at odds with usual practice, and not useful for learners (specially for highly inflectional languages like Russian, Polish and Arabic), so we needed to lemmatise the corpus: to identify, for each word, the lemma. We also decided that the list items would all be associated with a word class (noun, verb etc) with brush (noun) and can (noun) treated as distinct items from brush (verb), can (verb) and can (modal). For this we needed a part-of-speech tagger. For details of all corpora, and lemmatisers and POS-taggers used to process them, see Appendix 1. 3.2 Generate a frequency list The processed corpora were then loaded into corpus tools, such as the Sketch Engine (Kilgarriff et al. 2004) or the University of Leeds installation of the Corpus WorkBench. These tools both support the preparation of word lists, lemma lists, or, as we wanted here, lists for lemma+word class, all with frequencies attached. They also allow the user to easily view the underlying data, the „corpus lines‟, for any item in the list, to check for, for example, lemmatisation and POS-tagging errors and other anomalies. For each language, we took the 6000 most frequent lemma+word-class pairs, and this was the M1 list, the input to the next process. (This number is lower than the target 9000 because we expected the next steps to add many more items than they deleted, as they largely did.) 3.3 Clean up, compare it with lists from other corpora, and other wordlists 3.3.1 Cleanup There then followed a series of procedures to „clean‟ the list, delete anomalies, correct errors (in particular word class errors) and to check against other lists for omissions. The process would make each team aware of the idiosyncracies of their corpus so that, where possible, these could be mitigated by the integration of other data. The cleaning process included the following:    Checking part of speech coding (especially for complex parts of speech such as determiners and conjunctions where the automatic part of speech tagger may not be accurate.) Checking surprising inclusions to see whether they were errors. For instance 'top' as an English verb appeared in the list because of numerous examples of 'back to top' in our internet-derived corpus. Similarly, various lemmatizing errors were identified, for example the entry 'ty', which was the wrongly-formed singular of 'tie' Checking surprising verb uses which are more usefully coded as adjectives, e.g. English 'neighbouring' rather than the verb 'neighbour' or θolish „гrяżnicoаanв‟ („various‟) аhich аas lemmatiгed as the verb „гrяżnicoаać‟ („varв‟)   Amalgamating variant spellings such as 'organise' and 'organize' so that their frequency is not distorted by being divided Merging and splitting, as necessary, aspectual variants of verbs and reflexive verbs, often mislemmatiгed, such as θolish „opłacać się‟ („be аorthаhile‟) versus „opłacić‟ („cover‟) To promote consistency between language teams, a list of word types for inclusion was drawn up at the outset. This included decisions on abbreviations, proper nouns, dialect words, affixes, inflections, hyphenated words, trademarks and others. The project guidelines are attached as Appendix 3. 3.3.2 Polysemy, multi-word units Two central issues for creating word lists are polysemy, and multi-word units. The problem with polysemy is this: if a word has two meanings, for example a linguistic sentence and a prisoner's sentence, then it is not useful for a learner (or translator) to include the word in a list without indicating which meaning is intendedέ τn immediate response might be “let's make it a list of аord senses”έ This strategв has tаo difficulties, one theoretical and the other practicalέ The theoretical one is that there is no agreement about what the word senses for each word of a language are, and is never likely to be (Kilgarriff 1997). The practical one is that we cannot count word senses: fifty years of research in automatic Word Sense Disambiguation has not delivered programs which can automatically say, with a reasonable level of accuracy, which sense a word is being used in. The problem with multi-word units, like according to, is similar. It certainly makes more sense for learners and translators to see according to in the list than to see a high frequency for the word according (or, worse, the verb accord). But according to is a clear case, what about the many thousand of compounds, phrasal verbs, idioms and other fixed expressions? The first problem, again, is the theoretical one: what is the list of items we should count? The second is the practical one: how do we count them, without getting many false positives and distortions where, for example, we do not know what frequency to give to look because so much of the look data is taken up by look at, look into, look up, look for, look forward to ... Different language teams took different strategies on these two issues. Some, including the one for English, took a hard line: we cannot count word senses or multiword units reliably, so we shall have a plain list of simple words (in all but the most vivid cases, such as according to, united in united states). Others, notably the Polish team, took a more translator-friendly position, splitting clearly polysemous words between meanings and giving meaning indicators for each, and including multi-word items, estimating frequencies in each case. 3.3.3 Points of Comparison We quickly realized that everyday items (e.g. mummy, bread and the like) were under-represented or sometimes missing in the first list, while administrative and technical items (e.g. sector, review) were over-represented. For a subset of the languages (English, Norwegian, Italian and Polish) we were fortunate in having at our disposal spoken corpora (or subcorpora), including records of everyday informal speech, against which we could run comparisons. For English, for instance, we used the conversational-speech part of the British National Corpus (BNC-sp). We ran a comparison to identify all the words which had at least 50 occurrences in BNC-sp, and were either not in the M1 list or had much higher normalised frequency in BNC-sp than M1. We wanted the final list to be ordered by 'centrality in the language'. In straightforward cases we could simply use UKWaC frequency for sorting, but it was not clear how words which were added in would be sorted, or how any other manual interventions would interact with the sorting. We decided to use a points system, as follows. The original list was divided into six equal groups and allocated points, with six for the most frequent group descending to one for the least frequent. BNC-sp words were added on the following principles. (The variance in points allowed a small amount of judgment as to the overall generality and usefulness of the word):     The most frequent 100 words from BNC-sp were given 5 or 6 points 100-200: 4 or 5 points 200-400 3 or 4 points 400-600 2 or 3 points Points were then deducted: -1 for informal, -2 for taboo or slang, -2 for old fashioned. Any words on the UKWaC list that did not occur at all in BNC spoken had one point deducted. We then looked at a keyword comparison between UKWac and BNC spoken, in which words were sorted according to the ratio of their frequencies in the two corpora (for the exact method, see Kilgarriff 2009). For keywords of BNC-sp vs. UKWaC and vice versa, adjustments were made using a points system, so that words such as sector and review, which originally had 6 points, were demoted, and words such as bread were promoted. For a number of very restricted sets, such as numbers, compass points and days of the week, points were assigned to ensure consistency. This is because it would be unhelpful to language learners to see such items at different levels. Kelly lists include some proper nouns. The inclusion of proper nouns was corpus based, but it was felt necessary for teams to use some judgment. In particular, teams were asked to privilege items which did not come from their own geographical area, since these were more likely to be of universal importance. So, for instance, for the English list, an item such as Mediterranean would be deemed to be of more importance than Cornwall. The additional resources (corproa and word lists) used for each language are listerd in Appendix 2. 3.4 Translate each item into all the other Kelly languages Once each team had prepared its own monolingual lists, these were sent to a team of translators. Each of the nine lists was translated into each of the eight other languages, in 72 translation tasks giving 72 translation lists. Translators were asked to choose the core translation for each word and to make sure that the translation was equivalent in word class and register. They were encouraged to give single-word translations, and only one translation, where this was viable, though they should give mult i-word translations and/or multiple translations if this seemed the only sensible thing to do. Each team prepared instructions to deal with specific aspects of their language: for example, should the translation include word class (not relevant for Chinese, where word class is a problematic concept) and should the translated noun's gender and declension class be given, and if so, how. The work was subcontracted to a translation agency. There were, in some cases, several iterations, with Kelly project members who knew both languages for a list assessing the quality and sending it back for re-translation if the quality was not high enough. The output of this stage was a rich dataset of 72 T1 lists, each of around 6000-7000 translation pairs (and additional information relating to word class, frequency, points, sometimes sense indicators, translator notes and so forth.) 3.5 Use the ‘back translations’ to identify items for addition or deletion υв „back translations‟ for a language, e.g., English, we mean those words used by translators when translating into English. It seemed likely that some words that were wanted in the list but were not in the M2 lists, and some high-salience multiword units, would occur frequently as back translations. We simplified all rows in T1 lists to plain lemma-translation pairs. This involved a number of iterations to ensure all items which should match, as they were essentially the same word although theв came from either the εγ list or one of eight translator‟s files, did match. To support the process we threw away word-class information: word class often did not match across languages. We then built a database of the resulting pairs. The database was used to prepare three lists for each language: single-word candidates for inclusion, multiword candidates for inclusion, and candidates for exclusion/demotion.    Inclusion. Each team was given a list of items that occurred as translations, but were not in their own list. These were incorporated according to a points system based on the number of lists in which they occurred as translations. So, for instance, for English, words such as wolf, torture, mayor, earthquake, institute were not in the original list, but occurred frequently as translations, so they were added. Demotion/deletion. Conversely, words such as align, arguably, broker and bungalow were in the original list but did not occur once as translations from other lists. These were therefore considered for deletion or demotion. Multi-word units. Phrasal verbs and other phrases had not been included in the original lists because of the difficulty of identifying them automatically. It was hoped that these would emerge as translations of other languages. Items such as take out, of course, for example, take place were identified in this way. There were then numerous further phases of merging lists, merging frequency facts and CEFR levels, and many extra rounds of editing and checking, and then word cards were created. 4. The Kelly database The Kelly database is an interesting object. For each of nine languages, for each of around 9000 words,9 it contains translation mappings to one or more words in each of the other eight languages. With 74,258 lemmas and 423,848 mappings, it is large. We are not aware of any other comparable resources. While it has many limitations, as will be apparent from its method of construction as detailed above, it can supply data for many research questions. 4.1 9 Symmetric pairs (sympairs) These are lemmas, as discussed above. As the simpler wrd word will introduce no ambiguity, we shall use that throughout this section. A basic construct for fathoming the database is the symmatric pair (hereafter sympair). This is a pair of words, <a, b>, of two different languages A and B, such that a translates to b and b translates to a. A naïve theory of translation might expect most words to come in symmetric pairs. The actual numbers of sympairs, for each language pair, is as given in Table 1 (top right, above the leading diagonal). Note that the definition of symmetric pairs does not exclude a having another translation into B in addition to b, or b, into A. So a more constrained construct is the one-translation-only (oto) sympair, where neither a nor b has anв other translations into the other‟s languageέ We might eбpect this constraint to set aside the polysemous words. Numbers for these are in the bottom left triangle of Table 1 (below the leading diagonal). English English Polish 1147 15.1% Italian 1331 19.4% Swedish 1308 17.3% Chinese 390 5.1% Arabic 383 5% Russian 1050 13.9% Greek 690 9.1% Norwegian1074 14.2% List leng 7549 Polish Italian 2863 2896 37.9% 42.1% 2342 34.1% 1198 17.4% 1253 1163 14.8% 17% 284 236 3.6% 3.4% 340 323 3.9% 4.6% 1620 1142 19.2% 16.8% 962 1139 12.7% 16.3% 1307 1148 15.5% 16.8% 8459 6867 Swedish 2983 39.5% 2423 28.7% 2632 38.3% Chinese 1574 20.8% 945 12.2% 1015 15.4% 1109 14.3% 315 4% 247 2.9% 1308 15.5% 941 12.5% 2338 27.7% 8425 164 2% 376 4.8% 206 2.7% 217 2.8% 7730 Arabic 822 10.8% 1189 14% 1059 15.4% 617 7.3% 608 7.9% Russian 2526 33.4% 2614 29.2% 2103 30.6% 2270 26.9% 979 12.6% 1451 16.5% Greek 2594 34.3% 2461 32.5% 2164 31.5% 1954 25.8% 726 9.3% 966 12.7% 2192 29% 399 4.4% 329 957 4.32% 12.7% 273 1128 673 3% 12.6% 9% 8744 8940 7553 Norwegian 2298 30.4% 2443 28.8% 2366 34.4% 3109 36.9% 600 7.7% 916 10.4% 2114 23.6% 1377 18.2% 8942 Table 1: Sympairs and oto-sympairs by language pair Note that these numbers are low. In a simple world, these numbers would account for a large share of the pairs of vocabulary would fall into synsets. In practice, the fractions range between 36.9% (Swedish-Norwegian) and 7.3% (Swedish-Arabic). The percentages, also given in the table, are computed as the number of sympairs for a language pair divided by the smaller of the two numbers for the total number of words for the two languages. 4.2 Cliques A further construct of interest is the n-language clique,10 where, for words <a, b, … n> of languages τ, υ, … ζ, all pairs <(a,b), (a,c), … (a,n), (b,c), … (b,n) … ρ are sympairs. For cliques as for sympairs, we can have or not have the one-translation-only constraint. Figures are given, with and without oto, in Table 2. 10 Terminology from graph theory, where a fully-connected subgraph such as this is called a clique. # languages 3 4 5 6 7 8 9 Clique 55023 35146 16048 4980 975 106 5 Oto-Clique 14211 6413 2204 520 71 4 0 Table 2: Numbers of cliques and oto-cliques, for different number of languages We present the five nine-language cliques in Table 3 and the four eight-language oto-cliques in Table 4. In Appendix 5 we present the 33 seven-language oto-cliques (that do not share more than three words with either of the tables below), and in Appendix 6, the 49 eight-language cliques (that do not share more than three words with either of the tables below or the first table in the appendix).11 (Near-duplicates are a complication: if one language has two words for a concept that is otherwise largely stable, the outcome may be two cliques sharing most words.) ‫م تشف‬ hospital θο οεοη έο ospedale sykehus Szpital ϵЂϿАЁϼЊϴ Sjukhus Library ίδίζδογάεβ biblioteca bibliotek Biblioteka ϵϼϵϿϼЂІϹϾϴ Bibliotek ‫مكت‬ 医院图书馆 ‫م يق‬ 音乐 Music Μον δεά musica musikk Muzyka ЀЇϻЏϾϴ Music ‫ش‬ 太阳 Sun Ήζδομ sole sol Słońce ЅЂϿЁЊϹ Sol 理论 Theory Θ πλέα teoria teori Teoria ІϹЂЄϼГ Teori ‫ي‬ Table 3: The five 9-language cliques in the dataset 吉他 ‫ملك‬ 三十 ‫مأ‬ Guitar ΚδγΪλα Queen ία έζδ α chitarra gitar Gitara ϷϼІϴЄϴ Gitarr regina dronning Królowa ϾЂЄЂϿϹ϶ϴ Drottning Thirty λδΪθ α trenta tretti Trгвdгieści ІЄϼϸЊϴІА Trettio tragedy λαΰπ έα tragedia tragedie Tragedia ІЄϴϷϹϸϼГ Tragedy Table 4: The four 8-language oto-cliques in the dataset The concepts represented by many-language cliques are of interest, as they are lexicalised in a stable way across languages; one could even propose the method as a way of seeking out universals. We take a brief look at the concepts here, with each concept represented by its English word (as this will indicate the concept for most readers). The 51 English words featuring in 8- and 9-language cliques are bank bed bomb book bread bridge chair channel church climate coffee dog eye father fish forest future government guitar heart horse hospital kitchen knee level library logic marriage milk music office pocket prison problem psychology queen revolution sand snow source sun system tea ten theory thirty trade tragedy university water week 11 All tables order columns alphabetically by the English spelling of the language, and rows, by the spelling of the English word, or, if there is no English word, by the word in another latin-alphabet language, taking the remaining four latin-alphabet languages in alphabetical order: Italian, Norwegian, Polish, Swedish. Word class is not a construct in the database, since <lemma, word class> pairs were reduced to lemmas to avolid mismatches due to non-matching word class inventories. Nonetheless it is apparent that these are all nouns, with the possible exceptions of future (also an adjective) and ten, thirty (depending on whether numbers are seen as a distinct word class to nouns). Two numbers are in the list but not others. Institutions are well-represented: we have eight, bank church government hospital library office prison university (or nine if we include marriage). The natural world provides six (climate, forest, sand, snow, sun, water) , edibles and drinkables, four (bread, coffee, milk, tea), animals and body-parts, three (dog, fish, horse; eye, heart, knee), and people and furniture, two (queen, father; bed, chair). The 211 English words featuring in 7-word cliques but not in 8- or 9-langauge ones are given in Appendix 5. In addition to contributing further members to the groupings mentioned above, they introduce verbs (believe have hope read sleep write), adverbs (almost, already), adjectives (big, blind, central, clinical, green, industrial, mathematical, national, nervous, new, philosophical, single, theoretical, tragic, typical), nationalities (French, Italian), months (February, July, June, November) and days of the week (Saturday, Sunday, Thursday; one can‟t help аondering аhat happened to Monday Tuesday Wednesday and Friday). (As can be seen, allocation of words to word classes is problematic, as, for example, hope may be a noun as well as a verb; the analysis here is indicative only.) 4.3 Non-sympairs If life were simple, most words, for most language pairs, would be in sympairs. So, one question is, why are words not in sympairs? We can distinguish several kinds of non-sympairs. The translation pair <a of language A, b of language B>, where a, in the source list for A, is a non-sympair if a is not given as a translation of b. This could be because b is not in the source list for B. We can divide the non-sympair set (NS) for the directed language pair <A, B> into those where the word in B is in the source list for B, and those where it is not. We may call them the non-sympair-source (NSS) and non-sympair-non-source (NSNS) sets. In the database as a whole there are some hapaxes: words that only appear once in the whole database, as the translatiuon of one word of one other language only. These will form a subset of the target words in the non-sympair-non-source set. A further question we may ask about non-sympairs is: can we get from a to b (or vice versa) via a third language: is there a word z in a third language Z, such that a translates as z (or vice versa) and z translates as b (or vice versa). There may be zero routes from a to b, via another language, or there may be one, or there may be more than one. We shall call them the 0, 1, m sets. This gives the classification of translation-pairs shown in Fig 1. directed translation pairs sympairs non-sympairs non-sympair-source (NSS) non-sympair-non-source (NSNS) translation via third word? zero NSS-0 one NSS-1 translation via third word? Many NSS-m zero hapaxes One many NSNS-1 NSNS-m non-hapaxes Fig. 1: Types of translation pairs in the Kelly database We investigated the directed-translation-pairs for Arabic-English, Chinese-Russian, English-Greek, Greek-English, Norwegian-Swedish, Russian-Chinese, Swedish-English and Swedish-Russian. We identified how many translation pairs there were in each category, and give the counts in Table 5. NS NSS NSS-0 NSS-1 NSS-m NSNS Hapax Other NSNS-0 NSNS-1 NSNS-m Ara-Eng 4692 2918 628 630 1660 373 1401 286 75 12 Chi-Rus 3871 2647 1191 807 649 328 896 262 60 6 Eng-Gre 5599 2381 701 527 1153 1923 1295 594 Gre-Eng 5519 3339 1135 664 1540 554 1626 355 638 691 176 23 Nor-Swe 2958 1864 683 531 650 81 1013 36 Rus-Chi 5443 2706 1221 749 736 1155 1582 303 28 17 Table 5: Analysis of non-sympairs 504 348 Swe-Eng 3120 2095 633 576 886 214 811 103 97 14 Swe-Rus 3553 2453 801 712 940 295 805 106 149 40 Possible reasons why the directed pair was not a sympair: that is, why there was not a translation <b, a> included:        Frequency: b is not frequent enough to get into the source list for B o NSNS cases only Polysemy: b has more than one meaning and the transation given for it is not o NSS cases only Bad translation Translation problem: a doesn‟t carrв across easilв, it doesn‟t tвpicallв get a single-word translation in b o Typically gets a multiword translation in B Culture: a denotes a salient concept in culture of A-speakers but the concept isn‟t present or isn‟t so salient for υ-speakers Corpus problem – a is only there because of a skew in the A corpus General wooliness: translators might have given any of several translations of a ( and b if NSS) so it is not so surprising they did not match up We then took a stratified random sample of 100 transaltions pairs, with a total of 100 items containing 15 of each of NSS-0, NSS-1 and NSS-m, 30 hapaxes, 10 NSNS-0 and NSNS-1 and 5 NSNS-m. A team member who knew the two languages classified each item in the sample. 4.4 Analysis by language family One might expect there to be more sympairs where the languages are more closely related. We can test the hypothesis in that Swedish and Norwegian are both Scandinavian languages, a branch of the Germanic family, to which English also belongs; Polish and Russian are both Slavic (see Fig. 2). The percentage of sympairs for these is given in Table 6. (Data here is a subset of data in Table 1, we just bring attention to the language families.) Scandinavian Other Germanic (En-Sw, En-No) Slavic (Ru-Pl) Other (where one of the pair is Arabic or Chinese) 36.9 % 39.5% , 30.4% 29.2% Percenatages vary Ara-Rus (16.5%) and AraSwe (7.3%). Table 6: Sympairs by language family. Kelly languages non-Indo-European Indo-European Germanic Slavic Scandinavian Chinese Arabic Greek Italian English Swedish Norwegian Russian Polish Fig. 2: Genetic relationships between the nine languages in the Kelly project We have used oto-sympair ratios (Table 1) as a metric of lexical similarity to compute a completelinkage cluster analysis. The resulting tree is given in Fig. 3. In broad outline, the clustering corresponds to the genetic relationships between languages, although it is surprising to see Italian and English cluster so closely. In comparing the two trees we need to bear in mind that the genetic relationships between languages do not take into account later lexical borrowing, in particular the extent to which English words have permeated the vocabularies of various languages. 1. 00 D is t a n c e 0. 90 L in k a g e 0. 95 0. 80 0. 85 0. 75 0. 70 0. 65 Chinese G r eek Ar abic Polis h Russian Swedis h Nor wegian Englis h I t alian Fig. 3: Cluster analysis of Kelly languages based on sympair distance, one-translation-only We can also explore three-language cliques. The sets of three languages for which there are most three-language oto-cliques are No-Ru-Sw (535), No-Po-Sw (528), En-No-Sw (503) It-No-Sw (485) Po-Ru-Sw (473), No-Po-Ru (412), It-Po-Ru (404), En-Po-Ru (397) The top four triples all include the two closest languages, Norwegian and Swedish. They are joined with, first, their two geographical and cultural neighbours, Russian and Polish, before their cousin in the language tree, English. All triples including one of the non-European languages, Arabic and Chinese, scored lower than all-European triples. The lowest score for an all-European triple was 164, for En-Gr-No, whereas the highest for a triple including a non-European language was 99 for Chinese-PolishRussian. The lowest-scoring triple of all was Arabic-Chinese-Greek with just 22 three-language otocliques. 4.5 Are words and their translations of similar frequencies? It is not clear whether there is any reason to expect words in a sympair to have similar frequencies. Of course our frequencies will come from our corpora, so, if food words are commoner in Italian than Polish, this could be a feature of the corpus –hence uninteresting- or it could be a feature of the language –hence interesting-- and we will not be well equipped for unpicking the two: yet our corpora are comparable in their methods of construction and we can at least begin to explore the question. First, for all the European languages, for all words in the database, we identified the frequency in the main source corpus, and normalised to frequency per million. We left out Chinese and Arabic because the difficulty in segmentation of the texts into words (for Chinese) and lemmatisation (for Arabic) meant the prospects of comparing like with like across corpora, without human intervention, was low. Throughout, we normlised to lower-case. For each oto-sympair 12 for the (undirected) language pairs English-Greek, English-Russian, English-Swedish and Russian-Swedish, we calculated the ratio of the higher normalised frequency to the lower (so the lowest possible value of the ratio, when the nornmlaised frequencies are equal, is 1). In Table 7 we present the numbers of sympairs where this ratio was less than two, between two and four, four and eight, eight and sixteen, and over sixteen. Lg pair Eng-Gre Eng-Rus Eng-Swe Swe-Rus # otosympairs 688 1044 1308 1292 Ratio <2 2-4 4-8 8-16 >16 444 634 749 716 162 306 401 430 48 64 126 119 13 14 22 19 21 2 10 8 Table 7: Ratios of frequencies for oto-sympairs. Here, if life were simple, most ratios would be low. For these four language pairs, a member of the group who knows both languages of the pair will shortly be looking at all items with a ratio greater than four. 5 Summary and outlook In this paper we have presented the Kelly project, and its work on developing word lists, monolingual and bilingual, for language learning, using corpus methods, for nine languages and thirty-six language pairs. We have described the method in some detail and discussed the many complications encountered. We have loaded the data into an online database and made it accessible for anyone to explore: we presented our own first explorations of it. The propsects for Kelly lie in three arenas: commercial, scientific and administrative. The commercial dimension, under active development by consortium member Keewords AB, is the creation, slaes and marketing of the word cards. The scientific is in a range of directions including the continued exploration of the database and the evalaution of Kelly lists, against others, and for their validity n the classroom. The adminstrative realtes to the question: might Kelly lists become key resources, perhaps official vocabularies, for language teaching for those Kelly languages where currently-available resources are poor. We shall be making the case for adoption of Kelly lists (or, in all likelihood, their successors) to the language-teaching institutions of several Kelly countries. References Allen, S (1972). Tiotusen i topp [Top ten thousand]. Almqvist & Wiksell, Sweden 12 We excluded the few oto-sympairs containing a multiword from the analysis. Bortolini, U., Tagliavini, G. and A. Zampolli, 1972. Lessico di frequenza della lingua italiana contemporanea. Milano, Garzanti. Capel, A. (2010). A1-B2 vocabulary: Insights and issues arising from the English Profile Wordlists project. English Profile Journal 1 (1). De Mauro, T. , Mancini, M., Vedovelli, M., and M. Voghera . 1993 . Lessico di frequenza dell'italiano parlato. Milano, EtasLibri. De Mauro, T. 1997. Guida all'uso delle parole. Roma, Editori Riuniti. Efstathiadis, S., Antonopoulou, N., Manavi, D. & Vogiatzidou, S. (2001). Certificate of Attainment in Greek. Salonica: Ministry of Education-Center for the Greek Language. Forsbom, E. (2006). Deriving a Base Vocabulary Pool from the Stockholm Umeå Corpus. Gavioli, L. & Aston, G. (2001). Enriching reality: language corpora in language pedagogy. ELT Journal, 55/3, pp. 238-246 Gellerstam M. (1978) Välja sina ord. Reports from Språkdata 9. Hulstijn, J. (2001) Intentional and incidental second language vocabulary learning: a reappraisal of elaboration, rehearsal, and automaticity. In: Robinson, P. (ed.) Cognition and second language instruction. Cambridge: Cambridge University Press, 258–286. Laufer, B. (2003) Vocabulary acquisition in a second language: do learners really acquire most vocabulary by reading? some empirical evidence. Canadian Modern Language Review, 59(4), 567–587. Leech, G, Rayson, P and Wilson, A (2001) Word Frequencies in Written and Spoken English: based on the British National Corpus. Longman, London. McCrostie, J. (2007). Investigating the accuracy of teachers' word frequency intuitions. RELC Journal 38(1): 53-66. Mondria, J.-A. and Mondria-de Vries, S. (1994). Efficiently memorizing words with the help of word cards and „hand computer‟κ theorв and applicationsέ System, 22(1): 47–57. Nakata, T. (2008). English vocabulary learning with word lists, word cards and computers; implications from cognitive psychology for optimal spaced learning. ReCALL, 20(1), 3–20 Nation, I. S. P. (2001) Learning vocabulary in another language. Cambridge: Cambridge Nation, P. (1997). Vocabulary size, text coverage and word lists. In Schmitt, N. & McCarthy, M. (eds.) Vocabulary: Description, Acquisition and Pedagogy. Cambridge University Press Radziszewski, A., A. Kilgarriff and R. Lew (2011). Polish Word Sketches. 5th Language & Technology Conference, θoгnań, βί11. Schmitt, N. & Schmitt, D. (1995). Vocabulary notebooks: theoretical underpinnings and practical suggestions. ELT Journal, 49(2): 133–143. University of Athens (1998). Curriculum for Teaching Modern Greek as a Foreign Language to Adults (Levels 1 and 2: Introductory and Basic). Athens: University of Athens. Waring, Rέ (βίίδ) In defence of learning аords in аord pairsκ but onlв аhen doing it the „right‟ аaв! Available at http://www1.harenet.ne.jp/~waring/vocab/principles/systematic_learning.htm Retrieved 25/9/2011 Appendix 1: Base corpora Language Arabic Chinese English Greek Italian Norwegian Polish Russian Swedish Name Arabic web corpus Internet-ZH Size (mProcessing tools 174 AMIRA Note MSA only; mainly wikipedia, newspaper 277 From Northeastern University, China UKWaC 1,526 TreeTagger GkWaC 149 ILSP tools ItWaC 1,910 TreeTagger NoWaC ?Janne Oslo-Bergen tagger Polish web corpus 128 Takipi Russian web 188 TreeTagger corpus SwedishWaC 114 From Gothenburg Univ Appendix 2: Other resources used (corpora and wordlists) Language Arabic Chinese English Greek Italian Norwegian Polish Swedish Other corpora and word lists used BNC, BNC-spoken Official list from the Center for the Greek Language Italian PAROLE corpus: 250,000 words, newspapers and periodical Corpus Stammerjohann: 100,000 words spontaneous speech Corpus per il Confronto Diacronico LABLITA: 1000,000 words of speech, Florence area Spoken corpus Existing Poznan wordlists Appendix 3: Guidelines for inclusion of word types in Kelly lists Word type Policy Variants. Spelling variants should be amalgamated, so that e.g. organize and organise are counted as one word for frequency calculations. Each language team will have to have a style guide for preferred forms for the list itself. For English, British and US spelling variants such as color/colour will also be amalgamated. Inflected forms. Derivational inflected forms, e.g. quickly, happiness. Affixes, including productive affixes. Abbreviations. Multi-word units. Hyphenated compounds. Phrasal verbs. Phrases, idioms, proverbs, quotations. Subject-specific vocabulary. Dialect words. Items marked by register, e.g. very formal, slang, offensive Lexical variants, e.g. cash machine/ATM would be treated as separate items. These are not shown unless an inflected form has a meaning that is not inherent in the base form, e.g. better in the sense of „to get better‟έ Comments Although learners may want to look up inflections, esp. irregular ones, for the purposes of frequency they should be treated together with the base form. To be treated as words in their own right, i.e. as separate lemgrams. No, an affix will only appear if it forms a word that is common enough in itself to merit inclusion. Yes, including abbreviations that are written only, but only if they meet the normal criteria of what we are including, so not abbreviations for proper nouns and encyclopedic items. The most common abbreviations will probably be forms of address, weights and measures, Latin abbrevs, and the few cases where an abbreviation is the normal way to refer to an item, e.g. DVD. NB The inclusion of abbreviations will mean searching on the non-alphabet character [.]. Yes for the teams who decided to add them at this stage, no for those аho didn‟tέ Yes, as long as they can be found automatically. No for English, as they count as multi-words – yes for languages where they have a one word lemma. No. Only if it makes it by the normal frequency criteria (it may do, for instance for some computing terms). No. Normal frequency rules apply: if they come in the top 5,000 then yes. NB When it comes to adding CEF levels, we may need to consider grammar vocabulary as a special case because of its usefulness to language learners. ζυ We agreed that an „offensive‟ attribute should be added to the database so that while the frequency lists themselves can be purely frequency based, offensive items can be weeded out if necessary. Geographic terms. Country name/related adjective/name of people/language For these: give your own, then any others that appear in your frequency list in the normal way. Oceans/continents/important areas/mountain ranges These should be included on a frequency basis, but privilege items which are not from your own area. So for the English list, an item such as „εediterranean‟ аould be more important than „δake District‟έ This suggestion is to avoid over-representation of these items – everв list is likelв to include manв from one‟s own region. Cities Your own capital city, plus any really major cities in your country which have a different name in translation. Then any cities from other countries which fulfil the normal frequency criteria and have a different name in your language from the original. Famous places and buildings. stars, planets, galaxies, etc. Imaginary, biblical or mythological people or place names. Personal names. Famous people and places, and other encyclopedic info such as names of wars, treaties, names of ancient peoples, names of organizations, etc. Adjectives derived from famous people. Festivals and ceremonies. Trademarks. Beliefs and religions, and associated nouns and adjectives. Currencies. We will not cover individual rivers, mountains, deserts etc. Only if they have metonymy, e.g. Hollywood. Likely to be very rare. No. No. No. No. Only if they are in the top 5,000. If they are in the top 5,000. If they appear in the top 5,000 and are the name of an item, but not company names. If they are in the top 5,000. Include your own currency and any others in the top 5,000. Appendix 4: English words that featured in 7-language cliques afternoon age aggressive air almost already angel apple balcony beer believe big bird blind blood body bus category catholic central chaos cheese christian city clinical club comment constitution contact corruption country court cry culture daughter democracy description diagnosis dialogue dictionary difficulty digital direction director discipline distance document dollar door doubt eighty engineer example experiment family february festival fifteen fifth fifty filter finger five flag flower four french fresh friend garden glass god green guarantee have height hero history hope hundred ice industrial industry italian july june key kilometre knife lake liberal life light literature litre loan long mathematical mathematics meat mechanism member metal method million minister minute month mother museum myth national nervous new nightmare nine ninth nose november page pain park parliament pay period personality philosophical philosophy planet poem poet police population president price product production professor quality question radio rain read religion restaurant revenge river role root salt saturday scandal school screen sea series seventy shirt simple six sixty sky sleep soldier son stability strategy sugar sunday surprise sweet sword symbol tail talent technology temperature temple text theatre theoretical third three thursday ticket time tobacco tooth tournament tower tradition tragic travel twelve twenty two typical understanding video virus vote war weather white window winter woman word wound write year Appendix 5: 33 7-language-oto-Cliques 苹果 ‫ج‬ ‫ف د‬ apple ηάζο Mela jabłko ГϵϿЂϾЂ äpple apple ηάζο Mela eple jabłko ГϵϿЂϾЂ äpple cheese νλέ ost ser ЅЏЄ ost cheese νλέ Formaggio ost ser ЅЏЄ ost corruption δα γολΪ corruzione korrupsjon δα γολΪ corruzione korrupsjon korupcja Febbraio februar luty fifteen ίλονΪλ δομ εαπΫθ ϾЂЄЄЇЃЊϼ Г ϾЂЄЄЇЃЊϼ Г ЈϹ϶ЄϴϿА Quindici femten fifty π θάθ α Cinquanta femti ЃГІЁϴϸЊϴ ІА ЃГІАϸϹЅГІ horse Ϊζοΰο Cavallo hest piętnaści e pięćdгies iąt koń korru ption korru ption februa ri femto n femti o häst corruption july δοτζδομ Luglio juli lipiec ϼВϿА juli june δοτθδομ Giugno juni czerwiec ϼВЁА juni knee ΰσθα ο Ginocchio kne kolano ϾЂϿϹЁЂ lake ζέηθβ Lago jezioro ЂϻϹЄЂ litre ζέ λο Litro liter litr ϿϼІЄ million Milione million milion ЀϼϿϿϼЂЁ museum εα οηητ λδο ηον έο Museo museum muzeum museum ηον έο Museo museum muzeum ЀЇϻϹϽ Incubo mareritt koszmar ϾЂЌЀϴЄ Naso nese nos ЁЂЅ february 马 ‫رك‬ ‫ي‬ 湖 ‫ملي‬ ‫مت ف‬ nightmare ‫أف‬ nose ‫جي‬ 沙 Ϋπβ Tasca lomme kiesгeń pocket Ϋπβ Tasca lomme kiesгeń ϾϴЄЀϴЁ ficka Sabbia sand piasek ЃϹЅЂϾ sand Sabato lørdag sobota ЅЇϵϵЂІϴ lördag Settanta sytti siedemd гiesiąt herbata ЅϹЀАϸϹЅГ І sjuttio sand seventy 茶 ‫ش‬ ‫في‬ ‫ئ‬ tea tea ‫في‬ ητ β muse um muse um mardr öm pocket saturday ‫ش‬ δΪζ βμ liter Ϊίία ο ί οηάθ α Ϊδ Ϊδ Tè Tè te herbata ficka te te tooth σθ δ Dente tann гąb ϻЇϵ tand twelve υ εα Dodici tolv tolv 二十 twenty έεο δ Venti ϸ϶ϹЁϴϸЊϴ ІА ϸ϶ϴϸЊϴІА 病毒 virus 病毒 virus virus dаanaści e dаadгieś cia wirus Virus virus wirus δσμ tjugo virus virus 狼 ζτεομ Lupo ulv wilk ϶ЂϿϾ 狼 ζτεομ Lupo ulv wilk ϶ЂϿϾ varg Appendix 6: 49 8-language-cliques ‫ق ل‬ 银行 bank λΪπ αα banca bank bank ϵϴЁϾ Bank 床 bed ελ ίΪ δ letto seng łяżko ϾЄЂ϶ϴІА Sang 炸弹 bomb ίσηία bomba bomba ϵЂЀϵϴ Bomb bomb ίσηία bomba bombe bomba ϵЂЀϵϴ Bomb 书 book ίδίζέο libro bok książka ϾЁϼϷϴ Bok 面包 bread οπηέ brød chleb ЉϿϹϵ Bröd bread οπηέ pane brød chleb ЉϿϹϵ Bröd bridge ΰΫ νλα ponte bro ЀЂЅІ Bro chair εαλΫεζα sedia stol krгesło ЅІЇϿ Stol channel εαθΪζδ canale kanal kanał ϾϴЁϴϿ Kanal chiesa kirke kościяł clima klima klimat ϾϿϼЀϴІ Klimat kaffe kawa ϾЂЈϹ Kaffe ‫ق ل‬ ‫ج‬ 面包桥椅子 ‫ق‬ ‫كي‬ 教堂 church εεζβ έα Kyrka ‫م‬ climate εζέηα ‫ق‬ 咖啡 coffee εα Ϋμ ‫ق‬ 咖啡 coffee caffè kaffe kawa ϾЂЈϹ Kaffe 狗 dog cane hund pies ЅЂϵϴϾϴ Hund Öga ‫كل‬ ‫عي‬ ‫أ‬ 父亲鱼 eye ηΪ δ occhio øye oko ϷϿϴϻ father πα Ϋλαμ padre far ojciec ЂІϹЊ fish οΪλδ pesce fisk ryba ЄЏϵϴ Fisk skog las ϿϹЅ Skog prгвsгłoś ć rгąd ϵЇϸЇЍϹϹ framtid regerin g Hjärta ‫غ‬ 森林 forest Ϊ ομ ‫م تق ل‬ 未来 future ηΫζζοθ futuro ‫حك م‬ 政府 government governo 心脏 heart ενίΫλθβ β εαλ δΪ cuore hjerte serce ЃЄϴ϶ϼІϹϿАЅ І϶Ђ ЅϹЄϸЊϹ ‫مط‬ 厨房 kitchen εοναέθα cucina kjøkken kuchnia ϾЇЉЁГ ‫مط‬ 厨房 kitchen εοναέθα cucina kjøkken πέπ ο livello nivå ‫م ت‬ ‫م طق‬ ‫اج‬ level poziom ϾЇЉЁГ Kök ЇЄЂ϶ϹЁА Nivå ϿЂϷϼϾϴ Logic 逻辑 logic ζοΰδεά logica logikk 逻辑 logic ζοΰδεά logica logikk logika ϿЂϷϼϾϴ Logic 婚姻 marriage ekteskap milk ΰΪζα melk małżeńst wo mleko ϵЄϴϾ 牛奶 matrimoni o latte ЀЂϿЂϾЂ äktens kap Mjölk office ΰλα έο ufficio kontor biuro ЂЈϼЅ Kontor prigione fengsel аięгienie fengsel аięгienie ІВЄАЀϴ problema problem problem ЃЄЂϵϿϹЀϴ psicologia psykologi psycholog ia ЃЅϼЉЂϿЂϷϼ Г ‫مكت‬ ‫ج‬ 监狱 prison νζαεά ‫ج‬ 监狱 prison νζαεά ‫مشكل‬ problem 心理学 psychology πλσίζβη α ονξοζοΰέ α fängels e fängels e proble m psykol ogi ‫ث ر‬ revolution 雪 rewolucja ЄϹ϶ЂϿВЊϼГ rivoluzione revolusjon rewolucja ЄϹ϶ЂϿВЊϼГ neve snø śnieg ЅЁϹϷ revolut ion revolut ion Snö fonte kilde źrяdło ϼЅІЂЋЁϼϾ Källa system τ βηα sistema system system ЅϼЅІϹЀϴ System 十 ten Ϋεα dieci ti dгiesięć ϸϹЅГІА Tio trade ηπσλδο commercio handel handel ІЂЄϷЂ϶ϿГ Handel universitet ЇЁϼ϶ϹЄЅϼІ ϹІ ϶Ђϸϴ univers itet Vatten 大学 university 水 water 周 ‫مي‬ 城市 ‫عش‬ 十 ‫ه تف‬ revolusjon 系统 ‫أ‬ ‫مط‬ πβΰά 革命 ‫تج ر‬ ‫مء‬ source revolution ‫مص ر‬ ‫ج مع‬ snow παθΪ α β παθΪ α β ξδσθδ 革命雨电话 παθ πδ άηδο θ λσ vann uniwersyt et woda week ί οηΪ α settimana uke tвdгień ЁϹϸϹϿГ vecka week ί οηΪ α settimana uke tвdгień ЁϹϸϹϿГ vecka πσζβ città by miasto ϷЂЄЂϸ stad Ϋεα dieci ti dгiesięć ϸϹЅГІА tio pioggia regn deszcz ϸЂϺϸА regn telefono telefon telefon ІϹϿϹЈЂЁ telefon ίλοξά βζΫ πθο

RELATED PAPERS

RELATED TOPICS

Log In

Corpus-based vocabulary lists for language learners for nine languages

Corpus-based vocabulary lists for language learners for nine languages