Skip to main content

Adam Kilgarriff

Lexical Computing Ltd, UK, Director, Founder

University of Leeds, Centre for Translation Studies, Emeritus

Followers

469

Following

65

Public Views

InterestsView All (13)

Uploads

Papers by Adam Kilgarriff

ITRI-02-19 Evaluating the WASPbench, a lexicography tool incorporating word sense disambiguation

Abstract NLP system developers and corpus lexicographers would both benefit from a tool for findi... more

ITRI-03-19 Linguistic Search Engine

ITRI-00-26 Harnessing the Lexicographer in the Quest for Accurate Word Sense Disambiguation

СЬЪСИ Эв к жз ин г ж игвИ ЭУ зиж иК Ь з д д ж гйиа в з вгк а ж и ийж гж и к агдб ви г лгж з вз з ... more СЬЪСИ Эв к жз ин г ж игвИ ЭУ зиж иК Ь з д д ж гйиа в з вгк а ж и ийж гж и к агдб ви г лгж з вз з б й и гв ДЯЫ Е знзи бК Си з з гв и дж Й б зз и и гв л н иг бджгк и д ж гжб в г зй знзи бз з и жгй в ж з И в бгж м а И йб в ви жк ви гвК Ьг и з в йб вЙ ЯЫ джг ж б ви ж И Я ЫШЫН з в к агд гж йз н а м г Й ж д жз в гж в о в гждйз и в и ж л в йд г в л и гв жн виж зК нЙджг й и г и з и к ин л аа в йж и з вз з бЙ й и гв джг ж бК

The Concede model for lexical databases

Abstract The value of language resources is greatly enhanced if they share a common markup with a... more Abstract The value of language resources is greatly enhanced if they share a common markup with an explicit minimal semantics. Achieving this goal for lexical databases is difficult, as large-scale resources can realistically only be obtained by up-translation from pre-existing dictionaries, each with its own proprietary structure. This paper describes the approach we have taken in the Concede project, which aims to develop compatible lexical databases for six Central and Eastern European languages.

Bridging the gap between lexicon and corpus: convergence of formalisms

Abstract I first consider the spectrum of lexical information from the semantic to the textual. A... more Abstract I first consider the spectrum of lexical information from the semantic to the textual. A range of lexicons are classified according to where they sit on this scale. Lexicographic tools and WSD programs are included in the classification, and this is justified. There is currently a lacuna between the most text-oriented of the lexicographic approaches, and the most sophisticated of the data-driven ones. Lexical tuning requires that the lacuna be filled, so corpus data can flow into the lexicon.

An evaluation of a lexicographer’s workbench incorporating word sense disambiguation

NLP system developers and corpus lexicographers would both benefit from a tool for finding and or... more

Comparing word frequencies across corpora: Why chi-square doesn’t work, and an improved LOB-Brown comparison

We are often interested in discovering which words are markedly different in their distribution b... more We are often interested in discovering which words are markedly different in their distribution between two texts or two corpora. In this paper I show that one statistic which has sometimes been used for this purpose, chi-square, is inappropriate. I present an alternative, the Mann-Whitney ranks test. I apply the test to finding the words which are most different between the LOB and Brown corpora and show that it produces output that is well suited to the interests of lexicographers and humanities scholars.

Making better wordlists for ELT: Harvesting vocabulary lists from the web using WebBootCat

Abstract In Taiwan, and other Asian countries, students of English expect and are expected to mem... more Abstract In Taiwan, and other Asian countries, students of English expect and are expected to memorize a lot of vocabulary: MCU, for example, relies fairly heavily on vocabulary acquisition and retention in its teaching and testing resources. Oftentimes, lists of vocabulary items to be learned by students do not really belong to a particular topic, or fit it very loosely, because the items have not been chosen in a principled way.

Corpus-Based Vocabulary lists for Language Learners for Nine Languages Adam Kilgarriff, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan Khalil, Sofie Johansson Kokkinakis, Robert Lew, Serge Sharoff, Ravikiran Vadlapudi, Elena Volodina

Abstract We present the Kelly project, and its work on developing word lists, monolingual and bil... more Abstract We present the Kelly project, and its work on developing word lists, monolingual and bilingual, for language learning, using corpus methods, for nine languages and thirty-six language pairs. We describe the method in some detail and discuss the many challenges encountered. We have loaded the data into an online database and made it accessible for anyone to explore: we present our own first explorations of it.

Research Summary: PICAE-the Pearson International Corpus of Academic English

As part of the development programme for Pearson Test of English Academic, it was decided in 2007... more As part of the development programme for Pearson Test of English Academic, it was decided in 2007 to compile an academic corpus that would comprise spoken and written data from five major English-speaking countries in order to support the objective to ground PTE Academic on an accurate rendition of the English that students will need to understand and produce to be able to function in academic settings where English is the language of instruction.

Sketch Engine: a sense discrimination engine for English, Chinese and other languages

Analysis of text and spoken language, for the purposes of second language teaching, dictionary ma... more Analysis of text and spoken language, for the purposes of second language teaching, dictionary making and other linguistic applications, used to be based on the intuitions of linguists and lexicographers. The compilation of dictionaries and thesauri, for example, required that the compiler read very widely, and record the results of his efforts–the definitions and different senses of words–on thousands, or millions of index cards.

ITRI-00-15 Business Models for Dictionaries and NLP

Abstract NLP needs dictionaries, and dictionary-makers can use NLP to make better dictionaries, s... more Abstract NLP needs dictionaries, and dictionary-makers can use NLP to make better dictionaries, so there is great potential for synergy between the two activities. To date, there has been only very limited collaboration. The two reasons for this are (a) dictionary publishers' concerns regarding intellectual property, and (b) the different languages that lexicographers and NLP researchers speak. In this paper I present a model for overcoming the first and suggest some strategies for the second.

Large web corpora for Indian languages

A crucial resource for language technology development is a corpus. It should, if possible, be la... more A crucial resource for language technology development is a corpus. It should, if possible, be large and varied: otherwise it may well simply fail to cover all the core phenomena of the language, so tools based on it will sometimes fail because they are encountering something which was not encountered in the development corpus. They are critical for the development of morphological analysers because they show a good sample of all the words, in all their forms, that the analyser might be expected to handle.

ITRI-03-16 What computers can and cannot do for lexicography, or Us precision, them recall

Computers are good at recall, people are good at precision; that is, computers are good at findin... more Computers are good at recall, people are good at precision; that is, computers are good at finding a large set of possibilities, people are good judges of which possibilities are appropriate. 1 Conversely, people are bad at recall and computers are bad at precision; it is hard for people to think, unprompted, of lots of possibilities, and it is hard for computers to work out which candidate answers are good ones. This points to a straight forward division of duties Computer proposes, human disposes.

Finding Multiwords of More Than Two Words

Abstract The prospects for automatically identifying two-word multiwords in corpora have been exp... more Abstract The prospects for automatically identifying two-word multiwords in corpora have been explored in depth, and there are now well-established methods in widespread use.(We use 'multiwords' to include collocations, colligations, idioms and set phrases etc.) But many multiwords are of more than two words and research for items of three and more words has been less successful. We present three complementary strategies, all implemented and available in the Sketch Engine.

Collocationality (and how to measure it)

Abstract Collocation is increasingly recognised as a central aspect of language, a fact that Engl... more Abstract Collocation is increasingly recognised as a central aspect of language, a fact that English learners' dictionaries have responded to extensively. Statistical measures for identifying collocations in large corpora are now well-established. We move on to a further issue: which words have a particularly strong tendency to occur in collocations, or are most'collocational', and thereby merit having their collocates shown in dictionaries.

Duplication in corpora

Most corpora contain repeated material. In sampled corpora like the Brown Corpus, duplication is ... more Most corpora contain repeated material. In sampled corpora like the Brown Corpus, duplication is not so much of an issue, since the linguistic data is carefully selected proportionally by genre and thus the risk of introducing unwanted duplication is reduced. However, the typical corpus used in NLP is one in which as much data as possible of the desired genre is gathered. The result is a corpus whose nature and content is rather unknown.

Getting to know your corpus

Corpora are not easy to get a handle on. The usual way of getting to grips with text is to read i... more Corpora are not easy to get a handle on. The usual way of getting to grips with text is to read it, but corpora are mostly too big to read (and not designed to be read). We show, with examples, how keyword lists (of one corpus vs. another) are a direct, practical and fascinating way to explore the characteristics of corpora, and of text types. Our method is to classify the top one hundred keywords of corpus1 vs. corpus2, and corpus2 vs. corpus1.

Semi-Automatic Dictionary Drafting

How does language work? The Fregean tradition, picking up from Aristotle and Leibniz and carried ... more How does language work? The Fregean tradition, picking up from Aristotle and Leibniz and carried forward by Quine, Davidson and Montague, gives some glimpses of how the meanings of words and phrases combine, using grammar rules, to give meanings of sentences. Formal work on discourse and dialogue gives hope for an understanding of how the sentences build the 'meanings'(or, better, achieve the communicative purposes) of larger units. But what of the words and phrases? What–or, better, how-do they mean?

No-bureaucracy evaluation

Abstract Senseval is a series of evaluation exercises for Word Sense Disambiguation. The core des... more Abstract Senseval is a series of evaluation exercises for Word Sense Disambiguation. The core design is in accordance with the MUC and TREC model of quantitative, developer-oriented (rather than user-oriented) evaluation. The first was in 1998, with tasks for three languages and 25 participating research teams, the second in 2001, with tasks for twelve languages, thirty-five participating research teams and over 90 participating systems. The third is currently in planning.

ITRI-02-19 Evaluating the WASPbench, a lexicography tool incorporating word sense disambiguation

Abstract NLP system developers and corpus lexicographers would both benefit from a tool for findi... more

ITRI-03-19 Linguistic Search Engine

ITRI-00-26 Harnessing the Lexicographer in the Quest for Accurate Word Sense Disambiguation

СЬЪСИ Эв к жз ин г ж игвИ ЭУ зиж иК Ь з д д ж гйиа в з вгк а ж и ийж гж и к агдб ви г лгж з вз з ... more СЬЪСИ Эв к жз ин г ж игвИ ЭУ зиж иК Ь з д д ж гйиа в з вгк а ж и ийж гж и к агдб ви г лгж з вз з б й и гв ДЯЫ Е знзи бК Си з з гв и дж Й б зз и и гв л н иг бджгк и д ж гжб в г зй знзи бз з и жгй в ж з И в бгж м а И йб в ви жк ви гвК Ьг и з в йб вЙ ЯЫ джг ж б ви ж И Я ЫШЫН з в к агд гж йз н а м г Й ж д жз в гж в о в гждйз и в и ж л в йд г в л и гв жн виж зК нЙджг й и г и з и к ин л аа в йж и з вз з бЙ й и гв джг ж бК

The Concede model for lexical databases

Abstract The value of language resources is greatly enhanced if they share a common markup with a... more Abstract The value of language resources is greatly enhanced if they share a common markup with an explicit minimal semantics. Achieving this goal for lexical databases is difficult, as large-scale resources can realistically only be obtained by up-translation from pre-existing dictionaries, each with its own proprietary structure. This paper describes the approach we have taken in the Concede project, which aims to develop compatible lexical databases for six Central and Eastern European languages.

Bridging the gap between lexicon and corpus: convergence of formalisms

Abstract I first consider the spectrum of lexical information from the semantic to the textual. A... more Abstract I first consider the spectrum of lexical information from the semantic to the textual. A range of lexicons are classified according to where they sit on this scale. Lexicographic tools and WSD programs are included in the classification, and this is justified. There is currently a lacuna between the most text-oriented of the lexicographic approaches, and the most sophisticated of the data-driven ones. Lexical tuning requires that the lacuna be filled, so corpus data can flow into the lexicon.

An evaluation of a lexicographer’s workbench incorporating word sense disambiguation

NLP system developers and corpus lexicographers would both benefit from a tool for finding and or... more

Comparing word frequencies across corpora: Why chi-square doesn’t work, and an improved LOB-Brown comparison

We are often interested in discovering which words are markedly different in their distribution b... more We are often interested in discovering which words are markedly different in their distribution between two texts or two corpora. In this paper I show that one statistic which has sometimes been used for this purpose, chi-square, is inappropriate. I present an alternative, the Mann-Whitney ranks test. I apply the test to finding the words which are most different between the LOB and Brown corpora and show that it produces output that is well suited to the interests of lexicographers and humanities scholars.

Making better wordlists for ELT: Harvesting vocabulary lists from the web using WebBootCat

Abstract In Taiwan, and other Asian countries, students of English expect and are expected to mem... more Abstract In Taiwan, and other Asian countries, students of English expect and are expected to memorize a lot of vocabulary: MCU, for example, relies fairly heavily on vocabulary acquisition and retention in its teaching and testing resources. Oftentimes, lists of vocabulary items to be learned by students do not really belong to a particular topic, or fit it very loosely, because the items have not been chosen in a principled way.

Corpus-Based Vocabulary lists for Language Learners for Nine Languages Adam Kilgarriff, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan Khalil, Sofie Johansson Kokkinakis, Robert Lew, Serge Sharoff, Ravikiran Vadlapudi, Elena Volodina

Abstract We present the Kelly project, and its work on developing word lists, monolingual and bil... more Abstract We present the Kelly project, and its work on developing word lists, monolingual and bilingual, for language learning, using corpus methods, for nine languages and thirty-six language pairs. We describe the method in some detail and discuss the many challenges encountered. We have loaded the data into an online database and made it accessible for anyone to explore: we present our own first explorations of it.

Research Summary: PICAE-the Pearson International Corpus of Academic English

As part of the development programme for Pearson Test of English Academic, it was decided in 2007... more As part of the development programme for Pearson Test of English Academic, it was decided in 2007 to compile an academic corpus that would comprise spoken and written data from five major English-speaking countries in order to support the objective to ground PTE Academic on an accurate rendition of the English that students will need to understand and produce to be able to function in academic settings where English is the language of instruction.

Sketch Engine: a sense discrimination engine for English, Chinese and other languages

Analysis of text and spoken language, for the purposes of second language teaching, dictionary ma... more Analysis of text and spoken language, for the purposes of second language teaching, dictionary making and other linguistic applications, used to be based on the intuitions of linguists and lexicographers. The compilation of dictionaries and thesauri, for example, required that the compiler read very widely, and record the results of his efforts–the definitions and different senses of words–on thousands, or millions of index cards.

ITRI-00-15 Business Models for Dictionaries and NLP

Abstract NLP needs dictionaries, and dictionary-makers can use NLP to make better dictionaries, s... more Abstract NLP needs dictionaries, and dictionary-makers can use NLP to make better dictionaries, so there is great potential for synergy between the two activities. To date, there has been only very limited collaboration. The two reasons for this are (a) dictionary publishers' concerns regarding intellectual property, and (b) the different languages that lexicographers and NLP researchers speak. In this paper I present a model for overcoming the first and suggest some strategies for the second.

Large web corpora for Indian languages

A crucial resource for language technology development is a corpus. It should, if possible, be la... more A crucial resource for language technology development is a corpus. It should, if possible, be large and varied: otherwise it may well simply fail to cover all the core phenomena of the language, so tools based on it will sometimes fail because they are encountering something which was not encountered in the development corpus. They are critical for the development of morphological analysers because they show a good sample of all the words, in all their forms, that the analyser might be expected to handle.

ITRI-03-16 What computers can and cannot do for lexicography, or Us precision, them recall

Computers are good at recall, people are good at precision; that is, computers are good at findin... more Computers are good at recall, people are good at precision; that is, computers are good at finding a large set of possibilities, people are good judges of which possibilities are appropriate. 1 Conversely, people are bad at recall and computers are bad at precision; it is hard for people to think, unprompted, of lots of possibilities, and it is hard for computers to work out which candidate answers are good ones. This points to a straight forward division of duties Computer proposes, human disposes.

Finding Multiwords of More Than Two Words

Abstract The prospects for automatically identifying two-word multiwords in corpora have been exp... more Abstract The prospects for automatically identifying two-word multiwords in corpora have been explored in depth, and there are now well-established methods in widespread use.(We use 'multiwords' to include collocations, colligations, idioms and set phrases etc.) But many multiwords are of more than two words and research for items of three and more words has been less successful. We present three complementary strategies, all implemented and available in the Sketch Engine.

Collocationality (and how to measure it)

Abstract Collocation is increasingly recognised as a central aspect of language, a fact that Engl... more Abstract Collocation is increasingly recognised as a central aspect of language, a fact that English learners' dictionaries have responded to extensively. Statistical measures for identifying collocations in large corpora are now well-established. We move on to a further issue: which words have a particularly strong tendency to occur in collocations, or are most'collocational', and thereby merit having their collocates shown in dictionaries.

Duplication in corpora

Most corpora contain repeated material. In sampled corpora like the Brown Corpus, duplication is ... more Most corpora contain repeated material. In sampled corpora like the Brown Corpus, duplication is not so much of an issue, since the linguistic data is carefully selected proportionally by genre and thus the risk of introducing unwanted duplication is reduced. However, the typical corpus used in NLP is one in which as much data as possible of the desired genre is gathered. The result is a corpus whose nature and content is rather unknown.

Getting to know your corpus

Corpora are not easy to get a handle on. The usual way of getting to grips with text is to read i... more Corpora are not easy to get a handle on. The usual way of getting to grips with text is to read it, but corpora are mostly too big to read (and not designed to be read). We show, with examples, how keyword lists (of one corpus vs. another) are a direct, practical and fascinating way to explore the characteristics of corpora, and of text types. Our method is to classify the top one hundred keywords of corpus1 vs. corpus2, and corpus2 vs. corpus1.

Semi-Automatic Dictionary Drafting

How does language work? The Fregean tradition, picking up from Aristotle and Leibniz and carried ... more How does language work? The Fregean tradition, picking up from Aristotle and Leibniz and carried forward by Quine, Davidson and Montague, gives some glimpses of how the meanings of words and phrases combine, using grammar rules, to give meanings of sentences. Formal work on discourse and dialogue gives hope for an understanding of how the sentences build the 'meanings'(or, better, achieve the communicative purposes) of larger units. But what of the words and phrases? What–or, better, how-do they mean?

No-bureaucracy evaluation

Abstract Senseval is a series of evaluation exercises for Word Sense Disambiguation. The core des... more Abstract Senseval is a series of evaluation exercises for Word Sense Disambiguation. The core design is in accordance with the MUC and TREC model of quantitative, developer-oriented (rather than user-oriented) evaluation. The first was in 1998, with tasks for three languages and 25 participating research teams, the second in 2001, with tasks for twelve languages, thirty-five participating research teams and over 90 participating systems. The third is currently in planning.