Word frequency and key word statistics in corpus linguistics

2009

Word frequency and key word statistics in historical corpus linguistics Alistair Baron, Lancaster University Paul Rayson, Lancaster University Dawn Archer, University of Central Lancashire 1. Introduction Frequency-sorted word lists have long been part of the standard methodology for exploiting corpora. Sinclair (1991: 30) noted that "anyone studying a text is likely to need to know how often each different word form occurs in it". Tribble and Jones (1997: 36) outlined a methodology for using texts in the language classroom, proposing that the most effective starting point for understanding a text is a frequency-sorted word list. A frequency list records the number of times that each word occurs in the text. It can therefore provide interesting information about the words that appear (and do not appear) in a text. A word list can be arranged in order of first occurrence, alphabetically or in frequency order. Firstoccurrence order serves as a quick guide to the distribution of words in a text, an alphabetic listing is built mainly for indexing purposes, but a frequency-ordered listing highlights the most commonly-occurring words in the text. Frequency dictionaries have appeared for Spanish, Rumanian, French, Portuguese, German and English (Juilland et al 1964, 1965 and 1970; Davies, 2006; Davies and Preto-Bay, 2008; Jones and Tschirner, 2006; Leech et al, 2001). Traditional dictionaries also use frequency information indirectly, in choosing entries for inclusion. Francis and Ku!era (1982) took the simple word frequency list one stage further when they reported grammatical word frequencies drawn from the tagged version of the Brown corpus. Grammatical word frequencies are associated with a specific part-ofspeech (POS) tag. Although the computer saves us time when processing texts into frequency lists, it presents us with so much information that we need a filtering mechanism to pick out significant items prior to any analysis proper. There are at least two methods that we can use. First, formulae can be applied to adjust the raw frequencies for the distribution of words within a text; in other words, to describe the dispersion of frequencies in subsections of a corpus. Secondly, we can apply statistical procedures to highlight words that occur significantly more or less frequently than expected in a corpus. The frequency profile for a given text can be compared to the profile of a comparable text or to a profile derived from large amounts of text. Since the high frequency items tend to have a stable distribution generally, significant changes to the ordering of the words in the frequency list can flag points of interest to the researcher (Sinclair 1991: 31). For example, Hofland and Johansson (1982) use Yule's K statistic and the chi-squared goodness-of-fit test to pick out statistically significant different word frequencies across British and American English in their comparison of the two language varieties. This paper examines the technique of key word analysis. This is one of the most widely-used methods for discovering significant words, and is achieved by comparing the frequencies of words in a corpus with frequencies of those words in a (usually larger) reference corpus. It should be noted that the vast majority of key words studies take place using corpora of modern language. However, in this paper, we look at the possible problems that may occur when applying the same technique to historical corpora, and in particular, corpora of Early Modern English, a variety for which there are significant volumes of text already available. In addition, there is a growing body of historical data from this period being scanned and transcribed in the large digitisation initiatives such as Early English Books Online, British Library Newspapers Digitisation Project etc. The paper continues in section two with further background information on statistical techniques that are used to compare frequencies in corpora of modern languages. We also look at the few key word studies that have been carried out on historical data. In our case study, presented in section three, we first quantify the amount of spelling variation occurring in historical corpora. We then examine the problems of applying the key words technique to historical data and show how much key word lists are affected by issues of spelling variation. Our study quantifies, systematically and on a large scale, how the process of standardisation of written English throughout the Early Modern English period affects the robustness of key words results in historical corpus linguistics. In our conclusion (section four), we highlight possible solutions to this problem and describe directions for further work. 2. Background 2.1. Modern Although word frequency lists are very useful as a starting point for the analysis of corpora, there are well-known problems with using them. First, the frequencies must be normalised before the lists can be compared directly. Second, high frequency words at the top of any word frequency list are generally of no further interest to those trying to examine the content of corpora. Third, comparing the ranking of words is also misleading. Finally, multiword expressions and inflectional variants of the same lemma are not counted together. For further description of these problems, see Rayson (forthcoming) and Hoffman et al (2008). Even when they are derived from a large comprehensively-sampled corpus such as the British National Corpus (BNC), the word frequency counts themselves may be misleading. This is not because we might have miscounted the words, but because of how well the frequencies relate to usage in the English language as a whole. If a word has a high frequency count, we may reasonably infer, due to the nature of the BNC, that the word has a similarly high usage in the language. However, it may be the case that the word has a high frequency in the corpus not because it is widely used in the language as a whole but because it is widely used in a small(ish) number of texts, or parts of texts, within the corpus. To reveal these cases, we can calculate range or dispersion statistics. These show how widely distributed the occurrences of a word are within a corpus: i.e. whether it is frequent because it occurs in a lot of text samples in the corpus or whether it is frequent because of a very high usage in only a subset of texts (which may represent particular domains or genres). Frequent words with high dispersion values may be considered to have high currency in the language as a whole; high frequencies associated with low dispersion values should, in contrast, be treated with caution. For example, Church and Gale (1995) term this as the "bunchiness" or "burstiness" of words and show that the occurrences of the "very contagious" word "Kennedy" are not evenly dispersed in the Brown corpus (because he was the president of the United States when the Brown corpus was compiled in 1961). In the discipline of statistics, the mean and standard deviation are used as summary measures. In corpus linguistics, these are analogous to frequency and dispersion. According to Fries and Traver (1950: 21), Thorndike was the first to introduce range values into frequency lists. For further discussion of dispersion statistics, see Lyne (1985). Another way of dealing with the burstiness of words is to combine separate frequency and dispersion values into one measure called adjusted frequency (Francis and Ku!era, 1982: 464). Words can then be ranked by their adjusted frequencies. A more complex approach for describing variability within corpora is proposed by Gries (2006). The comparison of word frequency profiles has increasingly been used to examine issues in language variation, that is, to compare language usage across corpora, users, genres, etc. There are two types of corpus comparison. First, a comparison of a sample corpus with a larger 'normative' (or general language standard) corpus (e.g. Scott, 2000b). Second, a comparison of two roughly equal-sized corpora (e.g. Granger, 1998). These two main types can be extended to the comparison of more than two corpora. For example, we may compare one normative corpus to several smaller corpora at the same time, or compare three or more equal-sized corpora with each other. In general, however, this makes the results more difficult to interpret. Homogeneity (Stubbs, 1996: 152) within each of the corpora is important since we may find that the results reflect sections within one of the corpora that are unlike other sections in either of the corpora under consideration (Kilgarriff, 1997). There are a number of different statistics that can be applied in the comparison of word frequency lists. In what follows, we will examine a number of these in order to see how key word analysis operates. Hofland and Johansson (1982) carried out one of the largest early studies comparing word frequency profiles. This was the comparison of one million words of American English (the Brown corpus) with one million words of British English (the LOB corpus). They used a difference coefficient defined by Yule (1944) to assess the difference in the relative frequency of a word in the two corpora: The value of the coefficient varies between +1 and –1. A positive value indicated overuse in the LOB corpus, a negative value showed overuse in the Brown corpus. A statistical goodness-of-fit test originally suggested by Pearson (1904), the chi-squared test (!2), was also used to compare word frequencies across the two corpora. The chi-squared test was calculated as follows: where where Oi is the observed (actual) frequency, Ei is the expected (averaged) frequency, and Ni is the total frequency in corpus i (i in this case takes the values 1 and 2 for the LOB and Brown corpora respectively). Hofland and Johansson marked any resulting differences that were indicated by chi-squared values showing statistically significant difference at the 5%, 1%, or 0.1% level. The null hypothesis of the goodness of fit test is that there was no difference between the observed frequencies of a word in the two corpora. Note that even if the null hypothesis was not rejected, they could not conclude that it is true. The cut-off value corresponding to the chosen degree of confidence may not be exceeded, but this only indicates there is not enough evidence to reject the null hypothesis (Krenn and Samuelsson, 1997: 36). Critical values for the chi-squared statistic are listed in statistical tables such as those in Barnett and Cronin (1986) and Oakes (1998: 266). For example, the critical value for the 5% level, shown as 0.05 in the tables, is 3.84 at 1 degree of freedom (see below). Leech and Fallon (1992) used the lists produced by Hofland and Johansson to examine evidence of cultural differences between America and Britain in 1961. In corpus linguistics, we usually use a 2 " 2 table to compare frequencies of words or other linguistic features between two corpora. The chi-squared test is applicable to a general table with r rows and c columns. The number of degrees of freedom (d.f.), which is used when looking up critical values, is defined as the number of independent terms given that the marginal totals in the table are fixed. So, in the 2 " 2 table, d.f., as calculated by (r-1)"(c-1), is equal to 1. In this case, the 2 " 2 'contingency' table is as shown in Table 1 . CORPUS ONE Frequency of feature a Frequency of feature c not occurring TOTAL a+c CORPUS TWO b d TOTAL a+b c+d b+d N=a+b+c+d Table 1 - Contingency table for the chi-squared test Hence, we can calculate the chi-squared statistic (X2) as follows: When comparing the frequency distribution of word classes across the two major subdivisions of the Brown corpus, informative prose and imaginative prose, Francis and Ku!era (1982: 544) used a normalised ratio value (NR) rather than the chi-squared test. The ratio is normalised to take account of the fact that the informative section is nearly three times larger than the imaginative section of the corpus. An NR value of more than 1 indicates a greater occurrence in informative prose, while a value of less than 1 points to a higher relative frequency in imaginative texts. The greater the NR deviates from 1, the greater the grouping of a particular word class in one of the sections of the corpus. Comparing NR values is problematic since they are not on a linear scale, and the calculation is too generous for smaller relative differences when lower frequency items are compared to higher frequency items. Francis and Ku!era (1982: 547) also employed the Mosteller-Rourke (MR) adjustment for chi-squared for large numbers. The MR value is calculated as follows: where n is the frequency of an item in the whole corpus (Mosteller and Rourke, 1973: 191). The resulting values cannot be assessed for significance in the chi-squared tables, but they are used to rank items according to their MR value. In effect, MR reduces the chi-squared values for items occurring more than 1000 times, and increases the values for items with a frequency that is less than 1000. This seems a rather arbitrary figure, chosen to show 'nice numbers' and, if anything, the figure should be dependent on the corpus size(s). Kilgarriff (1996a, 1996b) pointed out that, in the Brown versus LOB comparison, many common words are marked as having significant chi-squared statistics and that, because words are not selected at random in language, we will always see a large number of differences in two such text collections. As an alternative, Kilgarriff selected the MannWhitney test that uses ranks of frequency data rather than the frequency values themselves to compute the statistic. Kilgarriff selected the Mann-Whitney test because it "does not give undue weight to single documents with a high [frequency] count" for a particular word. However, he observed that, even with the new test, 60% of words are marked as significant. Ignoring the actual frequency of occurrence, as in the Mann-Whitney test, means discarding most of the evidence we have about the distribution of words. As such, the test will have lower discriminatory power. Due to problems of too many zeros in the Mann-Whitney test, Kilgarriff (2001) reported that his technique omits words with less than 30 occurrences in the joint LOB and Brown corpus. This is a major drawback with the Mann-Whitney test; here it omits 92% of the types in the joint corpus. A further problem is that many words share ranks at the low end of frequency lists, especially for large corpora. For example, Copeck et al (1999) report that 18,630 words occur six times – 10 percent of their list for the BNC. Within each rank, words are ordered alphabetically. Additionally, comparing rank lists between different-sized corpora is also problematic. Copeck et al (1999) note the sizes of their frequency lists for LOB (7,950) and Wall Street Journal (4,550). This means that ranks for middle and lower frequency words in the BNC fall outside this range. These points suggest that the Mann-Whitney ranks test is suitable only for investigating mid- to high-frequency words when comparing corpora of the same size. Numerous other authors have used the chi-squared test to determine significant frequency differences of individual words or other linguistics features, rather than whole frequency profiles, between two corpora (for example Woods et al 1986: 140, Virtanen 1997, Oakes 1998: 26, Roland et al 2000, Wikberg 1999). Many authors also apply Yates' continuity correction (1934), developed to improve the approximation of the continuous probability distribution (chi-squared) to the discrete probability distribution of the observed frequency (multinomial). The Yates' corrected chi-squared statistic (Y2) is calculated as follows (from Table 1): In some texts, its use has been recommended (Everitt 1992: 14, Butler 1985: 122, and Woods et al 1986: 146), but current statistical textbooks report that the correction is less important than it was once thought (Agresti 1990: 68). Fisher's exact test may be used for tables with small expected frequencies, as an alternative to the chi-squared test. It uses the observed frequencies themselves to find the probability (P) of obtaining any particular arrangement of frequencies a, b, c, and d (again from Table 1): where a! is 'a factorial' (the product of a and all the whole numbers less than it, down to one, 0! = 1). The P value is then compared directly to the probability level, e.g. 0.05 for 5%, or 0.01 for 1%, to indicate departure from the null hypothesis in a specific direction. It is a onetailed test whereas the chi-squared is two-tailed. The P value may be doubled in order to compare it with the probability obtained through the chi-squared test. Fisher's exact test is computationally expensive since it involves calculating factorials, and the value of P for every possible arrangement of frequencies keeping the marginal totals fixed. Dunning (1993) reported that we should not rely on the assumption of a normal distribution when performing statistical text analysis and suggested that parametric analysis based on the binomial or multinomial distributions is a better alternative for smaller texts. Dunning also went on to propose the log-likelihood ratio as an alternative to Pearson's chisquared test, and he demonstrated this for the extraction of significant bigrams from text. Conversely, Mosteller and Rourke (1973: 162) state that the chi-squared statistic assumes a multinomial distribution, as do Cressie and Read (1994). Woods et al (1986: 188) describe the chi-squared test for association as non-parametric and state that it makes no special distributional assumptions of normality. There seems to be some confusion in the literature. Everitt (1992: 5-8) explains the situation more clearly. It is the observed frequencies that are assumed to follow a multinomial distribution, whereas the chi-squared distribution (which is used to calculate and tabulate critical values) arises from the normal distribution. Some papers in the literature report that the chi-squared statistic becomes unreliable when the expected frequency is too small, and possibly overestimates significance with high frequency words and when comparing a relatively small corpus to a much larger one. The former of these vague terms has been taken as meaning that all expected values must be greater than 5 (for example, Butler 1985: 117, Woods et al 1986: 144), and sometimes the same limit is applied to the observed frequencies (De Cock, 1998 and Nelson et al, 2002: 277). It was Cochran (1954) who suggested a rule that 4 in 5 (80%) of the expected values in an r " c table should be 5 or more. In the 2 " 2 table case, this means all cells should have expected values of 5 or more. Everitt (1992: 39) cites other more recent work than Cochran, which suggests that this rule is too conservative. Butler (1985: 117) suggests one possible solution to this is to combine frequencies until the combined classes have an expected frequency of 5 or more; likewise Nelson et al (2002: 277) for the observed frequencies, but Everitt (1992: 41) argues against this practice. Everitt (1992: 72) also mentions that the chi-squared statistic is "easily shown to be an approximation to" the log-likelihood for large samples. The two statistics take similar values for many tables. Williams (1976) notes that the log-likelihood is preferable to Pearson's chisquared in general. Everitt (1992: 18) also notes that the chi-squared test, Yates' corrected chi-squared and Fisher's exact test are equivalent in large samples. The obvious question, then, is: what constitutes a large sample? Kretzschmar et al (1997) start to answer the question by estimating sample sizes for various confidence levels. Scott (2001b) uses the loglikelihood statistic in his keywords procedure, as we shall see below. For the 2 " 2 case (in Table 1), the log-likelihood ratio is calculated as follows: G2 = 2 (alna + blnb + clnc + dlnd + NlnN - (a+b)ln(a+b) - (a+c)ln(a+c) - (b+d)ln(b+d) (c+d)ln(c+d)) Cressie and Read (1984) show that Pearson's X2 (chi-squared) and the likelihood ratio G2 (Dunning's log-likelihood) are, in fact, two statistics in a continuum defined by the powerdivergence family of statistics. They go on to describe this family in later work (1988, 1989). Here, they also make reference to the long and continuing discussion (since 1900) of the normal and chi-squared approximations for X2 and G2, and 2 " 2 contingency tables, during which many alternative tests have been devised (Yates, 1984). Finally, we can present the key word approach taken by Scott, which takes a systematic approach to the comparison of word frequency lists (Scott 1997, 1998, 2001a). Tribble (2000: 79-80) describes the way that the WordSmith tool finds keywords as follows: 1. frequency sorted wordlists are generated for the 'reference' corpus and for the research text or texts 2. each word in the research text is compared with its equivalent in the reference text and the program evaluates a statistical test based on the log-likelihood procedure to calculate the keyness 3. the wordlist for the research corpus is reordered in terms of the 'keyness' of each word Scott (1997) sets a minimum threshold of two occurrences for each word in the research text, although this does result in manually identified keywords being omitted from the keywords database (Scott, 2001b: 118). Other words with frequencies that violate the Cochran rule are still included in the keyword listing since, in practice, they are still interesting. The resulting keyword list contains two types of keyword: positive (those which are unusually frequent in the target corpus in comparison with the reference corpus), and negative (those which are unusually infrequent in the target corpus). These correspond to the terms overuse and underuse used in the learner corpus literature to describe the same observations. Tribble compares the list of positive and negative keywords against the frequency list for his corpus and demonstrates the improved usefulness of the keyword technique over simple frequencies for extracting interesting lexical items for stylistic studies. Scott also uses the notion of keykeywords. These are words that are key in all, or a large percentage, of the texts that are contained in the corpus under investigation. Tribble uses this feature to select lexical items that (may) give pedagogical insights in respect to (particular) genre(s). Key-keywords give us an insight into the dispersion of a key word in the corpus. Having reviewed how the key words procedure works and the different possibilities for the statistical apparatus that is used in the procedure, we now turn our attention to (some of the) studies which have applied the word frequency and key word approaches to historical data. 2.2. Key word studies relating to historical data Sub-branches within the field of historical linguistics have a long tradition of utilising corpusbased techniques (see, e.g., Risannen et al 1993 for an example of early studies made possible because of the advent of the computer/computer-based techniques, and Jucker et al. 1999: 1620 for an overview of the impact of computerisation on historical linguistic research methods). As such, it may surprise the reader to learn that there are relatively few studies of historical data that make use of the key words approach. Several of these (e.g., Culpeper 2002, sections of Culpeper and Archer 2008, Mahlberg 2007a, Mahlberg 2007b, Mahlberg forthcoming, Archer et al. forthcoming) explore classic English literature, whilst others explore specific activity types such as the historical English courtroom (see, e.g., Archer 2006 and sections of Culpeper and Archer 2008) or specific topics such as swearing (see, e.g., McEnery 2005, McEnery forthcoming). It is worth noting that the majority of these studies make use of Mike Scott's WordSmith Tools programme. Yet, only two of the above - Culpeper (2002) and Scott and Tribble (2006) – are mentioned in Scott's online lists of key word studies.1 Both explore Romeo and Juliet: Scott and Tribble compare the latter to a number of reference corpora (other Shakespearean tragedies, the Complete Works of Shakespeare and the British National Corpus (BNC)) to determine the extent to which the choice of reference corpora affected the key word results for Romeo and Juliet; Culpeper (2002) explores the extent to which key words can be used to identify the characteristics of six characters from the play. His choice of reference corpus was thus the Romeo and Juliet play itself minus the particular character's speech he was investigating at that time. Interestingly, Culpeper opted to utilise a modern edition of Romeo and Juliet (W. J. Craig's 1914 edition) as opposed to, for example, the First Folio from the Oxford Text Archive. His reasoning is that he wished to avoid as much spelling variation as possible, not least because "spelling variation is perhaps the greatest obstacle in the statistical manipulation of historical texts" (Culpeper 2002: 14). The idea that multiple variant spellings within a text greatly hinder standard corpus linguistic methods (such as frequency profiling, concordancing and key word analysis) is commonly-held (e.g. Markus, 2002). Indeed, it is highlighted by Archer and Culpeper (forthcoming, 2009) in their key word (key part-of-speech and key domain) study of EmodE social dyads (taken from comedy plays and trial proceedings), and by Archer et al. (forthcoming) in their key word (and key domain) comparison of Shakespearean lovecomedies and love-tragedies.2 It is the belief that spelling variation adversely affects the 1 See: http://www.lexically.net./wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm). Not all researchers who have employed key word techniques on historical data explicitly raise the issue of multiple spelling variants. This should not be taken as a sign that they do not regard the latter as a problem. On the contrary, they may have sidestepped the problem of spelling variation altogether by opting for modern editions and/or they may work on texts that represent an historical period where spelling was relatively fixed 2 accuracy of the statistical manipulation of historical texts which also prompted a number of researchers to develop a variant detector that can detect and normalise spellings, using a variety of computational techniques (see, e.g., Archer et al. 2003; Rayson et al 2005). Prior to this paper, however, no specific work had been undertaken to test the degree to which key word results are affected by multiple spelling variants (as far as we are aware). We seek to address this, here, by quantifying the effect of historical spelling variation on the lists of key words extracted from corpora. 3. Case study 3.1. The extent of spelling variation The aim for the first part of the analysis presented here was to discover, quantitatively, the extent of spelling variation in the Early Modern English (EModE) period, not least because many researchers comment on the large amount of spelling variation within the period without explicitly quantifying it (see, e.g., Vallins and Scragg (1965); Görlach (1991)). One exception is Schneider (2002) who, in his attempts to develop a normalised version of the Zurich English Newspaper (ZEN) Corpus (1670-1799) 3 , produced an overview of the spelling variations contained within. Schneider found that 3.99% of the tokens and 38.02% of the types within the corpus were unrecognised by the ENGCG tagger4, and hence could be considered spelling variants. The corpus was also split into four time periods, 1670-1709, 1710-1739, 1740-1769 and 1770-1799. The percentage of unrecognized tokens and types reduced in each subsequent time period, from 4.66% tokens and 36.57% types in the 16701709 sub-corpus to 2.85% tokens and 26.06% types in the 1770-1799 sub-corpus. As this paper will cover the entire EModE period5, a more thorough quantitative study of the spelling variation within the period is required. To this end, six different corpora were analysed: The ARCHER corpus, Early English Books Online, the Innsbruck Letter corpus, the Lampeter corpus, a corpus of Early English medical writings, and a collection of Shakespeare's works. The ARCHER corpus (A Representative Corpus of Historical English Registers)6 is a multi-purpose diachronic corpus covering from 1650 to the present day (only texts dated before 1800 were used in this study). It was built to facilitate the analysis of historical change in written and speech-based registers. Early English Books Online (EEBO)7 is a collection of digital facsimiles of virtually every English printed work between 1473 and 1700; nearly 125,000 works. As digital images of texts are of no use in this study, we have been given access to 12,268 of the 25,000 works that are being transcribed into ASCII SGML (see, e.g., Mahlberg’s investigations of Dicken’s works, for example, and McEnery’s investigations of swearing, and the response to “bad language” use exhibited by political movements such as the Society for the Reformation of Bad Manners). 3 See Fries and Schneider (2000) for more details. 4 See Voutilainen and Heikkilä (1994) for details. 5 The precise dating of the EModE period is a topic of some contention, see for example Görlach (1991: 9-11). Henry V’s commitment to the vernacular in 1417 (Richardson, 1980: 727) could be considered the earliest date for the period, whilst 1776, the year of the American Declaration of Independence - “the notional birth of the first (non-insular) extraterritorial English” (Lass, 1999a: 1) could be considered the latest date. 6 We used the ARCHER-3.x version of the corpus (1990–1993/2002/2007/2010). Compiled under the supervision of Douglas Biber and Edward Finegan at Northern Arizona University, University of Southern California, University of Freiburg, University of Heidelberg, University of Helsinki, Uppsala University, University of Michigan, University of Manchester, Lancaster University, University of Bamberg and University of Zurich. 7 http://eebo.chadwyck.com/ texts as part of the EEBO Text Creation Partnership.8 The Innsbruck Letter corpus, part of the Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET) corpus (Markus, 1999) is a collection of 469 complete letters dated between 1386 and 1688, a total of 182,000 words. The Lampeter corpus of Early Modern English Tracts (Schmied, 1994) is a collection of tracts and pamphlets published between 1640 and 1740. Each decade has two texts from each of the following six domains: religion, politics, economy & trade, science, law, and miscellaneous; resulting in a corpus of 120 texts and c. 1.1 millions words. The Early Modern English Medical Texts (EMEMT) corpus (Taavitsainen et al., forthcoming; Taavitsainen and Pahta, 1997 and forthcoming) is a collection of specifically medical texts built to study the evolution of medical writing. The portion of the corpus available to us covers 1525 to 1700. The collection of Shakespeare's works is a digitally-transcribed version of the first folio, which was printed in 1623. This can be sourced from the Oxford Text Archive9. Shakespeare's works were written between c. 1590 and c. 1613. A summary of the corpora used in this study is shown in Table 210. Corpus Genre and Type ARCHER General / Mixed EEBO General / Mixed Innsbruck Letters Lampeter Religion, Politics, Economy & Trade, Science, Law, and Miscellaneous tracts and pamphlets Medical texts EMEMT Shakespeare All plays (Comedies, Histories and First Folio Tragedies) from the First Folio. Years Eligible11 16601799 14701709 14101689 16401739 Texts Eligible 364 Tokens Eligible 632,639 12,265 535,910,150 436 170,538 120 1,124,131 15401699 c1590c161312 51 491,384 36 821,123 Table 2 - Summary of corpora used in study The total coverage of the corpora used in the study dates from 1410 to 1799; thus representing the entire EModE period. The corpora are all very different, covering various genres and text types. It is important to note that the corpora are never combined in our study and are always treated as separate entities. The first stage of the study involved sampling each corpus at regular intervals in order to gain a fair representation of the corpus over time. A sample period of ten years was chosen, hence the texts were split into their relevant decade (e.g. 1410 – 1419). This level of sampling 8 http://www.lib.umich.edu/tcp/eebo/ http://ota.ahds.ac.uk/ 10 We wish to thank Manfred Markus for allowing us to use the Innsbruck Letter Corpus and Irma Taavitsainen for providing us with a copy of the EMEMT corpus. 11 The full decade range was not used from all corpora due to texts dating too far from the EModE period or a lack of texts and/or words from certain decades. 12 It should be noted that the dates given for Shakespeare’s plays are estimates as there is considerable debate in respect to precise dating. In any case, the First Folio was printed in 1623 and it is difficult to know exactly the extent to which the editors adhered to the original source of each play. The Shakespeare plays cover only a small section of the EModE period and are included here to show any contrast between them and other corpora from the same time period. 9 did mean a small number of decades were omitted in certain corpora due to a lack of texts and/or words. The smaller EMEMT corpus could not be sampled in this way due to many decades containing only one or two files, or a small number of words; therefore the decision was made to include everything from the EMEMT corpus with a minimum of two files per decade. All results were normalised to a percentage in order to compare corpora with different sample sizes. The sampling sizes for each corpus are shown in Table 3. Corpus ARCHER EEBO Innsbruck Lampeter EMEMT Shakespeare First Folio Decade Sample Size 4,000 80,000 1,200 40,000 Total Possible 60,000 Minimum Texts 10 10 4 10 2 4 Decades not included due to a lack of texts and/or words 1740 1420, 1430, 1490, 1590 1620, 1640 Table 3 - Corpus sample sizes. For the more general corpora (ARCHER, EEBO and Lampeter), a minimum of ten texts per decade were required to ensure that one text did not account for more than 10% of a decade's sample. Elsewhere, a smaller number of texts were sufficient due to the fact that the specialised form of the corpora resulted in less variety of text. Samples were chosen from randomly selected texts from each decade, with the sample from each text beginning at a randomly selected index (word count) within the text. In order to discover the extent of spelling variation per corpus and per decade, each word in a given historical sample was compared to a modern word list derived from the Spell Checking Oriented Word Lists (SCOWL) 13 and a word list containing words with a frequency greater than 5 in the British National Corpus (BNC) (Leech et al., 2001). If a word was not found in the modern word lists it was classed as a spelling variant. This analysis provided a percentage of variant types and tokens per corpus and per decade sample. The variant type percentages are plotted in Figure 1 and the variant token percentages are plotted in Figure 2. An average variant percentage over all the available corpora for each decade was also calculated; this is shown for types in Figure 3 and for tokens in Figure 4. The general trend line is shown with a dotted line in all four graphs. 13 http://wordlist.sourceforge.net/scowl-readme 100 ARCHER EEBO Innsbruck Lampeter EMEMT Shakespeare Average Trend 90 80 % Variant Types 70 60 50 40 30 20 10 1400 1450 1500 1550 1600 Decade 1650 1700 1750 1800 Figure 1 - Graph showing variant types % in all corpora over time. 100 ARCHER EEBO Innsbruck Lampeter EMEMT Shakespeare Average Trend 90 80 % Variant Tokens 70 60 50 40 30 20 10 1400 1450 1500 1550 1600 Decade 1650 1700 1750 1800 Figure 2 - Graph showing variant tokens % in all corpora over time. Figure 3 - Average variant types % over corpora available for each decade. Figure 4 - Average variant tokens % over corpora available for each decade. Figures 1-4 all show a definite downwards trend in respect to the amount of spelling variation occurring throughout the EModE period. This not only corroborates Schneider's (2002) quantitative analysis of the ZEN Corpus for the latter part of the EModE period (1670-1799), but also quantifies the trend over the entire EModE period, verifying many scholar's claims that the language was under significant change throughout the period (see, e.g., Görlach, 1991:8-9; Lass, 1999b: 56, Rissanen, 1999: 187). Another point to note is that the rate of reduction in variation slows from around 1700; this is particularly noticeable in the graphs representing tokens (Figures 2 and 4). This backs up Görlach's (1991: 11) claim that, by 1700, the language had achieved "considerable homogeneity," with regional (written) dialect differences no longer present and "the period of co-existing variants, so typical of all levels of EModE, being over." It should be noted that the variant percentages shown in Figures 1-4 do not represent absolutely precise variant rates; they are all approximate values. It is extremely difficult to precisely calculate variant rates for large samples of text due to the problems involved when computing which words are actually variants – automatically processing large samples is necessary due to the time required for manual normalisation. First, so called 'real-word errors' are a concern; these are undetectable when comparing to a modern word list (as in this study), contextual knowledge is required to distinguish between variants which happen to match another modern word and words which are spelt in the 'standardised' modern form, e.g. "be" for "bee". An analysis of two small manually standardised samples (one from the Lampeter corpus, the other from Shakespeare's First Folio) used in a previous study (see Rayson et al., 2007) indicated the real-word error rates shown in Table 4. These figures are relatively low compared to real-word error rates in Modern English. By way of illustration, Peterson (1986) found that between 2% and 16% of typing errors would be undetected depending on the size of the word list used. Mitton (1987) found much larger rates; 40% of the spelling errors found in his study were real-word errors. In addition, when we opted to replicate the procedure we have used to analyse the Lampeter and Shakespeare samples (see above) on a manuallyprocessed corpus of child language spelling errors we found that 24.07% of the variant types identified and 18.31% of variant tokens identified were real-word errors. Sample Total words % of words which required normalisation (i.e. variants) Types Tokens Types Tokens Lampeter 839 2,726 19.19% 9.61% Shakespeare 897 3,991 63.88% 24.03% % of variants which are realword errors % of words erroneously marked as variants14 Types Tokens Types Tokens 4.35% 2.67% 12.04% 4.37% 8.55% 5.11% 7.80% 3.38% Table 4 – Analysis of variants found in manually standardised EModE samples. Table 4 indicates another problem when detecting spelling variants automatically - words incorrectly marked as variants. These may include proper nouns, encoded words (e.g. with Unicode entity values), words in languages other than English (e.g. Latin and French) and words which are simply not in the modern word list but are perfectly valid (e.g. archaic and obsolete words such as betwixt and howbeit). All of the problems listed occur in some of the corpora used in this study. Whilst a large amount of time was spent "cleaning" the texts, it is impossible to remove all imperfections. EEBO, for example, contains many Unicode entities for which there is no obvious ASCII replacement, and any word containing one (or more) of these values will be counted as a variant by our detector. Lampeter, ARCHER, Innsbruck and EEBO are known to contain sections of Latin and, in some cases, French passages; some of these passages will no doubt have been passed into the corpora samples. Aside from the odd exception all words in these foreign passages will be counted as variants. Proper nouns invariably cause problems when detecting spelling errors, whether in historical texts or in modern spell checkers. Due to the potentially large number of proper nouns which could be found within any text, it is not sensible to try and list them all (although adding more frequent proper nouns is a sensible first step). A common-sense approach to the problem would be to exploit the rule that proper nouns always begin with a capital letter in Modern English; this, however, does not work in all cases as a capital letter is also used to signify the start of a sentence. The problem is even worse in EModE, particularly in later EModE texts. Osselton (1998) describes how between 1550 and 1750 there was a distinct climb in the use of a capital letter to begin nouns where one would not be present in Modern English. The effect of this proper noun "problem" is evaluated in Figures 5 and 6 where the EEBO corpus samples are analysed as above and also by counting all words beginning with a capital as non-variants. As can be seen, variant counts are consistently lower if words with initial capitals are not considered as variants. However, the general downward trend remains the same with the lines following almost parallel paths. Marking all initial capital words as non-variants will no doubt lead to an increase in real-word errors due to "abnormal" capitalization of words which are also variants, sentence initial variants and inconsistently spelt proper nouns. 14 These are words which are “detected” as variants after the text has been normalised. Figure 5 - Comparison of variant type counts in EEBO corpus samples with (=original) and without initial capital words. Figure 6 - Comparison of variant tokens counts in EEBO corpus samples with (=original) and without initial capital words. It is clear that the level of variation displayed in Figures 1-6 are approximations.15 However, it is reasonable to assume that the level of "noise" leading to inaccuracies is relatively uniform throughout corpus samples and thus the general trend of spelling variation reducing over time throughout the Early Modern English period is maintained. 3.2. The effect of spelling variation on keyword analysis The second part of our case study analyzes the effect caused by the levels of spelling variation described in the previous section. The focus of our analysis will be the effect on key word lists, as described in section 2.1. In order to discover any effect caused by spelling variation, key word lists need to be formulated before and after spelling variation is removed, thus, any change in the key word list rankings will indicate an effect of spelling variation. Producing versions of texts or corpora with spelling variation removed is no simple task; except for very small samples, manually standardising texts is an exceedingly timeconsuming process. Fortunately, one of the corpora in our study, the Innsbruck Letter Corpus, has been standardised and manually checked. The standardised corpus contains parallel line pairs, the first line in each pair contains the original text, the second line contains a standardised version of the first line with any spelling variants replaced with modern English word equivalents. The corpus was split into two parts, one containing just the original text lines (this was sampled in section 3.1), the other containing the standardised equivalent lines. This resulted in two separate corpora on which a key word analysis could be completed, and the differences between the lists analysed. For this particular part of our study, log-likelihood was used to identify key words, and Wmatrix (Rayson, 2007) was used to produce key word lists. The BNC Written 15 The average Shakespeare decade sample variant rates for types and tokens respectively were 51.84% and 21.41%, compared to 63.08% and 23.04% in the manually processed sample. For Lampeter the average decade sample variant rates were (types/tokens) 22.64%/5.50% and for the manually processed sample: 19.19%/9.61%. Sampler16 was used as a reference corpus. Any word with a log-likelihood greater than or equal to 6.63 (p < 0.01 for 1 d.f.) was considered key, and any word with a frequency less than 5 in either the Innsbruck Letter Corpus (before or after standardisation) or the BNC sample was removed from the key word list. We included both overused and underused words that were considered key. After this filtering process, two key word lists remained, one representing the original corpus and the other representing the standardised corpus, each containing the same list of words along with their log-likelihood value representing each word's keyness in its parent (original or standardised) corpus. It was important that both lists contained the same list of words, as we wanted to analyse the effect on key word list ranks, not the number of extra variants appearing in the original list. Our hypothesis was that whilst there will be some similarity between the key word list rankings from the original corpus and the standardised corpus due to them originating from essentially the same texts, we expect a large deviation in the rankings; therefore showing a degradation in accuracy due to spelling variation. We wished to both prove this hypothesis and quantify the amount of deviation. In order to calculate the difference between the two key word lists, rank correlation was used. Rank correlation measures the correspondence between two different rankings on the same set of items and returns a value between -1 and 1; -1 is returned if one ranking is the exact reverse of the other, 0 is returned if the rankings are completely independent and 1 is returned if the two rankings are exactly the same. For this study, two rank correlation statistics were used: Spearman's Rank Correlation Coefficient (Spearman, 1904) and Kendall's Tau Rank Correlation Coefficient (Kendall, 1938). The first stage was to produce a set of log-likelihood observation pairs, these were created by performing a look-up of the log-likelihood values from both lists for each word. Both rank correlation statistics convert the log-likelihood values into ranks; that is every word will have a rank associated to it representing where the word appears in each list sorted descending by log-likelihood. For Spearman's Rank Correlation Coefficient the differences ( ) between each word's ranks are calculated, then the coefficient ( ) is given by: where is the number of words. Kendall's Tau Rank Correlation Coefficient works slightly differently in that it looks at the difference between each possible pairing in one list, if the sign of this difference (whether it is greater than, equal to, or less than 0) is equal to the sign of the difference between the same pair in the other list a concordant pair is counted ( ), otherwise a discordant pair is counted ( ). The coefficient ( ) is then calculated with: with again representing the number of words. Both rank correlation statistics were calculated on the paired log-likelihoods as described above and the results are shown in Table 5. Both figures show that whilst there is some correlation between the two key word lists there is a definite difference between the rankings of the standardised version's key word list and the original version's key word list. We can therefore confirm our original hypothesis (i.e. a deviation in the rankings of some key 16 Although clearly not the best match as a comparable corpus since it is from a different time period and design to the historical corpora, this effect will be minimised since we are using the same reference corpus for both before and after standardisation corpus comparisons. For more details about the BNC Sampler, see: http://www.natcorp.ox.ac.uk/corpus/index.xml.ID=products#sampler words) and conclude that spelling variation does have an effect on key word analysis of the Innsbruck Letter Corpus. Rank Correlation Method Spearman's Rank Correlation Coefficient Kendall's Tau Rank Correlation Coefficient Score 0.7045437 0.5304464 Table 5 - Rank correlations found when comparing the original and standardised versions of the Innsbruck Letter Corpus. In order to further show the effect of spelling variation on key word analysis, we wished to analyse key word lists before and after standardisation of samples from different time periods. Our hypothesis was that there would be more differentiation between the key word lists for samples that represent the earlier centuries of the EModE period, due to the greater level(s) of spelling variation evidenced at that time (as shown in section 3.1). As with the key words analysis of the Innsbruck Letter Corpus, we required both original and standardised versions of a corpus, this time sampled at regular intervals throughout the EModE period. Due to the significant amount of time required to manually standardise large samples, automatically (partly) standardised samples were deemed sufficient to detect a trend. A tool, named VARD (Rayson et al, forthcoming; Baron and Rayson, 2008), has been developed which can perform automatic standardisation of historical texts. The tool inserts modern equivalents alongside any historical spelling variants where the probability of a match is greater than a threshold set by the user. The tool does not successfully replace all spelling variants in a given text automatically, however a large amount of spelling variants can be dealt with before the user manually processes the remaining variants. For this study we decided to use the EEBO corpus as it covers the EModE period and has enough texts available per decade to build a large sample. The same decade samples used in section 3.1 were processed by VARD, producing partly standardised matching samples. As with the Innsbruck corpus, both versions of the samples were then processed with Wmatrix to produce two key word lists per sample. These lists were then filtered exactly as before, after which Spearman's Rank Correlation Coefficient and Kendall's Tau Rank Correlation Coefficient were calculated for each decade sample. The two coefficients are plotted in Figure 7 and Figure 8, with the dotted line showing the average trend. Figure 7 - Graph showing Spearman's Rank Correlation Coefficients comparing EEBO decade samples' key word lists before and after automatic standardisation. Figure 8 - Graph showing Kendall's Tau Rank Correlation Coefficients comparing EEBO decade samples' key word lists before and after automatic standardisation. The two graphs show erratic results for the earliest decade samples. This is mirrored (although not to the same extent) in the variant rates shown in Figures 1-6. This can be explained by examining the samples, especially that for 1510-19, a local maximum in Figures 7 and 8. The sample for 1510-19 contains a large section of foreign translations, containing many different languages. It is not possible to standardise this section, and so the standardised version will be more similar to the original version. This is shown in Figures 9 and 10, where the amount of spelling variation remaining after automatic standardisation is both higher and more erratic for the earlier decade samples. Figure 9 - Graph showing the frequency of spelling variant types in the EEBO samples before and after automatic standardisation. Figure 10 - Graph showing the frequency of spelling variant tokens in the EEBO samples before and after automatic standardisation. Noise in corpora of this nature is unavoidable and will have an influence on the results, also the effect of spelling variation is underestimated due to spelling variation still remaining (shown in Figures 9 and 10). However, the general upwards trend can be clearly seen for both coefficients, indicating an increase in correlation between the two key word lists the later the decade of the sample. We can conclude that a reduction in spelling variation over time produces less effect on key word analysis, thus proving our hypothesis. 4. Conclusion and future work In this paper, we have given an overview of the use of word frequency profiles and key words in corpus linguistics. We began with a review of the various statistics used when comparing word frequencies between corpora. We also noted that the studies that have exploited the key words technique on historical data have tended to use modernised versions of their datasets in order to sidestep the issue of spelling variation. In the case study presented in this paper, we carried out a quantitative analysis of spelling variation in a set of well-known historical corpora. The trends identified match the expected rapid decline in spelling variation until around 1700. For the first time, we have been able to quantify the extent of spelling variation in these corpora. The second part of the case study showed the effect of this variation on the key words procedure. We were able to demonstrate how the key words lists were affected by comparing the lists produced from original historical data with that of a standardised version of the corpora. We also showed that the reduction in spelling variation over time has a knockon effect on key word accuracy, with samples from later decades suffering less of an effect from spelling variation. We will continue to refine our techniques for detecting historical spelling variants, including, for example, contextual clues to detect so called 'real-word errors' as described in section 3. However, the quantitative trends presented here are already clear enough. Researchers using frequency-based techniques on non-standardised historical datasets should be wary of spelling variation and need to exercise caution when interpreting key words analyses carried out on such data. Where standardised versions of corpora are available, the results obtained from them can be considered more robust. However, where it is unfeasible to carry out manual standardisation, for example on the vast digitised textual resources such as EEBO, there is a need for a tool which can detect historical variants and automatically standardise them in a pre-processing step for the application of key words and other modern corpus linguistic procedures. A prototype for such a tool, VARD, was discussed in section 3.2. In future work, we plan to further develop the VARD tool. Currently, VARD employs the following procedures as a means of detecting variants, and mapping them to their 'modern' equivalents: a manually produced list of variants, SoundEx phonetic matching, edit distance and letter replacement heuristics. But these procedures are merely dealing with the surface forms of words. We will therefore be attempting semantic disambiguation in the near future so that we can also begin to distinguish the underlying meanings of words and their variants. This is important in respect to variants such as 'peece', which have more than one potential modern form (i.e. 'peace' and 'piece'). It is worth noting that, by adding a semantic component to the VARD, we have come full circle in our research endeavour as the VARD initially grew out of attempts to develop an historical version of the Wmatrix tool so that we could semantically annotate historical texts automatically (see Archer et al 2003). A related issue is the problem of 'realword errors', as previously mentioned in section 3.1. Variants such as 'bee' for 'be' or 'then' for 'than' are impossible to detect with a dictionary check alone. Therefore, future work will involve using part-of-speech and semantic information to detect potential spelling variants of this type in order to achieve more accurate automatic standardisation. However, it is important to be aware that this is a circular issue in that spelling variation will have an effect on part-of-speech and semantic tagging accuracy as shown by Rayson et al (2007) and Archer et al (2003) respectively. One solution may be to incorporate the part-of-speech tagger in the variant detection process; this has been partially explored by Atwell and Elliot (1987). 4.1. Investigating spelling from a diachronic perspective Although our main aim in this paper has been determining the effect that spelling variation has on (the meaningfulness of) keyness results, we effectively provide a means of quantifying the ongoing process of standardisation of written English throughout the EModE period, as witnessed by the decreasing levels of spelling variation. Moreover, we do so by exploring written texts that are both representative of the different centuries (and decades within) that make up the EModE period and also representative of different genres (i.e. plays, letters, medical texts, etc.). To our knowledge, we are the first to do this systematically on such a large scale: prior to Schneider (2002), who looks at the Zen corpus, this study and an earlier study by Archer and Rayson (2004) (for details of which, see below), most studies that have explored spelling from a diachronic (i.e. historical) perspective have tended to be qualitative in focus, that is, they have attended to the most obvious spelling patterns for a given period. Smith (2005: 222), for example, comments on the following patterns for Shakespearean English: the inter-changeability of <u> / <v> (depending on their initial/medial positioning), the use of <i> to represent <j> and the use of <vv> for <w> (see also Blake 1996; Scragg 1974). This focus is not surprising, given that these are the patterns that will strike the consciousness of the researcher as they read through texts. But it means that patterns below the level of consciousness – patterns that, for example, might be more subtle or only emerge across many texts – go unnoticed. The VARD tool therefore affords us with the opportunity to begin exploring spelling variability more subtly and systematically, whilst also determining the point(s) at which standardisation occurred (depending on the genre(s) under investigation). In respect to our own future work, we plan to assess the extent to which genre plays a part in which variants are used as well as the extent to which they are used. This work will build on Archer and Rayson's (2004) study of 3,823 spelling variants in a variety of texttypes representative of the 17th, 18th and 19th centuries, which appears to suggest that levels of spelling variation differed quite substantially across individual genres. For example, they examined a seventeenth century Newsbook Corpus, which effectively contained 296 occurrences per million words (of the 3,823 forms identified by them) – a frequency that seems very low, when compared to the 2,247 occurrences (per million words) found in (the seventeenth century component of) the Lampeter dataset. As Culpeper and Archer (forthcoming) highlight, the latter effectively contains genres - 'science', 'religion', 'politics', 'law' and 'economy' - which are regarded as having some of the very factors that are meant to provide a motivating force for standardisation (i.e. prestige and power). Yet, Archer and Rayson's study suggests that the more broad-based, popular genre of newsbooks was in the vanguard instead in the seventeenth century. Such a (surprising) result merits the type of systematic diachronic comparison of genres that the current study affords. References Agresti, Alan. Categorical data analysis. New York: Wiley, 1990. Archer, Dawn. "Tracing the development of 'advocacy' in two nineteenth century English trials." Diachronic Perspectives on Domain-Specific English. Eds. Marina Dossena and Irma Taavitsainen. Bern: Peter Lang; Linguistic Insights series, 2006. Archer, Dawn and Jonathan Culpeper. "Identifying key sociophilological usage in plays and trial proceedings (1640-1760): An empirical approach via corpus annotation." Journal of Historical Pragmatics Sociopragmatics Special Issue. Ed. Jonathan Culpeper. Forthcoming 2009. Archer, Dawn and Paul Rayson. "Using an historical semantic tagger as a diagnostic tool for variation in spelling." Thirteenth International Conference on English Historical Linguistics (ICEHL 13). Vienna, Austria: University of Vienna. (23-29 Aug. 2004). Archer, Dawn, Tony McEnery, Paul Rayson and Andrew Hardie. "Developing an automated semantic analysis system for Early Modern English." Proceedings of the Corpus Linguistics 2003 conference. UCREL technical paper number 16. Eds. Dawn Archer, Paul Rayson, Andrew Wilson and Tony McEnery. (2003): 22-31. Archer, Dawn, Jonathan Culpeper, and Paul Rayson. "Love – 'a familiar or a devil'? An Exploration of Key Domains in Shakespeare's Comedies and Tragedies." What's in a word-list? Investigating word frequency and keyword extraction. Ed. Dawn Archer. Ashgate, forthcoming. Atwell, Eric and Stephen Elliot. "Dealing with ill-formed English text." The Computational Analysis of English. A corpus-based approach. Eds. Roger Garside, Geoffrey Leech and Geoffrey Sampson. London/New York: Longman, 1987. Barnett, Stephen and Timothy M. Cronin. Mathematical formulae for engineering and science students. 4th ed. London: Longman, 1986. Baron, Alistair and Paul Rayson. "VARD 2: A tool for dealing with spelling variation in historical corpora." Proceedings of the Postgraduate Conference in Corpus Linguistics. Birmingham: Aston University (22 May 2008). Blake, Norman F. Shakespeare's Language: An Introduction. 2nd ed. Basingstoke: Macmillan, 1996. Butler, Christopher. Statistics in linguistics. Oxford: Blackwell, 1985. Church, Kenneth W. and William A. Gale. "Poisson mixtures." Natural Language Engineering 1.2. Cambridge: Cambridge University Press, 1995. 163-190. Cochran, William G. "Some methods for strengthening the common !2 tests." Biometrics 10 (1954): 417-451. Copeck, Terry, Ken Barker, Sylvain Delisle and Stan Szpakowicz. "More Alike than not - An Analysis of Word Frequencies in Four General-purpose Text Corpora." Proceedings of the 1999 Pacific Association for Computational Linguistics Conference (PACLING 99). Ontario, Canada: Waterloo (25-28 Aug. 1999): 282-287. Cressie, Noel and Timothy R. C. Read. "Multinomial Goodness-of-Fit Tests." Journal of the Royal Statistical Society. Series B (Methodological) 46.3 (1984): 440-464. Cressie, Noel and Timothy R. C. Read. "Pearson's X2 and the Log Likelihood Ratio Statistic G2: A comparative review." International Statistical Review 57.1. Belfast: Belfast University Press, 1989. 19-43. Culpeper, Jonathan. "Computers, language and characterisation: An analysis of six characters in Romeo and Juliet." Conversation in Life and in Literature: Papers from the ASLA Symposium 15. Eds. Ulla Merlander-Marttala, Carin Ostman and Merja Kytö. Uppsala : Universitetstryckeriet, 2002. 11-30. Culpeper, Jonathan and Dawn Archer. "The History of English Spelling." English Language and Linguistics. Eds. Jonathan Culpeper, Francis Katamba, Paul Kerswill, Ruth Wodak and Tony McEnery. Basingstoke, UK: Palgrave Macmillan, forthcoming. Davies, Mark. A frequency dictionary of Spanish. London: Routledge, 2006. Davies, Mark. and Ana Maria Raposo Preto-Bay. A frequency dictionary of Portuguese. London: Routledge, 2008. De Cock, Sylvie. "A recurrent word combination approach to the study of formulae in the speech of native and non-native speakers of English." International Journal of Corpus Linguistics 3.1. Amsterdam: John Benjamins, 1998. 59-80. Dunning, Ted. "Accurate Methods for the Statistics of Surprise and Coincidence." Computational Linguistics 19.1. MIT Press, March 1993. 61-74. Everitt, Brian S. The analysis of contingency tables. 2nd ed. London: Chapman and Hall, 1992. Francis, W. Nelson. and Henry Ku!era. Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin, 1982. Fries, Charles C. and A. Aileen Traver. English word lists: a study of their adaptability for instruction. Ann Arbor, Michigan: George Wahr Publishing Company, 1950. Fries, Udo and Peter Schneider. "Zen: preparing the Zurich English Newspaper Corpus." English media texts – past and present. Language and textual structure. Ed. Friedrich Ungerer. Amsterdam / Philadelphia: John Benjamins, 2000. 3-24. Görlach, Manfred. Introduction to Early Modern English. Cambridge: Cambridge University Press, 1991. Granger, Sylviane. "The computer learner corpus: a versatile new source of data for SLA research." Learner English on Computer. Ed. Sylviane Granger. London: Longman, 1998. 3-18. Gries, Stefan Th. "Exploring variability within and between corpora: some methodological considerations." Corpora 1.2 (2006): 109-151. Hoffmann, Sebastian, Stefan Evert, Nicholas Smith, David Lee and Ylva Berglund Prytz. Corpus Linguistics with BNCweb - a Practical Guide. Frankfurt am Main, Germany: Peter Lang, 2008. Hofland, Knut and Stig Johansson. Word frequencies in British and American English. Bergen, Norway: The Norwegian Computing Centre for the Humanities, 1982. Jones, Randall and Erwin Tschirner. A frequency dictionary of German. London: Routledge, 2006. Jucker, Andreas H., Gerd Fritz and Franz Lebsanft, eds. Historical Dialogue Analysis. Amsterdam: John Benjamins, 1999. Juilland, Alphonse and Eugenio Chang-Rodríguez. Frequency dictionary of Spanish words. The Hague: Mouton & Co., 1964. Juilland, Alphonse, P. Maximilian H. Edwards, and Ileana Juilland. Frequency dictionary of Rumanian words. The Hague: Mouton & Co., 1965. Juilland, Alphonse, Dorothy Brodin, and Catherine Davidovitch. Frequency dictionary of French words. Paris: Mouton & Co., 1970. Kendall, Maurice G. "A New Measure of Rank Correlation." Biometrika 30 (1938): 81-89. Kilgarriff, Adam. "Which words are particularly characteristic of a text? A survey of statistical approaches." Language Engineering for Document Analysis and Recognition (LEDAR), AISB 96 Workshop proceedings. Eds. Lindsay J. Evett, and Tony G. Rose. Brighton, UK (April 1996a): 33-40. Kilgarriff, Adam. "Why chi-square doesn't work, and an improved LOB-Brown comparison." ALLC-ACH Conference. Bergen, Norway (June 1996b). Kilgarriff, Adam. "Using word frequency lists to measure corpus homogeneity and similarity between corpora." Proceedings 5th ACL workshop on very large corpora. Beijing and Hong Kong (1997): 231-245. Kilgarriff, Adam. "Comparing Corpora." International Journal of Corpus Linguistics 6.1. Amsterdam: John Benjamins, 2001. 97-133. Krenn, Brigitte and Christer Samuelsson. The Linguist's Guide to Statistics: Don't Panic. 19 Dec. 1997 <http://nlp.stanford.edu/fsnlp/dontpanic.pdf>. Kretzschmar, William A., Charles F. Meyer and Dominique Ingegneri. "Uses of inferential statistics in corpus studies." Corpus-based studies in English: papers from the seventeenth International Conference on English language research on computerized corpora (ICAME 17), Stockholm, May 15-19, 1996. Ed. Magnus Ljung. Amsterdam: Rodopi, 1997. 167-177. Lass, Roger. "Introduction." The Cambridge History of the English Language: Volume III, 1476-1776. Ed. Roger Lass. Cambridge: Cambridge University Press, 1999a. Lass, Roger. "Phonology and Morphology." The Cambridge History of the English Language: Volume III, 1476-1776. Ed. Roger Lass. Cambridge: Cambridge University Press, 1999b. Leech, Geoffrey and Roger Fallon. "Computer corpora - what do they tell us about culture?" ICAME Journal 16. Bergen, Norway: Norwegian Computing Centre for the Humanities, 1992. 29-50. Leech, Geoffrey, Paul Rayson and Andrew Wilson. Word Frequencies in Written and Spoken English: based on the British National Corpus. London: Longman, 2001. Lyne, Anthony A. The vocabulary of French business correspondence. Geneva: Slatkine, 1985. Mahlberg, Michaela. "Clusters, key clusters and local textual functions in Dickens." Corpora 2.1 (2007a): 1-31. Mahlberg, Michaela. "Corpora and translation studies: textual functions of lexis in Bleak House and in a translation of the novel into German." La Traduzione. Lo Stato dell'Arte. Translation. The State of the Art. Ravenna: Longo, 2007b. 115-135. Mahlberg, Michaela. "A corpus stylistic perspective on Dickens' Great Expectations." Contemporary Stylistics. Eds. Marina Lambrou and Peter Stockwell. London: Continuum, forthcoming. Markus, Manfred. "Manual of ICAMET (Innsbruck Computer-Archive of Machine-Readable English Texts)." Innsbrucker Beitraege zur Kulturwissenschaft, Anglistische Reihe 7. Innsbruck: Leopold-Franzens-Universitaet Innsbruck, Institut fuer Anglistik, 1999. Markus, Manfred. "Towards an analysis of pragmatic and stylistic features in 15th and 17th century English letters." New frontiers of corpus research: papers from the 21st International Conference on English Language Research on Computerized Corpora, Sydney, 2000. Eds. Pam Peters, Peter Collins and Adam Smith. Amsterdam: Rodopi, 2002. McEnery, Tony. Swearing in English. Bad Language, Purity and Power from 1586 to the Present. London: Routledge, 2005. McEnery, Tony. "Keywords and Moral Panics: Mary Whitehouse and Media Censorship." What's in a word-list? Investigating word frequency and keyword extraction. Ed. Dawn Archer. Ashgate, forthcoming. Mitton, Roger. "Spelling Checkers, Spelling Correctors and the Misspelling of Poor Spellers." Information Processing & Management 23.5 (1987): 495-505. Mosteller, Frederick and Robert E. K. Rourke. Sturdy statistics. Reading, Massachusetts: Addison-Wesley, 1973. Nelson, Gerald, Sean Wallis and Bas Aarts. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins, 2002. Oakes, Michael P. Statistics for corpus linguistics. Edinburgh: Edinburgh University Press, 1998. Osselton, Noel E. "Spelling-Book Rules and the Capitalization of Nouns in the Seventeenth and Eighteenth Centuries." A Reader in Early Modern English. Eds. Mats Rydén, Ingrid Tiegen-Boon van Ostade and Merja Kytö. Frankfurt am Main, Germany: Peter Lang, 1998. Pearson, Karl. "On the theory of contingency and its relation to association and normal correlation." Biometric Series 1. London: Drapers' Co. Memoirs, 1904. Peterson, James L. "A Note on Undetected Typing Errors." Communications of the ACM 29.7 (1986): 633-637. Rayson, Paul. Wmatrix: a web-based corpus processing environment. Computing Department, Lancaster University. 2007 <http://ucrel.lancs.ac.uk/wmatrix/>. Rayson, Paul, Dawn Archer and Nicholas Smith. "VARD versus Word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora." Proceedings of the Corpus Linguistics 2005 conference. Birmingham, UK (14-17 July 2005). Rayson, Paul, Dawn Archer, Alistair Baron, Jonathan Culpeper and Nicholas Smith. "Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora." Proceedings of Corpus Linguistics 2007. University of Birmingham, UK (27-30 July 2007). Rayson, Paul, Dawn Archer, Alistair Baron, and Nicholas Smith. "Travelling Through Time with Corpus Annotation Software." Proceedings of Practical Applications in Language and Computers (PALC) 2007. The Department of English Language at !ód" University, Poland, 19th-22nd April 2007 (forthcoming). Rayson, Paul. "From key words to key semantic domains." International Journal of Corpus Linguistics (forthcoming). Read, Timothy R. C. and Noel A. C. Cressie. "Goodness-of-fit statistics for discrete multivariate data." Springer series in statistics. New York: Springer-Verlag, 1988. Richardson, Malcolm. "Henry V, the English Chancery, and Chancery English." Speculum 55.4 (1980): 726-750. Rissanen, Matti. "Syntax." The Cambridge History of the English Language: Volume III, 1476-1776. Ed. Roger Lass. Cambridge: Cambridge University Press, 1999. Rissanen, Matti, Merja Kytö and Minna Palander-Collin, eds. Early English in the Computer Age. Explorations in the Helsinki Corpus. Berlin/New York: Mouton de Gruyter (Topics in English Linguistics), 1993. Roland, Douglas, Daniel Jurafsky, Lise Menn, Susanne Gahl, Elizabeth Elder and Chris Riddoch. "Verb Subcategorization Frequency Differences between Business-News and Balanced Corpora: the role of verb sense." Proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000). Hong Kong (1-8 Oct. 2000): 28-34. Schmied, Josef. "The Lampeter Corpus of Early Modern English Tracts." Corpora across the Centuries: Proceedings of the First International Colloquium on English Diachronic Corpora. Cambridge, March 1993. Eds. Merja Kytö, Matti Rissanen, Susan Wright. Amsterdam: Rodopi, 1994. Schneider, Peter. "Computer Assisted Spelling Normalization of 18th Century English." New Frontiers of Corpus Research: Papers from the 21st International Conference on English language Research on Computerized Corpora, Sydney, 2000. Eds. Pam Peters, Peter Collins and Adam Smith. Amsterdam: Rodopi, 2002. 199-211. Scott, Mike. "PC analysis of key words – and key key words." System 25.2. Amsterdam: Elsevier, 1997. 233-245. Scott, Mike. "Focusing on the text and its key words." Rethinking language pedagogy from a corpus perspective: papers from the third international conference on teaching and language corpora. Eds. Lou Burnard and Tony McEnery. Frankfurt: Peter Lang, 2000. 104-121. Scott, Mike. "Comparing corpora and identifying key words, collocations, and frequency distributions through the WordSmith Tools suite of computer programs" Small corpus studies and ELT: theory and practice. Eds. Mohsen Ghadessy, Alex Henry and Robert L. Roseberry. Amsterdam: John Benjamins, 2001a. 47-67. Scott, Mike. "Mapping key words to problem and solution." Patterns of Text: in honour of Michael Hoey. Eds. Mike Scott and Geoff Thompson. Amsterdam: John Benjamins, 2001b. 109-127. Scott, Mike and Christopher Tribble. Textual Patterns: keyword and corpus analysis in language education. Amsterdam: John Benjamins, 2006. Scragg, Donald C. English Spelling. Manchester: Manchester University Press, 1974. Sinclair, John. Corpus, concordance, collocation. Oxford: Oxford University Press, 1991. Sinclair, John. "A way with common words." Out of corpora: studies in honour of Stig Johansson. Eds. Hilde Hasselgård and Signe Oksefjell. Amsterdam: Rodopi, 1999. 157179. Smith, Jeremy J. Essentials of Early English: An Introduction to Old, Middle and Early Modern English. London: Routledge, 2005. Spearman, Charles. "The proof and measurement of association between two things." American Journal of Psychology 15 (1904): 72-101. Stubbs, Michael. Text and corpus analysis: computer-assisted studies of language and culture. Oxford: Blackwell, 1996. Taavitsainen, Irma and Päivi Pahta. "Corpus of Early English medical writing 1375–1750." ICAME Journal 21 (1997): 71–81. Taavitsainen, Irma and Päivi Pahta, eds. Medical Writing in Early Modern English. Cambridge: Cambridge University Press, forthcoming. Taavitsainen, Irma, Päivi Pahta, Turo Hiltunen, Martti Mäkinen, Ville Marttila, Maura Ratia, Carla Suhr and Jukka Tyrkkö. Early Modern Medical Texts. forthcoming. Tribble, Christopher. "Genres, keywords, teaching: towards a pedagogic account of the language of project proposals." Rethinking language pedagogy from a corpus perspective: papers from the third international conference on teaching and language corpora. Eds. Lou Burnard and Tony McEnery. Frankfurt: Peter Lang, 2000. 75-90. Tribble, Christopher and Glyn Jones. Concordances in the classroom. Houston, Texas: Athelstan, 1997. Vallins, George H. and Donald G. Scragg. Spelling. London: André Deutsch, 1965. Virtanen, Tuija. "The progressive in NS and NNS student compositions: evidence from the International Corpus of Learner English." Corpus-based studies in English: papers from the seventeenth International Conference on English language research on computerized corpora (ICAME 17), Stockholm, May 15-19, 1996. Ed. Magnus Ljung. Amsterdam: Rodopi, 1997. 299-309. Voutilainen, Atro and Juha Heikkilä. "An English Constraint Grammar (ENGCG) a surfacesyntactic parser of English." Creating and Using English Language Corpora. Papers from the 14th International Conference on English Language Research on Computerized Corpora, Zürich, 1993. Eds. Udo Fries, Gunnel Tottie and Peter Schneider. Amsterdam: Rodopi, 1994. 189-199. Williams, Raymond. Keywords: a vocabulary of culture and society. 2nd ed. London: Fontana Press, 1983. Wikberg, Kay. "The style marker as if (though): a corpus study." Out of corpora: studies in honour of Stig Johansson. Eds. Hilde Hasselgård and Signe Oksefjell. Amsterdam: Rodopi, 1999. 93-105. Woods, Anthony, Paul Fletcher, and Arthur Hughes. Statistics in language studies. Cambridge: Cambridge University Press, 1986. Yates, Frank. "Contingency tables involving small numbers and the chi-squared test." Journal of the Royal Statistical Society Supplement 1 (1934): 217-235. Yates, Frank. "Tests of significance for 2 " 2 contingency tables." Journal of the Royal Statistical Society, A 147.3 (1984): 426-463. Yule, George. The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press, 1944.

Log In

Word frequency and key word statistics in corpus linguistics

Related papers

Related papers