Using Bilingual Materials to Develop
Word Sense Disambiguation Methods
William A. Gale
Kenneth W. Church
David Yarowsky
AT&T Bell Laboratories
600 Mountain Avenue
P. O. Box 636
Murray Hill NJ, 07974-0636
Abstract
Word sense disambiguation has been recognized as a major problem in natural language processing
research for over forty years. Much of this work has been stymied by difficulties in acquiring appropriate
lexical resources, such as semantic networks and annotated corpora. Following the suggestion in Brown et
al. (1991a) and Dagan et al. (1991), we have achieved considerable progress recently by taking advantage
of a new source of testing and training materials. Rather than depending on small amounts of hand-labeled
text, we have been making use of relatively large amounts of parallel text, text such as the Canadian
Hansards (parliamentary debates), which are available in two (or more) languages. The translation can
often be used in lieu of hand-labeling. For example, consider the polysemous word sentence, which has
two major senses: (1) a judicial sentence, and (2), a syntactic sentence. We can collect a number of sense
(1) examples by extracting instances that are translated as peine, and we can collect a number of sense (2)
examples by extracting instances that are translated as phrase. In this way, we have been able to acquire a
considerable amount of testing and training material for developing and testing our disambiguation
algorithms.
The availability of this testing and training material has enabled us to develop quantitative disambiguation
methods that achieve 90% accuracy in discriminating between two very distinct senses of a noun such as
sentence. In the training phase, we collect a number of instances of each sense of the polysemous noun.
Then in the testing phase, we are given a new instance of the noun, and are asked to assign the instance to
one of the senses. We attempt to answer this question by comparing the context of the unknown instance
with contexts of known instances using a Bayesian argument that has been applied successfully in related
applications such as author identification and information retrieval.
The final section of the paper will describe a number of methodological studies which show that the
training set need not be large and that it need not be free from errors. Perhaps most surprisingly, we find
that the context should extend ±50 words, an order of magnitude larger than one typically finds in the
literature.
1. Word-Sense Disambiguation
Consider, for example, the word duty which has at least two quite distinct senses: (1) a tax and (2) an
obligation. Three examples of each sense are given in Table 1 below. The classic disambiguation problem
is to construct a means for discriminating between two or more sets of examples such as those shown in
Table 1. This paper will focus on the methodology required to address the classic problem, and will have
less to say about the details required for practical application of this methodology. Consequently, me
reader should exercise some caution in interpreting the 90% figure reported here; this figure could easily be
swamped out in a practical system by any number of factors that go beyond the scope of this paper. In
particular, the Canadian Hansards, one of just the few currently available sources of parallel text, is
extremely unbalanced, and is therefore severely limited as a basis for a practical disambiguation system.
101
Moreover, it is important to distinguish the monolingual word-sense disambiguation problem from the
translation issue. It is not always necessary to resolve the word-sense ambiguity in order to translate a
polysemous word. Especially in related languages like English and French, it is common for word-sense
ambiguity to be preserved in both languages. For example, both the English noun interest and the French
equivalent interêt are multiply ambiguous in both languages in more or less the same ways. Thus, one
cannot turn to the French to resolve the ambiguity in the English, since the word is equally ambiguous in
both languages.
Furthermore, when one word does translate to two (e.g., sentence → peine and phrase), the choice of target
translation need not indicate a sense split in the source. Consider, for example, the group of Japanese
words translated by "wearing clothes" in English. While the Japanese have five different words for
"wear" depending on which part of the body is involved, we doubt that English speakers would ever sort
"wearing shoes" and "wearing shirt" into separate categories.
These examples indicate that word-sense disambiguation and translation are somewhat different problems.
It would have been nice if the translation could always be used in lieu of hand-tagging to resolve the wordsense ambiguity but unfortunately, this is not the case. Nevertheless, the translation is often helpful for
resolving the ambiguity. It seems to us to make sense to continue to use the Hansard translations to
develop the discrimination methodology, while we continue to seek more appropriate sources of testing and
training materials. See Yarowsky (1992) for an application of the methods developed here to a somewhat
more appropriate source, a combination of the Roget's Thesaurus (Chapman, 1977)1 and the Grolier's
Encyclopedia (1991).
2. Knowledge Acquisition Bottleneck
In our view, the crux of the problem in developing methods for word sense disambiguation is to find a
strategy for acquiring a sufficiently large set of training material. We think that we have found such a
strategy by turning to parallel text as a source of testing and training materials. Most of the previous work
falls into one of three camps: (1) Qualitative Methods, e.g., Hirst (1987), (2) Dictionary-based Methods,
e.g., Lesk (1986), and (3) Hand Annotated Corpora, e.g., Kelly and Stone (1975). In each case, the work
has been limited by knowledge acquisition bottleneck.
2.1 Qualitative Methods
For example, there has been a tradition in parts of the AI community of building large experts by hand, e.g.,
Granger (1977), Rieger (1977), Small and Rieger (1982), Hirst (1987). Unfortunately, this approach is not
very easy to scale up, as many researchers have observed: "The expert for THROW is currently six pages
long,... but it should be 10 times that size" (Small and Rieger, 1982). Since this approach is so difficult to
scale up, much of the work has had to focus on "toy" domains (e.g., Winograd's Blocks World) or
sublanguages (e.g., Isabelle (1984), Hirschman (1986)). Currently, it is not possible to find a semantic
network with the kind of broad coverage that would be required for unrestricted text.
1. This thesaurus should not be confused with the much smaller and less up-to-date 1911 edition of Roget's.
102
From an AI point of view, it appears that the word-sense disambiguation problem is "AI-Complete,"
meaning that you can't solve this problem until you've solved all of the other hard problems in AI. Since
this is unlikely to happen any time soon (if at all), it would seem to suggest that word-sense disambiguation
is just too hard a problem, and we should spend our time working on a simpler problem where we have a
good chance of making progress. Rather than accept this rather pessimistic conclusion, we prefer to reject
the premise and search for an alternative point of view.
2.2 Machine-Readable Dictionaries (MRDs)
Others such as Lesk (1986), Walker (1987), Ide and Veronis (1990) have turned to machine-readable
dictionaries (MRD) such as Oxford's Advanced Learner's Dictionary of Current English (OALDCE) in the
hope that MRDs might provide a way out of the knowledge acquisition bottleneck. These researchers seek
to develop a program that could read an arbitrary text and tag each word in the text with a pointer to a
particular sense number in a particular dictionary.
Unfortunately, the approach doesn't seem to work as well as one might hope. Lesk (1986) reports
accuracies of 50-70% on short samples of Pride and Prejudice. Part of the problem may be that dictionary
definitions are too short to mention all of the collocations (words that are often found in the context of a
particular sense of a polysemous word). In addition, dictionaries have much less coverage than one might
have expected. Walker (1987) reports that perhaps half of the words occurring in a new text cannot be
related to a dictionary entry.
Thus, like the AI approach, the dictionary-based approach is also limited by the knowledge acquisition
bottleneck; dictionaries simply don't record enough of the relevant information, and much of the
information that is stored in the dictionary is not in a format that computers can easily digest, at least at
present.
2.3 Approaches Based on Hand-Annotated Corpora
A third line of research makes use of hand-annotated corpora. Most of these studies are limited by the
availability of hand-annotated text. Since it is unlikely that such text will be available in large quantities
for most of the polysemous words in the vocabulary, there are serious questions about how such an
approach could be scaled up to handle unrestricted text. Nevertheless, we are extremely sympathetic with
the basic approach, and will adopt a very similar strategy ourselves. However, we will introduce one
important difference, the use of parallel text in lieu of hand-annotated text, as suggested by Brown et al.
(1991a), Dagan et al. (1991) and others.
Kelly and Stone (1975) constructed 1815 disambiguation models by hand, selecting words with a frequency
of at least 20 in a half million word corpus. Most subsequent work has sought automatic methods because
it is quite labor intensive to construct these rules by hand. Weiss (1973) first built rule sets by hand for five
words, then developed automatic procedures for building similar rule sets, which he applied to additional
three words. Unfortunately, the system was tested on the training set, so it is difficult to know how well it
actually worked.
Black (1987, 1988) studied five 4-way polysemous words using about 2000 hand tagged concordance lines
for each word. Using 1500 training examples for each word, his program constructed decision trees based
on the presence or absence of 81 "contextual categories" within the context2 of the ambiguous word. He
used three different types of contextual categories: (1) subject categories from LDOCE, the Longman
Dictionary of Contemporary English (Longman, 1978), (2) the 41 vocabulary items occurring most
frequently within two words of the ambiguous word, and (3) the 40 vocabulary items excluding function
words occurring most frequently in the concordance line. Black found that the dictionary categories
2. The context was defined to be the concordance line, which we estimate to be about ± 6 words from the ambiguous word, given that
his 2000 concordance lines contained about 26,000 words.
103
produced the weakest performance (47 percent correct), while the other two were quite close at 72 and 75
percent correct, respectively.
There has recently been a flurry of interest in approaches based on hand-annotated corpora. Hearst (1991)
is a very recent example of an approach somewhat like Black (1987, 1988), Weiss (1973) and Kelly and
Stone (1975), in this respect, though she makes use of considerably more syntactic information than the
others. Her performance also seems to be somewhat better than the others', though it is difficult to compare
performance across systems.
3. An Information Retrieval (IR) Approach to Sense Disambiguation
We have been experimenting with an Information Retrieval approach to sense disambiguation. In the
training phase, we collect a number of instances of sentence that are translated as peine, and a number of
instances of sentence uses that are translated as phrase. Then in the testing phase, we are given a new
instance of sentence, and are asked to assign the instance to one of the two senses. We attempt to answer
this question by comparing the context of the unknown instance with contexts of known instances.
Basically we are treating contexts as analogous to documents in an information retrieval setting. Just as the
probabilistic retrieval model (van Rijsbergen, 1979, chapter 6; Salton, 1989, section 10.3) sorts documents
d by
where Pr(token\sense) is an estimate of the probability that token appears in the context of sense1 or
sense2. Contexts are defined to extend 50 words to the left and 50 words to the right of the polysemous
word in question for reasons that will be discussed in section 5. This model ignores a number of important
linguistic factors such as word order and collocations (correlations among words in the context).
Nevertheless, there are 2V ≈ 200,000 parameters in the model. It is a non-trivial task to estimate such a
large number of parameters, especially given the sparseness of the training data. The training material
typically consists of approximately 12,000 words of text (100 words words of context for 60 instances of
each of two senses). Thus, there are more than 15 parameters to be estimated from for each data point.
Clearly, we need to be fairly careful given that we have so many parameters and so little evidence.
3.1 Using Global Probabilities to Smooth the Local Probabilities
In principle, the conditional probabilities, Pr(tok\sense), can be estimated by selecting those parts of the
entire corpus which satisfy the required conditions (e.g., 100-word contexts surrounding instances of one
sense of duty), counting the frequency of each word, and dividing the counts by the total number of words
satisfying the conditions. However, this estimate, which is known as the maximum likelihood estimate
(MLE), has a number of well-known problems. In particular, it will assign zero probability to words that
do not happen to appear in the sample. Zero is not only a biased estimate of their true probability, but it is
also unusable for the sense disambiguation task.
In order to avoid these problems, we have decided to use information from the entire corpus in addition to
information from the conditional sample in order. We will estimate Pr(tok\sense) by interpolating
between local probabilities computed within the 100-word context and global probabilities computed over
the entire corpus Pr(tok). The local probabilities are more relevant and the global probabilities are better
measured. We seek a trade-off between random measurement errors and bias errors. This is accomplished
by estimating the relevance of the larger corpus to the conditional sample in order to find the optimal trade
off between random error and bias. See Gale et al. (to appear) for further details.
104
3.2 An Example
Table 2 (below) gives a sense of what the interpolation procedure does for some of the words that play an
important role in disambiguating between the two senses of duty in the Canadian Hansards. Table 2 lists
the 15 words with the largest product (shown as the first column) of the model score (the second column)
and the frequency in the 6000 word training corpus (the third column). The conditioned samples are
obtained by extracting a 100-word window surrounding each of the 60 training examples. The training sets
were selected by randomly sampling instances of duty in the Hansards until 60 instances were found that
were translated as droit and 60 instances were found that were translated as devoir. The first set of 60 are
used to construct the model for the tax sense of duty and the second set of 60 are used to construct the
model for the obligation sense of duty.
The column labeled "freq" shows the number of times that each word appeared in the conditioned sample.
For example, the count of 50 for the word countervailing indicates that countervailing appeared 50 times
within the conditioned sample. This is a remarkable fact, given that countervailing is a fairly unusual word.
It is much less surprising to find a common word like to appearing quite often (228 times) in the other
conditioned sample. The second column (labeled "weight") models the fact that 50 instances of
countervailing are more surprising than 228 instances of to. The weights for a word are its log likelihood in
the conditioned sample compared with its likelihood in the global corpus. The first column, the product of
these log likelihoods and the frequencies, is a measure of the importance, in the training set, of the word for
determining which sense the training examples belong to. Note that words with large scores do seem to
intuitively distinguish the two senses, at least in the Canadian Hansards.
There are obviously some biases introduced by the unusual nature of this corpus, which is hardly a balanced
sample of general language. For example, the set of words listed in Table 2 under the obligation sense of
duty is heavily influenced by the fact that the Hansards contain a fair amount of boilerplate of the form:
"Mr. speaker, pursuant to standing order..., I have the honour and duty to present petitions duly signed
by... of my electors...."
4. Materials
4.1 Six Polysemous Words
We will focus on six polysemous words: duty, drug, land, language, position and sentence. Table 3
(below) shows the six English nouns, along with two French translations. The penultimate column shows
the number of times that each English noun was found with the particular French translation in the corpus,
while the final column shows the accuracy of the system in identifying the appropriate sense based on the
105
context of use. We selected these nouns because they could be disambiguated by looking at their French
translation in the Canadian Hansards. As mentioned above, the polysemous noun interest, for example,
would not meet this constraint because the French target interêt is just as ambiguous as the English source.
In addition, for studying methodological questions, it is important that there be an adequate number of
instances of both translations in corpus, though this constraint can be relaxed in a practical application as
we will see. Unless stated otherwise, the studies to be reported here all use 60 instances of the six
polysemous words in each of the two main senses for training Pr(tok|sense). An additional 90 instances of
each word in each sense are used for testing. Consequently, we require a total of 150 (60 + 90) instances of
each word in each sense in order to investigate the methodological issues.
4.2 Sentence Alignment and Word Correspondence
In order to collect the testing and training sets, we need to know the "truth." As mentioned above, we
approximate the "truth" by assuming that the French translation in the Hansards is adequate for our
purposes. The process of identifying the French translation is a two step procedure. As in Brown, Lai, and
Mercer (1991b), we begin by aligning the parallel texts at the sentence level (Gale and Church, 199la). In
our experience, 90% of the English sentences match exactly one French sentence, but other possibilities,
especially two sentences matching one (2-1) or one matching two (1-2), are not uncommon. The method
correctly aligned all but 4% of the regions. Moreover, by selecting the best scoring 80% of the corpus, the
error rate dropped to 0.7%. See (Gale and Church, 1991a) for more details on the method and its
evaluation.
After the sentences have been aligned, we can then identify the French correspondences using a very simple
set of programs designed by Yarowsky. Gale and Church (1991b) describe a more elaborate program that
attempts to find correspondences for most of the words in the English text, not just the polysemous words
of interest. However, for our purposes here, the more elaborate methods proved unnecessary.
5. Studies of Methodological Questions
5.1 How Much Context Should We Use?
As mentioned above, we use a very wide context, 100-words surrounding the polysemous word in question.
Most previous studies have limited themselves to a much narrower notion of context, perhaps only 5 words
to the left and 5 words to the right of the polysemous word, based on the observation that people don't seem
to need very much context (Kaplan, 1950; Choueka and Lusignan, 1985). Although people may be able to
get by without the additional context, we find that there are often very useful clues even quite far away from
the polysemous word in question. Figure 1 shows that information is measurable out to 10,000 words
away from the polysemous word, and Figure 2 shows that this information is useful out to 50 words. Since
the disambiguation problem is as difficult for the machine as it is, we believe that it would be a mistake to
ignore this information just because people don't seem to need it. As in computer chess, it is not always
106
Contextual Clues are Measurable Out to 10,000 Words
Figure 1. The horizontal axis d shows the size of the context as a distance (in words) from the
polysemous word in question. The vertical scale shows disambiguation performance (percent
correct), computed over a context of ten words at the distance specified by the horizontal axis:
[-d-5,-d] and [d, d+5],. Vertical lines indicate means and standard deviations computed over a
group of 90 instances times six polysemous words times two senses. Note that performance
remains significantly above chance (50%) out to 10,000 words away from the polysemous word.
best for the computer to try to copy human strategies.
Figure 1 demonstrates that the information is measurable at very large distances from the polysemous
words. In order to show this, we selected a very unusual context ([-d-5,-d] and [d, d+5]) and measured
performance as a function of d. This experiment thus asks, if you did not know any of the intervening
words, would ten words at distance d be sufficient for disambiguation? The answer is "yes" for for
d < 10,000, at least in the Hansards. We found this result surprising, given that almost all previous
disambiguation studies have concentrated so heavily on very narrow contexts. The result almost certainly
has something to do with discourse structure, and may depend fairly strongly on the average length of an
average debate. In addition, the result may depend on other factors such as part of speech.
Although Figure 1 shows that contextual clues are measurable at surprisingly large distances, much of this
information might not be very useful. In particular, it might have been possible to find the same
information at smaller distances. Figure 2 attempts to address this concern by examining the marginal
contribution of context as a function of distance. Figure 2 is computed just like Figure 1, except that Figure
2 uses a 2d-1 word context ([-d,-l] and [l,d]), rather than the 10 word context in Figure 1. This
experiment thus asks, given that you know all the words out to d, what is the value of a few additional
words further out? The contribution is largest, not surprisingly, for smaller d, but nevertheless, the
contribution continues to grow out to at least twenty words, perhaps fifty words, well beyond the ±6 word
contexts typically found in many disambiguation studies. Increasing the context from ±6 words to ±50
words improves performance from 86% to 90%.
107
Contextual Clues are Useful Out to 50 Words
Figure 2. The horizontal axis shows the size of the context, as in Figure 1. The vertical axis
shows performance (percent correct), computed over a context of 2d-1 words, from -d to d (but
excluding 0). Note that performance rises very rapidly at first and reaches an asymptote at about
50 words.
5.2 Quantity and Quality of Training Material
In a practical application, we might be concerned that the Bayesian discrimination methods would be too
demanding on the quantity and quality of the training material. This section will consider the quantity
question first and then return to the quality question.
As mentioned above, the method would have limited applicability if it requires unreasonably large training
sets. We expect performance to degrade with the size of the training set, and we would like to control this
source of variability as we study other factors. In addition, we would also like to be able to predict
performance when the size of the training set is severely constrained, because this is usually the case for
most senses of most words. Figure 3 shows performance as a function of the size of the training set. Note
that very small training sets perform remarkably well; just 3 exemplars are sufficient to achieve 75%.
Nevertheless, it helps to use larger training sets, up to about 50 or 60 exemplars when performance reaches
asymptote.
The quality of the training set is another potential source of concern. If training materials are to be
collected on a large scale, then we will need to accept a certain number of errors. Moreover, if the method
is robust to errors, then it will be possible to consider bootstrapping methods that might be able to speed up
the data collection effort.
In order to study the quality issue, we deliberately introduced a variable number of errors into the training
set. Table 4 shows the mean performance (percent correct), averaged over 90 instances times two senses
times six polysemous words, as a function of the quality of the training set (the fraction of
errors
108
Small Training Sets Perform Surprisingly Well
Figure 3. The horizontal axis shows the size of the training set, while the vertical scale shows
performance (percent correct, averaged over 90 instances times two senses times six
polysemous words). Note that very small training sets perform remarkably well; just 3 exemplars
are sufficient to achieve 75%. Nevertheless, it helps to use larger training sets, up to about 50 or
60 exemplars when performance reaches asymptote.
deliberately introduced into the training set) and coverage (the fraction of the test set with the largest
discrimination score).
Two observations on this table are important. First, at 10 percent errors input, the output errors have only
increased from 10 percent to 12 percent. Thus we can accommodate up to about ten percent errors with
little degradation of performance. Second, at fifty percent coverage, input errors of twenty to thirty percent
result in about half as many errors on output. Therefore if one had obtained a set of examples with no more
than twenty to thirty percent, one could iterate example selection just once or twice and have example sets
that had less than ten percent errors.
5.3 Recall vs. Precision
One can make a tradeoff between coverage (fraction of words for which a disambiguation is attempted) and
accuracy (fraction of attempted words which are correctly disambiguated), analogous to the recall versus
precision tradeoff in Information Retrieval. Figure 4 shows the error rate depends very strongly on
109
coverage, and consequently, it is important to state coverage carefully when reporting performance. At 66
percent coverage, the error rate is about a quarter of its value at 100 percent coverage. (Some of the
literature has not been as careful as it might be in this respect.)
Error Rate Depends on Coverage
Figure 4. The horizontal axis shows coverage while the vertical scale shows error rate. Note
that the error rate increases very quickly with small increases in coverage, because most of the
errors result when there is very little information to go on.
6. Conclusions
Difficulties in acquiring suitable testing and training materials have deterred progress on word-sense
disambiguation over the past forty years. We have achieved considerable progress recently by taking
advantage of a new source of testing and training materials. Rather than depending on small amounts of
hand-labeled text, we have been making use of relatively large amounts of parallel text, text such as the
Canadian Hansards, which are available in multiple languages. Consider, for example, the polysemous
word drug, which has two major senses: (1) a medical drug, and (2), an illicit drug. We can collect a
number of examples of each sense by using the French translation (médicament vs. drogue) as an indicator
of the sense of the English word. In this way, we have been able to acquire considerable amounts of testing
and training material for study of quantitative methods.
The availability of this testing and training material has enabled us to develop quantitative disambiguation
methods that achieve 90% accuracy in discriminating between two senses corresponding to different topics,
based on context alone. In addition, an perhaps more importantly, the availability of this testing and
training materials has allowed us to carry out a number of methodological studies. In particular, we find
that a training set of as few as ten exemplars seems to be quite useful, and that the training set can tolerate a
fair number of errors. The most surprising result, perhaps, is that the width of context should be ±50
words, an order of magnitude larger than one normally finds in the literature.
110
References
1. Bar-Hillel (1960), "Automatic Translation of Languages," in Advances in Computers, Donald Booth and R. E.
Meagher, eds., Academic, New York.
2. Black, Ezra (1987), Towards Computational Discrimination of English Word Senses, Ph. D. thesis, City
University of New York.
3. Black, Ezra (1988), "An Experiment in Computational Discrimination of English Word Senses," IBM Journal
of Research and Development, v 32, pp 185-194.
4. Brown, Peter, Stephen Della Pietra, Vincent Della Pietra, and Robert Mercer (1991a), "Word Sense
Disambiguation using Statistical Methods," Proceedings of the 29th Annual Meeting of the Association for
Computational Linguistics, pp 264-270.
5. Brown, Peter, Jennifer Lai, and Robert Mercer (1991b) "Aligning Sentences in Parallel Corpora," Proceedings
of the 29th Annual Meeting of the Association for Computational Linguistics, pp 169-176.
6. Chapman, Robert (1977). Roget's International Thesaurus (Fourth Edition), Harper and Row, NY.
7. Choueka, Yaacov, and Serge Lusignan (1985), "Disambiguation by Short Contexts," Computers and the
Humanities, v 19. pp. 147-158.
8. Church, Kenneth (1989), "A Stochastic Parts Program an Noun Phrase Parser for Unrestricted Text,"
Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing, Glasgow.
9. Cruse, D. A. (1986), Lexical Semantics, Cambridge University Press, Cambridge, England.
10. Dagan, Ido, Alon Itai, and Ulrike Schwall (1991), "Two Languages are more Informative than One,"
Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, pp 130-137.
11. Fillmore, Charles, and Sue Atkins, (1991) "Word Meaning: Starting where MRD's Stop," invited talk at the
29th Annual Meeting of the Association for Computational Linguistics.
12. Gale, W., and Church, K. (1991a) "A Program for Aligning Sentences in Bilingual Corpora," Association for
Computational Linguistics.
13. Gale, W., and Church, K. (1991b) "Identifying Word Correspondences in Parallel Text," Fourth Darpa
Workshop on Speech and Natural Language, Asilomar.
14. Gale, W., K. Church, and D. Yarowsky, (to appear) "A Method for Disambiguating Word Senses in a Large
Corpus," Computers and the Humanities.
15. Granger, Richard (1977), "FOUL-UP A program that figures out meanings of words from context," IJCAII-77,
pp. 172-178.
16. Grolier's Inc. (1991) New Grolier's Electronic Encyclopedia.
17. Hearst, Marti (1991) "Toward Noun Homonym Disambiguation Using Local Context in Large Text Corpora,"
in Proceedings of the Seventh Annual Conference of the UW Centre for the New OED and Text Research,
available from the UW Centre for the New OED and Text Research, University of Waterloo, Waterloo, Ontario,
Canada.
18. Hirschman, Lynette (1986), "Discovering Sublanguage Discovery," in Analyzing Language in Restricted
Domains, Ralph Grishman and Richard Kittredge, eds., Lawrence Erlbaum, Hillsdale, New Jersey.
19. Hirst, G. (1987), Semantic Interpretation and the Resolution of Ambiguity, Cambridge University Press,
Cambridge.
20. Ide, N. and Veronis, J. (1990) "Mapping Dictionaries: A Spreading Activation Approach," in Proceedings of
the Sixth Annual Conference of the UW Centre for the OED and Text Research, available from the UW Centre
for the New OED and Text Research, University of Waterloo, Waterloo, Ontario, Canada.
21. Isabelle, P. (1984) "Machine Translation at the TAUM Group," in King, M. (ed.) Machine Translation Today:
The State of the Art, Edinburgh University Press.
22. Jackson, Howard (1988) Words and their Meaning, Longman, London.
23. Jacobs, Paul, George Krupka, Susan McRoy, Lisa Rau, Norman Sondheimer, and Uri Zernik (1990), "Generic
Text Processing: A Progress Report," Proceedings DARPA Speech and Natural Language Workshop, pp. 359364.
111
24. Kaplan, Abraham (1950), "An Experimental Study of Ambiguity in Context," cited in Mechanical Translation,
v. l, nos. 1-3.
25. Kelly, Edward, and Phillip Stone (1975), Computer Recognition of English Word Senses, North-Holland,
Amsterdam.
26. Kucera, H., and W. Francis (1967), Computational Analysis of Present-day American English, Brown University
Press, Providence.
27. Lesk, Michael (1986), "Automatic Sense Disambiguation: How to tell a Pine Cone from an Ice Cream Cone,"
Proceeding of the 1986 SIGDOC Conference, Association for Computing Machinery, New York.
28. Longman Group Limited, eds. (1978), Longman Dictionary of Contemporary English, Longman, Burnt Mill,
England.
29. Masterman, Margaret (1967), "Mechanical Pidgin Translation," in Machine Translation, Donald Booth, ed.,
Wiley, 1967.
30. Mosteller, Fredrick, and David Wallace (1964) Inference and Disputed Authorship: The Federalist, AddisonWesley, Reading, Massachusetts.
31. Quine, W. v. O. (1960), Word and Object, MIT Press, Cambridge.
32. Rieger, Charles (1977), "Viewing Parsing as Word Sense Discrimination," in A Survey of Linguistic Science,
W. Dingwall, ed., Greylock.
33. Salton, G. (1989) Automatic Text Processing, Addison-Wesley.
34. Sinclair, J., Hanks, P., Fox, G., Moon, R., Stock, P. et al. (eds.) (1987) Collins Cobuild English Language
Dictionary, Collins, London and Glasgow.
35. Small, S. and C. Rieger (1982), "Parsing and Comprehending with Word Experts (A Theory and its
Realization)," in Strategies for Natural Language Processing, W. Lehnert and M. Ringle, eds., Lawrence
Erlbaum Associates, Hillsdale, NJ.
36. Stone, Phillip, D. C. Dunphy, M. S. Smith, and D. M. Ogilvie (1966), The General Inquirer: A Computer
Approach to Content Analysis, MIT Press, Cambridge.
37. van Rijsbergen, C. (1979) Information Retrieval, Second Edition, Butterworths, London.
38. Walker, Donald (1987), "Knowledge Resource Tools for Accessing Large Text Files," in Machine Translation:
Theoretical and Methodological Issues, Sergei Nirenburg, ed., Cambridge University Press, Cambridge,
England.
39. Weinreich, U. (1980), On Semantics, University of Pennsylvania Press, Philadelphia.
40. Weiss, Stephen (1973), "Learning to Disambiguate," Information Storage and Retrieval, v. 9, pp 33-41.
41. Yarowsky, David (1992), "Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained
on Large Corpora," Proceedings COLING-92.
42. Yngve, Victor (1955), "Syntax and the Problem of Multiple Meaning," in Machine Translation of Languages,
William Locke and Donald Booth, eds., Wiley, New York.
43. Zernik, Uri (1990) "Tagging Word Senses in Corpus: The Needle in the Haystack Revisited," in Text Based
Intelligent Systems: Current Research in Text Analysis, Information Extraction, and Retrieval, P. Jacobs, ed., GE
Research and Development Center, pp25-29.
112