[go: up one dir, main page]

0% found this document useful (0 votes)
33 views16 pages

Module 4

The document discusses Natural Language Processing (NLP) with a focus on Information Retrieval (IR) and Lexical Resources, detailing the design features of IR systems, including classical, non-classical, and alternative models. It covers key concepts such as indexing, stop word elimination, stemming, and Zipf's Law, along with various IR models like Boolean, probabilistic, and vector space models. The document emphasizes the importance of effective retrieval techniques and the organization of information relevant to user queries.

Uploaded by

pavanradhika1982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views16 pages

Module 4

The document discusses Natural Language Processing (NLP) with a focus on Information Retrieval (IR) and Lexical Resources, detailing the design features of IR systems, including classical, non-classical, and alternative models. It covers key concepts such as indexing, stop word elimination, stemming, and Zipf's Law, along with various IR models like Boolean, probabilistic, and vector space models. The document emphasizes the importance of effective retrieval techniques and the organization of information relevant to user queries.

Uploaded by

pavanradhika1982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

lOMoARcPSD|26931221

Module 4

Natural Language Processing


lOMoARcPSD|26931221

Information Retrieval And Lexical Resources: Information Retrieval: Design features of Information
Retrieval Systems-Classical, Non classical, Alternative Models of Information 08 Retrieval – valuation
Lexical Resources: World Net-Frame Net- Stemmers-POS TaggerResearch Corpora.

Textbook 1: Ch. 9,12 Tanveer Siddiqui, U.S. Tiwary, “Natural Language Processing and Information
Retrieval”, Oxford University Press, 2008

Information Retrieval And Lexical Resources:

1. Information Retrieval:
1. Information retrieval (IR) deals with the organisation, storage, retrieval, and evaluation of
information relevant to a user’s query.
2. A user in need of information formulates a request in the form of a query written in a
natural language.
3. The retrieval system responds by retrieving the document that seems relevant to the
query.
4. An information retrieval system does not inform (i.e., change the knowledge of) the user
on the subject of her inquiry. It merely informs on the existence ( or non-existence) and
whereabouts of documents relating to her request.

2. Design features of Information Retrieval Systems

Fig 1. Basic information retrieval process


1. Fig 1. Illustrates the basic process of IR.
2. It begins with the user’s information need.
lOMoARcPSD|26931221

3. Based on this need, he/she formulates a query.


4. The IR system returns documents that seem relevant to the query.
5. The retrieval is performed by matching the query representation with document
representation.

1. Indexing
A collection of raw documents is usually transformed into an easily accessible
representation. This process is known as indexing.
Most indexing techniques involve indentifying good document descriptors, such as
keywords or terms which describe the information content of documents.
Luhn(1957, 1958) is considered the first person to advance the notion of automatic indexing
of document based on their content. He assumed that the frequency of certain word-
occurrences in an article gave meaningful indentification of the article’s content. He
proposed that the discrimination power of index terms is a function of the rank order of the
frequency of their occurance, and that middle frequency terms have the highest
discrimination power. This model was proposed for the extraction of silent terms from a
document.
The word term can be a single word or multiword phrases. For example the sentence,
Design features of information retrieval systems, can be represented as follows:
Design, features, information, retrieval, systems.
It can also be represented by the set of terms:
Design, features, information retrieval, information retrieval systems.
These multi-word terms can be obtained by looking at frequently appearing sequences of
words, n-grams, part-of-speech tags, or by applying NLP to identify meaningful phrases or
handcrafting.
In text retrieval conference (TREC) the method used for phrase extraction is as follows:
1. Any pair of adjacent non-stop words is regarded a potential phrase.
2. The final list of phrases is composed of those pairs of words that occur in, say, 25 or more
documents in the document collection.
2. Eliminating Stop Words
The lexical processing of index terms involves elimination of stop words. Stop words are
high frequency words which have little semantic weight and are thus, unlikely to help in
retrieval. Typical example of stop words are articles and prepositions. Eliminating them
considerably reduces the number of index terms. The drawback of eliminating stop word is
that it can sometimes result in the elimination of useful index terms, for instance the stop
lOMoARcPSD|26931221

word A in Vitamin A. Some phrases, lie to be or not to be, consist entirely of stop words.
Elimination stop words in such case, make it impossible to correctly search a document.
3. Stemming
Stemming normalizes morphological variant, though in a crude manner, by removing affixes
from the words to reduce them to their stem, e.g. the words compute computing,
computes, and computer, are all be reduced to same word stem, comput. The stemmed
representation of the text, Design features of information retrieval systems, is (design,
feature, inform, retriev, system)
One of the problems associated with stemming is that it may throw away useful distinctions.
In some cases, it may be useful to help conflate, similar terms resulting in increased recall. In
others, it may be harmful resulting in reduced precision (e.g . when documents containing
the term computation are returned in response to the query phrase personal computer)
4. Zipf’s Law
Zipf’s law says that the frequency of words multiplied by their ranks in a large corpus is
more or less constant. More formally,
Frequent x rank = constant
This means that if we compute the frequencies of the words in a corpus, and arrange them
in decreasing order of frequency, then the product of the frequency of a word and its rank is
approximately equal to the product of the frequency and rank of another word. This
indicates that the frequency of a word is inversely proportional to its rank. This relationship
is shown in figure.

Empirical investigation of Zipf’s law on large corpuses suggest that human languages contain
a small number of words that occur with high frequency and a large number of words that
occur with low frequency. The high frequency word being common, have less discriminating
power, and thus are not useful for indexing. Low frequency words are less likely to be
included in the query, and are also not useful or indexing. As there are a large number of
rare (low frequency) words, dropping them considerably reduces the size of a list of index
terms. The remaining medium frequency words are content-bearing terms and can be used
lOMoARcPSD|26931221

for indexing. This can be implemented by defining thresholds for high and low frequency,
and dropping words that have frequencies above or below these thresholds. Stop word
elimination can be thought of as an implementation of Zipf’s law, where high frequency
terms are dropped from a set of index terms.

3. Information Retrieval Models


 The IR system consists of a model for documents, a model for queries, and a
matching function which compares queries to documents. The central objectives of
the model is to retrieve all documents relevant to a query. This defines the central
task of an IR system. IR models can be classified as follows:
1. Classical Models of IR

2. Non classical Models of IR

3. Alternative Models of Information Retrieval

 The three classical IR Models-Boolean, vector, and probabilistic are based on


mathematical knowledge that is easily recognized and well understood. These
models are simple, efficient, and easy to implement. Almost all existing commercial
systems are based on the mathematical models of IR. That is why they are called
classical models of IR.
 Non –classical models perform retrieval based on principles other than those used
by classical models, i.e., similarity, probability, and Boolean operation. These are
best exemplified by models based on special logic technique, situation theory, or the
concept of interaction.
 The third category of IR models, namely alternative models, are actually
enhancements of classical models, making use of specific techniques from other
fields .The cluster model, fuzzy model, and latent semantic indexing (LSI) model are
examples of alternative models of IR.

4. CLASSICAL INFORMATION RETRIEVAL MODELS


1. Boolean Model
 The Boolean model is the oldest of the three classical models.
 It is based on Boolean logic and classical set theory.
 In this model documents are represented as a set of keywords, usually stored in an
inverted file.
 An inverted file is a list of keywords ad identifiers of the documents in which they
occur Users are required to express their queries as Boolean expression consisting of
keywords connected with Boolean logical operators (AND, OR, NOT). Retrieval is
performed based on whether or not document contains the query terms.
Given a finite set
lOMoARcPSD|26931221

T = {t1, t2,……tv…..,tm}
Of index terms, a finite set
D= {d1, d2,……dj…..,dn}
Of documents and a Boolean expression – in a normal form-represent a query Q as
follows:
Q=^(˅θi), θi€ {tv-ti }
The retrieval is performed in two steps:
1. the set R1 of documents are obtained that contain or do not contain the term t i:
R1 ={dj|, θi € dj } θi € {tv € ti }
Where ¬ ti € dj means ti ∉ dj
2. Set operations are used to retrieve documents in response to Q :
η Ri

2. Probabilistic model:
 The probabilistic model applies a probabilistic framework to IR.
 It ranks documents based on the probability of their relevance to a given query.
 Retrieval depends on whether probability of relevance of a document is higher than
that of non-relevance, i.e whether it exceeds a threshold value.
 Given a set of documents D, a query q, and a cut-off value α, this model first
calculates the probability of relevance and irrelevance of a document to the query.
 It then ranks documents having probabilities of relevance at least that of irrelevance
in decreasing order of their relevance.
 Documents are retrieved if the probability of relevance in the ranked list exceeds the
cut off value.
 More formally, if P(R/d) is the probability of relevance of a document d, for query q,
and P(I/d) is the probability of irrelevance ,then the set of documents retrieved in
response to the query is as follows:

S= {dj|P(R/dj)≥ P(I/dj)}P(R/dj)≥α
 Most of the systems assume that terms are independent when estimating
probabilities for the probabilistic model.
 This assumption allows for accurate estimation of parameter values and helps
reduce computational complexity of the model.
 However, this assumption seems to be inaccurate as terms in a given domain usually
tend to co-occur.
 For example, it is more likely that ‘match point’ Will co-occur with tennis ‘rather than
cricket’. The probabilistic model, like the vector model, can produce results that
partly match the user query.
lOMoARcPSD|26931221

 Nevertheless, this model has drawbacks, one of which is the determination of a


threshold value for the initially retrieved set; the number of relevant documents by a
query is usually too small for the probability to be estimated accurately.

3. Vector Space Model


 The vector space model is one of the most well studied retrieval models.
 The vector space model represents documents and queries as vectors of features
representing terms that occur within them.
 Each document is characterized by a Boollean or numerical vector.
 These vectors are represented in a multi-dimensional space, in which each
dimension corresponds to a distinct term in the corpus of documents. In its simplest
form, each feature takes a value of either zero or one, indicating the absence or
presence of that term in a document or query.
 More generally, features are assigned numerical values that are usually a function of
the frequency of terms.
 Ranking algorithms compute the similarity between document and query vectors, to
yield a retrieval score to each document.
 This score is used to produce a ranked list of retrieved documents. Given a finite set
of a documents.

5. NON-CLASSICAL MODELS OF IR
 Non-classical IR models are based on principles other than similarity, probability,
Boolean operations, etc., on which classical retrieval models are based.
 Examples include information logic model, situation theory model, and interaction
model.
 The information logic model is based on a special logic technique called logical
imaging.
 Retrieval is performed by making inferences from document to query. This is unlike
classical models, where a search process is used.
 Unlike usual implication, which is true in all cases except that when antecedent is
true and consequent is false, this inference is uncertain.
 Hence, a measure of uncertainty is associated with this inference. The principle put
forward by van Rijsbergen is used to measure this uncertainty. This principle says:
Given any two sentences x and y, a measure of the uncertainty of y →x relative to a given
data set is determined by the minimal extent to which one has to add information to the
data set in order to establish the truth of y →x.

 In fact, this model was developed in response to van Rijsbergen’s realization that
classical models were unable to enhance effectiveness and that new meaning based
models were required to do so.
 The situation theory model is also based on va Rijsbergen’s principle. Retrieval s
considered as a flow of information from document to query.
lOMoARcPSD|26931221

 A structure called infon, denoted by t, is used to describe the situation and to model
information flow. An infon represents an n-ary relation and its polarity. The polarity
of an infon can be either 1 or 0, indicating that the infon carries either positive or
negative information.
 For example, the information in the sentences, adil is serving a dish, is conveyed by
the infon.
 If a document does not support the query q, it does not necessarily mean that the
document is not relevant to the query.
 Additional information, such as synonymy’s hypernyms/hyponyms, meronyms, etc.,
can be used to transform the document d into d’ such that d’ |=q.
 Semantic relationships in a thesaurus, like WordNet, are useful sources for this
information. The transformation from d to d’ is regarded as flow of information
between situations.
 The interaction IR model was first introduced in Dominich (1992,1993) and
Rijsbergen (1996). In this model, the documents are not isolated, instead they are
interconnected. The query interacts with the interconnected documents. Retrieval is
conceived as a result of this interaction.

6. ALTERNATIVE MODELS OF IR
1. Cluster Model:

 The cluster model is an attempt to reduce the number of matches during retrieval.
 The need for clustering was fist pointed out by Salton. Before we discuss the cluster-
based IR model, we would like to state the cluster hypothesis that explains why
clustering could prove efficient in IR. Closely associated documents tend to be
relevant to the same clusters.
 This hypothesis suggests that closely associated documents are likely to be retrieved
together. This means that by forming groups (classes or clusters) of related
documents, the search time reduced considerably.
 Instead of matching the query with every document in the collection, it is matched
with representatives of the class, and only documents from a class whose
representative is close to query, are considered for individual class whose
representative is close to query, are considered for individual match.
 Clustering can be applied on terms instead of documents. Thus, terms can be
grouped to form classes of co-occurrence terms.
 Co-occurrence terms can be used in dimensionality reduction or thesaurus
construction. A number of methods are used to group documents. We discuss here,
a cluster generation method based on similarity matrix. This method works as
follows:
 Let D = {d1, d2,….dj,…..dm} be a finite set documents, and let E = (eij)n,n be the
similarity matrix. The element Eijin this matrix, denotes a similarity between
document di and dj. Let T be the threshod value. Any pair of documents di and dj(i≠j)
whose similarity measure exceeds the threshold (eij≥T) is grouped to form a cluster.
The remaining documents form a single cluster.
lOMoARcPSD|26931221

 The set of clusters thus obtained is C = {C1, C2, …,Ck,….Cp}


A representative vector of each class (Cluster) is constructed by computing the centroid of
the document vectors belonging to that class. Representation vector for a cluster Ck is
rk = {a1k a2k,…,aik,….,amk}
An element aik in this vector is computed as
∑a
ⅆjϵc i j
k
aik = |Ck |

where aij is weight of the term ti, of the document dj, in cluster Ck.During retrieva the query
is compared with the cluster vectors.
(r1, r2,….,rk,….rp)
This comparison is carried out by computing the similarity between the query vector q and
the representative vector rk as
Sik = ∑m
i=1 𝑎 ikqi, k = 1,, …,p

A cluster Ck whose similarity sk exceeds a threshold is returned and the search proceeds in
that cluster.

2. Fuzzy Model
3. Latent Semantic Indexing Model
lOMoARcPSD|26931221

Evaluation of the IR System


Chapter 2:
Lexical Resources:

1. Word Net:
1. WordNet is a large lexical database for the English language.
2. WordNet consists of three databases- one for nouns, one for verbs, and one for
both adjectives and adverbs.
3. Information is organized into sets of synonymous words called synsets, each
representing one base concept. The synsets are linked to each other by means of
lexical and semantic relations.
4. Lexical relations occur between word-forms and semantic relations between word
meanings. These relations include synonymy, hypernymy/hyponymy, antonymy,
meronymy/holonymy, troponymy etc.
5. A word may appear in more than one synset and in more than one part of speech.
The meaning of a word is called sense. WordNet lists all senses of a word, each sense
belonging to a different synset. WordNet’s sense entries consist of a set of synonyms
and a gloss. A gloss consists of a dictionary-style definition and examples
demonstrating the use of a synset in a sentence, as shown in figure 12.1.
6. The figure shows the entries for the word’read’. ‘Read’ has one sense as a noun and
11 sense as a verb. Glosses help differentiate meanings.
7. Figures 12.2 12.3 and 12.4 show some of the relationships that hold between nouns,,
verbs and adjectives and adverbs.
8. Nouns and verbs are organized into hierarchies based on the hypernymy/hyponymy
relation, whereas adjectives are organized into clusters based on antonym pairs.
Fgure 12.5 shows a hypernym chain for ‘river’ extracted from WordNet. Figure 12.6
shows the troponym relations for the verb’lagh’.
9. WordNet is freely and publicly available for download from
http://wordnet.rinceton.ed/obtain,
lOMoARcPSD|26931221

10. WordNets for other languages have also been developed, e.g., EuroWordNet and
Hindi WordNet covers European languages, including English, Dutch, Spanish, Italian,
German, French, Czech and Estonian. Other than language internal relations it also
contains multilingual relations from each WordNet to English meanings.
Applications of WordNet:
WordNet has found numerous applications in problems related with IR and NLP. Some of these are
discussed here.

 Concept identification in Natural Language:

WordNet can be used to identify concepts pertaining to a term, to suit them to the full semantic
richness and complexity of a given information need.

 Word Sense Disambiguation111;

WordNet combines features of a number of the other resources commonly used in disambiguation
work. It offers sense definitions of words identifies synstes of synnonms, defines a number of
semantic relations and is freely available. This makes it the best known and most utilized resource
4
for word sense disambiguation. One of the earliest attempts to use WordNet for word sense
disambiguation was in IR by Voorheese. She used Wordnet noun hierarchy to achieve
disambiguation.

WordNet semantic relations can be used to expand queries so that the search for a document is not
confined to the pattern-matching of query terms, but also covers synonyms. The work performed by
Voorhees is based on the use of WordNet relations, such as synonyms, hypernyms and hyponyms to
expand queries.

Document Structuring and Categorization

The semantic information extracted from WordNet, and WordNet conceptual representation of
knowledge, have been used for text categorization.

Document Summarization

WordNet has found useful application in text summarization. The approach presented by Barzilay
and Elhadad (1997) utilizes information from WordNet to compute lexical chains.

2. Frame Net
1. FrameNet is a large database of semantically annotated English sentences, It is
based on principles of frame semantics.
2. It defines a tagset of semantic roles called the frame element.
3. Sentences from the British National Corpus are tagged with these frame elements.
4. The basic philosophy involved is that each word evokes a particular situation with
particular participants.
5. FrameNet aims at capturing these situations through case-frame representation of
words. The word that invokes a frame is called target word or predicate, and the
lOMoARcPSD|26931221

participant entities are defined using semantic roles, which are called frame
elements.
6. The FrameNet ontology can be viewed as a semantic level representation ofl
predicate argument structure.
7. Each frame contains a main lexical item as predicate and associated frame-specific
semantic roles, such as AUTHORITIES, TIME and SUSPECT in the ARREST frame, called
frame elements.
8. As an example, consider sentence(12.1) annotated with the semantic roles
AUTHORITIES and SUSPECT.
9. The target word in sence (12.1) is ‘nab’ which is a verb in the ARREST frame.
[authorities the police] nabbed[suspect the snatcher]. (12.1)

10. A COMMUNICATION frame has the semantic roles ADDRESSEE, core and non-core
frame elements of the COMMUNICATION frame, along with other details. A JUDGEMENT
frame contains roles such as JUDGE, EVALUEE, and REASON. A frame may inherit roles from
another frame. For example, a STATEMENT frame may inherit from a COMMUNICATION
FRAME; it contains roles such as SPEAKER, ADDRESSEE, AND MESSAGE.
11. The following sentences show some of these roles:
[Judge She] [Evaluee blames the police] [Reason for failing to provide enough protection]. (12.2)
[Speaker She] told [Addressee me ] [Message blames I’II return by 7:00 pm today’] (12.3)

FrameNet Applications:
1. Gildea and Jurafsky(2002) and Kwon et al. (2004) used FrameNet data for automatic
semantic parsing. The shallow semantic role obtained from FrameNet can play an
important role in information extraction. For example, a semantic role makes it
possible to identify that the theme role played by ‘match’ is same in sentences (12.4)
and 12.5) though the syntactic role is different.
The umpire stopped the match (12.4)
The match stopped due to bad weather (12.5)
2. In senctence(12.4), the word ‘match’ is the object, while it is the subject in
sentence(12.5)
Semantic roles may help in the question answering system. For example, the verb’ send’ and
‘receive’ would share the semantic roles SENDER, RECIPIENT, GOODS, etc., when defined
with respect to a common TRANSFER frame. Such common frames allow a question
answering system to answer a question such as ‘Who sent packet to Khushbu?’ using
sentence (12.6)
Khushbu received a packet from the examination cell (12.6)
lOMoARcPSD|26931221

Other applications include IR, interlinga for machine translation, text summarization, and
word sense disambiguation.

3. Stemmers
- Stemming is the process of reducing inflected (or sometimes derived) words to their
base or root form.
- The stem need not be identical to the morphological base of the word; it is usually
sufficient that related words map to the same stem, even if this stem is not in itself a
valid root.
- Stemming is useful in search engines for query expansion or indexing and other NLP
problems.
- The most common algorithm for stemming English is Porter’s Algorithm.
- Other existing stemmers include Lovins stemmer and a more recent one called the
Paice/Husk stemmer.
1. Stemmers for European Languages
- There are many stemmers available for English and other languages.
- Snowball presents stemmers for English, Russian, and a number of other European
languages, including French, Spanish, Portuguese, Hungarian, Italian, German, Dutch,
Swedish, Norwegian, Danish and Finnish.
2. Stemmers for Indian Languages
- Standard stemmers are not yet available for Hindi and other Indian languages.
- Majumder et al. used a cluster-based approach to find classes of root words and their
morphological variants.
- They used a task-based evaluation of their approach and reported that stemming
improves recall for Indian languages.
3. Stemming Applications
- Stemmers are common elements in search and retrieval systems such as Web search
engines.
- Stemming reduces the variants of a word to same stem.
- This reduces the size of the index and also helps retrieve documents that contain
variants of a query terms.
lOMoARcPSD|26931221

- Example, a user issuing a query for documents on ‘astronauts’ would like documents
on ‘astronaut’ as well.
- Stemming permits this by reducing both versions of the word to the same stem.
- However, the effectiveness of stemming for English query systems is not too great, and
in some cases may even reduce precision.
- Text Summarization and text categorization also involve term frequency analysis to find
features.
- In this analysis, stemming is used to transform various morphological forms of words
into their stems.

4. POS Tagger (PART-OF-SPEECH Tagger)


- Part-of-speech tagging is used at an early stage of text processing in many NLP
applications such as speech synthesis, machine translation, IR, and information
extraction.
- In IR, part-of-speech tagging can be used in indexing, extracting phrases and
disambiguating word senses.
1. Stanford Log-linear Part-of-Speech (POS) Tagger
- This POS Tagger is based on maximum entropy Markov models. The key features of
the Tagger are as follows:
i) It makes explicit use of both the preceding and following tag contexts via a
dependency network representation.
ii) It uses a broad range of lexical features.
iii) It utilizes priors in conditional log-linear models.
2. A Part-of-Speech Tagger for English
- This Tagger uses a bi-directional inference algorithm for part-of-speech tagging.
- It is based on Maximum entropy Markov models (MEMM).
- The algorithm can enumerate all possible decomposition structures and find the
highest probability sequence together with the corresponding decomposition structure
in polynomial time.
- Experimental results of this part-of-speech tagger show that the proposed bi-
directional inference methods consistently outperform unidirectional inference methods
and bi-directional MEMMs give comparable performance to that achieved by state-of-
the-art learning algorithms, including kernel support vector machines.
3. TnT tagger
- Trigrams’n Tags or TnT is an efficient statistical part-of-speech tagger.
lOMoARcPSD|26931221

- This tagger is based on hidden Markov models (HMM) and uses some optimization
techniques for smoothing and handling unknown words.
4. Brill Tagger
- Brill described a trainable rule-based tagger that obtained performance comparable to
that of stochastic taggers.
- It uses transformation-based learning to automatically induce rules.
5. CLAWS Part-of-Speech Tagger for English
- Constituent likelihood automatic word-tagging system (CLAWS) is one of the earliest
probabilistic taggers for English.
- The latest version of the tagger, CLAWS4, can be considered a hybrid tagger as it
involves both probabilistic and rule-based elements.
6. Tree-Tagger
- Tree-Tagger is a probabilistic tagging method. It avoids problems faced by the Markov
model methods when estimating transition probabilities from sparse data, by using a
decision tree to estimate transition probabilities.
7. ACOPOST: A collection of POS Taggers
- ACOPOST is a set of freely available POS Taggers.
- ACOPOST consists of four taggers
a) Maximum Entropy Tagger(MET)
b) Trigram Tagger(T3)
c) Error-driven Tranformation-based Tagger(TBT)
d) Example-based Tagger (ET)
8. POS Tagger for Indian Languages
- The automatic text processing of Hindi and other Indian languages is constrained
heavily due to lack of basic tools and large annotated corpuses.

5. Research Corpora
1. Research corpora have been developed for a number of NLP-related tasks.

2. Available Standard document collections for a variety of NLP-related tasks, along with
their Internet links are
a. IR Test Collection
- LETOR (learning to rank) is a package of benchmark data sets released by Microsoft
Research Asia.
lOMoARcPSD|26931221

- It consists of two datasets OHSUMED and TREC (TD2003 and TD2004).


- LETOR is packaged with extracted features for each query-document pair in the collection,
baseline results of several state-of-the-art learning-to-rank algorithms on the data and
evaluation tools.
- The data set is aimed at supporting future research in the area of learning ranking function
for Information retrieval.
b. Summerization Data
- Evaluating a text summarizing system requires existence of ‘gold summaries’.
- DUC provides document collections with known extracts and abstracts, which are used for
evaluating performance of summarization systems submitted at TREC conferences.
c. Word Sense Disambiguation
- SEMCOR is a sense-tagged corpus used in disambiguation.
- It is a subset of the Brown corpus, sense-tagged with WordNet synsets.
- Open Mind Word Expert attempts to create a very large sense-tagged corpus.
- It collects word sense tagging from the general public over the Web.
d. Asian Language Corpora
- The multilingual EMILLE corpus is the result of the enabling minority language engineering
(EMILLE) project at Lancaster University, UK. The project focuses on generation of data,
software resources and basic language engineering tools for the NLP of south Asian
Languages.
- Central Institute for Indian Languages (CIIL), provides a wider range of data in Indian
languages from a wide range of genres.

You might also like