Introduction to
Information Storage and Retrieval
 Chapter Two: Text Operations
    Chapter Objectives
    At the end of this chapter students will be able to
       Identify major steps in text operation
       Understand challenges in text operation
       Differentiate content bearing and function terms
       Tokenize text into index terms
       Eliminate stopwords from text
       Understand stemming and stemming algorithms
       Understand the need for thesaurus construction
       Compute distribution of terms in text
       Relate term distribution and term significance
2   02: Text Operation
             Text Operations
     Five text operations
    1.       Lexical analysis
              Objective is to handle digits, hyphens, punctuation marks, and case of
               letters
    2.       Stop word elimination
          Objective is filtering out words with very low discriminating power for
             retrieval purpose
    3.       Stemming
          Removing affixes to allow retrieval of documents with syntactical variation
             of query terms
    4.       Term selection
          Select index terms that carry more semantic
    5.       Thesaurus construction
          Term categorization to allow expansion of query with related terms
3            02: Text Operation
     Text Operations …
 Not all words in a document are equally significant to represent the
     contents/meanings of a document
      Some word carry more meaning than others
      Noun words are the most representative of a document content
 Therefore, need to preprocess the text of a document in a collection
  to be used as index terms
 Using the set of all words in a collection to index documents creates
  too much noise for the retrieval task
      Reduce noise means reduce words which can be used to refer to the document
 Preprocessing is the process of controlling the size of the
     vocabulary or the number of distinct words used as index terms
      Preprocessing will lead to an improvement in the information retrieval
       performance
 However, some search engines on the Web omit preprocessing
  Every word in the document is an index term
 4       02: Text Operation
Generating Document Representatives
Text Processing System
    Input text – full text, abstract or title
    Output – a document representative adequate for use in an automatic
      retrieval system
 The document representative consists of a list of class names, each
    name representing a class of words occurring in the total input text.
    A document will be indexed by a name if one of its significant words
    occurs as a member of that class.
    Documents       Tokenization   Stop word       Stemming         Thesaurus
                                                                     Index
                                                                     Terms
5      02: Text Operation
    Tokenization/Lexical Analysis
Change text of the documents into words to be adopted as
 index terms
Objective - identify words in the text
     Digits, hyphens, punctuation marks, case of letters
     Numbers are not good index terms (like 1910, 1999); but 510
     B.C. – unique
     Hyphen – break up the words (e.g. state-of-the-art = state of the
     art)- but some words, e.g. gilt-edged, B-49 - unique words which
     require hyphens
       gilt-edged and gilt edged do not mean the same; similarly, B-49 refers to
        submarine in Soviet while B49 refers to city bus in the US
     Punctuation marks – remove them totally unless significant, e.g.
     program code: x.exe and xexe
     Case of letters – not important and can convert all to upper or
     lower
      02: Text Operation
6
 Tokenization …
Analyze text into a sequence of discrete tokens (words).
Input: “Friends, Romans and Countrymen”
Output: Tokens (an instance of a sequence of characters that are
 grouped together as a useful semantic unit for processing)
 Friends
 Romans
 and
 Countrymen
Each such token is now a candidate for an index entry,
 after further processing
But what are valid tokens to produce?
 7    02: Text Operation
    Issues in Tokenization
One word or multiple: How do you decide it is one token or
 two or more?
splitting on white space can also split what should be regarded
 as a single token. This occurs most commonly with names (San
 Francisco, Los Angeles
    Hewlett-Packard  Hewlett and Packard as two tokens?
        state-of-the-art: break up hyphenated sequence.
        San Francisco, Los Angeles
    lowercase, lower-case, lower case ?
        data base, database, data-base
• cases with internal spaces that we might wish to regard as a single
  token include phone numbers ((800) 234-2333) and dates
  (Mar 11, 1983)
• Splitting tokens on spaces can cause bad retrieval results, for
  example, if a search for York University mainly returns documents
  containing New York University
8      02: Text Operation
Issues in Tokenization …
• The problems of hyphens and non-separating whitespace can even
  interact; Advertisements for air fares frequently contain items like San
  Francisco-Los Angeles, where simply doing whitespace splitting would
  give unfortunate results
• Elimination of period
      IP addresses (100.2.86.144)
How to handle special cases involving apostrophes, hyphens etc? C++,
    C#, URLs, emails, …
     Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs.
       republican) can be a meaningful part of a token, they are not usually
       however
 Simplest approach is to ignore all numbers and punctuation and use only
    case-insensitive unbroken strings of alphabetic characters as tokens
    Generally, don’t index numbers as text,
      Will often index “meta-data” , including creation date, format, etc. separately
Issues of tokenization are language specific
    Requires the language to be known
9       02: Text Operation
Elimination of STOPWORD
Stopwords are extremely common words across document
     collections that have no discriminatory power
They may occur in 80% of the documents in a collection.
They would appear to be of little value in helping select documents
     matching a user need and needs to be filtered out from index list
Examples of stopword:
     articles (a, an, the);
     pronouns: (I, he, she, it, their, his)
      prepositions (on, of, in, about, besides, against),
      conjunctions/ connectors (and, but, for, nor, or, so, yet),
      verbs (is, are, was, were),
      adverbs (here, there, out, because, soon, after) and
      adjectives (all, any, each, every, few, many, some)
Stopwords are language dependent
10       02: Text Operation
     Elimination of STOPWORD …
Stop word elimination used to be standard in older IR
 systems
But the trend is away from doing this. Most web search
 engines index stop words:
  Good query optimization techniques mean you pay little
   at query time for including stop words.
  You need stopwords for:
       Phrase queries: “King of Denmark”
       Various song titles, etc.: “Let it be”, “To be or not to be”
       “Relational” queries: “flights to London”
     Elimination of stopwords might reduce recall (e.g. “To be
      or not to be” – all eliminated except “be” – no or irrelevant
      retrieval)
11      02: Text Operation
How to determine a list of stopword?
 Intuition:
      Stopword have little semantic content; It is typical to remove such
       high-frequency words
      Stopwords take up 50% of the text. Hence, document size reduces by
       30-50% when stopword are eliminated
 One method: Sort terms (in decreasing order) by collection
     frequency and take the most frequent ones
     In a collection about insurance practices, “insurance” would be a stop
      word
 Another method: Build a stop word list that contains a set of
     articles, pronouns, etc.
       Why do we need stop lists: With
                                      a stop list, we can compare and
        exclude from index terms entirely the most common words
  With the removal of stopwords, we can measure better
12
     approximation
       02: Text Operation
                          of importance for classification, summarization, etc.
     Normalization
• It is standardizing tokens so that matches occur
     despite superficial differences in the character
     sequences of the tokens
     Need to “normalize” terms in indexed text as well as query terms
      into the same form
     Example: We want to match U.S.A. and USA, by deleting periods
      in a term
Case Folding: Often best to lower case everything, since users
     will use lowercase regardless of ‘correct’ capitalization…
     Republican vs. republican
     Fasil vs. fasil vs. FASIL
     Anti-discriminatory vs. antidiscriminatory
13      02: Text Operation
     Normalization issues
Good for
 Allow instances of Automobile at the beginning of a sentence to
  match with a query of automobile
 Helps a search engine when most users type ferrari when they
  are interested in a Ferrari car
Bad for
 Proper names vs. common nouns
        E.g. General Motors, Associated Press, …
 Solution:
 lowercase only words at the beginning of the sentence
In IR, lowercasing is most practical because of the way
     users issue their queries
14      02: Text Operation
 Stemming
  Stemming is the conflation of variant forms of a word into a single
     representation, the stem, semantically related to the variants
     The words connect, connects, connected, connecting, connectivity,
       connection can bed stemmed into connect
  The stem does not need to be a valid word, it need to capture the
   meaning of the word though
  Stemming aims to increase effectiveness of an information retrieval,
   recall where by more relevant out of the entire collection are retrieved
  Its also used to reduce the size of index files, since a single stem
   typically corresponds to several full terms, by storing stems instead of
   terms, compression factor of 50 percent can be achieve
  Conflation of words or so called stemming can either be done manually
   by using some kind of regular expressions or automatically using
   stemmers
15      02: Text Operation
     Term conflation
 One of the problems involved in the use of free text for
  indexing and retrieval is the variation in word forms that is
  likely to be encountered
 The most common types of variations are
      spelling errors (father, fathor)
      alternative spellings i.e. locality or national usage (color vs
       colour, labor vs labour)
      multi-word concepts (database, data base)
      affixes (dependent, independent, dependently)
       abbreviations (i.e., that is).
16      02: Text Operation
     Conflation or stemming methods
                                   Automatic
                                  approaches
      Affix                               Successor
                          Table lookup                n-gram
     Removal                               Variety
                            method                    Method
     Method                                Method
     longest match
          simple
         removal
17   02: Text Operation
     Affix removal
     Affix removal method removes suffix or prefix from the
      words so as to convert them into a common stem form
     Most of the stemmers that are currently used use this type of
      approach for conflation
     Affix removal method is based on two principles one is
      iterations and the other is longest match
     An iterative stemming algorithm is simply a recursive
      procedure, as its name implies, which removes strings in
      each order-class one at a time, starting at the end of a word
      and working toward its beginning
     No more than one match is allowed within a single order-
      class, by definition
18       02: Text Operation
     Affix removal
 Iteration is usually based on the fact that suffixes are attached
  to stems in a certain order, that is, there exist order-classes of
  suffixes
 The longest-match principle states that within any given class
  of endings, if more than one ending provides a match, the one
  which is longest should be removed
 The first stemmer based on this approach is the one developed
  by Lovins (1968); MF Porter (1980) also used this method
 However, Porter’s stemmer is more compact and easy to use
  then Lovins
 YASS is another stemmer based on the same approach; it is
  however language independent
19     02: Text Operation
     Table lookup approach
Store terms and their corresponding stems in a table
Stemming is then done via lookups in the table
One way to do stemming is to store a table of all index
 terms and their stems
Terms from queries and indexes could then be stemmed
 via table lookup
Problems with this approach
     making these lookup tables we need to extensively work on a
      language
     There will be some probability that these tables may miss out
      some exceptional cases
     storage overhead for such a table
20     02: Text Operation
      Successor Variety approach
 Determine word and morpheme boundaries based on the distribution of
  phonemes in a large body of utterances
 The successor variety of a string is the number of different characters that follow
  it in words in some body of text
 The successor variety of substrings of a term will decrease as more characters are
  added until a segment boundary is reached
     Test Word: READABLE
     Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE,
            READING, READS, RED, ROPE, RIPE
                 Prefix                       Successor Variety             Letters
      R                                              3            E,I,O
      RE                                             2            A,D
      REA                                            1            D
      READ                                           3            A,I,S
      READA                                          1            B
      READAB                                         1            L
      READABL                                        1            E
      READABLE                                       1            (Blank)
21    02: Text Operation
        n-gram stemmers
  Another method of conflating terms called the shared digram method
  A digram is a pair of consecutive letters
  Besides digrams we can also use trigrams and hence it is called n-gram method
   in general
  Association measures are calculated between pairs of terms based on shared
   unique digrams
     statistics => st ta at ti is st ti ic cs
     unique digrams = at cs ic is st ta ti
     statistical => st ta at ti is st ti ic ca al
     unique digrams = al at ca ic is st ta ti
  Dice’s coefficient (similarity)
                                   2C     2*6
                                               .80S
                                  A B 78
  A and B are the numbers of unique digrams in the first and the second words. C
   is the number of unique digrams shared by A and B
22        02: Text Operation
     n-gram stemmers …
Similarity measures are determined for all pairs of terms in
 the database, forming a similarity matrix
Such similarity measures are determined for all pairs of
 terms in the database
Once such similarity is computed for all the word pairs they
 are clustered as groups
The value of Dice coefficient gives us the hint that the stem
 for these pair of words lies in the first unique 8 digrams out
 of 10.
23     02: Text Operation
Criteria for judging stemmers
Correctness
 Overstemming: too much of a term is removed.
       Can cause unrelated terms to be conflated  retrieval of non-relevant
        documents
       Over-stemming is when two words with different stems are stemmed to
        the same root. This is also known as a false positive
     Understemming: too little of a term is removed.
       Prevent related terms from being conflated  relevant documents may
        not be retrieved
       Under-stemming is when two words that should be stemmed to the
        same root are not. This is also known as a false negative
         Term                       Stem
         GASES                      GAS (correct)
         GAS                        GA (over)
         GASEOUS                    GASE (under)
24     02: Text Operation
       Assumptions in Stemming
     Words with the same stem are semantically
      related and have the same meaning to the
      user of the text
     The chance of matching increase when the
      index term are reduced to their word stems
      because it is normal to search using “
      retrieve” than “retrieving”
25     02: Text Operation
Of the four types of stemming strategies (affix removal, table
lookup, successor variety, and n-grams) which is preferable?
 Table lookup consists simply of looking for the stem of a word
  in a table, a simple procedure but dependent on data on stems
  for the whole language. Since such data is not readily available
  and might require considerable storage space, this type of
  stemming algorithm might not be practical.
Successor variety stemming is based on the determination of
  morpheme boundaries, uses knowledge from structural
  linguistics, more complex than affix removal algorithms
 N-grams stemming is based on the identification of digrams
  and trigrams and is more a term clustering procedure than a
  stemming one
 Affix removal stemming is intuitive, simple, and can be
  implemented efficiently
26   02: Text Operation
   Porter Stemmer
 Porter stemmer is the most popular affix (suffix) removal algorithm, for its
 simplicity and elegance
 Despite being simpler, the Porter algorithm yields results comparable to those of
  the more sophisticated algorithms
 The Porter algorithm uses a suffix list for suffix stripping, the idea is to apply a
  series of rules to the suffixes of the words in the text. For instance, the suffix s is
  replace by nil as shown below converting plural into singular
             s
 The longest sequence of letters is searched left hand side in a set of
   rules
 Applied to the word stresses yields the stem stress instead of the stem stresse.
 A detailed description of the Porter algorithm can be found in the appendix of the
   text book and its implementation at
27        02: Text Operation
http://tartarus.org/~martin/PorterStemmer/index.html
     Porter stemmer
Most common algorithm for stemming English words to
 their common grammatical root
It is simple procedure for removing known affixes in
 English without using a dictionary. To gets rid of plurals
 the following rules are used:
 SSES  SS          caresses  caress
 IES  y                   ponies  pony
 SS  SS                   caress → caress
 S                       cats  cat
 ment (Delete final element if what remains is longer than 1
     character )
                                 replacement  replace
                                 cement  cement
28      02: Text Operation
  Thesauri
Mostly full-text searching cannot be accurate, since different
  authors may select different words to represent the same
  concept
  Problem: The same meaning can be expressed using different terms
   that are synonyms, homonyms, and related terms
  How can it be achieved such that for the same meaning the identical
   terms are used in the index and the query?
Thesaurus: The vocabulary of a controlled indexing
 language, formally organized so that a priori relationships
 between concepts (for example as "broader" and “related")
 are made explicit.
A thesaurus contains terms and relationships between terms
  IR thesauri rely typically upon the use of symbols such as USE/UF
    (UF=used for), BT, and RT to demonstrate inter-term relationships.
  e.g., car = automobile, truck, bus, taxi, motor vehicle
29
                -color = colour, paint
      02: Text Operation
     Aim of Thesaurus
Thesaurus tries to control the use of the vocabulary by
 showing a set of related words to handle synonyms and
 homonyms
The aim of thesaurus is therefore:
 to provide a standard vocabulary for indexing and searching
      Thesaurus rewrite to form equivalence classes, and we index such
       equivalences
      When the document contains automobile, index it under car as well
       (usually, also vice-versa)
 to assist users with locating terms for proper query
  formulation: When the query contains automobile, look
  under car as well for expanding query
 to provide classified hierarchies that allow the broadening
  and narrowing of the current request according to user needs
30     02: Text Operation
  Thesaurus Construction
Example: thesaurus built to assist IR for searching cars and
  vehicles :
Term: Motor vehicles
  UF :      Automobiles, Cars, Trucks
  BT:       Vehicles
  RT:       Road Engineering, Road Transport
Example: thesaurus built to assist IR in the fields of
 computer science:
TERM: natural languages
   UF natural language processing (UF=used for NLP)
   BT languages (BT=broader term is languages)
   TT languages (TT = top term is languages)
   RT artificial intelligence (RT=related term/s)
     computational linguistic, formal languages, query languages, speech
31   recognition
     02: Text Operation
       Language-specificity
     Many of the above features embody transformations
      that are
       Language-specific and
       Often, application-specific
     These are “plug-in” addenda to the indexing process
     Both open source and commercial plug-ins are
      available for handling these
32     02: Text Operation
 Statistical Properties of Text
How is the frequency of different words distributed?
 Refer to Zipf’s law
How fast does vocabulary size grow with the size of a
 corpus? Refer to Heap’s law
  Such factors affect the performance of IR system & can be used
    to select suitable term weights & other aspects of the system.
A few words are very common.
 2 most frequent words (e.g. “the”, “of”) can account for about
  10% of word occurrences in a document.
Most words are very rare.
 Half the words in a corpus appear only once, called “read only
  once”
Called a “heavy tailed” distribution, since most of the
33
  probability
      02: Text
                   mass
               Operation
                         is in the “tail”
 Sample Word Frequency Data
34   02: Text Operation
      Zipf’s distributions
For all the words in a documents collection, for each word w
•    f : is the frequency that w appears
•    r : is rank of w in order of frequency.
        (The most commonly occurring word has rank 1, etc.)
                                          Distribution of sorted word
          f
                                            frequencies, according to
                                                    Zipf’s law
                           w has rank r and
                           frequency f
35    02: Text Operation                                         r
     Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistics professor
     George Kingsley Zipf (1902-1950),
     attempts to capture the distribution of the frequencies (i.e. ,
      number of occurances ) of the words within a text
Zipf's Law states that when the distinct words in a text
 are arranged in decreasing order of their frequency of
 occuerence (most frequent words first), the occurence
 characterstics of the vocabulary can be characterized by the
 constant rank-frequency law of Zipf:
               Frequency * Rank = constant
If the words, w, in a collection are ranked, r, by their
 frequency, f, they roughly fit the relation:    r*f=c
 Different collections have different constants c
36      02: Text Operation
       Example: Zipf's Law
 The table shows the most frequently occurring words from
     336,310 document collection containing 125, 720, 891 total
     words; out of which there are 508,209 unique words
37     02: Text Operation
Zipf’s law: modeling word distribution
The collection frequency of the ith most common term is
     proportional to 1/i
                                      1
                                 fi 
                                      i
     If the most frequent term occurs f1 then the second most
       frequent term has half as many occurrences, the third most
       frequent term has a third as many, etc
Zipf's Law states that the frequency of the i-th most
     frequent word is 1/iӨ times that of the most frequent word
     occurrence of some event ( P ), as a function of the rank (i)
       when the rank is determined by the frequency of occurrence, is a
       power-law function Pi ~ 1/i Ө with the exponent Ө close to unity.
38      02: Text Operation
     More Example: Zipf’s Law
 Illustration of Rank-Frequency Law. Let the total number of word
  occurrences in the sample N = 1,000,000
           Rank (R)         Term   Frequency (F)   R.(F/N)
                1           the       69 971        0.070
                2            of       36 411        0.073
                3           and       28 852        0.086
                4            to       26 149        0.104
                5            a        23237         0.116
                6            in       21341         0.128
                7           that      10595         0.074
                8            is       10099         0.081
                9           was        9816         0.088
               10            he        9543         0.095
39     02: Text Operation
     Methods that Build on Zipf's Law
 • Stop lists: Ignore the most frequent words (upper cut-
   off)
     • Used by almost all systems
 • Significant words: Take words in between the most
   frequent (upper cut-off) and least frequent words (lower
   cut-off)
     • Used in IR systems
 • Term weighting: Give differing weights to terms based
   on their frequency, with most frequent words weighted
   less
     • Used by almost all ranking methods
40    02: Text Operation
 Explanations for Zipf’s Law
The law has been explained by “principle of least effort”
     which makes it easier for a speaker or writer of a language
     to repeat certain words instead of coining new words
     Zipf’s explanation was his “principle of least effort” which balance
  between speaker’s desire for a small vocabulary and hearer’s desire
  for a large one
Zipf’s Law Impact on IR
 Good News: Stopwords will account for a large fraction
      of text so eliminating them greatly reduces inverted-index
      storage costs
     Bad News: For most words, gathering sufficient data for
      meaningful statistical analysis (e.g. for correlation
      analysis for query expansion) is difficult since they are
      extremely rare
41       02: Text Operation
     Word significance: Luhn’s Ideas
     Luhn Idea (1958): the frequency of word occurrence in a
      text furnishes a useful measurement of word significance
     Luhn suggested that both extremely common and
      extremely uncommon words were not very useful for
      indexing
42      02: Text Operation
     Word significance: Luhn’s Ideas
For this, Luhn specifies two cutoff points: an upper and a
 lower cutoffs based on which non-significant words are
 excluded
     The words exceeding the upper cutoff were considered to be
      common
     The words below the lower cutoff were considered to be rare
     Hence they are not contributing significantly to the content of the
      text
     The ability of words to discriminate content, reached a peak at a
      rank order position half way between the two-cutoffs
Let f be the frequency of occurrence of words in a text, and    r their
 rank in decreasing order of word frequency, then a plot relating f & r
 yields the following curve
43       02: Text Operation
 Luhn’s Ideas
 Luhn (1958) suggested that both extremely common and extremely
 uncommon words were not very useful for document representation
  & indexing
44    02: Text Operation
 Vocabulary Growth: Heaps’ Law
How does the size of the overall vocabulary (number of
 unique words) grow with the size of the corpus?
 This determines how the size of the inverted index will scale with
     the size of the corpus.
Heaps’ law: estimates the number of vocabularies in a
 given corpus
 For a text of n words, the vocabulary size grows by O(nβ), where β
  is a constant, 0 < β < 1
 If V is the size of the vocabulary and n is the length of the corpus
  in words, Heaps provides the following equation:
Where constants:
                                                            
 K  10100                                 V  Kn
   0.40.6 (approx. square-root)
45      02: Text Operation
 Heaps’ distributions
• Distribution of size of the vocabulary: there is a linear
  relationship between vocabulary size and number of
  tokens
     • Example: from 1,000,000,000 words, there may be
       1,000,000 distinct words. Do you agree?
46      02: Text Operation
 Example
We want to estimate the size of the vocabulary for a corpus of
     1,000,000 words. However, we only know the statistics
     computed on smaller corpora sizes:
     For 100,000 words, there are 50,000 unique words
     For 500,000 words, there are 150,000 unique words
     Estimate the vocabulary size for the 1,000,000 words corpus?
     How about for a corpus of 1,000,000,000 words?
47     02: Text Operation
     Relevance Measure
                                      retrieved & not retrieved
                          irrelevan
                                      irrelevant  & irrelevant
                          t
                                      retrieved & not retrieved
                          releva
                                      relevant    but relevant
                          nt
                                      retrieved   not retrieved
48   02: Text Operation
       Issues: recall and precision
     breaking up hyphenated terms increase recall
      but decrease precision
     preserving case distinctions enhance precision
      but decrease recall
     commercial information systems usually take
      recall enhancing approach (numbers and words
      containing digits are index terms, and all are
      case insensitive)
49     02: Text Operation
        Exercise: Tokenization
     The cat slept peacefully in the living room. It’s a very old
       cat.
     Mr. O’Neill thinks that the boys’ stories about Chile’s
       capital aren’t amusing.
50      02: Text Operation
     Index term selection
 Index language is the language used to describe
  documents and requests
 Elements of the index language are index terms which
  may be derived from the text of the document to be
  described, or may be arrived at independently.
      If a full text representation of the text is adopted, then
       all words in the text are used as index terms = full text
       indexing
      Otherwise, need to select the words to be used as index
       terms for reducing the size of the index file which is
       basic to design an efficient searching IR system
51     02: Text Operation
          End of Chapter Two
52   02: Text Operation