Processing Text
● Corpus: collection of some types of documents
○ Corpus of news/articles
○ Usually share some characteristics
● We can rely on the textual characteristics to build text-based engines
● The laws apply to:
○ Different languages
○ Different text domains
1. Words to Terms
● Text transformation/processing: convert words into index terms
○ Index term: representation of the content of a document (used for searching)
● Tokenization: split words apart
● Stopping: ignore some words to make query processing more efficient and effective
● Stemming: allow similar words (like “run” and “running”) to match each other
⇒ Search engines is about deciphering the meaning of text by the counts of word occurrences
and co-occurrences (the number of times group of words occur together in documents)
2. Text Statistics
● Distribution of word frequencies is very skewed
○ Words like “the”, “of” account for 10% of all word occurrences
○ ½ of all unique words in a large sample of text occur only once
● Knowing (1) how fast the vocabulary size grows with the size of a corpus and (2) how
the words are distributed helps us determine:
○ Indexing technique
○ Storage: how to store data + storage availability
Probability
● Experiment: a specific set of actions the results of which can not be predicted with
certainty
● Sample space: possible set of recorded data
● Event space: subsets of a sample space defined by a specific event/outcome
○ Example: The event that the sum of 2 dice is 4
● Zipf’s law: The frequency of the rth most common word is inversely proportional to r
○ Graph frequency vs. rank → hard exponential decay (the more frequent the word,
the higher the rank)
○ The rank of word r * its frequency f is approximately a constant k
■ 𝑟 × 𝑓 = 𝑘
● If rank 1 term occurs cN times, ten rank 2 occurs cN/2, …
○ Most frequent word (1/1) will occur 2x as frequently as the
2nd most frequent (½).
○ Alternatively:
■ 𝑟 × 𝑃𝑟 = 𝑐 where
● 𝑃𝑟: probability of occurrence for the rth ranked word
● 𝑐 (≈ 0. 1): a constant
○ Might be inaccurate for low and high ranks
■ Not accurate at tails
○ There might be words with the same frequency n ⇒ Count: 𝑟𝑛 − 𝑟𝑛+1 (where 𝑟𝑛 is
the rank of the last word in the group, 𝑟𝑛+1 is the rank of the last word of the
previous group of words with higher frequency)
● Vocabulary Growth: as the size of the corpus grows, new words occur
○ Can be due to: typos, new words in the vocabulary, …
β
○ Heap’s law: 𝑣 = 𝑘 × 𝑛
■ 𝑣: vocab size for a corpus of size 𝑛 words
■ 𝑘, β: parameters (peculiar to each collection)
● 10 ≤ 𝑘 ≤ 100
● β ≈ 0. 5
3. Document Parsing
Process:
1. Taking a document
2. Turn it into a representation (like metadata)
a. Non-context: bag of words
b. word2vec: ML algorithm to see what’s the probability of seeing a word next to
another
On the other hand
1. Taking a query
2. Turn it into a representation (have to be of the same form as the document)
⇒ Compare them
⇒ Evaluate the comparisons (how related is this to the query)
⇒ Feedback (eg: based on how many times you clicked on something)
Basic inverted index construction:
● Token: a semantic unit
○ Instance of a word occurring in a doc. Example: Bush, bush, friends
○ Tokenizing: figuring out how to divide up the documents into fragments
● Text preprocessing:
○ Remove punctuations
○ Lowercase
● Inverted index: map a word to the documents it appears
○ Term: entry in dictionary (normalized). Example: bush, friend
○ Type: class of all tokens containing the same character sequence (Example: “to
be or not to be” has 6 tokens, 4 terms)
What’s in a document
● Multiple language
● Multiple character set/encoding: How do we go from the binary to the characters
● Different compressing algorithms: zipped/compressed
● Recognition of the content and structure of text documents
○ Tokenizing/lexical analysis: recognizing each word occurrence in the sequence
of characters in a document
○ Metadata: document attributes (date/author); tags to identify document
components
Big challenge in querying: Vocabulary mismatch
● Word boundary variation
○ eusa is different from e-usa = usa = u.s.a = us of a
● Morphological variation
○ Policeman = policemen = police vs. policy = politics
● Spelling variation:
○ Gaddafi = Gadhafi = …
● Synonymy and Polysemy
○ apple vs. Apple vs. Big Apple
a) Tokenizing
Challenges:
● Small words important in queries that are disregarded
● Hyphenated/non-hyphenated forms of many words are common
● Special characters
● Capitalized words (since tokenizing lowercases everything)
● Apostrophes (‘) that can be part of a word
● Periods
⇒ Can treat hyphens, apostrophes and periods as spaces
⇒ No single right way to do it
1. First pass: identifying markup or tags in the document (accommodate syntax errors in
the structure)
2. Second pass: identify structure in appropriate parts of the document
b) Stopping
● Function words: words that have little meaning apart from other words
● Stopwords: function words in the context of information retrievals
○ We will throw them out → decrease index size, increase retrieval efficiency
○ Can be customized for specific parts of the document structure/fields (like
“click”, “here”, “privacy” are stopwords when processing anchor text)
⇒ Downside of just removing the stopwords: Cannot search on them!
c) Stemming
Pros:
+ Smaller dictionary size
+ Increased recall (number of documents returned)
Cons:
- Exact quotes
- Decrease in specificity
- Decrease in precision (more *junk*)
● Component of text processing that captures the relationships between different
variations of a word
○ Reduce the different forms of a word to a common stem that occur because of
■ Inflection (eg, plurals, tenses) or
■ Derivations (making a verb in to a noun by adding the suffix -ation)
→ For example, “swimming” and “swam” can be reduced to “swim”
● 2 types of stemmers:
○ Algorithmic: decide whether 2 words are related
■ Simplest: “s” signifies plurality
● False negatives: cannot detect a relationship (example: country
and countries)
● False positives: detect a relationship when there is none (example:
I and is)
⇒ Solution: consider more kinds of suffixes → reduce false
negative, but false positives increase
■ Porter stemmer:
● The most popular algorithmic stemmer
● Used in many information retrieval experiments and systems
since the 70s
○ Dictionary-based: store term relationships based on pre-created dictionaries of
related terms
■ Instead of computing the relationship between words, these just store
lists of related words in a large dictionary
■ Drawback: tend to be static + less flexible (while language evolves
constantly)
○ Combine the 2:
■ Use algorithmic to deal with new/unrecognized words
■ Dictionary is used to detect relationships between common words
■ Krovetz stemmer: use dictionary to check the validity of a word + if not
found, then check for suffixes → feed to the dictionary again
● Lower false positive than Porter
● Higher false negative
d) Phrases and N-grams
● Phrase: a simple noun phrase (can be nouns, or adjectives followed by nouns)
○ POS tagging:
■ Too slow
■ Alternative: store word position information in the indexes → only identify
phrases when a query is processed
⇒ Testing word proximities at query time is likely to be too slow
⇒ Identify n words to be a unit during text processing
● n-gram: sequence of n words
○ Generated by choosing a particular value for n
○ Then move that window forward one unit (character/word) at a time
■ Example: “hello” contains the following bigrams: he, el, ll, lo
○ Indexes based on n-grams are larger than word indexes
4. Document Structure and Markup
● Document parsers use the structure of web pages (example, from HTML markup) for
ranking algorithm
○ Headings indicate that the phrase is particularly important
● XML: allows each application to define what the element types
○ Defined by a schema (like a database schema)
○ Therefore, its elements are more tied to the semantics of the data than HTML
elements
5. Link Analysis
● Links tell relationship between pages–help with ranking pages
● Anchor Text: 2-3 words describing the topic of the linked page
○ For a link www.ebay.com is highly linked to “eBay” in the anchor text
○ This motivates a simple ranking algorithm: search through all links in the
collection, looking for anchor text that matches user’s query
● PageRank:
○ Naive: Use page rank to measure popularity of pages
■ Like count the number of inlinks (links pointing to a page) for each page
■ Susceptible to spam!
○ PageRank: Google search engine
■ A value associated with a web page
■ Help search engines sift through all pages containing the word “eBay” to
find the most popular one
■ The PageRank for a page = Sum(Page that goes to it / count of pages
that it links to)
○ If we take into account “distraction” links on the site…
■ Adjust the PR to be the weighted average of clicking on that link (λ) vs.
not clicking on that link
○ Rank sinks: pages with no outbound links (accumulate PR but do not distribute it)
● Link quality:
○ Hard to assess: web page designers may try to create useless links to improve
the PR → link spam
■ Solution: detect link spam in comment sections and ignore them during
indexing
● Easier solution: put rel=nofollow to unimportant links
6. Internationalization
● Monolingual search engine: search engine designed for a particular language
● Challenge:
○ Word segmentation: identifying breaks within sequence of words/index terms
■ Techniques: Hidden Markov Model