Lecture 4-Dictionaries and Tolerant Retrieval
Lecture 4-Dictionaries and Tolerant Retrieval
Introduction to
Information Retrieval
Dictionaries and tolerant retrieval
Introduction to Information Retrieval
This lecture
▪ Dictionary data structures
▪ “Tolerant” retrieval
▪ Wild-card queries
▪ Spelling correction
▪ Soundex
3
Introduction to Information Retrieval
4
Introduction to Information Retrieval
A naive dictionary
▪ An array of struct:
6
Introduction to Information Retrieval
7
Introduction to Information Retrieval
Hashtables
▪ Each vocabulary term is hashed to an integer
▪ Pros:
▪ Lookup is faster than for a tree: O(1)
▪ Cons:
▪ No easy way to find minor variants:
▪ judgment/judgement
▪ No prefix search [tolerant retrieval]
▪ If vocabulary keeps growing, need to occasionally do the
expensive operation of rehashing everything
8
Introduction to Information Retrieval
ot
kle
s
ark
gen
zyg
sic
v
huy
ard
9
Introduction to Information Retrieval
10
Introduction to Information Retrieval
Tree: B-tree
a-h n-
u hy- z
m
Trees
▪ Simplest: binary Search tree
▪ More usual: B-trees
▪ Trees require a standard ordering of characters and hence
strings … but we typically have one
▪ Pros:
▪ Solves the prefix problem (terms starting with hyp)
▪ Cons:
▪ Slower: O(log M) [and this requires balanced tree]
▪ Rebalancing binary trees is expensive
▪ But B-trees mitigate the rebalancing problem
12
Introduction to Information Retrieval
WILD-CARD QUERIES
13
Introduction to Information Retrieval
Wild-card queries: *
▪ mon*: find all docs containing any word beginning
with “mon”.
▪ Easy with binary tree (or B-tree) lexicon: retrieve all
words in range: mon ≤ w < moo
▪ *mon: find words ending in “mon”: harder
▪ Maintain an additional B-tree for terms backwards.i.e
root –to-leaf path of the B-tree corresponds to a term in
the dictionary written backwards eg. Lemon would be
represented as root-n-o-m-e-l.
Can retrieve all words in range: nom ≤ w < oom.
Exercise: from this, how can we enumerate all terms
meeting the wild-card query pro*cent ? 14
Introduction to Information Retrieval
Query processing
▪ At this point, we have an enumeration of all terms in the
dictionary that match the wild-card query.
15
Introduction to Information Retrieval
16
Introduction to Information Retrieval
Permuterm index
▪ A special index for general wildcard queries is the permuterm index.
▪ First, we introduce a special symbol $ into our character set, to mark the
end of a term. E.g. Hello => Hello $
Permuterm
vocabulary
17
Introduction to Information Retrieval
Permuterm index
▪ Consider the wild card query m*n
▪ The key is to rotate such a wildcard query so that the * symbol appears at
the end of the string – thus the rotated wildcard query becomes n$m*.
▪ Next, we look up this string in the permuterm index, where seeking n$m*
(via a search tree) leads to rotations of (among others) the terms man and
moron.
18
.
Introduction to Information Retrieval
▪ In this case we first enumerate the terms in the dictionary that are in the
permuterm index of er$fi*.
▪ Not all such dictionary terms will have the string mo in the middle
▪ In this example, the term fishmonger would survive this filtering but filibuster
would not.
▪ Then run the surviving terms through the standard inverted index for
document retrieval. 19
Introduction to Information Retrieval
$ mac madde
m e n
m amon amortiz
o g e
o along among
n
21
Introduction to Information Retrieval
Processing wild-cards
▪ Query mon* can now be run as
▪ $m AND mo AND on
22
Introduction to Information Retrieval
SPELLING CORRECTION
23
Introduction to Information Retrieval
Spell correction
▪ There are two basic principles underlying most spelling
correction algorithms.
▪ When two correctly spelled queries are tied (or nearly tied), select the
one that is more common.
24
Introduction to Information Retrieval
Spell correction
▪ Two principal uses
▪ Correcting document(s) being indexed
▪ Correcting user queries to retrieve “right” answers
▪ Two main flavors:
▪ Isolated word
▪ Check each word on its own for misspelling
▪ Will not catch typos resulting in correctly spelled words
▪ e.g., from → form
▪ Context-sensitive
▪ Look at surrounding words,
▪ e.g., I flew form Heathrow to Delhi.
25
Introduction to Information Retrieval
Document correction
▪ Especially needed for OCR’ed documents
▪ Correction algorithms are tuned for this:
▪ Can use domain-specific knowledge
▪ E.g., OCR can confuse O and D more often than it would confuse O
and I (adjacent on the QWERTY keyboard, so more likely
interchanged in typing).
26
Introduction to Information Retrieval
27
Introduction to Information Retrieval
▪ What’s “closest”?
28
Introduction to Information Retrieval
▪ E.g., the edit distance from dof to dog is 1 (Just 1 with transpose.)
▪ From cat to act is 2
▪ from cat to dog is 3.
30
Introduction to Information Retrieval
31
Introduction to Information Retrieval
32
Introduction to Information Retrieval
33
Introduction to Information Retrieval
34
Introduction to Information Retrieval
36
Introduction to Information Retrieval
Matching bigrams
▪ Consider the query lord – we wish to identify words
matching 2 of its 3 bigrams (lo, or, rd)
38
Introduction to Information Retrieval
Context-sensitive correction
▪ Need surrounding context to catch this.
▪ Now try all possible resulting phrases with one word “fixed” at
a time
▪ flew from heathrow
▪ fled form heathrow
▪ flea form heathrow
Exercise
▪ Suppose that for “flew form Heathrow” we have 7
alternatives for flew, 19 for form and 3 for heathrow.
How many “corrected” phrases will we enumerate in
this scheme?
40
Introduction to Information Retrieval
Another approach
▪ Break phrase query into a conjunction of biwords.
41
Introduction to Information Retrieval
42
Introduction to Information Retrieval
SOUNDEX
43
Introduction to Information Retrieval
Soundex
▪ Our final technique for tolerant retrieval has to do with
phonetic correction: misspellings that arise because the user
types a query that sounds like the target term.
44
Introduction to Information Retrieval
45
Introduction to Information Retrieval
Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and
return the first four positions, which will be of the
form <uppercase letter> <digit> <digit> <digit>.
Questions
▪ Beijing
▪ Peking
48
Introduction to Information Retrieval
49
Introduction to Information Retrieval
Exercise
▪ Draw yourself a diagram showing the various indexes
in a search engine incorporating all the functionality
we have talked about
▪ Identify some of the key design choices in the index
pipeline:
▪ Does stemming happen before the Soundex index?
▪ What about n-grams?
▪ Given a query, how would you parse and dispatch
sub-queries to the various indexes?
50