0% found this document useful (0 votes)

102 views50 pages

Lecture 4-Dictionaries and Tolerant Retrieval

This document provides an introduction to dictionary data structures and tolerant retrieval in information retrieval systems. It discusses how hashtables and trees like B-trees can be used to efficiently store dictionaries. Trees allow for prefix searching needed for wildcard queries, while hashtables provide faster lookup. Permuterm indexes and k-gram indexes are also introduced to enable wildcard and proximity queries by indexing term rotations and subsequences. The document explains how these structures support querying and retrieving documents that match tolerant queries.

Uploaded by

Prateek Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views50 pages

Lecture 4-Dictionaries and Tolerant Retrieval

Uploaded by

Prateek Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Introduction to Information Retrieval

Introduction to
Information Retrieval
Dictionaries and tolerant retrieval
Introduction to Information Retrieval

Recap of the previous lecture

▪ The type/token distinction
▪ Terms are normalized types put in the dictionary
▪ Tokenization problems:
▪ Hyphens, apostrophes, compounds, CJK
▪ Term equivalence classing:
▪ Numbers, case folding, stemming, lemmatization
▪ Skip pointers
▪ Encoding a tree-like structure in a postings list
▪ Biword indexes for phrases
▪ Positional indexes for phrases/proximity queries
2
Introduction to Information Retrieval

This lecture
▪ Dictionary data structures
▪ “Tolerant” retrieval
▪ Wild-card queries
▪ Spelling correction
▪ Soundex

3
Introduction to Information Retrieval

Dictionary data structures for inverted indexes

▪ The dictionary data structure stores the term vocabulary,
document frequency, pointers to each postings list … in what
data structure?

4
Introduction to Information Retrieval

A naive dictionary
▪ An array of struct:

char[20] int Postings *

20 bytes 4/8 bytes 4/8 bytes
▪ How do we store a dictionary in memory efficiently?
▪ How do we quickly look up elements at query time?
5
Introduction to Information Retrieval

Dictionary data structures

▪ Two main choices:
▪ Hashtables
▪ Trees
▪ Some IR systems use hashtables, some trees

6
Introduction to Information Retrieval

Dictionary data structures

▪ The choice of solution (hashing, or search trees) is
governed by :
▪ How many keys are we likely to have?

▪ Is the number likely to remain static, or change a lot – and

in the case of changes, are we likely to only have new keys
inserted, or to also have some keys in the dictionary be
deleted?

▪ What are the relative frequencies with which various keys

will be accessed?

7
Introduction to Information Retrieval

Hashtables
▪ Each vocabulary term is hashed to an integer
▪ Pros:
▪ Lookup is faster than for a tree: O(1)
▪ Cons:
▪ No easy way to find minor variants:
▪ judgment/judgement
▪ No prefix search [tolerant retrieval]
▪ If vocabulary keeps growing, need to occasionally do the
expensive operation of rehashing everything

8
Introduction to Information Retrieval

Tree: binary Search tree R

o
a- o n-
m t z

a-h hv- n-s si-

u m h z

ot
kle
s
ark

gen

zyg
sic
v

huy
ard

9
Introduction to Information Retrieval

Issue with BST

▪ The principal issue here is that of rebalancing: as terms are
inserted into or deleted from the binary search tree, it needs
to be rebalanced so that the balance property is maintained.

10
Introduction to Information Retrieval

Tree: B-tree

a-h n-
u hy- z
m

▪ Definition: Every internal node has a number of children

in the interval [a,b] where a, b are appropriate natural
numbers, e.g., [2,4].
11
Introduction to Information Retrieval

Trees
▪ Simplest: binary Search tree
▪ More usual: B-trees
▪ Trees require a standard ordering of characters and hence
strings … but we typically have one
▪ Pros:
▪ Solves the prefix problem (terms starting with hyp)
▪ Cons:
▪ Slower: O(log M) [and this requires balanced tree]
▪ Rebalancing binary trees is expensive
▪ But B-trees mitigate the rebalancing problem

12
Introduction to Information Retrieval

WILD-CARD QUERIES

13
Introduction to Information Retrieval

Wild-card queries: *
▪ mon*: find all docs containing any word beginning
with “mon”.
▪ Easy with binary tree (or B-tree) lexicon: retrieve all
words in range: mon ≤ w < moo
▪ *mon: find words ending in “mon”: harder
▪ Maintain an additional B-tree for terms backwards.i.e
root –to-leaf path of the B-tree corresponds to a term in
the dictionary written backwards eg. Lemon would be
represented as root-n-o-m-e-l.
Can retrieve all words in range: nom ≤ w < oom.
Exercise: from this, how can we enumerate all terms
meeting the wild-card query pro*cent ? 14
Introduction to Information Retrieval

Query processing
▪ At this point, we have an enumeration of all terms in the
dictionary that match the wild-card query.

▪ We still have to look up the postings for each enumerated

term.

▪ E.g., consider the query:

se*ate AND fil*er
This may result in the execution of many Boolean AND
queries.

15
Introduction to Information Retrieval

B-trees handle *’s at the end of a query term

▪ How can we handle *’s in the middle of query term?
▪ co*tion
▪ We could look up co* AND *tion in a B-tree and
intersect the two term sets
▪ Expensive
▪ The solution: transform wild-card queries so that the
*’s occur at the end
▪ This gives rise to the Permuterm Index.

16
Introduction to Information Retrieval

Permuterm index
▪ A special index for general wildcard queries is the permuterm index.

▪ First, we introduce a special symbol $ into our character set, to mark the
end of a term. E.g. Hello => Hello $

▪ Next, we construct a permuterm index, in which the various rotations of

each term (augmented with $) are linked to the original vocabulary term.

Permuterm
vocabulary

17
Introduction to Information Retrieval

Permuterm index
▪ Consider the wild card query m*n

▪ The key is to rotate such a wildcard query so that the * symbol appears at
the end of the string – thus the rotated wildcard query becomes n$m*.

▪ Next, we look up this string in the permuterm index, where seeking n$m*
(via a search tree) leads to rotations of (among others) the terms man and
moron.

▪ Now that the permuterm index enables us to identify the original

vocabulary terms matching a wildcard query, we look up these terms in
the standard inverted index to retrieve matching documents.

18
.
Introduction to Information Retrieval

Permuterm query processing

▪ Rotate query wild-card to the right but what about a query such as fi*mo*er?

▪ In this case we first enumerate the terms in the dictionary that are in the
permuterm index of er$fi*.

▪ Not all such dictionary terms will have the string mo in the middle

▪ Filter these out by exhaustive enumeration, checking each candidate to see if it

contains mo.

▪ In this example, the term fishmonger would survive this filtering but filibuster
would not.

▪ Then run the surviving terms through the standard inverted index for
document retrieval. 19
Introduction to Information Retrieval

Bigram (k-gram) indexes

▪ Enumerate all k-grams (sequence of k chars) occurring in any
term
▪ e.g., from text “April is the cruelest month” we get the
2-grams (bigrams)

$a,ap,pr,ri,il,l$, $i,is,s$, $t,th,he,e$,

$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$
▪ $ is a special word boundary symbol

▪ Maintain a second inverted index from bigrams to dictionary

terms that match each bigram.
20
Introduction to Information Retrieval

Bigram index example

▪ The k-gram index finds terms based on a query consisting of
k-grams (here k=2).

$ mac madde
m e n
m amon amortiz
o g e
o along among
n

21
Introduction to Information Retrieval

Processing wild-cards
▪ Query mon* can now be run as

▪ $m AND mo AND on

▪ Gets terms that match AND version of our wildcard query.

▪ But we’d enumerate moon.
▪ Must post-filter these terms against query.
▪ Surviving enumerated terms are then looked up in the
term-document inverted index.
▪ Fast, space efficient (compared to permuterm).

22
Introduction to Information Retrieval

SPELLING CORRECTION

23
Introduction to Information Retrieval

Spell correction
▪ There are two basic principles underlying most spelling
correction algorithms.

▪ Of various alternative correct spellings for a miss-spelled query, choose

the “nearest” one.

▪ When two correctly spelled queries are tied (or nearly tied), select the
one that is more common.

24
Introduction to Information Retrieval

Spell correction
▪ Two principal uses
▪ Correcting document(s) being indexed
▪ Correcting user queries to retrieve “right” answers
▪ Two main flavors:
▪ Isolated word
▪ Check each word on its own for misspelling
▪ Will not catch typos resulting in correctly spelled words
▪ e.g., from → form
▪ Context-sensitive
▪ Look at surrounding words,
▪ e.g., I flew form Heathrow to Delhi.

25
Introduction to Information Retrieval

Document correction
▪ Especially needed for OCR’ed documents
▪ Correction algorithms are tuned for this:
▪ Can use domain-specific knowledge
▪ E.g., OCR can confuse O and D more often than it would confuse O
and I (adjacent on the QWERTY keyboard, so more likely
interchanged in typing).

26
Introduction to Information Retrieval

Isolated word correction

▪ Fundamental premise – there is a lexicon from which the
correct spellings come

▪ Two basic choices for this

▪ A standard lexicon such as
▪ Webster’s English Dictionary
▪ An “industry-specific” lexicon – hand-maintained

▪ The lexicon of the indexed corpus

▪ E.g., all words on the web
▪ All names, acronyms etc.

27
Introduction to Information Retrieval

Isolated word correction

▪ Given a lexicon and a character sequence Q, return the words
in the lexicon closest to Q

▪ What’s “closest”?

▪ We’ll study several alternatives

▪ Edit distance (Levenshtein distance)
▪ Weighted edit distance
▪ n-gram overlap

28
Introduction to Information Retrieval

Edit distance (Levenshtein distance)

▪ Given two strings S1 and S2, the minimum number of
operations to convert one to the other

▪ Operations are typically character-level

▪ Insert, Delete, Replace, (Transposition)

▪ E.g., the edit distance from dof to dog is 1 (Just 1 with transpose.)
▪ From cat to act is 2
▪ from cat to dog is 3.

▪ Generally found by dynamic programming.

29
Introduction to Information Retrieval

Weighted edit distance

▪ As above, but the weight of an operation depends on the
character(s) involved
▪ Meant to capture OCR or keyboard errors
Example: m more likely to be mis-typed as n than as q
▪ Therefore, replacing m by n is a smaller edit distance than
by q
▪ This may be formulated as a probability model
▪ Requires weight matrix as input.

30
Introduction to Information Retrieval

Using edit distances

▪ Given query, first enumerate all character sequences within a
preset (weighted) edit distance (e.g., 2)
▪ Intersect this set with list of “correct” words
▪ Show terms you found to user as suggestions
▪ Alternatively,
▪ We can look up all possible corrections in our inverted index and
return all docs … slow
▪ We can run with a single most likely correction

31
Introduction to Information Retrieval

Edit distance to all dictionary terms?

▪ Given a (mis-spelled) query – do we compute its edit distance
to every dictionary term?

▪ Expensive and slow

▪ Alternative?

▪ How do we cut the set of candidate dictionary terms?

▪ One possibility is to use n-gram overlap for this

▪ This can also be used by itself for spelling correction.

32
Introduction to Information Retrieval

K-gram indexes for spelling correction

▪ Enumerate all the n-grams in the query string as well as in the
lexicon

▪ Use the n-gram index (recall wild-card search) to retrieve all

lexicon terms matching any of the query n-grams

▪ Threshold by number of matching n-grams

▪ Variants – weight by keyboard layout, etc.

33
Introduction to Information Retrieval

Example with trigrams

▪ Suppose the text is november
▪ Trigrams are nov, ove, vem, emb, mbe, ber.
▪ The query is december
▪ Trigrams are dec, ece, cem, emb, mbe, ber.
▪ So 3 trigrams overlap (of 6 in each term)
▪ How can we turn this into a normalized measure of
overlap?

34
Introduction to Information Retrieval

One option – Jaccard coefficient

▪ A commonly-used measure of overlap
▪ Let X and Y be two sets; then the J.C. is

▪ Equals 1 when X and Y have the same elements and

zero when they are disjoint
▪ X and Y don’t have to be of the same size
▪ Always assigns a number between 0 and 1
▪ Now threshold to decide if you have a match
▪ E.g., if J.C. > 0.8, declare a match
35
Introduction to Information Retrieval

One option – Jaccard coefficient

▪ If the query is bord.
▪ Then Jaccard coefficient for the query word “bord”
and word “boardroom” is
JC = 2/(3+8-2) = .18
Whereas JC for query word “bord” and word “border”
is: JC = 3/(3+5-3) = .66

36
Introduction to Information Retrieval

Matching bigrams
▪ Consider the query lord – we wish to identify words
matching 2 of its 3 bigrams (lo, or, rd)

l alon lore slot

o e h
o borde lore morbi
r r d
r arden borde car
d t r d

Standard postings “merge” will enumerate …

Adapt this to using Jaccard (or another) measure.
37
Introduction to Information Retrieval

Context-sensitive spell correction

▪ Isolated-term correction would fail to correct
typographical errors.
E.g.
Text: I flew from Heathrow to Delhi.
Consider the phrase query “flew form Heathrow”
We’d like to respond
Did you mean “flew from Heathrow”?

because no docs matched the query phrase.

38
Introduction to Information Retrieval

Context-sensitive correction
▪ Need surrounding context to catch this.

▪ First idea: retrieve dictionary terms close (in weighted edit

distance) to each query term

▪ Now try all possible resulting phrases with one word “fixed” at
a time
▪ flew from heathrow
▪ fled form heathrow
▪ flea form heathrow

▪ Hit-based spelling correction: Suggest the alternative that has

lots of hits.
39
Introduction to Information Retrieval

Exercise
▪ Suppose that for “flew form Heathrow” we have 7
alternatives for flew, 19 for form and 3 for heathrow.
How many “corrected” phrases will we enumerate in
this scheme?

40
Introduction to Information Retrieval

Another approach
▪ Break phrase query into a conjunction of biwords.

▪ Look for biwords that need only one term corrected.

▪ Enumerate only phrases containing “common”

(popular) biwords.

41
Introduction to Information Retrieval

General issues in spell correction

▪ We enumerate multiple alternatives for “Did you
mean?”

▪ Need to figure out which to present to the user

▪ The alternative hitting most docs
▪ Query log analysis

42
Introduction to Information Retrieval

SOUNDEX

43
Introduction to Information Retrieval

Soundex
▪ Our final technique for tolerant retrieval has to do with
phonetic correction: misspellings that arise because the user
types a query that sounds like the target term.

▪ Such algorithms are especially applicable to searches on the

names of people.

▪ The main idea here is to generate, for each term, a “phonetic

hash” so that similar-sounding terms hash to the same value.

44
Introduction to Information Retrieval

Soundex algorithm scheme

▪ Turn every token to be indexed into a 4-character reduced
form.

▪ Build an inverted index from these reduced forms to the

original terms; call this the soundex index.

▪ Do the same with query terms

▪ When the query calls for a soundex match, search this

soundex index.

45
Introduction to Information Retrieval

Soundex – typical algorithm

1. Retain the first letter of the word.
2. Change all occurrences of the following letters to '0'
(zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
3. Change letters to digits as follows:
▪ B, F, P, V → 1
▪ C, G, J, K, Q, S, X, Z → 2
▪ D,T → 3
▪ L→4
▪ M, N → 5
▪ R→6
46
Introduction to Information Retrieval

Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and
return the first four positions, which will be of the
form <uppercase letter> <digit> <digit> <digit>.

E.g., Herman becomes H655.

Will hermann generate the same

code?
47
Introduction to Information Retrieval

Questions
▪ Beijing
▪ Peking

48
Introduction to Information Retrieval

What queries can we process?

▪ We have
▪ Positional inverted index with skip pointers
▪ Wild-card index
▪ Spell-correction
▪ Soundex
▪ Queries such as
(SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski)

49
Introduction to Information Retrieval

Exercise
▪ Draw yourself a diagram showing the various indexes
in a search engine incorporating all the functionality
we have talked about
▪ Identify some of the key design choices in the index
pipeline:
▪ Does stemming happen before the Soundex index?
▪ What about n-grams?
▪ Given a query, how would you parse and dispatch
sub-queries to the various indexes?

C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
Document Indexing in Information Retrieval
No ratings yet
Document Indexing in Information Retrieval
19 pages
Week 6
No ratings yet
Week 6
98 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
Lecture5 Spell Correction 1per
No ratings yet
Lecture5 Spell Correction 1per
61 pages
3.tolerant Retrieval
No ratings yet
3.tolerant Retrieval
46 pages
Inexact Matching, Sequence Alignment, and Dynamic Programming
No ratings yet
Inexact Matching, Sequence Alignment, and Dynamic Programming
57 pages
Lec4 IR
No ratings yet
Lec4 IR
53 pages
NLP Digital Notes
No ratings yet
NLP Digital Notes
128 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
No ratings yet
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
124 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
37 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
6-Spelling Correction Soundex
No ratings yet
6-Spelling Correction Soundex
52 pages
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
GujaratiWordCorrection Jan2025
No ratings yet
GujaratiWordCorrection Jan2025
15 pages
6 DP
No ratings yet
6 DP
119 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
Lecture3 Tolerant Retrieval
No ratings yet
Lecture3 Tolerant Retrieval
48 pages
Advanced Topics in Information Systems
No ratings yet
Advanced Topics in Information Systems
175 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
Unit I
No ratings yet
Unit I
83 pages
Module2 NLP BAD613B Notes
100% (1)
Module2 NLP BAD613B Notes
16 pages
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
49 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Imformation Retrieval
No ratings yet
Imformation Retrieval
48 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Ir 1
No ratings yet
Ir 1
59 pages
AI LabReport
No ratings yet
AI LabReport
12 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Lecture 4 - Tolerant-Retrieval Chapter 3
No ratings yet
Lecture 4 - Tolerant-Retrieval Chapter 3
20 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
Aca 3
No ratings yet
Aca 3
113 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Module 1 Jacard Distance and Editdistance
No ratings yet
Module 1 Jacard Distance and Editdistance
16 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
34 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
IR Lec04 Skip Ptrs Phrase Queries Indexing
No ratings yet
IR Lec04 Skip Ptrs Phrase Queries Indexing
18 pages
Nougat: Neural Optical Understanding For Academic Documents
No ratings yet
Nougat: Neural Optical Understanding For Academic Documents
17 pages
Visible Surface Detection Methods
No ratings yet
Visible Surface Detection Methods
54 pages
Relevance Feedback
No ratings yet
Relevance Feedback
47 pages
Compression
No ratings yet
Compression
46 pages
Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models
No ratings yet
Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models
16 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
Solving Safety Implications in A Case Based Decision-Support System in Medicine
No ratings yet
Solving Safety Implications in A Case Based Decision-Support System in Medicine
81 pages
Quantum Operating Systems
No ratings yet
Quantum Operating Systems
6 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
DIP Complete
No ratings yet
DIP Complete
71 pages
Module 2
No ratings yet
Module 2
78 pages
Lecture3 Tolerant Retrieval Handout 6 Per
No ratings yet
Lecture3 Tolerant Retrieval Handout 6 Per
8 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Index Construction
No ratings yet
Index Construction
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Lecture 14 XTS-AES & RC4
No ratings yet
Lecture 14 XTS-AES & RC4
24 pages
IR Cs Sem 6
No ratings yet
IR Cs Sem 6
16 pages
Chapter 1a
No ratings yet
Chapter 1a
63 pages
Chapter #4: Query Languages
No ratings yet
Chapter #4: Query Languages
16 pages
Spell Checker For Gurmukhi Script: Thesis Submitted in Partial Fulfillment of The Requirements For The Award of Degree of
No ratings yet
Spell Checker For Gurmukhi Script: Thesis Submitted in Partial Fulfillment of The Requirements For The Award of Degree of
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
Cryptographic Hash Functions
No ratings yet
Cryptographic Hash Functions
63 pages
DS Notes Shashank PDF
No ratings yet
DS Notes Shashank PDF
29 pages
20 Patterns To Master Dynamic Programming
No ratings yet
20 Patterns To Master Dynamic Programming
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Finite-State Spell-Checking With Weighted Language
No ratings yet
Finite-State Spell-Checking With Weighted Language
7 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Lecture 3 (Verman Cipher)
No ratings yet
Lecture 3 (Verman Cipher)
14 pages
CH 04
No ratings yet
CH 04
5 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
MT1818
No ratings yet
MT1818
2 pages
New Doc 2018-02-15
No ratings yet
New Doc 2018-02-15
23 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
Assembler: Jian-hua Yeh (葉建華) 真理大學資訊科學系助理教授
No ratings yet
Assembler: Jian-hua Yeh (葉建華) 真理大學資訊科學系助理教授
69 pages
NLP Sample Questions-Stu
No ratings yet
NLP Sample Questions-Stu
4 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Lecture 4 Abstract Algebra
No ratings yet
Lecture 4 Abstract Algebra
19 pages
Sic, Sic/Xe: Jian-hua Yeh (葉建華) 真理大學資訊科學系助理教授
No ratings yet
Sic, Sic/Xe: Jian-hua Yeh (葉建華) 真理大學資訊科學系助理教授
14 pages
Lecture 1 (Vigenere Cipher)
No ratings yet
Lecture 1 (Vigenere Cipher)
13 pages
Ucs672 MST 23
No ratings yet
Ucs672 MST 23
3 pages
A Graph Distance Metric Based On The Maximal Common Subgraph
No ratings yet
A Graph Distance Metric Based On The Maximal Common Subgraph
5 pages
Algorithms Lecture 4: Dynamic Programming
No ratings yet
Algorithms Lecture 4: Dynamic Programming
15 pages
Unit 1 2 3 4 5 NLP Notes Merged
100% (1)
Unit 1 2 3 4 5 NLP Notes Merged
105 pages
Types of Malwares: Trojan Horses
No ratings yet
Types of Malwares: Trojan Horses
12 pages
Soda14 Disjoint Set Union
No ratings yet
Soda14 Disjoint Set Union
13 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Computer Organization: National Institute of Technology Hamirpur
No ratings yet
Computer Organization: National Institute of Technology Hamirpur
8 pages
Week5 Dynamic Programming1
No ratings yet
Week5 Dynamic Programming1
11 pages
HW 9 Solution
No ratings yet
HW 9 Solution
8 pages
System Software (Csd-224) : Assignment: 01
No ratings yet
System Software (Csd-224) : Assignment: 01
4 pages
A8 Solution
No ratings yet
A8 Solution
4 pages
JEEMAINJAN AdmitCard PDF
No ratings yet
JEEMAINJAN AdmitCard PDF
1 page
A Survey of Spelling Error Detection and Correction Techniques
No ratings yet
A Survey of Spelling Error Detection and Correction Techniques
3 pages
Search Tree: Fundamentals and Applications
From Everand
Search Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
System Software (Csd-224) : Assignment: 01
No ratings yet
System Software (Csd-224) : Assignment: 01
4 pages

Lecture 4-Dictionaries and Tolerant Retrieval

Uploaded by

Lecture 4-Dictionaries and Tolerant Retrieval

Uploaded by

Introduction to Information Retrieval

Recap of the previous lecture

Dictionary data structures for inverted indexes

char[20] int Postings *

Dictionary data structures

Dictionary data structures

▪ Is the number likely to remain static, or change a lot – and

▪ What are the relative frequencies with which various keys

Tree: binary Search tree R

a-h hv- n-s si-

Issue with BST

▪ Definition: Every internal node has a number of children

▪ We still have to look up the postings for each enumerated

▪ E.g., consider the query:

B-trees handle *’s at the end of a query term

▪ Next, we construct a permuterm index, in which the various rotations of

▪ Now that the permuterm index enables us to identify the original

Permuterm query processing

▪ Filter these out by exhaustive enumeration, checking each candidate to see if it

Bigram (k-gram) indexes

$a,ap,pr,ri,il,l$, $i,is,s$, $t,th,he,e$,

▪ Maintain a second inverted index from bigrams to dictionary

Bigram index example

▪ Gets terms that match AND version of our wildcard query.

▪ Of various alternative correct spellings for a miss-spelled query, choose

Isolated word correction

▪ Two basic choices for this

▪ The lexicon of the indexed corpus

Isolated word correction

▪ We’ll study several alternatives

Edit distance (Levenshtein distance)

▪ Operations are typically character-level

▪ Generally found by dynamic programming.

Weighted edit distance

Using edit distances

Edit distance to all dictionary terms?

▪ Expensive and slow

▪ How do we cut the set of candidate dictionary terms?

▪ One possibility is to use n-gram overlap for this

▪ This can also be used by itself for spelling correction.

K-gram indexes for spelling correction

▪ Use the n-gram index (recall wild-card search) to retrieve all

▪ Threshold by number of matching n-grams

Example with trigrams

One option – Jaccard coefficient

▪ Equals 1 when X and Y have the same elements and

One option – Jaccard coefficient

l alon lore slot

Standard postings “merge” will enumerate …

Context-sensitive spell correction

because no docs matched the query phrase.

▪ First idea: retrieve dictionary terms close (in weighted edit

▪ Hit-based spelling correction: Suggest the alternative that has

▪ Look for biwords that need only one term corrected.

▪ Enumerate only phrases containing “common”

General issues in spell correction

▪ Need to figure out which to present to the user

▪ Such algorithms are especially applicable to searches on the

▪ The main idea here is to generate, for each term, a “phonetic

Soundex algorithm scheme

▪ Build an inverted index from these reduced forms to the

▪ Do the same with query terms

▪ When the query calls for a soundex match, search this

Soundex – typical algorithm

E.g., Herman becomes H655.

Will hermann generate the same

What queries can we process?

You might also like