Week 5
Week 5
Similarity measures
– Sequence-based: edit distance, Needleman-Wunch, affine gap, Smith-
Waterman, Jaro, Jaro-Winkler
– Set-based: overlap, Jaccard, TF/IDF
– Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
– Phonetic: Soundex
5
Accuracy Challenges
• Solution:
• Use a similarity measure s(x,y) ε [0,1]
• The higher s(x,y), the more likely that x and y match
• Declare x and y matched if s(x,y) ≥ t
• Distance measure/cost measure have also been used
• smaller values higher similarities
6
Scalability Challenges
7
Outline
• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Similarity-based: Ratcliff-Obershelp similarity
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering.
8
Edit Distance
10
Computing Edit Distance using Dynamic
Programming
• Define x = x1x2… xn, y = y1y2… ym
• d(i,j) = edit distance between x1x2… xi and y1y2… yj, the ith and jth
prefixes of x and y
• Recurrence equations
11
Example
• x = dva, y = dave
y0 y1 y2 y3 y4 y0 y1 y2 y3 y4
d a v e d a v e x=d–va
x0 0 1 2 3 4 x0 0 1 2 3 4 y=dave
x1 d 1 0 1 x1 d 1 0 1 2 3
substitute a with e
x2 v 2 x2 v 2 1 1 1 2 insert a (after d)
x3 a 3 x3 a 3 2 1 2 2
12
The Jaro Measure
13
The Jaro Measure: Examples
• x = jon, y = john
• c = 3 because the common characters are j, o, and n
• t=0
• jaro(x,y) = 1 / 3(3/3 + 3/4 + 3/3) = 0.917
• contrast this to 0.75, the similarity score of x and y using edit
distance
• x = jon, y = ojhn
• c = 3 because the common characters are j, o, and n
• t=2
• common char sequence in x is jon
• common char sequence in y is ojn
• jaro(x,y) = 0.81 14
The Jaro-Winkler Measure
15
Outline
• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Similarity-based: Ratcliff-Obershelp similarity
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering,
bound filtering
49
Set-based Similarity Measures
• View strings as sets or multi-sets of tokens
• Use set-related properties to compute similarity scores
• Find the similar tokens in both sets. More the number of common
tokens, more is the similarity between the sets.
• Common methods to generate tokens
• consider words delimited by space
• possibly stem the words (depending on the application) [catch, catchy,
catcher,..)
• remove common stop words (e.g., the, and, of)
• e.g., given “david smith” generate tokens “david” and “smith”
• consider n-grams, substrings of length n
• e.g., “david smith” the set of 3-grams are {##d, #da, dav, avi, …, h##}
18
The Jaccard Measure
19
The TF/IDF Measure: Motivation
20
Term Frequencies and
Inverse Document Frequencies
• Assume x and y are taken from a collection of strings
• Each string is converted into a bag of terms called
a document
• Define term frequency tf(t,d) = number of times term
t appears in document d
• Define inverse document frequency idf(t) = N / Nd,
number of documents in collection divided by number
of documents that contain t
• A higher value of IDF means that the occurrence of the term is
more distinguishing.
• note: in practice, idf(t) is often defined as log(N / Nd), here we
will use the above simple formula to define idf(t) 21
Example
22
Feature Vectors
23
TF/IDF Similarity Score
24
TF/IDF Similarity Score
25
Outline
• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Similarity-based: Ratcliff-Obershelp similarity
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering,
bound filtering
26
7
• Jaccard measure
• considers overlapping tokens in both x and y
• a token from x and a token from y must be identical to be included in
the set of overlapping tokens
• this can be too restrictive in certain cases
• Example:
• matching taxonomic nodes that describe companies
• “Energy & Transportation” vs. “Transportation, Energy, & Gas”
• in theory Jaccard is well suited here, in practice Jaccard may not work
well if tokens are commonly mispelled
• e.g., energy vs. eneryg
• generalized Jaccard measure can help such cases
8
An Example
• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Similarity-based: Ratcliff-Obershelp similarity
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering,
bound filtering
32
Phonetic Similarity Measures
33
The Soundex Measure
• Used primarily to match surnames
• maps a surname x into a 4-letter code
• two surnames are judged similar if share the same code
• Algorithm to map x into a code:
• Step 1: keep the first letter of x, subsequent steps are
performed on the rest of x
• Step 2: remove all occurences of W and H. Replace the
remaining letters with digits as follows:
• replace B, F, P, V with 1, C, G, J, K, Q, S, X, Z with 2, D, T with 3, L with 4, M, N
with 5, R with 6
• Step 3: replace sequence of identical digits by the digit itself
• Step 4: Drop all non-digit letters, return the first four letters as the
Soundex code 65
The Soundex Measure
• Example: x = Ashcraft
• after Step 2: A226a13, after Step 3: A26a13, Step 4 converts
this into A2613, then returns A261
• Soundex code is padded with 0 if there is not enough digits
• Example: Robert and Rupert map into R163
• Soundex fails to map Gough and Goff, and Jawornicki
and Yavornitzky
• designed primarily for Caucasian names, but found to work well
for names of many different origins
• does not work well for names of East Asian origins
• which uses vowels to discriminate, Soundex ignores vowels
35
Outline
• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Similarity-based: Ratcliff-Obershelp similarity
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering
36
Class exercises --
• To compute jaro(x,y)
• find common characters xi and yj such that xi = yj and
|i-j| ≤ min {|x|,|y|}/2
• jaro(x,y) = 1/ 3[c/|x| + c/|y| + (c – t/2)/c], where c: # of
common characters, t: is the # of transpositions
39
Naïve Solution Example
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}
O(x,y) > 2 ?
40
Naïve Solution Example
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}
41
Scalability Challenges
• Step 1: “Blocking”
• Using Index/heuristics/filtering/etc, reduce # of candidates to
compare
• Step 2: sim() only within candidate sets
For x in A:
Using Foo, find a candidate set C in B
For y in C:
If sim(x,y) > t, return (x,y);
43
Variants for “Foo”
44
Inverted Index over Strings [Sarawagi et al, SIGMOD 04]
45
Inverted Index [Sarawagi et al, SIGMOD 04]
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}
49
Size Filtering [Arasu et al,VLDB 06]
,6
• Key idea: if two sets share many terms large subsets of them
also share terms
• Consider overlap measure O(x,y) = |x ⋂ y|
• if |x ⋂ y| ≥ k any subset x’ x of size at least |x| - (k – 1) must
overlap y
• For J(x,y), any subset x’ x of size at least |x| - (k– 1) must overlap y,
г
where k = |x|*t˥
• To exploit this idea to find pairs (x,y) such that O(x,y) ≥ k
• given x, construct subset x’ of size |x| - (k – 1)
• use an inverted index to find all y that overlap x’
52
Example
Proof:
Prefix Suffix
• Algorithm
• reorder terms in each x ε X and y ε Y in increasing order of their
frequencies
• for each y ε Y, create y’, the prefix of size |y| - (k – 1) of y
• build an inverted index over all prefixes y’
• for each x ε X, create x’, the prefix of size |x| - (k – 1) of x, then
use above index to find all y such that x’ overlaps with y’
56
Prefix + Inverted Index [Bayardo et al,WWW 07]
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}
Ordered A Ordered B
ID Content ID Content
1 {Mendota, Lake} 4 {University, Lake, Monona}
2 {Area, Lake, Monona} 5 {Research, Area, Monona}
3 {Dane, Mendota, Lake, Monona} 6 {Area, Mendota, Lake, Monona}
Order: Dane > Research > University > Area > Mendota > Lake > Monona
Prefix + Inverted Index [Bayardo et al,WWW 07]
Ordered A Ordered B
ID Content ID Content
1 {Mendota, Lake} 4 {University, Lake, Monona}
2 {Area, Lake, Monona} 5 {Research, Area, Monona}
3 {Dane, Mendota, Lake, Monona} 6 {Area, Mendota, Lake, Monona}
O(x,y) > 2
Prefix Inverted Index for B
Prefix(x)=|x|-(k-1)=|x|-1
Token in B ID List
Area 5
Lake 4, 6
ID=1: {Mendota, Lake} Mendota 6
Research 5
ID=2: …
Candidate set C: {6} University 4
ID=3: …
Prefix + Inverted Index [Bayardo et al,WWW 07]
Ordered A Ordered B
ID Content ID Content
1 {Mendota, Lake} 4 {University, Lake, Monona}
2 {Area, Lake, Monona} 5 {Research, Area, Monona}
3 {Dane, Mendota, Lake, Monona} 6 {Area, Mendota, Lake, Monona}
O(x,y) > 2
Prefix Inverted Index for B
Prefix(x)=|x|-(k-1)=|x|-1
Token in B ID List
Area 5
Lake 4, 6
ID=1: …
Mendota 6
ID=2: {Area, Lake, Monona} Research 5
Candidate set C: University 4
ID=3: …
{5} + {4,6} = {4,5,6}
Prefix + Inverted Index [Bayardo et al,WWW 07]
Ordered A Ordered B
ID Content ID Content
1 {Mendota, Lake} 4 {University, Lake, Monona}
2 {Area, Lake, Monona} 5 {Research, Area, Monona}
3 {Dane, Mendota, Lake, Monona} 6 {Area, Mendota, Lake, Monona}
O(x,y) > 2
Prefix Inverted Index for B
Prefix(x)=|x|-(k-1)=|x|-1
Token in B ID List
ID=1: … Area 5
Lake 4, 6
ID=2: …
Mendota 6
ID=3: {Dane, Mendota, Research 5
Lake, Monona}
Candidate set C: University 4
{6} + {4,6} = {4,6}
Position Index [Xiao et al,WWW 08]
Further limits the set of candidate matches by deriving an upper bound on
the size of overlap between x and y
Order: Dane > Research > University > Area > Mendota > Lake > Monona
94