[go: up one dir, main page]

0% found this document useful (0 votes)
47 views64 pages

Week 5

When designing a data integration system, you should first deal with the schema mapping problem before the schema matching problem. Schema mapping determines how data from different sources will be combined and structured in the integrated system. Schema matching is the process of finding semantic correspondences between elements of different schemas, which informs the schema mapping process.

Uploaded by

Vaibhav Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views64 pages

Week 5

When designing a data integration system, you should first deal with the schema mapping problem before the schema matching problem. Schema mapping determines how data from different sources will be combined and structured in the integrated system. Schema matching is the process of finding semantic correspondences between elements of different schemas, which informs the schema mapping process.

Uploaded by

Vaibhav Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

When designing a data integration system, should you deal the

schema mapping problem first or schema matching problem?


Why?
Methods for
Data Matching and Cleansing

Slides taken from Anhai’s book on Data Integration


Introduction
 Find strings that refer to same real-world entities
– “Venketesh Vishwanathan” and “Venktesh Vishvanathan”
– “Room-108, Student Residence, IIITD, Delhi 20” and “Room# 108, IIIT-D,
Okhla Phase 3, New Delhi 110020”

 Play critical roles in many Data Integration tasks


– Schema matching, data matching, information extraction

 Similarity measures
– Sequence-based: edit distance, Needleman-Wunch, affine gap, Smith-
Waterman, Jaro, Jaro-Winkler
– Set-based: overlap, Jaccard, TF/IDF
– Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
– Phonetic: Soundex

 Scaling up string matching


– Inverted index, size filtering, prefix filtering, position filtering, bound filtering
Problem Description

• Given two sets of strings X and Y


• Find all pairs xεX and yεY that refer to the same real-world entity
• We refer to (x,y) as a match
• Example

• Two major challenges: accuracy & scalability


4
Accuracy Challenges

• Matching strings often appear quite differently


• Typing and OCR errors: David Smith vs. Davod Smith
• Different formatting conventions: 10/8 vs. Oct 8
• Custom abbreviation, shortening, or omission:
Daniel Walker Herbert Smith vs. Dan W. Smith
• Different names, nick names: William Smith vs. Bill Smith
• Shuffling parts of strings: Dept. of Computer Science, UW-
Madison vs. Computer Science Dept., UW-Madison

5
Accuracy Challenges

• Solution:
• Use a similarity measure s(x,y) ε [0,1]
• The higher s(x,y), the more likely that x and y match
• Declare x and y matched if s(x,y) ≥ t
• Distance measure/cost measure have also been used
• smaller values  higher similarities

6
Scalability Challenges

 Applying s(x,y) to all pairs is impractical


 Quadratic in size of data
 Solution: apply s(x,y) to only most promising pairs, using a
method FindCands
 For each string x ε X
use method FindCands to find a candidate set Z  Y
for each string y ε Z
if s(x,y) ≥ t then return (x,y) as a matched pair
 We discuss ways to implement FindCands later

7
Outline

• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Similarity-based: Ratcliff-Obershelp similarity
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering.

8
Edit Distance

• Also known as Levenshtein distance


• d(x,y) computes minimal cost of transforming x into y,
using a sequence of operators, each with cost 1
• Delete a character
• Insert a character
• Substitute a character with another
• Example: x = David Smiths, y = Davidd Simth,
d(x,y) = 4, using following sequence
• Inserting a character d (after David)
• Substituting m by i
• Substituting i by m
• Deleting the last character of x, which is s 9
Edit Distance

• Models common editing mistakes


• Inserting an extra character, swapping two characters, etc.
• So smaller edit distance  higher similarity
• Can be converted into a similarity measure
• s(x,y) = 1 - d(x,y) / [max(length(x), length(y))]
• Example
• s(David Smiths, Davidd Simth) = 1 – 4 / max(12, 12) = 0.67

10
Computing Edit Distance using Dynamic
Programming
• Define x = x1x2… xn, y = y1y2… ym
• d(i,j) = edit distance between x1x2… xi and y1y2… yj, the ith and jth
prefixes of x and y
• Recurrence equations

11
Example

• x = dva, y = dave
y0 y1 y2 y3 y4 y0 y1 y2 y3 y4
d a v e d a v e x=d–va

x0 0 1 2 3 4 x0 0 1 2 3 4 y=dave
x1 d 1 0 1 x1 d 1 0 1 2 3
substitute a with e
x2 v 2 x2 v 2 1 1 1 2 insert a (after d)
x3 a 3 x3 a 3 2 1 2 2

12
The Jaro Measure

• Mainly for comparing short strings, e.g., first/last names


• To compute jaro(x,y)
• find common characters xi and yj such that
• xi = yj and |i-j| ≤ min {|x|,|y|}/2
• intuitively, common characters are identical and positionally
“close to each other”
• if the i-th common character of x does not match the i-th
common character of y, then we have a transposition
• return jaro(x,y) = 1 / 3[c/|x| + c/|y| + (c – t/2)/c], where c is the
number of common characters, and t is the number of
transpositions

13
The Jaro Measure: Examples

• x = jon, y = john
• c = 3 because the common characters are j, o, and n
• t=0
• jaro(x,y) = 1 / 3(3/3 + 3/4 + 3/3) = 0.917
• contrast this to 0.75, the similarity score of x and y using edit
distance
• x = jon, y = ojhn
• c = 3 because the common characters are j, o, and n
• t=2
• common char sequence in x is jon
• common char sequence in y is ojn
• jaro(x,y) = 0.81 14
The Jaro-Winkler Measure

• Captures cases where x and y have a low Jaro score,


but share a prefix  still likely to match
• Computed as
• jaro-winkler(x,y) = (1 – PL*PW)*jaro(x,y) + PL*PW
• PL = length of the longest common prefix
• PW is a weight given to the prefix

• Exercise: Compute the edit distance between


“Vishwanathan” and “Vishvnahtan” using Levenshtein, Jaro
and Jaro-Winkler. Here PW=0.1

15
Outline

• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Similarity-based: Ratcliff-Obershelp similarity
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering,
bound filtering
49
Set-based Similarity Measures
• View strings as sets or multi-sets of tokens
• Use set-related properties to compute similarity scores
• Find the similar tokens in both sets. More the number of common
tokens, more is the similarity between the sets.
• Common methods to generate tokens
• consider words delimited by space
• possibly stem the words (depending on the application) [catch, catchy,
catcher,..)
• remove common stop words (e.g., the, and, of)
• e.g., given “david smith”  generate tokens “david” and “smith”
• consider n-grams, substrings of length n
• e.g., “david smith”  the set of 3-grams are {##d, #da, dav, avi, …, h##}

• special character # is added to handle the start and end of string.


The Overlap Measure

• Let Bx = set of tokens generated for string x


• Let By = set of tokens generated for string y
• O(x,y) = |Bx ⋂ By|
• returns the number of common tokens
• E.g., x = dave, y = dav
• Bx = {#d, da, av, ve, e#}, By = {#d, da, av, v#}
• O(x,y) = 3

18
The Jaccard Measure

• J(x,y) = |Bx ⋂ By|/|Bx ⋃ By|


• E.g., x = dave, y = dav
• Bx = {#d, da, av, ve, e#}, By = {#d, da, av, v#}
• J(x,y) = 3/6
• Very commonly used in practice

19
The TF/IDF Measure: Motivation

• uses the TF/IDF notion commonly used in IR


• two strings are similar if they share distinguishing terms
• e.g., x = Apple Corporation, CA
y = IBM Corporation, CA
z = Apple Corp
• s(x,y) > s(x,z) using edit distance or Jaccard measure, so x is
matched with y  incorrect
• TF/IDF measure can recognize that Apple is a distinguishing
term, whereas Corporation and CA are far more common 
correctly match x with z

20
Term Frequencies and
Inverse Document Frequencies
• Assume x and y are taken from a collection of strings
• Each string is converted into a bag of terms called
a document
• Define term frequency tf(t,d) = number of times term
t appears in document d
• Define inverse document frequency idf(t) = N / Nd,
number of documents in collection divided by number
of documents that contain t
• A higher value of IDF means that the occurrence of the term is
more distinguishing.
• note: in practice, idf(t) is often defined as log(N / Nd), here we
will use the above simple formula to define idf(t) 21
Example

22
Feature Vectors

• Each document d is converted into a feature vector vd


• vd has a feature vd(t) for each term t
• value of vd(t) is a function of TF and IDF scores
• here we assume vd(t) = tf(t,d) * idf(t)

23
TF/IDF Similarity Score

24
TF/IDF Similarity Score

25
Outline

• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Similarity-based: Ratcliff-Obershelp similarity
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering,
bound filtering

26
7

Generalized Jaccard Measure

• Jaccard measure
• considers overlapping tokens in both x and y
• a token from x and a token from y must be identical to be included in
the set of overlapping tokens
• this can be too restrictive in certain cases

• Example:
• matching taxonomic nodes that describe companies
• “Energy & Transportation” vs. “Transportation, Energy, & Gas”
• in theory Jaccard is well suited here, in practice Jaccard may not work
well if tokens are commonly mispelled
• e.g., energy vs. eneryg
• generalized Jaccard measure can help such cases
8

Generalized Jaccard Measure

• Let Bx = {x1, …, xn}, By = {y1, …, ym}


• Step 1: find token pairs that will be in the “softened”
overlap set
• apply a similarity measure s to compute sim score for each pair (xi, yj)
• keep only those score ≥ a given threshold t, this forms a bipartite graph G
• find the maximum-weight matching M in G

• Step 2: return normalized weight of M as generalized


Jaccard score
• GJ(x,y) =  (xi,yj)εM s(xi,yj) / (|Bx| + |By| - |M|)
9

An Example

 Generalized Jaccard score: (0.7 + 0.9)/(3 + 2 – 2) = 0.53


Sequence based -- Ratcliff-Obershelp similarity

1. Find the longest common substring from the two strings.


2. Remove that part from both strings, and split at the same location.
This breaks the strings into two parts, one left and another to the
right of the found common substring.
3. Now take the left part of both strings and call the function again to
find the longest common substring.
4. Do this too for the right part.
5. Repeat Steps 1 to 4 recursively until the size of any broken part is
less than a default value.

The similarity score = 2*the number of characters found in common/


the total number of characters in the two strings
30
Example
S1 = M A T H E M A T I C S
S1: MKUESHMOHANIA
S2 = M A T E M A T I C A
S2: MUKHESHMOHANIYA

Step 1: Common substring - E M A T I C


S1 = M A T H E M A T I C S
S2 = M A T E M A T I C A

Step 2: Left String 1: M A T H Left String 2: M A T


Right String 1: S Right String 2: A

Step 3: Common substring in left string: M A T


S1 = M A T H E M A T I C S
S2 = M A T E M A T I C A
Step 4: None
Step 5: Algorithm terminates

Similarity score = 2*9/(11+10) = 0.857 61


Outline

• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Similarity-based: Ratcliff-Obershelp similarity
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering,
bound filtering

32
Phonetic Similarity Measures

• Match strings based on their sound, instead of


appearances
• Very effective in matching names, which often appear in
different ways that sound the same
• e.g., Banarjee, Benerji, Banergi; Meyer, Meier, Myir , Mier, and Mire;
Smith, Smithe, and Smythe
• Soundex is most commonly used

33
The Soundex Measure
• Used primarily to match surnames
• maps a surname x into a 4-letter code
• two surnames are judged similar if share the same code
• Algorithm to map x into a code:
• Step 1: keep the first letter of x, subsequent steps are
performed on the rest of x
• Step 2: remove all occurences of W and H. Replace the
remaining letters with digits as follows:
• replace B, F, P, V with 1, C, G, J, K, Q, S, X, Z with 2, D, T with 3, L with 4, M, N
with 5, R with 6
• Step 3: replace sequence of identical digits by the digit itself
• Step 4: Drop all non-digit letters, return the first four letters as the
Soundex code 65
The Soundex Measure

• Example: x = Ashcraft
• after Step 2: A226a13, after Step 3: A26a13, Step 4 converts
this into A2613, then returns A261
• Soundex code is padded with 0 if there is not enough digits
• Example: Robert and Rupert map into R163
• Soundex fails to map Gough and Goff, and Jawornicki
and Yavornitzky
• designed primarily for Caucasian names, but found to work well
for names of many different origins
• does not work well for names of East Asian origins
• which uses vowels to discriminate, Soundex ignores vowels
35
Outline

• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Similarity-based: Ratcliff-Obershelp similarity
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering

36
Class exercises --

• To compute jaro(x,y)
• find common characters xi and yj such that xi = yj and
|i-j| ≤ min {|x|,|y|}/2
• jaro(x,y) = 1/ 3[c/|x| + c/|y| + (c – t/2)/c], where c: # of
common characters, t: is the # of transpositions

• Exercise: Compute the similarity score between “RAPID”


and “DIPAR” using Jaro method.
37
Class exercises -- contd
• To compute Ratcliff-Obershelp similarity score
1. Find the longest common substring from the two strings.
2. Remove that part from both strings, and split at the same location. This breaks the strings into two
parts, one left and another to the right of the found common substring.
3. Now take the left part of both strings and call the function again to find the longest common substring
4. Do this too for the right part.
5. Repeat Steps 1 to 4 recursively until the size of any broken part is less than a default value.
The similarity score= 2*#of characters found in common/the total # of characters in the two strings
• To compute Soundex measure
• maps the value x into a 4-letter code. Two values are judged similar if share the same code
• Step 1: keep the first letter of x, subsequent steps are performed on the rest of x
• Step 2: remove all occurences of W and H. Replace the remaining letters with digits as:
replace B, F, P, V with 1, C, G, J, K, Q, S, X, Z with 2, D, T with 3, L with 4, M, N with 5, R with 6
• Step 3: replace sequence of identical digits by the digit itself
• Step 4: Drop all non-digit letters, return the first four letters as the Soundex code
• Exercise: Compute the similarity score between “Vishwanathan” and
“Vishvnahtan” using Phonetic and Ratcliff-Obershelp similarity measures.
38
Computing similarity between two
strings/record/documents -- Naïve Solution

• All pair-wise comparison between A and B


For x in A: A, B: table
x, y: record as set
For y in B:
If sim(x,y) > t, return (x,y);

• Nested-loop: |A||B| comparisons


• The sim() evaluation may be costly

39
Naïve Solution Example
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

O(x,y) > 2 ?

O(x,y) ID=4 ID=5 ID=6


ID=1 1 0 2
ID=2 2 2 3
ID=3 2 1 3

40
Naïve Solution Example
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

J(x,y) > 0.6 ?

J(x,y)) ID=4 ID=5 ID=6


ID=1 0.25 0 0.5
ID=2 0.5 0.4 0.75
ID=3 0.2 0.16 0.6

41
Scalability Challenges

• Applying s(x,y) to all pairs is impractical


• Quadratic in size of data
• Solution: apply s(x,y) to only most promising pairs, using a
method FindCands
• For each string x ε X
• use method FindCands to find a candidate set Z  Y for
each string y ε Z
• if s(x,y) ≥ t then return (x,y) as a matched pair
• This is often called a blocking solution
• We now discuss ways to implement FindCands
• using Jaccard and overlap measures for now
42
2-Step Framework

• Step 1: “Blocking”
• Using Index/heuristics/filtering/etc, reduce # of candidates to
compare
• Step 2: sim() only within candidate sets
For x in A:
Using Foo, find a candidate set C in B
For y in C:
If sim(x,y) > t, return (x,y);

43
Variants for “Foo”

• “Foo”: How to identify candidate set C


• Fast
• Accurate: no false positives/negatives
• Many Variants for “Foo”
• Inverted Index
• Size filtering
• Prefix Index
• Prefix + Inverted Index
• Position Index

44
Inverted Index over Strings [Sarawagi et al, SIGMOD 04]

• Convert each string yεY into a document, build an


inverted index over these documents
• Given term t, use the index to quickly find documents
of Y that contain t

45
Inverted Index [Sarawagi et al, SIGMOD 04]
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

Inverted Index (IDX) for A Inverted Index (IDX) for B


Token in A ID List Token in B ID List
Area 2 Area 5, 6
Dane 3 Lake 4, 6
Lake 1, 2, 3 Mendota 6
Mendota 1, 3 Monona 4, 5, 6
Monona 2, 3 Research 5
University 4
76
Inverted Index [Sarawagi et al, SIGMOD 04]
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

For x in A: Inverted Index (IDX) for B


Using IDX, find a candidate set C in B
Token in B ID List
For y in C:
If sim(x,y) > t, return (x,y); Area 5
Lake 4, 6
ID=1: {Lake, Mendota} Mendota 6
Monona 4, 5, 6
ID=2: …
Candidate set C: Research 5
ID=3: … {4,6} + {6} = {4, 6}
University 4
47
Inverted Index [Sarawagi et al, SIGMOD 04]
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

For x in A: Inverted Index (IDX) for B


Using IDX, find a candidate set C in B
Token in B ID List
For y in C:
If sim(x,y) > t, return (x,y); Area 5
Lake 4, 6
ID=1: {Lake, Mendota} Mendota 6
Monona 4, 5, 6
ID=2: … ID Freq.
Candidate set C: Research 5
4 1
ID=3: …
O(x,y) > 2 6 University
2 4
48
Limitations

• The inverted list of some terms (e.g., stop words) can


be very long  costly to build and manipulate such lists
• Requires enumerating all pairs of strings that share at
least one term. This set can still be very large in
practice.

49
Size Filtering [Arasu et al,VLDB 06]

• Retrieves only strings in Y whose sizes make them


match candidates
• given a string xε X, infer a constraint on the size of strings in Y
that can possibly match x
• uses a B-tree index to retrieve only strings that satisfy size
constraints
• E.g., for Jaccard measure J(x,y) = |x ⋂ y| / |x ⋃ y|
• assume two strings x and y match if J(x,y) ≥ t
• 1/J(x,y) ≥ 1 ≥ |y|/|x| ≥ 1 ≥ J(x,y)
• can show that given a string xε X, only strings y such that
• |x|/t ≥ |y| ≥ |x|*t can possibly match x
50
Example

,6

• Consider x = {lake, mendota}. Suppose t = 0.8


• If yε Y matches x, we must have
• 2/0.8 = 2.5 ≥ |y| ≥ 2* 0.8 = 1.6
• no string in Set Y satisfies this constraint  no match
51
Prefix Filtering [Chaudhuri et al, ICDE 06]

• Key idea: if two sets share many terms  large subsets of them
also share terms
• Consider overlap measure O(x,y) = |x ⋂ y|
• if |x ⋂ y| ≥ k  any subset x’  x of size at least |x| - (k – 1) must
overlap y
• For J(x,y), any subset x’  x of size at least |x| - (k– 1) must overlap y,
г
where k = |x|*t˥
• To exploit this idea to find pairs (x,y) such that O(x,y) ≥ k
• given x, construct subset x’ of size |x| - (k – 1)
• use an inverted index to find all y that overlap x’
52
Example

• Consider matching using O(x,y) ≥ 2


• x1 = {lake, mendota}, let x1’ = {lake}
• Use inverted index to find {y4, y6} which contain at least one token in x1’.
If we had used the inverted index to find all strings in Y that contain at least one
53
token in x, what would be the answer?
Prefix-Inverted Index:
Selecting the Subset Intelligently [Bayardo et al,WWW 07]

• Recall that we select a subset x’ of x and check its overlap with


the entire set y
• We can do better by selecting a particular subset x’ and
checking its overlap with only a particular subset y’ of y
• How?
• impose an ordering O over the universe of all possible terms in increasing
frequency
• reorder the terms in each x ε X and y ε Y according to O
• select subset x’ that contains the first n terms of x as the prefix of
size n of x
• if |x ⋂ y| ≥ k, then x’ and y’ must overlap, where x’ is the prefix
of size |x| - (k – 1) of x and y’ is the prefix of size |y| - (k – 1) of y.
54
5

if |x ⋂ y| ≥ k, then x’ and y’ must overlap, where x’ is the prefix of


size |x| - (k – 1) of x and y’ is the prefix of size |y| - (k – 1) of y.

Proof:

Prefix Suffix

Suppose x’⋂y’=Φ  x’⋂y”≠Φ (otherwise |x⋂y|≥k will not hold)


∃u: uεx’ and uεy” and similarly ∃v: vεy’ and vεx”
uεx’ and vεx” , u<v in the ordering O
But vεy’ and uεy”, means v<u –> contradiction
This x’ overlaps with y’
Selecting the Subset Intelligently

• Algorithm
• reorder terms in each x ε X and y ε Y in increasing order of their
frequencies
• for each y ε Y, create y’, the prefix of size |y| - (k – 1) of y
• build an inverted index over all prefixes y’
• for each x ε X, create x’, the prefix of size |x| - (k – 1) of x, then
use above index to find all y such that x’ overlaps with y’

56
Prefix + Inverted Index [Bayardo et al,WWW 07]

A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

Token ID List TF Order


Inverted Index (IDX)
for both A and B Area 2, 5, 6 3 4
Dane 3 1 1
Lake 1, 2, 3, 4, 6 5 6
Create a universal order: Mendota 1, 3, 6 3 5
Put rare tokens front Monona 2, 3, 4, 5, 6 5 7
Research 5 2
Order: Dane > Research > Uni versity > Area > Mendota > L ake > Monona
1
University 4 1 3
Prefix + Inverted Index [Bayardo et al,WWW 07]

Ordered A Ordered B
ID Content ID Content
1 {Mendota, Lake} 4 {University, Lake, Monona}
2 {Area, Lake, Monona} 5 {Research, Area, Monona}
3 {Dane, Mendota, Lake, Monona} 6 {Area, Mendota, Lake, Monona}

Order: Dane > Research > University > Area > Mendota > Lake > Monona
Prefix + Inverted Index [Bayardo et al,WWW 07]

Ordered A Ordered B
ID Content ID Content
1 {Mendota, Lake} 4 {University, Lake, Monona}
2 {Area, Lake, Monona} 5 {Research, Area, Monona}
3 {Dane, Mendota, Lake, Monona} 6 {Area, Mendota, Lake, Monona}

O(x,y) > 2
Prefix Inverted Index for B
Prefix(x)=|x|-(k-1)=|x|-1
Token in B ID List
Area 5
Lake 4, 6
ID=1: {Mendota, Lake} Mendota 6
Research 5
ID=2: …
Candidate set C: {6} University 4
ID=3: …
Prefix + Inverted Index [Bayardo et al,WWW 07]

Ordered A Ordered B
ID Content ID Content
1 {Mendota, Lake} 4 {University, Lake, Monona}
2 {Area, Lake, Monona} 5 {Research, Area, Monona}
3 {Dane, Mendota, Lake, Monona} 6 {Area, Mendota, Lake, Monona}

O(x,y) > 2
Prefix Inverted Index for B
Prefix(x)=|x|-(k-1)=|x|-1
Token in B ID List
Area 5
Lake 4, 6
ID=1: …
Mendota 6
ID=2: {Area, Lake, Monona} Research 5
Candidate set C: University 4
ID=3: …
{5} + {4,6} = {4,5,6}
Prefix + Inverted Index [Bayardo et al,WWW 07]

Ordered A Ordered B
ID Content ID Content
1 {Mendota, Lake} 4 {University, Lake, Monona}
2 {Area, Lake, Monona} 5 {Research, Area, Monona}
3 {Dane, Mendota, Lake, Monona} 6 {Area, Mendota, Lake, Monona}

O(x,y) > 2
Prefix Inverted Index for B
Prefix(x)=|x|-(k-1)=|x|-1
Token in B ID List
ID=1: … Area 5
Lake 4, 6
ID=2: …
Mendota 6
ID=3: {Dane, Mendota, Research 5
Lake, Monona}
Candidate set C: University 4
{6} + {4,6} = {4,6}
Position Index [Xiao et al,WWW 08]
Further limits the set of candidate matches by deriving an upper bound on
the size of overlap between x and y
Order: Dane > Research > University > Area > Mendota > Lake > Monona

• x: {Dane, Research, Area, Mendota, Lake}


• y: {Research, Area, Mendota, Lake, Monona}
• O(x,y) > 4 ?

• Prefix(x) = Prefix(y) = 5 – (4 -1) = 2
• x: {Dane, Research, Area, Mendota, Lake}
• y: {Research, Area, Mendota, Lake, Monona}
• “Research” is a common prefix  (x,y) is a candidate pair using a
Prefix Filtering + Inverted Index method
•  need to compute sim(x,y)
• Estimation of max overlap = overlap in prefixes + min of unseen tokens
= 1 + min(3,4) = 4 > k  No need to compute sim(x,y) ! 62
Summary
• String matching is pervasive in data integration
• Two key challenges:
• what similarity measure and how to scale up?
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Similarity-based: Ratcliff-Obershelp similarity
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering.
Acknowledgment

 Slides in the scalability section are adapted from


http://pike.psu.edu/p2/wisc09-tech.ppt

94

You might also like