0% found this document useful (0 votes)

47 views64 pages

Week 5

When designing a data integration system, you should first deal with the schema mapping problem before the schema matching problem. Schema mapping determines how data from different sources will be combined and structured in the integrated system. Schema matching is the process of finding semantic correspondences between elements of different schemas, which informs the schema mapping process.

Uploaded by

Vaibhav Jindal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views64 pages

Week 5

Uploaded by

Vaibhav Jindal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

When designing a data integration system, should you deal the

schema mapping problem first or schema matching problem?

Why?
Methods for
Data Matching and Cleansing

Slides taken from Anhai’s book on Data Integration

Introduction
 Find strings that refer to same real-world entities
– “Venketesh Vishwanathan” and “Venktesh Vishvanathan”
– “Room-108, Student Residence, IIITD, Delhi 20” and “Room# 108, IIIT-D,
Okhla Phase 3, New Delhi 110020”

 Play critical roles in many Data Integration tasks

– Schema matching, data matching, information extraction

 Similarity measures
– Sequence-based: edit distance, Needleman-Wunch, affine gap, Smith-
Waterman, Jaro, Jaro-Winkler
– Set-based: overlap, Jaccard, TF/IDF
– Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
– Phonetic: Soundex

 Scaling up string matching

– Inverted index, size filtering, prefix filtering, position filtering, bound filtering
Problem Description

• Given two sets of strings X and Y

• Find all pairs xεX and yεY that refer to the same real-world entity
• We refer to (x,y) as a match
• Example

• Two major challenges: accuracy & scalability

4
Accuracy Challenges

• Matching strings often appear quite differently

• Typing and OCR errors: David Smith vs. Davod Smith
• Different formatting conventions: 10/8 vs. Oct 8
• Custom abbreviation, shortening, or omission:
Daniel Walker Herbert Smith vs. Dan W. Smith
• Different names, nick names: William Smith vs. Bill Smith
• Shuffling parts of strings: Dept. of Computer Science, UW-
Madison vs. Computer Science Dept., UW-Madison

5
Accuracy Challenges

• Solution:
• Use a similarity measure s(x,y) ε [0,1]
• The higher s(x,y), the more likely that x and y match
• Declare x and y matched if s(x,y) ≥ t
• Distance measure/cost measure have also been used
• smaller values  higher similarities

6
Scalability Challenges

 Applying s(x,y) to all pairs is impractical

 Quadratic in size of data
 Solution: apply s(x,y) to only most promising pairs, using a
method FindCands
 For each string x ε X
use method FindCands to find a candidate set Z  Y
for each string y ε Z
if s(x,y) ≥ t then return (x,y) as a matched pair
 We discuss ways to implement FindCands later

7
Outline

8
Edit Distance

• Also known as Levenshtein distance

• d(x,y) computes minimal cost of transforming x into y,
using a sequence of operators, each with cost 1
• Delete a character
• Insert a character
• Substitute a character with another
• Example: x = David Smiths, y = Davidd Simth,
d(x,y) = 4, using following sequence
• Inserting a character d (after David)
• Substituting m by i
• Substituting i by m
• Deleting the last character of x, which is s 9
Edit Distance

• Models common editing mistakes

• Inserting an extra character, swapping two characters, etc.
• So smaller edit distance  higher similarity
• Can be converted into a similarity measure
• s(x,y) = 1 - d(x,y) / [max(length(x), length(y))]
• Example
• s(David Smiths, Davidd Simth) = 1 – 4 / max(12, 12) = 0.67

10
Computing Edit Distance using Dynamic
Programming
• Define x = x1x2… xn, y = y1y2… ym
• d(i,j) = edit distance between x1x2… xi and y1y2… yj, the ith and jth
prefixes of x and y
• Recurrence equations

11
Example

• x = dva, y = dave
y0 y1 y2 y3 y4 y0 y1 y2 y3 y4
d a v e d a v e x=d–va

x0 0 1 2 3 4 x0 0 1 2 3 4 y=dave
x1 d 1 0 1 x1 d 1 0 1 2 3
substitute a with e
x2 v 2 x2 v 2 1 1 1 2 insert a (after d)
x3 a 3 x3 a 3 2 1 2 2

12
The Jaro Measure

• Mainly for comparing short strings, e.g., first/last names

• To compute jaro(x,y)
• find common characters xi and yj such that
• xi = yj and |i-j| ≤ min {|x|,|y|}/2
• intuitively, common characters are identical and positionally
“close to each other”
• if the i-th common character of x does not match the i-th
common character of y, then we have a transposition
• return jaro(x,y) = 1 / 3[c/|x| + c/|y| + (c – t/2)/c], where c is the
number of common characters, and t is the number of
transpositions

13
The Jaro Measure: Examples

• x = jon, y = john
• c = 3 because the common characters are j, o, and n
• t=0
• jaro(x,y) = 1 / 3(3/3 + 3/4 + 3/3) = 0.917
• contrast this to 0.75, the similarity score of x and y using edit
distance
• x = jon, y = ojhn
• c = 3 because the common characters are j, o, and n
• t=2
• common char sequence in x is jon
• common char sequence in y is ojn
• jaro(x,y) = 0.81 14
The Jaro-Winkler Measure

• Captures cases where x and y have a low Jaro score,

but share a prefix  still likely to match
• Computed as
• jaro-winkler(x,y) = (1 – PL*PW)*jaro(x,y) + PL*PW
• PL = length of the longest common prefix
• PW is a weight given to the prefix

• Exercise: Compute the edit distance between

“Vishwanathan” and “Vishvnahtan” using Levenshtein, Jaro
and Jaro-Winkler. Here PW=0.1

15
Outline

• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Similarity-based: Ratcliff-Obershelp similarity
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering,
bound filtering
49
Set-based Similarity Measures
• View strings as sets or multi-sets of tokens
• Use set-related properties to compute similarity scores
• Find the similar tokens in both sets. More the number of common
tokens, more is the similarity between the sets.
• Common methods to generate tokens
• consider words delimited by space
• possibly stem the words (depending on the application) [catch, catchy,
catcher,..)
• remove common stop words (e.g., the, and, of)
• e.g., given “david smith”  generate tokens “david” and “smith”
• consider n-grams, substrings of length n
• e.g., “david smith”  the set of 3-grams are {##d, #da, dav, avi, …, h##}

• special character # is added to handle the start and end of string.

The Overlap Measure

• Let Bx = set of tokens generated for string x

• Let By = set of tokens generated for string y
• O(x,y) = |Bx ⋂ By|
• returns the number of common tokens
• E.g., x = dave, y = dav
• Bx = {#d, da, av, ve, e#}, By = {#d, da, av, v#}
• O(x,y) = 3

18
The Jaccard Measure

• J(x,y) = |Bx ⋂ By|/|Bx ⋃ By|

• E.g., x = dave, y = dav
• Bx = {#d, da, av, ve, e#}, By = {#d, da, av, v#}
• J(x,y) = 3/6
• Very commonly used in practice

19
The TF/IDF Measure: Motivation

• uses the TF/IDF notion commonly used in IR

• two strings are similar if they share distinguishing terms
• e.g., x = Apple Corporation, CA
y = IBM Corporation, CA
z = Apple Corp
• s(x,y) > s(x,z) using edit distance or Jaccard measure, so x is
matched with y  incorrect
• TF/IDF measure can recognize that Apple is a distinguishing
term, whereas Corporation and CA are far more common 
correctly match x with z

20
Term Frequencies and
Inverse Document Frequencies
• Assume x and y are taken from a collection of strings
• Each string is converted into a bag of terms called
a document
• Define term frequency tf(t,d) = number of times term
t appears in document d
• Define inverse document frequency idf(t) = N / Nd,
number of documents in collection divided by number
of documents that contain t
• A higher value of IDF means that the occurrence of the term is
more distinguishing.
• note: in practice, idf(t) is often defined as log(N / Nd), here we
will use the above simple formula to define idf(t) 21
Example

22
Feature Vectors

• Each document d is converted into a feature vector vd

• vd has a feature vd(t) for each term t
• value of vd(t) is a function of TF and IDF scores
• here we assume vd(t) = tf(t,d) * idf(t)

23
TF/IDF Similarity Score

24
TF/IDF Similarity Score

25
Outline

• Problem description
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Similarity-based: Ratcliff-Obershelp similarity
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering,
bound filtering

26
7

Generalized Jaccard Measure

• Jaccard measure
• considers overlapping tokens in both x and y
• a token from x and a token from y must be identical to be included in
the set of overlapping tokens
• this can be too restrictive in certain cases

• Example:
• matching taxonomic nodes that describe companies
• “Energy & Transportation” vs. “Transportation, Energy, & Gas”
• in theory Jaccard is well suited here, in practice Jaccard may not work
well if tokens are commonly mispelled
• e.g., energy vs. eneryg
• generalized Jaccard measure can help such cases
8

Generalized Jaccard Measure

• Let Bx = {x1, …, xn}, By = {y1, …, ym}

• Step 1: find token pairs that will be in the “softened”
overlap set
• apply a similarity measure s to compute sim score for each pair (xi, yj)
• keep only those score ≥ a given threshold t, this forms a bipartite graph G
• find the maximum-weight matching M in G

• Step 2: return normalized weight of M as generalized

Jaccard score
• GJ(x,y) =  (xi,yj)εM s(xi,yj) / (|Bx| + |By| - |M|)
9

An Example

 Generalized Jaccard score: (0.7 + 0.9)/(3 + 2 – 2) = 0.53

Sequence based -- Ratcliff-Obershelp similarity

1. Find the longest common substring from the two strings.

2. Remove that part from both strings, and split at the same location.
This breaks the strings into two parts, one left and another to the
right of the found common substring.
3. Now take the left part of both strings and call the function again to
find the longest common substring.
4. Do this too for the right part.
5. Repeat Steps 1 to 4 recursively until the size of any broken part is
less than a default value.

The similarity score = 2*the number of characters found in common/

the total number of characters in the two strings
30
Example
S1 = M A T H E M A T I C S
S1: MKUESHMOHANIA
S2 = M A T E M A T I C A
S2: MUKHESHMOHANIYA

Step 1: Common substring - E M A T I C

S1 = M A T H E M A T I C S
S2 = M A T E M A T I C A

Step 2: Left String 1: M A T H Left String 2: M A T

Right String 1: S Right String 2: A

Step 3: Common substring in left string: M A T

S1 = M A T H E M A T I C S
S2 = M A T E M A T I C A
Step 4: None
Step 5: Algorithm terminates

Similarity score = 2*9/(11+10) = 0.857 61

Outline

32
Phonetic Similarity Measures

• Match strings based on their sound, instead of

appearances
• Very effective in matching names, which often appear in
different ways that sound the same
• e.g., Banarjee, Benerji, Banergi; Meyer, Meier, Myir , Mier, and Mire;
Smith, Smithe, and Smythe
• Soundex is most commonly used

33
The Soundex Measure
• Used primarily to match surnames
• maps a surname x into a 4-letter code
• two surnames are judged similar if share the same code
• Algorithm to map x into a code:
• Step 1: keep the first letter of x, subsequent steps are
performed on the rest of x
• Step 2: remove all occurences of W and H. Replace the
remaining letters with digits as follows:
• replace B, F, P, V with 1, C, G, J, K, Q, S, X, Z with 2, D, T with 3, L with 4, M, N
with 5, R with 6
• Step 3: replace sequence of identical digits by the digit itself
• Step 4: Drop all non-digit letters, return the first four letters as the
Soundex code 65
The Soundex Measure

• Example: x = Ashcraft
• after Step 2: A226a13, after Step 3: A26a13, Step 4 converts
this into A2613, then returns A261
• Soundex code is padded with 0 if there is not enough digits
• Example: Robert and Rupert map into R163
• Soundex fails to map Gough and Goff, and Jawornicki
and Yavornitzky
• designed primarily for Caucasian names, but found to work well
for names of many different origins
• does not work well for names of East Asian origins
• which uses vowels to discriminate, Soundex ignores vowels
35
Outline

36
Class exercises --

• To compute jaro(x,y)
• find common characters xi and yj such that xi = yj and
|i-j| ≤ min {|x|,|y|}/2
• jaro(x,y) = 1/ 3[c/|x| + c/|y| + (c – t/2)/c], where c: # of
common characters, t: is the # of transpositions

• Exercise: Compute the similarity score between “RAPID”

and “DIPAR” using Jaro method.
37
Class exercises -- contd
• To compute Ratcliff-Obershelp similarity score
1. Find the longest common substring from the two strings.
2. Remove that part from both strings, and split at the same location. This breaks the strings into two
parts, one left and another to the right of the found common substring.
3. Now take the left part of both strings and call the function again to find the longest common substring
4. Do this too for the right part.
5. Repeat Steps 1 to 4 recursively until the size of any broken part is less than a default value.
The similarity score= 2*#of characters found in common/the total # of characters in the two strings
• To compute Soundex measure
• maps the value x into a 4-letter code. Two values are judged similar if share the same code
• Step 1: keep the first letter of x, subsequent steps are performed on the rest of x
• Step 2: remove all occurences of W and H. Replace the remaining letters with digits as:
replace B, F, P, V with 1, C, G, J, K, Q, S, X, Z with 2, D, T with 3, L with 4, M, N with 5, R with 6
• Step 3: replace sequence of identical digits by the digit itself
• Step 4: Drop all non-digit letters, return the first four letters as the Soundex code
• Exercise: Compute the similarity score between “Vishwanathan” and
“Vishvnahtan” using Phonetic and Ratcliff-Obershelp similarity measures.
38
Computing similarity between two
strings/record/documents -- Naïve Solution

• All pair-wise comparison between A and B

For x in A: A, B: table
x, y: record as set
For y in B:
If sim(x,y) > t, return (x,y);

• Nested-loop: |A||B| comparisons

• The sim() evaluation may be costly

39
Naïve Solution Example
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

O(x,y) > 2 ?

O(x,y) ID=4 ID=5 ID=6

ID=1 1 0 2
ID=2 2 2 3
ID=3 2 1 3

40
Naïve Solution Example
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

J(x,y) > 0.6 ?

J(x,y)) ID=4 ID=5 ID=6

ID=1 0.25 0 0.5
ID=2 0.5 0.4 0.75
ID=3 0.2 0.16 0.6

41
Scalability Challenges

• Applying s(x,y) to all pairs is impractical

• Quadratic in size of data
• Solution: apply s(x,y) to only most promising pairs, using a
method FindCands
• For each string x ε X
• use method FindCands to find a candidate set Z  Y for
each string y ε Z
• if s(x,y) ≥ t then return (x,y) as a matched pair
• This is often called a blocking solution
• We now discuss ways to implement FindCands
• using Jaccard and overlap measures for now
42
2-Step Framework

• Step 1: “Blocking”
• Using Index/heuristics/filtering/etc, reduce # of candidates to
compare
• Step 2: sim() only within candidate sets
For x in A:
Using Foo, find a candidate set C in B
For y in C:
If sim(x,y) > t, return (x,y);

43
Variants for “Foo”

• “Foo”: How to identify candidate set C

• Fast
• Accurate: no false positives/negatives
• Many Variants for “Foo”
• Inverted Index
• Size filtering
• Prefix Index
• Prefix + Inverted Index
• Position Index

44
Inverted Index over Strings [Sarawagi et al, SIGMOD 04]

• Convert each string yεY into a document, build an

inverted index over these documents
• Given term t, use the index to quickly find documents
of Y that contain t

45
Inverted Index [Sarawagi et al, SIGMOD 04]
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

Inverted Index (IDX) for A Inverted Index (IDX) for B

Token in A ID List Token in B ID List
Area 2 Area 5, 6
Dane 3 Lake 4, 6
Lake 1, 2, 3 Mendota 6
Mendota 1, 3 Monona 4, 5, 6
Monona 2, 3 Research 5
University 4
76
Inverted Index [Sarawagi et al, SIGMOD 04]
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

For x in A: Inverted Index (IDX) for B

Using IDX, find a candidate set C in B
Token in B ID List
For y in C:
If sim(x,y) > t, return (x,y); Area 5
Lake 4, 6
ID=1: {Lake, Mendota} Mendota 6
Monona 4, 5, 6
ID=2: …
Candidate set C: Research 5
ID=3: … {4,6} + {6} = {4, 6}
University 4
47
Inverted Index [Sarawagi et al, SIGMOD 04]
A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

For x in A: Inverted Index (IDX) for B

Using IDX, find a candidate set C in B
Token in B ID List
For y in C:
If sim(x,y) > t, return (x,y); Area 5
Lake 4, 6
ID=1: {Lake, Mendota} Mendota 6
Monona 4, 5, 6
ID=2: … ID Freq.
Candidate set C: Research 5
4 1
ID=3: …
O(x,y) > 2 6 University
2 4
48
Limitations

• The inverted list of some terms (e.g., stop words) can

be very long  costly to build and manipulate such lists
• Requires enumerating all pairs of strings that share at
least one term. This set can still be very large in
practice.

49
Size Filtering [Arasu et al,VLDB 06]

• Retrieves only strings in Y whose sizes make them

match candidates
• given a string xε X, infer a constraint on the size of strings in Y
that can possibly match x
• uses a B-tree index to retrieve only strings that satisfy size
constraints
• E.g., for Jaccard measure J(x,y) = |x ⋂ y| / |x ⋃ y|
• assume two strings x and y match if J(x,y) ≥ t
• 1/J(x,y) ≥ 1 ≥ |y|/|x| ≥ 1 ≥ J(x,y)
• can show that given a string xε X, only strings y such that
• |x|/t ≥ |y| ≥ |x|*t can possibly match x
50
Example

• Consider x = {lake, mendota}. Suppose t = 0.8

• If yε Y matches x, we must have
• 2/0.8 = 2.5 ≥ |y| ≥ 2* 0.8 = 1.6
• no string in Set Y satisfies this constraint  no match
51
Prefix Filtering [Chaudhuri et al, ICDE 06]

• Key idea: if two sets share many terms  large subsets of them
also share terms
• Consider overlap measure O(x,y) = |x ⋂ y|
• if |x ⋂ y| ≥ k  any subset x’  x of size at least |x| - (k – 1) must
overlap y
• For J(x,y), any subset x’  x of size at least |x| - (k– 1) must overlap y,
г
where k = |x|*t˥
• To exploit this idea to find pairs (x,y) such that O(x,y) ≥ k
• given x, construct subset x’ of size |x| - (k – 1)
• use an inverted index to find all y that overlap x’
52
Example

• Consider matching using O(x,y) ≥ 2

• x1 = {lake, mendota}, let x1’ = {lake}
• Use inverted index to find {y4, y6} which contain at least one token in x1’.
If we had used the inverted index to find all strings in Y that contain at least one
53
token in x, what would be the answer?
Prefix-Inverted Index:
Selecting the Subset Intelligently [Bayardo et al,WWW 07]

• Recall that we select a subset x’ of x and check its overlap with

the entire set y
• We can do better by selecting a particular subset x’ and
checking its overlap with only a particular subset y’ of y
• How?
• impose an ordering O over the universe of all possible terms in increasing
frequency
• reorder the terms in each x ε X and y ε Y according to O
• select subset x’ that contains the first n terms of x as the prefix of
size n of x
• if |x ⋂ y| ≥ k, then x’ and y’ must overlap, where x’ is the prefix
of size |x| - (k – 1) of x and y’ is the prefix of size |y| - (k – 1) of y.
54
5

if |x ⋂ y| ≥ k, then x’ and y’ must overlap, where x’ is the prefix of

size |x| - (k – 1) of x and y’ is the prefix of size |y| - (k – 1) of y.

Proof:

Prefix Suffix

Suppose x’⋂y’=Φ  x’⋂y”≠Φ (otherwise |x⋂y|≥k will not hold)

∃u: uεx’ and uεy” and similarly ∃v: vεy’ and vεx”
uεx’ and vεx” , u<v in the ordering O
But vεy’ and uεy”, means v<u –> contradiction
This x’ overlaps with y’
Selecting the Subset Intelligently

• Algorithm
• reorder terms in each x ε X and y ε Y in increasing order of their
frequencies
• for each y ε Y, create y’, the prefix of size |y| - (k – 1) of y
• build an inverted index over all prefixes y’
• for each x ε X, create x’, the prefix of size |x| - (k – 1) of x, then
use above index to find all y such that x’ overlaps with y’

56
Prefix + Inverted Index [Bayardo et al,WWW 07]

A B
ID Content ID Content
1 {Lake, Mendota} 4 {Lake, Monona, University}
2 {Lake, Monona, Area} 5 {Monona, Research, Area}
3 {Lake, Mendota, Monona, Dane} 6 {Lake, Mendota, Monona, Area}

Token ID List TF Order

Inverted Index (IDX)
for both A and B Area 2, 5, 6 3 4
Dane 3 1 1
Lake 1, 2, 3, 4, 6 5 6
Create a universal order: Mendota 1, 3, 6 3 5
Put rare tokens front Monona 2, 3, 4, 5, 6 5 7
Research 5 2
Order: Dane > Research > Uni versity > Area > Mendota > L ake > Monona
1
University 4 1 3
Prefix + Inverted Index [Bayardo et al,WWW 07]

Ordered A Ordered B
ID Content ID Content
1 {Mendota, Lake} 4 {University, Lake, Monona}
2 {Area, Lake, Monona} 5 {Research, Area, Monona}
3 {Dane, Mendota, Lake, Monona} 6 {Area, Mendota, Lake, Monona}

Order: Dane > Research > University > Area > Mendota > Lake > Monona
Prefix + Inverted Index [Bayardo et al,WWW 07]

O(x,y) > 2
Prefix Inverted Index for B
Prefix(x)=|x|-(k-1)=|x|-1
Token in B ID List
Area 5
Lake 4, 6
ID=1: {Mendota, Lake} Mendota 6
Research 5
ID=2: …
Candidate set C: {6} University 4
ID=3: …
Prefix + Inverted Index [Bayardo et al,WWW 07]

O(x,y) > 2
Prefix Inverted Index for B
Prefix(x)=|x|-(k-1)=|x|-1
Token in B ID List
Area 5
Lake 4, 6
ID=1: …
Mendota 6
ID=2: {Area, Lake, Monona} Research 5
Candidate set C: University 4
ID=3: …
{5} + {4,6} = {4,5,6}
Prefix + Inverted Index [Bayardo et al,WWW 07]

O(x,y) > 2
Prefix Inverted Index for B
Prefix(x)=|x|-(k-1)=|x|-1
Token in B ID List
ID=1: … Area 5
Lake 4, 6
ID=2: …
Mendota 6
ID=3: {Dane, Mendota, Research 5
Lake, Monona}
Candidate set C: University 4
{6} + {4,6} = {4,6}
Position Index [Xiao et al,WWW 08]
Further limits the set of candidate matches by deriving an upper bound on
the size of overlap between x and y
Order: Dane > Research > University > Area > Mendota > Lake > Monona

• x: {Dane, Research, Area, Mendota, Lake}

• y: {Research, Area, Mendota, Lake, Monona}
• O(x,y) > 4 ?

• Prefix(x) = Prefix(y) = 5 – (4 -1) = 2
• x: {Dane, Research, Area, Mendota, Lake}
• y: {Research, Area, Mendota, Lake, Monona}
• “Research” is a common prefix  (x,y) is a candidate pair using a
Prefix Filtering + Inverted Index method
•  need to compute sim(x,y)
• Estimation of max overlap = overlap in prefixes + min of unseen tokens
= 1 + min(3,4) = 4 > k  No need to compute sim(x,y) ! 62
Summary
• String matching is pervasive in data integration
• Two key challenges:
• what similarity measure and how to scale up?
• Similarity measures
• Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
• Set-based: overlap, Jaccard, Sorensen-Dice, TF/IDF
• Similarity-based: Ratcliff-Obershelp similarity
• Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
• Phonetic: Soundex
• Scaling up string matching
• Inverted index, size filtering, prefix filtering, position filtering.
Acknowledgment

 Slides in the scalability section are adapted from

http://pike.psu.edu/p2/wisc09-tech.ppt

2. Vector Semantics - NLP
No ratings yet
2. Vector Semantics - NLP
118 pages
week_5_8
No ratings yet
week_5_8
80 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
String matching
No ratings yet
String matching
66 pages
L3
No ratings yet
L3
54 pages
Clustering Part4
No ratings yet
Clustering Part4
79 pages
Chapter 7 Normalization
No ratings yet
Chapter 7 Normalization
77 pages
Lecture -7 MSDS
No ratings yet
Lecture -7 MSDS
32 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
No ratings yet
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
30 pages
Similarity Measures Le 512
No ratings yet
Similarity Measures Le 512
14 pages
Module 5 Document Clustering
No ratings yet
Module 5 Document Clustering
33 pages
1730098650_ML12_Clustering
No ratings yet
1730098650_ML12_Clustering
34 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
IRS Unit-4
No ratings yet
IRS Unit-4
35 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
BDA
No ratings yet
BDA
31 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
L6 Recommendation
No ratings yet
L6 Recommendation
56 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Lec 5
No ratings yet
Lec 5
24 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
10 Dictionaries and Tolerant Retrieval
No ratings yet
10 Dictionaries and Tolerant Retrieval
13 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Chapter_2
No ratings yet
Chapter_2
70 pages
Chandola Kumar
No ratings yet
Chandola Kumar
13 pages
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
No ratings yet
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
79 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
Similarity Metric
No ratings yet
Similarity Metric
13 pages
Simple Efficient Algorithm
No ratings yet
Simple Efficient Algorithm
9 pages
The Stringdist Package For Approximate String Matching
No ratings yet
The Stringdist Package For Approximate String Matching
13 pages
Similarity Filters Jaro Winkler
No ratings yet
Similarity Filters Jaro Winkler
7 pages
Clustering by Compression: Abstract-We Present A New Method For Clustering Based On Compression
No ratings yet
Clustering by Compression: Abstract-We Present A New Method For Clustering Based On Compression
23 pages
Task 1
No ratings yet
Task 1
5 pages
Mini-Project #2: Instructions
No ratings yet
Mini-Project #2: Instructions
5 pages
Unit III
No ratings yet
Unit III
85 pages
Similarity
No ratings yet
Similarity
19 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
Fast String Matching in Python
No ratings yet
Fast String Matching in Python
5 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
Newer Method of String Comparison: The Modified Moving Contracting Window Pattern Algorithm
No ratings yet
Newer Method of String Comparison: The Modified Moving Contracting Window Pattern Algorithm
9 pages
Text Similarity Algorithms
No ratings yet
Text Similarity Algorithms
28 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Plagiarism_Detector_NLP_Theory (1)
No ratings yet
Plagiarism_Detector_NLP_Theory (1)
3 pages
tkde-2014-26-7
No ratings yet
tkde-2014-26-7
17 pages
GetStarted_ecomatDisplay_english
No ratings yet
GetStarted_ecomatDisplay_english
30 pages
Assignment No. 2: Similarity and Dissimilarity Measures
No ratings yet
Assignment No. 2: Similarity and Dissimilarity Measures
11 pages
Url Log Pass Cartel Cloud 16.10.2023
No ratings yet
Url Log Pass Cartel Cloud 16.10.2023
795 pages
Dealing With Massive Data - Duplicate Detection
No ratings yet
Dealing With Massive Data - Duplicate Detection
3 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Aw 8
No ratings yet
Aw 8
268 pages
Lecture Notes in Artificial Intelligence 3230
No ratings yet
Lecture Notes in Artificial Intelligence 3230
497 pages
Merge
No ratings yet
Merge
370 pages
750-342 Hawk 4000
No ratings yet
750-342 Hawk 4000
128 pages
Student Practical Assessment Sheet DSAL 2024 25 .Docx
No ratings yet
Student Practical Assessment Sheet DSAL 2024 25 .Docx
8 pages
Quiz 3 Solution (No 1-4)
No ratings yet
Quiz 3 Solution (No 1-4)
3 pages
Appointment Confirmation · Customer Self-Service (3)
No ratings yet
Appointment Confirmation · Customer Self-Service (3)
6 pages
7 DBMS
No ratings yet
7 DBMS
27 pages
Unit 1 HDL Design
No ratings yet
Unit 1 HDL Design
93 pages
Im Unit 3 DBMS Updated
No ratings yet
Im Unit 3 DBMS Updated
93 pages
List of Practicals XII IP 2023-24
No ratings yet
List of Practicals XII IP 2023-24
5 pages
Autocad Land Desktop Tutorial: Lesson 1
No ratings yet
Autocad Land Desktop Tutorial: Lesson 1
288 pages
Karam Al Aqtash - Computer FINAL PROJECT
No ratings yet
Karam Al Aqtash - Computer FINAL PROJECT
27 pages
Database CS-229T All - LABs
No ratings yet
Database CS-229T All - LABs
58 pages
Hangover Tamil Dubbed Bad Words Full 107
0% (1)
Hangover Tamil Dubbed Bad Words Full 107
4 pages
AllExercise FA-TOC Sipser
0% (1)
AllExercise FA-TOC Sipser
17 pages
Recap Through Exercise
No ratings yet
Recap Through Exercise
37 pages
Score: Context-Oriented Structured and Unstructured Information Integration
No ratings yet
Score: Context-Oriented Structured and Unstructured Information Integration
35 pages
Serial No 44
No ratings yet
Serial No 44
2 pages
Information Integration: Existing Methods and Solutions
No ratings yet
Information Integration: Existing Methods and Solutions
25 pages
Flutter Live
No ratings yet
Flutter Live
14 pages
AERO2598 Matlab Assignment 1-1 Slides
No ratings yet
AERO2598 Matlab Assignment 1-1 Slides
16 pages
Daram Vishnu Vardhan: Work Experience Skills
No ratings yet
Daram Vishnu Vardhan: Work Experience Skills
1 page
Rosendo TSA3
No ratings yet
Rosendo TSA3
15 pages
ICT Module Outline Nov 2023
No ratings yet
ICT Module Outline Nov 2023
8 pages
Gene Expression: Quantification of Information Molecules and Their Applications
No ratings yet
Gene Expression: Quantification of Information Molecules and Their Applications
146 pages
Indian Addresses Matching
No ratings yet
Indian Addresses Matching
12 pages
Procurve msm410
No ratings yet
Procurve msm410
6 pages
Summer Internship Report NISHANT NEWWWW
No ratings yet
Summer Internship Report NISHANT NEWWWW
12 pages
DVR and NVR Network Setup
No ratings yet
DVR and NVR Network Setup
7 pages
NehaBhardwaj (3 6)
No ratings yet
NehaBhardwaj (3 6)
4 pages
Rutuja N. Mhaske: Profile Contact
No ratings yet
Rutuja N. Mhaske: Profile Contact
2 pages
Implementing Snowbound VirtualViewer On Pega Platform
No ratings yet
Implementing Snowbound VirtualViewer On Pega Platform
2 pages
Sathyam Mohanram Vellal: Education
No ratings yet
Sathyam Mohanram Vellal: Education
1 page
Assignment 8 - Solutions: 4 Points
No ratings yet
Assignment 8 - Solutions: 4 Points
2 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Week 5

Uploaded by

Week 5

Uploaded by

When designing a data integration system, should you deal the

schema mapping problem first or schema matching problem?

Slides taken from Anhai’s book on Data Integration

 Play critical roles in many Data Integration tasks

 Scaling up string matching

• Given two sets of strings X and Y

• Two major challenges: accuracy & scalability

• Matching strings often appear quite differently

 Applying s(x,y) to all pairs is impractical

• Also known as Levenshtein distance

• Models common editing mistakes

• Mainly for comparing short strings, e.g., first/last names

• Captures cases where x and y have a low Jaro score,

• Exercise: Compute the edit distance between

• special character # is added to handle the start and end of string.

• Let Bx = set of tokens generated for string x

• J(x,y) = |Bx ⋂ By|/|Bx ⋃ By|

• uses the TF/IDF notion commonly used in IR

• Each document d is converted into a feature vector vd

Generalized Jaccard Measure

Generalized Jaccard Measure

• Let Bx = {x1, …, xn}, By = {y1, …, ym}

• Step 2: return normalized weight of M as generalized

 Generalized Jaccard score: (0.7 + 0.9)/(3 + 2 – 2) = 0.53

1. Find the longest common substring from the two strings.

The similarity score = 2*the number of characters found in common/

Step 1: Common substring - E M A T I C

Step 2: Left String 1: M A T H Left String 2: M A T

Step 3: Common substring in left string: M A T

Similarity score = 2*9/(11+10) = 0.857 61

• Match strings based on their sound, instead of

• Exercise: Compute the similarity score between “RAPID”

• All pair-wise comparison between A and B

• Nested-loop: |A||B| comparisons

O(x,y) ID=4 ID=5 ID=6

J(x,y) > 0.6 ?

J(x,y)) ID=4 ID=5 ID=6

• Applying s(x,y) to all pairs is impractical

• “Foo”: How to identify candidate set C

• Convert each string yεY into a document, build an

Inverted Index (IDX) for A Inverted Index (IDX) for B

For x in A: Inverted Index (IDX) for B

For x in A: Inverted Index (IDX) for B

• The inverted list of some terms (e.g., stop words) can

• Retrieves only strings in Y whose sizes make them

• Consider x = {lake, mendota}. Suppose t = 0.8

• Consider matching using O(x,y) ≥ 2

• Recall that we select a subset x’ of x and check its overlap with

if |x ⋂ y| ≥ k, then x’ and y’ must overlap, where x’ is the prefix of

Suppose x’⋂y’=Φ  x’⋂y”≠Φ (otherwise |x⋂y|≥k will not hold)

Token ID List TF Order

• x: {Dane, Research, Area, Mendota, Lake}

 Slides in the scalability section are adapted from

You might also like