[go: up one dir, main page]

0% found this document useful (0 votes)
9 views19 pages

A Learning Problem For Entity Matching

The document presents a learning method for entity matching that addresses challenges related to data inconsistency and incompleteness. It outlines a solution involving a training phase to select distance functions and thresholds, and a testing phase to evaluate the model's performance. The approach is validated through experiments on real datasets, demonstrating its efficiency compared to existing techniques.

Uploaded by

bocerin283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views19 pages

A Learning Problem For Entity Matching

The document presents a learning method for entity matching that addresses challenges related to data inconsistency and incompleteness. It outlines a solution involving a training phase to select distance functions and thresholds, and a testing phase to evaluate the model's performance. The approach is validated through experiments on real datasets, demonstrating its efficiency compared to existing techniques.

Uploaded by

bocerin283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 19

A Learning method For Entity matching

Jie Chen, Cheqing Jin, Rong Zhang, Aoying Zhou


cqjin@sei.ecnu.edu.cn
East China Normal University
Outline
1.Introduction
2.Our Solution
3.Experiments
4.Conclusion

QDB' 2012
1. Introduction
• In many applications, data is inaccurate, inconsi
stent and incomplete.
• One entity may be represented by more than on
e (inconsistent) records.
• Entity matching aims at finding record pairs ref
erring to the same entity.

QDB' 2012
Example: (Four tuples from Cora dataset)
RID Author Title Address Date
1 carla e. brodley and paul e. utgo. multivariate versus univariate amherst ma 1992.
decision trees.
2 c. e. brodley and p. e. utgo. multivariate decision trees. amherst massachusetts 1992.
3 c. e. brodley and p. e. utgo. multivariate decision trees. NULL 1995.
4 carla e. brodley and paul e. utgo. multivariate decision trees. NULL 1995.

same authors same title unspecified address same date

[1, 2] Carla E. Brodley, Paul E. Utgoff. Multivariate versus univariate decision trees. Technical report, 1992
[3, 4] Carla E. Brodley, Paul E. Utgoff. Multivariate Decision trees. Machine learning 19(1):45-77, 1995

Generally, we can use a distance function, along with a threshold, to comp


are each tuple. When the distiance is below the threshold, the tuples refer t
o the same entity.

QDB' 2012
Challenges
• How to select distance function for each attribute
?
– i.e, which is better, Jaccard or edit?
• How to set the threshold value for the distance f
unction?
– i.e, what distance means similar, 0.2 or 0.4?
• How to integrate multiple distance functions?
– i.e, is it possible to use Jaccard and edit simultaneous
ly?

QDB' 2012
Related work
• Classification-based Method
– By extracting feature vectors and training the classifiers
– effective, but often expensive

• Rule-based Method
– By build rules to match records
– explainable and efficient

• Our work belongs to the latter

QDB' 2012
2. Our solution
• Training phase: general a model
– Step 1: Choose a metric to measure the appropriateness of a d
istance function
– Step 2: Compute the threshold value for that distance function

– Step 3: Integrate multiple distance functions and attributes

• Testing phase: test the model

QDB' 2012
Jaccard distance

• Step 1
1

0.8

# of correctly identified duplicate pairs


precision = 0.6

# of identified duplicate pairs


0.4
Precsion

# of correctly identified duplicate pairs Recall

recall = 0.2
F-score

# of true duplicate pairs


0

2 * precision * recall
Edit Distance

0 0.2 0.4 0.6 0.8

F-score =
Distance

precision + recall 0.8

0.6

Note: it is easy to extend to the biased F-score.


Precsion

0.4
Recall

F-score

0.2

QDB' 2012 0 0.2 0.4 0.6 0.8


• Step 1 Definition1 (Attribute-level maximum F-score)

MAXFA(E) = maxdis∈D(MAXFdis, A(E))


A distance function set
An attribute
A distance function
A relation
Detail steps:
maxfa = 0;
for each attribute A
for each distance function dis
compute MAXFA(E);
if (MAXFA(E) > maxfa
maxfa = MAXFA(E)
return A and E that has maxfa value

QDB' 2012
• Step 2: compute the proper threshold for MAXF(E)
maxA(MAXFA(E))

The edit distance


is normalized to
[0, 1]

Conclusion: We can claim that a pair of records belong to the same entit
y if the edit distance of title attribute is smaller than 0.28

QDB' 2012
• Step 3: How about using multiple attributes?
It is better to use multiple attributes to measure the distance between a
pair of tuples.

Hint: Given an attribute group G, we use a weighted metric, GD,


to measure the distance between a pair of tuples ei and ej

weight the proper distance over


single attribute

QDB' 2012
• Step 3
•Using more attributes may either
increase, or decrease the MAXF
value.

•{Author, volume, page} is better t


han the rest groups.

However, finding the proper group needs to check all att


ribute combinations, which is expensive to execute.

Hence, we provide a heuristic solution.

QDB' 2012
• Step 3: Heuristic solution
Let Sh denote a set containing all attribute groups with a size of h.
1. Initially, generate S1 containing all single-attribute groups;
2. Iteratively generate Sh by using Sh-1 and S1;
3. Each element G in Sh should be superior to any element in Sh-1 or S1.
Assume G'∈Sh-1, G''∈S1, G=G'∪G''
MAXFG > max( MAXFG', MAXFG'')
Performance Analysis:
1. Efficient to execute
2. May not find the optimal attribute-group.
3. Experimental results run well

QDB' 2012
3. Experiments
• Data sets
 Cora is a collection of 1876 citation entries
 Attributes(9): author, title, address, date, page, volume, publisher, editor and
journal.
 Restaurant is a collection of 191 restaurant records
 Attributes(4): name, address, city and type.

QDB' 2012
3. Experiments
1. Single-attribute test

Restaurant data set

Cora data set

QDB' 2012
3. Experiments
2. Attribute group test

The Recall-Precision Curve


under the best attribute group

Performance of the heuristic solution

QDB' 2012
3. Experiments
• Comparison with
Existing Techniques
 Op-tree: not consider the
redundancy among
similarity functions
 SiFi: highly dependent on
given rules

QDB' 2012
4. Conclusion
• This paper aims at selecting an appropriate dista
nce function, rather than proposing a new distan
ce function.
• Unnecessary to define the threhold parameter.
• A heuristic approach to for higher efficiency.
• Evaluated by real data sets.

QDB' 2012
Questions?

You might also like