A Learning method For Entity matching
Jie Chen, Cheqing Jin, Rong Zhang, Aoying Zhou
cqjin@sei.ecnu.edu.cn
East China Normal University
Outline
1.Introduction
2.Our Solution
3.Experiments
4.Conclusion
QDB' 2012
1. Introduction
• In many applications, data is inaccurate, inconsi
stent and incomplete.
• One entity may be represented by more than on
e (inconsistent) records.
• Entity matching aims at finding record pairs ref
erring to the same entity.
QDB' 2012
Example: (Four tuples from Cora dataset)
RID Author Title Address Date
1 carla e. brodley and paul e. utgo. multivariate versus univariate amherst ma 1992.
decision trees.
2 c. e. brodley and p. e. utgo. multivariate decision trees. amherst massachusetts 1992.
3 c. e. brodley and p. e. utgo. multivariate decision trees. NULL 1995.
4 carla e. brodley and paul e. utgo. multivariate decision trees. NULL 1995.
same authors same title unspecified address same date
[1, 2] Carla E. Brodley, Paul E. Utgoff. Multivariate versus univariate decision trees. Technical report, 1992
[3, 4] Carla E. Brodley, Paul E. Utgoff. Multivariate Decision trees. Machine learning 19(1):45-77, 1995
Generally, we can use a distance function, along with a threshold, to comp
are each tuple. When the distiance is below the threshold, the tuples refer t
o the same entity.
QDB' 2012
Challenges
• How to select distance function for each attribute
?
– i.e, which is better, Jaccard or edit?
• How to set the threshold value for the distance f
unction?
– i.e, what distance means similar, 0.2 or 0.4?
• How to integrate multiple distance functions?
– i.e, is it possible to use Jaccard and edit simultaneous
ly?
QDB' 2012
Related work
• Classification-based Method
– By extracting feature vectors and training the classifiers
– effective, but often expensive
• Rule-based Method
– By build rules to match records
– explainable and efficient
• Our work belongs to the latter
QDB' 2012
2. Our solution
• Training phase: general a model
– Step 1: Choose a metric to measure the appropriateness of a d
istance function
– Step 2: Compute the threshold value for that distance function
– Step 3: Integrate multiple distance functions and attributes
• Testing phase: test the model
QDB' 2012
Jaccard distance
• Step 1
1
0.8
# of correctly identified duplicate pairs
precision = 0.6
# of identified duplicate pairs
0.4
Precsion
# of correctly identified duplicate pairs Recall
recall = 0.2
F-score
# of true duplicate pairs
0
2 * precision * recall
Edit Distance
0 0.2 0.4 0.6 0.8
F-score =
Distance
precision + recall 0.8
0.6
Note: it is easy to extend to the biased F-score.
Precsion
0.4
Recall
F-score
0.2
QDB' 2012 0 0.2 0.4 0.6 0.8
• Step 1 Definition1 (Attribute-level maximum F-score)
MAXFA(E) = maxdis∈D(MAXFdis, A(E))
A distance function set
An attribute
A distance function
A relation
Detail steps:
maxfa = 0;
for each attribute A
for each distance function dis
compute MAXFA(E);
if (MAXFA(E) > maxfa
maxfa = MAXFA(E)
return A and E that has maxfa value
QDB' 2012
• Step 2: compute the proper threshold for MAXF(E)
maxA(MAXFA(E))
The edit distance
is normalized to
[0, 1]
Conclusion: We can claim that a pair of records belong to the same entit
y if the edit distance of title attribute is smaller than 0.28
QDB' 2012
• Step 3: How about using multiple attributes?
It is better to use multiple attributes to measure the distance between a
pair of tuples.
Hint: Given an attribute group G, we use a weighted metric, GD,
to measure the distance between a pair of tuples ei and ej
weight the proper distance over
single attribute
QDB' 2012
• Step 3
•Using more attributes may either
increase, or decrease the MAXF
value.
•{Author, volume, page} is better t
han the rest groups.
However, finding the proper group needs to check all att
ribute combinations, which is expensive to execute.
Hence, we provide a heuristic solution.
QDB' 2012
• Step 3: Heuristic solution
Let Sh denote a set containing all attribute groups with a size of h.
1. Initially, generate S1 containing all single-attribute groups;
2. Iteratively generate Sh by using Sh-1 and S1;
3. Each element G in Sh should be superior to any element in Sh-1 or S1.
Assume G'∈Sh-1, G''∈S1, G=G'∪G''
MAXFG > max( MAXFG', MAXFG'')
Performance Analysis:
1. Efficient to execute
2. May not find the optimal attribute-group.
3. Experimental results run well
QDB' 2012
3. Experiments
• Data sets
Cora is a collection of 1876 citation entries
Attributes(9): author, title, address, date, page, volume, publisher, editor and
journal.
Restaurant is a collection of 191 restaurant records
Attributes(4): name, address, city and type.
QDB' 2012
3. Experiments
1. Single-attribute test
Restaurant data set
Cora data set
QDB' 2012
3. Experiments
2. Attribute group test
The Recall-Precision Curve
under the best attribute group
Performance of the heuristic solution
QDB' 2012
3. Experiments
• Comparison with
Existing Techniques
Op-tree: not consider the
redundancy among
similarity functions
SiFi: highly dependent on
given rules
QDB' 2012
4. Conclusion
• This paper aims at selecting an appropriate dista
nce function, rather than proposing a new distan
ce function.
• Unnecessary to define the threhold parameter.
• A heuristic approach to for higher efficiency.
• Evaluated by real data sets.
QDB' 2012
Questions?