0% found this document useful (0 votes)

9 views19 pages

A Learning Problem For Entity Matching

The document presents a learning method for entity matching that addresses challenges related to data inconsistency and incompleteness. It outlines a solution involving a training phase to select distance functions and thresholds, and a testing phase to evaluate the model's performance. The approach is validated through experiments on real datasets, demonstrating its efficiency compared to existing techniques.

Uploaded by

bocerin283

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views19 pages

A Learning Problem For Entity Matching

Uploaded by

bocerin283

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 19

A Learning method For Entity matching

Jie Chen, Cheqing Jin, Rong Zhang, Aoying Zhou

cqjin@sei.ecnu.edu.cn
East China Normal University
Outline
1.Introduction
2.Our Solution
3.Experiments
4.Conclusion

QDB' 2012
1. Introduction
• In many applications, data is inaccurate, inconsi
stent and incomplete.
• One entity may be represented by more than on
e (inconsistent) records.
• Entity matching aims at finding record pairs ref
erring to the same entity.

QDB' 2012
Example: (Four tuples from Cora dataset)
RID Author Title Address Date
1 carla e. brodley and paul e. utgo. multivariate versus univariate amherst ma 1992.
decision trees.
2 c. e. brodley and p. e. utgo. multivariate decision trees. amherst massachusetts 1992.
3 c. e. brodley and p. e. utgo. multivariate decision trees. NULL 1995.
4 carla e. brodley and paul e. utgo. multivariate decision trees. NULL 1995.

same authors same title unspecified address same date

[1, 2] Carla E. Brodley, Paul E. Utgoff. Multivariate versus univariate decision trees. Technical report, 1992
[3, 4] Carla E. Brodley, Paul E. Utgoff. Multivariate Decision trees. Machine learning 19(1):45-77, 1995

Generally, we can use a distance function, along with a threshold, to comp

are each tuple. When the distiance is below the threshold, the tuples refer t
o the same entity.

QDB' 2012
Challenges
• How to select distance function for each attribute
?
– i.e, which is better, Jaccard or edit?
• How to set the threshold value for the distance f
unction?
– i.e, what distance means similar, 0.2 or 0.4?
• How to integrate multiple distance functions?
– i.e, is it possible to use Jaccard and edit simultaneous
ly?

QDB' 2012
Related work
• Classification-based Method
– By extracting feature vectors and training the classifiers
– effective, but often expensive

• Rule-based Method
– By build rules to match records
– explainable and efficient

• Our work belongs to the latter

QDB' 2012
2. Our solution
• Training phase: general a model
– Step 1: Choose a metric to measure the appropriateness of a d
istance function
– Step 2: Compute the threshold value for that distance function

– Step 3: Integrate multiple distance functions and attributes

• Testing phase: test the model

QDB' 2012
Jaccard distance

• Step 1
1

0.8

# of correctly identified duplicate pairs

precision = 0.6

# of identified duplicate pairs

0.4
Precsion

# of correctly identified duplicate pairs Recall

recall = 0.2
F-score

# of true duplicate pairs

2 * precision * recall
Edit Distance

0 0.2 0.4 0.6 0.8

F-score =
Distance

precision + recall 0.8

0.6

Note: it is easy to extend to the biased F-score.

Precsion

0.4
Recall

F-score

0.2

QDB' 2012 0 0.2 0.4 0.6 0.8

• Step 1 Definition1 (Attribute-level maximum F-score)

MAXFA(E) = maxdis∈D(MAXFdis, A(E))

A distance function set
An attribute
A distance function
A relation
Detail steps:
maxfa = 0;
for each attribute A
for each distance function dis
compute MAXFA(E);
if (MAXFA(E) > maxfa
maxfa = MAXFA(E)
return A and E that has maxfa value

QDB' 2012
• Step 2: compute the proper threshold for MAXF(E)
maxA(MAXFA(E))

The edit distance

is normalized to
[0, 1]

Conclusion: We can claim that a pair of records belong to the same entit
y if the edit distance of title attribute is smaller than 0.28

QDB' 2012
• Step 3: How about using multiple attributes?
It is better to use multiple attributes to measure the distance between a
pair of tuples.

Hint: Given an attribute group G, we use a weighted metric, GD,

to measure the distance between a pair of tuples ei and ej

weight the proper distance over

single attribute

QDB' 2012
• Step 3
•Using more attributes may either
increase, or decrease the MAXF
value.

•{Author, volume, page} is better t

han the rest groups.

However, finding the proper group needs to check all att

ribute combinations, which is expensive to execute.

Hence, we provide a heuristic solution.

QDB' 2012
• Step 3: Heuristic solution
Let Sh denote a set containing all attribute groups with a size of h.
1. Initially, generate S1 containing all single-attribute groups;
2. Iteratively generate Sh by using Sh-1 and S1;
3. Each element G in Sh should be superior to any element in Sh-1 or S1.
Assume G'∈Sh-1, G''∈S1, G=G'∪G''
MAXFG > max( MAXFG', MAXFG'')
Performance Analysis:
1. Efficient to execute
2. May not find the optimal attribute-group.
3. Experimental results run well

QDB' 2012
3. Experiments
• Data sets
 Cora is a collection of 1876 citation entries
 Attributes(9): author, title, address, date, page, volume, publisher, editor and
journal.
 Restaurant is a collection of 191 restaurant records
 Attributes(4): name, address, city and type.

QDB' 2012
3. Experiments
1. Single-attribute test

Restaurant data set

Cora data set

QDB' 2012
3. Experiments
2. Attribute group test

The Recall-Precision Curve

under the best attribute group

Performance of the heuristic solution

QDB' 2012
3. Experiments
• Comparison with
Existing Techniques
 Op-tree: not consider the
redundancy among
similarity functions
 SiFi: highly dependent on
given rules

QDB' 2012
4. Conclusion
• This paper aims at selecting an appropriate dista
nce function, rather than proposing a new distan
ce function.
• Unnecessary to define the threhold parameter.
• A heuristic approach to for higher efficiency.
• Evaluated by real data sets.

QDB' 2012
Questions?

Lec 5
No ratings yet
Lec 5
24 pages
DM Lec03
No ratings yet
DM Lec03
37 pages
10 EST Solution
No ratings yet
10 EST Solution
16 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
Alam Uri 2014
No ratings yet
Alam Uri 2014
8 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
No ratings yet
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
61 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Statistical Learning
No ratings yet
Statistical Learning
92 pages
DWDM Unit Wise Question Bank
No ratings yet
DWDM Unit Wise Question Bank
8 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Chap2 Data
No ratings yet
Chap2 Data
92 pages
Data Mining
No ratings yet
Data Mining
24 pages
Unit III: Concept Description: Characterization and Comparison
No ratings yet
Unit III: Concept Description: Characterization and Comparison
53 pages
Data Mining: Distance & Similarity
No ratings yet
Data Mining: Distance & Similarity
25 pages
05 KNN
No ratings yet
05 KNN
49 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Data Mining Essentials for Students
No ratings yet
Data Mining Essentials for Students
10 pages
Data Mining Assignment Guide
No ratings yet
Data Mining Assignment Guide
4 pages
Week03 - 1 - KNN
No ratings yet
Week03 - 1 - KNN
32 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
43 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
DM-I Q Paper 2024
No ratings yet
DM-I Q Paper 2024
12 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
Multi Criteria
No ratings yet
Multi Criteria
87 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Chap2 Data
No ratings yet
Chap2 Data
76 pages
Chap2 Data Rev
No ratings yet
Chap2 Data Rev
91 pages
DM Day3 Preprocessing A S25
No ratings yet
DM Day3 Preprocessing A S25
109 pages
Chap2 Data
No ratings yet
Chap2 Data
86 pages
Week03 - 2 - KNN - Tutorial +solutions
No ratings yet
Week03 - 2 - KNN - Tutorial +solutions
14 pages
Formulas at A Glance - IDS
No ratings yet
Formulas at A Glance - IDS
5 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
4.1 K-Nearest Neighbours (K-NN
No ratings yet
4.1 K-Nearest Neighbours (K-NN
9 pages
3a KNN PDF
No ratings yet
3a KNN PDF
26 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Similarity Based Learning (Part 2)
No ratings yet
Similarity Based Learning (Part 2)
15 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Unit 2
No ratings yet
Unit 2
55 pages
Eature Engineering: Presenter: Prof. Amit Kumar Das
No ratings yet
Eature Engineering: Presenter: Prof. Amit Kumar Das
17 pages
All Data Mining Chapters
No ratings yet
All Data Mining Chapters
235 pages
Aiml Unit-4
No ratings yet
Aiml Unit-4
82 pages
Machine Learning for Students
No ratings yet
Machine Learning for Students
74 pages
2033 Rao Faisal Maqbool Data Maining 2
No ratings yet
2033 Rao Faisal Maqbool Data Maining 2
3 pages
Data Cleaning & Integration Guide
No ratings yet
Data Cleaning & Integration Guide
21 pages
Data Mining Unit3
No ratings yet
Data Mining Unit3
19 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
K-NN Numerical N Theory
No ratings yet
K-NN Numerical N Theory
5 pages
Unit 2 ML
No ratings yet
Unit 2 ML
89 pages
K Nearest Neighbour - Algorithm
No ratings yet
K Nearest Neighbour - Algorithm
29 pages
ML - 3 - Sovan - KNN - 1
No ratings yet
ML - 3 - Sovan - KNN - 1
95 pages
Foundations of Data Science - Unit 5 - Accuracy KNN
No ratings yet
Foundations of Data Science - Unit 5 - Accuracy KNN
24 pages
07 Clustering
No ratings yet
07 Clustering
44 pages
Vldb2008 Ps 4up
No ratings yet
Vldb2008 Ps 4up
16 pages
分布式数据流
No ratings yet
分布式数据流
64 pages
TSClu Win
No ratings yet
TSClu Win
24 pages
Sliding Window Topk
No ratings yet
Sliding Window Topk
30 pages
Statistical Inference: Lecture 2: Transformations and Expectations
No ratings yet
Statistical Inference: Lecture 2: Transformations and Expectations
95 pages
Finding Top-K Shortest Simple Paths With Diversity
No ratings yet
Finding Top-K Shortest Simple Paths With Diversity
26 pages
8 图数据库系统
No ratings yet
8 图数据库系统
72 pages
Central It y
No ratings yet
Central It y
92 pages
13 基于知识图谱的问答
No ratings yet
13 基于知识图谱的问答
73 pages
软件逆向工程原理与实践
No ratings yet
软件逆向工程原理与实践
162 pages
Redis源代码分析
No ratings yet
Redis源代码分析
32 pages
Challenges & Opportunities in Graph Processing at Alibaba, 钱正平
No ratings yet
Challenges & Opportunities in Graph Processing at Alibaba, 钱正平
49 pages
Sampling
No ratings yet
Sampling
100 pages
64格导游大师 - 国际象棋实战教科书
No ratings yet
64格导游大师 - 国际象棋实战教科书
316 pages
国际象棋入门与提高
No ratings yet
国际象棋入门与提高
253 pages
中国国际象棋：国际象棋中局妙手
No ratings yet
中国国际象棋：国际象棋中局妙手
211 pages
Life Hacks Sample
No ratings yet
Life Hacks Sample
30 pages
The Benoni For The Tournament Player (John Nunn) (Z-Library)
No ratings yet
The Benoni For The Tournament Player (John Nunn) (Z-Library)
164 pages
Blackand White Magic Excerpt
No ratings yet
Blackand White Magic Excerpt
13 pages
Key Elementsof Chess Strategy Excerpt
No ratings yet
Key Elementsof Chess Strategy Excerpt
17 pages
9152
No ratings yet
9152
25 pages
Sphinx Vol 1 - Sample
No ratings yet
Sphinx Vol 1 - Sample
20 pages
Key Elementsof Chess Tactics Excerpt
No ratings yet
Key Elementsof Chess Tactics Excerpt
19 pages
Bogoljubov Vol1 Sample
No ratings yet
Bogoljubov Vol1 Sample
25 pages
Bogoljubov Volume 2 Sample
No ratings yet
Bogoljubov Volume 2 Sample
15 pages
Karjakin Defence Sample
No ratings yet
Karjakin Defence Sample
16 pages
Play The Barry Attack: Andrew Martin
No ratings yet
Play The Barry Attack: Andrew Martin
27 pages
Winawer Sample
No ratings yet
Winawer Sample
17 pages
Clearance Form
No ratings yet
Clearance Form
1 page
Humblewood 2 Capran Arma Hedge
No ratings yet
Humblewood 2 Capran Arma Hedge
8 pages
Database
No ratings yet
Database
79 pages
A Meta-Analysis of Instructional Systems Applied in Science Teaching
No ratings yet
A Meta-Analysis of Instructional Systems Applied in Science Teaching
13 pages
Overseas Assignments: To Advertise On These Pages, Call
No ratings yet
Overseas Assignments: To Advertise On These Pages, Call
4 pages
Socscie Lesson 1
No ratings yet
Socscie Lesson 1
8 pages
Grammar Skills for B2+ Students
No ratings yet
Grammar Skills for B2+ Students
3 pages
Project ASA-final
No ratings yet
Project ASA-final
24 pages
B&S - Artificial Intelligence in Military Market - Global Forecast To 2028
No ratings yet
B&S - Artificial Intelligence in Military Market - Global Forecast To 2028
38 pages
Teaching Grammar Form Meaning and Use
0% (1)
Teaching Grammar Form Meaning and Use
6 pages
Physics Grade 12 TERM 1 QUESTION PAPER
No ratings yet
Physics Grade 12 TERM 1 QUESTION PAPER
9 pages
Full Braunwald's Heart Disease - Part 2 - A Textbook of Cardiovascular Medicine 12th Edition Peter Libby - Ebook PDF PDF All Chapters
80% (5)
Full Braunwald's Heart Disease - Part 2 - A Textbook of Cardiovascular Medicine 12th Edition Peter Libby - Ebook PDF PDF All Chapters
69 pages
Inferencing Character Traits Lesson
No ratings yet
Inferencing Character Traits Lesson
3 pages
Grade 5 & 6 Math Lesson Plan
No ratings yet
Grade 5 & 6 Math Lesson Plan
6 pages
Counselor Self-Care Guide
No ratings yet
Counselor Self-Care Guide
4 pages
Establishing Classroom Rules and Proceedures
No ratings yet
Establishing Classroom Rules and Proceedures
2 pages
Essentials of Modern Communications - 2020 - Mynbaev
No ratings yet
Essentials of Modern Communications - 2020 - Mynbaev
8 pages
? Book Summary - Great Mental Models Vol 1 - BettermentBookClub
No ratings yet
? Book Summary - Great Mental Models Vol 1 - BettermentBookClub
1 page
Swetha Profile
No ratings yet
Swetha Profile
3 pages
Jedtalk Principles and Strategies of Teaching Gifted and Talented Learners
No ratings yet
Jedtalk Principles and Strategies of Teaching Gifted and Talented Learners
5 pages
Aline Oliveira - SUBDUED
No ratings yet
Aline Oliveira - SUBDUED
8 pages
Aryan Thakur
No ratings yet
Aryan Thakur
2 pages
Performance Rubric - Grade 4 Math 17-18
No ratings yet
Performance Rubric - Grade 4 Math 17-18
14 pages
Neurological Physiotherapy Assignment# 2
No ratings yet
Neurological Physiotherapy Assignment# 2
6 pages
ASM1 CloudComputing 1st ThuanLV
No ratings yet
ASM1 CloudComputing 1st ThuanLV
3 pages
Business Email Structure
No ratings yet
Business Email Structure
4 pages
Class 10 MCQs: Self Management Skills
No ratings yet
Class 10 MCQs: Self Management Skills
5 pages
Non-Government Teachers' Registration & Certification Authority (NTRCA) 3
No ratings yet
Non-Government Teachers' Registration & Certification Authority (NTRCA) 3
8 pages
Contoh CV Dalam Bahasa Inggris
No ratings yet
Contoh CV Dalam Bahasa Inggris
1 page
Learning Episode 4.4
No ratings yet
Learning Episode 4.4
6 pages

A Learning Problem For Entity Matching

Uploaded by

A Learning Problem For Entity Matching

Uploaded by

A Learning method For Entity matching

Jie Chen, Cheqing Jin, Rong Zhang, Aoying Zhou

same authors same title unspecified address same date

Generally, we can use a distance function, along with a threshold, to comp

• Our work belongs to the latter

– Step 3: Integrate multiple distance functions and attributes

• Testing phase: test the model

# of correctly identified duplicate pairs

# of identified duplicate pairs

# of correctly identified duplicate pairs Recall

# of true duplicate pairs

0 0.2 0.4 0.6 0.8

precision + recall 0.8

Note: it is easy to extend to the biased F-score.

QDB' 2012 0 0.2 0.4 0.6 0.8

MAXFA(E) = maxdis∈D(MAXFdis, A(E))

The edit distance

Hint: Given an attribute group G, we use a weighted metric, GD,

weight the proper distance over

•{Author, volume, page} is better t

However, finding the proper group needs to check all att

Hence, we provide a heuristic solution.

Restaurant data set

Cora data set

The Recall-Precision Curve

Performance of the heuristic solution

You might also like