0% found this document useful (0 votes)

17 views10 pages

Web Data Integration Summary

The document summarizes different types of structured data available on the web, including data portals, web APIs, linked data, and HTML-embedded data. It also discusses standards and principles for fair and interoperable data, including FAIR data principles. Common data exchange formats are described, such as CSV, XML, JSON, and RDF. Challenges and methods for schema mapping, integration, and matching between different data sources are outlined.

Uploaded by

paul.huber1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views10 pages

Web Data Integration Summary

Uploaded by

paul.huber1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Web Data Integration Zusammenfassung

Data Integration Process 

Types of Structured Data on the Web

1. Data Portals
 collect and host datasets
 collect and generate metadata describing the datasets
 provide for data search and exploration
 provide free or payment-based access to data

Types of Shared Data: public sector, research, commercial

FAIR Data Principles

 Findable
o (Meta)data are assigned a globally unique
identifier
o Data are described with rich metadata
o (Meta)data are registered or indexed in a
searchable resource
 Accessible
o (Meta)data are retrievable by their identifier using a standardized communications
protocol
o Metadata are accessible, even when the data are no longer available
 Interoperable
o (Meta)data use a formal, broadly applicable language for knowledge representation
o (Meta)data use vocabularies that follow FAIR principles
o (Meta)data include qualified references to other (meta)data
 Reusable
o (Meta)data are released with a clear data usage license
o (Meta)data are associated with detailed provenance
o (Meta)data meet domain-relevant community standards

2. Web APIs
Platforms that enable users to share information, e.g., Facebook, are partly accessible from the web
They slice the Web into Data Silos
--- 1 Not index-able by generic web crawlers
--- 2 No automatic discovery of additional data source
--- 3 No single global data space

3. Linked Data
+++ Entities are identified with HTTP URIs (role of global primary keys), URIs can be looked up on the
Web (discover new data sources, navigate the global data graph)

4. HTML-embedded Data
+++
1. Webpages traditionally contain structured data in the form of HTML tables as well as template data
2. More and more websites semantically markup the content of their HTML pages using standardized
markup formats, like RDFa

Data Exchange Formats

 Data Exchange: Transfer of data from one system to another
 Data Exchange Format: Format used to represent (encode) the transferred data
 Web Data is heterogeneous with respect to the employed
o 1. Data Exchange Format (Technical Heterogeneity, e.g., XML, JSON, CSV)
o 2. Character Encoding (Syntactical Heterogeneity, e.g., ASCII, Unicode)

Character Encoding
is mapping of “real” characters to bit sequences and a common problem in data integration

UTF-8: most common encoding, includes Asian signs, common characters are encoded using only one
byte, less common ones are encoded in 2-6 bytes

Comma Separated Values (CSV)

 Data model is a table
 Pro: Data representation with minimal overhead
 Cons
o restricted to tabular data
o hard to read for humans when tables get wider
o different variations, no support for data types

Extensible Markup Language (XML)

 Widely used format for data exchange in the Web and enterprise contexts
 Data model: Tree
 Is a meta language: defines standard syntax, allows the definition of specific languages (XML
applications)
 HTML versus XML
o HTML: Aimed at displaying information to humans; mixes structure, content, and
presentation
o XML: Aimed at data exchange; separates structure, content, and presentation
 Well-formed XML Documents
o 1. Closing tag for each
opening tag
o 2. Proper nesting of tags
o 3. Only one attribute with
a specific name
 XML References
o Trees are limited when it
comes to n:m relations
 Problem: data
duplication
 consistency
 storage
 transmission volume
o Solution: IDs and references
Document Type Definition (DTD)
 Defines valid content structure of an XML document
o allowed elements, attributes, child elements, optional elements
o allowed order of elements
 DTDs can be used to validate an XML document from the Web before it is further processed

XPath
From Slide 33 on till 44

JavaScript Object Notation (JSON)

 is a lightweight data exchange format that uses JavaScript syntax
o less verbose alternative to XML
o widely adopted
– by Web APIs as data
exchange format
– for embedding structured
data in the HEAD section of
HTML pages
 Basics:
o objects are noted as in JavaScript
o objects are enclosed in curly brackets
{…}
o data is organized in key value pairs
separated by colons { key : value }
 Example: { “firstname” : “John” , “lastname” :
“Smith” , “age” : 46 }
 JSON is a lot like XML:
o data model: tree
o opening/closing tags/brackets
 Differences
o more compact notation compared to XML
o no id/idref – JSON data is strictly tree shaped
o less data types (only string, number, and Boolean)

Resource Description Framework (RDF)

Graph data model designed for sharing data on the Web

 Applications:
o annotation of Web pages (RDFa, JSON-LD)
o publication of data on the Web (Linked Data)
o exchange of graph data between applications
 View 1: Sentences in form Subject-Predicate-Object (called Triples “Chris works at Uni of Ma”)
 View 2: Labeled directed graph
 Resources
o everything (a person, a place, a web page, …) is a resource
o are identified by URI references
o may have one or more types (e.g. foaf:Person)
 Literals
o are data values, e.g., strings or integers
o may only be objects, not subjects of triples
o may have a data type or a language tag
 Predicates (Properties)
o connect resources to other resources
o connect resources to literal
 1. RDF Data Model
 2. RDF Syntaxes
 3. RDF Schema
 4. SPARQL Query Language
 5. RDF in Java

Schema Mapping and Data Translation

1. Two Basic Integration Situations
Schema Mapping

 Goal: Translate data from a set of source schemata into a given

target schema
 Top-down integration situation
 Triggered by concrete information need

Schema Integration

 Goal: Create a new integrated schema that can represent all

data from a given set of source schemata
 Bottom-up integration situation
 Triggered by the goal to fulfill different information needs
based on data from all sources (not throwing information out)

2. Types of Correspondences
 A correspondence relates a set of elements in a schema S to a
set of elements in schema T
 Mapping = Set of all correspondences that relate S and T
 Schema Matching: Automatically or semi-automatically discover correspondences between
schemata
 Types of correspondences
o One-to-One Correspondences – Movie.title → Item.name
o One-to-Many – Person.Name → split() → FirstName (Token 1), Surname (Token 2)
o Many-to-One – Product.basePrice * (1 + Location.taxRate) → Item.price

3. Schema Integration
 Completeness: All elements of the source schemata should be covered
 Correctness: All data should be represented semantically correct
 Minimality: The integrated schema should be minimal in respect to the number of relations
and attributes
 Understandability: should be easy to understand

4. Data Translation
Query Generation Goal: Derive suitable data translation queries (or programs) from the
correspondences.

5. Schema Matching
Automatically or semi-automatically discover correspondences between schemata

5.1 Challenges to Finding Correspondences

 Large schemata > 100 tables and >1000 attributes
 Generic, automatically generated names, e.g., attribute1, attribute2, attribute3
 Missing documentation

5.2. Schema Matching Methods

 1. Label-based Methods: Rely on the names of schema elements
o 1. Generate cross product of all attributes (classes) from A and B
o 2. For each pair calculate the similarity of the attribute labels using some similarity
metric: Levenshtein (insert/delete/replace are the edit operations), Jaccard, Soundex,
etc.
o 3. The most similar pairs are the matches
o Problems:
 Semantic heterogeneity is not recognized (e.g., synonyms/homonyms)
 Problems with different naming conventions (e.g., abbreviations)
o Solution
 Preprocessing: Normalize (stop word removal, stemming,…) labels to prepare
them for matching
 Matching: Employ similarity metrics that fit the specifics of the schemata
 2. Instance-based Methods: Compare the actual data values
o determine correspondences between A and B by examining which attributes in A and
B contain similar values  values often better capture the semantics of an attribute
than its label
o 1. Attribute Recognizers
o 2. Value Overlap using Jaccard 
o 3. Feature-based Methods
 By comparing e.g., attribute data type, average string length
 Discussion
 Require decision which features to use
 Require decision how to compare and combine values
 Similar attribute values do not always imply same semantics
o 4. Duplicate-based Methods
 Check which attribute values closely match in each duplicate (= similar entry
in two DBs)
 ++Can correctly distinguish very similar attributes
 ++Work well if duplicates are known or easy to find
 --Does not work well if identity resolution is too noisy, products with very
similar names
 3. Structure-based Methods: Exploit the structure of the schema
o Addresses this problem 
o High similarity of neighboring attributes and/or name of
relation increases similarity of attribute pair
 4. Combined Approaches: Use combinations of above methods
5.3 Generating Correspondences from the Similarity Matrix
 Input: Matrix containing attribute similarities
 Output: Set of correspondences
 Local Single Attribute Strategies:
o Thresholding
 all attribute pairs with sim above a threshold are returned as
correspondences
 domain expert checks correspondences afterwards and selects the right ones
o TopK
 give domain expert TopK correspondences for each attribute
o Top1
 directly return the best match as correspondence
 very optimistic, errors might frustrate domain expert
 Alternative: Global Matching
o Looking at the complete mapping (all correct correspondences between A and B)
gives us the additional restriction that one attribute in A should only be matched to
one attribute in B
o Find optimal set of disjunct correspondences
 Alternative: Stable Marriage

5.4 Finding Many-to-One and One-to-Many Correspondences sl.65

5.5 Table Annotation
Goal: Annotate the columns (type and property) of tables in a large table corpus with concepts from a
knowledge graph or shared vocabulary

6. Schema Heterogeneity on the Web

Identity Resolution
 Goal: Find all records that refer to the
same real-world entity.
 Challenge 1: Representations of the
same real-world entity are not
identical  fuzzy duplicates
o Solution: Entity Matching: compare multiple attributes using attribute-specific
similarity measures, after value normalization
 Challenge 2: Quadratic Runtime Complexity
o Comparing every pair of records is too expensive for larger datasets
o Solution: Blocking methods  avoid unnecessary comparisons

Entity Matching
2.1 Linearly Weighted Matching
Rules
Compute the similarity score between
records x and y as a linearly weighted
combination of individual attribute
similarity scores 
We declare x and y matched if sim(x,y) >= b for a pre-specified threshold b, and not matched
otherwise

2.2 Non-Linear Matching Rules

Often better than linear rules but require specific domain knowledge.

Non-linear rules can be learned using tree-based learners

2.3 Data Gathering for Matching

Not only values of the records to be compared, but also values of related records are relevant for the
similarity computation, e.g., Movies look at Actors

2.4 Data Preprocessing for Matching

To enable similarity measures to compute reliable scores, the data needs to be normalized (spelling,
value formats, measurement units, abbreviations), then parsed (Extract attribute/value pairs from
title) and translated (e.g., into target language).

2.5 Local versus Global Matching

 Local Matching
o consider all pairs above threshold as matches
o implies that one record can be matched with several other records
o makes sense for duplicate detection within single data source
 Global Matching
o enforce constraint that one record in data set A
should only be matched to one record in data set B
o makes sense for data sources that do not contain
duplicates
o Stable Marriage: Immer zuerst Paar mit höchster
Sim vereinen

2.6 Cluster Records using Pairwise Correspondences

Goal: Create groups of records describing the same real-world entity
from pairwise correspondences.
Simple Approach: Connected Components, Smarter Approach: Correlation Clustering (…)

Blocking
Since similarity is reflexive and symmetric, one can avoid unnecessary comparisons
from n² to (n²-n)/2.

3.1 Standard Blocking

Idea: Reduce number of comparisons by partitioning the records into buckets and
compare only records within each bucket e.g., partition books by publisher.

 Pro: much faster, Con: missed true matches

 Reduction ratio depends on effectiveness of blocking key
o high: if records are equally distributed over buckets
o low: if majority of the records end up in one bucket
o possible workaround: build sub-buckets using a second blocking attribute
 Recall depends on the actually matching pairs being kept (not more than 3% should be lost)

3.2 The Sorted Neighborhood Method (SNM)

Idea: Sort records so that similar records are close to each other. Only compare records within a small
neighborhood window.

 Challenges
o Choice of Blocking Key
o Choice of Window Size
o But: no problem with different bucket sizes

3.3 Token Blocking for Textual Attributes (…) only slide 32

Evaluation
GS necessary

Accuracy is not a good metric as matching is a strongly unbalanced task

Goal: Have a high reduction ratio while keeping pair

completeness > 0.97

Similarity Measures – In Detail

5.1 Edit-based String Similarity Measures
 Levenshtein Distance (aka Edit Distance)
o Measures the minimum number of edits needed to transform one string into the
other (insert, delete, replace)
o ++ can deal with typos
o -- does not work if parts of string (words) have different order
o -- quadratic runtime complexity
 Jaro Similarity 
o Specifically designed for matching names
 Winkler Similarity
o Intuition: Similarity of first few letters is more important
 less typos in first letters
 dealing with abbreviations
 Jaro-Winkler Similarity

5.2 Token-based String Similarity Measures

 Token-based measures ignore the order of words which is often desirable for comparing
multi-word strings
 Using n-grams for Jaccard Coefficient
o Deal with typos and different order of words
o Reduce the time complexity compared to Levenshtein
 Cosine Similarity: for comparing weighted term vectors
 TF-IDF: gives less weight to common tokens

5.3 Hybrid String Similarity Measures

hybrid similarity measures split strings into tokens and apply internal similarity function to compare
tokens

 Monge-Elkan Similarity
o ++ can deal with typos and different order of words
o -- runtime complexity: quadratic

5.4 Datatype-specific Similarity Measures

Numerical Comparison, Dates, geographic coordinates

Learning Matching Rules

Data Fusion
Data profiling
= refers to the activity of calculating statistics and creating summaries of a data source or data lake.

Data Fusion
= Given multiple records that describe the same real-world entity, create a single record while
resolving conflicting data values.
Goal: Create a single high-quality record.
Two basic fusion situations:

 Slot Filling
o Fill missing values (NULLs) in one dataset with corresponding values from other
datasets.  increased dataset density
 Conflict Resolution
o Resolve contradictions between records by applying a conflict resolution function
(heuristic)  increased data quality
Conflict Resolution Functions
 Conflict resolution functions are attribute-specific
o 1. Content-based functions that rely only on the data values to be fused
 E.g., average, min/max, union,…
o 2. Metadata-based functions that rely on provenance data, ratings, or quality scores
 E.g., favourSources, mostRecent, …

Evaluation of Fusion Results

 1. Data-Centric Evaluation Measures
o Density: measures the fraction of non-NULL values
o Consistency: A data set is consistent if it is free of
conflicting information
 2. Ground-Truth Based Evaluation Measures
o Accuracy: Fraction of correct values
selected by conflict resolution
function
o Manually determine correct values for a subset of the records

Learning Conflict Resolution Functions: Choose function with smallest

mean absolute error with respect to gold standard

Session - 6 - Complex Data Types
No ratings yet
Session - 6 - Complex Data Types
27 pages
Data Mapping and Exchange
No ratings yet
Data Mapping and Exchange
13 pages
Week 3
No ratings yet
Week 3
29 pages
AI in Data Integration & Schema Mapping
No ratings yet
AI in Data Integration & Schema Mapping
26 pages
Slide 3
No ratings yet
Slide 3
35 pages
Introduction To The Semantic Web (Tutorial) Johnson & Johnson Philadelphia, USA October 30, 2009 Ivan Herman, W3C
No ratings yet
Introduction To The Semantic Web (Tutorial) Johnson & Johnson Philadelphia, USA October 30, 2009 Ivan Herman, W3C
184 pages
Information Integration: Existing Methods and Solutions
No ratings yet
Information Integration: Existing Methods and Solutions
25 pages
Semantic Web Que Bank
No ratings yet
Semantic Web Que Bank
26 pages
Chapter 11: XML: Data Integration
No ratings yet
Chapter 11: XML: Data Integration
73 pages
SWT QB
No ratings yet
SWT QB
26 pages
DCPP Notes
No ratings yet
DCPP Notes
6 pages
Eis 05 J
No ratings yet
Eis 05 J
18 pages
Semantic Web for Data Integration
No ratings yet
Semantic Web for Data Integration
184 pages
Ontology Mapping and Schema Matching
No ratings yet
Ontology Mapping and Schema Matching
34 pages
SW M1
No ratings yet
SW M1
11 pages
Semantic Web 1
No ratings yet
Semantic Web 1
11 pages
Semweb2012 Lab1 Bulat Malawski
No ratings yet
Semweb2012 Lab1 Bulat Malawski
7 pages
Module 4 - Databases On The Web and Semi Structured Data
No ratings yet
Module 4 - Databases On The Web and Semi Structured Data
13 pages
CH 2 Data Collection Management
No ratings yet
CH 2 Data Collection Management
42 pages
Semantic Web Unit-III
No ratings yet
Semantic Web Unit-III
17 pages
RDF SW Velocity
No ratings yet
RDF SW Velocity
12 pages
SG4 - IPT 101 DataMapping and Exchange
No ratings yet
SG4 - IPT 101 DataMapping and Exchange
12 pages
UNIT 3 Resource Description Framework and XML Technologies
No ratings yet
UNIT 3 Resource Description Framework and XML Technologies
22 pages
Lecture 5 - Semi-Structured Data
No ratings yet
Lecture 5 - Semi-Structured Data
26 pages
DA (Unit 1)
No ratings yet
DA (Unit 1)
45 pages
Data Integration: Click To Edit Master Subtitle Style
No ratings yet
Data Integration: Click To Edit Master Subtitle Style
60 pages
Unit 5 XML
No ratings yet
Unit 5 XML
73 pages
Ece 2318 GENERAL DATA AND ITS TYPES
No ratings yet
Ece 2318 GENERAL DATA AND ITS TYPES
34 pages
DBMS Unit 6 Macro
No ratings yet
DBMS Unit 6 Macro
3 pages
IRSW - Semantic Web Introduction
No ratings yet
IRSW - Semantic Web Introduction
76 pages
Unit 5 - Notes
No ratings yet
Unit 5 - Notes
13 pages
Database Chapter 8 Complex Data Answers
No ratings yet
Database Chapter 8 Complex Data Answers
20 pages
Database Schema Matching Guide
No ratings yet
Database Schema Matching Guide
4 pages
TDA357 L10 NoSQL, JSON1
No ratings yet
TDA357 L10 NoSQL, JSON1
41 pages
CH-5 WDM
No ratings yet
CH-5 WDM
10 pages
Complete - Resume ADA
No ratings yet
Complete - Resume ADA
33 pages
Semantic Web
No ratings yet
Semantic Web
25 pages
Chapter 11
No ratings yet
Chapter 11
73 pages
XML and Internet Databases Guide
No ratings yet
XML and Internet Databases Guide
71 pages
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
No ratings yet
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
8 pages
Big Data
No ratings yet
Big Data
32 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
23 State of The Art
No ratings yet
23 State of The Art
61 pages
Introduction To The Semantic Web
No ratings yet
Introduction To The Semantic Web
11 pages
Unit-1 Bda
No ratings yet
Unit-1 Bda
17 pages
Data Integration
No ratings yet
Data Integration
44 pages
The Extensible Markup Language
No ratings yet
The Extensible Markup Language
6 pages
DS Lecture 2425 1B SEMI
No ratings yet
DS Lecture 2425 1B SEMI
98 pages
A UML Profile For Modeling Schema Mappings
No ratings yet
A UML Profile For Modeling Schema Mappings
10 pages
Semantic Web Unit-III
No ratings yet
Semantic Web Unit-III
17 pages
Semantic Web with Python Basics
No ratings yet
Semantic Web with Python Basics
20 pages
Data Models and Information Accesses - : (Set, Graph, Map, Archetype) (Relations, XML, KML, ADL) (List)
No ratings yet
Data Models and Information Accesses - : (Set, Graph, Map, Archetype) (Relations, XML, KML, ADL) (List)
76 pages
Reading - W3 CLIO
No ratings yet
Reading - W3 CLIO
7 pages
Adbms Unit1
No ratings yet
Adbms Unit1
19 pages
LAIBA 9209.2 This Is About Information Science and Technology
No ratings yet
LAIBA 9209.2 This Is About Information Science and Technology
29 pages
Diplo Cloud
No ratings yet
Diplo Cloud
5 pages
Exploring Semantic Web Modeling Approaches For Web Application Design
No ratings yet
Exploring Semantic Web Modeling Approaches For Web Application Design
14 pages
Web Content Mining and NLP: Bing Liu Department of Computer Science University of Illinois at Chicago Liub@cs - Uic.edu
No ratings yet
Web Content Mining and NLP: Bing Liu Department of Computer Science University of Illinois at Chicago Liub@cs - Uic.edu
59 pages
XML HTML
No ratings yet
XML HTML
12 pages
1 MARKUP LANGUAGES Web Standards The History of Ma - 241025 - 141647
No ratings yet
1 MARKUP LANGUAGES Web Standards The History of Ma - 241025 - 141647
17 pages
M.C.A. (2020 Pattern)
No ratings yet
M.C.A. (2020 Pattern)
44 pages
Web Technology Practical
No ratings yet
Web Technology Practical
45 pages
Csit Lab Manual WT Lab Manual
No ratings yet
Csit Lab Manual WT Lab Manual
33 pages
Introducción Al RML
No ratings yet
Introducción Al RML
35 pages
CS311 MCQs Mids 2024 Mam Mehwish
No ratings yet
CS311 MCQs Mids 2024 Mam Mehwish
9 pages
HTML CSS JavaScript Basics
No ratings yet
HTML CSS JavaScript Basics
225 pages
XML Parsers: SAX vs DOM Explained
No ratings yet
XML Parsers: SAX vs DOM Explained
4 pages
(FULL) Friendly Map Android Application For Disabled People
No ratings yet
(FULL) Friendly Map Android Application For Disabled People
33 pages
Lesson 3 Week 3
No ratings yet
Lesson 3 Week 3
50 pages
Mil DTL 87269D
100% (1)
Mil DTL 87269D
14 pages
Web Programming: Introduction To HTML and XHTML
No ratings yet
Web Programming: Introduction To HTML and XHTML
28 pages
SDC Lab Course File
No ratings yet
SDC Lab Course File
22 pages
Perl XML Parsing Techniques
No ratings yet
Perl XML Parsing Techniques
14 pages
Iso Iec Ieee 26531-2023
No ratings yet
Iso Iec Ieee 26531-2023
68 pages
BackOffice-Customer ICD
No ratings yet
BackOffice-Customer ICD
25 pages
ADBMS Board QPaper
No ratings yet
ADBMS Board QPaper
16 pages
Xxe Vulnerability: Project-Based Internship 2020
No ratings yet
Xxe Vulnerability: Project-Based Internship 2020
25 pages
Configure Report Server On Gas
No ratings yet
Configure Report Server On Gas
35 pages
Iso 25720-2009
No ratings yet
Iso 25720-2009
140 pages
Beginning XML With C# 7: XML Processing and Data Access For C# Developers 2nd Edition Bipin Joshi (Auth.)
No ratings yet
Beginning XML With C# 7: XML Processing and Data Access For C# Developers 2nd Edition Bipin Joshi (Auth.)
57 pages
Beginning Java Web Services
No ratings yet
Beginning Java Web Services
353 pages
ONIXtraining Handout
No ratings yet
ONIXtraining Handout
65 pages
XML Bible Gold Edition Elliotte Rusty Harold Download
No ratings yet
XML Bible Gold Edition Elliotte Rusty Harold Download
124 pages
BT0091 - WML and WAP Programming - Practical
No ratings yet
BT0091 - WML and WAP Programming - Practical
48 pages
XML Schema 2
No ratings yet
XML Schema 2
64 pages
How To Configure UWL
No ratings yet
How To Configure UWL
22 pages
Open Financial Exchange Specification 1.0.3: Chapters 1 - 4
No ratings yet
Open Financial Exchange Specification 1.0.3: Chapters 1 - 4
59 pages
Comprehensive Internet Guide
No ratings yet
Comprehensive Internet Guide
36 pages

Web Data Integration Summary

Uploaded by

Web Data Integration Summary

Uploaded by

Web Data Integration Zusammenfassung

Data Integration Process 

Types of Structured Data on the Web

Types of Shared Data: public sector, research, commercial

FAIR Data Principles

Data Exchange Formats

Comma Separated Values (CSV)

Extensible Markup Language (XML)

JavaScript Object Notation (JSON)

Resource Description Framework (RDF)

Schema Mapping and Data Translation

 Goal: Translate data from a set of source schemata into a given

 Goal: Create a new integrated schema that can represent all

5.1 Challenges to Finding Correspondences

5.2. Schema Matching Methods

5.4 Finding Many-to-One and One-to-Many Correspondences sl.65

6. Schema Heterogeneity on the Web

2.2 Non-Linear Matching Rules

Non-linear rules can be learned using tree-based learners

2.3 Data Gathering for Matching

2.4 Data Preprocessing for Matching

2.5 Local versus Global Matching

2.6 Cluster Records using Pairwise Correspondences

3.1 Standard Blocking

 Pro: much faster, Con: missed true matches

3.2 The Sorted Neighborhood Method (SNM)

3.3 Token Blocking for Textual Attributes (…) only slide 32

Accuracy is not a good metric as matching is a strongly unbalanced task

Goal: Have a high reduction ratio while keeping pair

completeness > 0.97

Similarity Measures – In Detail

5.2 Token-based String Similarity Measures

5.3 Hybrid String Similarity Measures

5.4 Datatype-specific Similarity Measures

Learning Matching Rules

Evaluation of Fusion Results

Learning Conflict Resolution Functions: Choose function with smallest

You might also like