Web Data Integration Zusammenfassung
Data Integration Process
Types of Structured Data on the Web
1. Data Portals
collect and host datasets
collect and generate metadata describing the datasets
provide for data search and exploration
provide free or payment-based access to data
Types of Shared Data: public sector, research, commercial
FAIR Data Principles
Findable
o (Meta)data are assigned a globally unique
identifier
o Data are described with rich metadata
o (Meta)data are registered or indexed in a
searchable resource
Accessible
o (Meta)data are retrievable by their identifier using a standardized communications
protocol
o Metadata are accessible, even when the data are no longer available
Interoperable
o (Meta)data use a formal, broadly applicable language for knowledge representation
o (Meta)data use vocabularies that follow FAIR principles
o (Meta)data include qualified references to other (meta)data
Reusable
o (Meta)data are released with a clear data usage license
o (Meta)data are associated with detailed provenance
o (Meta)data meet domain-relevant community standards
2. Web APIs
Platforms that enable users to share information, e.g., Facebook, are partly accessible from the web
They slice the Web into Data Silos
--- 1 Not index-able by generic web crawlers
--- 2 No automatic discovery of additional data source
--- 3 No single global data space
3. Linked Data
+++ Entities are identified with HTTP URIs (role of global primary keys), URIs can be looked up on the
Web (discover new data sources, navigate the global data graph)
4. HTML-embedded Data
+++
1. Webpages traditionally contain structured data in the form of HTML tables as well as template data
2. More and more websites semantically markup the content of their HTML pages using standardized
markup formats, like RDFa
Data Exchange Formats
Data Exchange: Transfer of data from one system to another
Data Exchange Format: Format used to represent (encode) the transferred data
Web Data is heterogeneous with respect to the employed
o 1. Data Exchange Format (Technical Heterogeneity, e.g., XML, JSON, CSV)
o 2. Character Encoding (Syntactical Heterogeneity, e.g., ASCII, Unicode)
Character Encoding
is mapping of “real” characters to bit sequences and a common problem in data integration
UTF-8: most common encoding, includes Asian signs, common characters are encoded using only one
byte, less common ones are encoded in 2-6 bytes
Comma Separated Values (CSV)
Data model is a table
Pro: Data representation with minimal overhead
Cons
o restricted to tabular data
o hard to read for humans when tables get wider
o different variations, no support for data types
Extensible Markup Language (XML)
Widely used format for data exchange in the Web and enterprise contexts
Data model: Tree
Is a meta language: defines standard syntax, allows the definition of specific languages (XML
applications)
HTML versus XML
o HTML: Aimed at displaying information to humans; mixes structure, content, and
presentation
o XML: Aimed at data exchange; separates structure, content, and presentation
Well-formed XML Documents
o 1. Closing tag for each
opening tag
o 2. Proper nesting of tags
o 3. Only one attribute with
a specific name
XML References
o Trees are limited when it
comes to n:m relations
Problem: data
duplication
consistency
storage
transmission volume
o Solution: IDs and references
Document Type Definition (DTD)
Defines valid content structure of an XML document
o allowed elements, attributes, child elements, optional elements
o allowed order of elements
DTDs can be used to validate an XML document from the Web before it is further processed
XPath
From Slide 33 on till 44
JavaScript Object Notation (JSON)
is a lightweight data exchange format that uses JavaScript syntax
o less verbose alternative to XML
o widely adopted
– by Web APIs as data
exchange format
– for embedding structured
data in the HEAD section of
HTML pages
Basics:
o objects are noted as in JavaScript
o objects are enclosed in curly brackets
{…}
o data is organized in key value pairs
separated by colons { key : value }
Example: { “firstname” : “John” , “lastname” :
“Smith” , “age” : 46 }
JSON is a lot like XML:
o data model: tree
o opening/closing tags/brackets
Differences
o more compact notation compared to XML
o no id/idref – JSON data is strictly tree shaped
o less data types (only string, number, and Boolean)
Resource Description Framework (RDF)
Graph data model designed for sharing data on the Web
Applications:
o annotation of Web pages (RDFa, JSON-LD)
o publication of data on the Web (Linked Data)
o exchange of graph data between applications
View 1: Sentences in form Subject-Predicate-Object (called Triples “Chris works at Uni of Ma”)
View 2: Labeled directed graph
Resources
o everything (a person, a place, a web page, …) is a resource
o are identified by URI references
o may have one or more types (e.g. foaf:Person)
Literals
o are data values, e.g., strings or integers
o may only be objects, not subjects of triples
o may have a data type or a language tag
Predicates (Properties)
o connect resources to other resources
o connect resources to literal
1. RDF Data Model
2. RDF Syntaxes
3. RDF Schema
4. SPARQL Query Language
5. RDF in Java
Schema Mapping and Data Translation
1. Two Basic Integration Situations
Schema Mapping
Goal: Translate data from a set of source schemata into a given
target schema
Top-down integration situation
Triggered by concrete information need
Schema Integration
Goal: Create a new integrated schema that can represent all
data from a given set of source schemata
Bottom-up integration situation
Triggered by the goal to fulfill different information needs
based on data from all sources (not throwing information out)
2. Types of Correspondences
A correspondence relates a set of elements in a schema S to a
set of elements in schema T
Mapping = Set of all correspondences that relate S and T
Schema Matching: Automatically or semi-automatically discover correspondences between
schemata
Types of correspondences
o One-to-One Correspondences – Movie.title → Item.name
o One-to-Many – Person.Name → split() → FirstName (Token 1), Surname (Token 2)
o Many-to-One – Product.basePrice * (1 + Location.taxRate) → Item.price
3. Schema Integration
Completeness: All elements of the source schemata should be covered
Correctness: All data should be represented semantically correct
Minimality: The integrated schema should be minimal in respect to the number of relations
and attributes
Understandability: should be easy to understand
4. Data Translation
Query Generation Goal: Derive suitable data translation queries (or programs) from the
correspondences.
5. Schema Matching
Automatically or semi-automatically discover correspondences between schemata
5.1 Challenges to Finding Correspondences
Large schemata > 100 tables and >1000 attributes
Generic, automatically generated names, e.g., attribute1, attribute2, attribute3
Missing documentation
5.2. Schema Matching Methods
1. Label-based Methods: Rely on the names of schema elements
o 1. Generate cross product of all attributes (classes) from A and B
o 2. For each pair calculate the similarity of the attribute labels using some similarity
metric: Levenshtein (insert/delete/replace are the edit operations), Jaccard, Soundex,
etc.
o 3. The most similar pairs are the matches
o Problems:
Semantic heterogeneity is not recognized (e.g., synonyms/homonyms)
Problems with different naming conventions (e.g., abbreviations)
o Solution
Preprocessing: Normalize (stop word removal, stemming,…) labels to prepare
them for matching
Matching: Employ similarity metrics that fit the specifics of the schemata
2. Instance-based Methods: Compare the actual data values
o determine correspondences between A and B by examining which attributes in A and
B contain similar values values often better capture the semantics of an attribute
than its label
o 1. Attribute Recognizers
o 2. Value Overlap using Jaccard
o 3. Feature-based Methods
By comparing e.g., attribute data type, average string length
Discussion
Require decision which features to use
Require decision how to compare and combine values
Similar attribute values do not always imply same semantics
o 4. Duplicate-based Methods
Check which attribute values closely match in each duplicate (= similar entry
in two DBs)
++Can correctly distinguish very similar attributes
++Work well if duplicates are known or easy to find
--Does not work well if identity resolution is too noisy, products with very
similar names
3. Structure-based Methods: Exploit the structure of the schema
o Addresses this problem
o High similarity of neighboring attributes and/or name of
relation increases similarity of attribute pair
4. Combined Approaches: Use combinations of above methods
5.3 Generating Correspondences from the Similarity Matrix
Input: Matrix containing attribute similarities
Output: Set of correspondences
Local Single Attribute Strategies:
o Thresholding
all attribute pairs with sim above a threshold are returned as
correspondences
domain expert checks correspondences afterwards and selects the right ones
o TopK
give domain expert TopK correspondences for each attribute
o Top1
directly return the best match as correspondence
very optimistic, errors might frustrate domain expert
Alternative: Global Matching
o Looking at the complete mapping (all correct correspondences between A and B)
gives us the additional restriction that one attribute in A should only be matched to
one attribute in B
o Find optimal set of disjunct correspondences
Alternative: Stable Marriage
5.4 Finding Many-to-One and One-to-Many Correspondences sl.65
5.5 Table Annotation
Goal: Annotate the columns (type and property) of tables in a large table corpus with concepts from a
knowledge graph or shared vocabulary
6. Schema Heterogeneity on the Web
Identity Resolution
Goal: Find all records that refer to the
same real-world entity.
Challenge 1: Representations of the
same real-world entity are not
identical fuzzy duplicates
o Solution: Entity Matching: compare multiple attributes using attribute-specific
similarity measures, after value normalization
Challenge 2: Quadratic Runtime Complexity
o Comparing every pair of records is too expensive for larger datasets
o Solution: Blocking methods avoid unnecessary comparisons
Entity Matching
2.1 Linearly Weighted Matching
Rules
Compute the similarity score between
records x and y as a linearly weighted
combination of individual attribute
similarity scores
We declare x and y matched if sim(x,y) >= b for a pre-specified threshold b, and not matched
otherwise
2.2 Non-Linear Matching Rules
Often better than linear rules but require specific domain knowledge.
Non-linear rules can be learned using tree-based learners
2.3 Data Gathering for Matching
Not only values of the records to be compared, but also values of related records are relevant for the
similarity computation, e.g., Movies look at Actors
2.4 Data Preprocessing for Matching
To enable similarity measures to compute reliable scores, the data needs to be normalized (spelling,
value formats, measurement units, abbreviations), then parsed (Extract attribute/value pairs from
title) and translated (e.g., into target language).
2.5 Local versus Global Matching
Local Matching
o consider all pairs above threshold as matches
o implies that one record can be matched with several other records
o makes sense for duplicate detection within single data source
Global Matching
o enforce constraint that one record in data set A
should only be matched to one record in data set B
o makes sense for data sources that do not contain
duplicates
o Stable Marriage: Immer zuerst Paar mit höchster
Sim vereinen
2.6 Cluster Records using Pairwise Correspondences
Goal: Create groups of records describing the same real-world entity
from pairwise correspondences.
Simple Approach: Connected Components, Smarter Approach: Correlation Clustering (…)
Blocking
Since similarity is reflexive and symmetric, one can avoid unnecessary comparisons
from n² to (n²-n)/2.
3.1 Standard Blocking
Idea: Reduce number of comparisons by partitioning the records into buckets and
compare only records within each bucket e.g., partition books by publisher.
Pro: much faster, Con: missed true matches
Reduction ratio depends on effectiveness of blocking key
o high: if records are equally distributed over buckets
o low: if majority of the records end up in one bucket
o possible workaround: build sub-buckets using a second blocking attribute
Recall depends on the actually matching pairs being kept (not more than 3% should be lost)
3.2 The Sorted Neighborhood Method (SNM)
Idea: Sort records so that similar records are close to each other. Only compare records within a small
neighborhood window.
Challenges
o Choice of Blocking Key
o Choice of Window Size
o But: no problem with different bucket sizes
3.3 Token Blocking for Textual Attributes (…) only slide 32
Evaluation
GS necessary
Accuracy is not a good metric as matching is a strongly unbalanced task
Goal: Have a high reduction ratio while keeping pair
completeness > 0.97
Similarity Measures – In Detail
5.1 Edit-based String Similarity Measures
Levenshtein Distance (aka Edit Distance)
o Measures the minimum number of edits needed to transform one string into the
other (insert, delete, replace)
o ++ can deal with typos
o -- does not work if parts of string (words) have different order
o -- quadratic runtime complexity
Jaro Similarity
o Specifically designed for matching names
Winkler Similarity
o Intuition: Similarity of first few letters is more important
less typos in first letters
dealing with abbreviations
Jaro-Winkler Similarity
5.2 Token-based String Similarity Measures
Token-based measures ignore the order of words which is often desirable for comparing
multi-word strings
Using n-grams for Jaccard Coefficient
o Deal with typos and different order of words
o Reduce the time complexity compared to Levenshtein
Cosine Similarity: for comparing weighted term vectors
TF-IDF: gives less weight to common tokens
5.3 Hybrid String Similarity Measures
hybrid similarity measures split strings into tokens and apply internal similarity function to compare
tokens
Monge-Elkan Similarity
o ++ can deal with typos and different order of words
o -- runtime complexity: quadratic
5.4 Datatype-specific Similarity Measures
Numerical Comparison, Dates, geographic coordinates
Learning Matching Rules
Data Fusion
Data profiling
= refers to the activity of calculating statistics and creating summaries of a data source or data lake.
Data Fusion
= Given multiple records that describe the same real-world entity, create a single record while
resolving conflicting data values.
Goal: Create a single high-quality record.
Two basic fusion situations:
Slot Filling
o Fill missing values (NULLs) in one dataset with corresponding values from other
datasets. increased dataset density
Conflict Resolution
o Resolve contradictions between records by applying a conflict resolution function
(heuristic) increased data quality
Conflict Resolution Functions
Conflict resolution functions are attribute-specific
o 1. Content-based functions that rely only on the data values to be fused
E.g., average, min/max, union,…
o 2. Metadata-based functions that rely on provenance data, ratings, or quality scores
E.g., favourSources, mostRecent, …
Evaluation of Fusion Results
1. Data-Centric Evaluation Measures
o Density: measures the fraction of non-NULL values
o Consistency: A data set is consistent if it is free of
conflicting information
2. Ground-Truth Based Evaluation Measures
o Accuracy: Fraction of correct values
selected by conflict resolution
function
o Manually determine correct values for a subset of the records
Learning Conflict Resolution Functions: Choose function with smallest
mean absolute error with respect to gold standard