Data warehouse & Exploration
PRINCIPLES OF
DATA INTEGRATION
Sellami Mokhtar
CHAPTER 1: INTRODUCTION
TO DATA INTEGRATION
PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Outline
➢ Introduction: data integration as a new abstraction
▪ Examples of data integration applications
▪ Schema heterogeneity
▪ Goal of data integration, why it’s a hard problem
▪ Data integration architectures
Data Integration
▪ Databases are great: they let us manage huge
amounts of data
▪ Assuming you’ve put it all into your schema.
▪ In reality, data sets are often created independently
▪ Only to discover later that they need to combine their
data!
▪ At that point, they’re using different systems, different
schemata and have limited interfaces to their data.
▪ The goal of data integration: tie together different
sources, controlled by many people, under a
common schema.
DBMS: it’s all about abstraction
▪ Logical vs. Physical; What vs. How.
Students: Takes:
SSN Name Category SSN CID
123-45-6789 Charles undergrad 123-45-6789 CSE444
234-56-7890 Dan grad 123-45-6789 CSE444
… … 234-56-7890 CSE142
Courses: …
CID Name Quarter
CSE444 Databases fall
CSE541 Operating systems winter
SELECT C.name
FROM Students S, Takes T, Courses C
WHERE S.name=“Mary” and
S.ssn = T.ssn and T.cid = C.cid
Data Integration:
A Higher-level Abstraction
Query Independence of:
• source & location
Mediated Schema • data model, syntax
• semantic variations
•…
Semantic
Mappings
S1 S2 S3
<cd> <title> The best of … </title>
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
…
Category
undergrad
grad
…
SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
… <artist> Carreras </artist>
<artist> Pavarotti </artist>
…
… <artist> Domingo </artist>
CID Name Quarter
CSE444 Databases fall <price> 19.95 </price>
CSE541 Operating systems winter
</cd>
Outline
✓ Introduction: data integration as a new abstraction
➢ Examples of data integration applications
▪ Schema heterogeneity
▪ Goal of data integration, why it’s a hard problem
▪ Data integration architectures
Applications of Data Integration
▪ Business
▪ Science
▪ Government
▪ The Web
▪ Pretty much everywhere
Application Area 1: Business
Enterprise Databases
EII Apps:
CRM
ERP
Single Mediated View
Portals
…
Legacy Databases
Services and Applications
50% of all IT $$$ spent here!
Application Area 2: Science
Sequenceable Structured
Phenotype Gene Experiment
Entity Vocabulary
Nucleotide Microarray
Protein
Sequence Experiment
Swiss-
OMIM HUGO GO
Prot
Gene- Locus-
Entrez GEO
Clinics Link
Hundreds of biomedical data sources available; growing rapidly!
Application Area 3: The Web
Hundreds of millions of high-quality
tables on the Web
The Deep Web
▪ Millions of high quality HTML forms out there
▪ Each form has its own special interface
▪ Hard to explore data across sites.
▪ Goal (for some domains):
▪ A single interface into a multitude of deep-web sources.
Create a single site to search for jobs/rentals/…
Easily traverse between the site by clicking its name
Outline
✓ Introduction: data integration as a new abstraction
✓ Examples of data integration applications
➢ Schema heterogeneity
▪ Goal of data integration, why it’s a hard problem
▪ Data integration architectures
Enterprise Data Integration:
FullServe Corporation
Employees Resumes
FullTimeEmp Interview
Hire CV
TempEmployees
Training Services
Services
Courses
Customers
Enrollments
Contracts
Sales HelpLine
Products Calls
Sales
EuroCard Corporation
Employees Resumes
Employees Interview
Hire
Credit Cards HelpLine
Customer Calls
CustDetail
Examples of Heterogeneity
FullServe EuroCard
FullTimeEmp Employees
ssn, empId, firstName ID, firstNameMiddleInitial,
middleName, lastName lastName
Hire Hire
empId, hireDate, recruiter ID, hireDate, recruiter
TempEmployees
ssn, hireStart, hireEnd
Find all employees (making over $100K)
Customer Call Center
Agents should have a full view of customer when they call
in.
Sales
Credit Cards
Products
Customer
Sales
CustDetail
Services
Services
Customers
Contracts
Other Reasons to Integrate Data
▪ Create a (useful) web site for tracking services
▪ Collaborate with third parties
▪ E.g., create branded services
▪ Comply with government regulations
▪ Find “risky” employees
▪ Business intelligence
▪ What’s really wrong with our products?
Outline
✓ Introduction: data integration as a new abstraction
✓ Examples of data integration applications
✓ Schema heterogeneity
➢ Goal of data integration, why it’s a hard problem
▪ Data integration architectures
Goal of Data Integration
▪ Uniform query access to a set of data sources
▪ Handle:
▪ Scale of sources: from tens to millions to billions
▪ Heterogeneity
▪ Autonomy
▪ Semi-structure
▪ Non Structured
Why is it Hard?
▪ Systems-level reasons:
▪ Managing different platforms
▪ SQL across multiple systems is not so simple
▪ Distributed query processing
▪ Logical reasons:
▪ Schema (and data) heterogeneity
▪ ‘Social’ reasons:
▪ Locating and capturing relevant data in the enterprise.
▪ Convincing people to share (data fiefdoms)
❖ Security, privacy and performance implications.
Setting Expectations
Data integration is AI-Complete.
▪ Completely automated solutions unlikely.
Goal 1:
▪ Reduce the effort needed to set up an integration
application.
Goal 2:
▪ Enable the system to perform gracefully with uncertainty
(e.g., on the web)
Goal 3:
▪ Enable the system to secure integration of heretegous
sources : privacy aware security data integration
Data Integration Smorgasbord
Something for everyone:
▪ Theory of modeling data sources
▪ Systems aspects of data integration
▪ Architectural issues: e.g., P2P data sharing, Microservice
▪ AI @ work: automated schema matching
▪ IA@ work: AI Based schema matching
▪ Web: latest on data integration & web
▪ Commercial products: BEA, IBM
▪ Semantic Web: what does it have to offer?
▪ New trends in DBMS: uncertainty, dataspaces, big data,
nosql, cloud, security privacy.
Outline
✓ Introduction: data integration as a new abstraction
✓ Examples of data integration applications
✓ Schema heterogeneity
✓ Goal of data integration, why it’s a hard problem
➢ Data integration architectures
Virtual, Warehousing and in Between
▪ Data warehousing: integrate by bringing the data
into a single physical warehouse
▪ Virtual data integration: leave the data at the
sources and access it at query time.
▪ Some differences, but semantic heterogeneity arises
in both cases.
▪ Numerous intermediate architectures.
▪ The course illustrates data integration technology
mostly through the virtual architecture.
Virtual Data Integration Architecture
Mediated Schema
or Warehouse Query reformulation/
Query over materialized data
Source
descriptions/
Transforms
Wrapper / Wrapper / Wrapper / Wrapper /
Extractor Extractor Extractor Extractor
RDBMS 1 RDBMS 2
HTML1 XML1
Example
Movie(title, director, year, genre)
Actors(title, actor)
Plays(movie, location, startTime)
Reviews(title, rating, description)
S1 S2 S3 S4 S5
Movies (name, Cinemas (place, CinemasInNYC CinemasInSF Reviews (title,
actors, director, movie, start) (cinema, title, (location, movie, date, grade,
genre) startTime) startingTime) review)
Wrappers
<cd> <title> The best of … </title>
<artist> Abiteboul </artist>
<artist> Pavarotti </artist>
<artist> Domingo </artist>
<price> 19.95 </price>
</cd>
…
Send queries to data
sources and transform
answers into tuples (or
other internal data
model). (Chapter 9)
Mediation Languages
Mediated Schema
Describe CD: ASIN, Title, Genre,…
Artist: ASIN, name, …
relationships
between
logic
mediated
schema and CDs Books
Album Title Authors
data sources ASIN
Price
ISBN
Price
ISBN
FirstName
DiscountPrice DiscountPrice LastName
(Chapter 3). Studio Edition
Artists
ASIN
CDCategories BookCategories ArtistName
ASIN ISBN GroupName
Category Category
Woody Allen Comedies in NY
Mediated schema:
Movie: Title, director, year, genre
Actors: title, actor
Plays: movie, location, startTime
Reviews: title, rating, description
select title, startTime
from Movie, Plays
where Movie.title=Plays.movie AND
location=“New York” AND
director=“Woody Allen”
Movie: Title, director, year, genre
Actors: title, actor
Plays: movie, location, startTime
Reviews: title, rating, description
select title, startTime
from Movie, Plays
where Movie.title=Plays.movie AND
location=“New York” AND
director=“Woody Allen”
Sources S1 and S3 are relevant, sources S4 and S5 are
irrelevant, and source S2 is relevant but possibly
redundant.
S1 S2 S3 S4 S5
Movies: Cinemas: Cinemas in NYC: Cinemas in SF: Reviews:
name, actors, place, movie, cinema, title, location, movie, title, date
director, genre start startTime startingTime grade, review
Query Processing
Query Query LLM reformulation
Logical query plan
Chapter 8 Query LLM optimizer
Physical query plan
Replanning request
Execution LLM engine
LLM Wrapper LLM wrapper wrapper wrapper wrapper
source source source source source
Data Warehouses – Offline Replication
▪ Determine physical schema
▪ Define a database with this Query Results
schema Data Warehouse
▪ Define procedural mappings
in an “ETL tool” to import
the data and clean it.
▪ Periodically copy all of the
data from the data sources
▪ Note that the sources
and the warehouse are
basically independent at
this point 37
Pros and Cons of Data Warehouses
Need to spend time to design the physical database
layout, as well as logical
This actually takes a lot of effort!
Data is generally not up-to-date (lazy or offline
refresh)
✓ Queries over the warehouse don’t disrupt the data
sources
✓ Can run very heavy-duty computations, including
data mining and cleaning
38
Summary of Chapter 1
▪ Data integration: abstract away the fact that data
comes from multiple sources in varying schemata.
▪ Problem occurs everywhere: it’s key to business,
science, Web and government.
▪ Goal: reduce the effort involved in integrating.
▪ Regardless of the architecture, heterogeneity is a key
issue.
▪ Architectures range from warehousing to virtual
integration.