[go: up one dir, main page]

0% found this document useful (0 votes)
10 views23 pages

SIP10 - Data Integration

The document discusses the challenges of data integration, emphasizing the need to understand different data sources and their interpretations. It outlines various data integration architectures and common mistakes made during the process, as well as the importance of balancing performance and normalization in database design. Ultimately, it highlights that data integration is often the most difficult aspect of system integration projects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views23 pages

SIP10 - Data Integration

The document discusses the challenges of data integration, emphasizing the need to understand different data sources and their interpretations. It outlines various data integration architectures and common mistakes made during the process, as well as the importance of balancing performance and normalization in database design. Ultimately, it highlights that data integration is often the most difficult aspect of system integration projects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

System Integration

Mini Case Studies © 2010

Data Integration

Shawn A. Butler, Ph.D.


Senior Lecturer, Executive Education Program
Institute for Software Research
Carnegie Mellon University
Objectives
 Understand why data integration is so
challenging
 Techniques for data integration
 Understand performance tradeoffs in data
integration

Data and Reality, William Kent, 1st Books Library, 2000

© 2010 CMU-ISR 2
Agenda
 A quick review of purpose
 Data Integration Architectures
 What does It mean?
 Relationships
 Data normalization

“Entities are a state of mind.


No two people agree what the real world view is.”
[A. Metaxides]

© 2010 CMU-ISR 3
Two Architectures For Data
Integration
 Problem – Develop an application that
aggregates information from several
applications

© 2010 CMU-ISR 4
Data Integration Model I

Presentation

Application Logic
Batch File Transfer
Database Gateway
Middleware ODBC
OLAP
Data Transformation

Application 1 Application 2

Data 1 Data 1

© 2010 CMU-ISR 5
Data Integration Model II

Presentation

Application Logic
Batch File Transfer
Database Gateway
Middleware ODBC
OLAP
Data Transformation

Application 1 Application 2

Data 1 Data 1 Data 1

© 2010 CMU-ISR 6
2 Architecture-Many Choices
 Get all the records and aggregate manually
 Aggregate field in a special top level record,
application making changes update the field
 System updates the aggregate field in system
business logic
 The new application computes the aggregate on
retrieval
 Special application “query processor” aggregates
the information when needed
 Interface provides this as part of the system

© 2010 CMU-ISR 7
Data Integration Common
Mistakes
 Creating yet another database
 Waiting for the data analyst to finish developing
the perfect schema
 Implementing the perfectly normalized schema
 Assuming the data exists as described in the
documentation
 Testing without a sufficient set of real data
 Assuming that one site is a good representation of
data at all sites

© 2010 CMU-ISR 8
What is One Thing?

 We exist in a world of ambiguity


 The system cannot tolerate ambiguity

 Oneness
 Sameness – When are two things the same?
 What is it? – In what categories does it exist?

© 2010 CMU-ISR 9
Meaning
 What is a warehouse?
• A location within a building?
• One physical building?
• Several physical buildings at a single location?
• A logical concept of where things are stored?

All definitions may be correct, but different among applications.

© 2010 CMU-ISR 10
Meaning
 What is a street?
• Segments along the physical road may have
different names
• Different streets may have the same name
• Some roads have discontinuous segments
• Is a street terminated by a city, county, state?
• Does street imply motor traffic?
• Does it also mean freeways, highways,
expressways, toll ways, circles, etc.

© 2010 CMU-ISR 11
Meaning
 How do we think of skills?
• What we know how to do is usually quite varied
• Categorization arbitrarily limits how that
information is defined
• Categorization arbitrarily limits the number of
skills that can be described

© 2010 CMU-ISR 12
How Many Things is It?
 A person is:
• An employee
• Spouse
• Shift supervisor
• Stockholder

 A warehouse is:
• Physical building
• A place where parts are stored

© 2010 CMU-ISR 13
When does change make It
different?
 Car
• Does a different color make it a different car?
• Does a new engine make it a different car?
• Slowly replace the parts of a car, at what point is it a
different car?
• Is the essence of change the same in two different
systems?
 Parts
• When does it matter when the component parts are
different?
 Versions
• At what point does a version move to a different
product?

© 2010 CMU-ISR 14
Change
 Sometimes our perception changes
• When do two things become one?

© 2010 CMU-ISR 15
Categories, Attributes,
Relationships
 Categories (aka types)
• Require arbitrary decisions
• Categories often have subsets
• Overlap with other categories
 Plaintiffs are people, corporations, government
agencies…)
• Once categories are established real things are
assigned
 Employee (part time, fulltime, managers, etc…)
• Items change in a category, does it still fit the
category

© 2010 CMU-ISR 16
Categories, Attributes,
Relationships
 Attributes may be part of the category or
they may be an essential part of the
categorization
• Cars, by the way, have wheels
• Not all cars have wheels
 Relationships
• Relationship is not meaningful between things
within a category
• Relationship is meaningful, but cannot be
related to itself
• Relationship is meaningful, and things can be
related to themselves

© 2010 CMU-ISR 17
Relationships
 Optional versus Mandatory
 Transitivity
 Symmetry and Anti-symmetry
 Implication
 Subset consistency
 Constraints
 Attributes of the relationship
 Names
• No name
• One name
• Multiple names

© 2010 CMU-ISR 18
Relational Databases
 Database normalization rules are designed to
prevent anomalies and inconsistencies in
databases
 Database normalization rules, strictly applied,
may introduce inefficiencies in database design
 All databases have various schemas that work
well for their application, but don’t combine well
into an efficient schema for all applications
 Database design is always a tradeoff among
application performance optimization

© 2010 CMU-ISR 19
Normalization of Databases

 First normal form – All record types must


contain the same number of fields

Transportation John Smith Anne Harbor Connie Redwood


Accounting
Logistics

Mark Johnson
Sally Worth
Mary Ebner
Chris Waters Tracy Elmore Judd Heron

© 2010 CMU-ISR 20
Normalization of Databases
 Second Normal Form – Non-key field is not
related to a key field
Employee Department Manager Dept_Address

1. The Department address is repeated in every record that refers to an


employee in that Department.
2. If the address of the Department changes, every record referring to a part
stored in that Department must be updated.
3. Because of the redundancy, the data might become inconsistent, with different
records showing different addresses for the same Department.
4. If at some point in time there are no employees in a Department, there may be
no record in which to keep the Department’s address.

© 2010 CMU-ISR 21
Normalization of Databases
 Third normal form – Non-key field is not a
fact about another non-key field

Employee Department Manager Department Address

Employee Department Manager

Department Department Address

© 2010 CMU-ISR 22
Summary
 Data integration will most likely be the
most difficult challenge of a system
integration project
 Data integration is difficult because each
data source will have their own views of
what It means
 There will always be a balance between
performance and normalization of the data
model
 Don’t let the data tail wag the system dog
© 2010 CMU-ISR 23

You might also like