Data Quality and
Integration
IT 221
Information
SLIDESMANIA.COM
Management
● Define terms
● Describe importance and goals
of data governance
● Describe importance and
measures of data quality
Objectives ● Define characteristics of
quality data
● Describe reasons for poor data
quality in organizations
● Describe a program for
improving data quality
SLIDESMANIA.COM
● Describe three types of data
integration approaches
● Describe the purpose and role
of master data management
● Describe four steps and
Objectives activities of ETL for data
integration for a data
warehouse
● Explain various forms of data
transformation for data
warehouses
SLIDESMANIA.COM
● Data governance
○ High-level organizational
Data groups and processes
overseeing data
Governance stewardship across the
organization
SLIDESMANIA.COM
● Data steward
○ A person responsible for
Data ensuring that
organizational
Governance applications properly
support the
organization’s data
quality goals
SLIDESMANIA.COM
For the Data Steward:
● Resolving Data Integration Issues
● Determining Data Security
Roles ● Documenting Data Definitions,
and Calculations, Summarizations,
etc.
Responsibilities
● Maintaining/Updating Business
Rules
● Analyzing and Improving Data
Quality
● Implementing Data Security
SLIDESMANIA.COM
Requirements
For the Data Administrators:
● Translating the Business Rules
into Data Models
Roles ● Maintaining Conceptual, Logical
and and Physical Data Models
● Assisting in Data Integration
Responsibilities
Resolution
● Maintaining Meta data Repository
SLIDESMANIA.COM
For the Database Administrators:
● Generating Physical DB Schema
● Performing Database Tuning
Roles ● Creating Database Backups
and ● Planning for Database Capacity
Responsibilities
SLIDESMANIA.COM
● Sponsorship from both senior
management and business units
● A data steward manager to
Requirements support, train, and coordinate
for Data data stewards
● Data stewards for different
Governance business units, subjects, and/or
source systems
● A governance committee to
provide data management
guidelines and standards
SLIDESMANIA.COM
● If the data are bad,
the business fails.
Importance Period.
of Data ○ GIGO – garbage in,
garbage out
Quality ○ Sarbanes-Oxley (SOX)
compliance by law
sets data and
metadata quality
standards
SLIDESMANIA.COM
● Purposes of data
quality
Importance ○ Minimize IT project
risk
of Data
○ Make timely business
Quality decisions
○ Ensure regulatory
compliance
SLIDESMANIA.COM
○ Expand customer
base
● Uniqueness
● Accuracy
● Consistency
● Completeness
Characteristics ● Timeliness
of Data Quality ● Currency
● Conformance
● Referential integrity
SLIDESMANIA.COM
● External data sources
Causes ○ Lack of control over
data quality
of Poor ● Redundant data
storage and
Data inconsistent metadata
Quality ○ Proliferation of
databases with
uncontrolled
redundancy and
SLIDESMANIA.COM
metadata
● Data entry
Causes ○ Poor data capture
controls
of Poor ● Lack of organizational
commitment
Data ○ Not recognizing poor
data quality as an
Quality organizational issue
SLIDESMANIA.COM
● Get business buy-in
● Perform data quality audit
● Establish data stewardship
program
Steps in Data ● Improve data capture processes
● Apply modern data management
principles and technology
Quality ● Apply total quality management
(TQM) practices
Improvement
SLIDESMANIA.COM
● Executive sponsorship
● Building a business case
● Prove a return on investment
Business (ROI)
● Avoidance of cost
Buy-in ● Avoidance of opportunity loss
SLIDESMANIA.COM
● Statistically profile all data files
● Document the set of values for all
Data fields
● Analyze data patterns
Quality (distribution, outliers,
frequencies)
Audit ● Verify whether controls and
business rules are enforced
● Use specialized data profiling
tools
SLIDESMANIA.COM
● Roles:
○ Oversight of data
Data stewardship program
Stewardship ○ Manage data subject
Program area
○ Oversee data definitions
○ Oversee production of
data
SLIDESMANIA.COM
○ Oversee use of data
● Report to: business unit
vs. IT organization?
Data
Stewardship
Program
SLIDESMANIA.COM
● Automate data entry as much as
possible
Improving ● Manual data entry should be
Data selected from preset options
● Use trained operators when
Capture possible
● Follow good user interface
Process design principles
● Immediate data validation for
entered data
SLIDESMANIA.COM
● Software tools for
Apply
analyzing and correcting
Modern Data
data quality problems:
Management
○ Pattern matching
Principles
and ○ Fuzzy logic
Technology ○ Expert systems
● Sound data modeling
SLIDESMANIA.COM
and database design
● TQM – Total Quality
Management
TQM ● TQM Principles:
Principles ○ Defect prevention
and
○ Continuous improvement
Management
○ Use of enterprise data
standards
○ Strong foundation of
SLIDESMANIA.COM
measurement
● Balanced focus
TQM ○ Customer
Principles ○ Product/Service
and
Management
SLIDESMANIA.COM
● Disciplines,
technologies, and
methods to ensure the
Master Data
currency, meaning, and
Management
quality of reference
data within and across
various subject areas
SLIDESMANIA.COM
● Three main architectures
○ Identity registry – master data remains
in source systems; registry provides
Master Data applications with location
Management ○ Integration hub – data changes
broadcast through central service to
subscribing databases
○ Persistent – central “golden record”
maintained; all applications have
access. Requires applications to push
data. Prone to data duplication.
SLIDESMANIA.COM
● Data integration creates a unified
view of business data
● Other possibilities:
Data ○ Application integration
Integration ○ Business process integration
○ User interaction integration
SLIDESMANIA.COM
● Any approach requires changed
data capture (CDC)
○ Indicates which data have
Data changed since previous data
integration activity
Integration
SLIDESMANIA.COM
● Consolidation (ETL)
○ Consolidating all data into a
centralized database (like a
data warehouse)
Techniques ● Data federation (EII)
for Data ○ Provides a virtual view of
data without actually creating
Integration one centralized database
● Data propagation (EAI and EDR)
○ Duplicate data across
databases, with near real-
SLIDESMANIA.COM
time delay
We will talk
about this first.
SLIDESMANIA.COM
● Typical operational data is:
○ Transient–not historical
Reconciled ○ Not normalized (perhaps due
to denormalization for
Data Layer performance)
○ Restricted in scope–not
comprehensive
○ Sometimes poor quality–
SLIDESMANIA.COM
inconsistencies and errors
● After ETL, data should be:
○ Detailed–not summarized yet
○ Historical–periodic
Reconciled ○ Normalized–3rd normal form or
higher
Data Layer ○ Comprehensive–enterprise-wide
perspective
○ Timely–data should be current
enough to assist decision-making
SLIDESMANIA.COM
○ Quality controlled–accurate with
full integrity
● Capture/Extract
● Scrub or data cleansing
● Transform
The ETL ● Load and Index
Process ETL = Extract, transform, and load
During initial load of Enterprise Data Warehouse
(EDW)
SLIDESMANIA.COM
During subsequent periodic updates to EDW
Capture/Extract…obtaining a snapshot of a chosen subset of the source data for
loading into the data warehouse
Steps in data
reconciliation
Great Idea!
Static extract = capturing a snapshot Incremental extract = capturing
of the source data at a point in time changes that have occurred since the
last static extract
SLIDESMANIA.COM
3333
Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade
data quality
Steps in data
reconciliation
(cont.)
Great Idea!
Fixing errors: misspellings, erroneous Also: decoding, reformatting, time
dates, incorrect field usage, stamping, conversion, key generation,
SLIDESMANIA.COM
mismatched addresses, missing data, merging, error detection/logging,
duplicate data, inconsistencies locating missing data
3434
Transform … convert data from format of operational system to
format of data warehouse
Steps in data
reconciliation
(cont.)
Record-level: Field-level:
SLIDESMANIA.COM
Selection–data partitioning single-field–from one field to one field
Joining–data combining multi-field–from many fields to one, or
Aggregation–data summarization one field to many
Load/Index…place transformed data into the
warehouse and create indexes
Steps in data
reconciliation
(cont.)
SLIDESMANIA.COM
Refresh mode: bulk rewriting Update mode: only changes in
of target data at periodic intervals source data are written to data
warehouse
● Selection – the process of
partitioning data according to
predefined criteria
Record Level ● Joining – the process of
Transformation combining data from various
sources into a single table or
Functions
view
SLIDESMANIA.COM
● Normalization – the process of
decomposing relations with
anomalies to produce smaller,
Record Level well-structured relations
Transformation ● Aggregation – the process of
transforming data from detailed
Functions
to summary level
SLIDESMANIA.COM
Single-field transformation
a) Basic Representation
In general, some transformation function translates data
SLIDESMANIA.COM
from old form to new form
Single-field transformation (cont.)
b) Algorithmic
Algorithmic transformation uses a formula or logical
SLIDESMANIA.COM
expression
40
Single-field transformation (cont.)
c) Table lookup
SLIDESMANIA.COM
Table lookup uses a separate table keyed by source record
code
Multi-field transformation
a) Many sources to one target
SLIDESMANIA.COM
Multi-field transformation (cont.)
b) One source to many targets
SLIDESMANIA.COM