2/21/2013
Topics Already Covered
Data Warehouse Architecture and its components
SS G515 - Data Warehousing: Introduction
Extraction,Transformation and Loading (ETL) Data Marts Approached to Design Data Warehouses
Inmons Kimballs
Dr. Yashvardhan Sharma Assistant Professor, CS & IS Dept. BITS-Pilani
A General Architecture for Data Warehousing
A General Architecture for Data Warehousing
The major components of data warehouse architecture are:
Source systems are where the data comes from. Extraction, transformation, and load (ETL) move data between
different data stores.
The central repository is the main store for the data warehouse. The metadata repository describes what is available and where. Data marts provide fast, specialised access for end users and
applications.
Operational feedback integrates decision support back into the
operational systems. place
End-users are the reason for developing the warehouse in the first
MOLAP: Multi-Dimensional On-Line Analytical Processing ROLAP: Relational On-Line Analytical Processing
3 4
Loading the Data Warehouse
Data is periodically extracted Data is cleansed and transformed
Data Warehousing Architecture
Monitoring & Administration
Metadata Repository
OLAP servers
Analysis Query/ Reporting
Serve
External Sources Operational dbs
Extract Transform Load Refresh
Users query the data warehouse Source Systems (OLTP) Data Staging Area Data Warehouse
6
Data Mining
Data Marts
2/21/2013
Data Warehousing Architecture
Data Warehouse Architecture
Data Warehouse COMPONENTS
Data Warehouse COMPONENTS
Source Data Component
Production Data. Internal Data. Archived Data. External Data.
Data Staging Component
Data Extraction Data Transformation. Data Loading.
10
Data Loading
Data Storage Component
Many of the data warehouses also employ multidimensional
database management systems. Data extracted from the data warehouse storage is aggregated in many ways and the summary data is kept in the multidimensional databases (MDDBs). Such multidimensional database systems are usually proprietary products.
11
12
2/21/2013
Information Delivery Component
Metadata Component
Metadata in a data warehouse is similar to a data dictionary,
but much more than a data dictionary.
Types of Metadata
Operational Metadata Extraction and Transformation Metadata End-User Metadata
More Details in Chapter 9.
13
14
Why Meta Data: Special Significance
First, it acts as the glue that connects all parts of the data
The architecture
Operational data source1
warehouse. Next, it provides information about the contents and structures to the developers. Finally, it opens the door to the end-users and makes the contents recognizable in their own terms.
Meta-data Operational data source 2 Lightly summarized data
High summarized data
Query Load Manager
Reporting, query, application development, and EIS(executive information system) Manage tools
Operational data source n
Detailed data
DBMS
OLAP(online analytical processing) tools
Operational data store (ods) Warehouse Manager
Operational data store (ODS)
Data mining
Archive/backup data
End-user access tools
15
Typical architecture of a data warehouse
The main components
Operational data sourcesfor the DW is supplied from mainframe
operational data held in first generation hierarchical and network databases, departmental data held in proprietary file systems, private data held on workstaions and private serves and external systems such as the Internet, commercially available DB, or DB assoicated with and organizations suppliers or customers
The main components
load manageralso called the frontend component, it performs all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse warehouse managerperforms all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data
Operational datastore(ODS)is a repository of current and
integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse
2/21/2013
The main components
query manageralso called backend component, it performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries detailed, lightly and lightly summarized data,archive/backup data meta-data end-user access toolscan be categorized into five main groups: data reporting and query tools, application development tools, executive information system (EIS) tools, online analytical processing (OLAP) tools, and data mining tools
Data flows
Inflow- The processes associated with the extraction, cleansing, and
loading of the data from the source systems into the data warehouse.
upflow- The process associated with adding value to the data in the
warehouse through summarizing, packaging , packaging, and distribution of the data
downflow- The processes associated with archiving and backing-up
of data in the warehouse
outflow- The process associated with making the data availabe to the
end-users
Meta-flow- The processes associated with the management of the
meta-data
Tools and Technologies
Operational data source1
Warehouse Manager Meta-flow
Meta-data High summarized data
Reporting, query,application development, and EIS (executive information system) tools
The critical steps in the construction of a data
warehouse:
Inflow Load Manager
Operational data source n Detailed data Lightly summarized data
Outflow OLAP (online analytical processing) tools
Upflow
DBMS
Query Manage
a. Extraction b. Cleansing c. Transformation
after the critical steps, loading the results into target
Warehouse Manager
Operational data store (ods)
Data mining tools Downflow Archive/backup data End-user access tools
system can be carried out either by separate products, or by a single, categories:
code generators database data replication tools dynamic transformation engines
Information flows of a data warehouse
Populating & Refreshing the Warehouse
Data Extraction Data Cleaning Data Transformation
ETL Process : Issues & Challenges
Consumes 70-80% of project time Heterogeneous Source Systems Little or no control over source systems Source systems scattered Source systems operating in different time zones Different currencies Different measurement units Data not captured by OLTP systems Ensuring data quality
Convert from legacy/host format to warehouse format Sort, summarize, consolidate, compute views, check integrity, build indexes, partition Bring new data from source systems
Load
Refresh
2/21/2013
Data Staging Area
A storage area where extracted data is Cleaned Transformed Deduplicated Initial storage for data Need not be based on Relational model Spread over a number of machines Mainly sorting and Sequential processing COBOL or C code running against flat files Does not provide data access to users Analogy kitchen of a restaurant
Presentation Servers
A target physical machine on which DW data is organized for Direct querying by end users using OLAP Report writers Data Visualization tools Data mining tools Data stored in Dimensional framework Analogy Sitting area of a restaurant
Data Cleaning
Why?
Soundex Algorithms
Misspelled terms For example NAMES Phonetic algorithms can find similar sounding names Based on the six phonetic classifications of human speech sounds
Data warehouse contains data that is analyzed for business decisions More data and multiple sources could mean more errors in the data and harder to trace such errors Results in incorrect analysis
Detecting data anomalies and rectifying them early has huge payoffs Long Term Solution
Change business practices and data entry tools Repository for meta-data
Data Warehouse Design
OLTP Systems are Data Capture Systems DATA IN systems DW are DATA OUT systems
Analyzing the DATA
Active Analysis User Queries User-guided data analysis Show me how X varies with Y OLAP Automated Analysis Data Mining Whats in there? Set the computer FREE on your data Supervised Learning (classification) Unsupervised Learning (clustering)
OLTP
DW
2/21/2013
OLAP Queries
How much of product P1 was sold in 2009 state wise? Top 5 selling products in 2010 Total Sales in Q1 of FY 2008-09? Color wise sales figure of cars from 2008 to 2010 Model wise sales of cars for the month of Jan from 2006 to
Data Mining Investigations
Which type of customers are more likely to spend most
with us in the coming year?
What additional products are most likely to be sold to
customers who buy sportswear? year?
In which area should we open a new store in the next What are the characteristics of customers most likely to
2010
default on their loans before the year is out?
Continuum of Analysis
Specialized Algorithms
Data Marts
What is a data mart? Advantages and disadvantages of data marts Issues with the development and management of data marts
SQL
OLTP
Primitive & Canned Analysis
OLAP
Complex Ad-hoc Analysis
Data Mining
Automated Analysis
34
21-Feb-13
Data Marts
A subset of a data warehouse that supports the requirements
Data Marts
Data Mart: A scaled-down version of the data warehouse A data mart is a small warehouse designed for the
of a particular department or business process Data Mart is a subset of corporate-wide data warehouse that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart. Characteristics include:
Does not always contain detailed data unlike data warehouses More easily understood and navigated Can be dependent or independent
department level.
It is often a way to gain entry and provide an opportunity to
learn
Major problem: if they differ from department to
department, they can be difficult to integrate enterprisewide
35
21-Feb-13
36
2/21/2013
Reasons for Creating Data Marts
Proof of Concept for the DW Can be developed quickly and less resource intensive than DW To give users access to data they need to analyze most often To improve query response time due to reduction in the volume
Kimball vs Inmon
Bill Inmon's paradigm: Data warehouse is one part of
the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.
of data to be accessed
Ralph Kimball's paradigm: Data warehouse is the
conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.
37
21-Feb-13
38
21-Feb-13
Kimball vs Inmon
Bill Inmon: Endorses a Top-Down design
Kimball vs Inmon: War of Words
"...The data warehouse is nothing more than the union of all the data marts...," Ralph Kimball, December 29, 1997. "You can catch all the minnows in the ocean and stack them together and they still do not make a whale," Bill Inmon, January 8, 1998.
Independent data marts cannot comprise an effective EDW. Organizations must focus on building EDW
Ralph Kimball: Endorses a Bottom-Up design
EDW effectively grows up around many of the several independent data marts such as for sales, inventory, or marketing
39
21-Feb-13
40
21-Feb-13
Kimball vs. Inmon
There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse.
Data Warehousing Process
Enterprise-wide warehouse, top down, the Inmon
methodology
Data mart, bottom up, the Kimball methodology When properly executed, both result in an enterprise-wide
data warehouse
41
21-Feb-13
42
2/21/2013
Data warehouse versus data mart.
Building a Data Warehouse
Questions to be asked:
Top-down or bottom-up approach? Enterprise-wide or departmental? Which firstdata warehouse or data mart? Build pilot or go with a full-fledged implementation? Dependent or independent data marts?
43
44