Zettel
DWDM Module 2
Data Warehouse
• a decision support database that is maintained separately from the organization's operational
database. It supports info-processing by providing a platform of consolidated, historical data
for analysis.
• A data warehouse is a
1. subject-oriented : organized around subjects such as customer, products, sales. Focuses on
modelling and analysis of this data and not dailt ops and transactions.
Provide a simple and concise view of these subjects by excluding unnecessary details.
2. integrated : built by combining multiple heterogeneous data sources. Data cleaning and
integration methods are applied to ensure consistency in all naming conventions, encoding,
measures, etc. Data is usually transformed into a conforming format after before integrating
to the warehouse.
3. time-variant : the time horizon for warehouses is much longer than of ops systems. they
provide historical data going back decades while the Ops systems focus on day-to-day
current value. Every warehouse item contains an element of time.
4. non-volatile : a physically separate store of data is transformed from Ops env to form a
warehouse. Operational updates of data do not occur in WHs, no transaction processing,
recovery or concurrency either. The only 2 operations are the initial data-loading and access
of data.
collection of data in support of management's decision making process.
3-Tier Warehouse architecture
Warehouse Development Approach
OLTP - Online Transaction Processing v/s OLAP - Online
Analytical Processing
Why a sperate Warehouse?
• DBMS are tuned for OLTP access methods, recovery, indexing and concurrency.
• WH are tuned for OLAP with complex queries, multidimensional views and data consolidation.
• Decision support requires historical data which is not available in Ops systems
• Decision support via analytics require consolidation of various different data sources which is
only possible in WH via various cleaning, integration and mining methods.
• Different sources typically have inconsistent data types and need to be transformed and
integrated first into a WH to be of any good use.
• Newer systems are now performing OLAP directly on Operation RDBMSs
DATA CUBE Aggregation
Aggregating data into lower levels and into multiple levels of data cubes reduce the size of data to
deal with and these small representations hence speed up query resolution and solution to tasks.
MULTIDIMENSIONAL Data
A single subject may be represented as a
function of multiple other data points.
Sales --> f ( Product, Month, Region)
Hierarchical summarization paths
Data warehouses also allow for multi-linked views of the data stored in them.
DIMENSIONAL TABLES : store subject info such as item (name, brand. type)
FACT TABLES : store measures such as "dollars_sold" and the keys to each of the related dimension
tables.
A n-D base cube is a base cuboid while the O-D cuboid with highest level of sdummarization is
called the apex cuboid. This lattice of cuboids forms a data cube.
CONCEPTUAL MODELING of Data Warehouses
1. Star Schema: A fact table is in the middle;
connected to a set of dimensiona tables.
1. Snowflake Schema : a
refinement of Star. Some
dimensional hierarchy is
normalized into a set of
smaller dimension tables;
forming a shape similar to
a snowflake.
1. Fact constellations : Multiple fact tables
share dimension tables, viewed as
collection of stars hence also called Galaxy
schema.
Concept Hierarchy in Warehouses
organizes concepts hierarchically and is associated with each dimension in a data WH.
This facilitates drilling (higher to lower concepts) and rolling (get less specific, going up the
hierarchy) in data warehouse to view data in multiple levels of granularity.
Recursively reduces low level concepts (ages - 11, 12, 13 ... 99) with higher level concepts ( youth ,
adult ,senior )
These hierarchies can be explicitly stated by domain experts or WH designers
street <- city <- state <- country
or left for automation. Numerical data is automatically set into hierarchies by Discretization
methods. Attribute with most distinct values automatically placed below and the the one with least
is placed atop. so naturally many streets = street at lowest level while countries at top level.
THREE WAREHOUSE MODELS
ENTERPRISE WAREHOUSE : collects all info about the subjects spanning the entire org.
DATA MART : a subset of the corporate-wide data that is of value to a specific group of users. Its
scope is confined to the specific group. For example the marketing department will have a different
data mart than the Engineering department.
These marts can be independent or dependant on the mainwarehouse.
VIRTUAL WAREHOUSE : A set of views over operational DBs. Only few "possible" summary views
can be materialized as this is STILL an Ops system and not a dedicated WH.
EXTRACTION, TRANSFORMATION, LOADING - ETL
1. Data Extraction : get data from multiple sources, heterogeneous in nature.
2. Data Cleaning : detect errors, incomplete points and outliers in th data and rectify.
3. Data Transformation : convert data from inconsistent legacy formats to consistent warehouse
format.
4. Load : sort, summarize, consolidate , aggregate, check integrity, build indices and partitions and
compute possible views / representations.
5. Refresh : propagate the updates from the data sources to the warehouse.
META DATA REPOSITORY
• meta data is the data that defines other data; in this case the warehouse objects.
• It stores the structure of the data WH, schema, views, dimensions. hierarchies, derived data,
marts and contents.
• Operational meta data includes "data lineage" (how the source provided info, how it was
cleaned, transformed and integrated), "currency of data" (active, archived, purged) and
"monitoring info" (usage stats, errors and audits)
• also stors algorithms used for many data miniung tasks like summarization
• mapping from Ops systems and env to data WH
• data related to performance and maintenance
• Business data - terms and definitions, ownership of data and various policies
OLAP OPERATIONS
• Roll Up (drill-up) : climb up the hierarchy; summarize; dimension reduction
• Drill Down (roll down) : climb down the hierarchy; detailed data; dimenions increase
• Pivot : reorient the cube / visualization to 2D planes.
• Slice : selecting a single dimension from a cube, effectively creating a new sub-cube. This
operation reduces the dimensionality of the data by fixing a specific value for one of the
dimensions. ITEM in year 2008 (so time fixed and rest 2 same)
• Dice : selecting specific years and product categories and excluding EVERRYTHING else. this
reduces dimensions and focus on the selected things.
• Drill across : involve more than once fact table
• Drill through : through the bottom of the cube to its actual backend relational tables.
Star-Net Query Model
enhances the traditional star schema by introducing a network of dimensions and supporting
hierarchical relationships, allowing for more complex and flexible data analysis. It is particularly
useful in data warehousing and OLAP systems, where users need to perform multidimensional
queries efficiently and intuitively.
• Central fact table with dimension tables.
• Interconnected dimensions for complex queries.
• Supports drill-down and roll-up analyses.
• Allows multidimensional analysis from various perspectives.
DATA WAREHOUSE Views
1. Top-Down View: Focuses on selecting relevant information necessary for the data warehouse.
2. Data Source View: Exposes information captured, stored, and managed by operational systems.
3. Data Warehouse View: Comprises fact tables and dimension tables.
4. Business Query View: Represents data perspectives from the end-user's viewpoint.
DATA WAREHOUSE Design Process
• Top-Down: Comprehensive design and planning for a mature data warehouse aligned with
business goals.
• Bottom-Up: Rapid development through experiments and prototypes, focusing on immediate
needs.
• Combination: Merges both approaches for flexibility and strategic alignment.
Software Engineering Perspectives:
• Waterfall Model: Structured, step-by-step analysis; each phase completed before the next.
• Spiral Model: Iterative development with quick turnaround; continuous refinement based on
feedback.
Typical Design Process:
1. Choose a Business Process: Identify key processes (e.g., orders, invoices).
2. Choose the Grain: Define the atomic level of data (e.g., individual transactions).
3. Choose the Dimensions: Identify relevant dimensions (e.g., Time, Product, Customer).
4. Choose the Measures: Define measures for fact tables (e.g., sales amount, quantity sold).
DATA WAREHOUSE Usage / Application
1. Information Processing:
• Supports querying, basic statistical analysis, and reporting.
• Utilizes crosstabs, tables, charts, and graphs.
2. Analytical Processing:
• Enables multidimensional analysis of data.
• Supports OLAP operations like slice, dice, drilling, and pivoting.
3. Data Mining:
• Focuses on knowledge discovery from hidden patterns.
• Supports associations, analytical model construction, classification, prediction, and
visualization of results.
From OLAP to OLAM - On Line Analytical Mining
• high quality of data in data warehouses
• Open database connectivity, web accessing, reporting and OLAP tools
• Online selection of data mining functions
Efficient Cube Computation
How many cuboids in an n-dimensional cube with L levels?
Materialization of data cubes can be full, partial or none. It depends on the data's size, sharing and
access freqeuency.
Compute Cubes
Text Text
define cube sales [item, city, year]: SELECT item, city, year, SUM (amount)
sum (sales_in_dollars) FROM SALES
compute cube sales CUBE BY item, city, year
Indexing OLAP Data
• index on a a particular column.
• each value has a bit vector; bit op is fast
• length of bit increases with number of
records so not suitable for high cardinality
data
• Word-Aligned Hybrid (WAH) makes it work
for such cases too now...
• In data warehouses the join index relates
the values of dimensions of a star schema
to a fact table. This speeds up relational
join operations. ex: Sales fact table for 2
dimensions of City and Product.
OLAP Query Processing
Determine what operations (drill, roll, slice, dice...) to perform on the available cuboid views.
Convert these operations into corresponding SQL/OLAP operations.
ROLAP - Relational OLAP
• Use relational or extended-relational DBMS to store and manage WH data and middle ware.
• Include optimization of DBBS backend and implementation of aggregation and navigation logic
along with additional tools and services
• greater scalability
MOLAP - Multidimensional OLAP
• sparse arra-based multidimensional summarized data
• fast-indexing to pre-computed data
HOLAP - Hybrid OLAP
• flexibility; ex: low level : relational, high-level: array.
Specialized SQL Servers
• specialized support for SQL queries over star/snowflake schemas
Attribute-oriented induction of Data in Warehouse
• collect task-relevant data using a relational database query
• perform generalization by
1. Attribute removal : CASE1 - there's a large set of distinct values or there is no concept of
hierarchy for the attribute. CASE2 - its higher level concepts are described in terms of other
attributes; then the attribute should be removed. ex: Customer ID, Order ID, and other similar
key values are all unique.
2. Attribute generalization : if there exists a set of generalization for distinct values of an
attribute then it should be selected and applied. ex: age in (youth, adult, senior) , address in
(state, country) , salaries sizes, etc.
• Apply aggregation by merging identical, genberalized tuples and accumulate their counts.
• Interact with users for knowledge visualization
Suppose that a user wants to describe the general characteristics of graduate students in the Big
University database. Given the attributes name, gender, major, birth place, birth date, residence,
phone#, and gpa.
Text Text
DMQL QUERY TRANSFORMED QUERY
use Big University DB use Big University DB
mine characteristics as “Science select name, gender, major, birth
Students” place, birth date, residence,
in relevance to name, gender, major, phone#, gpa
birth place, birth date, residence, from student where status in
phone#, gpa {“M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.”}
from student
where status in “graduate”