DW Presentation Logic
DW Presentation Logic
Overview
Data Warehous e
APIs • Extract
• Transform
• Load
Sources
BI Architecture Diagram
3
Why Warehouse?
Data Warehous e
• Why can’t we just go straight to the data source?
• Optimized for analytical reporting
• Historical data
• Multiple integrated sources
• Many years of history
• Aggregated or summarized
4
Characteristics of Data Warehouse
• Subject-oriented
• Data is categorized and stored by business subject
• Integrated
• Data on a given subject is collected from multiple sources and stored in a
single place
• Non-volatile
• Data in DW is not updated or deleted
• Time-variant
• Data is stored as a series of snapshots, each representing a period of time
5
WHY BI?
DWH and BI
• One without the other wouldn’t function.
• The data warehouse enhances the BI processes.
6
BI Tools & Technology
Languages
ETL and
Data
Integration
Database
Platform
BI OLAP
Tools
02
• A database model is a type of data models that determines the logical structure of
a database.
• A collection of tools for describing
• data
• data relationships
• data semantics
• data constraints
Database models
• Relational Data Model designs the data in the form of rows and columns within a table. Thus, a relational model
uses tables for representing data and in-between relationships. Tables are also called relations.
• An ER model is the logical representation of data as objects and relationships among them. These objects are
known as entities, and relationship is an association among these entities. A set of attributes describe the entities.
For example, student_name, student_id describes the 'student' entity.
• Object-based Data Model is an extension of the ER model with notions of functions, encapsulation, and object
identity, as well. This model supports a rich type system that includes structured and collection types.
• Semistructured Data Model: This type of data model is different from the other three data models (explained
above). The semistructured data model allows the data specifications at places where the individual data items of
the same type may have different attributes sets. The Extensible Markup Language, also known as XML, is widely
used for representing the semistructured data.
Data Models
• Other Types
• Hierarchical data model: Oldest form in tree structure. In this model, files are related in
a parent/child manner.
• Network data model: obsolete and being replaced by Relational model. In this model,
files are related as owners and members, like to the common network model.
• The Network Model in DBMS is a hierarchical model that is used to represent the
1:M/M:M relationship among the database constraints. It is represented in the form of a
graph hence it is a simple and easy-to-construct database model.
Why Us e Models ?
• Models can be useful when we want to examine or manage part of the real world
• The costs of using a model are often considerably lower than the costs of using or
experimenting with the real world itself
EXAMPLES
• airplane simulator
• nuclear power plant simulator
• flood warning system
• model of economy
• map
Network model
• Relational model can represent as a table with columns and rows. Each row is known as a
tuple. Each table of the column has a name or attribute.
• Data:
• Raw facts; building blocks of information
• Unprocessed information
• Information:
• Processed data to reveal meaning
Data vs. Information
Warehouse Terminology
• Dimension
• Attribute
• Facts
• Measures
25
Dimension
26
Dimension
• Product
• Company-Division-Group-Department-Class-Subclass-Item
• UDA, Item List, Item Differentiators, Packs
• Organization
• Company-Chain-Area-Region-District-Location
• Channel, Location Traits, Physical/Virtual Warehouse
• Supplier
• Supplier Country, Trait, Currency, Primary Supplier
• Promotion
• Promotion Head, Scheme, Scheme Type, Start and End Dates
27
Product
Product Dimension
• The Product dimension represents the product lines that the company sells.
• Generate Information:
• Item with highest profit,
• Item performance
28
Organization
Store Dimension
• Store Dimension contains store information like city, store name and phone number of the store
where the product was sold.
• Used to track store-specific metrics, such as customer traffic.
29
Time/Calendar
Time Dimension
• Keep track of the time of occurrence of different events in the Retail Organization.
• Analyze the trends over the time
30
4-5-4 Calendar
31
13 Period Calendar
32
Attributes
33
Business Vs Surrogate Key
• Record
• Business Key: It refers to the identifier given to the entity when it was created in the source system. Also known
as identifiers.
• Surrogate Key: A surrogate key is a unique value used to identify an entity in the Data warehouse. A new key is
attached to an entity whenever it is inserted into a data mart dimension table.
34
Hierarchy
Company
Company
Fiscal Year
Division
Fiscal Half Year Chain
Group
Roll Up
Drill Down
Department
Day Location
Item
35
Fact
36
?
PO of
$20000
37
Fact
38
Facts
39
Facts
• Standard Facts
• can be summed up across associate dimension.
For example, the sum of total sales for each of the days in a week gives the total sales for that week
• Positional Facts
• The data in positional fact tables reports the state of an entity at a certain point in time, rather than the
total activity of an entity, these facts cannot be simply summed over time.
For example, Stock on Hand
40
Facts
• Additive Fact
• Eg. Sales
• Semi-Additive Fact
• Eg. Inventory Position
• Non-Additive Fact
• Eg. Price
41
Measure
42
Aggregate Table
43
Data Modeling
• To achieve this, a model must be defined, and • Here, the primary keys of dimension tables and foreign
relationship needs to be mapped between the fact keys of fact tables relate to each other. This enables us to
use the field of respective tables together.
and dimension tables.
44
03
Understanding
OLAP
Data Redundancy
47
Data Normalization
48
Data Normalization
49
Data Normalization
50
OLTP
51
OLAP
52
OLAP vs OLTP
OLTP OLAP
53
How Do Data Warehous es Differ From Operational Systems ?
• Goals
• Structure
• Size
• Performance optimization
• Technologies used
54
Comparison Chart of Databas e Types
De-normalized table structure (few tables, many Normalized table structure (many tables, few
columns per table) columns per table)
Batch updates Continuous updates
56
Data Warehous es , Data Marts, and Operational Data Stores
• Data Warehouse – The queryable source of data in the enterprise. It is comprised of the
union of all of its constituent data marts.
• Data Mart – A logical subset of the complete data warehouse. Often viewed as a
restriction of the data warehouse to a single business process or to a group of related
business processes targeted toward a particular business group.
• Operational Data Store (ODS) – A point of integration for operational systems that
developed independent of each other. Since an ODS supports day to day operations, it
needs to be continually updated.
58
04
Approaches to Data
Warehous ing Models
Approaches to Data Warehous ing
Kimball Model:
Approaches to Data Warehous ing
Inmon Model:
Approaches to Data Warehous ing
64
Key Terms
• Rollup Table
• Dimension Identity/Fact Identities
• Dimension Keys
65
Flow to Data Warehouse
66
ETL Extraction Transformation Load
• To get data out of the source and load it into the data warehouse – simply a process of copying data
from one database to other
• Data is extracted from an OLTP database, transformed to match the data warehouse schema and loaded
into the data warehouse database
• Many data warehouses also incorporate data from non-OLTP systems such as text files, legacy systems,
and spreadsheets; such data also requires extraction, transformation, and loading
67
Data Flow in Warehouse
Source
Source
Tables Source Temp
Tables
(Batch)
Flat files
68
Flow to Data Warehouse
• Source Dependent Extract (SDE): In this phase, SDE tasks extract data from the source system and
stage it in staging tables. SDE tasks are source specific.
• Source Independent Load (SIL): Load tasks transform and port the data from staging tables to base
fact or dimension tables. SIL tasks are source independent.
• Post Load Proces s (PLP). PLP tasks are only executed after the dimension and fact tables are populated.
A typical usage of a PLP task is to transform data from a base fact table and load it into an aggregate
table. PLP tasks are source independent
69
Delta vs Full snapshot
• Delta
Extracts rows with data changes since last extract
less number of rows change compared to the total number of rows in a table
Records for deleted rows in source tables are maintained in a separate table
• Full Snapshot
Extracts all the rows from required source tables irrespective of changes since
the last extract
higher the changes in rows compared to all rows in a table
Target tables are either truncated and loaded with latest source full snapshot
(generally for Dimension)
or rows matching business keys between source and target tables are updated
per latest source system full snapshot and those that do not exist anymore in the
source extract are considered deleted from source and necessary delete processing are
applied in such target tables
No need of separate tables for deleted rows
Time consuming if changes are higher
70
Flow to Data Warehous e
71
Terminology
• Staging
• Temporary
• Target
• Slandered/ Incremental Update and Insert
• Reclassification
72
Dimens ion Load Strategy
73
Slowly Changing Dimension
74
Minor Change
75
SCD2 Major Change
Class Shirt
● Item Linen Shirt has been moved from Sub-category Casual to Formal.
76
SCD Type 2 Major Change...
Before Change
ITEM TABLE
77
SCD Type 2 Major Changes…
ITEM TABLE
78
SCD Type 3 Major Changes…
Before Change
ITEM TABLE
79
SCD Type 3 Major Changes…
ITEM TABLE
80
Reporting Approach for Reclas s ification
• AS WAS
• History is associated with the dimension hierarchy that was effective at the time of the
transaction.
• AS IS
• History is restated according to the current dimension hierarchy as it exists after the
reclassification occurred.
81
Fact Load Strategy
82
Fact Table Types: Base and Aggregate
• A base fact table holds fact data for a given functional area at the lowest level of granularity
Example : SALES_ITM_LOC_DAY_FACT
Stores sales fact data at the item/location/day/transaction/voucher level.
• An aggregate table holds data for a given functional area which is summarized or aggregated across one
or more dimensions.
Example : SALES_SBC_LOC_DAY_AGG
Stores sales fact data at the subclass/location/day level.
83
Aggregated Tables
84
Aggregated Tables
85
Major Change impact in Fact
86
Major Change impact in Fact
87
Major Change impact in Fact
88
Major Change impact in Fact
89
Standard Fact Aggregation
• Simple aggregates that are aggregated through summation of measures across associated dimensions.
• As Was Aggregates
• Aggregates across Organization, Product or Calendar dimension using hierarchy that exists at time of
roll up.
• Item, Location and Day level -> Sub-class, Location and Day Level : subclass for the item is
determined from the relation that exist at the day of roll-up
• Corporate Aggregates
• High level aggregates for Top Level Executives.
• No Organization level information.
90
Positional Fact Aggregation
91
Snowflake Schema
Month ProductCategory
MonthKey CategoryKey
MonthNumb Calendar Date CategoryName
er DateKey Description
…. MonthKey …..
….
Brand
Branch Order Fact BrandKey
BranchKey DateKey CategoryKey
Name ProductKey BrandName
….. CustomerKey …..
StoreKey
Produc
Stor
Customer tProductKey
StoreKey
e
CustomerKey BrandKey SKU
BranchKey
Code …..
…..
• Retail Analytics base fact loading program checks dimension data integrity with
dimension tables
• Any source data that violate dimension data integrity are rejected
• The rejected records are written to the respective rejection tables.
• Rejected data will not be loaded to Retail Analytics. The data must be manually cleaned up
on source table and rerun the program again
93
06
Q&A