0% found this document useful (0 votes)

14 views40 pages

Data Warehouse Fundamentals

Uploaded by

2026104570

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views40 pages

Data Warehouse Fundamentals

Uploaded by

2026104570

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Data warehouse fundamentals

Data Warehouse Overview

Definition:
 A data warehouse is a centralized system that aggregates data from
multiple sources into a single, consistent data store to support data
analytics.
Key Features:
 Data Aggregation: Combines data from various sources, including
transactional systems and operational databases.
 ETL Process: Involves Extracting, Transforming, and Loading data to
prepare it for analysis.
 Support for Analytics: Facilitates data mining, artificial intelligence (AI),
and machine learning (ML) applications.
Use Cases:
 Business Intelligence (BI): Enables organizations to perform online
analytical processing (OLAP) for fast, flexible data analysis.
 Industry Applications: Used across various sectors such as e-commerce,
healthcare, finance, and government for reporting and decision-making.
Benefits:
 Centralized Data: Provides a single source of truth, improving data
quality and access.
 Performance Improvement: Separates database operations from
analytics, enhancing data access speed.
 Competitive Advantage: Supports advanced analytics capabilities,
leading to smarter business decisions.
Trends:
 Cloud Data Warehouses (CDWs): Gaining popularity due to scalability
and cost-effectiveness, allowing organizations to access data warehousing
services without the need for hardware.
The ETL process in data warehousing stands for Extract, Transform, Load. It
is a critical procedure for preparing data for analysis. Here’s a brief breakdown of
each component:
ETL Process Steps
1. Extract:
o Definition: The process of retrieving data from various source
systems, which can include databases, flat files, APIs, and more.
o Purpose: To gather all relevant data needed for analysis from
disparate sources.
2. Transform:
o Definition: The process of cleaning, enriching, and converting the
extracted data into a suitable format for analysis.
o Key Activities:

 Data cleaning (removing duplicates, correcting errors)

 Data integration (combining data from different sources)
 Data transformation (changing data types, aggregating data)
o Purpose: To ensure data quality and consistency, making it ready
for loading into the data warehouse.
3. Load:
o Definition: The process of loading the transformed data into the
data warehouse.
o Types of Loading:

 Full Load: All data is loaded into the warehouse.

 Incremental Load: Only new or updated data is loaded.
o Purpose: To make the data available for querying and analysis in
the data warehouse.
Importance of ETL
 Data Quality: Ensures that the data is accurate and reliable for decision-
making.
 Efficiency: Streamlines the process of data integration and preparation,
allowing for faster analytics.
 Scalability: Supports the growing volume of data from various sources.

Popular Data Warehouse Systems

Types of Data Warehouse Systems
1. Appliance Data Warehouse Systems:
o Oracle Exadata:

 Can be deployed on-premises or via Oracle Public Cloud.

 Supports various workloads: OLTP, data warehouse analytics,
in-memory analytics, and mixed workloads.
o IBM Netezza:

 Deployable on IBM Cloud, AWS, Microsoft Azure, and private

clouds.
 Known for enabling data science and machine learning.
2. Cloud-Based Data Warehouse Systems:
o Amazon RedShift:

 Utilizes AWS-specific hardware and software.

 Features accelerated data compression, encryption, and
machine learning capabilities.
o Snowflake:

 Offers a multi-cloud analytics solution.

 Complies with GDPR and CCPA, with always-on encryption.
o Google BigQuery:

 Described as a flexible, multi-cloud data warehouse.

 Claims 99.99% uptime and sub-second query response times.
3. Hybrid Data Warehouse Systems (On-Premises and Cloud):
o Microsoft Azure Synapse Analytics:

 Provides code-free visual ETL/ELT processes.

 Supports data lake and data warehouse use cases.
o Teradata Vantage:

 Unifies data lakes, data warehouses, and analytics.

 Combines open-source and commercial technologies for
performance.
o IBM Db2 Warehouse:

 Known for scalability and parallel processing capabilities.

 Offers a containerized scale-out solution.
o Vertica:

 Supports multi-cloud environments and reports fast data

transfer rates.
o Oracle Autonomous Data Warehouse:

 Operates in Oracle Public Cloud and on-premises.

 Features automated data management and security.
Key Takeaways
 Data warehouse systems can be categorized into appliances, cloud-based,
or hybrid systems.
 Popular vendors include Oracle, IBM, Microsoft, Google, Snowflake,
and Amazon.
 Each system has unique features tailored to different business needs, such
as scalability, performance, and security.

Here are some advantages of using hybrid data warehouse systems:

Advantages

1. Flexibility:

o Organizations can choose between on-premises and cloud resources

based on their needs, allowing for tailored solutions.
2. Scalability:

o Hybrid systems can easily scale resources up or down,

accommodating varying workloads and data volumes without
significant infrastructure changes.
3. Cost Efficiency:

o Businesses can optimize costs by utilizing cloud resources for less

critical workloads while keeping sensitive data on-premises.
4. Performance:

o Hybrid systems can leverage the strengths of both environments,

ensuring high performance for analytics and reporting tasks.
5. Data Security:

o Sensitive data can be stored on-premises, while less sensitive data

can be processed in the cloud, enhancing overall security.
6. Disaster Recovery:

o Hybrid systems can provide robust disaster recovery options by

backing up data across both on-premises and cloud environments.
7. Integration:

o They facilitate the integration of various data sources, enabling

organizations to combine data from different environments
seamlessly.
8. Compliance:
o Organizations can meet regulatory requirements by keeping certain
data on-premises while utilizing the cloud for other data processing
needs.

Selecting a Data Warehouse System

Key Criteria for Evaluating Data Warehouse Systems
1. Features and Capabilities:
o Location: Data warehouses can be on-premises, on appliances, or
in the cloud. Organizations must balance data ingestion, storage,
and access needs.
o Architecture: Consider if the organization is ready for vendor-
specific architecture or needs multi-cloud installations.
o Data Types: Ensure the system supports the required data types,
including structured, semi-structured, and unstructured data.
2. Compatibility and Implementation:
o Evaluate how easily the system can be integrated with existing
infrastructure.
o Consider data governance, migration, and transformation
capabilities.
3. Ease of Use and Required Skills:
o Assess if the staff has the necessary skills for implementation and
management.
o Determine the complexity of the deployment and the need for
external expertise.
4. Support Considerations:
o Look for a single vendor for accountability and support.

o Verify service level agreements (SLAs) for uptime, security, and

scalability.
o Check for available support channels (phone, email, chat) and self-
service options.
5. Cost Considerations:
o Calculate the Total Cost of Ownership (TCO), which includes:

 Infrastructure costs (compute and storage)

 Software licensing or subscription costs
 Data migration and integration costs
 Administration and personnel costs
 Recurring support and maintenance costs
Summary of Decision Factors
 Organizations must balance security and data privacy with the need for
speed and insights.
 The choice between on-premises and cloud solutions often hinges on
data security requirements.
 A thorough analysis of costs and support is essential for long-term
success.
The most critical factor in selecting a data warehouse system often depends on
the specific needs of the organization, but data security and privacy
requirements frequently take precedence.
Key Considerations:
 Security Needs: Organizations that handle sensitive data may require
on-premises solutions to ensure compliance with regulations like GDPR or
CCPA.
 Data Privacy: Multi-location businesses must consider geo-specific data
storage to meet legal obligations.
 Balancing Act: While security is paramount, organizations also need to
consider performance, scalability, and cost-effectiveness to ensure they
can derive valuable insights from their data.

Data Marts Overview

What is a Data Mart?
 Definition: A data mart is an isolated part of a larger enterprise data
warehouse, specifically built to serve a particular business function,
purpose, or community of users.
 Examples:
o Sales and finance departments may have dedicated data marts for
quarterly reports.
o Marketing teams may analyze customer behavior data using their
own data marts.
Purpose of Data Marts
 Support Tactical Decisions: Data marts provide specific support for
making tactical decisions by focusing on the most relevant data.
 Efficiency: They save end users time by providing quick access to
necessary data without searching through the entire data warehouse.
Structure of Data Marts
 Database Type: Typically a relational database.
 Schema: Often uses a star or snowflake schema:
o Fact Table: Contains business metrics relevant to a business
process.
o Dimension Tables: Provide context for the facts.

Comparison with Other Data Repositories

 Data Marts vs. Transactional Databases:
o Data marts are optimized for read-intensive queries (OLAP).

o Transactional databases are optimized for write-intensive queries

(OLTP).
 Data Marts vs. Data Warehouses:
o Data marts have a smaller, tactical scope compared to the broader
strategic requirements of data warehouses.
o Data marts are leaner and faster than data warehouses.

Types of Data Marts

1. Dependent Data Marts:
o Draw data from the enterprise data warehouse.

o Inherit security and have simpler data pipelines.

2. Independent Data Marts:

o Created directly from sources, bypassing the data warehouse.

o Require custom ETL processes and may need separate security

measures.
3. Hybrid Data Marts:
o Combine data from both the enterprise data warehouse and
operational systems.
Key Functions of Data Marts
 Provide relevant data to end-users when needed.
 Accelerate business processes with efficient query response times.
 Offer a cost-effective method for data-driven decision-making.
 Ensure secure access and control over data.
The main purpose of a data mart is to provide specific support for making
tactical decisions within a business. Here are the key points regarding its
purpose:
 Focused Data Access: Data marts are designed to deliver relevant data
quickly to end-users, allowing them to make informed decisions without
sifting through large volumes of data.
 Efficiency: By concentrating on the most pertinent data for a particular
business function or department, data marts save time and effort for
users.
 Support for Business Functions: They cater to specific business areas,
such as sales, marketing, or finance, providing tailored insights that help
in operational decision-making.
 Enhanced Query Performance: Data marts are optimized for read-
intensive queries, ensuring fast response times for users seeking insights.
The structure of a data mart typically involves the following key components:
1. Database Type
 Relational Database: Data marts are usually built on relational database
management systems (RDBMS) that support structured data storage and
retrieval.
2. Schema
 Star Schema:
o Consists of a central fact table surrounded by dimension tables.

o The fact table contains quantitative data (metrics) relevant to

business processes (e.g., sales amounts).
o Dimension tables provide context (e.g., time, product, customer) for
the facts.
 Snowflake Schema:
o A more normalized version of the star schema.

o Dimension tables are further broken down into related tables,

reducing data redundancy.
3. Fact Table
 Definition: Contains the measurable, quantitative data for analysis.
 Examples of Metrics: Sales revenue, quantities sold, profit margins.
4. Dimension Tables
 Definition: Provide descriptive attributes related to the facts.
 Examples:
o Time Dimension: Year, quarter, month, day.

o Product Dimension: Product ID, name, category.

o Customer Dimension: Customer ID, name, location.

5. Data Pipeline
 ETL Process: Data marts typically involve an Extract, Transform, Load
(ETL) process to gather, clean, and load data from various sources into the
data mart.

Here’s a comparison of data marts, transactional databases, and data

warehouses based on their key characteristics:

1. Data Mart

 Purpose: Supports specific business functions or departments with

focused data access.
 Data Type: Contains validated, transformed, and cleaned data.
 Structure: Typically uses star or snowflake schema.
 Optimization: Optimized for read-intensive queries (OLAP).
 Data Sources: Draws data from transactional databases or data
warehouses.
 Historical Data: Accumulates historical data for trend analysis.
 Performance: Lean and fast, designed for quick query responses.

2. Transactional Database

 Purpose: Manages day-to-day operations and transactions of an

organization.
 Data Type: Contains raw, uncleaned data.
 Structure: Generally uses a normalized structure to reduce redundancy.
 Optimization: Optimized for write-intensive queries (OLTP).
 Data Sources: Serves as the source for operational applications (e.g.,
point-of-sale systems).
 Historical Data: May not store older data consistently.
 Performance: Focused on transaction processing speed and data
integrity.

3. Data Warehouse

 Purpose: Centralizes and organizes data from multiple sources for

comprehensive analysis and reporting.
 Data Type: Contains cleaned and validated data from various sources.
 Structure: Often uses star or snowflake schema, similar to data marts.
 Optimization: Optimized for read-intensive queries (OLAP).
 Data Sources: Integrates data from multiple transactional databases and
other sources.
 Historical Data: Stores large volumes of historical data for extensive
analysis.
 Performance: Can be larger and slower compared to data marts due to
the volume of data.

Summary

 Data Marts: Focused, tactical data access for specific departments.

 Transactional Databases: Operational systems for daily transactions.
 Data Warehouses: Comprehensive repositories for integrated data
analysis.

Data Lakes Overview

Definition
 Data Lake: A storage repository that can hold large amounts of
structured, semi-structured, and unstructured data in its native format.
Data is classified and tagged with metadata.
Key Characteristics
 Raw Data Storage: Data lakes store data in its original form without
needing to define structure or schema beforehand.
 Flexibility: Users do not need to know all potential use cases for the data
at the time of storage.
Benefits of Data Lakes
 Diverse Data Types: Can store:
o Unstructured data (e.g., documents, emails)

o Semi-structured data (e.g., JSON, XML)

o Structured data (e.g., from relational databases)

 Scalability: Capable of handling storage from terabytes to petabytes.

 Time Efficiency: Saves time by eliminating the need to define structures
and schemas before data loading.
 Flexible Data Reuse: Enables fast and flexible access to data for various
current and future use cases.
Use Cases
 Staging Area: Often used as a staging area for transforming data before
loading it into a data warehouse or data mart.
 Machine Learning and Analytics: Supports advanced analytics and
machine learning development.
Comparison with Data Warehouses
 Data Structure:
o Data Lake: Stores raw and unstructured data.

o Data Warehouse: Stores processed data that conforms to specific

standards.
 Schema Definition:
o Data Lake: No need for schema definition before data loading.
o Data Warehouse: Requires strict schema design prior to data
loading.
 Data Quality:
o Data Lake: Data may not be curated and can lack governance.

o Data Warehouse: Data is curated and adheres to governance

standards.
Typical Users
 Data Lakes: Primarily used by data scientists, data developers, and
machine learning engineers.
 Data Warehouses: Mainly utilized by business analysts and data
analysts.
Technologies and Vendors
 Common technologies and platforms for data lakes include:
o Amazon S3

o Apache Hadoop

o IBM, Microsoft, Google, Oracle, and others.

Data Lakehouses
The concept of Data Lakehouses combines the best features of both data
lakes and data warehouses. Here’s a breakdown of the key points:
What is a Data Lakehouse?
 Definition: A data lakehouse is a unified platform that allows for the
storage of both structured and unstructured data, providing the flexibility
of a data lake with the performance and management features of a data
warehouse.
Key Features
 Cost-Effectiveness: Data lakehouses are designed to store large volumes
of data at a lower cost compared to traditional data warehouses.
 Flexibility: They can handle various data types, including structured,
semi-structured, and unstructured data.
 Performance: Optimized for high-performance analytics and machine
learning workloads, allowing for quick data retrieval and processing.
Benefits
 Unified Architecture: Combines the capabilities of data lakes and data
warehouses, reducing the need for separate systems.
 Data Management: Built-in data governance and management features
help maintain data quality and integrity.
 Support for Modern Workloads: Ideal for supporting AI and machine
learning applications, enabling organizations to leverage their data for
advanced analytics.
Challenges
 Complexity: Implementing a data lakehouse can be complex, requiring
careful planning and architecture.
 Data Governance: Ensuring data quality and compliance can be
challenging, especially with diverse data sources.
Use Cases
 Business Intelligence: Organizations can use data lakehouses to create
dashboards and reports by integrating data from various sources.
 Machine Learning: Data scientists can access and analyze large datasets
for training machine learning models.
Conclusion
Data lakehouses represent a modern approach to data architecture, providing a
flexible, cost-effective solution for organizations looking to leverage their data for
insights and analytics.

Data Warehouse Architectures

General Data Warehouse Architecture
 Components:
o Data Sources: Flat files, databases, operational systems.

o ETL Layer: Extract, Transform, Load processes for data integration.

o Staging and Sandbox Areas: Temporary storage for data

processing and workflow development.
o Data Warehouse Repository: Centralized storage for integrated
data.
o Data Marts: Subsets of data warehouses, often used for specific
business areas (hub and spoke architecture).
o Analytics Layer: Tools for data analysis and business intelligence.

 Security: Ensures data protection during transfer and storage.

Reference Architectures
 Vendor-Specific Architectures: Tailored solutions from vendors that
ensure interoperability among components.
 Example: IBM's reference architecture includes:
o Data Acquisition Layer: Collects raw data from various sources.

o Data Integration Layer: Staging area for ETL processes.

o Data Repository Layer: Stores integrated data, typically in a

relational model.
o Analytics Layer: Often uses cube formats for easier analysis.

o Presentation Layer: Applications for user access to data.

Key Takeaways
 Data warehouse architecture is adaptable based on analytics
requirements.
 Proprietary architectures are tested for compatibility within vendor
ecosystems.
 Understanding these architectures is crucial for designing effective data
warehousing solutions.

Cubes, Rollups, and Materialized Views and Tables

Data Cubes
 Definition: A data cube represents multi-dimensional data, typically
derived from a star or snowflake schema.
 Dimensions: Coordinates in the cube are defined by dimensions (e.g.,
Product categories, State, Year).
 Fact: The cells contain a fact of interest (e.g., total sales in thousands of
dollars).
Operations on Data Cubes
1. Slicing: Selecting a single member from a dimension, reducing the cube's
dimensions.
o Example: Analyzing sales for the year 2018.

2. Dicing: Selecting a subset of values from a dimension.

o Example: Focusing on specific product types like "Gloves" and "T-
shirts".
3. Drilling Down: Exploring hierarchical dimensions for more detail.
o Example: Viewing specific product groups under "T-shirts".

4. Drilling Up: Reversing the drill-down process to return to a higher-level

view.
5. Pivoting: Rotating the cube to change the perspective of analysis without
altering the data.
o Example: Switching the year and product dimensions.

6. Rolling Up: Summarizing data along a dimension using aggregations

(e.g., COUNT, SUM).
o Example: Calculating the average selling price of T-shirts across
states.
Materialized Views
 Definition: A materialized view is a local, read-only snapshot of the
results of a query.
 Uses:
o Replicating data for staging databases in ETL processes.

o Precomputing and caching expensive queries for analytics.

 Refresh Options:
o Never: Populated only when created.

o Upon Request: Manually refreshed.

o Scheduled: Automatically refreshed at set intervals.

o Immediately: Automatically refreshed after every statement.

Example SQL for Materialized Views

 Oracle:
 CREATE MATERIALIZED VIEW My_Mat_View
 REFRESH FAST
 START WITH SYSDATE
 REFRESH EVERY DAY
AS SELECT * FROM my_table_name;
 PostgreSQL:
 CREATE MATERIALIZED VIEW My_Mat_View
 TABLESPACE tablespace_name
AS SELECT * FROM table_name;
Summary
 Data cubes facilitate multi-dimensional analysis with various operations.
 Materialized views enhance performance by storing query results for
efficient access.
In Db2, materialized views are referred to as Materialized Query Tables
(MQTs). Here are the key points regarding MQTs in Db2:
Materialized Query Tables (MQTs)
 Definition: MQTs are precomputed tables that store the results of a query.
They provide a way to improve query performance by storing aggregated
or joined data.
Key Features
 Automatic Refresh: MQTs can be set to refresh automatically based on
the underlying data changes.
 Immediate Refresh: This option allows the MQT to be updated
immediately after the underlying data is modified.
 Deferred Refresh: Data is not inserted into the MQT until a refresh
command is executed.
Creating an MQT
Here’s an example of how to create an MQT in Db2:
CREATE MATERIALIZED QUERY TABLE emp_mqt
AS (
SELECT e.employee_id, e.name, d.department_name
FROM Employee e
JOIN Department d ON e.department_id = d.department_id
)
DATA INITIALLY DEFERRED
REFRESH IMMEDIATE;
Refreshing an MQT
To refresh an MQT, you can use the following command:
REFRESH MATERIALIZED QUERY TABLE emp_mqt;
Benefits of Using MQTs
 Performance Improvement: MQTs can significantly reduce query
execution time by storing precomputed results.
 Simplified Queries: They allow complex queries to be simplified, as the
heavy lifting is done during the MQT creation.
Use Cases
 Reporting: MQTs are often used in reporting scenarios where data is
aggregated and needs to be accessed quickly.
 Data Warehousing: They are useful in data warehousing environments
for summarizing large datasets.

Grouping Sets in SQL

The GROUPING SETS clause is used in conjunction with the GROUP BY clause to allow you
to easily summarize data by aggregating a fact over as many dimensions as you like.

SQL GROUP BY clause

Recall that the SQL GROUP BY clause allows you to summarize an aggregation such as
SUM or AVG over the distinct members, or groups, of a categorical variable or dimension.

You can extend the functionality of the GROUP BY clause using SQL clauses such as CUBE
and ROLLUP to select multiple dimensions and create multi-dimensional summaries.
These two clauses also generate grand totals, like a report you might see in a spreadsheet
application or an accounting style sheet. Just like CUBE and ROLLUP, the SQL GROUPING
SETS clause allows you to aggregate data over multiple dimensions but does not generate
grand totals.

Examples

Let’s start with an example of a regular GROUP BY aggregation and then compare the
result to that of using the GROUPING SETS clause. We’ll use data from a fictional
company called Shiny Auto Sales. The schema for the company’s warehouse is displayed
in the entity-relationship diagram in Figure 1.

Fig. 1. Entity-relationship diagram for a “sales” star schema

based on the fictional “Shiny Auto Sales” company.

We’ll work with a convenient materialized view of a completely denormalized fact

table from the sales star schema, called DNsales, which looks like the following:

This DNsales table was created by joining all the dimension tables to the central fact table
and selecting only the columns which are displayed. Each record in DNsales contains
details for an individual sales transaction.

Example 1

Consider the following SQL code which invokes GROUP BY on the auto class
dimension to summarize total sales of new autos by auto class.
The result looks like this:

Example 2

Now suppose you want to generate a similar view, but you also want to include the total
sales by salesperson. You can use the GROUPING SETS clause to access both the auto
class and salesperson dimensions in the same query. Here is the SQL code you can use to
summarize total sales of new autos, both by auto class and by salesperson, all in one
expression:

Here is the query result. Notice that the first four rows are identical to the result of
Example 1, while the next 5 rows are what you would get
by substituting salespersonname for autoclassname in Example 1.
Essentially, applying GROUPING SETS to the two dimensions,
salespersonname and autoclassname, provides the same result that you would get by
appending the two individual results of applying GROUP BY to each dimension separately
as in Example 1.

Facts and Dimensional Modeling

Facts
 Definition: Facts are quantitative data that can be measured. They
represent business metrics or performance indicators.
 Types of Facts:
o Quantitative Facts: Numerical values such as sales amounts,
temperatures, or counts (e.g., number of sales).
o Qualitative Facts: Non-numeric data that can also be considered
facts, such as descriptions or statuses (e.g., "partly cloudy" in a
weather report).
Fact Tables
 Definition: A fact table is a central table in a star or snowflake schema
that contains the facts of a business process.
 Components:
o Measures: The actual facts (e.g., sales amount).

o Foreign Keys: Links to dimension tables that provide context to

the facts (e.g., store ID, product ID).
 Types of Fact Tables:
o Detail Level Fact Tables: Contain individual transactions (e.g.,
each sale).
o Summary Tables: Aggregate facts (e.g., total sales per quarter).

Dimensions
 Definition: Dimensions are attributes that provide context to facts. They
categorize and describe the facts.
 Characteristics:
o Dimensions are often referred to as categorical variables.

o They enable users to filter, group, and label data for analysis.

 Common Dimensions:
o Time: Date and time stamps.

o Location: Geographic data (e.g., country, city).

o Products: Attributes of products (e.g., make, model).

o People: Attributes of individuals (e.g., employee names).

Dimension Tables
 Definition: A dimension table stores the attributes of a dimension and is
linked to the fact table via foreign keys.
 Examples:
o Product Table: Contains details about products (e.g., product ID,
name, category).
o Employee Table: Contains details about employees (e.g.,
employee ID, name, department).
Relationships
 Fact and Dimension Tables: Fact tables are linked to multiple dimension
tables through foreign keys, allowing for complex queries and analysis.
 Schema Types:
o Star Schema: A simple structure where a central fact table is
connected to multiple dimension tables.
o Snowflake Schema: A more complex structure where dimension
tables are normalized into multiple related tables.
Example
 Sales at a Car Dealership:
o Fact Table: Contains facts like Sale ID, Sale Date, and Sale Amount.

o Dimension Tables:

 Vehicle Table: Attributes like Vehicle ID, Make, Model.

 Salesperson Table: Attributes like Salesperson ID, Name.
Summary
 Facts are measurable quantities that represent business performance.
 Dimensions provide context to these facts, enabling meaningful analysis.
 Fact and Dimension Tables are essential components of dimensional
modeling, facilitating data organization and retrieval for analytical
purposes.
Dimensions
 Definition: Dimensions are attributes or categories that provide context
to facts in a data warehouse. They help to describe the "who," "what,"
"where," and "when" of the data.
Characteristics
 Categorical Variables: Dimensions are often referred to as categorical
variables in statistics and data analysis.
 Contextual Information: They provide essential context that makes
facts meaningful and useful for analysis.
Common Types of Dimensions
1. Time Dimension:
o Attributes: Year, Quarter, Month, Day.

o Example: A time dimension might include data for each day of the
year.
2. Geography Dimension:
o Attributes: Country, State, City, Postal Code.

o Example: A geography dimension could categorize sales data by

location.
3. Product Dimension:
o Attributes: Product ID, Name, Category, Brand.

o Example: A product dimension might include details about each

product sold.
4. Customer Dimension:
o Attributes: Customer ID, Name, Age, Gender.

o Example: A customer dimension could provide insights into

customer demographics.
5. Employee Dimension:
o Attributes: Employee ID, Name, Department, Role.

o Example: An employee dimension might track sales performance by

salesperson.
Dimension Tables
 Definition: Dimension tables store the attributes of dimensions and are
linked to fact tables through foreign keys.
 Purpose: They allow for filtering, grouping, and labeling operations in
data analysis.
Example of a Dimension Table
 Product Table:
o Columns: Product ID, Product Name, Category, Price.

o This table provides detailed information about each product, which

can be used to analyze sales data in conjunction with a fact table.
The role of dimensions in data warehousing is crucial for organizing and
analyzing data effectively. Here are the key functions they serve:
1. Contextualization of Facts
 Dimensions provide context to the facts stored in fact tables, making the
data meaningful. For example, a sales amount (fact) becomes more
informative when associated with dimensions like time (date of sale) and
product (type of product sold).
2. Categorization
 Dimensions categorize facts into meaningful groups, allowing users to
analyze data based on different attributes. This categorization helps in
understanding trends and patterns.
3. Facilitating Queries
 Dimensions enable complex queries by allowing users to filter, group, and
label data. For instance, users can query sales data by specific time
periods, product categories, or geographic locations.
4. Supporting Data Analysis
 Dimensions are essential for analytical operations such as:
o Filtering: Narrowing down data to specific criteria (e.g., sales in a
particular region).
o Grouping: Aggregating data based on dimension attributes (e.g.,
total sales by month).
o Labeling: Providing descriptive labels for data points in reports and
dashboards.
5. Enhancing Reporting and Visualization
 Dimensions improve the quality of reports and visualizations by providing
detailed attributes that can be used to create insightful dashboards. For
example, a sales report can show performance by product category and
region.
6. Enabling Drill-Down Analysis
 Dimensions allow users to perform drill-down analysis, where they can
start with summarized data and explore more detailed levels. For example,
starting with total sales and drilling down to see sales by individual
products or salespersons.

Data Modeling using Star and Snowflake Schemas

Star Schema
 Definition: A star schema is a type of data modeling that organizes data
into fact and dimension tables.
 Structure:
o Fact Table: Central table that contains measurable, quantitative
data (facts) and foreign keys that reference dimension tables.
o Dimension Tables: Surround the fact table and contain descriptive
attributes related to the facts (e.g., time, product, store).
 Visualization: The schema resembles a star, with the fact table at the
center and dimension tables radiating outwards.
 Usage: Commonly used in data warehouses and data marts for efficient
querying and reporting.
Snowflake Schema
 Definition: A snowflake schema is an extension of the star schema that
normalizes dimension tables into multiple related tables.
 Structure:
o Normalization: Dimension tables are split into additional tables to
reduce redundancy and improve data integrity.
o Hierarchy: Each dimension can have multiple levels of hierarchy
(e.g., a product dimension may include category and brand tables).
 Visualization: The schema resembles a snowflake due to its multiple
layers of branching.
 Usage: Useful for complex queries and when data integrity is a priority.
Key Differences
 Normalization:
o Star schema is denormalized (fewer tables, faster queries).

o Snowflake schema is normalized (more tables, better data

integrity).
 Complexity:
o Star schema is simpler and easier to understand.
o Snowflake schema is more complex due to additional tables and
relationships.
Design Considerations
1. Business Process: Identify the business process to model (e.g., sales,
inventory).
2. Granularity: Determine the level of detail needed (e.g., daily sales vs.
monthly sales).
3. Dimensions: Identify relevant dimensions (e.g., time, product, customer).
4. Facts: Define the facts to measure (e.g., sales amount, quantity sold).
Example Scenario
 Business Process: Point-of-sale transactions for a store.
 Granularity: Individual line items on receipts.
 Dimensions: Date, store, product, cashier, payment method.
 Facts: Transaction amount, quantity, discounts, sales tax.
Conclusion
 Star and Snowflake schemas are essential for effective data modeling
in data warehousing.
 Choosing between them depends on the specific requirements of the
business process, including the need for speed versus data integrity.
Here are the advantages of using a snowflake schema over a star
schema:
1. Data Integrity
 Normalization: Snowflake schemas reduce data redundancy by
normalizing dimension tables, which helps maintain data integrity and
consistency.
2. Storage Efficiency
 Reduced Redundancy: By breaking down dimension tables into smaller,
related tables, snowflake schemas can save storage space, especially
when dealing with large datasets.
3. Flexibility
 Hierarchical Relationships: Snowflake schemas allow for more complex
relationships and hierarchies within dimensions, making it easier to
manage and analyze multi-level data.
4. Improved Query Performance for Certain Queries
 Targeted Queries: For queries that require detailed information from
multiple related dimensions, snowflake schemas can perform better due to
their structured relationships.
5. Easier Maintenance
 Simplified Updates: Changes to dimension attributes can be made in
one place (the normalized table), which simplifies maintenance and
updates.
6. Better for Complex Data Models
 Complex Relationships: Snowflake schemas are better suited for
complex data models where dimensions have multiple levels of hierarchy
or relationships.

Understanding Slowly Changing Dimensions (SCD)

Slowly Changing Dimensions are the methods used to monitor changes in the
dimension attributes, manage updates, helping businesses preserve historical
data and ensure accuracy in reporting. In data warehousing, a common problem
is to manage changes in dimensional data over time. This is where we use the
concept of Slowly Changing Dimensions (SCD). This reading will provide a brief
explanation on the types of SCD and also discuss their benefits, usage and
considerations in designing a data warehouse.
Various types of SCDs:
There are four primary types of SCDs:
 Type 0: Retain Original Value
 Type 1: Overwrite the Existing Data
 Type 2: Preserve Historical Data
 Type 3: Add New Attribute
However, in most of the advanced implementations, the below types are also
used, which combines or extend some of the basic types.
 Type 4: Historical Table
 Type 6: Hybrid Approach
Type 0: Retain Original Value
This can be used on Static Dimension which means once a value is inserted, it
will remain static. No changes will be made to the dimension data in Type 0. Also,
the historical data is not updated. This approach is beneficial for data which
should remain constant over time. Examples like product codes or account
numbers. The major advantage with Type 0 is, it is simple to implement and
more effective for dimensions that rarely change.
Type 1: Overwrite the Existing Data
Type 1 i.e., overwrite the existing data applies changes to the dimension directly
by overwriting the existing data. This way does not maintain a record of historical
changes, so if an attribute value is updated, the old value will be lost. For
example, when only the current state of the data is important, such as correcting
spelling mistakes or updating any contact information.
Pros:
 Easy to implement.
 Saves storage space.
Cons:
 No historical data is retained.
 May lead to inaccurate historical reporting.
Example: If a customer changes their address, the new address overwrites the
old one
Type 2: Preserve Historical Data (Row Versioning)
Type 2 i.e., preserve historical data allows you to track changes by adding new
rows in the dimension table whenever there’s an update. Each of the row will
have the current and the historical versions of the data and start/end dates (or)
flags are used to indicate whether the row is the current version. For example,
when it is essential to retain a full history of changes, such as tracking customer
address changes for legal compliance/auditing.
Pros:
 In Type 2, full historical data will be preserved.
 It is easier to retrieve data using queries as it existed at any point in time.
Cons:
 Type 2 increases the size of the dimension table.
 This mainly requires careful management of versioning fields.
Example: A new row will be created with the new address when a customer
updates their address, while the old row will be marked as historical.
Type 3: Add New Attribute (Tracking Limited History)
Type 3 adding a new attribute will track the historical changes by adding new
columns to the dimension table. Each of the column represents a different
version of the attribute. This is helpful when only a limited amount of the
historical data needs to be stored, such as the previous and current values. For
example, when you need to track a small number of changes and also when it is
only necessary to compare the previous and the current states.
Pros:
 Type 3 is easy to implement.
 This requires very less space than Type 2.
Cons:
 Type 3 can only track a limited amount of history.
 This does not maintain a total history of changes.
Example: Storing a customer’s current address and previous address in
separate columns in the same row.
Type 4: Historical Table (Tracking Historical Data in a Separate Table)
In Type 4, historical data will be stored in a separate table from the current
dimension data. In this, the main dimension table holds only the current data,
while a separate historical table stores all the previous versions of the data. For
example, when you like to separate current data from historical data to improve
performance and to simplify the design.
Pros:
 Type 4 will maintain a complete historical record.
 This usually separates current and historical data.
Cons:
 Type 4 is more complex to implement.
 This mainly requires additional storage for historical tables.
Example: A current customer table having only the latest updated information,
while an associated historical customer table holding the older records.
Type 6: Hybrid Approach
Type 6 is a hybrid approach that combines aspects of all the Type 1, Type 2, and
Type 3. This retains the full history like Type 2, will have a current flag like Type
1, and also it will track the previous versions like Type 3. This method helps in
accessing the current data, compare it with the previous versions, and also
maintains a complete historical record. For example, if you need a flexible
solution to track both the current and the historical versions of data, and also
requires the comparisons of previous values.
Pros:
 Type 6 combines the advantages of other types like Type 1, Type 2 and
Type 3.
 This will track the complete history by maintaining the current state.
Cons:
 Type 6 is more complex to manage.
 This mainly requires more storage.
Example: When a customer changes their address, the dimension table has a
current address field (Type 1), new rows in the table to track full historical
changes (Type 2), and a previous address field (Type 3).
Key Considerations in Implementing SCD:
1. Business Requirements: Before choosing an SCD type, assess the
business requirements. Do you need to track historical changes? If so, how
much history do you need to keep?
2. Versioning: As mentioned earlier, type 2 often requires a start date, end
date, and a current flag to manage the different versions of the same
dimension row. Ensure to handle these fields carefully in order to avoid
errors in version control.
3. Storage and Performance: Tracking the historical data can increase the
size of the dimension tables. Always consider the performance impact on
queries that access the dimension tables.
4. Extract, Transform, Load (ETL) Process: The ETL process should be
designed properly to fit the type of SCD in use. For example, Type 1 ETL
only updates existing rows while the Type 2 ETL needs to detect changes
and insert new rows.
Conclusion:
Slowly Changing Dimensions (SCDs) always provide very strong way to manage
changes in dimension data over time. Organizations can ensure accurate
reporting, maintain historical data, and optimize the performance of their data
warehouses by carefully selecting the proper SCD type based on their business
requirements. Whether you require simple overwriting (Type 1) or full historical
tracking (Type 2) or a hybrid solution (Type 6), the right SCD strategy will help
you achieve long-term data management success.

Data Warehousing with Star and Snowflake schemas

Why do we use these schemas, and how do they differ?
Star schemas are optimized for reads and are widely used for designing data
marts, whereas snowflake schemas are optimized for writes and are widely used
for transactional data warehousing. A star schema is a special case of a
snowflake schema in which all hierarchical dimensions have been denormalized,
or flattened.

Attribute Star schema Snowflake schema

Read speed Fast Moderate

Write speed Moderate Fast

Storage space Moderate to high Low to moderate

Data integrity risk Low to moderate Low

Query complexity Simple to moderate Moderate to complex

Schema complexity Simple to moderate Moderate to complex

Dimension hierarchies Denormalized single Normalized over multiple

tables tables

Joins per dimension One One per level

Attribute Star schema Snowflake schema

hierarchy

Ideal use OLAP systems, Data OLTP systems

Marts

Table 1. A comparison of star and snowflake schema attributes.

Normalization reduces redundancy
Both star and snowflake schemas benefit from the application of normalization.
“Normalization reduces redundancy” is an idiom that points to a key advantage
leveraged by both schemas. Normalizing a table means to create, for each
dimension:
1. A surrogate key to replace the natural key, that is, the unique values of
the given column, and
2. A lookup table to store the surrogate and natural key pairs.
Each surrogate key’s values are repeated exactly as many times within the
normalized table as the natural key was before moving the natural key to its new
lookup table. Thus, you did nothing to reduce the redundancy of the original
table.
However, dimensions typically contain groups of items that appear frequently,
such as a “city name” or “product category”. Since you only need one instance
from each group to build your lookup table, your lookup table will have many
fewer rows than your fact table. If there are child dimensions involved, then the
lookup table may still have some redundancy in the child dimension columns. In
other words, if you have a hierarchical dimension, such as “Country”, “State”,
and “City”, you can repeat the process on each level to further reduce the
redundancy. Notice that further normalizing your hierarchical dimensions has no
effect on the size or content of your fact table - star and snowflake schema data
models share identical fact tables.
Normalization reduces data size
When you normalize a table, you typically reduce its data size, because in the
process you likely replace expensive data types, such as strings, with much
smaller integer types. But to preserve the information content, you also need to
create a new lookup table that contains the original objects. The question is,
does this new table use less storage than the savings you just gained in the
normalized table? For small data, this question is probably not worth considering,
but for big data, or just data that is growing rapidly, the answer is yes, it is
inevitable. Indeed, your fact table will grow much more quickly than your
dimension tables, so normalizing your fact table, at least to the minimum degree
of a star schema is likely warranted. Now the question is about which is better –
star or snowflake?
Comparing benefits: snowflake vs. star data warehouses
The snowflake, being completely normalized, offers the least redundancy and the
smallest storage footprint. If the data ever changes, this minimal redundancy
means the snowflaked data needs to be changed in fewer places than would be
required for a star schema. In other words, writes are faster, and changes are
easier to implement. However, due to the additional joins required in querying
the data, the snowflake design can have an adverse impact on read speeds. By
denormalizing to a star schema, you can boost your query efficiency. You can
also choose a middle path in designing your data warehouse. You could opt for a
partially normalized schema. You could deploy a snowflake schema as your basis
and create views or even materialized views of denormalized data. You could for
example simulate a star schema on top of a snowflake schema. At the cost of
some additional complexity, you can select from the best of both worlds to craft
an optimal solution to meet your requirements.
Practical differences
Most queries you apply to the dataset, regardless of your schema choice, go
through the fact table. Your fact table serves as a portal to your dimension
tables. The main practical difference between star and snowflake schema from
the perspective of an analyst has to do with querying the data. You need more
joins for a snowflake schema to gain access to the deeper levels of the
hierarchical dimensions, which can reduce query performance over a star
schema. Thus, data analysts and data scientists tend to prefer the simpler star
schema. Snowflake schemas are generally good for designing data warehouses
and in particular, transaction processing systems, while star schemas are better
for serving data marts, or data warehouses that have simple fact-dimension
relationships. For example, suppose you have point-of-sale records accumulating
in an Online Transaction Processing System (OLTP) which are copied as a daily
batch ETL process to one or more Online Analytics Processing (OLAP) systems
where subsequent analysis of large volumes of historical data is carried out. The
OLTP source might use a snowflake schema to optimize performance for frequent
writes, while the OLAP system uses a star schema to optimize for frequent reads.
The ETL pipeline that moves the data between systems includes a
denormalization step which collapses each hierarchy of dimension tables into a
unified parent dimension table.
Too much of a good thing?
There is always a tradeoff between storage and compute that should factor into
your data warehouse design choices. For example, do your end-users or
applications need to have precomputed, stored dimensions such as ‘day of
week’, ‘month of year’, or ‘quarter’ of the year? Columns or tables which are
rarely required are occupying otherwise usable disk space. It might be better to
compute such dimensions within your SQL statements only when they are
needed. For example, given a star schema with a date dimension table, you
could apply the SQL ‘MONTH’ function as MONTH(dim_date.date_column) on
demand instead of joining the precomputed month column from the MONTH
table in a snowflake schema.
Scenario
Suppose you are handed a small sample of data from a very large dataset in the
form of a table by your client who would like you to take a look at the data and
consider potential schemas for a data warehouse based on the sample. Putting
aside gathering specific requirements for the moment, you start by exploring the
table and find that there are exactly two types of columns in the dataset - facts
and dimensions. There are no foreign keys although there is an index. You think
of this table as being a completely denormalized, or flattened dataset. You also
notice that amongst the dimensions are columns with relatively expensive data
types in terms of storage size, such as strings for names of people and places. At
this stage you already know you could equally well apply either a star or
snowflake schema to the dataset, thereby normalizing to the degree you wish.
Whether you choose star or snowflake, the total data size of the central fact
table will be dramatically reduced. This is because instead of using dimensions
directly in the main fact table, you use surrogate keys, which are typically
integers; and you move the natural dimensions to their own tables or hierarchy
of tables which are referenced by the surrogate keys. Even a 32-bit integer is
small compared to say a 10-character string (8 X 10 = 80 bits). Now it’s a matter
of gathering requirements and finding some optimal normalization scheme for
your schema.

Staging Areas for Data Warehouses

What is a Staging Area?
 Definition: A staging area is an intermediate storage location used during
the ETL (Extract, Transform, Load) process.
 Purpose: It acts as a bridge between data sources and target data
warehouses, data marts, or other data repositories.
A staging area in data warehousing is an intermediate storage location used
during the ETL (Extract, Transform, Load) process. Here are the key points to
understand:
Definition
 Intermediate Storage: It serves as a temporary holding area for data
extracted from various source systems before it is transformed and loaded
into the target data warehouse.
Purpose
 Data Integration: The staging area consolidates data from multiple
sources, allowing for integration before it reaches the final destination.
 Decoupling: It separates the data processing tasks from the source
systems, minimizing the risk of corrupting the original data.
Characteristics
 Transient Nature: Staging areas are often temporary and may be cleared
after the ETL process is completed. However, they can also retain data for
archival or troubleshooting purposes.
 Flexibility: They can be implemented using various methods, such as flat
files (e.g., CSV) or SQL tables in a relational database.
Functions
 Data Cleansing: Handles missing values, duplicates, and other data
quality issues.
 Transformation: Prepares data by applying necessary transformations to
meet the requirements of the target system.
 Monitoring: Helps in monitoring and optimizing ETL workflows.

Key Functions of Staging Areas

1. Integration: Consolidates data from multiple source systems.
2. Change Detection: Manages extraction of new and modified data.
3. Scheduling: Allows tasks within an ETL workflow to run in a specific
sequence or concurrently.
4. Data Cleansing and Validation: Handles missing values and duplicates.
5. Aggregating Data: Summarizes data (e.g., daily sales into weekly
averages).
6. Normalizing Data: Ensures consistency in data types and naming
conventions.
Implementation Methods
 Flat Files: Simple formats like CSV files managed with scripts (e.g., Bash,
Python).
 SQL Tables: Stored in a relational database (e.g., Db2).
 Self-contained Databases: Within data warehousing or business
intelligence platforms (e.g., Cognos Analytics).
Example Use Case
 Cost Accounting System: Data from Payroll, Sales, and Purchasing
departments is extracted to individual staging tables. The data is then
transformed and integrated into a single table before loading into the
target system.
Benefits of Staging Areas
 Decoupling Operations: Separates data processing from source
systems, minimizing the risk of data corruption.
 Recovery: If extracted data becomes corrupted, it can be easily
recovered from the staging area.
Summary
 Staging areas are crucial for integrating disparate data sources in data
warehouses.
 They can be implemented in various ways and serve multiple functions,
enhancing the efficiency and reliability of the ETL process.
Steps to Implement a Staging Area
1. Define Requirements:
o Identify the data sources (e.g., databases, APIs).

o Determine the types of data to be extracted and the

transformations needed.
2. Choose Implementation Method:
o Flat Files: Use CSV or JSON files for simple projects.

o Database Tables: Set up tables in a relational database (e.g.,

PostgreSQL, Db2).
o Data Warehousing Tools: Utilize platforms like Cognos Analytics
for more complex needs.
3. Set Up the Environment:
o Create the necessary infrastructure (servers, databases).

o Ensure access permissions for data sources and staging area.

4. Develop ETL Processes:

o Extract: Write scripts or use ETL tools to pull data from source
systems.
o Transform: Cleanse, validate, and aggregate data as required.

o Load: Insert the transformed data into the staging area.

5. Schedule ETL Jobs:

o Use scheduling tools (e.g., cron jobs, Apache Airflow) to automate
the ETL process at defined intervals.
6. Monitor and Optimize:
o Implement logging and monitoring to track ETL performance.

o Optimize queries and processes to improve efficiency.

7. Data Validation:
o Ensure data integrity by validating the data in the staging area
before loading it into the target system.
8. Documentation:
o Document the architecture, processes, and any transformations
applied for future reference and maintenance.
Example Tools
 ETL Tools: Apache NiFi, Talend, or custom scripts in Python.
 Databases: PostgreSQL, MySQL, or cloud solutions like AWS RDS.
Verifying Data Quality
Definition of Data Quality Verification
 Data Quality Verification involves checking data for:
o Accuracy: Ensuring data is correct and matches source data.

o Completeness: Identifying missing data or voids in fields.

o Consistency: Ensuring uniformity in data entry (e.g., date formats).

o Currency: Keeping data up to date.

Importance of Data Quality

 High-quality data is essential for:
o Successful integration of related data.

o Advanced analysis, statistical modeling, and machine learning.

o Enhanced confidence in insights and decision-making.

Common Data Quality Concerns

1. Accuracy Issues:
o Duplicated records during data migration.

o Manual entry errors (typos, out-of-range values).

o Data misalignment (e.g., CSV misinterpretation).

2. Completeness Issues:
o Missing values in required fields.

o Use of placeholders for missing data.

o Entire records missing due to system failures.

3. Consistency Issues:
o Deviations from standard terminology.

o Inconsistent date formats.

o Variations in data entry (e.g., "Mr. John Doe" vs. "John Doe").

o Inconsistent units of measurement (e.g., kilograms vs. pounds).

4. Currency Issues:
o Outdated customer addresses.

o Name changes not reflected in the data.

Process for Handling Bad Data

1. Implement Rules: Create rules to detect bad data.
2. Capture and Quarantine: Identify and isolate bad data.
3. Reporting: Share findings with domain experts.
4. Root Cause Analysis: Investigate upstream data lineage for issues.
5. Correction: Diagnose and correct identified problems.
6. Automation: Automate data cleaning workflows as much as possible.
Tools for Data Quality Solutions
 Examples of leading vendors and their tools include:
o IBM InfoSphere Server for Data Quality

o Informatica Data Quality

o SAP Data Quality Management

o Microsoft Data Quality Services

o OpenRefine (open-source tool)

Conclusion
 Data verification is crucial for managing data quality and enhancing
reliability.
 Enterprise-grade tools can help maintain data quality in a unified
environment.
To ensure data accuracy in your organization, consider implementing the
following strategies:
1. Data Entry Standards
 Establish clear guidelines for data entry to minimize errors.
 Use standardized formats for dates, names, and other fields.
2. Validation Rules
 Implement validation checks during data entry to catch errors in real-time
(e.g., range checks, format checks).
3. Regular Audits
 Conduct periodic audits of data to identify inaccuracies and
inconsistencies.
 Use sampling methods to review data quality.
4. Training and Awareness
 Provide training for staff on the importance of data accuracy and best
practices for data entry.
 Foster a culture of data quality within the organization.
5. Automated Data Cleaning
 Utilize automated tools to identify and correct inaccuracies, such as
duplicate records or out-of-range values.
6. Data Integration Checks
 When migrating or integrating data from different sources, ensure that
data matches and is aligned correctly.
7. Feedback Mechanism
 Establish a process for users to report data inaccuracies and provide
feedback for continuous improvement.
8. Data Governance
 Implement a data governance framework to oversee data management
practices and ensure accountability.
9. Use of Technology
 Leverage data quality tools and software that specialize in data
verification and cleansing.

Populating a Data Warehouse

Overview
 Populating a data warehouse is an ongoing process that involves:
o Initial load: The first-time data is loaded into the warehouse.

o Incremental loads: Regular updates to add new or changed data.

Key Steps in Populating a Data Warehouse

1. Schema Modeling:
o Ensure that the data warehouse schema is designed (e.g., star or
snowflake schema).
o Create production tables based on the schema.

2. Data Staging:
o Data should be staged in tables or files before loading.

o Verify data quality before loading into the warehouse.

3. Initial Load:
o Instantiate the data warehouse and its schema.

o Create fact and dimension tables.

o Load transformed and cleaned data from staging tables into the
warehouse.
4. Ongoing Data Loads:
o Automate incremental loads using scripts as part of the ETL
(Extract, Transform, Load) process.
o Schedule loads to occur daily or weekly based on requirements.

Change Detection and Incremental Loading

 Change Detection:
o Identify new or updated records in the source system.

o Use timestamps or mechanisms in relational databases to track

changes.
 Incremental Loading:
o Load only the new or changed data instead of the entire dataset.

o This can be done using scripts or ETL tools.

Maintenance
 Periodic Maintenance:
o Archive or delete older data that is not frequently accessed.

o Automate the archiving process to move data to less costly storage.

Example: Manually Populating a Data Warehouse

1. Creating Dimension Tables:
o Use SQL commands like CREATE TABLE to define dimension tables
(e.g., DimSalesPerson).
o Populate these tables using INSERT INTO statements.

2. Creating Fact Tables:

o Define fact tables (e.g., FactAutoSales) with primary keys and
foreign keys.
o Populate fact tables with sales data using INSERT INTO.

3. Establishing Relationships:
o Use ALTER TABLE and ADD CONSTRAINT to set up relationships
between fact and dimension tables.
Tools and Technologies
 ETL Tools: Tools like Apache Airflow and Apache Kafka can automate the
data loading process.
 Database Utilities: Use database-specific utilities (e.g., Db2 Load utility)
for efficient data loading.
Conclusion
 Populating a data warehouse is a structured process that requires careful
planning and execution.
 Regular maintenance and automation are crucial for keeping the data
warehouse current and efficient.
To automate incremental loading in a data warehouse, you can follow these
steps:
Steps to Automate Incremental Loading
1. Identify Change Detection Mechanism:
o Use timestamps or versioning in your source data to track changes.

o Many relational databases have built-in features to identify new or

modified records.
2. Create ETL Scripts:
o Write scripts (using languages like Python, SQL, or Bash) that:

 Extract new or updated records from the source.

 Transform the data as needed (cleaning, formatting).
 Load the transformed data into the data warehouse.
3. Schedule the ETL Process:
o Use scheduling tools (like cron jobs in Unix/Linux) to run your ETL
scripts at regular intervals (e.g., daily or weekly).
o Alternatively, use ETL tools like Apache Airflow to manage and
schedule your workflows.
4. Implement Logic for Incremental Loads:
o In your ETL scripts, include logic to:

 Query the source data for records that have changed since
the last load (using timestamps).
 Insert or update records in the data warehouse based on the
extracted data.
5. Error Handling and Logging:
o Implement error handling in your scripts to manage failures
gracefully.
o Log the results of each load process for monitoring and
troubleshooting.
6. Testing and Validation:
o Test your automated process to ensure it correctly identifies and
loads incremental changes.
o Validate the data in the warehouse to ensure accuracy and
completeness.
Example of Incremental Loading Logic
Here’s a simplified SQL example of how you might implement incremental
loading:
-- Assuming you have a last_loaded timestamp to track the last load
SELECT * FROM source_table
WHERE last_modified > :last_loaded;
This query retrieves records that have been modified since the last load, which
can then be processed and loaded into the data warehouse.
Tools for Automation
 ETL Tools: Apache Airflow, Talend, Informatica, or IBM DataStage can help
automate the ETL process.
 Scripting: Use Python or Bash scripts to handle the extraction,
transformation, and loading.

Querying the Data

Key Concepts
1. Entity-Relationship Diagram (ERD):
o Represents the star schema in a data warehouse.

o Helps in understanding the relationships between tables.

2. Materialized Views:
o Created by denormalizing or joining tables from a star schema.

o Store precomputed results to enhance query performance.

o Can be refreshed on a schedule or on-demand.

3. CUBE and ROLLUP Operations:

o Used in SQL to generate total and subtotal summaries.

o CUBE: Generates all possible permutations of the specified

dimensions.
o ROLLUP: Generates a hierarchical summary based on the order of
dimensions.
Practical Application
 Scenario: Creating live summary tables for reporting January sales by
salesperson and automobile type for ShinyAutoSales.
1. Understanding the Star Schema:
o Explore the existing schema in the data warehouse (e.g., "sasDW").

o Identify the central fact table (e.g., "fact_auto_sales") and its foreign
keys.
2. Querying Tables:
o Use SQL to query the fact and dimension tables.

o Example SQL command:

SELECT * FROM sales.fact_auto_sales LIMIT 10;

3. Creating a Denormalized View:
o Join dimension tables to the fact table to create a more
interpretable dataset.
o Example SQL command to create a materialized view:

o CREATE MATERIALIZED VIEW D_N_sales AS

o SELECT date, auto_class_name, is_new, salesperson_name, amount

o FROM fact_auto_sales

o INNER JOIN date_dimension ON fact_auto_sales.sales_date_key =

date_dimension.date_key
o INNER JOIN auto_category_dimension ON
fact_auto_sales.auto_class_id =
auto_category_dimension.auto_class_id
INNER JOIN salesperson_dimension ON fact_auto_sales.salesperson_id =
salesperson_dimension.salesperson_id;
4. Using CUBE and ROLLUP:
o Apply these operations to the materialized view to generate
summaries.
o Example for CUBE:

SELECT auto_class_name, salesperson_name, SUM(amount)

FROM D_N_sales
WHERE is_new = TRUE
GROUP BY CUBE(auto_class_name, salesperson_name;
Summary
 CUBE and ROLLUP provide powerful capabilities for quickly querying and
analyzing data.
 Materialized views help in reducing the load on the database and improve
query performance.
 Understanding the star schema and effectively querying the data is crucial
for data analysis tasks.

All About Data Warehouse
No ratings yet
All About Data Warehouse
35 pages
Lec 11 - DW
No ratings yet
Lec 11 - DW
32 pages
Data Warehousing and DSS
No ratings yet
Data Warehousing and DSS
53 pages
Datastage Anwers
No ratings yet
Datastage Anwers
75 pages
DWDM
No ratings yet
DWDM
61 pages
Chapter 1 Data Warehouse Fundamentals
No ratings yet
Chapter 1 Data Warehouse Fundamentals
26 pages
Data Warehousing
No ratings yet
Data Warehousing
33 pages
Bida Notes
No ratings yet
Bida Notes
67 pages
DW Unit I Notes
No ratings yet
DW Unit I Notes
28 pages
DWDM Unit 1 (R23)
No ratings yet
DWDM Unit 1 (R23)
85 pages
Unit 2
No ratings yet
Unit 2
19 pages
Overview of Data Ware Housing
No ratings yet
Overview of Data Ware Housing
17 pages
Unit 1 NMP-1
No ratings yet
Unit 1 NMP-1
33 pages
Data Warehousing
No ratings yet
Data Warehousing
8 pages
Data Warehousing Unit 1
No ratings yet
Data Warehousing Unit 1
18 pages
MCS-221 2024-25 em
No ratings yet
MCS-221 2024-25 em
34 pages
Data Mining
No ratings yet
Data Mining
3 pages
Data Warehouse
No ratings yet
Data Warehouse
71 pages
UNIT 1 Data Warehouseing
No ratings yet
UNIT 1 Data Warehouseing
26 pages
DW Unit 1
No ratings yet
DW Unit 1
29 pages
WA Data Warehouse
No ratings yet
WA Data Warehouse
16 pages
Data Warehousing
No ratings yet
Data Warehousing
8 pages
Data Warehouse - Concepts
No ratings yet
Data Warehouse - Concepts
64 pages
Unit 1
No ratings yet
Unit 1
18 pages
Data Warehousing
No ratings yet
Data Warehousing
4 pages
CS 2208 Data Mining and Warehousing Notes
No ratings yet
CS 2208 Data Mining and Warehousing Notes
14 pages
Unit1 (DW&DM)
No ratings yet
Unit1 (DW&DM)
30 pages
Data Warehousing and Data Mining Sample 2 PRESENTATION
No ratings yet
Data Warehousing and Data Mining Sample 2 PRESENTATION
21 pages
DWDM202
No ratings yet
DWDM202
6 pages
Solve These Questions
No ratings yet
Solve These Questions
11 pages
Data Warehousing and Online Analytical Processing
No ratings yet
Data Warehousing and Online Analytical Processing
31 pages
Ex 1
No ratings yet
Ex 1
14 pages
Data Warehousing for Business Insights
No ratings yet
Data Warehousing for Business Insights
2 pages
Data Warehouse Tools Comparison
No ratings yet
Data Warehouse Tools Comparison
6 pages
Data Warehousing Essentials Guide
100% (1)
Data Warehousing Essentials Guide
19 pages
Business Analytics Unit 2 Notes
No ratings yet
Business Analytics Unit 2 Notes
30 pages
Data Warehouse
No ratings yet
Data Warehouse
39 pages
What Is A Data Warehouse
No ratings yet
What Is A Data Warehouse
9 pages
ISDM Group5 Review
No ratings yet
ISDM Group5 Review
23 pages
Data Warehouse
No ratings yet
Data Warehouse
3 pages
All About Data-Warehouse
No ratings yet
All About Data-Warehouse
11 pages
$RRWYO9T
No ratings yet
$RRWYO9T
71 pages
Data Warehouse & Data Mining Notes
No ratings yet
Data Warehouse & Data Mining Notes
9 pages
BD&CC Unit2
No ratings yet
BD&CC Unit2
14 pages
Presentation1 1
No ratings yet
Presentation1 1
12 pages
Document 29
No ratings yet
Document 29
50 pages
Unit-3 - I MGN 343
No ratings yet
Unit-3 - I MGN 343
61 pages
In T e G R A Ti o N: Integration of Data
No ratings yet
In T e G R A Ti o N: Integration of Data
21 pages
DW Part A Part B Notes
No ratings yet
DW Part A Part B Notes
69 pages
KM 2
No ratings yet
KM 2
7 pages
03 Data Warehouse
No ratings yet
03 Data Warehouse
27 pages
Unit I
No ratings yet
Unit I
18 pages
BI Chapter 03 - Unlocked
No ratings yet
BI Chapter 03 - Unlocked
80 pages
Building Data Warehouse From Scratch
No ratings yet
Building Data Warehouse From Scratch
6 pages
Introduction To Data Warehousing - Overview
No ratings yet
Introduction To Data Warehousing - Overview
21 pages
Data Warehouse: Key Concepts & Architecture
No ratings yet
Data Warehouse: Key Concepts & Architecture
30 pages
Data Warehousing
No ratings yet
Data Warehousing
2 pages
Data Warehousing for Honours Students
No ratings yet
Data Warehousing for Honours Students
47 pages
Datawarehouse Unit-2
No ratings yet
Datawarehouse Unit-2
59 pages
INF10025 Tasks 01-3
No ratings yet
INF10025 Tasks 01-3
10 pages
Crud Repository
No ratings yet
Crud Repository
19 pages
SC - Dynatrace AssociateNote.1
No ratings yet
SC - Dynatrace AssociateNote.1
2 pages
Corporate Training DB
No ratings yet
Corporate Training DB
17 pages
Data Analytics Pandas
No ratings yet
Data Analytics Pandas
33 pages
Group1 SRS (Vivek)
No ratings yet
Group1 SRS (Vivek)
35 pages
Salesforce Data Modeling Guide
No ratings yet
Salesforce Data Modeling Guide
7 pages
DBMS Lab Maual
No ratings yet
DBMS Lab Maual
29 pages
SQL Server Architecture
No ratings yet
SQL Server Architecture
35 pages
Business Quiz
No ratings yet
Business Quiz
318 pages
Blast
No ratings yet
Blast
18 pages
SQL Syntax and Key Statements Guide
No ratings yet
SQL Syntax and Key Statements Guide
9 pages
19c Installation Script
No ratings yet
19c Installation Script
7 pages
DMSL Final Practical Exam Question Bank - 2023 - 24 - Student
No ratings yet
DMSL Final Practical Exam Question Bank - 2023 - 24 - Student
7 pages
MS Access Notes: 9-25-2020 Dit-Ii Arshad Ali Soomro CS Instructor at IBA ITC Gambat
No ratings yet
MS Access Notes: 9-25-2020 Dit-Ii Arshad Ali Soomro CS Instructor at IBA ITC Gambat
18 pages
Prediction of Diabetes Using Bayesian Network: Mukesh Kumari, Dr. Rajan Vohra, Anshul Arora
No ratings yet
Prediction of Diabetes Using Bayesian Network: Mukesh Kumari, Dr. Rajan Vohra, Anshul Arora
5 pages
Database Keys Explained
No ratings yet
Database Keys Explained
14 pages
2025-01-25 Contents Books
No ratings yet
2025-01-25 Contents Books
9 pages
AWS Certified Cloud Practitioner Exam Dumps - myTechMint-job 6
No ratings yet
AWS Certified Cloud Practitioner Exam Dumps - myTechMint-job 6
31 pages
Tree Plantation Ngo
No ratings yet
Tree Plantation Ngo
21 pages
Course 6231A: Maintaining A Microsoft SQL Server 2008 Database
No ratings yet
Course 6231A: Maintaining A Microsoft SQL Server 2008 Database
12 pages
List of All AWS Services
No ratings yet
List of All AWS Services
15 pages
SQL Guide
No ratings yet
SQL Guide
19 pages
Collection Types in PL/SQL
No ratings yet
Collection Types in PL/SQL
16 pages
R12 AP - SWEEP - ACCESS - ERROR-Insufficient Access Error Appears Attempting To Sweep From Control Payables Period Form (Doc ID 729104.1)
No ratings yet
R12 AP - SWEEP - ACCESS - ERROR-Insufficient Access Error Appears Attempting To Sweep From Control Payables Period Form (Doc ID 729104.1)
4 pages
DDL DML
No ratings yet
DDL DML
13 pages
CFA1 - TRIAL - QUANT - 2024 - 2503 - No Note
No ratings yet
CFA1 - TRIAL - QUANT - 2024 - 2503 - No Note
17 pages
SQL Functions Lab Guide
No ratings yet
SQL Functions Lab Guide
15 pages
Aiml Report
No ratings yet
Aiml Report
70 pages
SCM201805006 - EPO - Artificial Intelligence at The EPO
No ratings yet
SCM201805006 - EPO - Artificial Intelligence at The EPO
9 pages