Unit 3 Data Warehousing and OLAP
Unit 3 Data Warehousing and OLAP
Cost reductions: Data warehousing can result in cost savings over time by reducing data management
procedures and increasing overall efficiency, even when there are setup costs initially.
Data security: Data warehouses employ security protocols to safeguard confidential information,
guaranteeing that only authorized personnel are granted access to certain data.
Faster Queries: The data warehouse is designed to handle large queries that’s why it runs queries
faster than the database..
Historical Insight: The warehouse stores all your historical data which contains details about the
business so that one can analyze it at any time and extract insights from it.
Disadvantages of Data Warehousing
Cost: Building a data warehouse can be expensive, requiring significant investments in hardware,
software, and personnel.
Complexity: Data warehousing can be complex, and businesses may need to hire specialized personnel
to manage the system.
Time-consuming: Building a data warehouse can take a significant amount of time, requiring
businesses to be patient and committed to the process.
Data integration challenges: Data from different sources can be challenging to integrate, requiring
significant effort to ensure consistency and accuracy.
Data security: Data warehousing can pose data security risks, and businesses must take measures to
protect sensitive data from unauthorized access or breaches.
Finance
• Evaluation of customer expenses trends
• Maintain transparency in transactions
• Predict/spot defaulters and act accordingly
• Analyze and forecast different aspects of business, stock, and bond performance
Government
• Maintain and analyze tax records, health policy records, and their respective providers
• Prediction of criminal activities from patterns and trends
• Searching terrorist profile
• Threat assessment and fraud detection
Education
• Store and analyze information about faculty and students
• Maintain student portals to facilitate student activities
• Extract information for research grants and assess student demographics
• Integrate information from different sources into a single repository for analysis and strategic
decision-making
Healthcare
• Generate patient, employee, and financial records
• Share data with other entities, like insurance companies, NGOs, and medical aid services
• Use data mining to identify patient trends
• Provide feedback to physicians on procedures and tests
Insurance
• Analyze data patterns and customer trends – Maintain records of all internal and external
sources, including existing participants
• Design customized offers and promotions for customers
• Predict and analyze changes in the industry
Manufacturing
• Predict market changes and analyze current business trends
• Analyze previous and current market data
• Track customer feedback and identify opportunities for improvement
• Gather, standardize, and store data from various internal and external sources
• Identify profitable product lines and required product features
Retail
• Maintain records of producers and consumers
• Track items, their promotion strategies, and consumer buying trends (trend analysis)
• Analyze sales to determine shelf space
• Understanding the patterns of complaints, claims, and returns
Services
• Maintenance of financial and employee records
• Customer profiling and screening
• Resource allocation and management
• Revenue patterns and profitability
Transactional integrity: Enforces ACID (atomicity, consistency, isolation, durability) properties for
reliable transaction processing.
Row-oriented storage: Stores data in rows, optimal for frequent insert, update, and delete operations.
Normalized data structure: Organizes data into smaller, related tables to eliminate redundancy and
maintain efficiency.
Support for OLTP (online transaction processing): Prioritizes rapid execution of simple queries and
updates over complex analytics.
Here are some key characteristics of a data warehouse:
Analytical focus: Built to perform OLAP (online analytical processing) tasks like trend analysis and
reporting.
Historical data storage: Consolidates and organizes historical data from multiple sources for long-
term use.
Column-oriented storage: Structures data for efficient retrieval in analytical queries, often using
columnar formats.
Denormalized data structure: Optimized for read-heavy operations by minimizing joins, using star or
snowflake schemas.
Batch processing: Updates data in bulk during scheduled intervals rather than real time.
Scalability: Handles growing volumes of data without compromising query performance.
By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional data
and operations.
Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting tools,
analysis tools and data mining tools.
Enterprise Warehouse
An EDW is a data warehouse that encompasses and stores all of an organization’s data from sources
across the entire business. A smaller data warehouse may be specific to a business department or line of
business (like a data mart). In contrast, an EDW is intended to be a single repository for all of an
organization’s data.
Data Sources: These encompass data collected from various operational and transactional systems
within the organization, such as Enterprise Resource Planning (ERP) systems, Customer Relationship
Management (CRM) platforms, finance applications, Internet of Things (IoT) devices, and mobile and
online systems.
Staging Area: This serves as an intermediate space where data is aggregated, cleaned, and prepared
before being loaded into the EDW. The staging area ensures that data is in a consistent and usable
format, ready for further processing and analysis.
Storage layer: The data is finally loaded into the storage space. With the ELT approach, it might still
take some transformation here. But, at that stage, all the general changes will be applied, so the data
will be loaded into its final model(s). As we mentioned, data warehouses are most often relational
databases. DW will also include a database management system and additional storage for metadata.
Presentation or Access Space: This component provides an interface for users to access and interact
with the data stored in the EDW. It enables analytics, querying, reporting, and data sharing, facilitating
efficient decision-making and strategic insights across the organization.
Data marts
A data mart is a data storage system that contains information specific to an organization's business
unit. It contains a small and selected part of the data that the company stores in a larger storage system.
Companies use a data mart to analyze department-specific information more efficiently. It provides
summarized data that key stakeholders can use to quickly make informed decisions.
For example, a company might store data from various sources, such as supplier information, orders,
sensor data, employee information, and financial records in their data warehouse or data lake.
However, the company stores information relevant to, for instance, the marketing department, such as
social media reviews and customer records, in a data mart.
By using a data mart, companies can access specific information more efficiently. Compared to a data
warehouse, a data mart contains relevant and detailed information that a department accesses
frequently. Therefore, business managers don’t need to search the entire data warehouse to generate
performance reports or graphics.
Streamline decision-making
Companies can create a subset of data from a data warehouse with a data mart. Employees within the
department can then analyze the data and make decisions based on the same set of information.
Control information more effectively
A data mart gives employees highly granular access privileges. This means the company can authorize
a certain person to view or retrieve specific data. It helps companies to improve data governance and
enforce information access policies. For example, you can use data marts to provide user access to
employees for specific information in a data warehouse.
Manage data flexibly
A data mart is smaller and contains fewer tables than a data warehouse. This means data engineers can
manage and change information in a data mart without causing major database changes.
retrieving information much easier, they present a single point of failure. If the data warehouse fails, all
the connected data marts will also fail.
In a fact table, each row corresponds to one event or measurement, with some rows being associated
with a dimension table in the schema. Every row referencing a dimension table has its own foreign key,
which links the data in the fact table to the referenced dimension table. The foreign keys contained in
the fact table are also the primary keys of the referenced dimension tables.
The fact table shown in orange contains all the primary keys (PK) of the dimension tables, which are
the foreign keys (FK) in the fact table, along with two quantitative fields — namely quantity and
amount.
Characteristic Positioned in the middle of a The edges of the snowflake or star schema,
snowflake or star schema, attached to the fact table,
surrounded by dimensions.
Design Defined by their grain or at the It must be extensive, in-depth, and of the
atomic level. highest calibre.
Data Type Facts tables may include data Each dimension table has attributes that
about sales in relation to a provide information about the specifics of
number of parameters, such as the dimension. For instance, product
Product and Date. dimensions may include the product ID,
category, etc.
Key The fact table's primary key is Each dimension in a dimension table
mapped as a foreign key to contains a primary key column that
dimensions. uniquely identifies it.
Storage Helps to save report labels and Load dimensional structures with thorough
filter domain values in atomic data.
dimension tables.
The following diagram shows the sales data of a company with respect to the four dimensions, namely
time, item, branch, and location.
The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set
{location_key, street, city, province_or_state,country}.
This constraint may cause data redundancy. For example, "Vancouver" and "Victoria" both the cities
are in the Canadian province of British Columbia. The entries for such cities may cause data
redundancy along the attributes province_or_state and country.
Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-
key. The supplier key is linked to the supplier dimension table. The supplier dimension table contains
the attributes supplier_key and supplier_type.
Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes
easy to maintain and the save storage space.
• Multiple hierarchies that can belong to the same dimension have been designed at the lowest
possible detail.
In star schema, The fact tables and While in snowflake schema, The fact tables, dimension
the dimension tables are contained. tables as well as sub dimension tables are contained.
It takes less time for the execution of While it takes more time than star schema for the
queries. execution of queries.
In star schema, Normalization is not While in this, Both normalization and denormalization
used. are used.
The query complexity of star schema While the query complexity of snowflake schema is
is low. higher than star schema.
It has less number of foreign keys. While it has more number of foreign keys.
The sales fact table is same as that in the star schema. The shipping fact table has the five dimensions,
namely item_key, time_key, shipper_key, from_location, to_location. The shipping fact table also
contains two measures, namely dollars sold and units sold.
It is also possible to share dimension tables between fact tables. For example, time, item, and location
dimension tables are shared between the sales and shipping fact table.
Advantages
• Tables are subdivided into fact and dimensional to understand the relationship between them.
• It is a flexible schema that makes users use it.
• Here dimensional tables are shared by the number of fact tables.
• It is a normalized form of snowflake and star schema.
• We can access the data in the database using complex queries.
Disadvantages
• It is difficult to understand as it is a very complex schema to implement.
• It uses more space in the database comparative to the star schema.
• It has many joins between dimensional and fact tables and thus it is difficult to understand.
• This is difficult to maintain and operate.
Roll-up
It is the opposite of the drill-down operation and is also known as a drill-up or aggregation operation. It
is a dimension-reduction technique that performs aggregation on a data cube. It makes the data less
detailed and it can be performed by combining similar dimensions across any axis.
Example: Considering the above-mentioned clothing company sales example:
Here, we are performing the Roll-up operation on the given data cube by combining and categorizing
the sales based on the countries instead of cities.
Drill-Down
Drill down operation allows a user to zoom in on the data cube i.e., the less detailed data is converted
into highly detailed data. It can be implemented by either stepping down a concept hierarchy for a
dimension or adding additional dimensions to the hypercube.
Example: Consider a cube that represents the annual sales (4 Quarters: Q1, Q2, Q3, Q4) of various
kinds of clothes (Shirt, Pant, Shorts, Tees) of a company in 4 cities (Delhi, Mumbai, Las Vegas, New
York) as shown below.
In the figure, the drill-down operation is applied on the time dimension and the quarter Q1 is drilled
down to January, February, and March. Hence, by applying the drill-down operation, we can move
down from quarterly sales in a year to monthly or weekly records.
Slice
Slice operation is used to select a single dimension from the given cube to generate a new sub-cube. It
represents the information from another point of view.
Example: Considering our clothing company sales example:
Here, the sales done by the company during the first quarter are retrieved by performing the slice
operation on the given hypercube.
Dice
Dice operation is used to generate a new sub-cube from the existing hypercube. It selects two or more
dimensions from the hypercube to generate a new sub-cube for the given data.
Example: Considering our clothing company sales example.
In the figure, we are using the dice operation to retrieve the sales done by the company in the first half
of the year i.e., the sales in the first two quarters.
Pivot (Rotate)
It is used to provide an alternate view of the data available to the users. It is also known as Rotate
operation as it rotates the cube’s orientation to view the data from different perspectives.
Example: Considering our clothing company sales example:
Here, we are using the Pivot operation to view the sub-cube from a different perspective.
ROLAP
ROLAP is relational OLAP where the data is arranged in traditional methods like rows and columns in
the data warehouse. It is visible and accessible to users in multi-dimensional form. To display it as a
multi-dimensional view the data is designed as the related layer of metadata which supports the
collection and storage of data. It does dynamically in handling the complex query. It is slower than
MOLAP where ROLAP deals with the enormous volume of data at a higher speed.
Benefits
• It is compatible with data warehouses and OLTP systems.
• The data size limitation of ROLAP technology is determined by the underlying RDBMS. As a
result, ROLAP does not limit the amount of data that can be stored.
Limitations
• SQL functionality is constrained.
• It’s difficult to keep aggregate tables up to date.
MOLAP
MOLAP is a multi-dimensional OLAP where the data is analyzed on the registered system. The data is
arranged in a multi-dimensional array. The array carries predefined data when the data is loaded in
database management. MOLAP system is implemented on the application layer and when the user
sends any request it fetches the data with the minimum response time.
Benefits
• Suitable for slicing and dicing operations.
• Outperforms ROLAP when data is dense.
• Capable of performing complex calculations.
Limitations
• It is difficult to change the dimensions without re-aggregating.
• Since all calculations are performed when the cube is built, a large amount of data cannot be
stored in the cube itself.
HOLAP
ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP). It offers
greater scalability than ROLAP and faster computation than MOLAP.
HOLAP servers are capable of storing large amounts of detailed data. HOLAP benefits from ROLAP’s
greater scalability. HOLAP, on the other hand, makes use of cube technology for faster performance
and summary-type information. Because detailed data is stored in a relational database, cubes are
smaller than MOLAP.
Benefits
• HOLAP combines the benefits of MOLAP and ROLAP.
• Provide quick access at all aggregation levels.
Limitations
• HOLAP architecture is complex.
may be the best solution for businesses that require both transaction processing and data analysis.
Ultimately, choosing the right system depends on the specific needs of your business, including data
volume, query complexity, response time, scalability, and cost.
OLAP helps you analyze large volumes of OLTP helps you manage and process
Purpose
data to support decision-making. real-time transactions.
OLAP uses historical and aggregated data OLTP uses real-time and transactional
Data source
from multiple sources. data from a single source.
OLAP uses star schema, snowflake schema, OLTP uses normalized or denormalized
Data model
or other analytical models. models.
OLAP has large storage requirements. Think OLTP has comparatively smaller storage
Volume of data
terabytes (TB) and petabytes (PB). requirements. Think gigabytes (GB).
OLAP has longer response times, typically OLTP has shorter response times,
Response time
in seconds or minutes. typically in milliseconds
OLAP is good for analyzing trends, OLTP is good for processing payments,
Example
predicting customer behavior, and customer data management, and order
applications
identifying profitability. processing.