[go: up one dir, main page]

0% found this document useful (0 votes)
17 views36 pages

unit1

This document provides an overview of Data Warehouse and OLAP technology, defining data warehouses and their key features, including being subject-oriented, integrated, time-variant, and non-volatile. It discusses the differences between operational database systems and data warehouses, detailing OLAP operations, multidimensional data models, and various schemas like star and snowflake. The document also outlines data warehouse architecture, models, and the transition from data warehousing to data mining, emphasizing the importance of efficient data processing and analysis for decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views36 pages

unit1

This document provides an overview of Data Warehouse and OLAP technology, defining data warehouses and their key features, including being subject-oriented, integrated, time-variant, and non-volatile. It discusses the differences between operational database systems and data warehouses, detailing OLAP operations, multidimensional data models, and various schemas like star and snowflake. The document also outlines data warehouse architecture, models, and the transition from data warehousing to data mining, emphasizing the importance of efficient data processing and analysis for decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

UNIT-I

Data Warehouse and OLAP Technology

Syllabus:

Data Warehouse and OLAP Technology: An overview: Data warehouse, A Multidimensional Data model, Data
Warehouse Architecture, Data Warehouse Implementation, From Data warehousing to Data Mining. (Han &
Kamber)

What is a Data Warehouse

Definition 1:

Data warehousing provides architectures and tools for business executives to systematically organize, understand,
and use their data to make strategic decisions.

Definition 2:

A data warehouse refers to a data repository that is maintained separately from an organization’s operational
databases.

Definition 3:

According to William H. Inmon, a leading architect in the construction of data warehouse systems, “A data
warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of
management’s decision-making process.

Four major features of a data warehouse are:

Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier, product, and
sales.

Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as
relational databases, flat files, and online transaction records

Time-variant: Data are stored to provide information from an historic perspective (e.g., the past 5–10 years).

Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data
found in the operational environment.

How are organizations using the information from data warehouses?

Many organizations use this information to support business decision-making activities, including:

 Increasing customer focus, which includes the analysis of customer buying patterns (such as buying
preference, buying time, budget cycles, and appetites for spending)
 Repositioning products and managing product portfolios by comparing the performance of sales by quarter,
by year, and by geographic regions in order to fine-tune production strategies.
 Analysing operations and looking for sources of profit.
 Managing customer relationships, making environmental corrections, and managing the cost of corporate
assets
Differences between Operational Database Systems and Data Warehouses

 The major task of online operational database systems is to perform online transaction and query
processing. These systems are called online transaction processing (OLTP) systems. They cover most of
the day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking,
payroll, registration, and accounting.
 Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis
and decision making. Such systems can organize and present data in various formats to accommodate the
various needs of different users. These systems are known as online analytical processing (OLAP)
systems.

The major distinguishing features of OLTP and OLAP are summarized as follows:

1. Users and system orientation: An OLTP system is customer-oriented and is used for transaction and
query processing by clerks, clients, and information technology professionals. An OLAP system is
market-oriented and is used for data analysis by knowledge workers, including managers, executives, and
analysts.

2. Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used
for decision making. An OLAP system manages large amounts of historic data, provides facilities for
summarization and aggregation, and stores and manages information at different levels of granularity.

3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an
application-oriented database design. An OLAP system typically adopts either a star or a snowflake
model and a subject-oriented database design.

4. View: An OLTP system focuses mainly on the current data within an enterprise or department, without
referring to historic data or data in different organizations. In contrast, an OLAP system often spans
multiple versions of a database schema, due to the evolutionary process of an organization.

5. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions.
However, accesses to OLAP systems are mostly read-only operations (because most data warehouses
store historic rather than up-to-date information).
A Multidimensional Data Model

 Data warehouses and OLAP tools are based on a multidimensional data model. This model views data in the form
of a data cube.
Data cube: A data cube allows data to be modelled and viewed in multiple dimensions. It is defined by
dimensions and facts.
Dimensions
 Dimensions represent perspectives or entities with respect to which an organization wants to keep
records.
 For example, AllElectronics may create a sales data warehouse to keep records of the store’s sales with
respect to the dimensions time, item, branch, and location. These dimensions allow the store to keep track
of things like monthly sales of items and the branches and locations at which the items were sold.
 Each dimension may have a table associated with it, called a dimension table, which further describes the
dimension.
For example, a dimension table for item may contain the attributes item name, brand, and type.
 Dimension tables describe each dimension, containing attributes like item-name, brand and type.

Facts

 Facts are numeric measures representing quantities by which we want to analyse relationships between
dimensions are analysed.
 Examples of facts for a sales data warehouse include dollars sold (sales amount in dollars), units sold
(number of units sold), and amount budgeted.
 The fact table contains the names of the facts, or measures, as well as keys to each of the related
dimension tables.

Figure: A 3-D data cube representation of the data in Table 4.3, according to time, item, and location. The
measure displayed is dollars sold (in thousands).
A 4-D data cube representation of sales data, according to time, item, location, and supplier. The measure
displayed is dollars sold (in thousands). For improved readability, only some of the cube values are shown.

In Data warehousing the data cube is n-dimensional. An n-D base cube is called a base cuboid. The topmost 0-D
cuboid which holds the highest level of summarization is called the Apex cuboid. The Lattice of cuboids forms a
data cube.

Figure Lattice of cuboids, making up a 4-D data cube for time, item, location, and supplier. Each cuboid
represents a different degree of summarization.
Schemas for Multidimensional Data Models

 A data warehouse requires a concise, subject-oriented schema that facilitates online data analysis.
 The most popular data model for a data warehouse is a multidimensional model, which can exist in the form of a
star schema, a snowflake schema, or a fact constellation schema.

Schema Dimension Tables Fact Tables


star N 1
snowflake N 1
Fact constellation/ N N
Galaxy

Star schema:
The most common modelling paradigm is the star schema, in which the data warehouse contains.
 a large central table (fact table) containing the bulk of the data, with no redundancy, and
 a set of smaller attendant tables (dimension tables), one for each dimension.
Example Star schema. A star schema for AllElectronics sales is shown in the following Figure.
Sales are considered along four dimensions: time, item, branch, and location. The schema contains a central fact
table for sales that contains keys to each of the four dimensions, along with two measures: dollars sold, and
units sold.

Snowflake schema:
 The snowflake schema is a variant of the star schema model, where some dimension tables are normalized,
thereby further splitting the data into additional tables.
 The resulting schema graph forms a shape similar to a snowflake.
 The major difference between the snowflake and star schema models is that the dimension tables of the
snowflake model may be kept in normalized form to reduce redundancies.
 Such a table is easy to maintain and saves storage space.

Fact constellation/Galaxy scheme:


 Complex applications may require multiple fact tables to share dimension tables.
 It contains number of fact tables and number of dimension tables around the fact tables.
 It can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.
 Design of Fact constellation scheme is simple and access speed is high.
 It is used for large size databases.
OLAP Operations:

 In the multidimensional model, data are organized into multiple dimensions.


 This organization provides users with the flexibility to view data from different perspectives.
 OLAP data cubes have various operations that help create different views of data. These operations
allow querying and analyse the data interactively.
Example:
At the centre of the figure is a data cube for AllElectronics sales.
 The cube contains the dimensions’ location, time, and item, where location is aggregated with
respect to city values, time is aggregated with respect to quarters, and item is aggregated with
respect to item types.
 The measure displayed is dollars sold (in thousands). (For improved readability, only some of
the cubes’ cell values are shown.) The data examined are for the cities Chicago, New York,
Toronto, and Vancouver.
Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs aggregation on
a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction.
Example: roll-up on location (from cities to countries)

Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed data.
Drill-down can be realized by either stepping down a concept hierarchy for a dimension or introducing
additional dimensions.
Example: drill down on time(from quarters to months).

Slice and dice:


 The slice operation performs a selection on one dimension of the given cube, resulting in a subcube.
Example: A slice operation where the sales data are selected from the central cube for the dimension time
using the criterion time = “Q1.”
 The dice operation defines a subcube by performing a selection on two or more dimensions.
Example: A dice operation on the central cube based on the following selection criteria that involve three
dimensions: (location = “Toronto” or “Vancouver”) and (time = “Q1” or “Q2”) and (item = “home
entertainment” or “computer”).

Pivot (rotate):
Pivot (also called rotate) is a visualization operation that rotates the data axes in view to provide an
alternative data presentation.
Example: A pivot operation where the item and location axes in a 2-D slice are rotated.
Data Warehouse Architecture: A Multitiered Architecture

Data warehouses often adopt a three-tier architecture, as shown in Figure.

Bottom tier:
The bottom tier is a warehouse database server that is almost always a relational database system. Back-end tools and
utilities are used to feed data into the bottom tier from operational databases or other external sources.
Operational databases: These are databases that store transactional data generated by day-to-day operations of
an organization. They are typically optimized for quick and efficient data processing.
External Sources: These are sources of data external to the organization, such as data from suppliers, partners, or
public sources.
Data Marts: These are smaller, specialized data warehouses that focus on specific departments, business units, or
subject areas within the organization.
Data Warehouse: This is the central repository that integrates data from operational databases and external
sources. It stores historical and aggregated data for analysis and reporting.
Metadata Repository: This stores metadata, which provides information about the structure, definitions, and
relationships of data stored in the data warehouse.
Monitoring and Administration: This involves tools and processes for monitoring and managing the
performance, security, and integrity of the data warehouse system.

Middle tier:
This tier is responsible for performing online analytical processing (OLAP) operations on the data stored in the data
warehouse. The middle tier is an OLAP server that is typically implemented using either (1) a relational OLAP (ROLAP)
or (2) a multidimensional OLAP (MOLAP).

Top tier:
The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining tools
(e.g., trend analysis, prediction, and so on) that allow end-users to interact with and analyse data from the data warehouse.

Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse

From the architecture point of view, there are three data warehouse models: the enterprise warehouse, the data mart, and
the virtual warehouse.

Enterprise warehouse: An enterprise warehouse collects all the information about subjects spanning the entire
organization. It provides corporate-wide data integration, usually from one or more operational systems or external
information providers, and is cross-functional in scope. It typically contains detailed data as well as summarized data and
can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may
be implemented on traditional mainframes, computer superservers, or parallel architecture platforms. It requires extensive
business modelling and may take years to design and build.

Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The scope is
confined to specific selected subjects.
For example, a marketing data mart may confine its subjects to customer, item, and sales.
The data contained in data marts tend to be summarized. Data marts are usually implemented on low-cost departmental
servers that are Unix/Linux or Windows based.

Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient query processing, only
some of the possible summary views may be materialized. A virtual warehouse is easy to build but requires excess
capacity on operational database servers.

A recommended method for the development of data warehouse systems is to implement the warehouse in an
incremental and evolutionary manner.
From Data Warehousing to Data Mining
Data warehouse usage:
Data warehouses and data marts are used in a wide range of applications. Business executives use the data in data
warehouses and data marts to perform data analysis and make strategic decisions.
There are three kinds of data warehouse applications: information processing, analytical processing, and data mining.
Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts, or
graphs.
Analytical processing supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and pivoting. It
generally operates on historic data in both summarized and detailed forms. The major strength of online analytical
processing over information processing is the multidimensional data analysis of data warehouse data.
Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models,
performing classification and prediction, and presenting the mining results using visualization tools.

Oline Analytical Processing to Online Analytical Mining


Among the many different paradigms and architectures of data mining systems, multidimensional data mining is
particularly important for the following reasons:
 High quality of data in data warehouses
 Available information processing infrastructure surrounding data warehouses
 OLAP-based exploration of multidimensional data
 Online selection of data mining functions

An OLAM System Architecture:


 OLAM servers perform Analytical mining like OLAP performs Analytical processing. An integrated OLAM &
OLAP Architecture is shown below.
 Both OLAM and OLAP servers accept user queries via GUI API and work with data cube via Cube API.
 A meta data directory used to guide the access of data cube.
 The data cube constructed by accessing/ integrating multiple data bases via MDDB API which support OLEDB
or ODBC connections.
 OLAM server may perform multiple data mining tasks. This consisting of integrated Data Mining modules and
these are more sophisticated than OLAP server.

Data Warehouse Implementation


 Data warehouses contain huge volumes of data.
 OLAP servers demand that decision support queries be answered in the order of seconds.
 Therefore, it is crucial for data warehouse systems to support highly efficient cube computation techniques,
access methods, and query processing techniques.

Efficient Data Cube Computation: An Overview


 Multidimensional Analysis Basics: Multidimensional data analysis involves computing aggregations across
various sets of dimensions.
 . In SQL terms, these aggregations are referred to as group-by’s.
 Each group-by can be represented by a cuboid, where the set of group-by’s forms a lattice of cuboids
defining a data cube.

Example 4.6 A data cube is a lattice of cuboids. Suppose that you want to create a data cube for
AllElectronics sales that contains the following: city, item, year, and sales in dollars.
You want to be able to analyze the data, with queries such as the following:
“Compute the sum of sales, grouping by city and item.”
“Compute the sum of sales, grouping by city.”
“Compute the sum of sales, grouping by item.”
What is the total number of cuboids, or group-by’s, that can be computed for this data cube?
Taking the three attributes, city, item, and year, as the dimensions for the data cube, and sales in
dollars as the measure, the total number of cuboids, or groupby’s, that can be computed for this data
cube is 23 = 8.
The possible group-by’s are the following: {(city, item, year), (city, item), (city, year), (item, year), (city),
(item), (year), ()}, where () means that the group-by is empty (i.e., the dimensions are not grouped).
These group-by’s form a lattice of cuboids for the data cube, as shown in Figure 4.14
Types of OLAP
There are three main types of OLAP servers are as following:
ROLAP stands for Relational OLAP, an application based on relational DBMSs.

MOLAP stands for Multidimensional OLAP, an application based on multidimensional DBMSs.

HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional
techniques.

Relational OLAP (ROLAP) Server


These are intermediate servers which stand in between a relational back-end server and user
frontend tools.

They use a relational or extended-relational DBMS to save and handle warehouse data, and
OLAP middleware to provide missing pieces.

ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.

ROLAP technology tends to have higher scalability than MOLAP technology.

ROLAP systems work primarily from the data that resides in a relational database, where the
base data and dimension tables are stored as relational tables. This model permits the
multidimensional analysis of data.

This technique relies on manipulating the data stored in the relational database to give the
presence of traditional OLAP's slicing and dicing functionality. In essence, each method of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Relational OLAP Architecture
ROLAP Architecture includes the following components

o Database server.
o ROLAP server.
o Front-end tool.

Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in the
market. This method allows multiple multidimensional views of two-dimensional relational tables
to be created, avoiding structuring record around the desired view.

Some products in this segment have supported reliable SQL engines to help the complexity of
multidimensional analysis. This includes creating multiple SQL statements to handle user
requests, being 'RDBMS' aware and also being capable of generating the SQL statements
based on the optimizer of the DBMS engine.

Advantages
Can handle large amounts of information: The data size limitation of ROLAP technology is
depends on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the data
amount.

<="" strong="">RDBMS already comes with a lot of features. So ROLAP technologies, (works
on top of the RDBMS) can control these functionalities.

Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in the
relational database, the query time can be prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon developing SQL
statements to query the relational database, and SQL statements do not suit all needs.

Multidimensional OLAP (MOLAP) Server


A MOLAP system is based on a native logical model that directly supports multidimensional
data and operations. Data are stored physically into multidimensional arrays, and positional
techniques are used to access them.

One of the significant distinctions of MOLAP against a ROLAP is that data are summarized and
are stored in an optimized format in a multidimensional cube, instead of in a relational database.
In MOLAP model, data are structured into proprietary formats by client's reporting requirements
with the calculations pre-generated on the cubes.

MOLAP Architecture
MOLAP Architecture includes the following components

o Database server.
o MOLAP server.
o Front-end tool.

MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been pre-
calculated and stored.

Applications requiring iterative and comprehensive time-series analysis of trends are well suited
for MOLAP technology (e.g., financial analysis and budgeting).
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's
Lightship Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.

Some of the problems faced by clients are related to maintaining support to multiple subject
areas in an RDBMS. Some vendors can solve these problems by continuing access from
MOLAP tools to detailed data in and RDBMS.

This can be very useful for organizations with performance-sensitive multidimensional analysis
requirements and that have built or are in the process of building a data warehouse architecture
that contains multiple subject areas.

An example would be the creation of sales data measured by several dimensions (e.g., product
and sales region) to be stored and maintained in a persistent structure. This structure would be
provided to reduce the application overhead of performing calculations and building aggregation
during initialization. These structures can be automatically refreshed at predetermined intervals
established by an administrator.

Advantages
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for
slicing and dicing operations.

Can perform complex calculations: All evaluation have been pre-generated when the cube is
created. Hence, complex calculations are not only possible, but they return quickly.

Disadvantages
Limited in the amount of information it can handle: Because all calculations are performed
when the cube is built, it is not possible to contain a large amount of data in the cube itself.

Requires additional investment: Cube technology is generally proprietary and does not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are other
investments in human and capital resources are needed.

Hybrid OLAP (HOLAP) Server


HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture.
HOLAP systems save more substantial quantities of detailed data in the relational tables while
the aggregations are stored in the pre-calculated cubes. HOLAP also can drill through from the
cube down to the relational tables for delineated data. The Microsoft SQL Server
2000 provides a hybrid OLAP server.
Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP and ROLAP.
2. It provides fast access at all levels of aggregation.
3. HOLAP balances the disk space requirement, as it only stores the aggregate information on the
OLAP server and the detail record remains in the relational database. So no duplicate copy of the
detail record is maintained.

Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP servers.

You might also like