NOTES
on
DATA WAREHOUSING AND DATA MINING
MCA-III Semester
By:
Dr. Asha Ambhaikar
Professor & Dean Students Welfare
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 1
Kalinga University, Naya Raipur
Data Warehousing & Data Mining
MCA 305 External Marks:70 Internal Marks: 30
UNIT- I
The Compelling Need for data warehousing: Escalating Need for strategic information, failures of Past
decision-support systems, operational versus decision-support systems, data warehousing – the only viable
solution, data warehouse defined Data warehouse – The building Blocks: Defining Features, data warehouses
and data marts, overview of the components, metadata in the data warehouse Defining the business
requirements: Dimensional analysis, OLAP operations : Drilldown and roll-up, slice-and-dice or rotation.
UNIT- II Principles of dimensional modeling: , the STAR schema, STAR Schema Keys, Advantages of the STAR
Schema Dimensional Modeling: Updates to the Dimension tables, miscellaneous dimensions, the snowflake
schema, aggregate fact tables, families of STARS ,Steps for the Design & Construction of Data warehouse :
Framework , Architecture , Type of OLAP Servers : ROLAP , MOLAP , Data warehouse implementation tolls &
techniques.
UNIT- III Data Mining, Data Mining of what kind of Data , Knowledge discovery process (KDD) , What kind of
patterns can be mined , OLAP versus data mining, data mining and the data warehouse, Data mining
functionalities, classification Systems , Data processing : Cleaning , Integration & transformation, Reduction .
Data Mining primitives: What defines a Data Mining Task.
UNIT- IV Data Mining Query language (DMQL), Cluster Analysis : Partitioning , Hierarchical Density , Grid &
Model based methods., Major Data Mining Techniques, Cluster detection, decision Trees, memory-based
reasoning, link analysis, neural networks, genetic algorithms, moving into data mining, Data Mining
Applications, Benefits of data mining & applications.
Text books: 1. Paul Raj Poonia, “Fundamentals of Data Warehousing”, John Wiley & Sons, 2003. 2. W. H.
Inmon, “Building the Operational Data Store”,2nd Ed., John Wiley, 1999 3. Sam Anahony, “Data Warehousing
in the Real World: A Practical Guide for Building Decision Support Systems”, John Wiley, 2004. 4. Jarke,
“Fundamentals of Data Warehouse”, Springer
References books: 1. Kamber and Han, “Data Mining Concepts and Techniques”, Hartcourt India P. Ltd.,
2001. 2. G. K. Gupta, “Introduction to Data Mining with Case Studies”, PHI, 2006. 3. A. B. M. Shawkat Ali,
Saleh A. Wasimi, “Data Mining Methods and Techniques”, Cengage Learning, 2009. 4. Pang - Ning, Michael-
Steinbach, “Introduction to Data Mining”, Pearson, 4th Ed., 2009.
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 2
UNIT-I
COMPILING NEED FOR DATA WAREHOUSING
What is Data Warehousing?
A process of transforming
Information data into information
and making it available to
users in a timely enough
manner to make a
difference
Data
Technique for assembling and managing data from various sources for the
purpose of answering business questions. Thus making decisions that were
not previous possible
A decision support database maintained separately from the organization’s
operational database
“A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making
process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
Subject Oriented
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 3
Organized around major subjects, such as customer, product, sales.
Focusing on the modeling and analysis of data for decision makers, not on
daily operations or transaction processing.
Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process.
Integrated:
Constructed by integrating multiple, heterogeneous data sources
relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are applied.
Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
Time variant:
The time horizon for the data warehouse is significantly longer than that of
operational systems.
Operational database: current value data.
Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain “time element”.
Non volatile:
A physically separate store of data transformed from the operational
environment.
Operational update of data does not occur in the data warehouse
environment.
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 4
Does not require transaction processing, recovery, and concurrency
control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data.
Very large Databases:
Terabytes -- 10^12 bytes: Walmart-- 24 Terabytes
Petabytes -- 10^15 bytes: Geographic Information Systems
Exabytes -- 10^18 bytes: National Medical Records
Zettabytes -- 10^21 bytes: Weather images
Zottabytes -- 10^24 bytes: Intelligence Agency Videos
Need for Data Ware House:
A Data Warehouse is the necessary foundation for a true Business Intelligence
solution. Almost all sophisticated, mature BI systems are based on one. The
alternative to creating a data warehouse is to use data accessed directly from the
applications that create it.
Reporting and analysis can only be done properly by experts.
A Rapid Decision data warehouse eliminates all these problems while offering many
other advantages.
Evolution of Data Warehouse:
Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration:
Build on top of heterogeneous databases
Query driven approach
When a query is posed to a client site, a meta-dictionary is used
to translate the query into queries appropriate for individual
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 5
heterogeneous sites involved, and the results are integrated into
a global answer set
Complex information filtering, compete for resources
Data warehouse: update-driven, high performance
Information from heterogeneous sources is integrated in advance and
stored in warehouses for direct query and analysis
Benefits of Data warehouse:
Better Information
Better Strategies and plans
Better tactics and decisions
More efficient processed
Time saving
Reduction in paper reporting
Increase in knowledge worker productivity
Supports all decision makers’ data requirements
Provide ready access to critical data
Insulates operation databases from ad hoc processing
Provides high-level summary information
Provides drill down capabilities
Yields
Improved business knowledge
Competitive advantage
Enhances customer service and satisfaction
Facilitates decision making
Help streamline business processes
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 6
Goals of Data Warehouse:
Makes an organization’s information accessible.
Makes the organization’s information consistent.
Is an adaptive and durable source of information
Is a secure support that protects the
organization’s information asset
Is the foundation for decision making
Data Warehouse vs. Operational DBMS:
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 7
OLTP vs Data Warehouse
OLTP Warehouse (DSS)
Application Oriented Subject Oriented
Used to run business Used to analyze business
Detailed data Summarized and refined
Current up to date Snapshot data
Isolated Data Integrated Data
Repetitive access Ad-hoc access
Clerical User Knowledge User
(Manager)
OLTP Data Warehouse
Performance Sensitive Performance relaxed
Few Records accessed at a Large volumes accessed at
time (tens) a time(millions)
Mostly Read (Batch
Read/Update Access Update)
Redundancy present
No data redundancy Database Size 100
Database Size 100MB - GB - few terabytes
100 GB
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 8
To summarize ...
OLTP Systems are
used to “run” a
business
The Data Warehouse
helps to “optimize” the
business
Why Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
Warehouse—tuned for OLAP: complex OLAP queries, multidimensional
view, and consolidation.
Need of Data Warehouse:
missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled.
What are the various Trends in Data Warehouse?
There are Three Complementary Trends in Data Warehouse:
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 9
Data Warehousing
_ Consolidate data from many sources in one large repository.
– Loading, periodic synchronization of replicas.
– Semantic integration.
OLAP:
– Complex SQL queries and views.
– Queries based on spreadsheet-style operations and multidimensional” view
of data.
– Interactive and “online” queries.
Data Mining
_ Exploratory search for interesting
trends and anomalies.
Apart from above basic Trends, In the early stages, four significant factors drove
many companies to move into data warehousing:
violent competition
Government deregulation
Need to restore internal processes
Imperative for customized marketing
Significant Factors:
These significant factors reflect the new
trends in data warehousing:
Multiple Data Types
Data Visualization
Parallel Processing
Query Tools
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 10
Browser Tools
Data Fusion
Multidimensional Analysis
Agent Technology
E-Business- ERP, KM, CRM
Basic Elements of Data warehouse:
Source System
Staging Area
Presentation Area
End User Data Access Tools
Metadata
Data warehouse Architecture
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract Query
Transform Data Serve Reports
DBs Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 11
Working of Data Warehouse:
Bottom Layer:
The bottom layer is a DW database servers that is almost always a relational
database system
Data from operational databases and external sources are extracted using
application program interfaces known as gateways
It is supported by primary system
It has repository that is metadata (data about data)
Which is responsible for extracting the information from DW according to the
queries given by the end users
Metadata is the bridge between DW and the DSS
It provides logical linkage between data and application
Metadata can pinpoint access to information across the entire DW and can
unable the development of applications which automatically updates
themselves to reflect data warehouse content change.
Middle Layer:
The middle layer consists of OLAP server
OLAP means On Line Analytical Processing
It is used to perform analysis on data
and transform it in to useful information for decision making
OLAP is a continuously iterative process
OLAP servers are implemented by either ROLAP,MOLAP or HOLAP
TOP Layer:
The top layer is a client
That is the end user
It consists of
1.query and reporting tools
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 12
2.Analysis tools and
3. Data Mining Tools
It acts as an interface between the user and the server
This layer takes queries from the users
And then send it to the servers
Receiving information records back and
Gives them as output to the end users.
Eg. Analysis of weather forecasting, predictions and so on.
DEFINING THE BUSINESS REQUIREMENTS:
A requirements definition for the data warehouse can, therefore, be
based on business dimensions such as product, geography, time, and
promotion.
Information packages are the backbone of the requirements definition.
An information package records the critical measurements or facts and
business dimensions along which the facts are normally analyzed.
Interviews and group sessions are standard methods for collecting
requirements.
Key people to be interviewed or to be included in group sessions are
senior executives (including the sponsors), departmental managers,
business analysts, and operational systems DBAs.
Review all existing documentation of related operational systems.
Scope and content of the requirements definition document include
data sources, data transformation, data storage, information delivery,
and information package diagrams.
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 13
Project Planning and Management:
Objectives:
Review the essentials of planning for Data Warehouse
While planning a data Warehouse key issues to be considered include:
setting proper expectations, assessing risk, deciding between Top-
down and Bottom-up approaches, choosing from vendor solution.
Business requirement not technology must drive your project.
The data warehouse project without the full support of the top
management and without strong, enthusiastic and executive
sponsored.
Data warehouse projects are different from OLTP projects.
Discuss the project team, organization’s role and responsibilities.
Participation of the user is mandatory for success of the Data
Warehouse project. Users can participate in a variety of ways.
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 14
Consider the warning sign and success factors in the final analysis;
adopt a practical approach to built a successful data warehouse.
Project Manager: It is a serious mistake to have a project manager who
is more technology-oriented than user-oriented and business-oriented.
New Paradigm: Data warehousing is new for most companies;
innovative project management methods are essential to deal with the
unexpected challenges.
Team Roles: Team roles are not to be assigned arbitrarily; the roles
must reflect the needs of each individual data warehouse project.
Data Quality: Three critical aspects of data in the data warehouse are:
quality, quality and quality.
User Requirements: Although obvious, user requirements alone form
the driving force of every task on the project schedule.
Building for Growth: Number of users and number of queries shoot up
very quickly after deployment; data warehouses not built for growth will
fall to pieces.
Project Politics: The first data warehouse project in a company poses
challenges and threats to users at different levels; trying to handle
project politics is like walking the known tightrope, to be compressed
with extreme caution.
Realistic Expectations: It is easy to promise the world in the first data
warehouse project; setting expectations at the right and attainable
levels is the best course.
Dimensional Data Modeling: A well-designed dimensional data model is
a required foundation and blueprint.
External Data: A data warehouse does not live by internal data alone;
data from relevant external sources is an absolutely necessary
ingredient.
Training: Data warehouse user tools are different and new. If the users
do not know how to use the tools, they will not use the data warehouse.
An unused data warehouse is a failed data warehouse.
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 15
Fig. 1.2 Data Warehouse Deployment Phases
Dimensional Analysis (Multidimensional Data Model)
A data warehouse is based on a multidimensional data model which views data in the form
of a data cube
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
Dimension tables, such as item (item_name, brand, type), or time(day, week, month,
quarter, year)
Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D
cuboid, which holds the highest-level of summarization, is called the apex cuboid. The
lattice of cuboids forms a data cube.
Multidimensionality:
3-D + Spreadsheets (OLAP has this)
Data can be organized the way managers like to see them, rather than the way that the
system analysts do
Different presentations of the same data can be arranged easily and quickly
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 16
Dimensions: products, salespeople, market segments, business units, geographical
locations, distribution channels, country, or industry
Measures: money, sales volume, head count, inventory profit, actual versus forecast
Time: daily, weekly, monthly, quarterly, or yearly
Multidimensional Data
Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region Year
Category Country Quarter
Product
Product City Month Week
Office Day
Month
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 17
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product date country
1-D cuboids
product,date product,country date, country
2-D cuboids
3-D(base) cuboid
product, date, country
OLAP Operations
OLAP means On Line Analytical Processing.
It is used to perform analysis on data and transform it into information for decision making
purpose.
OLAP is a continuous iterative process.
A common operation is to aggregate a measure over one or more dimensions.
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 18
– Find total sales.
– Find total sales for each city, or for each state.
– Find top five products ranked by total sales.
Typical OLAP Operations
Roll up (drill-up): summarize data
Drill down (roll down): reverse of roll-up
Slice and dice:
project and select
Pivot : rotate
Roll-up and Drill Down
Higher Level of
Aggregation
Sales Channel
Region
Country
State
Location Address
Sales Representative
Low-level
Details
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 19
Slicing and Dicing
Product The Telecomm Slice
Household
Telecomm
Video Europe
Far East
Audio India
Retail Direct Special Sales Channel
A Visual Operation: Pivot
(Rotate)
Juice
Cola 10
Milk 47
Crea 30
m 12 Product
3/1 3/2 3/3 3/4
Date
Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 20
This operation performs aggregation on the data cube, either by climbing up a
concept of hierarchy for a dimension or by dimension reduction.
When roll up is performed by dimension reduction, one or more dimensions are
removed from the given cube.
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data, or introducing
new dimensions
It navigates from less detailed data to more detailed data.
This can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions
Slice and dice:
project and select
The slice operation performs a selection on one dimension of the given cube resulting
in a sub cube
The dice operation defines a sub cube by performing a selection on two or more
dimensions
Pivot (rotate):
It is visualization operation that rotates the data axes in new view in order to provide
an alternative presentation of the data.
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: Executes queries involving (across) more than one fact table
drill through: Operation uses relational SQL facilities to drill through the bottom level
of the data cube to its back-end relational tables
OLAP Is FASMI
Fast
Analysis to
Share
Multidimensional
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 21
Information
--------------------
Important Questions:
1. What is Data Warehouse? Explain in detail.
2. What are the goals of Data Warehouse?
3. What is the need of Data Warehouse?
4. What are the benefits of Data warehouse?
5. Write the Data Warehouse applications with its storage capacities.
6. Explain an Architecture of Data Warehouse with neat diagram.
7. What are the various elements of Data warehouse?
8. Explain and give the Business requirements of Data Warehouse.
9. How to plan for Data warehouse?
10. What is Multidimensional data model? Explain with neat diagram.
11. What are the various OLAP operations? Explain each with neat diagram.
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 22