BIA - Unit 2 Notes
BIA - Unit 2 Notes
Unit II
BI: DATA MINING & WAREHOUSING
SYLLABUS Data Mining - Introduction to Data Mining, Architecture of Data Mining and How Data mining works
(Process) , Functionalities & Classifications of Data Mining, Representation of Input Data, Analysis
Methodologies.
Data Warehousing - Introduction to Data Warehousing, Data Mart,
Online Analytical Processing (OLAP) – Tools, Data Modelling, Difference between OLAP and OLTP, Schema
– Star and Snowflake Schemas,
ETL Process – Role of ETL
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
NOTE: Data mining is a crucial step in turning data into knowledge. By applying advanced algorithms, it
allows organizations to make informed, data-driven decisions.
Main Components
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
o Stores domain knowledge, constraints, and rules to guide the mining process.
o Helps improve accuracy and relevance of results.
1. Data Cleaning
o Remove noise, errors, and missing values.
o Example: Handling null values in customer records.
2. Data Integration
o Combine data from multiple sources.
o Example: Merging sales data from online and physical stores.
3. Data Selection
o Choose relevant subsets of data for analysis.
o Example: Selecting sales data for the last 3 years only.
4. Data Transformation
o Convert data into a suitable format (normalization, aggregation, encoding).
o Example: Scaling values between 0–1 for machine learning models.
5. Data Mining
o Apply algorithms to extract patterns.
o Example: Clustering customers into groups based on spending habits.
6. Pattern Evaluation
o Identify truly interesting, useful, and valid patterns.
o Example: Filtering out rules that have low confidence or support.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
7. Knowledge Presentation
o Present discovered patterns using visualization (graphs, dashboards, reports).
o Example: Displaying customer segments in pie charts.
Data mining functionalities specify the kinds of patterns that can be discovered. These are the core tasks that
data mining systems perform.
A. Concept/Class Description
Summarization of data and characteristics of a class.
Techniques:
o Characterization → general features of a class (e.g., profile of "high-value customers").
o Discrimination → compare features of different classes (e.g., loyal vs. non-loyal customers).
C. Classification
Assigns objects into predefined categories (supervised learning).
Example: Classifying emails as spam or not spam.
D. Prediction
Predicts unknown or future values.
Example: Predicting next month’s sales revenue.
E. Clustering
Groups similar objects into clusters without predefined labels (unsupervised learning).
Example: Segmenting customers based on purchasing behavior.
F. Outlier/Anomaly Detection
Identifies data objects that deviate significantly from the rest.
Example: Fraud detection in banking transactions.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Data mining can be classified into different categories based on the type of knowledge to be discovered, the
type of data to be mined, and the techniques applied.
NOTE:
Given are the type of data, knowledge, methods, and applications
Data → Relational, Transactional, Text, Web, Multimedia, Spatial, Time-series.
Knowledge → Descriptive vs. Predictive.
Methods → ML-based, Statistical, Database-oriented.
Application → Business, Scientific, Engineering.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Data can be represented in various forms depending on the application, source, and mining task.
2. Transaction Data
Represents transactions (sets of items/events).
Each transaction = ID + list of items.
Mostly used in market basket analysis (association rule mining).
Example: Supermarket Transactions
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
o Example (Image Data):
Image = 3 × 3 grayscale pixels
0 120 255
60 200 180
30 90 150
Analysis methodologies are the approaches, strategies, or models used to extract knowledge from data. They
define how data is analyzed depending on the goal (prediction, description, clustering, etc.).
1. Descriptive Analysis
Objective → Describe patterns & relationships in existing data.
Focus → What happened?
Methods:
o Clustering (grouping similar data)
o Association Rules (finding co-occurrence patterns, e.g., Market Basket Analysis)
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
o Summarization (producing compact descriptions of data)
Example → “Customers who buy laptops also buy laptop bags.”
2. Predictive Analysis
Objective → Forecast future outcomes based on historical data.
Focus → What will happen?
Methods:
o Classification (assign items into predefined classes, e.g., spam filtering)
o Regression (predict continuous values, e.g., sales forecasting)
o Time-Series Analysis (predict trends over time, e.g., stock prices)
Example → “Next month’s sales are expected to increase by 10%.”
3. Diagnostic Analysis
Objective → Explain causes behind past events.
Focus → Why did it happen?
Methods:
o Correlation analysis
o Causality analysis
o Drill-down and OLAP operations
Example → “Sales dropped because of reduced advertising and seasonal changes.”
4. Exploratory Analysis
Objective → Discover unknown or hidden patterns without predefined hypotheses.
Focus → What patterns exist?
Methods:
o Clustering (discover natural groups)
o Anomaly detection (identify outliers/frauds)
Example → Detecting unusual credit card transactions indicating fraud.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Summary Table
In simple terms:
A database is for running the business (daily transactions).
A data warehouse is for analyzing the business (decision-making).
1. Data Sources
o Operational databases (OLTP), external data, flat files, sensors, etc.
2. ETL Process (Extract, Transform, Load)
o Extract → data pulled from multiple sources.
o Transform → clean, standardize, and integrate data.
o Load → move processed data into the warehouse.
3. Data Warehouse Database
o Central repository storing detailed and summarized data.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
4. Metadata
o "Data about data" → definitions, mappings, transformations.
5. Access Tools
o Reporting tools, OLAP (Online Analytical Processing), dashboards, data mining applications.
Common layers:
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
9. Data Mart
It is smaller, more focused, and faster to implement than a full Data Warehouse.
Used mainly by a particular group of users (e.g., Sales, Marketing, Finance).
Think of a Data Warehouse as a supermarket (everything is available), and a Data Mart as a specialized
shop (only specific items).
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
o Uses both DW + operational sources.
o Combines benefits of dependent and independent approaches.
o Example: A Retail Data Mart combining sales data from EDW and real-time POS systems.
Layers:
1. Data Sources → Operational databases, DW, external files.
2. ETL Layer → Extract, Transform, Load data into the mart.
3. Data Mart Storage → Stores subject-specific data.
4. Access Layer → OLAP, dashboards, reports for end users.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Finance Data Mart → Budgeting, forecasting, cost analysis.
Marketing Data Mart → Customer segmentation, campaign effectiveness.
HR Data Mart → Employee performance, recruitment analysis.
What is OLAP?
OLAP (Online Analytical Processing) is a set of tools and technologies that allow users to analyze
multidimensional data interactively from different perspectives for decision-making.
It enables:
Fast querying and reporting
Multi-dimensional analysis
Drill-down, roll-up, slicing, dicing, and pivoting
Example: A manager analyzing sales data by region → product → time period.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Popular OLAP Tools in the Industry
Microsoft SQL Server Analysis Services (SSAS) – Supports MOLAP, ROLAP, HOLAP.
Oracle OLAP – Enterprise-level OLAP integrated with Oracle DB.
IBM Cognos Analytics – Strong MOLAP-based analysis tool.
MicroStrategy – Enterprise BI with OLAP features.
SAP BW (Business Warehouse) – Integrated OLAP engine.
Pentaho Mondrian OLAP – Open-source OLAP engine.
Tableau – Provides OLAP-like features through interactive dashboards.
QlikView / Qlik Sense – In-memory OLAP analysis.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Ensure consistency, integrity, and accuracy of data.
Make data easy to query for business intelligence.
Act as a communication bridge between business users and database developers.
1. Star Schema
o Central Fact Table (measures like sales, revenue).
o Surrounding Dimension Tables (time, region, product).
o Simple, widely used.
2. Snowflake Schema
o Similar to Star Schema but dimension tables are normalized.
o More complex but reduces redundancy.
3. Galaxy Schema (Fact Constellation)
o Multiple Fact Tables share dimension tables.
o Used for large and complex warehouses.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Ensures scalability of databases and warehouses.
NOTE:
Data Modelling is the foundation of database and data warehouse design, guiding how data is structured,
stored, and accessed for analysis.
Star Schema
Star Schema is a type of multidimensional model used for data warehouses. In a star schema, the fact tables
and dimension tables are included. This schema uses fewer foreign-key joins. It forms a star structure with a
central fact table connected to the surrounding dimension tables.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Fact Table: Sales_Fact (Date_ID, Product_ID, Customer_ID, Sales_Amount, Quantity).
Dimension Tables:
Snowflake Schema
Snowflake Schema is also a type of multidimensional model used for data warehouses. In the snowflake
schema, the fact tables, dimension tables and sub-dimension tables are included. This schema forms a
snowflake structure with fact tables, dimension tables and sub-dimension tables.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Data Integrity Lower data integrity due to redundancy Higher data integrity due to normalization
Updates and More difficult to update due to Easier to update as data is normalized
Modifications denormalization
Learning Curve Easier to learn and implement More complex to learn and implement
1. Star Schema
Best for Simplicity and Speed: If we need a straightforward, easy-to-implement solution with fast query
execution, the Star Schema is ideal. It works well for small to medium datasets where quick, simple
queries are essential.
Use Case: Perfect for scenarios with fewer dimensions and limited hierarchy levels, such as sales data
warehouses in small businesses. It allows for fast data retrieval with minimal joins, making it suitable for
quick reporting and analytics.
Storage Considerations: Suitable when redundancy isn’t a significant issue and storage requirements are
manageable.
2. Snowflake Schema
Best for Flexibility and Data Integrity: If we need to handle large datasets with multiple levels of
hierarchy and a high degree of normalization, the Snowflake Schema offers greater flexibility. It’s perfect
for maintaining data integrity across complex datasets.
Use Case: Ideal for large organizations dealing with large, normalized datasets or those with frequent
updates, like customer or inventory management systems. It minimizes redundancy and improves storage
efficiency.
Storage Considerations: Snowflake is more storage-efficient due to its normalized structure, making it a
great choice for scenarios with complex, high-volume data.
What is ETL?
ETL (Extract, Transform, Load) is the process of moving data from multiple sources into a Data
Warehouse where it can be analyzed.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Steps of ETL
1. Extract
o Pulls data from multiple sources:
Databases (Oracle, SQL Server, MySQL)
Flat files (CSV, Excel, JSON)
Cloud services (AWS, Salesforce, APIs)
o Must handle structured, semi-structured, and unstructured data.
2. Transform
o Applies business rules and data cleansing:
Remove duplicates, fix missing values
Standardize formats (e.g., dates, currency)
Aggregate data (e.g., monthly sales totals)
Apply business logic (profit = sales – cost)
o Ensures data quality, consistency, and accuracy.
3. Load
o Puts data into the Data Warehouse.
o Two types of loading:
Full Load → Load all data at once (initial setup).
Incremental Load → Only load new/updated data (daily, weekly).
NOTE:
The ETL process is the backbone of Data Warehousing, ensuring that data from multiple sources is
extracted, cleaned, transformed, and loaded into a structured warehouse for OLAP and BI applications.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
UNIT II: DATA MINING & WAREHOUSING
QUESTION BANK
9 Discuss the Star and Snowflake Schema models in the context of Data 2023-24
Warehousing. CO2
10 How Star and Snowflake schemas are different with each other? Explain
with the help of suitable example CO2 2022-23
11 What are the different steps of ETL process? How ETL process is useful
in data warehousing? CO2 2022-23
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Short Answer Questions (2 marks)
1 Define Data Mining.
2 List any two applications of Data Mining.
3 What is a Data Warehouse?
4 Define ETL in the context of Data Warehousing.
5 What is a Data Mart?
6 Expand OLAP and OLTP.
7 What is meant by schema in Data Warehousing?
8 List any two Data Mining techniques.
9 Mention any two advantages of Data Mining.
10 Define clustering.
11 Differentiate between Data Mining and Data Warehousing.
12 Explain is Star Schema?
13 Explain is Snowflake Schema?
14 Explain functional classification of Data Mining.
15 Define Online Analytical Processing (OLAP).
16 Explain is the role of metadata in a Data Warehouse?
17 State two differences between OLAP and OLTP.
18 Explain is data representation in Data Mining?
19 Define Data Transformation in ETL.
20 Give examples of OLAP tools.
21 Identify the role of clustering in customer segmentation.
22 Explain is the purpose of fact tables in schema design?
23 State the difference between supervised and unsupervised learning in Data Mining.
24 Define association rule mining with an example.
25 List two key challenges in implementing Data Warehousing.
26 Explain why ETL is important.
27 Interpret is data preprocessing in Data Mining?
28 Define drill-down and roll-up in OLAP.
29 Mention one example each of structured and unstructured data.
30 Define dimensional modeling.
31 What is knowledge discovery in databases (KDD)?
32 Define classification in Data Mining.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Medium Answer Questions (6 marks)
1 Explain the architecture of a Data Mining system.
2 Discuss how Data Mining works (process).
3 Explain the steps in Knowledge Discovery in Databases (KDD).
4 Differentiate between descriptive and predictive data mining tasks.
5 Describe the role of data cleaning in ETL.
6 Explain the importance of OLAP in decision-making.
7 Differentiate between Data Mart and Data Warehouse.
8 Explain with example the concept of association rule mining.
9 Illustrate Star Schema with a neat diagram.
10 Compare Star and Snowflake schemas.
11 Explain the functional classification of Data Mining.
12 Analyze how clustering helps in market basket analysis.
13 Describe OLTP with suitable examples.
14 Explain with an example the process of data transformation in ETL.
15 Differentiate between structured, semi-structured, and unstructured data.
16 Compare roll-up and drill-down operations in OLAP.
17 Explain with example classification vs. regression.
18 Illustrate the role of metadata in Data Warehousing.
19 Describe dimensional modeling in Data Warehousing.
20 Compare advantages and disadvantages of Snowflake schema.
21 Evaluate the importance of preprocessing in data mining accuracy.
22 Analyze the role of OLAP tools in Business Intelligence.
23 Explain why Data Marts are needed when Data Warehouses exist.
24 Evaluate the role of ETL in ensuring data quality.
25 Compare and contrast supervised and unsupervised learning methods.
26 Illustrate how data warehousing supports decision support systems.
27 Discuss the challenges of implementing data warehouses in large organizations.
©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
4 Illustrate the process of knowledge discovery (KDD) and explain its importance.
5 Analyze the differences between Data Lakes, Data Warehouses, and Data Marts with examples.
6 Discuss Star Schema and Snowflake Schema in detail with diagrams.
7 Evaluate the advantages and limitations of Data Mining in Business Intelligence.
8 Describe classification, clustering, and association rule mining with real-world applications.
9 Analyze how OLAP operations (roll-up, drill-down, slice, dice, pivot) support BI decision-making.
10 Discuss the role of metadata in Data Warehousing and evaluate its impact on ETL processes.
11 Create a BI solution workflow using Data Warehousing, ETL, and Data Mining concepts.
12 Critically evaluate challenges in Data Mining related to scalability, data quality, and privacy.
©Dr.Preeti Gupta