[go: up one dir, main page]

0% found this document useful (0 votes)
5 views23 pages

BIA - Unit 2 Notes

The document provides a comprehensive overview of data mining and warehousing, detailing the processes, techniques, and applications involved in extracting valuable insights from large datasets. It covers the architecture of data mining systems, methodologies for analysis, and the importance of data warehousing in decision-making. Key functionalities and classifications of data mining are also discussed, emphasizing its relevance across various industries.

Uploaded by

s9650862
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views23 pages

BIA - Unit 2 Notes

The document provides a comprehensive overview of data mining and warehousing, detailing the processes, techniques, and applications involved in extracting valuable insights from large datasets. It covers the architecture of data mining systems, methodologies for analysis, and the importance of data warehousing in decision-making. Key functionalities and classifications of data mining are also discussed, emphasizing its relevance across various industries.

Uploaded by

s9650862
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY

Knowledge Park-II, Greater Noida, U.P.

Unit II
BI: DATA MINING & WAREHOUSING
SYLLABUS Data Mining - Introduction to Data Mining, Architecture of Data Mining and How Data mining works
(Process) , Functionalities & Classifications of Data Mining, Representation of Input Data, Analysis
Methodologies.
Data Warehousing - Introduction to Data Warehousing, Data Mart,
Online Analytical Processing (OLAP) – Tools, Data Modelling, Difference between OLAP and OLTP, Schema
– Star and Snowflake Schemas,
ETL Process – Role of ETL

1. Introduction to Data Mining


What is Data Mining?
Data mining is the process of extracting hidden patterns, trends, correlations, and useful information from
large datasets using statistical, mathematical, machine learning, and database management techniques. It
transforms raw data into meaningful insights that support decision-making and predictions.

Data Mining = Data + Knowledge Discovery

Why is Data Mining Important?


 Businesses generate huge volumes of data (e.g., sales records, customer details, transactions).
 Raw data by itself has little value unless processed and analyzed.
 Data mining helps to:
o Understand customer behavior
o Detect fraud
o Improve business strategies
o Predict future outcomes

Data Mining vs. Related Concepts


 Data Mining vs. Statistics → Statistics focuses on hypothesis testing, while data mining emphasizes
discovery of patterns.
 Data Mining vs. Machine Learning → Machine learning provides algorithms, while data mining
applies them to extract knowledge.
 Data Mining vs. Big Data → Big Data deals with huge, complex data; data mining extracts knowledge
from it.

Steps in Data Mining (Knowledge Discovery Process)


1. Data Cleaning – Removing noise and inconsistent data
2. Data Integration – Combining data from multiple sources
3. Data Selection – Choosing relevant data for analysis
4. Data Transformation – Converting data into suitable format
5. Data Mining – Applying algorithms to extract patterns
6. Pattern Evaluation – Identifying truly useful patterns
7. Knowledge Presentation – Visualizing and presenting results

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.

Techniques of Data Mining


 Classification – Categorizing data (e.g., spam vs. non-spam emails)
 Clustering – Grouping similar data items (e.g., customer segmentation)
 Regression – Predicting continuous values (e.g., sales forecasting)
 Association Rule Learning – Discovering relationships (e.g., market basket analysis → “People who
buy bread often buy butter”)
 Anomaly Detection – Identifying outliers or frauds

Applications of Data Mining


 Business – Customer relationship management, market analysis
 Healthcare – Disease prediction, drug discovery
 Banking & Finance – Fraud detection, credit scoring
 Education – Student performance prediction
 E-commerce – Recommendation systems (e.g., Amazon, Netflix)

NOTE: Data mining is a crucial step in turning data into knowledge. By applying advanced algorithms, it
allows organizations to make informed, data-driven decisions.

2. Architecture of Data Mining


The architecture of a data mining system describes the components and how they interact to extract
knowledge from large datasets.

Main Components

1. Database, Data Warehouse, or Other Data Repositories


o Source of data: databases, data warehouses, flat files, web data, sensor data, etc.
o Provides the raw data needed for mining.
2. Database/ Data Warehouse Server
o Responsible for fetching data requested by the mining engine.
o Ensures efficient data retrieval.
3. Data Cleaning & Integration Module
o Prepares data by handling missing values, noise, and inconsistencies.
o Integrates data from multiple sources into a unified format.
4. Data Mining Engine (Core Component)
o Applies algorithms and techniques such as classification, clustering, regression, association
rule mining, and anomaly detection.
o The "brain" of the system.
5. Pattern Evaluation Module
o Identifies useful patterns and eliminates irrelevant or redundant results.
o Uses interestingness measures to rank discovered knowledge.
6. Graphical User Interface (GUI)
o Enables users to interact with the system, define queries, visualize results, and interpret
patterns.
7. Knowledge Base

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
o Stores domain knowledge, constraints, and rules to guide the mining process.
o Helps improve accuracy and relevance of results.

3. How Data Mining Works (Process)


The process of data mining is often called Knowledge Discovery in Databases (KDD). Steps in the Process
are given below:

1. Data Cleaning
o Remove noise, errors, and missing values.
o Example: Handling null values in customer records.
2. Data Integration
o Combine data from multiple sources.
o Example: Merging sales data from online and physical stores.
3. Data Selection
o Choose relevant subsets of data for analysis.
o Example: Selecting sales data for the last 3 years only.
4. Data Transformation
o Convert data into a suitable format (normalization, aggregation, encoding).
o Example: Scaling values between 0–1 for machine learning models.
5. Data Mining
o Apply algorithms to extract patterns.
o Example: Clustering customers into groups based on spending habits.
6. Pattern Evaluation
o Identify truly interesting, useful, and valid patterns.
o Example: Filtering out rules that have low confidence or support.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
7. Knowledge Presentation
o Present discovered patterns using visualization (graphs, dashboards, reports).
o Example: Displaying customer segments in pie charts.

4. Functionalities of Data Mining

Data mining functionalities specify the kinds of patterns that can be discovered. These are the core tasks that
data mining systems perform.

A. Concept/Class Description
 Summarization of data and characteristics of a class.
 Techniques:
o Characterization → general features of a class (e.g., profile of "high-value customers").
o Discrimination → compare features of different classes (e.g., loyal vs. non-loyal customers).

B. Association Analysis (Correlation/Link Analysis)


 Finds relationships among items.
 Example: Market Basket Analysis → "If a customer buys milk, they are likely to buy bread".

C. Classification
 Assigns objects into predefined categories (supervised learning).
 Example: Classifying emails as spam or not spam.

D. Prediction
 Predicts unknown or future values.
 Example: Predicting next month’s sales revenue.

E. Clustering
 Groups similar objects into clusters without predefined labels (unsupervised learning).
 Example: Segmenting customers based on purchasing behavior.

F. Outlier/Anomaly Detection
 Identifies data objects that deviate significantly from the rest.
 Example: Fraud detection in banking transactions.

G. Evolution Analysis (Trend & Sequential Pattern Mining)


 Analyzes time-related data to discover trends and patterns.
 Example: Stock market trend prediction, web clickstream analysis

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.

5. Classification of Data Mining

Data mining can be classified into different categories based on the type of knowledge to be discovered, the
type of data to be mined, and the techniques applied.

A. Based on the Type of Data Mined

 Relational Data Mining → Works on relational databases (tables, rows, columns).


 Transactional Data Mining → Analyzes transaction records (e.g., sales data, purchase logs).
 Spatial Data Mining → Deals with spatial/geographical data (e.g., maps, satellite images).
 Text Mining → Extracts useful information from unstructured text documents.
 Web Mining → Finds patterns from the World Wide Web (web usage, web content, web structure).
 Multimedia Data Mining → Works on audio, video, and image data.
 Time-series & Sequential Data Mining → Focuses on temporal or sequence-based data (e.g., stock
market trends, DNA sequences).

B. Based on the Type of Knowledge to be discovered


 Descriptive Data Mining
o Summarizes or describes existing data.
o Examples: Clustering, Association Rule Mining, Summarization.
 Predictive Data Mining
o Uses historical data to predict future trends or values.
o Examples: Classification, Regression, Anomaly Detection.

C. Based on the Techniques/Methods Used


 Machine Learning–Based → Decision Trees, Neural Networks, Support Vector Machines (SVM).
 Statistical Techniques → Regression, Bayesian Models, Hypothesis Testing.
 Database-Oriented Approaches → OLAP, SQL-based mining, Query-driven mining.

D. Based on Application Domain


 Business Data Mining → Customer segmentation, market basket analysis, fraud detection.
 Scientific Data Mining → Bioinformatics, drug discovery, climate analysis.
 Engineering Data Mining → Quality control, fault detection.

NOTE:
Given are the type of data, knowledge, methods, and applications
Data → Relational, Transactional, Text, Web, Multimedia, Spatial, Time-series.
Knowledge → Descriptive vs. Predictive.
Methods → ML-based, Statistical, Database-oriented.
Application → Business, Scientific, Engineering.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.

6. Representation of Input Data in Data Mining

Data can be represented in various forms depending on the application, source, and mining task.

1. Record Data (Tabular/Relational Data)


 Most common form → stored in rows and columns (like a database table).
 Each row (tuple) = an object/record.
 Each column (attribute/field) = a property/feature.
 Example: Customer Database

Customer_ID Name Age Income City


C101 Rahul 25 40,000 Delhi
C102 Neha 30 55,000 Mumbai
C103 Sameer 22 35,000 Bangalore

2. Transaction Data
 Represents transactions (sets of items/events).
 Each transaction = ID + list of items.
 Mostly used in market basket analysis (association rule mining).
 Example: Supermarket Transactions

TID Items Bought


T01 {Milk, Bread, Butter}
T02 {Milk, Bread, Eggs}
T03 {Bread, Butter, Jam}

3. Text Data (Documents & Web Data)


 Unstructured or semi-structured text.
 Represented using Bag of Words (BoW), TF-IDF, or embeddings.
 Used in text mining, sentiment analysis, document clustering.
 Example:
Document = "Data mining discovers patterns."
BoW Representation = {Data:1, Mining:1, Discovers:1, Patterns:1}

4. Spatial, Image, and Multimedia Data


 Used in image mining, video mining, GIS applications.
 Data represented as:
o Pixel values (for images)
o Coordinates (for spatial data)
o Feature descriptors (e.g., color histogram, edges).

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
o Example (Image Data):
Image = 3 × 3 grayscale pixels

0 120 255
60 200 180
30 90 150

5. Time-Series Data / Sequence Data


 Ordered data with respect to time or sequence.
 Used in trend analysis, stock market prediction, DNA sequence mining.
 Example: Stock Price (Time-Series)

Day Price (₹)


1 250
2 260
3 245
4 270

6. Graph & Network Data


 Data represented as nodes (entities) and edges (relationships).
 Used in social network analysis, web mining, fraud detection.
 Example: Social Network Graph
Nodes = Users (A, B, C, D)
Edges = Friendships {(A–B), (A–C), (B–D)}
7. Data Cubes (OLAP Representation)
 Multi-dimensional representation for OLAP (Online Analytical Processing).
 Useful for business intelligence.
 Example: Sales Data Cube
Dimensions: Time × Product × Location
Fact: Sales amount

7. Analysis Methodologies in Data Mining

Analysis methodologies are the approaches, strategies, or models used to extract knowledge from data. They
define how data is analyzed depending on the goal (prediction, description, clustering, etc.).

1. Descriptive Analysis
 Objective → Describe patterns & relationships in existing data.
 Focus → What happened?
 Methods:
o Clustering (grouping similar data)
o Association Rules (finding co-occurrence patterns, e.g., Market Basket Analysis)

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
o Summarization (producing compact descriptions of data)
 Example → “Customers who buy laptops also buy laptop bags.”

2. Predictive Analysis
 Objective → Forecast future outcomes based on historical data.
 Focus → What will happen?
 Methods:
o Classification (assign items into predefined classes, e.g., spam filtering)
o Regression (predict continuous values, e.g., sales forecasting)
o Time-Series Analysis (predict trends over time, e.g., stock prices)
 Example → “Next month’s sales are expected to increase by 10%.”

3. Diagnostic Analysis
 Objective → Explain causes behind past events.
 Focus → Why did it happen?
 Methods:
o Correlation analysis
o Causality analysis
o Drill-down and OLAP operations
 Example → “Sales dropped because of reduced advertising and seasonal changes.”

4. Exploratory Analysis
 Objective → Discover unknown or hidden patterns without predefined hypotheses.
 Focus → What patterns exist?
 Methods:
o Clustering (discover natural groups)
o Anomaly detection (identify outliers/frauds)
 Example → Detecting unusual credit card transactions indicating fraud.

5. Confirmatory (Hypothesis-Driven) Analysis


 Objective → Test specific hypotheses on the data.
 Focus → Is this assumption true?
 Methods:
o Statistical hypothesis testing
o Chi-square tests, t-tests
 Example → Testing if “Younger customers spend more on online shopping” is true.

6. Exploitative / Prescriptive Analysis


 Objective → Suggest best actions based on insights.
 Focus → What should we do?
 Methods:
o Optimization models
o Simulation
o Decision trees for decision-making
 Example → “Offer a 15% discount to retain high-risk customers.”

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Summary Table

Methodology Focus Example Techniques Used


Descriptive What happened? Market basket analysis Clustering, Association,
Summarization
Predictive What will happen? Sales forecasting, spam detection Classification, Regression,
Time-series
Diagnostic Why did it happen? Sales decline analysis Correlation, OLAP, Drill-
down
Exploratory What patterns exist? Fraud detection, anomaly spotting Clustering, Outlier
Detection
Confirmatory Is assumption true? Hypothesis testing in research Chi-square, t-test, ANOVA
Prescriptive What should we do? Optimizing marketing strategies Optimization, Simulation,
Decision Trees

8. Introduction to Data Warehousing

What is a Data Warehouse?


A Data Warehouse (DW) is a centralized repository that stores integrated, subject-oriented, time-variant,
and non-volatile data collected from multiple sources.
It is designed to support decision-making, reporting, and analysis, not day-to-day transaction processing.

In simple terms:
 A database is for running the business (daily transactions).
 A data warehouse is for analyzing the business (decision-making).

Characteristics of a Data Warehouse (by Inmon & Kimball)


1. Subject-Oriented
o Organized around key subjects (e.g., sales, customers, products), not applications.
2. Integrated
o Data is collected from multiple sources, cleaned, and made consistent.
3. Time-Variant
o Stores historical data (months/years) to analyze trends over time.
4. Non-Volatile
o Data is stable; once entered, it is not changed or deleted (only updated with new data).

Components of a Data Warehouse

1. Data Sources
o Operational databases (OLTP), external data, flat files, sensors, etc.
2. ETL Process (Extract, Transform, Load)
o Extract → data pulled from multiple sources.
o Transform → clean, standardize, and integrate data.
o Load → move processed data into the warehouse.
3. Data Warehouse Database
o Central repository storing detailed and summarized data.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
4. Metadata
o "Data about data" → definitions, mappings, transformations.
5. Access Tools
o Reporting tools, OLAP (Online Analytical Processing), dashboards, data mining applications.

Types of Data Warehouses

1. Enterprise Data Warehouse (EDW)


o Central warehouse for the entire organization.
o Provides a single version of truth.
2. Data Mart
o Subset of a DW, focused on a specific department (e.g., sales, finance).
o Faster, smaller, cheaper to implement.
3. Virtual Data Warehouse
o A set of views on operational databases.
o No physical storage; depends on live systems.

Architecture of Data Warehousing

Common layers:

1. Data Source Layer → Operational systems, external sources.


2. Data Staging Layer (ETL) → Cleansing, integration, transformation.
3. Data Storage Layer (Warehouse) → Central repository.
4. Analysis/Presentation Layer → OLAP, dashboards, reporting tools.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.

Benefits of Data Warehousing


 Better decision-making through historical analysis.
 Integrated view of enterprise data.
 Supports trend analysis and forecasting.
 Faster query performance compared to OLTP systems.
 Enables data mining and BI (Business Intelligence).

Example Use Cases


 Retail → Analyzing sales patterns across seasons and locations.
 Banking → Customer profiling, fraud detection.
 Healthcare → Patient care analysis, hospital management.
 Telecom → Churn analysis, customer segmentation.

9. Data Mart

What is a Data Mart?


A Data Mart is a subset of a Data Warehouse that focuses on a specific business area, department, or
subject within an organization.

 It is smaller, more focused, and faster to implement than a full Data Warehouse.
 Used mainly by a particular group of users (e.g., Sales, Marketing, Finance).

Think of a Data Warehouse as a supermarket (everything is available), and a Data Mart as a specialized
shop (only specific items).

Characteristics of Data Mart


1. Subject-Oriented → Focuses on one business area (sales, HR, finance).
2. Department-Specific → Designed for a single department or user group.
3. Smaller in Size → Few GBs to a few TBs (compared to multi-TB DWs).
4. Faster Implementation → Can be built in weeks/months.
5. Cost-Effective → Cheaper than building a full enterprise DW.

Types of Data Marts


1. Dependent Data Mart
o Created directly from the Enterprise Data Warehouse (EDW).
o Gets cleaned and consistent data from DW.
o Ensures data integrity across departments.
o Example: Marketing Data Mart derived from company-wide EDW.
2. Independent Data Mart
o Created directly from operational or external sources without a DW.
o Quick to build but may lack consistency with other departments.
o Example: A small Finance Data Mart built only from bank transaction systems.
3. Hybrid Data Mart

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
o Uses both DW + operational sources.
o Combines benefits of dependent and independent approaches.
o Example: A Retail Data Mart combining sales data from EDW and real-time POS systems.

Architecture of Data Mart

Layers:
1. Data Sources → Operational databases, DW, external files.
2. ETL Layer → Extract, Transform, Load data into the mart.
3. Data Mart Storage → Stores subject-specific data.
4. Access Layer → OLAP, dashboards, reports for end users.

Advantages of Data Mart


 Faster implementation than full DW.
 Cost-effective for small departments.
 Provides focused analysis for specific needs.
 Improves query performance (less data compared to DW).
 Easier for non-technical users to access.

Disadvantages of Data Mart


 May create data silos if independent marts are not integrated.
 Can lead to inconsistencies across departments.
 Requires additional ETL maintenance.

Example Use Cases of Data Marts


 Sales Data Mart → Analyzing monthly/quarterly sales performance.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
 Finance Data Mart → Budgeting, forecasting, cost analysis.
 Marketing Data Mart → Customer segmentation, campaign effectiveness.
 HR Data Mart → Employee performance, recruitment analysis.

10. Online Analytical Processing (OLAP) – Tools

What is OLAP?
OLAP (Online Analytical Processing) is a set of tools and technologies that allow users to analyze
multidimensional data interactively from different perspectives for decision-making.
It enables:
 Fast querying and reporting
 Multi-dimensional analysis
 Drill-down, roll-up, slicing, dicing, and pivoting
 Example: A manager analyzing sales data by region → product → time period.

Features of OLAP Tools


1. Multidimensional View → Data stored as a cube (dimensions: Time, Product, Region, etc.).
2. Fast Query Performance → Pre-computed aggregations improve speed.
3. Interactive Analysis → Drill-down, slice & dice, roll-up.
4. Support for Complex Calculations → Profit margins, growth rates, ratios.
5. Integration with BI Tools → Dashboards, visualization, and reporting.

Types of OLAP Tools


1. MOLAP (Multidimensional OLAP)
o Stores data in multidimensional cubes.
o Very fast query performance (pre-computed aggregates).
o Example: Cognos PowerPlay, Microsoft Analysis Services (MOLAP mode).
2. ROLAP (Relational OLAP)
o Stores data in relational databases (tables).
o Uses SQL queries for aggregations.
o Scales well with large datasets but slower than MOLAP.
o Example: Oracle OLAP, MicroStrategy, SAP BW (ROLAP mode).
3. HOLAP (Hybrid OLAP)
o Combination of MOLAP + ROLAP.
o Frequently accessed data stored in cubes (fast), detailed data in relational DB.
o Balanced performance + scalability.
o Example: Microsoft Analysis Services (HOLAP mode).
4. DOLAP (Desktop OLAP)
o Small OLAP system that runs on individual PCs.
o Limited scalability, but good for small businesses.
o Example: Cognos PowerPlay for Windows.
5. WOLAP (Web-based OLAP)
o Provides OLAP functionality via web browsers.
o No need for heavy desktop installations.
o Example: Pentaho, Tableau Online, Oracle Express OLAP.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Popular OLAP Tools in the Industry
 Microsoft SQL Server Analysis Services (SSAS) – Supports MOLAP, ROLAP, HOLAP.
 Oracle OLAP – Enterprise-level OLAP integrated with Oracle DB.
 IBM Cognos Analytics – Strong MOLAP-based analysis tool.
 MicroStrategy – Enterprise BI with OLAP features.
 SAP BW (Business Warehouse) – Integrated OLAP engine.
 Pentaho Mondrian OLAP – Open-source OLAP engine.
 Tableau – Provides OLAP-like features through interactive dashboards.
 QlikView / Qlik Sense – In-memory OLAP analysis.

OLAP Operations (User Functions)


1. Roll-Up (Aggregation) → Summarize data (e.g., daily → monthly → yearly sales).
2. Drill-Down → Opposite of roll-up, more detailed view (year → month → day).
3. Slice → Fix one dimension (e.g., sales of 2024 only).
4. Dice → Select a sub-cube by choosing multiple dimensions (e.g., sales of Laptops in India in 2024).
5. Pivot (Rotation) → Reorient dimensions to view data from different perspectives.

Advantages of OLAP Tools


 Faster decision-making through quick insights.
 Multi-dimensional data analysis.
 Handles complex queries efficiently.
 User-friendly interfaces (drag-and-drop, dashboards).
 Supports trend analysis & forecasting.

11. Data Modelling

What is Data Modelling?


- Data Modelling is the process of defining and structuring data in a way that it can be easily stored,
retrieved, and analyzed in a database or data warehouse.
- It is a blueprint for how data is organized, related, and processed.
- Objectives of Data Modelling
 Provide a clear structure for data storage.
 Define relationships between data elements.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
 Ensure consistency, integrity, and accuracy of data.
 Make data easy to query for business intelligence.
 Act as a communication bridge between business users and database developers.

Types of Data Models


1. Conceptual Data Model
o High-level representation of business entities and relationships.
o Focus: What data is required?
o Example: "Customers purchase Products."
o Tools: ER diagrams (Entity-Relationship Models).
2. Logical Data Model
o More detailed than conceptual; includes attributes, primary keys, and relationships.
o Independent of any specific database.
o Example: Customer (Cust_ID, Name, Email) → Order (Order_ID, Date, Amount).
3. Physical Data Model
o Actual implementation in a specific database (SQL, Oracle, MySQL, etc.).
o Includes tables, columns, data types, indexes, constraints.
o Example: CREATE TABLE Customer (Cust_ID INT PRIMARY KEY, Name VARCHAR(50),
Email VARCHAR(50));

Data Modelling Techniques


1. Entity-Relationship (ER) Modelling → Uses entities, attributes, and relationships.
2. Relational Modelling → Organizes data in tables (rows & columns).
3. Dimensional Modelling → Used in Data Warehousing (Fact tables + Dimension tables).
4. Object-Oriented Modelling → Uses objects (attributes + methods).
5. Hierarchical/Network Modelling → Legacy models, data stored as trees/networks.

Data Modelling in Data Warehousing


In Data Warehousing, main approaches used are:

1. Star Schema
o Central Fact Table (measures like sales, revenue).
o Surrounding Dimension Tables (time, region, product).
o Simple, widely used.
2. Snowflake Schema
o Similar to Star Schema but dimension tables are normalized.
o More complex but reduces redundancy.
3. Galaxy Schema (Fact Constellation)
o Multiple Fact Tables share dimension tables.
o Used for large and complex warehouses.

Benefits of Data Modelling


 Improves data quality and avoids redundancy.
 Makes databases efficient & consistent.
 Helps in business decision-making.
 Provides a visual representation for better understanding.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
 Ensures scalability of databases and warehouses.

NOTE:
Data Modelling is the foundation of database and data warehouse design, guiding how data is structured,
stored, and accessed for analysis.

Difference between OLAP and OLTP

Feature OLTP (Online Transaction Processing) OLAP (Online Analytical Processing)


Purpose Handles day-to-day transactions Supports decision-making & analysis
Data Type Operational data (current, real-time) Historical data (aggregated, summarized)
Process Insert, Update, Delete (CRUD operations) Query, Reporting, Analysis
Users Clerks, Cashiers, Online users Managers, Analysts, Executives
Data Volume Small transactions, but very frequent Large volumes of historical data
Database Design Normalized (to avoid redundancy, fast Denormalized (star/snowflake schemas for faster
transactions) queries)
Query Type Simple, short queries (fast response needed) Complex queries (multi-dimensional, aggregations)
Examples Banking systems, ATM transactions, online Business intelligence, sales forecasting, market
bookings, retail POS analysis, dashboards
Performance Metric Measured in transactions per second (TPS) Measured in query response time
Storage Requirement Smaller (stores only operational data) Larger (stores years of historical data)
Backup/Recovery Critical (loss affects day-to-day operations) Less frequent but important (loss affects decision-
making)

Star Schema
Star Schema is a type of multidimensional model used for data warehouses. In a star schema, the fact tables
and dimension tables are included. This schema uses fewer foreign-key joins. It forms a star structure with a
central fact table connected to the surrounding dimension tables.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Fact Table: Sales_Fact (Date_ID, Product_ID, Customer_ID, Sales_Amount, Quantity).

Dimension Tables:

 Product_Dim → Product_ID, Product_Name, Category_ID


 Category_Dim → Category_ID, Category_Name
 Date_Dim → Date_ID, Day, Month_ID
 Month_Dim → Month_ID, Month_Name, Year

Snowflake Schema

Snowflake Schema is also a type of multidimensional model used for data warehouses. In the snowflake
schema, the fact tables, dimension tables and sub-dimension tables are included. This schema forms a
snowflake structure with fact tables, dimension tables and sub-dimension tables.

Difference between Star and Snowflake Schema

Feature Star Schema Snowflake Schema


Structure Central fact table connected to dimension Fact table connected to normalized dimension tables
tables
Data Normalization Denormalized dimension tables Normalized dimension tables
Performance Faster query execution due to fewer joins Slower query performance due to multiple joins
Design Complexity Simple and easy to understand Complex design with multiple levels of relationships
Space Usage Uses more storage due to denormalization Uses less storage due to normalization
Data Redundancy Higher data redundancy Lower data redundancy
Foreign Keys Fewer foreign keys More foreign keys
Use Cases Best for large datasets and quick ad-hoc Best for structured, predictable queries
queries
Query Complexity Low query complexity High query complexity due to multiple joins
Maintainability Easier to maintain due to simple design More difficult to maintain due to complexity
Scalability Scalable but may encounter performance More scalable for very large data sets due to
issues with large data volumes normalization
Suitability for BI Ideal for BI tools and quick reporting Better for systems that require detailed reporting
Tools and data analysis

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Data Integrity Lower data integrity due to redundancy Higher data integrity due to normalization
Updates and More difficult to update due to Easier to update as data is normalized
Modifications denormalization
Learning Curve Easier to learn and implement More complex to learn and implement

Choosing Between Star Schema and Snowflake Schema


When selecting between Star Schema and Snowflake Schema, it’s important to align our choice with our
organization’s needs, data characteristics and performance expectations. Here’s a quick guide to help we
decide:

1. Star Schema
 Best for Simplicity and Speed: If we need a straightforward, easy-to-implement solution with fast query
execution, the Star Schema is ideal. It works well for small to medium datasets where quick, simple
queries are essential.
 Use Case: Perfect for scenarios with fewer dimensions and limited hierarchy levels, such as sales data
warehouses in small businesses. It allows for fast data retrieval with minimal joins, making it suitable for
quick reporting and analytics.
 Storage Considerations: Suitable when redundancy isn’t a significant issue and storage requirements are
manageable.

2. Snowflake Schema
 Best for Flexibility and Data Integrity: If we need to handle large datasets with multiple levels of
hierarchy and a high degree of normalization, the Snowflake Schema offers greater flexibility. It’s perfect
for maintaining data integrity across complex datasets.
 Use Case: Ideal for large organizations dealing with large, normalized datasets or those with frequent
updates, like customer or inventory management systems. It minimizes redundancy and improves storage
efficiency.
 Storage Considerations: Snowflake is more storage-efficient due to its normalized structure, making it a
great choice for scenarios with complex, high-volume data.

Which Schema is Right for You?


 If simplicity and speed are our priorities, the Star Schema is a better fit.
 If we need to handle complex data with frequent updates while minimizing storage, the Snowflake
Schema is more suitable.

12. ETL Process – Role of ETL

What is ETL?
ETL (Extract, Transform, Load) is the process of moving data from multiple sources into a Data
Warehouse where it can be analyzed.

 Extract → Collects data from different sources (databases, files, APIs).


 Transform → Cleans, formats, and integrates data into a consistent structure.
 Load → Stores transformed data into the Data Warehouse (fact & dimension tables).

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Steps of ETL
1. Extract
o Pulls data from multiple sources:
 Databases (Oracle, SQL Server, MySQL)
 Flat files (CSV, Excel, JSON)
 Cloud services (AWS, Salesforce, APIs)
o Must handle structured, semi-structured, and unstructured data.
2. Transform
o Applies business rules and data cleansing:
 Remove duplicates, fix missing values
 Standardize formats (e.g., dates, currency)
 Aggregate data (e.g., monthly sales totals)
 Apply business logic (profit = sales – cost)
o Ensures data quality, consistency, and accuracy.
3. Load
o Puts data into the Data Warehouse.
o Two types of loading:
 Full Load → Load all data at once (initial setup).
 Incremental Load → Only load new/updated data (daily, weekly).

Role of ETL in Data Warehousing


1. Data Integration
o Combines data from multiple heterogeneous sources into a unified view.
2. Data Cleaning & Quality
o Removes inconsistencies, duplicates, and errors → improves data reliability.
3. Data Transformation
o Converts raw data into a business-friendly format for reporting and OLAP.
4. Data Loading
o Makes data available in fact & dimension tables for analysis.
5. Supports Business Intelligence (BI)
o ETL ensures managers & analysts have accurate and timely data for decision-making.
6. Automation
o ETL tools automate repetitive tasks, reducing human errors and improving efficiency.

Popular ETL Tools


 Informatica PowerCenter
 Talend Open Studio
 Apache NiFi
 Microsoft SQL Server Integration Services (SSIS)
 IBM DataStage
 AWS Glue
 Apache Airflow (workflow orchestration)

NOTE:
The ETL process is the backbone of Data Warehousing, ensuring that data from multiple sources is
extracted, cleaned, transformed, and loaded into a structured warehouse for OLAP and BI applications.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
UNIT II: DATA MINING & WAREHOUSING
QUESTION BANK

(AKTU Previous Year Questions)


S.No Questions CO Year
1 Define data mining? Give example CO2 2024-25
2 Explain the architecture of a data mining system. Describe the different
CO2 2024-25
stages involved in the data mining process?
3 Discuss the functionalities and classifications of data mining. What are the
CO2 2024-25
key techniques used in data mining?
4 Define data warehousing and explain its importance in the context of
CO2 2024-25
Business Intelligence. Discuss the architecture of a data warehouse?
5 Define Data Mining and highlight its significance in business 2023-24
intelligence. CO2

6 Explain the architecture of Data Mining. CO2 2023-24

7 Provide a detailed analysis of Online Analytical Processing (OLAP), 2023-24


including its tools and applications. Discuss how OLAP supports
decision-making processes in organizations, and provide examples of its CO2
usage.

8 Explore the principles of data modelling in the context of Data 2023-24


Warehousing. Discuss the importance of data modelling in designing an CO2
effective data warehouse and how it aids in information retrieval.

9 Discuss the Star and Snowflake Schema models in the context of Data 2023-24
Warehousing. CO2

10 How Star and Snowflake schemas are different with each other? Explain
with the help of suitable example CO2 2022-23

11 What are the different steps of ETL process? How ETL process is useful
in data warehousing? CO2 2022-23

12 What does OLTP & OLAP stands for? CO2 2022-23


13 Define evolution and deviation analysis. CO2 2022-23

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Short Answer Questions (2 marks)
1 Define Data Mining.
2 List any two applications of Data Mining.
3 What is a Data Warehouse?
4 Define ETL in the context of Data Warehousing.
5 What is a Data Mart?
6 Expand OLAP and OLTP.
7 What is meant by schema in Data Warehousing?
8 List any two Data Mining techniques.
9 Mention any two advantages of Data Mining.
10 Define clustering.
11 Differentiate between Data Mining and Data Warehousing.
12 Explain is Star Schema?
13 Explain is Snowflake Schema?
14 Explain functional classification of Data Mining.
15 Define Online Analytical Processing (OLAP).
16 Explain is the role of metadata in a Data Warehouse?
17 State two differences between OLAP and OLTP.
18 Explain is data representation in Data Mining?
19 Define Data Transformation in ETL.
20 Give examples of OLAP tools.
21 Identify the role of clustering in customer segmentation.
22 Explain is the purpose of fact tables in schema design?
23 State the difference between supervised and unsupervised learning in Data Mining.
24 Define association rule mining with an example.
25 List two key challenges in implementing Data Warehousing.
26 Explain why ETL is important.
27 Interpret is data preprocessing in Data Mining?
28 Define drill-down and roll-up in OLAP.
29 Mention one example each of structured and unstructured data.
30 Define dimensional modeling.
31 What is knowledge discovery in databases (KDD)?
32 Define classification in Data Mining.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.
Medium Answer Questions (6 marks)
1 Explain the architecture of a Data Mining system.
2 Discuss how Data Mining works (process).
3 Explain the steps in Knowledge Discovery in Databases (KDD).
4 Differentiate between descriptive and predictive data mining tasks.
5 Describe the role of data cleaning in ETL.
6 Explain the importance of OLAP in decision-making.
7 Differentiate between Data Mart and Data Warehouse.
8 Explain with example the concept of association rule mining.
9 Illustrate Star Schema with a neat diagram.
10 Compare Star and Snowflake schemas.
11 Explain the functional classification of Data Mining.
12 Analyze how clustering helps in market basket analysis.
13 Describe OLTP with suitable examples.
14 Explain with an example the process of data transformation in ETL.
15 Differentiate between structured, semi-structured, and unstructured data.
16 Compare roll-up and drill-down operations in OLAP.
17 Explain with example classification vs. regression.
18 Illustrate the role of metadata in Data Warehousing.
19 Describe dimensional modeling in Data Warehousing.
20 Compare advantages and disadvantages of Snowflake schema.
21 Evaluate the importance of preprocessing in data mining accuracy.
22 Analyze the role of OLAP tools in Business Intelligence.
23 Explain why Data Marts are needed when Data Warehouses exist.
24 Evaluate the role of ETL in ensuring data quality.
25 Compare and contrast supervised and unsupervised learning methods.
26 Illustrate how data warehousing supports decision support systems.
27 Discuss the challenges of implementing data warehouses in large organizations.

Long Answer Questions (10 marks)


1 Discuss in detail the architecture of a Data Mining system with a diagram.
2 Explain the entire ETL process with suitable examples and discuss its role in BI.
3 Compare OLAP and OLTP systems with examples and evaluate their roles in BI.

©Dr.Preeti Gupta
GALGOTIAS COLLEGE OF ENGINEERING AND TECHNOLOGY
Knowledge Park-II, Greater Noida, U.P.

4 Illustrate the process of knowledge discovery (KDD) and explain its importance.
5 Analyze the differences between Data Lakes, Data Warehouses, and Data Marts with examples.
6 Discuss Star Schema and Snowflake Schema in detail with diagrams.
7 Evaluate the advantages and limitations of Data Mining in Business Intelligence.
8 Describe classification, clustering, and association rule mining with real-world applications.
9 Analyze how OLAP operations (roll-up, drill-down, slice, dice, pivot) support BI decision-making.
10 Discuss the role of metadata in Data Warehousing and evaluate its impact on ETL processes.
11 Create a BI solution workflow using Data Warehousing, ETL, and Data Mining concepts.
12 Critically evaluate challenges in Data Mining related to scalability, data quality, and privacy.

©Dr.Preeti Gupta

You might also like