DWDM External
DWDM External
Data Mining
Data mining is the process of extracting valuable information, patterns, and insights from large datasets. It
involves using statistical techniques, machine learning algorithms, and artificial intelligence to analyze data,
recognize trends, and make predictions. Data mining helps businesses, researchers, and analysts discover hidden
relationships in data, leading to better decision-making.
1. Data Collection: Gathering raw data from different sources such as databases, spreadsheets, and online
platforms.
2. Data Cleaning: Removing errors, missing values, and inconsistencies to improve data quality.
3. Data Transformation: Converting data into a suitable format for analysis.
4. Data Integration: Combining data from multiple sources into a unified system.
5. Data Analysis & Pattern Recognition: Applying algorithms to detect trends, correlations, and
unexpected patterns.
6. Data Interpretation & Visualization: Presenting findings using charts, graphs, and reports for decision-
making.
Example:
An e-commerce website tracks customer purchases and analyzes buying patterns. Using data mining, it suggests
personalized product recommendations, increasing sales and customer satisfaction.
Data mining systems are classified based on various factors such as the type of data they handle, the techniques
they use, and the type of knowledge they discover.
Data mining systems are designed to work on different kinds of data sources like:
• Relational Databases
Example: SQL-based databases like MySQL, Oracle.
Used for mining structured data using tables, rows, and columns.
• Data Warehouses
Example: OLAP cubes storing large volumes of data.
Used for multi-dimensional analysis and aggregation.
• Transactional Databases
Example: Retail sales records, banking transactions.
Useful in market basket analysis.
• Multimedia Databases
Example: Image, audio, and video databases.
Used for face recognition, image tagging.
• Association Rules
Example: Customers who buy bread also buy butter.
Used in Market Basket Analysis.
• Classification & Prediction
Example: Predicting if a customer will default on a loan.
Used in banking and finance.
• Clustering
Example: Grouping students based on performance.
Used in education analytics.
• Machine Learning-based
Example: Decision Trees, Neural Networks
Used for predictive modeling and pattern recognition.
• Statistical Methods
Example: Regression analysis
Used for finding relationships between variables.
• Database-Oriented Methods
Example: SQL-based queries for pattern discovery.
Works well with structured data.
This is based on how much the user interacts with the system:
• Query-Driven Systems
The user specifies what they want to find.
Example: Using SQL queries.
• Interactive Systems
The system suggests patterns and the user can refine the search.
Example: OLAP tools.
Data Mining is widely used in the retail sector to improve sales, marketing strategies, and customer experience.
Key Applications:
• Market Basket Analysis:
Helps identify products that are frequently bought together. For example, if customers buy bread and
butter, the store can place these items near each other to increase sales.
• Customer Segmentation:
Customers are grouped based on their buying behavior (e.g., regular, seasonal, or discount seekers) to
target them with specific offers.
• Sales Forecasting:
Retailers use past sales data to predict future trends and manage inventory more efficiently.
• Loyalty Programs:
Analyzing the shopping patterns of loyal customers helps design better reward systems.
• Fraud Detection:
Unusual transaction patterns can be flagged to prevent fraudulent activities.
2. In Education Domain:
In the field of education, data mining is used to enhance teaching and learning processes.
Key Applications:
KDD is a systematic process used to extract useful knowledge from large datasets. It includes multiple steps,
starting from raw data collection to presenting meaningful insights.
1. Data Selection
2. Data Integration
• Converts data into a suitable format for mining while reducing its size without losing important details.
• Reduces the dimensionality of data while preserving key information.
5. Data Mining
• The core step where patterns, relationships, and trends are extracted using algorithms.
• Techniques include classification and clustering.
6. Pattern Evaluation
• Presents extracted knowledge using graphs, charts, reports, or dashboards for decision-making.
• Ensures results are understandable and actionable.
A Data Warehouse is a centralized storage system that collects data from different sources, organizes it, and
stores it for analysis and reporting.
Simple Diagram:
+-------------------+
| Operational DBs |
+-------------------+
|
v
+-----------------+ ETL +-------------------+
| External Data | -------------> | Data Warehouse |
+-----------------+ Process +-------------------+
|
v
+----------------------+
| Reporting Tools / |
| Data Mining / OLAP |
+----------------------+
In Simple Words:
A Data Warehouse acts like a big library that stores clean and organized data from different departments. It
helps in analyzing business performance over time.
A Data Warehouse Architecture is a structured framework that defines how data is collected, processed, stored,
and accessed for analysis. It consists of multiple layers, ensuring efficient data management and retrieval.
A Data Warehouse is a system used to store large amounts of historical data from various sources for analysis
and reporting purposes. The idea of a data warehouse was born due to the need to support decision-making in
organizations.
• During the 1960s, organizations began using mainframe computers to manage business data.
• However, these systems were mainly transactional, used only for daily operations — not for analysis.
• Bill Inmon, considered the "Father of Data Warehousing," formally introduced the Data Warehouse
concept.
• He defined it as:
“A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of
management’s decision-making process.”
• Around this time, tools like ETL (Extract, Transform, Load) processes were developed to collect and
clean data from multiple sources.
• Introduction of Data Marts, Star/Snowflake Schemas, and Metadata to better organize data.
• Use of Data Mining techniques and real-time data integration became common.
• With the rise of Big Data, warehouses have moved to Cloud platforms like Amazon Redshift, Google
BigQuery, and Snowflake.
• These support huge data volumes, real-time processing, and advanced analytics using AI/ML.
• Pros:
o Simple and cost-effective for small businesses.
o Reduces data duplication.
• Cons:
o Slower performance due to mixed processing.
o Not scalable for large businesses.
2. Two-Tier Architecture
• Definition: Adds a middle layer (staging area) between data sources and the warehouse to clean and organize
data.
• Key Components:
o Source Layer: Collects data from different sources.
o Staging Layer: Cleans and transforms data using ETL tools.
o Data Warehouse Layer: Stores and organizes data for analysis.
o Analysis Layer: Provides reports, dashboards, and business intelligence.
• Pros:
o Improves data quality.
o Reduces workload on live databases.
• Cons:
o Limited scalability.
o Can slow down when handling large data volumes.
• Definition: Separates data storage, processing, and analysis into three layers for better performance.
• Key Components:
o Bottom Tier (Data Storage): Stores raw data using databases.
o Middle Tier (Processing/OLAP Server): Organizes and processes data for faster queries.
o Top Tier (Front-End/Analysis): Provides reports, dashboards, and data mining tools.
• Pros:
o Fast and efficient for handling large data sets.
o Supports advanced analytics and reporting.
o Highly scalable for big enterprises.
• Cons:
o Expensive to implement.
o Requires expert management.
Comparison Table
Feature Single-Tier Two-Tier Three-Tier
Complexity Low Medium High
Data Quality Low Moderate High
Performance Slow Moderate Fast
Scalability Low Limited High
Best For Small Businesses Medium Businesses Large Enterprises
Difference between OLTP and OLAP
• OLTP: A user books a movie ticket. The system inserts that transaction into the database.
• OLAP: Management analyzes the total number of tickets sold last year by region and genre.
A Data Warehouse Schema defines how data is structured and organized in a data warehouse. It determines how
tables, relationships, and keys are designed for efficient storage, retrieval, and analysis of data.
1. Star Schema
The star schema is the simplest and most widely used schema in data warehousing. It consists of a central fact
table connected to multiple dimension tables in a star-like structure.
Structure:
• Fact Table: Contains numerical data (e.g., sales amount, revenue) and foreign keys referencing dimension
tables.
• Dimension Tables: Store descriptive attributes related to the fact table (e.g., product details, customer
details, time, location).
Example:
• Fact Table: Sales (Product_ID, Order_ID, Customer_ID, Employer_ID, Total, Quantity, Discount)
• Dimension Tables:
o Product (Product_ID, Name, Category, Price)
o Time (Order_ID, Date, Year, Quarter, Month)
o Customer (Customer_ID, Name, Address, City, Zip)
o Emp (Emp_ID, Name, Title, Department, Region)
Advantages
Disadvantages
2. Snowflake Schema
The Snowflake Schema is a more normalized version of the Star Schema. It reduces data redundancy by splitting
dimension tables into smaller related tables, forming a snowflake-like structure.
Structure:
• Fact Table: Stores numerical data and foreign keys referencing dimension tables.
• Normalized Dimension Tables: Dimension tables are further divided into sub-tables to remove
redundancy.
Example:
• Fact Table: Sales (Product_ID, Order_ID, Customer_ID, Employer_ID, Total, Quantity, Discount)
• Dimension Tables:
o Product (Product_ID, Product_Name, Category_ID)
▪ Category (Category_ID, Name, Description, Price)
o Customer (Customer_ID, Name, Address, City_ID)
▪ City (City_ID, Name, Zipcode, State, Country)
o Time (Order_ID, Date, Year, Quarter, Month)
o Employee (Employee_ID, Name, Department_ID, Region, Territory)
▪ Department (Department_ID, Name, Location)
Advantages
• Reduces data redundancy
• Saves storage space
• Improves data integrity
Disadvantages
The fact constellation schema is a more complex structure that includes multiple fact tables sharing common
dimension tables. It is also called a galaxy schema because it looks like a collection of multiple star schemas.
Structure:
• Multiple Fact Tables: Used when a business has different processes that share common dimensions.
• Shared Dimension Tables: Common dimensions are used across fact tables.
Example:
Fact Tables
Dimension Tables
1. Student Dimension Table
o Stud_roll (Primary Key)
o Name
o CGPA
2. Company Dimension Table
o Company_id (Primary Key)
o Name
o Offer_Package
3. Training Institute Dimension Table
o Institute_id (Primary Key)
o Name
o Full_course_fee
4. TPO Dimension Table
o TPO_id (Primary Key)
o Name
o Age
Advantages
Disadvantages
Data preprocessing is the process of preparing raw data for analysis by transforming it into a clean, structured,
and usable format. Since real-world data often contains errors, inconsistencies, and missing values, preprocessing
ensures better accuracy and efficiency in data mining and machine learning.
Data preprocessing is a crucial step in data analysis and machine learning. It ensures that raw data is clean,
structured, and ready for processing, leading to accurate and efficient results.
Raw data is often incomplete, inconsistent, or contains errors, making it unsuitable for direct analysis. Data
preprocessing is essential to clean, transform, and structure the data for better accuracy and efficiency in
decision-making and machine learning.
Data preprocessing aims to transform raw, unstructured data into a clean, consistent, and usable format for
analysis and decision-making.
Essential/Importance/Need/Objectives
1. Handling Missing Data
• Real-world data often has missing values due to human error or system failures.
• Filling missing values or removing incomplete records ensures data reliability.
2. Removing Noise and Inconsistencies
Techniques
Data processing techniques help in refining raw data to ensure accuracy, consistency, and efficiency in data
analysis.
1. Data Cleaning
Definition: The process of detecting and correcting inaccurate, incomplete, or inconsistent data to improve data
quality.
Steps:
Example:
A retail company finds missing values in the "Customer Age" column of its database.
• Solution: Fill missing values with the average age of other customers or remove incomplete records.
2. Data Integration
Definition: The process of combining data from multiple sources into a unified dataset.
Steps:
Example:
A company has customer data in separate systems for online and in-store purchases.
• Solution: Merge both datasets into a single database to provide a 360-degree customer view.
3. Data Transformation
Steps:
Example:
A company collects product prices in different currencies (USD, INR, EUR).
4. Data Reduction
Definition: The process of minimizing data volume while preserving its integrity.
Steps:
Example: Instead of storing daily weather data for 10 years, only monthly averages are stored to reduce data
volume while maintaining essential trends and patterns.
In real-world datasets like agriculture, missing values are common due to human error, sensor failure, or
incomplete surveys.
2. Manual Input
6. Interpolation
• Structure of data
• Source of data
• Transformation rules
• Usage and access details
Think of metadata as the instruction manual that helps understand and manage the data warehouse.
Importance
1. Improves Data Organization
• Defines who can access or modify data, ensuring privacy and security.
• Example: Metadata in a cloud storage system controls user permissions, restricting access to sensitive
files.
• Helps different systems and applications understand and use shared data.
• Example: Different organizations using standardized metadata formats (e.g., XML, JSON) can seamlessly
exchange data.
• Provides valuable context for analyzing trends and making informed decisions.
• Example: In business intelligence, metadata helps categorize sales data by region, product, and time period
for analysis.
Types
1. Operational Metadata (Source layer)
• Stores details about the source data used in the data warehouse.
• Helps track where the data comes from, its format, and how it is stored.
• Ensures that data can be traced back to its original source when needed.
• Includes information about data structures, field lengths, and data types in the source systems.
• Keeps records of data updates, deletions, and modifications made in the operational systems.
• Helps in troubleshooting by providing logs of data movement and transformations.
Examples:
• A sales table in a MySQL database contains fields: Order_ID, Customer_Name, Amount, Date.
→ Metadata: field names, data types (int, varchar, date), source system name.
• Source file is updated every day at 10 PM.
→ Metadata: update frequency, timestamp logs.
• Log showing that 5 records were deleted yesterday from the source system.
• Describes how data is extracted from different sources and transformed before storing it in the warehouse.
• Includes details such as:
o Extraction frequency (e.g., daily, weekly, or real-time updates).
o Methods used for extraction (e.g., full extraction, incremental extraction).
o Business rules applied for cleaning and modifying data before loading.
• Provides information on data validation techniques (e.g., handling missing values, removing duplicates).
• Ensures that the transformed data is accurate, consistent, and structured properly for analysis.
• Helps maintain data lineage, tracking changes made to the data throughout the process.
Examples:
• Customer names from multiple sources are converted to uppercase for consistency.
→ Metadata stores this rule: UPPER(Customer_Name)
• Extraction frequency: Sales data is pulled every 24 hours.
• Validation rule: Remove records where Amount < 0.
• A field Total_Sales is derived as Quantity × Unit_Price.
• Acts as a guide for users to find and understand data in the warehouse.
• Allows users to search for information using business-friendly terms instead of complex database
terminology.
• Provides details about data relationships, definitions, and usage to help end-users interpret reports
correctly.
• Supports data visualization and reporting tools by mapping technical data to business concepts.
• Makes it easier for non-technical users (e.g., managers, analysts) to access and analyze the data
effectively.
• Improves decision-making by ensuring users can quickly locate and trust the data they need.
Examples:
1. Roll-up (Aggregation)
o Moves from detailed data to summarized data.
o Example:
▪ Quarter - Year (Time Dimension).
▪ City - Country (Location Dimension).
Pros: Handles large data, more flexible for dynamic queries, uses existing RDBMS infrastructure.
Cons: Slower performance compared to MOLAP, complex SQL queries may affect performance.
Pros: Fast query performance, efficient for complex calculations. Cons: Needs more storage, less flexible.
Pros: Balance between speed and storage, supports both detailed and summarized data, optimized for complex
queries.
Cons: Higher system complexity, requires more maintenance.
A Data Mart is a subset of a Data Warehouse that is designed to serve a specific department, business
function, or user group within an organization, such as Sales, Marketing, or Finance.
Features of a Data Mart:
Example:
Types
1. Dependent Data Mart
Example:
A banking system where a data warehouse stores all transactions, and separate data marts exist for loans, credit
cards, and customer accounts.
Example:
A small retail company builds a data mart to analyze sales trends, without needing a full-scale data warehouse.
3. Hybrid Data Mart
• Created by extracting data from both operational sources and the data warehouse.
• Provides flexibility, allowing organizations to access data from external sources or a central warehouse.
• Supports both Top-Down and Bottom-Up approaches.
• Suitable for businesses that need fast access to data from different sources.
Example:
An e-commerce company integrating real-time sales data from operational systems while also using historical
customer data from the data warehouse.
Organizations build Data Marts to simplify, speed up, and personalize access to data for specific business
units. Below are the main reasons:
1. Department-Specific Focus
• Data marts serve specific departments like Sales, HR, Marketing, or Finance.
• Allow users to work with only the data relevant to their roles.
• Reduces complexity for non-technical users.
Example: A Sales Data Mart helps sales teams access product-wise and region-wise sales data without
navigating the entire enterprise warehouse.
2. Improved Performance
Example: A marketing executive can run campaign performance reports faster on a marketing data mart than
querying the full data warehouse.
3. Cost-Effective
• Building and maintaining a small data mart is cheaper than a large data warehouse.
• Suitable for small teams or companies with limited budgets.
4. Faster Implementation
Example: HR data mart contains employee records but keeps payroll or medical information secure from
other departments.
6. Customization
• Each department can design reports and dashboards tailored to their specific KPIs.
• Supports better data visualization and interpretation.
Concept Hierarchies
Concept Hierarchies are used to organize data values at multiple levels of abstraction. They help in
generalizing or drilling down data during analysis, making data mining more efficient and insightful.
What is a Concept Hierarchy?
A concept hierarchy defines a sequence of mappings from low-level (detailed) concepts to high-level
(general) concepts.
Example:
"City → State → Country"
"Date → Month → Quarter → Year"
• Data generalization: Replace detailed data with higher-level concepts for summarization.
• OLAP operations: Used in roll-up and drill-down.
• Simplifies analysis by grouping related data.
Usage in OLAP:
Market Basket Analysis (MBA) is a data mining technique used to find associations or relationships between
sets of items that customers buy together. It helps businesses understand customer purchasing behavior.
Key Concept:
"If a customer buys item A, they are likely to buy item B as well."
Example:
If customers who buy bread often also buy butter, a store may place these items close to each other or offer a
combo deal.
How it works:
This means people who buy Milk and Bread often also buy Butter.
Terminology:
Term Description
Itemset A group of items bought together (e.g., {Milk, Bread})
Support Measures of how frequently an itemset appears in the dataset
Confidence Measures of how often item B is bought when item A is bought
Lift Measures how much more likely A and B are bought together compared to if they are being
bought independently
Use Cases:
Apriori Algorithm
Apriori is an algorithm for finding frequent itemsets in a transaction database and then generating association
rules from those itemsets.
Step-by-Step Procedure
Use Apriori Algorithm to generate association rules from the following transactions:
Minimum Support = 50%, Minimum Confidence = 75%
Transactions:
TID Items
T1 Bread, Butter, Jam, Milk
T2 Bread, Butter, Milk
T3 Bread, Juice, Curd
T4 Bread, Milk, Juice
T5 Butter, Milk, Juice
Step 2: Generate Candidate 2-Itemsets (C2) and Find Frequent 2-Itemsets (L2)
Association rules are if-then statements that show the relationship between items in a transaction dataset.
Example:
If a customer buys Bread, then they also buy Milk.
→ Written as: Bread → Milk
• {Bread, Milk}
• {Butter, Milk}
Generate Rules:
From {Bread, Milk}:
1. Bread → Milk
o Support = 60%; Confidence = 75%
2. Milk → Bread
o Support = 60%; Confidence = 75%
1. Butter → Milk
o Support = 60%; Confidence = 100%
2. Milk → Butter
o Support = 60%; Confidence = 75%
These rules help businesses make decisions like product placement, combo offers, and targeted promotions.
Classification
Classification is a data mining technique used to predict the category or class of a given data point based on
past data.
It uses a model trained on labeled data to classify new, unseen data.
Classification is a supervised learning technique where the goal is to assign predefined labels (classes) to data
based on input features.
A Decision Tree is a machine learning algorithm used for classification tasks that splits the dataset into smaller
subsets based on feature values. It forms a tree structure where each internal node is a condition (test), and
each leaf node represents a class label (decision).
• Attendance = 80%
• Assignment Score = 72
• Decision Tree
• Naive Bayes
• K-Nearest Neighbor (KNN)
• Support Vector Machine (SVM)
• Random Forest
• Logistic Regression