[go: up one dir, main page]

0% found this document useful (0 votes)
30 views30 pages

DWDM External

Data mining is the process of extracting valuable insights from large datasets using statistical techniques and machine learning, aiding decision-making in various fields. The KDD process involves steps from data selection to visualization, while data warehouses serve as centralized systems for storing and analyzing data. Different architectures and schemas, such as star and snowflake schemas, are employed to optimize data organization and retrieval.

Uploaded by

Riya Juneja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views30 pages

DWDM External

Data mining is the process of extracting valuable insights from large datasets using statistical techniques and machine learning, aiding decision-making in various fields. The KDD process involves steps from data selection to visualization, while data warehouses serve as centralized systems for storing and analyzing data. Different architectures and schemas, such as star and snowflake schemas, are employed to optimize data organization and retrieval.

Uploaded by

Riya Juneja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Data Mining Basics & KDD

Data Mining

Data mining is the process of extracting valuable information, patterns, and insights from large datasets. It
involves using statistical techniques, machine learning algorithms, and artificial intelligence to analyze data,
recognize trends, and make predictions. Data mining helps businesses, researchers, and analysts discover hidden
relationships in data, leading to better decision-making.

Key Features of Data Mining:

• Pattern Discovery: Identifies recurring trends and relationships in data.


• Predictive Analysis: Helps forecast future trends based on past data.
• Automatic Processing: Uses AI and machine learning to analyze data without manual intervention.
• Large Dataset Handling: Efficiently processes vast amounts of structured and unstructured data.
• Decision Support: Provides insights to improve business strategies and operations.

Steps in Data Mining Process:

1. Data Collection: Gathering raw data from different sources such as databases, spreadsheets, and online
platforms.
2. Data Cleaning: Removing errors, missing values, and inconsistencies to improve data quality.
3. Data Transformation: Converting data into a suitable format for analysis.
4. Data Integration: Combining data from multiple sources into a unified system.
5. Data Analysis & Pattern Recognition: Applying algorithms to detect trends, correlations, and
unexpected patterns.
6. Data Interpretation & Visualization: Presenting findings using charts, graphs, and reports for decision-
making.

Example:

An e-commerce website tracks customer purchases and analyzes buying patterns. Using data mining, it suggests
personalized product recommendations, increasing sales and customer satisfaction.

Classification of Data Mining Systems with an example

Data mining systems are classified based on various factors such as the type of data they handle, the techniques
they use, and the type of knowledge they discover.

1. Classification Based on Type of Data Source

Data mining systems are designed to work on different kinds of data sources like:

• Relational Databases
Example: SQL-based databases like MySQL, Oracle.
Used for mining structured data using tables, rows, and columns.
• Data Warehouses
Example: OLAP cubes storing large volumes of data.
Used for multi-dimensional analysis and aggregation.
• Transactional Databases
Example: Retail sales records, banking transactions.
Useful in market basket analysis.
• Multimedia Databases
Example: Image, audio, and video databases.
Used for face recognition, image tagging.

2. Classification Based on Type of Knowledge Mined

This is based on what kind of pattern or knowledge we want to discover:

• Association Rules
Example: Customers who buy bread also buy butter.
Used in Market Basket Analysis.
• Classification & Prediction
Example: Predicting if a customer will default on a loan.
Used in banking and finance.
• Clustering
Example: Grouping students based on performance.
Used in education analytics.

3. Classification Based on Mining Techniques Used

This depends on the type of method used to mine the data:

• Machine Learning-based
Example: Decision Trees, Neural Networks
Used for predictive modeling and pattern recognition.
• Statistical Methods
Example: Regression analysis
Used for finding relationships between variables.
• Database-Oriented Methods
Example: SQL-based queries for pattern discovery.
Works well with structured data.

4. Classification Based on User Interaction

This is based on how much the user interacts with the system:

• Query-Driven Systems
The user specifies what they want to find.
Example: Using SQL queries.
• Interactive Systems
The system suggests patterns and the user can refine the search.
Example: OLAP tools.

Applications of Data Mining (In Retail Industry & Education Domain)


1. In Retail Industry:

Data Mining is widely used in the retail sector to improve sales, marketing strategies, and customer experience.

Key Applications:
• Market Basket Analysis:
Helps identify products that are frequently bought together. For example, if customers buy bread and
butter, the store can place these items near each other to increase sales.
• Customer Segmentation:
Customers are grouped based on their buying behavior (e.g., regular, seasonal, or discount seekers) to
target them with specific offers.
• Sales Forecasting:
Retailers use past sales data to predict future trends and manage inventory more efficiently.
• Loyalty Programs:
Analyzing the shopping patterns of loyal customers helps design better reward systems.
• Fraud Detection:
Unusual transaction patterns can be flagged to prevent fraudulent activities.

2. In Education Domain:

In the field of education, data mining is used to enhance teaching and learning processes.

Key Applications:

• Student Performance Prediction:


Based on past test scores, attendance, and engagement, institutions can predict which students need
additional support.
• Dropout Rate Analysis:
Identifies students who are at risk of dropping out by analyzing behavioral and academic data.
• Personalized Learning:
Adaptive learning systems recommend study materials based on each student's strengths and
weaknesses.
• Curriculum Development:
Data from student feedback and performance helps in designing better course content.
• Decision Support for Management:
Helps schools and colleges make better administrative decisions by analyzing enrollment, results, and
feedback trends.

KDD (Knowledge Discovery in Databases) process/steps

KDD is a systematic process used to extract useful knowledge from large datasets. It includes multiple steps,
starting from raw data collection to presenting meaningful insights.
1. Data Selection

• Identifies and extracts relevant data required for analysis.


• Filters out unnecessary or irrelevant data to improve efficiency.

2. Data Integration

• Combines data from multiple sources into a unified dataset.


• Helps in avoiding redundancy and inconsistency.

3. Data Cleaning & Preprocessing

• Handles missing values, removes noise, and resolves inconsistencies.


• Ensures high-quality, accurate, and reliable data for analysis.

4. Data Transformation & Reduction

• Converts data into a suitable format for mining while reducing its size without losing important details.
• Reduces the dimensionality of data while preserving key information.

5. Data Mining

• The core step where patterns, relationships, and trends are extracted using algorithms.
• Techniques include classification and clustering.

6. Pattern Evaluation

• Identifies interesting and useful patterns from mined data.


• Uses statistical measures and visualization tools to validate insights.

7. Visualization & Knowledge Presentation

• Presents extracted knowledge using graphs, charts, reports, or dashboards for decision-making.
• Ensures results are understandable and actionable.

Data Warehouse & Architecture


Data Warehouse

A Data Warehouse is a centralized storage system that collects data from different sources, organizes it, and
stores it for analysis and reporting.

It is used for decision-making, business intelligence, and data analysis.

Key Features of a Data Warehouse:

1. Subject-Oriented – Focuses on specific areas like sales, inventory, or customers.


2. Integrated – Collects data from different sources and stores it in a consistent format.
3. Time-Variant – Stores historical data for trend analysis.
4. Non-Volatile – Data is stable and not frequently changed once entered.
Purpose of a Data Warehouse:

• Helps top management make better business decisions.


• Provides a platform for reporting, data mining, and data analysis.
• Supports OLAP (Online Analytical Processing) operations like slicing, dicing, roll-up, and drill-down.

Simple Diagram:
+-------------------+
| Operational DBs |
+-------------------+
|
v
+-----------------+ ETL +-------------------+
| External Data | -------------> | Data Warehouse |
+-----------------+ Process +-------------------+
|
v
+----------------------+
| Reporting Tools / |
| Data Mining / OLAP |
+----------------------+

In Simple Words:

A Data Warehouse acts like a big library that stores clean and organized data from different departments. It
helps in analyzing business performance over time.

Data Warehouse Architecture

A Data Warehouse Architecture is a structured framework that defines how data is collected, processed, stored,
and accessed for analysis. It consists of multiple layers, ensuring efficient data management and retrieval.

Main Components of Data Warehouse Architecture

1. Data Source Layer (Operational Systems)


o Data comes from multiple sources such as databases, ERP systems, CRM systems, flat files, or
external sources.
o These sources may have different formats and structures.
2. Data Staging Layer (ETL – Extract, Transform, Load)
o Extraction: Data is collected from different sources.
o Transformation: Data is cleaned, standardized, and converted into a common format.
o Loading: Transformed data is loaded into the data warehouse.
3. Data Storage Layer (Data Warehouse)
o The central storage area where data is organized into fact tables and dimension tables (star schema
or snowflake schema).
o Supports historical data and enables efficient querying.
4. Data Mart Layer (Departmental Data Stores)
o A subset of the data warehouse focused on specific business areas like sales, finance, or marketing.
o Improves query performance by storing only relevant data.
5. Metadata Layer
o Stores information about data sources, definitions, transformations, and relationships.
o Helps users and systems understand and manage the data warehouse.
6. OLAP (Online Analytical Processing) Layer
o Supports complex queries, aggregations, and multi-dimensional analysis.
o Enables drill-down, roll-up, slicing, and dicing of data.
7. Presentation Layer (BI & Reporting)
o Provides access to data through dashboards, reports, and visualization tools.
o Used by business analysts and decision-makers to gain insights.

History of Data Warehouse

A Data Warehouse is a system used to store large amounts of historical data from various sources for analysis
and reporting purposes. The idea of a data warehouse was born due to the need to support decision-making in
organizations.

1. 1960s – Early Concepts:

• During the 1960s, organizations began using mainframe computers to manage business data.
• However, these systems were mainly transactional, used only for daily operations — not for analysis.

2. 1980s – The Need for Analytics:

• As businesses grew, so did their data volumes.


• Managers required data for decision-making, but operational systems couldn’t support complex queries
without slowing down.
• This led to the concept of separating analytical data from operational data.

3. 1990s – Birth of Data Warehousing (by Bill Inmon):

• Bill Inmon, considered the "Father of Data Warehousing," formally introduced the Data Warehouse
concept.
• He defined it as:
“A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of
management’s decision-making process.”
• Around this time, tools like ETL (Extract, Transform, Load) processes were developed to collect and
clean data from multiple sources.

4. Late 1990s – Growth and Adoption:

• More companies started using Data Warehouses for Business Intelligence.


• OLAP (Online Analytical Processing) tools were developed to perform fast analytical queries.
5. 2000s – Modern Advancements:

• Introduction of Data Marts, Star/Snowflake Schemas, and Metadata to better organize data.
• Use of Data Mining techniques and real-time data integration became common.

6. Present Day – Big Data & Cloud Warehousing:

• With the rise of Big Data, warehouses have moved to Cloud platforms like Amazon Redshift, Google
BigQuery, and Snowflake.
• These support huge data volumes, real-time processing, and advanced analytics using AI/ML.

Types of Data Warehouse


1. Single-Tier Architecture

• Definition: Stores all data in a single layer to reduce redundancy.


• Key Features:
o Uses a virtual data warehouse.
o No clear separation between storing and analyzing data.

• Pros:
o Simple and cost-effective for small businesses.
o Reduces data duplication.

• Cons:
o Slower performance due to mixed processing.
o Not scalable for large businesses.

2. Two-Tier Architecture

• Definition: Adds a middle layer (staging area) between data sources and the warehouse to clean and organize
data.
• Key Components:
o Source Layer: Collects data from different sources.
o Staging Layer: Cleans and transforms data using ETL tools.
o Data Warehouse Layer: Stores and organizes data for analysis.
o Analysis Layer: Provides reports, dashboards, and business intelligence.

• Pros:
o Improves data quality.
o Reduces workload on live databases.

• Cons:
o Limited scalability.
o Can slow down when handling large data volumes.

3. Three-Tier Architecture (Most Common & Widely Used)

• Definition: Separates data storage, processing, and analysis into three layers for better performance.
• Key Components:
o Bottom Tier (Data Storage): Stores raw data using databases.
o Middle Tier (Processing/OLAP Server): Organizes and processes data for faster queries.
o Top Tier (Front-End/Analysis): Provides reports, dashboards, and data mining tools.
• Pros:
o Fast and efficient for handling large data sets.
o Supports advanced analytics and reporting.
o Highly scalable for big enterprises.
• Cons:
o Expensive to implement.
o Requires expert management.

Comparison Table
Feature Single-Tier Two-Tier Three-Tier
Complexity Low Medium High
Data Quality Low Moderate High
Performance Slow Moderate Fast
Scalability Low Limited High
Best For Small Businesses Medium Businesses Large Enterprises
Difference between OLTP and OLAP

Feature OLTP (Online Transaction Processing) OLAP (Online Analytical Processing)


Purpose Handles day-to-day transactions Performs complex analysis and reporting
Data Type Current, real-time data Historical, aggregated data
Operations INSERT, UPDATE, DELETE SELECT (complex queries, aggregations)
Users Front-line workers (cashiers, clerks) Analysts, Managers, Executives
Speed Fast for simple queries Optimized for reading large data sets
Data Size Smaller (transactional data) Very large (warehouse data)
Normalization Highly normalized (many tables) Denormalized (fewer tables, star schema)
Examples Banking, e-commerce orders, ticket booking Sales analysis, business trends, dashboards
Example:

• OLTP: A user books a movie ticket. The system inserts that transaction into the database.
• OLAP: Management analyzes the total number of tickets sold last year by region and genre.

Schemas in Data Warehousing

A Schema in data warehousing is a logical description of the entire database structure.


It defines how data is organized, how different tables are related to each other, and how they can be accessed.

Think of it like a blueprint or map of the database.

A Data Warehouse Schema defines how data is structured and organized in a data warehouse. It determines how
tables, relationships, and keys are designed for efficient storage, retrieval, and analysis of data.

1. Star Schema

The star schema is the simplest and most widely used schema in data warehousing. It consists of a central fact
table connected to multiple dimension tables in a star-like structure.

Structure:

• Fact Table: Contains numerical data (e.g., sales amount, revenue) and foreign keys referencing dimension
tables.
• Dimension Tables: Store descriptive attributes related to the fact table (e.g., product details, customer
details, time, location).

Example:

For a sales data warehouse:

• Fact Table: Sales (Product_ID, Order_ID, Customer_ID, Employer_ID, Total, Quantity, Discount)
• Dimension Tables:
o Product (Product_ID, Name, Category, Price)
o Time (Order_ID, Date, Year, Quarter, Month)
o Customer (Customer_ID, Name, Address, City, Zip)
o Emp (Emp_ID, Name, Title, Department, Region)
Advantages

• Simple and easy to understand


• Faster query performance due to fewer joins
• Efficient for OLAP (Online Analytical Processing)

Disadvantages

• Data redundancy in dimension tables


• Not suitable for highly complex data relationships

2. Snowflake Schema

The Snowflake Schema is a more normalized version of the Star Schema. It reduces data redundancy by splitting
dimension tables into smaller related tables, forming a snowflake-like structure.

Structure:

• Fact Table: Stores numerical data and foreign keys referencing dimension tables.
• Normalized Dimension Tables: Dimension tables are further divided into sub-tables to remove
redundancy.

Example:

For a sales data warehouse:

• Fact Table: Sales (Product_ID, Order_ID, Customer_ID, Employer_ID, Total, Quantity, Discount)
• Dimension Tables:
o Product (Product_ID, Product_Name, Category_ID)
▪ Category (Category_ID, Name, Description, Price)
o Customer (Customer_ID, Name, Address, City_ID)
▪ City (City_ID, Name, Zipcode, State, Country)
o Time (Order_ID, Date, Year, Quarter, Month)
o Employee (Employee_ID, Name, Department_ID, Region, Territory)
▪ Department (Department_ID, Name, Location)

Advantages
• Reduces data redundancy
• Saves storage space
• Improves data integrity

Disadvantages

• More complex queries due to multiple joins


• Slower query performance compared to Star Schema

3. Fact Constellation Schema (Galaxy Schema)

The fact constellation schema is a more complex structure that includes multiple fact tables sharing common
dimension tables. It is also called a galaxy schema because it looks like a collection of multiple star schemas.

Structure:

• Multiple Fact Tables: Used when a business has different processes that share common dimensions.
• Shared Dimension Tables: Common dimensions are used across fact tables.

Example:

Fact Tables

1. Placement Fact Table


o Stud_roll (Foreign Key) → References Student
o Company_id (Foreign Key) → References Company
o TPO_id (Foreign Key) → References TPO
o No. of students eligible
o No. of students placed
2. Workshop Fact Table
o Stud_roll (Foreign Key) → References Student
o Institute_id (Foreign Key) → References Training Institute
o TPO_id (Foreign Key) → References TPO
o No. of students selected
o No. of students attended

Dimension Tables
1. Student Dimension Table
o Stud_roll (Primary Key)
o Name
o CGPA
2. Company Dimension Table
o Company_id (Primary Key)
o Name
o Offer_Package
3. Training Institute Dimension Table
o Institute_id (Primary Key)
o Name
o Full_course_fee
4. TPO Dimension Table
o TPO_id (Primary Key)
o Name
o Age

Advantages

• Supports multiple business processes


• Eliminates duplication of dimensions
• More flexible and scalable

Disadvantages

• High complexity in design and queries


• Requires more storage and maintenance

Data Preprocessing & Metadata


Data Preprocessing

Data preprocessing is the process of preparing raw data for analysis by transforming it into a clean, structured,
and usable format. Since real-world data often contains errors, inconsistencies, and missing values, preprocessing
ensures better accuracy and efficiency in data mining and machine learning.

Data preprocessing is a crucial step in data analysis and machine learning. It ensures that raw data is clean,
structured, and ready for processing, leading to accurate and efficient results.

Raw data is often incomplete, inconsistent, or contains errors, making it unsuitable for direct analysis. Data
preprocessing is essential to clean, transform, and structure the data for better accuracy and efficiency in
decision-making and machine learning.

Data preprocessing aims to transform raw, unstructured data into a clean, consistent, and usable format for
analysis and decision-making.

Essential/Importance/Need/Objectives
1. Handling Missing Data

• Real-world data often has missing values due to human error or system failures.
• Filling missing values or removing incomplete records ensures data reliability.
2. Removing Noise and Inconsistencies

• Data can contain errors, outliers, or duplicate values, affecting analysis.


• Preprocessing removes unwanted variations and ensures clean data.

3. Standardizing Data Formats

• Data from multiple sources may have different formats or scales.


• Converting them into a uniform format ensures better integration and analysis.

4. Improving Data Accuracy and Quality

• Incorrect or irrelevant data can lead to misleading conclusions.


• Cleaning and transforming data enhance accuracy and reliability.

5. Enhancing Computational Efficiency

• Reducing unnecessary data speeds up processing and saves storage space.


• Selecting relevant features improves model performance.

6. Supporting Better Decision-Making

• Well-preprocessed data provides clear insights for analysis.


• Helps businesses and researchers make informed and data-driven decisions.

Techniques

Data processing techniques help in refining raw data to ensure accuracy, consistency, and efficiency in data
analysis.

1. Data Cleaning

Definition: The process of detecting and correcting inaccurate, incomplete, or inconsistent data to improve data
quality.

Steps:

1. Identify missing, duplicate, or incorrect data.


2. Handle missing values by removal or imputation.
3. Correct inconsistencies (e.g., standardizing date formats).
4. Remove duplicate records.
5. Validate and verify data accuracy.

Example:
A retail company finds missing values in the "Customer Age" column of its database.

• Solution: Fill missing values with the average age of other customers or remove incomplete records.

2. Data Integration
Definition: The process of combining data from multiple sources into a unified dataset.

Steps:

1. Identify data sources (e.g., sales, marketing, HR).


2. Resolve data conflicts and inconsistencies.
3. Remove duplicate records.
4. Merge datasets into a centralized database.
5. Standardize data formats for consistency.

Example:
A company has customer data in separate systems for online and in-store purchases.

• Solution: Merge both datasets into a single database to provide a 360-degree customer view.

3. Data Transformation

Definition: Converting data into a suitable format for analysis.

Steps:

1. Standardize units and formats (e.g., currency, dates).


2. Normalize or scale numerical data.
3. Encode categorical variables (e.g., Male → 1, Female → 0).
4. Convert unstructured data (e.g., text, images) into structured form.

Example:
A company collects product prices in different currencies (USD, INR, EUR).

• Solution: Convert all prices into USD for uniform analysis.

4. Data Reduction

Definition: The process of minimizing data volume while preserving its integrity.

Steps:

1. Remove irrelevant or redundant features.


2. Apply feature selection techniques.
3. Aggregate data to higher levels.
4. Use sampling methods to reduce dataset size.

Example: Instead of storing daily weather data for 10 years, only monthly averages are stored to reduce data
volume while maintaining essential trends and patterns.

Handling Missing Values in datasets (like in Agriculture domain)

In real-world datasets like agriculture, missing values are common due to human error, sensor failure, or
incomplete surveys.

Why It's Important to Handle Missing Values:


• Missing data can affect accuracy of analysis.
• It may lead to incorrect predictions in machine learning.
• Algorithms may fail or give biased results if data is incomplete.

Common Methods to Handle Missing Values:

1. Ignore the Tuple (Record)

• Remove rows with missing data.


• ✔ Useful when the dataset is large.
• ✘ Risk: May lose important information.

Example: Remove crop records that have missing rainfall data.

2. Manual Input

• Ask domain experts or refer to field notes to fill missing values.


• ✔ Accurate, if experts are available.
• ✘ Time-consuming.

3. Global Constant Replacement

• Replace missing value with a constant like “Unknown” or 0.


• ✔ Simple to implement.
• ✘ May reduce accuracy.

Example: Replace missing crop names with “Unknown”.

4. Mean / Median / Mode Substitution

• Replace missing numerical data with:


o Mean (average)
o Median (middle value)
o Mode (most common value)

Example: Replace missing soil pH with average pH from other records.

5. Predictive Modeling (Machine Learning)

• Use algorithms (like regression or k-NN) to predict missing values.


• ✔ More accurate.
• ✘ Needs computational power and skill.

6. Interpolation

• Use nearby time-based data to estimate the missing value.


• Example: If rainfall is missing for June, take the average of May and July.
Metadata in Data Warehousing
Definition

Metadata means “data about data.”


In a data warehouse, it provides detailed information about:

• Structure of data
• Source of data
• Transformation rules
• Usage and access details

Think of metadata as the instruction manual that helps understand and manage the data warehouse.

• It provides information about a dataset, file, or document.


• It makes it easier to organize, search, retrieve, and manage data.
• For example, in a document, metadata includes the author, creation date, and file size.

Importance
1. Improves Data Organization

• Metadata helps structure and classify data, making it easier to manage.


• Example: In a library, book metadata (title, author, genre) helps in proper categorization and quick
retrieval.

2. Enhances Search & Retrieval

• Metadata enables efficient searching by providing keywords, tags, and descriptions.


• Example: Search engines use metadata to rank and display relevant web pages in search results.

3. Ensures Data Accuracy & Consistency

• Helps maintain standardized data formats and structures across systems.


• Example: A company’s database uses metadata to ensure all customer records follow the same format
(e.g., name, address, phone number).

4. Supports Data Security & Access Control

• Defines who can access or modify data, ensuring privacy and security.
• Example: Metadata in a cloud storage system controls user permissions, restricting access to sensitive
files.

5. Aids in Data Integration & Interoperability

• Helps different systems and applications understand and use shared data.
• Example: Different organizations using standardized metadata formats (e.g., XML, JSON) can seamlessly
exchange data.

6. Enables Data Analysis & Decision-Making

• Provides valuable context for analyzing trends and making informed decisions.
• Example: In business intelligence, metadata helps categorize sales data by region, product, and time period
for analysis.

7. Supports Digital Preservation

• Helps maintain records for long-term storage and future use.


• Example: Digital archives use metadata to track file creation dates, formats, and modification history.

8. Improves Content Management

• Helps in organizing and managing digital assets efficiently.


• Example: In a media library, metadata tags (e.g., actor names, movie genres) allow users to filter and find
content easily.

9. Reduces Data Redundancy & Storage Costs

• Helps identify duplicate or unnecessary data, optimizing storage.


• Example: A company can use metadata to detect and remove duplicate files, saving storage space.

Types
1. Operational Metadata (Source layer)

• Stores details about the source data used in the data warehouse.
• Helps track where the data comes from, its format, and how it is stored.
• Ensures that data can be traced back to its original source when needed.
• Includes information about data structures, field lengths, and data types in the source systems.
• Keeps records of data updates, deletions, and modifications made in the operational systems.
• Helps in troubleshooting by providing logs of data movement and transformations.

Examples:

• A sales table in a MySQL database contains fields: Order_ID, Customer_Name, Amount, Date.
→ Metadata: field names, data types (int, varchar, date), source system name.
• Source file is updated every day at 10 PM.
→ Metadata: update frequency, timestamp logs.
• Log showing that 5 records were deleted yesterday from the source system.

2. Extraction and Transformation Metadata (ETL layer)

• Describes how data is extracted from different sources and transformed before storing it in the warehouse.
• Includes details such as:
o Extraction frequency (e.g., daily, weekly, or real-time updates).
o Methods used for extraction (e.g., full extraction, incremental extraction).
o Business rules applied for cleaning and modifying data before loading.
• Provides information on data validation techniques (e.g., handling missing values, removing duplicates).
• Ensures that the transformed data is accurate, consistent, and structured properly for analysis.
• Helps maintain data lineage, tracking changes made to the data throughout the process.

Examples:
• Customer names from multiple sources are converted to uppercase for consistency.
→ Metadata stores this rule: UPPER(Customer_Name)
• Extraction frequency: Sales data is pulled every 24 hours.
• Validation rule: Remove records where Amount < 0.
• A field Total_Sales is derived as Quantity × Unit_Price.

3. End-User Metadata (Business Layer)

• Acts as a guide for users to find and understand data in the warehouse.
• Allows users to search for information using business-friendly terms instead of complex database
terminology.
• Provides details about data relationships, definitions, and usage to help end-users interpret reports
correctly.
• Supports data visualization and reporting tools by mapping technical data to business concepts.
• Makes it easier for non-technical users (e.g., managers, analysts) to access and analyze the data
effectively.
• Improves decision-making by ensuring users can quickly locate and trust the data they need.

Examples:

• Instead of showing column cust_id, the UI shows Customer ID.


• Revenue = Sum of Total_Sales (technical field) → Shown in dashboard as Monthly Revenue.
• A business user searches for "Top selling products" → Metadata maps this to a report using fields like
Product_Name, Total_Sales.

OLAP Operations & Servers


OLAP is a technology that helps managers and analysts analyze large amounts of data efficiently using a
multidimensional data model. It provides fast, interactive, and consistent access to data for decision-making.

Operations (Roll-up, Drill-down, Slice, Dice, Pivot)

OLAP allows interactive data analysis through different operations:

1. Roll-up (Aggregation)
o Moves from detailed data to summarized data.
o Example:
▪ Quarter - Year (Time Dimension).
▪ City - Country (Location Dimension).

2. Drill-down (Detailed View)


o Opposite of Roll-up: Moves from summary to detailed data.
o Example:
▪ Year - Quarter - Month - Day.

3. Slice (Single Dimension Filtering)


o Selects a single dimension to create a new sub-cube.
o Example:
▪ Selecting Q1 (Quarter 1) in the Time dimension.
4. Dice (Multiple Dimensions Filtering)
o Filters two or more dimensions to create a sub-cube.
o Example:
▪ Location = "Delhi" OR "Kolkata".
▪ Time = "Q1" OR "Q2".
▪ Items = "Car" OR "Bus".

5. Pivot (Rotation of Data)


o Rotates the data to view it from different perspectives.
o Helps in analyzing different dimensions easily.
Servers/types (ROLAP, MOLAP, HOLAP)

1. ROLAP (Relational OLAP)

• Uses relational databases and SQL queries for analysis.


• Works well with large datasets.
• Example: Customer behavior tracking.

Pros: Handles large data, more flexible for dynamic queries, uses existing RDBMS infrastructure.
Cons: Slower performance compared to MOLAP, complex SQL queries may affect performance.

2. MOLAP (Multidimensional OLAP)

• Stores data in pre-built cubes for fast access.


• Best for quick reports and summaries.
• Example: Sales forecasting, financial reports.

Pros: Fast query performance, efficient for complex calculations. Cons: Needs more storage, less flexible.

3. HOLAP (Hybrid OLAP)

• Mix of MOLAP and ROLAP to balance speed and storage.


• Stores summary data in cubes and detailed data in databases.
• Example: Bank transaction analysis.

Pros: Balance between speed and storage, supports both detailed and summarized data, optimized for complex
queries.
Cons: Higher system complexity, requires more maintenance.

Data Marts & Concept Hierarchies


Data Mart

A Data Mart is a subset of a Data Warehouse that is designed to serve a specific department, business
function, or user group within an organization, such as Sales, Marketing, or Finance.
Features of a Data Mart:

• Contains summarized and subject-specific data.


• Easier and faster to access than a full data warehouse.
• Focuses on a single area of business.
• Can be independent (stand-alone) or dependent (sourced from a central data warehouse).

Example:

• A Sales Data Mart might store:


Customer_ID, Product_Name, Sales_Amount, Region, Date
• A Finance Data Mart might store:
Account_ID, Expenses, Revenue, Profit, Quarter

Types
1. Dependent Data Mart

• Created by extracting data from a central data warehouse.


• The data warehouse is built first by gathering data from multiple external sources using an ETL
(Extract, Transform, Load) tool.
• Follows the Top-Down Approach of data warehouse architecture.
• Used by large organizations that require centralized data management.

Example:
A banking system where a data warehouse stores all transactions, and separate data marts exist for loans, credit
cards, and customer accounts.

2. Independent Data Mart

• Created directly from external sources without a central data warehouse.


• The data mart is built first, and then a data warehouse may be created later by integrating multiple
data marts.
• Follows the Bottom-Up Approach of data warehouse architecture.
• Used by small organizations due to its lower cost and quick implementation.

Example:
A small retail company builds a data mart to analyze sales trends, without needing a full-scale data warehouse.
3. Hybrid Data Mart

• Created by extracting data from both operational sources and the data warehouse.
• Provides flexibility, allowing organizations to access data from external sources or a central warehouse.
• Supports both Top-Down and Bottom-Up approaches.
• Suitable for businesses that need fast access to data from different sources.

Example:
An e-commerce company integrating real-time sales data from operational systems while also using historical
customer data from the data warehouse.

Comparison of Data Mart Types


Feature Dependent Data Mart Independent Data Mart Hybrid Data Mart
Data Data Warehouse External Sources Both Warehouse & External
Source Sources
Approach Top-Down Bottom-Up Mixed
Complexity High Low Medium
Cost Expensive Cost-Effective Moderate
Best for Large Organizations Small Businesses Medium to Large Businesses
Example A marketing data mart created A finance department collecting A sales data mart that pulls
from the organization's main data from Excel files and data from the warehouse and
warehouse. creating its own mart. CRM system.
Reasons to build Data Marts

Organizations build Data Marts to simplify, speed up, and personalize access to data for specific business
units. Below are the main reasons:
1. Department-Specific Focus

• Data marts serve specific departments like Sales, HR, Marketing, or Finance.
• Allow users to work with only the data relevant to their roles.
• Reduces complexity for non-technical users.

Example: A Sales Data Mart helps sales teams access product-wise and region-wise sales data without
navigating the entire enterprise warehouse.

2. Improved Performance

• Data marts store less volume of data than a full warehouse.


• Faster query execution and better response time.

Example: A marketing executive can run campaign performance reports faster on a marketing data mart than
querying the full data warehouse.

3. Cost-Effective

• Building and maintaining a small data mart is cheaper than a large data warehouse.
• Suitable for small teams or companies with limited budgets.

4. Faster Implementation

• Quicker to design and deploy compared to enterprise-level data warehouses.


• Enables agile decision-making with faster access to needed data.

5. Data Security & Privacy

• Access can be restricted to data relevant to a department.


• Helps in maintaining confidentiality and compliance.

Example: HR data mart contains employee records but keeps payroll or medical information secure from
other departments.

6. Customization

• Each department can design reports and dashboards tailored to their specific KPIs.
• Supports better data visualization and interpretation.

7. Supports Business Decision-Making

• Helps managers and analysts access timely and relevant data.


• Supports strategic planning and operational improvements.

Concept Hierarchies

Concept Hierarchies are used to organize data values at multiple levels of abstraction. They help in
generalizing or drilling down data during analysis, making data mining more efficient and insightful.
What is a Concept Hierarchy?

A concept hierarchy defines a sequence of mappings from low-level (detailed) concepts to high-level
(general) concepts.

Example:
"City → State → Country"
"Date → Month → Quarter → Year"

Purpose of Concept Hierarchies:

• Data generalization: Replace detailed data with higher-level concepts for summarization.
• OLAP operations: Used in roll-up and drill-down.
• Simplifies analysis by grouping related data.

Types of Concept Hierarchies:

Type Description Example


Schema-based Hierarchies Defined by the database schema or Product_ID → Category →
user Department
Set-grouped Hierarchies Group values based on user-defined Age: {1–10, 11–20, ..., 60+}
sets
Rule-based Hierarchies Created using if-then rules IF city = Delhi THEN country = India
Attribute-based Derived from the attribute itself Date: day → month → year
Hierarchies

Example Hierarchy: Date


2024-05-02

May 2, 2024

May 2024

Q2 2024

2024

Example Hierarchy: Location


Dwarka

Delhi

India

Asia

World

Usage in OLAP:

• Roll-up: Going up in hierarchy (e.g., City → Country)


• Drill-down: Going down (e.g., Year → Month)
Association Rules & Market Basket Analysis
Market Basket Analysis (MBA)

Market Basket Analysis (MBA) is a data mining technique used to find associations or relationships between
sets of items that customers buy together. It helps businesses understand customer purchasing behavior.

Key Concept:

Market Basket Analysis is based on the idea:

"If a customer buys item A, they are likely to buy item B as well."

Example:
If customers who buy bread often also buy butter, a store may place these items close to each other or offer a
combo deal.

How it works:

MBA uses association rule mining to discover patterns like:

IF {Milk, Bread} THEN {Butter}

This means people who buy Milk and Bread often also buy Butter.

Terminology:

Term Description
Itemset A group of items bought together (e.g., {Milk, Bread})
Support Measures of how frequently an itemset appears in the dataset
Confidence Measures of how often item B is bought when item A is bought
Lift Measures how much more likely A and B are bought together compared to if they are being
bought independently

Use Cases:

• Supermarkets and retail stores


• E-commerce product recommendations
• Cross-selling strategies
• Inventory management
• Promotion planning

Apriori Algorithm

Apriori is an algorithm for finding frequent itemsets in a transaction database and then generating association
rules from those itemsets.

Step-by-Step Procedure

Use Apriori Algorithm to generate association rules from the following transactions:
Minimum Support = 50%, Minimum Confidence = 75%
Transactions:
TID Items
T1 Bread, Butter, Jam, Milk
T2 Bread, Butter, Milk
T3 Bread, Juice, Curd
T4 Bread, Milk, Juice
T5 Butter, Milk, Juice

Step 1: Find Frequent 1-Itemsets (L1)

Count the support of each item:

Item Support Count Support %


Bread 4/5 80%✅
Butter 3/5 60%✅
Jam 1/5 20%❌
Milk 4/5 80%✅
Juice 3/5 60%✅
Curd 1/5 20%❌

Items with support ≥ 50% are kept:


L1 = {Bread, Butter, Milk, Juice}

Step 2: Generate Candidate 2-Itemsets (C2) and Find Frequent 2-Itemsets (L2)

Calculate support for combinations from L1:

2-Itemset Support Count Support %


{Bread, Butter} 2/5 40% ❌
{Bread, Milk} 3/5 60% ✅
{Bread, Juice} 2/5 40% ❌
{Butter, Milk} 3/5 60% ✅
{Butter, Juice} 2/5 40% ❌
{Milk, Juice} 2/5 40% ❌

L2 = {Bread, Milk}, {Butter, Milk}

Step 3: Generate Candidate 3-Itemsets (C3) and Check Support

From L2, try:


{Bread, Butter, Milk} – appears only in T1 and T2 → Support = 2/5 = 40% ❌
Not frequent → No L3 generated

Step 4: Generate Association Rules from Frequent Itemsets


From L2:
Rule 1: Bread → Milk
• Support = 60%, Confidence = (Support(Bread & Milk) / Support(Bread))
• Confidence = 60/80 = 3/4 = 75% ✅

Rule 2: Milk → Bread

• Support = 60%, Confidence = (Support(Bread & Milk) / Support(Milk))


• Confidence = 60/80 = 3/4 = 75% ✅

Rule 3: Butter → Milk

• Support = 60%, Confidence = (Support(Butter & Milk) / Support(Butter))


• Confidence = 60/60 = 3/3 = 100% ✅

Rule 4: Milk → Butter

• Support = 60%, Confidence = (Support(Bread & Milk) / Support(Milk))


• Confidence = 60/80 = 3/4 = 75% ✅

Mining Association Rules (Using Apriori Algorithm)

Association rules are if-then statements that show the relationship between items in a transaction dataset.
Example:
If a customer buys Bread, then they also buy Milk.
→ Written as: Bread → Milk

Each rule is evaluated using:

• Support: How often an itemset appears in the dataset.


• Confidence: How often item B is bought when item A is bought.

Steps to Mine Association Rules from Frequent Itemsets

We'll use the frequent itemsets discovered in the Apriori example:

Frequent Itemsets (L2):

• {Bread, Milk}
• {Butter, Milk}

Generate Rules:
From {Bread, Milk}:

1. Bread → Milk
o Support = 60%; Confidence = 75%
2. Milk → Bread
o Support = 60%; Confidence = 75%

✅ Both rules meet minimum confidence (75%)


From {Butter, Milk}:

1. Butter → Milk
o Support = 60%; Confidence = 100%
2. Milk → Butter
o Support = 60%; Confidence = 75%

✅ Both rules meet minimum confidence (75%)

✅ Final Strong Association Rules:


Rule Support Confidence
Bread → Milk 60% 75%
Milk → Bread 60% 75%
Butter → Milk 60% 100%
Milk → Butter 60% 75%

These rules help businesses make decisions like product placement, combo offers, and targeted promotions.

Classification
Classification is a data mining technique used to predict the category or class of a given data point based on
past data.
It uses a model trained on labeled data to classify new, unseen data.

Classification is a supervised learning technique where the goal is to assign predefined labels (classes) to data
based on input features.

Classification Example Using Decision Tree

A Decision Tree is a machine learning algorithm used for classification tasks that splits the dataset into smaller
subsets based on feature values. It forms a tree structure where each internal node is a condition (test), and
each leaf node represents a class label (decision).

Example Problem: Predict if a student will Pass or Fail


Training Data:
Attendance (%) Assignment Score Result
≥ 75 ≥ 70 Pass
< 75 < 70 Fail
≥ 75 < 70 Fail
< 75 ≥ 70 Fail
Step-by-Step Construction of the Decision Tree:
[Attendance ≥ 75?]
/ \
Yes No
/ \
[Assignment ≥ 70?] Fail
/ \
Yes No
Pass Fail
Explanation:

1. Root Node checks if attendance is ≥ 75%.


2. If Yes, it then checks if the assignment score is ≥ 70.
o If both are true → Pass
o Else → Fail
3. If Attendance < 75% → Fail, regardless of score.

Classifying New Student

Suppose a new student has:

• Attendance = 80%
• Assignment Score = 72

→ Goes down the Yes → Yes path


→ Result: Pass

Advantages of Decision Trees:

• Easy to understand and visualize.


• Handles both numerical and categorical data.
• No need for feature scaling.

Common Classification Algorithms:

• Decision Tree
• Naive Bayes
• K-Nearest Neighbor (KNN)
• Support Vector Machine (SVM)
• Random Forest
• Logistic Regression

✅ Real-life Examples of Classification:


Application Classes (Labels)
Email Spam Detection Spam / Not Spam
Disease Diagnosis Disease Present / Not Present
Credit Approval Approved / Rejected
Image Recognition Cat / Dog / Other
Sentiment Analysis Positive / Negative

You might also like