0% found this document useful (0 votes)

14 views37 pages

Data Mining and Warehousing Lecture-1,2

Data mining is the process of extracting valuable insights from large datasets using various techniques, including data cleansing, AI, and machine learning. Data warehousing involves collecting and managing data from multiple sources in a centralized repository to support analysis and decision-making. The document outlines the components, architectures, and processes involved in data mining and warehousing, emphasizing their importance in enhancing business intelligence and operational efficiency.

Uploaded by

kani.tobie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views37 pages

Data Mining and Warehousing Lecture-1,2

Uploaded by

kani.tobie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Data Mining and Warehousing

Lecture-1,2
Data Mining
• Data Mining is the process of extracting meaningful patterns and
insights from large datasets using various statistical and
computational techniques, allowing businesses to make informed
decisions based on hidden trends and correlations within the data.
Conti…
Data Mining Concepts
Achieving the best results from data mining requires an array of tools and techniques. Some of the most commonly-used
functions include:

• Data cleansing and preparation — A step in which data is transformed into a form suitable for further analysis and
processing, such as identifying and removing errors and missing data.
• Artificial intelligence (AI) — These systems perform analytical activities associated with human intelligence such as planning,
learning, reasoning, and problem-solving.
• Association rule learning — These tools, also known as market basket analysis, search for relationships among variables in a
dataset, such as determining which products are typically purchased together.
• Clustering — A process of partitioning a dataset into a set of meaningful sub-classes, called clusters, to help users
understand the natural grouping or structure in the data.
• Classification — This technique assigns items in a dataset to target categories or classes with the goal of accurately
predicting the target class for each case in the data.
• Data analytics —The process of examining large datasets to extract meaningful insights and patterns, typically using
statistical methods and specialized software to understand trends, make informed decisions, and solve complex problems;.
• Data warehousing — A large collection of business data used to help an organization make decisions. It is the foundational
component of most large-scale data mining efforts.
• Machine learning — A computer programming technique that uses statistical probabilities to give computers the ability to
“learn” without being explicitly programmed.
• Regression — A technique used to predict a range of numeric values, such as sales, temperatures, or stock prices, based on a
Conti…

This diagram appears to depict a data mining lifecycle, illustrating the stages of a process from defining the
problem to deploying and updating models. It integrates components such as data preparation, exploration,
model building, validation, and deployment, emphasizing a cyclical workflow often seen in data mining or
machine learning projects.
Advantages of Data Mining
For example, data mining can tell you which prospects are likely to become profitable customers
based on past customer profiles, and which are most likely to respond to a specific offer. With this
knowledge, you can increase your return on investment (ROI) by making your offer to only those
prospects likely to respond and become valuable customers.
• Increasing revenue.
• Understanding customer segments and preferences.
• Acquiring new customers.
• Improving cross-selling and up-selling.
• Retaining customers and increasing loyalty.
• Increasing ROI from marketing campaigns.
• Detecting fraud.
• Identifying credit risks.
• Monitoring operational performance.
Data Warehousing?
• Data warehousing refers to the process of collecting, storing, and
managing large volumes of data from multiple sources in a centralized
repository for analysis and reporting purposes. It is a core component
of business intelligence (BI) systems and helps organizations make data-
driven decisions by providing a structured and consolidated view of
their data.
• It serves as an electronic repository for storing vast amounts of business
information, specifically designed for querying and analysis rather than
transactional processing. This process involves transforming raw data
into meaningful information and delivering it to users promptly,
enabling informed decision-making and driving impactful outcomes.
Components of a Data Warehouse:

1. Data Sources: Operational systems, flat files, or external systems where raw data originates.
2. ETL Process (Extract, Transform, Load):
1. Extract: Data is retrieved from source systems.
2. Transform: Data is cleaned, formatted, and standardized.
3. Load: Data is stored in the warehouse.
3. Data Storage:
1. Central repository (relational databases, column storage).
2. Organized into fact tables (quantitative data) and dimension tables (descriptive data).
4. Metadata: Information about the data, such as definitions, formats, and relationships.
5. Query and Reporting Tools: Tools for data analysis, visualization, and reporting.
This image visually compares two different data management
approaches: the data warehouse and the data lake.
Data Warehouse Flow:
1.Raw Data: Raw, unstructured or semi-structured data is
collected.
2.Formatted and Processed Data: The raw data is processed
and transformed into structured formats.
3.Data Warehouse: The structured data is stored in the data
warehouse, optimized for analytics.
4.Users: The processed data is accessed by users through
analytics tools.

Data Lake Flow:

5.Raw Data: Raw, unstructured, or semi-structured data is
collected.
6.Data Lake: A Data Lake is a large repository of raw,
unstructured data from various sources stored in a centralized
location.
7.Formatted and Processed Data: Data is processed on-
demand for specific analytics use cases.
8.Users: Processed data is accessed by users through analytics
tools.
ETL (Extract, Transform, and Load) Process
The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and
Loading. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system
and needs to be agile, automated, and well documented.
Process of data warehousing
Process of data warehousing
Data Sources:
•Multiple data sources (e.g., databases, applications, files) contribute raw data.

•ETL Process:
•Extract, Transform, Load (ETL) is the intermediate step where:
•Data is extracted from various sources.
•It is then transformed into a unified format, ensuring consistency and accuracy.
•Finally, the transformed data is loaded into the data warehouse.

•Data Warehouse:
•A central repository stores structured, cleaned, and processed data.
•Data is optimized for query performance and analytics.

•Output and Usage:

•The data warehouse supports:
•Visualization tools for graphical data representation.
•Reporting for operational and strategic insights.
•Business Intelligence (BI) tools for deeper analysis and informed decision-making.
Data warehouse architecture
1.External Sources (S1, S2, S3):
1. These are the origin points of raw data, which could include databases, flat files, or real-time streams.
2.Staging Area:
1. Before the data enters the data warehouse, it is temporarily stored in a staging area.
2. This step is critical for preprocessing tasks like deduplication, validation, and error correction.
3.ETL (Extract, Transform, Load):
1. The core process of:
1. Extracting data from external sources.
2. Transforming it to ensure uniformity, quality, and compatibility.
3. Loading the cleaned and structured data into the data warehouse.
4.Data Warehouse:
1. The central repository that stores transformed and processed data.
2. Data is optimized for queries, reporting, and analysis.
5.Data Marts:
1. Subsets of the data warehouse tailored for specific departments, teams, or use cases (e.g., finance,
marketing).
2. They help reduce complexity by providing focused data.
6.Data Mining:
1. Advanced techniques are applied to discover patterns, correlations, or insights within the data
warehouse.
3-layer architecture for data analytics
1.Analytics Layer:
•This is the front-end client layer that
interacts with users.
•Provides tools such as dashboards and
reports for data access and visualization.
2.Semantics Layer:
•Data is restructured in this layer to
support complex queries and fast
analytics.
•It acts as an intermediary, optimizing
the data for ease of access and
improved performance.
3.Data Layer:
•Comprises the database server, data
marts, and data lakes where raw and
processed data are stored.
•Metadata is created here, and ELT
(Extract, Load, Transform) tools are
used to process and transform the data
for further use.
Single-Tier Architecture

Single-Tier architecture is not periodically used in

practice. Its purpose is to minimize the amount of
data stored to reach this goal; it removes data
redundancies.

The vulnerability of this architecture lies in its

failure to meet the requirement for separation
between analytical and transactional processing.
Two-Tier Architecture

Source layer: A data warehouse system uses a

heterogeneous source of data.

Data Staging: The data stored to the source should be

extracted, cleansed to remove inconsistencies and fill
gaps, and integrated to merge heterogeneous sources
into one standard schema.

Data Warehouse layer: Information is saved to one

logically centralized individual repository: a data
warehouse.

Analysis: In this layer, integrated data is efficiently, and

flexible accessed to issue reports, dynamically analyze
information, and simulate hypothetical business
scenarios.
Three-Tier Architecture

The three-tier architecture consists

of the source layer (containing
multiple source system), the
reconciled layer and the data
warehouse layer (containing both
data warehouses and data marts).

The main advantage of

the reconciled layer is that it
creates a standard reference data
model for a whole enterprise. At
the same time, it separates the
problems of source data extraction
and integration from those of data
warehouse population.
Data warehouse system is also known by the following name:

• Decision Support System (DSS)

• Executive Information System
• Management Information System
• Business Intelligence Solution
• Analytic Application
• Data Warehouse
How Datawarehouse works?
A Data Warehouse works as a central repository where information arrives from one or more data
sources. Data flows into a data warehouse from the transactional system and other relational databases.
Data may be:
1. Structured
2. Semi-structured
3. Unstructured data
The data is processed, transformed, and ingested so that users can access the processed data in the Data
Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data warehouse
merges information coming from different sources into one comprehensive database.
By merging all of this information in one place, an organization can analyze its customers more
holistically. This helps to ensure that it has considered all the information available. Data warehousing
makes data mining possible. Data mining is looking for patterns in the data that may lead to higher sales
and profits.
Characteristics of data warehousing
• Subject oriented: data are organized by detailed subject containing only information
relevant for decision support. It provides a more comprehensive view of the
organization
• Integrated: data warehouses must place data from different sources into a
consistent format
• Time variant (time series): it contains historical (daily, weekly and monthly) inc
addition to current data (real-time)
• Nonvolatile: data can not be changed or updated after it had entered into data
warehouse. Obsolete (Old) data are discarded and changes are recorded as new data
• Web based: designed for web based applications
• Relational/multidimensional: its structure is either relational or multidimensional
• Uses Client/server: so as to be easy to access.
• Include metadata: it is a data about data (about how data are organized and to use
them)
• Data mart
A data mart is a structured database that stores a selected
portion of an organization's data for a specific business unit,
department, or line of business.

Data marts are typically smaller and more focused than data
warehouses, and are often a subset of an organization's
larger enterprise data warehouse.

1. Dependent: Dependent data marts are created by drawing

data directly from operational, external or both sources.
2. Independent: Independent data mart is created without the
use of a central data warehouse.
3. Hybrid: This type of data marts can take data from data
warehouses or operational systems
• Operational data stores (ODS)
An Operational Data Store (ODS) is a centralized database that
integrates real-time data from various sources to support operational
reporting and decision-making.
Unlike data warehouses that store historical data, an ODS focuses on
providing current data for immediate use. It acts as a bridge
between transactional systems and data warehouses, ensuring data
consistency and quick access to up-to-date information. This allows
organizations to make timely decisions based on the latest data available

-Oper marts
An oper mart, or operational data mart, is a small-scale database that
stores and analyzes operational data for a specific department or
functional area in an organization.
• Enterprise data warehouse (EDW)
A technology that provides a vehicle for pushing data from source
systems into a data warehouse that is used across the enterprise for
decision support
• Metadata
Data about data. In a data warehouse, metadata describe the
contents of a data warehouse and the manner of its use
Data Warehousing Process
Overview
• Organizations continuously collect data, information, and knowledge
at an increasingly accelerated rate and store them in computerized
systems
• The number of users needing to access the information continues to
increase as a result of improved reliability and availability of network
access, especially the Internet
Data Objects and Attribute Types
• Data sets are made up of data objects.
• A data object represents an entity—in a sales database, the
objects may be customers, store items, and sales; in a medical
database, the objects may be patients; in a university database,
the objects may be students, professors, and courses.
• Data objects are typically described by attributes. Data objects
can also be referred to as samples, examples, instances, data
points, or objects. If the data objects are stored in a database,
they are data tuples. That is, the rows of a database
correspond to the data objects, and the columns correspond to
the attributes. In this section, we define attributes and look at
the various attribute types.
• Data: It is how the data objects, and their attributes are
stored.

• An attribute is an object’s property or characteristics.

For example. A person’s hair color, air humidity etc.
• An attribute set defines an object. The object is also
referred to as a record of the instances or entity.
Mapping the data warehouse to a
multiprocessor architecture
To manage large number of client requests efficiently, database vendor’s designed parallel hardware
architectures by implementing multi server and multithreaded systems. This is called interquery
parallism in which different server threads handle multiple requests at the same time.
This can be implemented on Symmetric Multiprocessing(SMP) systems, where it increases
throughput and allowed the support of more concurrent users.

Data warehouse can be mapped into different type of architectures as follows:

• Shared memory architecture

• Shared disk architecture

• Shared nothing architecture

Multiprocessor Architecture
This architecture is simple to implement, and the key idea is
that a single RDBMS server can potentially utilize all
processors, access all memory and access the entire
database.
There are three DBMS software architecture
styles for parallel processing:
1. Shared memory or shared everything Architecture
2. Shared disk architecture
3. Shred nothing architecture
Shared Memory Architecture
Tightly coupled shared memory systems. All processing
nodes in the system access the same memory space,
allowing for fast data sharing between them.
Pros:
High data transfer speeds due to direct memory access.
Ideal for applications with frequent data exchange
between nodes.
Cons:
Scalability limitations: Adding more nodes can create
memory contention and performance bottlenecks.
Single point of failure: If the shared memory fails, the
entire system is impacted.
2. Shared Disk Architecture
Shared disk systems are typically loosely coupled.
Multiple nodes in the cluster can access the same
disk storage, allowing data to be shared across the
system.
Pros:
Easier to implement than shared-nothing
architecture.
Suitable for applications with moderate data sharing
requirements.
Cons:
Potential performance bottlenecks due to disk
access contention when multiple nodes try to
read/write to the same data.
Limited scalability compared to shared-nothing
architectures.
Shared Nothing Architecture
Each node in the cluster has its own dedicated memory and
storage, with no shared resources between them. Data is
partitioned across nodes and accessed independently.
Pros:
Excellent scalability: Can easily add more nodes to handle increased
data volume.
High performance for large-scale queries due to parallel processing.
Fault tolerance: Failure of one node does not affect the entire
system.
Cons:
Requires complex data partitioning strategies.
May have higher network overhead due to data transfer between
nodes.

Data Warehosing and Data Mining
No ratings yet
Data Warehosing and Data Mining
15 pages
DWDM202
No ratings yet
DWDM202
6 pages
Module 1
No ratings yet
Module 1
32 pages
DATA MINING - UNIT 1s
No ratings yet
DATA MINING - UNIT 1s
43 pages
Internship
No ratings yet
Internship
12 pages
Data Warehousing for Managers
No ratings yet
Data Warehousing for Managers
16 pages
Koneru Lakshmaiah College of Engineering: S.indira Priya Darsini V.chaitanya
No ratings yet
Koneru Lakshmaiah College of Engineering: S.indira Priya Darsini V.chaitanya
9 pages
Data Mininng
No ratings yet
Data Mininng
11 pages
Data Warehousing Mining
No ratings yet
Data Warehousing Mining
26 pages
Unit 1 (DWDM)
No ratings yet
Unit 1 (DWDM)
50 pages
Unit 1 (DWDM)
No ratings yet
Unit 1 (DWDM)
51 pages
Chapter 2 Data Warehousing
No ratings yet
Chapter 2 Data Warehousing
57 pages
1 What Is Data Mining
No ratings yet
1 What Is Data Mining
9 pages
DATA Mining UNIT1 DATA Mining UNIT1: Operating System (Sindhi College) Operating System (Sindhi College)
No ratings yet
DATA Mining UNIT1 DATA Mining UNIT1: Operating System (Sindhi College) Operating System (Sindhi College)
24 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
36 pages
DWDM Fresh Notes For Unit 1, Unit 2, Unit 3
No ratings yet
DWDM Fresh Notes For Unit 1, Unit 2, Unit 3
54 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
6 pages
DW 1
No ratings yet
DW 1
2 pages
DW Assignment
No ratings yet
DW Assignment
6 pages
Overview of Data Ware Housing
No ratings yet
Overview of Data Ware Housing
17 pages
Unit1 (DW&DM)
No ratings yet
Unit1 (DW&DM)
30 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
14 pages
DWDM 2 Unit Notes
No ratings yet
DWDM 2 Unit Notes
14 pages
Unit 1 (DWDM)
No ratings yet
Unit 1 (DWDM)
52 pages
Rdbmsiii 190703162808
No ratings yet
Rdbmsiii 190703162808
20 pages
Data Mining and Data Warehouse
No ratings yet
Data Mining and Data Warehouse
11 pages
Data Warehousing and Data Mining
75% (4)
Data Warehousing and Data Mining
14 pages
Module 2
No ratings yet
Module 2
43 pages
Data Warehousing & Data Mining
No ratings yet
Data Warehousing & Data Mining
16 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
53 pages
ISM Data Warehousing-1
No ratings yet
ISM Data Warehousing-1
23 pages
Lec 11 - DW
No ratings yet
Lec 11 - DW
32 pages
Data Warehouse Architecture
100% (2)
Data Warehouse Architecture
5 pages
Data Mining & Warehousing Basics
No ratings yet
Data Mining & Warehousing Basics
30 pages
Lec09-Data Warehousing
No ratings yet
Lec09-Data Warehousing
32 pages
DW Assigment
No ratings yet
DW Assigment
20 pages
Data Mining & Techniques Guide
No ratings yet
Data Mining & Techniques Guide
108 pages
Data Warehouse
No ratings yet
Data Warehouse
15 pages
1 Unit
No ratings yet
1 Unit
46 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
19 pages
Presented By: A Paper Presentation On
No ratings yet
Presented By: A Paper Presentation On
7 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
29 pages
Sem3 Unit1 DW
No ratings yet
Sem3 Unit1 DW
12 pages
Unit I
No ratings yet
Unit I
18 pages
Ex 1
No ratings yet
Ex 1
14 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
52 pages
Unit - I DW
No ratings yet
Unit - I DW
12 pages
Data Warehouse Architecture Guide
No ratings yet
Data Warehouse Architecture Guide
21 pages
Unit B Data Warehousing
No ratings yet
Unit B Data Warehousing
26 pages
5, Data Warehousing
No ratings yet
5, Data Warehousing
16 pages
DWDM Unit 1 (R23)
No ratings yet
DWDM Unit 1 (R23)
85 pages
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
No ratings yet
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
10 pages
Data Mining Chapter 1 Introduction
No ratings yet
Data Mining Chapter 1 Introduction
39 pages
DWDM QB
No ratings yet
DWDM QB
29 pages
Data Warehouse References
No ratings yet
Data Warehouse References
40 pages
Data Warehousing and Data Mining Final Year Seminar Topic
No ratings yet
Data Warehousing and Data Mining Final Year Seminar Topic
10 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
11 pages
Unit Ii
No ratings yet
Unit Ii
45 pages
10,000,000 (Ten Million) Is The: Natural Number
No ratings yet
10,000,000 (Ten Million) Is The: Natural Number
38 pages
TxDOT Elastomeric Bearing Details
No ratings yet
TxDOT Elastomeric Bearing Details
3 pages
Ileana Dr. Macalinao, G.R. No. 175490: Republic of The Philippines Supreme Court Manila
No ratings yet
Ileana Dr. Macalinao, G.R. No. 175490: Republic of The Philippines Supreme Court Manila
13 pages
Orbo AI: Revolutionizing Beauty Tech
No ratings yet
Orbo AI: Revolutionizing Beauty Tech
35 pages
Bruhm AC
No ratings yet
Bruhm AC
32 pages
Chapters 1234
No ratings yet
Chapters 1234
53 pages
Ita Reviewer
No ratings yet
Ita Reviewer
5 pages
Basic Documents and Transactions Related To Bank Deposits
No ratings yet
Basic Documents and Transactions Related To Bank Deposits
18 pages
SQL Always On Issues and Resolutions - SQL Server
No ratings yet
SQL Always On Issues and Resolutions - SQL Server
6 pages
X500 - 250 Cap 01 (Info Generali) PDF
No ratings yet
X500 - 250 Cap 01 (Info Generali) PDF
44 pages
Tally
No ratings yet
Tally
31 pages
A Need of Product Diversification in Airbnb: Eats
No ratings yet
A Need of Product Diversification in Airbnb: Eats
12 pages
Router User Management Guide
No ratings yet
Router User Management Guide
47 pages
Gender-Based Violence Indicators Report
No ratings yet
Gender-Based Violence Indicators Report
5 pages
Operating Instructions and Service Manual Basketball Shotclock MODEL MP-2299
No ratings yet
Operating Instructions and Service Manual Basketball Shotclock MODEL MP-2299
20 pages
H.N. Holdings Corporate Profile
No ratings yet
H.N. Holdings Corporate Profile
11 pages
191 Cases: Fraud in Nonprofits
No ratings yet
191 Cases: Fraud in Nonprofits
2 pages
Money Pechu - July 17
No ratings yet
Money Pechu - July 17
8 pages
Concepts FICO
No ratings yet
Concepts FICO
5 pages
CGE 101 Assessment Guideline Spec Exam Aug 2025
No ratings yet
CGE 101 Assessment Guideline Spec Exam Aug 2025
3 pages
DR Fixit Coal Tar Epoxy
No ratings yet
DR Fixit Coal Tar Epoxy
3 pages
Ai Agent For Puc
No ratings yet
Ai Agent For Puc
4 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
8 pages
Stonco Avalume Cutsheet
No ratings yet
Stonco Avalume Cutsheet
3 pages
Third Republic
No ratings yet
Third Republic
12 pages
Example of A Hotel Organizational Chart
No ratings yet
Example of A Hotel Organizational Chart
14 pages
6.5.1. Case Study
No ratings yet
6.5.1. Case Study
2 pages
Handbook of Green Building Des6d7b8f7089cb - Anna's Archive 58
No ratings yet
Handbook of Green Building Des6d7b8f7089cb - Anna's Archive 58
1 page
DLP - AdmitCard - 1004098774 - 01-02-2025 22 - 55 - 32
No ratings yet
DLP - AdmitCard - 1004098774 - 01-02-2025 22 - 55 - 32
3 pages
Beauty Therapy - Assignment Sixteen
No ratings yet
Beauty Therapy - Assignment Sixteen
2 pages

Data Mining and Warehousing Lecture-1,2

Uploaded by

Data Mining and Warehousing Lecture-1,2

Uploaded by

Data Mining and Warehousing

Data Lake Flow:

•Output and Usage:

Single-Tier architecture is not periodically used in

The vulnerability of this architecture lies in its

Source layer: A data warehouse system uses a

Data Staging: The data stored to the source should be

Data Warehouse layer: Information is saved to one

Analysis: In this layer, integrated data is efficiently, and

The three-tier architecture consists

The main advantage of

• Decision Support System (DSS)

1. Dependent: Dependent data marts are created by drawing

• An attribute is an object’s property or characteristics.

Data warehouse can be mapped into different type of architectures as follows:

• Shared memory architecture

• Shared disk architecture

• Shared nothing architecture

You might also like