Data Mining and Warehousing
Lecture-1,2
Data Mining
• Data Mining is the process of extracting meaningful patterns and
insights from large datasets using various statistical and
computational techniques, allowing businesses to make informed
decisions based on hidden trends and correlations within the data.
Conti…
Data Mining Concepts
Achieving the best results from data mining requires an array of tools and techniques. Some of the most commonly-used
functions include:
• Data cleansing and preparation — A step in which data is transformed into a form suitable for further analysis and
processing, such as identifying and removing errors and missing data.
• Artificial intelligence (AI) — These systems perform analytical activities associated with human intelligence such as planning,
learning, reasoning, and problem-solving.
• Association rule learning — These tools, also known as market basket analysis, search for relationships among variables in a
dataset, such as determining which products are typically purchased together.
• Clustering — A process of partitioning a dataset into a set of meaningful sub-classes, called clusters, to help users
understand the natural grouping or structure in the data.
• Classification — This technique assigns items in a dataset to target categories or classes with the goal of accurately
predicting the target class for each case in the data.
• Data analytics —The process of examining large datasets to extract meaningful insights and patterns, typically using
statistical methods and specialized software to understand trends, make informed decisions, and solve complex problems;.
• Data warehousing — A large collection of business data used to help an organization make decisions. It is the foundational
component of most large-scale data mining efforts.
• Machine learning — A computer programming technique that uses statistical probabilities to give computers the ability to
“learn” without being explicitly programmed.
• Regression — A technique used to predict a range of numeric values, such as sales, temperatures, or stock prices, based on a
Conti…
This diagram appears to depict a data mining lifecycle, illustrating the stages of a process from defining the
problem to deploying and updating models. It integrates components such as data preparation, exploration,
model building, validation, and deployment, emphasizing a cyclical workflow often seen in data mining or
machine learning projects.
Advantages of Data Mining
For example, data mining can tell you which prospects are likely to become profitable customers
based on past customer profiles, and which are most likely to respond to a specific offer. With this
knowledge, you can increase your return on investment (ROI) by making your offer to only those
prospects likely to respond and become valuable customers.
• Increasing revenue.
• Understanding customer segments and preferences.
• Acquiring new customers.
• Improving cross-selling and up-selling.
• Retaining customers and increasing loyalty.
• Increasing ROI from marketing campaigns.
• Detecting fraud.
• Identifying credit risks.
• Monitoring operational performance.
Data Warehousing?
• Data warehousing refers to the process of collecting, storing, and
managing large volumes of data from multiple sources in a centralized
repository for analysis and reporting purposes. It is a core component
of business intelligence (BI) systems and helps organizations make data-
driven decisions by providing a structured and consolidated view of
their data.
• It serves as an electronic repository for storing vast amounts of business
information, specifically designed for querying and analysis rather than
transactional processing. This process involves transforming raw data
into meaningful information and delivering it to users promptly,
enabling informed decision-making and driving impactful outcomes.
Components of a Data Warehouse:
1. Data Sources: Operational systems, flat files, or external systems where raw data originates.
2. ETL Process (Extract, Transform, Load):
1. Extract: Data is retrieved from source systems.
2. Transform: Data is cleaned, formatted, and standardized.
3. Load: Data is stored in the warehouse.
3. Data Storage:
1. Central repository (relational databases, column storage).
2. Organized into fact tables (quantitative data) and dimension tables (descriptive data).
4. Metadata: Information about the data, such as definitions, formats, and relationships.
5. Query and Reporting Tools: Tools for data analysis, visualization, and reporting.
This image visually compares two different data management
approaches: the data warehouse and the data lake.
Data Warehouse Flow:
1.Raw Data: Raw, unstructured or semi-structured data is
collected.
2.Formatted and Processed Data: The raw data is processed
and transformed into structured formats.
3.Data Warehouse: The structured data is stored in the data
warehouse, optimized for analytics.
4.Users: The processed data is accessed by users through
analytics tools.
Data Lake Flow:
5.Raw Data: Raw, unstructured, or semi-structured data is
collected.
6.Data Lake: A Data Lake is a large repository of raw,
unstructured data from various sources stored in a centralized
location.
7.Formatted and Processed Data: Data is processed on-
demand for specific analytics use cases.
8.Users: Processed data is accessed by users through analytics
tools.
ETL (Extract, Transform, and Load) Process
The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and
Loading. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system
and needs to be agile, automated, and well documented.
Process of data warehousing
Process of data warehousing
Data Sources:
•Multiple data sources (e.g., databases, applications, files) contribute raw data.
•ETL Process:
•Extract, Transform, Load (ETL) is the intermediate step where:
•Data is extracted from various sources.
•It is then transformed into a unified format, ensuring consistency and accuracy.
•Finally, the transformed data is loaded into the data warehouse.
•Data Warehouse:
•A central repository stores structured, cleaned, and processed data.
•Data is optimized for query performance and analytics.
•Output and Usage:
•The data warehouse supports:
•Visualization tools for graphical data representation.
•Reporting for operational and strategic insights.
•Business Intelligence (BI) tools for deeper analysis and informed decision-making.
Data warehouse architecture
1.External Sources (S1, S2, S3):
1. These are the origin points of raw data, which could include databases, flat files, or real-time streams.
2.Staging Area:
1. Before the data enters the data warehouse, it is temporarily stored in a staging area.
2. This step is critical for preprocessing tasks like deduplication, validation, and error correction.
3.ETL (Extract, Transform, Load):
1. The core process of:
1. Extracting data from external sources.
2. Transforming it to ensure uniformity, quality, and compatibility.
3. Loading the cleaned and structured data into the data warehouse.
4.Data Warehouse:
1. The central repository that stores transformed and processed data.
2. Data is optimized for queries, reporting, and analysis.
5.Data Marts:
1. Subsets of the data warehouse tailored for specific departments, teams, or use cases (e.g., finance,
marketing).
2. They help reduce complexity by providing focused data.
6.Data Mining:
1. Advanced techniques are applied to discover patterns, correlations, or insights within the data
warehouse.
3-layer architecture for data analytics
1.Analytics Layer:
•This is the front-end client layer that
interacts with users.
•Provides tools such as dashboards and
reports for data access and visualization.
2.Semantics Layer:
•Data is restructured in this layer to
support complex queries and fast
analytics.
•It acts as an intermediary, optimizing
the data for ease of access and
improved performance.
3.Data Layer:
•Comprises the database server, data
marts, and data lakes where raw and
processed data are stored.
•Metadata is created here, and ELT
(Extract, Load, Transform) tools are
used to process and transform the data
for further use.
Single-Tier Architecture
Single-Tier architecture is not periodically used in
practice. Its purpose is to minimize the amount of
data stored to reach this goal; it removes data
redundancies.
The vulnerability of this architecture lies in its
failure to meet the requirement for separation
between analytical and transactional processing.
Two-Tier Architecture
Source layer: A data warehouse system uses a
heterogeneous source of data.
Data Staging: The data stored to the source should be
extracted, cleansed to remove inconsistencies and fill
gaps, and integrated to merge heterogeneous sources
into one standard schema.
Data Warehouse layer: Information is saved to one
logically centralized individual repository: a data
warehouse.
Analysis: In this layer, integrated data is efficiently, and
flexible accessed to issue reports, dynamically analyze
information, and simulate hypothetical business
scenarios.
Three-Tier Architecture
The three-tier architecture consists
of the source layer (containing
multiple source system), the
reconciled layer and the data
warehouse layer (containing both
data warehouses and data marts).
The main advantage of
the reconciled layer is that it
creates a standard reference data
model for a whole enterprise. At
the same time, it separates the
problems of source data extraction
and integration from those of data
warehouse population.
Data warehouse system is also known by the following name:
• Decision Support System (DSS)
• Executive Information System
• Management Information System
• Business Intelligence Solution
• Analytic Application
• Data Warehouse
How Datawarehouse works?
A Data Warehouse works as a central repository where information arrives from one or more data
sources. Data flows into a data warehouse from the transactional system and other relational databases.
Data may be:
1. Structured
2. Semi-structured
3. Unstructured data
The data is processed, transformed, and ingested so that users can access the processed data in the Data
Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data warehouse
merges information coming from different sources into one comprehensive database.
By merging all of this information in one place, an organization can analyze its customers more
holistically. This helps to ensure that it has considered all the information available. Data warehousing
makes data mining possible. Data mining is looking for patterns in the data that may lead to higher sales
and profits.
Characteristics of data warehousing
• Subject oriented: data are organized by detailed subject containing only information
relevant for decision support. It provides a more comprehensive view of the
organization
• Integrated: data warehouses must place data from different sources into a
consistent format
• Time variant (time series): it contains historical (daily, weekly and monthly) inc
addition to current data (real-time)
• Nonvolatile: data can not be changed or updated after it had entered into data
warehouse. Obsolete (Old) data are discarded and changes are recorded as new data
• Web based: designed for web based applications
• Relational/multidimensional: its structure is either relational or multidimensional
• Uses Client/server: so as to be easy to access.
• Include metadata: it is a data about data (about how data are organized and to use
them)
• Data mart
A data mart is a structured database that stores a selected
portion of an organization's data for a specific business unit,
department, or line of business.
Data marts are typically smaller and more focused than data
warehouses, and are often a subset of an organization's
larger enterprise data warehouse.
1. Dependent: Dependent data marts are created by drawing
data directly from operational, external or both sources.
2. Independent: Independent data mart is created without the
use of a central data warehouse.
3. Hybrid: This type of data marts can take data from data
warehouses or operational systems
• Operational data stores (ODS)
An Operational Data Store (ODS) is a centralized database that
integrates real-time data from various sources to support operational
reporting and decision-making.
Unlike data warehouses that store historical data, an ODS focuses on
providing current data for immediate use. It acts as a bridge
between transactional systems and data warehouses, ensuring data
consistency and quick access to up-to-date information. This allows
organizations to make timely decisions based on the latest data available
-Oper marts
An oper mart, or operational data mart, is a small-scale database that
stores and analyzes operational data for a specific department or
functional area in an organization.
• Enterprise data warehouse (EDW)
A technology that provides a vehicle for pushing data from source
systems into a data warehouse that is used across the enterprise for
decision support
• Metadata
Data about data. In a data warehouse, metadata describe the
contents of a data warehouse and the manner of its use
Data Warehousing Process
Overview
• Organizations continuously collect data, information, and knowledge
at an increasingly accelerated rate and store them in computerized
systems
• The number of users needing to access the information continues to
increase as a result of improved reliability and availability of network
access, especially the Internet
Data Objects and Attribute Types
• Data sets are made up of data objects.
• A data object represents an entity—in a sales database, the
objects may be customers, store items, and sales; in a medical
database, the objects may be patients; in a university database,
the objects may be students, professors, and courses.
• Data objects are typically described by attributes. Data objects
can also be referred to as samples, examples, instances, data
points, or objects. If the data objects are stored in a database,
they are data tuples. That is, the rows of a database
correspond to the data objects, and the columns correspond to
the attributes. In this section, we define attributes and look at
the various attribute types.
• Data: It is how the data objects, and their attributes are
stored.
• An attribute is an object’s property or characteristics.
For example. A person’s hair color, air humidity etc.
• An attribute set defines an object. The object is also
referred to as a record of the instances or entity.
Mapping the data warehouse to a
multiprocessor architecture
To manage large number of client requests efficiently, database vendor’s designed parallel hardware
architectures by implementing multi server and multithreaded systems. This is called interquery
parallism in which different server threads handle multiple requests at the same time.
This can be implemented on Symmetric Multiprocessing(SMP) systems, where it increases
throughput and allowed the support of more concurrent users.
Data warehouse can be mapped into different type of architectures as follows:
• Shared memory architecture
• Shared disk architecture
• Shared nothing architecture
Multiprocessor Architecture
This architecture is simple to implement, and the key idea is
that a single RDBMS server can potentially utilize all
processors, access all memory and access the entire
database.
There are three DBMS software architecture
styles for parallel processing:
1. Shared memory or shared everything Architecture
2. Shared disk architecture
3. Shred nothing architecture
Shared Memory Architecture
Tightly coupled shared memory systems. All processing
nodes in the system access the same memory space,
allowing for fast data sharing between them.
Pros:
High data transfer speeds due to direct memory access.
Ideal for applications with frequent data exchange
between nodes.
Cons:
Scalability limitations: Adding more nodes can create
memory contention and performance bottlenecks.
Single point of failure: If the shared memory fails, the
entire system is impacted.
2. Shared Disk Architecture
Shared disk systems are typically loosely coupled.
Multiple nodes in the cluster can access the same
disk storage, allowing data to be shared across the
system.
Pros:
Easier to implement than shared-nothing
architecture.
Suitable for applications with moderate data sharing
requirements.
Cons:
Potential performance bottlenecks due to disk
access contention when multiple nodes try to
read/write to the same data.
Limited scalability compared to shared-nothing
architectures.
Shared Nothing Architecture
Each node in the cluster has its own dedicated memory and
storage, with no shared resources between them. Data is
partitioned across nodes and accessed independently.
Pros:
Excellent scalability: Can easily add more nodes to handle increased
data volume.
High performance for large-scale queries due to parallel processing.
Fault tolerance: Failure of one node does not affect the entire
system.
Cons:
Requires complex data partitioning strategies.
May have higher network overhead due to data transfer between
nodes.