Chp-1: Introduction to
Data Mining
Data Mining: Concepts and
August 20, 2025 Techniques 1
Chapter 1. Introduction
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Kind of patterns to be mined
Technologies used
Major issues in data mining
Data Mining: Concepts and
August 20, 2025 Techniques 2
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes(1000
terabytes)
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation,
…
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data Data
sets Mining: Concepts and
August 20, 2025 Techniques 3
Evolution of Database
Technology
1960s:
Data collection, database creation and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information
systems
Data Mining: Concepts and
August 20, 2025 Techniques 4
Evolution of database system technology
Data Mining: Concepts and
August 20, 2025 Techniques 5
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
Data Mining: Concepts and
August 20, 2025 Techniques 6
Decision making
Data Mining: Concepts and
August 20, 2025 Techniques 7
Knowledge Discovery (KDD) Process
Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining
Data Selection & Transformation
Warehouse
Data Cleaning
Data Integration
Databases Data Mining: Concepts and
August 20, 2025 Techniques 8
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
9
Why Data Mining?—Potential
Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship
management (CRM), market basket analysis,
cross selling
Risk analysis and management
Forecasting, customer retention, quality
control, competitive analysis
Fraud detection and detection of unusual
patterns (outliers
Other Applications
Text mining (news group, email, documents)
and Web mining
Stream data mining
Bioinformatics
August 20, 2025
and bio-data
Data Mining: analysis
Concepts and
Techniques 10
Ex. 1: Market Analysis and
Management
Where does the data come from?—Credit card
transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle
studies
Target marketing
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits,
etc.
Determine customer purchasing patterns over time
Cross-market analysis—Find associations/co-
relations between product sales, & predict based
on such association
Customer profiling—What types of customers buy
what products (clustering or classification)
Data Mining: Concepts and
August 20, 2025 Techniques 11
Ex. 1: Market Analysis and
Management
Customer requirement analysis
Identify the best products for different groups of
customers
Predict what factors will attract new customers
Provision of summary information
Multidimensional summary reports
Statistical summary information (data central tendency
and variation)
Data Mining: Concepts and
August 20, 2025 Techniques 12
Data Mining: On What Kinds of
Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-
sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data(geographical data)
Multimedia database
Text databases
The World-Wide Web Data Mining: Concepts and
August 20, 2025 Techniques 13
Data Mining: On What Kinds of
Data?
Mining relational databases
Eg. Anaylze customer data to predict the credit
risk of new customers based on their income,
age and previous credit information.
Data Warehouses
Sales per item type per branch for third quarter.
Data stored to provide information from
historical perespective. Eg. In past 6 to 12
months, summarized data
Modeled by multidimentional data structure
called data cube.
Data Mining: Concepts and
August 20, 2025 Techniques 14
Data Mining: Concepts and
August 20, 2025 Techniques 15
Data Mining: On What Kinds of
Data?
Transactional data
Eg analyze which items are sold well together?
Printers are normally purchased together with
computer
Data Mining: Concepts and
August 20, 2025 Techniques 16
Kinds of Patterns to be mined
Data Mining: Concepts and
August 20, 2025 Techniques 17
What Kinds of Patterns Can Be
Mined?
1) Generalization
2) Association and Correlation Analysis
3) Classification
4) Cluster Analysis
5) Outlier Analysis
Data Mining: Concepts and
August 20, 2025 Techniques 18
Data Mining Function: (1)
Generalization
Multidimensional concept description:
Characterization and discrimination
Generalize, summarize, and contrast data
characteristics, e.g., summarize the
characteristics of customers who spend more
than Rs. 50,000 a year at an electronics store
Data characterization is a summarization of
the general characteristics or features of a
target class of data
Data cube technology for computing
OLAP (online analytical processing)
Examples of Output forms : pie charts, MDD
cubes, bar charts, curves etc.
Data Mining: Concepts and
August 20, 2025 Techniques 19
Data Mining Function: (1)
Generalization contd.
Data discrimination is a comparison of the general
features of the target class data objects against the
general features of objects from one or multiple
contrasting classes.
Eg. Compare 2 groups of customers- those who
shop for computer products regularly(more than
twice a month) and those who rarely shop for such
products(less than 3 times a year)
Data cube technology for computing
Drill down on any dimension
Discriminant rules: Discrimination descriptions
expressed in the form of rules
Output forms : same as that of data characterization
along with discrimination descriptions
Data Mining: Concepts and
August 20, 2025 Techniques 20
Data Mining Function: (2) Association
and Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in
your mart? Eg. Milk & bread
Association, correlation vs. causality
A typical association rule
Computer →software [1%, 50%] (support,
confidence)
Confidence means that if one buys a computer there is a
50% chance that she will buy software too. A 1% support
means that 1% of all transactions under analysis show
that computer & software are purchased together
Association rules are discarded as uninteresting if
they do not satisfy both a minimum support
threshold and a minimum confidence
threshold
Data Mining: Concepts and
August 20, 2025 Techniques 21
Data Mining Function: (3)
Classification
Classification and label prediction
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future
prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
Typical applications:
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
Data Mining: Concepts and
August 20, 2025 Techniques 22
Various forms of a classification
model
Data Mining: Concepts and
August 20, 2025 Techniques 23
Data Mining Function: (4) Cluster
Analysis
Unsupervised learning (i.e., Class label is
unknown)
Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find
distribution patterns
Data objects are clustered or grouped
based on the principle of maximizing
intraclass similarity and minimizing
interclass similarity
Data Mining: Concepts and
August 20, 2025 Techniques 24
Data Mining Function: (4) Cluster
Analysis
Data Mining: Concepts and
August 20, 2025 Techniques 25
Data Mining Function: (5) Outlier
Analysis
Outlier analysis (anomaly mining)
Outlier: A data object that does not comply
with the general behaviour of the data
Noise or exception? ― One person’s
garbage could be another person’s treasure
Methods: by product of clustering or
regression analysis, …
Useful in fraud detection, rare events
analysis
Data Mining: Concepts and
August 20, 2025 Techniques 26
Are All the “Discovered” Patterns
Interesting?
Data mining may generate thousands of patterns: Not all of
them are interesting
Suggested approach: Human-centered, query-based, focused
mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid
on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to
confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s belief in the data, e.g. large
DataaMining:
earthquake often follows clusterConcepts and earthquake.
of small
August 20, 2025 Techniques 27
Find All and Only Interesting
Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns?
Do we need to find all of the interesting patterns?
Association vs. classification vs. clustering
Search for only interesting patterns: An optimization problem
Can a data mining system find only the interesting
patterns?
Approaches
First generate all the patterns and then filter out the
uninteresting ones
Generate only the interesting
Data Mining: Concepts and
August 20, 2025 Techniques 28
Technologies Used
As a highly application-driven domain, data mining has
incorporated many techniques from other domains
The interdisciplinary nature of data mining research and
development contributes significantly to the success of data
mining and its extensive applications
Data Mining: Concepts and
August 20, 2025 Techniques 29
Data Mining: Confluence of Multiple
Disciplines
Machine Pattern Statistics
Learning Recognition
Applications Data Mining Visualization
Algorithm Database High-Performance
Technology Computing
Data Mining: Concepts and
August 20, 2025 Techniques 30
Data Mining: Confluence of Multiple
Disciplines
Statistics
Statistical models are widely used to model data
and data classes.
Eg. We can use statistics to model noise and
missing data.
Machine learning
Computer programs automatically learn to
recognize complex patterns and make intelligent
decisions based on data.
e.g. Handwritten postal codes
Data Mining: Concepts and
August 20, 2025 Techniques 31
Why Confluence of Multiple
Disciplines?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-
bytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked
data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
Data Mining: Concepts and
August 20, 2025 Techniques 32
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data
types, e.g., files in pdf or doc
Mining knowledge in multi-dimensional space.
Data mining: An interdisciplinary effort( mine data with
NL text)
Pattern evaluation: the interestingness problem
Handling noise, uncertainty, and incompleteness of data
Integration of the discovered knowledge with existing
one: knowledge fusion
Pattern evaluation and pattern- or constraint-guided
mining
Data Mining: Concepts and
August 20, 2025 Techniques 33
Major Issues in Data Mining
(1)
User interaction
Interactive mining( dynamically change focus of search)
Incorporation of background knowledge(constraints, rules)
presentation and visualization of data mining results
Efficiency and Scalability
Efficiency and scalability of data mining algorithms(run time
…predictable,short,acceptable)
Parallel, distributed, stream, and incremental mining
methods
Diversity of data types
Handling complex types of data(simple to temporal data
objects)
Mining dynamic, networked, and global
Data Mining: Concepts and data repositories
August 20, 2025 Techniques 34
Major Issues in Data Mining
(2)
Data mining and society
Social impacts of data mining(benefit to society)
Privacy-preserving data mining
Invisible data mining(system have buit in function.. click
of mouse)
Data Mining: Concepts and
August 20, 2025 Techniques 35
Architecture: Typical Data Mining
System
Graphical User Interface
Pattern Evaluation
Know
Data Mining Engine ledge
-Base
Database or Data
Warehouse Server
data cleaning, integration, and selection
Data World-Wide Other Info
Database Repositories
Warehouse Web
Data Mining: Concepts and
August 20, 2025 Techniques 36
Summary
Data mining: Discovering interesting patterns from large
amounts of data
A natural evolution of database technology, in great demand,
with wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis,
etc.
Data mining systems and architectures
Major issues in data mining
Data Mining: Concepts and
August 20, 2025 Techniques 37