Data Mining
Data Mining
Data Mining
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
KDD Process: Several Key Steps
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant representation
Choosing functions of data mining
summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
Architecture: Typical Data Mining System
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
Primitives that Define a Data Mining Task
1. Task-relevant data
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
2. Type of knowledge to be mined
Characterization, discrimination, association, classification, prediction, clustering,
outlier analysis, other data mining tasks
3. Background knowledge
A typical kind of background knowledge: Concept hierarchies
Schema hierarchy :E.g., street < city < province_or_state < country
Set-grouping hierarchy:E.g., {20-39} = young, {40-59} = middle_aged
4.Pattern interestingness measurements
Simplicity
4. Outlier analysis
Outlier: Data object that does not comply with the general behavior of the data
Noise or exception? Useful in fraud detection, rare events analysis
Data mining may generate thousands of patterns: Not all of them are interesting
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or test data
with some degree of certainty, potentially useful, novel, or validates some hypothesis
that a user seeks to confirm
1. Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
Major Issues in Data Mining
Contd..
2. User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
3.Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
Potential Applications of DATA Mining