Introduction
DATA MINING
Dr. Mohammad Alsaudi
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
○ Automated data collection tools, database systems, Web,
computerized society.
– Major sources data generation
○ Web, e-commerce, transactions, stocks, …
○ Remote sensing, bioinformatics, scientific simulation, etc
○ news, digital cameras, YouTube.
2
What Is Data Mining?
• Data mining (knowledge discovery from data)
Extraction of interesting ( previously unknown and potentially
useful) patterns or knowledge from huge amount of data.
– Data mining: a misnomer?
• Alternative names:
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Is everything “data mining”?
– Simple search and query processing.
3
Knowledge discovery from databases
• This is a view from typical database systems
and data warehousing communities
• Data mining plays an essential role in the
knowledge discovery process
Databases 4
Example: A Web Mining Framework
• Web mining usually involves
– Data cleaning
– Data integration from multiple sources
– Warehousing the data A data warehouse is an electronic system
for storing information in a manner that is secure, reliable, easy
to retrieve, and easy to manage.
– Data cube construction
– Data selection for data mining
– Data mining
– Presentation of the mining results
– Patterns and knowledge to be used or stored into knowledge-
base
5
Data Mining in Business Intelligence
End User
Increasing potential Decisio
to support n
business decisions
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
6
KDD Process: A Typical View from ML and
Statistics
• This is a view from typical machine learning and statistics communities
Input Data Data Pre- Data Post-
Processing Mining Processin
g
Data integration Pattern discovery Pattern evaluation
Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern
Dimension reduction interpretation
Clustering
Outlier analysis Pattern visualization
…………
7
Multi-Dimensional View of Data Mining
• Data to be mined
– Database data (extended-relational, object-oriented,
heterogeneous, legacy), transactional data, stream, time-series,
sequence, text and web, multi-media, graphs & social and
information networks.
• Knowledge to be mined (or: Data mining functions)
– Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
– Descriptive vs. predictive data mining ?
– What is difference between predictive and descriptive model?
A descriptive mining will exploit the past data that are stored in
databases and provide you with the accurate report. In a
Predictive mining, it identifies patterns found in past and
transactional data to find risks and future outcomes.
8
Multi-Dimensional View of Data Mining
• Techniques utilized
– Warehouse , machine learning, statistics, pattern
recognition, visualization, high-performance, etc.
• Applications adapted
– telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web
mining, etc.
9
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
10
Data Mining Function: (1) Generalization
• Information integration and data warehouse construction
– Data cleaning, transformation, integration, and multidimensional
data model
• Data cube technology
– Scalable methods for computing (i.e., materializing)
multidimensional aggregates
– OLAP (online analytical processing)
• Multidimensional concept description: Characterization
and discrimination
– Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet region
11
Data Mining Function: (2) Association and
Correlation Analysis
• Frequent patterns (or frequent itemsets)
– What items are frequently purchased together in your Walmart?
• Association, correlation vs. causality
– A typical association rule
○ Diaper Beer [0.5%, 75%] (support, confidence)
– Are strongly associated items also strongly correlated?
• How to mine such patterns and rules efficiently in large
datasets?
• How to use such patterns for classification, clustering,
and other applications?
12
Data Mining Function: (3) Classification
• Classification and label prediction
– Construct models (functions) based on some training examples
– Describe and distinguish classes or concepts for future prediction
○ E.g., classify countries based on (climate), or classify cars based
on (gas mileage)
– Predict some unknown class labels
• Typical methods
– Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-based
classification, logistic regression, …
• Typical applications:
– Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
13
Data Mining Function: (4) Cluster Analysis
• Unsupervised learning (i.e., Class label is unknown)
• Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
• Principle: Maximizing intra-class similarity & minimizing
interclass similarity
• Many methods and applications
14
Data Mining Function: (5) Outlier Analysis
• Outlier analysis
– Outlier: A data object that does not comply with the general
behavior of the data
– Noise or exception? ―
– Methods: by product of clustering or regression analysis, …
– Useful in fraud detection, rare events analysis
15
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
• Sequence, trend and evolution analysis
– Trend, time-series, and deviation analysis: e.g., regression and
value prediction
– Sequential pattern mining
○ e.g., first buy digital camera, then buy large SD memory
cards
– Periodicity analysis
– Motifs and biological sequence analysis
○ Approximate and consecutive motifs
– Similarity-based analysis
• Mining data streams
– Ordered, time-varying, potentially infinite, data streams
16
Structure and Network Analysis
• Graph mining
– Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
• Information network analysis
– Social networks: actors (objects, nodes) and relationships (edges)
○ e.g., author networks in CS, terrorist networks
– Multiple heterogeneous networks
○ A person could be multiple information networks: friends, family,
classmates, …
– Links carry a lot of semantic information: Link mining
• Web mining
– Web is a big information network: from PageRank to Google
– Analysis of Web information networks
○ Web community discovery, opinion mining, usage mining, …
17
Evaluation of Knowledge
• Are all mined knowledge interesting?
– One can mine tremendous amount of “patterns” and knowledge
– Some may fit only certain dimension space (time, location, …)
– Some may not be representative, may be transient, …
• Evaluation of mined knowledge → directly mine only
interesting knowledge?
– Descriptive vs. predictive
– Coverage
– Typicality vs. novelty
– Accuracy
– Timeliness
– … 18
Data Mining: Confluence of Multiple Disciplines
Machine Pattern Statistics
Learning Recognition
Applications Data Mining Visualization
Algorithm Database High-Performance
Technology Computing
19
Applications of Data Mining
• Web page analysis: from web page classification, clustering to PageRank &
HITS algorithms
• Collaborative analysis & recommender systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
• Data mining and software engineering (e.g., IEEE Computer, Aug. 2009
issue)
• From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
20
Major Issues in Data Mining (2)
• Efficiency and Scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed, stream, and incremental mining methods
• Diversity of data types
– Handling complex types of data
– Mining dynamic, networked, and global data repositories
• Data mining and society
– Social impacts of data mining
– Privacy-preserving data mining
– Invisible data mining
21