Introduction
DATA MINING
1
Why Data Mining?
Necessity, who is the mother of invention. – Plato
We are drowning in data, but starving for knowledge!
The Explosive Growth of Data: from terabytes to
petabytes
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
2
Why Data Mining?
Data mining turns a large collection of data into
knowledge
A search engine (e.g., Google) receives hundreds of millions of queries
every day
Each query can be viewed as a transaction where the user describes her
or his information need
some patterns found in user search queries can disclose invaluable
knowledge that cannot be obtained by reading individual data items
alone
3
Data Mining
searching for knowledge (interesting patterns) in data.
4
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
5
Data Mining Applications
6
Data Mining for Financial Data Analysis
Design and construction of data warehouses
Loan payment prediction and customer credit
policy analysis
Classification and clustering of customers for
targeted marketing
Detection of money laundering and other financial
crimes
7
Knowledge Discovery (KDD) Process
This is a view from typical database
systems and data warehousing
communities
Pattern Evaluation
Data mining plays an essential role
in the knowledge discovery process
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
8
Knowledge Discovery (KDD) Process
Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be
combined)
Data selection (where data relevant to the analysis task are
retrieved from the database)
Data transformation (where data are transformed and
consolidated into forms appropriate for mining by performing
summary or aggregation operations)
Data mining (an essential process where intelligent methods
are applied to extract data patterns)
Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measures)
Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined
knowledge to users)
9
Data Warehouses
A data warehouse is a repository of information
collected from multiple sources, stored under a unified
schema, and usually residing at a single site.
It is usually modeled by a multidimensional data
structure, called a data cube
In data cube, each dimension corresponds to an
attribute or a set of attributes in the schema
each cell stores the value of some aggregate measure
such as count as an example
A data cube provides a multidimensional view of data
and allows the pre-computation and fast access of
summarized data
10
Data Warehouses
11
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
12
Data Mining Functionalities
Data mining functionalities are used to specify the
kinds of patterns to be found in data mining tasks
In general, such tasks can be classified into two
categories –
Descriptive - characterizes properties of the data in a
target data set.
Predictive - performs induction on the current data in
order to make predictions
13
Generalization
Information integration and data warehouse construction
Data cleaning, transformation, integration, and
multidimensional data model
Multidimensional concept description: Characterization
and discrimination
Generalize, summarize, and contrast data characteristics
14
Example: Data Characterization
A customer relationship manager at
“ABCElectronics” may order the following data
mining task: Summarize the characteristics of
customers who spend more than $5000 a year at
“ABCElectronics”.
The result is a general profile of these customers,
such as that they are 40 to 50 years old, employed,
and have excellent credit ratings.
The data mining system should allow the customer
relationship manager to drill down on any
dimension, such as on occupation to view these
customers according to their type of employment
15
Example: Data Discrimination
A customer relationship manager at “ABCElectronics” may want
to compare two groups of customers—those who shop for
computer products regularly (e.g., more than twice a month) and
those who rarely shop for such products (e.g., less than three
times a year)
The resulting description provides a general comparative profile
of these customers, such as that 80% of the customers who
frequently purchase computer products are between 20 and 40
years old and have a university education
Whereas 60% of the customers who infrequently buy such
products are either seniors or youths, and have no university
degree.
16
Mining Frequent Patterns, Association
and Correlation Analysis
Frequent patterns or frequent item sets - patterns that
occur frequently in data.
A frequent item set typically refers to a set of items
that often appear together in a transactional data set
—for example, milk and bread, which are frequently bought together in
grocery stores by many customer
What items are frequently purchased together in your Walmart?
A frequently occurring subsequence, such as the pattern that
customers, tend to purchase first a laptop, followed by a digital
camera, and then a memory card, is a (frequent) sequential pattern
Mining frequent patterns leads to the discovery of
interesting associations and correlations within data. 17
Association and Correlation Analysis
Suppose that, as a marketing manager at
“ABCElectronics”, you want to know which items are
frequently purchased together
An example of such a rule:
buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%,confidence = 50%]
A confidence, or certainty, of 50% means that if a
customer buys a computer, there is a 50% chance that
she will buy software as well
A 1% support means that 1% of all the transactions
under analysis show that computer and software are
purchased together
18
Question
A data mining system may find association rules as
follows: age(X, “20..29”) ∧ income(X, “40K..49K”) ⇒ buys(X,
“laptop”) [support = 2%, confidence = 60%]
What does the above association rule indicate?
19
Answer
The rule indicates that of all the customers under
study, 2% are 20 to 29 years old with an income of
$40,000 to $49,000 and have purchased a laptop
(computer)
There is a 60% probability that a customer in this age
and income group will purchase a laptop.
20
Classification
Classification and label prediction
Construct models (functions) based on some training
examples
Describe and distinguish classes or concepts for future
prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
Typical applications: Credit card fraud detection, direct
21
Some Classification Tools
22
Classification and Regression
Suppose as a sales manager you want to classify a large set of
items in the store, based on three kinds of responses to a sales
campaign: good response, mild response and no response.
You want to derive a model for each of these three classes
based on the descriptive features of the items, such as price,
brand, place made, type, and category
Suppose instead, that rather than predicting categorical
response labels for each store item, you would like to predict
the amount of revenue that each item will generate during an
upcoming sale , based on the previous sales data
This is an example of regression
23
Cluster Analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns
Principle: Maximizing intra-class similarity & minimizing
interclass similarity
Many methods and applications
24
Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general behavior of
the data
Noise or exception? ― One person’s garbage could be another person’s
treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
Example: Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of unusually large amounts for a given account number
in comparison to regular charges incurred by the same account.
25
Technologies Used
26
Technologies Used
Statistics
Data mining has an inherent connection with statistics.
It studies the collection, analysis, interpretation or
explanation, and presentation of data
Statistical models are widely used to model data and
data classes
27
Technologies Used
Machine Learning
It investigates how computers can learn (or improve
their performance) based on data
For example, a typical machine learning problem is to
program a computer so that it can automatically
recognize handwritten postal codes on mail after
learning from a set of examples
28
Technologies Used
Information Retrieval
It is the science of searching for documents or
information in documents
Documents can be text or multimedia, and may
reside on the Web
29
Major Issues
Mining various and new kinds of knowledge
Mining knowledge in multidimensional space
Data mining—an interdisciplinary effort
Handling uncertainty, noise, or incompleteness of
data
Pattern evaluation and pattern- or constraint-
guided mining 30