Data Mining:
Data Mining Tasks
& Data
Data Mining - Lecture 2
Data Mining Tasks
Prediction Tasks
Use some variables to predict unknown or future values of other
variables
Description Tasks
Find human-interpretable patterns that describe the data.
Common data mining tasks
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Data Mining - Lecture 2 2
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the
class.
Find a model for class attribute as a function of
the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually,
the given data set is divided into training and test sets, with training
set used to build the model and test set used to validate it.
Data Mining - Lecture 2 3
Classification Example
Refund Marital Taxable
Status Income Cheat Cheat
No Single 75K ? No
Yes Married 50K ? No
Tid Refund Marital Taxable
Status Income Cheat No Married 150K ? No
Yes Divorced 90K ? Yes
1 Yes Single 125K No
No Single 40K ? No
2 No Married 100K No
No Married 80K ? No
3 No Single 70K No 10
10
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No Test
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10
Set Classifier
Data Mining - Lecture 2 4
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier
model.
Data Mining - Lecture 2 5
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions.
Approach:
Use credit card transactions and the information on its
account-holder as attributes.
When does a customer buy, what does he buy, how often he
pays on time, etc
Label past transactions as fraud or fair transactions. This forms
the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card
transactions on an account.
Data Mining - Lecture 2 6
Clustering Definition
Given a set of data points, each having a set of
attributes, and a similarity measure among
them, find clusters such that
Data points in one cluster are more similar to one
another.
Data points in separate clusters are less similar to one
another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
Data Mining - Lecture 2 7
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets
of customers where any subset may
conceivably be selected as a market target to
be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on
their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
Data Mining - Lecture 2 8
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
Approach: To identify frequently occurring
terms in each document. Form a similarity
measure based on the frequencies of different
terms. Use it to cluster.
Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents.
Data Mining - Lecture 2 9
Association Rule Discovery: Definition
Given a set of records each of which contain some
number of items from a given collection;
Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.
TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Juice, Bread {Milk} --> {Coke}
3 Juice, Coke, Diaper, Milk {Diaper, Milk} --> {Juice}
4 Juice, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Data Mining - Lecture 2 10
Association Rule Discovery: Application 1
Marketing and Sales Promotion:
Let the rule discovered be
{Cookies, … } --> {Potato Chips}
Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
Cookies in the antecedent => Can be used to see
which products would be affected if the store
discontinues selling Cookies.
Cookies in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Cookies to promote sale of Potato chips!
Data Mining - Lecture 2 11
Association Rule Discovery: Application 2
Supermarket shelf management.
Goal: To identify items that are bought
together by sufficiently many customers.
Approach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
A classic rule --
If a customer buys diaper and milk, then he is very
likely to buy juice.
So, don’t be surprised if you find six-packs stacked
next to diapers!
Data Mining - Lecture 2 12
Sequential Pattern Discovery: Definition
Given is a set of objects, with each object associated with
its own timeline of events, find rules that predict strong
sequential dependencies among different events:
In point-of-sale transaction sequences,
Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies)
Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)
Data Mining - Lecture 2 13
Regression
Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
Greatly studied in statistics, neural network
fields.
Examples: age of a person?
Predicting the age of a person based on
MaritalStatus, NumberOfChildren, Income,…
E.g., If MaritalStatus=Yes, Age = 20
+4*NumberOfChildren+0.0001*Income+…
Predicting wind velocities as a function of
temperature, humidity, and pressure.
Data Mining - Lecture 2 14
Deviation/Anomaly Detection
Detect significant deviations
from normal behavior
Applications:
Credit Card Fraud Detection
Data Mining - Lecture 2 15
What is Data?
Attributes
Collection of data objects and
their attributes.
Tid Refund Marital Taxable
An attribute is a property or Status Income Cheat
characteristic of an object
1 Yes Single 125K No
Examples: eye color of a
person, temperature, etc. 2 No Married 100K No
Attribute is also known as 3 No Single 70K No
variable, field, characteristic, 4 Yes Married 120K No
or feature. 5 No Divorced 95K Yes
A collection of attributes Objects
6 No Married 60K No
describe an object 7 Yes Divorced 220K No
Object is also known as 8 No Single 85K Yes
record, point, case, sample,
entity, or instance. 9 No Married 75K No
10 No Single 90K Yes
10
Data Mining - Lecture 2 16
Types of data sets
Record
Data Matrix
Document Data
Transaction Data
Multi-Relational
Star or snowflake schema
Graph
World Wide Web
Molecular Structures
Ordered
Sequential Data
Spatial Data
Temporal Data
Data Mining - Lecture 2 17
Record Data
Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Mining - Lecture 2 18
Data Matrix
If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1
Data Mining - Lecture 2 19
Document Data
Each document becomes a ‘term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times
the corresponding term occurs in the document.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Data Mining - Lecture 2 20
Transaction Data
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of
products purchased by a customer during one shopping
trip constitute a transaction, while the individual
products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Juice, Bread
3 Juice, Coke, Diaper, Milk
4 Juice, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Data Mining - Lecture 2 21
Multi-Relational Data
• Attributes are objects themselves
Data Mining - Lecture 2 22
Graph Data
Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
Data Mining - Lecture 2 23
Chemical Data
Benzene Molecule: C6H6
Data Mining - Lecture 2 24
Ordered Data
Sequences of transactions
Items/Events
An element of
the sequence
Data Mining - Lecture 2 25
Ordered Data
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Mining - Lecture 2 26
Ordered Data
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Data Mining - Lecture 2 27