Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 6 —
©Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
April 12, 2023 Data Mining: Concepts and Techniques 1
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
Mining various kinds of association/correlation rules
Constraint-based association mining
Sequential pattern mining
Applications/extensions of frequent pattern mining
Summary
April 12, 2023 Data Mining: Concepts and Techniques 2
What Is Association Mining?
Association rule mining:
Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
Frequent pattern: pattern (set of items, sequence,
etc.) that occurs frequently in a database [AIS93]
Motivation: finding regularities in data
What products were often purchased together? — Beer
and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
April 12, 2023 Data Mining: Concepts and Techniques 3
Why Is Frequent Pattern or Assoiciation
Mining an Essential Task in Data Mining?
Foundation for many essential data mining tasks
Association, correlation, causality
Sequential patterns, temporal or cyclic association,
partial periodicity, spatial and multimedia association
Associative classification, cluster analysis, iceberg cube,
fascicles (semantic data compression)
Broad applications
Basket data analysis, cross-marketing, catalog design,
sale campaign analysis
Web log (click stream) analysis, DNA sequence analysis,
etc.
April 12, 2023 Data Mining: Concepts and Techniques 4
Basic Concepts: Frequent Patterns and
Association Rules
Itemset X={x1, …, xk}
Transaction-id Items bought
10 A, B, C Find all the rules XY with min
20 A, C
confidence and support
30 A, D support, s, probability that a
40 B, E, F transaction contains XY
confidence, c, conditional
Customer Customer probability that a transaction
buys both buys diaper having X also contains Y.
Let min_support = 50%,
min_conf = 50%:
A C (50%, 66.7%)
Customer
buys beer
C A (50%, 100%)
April 12, 2023 Data Mining: Concepts and Techniques 5
Mining Association Rules—an Example
Transaction-id Items bought
Min. support 50%
10 A, B, C Min. confidence 50%
20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
For rule A C: {A, C} 50%
support = support({A}{C}) = 50%
confidence = support({A}{C})/support({A}) =
66.6%
April 12, 2023 Data Mining: Concepts and Techniques 6
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
Mining various kinds of association/correlation rules
Constraint-based association mining
Sequential pattern mining
Applications/extensions of frequent pattern mining
Summary
April 12, 2023 Data Mining: Concepts and Techniques 7
Apriori: A Candidate Generation-and-test Approach
Any subset of a frequent itemset must be frequent
if {beer, diaper, nuts} is frequent, so is {beer,
diaper}
Every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
generate length (k+1) candidate itemsets from length k
frequent itemsets, and
test the candidates against DB
The performance studies show its efficiency and scalability
Agrawal & Srikant 1994, Mannila, et al. 1994
April 12, 2023 Data Mining: Concepts and Techniques 8
The Apriori Algorithm—An Example
Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset L3
3rd scan Itemset sup
{B, C, E}
{B, C, E} 2
April 12, 2023 Data Mining: Concepts and Techniques 9
The Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
April 12, 2023 Data Mining: Concepts and Techniques 10
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
April 12, 2023 Data Mining: Concepts and Techniques 11
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
April 12, 2023 Data Mining: Concepts and Techniques 12
How to Count Supports of Candidates?
Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and
counts
Interior node contains a hash table
Subset function: finds all the candidates contained in
a transaction
April 12, 2023 Data Mining: Concepts and Techniques 13
Example: Counting Supports of Candidates
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
April 12, 2023 Data Mining: Concepts and Techniques 14
Efficient Implementation of Apriori in SQL
Hard to get good performance out of pure SQL (SQL-
92) based approaches alone
Make use of object-relational extensions like UDFs,
BLOBs, Table functions etc.
Get orders of magnitude improvement
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating
association rule mining with relational database
systems: Alternatives and implications. In SIGMOD’98
April 12, 2023 Data Mining: Concepts and Techniques 15
Challenges of Frequent Pattern Mining
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
April 12, 2023 Data Mining: Concepts and Techniques 16
DIC: Reduce Number of Scans
ABCD
Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules
for market basket data. In
SIGMOD’97
April 12, 2023 Data Mining: Concepts and Techniques 17
Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions
of DB
Scan 1: partition database and find local
frequent patterns
Scan 2: consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S. Navathe. An
efficient algorithm for mining association in large
databases. In VLDB’95
April 12, 2023 Data Mining: Concepts and Techniques 18
Sampling for Frequent Patterns
Select a sample of original database, mine frequent
patterns within sample using Apriori
Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
Example: check abcd instead of ab, ac, …, etc.
Scan database again to find missed frequent patterns
H. Toivonen. Sampling large databases for association
rules. In VLDB’96
April 12, 2023 Data Mining: Concepts and Techniques 19
DHP: Reduce the Number of Candidates
A k-itemset whose corresponding hashing bucket count is
below the threshold cannot be frequent
Candidates: a, b, c, d, e
Hash entries: {ab, ad, ae} {bd, be, de} …
Frequent 1-itemset: a, b, d, e
ab is not a candidate 2-itemset if the sum of count of
{ab, ad, ae} is below support threshold
J. Park, M. Chen, and P. Yu. An effective hash-based
algorithm for mining association rules. In SIGMOD’95
April 12, 2023 Data Mining: Concepts and Techniques 20
Eclat/MaxEclat and VIPER: Exploring Vertical
Data Format
Use tid-list, the list of transaction-ids containing an itemset
Compression of tid-lists
Itemset A: t1, t2, t3, sup(A)=3
Itemset B: t2, t3, t4, sup(B)=3
Itemset AB: t2, t3, sup(AB)=2
Major operation: intersection of tid-lists
M. Zaki et al. New algorithms for fast discovery of
association rules. In KDD’97
P. Shenoy et al. Turbo-charging vertical mining of large
databases. In SIGMOD’00
April 12, 2023 Data Mining: Concepts and Techniques 21
Bottleneck of Frequent-pattern Mining
Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2…i100
# of scans: 100
# of Candidates: (1001) + (1002) + … + (110000) = 2100-
1 = 1.27*1030 !
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
April 12, 2023 Data Mining: Concepts and Techniques 22
Mining Frequent Patterns Without
Candidate Generation
Grow long patterns from short ones using local
frequent items
“abc” is a frequent pattern
Get all transactions having “abc”: DB|abc
“d” is a local frequent item in DB|abc abcd is
a frequent pattern
April 12, 2023 Data Mining: Concepts and Techniques 23
Construct FP-tree from a Transaction Database
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3
order, f-list m 3
a:3 p:1
p 3
3. Scan DB again, m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
April 12, 2023 Data Mining: Concepts and Techniques 24
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent pattern
mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not count
node-links and the count field)
For Connect-4 DB, compression ratio could be over 100
April 12, 2023 Data Mining: Concepts and Techniques 25
Partition Patterns and Databases
Frequent patterns can be partitioned into subsets
according to f-list
F-list=f-c-a-b-m-p
Patterns containing p
Patterns having m but no p
…
Patterns having c but no a nor b, m, p
Pattern f
Completeness and non-redundency
April 12, 2023 Data Mining: Concepts and Techniques 26
Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item p
Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
April 12, 2023 Data Mining: Concepts and Techniques 27
From Conditional Pattern-bases to Conditional FP-trees
For each pattern-base
Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:
{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,
a 3 f:3 fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
April 12, 2023 Data Mining: Concepts and Techniques 28
Recursion: Mining Each Conditional FP-tree
{}
{} Cond. pattern base of “am”: (fc:3) f:3
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
April 12, 2023 Data Mining: Concepts and Techniques 29
A Special Case: Single Prefix Path in FP-tree
Suppose a (conditional) FP-tree T has a shared single prefix-path P
Mining can be decomposed into two parts
Reduction of the single prefix path into one node
{} Concatenation of the mining results of the two parts
a1:n1
a2:n2
a3:n3
{} r1
b1:m1 C1:k1 a1:n1
r1 = + b1:m1 C1:k1
a2:n2
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
April 12, 2023 Data Mining: Concepts and Techniques 30
Mining Frequent Patterns With FP-trees
Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern and
database partition
Method
For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
Repeat the process on each newly created conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
one path—single path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern
April 12, 2023 Data Mining: Concepts and Techniques 31
Scaling FP-growth by DB Projection
FP-tree cannot fit in memory?—DB projection
First partition a database into a set of projected DBs
Then construct and mine FP-tree for each projected
DB
Parallel projection vs. Partition projection techniques
Parallel projection is space costly
April 12, 2023 Data Mining: Concepts and Techniques 32
Partition-based Projection
Tran. DB
Parallel projection needs a lot fcamp
of disk space fcabm
fb
Partition projection saves it cbp
fcamp
p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB
fcam fcab f fc f …
cb fca cb … …
fcam fca …
am-proj DB cm-proj DB
fc f …
fc f
fc f
April 12, 2023 Data Mining: Concepts and Techniques 33
FP-Growth vs. Apriori: Scalability With the Support
Threshold
100 Data set T25I20D10K
90 D1 FP-grow th runtime
D1 Apriori runtime
80
70
Run time(sec.)
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
April 12, 2023 Data Mining: Concepts and Techniques 34
FP-Growth vs. Tree-Projection: Scalability with
the Support Threshold
Data set T25I20D100K
140
D2 FP-growth
120 D2 TreeProjection
100
Runtime (sec.)
80
60
40
20
0
0 0.5 1 1.5 2
Support threshold (%)
April 12, 2023 Data Mining: Concepts and Techniques 35
Why Is FP-Growth the Winner?
Divide-and-conquer:
decompose both the mining task and DB according to
the frequent patterns obtained so far
leads to focused search of smaller databases
Other factors
no candidate generation, no candidate test
compressed database: FP-tree structure
no repeated scan of entire database
basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
April 12, 2023 Data Mining: Concepts and Techniques 36
Implications of the Methodology
Mining closed frequent itemsets and max-patterns
CLOSET (DMKD’00)
Mining sequential patterns
FreeSpan (KDD’00), PrefixSpan (ICDE’01)
Constraint-based mining of frequent patterns
Convertible constraints (KDD’00, ICDE’01)
Computing iceberg data cubes with complex measures
H-tree and H-cubing algorithm (SIGMOD’01)
April 12, 2023 Data Mining: Concepts and Techniques 37
Max-patterns
Frequent pattern {a1, …, a100} (1001) + (1002) +
… + (110000) = 2100-1 = 1.27*1030 frequent sub-
patterns!
Max-pattern: frequent patterns without proper
frequent super pattern
BCDE, ACD are max-patterns
Tid Items
BCD is not a max-pattern
10 A,B,C,D,E
20 B,C,D,E,
Min_sup=2 30 A,C,D,F
April 12, 2023 Data Mining: Concepts and Techniques 38
MaxMiner: Mining Max-patterns
1st scan: find frequent items Tid Items
A, B, C, D, E
10 A,B,C,D,E
20 B,C,D,E,
2 scan: find support for
nd
30 A,C,D,F
AB, AC, AD, AE, ABCDE
BC, BD, BE, BCDE
Potential
CD, CE, CDE, DE, max-patterns
Since BCDE is a max-pattern, no need to check
BCD, BDE, CDE in later scan
R. Bayardo. Efficiently mining long patterns from
databases. In SIGMOD’98
April 12, 2023 Data Mining: Concepts and Techniques 39
Frequent Closed Patterns
Conf(acd)=100% record acd only
For frequent itemset X, if there exists no item y
s.t. every transaction containing X also contains
y, then X is a frequent closed pattern
“acd” is a frequent closed pattern
Concise rep. of freq pats Min_sup=2
TID Items
Reduce # of patterns and rules
10 a, c, d, e, f
N. Pasquier et al. In ICDT’99 20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
April 12, 2023 Data Mining: Concepts and Techniques 40
Mining Frequent Closed Patterns: CLOSET
Flist: list of all frequent items in support ascending order
Flist: d-a-f-e-c Min_sup=2
Divide search space TID Items
10 a, c, d, e, f
Patterns having d 20 a, b, e
30 c, e, f
Patterns having d but no a, etc. 40 a, c, d, f
Find frequent closed pattern recursively 50 c, e, f
Every transaction having d also has cfa cfad is a
frequent closed pattern
J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining
Frequent Closed Itemsets", DMKD'00.
April 12, 2023 Data Mining: Concepts and Techniques 41
Mining Frequent Closed Patterns: CHARM
Use vertical data format: t(AB)={T1, T12, …}
Derive closed pattern based on vertical intersections
t(X)=t(Y): X and Y always happen together
t(X)t(Y): transaction having X always has Y
Use diffset to accelerate mining
Only keep track of difference of tids
t(X)={T1, T2, T3}, t(Xy )={T1, T3}
Diffset(Xy, X)={T2}
M. Zaki. CHARM: An Efficient Algorithm for Closed Association Rule Mining, CS-
TR99-10, Rensselaer Polytechnic Institute
M. Zaki, Fast Vertical Mining Using Diffsets, TR01-1, Department of Computer
Science, Rensselaer Polytechnic Institute
April 12, 2023 Data Mining: Concepts and Techniques 42
Visualization of Association Rules: Pane Graph
April 12, 2023 Data Mining: Concepts and Techniques 43
Visualization of Association Rules: Rule Graph
April 12, 2023 Data Mining: Concepts and Techniques 44
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
Mining various kinds of association/correlation rules
Constraint-based association mining
Sequential pattern mining
Applications/extensions of frequent pattern mining
Summary
April 12, 2023 Data Mining: Concepts and Techniques 45
Mining Various Kinds of Rules or Regularities
Multi-level, quantitative association rules,
correlation and causality, ratio rules, sequential
patterns, emerging patterns, temporal
associations, partial periodicity
Classification, clustering, iceberg cubes, etc.
April 12, 2023 Data Mining: Concepts and Techniques 46
Multiple-level Association Rules
Items often form hierarchy
Flexible support settings: Items at the lower level are
expected to have lower support.
Transaction database can be encoded based on
dimensions and levels
explore shared multi-level mining
uniform support reduced support
Level 1 Milk Level 1
min_sup = 5% min_sup = 5%
[support = 10%]
Level 2 2% Milk Skim Milk Level 2
min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%
April 12, 2023 Data Mining: Concepts and Techniques 47
ML/MD Associations with Flexible Support Constraints
Why flexible support constraints?
Real life occurrence frequencies vary greatly
Diamond, watch, pens in a shopping basket
Uniform support may not be an interesting model
A flexible model
The lower-level, the more dimension combination, and the long
pattern length, usually the smaller support
General rules should be easy to specify and understand
Special items and special group of items may be specified
individually and have higher priority
April 12, 2023 Data Mining: Concepts and Techniques 48
Multi-dimensional Association
Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: 2 dimensions or predicates
Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)
hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
Categorical Attributes
finite number of possible values, no ordering among values
Quantitative Attributes
numeric, implicit ordering among values
April 12, 2023 Data Mining: Concepts and Techniques 49
Multi-level Association: Redundancy Filtering
Some rules may be redundant due to “ancestor”
relationships between items.
Example
milk wheat bread [support = 8%, confidence = 70%]
2% milk wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second rule.
A rule is redundant if its support is close to the “expected”
value, based on the rule’s ancestor.
April 12, 2023 Data Mining: Concepts and Techniques 50
Multi-Level Mining: Progressive Deepening
A top-down, progressive deepening approach:
First mine high-level frequent items:
milk (15%), bread (10%)
Then mine their lower-level “weaker” frequent
itemsets:
2% milk (5%), wheat bread (4%)
Different min_support threshold across multi-levels lead
to different algorithms:
If adopting the same min_support across multi-levels
then toss t if any of t’s ancestors is infrequent.
If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s support
is frequent/non-negligible.
April 12, 2023 Data Mining: Concepts and Techniques 51
Techniques for Mining MD Associations
Search for frequent k-predicate set:
Example: {age, occupation, buys} is a 3-predicate set
Techniques can be categorized by how age are treated
1. Using static discretization of quantitative attributes
Quantitative attributes are statically discretized by
using predefined concept hierarchies
2. Quantitative association rules
Quantitative attributes are dynamically discretized into
“bins”based on the distribution of the data
3. Distance-based association rules
This is a dynamic discretization process that considers
the distance between data points
April 12, 2023 Data Mining: Concepts and Techniques 52
Static Discretization of Quantitative Attributes
Discretized prior to mining using concept hierarchy.
Numeric values are replaced by ranges.
In relational database, finding all frequent k-predicate sets
will require k or k+1 table scans.
Data cube is well suited for mining. ()
The cells of an n-dimensional
(age) (income) (buys)
cuboid correspond to the
predicate sets.
(age, income) (age,buys) (income,buys)
Mining from data cubes
can be much faster.
(age,income,buys)
April 12, 2023 Data Mining: Concepts and Techniques 53
Quantitative Association Rules
Numeric attributes are dynamically discretized
Such that the confidence or compactness of the rules
mined is maximized
2-D quantitative association rules: Aquan1 Aquan2 Acat
Cluster “adjacent”
association rules
to form general
rules using a 2-D
grid
Example
age(X,”30-34”) income(X,”24K -
48K”)
buys(X,”high resolution TV”)
April 12, 2023 Data Mining: Concepts and Techniques 54
Mining Distance-based Association Rules
Binning methods do not capture the semantics of interval
data
Equi-width Equi-depth Distance-
Price($) (width $10) (depth 2) based
7 [0,10] [7,20] [7,7]
20 [11,20] [22,50] [20,22]
22 [21,30] [51,53] [50,53]
50 [31,40]
51 [41,50]
53 [51,60]
Distance-based partitioning, more meaningful discretization
considering:
density/number of points in an interval
“closeness” of points in an interval
April 12, 2023 Data Mining: Concepts and Techniques 55
Interestingness Measure: Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading
The overall percentage of students eating cereal is 75% which is
higher than 66.7%.
play basketball not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
Measure of dependent/correlated events: lift
Basketball Not basketball Sum (row)
P( A B)
corrA, B Cereal 2000 1750 3750
P( A) P( B) Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
April 12, 2023 Data Mining: Concepts and Techniques 56
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
Mining various kinds of association/correlation rules
Constraint-based association mining
Sequential pattern mining
Applications/extensions of frequent pattern mining
Summary
April 12, 2023 Data Mining: Concepts and Techniques 57
Constraint-based Data Mining
Finding all the patterns in a database autonomously? —
unrealistic!
The patterns could be too many but not focused!
Data mining should be an interactive process
User directs what to be mined using a data mining
query language (or a graphical user interface)
Constraint-based mining
User flexibility: provides constraints on what to be
mined
System optimization: explores such constraints for
efficient mining—constraint-based mining
April 12, 2023 Data Mining: Concepts and Techniques 58
Constraints in Data Mining
Knowledge type constraint:
classification, association, etc.
Data constraint — using SQL-like queries
find product pairs sold together in stores in Vancouver
in Dec.’00
Dimension/level constraint
in relevance to region, price, brand, customer category
Rule (or pattern) constraint
small sales (price < $10) triggers big sales (sum >
$200)
Interestingness constraint
strong rules: min_support 3%, min_confidence
60%
April 12, 2023 Data Mining: Concepts and Techniques 59
Constrained Mining vs. Constraint-Based Search
Constrained mining vs. constraint-based search/reasoning
Both are aimed at reducing search space
Finding all patterns satisfying constraints vs. finding
some (or one) answer in constraint-based search in AI
Constraint-pushing vs. heuristic search
It is an interesting research problem on how to integrate
them
Constrained mining vs. query processing in DBMS
Database query processing requires to find all
Constrained pattern mining shares a similar philosophy
as pushing selections deeply in query processing
April 12, 2023 Data Mining: Concepts and Techniques 60
Constrained Frequent Pattern Mining: A Mining Query
Optimization Problem
Given a frequent pattern mining query with a set of constraints C, the
algorithm should be
sound: it only finds frequent sets that satisfy the given
constraints C
complete: all frequent sets satisfying the given
constraints C are found
A naïve solution
First find all frequent sets, and then test them for
constraint satisfaction
More efficient approaches:
Analyze the properties of constraints comprehensively
Push them as deeply as possible inside the frequent
pattern computation.
April 12, 2023 Data Mining: Concepts and Techniques 61
Anti-Monotonicity in Constraint-Based Mining
TDB (min_sup=2)
Anti-monotonicity TID Transaction
When an intemset S violates the 10 a, b, c, d, f
constraint, so does any of its superset 20 b, c, d, f, g, h
30 a, c, d, e, f
sum(S.Price) v is anti-monotone 40 c, e, f, g
sum(S.Price) v is not anti-monotone Item Profit
Example. C: range(S.profit) 15 is anti- a 40
monotone b 0
c -20
Itemset ab violates C d 10
So does every superset of ab e -30
f 30
g 20
h -10
April 12, 2023 Data Mining: Concepts and Techniques 62
Which Constraints Are Anti-Monotone?
Constraint Antimonotone
vS No
SV no
SV yes
min(S) v no
min(S) v yes
max(S) v yes
max(S) v no
count(S) v yes
count(S) v no
sum(S) v ( a S, a 0 ) yes
sum(S) v ( a S, a 0 ) no
range(S) v yes
range(S) v no
avg(S) v, { , , } convertible
support(S) yes
support(S) no
April 12, 2023 Data Mining: Concepts and Techniques 63
Monotonicity in Constraint-Based Mining
TDB (min_sup=2)
TID Transaction
Monotonicity
10 a, b, c, d, f
When an intemset S satisfies the 20 b, c, d, f, g, h
constraint, so does any of its 30 a, c, d, e, f
40 c, e, f, g
superset
sum(S.Price) v is monotone Item Profit
a 40
min(S.Price) v is monotone b 0
Example. C: range(S.profit) 15 c -20
d 10
Itemset ab satisfies C e -30
So does every superset of ab f 30
g 20
h -10
April 12, 2023 Data Mining: Concepts and Techniques 64
Which Constraints Are Monotone?
Constraint Monotone
vS yes
SV yes
SV no
min(S) v yes
min(S) v no
max(S) v no
max(S) v yes
count(S) v no
count(S) v yes
sum(S) v ( a S, a 0 ) no
sum(S) v ( a S, a 0 ) yes
range(S) v no
range(S) v yes
avg(S) v, { , , } convertible
support(S) no
support(S) yes
April 12, 2023 Data Mining: Concepts and Techniques 65
Succinctness
Succinctness:
Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based on
A1 , i.e., S contains a subset belonging to A1
Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
min(S.Price) v is succinct
sum(S.Price) v is not succinct
Optimization: If C is succinct, C is pre-counting pushable
April 12, 2023 Data Mining: Concepts and Techniques 66
Which Constraints Are Succinct?
Constraint Succinct
vS yes
SV yes
SV yes
min(S) v yes
min(S) v yes
max(S) v yes
max(S) v yes
count(S) v weakly
count(S) v weakly
sum(S) v ( a S, a 0 ) no
sum(S) v ( a S, a 0 ) no
range(S) v no
range(S) v no
avg(S) v, { , , } no
support(S) no
support(S) no
April 12, 2023 Data Mining: Concepts and Techniques 67
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
April 12, 2023 Data Mining: Concepts and Techniques 68
Naïve Algorithm: Apriori + Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price < 5}
April 12, 2023 Data Mining: Concepts and Techniques 69
The Constrained Apriori Algorithm: Push
an Anti-monotone Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price < 5}
April 12, 2023 Data Mining: Concepts and Techniques 70
The Constrained Apriori Algorithm: Push a
Succinct Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{1 5} 1 {1 5}
{2 3} 2
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2 {3 5}
{3 5} 2
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 min{S.price <= 1 }
April 12, 2023 Data Mining: Concepts and Techniques 71
Converting “Tough” Constraints
TDB (min_sup=2)
TID Transaction
10 a, b, c, d, f
Convert tough constraints into anti-
20 b, c, d, f, g, h
monotone or monotone by properly
30 a, c, d, e, f
ordering items 40 c, e, f, g
Examine C: avg(S.profit) 25
Item Profit
Order items in value-descending order a 40
b 0
<a, f, g, d, b, h, c, e>
c -20
If an itemset afb violates C d 10
e -30
So does afbh, afb*
f 30
It becomes anti-monotone! g 20
h -10
April 12, 2023 Data Mining: Concepts and Techniques 72
Convertible Constraints
Let R be an order of items
Convertible anti-monotone
If an itemset S violates a constraint C, so does every
itemset having S as a prefix w.r.t. R
Ex. avg(S) v w.r.t. item value descending order
Convertible monotone
If an itemset S satisfies constraint C, so does every
itemset having S as a prefix w.r.t. R
Ex. avg(S) v w.r.t. item value descending order
April 12, 2023 Data Mining: Concepts and Techniques 73
Strongly Convertible Constraints
avg(X) 25 is convertible anti-monotone w.r.t.
item value descending order R: <a, f, g, d, b, h,
c, e> Item Profit
If an itemset af violates a constraint C, a 40
so does every itemset with af as prefix, b 0
such as afd c -20
d 10
avg(X) 25 is convertible monotone w.r.t. item
e -30
value ascending order R-1: <e, c, h, b, d, g, f, a>
f 30
If an itemset d satisfies a constraint C, g 20
so does itemsets df and dfa, which h -10
having d as a prefix
Thus, avg(X) 25 is strongly convertible
April 12, 2023 Data Mining: Concepts and Techniques 74
What Constraints Are Convertible?
Convertible anti- Convertible Strongly
Constraint monotone monotone convertible
avg(S) , v Yes Yes Yes
median(S) , v Yes Yes Yes
sum(S) v (items could be of any value,
Yes No No
v 0)
sum(S) v (items could be of any value,
No Yes No
v 0)
sum(S) v (items could be of any value,
No Yes No
v 0)
sum(S) v (items could be of any value,
Yes No No
v 0)
……
April 12, 2023 Data Mining: Concepts and Techniques 75
Combing Them Together—A General Picture
Constraint Antimonotone Monotone Succinct
vS no yes yes
SV no yes yes
SV yes no yes
min(S) v no yes yes
min(S) v yes no yes
max(S) v yes no yes
max(S) v no yes yes
count(S) v yes no weakly
count(S) v no yes weakly
sum(S) v ( a S, a 0 ) yes no no
sum(S) v ( a S, a 0 ) no yes no
range(S) v yes no no
range(S) v no yes no
avg(S) v, { , , } convertible convertible no
support(S) yes no no
support(S) no yes no
April 12, 2023 Data Mining: Concepts and Techniques 76
Classification of Constraints
Antimonotone Monotone
Strongly
convertible
Succinct
Convertible Convertible
anti-monotone monotone
Inconvertible
April 12, 2023 Data Mining: Concepts and Techniques 77
Mining With Convertible Constraints
TDB (min_sup=2)
TID Transaction
C: avg(S.profit) 25 10 a, f, d, b, c
20 f, g, d, b, c
List of items in every transaction in
value descending order R: 30 a, f, d, c, e
40 f, g, h, c, e
<a, f, g, d, b, h, c, e>
C is convertible anti-monotone
Item Profit
a 40
w.r.t. R f 30
Scan transaction DB once g 20
remove infrequent items d 10
b 0
Item h in transaction 40 is
h -10
dropped c -20
Itemsets a and f are good e -30
April 12, 2023 Data Mining: Concepts and Techniques 78
Can Apriori Handle Convertible Constraint?
A convertible, not monotone nor anti-monotone
nor succinct constraint cannot be pushed deep
into the an Apriori mining algorithm
Within the level wise framework, no direct
Item Value
pruning based on the constraint can be made a 40
Itemset df violates constraint C: avg(X)>=25 b 0
c -20
Since adf satisfies C, Apriori needs df to
d 10
assemble adf, df cannot be pruned
e -30
But it can be pushed into frequent-pattern f 30
growth framework! g 20
h -10
April 12, 2023 Data Mining: Concepts and Techniques 79
Mining With Convertible Constraints
Item Value
a 40
C: avg(X)>=25, min_sup=2 f 30
List items in every transaction in value descending order R: g 20
<a, f, g, d, b, h, c, e> d 10
C is convertible anti-monotone w.r.t. R b 0
h -10
Scan TDB once
c -20
remove infrequent items
e -30
Item h is dropped
Itemsets a and f are good, … TDB (min_sup=2)
Projection-based mining
TID Transaction
Imposing an appropriate order on item projection 10 a, f, d, b, c
Many tough constraints can be converted into 20 f, g, d, b, c
(anti)-monotone 30 a, f, d, c, e
40 f, g, h, c, e
April 12, 2023 Data Mining: Concepts and Techniques 80
Handling Multiple Constraints
Different constraints may require different or even
conflicting item-ordering
If there exists an order R s.t. both C1 and C2 are
convertible w.r.t. R, then there is no conflict between
the two convertible constraints
If there exists conflict on order of items
Try to satisfy one constraint first
Then using the order for the other constraint to
mine frequent itemsets in the corresponding
projected database
April 12, 2023 Data Mining: Concepts and Techniques 81
Chapter 6: Mining Association Rules in
Large Databases
Association rule mining
Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
Mining various kinds of association/correlation rules
Constraint-based association mining
Sequential pattern mining
Applications/extensions of frequent pattern mining
Summary
April 12, 2023 Data Mining: Concepts and Techniques 82
Sequence Databases and Sequential
Pattern Analysis
Transaction databases, time-series databases vs. sequence
databases
Frequent patterns vs. (frequent) sequential patterns
Applications of sequential pattern mining
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital camera,
within 3 months.
Medical treatment, natural disasters (e.g., earthquakes),
science & engineering processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures
April 12, 2023 Data Mining: Concepts and Techniques 83
What Is Sequential Pattern Mining?
Given a set of sequences, find the complete set
of frequent subsequences
A sequence : < (ef) (ab) (df) c b >
A sequence database
SID sequence An element may contain a set of items.
10 <a(abc)(ac)d(cf)> Items within an element are unordered
20 <(ad)c(bc)(ae)>
and we list them alphabetically.
30 <(ef)(ab)(df)cb>
<a(bc)dc> is a subsequence
40 <eg(af)cbc>
of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a
sequential pattern
April 12, 2023 Data Mining: Concepts and Techniques 84
Challenges on Sequential Pattern Mining
A huge number of possible sequential patterns are
hidden in databases
A mining algorithm should
find the complete set of patterns, when possible,
satisfying the minimum support (frequency) threshold
be highly efficient, scalable, involving only a small
number of database scans
be able to incorporate various kinds of user-specific
constraints
April 12, 2023 Data Mining: Concepts and Techniques 85
Studies on Sequential Pattern Mining
Concept introduction and an initial Apriori-like algorithm
R. Agrawal & R. Srikant. “Mining sequential patterns,” ICDE’95
GSP—An Apriori-based, influential mining method (developed at IBM
Almaden)
R. Srikant & R. Agrawal. “Mining sequential patterns:
Generalizations and performance improvements,” EDBT’96
From sequential patterns to episodes (Apriori-like + constraints)
H. Mannila, H. Toivonen & A.I. Verkamo. “Discovery of frequent
episodes in event sequences,” Data Mining and Knowledge
Discovery, 1997
Mining sequential patterns with constraints
M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern
Mining with Regular Expression Constraints. VLDB 1999
April 12, 2023 Data Mining: Concepts and Techniques 86
A Basic Property of Sequential Patterns: Apriori
A basic property: Apriori (Agrawal & Sirkant’94)
If a sequence S is not frequent
Then none of the super-sequences of S is frequent
E.g, <hb> is infrequent so do <hab> and <(ah)b>
Seq. ID Sequence Given support threshold
10 <(bd)cb(ac)> min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
April 12, 2023 Data Mining: Concepts and Techniques 87
GSP—A Generalized Sequential Pattern Mining Algorithm
GSP (Generalized Sequential Pattern) mining algorithm
proposed by Agrawal and Srikant, EDBT’96
Outline of the method
Initially, every item in DB is a candidate of length-1
for each level (i.e., sequences of length-k) do
scan database to collect support count for each
candidate sequence
generate candidate length-(k+1) sequences from
length-k frequent sequences using Apriori
repeat until no frequent sequence or no candidate
can be found
Major strength: Candidate pruning by Apriori
April 12, 2023 Data Mining: Concepts and Techniques 88
Finding Length-1 Sequential Patterns
Examine GSP using an example
Initial candidates: all singleton sequences Cand Sup
<a>, <b>, <c>, <d>, <e>, <f>, <a> 3
<g>, <h> <b> 5
Scan database once, count support for
candidates <c> 4
<d> 3
min_sup =2 <e> 3
Seq. ID Sequence <f> 2
10 <(bd)cb(ac)> <g> 1
20 <(bf)(ce)b(fg)> <h> 1
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
April 12, 2023 Data Mining: Concepts and Techniques 89
Generating Length-2 Candidates
<a> <b> <c> <d> <e> <f>
<a> <aa> <ab> <ac> <ad> <ae> <af>
51 length-2 <b> <ba> <bb> <bc> <bd> <be> <bf>
<c> <ca> <cb> <cc> <cd> <ce> <cf>
Candidates <d> <da> <db> <dc> <dd> <de> <df>
<e> <ea> <eb> <ec> <ed> <ee> <ef>
<f> <fa> <fb> <fc> <fd> <fe> <ff>
<a> <b> <c> <d> <e> <f>
Without Apriori
<a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
<b> <(bc)> <(bd)> <(be)> <(bf)>
property,
<c> <(cd)> <(ce)> <(cf)> 8*8+8*7/2=92
<d> <(de)> <(df)> candidates
<e> <(ef)>
<f>
Apriori prunes
April 12, 2023
44.57% candidates
Data Mining: Concepts and Techniques 90
Generating Length-3 Candidates and Finding Length-3 Patterns
Generate Length-3 Candidates
Self-join length-2 sequential patterns
Based on the Apriori property
<ab>, <aa> and <ba> are all length-2 sequential
patterns <aba> is a length-3 candidate
<(bd)>, <bb> and <db> are all length-2 sequential
patterns <(bd)b> is a length-3 candidate
46 candidates are generated
Find Length-3 Sequential Patterns
Scan database once more, collect support counts for
candidates
19 out of 46 candidates pass support threshold
April 12, 2023 Data Mining: Concepts and Techniques 92
The GSP Mining Process
5th scan: 1 cand. 1 length-5 seq. <(bd)cba> Cand. cannot pass
pat. sup. threshold
4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> … Cand. not in DB at all
pat.
3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> …
pat. 20 cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq.
pat. 10 cand. not in DB at all <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
1st scan: 8 cand. 6 length-1 seq.
pat. <a> <b> <c> <d> <e> <f> <g> <h>
Seq. ID Sequence
10 <(bd)cb(ac)>
min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
April 12, 2023 Data Mining: Concepts and Techniques 93
Bottlenecks of GSP
A huge set of candidates could be generated
1,000 frequent length-1 sequences generate
1000 999
1000 1000 1,499,500 length-2 candidates!
2
Multiple scans of database in mining
Real challenge: mining long sequential patterns
An exponential number of short candidates
A length-100 sequential pattern needs 1030
candidate sequences!
100
100 100
2
i 1 i
1 1030
April 12, 2023 Data Mining: Concepts and Techniques 95
FreeSpan: Frequent Pattern-Projected
Sequential Pattern Mining
A divide-and-conquer approach
Recursively project a sequence database into a set of smaller
databases based on the current set of frequent patterns
Mine each projected database to find its patterns
J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, FreeSpan:
Frequent pattern-projected sequential pattern mining. In KDD’00.
f_list: b:5, c:4, a:3, d:3, e:3, f:2
Sequence Database SDB
All seq. pat. can be divided into 6 subsets:
< (bd) c b (ac) > •Seq. pat. containing item f
< (bf) (ce) b (fg) > •Those containing e but no f
< (ah) (bf) a b f > •Those containing d but no e nor f
< (be) (ce) d > •Those containing a but no d, e or f
< a (bd) b c b (ade) > •Those containing c but no a, d, e or f
•Those containing only item b
April 12, 2023 Data Mining: Concepts and Techniques 96
From FreeSpan to PrefixSpan: Why?
Freespan:
Projection-based: No candidate sequence needs to be
generated
But, projection can be performed at any point in the
sequence, and the projected sequences do will not
shrink much
PrefixSpan
Projection-based
But only prefix-based projection: less projections and
quickly shrinking sequences
April 12, 2023 Data Mining: Concepts and Techniques 97
Prefix and Suffix (Projection)
<a>, <aa>, <a(ab)> and <a(abc)> are
prefixes of sequence <a(abc)(ac)d(cf)>
Given sequence <a(abc)(ac)d(cf)>
Prefix Suffix (Prefix-Based Projection)
<a> <(abc)(ac)d(cf)>
<aa> <(_bc)(ac)d(cf)>
<ab> <(_c)(ac)d(cf)>
April 12, 2023 Data Mining: Concepts and Techniques 98
Mining Sequential Patterns by Prefix
Projections
Step 1: find length-1 sequential patterns
<a>, <b>, <c>, <d>, <e>, <f>
Step 2: divide search space. The complete set of seq. pat.
can be partitioned into 6 subsets:
The ones having prefix <a>;
The ones having prefix <b>;
SID sequence
…
10 <a(abc)(ac)d(cf)>
The ones having prefix <f> 20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
April 12, 2023 Data Mining: Concepts and Techniques 99
Finding Seq. Patterns with Prefix <a>
Only need to consider projections w.r.t. <a>
<a>-projected database: <(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
Find all the length-2 seq. pat. Having prefix <a>: <aa>,
<ab>, <(ab)>, <ac>, <ad>, <af>
Further partition into 6 subsets
SID sequence
Having prefix <aa>; 10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
…
30 <(ef)(ab)(df)cb>
Having prefix <af> 40 <eg(af)cbc>
April 12, 2023 Data Mining: Concepts and Techniques 100
Completeness of PrefixSpan
SDB
SID sequence
10 <a(abc)(ac)d(cf)>
Length-1 sequential patterns
20 <(ad)c(bc)(ae)>
<a>, <b>, <c>, <d>, <e>, <f>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
Having prefix <a> Having prefix <c>, …, <f>
Having prefix <b>
<a>-projected database <b>-projected database
<(abc)(ac)d(cf)> Length-2 sequential
…
<(_d)c(bc)(ae)> patterns
<(_b)(df)cb> <aa>, <ab>, <(ab)>,
<(_f)cbc> <ac>, <ad>, <af>
……
Having prefix <aa> Having prefix <af>
<aa>-proj. db … <af>-proj. db
April 12, 2023 Data Mining: Concepts and Techniques 101
Efficiency of PrefixSpan
No candidate sequence needs to be generated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing projected
databases
Can be improved by bi-level projections
April 12, 2023 Data Mining: Concepts and Techniques 102
Optimization Techniques in PrefixSpan
Physical projection vs. pseudo-projection
Pseudo-projection may reduce the effort of
projection when the projected database fits in
main memory
Parallel projection vs. partition projection
Partition projection may avoid the blowup of
disk space
April 12, 2023 Data Mining: Concepts and Techniques 103
Speed-up by Pseudo-projection
Major cost of PrefixSpan: projection
Postfixes of sequences often appear
repeatedly in recursive projected
databases
When (projected) database can be held in main
memory, use pointers to form projections
Pointer to the sequence s=<a(abc)(ac)d(cf)>
<a>
Offset of the postfix
s|<a>: ( , 2) <(abc)(ac)d(cf)>
<ab>
s|<ab>: ( , 4) <(_c)(ac)d(cf)>
April 12, 2023 Data Mining: Concepts and Techniques 104
Pseudo-Projection vs. Physical Projection
Pseudo-projection avoids physically copying postfixes
Efficient in running time and space when database
can be held in main memory
However, it is not efficient when database cannot fit
in main memory
Disk-based random accessing is very costly
Suggested Approach:
Integration of physical and pseudo-projection
Swapping to pseudo-projection when the data set
fits in memory
April 12, 2023 Data Mining: Concepts and Techniques 105
PrefixSpan Is Faster than GSP and FreeSpan
400 PrefixSpan-1
350 PrefixSpan-2
Runtime (second)
300 FreeSpan
250
GSP
200
150
100
50
0
0.00 0.50 1.00 1.50 2.00 2.50 3.00
Support threshold (%)
April 12, 2023 Data Mining: Concepts and Techniques 106
Effect of Pseudo-Projection
PrefixSpan-1
200
PrefixSpan-2
PrefixSpan-1 (Pseudo)
Runtime (second)
160
PrefixSpan-2 (Pseudo)
120
80
40
0
0.20 0.30 0.40 0.50 0.60
Support threshold (%)
April 12, 2023 Data Mining: Concepts and Techniques 107
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
Mining various kinds of association/correlation rules
Constraint-based association mining
Sequential pattern mining
Applications/extensions of frequent pattern mining
Summary
April 12, 2023 Data Mining: Concepts and Techniques 108
Associative Classification
Mine association possible rules (PR) in form of
condset c
Condset: a set of attribute-value pairs
C: class label
Build Classifier
Organize rules according to decreasing
precedence based on confidence and support
B. Liu, W. Hsu & Y. Ma. Integrating classification and
association rule mining. In KDD’98
April 12, 2023 Data Mining: Concepts and Techniques 109
Why Iceberg Cube?
It is too costly to materialize a high dimen. cube
20 dimensions each with 99 distinct values may lead to
a cube of 10020 cells.
Even if there is only one nonempty cell in each 10
10
cells, the cube will still contain 1030 nonempty cells
Observation: Trivial cells are usually not interesting
Nontrivial: large volume of sales, or high profit
Solution:
iceberg cube—materialize only nontrivial cells of a data
cube
April 12, 2023 Data Mining: Concepts and Techniques 110
Anti-Monotonicity in Iceberg Cubes
If a cell c violates the HAVING clause, so do all more
specific cells
Example. Let Having COUNT(*)>=50
(*, *, Edu, 1000, 30) violates the HAVING clause
(Feb, *, Edu), (*, Van, Edu), (Mar, Tor, Edu): each
must have count no more than 30
CREATE CUBE Sales_Iceberg AS
SELECT month, city, cust_grp, Month City Cust_grp Prod Cost Price
AVG(price), COUNT(*) Jan Tor Edu Printer 500 485
FROM Sales_Infor Mar Van Edu HD 540 520
CUBEBY month, city, cust_grp … … … … … …
HAVING COUNT(*)>=50
April 12, 2023 Data Mining: Concepts and Techniques 111
Computing Iceberg Cubes Efficiently
Based on Apriori-like pruning
BUC [Bayer & Ramakrishnan, 99]
bottom-up cubing, efficient bucket-sort alg.
Only handles anti-monotonic iceberg cubes, e.g.,
measures confined to count and p+_sum (e.g., price)
Computing non-anti-monotonic iceberg cubes
Finding a weaker but anti-monotonic measure (e.g.,
avg to top-k-avg) for dynamic pruning in computation
Use special data structure (H-tree) and perform H-
cubing (SIGMOD’01)
April 12, 2023 Data Mining: Concepts and Techniques 112
Spatial and Multi-Media Association: A
Progressive Refinement Method
Why progressive refinement?
Mining operator can be expensive or cheap, fine or
rough
Trade speed with quality: step-by-step refinement.
Superset coverage property:
Preserve all the positive answers—allow a positive false
test but not a false negative test.
Two- or multi-step mining:
First apply rough/cheap operator (superset coverage)
Then apply expensive algorithm on a substantially
reduced candidate set (Koperski & Han, SSD’95).
April 12, 2023 Data Mining: Concepts and Techniques 113
Progressive Refinement Mining of Spatial
Associations
Hierarchy of spatial relationship:
“g_close_to”: near_by, touch, intersect, contain, etc.
First search for rough relationship and then refine it.
Two-step mining of spatial association:
Step 1: rough spatial computation (as a filter)
Using MBR or R-tree for rough estimation.
Step2: Detailed spatial algorithm (as refinement)
Apply only to those objects which have passed the rough
spatial association test (no less than min_support)
April 12, 2023 Data Mining: Concepts and Techniques 114
Mining Multimedia Associations
Correlations with color, spatial relationships, etc.
From coarse to Fine Resolution mining
April 12, 2023 Data Mining: Concepts and Techniques 115
Further Evolution of PrefixSpan
Closed- and max- sequential patterns
Finding only the most meaningful (longest)
sequential patterns
Constraint-based sequential pattern growth
Adding user-specific constraints
From sequential patterns to structured patterns
Beyond sequential patterns, mining
structured patterns in XML documents
April 12, 2023 Data Mining: Concepts and Techniques 116
Closed- and Max- Sequential Patterns
A closed- sequential pattern is a frequent sequence s where there is
no proper super-sequence of s sharing the same support count with s
A max- sequential pattern is a sequential pattern p s.t. any proper
super-pattern of p is not frequent
Benefit of the notion of closed sequential patterns
{<a1 a2 … a50>, <a1 a2 … a100>}, with min_sup = 1
There are 2100 sequential patterns, but only 2 are
closed
Similar benefits for the notion of max- sequential-patterns
April 12, 2023 Data Mining: Concepts and Techniques 117
Methods for Mining Closed- and Max-
Sequential Patterns
PrefixSpan or FreeSpan can be viewed as projection-guided depth-first
search
For mining max- sequential patterns, any sequence which does not
contain anything beyond the already discovered ones will be removed
from the projected DB
{<a1 a2 … a50>, <a1 a2 … a100>}, with min_sup = 1
If we have found a max-sequential pattern <a1 a2 …
a100>, nothing will be projected in any projected DB
Similar ideas can be applied for mining closed- sequential-patterns
April 12, 2023 Data Mining: Concepts and Techniques 118
Constraint-Based Sequential Pattern
Mining
Constraint-based sequential pattern mining
Constraints: User-specified, for focused mining of desired patterns
How to explore efficient mining with constraints? — Optimization
Classification of constraints
Anti-monotone: E.g., value_sum(S) < 150, min(S) > 10
Monotone: E.g., count (S) > 5, S {PC, digital_camera}
Succinct: E.g., length(S) 10, S {Pentium, MS/Office, MS/Money}
Convertible: E.g., value_avg(S) < 25, profit_sum (S) > 160,
max(S)/avg(S) < 2, median(S) – min(S) > 5
Inconvertible: E.g., avg(S) – median(S) = 0
April 12, 2023 Data Mining: Concepts and Techniques 119
Sequential Pattern Growth for
Constraint-Based Mining
Efficient mining with convertible constraints
Not solvable by candidate generation-and-test methodology
Easily push-able into the sequential pattern growth framework
Example: push avg(S) < 25 in frequent pattern growth
project items in value (price/profit depending on mining
semantics) ascending/descending order for sequential pattern
growth
Grow each pattern by sequential pattern growth
If avg(current_pattern) 25, toss the current_pattern
Why?—future growths always make it bigger
But why not candidate generation?—no structure or ordering in
growth
April 12, 2023 Data Mining: Concepts and Techniques 120
From Sequential Patterns to Structured
Patterns
Sets, sequences, trees and other structures
Transaction DB: Sets of items
{{i1, i2, …, im}, …}
Seq. DB: Sequences of sets:
{<{i1, i2}, …, {im, in, ik}>, …}
Sets of Sequences:
{{<i1, i2>, …, <im, in, ik>}, …}
Sets of trees (each element being a tree):
{t1, t2, …, tn}
Applications: Mining structured patterns in XML documents
April 12, 2023 Data Mining: Concepts and Techniques 121
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
Mining various kinds of association/correlation rules
Constraint-based association mining
Sequential pattern mining
Applications/extensions of frequent pattern mining
Summary
April 12, 2023 Data Mining: Concepts and Techniques 122
Frequent-Pattern Mining: Achievements
Frequent pattern mining—an important task in data mining
Frequent pattern mining methodology
Candidate generation & test vs. projection-based (frequent-pattern
growth)
Vertical vs. horizontal format
Various optimization methods: database partition, scan reduction,
hash tree, sampling, border computation, clustering, etc.
Related frequent-pattern mining algorithm: scope extension
Mining closed frequent itemsets and max-patterns (e.g., MaxMiner,
CLOSET, CHARM, etc.)
Mining multi-level, multi-dimensional frequent patterns with flexible
support constraints
Constraint pushing for mining optimization
From frequent patterns to correlation and causality
April 12, 2023 Data Mining: Concepts and Techniques 123
Frequent-Pattern Mining: Applications
Related problems which need frequent pattern mining
Association-based classification
Iceberg cube computation
Database compression by fascicles and frequent
patterns
Mining sequential patterns (GSP, PrefixSpan, SPADE,
etc.)
Mining partial periodicity, cyclic associations, etc.
Mining frequent structures, trends, etc.
Typical application examples
Market-basket analysis, Weblog analysis, DNA mining,
etc.
April 12, 2023 Data Mining: Concepts and Techniques 124
Frequent-Pattern Mining: Research Problems
Multi-dimensional gradient analysis: patterns regarding
changes and differences
Not just counts—other measures, e.g., avg(profit)
Mining top-k frequent patterns without support constraint
Mining fault-tolerant associations
“3 out of 4 courses excellent” leads to A in data mining
Fascicles and database compression by frequent pattern
mining
Partial periodic patterns
DNA sequence analysis and pattern classification
April 12, 2023 Data Mining: Concepts and Techniques 125
References: Frequent-pattern Mining Methods
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm
for generation of frequent itemsets. Journal of Parallel and Distributed
Computing, 2000.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules
between sets of items in large databases. SIGMOD'93, 207-216,
Washington, D.C.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94 487-499, Santiago, Chile.
J. Han, J. Pei, and Y. Yin: “Mining frequent patterns without candidate
generation”. In Proc. ACM-SIGMOD’2000, pp. 1-12, Dallas, TX, May
2000.
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for
discovering association rules. KDD'94, 181-192, Seattle, WA, July 1994.
April 12, 2023 Data Mining: Concepts and Techniques 126
References: Frequent-pattern Mining Methods
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for
mining association rules in large databases. VLDB'95, 432-443, Zurich,
Switzerland.
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques
for mining causal structures. VLDB'98, 594-605, New York, NY.
R. Srikant and R. Agrawal. Mining generalized association rules.
VLDB'95, 407-419, Zurich, Switzerland, Sept. 1995.
R. Srikant and R. Agrawal. Mining quantitative association rules in large
relational tables. SIGMOD'96, 1-12, Montreal, Canada.
H. Toivonen. Sampling large databases for association rules. VLDB'96,
134-145, Bombay, India, Sept. 1996.
M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for
fast discovery of association rules. KDD’97. August 1997.
April 12, 2023 Data Mining: Concepts and Techniques 127
References: Frequent-pattern Mining
(Performance Improvements)
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting
and implication rules for market basket analysis. SIGMOD'97, Tucson,
Arizona, May 1997.
D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered
association rules in large databases: An incremental updating technique.
ICDE'96, New Orleans, LA.
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining
using two-dimensional optimized association rules: Scheme, algorithms,
and visualization. SIGMOD'96, Montreal, Canada.
E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for
association rules. SIGMOD'97, Tucson, Arizona.
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95, San Jose, CA, May 1995.
April 12, 2023 Data Mining: Concepts and Techniques 128
References: Frequent-pattern Mining
(Performance Improvements)
G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong
rules. In G. Piatetsky-Shapiro and W. J. Frawley, Knowledge Discovery in
Databases,. AAAI/MIT Press, 1991.
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95, San Jose, CA.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule
mining with relational database systems: Alternatives and implications.
SIGMOD'98, Seattle, WA.
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama.
Computing optimized rectilinear regions for association rules. KDD'97,
Newport Beach, CA, Aug. 1997.
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for
discovery of association rules. Data Mining and Knowledge Discovery,
1:343-374, 1997.
April 12, 2023 Data Mining: Concepts and Techniques 129
References: Frequent-pattern Mining (Multi-
level, correlation, ratio rules, etc.)
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to
correlations. SIGMOD'97, 265-276, Tucson, Arizona.
J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95, 420-
431, Zurich, Switzerland.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules
from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland.
F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable
data mining. VLDB'98, 582-593, New York, NY
B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham,
England.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122-
133, Bombay, India.
R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona.
A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large
database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A
generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.
J. Pei, A.K.H. Tung, J. Han. Fault-Tolerant Frequent Pattern Mining: Problems and Challenges.
SIGMOD DMKD’01, Santa Barbara, CA.
April 12, 2023 Data Mining: Concepts and Techniques 130
References: Mining Max-patterns and Closed
itemsets
R. J. Bayardo. Efficiently mining long patterns from databases.
SIGMOD'98, 85-93, Seattle, Washington.
J. Pei, J. Han, and R. Mao, "CLOSET: An Efficient Algorithm for Mining
Frequent Closed Itemsets", Proc. 2000 ACM-SIGMOD Int. Workshop on
Data Mining and Knowledge Discovery (DMKD'00), Dallas, TX, May
2000.
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent
closed itemsets for association rules. ICDT'99, 398-416, Jerusalem,
Israel, Jan. 1999.
M. Zaki. Generating Non-Redundant Association Rules. KDD'00.
Boston, MA. Aug. 2000
M. Zaki. CHARM: An Efficient Algorithm for Closed Association Rule
Mining, SIAM’02
April 12, 2023 Data Mining: Concepts and Techniques 131
References: Constraint-base Frequent-pattern Mining
G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00,
512-521, San Diego, CA, Feb. 2000.
Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95,
39-46, Singapore, Dec. 1995.
J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining",
COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.
L. V. S. Lakshmanan, R. Ng, J. Han and A. Pang, "Optimization of Constrained Frequent Set Queries
with 2-Variable Constraints", SIGMOD’99
R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. “Exploratory mining and pruning optimizations of
constrained association rules.” SIGMOD’98
J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining Frequent Itemsets with Convertible Constraints", Proc.
2001 Int. Conf. on Data Engineering (ICDE'01), April 2001.
J. Pei and J. Han "Can We Push More Constraints into Frequent Pattern Mining?", Proc. 2000 Int. Conf.
on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000.
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73,
Newport Beach, California
April 12, 2023 Data Mining: Concepts and Techniques 132
References: Sequential Pattern Mining Methods
R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14,
Taipei, Taiwan.
R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations
and performance improvements. EDBT’96.
J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu,
"FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining",
Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining
(KDD'00), Boston, MA, August 2000.
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent
episodes in event sequences. Data Mining and Knowledge Discovery,
1:259-289, 1997.
April 12, 2023 Data Mining: Concepts and Techniques 133
References: Sequential Pattern Mining Methods
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu,
"PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected
Pattern Growth", Proc. 2001 Int. Conf. on Data Engineering
(ICDE'01), Heidelberg, Germany, April 2001.
B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association
rules. ICDE'98, 412-421, Orlando, FL.
S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of
interesting patterns in association rules. VLDB'98, 368-379, New York,
NY.
M.J. Zaki. Efficient enumeration of frequent sequences. CIKM’98.
Novermber 1998.
M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern
Mining with Regular Expression Constraints. VLDB 1999: 223-234,
Edinburgh, Scotland.
April 12, 2023 Data Mining: Concepts and Techniques 134
References: Frequent-pattern Mining in Spatial,
Multimedia, Text & Web Databases
K. Koperski, J. Han, and G. B. Marchisio, "Mining Spatial and Image Data through Progressive
Refinement Methods", Revue internationale de gomatique (European Journal of GIS and Spatial
Analysis), 9(4):425-440, 1999.
A. K. H. Tung, H. Lu, J. Han, and L. Feng, "Breaking the Barrier of Transactions: Mining Inter-
Transaction Association Rules", Proc. 1999 Int. Conf. on Knowledge Discovery and Data Mining
(KDD'99), San Diego, CA, Aug. 1999, pp. 297-301.
J. Han, G. Dong and Y. Yin, "Efficient Mining of Partial Periodic Patterns in Time Series Database",
Proc. 1999 Int. Conf. on Data Engineering (ICDE'99), Sydney, Australia, March 1999, pp. 106-115
H. Lu, L. Feng, and J. Han, "Beyond Intra-Transaction Association Analysis:Mining Multi-Dimensional
Inter-Transaction Association Rules", ACM Transactions on Information Systems (TOIS’00), 18(4): 423-
454, 2000.
O. R. Zaiane, M. Xin, J. Han, "Discovering Web Access Patterns and Trends by Applying OLAP and Data
Mining Technology on Web Logs," Proc. Advances in Digital Librar ies Conf. (ADL'98), Santa Barbara,
CA, April 1998, pp. 19-29
O. R. Zaiane, J. Han, and H. Zhu, "Mining Recurrent Items in Multimedia with Progressive Resolution
Refinement", ICDE'00, San Diego, CA, Feb. 2000, pp. 461-470
April 12, 2023 Data Mining: Concepts and Techniques 135
References: Frequent-pattern Mining for Classification
and Data Cube Computation
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and
iceberg cubes. SIGMOD'99, 359-370, Philadelphia, PA, June 1999.
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman.
Computing iceberg queries efficiently. VLDB'98, 299-310, New York, NY, Aug.
1998.
J. Han, J. Pei, G. Dong, and K. Wang, “Computing Iceberg Data Cubes with
Complex Measures”, Proc. ACM-SIGMOD’2001, Santa Barbara, CA, May
2001.
M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-
dimensional association rules using data cubes. KDD'97, 207-210, Newport
Beach, California.
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and
iceberg cubes. SIGMOD’99
T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing
association rules. Technical Report, Aug. 2000
April 12, 2023 Data Mining: Concepts and Techniques 136
www.cs.uiuc.edu/~hanj
Thank you !!!
April 12, 2023 Data Mining: Concepts and Techniques 137
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 7 —
©Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
April 12, 2023 Data Mining: Concepts and Techniques 1
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 12, 2023 Data Mining: Concepts and Techniques 2
Classification vs. Prediction
Classification:
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
April 12, 2023 Data Mining: Concepts and Techniques 3
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting
will occur
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
April 12, 2023 Data Mining: Concepts and Techniques 4
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
April 12, 2023 Data Mining: Concepts and Techniques 5
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
April 12, 2023 Data Mining: Concepts and Techniques 6
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
April 12, 2023 Data Mining: Concepts and Techniques 7
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 12, 2023 Data Mining: Concepts and Techniques 8
Issues Regarding Classification and Prediction
(1): Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
April 12, 2023 Data Mining: Concepts and Techniques 9
Issues regarding classification and prediction
(2): Evaluating Classification Methods
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provided by the model
Goodness of rules
decision tree size
compactness of classification rules
April 12, 2023 Data Mining: Concepts and Techniques 10
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 12, 2023 Data Mining: Concepts and Techniques 11
Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
This <=30 high no excellent no
31…40 high no fair yes
follows an >40 medium no fair yes
example >40 low yes fair yes
from >40 low yes excellent no
31…40 low yes excellent yes
Quinlan’s <=30 medium no fair no
ID3 <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
April 12, 2023 Data Mining: Concepts and Techniques 12
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
30..40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
April 12, 2023 Data Mining: Concepts and Techniques 13
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
April 12, 2023 Data Mining: Concepts and Techniques 14
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
information measures info required to classify any
arbitrary tuple m
si si
I( s1,s2,...,sm ) log 2
i 1 s s
entropy of attribute A with values {a1,a2,…,av}
v
s1 j ... smj
E(A) I ( s1 j ,...,smj )
j 1 s
information gained by branching on attribute A
Gain(A) I(s 1, s 2 ,...,sm) E(A)
April 12, 2023 Data Mining: Concepts and Techniques 15
Attribute Selection by Information
Gain Computation
Class P: buys_computer = “yes” 5 4
E (age) I (2,3) I (4,0)
Class N: buys_computer = “no” 14 14
I(p, n) = I(9, 5) =0.940 5
I (3,2) 0.694
Compute the entropy for age: 14
5
age pi ni I(pi, ni) I (2,3) means “age <=30” has 5
14
<=30 2 3 0.971 out of 14 samples, with 2 yes’es
30…40 4 0 0 and 3 no’s. Hence
>40 3 2 0.971
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) I ( p, n) E (age) 0.246
<=30 high no excellent no
31…40 high
>40 medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes
>40
31…40 low
low yes excellent
yes excellent
no
yes Gain(income) 0.029
Gain( student) 0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium
31…40 medium
yes excellent
no excellent
yes
yes
Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40April 12, 2023
medium no excellent Data no
Mining: Concepts and Techniques 16
Other Attribute Selection Measures
Gini index (CART, IBM IntelligentMiner)
All attributes are assumed continuous-valued
Assume there exist several possible split values for
each attribute
May need other tools, such as clustering, to get the
possible split values
Can be modified for categorical attributes
April 12, 2023 Data Mining: Concepts and Techniques 17
Gini Index (IBM IntelligentMiner)
If a data set T contains examples from n classes, gini index,
gini(T) is defined as gini(T ) 1
n
p2 j
j 1
where pj is the relative frequency of class j in T.
If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data
contains examples from n classes, the gini index gini(T) is
defined as
N 1 gini( ) N 2 gini( )
gini split (T ) T1 T2
N N
The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting
points for each attribute).
April 12, 2023 Data Mining: Concepts and Techniques 18
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
April 12, 2023 Data Mining: Concepts and Techniques 19
Avoid Overfitting in Classification
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due
to noise or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure
falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
Use a set of data different from the training data to
decide which is the “best pruned tree”
April 12, 2023 Data Mining: Concepts and Techniques 20
Approaches to Determine the Final
Tree Size
Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross validation
Use all the data for training
but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may
improve the entire distribution
Use minimum description length (MDL) principle
halting growth of the tree when the encoding is
minimized
April 12, 2023 Data Mining: Concepts and Techniques 21
Enhancements to basic decision
tree induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that are
sparsely represented
This reduces fragmentation, repetition, and replication
April 12, 2023 Data Mining: Concepts and Techniques 22
Classification in Large Databases
Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
Why decision tree induction in data mining?
relatively faster learning speed (than other classification
methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
April 12, 2023 Data Mining: Concepts and Techniques 23
Scalable Decision Tree Induction
Methods in Data Mining Studies
SLIQ (EDBT’96 — Mehta et al.)
builds an index for each attribute and only class list and
the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
integrates tree splitting and tree pruning: stop growing
the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
separates the scalability aspects from the criteria that
determine the quality of the tree
builds an AVC-list (attribute, value, class label)
April 12, 2023 Data Mining: Concepts and Techniques 24
Data Cube-Based Decision-Tree
Induction
Integration of generalization with decision-tree induction
(Kamber et al’97).
Classification at primitive concept levels
E.g., precise temperature, humidity, outlook, etc.
Low-level concepts, scattered classes, bushy
classification-trees
Semantic interpretation problems.
Cube-based multi-level classification
Relevance analysis at multi-levels.
Information-gain analysis with dimension + level.
April 12, 2023 Data Mining: Concepts and Techniques 25
Presentation of Classification Results
April 12, 2023 Data Mining: Concepts and Techniques 26
Visualization of a Decision Tree in
SGI/MineSet 3.0
April 12, 2023 Data Mining: Concepts and Techniques 27
Interactive Visual Mining by
Perception-Based Classification (PBC)
April 12, 2023 Data Mining: Concepts and Techniques 28
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 12, 2023 Data Mining: Concepts and Techniques 29
Bayesian Classification: Why?
Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
April 12, 2023 Data Mining: Concepts and Techniques 30
Bayesian Theorem: Basics
Let X be a data sample whose class label is unknown
Let H be a hypothesis that X belongs to class C
For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
P(X): probability that sample data is observed
P(X|H) : probability of observing the sample X, given that
the hypothesis holds
April 12, 2023 Data Mining: Concepts and Techniques 31
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis
H, P(H|X) follows the Bayes theorem
P(H | X ) P( X | H )P(H )
P( X )
Informally, this can be written as
posterior =likelihood x prior / evidence
MAP (maximum posteriori) hypothesis
h arg max P(h | D) arg max P(D | h)P(h).
MAP hH hH
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
April 12, 2023 Data Mining: Concepts and Techniques 32
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent:
n
P( X | C i ) P( x k | C i )
k 1
The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
No dependence relation between attributes
Greatly reduces the computation cost, only count the class
distribution.
Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)
April 12, 2023 Data Mining: Concepts and Techniques 33
Training dataset
age income student credit_rating buys_computer
Class: <=30 high no fair no
C1:buys_computer= <=30 high no excellent no
‘yes’ 30…40 high no fair yes
C2:buys_computer= >40 medium no fair yes
‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample 31…40 low yes excellent yes
X =(age<=30, <=30 medium no fair no
Income=medium, <=30 low yes fair yes
Student=yes >40 medium yes fair yes
Credit_rating= <=30 medium yes excellent yes
Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
April 12, 2023 Data Mining: Concepts and Techniques 34
Naïve Bayesian Classifier: Example
Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007
X belongs to class “buys_computer=yes”
April 12, 2023 Data Mining: Concepts and Techniques 35
Naïve Bayesian Classifier: Comments
Advantages :
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence , therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
How to deal with these dependencies?
Bayesian Belief Networks
April 12, 2023 Data Mining: Concepts and Techniques 36
Bayesian Networks
Bayesian belief network allows a subset of the variables
conditionally independent
A graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
Nodes: random variables
Links: dependency
X Y X,Y are the parents of Z, and Y is the
parent of P
Z No dependency between Z and P
P Has no loops or cycles
April 12, 2023 Data Mining: Concepts and Techniques 37
Bayesian Belief Network: An Example
Family
Smoker
History
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1
LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9
The conditional probability table
for the variable LungCancer:
PositiveXRay Dyspnea Shows the conditional probability
for each possible combination of its
parents n
Bayesian Belief Networks P( z1,..., zn) P( zi | Parents ( Z i ))
i 1
April 12, 2023 Data Mining: Concepts and Techniques 38
Learning Bayesian Networks
Several cases
Given both the network structure and all variables
observable: learn only the CPTs
Network structure known, some hidden variables:
method of gradient descent, analogous to neural
network learning
Network structure unknown, all variables observable:
search through the model space to reconstruct graph
topology
Unknown structure, all hidden variables: no good
algorithms known for this purpose
D. Heckerman, Bayesian networks for data mining
April 12, 2023 Data Mining: Concepts and Techniques 39
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 12, 2023 Data Mining: Concepts and Techniques 40
Classification
Classification:
predicts categorical class labels
Typical Applications
{credit history, salary}-> credit approval ( Yes/No)
{Temp, Humidity} --> Rain (Yes/No)
x X {0,1} , y Y {0,1}
n
Mathematically h: X Y
y h( x )
April 12, 2023 Data Mining: Concepts and Techniques 41
Linear Classification
Binary Classification
problem
The data above the red
line belongs to class ‘x’
x The data below red line
x x
x x belongs to class ‘o’
x Examples – SVM,
x x o
x Perceptron, Probabilistic
o
x o o Classifiers
ooo
o o
o o o o
April 12, 2023 Data Mining: Concepts and Techniques 42
Discriminative Classifiers
Advantages
prediction accuracy is generally high
(as compared to Bayesian methods – in general)
robust, works when training examples contain errors
fast evaluation of the learned target function
(Bayesian networks are normally slow)
Criticism
long training time
difficult to understand the learned function (weights)
(Bayesian networks can be used easily for pattern discovery)
not easy to incorporate domain knowledge
(easy in the form of priors on the data or distributions)
April 12, 2023 Data Mining: Concepts and Techniques 43
Neural Networks
Analogy to Biological Systems (Indeed a great example
of a good learning system)
Massive Parallelism allowing for computational
efficiency
The first learning algorithm came in 1959 (Rosenblatt)
who suggested that if a target output value is provided
for a single neuron with fixed inputs, one can
incrementally change weights to learn to produce these
outputs using the perceptron learning rule
April 12, 2023 Data Mining: Concepts and Techniques 44
A Neuron
- mk
x0 w0
x1
w1
f
output y
xn wn
Input weight weighted Activation
vector x vector w sum function
The n-dimensional input vector x is mapped into
variable y by means of the scalar product and a
nonlinear function mapping
April 12, 2023 Data Mining: Concepts and Techniques 45
A Neuron
- mk
x0 w0
x1
w1
f
output y
xn wn
Input weight weighted Activation
vector x vector w sum function
For Example
n
y sign( wi xi m k )
i 0
April 12, 2023 Data Mining: Concepts and Techniques 46
Multi-Layer Perceptron
Output vector
Err j O j (1 O j ) Errk w jk
Output nodes k
j j (l) Errj
wij wij (l ) Errj Oi
Hidden nodes Errj O j (1 O j )(T j O j )
wij 1
Oj I j
1 e
Input nodes
I j wij Oi j
i
Input vector: xi
Network Training
The ultimate objective of training
obtain a set of weights that makes almost all the
tuples in the training data classified correctly
Steps
Initialize weights with random values
Feed the input tuples into the network one by one
For each unit
Compute the net input to the unit as a linear combination
of all the inputs to the unit
Compute the output value using the activation function
Compute the error
Update the weights and the bias
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 12, 2023 Data Mining: Concepts and Techniques 50
SVM – Support Vector Machines
Small Margin Large Margin
Support Vectors
SVM – Cont.
Linear Support Vector Machine
Given a set of points i
x n
with label y i {1,1}
The SVM finds a hyperplane defined by the pair (w,b)
(where w is the normal to the plane and b is the distance from
the origin)
s.t. yi ( xi w b) 1 i 1,..., N
x – feature vector, b- bias, y- class label, ||w|| - margin
April 12, 2023 Data Mining: Concepts and Techniques 52
SVM – Cont.
What if the data is not linearly separable?
Project the data to high dimensional space where it is
linearly separable and then we can use linear SVM –
(Using Kernels)
(0,1) +
+ - +
-1 0 +1
- +
(0,0) (1,0)
April 12, 2023 Data Mining: Concepts and Techniques 53
Non-Linear SVM
Classification using SVM (w,b)
?
xi w b 0
In non linear case we can see this as
?
K ( xi , w) b 0
Kernel – Can be thought of as doing dot product
in some high dimensional space
April 12, 2023 Data Mining: Concepts and Techniques 54
April 12, 2023 Data Mining: Concepts and Techniques 55
Results
April 12, 2023 Data Mining: Concepts and Techniques 56
SVM vs. Neural Network
SVM Neural Network
Relatively new concept Quiet Old
Nice Generalization Generalizes well but
properties doesn’t have strong
Hard to learn – learned mathematical foundation
in batch mode using Can easily be learned in
quadratic programming incremental fashion
techniques To learn complex
Using kernels can learn functions – use
very complex functions multilayer perceptron
(not that trivial)
April 12, 2023 Data Mining: Concepts and Techniques 57
SVM Related Links
http://svm.dcs.rhbnc.ac.uk/
http://www.kernel-machines.org/
C. J. C. Burges. A Tutorial on Support Vector Machines for
Pattern Recognition. Knowledge Discovery and Data
Mining, 2(2), 1998.
SVMlight – Software (in C)
http://ais.gmd.de/~thorsten/svm_light
BOOK: An Introduction to Support Vector Machines
N. Cristianini and J. Shawe-Taylor
Cambridge University Press
April 12, 2023 Data Mining: Concepts and Techniques 58
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 12, 2023 Data Mining: Concepts and Techniques 59
Association-Based Classification
Several methods for association-based classification
ARCS: Quantitative association mining and clustering
of association rules (Lent et al’97)
It beats C4.5 in (mainly) scalability and also accuracy
Associative classification: (Liu et al’98)
It mines high support and high confidence rules in the form of
“cond_set => y”, where y is a class label
CAEP (Classification by aggregating emerging patterns)
(Dong et al’99)
Emerging patterns (EPs): the itemsets whose support
increases significantly from one class to another
Mine Eps based on minimum support and growth rate
April 12, 2023 Data Mining: Concepts and Techniques 60
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 12, 2023 Data Mining: Concepts and Techniques 61
Other Classification Methods
k-nearest neighbor classifier
case-based reasoning
Genetic algorithm
Rough set approach
Fuzzy set approaches
April 12, 2023 Data Mining: Concepts and Techniques 62
Instance-Based Methods
Instance-based learning:
Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
Typical approaches
k-nearest neighbor approach
Instances represented as points in a Euclidean
space.
Locally weighted regression
Constructs local approximation
Case-based reasoning
Uses symbolic representations and knowledge-
based inference
April 12, 2023 Data Mining: Concepts and Techniques 63
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space.
The nearest neighbor are defined in terms of
Euclidean distance.
The target function could be discrete- or real- valued.
For discrete-valued, the k-NN returns the most
common value among the k training examples nearest
to xq.
Vonoroi diagram: the decision surface induced by 1-
NN for a typical set of training examples.
_
_
_ _ .
+
_ .
+
xq + . . .
April 12, 2023
_ + .
Data Mining: Concepts and Techniques 64
Discussion on the k-NN Algorithm
The k-NN algorithm for continuous-valued target functions
Calculate the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors
according to their distance to the query point xq
giving greater weight to closer neighbors w 1
d ( xq , xi )2
Similarly, for real-valued target functions
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes.
To overcome it, axes stretch or elimination of the least
relevant attributes.
April 12, 2023 Data Mining: Concepts and Techniques 65
Case-Based Reasoning
Also uses: lazy evaluation + analyze similar instances
Difference: Instances are not “points in a Euclidean space”
Example: Water faucet problem in CADET (Sycara et al’92)
Methodology
Instances represented by rich symbolic descriptions
(e.g., function graphs)
Multiple retrieved cases may be combined
Tight coupling between case retrieval, knowledge-based
reasoning, and problem solving
Research issues
Indexing based on syntactic similarity measure, and
when failure, backtracking, and adapting to additional
cases
April 12, 2023 Data Mining: Concepts and Techniques 66
Remarks on Lazy vs. Eager Learning
Instance-based learning: lazy evaluation
Decision-tree and Bayesian classification: eager evaluation
Key differences
Lazy method may consider query instance xq when deciding how to
generalize beyond the training data D
Eager method cannot since they have already chosen global
approximation when seeing the query
Efficiency: Lazy - less time training but more time predicting
Accuracy
Lazy method effectively uses a richer hypothesis space since it uses
many local linear functions to form its implicit global approximation
to the target function
Eager: must commit to a single hypothesis that covers the entire
instance space
April 12, 2023 Data Mining: Concepts and Techniques 67
Genetic Algorithms
GA: based on an analogy to biological evolution
Each rule is represented by a string of bits
An initial population is created consisting of randomly
generated rules
e.g., IF A1 and Not A2 then C2 can be encoded as 100
Based on the notion of survival of the fittest, a new
population is formed to consists of the fittest rules and
their offsprings
The fitness of a rule is represented by its classification
accuracy on a set of training examples
Offsprings are generated by crossover and mutation
April 12, 2023 Data Mining: Concepts and Techniques 68
Rough Set Approach
Rough sets are used to approximately or “roughly”
define equivalent classes
A rough set for a given class C is approximated by two
sets: a lower approximation (certain to be in C) and an
upper approximation (cannot be described as not
belonging to C)
Finding the minimal subsets (reducts) of attributes (for
feature reduction) is NP-hard but a discernibility matrix
is used to reduce the computation intensity
April 12, 2023 Data Mining: Concepts and Techniques 69
Fuzzy Set
Approaches
Fuzzy logic uses truth values between 0.0 and 1.0 to
represent the degree of membership (such as using
fuzzy membership graph)
Attribute values are converted to fuzzy values
e.g., income is mapped into the discrete categories
{low, medium, high} with fuzzy values calculated
For a given new sample, more than one fuzzy value may
apply
Each applicable rule contributes a vote for membership
in the categories
Typically, the truth values for each predicted category
are summed
April 12, 2023 Data Mining: Concepts and Techniques 70
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 12, 2023 Data Mining: Concepts and Techniques 71
What Is Prediction?
Prediction is similar to classification
First, construct a model
Second, use model to predict unknown value
Major method for prediction is regression
Linear and multiple regression
Non-linear regression
Prediction is different from classification
Classification refers to predict categorical class label
Prediction models continuous-valued functions
April 12, 2023 Data Mining: Concepts and Techniques 72
Predictive Modeling in Databases
Predictive modeling: Predict data values or construct
generalized linear models based on the database data.
One can only predict value ranges or category distributions
Method outline:
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction
Determine the major factors which influence the prediction
Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis
April 12, 2023 Data Mining: Concepts and Techniques 73
Regress Analysis and Log-Linear
Models in Prediction
Linear regression: Y = + X
Two parameters , and specify the line and are to
be estimated by using the data at hand.
using the least squares criterion to the known values
of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the
above.
Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
Probability: p(a, b, c, d) = ab acad bcd
April 12, 2023 Data Mining: Concepts and Techniques 74
Locally Weighted Regression
Construct an explicit approximation to f over a local region
surrounding query instance xq.
Locally weighted linear regression:
The target function f is approximated near xq using the
( x) w w a ( x) w a ( x)
linear function: f
0 11 n n
minimize the squared error: distance-decreasing weight
K E ( xq ) 1 ( f ( x) f ( x))2 K (d ( xq , x))
2 xk _nearest _neighbors_of _ x
q
the gradient descent training rule:
w j K (d ( xq , x))(( f ( x) f ( x))a j ( x)
x k _ nearest _ neighbors_ of _ xq
In most cases, the target function is approximated by a
constant, linear, or quadratic function.
April 12, 2023 Data Mining: Concepts and Techniques 75
Prediction: Numerical Data
April 12, 2023 Data Mining: Concepts and Techniques 76
Prediction: Categorical Data
April 12, 2023 Data Mining: Concepts and Techniques 77
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 12, 2023 Data Mining: Concepts and Techniques 78
Classification Accuracy: Estimating Error
Rates
Partition: Training-and-testing
use two independent data sets, e.g., training set
(2/3), test set(1/3)
used for data set with large number of samples
Cross-validation
divide the data set into k subsamples
use k-1 subsamples as training data and one sub-
sample as test data—k-fold cross-validation
for data set with moderate size
Bootstrapping (leave-one-out)
for small size data
April 12, 2023 Data Mining: Concepts and Techniques 79
Bagging and Boosting
General idea
Training data
Classification method (CM)
Classifier C
Altered Training data CM
Classifier C1
Altered Training data CM
…….. Classifier C2
Aggregation ….
Classifier C*
April 12, 2023 Data Mining: Concepts and Techniques 80
Bagging
Given a set S of s samples
Generate a bootstrap sample T from S. Cases in S may not
appear in T or may appear more than once.
Repeat this sampling procedure, getting a sequence of k
independent training sets
A corresponding sequence of classifiers C1,C2,…,Ck is
constructed for each of these training sets, by using the
same classification algorithm
To classify an unknown sample X,let each classifier predict
or vote
The Bagged Classifier C* counts the votes and assigns X
to the class with the “most” votes
April 12, 2023 Data Mining: Concepts and Techniques 81
Boosting Technique — Algorithm
Assign every example an equal weight 1/N
For t = 1, 2, …, T Do
Obtain a hypothesis (classifier) h(t) under w(t)
Calculate the error of h(t) and re-weight the examples
based on the error . Each classifier is dependent on the
previous ones. Samples that are incorrectly predicted
are weighted more heavily
Normalize w
(t+1) to sum to 1 (weights assigned to
different classifiers sum to 1)
Output a weighted sum of all the hypothesis, with each
hypothesis weighted according to its accuracy on the
training set
April 12, 2023 Data Mining: Concepts and Techniques 82
Bagging and Boosting
Experiments with a new boosting algorithm,
freund et al (AdaBoost )
Bagging Predictors, Brieman
Boosting Naïve Bayesian Learning on large subset
of MEDLINE, W. Wilbur
April 12, 2023 Data Mining: Concepts and Techniques 83
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 12, 2023 Data Mining: Concepts and Techniques 84
Summary
Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)
Classification is probably one of the most widely used
data mining techniques with a lot of extensions
Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic
Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc..
April 12, 2023 Data Mining: Concepts and Techniques 85
References (1)
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998.
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for
scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining
(KDD'95), pages 39-44, Montreal, Canada, August 1995.
U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994
AAAI Conf., pages 601-606, AAAI Press, 1994.
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages
416-427, New York, NY, August 1998.
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction . In SIGMOD'99 , Philadelphia, Pennsylvania, 1999
April 12, 2023 Data Mining: Concepts and Techniques 86
References (2)
M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree
induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research
Issues on Data Engineering (RIDE'97), Birmingham, England, April 1997.
B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining. Proc.
1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98) New York, NY, Aug.
1998.
W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, , Proc. 2001 Int. Conf. on Data Mining (ICDM'01), San Jose, CA,
Nov. 2001.
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,
pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
(EDBT'96), Avignon, France, March 1996.
April 12, 2023 Data Mining: Concepts and Techniques 87
References (3)
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary
Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial
Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996.
R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and
pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August
1998.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept.
1996.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.
Morgan Kaufman, 1991.
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
April 12, 2023 Data Mining: Concepts and Techniques 88
www.cs.uiuc.edu/~hanj
Thank you !!!
April 12, 2023 Data Mining: Concepts and Techniques 89
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 8 —
©Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
April 12, 2023 Data Mining: Concepts and Techniques 1
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
April 12, 2023 Data Mining: Concepts and Techniques 2
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined
classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
General Applications of Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns
April 12, 2023 Data Mining: Concepts and Techniques 4
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
April 12, 2023 Data Mining: Concepts and Techniques 5
What Is Good Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
April 12, 2023 Data Mining: Concepts and Techniques 6
Requirements of Clustering in Data
Mining
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
April 12, 2023 Data Mining: Concepts and Techniques 7
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
April 12, 2023 Data Mining: Concepts and Techniques 8
Data Structures
Data matrix
x11 ... x1f ... x1p
(two modes)
... ... ... ... ...
x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... xnp
n1
Dissimilarity matrix 0
(one mode)
d(2,1) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
April 12, 2023 Data Mining: Concepts and Techniques 9
Measure the Quality of Clustering
Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
Weights should be associated with different variables
based on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
April 12, 2023 Data Mining: Concepts and Techniques 10
Type of data in clustering analysis
Interval-scaled variables:
Binary variables:
Nominal, ordinal, and ratio variables:
Variables of mixed types:
April 12, 2023 Data Mining: Concepts and Techniques 11
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where mf 1
n (x1 f x2 f ... xnf )
.
Calculate the standardized measurement (z-score)
xif m f
zif sf
Using mean absolute deviation is more robust than using
standard deviation
April 12, 2023 Data Mining: Concepts and Techniques 12
Similarity and Dissimilarity Between
Objects
Distances are normally used to measure the similarity or
dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j2 ip jp
April 12, 2023 Data Mining: Concepts and Techniques 13
Similarity and Dissimilarity Between
Objects (Cont.)
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
April 12, 2023 Data Mining: Concepts and Techniques 14
Binary Variables
A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i 0 c d cd
sum a c b d p
Simple matching coefficient (invariant, if the binary
variable is symmetric): d (i, j) bc
a bc d
Jaccard coefficient (noninvariant if the binary variable is
asymmetric): d (i, j) bc
a bc
April 12, 2023 Data Mining: Concepts and Techniques 15
Dissimilarity between Binary
Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
April 12, 2023 Data Mining: Concepts and Techniques 16
Nominal Variables
A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
d (i, j) p
p
m
Method 2: use a large number of binary variables
creating a new binary variable for each of the M
nominal states
April 12, 2023 Data Mining: Concepts and Techniques 17
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank rif {1,..., M f }
map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif
M f 1
compute the dissimilarity using methods for interval-
scaled variables
April 12, 2023 Data Mining: Concepts and Techniques 18
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank
as interval-scaled
April 12, 2023 Data Mining: Concepts and Techniques 19
Variables of Mixed Types
A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
One may use a weighted formula to combine their
effects pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
compute ranks rif and
z
r 1
and treat zif as interval-scaled
if
if M 1 f
April 12, 2023 Data Mining: Concepts and Techniques 20
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
April 12, 2023 Data Mining: Concepts and Techniques 21
Major Clustering Approaches
Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
April 12, 2023 Data Mining: Concepts and Techniques 22
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
April 12, 2023 Data Mining: Concepts and Techniques 23
Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by
the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw’87): Each cluster is represented by one of
the objects in the cluster
April 12, 2023 Data Mining: Concepts and Techniques 24
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four
steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest
seed point
Go back to Step 2, stop when no more new
assignment
April 12, 2023 Data Mining: Concepts and Techniques 25
The K-Means Clustering Method
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
cluster center 4 Update 4
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
April 12, 2023 Data Mining: Concepts and Techniques 26
Comments on the K-Means Method
Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as: deterministic
annealing and genetic algorithms
Weakness
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
April 12, 2023 Data Mining: Concepts and Techniques 27
Variations of the K-Means Method
A few variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes (Huang’98)
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype method
April 12, 2023 Data Mining: Concepts and Techniques 28
What is the problem of k-Means Method?
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially
distort the distribution of the data.
K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
April 12, 2023 Data Mining: Concepts and Techniques 29
The K-Medoids Clustering Method
Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
PAM works effectively for small data sets, but does not scale well
for large data sets
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)
April 12, 2023 Data Mining: Concepts and Techniques 30
Typical k-medoids algorithm (PAM)
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
K=2 Randomly select a
Total Cost = 26 nonmedoid object,Oramdom
10 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
April 12, 2023 Data Mining: Concepts and Techniques 31
PAM (Partitioning Around Medoids) (1987)
PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
For each pair of i and h,
If TCih < 0, i is replaced by h
Then assign each non-selected object to the most
similar representative object
repeat steps 2-3 until there is no change
April 12, 2023 Data Mining: Concepts and Techniques 32
PAM Clustering: Total swapping cost TCih=jCjih
10 10
9 9
j
8
t 8
t
7 7
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Cjih = d(j, h) - d(j, i) Cjih = 0
10
10
9
9
8
h 8
7
7
6
j 6
5
5 i
i 4
h j
4
3
t 3
2
2
1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
April 12, 2023 CjihTechniques
Cjih = d(j, t) - d(j, i) Data Mining: Concepts and = d(j, h) - d(j, t) 33
What is the problem with PAM?
Pam is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
Pam works efficiently for small data sets but does not
scale well for large data sets.
O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)
April 12, 2023 Data Mining: Concepts and Techniques 34
CLARA (Clustering Large Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
April 12, 2023 Data Mining: Concepts and Techniques 35
CLARANS (“Randomized” CLARA) (1994)
CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
April 12, 2023 Data Mining: Concepts and Techniques 36
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
April 12, 2023 Data Mining: Concepts and Techniques 37
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
April 12, 2023 Data Mining: Concepts and Techniques 38
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
April 12, 2023 Data Mining: Concepts and Techniques 39
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
April 12, 2023 Data Mining: Concepts and Techniques 40
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
April 12, 2023 Data Mining: Concepts and Techniques 41
More on Hierarchical Clustering Methods
Major weakness of agglomerative clustering methods
do not scale well: time complexity of at least O(n ),
2
where n is the number of total objects
can never undo what was done previously
Integration of hierarchical with distance-based clustering
BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the
cluster by a specified fraction
CHAMELEON (1999): hierarchical clustering using
dynamic modeling
April 12, 2023 Data Mining: Concepts and Techniques 42
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the
order of the data record.
April 12, 2023 Data Mining: Concepts and Techniques 43
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2 CF = (5, (16,30),(54,190))
10
9
(3,4)
8
6
(2,6)
5
4 (4,5)
3
1
(4,7)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8)
April 12, 2023 Data Mining: Concepts and Techniques 44
CF-Tree in BIRCH
Clustering feature:
summary of the statistics for a given subcluster: the 0-th, 1st and
2nd moments of the subcluster from the statistical point of view.
registers crucial measurements for computing cluster and utilizes
storage efficiently
A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
Branching factor: specify the maximum number of children.
threshold: max diameter of sub-clusters stored at the leaf nodes
April 12, 2023 Data Mining: Concepts and Techniques 45
CF Tree
Root
B=7 CF1 CF2 CF3 CF6
child1 child2 child3 child6
L=6
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Leaf node Leaf node
prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next
April 12, 2023 Data Mining: Concepts and Techniques 46
CURE (Clustering Using
REpresentatives )
CURE: proposed by Guha, Rastogi & Shim, 1998
Stops the creation of a cluster hierarchy if a level
consists of k clusters
Uses multiple representative points to evaluate the
distance between clusters, adjusts well to arbitrary
shaped clusters and avoids single-link effect
April 12, 2023 Data Mining: Concepts and Techniques 47
Drawbacks of Distance-Based
Method
Drawbacks of square-error based clustering method
Consider only one point as representative of a cluster
Good only for convex shaped, similar size and density,
and if k can be reasonably estimated
April 12, 2023 Data Mining: Concepts and Techniques 48
Cure: The Algorithm
Draw random sample s.
Partition sample to p partitions with size s/p
Partially cluster partitions into s/pq clusters
Eliminate outliers
By random sampling
If a cluster grows too slow, eliminate it.
Cluster partial clusters.
Label data in disk
April 12, 2023 Data Mining: Concepts and Techniques 49
Data Partitioning and Clustering
s = 50
p=2 s/pq = 5
s/p = 25
y
y y
x
y y
x x
x x
April 12, 2023 Data Mining: Concepts and Techniques 50
Cure: Shrinking Representative Points
y y
x x
Shrink the multiple representative points towards the
gravity center by a fraction of .
Multiple representatives capture the shape of the cluster
April 12, 2023 Data Mining: Concepts and Techniques 51
Clustering Categorical Data: ROCK
ROCK: Robust Clustering using linKs,
by S. Guha, R. Rastogi, K. Shim (ICDE’99).
Use links to measure similarity/proximity
Not distance based
Computational complexity: O( n 2
nm m
m a n 2
log n)
Basic ideas:
Similarity function and neighbors: Sim( T , T ) T T 1 2
T T
1 2
Let T1 = {1,2,3}, T2={3,4,5} 1 2
{3} 1
Sim( T 1, T 2) 0.2
{1,2,3,4,5} 5
April 12, 2023 Data Mining: Concepts and Techniques 52
Rock: Algorithm
Links: The number of common neighbours for
the two points.
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
{1,2,3} 3 {1,2,4}
Algorithm
Draw random sample
Cluster with links
Label data in disk
April 12, 2023 Data Mining: Concepts and Techniques 53
CHAMELEON (Hierarchical clustering
using dynamic modeling)
CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity and closeness
(proximity) between two clusters are high relative to the internal
interconnectivity of the clusters and closeness of items within the
clusters
Cure ignores information about interconnectivity of the objects,
Rock ignores information about the closeness of two clusters
A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large
number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the
genuine clusters by repeatedly combining these sub-clusters
April 12, 2023 Data Mining: Concepts and Techniques 54
Overall Framework of CHAMELEON
Construct
Sparse Graph Partition the Graph
Data Set
Merge Partition
Final Clusters
April 12, 2023 Data Mining: Concepts and Techniques 55
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
April 12, 2023 Data Mining: Concepts and Techniques 56
Density-Based Clustering Methods
Clustering based on density (local cluster criterion),
such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98)
April 12, 2023 Data Mining: Concepts and Techniques 57
Density Concepts
Core object (CO)–object with at least ‘M’ objects within a
radius ‘E-neighborhood’
Directly density reachable (DDR)–x is CO, y is in x’s ‘E-
neighborhood’
Density reachable–there exists a chain of DDR objects
from x to y
Density based cluster–density connected objects
maximum w.r.t. reachability
April 12, 2023 Data Mining: Concepts and Techniques 58
Density-Based Clustering: Background
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q wrt. Eps, MinPts if
1) p belongs to NEps(q)
2) core point condition: p MinPts = 5
q
|NEps (q)| >= MinPts Eps = 1 cm
April 12, 2023 Data Mining: Concepts and Techniques 59
Density-Based Clustering: Background (II)
Density-reachable:
p
A point p is density-reachable from
a point q wrt. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
Density-connected
A point p is density-connected to a p q
point q wrt. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o wrt.
Eps and MinPts.
April 12, 2023 Data Mining: Concepts and Techniques 60
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
April 12, 2023 Data Mining: Concepts and Techniques 61
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps
and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database.
Continue the process until all of the points have been
processed.
April 12, 2023 Data Mining: Concepts and Techniques 62
OPTICS: A Cluster-Ordering Method (1999)
OPTICS: Ordering Points To Identify the Clustering
Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
Produces a special order of the database wrt its
density-based clustering structure
This cluster-ordering contains info equiv to the
density-based clusterings corresponding to a broad
range of parameter settings
Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
Can be represented graphically or using visualization
techniques
April 12, 2023 Data Mining: Concepts and Techniques 63
OPTICS: Some Extension from
DBSCAN
Index-based:
k = number of dimensions
N = 20
p = 75% D
M = N(1-p) = 5
Complexity: O(kN2)
Core Distance p1
o
Reachability Distance p2 o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) = 4cm
April 12, 2023 e = 3 cm
Data Mining: Concepts and Techniques 64
Reachability
-distance
undefined
e
e
e‘
Cluster-order
of the objects
April 12, 2023 Data Mining: Concepts and Techniques 65
Density-Based Cluster analysis:
OPTICS & Its Applications
April 12, 2023 Data Mining: Concepts and Techniques 66
DENCLUE: Using density functions
DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
Major features
Solid mathematical foundation
Good for data sets with large amounts of noise
Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
But needs a large number of parameters
April 12, 2023 Data Mining: Concepts and Techniques 67
Denclue: Technical Essence
Uses grid cells but only keeps information about grid
cells that do actually contain data points and manages
these cells in a tree-based access structure.
Influence function: describes the impact of a data point
within its neighborhood.
Overall density of the data space can be calculated as
the sum of the influence function of all data points.
Clusters can be determined mathematically by
identifying density attractors.
Density attractors are local maximal of the overall
density function.
April 12, 2023 Data Mining: Concepts and Techniques 68
Gradient: The steepness of a slope
Example
d ( x , y )2
f Gaussian ( x , y ) e 2 2
d ( x , xi ) 2
( x ) i 1 e
N
D 2 2
f Gaussian
d ( x , xi ) 2
( x, xi ) i 1 ( xi x) e
N
f D
Gaussian
2 2
April 12, 2023 Data Mining: Concepts and Techniques 69
Density Attractor
April 12, 2023 Data Mining: Concepts and Techniques 70
Center-Defined and Arbitrary
April 12, 2023 Data Mining: Concepts and Techniques 71
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
April 12, 2023 Data Mining: Concepts and Techniques 72
Grid-Based Clustering Method
Using multi-resolution grid data structure
Several interesting methods
STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)
A multi-resolution clustering approach using
wavelet method
CLIQUE: Agrawal, et al. (SIGMOD’98)
April 12, 2023 Data Mining: Concepts and Techniques 73
STING: A Statistical Information
Grid Approach
Wang, Yang and Muntz (VLDB’97)
The spatial area area is divided into rectangular cells
There are several levels of cells corresponding to different
levels of resolution
April 12, 2023 Data Mining: Concepts and Techniques 74
STING: A Statistical Information
Grid Approach (2)
Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from
parameters of lower level cell
count, mean, s, min, max
type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layer—typically with a small
number of cells
For each cell in the current level compute the confidence
interval
STING: A Statistical Information
Grid Approach (3)
Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to
the next lower level
Repeat this process until the bottom layer is reached
Advantages:
Query-independent, easy to parallelize, incremental
update
O(K), where K is the number of grid cells at the
lowest level
Disadvantages:
All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
WaveCluster (1998)
Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
A multi-resolution clustering approach which applies
wavelet transform to the feature space
A wavelet transform is a signal processing
technique that decomposes a signal into different
frequency sub-band.
Both grid-based and density-based
Input parameters:
# of grid cells for each dimension
the wavelet, and the # of applications of wavelet
transform.
April 12, 2023 Data Mining: Concepts and Techniques 77
WaveCluster (1998)
How to apply wavelet transform to find clusters
Summaries the data by imposing a multidimensional
grid structure onto data space
These multidimensional spatial data objects are
represented in a n-dimensional feature space
Apply wavelet transform on feature space to find the
dense regions in the feature space
Apply wavelet transform multiple times which result
in clusters at different scales from fine to coarse
April 12, 2023 Data Mining: Concepts and Techniques 79
Wavelet Transform
Decomposes a signal into different frequency
subbands. (can be applied to n-dimensional
signals)
Data are transformed to preserve relative
distance between objects at different levels of
resolution.
Allows natural clusters to become more
distinguishable
April 12, 2023 Data Mining: Concepts and Techniques 80
What Is Wavelet (2)?
April 12, 2023 Data Mining: Concepts and Techniques 81
Quantization
April 12, 2023 Data Mining: Concepts and Techniques 82
Transformation
April 12, 2023 Data Mining: Concepts and Techniques 83
WaveCluster (1998)
Why is wavelet transformation useful for clustering
Unsupervised clustering
It uses hat-shape filters to emphasize region where
points cluster, but simultaneously to suppress weaker
information in their boundary
Effective removal of outliers
Multi-resolution
Cost efficiency
Major features:
Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data
April 12, 2023 Data Mining: Concepts and Techniques 84
CLIQUE (Clustering In QUEst)
Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
Automatically identifying subspaces of a high dimensional
data space that allow better clustering than original space
CLIQUE can be considered as both density-based and grid-
based
It partitions each dimension into the same number of
equal length interval
It partitions an m-dimensional data space into non-
overlapping rectangular units
A unit is dense if the fraction of total data points
contained in the unit exceeds the input model parameter
A cluster is a maximal set of connected dense units
within a subspace
April 12, 2023 Data Mining: Concepts and Techniques 85
CLIQUE: The Major Steps
Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters:
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of
interests.
Generate minimal description for the clusters
Determine maximal regions that cover a cluster of
connected dense units for each cluster
Determination of minimal cover for each cluster
April 12, 2023 Data Mining: Concepts and Techniques 86
Vacation
(10,000)
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
30 50
age
April 12, 2023 Data Mining: Concepts and Techniques 87
Strength and Weakness of CLIQUE
Strength
It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
It is insensitive to the order of records in input and
does not presume some canonical data distribution
It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
The accuracy of the clustering result may be
degraded at the expense of simplicity of the method
April 12, 2023 Data Mining: Concepts and Techniques 88
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
April 12, 2023 Data Mining: Concepts and Techniques 89
Model-Based Clustering Methods
Attempt to optimize the fit between the data and some
mathematical model
Statistical and AI approach
Conceptual clustering
A form of clustering in machine learning
Produces a classification scheme for a set of unlabeled objects
Finds characteristic description for each concept (class)
COBWEB (Fisher’87)
A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification
tree
Each node refers to a concept and contains a probabilistic
description of that concept
April 12, 2023 Data Mining: Concepts and Techniques 90
COBWEB Clustering Method
A classification tree
April 12, 2023 Data Mining: Concepts and Techniques 91
More on Statistical-Based Clustering
Limitations of COBWEB
The assumption that the attributes are independent of
each other is often too strong because correlation may
exist
Not suitable for clustering large database data –
skewed tree and expensive probability distributions
CLASSIT
an extension of COBWEB for incremental clustering of
continuous data
suffers similar problems as COBWEB
AutoClass (Cheeseman and Stutz, 1996)
Uses Bayesian statistical analysis to estimate the
number of clusters
Popular in industry
April 12, 2023 Data Mining: Concepts and Techniques 92
Other Model-Based Clustering
Methods
Neural network approaches
Represent each cluster as an exemplar, acting as a
“prototype” of the cluster
New objects are distributed to the cluster whose
exemplar is the most similar according to some
dostance measure
Competitive learning
Involves a hierarchical architecture of several units
(neurons)
Neurons compete in a “winner-takes-all” fashion for
the object currently being presented
April 12, 2023 Data Mining: Concepts and Techniques 93
Model-Based Clustering Methods
April 12, 2023 Data Mining: Concepts and Techniques 94
Self-organizing feature maps (SOMs)
Clustering is also performed by having several
units competing for the current object
The unit whose weight vector is closest to the
current object wins
The winner and its neighbors learn by having
their weights adjusted
SOMs are believed to resemble processing that
can occur in the brain
Useful for visualizing high-dimensional data in 2-
or 3-D space
April 12, 2023 Data Mining: Concepts and Techniques 95
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
April 12, 2023 Data Mining: Concepts and Techniques 96
What Is Outlier Discovery?
What are outliers?
The set of objects are considerably dissimilar from the
remainder of the data
Example: Sports: Michael Jordon, Wayne Gretzky, ...
Problem
Find top n outlier points
Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis
April 12, 2023 Data Mining: Concepts and Techniques 97
Outlier Discovery:
Statistical Approaches
Assume a model underlying distribution that generates
data set (e.g. normal distribution)
Use discordancy tests depending on
data distribution
distribution parameter (e.g., mean, variance)
number of expected outliers
Drawbacks
most tests are for single attribute
In many cases, data distribution may not be known
April 12, 2023 Data Mining: Concepts and Techniques 98
Outlier Discovery: Distance-Based Approach
Introduced to counter the main limitations imposed by
statistical methods
We need multi-dimensional analysis without knowing
data distribution.
Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the objects
in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm
Outlier Discovery: Deviation-
Based Approach
Identifies outliers by examining the main characteristics
of objects in a group
Objects that “deviate” from this description are
considered outliers
sequential exception technique
simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
OLAP data cube technique
uses data cubes to identify regions of anomalies in
large multidimensional data
April 12, 2023 Data Mining: Concepts and Techniques 100
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
April 12, 2023 Data Mining: Concepts and Techniques 101
Problems and Challenges
Considerable progress has been made in scalable
clustering methods
Partitioning: k-means, k-medoids, CLARANS
Hierarchical: BIRCH, CURE
Density-based: DBSCAN, CLIQUE, OPTICS
Grid-based: STING, WaveCluster
Model-based: Autoclass, Denclue, Cobweb
Current clustering techniques do not address all the
requirements adequately
Constraint-based clustering analysis: Constraints exist in
data space (bridges and highways) or in user queries
April 12, 2023 Data Mining: Concepts and Techniques 102
Constraint-Based Clustering Analysis
Clustering analysis: less parameters but more user-desired
constraints, e.g., an ATM allocation problem
April 12, 2023 Data Mining: Concepts and Techniques 103
Clustering With Obstacle Objects
Not Taking obstacles into account Taking obstacles into account
April 12, 2023 Data Mining: Concepts and Techniques 104
Summary
Cluster analysis groups objects based on their similarity
and has wide applications
Measure of similarity can be computed for various types
of data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
There are still lots of research issues on cluster analysis,
such as constraint-based clustering
April 12, 2023 Data Mining: Concepts and Techniques 105
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
April 12, 2023 Data Mining: Concepts and Techniques 106
References (2)
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96.
April 12, 2023 Data Mining: Concepts and Techniques 107
www.cs.uiuc.edu/~hanj
Thank you !!!
April 12, 2023 Data Mining: Concepts and Techniques 108