[go: up one dir, main page]

0% found this document useful (0 votes)
23 views330 pages

DM Unit 2

Chapter 6 of 'Data Mining: Concepts and Techniques' focuses on mining association rules in large databases, detailing the process of finding frequent patterns and correlations among items in transactional databases. It discusses various algorithms, including the Apriori algorithm, for scalable mining, as well as challenges and improvements in frequent pattern mining. The chapter also highlights applications of association rule mining in fields such as market analysis and web log analysis.

Uploaded by

nagendradrg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views330 pages

DM Unit 2

Chapter 6 of 'Data Mining: Concepts and Techniques' focuses on mining association rules in large databases, detailing the process of finding frequent patterns and correlations among items in transactional databases. It discusses various algorithms, including the Apriori algorithm, for scalable mining, as well as challenges and improvements in frequent pattern mining. The chapter also highlights applications of association rule mining in fields such as market analysis and web log analysis.

Uploaded by

nagendradrg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 330

Data Mining:

Concepts and Techniques


— Slides for Textbook —
— Chapter 6 —

©Jiawei Han and Micheline Kamber


Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
April 12, 2023 Data Mining: Concepts and Techniques 1
Chapter 6: Mining Association Rules
in Large Databases

 Association rule mining


 Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
 Mining various kinds of association/correlation rules
 Constraint-based association mining
 Sequential pattern mining
 Applications/extensions of frequent pattern mining
 Summary

April 12, 2023 Data Mining: Concepts and Techniques 2


What Is Association Mining?
 Association rule mining:
 Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
 Frequent pattern: pattern (set of items, sequence,
etc.) that occurs frequently in a database [AIS93]
 Motivation: finding regularities in data
 What products were often purchased together? — Beer
and diapers?!
 What are the subsequent purchases after buying a PC?

 What kinds of DNA are sensitive to this new drug?

 Can we automatically classify web documents?

April 12, 2023 Data Mining: Concepts and Techniques 3


Why Is Frequent Pattern or Assoiciation
Mining an Essential Task in Data Mining?
 Foundation for many essential data mining tasks
 Association, correlation, causality
 Sequential patterns, temporal or cyclic association,
partial periodicity, spatial and multimedia association
 Associative classification, cluster analysis, iceberg cube,
fascicles (semantic data compression)
 Broad applications
 Basket data analysis, cross-marketing, catalog design,
sale campaign analysis
 Web log (click stream) analysis, DNA sequence analysis,
etc.
April 12, 2023 Data Mining: Concepts and Techniques 4
Basic Concepts: Frequent Patterns and
Association Rules

 Itemset X={x1, …, xk}


Transaction-id Items bought
10 A, B, C  Find all the rules XY with min
20 A, C
confidence and support
30 A, D  support, s, probability that a
40 B, E, F transaction contains XY
 confidence, c, conditional
Customer Customer probability that a transaction
buys both buys diaper having X also contains Y.

Let min_support = 50%,


min_conf = 50%:
A  C (50%, 66.7%)
Customer
buys beer
C  A (50%, 100%)
April 12, 2023 Data Mining: Concepts and Techniques 5
Mining Association Rules—an Example

Transaction-id Items bought


Min. support 50%
10 A, B, C Min. confidence 50%
20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
For rule A  C: {A, C} 50%

support = support({A}{C}) = 50%


confidence = support({A}{C})/support({A}) =
66.6%

April 12, 2023 Data Mining: Concepts and Techniques 6


Chapter 6: Mining Association Rules
in Large Databases

 Association rule mining


 Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
 Mining various kinds of association/correlation rules
 Constraint-based association mining
 Sequential pattern mining
 Applications/extensions of frequent pattern mining
 Summary

April 12, 2023 Data Mining: Concepts and Techniques 7


Apriori: A Candidate Generation-and-test Approach

 Any subset of a frequent itemset must be frequent


 if {beer, diaper, nuts} is frequent, so is {beer,
diaper}
 Every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
 Method:
 generate length (k+1) candidate itemsets from length k
frequent itemsets, and
 test the candidates against DB

 The performance studies show its efficiency and scalability


 Agrawal & Srikant 1994, Mannila, et al. 1994
April 12, 2023 Data Mining: Concepts and Techniques 8
The Apriori Algorithm—An Example
Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3
3rd scan Itemset sup
{B, C, E}
{B, C, E} 2
April 12, 2023 Data Mining: Concepts and Techniques 9
The Apriori Algorithm
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
April 12, 2023 Data Mining: Concepts and Techniques 10
Important Details of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}

April 12, 2023 Data Mining: Concepts and Techniques 11


How to Generate Candidates?

 Suppose the items in Lk-1 are listed in an order


 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

April 12, 2023 Data Mining: Concepts and Techniques 12


How to Count Supports of Candidates?

 Why counting supports of candidates a problem?


 The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets and
counts
 Interior node contains a hash table
 Subset function: finds all the candidates contained in
a transaction

April 12, 2023 Data Mining: Concepts and Techniques 13


Example: Counting Supports of Candidates

Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8

1+2356

13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458

April 12, 2023 Data Mining: Concepts and Techniques 14


Efficient Implementation of Apriori in SQL

 Hard to get good performance out of pure SQL (SQL-


92) based approaches alone
 Make use of object-relational extensions like UDFs,
BLOBs, Table functions etc.
 Get orders of magnitude improvement
 S. Sarawagi, S. Thomas, and R. Agrawal. Integrating
association rule mining with relational database
systems: Alternatives and implications. In SIGMOD’98

April 12, 2023 Data Mining: Concepts and Techniques 15


Challenges of Frequent Pattern Mining

 Challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for
candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
April 12, 2023 Data Mining: Concepts and Techniques 16
DIC: Reduce Number of Scans

ABCD
 Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD  Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets

{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules
for market basket data. In
SIGMOD’97
April 12, 2023 Data Mining: Concepts and Techniques 17
Partition: Scan Database Only Twice

 Any itemset that is potentially frequent in DB


must be frequent in at least one of the partitions
of DB
 Scan 1: partition database and find local

frequent patterns
 Scan 2: consolidate global frequent patterns

 A. Savasere, E. Omiecinski, and S. Navathe. An


efficient algorithm for mining association in large
databases. In VLDB’95

April 12, 2023 Data Mining: Concepts and Techniques 18


Sampling for Frequent Patterns

 Select a sample of original database, mine frequent


patterns within sample using Apriori
 Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
 Example: check abcd instead of ab, ac, …, etc.

 Scan database again to find missed frequent patterns


 H. Toivonen. Sampling large databases for association
rules. In VLDB’96

April 12, 2023 Data Mining: Concepts and Techniques 19


DHP: Reduce the Number of Candidates

 A k-itemset whose corresponding hashing bucket count is


below the threshold cannot be frequent
 Candidates: a, b, c, d, e

 Hash entries: {ab, ad, ae} {bd, be, de} …

 Frequent 1-itemset: a, b, d, e

 ab is not a candidate 2-itemset if the sum of count of

{ab, ad, ae} is below support threshold


 J. Park, M. Chen, and P. Yu. An effective hash-based
algorithm for mining association rules. In SIGMOD’95

April 12, 2023 Data Mining: Concepts and Techniques 20


Eclat/MaxEclat and VIPER: Exploring Vertical
Data Format

 Use tid-list, the list of transaction-ids containing an itemset


 Compression of tid-lists
 Itemset A: t1, t2, t3, sup(A)=3
 Itemset B: t2, t3, t4, sup(B)=3
 Itemset AB: t2, t3, sup(AB)=2
 Major operation: intersection of tid-lists
 M. Zaki et al. New algorithms for fast discovery of
association rules. In KDD’97
 P. Shenoy et al. Turbo-charging vertical mining of large
databases. In SIGMOD’00

April 12, 2023 Data Mining: Concepts and Techniques 21


Bottleneck of Frequent-pattern Mining

 Multiple database scans are costly


 Mining long patterns needs many passes of
scanning and generates lots of candidates
 To find frequent itemset i1i2…i100

 # of scans: 100
 # of Candidates: (1001) + (1002) + … + (110000) = 2100-
1 = 1.27*1030 !
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?

April 12, 2023 Data Mining: Concepts and Techniques 22


Mining Frequent Patterns Without
Candidate Generation

 Grow long patterns from short ones using local


frequent items

 “abc” is a frequent pattern

 Get all transactions having “abc”: DB|abc

 “d” is a local frequent item in DB|abc  abcd is


a frequent pattern

April 12, 2023 Data Mining: Concepts and Techniques 23


Construct FP-tree from a Transaction Database

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3
order, f-list m 3
a:3 p:1
p 3
3. Scan DB again, m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
April 12, 2023 Data Mining: Concepts and Techniques 24
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent pattern

mining
 Never break a long pattern of any transaction

 Compactness
 Reduce irrelevant info—infrequent items are gone

 Items in frequency descending order: the more

frequently occurring, the more likely to be shared


 Never be larger than the original database (not count

node-links and the count field)


 For Connect-4 DB, compression ratio could be over 100

April 12, 2023 Data Mining: Concepts and Techniques 25


Partition Patterns and Databases

 Frequent patterns can be partitioned into subsets


according to f-list
 F-list=f-c-a-b-m-p

 Patterns containing p

 Patterns having m but no p

 …

 Patterns having c but no a nor b, m, p

 Pattern f

 Completeness and non-redundency

April 12, 2023 Data Mining: Concepts and Techniques 26


Find Patterns Having P From P-conditional Database
 Starting at the frequent item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1

p:2 m:1 p fcam:2, cb:1


April 12, 2023 Data Mining: Concepts and Techniques 27
From Conditional Pattern-bases to Conditional FP-trees

 For each pattern-base


 Accumulate the count for each item in the base

 Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:


{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
April 12, 2023 Data Mining: Concepts and Techniques 28
Recursion: Mining Each Conditional FP-tree
{}

{} Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree

{}

Cond. pattern base of “cam”: (f:3) f:3


cam-conditional FP-tree

April 12, 2023 Data Mining: Concepts and Techniques 29


A Special Case: Single Prefix Path in FP-tree

 Suppose a (conditional) FP-tree T has a shared single prefix-path P


 Mining can be decomposed into two parts
 Reduction of the single prefix path into one node
{}  Concatenation of the mining results of the two parts
a1:n1
a2:n2

a3:n3
{} r1

b1:m1 C1:k1 a1:n1


 r1 = + b1:m1 C1:k1
a2:n2
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
April 12, 2023 Data Mining: Concepts and Techniques 30
Mining Frequent Patterns With FP-trees

 Idea: Frequent pattern growth


 Recursively grow frequent patterns by pattern and

database partition
 Method
 For each frequent item, construct its conditional

pattern-base, and then its conditional FP-tree


 Repeat the process on each newly created conditional

FP-tree
 Until the resulting FP-tree is empty, or it contains only

one path—single path will generate all the


combinations of its sub-paths, each of which is a
frequent pattern
April 12, 2023 Data Mining: Concepts and Techniques 31
Scaling FP-growth by DB Projection

 FP-tree cannot fit in memory?—DB projection


 First partition a database into a set of projected DBs
 Then construct and mine FP-tree for each projected
DB
 Parallel projection vs. Partition projection techniques
 Parallel projection is space costly

April 12, 2023 Data Mining: Concepts and Techniques 32


Partition-based Projection

Tran. DB
 Parallel projection needs a lot fcamp
of disk space fcabm
fb
 Partition projection saves it cbp
fcamp

p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB


fcam fcab f fc f …
cb fca cb … …
fcam fca …

am-proj DB cm-proj DB
fc f …
fc f
fc f
April 12, 2023 Data Mining: Concepts and Techniques 33
FP-Growth vs. Apriori: Scalability With the Support
Threshold

100 Data set T25I20D10K


90 D1 FP-grow th runtime
D1 Apriori runtime
80

70
Run time(sec.)

60
50

40

30

20

10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)

April 12, 2023 Data Mining: Concepts and Techniques 34


FP-Growth vs. Tree-Projection: Scalability with
the Support Threshold

Data set T25I20D100K


140
D2 FP-growth
120 D2 TreeProjection

100
Runtime (sec.)

80

60

40

20

0
0 0.5 1 1.5 2
Support threshold (%)
April 12, 2023 Data Mining: Concepts and Techniques 35
Why Is FP-Growth the Winner?

 Divide-and-conquer:
 decompose both the mining task and DB according to
the frequent patterns obtained so far
 leads to focused search of smaller databases
 Other factors
 no candidate generation, no candidate test
 compressed database: FP-tree structure
 no repeated scan of entire database
 basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching

April 12, 2023 Data Mining: Concepts and Techniques 36


Implications of the Methodology

 Mining closed frequent itemsets and max-patterns

 CLOSET (DMKD’00)
 Mining sequential patterns

 FreeSpan (KDD’00), PrefixSpan (ICDE’01)


 Constraint-based mining of frequent patterns

 Convertible constraints (KDD’00, ICDE’01)


 Computing iceberg data cubes with complex measures

 H-tree and H-cubing algorithm (SIGMOD’01)

April 12, 2023 Data Mining: Concepts and Techniques 37


Max-patterns

 Frequent pattern {a1, …, a100}  (1001) + (1002) +


… + (110000) = 2100-1 = 1.27*1030 frequent sub-
patterns!
 Max-pattern: frequent patterns without proper
frequent super pattern
 BCDE, ACD are max-patterns
Tid Items
 BCD is not a max-pattern
10 A,B,C,D,E
20 B,C,D,E,
Min_sup=2 30 A,C,D,F

April 12, 2023 Data Mining: Concepts and Techniques 38


MaxMiner: Mining Max-patterns

 1st scan: find frequent items Tid Items

 A, B, C, D, E
10 A,B,C,D,E
20 B,C,D,E,
 2 scan: find support for
nd
30 A,C,D,F
 AB, AC, AD, AE, ABCDE

 BC, BD, BE, BCDE


Potential
 CD, CE, CDE, DE, max-patterns
 Since BCDE is a max-pattern, no need to check
BCD, BDE, CDE in later scan
 R. Bayardo. Efficiently mining long patterns from
databases. In SIGMOD’98
April 12, 2023 Data Mining: Concepts and Techniques 39
Frequent Closed Patterns

 Conf(acd)=100%  record acd only


 For frequent itemset X, if there exists no item y
s.t. every transaction containing X also contains
y, then X is a frequent closed pattern
 “acd” is a frequent closed pattern

 Concise rep. of freq pats Min_sup=2


TID Items
 Reduce # of patterns and rules
10 a, c, d, e, f
 N. Pasquier et al. In ICDT’99 20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f

April 12, 2023 Data Mining: Concepts and Techniques 40


Mining Frequent Closed Patterns: CLOSET

 Flist: list of all frequent items in support ascending order


 Flist: d-a-f-e-c Min_sup=2
 Divide search space TID Items
10 a, c, d, e, f
 Patterns having d 20 a, b, e
30 c, e, f
 Patterns having d but no a, etc. 40 a, c, d, f
 Find frequent closed pattern recursively 50 c, e, f

 Every transaction having d also has cfa  cfad is a


frequent closed pattern
 J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining
Frequent Closed Itemsets", DMKD'00.

April 12, 2023 Data Mining: Concepts and Techniques 41


Mining Frequent Closed Patterns: CHARM

 Use vertical data format: t(AB)={T1, T12, …}


 Derive closed pattern based on vertical intersections
 t(X)=t(Y): X and Y always happen together
 t(X)t(Y): transaction having X always has Y
 Use diffset to accelerate mining
 Only keep track of difference of tids
 t(X)={T1, T2, T3}, t(Xy )={T1, T3}
 Diffset(Xy, X)={T2}
 M. Zaki. CHARM: An Efficient Algorithm for Closed Association Rule Mining, CS-
TR99-10, Rensselaer Polytechnic Institute
 M. Zaki, Fast Vertical Mining Using Diffsets, TR01-1, Department of Computer
Science, Rensselaer Polytechnic Institute

April 12, 2023 Data Mining: Concepts and Techniques 42


Visualization of Association Rules: Pane Graph

April 12, 2023 Data Mining: Concepts and Techniques 43


Visualization of Association Rules: Rule Graph

April 12, 2023 Data Mining: Concepts and Techniques 44


Chapter 6: Mining Association Rules
in Large Databases

 Association rule mining


 Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
 Mining various kinds of association/correlation rules
 Constraint-based association mining
 Sequential pattern mining
 Applications/extensions of frequent pattern mining
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 45
Mining Various Kinds of Rules or Regularities

 Multi-level, quantitative association rules,

correlation and causality, ratio rules, sequential

patterns, emerging patterns, temporal

associations, partial periodicity

 Classification, clustering, iceberg cubes, etc.

April 12, 2023 Data Mining: Concepts and Techniques 46


Multiple-level Association Rules

 Items often form hierarchy


 Flexible support settings: Items at the lower level are
expected to have lower support.
 Transaction database can be encoded based on
dimensions and levels
 explore shared multi-level mining

uniform support reduced support


Level 1 Milk Level 1
min_sup = 5% min_sup = 5%
[support = 10%]

Level 2 2% Milk Skim Milk Level 2


min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

April 12, 2023 Data Mining: Concepts and Techniques 47


ML/MD Associations with Flexible Support Constraints

 Why flexible support constraints?


 Real life occurrence frequencies vary greatly
 Diamond, watch, pens in a shopping basket
 Uniform support may not be an interesting model
 A flexible model
 The lower-level, the more dimension combination, and the long
pattern length, usually the smaller support
 General rules should be easy to specify and understand
 Special items and special group of items may be specified
individually and have higher priority

April 12, 2023 Data Mining: Concepts and Techniques 48


Multi-dimensional Association

 Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
 Multi-dimensional rules:  2 dimensions or predicates
 Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”)
 hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
 Categorical Attributes
 finite number of possible values, no ordering among values
 Quantitative Attributes
 numeric, implicit ordering among values
April 12, 2023 Data Mining: Concepts and Techniques 49
Multi-level Association: Redundancy Filtering

 Some rules may be redundant due to “ancestor”


relationships between items.
 Example
 milk  wheat bread [support = 8%, confidence = 70%]

 2% milk  wheat bread [support = 2%, confidence = 72%]


 We say the first rule is an ancestor of the second rule.
 A rule is redundant if its support is close to the “expected”
value, based on the rule’s ancestor.

April 12, 2023 Data Mining: Concepts and Techniques 50


Multi-Level Mining: Progressive Deepening

 A top-down, progressive deepening approach:


 First mine high-level frequent items:
milk (15%), bread (10%)
 Then mine their lower-level “weaker” frequent
itemsets:
2% milk (5%), wheat bread (4%)

 Different min_support threshold across multi-levels lead


to different algorithms:
 If adopting the same min_support across multi-levels

then toss t if any of t’s ancestors is infrequent.


 If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s support
is frequent/non-negligible.
April 12, 2023 Data Mining: Concepts and Techniques 51
Techniques for Mining MD Associations

 Search for frequent k-predicate set:


 Example: {age, occupation, buys} is a 3-predicate set

 Techniques can be categorized by how age are treated

1. Using static discretization of quantitative attributes


 Quantitative attributes are statically discretized by

using predefined concept hierarchies


2. Quantitative association rules
 Quantitative attributes are dynamically discretized into

“bins”based on the distribution of the data


3. Distance-based association rules
 This is a dynamic discretization process that considers

the distance between data points


April 12, 2023 Data Mining: Concepts and Techniques 52
Static Discretization of Quantitative Attributes

 Discretized prior to mining using concept hierarchy.


 Numeric values are replaced by ranges.
 In relational database, finding all frequent k-predicate sets
will require k or k+1 table scans.
 Data cube is well suited for mining. ()

 The cells of an n-dimensional


(age) (income) (buys)
cuboid correspond to the
predicate sets.
(age, income) (age,buys) (income,buys)
 Mining from data cubes
can be much faster.
(age,income,buys)
April 12, 2023 Data Mining: Concepts and Techniques 53
Quantitative Association Rules
 Numeric attributes are dynamically discretized
 Such that the confidence or compactness of the rules
mined is maximized
 2-D quantitative association rules: Aquan1  Aquan2  Acat
 Cluster “adjacent”
association rules
to form general
rules using a 2-D
grid
 Example
age(X,”30-34”)  income(X,”24K -
48K”)
 buys(X,”high resolution TV”)
April 12, 2023 Data Mining: Concepts and Techniques 54
Mining Distance-based Association Rules

 Binning methods do not capture the semantics of interval


data
Equi-width Equi-depth Distance-
Price($) (width $10) (depth 2) based
7 [0,10] [7,20] [7,7]
20 [11,20] [22,50] [20,22]
22 [21,30] [51,53] [50,53]
50 [31,40]
51 [41,50]
53 [51,60]

 Distance-based partitioning, more meaningful discretization


considering:
 density/number of points in an interval

 “closeness” of points in an interval

April 12, 2023 Data Mining: Concepts and Techniques 55


Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall percentage of students eating cereal is 75% which is
higher than 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift

Basketball Not basketball Sum (row)


P( A B)
corrA, B  Cereal 2000 1750 3750

P( A) P( B) Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

April 12, 2023 Data Mining: Concepts and Techniques 56


Chapter 6: Mining Association Rules
in Large Databases

 Association rule mining


 Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
 Mining various kinds of association/correlation rules
 Constraint-based association mining
 Sequential pattern mining
 Applications/extensions of frequent pattern mining
 Summary

April 12, 2023 Data Mining: Concepts and Techniques 57


Constraint-based Data Mining

 Finding all the patterns in a database autonomously? —


unrealistic!
 The patterns could be too many but not focused!

 Data mining should be an interactive process


 User directs what to be mined using a data mining

query language (or a graphical user interface)


 Constraint-based mining
 User flexibility: provides constraints on what to be

mined
 System optimization: explores such constraints for

efficient mining—constraint-based mining

April 12, 2023 Data Mining: Concepts and Techniques 58


Constraints in Data Mining

 Knowledge type constraint:


 classification, association, etc.
 Data constraint — using SQL-like queries
 find product pairs sold together in stores in Vancouver
in Dec.’00
 Dimension/level constraint
 in relevance to region, price, brand, customer category
 Rule (or pattern) constraint
 small sales (price < $10) triggers big sales (sum >
$200)
 Interestingness constraint
 strong rules: min_support  3%, min_confidence 
60%
April 12, 2023 Data Mining: Concepts and Techniques 59
Constrained Mining vs. Constraint-Based Search

 Constrained mining vs. constraint-based search/reasoning


 Both are aimed at reducing search space
 Finding all patterns satisfying constraints vs. finding
some (or one) answer in constraint-based search in AI
 Constraint-pushing vs. heuristic search
 It is an interesting research problem on how to integrate
them
 Constrained mining vs. query processing in DBMS
 Database query processing requires to find all
 Constrained pattern mining shares a similar philosophy
as pushing selections deeply in query processing

April 12, 2023 Data Mining: Concepts and Techniques 60


Constrained Frequent Pattern Mining: A Mining Query
Optimization Problem

 Given a frequent pattern mining query with a set of constraints C, the


algorithm should be
 sound: it only finds frequent sets that satisfy the given
constraints C
 complete: all frequent sets satisfying the given
constraints C are found
 A naïve solution
 First find all frequent sets, and then test them for
constraint satisfaction
 More efficient approaches:
 Analyze the properties of constraints comprehensively
 Push them as deeply as possible inside the frequent
pattern computation.
April 12, 2023 Data Mining: Concepts and Techniques 61
Anti-Monotonicity in Constraint-Based Mining

TDB (min_sup=2)
 Anti-monotonicity TID Transaction

 When an intemset S violates the 10 a, b, c, d, f

constraint, so does any of its superset 20 b, c, d, f, g, h


30 a, c, d, e, f
 sum(S.Price)  v is anti-monotone 40 c, e, f, g

 sum(S.Price)  v is not anti-monotone Item Profit


 Example. C: range(S.profit)  15 is anti- a 40

monotone b 0
c -20
 Itemset ab violates C d 10
 So does every superset of ab e -30
f 30
g 20
h -10
April 12, 2023 Data Mining: Concepts and Techniques 62
Which Constraints Are Anti-Monotone?
Constraint Antimonotone

vS No
SV no

SV yes
min(S)  v no

min(S)  v yes
max(S)  v yes

max(S)  v no
count(S)  v yes

count(S)  v no

sum(S)  v ( a  S, a  0 ) yes
sum(S)  v ( a  S, a  0 ) no

range(S)  v yes
range(S)  v no

avg(S)  v,   { , ,  } convertible
support(S)   yes

support(S)   no

April 12, 2023 Data Mining: Concepts and Techniques 63


Monotonicity in Constraint-Based Mining
TDB (min_sup=2)
TID Transaction
 Monotonicity
10 a, b, c, d, f
 When an intemset S satisfies the 20 b, c, d, f, g, h
constraint, so does any of its 30 a, c, d, e, f
40 c, e, f, g
superset
 sum(S.Price)  v is monotone Item Profit
a 40
 min(S.Price)  v is monotone b 0
 Example. C: range(S.profit)  15 c -20
d 10
 Itemset ab satisfies C e -30

 So does every superset of ab f 30


g 20
h -10
April 12, 2023 Data Mining: Concepts and Techniques 64
Which Constraints Are Monotone?
Constraint Monotone
vS yes
SV yes

SV no
min(S)  v yes

min(S)  v no
max(S)  v no

max(S)  v yes
count(S)  v no

count(S)  v yes

sum(S)  v ( a  S, a  0 ) no
sum(S)  v ( a  S, a  0 ) yes

range(S)  v no
range(S)  v yes

avg(S)  v,   { , ,  } convertible
support(S)   no

support(S)   yes

April 12, 2023 Data Mining: Concepts and Techniques 65


Succinctness

 Succinctness:
 Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based on
A1 , i.e., S contains a subset belonging to A1
 Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
 min(S.Price)  v is succinct
 sum(S.Price)  v is not succinct
 Optimization: If C is succinct, C is pre-counting pushable

April 12, 2023 Data Mining: Concepts and Techniques 66


Which Constraints Are Succinct?
Constraint Succinct
vS yes
SV yes

SV yes
min(S)  v yes

min(S)  v yes
max(S)  v yes

max(S)  v yes
count(S)  v weakly

count(S)  v weakly

sum(S)  v ( a  S, a  0 ) no
sum(S)  v ( a  S, a  0 ) no

range(S)  v no
range(S)  v no

avg(S)  v,   { , ,  } no
support(S)   no

support(S)   no

April 12, 2023 Data Mining: Concepts and Techniques 67


The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
April 12, 2023 Data Mining: Concepts and Techniques 68
Naïve Algorithm: Apriori + Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price < 5}
April 12, 2023 Data Mining: Concepts and Techniques 69
The Constrained Apriori Algorithm: Push
an Anti-monotone Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price < 5}
April 12, 2023 Data Mining: Concepts and Techniques 70
The Constrained Apriori Algorithm: Push a
Succinct Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{1 5} 1 {1 5}
{2 3} 2
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2 {3 5}
{3 5} 2
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 min{S.price <= 1 }
April 12, 2023 Data Mining: Concepts and Techniques 71
Converting “Tough” Constraints
TDB (min_sup=2)
TID Transaction
10 a, b, c, d, f
 Convert tough constraints into anti-
20 b, c, d, f, g, h
monotone or monotone by properly
30 a, c, d, e, f
ordering items 40 c, e, f, g
 Examine C: avg(S.profit)  25
Item Profit
 Order items in value-descending order a 40
b 0
 <a, f, g, d, b, h, c, e>
c -20
 If an itemset afb violates C d 10
e -30
 So does afbh, afb*
f 30
 It becomes anti-monotone! g 20
h -10

April 12, 2023 Data Mining: Concepts and Techniques 72


Convertible Constraints

 Let R be an order of items


 Convertible anti-monotone
 If an itemset S violates a constraint C, so does every
itemset having S as a prefix w.r.t. R
 Ex. avg(S)  v w.r.t. item value descending order
 Convertible monotone
 If an itemset S satisfies constraint C, so does every
itemset having S as a prefix w.r.t. R
 Ex. avg(S)  v w.r.t. item value descending order

April 12, 2023 Data Mining: Concepts and Techniques 73


Strongly Convertible Constraints

 avg(X)  25 is convertible anti-monotone w.r.t.


item value descending order R: <a, f, g, d, b, h,
c, e> Item Profit
 If an itemset af violates a constraint C, a 40
so does every itemset with af as prefix, b 0

such as afd c -20


d 10
 avg(X)  25 is convertible monotone w.r.t. item
e -30
value ascending order R-1: <e, c, h, b, d, g, f, a>
f 30
 If an itemset d satisfies a constraint C, g 20
so does itemsets df and dfa, which h -10
having d as a prefix
 Thus, avg(X)  25 is strongly convertible
April 12, 2023 Data Mining: Concepts and Techniques 74
What Constraints Are Convertible?

Convertible anti- Convertible Strongly


Constraint monotone monotone convertible

avg(S)  ,  v Yes Yes Yes

median(S)  ,  v Yes Yes Yes

sum(S)  v (items could be of any value,


Yes No No
v  0)
sum(S)  v (items could be of any value,
No Yes No
v  0)
sum(S)  v (items could be of any value,
No Yes No
v  0)
sum(S)  v (items could be of any value,
Yes No No
v  0)
……

April 12, 2023 Data Mining: Concepts and Techniques 75


Combing Them Together—A General Picture

Constraint Antimonotone Monotone Succinct


vS no yes yes
SV no yes yes

SV yes no yes


min(S)  v no yes yes

min(S)  v yes no yes


max(S)  v yes no yes

max(S)  v no yes yes


count(S)  v yes no weakly

count(S)  v no yes weakly

sum(S)  v ( a  S, a  0 ) yes no no
sum(S)  v ( a  S, a  0 ) no yes no

range(S)  v yes no no
range(S)  v no yes no

avg(S)  v,   { , ,  } convertible convertible no


support(S)   yes no no

support(S)   no yes no

April 12, 2023 Data Mining: Concepts and Techniques 76


Classification of Constraints

Antimonotone Monotone

Strongly
convertible
Succinct

Convertible Convertible
anti-monotone monotone

Inconvertible
April 12, 2023 Data Mining: Concepts and Techniques 77
Mining With Convertible Constraints
TDB (min_sup=2)
TID Transaction

 C: avg(S.profit)  25 10 a, f, d, b, c
20 f, g, d, b, c
 List of items in every transaction in
value descending order R: 30 a, f, d, c, e
40 f, g, h, c, e
<a, f, g, d, b, h, c, e>
 C is convertible anti-monotone
Item Profit
a 40
w.r.t. R f 30
 Scan transaction DB once g 20
 remove infrequent items d 10
b 0
 Item h in transaction 40 is
h -10
dropped c -20
 Itemsets a and f are good e -30

April 12, 2023 Data Mining: Concepts and Techniques 78


Can Apriori Handle Convertible Constraint?

 A convertible, not monotone nor anti-monotone


nor succinct constraint cannot be pushed deep
into the an Apriori mining algorithm
 Within the level wise framework, no direct
Item Value
pruning based on the constraint can be made a 40
 Itemset df violates constraint C: avg(X)>=25 b 0
c -20
 Since adf satisfies C, Apriori needs df to
d 10
assemble adf, df cannot be pruned
e -30
 But it can be pushed into frequent-pattern f 30
growth framework! g 20
h -10

April 12, 2023 Data Mining: Concepts and Techniques 79


Mining With Convertible Constraints
Item Value
a 40
 C: avg(X)>=25, min_sup=2 f 30
 List items in every transaction in value descending order R: g 20
<a, f, g, d, b, h, c, e> d 10
 C is convertible anti-monotone w.r.t. R b 0
h -10
 Scan TDB once
c -20
 remove infrequent items
e -30
 Item h is dropped
 Itemsets a and f are good, … TDB (min_sup=2)
 Projection-based mining
TID Transaction
 Imposing an appropriate order on item projection 10 a, f, d, b, c

 Many tough constraints can be converted into 20 f, g, d, b, c

(anti)-monotone 30 a, f, d, c, e
40 f, g, h, c, e

April 12, 2023 Data Mining: Concepts and Techniques 80


Handling Multiple Constraints

 Different constraints may require different or even


conflicting item-ordering
 If there exists an order R s.t. both C1 and C2 are
convertible w.r.t. R, then there is no conflict between
the two convertible constraints
 If there exists conflict on order of items
 Try to satisfy one constraint first
 Then using the order for the other constraint to
mine frequent itemsets in the corresponding
projected database

April 12, 2023 Data Mining: Concepts and Techniques 81


Chapter 6: Mining Association Rules in
Large Databases

 Association rule mining


 Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
 Mining various kinds of association/correlation rules
 Constraint-based association mining
 Sequential pattern mining
 Applications/extensions of frequent pattern mining
 Summary

April 12, 2023 Data Mining: Concepts and Techniques 82


Sequence Databases and Sequential
Pattern Analysis

 Transaction databases, time-series databases vs. sequence


databases
 Frequent patterns vs. (frequent) sequential patterns
 Applications of sequential pattern mining
 Customer shopping sequences:
 First buy computer, then CD-ROM, and then digital camera,
within 3 months.
 Medical treatment, natural disasters (e.g., earthquakes),
science & engineering processes, stocks and markets, etc.
 Telephone calling patterns, Weblog click streams
 DNA sequences and gene structures

April 12, 2023 Data Mining: Concepts and Techniques 83


What Is Sequential Pattern Mining?

 Given a set of sequences, find the complete set


of frequent subsequences

A sequence : < (ef) (ab) (df) c b >


A sequence database
SID sequence An element may contain a set of items.
10 <a(abc)(ac)d(cf)> Items within an element are unordered
20 <(ad)c(bc)(ae)>
and we list them alphabetically.
30 <(ef)(ab)(df)cb>
<a(bc)dc> is a subsequence
40 <eg(af)cbc>
of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a


sequential pattern
April 12, 2023 Data Mining: Concepts and Techniques 84
Challenges on Sequential Pattern Mining

 A huge number of possible sequential patterns are


hidden in databases
 A mining algorithm should
 find the complete set of patterns, when possible,
satisfying the minimum support (frequency) threshold
 be highly efficient, scalable, involving only a small
number of database scans
 be able to incorporate various kinds of user-specific
constraints

April 12, 2023 Data Mining: Concepts and Techniques 85


Studies on Sequential Pattern Mining

 Concept introduction and an initial Apriori-like algorithm


 R. Agrawal & R. Srikant. “Mining sequential patterns,” ICDE’95

 GSP—An Apriori-based, influential mining method (developed at IBM


Almaden)
 R. Srikant & R. Agrawal. “Mining sequential patterns:

Generalizations and performance improvements,” EDBT’96


 From sequential patterns to episodes (Apriori-like + constraints)
 H. Mannila, H. Toivonen & A.I. Verkamo. “Discovery of frequent

episodes in event sequences,” Data Mining and Knowledge


Discovery, 1997
 Mining sequential patterns with constraints
 M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern
Mining with Regular Expression Constraints. VLDB 1999

April 12, 2023 Data Mining: Concepts and Techniques 86


A Basic Property of Sequential Patterns: Apriori

 A basic property: Apriori (Agrawal & Sirkant’94)


 If a sequence S is not frequent

 Then none of the super-sequences of S is frequent

 E.g, <hb> is infrequent  so do <hab> and <(ah)b>

Seq. ID Sequence Given support threshold


10 <(bd)cb(ac)> min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>

April 12, 2023 Data Mining: Concepts and Techniques 87


GSP—A Generalized Sequential Pattern Mining Algorithm

 GSP (Generalized Sequential Pattern) mining algorithm


 proposed by Agrawal and Srikant, EDBT’96
 Outline of the method
 Initially, every item in DB is a candidate of length-1
 for each level (i.e., sequences of length-k) do
 scan database to collect support count for each

candidate sequence
 generate candidate length-(k+1) sequences from

length-k frequent sequences using Apriori


 repeat until no frequent sequence or no candidate
can be found
 Major strength: Candidate pruning by Apriori

April 12, 2023 Data Mining: Concepts and Techniques 88


Finding Length-1 Sequential Patterns

 Examine GSP using an example


 Initial candidates: all singleton sequences Cand Sup
 <a>, <b>, <c>, <d>, <e>, <f>, <a> 3
<g>, <h> <b> 5
 Scan database once, count support for
candidates <c> 4
<d> 3

min_sup =2 <e> 3
Seq. ID Sequence <f> 2
10 <(bd)cb(ac)> <g> 1
20 <(bf)(ce)b(fg)> <h> 1
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
April 12, 2023 Data Mining: Concepts and Techniques 89
Generating Length-2 Candidates

<a> <b> <c> <d> <e> <f>


<a> <aa> <ab> <ac> <ad> <ae> <af>
51 length-2 <b> <ba> <bb> <bc> <bd> <be> <bf>
<c> <ca> <cb> <cc> <cd> <ce> <cf>
Candidates <d> <da> <db> <dc> <dd> <de> <df>
<e> <ea> <eb> <ec> <ed> <ee> <ef>
<f> <fa> <fb> <fc> <fd> <fe> <ff>

<a> <b> <c> <d> <e> <f>


Without Apriori
<a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
<b> <(bc)> <(bd)> <(be)> <(bf)>
property,
<c> <(cd)> <(ce)> <(cf)> 8*8+8*7/2=92
<d> <(de)> <(df)> candidates
<e> <(ef)>
<f>
Apriori prunes
April 12, 2023
44.57% candidates
Data Mining: Concepts and Techniques 90
Generating Length-3 Candidates and Finding Length-3 Patterns

 Generate Length-3 Candidates


 Self-join length-2 sequential patterns
 Based on the Apriori property

 <ab>, <aa> and <ba> are all length-2 sequential

patterns  <aba> is a length-3 candidate


 <(bd)>, <bb> and <db> are all length-2 sequential

patterns  <(bd)b> is a length-3 candidate


 46 candidates are generated
 Find Length-3 Sequential Patterns
 Scan database once more, collect support counts for
candidates
 19 out of 46 candidates pass support threshold
April 12, 2023 Data Mining: Concepts and Techniques 92
The GSP Mining Process

5th scan: 1 cand. 1 length-5 seq. <(bd)cba> Cand. cannot pass


pat. sup. threshold

4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> … Cand. not in DB at all
pat.
3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> …
pat. 20 cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq.
pat. 10 cand. not in DB at all <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
1st scan: 8 cand. 6 length-1 seq.
pat. <a> <b> <c> <d> <e> <f> <g> <h>
Seq. ID Sequence
10 <(bd)cb(ac)>
min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
April 12, 2023 Data Mining: Concepts and Techniques 93
Bottlenecks of GSP
 A huge set of candidates could be generated
 1,000 frequent length-1 sequences generate
1000  999
1000 1000   1,499,500 length-2 candidates!
2

 Multiple scans of database in mining

 Real challenge: mining long sequential patterns


 An exponential number of short candidates
 A length-100 sequential pattern needs 1030
candidate sequences!
  100
100 100

    2
i 1  i 
 1  1030

April 12, 2023 Data Mining: Concepts and Techniques 95


FreeSpan: Frequent Pattern-Projected
Sequential Pattern Mining
 A divide-and-conquer approach
 Recursively project a sequence database into a set of smaller
databases based on the current set of frequent patterns
 Mine each projected database to find its patterns
 J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, FreeSpan:
Frequent pattern-projected sequential pattern mining. In KDD’00.

f_list: b:5, c:4, a:3, d:3, e:3, f:2


Sequence Database SDB
All seq. pat. can be divided into 6 subsets:
< (bd) c b (ac) > •Seq. pat. containing item f
< (bf) (ce) b (fg) > •Those containing e but no f
< (ah) (bf) a b f > •Those containing d but no e nor f
< (be) (ce) d > •Those containing a but no d, e or f
< a (bd) b c b (ade) > •Those containing c but no a, d, e or f
•Those containing only item b
April 12, 2023 Data Mining: Concepts and Techniques 96
From FreeSpan to PrefixSpan: Why?

 Freespan:
 Projection-based: No candidate sequence needs to be
generated
 But, projection can be performed at any point in the
sequence, and the projected sequences do will not
shrink much
 PrefixSpan
 Projection-based
 But only prefix-based projection: less projections and
quickly shrinking sequences

April 12, 2023 Data Mining: Concepts and Techniques 97


Prefix and Suffix (Projection)

 <a>, <aa>, <a(ab)> and <a(abc)> are


prefixes of sequence <a(abc)(ac)d(cf)>
 Given sequence <a(abc)(ac)d(cf)>

Prefix Suffix (Prefix-Based Projection)


<a> <(abc)(ac)d(cf)>
<aa> <(_bc)(ac)d(cf)>
<ab> <(_c)(ac)d(cf)>

April 12, 2023 Data Mining: Concepts and Techniques 98


Mining Sequential Patterns by Prefix
Projections

 Step 1: find length-1 sequential patterns


 <a>, <b>, <c>, <d>, <e>, <f>

 Step 2: divide search space. The complete set of seq. pat.


can be partitioned into 6 subsets:
 The ones having prefix <a>;

 The ones having prefix <b>;


SID sequence
 …
10 <a(abc)(ac)d(cf)>
 The ones having prefix <f> 20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>

April 12, 2023 Data Mining: Concepts and Techniques 99


Finding Seq. Patterns with Prefix <a>

 Only need to consider projections w.r.t. <a>


 <a>-projected database: <(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

 Find all the length-2 seq. pat. Having prefix <a>: <aa>,
<ab>, <(ab)>, <ac>, <ad>, <af>
 Further partition into 6 subsets
SID sequence
 Having prefix <aa>; 10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
 …
30 <(ef)(ab)(df)cb>
 Having prefix <af> 40 <eg(af)cbc>

April 12, 2023 Data Mining: Concepts and Techniques 100


Completeness of PrefixSpan

SDB
SID sequence
10 <a(abc)(ac)d(cf)>
Length-1 sequential patterns
20 <(ad)c(bc)(ae)>
<a>, <b>, <c>, <d>, <e>, <f>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>

Having prefix <a> Having prefix <c>, …, <f>


Having prefix <b>
<a>-projected database <b>-projected database
<(abc)(ac)d(cf)> Length-2 sequential

<(_d)c(bc)(ae)> patterns
<(_b)(df)cb> <aa>, <ab>, <(ab)>,
<(_f)cbc> <ac>, <ad>, <af>
……
Having prefix <aa> Having prefix <af>

<aa>-proj. db … <af>-proj. db

April 12, 2023 Data Mining: Concepts and Techniques 101


Efficiency of PrefixSpan

 No candidate sequence needs to be generated

 Projected databases keep shrinking

 Major cost of PrefixSpan: constructing projected


databases

 Can be improved by bi-level projections

April 12, 2023 Data Mining: Concepts and Techniques 102


Optimization Techniques in PrefixSpan

 Physical projection vs. pseudo-projection


 Pseudo-projection may reduce the effort of
projection when the projected database fits in
main memory
 Parallel projection vs. partition projection
 Partition projection may avoid the blowup of
disk space

April 12, 2023 Data Mining: Concepts and Techniques 103


Speed-up by Pseudo-projection

 Major cost of PrefixSpan: projection


 Postfixes of sequences often appear
repeatedly in recursive projected
databases
 When (projected) database can be held in main
memory, use pointers to form projections
 Pointer to the sequence s=<a(abc)(ac)d(cf)>
<a>
 Offset of the postfix
s|<a>: ( , 2) <(abc)(ac)d(cf)>
<ab>
s|<ab>: ( , 4) <(_c)(ac)d(cf)>
April 12, 2023 Data Mining: Concepts and Techniques 104
Pseudo-Projection vs. Physical Projection

 Pseudo-projection avoids physically copying postfixes


 Efficient in running time and space when database
can be held in main memory
 However, it is not efficient when database cannot fit
in main memory
 Disk-based random accessing is very costly

 Suggested Approach:
 Integration of physical and pseudo-projection

 Swapping to pseudo-projection when the data set


fits in memory

April 12, 2023 Data Mining: Concepts and Techniques 105


PrefixSpan Is Faster than GSP and FreeSpan

400 PrefixSpan-1
350 PrefixSpan-2
Runtime (second)

300 FreeSpan
250
GSP
200
150
100
50
0
0.00 0.50 1.00 1.50 2.00 2.50 3.00

Support threshold (%)

April 12, 2023 Data Mining: Concepts and Techniques 106


Effect of Pseudo-Projection

PrefixSpan-1
200
PrefixSpan-2
PrefixSpan-1 (Pseudo)
Runtime (second)

160
PrefixSpan-2 (Pseudo)

120

80

40

0
0.20 0.30 0.40 0.50 0.60

Support threshold (%)


April 12, 2023 Data Mining: Concepts and Techniques 107
Chapter 6: Mining Association Rules
in Large Databases

 Association rule mining


 Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
 Mining various kinds of association/correlation rules
 Constraint-based association mining
 Sequential pattern mining
 Applications/extensions of frequent pattern mining
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 108
Associative Classification

 Mine association possible rules (PR) in form of


condset  c
 Condset: a set of attribute-value pairs
 C: class label
 Build Classifier
 Organize rules according to decreasing
precedence based on confidence and support
 B. Liu, W. Hsu & Y. Ma. Integrating classification and
association rule mining. In KDD’98

April 12, 2023 Data Mining: Concepts and Techniques 109


Why Iceberg Cube?

 It is too costly to materialize a high dimen. cube


 20 dimensions each with 99 distinct values may lead to

a cube of 10020 cells.


 Even if there is only one nonempty cell in each 10
10

cells, the cube will still contain 1030 nonempty cells


 Observation: Trivial cells are usually not interesting
 Nontrivial: large volume of sales, or high profit

 Solution:
 iceberg cube—materialize only nontrivial cells of a data

cube

April 12, 2023 Data Mining: Concepts and Techniques 110


Anti-Monotonicity in Iceberg Cubes

 If a cell c violates the HAVING clause, so do all more


specific cells
 Example. Let Having COUNT(*)>=50
 (*, *, Edu, 1000, 30) violates the HAVING clause

 (Feb, *, Edu), (*, Van, Edu), (Mar, Tor, Edu): each

must have count no more than 30

CREATE CUBE Sales_Iceberg AS


SELECT month, city, cust_grp, Month City Cust_grp Prod Cost Price
AVG(price), COUNT(*) Jan Tor Edu Printer 500 485
FROM Sales_Infor Mar Van Edu HD 540 520
CUBEBY month, city, cust_grp … … … … … …
HAVING COUNT(*)>=50

April 12, 2023 Data Mining: Concepts and Techniques 111


Computing Iceberg Cubes Efficiently

 Based on Apriori-like pruning


 BUC [Bayer & Ramakrishnan, 99]
 bottom-up cubing, efficient bucket-sort alg.
 Only handles anti-monotonic iceberg cubes, e.g.,
measures confined to count and p+_sum (e.g., price)
 Computing non-anti-monotonic iceberg cubes
 Finding a weaker but anti-monotonic measure (e.g.,
avg to top-k-avg) for dynamic pruning in computation
 Use special data structure (H-tree) and perform H-
cubing (SIGMOD’01)

April 12, 2023 Data Mining: Concepts and Techniques 112


Spatial and Multi-Media Association: A
Progressive Refinement Method

 Why progressive refinement?


 Mining operator can be expensive or cheap, fine or

rough
 Trade speed with quality: step-by-step refinement.

 Superset coverage property:


 Preserve all the positive answers—allow a positive false

test but not a false negative test.


 Two- or multi-step mining:
 First apply rough/cheap operator (superset coverage)

 Then apply expensive algorithm on a substantially

reduced candidate set (Koperski & Han, SSD’95).

April 12, 2023 Data Mining: Concepts and Techniques 113


Progressive Refinement Mining of Spatial
Associations
 Hierarchy of spatial relationship:
 “g_close_to”: near_by, touch, intersect, contain, etc.

 First search for rough relationship and then refine it.

 Two-step mining of spatial association:


 Step 1: rough spatial computation (as a filter)

 Using MBR or R-tree for rough estimation.


 Step2: Detailed spatial algorithm (as refinement)
 Apply only to those objects which have passed the rough
spatial association test (no less than min_support)

April 12, 2023 Data Mining: Concepts and Techniques 114


Mining Multimedia Associations

Correlations with color, spatial relationships, etc.


From coarse to Fine Resolution mining

April 12, 2023 Data Mining: Concepts and Techniques 115


Further Evolution of PrefixSpan

 Closed- and max- sequential patterns


 Finding only the most meaningful (longest)
sequential patterns
 Constraint-based sequential pattern growth
 Adding user-specific constraints
 From sequential patterns to structured patterns
 Beyond sequential patterns, mining
structured patterns in XML documents
April 12, 2023 Data Mining: Concepts and Techniques 116
Closed- and Max- Sequential Patterns

 A closed- sequential pattern is a frequent sequence s where there is


no proper super-sequence of s sharing the same support count with s
 A max- sequential pattern is a sequential pattern p s.t. any proper
super-pattern of p is not frequent
 Benefit of the notion of closed sequential patterns
 {<a1 a2 … a50>, <a1 a2 … a100>}, with min_sup = 1
 There are 2100 sequential patterns, but only 2 are
closed
 Similar benefits for the notion of max- sequential-patterns

April 12, 2023 Data Mining: Concepts and Techniques 117


Methods for Mining Closed- and Max-
Sequential Patterns

 PrefixSpan or FreeSpan can be viewed as projection-guided depth-first


search
 For mining max- sequential patterns, any sequence which does not
contain anything beyond the already discovered ones will be removed
from the projected DB
 {<a1 a2 … a50>, <a1 a2 … a100>}, with min_sup = 1
 If we have found a max-sequential pattern <a1 a2 …
a100>, nothing will be projected in any projected DB
 Similar ideas can be applied for mining closed- sequential-patterns

April 12, 2023 Data Mining: Concepts and Techniques 118


Constraint-Based Sequential Pattern
Mining

 Constraint-based sequential pattern mining


 Constraints: User-specified, for focused mining of desired patterns
 How to explore efficient mining with constraints? — Optimization
 Classification of constraints
 Anti-monotone: E.g., value_sum(S) < 150, min(S) > 10
 Monotone: E.g., count (S) > 5, S  {PC, digital_camera}
 Succinct: E.g., length(S)  10, S  {Pentium, MS/Office, MS/Money}
 Convertible: E.g., value_avg(S) < 25, profit_sum (S) > 160,
max(S)/avg(S) < 2, median(S) – min(S) > 5
 Inconvertible: E.g., avg(S) – median(S) = 0

April 12, 2023 Data Mining: Concepts and Techniques 119


Sequential Pattern Growth for
Constraint-Based Mining

 Efficient mining with convertible constraints


 Not solvable by candidate generation-and-test methodology
 Easily push-able into the sequential pattern growth framework
 Example: push avg(S) < 25 in frequent pattern growth
 project items in value (price/profit depending on mining
semantics) ascending/descending order for sequential pattern
growth
 Grow each pattern by sequential pattern growth
 If avg(current_pattern)  25, toss the current_pattern
 Why?—future growths always make it bigger
 But why not candidate generation?—no structure or ordering in
growth

April 12, 2023 Data Mining: Concepts and Techniques 120


From Sequential Patterns to Structured
Patterns

 Sets, sequences, trees and other structures


 Transaction DB: Sets of items
 {{i1, i2, …, im}, …}
 Seq. DB: Sequences of sets:
 {<{i1, i2}, …, {im, in, ik}>, …}
 Sets of Sequences:
 {{<i1, i2>, …, <im, in, ik>}, …}
 Sets of trees (each element being a tree):
 {t1, t2, …, tn}
 Applications: Mining structured patterns in XML documents

April 12, 2023 Data Mining: Concepts and Techniques 121


Chapter 6: Mining Association Rules
in Large Databases

 Association rule mining


 Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
 Mining various kinds of association/correlation rules
 Constraint-based association mining
 Sequential pattern mining
 Applications/extensions of frequent pattern mining
 Summary

April 12, 2023 Data Mining: Concepts and Techniques 122


Frequent-Pattern Mining: Achievements

 Frequent pattern mining—an important task in data mining


 Frequent pattern mining methodology
 Candidate generation & test vs. projection-based (frequent-pattern

growth)
 Vertical vs. horizontal format

 Various optimization methods: database partition, scan reduction,

hash tree, sampling, border computation, clustering, etc.


 Related frequent-pattern mining algorithm: scope extension
 Mining closed frequent itemsets and max-patterns (e.g., MaxMiner,

CLOSET, CHARM, etc.)


 Mining multi-level, multi-dimensional frequent patterns with flexible

support constraints
 Constraint pushing for mining optimization

 From frequent patterns to correlation and causality

April 12, 2023 Data Mining: Concepts and Techniques 123


Frequent-Pattern Mining: Applications

 Related problems which need frequent pattern mining


 Association-based classification
 Iceberg cube computation
 Database compression by fascicles and frequent
patterns
 Mining sequential patterns (GSP, PrefixSpan, SPADE,
etc.)
 Mining partial periodicity, cyclic associations, etc.
 Mining frequent structures, trends, etc.
 Typical application examples
 Market-basket analysis, Weblog analysis, DNA mining,
etc.
April 12, 2023 Data Mining: Concepts and Techniques 124
Frequent-Pattern Mining: Research Problems

 Multi-dimensional gradient analysis: patterns regarding


changes and differences
 Not just counts—other measures, e.g., avg(profit)
 Mining top-k frequent patterns without support constraint
 Mining fault-tolerant associations
 “3 out of 4 courses excellent” leads to A in data mining
 Fascicles and database compression by frequent pattern
mining
 Partial periodic patterns
 DNA sequence analysis and pattern classification

April 12, 2023 Data Mining: Concepts and Techniques 125


References: Frequent-pattern Mining Methods

 R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm


for generation of frequent itemsets. Journal of Parallel and Distributed
Computing, 2000.
 R. Agrawal, T. Imielinski, and A. Swami. Mining association rules
between sets of items in large databases. SIGMOD'93, 207-216,
Washington, D.C.
 R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94 487-499, Santiago, Chile.
 J. Han, J. Pei, and Y. Yin: “Mining frequent patterns without candidate
generation”. In Proc. ACM-SIGMOD’2000, pp. 1-12, Dallas, TX, May
2000.
 H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for
discovering association rules. KDD'94, 181-192, Seattle, WA, July 1994.

April 12, 2023 Data Mining: Concepts and Techniques 126


References: Frequent-pattern Mining Methods

 A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for


mining association rules in large databases. VLDB'95, 432-443, Zurich,
Switzerland.
 C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques
for mining causal structures. VLDB'98, 594-605, New York, NY.
 R. Srikant and R. Agrawal. Mining generalized association rules.
VLDB'95, 407-419, Zurich, Switzerland, Sept. 1995.
 R. Srikant and R. Agrawal. Mining quantitative association rules in large
relational tables. SIGMOD'96, 1-12, Montreal, Canada.
 H. Toivonen. Sampling large databases for association rules. VLDB'96,
134-145, Bombay, India, Sept. 1996.
 M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for
fast discovery of association rules. KDD’97. August 1997.

April 12, 2023 Data Mining: Concepts and Techniques 127


References: Frequent-pattern Mining
(Performance Improvements)
 S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting
and implication rules for market basket analysis. SIGMOD'97, Tucson,
Arizona, May 1997.
 D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered
association rules in large databases: An incremental updating technique.
ICDE'96, New Orleans, LA.
 T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining
using two-dimensional optimized association rules: Scheme, algorithms,
and visualization. SIGMOD'96, Montreal, Canada.
 E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for
association rules. SIGMOD'97, Tucson, Arizona.
 J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95, San Jose, CA, May 1995.

April 12, 2023 Data Mining: Concepts and Techniques 128


References: Frequent-pattern Mining
(Performance Improvements)
 G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong
rules. In G. Piatetsky-Shapiro and W. J. Frawley, Knowledge Discovery in
Databases,. AAAI/MIT Press, 1991.
 J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95, San Jose, CA.
 S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule
mining with relational database systems: Alternatives and implications.
SIGMOD'98, Seattle, WA.
 K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama.
Computing optimized rectilinear regions for association rules. KDD'97,
Newport Beach, CA, Aug. 1997.
 M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for
discovery of association rules. Data Mining and Knowledge Discovery,
1:343-374, 1997.
April 12, 2023 Data Mining: Concepts and Techniques 129
References: Frequent-pattern Mining (Multi-
level, correlation, ratio rules, etc.)
 S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to
correlations. SIGMOD'97, 265-276, Tucson, Arizona.
 J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95, 420-
431, Zurich, Switzerland.
 M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules
from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland.
 F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable
data mining. VLDB'98, 582-593, New York, NY
 B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham,
England.
 R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122-
133, Bombay, India.
 R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona.
 A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large
database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
 D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A
generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.
 J. Pei, A.K.H. Tung, J. Han. Fault-Tolerant Frequent Pattern Mining: Problems and Challenges.
SIGMOD DMKD’01, Santa Barbara, CA.

April 12, 2023 Data Mining: Concepts and Techniques 130


References: Mining Max-patterns and Closed
itemsets
 R. J. Bayardo. Efficiently mining long patterns from databases.
SIGMOD'98, 85-93, Seattle, Washington.
 J. Pei, J. Han, and R. Mao, "CLOSET: An Efficient Algorithm for Mining
Frequent Closed Itemsets", Proc. 2000 ACM-SIGMOD Int. Workshop on
Data Mining and Knowledge Discovery (DMKD'00), Dallas, TX, May
2000.
 N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent
closed itemsets for association rules. ICDT'99, 398-416, Jerusalem,
Israel, Jan. 1999.
 M. Zaki. Generating Non-Redundant Association Rules. KDD'00.
Boston, MA. Aug. 2000
 M. Zaki. CHARM: An Efficient Algorithm for Closed Association Rule
Mining, SIAM’02

April 12, 2023 Data Mining: Concepts and Techniques 131


References: Constraint-base Frequent-pattern Mining

 G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00,
512-521, San Diego, CA, Feb. 2000.
 Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95,
39-46, Singapore, Dec. 1995.
 J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining",
COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.
 L. V. S. Lakshmanan, R. Ng, J. Han and A. Pang, "Optimization of Constrained Frequent Set Queries
with 2-Variable Constraints", SIGMOD’99
 R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. “Exploratory mining and pruning optimizations of
constrained association rules.” SIGMOD’98
 J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining Frequent Itemsets with Convertible Constraints", Proc.
2001 Int. Conf. on Data Engineering (ICDE'01), April 2001.
 J. Pei and J. Han "Can We Push More Constraints into Frequent Pattern Mining?", Proc. 2000 Int. Conf.
on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000.
 R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73,
Newport Beach, California

April 12, 2023 Data Mining: Concepts and Techniques 132


References: Sequential Pattern Mining Methods

 R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14,


Taipei, Taiwan.
 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations
and performance improvements. EDBT’96.
 J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu,
"FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining",
Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining
(KDD'00), Boston, MA, August 2000.
 H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent
episodes in event sequences. Data Mining and Knowledge Discovery,
1:259-289, 1997.

April 12, 2023 Data Mining: Concepts and Techniques 133


References: Sequential Pattern Mining Methods

 J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu,


"PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected
Pattern Growth", Proc. 2001 Int. Conf. on Data Engineering
(ICDE'01), Heidelberg, Germany, April 2001.
 B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association
rules. ICDE'98, 412-421, Orlando, FL.
 S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of
interesting patterns in association rules. VLDB'98, 368-379, New York,
NY.
 M.J. Zaki. Efficient enumeration of frequent sequences. CIKM’98.
Novermber 1998.
 M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern
Mining with Regular Expression Constraints. VLDB 1999: 223-234,
Edinburgh, Scotland.
April 12, 2023 Data Mining: Concepts and Techniques 134
References: Frequent-pattern Mining in Spatial,
Multimedia, Text & Web Databases

 K. Koperski, J. Han, and G. B. Marchisio, "Mining Spatial and Image Data through Progressive
Refinement Methods", Revue internationale de gomatique (European Journal of GIS and Spatial
Analysis), 9(4):425-440, 1999.
 A. K. H. Tung, H. Lu, J. Han, and L. Feng, "Breaking the Barrier of Transactions: Mining Inter-
Transaction Association Rules", Proc. 1999 Int. Conf. on Knowledge Discovery and Data Mining
(KDD'99), San Diego, CA, Aug. 1999, pp. 297-301.
 J. Han, G. Dong and Y. Yin, "Efficient Mining of Partial Periodic Patterns in Time Series Database",
Proc. 1999 Int. Conf. on Data Engineering (ICDE'99), Sydney, Australia, March 1999, pp. 106-115
 H. Lu, L. Feng, and J. Han, "Beyond Intra-Transaction Association Analysis:Mining Multi-Dimensional
Inter-Transaction Association Rules", ACM Transactions on Information Systems (TOIS’00), 18(4): 423-
454, 2000.
 O. R. Zaiane, M. Xin, J. Han, "Discovering Web Access Patterns and Trends by Applying OLAP and Data
Mining Technology on Web Logs," Proc. Advances in Digital Librar ies Conf. (ADL'98), Santa Barbara,
CA, April 1998, pp. 19-29
 O. R. Zaiane, J. Han, and H. Zhu, "Mining Recurrent Items in Multimedia with Progressive Resolution
Refinement", ICDE'00, San Diego, CA, Feb. 2000, pp. 461-470

April 12, 2023 Data Mining: Concepts and Techniques 135


References: Frequent-pattern Mining for Classification
and Data Cube Computation
 K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and
iceberg cubes. SIGMOD'99, 359-370, Philadelphia, PA, June 1999.
 M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman.
Computing iceberg queries efficiently. VLDB'98, 299-310, New York, NY, Aug.
1998.
 J. Han, J. Pei, G. Dong, and K. Wang, “Computing Iceberg Data Cubes with
Complex Measures”, Proc. ACM-SIGMOD’2001, Santa Barbara, CA, May
2001.
 M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-
dimensional association rules using data cubes. KDD'97, 207-210, Newport
Beach, California.
 K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and
iceberg cubes. SIGMOD’99
 T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing
association rules. Technical Report, Aug. 2000

April 12, 2023 Data Mining: Concepts and Techniques 136


www.cs.uiuc.edu/~hanj

Thank you !!!


April 12, 2023 Data Mining: Concepts and Techniques 137
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 7 —

©Jiawei Han and Micheline Kamber


Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
April 12, 2023 Data Mining: Concepts and Techniques 1
Chapter 7. Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification by Support Vector Machines (SVM)
 Classification based on concepts from association rule
mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 2
Classification vs. Prediction

 Classification:
 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the


training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
 Prediction:
 models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical Applications
 credit approval

 target marketing

 medical diagnosis

 treatment effectiveness analysis


April 12, 2023 Data Mining: Concepts and Techniques 3
Classification—A Two-Step Process

 Model construction: describing a set of predetermined classes


 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees,


or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model

 The known label of test sample is compared with the

classified result from the model


 Accuracy rate is the percentage of test set samples that are

correctly classified by the model


 Test set is independent of training set, otherwise over-fitting

will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
April 12, 2023 Data Mining: Concepts and Techniques 4
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
April 12, 2023 Data Mining: Concepts and Techniques 5
Classification Process (2): Use the
Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
April 12, 2023 Data Mining: Concepts and Techniques 6
Supervised vs. Unsupervised
Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
April 12, 2023 Data Mining: Concepts and Techniques 7
Chapter 7. Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification by Support Vector Machines (SVM)
 Classification based on concepts from association rule
mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 8
Issues Regarding Classification and Prediction
(1): Data Preparation

 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data

April 12, 2023 Data Mining: Concepts and Techniques 9


Issues regarding classification and prediction
(2): Evaluating Classification Methods

 Predictive accuracy
 Speed and scalability
 time to construct the model

 time to use the model

 Robustness
 handling noise and missing values

 Scalability
 efficiency in disk-resident databases

 Interpretability:
 understanding and insight provided by the model

 Goodness of rules
 decision tree size

 compactness of classification rules

April 12, 2023 Data Mining: Concepts and Techniques 10


Chapter 7. Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification by Support Vector Machines (SVM)
 Classification based on concepts from association rule
mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 11
Training Dataset

age income student credit_rating buys_computer


<=30 high no fair no
This <=30 high no excellent no
31…40 high no fair yes
follows an >40 medium no fair yes
example >40 low yes fair yes
from >40 low yes excellent no
31…40 low yes excellent yes
Quinlan’s <=30 medium no fair no
ID3 <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

April 12, 2023 Data Mining: Concepts and Techniques 12


Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

April 12, 2023 Data Mining: Concepts and Techniques 13


Algorithm for Decision Tree Induction

 Basic algorithm (a greedy algorithm)


 Tree is constructed in a top-down recursive divide-and-conquer

manner
 At start, all the training examples are at the root

 Attributes are categorical (if continuous-valued, they are

discretized in advance)
 Examples are partitioned recursively based on selected attributes

 Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain)


 Conditions for stopping partitioning
 All samples for a given node belong to the same class

 There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf


 There are no samples left

April 12, 2023 Data Mining: Concepts and Techniques 14


Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 S contains si tuples of class Ci for i = {1, …, m}
 information measures info required to classify any
arbitrary tuple m
si si
I( s1,s2,...,sm )   log 2
i 1 s s
 entropy of attribute A with values {a1,a2,…,av}
v
s1 j  ...  smj
E(A)   I ( s1 j ,...,smj )
j 1 s

 information gained by branching on attribute A

Gain(A)  I(s 1, s 2 ,...,sm)  E(A)

April 12, 2023 Data Mining: Concepts and Techniques 15


Attribute Selection by Information
Gain Computation
 Class P: buys_computer = “yes” 5 4
E (age)  I (2,3)  I (4,0)
 Class N: buys_computer = “no” 14 14
 I(p, n) = I(9, 5) =0.940 5
 I (3,2)  0.694
 Compute the entropy for age: 14
5
age pi ni I(pi, ni) I (2,3) means “age <=30” has 5
14
<=30 2 3 0.971 out of 14 samples, with 2 yes’es
30…40 4 0 0 and 3 no’s. Hence
>40 3 2 0.971
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age)  I ( p, n)  E (age)  0.246
<=30 high no excellent no
31…40 high
>40 medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes
>40
31…40 low
low yes excellent
yes excellent
no
yes Gain(income)  0.029
Gain( student)  0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium
31…40 medium
yes excellent
no excellent
yes
yes
Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40April 12, 2023
medium no excellent Data no
Mining: Concepts and Techniques 16
Other Attribute Selection Measures

 Gini index (CART, IBM IntelligentMiner)


 All attributes are assumed continuous-valued
 Assume there exist several possible split values for
each attribute
 May need other tools, such as clustering, to get the
possible split values
 Can be modified for categorical attributes

April 12, 2023 Data Mining: Concepts and Techniques 17


Gini Index (IBM IntelligentMiner)
 If a data set T contains examples from n classes, gini index,
gini(T) is defined as gini(T )  1
n
p2  j
j 1
where pj is the relative frequency of class j in T.
 If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data
contains examples from n classes, the gini index gini(T) is
defined as
N 1 gini( )  N 2 gini( )
gini split (T )  T1 T2
N N
 The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting
points for each attribute).
April 12, 2023 Data Mining: Concepts and Techniques 18
Extracting Classification Rules from Trees

 Represent the knowledge in the form of IF-THEN rules


 One rule is created for each path from the root to a leaf
 Each attribute-value pair along a path forms a conjunction
 The leaf node holds the class prediction
 Rules are easier for humans to understand
 Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

April 12, 2023 Data Mining: Concepts and Techniques 19


Avoid Overfitting in Classification

 Overfitting: An induced tree may overfit the training data


 Too many branches, some may reflect anomalies due
to noise or outliers
 Poor accuracy for unseen samples

 Two approaches to avoid overfitting


 Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure
falling below a threshold
 Difficult to choose an appropriate threshold

 Postpruning: Remove branches from a “fully grown”


tree—get a sequence of progressively pruned trees
 Use a set of data different from the training data to

decide which is the “best pruned tree”


April 12, 2023 Data Mining: Concepts and Techniques 20
Approaches to Determine the Final
Tree Size
 Separate training (2/3) and testing (1/3) sets
 Use cross validation, e.g., 10-fold cross validation
 Use all the data for training
 but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may
improve the entire distribution
 Use minimum description length (MDL) principle
 halting growth of the tree when the encoding is
minimized

April 12, 2023 Data Mining: Concepts and Techniques 21


Enhancements to basic decision
tree induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
April 12, 2023 Data Mining: Concepts and Techniques 22
Classification in Large Databases

 Classification—a classical problem extensively studied by


statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
 Why decision tree induction in data mining?
 relatively faster learning speed (than other classification
methods)
 convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases
 comparable classification accuracy with other methods
April 12, 2023 Data Mining: Concepts and Techniques 23
Scalable Decision Tree Induction
Methods in Data Mining Studies
 SLIQ (EDBT’96 — Mehta et al.)
 builds an index for each attribute and only class list and

the current attribute list reside in memory


 SPRINT (VLDB’96 — J. Shafer et al.)
 constructs an attribute list data structure

 PUBLIC (VLDB’98 — Rastogi & Shim)


 integrates tree splitting and tree pruning: stop growing

the tree earlier


 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 separates the scalability aspects from the criteria that

determine the quality of the tree


 builds an AVC-list (attribute, value, class label)
April 12, 2023 Data Mining: Concepts and Techniques 24
Data Cube-Based Decision-Tree
Induction
 Integration of generalization with decision-tree induction
(Kamber et al’97).
 Classification at primitive concept levels
 E.g., precise temperature, humidity, outlook, etc.
 Low-level concepts, scattered classes, bushy
classification-trees
 Semantic interpretation problems.
 Cube-based multi-level classification
 Relevance analysis at multi-levels.
 Information-gain analysis with dimension + level.

April 12, 2023 Data Mining: Concepts and Techniques 25


Presentation of Classification Results

April 12, 2023 Data Mining: Concepts and Techniques 26


Visualization of a Decision Tree in
SGI/MineSet 3.0

April 12, 2023 Data Mining: Concepts and Techniques 27


Interactive Visual Mining by
Perception-Based Classification (PBC)

April 12, 2023 Data Mining: Concepts and Techniques 28


Chapter 7. Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification by Support Vector Machines (SVM)
 Classification based on concepts from association rule
mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 29
Bayesian Classification: Why?

 Probabilistic learning: Calculate explicit probabilities for


hypothesis, among the most practical approaches to
certain types of learning problems
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
 Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
April 12, 2023 Data Mining: Concepts and Techniques 30
Bayesian Theorem: Basics

 Let X be a data sample whose class label is unknown


 Let H be a hypothesis that X belongs to class C
 For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
 P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
 P(X): probability that sample data is observed
 P(X|H) : probability of observing the sample X, given that
the hypothesis holds

April 12, 2023 Data Mining: Concepts and Techniques 31


Bayesian Theorem

 Given training data X, posteriori probability of a hypothesis


H, P(H|X) follows the Bayes theorem
P(H | X )  P( X | H )P(H )
P( X )

 Informally, this can be written as


posterior =likelihood x prior / evidence
 MAP (maximum posteriori) hypothesis
h  arg max P(h | D)  arg max P(D | h)P(h).
MAP hH hH

 Practical difficulty: require initial knowledge of many


probabilities, significant computational cost
April 12, 2023 Data Mining: Concepts and Techniques 32
Naïve Bayes Classifier

 A simplified assumption: attributes are conditionally


independent:
n
P( X | C i )   P( x k | C i )
k 1
 The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
 No dependence relation between attributes
 Greatly reduces the computation cost, only count the class
distribution.
 Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)
April 12, 2023 Data Mining: Concepts and Techniques 33
Training dataset

age income student credit_rating buys_computer


Class: <=30 high no fair no
C1:buys_computer= <=30 high no excellent no
‘yes’ 30…40 high no fair yes
C2:buys_computer= >40 medium no fair yes
‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample 31…40 low yes excellent yes
X =(age<=30, <=30 medium no fair no
Income=medium, <=30 low yes fair yes
Student=yes >40 medium yes fair yes
Credit_rating= <=30 medium yes excellent yes
Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

April 12, 2023 Data Mining: Concepts and Techniques 34


Naïve Bayesian Classifier: Example
 Compute P(X/Ci) for each class

P(age=“<30” | buys_computer=“yes”) = 2/9=0.222


P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4

X=(age<=30 ,income =medium, student=yes,credit_rating=fair)

P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044


P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007

X belongs to class “buys_computer=yes”

April 12, 2023 Data Mining: Concepts and Techniques 35


Naïve Bayesian Classifier: Comments

 Advantages :
 Easy to implement

 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence , therefore loss of

accuracy
 Practically, dependencies exist among variables

 E.g., hospitals: patients: Profile: age, family history etc

Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc


 Dependencies among these cannot be modeled by Naïve Bayesian

Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks

April 12, 2023 Data Mining: Concepts and Techniques 36


Bayesian Networks

 Bayesian belief network allows a subset of the variables


conditionally independent
 A graphical model of causal relationships
 Represents dependency among the variables
 Gives a specification of joint probability distribution

Nodes: random variables


Links: dependency
X Y X,Y are the parents of Z, and Y is the
parent of P
Z No dependency between Z and P
P Has no loops or cycles

April 12, 2023 Data Mining: Concepts and Techniques 37


Bayesian Belief Network: An Example

Family
Smoker
History
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

The conditional probability table


for the variable LungCancer:
PositiveXRay Dyspnea Shows the conditional probability
for each possible combination of its
parents n
Bayesian Belief Networks P( z1,..., zn)   P( zi | Parents ( Z i ))
i 1

April 12, 2023 Data Mining: Concepts and Techniques 38


Learning Bayesian Networks

 Several cases
 Given both the network structure and all variables

observable: learn only the CPTs


 Network structure known, some hidden variables:

method of gradient descent, analogous to neural


network learning
 Network structure unknown, all variables observable:

search through the model space to reconstruct graph


topology
 Unknown structure, all hidden variables: no good

algorithms known for this purpose


 D. Heckerman, Bayesian networks for data mining
April 12, 2023 Data Mining: Concepts and Techniques 39
Chapter 7. Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification by Support Vector Machines (SVM)
 Classification based on concepts from association rule
mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 40
Classification

 Classification:
 predicts categorical class labels

 Typical Applications
 {credit history, salary}-> credit approval ( Yes/No)

 {Temp, Humidity} --> Rain (Yes/No)

x  X  {0,1} , y  Y  {0,1}
n

Mathematically h: X Y
y  h( x )
April 12, 2023 Data Mining: Concepts and Techniques 41
Linear Classification
 Binary Classification
problem
 The data above the red
line belongs to class ‘x’
x  The data below red line
x x
x x belongs to class ‘o’
x  Examples – SVM,
x x o
x Perceptron, Probabilistic
o
x o o Classifiers
ooo
o o
o o o o

April 12, 2023 Data Mining: Concepts and Techniques 42


Discriminative Classifiers

 Advantages
 prediction accuracy is generally high

 (as compared to Bayesian methods – in general)


 robust, works when training examples contain errors
 fast evaluation of the learned target function
 (Bayesian networks are normally slow)
 Criticism
 long training time

 difficult to understand the learned function (weights)

 (Bayesian networks can be used easily for pattern discovery)


 not easy to incorporate domain knowledge
 (easy in the form of priors on the data or distributions)

April 12, 2023 Data Mining: Concepts and Techniques 43


Neural Networks

 Analogy to Biological Systems (Indeed a great example


of a good learning system)
 Massive Parallelism allowing for computational
efficiency
 The first learning algorithm came in 1959 (Rosenblatt)
who suggested that if a target output value is provided
for a single neuron with fixed inputs, one can
incrementally change weights to learn to produce these
outputs using the perceptron learning rule

April 12, 2023 Data Mining: Concepts and Techniques 44


A Neuron

- mk
x0 w0
x1

w1
f
output y
xn wn

Input weight weighted Activation


vector x vector w sum function
 The n-dimensional input vector x is mapped into
variable y by means of the scalar product and a
nonlinear function mapping
April 12, 2023 Data Mining: Concepts and Techniques 45
A Neuron

- mk
x0 w0
x1

w1
f
output y
xn wn

Input weight weighted Activation


vector x vector w sum function
For Example
n
y  sign(  wi xi  m k )
i 0
April 12, 2023 Data Mining: Concepts and Techniques 46
Multi-Layer Perceptron

Output vector
Err j  O j (1  O j ) Errk w jk
Output nodes k

 j   j  (l) Errj
wij  wij  (l ) Errj Oi
Hidden nodes Errj  O j (1  O j )(T j  O j )
wij 1
Oj  I j
1 e
Input nodes
I j   wij Oi   j
i

Input vector: xi
Network Training

 The ultimate objective of training


 obtain a set of weights that makes almost all the

tuples in the training data classified correctly


 Steps
 Initialize weights with random values

 Feed the input tuples into the network one by one

 For each unit

 Compute the net input to the unit as a linear combination


of all the inputs to the unit
 Compute the output value using the activation function
 Compute the error
 Update the weights and the bias
Chapter 7. Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification by Support Vector Machines (SVM)
 Classification based on concepts from association rule
mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 50
SVM – Support Vector Machines

Small Margin Large Margin


Support Vectors
SVM – Cont.

 Linear Support Vector Machine

Given a set of points i


x   n
with label y i  {1,1}
The SVM finds a hyperplane defined by the pair (w,b)
(where w is the normal to the plane and b is the distance from
the origin)
s.t. yi ( xi  w  b)  1 i  1,..., N
x – feature vector, b- bias, y- class label, ||w|| - margin

April 12, 2023 Data Mining: Concepts and Techniques 52


SVM – Cont.

 What if the data is not linearly separable?


 Project the data to high dimensional space where it is
linearly separable and then we can use linear SVM –
(Using Kernels)

(0,1) +

+ - +
-1 0 +1
- +
(0,0) (1,0)
April 12, 2023 Data Mining: Concepts and Techniques 53
Non-Linear SVM

Classification using SVM (w,b)


?
xi  w  b  0
In non linear case we can see this as
?
K ( xi , w)  b  0
Kernel – Can be thought of as doing dot product
in some high dimensional space

April 12, 2023 Data Mining: Concepts and Techniques 54


April 12, 2023 Data Mining: Concepts and Techniques 55
Results

April 12, 2023 Data Mining: Concepts and Techniques 56


SVM vs. Neural Network

 SVM  Neural Network


 Relatively new concept  Quiet Old

 Nice Generalization  Generalizes well but

properties doesn’t have strong


 Hard to learn – learned mathematical foundation
in batch mode using  Can easily be learned in

quadratic programming incremental fashion


techniques  To learn complex

 Using kernels can learn functions – use


very complex functions multilayer perceptron
(not that trivial)

April 12, 2023 Data Mining: Concepts and Techniques 57


SVM Related Links

 http://svm.dcs.rhbnc.ac.uk/
 http://www.kernel-machines.org/
 C. J. C. Burges. A Tutorial on Support Vector Machines for
Pattern Recognition. Knowledge Discovery and Data
Mining, 2(2), 1998.
 SVMlight – Software (in C)
http://ais.gmd.de/~thorsten/svm_light
 BOOK: An Introduction to Support Vector Machines
N. Cristianini and J. Shawe-Taylor
Cambridge University Press
April 12, 2023 Data Mining: Concepts and Techniques 58
Chapter 7. Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification by Support Vector Machines (SVM)
 Classification based on concepts from association rule
mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 59
Association-Based Classification

 Several methods for association-based classification


 ARCS: Quantitative association mining and clustering

of association rules (Lent et al’97)


 It beats C4.5 in (mainly) scalability and also accuracy
 Associative classification: (Liu et al’98)
 It mines high support and high confidence rules in the form of
“cond_set => y”, where y is a class label
 CAEP (Classification by aggregating emerging patterns)
(Dong et al’99)
 Emerging patterns (EPs): the itemsets whose support
increases significantly from one class to another
 Mine Eps based on minimum support and growth rate

April 12, 2023 Data Mining: Concepts and Techniques 60


Chapter 7. Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification by Support Vector Machines (SVM)
 Classification based on concepts from association rule
mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 61
Other Classification Methods

 k-nearest neighbor classifier


 case-based reasoning
 Genetic algorithm
 Rough set approach
 Fuzzy set approaches

April 12, 2023 Data Mining: Concepts and Techniques 62


Instance-Based Methods
 Instance-based learning:
 Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
 Typical approaches
 k-nearest neighbor approach

 Instances represented as points in a Euclidean

space.
 Locally weighted regression

 Constructs local approximation

 Case-based reasoning

 Uses symbolic representations and knowledge-

based inference
April 12, 2023 Data Mining: Concepts and Techniques 63
The k-Nearest Neighbor Algorithm
 All instances correspond to points in the n-D space.
 The nearest neighbor are defined in terms of
Euclidean distance.
 The target function could be discrete- or real- valued.
 For discrete-valued, the k-NN returns the most
common value among the k training examples nearest
to xq.
 Vonoroi diagram: the decision surface induced by 1-
NN for a typical set of training examples.
_
_
_ _ .
+
_ .
+
xq + . . .
April 12, 2023
_ + .
Data Mining: Concepts and Techniques 64
Discussion on the k-NN Algorithm
 The k-NN algorithm for continuous-valued target functions
 Calculate the mean values of the k nearest neighbors

 Distance-weighted nearest neighbor algorithm


 Weight the contribution of each of the k neighbors

according to their distance to the query point xq


 giving greater weight to closer neighbors w  1
d ( xq , xi )2
 Similarly, for real-valued target functions

 Robust to noisy data by averaging k-nearest neighbors


 Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes.
 To overcome it, axes stretch or elimination of the least

relevant attributes.
April 12, 2023 Data Mining: Concepts and Techniques 65
Case-Based Reasoning

 Also uses: lazy evaluation + analyze similar instances


 Difference: Instances are not “points in a Euclidean space”
 Example: Water faucet problem in CADET (Sycara et al’92)
 Methodology
 Instances represented by rich symbolic descriptions
(e.g., function graphs)
 Multiple retrieved cases may be combined

 Tight coupling between case retrieval, knowledge-based


reasoning, and problem solving
 Research issues
 Indexing based on syntactic similarity measure, and
when failure, backtracking, and adapting to additional
cases
April 12, 2023 Data Mining: Concepts and Techniques 66
Remarks on Lazy vs. Eager Learning
 Instance-based learning: lazy evaluation
 Decision-tree and Bayesian classification: eager evaluation
 Key differences
 Lazy method may consider query instance xq when deciding how to

generalize beyond the training data D


 Eager method cannot since they have already chosen global

approximation when seeing the query


 Efficiency: Lazy - less time training but more time predicting
 Accuracy
 Lazy method effectively uses a richer hypothesis space since it uses

many local linear functions to form its implicit global approximation


to the target function
 Eager: must commit to a single hypothesis that covers the entire

instance space
April 12, 2023 Data Mining: Concepts and Techniques 67
Genetic Algorithms
 GA: based on an analogy to biological evolution
 Each rule is represented by a string of bits
 An initial population is created consisting of randomly
generated rules
 e.g., IF A1 and Not A2 then C2 can be encoded as 100

 Based on the notion of survival of the fittest, a new


population is formed to consists of the fittest rules and
their offsprings
 The fitness of a rule is represented by its classification
accuracy on a set of training examples
 Offsprings are generated by crossover and mutation

April 12, 2023 Data Mining: Concepts and Techniques 68


Rough Set Approach
 Rough sets are used to approximately or “roughly”
define equivalent classes
 A rough set for a given class C is approximated by two
sets: a lower approximation (certain to be in C) and an
upper approximation (cannot be described as not
belonging to C)
 Finding the minimal subsets (reducts) of attributes (for
feature reduction) is NP-hard but a discernibility matrix
is used to reduce the computation intensity

April 12, 2023 Data Mining: Concepts and Techniques 69


Fuzzy Set
Approaches

 Fuzzy logic uses truth values between 0.0 and 1.0 to


represent the degree of membership (such as using
fuzzy membership graph)
 Attribute values are converted to fuzzy values
 e.g., income is mapped into the discrete categories
{low, medium, high} with fuzzy values calculated
 For a given new sample, more than one fuzzy value may
apply
 Each applicable rule contributes a vote for membership
in the categories
 Typically, the truth values for each predicted category
are summed
April 12, 2023 Data Mining: Concepts and Techniques 70
Chapter 7. Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification by Support Vector Machines (SVM)
 Classification based on concepts from association rule
mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 71
What Is Prediction?

 Prediction is similar to classification


 First, construct a model
 Second, use model to predict unknown value
 Major method for prediction is regression
 Linear and multiple regression
 Non-linear regression
 Prediction is different from classification
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions

April 12, 2023 Data Mining: Concepts and Techniques 72


Predictive Modeling in Databases

 Predictive modeling: Predict data values or construct


generalized linear models based on the database data.
 One can only predict value ranges or category distributions
 Method outline:
 Minimal generalization
 Attribute relevance analysis
 Generalized linear model construction
 Prediction
 Determine the major factors which influence the prediction
 Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
 Multi-level prediction: drill-down and roll-up analysis
April 12, 2023 Data Mining: Concepts and Techniques 73
Regress Analysis and Log-Linear
Models in Prediction
 Linear regression: Y =  +  X
 Two parameters ,  and  specify the line and are to

be estimated by using the data at hand.


 using the least squares criterion to the known values

of Y1, Y2, …, X1, X2, ….


 Multiple regression: Y = b0 + b1 X1 + b2 X2.
 Many nonlinear functions can be transformed into the

above.
 Log-linear models:
 The multi-way table of joint probabilities is

approximated by a product of lower-order tables.


 Probability: p(a, b, c, d) = ab acad bcd
April 12, 2023 Data Mining: Concepts and Techniques 74
Locally Weighted Regression

 Construct an explicit approximation to f over a local region


surrounding query instance xq.
 Locally weighted linear regression:
 The target function f is approximated near xq using the
 ( x)  w  w a ( x) w a ( x)
linear function: f
0 11 n n
 minimize the squared error: distance-decreasing weight

K E ( xq )  1  ( f ( x)  f ( x))2 K (d ( xq , x))
2 xk _nearest _neighbors_of _ x
q
 the gradient descent training rule:
w j    K (d ( xq , x))(( f ( x)  f ( x))a j ( x)
x k _ nearest _ neighbors_ of _ xq
 In most cases, the target function is approximated by a
constant, linear, or quadratic function.
April 12, 2023 Data Mining: Concepts and Techniques 75
Prediction: Numerical Data

April 12, 2023 Data Mining: Concepts and Techniques 76


Prediction: Categorical Data

April 12, 2023 Data Mining: Concepts and Techniques 77


Chapter 7. Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification by Support Vector Machines (SVM)
 Classification based on concepts from association rule
mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 78
Classification Accuracy: Estimating Error
Rates
 Partition: Training-and-testing
 use two independent data sets, e.g., training set
(2/3), test set(1/3)
 used for data set with large number of samples
 Cross-validation
 divide the data set into k subsamples
 use k-1 subsamples as training data and one sub-
sample as test data—k-fold cross-validation
 for data set with moderate size
 Bootstrapping (leave-one-out)
 for small size data
April 12, 2023 Data Mining: Concepts and Techniques 79
Bagging and Boosting

 General idea
Training data
Classification method (CM)

Classifier C
Altered Training data CM

Classifier C1
Altered Training data CM

…….. Classifier C2
Aggregation ….
Classifier C*

April 12, 2023 Data Mining: Concepts and Techniques 80


Bagging

 Given a set S of s samples


 Generate a bootstrap sample T from S. Cases in S may not
appear in T or may appear more than once.
 Repeat this sampling procedure, getting a sequence of k
independent training sets
 A corresponding sequence of classifiers C1,C2,…,Ck is
constructed for each of these training sets, by using the
same classification algorithm
 To classify an unknown sample X,let each classifier predict
or vote
 The Bagged Classifier C* counts the votes and assigns X
to the class with the “most” votes
April 12, 2023 Data Mining: Concepts and Techniques 81
Boosting Technique — Algorithm

 Assign every example an equal weight 1/N


 For t = 1, 2, …, T Do
 Obtain a hypothesis (classifier) h(t) under w(t)
 Calculate the error of h(t) and re-weight the examples
based on the error . Each classifier is dependent on the
previous ones. Samples that are incorrectly predicted
are weighted more heavily
 Normalize w
(t+1) to sum to 1 (weights assigned to

different classifiers sum to 1)


 Output a weighted sum of all the hypothesis, with each
hypothesis weighted according to its accuracy on the
training set

April 12, 2023 Data Mining: Concepts and Techniques 82


Bagging and Boosting

 Experiments with a new boosting algorithm,


freund et al (AdaBoost )
 Bagging Predictors, Brieman
 Boosting Naïve Bayesian Learning on large subset
of MEDLINE, W. Wilbur

April 12, 2023 Data Mining: Concepts and Techniques 83


Chapter 7. Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification by Support Vector Machines (SVM)
 Classification based on concepts from association rule
mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 84
Summary

 Classification is an extensively studied problem (mainly in


statistics, machine learning & neural networks)
 Classification is probably one of the most widely used
data mining techniques with a lot of extensions
 Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic
 Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc..

April 12, 2023 Data Mining: Concepts and Techniques 85


References (1)

 C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997.
 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998.
 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for
scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining
(KDD'95), pages 39-44, Montreal, Canada, August 1995.
 U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994
AAAI Conf., pages 601-606, AAAI Press, 1994.
 J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages
416-427, New York, NY, August 1998.
 J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction . In SIGMOD'99 , Philadelphia, Pennsylvania, 1999

April 12, 2023 Data Mining: Concepts and Techniques 86


References (2)

 M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree
induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research
Issues on Data Engineering (RIDE'97), Birmingham, England, April 1997.
 B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining. Proc.
1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98) New York, NY, Aug.
1998.
 W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, , Proc. 2001 Int. Conf. on Data Mining (ICDM'01), San Jose, CA,
Nov. 2001.
 J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,
pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994.
 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
(EDBT'96), Avignon, France, March 1996.

April 12, 2023 Data Mining: Concepts and Techniques 87


References (3)

 T. M. Mitchell. Machine Learning. McGraw Hill, 1997.


 S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary
Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
 J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial
Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996.
 R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and
pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August
1998.
 J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept.
1996.
 S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.
Morgan Kaufman, 1991.
 S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.

April 12, 2023 Data Mining: Concepts and Techniques 88


www.cs.uiuc.edu/~hanj

Thank you !!!


April 12, 2023 Data Mining: Concepts and Techniques 89
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 8 —

©Jiawei Han and Micheline Kamber


Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
April 12, 2023 Data Mining: Concepts and Techniques 1
Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 2
What is Cluster Analysis?

 Cluster: a collection of data objects


 Similar to one another within the same cluster

 Dissimilar to the objects in other clusters

 Cluster analysis
 Grouping a set of data objects into clusters

 Clustering is unsupervised classification: no predefined


classes
 Typical applications
 As a stand-alone tool to get insight into data

distribution
 As a preprocessing step for other algorithms
General Applications of Clustering

 Pattern Recognition
 Spatial Data Analysis
 create thematic maps in GIS by clustering feature
spaces
 detect spatial clusters and explain them in spatial data
mining
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification

 Cluster Weblog data to discover groups of similar


access patterns

April 12, 2023 Data Mining: Concepts and Techniques 4


Examples of Clustering Applications

 Marketing: Help marketers discover distinct groups in


their customer bases, and then use this knowledge to
develop targeted marketing programs
 Land use: Identification of areas of similar land use in an
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to
their house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
April 12, 2023 Data Mining: Concepts and Techniques 5
What Is Good Clustering?

 A good clustering method will produce high quality


clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.

April 12, 2023 Data Mining: Concepts and Techniques 6


Requirements of Clustering in Data
Mining

 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability

April 12, 2023 Data Mining: Concepts and Techniques 7


Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 8
Data Structures
 Data matrix
 x11 ... x1f ... x1p 
 (two modes)  
 ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 Dissimilarity matrix  0 
 (one mode)
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

April 12, 2023 Data Mining: Concepts and Techniques 9


Measure the Quality of Clustering

 Dissimilarity/Similarity metric: Similarity is expressed in


terms of a distance function, which is typically metric:
d(i, j)
 There is a separate “quality” function that measures the
“goodness” of a cluster.
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
 Weights should be associated with different variables
based on applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.
April 12, 2023 Data Mining: Concepts and Techniques 10
Type of data in clustering analysis

 Interval-scaled variables:
 Binary variables:
 Nominal, ordinal, and ratio variables:
 Variables of mixed types:

April 12, 2023 Data Mining: Concepts and Techniques 11


Interval-valued variables

 Standardize data
 Calculate the mean absolute deviation:
s f  1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where mf  1
n (x1 f  x2 f  ...  xnf )
.

 Calculate the standardized measurement (z-score)


xif  m f
zif  sf
 Using mean absolute deviation is more robust than using
standard deviation

April 12, 2023 Data Mining: Concepts and Techniques 12


Similarity and Dissimilarity Between
Objects

 Distances are normally used to measure the similarity or


dissimilarity between two data objects
 Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
 If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp

April 12, 2023 Data Mining: Concepts and Techniques 13


Similarity and Dissimilarity Between
Objects (Cont.)

 If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 Properties
 d(i,j)  0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)
 Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures

April 12, 2023 Data Mining: Concepts and Techniques 14


Binary Variables
 A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i 0 c d cd
sum a  c b  d p

 Simple matching coefficient (invariant, if the binary


variable is symmetric): d (i, j)  bc
a bc  d
 Jaccard coefficient (noninvariant if the binary variable is
asymmetric): d (i, j)  bc
a bc
April 12, 2023 Data Mining: Concepts and Techniques 15
Dissimilarity between Binary
Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 gender is a symmetric attribute


 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
April 12, 2023 Data Mining: Concepts and Techniques 16
Nominal Variables

 A generalization of the binary variable in that it can take


more than 2 states, e.g., red, yellow, blue, green
 Method 1: Simple matching
 m: # of matches, p: total # of variables
d (i, j)  p 
p
m

 Method 2: use a large number of binary variables


 creating a new binary variable for each of the M
nominal states

April 12, 2023 Data Mining: Concepts and Techniques 17


Ordinal Variables

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by their rank rif {1,..., M f }

 map the range of each variable onto [0, 1] by replacing


i-th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-
scaled variables

April 12, 2023 Data Mining: Concepts and Techniques 18


Ratio-Scaled Variables

 Ratio-scaled variable: a positive measurement on a


nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
 Methods:
 treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
 apply logarithmic transformation
yif = log(xif)
 treat them as continuous ordinal data treat their rank
as interval-scaled
April 12, 2023 Data Mining: Concepts and Techniques 19
Variables of Mixed Types

 A database may contain all the six types of variables


 symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
 One may use a weighted formula to combine their
effects  pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 f is binary or nominal:

dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.


 f is interval-based: use the normalized distance

 f is ordinal or ratio-scaled

 compute ranks rif and


z 
r 1
 and treat zif as interval-scaled
if
if M 1 f

April 12, 2023 Data Mining: Concepts and Techniques 20


Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 21
Major Clustering Approaches

 Partitioning algorithms: Construct various partitions and


then evaluate them by some criterion
 Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
 Density-based: based on connectivity and density functions
 Grid-based: based on a multiple-level granularity structure
 Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other

April 12, 2023 Data Mining: Concepts and Techniques 22


Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 23
Partitioning Algorithms: Basic Concept

 Partitioning method: Construct a partition of a database D


of n objects into a set of k clusters
 Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by
the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw’87): Each cluster is represented by one of
the objects in the cluster
April 12, 2023 Data Mining: Concepts and Techniques 24
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in four


steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when no more new
assignment
April 12, 2023 Data Mining: Concepts and Techniques 25
The K-Means Clustering Method

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

April 12, 2023 Data Mining: Concepts and Techniques 26


Comments on the K-Means Method

 Strength: Relatively efficient: O(tkn), where n is # objects, k is #


clusters, and t is # iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as: deterministic
annealing and genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about
categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
April 12, 2023 Data Mining: Concepts and Techniques 27
Variations of the K-Means Method

 A few variants of the k-means which differ in

 Selection of the initial k means

 Dissimilarity calculations

 Strategies to calculate cluster means

 Handling categorical data: k-modes (Huang’98)

 Replacing means of clusters with modes

 Using new dissimilarity measures to deal with categorical objects

 Using a frequency-based method to update modes of clusters

 A mixture of categorical and numerical data: k-prototype method


April 12, 2023 Data Mining: Concepts and Techniques 28
What is the problem of k-Means Method?

 The k-means algorithm is sensitive to outliers !


 Since an object with an extremely large value may substantially
distort the distribution of the data.

 K-Medoids: Instead of taking the mean value of the object in a


cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

April 12, 2023 Data Mining: Concepts and Techniques 29


The K-Medoids Clustering Method

 Find representative objects, called medoids, in clusters


 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale well
for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)

April 12, 2023 Data Mining: Concepts and Techniques 30


Typical k-medoids algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

April 12, 2023 Data Mining: Concepts and Techniques 31


PAM (Partitioning Around Medoids) (1987)

 PAM (Kaufman and Rousseeuw, 1987), built in Splus


 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h
 Then assign each non-selected object to the most
similar representative object
 repeat steps 2-3 until there is no change
April 12, 2023 Data Mining: Concepts and Techniques 32
PAM Clustering: Total swapping cost TCih=jCjih
10 10

9 9
j
8
t 8
t
7 7

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0


10
10

9
9

8
h 8

7
7

6
j 6

5
5 i
i 4
h j
4

3
t 3

2
2

1
t
1

0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

April 12, 2023 CjihTechniques


Cjih = d(j, t) - d(j, i) Data Mining: Concepts and = d(j, h) - d(j, t) 33
What is the problem with PAM?
 Pam is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
 Pam works efficiently for small data sets but does not
scale well for large data sets.
 O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)

April 12, 2023 Data Mining: Concepts and Techniques 34


CLARA (Clustering Large Applications) (1990)
 CLARA (Kaufmann and Rousseeuw in 1990)
 Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
April 12, 2023 Data Mining: Concepts and Techniques 35
CLARANS (“Randomized” CLARA) (1994)

 CLARANS (A Clustering Algorithm based on Randomized


Search) (Ng and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
 If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
April 12, 2023 Data Mining: Concepts and Techniques 36
Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 37
Hierarchical Clustering

 Use distance matrix as clustering criteria. This method


does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
April 12, 2023 Data Mining: Concepts and Techniques 38
AGNES (Agglomerative Nesting)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g., Splus
 Use the Single-Link method and the dissimilarity matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

April 12, 2023 Data Mining: Concepts and Techniques 39


A Dendrogram Shows How the
Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster.

April 12, 2023 Data Mining: Concepts and Techniques 40


DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

April 12, 2023 Data Mining: Concepts and Techniques 41


More on Hierarchical Clustering Methods

 Major weakness of agglomerative clustering methods


 do not scale well: time complexity of at least O(n ),
2

where n is the number of total objects


 can never undo what was done previously

 Integration of hierarchical with distance-based clustering


 BIRCH (1996): uses CF-tree and incrementally adjusts

the quality of sub-clusters


 CURE (1998): selects well-scattered points from the

cluster and then shrinks them towards the center of the


cluster by a specified fraction
 CHAMELEON (1999): hierarchical clustering using

dynamic modeling
April 12, 2023 Data Mining: Concepts and Techniques 42
BIRCH (1996)

 Birch: Balanced Iterative Reducing and Clustering using


Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the
order of the data record.
April 12, 2023 Data Mining: Concepts and Techniques 43
Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)


N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2 CF = (5, (16,30),(54,190))
10

9
(3,4)
8

6
(2,6)
5

4 (4,5)
3

1
(4,7)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8)

April 12, 2023 Data Mining: Concepts and Techniques 44


CF-Tree in BIRCH

 Clustering feature:
 summary of the statistics for a given subcluster: the 0-th, 1st and
2nd moments of the subcluster from the statistical point of view.
 registers crucial measurements for computing cluster and utilizes
storage efficiently
A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”
 The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
 Branching factor: specify the maximum number of children.
 threshold: max diameter of sub-clusters stored at the leaf nodes
April 12, 2023 Data Mining: Concepts and Techniques 45
CF Tree
Root

B=7 CF1 CF2 CF3 CF6


child1 child2 child3 child6
L=6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

April 12, 2023 Data Mining: Concepts and Techniques 46


CURE (Clustering Using
REpresentatives )

 CURE: proposed by Guha, Rastogi & Shim, 1998


 Stops the creation of a cluster hierarchy if a level
consists of k clusters
 Uses multiple representative points to evaluate the
distance between clusters, adjusts well to arbitrary
shaped clusters and avoids single-link effect
April 12, 2023 Data Mining: Concepts and Techniques 47
Drawbacks of Distance-Based
Method

 Drawbacks of square-error based clustering method


 Consider only one point as representative of a cluster
 Good only for convex shaped, similar size and density,
and if k can be reasonably estimated
April 12, 2023 Data Mining: Concepts and Techniques 48
Cure: The Algorithm

 Draw random sample s.


 Partition sample to p partitions with size s/p
 Partially cluster partitions into s/pq clusters
 Eliminate outliers
 By random sampling
 If a cluster grows too slow, eliminate it.
 Cluster partial clusters.
 Label data in disk
April 12, 2023 Data Mining: Concepts and Techniques 49
Data Partitioning and Clustering

 s = 50
 p=2  s/pq = 5
 s/p = 25
y
y y

x
y y

x x
x x
April 12, 2023 Data Mining: Concepts and Techniques 50
Cure: Shrinking Representative Points
y y

x x

 Shrink the multiple representative points towards the


gravity center by a fraction of .
 Multiple representatives capture the shape of the cluster
April 12, 2023 Data Mining: Concepts and Techniques 51
Clustering Categorical Data: ROCK

 ROCK: Robust Clustering using linKs,


by S. Guha, R. Rastogi, K. Shim (ICDE’99).
 Use links to measure similarity/proximity

 Not distance based

 Computational complexity: O( n 2
 nm m
m a  n 2
log n)
 Basic ideas:
 Similarity function and neighbors: Sim( T , T )  T  T 1 2

T T
1 2

Let T1 = {1,2,3}, T2={3,4,5} 1 2

{3} 1
Sim( T 1, T 2)    0.2
{1,2,3,4,5} 5

April 12, 2023 Data Mining: Concepts and Techniques 52


Rock: Algorithm

 Links: The number of common neighbours for


the two points.
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
{1,2,3} 3 {1,2,4}
 Algorithm
 Draw random sample

 Cluster with links

 Label data in disk

April 12, 2023 Data Mining: Concepts and Techniques 53


CHAMELEON (Hierarchical clustering
using dynamic modeling)
 CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity and closeness
(proximity) between two clusters are high relative to the internal
interconnectivity of the clusters and closeness of items within the
clusters
 Cure ignores information about interconnectivity of the objects,
Rock ignores information about the closeness of two clusters
 A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large
number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the
genuine clusters by repeatedly combining these sub-clusters
April 12, 2023 Data Mining: Concepts and Techniques 54
Overall Framework of CHAMELEON
Construct
Sparse Graph Partition the Graph

Data Set

Merge Partition

Final Clusters

April 12, 2023 Data Mining: Concepts and Techniques 55


Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 56
Density-Based Clustering Methods

 Clustering based on density (local cluster criterion),


such as density-connected points
 Major features:
 Discover clusters of arbitrary shape

 Handle noise

 One scan

 Need density parameters as termination condition

 Several interesting studies:


 DBSCAN: Ester, et al. (KDD’96)

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98)

April 12, 2023 Data Mining: Concepts and Techniques 57


Density Concepts
 Core object (CO)–object with at least ‘M’ objects within a
radius ‘E-neighborhood’
 Directly density reachable (DDR)–x is CO, y is in x’s ‘E-
neighborhood’
 Density reachable–there exists a chain of DDR objects
from x to y
 Density based cluster–density connected objects
maximum w.r.t. reachability

April 12, 2023 Data Mining: Concepts and Techniques 58


Density-Based Clustering: Background
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q wrt. Eps, MinPts if
 1) p belongs to NEps(q)
 2) core point condition: p MinPts = 5
q
|NEps (q)| >= MinPts Eps = 1 cm

April 12, 2023 Data Mining: Concepts and Techniques 59


Density-Based Clustering: Background (II)

 Density-reachable:
p
 A point p is density-reachable from
a point q wrt. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a p q
point q wrt. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o wrt.
Eps and MinPts.
April 12, 2023 Data Mining: Concepts and Techniques 60
DBSCAN: Density Based Spatial
Clustering of Applications with Noise

 Relies on a density-based notion of cluster: A cluster is


defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

April 12, 2023 Data Mining: Concepts and Techniques 61


DBSCAN: The Algorithm

 Arbitrary select a point p

 Retrieve all points density-reachable from p wrt Eps


and MinPts.

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable


from p and DBSCAN visits the next point of the
database.

 Continue the process until all of the points have been


processed.
April 12, 2023 Data Mining: Concepts and Techniques 62
OPTICS: A Cluster-Ordering Method (1999)

 OPTICS: Ordering Points To Identify the Clustering


Structure
 Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)

 Produces a special order of the database wrt its

density-based clustering structure


 This cluster-ordering contains info equiv to the

density-based clusterings corresponding to a broad


range of parameter settings
 Good for both automatic and interactive cluster

analysis, including finding intrinsic clustering structure


 Can be represented graphically or using visualization

techniques
April 12, 2023 Data Mining: Concepts and Techniques 63
OPTICS: Some Extension from
DBSCAN

 Index-based:
 k = number of dimensions

 N = 20

 p = 75% D
 M = N(1-p) = 5

 Complexity: O(kN2)
 Core Distance p1

o
 Reachability Distance p2 o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) = 4cm
April 12, 2023 e = 3 cm
Data Mining: Concepts and Techniques 64
Reachability
-distance

undefined
e
e
e‘

Cluster-order
of the objects
April 12, 2023 Data Mining: Concepts and Techniques 65
Density-Based Cluster analysis:
OPTICS & Its Applications

April 12, 2023 Data Mining: Concepts and Techniques 66


DENCLUE: Using density functions

 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)


 Major features
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
 Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
 But needs a large number of parameters

April 12, 2023 Data Mining: Concepts and Techniques 67


Denclue: Technical Essence

 Uses grid cells but only keeps information about grid


cells that do actually contain data points and manages
these cells in a tree-based access structure.
 Influence function: describes the impact of a data point
within its neighborhood.
 Overall density of the data space can be calculated as
the sum of the influence function of all data points.
 Clusters can be determined mathematically by
identifying density attractors.
 Density attractors are local maximal of the overall
density function.
April 12, 2023 Data Mining: Concepts and Techniques 68
Gradient: The steepness of a slope
 Example

d ( x , y )2

f Gaussian ( x , y )  e 2 2

d ( x , xi ) 2

( x )   i 1 e
N
D 2 2
f Gaussian
d ( x , xi ) 2

( x, xi )  i 1 ( xi  x)  e
N
f D
Gaussian
2 2

April 12, 2023 Data Mining: Concepts and Techniques 69


Density Attractor

April 12, 2023 Data Mining: Concepts and Techniques 70


Center-Defined and Arbitrary

April 12, 2023 Data Mining: Concepts and Techniques 71


Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 72
Grid-Based Clustering Method

 Using multi-resolution grid data structure


 Several interesting methods
 STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
 WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)
 A multi-resolution clustering approach using
wavelet method
 CLIQUE: Agrawal, et al. (SIGMOD’98)

April 12, 2023 Data Mining: Concepts and Techniques 73


STING: A Statistical Information
Grid Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area area is divided into rectangular cells
 There are several levels of cells corresponding to different
levels of resolution

April 12, 2023 Data Mining: Concepts and Techniques 74


STING: A Statistical Information
Grid Approach (2)
 Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
 Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
 Parameters of higher level cells can be easily calculated from
parameters of lower level cell
 count, mean, s, min, max

 type of distribution—normal, uniform, etc.

 Use a top-down approach to answer spatial data queries


 Start from a pre-selected layer—typically with a small
number of cells
 For each cell in the current level compute the confidence
interval
STING: A Statistical Information
Grid Approach (3)
 Remove the irrelevant cells from further consideration
 When finish examining the current layer, proceed to
the next lower level
 Repeat this process until the bottom layer is reached
 Advantages:
 Query-independent, easy to parallelize, incremental

update
 O(K), where K is the number of grid cells at the

lowest level
 Disadvantages:
 All the cluster boundaries are either horizontal or

vertical, and no diagonal boundary is detected


WaveCluster (1998)

 Sheikholeslami, Chatterjee, and Zhang (VLDB’98)


 A multi-resolution clustering approach which applies
wavelet transform to the feature space
 A wavelet transform is a signal processing
technique that decomposes a signal into different
frequency sub-band.
 Both grid-based and density-based
 Input parameters:
 # of grid cells for each dimension
 the wavelet, and the # of applications of wavelet
transform.
April 12, 2023 Data Mining: Concepts and Techniques 77
WaveCluster (1998)

 How to apply wavelet transform to find clusters


 Summaries the data by imposing a multidimensional
grid structure onto data space
 These multidimensional spatial data objects are

represented in a n-dimensional feature space


 Apply wavelet transform on feature space to find the

dense regions in the feature space


 Apply wavelet transform multiple times which result

in clusters at different scales from fine to coarse

April 12, 2023 Data Mining: Concepts and Techniques 79


Wavelet Transform

 Decomposes a signal into different frequency


subbands. (can be applied to n-dimensional
signals)
 Data are transformed to preserve relative
distance between objects at different levels of
resolution.
 Allows natural clusters to become more
distinguishable

April 12, 2023 Data Mining: Concepts and Techniques 80


What Is Wavelet (2)?

April 12, 2023 Data Mining: Concepts and Techniques 81


Quantization

April 12, 2023 Data Mining: Concepts and Techniques 82


Transformation

April 12, 2023 Data Mining: Concepts and Techniques 83


WaveCluster (1998)

 Why is wavelet transformation useful for clustering


 Unsupervised clustering

It uses hat-shape filters to emphasize region where


points cluster, but simultaneously to suppress weaker
information in their boundary
 Effective removal of outliers

 Multi-resolution

 Cost efficiency

 Major features:
 Complexity O(N)

 Detect arbitrary shaped clusters at different scales

 Not sensitive to noise, not sensitive to input order

 Only applicable to low dimensional data


April 12, 2023 Data Mining: Concepts and Techniques 84
CLIQUE (Clustering In QUEst)

 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).


 Automatically identifying subspaces of a high dimensional
data space that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-
based
 It partitions each dimension into the same number of
equal length interval
 It partitions an m-dimensional data space into non-
overlapping rectangular units
 A unit is dense if the fraction of total data points
contained in the unit exceeds the input model parameter
 A cluster is a maximal set of connected dense units
within a subspace
April 12, 2023 Data Mining: Concepts and Techniques 85
CLIQUE: The Major Steps

 Partition the data space and find the number of points that
lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the
Apriori principle
 Identify clusters:
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of
interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of
connected dense units for each cluster
 Determination of minimal cover for each cluster

April 12, 2023 Data Mining: Concepts and Techniques 86


Vacation
(10,000)

(week)
Salary

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7

age age
20 30 40 50 60 20 30 40 50 60

=3
Vacation

30 50
age

April 12, 2023 Data Mining: Concepts and Techniques 87


Strength and Weakness of CLIQUE

 Strength
 It automatically finds subspaces of the highest

dimensionality such that high density clusters exist in


those subspaces
 It is insensitive to the order of records in input and

does not presume some canonical data distribution


 It scales linearly with the size of input and has good

scalability as the number of dimensions in the data


increases
 Weakness
 The accuracy of the clustering result may be

degraded at the expense of simplicity of the method


April 12, 2023 Data Mining: Concepts and Techniques 88
Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 89
Model-Based Clustering Methods
 Attempt to optimize the fit between the data and some
mathematical model
 Statistical and AI approach
 Conceptual clustering

 A form of clustering in machine learning


 Produces a classification scheme for a set of unlabeled objects
 Finds characteristic description for each concept (class)
 COBWEB (Fisher’87)
 A popular a simple method of incremental conceptual learning
 Creates a hierarchical clustering in the form of a classification
tree
 Each node refers to a concept and contains a probabilistic
description of that concept
April 12, 2023 Data Mining: Concepts and Techniques 90
COBWEB Clustering Method

A classification tree

April 12, 2023 Data Mining: Concepts and Techniques 91


More on Statistical-Based Clustering

 Limitations of COBWEB
 The assumption that the attributes are independent of
each other is often too strong because correlation may
exist
 Not suitable for clustering large database data –
skewed tree and expensive probability distributions
 CLASSIT
 an extension of COBWEB for incremental clustering of
continuous data
 suffers similar problems as COBWEB

 AutoClass (Cheeseman and Stutz, 1996)


 Uses Bayesian statistical analysis to estimate the
number of clusters
 Popular in industry

April 12, 2023 Data Mining: Concepts and Techniques 92


Other Model-Based Clustering
Methods

 Neural network approaches


 Represent each cluster as an exemplar, acting as a

“prototype” of the cluster


 New objects are distributed to the cluster whose

exemplar is the most similar according to some


dostance measure
 Competitive learning
 Involves a hierarchical architecture of several units

(neurons)
 Neurons compete in a “winner-takes-all” fashion for

the object currently being presented

April 12, 2023 Data Mining: Concepts and Techniques 93


Model-Based Clustering Methods

April 12, 2023 Data Mining: Concepts and Techniques 94


Self-organizing feature maps (SOMs)

 Clustering is also performed by having several


units competing for the current object
 The unit whose weight vector is closest to the
current object wins
 The winner and its neighbors learn by having
their weights adjusted
 SOMs are believed to resemble processing that
can occur in the brain
 Useful for visualizing high-dimensional data in 2-
or 3-D space

April 12, 2023 Data Mining: Concepts and Techniques 95


Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 96
What Is Outlier Discovery?

 What are outliers?


 The set of objects are considerably dissimilar from the
remainder of the data
 Example: Sports: Michael Jordon, Wayne Gretzky, ...

 Problem
 Find top n outlier points

 Applications:
 Credit card fraud detection

 Telecom fraud detection

 Customer segmentation

 Medical analysis

April 12, 2023 Data Mining: Concepts and Techniques 97


Outlier Discovery:
Statistical Approaches

 Assume a model underlying distribution that generates


data set (e.g. normal distribution)
 Use discordancy tests depending on
 data distribution

 distribution parameter (e.g., mean, variance)

 number of expected outliers

 Drawbacks
 most tests are for single attribute

 In many cases, data distribution may not be known

April 12, 2023 Data Mining: Concepts and Techniques 98


Outlier Discovery: Distance-Based Approach

 Introduced to counter the main limitations imposed by


statistical methods
 We need multi-dimensional analysis without knowing

data distribution.
 Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the objects
in T lies at a distance greater than D from O
 Algorithms for mining distance-based outliers
 Index-based algorithm

 Nested-loop algorithm

 Cell-based algorithm
Outlier Discovery: Deviation-
Based Approach
 Identifies outliers by examining the main characteristics
of objects in a group
 Objects that “deviate” from this description are
considered outliers
 sequential exception technique
 simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
 OLAP data cube technique
 uses data cubes to identify regions of anomalies in
large multidimensional data
April 12, 2023 Data Mining: Concepts and Techniques 100
Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
April 12, 2023 Data Mining: Concepts and Techniques 101
Problems and Challenges

 Considerable progress has been made in scalable


clustering methods
 Partitioning: k-means, k-medoids, CLARANS
 Hierarchical: BIRCH, CURE
 Density-based: DBSCAN, CLIQUE, OPTICS
 Grid-based: STING, WaveCluster
 Model-based: Autoclass, Denclue, Cobweb
 Current clustering techniques do not address all the
requirements adequately
 Constraint-based clustering analysis: Constraints exist in
data space (bridges and highways) or in user queries
April 12, 2023 Data Mining: Concepts and Techniques 102
Constraint-Based Clustering Analysis

 Clustering analysis: less parameters but more user-desired


constraints, e.g., an ATM allocation problem

April 12, 2023 Data Mining: Concepts and Techniques 103


Clustering With Obstacle Objects

Not Taking obstacles into account Taking obstacles into account

April 12, 2023 Data Mining: Concepts and Techniques 104


Summary

 Cluster analysis groups objects based on their similarity


and has wide applications
 Measure of similarity can be computed for various types
of data
 Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
 Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
 There are still lots of research issues on cluster analysis,
such as constraint-based clustering
April 12, 2023 Data Mining: Concepts and Techniques 105
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
 P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.

April 12, 2023 Data Mining: Concepts and Techniques 106


References (2)

 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster


Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
 P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97.
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96.
April 12, 2023 Data Mining: Concepts and Techniques 107
www.cs.uiuc.edu/~hanj

Thank you !!!


April 12, 2023 Data Mining: Concepts and Techniques 108

You might also like