Data Mining: Association
Analysis
Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items
• An itemset that contains k items 1 Bread, Milk
• Support count () 2 Bread, Diaper, Beer, Eggs
– Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
– E.g. ({Milk, Bread,Diaper}) = 2 4 Bread, Milk, Diaper, Beer
• Support 5 Bread, Milk, Diaper, Coke
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than
or equal to a minsup threshold
Definition: Association Rule
Association Rule
TID Items
– An implication expression of the form X
1 Bread, Milk
Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
3 Milk, Diaper, Beer, Coke
{Milk, Diaper} {Beer}
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
– Support (s)
Fraction of transactions that contain both Example:
X and Y {Milk , Diaper } Beer
– Confidence (c)
(Milk, Diaper, Beer) 2
Measures how often items in Y
appear in transactions that
s 0.4
|T| 5
contain X
(Milk, Diaper, Beer) 2
c 0.67
(Milk, Diaper) 3
Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk {Milk,Diaper} {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
• Frequent itemset generation is still
computationally expensive
Frequent Itemset Generation
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Given d items, there
are 2d possible
ABCDE candidate itemsets
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
d d k
R
d 1 d k
k j
k 1 j 1
3 2 1
d d 1
If d=6, R = 602 rules
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the
candidates or transactions
– No need to match every candidate against every
transaction
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
• Apriori principle holds due to the following property
of the support measure:
X , Y : ( X Y ) s( X ) s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Apriori Principle
• If an itemset is frequent, then all of its subsets must also be frequent
• If an itemset is infrequent, then all of its supersets must be infrequent too
(X Y) (¬Y ¬X) null
frequent
A B C D E
frequent
infrequent
AB AC AD AE BC BD BE CD CE DE
infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
13
ABCDE
Illustrating Apriori Principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Pruned
ABCDE
supersets
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C = 41 {Bread,Milk,Diaper} 3
1 2 3
With support-based pruning,
6 + 6 + 1 = 13
Apriori Algorithm
• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent
Example
A database has five
transactions. Let the min sup
= 50% and min con f = 80%.
Solution
Step 1: Find all Frequent
Itemsets
Frequent Itemset:
{A} {B} {C} {E} {A C} {B C} {B E} {C E} {B C E}
Step 2: Generate strong association rules from the frequent itemsets
Example
A database has five
transactions. Let the min sup
= 50% and min con f = 80%.