UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN
VNU-University of Engineering and Technology
INT3209 - DATA MINING
Week 6: Basic Association Analysis
Duc-Trong Le
Slide credit: Vipin Kumar et al.,
https://www-users.cse.umn.edu/~kumar001/dmbook
Hanoi, 09/2021
Association Rule Mining
● Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Implication means co-occurrence,
not causality!
2
Definition: Frequent Itemset
● Itemset
– A collection of one or more items
◆ Example: {Milk, Bread, Diaper}
– k-itemset
◆ An itemset that contains k items
● Support count (σ)
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
● Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
● Frequent Itemset
– An itemset whose support is greater than or equal to a
3
minsup threshold
Definition: Association Rule
● Association Rule
– An implication expression of the form
X → Y, where X and Y are itemsets
– Example:
{Milk, Diaper} → {Beer}
● Rule Evaluation Metrics
– Support (s)
◆ Fraction of transactions that contain Example:
both X and Y
– Confidence (c)
◆ Measures how often items in Y
appear in transactions that
contain X
4
Association Rule Mining Task
● Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
● Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each
rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive! 5
Computational Complexity
● Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
If d=6, R = 602 rules
6
Mining Association Rules
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
● All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
● Rules originating from the same itemset have identical support but
can have different confidence
● Thus, we may decouple the support and confidence requirements
7
Mining Association Rules
● Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
● Frequent itemset generation is still
computationally expensive
8
Frequent Itemset Generation
Given d items, there
are 2d possible
candidate itemsets
9
Frequent Itemset Generation
● Brute-force approach:
– Each itemset in the lattice is a candidate frequent
itemset
– Count the support of each candidate by scanning the
database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!! 10
Frequent Itemset Generation Strategies
● Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
● Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
● Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every transaction
11
Reducing Number of Candidates
● Apriori principle:
– If an itemset is frequent, then all of its subsets
must also be frequent
● Apriori principle holds due to the following
property of the support measure:
– Support of an itemset never exceeds the
support of its subsets
– This is known as the anti-monotone property of
support 12
Illustrating Apriori Principle
Found to be
Infrequent
Pruned
supersets
13
Illustrating Apriori Principle
Items (1-itemsets)
Minimum Support = 3
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
14
Illustrating Apriori Principle
Items (1-itemsets)
Minimum Support = 3
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
15
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Minimum Support = 3
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
16
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Minimum Support = 3
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
17
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
18
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
19
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
6 + 6 + 1 = 13
20
Apriori Algorithm
– Fk: frequent k-itemsets
– Lk: candidate k-itemsets
● Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
◆ Candidate Generation: Generate Lk+1 from Fk
◆ Candidate Pruning: Prune candidate itemsets in Lk+1
containing subsets of length k that are infrequent
◆ Support Counting: Count the support of each
candidate in Lk+1 by scanning the DB
◆ Candidate Elimination: Eliminate candidates in Lk+1
that are infrequent, leaving only those that are frequent
=> Fk+1
21
Candidate Generation: Brute-force method
22
Candidate Generation: Merge Fk-1 and F1 itemsets
23
Candidate Generation: Fk-1 x Fk-1 Method
● Merge two frequent (k-1)-itemsets if their first (k-2) items
are identical
● F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE
– Do not merge(ABD,ACD) because they share
only prefix of length 1 instead of length 2
24
Candidate Pruning
● Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
be the set of frequent 3-itemsets
● L4 = {ABCD,ABCE,ABDE} is the set of candidate
4-itemsets generated (from previous slide)
● Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent
● After candidate pruning: L4 = {ABCD}
25
Candidate Generation: Fk-1 x Fk-1 Method
26
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 1 = 13 Use of Fk-1xFk-1 method for candidate generation results in
only one 3-itemset. This is eliminated after the support counting
step.
27
Alternate Fk-1 x Fk-1 Method
● Merge two frequent (k-1)-itemsets if the last (k-2) items
of the first one is identical to the first (k-2) items of the
second.
● F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE
28
Candidate Pruning for Alternate Fk-1 x Fk-1 Method
● Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
be the set of frequent 3-itemsets
● L4 = {ABCD,ABDE,ACDE,BCDE} is the set of
candidate 4-itemsets generated (from previous
slide)
● Candidate pruning
– Prune ABDE because ADE is infrequent
– Prune ACDE because ACE and ADE are infrequent
– Prune BCDE because BCE
● After candidate pruning: L4 = {ABCD}
29
Support Counting of Candidate Itemsets
● Scan the database of transactions to determine the
support of each candidate itemset
– Must match every candidate itemset against every transaction,
which is an expensive operation
30
Support Counting of Candidate Itemsets
● To reduce number of comparisons, store the candidate
itemsets in a hash structure
– Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
31
Support Counting: An Example
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3
5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
How many of these itemsets are supported by transaction (1,2,3,5,6)?
32
Rule Generation
● Given a frequent itemset L, find all non-empty
subsets f ⊂ L such that f → L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate
rules:
ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,
● If |L| = k, then there are 2k – 2 candidate
association rules (ignoring L → ∅ and ∅ → L)
33
Rule Generation
● In general, confidence does not have an
anti-monotone property
c(ABC →D) can be larger or smaller than
c(AB →D)
● But confidence of rules generated from the same
itemset has an anti-monotone property
– E.g., Suppose {A,B,C,D} is a frequent
4-itemset:
c(ABC → D) ≥ c(AB → CD) ≥ c(A →
BCD)
34
Rule Generation for Apriori Algorithm
Lattice of rules
Low
Confidence
Rule
Pruned
Rules
35
Association Analysis: Basic Concepts
and Algorithms
Algorithms and Complexity
36
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
● Dimensionality (number of items) of the data set
● Size of database
● Average transaction width
–
37
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
–
● Size of database
–
● Average transaction width
–
38
Impact of Support Based Pruning
Items (1-itemsets)
Minimum Support = 3
Minimum Support = 2
If every subset is considered,
6
C1 + 6C2 + 6C3
If every subset is considered,
6 + 15 + 20 = 41 6
C1 + 6C2 + 6C3 + 6C4
With support-based pruning,
6 + 15 + 20 +15 = 56
6 + 6 + 4 = 16
39
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database
● Average transaction width
–
40
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database
– run time of algorithm increases with number of transactions
● Average transaction width
41
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database
– run time of algorithm increases with number of transactions
● Average transaction width
– transaction width increases the max length of frequent itemsets
– number of subsets in a transaction increases with its width,
increasing computation time for support counting
42
Factors Affecting Complexity of Apriori
43
Compact Representation of Frequent Itemsets
● Some frequent itemsets are redundant because their
supersets are also frequent
Consider the following data set. Assume support threshold =5
Number of frequent itemsets
● Need a compact representation
44
Maximal Frequent Itemset
An itemset is maximal frequent if it is frequent and none of its
immediate supersets is frequent
Maximal
Itemsets
Infrequent
Itemsets Border
45
What are the Maximal Frequent Itemsets in this Data?
Minimum support threshold = 5
(A1-A10)
(B1-B10)
(C1-C10)
46
An illustrative example
Items
A B C D E F G H I J Support threshold (by count) : 5
1 Frequent itemsets: ?
Maximal itemsets: ?
2
3
4
Transactions
5
6
7
8
9
10
47
An illustrative example
Items
A B C D E F G H I J Support threshold (by count) : 5
1 Frequent itemsets: {F}
Maximal itemsets: {F}
2
3 Support threshold (by count): 4
4 Frequent itemsets: ?
Transactions
Maximal itemsets: ?
5
6
7
8
9
10
48
An illustrative example
Items
A B C D E F G H I J Support threshold (by count) : 5
1 Frequent itemsets: {F}
Maximal itemsets: {F}
2
3 Support threshold (by count): 4
4 Frequent itemsets: {E}, {F},
Transactions
{E,F}, {J}
5
Maximal itemsets: {E,F}, {J}
6
7 Support threshold (by count): 3
Frequent itemsets: ?
8
Maximal itemsets: ?
9
10
49
An illustrative example
Items
A B C D E F G H I J Support threshold (by count) : 5
1 Frequent itemsets: {F}
Maximal itemsets: {F}
2
3 Support threshold (by count): 4
4 Frequent itemsets: {E}, {F}, {E,F},
Transactions
{J}
5
Maximal itemsets: {E,F}, {J}
6
7 Support threshold (by count): 3
Frequent itemsets:
8
All subsets of {C,D,E,F} + {J}
9 Maximal itemsets:
10 {C,D,E,F}, {J}
50
Another illustrative example
Items
A B C D E F G H I J Support threshold (by count) : 5
1 Maximal itemsets: {A}, {B}, {C}
2 Support threshold (by count): 4
3 Maximal itemsets: {A,B},
4 {A,C},{B,C}
Transactions
5 Support threshold (by count): 3
6 Maximal itemsets: {A,B,C}
7
8
9
10
51
Closed Itemset
● An itemset X is closed if none of its immediate supersets
has the same support as the itemset X.
● X is not closed if at least one of its immediate supersets
has support count as X.
52
Closed Itemset
● An itemset X is closed if none of its immediate supersets
has the same support as the itemset X.
● X is not closed if at least one of its immediate supersets
has support count as X.
53
Maximal Frequent vs Closed Frequent Itemsets
Minimum support = 2 Closed but
not maximal
Closed and
maximal
# Closed frequent = 9
# Maximal freaquent = 4
54
What are the Closed Itemsets in this Data?
(A1-A10)
(B1-B10)
(C1-C10)
55
Maximal vs Closed Itemsets
56
Pattern Evaluation
● Association rule algorithms can produce large
number of rules
● Interestingness measures can be used to
prune/rank the patterns
– In the original formulation, support &
confidence are the only measures used
57
Computing Interestingness Measure
● Given X → Y or {X,Y}, information needed to compute
interestingness can be obtained from a contingency table
Contingency table
Y Y f11: support of X and Y
X f11 f10 f1+ f10: support of X and Y
X f01 f00 fo+ f01: support of X and Y
f+1 f+0 N f00: support of X and Y
Used to define various measures
● support, confidence, Gini,
entropy, etc.
58
Drawback of Confidence
Custo Tea Coffee …
mers
C1 0 1 …
C2 1 0 …
C3 1 1 …
C4 1 0 …
…
Association Rule: Tea → Coffee
Confidence ≅ P(Coffee|Tea) = 150/200 = 0.75
Confidence > 50%, meaning people who drink tea are more
likely to drink coffee than not drink coffee
So rule seems reasonable
59
Drawback of Confidence
Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000
Association Rule: Tea → Coffee
Confidence= P(Coffee|Tea) = 150/200 = 0.75
but P(Coffee) = 0.8, which means knowing that a person drinks
tea reduces the probability that the person drinks coffee!
⇒ Note that P(Coffee| Tea ) = 650/800 = 0.8125
60
Drawback of Confidence
Custo Tea Honey …
mers
C1 0 1 …
C2 1 0 …
C3 1 1 …
C4 1 0 …
…
Association Rule: Tea → Honey
Confidence ≅ P(Honey|Tea) = 100/200 = 0.50
Confidence = 50%, which may mean that drinking tea has little
influence whether honey is used or not
So rule seems uninteresting
But P(Honey) = 120/1000 = .12 (hence tea drinkers are far
more likely to have honey
61
Measure for Association Rules
● So, what kind of rules do we really want?
– Confidence(X → Y) should be sufficiently high
◆ To ensure that people who buy X will more likely
buy Y than not buy Y
– Confidence(X → Y) > support(Y)
◆ Otherwise, rule will be misleading because
having item X actually reduces the chance of
having item Y in the same transaction
◆ Is there any measure that capture this
constraint?
– Answer: Yes. There are many of them.
62
Statistical Relationship between X and Y
● The criterion
confidence(X → Y) = support(Y)
is equivalent to:
– P(Y|X) = P(Y)
– P(X,Y) = P(X) × P(Y) (X and Y are
independent)
If P(X,Y) > P(X) × P(Y) : X & Y are positively
correlated
63
If P(X,Y) < P(X) × P(Y) : X & Y are negatively
Measures that take into account statistical dependence
lift is used for rules while
interest is used for itemsets
64
Example: Lift/Interest
Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000
Association Rule: Tea → Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.8
⇒ Interest = 0.15 / (0.2×0.8) = 0.9375 (< 1, therefore is negatively
associated)
So, is it enough to use confidence/Interest for pruning?
65
There are lots of
measures proposed in
the literature
66
Simpson’s Paradox
● Observed relationship in data may be influenced
by the presence of other confounding factors
(hidden variables)
– Hidden variables may cause the observed
relationship to disappear or reverse its
direction!
● Proper stratification is needed to avoid
generating spurious patterns
67
Simpson’s Paradox
● Recovery rate from Covid
– Hospital A: 80%
– Hospital B: 90%
● Which hospital is better?
68
Simpson’s Paradox
● Recovery rate from Covid
– Hospital A: 80%
– Hospital B: 90%
● Which hospital is better?
● Covid recovery rate on older population
– Hospital A: 50%
– Hospital B: 30%
● Covid recovery rate on younger population
– Hospital A: 99%
– Hospital B: 98%
69
Simpson’s Paradox
● Covid-19 death: (per 100,000 of population)
– County A: 15
– County B: 10
● Which state is managing the pandemic better?
70
Simpson’s Paradox
● Covid-19 death: (per 100,000 of population)
– County A: 15
– County B: 10
● Which state is managing the pandemic better?
● Covid death rate on older population
– County A: 20
– County B: 40
● Covid death rate on younger population
– County A: 2
– County B: 5
71