[go: up one dir, main page]

0% found this document useful (0 votes)
26 views112 pages

Data Mining - Module2

The document discusses Association Rule Mining, which aims to discover rules predicting the occurrence of items in market-basket transactions based on other items present. It highlights the importance of frequent itemsets, support, and confidence metrics, as well as the computational challenges involved in mining association rules. The Apriori algorithm is introduced as a method to efficiently generate frequent itemsets and prune infrequent candidates using the anti-monotone property of support.

Uploaded by

sekharpranitha02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views112 pages

Data Mining - Module2

The document discusses Association Rule Mining, which aims to discover rules predicting the occurrence of items in market-basket transactions based on other items present. It highlights the importance of frequent itemsets, support, and confidence metrics, as well as the computational challenges involved in mining association rules. The Apriori algorithm is introduced as a method to efficiently generate frequent itemsets and prune infrequent candidates using the anti-monotone property of support.

Uploaded by

sekharpranitha02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

Data Mining

Association Analysis

By
Lohith C
Association Rule Mining

⚫ Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction.

Market-Basket transactions
Example of Association Rules
TID Items
1 Bread, Milk {Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
⚫ Association analysis, is useful for discovering
interesting relationships hidden in large data sets.
⚫ The uncovered relationships can be represented in the
form of sets of items present in many transactions, which
are known as frequent itemsets or association rules, that
represent relationships between two itemsets.
⚫ two key issues
⚫ First, discovering patterns from a large transaction data
set can be computationally expensive.
⚫ Second, some of the discovered patterns may be
spurious (happen simply by chance) and even for non-
spurious patterns, some are more interesting than others.
Problem Definition: Frequent Itemset
⚫ Binary Representation
– Market basket data can be represented
in a binary format
– each row corresponds to a transaction
– each column corresponds to an item.
⚫ Itemset
– A collection of one or more items
◆ Example: {Milk, Bread, Diaper}
– k-itemset
◆ An itemset that contains k items
TID Items
⚫ Support count () 1 Bread, Milk
– the number of transactions that contain 2 Bread, Diaper, Beer, Eggs
a particular itemset or Frequency of 3 Milk, Diaper, Beer, Coke
occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– the support count, σ(X), for an itemset X
5 Bread, Milk, Diaper, Coke
can be stated as follows:
– E.g. ({Milk, Bread,Diaper}) = 2

⚫ Support TID Items


– Fraction of transactions in 1 Bread, Milk
which an itemset occurs 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– E.g. s({Milk, Bread, Diaper}) = 4 Bread, Milk, Diaper, Beer
2/5 5 Bread, Milk, Diaper, Coke
⚫ Frequent Itemset
– An itemset whose support is
greater than or equal to a
minsup threshold
Definition: Association Rule
⚫ Association Rule
TID Items
– An implication expression of the form
X → Y, where X and Y are itemsets 1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper} → {Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
⚫ Rule Evaluation Metrics
– Support (s)
◆ how often a rule is applicable to a Example:
given data set {Milk, Diaper}  {Beer}
◆ Confidence (c) Measures how often
items in Y appear in transactions that (Milk, Diaper, Beer) 2
contain X. s = = = 0.4
|T| 5

c=
(Milk, Diaper, Beer) = 2 = 0.67
(Milk, Diaper) 3
Association Rule Mining Task

⚫ Given a set of transactions T, the goal of association


rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

⚫ Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Prohibitively expensive.
Mining Association Rules
Example of Rules:
{Milk,Diaper} → {Beer} (s=2/5=0.4,
TID Items c=2/3=0.67)
1 Bread, Milk {Milk,Beer} → {Diaper} (s=2/5=0.4,
2 Bread, Diaper, Beer, Eggs c=2/2=1.0)
3 Milk, Diaper, Beer, Coke
{Diaper,Beer} → {Milk} (s=2/5=0.4,
c=2/3=0.67)
4 Bread, Milk, Diaper, Beer
{Beer} → {Milk,Diaper} (s=2/5=0.4,
5 Bread, Milk, Diaper, Coke
c=2/3=0.67)
{Diaper} → {Milk,Beer} (s=2/5=0.4,
c=2/4=0.5)
{Milk} → {Diaper,Beer} (s=2/5=0.4,
Observations: c=2/4=0.5)
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules

⚫ Common strategy adopted by many association rule


mining algorithms is to decompose the problem
into two major subtasks:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsupport

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset.
These rules are called strong rules.
⚫ Frequent itemset generation is still
computationally expensive
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given k items, there
are 2k -1 possible
ABCDE frequent itemsets,
excluding the null
Fig : An Itemset lattice set.
Frequent Itemset Generation

⚫ Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Determine the support count of each candidate by
scanning the database.
Frequent Itemset Generation Strategies

⚫ Reduce the number of candidates (M)


– The Apriori principle, is an effective way to eliminate
some of the candidate itemsets without counting their
support values.
⚫ Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidate
itemsets or to compress the data set.
– No need to match each candidate against every
transaction
⚫ Reduce the number of transactions (N)
– Reduce size of N .
– as the size of candidate itemset increases, fewer
transactions will be supported by the itemsets.
Reducing Number of Candidates

⚫ Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent.
⚫ The strategy of trimming the exponential search
space based on the support measure is known as
support-based pruning.
⚫ Apriori principle holds due to the following property
of the support measure:
⚫ X ⊂ Y , we have f(Y ) ≤ f(X).
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support.
Illustrating Apriori Principle

Suppose {c, d, e} is a frequent itemset. Clearly, any transaction that contains {c, d, e}
must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d,
e} is frequent, then all subsets of {c, d, e} (i.e., the shaded itemsets in this figure) must
also be frequent.
Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be
infrequent too.
Illustrating Apriori Principle
Figure . An illustration of support-based pruning. If {a, b} is infrequent, then all
supersets of {a, b} are infrequent.
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
supersets ABCDE
Illustrating Apriori Principle

TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, B read, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, B read, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1

Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle

TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, B read, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, B read, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1

Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Pairs (2-itemsets)
Beer 3 {Bread,Milk}
Diaper 4 {Bread, Beer } (No need to generate
Eggs 1 {Bread,Diaper}
{Beer, Milk}
candidates involving Coke
{Diaper, Milk} or Eggs)
{Beer,Diaper}

Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Beer, Bread} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3
candidates involving Coke
{Beer,Milk} 2
{Diaper,Milk} 3 or Eggs)
{Beer,Diaper} 3
Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset
6C + 6C + 6C
1 2 3 { Beer, Diaper, Milk}
6 + 15 + 20 = 41 { Beer,Bread,Diaper}
With support-based pruning, {Bread, Diaper, Milk}
6 + 6 + 4 = 16 { Beer, Bread, Milk}
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C
1 2 3 { Beer, Diaper, Milk} 2
6 + 15 + 20 = 41 { Beer,Bread, Diaper} 2
With support-based pruning, {Bread, Diaper, Milk} 2
6 + 6 + 4 = 16 {Beer, Bread, Milk} 1
6 + 6 + 1 = 13
Apriori Algorithm
Fk: frequent k-itemsets
Ck: candidate k-itemsets
Explanation
⚫ The algorithm initially makes a single pass over the data set to
determine the support of each item. Upon completion of this step, the
set of all frequent 1-itemsets, F1, will be known (steps 1 and 2).
⚫ Next, the algorithm will iteratively generate new candidate k-itemsets
and prune unnecessary candidates that are guaranteed to be
infrequent given the frequent (k−1)-itemsets found in the previous
iteration (steps 5 and 6). Candidate generation and pruning is
implemented using the functions candidate-gen and candidate-prune.
⚫ To count the support of the candidates, the algorithm needs to make
an additional pass over the data set (steps 7–12). The subset function
is used to determine all the candidate itemsets in Ck that are
contained in each transaction t.
⚫ • After counting their supports, the algorithm eliminates all candidate
itemsets whose support counts are less than N ×minsup (step 13).
⚫ The algorithm terminates when there are no new frequent itemsets
generated, i.e., Fk = ∅ (step 14).
⚫ Apriori algorithm has two important characteristics.
⚫ First, it is a level-wise algorithm; i.e., it traverses the
itemset lattice one level at a time, from frequent 1-
itemsets to the maximum size of frequent itemsets.
⚫ Second, it employs a generate-and-test strategy for
finding frequent itemsets.
⚫ At each iteration (level), new candidate itemsets are
generated from the frequent itemsets found in the
previous iteration.
⚫ The support for each candidate is then counted and
tested against the minsup threshold.
⚫ Total number of iterations=kmax +1, where kmax is the
maximum size of the frequent itemsets.
2 main steps of Apriori

⚫ 1.Candidate Generation. This operation


generates new candidate k-itemsets based on
the frequent (k−1)-itemsets found in the previous
iteration.
⚫ 2. Candidate Pruning. This operation eliminates
some of the candidate k-itemsets using support-
based pruning, i.e. by removing k-itemsets whose
subsets are known to be infrequent in previous
iterations.
Candidate Generation

⚫ An effective candidate generation procedure


must be complete and non-redundant.
⚫ A candidate generation procedure is said to be
complete if it does not omit any frequent itemsets.
⚫ A candidate generation procedure is non-
redundant if it does not generate the same
candidate itemset more than once .
Brute-Force Method

⚫ The brute-force method considers every k-


itemset as a potential candidate and then applies
the candidate pruning step to remove any
unnecessary candidates whose subsets are
infrequent .
⚫ The number of candidate itemsets generated at
level k is equal to dCk, where d is the total
number of items.
Candidate Generation: Brute-force method
Fk−1 ×F1 Method

⚫ An alternative method for candidate generation is


to extend each frequent (k − 1)-itemset
with frequent items that are not part of the
(k−1)- itemset.
⚫ The procedure is complete because every
frequent k-itemset is composed of a frequent
(k−1)-itemset and a frequent 1-itemset.
⚫ But does not prevent the same candidate itemset
from being generated more than once.
Candidate Generation: Merge Fk-1 and F1 itemsets
Fk-1 x Fk-1 Method

⚫ This candidate generation procedure, which is


used in the candidate-gen function of the Apriori
algorithm, merges a pair of frequent (k−1)-
itemsets only if their first k − 2 items, arranged in
lexicographic order, are identical.
⚫ Let A = {a1,a2,...,ak−1} and B = {b1,b2,...,bk−1}
be a pair of frequent (k−1)-itemsets, arranged
lexicographically. A and B are merged if they
satisfy the following conditions:
⚫ ai = bi (for i =1 ,2,...,k−2).
Candidate Generation: Fk-1 x Fk-1 Method
Candidate Generation: Fk-1 x Fk-1 Method

⚫ Merge two frequent (k-1)-itemsets if their first (k-2) items


are identical

⚫ F2 = {Bread,Diaper}+{Bread, Milk}
– merge these two because they have identical first
item.

– Do not merge {Beer, Diapers} with{Diapers, Milk}


because they do not have identical first item.
Alternate Fk-1 x Fk-1 Method

⚫ Merge two frequent (k-1)-itemsets if the last (k-2) items of


the first one is identical to the first (k-2) items of the
second.

⚫ F2 = {Bread, Diapers}+{Diapers, Milk}

– could be merged using this approach to generate the


candidate 3-itemset {Bread, Diapers, Milk}.
Candidate Pruning
⚫ Brute-force candidate generation method:
candidate pruning requires checking only k
subsets of size k − 1 for each candidate k-itemset.
⚫ Fk−1×F1 candidate generation: ensures that at
least one of the (k−1)-size subsets of every
candidate k-itemset is frequent, we only need to
check for the remaining k − 1 subsets.
⚫ Fk−1 ×Fk−1 strategy: examine only k − 2
subsets of every candidate k-itemset, since two
of its (k−1)-size subsets are already known to
be frequent in the candidate generation step.
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41 {Bread, Diaper, Milk} 2
With support-based pruning,
6 + 6 + 1 = 13 Use of Fk-1xFk-1 method for candidate generation results in
only one 3-itemset. This is eliminated after the support
counting step.
Support Counting of Candidate Itemsets

⚫ Support counting is the process of determining the


frequency of occurrence for every candidate itemset that
survives the candidate pruning step.
⚫ Scan the database of transactions to determine the
support of each candidate itemset
– Must match every candidate itemset against every transaction,
which is an expensive operation(Brute Force)
– An alternative approach is to enumerate the itemsets contained
in each transaction and use them to update the support counts of
their respective candidate itemsets.
TID Items
1 Bread, Milk
Itemset
2 Beer, B read, Diaper, Eggs { Beer, Diaper, Milk}
3 Beer, Coke, Diaper, Milk { Beer,Bread,Diaper}
{Bread, Diaper, Milk}
4 Beer, B read, Diaper, Milk
{ Beer, Bread, Milk}
5 Bread, Coke, Diaper, Milk
Support Counting

⚫ Support counting is the process of determining


the frequency of occurrence for every candidate
itemset that survives the candidate pruning step.
⚫ A brute-force approach for doing this is to
compare each transaction against every
candidate itemset and to update the support
counts of candidates contained in a transaction.
This approach is computationally expensive,
especially when the numbers of transactions and
candidate itemsets are large.
⚫ An alternative approach is to enumerate the
itemsets contained in each transaction and use
them to update the support counts of their
respective candidate itemsets.

⚫ To illustrate, consider a transaction t that contains


five items, {1, 2, 3, 5, 6}. There are 5C3= 10
itemsets of size 3 contained in this transaction.
Support Counting: An Example
How many of these itemsets are supported by transaction (1,2,3,5,6)?

Transaction, t
1 2 3 5 6

Level 1
1 2 3 5 6 2 3 5 6 3 5 6

Level 2

1 2 3 5 6 1 3 5 6 1 5 6 2 3 5 6 2 5 6 3 5 6

1 2 3
1 3 5 2 3 5
1 2 5 1 5 6 2 5 6 3 5 6
1 3 6 2 3 6
1 2 6

Level 3 Subsets of 3 items


⚫ For instance, 1 2 3 5 6 represents a 3-itemset
that begins with item 1, followed by two more
items chosen from the set {2, 3, 5, 6}.
⚫ After fixing the first item, the prefix tree structure
at Level 2 represents the number of ways to
select the second item. For example, 1 2 3 5 6
corresponds to itemsets that begin with the prefix
(1 2) and are followed by the items 3, 5, or 6.
⚫ Finally, the prefix tree structure at Level 3
represents the complete set of 3-itemsets
contained in t. For example, the 3-itemsets that
begin with prefix {1 2} are {1, 2, 3}, {1, 2, 5}, and
{1, 2, 6}, while those that begin with prefix {2 3}
are {2, 3, 5} and {2, 3, 6}
⚫ The prefix tree structure shown in Figure
demonstrates how itemsets contained in a
transaction can be systematically enumetheir
items one by one, from the leftmost item to the
rightmost item.
⚫ We still have to determine whether each
enumerated 3-itemset corresponds to an existing
candidate itemset. If it matches one of the
candidates, then the support count of the
corresponding candidate is incremented.
Support Counting Using a Hash Tree

⚫ In the Apriori algorithm, candidate itemsets are


partitioned into different buckets and stored in a
hash tree.
⚫ During support counting, itemsets contained in
each transaction are also hashed into their
appropriate buckets.
⚫ Instead of comparing each itemset in the
transaction with every candidate itemset, it is
matched only against candidate itemsets that
belong to the same bucket
Support Counting of Candidate Itemsets

⚫ To reduce number of comparisons, store the candidate


itemsets in a hash structure
– Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets

Transactions Hash Structure


TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets
Support Counting Using a Hash Tree
Each internal node of the tree uses the following hash function, h(p)=p
mod 3, to determine which branch of the current node should be followed
next.
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)

234
Hash function 567
3,6,9 145 356 367
1,4,7 136345 368
357
2,5,8 689
124
457 125 159
458
Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9

2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458
Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
Alternative Methods for Generating Frequent
Itemsets
⚫ Drawback of Apriori:
– The algorithm incurs considerable I/O overhead since
it requires making several passes over the transaction
data set.
– the performance of the Apriori algorithm may degrade
significantly for dense data sets because of the
increasing width of transactions.
⚫ Traversal of Itemset Lattice:
1. General-to-Specific versus Specific-to-General:
⚫ The Apriori algorithm uses a general-to-specific search
strategy, where pairs of frequent (k−1)-itemsets are
merged to obtain candidate k-itemsets.
⚫ effective, provided the maximum length of a frequent
itemset is not too long.
– looks for more specific frequent itemsets first, before
finding the more general frequent itemsets.
– This strategy is useful to discover maximal frequent
itemsets in dense transactions, where the frequent
itemset border is located near the bottom of the
lattice.
– combine both general-to-specific and specific-to-
general search strategies. This bidirectional approach
requires more space to store the candidate itemsets,
but it can help to rapidly identify the frequent itemset
border.
2. Equivalence Classes:
– first partition the lattice into disjoint groups of nodes
(or equivalence classes). A frequent itemset
generation algorithm searches for frequent itemsets
within a particular equivalence class first before
moving to another equivalence class.
– As in Apriori algorithm, partitioning the lattice on the
basis of itemset sizes.
– Equivalence classes can also be defined according to
the prefix or suffix labels of an itemset.
3. Breadth-First versus Depth-First:
– The Apriori algorithm traverses the lattice in a
breadth-first manner, It first discovers all the frequent
1-itemsets, followed by the frequent 2-itemsets, and
so on, until no new frequent itemsets are generated.
– The depth-first approach is often used by algorithms
designed to find maximal frequent itemsets. This
approach allows the frequent itemset border to be
detected more quickly than using a breadth-first
approach.
⚫Once a maximal frequent itemset is found,
substantial pruning can be performed on its
subsets.
⚫ if the node bcde shown in is maximal frequent,
then the algorithm does not have to visit the
subtrees rooted at bd, be, c, d, and e because
they will not contain any maximal frequent
itemsets.
⚫ suppose the support for {a,b,c} is identical to the
support for {a,b}. The subtrees rooted at abd and
abe can be skipped because they are guaranteed
not to have any maximal frequent itemsets.
⚫ Representation of Transaction Data Set
– The choice of representation can affect the I/O costs
incurred when computing the support of candidate
itemsets.
1. Horizontal data layout, which is adopted by many
association rule mining algorithms, including Apriori.
2. Another possibility is to store the list of transaction
identifiers (TID-list) associated with each item. Such
a representation is known as the vertical data layout.
-The support for each candidate itemset is obtained by intersecting
the TID-lists of its subset items.
-The length of the TID-lists shrinks as we progress to larger sized
itemsets.
– However, one problem with this approach is that the
initial set of TID-lists might be too large to fit into main
memory, thus requiring more sophisticated techniques
to compress the TID-lists.
FP-Growth Algorithm
⚫ FP growth algorithm is an improvement of apriori algorithm.
⚫ FP growth algorithm used for finding frequent itemset in a
transaction database without candidate generation.
⚫ FP growth represents frequent items in Frequent
Pattern(FP) trees or FP-tree.
⚫ The algorithm does not subscribe to the generate-and-test
paradigm of Apriori.
⚫ Instead, it encodes the data set using a compact data structure
called an FP-tree and extracts frequent itemsets directly from
this structure.
FP-Tree Representation
⚫ An FP-tree is a compressed representation of the input data.
⚫ It is constructed by reading the data set one transaction at a
time and mapping each transaction onto a path in the FP-tree.
⚫ As different transactions can have several items in
common, their paths might overlap.
⚫ The more the paths overlap with one another, the more
compression we can achieve using the FP-tree structure.
⚫ If the size of the FP-tree is small enough to fit into main
memory, this will allow us to extract frequent itemsets
directly from the structure in memory instead of making
repeated passes over the data stored on disk.
⚫ Each node in the tree contains the label of an item along
with a counter that shows the number of transactions
mapped onto the given path.
⚫ Initially, the FP-tree contains only the root node
represented by the null symbol.
Steps to extend the FP-tree

1. The data set is scanned once to determine the support


count of each item. Infrequent items are discarded, while
the frequent items are sorted in decreasing support
counts inside every transaction of the data set.
2. The algorithm makes a second pass over the data to
construct the FPtree.
1. After reading the first transaction, {a,b}, the nodes labeled as a and b
are created. A path is then formed from null → a → b to encode the
transaction. Every node along the path has a frequency count of 1.
3. After reading the second transaction, {b,c,d}, a new set of
nodes is created for items b, c, and d. A path is then
formed to represent the transaction by connecting the
nodes null → b → c → d.
4. The third transaction, {a,c,d,e}, shares a common prefix
item (which is a) with the first transaction. As a result,
the path for the third transaction, null → a → c → d → e,
overlaps with the path for the first transaction, null → a
→ b. Because of their overlapping path, the frequency
count for node a is incremented to two, while the
frequency counts for the newly created nodes, c, d, and
e, are equal to one.
5. This process continues until every transaction has been
mapped onto one of the paths given in the FP-tree.
– An FP-tree also contains a list of pointers connecting nodes that
have the same items represented by dotted lines.
⚫ The size of an FP-tree is typically smaller than the size of
the uncompressed data because many transactions in
market basket data often share a few items in common.
⚫ In the best-case scenario, where all the transactions have
the same set of items, the FP-tree contains only a single
branch of nodes.
⚫ The worst case scenario happens when every
transaction has a unique set of items, then size of the FP-
tree is effectively the same as the size of the original data.
⚫ Size of FP-tree is higher because it requires additional
space to store pointers between nodes and counters for
each item.
⚫ The size of an FP-tree also depends on how the items are
ordered.
Frequent Itemset Generation in FP-
Growth Algorithm
⚫ FP-growth is an
algorithm that generates
frequent itemsets from
an FPtree by exploring
the tree in a bottom-up
fashion.
⚫ FP-growth finds all the
frequent itemsets ending
with a particular suffix by
employing a divide-and-
conquer strategy to split
the problem into smaller
subproblems
The task of finding frequent itemsets
ending with e.
⚫ The first step is to gather all the paths containing node e.
These initial paths are called prefix paths.
⚫ From the prefix paths , the support count for e is obtained
by adding the support counts associated with node e.
⚫ Assuming that the minimum support count is 2, {e} is
declared a frequent itemset because its support count is
3.
⚫ Because {e} is frequent, the algorithm has to solve the
subproblems of finding frequent itemsets ending in de, ce,
ade, and ae. Before solving these subproblems, it must
first convert the prefix paths into a conditional FP-tree,
which is structurally similar to an FP-tree, except it is used
to find frequent itemsets ending with a particular suffix.
A conditional FP-tree is obtained as
follows
⚫ First, the support counts along the prefix paths must be
updated because some of the counts include transactions
that do not contain item e. For example, the rightmost
path shown in Figure 5.27(a), null → b:2 → c:2 → e:1,
includes a transaction {b,c} that does not contain item e.
The counts along the prefix path must therefore be
adjusted to 1 to reflect the actual number of transactions
containing {b,c,e}.
⚫ The prefix paths are truncated by removing the nodes for
e. These nodes can be removed because the support
counts along the prefix paths have been updated to
reflect only transactions that contain e and the
subproblems of finding frequent itemsets ending in de, ce,
be, and ae no longer need information about node e.

⚫ After updating the support counts along the prefix paths,


some of the items may no longer be frequent. For
example, the node b appears only once and has a
support count equal to 1, which means that there is only
one transaction that contains both b and e. Item b can be
safely ignored from subsequent analysis because all
itemsets ending in be must be infrequent.
⚫ FP-growth uses the conditional FP-tree for e to solve the
subproblems of finding frequent itemsets ending in de, ce, and
ae. To find the frequent itemsets ending in de, the prefix paths
for d are gathered from the conditional FP-tree for e (Figure
5.27(c)).
⚫ By adding the frequency counts associated with node d, we
obtain the support count for {d,e}. Since the support count is
equal to 2, {d,e} is declared a frequent itemset.
⚫ Next, the algorithm constructs the conditional FP-tree for de
using the approach described in step 3. After updating the
support counts and removing the infrequent item c, the
conditional FP-tree for de is shown in Figure 5.27(d).
⚫ Since the conditional FP-tree contains only one item, a, whose
support is equal to minsup, the algorithm extracts the frequent
itemset {a,d,e} and moves on to the next subproblem, which is
to generate frequent itemsets ending in ce.
⚫ After processing the prefix paths for c,{c,e}is found to be
frequent. However, the conditional FP-tree for ce will have
no frequent items and thus will be eliminated. The
algorithm proceeds to solve the next subproblem and
finds {a,e} to be the only frequent itemset remaining.
Advantages

⚫ At each recursive step, a conditional FP-tree is


constructed by updating the frequency counts along the
prefix paths and removing all infrequent items.
⚫ Because the subproblems are disjoint, FP-growth will not
generate any duplicate itemsets.
⚫ In addition, the counts associated with the nodes allow
the algorithm to perform support counting while
generating the common suffix itemsets.
⚫ It illustrates how a compact representation of the
transaction data set helps to efficiently generate frequent
itemsets.
⚫ The run-time performance of FP-growth depends on the
compaction factor of the data set.
Limitation

⚫ If the resulting conditional FP-trees are very bushy (in the


worst case, a full prefix tree), then the performance of the
algorithm degrades significantly because it has to
generate a large number of subproblems and merge the
results returned by each subproblem.
Evaluation of Association Patterns
⚫ Association rule algorithms can produce large
number of rules

⚫ Objective Interestingness measures can be used


to prune/rank the patterns
– In the original formulation, support & confidence are
the only measures used
Computing Interestingness Measure
Computing Interestingness Measure

⚫ Given X → Y or {X,Y}, information needed to compute


interestingness can be obtained from a contingency table
Contingency table
B B f11: support of A and B
A f11 f10 f1+ f10: support of A and B
A f01 f00 fo+ f01: support of A and B
f+1 f+0 N f00: support of A and B

Used to define various measures


◆ support, confidence, etc.
Drawback of Confidence

Association Rule: Tea → Coffee

Confidence  P(Coffee|Tea) = 15/20 = 0.75


Confidence > 50%, meaning people who drink tea are more
likely to drink coffee than not drink coffee
So rule seems reasonable, but misleading.
Drawback of Confidence

Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 150/200 = 0.75

but P(Coffee) = 0.8, which means knowing that a person drinks


tea reduces the probability that the person drinks coffee!
Measure for Association Rules

⚫ So, what kind of rules do we really want?


– Confidence(X → Y) should be sufficiently high
◆ To ensure that people who buy X will more likely buy Y than
not buy Y

– Confidence(X → Y) > support(Y)


◆ Otherwise, rule will be misleading because having item X
actually reduces the chance of having item Y in the same
transaction
◆ Is there any measure that capture this constraint?

– Answer: Yes. There are many of them.


Simpson's Paradox

The hidden variables may cause the observed


relationship between a pair of variables to
disappear or reverse its direction, a
phenomenon that is known as Simpson's
paradox
The rule {HDTV=Yes} → {Exercise machine =
Yes} has a confidence of 99/180 = 55% and
the rule {HDTV = No} → { Exercise machine =
Yes} has a confidence of 54/120 = 45%
The paradox can be explained in the following
way:
• Most customers who buy HDTVs are
working adults.
• Working adults are also the largest group of
customers who buy exercise machines.
• Nearly 85% of the customers are working
adults, the observed relationship between
HDTV and exercise machine turns out to be
stronger in the combined data than what it
would have been if the data is stratified.
Thank You

You might also like