Data Mining - Module2
Data Mining - Module2
Association Analysis
By
Lohith C
Association Rule Mining
Market-Basket transactions
Example of Association Rules
TID Items
1 Bread, Milk {Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
⚫ Association analysis, is useful for discovering
interesting relationships hidden in large data sets.
⚫ The uncovered relationships can be represented in the
form of sets of items present in many transactions, which
are known as frequent itemsets or association rules, that
represent relationships between two itemsets.
⚫ two key issues
⚫ First, discovering patterns from a large transaction data
set can be computationally expensive.
⚫ Second, some of the discovered patterns may be
spurious (happen simply by chance) and even for non-
spurious patterns, some are more interesting than others.
Problem Definition: Frequent Itemset
⚫ Binary Representation
– Market basket data can be represented
in a binary format
– each row corresponds to a transaction
– each column corresponds to an item.
⚫ Itemset
– A collection of one or more items
◆ Example: {Milk, Bread, Diaper}
– k-itemset
◆ An itemset that contains k items
TID Items
⚫ Support count () 1 Bread, Milk
– the number of transactions that contain 2 Bread, Diaper, Beer, Eggs
a particular itemset or Frequency of 3 Milk, Diaper, Beer, Coke
occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– the support count, σ(X), for an itemset X
5 Bread, Milk, Diaper, Coke
can be stated as follows:
– E.g. ({Milk, Bread,Diaper}) = 2
c=
(Milk, Diaper, Beer) = 2 = 0.67
(Milk, Diaper) 3
Association Rule Mining Task
⚫ Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Prohibitively expensive.
Mining Association Rules
Example of Rules:
{Milk,Diaper} → {Beer} (s=2/5=0.4,
TID Items c=2/3=0.67)
1 Bread, Milk {Milk,Beer} → {Diaper} (s=2/5=0.4,
2 Bread, Diaper, Beer, Eggs c=2/2=1.0)
3 Milk, Diaper, Beer, Coke
{Diaper,Beer} → {Milk} (s=2/5=0.4,
c=2/3=0.67)
4 Bread, Milk, Diaper, Beer
{Beer} → {Milk,Diaper} (s=2/5=0.4,
5 Bread, Milk, Diaper, Coke
c=2/3=0.67)
{Diaper} → {Milk,Beer} (s=2/5=0.4,
c=2/4=0.5)
{Milk} → {Diaper,Beer} (s=2/5=0.4,
Observations: c=2/4=0.5)
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset.
These rules are called strong rules.
⚫ Frequent itemset generation is still
computationally expensive
Frequent Itemset Generation
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
⚫ Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Determine the support count of each candidate by
scanning the database.
Frequent Itemset Generation Strategies
⚫ Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent.
⚫ The strategy of trimming the exponential search
space based on the support measure is known as
support-based pruning.
⚫ Apriori principle holds due to the following property
of the support measure:
⚫ X ⊂ Y , we have f(Y ) ≤ f(X).
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support.
Illustrating Apriori Principle
Suppose {c, d, e} is a frequent itemset. Clearly, any transaction that contains {c, d, e}
must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d,
e} is frequent, then all subsets of {c, d, e} (i.e., the shaded itemsets in this figure) must
also be frequent.
Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be
infrequent too.
Illustrating Apriori Principle
Figure . An illustration of support-based pruning. If {a, b} is infrequent, then all
supersets of {a, b} are infrequent.
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
supersets ABCDE
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, B read, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, B read, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1
Minimum Support = 3
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, B read, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, B read, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 3
Minimum Support = 3
⚫ F2 = {Bread,Diaper}+{Bread, Milk}
– merge these two because they have identical first
item.
Transaction, t
1 2 3 5 6
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
1 2 3 5 6 1 3 5 6 1 5 6 2 3 5 6 2 5 6 3 5 6
1 2 3
1 3 5 2 3 5
1 2 5 1 5 6 2 5 6 3 5 6
1 3 6 2 3 6
1 2 6
234
Hash function 567
3,6,9 145 356 367
1,4,7 136345 368
357
2,5,8 689
124
457 125 159
458
Support Counting Using a Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Support Counting Using a Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
Support Counting Using a Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Support Counting Using a Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Support Counting Using a Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Support Counting Using a Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
Alternative Methods for Generating Frequent
Itemsets
⚫ Drawback of Apriori:
– The algorithm incurs considerable I/O overhead since
it requires making several passes over the transaction
data set.
– the performance of the Apriori algorithm may degrade
significantly for dense data sets because of the
increasing width of transactions.
⚫ Traversal of Itemset Lattice:
1. General-to-Specific versus Specific-to-General:
⚫ The Apriori algorithm uses a general-to-specific search
strategy, where pairs of frequent (k−1)-itemsets are
merged to obtain candidate k-itemsets.
⚫ effective, provided the maximum length of a frequent
itemset is not too long.
– looks for more specific frequent itemsets first, before
finding the more general frequent itemsets.
– This strategy is useful to discover maximal frequent
itemsets in dense transactions, where the frequent
itemset border is located near the bottom of the
lattice.
– combine both general-to-specific and specific-to-
general search strategies. This bidirectional approach
requires more space to store the candidate itemsets,
but it can help to rapidly identify the frequent itemset
border.
2. Equivalence Classes:
– first partition the lattice into disjoint groups of nodes
(or equivalence classes). A frequent itemset
generation algorithm searches for frequent itemsets
within a particular equivalence class first before
moving to another equivalence class.
– As in Apriori algorithm, partitioning the lattice on the
basis of itemset sizes.
– Equivalence classes can also be defined according to
the prefix or suffix labels of an itemset.
3. Breadth-First versus Depth-First:
– The Apriori algorithm traverses the lattice in a
breadth-first manner, It first discovers all the frequent
1-itemsets, followed by the frequent 2-itemsets, and
so on, until no new frequent itemsets are generated.
– The depth-first approach is often used by algorithms
designed to find maximal frequent itemsets. This
approach allows the frequent itemset border to be
detected more quickly than using a breadth-first
approach.
⚫Once a maximal frequent itemset is found,
substantial pruning can be performed on its
subsets.
⚫ if the node bcde shown in is maximal frequent,
then the algorithm does not have to visit the
subtrees rooted at bd, be, c, d, and e because
they will not contain any maximal frequent
itemsets.
⚫ suppose the support for {a,b,c} is identical to the
support for {a,b}. The subtrees rooted at abd and
abe can be skipped because they are guaranteed
not to have any maximal frequent itemsets.
⚫ Representation of Transaction Data Set
– The choice of representation can affect the I/O costs
incurred when computing the support of candidate
itemsets.
1. Horizontal data layout, which is adopted by many
association rule mining algorithms, including Apriori.
2. Another possibility is to store the list of transaction
identifiers (TID-list) associated with each item. Such
a representation is known as the vertical data layout.
-The support for each candidate itemset is obtained by intersecting
the TID-lists of its subset items.
-The length of the TID-lists shrinks as we progress to larger sized
itemsets.
– However, one problem with this approach is that the
initial set of TID-lists might be too large to fit into main
memory, thus requiring more sophisticated techniques
to compress the TID-lists.
FP-Growth Algorithm
⚫ FP growth algorithm is an improvement of apriori algorithm.
⚫ FP growth algorithm used for finding frequent itemset in a
transaction database without candidate generation.
⚫ FP growth represents frequent items in Frequent
Pattern(FP) trees or FP-tree.
⚫ The algorithm does not subscribe to the generate-and-test
paradigm of Apriori.
⚫ Instead, it encodes the data set using a compact data structure
called an FP-tree and extracts frequent itemsets directly from
this structure.
FP-Tree Representation
⚫ An FP-tree is a compressed representation of the input data.
⚫ It is constructed by reading the data set one transaction at a
time and mapping each transaction onto a path in the FP-tree.
⚫ As different transactions can have several items in
common, their paths might overlap.
⚫ The more the paths overlap with one another, the more
compression we can achieve using the FP-tree structure.
⚫ If the size of the FP-tree is small enough to fit into main
memory, this will allow us to extract frequent itemsets
directly from the structure in memory instead of making
repeated passes over the data stored on disk.
⚫ Each node in the tree contains the label of an item along
with a counter that shows the number of transactions
mapped onto the given path.
⚫ Initially, the FP-tree contains only the root node
represented by the null symbol.
Steps to extend the FP-tree