FP-Growth Algorithm and FP-Tree Construction
November 28, 2024
1 Introduction
The FP-Growth algorithm is an efficient method for frequent pattern mining in
large datasets, which avoids the candidate generation step used in algorithms
like Apriori. It uses a compact data structure called the FP-tree (Frequent
Pattern tree) to represent the database in a compressed form. This allows for
efficient mining of frequent itemsets.
2 Step 1: Calculate Item Frequencies
Before building the FP-tree, we need to calculate the frequency (support) of
each item in the dataset. Support refers to the number of transactions that
contain the item.
2.1 Example Dataset
Consider the following transactions:
Transaction ID Items Purchased
T1 A, B, D, E
T2 B, C, D, E
T3 A, B, D, E, F
T4 B, C, D, E
T5 A, B, D, E, F
2.2 Item Frequency Calculation
The frequency (support) of each item is calculated as follows:
• A: Appears in T1, T3, T5 (3 times)
• B: Appears in T1, T2, T3, T4, T5 (5 times)
• C: Appears in T2, T4 (2 times)
• D: Appears in all transactions (5 times)
1
• E: Appears in all transactions (5 times)
• F: Appears in T3, T5 (2 times)
3 Step 2: Build the FP-Tree
Once the item frequencies are calculated, items are sorted in descending order of
their frequencies. Items with low support are removed from the dataset before
constructing the tree.
3.1 Sort Items by Frequency
The items are sorted in descending order of frequency:
B : 5, D : 5, E : 5, A : 3, C : 2, F : 2
3.2 Insert Transactions into the FP-Tree
Now, we insert each transaction into the FP-tree, sorting the items according
to the frequency list and building the tree. We either increment the count of an
existing node or create a new node.
3.2.1 Inserting Transactions
• T1: A, B, D, E → Sorted: B, D, E, A
• T2: B, C, D, E → Sorted: B, D, E, C
• T3: A, B, D, E, F → Sorted: B, D, E, A, F
• T4: B, C, D, E → Sorted: B, D, E, C
• T5: A, B, D, E, F → Sorted: B, D, E, A, F
3.3 FP-Tree Structure
After inserting all transactions, the FP-tree looks like this:
2
[ROOT]
|
(B:5)
|
+---------+---------+
| |
(D:4) (F:2)
|
+-------+-------+
| | |
(E:4) (A:3) (C:2)
|
(F:2)
Here:
• ROOT is the root node.
• The node B appears 5 times.
• The node D appears 4 times.
• The node E appears 4 times, and so on.
4 Step 3: Header Table
The header table is an essential part of the FP-tree. It stores the items and
their frequencies and provides links to the nodes in the tree. The header table
facilitates efficient mining by allowing easy traversal of the FP-tree.
4.1 Header Table Example
For the FP-tree above, the header table looks like this:
Item Frequency Linked Nodes
B 5 (B : 5) → (B : 4) → (B : 4) → (B : 3) → (B : 3)
D 5 (D : 4) → (D : 4) → (D : 3) → (D : 2)
E 5 (E : 4) → (E : 4) → (E : 3) → (E : 2)
A 3 (A : 3) → (A : 2)
C 2 (C : 2)
F 2 (F : 2) → (F : 2)
5 Step 4: Mining the FP-Tree
The mining process involves recursively extracting frequent itemsets from the
FP-tree by examining the header table and conditional pattern bases.
3
5.1 Conditional Pattern Base for Item B
To mine the patterns related to item B, we look at all paths that contain B and
trace back to the root. The conditional pattern base for B consists of all items
appearing with B in the transactions.
Transactions containing B:
T 1 : B, D, E, A
T 2 : B, D, E, C
T 3 : B, D, E, A, F
T 4 : B, D, E, C
T 5 : B, D, E, A, F
The conditional pattern base for B is:
{D, E, A, F }, {D, E, C}
We then recursively build a conditional FP-tree for this pattern base and
continue mining for further frequent itemsets.
6 Step 5: Recursive Mining and Frequent Item-
sets
The mining process continues recursively for each item in the header table, and
we extract frequent itemsets. The frequent itemsets for the given dataset could
include:
{B} : Support = 5
{D} : Support = 5
{E} : Support = 5
{A} : Support = 3
{B, D} : Support = 4
{B, E} : Support = 4
{B, D, E} : Support = 4
{A, B} : Support = 3
{A, D} : Support = 3
{A, B, D} : Support = 3
4
7 Advantages of FP-Growth
• No Candidate Generation: Unlike Apriori, FP-Growth does not gen-
erate candidate itemsets, which reduces computation time.
• Efficient Memory Use: The FP-tree is a compressed representation of
the dataset, saving memory.
• Scalability: FP-Growth scales well with large datasets due to its efficient
use of memory and reduced I/O operations.
8 Conclusion
The FP-Growth algorithm is a powerful tool for frequent pattern mining. By us-
ing the FP-tree and header table, FP-Growth efficiently mines frequent itemsets
without the need for candidate generation, making it faster and more scalable
than other algorithms like Apriori.