Introduction to Data Mining
Assignment #2
Q#1: A database has five transactions. Let min-sup=60% and min-conf=80%
TID Items-bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the efficiency of
the two mining processes.
List all the strong association rules (with support s and confidence c) matching the following
metarule, where X is a variable representing customers, and item I denotes variables representing
items(e.g, “A,” “B”);)
∀x ∈ transaction, buys(X,item1) ∧ buys(X,item2) ⇒ buys(X,item3) [s,c]
Q#2: (Implementation project) Using a programming language that you are familiar with, such
as C++ or Java, implement three frequent itemset mining algorithms introduced in this chapter:
(1) Apriori [AS94b], (2) FP-growth [HPY00], and (3) Eclat [Zak00] (mining using the
vertical data format). Compare the performance of each algorithm with various kinds of large
data sets. Write a report to analyze the situations (e.g., data size, data distribution, minimal
support threshold setting, and pattern density) where one algorithm may perform better than the
others, and state why?
Q#3: Give a short example to show that items in a strong association rule actually may be
negatively correlated.
Q#4: The following contingency table summarizes supermarket transaction data, where hot dogs
refers to the transactions containing hot dogs, hot dogs refers to the transactions that do not
contain hot dogs, hamburgers refers to the transactions containing hamburgers, and hamburgers
refers to the transactions that do not contain hamburgers.
(a) Suppose that the association rule “hot dogs ⇒ hamburgers” is mined. Given a minimum
support threshold of 25% and a minimum confidence threshold of 50%, is this association rule
strong?
(b) Based on the given data, is the purchase of hot dogs independent of the purchase of
hamburgers? If not, what kind of correlation relationship exists between the two?
(c) Compare the use of the all confidence, max confidence, Kulczynski, and cosine measures
with lift and correlation on the given data.