Data Mining Chapter 2: Market Basket Analysis
Data Mining Chapter 2: Market Basket Analysis
This chapter covers machine learning methods for identifying associations among items in transactional
data—a practice commonly known as market basket analysis due to its widespread use among retail
store.
To remember:
- Association rule: unsupervised learning: unlabeled data: no need of algorithm to be trained
- Useful for large amounts of transactional data: Big Data
- statistical measures of "interestingness”:
- Confidence => reliability, accuracy
- Lift: A & B are independent => the strength of the association relative
- Density value= non-zero sparse matrix cells.
Support:
𝐶𝑜𝑢𝑛𝑡 (𝐴)
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴) = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑡𝑒𝑚𝑠
Apriori uses statistical measures of "interestingness", which help evaluate the usefulness and
significance of discovered rules. The most common measures include:
1- Confidence
● Definition: The ratio of the observed support of A and B appearing together to the expected
support if A and B were independent.
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴∩𝐵) 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴→𝐵)
● Formula:𝐿𝑖𝑓𝑡(𝐴 → 𝐵) = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴)× 𝑠𝑝𝑝𝑜𝑟𝑡(𝐵)
= 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐵)
● Purpose: Indicates the strength of the association relative to random chance.
○ Lift > 1: Positive association (items occur together more often than by chance); know the
degree to which those two occurrences are dependent on one another.
A large lift value is therefore a strong indicator that a rule is important, and reflects a true
connection between the items.
○ Lift = 1: No association.
○ Lift < 1: Negative association (items occur together less often than by chance); know the
items are substitutes for each other..
Note: Unlike confidence where the item order matters, lift(X -> Y) is the same as lift(Y -> X)
Strong rules have high support and confidence, making them valuable for decision-making, such as
optimizing product placement in a store. However, checking all possible item combinations becomes
impractical for large datasets. To address this, the Apriori algorithm applies minimum thresholds for
support and confidence to efficiently identify the most relevant rules.
Strengths
- Is ideally suited for working with very large amounts of transactional data
- Results in rules that are easy to understand
- Useful for "data mining" and discovering unexpected knowledge in databases
Weaknesses
- Not very helpful for small datasets
- Takes effort to separate the insight from the common sens
- Easy to draw spurious conclusions from random patterns
R application:
Data extraction and preparation:
groceries <- Is similar to read.csv() except that it results in a sparse matrix
read.transactions( suitable for transactional data.
"groceries.csv", The parameter sep="," specifies that items in the input file are
sep = ",") separated by a comma.
summary(groceries) Used to see some basic information about the groceries dataset
itemFrequency( Allows us to see the proportion of transactions that contain the item
groceries[, 1:3])
itemFrequencyPlot( this results in a histogram showing the items in the data with at least
groceries, support=0.1) 10 percent support
image(sample Allows us to view the sparse matrix for a randomly sampled set of
(groceries, 100)) transactions.
groceryrules <- apriori( This saves our rules in a rules object, which we can peek into by
groceries, typing its name:
parameter = list( groceryrules
support = 0.006, set of 463 rules
confidence = 0.25, Our groceryrules object contains a set of 463 association rules. To
minlen = 2)) determine whether any of them are useful, we'll have to dig deeper.
berryrules <- Finds any rules with berries appearing in the rule
subset(groceryrules, items inspect(berryrules)
%in% "berries")