[go: up one dir, main page]

0% found this document useful (0 votes)
17 views4 pages

Data Mining Chapter 2: Market Basket Analysis

Chapter 2 discusses market basket analysis using association rules to identify relationships among items in transactional data, particularly in retail. It introduces the Apriori algorithm for mining association rules, emphasizing statistical measures like support, confidence, and lift to evaluate rule significance. The chapter also covers the strengths and weaknesses of the method, along with practical applications in R for data extraction, model training, and performance evaluation.

Uploaded by

oumaima abaied
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views4 pages

Data Mining Chapter 2: Market Basket Analysis

Chapter 2 discusses market basket analysis using association rules to identify relationships among items in transactional data, particularly in retail. It introduces the Apriori algorithm for mining association rules, emphasizing statistical measures like support, confidence, and lift to evaluate rule significance. The chapter also covers the strengths and weaknesses of the method, along with practical applications in R for data extraction, model training, and performance evaluation.

Uploaded by

oumaima abaied
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Chapter 2: Basket Analysis - Association Rule

This chapter covers machine learning methods for identifying associations among items in transactional
data—a practice commonly known as market basket analysis due to its widespread use among retail
store.
To remember:
-​ Association rule: unsupervised learning: unlabeled data: no need of algorithm to be trained
-​ Useful for large amounts of transactional data: Big Data
-​ statistical measures of "interestingness”:
-​ Confidence => reliability, accuracy
-​ Lift: A & B are independent => the strength of the association relative
-​ Density value= non-zero sparse matrix cells.

Understanding association rules


The result of a market basket analysis is a set of association rules that specify patterns of relationships
among items.
A typical rule might be expressed in the form:
{ peanut butter, jelly } → { bread }
Developed in the context of Big Data and database science, association rules are not used for
prediction, but rather for unsupervised knowledge discovery in large databases, unlike the classification
and numeric prediction algorithms seen before.

The Apriori algorithm for association rule learning


The challenge in association rule mining arises from the vast number of potential itemsets, which grow
exponentially with the number of items (2^k possible combinations). Apriori addresses this by leveraging
the Apriori property, which states that all subsets of a frequent itemset must also be frequent. This
allows the algorithm to eliminate infrequent item combinations early, significantly reducing the search
space.

Support:
𝐶𝑜𝑢𝑛𝑡 (𝐴)
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴) = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑡𝑒𝑚𝑠

Apriori uses statistical measures of "interestingness", which help evaluate the usefulness and
significance of discovered rules. The most common measures include:

1- Confidence

●​ Definition: The likelihood that item B is purchased when item A is purchased.


𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴,𝐵) 𝑃(𝐴∩𝐵)
●​ Formula: 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 → 𝐵) = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴)
= 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴)
●​ Purpose: Measures the reliability of the rule. Higher confidence means that B often follows A in
transactions.
2- Lift

●​ Definition: The ratio of the observed support of A and B appearing together to the expected
support if A and B were independent.
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴∩𝐵) 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴→𝐵)
●​ Formula:𝐿𝑖𝑓𝑡(𝐴 → 𝐵) = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴)× 𝑠𝑝𝑝𝑜𝑟𝑡(𝐵)
= 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐵)
●​ Purpose: Indicates the strength of the association relative to random chance.
○​ Lift > 1: Positive association (items occur together more often than by chance); know the
degree to which those two occurrences are dependent on one another.

A large lift value is therefore a strong indicator that a rule is important, and reflects a true
connection between the items.

○​ Lift = 1: No association.

○​ Lift < 1: Negative association (items occur together less often than by chance); know the
items are substitutes for each other..

Note: Unlike confidence where the item order matters, lift(X -> Y) is the same as lift(Y -> X)

Strong rules have high support and confidence, making them valuable for decision-making, such as
optimizing product placement in a store. However, checking all possible item combinations becomes
impractical for large datasets. To address this, the Apriori algorithm applies minimum thresholds for
support and confidence to efficiently identify the most relevant rules.

Strengths
-​ Is ideally suited for working with very large amounts of transactional data
-​ Results in rules that are easy to understand
-​ Useful for "data mining" and discovering unexpected knowledge in databases

Weaknesses
-​ Not very helpful for small datasets
-​ Takes effort to separate the insight from the common sens
-​ Easy to draw spurious conclusions from random patterns
R application:
Data extraction and preparation:
groceries <- Is similar to read.csv() except that it results in a sparse matrix
read.transactions( suitable for transactional data.
"groceries.csv", The parameter sep="," specifies that items in the input file are
sep = ",") separated by a comma.

summary(groceries) Used to see some basic information about the groceries dataset

inspect(groceries[1:5]) Used to look at the contents of the sparse matrix

itemFrequency( Allows us to see the proportion of transactions that contain the item
groceries[, 1:3])

itemFrequencyPlot( this results in a histogram showing the items in the data with at least
groceries, support=0.1) 10 percent support

itemFrequencyPlot( The histogram is then sorted by decreasing support, as shown in the


groceries, topN=20) following diagram for the top 20 items in the dataset

image(groceries[1:5]) Allows us to visualize the sparse matrix.

image(sample Allows us to view the sparse matrix for a randomly sampled set of
(groceries, 100)) transactions.

inspect(groceryrules[1:3]) To take a look at specific rules

Training the model:


Package: arules

Association Rule Syntax


Using the apriori()function in the arules package
Myrules <- apriori(data = mydata, parameter = list(support = 0.1, confidence =
0.8, minlen = 1)
●​ Data: sparse matrix holding transactional data
●​ Support: specifies the minimum required rule support
●​ Confidence: specifies the minimum required rule confidence
●​ Minlen: specifies the minimum required rule items
The function will return a rules object storing all rules that meet the minimum criteria.
By default, support = 0.1 and confidence = 0.8.

Examining association rules:


inspect(myrules)
●​ myrules is a set of association rules from the apriori() function
This will output the association rules to the screen. Vector operators can be used on myrules to choose a
specific rule(s) to view.
Setting the support:
Think about the minimum number of transactions you would need before you would consider a
pattern interesting.

Setting the confidence:


- If confidence is too low; we might be overwhelmed with a large number of unreliable rules—such as
dozens of rules indicating items commonly purchased with batteries.
- If we set confidence too high, then we will be limited to rules that are obvious or inevitable—like the
fact that a smoke detector is always purchased in combination with batteries.
The appropriate minimum confidence level depends a great deal on the goals of your analysis. If you
start with conservative (high) values, you can always reduce them to broaden the search if you aren't
finding actionable intelligence.

groceryrules <- apriori( This saves our rules in a rules object, which we can peek into by
groceries, typing its name:
parameter = list( groceryrules
support = 0.006, set of 463 rules
confidence = 0.25, Our groceryrules object contains a set of 463 association rules. To
minlen = 2)) determine whether any of them are useful, we'll have to dig deeper.

Evaluating the model’s performance:


summary(groceryrules) To obtain a high-level overview of the association rules
rule length distribution (lhs + rhs):sizes
2 3 4
150 297 16

inspect(groceryrules[1:3]) To look at specific rules

Improving the model’s performance:


1- Sorting the set of association rules:
The most useful rules might be those with the highest support, confidence, or lift.
By default, the sort order is decreasing. To reverse this order, add parameter decreasing = FALSE.

sort() reorder the list of rules

2- Taking subsets of association rules


subset() Searches for subsets of transactions, items, or rules

berryrules <- Finds any rules with berries appearing in the rule
subset(groceryrules, items inspect(berryrules)
%in% "berries")

You might also like