UNIT II
Unit-2 - Associations and Correlations
Market Basket Analysis – Apriori Algorithm –
Mining Frequent Itemsets without Candidate Generation –
Mining Frequent Itemsets Using Vertical Data Format –
Mining Closed Frequent Itemsets –
Mining Multilevel Association Rules –
Mining Multidimensional Association Rules –
Correlation Analysis –
Constraint-Based Association Mining
Mining Frequent Patterns, Associations and Correlations – Mining Methods – Mining Various Kinds of
Association Rules – Correlation Analysis – Constraint Based Association Mining
Association Mining
• Association rule mining:
– Finding frequent patterns, associations, correlations, or causal structures among sets of
items or objects in transaction databases, relational databases, and other information
repositories.
• Applications:
– Basket data analysis, cross-marketing, catalog design, loss-leader analysis,
clustering, classification, etc.
• Examples.
– Rule form: “Body ® Head [support, confidence]”.
– buys(x, “diapers”) ® buys(x, “beers”) [0.5%, 60%]
– major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]
Association Rule: Basic Concepts
• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by
a customer in a visit)
• Find: all rules that correlate the presence of one set of items with that of another set of items
– E.g., 98% of people who purchase tires and auto accessories also get automotive
services done
• Applications
– * Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
– Home Electronics * (What other products should the store stocks up?)
– Attached mailing in direct marketing
– Detecting “ping-pong”ing of patients, faulty “collisions”
Rule Measures: Support and Confidence
• Find all the rules X & Y Z with minimum confidence and support
– support, s, probability that a transaction contains {X Y Z}
– confidence, c, conditional probability that a transaction having {X Y} also contains Z
Let minimum support 50%, and minimum confidence 50%, we have
– A C (50%, 66.6%)
– C A (50%, 100%)
Transaction ID Items
Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
Association Rule Mining: A Road Map
• Boolean vs. quantitative associations (Based on the types of values handled)
– buys(x, “SQLServer”) ^ buys(x, “DMBook”) ® buys(x, “DBMiner”) [0.2%, 60%]
– age(x, “30..39”) ^ income(x, “42..48K”) ® buys(x, “PC”) [1%, 75%]
• Single dimension vs. multiple dimensional associations (see ex. Above)
• Single level vs. multiple-level analysis
– What brands of beers are associated with what brands of diapers?
• Various extensions
– Correlation, causality analysis
• Association does not necessarily imply correlation or causality
– Maxpatterns and closed itemsets
– Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
Market – Basket analysis
A market basket is a collection of items purchased by a customer in a single transaction, which
is a well-defined business activity. For example, a customer's visits to a grocery store or an online
purchase from a virtual store on the Web are typical customer transactions. Retailers accumulate
huge collections of transactions by recording business activities over time. One common analysis run
against a transactions database is to find sets of items, or itemsets, that appear together in many
transactions. A business can use knowledge of these patterns to improve the Placement of these items
in the store or the layout of mail- order catalog page and Web pages. An itemset containing i items is
called an i-itemset. The percentage of transactions that contain an itemset is called the itemset's
support. For an itemset to be interesting, its support must be higher than a user-specified minimum.
Such itemsets are said to be frequent.
Figure : Market basket analysis.
Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for association Rule means
that 2% of all the transactions under analysis show that computer and financial management
software are purchased together. A confidence of 60% means that 60% of the customers who
purchased a computer also bought the software. Typically, association rules are considered
interesting if they satisfy both a minimum support threshold and a minimum confidence threshold.
Mining Frequent Patterns
The method that mines the complete set of frequent itemsets with candidate generation.
Apriori property & The Apriori Algorithm.
Apriori property
• All nonempty subsets of a frequent item set most also be frequent.
– An item set I does not satisfy the minimum support threshold, min-sup, then I is not
frequent, i.e., support(I) < min-sup
– If an item A is added to the item set I then the resulting item set (I U A) can not
occur more frequently than I.
• Monotonic functions are functions that move in only one direction.
• This property is called anti-monotonic.
• If a set can not pass a test, all its supersets will fail the same test as well.
• This property is monotonic in failing the test.
The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
Example
The method that mines the complete set of frequent itemsets without generation.
• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure
– highly condensed, but complete for frequent pattern mining
– avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern mining method
– A divide-and-conquer methodology: decompose mining tasks into smaller ones
– Avoid candidate generation: sub-database test only!
Construct FP-tree from a Transaction DB
TID Items bought (ordered) frequentitems
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 0.5
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
Benefits of the FP-tree Structure
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be shared
– never be larger than the original database (if not count node-links and counts)
– Example: For Connect-4 DB, compression ratio could be over 100
Mining Frequent Patterns Using FP-tree
• General idea (divide-and-conquer)
– Recursively grow frequent pattern path using the FP-tree
• Method
– For each item, construct its conditional pattern-base, and then its conditional FP-tree
– Repeat the process on each newly created conditional FP-tree
– Until the resulting FP-tree is empty, or it contains only one path (single path will
generate all the combinations of its sub-paths, each of which is a frequent
pattern)
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in the FP-tree
2) Construct conditional FP-tree from each conditional pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far
If the conditional FP-tree contains a single path, simply enumerate all the patterns
Principles of Frequent Pattern Growth
• Pattern growth property
– Let be a frequent itemset in DB, B be 's conditional pattern base, and be an
itemset in B. Then is a frequent itemset in DB iff is frequent in B.
• “abcdef ” is a frequent pattern, if and only if
– “abcde ” is a frequent pattern, and
– “f ” is frequent in the set of transactions containing “abcde ”
Why Is Frequent Pattern Growth Fast?
• Our performance study shows
– FP-growth is an order of magnitude faster than Apriori, and is also faster than
tree- projection
• Reasoning
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
Basic operation is counting and FP-tree building
Mining multilevel association rules from transactional databases.
• Items often form hierarchy.
• Items at the lower level are expected to have lower support.
• Rules regarding itemsets at
appropriate levels could be quite useful.
• Transaction database can be encoded based on dimensions and levels
• We can explore shared multi-level mining
Food
Milk Bread
Wheat
Skim 2% White
Fraser
Sunset
TID Items
T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
T3 {112, 122, 221, 411}
T4 {111, 121}
T5 {111,122,211, 221, 413}
Mining Multi-Level Associations
• A top_down, progressive deepening approach:
– First find high-level strong
rules: milk ® bread [20%,
60%].
– Then find their lower-level “weaker”
rules: 2% milk ® wheat bread [6%,
50%].
• Variations at mining multiple-level association rules.
– Level-crossed association rules:
2% milk ® Wonder wheat bread
– Association rules with multiple, alternative hierarchies:
2% milk ® Wonder bread
Multi-level Association: Uniform Support vs. Reduced Support
• Uniform Support: the same minimum support for all levels
– + One minimum support threshold. No need to examine itemsets containing any
item whose ancestors do not have minimum support.
– – Lower level items do not occur as frequently. If support threshold
• too high miss low level associations
• too low generate too many high level associations
• Reduced Support: reduced minimum support at lower levels
– There are 4 search strategies:
• Level-by-level independent
• Level-cross filtering by k-itemset
• Level-cross filtering by single item
• Controlled level-cross filtering by single item
Multi-level Association: Redundancy Filtering
• Some rules may be redundant due to “ancestor” relationships between items.
• Example
– milk wheat bread [support = 8%, confidence = 70%]
– 2% milk wheat bread [support = 2%, confidence = 72%]
• We say the first rule is an ancestor of the second rule.
• A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor
Multi-Level Mining: Progressive Deepening
• A top-down, progressive deepening approach:
– First mine high-level frequent items:
milk (15%), bread (10%)
– Then mine their lower-level “weaker” frequent
itemsets: 2% milk (5%), wheat bread (4%)
• Different min_support threshold across multi-levels lead to different algorithms:
– If adopting the same min_support across multi-levels then toss t if any of t’s ancestors is
infrequent.
– If adopting reduced min_support at lower levels then examine only those
descendents whose ancestor’s support is frequent/non-negligible.
Correlation in detail.
• Interest (correlation, lift)
– taking both P(A) and P(B) in consideration
– P(A^B)=P(B)*P(A), if A and B are independent events
– A and B negatively correlated, if the value is less than 1; otherwise A and B
positively correlated
X 1 1 1 1 0 0 0 0 Itemset Support Interest
Y11000000 X,Y 25% 2
Z01111111 X,Z 37.50% 0.9
2 Correlation
Y,Z 12.50% 0.57
• 2 measures correlation between categorical attributes
2
(observed _ exp ected )
2
exp ected
game not game sum(row)
video 4000(4500) 3500(3000) 7500
not video 2000(1500) 500 (1000) 2500
sum(col.) 6000 4000 10000
• expected(i,j) = count(row i) * count(column j) / N
• 2 = (4000 - 4500)2 / 4500 - (3500 - 3000)2 / 3000 - (2000 - 1500)2 / 1500 - (500 - 1000)2 /
1000 = 555.6
• 2 > 1 and observed value of (game, video) < expected value, there is a negative correlation
Numeric correlation
• Correlation concept in statistics
– Used to study the relationship existing between 2 or more numeric variables
– A correlation is a measure of the linear relationship between variables
Ex: number of hours spent studying in a class with grade received
– Outcomes:
• positively related
• Not related
• negatively related
– Statistical relationships
• Covariance
• Correlation coefficient
Constraint-Based Mining in detail.
• Interactive, exploratory mining giga-bytes of data?
– Could it be real? — Making good use of constraints!
• What kinds of constraints can be used in mining?
– Knowledge type constraint: classification, association, etc.
– Data constraint: SQL-like queries
• Find product pairs sold together in Vancouver in Dec.’98.
– Dimension/level constraints:
• in relevance to region, price, brand, customer category.
– Rule constraints
• small sales (price < $10) triggers big sales (sum > $200).
– Interestingness constraints:
• strong rules (min_support 3%, min_confidence 60%).
Rule Constraints in Association Mining
• Two kind of rule constraints:
– Rule form constraints: meta-rule guided mining.
• P(x, y) ^ Q(x, w) ® takes(x, “database systems”).
– Rule (content) constraint: constraint-based query optimization (Ng, et al., SIGMOD’98).
• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
• 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99):
– 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shownabove.
– 2-var: A constraint confining both sides (L and R).
• sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)
Constrain-Based Association Query
• Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
• A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
– where C is a set of constraints on S1, S2 including frequency constraint
• A classification of (single-variable) constraints:
– Class constraint: S A. e.g. S Item
– Domain constraint:
• S v, { , , , , , }. e.g. S.Price < 100
• v S, is or . e.g. snacks S.Type
• V S, or S V, { , , , , }
– e.g. {snacks, sodas } S.Type
– Aggregation constraint: agg(S) v, where agg is in {min, max, sum, count, avg}, and
{ , , , , , }.
• e.g. count(S1.Type) 1 , avg(S2.Price) 100
Constrained Association Query Optimization Problem
• Given a CAQ = { (S1, S2) | C }, the algorithm should be :
– sound: It only finds frequent sets that satisfy the given constraints C
– complete: All frequent sets satisfy the given constraints C are found
• A naïve solution:
– Apply Apriori for finding all frequent sets, and then to test them for constraint
satisfaction one by one.
• Our approach:
– Comprehensive analysis of the properties of constraints and try to push them as
deeply as possible inside the frequent set computation.
Categories of Constraints.
1. Anti-monotone and Monotone Constraints
• constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the super-
patterns of S can satisfy Ca
• A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of S also
satisfies it
2. Succinct Constraint
• A subset of item Is is a succinct set, if it can be expressed as p(I) for some selection predicate
p, where is a selection operator
• SP2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I, s.t. SP can
be expressed in terms of the strict power sets of I1, …, Ik using union and minus
• A constraint Cs is succinct provided SATCs(I) is a succinct power set
3. Convertible Constraint
• Suppose all items in patterns are listed in a total order R
• A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies
that each suffix of S w.r.t. R also satisfies C
• A constraint C is convertible monotone iff a pattern S satisfying the constraint implies
that each pattern of which S is a suffix w.r.t. R also satisfies C
Property of Constraints: Anti-Monotone
• Anti-monotonicity: If a set S violates the constraint, any superset of S violates theconstraint.
• Examples:
– sum(S.Price) v is anti-monotone
– sum(S.Price) v is not anti-monotone
– sum(S.Price) = v is partly anti-monotone
• Application:
– Push “sum(S.price) 1000” deeply into iterative frequent set computation.
Example of Convertible Constraints: Avg(S) V
• Let R be the value descending order over the set of items
– E.g. I={9, 8, 6, 4, 3, 1}
• Avg(S) v is convertible monotone w.r.t. R
– If S is a suffix of S1, avg(S1) avg(S)
• {8, 4, 3} is a suffix of {9, 8, 4, 3}
• avg({9, 8, 4, 3})=6 avg({8, 4, 3})=5
– If S satisfies avg(S) v, so does S1
• {8, 4, 3} satisfies constraint avg(S) 4, so does {9, 8, 4, 3}
Property of Constraints: Succinctness
• Succinctness:
– For any set S1 and S2 satisfying C, S1 S2 satisfies C
– Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1
, i.e., it contains a subset belongs to A1 ,
• Example :
– sum(S.Price ) v is not succinct
– min(S.Price ) v is succinct
• Optimization:
– If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint
alone is not affected by the iterative support counting.