Association Rule
Given a set of transactions, find rules that will predict the occurrence of an item based on other items.
- Definitions
- Itemset $$S$$: a collection of items
- Support count $$\sigma(S)$$: frequency of occurrence of an itemset
- Support $$s(S)$$: fraction of transactions that contain an itemset
- Frequent itemset: an itemset that is greater or equal to a $$minsup$$ threshold
- Association rule $$X \rightarrow Y$$
- Rule evaluation metrics
- Support $$s$$: fraction of transactions that contain $$X \cup Y$$
- $$s = \frac{\sigma(X \cup Y)}{|T|}$$
- Confidence $$c$$: how often items in $$Y$$ appear in transactions that contain $$X$$
- $$c = \frac{\sigma(X \cup Y)}{\sigma(Y)}$$
- Support $$s$$: fraction of transactions that contain $$X \cup Y$$
Association Rule Mining
Given a set of transactions, find all rules having:
- $$s \ge minsup$$
- $$c \ge minconf$$
Two-Step Approach
- Frequent itemset generation
- Generate all frequent itemsets with $$s \ge minsup$$
- Rule generation
- Generate high confidence rules from each frequent itemset
Frequent Itemset Generation
$$2^d$$ candidates given $$d$$ items.
- Factors affecting complexity
minsup
- Number of candidates, max length of frequent itemsets, etc.
- Dimensionality (number of items) of the dataset
- Space & computation for support count, frequent items, etc.
- Size of database (number of transactions)
- Multiple passes
- Average transaction width
- Number of subsets in transaction, max length of frequent itemsets -> traversal of hash tree
Brute-Force Approach
- Complexity
- $$O(NMw)$$, where $$M = 2^d$$
- Possible improvements
- Reduce number of candidates
M
- Pruning
- Reduce number of transactions
N
- Reduce size of
N
as the size of itemset increases - Used by Direct Hashing and Pruning (DHP) and vertical-based mining algorithm
- Reduce size of
- Reduce number of comparisons
NM
- Efficient data structures
- Reduce number of candidates
Apriori Principle
Reduce number of candidates.
Based on anti-monotone property of support:
$$ X \subseteq Y \Rightarrow s(X) \ge s(Y)
$$
All subsets of a frequent itemset must be frequent:
init C # candidate itemset
init L # frequent itemset of size i
init minsup
L[1] = {frequent items}
i = 1
while not L[i].empty():
C = candidates generated from L[i] # Join step + prune step
for each transaction t in database:
for each candidate c in C:
if c in t:
increment count of c
L[i+1] = (C >= minsup)
i++
return union(L)
- Candidate generation
- Join step: generate all possible combinations by joining $$L_i$$
- Pruning step: prune candidates if some of its subsets are not a frequent itemset
Candidate Hash Tree
Reduce number of comparisons.
- Hash function
- Max leaf size: max number of itemsets stored in a leaf node
Rule Generation
Given a frequent itemset $$L$$, find all non-empty subsets $$f \subset L$$ s.t. $$f \rightarrow L - f$$ satisfies the min confidence requirement.
$$(2^k-2)$$ candidates if $$|L| = k$$ (excluding $$L \rightarrow \emptyset$$, $$\emptyset \rightarrow L$$).
Apriori Principle
Confidence does not have an anti-monotone property, but confidence of rules generated from the same itemset has:
$$ c(ABC \rightarrow D) \ge c(AB \rightarrow CD) \ge c(A \rightarrow BCD)
$$
- Candidate rule generation
- Merge two rules that share the same prefix in rule consequent
- Prune rule if some subset rules do not have high confidence
- e.g.
AD => BC
- e.g.