Association Rule

Given a set of transactions, find rules that will predict the occurrence of an item based on other items.

Definitions
- Itemset $$S$$: a collection of items
- Support count $$\sigma(S)$$: frequency of occurrence of an itemset
- Support $$s(S)$$: fraction of transactions that contain an itemset
- Frequent itemset: an itemset that is greater or equal to a $$minsup$$ threshold
- Association rule $$X \rightarrow Y$$
Rule evaluation metrics
- Support $$s$$: fraction of transactions that contain $$X \cup Y$$
  - $$s = \frac{\sigma(X \cup Y)}{|T|}$$
- Confidence $$c$$: how often items in $$Y$$ appear in transactions that contain $$X$$
  - $$c = \frac{\sigma(X \cup Y)}{\sigma(Y)}$$

Association Rule Mining

Given a set of transactions, find all rules having:

$$s \ge minsup$$
$$c \ge minconf$$

Two-Step Approach

Frequent itemset generation
1. Generate all frequent itemsets with $$s \ge minsup$$
Rule generation
1. Generate high confidence rules from each frequent itemset

Frequent Itemset Generation

$$2^d$$ candidates given $$d$$ items.

Factors affecting complexity
- minsup
  - Number of candidates, max length of frequent itemsets, etc.
- Dimensionality (number of items) of the dataset
  - Space & computation for support count, frequent items, etc.
- Size of database (number of transactions)
  - Multiple passes
- Average transaction width
  - Number of subsets in transaction, max length of frequent itemsets -> traversal of hash tree

Brute-Force Approach

Complexity
- $$O(NMw)$$, where $$M = 2^d$$
Possible improvements
- Reduce number of candidates M
  - Pruning
- Reduce number of transactions N
  - Reduce size of N as the size of itemset increases
  - Used by Direct Hashing and Pruning (DHP) and vertical-based mining algorithm
- Reduce number of comparisons NM
  - Efficient data structures

Apriori Principle

Reduce number of candidates.

Based on anti-monotone property of support:

$$ X \subseteq Y \Rightarrow s(X) \ge s(Y)

All subsets of a frequent itemset must be frequent:

init C # candidate itemset
init L # frequent itemset of size i
init minsup

L[1] = {frequent items}
i = 1
while not L[i].empty():
    C = candidates generated from L[i] # Join step + prune step

    for each transaction t in database:
        for each candidate c in C:
            if c in t:
                increment count of c

    L[i+1] = (C >= minsup)
    i++

return union(L)

Candidate generation
- Join step: generate all possible combinations by joining $$L_i$$
- Pruning step: prune candidates if some of its subsets are not a frequent itemset

Candidate Hash Tree

Reduce number of comparisons.

Hash function
Max leaf size: max number of itemsets stored in a leaf node

Rule Generation

Given a frequent itemset $$L$$, find all non-empty subsets $$f \subset L$$ s.t. $$f \rightarrow L - f$$ satisfies the min confidence requirement.

$$(2^k-2)$$ candidates if $$|L| = k$$ (excluding $$L \rightarrow \emptyset$$, $$\emptyset \rightarrow L$$).

Apriori Principle

Confidence does not have an anti-monotone property, but confidence of rules generated from the same itemset has:

$$ c(ABC \rightarrow D) \ge c(AB \rightarrow CD) \ge c(A \rightarrow BCD)

Candidate rule generation
- Merge two rules that share the same prefix in rule consequent
- Prune rule if some subset rules do not have high confidence
  - e.g. AD => BC

Association Rules

Association Rule

Association Rule Mining

Two-Step Approach

Frequent Itemset Generation

Brute-Force Approach

Apriori Principle

Candidate Hash Tree

Rule Generation

Apriori Principle

results matching ""

No results matching ""