Association Rule

Given a set of transactions, find rules that will predict the occurrence of an item based on other items.

  • Definitions
    • Itemset $$S$$: a collection of items
    • Support count $$\sigma(S)$$: frequency of occurrence of an itemset
    • Support $$s(S)$$: fraction of transactions that contain an itemset
    • Frequent itemset: an itemset that is greater or equal to a $$minsup$$ threshold
    • Association rule $$X \rightarrow Y$$
  • Rule evaluation metrics
    • Support $$s$$: fraction of transactions that contain $$X \cup Y$$
      • $$s = \frac{\sigma(X \cup Y)}{|T|}$$
    • Confidence $$c$$: how often items in $$Y$$ appear in transactions that contain $$X$$
      • $$c = \frac{\sigma(X \cup Y)}{\sigma(Y)}$$

Association Rule Mining

Given a set of transactions, find all rules having:

  • $$s \ge minsup$$
  • $$c \ge minconf$$

Two-Step Approach

  1. Frequent itemset generation
    1. Generate all frequent itemsets with $$s \ge minsup$$
  2. Rule generation
    1. Generate high confidence rules from each frequent itemset

Frequent Itemset Generation

$$2^d$$ candidates given $$d$$ items.

  • Factors affecting complexity
    • minsup
      • Number of candidates, max length of frequent itemsets, etc.
    • Dimensionality (number of items) of the dataset
      • Space & computation for support count, frequent items, etc.
    • Size of database (number of transactions)
      • Multiple passes
    • Average transaction width
      • Number of subsets in transaction, max length of frequent itemsets -> traversal of hash tree
Brute-Force Approach

  • Complexity
    • $$O(NMw)$$, where $$M = 2^d$$
  • Possible improvements
    • Reduce number of candidates M
      • Pruning
    • Reduce number of transactions N
      • Reduce size of N as the size of itemset increases
      • Used by Direct Hashing and Pruning (DHP) and vertical-based mining algorithm
    • Reduce number of comparisons NM
      • Efficient data structures
Apriori Principle

Reduce number of candidates.

Based on anti-monotone property of support:

$$ X \subseteq Y \Rightarrow s(X) \ge s(Y)

$$

All subsets of a frequent itemset must be frequent:

init C # candidate itemset
init L # frequent itemset of size i
init minsup

L[1] = {frequent items}
i = 1
while not L[i].empty():
    C = candidates generated from L[i] # Join step + prune step

    for each transaction t in database:
        for each candidate c in C:
            if c in t:
                increment count of c

    L[i+1] = (C >= minsup)
    i++

return union(L)
  • Candidate generation
    • Join step: generate all possible combinations by joining $$L_i$$
    • Pruning step: prune candidates if some of its subsets are not a frequent itemset
Candidate Hash Tree

Reduce number of comparisons.

  • Hash function
  • Max leaf size: max number of itemsets stored in a leaf node

Rule Generation

Given a frequent itemset $$L$$, find all non-empty subsets $$f \subset L$$ s.t. $$f \rightarrow L - f$$ satisfies the min confidence requirement.

$$(2^k-2)$$ candidates if $$|L| = k$$ (excluding $$L \rightarrow \emptyset$$, $$\emptyset \rightarrow L$$).

Apriori Principle

Confidence does not have an anti-monotone property, but confidence of rules generated from the same itemset has:

$$ c(ABC \rightarrow D) \ge c(AB \rightarrow CD) \ge c(A \rightarrow BCD)

$$

  • Candidate rule generation
    • Merge two rules that share the same prefix in rule consequent
    • Prune rule if some subset rules do not have high confidence
      • e.g. AD => BC

results matching ""

    No results matching ""