- Association: Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrence of other items in the transaction.
 - Support count ($\sigma$): frequency of occurrence of an itemset
 - Support: fraction of transactions that contain an item set
 - Frequent Itemset: An itemset whose support is greater or equal to a minsup threshold
 
2. Association rule
- Association rule
 - $ X \rightarrow Y $
 - support (s)
 - fraction of transactions that contains both X and Y
 - confidence
 - measures how often items in Y appear in transactions that contains X
 
3. Association Rule Mining Task
- Given a set of transactions T, the goal of association rule mining is to find all rules having
 - support >= minsup threshold
 - confidence >= minconf threshold
 - Brute-force approach
 - list all possible association rules
 - compute the support and confidence for each rule
 - prune rules that fail the minsup and minconf thresholds
 - this is computationally prohibitive!
 
4. Two-step Approach
- Step 1: frequent itemset generation
 - generate all itemset whose support >= minsup
 - Step 2: rule generation
 - generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
 - Drawback: frequent itemset generation is still computationally expensive
 
5. Computational Complexity
- Given d unique items
 - total number of item set is: $2^d$
 - total number of possible association rules is:
 - $ R = \sum^{d-1}_{k=1} [\binom{d}{k} *\sum^{d-k}_{j=1}\binom{d-k}{j}]$ = $3^d - s^{d+1} +1 $
 
6. Reducing Number of Candidates
- Apriori principal
 - If an itemset is frequent, then all of its subsets must also be frequent
 - i.e., $\forall X, Y: X \subset Y \rightarrow S(X) \geq S(Y)$
 - Apriori algorithm
 - Let k = 1
 - generate frequent itemset of length 1
 - repeat until no new frequent itemsets are identified
 - generate length (k+1) candidate itemsets from length k frequent itemsets
 - prune candidate itemsets containing subsets of length k that are infrequent
 - count the support of each candidate by scanning the db
 - eliminate candidates that are infrequent, leaving only those that are frequent
 
7. Closed Itermset
- An itemset is closed if none of its immediate supersets has the same support as the itemset.
 
8. Effect of Support Distribution
- How to set the appropriate minsup threshold?
 - if minsup is set too high, we could miss items involving interesting rate items (e.g., expensive products)
 - if minsup is set too low, it is computatinally expensive and the number of itemset is very large.
 - Using a single minimum support threshold may not be effective.
 
No comments:
Post a Comment