Home > A Rigorous Statistical Approach for Identifying Significant Itemsets

# A Rigorous Statistical Approach for Identifying Significant Itemsets

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Data Mining

• Discover hidden patterns, correlations, association rules, etc., in large data sets
• When is the discovery interesting, important, significant?
• We develop rigorous mathematical/statistical

approach

Frequent Itemsets

• Dataset D of transactions tj (subsets) of a base set of items I, (tj 2I).
• Support of an itemsets X = number of transactions that contain X.

support({Beer,Diaper}) = 3

Frequent Itemsets

• Discover all itemsets with significant support.
• Fundamental primitive in data mining applications

support({Beer,Diaper}) = 3

Significance

• What support level makes an itemset significantly frequent?
• Minimize false positive and false negative discoveries
• Improve ��quality�� of subsequent analyses
• How to narrow the search to focus only on significant itemsets?
• Reduce the possibly exponential time search

Statistical Model

• Input:
• D = a dataset of t transactions over |I|=n
• For iI, let n(i) be the support of {i} in D
• fi= n(i)/t = frequency of i in D
• H0 Model:
• D = a dataset of t transactions, |I|=n
• Item i is included in transaction j with probability fi independent of all other events.

Statistical Tests

• H0 : null hypothesis – the support of no itemset is significant with respect to D
• H1: alternative hypothesis, the support of itemsets X1, X2, X3,�� is significant. It is unlikely that their support comes from the distribution of D
• Significance level:

�� = Prob( rejecting H0 when it��s true )

Naïve Approach

• Let X={x1,x2,��xr},
• fx =��j fj, probability that a given itemset is in a given transaction
• sx  = support of X, distributed sx B(t, fx)
• Reject H0 if:

Prob(B(t, fx) �� sx) = p-value �� ��

Naïve Approach

• Variations:
• R=support /E[support in D]
• R=support - E[support in D]
• Z-value = (s-E[s])/ϭ[s]
• many more��
• D has 1,000,000 transactions, over 1000 items, each item has frequency 1/1000.
• We observed that a pair {i,j} appears 7 times, is this pair statistically significant?
• In D (random dataset):
• E[ support({i,j}) ] = 1
• Prob({i,j} has support �� 7 ) ≃ 0.0001
• p-value 0.0001  - must be significant!

What��s wrong? – example

What��s wrong? – example

• There are 499,500 pairs, each has probability 0.0001 to appear in 7 transactions in D
• The expected number of pairs with support �� 7 in D is 50,

not such a rare event!

• Many false positive discoveries (flagging itemsets that are not significant)
• Need to correct for multiplicity of hypothesis.

Multi-Hypothesis test

• Testing for significant itemsets of size k involves testing simultaneously for

m=      null hypotheses.

• H0 (X) = support of X conforms with D

s= support of X, distributed: sx B(t, fx)

• How to combine m tests while minimizing false positive and negative discoveries?

Family Wise Error Rate (FWER)

• Family Wise Error Rate (FWER) = probability of at least one false positive

(flagging a non-significant itemset as significant)

• Bonferroni method (union bound) – test each null hypothesis with significance level ��/m
• Too conservative – many false negative – does not flag many significant itemsets.

False Discovery Rate (FDR)

• Less conservative approach
• V= number of false positive discoveries
• R= total number of rejected null hypothesis

= number itemsets flagged as significant

• Test with level of significance �� : reject maximum number of null hypothesis such that FDR �� ��

FDR = E[V/R]      (FDR=0 when R=0)

Standard Multi-Hypothesis test

Standard Multi-Hypothesis test

• Less conservative than Bonferroni method:
• i ��/m VS ��/m
• For m=       , still needs very small individual p-value to reject an hypothesis

Alternative Approach

• Q(k, si) = observed number of itemsets of size k and support �� si
• p-value =                                            the probability of Q(k, si)  in D
• Fewer hypothesis
• How to compute the p-value? What is the distribution of the number of itemsets of size k and support �� si in D ?

• Simulations to estimate the probabilities
• Choose a data set at random and count
• Main problem:   m

small probabilities to reject hypothesis

a lot of simulations to estimate probabilities

Permutation Test

Main Contributions

• Poisson approximation:  let Qk,s = number of itemsets of size k and support s in D (random dataset), for s��smin:

Qk,s is well approximate by a Poisson distribution.

• Based on the Poisson approximation – a powerful FDR multi-hypothesis test for  significant frequent itemsets.

Chen-Stein Method

• A powerful technique for approximating the sum of dependent Bernoulli variables.
• For an itemset X of k items let ZX=1 if X has support at least s, else ZX=0
• Qk,s = ��X ZX   (X of k items)
• U~Poisson(��)
• I(x)= {Y | |y|=k, Y˄X �� empty},

Chen-Stein Method (2)

Approximation Result

• Qk,s is well approximate by a Poisson distribution for s��smin

Monte-Carlo Estimate

• To determine smin for a given set of parameters (n,t,fi ):
• Choose m random datasets with the required parameters.
• For each dataset extract all itemsets with support at least s�� (�� smin)
• Find the minimum s such that

Prob(b1(s)+b2(s) �� ��) �� 1-��

New Statistical Test

• Instead of testing the significance of the support of individual itemsets we test the significance of the number of itemsets with a given support
• The null hypothesis distribution is specified by the Poisson approximation result
• Reduces the number of simultaneous tests
• More powerful test – less false negatives

Test I

• Define ��1, ��2, ��3, �� such that �Ʀ�i�� ��
• For i=0,��,log (smax – smin ) +1
• si= smin +2i
• Q(k, si) = observed number of itemsets of size k and support �� si
• H0(k,si) = ��Q(k,si) conforms with Poisson(��i)��
• Reject H0(k,si) if p-value < ��i

Test I

• Let s* be the smallest s such that

H0 (k,s) rejected by Test I

• With confidence level �� the number of itemsets with support �� s* is significant
• Some itemsets with support �� s* could still be false positive

Test II

• Define ��1, ��2, ��3,�� such that �� ��i�� ��
• Reject H0 (k,si) if:

p-value < ��i  and Q(k,si)�� ��i / ��i

• Let s* be the minimum s such that H0(k,s) was rejected
• If we flag all itemsets with support �� s* as significant, FDR �� ��

Proof

• Vi = false discoveries if H0(k,si) first rejected
• Ei = ��H0(k,si) rejected��

Real Datasets

• FIMI repository
• http://fimi.cs.helsinki.fi/data/
• standard benchmarks
• m = avg. transaction length

Experimental Results

• Poisson approximation
• Poisson ��regime�� �� no itemsets expected

Experimental Results

• Poisson approximation
• not approximating the p-values of itemsets as hypothesis (small!)
• finding the minimum s such that:

Prob(b1(s)+b2(s) �� ��) �� 1-��

• fewer simulations
• less time per simulation (��few�� itemsets)

Experimental Results

• Test II: �� = 0.05, �� = 0.05
• Rk,s* = num. itemsets of size k with support �� s*

Itemset of size 154 with support �� 7

Experimental Results

• Standard Multi-Hypothesis test: �� = 0.05
• R = size output Standard Multi-Hypothesis test
• Rk,s* = size output Test II

Collaborators

Michael Mitzenmacher (Harvard)