Home > A Rigorous Statistical Approach for Identifying Significant Itemsets

A Rigorous Statistical Approach for Identifying Significant Itemsets


An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets


Data Mining 

  • Discover hidden patterns, correlations, association rules, etc., in large data sets
  • When is the discovery interesting, important, significant?
  • We develop rigorous mathematical/statistical

   approach


Frequent Itemsets 

  • Dataset D of transactions tj (subsets) of a base set of items I, (tj 2I).
  • Support of an itemsets X = number of transactions that contain X.
 
 

support({Beer,Diaper}) = 3


Frequent Itemsets 

  • Discover all itemsets with significant support.
  • Fundamental primitive in data mining applications
 
 

support({Beer,Diaper}) = 3


Significance 

  • What support level makes an itemset significantly frequent?
    • Minimize false positive and false negative discoveries
    • Improve ��quality�� of subsequent analyses
  • How to narrow the search to focus only on significant itemsets?
    • Reduce the possibly exponential time search

Statistical Model 

  • Input:
    • D = a dataset of t transactions over |I|=n
    • For iI, let n(i) be the support of {i} in D
    • fi= n(i)/t = frequency of i in D
  • H0 Model:
    • D = a dataset of t transactions, |I|=n
    • Item i is included in transaction j with probability fi independent of all other events.

Statistical Tests 

  • H0 : null hypothesis – the support of no itemset is significant with respect to D
  • H1: alternative hypothesis, the support of itemsets X1, X2, X3,�� is significant. It is unlikely that their support comes from the distribution of D
  • Significance level:

    �� = Prob( rejecting H0 when it��s true ) 

 


Naïve Approach 

  • Let X={x1,x2,��xr},
  • fx =��j fj, probability that a given itemset is in a given transaction
  • sx  = support of X, distributed sx B(t, fx)
  • Reject H0 if: 

       Prob(B(t, fx) �� sx) = p-value �� ��

 


Naïve Approach 

  • Variations:
    • R=support /E[support in D]
    • R=support - E[support in D]
    • Z-value = (s-E[s])/ϭ[s]
    • many more��
  • D has 1,000,000 transactions, over 1000 items, each item has frequency 1/1000.
  • We observed that a pair {i,j} appears 7 times, is this pair statistically significant?
  • In D (random dataset):
    • E[ support({i,j}) ] = 1
    • Prob({i,j} has support �� 7 ) ≃ 0.0001
  • p-value 0.0001  - must be significant!
 

What��s wrong? – example


What��s wrong? – example  

  • There are 499,500 pairs, each has probability 0.0001 to appear in 7 transactions in D
  • The expected number of pairs with support �� 7 in D is 50,

   not such a rare event!

  • Many false positive discoveries (flagging itemsets that are not significant)
  • Need to correct for multiplicity of hypothesis.

Multi-Hypothesis test 

  • Testing for significant itemsets of size k involves testing simultaneously for

   m=      null hypotheses.

  • H0 (X) = support of X conforms with D

   s= support of X, distributed: sx B(t, fx)

  • How to combine m tests while minimizing false positive and negative discoveries?

Family Wise Error Rate (FWER) 

  • Family Wise Error Rate (FWER) = probability of at least one false positive

    (flagging a non-significant itemset as significant)

  • Bonferroni method (union bound) – test each null hypothesis with significance level ��/m
  • Too conservative – many false negative – does not flag many significant itemsets.

 


False Discovery Rate (FDR) 

  • Less conservative approach
  • V= number of false positive discoveries
  • R= total number of rejected null hypothesis

     = number itemsets flagged as significant 
 

  • Test with level of significance �� : reject maximum number of null hypothesis such that FDR �� ��
 

FDR = E[V/R]      (FDR=0 when R=0)


Standard Multi-Hypothesis test


Standard Multi-Hypothesis test 

  • Less conservative than Bonferroni method:
    • i ��/m VS ��/m
  • For m=       , still needs very small individual p-value to reject an hypothesis 

Alternative Approach 

    • Q(k, si) = observed number of itemsets of size k and support �� si
  • p-value =                                            the probability of Q(k, si)  in D
  • Fewer hypothesis
  • How to compute the p-value? What is the distribution of the number of itemsets of size k and support �� si in D ?

 


  • Simulations to estimate the probabilities
    • Choose a data set at random and count
  • Main problem:   m
 

    small probabilities to reject hypothesis 

    a lot of simulations to estimate probabilities 

Permutation Test


Main Contributions 

  • Poisson approximation:  let Qk,s = number of itemsets of size k and support s in D (random dataset), for s��smin:

   Qk,s is well approximate by a Poisson distribution.

  • Based on the Poisson approximation – a powerful FDR multi-hypothesis test for  significant frequent itemsets.

Chen-Stein Method 

  • A powerful technique for approximating the sum of dependent Bernoulli variables.
  • For an itemset X of k items let ZX=1 if X has support at least s, else ZX=0 
  • Qk,s = ��X ZX   (X of k items)
  • U~Poisson(��)
  • I(x)= {Y | |y|=k, Y˄X �� empty},

Chen-Stein Method (2)


Approximation Result 

  • Qk,s is well approximate by a Poisson distribution for s��smin

Monte-Carlo Estimate 

  • To determine smin for a given set of parameters (n,t,fi ):
    • Choose m random datasets with the required parameters.
    • For each dataset extract all itemsets with support at least s�� (�� smin)
    • Find the minimum s such that

           Prob(b1(s)+b2(s) �� ��) �� 1-��

New Statistical Test 

  • Instead of testing the significance of the support of individual itemsets we test the significance of the number of itemsets with a given support
  • The null hypothesis distribution is specified by the Poisson approximation result
  • Reduces the number of simultaneous tests
  • More powerful test – less false negatives

Test I 

  • Define ��1, ��2, ��3, �� such that �Ʀ�i�� ��
  • For i=0,��,log (smax – smin ) +1
    • si= smin +2i
    • Q(k, si) = observed number of itemsets of size k and support �� si
    • H0(k,si) = ��Q(k,si) conforms with Poisson(��i)��
    • Reject H0(k,si) if p-value < ��i 

 


Test I 

  • Let s* be the smallest s such that

   H0 (k,s) rejected by Test I

  • With confidence level �� the number of itemsets with support �� s* is significant
  • Some itemsets with support �� s* could still be false positive 

Test II 

  • Define ��1, ��2, ��3,�� such that �� ��i�� ��
  • Reject H0 (k,si) if: 

         p-value < ��i  and Q(k,si)�� ��i / ��i 

    • Let s* be the minimum s such that H0(k,s) was rejected
    • If we flag all itemsets with support �� s* as significant, FDR �� ��

    Proof 

    • Vi = false discoveries if H0(k,si) first rejected
    • Ei = ��H0(k,si) rejected��

    Real Datasets 

    • FIMI repository
      • http://fimi.cs.helsinki.fi/data/
      • standard benchmarks
      • m = avg. transaction length

    Experimental Results 

    • Poisson approximation
    • Poisson ��regime�� �� no itemsets expected  
       
       
       
       
       

    Experimental Results 

    • Poisson approximation
      • not approximating the p-values of itemsets as hypothesis (small!)
      • finding the minimum s such that:

                Prob(b1(s)+b2(s) �� ��) �� 1-��

        • fewer simulations
        • less time per simulation (��few�� itemsets)

     


    Experimental Results 

    • Test II: �� = 0.05, �� = 0.05
      • Rk,s* = num. itemsets of size k with support �� s*
     
     

    Itemset of size 154 with support �� 7


    Experimental Results 

    • Standard Multi-Hypothesis test: �� = 0.05
      • R = size output Standard Multi-Hypothesis test
      • Rk,s* = size output Test II

    Collaborators 

      Adam Kirsch (Harvard)

      Michael Mitzenmacher (Harvard)

      Andrea Pietracaprina (U. of Padova) 

      Geppino Pucci (U. of Padova)

      Eli Upfal (Brown U.)

      Fabio Vandin (U. of Padova)


Set Home | Add to Favorites

All Rights Reserved Powered by Free Document Search and Download

Copyright © 2011
This site does not host pdf,doc,ppt,xls,rtf,txt files all document are the property of their respective owners. complaint#nuokui.com
TOP