An Efficient Rigorous
Approach for Identifying Statistically Significant Frequent Itemsets
Data Mining
- Discover hidden patterns,
correlations, association rules, etc., in large data sets
- When is the discovery interesting,
important, significant?
- We develop rigorous mathematical/statistical
approach
Frequent Itemsets
- Dataset D of transactions t_{j }
(subsets) of a base set of items I, (t_{j} ⊆ 2^{I}).
- Support of an itemsets
X = number of transactions that contain X.
support({Beer,Diaper}) = 3
Frequent Itemsets
- Discover all itemsets with
significant support.
- Fundamental primitive in
data mining applications
support({Beer,Diaper}) = 3
Significance
- What support level makes
an itemset significantly frequent?
- Minimize false positive and
false negative discoveries
- Improve ��quality�� of
subsequent analyses
- How to narrow the search
to focus only on significant itemsets?
- Reduce the possibly exponential
time search
Statistical Model
- Input:
- D = a dataset of
t transactions over |I|=n
- For i∊I, let n(i) be the support of
{i} in D.
- f_{i}=
n(i)/t = frequency of i in D
- H_{0}
Model:
- D =
a dataset of t transactions, |I|=n
- Item i is included
in transaction j with probability f_{i }
independent of all other events.
Statistical Tests
- H_{0 }
: null hypothesis – the support of no itemset is significant with
respect to D
- H_{1}:
alternative hypothesis, the support of itemsets X_{1},
X_{2},_{ }
X_{3},�� is significant. It is unlikely
that their support comes from the distribution of D
- Significance level:
�� = Prob( rejecting H_{0}
when it��s true )
Naïve Approach
- Let X={x_{1},x_{2},��x_{r}},
- f_{x }
=��_{j }
f_{j},_{ }
probability that a given itemset
is in a given transaction
- s_{x }= support
of X, distributed s_{x }∼ B(t, f_{x})
Prob(B(t, f_{x})
�� s_{x})
= p-value �� ��
Naïve Approach
- Variations:
- R=support /E[support
in D]
- R=support - E[support
in D]
- Z-value = (s-E[s])/ϭ[s]
- many more��
- D has 1,000,000 transactions,
over 1000 items, each item has frequency 1/1000.
- We observed that a pair
{i,j} appears 7 times, is this pair statistically
significant?
- In D (random dataset):
- E[ support({i,j})
] = 1
- Prob({i,j}
has support ��
7 ) ≃ 0.0001
- p-value 0.0001 -
must be significant!
What��s wrong? –
example
What��s wrong? –
example
- There are 499,500 pairs,
each has probability 0.0001 to appear in 7 transactions in D
- The expected number of pairs
with support �� 7 in D is ≃ 50,
not such
a rare event!
- Many false positive discoveries
(flagging itemsets that are not significant)
- Need to correct for multiplicity
of hypothesis.
Multi-Hypothesis test
- Testing for significant
itemsets of size k involves testing simultaneously for
m=
null hypotheses.
- H_{0 }
(X) = support of X conforms with D
s_{x }= support of X, distributed: s_{x }∼
B(t, f_{x})
- How to combine m
tests while minimizing false positive and negative discoveries?
Family Wise Error Rate
(FWER)
- Family Wise Error Rate (FWER)
= probability of at least one false positive
(flagging a non-significant
itemset as significant)
- Bonferroni method (union
bound) – test each null hypothesis with significance level ��/m
- Too conservative –
many false negative – does not flag many significant itemsets.
False Discovery Rate
(FDR)
- Less conservative approach
- V= number of false positive
discoveries
- R= total number of rejected
null hypothesis
= number itemsets
flagged as significant
- Test with level of significance �� : reject maximum number of null
hypothesis such that FDR
��
��
FDR
= E[V/R] (FDR=0 when
R=0)
Standard Multi-Hypothesis
test
Standard Multi-Hypothesis
test
- Less conservative than Bonferroni
method:
- For m=
, still needs very small individual p-value to reject an hypothesis
Alternative Approach
- Q(k, s_{i}) = observed
number of itemsets of size k and support �� s_{i}
- p-value =
the probability of Q(k, s_{i})
in D
- Fewer hypothesis
- How to compute the p-value?
What is the distribution of the number of itemsets of size k
and support �� s_{i }
in D ?
- Simulations to estimate
the probabilities
- Choose a data set at random
and count
Permutation Test
Main Contributions
- Poisson approximation:
let Q_{k,s}
= number of itemsets of size k and support s in D (random
dataset), for s��s_{min}:
Q_{k,s }
is well approximate by a Poisson distribution.
- Based on the Poisson approximation
– a powerful FDR multi-hypothesis test for significant
frequent itemsets.
Chen-Stein Method
- A powerful technique for
approximating the sum of dependent Bernoulli variables.
- For an itemset X
of k items let Z_{X}=1 if X
has support at least s, else Z_{X}=0
- Q_{k,s }= ��_{X }
Z_{X }(X of k items)
- U~Poisson(��)
- I(x)= {Y | |y|=k, Y˄X �� empty},
Chen-Stein Method (2)
Approximation Result
- Q_{k,s }
is well approximate by a Poisson distribution for s��s_{min}
Monte-Carlo Estimate
- To determine s_{min }
for_{ }a given set of parameters (n,t,f_{i}
):
- Choose m random datasets
with the required parameters.
- For each dataset extract
all itemsets with support at least s�� (�� s_{min})
- Find the minimum s such that
New Statistical Test
- Instead of testing the significance
of the support of individual itemsets we test the significance of the number
of itemsets with a given support
- The null hypothesis distribution
is specified by the Poisson approximation result
- Reduces the number of simultaneous
tests
- More powerful test – less
false negatives
Test I
- Define
��_{1, }��_{2, }
��_{3, }��_{ }
such that �Ʀ�_{i}�� ��
- For i=0,��,log
(s_{max} – s_{min }
) +1
- s_{i}=
s_{min} +2^{i}
- Q(k, s_{i}) = observed
number of itemsets of size k and support �� s_{i}
- H_{0}(k,s_{i})
= ��Q(k,s_{i}) conforms
with Poisson(��_{i})��
- Reject H_{0}(k,s_{i})
if p-value < ��_{i}
Test I
- Let s* be the smallest
s such that
H_{0 }
(k,s) rejected by Test I
- With confidence level ��
the number of itemsets with support �� s* is significant
- Some itemsets with support ��
s* could still be false positive
Test II
- Define ��_{1},
��_{2}, ��_{3},�� such
that �� ��_{i}�� ��
- Reject H_{0 }
(k,s_{i}) if:
p-value
< ��_{i} and Q(k,s_{i})��
��_{i} / ��_{i}
- Let s* be the minimum
s such that H_{0}(k,s) was rejected
- If we flag all itemsets
with support �� s* as significant, FDR �� ��
Proof
- V_{i }
= false discoveries if H_{0}(k,s_{i})
first rejected
- E_{i}
= ��H_{0}(k,s_{i})
rejected��
Real Datasets
- FIMI repository
- http://fimi.cs.helsinki.fi/data/
- standard benchmarks
- m = avg. transaction
length
Experimental Results
- Poisson ��regime�� ��
no itemsets expected
Experimental Results
- Poisson approximation
- not approximating the p-values
of itemsets as hypothesis (small!)
- finding the minimum s such
that:
- fewer simulations
- less time per simulation
(��few�� itemsets)
Experimental Results
- Test II: �� = 0.05, �� = 0.05
- R_{k,s*}
= num. itemsets of size k with support �� s*
Itemset of size 154 with support
�� 7
Experimental Results
- Standard Multi-Hypothesis
test: �� = 0.05
- R = size output Standard Multi-Hypothesis test
- R_{k,s*}
= size output Test II
Collaborators
Adam Kirsch (Harvard)
Michael
Mitzenmacher (Harvard)
Andrea Pietracaprina (U. of Padova)
Geppino Pucci (U.
of Padova)
Eli Upfal (Brown
U.)
Fabio Vandin (U.
of Padova)