An Efficient Rigorous
Approach for Identifying Statistically Significant Frequent Itemsets
Data Mining
- Discover hidden patterns,
correlations, association rules, etc., in large data sets
- When is the discovery interesting,
important, significant?
- We develop rigorous mathematical/statistical
approach
Frequent Itemsets
- Dataset D of transactions tj
(subsets) of a base set of items I, (tj ⊆ 2I).
- Support of an itemsets
X = number of transactions that contain X.
support({Beer,Diaper}) = 3
Frequent Itemsets
- Discover all itemsets with
significant support.
- Fundamental primitive in
data mining applications
support({Beer,Diaper}) = 3
Significance
- What support level makes
an itemset significantly frequent?
- Minimize false positive and
false negative discoveries
- Improve ��quality�� of
subsequent analyses
- How to narrow the search
to focus only on significant itemsets?
- Reduce the possibly exponential
time search
Statistical Model
- Input:
- D = a dataset of
t transactions over |I|=n
- For i∊I, let n(i) be the support of
{i} in D.
- fi=
n(i)/t = frequency of i in D
- H0
Model:
- D =
a dataset of t transactions, |I|=n
- Item i is included
in transaction j with probability fi
independent of all other events.
Statistical Tests
- H0
: null hypothesis – the support of no itemset is significant with
respect to D
- H1:
alternative hypothesis, the support of itemsets X1,
X2,
X3,�� is significant. It is unlikely
that their support comes from the distribution of D
- Significance level:
�� = Prob( rejecting H0
when it��s true )
Naïve Approach
- Let X={x1,x2,��xr},
- fx
=��j
fj,
probability that a given itemset
is in a given transaction
- sx = support
of X, distributed sx ∼ B(t, fx)
Prob(B(t, fx)
�� sx)
= p-value �� ��
Naïve Approach
- Variations:
- R=support /E[support
in D]
- R=support - E[support
in D]
- Z-value = (s-E[s])/ϭ[s]
- many more��
- D has 1,000,000 transactions,
over 1000 items, each item has frequency 1/1000.
- We observed that a pair
{i,j} appears 7 times, is this pair statistically
significant?
- In D (random dataset):
- E[ support({i,j})
] = 1
- Prob({i,j}
has support ��
7 ) ≃ 0.0001
- p-value 0.0001 -
must be significant!
What��s wrong? –
example
What��s wrong? –
example
- There are 499,500 pairs,
each has probability 0.0001 to appear in 7 transactions in D
- The expected number of pairs
with support �� 7 in D is ≃ 50,
not such
a rare event!
- Many false positive discoveries
(flagging itemsets that are not significant)
- Need to correct for multiplicity
of hypothesis.
Multi-Hypothesis test
- Testing for significant
itemsets of size k involves testing simultaneously for
m=
null hypotheses.
- H0
(X) = support of X conforms with D
sx = support of X, distributed: sx ∼
B(t, fx)
- How to combine m
tests while minimizing false positive and negative discoveries?
Family Wise Error Rate
(FWER)
- Family Wise Error Rate (FWER)
= probability of at least one false positive
(flagging a non-significant
itemset as significant)
- Bonferroni method (union
bound) – test each null hypothesis with significance level ��/m
- Too conservative –
many false negative – does not flag many significant itemsets.
False Discovery Rate
(FDR)
- Less conservative approach
- V= number of false positive
discoveries
- R= total number of rejected
null hypothesis
= number itemsets
flagged as significant
- Test with level of significance �� : reject maximum number of null
hypothesis such that FDR
��
��
FDR
= E[V/R] (FDR=0 when
R=0)
Standard Multi-Hypothesis
test
Standard Multi-Hypothesis
test
- Less conservative than Bonferroni
method:
- For m=
, still needs very small individual p-value to reject an hypothesis
Alternative Approach
- Q(k, si) = observed
number of itemsets of size k and support �� si
- p-value =
the probability of Q(k, si)
in D
- Fewer hypothesis
- How to compute the p-value?
What is the distribution of the number of itemsets of size k
and support �� si
in D ?
- Simulations to estimate
the probabilities
- Choose a data set at random
and count
Permutation Test
Main Contributions
- Poisson approximation:
let Qk,s
= number of itemsets of size k and support s in D (random
dataset), for s��smin:
Qk,s
is well approximate by a Poisson distribution.
- Based on the Poisson approximation
– a powerful FDR multi-hypothesis test for significant
frequent itemsets.
Chen-Stein Method
- A powerful technique for
approximating the sum of dependent Bernoulli variables.
- For an itemset X
of k items let ZX=1 if X
has support at least s, else ZX=0
- Qk,s = ��X
ZX (X of k items)
- U~Poisson(��)
- I(x)= {Y | |y|=k, Y˄X �� empty},
Chen-Stein Method (2)
Approximation Result
- Qk,s
is well approximate by a Poisson distribution for s��smin
Monte-Carlo Estimate
- To determine smin
for a given set of parameters (n,t,fi
):
- Choose m random datasets
with the required parameters.
- For each dataset extract
all itemsets with support at least s�� (�� smin)
- Find the minimum s such that
New Statistical Test
- Instead of testing the significance
of the support of individual itemsets we test the significance of the number
of itemsets with a given support
- The null hypothesis distribution
is specified by the Poisson approximation result
- Reduces the number of simultaneous
tests
- More powerful test – less
false negatives
Test I
- Define
��1, ��2,
��3, ��
such that �Ʀ�i�� ��
- For i=0,��,log
(smax – smin
) +1
- si=
smin +2i
- Q(k, si) = observed
number of itemsets of size k and support �� si
- H0(k,si)
= ��Q(k,si) conforms
with Poisson(��i)��
- Reject H0(k,si)
if p-value < ��i
Test I
- Let s* be the smallest
s such that
H0
(k,s) rejected by Test I
- With confidence level ��
the number of itemsets with support �� s* is significant
- Some itemsets with support ��
s* could still be false positive
Test II
- Define ��1,
��2, ��3,�� such
that �� ��i�� ��
p-value
< ��i and Q(k,si)��
��i / ��i
- Let s* be the minimum
s such that H0(k,s) was rejected
- If we flag all itemsets
with support �� s* as significant, FDR �� ��
Proof
- Vi
= false discoveries if H0(k,si)
first rejected
- Ei
= ��H0(k,si)
rejected��
Real Datasets
- FIMI repository
- http://fimi.cs.helsinki.fi/data/
- standard benchmarks
- m = avg. transaction
length
Experimental Results
- Poisson ��regime�� ��
no itemsets expected
Experimental Results
- Poisson approximation
- not approximating the p-values
of itemsets as hypothesis (small!)
- finding the minimum s such
that:
- fewer simulations
- less time per simulation
(��few�� itemsets)
Experimental Results
- Test II: �� = 0.05, �� = 0.05
- Rk,s*
= num. itemsets of size k with support �� s*
Itemset of size 154 with support
�� 7
Experimental Results
- Standard Multi-Hypothesis
test: �� = 0.05
- R = size output Standard Multi-Hypothesis test
- Rk,s*
= size output Test II
Collaborators
Adam Kirsch (Harvard)
Michael
Mitzenmacher (Harvard)
Andrea Pietracaprina (U. of Padova)
Geppino Pucci (U.
of Padova)
Eli Upfal (Brown
U.)
Fabio Vandin (U.
of Padova)