Home > Inference on the limiting false discovery rate and the p-value threshold parameter assuming weak dependence between gene express

Inference on the limiting false discovery rate and the p-value threshold parameter assuming weak dependence between gene express

Page 1
Inference on the limiting false discovery rate and the p-value threshold parameter assuming weak dependence between gene expression levels within subject
Glenn Heller1 and Jing Qin2
1Department of Epidemiology and Biostatistics
Memorial Sloan-Kettering Cancer Center New York, NY 10021, USA
2Biostatistics Research Branch
National Institute of Allergy and Infectious Diseases Bethesda, Maryland 20892, USA Corresponding author: Glenn Heller email: hellerg@mskcc.org phone: 646 735 8112 fax: 646 735 0010 Running Head: FDR analysis with dependent data 1

Page 2
Summary. An objective of microarray data analysis is to identify gene expressions that are associated with a disease related outcome. For each gene, a test statistic is computed to determine if an association exists, and this statistic generates a marginal p-value. In an effort to pool this information across genes, a p-value density function is derived. The p-value density is modeled as a mixture of a uniform (0,1) density and a scaled ratio of normal densities derived from the asymptotic normality of the test statistic. The p-values are assumed to be weakly dependent and a quasi-likelihood is used to estimate the parameters in the mixture density. The quasi-likelihood and the weak dependence assumption enables estimation and asymptotic inference on the false discovery rate for a given rejection region, and its inverse, the p-value threshold parameter for a fixed false discovery rate. A false discovery rate analysis on a local- ized prostate cancer data set is used to illustrate the methodology. Simulations are performed to assess the performance of this methodology. Keywords: Asymptotic normal test statistic, confidence interval, microarray, p-value mixture model, quasi-likelihood, weak dependence. 2

Page 3
1 Introduction Microarray analysis is used to identify gene expressions that are associated with a disease related outcome. It is typically exploratory, with no apriori hypothesis concerning the association between a specific gene expression and the outcome variable. The intent of the analysis is to generate hypotheses for further exploration, either in the laboratory or in the clinic. Microarray technology is currently applied in many areas including: clinical staging, cell line classification, distinguishing tumor type, and understanding the effect of a biological agent. Scientists believe that identification of informative genes will provide insight into a disease mechanism, a genetic pathway, or isolation of a therapeutic target. Microarray analysis generates a vast amount of data. In a typical study, tens of thousands of gene expressions are recorded on the subjects under study. For each gene, a test of association of gene expression and outcome variable is performed. The statistical challenge is how to determine which genes are truly associated with outcome. Simply testing each gene individually, without adjustment for the number of genes examined, provides little confidence that true associations are identified and nonimportant genes are eliminated from further study. For example, if a test statistic is computed for each gene, and the genes with the highest test statistic are found discriminatory, the cut-off for the test statistic is still problematic. Due to the thousands of tests performed, use of a standard nominal significance level to determine this critical region will result in overstating the number of significant associations identified. Traditionally, protection against multiple comparisons is undertaken by choosing the critical region of a test to satisfy a familywise error rate of ��, where the familywise error rate is defined as the probability of rejecting at least one true hypothesis (Hochberg and Tamhane, 1987). While protecting against falsely rejecting tests, the familywise error rate is a conservative approach, resulting in a loss of power in each of the individual tests. As a result, using this approach to adjust for the multiple tests could potentially miss genes that are associated with outcome. To correct for this conservativeness, Benjamini and Hochberg (1995) developed the false discovery rate (FDR), which is defined as

Page 4
the expected proportion of false rejections of the null hypothesis. The FDR represents a compromise between the conservative familywise error rate and testing each gene at the nominal significance level. Variations of the false discovery rate, termed the positive, conditional, and marginal false discovery rates, have been proposed in the literature (Benjamini and Hochberg 1995, Storey 2002, Tsai 2003). Assuming weak dependence and at least one test statistic is rejected, as the number of genes tested increases, these false discovery rates all converge to the probability a gene is not associ- ated with the outcome conditional on the test statistic lying in the rejection region. As a result of the asymptotic equivalence of these false discovery rates, we call this conditional probability the limiting false discovery rate. For the analysis of gene array data, where tens of thousands of tests are carried out, this asymptotic evaluation of the FDR is reasonable. The limiting FDR has been estimated in the literature by pooling informa- tion across genes and using a mixture model for the density of the test statistic or corresponding p-value. Pan et al. (2003) employ a normal mixture model for the density of the t-statistic. Parker and Rothenberg (1988), Allison et al. (2002), and Pounds and Morris (2003) use a Uniform-Beta mixture to model the p-value density. The accuracy of the resulting FDR estimates rely on the adequacy of the mixture density specification. To avoid this specification, non- parametric estimates of the mixture density have been proposed by Efron et al. (2001), Storey (2002), and Black (2004). In addition to estimation of the limiting FDR, estimates of the p-value threshold, adapted from the sequential p-value method of Benjamini and Hochberg (1995), have been developed by Benjamini and Hochberg (2000), Genovese and Wasserman (2004), and Storey et al. (2004). Although these FDR and p-value threshold estimates are com- monly employed in the analysis of microarray data, their precision is typically ignored. One exception is Owen (2005), who computed the variance of the number of false discoveries when genes are dependent, but this calculation is conditional on the observed gene expression data and assumes all genes are unrelated to the outcome variable. In this paper, we develop estimates of the limiting false discovery rate eval- uated at a p-value threshold, and its inverse, the p-value threshold evaluated

Page 5
at a fixed FDR, from a quasi-likelihood derived from marginal p-value mix- ture densities. The asymptotic normality of the FDR and p-value threshold estimates stem from quasi-likelihood based results and a weak dependence as- sumption between gene expression values within subject. Estimation of the asymptotic variance for the limiting FDR estimate is derived and confidence intervals for the limiting FDR and p-value threshold parameters are devel- oped, accounting for the potential dependence between genes. We believe these estimates of precision provide a unique perspective to error rate analysis of microarray data. The methodology is demonstrated on a microarray gene expression data set obtained from 79 patients who underwent a radical prostatectomy for localized prostate cancer. The data were obtained from tissue samples taken at the time of surgery. In the analysis, patients were followed for at least seven years; 37 patients were classified with recurrent disease based on a rising PSA profile, whereas 42 patients classified with nonrecurrent disease, remained with an undetectable PSA seven years after surgery (Stephenson et al. 2005). Prostate specific antigen (PSA) is a biomarker that is commonly used to determine the existence of prostate tumor cells in the patient. The gene expression analysis was carried out using the Affymetrix U133A human gene array, which has 22,283 genes. Expression values on each array were preprocessed using Affymetrix MAS 5.0. This preprocessing algorithm includes a background adjustment of the expression values, and a within array scale transformation, producing a 2% trimmed mean within each array equal to 500. The choice of 500 is the default value for MAS 5.0. 2 P-value Mixture Model To determine differential gene expression between the recurrent/nonrecurrent outcomes, a t-test was performed for each gene, and the accompanying p-value, based on the standard normal reference distribution, was computed to test the hypothesis H0g: no difference in gene g expression between outcome groups

Page 6
H1g: gene g expression is different between outcome groups (g = 1,...G). Each p-value is generated from one of these two classes (not different/different). A random variable Dg, indicates whether the observed p-value for gene g, denoted by pg, was generated from the null class (Dg = 0) or the alternative class (Dg = 1). The marginal distribution of Dg is Bernoulli with parameter �� = Pr(not different) and the density of P given D is written as fD(p). Since the Dg are not observed, the marginal density of P is represented as a mixture of two density functions f(p) = ��f0(p) + (1 − ��)f1(p). In the null class, the distribution of P is uniform (0,1), f0(p)=1 0 <p< 1. The alternative density is f1(p;��) = 1 2 ����,1(��−1(1 − p/2)) ��0,1(��−1(1 − p/2)) , where ��µ,��(u) denotes a normal density function with location and scale pa- rameters µ and ��, and �� is the standard normal distribution function. This density is derived from the asymptotic normality of the test statistic T and the change of variable P = 2(1 − ��(|T|)) (Hung et al. 1997). Estimation and inference in this work are based on the marginal p-values derived from the asymptotic normality of the test statistic. The accuracy of the inference derived from the proposed methodology is a function of the accuracy of the asymptotic normal approximation and thus improves as the sample size increases. Although Student��s t-statistic is used to generate the p-values for the prostate cancer gene expression data, the application is wide ranging and can be applied to any k-sample comparison or test of association that is based on an asymptotic normal test statistic. The mixture model used to represent the density of the p-values is f(p;��, ��) = �� + (1 − ��) 1 2 ����,1(��−1(1 − p/2)) ��0,1(��−1(1 − p/2)) 0 < p �� 1. (1)

Page 7
In this model, the parameter �� measures the strength of the differentially expressed genes, with a large value signaling that there are a group of genes with very small p-values. For the two-sample t-test, assuming a common variance ��2, and n1,n2 subjects in the two groups, �� = ( n1n2 n1 + n2 )1/2 (|µ1 − µ2| �� ) . Estimation of the parameters �� = (��, ��) is based on the loglikelihood LG(��, ��) ��
G
��
g=1
lg(��) =
G
��
g=1
log { �� + (1 − ��) 1 2 ����,1(��−1(1 − pg/2)) ��0,1(��−1(1 − pg/2)) } . (2) In the context of microarray analysis, it is not plausible to treat the G gene derived p-values as independent. We therefore treat LG(��, ��) as a log quasi- likelihood. A form of weak dependence between the quasi-score components, described in conditions (D1)-(D3) below, is sufficient to satisfy the central limit theorem and the weak law of large numbers for the parameter estimates (Serfling 1968). As a result, the theorem stated following these conditions provides the asymptotic inferential structure for ��. Denote the quasi-score component as sg(��) = ∂lg(��) ∂�� , Ma as a ��-algebra generated by the quasi-score components {s1(��),...,sa(��)}, and the partial sum of the quasi-score as Ta(��) = G−1/2
a+G
��
g=a+1
sg(��). The following conditions are used to define weak dependence of the quasi-score components (D1) E[sg(��0)] = 0 (D2) limG���� var[Ta(��0)] = V (��0) uniformly in a (D3) E[E2(Ta(��0)|Ma )] �� b(G) E|var(Ta(��0)|Ma) − var(Ta(��0))| �� b(G)

Page 8
where b(G) = O(G−��); �� > 0. theorem: Assume that the weak dependence conditions (D1-D3) are sat- isfied. Define ˆ�� = (ˆ��, ˆ��) as the maximum quasi-likelihood estimate derived from equation (2) and let ��0 denote the true value of ��. Let N(��0) denote a bounded neighborhood around ��0 and assume G−1LG(��) and its derivative are uniformly bounded in N(��0). Then as n �� �� and G �� ��, (1) ˆ�� is a consistent estimator of ��0, (2) �� G(ˆ�� − ��0)
D
�� N(0,��), �� = A−1V A−1 where 1 �� G
G
��
g=1
∂lg(��0) ∂��
D
�� N(0,V ) 1 G
G
��
g=1
∂2lg(��0) ∂��∂��T
P
�� A 1 G �ơ�
g,h��C
∂lg(��0) ∂�� ∂lh(��0) ∂��
P
�� V and C is the set of quasi-score component pairs with nonzero correlation (Lum- ley and Heagerty 1999). The matrices A and V are consistently estimated by An(ˆ��) = 1 G
G
��
g=1
∂2lg(��) ∂��∂��T ∣ ∣ ∣ ∣��=ˆ�� Vn(ˆ��) = 1 G �ơ�
g,h��C
∂lg(��) ∂�� ∂lh(��) ∂�� ∣ ∣ ∣ ∣��=ˆ�� Estimation of V requires knowledge of the correlated quasi-score compo- nent pairs. The following argument is used to carry out this computation. Let t(eg)=��−1(1 − pg/2) represent the observed value of the t-statistic for gene g as a function of the gene expression data for the n subjects, where eT
g = (e1g,...,eng) is the gene
expression data vector. These n elements are derived from the two outcome groups (recurrence / no recurrence). It is assumed that the elements are

Page 9
generated independently, and up to a shift in location, identically distributed. The quasi-score for gene g is now written as s(��;eg) = ∂ ∂�� log { �� + (1 − ��) 1 2 ����,1(t(eg)) ��0,1(t(eg)) } where we have added the gene expression vector as an argument to the quasi- score component. Application of the mean value theorem produces the fol- lowing covariance calculation for the quasi-score components corresponding to genes (g, h) cov[s(��;eg),s(��;eh)] = cov[{s(��;µg) + WT (��;µ∗
g)(eg − µg)},{s(��;µh) + WT (��;µ∗ h)(eh − µh)}],
where E(eg) = µg, and W(��;µ∗
g) = ∂s(��;eg)/∂eg is an n��2 matrix evaluated
at the point µ∗
g which lies on a line between eg and µg. Letting ��gh denote
the covariance of the gene expression data for genes g and h, it follows that cov[s(��;eg),s(��;eh)] = ��ghWT (��;µ∗
g)W(��;µ∗ h).
Thus, elimination of noncorrelated quasi-score component pairs can be accom- plished by testing whether the corresponding gene expression data pairs are correlated. For the prostate cancer data, the sample correlation matrix R = (r)gh was computed for the 22,283 genes and Fisher��s z-test statistic was used to test whether each gene pair had correlation zero. There were G(G − 1)/2 �� 250 million tests to determine correlated gene pairs, resulting in a further multiple comparison problem. Using the Benjamini and Hochberg (1995) pro- cedure of controlling the false discovery rate at the 0.05 level, gene pairs were considered correlated if the test statistic produced a p-value less than 0.003. Seven percent of the gene pairs demonstrated a nonzero correlation for the expression data using this FDR criterion and were included in the summand for the estimate Vn(ˆ��). Maximization of the quasi-likelihood was accomplished through the Nelder- Mead simplex algorithm, under the constrained parameter space (0 < �� < 1, 0 < ��). The parameter estimates from the prostate cancer data were, ˆ�� =

Page 10
0.75, ˆ�� = 1.89. To assess the adequacy of the mixture model (1), the observed p-values were compared to the model derived p-values. As shown in Figure 1, the mixture model along with the maximum quasi-likelihood parameter estimates provide an adequate fit to the data.
Histogram of p
p density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0
Figure 1. Histogram for the observed 22,283 gene p-values and the p-value mixture density estimate. The horizontal dotted line represents the quasi- likelihood estimate of ��. The mixture parameter estimate indicates that 25% of the 22,283 genes analyzed are associated with recurrence. It is generally recognized that the level of confidence regarding membership into this alternative class is not equal for all genes. Gene membership into the alternative class becomes increasingly likely as the p-value decreases. Our strategy is to report those genes where there is a high level of confidence of this association.

Page 11
3 Gene Selection based on the FDR and P-Value Threshold Pa- rameters The false discovery rate (FDR) is a popular measure of this confidence level. Assuming a common marginal distribution for each p-value, the limiting FDR is defined at a fixed rejection region ��0 by ��(��0) = Pr(D = 0|P �� ��0); the probability a gene belongs to the null class given its associated p-value is less than ��0 (Storey 2004, Genovese and Wasserman 2004). Using the p-value mixture model framework, the limiting false discovery rate parameter at ��0 is defined as ��(��0;��) = �˦�0 �˦�0 + (1 − ��)(1 − ��[��−1(1 − ��0/2) − ��]) . (3) The consistency of ��(��0; ˆ��) results from the consistency of the quasilikelihood estimates. Alternatively, since the FDR defined in (3) is a monotone function of ��, it can be set to a sufficiently small value ��0, and a consistent estimate of the threshold p-value �� is found through the equation ��(��; ˆ��) = ��0. Func- tionally, this is accomplished by solving the equation ��(��0; ˆ��) = (1 − ˆ ��)F1(��; ˆ��)��0 ˆ ��(1 − ��0) (4) where F1(��;��)=1 − ��[��−1(1 − ��/2) − ��] is the distribution function of the p-values under the alternative hypothesis and evaluated at ��. Although the FDR and p-value threshold estimates are consistent, their precision is a function of the level of dependence in the gene expression data. As this dependence increases, the confidence that the FDR and threshold es- timates lie in a small neighborhood around their parameter values diminishes. To obtain a better understanding of these parameters, an asymptotic normal pivotal statistic is constructed to produce an asymptotic confidence interval for the FDR parameter ��(��0;��0) and its inverse function, the p-value threshold parameter ��(��0;��0). The asymptotic normality of ��(��0; ˆ��) stems from the asymptotic normality of the quasi-likelihood estimate ˆ��, derived in the theorem, and the continuity

Page 12
of �� with respect to ��. The asymptotic variance of ��(��; ˆ��) follows directly from the asymptotic variance of the quasi-likelihood estimates and the delta method var[��(��0; ˆ��)] = ��T ���� where ��T = [∂�� ∂�� , ∂�� ∂�� ] . The resulting symmetric (1 − ��) asymptotic confidence interval for the FDR at the p-value threshold level ��0 is [ ��(��0; ˆ��) − z1−��/2 �� var{��(��0; ˆ��)} , ��(��0; ˆ��) + z1−��/2 �� var{��(��0; ˆ��)} ] , where z1−��/2 is the 1 − ��/2 standard normal quantile. A confidence interval for the p-value threshold parameter at a given FDR level ��0 is constructed by applying the inverse transformation to the lower and upper confidence limits of ��(��;��). Specifically, a (1−��) FDR confidence interval for any given �� may be written as Pr[a�� < ��(��;��) < b��]=1 − ��. By choosing �� to be the p-value threshold parameter for a FDR level ��0, i.e. ��(��;��) = ��0, and applying the inverse transform, the (1−��) p-value threshold confidence interval is Pr [(1 − ��)F1(��;��)a�� ��(1 − a��) < ��(��0;��) < (1 − ��)F1(��;��)b�� ��(1 − b��) ] = 1 − ��. Evaluation of this asymptotic confidence interval for the p-value threshold parameter at the FDR level ��0 is obtained by substituting consistent estimates for (��,��) in the upper and lower confidence bounds. These estimates are obtained from equations (2) and (4). Although either estimate, ��(��0; ˆ��) or ��(��0; ˆ��), may be used in differential gene expression analysis, for the purpose of gene selection, a direct approach is to fix the FDR and estimate the p-value threshold region; the genes that fall into the rejection region are chosen for further analysis. This is the approach carried out for the prostate cancer data set. The proposed confidence interval can be employed in multiple ways for gene selection depending upon the objective of the analysis. If differential

Page 13
gene expression analysis is used to choose a small set of genes for validation by the laboratory scientist at the bench, then a (1−��) lower confidence bound for the p-value threshold parameter could be used for gene selection. Typically, RT-PCR, northern blots, or immunohistochemistry are used to validate the differentially expressed genes. Alternatively, if the goal is to use the error rate analysis as a screening tool to weed out uninteresting genes for subsequent classification or prediction analysis, then a liberal approach using a (1 − ��) upper confidence bound for ��(��0;��) would be suitable. The original localized prostate cancer data analysis applied differential gene expression analysis as a filter to select candidate genes for model building and prediction (Stephenson et al. 2005). Using the p-value mixture model, the estimated p-value threshold is 2.1 �� 10−3 for a FDR level equal to 0.05. Accounting for the variability and dependency in the gene expression data, the interval width from the 95% confidence interval for the p-value threshold parameter is 2.9��10−3, which is substantial relative to the threshold estimate. The 95% upper confidence bound for the p-value threshold parameter is 3.8�� 10−3. For the prostate cancer p-value gene list, if the threshold estimate was used as the filter, 367 genes would be selected for further analysis. In contrast, the 95% upper confidence bound filter would incorporate 575 genes for the model building component of the analysis. Thus, when using the FDR as a filter, accounting for the variability provides a more liberal, but rational, gene selection mechanism. It is interesting to note the effect of the dependence assumption on the selection mechanism. Under the assumption that the gene expression values were independent, the interval width from the 95% confidence interval for the p-value threshold parameter is 4.6 �� 10−4, approximately one-eighth the interval width of the threshold parameter estimate under weak dependence. Not surprisingly, the independence assumption used in conjunction with 22,283 genes, provides a justification for the use of the threshold estimate as the filter, with little penalty for substituting the estimate for the parameter value. For this data set, however, the dependence between genes is an important aspect of the analysis. Thus, the uncertainty of the location of the FDR and p-value threshold parameters should be accounted for in the gene selection analysis.

Page 14
Finally, an alternative approach is to infer the asymptotic FDR parameter for a given p-value threshold parameter. Gene selection based on the p-value threshold equal to 0.05, would produce an asymptotic FDR estimate equal to 0.240 with an estimated standard error equal to 0.030. Thus, the probability a selected gene is not differentially expressed may be as high as 0.300. 4 Simulations A series of simulations were performed to assess the adequacy of the point and interval estimates of the limiting FDR and the p-value threshold parameters. A two-sample t-test was used to compute the p-value for each of 10000 ��genes��. The two-group comparison was based on either 20 or 40 subjects per group. Within each group, the expression data for each gene were generated indepen- dently and identically distributed from either a normal or log-Weibull family. For each gene in group 1, a vector of n1 independent identically distributed mean zero and variance 1 random variables were generated. For each gene in group 2, a vector of n2 independent identically distributed random variables were generated with either mean zero and variance 1 or mean 2 and variance 1. The probability that the n2 vector components had mean 2 was set equal to 1 − ��. The parameter �� represents the proportion of true null hypotheses and was chosen to equal {0.3,0.6,0.9} for the simulations. Within each subject, a block dependence structure between genes was generated. The block size was 500, with equal correlation �� between genes within a block. The values of �� used in the simulations were {0,0.3,0.6}. This correlation represents a m-dependence structure and satisfies the weak dependence conditions. Five hundred replications were run for each simulation. The results of the simulations are presented in Tables 1 and 2. In general, the level of correlation between genes did not influence the bias or coverage estimates. For both the FDR and p-value threshold estimates, the bias in- creased as the parameter moved away from zero. The log-Weibull simulations were less accurate than the normal simulations.

Page 15
T able 1 . The columns in the table represen t: the prop ortion of true n ull h y p otheses (�� ), the limiting FDR and p-value threshold (PVT) parameters, the blo ck correlation parameter (�� ), the bias (�� 10
3
) of the estimates of these parameters, and the empirical 95% coverage probability of these parameters. The sample size is forty observations p er grou p. �� = 0.0 �� = 0.3 �� = 0.6 �� parameter value B ias �� 10
3
Coverage Bias �� 10
3
Coverage Bias �� 10
3
Coverage Normal data 0.3 FDR = 0.021 -0.024 0.960 -0.014 0.952 0.013 0.920 PVT = 0.123 0.200 0.962 0.142 0.946 -0.021 0.926 0.6 FDR = 0.047 -0.089 0.942 0.034 0.948 0.082 0.928 PVT = 0.053 0.063 0.942 -0.004 0.946 -0.030 0.920 0.9 FDR = 0.310 -0.027 0.922 0.697 0.952 0.082 0.914 PVT = 0.006 0.006 0.924 -0.015 0.950 0.003 0.918 log-W eibull data 0.3 FDR = 0.021 -0.005 0.958 -0.006 0.942 0.018 0.926 PVT = 0.123 0.083 0.966 0.096 0.948 -0.054 0.918 0.6 FDR = 0.047 -0.044 0.940 0.052 0.932 0.094 0.928 PVT = 0.053 0.039 0.942 -0.014 0.932 -0.037 0.924 0.9 FDR = 0.310 0.160 0.922 0.766 0.950 0.152 0.918 PVT = 0.006 0.000 0.928 -0.017 0.950 0.001 0.914

Page 16
T able 2 . The columns in the table represen t: the prop ortion of true n ull h y p otheses (�� ), the limiting FDR and p-value threshold (PVT) parameters, the blo ck correlation parameter (�� ), the bias (�� 10
3
) of the estimates of these parameters, and the empirical 95% coverage probability of these parameters. The sample size is tw en ty observations p er group. �� = 0.0 �� = 0.3 �� = 0.6 �� parameter value B ias �� 10
3
Coverage Bias �� 10
3
Coverage Bias �� 10
3
Coverage Normal data 0.3 FDR = 0.021 0.079 0.948 0.062 0.956 0.036 0.948 PVT = 0.123 -0.414 0.954 -0.310 0.948 -0.159 0.944 0.6 FDR = 0.047 -0.016 0.954 0.184 0.948 -0.009 0.934 PVT = 0.053 0.022 0.952 -0.084 0.948 0.020 0.936 0.9 FDR = 0.310 0.141 0.952 -1.102 0.950 -0.767 0.934 PVT = 0.006 0.001 0.946 0.035 0.962 0.027 0.944 log-W eibull data 0.3 FDR = 0.021 0.504 0.806 0.281 0.906 0.210 0.934 PVT = 0.123 -2.893 0.792 -1.600 0.886 -1.159 0.926 0.6 FDR = 0.047 0.958 0.894 0.717 0.932 0.440 0.924 PVT = 0.053 -0.497 0.890 -0.369 0.914 0.219 0.922 0.9 FDR = 0.310 3.274 0.938 1.059 0.966 1.542 0.926 PVT = 0.006 -0.083 0.928 -0.024 0.956 -0.035 0.920

Page 17
In Table 1, with forty subjects per group, the bias remained small and the 95% empirical coverage was uniformly good for all the simulations examined. In Table 2, however, with twenty subjects per group, the bias sometimes be- came large and had a negative impact on the coverage estimate, particularly in the log-Weibull simulations. Additional simulations were run to explore the impact of violating the weak dependence assumption. Within each subject, all 10000 genes were equally cor- related. The correlations examined were �� = {0.2,0.4,0.6}. For the normal simulations with 40 subjects per group, the simulations resulted in a signif- icant percentage of negative variance estimates for the FDR estimates. The percentage of replicates within a simulation that resulted in a negative vari- ance estimate ranged from 15% to 50%. Thus, the proposed methodology is not robust to a strong dependence structure. 5 General Comments A mixture model is proposed to determine a subset of genes associated with an outcome variable. Since the observed p-values used in the mixture model are derived from the asymptotic normality of the test statistic, this method is not confined to a specific test statistic or outcome variable type. The pro- posed methodology can be applied to a test statistic based on a comparison between groups (Student��s t-statistic, Wilcoxon rank sum statistic, the log rank statistic, or their k-sample analogs), a test of association between vari- ables (Pearson��s correlation coefficient or Kendall��s Tau), regression analyses or a multilevel factorial analysis. The accuracy of the proposed methodology is a function of the accuracy of the asymptotic normality of the test statistic and the weak dependence assumption. We believe, however, there is a growing recognition that the inability to validate many gene expression analyses is a function of the limited number of samples in these analyses. Thus, it is our expectation that future gene expression studies will be based on larger sample sizes, enabling the asymptotic normality assumption to be justified on a greater proportion of

Page 18
studies. The effect of within subject gene expression dependence on FDR measures is a current subject of research. Qui et al. (2005) demonstrated that depen- dence can impact the variability of an FDR measure. Efron (2005), using a test statistic mixture density, demonstrated that strong dependence can produce a significant deviation between the empirical and theoretical null component of the mixture density. The resulting bias in the FDR estimate may be re- duced by adapting the null density to the observed data. For the p-value mixture density, this could entail replacing the standard Uniform null density with a Beta (��,��) null density in the quasi-likelihood. Note that the standard Uniform null density is the special case �� = �� = 1. Whether this generaliza- tion produces a less biased FDR and p-value threshold estimate under strong dependence will be the subject of future research. Strong dependence relationships such as exchangeability (Qui et al. 2005) and positive regression dependence (Benjamini and Yekutieli 2001) appear to have an adverse effect on the FDR measure and do not hold for the methodol- ogy presented in this paper. In contrast, under weak dependence, our simula- tions demonstrate that the FDR and p-value threshold estimates are accurate. What remains unclear is whether weak dependence is congruent to the concept of genetic pathways and hence whether it is sufficient to approximate the gene expression correlation structure. If weak dependence is not sufficient it may be possible to transform the expression data in the preprocessing algorithm prior to performing the proposed FDR analysis. Our measure of the FDR differs from the conventional measure proposed in Benjamini and Hochberg (1995). Their FDR measure is based on a fixed number of tests performed; we have modified the FDR to present its limit- ing value. When the marginal p-values are generated from a single mixture distribution, the asymptotic FDR for a given threshold ��, is defined as the probability a gene is not differentially expressed given its p-value is less than ��. A benefit of the asymptotic FDR is the creation of an asymptotic pivotal statistic that is used to create a confidence interval for either the asymptotic FDR parameter, or its inverse function, the p-value threshold parameter. The confidence intervals are used to provide control of the error rate with a high

Page 19
level of confidence or as a liberal gene filter for subsequent statistical analyses of the gene expression data. An alternative error rate analysis is based on controlling the tail probability for the proportion of false rejections. The properties of this error rate estimate, also known as the proportion of false positives (PFP) or the false discovery proportion (FDP), have been studied for both independent and dependent gene expression values (Korn et al. 2004, Genovese and Wasserman 2004, van der Laan et al. 2004). As noted in Genovese and Wasserman (2004), the FDP can be written as a function of the chosen threshold ��0 S(��0) = ��
g
I(Pg �� ��0)I(Dg = 0) ��
g
I(Pg �� ��0) , where it is assumed that at least one p-value is below the threshold ��0. A connection between the tail probability for the FDP Pr[S(��0) > c] and a confidence bound for the asymptotic FDR can be obtained using the central limit theorem assumption, G1/2[S(��0)−��(��0)] converges in distribution to a mean zero normal random variable, with asymptotic variance denoted by W(��0) and the asymptotic FDR represented as ��(��0). The relationship between the two error rate analyses is realized by substituting ��(��0) + zqW1/2(��0) G1/2 , for c in the FDP tail probability, producing a (1 − q) upper confidence bound for the asymptotic FDR S(��0) − zqW1/2(��0) G1/2 . Finally, in this paper an empirical criterion was used to group genes into the dependent sets for the asymptotic variance calculation. As our under- standing of the gene environment continues to improve, dependent gene sets may be established from biological determinants, such as through linkage of their gene function, location in the cell, or involvement in the biological pro- cess. One source currently available to establish these connections is the gene

Page 20
ontology consortium website (www.geneontology.org). As knowledge of gene interactions increase, the dependency classification can be carried out using external databases. Appendix: Derivations of the asymptotic properties of the quasi- likelihood estimates 1) ˆ ��
p
�� ��0 The proof will use the following arguments. The log quasi-likelihood is defined as the sum of log p-value mixture densities LG(��) =
G
��
g=1
log { �� + (1 − ��) 1 2 ����,1(��−1(1 − pg/2)) ��0,1(��−1(1 − pg/2)) . } where �� = (��, ��). It follows that for ��0, the true value of ��, E [ ∂ ∂�� G−1LG(��) ]
��=��0
= 0 E [ ∂2 ∂��2 G−1LG(��) ]
��=��0
is negative definite. Let N(��0) denote a bounded neighborhood around ��0 and assume G−1LG(��) and its derivative are uniformly bounded in N(��0). Define ��(��) = limG���� G−1LG(��). Then for �� > 0, sup
��:��−��0 >��
��(��) < ��(��0) (A.1)

Page 21
Proof: Since ˆ�� is the maximum quasi-likelihood estimate, G−1LG(ˆ��) �� G−1LG(��0) By the assumptions above, G−1LG(��) converges uniformly to ��(��) for �� �� N(��0). Thus for ϵ > 0, G−1LG(ˆ��) �� ��(��0) − ϵ Adding and subtracting ��(ˆ��) to the left side of the inequality and again using the uniform convergence argument ��(ˆ��) �� ��(��0) − ϵ which by (A.1) cannot occur unless ˆ�� �� ��0. 2) �� G(ˆ�� − ��0)
D
�� N(0,��) Proof: Let SG(��) = ∂LG(��)/∂�� and AG(��) = ∂2LG(��)/∂��∂��T . Under the as- sumption of weak dependence, the weak law of large numbers and a Taylor expansion are applied to produce G1/2(ˆ�� − ��0)=[G−1AG(��0)]−1G−1/2SG(��0) + op(1). It follows that var[G1/2(ˆ�� − ��0)] = [G−1AG(��0)]−1[G−1varSG(��0)][G−1AG(��0)]−T and therefore, using the central limit theorem for weakly dependent data, �� G(ˆ�� − ��0)
D
�� N(0,A−1V A−T ).

Page 22
References Allison, D. B., Gadbury, G. L., Heo, M., Fernandez, J. R., Lee, C. K., Prolla, T. A., and Weindruch, R. (2002). A mixture model approach for the anal- ysis of microarray gene expression data. Computational Statistics and Data Analysis, 39, 1-20. Black, M. A. (2004). A note on the adaptive control of false discovery rates. Journal of the Royal Statistical Society, Series B, 66, 297-304. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57, 289-300. Benjamini, Y. and Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25, 60-83. Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29, 1165-1188. Efron, B., Tibshirani, R., Storey, J. D., Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association, 96, 1151-1160. Efron B. (2006). Correlation and large-scale simultaneous significance testing. Journal of the American Statistical Association, 102, 93-103. Genovese, C. and Wasserman, L. (2004). A stochastic process approach to false discovery control. Annals of Statistics, 32, 1035-1061. Hochberg, Y. and Tamhane, A. C. (1987). Multiple Comparison Procedures. New York: Wiley. Hung, H. M. J., O��Neill, R. T., Bauer, P., Köhne, K. (1997). The behavior of the p-value when the alternative hypothesis is true. Biometrics, 53, 11-22. Korn, E. L., Troendle, J. F., McShane, L. M., Simon, R. (2004). Controlling the number of false discoveries: application to high-dimensional genomic data. Journal of Statistical Planning and Inference, 124, 379-398.

Page 23
Lumley, T. and Heagerty, P. J. (1999). Weighted empirical adaptive variance estimators. Journal of the Royal Statistical Society, Series B, 61, 459-477. Owen, A. (2005). Variance of the number of false discoveries. Journal of the Royal Statistical Society, Series B, 67, 411-426. Pan, W., Lin, J., Le, C. (2003). A mixture model approach to detecting dif- ferentially expressed genes with microarray data. Functional and Integrative Genomics, 3, 117-124. Parker, R. A. and Rothenberg R. B. (1988). Identifying important results from multiple statistical tests. Statistics in Medicine 7, 1031-1043. Pounds, S. and Morris, S. (2003). Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics, 19, 1236-1242. Qiu, X., Klebanov, L., Yakovlev, A. (2005). Correlation between gene ex- pression levels and limitations of the empirical Bayes methodology for find- ing differentially expressed genes. Statistical Applications in Genetics and Molecular Biology, 4, Article 34. Serfling, R. J. (1968). Contributions to central limit theorem for dependent variables. The Annals of Mathematical Statistics, 39, 1158-1175. Stephenson, A. J., Smith, A., Kattan, M.W., Satagopan, J., Reuter, V. E., Scardino, P. T., Gerald, W. L. (2005). Integration of gene expression pro- filing and clinical variables to predict prostate carcinoma recurrence after radical prostatectomy. Cancer, 104, 290-298. Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B, 64, 479-498. Storey, J. D., Taylor, J. E., Siegmund, D. (2004). Strong control, conser- vative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society, Series B, 66, 187-205. Tsai, C. A., Hsueh, H. M., Chen, J. J. (2003). Estimation of false discovery rates in multiple testing: Application to gene microarray data. Biometrics, 59, 1071-1081.

Page 24
van der Laan, M. J., Dudoit, S., Pollard K. S. (2004) Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives. Statistical Applications in Genetics and Molecular Biology, 3, Issue 1, Article 15.

Recent Documents:

Set Home | Add to Favorites

All Rights Reserved Powered by Free Document Search and Download

Copyright © 2011
This site does not host pdf,doc,ppt,xls,rtf,txt files all document are the property of their respective owners. complaint#nuokui.com
TOP