No Arabic abstract
We are concerned with testing replicability hypotheses for many endpoints simultaneously. This constitutes a multiple test problem with composite null hypotheses. Traditional $p$-values, which are computed under least favourable parameter configurations, are over-conservative in the case of composite null hypotheses. As demonstrated in prior work, this poses severe challenges in the multiple testing context, especially when one goal of the statistical analysis is to estimate the proportion $pi_0$ of true null hypotheses. Randomized $p$-values have been proposed to remedy this issue. In the present work, we discuss the application of randomized $p$-values in replicability analysis. In particular, we introduce a general class of statistical models for which valid, randomized $p$-values can be calculated easily. By means of computer simulations, we demonstrate that their usage typically leads to a much more accurate estimation of $pi_0$. Finally, we apply our proposed methodology to a real data example from genomics.
Given a family of null hypotheses $H_{1},ldots,H_{s}$, we are interested in the hypothesis $H_{s}^{gamma}$ that at most $gamma-1$ of these null hypotheses are false. Assuming that the corresponding $p$-values are independent, we are investigating combined $p$-values that are valid for testing $H_{s}^{gamma}$. In various settings in which $H_{s}^{gamma}$ is false, we determine which combined $p$-value works well in which setting. Via simulations, we find that the Stouffer method works well if the null $p$-values are uniformly distributed and the signal strength is low, and the Fisher method works better if the null $p$-values are conservative, i.e. stochastically larger than the uniform distribution. The minimum method works well if the evidence for the rejection of $H_{s}^{gamma}$ is focused on only a few non-null $p$-values, especially if the null $p$-values are conservative. Methods that incorporate the combination of $e$-values work well if the null hypotheses $H_{1},ldots,H_{s}$ are simple.
We are concerned with multiple test problems with composite null hypotheses and the estimation of the proportion $pi_{0}$ of true null hypotheses. The Schweder-Spjo tvoll estimator $hat{pi}_0$ utilizes marginal $p$-values and only works properly if the $p$-values that correspond to the true null hypotheses are uniformly distributed on $[0,1]$ ($mathrm{Uni}[0,1]$-distributed). In the case of composite null hypotheses, marginal $p$-values are usually computed under least favorable parameter configurations (LFCs). Thus, they are stochastically larger than $mathrm{Uni}[0,1]$ under non-LFCs in the null hypotheses. When using these LFC-based $p$-values, $hat{pi}_0$ tends to overestimate $pi_{0}$. We introduce a new way of randomizing $p$-values that depends on a tuning parameter $cin[0,1]$, such that $c=0$ and $c=1$ lead to $mathrm{Uni}[0,1]$-distributed $p$-values, which are independent of the data, and to the original LFC-based $p$-values, respectively. For a certain value $c=c^{star}$ the bias of $hat{pi}_0$ is minimized when using our randomized $p$-values. This often also entails a smaller mean squared error of the estimator as compared to the usage of the LFC-based $p$-values. We analyze these points theoretically, and we demonstrate them numerically in computer simulations under various standard statistical models.
Replicability analysis aims to identify the findings that replicated across independent studies that examine the same features. We provide powerful novel replicability analysis procedures for two studies for FWER and for FDR control on the replicability claims. The suggested procedures first select the promising features from each study solely based on that study, and then test for replicability only the features that were selected in both studies. We incorporate the plug-in estimates of the fraction of null hypotheses in one study among the selected hypotheses by the other study. Since the fraction of nulls in one study among the selected features from the other study is typically small, the power gain can be remarkable. We provide theoretical guarantees for the control of the appropriate error rates, as well as simulations that demonstrate the excellent power properties of the suggested procedures. We demonstrate the usefulness of our procedures on real data examples from two application fields: behavioural genetics and microarray studies.
When testing for replication of results from a primary study with two-sided hypotheses in a follow-up study, we are usually interested in discovering the features with discoveries in the same direction in the two studies. The direction of testing in the follow-up study for each feature can therefore be decided by the primary study. We prove that in this case the methods suggested in Heller, Bogomolov, and Benjamini (2014) for control over false replicability claims are valid. Specifically, we prove that if we input into the procedures in Heller, Bogomolov, and Benjamini (2014) the one-sided p-values in the directions favoured by the primary study, then we achieve directional control over the desired error measure (family-wise error rate or false discovery rate).
Multiple testing problems are a staple of modern statistical analysis. The fundamental objective of multiple testing procedures is to reject as many false null hypotheses as possible (that is, maximize some notion of power), subject to controlling an overall measure of false discovery, like family-wise error rate (FWER) or false discovery rate (FDR). In this paper we formulate multiple testing of simple hypotheses as an infinite-dimensional optimization problem, seeking the most powerful rejection policy which guarantees strong control of the selected measure. In that sense, our approach is a generalization of the optimal Neyman-Pearson test for a single hypothesis. We show that for exchangeable hypotheses, for both FWER and FDR and relevant notions of power, these problems can be formulated as infinite linear programs and can in principle be solved for any number of hypotheses. We also characterize maximin rules for complex alternatives, and demonstrate that such rules can be found in practice, leading to improved practical procedures compared to existing alternatives. We derive explicit optimal tests for FWER or FDR control for three independent normal means. We find that the power gain over natural competitors is substantial in all settings examined. Finally, we apply our optimal maximin rule to subgroup analyses in systematic reviews from the Cochrane library, leading to an increase in the number of findings while guaranteeing strong FWER control against the one sided alternative.