No Arabic abstract
The knockoff-based multiple testing setup of Barber & Candes (2015) for variable selection in multiple regression where sample size is as large as the number of explanatory variables is considered. The method of Benjamini & Hochberg (1995) based on ordinary least squares estimates of the regression coefficients is adjusted to the setup, transforming it to a valid p-value based false discovery rate controlling method not relying on any specific correlation structure of the explanatory variables. Simulations and real data applications show that our proposed method that is agnostic to {pi}0, the proportion of unimportant explanatory variables, and a data-adaptive version of it that uses an estimate of {pi}0 are powerful competitors of the false discovery rate controlling method in Barber & Candes (2015).
Variable selection on the large-scale networks has been extensively studied in the literature. While most of the existing methods are limited to the local functionals especially the graph edges, this paper focuses on selecting the discrete hub structures of the networks. Specifically, we propose an inferential method, called StarTrek filter, to select the hub nodes with degrees larger than a certain thresholding level in the high dimensional graphical models and control the false discovery rate (FDR). Discovering hub nodes in the networks is challenging: there is no straightforward statistic for testing the degree of a node due to the combinatorial structures; complicated dependence in the multiple testing problem is hard to characterize and control. In methodology, the StarTrek filter overcomes this by constructing p-values based on the maximum test statistics via the Gaussian multiplier bootstrap. In theory, we show that the StarTrek filter can control the FDR by providing accurate bounds on the approximation errors of the quantile estimation and addressing the dependence structures among the maximal statistics. To this end, we establish novel Cramer-type comparison bounds for the high dimensional Gaussian random vectors. Comparing to the Gaussian comparison bound via the Kolmogorov distance established by citet{chernozhukov2014anti}, our Cramer-type comparison bounds establish the relative difference between the distribution functions of two high dimensional Gaussian random vectors. We illustrate the validity of the StarTrek filter in a series of numerical experiments and apply it to the genotype-tissue expression dataset to discover central regulator genes.
The Benjamini-Hochberg (BH) procedure remains widely popular despite having limited theoretical guarantees in the commonly encountered scenario of correlated test statistics. Of particular concern is the possibility that the method could exhibit bursty behavior, meaning that it might typically yield no false discoveries while occasionally yielding both a large number of false discoveries and a false discovery proportion (FDP) that far exceeds its own well controlled mean. In this paper, we investigate which test statistic correlation structures lead to bursty behavior and which ones lead to well controlled FDPs. To this end, we develop a central limit theorem for the FDP in a multiple testing setup where the test statistic correlations can be either short-range or long-range as well as either weak or strong. The theorem and our simulations from a data-driven factor model suggest that the BH procedure exhibits severe burstiness when the test statistics have many strong, long-range correlations, but does not otherwise.
In many domains, data measurements can naturally be associated with the leaves of a tree, expressing the relationships among these measurements. For example, companies belong to industries, which in turn belong to ever coarser divisions such as sectors; microbes are commonly arranged in a taxonomic hierarchy from species to kingdoms; street blocks belong to neighborhoods, which in turn belong to larger-scale regions. The problem of tree-based aggregation that we consider in this paper asks which of these tree-defined subgroups of leaves should really be treated as a single entity and which of these entities should be distinguished from each other. We introduce the false split rate, an error measure that describes the degree to which subgroups have been split when they should not have been. We then propose a multiple hypothesis testing algorithm for tree-based aggregation, which we prove controls this error measure. We focus on two main examples of tree-based aggregation, one which involves aggregating means and the other which involves aggregating regression coefficients. We apply this methodology to aggregate stocks based on their volatility and to aggregate neighborhoods of New York City based on taxi fares.
We consider controlling the false discovery rate for testing many time series with an unknown cross-sectional correlation structure. Given a large number of hypotheses, false and missing discoveries can plague an analysis. While many procedures have been proposed to control false discovery, most of them either assume independent hypotheses or lack statistical power. A problem of particular interest is in financial asset pricing, where the goal is to determine which ``factors lead to excess returns out of a large number of potential factors. Our contribution is two-fold. First, we show the consistency of Fama and Frenchs prominent method under multiple testing. Second, we propose a novel method for false discovery control using double bootstrapping. We achieve superior statistical power to existing methods and prove that the false discovery rate is controlled. Simulations and a real data application illustrate the efficacy of our method over existing methods.
Selecting relevant features associated with a given response variable is an important issue in many scientific fields. Quantifying quality and uncertainty of a selection result via false discovery rate (FDR) control has been of recent interest. This paper introduces a way of using data-splitting strategies to asymptotically control the FDR while maintaining a high power. For each feature, the method constructs a test statistic by estimating two independent regression coefficients via data splitting. FDR control is achieved by taking advantage of the statistics property that, for any null feature, its sampling distribution is symmetric about zero. Furthermore, we propose Multiple Data Splitting (MDS) to stabilize the selection result and boost the power. Interestingly and surprisingly, with the FDR still under control, MDS not only helps overcome the power loss caused by sample splitting, but also results in a lower variance of the false discovery proportion (FDP) compared with all other methods in consideration. We prove that the proposed data-splitting methods can asymptotically control the FDR at any designated level for linear and Gaussian graphical models in both low and high dimensions. Through intensive simulation studies and a real-data application, we show that the proposed methods are robust to the unknown distribution of features, easy to implement and computationally efficient, and are often the most powerful ones amongst competitors especially when the signals are weak and the correlations or partial correlations are high among features.