Accurate and Efficient Estimation of Small P-values with the Cross-Entropy Method: Applications in Genomic Data Analysis

105 0 0.0 ( 0 )

Download Cite

Added by Yang Shi

Publication date 2018

fields Mathematical Statistics

and research's language is English

Authors Yang Shi - Mengqiao Wang - Weiping Shi

Applications

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Small $p$-values are often required to be accurately estimated in large scale genomic studies for the adjustment of multiple hypothesis tests and the ranking of genomic features based on their statistical significance. For those complicated test statistics whose cumulative distribution functions are analytically intractable, existing methods usually do not work well with small $p$-values due to lack of accuracy or computational restrictions. We propose a general approach for accurately and efficiently calculating small $p$-values for a broad range of complicated test statistics based on the principle of the cross-entropy method and Markov chain Monte Carlo sampling techniques. We evaluate the performance of the proposed algorithm through simulations and demonstrate its application to three real examples in genomic studies. The results show that our approach can accurately evaluate small to extremely small $p$-values (e.g. $10^{-6}$ to $10^{-100}$). The proposed algorithm is helpful to the improvement of existing test procedures and the development of new test procedures in genomic studies.

rate research

Efficiently estimating small p-values in permutation tests using importance sampling and cross-entropy method

109 - Yang Shi , Huining Kang , Ji-Hyun Lee 2016

Permutation tests are commonly used for estimating p-values from statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is not available or unreliable for finite sample sizes. One critical challenge for permutation tests in genomic studies is that an enormous number of permutations is needed for obtaining reliable estimations of small p-values, which requires intensive computational efforts. In this paper, we develop a computationally efficient algorithm for evaluating small p-values from permutation tests based on an adaptive importance sampling approach, which uses the cross-entropy method for finding the optimal proposal density. Simulation studies and analysis of a real microarray dataset demonstrate that our approach achieves considerable gains in computational efficiency comparing with existing methods.

Computation

Combining independent p-values in replicability analysis: A comparative study

133 - Anh-Tuan Hoang , Thorsten Dickhaus 2021

Given a family of null hypotheses $H_{1},ldots,H_{s}$, we are interested in the hypothesis $H_{s}^{gamma}$ that at most $gamma-1$ of these null hypotheses are false. Assuming that the corresponding $p$-values are independent, we are investigating combined $p$-values that are valid for testing $H_{s}^{gamma}$. In various settings in which $H_{s}^{gamma}$ is false, we determine which combined $p$-value works well in which setting. Via simulations, we find that the Stouffer method works well if the null $p$-values are uniformly distributed and the signal strength is low, and the Fisher method works better if the null $p$-values are conservative, i.e. stochastically larger than the uniform distribution. The minimum method works well if the evidence for the rejection of $H_{s}^{gamma}$ is focused on only a few non-null $p$-values, especially if the null $p$-values are conservative. Methods that incorporate the combination of $e$-values work well if the null hypotheses $H_{1},ldots,H_{s}$ are simple.

Applications

Accurate estimation of influenza epidemics using Google search data via ARGO

316 - Shihao Yang , Mauricio Santillana , 2015

Accurate real-time tracking of influenza outbreaks helps public health officials make timely and meaningful decisions that could save lives. We propose an influenza tracking model, ARGO (AutoRegression with GOogle search data), that uses publicly available online search data. In addition to having a rigorous statistical foundation, ARGO outperforms all previously available Google-search-based tracking models, including the latest version of Google Flu Trends, even though it uses only low-quality search data as input from publicly available Google Trends and Google Correlate websites. ARGO not only incorporates the seasonality in influenza epidemics but also captures changes in peoples online search behavior over time. ARGO is also flexible, self-correcting, robust, and scalable, making it a potentially powerful tool that can be used for real-time tracking of other social events at multiple temporal and spatial resolutions.

Applications Social and Information Networks Machine Learning

Nested sampling for frequentist computation: fast estimation of small $p$-values

73 - Andrew Fowlie , Sebastian Hoof , Will Handley 2021

We propose a novel method for computing $p$-values based on nested sampling (NS) applied to the sampling space rather than the parameter space of the problem, in contrast to its usage in Bayesian computation. The computational cost of NS scales as $log^2{1/p}$, which compares favorably to the $1/p$ scaling for Monte Carlo (MC) simulations. For significances greater than about $4sigma$ in both a toy problem and a simplified resonance search, we show that NS requires orders of magnitude fewer simulations than ordinary MC estimates. This is particularly relevant for high-energy physics, which adopts a $5sigma$ gold standard for discovery. We conclude with remarks on new connections between Bayesian and frequentist computation and possibilities for tuning NS implementations for still better performance in this setting.

Data Analysis Statistics and Probability Instrumentation and Methods for Astrophysics High Energy Physics - Experiment

Efficient and Accurate Path Cost Estimation Using Trajectory Data

136 - Jian Dai , Bin Yang , Chenjuan Guo 2015

Using the growing volumes of vehicle trajectory data, it becomes increasingly possible to capture time-varying and uncertain travel costs in a road network, including travel time and fuel consumption. The current paradigm represents a road network as a graph, assigns weights to the graphs edges by fragmenting trajectories into small pieces that fit the underlying edges, and then applies a routing algorithm to the resulting graph. We propose a new paradigm that targets more accurate and more efficient estimation of the costs of paths by associating weights with sub-paths in the road network. The paper provides a solution to a foundational problem in this paradigm, namely that of computing the time-varying cost distribution of a path. The solution consists of several steps. We first learn a set of random variables that capture the joint distributions of sub-paths that are covered by sufficient trajectories. Then, given a departure time and a path, we select an optimal subset of learned random variables such that the random variables corresponding paths together cover the path. This enables accurate joint distribution estimation of the path, and by transferring the joint distribution into a marginal distribution, the travel cost distribution of the path is obtained. The use of multiple learned random variables contends with data sparseness, the use of multi-dimensional histograms enables compact representation of arbitrary joint distributions that fully capture the travel cost dependencies among the edges in paths. Empirical studies with substantial trajectory data from two different cities offer insight into the design properties of the proposed solution and suggest that the solution is effective in real-world settings.

Databases