Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Are Discoveries Spurious? Distributions of Maximum Spurious Correlations and Their Applications

382 0 0.0 ( 0 )

Download Cite

Added by Wen-Xin Zhou

Publication date 2015

fields Mathematical Statistics

and research's language is English

Authors Jianqing Fan - Qi-Man Shao - Wen-Xin Zhou

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries from these data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions about the exogeneity of the covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum spurious correlations given a certain number of predictors, namely, the distribution of the correlation of a response variable $Y$ with the best $s$ linear combinations of $p$ covariates $mathbf{X}$, even when $mathbf{X}$ and $Y$ are independent. When the covariance matrix of $mathbf{X}$ possesses the restricted eigenvalue property, we derive such distributions for both a finite $s$ and a diverging $s$, using Gaussian approximation and empirical process techniques. However, such a distribution depends on the unknown covariance matrix of $mathbf{X}$. Hence, we use the multiplier bootstrap procedure to approximate the unknown distributions and establish the consistency of such a simple bootstrap approach. The results are further extended to the situation where the residuals are from regularized fits. Our approach is then used to construct the upper confidence limit for the maximum spurious correlation and to test the exogeneity of the covariates. The former provides a baseline for guarding against false discoveries and the latter tests whether our fundamental assumptions for high-dimensional model selection are statistically valid. Our techniques and results are illustrated with both numerical examples and real data analysis.

rate research

When Is the First Spurious Variable Selected by Sequential Regression Procedures?

100 - Weijie J. Su 2017

Applied statisticians use sequential regression procedures to produce a ranking of explanatory variables and, in settings of low correlations between variables and strong true effect sizes, expect that variables at the very top of this ranking are truly relevant to the response. In a regime of certain sparsity levels, however, three examples of sequential procedures--forward stepwise, the lasso, and least angle regression--are shown to include the first spurious variable unexpectedly early. We derive a rigorous, sharp prediction of the rank of the first spurious variable for these three procedures, demonstrating that the first spurious variable occurs earlier and earlier as the regression coefficients become denser. This counterintuitive phenomenon persists for statistically independent Gaussian random designs and an arbitrarily large magnitude of the true effects. We gain a better understanding of the phenomenon by identifying the underlying cause and then leverage the insights to introduce a simple visualization tool termed the double-ranking diagram to improve on sequential methods. As a byproduct of these findings, we obtain the first provable result certifying the exact equivalence between the lasso and least angle regression in the early stages of solution paths beyond orthogonal designs. This equivalence can seamlessly carry over many important model selection results concerning the lasso to least angle regression.

Statistics Theory Machine Learning Statistics Theory

Unions of Orthogonal Arrays and their aberrations via Hilbert bases

184 - Roberto Fontana , Fabio Rapallo 2018

We generate all the Orthogonal Arrays (OAs) of a given size n and strength t as the union of a collection of OAs which belong to an inclusion-minimal set of OAs. We derive a formula for computing the (Generalized) Word Length Pattern of a union of OAs that makes use of their polynomial counting functions. In this way the best OAs according to the Generalized Minimum Aberration criterion can be found by simply exploring a relatively small set of counting functions. The classes of OAs with 5 binary factors, strength 2, and sizes 16 and 20 are fully described.

Statistics Theory Methodology Statistics Theory

Limiting distributions of graph-based test statistics

153 - Yejiong Zhu , Hao Chen 2021

Two-sample tests utilizing a similarity graph on observations are useful for high-dimensional data and non-Euclidean data due to their flexibility and good performance under a wide range of alternatives. Existing works mainly focused on sparse graphs, such as graphs with the number of edges in the order of the number of observations. However, the tests have better performance with denser graphs under many settings. In this work, we establish the theoretical ground for graph-based tests with graphs that are much denser than those in existing works.

Statistics Theory Methodology Statistics Theory

Maximum likelihood estimation of a log-concave density and its distribution function: Basic properties and uniform consistency

457 - Lutz Duembgen , Kaspar Rufibach 2009

We study nonparametric maximum likelihood estimation of a log-concave probability density and its distribution and hazard function. Some general properties of these estimators are derived from two characterizations. It is shown that the rate of convergence with respect to supremum norm on a compact interval for the density and hazard rate estimator is at least $(log(n)/n)^{1/3}$ and typically $(log(n)/n)^{2/5}$, whereas the difference between the empirical and estimated distribution function vanishes with rate $o_{mathrm{p}}(n^{-1/2})$ under certain regularity assumptions.

Statistics Theory Methodology Statistics Theory

Tests of exponentiality based on Arnold-Villasenor characterization, and their efficiencies

366 - M. Jovanovic , B. Milosevic , Ya. Yu. Nikitin 2014

We propose two families of scale-free exponentiality tests based on the recent characterization of exponentiality by Arnold and Villasenor. The test statistics are based on suitable functionals of U-empirical distribution functions. The family of integral statistics can be reduced to V- or U-statistics with relatively simple non-degenerate kernels. They are asymptotically normal and have reasonably high local Bahadur efficiency under common alternatives. This efficiency is compared with simulated powers of new tests. On the other hand, the Kolmogorov type tests demonstrate very low local Bahadur efficiency and rather moderate power for common alternatives,and can hardly be recommended to practitioners. We also explore the conditions of local asymptotic optimality of new tests and describe for both families special most favorable alternatives for which the tests are fully efficient.

Statistics Theory Methodology Statistics Theory

comments

Fetching comments

Higher Institute for Demographic Studies and Researches

Additional details More universities

Are Discoveries Spurious? Distributions of Maximum Spurious Correlations and Their Applications

Ask ChatGPT about the research

No Arabic abstract

Read More