ﻻ يوجد ملخص باللغة العربية
Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries from these data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions about the exogeneity of the covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum spurious correlations given a certain number of predictors, namely, the distribution of the correlation of a response variable $Y$ with the best $s$ linear combinations of $p$ covariates $mathbf{X}$, even when $mathbf{X}$ and $Y$ are independent. When the covariance matrix of $mathbf{X}$ possesses the restricted eigenvalue property, we derive such distributions for both a finite $s$ and a diverging $s$, using Gaussian approximation and empirical process techniques. However, such a distribution depends on the unknown covariance matrix of $mathbf{X}$. Hence, we use the multiplier bootstrap procedure to approximate the unknown distributions and establish the consistency of such a simple bootstrap approach. The results are further extended to the situation where the residuals are from regularized fits. Our approach is then used to construct the upper confidence limit for the maximum spurious correlation and to test the exogeneity of the covariates. The former provides a baseline for guarding against false discoveries and the latter tests whether our fundamental assumptions for high-dimensional model selection are statistically valid. Our techniques and results are illustrated with both numerical examples and real data analysis.
Applied statisticians use sequential regression procedures to produce a ranking of explanatory variables and, in settings of low correlations between variables and strong true effect sizes, expect that variables at the very top of this ranking are tr
We generate all the Orthogonal Arrays (OAs) of a given size n and strength t as the union of a collection of OAs which belong to an inclusion-minimal set of OAs. We derive a formula for computing the (Generalized) Word Length Pattern of a union of OA
Two-sample tests utilizing a similarity graph on observations are useful for high-dimensional data and non-Euclidean data due to their flexibility and good performance under a wide range of alternatives. Existing works mainly focused on sparse graphs
We study nonparametric maximum likelihood estimation of a log-concave probability density and its distribution and hazard function. Some general properties of these estimators are derived from two characterizations. It is shown that the rate of conve
We propose two families of scale-free exponentiality tests based on the recent characterization of exponentiality by Arnold and Villasenor. The test statistics are based on suitable functionals of U-empirical distribution functions. The family of int