No Arabic abstract
We consider the problem of recovering a $k$-sparse signal ${mbox{$beta$}}_0inmathbb{R}^p$ from noisy observations $bf y={bf X}mbox{$beta$}_0+{bf w}inmathbb{R}^n$. One of the most popular approaches is the $l_1$-regularized least squares, also known as LASSO. We analyze the mean square error of LASSO in the case of random designs in which each row of ${bf X}$ is drawn from distribution $N(0,{mbox{$Sigma$}})$ with general ${mbox{$Sigma$}}$. We first derive the asymptotic risk of LASSO in the limit of $n,prightarrowinfty$ with $n/prightarrowdelta$. We then examine conditions on $n$, $p$, and $k$ for LASSO to exactly reconstruct ${mbox{$beta$}}_0$ in the noiseless case ${bf w}=0$. A phase boundary $delta_c=delta(epsilon)$ is precisely established in the phase space defined by $0ledelta,epsilonle 1$, where $epsilon=k/p$. Above this boundary, LASSO perfectly recovers ${mbox{$beta$}}_0$ with high probability. Below this boundary, LASSO fails to recover $mbox{$beta$}_0$ with high probability. While the values of the non-zero elements of ${mbox{$beta$}}_0$ do not have any effect on the phase transition curve, our analysis shows that $delta_c$ does depend on the signed pattern of the nonzero values of $mbox{$beta$}_0$ for general ${mbox{$Sigma$}} e{bf I}_p$. This is in sharp contrast to the previous phase transition results derived in i.i.d. case with $mbox{$Sigma$}={bf I}_p$ where $delta_c$ is completely determined by $epsilon$ regardless of the distribution of $mbox{$beta$}_0$. Underlying our formalism is a recently developed efficient algorithm called approximate message passing (AMP) algorithm. We generalize the state evolution of AMP from i.i.d. case to general case with ${mbox{$Sigma$}} e{bf I}_p$. Extensive computational experiments confirm that our theoretical predictions are consistent with simulation results on moderate size system.
Inferring causal relationships or related associations from observational data can be invalidated by the existence of hidden confounding. We focus on a high-dimensional linear regression setting, where the measured covariates are affected by hidden confounding and propose the {em Doubly Debiased Lasso} estimator for individual components of the regression coefficient vector. Our advocated method simultaneously corrects both the bias due to estimation of high-dimensional parameters as well as the bias caused by the hidden confounding. We establish its asymptotic normality and also prove that it is efficient in the Gauss-Markov sense. The validity of our methodology relies on a dense confounding assumption, i.e. that every confounding variable affects many covariates. The finite sample performance is illustrated with an extensive simulation study and a genomic application.
We develop a new class of distribution--free multiple testing rules for false discovery rate (FDR) control under general dependence. A key element in our proposal is a symmetrized data aggregation (SDA) approach to incorporating the dependence structure via sample splitting, data screening and information pooling. The proposed SDA filter first constructs a sequence of ranking statistics that fulfill global symmetry properties, and then chooses a data--driven threshold along the ranking to control the FDR. The SDA filter substantially outperforms the knockoff method in power under moderate to strong dependence, and is more robust than existing methods based on asymptotic $p$-values. We first develop finite--sample theory to provide an upper bound for the actual FDR under general dependence, and then establish the asymptotic validity of SDA for both the FDR and false discovery proportion (FDP) control under mild regularity conditions. The procedure is implemented in the R package texttt{SDA}. Numerical results confirm the effectiveness and robustness of SDA in FDR control and show that it achieves substantial power gain over existing methods in many settings.
Let $G$ be a non--linear function of a Gaussian process ${X_t}_{tinmathbb{Z}}$ with long--range dependence. The resulting process ${G(X_t)}_{tinmathbb{Z}}$ is not Gaussian when $G$ is not linear. We consider random wavelet coefficients associated with ${G(X_t)}_{tinmathbb{Z}}$ and the corresponding wavelet scalogram which is the average of squares of wavelet coefficients over locations. We obtain the asymptotic behavior of the scalogram as the number of observations and scales tend to infinity. It is known that when $G$ is a Hermite polynomial of any order, then the limit is either the Gaussian or the Rosenblatt distribution, that is, the limit can be represented by a multiple Wiener-It^o integral of order one or two. We show, however, that there are large classes of functions $G$ which yield a higher order Hermite distribution, that is, the limit can be represented by a a multiple Wiener-It^o integral of order greater than two.
Least Absolute Shrinkage and Selection Operator or the Lasso, introduced by Tibshirani (1996), is a popular estimation procedure in multiple linear regression when underlying design has a sparse structure, because of its property that it sets some regression coefficients exactly equal to 0. In this article, we develop a perturbation bootstrap method and establish its validity in approximating the distribution of the Lasso in heteroscedastic linear regression. We allow the underlying covariates to be either random or non-random. We show that the proposed bootstrap method works irrespective of the nature of the covariates, unlike the resample-based bootstrap of Freedman (1981) which must be tailored based on the nature (random vs non-random) of the covariates. Simulation study also justifies our method in finite samples.
The lasso and related sparsity inducing algorithms have been the target of substantial theoretical and applied research. Correspondingly, many results are known about their behavior for a fixed or optimally chosen tuning parameter specified up to unknown constants. In practice, however, this oracle tuning parameter is inaccessible so one must use the data to select one. Common statistical practice is to use a variant of cross-validation for this task. However, little is known about the theoretical properties of the resulting predictions with such data-dependent methods. We consider the high-dimensional setting with random design wherein the number of predictors $p$ grows with the number of observations $n$. Under typical assumptions on the data generating process, similar to those in the literature, we recover oracle rates up to a log factor when choosing the tuning parameter with cross-validation. Under weaker conditions, when the true model is not necessarily linear, we show that the lasso remains risk consistent relative to its linear oracle. We also generalize these results to the group lasso and square-root lasso and investigate the predictive and model selection performance of cross-validation via simulation.