No Arabic abstract
The Lasso is a method for high-dimensional regression, which is now commonly used when the number of covariates $p$ is of the same order or larger than the number of observations $n$. Classical asymptotic normality theory is not applicable for this model due to two fundamental reasons: $(1)$ The regularized risk is non-smooth; $(2)$ The distance between the estimator $bf widehat{theta}$ and the true parameters vector $bf theta^star$ cannot be neglected. As a consequence, standard perturbative arguments that are the traditional basis for asymptotic normality fail. On the other hand, the Lasso estimator can be precisely characterized in the regime in which both $n$ and $p$ are large, while $n/p$ is of order one. This characterization was first obtained in the case of standard Gaussian designs, and subsequently generalized to other high-dimensional estimation procedures. Here we extend the same characterization to Gaussian correlated designs with non-singular covariance structure. This characterization is expressed in terms of a simpler ``fixed design model. We establish non-asymptotic bounds on the distance between distributions of various quantities in the two models, which hold uniformly over signals $bf theta^star$ in a suitable sparsity class, and values of the regularization parameter. As applications, we study the distribution of the debiased Lasso, and show that a degrees-of-freedom correction is necessary for computing valid confidence intervals.
We consider the problem of learning a coefficient vector x_0in R^N from noisy linear observation y=Ax_0+w in R^n. In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an L1-penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Simulations on real data matrices suggest that our results can be relevant in a broad array of practical applications.
The classical binary hypothesis testing problem is revisited. We notice that when one of the hypotheses is composite, there is an inherent difficulty in defining an optimality criterion that is both informative and well-justified. For testing in the simple normal location problem (that is, testing for the mean of multivariate Gaussians), we overcome the difficulty as follows. In this problem there exists a natural hardness order between parameters as for different parameters the error-probailities curves (when the parameter is known) are either identical, or one dominates the other. We can thus define minimax performance as the worst-case among parameters which are below some hardness level. Fortunately, there exists a universal minimax test, in the sense that it is minimax for all hardness levels simultaneously. Under this criterion we also find the optimal test for composite hypothesis testing with training data. This criterion extends to the wide class of local asymptotic normal models, in an asymptotic sense where the approximation of the error probabilities is additive. Since we have the asymptotically optimal tests for composite hypothesis testing with and without training data, we quantify the loss of universality and gain of training data for these models.
This paper studies the problem of accurately recovering a sparse vector $beta^{star}$ from highly corrupted linear measurements $y = X beta^{star} + e^{star} + w$ where $e^{star}$ is a sparse error vector whose nonzero entries may be unbounded and $w$ is a bounded noise. We propose a so-called extended Lasso optimization which takes into consideration sparse prior information of both $beta^{star}$ and $e^{star}$. Our first result shows that the extended Lasso can faithfully recover both the regression as well as the corruption vector. Our analysis relies on the notion of extended restricted eigenvalue for the design matrix $X$. Our second set of results applies to a general class of Gaussian design matrix $X$ with i.i.d rows $oper N(0, Sigma)$, for which we can establish a surprising result: the extended Lasso can recover exact signed supports of both $beta^{star}$ and $e^{star}$ from only $Omega(k log p log n)$ observations, even when the fraction of corruption is arbitrarily close to one. Our analysis also shows that this amount of observations required to achieve exact signed support is indeed optimal.
Efficient automatic protein classification is of central importance in genomic annotation. As an independent way to check the reliability of the classification, we propose a statistical approach to test if two sets of protein domain sequences coming from two families of the Pfam database are significantly different. We model protein sequences as realizations of Variable Length Markov Chains (VLMC) and we use the context trees as a signature of each protein family. Our approach is based on a Kolmogorov--Smirnov-type goodness-of-fit test proposed by Balding et al. [Limit theorems for sequences of random trees (2008), DOI: 10.1007/s11749-008-0092-z]. The test statistic is a supremum over the space of trees of a function of the two samples; its computation grows, in principle, exponentially fast with the maximal number of nodes of the potential trees. We show how to transform this problem into a max-flow over a related graph which can be solved using a Ford--Fulkerson algorithm in polynomial time on that number. We apply the test to 10 randomly chosen protein domain families from the seed of Pfam-A database (high quality, manually curated families). The test shows that the distributions of context trees coming from different families are significantly different. We emphasize that this is a novel mathematical approach to validate the automatic clustering of sequences in any context. We also study the performance of the test via simulations on Galton--Watson related processes.
In many statistical problems the hypotheses are naturally divided into groups, and the investigators are interested to perform group-level inference, possibly along with inference on individual hypotheses. We consider the goal of discovering groups containing $u$ or more signals with group-level false discovery rate (FDR) control. This goal can be addressed by multiple testing of partial conjunction hypotheses with a parameter $u,$ which reduce to global null hypotheses for $u=1.$ We consider the case where the partial conjunction $p$-values are combinations of within-group $p$-values, and obtain sufficient conditions on (1) the dependencies among the $p$-values within and across the groups, (2) the combining method for obtaining partial conjunction $p$-values, and (3) the multiple testing procedure, for obtaining FDR control on partial conjunction discoveries. We consider separately the dependencies encountered in the meta-analysis setting, where multiple features are tested in several independent studies, and the $p$-values within each study may be dependent. Based on the results for this setting, we generalize the procedure of Benjamini, Heller, and Yekutieli (2009) for assessing replicability of signals across studies, and extend their theoretical results regarding FDR control with respect to replicability claims.