No Arabic abstract
We study high-dimensional linear models with error-in-variables. Such models are motivated by various applications in econometrics, finance and genetics. These models are challenging because of the need to account for measurement errors to avoid non-vanishing biases in addition to handle the high dimensionality of the parameters. A recent growing literature has proposed various estimators that achieve good rates of convergence. Our main contribution complements this literature with the construction of simultaneous confidence regions for the parameters of interest in such high-dimensional linear models with error-in-variables. These confidence regions are based on the construction of moment conditions that have an additional orthogonal property with respect to nuisance parameters. We provide a construction that requires us to estimate an additional high-dimensional linear model with error-in-variables for each component of interest. We use a multiplier bootstrap to compute critical values for simultaneous confidence intervals for a subset $S$ of the components. We show its validity despite of possible model selection mistakes, and allowing for the cardinality of $S$ to be larger than the sample size. We apply and discuss the implications of our results to two examples and conduct Monte Carlo simulations to illustrate the performance of the proposed procedure.
This was a revision of arXiv:1105.2454v1 from 2012. It considers a variation on the STIV estimator where, instead of one conic constraint, there are as many conic constraints as moments (instruments) allowing to use more directly moderate deviations for self-normalized sums. The idea first appeared in formula (6.5) in arXiv:1105.2454v1 when some instruments can be endogenous. For reference and to avoid confusion with the STIV estimator, this estimator should be called C-STIV.
In this paper we develop an online statistical inference approach for high-dimensional generalized linear models with streaming data for real-time estimation and inference. We propose an online debiased lasso (ODL) method to accommodate the special structure of streaming data. ODL differs from offline debiased lasso in two important aspects. First, in computing the estimate at the current stage, it only uses summary statistics of the historical data. Second, in addition to debiasing an online lasso estimator, ODL corrects an approximation error term arising from nonlinear online updating with streaming data. We show that the proposed online debiased estimators for the GLMs are consistent and asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. Extensive numerical experiments are conducted to evaluate the performance of the proposed ODL method. These experiments demonstrate the effectiveness of our algorithm and support the theoretical results. A streaming dataset from the National Automotive Sampling System-Crashworthiness Data System is analyzed to illustrate the application of the proposed method.
High-dimensional linear models with endogenous variables play an increasingly important role in recent econometric literature. In this work we allow for models with many endogenous variables and many instrument variables to achieve identification. Because of the high-dimensionality in the second stage, constructing honest confidence regions with asymptotically correct coverage is non-trivial. Our main contribution is to propose estimators and confidence regions that would achieve that. The approach relies on moment conditions that have an additional orthogonal property with respect to nuisance parameters. Moreover, estimation of high-dimension nuisance parameters is carried out via new pivotal procedures. In order to achieve simultaneously valid confidence regions we use a multiplier bootstrap procedure to compute critical values and establish its validity.
Let $(Y,(X_i)_{iinmathcal{I}})$ be a zero mean Gaussian vector and $V$ be a subset of $mathcal{I}$. Suppose we are given $n$ i.i.d. replications of the vector $(Y,X)$. We propose a new test for testing that $Y$ is independent of $(X_i)_{iin mathcal{I}backslash V}$ conditionally to $(X_i)_{iin V}$ against the general alternative that it is not. This procedure does not depend on any prior information on the covariance of $X$ or the variance of $Y$ and applies in a high-dimensional setting. It straightforwardly extends to test the neighbourhood of a Gaussian graphical model. The procedure is based on a model of Gaussian regression with random Gaussian covariates. We give non asymptotic properties of the test and we prove that it is rate optimal (up to a possible $log(n)$ factor) over various classes of alternatives under some additional assumptions. Besides, it allows us to derive non asymptotic minimax rates of testing in this setting. Finally, we carry out a simulation study in order to evaluate the performance of our procedure.
We propose a new estimator for the high-dimensional linear regression model with observation error in the design where the number of coefficients is potentially larger than the sample size. The main novelty of our procedure is that the choice of penalty parameters is pivotal. The estimator is based on applying a self-normalization to the constraints that characterize the estimator. Importantly, we show how to cast the computation of the estimator as the solution of a convex program with second order cone constraints. This allows the use of algorithms with theoretical guarantees and reliable implementation. Under sparsity assumptions, we derive $ell_q$-rates of convergence and show that consistency can be achieved even if the number of regressors exceeds the sample size. We further provide a simple to implement rule to threshold the estimator that yields a provably sparse estimator with similar $ell_2$ and $ell_1$-rates of convergence. The thresholds are data-driven and component dependents. Finally, we also study the rates of convergence of estimators that refit the data based on a selected support with possible model selection mistakes. In addition to our finite sample theoretical results that allow for non-i.i.d. data, we also present simulations to compare the performance of the proposed estimators.