No Arabic abstract
In this paper we develop an online statistical inference approach for high-dimensional generalized linear models with streaming data for real-time estimation and inference. We propose an online debiased lasso (ODL) method to accommodate the special structure of streaming data. ODL differs from offline debiased lasso in two important aspects. First, in computing the estimate at the current stage, it only uses summary statistics of the historical data. Second, in addition to debiasing an online lasso estimator, ODL corrects an approximation error term arising from nonlinear online updating with streaming data. We show that the proposed online debiased estimators for the GLMs are consistent and asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. Extensive numerical experiments are conducted to evaluate the performance of the proposed ODL method. These experiments demonstrate the effectiveness of our algorithm and support the theoretical results. A streaming dataset from the National Automotive Sampling System-Crashworthiness Data System is analyzed to illustrate the application of the proposed method.
In this paper, we survey some recent results on statistical inference (parametric and nonparametric statistical estimation, hypotheses testing) about the spectrum of stationary models with tapered data, as well as, a question concerning robustness of inferences, carried out on a linear stationary process contaminated by a small trend. We also discuss some question concerning tapered Toeplitz matrices and operators, central limit theorems for tapered Toeplitz type quadratic functionals, and tapered Fejer-type kernels and singular integrals. These are the main tools for obtaining the corresponding results, and also are of interest in themselves. The processes considered will be discrete-time and continuous-time Gaussian, linear or Levy-driven linear processes with memory.
In this paper we discuss the estimation of a nonparametric component $f_1$ of a nonparametric additive model $Y=f_1(X_1) + ...+ f_q(X_q) + epsilon$. We allow the number $q$ of additive components to grow to infinity and we make sparsity assumptions about the number of nonzero additive components. We compare this estimation problem with that of estimating $f_1$ in the oracle model $Z= f_1(X_1) + epsilon$, for which the additive components $f_2,dots,f_q$ are known. We construct a two-step presmoothing-and-resmoothing estimator of $f_1$ and state finite-sample bounds for the difference between our estimator and some smoothing estimators $hat f_1^{text{(oracle)}}$ in the oracle model. In an asymptotic setting these bounds can be used to show asymptotic equivalence of our estimator and the oracle estimators; the paper thus shows that, asymptotically, under strong enough sparsity conditions, knowledge of $f_2,dots,f_q$ has no effect on estimation accuracy. Our first step is to estimate $f_1$ with an undersmoothed estimator based on near-orthogonal projections with a group Lasso bias correction. We then construct pseudo responses $hat Y$ by evaluating a debiased modification of our undersmoothed estimator of $f_1$ at the design points. In the second step the smoothing method of the oracle estimator $hat f_1^{text{(oracle)}}$ is applied to a nonparametric regression problem with responses $hat Y$ and covariates $X_1$. Our mathematical exposition centers primarily on establishing properties of the presmoothing estimator. We present simulation results demonstrating close-to-oracle performance of our estimator in practical applications.
In the low-dimensional case, the generalized additive coefficient model (GACM) proposed by Xue and Yang [Statist. Sinica 16 (2006) 1423-1446] has been demonstrated to be a powerful tool for studying nonlinear interaction effects of variables. In this paper, we propose estimation and inference procedures for the GACM when the dimension of the variables is high. Specifically, we propose a groupwise penalization based procedure to distinguish significant covariates for the large $p$ small $n$ setting. The procedure is shown to be consistent for model structure identification. Further, we construct simultaneous confidence bands for the coefficient functions in the selected model based on a refined two-step spline estimator. We also discuss how to choose the tuning parameters. To estimate the standard deviation of the functional estimator, we adopt the smoothed bootstrap method. We conduct simulation experiments to evaluate the numerical performance of the proposed methods and analyze an obesity data set from a genome-wide association study as an illustration.
We propose statistical inferential procedures for panel data models with interactive fixed effects in a kernel ridge regression framework.Compared with traditional sieve methods, our method is automatic in the sense that it does not require the choice of basis functions and truncation parameters.Model complexity is controlled by a continuous regularization parameter which can be automatically selected by generalized cross validation. Based on empirical processes theory and functional analysis tools, we derive joint asymptotic distributions for the estimators in the heterogeneous setting. These joint asymptotic results are then used to construct confidence intervals for the regression means and prediction intervals for the future observations, both being the first provably valid intervals in literature. Marginal asymptotic normality of the functional estimators in homogeneous setting is also obtained. Simulation and real data analysis demonstrate the advantages of our method.
This paper considers the maximum generalized empirical likelihood (GEL) estimation and inference on parameters identified by high dimensional moment restrictions with weakly dependent data when the dimensions of the moment restrictions and the parameters diverge along with the sample size. The consistency with rates and the asymptotic normality of the GEL estimator are obtained by properly restricting the growth rates of the dimensions of the parameters and the moment restrictions, as well as the degree of data dependence. It is shown that even in the high dimensional time series setting, the GEL ratio can still behave like a chi-square random variable asymptotically. A consistent test for the over-identification is proposed. A penalized GEL method is also provided for estimation under sparsity setting.