No Arabic abstract
As an effective nonparametric method, empirical likelihood (EL) is appealing in combining estimating equations flexibly and adaptively for incorporating data information. To select important variables and estimating equations in the sparse high-dimensional model, we consider a penalized EL method based on robust estimating functions by applying two penalty functions for regularizing the regression parameters and the associated Lagrange multipliers simultaneously, which allows the dimensionalities of both regression parameters and estimating equations to grow exponentially with the sample size. A first inspection on the robustness of estimating equations contributing to the estimating equations selection and variable selection is discussed from both theoretical perspective and intuitive simulation results in this paper. The proposed method can improve the robustness and effectiveness when the data have underlying outliers or heavy tails in the response variables and/or covariates. The robustness of the estimator is measured via the bounded influence function, and the oracle properties are also established under some regularity conditions. Extensive simulation studies and a yeast cell data are used to evaluate the performance of the proposed method. The numerical results reveal that the robustness of sparse estimating equations selection fundamentally enhances variable selection accuracy when the data have heavy tails and/or include underlying outliers.
Nonparametric empirical Bayes methods provide a flexible and attractive approach to high-dimensional data analysis. One particularly elegant empirical Bayes methodology, involving the Kiefer-Wolfowitz nonparametric maximum likelihood estimator (NPMLE) for mixture models, has been known for decades. However, implementation and theoretical analysis of the Kiefer-Wolfowitz NPMLE are notoriously difficult. A fast algorithm was recently proposed that makes NPMLE-based procedures feasible for use in large-scale problems, but the algorithm calculates only an approximation to the NPMLE. In this paper we make two contributions. First, we provide upper bounds on the convergence rate of the approximate NPMLEs statistical error, which have the same order as the best known bounds for the true NPMLE. This suggests that the approximate NPMLE is just as effective as the true NPMLE for statistical applications. Second, we illustrate the promise of NPMLE procedures in a high-dimensional binary classification problem. We propose a new procedure and show that it vastly outperforms existing methods in experiments with simulated data. In real data analyses involving cancer survival and gene expression data, we show that it is very competitive with several recently proposed methods for regularized linear discriminant analysis, another popular approach to high-dimensional classification.
High-dimensional statistical inference with general estimating equations are challenging and remain less explored. In this paper, we study two problems in the area: confidence set estimation for multiple components of the model parameters, and model specifications test. For the first one, we propose to construct a new set of estimating equations such that the impact from estimating the high-dimensional nuisance parameters becomes asymptotically negligible. The new construction enables us to estimate a valid confidence region by empirical likelihood ratio. For the second one, we propose a test statistic as the maximum of the marginal empirical likelihood ratios to quantify data evidence against the model specification. Our theory establishes the validity of the proposed empirical likelihood approaches, accommodating over-identification and exponentially growing data dimensionality. The numerical studies demonstrate promising performance and potential practical benefits of the new methods.
Statistical methods with empirical likelihood (EL) are appealing and effective especially in conjunction with estimating equations through which useful data information can be adaptively and flexibly incorporated. It is also known in the literature that EL approaches encounter difficulties when dealing with problems having high-dimensional model parameters and estimating equations. To overcome the challenges, we begin our study with a careful investigation on high-dimensional EL from a new scope targeting at estimating a high-dimensional sparse model parameters. We show that the new scope provides an opportunity for relaxing the stringent requirement on the dimensionality of the model parameter. Motivated by the new scope, we then propose a new penalized EL by applying two penalty functions respectively regularizing the model parameters and the associated Lagrange multipliers in the optimizations of EL. By penalizing the Lagrange multiplier to encourage its sparsity, we show that drastic dimension reduction in the number of estimating equations can be effectively achieved without compromising the validity and consistency of the resulting estimators. Most attractively, such a reduction in dimensionality of estimating equations is actually equivalent to a selection among those high-dimensional estimating equations, resulting in a highly parsimonious and effective device for high-dimensional sparse model parameters. Allowing both the dimensionalities of model parameters and estimating equations growing exponentially with the sample size, our theory demonstrates that the estimator from our new penalized EL is sparse and consistent with asymptotically normally distributed nonzero components. Numerical simulations and a real data analysis show that the proposed penalized EL works promisingly.
Functional principal component analysis is essential in functional data analysis, but the inferences will become unconvincing when some non-Gaussian characteristics occur, such as heavy tail and skewness. The focus of this paper is to develop a robust functional principal component analysis methodology in dealing with non-Gaussian longitudinal data, for which sparsity and irregularity along with non-negligible measurement errors must be considered. We introduce a Kendalls $tau$ function whose particular properties make it a nice proxy for the covariance function in the eigenequation when handling non-Gaussian cases. Moreover, the estimation procedure is presented and the asymptotic theory is also established. We further demonstrate the superiority and robustness of our method through simulation studies and apply the method to the longitudinal CD4 cell count data in an AIDS study.
This paper proposes a new robust smooth-threshold estimating equation to select important variables and automatically estimate parameters for high dimensional longitudinal data. A novel working correlation matrix is proposed to capture correlations within the same subject. The proposed procedure works well when the number of covariates p increases as the number of subjects n increases. The proposed estimates are competitive with the estimates obtained with the true correlation structure, especially when the data are contaminated. Moreover, the proposed method is robust against outliers in the response variables and/or covariates. Furthermore, the oracle properties for robust smooth-threshold estimating equations under large n, diverging p are established under some regularity conditions. Extensive simulation studies and a yeast cell cycle data are used to evaluate the performance of the proposed method, and results show that our proposed method is competitive with existing robust variable selection procedures.