ترغب بنشر مسار تعليمي؟ اضغط هنا

Optimal Sparse Singular Value Decomposition for High-dimensional High-order Data

85   0   0.0 ( 0 )
 نشر من قبل Anru Zhang
 تاريخ النشر 2018
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

In this article, we consider the sparse tensor singular value decomposition, which aims for dimension reduction on high-dimensional high-order data with certain sparsity structure. A method named Sparse Tensor Alternating Thresholding for Singular Value Decomposition (STAT-SVD) is proposed. The proposed procedure features a novel double projection & thresholding scheme, which provides a sharp criterion for thresholding in each iteration. Compared with regular tensor SVD model, STAT-SVD permits more robust estimation under weaker assumptions. Both the upper and lower bounds for estimation accuracy are developed. The proposed procedure is shown to be minimax rate-optimal in a general class of situations. Simulation studies show that STAT-SVD performs well under a variety of configurations. We also illustrate the merits of the proposed procedure on a longitudinal tensor dataset on European country mortality rates.



قيم البحث

اقرأ أيضاً

We consider high-dimensional measurement errors with high-frequency data. Our focus is on recovering the covariance matrix of the random errors with optimality. In this problem, not all components of the random vector are observed at the same time an d the measurement errors are latent variables, leading to major challenges besides high data dimensionality. We propose a new covariance matrix estimator in this context with appropriate localization and thresholding. By developing a new technical device integrating the high-frequency data feature with the conventional notion of $alpha$-mixing, our analysis successfully accommodates the challenging serial dependence in the measurement errors. Our theoretical analysis establishes the minimax optimal convergence rates associated with two commonly used loss functions. We then establish cases when the proposed localized estimator with thresholding achieves the minimax optimal convergence rates. Considering that the variances and covariances can be small in reality, we conduct a second-order theoretical analysis that further disentangles the dominating bias in the estimator. A bias-corrected estimator is then proposed to ensure its practical finite sample performance. We illustrate the promising empirical performance of the proposed estimator with extensive simulation studies and a real data analysis.
We derive a formula for optimal hard thresholding of the singular value decomposition in the presence of correlated additive noise; although it nominally involves unobservables, we show how to apply it even where the noise covariance structure is not a-priori known or is not independently estimable. The proposed method, which we call ScreeNOT, is a mathematically solid alternative to Cattells ever-popular but vague Scree Plot heuristic from 1966. ScreeNOT has a surprising oracle property: it typically achieves exactly, in large finite samples, the lowest possible MSE for matrix recovery, on each given problem instance - i.e. the specific threshold it selects gives exactly the smallest achievable MSE loss among all possible threshold choices for that noisy dataset and that unknown underlying true low rank model. The method is computationally efficient and robust against perturbations of the underlying covariance structure. Our results depend on the assumption that the singular values of the noise have a limiting empirical distribution of compact support; this model, which is standard in random matrix theory, is satisfied by many models exhibiting either cross-row correlation structure or cross-column correlation structure, and also by many situations where there is inter-element correlation structure. Simulations demonstrate the effectiveness of the method even at moderate matrix sizes. The paper is supplemented by ready-to-use software packages implementing the proposed algorithm.
We study high-dimensional regression with missing entries in the covariates. A common strategy in practice is to emph{impute} the missing entries with an appropriate substitute and then implement a standard statistical procedure acting as if the cova riates were fully observed. Recent literature on this subject proposes instead to design a specific, often complicated or non-convex, algorithm tailored to the case of missing covariates. We investigate a simpler approach where we fill-in the missing entries with their conditional mean given the observed covariates. We show that this imputation scheme coupled with standard off-the-shelf procedures such as the LASSO and square-root LASSO retains the minimax estimation rate in the random-design setting where the covariates are i.i.d. sub-Gaussian. We further show that the square-root LASSO remains emph{pivotal} in this setting. It is often the case that the conditional expectation cannot be computed exactly and must be approximated from data. We study two cases where the covariates either follow an autoregressive (AR) process, or are jointly Gaussian with sparse precision matrix. We propose tractable estimators for the conditional expectation and then perform linear regression via LASSO, and show similar estimation rates in both cases. We complement our theoretical results with simulations on synthetic and semi-synthetic examples, illustrating not only the sharpness of our bounds, but also the broader utility of this strategy beyond our theoretical assumptions.
212 - Jue Hou , Zijian Guo , Tianxi Cai 2021
Risk modeling with EHR data is challenging due to a lack of direct observations on the disease outcome, and the high dimensionality of the candidate predictors. In this paper, we develop a surrogate assisted semi-supervised-learning (SAS) approach to risk modeling with high dimensional predictors, leveraging a large unlabeled data on candidate predictors and surrogates of outcome, as well as a small labeled data with annotated outcomes. The SAS procedure borrows information from surrogates along with candidate predictors to impute the unobserved outcomes via a sparse working imputation model with moment conditions to achieve robustness against mis-specification in the imputation model and a one-step bias correction to enable interval estimation for the predicted risk. We demonstrate that the SAS procedure provides valid inference for the predicted risk derived from a high dimensional working model, even when the underlying risk prediction model is dense and the risk model is mis-specified. We present an extensive simulation study to demonstrate the superiority of our SSL approach compared to existing supervised methods. We apply the method to derive genetic risk prediction of type-2 diabetes mellitus using a EHR biobank cohort.
533 - Pranab K. Sen 2008
High-dimensional data models, often with low sample size, abound in many interdisciplinary studies, genomics and large biological systems being most noteworthy. The conventional assumption of multinormality or linearity of regression may not be plaus ible for such models which are likely to be statistically complex due to a large number of parameters as well as various underlying restraints. As such, parametric approaches may not be very effective. Anything beyond parametrics, albeit, having increased scope and robustness perspectives, may generally be baffled by the low sample size and hence unable to give reasonable margins of errors. Kendalls tau statistic is exploited in this context with emphasis on dimensional rather than sample size asymptotics. The Chen--Stein theorem has been thoroughly appraised in this study. Applications of these findings in some microarray data models are illustrated.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا