ترغب بنشر مسار تعليمي؟ اضغط هنا

Data Distillery: Effective Dimension Estimation via Penalized Probabilistic PCA

73   0   0.0 ( 0 )
 نشر من قبل Wei Deng
 تاريخ النشر 2018
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

The paper tackles the unsupervised estimation of the effective dimension of a sample of dependent random vectors. The proposed method uses the principal components (PC) decomposition of sample covariance to establish a low-rank approximation that helps uncover the hidden structure. The number of PCs to be included in the decomposition is determined via a Probabilistic Principal Components Analysis (PPCA) embedded in a penalized profile likelihood criterion. The choice of penalty parameter is guided by a data-driven procedure that is justified via analytical derivations and extensive finite sample simulations. Application of the proposed penalized PPCA is illustrated with three gene expression datasets in which the number of cancer subtypes is estimated from all expression measurements. The analyses point towards hidden structures in the data, e.g. additional subgroups, that could be of scientific interest.



قيم البحث

اقرأ أيضاً

Non-parametric maximum likelihood estimation encompasses a group of classic methods to estimate distribution-associated functions from potentially censored and truncated data, with extensive applications in survival analysis. These methods, including the Kaplan-Meier estimator and Turnbulls method, often result in overfitting, especially when the sample size is small. We propose an improvement to these methods by applying kernel smoothing to their raw estimates, based on a BIC-type loss function that balances the trade-off between optimizing model fit and controlling model complexity. In the context of a longitudinal study with repeated observations, we detail our proposed smoothing procedure and optimization algorithm. With extensive simulation studies over multiple realistic scenarios, we demonstrate that our smoothing-based procedure provides better overall accuracy in both survival function estimation and individual-level time-to-event prediction by reducing overfitting. Our smoothing procedure decreases the discrepancy between the estimated and true simulated survival function using interval-censored data by up to 49% compared to the raw un-smoothed estimate, with similar improvements of up to 41% and 23% in within-sample and out-of-sample prediction, respectively. Finally, we apply our method to real data on censored breast cancer diagnosis, which similarly shows improvement when compared to empirical survival estimates from uncensored data. We provide an R package, SISE, for implementing our penalized likelihood method.
Quadratic regression goes beyond the linear model by simultaneously including main effects and interactions between the covariates. The problem of interaction estimation in high dimensional quadratic regression has received extensive attention in the past decade. In this article we introduce a novel method which allows us to estimate the main effects and interactions separately. Unlike existing methods for ultrahigh dimensional quadratic regressions, our proposal does not require the widely used heredity assumption. In addition, our proposed estimates have explicit formulas and obey the invariance principle at the population level. We estimate the interactions of matrix form under penalized convex loss function. The resulting estimates are shown to be consistent even when the covariate dimension is an exponential order of the sample size. We develop an efficient ADMM algorithm to implement the penalized estimation. This ADMM algorithm fully explores the cheap computational cost of matrix multiplication and is much more efficient than existing penalized methods such as all pairs LASSO. We demonstrate the promising performance of our proposal through extensive numerical studies.
323 - Xin Gao , Daniel Q. Pu , Yuehua Wu 2009
In a Gaussian graphical model, the conditional independence between two variables are characterized by the corresponding zero entries in the inverse covariance matrix. Maximum likelihood method using the smoothly clipped absolute deviation (SCAD) pen alty (Fan and Li, 2001) and the adaptive LASSO penalty (Zou, 2006) have been proposed in literature. In this article, we establish the result that using Bayesian information criterion (BIC) to select the tuning parameter in penalized likelihood estimation with both types of penalties can lead to consistent graphical model selection. We compare the empirical performance of BIC with cross validation method and demonstrate the advantageous performance of BIC criterion for tuning parameter selection through simulation studies.
128 - Jiaqi Li , Liya Fu 2021
As an effective nonparametric method, empirical likelihood (EL) is appealing in combining estimating equations flexibly and adaptively for incorporating data information. To select important variables and estimating equations in the sparse high-dimen sional model, we consider a penalized EL method based on robust estimating functions by applying two penalty functions for regularizing the regression parameters and the associated Lagrange multipliers simultaneously, which allows the dimensionalities of both regression parameters and estimating equations to grow exponentially with the sample size. A first inspection on the robustness of estimating equations contributing to the estimating equations selection and variable selection is discussed from both theoretical perspective and intuitive simulation results in this paper. The proposed method can improve the robustness and effectiveness when the data have underlying outliers or heavy tails in the response variables and/or covariates. The robustness of the estimator is measured via the bounded influence function, and the oracle properties are also established under some regularity conditions. Extensive simulation studies and a yeast cell data are used to evaluate the performance of the proposed method. The numerical results reveal that the robustness of sparse estimating equations selection fundamentally enhances variable selection accuracy when the data have heavy tails and/or include underlying outliers.
262 - Libo Sun , Chihoon Lee , 2013
We consider the problem of estimating parameters of stochastic differential equations (SDEs) with discrete-time observations that are either completely or partially observed. The transition density between two observations is generally unknown. We pr opose an importance sampling approach with an auxiliary parameter when the transition density is unknown. We embed the auxiliary importance sampler in a penalized maximum likelihood framework which produces more accurate and computationally efficient parameter estimates. Simulation studies in three different models illustrate promising improvements of the new penalized simulated maximum likelihood method. The new procedure is designed for the challenging case when some state variables are unobserved and moreover, observed states are sparse over time, which commonly arises in ecological studies. We apply this new approach to two epidemics of chronic wasting disease in mule deer.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا