ترغب بنشر مسار تعليمي؟ اضغط هنا

Estimating Graph Dimension with Cross-validated Eigenvalues

110   0   0.0 ( 0 )
 نشر من قبل Fan Chen
 تاريخ النشر 2021
والبحث باللغة English




اسأل ChatGPT حول البحث

In applied multivariate statistics, estimating the number of latent dimensions or the number of clusters is a fundamental and recurring problem. One common diagnostic is the scree plot, which shows the largest eigenvalues of the data matrix; the user searches for a gap or elbow in the decreasing eigenvalues; unfortunately, these patterns can hide beneath the bias of the sample eigenvalues. This methodological problem is conceptually difficult because, in many situations, there is only enough signal to detect a subset of the $k$ population dimensions/eigenvectors. In this situation, one could argue that the correct choice of $k$ is the number of detectable dimensions. We alleviate these problems with cross-validated eigenvalues. Under a large class of random graph models, without any parametric assumptions, we provide a p-value for each sample eigenvector. It tests the null hypothesis that this sample eigenvector is orthogonal to (i.e., uncorrelated with) the true latent dimensions. This approach naturally adapts to problems where some dimensions are not statistically detectable. In scenarios where all $k$ dimensions can be estimated, we prove that our procedure consistently estimates $k$. In simulations and a data example, the proposed estimator compares favorably to alternative approaches in both computational and statistical performance.



قيم البحث

اقرأ أيضاً

The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimensional regimes do not admit such a convenience, however. As such, a variety of estimators have been derived to overcome the shortcomings of the sample covariance matrix in these settings. Yet, the question of selecting an optimal estimator from among the plethora available remains largely unaddressed. Using the framework of cross-validated loss-based estimation, we develop the theoretical underpinnings of just such an estimator selection procedure. In particular, we propose a general class of loss functions for covariance matrix estimation and establish finite-sample risk bounds and conditions for the asymptotic optimality of the cross-validated estimator selector with respect to these loss functions. We evaluate our proposed approach via a comprehensive set of simulation experiments and demonstrate its practical benefits by application in the exploratory analysis of two single-cell transcriptome sequencing datasets. A free and open-source software implementation of the proposed methodology, the cvCovEst R package, is briefly introduced.
Recent contributions to kernel smoothing show that the performance of cross-validated bandwidth selectors improve significantly from indirectness. Indirect crossvalidation first estimates the classical cross-validated bandwidth from a more rough and difficult smoothing problem than the original one and then rescales this indirect bandwidth to become a bandwidth of the original problem. The motivation for this approach comes from the observation that classical crossvalidation tends to work better when the smoothing problem is difficult. In this paper we find that the performance of indirect crossvalidation improves theoretically and practically when the polynomial order of the indirect kernel increases, with the Gaussian kernel as limiting kernel when the polynomial order goes to infinity. These theoretical and practical results support the often proposed choice of the Gaussian kernel as indirect kernel. However, for do-validation our study shows a discrepancy between asymptotic theory and practical performance. As for indirect crossvalidation, in asymptotic theory the performance of indirect do-validation improves with increasing polynomial order of the used indirect kernel. But this theoretical improvements do not carry over to practice and the original do-validation still seems to be our preferred bandwidth selector. We also consider plug-in estimation and combinations of plug-in bandwidths and crossvalidated bandwidths. These latter bandwidths do not outperform the original do-validation estimator either.
Forest-based methods have recently gained in popularity for non-parametric treatment effect estimation. Building on this line of work, we introduce causal survival forests, which can be used to estimate heterogeneous treatment effects in a survival a nd observational setting where outcomes may be right-censored. Our approach relies on orthogonal estimating equations to robustly adjust for both censoring and selection effects. In our experiments, we find our approach to perform well relative to a number of baselines.
78 - Shuxiao Chen , Bo Zhang 2021
Estimating dynamic treatment regimes (DTRs) from retrospective observational data is challenging as some degree of unmeasured confounding is often expected. In this work, we develop a framework of estimating properly defined optimal DTRs with a time- varying instrumental variable (IV) when unmeasured covariates confound the treatment and outcome, rendering the potential outcome distributions only partially identified. We derive a novel Bellman equation under partial identification, use it to define a generic class of estimands (termed IV-optimal DTRs), and study the associated estimation problem. We then extend the IV-optimality framework to tackle the policy improvement problem, delivering IV-improved DTRs that are guaranteed to perform no worse and potentially better than a pre-specified baseline DTR. Importantly, our IV-improvement framework opens up the possibility of strictly improving upon DTRs that are optimal under the no unmeasured confounding assumption (NUCA). We demonstrate via extensive simulations the superior performance of IV-optimal and IV-improved DTRs over the DTRs that are optimal only under the NUCA. In a real data example, we embed retrospective observational registry data into a natural, two-stage experiment with noncompliance using a time-varying IV and estimate useful IV-optimal DTRs that assign mothers to high-level or low-level neonatal intensive care units based on their prognostic variables.
In multivariate data analysis, it is often important to estimate a graph characterizing dependence among (p) variables. A popular strategy uses the non-zero entries in a (ptimes p) covariance or precision matrix, typically requiring restrictive model ing assumptions for accurate graph recovery. To improve model robustness, we instead focus on estimating the {em backbone} of the dependence graph. We use a spanning tree likelihood, based on a minimalist graphical model that is purposely overly-simplified. Taking a Bayesian approach, we place a prior on the space of trees and quantify uncertainty in the graphical model. In both theory and experiments, we show that this model does not require the population graph to be a spanning tree or the covariance to satisfy assumptions beyond positive-definiteness. The model accurately recovers the backbone of the population graph at a rate competitive with existing approaches but with better robustness. We show combinatorial properties of the spanning tree, which may be of independent interest, and develop an efficient Gibbs sampler for Bayesian inference. Analyzing electroencephalography data using a Hidden Markov Model with each latent state modeled by a spanning tree, we show that results are much more interpretable compared with popular alternatives.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا