ترغب بنشر مسار تعليمي؟ اضغط هنا

Distance-Based Independence Screening for Canonical Analysis

77   0   0.0 ( 0 )
 نشر من قبل Chuanping Yu
 تاريخ النشر 2019
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

This paper introduces a new method named Distance-based Independence Screening for Canonical Analysis (DISCA) to reduce dimensions of two random vectors with arbitrary dimensions. The objective of our method is to identify the low dimensional linear projections of two random vectors, such that any dimension reduction based on linear projection with lower dimensions will surely affect some dependent structure -- the removed components are not independent. The essence of DISCA is to use the distance correlation to eliminate the redundant dimensions until infeasible. Unlike the existing canonical analysis methods, DISCA does not require the dimensions of the reduced subspaces of the two random vectors to be equal, nor does it require certain distributional assumption on the random vectors. We show that under mild conditions, our approach does undercover the lowest possible linear dependency structures between two random vectors, and our conditions are weaker than some sufficient linear subspace-based methods. Numerically, DISCA is to solve a non-convex optimization problem. We formulate it as a difference-of-convex (DC) optimization problem, and then further adopt the alternating direction method of multipliers (ADMM) on the convex step of the DC algorithms to parallelize/accelerate the computation. Some sufficient linear subspace-based methods use potentially numerically-intensive bootstrap method to determine the dimensions of the reduced subspaces in advance; our method avoids this complexity. In simulations, we present cases that DISCA can solve effectively, while other methods cannot. In both the simulation studies and real data cases, when the other state-of-the-art dimension reduction methods are applicable, we observe that DISCA performs either comparably or better than most of them. Codes and an R package can be found in GitHub https://github.com/ChuanpingYu/DISCA.



قيم البحث

اقرأ أيضاً

86 - Wenjia Wang , Yi-Hui Zhou 2020
Classical canonical correlation analysis (CCA) requires matrices to be low dimensional, i.e. the number of features cannot exceed the sample size. Recent developments in CCA have mainly focused on the high-dimensional setting, where the number of fea tures in both matrices under analysis greatly exceeds the sample size. These approaches impose penalties in the optimization problems that are needed to be solve iteratively, and estimate multiple canonical vectors sequentially. In this work, we provide an explicit link between sparse multiple regression with sparse canonical correlation analysis, and an efficient algorithm that can estimate multiple canonical pairs simultaneously rather than sequentially. Furthermore, the algorithm naturally allows parallel computing. These properties make the algorithm much efficient. We provide theoretical results on the consistency of canonical pairs. The algorithm and theoretical development are based on solving an eigenvectors problem, which significantly differentiate our method with existing methods. Simulation results support the improved performance of the proposed approach. We apply eigenvector-based CCA to analysis of the GTEx thyroid histology images, analysis of SNPs and RNA-seq gene expression data, and a microbiome study. The real data analysis also shows improved performance compared to traditional sparse CCA.
Canonical correlation analysis (CCA) is a classical and important multivariate technique for exploring the relationship between two sets of continuous variables. CCA has applications in many fields, such as genomics and neuroimaging. It can extract m eaningful features as well as use these features for subsequent analysis. Although some sparse CCA methods have been developed to deal with high-dimensional problems, they are designed specifically for continuous data and do not consider the integer-valued data from next-generation sequencing platforms that exhibit very low counts for some important features. We propose a model-based probabilistic approach for correlation and canonical correlation estimation for two sparse count data sets (PSCCA). PSCCA demonstrates that correlations and canonical correlations estimated at the natural parameter level are more appropriate than traditional estimation methods applied to the raw data. We demonstrate through simulation studies that PSCCA outperforms other standard correlation approaches and sparse CCA approaches in estimating the true correlations and canonical correlations at the natural parameter level. We further apply the PSCCA method to study the association of miRNA and mRNA expression data sets from a squamous cell lung cancer study, finding that PSCCA can uncover a large number of strongly correlated pairs than standard correlation and other sparse CCA approaches.
Canonical correlation analysis investigates linear relationships between two sets of variables, but often works poorly on modern data sets due to high-dimensionality and mixed data types such as continuous, binary and zero-inflated. To overcome these challenges, we propose a semiparametric approach for sparse canonical correlation analysis based on Gaussian copula. Our main contribution is a truncated latent Gaussian copula model for data with excess zeros, which allows us to derive a rank-based estimator of the latent correlation matrix for mixed variable types without the estimation of marginal transformation functions. The resulting canonical correlation analysis method works well in high-dimensional settings as demonstrated via numerical studies, as well as in application to the analysis of association between gene expression and micro RNA data of breast cancer patients.
This paper proposes a new statistic to test independence between two high dimensional random vectors ${mathbf{X}}:p_1times1$ and ${mathbf{Y}}:p_2times1$. The proposed statistic is based on the sum of regularized sample canonical correlation coefficie nts of ${mathbf{X}}$ and ${mathbf{Y}}$. The asymptotic distribution of the statistic under the null hypothesis is established as a corollary of general central limit theorems (CLT) for the linear statistics of classical and regularized sample canonical correlation coefficients when $p_1$ and $p_2$ are both comparable to the sample size $n$. As applications of the developed independence test, various types of dependent structures, such as factor models, ARCH models and a general uncorrelated but dependent case, etc., are investigated by simulations. As an empirical application, cross-sectional dependence of daily stock returns of companies between different sections in the New York Stock Exchange (NYSE) is detected by the proposed test.
Independence screening methods such as the two sample $t$-test and the marginal correlation based ranking are among the most widely used techniques for variable selection in ultrahigh dimensional data sets. In this short note, simple examples are use d to demonstrate potential problems with the independence screening methods in the presence of correlated predictors. Also, an example is considered where all important variables are independent among themselves and all but one important variables are independent with the unimportant variables. Furthermore, a real data example from a genome wide association study is used to illustrate inferior performance of marginal correlation screening compared to another screening method.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا