D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multiple High-dimensional Datasets

152 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Hai Shu

تاريخ النشر 2020

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Hai Shu - Zhe Qu - Hongtu Zhu

التعلم الالي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Modern biomedical studies often collect multiple types of high-dimensional data on a common set of objects. A popular model for the joint analysis of multi-type datasets decomposes each data matrix into a low-rank common-variation matrix generated by latent factors shared across all datasets, a low-rank distinctive-variation matrix corresponding to each dataset, and an additive noise matrix. We propose decomposition-based generalized canonical correlation analysis (D-GCCA), a novel decomposition method that appropriately defines those matrices on the L2 space of random variables, whereas most existing methods are developed on its approximation, the Euclidean dot product space. Moreover to well calibrate common latent factors, we impose a desirable orthogonality constraint on distinctive latent factors. Existing methods inadequately consider such orthogonality and can thus suffer from substantial loss of undetected common variation. Our D-GCCA takes one step further than GCCA by separating common and distinctive variations among canonical variables, and enjoys an appealing interpretation from the perspective of principal component analysis. Consistent estimators of our common-variation and distinctive-variation matrices are established with good finite-sample numerical performance, and have closed-form expressions leading to efficient computation especially for large-scale datasets. The superiority of D-GCCA over state-of-the-art methods is also corroborated in simulations and real-world data examples.

قيم البحث

358 - Cencheng Shen , Ming Sun , Minh Tang 2013

For multiple multivariate data sets, we derive conditions under which Generalized Canonical Correlation Analysis (GCCA) improves classification performance of the projected datasets, compared to standard Canonical Correlation Analysis (CCA) using onl y two data sets. We illustrate our theoretical results with simulations and a real data experiment.

التعلم الالي

CDPA: Common and Distinctive Pattern Analysis between High-dimensional Datasets

242 - Hai Shu , Zhe Qu 2019

A representative model in integrative analysis of two high-dimensional correlated datasets is to decompose each data matrix into a low-rank common matrix generated by latent factors shared across datasets, a low-rank distinctive matrix corresponding to each dataset, and an additive noise matrix. Existing decomposition methods claim that their common matrices capture the common pattern of the two datasets. However, their so-called common pattern only denotes the common latent factors but ignores the common pattern between the two coefficient matrices of these common latent factors. We propose a new unsupervised learning method, called the common and distinctive pattern analysis (CDPA), which appropriately defines the two types of data patterns by further incorporating the common and distinctive patterns of the coefficient matrices. A consistent estimation approach is developed for high-dimensional settings, and shows reasonably good finite-sample performance in simulations. Our simulation studies and real data analysis corroborate that the proposed CDPA can provide better characterization of common and distinctive patterns and thereby benefit data mining.

التعلم الالي التعلم الآلي

Canonical-Correlation-Based Fast Feature Selection

127 - Sikai Zhang , Tingna Wang , Keith Worden 2021

This paper proposes a canonical-correlation-based filter method for feature selection. The sum of squared canonical correlation coefficients is adopted as the feature ranking criterion. The proposed method boosts the computational speed of the rankin g criterion in greedy search. The supporting theorems developed for the feature selection method are fundamental to the understanding of the canonical correlation analysis. In empirical studies, a synthetic dataset is used to demonstrate the speed advantage of the proposed method, and eight real datasets are applied to show the effectiveness of the proposed feature ranking criterion in both classification and regression. The results show that the proposed method is considerably faster than the definition-based method, and the proposed ranking criterion is competitive compared with the seven mutual-information-based criteria.

التعلم الالي التعلم الآلي

The Labeled Multiple Canonical Correlation Analysis for Information Fusion

102 - Lei Gao , Rui Zhang , Lin Qi 2021

The objective of multimodal information fusion is to mathematically analyze information carried in different sources and create a new representation which will be more effectively utilized in pattern recognition and other multimedia information proce ssing tasks. In this paper, we introduce a new method for multimodal information fusion and representation based on the Labeled Multiple Canonical Correlation Analysis (LMCCA). By incorporating class label information of the training samples,the proposed LMCCA ensures that the fused features carry discriminative characteristics of the multimodal information representations, and are capable of providing superior recognition performance. We implement a prototype of LMCCA to demonstrate its effectiveness on handwritten digit recognition,face recognition and object recognition utilizing multiple features,bimodal human emotion recognition involving information from both audio and visual domains. The generic nature of LMCCA allows it to take as input features extracted by any means,including those by deep learning (DL) methods. Experimental results show that the proposed method enhanced the performance of both statistical machine learning (SML) methods, and methods based on DL.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

Ensembled sparse-input hierarchical networks for high-dimensional datasets

67 - Jean Feng , Noah Simon 2020

Neural networks have seen limited use in prediction for high-dimensional data with small sample sizes, because they tend to overfit and require tuning many more hyperparameters than existing off-the-shelf machine learning methods. With small modifica tions to the network architecture and training procedure, we show that dense neural networks can be a practical data analysis tool in these settings. The proposed method, Ensemble by Averaging Sparse-Input Hierarchical networks (EASIER-net), appropriately prunes the network structure by tuning only two L1-penalty parameters, one that controls the input sparsity and another that controls the number of hidden layers and nodes. The method selects variables from the true support if the irrelevant covariates are only weakly correlated with the response; otherwise, it exhibits a grouping effect, where strongly correlated covariates are selected at similar rates. On a collection of real-world datasets with different sizes, EASIER-net selected network architectures in a data-adaptive manner and achieved higher prediction accuracy than off-the-shelf methods on average.

التعلم الالي التعلم الآلي المنهجية