ﻻ يوجد ملخص باللغة العربية
VARCLUST algorithm is proposed for clustering variables under the assumption that variables in a given cluster are linear combinations of a small number of hidden latent variables, corrupted by the random noise. The entire clustering task is viewed as the problem of selection of the statistical model, which is defined by the number of clusters, the partition of variables into these clusters and the cluster dimensions, i.e. the vector of dimensions of linear subspaces spanning each of the clusters. The optimal model is selected using the approximate Bayesian criterion based on the Laplace approximations and using a non-informative uniform prior on the number of clusters. To solve the problem of the search over a huge space of possible models we propose an extension of the ClustOfVar algorithm which was dedicated to subspaces of dimension only 1, and which is similar in structure to the $K$-centroid algorithm. We provide a complete methodology with theoretical guarantees, extensive numerical experimentations, complete data analyses and implementation. Our algorithm assigns variables to appropriate clusterse based on the consistent Bayesian Information Criterion (BIC), and estimates the dimensionality of each cluster by the PEnalized SEmi-integrated Likelihood Criterion (PESEL), whose consistency we prove. Additionally, we prove that each iteration of our algorithm leads to an increase of the Laplace approximation to the model posterior probability and provide the criterion for the estimation of the number of clusters. Numerical comparisons with other algorithms show that VARCLUST may outperform some popular machine learning tools for sparse subspace clustering. We also report the results of real data analysis including TCGA breast cancer data and meteorological data. The proposed method is implemented in the publicly available R package varclust.
When working with large biological data sets, exploratory analysis is an important first step for understanding the latent structure and for generating hypotheses to be tested in subsequent analyses. However, when the number of variables is large com
The vast majority of Dimensionality Reduction (DR) techniques rely on second-order statistics to define their optimization objective. Even though this provides adequate results in most cases, it comes with several shortcomings. The methods require ca
We show how to approximate a data matrix $mathbf{A}$ with a much smaller sketch $mathbf{tilde A}$ that can be used to solve a general class of constrained k-rank approximation problems to within $(1+epsilon)$ error. Importantly, this class of problem
Existing dimensionality reduction methods are adept at revealing hidden underlying manifolds arising from high-dimensional data and thereby producing a low-dimensional representation. However, the smoothness of the manifolds produced by classic techn
The R package CVEK introduces a suite of flexible machine learning models and robust hypothesis tests for learning the joint nonlinear effects of multiple covariates in limited samples. It implements the Cross-validated Ensemble of Kernels (CVEK)(Liu