No Arabic abstract
This report concerns the problem of dimensionality reduction through information geometric methods on statistical manifolds. While there has been considerable work recently presented regarding dimensionality reduction for the purposes of learning tasks such as classification, clustering, and visualization, these methods have focused primarily on Riemannian manifolds in Euclidean space. While sufficient for many applications, there are many high-dimensional signals which have no straightforward and meaningful Euclidean representation. In these cases, signals may be more appropriately represented as a realization of some distribution lying on a statistical manifold, or a manifold of probability density functions (PDFs). We present a framework for dimensionality reduction that uses information geometry for both statistical manifold reconstruction as well as dimensionality reduction in the data domain.
Existing dimensionality reduction methods are adept at revealing hidden underlying manifolds arising from high-dimensional data and thereby producing a low-dimensional representation. However, the smoothness of the manifolds produced by classic techniques over sparse and noisy data is not guaranteed. In fact, the embedding generated using such data may distort the geometry of the manifold and thereby produce an unfaithful embedding. Herein, we propose a framework for nonlinear dimensionality reduction that generates a manifold in terms of smooth geodesics that is designed to treat problems in which manifold measurements are either sparse or corrupted by noise. Our method generates a network structure for given high-dimensional data using a nearest neighbors search and then produces piecewise linear shortest paths that are defined as geodesics. Then, we fit points in each geodesic by a smoothing spline to emphasize the smoothness. The robustness of this approach for sparse and noisy datasets is demonstrated by the implementation of the method on synthetic and real-world datasets.
Molecular dynamics (MD) simulations have been widely applied to study macromolecules including proteins. However, high-dimensionality of the datasets produced by simulations makes it difficult for thorough analysis, and further hinders a deeper understanding of biomacromolecules. To gain more insights into the protein structure-function relations, appropriate dimensionality reduction methods are needed to project simulations onto low-dimensional spaces. Linear dimensionality reduction methods, such as principal component analysis (PCA) and time-structure based independent component analysis (t-ICA), could not preserve sufficient structural information. Though better than linear methods, nonlinear methods, such as t-distributed stochastic neighbor embedding (t-SNE), still suffer from the limitations in avoiding system noise and keeping inter-cluster relations. ivis is a novel deep learning-based dimensionality reduction method originally developed for single-cell datasets. Here we applied this framework for the study of light, oxygen and voltage (LOV) domain of diatom Phaeodactylum tricornutum aureochrome 1a (PtAu1a). Compared with other methods, ivis is shown to be superior in constructing Markov state model (MSM), preserving information of both local and global distances and maintaining similarity between high dimension and low dimension with the least information loss. Moreover, ivis framework is capable of providing new prospective for deciphering residue-level protein allostery through the feature weights in the neural network. Overall, ivis is a promising member in the analysis toolbox for proteins.
Spectral dimensionality reduction methods enable linear separations of complex data with high-dimensional features in a reduced space. However, these methods do not always give the desired results due to irregularities or uncertainties of the data. Thus, we consider aggressively modifying the scales of the features to obtain the desired classification. Using prior knowledge on the labels of partial samples to specify the Fiedler vector, we formulate an eigenvalue problem of a linear matrix pencil whose eigenvector has the feature scaling factors. The resulting factors can modify the features of entire samples to form clusters in the reduced space, according to the known labels. In this study, we propose new dimensionality reduction methods supervised using the feature scaling associated with the spectral clustering. Numerical experiments show that the proposed methods outperform well-established supervised methods for toy problems with more samples than features, and are more robust regarding clustering than existing methods. Also, the proposed methods outperform existing methods regarding classification for real-world problems with more features than samples of gene expression profiles of cancer diseases. Furthermore, the feature scaling tends to improve the clustering and classification accuracies of existing unsupervised methods, as the proportion of training data increases.
Grassmann manifolds have been widely used to represent the geometry of feature spaces in a variety of problems in medical imaging and computer vision including but not limited to shape analysis, action recognition, subspace clustering and motion segmentation. For these problems, the features usually lie in a very high-dimensional Grassmann manifold and hence an appropriate dimensionality reduction technique is called for in order to curtail the computational burden. To this end, the Principal Geodesic Analysis (PGA), a nonlinear extension of the well known principal component analysis, is applicable as a general tool to many Riemannian manifolds. In this paper, we propose a novel framework for dimensionality reduction of data in Riemannian homogeneous spaces and then focus on the Grassman manifold which is an example of a homogeneous space. Our framework explicitly exploits the geometry of the homogeneous space yielding reduced dimensional nested sub-manifolds that need not be geodesic submanifolds and thus are more expressive. Specifically, we project points in a Grassmann manifold to an embedded lower dimensional Grassmann manifold. A salient feature of our method is that it leads to higher expressed variance compared to PGA which we demonstrate via synthetic and real data experiments.
This is a tutorial and survey paper on unification of spectral dimensionality reduction methods, kernel learning by Semidefinite Programming (SDP), Maximum Variance Unfolding (MVU) or Semidefinite Embedding (SDE), and its variants. We first explain how the spectral dimensionality reduction methods can be unified as kernel Principal Component Analysis (PCA) with different kernels. This unification can be interpreted as eigenfunction learning or representation of kernel in terms of distance matrix. Then, since the spectral methods are unified as kernel PCA, we say let us learn the best kernel for unfolding the manifold of data to its maximum variance. We first briefly introduce kernel learning by SDP for the transduction task. Then, we explain MVU in detail. Vario