No Arabic abstract
A simple and fast analysis method to sort large data sets into groups with shared distinguishing characteristics is described, and applied to single molecular break junction conductance versus electrode displacement data. The method, based on principal component analysis, successfully sorted data sets based on the projection of the data onto the first or second principal component of the correlation matrix without the need to assert any specific hypothesis about the expected features within the data. This was an improvement on the current correlation matrix analysis approach because it sorted data automatically, making it more objective and less time consuming, and our method is applicable to a wide range of multivariate data sets. Here the method was demonstrated on two systems. First, it was demonstrated on mixtures of two molecules with identical anchor groups, similar lengths, but either a $pi$ (high conductance) or $sigma$ (low conductance) bridge. The mixed data was automatically sorted into two groups containing one molecule or the other. Second, it was demonstrated on break junction data measured with the $pi$ bridged molecule alone. Again the method distinguished between two groups. These groups were tentatively assigned to different geometries of the molecule in the junction.
Single-molecule break junction measurements deliver a huge number of conductance vs. electrode separation traces. Along such measurements the target molecules may bind to the electrodes in different geometries, and the evolution and rupture of the single-molecule junction may also follow distinct trajectories. The unraveling of the various typical trace classes is a prerequisite of the proper physical interpretation of the data. Here we exploit the efficient feature recognition properties of neural networks to automatically find the relevant trace classes. To eliminate the need for manually labeled training data we apply a combined method, which automatically selects training traces according to the extreme values of principal component projections or some auxiliary measured quantities, and then the network captures the features of these characteristic traces, and generalizes its inference to the entire dataset. The use of a simple neural network structure also enables a direct insight to the decision making mechanism. We demonstrate that this combined machine learning method is efficient in the unsupervised recognition of unobvious, but highly relevant trace classes within low and room temperature gold-4,4 bipyridine-gold single molecule break junction data.
Cryo-electron microscopy nowadays often requires the analysis of hundreds of thousands of 2D images as large as a few hundred pixels in each direction. Here we introduce an algorithm that efficiently and accurately performs principal component analysis (PCA) for a large set of two-dimensional images, and, for each image, the set of its uniform rotations in the plane and their reflections. For a dataset consisting of $n$ images of size $L times L$ pixels, the computational complexity of our algorithm is $O(nL^3 + L^4)$, while existing algorithms take $O(nL^4)$. The new algorithm computes the expansion coefficients of the images in a Fourier-Bessel basis efficiently using the non-uniform fast Fourier transform. We compare the accuracy and efficiency of the new algorithm with traditional PCA and existing algorithms for steerable PCA.
We show how to efficiently project a vector onto the top principal components of a matrix, without explicitly computing these components. Specifically, we introduce an iterative algorithm that provably computes the projection using few calls to any black-box routine for ridge regression. By avoiding explicit principal component analysis (PCA), our algorithm is the first with no runtime dependence on the number of top principal components. We show that it can be used to give a fast iterative method for the popular principal component regression problem, giving the first major runtime improvement over the naive method of combining PCA with regression. To achieve our results, we first observe that ridge regression can be used to obtain a smooth projection onto the top principal components. We then sharpen this approximation to true projection using a low-degree polynomial approximation to the matrix step function. Step function approximation is a topic of long-term interest in scientific computing. We extend prior theory by constructing polynomials with simple iterative structure and rigorously analyzing their behavior under limited precision.
Dimension reduction for high-dimensional compositional data plays an important role in many fields, where the principal component analysis of the basis covariance matrix is of scientific interest. In practice, however, the basis variables are latent and rarely observed, and standard techniques of principal component analysis are inadequate for compositional data because of the simplex constraint. To address the challenging problem, we relate the principal subspace of the centered log-ratio compositional covariance to that of the basis covariance, and prove that the latter is approximately identifiable with the diverging dimensionality under some subspace sparsity assumption. The interesting blessing-of-dimensionality phenomenon enables us to propose the principal subspace estimation methods by using the sample centered log-ratio covariance. We also derive nonasymptotic error bounds for the subspace estimators, which exhibits a tradeoff between identification and estimation. Moreover, we develop efficient proximal alternating direction method of multipliers algorithms to solve the nonconvex and nonsmooth optimization problems. Simulation results demonstrate that the proposed methods perform as well as the oracle methods with known basis. Their usefulness is illustrated through an analysis of word usage pattern for statisticians.
Principal component analysis (PCA) is an important tool in exploring data. The conventional approach to PCA leads to a solution which favours the structures with large variances. This is sensitive to outliers and could obfuscate interesting underlying structures. One of the equivalent definitions of PCA is that it seeks the subspaces that maximize the sum of squared pairwise distances between data projections. This definition opens up more flexibility in the analysis of principal components which is useful in enhancing PCA. In this paper we introduce scales into PCA by maximizing only the sum of pairwise distances between projections for pairs of datapoints with distances within a chosen interval of values [l,u]. The resulting principal component decompositions in Multiscale PCA depend on point (l,u) on the plane and for each point we define projectors onto principal components. Cluster analysis of these projectors reveals the structures in the data at various scales. Each structure is described by the eigenvectors at the medoid point of the cluster which represent the structure. We also use the distortion of projections as a criterion for choosing an appropriate scale especially for data with outliers. This method was tested on both artificial distribution of data and real data. For data with multiscale structures, the method was able to reveal the different structures of the data and also to reduce the effect of outliers in the principal component analysis.