No Arabic abstract
Principal Component Analysis (PCA) is a common multivariate statistical analysis method, and Probabilistic Principal Component Analysis (PPCA) is its probabilistic reformulation under the framework of Gaussian latent variable model. To improve the robustness of PPCA, it has been proposed to change the underlying Gaussian distributions to multivariate $t$-distributions. Based on the representation of $t$-distribution as a scale mixture of Gaussians, a hierarchical model is used for implementation. However, although the robust PPCA methods work reasonably well for some simulation studies and real data, the hierarchical model implemented does not yield the equivalent interpretation. In this paper, we present a set of equivalent relationships between those models, and discuss the performance of robust PPCA methods using different multivariate $t$-distributed structures through several simulation studies. In doing so, we clarify a current misrepresentation in the literature, and make connections between a set of hierarchical models for robust PPCA.
Models based on multivariate t distributions are widely applied to analyze data with heavy tails. However, all the marginal distributions of the multivariate t distributions are restricted to have the same degrees of freedom, making these models unable to describe different marginal heavy-tailedness. We generalize the traditional multivariate t distributions to non-elliptically contoured multivariate t distributions, allowing for different marginal degrees of freedom. We apply the non-elliptically contoured multivariate t distributions to three widely-used models: the Heckman selection model with different degrees of freedom for selection and outcome equations, the multivariate Robit model with different degrees of freedom for marginal responses, and the linear mixed-effects model with different degrees of freedom for random effects and within-subject errors. Based on the Normal mixture representation of our t distribution, we propose efficient Bayesian inferential procedures for the model parameters based on data augmentation and parameter expansion. We show via simulation studies and real examples that the conclusions are sensitive to the existence of different marginal heavy-tailedness.
High dimensional data has introduced challenges that are difficult to address when attempting to implement classical approaches of statistical process control. This has made it a topic of interest for research due in recent years. However, in many cases, data sets have underlying structures, such as in advanced manufacturing systems. If extracted correctly, efficient methods for process control can be developed. This paper proposes a robust sparse dimensionality reduction approach for correlated high-dimensional process monitoring to address the aforementioned issues. The developed monitoring technique uses robust sparse probabilistic PCA to reduce the dimensionality of the data stream while retaining interpretability. The proposed methodology utilizes Bayesian variational inference to obtain the estimates of a probabilistic representation of PCA. Simulation studies were conducted to verify the efficacy of the proposed methodology. Furthermore, we conducted a case study for change detection for in-line Raman spectroscopy to validate the efficiency of our proposed method in a practical scenario.
Traditional principal component analysis (PCA) is well known in high-dimensional data analysis, but it requires to express data by a matrix with observations to be continuous. To overcome the limitations, a new method called flexible PCA (FPCA) for exponential family distributions is proposed. The goal is to ensure that it can be implemented to arbitrary shaped region for either count or continuous observations. The methodology of FPCA is developed under the framework of generalized linear models. It provides statistical models for FPCA not limited to matrix expressions of the data. A maximum likelihood approach is proposed to derive the decomposition when the number of principal components (PCs) is known. This naturally induces a penalized likelihood approach to determine the number of PCs when it is unknown. By modifying it for missing data problems, the proposed method is compared with previous PCA methods for missing data. The simulation study shows that the performance of FPCA is always better than its competitors. The application uses the proposed method to reduce the dimensionality of arbitrary shaped sub-regions of images and the global spread patterns of COVID-19 under normal and Poisson distributions, respectively.
Sparse Principal Component Analysis (SPCA) is widely used in data processing and dimension reduction; it uses the lasso to produce modified principal components with sparse loadings for better interpretability. However, sparse PCA never considers an additional grouping structure where the loadings share similar coefficients (i.e., feature grouping), besides a special group with all coefficients being zero (i.e., feature selection). In this paper, we propose a novel method called Feature Grouping and Sparse Principal Component Analysis (FGSPCA) which allows the loadings to belong to disjoint homogeneous groups, with sparsity as a special case. The proposed FGSPCA is a subspace learning method designed to simultaneously perform grouping pursuit and feature selection, by imposing a non-convex regularization with naturally adjustable sparsity and grouping effect. To solve the resulting non-convex optimization problem, we propose an alternating algorithm that incorporates the difference-of-convex programming, augmented Lagrange and coordinate descent methods. Additionally, the experimental results on real data sets show that the proposed FGSPCA benefits from the grouping effect compared with methods without grouping effect.
Sparse principal component analysis (PCA) is a popular tool for dimensional reduction of high-dimensional data. Despite its massive popularity, there is still a lack of theoretically justifiable Bayesian sparse PCA that is computationally scalable. A major challenge is choosing a suitable prior for the loadings matrix, as principal components are mutually orthogonal. We propose a spike and slab prior that meets this orthogonality constraint and show that the posterior enjoys both theoretical and computational advantages. Two computational algorithms, the PX-CAVI and the PX-EM algorithms, are developed. Both algorithms use parameter expansion to deal with the orthogonality constraint and to accelerate their convergence speeds. We found that the PX-CAVI algorithm has superior empirical performance than the PX-EM algorithm and two other penalty methods for sparse PCA. The PX-CAVI algorithm is then applied to study a lung cancer gene expression dataset. $mathsf{R}$ package $mathsf{VBsparsePCA}$ with an implementation of the algorithm is available on The Comprehensive R Archive Network.