No Arabic abstract
Methods for global measurement of transcript abundance such as microarrays and RNA-seq generate datasets in which the number of measured features far exceeds the number of observations. Extracting biologically meaningful and experimentally tractable insights from such data therefore requires high-dimensional prediction. Existing sparse linear approaches to this challenge have been stunningly successful, but some important issues remain. These methods can fail to select the correct features, predict poorly relative to non-sparse alternatives, or ignore any unknown grouping structures for the features. We propose a method called SuffPCR that yields improved predictions in high-dimensional tasks including regression and classification, especially in the typical context of omics with correlated features. SuffPCR first estimates sparse principal components and then estimates a linear model on the recovered subspace. Because the estimated subspace is sparse in the features, the resulting predictions will depend on only a small subset of genes. SuffPCR works well on a variety of simulated and experimental transcriptomic data, performing nearly optimally when the model assumptions are satisfied. We also demonstrate near-optimal theoretical guarantees.
Dimension reduction for high-dimensional compositional data plays an important role in many fields, where the principal component analysis of the basis covariance matrix is of scientific interest. In practice, however, the basis variables are latent and rarely observed, and standard techniques of principal component analysis are inadequate for compositional data because of the simplex constraint. To address the challenging problem, we relate the principal subspace of the centered log-ratio compositional covariance to that of the basis covariance, and prove that the latter is approximately identifiable with the diverging dimensionality under some subspace sparsity assumption. The interesting blessing-of-dimensionality phenomenon enables us to propose the principal subspace estimation methods by using the sample centered log-ratio covariance. We also derive nonasymptotic error bounds for the subspace estimators, which exhibits a tradeoff between identification and estimation. Moreover, we develop efficient proximal alternating direction method of multipliers algorithms to solve the nonconvex and nonsmooth optimization problems. Simulation results demonstrate that the proposed methods perform as well as the oracle methods with known basis. Their usefulness is illustrated through an analysis of word usage pattern for statisticians.
We propose a supervised principal component regression method for relating functional responses with high dimensional covariates. Unlike the conventional principal component analysis, the proposed method builds on a newly defined expected integrated residual sum of squares, which directly makes use of the association between functional response and predictors. Minimizing the integrated residual sum of squares gives the supervised principal components, which is equivalent to solving a sequence of nonconvex generalized Rayleigh quotient optimization problems and thus is computationally intractable. To overcome this computational challenge, we reformulate the nonconvex optimization problems into a simultaneous linear regression, with a sparse penalty added to deal with high dimensional predictors. Theoretically, we show that the reformulated regression problem recovers the same supervised principal subspace under suitable conditions. Statistically, we establish non-asymptotic error bounds for the proposed estimators. Numerical studies and an application to the Human Connectome Project lend further support.
Functional principal component analysis (FPCA) could become invalid when data involve non-Gaussian features. Therefore, we aim to develop a general FPCA method to adapt to such non-Gaussian cases. A Kenalls $tau$ function, which possesses identical eigenfunctions as covariance function, is constructed. The particular formulation of Kendalls $tau$ function makes it less insensitive to data distribution. We further apply it to the estimation of FPCA and study the corresponding asymptotic consistency. Moreover, the effectiveness of the proposed method is demonstrated through a comprehensive simulation study and an application to the physical activity data collected by a wearable accelerometer monitor.
Functional binary datasets occur frequently in real practice, whereas discrete characteristics of the data can bring challenges to model estimation. In this paper, we propose a sparse logistic functional principal component analysis (SLFPCA) method to handle the functional binary data. The SLFPCA looks for local sparsity of the eigenfunctions to obtain convenience in interpretation. We formulate the problem through a penalized Bernoulli likelihood with both roughness penalty and sparseness penalty terms. An efficient algorithm is developed for the optimization of the penalized likelihood using majorization-minimization (MM) algorithm. The theoretical results indicate both consistency and sparsistency of the proposed method. We conduct a thorough numerical experiment to demonstrate the advantages of the SLFPCA approach. Our method is further applied to a physical activity dataset.
Functional principal component analysis is essential in functional data analysis, but the inferences will become unconvincing when some non-Gaussian characteristics occur, such as heavy tail and skewness. The focus of this paper is to develop a robust functional principal component analysis methodology in dealing with non-Gaussian longitudinal data, for which sparsity and irregularity along with non-negligible measurement errors must be considered. We introduce a Kendalls $tau$ function whose particular properties make it a nice proxy for the covariance function in the eigenequation when handling non-Gaussian cases. Moreover, the estimation procedure is presented and the asymptotic theory is also established. We further demonstrate the superiority and robustness of our method through simulation studies and apply the method to the longitudinal CD4 cell count data in an AIDS study.