No Arabic abstract
We describe the results of principal component analysis (PCA) of up-the-ramp sampled IR array data from the HST WFC3 IR, JWST NIRSpec, and prototype WFIRST WFI detectors. These systems use respectively Teledyne H1R, H2RG, and H4RG-10 near-IR detector arrays with a variety of IR array controllers. The PCA shows that the Legendre polynomials approximate the principal components of these systems (i.e. they roughly diagonalize the covariance matrix). In contrast to the monomial basis that is widely used for polynomial fitting and linearization today, the Legendre polynomials are an orthonormal basis. They provide a quantifiable, compact, and (nearly) linearly uncorrelated representation of the information content of the data. By fitting a few Legendre polynomials, nearly all of the meaningful information in representative WFC3 astronomical datacubes can be condensed from 15 up-the-ramp samples down to 6 compressible Legendre coefficients per pixel. The higher order coefficients contain time domain information that is lost when one projects up-the-ramp sampled datacubes onto 2-dimensional images by fitting a straight line, even if the data are linearized before fitting the line. Going forward, we believe that this time domain information is potentially important for disentangling the various non-linearities that can affect IR array observations, i.e. inherent pixel non-linearity, persistence, burn in, brighter-fatter effect, (potentially) non-linear inter-pixel capacitance (IPC), and perhaps others.
Instrumental data are affected by systematic effects that dominate the errors and can be relevant when searching for small signals. This is the case of the K2 mission, a follow up of the Kepler mission, that, after a failure on two reaction wheels, has lost its stability properties rising strongly the systematics in the light curves and reducing its photometric precision. In this work, we have developed a general method to remove time related systematics from a set of light curves, that has been applied to K2 data. The method uses the Principal Component Analysis to retrieve the correlation between the light curves due to the systematics and to remove its effect without knowing any information other than the data itself. We have applied the method to all the K2 campaigns available at the Mikulski Archive for Space Telescopes, and we have tested the effectiveness of the procedure and its capability in preserving the astrophysical signal on a few transits and on eclipsing binaries. One product of this work is the identification of stable sources along the ecliptic plane that can be used as photometric calibrators for the upcoming Atmospheric Remote-sensing Exoplanet Large-survey mission.
We show how to efficiently project a vector onto the top principal components of a matrix, without explicitly computing these components. Specifically, we introduce an iterative algorithm that provably computes the projection using few calls to any black-box routine for ridge regression. By avoiding explicit principal component analysis (PCA), our algorithm is the first with no runtime dependence on the number of top principal components. We show that it can be used to give a fast iterative method for the popular principal component regression problem, giving the first major runtime improvement over the naive method of combining PCA with regression. To achieve our results, we first observe that ridge regression can be used to obtain a smooth projection onto the top principal components. We then sharpen this approximation to true projection using a low-degree polynomial approximation to the matrix step function. Step function approximation is a topic of long-term interest in scientific computing. We extend prior theory by constructing polynomials with simple iterative structure and rigorously analyzing their behavior under limited precision.
The Gaia optical reference frame is intrinsically undefined with respect to global orientation and spin, so it needs to be anchored in the radio-based International Celestial Reference Frame (ICRF) to provide a referenced and quasi-inertial celestial coordinate system. The link between the two fundamental frames is realized through two samples of distant extragalactic sources, mostly AGNs and quasars, but only the smaller sample of radio-loud ICRF sources with optical counterparts is available to determine the mutual orientation. The robustness of this link can be mathematically formulated in the framework of functional principal component analysis using a set of vector spherical harmonics to represent the differences in celestial positions of the common objects. The weakest eigenvectors are computed, which describe the greatest deficiency of the link. The deficient or poorly determined terms are specific vector fields on the sphere which carry the largest errors of absolute astrometry using Gaia in reference to the ICRF. This analysis provides guidelines to the future development of the ICRF maximizing the accuracy of the link over the entire celestial sphere. A measure of robustness of a least-squares solution, which can be applied to any linear model fitting problem, is introduced to help discriminate between reference frame tie models of different degrees.
Dimension reduction for high-dimensional compositional data plays an important role in many fields, where the principal component analysis of the basis covariance matrix is of scientific interest. In practice, however, the basis variables are latent and rarely observed, and standard techniques of principal component analysis are inadequate for compositional data because of the simplex constraint. To address the challenging problem, we relate the principal subspace of the centered log-ratio compositional covariance to that of the basis covariance, and prove that the latter is approximately identifiable with the diverging dimensionality under some subspace sparsity assumption. The interesting blessing-of-dimensionality phenomenon enables us to propose the principal subspace estimation methods by using the sample centered log-ratio covariance. We also derive nonasymptotic error bounds for the subspace estimators, which exhibits a tradeoff between identification and estimation. Moreover, we develop efficient proximal alternating direction method of multipliers algorithms to solve the nonconvex and nonsmooth optimization problems. Simulation results demonstrate that the proposed methods perform as well as the oracle methods with known basis. Their usefulness is illustrated through an analysis of word usage pattern for statisticians.
Principal component analysis (PCA) is an important tool in exploring data. The conventional approach to PCA leads to a solution which favours the structures with large variances. This is sensitive to outliers and could obfuscate interesting underlying structures. One of the equivalent definitions of PCA is that it seeks the subspaces that maximize the sum of squared pairwise distances between data projections. This definition opens up more flexibility in the analysis of principal components which is useful in enhancing PCA. In this paper we introduce scales into PCA by maximizing only the sum of pairwise distances between projections for pairs of datapoints with distances within a chosen interval of values [l,u]. The resulting principal component decompositions in Multiscale PCA depend on point (l,u) on the plane and for each point we define projectors onto principal components. Cluster analysis of these projectors reveals the structures in the data at various scales. Each structure is described by the eigenvectors at the medoid point of the cluster which represent the structure. We also use the distortion of projections as a criterion for choosing an appropriate scale especially for data with outliers. This method was tested on both artificial distribution of data and real data. For data with multiscale structures, the method was able to reveal the different structures of the data and also to reduce the effect of outliers in the principal component analysis.