No Arabic abstract
High resolution galaxy spectra contain much information about galactic physics, but the high dimensionality of these spectra makes it difficult to fully utilize the information they contain. We apply variational autoencoders (VAEs), a non-linear dimensionality reduction technique, to a sample of spectra from the Sloan Digital Sky Survey. In contrast to Principal Component Analysis (PCA), a widely used technique, VAEs can capture non-linear relationships between latent parameters and the data. We find that a VAE can reconstruct the SDSS spectra well with only six latent parameters, outperforming PCA with the same number of components. Different galaxy classes are naturally separated in this latent space, without class labels having been given to the VAE. The VAE latent space is interpretable because the VAE can be used to make synthetic spectra at any point in latent space. For example, making synthetic spectra along tracks in latent space yields sequences of realistic spectra that interpolate between two different types of galaxies. Using the latent space to find outliers may yield interesting spectra: in our small sample, we immediately find unusual data artifacts and stars misclassified as galaxies. In this exploratory work, we show that VAEs create compact, interpretable latent spaces that capture non-linear features of the data. While a VAE takes substantial time to train (~1 day for 48000 spectra), once trained, VAEs can enable the fast exploration of large astronomical data sets.
In order to process efficiently ever-higher dimensional data such as images, sentences, or audio recordings, one needs to find a proper way to reduce the dimensionality of such data. In this regard, SVD-based methods including PCA and Isomap have been extensively used. Recently, a neural network alternative called autoencoder has been proposed and is often preferred for its higher flexibility. This work aims to show that PCA is still a relevant technique for dimensionality reduction in the context of classification. To this purpose, we evaluated the performance of PCA compared to Isomap, a deep autoencoder, and a variational autoencoder. Experiments were conducted on three commonly used image datasets: MNIST, Fashion-MNIST, and CIFAR-10. The four different dimensionality reduction techniques were separately employed on each dataset to project data into a low-dimensional space. Then a k-NN classifier was trained on each projection with a cross-validated random search over the number of neighbours. Interestingly, our experiments revealed that k-NN achieved comparable accuracy on PCA and both autoencoders projections provided a big enough dimension. However, PCA computation time was two orders of magnitude faster than its neural network counterparts.
In this work, we present a quantum neighborhood preserving embedding and a quantum local discriminant embedding for dimensionality reduction and classification. We demonstrate that these two algorithms have an exponential speedup over their respectively classical counterparts. Along the way, we propose a variational quantum generalized eigenvalue solver that finds the generalized eigenvalues and eigenstates of a matrix pencil $(mathcal{G},mathcal{S})$. As a proof-of-principle, we implement our algorithm to solve $2^5times2^5$ generalized eigenvalue problems. Finally, our results offer two optional outputs with quantum or classical form, which can be directly applied in another quantum or classical machine learning process.
Manifold-valued data naturally arises in medical imaging. In cognitive neuroscience, for instance, brain connectomes base the analysis of coactivation patterns between different brain regions on the analysis of the correlations of their functional Magnetic Resonance Imaging (fMRI) time series - an object thus constrained by construction to belong to the manifold of symmetric positive definite matrices. One of the challenges that naturally arises consists of finding a lower-dimensional subspace for representing such manifold-valued data. Traditional techniques, like principal component analysis, are ill-adapted to tackle non-Euclidean spaces and may fail to achieve a lower-dimensional representation of the data - thus potentially pointing to the absence of lower-dimensional representation of the data. However, these techniques are restricted in that: (i) they do not leverage the assumption that the connectomes belong on a pre-specified manifold, therefore discarding information; (ii) they can only fit a linear subspace to the data. In this paper, we are interested in variants to learn potentially highly curved submanifolds of manifold-valued data. Motivated by the brain connectomes example, we investigate a latent variable generative model, which has the added benefit of providing us with uncertainty estimates - a crucial quantity in the medical applications we are considering. While latent variable models have been proposed to learn linear and nonlinear spaces for Euclidean data, or geodesic subspaces for manifold data, no intrinsic latent variable model exists to learn nongeodesic subspaces for manifold data. This paper fills this gap and formulates a Riemannian variational autoencoder with an intrinsic generative model of manifold-valued data. We evaluate its performances on synthetic and real datasets by introducing the formalism of weighted Riemannian submanifolds.
With the increasing number of deep multi-wavelength galaxy surveys, the spectral energy distribution (SED) of galaxies has become an invaluable tool for studying the formation of their structures and their evolution. In this context, standard analysis relies on simple spectro-photometric selection criteria based on a few SED colors. If this fully supervised classification already yielded clear achievements, it is not optimal to extract relevant information from the data. In this article, we propose to employ very recent advances in machine learning, and more precisely in feature learning, to derive a data-driven diagram. We show that the proposed approach based on denoising autoencoders recovers the bi-modality in the galaxy population in an unsupervised manner, without using any prior knowledge on galaxy SED classification. This technique has been compared to principal component analysis (PCA) and to standard color/color representations. In addition, preliminary results illustrate that this enables the capturing of extra physically meaningful information, such as redshift dependence, galaxy mass evolution and variation over the specific star formation rate. PCA also results in an unsupervised representation with physical properties, such as mass and sSFR, although this representation separates out. less other characteristics (bimodality, redshift evolution) than denoising autoencoders.
Extremely high data rates expected in next-generation radio interferometers necessitate a fast and robust way to process measurements in a big data context. Dimensionality reduction can alleviate computational load needed to process these data, in terms of both computing speed and memory usage. In this article, we present image reconstruction results from highly reduced radio-interferometric data, following our previously proposed data dimensionality reduction method, $mathrm{R}_{mathrm{sing}}$, based on studying the distribution of the singular values of the measurement operator. This method comprises a simple weighted, subsampled discrete Fourier transform of the dirty image. Additionally, we show that an alternative gridding-based reduction method works well for target data sizes of the same order as the image size. We reconstruct images from well-calibrated VLA data to showcase the robustness of our proposed method down to very low data sizes in a real data setting. We show through comparisons with the conventional reduction method of time- and frequency-averaging, that our proposed method produces more accurate reconstructions while reducing data size much further, and is particularly robust when data sizes are aggressively reduced to low fractions of the image size. $mathrm{R}_{mathrm{sing}}$ can function in a block-wise fashion, and could be used in the future to process incoming data by blocks in real-time, thus opening up the possibility of performing on-line imaging as the data are being acquired. MATLAB code for the proposed dimensionality reduction method is available on GitHub.