No Arabic abstract
Among the many challenges posed by the huge data volumes produced by the new generation of astronomical instruments there is also the search for rare and peculiar objects. Unsupervised outlier detection algorithms may provide a viable solution. In this work we compare the performances of six methods: the Local Outlier Factor, Isolation Forest, k-means clustering, a measure of novelty, and both a normal and a convolutional autoencoder. These methods were applied to data extracted from SDSS stripe 82. After discussing the sensitivity of each method to its own set of hyperparameters, we combine the results from each method to rank the objects and produce a final list of outliers.
We present a comparison of several Difference Image Analysis (DIA) techniques, in combination with Machine Learning (ML) algorithms, applied to the identification of optical transients associated with gravitational wave events. Each technique is assessed based on the scoring metrics of Precision, Recall, and their harmonic mean F1, measured on the DIA results as standalone techniques, and also in the results after the application of ML algorithms, on transient source injections over simulated and real data. This simulations cover a wide range of instrumental configurations, as well as a variety of scenarios of observation conditions, by exploring a multi dimensional set of relevant parameters, allowing us to extract general conclusions related to the identification of transient astrophysical events. The newest subtraction techniques, and particularly the methodology published in Zackay et al. (2016) are implemented in an Open Source Python package, named properimage, suitable for many other astronomical image analyses. This together with the ML libraries we describe, provides an effective transient detection software pipeline. Here we study the effects of the different ML techniques, and the relative feature importances for classification of transient candidates, and propose an optimal combined strategy. This constitutes the basic elements of pipelines that could be applied in searches of electromagnetic counterparts to GW sources.
In the present era of large scale surveys, big data presents new challenges to the discovery process for anomalous data. Such data can be indicative of systematic errors, extreme (or rare) forms of known phenomena, or most interestingly, truly novel phenomena which exhibit as-of-yet unobserved behaviors. In this work we present an outlier scoring methodology to identify and characterize the most promising unusual sources to facilitate discoveries of such anomalous data. We have developed a data mining method based on k-Nearest Neighbor distance in feature space to efficiently identify the most anomalous lightcurves. We test variations of this method including using principal components of the feature space, removing select features, the effect of the choice of k, and scoring to subset samples. We evaluate the peformance of our scoring on known object classes and find that our scoring consistently scores rare (<1000) object classes higher than common classes. We have applied scoring to all long cadence lightcurves of quarters 1 to 17 of Keplers prime mission and present outlier scores for all 2.8 million lightcurves for the roughly 200k objects.
The Exoplanet Imaging Data Challenge is a community-wide effort meant to offer a platform for a fair and common comparison of image processing methods designed for exoplanet direct detection. For this purpose, it gathers on a dedicated repository (Zenodo), data from several high-contrast ground-based instruments worldwide in which we injected synthetic planetary signals. The data challenge is hosted on the CodaLab competition platform, where participants can upload their results. The specifications of the data challenge are published on our website. The first phase, launched on the 1st of September 2019 and closed on the 1st of October 2020, consisted in detecting point sources in two types of common data-set in the field of high-contrast imaging: data taken in pupil-tracking mode at one wavelength (subchallenge 1, also referred to as ADI) and multispectral data taken in pupil-tracking mode (subchallenge 2, also referred to as ADI mSDI). In this paper, we describe the approach, organisational lessons-learnt and current limitations of the data challenge, as well as preliminary results of the participants submissions for this first phase. In the future, we plan to provide permanent access to the standard library of data sets and metrics, in order to guide the validation and support the publications of innovative image processing algorithms dedicated to high-contrast imaging of planetary systems.
We present CosmoHub (https://cosmohub.pic.es), a web application based on Hadoop to perform interactive exploration and distribution of massive cosmological datasets. Recent Cosmology seeks to unveil the nature of both dark matter and dark energy mapping the large-scale structure of the Universe, through the analysis of massive amounts of astronomical data, progressively increasing during the last (and future) decades with the digitization and automation of the experimental techniques. CosmoHub, hosted and developed at the Port dInformacio Cientifica (PIC), provides support to a worldwide community of scientists, without requiring the end user to know any Structured Query Language (SQL). It is serving data of several large international collaborations such as the Euclid space mission, the Dark Energy Survey (DES), the Physics of the Accelerating Universe Survey (PAUS) and the Marenostrum Institut de Ci`encies de lEspai (MICE) numerical simulations. While originally developed as a PostgreSQL relational database web frontend, this work describes the current version of CosmoHub, built on top of Apache Hive, which facilitates scalable reading, writing and managing huge datasets. As CosmoHubs datasets are seldomly modified, Hive it is a better fit. Over 60 TiB of catalogued information and $50 times 10^9$ astronomical objects can be interactively explored using an integrated visualization tool which includes 1D histogram and 2D heatmap plots. In our current implementation, online exploration of datasets of $10^9$ objects can be done in a timescale of tens of seconds. Users can also download customized subsets of data in standard formats generated in few minutes.
Intelligent scheduling of the sequence of scientific exposures taken at ground-based astronomical observatories is massively challenging. Observing time is over-subscribed and atmospheric conditions are constantly changing. We propose to guide observatory scheduling using machine learning. Leveraging a 15-year archive of exposures, environmental, and operating conditions logged by the Canada-France-Hawaii Telescope, we construct a probabilistic data-driven model that accurately predicts image quality. We demonstrate that, by optimizing the opening and closing of twelve vents placed on the dome of the telescope, we can reduce dome-induced turbulence and improve telescope image quality by (0.05-0.2 arc-seconds). This translates to a reduction in exposure time (and hence cost) of $sim 10-15%$. Our study is the first step toward data-based optimization of the multi-million dollar operations of current and next-generation telescopes.