No Arabic abstract
Upcoming synoptic surveys are set to generate an unprecedented amount of data. This requires an automatic framework that can quickly and efficiently provide classification labels for several new object classification challenges. Using data describing 11 types of variable stars from the Catalina Real-Time Transient Surveys (CRTS), we illustrate how to capture the most important information from computed features and describe detailed methods of how to robustly use Information Theory for feature selection and evaluation. We apply three Machine Learning (ML) algorithms and demonstrate how to optimize these classifiers via cross-validation techniques. For the CRTS dataset, we find that the Random Forest (RF) classifier performs best in terms of balanced-accuracy and geometric means. We demonstrate substantially improved classification results by converting the multi-class problem into a binary classification task, achieving a balanced-accuracy rate of $sim$99 per cent for the classification of ${delta}$-Scuti and Anomalous Cepheids (ACEP). Additionally, we describe how classification performance can be improved via converting a flat-multi-class problem into a hierarchical taxonomy. We develop a new hierarchical structure and propose a new set of classification features, enabling the accurate identification of subtypes of cepheids, RR Lyrae and eclipsing binary stars in CRTS data.
Photometric variability detection is often considered as a hypothesis testing problem: an object is variable if the null-hypothesis that its brightness is constant can be ruled out given the measurements and their uncertainties. Uncorrected systematic errors limit the practical applicability of this approach to high-amplitude variability and well-behaving data sets. Searching for a new variability detection technique that would be applicable to a wide range of variability types while being robust to outliers and underestimated measurement uncertainties, we propose to consider variability detection as a classification problem that can be approached with machine learning. We compare several classification algorithms: Logistic Regression (LR), Support Vector Machines (SVM), k-Nearest Neighbors (kNN) Neural Nets (NN), Random Forests (RF) and Stochastic Gradient Boosting classifier (SGB) applied to 18 features (variability indices) quantifying scatter and/or correlation between points in a light curve. We use a subset of OGLE-II Large Magellanic Cloud (LMC) photometry (30265 light curves) that was searched for variability using traditional methods (168 known variable objects identified) as the training set and then apply the NN to a new test set of 31798 OGLE-II LMC light curves. Among 205 candidates selected in the test set, 178 are real variables, 13 low-amplitude variables are new discoveries. We find that the considered machine learning classifiers are more efficient (they find more variables and less false candidates) compared to traditional techniques that consider individual variability indices or their linear combination. The NN, SGB, SVM and RF show a higher efficiency compared to LR and kNN.
The need for the development of automatic tools to explore astronomical databases has been recognized since the inception of CCDs and modern computers. Astronomers already have developed solutions to tackle several science problems, such as automatic classification of stellar objects, outlier detection, and globular clusters identification, among others. New science problems emerge and it is critical to be able to re-use the models learned before, without rebuilding everything from the beginning when the science problem changes. In this paper, we propose a new meta-model that automatically integrates existing classification models of variable stars. The proposed meta-model incorporates existing models that are trained in a different context, answering different questions and using different representations of data. Conventional mixture of experts algorithms in machine learning literature can not be used since each expert (model) uses different inputs. We also consider computational complexity of the model by using the most expensive models only when it is necessary. We test our model with EROS-2 and MACHO datasets, and we show that we solve most of the classification challenges only by training a meta-model to learn how to integrate the previous experts.
We present a novel automated methodology to detect and classify periodic variable stars in a large database of photometric time series. The methods are based on multivariate Bayesian statistics and use a multi-stage approach. We applied our method to the ground-based data of the TrES Lyr1 field, which is also observed by the Kepler satellite, covering ~26000 stars. We found many eclipsing binaries as well as classical non-radial pulsators, such as slowly pulsating B stars, Gamma Doradus, Beta Cephei and Delta Scuti stars. Also a few classical radial pulsators were found.
The accurate automated classification of variable stars into their respective sub-types is difficult. Machine learning based solutions often fall foul of the imbalanced learning problem, which causes poor generalisation performance in practice, especially on rare variable star sub-types. In previous work, we attempted to overcome such deficiencies via the development of a hierarchical machine learning classifier. This algorithm-level approach to tackling imbalance, yielded promising results on Catalina Real-Time Survey (CRTS) data, outperforming the binary and multi-class classification schemes previously applied in this area. In this work, we attempt to further improve hierarchical classification performance by applying data-level approaches to directly augment the training data so that they better describe under-represented classes. We apply and report results for three data augmentation methods in particular: $textit{R}$andomly $textit{A}$ugmented $textit{S}$ampled $textit{L}$ight curves from magnitude $textit{E}$rror ($texttt{RASLE}$), augmenting light curves with Gaussian Process modelling ($texttt{GpFit}$) and the Synthetic Minority Over-sampling Technique ($texttt{SMOTE}$). When combining the algorithm-level (i.e. the hierarchical scheme) together with the data-level approach, we further improve variable star classification accuracy by 1-4$%$. We found that a higher classification rate is obtained when using $texttt{GpFit}$ in the hierarchical model. Further improvement of the metric scores requires a better standard set of correctly identified variable stars and, perhaps enhanced features are needed.
With the coming data deluge from synoptic surveys, there is a growing need for frameworks that can quickly and automatically produce calibrated classification probabilities for newly-observed variables based on a small number of time-series measurements. In this paper, we introduce a methodology for variable-star classification, drawing from modern machine-learning techniques. We describe how to homogenize the information gleaned from light curves by selection and computation of real-numbered metrics (feature), detail methods to robustly estimate periodic light-curve features, introduce tree-ensemble methods for accurate variable star classification, and show how to rigorously evaluate the classification results using cross validation. On a 25-class data set of 1542 well-studied variable stars, we achieve a 22.8% overall classification error using the random forest classifier; this represents a 24% improvement over the best previous classifier on these data. This methodology is effective for identifying samples of specific science classes: for pulsational variables used in Milky Way tomography we obtain a discovery efficiency of 98.2% and for eclipsing systems we find an efficiency of 99.1%, both at 95% purity. We show that the random forest (RF) classifier is superior to other machine-learned methods in terms of accuracy, speed, and relative immunity to features with no useful class information; the RF classifier can also be used to estimate the importance of each feature in classification. Additionally, we present the first astronomical use of hierarchical classification methods to incorporate a known class taxonomy in the classifier, which further reduces the catastrophic error rate to 7.8%. Excluding low-amplitude sources, our overall error rate improves to 14%, with a catastrophic error rate of 3.5%.