No Arabic abstract
Astronomy is increasingly encountering two fundamental truths: (1) The field is faced with the task of extracting useful information from extremely large, complex, and high dimensional datasets; (2) The techniques of astroinformatics and astrostatistics are the only way to make this tractable, and bring the required level of sophistication to the analysis. Thus, an approach which provides these tools in a way that scales to these datasets is not just desirable, it is vital. The expertise required spans not just astronomy, but also computer science, statistics, and informatics. As a computer scientist and expert in machine learning, Alexs contribution of expertise and a large number of fast algorithms designed to scale to large datasets, is extremely welcome. We focus in this discussion on the questions raised by the practical application of these algorithms to real astronomical datasets. That is, what is needed to maximally leverage their potential to improve the science return? This is not a trivial task. While computing and statistical expertise are required, so is astronomical expertise. Precedent has shown that, to-date, the collaborations most productive in producing astronomical science results (e.g, the Sloan Digital Sky Survey), have either involved astronomers expert in computer science and/or statistics, or astronomers involved in close, long-term collaborations with experts in those fields. This does not mean that the astronomers are giving the most important input, but simply that their input is crucial in guiding the effort in the most fruitful directions, and coping with the issues raised by real data. Thus, the tools must be useable and understandable by those whose primary expertise is not computing or statistics, even though they may have quite extensive knowledge of those fields.
We review the current state of data mining and machine learning in astronomy. Data Mining can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black-box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those where data mining techniques directly resulted in improved science, and important current and future directions, including probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm, and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.
We have investigated a number of factors that can have significant impacts on the classification performance of $gamma$-ray sources detected by Fermi Large Area Telescope (LAT) with machine learning techniques. We show that a framework of automatic feature selection can construct a simple model with a small set of features which yields better performance over previous results. Secondly, because of the small sample size of the training/test sets of certain classes in $gamma$-ray, nested re-sampling and cross-validations are suggested for quantifying the statistical fluctuations of the quoted accuracy. We have also constructed a test set by cross-matching the identified active galactic nuclei (AGNs) and the pulsars (PSRs) in the Fermi LAT eight-year point source catalog (4FGL) with those unidentified sources in the previous 3$^{rm rd}$ Fermi LAT Source Catalog (3FGL). Using this cross-matched set, we show that some features used for building classification model with the identified source can suffer from the problem of covariate shift, which can be a result of various observational effects. This can possibly hamper the actual performance when one applies such model in classifying unidentified sources. Using our framework, both AGN/PSR and young pulsar (YNG)/millisecond pulsar (MSP) classifiers are automatically updated with the new features and the enlarged training samples in 4FGL catalog incorporated. Using a two-layer model with these updated classifiers, we have selected 20 promising MSP candidates with confidence scores $>98%$ from the unidentified sources in 4FGL catalog which can provide inputs for a multi-wavelength identification campaign.
We present an analysis technique that uses the timing information of Cherenkov images from extensive air showers (EAS). Our emphasis is on distant, or large core distance gamma-ray induced showers at multi-TeV energies. Specifically, combining pixel timing information with an improved direction reconstruction algorithm, leads to improvements in angular and core resolution as large as ~40% and ~30%, respectively, when compared with the same algorithm without the use of timing. Above 10 TeV, this results in an angular resolution approaching 0.05 degrees, together with a core resolution better than ~15 m. The off-axis post-cut gamma-ray acceptance is energy dependent and its full width at half maximum ranges from 4 degrees to 8 degrees. For shower directions that are up to ~6 degrees off-axis, the angular resolution achieved by using timing information is comparable, around 100 TeV, to the on-axis angular resolution. The telescope specifications and layout we describe here are geared towards energies above 10 TeV. However, the methods can in principle be applied to other energies, given suitable telescope parameters. The 5-telescope cell investigated in this study could initially pave the way for a larger array of sparsely spaced telescopes in an effort to push the collection area to >10 km2. These results highlight the potential of a `sparse array approach in effectively opening up the energy range above 10 TeV.
We investigate star-galaxy classification for astronomical surveys in the context of four methods enabling the interpretation of black-box machine learning systems. The first is outputting and exploring the decision boundaries as given by decision tree based methods, which enables the visualization of the classification categories. Secondly, we investigate how the Mutual Information based Transductive Feature Selection (MINT) algorithm can be used to perform feature pre-selection. If one would like to provide only a small number of input features to a machine learning classification algorithm, feature pre-selection provides a method to determine which of the many possible input properties should be selected. Third is the use of the tree-interpreter package to enable popular decision tree based ensemble methods to be opened, visualized, and understood. This is done by additional analysis of the tree based model, determining not only which features are important to the model, but how important a feature is for a particular classification given its value. Lastly, we use decision boundaries from the model to revise an already existing method of classification, essentially asking the tree based method where decision boundaries are best placed and defining a new classification method. We showcase these techniques by applying them to the problem of star-galaxy separation using data from the Sloan Digital Sky Survey (hereafter SDSS). We use the output of MINT and the ensemble methods to demonstrate how more complex decision boundaries improve star-galaxy classification accuracy over the standard SDSS frames approach (reducing misclassifications by up to $approx33%$). We then show how tree-interpreter can be used to explore how relevant each photometric feature is when making a classification on an object by object basis.
How should we invest our available resources to best sustain astronomys track record of discovery, established over the past few decades? Two strong hints come from (1) our history of astronomical discoveries and (2) literature citation patterns that reveal how discovery and development activities in science are strong functions of team size. These argue that progress in astronomy hinges on support for a diversity of research efforts in terms of team size, research tools and platforms, and investment strategies that encourage risk taking. These ideas also encourage us to examine the implications of the trend toward big team science and survey science in astronomy over the past few decades, and to reconsider the common assumption that progress in astronomy always means trading up to bigger apertures and facilities. Instead, the considerations above argue that we need a balanced set of investments in small- to large-scale initiatives and team sizes both large and small. Large teams tend to develop existing ideas, whereas small teams are more likely to fuel the future with disruptive discoveries. While large facilities are the value investments that are guaranteed to produce discoveries, smaller facilities are the growth stocks that are likely to deliver the biggest science bang per buck, sometimes with outsize returns. One way to foster the risk taking that fuels discovery is to increase observing opportunity, i.e., create more observing nights and facilitate the exploration of science-ready data.