No Arabic abstract
Machine Learning algorithms are good tools for both classification and prediction purposes. These algorithms can further be used for scientific discoveries from the enormous data being collected in our era. We present ways of discovering and understanding astronomical phenomena by applying machine learning algorithms to data collected with radio telescopes. We discuss the use of supervised machine learning algorithms to predict the free parameters of star formation histories and also better understand the relations between the different input and output parameters. We made use of Deep Learning to capture the non-linearity in the parameters. Our models are able to predict with low error rates and give the advantage of predicting in real time once the model has been trained. The other class of machine learning algorithms viz. unsupervised learning can prove to be very useful in finding patterns in the data. We explore how we use such unsupervised techniques on solar radio data to identify patterns and variations, and also link such findings to theories, which help to better understand the nature of the system being studied. We highlight the challenges faced in terms of data size, availability, features, processing ability and importantly, the interpretability of results. As our ability to capture and store data increases, increased use of machine learning to understand the underlying physics in the information captured seems inevitable.
Efficient identification and follow-up of astronomical transients is hindered by the need for humans to manually select promising candidates from data streams that contain many false positives. These artefacts arise in the difference images that are produced by most major ground-based time domain surveys with large format CCD cameras. This dependence on humans to reject bogus detections is unsustainable for next generation all-sky surveys and significant effort is now being invested to solve the problem computationally. In this paper we explore a simple machine learning approach to real-bogus classification by constructing a training set from the image data of ~32000 real astrophysical transients and bogus detections from the Pan-STARRS1 Medium Deep Survey. We derive our feature representation from the pixel intensity values of a 20x20 pixel stamp around the centre of the candidates. This differs from previous work in that it works directly on the pixels rather than catalogued domain knowledge for feature design or selection. Three machine learning algorithms are trained (artificial neural networks, support vector machines and random forests) and their performances are tested on a held-out subset of 25% of the training data. We find the best results from the random forest classifier and demonstrate that by accepting a false positive rate of 1%, the classifier initially suggests a missed detection rate of around 10%. However we also find that a combination of bright star variability, nuclear transients and uncertainty in human labelling means that our best estimate of the missed detection rate is approximately 6%.
Identification of anomalous light curves within time-domain surveys is often challenging. In addition, with the growing number of wide-field surveys and the volume of data produced exceeding astronomers ability for manual evaluation, outlier and anomaly detection is becoming vital for transient science. We present an unsupervised method for transient discovery using a clustering technique and the Astronomaly package. As proof of concept, we evaluate 85553 minute-cadenced light curves collected over two 1.5 hour periods as part of the Deeper, Wider, Faster program, using two different telescope dithering strategies. By combining the clustering technique HDBSCAN with the isolation forest anomaly detection algorithm via the visual interface of Astronomaly, we are able to rapidly isolate anomalous sources for further analysis. We successfully recover the known variable sources, across a range of catalogues from within the fields, and find a further 7 uncatalogued variables and two stellar flare events, including a rarely observed ultra fast flare (5 minute) from a likely M-dwarf.
The scientific community has been increasingly interested in harnessing the power of deep learning to solve various domain challenges. However, despite the effectiveness in building predictive models, fundamental challenges exist in extracting actionable knowledge from the deep neural network due to their opaque nature. In this work, we propose techniques for exploring the behavior of deep learning models by injecting domain-specific actionable concepts as tunable ``knobs in the analysis pipeline. By incorporating the domain knowledge with generative modeling, we are not only able to better understand the behavior of these black-box models, but also provide scientists with actionable insights that can potentially lead to fundamental discoveries.
The field of astronomy has arrived at a turning point in terms of size and complexity of both datasets and scientific collaboration. Commensurately, algorithms and statistical models have begun to adapt --- e.g., via the onset of artificial intelligence --- which itself presents new challenges and opportunities for growth. This white paper aims to offer guidance and ideas for how we can evolve our technical and collaborative frameworks to promote efficient algorithmic development and take advantage of opportunities for scientific discovery in the petabyte era. We discuss challenges for discovery in large and complex data sets; challenges and requirements for the next stage of development of statistical methodologies and algorithmic tool sets; how we might change our paradigms of collaboration and education; and the ethical implications of scientists contributions to widely applicable algorithms and computational modeling. We start with six distinct recommendations that are supported by the commentary following them. This white paper is related to a larger corpus of effort that has taken place within and around the Petabytes to Science Workshops (https://petabytestoscience.github.io/).
This paper reviews some of the challenges posed by the huge growth of experimental data generated by the new generation of large-scale experiments at UK national facilities at the Rutherford Appleton Laboratory site at Harwell near Oxford. Such Big Scientific Data comes from the Diamond Light Source and Electron Microscopy Facilities, the ISIS Neutron and Muon Facility, and the UKs Central Laser Facility. Increasingly, scientists are now needing to use advanced machine learning and other AI technologies both to automate parts of the data pipeline and also to help find new scientific discoveries in the analysis of their data. For commercially important applications, such as object recognition, natural language processing and automatic translation, deep learning has made dramatic breakthroughs. Googles DeepMind has now also used deep learning technology to develop their AlphaFold tool to make predictions for protein folding. Remarkably, they have been able to achieve some spectacular results for this specific scientific problem. Can deep learning be similarly transformative for other scientific problems? After a brief review of some initial applications of machine learning at the Rutherford Appleton Laboratory, we focus on challenges and opportunities for AI in advancing materials science. Finally, we discuss the importance of developing some realistic machine learning benchmarks using Big Scientific Data coming from a number of different scientific domains. We conclude with some initial examples of our SciML benchmark suite and of the research challenges these benchmarks will enable.