Do you want to publish a course? Click here

Robust Machine Learning Applied to Astronomical Datasets II: Quantifying Photometric Redshifts for Quasars Using Instance-Based Learning

63   0   0.0 ( 0 )
 Added by Nicholas M. Ball
 Publication date 2006
  fields Physics
and research's language is English




Ask ChatGPT about the research

We apply instance-based machine learning in the form of a k-nearest neighbor algorithm to the task of estimating photometric redshifts for 55,746 objects spectroscopically classified as quasars in the Fifth Data Release of the Sloan Digital Sky Survey. We compare the results obtained to those from an empirical color-redshift relation (CZR). In contrast to previously published results using CZRs, we find that the instance-based photometric redshifts are assigned with no regions of catastrophic failure. Remaining outliers are simply scattered about the ideal relation, in a similar manner to the pattern seen in the optical for normal galaxies at redshifts z < ~1. The instance-based algorithm is trained on a representative sample of the data and pseudo-blind-tested on the remaining unseen data. The variance between the photometric and spectroscopic redshifts is sigma^2 = 0.123 +/- 0.002 (compared to sigma^2 = 0.265 +/- 0.006 for the CZR), and 54.9 +/- 0.7%, 73.3 +/- 0.6%, and 80.7 +/- 0.3% of the objects are within delta z < 0.1, 0.2, and 0.3 respectively. We also match our sample to the Second Data Release of the Galaxy Evolution Explorer legacy data and the resulting 7,642 objects show a further improvement, giving a variance of sigma^2 = 0.054 +/- 0.005, and 70.8 +/- 1.2%, 85.8 +/- 1.0%, and 90.8 +/- 0.7% of objects within delta z < 0.1, 0.2, and 0.3. We show that the improvement is indeed due to the extra information provided by GALEX, by training on the same dataset using purely SDSS photometry, which has a variance of sigma^2 = 0.090 +/- 0.007. Each set of results represents a realistic standard for application to further datasets for which the spectra are representative.



rate research

Read More

109 - Nicholas M. Ball 2008
We apply machine learning in the form of a nearest neighbor instance-based algorithm (NN) to generate full photometric redshift probability density functions (PDFs) for objects in the Fifth Data Release of the Sloan Digital Sky Survey (SDSS DR5). We use a conceptually simple but novel application of NN to generate the PDFs - perturbing the object colors by their measurement error - and using the resulting instances of nearest neighbor distributions to generate numerous individual redshifts. When the redshifts are compared to existing SDSS spectroscopic data, we find that the mean value of each PDF has a dispersion between the photometric and spectroscopic redshift consistent with other machine learning techniques, being sigma = 0.0207 +/- 0.0001 for main sample galaxies to r < 17.77 mag, sigma = 0.0243 +/- 0.0002 for luminous red galaxies to r < ~19.2 mag, and sigma = 0.343 +/- 0.005 for quasars to i < 20.3 mag. The PDFs allow the selection of subsets with improved statistics. For quasars, the improvement is dramatic: for those with a single peak in their probability distribution, the dispersion is reduced from 0.343 to sigma = 0.117 +/- 0.010, and the photometric redshift is within 0.3 of the spectroscopic redshift for 99.3 +/- 0.1% of the objects. Thus, for this optical quasar sample, we can virtually eliminate catastrophic photometric redshift estimates. In addition to the SDSS sample, we incorporate ultraviolet photometry from the Third Data Release of the Galaxy Evolution Explorer All-Sky Imaging Survey (GALEX AIS GR3) to create PDFs for objects seen in both surveys. For quasars, the increased coverage of the observed frame UV of the SED results in significant improvement over the full SDSS sample, with sigma = 0.234 +/- 0.010. We demonstrate that this improvement is genuine. [Abridged]
119 - Nicholas M. Ball 2008
We present recent results from the LCDM (Laboratory for Cosmological Data Mining; http://lcdm.astro.uiuc.edu) collaboration between UIUC Astronomy and NCSA to deploy supercomputing cluster resources and machine learning algorithms for the mining of terascale astronomical datasets. This is a novel application in the field of astronomy, because we are using such resources for data mining, and not just performing simulations. Via a modified implementation of the NCSA cyberenvironment Data-to-Knowledge, we are able to provide improved classifications for over 100 million stars and galaxies in the Sloan Digital Sky Survey, improved distance measures, and a full exploitation of the simple but powerful k-nearest neighbor algorithm. A driving principle of this work is that our methods should be extensible from current terascale datasets to upcoming petascale datasets and beyond. We discuss issues encountered to-date, and further issues for the transition to petascale. In particular, disk I/O will become a major limiting factor unless the necessary infrastructure is implemented.
We present recent results from the Laboratory for Cosmological Data Mining (http://lcdm.astro.uiuc.edu) at the National Center for Supercomputing Applications (NCSA) to provide robust classifications and photometric redshifts for objects in the terascale-class Sloan Digital Sky Survey (SDSS). Through a combination of machine learning in the form of decision trees, k-nearest neighbor, and genetic algorithms, the use of supercomputing resources at NCSA, and the cyberenvironment Data-to-Knowledge, we are able to provide improved classifications for over 100 million objects in the SDSS, improved photometric redshifts, and a full exploitation of the powerful k-nearest neighbor algorithm. This work is the first to apply the full power of these algorithms to contemporary terascale astronomical datasets, and the improvement over existing results is demonstrable. We discuss issues that we have encountered in dealing with data on the terascale, and possible solutions that can be implemented to deal with upcoming petascale datasets.
We provide classifications for all 143 million non-repeat photometric objects in the Third Data Release of the Sloan Digital Sky Survey (SDSS) using decision trees trained on 477,068 objects with SDSS spectroscopic data. We demonstrate that these star/galaxy classifications are expected to be reliable for approximately 22 million objects with r < ~20. The general machine learning environment Data-to-Knowledge and supercomputing resources enabled extensive investigation of the decision tree parameter space. This work presents the first public release of objects classified in this way for an entire SDSS data release. The objects are classified as either galaxy, star or nsng (neither star nor galaxy), with an associated probability for each class. To demonstrate how to effectively make use of these classifications, we perform several important tests. First, we detail selection criteria within the probability space defined by the three classes to extract samples of stars and galaxies to a given completeness and efficiency. Second, we investigate the efficacy of the classifications and the effect of extrapolating from the spectroscopic regime by performing blind tests on objects in the SDSS, 2dF Galaxy Redshift and 2dF QSO Redshift (2QZ) surveys. Given the photometric limits of our spectroscopic training data, we effectively begin to extrapolate past our star-galaxy training set at r ~ 18. By comparing the number counts of our training sample with the classified sources, however, we find that our efficiencies appear to remain robust to r ~ 20. As a result, we expect our classifications to be accurate for 900,000 galaxies and 6.7 million stars, and remain robust via extrapolation for a total of 8.0 million galaxies and 13.9 million stars. [Abridged]
We present an analysis of anomaly detection for machine learning redshift estimation. Anomaly detection allows the removal of poor training examples, which can adversely influence redshift estimates. Anomalous training examples may be photometric galaxies with incorrect spectroscopic redshifts, or galaxies with one or more poorly measured photometric quantity. We select 2.5 million clean SDSS DR12 galaxies with reliable spectroscopic redshifts, and 6730 anomalous galaxies with spectroscopic redshift measurements which are flagged as unreliable. We contaminate the clean base galaxy sample with galaxies with unreliable redshifts and attempt to recover the contaminating galaxies using the Elliptical Envelope technique. We then train four machine learning architectures for redshift analysis on both the contaminated sample and on the preprocessed anomaly-removed sample and measure redshift statistics on a clean validation sample generated without any preprocessing. We find an improvement on all measured statistics of up to 80% when training on the anomaly removed sample as compared with training on the contaminated sample for each of the machine learning routines explored. We further describe a method to estimate the contamination fraction of a base data sample.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا