No Arabic abstract
We present an analysis of anomaly detection for machine learning redshift estimation. Anomaly detection allows the removal of poor training examples, which can adversely influence redshift estimates. Anomalous training examples may be photometric galaxies with incorrect spectroscopic redshifts, or galaxies with one or more poorly measured photometric quantity. We select 2.5 million clean SDSS DR12 galaxies with reliable spectroscopic redshifts, and 6730 anomalous galaxies with spectroscopic redshift measurements which are flagged as unreliable. We contaminate the clean base galaxy sample with galaxies with unreliable redshifts and attempt to recover the contaminating galaxies using the Elliptical Envelope technique. We then train four machine learning architectures for redshift analysis on both the contaminated sample and on the preprocessed anomaly-removed sample and measure redshift statistics on a clean validation sample generated without any preprocessing. We find an improvement on all measured statistics of up to 80% when training on the anomaly removed sample as compared with training on the contaminated sample for each of the machine learning routines explored. We further describe a method to estimate the contamination fraction of a base data sample.
We apply machine learning in the form of a nearest neighbor instance-based algorithm (NN) to generate full photometric redshift probability density functions (PDFs) for objects in the Fifth Data Release of the Sloan Digital Sky Survey (SDSS DR5). We use a conceptually simple but novel application of NN to generate the PDFs - perturbing the object colors by their measurement error - and using the resulting instances of nearest neighbor distributions to generate numerous individual redshifts. When the redshifts are compared to existing SDSS spectroscopic data, we find that the mean value of each PDF has a dispersion between the photometric and spectroscopic redshift consistent with other machine learning techniques, being sigma = 0.0207 +/- 0.0001 for main sample galaxies to r < 17.77 mag, sigma = 0.0243 +/- 0.0002 for luminous red galaxies to r < ~19.2 mag, and sigma = 0.343 +/- 0.005 for quasars to i < 20.3 mag. The PDFs allow the selection of subsets with improved statistics. For quasars, the improvement is dramatic: for those with a single peak in their probability distribution, the dispersion is reduced from 0.343 to sigma = 0.117 +/- 0.010, and the photometric redshift is within 0.3 of the spectroscopic redshift for 99.3 +/- 0.1% of the objects. Thus, for this optical quasar sample, we can virtually eliminate catastrophic photometric redshift estimates. In addition to the SDSS sample, we incorporate ultraviolet photometry from the Third Data Release of the Galaxy Evolution Explorer All-Sky Imaging Survey (GALEX AIS GR3) to create PDFs for objects seen in both surveys. For quasars, the increased coverage of the observed frame UV of the SED results in significant improvement over the full SDSS sample, with sigma = 0.234 +/- 0.010. We demonstrate that this improvement is genuine. [Abridged]
We apply instance-based machine learning in the form of a k-nearest neighbor algorithm to the task of estimating photometric redshifts for 55,746 objects spectroscopically classified as quasars in the Fifth Data Release of the Sloan Digital Sky Survey. We compare the results obtained to those from an empirical color-redshift relation (CZR). In contrast to previously published results using CZRs, we find that the instance-based photometric redshifts are assigned with no regions of catastrophic failure. Remaining outliers are simply scattered about the ideal relation, in a similar manner to the pattern seen in the optical for normal galaxies at redshifts z < ~1. The instance-based algorithm is trained on a representative sample of the data and pseudo-blind-tested on the remaining unseen data. The variance between the photometric and spectroscopic redshifts is sigma^2 = 0.123 +/- 0.002 (compared to sigma^2 = 0.265 +/- 0.006 for the CZR), and 54.9 +/- 0.7%, 73.3 +/- 0.6%, and 80.7 +/- 0.3% of the objects are within delta z < 0.1, 0.2, and 0.3 respectively. We also match our sample to the Second Data Release of the Galaxy Evolution Explorer legacy data and the resulting 7,642 objects show a further improvement, giving a variance of sigma^2 = 0.054 +/- 0.005, and 70.8 +/- 1.2%, 85.8 +/- 1.0%, and 90.8 +/- 0.7% of objects within delta z < 0.1, 0.2, and 0.3. We show that the improvement is indeed due to the extra information provided by GALEX, by training on the same dataset using purely SDSS photometry, which has a variance of sigma^2 = 0.090 +/- 0.007. Each set of results represents a realistic standard for application to further datasets for which the spectra are representative.
We estimated photometric redshifts (zphot) for more than 1.1 million galaxies of the ESO Public Kilo-Degree Survey (KiDS) Data Release 2. KiDS is an optical wide-field imaging survey carried out with the VLT Survey Telescope (VST) and the OmegaCAM camera, which aims at tackling open questions in cosmology and galaxy evolution, such as the origin of dark energy and the channel of galaxy mass growth. We present a catalogue of photometric redshifts obtained using the Multi Layer Perceptron with Quasi Newton Algorithm (MLPQNA) model, provided within the framework of the DAta Mining and Exploration Web Application REsource (DAMEWARE). These photometric redshifts are based on a spectroscopic knowledge base which was obtained by merging spectroscopic datasets from GAMA (Galaxy And Mass Assembly) data release 2 and SDSS-III data release 9. The overall 1 sigma uncertainty on Delta z = (zspec - zphot) / (1+ zspec) is ~ 0.03, with a very small average bias of ~ 0.001, a NMAD of ~ 0.02 and a fraction of catastrophic outliers (| Delta z | > 0.15) of ~0.4%.
The scientific value of the next generation of large continuum surveys would be greatly increased if the redshifts of the newly detected sources could be rapidly and reliably estimated. Given the observational expense of obtaining spectroscopic redshifts for the large number of new detections expected, there has been substantial recent work on using machine learning techniques to obtain photometric redshifts. Here we compare the accuracy of the predicted photometric redshifts obtained from Deep Learning(DL) with the k-Nearest Neighbour (kNN) and the Decision Tree Regression (DTR) algorithms. We find using a combination of near-infrared, visible and ultraviolet magnitudes, trained upon a sample of SDSS QSOs, that the kNN and DL algorithms produce the best self-validation result with a standard deviation of {sigma} = 0.24. Testing on various sub-samples, we find that the DL algorithm generally has lower values of {sigma}, in addition to exhibiting a better performance in other measures. Our DL method, which uses an easy to implement off-the-shelf algorithm with no filtering nor removal of outliers, performs similarly to other, more complex, algorithms, resulting in an accuracy of {Delta}z < 0.1$ up to z ~ 2.5. Applying the DL algorithm trained on our 70,000 strong sample to other independent (radio-selected) datasets, we find {sigma} < 0.36 over a wide range of radio flux densities. This indicates much potential in using this method to determine photometric redshifts of quasars detected with the Square Kilometre Array.
The Epoch of Reionization (EoR) features a rich interplay between the first luminous sources and the low-density gas of the intergalactic medium (IGM), where photons from these sources ionize the IGM. There are currently few observational constraints on key observables related to the EoR, such as the midpoint and duration of reionization. Although upcoming observations of the 21 cm power spectrum with next-generation radio interferometers such as the Hydrogen Epoch of Reionization Array (HERA) and the Square Kilometre Array (SKA) are expected to provide information about the midpoint of reionization readily, extracting the duration from the power spectrum alone is a more difficult proposition. As an alternative method for extracting information about reionization, we present an application of convolutional neural networks (CNNs) to images of reionization. These images are two-dimensional in the plane of the sky, and extracted at a series of redshift values to generate image cubes that are qualitatively similar to those of the HERA and the SKA will generate in the near future. Additionally, we include the impact that the bright foreground signal from the the Milky Way imparts on such image cubes from interferometers, but do not include the noise induced from observations. We show that we are able to recover the duration of reionization $Delta$z to within 5% using CNNs, assuming that the midpoint of reionization is already relatively well constrained. These results have exciting impacts for estimating $tau$, the optical depth to the cosmic microwave background, which can help constrain other cosmological parameters.