No Arabic abstract
We present results of using individual galaxies redshift probability information derived from a photometric redshift (photo-z) algorithm, SPIDERz, to identify potential catastrophic outliers in photometric redshift determinations. By using two test data sets comprised of COSMOS multi-band photometry spanning a wide redshift range (0<z<4) matched with reliable spectroscopic or other redshift determinations we explore the efficacy of a novel method to flag potential catastrophic outliers in an analysis which relies on accurate photometric redshifts. SPIDERz is a custom support vector machine classification algorithm for photo-z analysis that naturally outputs a distribution of redshift probability information for each galaxy in addition to a discrete most probable photo-z value. By applying an analytic technique with flagging criteria to identify the presence of probability distribution features characteristic of catastrophic outlier photo-z estimates, such as multiple redshift probability peaks separated by substantial redshift distances, we can flag potential catastrophic outliers in photo-z determinations. We find that our proposed method can correctly flag large fractions (>50%) of the catastrophic outlier galaxies, while only flagging a small fraction (<5%) of the total non-outlier galaxies, depending on parameter choices. The fraction of non-outlier galaxies flagged varies significantly with redshift and magnitude, however. We examine the performance of this strategy in photo-z determinations using a range of flagging parameter values. These results could potentially be useful for utilization of photometric redshifts in future large scale surveys where catastrophic outliers are particularly detrimental to the science goals.
We present results of using individual galaxies probability distribution over redshift as a method of identifying potential catastrophic outliers in empirical photometric redshift estimation. In the course of developing this approach we develop a method of modification of the redshift distribution of training sets to improve both the baseline accuracy of high redshift (z>1.5) estimation as well as catastrophic outlier mitigation. We demonstrate these using two real test data sets and one simulated test data set spanning a wide redshift range (0<z<4). Results presented here inform an example `prescription that can be applied as a realistic photometric redshift estimation scenario for a hypothetical large-scale survey. We find that with appropriate optimization, we can identify a significant percentage (>30%) of catastrophic outlier galaxies while simultaneously incorrectly flagging only a small percentage (<7% and in many cases <3%) of non-outlier galaxies as catastrophic outliers. We find also that our training set redshift distribution modification results in a significant (>10) percentage point decrease of outlier galaxies for z>1.5 with only a small (<3) percentage point increase of outlier galaxies for z<1.5 compared to the unmodified training set. In addition, we find that this modification can in some cases cause a significant (~20) percentage point decrease of galaxies which are non-outliers but which have been incorrectly identified as outliers, while in other cases cause only a small (<1) percentage increase in this metric.
We demonstrate that highly accurate joint redshift-stellar mass probability distribution functions (PDFs) can be obtained using the Random Forest (RF) machine learning (ML) algorithm, even with few photometric bands available. As an example, we use the Dark Energy Survey (DES), combined with the COSMOS2015 catalogue for redshifts and stellar masses. We build two ML models: one containing deep photometry in the $griz$ bands, and the second reflecting the photometric scatter present in the main DES survey, with carefully constructed representative training data in each case. We validate our joint PDFs for $10,699$ test galaxies by utilizing the copula probability integral transform and the Kendall distribution function, and their univariate counterparts to validate the marginals. Benchmarked against a basic set-up of the template-fitting code BAGPIPES, our ML-based method outperforms template fitting on all of our predefined performance metrics. In addition to accuracy, the RF is extremely fast, able to compute joint PDFs for a million galaxies in just under $6$ min with consumer computer hardware. Such speed enables PDFs to be derived in real time within analysis codes, solving potential storage issues. As part of this work we have developed GALPRO, a highly intuitive and efficient Python package to rapidly generate multivariate PDFs on-the-fly. GALPRO is documented and available for researchers to use in their cosmology and galaxy evolution studies.
MiniJPAS is a ~1 deg^2 imaging survey of the AEGIS field in 60 bands, performed to demonstrate the scientific potential of the upcoming JPAS survey. Full coverage of the 3800-9100 AA range with 54 narrow and 6 broad optical filters allow for extremely accurate photo-z, which applied over 1000s of deg^2 will enable new applications of the photo-z technique such as measurement of baryonic acoustic oscillations. In this paper we describe the method used to obtain the photo-z included in the publicly available miniJPAS catalogue, and characterise the photo-z performance. We build 100 AA resolution photo-spectra from the PSF-corrected forced-aperture photometry. Systematic offsets in the photometry are corrected by applying magnitude shifts obtained through iterative fitting with stellar population synthesis models. We compute photo-z with a customised version of LePhare, using a set of templates optimised for the J-PAS filter-set. We analyse the accuracy of miniJPAS photo-z and their dependence on multiple quantities using a subsample of 5,266 galaxies with spectroscopic redshifts from SDSS and DEEP, that we find to be representative of the whole r<23 miniJPAS sample. Formal uncertainties for the photo-z that are calculated with the deltachi^2 method underestimate the actual redshift errors. The odds parameter has the stronger correlation with |Dz|, and accurately reproduces the probability of a redshift outlier (|Dz|>0.03) irrespective of the magnitude, redshift, or spectral type of the sources. We show that the two main summary statistics characterising the photo-z accuracy for a population of galaxies (snmad and eta) can be predicted by the distribution of odds in such population, and use this to estimate them for the whole miniJPAS sample. At r<23 there are 17,500 galaxies/deg^2 with valid photo-z estimates, of which 4,200 are expected to have |Dz|<0.003 (abridged).
We present redshift probability distributions for galaxies in the SDSS DR8 imaging data. We used the nearest-neighbor weighting algorithm presented in Lima et al. 2008 and Cunha et al. 2009 to derive the ensemble redshift distribution N(z), and individual redshift probability distributions P(z) for galaxies with r < 21.8. As part of this technique, we calculated weights for a set of training galaxies with known redshifts such that their density distribution in five dimensional color-magnitude space was proportional to that of the photometry-only sample, producing a nearly fair sample in that space. We then estimated the ensemble N(z) of the photometric sample by constructing a weighted histogram of the training set redshifts. We derived P(z) s for individual objects using the same technique, but limiting to training set objects from the local color-magnitude space around each photometric object. Using the P(z) for each galaxy, rather than an ensemble N(z), can reduce the statistical error in measurements that depend on the redshifts of individual galaxies. The spectroscopic training sample is substantially larger than that used for the DR7 release, and the newly added PRIMUS catalog is now the most important training set used in this analysis by a wide margin. We expect the primary source of error in the N(z) reconstruction is sample variance: the training sets are drawn from relatively small volumes of space. Using simulations we estimated the uncertainty in N(z) at a given redshift is 10-15%. The uncertainty on calculations incorporating N(z) or P(z) depends on how they are used; we discuss the case of weak lensing measurements. The P(z) catalog is publicly available from the SDSS website.
Obtaining accurate photometric redshift estimations is an important aspect of cosmology, remaining a prerequisite of many analyses. In creating novel methods to produce redshift estimations, there has been a shift towards using machine learning techniques. However, there has not been as much of a focus on how well different machine learning methods scale or perform with the ever-increasing amounts of data being produced. Here, we introduce a benchmark designed to analyse the performance and scalability of different supervised machine learning methods for photometric redshift estimation. Making use of the Sloan Digital Sky Survey (SDSS - DR12) dataset, we analysed a variety of the most used machine learning algorithms. By scaling the number of galaxies used to train and test the algorithms up to one million, we obtained several metrics demonstrating the algorithms performance and scalability for this task. Furthermore, by introducing a new optimisation method, time-considered optimisation, we were able to demonstrate how a small concession of error can allow for a great improvement in efficiency. From the algorithms tested we found that the Random Forest performed best in terms of error with a mean squared error, MSE = 0.0042; however, as other algorithms such as Boosted Decision Trees and k-Nearest Neighbours performed incredibly similarly, we used our benchmarks to demonstrate how different algorithms could be superior in different scenarios. We believe benchmarks such as this will become even more vital with upcoming surveys, such as LSST, which will capture billions of galaxies requiring photometric redshifts.