ترغب بنشر مسار تعليمي؟ اضغط هنا

140 - Nicholas M. Ball 2013
This is a companion Focus Demonstration article to the CANFAR+Skytree poster (Ball 2012), demonstrating the usage of the Skytree machine learning software on the Canadian Advanced Network for Astronomical Research (CANFAR) cloud computing system. CAN FAR+Skytree is the worlds first cloud computing system for data mining in astronomy.
411 - Nicholas M. Ball 2013
At the Canadian Astronomy Data Centre, we have combined our cloud computing system, CANFAR, with the worlds most advanced machine learning software, Skytree, to create the worlds first cloud computing system for data mining in astronomy. CANFAR provi des a generic environment for the storage and processing of large datasets, removing the requirement to set up and maintain a computing system when implementing an extensive undertaking such as a survey pipeline. 500 processor cores and several hundred terabytes of persistent storage are currently available to users. The storage is implemented via the International Virtual Observatory Alliances VOSpace protocol, and is accessible both interactively, and to all processing jobs. The user interacts with CANFAR by utilizing virtual machines, which appear to them as equivalent to a desktop. Each machine is replicated as desired to perform large-scale parallel processing. Such an arrangement enables the user to immediately install and run the same astronomy code that they already utilize, in the same way as on a desktop. In addition, unlike many cloud systems, batch job scheduling is handled for the user on multiple virtual machines by the Condor job queueing system. Skytree is installed and run just as any other software on the system, and thus acts as a library of command line data mining functions that can be integrated into ones wider analysis. Thus we have created a generic environment for large-scale analysis by data mining, in the same way that CANFAR itself has done for storage and processing. Because Skytree scales to large data in linear runtime, this allows the full sophistication of the huge fields of data mining and machine learning to be applied to the hundreds of millions of objects that make up current large datasets. We demonstrate the utility of the CANFAR+Skytree system by showing science results obtained. [Abridged]
Astronomy is increasingly encountering two fundamental truths: (1) The field is faced with the task of extracting useful information from extremely large, complex, and high dimensional datasets; (2) The techniques of astroinformatics and astrostatist ics are the only way to make this tractable, and bring the required level of sophistication to the analysis. Thus, an approach which provides these tools in a way that scales to these datasets is not just desirable, it is vital. The expertise required spans not just astronomy, but also computer science, statistics, and informatics. As a computer scientist and expert in machine learning, Alexs contribution of expertise and a large number of fast algorithms designed to scale to large datasets, is extremely welcome. We focus in this discussion on the questions raised by the practical application of these algorithms to real astronomical datasets. That is, what is needed to maximally leverage their potential to improve the science return? This is not a trivial task. While computing and statistical expertise are required, so is astronomical expertise. Precedent has shown that, to-date, the collaborations most productive in producing astronomical science results (e.g, the Sloan Digital Sky Survey), have either involved astronomers expert in computer science and/or statistics, or astronomers involved in close, long-term collaborations with experts in those fields. This does not mean that the astronomers are giving the most important input, but simply that their input is crucial in guiding the effort in the most fruitful directions, and coping with the issues raised by real data. Thus, the tools must be useable and understandable by those whose primary expertise is not computing or statistics, even though they may have quite extensive knowledge of those fields.
The Next Generation Virgo Cluster Survey is a 104 square degree survey of the Virgo Cluster, carried out using the MegaPrime camera of the Canada-France-Hawaii telescope, from semesters 2009A-2012A. The survey will provide coverage of this nearby den se environment in the universe to unprecedented depth, providing profound insights into galaxy formation and evolution, including definitive measurements of the properties of galaxies in a dense environment in the local universe, such as the luminosity function. The limiting magnitude of the survey is g_AB = 25.7 (10 sigma point source), and the 2 sigma surface brightness limit is g_AB ~ 29 mag arcsec^-2. The data volume of the survey (approximately 50 terabytes of images), while large by contemporary astronomical standards, is not intractable. This renders the survey amenable to the methods of astroinformatics. The enormous dynamic range of objects, from the giant elliptical galaxy M87 at M(B) = -21.6, to the faintest dwarf ellipticals at M(B) ~ -6, combined with photometry in 5 broad bands (u* g r i z), and unprecedented depth revealing many previously unseen structures, creates new challenges in object detection and classification. We present results from ongoing work on the survey, including photometric redshifts, Virgo cluster membership, and the implementation of fast data mining algorithms on the infrastructure of the Canadian Astronomy Data Centre, as part of the Canadian Advanced Network for Astronomical Research (CANFAR).
We review the current state of data mining and machine learning in astronomy. Data Mining can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potent ial to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black-box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those where data mining techniques directly resulted in improved science, and important current and future directions, including probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm, and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.
81 - Adam D Myers 2009
The use of photometric redshifts in cosmology is increasing. Often, however these photo-zs are treated like spectroscopic observations, in that the peak of the photometric redshift, rather than the full probability density function (PDF), is used. Th is overlooks useful information inherent in the full PDF. We introduce a new real-space estimator for one of the most used cosmological statistics, the 2-point correlation function, that weights by the PDF of individual photometric objects in a manner that is optimal when Poisson statistics dominate. As our estimator does not bin based on the PDF peak it substantially enhances the clustering signal by usefully incorporating information from all photometric objects that overlap the redshift bin of interest. As a real-world application, we measure QSO clustering in the Sloan Digital Sky Survey (SDSS). We find that our simplest binned estimator improves the clustering signal by a factor equivalent to increasing the survey size by a factor of 2-3. We also introduce a new implementation that fully weights between pairs of objects in constructing the cross-correlation and find that this pair-weighted estimator improves clustering signal in a manner equivalent to increasing the survey size by a factor of 4-5. Our technique uses spectroscopic data to anchor the distance scale and it will be particularly useful where spectroscopic data (e.g, from BOSS) overlaps deeper photometry (e.g.,from Pan-STARRS, DES or the LSST). We additionally provide simple, informative expressions to determine when our estimator will be competitive with the autocorrelation of spectroscopic objects. Although we use QSOs as an example population, our estimator can and should be applied to any clustering estimate that uses photometric objects.
We present recent results from the LCDM (Laboratory for Cosmological Data Mining; http://lcdm.astro.uiuc.edu) collaboration between UIUC Astronomy and NCSA to deploy supercomputing cluster resources and machine learning algorithms for the mining of t erascale astronomical datasets. This is a novel application in the field of astronomy, because we are using such resources for data mining, and not just performing simulations. Via a modified implementation of the NCSA cyberenvironment Data-to-Knowledge, we are able to provide improved classifications for over 100 million stars and galaxies in the Sloan Digital Sky Survey, improved distance measures, and a full exploitation of the simple but powerful k-nearest neighbor algorithm. A driving principle of this work is that our methods should be extensible from current terascale datasets to upcoming petascale datasets and beyond. We discuss issues encountered to-date, and further issues for the transition to petascale. In particular, disk I/O will become a major limiting factor unless the necessary infrastructure is implemented.
We apply machine learning in the form of a nearest neighbor instance-based algorithm (NN) to generate full photometric redshift probability density functions (PDFs) for objects in the Fifth Data Release of the Sloan Digital Sky Survey (SDSS DR5). We use a conceptually simple but novel application of NN to generate the PDFs - perturbing the object colors by their measurement error - and using the resulting instances of nearest neighbor distributions to generate numerous individual redshifts. When the redshifts are compared to existing SDSS spectroscopic data, we find that the mean value of each PDF has a dispersion between the photometric and spectroscopic redshift consistent with other machine learning techniques, being sigma = 0.0207 +/- 0.0001 for main sample galaxies to r < 17.77 mag, sigma = 0.0243 +/- 0.0002 for luminous red galaxies to r < ~19.2 mag, and sigma = 0.343 +/- 0.005 for quasars to i < 20.3 mag. The PDFs allow the selection of subsets with improved statistics. For quasars, the improvement is dramatic: for those with a single peak in their probability distribution, the dispersion is reduced from 0.343 to sigma = 0.117 +/- 0.010, and the photometric redshift is within 0.3 of the spectroscopic redshift for 99.3 +/- 0.1% of the objects. Thus, for this optical quasar sample, we can virtually eliminate catastrophic photometric redshift estimates. In addition to the SDSS sample, we incorporate ultraviolet photometry from the Third Data Release of the Galaxy Evolution Explorer All-Sky Imaging Survey (GALEX AIS GR3) to create PDFs for objects seen in both surveys. For quasars, the increased coverage of the observed frame UV of the SED results in significant improvement over the full SDSS sample, with sigma = 0.234 +/- 0.010. We demonstrate that this improvement is genuine. [Abridged]
We present recent results from the Laboratory for Cosmological Data Mining (http://lcdm.astro.uiuc.edu) at the National Center for Supercomputing Applications (NCSA) to provide robust classifications and photometric redshifts for objects in the teras cale-class Sloan Digital Sky Survey (SDSS). Through a combination of machine learning in the form of decision trees, k-nearest neighbor, and genetic algorithms, the use of supercomputing resources at NCSA, and the cyberenvironment Data-to-Knowledge, we are able to provide improved classifications for over 100 million objects in the SDSS, improved photometric redshifts, and a full exploitation of the powerful k-nearest neighbor algorithm. This work is the first to apply the full power of these algorithms to contemporary terascale astronomical datasets, and the improvement over existing results is demonstrable. We discuss issues that we have encountered in dealing with data on the terascale, and possible solutions that can be implemented to deal with upcoming petascale datasets.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا