Pre-feasibility Study of Astronomical Data Archive Systems Powered by Public Cloud Computing and Hadoop Hive

74 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Satoshi Eguchi

تاريخ النشر 2016

مجال البحث فيزياء

والبحث باللغة English

تأليف Satoshi Eguchi

الأجهزة والأساليب للزيئات الفيزياء الفلكية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The size of astronomical observational data is increasing yearly. For example, while Atacama Large Millimeter/submillimeter Array is expected to generate 200 TB raw data every year, Large Synoptic Survey Telescope is estimated to produce 15 TB raw data every night. Since the increasing rate of computing is much lower than that of astronomical data, to provide high performance computing (HPC) resources together with scientific data will be common in the next decade. However, the installation and maintenance costs of a HPC system can be burdensome for the provider. I note public cloud computing for an alternative way to get sufficient computing resources inexpensively. I build Hadoop and Hive clusters by utilizing a virtual private server (VPS) service and Amazon Elastic MapReduce (EMR), and measure their performances. The VPS cluster behaves differently day by day, while the EMR clusters are relatively stable. Since partitioning is essential for Hive, several partitioning algorithms are evaluated. In this paper, I report the results of the benchmarks and the performance optimizations in cloud computing environment.

قيم البحث

486 - Pau Tallada , Jorge Carretero , Jordi Casals 2020

We present CosmoHub (https://cosmohub.pic.es), a web application based on Hadoop to perform interactive exploration and distribution of massive cosmological datasets. Recent Cosmology seeks to unveil the nature of both dark matter and dark energy map ping the large-scale structure of the Universe, through the analysis of massive amounts of astronomical data, progressively increasing during the last (and future) decades with the digitization and automation of the experimental techniques. CosmoHub, hosted and developed at the Port dInformacio Cientifica (PIC), provides support to a worldwide community of scientists, without requiring the end user to know any Structured Query Language (SQL). It is serving data of several large international collaborations such as the Euclid space mission, the Dark Energy Survey (DES), the Physics of the Accelerating Universe Survey (PAUS) and the Marenostrum Institut de Ci`encies de lEspai (MICE) numerical simulations. While originally developed as a PostgreSQL relational database web frontend, this work describes the current version of CosmoHub, built on top of Apache Hive, which facilitates scalable reading, writing and managing huge datasets. As CosmoHubs datasets are seldomly modified, Hive it is a better fit. Over 60 TiB of catalogued information and $50 times 10^9$ astronomical objects can be interactively explored using an integrated visualization tool which includes 1D histogram and 2D heatmap plots. In our current implementation, online exploration of datasets of $10^9$ objects can be done in a timescale of tens of seconds. Users can also download customized subsets of data in standard formats generated in few minutes.

الأجهزة والأساليب للزيئات الفيزياء الفلكية النظم الموزعة والتوازية والحوسبة العنقودية تحليل البيانات والإحصاءات والاحتمال

A Redistribution Tool for Long-Term Archive of Astronomical Observation Data

112 - Chao Sun , Ce Yu , Chenzhou Cui 2020

Astronomical observation data require long-term preservation, and the rapid accumulation of observation data makes it necessary to consider the cost of long-term archive storage. In addition to low-speed disk-based online storage, optical disk or tap e-based offline storage can be used to save costs. However, for astronomical research that requires historical data (particularly time-domain astronomy), the performance and energy consumption of data-accessing techniques cause problems because the requested data (which are organized according to observation time) may be located across multiple storage devices. In this study, we design and develop a tool referred to as AstroLayout to redistribute the observation data using spatial aggregation. The core algorithm uses graph partitioning to generate an optimized data placement according to the original observation data statistics and the target storage system. For the given observation data, AstroLayout can copy the long-term archive in the target storage system in accordance with this placement. An efficiency evaluation shows that AstroLayout can reduce the number of devices activated when responding to data-access requests in time-domain astronomy research. In addition to improving the performance of data-accessing techniques, AstroLayout can also reduce the storage systems power consumption. For enhanced adaptability, it supports storage systems of any media, including optical disks, tapes, and hard disks.

الأجهزة والأساليب للزيئات الفيزياء الفلكية

The Workflow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures -- Technical Report

109 - Laurens Versluis , Roland Matha , Sacheendra Talluri 2019

Realistic, relevant, and reproducible experiments often need input traces collected from real-world environments. We focus in this work on traces of workflows---common in datacenters, clouds, and HPC infrastructures. We show that the state-of-the-art in using workflow-traces raises important issues: (1) the use of realistic traces is infrequent, and (2) the use of realistic, {it open-access} traces even more so. Alleviating these issues, we introduce the Workflow Trace Archive (WTA), an open-access archive of workflow traces from diverse computing infrastructures and tooling to parse, validate, and analyze traces. The WTA includes ${>}48$ million workflows captured from ${>}10$ computing infrastructures, representing a broad diversity of trace domains and characteristics. To emphasize the importance of trace diversity, we characterize the WTA contents and analyze in simulation the impact of trace diversity on experiment results. Our results indicate significant differences in characteristics, properties, and workflow structures between workload sources, domains, and fields.

النظم الموزعة والتوازية والحوسبة العنقودية

The LAMOST Data Archive and Data Release

332 - Boliang He , Dongwei Fan , Chenzhou Cui 2016

The Large sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) is the largest optical telescope in China. In last four years, the LAMOST telescope has published four editions data (pilot data release, data release 1, data release 2 and data r elease 3). To archive and release these data (raw data, catalog, spectrum etc), we have set up a data cycle management system, including the transfer of data, archiving, backup. And through the evolution of four softwa

الأجهزة والأساليب للزيئات الفيزياء الفلكية

Mega-Archive and the EURONEAR Tools for Datamining World Astronomical Images

103 - Ovidiu Vaduvescu , Lucian Curelaru , Marcel Popescu 2019

The world astronomical image archives represent huge opportunities to time-domain astronomy sciences and other hot topics such as space defense, and astronomical observatories should improve this wealth and make it more accessible in the big data era . In 2010 we introduced the Mega-Archive database and the Mega-Precovery server for data mining images containing Solar system bodies, with focus on near Earth asteroids (NEAs). This paper presents the improvements and introduces some new related data mining tools developed during the last five years. Currently, the Mega-Archive has indexed 15 million images available from six major collections (CADC, ESO, ING, LCOGT, NVO and SMOKA) and other instrument archives and surveys. This meta-data index collection is daily updated (since 2014) by a crawler which performs automated query of five major collections. Since 2016, these data mining tools run to the new dedicated EURONEAR server, and the database migrated to SQL engine which supports robust and fast queries. To constrain the area to search moving or fixed objects in images taken by large mosaic cameras, we built the graphical tools FindCCD and FindCCD for Fixed Objects which overlay the targets across one of seven mosaic cameras (Subaru-SuprimeCam, VST-OmegaCam, INT-WFC, VISTA-VIRCAM, CFHT-MegaCam, Blanco-DECam and Subaru-HSC), also plotting the uncertainty ellipse for poorly observed NEAs. In 2017 we improved Mega-Precovery, which offers now two options for calculus of the ephemerides and three options for the input (objects defined by designation, orbit or observations). Additionally, we developed Mega-Archive for Fixed Objects (MASFO) and Mega-Archive Search for Double Stars (MASDS). We believe that the huge potential of science imaging archives is still insufficiently exploited.

الأجهزة والأساليب للزيئات الفيزياء الفلكية قواعد البيانات

سجل دخول لتتمكن من نشر تعليقات