Toward Enabling Reproducibility for Data-Intensive Research using the Whole Tale Platform

75 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Victoria Stodden

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Kyle Chard - Niall Gaffney - Mihael Hategan

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Whole Tale http://wholetale.org is a web-based, open-source platform for reproducible research supporting the creation, sharing, execution, and verification of Tales for the scientific research community. Tales are executable research objects that capture the code, data, and environment along with narrative and workflow information needed to re-create computational results from scientific studies. Creating reproducible research objects that enable reproducibility, transparency, and re-execution for computational experiments requiring significant compute resources or utilizing massive data is an especially challenging open problem. We describe opportunities, challenges, and solutions to facilitating reproducibility for data- and compute-intensive research, that we call Tales at Scale, using the Whole Tale computing platform. We highlight challenges and solutions in frontend responsiveness needs, gaps in current middleware design and implementation, network restrictions, containerization, and data access. Finally, we discuss challenges in packaging computational experiment implementations for portable data-intensive Tales and outline future work.

قيم البحث

124 - Bertram Ludaescher , Kyle Chard , Niall Gaffney 2016

We present an overview of the recently funded Merging Science and Cyberinfrastructure Pathways: The Whole Tale project (NSF award #1541450). Our approach has two nested goals: 1) deliver an environment that enables researchers to create a complete na rrative of the research process including exposure of the data-to-publication lifecycle, and 2) systematically and persistently link research publications to their associated digital scholarly objects such as the data, code, and workflows. To enable this, Whole Tale will create an environment where researchers can collaborate on data, workspaces, and workflows and then publish them for future adoption or modification. Published data and applications will be consumed either directly by users using the Whole Tale environment or can be integrated into existing or future domain Science Gateways.

المكتبات الرقمية

SODA: A Semantics-Aware Optimization Framework for Data-Intensive Applications Using Hybrid Program Analysis

69 - Bingbing Rao , Zixia Liu , Hong Zhang 2021

In the era of data explosion, a growing number of data-intensive computing frameworks, such as Apache Hadoop and Spark, have been proposed to handle the massive volume of unstructured data in parallel. Since programming models provided by these frame works allow users to specify complex and diversified user-defined functions (UDFs) with predefined operations, the grand challenge of tuning up entire system performance arises if programmers do not fully understand the semantics of code, data, and runtime systems. In this paper, we design a holistic semantics-aware optimization for data-intensive applications using hybrid program analysis} (SODA) to assist programmers to tune performance issues. SODA is a two-phase framework: the offline phase is a static analysis that analyzes code and performance profiling data from the online phase of prior executions to generate a parameterized and instrumented application; the online phase is a dynamic analysis that keeps track of the applications execution and collects runtime information of data and system. Extensive experimental results on four real-world Spark applications show that SODA can gain up to 60%, 10%, 8%, faster than its original implementation, with the three proposed optimization strategies, i.e., cache management, operation reordering, and element pruning, respectively.

النظم الموزعة والتوازية والحوسبة العنقودية

Advancing computational reproducibility in the Dataverse data repository platform

110 - Ana Trisovic , Philip Durbin , Tania Schlatter 2020

Recent reproducibility case studies have raised concerns showing that much of the deposited research has not been reproducible. One of their conclusions was that the way data repositories store research data and code cannot fully facilitate reproduci bility due to the absence of a runtime environment needed for the code execution. New specialized reproducibility tools provide cloud-based computational environments for code encapsulation, thus enabling research portability and reproducibility. However, they do not often enable research discoverability, standardized data citation, or long-term archival like data repositories do. This paper addresses the shortcomings of data repositories and reproducibility tools and how they could be overcome to improve the current lack of computational reproducibility in published and archived research outputs.

المكتبات الرقمية هندسة البرمجيات

Data Diffusion: Dynamic Resource Provision and Data-Aware Scheduling for Data Intensive Applications

149 - Ioan Raicu , Yong Zhao , Ian Foster 2008

Data intensive applications often involve the analysis of large datasets that require large amounts of compute and storage resources. While dedicated compute and/or storage farms offer good task/data throughput, they suffer low resource utilization p roblem under varying workloads conditions. If we instead move such data to distributed computing resources, then we incur expensive data transfer cost. In this paper, we propose a data diffusion approach that combines dynamic resource provisioning, on-demand data replication and caching, and data locality-aware scheduling to achieve improved resource efficiency under varying workloads. We define an abstract data diffusion model that takes into consideration the workload characteristics, data accessing cost, application throughput and resource utilization; we validate the model using a real-world large-scale astronomy application. Our results show that data diffusion can increase the performance index by as much as 34X, and improve application response time by over 506X, while achieving near-optimal throughputs and execution times.

النظم الموزعة والتوازية والحوسبة العنقودية

Data-Intensive Supercomputing in the Cloud: Global Analytics for Satellite Imagery

78 - Michael S. Warren , Samuel W. Skillman , Rick Chartrand 2017

We present our experiences using cloud computing to support data-intensive analytics on satellite imagery for commercial applications. Drawing from our background in high-performance computing, we draw parallels between the early days of clustered co mputing systems and the current state of cloud computing and its potential to disrupt the HPC market. Using our own virtual file system layer on top of cloud remote object storage, we demonstrate aggregate read bandwidth of 230 gigabytes per second using 512 Google Compute Engine (GCE) nodes accessing a USA multi-region standard storage bucket. This figure is comparable to the best HPC storage systems in existence. We also present several of our application results, including the identification of field boundaries in Ukraine, and the generation of a global cloud-free base layer from Landsat imagery.

النظم الموزعة والتوازية والحوسبة العنقودية الذكاء الاصطناعي

سجل دخول لتتمكن من نشر تعليقات