Accelerating Large-scale Data Exploration through Data Diffusion

227 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ioan Raicu

تاريخ النشر 2008

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ioan Raicu - Yong Zhao - Ian Foster

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a data diffusion approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. The approach is reminiscent of cooperative caching, web-caching, and peer-to-peer storage systems, but addresses different application demands. Other data-aware scheduling approaches assume dedicated resources, which can be expensive and/or inefficient if load varies significantly. To explore the feasibility of the data diffusion approach, we have extended the Falkon resource provisioning and task scheduling system to support data caching and data-aware scheduling. Performance results from both micro-benchmarks and a large scale astronomy application demonstrate that our approach improves performance relative to alternative approaches, as well as provides improved scalability as aggregated I/O bandwidth scales linearly with the number of data cache nodes.

قيم البحث

97 - Eli Dart , Michael F. Wehner , Prabhat 2017

We document the data transfer workflow, data transfer performance, and other aspects of staging approximately 56 terabytes of climate model output data from the distributed Coupled Model Intercomparison Project (CMIP5) archive to the National Energy Research Supercomputing Center (NERSC) at the Lawrence Berkeley National Laboratory required for tracking and characterizing extratropical storms, a phenomena of importance in the mid-latitudes. We present this analysis to illustrate the current challenges in assembling multi-model data sets at major computing facilities for large-scale studies of CMIP5 data. Because of the larger archive size of the upcoming CMIP6 phase of model intercomparison, we expect such data transfers to become of increasing importance, and perhaps of routine necessity. We find that data transfer rates using the ESGF are often slower than what is typically available to US residences and that there is significant room for improvement in the data transfer capabilities of the ESGF portal and data centers both in terms of workflow mechanics and in data transfer performance. We believe performance improvements of at least an order of magnitude are within technical reach using current best practices, as illustrated by the performance we achieved in transferring the complete raw data set between two high performance computing facilities. To achieve these performance improvements, we recommend: that current best practices (such as the Science DMZ model) be applied to the data servers and networks at ESGF data centers; that sufficient financial and human resources be devoted at the ESGF data centers for systems and network engineering tasks to support high performance data movement; and that performance metrics for data transfer between ESGF data centers and major computing facilities used for climate data analysis be established, regularly tested, and published.

النظم الموزعة والتوازية والحوسبة العنقودية قواعد البيانات الفيزياء الجوية والمحيطية

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

443 - Shen Li , Yanli Zhao , Rohan Varma 2020

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Data Diffusion: Dynamic Resource Provision and Data-Aware Scheduling for Data Intensive Applications

101 - Ioan Raicu , Yong Zhao , Ian Foster 2008

Data intensive applications often involve the analysis of large datasets that require large amounts of compute and storage resources. While dedicated compute and/or storage farms offer good task/data throughput, they suffer low resource utilization p roblem under varying workloads conditions. If we instead move such data to distributed computing resources, then we incur expensive data transfer cost. In this paper, we propose a data diffusion approach that combines dynamic resource provisioning, on-demand data replication and caching, and data locality-aware scheduling to achieve improved resource efficiency under varying workloads. We define an abstract data diffusion model that takes into consideration the workload characteristics, data accessing cost, application throughput and resource utilization; we validate the model using a real-world large-scale astronomy application. Our results show that data diffusion can increase the performance index by as much as 34X, and improve application response time by over 506X, while achieving near-optimal throughputs and execution times.

النظم الموزعة والتوازية والحوسبة العنقودية

ds-array: A Distributed Data Structure for Large Scale Machine Learning

71 - Javier Alvarez Cid-Fuentes , Pol Alvarez , Salvi Sol`a 2021

Machine learning has proved to be a useful tool for extracting knowledge from scientific data in numerous research fields, including astrophysics, genomics, and molecular dynamics. Often, data sets from these research areas need to be processed in di stributed platforms due to their magnitude. This can be done using one of the various distributed machine learning libraries available. One of these libraries is dislib, a distributed machine learning library for Python especially designed to process large scale data sets on HPC clusters, which makes dislib an ideal candidate for analyzing scientific data. However, dislibs main distributed data structure, called Dataset, has some limitations, including poor performance in certain operations and low flexibility and usability. In this paper, we propose a novel distributed data structure for dislib, called ds-array, that addresses dislibs main limitations in data management. Ds-arrays simplify distributed data management in dislib by exposing a NumPy-like API, provide more flexibility, and reduce the computational complexity of some operations. This results in performance improvements of up to two orders of magnitude over Datasets, while also greatly improving scalability and usability.

النظم الموزعة والتوازية والحوسبة العنقودية

Scalable Teacher Forcing Network for Semi-Supervised Large Scale Data Streams

318 - Mahardhika Pratama , Choiru Zain , Edwin Lughofer 2021

The large-scale data stream problem refers to high-speed information flow which cannot be processed in scalable manner under a traditional computing platform. This problem also imposes expensive labelling cost making the deployment of fully supervise d algorithms unfeasible. On the other hand, the problem of semi-supervised large-scale data streams is little explored in the literature because most works are designed in the traditional single-node computing environments while also being fully supervised approaches. This paper offers Weakly Supervised Scalable Teacher Forcing Network (WeScatterNet) to cope with the scarcity of labelled samples and the large-scale data streams simultaneously. WeScatterNet is crafted under distributed computing platform of Apache Spark with a data-free model fusion strategy for model compression after parallel computing stage. It features an open network structure to address the global and local drift problems while integrating a data augmentation, annotation and auto-correction ($DA^3$) method for handling partially labelled data streams. The performance of WeScatterNet is numerically evaluated in the six large-scale data stream problems with only $25%$ label proportions. It shows highly competitive performance even if compared with fully supervised learners with $100%$ label proportions.

النظم الموزعة والتوازية والحوسبة العنقودية الذكاء الاصطناعي التعلم الآلي

سجل دخول لتتمكن من نشر تعليقات