Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts

77 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Dominik Scheinert

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Dominik Scheinert - Lauritz Thamsen - Houkun Zhu

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Distributed dataflow systems enable the use of clusters for scalable data analytics. However, selecting appropriate cluster resources for a processing job is often not straightforward. Performance models trained on historical executions of a concrete job are helpful in such situations, yet they are usually bound to a specific job execution context (e.g. node type, softwar

قيم البحث

83 - Dominik Scheinert , Houkun Zhu , Lauritz Thamsen 2021

Distributed dataflow systems like Spark and Flink enable the use of clusters for scalable data analytics. While runtime prediction models can be used to initially select appropriate cluster resources given target runtimes, the actual runtime performa nce of dataflow jobs depends on several factors and varies over time. Yet, in many situations, dynamic scaling can be used to meet formulated runtime targets despite significant performance variance. This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs and, thus, allows for deriving effective rescaling decisions. For this, Enel incorporates descriptive properties that capture the respective execution context, considers statistics from individual dataflow tasks, and propagates predictions through the job graph to eventually find an optimized new scale-out. Our evaluation of Enel with four iterative Spark jobs shows that our approach is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Power-Aware Allocation of Graph Jobs in Geo-Distributed Cloud Networks

92 - Seyyedali Hosseinalipour , Anuj Nayak , Huaiyu Dai 2018

In the era of big-data, the jobs submitted to the clouds exhibit complicated structures represented by graphs, where the nodes denote the sub-tasks each of which can be accommodated at a slot in a server, while the edges indicate the communication co nstraints among the sub-tasks. We develop a framework for efficient allocation of graph jobs in geo-distributed cloud networks (GDCNs), explicitly considering the power consumption of the datacenters (DCs). We address the following two challenges arising in graph job allocation: i) the allocation problem belongs to NP-hard nonlinear integer programming; ii) the allocation requires solving the NP-complete sub-graph isomorphism problem, which is particularly cumbersome in large-scale GDCNs. We develop a suite of efficient solutions for GDCNs of various scales. For small-scale GDCNs, we propose an analytical approach based on convex programming. For medium-scale GDCNs, we develop a distributed allocation algorithm exploiting the processing power of DCs in parallel. Afterward, we provide a novel low-complexity (decentralized) sub-graph extraction method, based on which we introduce cloud crawlers aiming to extract allocations of good potentials for large-scale GDCNs. Given these suggested strategies, we further investigate strategy selection under both fixed and adaptive DC pricing schemes, and propose an online learning algorithm for each.

النظم الموزعة والتوازية والحوسبة العنقودية

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

327 - Qinggang Zhou , Yawen Zhang , Pengcheng Li 2020

The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets. Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back propagatio ns, to wait for gradients aggregated from all workers, and to receive weight updates before the next batch of tasks. This synchronous execution model exposes the overheads of gradient/weight communication among a large number of workers in a distributed training system. We propose a new SGD algorithm, DaSGD (Local SGD with Delayed Averaging), which parallelizes SGD and forward/back propagations to hide 100% of the communication overhead. By adjusting the gradient update scheme, this algorithm uses hardware resources more efficiently and reduces the reliance on the low-latency and high-throughput inter-connects. The theoretical analysis and the experimental results show its convergence rate O(1/sqrt(K)), the same as SGD. The performance evaluation demonstrates it enables a linear performance scale-up with the cluster size.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models

66 - Matthew LeMay , Shijian Li , Tian Guo 2019

Deep learning models are increasingly used for end-user applications, supporting both novel features such as facial recognition, and traditional features, e.g. web search. To accommodate high inference throughput, it is common to host a single pre-tr ained Convolutional Neural Network (CNN) in dedicated cloud-based servers with hardware accelerators such as Graphics Processing Units (GPUs). However, GPUs can be orders of magnitude more expensive than traditional Central Processing Unit (CPU) servers. These resources could also be under-utilized facing dynamic workloads, which may result in inflated serving costs. One potential way to alleviate this problem is by allowing hosted models to share the underlying resources, which we refer to as multi-tenant inference serving. One of the key challenges is maximizing the resource efficiency for multi-tenant serving given hardware with diverse characteristics, models with unique response time Service Level Agreement (SLA), and dynamic inference workloads. In this paper, we present Perseus, a measurement framework that provides the basis for understanding the performance and cost trade-offs of multi-tenant model serving. We implemented Perseus in Python atop a popular cloud inference server called Nvidia TensorRT Inference Server. Leveraging Perseus, we evaluated the inference throughput and cost for serving various models and demonstrated that multi-tenant model serving led to up to 12% cost reduction.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي الأداء

Distributed Training of Deep Learning Models: A Taxonomic Perspective

174 - Matthias Langer , Zhen He , Wenny Rahayu 2020

Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster. Developers of DDLS are required to make many decisions to process their particular workloads in their chosen environment ef ficiently. The advent of GPU-based deep learning, the ever-increasing size of datasets and deep neural network models, in combination with the bandwidth constraints that exist in cluster environments require developers of DDLS to be innovative in order to train high quality models quickly. Comparing DDLS side-by-side is difficult due to their extensive feature lists and architectural deviations. We aim to shine some light on the fundamental principles that are at work when training deep neural networks in a cluster of independent machines by analyzing the general properties associated with training deep learning models and how such workloads can be distributed in a cluster to achieve collaborative model training. Thereby we provide an overview of the different techniques that are used by contemporary DDLS and discuss their influence and implications on the training process. To conceptualize and compare DDLS, we group different techniques into categories, thus establishing a taxonomy of distributed deep learning systems.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي