ترغب بنشر مسار تعليمي؟ اضغط هنا

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

85   0   0.0 ( 0 )
 نشر من قبل Menglu Yu
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which exploit the massive parallelism of computing clusters to expedite ML training. However, the proliferation of distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a key question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and observations on the worker-parameter server locality configurations, we transform the problem into a mixed packing and covering integer program, which enables approximation algorithm design; iii) We propose a meticulously designed approximation algorithm based on randomized rounding and rigorously analyze its performance. Collectively, our results contribute to the state of the art of distributed ML system optimization and algorithm design.



قيم البحث

اقرأ أيضاً

A major driver behind the success of modern machine learning algorithms has been their ability to process ever-larger amounts of data. As a result, the use of distributed systems in both research and production has become increasingly prevalent as a means to scale to this growing data. At the same time, however, distributing the learning process can drastically complicate the implementation of even simple algorithms. This is especially problematic as many machine learning practitioners are not well-versed in the design of distributed systems, let alone those that have complicated communication topologies. In this work we introduce Launchpad, a programming model that simplifies the process of defining and launching distributed systems that is specifically tailored towards a machine learning audience. We describe our framework, its design philosophy and implementation, and give a number of examples of common learning algorithms whose designs are greatly simplified by this approach.
Efficient GPU resource scheduling is essential to maximize resource utilization and save training costs for the increasing amount of deep learning workloads in shared GPU clusters. Existing GPU schedulers largely rely on static policies to leverage t he performance characteristics of deep learning jobs. However, they can hardly reach optimal efficiency due to the lack of elasticity. To address the problem, we propose ONES, an ONline Evolutionary Scheduler for elastic batch size orchestration. ONES automatically manages the elasticity of each job based on the training batch size, so as to maximize GPU utilization and improve scheduling efficiency. It determines the batch size for each job through an online evolutionary search that can continuously optimize the scheduling decisions. We evaluate the effectiveness of ONES with 64 GPUs on TACCs Longhorn supercomputers. The results show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.
Existing distributed machine learning (DML) systems focus on improving the computational efficiency of distributed learning, whereas communication aspects have received less attention. Many DML systems treat the network as a blackbox. Thus, DML algor ithms performance is impeded by network bottlenecks, and DML systems end up sacrificing important algorithmic and system-level benefits. We present MLfabric, a communication library that manages all network transfers in a DML system, and holistically determines the communication pattern of a DML algorithm at any point in time. This allows MLfabric to carefully order transfers (i.e., gradient updates) to improve convergence, opportunistically aggregate updates in-network to improve efficiency, and proactively replicate some of them to support new notions of fault tolerance. We empirically find that MLfabric achieves up to 3X speed-up in training large deep learning models in realistic dynamic cluster settings.
Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -- straggler nodes, system fa ilures, or communication bottlenecks -- but there has been little interaction cutting across codes, machine learning, and distributed systems. In this work, we provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling. For matrix multiplication, we use codes to alleviate the effect of stragglers, and show that if the number of homogeneous workers is $n$, and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of $log n$. For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction $alpha$ of the data matrix can be cached at each worker, and $n$ is the number of workers, emph{coded shuffling} reduces the communication cost by a factor of $(alpha + frac{1}{n})gamma(n)$ compared to uncoded shuffling, where $gamma(n)$ is the ratio of the cost of unicasting $n$ messages to $n$ users to multicasting a common message (of the same size) to $n$ users. For instance, $gamma(n) simeq n$ if multicasting a message to $n$ users is as cheap as unicasting a message to one user. We also provide experiment results, corroborating our theoretical gains of the coded algorithms.
364 - Ji Liu , Jizhou Huang , Yang Zhou 2021
In recent years, data and computing resources are typically distributed in the devices of end users, various regions or organizations. Because of laws or regulations, the distributed data and computing resources cannot be directly shared among differ ent regions or organizations for machine learning tasks. Federated learning emerges as an efficient approach to exploit distributed data and computing resources, so as to collaboratively train machine learning models, while obeying the laws and regulations and ensuring data security and data privacy. In this paper, we provide a comprehensive survey of existing works for federated learning. We propose a functional architecture of federated learning systems and a taxonomy of related techniques. Furthermore, we present the distributed training, data communication, and security of FL systems. Finally, we analyze their limitations and propose future research directions.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا