ﻻ يوجد ملخص باللغة العربية
Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which exploit the massive parallelism of computing clusters to expedite ML training. However, the proliferation of distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a key question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and observations on the worker-parameter server locality configurations, we transform the problem into a mixed packing and covering integer program, which enables approximation algorithm design; iii) We propose a meticulously designed approximation algorithm based on randomized rounding and rigorously analyze its performance. Collectively, our results contribute to the state of the art of distributed ML system optimization and algorithm design.
A major driver behind the success of modern machine learning algorithms has been their ability to process ever-larger amounts of data. As a result, the use of distributed systems in both research and production has become increasingly prevalent as a
Efficient GPU resource scheduling is essential to maximize resource utilization and save training costs for the increasing amount of deep learning workloads in shared GPU clusters. Existing GPU schedulers largely rely on static policies to leverage t
Existing distributed machine learning (DML) systems focus on improving the computational efficiency of distributed learning, whereas communication aspects have received less attention. Many DML systems treat the network as a blackbox. Thus, DML algor
Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -- straggler nodes, system fa
In recent years, data and computing resources are typically distributed in the devices of end users, various regions or organizations. Because of laws or regulations, the distributed data and computing resources cannot be directly shared among differ