ﻻ يوجد ملخص باللغة العربية
Distributed implementations are crucial in speeding up large scale machine learning applications. Distributed gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers. Coded distributed computation techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers. In this paper, we consider gradient coding (GC), and propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to dynamically choose from among a set of possible codes depending on the past straggling behavior. In particular, we consider GC with clustering, and regulate the number of stragglers in each cluster by dynamically forming the clusters at each iteration; hence, the proposed scheme is called $GC$ $with$ $dynamic$ $clustering$ (GC-DC). Under a time-correlated straggling behavior, GC-DC gains from adapting to the straggling behavior over time such that, at each iteration, GC-DC aims at distributing the stragglers across clusters as uniformly as possible based on the past straggler behavior. For both homogeneous and heterogeneous worker models, we numerically show that GC-DC provides significant improvements in the average per-iteration completion time without an increase in the communication load compared to the original GC scheme.
In distributed synchronous gradient descent (GD) the main performance bottleneck for the per-iteration completion time is the slowest textit{straggling} workers. To speed up GD iterations in the presence of stragglers, coded distributed computation t
Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations: delays caused by slow running machines called stragglers, and communication overheads.
Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, suffer from slow running machines, called stragglers. Gradient coding is a coding-theoretic framework to mitigate stragg
Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for computation at the agents is affected by the availability of local resources giving rise to
A central issue of distributed computing systems is how to optimally allocate computing and storage resources and design data shuffling strategies such that the total execution time for computing and data shuffling is minimized. This is extremely cri