ﻻ يوجد ملخص باللغة العربية
In distributed synchronous gradient descent (GD) the main performance bottleneck for the per-iteration completion time is the slowest textit{straggling} workers. To speed up GD iterations in the presence of stragglers, coded distributed computation techniques are implemented by assigning redundant computations to workers. In this paper, we propose a novel gradient coding (GC) scheme that utilizes dynamic clustering, denoted by GC-DC, to speed up the gradient calculation. Under time-correlated straggling behavior, GC-DC aims at regulating the number of straggling workers in each cluster based on the straggler behavior in the previous iteration. We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
Distributed implementations are crucial in speeding up large scale machine learning applications. Distributed gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant
Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations: delays caused by slow running machines called stragglers, and communication overheads.
Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for computation at the agents is affected by the availability of local resources giving rise to
Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, suffer from slow running machines, called stragglers. Gradient coding is a coding-theoretic framework to mitigate stragg
The performance of large-scale distributed compute systems is adversely impacted by stragglers when the execution time of a job is uncertain. To manage stragglers, we consider a multi-fork approach for job scheduling, where additional parallel server