ﻻ يوجد ملخص باللغة العربية
It has been established that when the gradient coding problem is distributed among $n$ servers, the computation load (number of stored data partitions) of each worker is at least $s+1$ in order to resists $s$ stragglers. This scheme incurs a large overhead when the number of stragglers $s$ is large. In this paper, we focus on a new framework called emph{approximate gradient coding} to mitigate stragglers in distributed learning. We show that, to exactly recover the gradient with high probability, the computation load is lower bounded by $O(log(n)/log(n/s))$. We also propose a code that exactly matches such lower bound. We identify a fundamental three-fold tradeoff for any approximate gradient coding scheme $dgeq O(log(1/epsilon)/log(n/s))$, where $d$ is the computation load, $epsilon$ is the error of gradient. We give an explicit code construction based on random edge removal process that achieves the derived tradeoff. We implement our schemes and demonstrate the advantage of the approaches over the current fastest gradient coding strategies.
In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is presence of stragglers. These are no
In distributed optimization problems, a technique called gradient coding, which involves replicating data points, has been used to mitigate the effect of straggling machines. Recent work has studied approximate gradient coding, which concerns coding
The maximum possible throughput (or the rate of job completion) of a multi-server system is typically the sum of the service rates of individual servers. Recent work shows that launching multiple replicas of a job and canceling them as soon as one co
It has recently been observed that certain extremely simple feature encoding techniques are able to achieve state of the art performance on several standard image classification benchmarks including deep belief networks, convolutional nets, factored
This paper focuses on mitigating the impact of stragglers in distributed learning system. Unlike the existing results designed for a fixed number of stragglers, we developed a new scheme called Adaptive Gradient Coding(AGC) with flexible tolerance of