Do you want to publish a course? Click here

Adaptive Gradient Coding

84   0   0.0 ( 0 )
 Added by Hankun Cao
 Publication date 2020
and research's language is English




Ask ChatGPT about the research

This paper focuses on mitigating the impact of stragglers in distributed learning system. Unlike the existing results designed for a fixed number of stragglers, we developed a new scheme called Adaptive Gradient Coding(AGC) with flexible tolerance of various number of stragglers. Our scheme gives an optimal tradeoff between computation load, straggler tolerance and communication cost. In particular, it allows to minimize the communication cost according to the real-time number of stragglers in the practical environments. Implementations on Amazon EC2 clusters using Python with mpi4py package verify the flexibility in several situations.



rate research

Read More

82 - Amogh Johri , Arti Yardi , 2021
In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is presence of stragglers. These are nodes that are terribly slow in performing their task which results in under-utilization of the training data that is stored in them. Towards this, gradient coding mitigates the impact of stragglers by adding sufficient redundancy in the data. Gradient coding and other straggler mitigation schemes assume that the straggler behavior of the worker nodes is identical. Our experiments on the Amazon AWS cluster however suggest otherwise and we see that there is a correlation in the straggler behavior across iterations. To model this, we introduce a heterogeneous straggler model where nodes are categorized into two classes, slow and active. To better utilize training data stored with slow nodes, we modify the existing gradient coding schemes with shuffling of the training data among workers. Our results (both simulation and cloud experiments) suggest remarkable improvement with shuffling over existing schemes. We perform theoretical analysis for the proposed models justifying their utility.
Batched network coding is a low-complexity network coding solution to feedbackless multi-hop wireless packet network transmission with packet loss. The data to be transmitted is encoded into batches where each of which consists of a few coded packets. Unlike the traditional forwarding strategy, the intermediate network nodes have to perform recoding, which generates recoded packets by network coding operations restricted within the same batch. Adaptive recoding is a technique to adapt the fluctuation of packet loss by optimizing the number of recoded packets per batch to enhance the throughput. The input rank distribution, which is a piece of information regarding the batches arriving at the node, is required to apply adaptive recoding. However, this distribution is not known in advance in practice as the incoming links channel condition may change from time to time. On the other hand, to fully utilize the potential of adaptive recoding, we need to have a good estimation of this distribution. In other words, we need to guess this distribution from a few samples so that we can apply adaptive recoding as soon as possible. In this paper, we propose a distributionally robust optimization for adaptive recoding with a small-sample inferred prediction of the input rank distribution. We develop an algorithm to efficiently solve this optimization with the support of theoretical guarantees that our optimizations performance would constitute as a confidence lower bound of the optimal throughput with high probability.
We consider a wireless communication network with an adaptive scheme to select the number of packets to be admitted and encoded for each transmission, and characterize the information timeliness. For a network of erasure channels and discrete time, we provide closed form expressions for the Average and Peak Age of Information (AoI) as functions of admission control and adaptive coding parameters, the feedback delay, and the maximum feasible end-to-end rate that depends on channel conditions and network topology. These new results guide the system design for robust improvements of the AoI when transmitting time sensitive information in the presence of topology and channel changes. We illustrate the benefits of using adaptive packet coding to improve information timeliness by characterizing the network performance with respect to the AoI along with its relationship to throughput (rate of successfully decoded packets at the destination) and per-packet delay. We show that significant AoI performance gains can be obtained in comparison to the uncoded case, and that these gains are robust to network variations as channel conditions and network topology change.
A major hurdle in machine learning is scalability to massive datasets. One approach to overcoming this is to distribute the computational tasks among several workers. textit{Gradient coding} has been recently proposed in distributed optimization to compute the gradient of an objective function using multiple, possibly unreliable, worker nodes. By designing distributed coded schemes, gradient coded computations can be made resilient to textit{stragglers}, nodes with longer response time comparing to other nodes in a distributed network. Most such schemes rely on operations over the real or complex numbers and are inherently numerically unstable. We present a binary scheme which avoids such operations, thereby enabling numerically stable distributed computation of the gradient. Also, some restricting assumptions in prior work are dropped, and a more efficient decoding is given.
We propose a novel adaptive and causal random linear network coding (AC-RLNC) algorithm with forward error correction (FEC) for a point-to-point communication channel with delayed feedback. AC-RLNC is adaptive to the channel condition, that the algorithm estimates, and is causal, as coding depends on the particular erasure realizations, as reflected in the feedback acknowledgments. Specifically, the proposed model can learn the erasure pattern of the channel via feedback acknowledgments, and adaptively adjust its retransmission rates using a priori and posteriori algorithms. By those adjustments, AC-RLNC achieves the desired delay and throughput, and enables transmission with zero error probability. We upper bound the throughput and the mean and maximum in order delivery delay of AC-RLNC, and prove that for the point to point communication channel in the non-asymptotic regime the proposed code may achieve more than 90% of the channel capacity. To upper bound the throughput we utilize the minimum Bhattacharyya distance for the AC-RLNC code. We validate those results via simulations. We contrast the performance of AC-RLNC with the one of selective repeat (SR)-ARQ, which is causal but not adaptive, and is a posteriori. Via a study on experimentally obtained commercial traces, we demonstrate that a protocol based on AC-RLNC can, vis-`a-vis SR-ARQ, double the throughput gains, and triple the gain in terms of mean in order delivery delay when the channel is bursty. Furthermore, the difference between the maximum and mean in order delivery delay is much smaller than that of SR-ARQ. Closing the delay gap along with boosting the throughput is very promising for enabling ultra-reliable low-latency communications (URLLC) applications.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا