ترغب بنشر مسار تعليمي؟ اضغط هنا

Straggler-Resilient and Communication-Efficient Distributed Iterative Linear Solver

97   0   0.0 ( 0 )
 نشر من قبل Farzin Haddadpour
 تاريخ النشر 2018
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We propose a novel distributed iterative linear inverse solver method. Our method, PolyLin, has significantly lower communication cost, both in terms of number of rounds as well as number of bits, in comparison with the state of the art at the cost of higher computational complexity and storage. Our algorithm also has a built-in resilience to straggling and faulty computation nodes. We develop a natural variant of our main algorithm that trades off communication cost for computational complexity. Our method is inspired by ideas in error correcting codes.



قيم البحث

اقرأ أيضاً

Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations: delays caused by slow running machines called stragglers, and communication overheads. Recently, Ye and Abbe [ICML 2018] proposed a coding-theoretic paradigm to characterize a fundamental trade-off between computation load per worker, communication overhead per worker, and straggler tolerance. However, their proposed coding schemes suffer from heavy decoding complexity and poor numerical stability. In this paper, we develop a communication-efficient gradient coding framework to overcome these drawbacks. Our proposed framework enables using any linear code to design the encoding and decoding functions. When a particular code is used in this framework, its block-length determines the computation load, dimension determines the communication overhead, and minimum distance determines the straggler tolerance. The flexibility of choosing a code allows us to gracefully trade-off the straggler threshold and communication overhead for smaller decoding complexity and higher numerical stability. Further, we show that using a maximum distance separable (MDS) code generated by a random Gaussian matrix in our framework yields a gradient code that is optimal with respect to the trade-off and, in addition, satisfies stronger guarantees on numerical stability as compared to the previously proposed schemes. Finally, we evaluate our proposed framework on Amazon EC2 and demonstrate that it reduces the average iteration time by 16% as compared to prior gradient coding schemes.
We present a communication-efficient distributed protocol for computing the Babai point, an approximate nearest point for a random vector ${bf X}inmathbb{R}^n$ in a given lattice. We show that the protocol is optimal in the sense that it minimizes th e sum rate when the components of ${bf X}$ are mutually independent. We then investigate the error probability, i.e. the probability that the Babai point does not coincide with the nearest lattice point. In dimensions two and three, this probability is seen to grow with the packing density. For higher dimensions, we use a bound from probability theory to estimate the error probability for some well-known lattices. Our investigations suggest that for uniform distributions, the error probability becomes large with the dimension of the lattice, for lattices with good packing densities. We also consider the case where $mathbf{X}$ is obtained by adding Gaussian noise to a randomly chosen lattice point. In this case, the error probability goes to zero with the lattice dimension when the noise variance is sufficiently small. In such cases, a distributed algorithm for finding the approximate nearest lattice point is sufficient for finding the nearest lattice point.
In large scale distributed storage systems (DSS) deployed in cloud computing, correlated failures resulting in simultaneous failure (or, unavailability) of blocks of nodes are common. In such scenarios, the stored data or a content of a failed node c an only be reconstructed from the available live nodes belonging to the available blocks. To analyze the resilience of the system against such block failures, this work introduces the framework of Block Failure Resilient (BFR) codes, wherein the data (e.g., a file in DSS) can be decoded by reading out from a same number of codeword symbols (nodes) from a subset of available blocks of the underlying codeword. Further, repairable BFR codes are introduced, wherein any codeword symbol in a failed block can be repaired by contacting a subset of remaining blocks in the system. File size bounds for repairable BFR codes are derived, and the trade-off between per node storage and repair bandwidth is analyzed, and the corresponding minimum storage regenerating (BFR-MSR) and minimum bandwidth regenerating (BFR-MBR) points are derived. Explicit codes achieving the two operating points for a special case of parameters are constructed, wherein the underlying regenerating codewords are distributed to BFR codeword symbols according to combinatorial designs. Finally, BFR locally repairable codes (BFR-LRC) are introduced, an upper bound on the resilience is derived and optimal code construction are provided by a concatenation of Gabidulin and MDS codes. Repair efficiency of BFR-LRC is further studied via the use of BFR-MSR/MBR codes as local codes. Code constructions achieving optimal resilience for BFR-MSR/MBR-LRCs are provided for certain parameter regimes. Overall, this work introduces the framework of block failures along with optimal code constructions, and the study of architecture-aware coding for distributed storage systems.
Distributed matrix computations -- matrix-matrix or matrix-vector multiplications -- are well-recognized to suffer from the problem of stragglers (slow or failed worker nodes). Much of prior work in this area is (i) either sub-optimal in terms of its straggler resilience, or (ii) suffers from numerical problems, i.e., there is a blow-up of round-off errors in the decoded result owing to the high condition numbers of the corresponding decoding matrices. Our work presents convolutional coding approach to this problem that removes these limitations. It is optimal in terms of its straggler resilience, and has excellent numerical robustness as long as the workers storage capacity is slightly higher than the fundamental lower bound. Moreover, it can be decoded using a fast peeling decoder that only involves add/subtract operations. Our second approach has marginally higher decoding complexity than the first one, but allows us to operate arbitrarily close to the lower bound. Its numerical robustness can be theoretically quantified by deriving a computable upper bound on the worst case condition number over all possible decoding matrices by drawing connections with the properties of large Toeplitz matrices. All above claims are backed up by extensive experiments done on the AWS cloud platform.
321 - Yong Zeng , Rui Zhang 2016
Wireless communication with unmanned aerial vehicles (UAVs) is a promising technology for future communication systems. In this paper, we study energy-efficient UAV communication with a ground terminal via optimizing the UAVs trajectory, a new design paradigm that jointly considers both the communication throughput and the UAVs energy consumption. To this end, we first derive a theoretical model on the propulsion energy consumption of fixed-wing UAVs as a function of the UAVs flying speed, direction and acceleration, based on which the energy efficiency of UAV communication is defined. Then, for the case of unconstrained trajectory optimization, we show that both the rate-maximization and energy-minimization designs lead to vanishing energy efficiency and thus are energy-inefficient in general. Next, we introduce a practical circular UAV trajectory, under which the UAVs flight radius and speed are optimized to maximize the energy efficiency for communication. Furthermore, an efficient design is proposed for maximizing the UAVs energy efficiency with general constraints on its trajectory, including its initial/final locations and velocities, as well as maximum speed and acceleration. Numerical results show that the proposed designs achieve significantly higher energy efficiency for UAV communication as compared with other benchmark schemes.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا