مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Straggler Mitigation through Unequal Error Protection for Distributed Matrix Multiplication

109 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Busra Tegin

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Busra Tegin - Eduin E. Hernandez - Stefano Rini

نظرية المعلومات النظم الموزعة والتوازية والحوسبة العنقودية نظرية المعلومات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for computation at the agents is affected by the availability of local resources giving rise to the straggler problem in which the computation results are held back by unresponsive agents. For this problem, linear coding of the matrix sub-blocks can be used to introduce resilience toward straggling. The Parameter Server (PS) utilizes a channel code and distributes the matrices to the workers for multiplication. It then produces an approximation to the desired matrix multiplication using the results of the computations received at a given deadline. In this paper, we propose to employ Unequal Error Protection (UEP) codes to alleviate the straggler problem. The resiliency level of each sub-block is chosen according to its norm as blocks with larger norms have higher effects on the result of the matrix multiplication. We validate the effectiveness of our scheme both theoretically and through numerical evaluations. We derive a theoretical characterization of the performance of UEP using random linear codes, and compare it the case of equal error protection. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN), for which we investigate the fundamental trade-off between precision and the time required for the computations.

قيم البحث

122 - Busra Tegin , Eduin. E. Hernandez , Stefano Rini 2021

Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for the computations at the agents is affected by the availability of local resources and/or po or channel conditions giving rise to the straggler problem. As a remedy to this problem, we employ Unequal Error Protection (UEP) codes to obtain an approximation of the matrix product in the distributed computation setting to provide higher protection for the blocks with higher effect on the final result. We characterize the performance of the proposed approach from a theoretical perspective by bounding the expected reconstruction error for matrices with uncorrelated entries. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN) for an image classification task in the evaluation of the gradients. Our numerical experiments show that it is indeed possible to obtain significant improvements in the overall time required to achieve the DNN training convergence by producing approximation of matrix products using UEP codes in the presence of stragglers.

النظم الموزعة والتوازية والحوسبة العنقودية نظرية المعلومات نظرية المعلومات

Communication-Efficient Gradient Coding for Straggler Mitigation in Distributed Learning

310 - Swanand Kadhe , O. Ozan Koyluoglu , 2020

Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations: delays caused by slow running machines called stragglers, and communication overheads. Recently, Ye and Abbe [ICML 2018] proposed a coding-theoretic paradigm to characterize a fundamental trade-off between computation load per worker, communication overhead per worker, and straggler tolerance. However, their proposed coding schemes suffer from heavy decoding complexity and poor numerical stability. In this paper, we develop a communication-efficient gradient coding framework to overcome these drawbacks. Our proposed framework enables using any linear code to design the encoding and decoding functions. When a particular code is used in this framework, its block-length determines the computation load, dimension determines the communication overhead, and minimum distance determines the straggler tolerance. The flexibility of choosing a code allows us to gracefully trade-off the straggler threshold and communication overhead for smaller decoding complexity and higher numerical stability. Further, we show that using a maximum distance separable (MDS) code generated by a random Gaussian matrix in our framework yields a gradient code that is optimal with respect to the trade-off and, in addition, satisfies stronger guarantees on numerical stability as compared to the previously proposed schemes. Finally, we evaluate our proposed framework on Amazon EC2 and demonstrate that it reduces the average iteration time by 16% as compared to prior gradient coding schemes.

نظرية المعلومات النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Autoencoder-Based Unequal Error Protection Codes

149 - Vukan Ninkovic , Dejan Vukobratovic , Christian Hager 2021

We present a novel autoencoder-based approach for designing codes that provide unequal error protection (UEP) capabilities. The proposed design is based on a generalization of an autoencoder loss function that accommodates both message-wise and bit-w ise UEP scenarios. In both scenarios, the generalized loss function can be adjusted using an associated weight vector to trade off error probabilities corresponding to different importance classes. For message-wise UEP, we compare the proposed autoencoder-based UEP codes with a union of random coset codes. For bit-wise UEP, the proposed codes are compared with UEP rateless spinal codes and the superposition of random Gaussian codes. In all cases, the autoencoder-based codes show superior performance while providing design simplicity and flexibility in trading off error protection among different importance classes.

نظرية المعلومات نظرية المعلومات

Gradient Coding with Dynamic Clustering for Straggler Mitigation

356 - Baturalp Buyukates , Emre Ozfatura , Sennur Ulukus 2020

In distributed synchronous gradient descent (GD) the main performance bottleneck for the per-iteration completion time is the slowest textit{straggling} workers. To speed up GD iterations in the presence of stragglers, coded distributed computation t echniques are implemented by assigning redundant computations to workers. In this paper, we propose a novel gradient coding (GC) scheme that utilizes dynamic clustering, denoted by GC-DC, to speed up the gradient calculation. Under time-correlated straggling behavior, GC-DC aims at regulating the number of straggling workers in each cluster based on the straggler behavior in the previous iteration. We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.

نظرية المعلومات النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Adaptive Private Distributed Matrix Multiplication

305 - Rawad Bitar , Marvin Xhemrishi , Antonia Wachter-Zeh 2021

We consider the problem of designing codes with flexible rate (referred to as rateless codes), for private distributed matrix-matrix multiplication. A master server owns two private matrices $mathbf{A}$ and $mathbf{B}$ and hires worker nodes to help computing their multiplication. The matrices should remain information-theoretically private from the workers. Codes with fixed rate require the master to assign tasks to the workers and then wait for a predetermined number of workers to finish their assigned tasks. The size of the tasks, hence the rate of the scheme, depends on the number of workers that the master waits for. We design a rateless private matrix-matrix multiplication scheme, called RPM3. In contrast to fixed-rate schemes, our scheme fixes the size of the tasks and allows the master to send multiple tasks to the workers. The master keeps sending tasks and receiving results until it can decode the multiplication; rendering the scheme flexible and adaptive to heterogeneous environments. Despite resulting in a smaller rate than known straggler-tolerant schemes, RPM3 provides a smaller mean waiting time of the master by leveraging the heterogeneity of the workers. The waiting time is studied under two different models for the workers service time. We provide upper bounds for the mean waiting time under both models. In addition, we provide lower bounds on the mean waiting time under the worker-dependent fixed service time model.

نظرية المعلومات نظرية المعلومات

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الجامعة الافتراضية السورية

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Straggler Mitigation through Unequal Error Protection for Distributed Matrix Multiplication

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً