Divide-and-Shuffle Synchronization for Distributed Machine Learning

103 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Weiyan Wang

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Weiyan Wang - Cengguang Zhang - Liu Yang

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Distributed Machine Learning suffers from the bottleneck of synchronization to all-reduce workers updates. Previous works mainly consider better network topology, gradient compression, or stale updates to speed up communication and relieve the bottleneck. However, all these works ignore the importance of reducing the scale of synchronized elements and inevitable serial executed operators. To address the problem, our work proposes the Divide-and-Shuffle Synchronization(DS-Sync), which divides workers into several parallel groups and shuffles group members. DS-Sync only synchronizes the workers in the same group so that the scale of a group is much smaller. The shuffle of workers maintains the algorithms convergence speed, which is interpreted in theory. Comprehensive experiments also show the significant improvements in the latest and popular models like Bert, WideResnet, and DeepFM on challenging datasets.

قيم البحث

166 - Kevin Hsieh 2019

The usability and practicality of any machine learning (ML) applications are largely influenced by two critical but hard-to-attain factors: low latency and low cost. Unfortunately, achieving low latency and low cost is very challenging when ML depend s on real-world data that are highly distributed and rapidly growing (e.g., data collected by mobile phones and video cameras all over the world). Such real-world data pose many challenges in communication and computation. For example, when training data are distributed across data centers that span multiple continents, communication among data centers can easily overwhelm the limited wide-area network bandwidth, leading to prohibitively high latency and high cost. In this dissertation, we demonstrate that the latency and cost of ML on highly-distributed and rapidly-growing data can be improved by one to two orders of magnitude by designing ML systems that exploit the characteristics of ML algorithms, ML model structures, and ML training/serving data. We support this thesis statement with three contributions. First, we design a system that provides both low-latency and low-cost ML serving (inferencing) over large-scale and continuously-growing datasets, such as videos. Second, we build a system that makes ML training over geo-distributed datasets as fast as training within a single data center. Third, we present a first detailed study and a system-level solution on a fundamental and largely overlooked problem: ML training over non-IID (i.e., not independent and identically distributed) data partitions (e.g., facial images collected by cameras varies according to the demographics of each cameras location).

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

Toward Understanding the Impact of Staleness in Distributed Machine Learning

283 - Wei Dai , Yi Zhou , Nanqing Dong 2018

Many distributed machine learning (ML) systems adopt the non-synchronous execution in order to alleviate the network communication bottleneck, resulting in stale parameters that do not reflect the latest updates. Despite much development in large-sca le ML, the effects of staleness on learning are inconclusive as it is challenging to directly monitor or control staleness in complex distributed environments. In this work, we study the convergence behaviors of a wide array of ML models and algorithms under delayed updates. Our extensive experiments reveal the rich diversity of the effects of staleness on the convergence of ML algorithms and offer insights into seemingly contradictory reports in the literature. The empirical findings also inspire a new convergence analysis of stochastic gradient descent in non-convex optimization under staleness, matching the best-known convergence rate of O(1/sqrt{T}).

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning

102 - Anis Elgabli , Jihong Park , Amrit S. Bedi 2019

When the data is distributed across multiple servers, lowering the communication cost between the servers (or workers) while solving the distributed learning problem is an important problem and is the focus of this paper. In particular, we propose a fast, and communication-efficient decentralized framework to solve the distributed machine learning (DML) problem. The proposed algorithm, Group Alternating Direction Method of Multipliers (GADMM) is based on the Alternating Direction Method of Multipliers (ADMM) framework. The key novelty in GADMM is that it solves the problem in a decentralized topology where at most half of the workers are competing for the limited communication resources at any given time. Moreover, each worker exchanges the locally trained model only with two neighboring workers, thereby training a global model with a lower amount of communication overhead in each exchange. We prove that GADMM converges to the optimal solution for convex loss functions, and numerically show that it converges faster and more communication-efficient than the state-of-the-art communication-efficient algorithms such as the Lazily Aggregated Gradient (LAG) and dual averaging, in linear and logistic regression tasks on synthetic and real datasets. Furthermore, we propose Dynamic GADMM (D-GADMM), a variant of GADMM, and prove its convergence under the time-varying network topology of the workers.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية نظرية المعلومات

ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training

87 - Qinqing Zheng , Bor-Yiing Su , Jiyan Yang 2020

Recommendation systems are often trained with a tremendous amount of data, and distributed training is the workhorse to shorten the training time. While the training throughput can be increased by simply adding more workers, it is also increasingly c hallenging to preserve the model quality. In this paper, we present shadowsync, a distributed framework specifically tailored to modern scale recommendation system training. In contrast to previous works where synchronization happens as part of the training process, shadowsync separates the synchronization from training and runs it in the background. Such isolation significantly reduces the synchronization overhead and increases the synchronization frequency, so that we are able to obtain both high throughput and excellent model quality when training at scale. The superiority of our procedure is confirmed by experiments on training deep neural networks for click-through-rate prediction tasks. Our framework is capable to express data parallelism and/or model parallelism, generic to host various types of synchronization algorithms, and readily applicable to large scale problems in other areas.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

Distributed Machine Learning via Sufficient Factor Broadcasting

205 - Pengtao Xie , Jin Kyu Kim , Yi Zhou 2015

Matrix-parametrized models, including multiclass logistic regression and sparse coding, are used in machine learning (ML) applications ranging from computer vision to computational biology. When these models are applied to large-scale ML problems sta rting at millions of samples and tens of thousands of classes, their parameter matrix can grow at an unexpected rate, resulting in high parameter synchronization costs that greatly slow down distributed learning. To address this issue, we propose a Sufficient Factor Broadcasting (SFB) computation model for efficient distributed learning of a large family of matrix-parameterized models, which share the following property: the parameter update computed on each data sample is a rank-1 matrix, i.e., the outer product of two sufficient factors (SFs). By broadcasting the SFs among worker machines and reconstructing the update matrices locally at each worker, SFB improves communication efficiency --- communication costs are linear in the parameter matrixs dimensions, rather than quadratic --- without affecting computational correctness. We present a theoretical convergence analysis of SFB, and empirically corroborate its efficiency on four different matrix-parametrized ML models.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية