Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning

190 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Peng Jiang

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Peng Jiang - Gagan Agrawal

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Stochastic Gradient Descent (SGD) is the key learning algorithm for many machine learning tasks. Because of its computational costs, there is a growing interest in accelerating SGD on HPC resources like GPU clusters. However, the performance of parallel SGD is still bottlenecked by the high communication costs even with a fast connection among the machines. A simple approach to alleviating this problem, used in many existing efforts, is to perform communication every few iterations, using a constant averaging period. In this paper, we show that the optimal averaging period in terms of convergence and communication cost is not a constant, but instead varies over the course of the execution. Specifically, we observe that reducing the variance of model parameters among the computing nodes is critical to the convergence of periodic parameter averaging SGD. Given a fixed communication budget, we show that it is more beneficial to synchronize more frequently in early iterations to reduce the initial large variance and synchronize less frequently in the later phase of the training process. We propose a practical algorithm, named ADaptive Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall variance of model parameters, and thus better convergence compared with the Constant Periodic parameter averaging SGD (CPSGD). We evaluate our method with several image classification benchmarks and show that our ADPSGD indeed achieves smaller training losses and higher test accuracies with smaller communication compared with CPSGD. Compared with gradient-quantization SGD, we show that our algorithm achieves faster convergence with only half of the communication. Compared with full-communication SGD, our ADPSGD achieves 1:14x to 1:27x speedups with a 100Gbps connection among computing nodes, and the speedups increase to 1:46x ~ 1:95x with a 10Gbps connection.

قيم البحث

383 - Tianyi Chen , Ziye Guo , Yuejiao Sun 2020

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient desce nt method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method - justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التحسين والتحكم

Adaptive Quantization of Model Updates for Communication-Efficient Federated Learning

138 - Divyansh Jhunjhunwala , Advait Gadhikar , Gauri Joshi 2021

Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning, especially in bandwidth-limited settings and high-dimensional models. Gradient quantization is an effective way of red ucing the number of bits required to communicate each model update, albeit at the cost of having a higher error floor due to the higher variance of the stochastic gradients. In this work, we propose an adaptive quantization strategy called AdaQuantFL that aims to achieve communication efficiency as well as a low error floor by changing the number of quantization levels during the course of training. Experiments on training deep neural networks show that our method can converge in much fewer communicated bits as compared to fixed quantization level setups, with little or no impact on training and test accuracy.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD

312 - Jianyu Wang , Hao Liang , Gauri Joshi 2020

Distributed stochastic gradient descent (SGD) is essential for scaling the machine learning algorithms to a large number of computing nodes. However, the infrastructures variability such as high communication delay or random node slowdown greatly imp edes the performance of distributed SGD algorithm, especially in a wireless system or sensor networks. In this paper, we propose an algorithmic approach named Overlap-Local-SGD (and its momentum variant) to overlap the communication and computation so as to speedup the distributed training procedure. The approach can help to mitigate the straggler effects as well. We achieve this by adding an anchor model on each node. After multiple local updates, locally trained models will be pulled back towards the synchronized anchor model rather than communicating with others. Experimental results of training a deep neural network on CIFAR-10 dataset demonstrate the effectiveness of Overlap-Local-SGD. We also provide a convergence guarantee for the proposed algorithm under non-convex objective functions.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التحسين والتحكم

RingFed: Reducing Communication Costs in Federated Learning on Non-IID Data

121 - Guang Yang , Ke Mu , Chunhe Song 2021

Federated learning is a widely used distributed deep learning framework that protects the privacy of each client by exchanging model parameters rather than raw data. However, federated learning suffers from high communication costs, as a considerable number of model parameters need to be transmitted many times during the training process, making the approach inefficient, especially when the communication network bandwidth is limited. This article proposes RingFed, a novel framework to reduce communication overhead during the training process of federated learning. Rather than transmitting parameters between the center server and each client, as in original federated learning, in the proposed RingFed, the updated parameters are transmitted between each client in turn, and only the final result is transmitted to the central server, thereby reducing the communication overhead substantially. After several local updates, clients first send their parameters to another proximal client, not to the center server directly, to preaggregate. Experiments on two different public datasets show that RingFed has fast convergence, high model accuracy, and low communication cost.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية

Reducing the Communication Cost of Federated Learning through Multistage Optimization

196 - Charlie Hou , Kiran K. Thekumparampil , Giulia Fanti 2021

A central question in federated learning (FL) is how to design optimization algorithms that minimize the communication cost of training a model over heterogeneous data distributed across many clients. A popular technique for reducing communication is the use of local steps, where clients take multiple optimization steps over local data before communicating with the server (e.g., FedAvg, SCAFFOLD). This contrasts with centralized methods, where clients take one optimization step per communication round (e.g., Minibatch SGD). A recent lower bound on the communication complexity of first-order methods shows that centralized methods are optimal over highly-heterogeneous data, whereas local methods are optimal over purely homogeneous data [Woodworth et al., 2020]. For intermediate heterogeneity levels, no algorithm is known to match the lower bound. In this paper, we propose a multistage optimization scheme that nearly matches the lower bound across all heterogeneity levels. The idea is to first run a local method up to a heterogeneity-induced error floor; next, we switch to a centralized method for the remaining steps. Our analysis may help explain empirically-successful stepsize decay methods in FL [Charles et al., 2020; Reddi et al., 2020]. We demonstrate the schemes practical utility in image classification tasks.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التحسين والتحكم