ﻻ يوجد ملخص باللغة العربية
Communication overhead hinders the scalability of large-scale distributed training. Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity $1-beta$ which measures the network connectivity. On large and sparse networks where $1-beta to 0$, Gossip SGD requires more iterations to converge, which offsets against its communication benefit. This paper introduces Gossip-PGA, which adds Periodic Global Averaging into Gossip SGD. Its transient stage, i.e., the iterations required to reach asymptotic linear speedup stage, improves from $Omega(beta^4 n^3/(1-beta)^4)$ to $Omega(beta^4 n^3 H^4)$ for non-convex problems. The influence of network topology in Gossip-PGA can be controlled by the averaging period $H$. Its transient-stage complexity is also superior to Local SGD which has order $Omega(n^3 H^4)$. Empirical results of large-scale training on image classification (ResNet50) and language modeling (BERT) validate our theoretical findings.
Federated Learning (FL) is a recent development in the field of machine learning that collaboratively trains models without the training data leaving client devices, to preserve data privacy. In realistic FL settings, the training set is distributed
Nesterov SGD is widely used for training modern neural networks and other machine learning models. Yet, its advantages over SGD have not been theoretically clarified. Indeed, as we show in our paper, both theoretically and empirically, Nesterov SGD w
Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we intr
Stochastic Gradient Descent (SGD) is the key learning algorithm for many machine learning tasks. Because of its computational costs, there is a growing interest in accelerating SGD on HPC resources like GPU clusters. However, the performance of paral
Stochastic gradient descent (SGD) has been widely studied in the literature from different angles, and is commonly employed for solving many big data machine learning problems. However, the averaging technique, which combines all iterative solutions