ﻻ يوجد ملخص باللغة العربية
In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact {em biased} compressors often show superior performance in practice when compared to the much more studied and understood {em unbiased} compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. Our {em distributed} SGD method enjoys the ergodic rate $mathcal{O}left(frac{delta L exp(-K) }{mu} + frac{(C + D)}{Kmu}right)$, where $delta$ is a compression parameter which grows when more compression is applied, $L$ and $mu$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose a new highly performing biased compressor---combination of Top-$k$ and natural dithering---which in our experiments outperforms all other compression techniques.
We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the curren
Various bias-correction methods such as EXTRA, gradient tracking methods, and exact diffusion have been proposed recently to solve distributed {em deterministic} optimization problems. These methods employ constant step-sizes and converge linearly to
Training deep neural networks on large datasets containing high-dimensional data requires a large amount of computation. A solution to this problem is data-parallel distributed training, where a model is replicated into several computational nodes th
Decentralized optimization and communication compression have exhibited their great potential in accelerating distributed machine learning by mitigating the communication bottleneck in practice. While existing decentralized algorithms with communicat
Codistillation has been proposed as a mechanism to share knowledge among concurrently trained models by encouraging them to represent the same function through an auxiliary loss. This contrasts with the more commonly used fully-synchronous data-paral