ﻻ يوجد ملخص باللغة العربية
We consider practical data characteristics underlying federated learning, where unbalanced and non-i.i.d. data from clients have a block-cyclic structure: each cycle contains several blocks, and each clients training data follow block-specific and non-i.i.d. distributions. Such a data structure would introduce client and block biases during the collaborative training: the single global model would be biased towards the client or block specific data. To overcome the biases, we propose two new distributed optimization algorithms called multi-model parallel SGD (MM-PSGD) and multi-chain parallel SGD (MC-PSGD) with a convergence rate of $O(1/sqrt{NT})$, achieving a linear speedup with respect to the total number of clients. In particular, MM-PSGD adopts the block-mixed training strategy, while MC-PSGD further adds the block-separate training strategy. Both algorithms create a specific predictor for each block by averaging and comparing the historical global models generated in this block from different cycles. We extensively evaluate our algorithms over the CIFAR-10 dataset. Evaluation results demonstrate that our algorithms significantly outperform the conventional federated averaging algorithm in terms of test accuracy, and also preserve robustness for the variance of critical parameters.
Various bias-correction methods such as EXTRA, gradient tracking methods, and exact diffusion have been proposed recently to solve distributed {em deterministic} optimization problems. These methods employ constant step-sizes and converge linearly to
We study asynchronous finite sum minimization in a distributed-data setting with a central parameter server. While asynchrony is well understood in parallel settings where the data is accessible by all machines -- e.g., modifications of variance-redu
Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient desce
Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we intr
In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact {em biased} compressors often show superior