Distributed TD(0) with Almost No Communication

114 0 0.0 ( 0 )

Download Cite

Added by Rui Liu

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Rui Liu - Alex Olshevsky

Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We provide a new non-asymptotic analysis of distributed TD(0) with linear function approximation. Our approach relies on one-shot averaging, where $N$ agents run local copies of TD(0) and average the outcomes only once at the very end. We consider two models: one in which the agents interact with an environment they can observe and whose transitions depends on all of their actions (which we call the global state model), and one in which each agent can run a local copy of an identical Markov Decision Process, which we call the local state model. In the global state model, we show that the convergence rate of our distributed one-shot averaging method matches the known convergence rate of TD(0). By contrast, the best convergence rate in the previous literature showed a rate which, in the worst case, underperformed the non-distributed version by $O(N^3)$ in terms of the number of agents $N$. In the local state model, we demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0). As far as we are aware, this is the first result rigorously showing benefits from parallelism for temporal difference methods.

rate research

Communication-efficient distributed SGD with Sketching

191 - Nikita Ivkin , Daniel Rothchild , Enayat Ullah 2019

Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we introduce Sketched SGD, an algorithm for carrying out distributed SGD by communicating sketches instead of full gradients. We show that Sketched SGD has favorable convergence rates on several classes of functions. When considering all communication -- both of gradients and of updated model weights -- Sketched SGD reduces the amount of communication required compared to other gradient compression methods from $mathcal{O}(d)$ or $mathcal{O}(W)$ to $mathcal{O}(log d)$, where $d$ is the number of model parameters and $W$ is the number of workers participating in training. We run experiments on a transformer model, an LSTM, and a residual network, demonstrating up to a 40x reduction in total communication cost with no loss in final model performance. We also show experimentally that Sketched SGD scales to at least 256 workers without increasing communication cost or degrading model performance.

Machine Learning Distributed Parallel and Cluster Computing Optimization and Control

Parameter estimation with almost no public communication for continuous-variable quantum key distribution

179 - Cosmo Lupo , Carlo Ottaviani , Panagiotis Papanastasiou 2017

One crucial step in any quantum key distribution (QKD) scheme is parameter estimation. In a typical QKD protocol the users have to sacrifice part of their raw data to estimate the parameters of the communication channel as, for example, the error rate. This introduces a tradeoff between the secret key rate and the accuracy of parameter estimation in the finite-size regime. Here we show that continuous-variable (CV) QKD is not subject to this constraint as the whole raw keys can be used for both parameter estimation and secret key generation, without compromising the security. First we show that this property holds for measurement-device independent (MDI) protocols, as a consequence of the fact that in an MDI protocol the correlations between Alice and Bob are post-selected by the measurement performed by an untrusted relay. This result is then extended beyond the MDI framework by exploiting the fact that MDI protocols can simulate device-dependent one-way QKD with arbitrarily high precision.

Quantum Physics

CADA: Communication-Adaptive Distributed Adam

383 - Tianyi Chen , Ziye Guo , Yuejiao Sun 2020

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method - justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.

Machine Learning Distributed Parallel and Cluster Computing Optimization and Control

SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum

149 - Jianyu Wang , Vinayak Tantia , Nicolas Ballas 2019

Distributed optimization is essential for training large models on large datasets. Multiple approaches have been proposed to reduce the communication overhead in distributed training, such as synchronizing only after performing multiple local SGD steps, and decentralized methods (e.g., using gossip algorithms) to decouple communications among workers. Although these methods run faster than AllReduce-based methods, which use blocking communication before every update, the resulting models may be less accurate after the same number of updates. Inspired by the BMUF method of Chen & Huo (2016), we propose a slow momentum (SlowMo) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm. Experiments on image classification and machine translation tasks demonstrate that SlowMo consistently yields improvements in optimization and generalization performance relative to the base optimizer, even when the additional overhead is amortized over many updates so that the SlowMo runtime is on par with that of the base optimizer. We provide theoretical convergence guarantees showing that SlowMo converges to a stationary point of smooth non-convex losses. Since BMUF can be expressed through the SlowMo framework, our results also correspond to the first theoretical convergence guarantees for BMUF.

Machine Learning Distributed Parallel and Cluster Computing Optimization and Control

Finite Sample Analyses for TD(0) with Function Approximation

126 - Gal Dalal , Balazs Szorenyi , Gugan Thoppe 2017

TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is no existing finite sample analysis for TD(0) with function approximation, even for the linear case. Our work is the first to provide such results. Existing convergence rates for Temporal Difference (TD) methods apply only to somewhat modifi

Artificial Intelligence

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Distributed TD(0) with Almost No Communication

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions