Distributed Convex Optimization With Limited Communications

140 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Milind Rao

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Milind Rao - Stefano Rini - Andrea Goldsmith

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي أنظمة متعددة العملاء

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, a distributed convex optimization algorithm, termed emph{distributed coordinate dual averaging} (DCDA) algorithm, is proposed. The DCDA algorithm addresses the scenario of a large distributed optimization problem with limited communication among nodes in the network. Currently known distributed subgradient methods, such as the distributed dual averaging or the distributed alternating direction method of multipliers algorithms, assume that nodes can exchange messages of large cardinality. Such network communication capabilities are not valid in many scenarios of practical relevance. In the DCDA algorithm, on the other hand, communication of each coordinate of the optimization variable is restricted over time. For the proposed algorithm, we bound the rate of convergence under different communication protocols and network architectures. We also consider the extensions to the case of imperfect gradient knowledge and the case in which transmitted messages are corrupted by additive noise or are quantized. Relevant numerical simulations are also provided.

قيم البحث

70 - Yan Zhang , Robert J. Ravier , Michael M. Zavlanos 2019

In this paper, we consider the problem of distributed online convex optimization, where a network of local agents aim to jointly optimize a convex function over a period of multiple time steps. The agents do not have any information about the future. Existing algorithms have established dynamic regret bounds that have explicit dependence on the number of time steps. In this work, we show that we can remove this dependence assuming that the local objective functions are strongly convex. More precisely, we propose a gradient tracking algorithm where agents jointly communicate and descend based on corrected gradient steps. We verify our theoretical results through numerical experiments.

التحسين والتحكم التعلم الآلي

Fast decentralized non-convex finite-sum optimization with recursive variance reduction

209 - Ran Xin , Usman A. Khan , Soummya Kar 2020

This paper considers decentralized minimization of $N:=nm$ smooth non-convex cost functions equally divided over a directed network of $n$ nodes. Specifically, we describe a stochastic first-order gradient method, called GT-SARAH, that employs a SARA H-type variance reduction technique and gradient tracking (GT) to address the stochastic and decentralized nature of the problem. We show that GT-SARAH, with appropriate algorithmic parameters, finds an $epsilon$-accurate first-order stationary point with $Obig(maxbig{N^{frac{1}{2}},n(1-lambda)^{-2},n^{frac{2}{3}}m^{frac{1}{3}}(1-lambda)^{-1}big}Lepsilon^{-2}big)$ gradient complexity, where ${(1-lambda)in(0,1]}$ is the spectral gap of the network weight matrix and $L$ is the smoothness parameter of the cost functions. This gradient complexity outperforms that of the existing decentralized stochastic gradient methods. In particular, in a big-data regime such that ${n = O(N^{frac{1}{2}}(1-lambda)^{3})}$, this gradient complexity furthers reduces to ${O(N^{frac{1}{2}}Lepsilon^{-2})}$, independent of the network topology, and matches that of the centralized near-optimal variance-reduced methods. Moreover, in this regime GT-SARAH achieves a non-asymptotic linear speedup, in that, the total number of gradient computations at each node is reduced by a factor of $1/n$ compared to the centralized near-optimal algorithms that perform all gradient computations at a single node. To the best of our knowledge, GT-SARAH is the first algorithm that achieves this property. In addition, we show that appropriate choices of local minibatch size balance the trade-offs between the gradient and communication complexity of GT-SARAH. Over infinite time horizon, we establish that all nodes in GT-SARAH asymptotically achieve consensus and converge to a first-order stationary point in the almost sure and mean-squared sense.

التحسين والتحكم التعلم الآلي أنظمة متعددة العملاء

Domain-specific Communication Optimization for Distributed DNN Training

146 - Hao Wang , Jingrong Chen , Xinchen Wan 2020

Communication overhead poses an important obstacle to distributed DNN training and draws increasing attention in recent years. Despite continuous efforts, prior solutions such as gradient compression/reduction, compute/communication overlapping and l ayer-wise flow scheduling, etc., are still coarse-grained and insufficient for an efficient distributed training especially when the network is under pressure. We present DLCP, a novel solution exploiting the domain-specific properties of deep learning to optimize communication overhead of DNN training in a fine-grained manner. At its heart, DLCP comprises of several key innovations beyond prior work: e.g., it exploits {em bounded loss tolerance} of SGD-based training to improve tail communication latency which cannot be avoided purely through gradient compression. It then performs fine-grained packet-level prioritization and dropping, as opposed to flow-level scheduling, based on layers and magnitudes of gradients to further speedup model convergence without affecting accuracy. In addition, it leverages inter-packet order-independency to perform per-packet load balancing without causing classical re-ordering issues. DLCP works with both Parameter Server and collective communication routines. We have implemented DLCP with commodity switches, integrated it with various training frameworks including TensorFlow, MXNet and PyTorch, and deployed it in our small-scale testbed with 10 Nvidia V100 GPUs. Our testbed experiments and large-scale simulations show that DLCP delivers up to $84.3%$ additional training acceleration over the best existing solutions.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Distributed Online Convex Optimization with Improved Dynamic Regret

77 - Yan Zhang , Robert J. Ravier , Vahid Tarokh 2019

In this paper, we consider the problem of distributed online convex optimization, where a group of agents collaborate to track the global minimizers of a sum of time-varying objective functions in an online manner. Specifically, we propose a novel di stributed online gradient descent algorithm that relies on an online adaptation of the gradient tracking technique used in static optimization. We show that the dynamic regret bound of this algorithm has no explicit dependence on the time horizon and, therefore, can be tighter than existing bounds especially for problems with long horizons. Our bound depends on a new regularity measure that quantifies the total change in the gradients at the optimal points at each time instant. Furthermore, when the optimizer is approximatly subject to linear dynamics, we show that the dynamic regret bound can be further tightened by replacing the regularity measure that captures the path length of the optimizer with the accumulated prediction errors, which can be much lower in this special case. We present numerical experiments to corroborate our theoretical results.

التحسين والتحكم

An improved convergence analysis for decentralized online stochastic non-convex optimization

73 - Ran Xin , Usman A. Khan , 2020

In this paper, we study decentralized online stochastic non-convex optimization over a network of nodes. Integrating a technique called gradient tracking in decentralized stochastic gradient descent, we show that the resulting algorithm, GT-DSGD, enj oys certain desirable characteristics towards minimizing a sum of smooth non-convex functions. In particular, for general smooth non-convex functions, we establish non-asymptotic characterizations of GT-DSGD and derive the conditions under which it achieves network-independent performances that match the centralized minibatch SGD. In contrast, the existing results suggest that GT-DSGD is always network-dependent and is therefore strictly worse than the centralized minibatch SGD. When the global non-convex function additionally satisfies the Polyak-Lojasiewics (PL) condition, we establish the linear convergence of GT-DSGD up to a steady-state error with appropriate constant step-sizes. Moreover, under stochastic approximation step-sizes, we establish, for the first time, the optimal global sublinear convergence rate on almost every sample path, in addition to the asymptotically optimal sublinear rate in expectation. Since strongly convex functions are a special case of the functions satisfying the PL condition, our results are not only immediately applicable but also improve the currently known best convergence rates and their dependence on problem parameters.

التحسين والتحكم التعلم الآلي أنظمة متعددة العملاء