Decentralized Learning with Lazy and Approximate Dual Gradients

296 0 0.0 ( 0 )

Download Cite

Added by Liu Yanli

Publication date 2020

fields

and research's language is English

Authors Yanli Liu - Yuejiao Sun - Wotao Yin

Optimization and Control

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This paper develops algorithms for decentralized machine learning over a network, where data are distributed, computation is localized, and communication is restricted between neighbors. A line of recent research in this area focuses on improving both computation and communication complexities. The methods SSDA and MSDA cite{scaman2017optimal} have optimal communication complexity when the objective is smooth and strongly convex, and are simple to derive. However, they require solving a subproblem at each step. We propose new algorithms that save computation through using (stochastic) gradients and saves communications when previous information is sufficiently useful. Our methods remain relatively simple -- rather than solving a subproblem, they run Katyusha for a small, fixed number of steps from the latest point. An easy-to-compute, local rule is used to decide if a worker can skip a round of communication. Furthermore, our methods provably reduce communication and computation complexities of SSDA and MSDA. In numerical experiments, our algorithms achieve significant computation and communication reduction compared with the state-of-the-art.

rate research

Decentralized Consensus Algorithm with Delayed and Stochastic Gradients

173 - Benjamin Sirb , Xiaojing Ye 2016

We analyze the convergence of decentralized consensus algorithm with delayed gradient information across the network. The nodes in the network privately hold parts of the objective function and collaboratively solve for the consensus optimal solution of the total objective while they can only communicate with their immediate neighbors. In real-world networks, it is often difficult and sometimes impossible to synchronize the nodes, and therefore they have to use stale gradient information during computations. We show that, as long as the random delays are bounded in expectation and a proper diminishing step size policy is employed, the iterates generated by decentralized gradient descent method converge to a consensual optimal solution. Convergence rates of both objective and consensus are derived. Numerical results on a number of synthetic problems and real-world seismic tomography datasets in decentralized sensor networks are presented to show the performance of the method.

Optimization and Control

Accelerated Decentralized Dual Averaging

77 - Changxin Liu , Yang Shi , Huiping Li 2020

This paper studies decentralized convex optimization problems defined over networks, where the objective is to minimize a sum of local smooth convex functions while respecting a common constraint. Two new algorithms based on dual averaging and decentralized consensus-seeking are proposed. The first one accelerates the standard convergence rate $O(frac{1}{sqrt{t}})$ in existing decentralized dual averaging (DDA) algorithms to $O(frac{1}{t})$, where $t$ is the time counter. This is made possible by a second-order consensus scheme that assists each agent to locally track the global dual variable more accurately and a new analysis of the descent property for the mean variable. We remark that, in contrast to its primal counterparts, this method decouples the synchronization step from nonlinear projection, leading to a rather concise analysis and a natural extension to stochastic networks. In the second one, two local sequences of primal variables are constructed in a decentralized manner to achieve acceleration, where only one of them is exchanged between agents. In addition to this, another consensus round is performed for local dual variables. The convergence rate is proved to be $O(1)(frac{1}{t^2}+frac{1}{t})$, where the magnitude of error bound is showed to be inversely proportional to the algebraic connectivity of the graph. However, the condition for stepsize does not rely on the weight matrix associated with the graph, making it easier to satisfy in practice than other accelerated methods. Finally, comparisons between the proposed methods and several recent algorithms are performed using a large-scale LASSO problem.

Optimization and Control

Decentralized Non-Convex Learning with Linearly Coupled Constraints

176 - Jiawei Zhang , Songyang Ge , Tsung-Hui Chang 2021

Motivated by the need for decentralized learning, this paper aims at designing a distributed algorithm for solving nonconvex problems with general linear constraints over a multi-agent network. In the considered problem, each agent owns some local information and a local variable for jointly minimizing a cost function, but local variables are coupled by linear constraints. Most of the existing methods for such problems are only applicable for convex problems or problems with specific linear constraints. There still lacks a distributed algorithm for such problems with general linear constraints and under nonconvex setting. In this paper, to tackle this problem, we propose a new algorithm, called proximal dual consensus (PDC) algorithm, which combines a proximal technique and a dual consensus method. We build the theoretical convergence conditions and show that the proposed PDC algorithm can converge to an $epsilon$-Karush-Kuhn-Tucker solution within $mathcal{O}(1/epsilon)$ iterations. For computation reduction, the PDC algorithm can choose to perform cheap gradient descent per iteration while preserving the same order of $mathcal{O}(1/epsilon)$ iteration complexity. Numerical results are presented to demonstrate the good performance of the proposed algorithms for solving a regression problem and a classification problem over a network where agents have only partial observations of data features.

Optimization and Control Systems and Control Systems and Control

Decentralized Composite Optimization in Stochastic Networks: A Dual Averaging Approach with Linear Convergence

148 - Changxin Liu , Zirui Zhou , Jian Pei 2021

Decentralized optimization, particularly the class of decentralized composite convex optimization (DCCO) problems, has found many applications. Due to ubiquitous communication congestion and random dropouts in practice, it is highly desirable to design decentralized algorithms that can handle stochastic communication networks. However, most existing algorithms for DCCO only work in time-invariant networks and cannot be extended to stochastic networks because they inherently need knowledge of network topology $textit{a priori}$. In this paper, we propose a new decentralized dual averaging (DDA) algorithm that can solve DCCO in stochastic networks. Under a rather mild condition on stochastic networks, we show that the proposed algorithm attains $textit{global linear convergence}$ if each local objective function is strongly convex. Our algorithm substantially improves the existing DDA-type algorithms as the latter were only known to converge $textit{sublinearly}$ prior to our work. The key to achieving the improved rate is the design of a novel dynamic averaging consensus protocol for DDA, which intuitively leads to more accurate local estimates of the global dual variable. To the best of our knowledge, this is the first linearly convergent DDA-type decentralized algorithm and also the first algorithm that attains global linear convergence for solving DCCO in stochastic networks. Numerical results are also presented to support our design and analysis.

Optimization and Control Distributed Parallel and Cluster Computing

Decentralized and Parallel Primal and Dual Accelerated Methods for Stochastic Convex Programming Problems

117 - Darina Dvinskikh , Alexander Gasnikov 2019

We introduce primal and dual stochastic gradient oracle methods for decentralized convex optimization problems. Both for primal and dual oracles, the proposed methods are optimal in terms of the number of communication steps. However, for all classes of the objective, the optimality in terms of the number of oracle calls per node takes place only up to a logarithmic factor and the notion of smoothness. By using mini-batching technique, we show that the proposed methods with stochastic oracle can be additionally parallelized at each node. The considered algorithms can be applied to many data science problems and inverse problems.

Optimization and Control