STEM: A Stochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning

93 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Prashant Khanduri

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Prashant Khanduri - Pranay Sharma - Haibo Yang

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Federated Learning (FL) refers to the paradigm where multiple worker nodes (WNs) build a joint model by using local data. Despite extensive research, for a generic non-convex FL problem, it is not clear, how to choose the WNs and the servers update directions, the minibatch sizes, and the local update frequency, so that the WNs use the minimum number of samples and communication rounds to achieve the desired solution. This work addresses the above question and considers a class of stochastic algorithms where the WNs perform a few local updates before communication. We show that when both the WNs and the servers directions are chosen based on a stochastic momentum estimator, the algorithm requires $tilde{mathcal{O}}(epsilon^{-3/2})$ samples and $tilde{mathcal{O}}(epsilon^{-1})$ communication rounds to compute an $epsilon$-stationary solution. To the best of our knowledge, this is the first FL algorithm that achieves such {it near-optimal} sample and communication complexities simultaneously. Further, we show that there is a trade-off curve between local update frequencies and local minibatch sizes, on which the above sample and communication complexities can be maintained. Finally, we show that for the classical FedAvg (a.k.a. Local SGD, which is a momentum-less special case of the STEM), a similar trade-off curve exists, albeit with worse sample and communication complexities. Our insights on this trade-off provides guidelines for choosing the four important design elements for FL algorithms, the update frequency, directions, and minibatch sizes to achieve the best performance.

قيم البحث

113 - Prashant Khanduri , Siliang Zeng , Mingyi Hong 2021

This paper proposes a new algorithm -- the underline{S}ingle-timescale Dounderline{u}ble-momentum underline{St}ochastic underline{A}pproxunderline{i}matiounderline{n} (SUSTAIN) -- for tackling stochastic unconstrained bilevel optimization problems. W e focus on bilevel problems where the lower level subproblem is strongly-convex and the upper level objective function is smooth. Unlike prior works which rely on emph{two-timescale} or emph{double loop} techniques, we design a stochastic momentum-assisted gradient estimator for both the upper and lower level updates. The latter allows us to control the error in the stochastic gradient updates due to inaccurate solution to both subproblems. If the upper objective function is smooth but possibly non-convex, we show that {aname}~requires $mathcal{O}(epsilon^{-3/2})$ iterations (each using ${cal O}(1)$ samples) to find an $epsilon$-stationary solution. The $epsilon$-stationary solution is defined as the point whose squared norm of the gradient of the outer function is less than or equal to $epsilon$. The total number of stochastic gradient samples required for the upper and lower level objective functions matches the best-known complexity for single-level stochastic gradient algorithms. We also analyze the case when the upper level objective function is strongly-convex.

التحسين والتحكم التعلم الآلي التعلم الالي

An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning

68 - Blake Woodworth , Nathan Srebro 2021

We present and analyze an algorithm for optimizing smooth and convex or strongly convex objectives using minibatch stochastic gradient estimates. The algorithm is optimal with respect to its dependence on both the minibatch size and minimum expected loss simultaneously. This improves over the optimal method of Lan (2012), which is insensitive to the minimum expected loss; over the optimistic acceleration of Cotter et al. (2011), which has suboptimal dependence on the minibatch size; and over the algorithm of Liu and Belkin (2018), which is limited to least squares problems and is also similarly suboptimal with respect to the minibatch size. Applied to interpolation learning, the improvement over Cotter et al. and Liu and Belkin translates to a linear, rather than square-root, parallelization speedup.

التعلم الآلي التحسين والتحكم

A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning

180 - Nhan H. Pham , Lam M. Nguyen , Dzung T. Phan 2020

We propose a novel hybrid stochastic policy gradient estimator by combining an unbiased policy gradient estimator, the REINFORCE estimator, with another biased one, an adapted SARAH estimator for policy optimization. The hybrid policy gradient estima tor is shown to be biased, but has variance reduced property. Using this estimator, we develop a new Proximal Hybrid Stochastic Policy Gradient Algorithm (ProxHSPGA) to solve a composite policy optimization problem that allows us to handle constraints or regularizers on the policy parameters. We first propose a single-looped algorithm then introduce a more practical restarting variant. We prove that both algorithms can achieve the best-known trajectory complexity $mathcal{O}left(varepsilon^{-3}right)$ to attain a first-order stationary point for the composite problem which is better than existing REINFORCE/GPOMDP $mathcal{O}left(varepsilon^{-4}right)$ and SVRPG $mathcal{O}left(varepsilon^{-10/3}right)$ in the non-composite setting. We evaluate the performance of our algorithm on several well-known examples in reinforcement learning. Numerical results show that our algorithm outperforms two existing methods on these examples. Moreover, the composite settings indeed have some advantages compared to the non-composite ones on certain problems.

التعلم الآلي التحسين والتحكم

Improved Learning Rates for Stochastic Optimization: Two Theoretical Viewpoints

264 - Shaojie Li , Yong Liu 2021

Generalization performance of stochastic optimization stands a central place in learning theory. In this paper, we investigate the excess risk performance and towards improved learning rates for two popular approaches of stochastic optimization: empi rical risk minimization (ERM) and stochastic gradient descent (SGD). Although there exists plentiful generalization analysis of ERM and SGD for supervised learning, current theoretical understandings of ERM and SGD either have stronger assumptions in convex learning, e.g., strong convexity, or show slow rates and less studied in nonconvex learning. Motivated by these problems, we aim to provide improved rates under milder assumptions in convex learning and derive faster rates in nonconvex learning. It is notable that our analysis span two popular theoretical viewpoints: emph{stability} and emph{uniform convergence}. Specifically, in stability regime, we present high probability learning rates of order $mathcal{O} (1/n)$ w.r.t. the sample size $n$ for ERM and SGD with milder assumptions in convex learning and similar high probability rates of order $mathcal{O} (1/n)$ in nonconvex learning, rather than in expectation. Furthermore, this type of learning rate is improved to faster order $mathcal{O} (1/n^2)$ in uniform convergence regime. To our best knowledge, for ERM and SGD, the learning rates presented in this paper are all state-of-the-art.

التعلم الآلي التحسين والتحكم التعلم الالي

GT-STORM: Taming Sample, Communication, and Memory Complexities in Decentralized Non-Convex Learning

56 - Xin Zhang , Jia Liu , Zhengyuan Zhu 2021

Decentralized nonconvex optimization has received increasing attention in recent years in machine learning due to its advantages in system robustness, data privacy, and implementation simplicity. However, three fundamental challenges in designing dec entralized optimization algorithms are how to reduce their sample, communication, and memory complexities. In this paper, we propose a underline{g}radient-underline{t}racking-based underline{sto}chastic underline{r}ecursive underline{m}omentum (GT-STORM) algorithm for efficiently solving nonconvex optimization problems. We show that to reach an $epsilon^2$-stationary solution, the total number of sample evaluations of our algorithm is $tilde{O}(m^{1/2}epsilon^{-3})$ and the number of communication rounds is $tilde{O}(m^{-1/2}epsilon^{-3})$, which improve the $O(epsilon^{-4})$ costs of sample evaluations and communications for the existing decentralized stochastic gradient algorithms. We conduct extensive experiments with a variety of learning models, including non-convex logistical regression and convolutional neural networks, to verify our theoretical findings. Collectively, our results contribute to the state of the art of theories and algorithms for decentralized network optimization.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية