Do you want to publish a course? Click here

Distributed Variable Sample-size Stochastic Optimization with Fixed Step-sizes

133   0   0.0 ( 0 )
 Added by Jinlong Lei
 Publication date 2021
  fields
and research's language is English




Ask ChatGPT about the research

The paper considers distributed stochastic optimization over randomly switching networks, where agents collaboratively minimize the average of all agents local expectation-valued convex cost functions. Due to the stochasticity in gradient observations, distributedness of local functions, and randomness of communication topologies, distributed algorithms with a convergence guarantee under fixed step-sizes have not been achieved yet. This work incorporates variance reduction scheme into the distributed stochastic gradient tracking algorithm, where local gradients are estimated by averaging across a variable number of sampled gradients. With an identically and independently distributed (i.i.d.) random network, we show that all agents iterates converge almost surely to the same optimal solution under fixed step-sizes. When the global cost function is strongly convex and the sample size increases at a geometric rate, we prove that the iterates geometrically converge to the unique optimal solution, and establish the iteration, oracle, and communication complexity. The algorithm performance including rate and complexity analysis are further investigated with constant step-sizes and a polynomially increasing sample size. Finally, the empirical algorithm performance are illustrated with numerical examples.



rate research

Read More

We consider a stochastic variational inequality (SVI) problem with a continuous and monotone mapping over a closed and convex set. In strongly monotone regimes, we present a variable sample-size averaging scheme (VS-Ave) that achieves a linear rate with an optimal oracle complexity. In addition, the iteration complexity is shown to display a muted dependence on the condition number compared with standard variance-reduced projection schemes. To contend with merely monotone maps, we develop amongst the first proximal-point algorithms with variable sample-sizes (PPAWSS), where increasingly accurate solutions of strongly monotone SVIs are obtained via (VS-Ave) at every step. This allows for achieving a sublinear convergence rate that matches that obtained for deterministic monotone VIs. Preliminary numerical evidence suggests that the schemes compares well with competing schemes.
One of the most widely used methods for solving large-scale stochastic optimization problems is distributed asynchronous stochastic gradient descent (DASGD), a family of algorithms that result from parallelizing stochastic gradient descent on distributed computing architectures (possibly) asychronously. However, a key obstacle in the efficient implementation of DASGD is the issue of delays: when a computing node contributes a gradient update, the global model parameter may have already been updated by other nodes several times over, thereby rendering this gradient information stale. These delays can quickly add up if the computational throughput of a node is saturated, so the convergence of DASGD may be compromised in the presence of large delays. Our first contribution is that, by carefully tuning the algorithms step-size, convergence to the critical set is still achieved in mean square, even if the delays grow unbounded at a polynomial rate. We also establish finer results in a broad class of structured optimization problems (called variationally coherent), where we show that DASGD converges to a global optimum with probability $1$ under the same delay assumptions. Together, these results contribute to the broad landscape of large-scale non-convex stochastic optimization by offering state-of-the-art theoretical guarantees and providing insights for algorithm design.
Many popular learning-rate schedules for deep neural networks combine a decaying trend with local perturbations that attempt to escape saddle points and bad local minima. We derive convergence guarantees for bandwidth-based step-sizes, a general class of learning-rates that are allowed to vary in a banded region. This framework includes cyclic and non-monotonic step-sizes for which no theoretical guarantees were previously known. We provide worst-case guarantees for SGD on smooth non-convex problems under several bandwidth-based step sizes, including stagewise $1/sqrt{t}$ and the popular step-decay (constant and then drop by a constant), which is also shown to be optimal. Moreover, we show that its momentum variant (SGDM) converges as fast as SGD with the bandwidth-based step-decay step-size. Finally, we propose some novel step-size schemes in the bandwidth-based family and verify their efficiency on several deep neural network training tasks.
Cyclical step-sizes are becoming increasingly popular in the optimization of deep learning problems. Motivated by recent observations on the spectral gaps of Hessians in machine learning, we show that these step-size schedules offer a simple way to exploit them. More precisely, we develop a convergence rate analysis for quadratic objectives that provides optimal parameters and shows that cyclical learning rates can improve upon traditional lower complexity bounds. We further propose a systematic approach to design optimal first order methods for quadratic minimization with a given spectral structure. Finally, we provide a local convergence rate analysis beyond quadratic minimization for the proposed methods and illustrate our findings through benchmarks on least squares and logistic regression problems.
This work develops effective distributed strategies for the solution of constrained multi-agent stochastic optimization problems with coupled parameters across the agents. In this formulation, each agent is influenced by only a subset of the entries of a global parameter vector or model, and is subject to convex constraints that are only known locally. Problems of this type arise in several applications, most notably in disease propagation models, minimum-cost flow problems, distributed control formulations, and distributed power system monitoring. This work focuses on stochastic settings, where a stochastic risk function is associated with each agent and the objective is to seek the minimizer of the aggregate sum of all risks subject to a set of constraints. Agents are not aware of the statistical distribution of the data and, therefore, can only rely on stochastic approximations in their learning strategies. We derive an effective distributed learning strategy that is able to track drifts in the underlying parameter model. A detailed performance and stability analysis is carried out showing that the resulting coupled diffusion strategy converges at a linear rate to an $O(mu)-$neighborhood of the true penalized optimizer.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا