No Arabic abstract
We consider the problem of minimizing a convex function that is evolving in time according to unknown and possibly stochastic dynamics. Such problems abound in the machine learning and signal processing literature, under the names of concept drift and stochastic tracking. We provide novel non-asymptotic convergence guarantees for stochastic algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability. Notably, we show that the tracking efficiency of the proximal stochastic gradient method depends only logarithmically on the initialization quality, when equipped with a step-decay schedule. The results moreover naturally extend to settings where the dynamics depend jointly on time and on the decision variable itself, as in the performative prediction framework.
Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on light-tail noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithmic in the condition number. The procedure we propose, called proxBoost, is elementary and builds on two well-known ingredients: robust distance estimation and the proximal point method. We discuss consequences for both streaming (online) algorithms and offline algorithms based on empirical risk minimization.
We introduce and analyze stochastic optimization methods where the input to each gradient update is perturbed by bounded noise. We show that this framework forms the basis of a unified approach to analyze asynchronous implementations of stochastic optimization algorithms.In this framework, asynchronous stochastic optimization algorithms can be thought of as serial methods operating on noisy inputs. Using our perturbed iterate framework, we provide new analyses of the Hogwild! algorithm and asynchronous stochastic coordinate descent, that are simpler than earlier analyses, remove many assumptions of previous models, and in some cases yield improved upper bounds on the convergence rates. We proceed to apply our framework to develop and analyze KroMagnon: a novel, parallel, sparse stochastic variance-reduced gradient (SVRG) algorithm. We demonstrate experimentally on a 16-core machine that the sparse and parallel version of SVRG is in some cases more than four orders of magnitude faster than the standard SVRG algorithm.
Motivated by broad applications in reinforcement learning and machine learning, this paper considers the popular stochastic gradient descent (SGD) when the gradients of the underlying objective function are sampled from Markov processes. This Markov sampling leads to the gradient samples being biased and not independent. The existing results for the convergence of SGD under Markov randomness are often established under the assumptions on the boundedness of either the iterates or the gradient samples. Our main focus is to study the finite-time convergence of SGD for different types of objective functions, without requiring these assumptions. We show that SGD converges nearly at the same rate with Markovian gradient samples as with independent gradient samples. The only difference is a logarithmic factor that accounts for the mixing time of the Markov chain.
Considering the constrained stochastic optimization problem over a time-varying random network, where the agents are to collectively minimize a sum of objective functions subject to a common constraint set, we investigate asymptotic properties of a distributed algorithm based on dual averaging of gradients. Different from most existing works on distributed dual averaging algorithms that mainly concentrating on their non-asymptotic properties, we not only prove almost sure convergence and the rate of almost sure convergence, but also asymptotic normality and asymptotic efficiency of the algorithm. Firstly, for general constrained convex optimization problem distributed over a random network, we prove that almost sure consensus can be archived and the estimates of agents converge to the same optimal point. For the case of linear constrained convex optimization, we show that the mirror map of the averaged dual sequence identifies the active constraints of the optimal solution with probability 1, which helps us to prove the almost sure convergence rate and then establish asymptotic normality of the algorithm. Furthermore, we also verify that the algorithm is asymptotically optimal. To the best of our knowledge, it seems to be the first asymptotic normality result for constrained distributed optimization algorithms. Finally, a numerical example is provided to justify the theoretical analysis.
The paper considers distributed stochastic optimization over randomly switching networks, where agents collaboratively minimize the average of all agents local expectation-valued convex cost functions. Due to the stochasticity in gradient observations, distributedness of local functions, and randomness of communication topologies, distributed algorithms with a convergence guarantee under fixed step-sizes have not been achieved yet. This work incorporates variance reduction scheme into the distributed stochastic gradient tracking algorithm, where local gradients are estimated by averaging across a variable number of sampled gradients. With an identically and independently distributed (i.i.d.) random network, we show that all agents iterates converge almost surely to the same optimal solution under fixed step-sizes. When the global cost function is strongly convex and the sample size increases at a geometric rate, we prove that the iterates geometrically converge to the unique optimal solution, and establish the iteration, oracle, and communication complexity. The algorithm performance including rate and complexity analysis are further investigated with constant step-sizes and a polynomially increasing sample size. Finally, the empirical algorithm performance are illustrated with numerical examples.