In this paper we propose several adaptive gradient methods for stochastic optimization. Unlike AdaGrad-type of methods, our algorithms are based on Armijo-type line search and they simultaneously adapt to the unknown Lipschitz constant of the gradient and variance of the stochastic approximation for the gradient. We consider an accelerated and non-accelerated gradient descent for convex problems and gradient descent for non-convex problems. In the experiments we demonstrate superiority of our methods to existing adaptive methods, e.g. AdaGrad and Adam.
We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions. However, an $alpha$-fraction of the machines are $textit{Byzantine}$, in that they may behave in arbitrary, adversarial ways. We consider a variant of this procedure in the challenging $textit{non-convex}$ case. Our main result is a new algorithm SafeguardSGD which can provably escape saddle points and find approximate local minima of the non-convex objective. The algorithm is based on a new concentration filtering technique, and its sample and time complexity bounds match the best known theoretical bounds in the stochastic, distributed setting when no Byzantine machines are present. Our algorithm is very practical: it improves upon the performance of all prior methods when training deep neural networks, it is relatively lightweight, and it is the first method to withstand two recently-proposed Byzantine attacks.
In this work, we analyze the global convergence property of coordinate gradient descent with random choice of coordinates and stepsizes for non-convex optimization problems. Under generic assumptions, we prove that the algorithm iterate will almost surely escape strict saddle points of the objective function. As a result, the algorithm is guaranteed to converge to local minima if all saddle points are strict. Our proof is based on viewing coordinate descent algorithm as a nonlinear random dynamical system and a quantitative finite block analysis of its linearization around saddle points.
We lower bound the complexity of finding $epsilon$-stationary points (with gradient norm at most $epsilon$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $epsilon^{-4}$ queries to find an $epsilon$ stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of $epsilon^{-3}$ queries, establishing the optimality of recently proposed variance reduction techniques.
Stochastic gradient methods (SGMs) have been widely used for solving stochastic optimization problems. A majority of existing works assume no constraints or easy-to-project constraints. In this paper, we consider convex stochastic optimization problems with expectation constraints. For these problems, it is often extremely expensive to perform projection onto the feasible set. Several SGMs in the literature can be applied to solve the expectation-constrained stochastic problems. We propose a novel primal-dual type SGM based on the Lagrangian function. Different from existing methods, our method incorporates an adaptiveness technique to speed up convergence. At each iteration, our method inquires an unbiased stochastic subgradient of the Lagrangian function, and then it renews the primal variables by an adaptive-SGM update and the dual variables by a vanilla-SGM update. We show that the proposed method has a convergence rate of $O(1/sqrt{k})$ in terms of the objective error and the constraint violation. Although the convergence rate is the same as those of existing SGMs, we observe its significantly faster convergence than an existing non-adaptive primal-dual SGM and a primal SGM on solving the Neyman-Pearson classification and quadratically constrained quadratic programs. Furthermore, we modify the proposed method to solve convex-concave stochastic minimax problems, for which we perform adaptive-SGM updates to both primal and dual variables. A convergence rate of $O(1/sqrt{k})$ is also established to the modified method for solving minimax problems in terms of primal-dual gap.
In this paper, we demonstrate the power of a widely used stochastic estimator based on moving average (SEMA) on a range of stochastic non-convex optimization problems, which only requires {bf a general unbiased stochastic oracle}. We analyze various stochastic methods (existing or newly proposed) based on the {bf variance recursion property} of SEMA for three families of non-convex optimization, namely standard stochastic non-convex minimization, stochastic non-convex strongly-concave min-max optimization, and stochastic bilevel optimization. Our contributions include: (i) for standard stochastic non-convex minimization, we present a simple and intuitive proof of convergence for a family Adam-style methods (including Adam) with an increasing or large momentum parameter for the first-order moment, which gives an alternative yet more natural way to guarantee Adam converge; (ii) for stochastic non-convex strongly-concave min-max optimization, we present a single-loop stochastic gradient descent ascent method based on the moving average estimators and establish its oracle complexity of $O(1/epsilon^4)$ without using a large mini-batch size, addressing a gap in the literature; (iii) for stochastic bilevel optimization, we present a single-loop stochastic method based on the moving average estimators and establish its oracle complexity of $widetilde O(1/epsilon^4)$ without computing the inverse or SVD of the Hessian matrix, improving state-of-the-art results. For all these problems, we also establish a variance diminishing result for the used stochastic gradient estimators.