No Arabic abstract
Derivative-free optimization (DFO) has recently gained a lot of momentum in machine learning, spawning interest in the community to design faster methods for problems where gradients are not accessible. While some attention has been given to the concept of acceleration in the DFO literature, existing stochastic algorithms for objective functions with a finite-sum structure have not been shown theoretically to achieve an accelerated rate of convergence. Algorithms that use acceleration in such a setting are prone to instabilities, making it difficult to reach convergence. In this work, we exploit the finite-sum structure of the objective in order to design a variance-reduced DFO algorithm that provably yields acceleration. We prove rates of convergence for both smooth convex and strongly-convex finite-sum objective functions. Finally, we validate our theoretical results empirically on several tasks and datasets.
Structured problems arise in many applications. To solve these problems, it is important to leverage the structure information. This paper focuses on convex problems with a finite-sum compositional structure. Finite-sum problems appear as the sample average approximation of a stochastic optimization problem and also arise in machine learning with a huge amount of training data. One popularly used numerical approach for finite-sum problems is the stochastic gradient method (SGM). However, the additional compositional structure prohibits easy access to unbiased stochastic approximation of the gradient, so directly applying the SGM to a finite-sum compositional optimization problem (COP) is often inefficient. We design new algorithms for solving strongly-convex and also convex two-level finite-sum COPs. Our design incorporates the Katyusha acceleration technique and adopts the mini-batch sampling from both outer-level and inner-level finite-sum. We first analyze the algorithm for strongly-convex finite-sum COPs. Similar to a few existing works, we obtain linear convergence rate in terms of the expected objective error, and from the convergence rate result, we then establish complexity results of the algorithm to produce an $varepsilon$-solution. Our complexity results have the same dependence on the number of component functions as existing works. However, due to the use of Katyusha acceleration, our results have better dependence on the condition number $kappa$ and improve to $kappa^{2.5}$ from the best-known $kappa^3$. Finally, we analyze the algorithm for convex finite-sum COPs, which uses as a subroutine the algorithm for strongly-convex finite-sum COPs. Again, we obtain better complexity results than existing works in terms of the dependence on $varepsilon$, improving to $varepsilon^{-2.5}$ from the best-known $varepsilon^{-3}$.
We propose an accelerated meta-algorithm, which allows to obtain accelerated methods for convex unconstrained minimization in different settings. As an application of the general scheme we propose nearly optimal methods for minimizing smooth functions with Lipschitz derivatives of an arbitrary order, as well as for smooth minimax optimization problems. The proposed meta-algorithm is more general than the ones in the literature and allows to obtain better convergence rates and practical performance in several settings.
This paper investigates accelerating the convergence of distributed optimization algorithms on non-convex problems. We propose a distributed primal-dual stochastic gradient descent~(SGD) equipped with powerball method to accelerate. We show that the proposed algorithm achieves the linear speedup convergence rate $mathcal{O}(1/sqrt{nT})$ for general smooth (possibly non-convex) cost functions. We demonstrate the efficiency of the algorithm through numerical experiments by training two-layer fully connected neural networks and convolutional neural networks on the MNIST dataset to compare with state-of-the-art distributed SGD algorithms and centralized SGD algorithms.
This paper considers decentralized minimization of $N:=nm$ smooth non-convex cost functions equally divided over a directed network of $n$ nodes. Specifically, we describe a stochastic first-order gradient method, called GT-SARAH, that employs a SARAH-type variance reduction technique and gradient tracking (GT) to address the stochastic and decentralized nature of the problem. We show that GT-SARAH, with appropriate algorithmic parameters, finds an $epsilon$-accurate first-order stationary point with $Obig(maxbig{N^{frac{1}{2}},n(1-lambda)^{-2},n^{frac{2}{3}}m^{frac{1}{3}}(1-lambda)^{-1}big}Lepsilon^{-2}big)$ gradient complexity, where ${(1-lambda)in(0,1]}$ is the spectral gap of the network weight matrix and $L$ is the smoothness parameter of the cost functions. This gradient complexity outperforms that of the existing decentralized stochastic gradient methods. In particular, in a big-data regime such that ${n = O(N^{frac{1}{2}}(1-lambda)^{3})}$, this gradient complexity furthers reduces to ${O(N^{frac{1}{2}}Lepsilon^{-2})}$, independent of the network topology, and matches that of the centralized near-optimal variance-reduced methods. Moreover, in this regime GT-SARAH achieves a non-asymptotic linear speedup, in that, the total number of gradient computations at each node is reduced by a factor of $1/n$ compared to the centralized near-optimal algorithms that perform all gradient computations at a single node. To the best of our knowledge, GT-SARAH is the first algorithm that achieves this property. In addition, we show that appropriate choices of local minibatch size balance the trade-offs between the gradient and communication complexity of GT-SARAH. Over infinite time horizon, we establish that all nodes in GT-SARAH asymptotically achieve consensus and converge to a first-order stationary point in the almost sure and mean-squared sense.
We establish that the min-sum message-passing algorithm and its asynchronous variants converge for a large class of unconstrained convex optimization problems.