No Arabic abstract
We provide improved convergence rates for various emph{non-smooth} optimization problems via higher-order accelerated methods. In the case of $ell_infty$ regression, we achieves an $O(epsilon^{-4/5})$ iteration complexity, breaking the $O(epsilon^{-1})$ barrier so far present for previous methods. We arrive at a similar rate for the problem of $ell_1$-SVM, going beyond what is attainable by first-order methods with prox-oracle access for non-smooth non-strongly convex problems. We further show how to achieve even faster rates by introducing higher-order regularization. Our results rely on recent advances in near-optimal accelerated methods for higher-order smooth convex optimization. In particular, we extend Nesterovs smoothing technique to show that the standard softmax approximation is not only smooth in the usual sense, but also emph{higher-order} smooth. With this observation in hand, we provide the first example of higher-order acceleration techniques yielding faster rates for emph{non-smooth} optimization, to the best of our knowledge.
We provide improved convergence rates for constrained convex-concave min-max problems and monotone variational inequalities with higher-order smoothness. In min-max settings where the $p^{th}$-order derivatives are Lipschitz continuous, we give an algorithm HigherOrderMirrorProx that achieves an iteration complexity of $O(1/T^{frac{p+1}{2}})$ when given access to an oracle for finding a fixed point of a $p^{th}$-order equation. We give analogous rates for the weak monotone variational inequality problem. For $p>2$, our results improve upon the iteration complexity of the first-order Mirror Prox method of Nemirovski [2004] and the second-order method of Monteiro and Svaiter [2012]. We further instantiate our entire algorithm in the unconstrained $p=2$ case.
The theory of integral quadratic constraints (IQCs) allows the certification of exponential convergence of interconnected systems containing nonlinear or uncertain elements. In this work, we adapt the IQC theory to study first-order methods for smooth and strongly-monotone games and show how to design tailored quadratic constraints to get tight upper bounds of convergence rates. Using this framework, we recover the existing bound for the gradient method~(GD), derive sharper bounds for the proximal point method~(PPM) and optimistic gradient method~(OG), and provide emph{for the first time} a global convergence rate for the negative momentum method~(NM) with an iteration complexity $mathcal{O}(kappa^{1.5})$, which matches its known lower bound. In addition, for time-varying systems, we prove that the gradient method with optimal step size achieves the fastest provable worst-case convergence rate with quadratic Lyapunov functions. Finally, we further extend our analysis to stochastic games and study the impact of multiplicative noise on different algorithms. We show that it is impossible for an algorithm with one step of memory to achieve acceleration if it only queries the gradient once per batch (in contrast with the stochastic strongly-convex optimization setting, where such acceleration has been demonstrated). However, we exhibit an algorithm which achieves acceleration with two gradient queries per batch.
In recent years, the success of deep learning has inspired many researchers to study the optimization of general smooth non-convex functions. However, recent works have established pessimistic worst-case complexities for this class functions, which is in stark contrast with their superior performance in real-world applications (e.g. training deep neural networks). On the other hand, it is found that many popular non-convex optimization problems enjoy certain structured properties which bear some similarities to convexity. In this paper, we study the class of textit{quasar-convex functions} to close the gap between theory and practice. We study the convergence of first order methods in a variety of different settings and under different optimality criterions. We prove complexity upper bounds that are similar to standard results established for convex functions and much better that state-of-the-art convergence rates of non-convex functions. Overall, this paper suggests that textit{quasar-convexity} allows efficient optimization procedures, and we are looking forward to seeing more problems that demonstrate similar properties in practice.
We propose a new majorization-minimization (MM) method for non-smooth and non-convex programs, which is general enough to include the existing MM methods. Besides the local majorization condition, we only require that the difference between the directional derivatives of the objective function and its surrogate function vanishes when the number of iterations approaches infinity, which is a very weak condition. So our method can use a surrogate function that directly approximates the non-smooth objective function. In comparison, all the existing MM methods construct the surrogate function by approximating the smooth component of the objective function. We apply our relaxed MM methods to the robust matrix factorization (RMF) problem with different regularizations, where our locally majorant algorithm shows advantages over the state-of-the-art approaches for RMF. This is the first algorithm for RMF ensuring, without extra assumptions, that any limit point of the iterates is a stationary point.
The usual approach to developing and analyzing first-order methods for smooth convex optimization assumes that the gradient of the objective function is uniformly smooth with some Lipschitz constant $L$. However, in many settings the differentiable convex function $f(cdot)$ is not uniformly smooth -- for example in $D$-optimal design where $f(x):=-ln det(HXH^T)$, or even the univariate setting with $f(x) := -ln(x) + x^2$. Herein we develop a notion of relative smoothness and relative strong convexity that is determined relative to a user-specified reference function $h(cdot)$ (that should be computationally tractable for algorithms), and we show that many differentiable convex functions are relatively smooth with respect to a correspondingly fairly-simple reference function $h(cdot)$. We extend two standard algorithms -- the primal gradient scheme and the dual averaging scheme -- to our new setting, with associated computational guarantees. We apply our new approach to develop a new first-order method for the $D$-optimal design problem, with associated computational complexity analysis. Some of our results have a certain overlap with the recent work cite{bbt}.