Practical Schemes for Finding Near-Stationary Points of Convex Finite-Sums

104 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Kaiwen Zhou

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Kaiwen Zhou - Lai Tian - Anthony Man-Cho So

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The problem of finding near-stationary points in convex optimization has not been adequately studied yet, unlike other optimality measures such as minimizing function value. Even in the deterministic case, the optimal method (OGM-G, due to Kim and Fessler (2021)) has just been discovered recently. In this work, we conduct a systematic study of the algorithmic techniques in finding near-stationary points of convex finite-sums. Our main contributions are several algorithmic discoveries: (1) we discover a memory-saving variant of OGM-G based on the performance estimation problem approach (Drori and Teboulle, 2014); (2) we design a new accelerated SVRG variant that can simultaneously achieve fast rates for both minimizing gradient norm and function value; (3) we propose an adaptively regularized accelerated SVRG variant, which does not require the knowledge of some unknown initial constants and achieves near-optimal complexities. We put an emphasis on the simplicity and practicality of the new schemes, which could facilitate future developments.

قيم البحث

80 - Jingzhao Zhang , Hongzhou Lin , Stefanie Jegelka 2020

We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains examples such as ReLU neural networks and others with non-differentiable activation functions. We first show that finding an $epsilon$-stationary point with first-order methods is impossible in finite time. We then introduce the notion of $(delta, epsilon)$-stationarity, which allows for an $epsilon$-approximate gradient to be the convex combination of generalized gradients evaluated at points within distance $delta$ to the solution. We propose a series of randomized first-order methods and analyze their complexity of finding a $(delta, epsilon)$-stationary point. Furthermore, we provide a lower bound and show that our stochastic algorithm has min-max optimal dependence on $delta$. Empirically, our methods perform well for training ReLU neural networks.

التحسين والتحكم التعلم الآلي

Lower Bounds for Finding Stationary Points I

87 - Yair Carmon , John C. Duchi , Oliver Hinder 2017

We prove lower bounds on the complexity of finding $epsilon$-stationary points (points $x$ such that $| abla f(x)| le epsilon$) of smooth, high-dimensional, and potentially non-convex functions $f$. We consider oracle-based complexity measures, where an algorithm is given access to the value and all derivatives of $f$ at a query point $x$. We show that for any (potentially randomized) algorithm $mathsf{A}$, there exists a function $f$ with Lipschitz $p$th order derivatives such that $mathsf{A}$ requires at least $epsilon^{-(p+1)/p}$ queries to find an $epsilon$-stationary point. Our lower bounds are sharp to within constants, and they show that gradient descent, cubic-regularized Newtons method, and generalized $p$th order regularization are worst-case optimal within their natural function classes.

التحسين والتحكم

Lower Bounds for Finding Stationary Points II: First-Order Methods

76 - Yair Carmon , John C. Duchi , Oliver Hinder 2017

We establish lower bounds on the complexity of finding $epsilon$-stationary points of smooth, non-convex high-dimensional functions using first-order methods. We prove that deterministic first-order methods, even applied to arbitrarily smooth functio ns, cannot achieve convergence rates in $epsilon$ better than $epsilon^{-8/5}$, which is within $epsilon^{-1/15}logfrac{1}{epsilon}$ of the best known rate for such methods. Moreover, for functions with Lipschitz first and second derivatives, we prove no deterministic first-order method can achieve convergence rates better than $epsilon^{-12/7}$, while $epsilon^{-2}$ is a lower bound for functions with only Lipschitz gradient. For convex functions with Lipschitz gradient, accelerated gradient descent achieves the rate $epsilon^{-1}logfrac{1}{epsilon}$, showing that finding stationary points is easier given convexity.

التحسين والتحكم

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

112 - Yossi Arjevani 2017

We study the conditions under which one is able to efficiently apply variance-reduction and acceleration schemes on finite sum optimization problems. First, we show that, perhaps surprisingly, the finite sum structure by itself, is not sufficient for obtaining a complexity bound of $tilde{cO}((n+L/mu)ln(1/epsilon))$ for $L$-smooth and $mu$-strongly convex individual functions - one must also know which individual function is being referred to by the oracle at each iteration. Next, we show that for a broad class of first-order and coordinate-descent finite sum algorithms (including, e.g., SDCA, SVRG, SAG), it is not possible to get an `accelerated complexity bound of $tilde{cO}((n+sqrt{n L/mu})ln(1/epsilon))$, unless the strong convexity parameter is given explicitly. Lastly, we show that when this class of algorithms is used for minimizing $L$-smooth and convex finite sums, the optimal complexity bound is $tilde{cO}(n+L/epsilon)$, assuming that (on average) the same update rule is used in every iteration, and $tilde{cO}(n+sqrt{nL/epsilon})$, otherwise.

التحسين والتحكم التعلم الآلي التعلم الالي

Katyusha Acceleration for Convex Finite-Sum Compositional Optimization

91 - Yibo Xu , Yangyang Xu 2019

Structured problems arise in many applications. To solve these problems, it is important to leverage the structure information. This paper focuses on convex problems with a finite-sum compositional structure. Finite-sum problems appear as the sample average approximation of a stochastic optimization problem and also arise in machine learning with a huge amount of training data. One popularly used numerical approach for finite-sum problems is the stochastic gradient method (SGM). However, the additional compositional structure prohibits easy access to unbiased stochastic approximation of the gradient, so directly applying the SGM to a finite-sum compositional optimization problem (COP) is often inefficient. We design new algorithms for solving strongly-convex and also convex two-level finite-sum COPs. Our design incorporates the Katyusha acceleration technique and adopts the mini-batch sampling from both outer-level and inner-level finite-sum. We first analyze the algorithm for strongly-convex finite-sum COPs. Similar to a few existing works, we obtain linear convergence rate in terms of the expected objective error, and from the convergence rate result, we then establish complexity results of the algorithm to produce an $varepsilon$-solution. Our complexity results have the same dependence on the number of component functions as existing works. However, due to the use of Katyusha acceleration, our results have better dependence on the condition number $kappa$ and improve to $kappa^{2.5}$ from the best-known $kappa^3$. Finally, we analyze the algorithm for convex finite-sum COPs, which uses as a subroutine the algorithm for strongly-convex finite-sum COPs. Again, we obtain better complexity results than existing works in terms of the dependence on $varepsilon$, improving to $varepsilon^{-2.5}$ from the best-known $varepsilon^{-3}$.

التحسين والتحكم التعلم الآلي التحليل العددي