No Arabic abstract
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains examples such as ReLU neural networks and others with non-differentiable activation functions. We first show that finding an $epsilon$-stationary point with first-order methods is impossible in finite time. We then introduce the notion of $(delta, epsilon)$-stationarity, which allows for an $epsilon$-approximate gradient to be the convex combination of generalized gradients evaluated at points within distance $delta$ to the solution. We propose a series of randomized first-order methods and analyze their complexity of finding a $(delta, epsilon)$-stationary point. Furthermore, we provide a lower bound and show that our stochastic algorithm has min-max optimal dependence on $delta$. Empirically, our methods perform well for training ReLU neural networks.
The problem of finding near-stationary points in convex optimization has not been adequately studied yet, unlike other optimality measures such as minimizing function value. Even in the deterministic case, the optimal method (OGM-G, due to Kim and Fessler (2021)) has just been discovered recently. In this work, we conduct a systematic study of the algorithmic techniques in finding near-stationary points of convex finite-sums. Our main contributions are several algorithmic discoveries: (1) we discover a memory-saving variant of OGM-G based on the performance estimation problem approach (Drori and Teboulle, 2014); (2) we design a new accelerated SVRG variant that can simultaneously achieve fast rates for both minimizing gradient norm and function value; (3) we propose an adaptively regularized accelerated SVRG variant, which does not require the knowledge of some unknown initial constants and achieves near-optimal complexities. We put an emphasis on the simplicity and practicality of the new schemes, which could facilitate future developments.
We prove lower bounds on the complexity of finding $epsilon$-stationary points (points $x$ such that $| abla f(x)| le epsilon$) of smooth, high-dimensional, and potentially non-convex functions $f$. We consider oracle-based complexity measures, where an algorithm is given access to the value and all derivatives of $f$ at a query point $x$. We show that for any (potentially randomized) algorithm $mathsf{A}$, there exists a function $f$ with Lipschitz $p$th order derivatives such that $mathsf{A}$ requires at least $epsilon^{-(p+1)/p}$ queries to find an $epsilon$-stationary point. Our lower bounds are sharp to within constants, and they show that gradient descent, cubic-regularized Newtons method, and generalized $p$th order regularization are worst-case optimal within their natural function classes.
This paper has two main goals: (a) establish several statistical properties---consistency, asymptotic distributions, and convergence rates---of stationary solutions and values of a class of coupled nonconvex and nonsmoothempirical risk minimization problems, and (b) validate these properties by a noisy amplitude-based phase retrieval problem, the latter being of much topical interest.Derived from available data via sampling, these empirical risk minimization problems are the computational workhorse of a population risk model which involves the minimization of an expected value of a random functional. When these minimization problems are nonconvex, the computation of their globally optimal solutions is elusive. Together with the fact that the expectation operator cannot be evaluated for general probability distributions, it becomes necessary to justify whether the stationary solutions of the empirical problems are practical approximations of the stationary solution of the population problem. When these two features, general distribution and nonconvexity, are coupled with nondifferentiability that often renders the problems non-Clarke regular, the task of the justification becomes challenging. Our work aims to address such a challenge within an algorithm-free setting. The resulting analysis is therefore different from the much of the analysis in the recent literature that is based on local search algorithms. Furthermore, supplementing the classical minimizer-centric analysis, our results offer a first step to close the gap between computational optimization and asymptotic analysis of coupled nonconvex nonsmooth statistical estimation problems, expanding the former with statistical properties of the practically obtained solution and providing the latter with a more practical focus pertaining to computational tractability.
While many distributed optimization algorithms have been proposed for solving smooth or convex problems over the networks, few of them can handle non-convex and non-smooth problems. Based on a proximal primal-dual approach, this paper presents a new (stochastic) distributed algorithm with Nesterov momentum for accelerated optimization of non-convex and non-smooth problems. Theoretically, we show that the proposed algorithm can achieve an $epsilon$-stationary solution under a constant step size with $mathcal{O}(1/epsilon^2)$ computation complexity and $mathcal{O}(1/epsilon)$ communication complexity. When compared to the existing gradient tracking based methods, the proposed algorithm has the same order of computation complexity but lower order of communication complexity. To the best of our knowledge, the presented result is the first stochastic algorithm with the $mathcal{O}(1/epsilon)$ communication complexity for non-convex and non-smooth problems. Numerical experiments for a distributed non-convex regression problem and a deep neural network based classification problem are presented to illustrate the effectiveness of the proposed algorithms.
This paper studies the complexity for finding approximate stationary points of nonconvex-strongly-concave (NC-SC) smooth minimax problems, in both general and averaged smooth finite-sum settings. We establish nontrivial lower complexity bounds of $Omega(sqrt{kappa}Delta Lepsilon^{-2})$ and $Omega(n+sqrt{nkappa}Delta Lepsilon^{-2})$ for the two settings, respectively, where $kappa$ is the condition number, $L$ is the smoothness constant, and $Delta$ is the initial gap. Our result reveals substantial gaps between these limits and best-known upper bounds in the literature. To close these gaps, we introduce a generic acceleration scheme that deploys existing gradient-based methods to solve a sequence of crafted strongly-convex-strongly-concave subproblems. In the general setting, the complexity of our proposed algorithm nearly matches the lower bound; in particular, it removes an additional poly-logarithmic dependence on accuracy present in previous works. In the averaged smooth finite-sum setting, our proposed algorithm improves over previous algorithms by providing a nearly-tight dependence on the condition number.