ﻻ يوجد ملخص باللغة العربية
We design an algorithm which finds an $epsilon$-approximate stationary point (with $| abla F(x)|le epsilon$) using $O(epsilon^{-3})$ stochastic gradient and Hessian-vector products, matching guarantees that were previously available only under a stronger assumption of access to multiple queries with the same random seed. We prove a lower bound which establishes that this rate is optimal and---surprisingly---that it cannot be improved using stochastic $p$th order methods for any $pge 2$, even when the first $p$ derivatives of the objective are Lipschitz. Together, these results characterize the complexity of non-convex stochastic optimization with second-order methods and beyond. Expanding our scope to the oracle complexity of finding $(epsilon,gamma)$-approximate second-order stationary points, we establish nearly matching upper and lower bounds for stochastic second-order methods. Our lower bounds here are novel even in the noiseless case.
Many popular learning-rate schedules for deep neural networks combine a decaying trend with local perturbations that attempt to escape saddle points and bad local minima. We derive convergence guarantees for bandwidth-based step-sizes, a general clas
Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statist
We give nearly matching upper and lower bounds on the oracle complexity of finding $epsilon$-stationary points ($| abla F(x) | leqepsilon$) in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic orac
Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Bla
We investigate 1) the rate at which refined properties of the empirical risk---in particular, gradients---converge to their population counterparts in standard non-convex learning tasks, and 2) the consequences of this convergence for optimization. O