ﻻ يوجد ملخص باللغة العربية
Many popular learning-rate schedules for deep neural networks combine a decaying trend with local perturbations that attempt to escape saddle points and bad local minima. We derive convergence guarantees for bandwidth-based step-sizes, a general class of learning-rates that are allowed to vary in a banded region. This framework includes cyclic and non-monotonic step-sizes for which no theoretical guarantees were previously known. We provide worst-case guarantees for SGD on smooth non-convex problems under several bandwidth-based step sizes, including stagewise $1/sqrt{t}$ and the popular step-decay (constant and then drop by a constant), which is also shown to be optimal. Moreover, we show that its momentum variant (SGDM) converges as fast as SGD with the bandwidth-based step-decay step-size. Finally, we propose some novel step-size schemes in the bandwidth-based family and verify their efficiency on several deep neural network training tasks.
We design an algorithm which finds an $epsilon$-approximate stationary point (with $| abla F(x)|le epsilon$) using $O(epsilon^{-3})$ stochastic gradient and Hessian-vector products, matching guarantees that were previously available only under a stro
Reducing the variance of the gradient estimator is known to improve the convergence rate of stochastic gradient-based optimization and sampling algorithms. One way of achieving variance reduction is to design importance sampling strategies. Recently,
Gradient clipping is commonly used in training deep neural networks partly due to its practicability in relieving the exploding gradient problem. Recently, citet{zhang2019gradient} show that clipped (stochastic) Gradient Descent (GD) converges faster
We investigate 1) the rate at which refined properties of the empirical risk---in particular, gradients---converge to their population counterparts in standard non-convex learning tasks, and 2) the consequences of this convergence for optimization. O
We give nearly matching upper and lower bounds on the oracle complexity of finding $epsilon$-stationary points ($| abla F(x) | leqepsilon$) in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic orac