ترغب بنشر مسار تعليمي؟ اضغط هنا

Super-Acceleration with Cyclical Step-sizes

75   0   0.0 ( 0 )
 نشر من قبل Baptiste Goujaud
 تاريخ النشر 2021
  مجال البحث
والبحث باللغة English




اسأل ChatGPT حول البحث

Cyclical step-sizes are becoming increasingly popular in the optimization of deep learning problems. Motivated by recent observations on the spectral gaps of Hessians in machine learning, we show that these step-size schedules offer a simple way to exploit them. More precisely, we develop a convergence rate analysis for quadratic objectives that provides optimal parameters and shows that cyclical learning rates can improve upon traditional lower complexity bounds. We further propose a systematic approach to design optimal first order methods for quadratic minimization with a given spectral structure. Finally, we provide a local convergence rate analysis beyond quadratic minimization for the proposed methods and illustrate our findings through benchmarks on least squares and logistic regression problems.



قيم البحث

اقرأ أيضاً

132 - Jinlong Lei , Peng Yi , Jie Chen 2021
The paper considers distributed stochastic optimization over randomly switching networks, where agents collaboratively minimize the average of all agents local expectation-valued convex cost functions. Due to the stochasticity in gradient observation s, distributedness of local functions, and randomness of communication topologies, distributed algorithms with a convergence guarantee under fixed step-sizes have not been achieved yet. This work incorporates variance reduction scheme into the distributed stochastic gradient tracking algorithm, where local gradients are estimated by averaging across a variable number of sampled gradients. With an identically and independently distributed (i.i.d.) random network, we show that all agents iterates converge almost surely to the same optimal solution under fixed step-sizes. When the global cost function is strongly convex and the sample size increases at a geometric rate, we prove that the iterates geometrically converge to the unique optimal solution, and establish the iteration, oracle, and communication complexity. The algorithm performance including rate and complexity analysis are further investigated with constant step-sizes and a polynomially increasing sample size. Finally, the empirical algorithm performance are illustrated with numerical examples.
Variational Inference makes a trade-off between the capacity of the variational family and the tractability of finding an approximate posterior distribution. Instead, Boosting Variational Inference allows practitioners to obtain increasingly good pos terior approximations by spending more compute. The main obstacle to widespread adoption of Boosting Variational Inference is the amount of resources necessary to improve over a strong Variational Inference baseline. In our work, we trace this limitation back to the global curvature of the KL-divergence. We characterize how the global curvature impacts time and memory consumption, address the problem with the notion of local curvature, and provide a novel approximate backtracking algorithm for estimating local curvature. We give new theoretical convergence rates for our algorithms and provide experimental validation on synthetic and real-world datasets.
The main goal of this work is equipping convex and nonconvex problems with Barzilai-Borwein (BB) step size. With the adaptivity of BB step sizes granted, they can fail when the objective function is not strongly convex. To overcome this challenge, th e key idea here is to bridge (non)convex problems and strongly convex ones via regularization. The proposed regularization schemes are textit{simple} yet effective. Wedding the BB step size with a variance reduction method, known as SARAH, offers a free lunch compared with vanilla SARAH in convex problems. The convergence of BB step sizes in nonconvex problems is also established and its complexity is no worse than other adaptive step sizes such as AdaGrad. As a byproduct, our regularized SARAH methods for convex functions ensure that the complexity to find $mathbb{E}[| abla f(mathbf{x}) |^2]leq epsilon$ is ${cal O}big( (n+frac{1}{sqrt{epsilon}})ln{frac{1}{epsilon}}big)$, improving $epsilon$ dependence over existing results. Numerical tests further validate the merits of proposed approaches.
Many popular learning-rate schedules for deep neural networks combine a decaying trend with local perturbations that attempt to escape saddle points and bad local minima. We derive convergence guarantees for bandwidth-based step-sizes, a general clas s of learning-rates that are allowed to vary in a banded region. This framework includes cyclic and non-monotonic step-sizes for which no theoretical guarantees were previously known. We provide worst-case guarantees for SGD on smooth non-convex problems under several bandwidth-based step sizes, including stagewise $1/sqrt{t}$ and the popular step-decay (constant and then drop by a constant), which is also shown to be optimal. Moreover, we show that its momentum variant (SGDM) converges as fast as SGD with the bandwidth-based step-decay step-size. Finally, we propose some novel step-size schemes in the bandwidth-based family and verify their efficiency on several deep neural network training tasks.
Reducing the variance of the gradient estimator is known to improve the convergence rate of stochastic gradient-based optimization and sampling algorithms. One way of achieving variance reduction is to design importance sampling strategies. Recently, the problem of designing such schemes was formulated as an online learning problem with bandit feedback, and algorithms with sub-linear static regret were designed. In this work, we build on this framework and propose Avare, a simple and efficient algorithm for adaptive importance sampling for finite-sum optimization and sampling with decreasing step-sizes. Under standard technical conditions, we show that Avare achieves $mathcal{O}(T^{2/3})$ and $mathcal{O}(T^{5/6})$ dynamic regret for SGD and SGLD respectively when run with $mathcal{O}(1/t)$ step sizes. We achieve this dynamic regret bound by leveraging our knowledge of the dynamics defined by the algorithm, and combining ideas from online learning and variance-reduced stochastic optimization. We validate empirically the performance of our algorithm and identify settings in which it leads to significant improvements.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا