A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

306 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Nicolas Le Roux

تاريخ النشر 2012

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Nicolas Le Roux - Mark Schmidtn (INRIA Paris - Rocquencourt

التحسين والتحكم التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly.

قيم البحث

95 - Yossi Arjevani , Ohad Shamir , Nathan Srebro 2018

We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $tau$ rounds ago. First, we show that without stochastic noise, dela ys strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only $1/tau$ of the gradients. In sharp contrast, we quantify how stochastic noise makes the effect of delays negligible, improving on previous work which only showed this phenomenon asymptotically or for much smaller delays. Also, in the context of distributed optimization, the results indicate that the performance of gradient descent with delays is competitive with synchronous approaches such as mini-batching. Our results are based on a novel technique for analyzing convergence of optimization algorithms using generating functions.

التحسين والتحكم التعلم الآلي التعلم الالي

Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

221 - Atsushi Nitanda , Taiji Suzuki 2018

We consider stochastic gradient descent and its averaging variant for binary classification problems in a reproducing kernel Hilbert space. In the traditional analysis using a consistency property of loss functions, it is known that the expected clas sification error converges more slowly than the expected risk even when assuming a low-noise condition on the conditional label probabilities. Consequently, the resulting rate is sublinear. Therefore, it is important to consider whether much faster convergence of the expected classification error can be achieved. In recent research, an exponential convergence rate for stochastic gradient descent was shown under a strong low-noise condition but provided theoretical analysis was limited to the squared loss function, which is somewhat inadequate for binary classification tasks. In this paper, we show an exponential convergence of the expected classification error in the final phase of the stochastic gradient descent for a wide class of differentiable convex loss functions under similar assumptions. As for the averaged stochastic gradient descent, we show that the same convergence rate holds from the early phase of training. In experiments, we verify our analyses on the $L_2$-regularized logistic regression.

التعلم الالي التعلم الآلي التحسين والتحكم

Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization

166 - Yangyang Xu 2020

Stochastic gradient methods (SGMs) have been extensively used for solving stochastic problems or large-scale machine learning problems. Recent works employ various techniques to improve the convergence rate of SGMs for both convex and nonconvex cases . Most of them require a large number of samples in some or all iterations of the improved SGMs. In this paper, we propose a new SGM, named PStorm, for solving nonconvex nonsmooth stochastic problems. With a momentum-based variance reduction technique, PStorm can achieve the optimal complexity result $O(varepsilon^{-3})$ to produce a stochastic $varepsilon$-stationary solution, if a mean-squared smoothness condition holds and $Theta(varepsilon^{-1})$ samples are available for the initial update. Different from existing optimal methods, PStorm can still achieve a near-optimal complexity result $tilde{O}(varepsilon^{-3})$ by using only one or $O(1)$ samples in every update. With this property, PStorm can be applied to online learning problems that favor real-time decisions based on one or $O(1)$ new observations. In addition, for large-scale machine learning problems, PStorm can generalize better by small-batch training than other optimal methods that require large-batch training and the vanilla SGM, as we demonstrate on training a sparse fully-connected neural network and a sparse convolutional neural network.

التحسين والتحكم التعلم الآلي التحليل العددي

Orthant Based Proximal Stochastic Gradient Method for $ell_1$-Regularized Optimization

240 - Tianyi Chen , Tianyu Ding , Bo Ji 2020

Sparsity-inducing regularization problems are ubiquitous in machine learning applications, ranging from feature selection to model compression. In this paper, we present a novel stochastic method -- Orthant Based Proximal Stochastic Gradient Method ( OBProx-SG) -- to solve perhaps the most popular instance, i.e., the l1-regularized problem. The OBProx-SG method contains two steps: (i) a proximal stochastic gradient step to predict a support cover of the solution; and (ii) an orthant step to aggressively enhance the sparsity level via orthant face projection. Compared to the state-of-the-art methods, e.g., Prox-SG, RDA and Prox-SVRG, the OBProx-SG not only converges to the global optimal solutions (in convex scenario) or the stationary points (in non-convex scenario), but also promotes the sparsity of the solutions substantially. Particularly, on a large number of convex problems, OBProx-SG outperforms the existing methods comprehensively in the aspect of sparsity exploration and objective values. Moreover, the experiments on non-convex deep neural networks, e.g., MobileNetV1 and ResNet18, further demonstrate its superiority by achieving the solutions of much higher sparsity without sacrificing generalization accuracy.

التحسين والتحكم التعلم الآلي التعلم الالي

On a Faster $R$-Linear Convergence Rate of the Barzilai-Borwein Method

80 - Dawei Li , Ruoyu Sun 2021

The Barzilai-Borwein (BB) method has demonstrated great empirical success in nonlinear optimization. However, the convergence speed of BB method is not well understood, as the known convergence rate of BB method for quadratic problems is much worse t han the steepest descent (SD) method. Therefore, there is a large discrepancy between theory and practice. To shrink this gap, we prove that the BB method converges $R$-linearly at a rate of $1-1/kappa$, where $kappa$ is the condition number, for strongly convex quadratic problems. In addition, an example with the theoretical rate of convergence is constructed, indicating the tightness of our bound.

التحسين والتحكم التعلم الآلي