On the Convergence of Nesterovs Accelerated Gradient Method in Stochastic Settings

217 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Mahmoud Assran

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Mahmoud Assran - Michael Rabbat

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study Nesterovs accelerated gradient method with constant step-size and momentum parameters in the stochastic approximation setting (unbiased gradients with bounded variance) and the finite-sum setting (where randomness is due to sampling mini-batches). To build better insight into the behavior of Nesterovs method in stochastic settings, we focus throughout on objectives that are smooth, strongly-convex, and twice continuously differentiable. In the stochastic approximation setting, Nesterovs method converges to a neighborhood of the optimal point at the same accelerated rate as in the deterministic setting. Perhaps surprisingly, in the finite-sum setting, we prove that Nesterovs method may diverge with the usual choice of step-size and momentum, unless additional conditions on the problem related to conditioning and data coherence are satisfied. Our results shed light as to why Nesterovs method may fail to converge or achieve acceleration in the finite-sum setting.

قيم البحث

187 - Yushu Chen , Hao Jing , Wenlai Zhao 2019

We present the remote stochastic gradient (RSG) method, which computes the gradients at configurable remote observation points, in order to improve the convergence rate and suppress gradient noise at the same time for different curvatures. RSG is fur ther combined with adaptive methods to construct ARSG for acceleration. The method is efficient in computation and memory, and is straightforward to implement. We analyze the convergence properties by modeling the training process as a dynamic system, which provides a guideline to select the configurable observation factor without grid search. ARSG yields $O(1/sqrt{T})$ convergence rate in non-convex settings, that can be further improved to $O(log(T)/T)$ in strongly convex settings. Numerical experiments demonstrate that ARSG achieves both faster convergence and better generalization, compared with popular adaptive methods, such as ADAM, NADAM, AMSGRAD, and RANGER for the tested problems. In particular, for training ResNet-50 on ImageNet, ARSG outperforms ADAM in convergence speed and meanwhile it surpasses SGD in generalization.

التعلم الآلي التحسين والتحكم التعلم الالي

The Complexity of Making the Gradient Small in Stochastic Convex Optimization

318 - Dylan J. Foster , Ayush Sekhari , Ohad Shamir 2019

We give nearly matching upper and lower bounds on the oracle complexity of finding $epsilon$-stationary points ($| abla F(x) | leqepsilon$) in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic orac le model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimization complexity and sample complexity, and reveals some surprising differences between the complexity of stochastic optimization versus learning. Notably, we show that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on smoothness is necessary in the local stochastic oracle model. In other words, the separation in complexity between the two models can be exponential, and that the folklore understanding that smoothness is required to find stationary points is only weakly true for statistical learning. Our upper bounds are based on extensions of a recent recursive regularization technique proposed by Allen-Zhu (2018). We show how to extend the technique to achieve near-optimal rates, and in particular show how to leverage the extra information available in the global oracle model. Our algorithm for the global model can be implemented efficiently through finite sum methods, and suggests an interesting new computational-statistical tradeoff.

التعلم الآلي التحسين والتحكم التعلم الالي

Stochastic Gradient Descent with Biased but Consistent Gradient Estimators

88 - Jie Chen , Ronny Luss 2018

Stochastic gradient descent (SGD), which dates back to the 1950s, is one of the most popular and effective approaches for performing stochastic optimization. Research on SGD resurged recently in machine learning for optimizing convex loss functions a nd training nonconvex deep neural networks. The theory assumes that one can easily compute an unbiased gradient estimator, which is usually the case due to the sample average nature of empirical risk minimization. There exist, however, many scenarios (e.g., graphs) where an unbiased estimator may be as expensive to compute as the full gradient because training examples are interconnected. Recently, Chen et al. (2018) proposed using a consistent gradient estimator as an economic alternative. Encouraged by empirical success, we show, in a general setting, that consistent estimators result in the same convergence behavior as do unbiased ones. Our analysis covers strongly convex, convex, and nonconvex objectives. We verify the results with illustrative experiments on synthetic and real-world data. This work opens several new research directions, including the development of more efficient SGD updates with consistent estimators and the design of efficient training algorithms for large-scale graphs.

التعلم الآلي التحسين والتحكم التعلم الالي

Matrix Completion via Nonconvex Regularization: Convergence of the Proximal Gradient Algorithm

148 - Fei Wen , Rendong Ying , Peilin Liu 2019

Matrix completion has attracted much interest in the past decade in machine learning and computer vision. For low-rank promotion in matrix completion, the nuclear norm penalty is convenient due to its convexity but has a bias problem. Recently, vario us algorithms using nonconvex penalties have been proposed, among which the proximal gradient descent (PGD) algorithm is one of the most efficient and effective. For the nonconvex PGD algorithm, whether it converges to a local minimizer and its convergence rate are still unclear. This work provides a nontrivial analysis on the PGD algorithm in the nonconvex case. Besides the convergence to a stationary point for a generalized nonconvex penalty, we provide more deep analysis on a popular and important class of nonconvex penalties which have discontinuous thresholding functions. For such penalties, we establish the finite rank convergence, convergence to restricted strictly local minimizer and eventually linear convergence rate of the PGD algorithm. Meanwhile, convergence to a local minimizer has been proved for the hard-thresholding penalty. Our result is the first shows that, nonconvex regularized matrix completion only has restricted strictly local minimizers, and the PGD algorithm can converge to such minimizers with eventually linear rate under certain conditions. Illustration of the PGD algorithm via experiments has also been provided. Code is available at https://github.com/FWen/nmc.

التعلم الآلي التحسين والتحكم التعلم الالي

Stochastic Gradient Descent for Nonconvex Learning without Bounded Gradient Assumptions

117 - Yunwen Lei , Ting Hu , Guiying Li 2019

Stochastic gradient descent (SGD) is a popular and efficient method with wide applications in training deep neural nets and other nonconvex models. While the behavior of SGD is well understood in the convex learning setting, the existing theoretical results for SGD applied to nonconvex objective functions are far from mature. For example, existing results require to impose a nontrivial assumption on the uniform boundedness of gradients for all iterates encountered in the learning process, which is hard to verify in practical implementations. In this paper, we establish a rigorous theoretical foundation for SGD in nonconvex learning by showing that this boundedness assumption can be removed without affecting convergence rates. In particular, we establish sufficient conditions for almost sure convergence as well as optimal convergence rates for SGD applied to both general nonconvex objective functions and gradient-dominated objective functions. A linear convergence is further derived in the case with zero variances.

التعلم الآلي التحسين والتحكم التعلم الالي