Finite-Time Analysis of Stochastic Gradient Descent under Markov Randomness

73 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Thinh Thanh Doan

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Thinh T. Doan - Lam M. Nguyen - Nhan H. Pham

التحسين والتحكم التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Motivated by broad applications in reinforcement learning and machine learning, this paper considers the popular stochastic gradient descent (SGD) when the gradients of the underlying objective function are sampled from Markov processes. This Markov sampling leads to the gradient samples being biased and not independent. The existing results for the convergence of SGD under Markov randomness are often established under the assumptions on the boundedness of either the iterates or the gradient samples. Our main focus is to study the finite-time convergence of SGD for different types of objective functions, without requiring these assumptions. We show that SGD converges nearly at the same rate with Markovian gradient samples as with independent gradient samples. The only difference is a logarithmic factor that accounts for the mixing time of the Markov chain.

قيم البحث

اقرأ أيضاً

Stochastic Reweighted Gradient Descent

166 - Ayoub El Hanchi , David A. Stephens 2021

Despite the strong theoretical guarantees that variance-reduced finite-sum optimization algorithms enjoy, their applicability remains limited to cases where the memory overhead they introduce (SAG/SAGA), or the periodic full gradient computation they require (SVRG/SARAH) are manageable. A promising approach to achieving variance reduction while avoiding these drawbacks is the use of importance sampling instead of control variates. While many such methods have been proposed in the literature, directly proving that they improve the convergence of the resulting optimization algorithm has remained elusive. In this work, we propose an importance-sampling-based algorithm we call SRG (stochastic reweighted gradient). We analyze the convergence of SRG in the strongly-convex case and show that, while it does not recover the linear rate of control variates methods, it provably outperforms SGD. We pay particular attention to the time and memory overhead of our proposed method, and design a specialized red-black tree allowing its efficient implementation. Finally, we present empirical results to support our findings.

التحسين والتحكم التعلم الآلي التعلم الالي

Biased Stochastic Gradient Descent for Conditional Stochastic Optimization

148 - Yifan Hu , Siqi Zhang , Xin Chen 2020

Conditional Stochastic Optimization (CSO) covers a variety of applications ranging from meta-learning and causal inference to invariant learning. However, constructing unbiased gradient estimates in CSO is challenging due to the composition structure . As an alternative, we propose a biased stochastic gradient descent (BSGD) algorithm and study the bias-variance tradeoff under different structural assumptions. We establish the sample complexities of BSGD for strongly convex, convex, and weakly convex objectives, under smooth and non-smooth conditions. We also provide matching lower bounds of BSGD for convex CSO objectives. Extensive numerical experiments are conducted to illustrate the performance of BSGD on robust logistic regression, model-agnostic meta-learning (MAML), and instrumental variable regression (IV).

التحسين والتحكم التعلم الآلي التعلم الالي

A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

95 - Yossi Arjevani , Ohad Shamir , Nathan Srebro 2018

We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $tau$ rounds ago. First, we show that without stochastic noise, dela ys strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only $1/tau$ of the gradients. In sharp contrast, we quantify how stochastic noise makes the effect of delays negligible, improving on previous work which only showed this phenomenon asymptotically or for much smaller delays. Also, in the context of distributed optimization, the results indicate that the performance of gradient descent with delays is competitive with synchronous approaches such as mini-batching. Our results are based on a novel technique for analyzing convergence of optimization algorithms using generating functions.

التحسين والتحكم التعلم الآلي التعلم الالي

Exit Time Analysis for Approximations of Gradient Descent Trajectories Around Saddle Points

60 - Rishabh Dixit , Waheed U. Bajwa 2020

This paper considers the problem of understanding the exit time for trajectories of gradient-related first-order methods from saddle neighborhoods under some initial boundary conditions. Given the `flat geometry around saddle points, first-order meth ods can struggle in escaping these regions in a fast manner due to the small magnitudes of gradients encountered. In particular, while it is known that gradient-related first-order methods escape strict-saddle neighborhoods, existing literature does not explicitly leverage the local geometry around saddle points in order to control behavior of gradient trajectories. It is in this context that this paper puts forth a rigorous geometric analysis of the gradient-descent method around strict-saddle neighborhoods using matrix perturbation theory. In doing so, it provides a key result that can be used to generate an approximate gradient trajectory for any given initial conditions. In addition, the analysis leads to a linear exit-time solution for gradient-descent method under certain necessary initial conditions for a class of strict-saddle functions.

التحسين والتحكم التعلم الآلي أنظمة وتحكم

Accelerated Stochastic Gradient Descent for Minimizing Finite Sums

746 - Atsushi Nitanda 2015

We propose an optimization method for minimizing the finite sums of smooth convex functions. Our method incorporates an accelerated gradient descent (AGD) and a stochastic variance reduction gradient (SVRG) in a mini-batch setting. Unlike SVRG, our m ethod can be directly applied to non-strongly and strongly convex problems. We show that our method achieves a lower overall complexity than the recently proposed methods that supports non-strongly convex problems. Moreover, this method has a fast rate of convergence for strongly convex problems. Our experiments show the effectiveness of our method.

التعلم الالي التعلم الآلي