ترغب بنشر مسار تعليمي؟ اضغط هنا

Backtracking Gradient Descent allowing unbounded learning rates

86   0   0.0 ( 0 )
 نشر من قبل Tuyen Truong
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

In unconstrained optimisation on an Euclidean space, to prove convergence in Gradient Descent processes (GD) $x_{n+1}=x_n-delta _n abla f(x_n)$ it usually is required that the learning rates $delta _n$s are bounded: $delta _nleq delta $ for some positive $delta $. Under this assumption, if the sequence $x_n$ converges to a critical point $z$, then with large values of $n$ the update will be small because $||x_{n+1}-x_n||lesssim || abla f(x_n)||$. This may also force the sequence to converge to a bad minimum. If we can allow, at least theoretically, that the learning rates $delta _n$s are not bounded, then we may have better convergence to better minima. A previous joint paper by the author showed convergence for the usual version of Backtracking GD under very general assumptions on the cost function $f$. In this paper, we allow the learning rates $delta _n$ to be unbounded, in the sense that there is a function $h:(0,infty)rightarrow (0,infty )$ such that $lim _{trightarrow 0}th(t)=0$ and $delta _nlesssim max {h(x_n),delta }$ satisfies Armijos condition for all $n$, and prove convergence under the same assumptions as in the mentioned paper. It will be shown that this growth rate of $h$ is best possible if one wants convergence of the sequence ${x_n}$. A specific way for choosing $delta _n$ in a discrete way connects to Two-way Backtracking GD defined in the mentioned paper. We provide some results which either improve or are implicitly contained in those in the mentioned paper and another recent paper on avoidance of saddle points.



قيم البحث

اقرأ أيضاً

56 - Tuyen Trung Truong 2020
Our main result concerns the following condition: {bf Condition C.} Let $X$ be a Banach space. A $C^1$ function $f:Xrightarrow mathbb{R}$ satisfies Condition C if whenever ${x_n}$ weakly converges to $x$ and $lim _{nrightarrowinfty}|| abla f(x_n)|| =0$, then $ abla f(x)=0$. We assume that there is given a canonical isomorphism between $X$ and its dual $X^*$, for example when $X$ is a Hilbert space. {bf Theorem.} Let $X$ be a reflexive, complete Banach space and $f:Xrightarrow mathbb{R}$ be a $C^2$ function which satisfies Condition C. Moreover, we assume that for every bounded set $Ssubset X$, then $sup _{xin S}|| abla ^2f(x)||<infty$. We choose a random point $x_0in X$ and construct by the Local Backtracking GD procedure (which depends on $3$ hyper-parameters $alpha ,beta ,delta _0$, see later for details) the sequence $x_{n+1}=x_n-delta (x_n) abla f(x_n)$. Then we have: 1) Every cluster point of ${x_n}$, in the {bf weak} topology, is a critical point of $f$. 2) Either $lim _{nrightarrowinfty}f(x_n)=-infty$ or $lim _{nrightarrowinfty}||x_{n+1}-x_n||=0$. 3) Here we work with the weak topology. Let $mathcal{C}$ be the set of critical points of $f$. Assume that $mathcal{C}$ has a bounded component $A$. Let $mathcal{B}$ be the set of cluster points of ${x_n}$. If $mathcal{B}cap A ot= emptyset$, then $mathcal{B}subset A$ and $mathcal{B}$ is connected. 4) Assume that $X$ is separable. Then for generic choices of $alpha ,beta ,delta _0$ and the initial point $x_0$, if the sequence ${x_n}$ converges - in the {bf weak} topology, then the limit point cannot be a saddle point.
Despite the strong theoretical guarantees that variance-reduced finite-sum optimization algorithms enjoy, their applicability remains limited to cases where the memory overhead they introduce (SAG/SAGA), or the periodic full gradient computation they require (SVRG/SARAH) are manageable. A promising approach to achieving variance reduction while avoiding these drawbacks is the use of importance sampling instead of control variates. While many such methods have been proposed in the literature, directly proving that they improve the convergence of the resulting optimization algorithm has remained elusive. In this work, we propose an importance-sampling-based algorithm we call SRG (stochastic reweighted gradient). We analyze the convergence of SRG in the strongly-convex case and show that, while it does not recover the linear rate of control variates methods, it provably outperforms SGD. We pay particular attention to the time and memory overhead of our proposed method, and design a specialized red-black tree allowing its efficient implementation. Finally, we present empirical results to support our findings.
We present a strikingly simple proof that two rules are sufficient to automate gradient descent: 1) dont increase the stepsize too fast and 2) dont overstep the local curvature. No need for functional values, no line search, no information about the function except for the gradients. By following these rules, you get a method adaptive to the local geometry, with convergence guarantees depending only on the smoothness in a neighborhood of a solution. Given that the problem is convex, our method converges even if the global smoothness constant is infinity. As an illustration, it can minimize arbitrary continuously twice-differentiable convex function. We examine its performance on a range of convex and nonconvex problems, including logistic regression and matrix factorization.
Distributed descent-based methods are an essential toolset to solving optimization problems in multi-agent system scenarios. Here the agents seek to optimize a global objective function through mutual cooperation. Oftentimes, cooperation is achieved over a wireless communication network that is prone to delays and errors. There are many scenarios wherein the objective function is either non-differentiable or merely observable. In this paper, we present a cross-entropy based distributed stochastic approximation algorithm (SA) that finds a minimum of the objective, using only samples. We call this algorithm Decentralized Simultaneous Perturbation Stochastic Gradient, with Constant Sensitivity Parameters (DSPG). This algorithm is a two fold improvement over the classic Simultaneous Perturbation Stochastic Approximations (SPSA) algorithm. Specifically, DSPG allows for (i) the use of old information from other agents and (ii) easy implementation through the use simple hyper-parameter choices. We analyze the biases and variances that arise due to these two allowances. We show that the biases due to communication delays can be countered by a careful choice of algorithm hyper-parameters. The variance of the gradient estimator and its effect on the rate of convergence is studied. We present numerical results supporting our theory. Finally, we discuss an application to the stochastic consensus problem.
148 - Yifan Hu , Siqi Zhang , Xin Chen 2020
Conditional Stochastic Optimization (CSO) covers a variety of applications ranging from meta-learning and causal inference to invariant learning. However, constructing unbiased gradient estimates in CSO is challenging due to the composition structure . As an alternative, we propose a biased stochastic gradient descent (BSGD) algorithm and study the bias-variance tradeoff under different structural assumptions. We establish the sample complexities of BSGD for strongly convex, convex, and weakly convex objectives, under smooth and non-smooth conditions. We also provide matching lower bounds of BSGD for convex CSO objectives. Extensive numerical experiments are conducted to illustrate the performance of BSGD on robust logistic regression, model-agnostic meta-learning (MAML), and instrumental variable regression (IV).

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا