Asymptotic behaviour of learning rates in Armijos condition

64 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Tuyen Truong

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Tuyen Trung Truong - Tuan Hang Nguyen

التحسين والتحكم التعلم الآلي التحليل العددي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Fix a constant $0<alpha <1$. For a $C^1$ function $f:mathbb{R}^krightarrow mathbb{R}$, a point $x$ and a positive number $delta >0$, we say that Armijos condition is satisfied if $f(x-delta abla f(x))-f(x)leq -alpha delta || abla f(x)||^2$. It is a basis for the well known Backtracking Gradient Descent (Backtracking GD) algorithm. Consider a sequence ${x_n}$ defined by $x_{n+1}=x_n-delta _n abla f(x_n)$, for positive numbers $delta _n$ for which Armijos condition is satisfied. We show that if ${x_n}$ converges to a non-degenerate critical point, then ${delta _n}$ must be bounded. Moreover this boundedness can be quantified in terms of the norms of the Hessian $ abla ^2f$ and its inverse at the limit point. This complements the first authors results on Unbounded Backtracking GD, and shows that in case of convergence to a non-degenerate critical point the behaviour of Unbounded Backtracking GD is not too different from that of usual Backtracking GD. On the other hand, in case of convergence to a degenerate critical point the behaviours can be very much different. We run some experiments to illustrate that both scenrios can really happen. In another part of the paper, we argue that Backtracking GD has the correct unit (according to a definition by Zeiler in his Adadeltas paper). The main point is that since learning rate in Backtracking GD is bound by Armijos condition, it is not unitless.

قيم البحث

اقرأ أيضاً

Coordinate-wise Armijos condition

132 - Tuyen Trung Truong 2019

Let $z=(x,y)$ be coordinates for the product space $mathbb{R}^{m_1}times mathbb{R}^{m_2}$. Let $f:mathbb{R}^{m_1}times mathbb{R}^{m_2}rightarrow mathbb{R}$ be a $C^1$ function, and $ abla f=(partial _xf,partial _yf)$ its gradient. Fix $0<alpha <1$. F or a point $(x,y) in mathbb{R}^{m_1}times mathbb{R}^{m_2}$, a number $delta >0$ satisfies Armijos condition at $(x,y)$ if the following inequality holds: begin{eqnarray*} f(x-delta partial _xf,y-delta partial _yf)-f(x,y)leq -alpha delta (||partial _xf||^2+||partial _yf||^2). end{eqnarray*} When $f(x,y)=f_1(x)+f_2(y)$ is a coordinate-wise sum map, we propose the following {bf coordinate-wise} Armijos condition. Fix again $0<alpha <1$. A pair of positive numbers $delta _1,delta _2>0$ satisfies the coordinate-wise variant of Armijos condition at $(x,y)$ if the following inequality holds: begin{eqnarray*} [f_1(x-delta _1 abla f_1(x))+f_2(y-delta _2 abla f_2(y))]-[f_1(x)+f_2(y)]leq -alpha (delta _1|| abla f_1(x)||^2+delta _2|| abla f_2(y)||^2). end{eqnarray*} We then extend results in our recent previous results, on Backtracking Gradient Descent and some variants, to this setting. We show by an example the advantage of using coordinate-wise Armijos condition over the usual Armijos condition.

التحسين والتحكم التعلم الآلي التحليل العددي

Coordinate-wise Armijos condition: General case

51 - Tuyen Trung Truong 2020

Let $z=(x,y)$ be coordinates for the product space $mathbb{R}^{m_1}times mathbb{R}^{m_2}$. Let $f:mathbb{R}^{m_1}times mathbb{R}^{m_2}rightarrow mathbb{R}$ be a $C^1$ function, and $ abla f=(partial _xf,partial _yf)$ its gradient. Fix $0<alpha <1$. F or a point $(x,y) in mathbb{R}^{m_1}times mathbb{R}^{m_2}$, a number $delta >0$ satisfies Armijos condition at $(x,y)$ if the following inequality holds: begin{eqnarray*} f(x-delta partial _xf,y-delta partial _yf)-f(x,y)leq -alpha delta (||partial _xf||^2+||partial _yf||^2). end{eqnarray*} In one previous paper, we proposed the following {bf coordinate-wise} Armijos condition. Fix again $0<alpha <1$. A pair of positive numbers $delta _1,delta _2>0$ satisfies the coordinate-wise variant of Armijos condition at $(x,y)$ if the following inequality holds: begin{eqnarray*} [f(x-delta _1partial _xf(x,y), y-delta _2partial _y f(x,y))]-[f(x,y)]leq -alpha (delta _1||partial _xf(x,y)||^2+delta _2||partial _yf(x,y)||^2). end{eqnarray*} Previously we applied this condition for functions of the form $f(x,y)=f(x)+g(y)$, and proved various convergent results for them. For a general function, it is crucial - for being able to do real computations - to have a systematic algorithm for obtaining $delta _1$ and $delta _2$ satisfying the coordinate-wise version of Armijos condition, much like Backtracking for the usual Armijos condition. In this paper we propose such an algorithm, and prove according convergent results. We then analyse and present experimental results for some functions such as $f(x,y)=a|x|+y$ (given by Asl and Overton in connection to Wolfes method), $f(x,y)=x^3 sin (1/x) + y^3 sin(1/y)$ and Rosenbrocks function.

التحسين والتحكم التعلم الآلي النظم الديناميكية

Non-asymptotic estimates for TUSLA algorithm for non-convex learning with applications to neural networks with ReLU activation function

61 - Dong-Young Lim , Ariel Neufeld , Sotirios Sabanis 2021

We consider non-convex stochastic optimization problems where the objective functions have super-linearly growing and discontinuous stochastic gradients. In such a setting, we provide a non-asymptotic analysis for the tamed unadjusted stochastic Lang evin algorithm (TUSLA) introduced in Lovas et al. (2021). In particular, we establish non-asymptotic error bounds for the TUSLA algorithm in Wasserstein-1 and Wasserstein-2 distances. The latter result enables us to further derive non-asymptotic estimates for the expected excess risk. To illustrate the applicability of the main results, we consider an example from transfer learning with ReLU neural networks, which represents a key paradigm in machine learning. Numerical experiments are presented for the aforementioned example which supports our theoretical findings. Hence, in this setting, we demonstrate both theoretically and numerically that the TUSLA algorithm can solve the optimization problem involving neural networks with ReLU activation function. Besides, we provide simulation results for synthetic examples where popular algorithms, e.g. ADAM, AMSGrad, RMSProp, and (vanilla) SGD, may fail to find the minimizer of the objective functions due to the super-linear growth and the discontinuity of the corresponding stochastic gradient, while the TUSLA algorithm converges rapidly to the optimal solution.

التحسين والتحكم التعلم الآلي التحليل العددي

Backtracking Gradient Descent allowing unbounded learning rates

85 - Tuyen Trung Truong 2020

In unconstrained optimisation on an Euclidean space, to prove convergence in Gradient Descent processes (GD) $x_{n+1}=x_n-delta _n abla f(x_n)$ it usually is required that the learning rates $delta _n$s are bounded: $delta _nleq delta $ for some pos itive $delta $. Under this assumption, if the sequence $x_n$ converges to a critical point $z$, then with large values of $n$ the update will be small because $||x_{n+1}-x_n||lesssim || abla f(x_n)||$. This may also force the sequence to converge to a bad minimum. If we can allow, at least theoretically, that the learning rates $delta _n$s are not bounded, then we may have better convergence to better minima. A previous joint paper by the author showed convergence for the usual version of Backtracking GD under very general assumptions on the cost function $f$. In this paper, we allow the learning rates $delta _n$ to be unbounded, in the sense that there is a function $h:(0,infty)rightarrow (0,infty )$ such that $lim _{trightarrow 0}th(t)=0$ and $delta _nlesssim max {h(x_n),delta }$ satisfies Armijos condition for all $n$, and prove convergence under the same assumptions as in the mentioned paper. It will be shown that this growth rate of $h$ is best possible if one wants convergence of the sequence ${x_n}$. A specific way for choosing $delta _n$ in a discrete way connects to Two-way Backtracking GD defined in the mentioned paper. We provide some results which either improve or are implicitly contained in those in the mentioned paper and another recent paper on avoidance of saddle points.

التحسين والتحكم التعلم الآلي التعلم الالي

Safe global optimization of expensive noisy black-box functions in the $delta$-Lipschitz framework

105 - Yaroslav D. Sergeyev 2019

In this paper, the problem of safe global maximization (it should not be confused with robust optimization) of expensive noisy black-box functions satisfying the Lipschitz condition is considered. The notion safe means that the objective function $f( x)$ during optimization should not violate a safety threshold, for instance, a certain a priori given value $h$ in a maximization problem. Thus, any new function evaluation (possibly corrupted by noise) must be performed at safe points only, namely, at points $y$ for which it is known that the objective function $f(y) > h$. The main difficulty here consists in the fact that the used optimization algorithm should ensure that the safety constraint will be satisfied at a point $y$ before evaluation of $f(y)$ will be executed. Thus, it is required both to determine the safe region $Omega$ within the search domain~$D$ and to find the global maximum within $Omega$. An additional difficulty consists in the fact that these problems should be solved in the presence of the noise. This paper starts with a theoretical study of the problem and it is shown that even though the objective function $f(x)$ satisfies the Lipschitz condition, traditional Lipschitz minorants and majorants cannot be used due to the presence of the noise. Then, a $delta$-Lipschitz framework and two algorithms using it are proposed to solve the safe global maximization problem. The first method determines the safe area within the search domain and the second one executes the global maximization over the found safe region. For both methods a number of theoretical results related to their functioning and convergence is established. Finally, numerical experiments confirming the reliability of the proposed procedures are performed.

التحسين والتحكم التعلم الآلي التحليل العددي