A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning

181 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Nhan Pham

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Nhan H. Pham - Lam M. Nguyen - Dzung T. Phan

التعلم الآلي التحسين والتحكم

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We propose a novel hybrid stochastic policy gradient estimator by combining an unbiased policy gradient estimator, the REINFORCE estimator, with another biased one, an adapted SARAH estimator for policy optimization. The hybrid policy gradient estimator is shown to be biased, but has variance reduced property. Using this estimator, we develop a new Proximal Hybrid Stochastic Policy Gradient Algorithm (ProxHSPGA) to solve a composite policy optimization problem that allows us to handle constraints or regularizers on the policy parameters. We first propose a single-looped algorithm then introduce a more practical restarting variant. We prove that both algorithms can achieve the best-known trajectory complexity $mathcal{O}left(varepsilon^{-3}right)$ to attain a first-order stationary point for the composite problem which is better than existing REINFORCE/GPOMDP $mathcal{O}left(varepsilon^{-4}right)$ and SVRPG $mathcal{O}left(varepsilon^{-10/3}right)$ in the non-composite setting. We evaluate the performance of our algorithm on several well-known examples in reinforcement learning. Numerical results show that our algorithm outperforms two existing methods on these examples. Moreover, the composite settings indeed have some advantages compared to the non-composite ones on certain problems.

قيم البحث

205 - Samuele Tosatto , Jo~ao Carvalho , Jan Peters 2020

Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or hig h variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our gradient estimate against state-of-the-art methods, and show that it outperforms the baselines in terms of sample efficiency on classical control tasks.

التعلم الآلي الذكاء الاصطناعي

Policy Gradient using Weak Derivatives for Reinforcement Learning

104 - Sujay Bhatt , Alec Koppel , Vikram Krishnamurthy 2020

This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradi ent of the value function into two factors: the score function and the Q-function. This paper presents four results:(i) an alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be $O(1/sqrt(k))$; (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular score-function approach. Experiments on OpenAI gym pendulum environment show superior performance of the proposed algorithm.

التعلم الآلي أنظمة متعددة العملاء أنظمة وتحكم

Joint Optimization of Multi-Objective Reinforcement Learning with Policy Gradient Based Algorithm

139 - Qinbo Bai , Mridul Agarwal , Vaneet Aggarwal 2021

Many engineering problems have multiple objectives, and the overall aim is to optimize a non-linear function of these objectives. In this paper, we formulate the problem of maximizing a non-linear concave function of multiple long-term objectives. A policy-gradient based model-free algorithm is proposed for the problem. To compute an estimate of the gradient, a biased estimator is proposed. The proposed algorithm is shown to achieve convergence to within an $epsilon$ of the global optima after sampling $mathcal{O}(frac{M^4sigma^2}{(1-gamma)^8epsilon^4})$ trajectories where $gamma$ is the discount factor and $M$ is the number of the agents, thus achieving the same dependence on $epsilon$ as the policy gradient algorithm for the standard reinforcement learning.

التعلم الآلي الذكاء الاصطناعي أنظمة وتحكم

Stochastic Gradient Descent for Nonconvex Learning without Bounded Gradient Assumptions

117 - Yunwen Lei , Ting Hu , Guiying Li 2019

Stochastic gradient descent (SGD) is a popular and efficient method with wide applications in training deep neural nets and other nonconvex models. While the behavior of SGD is well understood in the convex learning setting, the existing theoretical results for SGD applied to nonconvex objective functions are far from mature. For example, existing results require to impose a nontrivial assumption on the uniform boundedness of gradients for all iterates encountered in the learning process, which is hard to verify in practical implementations. In this paper, we establish a rigorous theoretical foundation for SGD in nonconvex learning by showing that this boundedness assumption can be removed without affecting convergence rates. In particular, we establish sufficient conditions for almost sure convergence as well as optimal convergence rates for SGD applied to both general nonconvex objective functions and gradient-dominated objective functions. A linear convergence is further derived in the case with zero variances.

التعلم الآلي التحسين والتحكم التعلم الالي

Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme

397 - K.E. Avrachenkov , V.S. Borkar , H.P. Dolhare 2021

We analyze the DQN reinforcement learning algorithm as a stochastic approximation scheme using the o.d.e. (for ordinary differential equation) approach and point out certain theoretical issues. We then propose a modified scheme called Full Gradient D QN (FG-DQN, for short) that has a sound theoretical basis and compare it with the original scheme on sample problems. We observe a better performance for FG-DQN.

التعلم الآلي التحسين والتحكم الاحتمالات