Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient

206 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Samuele Tosatto

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Samuele Tosatto - Jo~ao Carvalho - Jan Peters

التعلم الآلي الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our gradient estimate against state-of-the-art methods, and show that it outperforms the baselines in terms of sample efficiency on classical control tasks.

قيم البحث

348 - Samuele Tosatto , Joao Carvalho , Hany Abdulsamad 2020

Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. The need for intensive interactions with the environment is especially observed in many widely popular policy gradient algorithms th at perform updates using on-policy samples. The price of such inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited. We address this issue by building on the general sample efficiency of off-policy algorithms. With nonparametric regression and density estimation methods we construct a nonparametric Bellman equation in a principled manner, which allows us to obtain closed-form estimates of the value function, and to analytically express the full policy gradient. We provide a theoretical analysis of our estimate to show that it is consistent under mild smoothness assumptions and empirically show that our approach has better sample efficiency than state-of-the-art policy gradient methods.

التعلم الآلي التعلم الالي

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

295 - Natasha Jaques , Asma Ghandeharioun , Judy Hanwen Shen 2019

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation -- a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in off-policy batch RL.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Representations for Stable Off-Policy Reinforcement Learning

322 - Dibya Ghosh , Marc G. Bellemare 2020

Reinforcement learning with function approximation can be unstable and even divergent, especially when combined with off-policy learning and Bellman updates. In deep reinforcement learning, these issues have been dealt with empirically by adapting an d regularizing the representation, in particular with auxiliary tasks. This suggests that representation learning may provide a means to guarantee stability. In this paper, we formally show that there are indeed nontrivial state representations under which the canonical TD algorithm is stable, even when learning off-policy. We analyze representation learning schemes that are based on the transition matrix of a policy, such as proto-value functions, along three axes: approximation error, stability, and ease of estimation. In the most general case, we show that a Schur basis provides convergence guarantees, but is difficult to estimate from samples. For a fixed reward function, we find that an orthogonal basis of the corresponding Krylov subspace is an even better choice. We conclude by empirically demonstrating that these stable representations can be learned using stochastic gradient descent, opening the door to improved techniques for representation learning with deep networks.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Off-Policy Reinforcement Learning with Delayed Rewards

273 - Beining Han , Zhizhou Ren , Zuofan Wu 2021

We study deep reinforcement learning (RL) algorithms with delayed rewards. In many real-world tasks, instant rewards are often not readily accessible or even defined immediately after the agent performs actions. In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments. Then, we introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees. For practical tasks with high dimensional state spaces, we further introduce the HC-decomposition rule of the Q-function in our framework which naturally leads to an approximation scheme that helps boost the training efficiency and stability. We finally conduct extensive experiments to demonstrate the superior performance of our algorithms over the existing work and their variants.

التعلم الآلي

On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data

185 - Chenjun Xiao , Ilbin Lee , Bo Dai 2021

We study the fundamental question of the sample complexity of learning a good policy in finite Markov decision processes (MDPs) when the data available for learning is obtained by following a logging policy that must be chosen without knowledge of th e underlying MDP. Our main results show that the sample complexity, the minimum number of transitions necessary and sufficient to obtain a good policy, is an exponential function of the relevant quantities when the planning horizon $H$ is finite. In particular, we prove that the sample complexity of obtaining $epsilon$-optimal policies is at least $Omega(mathrm{A}^{min(mathrm{S}-1, H+1)})$ for $gamma$-discounted problems, where $mathrm{S}$ is the number of states, $mathrm{A}$ is the number of actions, and $H$ is the effective horizon defined as $H=lfloor tfrac{ln(1/epsilon)}{ln(1/gamma)} rfloor$; and it is at least $Omega(mathrm{A}^{min(mathrm{S}-1, H)}/varepsilon^2)$ for finite horizon problems, where $H$ is the planning horizon of the problem. This lower bound is essentially matched by an upper bound. For the average-reward setting we show that there is no algorithm finding $epsilon$-optimal policies with a finite amount of data.

التعلم الآلي الذكاء الاصطناعي