Active Reinforcement Learning over MDPs

85 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Qi Yang

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Qi Yang - Peng Yang - Ke Tang

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The past decade has seen the rapid development of Reinforcement Learning, which acquires impressive performance with numerous training resources. However, one of the greatest challenges in RL is generalization efficiency (i.e., generalization performance in a unit time). This paper proposes a framework of Active Reinforcement Learning (ARL) over MDPs to improve generalization efficiency in a limited resource by instance selection. Given a number of instances, the algorithm chooses out valuable instances as training sets while training the policy, thereby costing fewer resources. Unlike existing approaches, we attempt to actively select and use training data rather than train on all the given data, thereby costing fewer resources. Furthermore, we introduce a general instance evaluation metrics and selection mechanism into the framework. Experiments results reveal that the proposed framework with Proximal Policy Optimization as policy optimizer can effectively improve generalization efficiency than unselect-ed and unbiased selected methods.

قيم البحث

142 - Jiafan He , Dongruo Zhou , Quanquan Gu 2020

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$gamma$, which is based on the emph{optimism in the face of uncertainty principle} a nd the Bernstein-type bonus. We show that UCBVI-$gamma$ achieves an $tilde{O}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tilde{Omega}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$gamma$ is nearly minimax optimal for discounted MDPs.

التعلم الآلي التحسين والتحكم التعلم الالي

Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

88 - Christoph Dann , Yishay Mansour , Mehryar Mohri 2021

There have been many recent advances on provably efficient Reinforcement Learning (RL) in problems with rich observation spaces. However, all these works share a strong realizability assumption about the optimal value function of the true MDP. Such r ealizability assumptions are often too strong to hold in practice. In this work, we consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $Pi$ that may not contain any near-optimal policy. We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP. Specifically, our algorithm enjoys a sample complexity bound of $widetilde{O}left((H^{4d} K^{3d} log |Pi|)/epsilon^2right)$ where $H$ is the length of episodes, $K$ is the number of actions and $epsilon>0$ is the desired sub-optimality. We also provide a nearly matching lower bound for this agnostic setting that shows that the exponential dependence on rank is unavoidable, without further assumptions.

التعلم الآلي الذكاء الاصطناعي أنظمة وتحكم

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

135 - Dongruo Zhou , Jiafan He , Quanquan Gu 2020

Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low-dimensional space. In this paper, we study reinforcement lear ning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrt{T}/(1-gamma)^2)$ regret, where $d$ is the dimension of the feature space, $T$ is the time horizon and $gamma$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing the generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by $Omega(dsqrt{T}/(1-gamma)^{1.5})$. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $(1-gamma)^{-0.5}$ factor.

التعلم الآلي الذكاء الاصطناعي التحسين والتحكم

Reinforcement Learning through Active Inference

114 - Alexander Tschantz , Beren Millidge , Anil K. Seth 2020

The central tenet of reinforcement learning (RL) is that agents seek to maximize the sum of cumulative rewards. In contrast, active inference, an emerging framework within cognitive and computational neuroscience, proposes that agents act to maximize the evidence for a biased generative model. Here, we illustrate how ideas from active inference can augment traditional RL approaches by (i) furnishing an inherent balance of exploration and exploitation, and (ii) providing a more flexible conceptualization of reward. Inspired by active inference, we develop and implement a novel objective for decision making, which we term the free energy of the expected future. We demonstrate that the resulting algorithm successfully balances exploration and exploitation, simultaneously achieving robust performance on several challenging RL benchmarks with sparse, well-shaped, and no rewards.

التعلم الآلي الذكاء الاصطناعي نظرية المعلومات

Safety-Constrained Reinforcement Learning for MDPs

56 - Sebastian Junges , Nils Jansen , Christian Dehnert 2015

We consider controller synthesis for stochastic and partially unknown environments in which safety is essential. Specifically, we abstract the problem as a Markov decision process in which the expected performance is measured using a cost function th at is unknown prior to run-time exploration of the state space. Standard learning approaches synthesize cost-optimal strategies without guaranteeing safety properties. To remedy this, we first compute safe, permissive strategies. Then, exploration is constrained to these strategies and thereby meets the imposed safety requirements. Exploiting an iterative learning procedure, the resulting policy is safety-constrained and optimal. We show correctness and completeness of the method and discuss the use of several heuristics to increase its scalability. Finally, we demonstrate the applicability by means of a prototype implementation.

هندسة البرمجيات أنظمة وتحكم