Omega-Regular Objectives in Model-Free Reinforcement Learning

352 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ashutosh Trivedi

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ernst Moritz Hahn - Mateo Perez - Sven Schewe

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We provide the first solution for model-free reinforcement learning of {omega}-regular objectives for Markov decision processes (MDPs). We present a constructive reduction from the almost-sure satisfaction of {omega}-regular objectives to an almost- sure reachability problem and extend this technique to learning how to control an unknown model so that the chance of satisfying the objective is maximized. A key feature of our technique is the compilation of {omega}-regular properties into limit- deterministic Buechi automata instead of the traditional Rabin automata; this choice sidesteps difficulties that have marred previous proposals. Our approach allows us to apply model-free, off-the-shelf reinforcement learning algorithms to compute optimal strategies from the observations of the MDP. We present an experimental evaluation of our technique on benchmark learning problems.

قيم البحث

83 - E. M. Hahn , M. Perez , S. Schewe 2020

Recently, successful approaches have been made to exploit good-for-MDPs automata (Buchi automata with a restricted form of nondeterminism) for model free reinforcement learning, a class of automata that subsumes good for games automata and the most w idespread class of limit deterministic automata. The foundation of using these Buchi automata is that the Buchi condition can, for good-for-MDP automata, be translated to reachability. The drawback of this translation is that the rewards are, on average, reaped very late, which requires long episodes during the learning process. We devise a new reward shaping approach that overcomes this issue. We show that the resulting model is equivalent to a discounted payoff objective with a biased discount that simplifies and improves on prior work in this direction.

المنطق في علوم الحاسوب التعلم الآلي

Tree games with regular objectives

169 - Marcin Przyby{l}ko 2014

We study tree games developed recently by Matteo Mio as a game interpretation of the probabilistic $mu$-calculus. With expressive power comes complexity. Mio showed that tree games are able to encode Blackwell games and, consequently, are not determi ned under deterministic strategies. We show that non-stochastic tree games with objectives recognisable by so-called game automata are determined under deterministic, finite memory strategies. Moreover, we give an elementary algorithmic procedure which, for an arbitrary regular language L and a finite non-stochastic tree game with a winning objective L decides if the game is determined under deterministic strategies.

المنطق في علوم الحاسوب اللغات الرسمية ونظرية الأتومات علوم الكمبيوتر ونظرية الألعاب

Robotic Table Tennis with Model-Free Reinforcement Learning

110 - Wenbo Gao , Laura Graesser , Krzysztof Choromanski 2020

We propose a model-free algorithm for learning efficient policies capable of returning table tennis balls by controlling robot joints at a rate of 100Hz. We demonstrate that evolutionary search (ES) methods acting on CNN-based policy architectures fo r non-visual inputs and convolving across time learn compact controllers leading to smooth motions. Furthermore, we show that with appropriately tuned curriculum learning on the task and rewards, policies are capable of developing multi-modal styles, specifically forehand and backhand stroke, whilst achieving 80% return rate on a wide range of ball throws. We observe that multi-modality does not require any architectural priors, such as multi-head architectures or hierarchical policies.

التعلم الآلي علم الروبوتات التعلم الالي

Topology and Ambiguity in Omega Context Free Languages

246 - Olivier Finkel 2008

We study the links between the topological complexity of an omega context free language and its degree of ambiguity. In particular, using known facts from classical descriptive set theory, we prove that non Borel omega context free languages which ar e recognized by Buchi pushdown automata have a maximum degree of ambiguity. This result implies that degrees of ambiguity are really not preserved by the operation of taking the omega power of a finitary context free language. We prove also that taking the adherence or the delta-limit of a finitary language preserves neither unambiguity nor inherent ambiguity. On the other side we show that methods used in the study of omega context free languages can also be applied to study the notion of ambiguity in infinitary rational relations accepted by Buchi 2-tape automata and we get first results in that direction.

المنطق في علوم الحاسوب منطق

Learning Human Objectives by Evaluating Hypothetical Behavior

113 - Siddharth Reddy , Anca D. Dragan , Sergey Levine 2019

We seek to align agent behavior with a users objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive . To address this challenge, we propose an algorithm that safely and interactively learns a model of the users reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data. Our method uses these models to synthesize hypothetical behaviors, asks the user to label the behaviors with rewards, and trains a neural network to predict the rewards. The key idea is to actively synthesize the hypothetical behaviors from scratch by maximizing tractable proxies for the value of information, without interacting with the environment. We call this method reward query synthesis via trajectory optimization (ReQueST). We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that ReQueST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions. Moreover, ReQueST safely trains the reward model to detect unsafe states, and corrects reward hacking before deploying the agent.

أجهزة الكمبيوتر والمجتمع التعلم الآلي التعلم الالي