Mungojerrie: Reinforcement Learning of Linear-Time Objectives

192 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Mateo Perez

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ernst Moritz Hahn - Mateo Perez - Sven Schewe

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Reinforcement learning synthesizes controllers without prior knowledge of the system. At each timestep, a reward is given. The controllers optimize the discounted sum of these rewards. Applying this class of algorithms requires designing a reward scheme, which is typically done manually. The designer must ensure that their intent is accurately captured. This may not be trivial, and is prone to error. An alternative to this manual programming, akin to programming directly in assembly, is to specify the objective in a formal language and have it compiled to a reward scheme. Mungojerrie (https://plv.colorado.edu/mungojerrie/) is a tool for testing reward schemes for $omega$-regular objectives on finite models. The tool contains reinforcement learning algorithms and a probabilistic model checker. Mungojerrie supports models specified in PRISM and $omega$-automata specified in HOA.

قيم البحث

127 - Ernst Moritz Hahn , Mateo Perez , Sven Schewe 2021

We study reinforcement learning for the optimal control of Branching Markov Decision Processes (BMDPs), a natural extension of (multitype) Branching Markov Chains (BMCs). The state of a (discrete-time) BMCs is a collection of entities of various type s that, while spawning other entities, generate a payoff. In comparison with BMCs, where the evolution of a each entity of the same type follows the same probabilistic pattern, BMDPs allow an external controller to pick from a range of options. This permits us to study the best/worst behaviour of the system. We generalise model-free reinforcement learning techniques to compute an optimal control strategy of an unknown BMDP in the limit. We present results of an implementation that demonstrate the practicality of the approach.

التعلم الآلي المنطق في علوم الحاسوب أنظمة وتحكم

Improving Generalization in Meta Reinforcement Learning using Learned Objectives

147 - Louis Kirsch , Sjoerd van Steenkiste , Jurgen Schmidhuber 2019

Biological evolution has distilled the experiences of many learners into the general learning algorithms of humans. Our novel meta reinforcement learning algorithm MetaGenRL is inspired by this process. MetaGenRL distills the experiences of many comp lex agents to meta-learn a low-complexity neural objective function that decides how future individuals will learn. Unlike recent meta-RL algorithms, MetaGenRL can generalize to new environments that are entirely different from those used for meta-training. In some cases, it even outperforms human-engineered RL algorithms. MetaGenRL uses off-policy second-order gradients during meta-training that greatly increase its sample efficiency.

التعلم الآلي الذكاء الاصطناعي الحوسبة العصبية والتطورية

Polynomial Time Reinforcement Learning in Correlated FMDPs with Linear Value Functions

125 - Siddartha Devic , Zihao Deng , Brendan Juba 2021

Many reinforcement learning (RL) environments in practice feature enormous state spaces that may be described compactly by a factored structure, that may be modeled by Factored Markov Decision Processes (FMDPs). We present the first polynomial-time a lgorithm for RL with FMDPs that does not rely on an oracle planner, and instead of requiring a linear transition model, only requires a linear value function with a suitable local basis with respect to the factorization. With this assumption, we can solve FMDPs in polynomial time by constructing an efficient separation oracle for convex optimization. Importantly, and in contrast to prior work, we do not assume that the transitions on various factors are independent.

التعلم الآلي

Inverse Reinforcement Learning of Autonomous Behaviors Encoded as Weighted Finite Automata

126 - Tianyu Wang , Nikolay Atanasov 2021

This paper presents a method for learning logical task specifications and cost functions from demonstrations. Linear temporal logic (LTL) formulas are widely used to express complex objectives and constraints for autonomous systems. Yet, such specifi cations may be challenging to construct by hand. Instead, we consider demonstrated task executions, whose temporal logic structure and transition costs need to be inferred by an autonomous agent. We employ a spectral learning approach to extract a weighted finite automaton (WFA), approximating the unknown logic structure of the task. Thereafter, we define a product between the WFA for high-level task guidance and a Labeled Markov decision process (L-MDP) for low-level control and optimize a cost function that matches the demonstrators behavior. We demonstrate that our method is capable of generalizing the execution of the inferred task specification to new environment configurations.

التعلم الآلي المنطق في علوم الحاسوب

Distributional reinforcement learning with linear function approximation

277 - Marc G. Bellemare , Nicolas Le Roux , Pablo Samuel Castro 2019

Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited. One exception is Rowland et al. (2018)s analysis of the C51 algorithm in terms of the Cramer distance, but th eir results only apply to the tabular setting and ignore C51s use of a softmax to produce normalized distributions. In this paper we adapt the Cramer distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cramer-based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the models prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cramer-based distributional methods may perform worse than directly approximating the value function.

التعلم الآلي التعلم الالي