Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability

142 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Dibya Ghosh

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Dibya Ghosh - Jad Rahme - Aviral Kumar

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Generalization is a central challenge for the deployment of reinforcement learning (RL) systems in the real world. In this paper, we show that the sequential structure of the RL problem necessitates new approaches to generalization beyond the well-studied techniques used in supervised learning. While supervised learning methods can generalize effectively without explicitly accounting for epistemic uncertainty, we show that, perhaps surprisingly, this is not the case in RL. We show that generalization to unseen test conditions from a limited number of training conditions induces implicit partial observability, effectively turning even fully-observed MDPs into POMDPs. Informed by this observation, we recast the problem of generalization in RL as solving the induced partially observed Markov decision process, which we call the epistemic POMDP. We demonstrate the failure modes of algorithms that do not appropriately handle this partial observability, and suggest a simple ensemble-based technique for approximately solving the partially observed problem. Empirically, we demonstrate that our simple algorithm derived from the epistemic POMDP achieves significant gains in generalization over current methods on the Procgen benchmark suite.

قيم البحث

316 - Justas Dauparas , Ryota Tomioka , 2018

The question of how to explore, i.e., take actions with uncertain outcomes to learn about possible future rewards, is a key question in reinforcement learning (RL). Here, we show a surprising result: We show that Q-learning with nonlinear Q-function and no explicit exploration (i.e., a purely greedy policy) can learn several standard benchmark tasks, including mountain car, equally well as, or better than, the most commonly-used $epsilon$-greedy exploration. We carefully examine this result and show that both the depth of the Q-network and the type of nonlinearity are important to induce such deterministic exploration.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

A Perspective on Objects and Systematic Generalization in Model-Based RL

179 - Sjoerd van Steenkiste , Klaus Greff , Jurgen Schmidhuber 2019

In order to meet the diverse challenges in solving many real-world problems, an intelligent agent has to be able to dynamically construct a model of its environment. Objects facilitate the modular reuse of prior knowledge and the combinatorial constr uction of such models. In this work, we argue that dynamically bound features (objects) do not simply emerge in connectionist models of the world. We identify several requirements that need to be fulfilled in overcoming this limitation and highlight corresponding inductive biases.

التعلم الآلي الذكاء الاصطناعي الحوسبة العصبية والتطورية

Why is quantum gravity so difficult (compared to QCD)?

120 - Hidenori Fukaya 2018

Gravity is difficult to quantize. This is a well-known fact but its reason is given simply by non-renormalizability of the Newton constant and little is discussed why among many quantum gauge theories, gravity is special. In this essay we try to trea t the gravity as one of many gauge theories, and discuss how it is special and why it is difficult to quantize.

الفيزياء عالية الطاقة - النظرية النسبية العامة وهدية الكونيات الكم فيزياء الطاقة العالية - شعرية

Many Agent Reinforcement Learning Under Partial Observability

118 - Keyang He , Prashant Doshi , Bikramjit Banerjee 2021

Recent renewed interest in multi-agent reinforcement learning (MARL) has generated an impressive array of techniques that leverage deep reinforcement learning, primarily actor-critic architectures, and can be applied to a limited range of settings in terms of observability and communication. However, a continuing limitation of much of this work is the curse of dimensionality when it comes to representations based on joint actions, which grow exponentially with the number of agents. In this paper, we squarely focus on this challenge of scalability. We apply the key insight of action anonymity, which leads to permutation invariance of joint actions, to two recently presented deep MARL algorithms, MADDPG and IA2C, and compare these instantiations to another recent technique that leverages action anonymity, viz., mean-field MARL. We show that our instantiations can learn the optimal behavior in a broader class of agent networks than the mean-field method, using a recently introduced pragmatic domain.

التعلم الآلي الذكاء الاصطناعي أنظمة متعددة العملاء

Bilinear Classes: A Structural Framework for Provable Generalization in RL

133 - Simon S. Du , Sham M. Kakade , Jason D. Lee 2021

This work introduces Bilinear Classes, a new structural framework, which permit generalization in reinforcement learning in a wide variety of settings through the use of function approximation. The framework incorporates nearly all existing models in which a polynomial sample complexity is achievable, and, notably, also includes new models, such as the Linear $Q^*/V^*$ model in which both the optimal $Q$-function and the optimal $V$-function are linear in some known feature space. Our main result provides an RL algorithm which has polynomial sample complexity for Bilinear Classes; notably, this sample complexity is stated in terms of a reduction to the generalization error of an underlying supervised learning sub-problem. These bounds nearly match the best known sample complexity bounds for existing models. Furthermore, this framework also extends to the infinite dimensional (RKHS) setting: for the the Linear $Q^*/V^*$ model, linear MDPs, and linear mixture MDPs, we provide sample complexities that have no explicit dependence on the explicit feature dimension (which could be infinite), but instead depends only on information theoretic quantities.

التعلم الآلي الذكاء الاصطناعي التحسين والتحكم