Evaluating Agents without Rewards

110 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Danijar Hafner

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Brendon Matusch - Jimmy Ba - Danijar Hafner

التعلم الآلي الذكاء الاصطناعي علم الروبوتات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Reinforcement learning has enabled agents to solve challenging tasks in unknown environments. However, manually crafting reward functions can be time consuming, expensive, and error prone to human error. Competing objectives have been proposed for agents to learn without external supervision, but it has been unclear how well they reflect task rewards or human behavior. To accelerate the development of intrinsic objectives, we retrospectively compute potential objectives on pre-collected datasets of agent behavior, rather than optimizing them online, and compare them by analyzing their correlations. We study input entropy, information gain, and empowerment across seven agents, three Atari games, and the 3D game Minecraft. We find that all three intrinsic objectives correlate more strongly with a human behavior similarity metric than with task reward. Moreover, input entropy and information gain correlate more strongly with human similarity than task reward does, suggesting the use of intrinsic objectives for designing agents that behave similarly to human players.

قيم البحث

57 - Paul Knott , Micah Carroll , Sam Devlin 2021

In order for agents trained by deep reinforcement learning to work alongside humans in realistic settings, we will need to ensure that the agents are emph{robust}. Since the real world is very diverse, and human behavior often changes in response to agent deployment, the agent will likely encounter novel situations that have never been seen during training. This results in an evaluation challenge: if we cannot rely on the average training or validation reward as a metric, then how can we effectively evaluate robustness? We take inspiration from the practice of emph{unit testing} in software engineering. Specifically, we suggest that when designing AI agents that collaborate with humans, designers should search for potential edge cases in emph{possible partner behavior} and emph{possible states encountered}, and write tests which check that the behavior of the agent in these edge cases is reasonable. We apply this methodology to build a suite of unit tests for the Overcooked-AI environment, and use this test suite to evaluate three proposals for improving robustness. We find that the test suite provides significant insight into the effects of these proposals that were generally not revealed by looking solely at the average validation reward.

التعلم الآلي الذكاء الاصطناعي تفاعل الإنسان والحاسوب

Vizarel: A System to Help Better Understand RL Agents

70 - Shuby Deshpande , Jeff Schneider 2020

Visualization tools for supervised learning have allowed users to interpret, introspect, and gain intuition for the successes and failures of their models. While reinforcement learning practitioners ask many of the same questions, existing tools are not applicable to the RL setting. In this work, we describe our initial attempt at constructing a prototype of these ideas, through identifying possible features that such a system should encapsulate. Our design is motivated by envisioning the system to be a platform on which to experiment with interpretable reinforcement learning.

التعلم الآلي الذكاء الاصطناعي علم الروبوتات

Discriminator Soft Actor Critic without Extrinsic Rewards

128 - Daichi Nishio , Daiki Kuyoshi , Toi Tsuneda 2020

It is difficult to be able to imitate well in unknown states from a small amount of expert data and sampling data. Supervised learning methods such as Behavioral Cloning do not require sampling data, but usually suffer from distribution shift. The me thods based on reinforcement learning, such as inverse reinforcement learning and generative adversarial imitation learning (GAIL), can learn from only a few expert data. However, they often need to interact with the environment. Soft Q imitation learning addressed the problems, and it was shown that it could learn efficiently by combining Behavioral Cloning and soft Q-learning with constant rewards. In order to make this algorithm more robust to distribution shift, we propose Discriminator Soft Actor Critic (DSAC). It uses a reward function based on adversarial inverse reinforcement learning instead of constant rewards. We evaluated it on PyBullet environments with only four expert trajectories.

التعلم الآلي التعلم الالي

Programming by Rewards

76 - Nagarajan Natarajan , Ajaykrishna Karthikeyan , Prateek Jain 2020

We formalize and study ``programming by rewards (PBR), a new approach for specifying and synthesizing subroutines for optimizing some quantitative metric such as performance, resource utilization, or correctness over a benchmark. A PBR specification consists of (1) input features $x$, and (2) a reward function $r$, modeled as a black-box component (which we can only run), that assigns a reward for each execution. The goal of the synthesizer is to synthesize a decision function $f$ which transforms the features to a decision value for the black-box component so as to maximize the expected reward $E[r circ f (x)]$ for executing decisions $f(x)$ for various values of $x$. We consider a space of decision functions in a DSL of loop-free if-then-else programs, which can branch on linear functions of the input features in a tree-structure and compute a linear function of the inputs in the leaves of the tree. We find that this DSL captures decision functions that are manually written in practice by programmers. Our technical contribution is the use of continuous-optimization techniques to perform synthesis of such decision functions as if-then-else programs. We also show that the framework is theoretically-founded ---in cases when the rewards satisfy nice properties, the synthesized code is optimal in a precise sense. We have leveraged PBR to synthesize non-trivial decision functions related to search and ranking heuristics in the PROSE codebase (an industrial strength program synthesis framework) and achieve competitive results to manually written procedures over multiple man years of tuning. We present empirical evaluation against other baseline techniques over real-world case studies (including PROSE) as well on simple synthetic benchmarks.

التعلم الآلي الذكاء الاصطناعي لغات البرمجة

Evaluating Visual Conversational Agents via Cooperative Human-AI Games

110 - Prithvijit Chattopadhyay , Deshraj Yadav , Viraj Prabhu 2017

As AI continues to advance, human-AI teams are inevitable. However, progress in AI is routinely measured in isolation, without a human in the loop. It is crucial to benchmark progress in AI, not just in isolation, but also in terms of how it translat es to helping humans perform certain tasks, i.e., the performance of human-AI teams. In this work, we design a cooperative game - GuessWhich - to measure human-AI team performance in the specific context of the AI being a visual conversational agent. GuessWhich involves live interaction between the human and the AI. The AI, which we call ALICE, is provided an image which is unseen by the human. Following a brief description of the image, the human questions ALICE about this secret image to identify it from a fixed pool of images. We measure performance of the human-ALICE team by the number of guesses it takes the human to correctly identify the secret image after a fixed number of dialog rounds with ALICE. We compare performance of the human-ALICE teams for t

تفاعل الإنسان والحاسوب الذكاء الاصطناعي الحساب واللغة