Avoiding Tampering Incentives in Deep RL via Decoupled Approval

108 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jonathan Uesato

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jonathan Uesato - Ramana Kumar - Victoria Krakovna

التعلم الآلي الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the agent? Standard RL algorithms assume a secure reward function, and can thus perform poorly in settings where agents can tamper with the reward-generating mechanism. We present a principled solution to the problem of learning from influenceable feedback, which combines approval with a decoupled feedback collection procedure. For a natural class of corruption functions, decoupled approval algorithms have aligned incentives both at convergence and for their local updates. Empirically, they also scale to complex 3D environments where tampering is possible.

قيم البحث

146 - Lucas Beyer , Damien Vincent , Olivier Teboul 2019

An agent learning through interactions should balance its action selection process between probing the environment to discover new rewards and using the information acquired in the past to adopt useful behaviour. This trade-off is usually obtained by perturbing either the agents actions (e.g., e-greedy or Gibbs sampling) or the agents parameters (e.g., NoisyNet), or by modifying the reward it receives (e.g., exploration bonus, intrinsic motivation, or hand-shaped rewards). Here, we adopt a disruptive but simple and generic perspective, where we explicitly disentangle exploration and exploitation. Different losses are optimized in parallel, one of them coming from the true objective (maximizing cumulative rewards from the environment) and others being related to exploration. Every loss is used in turn to learn a policy that generates transitions, all shared in a single replay buffer. Off-policy methods are then applied to these transitions to optimize each loss. We showcase our approach on a hard-exploration environment, show its sample-efficiency and robustness, and discuss further implications.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

REALab: An Embedded Perspective on Tampering

120 - Ramana Kumar , Jonathan Uesato , Richard Ngo 2020

This paper describes REALab, a platform for embedded agency research in reinforcement learning (RL). REALab is designed to model the structure of tampering problems that may arise in real-world deployments of RL. Standard Markov Decision Process (MDP ) formulations of RL and simulated environments mirroring the MDP structure assume secure access to feedback (e.g., rewards). This may be unrealistic in settings where agents are embedded and can corrupt the processes producing feedback (e.g., human supervisors, or an implemented reward function). We describe an alternative Corrupt Feedback MDP formulation and the REALab environment platform, which both avoid the secure feedback assumption. We hope the design of REALab provides a useful perspective on tampering problems, and that the platform may serve as a unit test for the presence of tampering incentives in RL agent designs.

التعلم الآلي الذكاء الاصطناعي

Return-based Scaling: Yet Another Normalisation Trick for Deep RL

95 - Tom Schaul , Georg Ostrovski , Iurii Kemaev 2021

Scaling issues are mundane yet irritating for practitioners of reinforcement learning. Error scales vary across domains, tasks, and stages of learning; sometimes by many orders of magnitude. This can be detrimental to learning speed and stability, cr eate interference between learning tasks, and necessitate substantial tuning. We revisit this topic for agents based on temporal-difference learning, sketch out some desiderata and investigate scenarios where simple fixes fall short. The mechanism we propose requires neither tuning, clipping, nor adaptation. We validate its effectiveness and robustness on the suite of Atari games. Our scaling method turns out to be particularly helpful at mitigating interference, when training a shared neural network on multiple targets that differ in reward scale or discounting.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

134 - Lasse Espeholt , Hubert Soyer , Remi Munos 2018

In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. A key challenge is to handle the increased amount of data and extended training time. We have developed a new distri buted agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace. We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach.

التعلم الآلي الذكاء الاصطناعي

One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

125 - Tom Le Paine , Sergio Gomez Colmenarejo , Ziyu Wang 2018

Humans are experts at high-fidelity imitation -- closely mimicking a demonstration, often in one attempt. Humans use this ability to quickly solve a task instance, and to bootstrap learning of new tasks. Achieving these abilities in autonomous agents is an open problem. In this paper, we introduce an off-policy RL algorithm (MetaMimic) to narrow this gap. MetaMimic can learn both (i) policies for high-fidelity one-shot imitation of diverse novel skills, and (ii) policies that enable the agent to solve tasks more efficiently than the demonstrators. MetaMimic relies on the principle of storing all experiences in a memory and replaying these to learn massive deep neural network policies by off-policy RL. This paper introduces, to the best of our knowledge, the largest existing neural networks for deep RL and shows that larger networks with normalization are needed to achieve one-shot high-fidelity imitation on a challenging manipulation task. The results also show that both types of policy can be learned from vision, in spite of the task rewards being sparse, and without access to demonstrator actions.

التعلم الآلي الذكاء الاصطناعي الرؤية الحاسوبية وتمييز الأنماط