Proper Value Equivalence

113 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Christopher Grimm

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Christopher Grimm - Andre Barreto - Gregory Farquhar

الذكاء الاصطناعي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

One of the main challenges in model-based reinforcement learning (RL) is to decide which aspects of the environment should be modeled. The value-equivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for value-based planning. Technically, VE distinguishes models based on a set of policies and a set of functions: a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the correct result when applied to the functions. As the number of policies and functions increase, the set of VE models shrinks, eventually collapsing to a single point corresponding to a perfect model. A fundamental question underlying the VE principle is thus how to select the smallest sets of policies and functions that are sufficient for planning. In this paper we take an important step towards answering this question. We start by generalizing the concept of VE to order-$k$ counterparts defined with respect to $k$ applications of the Bellman operator. This leads to a family of VE classes that increase in size as $k rightarrow infty$. In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE. Unlike VE, the PVE class may contain multiple models even in the limit when all value functions are used. Crucially, all these models are sufficient for planning, meaning that they will yield an optimal policy despite the fact that they may ignore many aspects of the environment. We construct a loss function for learning PVE models and argue that popular algorithms such as MuZero and Muesli can be understood as minimizing an upper bound for this loss. We leverage this connection to propose a modification to MuZero and show that it can lead to improved performance in practice.

قيم البحث

اقرأ أيضاً

Value Iteration Networks

113 - Aviv Tamar , Yi Wu , Garrett Thomas 2016

We introduce the value iteration network (VIN): a fully differentiable neural network with a `planning module embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies fo r reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

الذكاء الاصطناعي التعلم الآلي الحوسبة العصبية والتطورية

Value Prediction Network

275 - Junhyuk Oh , Satinder Singh , Honglak Lee 2017

This paper proposes a novel deep reinforcement learning (RL) architecture, called Value Prediction Network (VPN), which integrates model-free and model-based RL methods into a single neural network. In contrast to typical model-based RL methods, VPN learns a dynamics model whose abstract states are trained to make option-conditional predictions of future values (discounted sum of rewards) rather than of future observations. Our experimental results show that VPN has several advantages over both model-free and model-based baselines in a stochastic environment where careful planning is required but building an accurate observation-prediction model is difficult. Furthermore, VPN outperforms Deep Q-Network (DQN) on several Atari games even with short-lookahead planning, demonstrating its potential as a new way of learning a good state representation.

الذكاء الاصطناعي التعلم الآلي

Multi-Labelled Value Networks for Computer Go

76 - Ti-Rong Wu , I-Chen Wu , Guan-Wun Chen 2017

This paper proposes a new approach to a novel value network architecture for the game Go, called a multi-labelled (ML) value network. In the ML value network, different values (win rates) are trained simultaneously for different settings of komi, a c ompensation given to balance the initiative of playing first. The ML value network has three advantages, (a) it outputs values for different komi, (b) it supports dynamic komi, and (c) it lowers the mean squared error (MSE). This paper also proposes a new dynamic komi method to improve game-playing strength. This paper also performs experiments to demonstrate the merits of the architecture. First, the MSE of the ML value network is generally lower than the value network alone. Second, the program based on the ML value network wins by a rate of 67.6% against the program based on the value network alone. Third, the program with the proposed dynamic komi method significantly improves the playing strength over the baseline that does not use dynamic komi, especially for handicap games. To our knowledge, up to date, no handicap games have been played openly by programs using value networks. This paper provides these programs with a useful approach to playing handicap games.

الذكاء الاصطناعي التعلم الآلي

Homotopy equivalence for proper holomorphic mappings

131 - John P. DAngelo , Jiri Lebl 2014

We introduce several homotopy equivalence relations for proper holomorphic mappings between balls. We provide examples showing that the degree of a rational proper mapping between balls (in positive codimension) is not a homotopy invariant. In domain dimension at least 2, we prove that the set of homotopy classes of rational proper mappings from a ball to a higher dimensional ball is finite. By contrast, when the target dimension is at least twice the domain dimension, it is well known that there are uncountably many spherical equivalence classes. We generalize this result by proving that an arbitrary homotopy of rational maps whose endpoints are spherically inequivalent must contain uncountably many spherically inequivalent maps. We introduce Whitney sequences, a precise analogue (in higher dimensions) of the notion of finite Blaschke product (in one dimension). We show that terms in a Whitney sequence are homotopic to monomial mappings, and we establish an additional result about the target dimensions of such homotopies.

المتغيرات المعقدة

Problems with Shapley-value-based explanations as feature importance measures

128 - I. Elizabeth Kumar , Suresh Venkatasubramanian , Carlos Scheidegger 2020

Game-theoretic formulations of feature importance have become popular as a way to explain machine learning models. These methods define a cooperative game between the features of a model and distribute influence among these input elements using some form of the games unique Shapley values. Justification for these methods rests on two pillars: their desirable mathematical properties, and their applicability to specific motivations for explanations. We show that mathematical problems arise when Shapley values are used for feature importance and that the solutions to mitigate these necessarily induce further complexity, such as the need for causal reasoning. We also draw on additional literature to argue that Shapley values do not provide explanations which suit human-centric goals of explainability.

الذكاء الاصطناعي التعلم الآلي التعلم الالي