No Arabic abstract
The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and implicitly rely on multi-step reasoning based on fragile assumptions about other agents actions and thus fail when paired with humans or independently trained agents at test time. To address this, we present off-belief learning (OBL). At each timestep OBL agents follow a policy $pi_1$ that is optimized assuming past actions were taken by a given, fixed policy ($pi_0$), but assuming that future actions will be taken by $pi_1$. When $pi_0$ is uniform random, OBL converges to an optimal policy that does not rely on inferences based on other agents behavior (an optimal grounded policy). OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next, thereby introducing multi-level cognitive reasoning in a controlled manner. Unlike existing approaches, which may converge to any equilibrium policy, OBL converges to a unique policy, making it suitable for zero-shot coordination (ZSC). OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a toy-setting and the benchmark human-AI & ZSC problem Hanabi.
Planning and reinforcement learning are two key approaches to sequential decision making. Multi-step approximate real-time dynamic programming, a recently successful algorithm class of which AlphaZero [Silver et al., 2018] is an example, combines both by nesting planning within a learning loop. However, the combination of planning and learning introduces a new question: how should we balance time spend on planning, learning and acting? The importance of this trade-off has not been explicitly studied before. We show that it is actually of key importance, with computational results indicating that we should neither plan too long nor too short. Conceptually, we identify a new spectrum of planning-learning algorithms which ranges from exhaustive search (long planning) to model-free RL (no planning), with optimal performance achieved midway.
There is considerable interest in designing meta-reinforcement learning (meta-RL) algorithms, which enable autonomous agents to adapt new tasks from small amount of experience. In meta-RL, the specification (such as reward function) of current task is hidden from the agent. In addition, states are hidden within each task owing to sensor noise or limitations in realistic environments. Therefore, the meta-RL agent faces the challenge of specifying both the hidden task and states based on small amount of experience. To address this, we propose estimating disentangled belief about task and states, leveraging an inductive bias that the task and states can be regarded as global and local features of each task. Specifically, we train a hierarchical state-space model (HSSM) parameterized by deep neural networks as an environment model, whose global and local latent variables correspond to task and states, respectively. Because the HSSM does not allow analytical computation of posterior distribution, i.e., belief, we employ amortized inference to approximate it. After the belief is obtained, we can augment observations of a model-free policy with the belief to efficiently train the policy. Moreover, because task and state information are factorized and interpretable, the downstream policy training is facilitated compared with the prior methods that did not consider the hierarchical nature. Empirical validations on a GridWorld environment confirm that the HSSM can separate the hidden task and states information. Then, we compare the meta-RL agent with the HSSM to prior meta-RL methods in MuJoCo environments, and confirm that our agent requires less training data and reaches higher final performance.
In this work, we introduce a new approach for the efficient solution of autonomous decision and planning problems, with a special focus on decision making under uncertainty and belief space planning (BSP) in high-dimensional state spaces. Usually, to solve the decision problem, we identify the optimal action, according to some objective function. We claim that we can sometimes generate and solve an analogous yet simplified decision problem, which can be solved more efficiently; a wise simplification method can lead to the same action selection, or one for which the maximal loss can be guaranteed. Furthermore, such simplification is separated from the state inference, and does not compromise its accuracy, as the selected action would finally be applied on the original state. First, we present the concept for general decision problems, and provide a theoretical framework for a coherent formulation of the approach. We then practically apply these ideas to BSP problems, which can be simplified by considering a sparse approximation of the initial (Gaussian) belief. The scalable belief sparsification algorithm we provide is able to yield solutions which are guaranteed to be consistent with the original problem. We demonstrate the benefits of the approach in the solution of a highly realistic active-SLAM problem, and manage to significantly reduce computation time, with practically no loss in the quality of solution. This work is conceptual and fundamental, and holds numerous possible extensions.
Defining and modeling the relation of inclusion between continuous belief function may be considered as an important operation in order to study their behaviors. Within this paper we will propose and present two forms of inclusion: The strict and the partial one. In order to develop this relation, we will study the case of consonant belief function. To do so, we will simulate normal distributions allowing us to model and analyze these relations. Based on that, we will determine the parameters influencing and characterizing the two forms of inclusion.
The theory of belief functions manages uncertainty and also proposes a set of combination rules to aggregate opinions of several sources. Some combination rules mix evidential information where sources are independent; other rules are suited to combine evidential information held by dependent sources. In this paper we have two main contributions: First we suggest a method to quantify sources degree of independence that may guide the choice of the more appropriate set of combination rules. Second, we propose a new combination rule that takes consideration of sources degree of independence. The proposed method is illustrated on generated mass functions.