ﻻ يوجد ملخص باللغة العربية
Deep Reinforcement Learning (DRL) methods have performed well in an increasing numbering of high-dimensional visual decision making domains. Among all such visual decision making problems, those with discrete action spaces often tend to have underlying compositional structure in the said action space. Such action spaces often contain actions such as go left, go up as well as go diagonally up and left (which is a composition of the former two actions). The representations of control policies in such domains have traditionally been modeled without exploiting this inherent compositional structure in the action spaces. We propose a new learning paradigm, Factored Action space Representations (FAR) wherein we decompose a control policy learned using a Deep Reinforcement Learning Algorithm into independent components, analogous to decomposing a vector in terms of some orthogonal basis vectors. This architectural modification of the control policy representation allows the agent to learn about multiple actions simultaneously, while executing only one of them. We demonstrate that FAR yields considerable improvements on top of two DRL algorithms in Atari 2600: FARA3C outperforms A3C (Asynchronous Advantage Actor Critic) in 9 out of 14 tasks and FARAQL outperforms AQL (Asynchronous n-step Q-Learning) in 9 out of 13 tasks.
Advances in computing resources have resulted in the increasing complexity of cyber-physical systems (CPS). As the complexity of CPS evolved, the focus has shifted from traditional control methods to deep reinforcement learning-based (DRL) methods fo
Maximum Entropy (MaxEnt) reinforcement learning is a powerful learning paradigm which seeks to maximize return under entropy regularization. However, action entropy does not necessarily coincide with state entropy, e.g., when multiple actions produce
In this work, we study auxiliary prediction tasks defined by temporal-difference networks (TD networks); these networks are a language for expressing a rich space of general value function (GVF) prediction targets that may be learned efficiently with
We address the problem of learning hierarchical deep neural network policies for reinforcement learning. In contrast to methods that explicitly restrict or cripple lower layers of a hierarchy to force them to use higher-level modulating signals, each
Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which $N$ independent processes are managed. However, previous work only study the offline setting where problem dynamics are known. We