No Arabic abstract
Complex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks given resource collection constraints are of great interest. In this paper, we formulate the task-specific network discovery problem in an incomplete network setting as a sequential decision making problem. Our downstream task is selective harvesting, the optimal collection of vertices with a particular attribute. We propose a framework, called Network Actor Critic (NAC), which learns a policy and notion of future reward in an offline setting via a deep reinforcement learning algorithm. The NAC paradigm utilizes a task-specific network embedding to reduce the state space complexity. A detailed comparative analysis of popular network embeddings is presented with respect to their role in supporting offline planning. Furthermore, a quantitative study is presented on several synthetic and real benchmarks using NAC and several baselines. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms. Finally, we outline learning regimes where planning is critical in addressing sparse and changing reward signals.
This paper presents a new neural architecture that combines a modulated Hebbian network (MOHN) with DQN, which we call modulated Hebbian plus Q network architecture (MOHQA). The hypothesis is that such a combination allows MOHQA to solve difficult partially observable Markov decision process (POMDP) problems which impair temporal difference (TD)-based RL algorithms such as DQN, as the TD error cannot be easily derived from observations. The key idea is to use a Hebbian network with bio-inspired neural traces in order to bridge temporal delays between actions and rewards when confounding observations and sparse rewards result in inaccurate TD errors. In MOHQA, DQN learns low level features and control, while the MOHN contributes to the high-level decisions by associating rewards with past states and actions. Thus the proposed architecture combines two modules with significantly different learning algorithms, a Hebbian associative network and a classical DQN pipeline, exploiting the advantages of both. Simulations on a set of POMDPs and on the MALMO environment show that the proposed algorithm improved DQNs results and even outperformed control tests with A2C, QRDQN+LSTM and REINFORCE algorithms on some POMDPs with confounding stimuli and sparse rewards.
This paper proposes a new robust update rule of target network for deep reinforcement learning (DRL), to replace the conventional update rule, given as an exponential moving average. The target network is for smoothly generating the reference signals for a main network in DRL, thereby reducing learning variance. The problem with its conventional update rule is the fact that all the parameters are smoothly copied with the same speed from the main network, even when some of them are trying to update toward the wrong directions. This behavior increases the risk of generating the wrong reference signals. Although slowing down the overall update speed is a naive way to mitigate wrong updates, it would decrease learning speed. To robustly update the parameters while keeping learning speed, a t-soft update method, which is inspired by student-t distribution, is derived with reference to the analogy between the exponential moving average and the normal distribution. Through the analysis of the derived t-soft update, we show that it takes over the properties of the student-t distribution. Specifically, with a heavy-tailed property of the student-t distribution, the t-soft update automatically excludes extreme updates that differ from past experiences. In addition, when the updates are similar to the past experiences, it can mitigate the learning delay by increasing the amount of updates. In PyBullet robotics simulations for DRL, an online actor-critic algorithm with the t-soft update outperformed the conventional methods in terms of the obtained return and/or its variance. From the training process by the t-soft update, we found that the t-soft update is globally consistent with the standard soft update, and the update rates are locally adjusted for acceleration or suppression.
Deep reinforcement learning (deep RL) holds the promise of automating the acquisition of complex controllers that can map sensory inputs directly to low-level actions. In the domain of robotic locomotion, deep RL could enable learning locomotion skills with minimal engineering and without an explicit model of the robot dynamics. Unfortunately, applying deep RL to real-world robotic tasks is exceptionally difficult, primarily due to poor sample complexity and sensitivity to hyperparameters. While hyperparameters can be easily tuned in simulated domains, tuning may be prohibitively expensive on physical systems, such as legged robots, that can be damaged through extensive trial-and-error learning. In this paper, we propose a sample-efficient deep RL algorithm based on maximum entropy RL that requires minimal per-task tuning and only a modest number of trials to learn neural network policies. We apply this method to learning walking gaits on a real-world Minitaur robot. Our method can acquire a stable gait from scratch directly in the real world in about two hours, without relying on any model or simulation, and the resulting policy is robust to moderate variations in the environment. We further show that our algorithm achieves state-of-the-art performance on simulated benchmarks with a single set of hyperparameters. Videos of training and the learned policy can be found on the project website.
Network dismantling aims to degrade the connectivity of a network by removing an optimal set of nodes and has been widely adopted in many real-world applications such as epidemic control and rumor containment. However, conventional methods usually focus on simple network modeling with only pairwise interactions, while group-wise interactions modeled by hypernetwork are ubiquitous and critical. In this work, we formulate the hypernetwork dismantling problem as a node sequence decision problem and propose a deep reinforcement learning (DRL)-based hypernetwork dismantling framework. Besides, we design a novel inductive hypernetwork embedding method to ensure the transferability to various real-world hypernetworks. Generally, our framework builds an agent. It first generates small-scale synthetic hypernetworks and embeds the nodes and hypernetworks into a low dimensional vector space to represent the action and state space in DRL, respectively. Then trial-and-error dismantling tasks are conducted by the agent on these synthetic hypernetworks, and the dismantling strategy is continuously optimized. Finally, the well-optimized strategy is applied to real-world hypernetwork dismantling tasks. Experimental results on five real-world hypernetworks demonstrate the effectiveness of our proposed framework.
Many real-world sequential decision making problems are partially observable by nature, and the environment model is typically unknown. Consequently, there is great need for reinforcement learning methods that can tackle such problems given only a stream of incomplete and noisy observations. In this paper, we propose deep variational reinforcement learning (DVRL), which introduces an inductive bias that allows an agent to learn a generative model of the environment and perform inference in that model to effectively aggregate the available information. We develop an n-step approximation to the evidence lower bound (ELBO), allowing the model to be trained jointly with the policy. This ensures that the latent state representation is suitable for the control task. In experiments on Mountain Hike and flickering Atari we show that our method outperforms previous approaches relying on recurrent neural networks to encode the past.