ترغب بنشر مسار تعليمي؟ اضغط هنا

Tackling overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning, but has received comparatively little attention in the multi-agent setting. In this work, we empirically demonst rate that QMIX, a popular $Q$-learning algorithm for cooperative multi-agent reinforcement learning (MARL), suffers from a more severe overestimation in practice than previously acknowledged, and is not mitigated by existing approaches. We rectify this with a novel regularization-based update scheme that penalizes large joint action-values that deviate from a baseline and demonstrate its effectiveness in stabilizing learning. Furthermore, we propose to employ a softmax operator, which we efficiently approximate in a novel way in the multi-agent setting, to further reduce the potential overestimation bias. Our approach, Regularized Softmax (RES) Deep Multi-Agent $Q$-Learning, is general and can be applied to any $Q$-learning based MARL algorithm. We demonstrate that, when applied to QMIX, RES avoids severe overestimation and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.
A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin D elayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous control. We first theoretically analyze the softmax operator in continuous action space. Then, we uncover an important property of the softmax operator in actor-critic algorithms, i.e., it helps to smooth the optimization landscape, which sheds new light on the benefits of the operator. We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double Deterministic Policy Gradients (SD3), by building the softmax operator upon single and double estimators, which can effectively improve the overestimation and underestimation bias. We conduct extensive experiments on challenging continuous control tasks, and results show that SD3 outperforms state-of-the-art methods.
Establishing quantum entanglement between individual nodes is crucial for building large-scale quantum networks, enabling secure quantum communication, distributed quantum computing, enhanced quantum metrology and fundamental tests of quantum mechani cs. However, the shared entanglements have been merely observed in either extremely low-temperature or well-isolated systems, which limits the quantum networks for the real-life applications. Here, we report the realization of heralding quantum entanglement between two atomic ensembles at room temperature, where each of them contains billions of motional atoms. By measuring the mapped-out entangled state with quantum interference, concurrence and correlation, we strongly verify the existence of a single excitation delocalized in two atomic ensembles. Remarkably, the heralded quantum entanglement of atomic ensembles can be operated with the feature of delay-choice, which illustrates the essentiality of the built-in quantum memory. The demonstrated building block paves the way for constructing quantum networks and distributing entanglement across multiple remote nodes at ambient conditions.
Quantum entanglement is of central importance to quantum computing, quantum metrology, quantum information as well as the nature of quantum physics. Quantum theory does not prevent entanglement from being created and observed in macroscopic physical systems, in reality however, the accessible scale of entanglement is still very limited due to decoherence effects. Recently, entanglement has been observed among atoms from thousands to millions level in extremely low-temperature and well-isolated systems. Here, we create multipartite entanglement of billions of motional atoms in a quantum memory at room temperature, and certify the genuine entanglement via $M$-separability witness associated with photon statistics. The information contained in a single photon is found strongly correlated with the excitation shared by the motional atoms, which intrinsically address the large system and therefore stimulate the multipartite entanglement. Remarkably, our heralded and quantum memory built-in entanglement generation allows us to directly observe the dynamic evolution of entanglement depth and further to reveal the effects of decoherence. Our results verify the existence of genuine multipartite entanglement among billions of motional atoms at ambient condition, significantly extending the boundary of the accessible scale of entanglement. Besides probing the quantum-to-classical transition in an entirely new realm, the developed abilities of manipulating such a large-scale entanglement may enhance a wide spectrum of applications for emerging quantum technologies.
Recent years have witnessed a tremendous improvement of deep reinforcement learning. However, a challenging problem is that an agent may suffer from inefficient exploration, particularly for on-policy methods. Previous exploration methods either rely on complex structure to estimate the novelty of states, or incur sensitive hyper-parameters causing instability. We propose an efficient exploration method, Multi-Path Policy Optimization (MPPO), which does not incur high computation cost and ensures stability. MPPO maintains an efficient mechanism that effectively utilizes a population of diverse policies to enable better exploration, especially in sparse environments. We also give a theoretical guarantee of the stable performance. We build our scheme upon two widely-adopted on-policy methods, the Trust-Region Policy Optimization algorithm and Proximal Policy Optimization algorithm. We conduct extensive experiments on several MuJoCo tasks and their sparsified variants to fairly evaluate the proposed method. Results show that MPPO significantly outperforms state-of-the-art exploration methods in terms of both sample efficiency and final performance.
Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.
127 - Ling Pan , Qingpeng Cai , Qi Meng 2019
Value function estimation is an important task in reinforcement learning, i.e., prediction. The Boltzmann softmax operator is a natural value estimator and can provide several benefits. However, it does not satisfy the non-expansion property, and its direct use may fail to converge even in value iteration. In this paper, we propose to update the value function with dynamic Boltzmann softmax (DBS) operator, which has good convergence property in the setting of planning and learning. Experimental results on GridWorld show that the DBS operator enables better estimation of the value function, which rectifies the convergence issue of the softmax operator. Finally, we propose the DBS-DQN algorithm by applying dynamic Boltzmann softmax updates in deep Q-network, which outperforms DQN substantially in 40 out of 49 Atari games.
Low-decoherence regime plays a key role in constructing multi-particle quantum systems and has therefore been constantly pursued in order to build quantum simulators and quantum computers in a scalable fashion. Quantum error correction and quantum to pological computing have been proved being able to protect quantumness but havent been experimentally realized yet. Recently, topological boundary states are found inherently stable and are capable of protecting physical fields from dissipation and disorder, which inspires the application of such a topological protection on quantum correlation. Here, we present an experimental demonstration of topological protection of two-photon quantum states on a photonic chip. By analyzing the quantum correlation of photons out from the topologically nontrivial boundary state, we obtain a high cross-correlation and a strong violation of Cauchy-Schwarz inequality up to 30 standard deviations. Our results, together with our integrated implementation, provide an alternative way of protecting quantumness, and may inspire many more explorations in quantum topological photonics, a crossover between topological photonics and quantum information.
We study a reinforcement learning setting, where the state transition function is a convex combination of a stochastic continuous function and a deterministic function. Such a setting generalizes the widely-studied stochastic state transition setting , namely the setting of deterministic policy gradient (DPG). We firstly give a simple example to illustrate that the deterministic policy gradient may be infinite under deterministic state transitions, and introduce a theoretical technique to prove the existence of the policy gradient in this generalized setting. Using this technique, we prove that the deterministic policy gradient indeed exists for a certain set of discount factors, and further prove two conditions that guarantee the existence for all discount factors. We then derive a closed form of the policy gradient whenever exists. Furthermore, to overcome the challenge of high sample complexity of DPG in this setting, we propose the Generalized Deterministic Policy Gradient (GDPG) algorithm. The main innovation of the algorithm is a new method of applying model-based techniques to the model-free algorithm, the deep deterministic policy gradient algorithm (DDPG). GDPG optimize the long-term rewards of the model-based augmented MDP subject to a constraint that the long-rewards of the MDP is less than the original one. We finally conduct extensive experiments comparing GDPG with state-of-the-art methods and the direct model-based extension method of DDPG on several standard continuous control benchmarks. Results demonstrate that GDPG substantially outperforms DDPG, the model-based extension of DDPG and other baselines in terms of both convergence and long-term rewards in most environments.
Bike sharing provides an environment-friendly way for traveling and is booming all over the world. Yet, due to the high similarity of user travel patterns, the bike imbalance problem constantly occurs, especially for dockless bike sharing systems, ca using significant impact on service quality and company revenue. Thus, it has become a critical task for bike sharing systems to resolve such imbalance efficiently. In this paper, we propose a novel deep reinforcement learning framework for incentivizing users to rebalance such systems. We model the problem as a Markov decision process and take both spatial and temporal features into consideration. We develop a novel deep reinforcement learning algorithm called Hierarchical Reinforcement Pricing (HRP), which builds upon the Deep Deterministic Policy Gradient algorithm. Different from existing methods that often ignore spatial information and rely heavily on accurate prediction, HRP captures both spatial and temporal dependencies using a divide-and-conquer structure with an embedded localized module. We conduct extensive experiments to evaluate HRP, based on a dataset from Mobike, a major Chinese dockless bike sharing company. Results show that HRP performs close to the 24-timeslot look-ahead optimization, and outperforms state-of-the-art methods in both service level and bike distribution. It also transfers well when applied to unseen areas.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا