Context-Aware Safe Reinforcement Learning for Non-Stationary Environments

108 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Baiming Chen

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Baiming Chen - Zuxin Liu - Jiacheng Zhu

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Safety is a critical concern when deploying reinforcement learning agents for realistic tasks. Recently, safe reinforcement learning algorithms have been developed to optimize the agents performance while avoiding violations of safety constraints. However, few studies have addressed the non-stationary disturbances in the environments, which may cause catastrophic outcomes. In this paper, we propose the context-aware safe reinforcement learning (CASRL) method, a meta-learning framework to realize safe adaptation in non-stationary environments. We use a probabilistic latent variable model to achieve fast inference of the posterior environment transition distribution given the context data. Safety constraints are then evaluated with uncertainty-aware trajectory sampling. The high cost of safety violations leads to the rareness of unsafe records in the dataset. We address this issue by enabling prioritized sampling during model training and formulating prior safety constraints with domain knowledge during constrained planning. The algorithm is evaluated in realistic safety-critical environments with non-stationary disturbances. Results show that the proposed algorithm significantly outperforms existing baselines in terms of safety and robustness.

قيم البحث

246 - Haozhe Wang , Jiale Zhou , Xuming He 2020

Despite recent success of deep network-based Reinforcement Learning (RL), it remains elusive to achieve human-level efficiency in learning novel tasks. While previous efforts attempt to address this challenge using meta-learning strategies, they typi cally suffer from sampling inefficiency with on-policy RL algorithms or meta-overfitting with off-policy learning. In this work, we propose a novel meta-RL strategy to address those limitations. In particular, we decompose the meta-RL problem into three sub-tasks, task-exploration, task-inference and task-fulfillment, instantiated with two deep network agents and a task encoder. During meta-training, our method learns a task-conditioned actor network for task-fulfillment, an explorer network with a self-supervised reward shaping that encourages task-informative experiences in task-exploration, and a context-aware graph-based task encoder for task inference. We validate our approach with extensive experiments on several public benchmarks and the results show that our algorithm effectively performs exploration for task inference, improves sample efficiency during both training and testing, and mitigates the meta-overfitting problem.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Multi-Source Transfer Learning for Non-Stationary Environments

92 - Honghui Du , Leandro L. Minku , Huiyu Zhou 2019

In data stream mining, predictive models typically suffer drops in predictive performance due to concept drift. As enough data representing the new concept must be collected for the new concept to be well learnt, the predictive performance of existin g models usually takes some time to recover from concept drift. To speed up recovery from concept drift and improve predictive performance in data stream mining, this work proposes a novel approach called Multi-sourcE onLine TrAnsfer learning for Non-statIonary Environments (Melanie). Melanie is the first approach able to transfer knowledge between multiple data streaming sources in non-stationary environments. It creates several sub-classifiers to learn different aspects from different source and target concepts over time. The sub-classifiers that match the current target concept well are identified, and used to compose an ensemble for predicting examples from the target concept. We evaluate Melanie on several synthetic data streams containing different types of concept drift and on real world data streams. The results indicate that Melanie can deal with a variety drifts and improve predictive performance over existing data stream learning algorithms by making use of multiple sources.

التعلم الآلي التعلم الالي

Delay-Aware Multi-Agent Reinforcement Learning for Cooperative and Competitive Environments

446 - Baiming Chen , Mengdi Xu , Zuxin Liu 2020

Action and observation delays exist prevalently in the real-world cyber-physical systems which may pose challenges in reinforcement learning design. It is particularly an arduous task when handling multi-agent systems where the delay of one agent cou ld spread to other agents. To resolve this problem, this paper proposes a novel framework to deal with delays as well as the non-stationary training issue of multi-agent tasks with model-free deep reinforcement learning. We formally define the Delay-Aware Markov Game that incorporates the delays of all agents in the environment. To solve Delay-Aware Markov Games, we apply centralized training and decentralized execution that allows agents to use extra information to ease the non-stationarity issue of the multi-agent systems during training, without the need of a centralized controller during execution. Experiments are conducted in multi-agent particle environments including cooperative communication, cooperative navigation, and competitive experiments. We also test the proposed algorithm in traffic scenarios that require coordination of all autonomous vehicles to show the practical value of delay-awareness. Results show that the proposed delay-aware multi-agent reinforcement learning algorithm greatly alleviates the performance degradation introduced by delay. Codes and demo videos are available at: https://github.com/baimingc/delay-aware-MARL.

التعلم الآلي أنظمة متعددة العملاء التعلم الالي

Constraints Penalized Q-Learning for Safe Offline Reinforcement Learning

82 - Haoran Xu , Xianyuan Zhan , Xiangyu Zhu 2021

We study the problem of safe offline reinforcement learning (RL), the goal is to learn a policy that maximizes long-term reward while satisfying safety constraints given only offline data, without further interaction with the environment. This proble m is more appealing for real world RL applications, in which data collection is costly or dangerous. Enforcing constraint satisfaction is non-trivial, especially in offline settings, as there is a potential large discrepancy between the policy distribution and the data distribution, causing errors in estimating the value of safety constraints. We show that naive approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solutions. We thus develop a simple yet effective algorithm, Constraints Penalized Q-Learning (CPQ), to solve the problem. Our method admits the use of data generated by mixed behavior policies. We present a theoretical analysis and demonstrate empirically that our approach can learn robustly across a variety of benchmark control tasks, outperforming several baselines.

التعلم الآلي الذكاء الاصطناعي

Weighted Gaussian Process Bandits for Non-stationary Environments

263 - Yuntian Deng , Xingyu Zhou , Baekjin Kim 2021

In this paper, we consider the Gaussian process (GP) bandit optimization problem in a non-stationary environment. To capture external changes, the black-box function is allowed to be time-varying within a reproducing kernel Hilbert space (RKHS). To t his end, we develop WGP-UCB, a novel UCB-type algorithm based on weighted Gaussian process regression. A key challenge is how to cope with infinite-dimensional feature maps. To that end, we leverage kernel approximation techniques to prove a sublinear regret bound, which is the first (frequentist) sublinear regret guarantee on weighted time-varying bandits with general nonlinear rewards. This result generalizes both non-stationary linear bandits and standard GP-UCB algorithms. Further, a novel concentration inequality is achieved for weighted Gaussian process regression with general weights. We also provide universal upper bounds and weight-dependent upper bounds for weighted maximum information gains. These results are potentially of independent interest for applications such as news ranking and adaptive pricing, where weights can be adopted to capture the importance or quality of data. Finally, we conduct experiments to highlight the favorable gains of the proposed algorithm in many cases when compared to existing methods.

التعلم الآلي الذكاء الاصطناعي