FORK: A Forward-Looking Actor For Model-Free Reinforcement Learning

73 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Honghao Wei

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Honghao Wei - Lei Ying

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we propose a new type of Actor, named forward-looking Actor or FORK for short, for Actor-Critic algorithms. FORK can be easily integrated into a model-free Actor-Critic algorithm. Our experiments on six Box2D and MuJoCo environments with continuous state and action spaces demonstrate significant performance improvement FORK can bring to the state-of-the-art algorithms. A variation of FORK can further solve Bipedal-WalkerHardcore in as few as four hours using a single GPU.

قيم البحث

70 - Matthias Weissenbacher , Yoshinobu Kawahara 2021

In this work we discuss the incorporation of quadratic neurons into policy networks in the context of model-free actor-critic reinforcement learning. Quadratic neurons admit an explicit quadratic function approximation in contrast to conventional app roaches where the the non-linearity is induced by the activation functions. We perform empiric experiments on several MuJoCo continuous control tasks and find that when quadratic neurons are added to MLP policy networks those outperform the baseline MLP whilst admitting a smaller number of parameters. The top returned reward is in average increased by $5.8%$ while being about $21%$ more sample efficient. Moreover, it can maintain its advantage against added action and observation noise.

التعلم الآلي الذكاء الاصطناعي علم الروبوتات

A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

587 - Wesley Suttle , Zhuoran Yang , Kaiqing Zhang 2019

This paper extends off-policy reinforcement learning to the multi-agent case in which a set of networked agents communicating with their neighbors according to a time-varying graph collaboratively evaluates and improves a target policy while followin g a distinct behavior policy. To this end, the paper develops a multi-agent version of emphatic temporal difference learning for off-policy policy evaluation, and proves convergence under linear function approximation. The paper then leverages this result, in conjunction with a novel multi-agent off-policy policy gradient theorem and recent work in both multi-agent on-policy and single-agent off-policy actor-critic methods, to develop and give convergence guarantees for a new multi-agent off-policy actor-critic algorithm.

التعلم الآلي التعلم الالي

Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces

76 - Ahmed Touati , Adrien Ali Taiga , Marc G. Bellemare 2020

Despite the wealth of research into provably efficient reinforcement learning algorithms, most works focus on tabular representation and thus struggle to handle exponentially or infinitely large state-action spaces. In this paper, we consider episodi c reinforcement learning with a continuous state-action space which is assumed to be equipped with a natural metric that characterizes the proximity between different states and actions. We propose ZoomRL, an online algorithm that leverages ideas from continuous bandits to learn an adaptive discretization of the joint space by zooming in more promising and frequently visited regions while carefully balancing the exploitation-exploration trade-off. We show that ZoomRL achieves a worst-case regret $tilde{O}(H^{frac{5}{2}} K^{frac{d+1}{d+2}})$ where $H$ is the planning horizon, $K$ is the number of episodes and $d$ is the covering dimension of the space with respect to the metric. Moreover, our algorithm enjoys improved metric-dependent guarantees that reflect the geometry of the underlying space. Finally, we show that our algorithm is robust to small misspecification errors.

التعلم الآلي التعلم الالي

Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic

440 - Yangang Ren , Jingliang Duan , Shengbo Eben Li 2020

Reinforcement learning (RL) has achieved remarkable performance in numerous sequential decision making and control tasks. However, a common problem is that learned nearly optimal policy always overfits to the training environment and may not be exten ded to situations never encountered during training. For practical applications, the randomness of environment usually leads to some devastating events, which should be the focus of safety-critical systems such as autonomous driving. In this paper, we introduce the minimax formulation and distributional framework to improve the generalization ability of RL algorithms and develop the Minimax Distributional Soft Actor-Critic (Minimax DSAC) algorithm. Minimax formulation aims to seek optimal policy considering the most severe variations from environment, in which the protagonist policy maximizes action-value function while the adversary policy tries to minimize it. Distributional framework aims to learn a state-action return distribution, from which we can model the risk of different returns explicitly, thereby formulating a risk-averse protagonist policy and a risk-seeking adversarial policy. We implement our method on the decision-making tasks of autonomous vehicles at intersections and test the trained policy in distinct environments. Results demonstrate that our method can greatly improve the generalization ability of the protagonist agent to different environmental variations.

التعلم الآلي التعلم الالي

Model-Based Reinforcement Learning for Atari

77 - Lukasz Kaiser , Mohammad Babaeizadeh , Piotr Milos 2019

Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, th an a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games in low data regime of 100k interactions between the agent and the environment, which corresponds to two hours of real-time play. In most games SimPLe outperforms state-of-the-art model-free algorithms, in some games by over an order of magnitude.

التعلم الآلي التعلم الالي