Episodic Self-Imitation Learning with Hindsight

135 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Tianhong Dai

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Tianhong Dai - Hengyan Liu - Anil Anthony Bharath

الذكاء الاصطناعي علم الروبوتات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good state-action pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation.

قيم البحث

اقرأ أيضاً

Soft Hindsight Experience Replay

101 - Qiwei He , Liansheng Zhuang , Houqiang Li 2020

Efficient learning in the environment with sparse rewards is one of the most important challenges in Deep Reinforcement Learning (DRL). In continuous DRL environments such as robotic arms control, Hindsight Experience Replay (HER) has been shown an e ffective solution. However, due to the brittleness of deterministic methods, HER and its variants typically suffer from a major challenge for stability and convergence, which significantly affects the final performance. This challenge severely limits the applicability of such methods to complex real-world domains. To tackle this challenge, in this paper, we propose Soft Hindsight Experience Replay (SHER), a novel approach based on HER and Maximum Entropy Reinforcement Learning (MERL), combining the failed experiences reuse and maximum entropy probabilistic inference model. We evaluate SHER on Open AI Robotic manipulation tasks with sparse rewards. Experimental results show that, in contrast to HER and its variants, our proposed SHER achieves state-of-the-art performance, especially in the difficult HandManipulation tasks. Furthermore, our SHER method is more stable, achieving very similar performance across different random seeds.

الذكاء الاصطناعي علم الروبوتات

DropoutDAgger: A Bayesian Approach to Safe Imitation Learning

82 - Kunal Menda , Katherine Driggs-Campbell , 2017

While imitation learning is becoming common practice in robotics, this approach often suffers from data mismatch and compounding errors. DAgger is an iterative algorithm that addresses these issues by continually aggregating training data from both t he expert and novice policies, but does not consider the impact of safety. We present a probabilistic extension to DAgger, which uses the distribution over actions provided by the novice policy, for a given observation. Our method, which we call DropoutDAgger, uses dropout to train the novice as a Bayesian neural network that provides insight to its confidence. Using the distribution over the novices actions, we estimate a probabilistic measure of safety with respect to the expert action, tuned to balance exploration and exploitation. The utility of this approach is evaluated on the MuJoCo HalfCheetah and in a simple driving experiment, demonstrating improved performance and safety compared to other DAgger variants and classic imitation learning.

الذكاء الاصطناعي علم الروبوتات

Learning from the Hindsight Plan -- Episodic MPC Improvement

74 - Aviv Tamar , Garrett Thomas , Tianhao Zhang 2016

Model predictive control (MPC) is a popular control method that has proved effective for robotics, among other fields. MPC performs re-planning at every time step. Re-planning is done with a limited horizon per computational and real-time constraints and often also for robustness to potential model errors. However, the limited horizon leads to suboptimal performance. In this work, we consider the iterative learning setting, where the same task can be repeated several times, and propose a policy improvement scheme for MPC. The main idea is that between executions we can, offline, run MPC with a longer horizon, resulting in a hindsight plan. To bring the next real-world execution closer to the hindsight plan, our approach learns to re-shape the original cost function with the goal of satisfying the following property: short horizon planning (as realistic during real executions) with respect to the shaped cost should result in mimicking the hindsight plan. This effectively consolidates long-term reasoning into the short-horizon planning. We empirically evaluate our approach in contact-rich manipulation tasks both in simulated and real environments, such as peg insertion by a real PR2 robot.

علم الروبوتات الذكاء الاصطناعي التعلم الآلي

Self-Imitation Learning

150 - Junhyuk Oh , Yijie Guo , Satinder Singh 2018

This paper proposes Self-Imitation Learning (SIL), a simple off-policy actor-critic algorithm that learns to reproduce the agents past good decisions. This algorithm is designed to verify our hypothesis that exploiting past good experiences can indir ectly drive deep exploration. Our empirical results show that SIL significantly improves advantage actor-critic (A2C) on several hard exploration Atari games and is competitive to the state-of-the-art count-based exploration methods. We also show that SIL improves proximal policy optimization (PPO) on MuJoCo tasks.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Keyframe-Focused Visual Imitation Learning

128 - Chuan Wen , Jierui Lin , Jianing Qian 2021

Imitation learning trains control policies by mimicking pre-recorded expert demonstrations. In partially observable settings, imitation policies must rely on observation histories, but many seemingly paradoxical results show better performance for po licies that only access the most recent observation. Recent solutions ranging from causal graph learning to deep information bottlenecks have shown promising results, but failed to scale to realistic settings such as visual imitation. We propose a solution that outperforms these prior approaches by upweighting demonstration keyframes corresponding to expert action changepoints. This simple approach easily scales to complex visual imitation settings. Our experimental results demonstrate consistent performance improvements over all baselines on image-based Gym MuJoCo continuous control tasks. Finally, on the CARLA photorealistic vision-based urban driving simulator, we resolve a long-standing issue in behavioral cloning for driving by demonstrating effective imitation from observation histories. Supplementary materials and code at: url{https://tinyurl.com/imitation-keyframes}.

التعلم الآلي علم الروبوتات