Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

ProMP: Proximal Meta-Policy Search

174 0 0.0 ( 0 )

Download Cite

Added by Jonas Rothfuss

Publication date 2018

fields Informatics Engineering Mathematical Statistics

and research's language is English

Authors Jonas Rothfuss - Dennis Lee - Ignasi Clavera

Machine Learning Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.

rate research

Proximal Deterministic Policy Gradient

112 - Marco Maggipinto , Gian Antonio Susto , Pratik Chaudhari 2020

This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) algorithms. First, we formulate off-policy RL as a stochastic proximal point iteration. The target network plays the role of the variable of optimization and the value network computes the proximal operator. Second, we exploits the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate through bootstrapping with limited increase of computational resources. Further, we demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.

Machine Learning Machine Learning

Robust Policy Search for Robot Navigation with Stochastic Meta-Policies

61 - Javier Garcia-Barcos , Ruben Martinez-Cantin 2020

Bayesian optimization is an efficient nonlinear optimization method where the queries are carefully selected to gather information about the optimum location. Thus, in the context of policy search, it has been called active policy search. The main ingredients of Bayesian optimization for sample efficiency are the probabilistic surrogate model and the optimal decision heuristics. In this work, we exploit those to provide robustness to different issues for policy search algorithms. We combine several methods and show how their interaction works better than the sum of the parts. First, to deal with input noise and provide a safe and repeatable policy we use an improved version of unscented Bayesian optimization. Then, to deal with mismodeling errors and improve exploration we use stochastic meta-policies for query selection and an adaptive kernel. We compare the proposed algorithm with previous results in several optimization benchmarks and robot tasks, such as pushing objects with a robot arm, or path finding with a rover.

Machine Learning Machine Learning

Bottom-Up Meta-Policy Search

79 - Luckeciano C. Melo , Marcos R. O. A. Maximo , Adilson Marques da Cunha 2019

Despite of the recent progress in agents that learn through interaction, there are several challenges in terms of sample efficiency and generalization across unseen behaviors during training. To mitigate these problems, we propose and apply a first-order Meta-Learning algorithm called Bottom-Up Meta-Policy Search (BUMPS), which works with two-phase optimization procedure: firstly, in a meta-training phase, it distills few expert policies to create a meta-policy capable of generalizing knowledge to unseen tasks during training; secondly, it applies a fast adaptation strategy named Policy Filtering, which evaluates few policies sampled from the meta-policy distribution and selects which best solves the task. We conducted all experiments in the RoboCup 3D Soccer Simulation domain, in the context of kick motion learning. We show that, given our experimental setup, BUMPS works in scenarios where simple multi-task Reinforcement Learning does not. Finally, we performed experiments in a way to evaluate each component of the algorithm.

Machine Learning Artificial Intelligence Robotics

Variance-Reduced Off-Policy Memory-Efficient Policy Search

91 - Daoming Lyu , Qi Qi , Mohammad Ghavamzadeh 2020

Off-policy policy optimization is a challenging problem in reinforcement learning (RL). The algorithms designed for this problem often suffer from high variance in their estimators, which results in poor sample efficiency, and have issues with convergence. A few variance-reduced on-policy policy gradient algorithms have been recently proposed that use methods from stochastic optimization to reduce the variance of the gradient estimate in the REINFORCE algorithm. However, these algorithms are not designed for the off-policy setting and are memory-inefficient, since they need to collect and store a large ``reference batch of samples from time to time. To achieve variance-reduced off-policy-stable policy optimization, we propose an algorithm family that is memory-efficient, stochastically variance-reduced, and capable of learning from off-policy samples. Empirical studies validate the effectiveness of the proposed approaches.

Machine Learning Machine Learning

Learning Efficient and Effective Exploration Policies with Counterfactual Meta Policy

76 - Ruihan Yang , Qiwei Ye , Tie-Yan Liu 2019

A fundamental issue in reinforcement learning algorithms is the balance between exploration of the environment and exploitation of information already obtained by the agent. Especially, exploration has played a critical role for both efficiency and efficacy of the learning process. However, Existing works for exploration involve task-agnostic design, that is performing well in one environment, but be ill-suited to another. To the purpose of learning an effective and efficient exploration policy in an automated manner. We formalized a feasible metric for measuring the utility of exploration based on counterfactual ideology. Based on that, We proposed an end-to-end algorithm to learn exploration policy by meta-learning. We demonstrate that our method achieves good results compared to previous works in the high-dimensional control tasks in MuJoCo simulator.

Machine Learning Machine Learning

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

ProMP: Proximal Meta-Policy Search

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions