First return, then explore

59 0 0.0 ( 0 )

Download Cite

Added by Jeff Clune

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Adrien Ecoffet - Joost Huizinga - Joel Lehman

Artificial Intelligence

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The promise of reinforcement learning is to solve complex sequential decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse and deceptive feedback. Avoiding these pitfalls requires thoroughly exploring the environment, but creating algorithms that can do so remains one of the central challenges of the field. We hypothesise that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states (detachment) and from failing to first return to a state before exploring from it (derailment). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly remembering promising states and first returning to such states before intentionally exploring. Go-Explore solves all heretofore unsolved Atari games and surpasses the state of the art on all hard-exploration games, with orders of magnitude improvements on the grand challenges Montezumas Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explores exploration efficiency and enable it to handle stochasticity throughout training. The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration, an insight that may prove critical to the creation of truly intelligent learning agents.

rate research

On Explore-Then-Commit Strategies

123 - Aurelien Garivier , Emilie Kaufmann (SEQUEL , CRIStAL 2016

We study the problem of minimising regret in two-armed bandit problems with Gaussian rewards. Our objective is to use this simple setting to illustrate that strategies based on an exploration phase (up to a stopping time) followed by exploitation are necessarily suboptimal. The results hold regardless of whether or not the difference in means between the two arms is known. Besides the main message, we also refine existing deviation inequalities, which allow us to design fully sequential strategies with finite-time regret guarantees that are (a) asymptotically optimal as the horizon grows and (b) order-optimal in the minimax sense. Furthermore we provide empirical evidence that the theory also holds in practice and discuss extensions to non-gaussian and multiple-armed case.

Statistics Theory Machine Learning Statistics Theory

Risk-Averse Explore-Then-Commit Algorithms for Finite-Time Bandits

67 - Ali Yekkehkhany , Ebrahim Arian , Mohammad Hajiesmaili 2019

In this paper, we study multi-armed bandit problems in explore-then-commit setting. In our proposed explore-then-commit setting, the goal is to identify the best arm after a pure experimentation (exploration) phase and exploit it once or for a given finite number of times. We identify that although the arm with the highest expected reward is the most desirable objective for infinite exploitations, it is not necessarily the one that is most probable to have the highest reward in a single or finite-time exploitations. Alternatively, we advocate the idea of risk-aversion where the objective is to compete against the arm with the best risk-return trade-off. Then, we propose two algorithms whose objectives are to select the arm that is most probable to reward the most. Using a new notion of finite-time exploitation regret, we find an upper bound for the minimum number of experiments before commitment, to guarantee an upper bound for the regret. As compared to existing risk-averse bandit algorithms, our algorithms do not rely on hyper-parameters, resulting in a more robust behavior in practice, which is verified by the numerical evaluation.

Machine Learning Machine Learning

Notes on sets of first return

97 - Valerie Berthe , Clelia De Felice , Francesco Dolce andn Dominique Perrin 2013

We prove in this note that, for an alphabet with three letters, the set of first return to a given word in a set satisfying the tree condition is a basis of the free group.

Combinatorics

A Unified Model for the Two-stage Offline-then-Online Resource Allocation

80 - Yifan Xu , Pan Xu , Jianping Pan 2020

With the popularity of the Internet, traditional offline resource allocation has evolved into a new form, called online resource allocation. It features the online arrivals of agents in the system and the real-time decision-making requirement upon the arrival of each online agent. Both offline and online resource allocation have wide applications in various real-world matching markets ranging from ridesharing to crowdsourcing. There are some emerging applications such as rebalancing in bike sharing and trip-vehicle dispatching in ridesharing, which involve a two-stage resource allocation process. The process consists of an offline phase and another sequential online phase, and both phases compete for the same set of resources. In this paper, we propose a unified model which incorporates both offline and online resource allocation into a single framework. Our model assumes non-uniform and known arrival distributions for online agents in the second online phase, which can be learned from historical data. We propose a parameterized linear programming (LP)-based algorithm, which is shown to be at most a constant factor of $1/4$ from the optimal. Experimental results on the real dataset show that our LP-based approaches outperform the LP-agnostic heuristics in terms of robustness and effectiveness.

Artificial Intelligence Data Structures and Algorithms

Directly Estimating the Variance of the {lambda}-Return Using Temporal-Difference Methods

182 - Craig Sherstan , Brendan Bennett , Kenny Young 2018

This paper investigates estimating the variance of a temporal-difference learning agents update target. Most reinforcement learning methods use an estimate of the value function, which captures how good it is for the agent to be in a particular state and is mathematically expressed as the expected sum of discounted future rewards (called the return). These values can be straightforwardly estimated by averaging batches of returns using Monte Carlo methods. However, if we wish to update the agents value estimates during learning--before terminal outcomes are observed--we must use a different estimation target called the {lambda}-return, which truncates the return with the agents own estimate of the value function. Temporal difference learning methods estimate the expected {lambda}-return for each state, allowing these methods to update online and incrementally, and in most cases achieve better generalization error and faster learning than Monte Carlo methods. Naturally one could attempt to estimate higher-order moments of the {lambda}-return. This paper is about estimating the variance of the {lambda}-return. Prior work has shown that given estimates of the variance of the {lambda}-return, learning systems can be constructed to (1) mitigate risk in action selection, and (2) automatically adapt the parameters of the learning process itself to improve performance. Unfortunately, existing methods for estimating the variance of the {lambda}-return are complex and not well understood empirically. We contribute a method for estimating the variance of the {lambda}-return directly using policy evaluation methods from reinforcement learning. Our approach is significantly simpler than prior methods that independently estimate the second moment of the {lambda}-return. Empirically our new approach behaves at least as well as existing approaches, but is generally more robust.

Artificial Intelligence

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

First return, then explore

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions