Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

136 0 0.0 ( 0 )

Download Cite

Added by Quanquan Gu

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Dongruo Zhou - Jiafan He - Quanquan Gu

Machine Learning Artificial Intelligence Optimization and Control

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low-dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrt{T}/(1-gamma)^2)$ regret, where $d$ is the dimension of the feature space, $T$ is the time horizon and $gamma$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing the generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by $Omega(dsqrt{T}/(1-gamma)^{1.5})$. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $(1-gamma)^{-0.5}$ factor.

rate research

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

142 - Jiafan He , Dongruo Zhou , Quanquan Gu 2020

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$gamma$, which is based on the emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$gamma$ achieves an $tilde{O}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tilde{Omega}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$gamma$ is nearly minimax optimal for discounted MDPs.

Machine Learning Optimization and Control Machine Learning

Provably Efficient Reinforcement Learning with Aggregated States

112 - Shi Dong , Benjamin Van Roy , Zhengyuan Zhou 2019

We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret $tilde{mathcal{O}}(sqrt{H^5 M K} + epsilon HK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $epsilon$ is the largest difference between any pair of optimal state-action values associated with a common aggregate state. Notably, this regret bound does not depend on the number of states or actions and indicates that asymptotic per-period regret is no greater than $epsilon$, independent of horizon. To our knowledge, this is the first such result that applies to reinforcement learning with nontrivial value function approximation without any restrictions on transition probabilities.

Machine Learning Machine Learning Optimization and Control

Reinforcement Learning with Efficient Active Feature Acquisition

95 - Haiyan Yin , Yingzhen Li , Sinno Jialin Pan 2020

Solving real-life sequential decision making problems under partial observability involves an exploration-exploitation problem. To be successful, an agent needs to efficiently gather valuable information about the state of the world for making rewarding decisions. However, in real-life, acquiring valuable information is often highly costly, e.g., in the medical domain, information acquisition might correspond to performing a medical test on a patient. This poses a significant challenge for the agent to perform optimally for the task while reducing the cost for information acquisition. In this paper, we propose a model-based reinforcement learning framework that learns an active feature acquisition policy to solve the exploration-exploitation problem during its execution. Key to the success is a novel sequential variational auto-encoder that learns high-quality representations from partially observed states, which are then used by the policy to maximize the task reward in a cost efficient manner. We demonstrate the efficacy of our proposed framework in a control domain as well as using a medical simulator. In both tasks, our proposed method outperforms conventional baselines and results in policies with greater cost efficiency.

Machine Learning Artificial Intelligence

Is Q-Learning Provably Efficient? An Extended Analysis

201 - Kushagra Rastogi , Jonathan Lee , Fabrice Harel-Canada 2020

This work extends the analysis of the theoretical results presented within the paper Is Q-Learning Provably Efficient? by Jin et al. We include a survey of related research to contextualize the need for strengthening the theoretical guarantees related to perhaps the most important threads of model-free reinforcement learning. We also expound upon the reasoning used in the proofs to highlight the critical steps leading to the main result showing that Q-learning with UCB exploration achieves a sample efficiency that matches the optimal regret that can be achieved by any model-based approach.

Machine Learning Artificial Intelligence Optimization and Control

Provably Efficient Generative Adversarial Imitation Learning for Online and Offline Setting with Linear Function Approximation

106 - Zhihan Liu , Yufeng Zhang , Zuyue Fu 2021

In generative adversarial imitation learning (GAIL), the agent aims to learn a policy from an expert demonstration so that its performance cannot be discriminated from the expert policy on a certain predefined reward set. In this paper, we study GAIL in both online and offline settings with linear function approximation, where both the transition and reward function are linear in the feature maps. Besides the expert demonstration, in the online setting the agent can interact with the environment, while in the offline setting the agent only accesses an additional dataset collected by a prior. For online GAIL, we propose an optimistic generative adversarial policy optimization algorithm (OGAP) and prove that OGAP achieves $widetilde{mathcal{O}}(H^2 d^{3/2}K^{1/2}+KH^{3/2}dN_1^{-1/2})$ regret. Here $N_1$ represents the number of trajectories of the expert demonstration, $d$ is the feature dimension, and $K$ is the number of episodes. For offline GAIL, we propose a pessimistic generative adversarial policy optimization algorithm (PGAP). For an arbitrary additional dataset, we obtain the optimality gap of PGAP, achieving the minimax lower bound in the utilization of the additional dataset. Assuming sufficient coverage on the additional dataset, we show that PGAP achieves $widetilde{mathcal{O}}(H^{2}dK^{-1/2} +H^2d^{3/2}N_2^{-1/2}+H^{3/2}dN_1^{-1/2} )$ optimality gap. Here $N_2$ represents the number of trajectories of the additional dataset with sufficient coverage.

Machine Learning Artificial Intelligence Optimization and Control

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions