New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Provably Efficient Reinforcement Learning with Aggregated States

113 0 0.0 ( 0 )

Download Cite

Added by Shi Dong

Publication date 2019

fields Mathematical Statistics Informatics Engineering

and research's language is English

Authors Shi Dong - Benjamin Van Roy - Zhengyuan Zhou

Machine Learning Machine Learning Optimization and Control

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret $tilde{mathcal{O}}(sqrt{H^5 M K} + epsilon HK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $epsilon$ is the largest difference between any pair of optimal state-action values associated with a common aggregate state. Notably, this regret bound does not depend on the number of states or actions and indicates that asymptotic per-period regret is no greater than $epsilon$, independent of horizon. To our knowledge, this is the first such result that applies to reinforcement learning with nontrivial value function approximation without any restrictions on transition probabilities.

rate research

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

135 - Dongruo Zhou , Jiafan He , Quanquan Gu 2020

Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low-dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrt{T}/(1-gamma)^2)$ regret, where $d$ is the dimension of the feature space, $T$ is the time horizon and $gamma$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing the generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by $Omega(dsqrt{T}/(1-gamma)^{1.5})$. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $(1-gamma)^{-0.5}$ factor.

Machine Learning Artificial Intelligence Optimization and Control

Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

457 - Morteza Ibrahimi , Adel Javanmard , Benjamin Van Roy 2013

We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, for the average cost LQ problem, a regret bound of ${O}(sqrt{T})$ was shown, apart form logarithmic factors. However, this bound scales exponentially with $p$, the dimension of the state space. In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. We present an adaptive control scheme that achieves a regret bound of ${O}(p sqrt{T})$, apart from logarithmic factors. In particular, our algorithm has an average cost of $(1+eps)$ times the optimum cost after $T = polylog(p) O(1/eps^2)$. This is in comparison to previous work on the dense dynamics where the algorithm requires time that scales exponentially with dimension in order to achieve regret of $eps$ times the optimal cost. We believe that our result has prominent applications in the emerging area of computational advertising, in particular targeted online advertising and advertising in social networks.

Machine Learning Machine Learning Optimization and Control

Provably Efficient Cooperative Multi-Agent Reinforcement Learning with Function Approximation

105 - Abhimanyu Dubey , Alex Pentland 2021

Reinforcement learning in cooperative multi-agent settings has recently advanced significantly in its scope, with applications in cooperative estimation for advertising, dynamic treatment regimes, distributed control, and federated learning. In this paper, we discuss the problem of cooperative multi-agent RL with function approximation, where a group of agents communicates with each other to jointly solve an episodic MDP. We demonstrate that via careful message-passing and cooperative value iteration, it is possible to achieve near-optimal no-regret learning even with a fixed constant communication budget. Next, we demonstrate that even in heterogeneous cooperative settings, it is possible to achieve Pareto-optimal no-regret learning with limited communication. Our work generalizes several ideas from the multi-agent contextual and multi-armed bandit literature to MDPs and reinforcement learning.

Machine Learning Multiagent Systems Machine Learning

Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints

97 - Tianhao Wang , Dongruo Zhou , Quanquan Gu 2021

We study reinforcement learning (RL) with linear function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: batch learning model and rare policy switch model, and propose two efficient online RL algorithms for linear Markov decision processes. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an $tilde O(sqrt{d^3H^3T} + dHT/B)$ regret, where $d$ is the dimension of the feature mapping, $H$ is the episode length, $T$ is the number of interactions and $B$ is the number of batches. Our result suggests that it suffices to use only $sqrt{T/dH}$ batches to obtain $tilde O(sqrt{d^3H^3T})$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $tilde O(sqrt{d^3H^3T[1+T/(dH)]^{dH/B}})$ regret, which implies that $dHlog T$ policy switches suffice to obtain the $tilde O(sqrt{d^3H^3T})$ regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2019), yet with a substantially smaller amount of adaptivity.

Machine Learning Optimization and Control Machine Learning

Is Q-Learning Provably Efficient? An Extended Analysis

201 - Kushagra Rastogi , Jonathan Lee , Fabrice Harel-Canada 2020

This work extends the analysis of the theoretical results presented within the paper Is Q-Learning Provably Efficient? by Jin et al. We include a survey of related research to contextualize the need for strengthening the theoretical guarantees related to perhaps the most important threads of model-free reinforcement learning. We also expound upon the reasoning used in the proofs to highlight the critical steps leading to the main result showing that Q-learning with UCB exploration achieves a sample efficiency that matches the optimal regret that can be achieved by any model-based approach.

Machine Learning Artificial Intelligence Optimization and Control

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Provably Efficient Reinforcement Learning with Aggregated States

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions