An Adaptive State Aggregation Algorithm for Markov Decision Processes

193 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Guanting Chen

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Guanting Chen - Johann Demetrio Gaebler - Matt Peng

التعلم الآلي بنى وهياكل البيانات والخوارزميات التحسين والتحكم

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Value iteration is a well-known method of solving Markov Decision Processes (MDPs) that is simple to implement and boasts strong theoretical convergence guarantees. However, the computational cost of value iteration quickly becomes infeasible as the size of the state space increases. Various methods have been proposed to overcome this issue for value iteration in large state and action space MDPs, often at the price, however, of generalizability and algorithmic simplicity. In this paper, we propose an intuitive algorithm for solving MDPs that reduces the cost of value iteration updates by dynamically grouping together states with similar cost-to-go values. We also prove that our algorithm converges almost surely to within (2varepsilon / (1 - gamma)) of the true optimal value in the (ell^infty) norm, where (gamma) is the discount factor and aggregated states differ by at most (varepsilon). Numerical experiments on a variety of simulated environments confirm the robustness of our algorithm and its ability to solve MDPs with much cheaper updates especially as the scale of the MDP problem increases.

قيم البحث

159 - Honghao Wei , Xin Liu , Lei Ying 2021

This paper presents the first {em model-free}, {em simulator-free} reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it has three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three Q values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves $tilde{cal O}left(frac{1 }{delta}H^4 S^{frac{1}{2}}A^{frac{1}{2}}K^{frac{4}{5}} right)$ regret, where $K$ is the total number of episodes, $H$ is the number of steps in each episode, $S$ is the number of states, $A$ is the number of actions, and $delta$ is Slaters constant. Furthermore, Triple-Q guarantees zero constraint violation when $K$ is sufficiently large. Finally, the computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs and is computationally efficient.

التعلم الآلي الذكاء الاصطناعي

Scalable methods for computing state similarity in deterministic Markov Decision Processes

160 - Pablo Samuel Castro 2019

We present new algorithms for computing and approximating bisimulation metrics in Markov Decision Processes (MDPs). Bisimulation metrics are an elegant formalism that capture behavioral equivalence between states and provide strong theoretical guaran tees on differences in optimal behaviour. Unfortunately, their computation is expensive and requires a tabular representation of the states, which has thus far rendered them impractical for large problems. In this paper we present a new version of the metric that is tied to a behavior policy in an MDP, along with an analysis of its theoretical properties. We then present two new algorithms for approximating bisimulation metrics in large, deterministic MDPs. The first does so via sampling and is guaranteed to converge to the true metric. The second is a differentiable loss which allows us to learn an approximation even for continuous state MDPs, which prior to this work had not been possible.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Hierarchical Representation Learning for Markov Decision Processes

91 - Lorenzo Steccanella , Simone Totaro , Anders Jonsson 2021

In this paper we present a novel method for learning hierarchical representations of Markov decision processes. Our method works by partitioning the state space into subsets, and defines subtasks for performing transitions between the partitions. We formulate the problem of partitioning the state space as an optimization problem that can be solved using gradient descent given a set of sampled trajectories, making our method suitable for high-dimensional problems with large state spaces. We empirically validate the method, by showing that it can successfully learn a useful hierarchical representation in a navigation domain. Once learned, the hierarchical representation can be used to solve different tasks in the given domain, thus generalizing knowledge across tasks.

التعلم الآلي الذكاء الاصطناعي

MICo: Learning improved representations via sampling-based state similarity for Markov decision processes

66 - Pablo Samuel Castro , Tyler Kastner , Prakash Panangaden 2021

We present a new behavioural distance over the state space of a Markov decision process, and demonstrate the use of this distance as an effective means of shaping the learnt representations of deep reinforcement learning agents. While existing notion s of state similarity are typically difficult to learn at scale due to high computational cost and lack of sample-based algorithms, our newly-proposed distance addresses both of these issues. In addition to providing detailed theoretical analysis, we provide empirical evidence that learning this distance alongside the value function yields structured and informative representations, including strong results on the Arcade Learning Environment benchmark.

التعلم الآلي الذكاء الاصطناعي

Max-Plus Matching Pursuit for Deterministic Markov Decision Processes

114 - Francis Bach 2019

We consider deterministic Markov decision processes (MDPs) and apply max-plus algebra tools to approximate the value iteration algorithm by a smaller-dimensional iteration based on a representation on dictionaries of value functions. The setup natura lly leads to novel theoretical results which are simply formulated due to the max-plus algebra structure. For example, when considering a fixed (non adaptive) finite basis, the computational complexity of approximating the optimal value function is not directly related to the number of states, but to notions of covering numbers of the state space. In order to break the curse of dimensionality in factored state-spaces, we consider adaptive basis that can adapt to particular problems leading to an algorithm similar to matching pursuit from signal processing. They currently come with no theoretical guarantees but work empirically well on simple deterministic MDPs derived from low-dimensional continuous control problems. We focus primarily on deterministic MDPs but note that the framework can be applied to all MDPs by considering measure-based formulations.

التعلم الآلي التعلم الالي