ﻻ يوجد ملخص باللغة العربية
We present new algorithms for computing and approximating bisimulation metrics in Markov Decision Processes (MDPs). Bisimulation metrics are an elegant formalism that capture behavioral equivalence between states and provide strong theoretical guarantees on differences in optimal behaviour. Unfortunately, their computation is expensive and requires a tabular representation of the states, which has thus far rendered them impractical for large problems. In this paper we present a new version of the metric that is tied to a behavior policy in an MDP, along with an analysis of its theoretical properties. We then present two new algorithms for approximating bisimulation metrics in large, deterministic MDPs. The first does so via sampling and is guaranteed to converge to the true metric. The second is a differentiable loss which allows us to learn an approximation even for continuous state MDPs, which prior to this work had not been possible.
We present a new behavioural distance over the state space of a Markov decision process, and demonstrate the use of this distance as an effective means of shaping the learnt representations of deep reinforcement learning agents. While existing notion
We consider deterministic Markov decision processes (MDPs) and apply max-plus algebra tools to approximate the value iteration algorithm by a smaller-dimensional iteration based on a representation on dictionaries of value functions. The setup natura
In this paper we present a novel method for learning hierarchical representations of Markov decision processes. Our method works by partitioning the state space into subsets, and defines subtasks for performing transitions between the partitions. We
Value iteration is a well-known method of solving Markov Decision Processes (MDPs) that is simple to implement and boasts strong theoretical convergence guarantees. However, the computational cost of value iteration quickly becomes infeasible as the
Model-free reinforcement learning is known to be memory and computation efficient and more amendable to large scale problems. In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Process