ﻻ يوجد ملخص باللغة العربية
We study the sequential decision making problem in Markov decision process (MDP) where each policy is evaluated by a set containing average rewards over different horizon lengths and with different initial distributions. Given a pre-collected dataset of multiple trajectories generated by some behavior policy, our goal is to learn a robust policy in a pre-specified policy class that can maximize the smallest value of this set. Leveraging the semi-parametric efficiency theory from statistics, we develop a policy learning method for estimating the defined robust optimal policy that can efficiently break the curse of horizon under mild technical conditions. A rate-optimal regret bound up to a logarithmic factor is established in terms of the number of trajectories and the number of decision points.
We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust e
We consider synchronizing properties of Markov decision processes (MDP), viewed as generators of sequences of probability distributions over states. A probability distribution is p-synchronizing if the probability mass is at least p in some state, an
Approximate Newton methods are a standard optimization tool which aim to maintain the benefits of Newtons method, such as a fast rate of convergence, whilst alleviating its drawbacks, such as computationally expensive calculation or estimation of the
The success of deep reinforcement learning (DRL) is due to the power of learning a representation that is suitable for the underlying exploration and exploitation task. However, existing provable reinforcement learning algorithms with linear function
Coordination of distributed agents is required for problems arising in many areas, including multi-robot systems, networking and e-commerce. As a formal framework for such problems, we use the decentralized partially observable Markov decision proces