No Arabic abstract
In this work, we study discrete-time Markov decision processes (MDPs) under constraints with Borel state and action spaces and where all the performance functions have the same form of the expected total reward (ETR) criterion over the infinite time horizon. One of our objective is to propose a convex programming formulation for this type of MDPs. It will be shown that the values of the constrained control problem and the associated convex program coincide and that if there exists an optimal solution to the convex program then there exists a stationary randomized policy which is optimal for the MDP. It will be also shown that in the framework of constrained control problems, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies. We consider standard hypotheses such as the so-called continuity-compactness conditions and a Slater-type condition. Our assumptions are quite weak to deal with cases that have not yet been addressed in the literature. An example is presented to illustrate our results with respect to those of the literature.
This paper describes the structure of solutions to Kolmogorovs equations for nonhomogeneous jump Markov processes and applications of these results to control of jump stochastic systems. These equations were studied by Feller (1940), who clarified in 1945 in the errata to that paper that some of its results covered only nonexplosive Markov processes. We present the results for possibly explosive Markov processes. The paper is based on the invited talk presented by the authors at the International Conference dedicated to the 200th anniversary of the birth of P. L.~Chebyshev.
In this paper, we consider the optimal stopping problem on semi-Markov processes (SMPs) with finite horizon, and aim to establish the existence and computation of optimal stopping times. To achieve the goal, we first develop the main results of finite horizon semi-Markov decision processes (SMDPs) to the case with additional terminal costs, introduce an explicit construction of SMDPs, and prove the equivalence between the optimal stopping problems on SMPs and SMDPs. Then, using the equivalence and the results on SMDPs developed here, we not only show the existence of optimal stopping time of SMPs, but also provide an algorithm for computing optimal stopping time on SMPs. Moreover, we show that the optimal and -optimal stopping time can be characterized by the hitting time of some special sets, respectively.
We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward. Our proof techniques are a slight generalization of those by Abounadi, Bertsekas, and Borkar (2001). In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms can be significantly easier to use.
We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency given multiple trajectories collected under some behavior policy. Based on the proposed estimator, we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. To the best of our knowledge, this is the first regret bound for batch policy learning in the infinite time horizon setting. The performance of the method is illustrated by simulation studies.
We present a convex-concave reformulation of the reversible Markov chain estimation problem and outline an efficient numerical scheme for the solution of the resulting problem based on a primal-dual interior point method for monotone variational inequalities. Extensions to situations in which information about the stationary vector is available can also be solved via the convex- concave reformulation. The method can be generalized and applied to the discrete transition matrix reweighting analysis method to perform inference from independent chains with specified couplings between the stationary probabilities. The proposed approach offers a significant speed-up compared to a fixed-point iteration for a number of relevant applications.