No Arabic abstract
We study the synthesis of a policy in a Markov decision process (MDP) following which an agent reaches a target state in the MDP while minimizing its total discounted cost. The problem combines a reachability criterion with a discounted cost criterion and naturally expresses the completion of a task with probabilistic guarantees and optimal transient performance. We first establish that an optimal policy for the considered formulation may not exist but that there always exists a near-optimal stationary policy. We additionally provide a necessary and sufficient condition for the existence of an optimal policy. We then restrict our attention to stationary deterministic policies and show that the decision problem associated with the synthesis of an optimal stationary deterministic policy is NP-complete. Finally, we provide an exact algorithm based on mixed-integer linear programming and propose an efficient approximation algorithm based on linear programming for the synthesis of an optimal stationary deterministic policy.
We analyze the infinite horizon minimax average cost Markov Control Model (MCM), for a class of controlled process conditional distributions, which belong to a ball, with respect to total variation distance metric, centered at a known nominal controlled conditional distribution with radius $Rin [0,2]$, in which the minimization is over the control strategies and the maximization is over conditional distributions. Upon performing the maximization, a dynamic programming equation is obtained which includes, in addition to the standard terms, the oscillator semi-norm of the cost-to-go. First, the dynamic programming equation is analyzed for finite state and control spaces. We show that if the nominal controlled process distribution is irreducible, then for every stationary Markov control policy the maximizing conditional distribution of the controlled process is also irreducible for $R in [0,R_{max}]$. Second, the generalized dynamic programming is analyzed for Borel spaces. We derive necessary and sufficient conditions for any control strategy to be optimal. Through our analysis, new dynamic programming equations and new policy iteration algorithms are derived. The main feature of the new policy iteration algorithms (which are applied for finite alphabet spaces) is that the policy evaluation and policy improvement steps are performed by using the maximizing conditional distribution, which is obtained via a water filling solution. Finally, the application of the new dynamic programming equations and the corresponding policy iteration algorithms are shown via illustrative examples.
The aim of this paper is to address optimality of stochastic control strategies via dynamic programming subject to total variation distance ambiguity on the conditional distribution of the controlled process. We formulate the stochastic control problem using minimax theory, in which the control minimizes the pay-off while the conditional distribution, from the total variation distance set, maximizes it. First, we investigate the maximization of a linear functional on the space of probability measures on abstract spaces, among those probability measures which are within a total variation distance from a nominal probability measure, and then we give the maximizing probability measure in closed form. Second, we utilize the solution of the maximization to solve minimax stochastic control with deterministic control strategies, under a Markovian and a non-Markovian assumption, on the conditional distributions of the controlled process. The results of this part include: 1) Minimax optimization subject to total variation distance ambiguity constraint; 2) new dynamic programming recursions, which involve the oscillator seminorm of the value function, in addition to the standard terms; 3) new infinite horizon discounted dynamic programming equation, the associated contractive property, and a new policy iteration algorithm. Finally, we provide illustrative examples for both the finite and infinite horizon cases. For the infinite horizon case we invoke the new policy iteration algorithm to compute the optimal strategies.
We consider the linear quadratic Gaussian control problem with a discounted cost functional for descriptor systems on the infinite time horizon. Based on recent results from the deterministic framework, we characterize the feasibility of this problem using a linear matrix inequality. In particular, conditions for existence and uniqueness of optimal controls are derived, which are weaker compared to the standard approaches in the literature. We further show that also for the stochastic problem, the optimal control is given in terms of the stabilizing solution of the Lure equation, which generalizes the algebraic Riccati equation.
IC3 has been a leap forward in symbolic model checking. This paper proposes PrIC3 (pronounced pricy-three), a conservative extension of IC3 to symbolic model checking of MDPs. Our main focus is to develop the theory underlying PrIC3. Alongside, we present a first implementation of PrIC3 including the key ingredients from IC3 such as generalization, repushing, and propagation.
We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$gamma$, which is based on the emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$gamma$ achieves an $tilde{O}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tilde{Omega}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$gamma$ is nearly minimax optimal for discounted MDPs.