On Minimizing Total Discounted Cost in MDPs Subject to Reachability Constraints

68 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yagiz Savas

تاريخ النشر 2021

مجال البحث

والبحث باللغة English

تأليف Yagiz Savas - Christos K. Verginis - Michael Hibbard

التحسين والتحكم

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study the synthesis of a policy in a Markov decision process (MDP) following which an agent reaches a target state in the MDP while minimizing its total discounted cost. The problem combines a reachability criterion with a discounted cost criterion and naturally expresses the completion of a task with probabilistic guarantees and optimal transient performance. We first establish that an optimal policy for the considered formulation may not exist but that there always exists a near-optimal stationary policy. We additionally provide a necessary and sufficient condition for the existence of an optimal policy. We then restrict our attention to stationary deterministic policies and show that the decision problem associated with the synthesis of an optimal stationary deterministic policy is NP-complete. Finally, we provide an exact algorithm based on mixed-integer linear programming and propose an efficient approximation algorithm based on linear programming for the synthesis of an optimal stationary deterministic policy.

قيم البحث

173 - Ioannis Tzortzis , Charalambos D. Charalambous , Themistoklisn Charalambous 2015

We analyze the infinite horizon minimax average cost Markov Control Model (MCM), for a class of controlled process conditional distributions, which belong to a ball, with respect to total variation distance metric, centered at a known nominal control led conditional distribution with radius $Rin [0,2]$, in which the minimization is over the control strategies and the maximization is over conditional distributions. Upon performing the maximization, a dynamic programming equation is obtained which includes, in addition to the standard terms, the oscillator semi-norm of the cost-to-go. First, the dynamic programming equation is analyzed for finite state and control spaces. We show that if the nominal controlled process distribution is irreducible, then for every stationary Markov control policy the maximizing conditional distribution of the controlled process is also irreducible for $R in [0,R_{max}]$. Second, the generalized dynamic programming is analyzed for Borel spaces. We derive necessary and sufficient conditions for any control strategy to be optimal. Through our analysis, new dynamic programming equations and new policy iteration algorithms are derived. The main feature of the new policy iteration algorithms (which are applied for finite alphabet spaces) is that the policy evaluation and policy improvement steps are performed by using the maximizing conditional distribution, which is obtained via a water filling solution. Finally, the application of the new dynamic programming equations and the corresponding policy iteration algorithms are shown via illustrative examples.

التحسين والتحكم

Dynamic Programming Subject to Total Variation Distance Ambiguity

343 - Ioannis Tzortzis , Charalambos D. Charalambous , Themistoklisn Charalambous 2014

The aim of this paper is to address optimality of stochastic control strategies via dynamic programming subject to total variation distance ambiguity on the conditional distribution of the controlled process. We formulate the stochastic control probl em using minimax theory, in which the control minimizes the pay-off while the conditional distribution, from the total variation distance set, maximizes it. First, we investigate the maximization of a linear functional on the space of probability measures on abstract spaces, among those probability measures which are within a total variation distance from a nominal probability measure, and then we give the maximizing probability measure in closed form. Second, we utilize the solution of the maximization to solve minimax stochastic control with deterministic control strategies, under a Markovian and a non-Markovian assumption, on the conditional distributions of the controlled process. The results of this part include: 1) Minimax optimization subject to total variation distance ambiguity constraint; 2) new dynamic programming recursions, which involve the oscillator seminorm of the value function, in addition to the standard terms; 3) new infinite horizon discounted dynamic programming equation, the associated contractive property, and a new policy iteration algorithm. Finally, we provide illustrative examples for both the finite and infinite horizon cases. For the infinite horizon case we invoke the new policy iteration algorithm to compute the optimal strategies.

التحسين والتحكم

Discounted Cost Linear Quadratic Gaussian Control for Descriptor Systems

144 - Hermann Mena , Lena-Maria Pfurtscheller , Matthias Voigt 2020

We consider the linear quadratic Gaussian control problem with a discounted cost functional for descriptor systems on the infinite time horizon. Based on recent results from the deterministic framework, we characterize the feasibility of this problem using a linear matrix inequality. In particular, conditions for existence and uniqueness of optimal controls are derived, which are weaker compared to the standard approaches in the literature. We further show that also for the stochastic problem, the optimal control is given in terms of the stabilizing solution of the Lure equation, which generalizes the algebraic Riccati equation.

التحسين والتحكم

PrIC3: Property Directed Reachability for MDPs

60 - Kevin Batz , Sebastian Junges , Benjamin Lucien Kaminski 2020

IC3 has been a leap forward in symbolic model checking. This paper proposes PrIC3 (pronounced pricy-three), a conservative extension of IC3 to symbolic model checking of MDPs. Our main focus is to develop the theory underlying PrIC3. Alongside, we pr esent a first implementation of PrIC3 including the key ingredients from IC3 such as generalization, repushing, and propagation.

المنطق في علوم الحاسوب

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

142 - Jiafan He , Dongruo Zhou , Quanquan Gu 2020

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$gamma$, which is based on the emph{optimism in the face of uncertainty principle} a nd the Bernstein-type bonus. We show that UCBVI-$gamma$ achieves an $tilde{O}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tilde{Omega}big({sqrt{SAT}}/{(1-gamma)^{1.5}}big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$gamma$ is nearly minimax optimal for discounted MDPs.

التعلم الآلي التحسين والتحكم التعلم الالي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الرشيد الدولية الخاصة للعلوم والتكنولوجيا

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

On Minimizing Total Discounted Cost in MDPs Subject to Reachability Constraints

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً