ترغب بنشر مسار تعليمي؟ اضغط هنا

Sufficiency of Markov Policies for Continuous-Time Jump Markov Decision Processes

176   0   0.0 ( 0 )
 نشر من قبل Eugene Feinberg
 تاريخ النشر 2020
  مجال البحث
والبحث باللغة English




اسأل ChatGPT حول البحث

This paper extends to Continuous-Time Jump Markov Decision Processes (CTJMDP) the classic result for Markov Decision Processes stating that, for a given initial state distribution, for every policy there is a (randomized) Markov policy, which can be defined in a natural way, such that at each time instance the marginal distributions of state-action pairs for these two policies coincide. It is shown in this paper that this equality takes place for a CTJMDP if the corresponding Markov policy defines a nonexplosive jump Markov process. If this Markov process is explosive, then at each time instance the marginal probability, that a state-action pair belongs to a measurable set of state-action pairs, is not greater for the described Markov policy than the same probability for the original policy. These results are used in this paper to prove that for expected discounted total costs and for average costs per unit time, for a given initial state distribution, for each policy for a CTJMDP the described a Markov policy has the same or better performance.

قيم البحث

اقرأ أيضاً

This paper describes the structure of solutions to Kolmogorovs equations for nonhomogeneous jump Markov processes and applications of these results to control of jump stochastic systems. These equations were studied by Feller (1940), who clarified in 1945 in the errata to that paper that some of its results covered only nonexplosive Markov processes. We present the results for possibly explosive Markov processes. The paper is based on the invited talk presented by the authors at the International Conference dedicated to the 200th anniversary of the birth of P. L.~Chebyshev.
The objective of this work is to study continuous-time Markov decision processes on a general Borel state space with both impulsive and continuous controls for the infinite-time horizon discounted cost. The continuous-time controlled process is shown to be non explosive under appropriate hypotheses. The so-called Bellman equation associated to this control problem is studied. Sufficient conditions ensuring the existence and the uniqueness of a bounded measurable solution to this optimality equation are provided. Moreover, it is shown that the value function of the optimization problem under consideration satisfies this optimality equation. Sufficient conditions are also presented to ensure on one hand the existence of an optimal control strategy and on the other hand the existence of an $varepsilon$-optimal control strategy. The decomposition of the state space in two disjoint subsets is exhibited where roughly speaking, one should apply a gradual action or an impulsive action correspondingly to get an optimal or $varepsilon$-optimal strategy. An interesting consequence of our previous results is as follows: the set of strategies that allow interventions at time $t=0$ and only immediately after natural jumps is a sufficient set for the control problem under consideration.
In a variety of applications, an agents success depends on the knowledge that an adversarial observer has or can gather about the agents decisions. It is therefore desirable for the agent to achieve a task while reducing the ability of an observer to infer the agents policy. We consider the task of the agent as a reachability problem in a Markov decision process and study the synthesis of policies that minimize the observers ability to infer the transition probabilities of the agent between the states of the Markov decision process. We introduce a metric that is based on the Fisher information as a proxy for the information leaked to the observer and using this metric formulate a problem that minimizes expected total information subject to the reachability constraint. We proceed to solve the problem using convex optimization methods. To verify the proposed method, we analyze the relationship between the expected total information and the estimation error of the observer, and show that, for a particular class of Markov decision processes, these two values are inversely proportional.
The aim of this paper is to propose a new numerical approximation of the Kalman-Bucy filter for semi-Markov jump linear systems. This approximation is based on the selection of typical trajectories of the driving semi-Markov chain of the process by u sing an optimal quantization technique. The main advantage of this approach is that it makes pre-computations possible. We derive a Lipschitz property for the solution of the Riccati equation and a general result on the convergence of perturbed solutions of semi-Markov switching Riccati equations when the perturbation comes from the driving semi-Markov chain. Based on these results, we prove the convergence of our approximation scheme in a general infinite countable state space framework and derive an error bound in terms of the quantization error and time discretization step. We employ the proposed filter in a magnetic levitation example with markovian failures and compare its performance with both the Kalman-Bucy filter and the Markovian linear minimum mean squares estimator.
We study the problem of synthesizing a controller that maximizes the entropy of a partially observable Markov decision process (POMDP) subject to a constraint on the expected total reward. Such a controller minimizes the predictability of an agents t rajectories to an outside observer while guaranteeing the completion of a task expressed by a reward function. We first prove that an agent with partial observations can achieve an entropy at most as well as an agent with perfect observations. Then, focusing on finite-state controllers (FSCs) with deterministic memory transitions, we show that the maximum entropy of a POMDP is lower bounded by the maximum entropy of the parametric Markov chain (pMC) induced by such FSCs. This relationship allows us to recast the entropy maximization problem as a so-called parameter synthesis problem for the induced pMC. We then present an algorithm to synthesize an FSC that locally maximizes the entropy of a POMDP over FSCs with the same number of memory states. In numerical examples, we illustrate the relationship between the maximum entropy, the number of memory states in the FSC, and the expected reward.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا