ﻻ يوجد ملخص باللغة العربية
In a variety of applications, an agents success depends on the knowledge that an adversarial observer has or can gather about the agents decisions. It is therefore desirable for the agent to achieve a task while reducing the ability of an observer to infer the agents policy. We consider the task of the agent as a reachability problem in a Markov decision process and study the synthesis of policies that minimize the observers ability to infer the transition probabilities of the agent between the states of the Markov decision process. We introduce a metric that is based on the Fisher information as a proxy for the information leaked to the observer and using this metric formulate a problem that minimizes expected total information subject to the reachability constraint. We proceed to solve the problem using convex optimization methods. To verify the proposed method, we analyze the relationship between the expected total information and the estimation error of the observer, and show that, for a particular class of Markov decision processes, these two values are inversely proportional.
This paper extends to Continuous-Time Jump Markov Decision Processes (CTJMDP) the classic result for Markov Decision Processes stating that, for a given initial state distribution, for every policy there is a (randomized) Markov policy, which can be
The objective of this work is to study continuous-time Markov decision processes on a general Borel state space with both impulsive and continuous controls for the infinite-time horizon discounted cost. The continuous-time controlled process is shown
We study the problem of synthesizing a controller that maximizes the entropy of a partially observable Markov decision process (POMDP) subject to a constraint on the expected total reward. Such a controller minimizes the predictability of an agents t
We study the problem of synthesizing a policy that maximizes the entropy of a Markov decision process (MDP) subject to a temporal logic constraint. Such a policy minimizes the predictability of the paths it generates, or dually, maximizes the explora
Recently two approximate Newton methods were proposed for the optimisation of Markov Decision Processes. While these methods were shown to have desirable properties, such as a guarantee that the preconditioner is negative-semidefinite when the policy