بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Limit Synchronization in Markov Decision Processes

314 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Mahsa Shirmohammadi

تاريخ النشر 2013

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Laurent Doyen - Thierry Massart - Mahsa Shirmohammadi

علوم الكمبيوتر ونظرية الألعاب

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Markov decision processes (MDP) are finite-state systems with both strategic and probabilistic choices. After fixing a strategy, an MDP produces a sequence of probability distributions over states. The sequence is eventually synchronizing if the probability mass accumulates in a single state, possibly in the limit. Precisely, for 0 <= p <= 1 the sequence is p-synchronizing if a probability distribution in the sequence assigns probability at least p to some state, and we distinguish three synchronization modes: (i) sure winning if there exists a strategy that produces a 1-synchronizing sequence; (ii) almost-sure winning if there exists a strategy that produces a sequence that is, for all epsilon > 0, a (1-epsilon)-synchronizing sequence; (iii) limit-sure winning if for all epsilon > 0, there exists a strategy that produces a (1-epsilon)-synchronizing sequence. We consider the problem of deciding whether an MDP is sure, almost-sure, limit-sure winning, and we establish the decidability and optimal complexity for all modes, as well as the memory requirements for winning strategies. Our main contributions are as follows: (a) for each winning modes we present characterizations that give a PSPACE complexity for the decision problems, and we establish matching PSPACE lower bounds; (b) we show that for sure winning strategies, exponential memory is sufficient and may be necessary, and that in general infinite memory is necessary for almost-sure winning, and unbounded memory is necessary for limit-sure winning; (c) along with our results, we establish new complexity results for alternating finite automata over a one-letter alphabet.

قيم البحث

379 - Laurent Doyen , Thierry Massart , Mahsa Shirmohammadi 2014

We consider synchronizing properties of Markov decision processes (MDP), viewed as generators of sequences of probability distributions over states. A probability distribution is p-synchronizing if the probability mass is at least p in some state, an d a sequence of probability distributions is weakly p-synchronizing, or strongly p-synchronizing if respectively infinitely many, or all but finitely many distributions in the sequence are p-synchronizing. For each synchronizing mode, an MDP can be (i) sure winning if there is a strategy that produces a 1-synchronizing sequence; (ii) almost-sure winning if there is a strategy that produces a sequence that is, for all {epsilon} > 0, a (1-{epsilon})-synchronizing sequence; (iii) limit-sure winning if for all {epsilon} > 0, there is a strategy that produces a (1-{epsilon})-synchronizing sequence. For each synchronizing and winning mode, we consider the problem of deciding whether an MDP is winning, and we establish matching upper and lower complexity bounds of the problems, as well as the optimal memory requirement for winning strategies: (a) for all winning modes, we show that the problems are PSPACE-complete for weakly synchronizing, and PTIME-complete for strongly synchronizing; (b) we show that for weakly synchronizing, exponential memory is sufficient and may be necessary for sure winning, and infinite memory is necessary for almost-sure winning; for strongly synchronizing, linear-size memory is sufficient and may be necessary in all modes; (c) we show a robustness result that the almost-sure and limit-sure winning modes coincide for both weakly and strongly synchronizing.

المنطق في علوم الحاسوب

Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes

122 - Yuval Emek , Ron Lavi , Rad Niazadeh 2020

In this paper, a rather general online problem called dynamic resource allocation with capacity constraints (DRACC) is introduced and studied in the realm of posted price mechanisms. This problem subsumes several applications of stateful pricing, inc luding but not limited to posted prices for online job scheduling and matching over a dynamic bipartite graph. As the existing online learning techniques do not yield vanishing-regret mechanisms for this problem, we develop a novel online learning framework defined over deterministic Markov decision processes with dynamic state transition and reward functions. We then prove that if the Markov decision process is guaranteed to admit an oracle that can simulate any given policy from any initial state with bounded loss -- a condition that is satisfied in the DRACC problem -- then the online learning problem can be solved with vanishing regret. Our proof technique is based on a reduction to online learning with switching cost, in which an online decision maker incurs an extra cost every time she switches from one arm to another. We formally demonstrate this connection and further show how DRACC can be used in our proposed applications of stateful pricing.

علوم الكمبيوتر ونظرية الألعاب

Detecting Spiky Corruption in Markov Decision Processes

217 - Jason Mancuso , Tomasz Kisielewski , David Lindner 2019

Current reinforcement learning methods fail if the reward function is imperfect, i.e. if the agent observes reward different from what it actually receives. We study this problem within the formalism of Corrupt Reward Markov Decision Processes (CRMDP s). We show that if the reward corruption in a CRMDP is sufficiently spiky, the environment is solvable. We fully characterize the regret bound of a Spiky CRMDP, and introduce an algorithm that is able to detect its corrupt states. We show that this algorithm can be used to learn the optimal policy with any common reinforcement learning algorithm. Finally, we investigate our algorithm in a pair of simple gridworld environments, finding that our algorithm can detect the corrupt states and learn the optimal policy despite the corruption.

التعلم الآلي التعلم الالي

Online Learning in Kernelized Markov Decision Processes

147 - Sayak Ray Chowdhury , Aditya Gopalan 2018

We consider online learning for minimizing regret in unknown, episodic Markov decision processes (MDPs) with continuous states and actions. We develop variants of the UCRL and posterior sampling algorithms that employ nonparametric Gaussian process p riors to generalize across the state and action spaces. When the transition and reward functions of the true MDP are members of the associated Reproducing Kernel Hilbert Spaces of functions induced by symmetric psd kernels (frequentist setting), we show that the algorithms enjoy sublinear regret bounds. The bounds are in terms of explicit structural parameters of the kernels, namely a novel generalization of the information gain metric from kernelized bandit, and highlight the influence of transition and reward function structure on the learning performance. Our results are applicable to multidimensional state and action spaces with composite kernel structures, and generalize results from the literature on kernelized bandits, and the adaptive control of parametric linear dynamical systems with quadratic costs.

التعلم الآلي التعلم الالي

Synchronizing Objectives for Markov Decision Processes

329 - Laurent Doyen 2011

We introduce synchronizing objectives for Markov decision processes (MDP). Intuitively, a synchronizing objective requires that eventually, at every step there is a state which concentrates almost all the probability mass. In particular, it implies t hat the probabilistic system behaves in the long run like a deterministic system: eventually, the current state of the MDP can be identified with almost certainty. We study the problem of deciding the existence of a strategy to enforce a synchronizing objective in MDPs. We show that the problem is decidable for general strategies, as well as for blind strategies where the player cannot observe the current state of the MDP. We also show that pure strategies are sufficient, but memory may be necessary.

المنطق في علوم الحاسوب التعقيد الحسابي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة أسيوط

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Limit Synchronization in Markov Decision Processes

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً