بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Index Policy for A Class of Partially Observable Markov Decision Processes

105 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Keqin Liu

تاريخ النشر 2021

مجال البحث

والبحث باللغة English

تأليف Keqin Liu

التحسين والتحكم

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper addresses an important class of restless multi-armed bandit (RMAB) problems that finds a broad application area in operations research, stochastic optimization, and reinforcement learning. There are $N$ independent Markov processes that may be operated, observed and offer rewards. Due to the resource constraint, we can only choose a subset of $M~(M<N)$ processes to operate and accrue reward determined by the states of selected processes. We formulate the problem as an RMAB with an infinite state space and design an algorithm that achieves a near-optimal performance with low complexity. Our algorithm is based on Whittles original idea of index policy but can be implemented under more general scenarios, including continuous state space, relaxed indexability, online computations, etc.

قيم البحث

147 - Yagiz Savas , Michael Hibbard , Bo Wu 2021

We study the problem of synthesizing a controller that maximizes the entropy of a partially observable Markov decision process (POMDP) subject to a constraint on the expected total reward. Such a controller minimizes the predictability of an agents t rajectories to an outside observer while guaranteeing the completion of a task expressed by a reward function. We first prove that an agent with partial observations can achieve an entropy at most as well as an agent with perfect observations. Then, focusing on finite-state controllers (FSCs) with deterministic memory transitions, we show that the maximum entropy of a POMDP is lower bounded by the maximum entropy of the parametric Markov chain (pMC) induced by such FSCs. This relationship allows us to recast the entropy maximization problem as a so-called parameter synthesis problem for the induced pMC. We then present an algorithm to synthesize an FSC that locally maximizes the entropy of a POMDP over FSCs with the same number of memory states. In numerical examples, we illustrate the relationship between the maximum entropy, the number of memory states in the FSC, and the expected reward.

التحسين والتحكم

Multivariate Utility Optimization with an Application to Risk-Sensitive Partially Observable Markov Decision Processes

155 - Vaios Laschos , Robert Seidel , Klaus Obermayer 2018

We introduce and treat a class of Multi Objective Risk-Sensitive Markov Decision Processes (MORSMDPs), where the optimality criteria are generated by a multivariate utility function applied on a finite set of emph{different running costs}. To illustr ate our approach, we study the example of a two-armed bandit problem. In the sequel, we show that it is possible to reformulate standard Risk-Sensitive Partially Observable Markov Decision Processes (RSPOMDPs), where risk is modeled by a utility function that is a emph{sum of exponentials}, as MORSMDPs that can be solved with the methods described in the first part. This way, we extend the treatment of RSPOMDPs with exponential utility to RSPOMDPs corresponding to a qualitatively bigger family of utility functions.

التحسين والتحكم

Constrained Active Classification Using Partially Observable Markov Decision Processes

123 - Bo Wu , Mohamadreza Ahmadi , Suda Bharadwaj 2020

In this work, we study the problem of actively classifying the attributes of dynamical systems characterized as a finite set of Markov decision process (MDP) models. We are interested in finding strategies that actively interact with the dynamical sy stem and observe its reactions so that the attribute of interest is classified efficiently with high confidence. We present a decision-theoretic framework based on partially observable Markov decision processes (POMDPs). The proposed framework relies on assigning a classification belief (a probability distribution) to the attributes of interest. Given an initial belief, confidence level over which a classification decision can be made, a cost bound, safe belief sets, and a finite time horizon, we compute POMDP strategies leading to classification decisions. We present two different algorithms to compute such strategies. The first algorithm computes the optimal strategy exactly by value iteration. To overcome the computational complexity of computing the exact solutions, we propose a second algorithm is based on adaptive sampling to approximate the optimal probability of reaching a classification decision. We illustrate the proposed methodology using examples from medical diagnosis and privacy-preserving advertising.

أنظمة وتحكم أنظمة وتحكم

Human-in-the-Loop Synthesis for Partially Observable Markov Decision Processes

666 - Steven Carr , Nils Jansen , Ralf Wimmer 2018

We study planning problems where autonomous agents operate inside environments that are subject to uncertainties and not fully observable. Partially observable Markov decision processes (POMDPs) are a natural formal model to capture such problems. Be cause of the potentially huge or even infinite belief space in POMDPs, synthesis with safety guarantees is, in general, computationally intractable. We propose an approach that aims to circumvent this difficulty: in scenarios that can be partially or fully simulated in a virtual environment, we actively integrate a human user to control an agent. While the user repeatedly tries to safely guide the agent in the simulation, we collect data from the human input. Via behavior cloning, we translate the data into a strategy for the POMDP. The strategy resolves all nondeterminism and non-observability of the POMDP, resulting in a discrete-time Markov chain (MC). The efficient verification of this MC gives quantitative insights into the quality of the inferred human strategy by proving or disproving given system specifications. For the case that the quality of the strategy is not sufficient, we propose a refinement method using counterexamples presented to the human. Experiments show that by including humans into the POMDP verification loop we improve the state of the art by orders of magnitude in terms of scalability.

الذكاء الاصطناعي

Information-Theoretic Methods for Planning and Learning in Partially Observable Markov Decision Processes

66 - Roy Fox 2016

Bounded agents are limited by intrinsic constraints on their ability to process information that is available in their sensors and memory and choose actions and memory updates. In this dissertation, we model these constraints as information-rate cons traints on communication channels connecting these various internal components of the agent. We make four major contributions detailed below and many smaller contributions detailed in each section. First, we formulate the problem of optimizing the agent under both extrinsic and intrinsic constraints and develop the main tools for solving it. Second, we identify another reason for the challenging convergence properties of the optimization algorithm, which is the bifurcation structure of the update operator near phase transitions. Third, we study the special case of linear-Gaussian dynamics and quadratic cost (LQG), where the optimal solution has a particularly simple and solvable form. Fourth, we explore the learning task, where the model of the world dynamics is unknown and sample-based updates are used instead.

التعلم الآلي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الشھباء الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Index Policy for A Class of Partially Observable Markov Decision Processes

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً