Does the Markov Decision Process Fit the Data: Testing for the Markov Property in Sequential Decision Making

115 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Chengchun Shi

تاريخ النشر 2020

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Chengchun Shi - Runzhe Wan - Rui Song

التعلم الالي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The Markov assumption (MA) is fundamental to the empirical validity of reinforcement learning. In this paper, we propose a novel Forward-Backward Learning procedure to test MA in sequential decision making. The proposed test does not assume any parametric form on the joint distribution of the observed data and plays an important role for identifying the optimal policy in high-order Markov decision processes and partially observable MDPs. We apply our test to both synthetic datasets and a real data example from mobile health studies to illustrate its usefulness.

قيم البحث

239 - Wenjun Zeng , Yi Liu 2021

In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence. Different from general sequential decision making process, the use cases have a simpler flow where customers per seei ng recommended content on each page can only return feedback as moving forward in the process or dropping from it until a termination state. We refer to this type of problems as sequential decision making in linear--flow. We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix. At recommendation time, we use Thompson sampling (TS) to sample the transition probabilities and allocate the best series of actions with analytical solution through exact dynamic programming. The way that we formulate the problem allows us to leverage TSs efficiency in balancing exploration and exploitation and Bandits convenience in modeling actions incompatibility. In the simulation study, we observe the proposed MDP with Bandits algorithm outperforms Q-learning with $epsilon$-greedy and decreasing $epsilon$, independent Bandits, and interaction Bandits. We also find the proposed algorithms performance is the most robust to changes in the across-page interdependence strength.

التعلم الآلي التعلم الالي

Meta-Learning surrogate models for sequential decision making

124 - Alexandre Galashov , Jonathan Schwarz , Hyunjik Kim 2019

We introduce a unified probabilistic framework for solving sequential decision making problems ranging from Bayesian optimisation to contextual bandits and reinforcement learning. This is accomplished by a probabilistic model-based approach that expl ains observed data while capturing predictive uncertainty during the decision making process. Crucially, this probabilistic model is chosen to be a Meta-Learning system that allows learning from a distribution of related problems, allowing data efficient adaptation to a target task. As a suitable instantiation of this framework, we explore the use of Neural processes due to statistical and computational desiderata. We apply our framework to a broad range of problem domains, such as control problems, recommender systems and adversarial attacks on RL agents, demonstrating an efficient and general black-box learning approach.

التعلم الالي التعلم الآلي

Temporal Regularization in Markov Decision Process

82 - Pierre Thodoroff , Audrey Durand , Joelle Pineau 2018

Several applications of Reinforcement Learning suffer from instability due to high variance. This is especially prevalent in high dimensional domains. Regularization is a commonly used technique in machine learning to reduce variance, at the cost of introducing some bias. Most existing regularization techniques focus on spatial (perceptual) regularization. Yet in reinforcement learning, due to the nature of the Bellman equation, there is an opportunity to also exploit temporal regularization based on smoothness in value estimates over trajectories. This paper explores a class of methods for temporal regularization. We formally characterize the bias induced by this technique using Markov chain concepts. We illustrate the various characteristics of temporal regularization via a sequence of simple discrete and continuous MDPs, and show that the technique provides improvement even in high-dimensional Atari games.

التعلم الآلي التعلم الالي

Quantum Logic Gate Synthesis as a Markov Decision Process

171 - M. Sohaib Alam 2019

Reinforcement learning has witnessed recent applications to a variety of tasks in quantum programming. The underlying assumption is that those tasks could be modeled as Markov Decision Processes (MDPs). Here, we investigate the feasibility of this as sumption by exploring its consequences for two of the simplest tasks in quantum programming: state preparation and gate compilation. By forming discrete MDPs, focusing exclusively on the single-qubit case, we solve for the optimal policy exactly through policy iteration. We find optimal paths that correspond to the shortest possible sequence of gates to prepare a state, or compile a gate, up to some target accuracy. As an example, we find sequences of H and T gates with length as small as 11 producing ~99% fidelity for states of the form (HT)^{n} |0> with values as large as n=10^{10}. This work provides strong evidence that reinforcement learning can be used for optimal state preparation and gate compilation for larger qubit spaces.

فيزياء الكم التعلم الآلي

Markov Decision Process for MOOC users behavioral inference

285 - Firas Jarboui , Celya Gruson-daniel , Pierre Chanial 2019

Studies on massive open online courses (MOOCs) users discuss the existence of typical profiles and their impact on the learning process of the students. However defining the typical behaviors as well as classifying the users accordingly is a difficul t task. In this paper we suggest two methods to model MOOC users behaviour given their log data. We mold their behavior into a Markov Decision Process framework. We associate the users intentions with the MDP reward and argue that this allows us to classify them.

التعلم الآلي التعلم الالي