L*-Based Learning of Markov Decision Processes (Extended Version)

140 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Martin Tappler

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Martin Tappler - Bernhard K. Aichernig - Giovanni Bacci

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Automata learning techniques automatically generate system models from test observations. These techniques usually fall into two categories: passive and active. Passive learning uses a predetermined data set, e.g., system logs. In contrast, active learning actively queries the system under learning, which is considered more efficient. An influential active learning technique is Angluins L* algorithm for regular languages which inspired several generalisations from DFAs to other automata-based modelling formalisms. In this work, we study L*-based learning of deterministic Markov decision processes, first assuming an ideal setting with perfect information. Then, we relax this assumption and present a novel learning algorithm that collects information by sampling system traces via testing. Experiments with the implementation of our sampling-based algorithm suggest that it achieves better accuracy than state-of-the-art passive learning techniques with the same amount of test data. Unlike existing learning algorithms with predefined states, our algorithm learns the complete model structure including the states.

قيم البحث

147 - Sayak Ray Chowdhury , Aditya Gopalan 2018

We consider online learning for minimizing regret in unknown, episodic Markov decision processes (MDPs) with continuous states and actions. We develop variants of the UCRL and posterior sampling algorithms that employ nonparametric Gaussian process p riors to generalize across the state and action spaces. When the transition and reward functions of the true MDP are members of the associated Reproducing Kernel Hilbert Spaces of functions induced by symmetric psd kernels (frequentist setting), we show that the algorithms enjoy sublinear regret bounds. The bounds are in terms of explicit structural parameters of the kernels, namely a novel generalization of the information gain metric from kernelized bandit, and highlight the influence of transition and reward function structure on the learning performance. Our results are applicable to multidimensional state and action spaces with composite kernel structures, and generalize results from the literature on kernelized bandits, and the adaptive control of parametric linear dynamical systems with quadratic costs.

التعلم الآلي التعلم الالي

On Online Learning in Kernelized Markov Decision Processes

93 - Sayak Ray Chowdhury , Aditya Gopalan 2019

We develop algorithms with low regret for learning episodic Markov decision processes based on kernel approximation techniques. The algorithms are based on both the Upper Confidence Bound (UCB) as well as Posterior or Thompson Sampling (PSRL) philoso phies, and work in the general setting of continuous state and action spaces when the true unknown transition dynamics are assumed to have smoothness induced by an appropriate Reproducing Kernel Hilbert Space (RKHS).

التعلم الآلي التعلم الالي

Approximate Abstractions of Markov Chains with Interval Decision Processes (Extended Version)

101 - Y. Zacchia Lun , J. Wheatley , A. DInnocenzo 2018

This work introduces a new abstraction technique for reducing the state space of large, discrete-time labelled Markov chains. The abstraction leverages the semantics of interval Markov decision processes and the existing notion of approximate probabi listic bisimulation. Whilst standard abstractions make use of abstract points that are taken from the state space of the concrete model and which serve as representatives for sets of concrete states, in this work the abstract structure is constructed considering abstract points that are not necessarily selected from the states of the concrete model, rather they are a function of these states. The resulting model presents a smaller one-step bisimulation error, when compared to a like-sized, standard Markov chain abstraction. We outline a method to perform probabilistic model checking, and show that the computational complexity of the new method is comparable to that of standard abstractions based on approximate probabilistic bisimulations.

أنظمة وتحكم

Detecting Spiky Corruption in Markov Decision Processes

217 - Jason Mancuso , Tomasz Kisielewski , David Lindner 2019

Current reinforcement learning methods fail if the reward function is imperfect, i.e. if the agent observes reward different from what it actually receives. We study this problem within the formalism of Corrupt Reward Markov Decision Processes (CRMDP s). We show that if the reward corruption in a CRMDP is sufficiently spiky, the environment is solvable. We fully characterize the regret bound of a Spiky CRMDP, and introduce an algorithm that is able to detect its corrupt states. We show that this algorithm can be used to learn the optimal policy with any common reinforcement learning algorithm. Finally, we investigate our algorithm in a pair of simple gridworld environments, finding that our algorithm can detect the corrupt states and learn the optimal policy despite the corruption.

التعلم الآلي التعلم الالي

Online Markov Decision Processes with Aggregate Bandit Feedback

358 - Alon Cohen , Haim Kaplan , Tomer Koren 2021

We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics. In each episode, the learner suffers the loss accumulated along the trajectory realized by the poli cy chosen for the episode, and observes aggregate bandit feedback: the trajectory is revealed along with the cumulative loss suffered, rather than the individual losses encountered along the trajectory. Our main result is a computationally efficient algorithm with $O(sqrt{K})$ regret for this setting, where $K$ is the number of episodes. We establish this result via an efficient reduction to a novel bandit learning setting we call Distorted Linear Bandits (DLB), which is a variant of bandit linear optimization where actions chosen by the learner are adversarially distorted before they are committed. We then develop a computationally-efficient online algorithm for DLB for which we prove an $O(sqrt{T})$ regret bound, where $T$ is the number of time steps. Our algorithm is based on online mirror descent with a self-concordant barrier regularization that employs a novel increasing learning rate schedule.

التعلم الآلي التعلم الالي