Modelling transition dynamics in MDPs with RKHS embeddings

97 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Steffen Grunewalder

تاريخ النشر 2012

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Steffen Grunewalder

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as emph{embeddings} in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the under-actuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with least-squares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the NPDP method. Our approach achieves better performance in all experiments.

قيم البحث

200 - Steffen Grunewalder , Luca Baldassarre , Massimiliano Pontil 2011

التعلم الآلي

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

108 - Chi Jin , Tiancheng Jin , Haipeng Luo 2019

We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $mathcal{tilde{O}}(L|X|sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure $mathcal{tilde{O}}(sqrt{T})$ regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an $textit{upper occupancy bound}$.

التعلم الآلي التعلم الالي

Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

124 - Tiancheng Jin , Haipeng Luo 2020

This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a ``best-of-both-worlds guarantee: it achieves $mathcal{O}(log T)$ regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with $tilde{mathcal{O}}(sqrt{T})$ regret even when the losses are adversarial, where $T$ is the number of episodes. More generally, it achieves $tilde{mathcal{O}}(sqrt{C})$ regret in an intermediate setting where the losses are corrupted by a total amount of $C$. Our algorithm is based on the Follow-the-Regularized-Leader method from Zimin and Neu (2013), with a novel hybrid regularizer inspired by recent works of Zimmert et al. (2019a, 2019b) for the special case of multi-armed bandits. Crucially, our regularizer admits a non-diagonal Hessian with a highly complicated inverse. Analyzing such a regularizer and deriving a particular self-bounding regret guarantee is our key technical contribution and might be of independent interest.

التعلم الآلي التعلم الالي

Dynamics-aware Embeddings

138 - William Whitney , Rajat Agarwal , Kyunghyun Cho 2019

In this paper we consider self-supervised representation learning to improve sample efficiency in reinforcement learning (RL). We propose a forward prediction objective for simultaneously learning embeddings of states and action sequences. These embe ddings capture the structure of the environments dynamics, enabling efficient policy learning. We demonstrate that our action embeddings alone improve the sample efficiency and peak performance of model-free RL on control from low-dimensional states. By combining state and action embeddings, we achieve efficient learning of high-quality policies on goal-conditioned continuous control from pixel observations in only 1-2 million environment steps.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

MDPs with Setwise Continuous Transition Probabilities

114 - Eugene A. Feinberg , Pavlo O. Kasyanov 2020

This paper describes the structure of optimal policies for infinite-state Markov Decision Processes with setwise continuous transition probabilities. The action sets may be noncompact. The objective criteria are either the expected total discounted a nd undiscounted costs or average costs per unit time. The analysis of optimality equations and inequalities is based on the optimal selection theorem for inf-compact functions introduced in this paper.

التحسين والتحكم