ﻻ يوجد ملخص باللغة العربية
Solving Partially Observable Markov Decision Processes (POMDPs) is hard. Learning optimal controllers for POMDPs when the model is unknown is harder. Online learning of optimal controllers for unknown POMDPs, which requires efficient learning using regret-minimizing algorithms that effectively tradeoff exploration and exploitation, is even harder, and no solution exists currently. In this paper, we consider infinite-horizon average-cost POMDPs with unknown transition model, though a known observation model. We propose a natural posterior sampling-based reinforcement learning algorithm (PSRL-POMDP) and show that it achieves a regret bound of $O(log T)$, where $T$ is the time horizon, when the parameter set is finite. In the general case (continuous parameter set), we show that the algorithm achieves $O (T^{2/3})$ regret under two technical assumptions. To the best of our knowledge, this is the first online RL algorithm for POMDPs and has sub-linear regret.
We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $mathcal{tilde{O}}(L|X|sqrt{|A|T})$
This paper deals with bandit online learning problems involving feedback of unknown delay that can emerge in multi-armed bandit (MAB) and bandit convex optimization (BCO) settings. MAB and BCO require only values of the objective function involved th
Bounded agents are limited by intrinsic constraints on their ability to process information that is available in their sensors and memory and choose actions and memory updates. In this dissertation, we model these constraints as information-rate cons
Reinforcement Learning (RL) agents typically learn memoryless policies---policies that only consider the last observation when selecting actions. Learning memoryless policies is efficient and optimal in fully observable environments. However, some fo
We study linear contextual bandits with access to a large, confounded, offline dataset that was sampled from some fixed policy. We show that this problem is closely related to a variant of the bandit problem with side information. We construct a line