Robust Batch Policy Learning in Markov Decision Processes

نشر في Zhengling Qi بتاريخ 2020 والبحث باللغة English تحميل البحث

الملخص بالإنكليزية

We study the sequential decision making problem in Markov decision process (MDP) where each policy is evaluated by a set containing average rewards over different horizon lengths and with different initial distributions. Given a pre-collected dataset of multiple trajectories generated by some behavior policy, our goal is to learn a robust policy in a pre-specified policy class that can maximize the smallest value of this set. Leveraging the semi-parametric efficiency theory from statistics, we develop a policy learning method for estimating the defined robust optimal policy that can efficiently break the curse of horizon under mild technical conditions. A rate-optimal regret bound up to a logarithmic factor is established in terms of the number of trajectories and the number of decision points.

تحميل البحث