ﻻ يوجد ملخص باللغة العربية
We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $widetilde{O}(sqrt{T})$ regret and another computationally efficient variant with $widetilde{O}(T^{3/4})$ regret, where $T$ is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $widetilde{O}(sqrt{T})$ regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with $widetilde{O}(T^{2/3})$ regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).
We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation, where the transition probability function of the underlying Markov Decision Process (MDP) admits a linear form over a feature mapping of
Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation. In this paper, we propose Exploration Enhanced Q-learni
Model-free reinforcement learning is known to be memory and computation efficient and more amendable to large scale problems. In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Process
We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping. We pro