ترغب بنشر مسار تعليمي؟ اضغط هنا

Instance based Generalization in Reinforcement Learning

65   0   0.0 ( 0 )
 نشر من قبل Martin Bertran
 تاريخ النشر 2020
والبحث باللغة English




اسأل ChatGPT حول البحث

Agents trained via deep reinforcement learning (RL) routinely fail to generalize to unseen environments, even when these share the same underlying dynamics as the training levels. Understanding the generalization properties of RL is one of the challenges of modern machine learning. Towards this goal, we analyze policy learning in the context of Partially Observable Markov Decision Processes (POMDPs) and formalize the dynamics of training levels as instances. We prove that, independently of the exploration strategy, reusing instances introduces significant changes on the effective Markov dynamics the agent observes during training. Maximizing expected rewards impacts the learned belief state of the agent by inducing undesired instance specific speedrunning policies instead of generalizeable ones, which are suboptimal on the training set. We provide generalization bounds to the value gap in train and test environments based on the number of training instances, and use insights based on these to improve performance on unseen levels. We propose training a shared belief representation over an ensemble of specialized policies, from which we compute a consensus policy that is used for data collection, disallowing instance specific exploitation. We experimentally validate our theory, observations, and the proposed computational solution over the CoinRun benchmark.



قيم البحث

اقرأ أيضاً

In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively little insigh t into an agents ability to generalize. We address this issue by using procedurally generated environments to construct distinct training and test sets. Most notably, we introduce a new environment called CoinRun, designed as a benchmark for generalization in RL. Using CoinRun, we find that agents overfit to surprisingly large training sets. We then show that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.
Learning data representations that are useful for various downstream tasks is a cornerstone of artificial intelligence. While existing methods are typically evaluated on downstream tasks such as classification or generative image quality, we propose to assess representations through their usefulness in downstream control tasks, such as reaching or pushing objects. By training over 10,000 reinforcement learning policies, we extensively evaluate to what extent different representation properties affect out-of-distribution (OOD) generalization. Finally, we demonstrate zero-shot transfer of these policies from simulation to the real world, without any domain randomization or fine-tuning. This paper aims to establish the first systematic characterization of the usefulness of learned representations for real-world OOD downstream tasks.
The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show that this is not possible -- there exists a fundamental tradeoff between achieving low regret and identifying an $epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity -- yielding a complexity which scales with the suboptimality gaps and the ``reachability of a state. We show that our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.
Social Reinforcement Learning methods, which model agents in large networks, are useful for fake news mitigation, personalized teaching/healthcare, and viral marketing, but it is challenging to incorporate inter-agent dependencies into the models eff ectively due to network size and sparse interaction data. Previous social RL approaches either ignore agents dependencies or model them in a computationally intensive manner. In this work, we incorporate agent dependencies efficiently in a compact model by clustering users (based on their payoff and contribution to the goal) and combine this with a method to easily derive personalized agent-level policies from cluster-level policies. We also propose a dynamic clustering approach that captures changing user behavior. Experiments on real-world datasets illustrate that our proposed approach learns more accurate policy estimates and converges more quickly, compared to several baselines that do not use agent correlations or only use static clusters.
Reinforcement learning (RL) has achieved remarkable performance in numerous sequential decision making and control tasks. However, a common problem is that learned nearly optimal policy always overfits to the training environment and may not be exten ded to situations never encountered during training. For practical applications, the randomness of environment usually leads to some devastating events, which should be the focus of safety-critical systems such as autonomous driving. In this paper, we introduce the minimax formulation and distributional framework to improve the generalization ability of RL algorithms and develop the Minimax Distributional Soft Actor-Critic (Minimax DSAC) algorithm. Minimax formulation aims to seek optimal policy considering the most severe variations from environment, in which the protagonist policy maximizes action-value function while the adversary policy tries to minimize it. Distributional framework aims to learn a state-action return distribution, from which we can model the risk of different returns explicitly, thereby formulating a risk-averse protagonist policy and a risk-seeking adversarial policy. We implement our method on the decision-making tasks of autonomous vehicles at intersections and test the trained policy in distinct environments. Results demonstrate that our method can greatly improve the generalization ability of the protagonist agent to different environmental variations.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا