ترغب بنشر مسار تعليمي؟ اضغط هنا

Choice Set Misspecification in Reward Inference

401   0   0.0 ( 0 )
 نشر من قبل Rachel Freedman
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Specifying reward functions for robots that operate in environments without a natural reward signal can be challenging, and incorrectly specified rewards can incentivise degenerate or dangerous behavior. A promising alternative to manually specifying reward functions is to enable robots to infer them from human feedback, like demonstrations or corrections. To interpret this feedback, robots treat as approximately optimal a choice the person makes from a choice set, like the set of possible trajectories they could have demonstrated or possible corrections they could have made. In this work, we introduce the idea that the choice set itself might be difficult to specify, and analyze choice set misspecification: what happens as the robot makes incorrect assumptions about the set of choices from which the human selects their feedback. We propose a classification of different kinds of choice set misspecification, and show that these different classes lead to meaningful differences in the inferred reward and resulting performance. While we would normally expect misspecification to hurt, we find that certain kinds of misspecification are neither helpful nor harmful (in expectation). However, in other situations, misspecification can be extremely harmful, leading the robot to believe the opposite of what it should believe. We hope our results will allow for better prediction and response to the effects of misspecification in real-world reward inference.



قيم البحث

اقرأ أيضاً

Single-agent dynamic discrete choice models are typically estimated using heavily parametrized econometric frameworks, making them susceptible to model misspecification. This paper investigates how misspecification affects the results of inference in these models. Specifically, we consider a local misspecification framework in which specification errors are assumed to vanish at an arbitrary and unknown rate with the sample size. Relative to global misspecification, the local misspecification analysis has two important advantages. First, it yields tractable and general results. Second, it allows us to focus on parameters with structural interpretation, instead of pseudo-true parameters. We consider a general class of two-step estimators based on the K-stage sequential policy function iteration algorithm, where K denotes the number of iterations employed in the estimation. This class includes Hotz and Miller (1993)s conditional choice probability estimator, Aguirregabiria and Mira (2002)s pseudo-likelihood estimator, and Pesendorfer and Schmidt-Dengler (2008)s asymptotic least squares estimator. We show that local misspecification can affect the asymptotic distribution and even the rate of convergence of these estimators. In principle, one might expect that the effect of the local misspecification could change with the number of iterations K. One of our main findings is that this is not the case, i.e., the effect of local misspecification is invariant to K. In practice, this means that researchers cannot eliminate or even alleviate problems of model misspecification by changing K.
It is often difficult to hand-specify what the correct reward function is for a task, so researchers have instead aimed to learn reward functions from human behavior or feedback. The types of behavior interpreted as evidence of the reward function ha ve expanded greatly in recent years. Weve gone from demonstrations, to comparisons, to reading into the information leaked when the human is pushing the robot away or turning it off. And surely, there is more to come. How will a robot make sense of all these diverse types of behavior? Our key insight is that different types of behavior can be interpreted in a single unifying formalism - as a reward-rational choice that the human is making, often implicitly. The formalism offers both a unifying lens with which to view past work, as well as a recipe for interpreting new sources of information that are yet to be uncovered. We provide two examples to showcase this: interpreting a new feedback type, and reading into how the choice of feedback itself leaks information about the reward.
Reinforcement learning problems are often described through rewards that indicate if an agent has completed some task. This specification can yield desirable behavior, however many problems are difficult to specify in this manner, as one often needs to know the proper configuration for the agent. When humans are learning to solve tasks, we often learn from visual instructions composed of images or videos. Such representations motivate our development of Perceptual Reward Functions, which provide a mechanism for creating visual task descriptions. We show that this approach allows an agent to learn from rewards that are based on raw pixels rather than internal parameters.
Autonomous agents optimize the reward function we give them. What they dont know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of terrain) where optimizing that same reward may lead to undesired behavior. Our insight is that reward functions are merely observations about what the designer actually wants, and that they should be interpreted in the context in which they were designed. We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse behavior in test MDPs. Empirical results suggest that this approach can help alleviate negative side effects of misspecified reward functions and mitigate reward hacking.
It is incredibly easy for a system designer to misspecify the objective for an autonomous system (robot), thus motivating the desire to have the robot learn the objective from human behavior instead. Recent work has suggested that people have an inte rest in the robot performing well, and will thus behave pedagogically, choosing actions that are informative to the robot. In turn, robots benefit from interpreting the behavior by accounting for this pedagogy. In this work, we focus on misspecification: we argue that robots might not know whether people are being pedagogic or literal and that it is important to ask which assumption is safer to make. We cast objective learning into the more general form of a common-payoff game between the robot and human, and prove that in any such game literal interpretation is more robust to misspecification. Experiments with human data support our theoretical results and point to the sensitivity of the pedagogic assumption.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا