ترغب بنشر مسار تعليمي؟ اضغط هنا

Learning Human Objectives by Evaluating Hypothetical Behavior

114   0   0.0 ( 0 )
 نشر من قبل Siddharth Reddy
 تاريخ النشر 2019
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We seek to align agent behavior with a users objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the users reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data. Our method uses these models to synthesize hypothetical behaviors, asks the user to label the behaviors with rewards, and trains a neural network to predict the rewards. The key idea is to actively synthesize the hypothetical behaviors from scratch by maximizing tractable proxies for the value of information, without interacting with the environment. We call this method reward query synthesis via trajectory optimization (ReQueST). We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that ReQueST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions. Moreover, ReQueST safely trains the reward model to detect unsafe states, and corrects reward hacking before deploying the agent.



قيم البحث

اقرأ أيضاً

When a robot performs a task next to a human, physical interaction is inevitable: the human might push, pull, twist, or guide the robot. The state-of-the-art treats these interactions as disturbances that the robot should reject or avoid. At best, th ese robots respond safely while the human interacts; but after the human lets go, these robots simply return to their original behavior. We recognize that physical human-robot interaction (pHRI) is often intentional -- the human intervenes on purpose because the robot is not doing the task correctly. In this paper, we argue that when pHRI is intentional it is also informative: the robot can leverage interactions to learn how it should complete the rest of its current task even after the person lets go. We formalize pHRI as a dynamical system, where the human has in mind an objective function they want the robot to optimize, but the robot does not get direct access to the parameters of this objective -- they are internal to the human. Within our proposed framework human interactions become observations about the true objective. We introduce approximations to learn from and respond to pHRI in real-time. We recognize that not all human corrections are perfect: often users interact with the robot noisily, and so we improve the efficiency of robot learning from pHRI by reducing unintended learning. Finally, we conduct simulations and user studies on a robotic manipulator to compare our proposed approach to the state-of-the-art. Our results indicate that learning from pHRI leads to better task performance and improved human satisfaction.
Using the concept of principal stratification from the causal inference literature, we introduce a new notion of fairness, called principal fairness, for human and algorithmic decision-making. The key idea is that one should not discriminate among in dividuals who would be similarly affected by the decision. Unlike the existing statistical definitions of fairness, principal fairness explicitly accounts for the fact that individuals can be impacted by the decision. We propose an axiomatic assumption that all groups are created equal. This assumption is motivated by a belief that protected attributes such as race and gender should have no direct causal effects on potential outcomes. Under this assumption, we show that principal fairness implies all three existing statistical fairness criteria once we account for relevant covariates. This result also highlights the essential role of conditioning covariates in resolving the previously recognized tradeoffs between the existing statistical fairness criteria. Finally, we discuss how to empirically choose conditioning covariates and then evaluate the principal fairness of a particular decision.
Large-scale collection of human behavioral data by companies raises serious privacy concerns. We show that behavior captured in the form of application usage data collected from smartphones is highly unique even in very large datasets encompassing mi llions of individuals. This makes behavior-based re-identification of users across datasets possible. We study 12 months of data from 3.5 million users and show that four apps are enough to uniquely re-identify 91.2% of users using a simple strategy based on public information. Furthermore, we show that there is seasonal variability in uniqueness and that application usage fingerprints drift over time at an average constant rate.
We provide the first solution for model-free reinforcement learning of {omega}-regular objectives for Markov decision processes (MDPs). We present a constructive reduction from the almost-sure satisfaction of {omega}-regular objectives to an almost- sure reachability problem and extend this technique to learning how to control an unknown model so that the chance of satisfying the objective is maximized. A key feature of our technique is the compilation of {omega}-regular properties into limit- deterministic Buechi automata instead of the traditional Rabin automata; this choice sidesteps difficulties that have marred previous proposals. Our approach allows us to apply model-free, off-the-shelf reinforcement learning algorithms to compute optimal strategies from the observations of the MDP. We present an experimental evaluation of our technique on benchmark learning problems.
The ongoing COVID-19 global pandemic is affecting every facet of human lives (e.g., public health, education, economy, transportation, and the environment). This novel pandemic and citywide implemented lockdown measures are affecting virus transmissi on, peoples travel patterns, and air quality. Many studies have been conducted to predict the COVID-19 diffusion, assess the impacts of the pandemic on human mobility and air quality, and assess the impacts of lockdown measures on viral spread with a range of Machine Learning (ML) techniques. This review study aims to analyze results from past research to understand the interactions among the COVID-19 pandemic, lockdown measures, human mobility, and air quality. The critical review of prior studies indicates that urban form, peoples socioeconomic and physical conditions, social cohesion, and social distancing measures significantly affect human mobility and COVID-19 transmission. during the COVID-19 pandemic, many people are inclined to use private transportation for necessary travel purposes to mitigate coronavirus-related health problems. This review study also noticed that COVID-19 related lockdown measures significantly improve air quality by reducing the concentration of air pollutants, which in turn improves the COVID-19 situation by reducing respiratory-related sickness and deaths of people. It is argued that ML is a powerful, effective, and robust analytic paradigm to handle complex and wicked problems such as a global pandemic. This study also discusses policy implications, which will be helpful for policymakers to take prompt actions to moderate the severity of the pandemic and improve urban environments by adopting data-driven analytic methods.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا