ترغب بنشر مسار تعليمي؟ اضغط هنا

Error-Aware Policy Learning: Zero-Shot Generalization in Partially Observable Dynamic Environments

382   0   0.0 ( 0 )
 نشر من قبل Visak Chadalavada Vijay Kumar
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Simulation provides a safe and efficient way to generate useful data for learning complex robotic tasks. However, matching simulation and real-world dynamics can be quite challenging, especially for systems that have a large number of unobserved or unmeasurable parameters, which may lie in the robot dynamics itself or in the environment with which the robot interacts. We introduce a novel approach to tackle such a sim-to-real problem by developing policies capable of adapting to new environments, in a zero-shot manner. Key to our approach is an error-aware policy (EAP) that is explicitly made aware of the effect of unobservable factors during training. An EAP takes as input the predicted future state error in the target environment, which is provided by an error-prediction function, simultaneously trained with the EAP. We validate our approach on an assistive walking device trained to help the human user recover from external pushes. We show that a trained EAP for a hip-torque assistive device can be transferred to different human agents with unseen biomechanical characteristics. In addition, we show that our method can be applied to other standard RL control tasks.

قيم البحث

اقرأ أيضاً

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We def ine the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.
Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing disco unted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero sum games, without any domain-specific state space reductions.
Recent advances in data-driven models for grounded language understanding have enabled robots to interpret increasingly complex instructions. Two fundamental limitations of these methods are that most require a full model of the environment to be kno wn a priori, and they attempt to reason over a world representation that is flat and unnecessarily detailed, which limits scalability. Recent semantic mapping methods address partial observability by exploiting language as a sensor to infer a distribution over topological, metric and semantic properties of the environment. However, maintaining a distribution over highly detailed maps that can support grounding of diverse instructions is computationally expensive and hinders real-time human-robot collaboration. We propose a novel framework that learns to adapt perception according to the task in order to maintain compact distributions over semantic maps. Experiments with a mobile manipulator demonstrate more efficient instruction following in a priori unknown environments.
This study proposes an integrated task and motion planning method for dynamic locomotion in partially observable environments with multi-level safety guarantees. This layered planning framework is composed of a high-level symbolic task planner and a low-level phase-space motion planner. A belief abstraction at the task planning level enables belief estimation of dynamic obstacle locations and guarantees navigation safety with collision avoidance. The high-level task planner, i.e., a two-level navigation planner, employs linear temporal logic for a reactive game synthesis between the robot and its environment while incorporating low-level safe keyframe policies into formal task specification design. The synthesized task planner commands a series of locomotion actions including walking step length, step height, and heading angle changes, to the underlying keyframe decision-maker, which further determines the robot center-of-mass apex velocity keyframe. The low-level phase-space planner uses a reduced-order locomotion model to generate non-periodic trajectories meeting balancing safety criteria for straight and steering walking. These criteria are characterized by constraints on locomotion keyframe states, and are used to define keyframe transition policies via viability kernels. Simulation results of a Cassie bipedal robot designed by Agility Robotics demonstrate locomotion maneuvering in a three-dimensional, partially observable environment consisting of dynamic obstacles and uneven terrain.
This paper presents a data-driven approach for multi-robot coordination in partially-observable domains based on Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) and macro-actions (MAs). Dec-POMDPs provide a general framework for cooperative sequential decision making under uncertainty and MAs allow temporally extended and asynchronous action execution. To date, most methods assume the underlying Dec-POMDP model is known a priori or a full simulator is available during planning time. Previous methods which aim to address these issues suffer from local optimality and sensitivity to initial conditions. Additionally, few hardware demonstrations involving a large team of heterogeneous robots and with long planning horizons exist. This work addresses these gaps by proposing an iterative sampling based Expectation-Maximization algorithm (iSEM) to learn polices using only trajectory data containing observations, MAs, and rewards. Our experiments show the algorithm is able to achieve better solution quality than the state-of-the-art learning-based methods. We implement two variants of multi-robot Search and Rescue (SAR) domains (with and without obstacles) on hardware to demonstrate the learned policies can effectively control a team of distributed robots to cooperate in a partially observable stochastic environment.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا