Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

420 0 0.0 ( 0 )

Download Cite

Added by Kuan Fang

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Kuan Fang - Alexander Toshev - Li Fei-Fei

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Many robotic applications require the agent to perform long-horizon tasks in partially observable environments. In such applications, decision making at any step can depend on observations received far in the past. Hence, being able to properly memorize and utilize the long-term history is crucial. In this work, we propose a novel memory-based policy, named Scene Memory Transformer (SMT). The proposed policy embeds and adds each observation to a memory and uses the attention mechanism to exploit spatio-temporal dependencies. This model is generic and can be efficiently trained with reinforcement learning over long episodes. On a range of visual navigation tasks, SMT demonstrates superior performance to existing reactive and memory-based policies by a margin.

rate research

LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks

172 - Albert Wilcox , Ashwin Balakrishna , Brijen Thananjeyan 2021

Reinforcement learning (RL) algorithms have shown impressive success in exploring high-dimensional environments to learn complex, long-horizon tasks, but can often exhibit unsafe behaviors and require extensive environment interaction when exploration is unconstrained. A promising strategy for safe learning in dynamically uncertain environments is requiring that the agent can robustly return to states where task success (and therefore safety) can be guaranteed. While this approach has been successful in low-dimensions, enforcing this constraint in environments with high-dimensional state spaces, such as images, is challenging. We present Latent Space Safe Sets (LS3), which extends this strategy to iterative, long-horizon tasks with image observations by using suboptimal demonstrations and a learned dynamics model to restrict exploration to the neighborhood of a learned Safe Set where task completion is likely. We evaluate LS3 on 4 domains, including a challenging sequential pushing task in simulation and a physical cable routing task. We find that LS3 can use prior task successes to restrict exploration and learn more efficiently than prior algorithms while satisfying constraints. See https://tinyurl.com/latent-ss for code and supplementary material.

Machine Learning Artificial Intelligence Robotics

Communicative Learning with Natural Gestures for Embodied Navigation Agents with Human-in-the-Scene

344 - Qi Wu , Cheng-Ju Wu , Yixin Zhu 2021

Human-robot collaboration is an essential research topic in artificial intelligence (AI), enabling researchers to devise cognitive AI systems and affords an intuitive means for users to interact with the robot. Of note, communication plays a central role. To date, prior studies in embodied agent navigation have only demonstrated that human languages facilitate communication by instructions in natural languages. Nevertheless, a plethora of other forms of communication is left unexplored. In fact, human communication originated in gestures and oftentimes is delivered through multimodal cues, e.g. go there with a pointing gesture. To bridge the gap and fill in the missing dimension of communication in embodied agent navigation, we propose investigating the effects of using gestures as the communicative interface instead of verbal cues. Specifically, we develop a VR-based 3D simulation environment, named Ges-THOR, based on AI2-THOR platform. In this virtual environment, a human player is placed in the same virtual scene and shepherds the artificial agent using only gestures. The agent is tasked to solve the navigation problem guided by natural gestures with unknown semantics; we do not use any predefined gestures due to the diversity and versatile nature of human gestures. We argue that learning the semantics of natural gestures is mutually beneficial to learning the navigation task--learn to communicate and communicate to learn. In a series of experiments, we demonstrate that human gesture cues, even without predefined semantics, improve the object-goal navigation for an embodied agent, outperforming various state-of-the-art methods.

Artificial Intelligence Computer Vision and Pattern Recognition Human-Computer Interaction

Towards Embodied Scene Description

90 - Sinan Tan , Huaping Liu , Di Guo 2020

Embodiment is an important characteristic for all intelligent agents (creatures and robots), while existing scene description tasks mainly focus on analyzing images passively and the semantic understanding of the scenario is separated from the interaction between the agent and the environment. In this work, we propose the Embodied Scene Description, which exploits the embodiment ability of the agent to find an optimal viewpoint in its environment for scene description tasks. A learning framework with the paradigms of imitation learning and reinforcement learning is established to teach the intelligent agent to generate corresponding sensorimotor activities. The proposed framework is tested on both the AI2Thor dataset and a real world robotic platform demonstrating the effectiveness and extendability of the developed method.

Robotics Computer Vision and Pattern Recognition

A Framework for Multisensory Foresight for Embodied Agents

176 - Xiaohui Chen , Ramtin Hosseini , Karen Panetta 2021

Predicting future sensory states is crucial for learning agents such as robots, drones, and autonomous vehicles. In this paper, we couple multiple sensory modalities with exploratory actions and propose a predictive neural network architecture to address this problem. Most existing approaches rely on large, manually annotated datasets, or only use visual data as a single modality. In contrast, the unsupervised method presented here uses multi-modal perceptions for predicting future visual frames. As a result, the proposed model is more comprehensive and can better capture the spatio-temporal dynamics of the environment, leading to more accurate visual frame prediction. The other novelty of our framework is the use of sub-networks dedicated to anticipating future haptic, audio, and tactile signals. The framework was tested and validated with a dataset containing 4 sensory modalities (vision, haptic, audio, and tactile) on a humanoid robot performing 9 behaviors multiple times on a large set of objects. While the visual information is the dominant modality, utilizing the additional non-visual modalities improves the accuracy of predictions.

Robotics Computer Vision and Pattern Recognition

Gibson Env: Real-World Perception for Embodied Agents

218 - Fei Xia , Amir Zamir , Zhi-Yang He 2018

Developing visual perception models for active agents and sensorimotor control are cumbersome to be done in the physical world, as existing algorithms are too slow to efficiently learn in real-time and robots are fragile and costly. This has given rise to learning-in-simulation which consequently casts a question on whether the results transfer to real-world. In this paper, we are concerned with the problem of developing real-world perception for active agents, propose Gibson Virtual Environment for this purpose, and showcase sample perceptual tasks learned therein. Gibson is based on virtualizing real spaces, rather than using artificially designed ones, and currently includes over 1400 floor spaces from 572 full buildings. The main characteristics of Gibson are: I. being from the real-world and reflecting its semantic complexity, II. having an internal synthesis mechanism, Goggles, enabling deploying the trained models in real-world without needing further domain adaptation, III. embodiment of agents and making them subject to constraints of physics and space.

Artificial Intelligence Computer Vision and Pattern Recognition Graphics

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions