Modular Framework for Visuomotor Language Grounding

132 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Kolby Nottingham

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Kolby Nottingham - Litian Liang - Daeyun Shin

الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Natural language instruction following tasks serve as a valuable test-bed for grounded language and robotics research. However, data collection for these tasks is expensive and end-to-end approaches suffer from data inefficiency. We propose the structuring of language, acting, and visual tasks into separate modules that can be trained independently. Using a Language, Action, and Vision (LAV) framework removes the dependence of action and vision modules on instruction following datasets, making them more efficient to train. We also present a preliminary evaluation of LAV on the ALFRED task for visual and interactive instruction following.

قيم البحث

155 - Tristan Karch , Laetitia Teodorescu , Katja Hofmann 2021

Language is an interface to the outside world. In order for embodied agents to use it, language must be grounded in other, sensorimotor modalities. While there is an extended literature studying how machines can learn grounded language, the topic of how to learn spatio-temporal linguistic concepts is still largely uncharted. To make progress in this direction, we here introduce a novel spatio-temporal language grounding task where the goal is to learn the meaning of spatio-temporal descriptions of behavioral traces of an embodied agent. This is achieved by training a truth function that predicts if a description matches a given history of observations. The descriptions involve time-extended predicates in past and present tense as well as spatio-temporal references to objects in the scene. To study the role of architectural biases in this task, we train several models including multimodal Transformer architectures; the latter implement different attention computations between words and objects across space and time. We test models on two classes of generalization: 1) generalization to randomly held-out sentences; 2) generalization to grammar primitives. We observe that maintaining object identity in the attention computation of our Transformers is instrumental to achieving good performance on generalization overall, and that summarizing object traces in a single token has little influence on performance. We then discuss how this opens new perspectives for language-guided autonomous embodied agents. We also release our code under open-source license as well as pretrained models and datasets to encourage the wider community to build upon and extend our work in the future.

الذكاء الاصطناعي الحساب واللغة التعلم الآلي

Flow: A Modular Learning Framework for Autonomy in Traffic

100 - Cathy Wu , Aboudy Kreidieh , Kanaad Parvate 2017

The rapid development of autonomous vehicles (AVs) holds vast potential for transportation systems through improved safety, efficiency, and access to mobility. However, due to numerous technical, political, and human factors challenges, new methodolo gies are needed to design vehicles and transportation systems for these positive outcomes. This article tackles technical challenges arising from the partial adoption of autonomy: partial control, partial observation, complex multi-vehicle interactions, and the sheer variety of traffic settings represented by real-world networks. The article presents a modular learning framework which leverages deep Reinforcement Learning methods to address complex traffic dynamics. Modules are composed to capture common traffic phenomena (traffic jams, lane changing, intersections). Learned control laws are found to exceed human driving performance by at least 40% with only 5-10% adoption of AVs. In partially-observed single-lane traffic, a small neural network control law can eliminate stop-and-go traffic -- surpassing all known model-based controllers, achieving near-optimal performance, and generalizing to out-of-distribution traffic densities.

الذكاء الاصطناعي علم الروبوتات أنظمة وتحكم

Hierarchical visuomotor control of humanoids

108 - Josh Merel , Arun Ahuja , Vu Pham 2018

We aim to build complex humanoid agents that integrate perception, motor control, and memory. In this work, we partly factor this problem into low-level motor control from proprioception and high-level coordination of the low-level skills informed by vision. We develop an architecture capable of surprisingly flexible, task-directed motor control of a relatively high-DoF humanoid body by combining pre-training of low-level motor controllers with a high-level, task-focused controller that switches among low-level sub-policies. The resulting system is able to control a physically-simulated humanoid body to solve tasks that require coupling visual perception from an unstabilized egocentric RGB camera during locomotion in the environment. For a supplementary video link, see https://youtu.be/7GISvfbykLE .

الذكاء الاصطناعي علم الروبوتات

Modular Object-Oriented Games: A Task Framework for Reinforcement Learning, Psychology, and Neuroscience

77 - Nicholas Watters , Joshua Tenenbaum , Mehrdad Jazayeri 2021

In recent years, trends towards studying simulated games have gained momentum in the fields of artificial intelligence, cognitive science, psychology, and neuroscience. The intersections of these fields have also grown recently, as researchers increa sing study such games using both artificial agents and human or animal subjects. However, implementing games can be a time-consuming endeavor and may require a researcher to grapple with complex codebases that are not easily customized. Furthermore, interdisciplinary researchers studying some combination of artificial intelligence, human psychology, and animal neurophysiology face additional challenges, because existing platforms are designed for only one of these domains. Here we introduce Modular Object-Oriented Games, a Python task framework that is lightweight, flexible, customizable, and designed for use by machine learning, psychology, and neurophysiology researchers.

الذكاء الاصطناعي التعلم الآلي الخلايا العصبية والإدراك

MOSS: End-to-End Dialog System Framework with Modular Supervision

79 - Weixin Liang , Youzhi Tian , Chengcai Chen 2019

A major bottleneck in training end-to-end task-oriented dialog system is the lack of data. To utilize limited training data more efficiently, we propose Modular Supervision Network (MOSS), an encoder-decoder training framework that could incorporate supervision from various intermediate dialog system modules including natural language understanding, dialog state tracking, dialog policy learning, and natural language generation. With only 60% of the training data, MOSS-all (i.e., MOSS with supervision from all four dialog modules) outperforms state-of-the-art models on CamRest676. Moreover, introducing modular supervision has even bigger benefits when the dialog task has a more complex dialog state and action space. With only 40% of the training data, MOSS-all outperforms the state-of-the-art model on a complex laptop network troubleshooting dataset, LaptopNetwork, that we introduced. LaptopNetwork consists of conversations between real customers and customer service agents in Chinese. Moreover, MOSS framework can accommodate dialogs that have supervision from different dialog modules at both the framework level and model level. Therefore, MOSS is extremely flexible to update in a real-world deployment.

الذكاء الاصطناعي الحساب واللغة