BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps

96 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Wang Zhu

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Wang Zhu - Hexiang Hu - Jiacheng Chen

الذكاء الاصطناعي الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Learning to follow instructions is of fundamental importance to autonomous agents for vision-and-language navigation (VLN). In this paper, we study how an agent can navigate long paths when learning from a corpus that consists of shorter ones. We show that existing state-of-the-art agents do not generalize well. To this end, we propose BabyWalk, a new VLN agent that is learned to navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially. A special design memory buffer is used by the agent to turn its past experiences into contexts for future steps. The learning process is composed of two phases. In the first phase, the agent uses imitation learning from demonstration to accomplish BabySteps. In the second phase, the agent uses curriculum-based reinforcement learning to maximize rewards on navigation tasks with increasingly longer instructions. We create two new benchmark datasets (of long navigation tasks) and use them in conjunction with existing ones to examine BabyWalks generalization ability. Empirical results show that BabyWalk achieves state-of-the-art results on several metrics, in particular, is able to follow long instructions better. The codes and the datasets are released on our project page https://github.com/Sha-Lab/babywalk.

قيم البحث

109 - Xin Eric Wang , Vihan Jain , Eugene Ie 2020

Recent research efforts enable study for natural language grounded navigation in photo-realistic environments, e.g., following natural language instructions or dialog. However, existing methods tend to overfit training data in seen environments and f ail to generalize well in previously unseen environments. To close the gap between seen and unseen environments, we aim at learning a generalized navigation model from two novel perspectives: (1) we introduce a multitask navigation model that can be seamlessly trained on both Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks, which benefits from richer natural language guidance and effectively transfers knowledge across tasks; (2) we propose to learn environment-agnostic representations for the navigation policy that are invariant among the environments seen during training, thus generalizing better on unseen environments. Extensive experiments show that environment-agnostic multitask learning significantly reduces the performance gap between seen and unseen environments, and the navigation agent trained so outperforms baselines on unseen environments by 16% (relative measure on success rate) on VLN and 120% (goal progress) on NDH. Our submission to the CVDN leaderboard establishes a new state-of-the-art for the NDH task on the holdout test set. Code is available at https://github.com/google-research/valan.

الذكاء الاصطناعي الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

362 - Muhammad Zubair Irshad , Niluthpol Chowdhury Mithun , Zachary Seymour 2021

This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end learning-base d VLN methods struggle at this task as they focus mostly on utilizing raw visual observations and lack the semantic spatio-temporal reasoning capabilities which is crucial in generalizing to new environments. In this regard, we present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method. Our method creates a temporal semantic memory by building a top-down local ego-centric semantic map and performs cross-modal grounding to align map and language modalities to enable effective learning of VLN policy. Empirical results in a photo-realistic long-horizon simulation environment show that the proposed approach outperforms a variety of state-of-the-art methods and baselines with over 22% relative improvement in SPL in prior unseen environments.

علم الروبوتات الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

Core Challenges in Embodied Vision-Language Planning

176 - Jonathan Francis , Nariaki Kitamura , Felix Labelle 2021

Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approache s and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.

التعلم الآلي الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

155 - Wanrong Zhu , Xin Eric Wang , Tsu-Jui Fu 2020

One of the most challenging topics in Natural Language Processing (NLP) is visually-grounded language understanding and reasoning. Outdoor vision-and-language navigation (VLN) is such a task where an agent follows natural language instructions and na vigates a real-life urban environment. Due to the lack of human-annotated instructions that illustrate intricate urban scenes, outdoor VLN remains a challenging task to solve. This paper introduces a Multimodal Text Style Transfer (MTST) learning approach and leverages external multimodal resources to mitigate data scarcity in outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external outdoor navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 8.7% relatively on the test set.

الحساب واللغة الذكاء الاصطناعي الرؤية الحاسوبية وتمييز الأنماط

Multiscale Self Attentive Convolutions for Vision and Language Modeling

69 - Oren Barkan 2019

Self attention mechanisms have become a key building block in many state-of-the-art language understanding models. In this paper, we show that the self attention operator can be formulated in terms of 1x1 convolution operations. Following this observ ation, we propose several novel operators: First, we introduce a 2D version of self attention that is applicable for 2D signals such as images. Second, we present the 1D and 2D Self Attentive Convolutions (SAC) operator that generalizes self attention beyond 1x1 convolutions to 1xm and nxm convolutions, respectively. While 1D and 2D self attention operate on individual words and pixels, SAC operates on m-grams and image patches, respectively. Third, we present a multiscale version of SAC (MSAC) which analyzes the input by employing multiple SAC operators that vary by filter size, in parallel. Finally, we explain how MSAC can be utilized for vision and language modeling, and further harness MSAC to form a cross attentive image similarity machinery.

التعلم الآلي الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط