في مهمة التنقل في الرؤية واللغة (VLN)، يقوم وكيل مجسد على بيئة ثلاثية الأبعاد، بعد تعليمات اللغة الطبيعية.التحدي في هذه المهمة هو كيفية التعامل مع سيناريوهات المسار حيث يوضح الوكيل من المسار المرجعي.يشرف العمل السابق على الوكيل بالإجراءات القائمة على أقصر مسار من موقع الوكيل إلى الهدف، ولكن هذا الإشراف الموجه نحو الأهداف غالبا ما لا يتماشى مع التعليمات.علاوة على ذلك، فإن مقاييس التقييم التي تستخدمها العمل المسبق لا تقيس مقدار التعليمات اللغوية التي يستطيع الوكيل أن يتبعها.في هذا العمل، نقترح مخطط إشراف بسيط وفعال من اللغات، ومقياس جديد يقيس عدد التعليمات الفرعية التي أكملها الوكيل أثناء الملاحة.
In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle off the path' scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent's location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.
References used
https://aclanthology.org/
Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In this paper,
An exciting frontier in natural language understanding (NLU) and generation (NLG) calls for (vision-and-) language models that can efficiently access external structured knowledge repositories. However, many existing knowledge bases only cover limite
AI assistants can now carry out tasks for users by directly interacting with website UIs. Current semantic parsing and slot-filling techniques cannot flexibly adapt to many different websites without being constantly re-trained. We propose FLIN, a na
Phrase grounding aims to map textual phrases to their associated image regions, which can be a prerequisite for multimodal reasoning and can benefit tasks requiring identifying objects based on language. With pre-trained vision-and-language models ac
Recent vision-language understanding approaches adopt a multi-modal transformer pre-training and finetuning paradigm. Prior work learns representations of text tokens and visual features with cross-attention mechanisms and captures the alignment sole