Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Where Are You? Localization from Embodied Dialog

90 0 0.0 ( 0 )

Download Cite

Added by Meera Hahn

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Meera Hahn - Jacob Krantz - Dhruv Batra

Computer Vision and Pattern Recognition Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We present Where Are You? (WAY), a dataset of ~6k dialogs in which two humans -- an Observer and a Locator -- complete a cooperative localization task. The Observer is spawned at random in a 3D environment and can navigate from first-person views while answering questions from the Locator. The Locator must localize the Observer in a detailed top-down map by asking questions and giving instructions. Based on this dataset, we define three challenging tasks: Localization from Embodied Dialog or LED (localizing the Observer from dialog history), Embodied Visual Dialog (modeling the Observer), and Cooperative Localization (modeling both agents). In this paper, we focus on the LED task -- providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices. Our best model achieves 32.7% success at identifying the Observers location within 3m in unseen buildings, vs. 70.4% for human Locators.

rate research

To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

93 - Yitian Yuan , Tao Mei , Wenwu Zhu 2018

Given an untrimmed video and a sentence description, temporal sentence localization aims to automatically determine the start and end points of the described sentence within the video. The problem is challenging as it needs the understanding of both video and sentence. Existing research predominantly employs a costly scan and localize framework, neglecting the global video context and the specific details within sentences which play as critical issues for this problem. In this paper, we propose a novel Attention Based Location Regression (ABLR) approach to solve the temporal sentence localization from a global perspective. Specifically, to preserve the context information, ABLR first encodes both video and sentence via Bidirectional LSTM networks. Then, a multi-modal co-attention mechanism is introduced to generate not only video attention which reflects the global video structure, but also sentence attention which highlights the crucial details for temporal localization. Finally, a novel attention based location regression network is designed to predict the temporal coordinates of sentence query from the previous attention. ABLR is jointly trained in an end-to-end manner. Comprehensive experiments on ActivityNet Captions and TACoS datasets demonstrate both the effectiveness and the efficiency of the proposed ABLR approach.

Computer Vision and Pattern Recognition

Is time enough in order to know where you are?

574 - Angelo Tartaglia 2012

This talk discusses various aspects of the structure of space-time presenting mechanisms leading to the explanation of the rigidity of the manifold and to the emergence of time, i.e. of the Lorentzian signature. The proposed ingredient is the analog, in four dimensions, of the deformation energy associated with the common threedimensional elasticity theory. The inclusion of this additional term in the Lagrangian of empty space-time accounts for gravity as an emergent feature from the microscopic structure of space-time. Once time has legitimately been introduced, a global positioning method based on local measurements of proper times between the arrivals of electromagnetic pulses from independent distant sources is presented. The method considers both pulsars as well as artificial emitters located on celestial bodies of the solar system as pulsating beacons to be used for navigation and positioning.

General Relativity and Quantum Cosmology Cosmology and Nongalactic Astrophysics

Are We There Yet? Learning to Localize in Embodied Instruction Following

91 - Shane Storks , Qiaozi Gao , Govind Thattai 2021

Embodied instruction following is a challenging problem requiring an agent to infer a sequence of primitive actions to achieve a goal environment state from complex language and visual inputs. Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem consisting of step-by-step natural language instructions to achieve subgoals which compose to an ultimate high-level goal. Key challenges for this task include localizing target locations and navigating to them through visual inputs, and grounding language instructions to visual appearance of objects. To address these challenges, in this study, we augment the agents field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep. We also improve language grounding by introducing a pre-trained object detection module to the model pipeline. Empirical studies show that our approach exceeds the baseline model performance.

Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition

Where in the World are You? Geolocation and Language Identification in Twitter

444 - Mark Graham , Scott A. Hale , 2013

The movements of ideas and content between locations and languages are unquestionably crucial concerns to researchers of the information age, and Twitter has emerged as a central, global platform on which hundreds of millions of people share knowledge and information. A variety of research has attempted to harvest locational and linguistic metadata from tweets in order to understand important questions related to the 300 million tweets that flow through the platform each day. However, much of this work is carried out with only limited understandings of how best to work with the spatial and linguistic contexts in which the information was produced. Furthermore, standard, well-accepted practices have yet to emerge. As such, this paper studies the reliability of key methods used to determine language and location of content in Twitter. It compares three automated language identification packages to Twitters user interface language setting and to a human coding of languages in order to identify common sources of disagreement. The paper also demonstrates that in many cases user-entered profile locations differ from the physical locations users are actually tweeting from. As such, these open-ended, user-generated, profile locations cannot be used as useful proxies for the physical locations from which information is published to Twitter.

Computers and Society Social and Information Networks

Granular Multimodal Attention Networks for Visual Dialog

90 - Badri N. Patro , Shivansh Patel , Vinay P. Namboodiri 2019

Vision and language tasks have benefited from attention. There have been a number of different attention models proposed. However, the scale at which attention needs to be applied has not been well examined. Particularly, in this work, we propose a new method Granular Multi-modal Attention, where we aim to particularly address the question of the right granularity at which one needs to attend while solving the Visual Dialog task. The proposed method shows improvement in both image and text attention networks. We then propose a granular Multi-modal Attention network that jointly attends on the image and text granules and shows the best performance. With this work, we observe that obtaining granular attention and doing exhaustive Multi-modal Attention appears to be the best way to attend while solving visual dialog.

Computer Vision and Pattern Recognition Computation and Language Machine Learning

comments

Fetching comments

Sham Private University

Additional details More universities

Where Are You? Localization from Embodied Dialog

Ask ChatGPT about the research

No Arabic abstract

Read More