Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

85 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Alessandro Suglia

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Alessandro Suglia - Antonio Vergari - Ioannis Konstas

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, relying instead on gold category labels for objects in the scene both at training and inference time. This provides an unnatural performance advantage when categories at inference time match those at training time, and it causes models to fail in more realistic zero-shot scenarios where out-of-domain object categories are involved. To overcome this issue, we introduce a novel imagination module based on Regularized Auto-Encoders, that learns context-aware and category-aware latent embeddings without relying on category labels at inference time. Our imagination module outperforms state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?! zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold categories are available at inference time. The imagination module also boosts reasoning about object properties and attributes.

قيم البحث

304 - Alessandro Suglia , Yonatan Bisk , Ioannis Konstas 2021

Guessing games are a prototypical instance of the learning by interacting paradigm. This work investigates how well an artificial agent can benefit from playing guessing games when later asked to perform on novel NLP downstream tasks such as Visual Q uestion Answering (VQA). We propose two ways to exploit playing guessing games: 1) a supervised learning scenario in which the agent learns to mimic successful guessing games and 2) a novel way for an agent to play by itself, called Self-play via Iterated Experience Learning (SPIEL). We evaluate the ability of both procedures to generalize: an in-domain evaluation shows an increased accuracy (+7.79) compared with competitors on the evaluation suite CompGuessWhat?!; a transfer evaluation shows improved performance for VQA on the TDIUC dataset in terms of harmonic average accuracy (+5.31) thanks to more fine-grained object representations learned via SPIEL.

الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

97 - Po-Yao Huang , Xiaojun Chang , Alexander Hauptmann 2019

With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, o ur model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.

الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

Representations of language in a model of visually grounded speech signal

89 - Grzegorz Chrupa{l}a , Lieke Gelderloos , Afra Alishahi 2017

We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space. We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it learns to extract both form and meaning-based linguistic knowledge from the input signal. We carry out an in-depth analysis of the representations used by different components of the trained model and show that encoding of semantic aspects tends to become richer as we go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease.

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

Perceptual representations of structural information in images: application to quality assessment of synthesized view in FTV scenario

115 - Ling suiyi , Li Jing , Le Callet Patrick 2019

As the immersive multimedia techniques like Free-viewpoint TV (FTV) develop at an astonishing rate, users demand for high-quality immersive contents increases dramatically. Unlike traditional uniform artifacts, the distortions within immersive conten ts could be non-uniform structure-related and thus are challenging for commonly used quality metrics. Recent studies have demonstrated that the representation of visual features can be extracted from multiple levels of the hierarchy. Inspired by the hierarchical representation mechanism in the human visual system (HVS), in this paper, we explore to adopt structural representations to quantitatively measure the impact of such structure-related distortion on perceived quality in FTV scenario. More specifically, a bio-inspired full reference image quality metric is proposed based on 1) low-level contour descriptor; 2) mid-level contour category descriptor; and 3) task-oriented non-natural structure descriptor. The experimental results show that the proposed model outperforms significantly the state-of-the-art metrics.

معالجة الصور والفيديو الرؤية الحاسوبية وتمييز الأنماط

Visually grounded learning of keyword prediction from untranscribed speech

122 - Herman Kamper , Shane Settle , Gregory Shakhnarovich 2017

During language acquisition, infants have the benefit of visual cues to ground spoken language. Robots similarly have access to audio and visual sensors. Recent work has shown that images and spoken captions can be mapped into a meaningful common spa ce, allowing images to be retrieved using speech and vice versa. In this setting of images paired with untranscribed spoken captions, we consider whether computer vision systems can be used to obtain textual labels for the speech. Concretely, we use an image-to-words multi-label visual classifier to tag images with soft textual labels, and then train a neural network to map from the speech to these soft targets. We show that the resulting speech system is able to predict which words occur in an utterance---acting as a spoken bag-of-words classifier---without seeing any parallel speech and text. We find that the model often confuses semantically related words, e.g. man and person, making it even more effective as a semantic keyword spotter.

الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط