Order-Embeddings of Images and Language

317 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ivan Vendrov

تاريخ النشر 2015

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ivan Vendrov - Ryan Kiros - Sanja Fidler

التعلم الآلي الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.

قيم البحث

69 - Oren Barkan 2019

Self attention mechanisms have become a key building block in many state-of-the-art language understanding models. In this paper, we show that the self attention operator can be formulated in terms of 1x1 convolution operations. Following this observ ation, we propose several novel operators: First, we introduce a 2D version of self attention that is applicable for 2D signals such as images. Second, we present the 1D and 2D Self Attentive Convolutions (SAC) operator that generalizes self attention beyond 1x1 convolutions to 1xm and nxm convolutions, respectively. While 1D and 2D self attention operate on individual words and pixels, SAC operates on m-grams and image patches, respectively. Third, we present a multiscale version of SAC (MSAC) which analyzes the input by employing multiple SAC operators that vary by filter size, in parallel. Finally, we explain how MSAC can be utilized for vision and language modeling, and further harness MSAC to form a cross attentive image similarity machinery.

التعلم الآلي الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

85 - Tatiana Likhomanenko , Qiantong Xu , Ronan Collobert 2021

Without positional information, attention-based transformer neural networks are permutation-invariant. Absolute or relative positional embeddings are the most popular ways to feed transformer models positional information. Absolute positional embeddi ngs are simple to implement, but suffer from generalization issues when evaluating on sequences of different length than those seen at training time. Relative positions are more robust to length change, but are more complex to implement and yield inferior model throughput. In this paper, we propose an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute (simplicity and speed) and relative position embeddings (better generalization). In addition, our empirical evaluation on state-of-the-art models in machine translation, image and speech recognition demonstrates that CAPE leads to better generalization performance as well as increased stability with respect to training hyper-parameters.

التعلم الآلي الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

Language-Mediated, Object-Centric Representation Learning

362 - Ruocheng Wang , Jiayuan Mao , Samuel J. Gershman 2020

We present Language-mediated, Object-centric Representation Learning (LORL), a paradigm for learning disentangled, object-centric scene representations from vision and language. LORL builds upon recent advances in unsupervised object discovery and se gmentation, notably MONet and Slot Attention. While these algorithms learn an object-centric representation just by reconstructing the input image, LORL enables them to further learn to associate the learned representations to concepts, i.e., words for object categories, properties, and spatial relationships, from language input. These object-centric concepts derived from language facilitate the learning of object-centric representations. LORL can be integrated with various unsupervised object discovery algorithms that are language-agnostic. Experiments show that the integration of LORL consistently improves the performance of unsupervised object discovery methods on two datasets via the help of language. We also show that concepts learned by LORL, in conjunction with object discovery methods, aid downstream tasks such as referring expression comprehension.

التعلم الآلي الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

Core Challenges in Embodied Vision-Language Planning

176 - Jonathan Francis , Nariaki Kitamura , Felix Labelle 2021

Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approache s and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.

التعلم الآلي الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

Sentiment Classification using Images and Label Embeddings

178 - Laura Graesser , Abhinav Gupta , Lakshay Sharma 2017

In this project we analysed how much semantic information images carry, and how much value image data can add to sentiment analysis of the text associated with the images. To better understand the contribution from images, we compared models which on ly made use of image data, models which only made use of text data, and models which combined both data types. We also analysed if this approach could help sentiment classifiers generalize to unknown sentiments.

الحساب واللغة الذكاء الاصطناعي الرؤية الحاسوبية وتمييز الأنماط