New community

Subscribe to the gold package and get unlimited access to Shamra Academy

MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

MTTT: تعلم تعويضات التفاعل المتعدد الوسائط من المحولات ثلاثية التريلين

622 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intra-modality information. Inspired by this observation, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pre-train MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA-1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.

References used

https://aclanthology.org/

rate research

Learning to Select Question-Relevant Relations for Visual Question Answering

402 - Association for Computation Linguistics 2021 مقالة

Previous existing visual question answering (VQA) systems commonly use graph neural networks(GNNs) to extract visual relationships such as semantic relations or spatial relations. However, studies that use GNNs typically ignore the importance of each relation and simply concatenate outputs from multiple relation encoders. In this paper, we propose a novel layer architecture that fuses multiple visual relations through an attention mechanism to address this issue. Specifically, we develop a model that uses question embedding and joint embedding of the encoders to obtain dynamic attention weights with regard to the type of questions. Using the learnable attention weights, the proposed model can efficiently use the necessary visual relation features for a given question. Experimental results on the VQA 2.0 dataset demonstrate that the proposed model outperforms existing graph attention network-based architectures. Additionally, we visualize the attention weight and show that the proposed model assigns a higher weight to relations that are more relevant to the question.

select question-relevant relations حدد العلاقات ذات الصلة بالسؤال صناعة حمض الفوسفور

MIMOQA: Multimodal Input Multimodal Output Question Answering

387 - Association for Computation Linguistics 2021 مقالة

Multimodal research has picked up significantly in the space of question answering with the task being extended to visual question answering, charts question answering as well as multimodal input question answering. However, all these explorations pr oduce a unimodal textual output as the answer. In this paper, we propose a novel task - MIMOQA - Multimodal Input Multimodal Output Question Answering in which the output is also multimodal. Through human experiments, we empirically show that such multimodal outputs provide better cognitive understanding of the answers. We also propose a novel multimodal question-answering framework, MExBERT, that incorporates a joint textual and visual attention towards producing such a multimodal output. Our method relies on a novel multimodal dataset curated for this problem from publicly available unimodal datasets. We show the superior performance of MExBERT against strong baselines on both the automatic as well as human metrics.

نهج التعلم متري output question answering input question answering إخراج سؤال الرد إجابة سؤال المدخلات صناعة حمض الفوسفور

Multimodal Weighted Fusion of Transformers for Movie Genre Classification

370 - Association for Computation Linguistics 2021 مقالة

The Multimodal Transformer showed to be a competitive model for multimodal tasks involving textual, visual and audio signals. However, as more modalities are involved, its late fusion by concatenation starts to have a negative impact on the model's p erformance. Besides, interpreting model's predictions becomes difficult, as one would have to look at the different attention activation matrices. In order to overcome these shortcomings, we propose to perform late fusion by adding a GMU module, which effectively allows the model to weight modalities at instance level, improving its performance while providing a better interpretabilty mechanism. In the experiments, we compare our proposed model (MulT-GMU) against the original implementation (MulT-Concat) and a SOTA model tested in a movie genre classification dataset. Our approach, MulT-GMU, outperforms both, MulT-Concat and previous SOTA model.

multimodal weighted fusion multimodal weighted multimodal transformer showed multimodal مرجح الانصهار multimodal المرجحة عرض محول متعدد الوسائط صناعة حمض الفوسفور المزيد..

Context-Aware Interaction Network for Question Matching

400 - Association for Computation Linguistics 2021 مقالة

Impressive milestones have been achieved in text matching by adopting a cross-attention mechanism to capture pertinent semantic connections between two sentence representations. However, regular cross-attention focuses on word-level links between the two input sequences, neglecting the importance of contextual information. We propose a context-aware interaction network (COIN) to properly align two sequences and infer their semantic relationship. Specifically, each interaction block includes (1) a context-aware cross-attention mechanism to effectively integrate contextual information when aligning two sequences, and (2) a gate fusion layer to flexibly interpolate aligned representations. We apply multiple stacked interaction blocks to produce alignments at different levels and gradually refine the attention results. Experiments on two question matching datasets and detailed analyses demonstrate the effectiveness of our model.

context-aware interaction network question matching interaction network شبكة التفاعل السياق مطابقة السؤال شبكة التفاعل صناعة حمض الفوسفور المزيد..

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

499 - Association for Computation Linguistics 2021 مقالة

Abstract Recently, multimodal transformer models have gained popularity because their performance on downstream tasks suggests they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important fa ctors that can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality-specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers.

decoupling the role role of data multimodal transformers فك الدورة دور البيانات محولات متعددة الوسائط صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

MTTT: تعلم تعويضات التفاعل المتعدد الوسائط من المحولات ثلاثية التريلين

Ask ChatGPT about the research

Read More

suggested questions