ﻻ يوجد ملخص باللغة العربية
Multimodal sentiment analysis is an important research area that predicts speakers sentiment tendency through features extracted from textual, visual and acoustic modalities. The central challenge is the fusion method of the multimodal information. A variety of fusion methods have been proposed, but few of them adopt end-to-end translation models to mine the subtle correlation between modalities. Enlightened by recent success of Transformer in the area of machine translation, we propose a new fusion method, TransModality, to address the task of multimodal sentiment analysis. We assume that translation between modalities contributes to a better joint representation of speakers utterance. With Transformer, the learned features embody the information both from the source modality and the target modality. We validate our model on multiple multimodal datasets: CMU-MOSI, MELD, IEMOCAP. The experiments show that our proposed method achieves the state-of-the-art performance.
In multimodal sentiment analysis (MSA), the performance of a model highly depends on the quality of synthesized embeddings. These embeddings are generated from the upstream process called multimodal fusion, which aims to extract and combine the input
We tackle the crucial challenge of fusing different modalities of features for multimodal sentiment analysis. Mainly based on neural networks, existing approaches largely model multimodal interactions in an implicit and hard-to-understand manner. We
Multimodal sentiment analysis aims to extract and integrate semantic information collected from multiple modalities to recognize the expressed emotions and sentiment in multimodal data. This research areas major concern lies in developing an extraord
We introduce the MuSe-Toolbox - a Python-based open-source toolkit for creating a variety of continuous and discrete emotion gold standards. In a single framework, we unify a wide range of fusion methods and propose the novel Rater Aligned Annotation
Most recent works on sentiment analysis have exploited the text modality. However, millions of hours of video recordings posted on social media platforms everyday hold vital unstructured information that can be exploited to more effectively gauge pub