تشمل اللغة البشرية أكثر من مجرد نص؛كما أنه ينقل العواطف من خلال النغمة والإيماءات.نقدم دراسة حالة لثلاث هندسة بسيطة وفعالة قائمة على المحولات لتنبؤ المعنويات والعاطفة في البيانات متعددة الوسائط.يقوم نموذج الانصهار المتأخر بدمج ميزات Unimodal لإنشاء تسلسل ميزة متعددة الوسائط، يجمع نموذج Robin Robin بشكل متكرر بين ميزات BIMODAL باستخدام اهتماما عبر الوسائط، ويجمع نموذج الانصهار الهجين بين ميزات Trimodal و Unimodal معا لتشكيل تسلسل مائع نهائي للتنبؤ بالمشاعر.تبين تجاربنا أن نماذجنا الصغيرة فعالة ومتفوقة على الإصدارات التي تم إصدارها علنا من أنظمة تحليل المعنويات متعددة الوسائط الأكبر والحديثة.
Human language encompasses more than just text; it also conveys emotions through tone and gestures. We present a case study of three simple and efficient Transformer-based architectures for predicting sentiment and emotion in multimodal data. The Late Fusion model merges unimodal features to create a multimodal feature sequence, the Round Robin model iteratively combines bimodal features using cross-modal attention, and the Hybrid Fusion model combines trimodal and unimodal features together to form a final feature sequence for predicting sentiment. Our experiments show that our small models are effective and outperform the publicly released versions of much larger, state-of-the-art multimodal sentiment analysis systems.
References used
https://aclanthology.org/
Best-worst Scaling (BWS) is a methodology for annotation based on comparing and ranking instances, rather than classifying or scoring individual instances. Studies have shown the efficacy of this methodology applied to NLP tasks in terms of a higher
Recent knowledge graph embedding (KGE) models based on hyperbolic geometry have shown great potential in a low-dimensional embedding space. However, the necessity of hyperbolic space in KGE is still questionable, because the calculation based on hype
This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextual multilingual multimodal embeddings. Under a zero-s
Multimodal research has picked up significantly in the space of question answering with the task being extended to visual question answering, charts question answering as well as multimodal input question answering. However, all these explorations pr
An understanding of humor is an essential component of human-facing NLP systems. In this paper, we investigate several methods for detecting humor in short statements as part of Semeval-2021 Shared Task 7. For Task 1a, we apply an ensemble of fine-tu