ﻻ يوجد ملخص باللغة العربية
Multi-modal representation learning by pretraining has become an increasing interest due to its easy-to-use and potential benefit for various Visual-and-Language~(V-L) tasks. However its requirement of large volume and high-quality vision-language pairs highly hinders its values in practice. In this paper, we proposed a novel label-augmented V-L pretraining model, named LAMP, to address this problem. Specifically, we leveraged auto-generated labels of visual objects to enrich vision-language pairs with fine-grained alignment and correspondingly designed a novel pretraining task. Besides, we also found such label augmentation in second-stage pretraining would further universally benefit various downstream tasks. To evaluate LAMP, we compared it with some state-of-the-art models on four downstream tasks. The quantitative results and analysis have well proven the value of labels in V-L pretraining and the effectiveness of LAMP.
Connected vehicles, whether equipped with advanced driver-assistance systems or fully autonomous, are currently constrained to visual information in their lines-of-sight. A cooperative perception system among vehicles increases their situational awar
Unsupervised pretraining has achieved great success and many recent works have shown unsupervised pretraining can achieve comparable or even slightly better transfer performance than supervised pretraining on downstream target datasets. But in this p
Document layout comprises both structural and visual (eg. font-sizes) information that is vital but often ignored by machine learning models. The few existing models which do use layout information only consider textual contents, and overlook the exi
Fake news often involves semantic manipulations across modalities such as image, text, location etc and requires the development of multimodal semantic forensics for its detection. Recent research has centered the problem around images, calling it im
We tackle the crucial challenge of fusing different modalities of features for multimodal sentiment analysis. Mainly based on neural networks, existing approaches largely model multimodal interactions in an implicit and hard-to-understand manner. We