ﻻ يوجد ملخص باللغة العربية
Tasks that rely on multi-modal information typically include a fusion module that combines information from different modalities. In this work, we develop a Refiner Fusion Network (ReFNet) that enables fusion modules to combine strong unimodal representation with strong multimodal representations. ReFNet combines the fusion network with a decoding/defusing module, which imposes a modality-centric responsibility condition. This approach addresses a big gap in existing multimodal fusion frameworks by ensuring that both unimodal and fused representations are strongly encoded in the latent fusion space. We demonstrate that the Refiner Fusion Network can improve upon performance of powerful baseline fusion modules such as multimodal transformers. The refiner network enables inducing graphical representations of the fused embeddings in the latent space, which we prove under certain conditions and is supported by strong empirical results in the numerical experiments. These graph structures are further strengthened by combining the ReFNet with a Multi-Similarity contrastive loss function. The modular nature of Refiner Fusion Network lends itself to be combined with different fusion architectures easily, and in addition, the refiner step can be applied for pre-training on unlabeled datasets, thus leveraging unsupervised data towards improving performance. We demonstrate the power of Refiner Fusion Networks on three datasets, and further show that they can maintain performance with only a small fraction of labeled data.
One of the major reasons for misclassification of multiplex actions during action recognition is the unavailability of complementary features that provide the semantic information about the actions. In different domains these features are present wit
This paper proposes a method for representation learning of multimodal data using contrastive losses. A traditional approach is to contrast different modalities to learn the information shared between them. However, that approach could fail to learn
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchm
Users of social networks tend to post and share content with little restraint. Hence, rumors and fake news can quickly spread on a huge scale. This may pose a threat to the credibility of social media and can cause serious consequences in real life.
Deep multimodal fusion by using multiple sources of data for classification or regression has exhibited a clear advantage over the unimodal counterpart on various applications. Yet, current methods including aggregation-based and alignment-based fusi