ﻻ يوجد ملخص باللغة العربية
Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even though these data modalities may not be semantically correlated. Rather than directly aligning the representations of different modalities, we compose audio, image, and video representations across modalities to uncover richer multi-modal knowledge. Our main idea is to learn a compositional embedding that closes the cross-modal semantic gap and captures the task-relevant semantics, which facilitates pulling together representations across modalities by compositional contrastive learning. We establish a new, comprehensive multi-modal distillation benchmark on three video datasets: UCF101, ActivityNet, and VGGSound. Moreover, we demonstrate that our model significantly outperforms a variety of existing knowledge distillation methods in transferring audio-visual knowledge to improve video representation learning. Code is released here: https://github.com/yanbeic/CCL.
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of
Deep Bregman divergence measures divergence of data points using neural networks which is beyond Euclidean distance and capable of capturing divergence over distributions. In this paper, we propose deep Bregman divergences for contrastive learning of
In recent years, many explanation methods have been proposed to explain individual classifications of deep neural networks. However, how to leverage the created explanations to improve the learning process has been less explored. As the privileged in
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverb
Contrastive learning has delivered impressive results in many audio-visual representation learning scenarios. However, existing approaches optimize for learning either textit{global} representations useful for tasks such as classification, or textit{