Themes Informed Audio-visual Correspondence Learning

166 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Runze Su

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Runze Su - Fei Tao - Xudong Liu

الذكاء الاصطناعي الوسائط المتعددة التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The applications of short-term user-generated video (UGV), such as Snapchat, and Youtube short-term videos, booms recently, raising lots of multimodal machine learning tasks. Among them, learning the correspondence between audio and visual information from videos is a challenging one. Most previous work of the audio-visual correspondence(AVC) learning only investigated constrained videos or simple settings, which may not fit the application of UGV. In this paper, we proposed new principles for AVC and introduced a new framework to set sight of videos themes to facilitate AVC learning. We also released the KWAI-AD-AudVis corpus which contained 85432 short advertisement videos (around 913 hours) made by users. We evaluated our proposed approach on this corpus, and it was able to outperform the baseline by 23.15% absolute difference.

قيم البحث

104 - Jen-Cheng Hou , Syu-Siang Wang , Ying-Hui Lai 2017

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus only on addressing audio information. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of con volutional neural networks (CNNs) in SE, we propose an audio-visual deep CNNs (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. We also propose a multi-task learning framework for reconstructing audio and visual signals at the output layer. Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer. The model is trained in an end-to-end manner, and parameters are jointly learned through back-propagation. We evaluate enhanced speech using five instrumental criteria. Results show that the AVDCNN model yields a notably superior performance compared with an audio-only CNN-based SE model and two conventional SE approaches, confirming the effectiveness of integrating visual information into the SE process. In addition, the AVDCNN model also outperforms an existing audio-visual SE model, confirming its capability of effectively combining audio and visual information in SE.

أنظمة الصوت في الحاسوب الوسائط المتعددة التعلم الالي

Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Network

99 - Jen-Cheng Hou , Syu-Siang Wang , Ying-Hui Lai 2017

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus on addressing audio information only. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of con volutional neural networks (CNNs) in SE, we propose an audio-visual deep CNN (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. In the proposed AVDCNN SE model, audio and visual data are first processed using individual CNNs, and then, fused into a joint network to generate enhanced speech at the output layer. The AVDCNN model is trained in an end-to-end manner, and parameters are jointly learned through back-propagation. We evaluate enhanced speech using five objective criteria. Results show that the AVDCNN yields notably better performance, compared with an audio-only CNN-based SE model and two conventional SE approaches, confirming the effectiveness of integrating visual information into the SE process.

أنظمة الصوت في الحاسوب الوسائط المتعددة التعلم الالي

Self-supervised object detection from audio-visual correspondence

131 - Triantafyllos Afouras , Yuki M. Asano , Francois Fagan 2021

We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio c omponent to teach the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.

الرؤية الحاسوبية وتمييز الأنماط

Positive Sample Propagation along the Audio-Visual Event Line

144 - Jinxing Zhou , Liang Zheng , Yiran Zhong 2021

Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. In order to learn discriminative features for a classifi er, it is pivotal to identify the helpful (or positive) audio-visual segment pairs while filtering out the irrelevant ones, regardless whether they are synchronized or not. To this end, we propose a new positive sample propagation (PSP) module to discover and exploit the closely related audio-visual pairs by evaluating the relationship within every possible pair. It can be done by constructing an all-pair similarity map between each audio and visual segment, and only aggregating the features from the pairs with high similarity scores. To encourage the network to extract high correlated features for positive samples, a new audio-visual pair similarity loss is proposed. We also propose a new weighting branch to better exploit the temporal correlations in weakly supervised setting. We perform extensive experiments on the public AVE dataset and achieve new state-of-the-art accuracy in both fully and weakly supervised settings, thus verifying the effectiveness of our method.

الرؤية الحاسوبية وتمييز الأنماط الوسائط المتعددة أنظمة الصوت في الحاسوب

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

381 - Yapeng Tian , Dingzeyu Li , 2020

In this paper, we introduce a new problem, named audio-visual video parsing, which aims to parse a video into temporal event segments and label them as either audible, visible, or both. Such a problem is essential for a complete understanding of the scene depicted inside a video. To facilitate exploration, we collect a Look, Listen, and Parse (LLP) dataset to investigate audio-visual video parsing in a weakly-supervised manner. This task can be naturally formulated as a Multimodal Multiple Instance Learning (MMIL) problem. Concretely, we propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously. We develop an attentive MMIL pooling method to adaptively explore useful audio and visual content from different temporal extent and modalities. Furthermore, we discover and mitigate modality bias and noisy label issues with an individual-guided learning mechanism and label smoothing technique, respectively. Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels. Our proposed framework can effectively leverage unimodal and cross-modal temporal contexts and alleviate modality bias and noisy labels problems.

الرؤية الحاسوبية وتمييز الأنماط الوسائط المتعددة أنظمة الصوت في الحاسوب