A Closer Look at Weak Label Learning for Audio Events

87 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Anurag Kumar

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ankit Shah - Anurag Kumar - Alexander G. Hauptmann

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Audio content analysis in terms of sound events is an important research problem for a variety of applications. Recently, the development of weak labeling approaches for audio or sound event detection (AED) and availability of large scale weakly labeled dataset have finally opened up the possibility of large scale AED. However, a deeper understanding of how weak labels affect the learning for sound events is still missing from literature. In this work, we first describe a CNN based approach for weakly supervised training of audio events. The approach follows some basic design principle desirable in a learning method relying on weakly labeled audio. We then describe important characteristics, which naturally arise in weakly supervised learning of sound events. We show how these aspects of weak labels affect the generalization of models. More specifically, we study how characteristics such as label density and corruption of labels affects weakly supervised training for audio events. We also study the feasibility of directly obtaining weak labeled data from the web without any manual label and compare it with a dataset which has been manually labeled. The analysis and understanding of these factors should be taken into picture in the development of future weak label learning methods. Audioset, a large scale weakly labeled dataset for sound events is used in our experiments.

قيم البحث

117 - Giuseppe Pastore , Fabio Cermelli , Yongqin Xian 2021

Being able to segment unseen classes not observed during training is an important technical challenge in deep learning, because of its potential to reduce the expensive annotation required for semantic segmentation. Prior zero-label semantic segmenta tion works approach this task by learning visual-semantic embeddings or generative models. However, they are prone to overfitting on the seen classes because there is no training signal for them. In this paper, we study the challenging generalized zero-label semantic segmentation task where the model has to segment both seen and unseen classes at test time. We assume that pixels of unseen classes could be present in the training images but without being annotated. Our idea is to capture the latent information on unseen classes by supervising the model with self-produced pseudo-labels for unlabeled pixels. We propose a consistency regularizer to filter out noisy pseudo-labels by taking the intersections of the pseudo-labels generated from different augmentations of the same image. Our framework generates pseudo-labels and then retrain the model with human-annotated and pseudo-labelled data. This procedure is repeated for several iterations. As a result, our approach achieves the new state-of-the-art on PascalVOC12 and COCO-stuff datasets in the challenging generalized zero-label semantic segmentation setting, surpassing other existing methods addressing this task with more complex strategies.

الرؤية الحاسوبية وتمييز الأنماط

An Approach for Self-Training Audio Event Detectors Using Web Data

75 - Benjamin Elizalde , Ankit Shah , Siddharth Dalmia 2016

Audio Event Detection (AED) aims to recognize sounds within audio and video recordings. AED employs machine learning algorithms commonly trained and tested on annotated datasets. However, available datasets are limited in number of samples and hence it is difficult to model acoustic diversity. Therefore, we propose combining labeled audio from a dataset and unlabeled audio from the web to improve the sound models. The audio event detectors are trained on the labeled audio and ran on the unlabeled audio downloaded from YouTube. Whenever the detectors recognized any of the known sounds with high confidence, the unlabeled audio was use to re-train the detectors. The performance of the re-trained detectors is compared to the one from the original detectors using the annotated test set. Results showed an improvement of the AED, and uncovered challenges of using web audio from videos.

أنظمة الصوت في الحاسوب التعلم الآلي الوسائط المتعددة

Continual Learning for Fake Audio Detection

107 - Haoxin Ma , Jiangyan Yi , Jianhua Tao 2021

Fake audio attack becomes a major threat to the speaker verification system. Although current detection approaches have achieved promising results on dataset-specific scenarios, they encounter difficulties on unseen spoofing data. Fine-tuning and ret raining from scratch have been applied to incorporate new data. However, fine-tuning leads to performance degradation on previous data. Retraining takes a lot of time and computation resources. Besides, previous data are unavailable due to privacy in some situations. To solve the above problems, this paper proposes detecting fake without forgetting, a continual-learning-based method, to make the model learn new spoofing attacks incrementally. A knowledge distillation loss is introduced to loss function to preserve the memory of original model. Supposing the distribution of genuine voice is consistent among different scenarios, an extra embedding similarity loss is used as another constraint to further do a positive sample alignment. Experiments are conducted on the ASVspoof2019 dataset. The results show that our proposed method outperforms fine-tuning by the relative reduction of average equal error rate up to 81.62%.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

72 - Jongpil Lee , Taejun Kim , Jiyoung Park 2017

Music, speech, and acoustic scene sound are often handled separately in the audio domain because of their different signal characteristics. However, as the image domain grows rapidly by versatile image classification models, it is necessary to study extensible classification models in the audio domain as well. In this study, we approach this problem using two types of sample-level deep convolutional neural networks that take raw waveforms as input and uses filters with small granularity. One is a basic model that consists of convolution and pooling layers. The other is an improved model that additionally has residual connections, squeeze-and-excitation modules and multi-level concatenation. We show that the sample-level models reach state-of-the-art performance levels for the three different categories of sound. Also, we visualize the filters along layers and compare the characteristics of learned filters.

أنظمة الصوت في الحاسوب التعلم الآلي الوسائط المتعددة

The NIGENS General Sound Events Database

84 - Ivo Trowitzsch , Jalil Taghia , Youssef Kashef 2019

Computational auditory scene analysis is gaining interest in the last years. Trailing behind the more mature field of speech recognition, it is particularly general sound event detection that is attracting increasing attention. Crucial for training a nd testing reasonable models is having available enough suitable data -- until recently, general sound event databases were hardly found. We release and present a database with 714 wav files containing isolated high quality sound events of 14 different types, plus 303 `general wav files of anything else but these 14 types. All sound events are strongly labeled with perceptual on- and offset times, paying attention to omitting in-between silences. The amount of isolated sound events, the quality of annotations, and the particular general sound class distinguish NIGENS from other databases.

أنظمة الصوت في الحاسوب التعلم الآلي الوسائط المتعددة