Do you want to publish a course? Click here

Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments

320   0   0.0 ( 0 )
 Added by Romain Serizel
 Publication date 2018
and research's language is English




Ask ChatGPT about the research

This paper presents DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeled training set to improve system performance. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events.. .) and potential industrial applications .



rate research

Read More

Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods to combine separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the sound separation model to the sound event detection data on both the sound separation and the sound event detection.
170 - Heinrich Dinkel , Kai Yu 2019
Task 4 of the DCASE2018 challenge demonstrated that substantially more research is needed for a real-world application of sound event detection. Analyzing the challenge results it can be seen that most successful models are biased towards predicting long (e.g., over 5s) clips. This work aims to investigate the performance impact of fixed-sized window median filter post-processing and advocate the use of double thresholding as a more robust and predictable post-processing method. Further, four different temporal subsampling methods within the CRNN framework are proposed: mean-max, alpha-mean-max, Lp-norm and convolutional. We show that for this task subsampling the temporal resolution by a neural network enhances the F1 score as well as its robustness towards short, sporadic sound events. Our best single model achieves 30.1% F1 on the evaluation set and the best fusion model 32.5%, while being robust to event length variations.
We propose a simple but efficient method termed Guided Learning for weakly-labeled semi-supervised sound event detection (SED). There are two sub-targets implied in weakly-labeled SED: audio tagging and boundary detection. Instead of designing a single model by considering a trade-off between the two sub-targets, we design a teacher model aiming at audio tagging to guide a student model aiming at boundary detection to learn using the unlabeled data. The guidance is guaranteed by the audio tagging performance gap of the two models. In the meantime, the student model liberated from the trade-off is able to provide more excellent boundary detection results. We propose a principle to design such two models based on the relation between the temporal compression scale and the two sub-targets. We also propose an end-to-end semi-supervised learning process for these two models to enable their abilities to rise alternately. Experiments on the DCASE2018 Task4 dataset show that our approach achieves competitive performance.
In this paper, we present a method called HODGEPODGEfootnotemark[1] for large-scale detection of sound events using weakly labeled, synthetic, and unlabeled data proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge Task 4: Sound event detection in domestic environments. To perform this task, we adopted the convolutional recurrent neural networks (CRNN) as our backbone network. In order to deal with a small amount of tagged data and a large amounts of unlabeled in-domain data, we aim to focus primarily on how to apply semi-supervise learning methods efficiently to make full use of limited data. Three semi-supervised learning principles have been used in our system, including: 1) Consistency regularization applies data augmentation; 2) MixUp regularizer requiring that the predictions for a interpolation of two inputs is close to the interpolation of the prediction for each individual input; 3) MixUp regularization applies to interpolation between data augmentations. We also tried an ensemble of various models, which are trained by using different semi-supervised learning principles. Our proposed approach significantly improved the performance of the baseline, achieving the event-based f-measure of 42.0% compared to 25.8% event-based f-measure of the baseline in the provided official evaluation dataset. Our submissions ranked third among 18 teams in the task 4.
In recent years, the involvement of synthetic strongly labeled data,weakly labeled data and unlabeled data has drawn much research attentionin semi-supervised sound event detection (SSED). Self-training models carry out predictions without strong annotations and then take predictions with high probabilities as pseudo-labels for retraining. Such models have shown its effectiveness in SSED. However, probabilities are poorly calibrated confidence estimates, and samples with low probabilities are ignored. Hence, we introduce a method of learning confidence deliberately and retaining all data distinctly by applying confidence as weights. Additionally, linear pooling has been considered as a state-of-the-art aggregation function for SSED with weak labeling. In this paper, we propose a power pooling function whose coefficient can be trained automatically to achieve nonlinearity. A confidencebased semi-supervised sound event detection (C-SSED) framework is designed to combine confidence and power pooling. The experimental results demonstrate that confidence is proportional to the accuracy of the predictions. The power pooling function outperforms linear pooling at both error rate and F1 results. In addition, the C-SSED framework achieves a relative error rate reduction of 34% in contrast to the baseline model.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا