ترغب بنشر مسار تعليمي؟ اضغط هنا

ACGAN-based Data Augmentation Integrated with Long-term Scalogram for Acoustic Scene Classification

64   0   0.0 ( 0 )
 نشر من قبل Hangting Chen
 تاريخ النشر 2020
والبحث باللغة English




اسأل ChatGPT حول البحث

In acoustic scene classification (ASC), acoustic features play a crucial role in the extraction of scene information, which can be stored over different time scales. Moreover, the limited size of the dataset may lead to a biased model with a poor performance for records from unseen cities and confusing scene classes. In order to overcome this, we propose a long-term wavelet feature that requires a lower storage capacity and can be classified faster and more accurately compared with classic Mel filter bank coefficients (FBank). This feature can be extracted with predefined wavelet scales similar to the FBank. Furthermore, a novel data augmentation scheme based on generative adversarial neural networks with auxiliary classifiers (ACGANs) is adopted to improve the generalization of the ASC systems. The scheme, which contains ACGANs and a sample filter, extends the database iteratively by splitting the dataset, training the ACGANs and subsequently filtering samples. Experiments were conducted on datasets from the Detection and Classification of Acoustic Scenes and Events (DCASE) challenges. The results on the DCASE19 dataset demonstrate the improved performance of the proposed techniques compared with the classic FBank classifier. Moreover, the proposed fusion system achieved first place in the DCASE19 competition and surpassed the top accuracies on the DCASE17 dataset.



قيم البحث

اقرأ أيضاً

In this paper, we present SpecAugment++, a novel data augmentation method for deep neural networks based acoustic scene classification (ASC). Different from other popular data augmentation methods such as SpecAugment and mixup that only work on the i nput space, SpecAugment++ is applied to both the input space and the hidden space of the deep neural networks to enhance the input and the intermediate feature representations. For an intermediate hidden state, the augmentation techniques consist of masking blocks of frequency channels and masking blocks of time frames, which improve generalization by enabling a model to attend not only to the most discriminative parts of the feature, but also the entire parts. Apart from using zeros for masking, we also examine two approaches for masking based on the use of other samples within the minibatch, which helps introduce noises to the networks to make them more discriminative for classification. The experimental results on the DCASE 2018 Task1 dataset and DCASE 2019 Task1 dataset show that our proposed method can obtain 3.6% and 4.7% accuracy gains over a strong baseline without augmentation (i.e. CP-ResNet) respectively, and outperforms other previous data augmentation methods.
In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses o n ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-based architectures. On Task 1a development data set, an ASC accuracy of 76.9% is attained using our best single classifier and data augmentation. An accuracy of 81.9% is then attained by a final model fusion of our two-stage ASC classifiers. On Task 1b development data set, we achieve an accuracy of 96.7% with a model size smaller than 500KB. Code is available: https://github.com/MihawkHu/DCASE2020_task1.
Acoustic scene classification identifies an input segment into one of the pre-defined classes using spectral information. The spectral information of acoustic scenes may not be mutually exclusive due to common acoustic properties across different cla sses, such as babble noises included in both airports and shopping malls. However, conventional training procedure based on one-hot labels does not consider the similarities between different acoustic scenes. We exploit teacher-student learning with the purpose to derive soft-labels that consider common acoustic properties among different acoustic scenes. In teacher-student learning, the teacher network produces soft-labels, based on which the student network is trained. We investigate various methods to extract soft-labels that better represent similarities across different scenes. Such attempts include extracting soft-labels from multiple audio segments that are defined as an identical acoustic scene. Experimental results demonstrate the potential of our approach, showing a classification accuracy of 77.36 % on the DCASE 2018 task 1 validation set.
In this paper, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge are described. Also, the analysis of different methods on the leaderboard set is provided. The propose d approach is a fusion of two different Convolutional Neural Network (CNN) topologies. The first one is the common two-dimensional CNNs which is mainly used in image classification. The second one is a one-dimensional CNN for extracting fixed-length audio segment embeddings, so called x-vectors, which has also been used in speech processing, especially for speaker recognition. In addition to the different topologies, two types of features were tested: log mel-spectrogram and CQT features. Finally, the outputs of different systems are fused using a simple output averaging in the best performing system. Our submissions ranked third among 24 teams in the ASC sub-task A (task1a).
In recent years, deep learning technique has received intense attention owing to its great success in image recognition. A tendency of adaption of deep learning in various information processing fields has formed, including music information retrieva l (MIR). In this paper, we conduct a comprehensive study on music audio classification with improved convolutional neural networks (CNNs). To the best of our knowledge, this the first work to apply Densely Connected Convolutional Networks (DenseNet) to music audio tagging, which has been demonstrated to perform better than Residual neural network (ResNet). Additionally, two specific data augmentation approaches of time overlapping and pitch shifting have been proposed to address the deficiency of labelled data in the MIR. Moreover, an ensemble learning of stacking is employed based on SVM. We believe that the proposed combination of strong representation of DenseNet and data augmentation can be adapted to other audio processing tasks.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا