Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

165 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل C.-H. Huck Yang

تاريخ النشر 2020

مجال البحث هندسة إلكترونية الهندسة المعلوماتية

والبحث باللغة English

تأليف Hu Hu - Chao-Han Huck Yang - Xianjun Xia

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-based architectures. On Task 1a development data set, an ASC accuracy of 76.9% is attained using our best single classifier and data augmentation. An accuracy of 81.9% is then attained by a final model fusion of our two-stage ASC classifiers. On Task 1b development data set, we achieve an accuracy of 96.7% with a model size smaller than 500KB. Code is available: https://github.com/MihawkHu/DCASE2020_task1.

قيم البحث

169 - Hu Hu , Chao-Han Huck Yang , Xianjun Xia 2020

To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system lever ages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (ii) the second CNN classifies the same inputs into one of ten finer-grained classes. Three different CNN architectures are explored to implement the two-stage classifiers, and a frequency sub-sampling scheme is investigated. Moreover, novel data augmentation schemes for ASC are also investigated. Evaluated on DCASE 2020 Task 1a, our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set, where our best system, a two-stage fusion of CNN ensembles, delivers a 81.9% average accuracy among multi-device test data, and it obtains a significant improvement on unseen devices. Finally, neural saliency analysis with class activation mapping (CAM) gives new insights on the patterns learnt by our models.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي التعلم الآلي

ACGAN-based Data Augmentation Integrated with Long-term Scalogram for Acoustic Scene Classification

63 - Hangting Chen , Zuozhen Liu , Zongming Liu 2020

In acoustic scene classification (ASC), acoustic features play a crucial role in the extraction of scene information, which can be stored over different time scales. Moreover, the limited size of the dataset may lead to a biased model with a poor per formance for records from unseen cities and confusing scene classes. In order to overcome this, we propose a long-term wavelet feature that requires a lower storage capacity and can be classified faster and more accurately compared with classic Mel filter bank coefficients (FBank). This feature can be extracted with predefined wavelet scales similar to the FBank. Furthermore, a novel data augmentation scheme based on generative adversarial neural networks with auxiliary classifiers (ACGANs) is adopted to improve the generalization of the ASC systems. The scheme, which contains ACGANs and a sample filter, extends the database iteratively by splitting the dataset, training the ACGANs and subsequently filtering samples. Experiments were conducted on datasets from the Detection and Classification of Acoustic Scenes and Events (DCASE) challenges. The results on the DCASE19 dataset demonstrate the improved performance of the proposed techniques compared with the classic FBank classifier. Moreover, the proposed fusion system achieved first place in the DCASE19 competition and surpassed the top accuracies on the DCASE17 dataset.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification

90 - Helin Wang , Yuexian Zou , Wenwu Wang 2021

In this paper, we present SpecAugment++, a novel data augmentation method for deep neural networks based acoustic scene classification (ASC). Different from other popular data augmentation methods such as SpecAugment and mixup that only work on the i nput space, SpecAugment++ is applied to both the input space and the hidden space of the deep neural networks to enhance the input and the intermediate feature representations. For an intermediate hidden state, the augmentation techniques consist of masking blocks of frequency channels and masking blocks of time frames, which improve generalization by enabling a model to attend not only to the most discriminative parts of the feature, but also the entire parts. Apart from using zeros for masking, we also examine two approaches for masking based on the use of other samples within the minibatch, which helps introduce noises to the networks to make them more discriminative for classification. The experimental results on the DCASE 2018 Task1 dataset and DCASE 2019 Task1 dataset show that our proposed method can obtain 3.6% and 4.7% accuracy gains over a strong baseline without augmentation (i.e. CP-ResNet) respectively, and outperforms other previous data augmentation methods.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

Audio-Based Music Classification with DenseNet And Data Augmentation

129 - Wenhao Bian , Jie Wang , Bojin Zhuang 2019

In recent years, deep learning technique has received intense attention owing to its great success in image recognition. A tendency of adaption of deep learning in various information processing fields has formed, including music information retrieva l (MIR). In this paper, we conduct a comprehensive study on music audio classification with improved convolutional neural networks (CNNs). To the best of our knowledge, this the first work to apply Densely Connected Convolutional Networks (DenseNet) to music audio tagging, which has been demonstrated to perform better than Residual neural network (ResNet). Additionally, two specific data augmentation approaches of time overlapping and pitch shifting have been proposed to address the deficiency of labelled data in the MIR. Moreover, an ensemble learning of stacking is employed based on SVM. We believe that the proposed combination of strong representation of DenseNet and data augmentation can be adapted to other audio processing tasks.

معالجة الصوت والكلام الوسائط المتعددة أنظمة الصوت في الحاسوب

Capturing scattered discriminative information using a deep architecture in acoustic scene classification

66 - Hye-jin Shim , Jee-weon Jung , Ju-ho Kim 2020

Frequently misclassified pairs of classes that share many common acoustic properties exist in acoustic scene classification (ASC). To distinguish such pairs of classes, trivial details scattered throughout the data could be vital clues. However, thes e details are less noticeable and are easily removed using conventional non-linear activations (e.g. ReLU). Furthermore, making design choices to emphasize trivial details can easily lead to overfitting if the system is not sufficiently generalized. In this study, based on the analysis of the ASC tasks characteristics, we investigate various methods to capture discriminative information and simultaneously mitigate the overfitting problem. We adopt a max feature map method to replace conventional non-linear activations in a deep neural network, and therefore, we apply an element-wise comparison between different filters of a convolution layers output. Two data augment methods and two deep architecture modules are further explored to reduce overfitting and sustain the systems discriminative power. Various experiments are conducted using the detection and classification of acoustic scenes and events 2020 task1-a dataset to validate the proposed methods. Our results show that the proposed system consistently outperforms the baseline, where the single best performing system has an accuracy of 70.4% compared to 65.1% of the baseline.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب