ترغب بنشر مسار تعليمي؟ اضغط هنا

Sample Dropout for Audio Scene Classification Using Multi-Scale Dense Connected Convolutional Neural Network

67   0   0.0 ( 0 )
 نشر من قبل Kele Xu
 تاريخ النشر 2018
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Acoustic scene classification is an intricate problem for a machine. As an emerging field of research, deep Convolutional Neural Networks (CNN) achieve convincing results. In this paper, we explore the use of multi-scale Dense connected convolutional neural network (DenseNet) for the classification task, with the goal to improve the classification performance as multi-scale features can be extracted from the time-frequency representation of the audio signal. On the other hand, most of previous CNN-based audio scene classification approaches aim to improve the classification accuracy, by employing different regularization techniques, such as the dropout of hidden units and data augmentation, to reduce overfitting. It is widely known that outliers in the training set have a high negative influence on the trained model, and culling the outliers may improve the classification performance, while it is often under-explored in previous studies. In this paper, inspired by the silence removal in the speech signal processing, a novel sample dropout approach is proposed, which aims to remove outliers in the training dataset. Using the DCASE 2017 audio scene classification datasets, the experimental results demonstrates the proposed multi-scale DenseNet providing a superior performance than the traditional single-scale DenseNet, while the sample dropout method can further improve the classification robustness of multi-scale DenseNet.

قيم البحث

اقرأ أيضاً

88 - Liwen Zhang , Jiqing Han 2019
The performance of an Acoustic Scene Classification (ASC) system is highly depending on the latent temporal dynamics of the audio signal. In this paper, we proposed a multiple layers temporal pooling method using CNN feature sequence as in-put, which can effectively capture the temporal dynamics for an entire audio signal with arbitrary duration by building direct connections between the sequence and its time indexes. We applied our novel framework on DCASE 2018 task 1, ASC. For evaluation, we trained a Support Vector Machine (SVM) with the proposed Multi-Layered Temporal Pooling (MLTP) learned features. Experimental results on the development dataset, usage of the MLTP features significantly improved the ASC performance. The best performance with 75.28% accuracy was achieved by using the optimal setting found in our experiments.
163 - Jongpil Lee , Juhan Nam 2017
Music tag words that describe music audio by text have different levels of abstraction. Taking this issue into account, we propose a music classification approach that aggregates multi-level and multi-scale features using pre-trained feature extracto rs. In particular, the feature extractors are trained in sample-level deep convolutional neural networks using raw waveforms. We show that this approach achieves state-of-the-art results on several music classification datasets.
Precise destination prediction of taxi trajectories can benefit many intelligent location based services such as accurate ad for passengers. Traditional prediction approaches, which treat trajectories as one-dimensional sequences and process them in single scale, fail to capture the diverse two-dimensional patterns of trajectories in different spatial scales. In this paper, we propose T-CONV which models trajectories as two-dimensional images, and adopts multi-layer convolutional neural networks to combine multi-scale trajectory patterns to achieve precise prediction. Furthermore, we conduct gradient analysis to visualize the multi-scale spatial patterns captured by T-CONV and extract the areas with distinct influence on the ultimate prediction. Finally, we integrate multiple local enhancement convolutional fields to explore these important areas deeply for better prediction. Comprehensive experiments based on real trajectory data show that T-CONV can achieve higher accuracy than the state-of-the-art methods.
Acoustic scene classification systems using deep neural networks classify given recordings into pre-defined classes. In this study, we propose a novel scheme for acoustic scene classification which adopts an audio tagging system inspired by the human perception mechanism. When humans identify an acoustic scene, the existence of different sound events provides discriminative information which affects the judgement. The proposed framework mimics this mechanism using various approaches. Firstly, we employ three methods to concatenate tag vectors extracted using an audio tagging system with an intermediate hidden layer of an acoustic scene classification system. We also explore the multi-head attention on the feature map of an acoustic scene classification system using tag vectors. Experiments conducted on the detection and classification of acoustic scenes and events 2019 task 1-a dataset demonstrate the effectiveness of the proposed scheme. Concatenation and multi-head attention show a classification accuracy of 75.66 % and 75.58 %, respectively, compared to 73.63 % accuracy of the baseline. The system with the proposed two approaches combined demonstrates an accuracy of 76.75 %.
Reconstruction of PET images is an ill-posed inverse problem and often requires iterative algorithms to achieve good image quality for reliable clinical use in practice, at huge computational costs. In this paper, we consider the PET reconstruction a dense prediction problem where the large scale contextual information is essential, and propose a novel architecture of multi-scale fully convolutional neural networks (msfCNN) for fast PET image reconstruction. The proposed msfCNN gains large receptive fields with both memory and computational efficiency, by using a downscaling-upscaling structure and dilated convolutions. Instead of pooling and deconvolution, we propose to use the periodic shuffling operation from sub-pixel convolution and its inverse to scale the size of feature maps without losing resolution. Residual connections were added to improve training. We trained the proposed msfCNN model with simulated data, and applied it to clinical PET data acquired on a Siemens mMR scanner. The results from real oncological and neurodegenerative cases show that the proposed msfCNN-based reconstruction outperforms the iterative approaches in terms of computational time while achieving comparable image quality for quantification. The proposed msfCNN model can be applied to other dense prediction tasks, and fast msfCNN-based PET reconstruction could facilitate the potential use of molecular imaging in interventional/surgical procedures, where cancer surgery can particularly benefit.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا