بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Acoustic scene classification using multi-layer temporal pooling based on convolutional neural network

89 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Liwen Zhang

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية هندسة إلكترونية

والبحث باللغة English

تأليف Liwen Zhang - Jiqing Han

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The performance of an Acoustic Scene Classification (ASC) system is highly depending on the latent temporal dynamics of the audio signal. In this paper, we proposed a multiple layers temporal pooling method using CNN feature sequence as in-put, which can effectively capture the temporal dynamics for an entire audio signal with arbitrary duration by building direct connections between the sequence and its time indexes. We applied our novel framework on DCASE 2018 task 1, ASC. For evaluation, we trained a Support Vector Machine (SVM) with the proposed Multi-Layered Temporal Pooling (MLTP) learned features. Experimental results on the development dataset, usage of the MLTP features significantly improved the ASC performance. The best performance with 75.28% accuracy was achieved by using the optimal setting found in our experiments.

قيم البحث

125 - Yang Liu , Alexandros Neophytou , Sunando Sengupta 2021

Convolutional neural networks (CNNs) with log-mel spectrum features have shown promising results for acoustic scene classification tasks. However, the performance of these CNN based classifiers is still lacking as they do not generalise well for unkn own environments. To address this issue, we introduce an acoustic spectrum transformation network where traditional log-mel spectrums are transformed into imagined visual features (IVF). The imagined visual features are learned by exploiting the relationship between audio and visual features present in video recordings. An auto-encoder is used to encode images as visual features and a transformation network learns how to generate imagined visual features from log-mel. Our model is trained on a large dataset of Youtube videos. We test our proposed method on the scene classification task of DCASE and ESC-50, where our method outperforms other spectrum features, especially for unseen environments.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

DD-CNN: Depthwise Disout Convolutional Neural Network for Low-complexity Acoustic Scene Classification

80 - Jingqiao Zhao , Zhen-Hua Feng , Qiuqiang Kong 2020

This paper presents a Depthwise Disout Convolutional Neural Network (DD-CNN) for the detection and classification of urban acoustic scenes. Specifically, we use log-mel as feature representations of acoustic signals for the inputs of our network. In the proposed DD-CNN, depthwise separable convolution is used to reduce the network complexity. Besides, SpecAugment and Disout are used for further performance boosting. Experimental results demonstrate that our DD-CNN can learn discriminative acoustic characteristics from audio fragments and effectively reduce the network complexity. Our DD-CNN was used for the low-complexity acoustic scene classification task of the DCASE2020 Challenge, which achieves 92.04% accuracy on the validation set.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Domain Mismatch Robust Acoustic Scene Classification using Channel Information Conversion

137 - Seongkyu Mun , Suwon Shon 2018

In a recent acoustic scene classification (ASC) research field, training and test device channel mismatch have become an issue for the real world implementation. To address the issue, this paper proposes a channel domain conversion using factorized h ierarchical variational autoencoder. Proposed method adapts both the source and target domain to a pre-defined specific domain. Unlike the conventional approach, the relationship between the target and source domain and information of each domain are not required in the adaptation process. Based on the experimental results using the IEEE detection and classification of acoustic scenes and event 2018 task 1-B dataset and the baseline system, it is shown that the proposed approach can mitigate the channel mismatching issue of different recording devices.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Analysis Acoustic Features for Acoustic Scene Classification and Score fusion of multi-classification systems applied to DCASE 2016 challenge

84 - Sangwook Park , Seongkyu Mun , Younglo Lee 2018

This paper describes an acoustic scene classification method which achieved the 4th ranking result in the IEEE AASP challenge of Detection and Classification of Acoustic Scenes and Events 2016. In order to accomplish the ensuing task, several methods are explored in three aspects: feature extraction, feature transformation, and score fusion for final decision. In the part of feature extraction, several features are investigated for effective acoustic scene classification. For resolving the issue that the same sound can be heard in different places, a feature transformation is applied for better separation for classification. From these, several systems based on different feature sets are devised for classification. The final result is determined by fusing the individual systems. The method is demonstrated and validated by the experiment conducted using the Challenge database.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Multi-Scale Attention Neural Network for Acoustic Echo Cancellation

90 - Lu Ma , Song Yang , Yaguang Gong 2021

Acoustic Echo Cancellation (AEC) plays a key role in speech interaction by suppressing the echo received at microphone introduced by acoustic reverberations from loudspeakers. Since the performance of linear adaptive filter (AF) would degrade severel y due to nonlinear distortions, background noises, and microphone clipping in real scenarios, deep learning has been employed for AEC for its good nonlinear modelling ability. In this paper, we constructed an end-to-end multi-scale attention neural network for AEC. Temporal convolution is first used to transform waveform into spectrogram. The spectrograms of the far-end reference and the near-end mixture are concatenated, and fed to a temporal convolution network (TCN) with stacked dilated convolution layers. Attention mechanism is performed among these representations from different layers to adaptively extract relevant features by referring to the previous hidden state in the encoder long short-term memory (LSTM) unit. The representations are weighted averaged and fed to the encoder LSTM for the near-end speech estimation. Experiments show the superiority of our method in terms of the echo return loss enhancement (ERLE) for single-talk periods and the perceptual evaluation of speech quality (PESQ) score for double-talk periods in background noise and nonlinear distortion scenarios.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة إيبلا الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Acoustic scene classification using multi-layer temporal pooling based on convolutional neural network

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً