بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Domestic activities clustering from audio recordings using convolutional capsule autoencoder network

235 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yanxiong Li

تاريخ النشر 2021

مجال البحث هندسة إلكترونية الهندسة المعلوماتية

والبحث باللغة English

تأليف Ziheng Lin - Yanxiong Li - Zhangjin Huang

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Recent efforts have been made on domestic activities classification from audio recordings, especially the works submitted to the challenge of DCASE (Detection and Classification of Acoustic Scenes and Events) since 2018. In contrast, few studies were done on domestic activities clustering, which is a newly emerging problem. Domestic activities clustering from audio recordings aims at merging audio clips which belong to the same class of domestic activity into a single cluster. Domestic activities clustering is an effective way for unsupervised estimation of daily activities performed in home environment. In this study, we propose a method for domestic activities clustering using a convolutional capsule autoencoder network (CCAN). In the method, the deep embeddings are learned by the autoencoder in the CCAN, while the deep embeddings which belong to the same class of domestic activities are merged into a single cluster by a clustering layer in the CCAN. Evaluated on a public dataset adopted in DCASE-2018 Task 5, the results show that the proposed method outperforms state-of-the-art methods in terms of the metrics of clustering accuracy and normalized mutual information.

قيم البحث

133 - Monisankha Pal , Manoj Kumar , Raghuveer Peri 2019

In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network. The proposed diarization system is trained jointly with GAN loss, latent vari able recovery loss, and a clustering-specific loss. It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a combination of continuous random variables and discrete one-hot encoded variables using the original speaker labels. We benchmark our proposed system on the AMI meeting corpus, and two child-clinician interaction corpora (ADOS and BOSCC) from the autism diagnosis domain. ADOS and BOSCC contain diagnostic and treatment outcome sessions respectively obtained in clinical settings for verbal children and adolescents with autism. Experimental results show that our proposed system significantly outperform the state-of-the-art x-vector based diarization system on these databases. Further, we perform embedding fusion with x-vectors to achieve a relative DER improvement of 31%, 36% and 49% on AMI eval, ADOS and BOSCC corpora respectively, when compared to the x-vector baseline using oracle speech segmentation.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

SkipConvNet: Skip Convolutional Neural Network for Speech Dereverberation using Optimally Smoothed Spectral Mapping

91 - Vinay Kothapally , Wei Xia , Shahram Ghorbani 2020

The reliability of using fully convolutional networks (FCNs) has been successfully demonstrated by recent studies in many speech applications. One of the most popular variants of these FCNs is the `U-Net, which is an encoder-decoder network with skip connections. In this study, we propose `SkipConvNet where we replace each skip connection with multiple convolutional modules to provide decoder with intuitive feature maps rather than encoders output to improve the learning capacity of the network. We also propose the use of optimal smoothing of power spectral density (PSD) as a pre-processing step, which helps to further enhance the efficiency of the network. To evaluate our proposed system, we use the REVERB challenge corpus to assess the performance of various enhancement approaches under the same conditions. We focus solely on monitoring improvements in speech quality and their contribution to improving the efficiency of back-end speech systems, such as speech recognition and speaker verification, trained on only clean speech. Experimental findings show that the proposed system consistently outperforms other approaches.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب معالجة الإشارات

Investigation of F0 conditioning and Fully Convolutional Networks in Variational Autoencoder based Voice Conversion

74 - Wen-Chin Huang , Yi-Chiao Wu , Chen-Chou Lo 2019

In this work, we investigate the effectiveness of two techniques for improving variational autoencoder (VAE) based voice conversion (VC). First, we reconsider the relationship between vocoder features extracted using the high quality vocoders adopted in conventional VC systems, and hypothesize that the spectral features are in fact F0 dependent. Such hypothesis implies that during the conversion phase, the latent codes and the converted features in VAE based VC are in fact source F0 dependent. To this end, we propose to utilize the F0 as an additional input of the decoder. The model can learn to disentangle the latent code from the F0 and thus generates converted F0 dependent converted features. Second, to better capture temporal dependencies of the spectral features and the F0 pattern, we replace the frame wise conversion structure in the original VAE based VC framework with a fully convolutional network structure. Our experiments demonstrate that the degree of disentanglement as well as the naturalness of the converted speech are indeed improved.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

286 - Haoyu Li , Yang Ai , Junichi Yamagishi 2020

High-quality speech corpora are essential foundations for most speech applications. However, such speech data are expensive and limited since they are collected in professional recording environments. In this work, we propose an encoder-decoder neura l network to automatically enhance low-quality recordings to professional high-quality recordings. To address channel variability, we first filter out the channel characteristics from the original input audio using the encoder network with adversarial training. Next, we disentangle the channel factor from a reference audio. Conditioned on this factor, an auto-regressive decoder is then used to predict the target-environment Mel spectrogram. Finally, we apply a neural vocoder to synthesize the speech waveform. Experimental results show that the proposed system can generate a professional high-quality speech waveform when setting high-quality audio as the reference. It also improves speech enhancement performance compared with several state-of-the-art baseline systems.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

Audio Dequantization Using (Co)Sparse (Non)Convex Methods

193 - Pavel Zaviv{s}ka , Pavel Rajmic , Ondv{r}ej Mokry 2020

The paper deals with the hitherto neglected topic of audio dequantization. It reviews the state-of-the-art sparsity-based approaches and proposes several new methods. Convex as well as non-convex approaches are included, and all the presented formula tions come in both the synthesis and analysis variants. In the experiments the methods are evaluated using the signal-to-distortion ratio (SDR) and PEMO-Q, a perceptually motivated metric.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب معالجة الإشارات

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة طرطوس

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Domestic activities clustering from audio recordings using convolutional capsule autoencoder network

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً