ترغب بنشر مسار تعليمي؟ اضغط هنا

Dereverberation using joint estimation of dry speech signal and acoustic system

91   0   0.0 ( 0 )
 نشر من قبل Sanna Wager C
 تاريخ النشر 2020
والبحث باللغة English




اسأل ChatGPT حول البحث

The purpose of speech dereverberation is to remove quality-degrading effects of a time-invariant impulse response filter from the signal. In this report, we describe an approach to speech dereverberation that involves joint estimation of the dry speech signal and of the room impulse response. We explore deep learning models that apply to each task separately, and how these can be combined in a joint model with shared parameters.



قيم البحث

اقرأ أيضاً

With the widespread use of telemedicine services, automatic assessment of health conditions via telephone speech can significantly impact public health. This work summarizes our preliminary findings on automatic detection of respiratory distress usin g well-known acoustic and prosodic features. Speech samples are collected from de-identified telemedicine phonecalls from a healthcare provider in Bangladesh. The recordings include conversational speech samples of patients talking to doctors showing mild or severe respiratory distress or asthma symptoms. We hypothesize that respiratory distress may alter speech features such as voice quality, speaking pattern, loudness, and speech-pause duration. To capture these variations, we utilize a set of well-known acoustic and prosodic features with a Support Vector Machine (SVM) classifier for detecting the presence of respiratory distress. Experimental evaluations are performed using a 3-fold cross-validation scheme, ensuring patient-independent data splits. We obtained an overall accuracy of 86.4% in detecting respiratory distress from the speech recordings using the acoustic feature set. Correlation analysis reveals that the top-performing features include loudness, voice rate, voice duration, and pause duration.
Real-world audio recordings are often degraded by factors such as noise, reverberation, and equalization distortion. This paper introduces HiFi-GAN, a deep learning method to transform recorded speech to sound as though it had been recorded in a stud io. We use an end-to-end feed-forward WaveNet architecture, trained with multi-scale adversarial discriminators in both the time domain and the time-frequency domain. It relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech. The proposed model generalizes well to new speakers, new speech content, and new environments. It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments.
88 - Jie Li , Lizhong Yao 2021
In this paper, we introduce a new acoustic leakage dataset of gas pipelines, called as GPLA-12, which has 12 categories over 684 training/testing acoustic signals. Unlike massive image and voice datasets, there have relatively few acoustic signal dat asets, especially for engineering fault detection. In order to enhance the development of fault diagnosis, we collect acoustic leakage signals on the basis of an intact gas pipe system with external artificial leakages, and then preprocess the collected data with structured tailoring which are turned into GPLA-12. GPLA-12 dedicates to serve as a feature learning dataset for time-series tasks and classifications. To further understand the dataset, we train both shadow and deep learning algorithms to observe the performance. The dataset as well as the pretrained models have been released at both www.daip.club and github.com/Deep-AI-Application-DAIP
Automatic speech recognition in multi-channel reverberant conditions is a challenging task. The conventional way of suppressing the reverberation artifacts involves a beamforming based enhancement of the multi-channel speech signal, which is used to extract spectrogram based features for a neural network acoustic model. In this paper, we propose to extract features directly from the multi-channel speech signal using a multi variate autoregressive (MAR) modeling approach, where the correlations among all the three dimensions of time, frequency and channel are exploited. The MAR features are fed to a convolutional neural network (CNN) architecture which performs the joint acoustic modeling on the three dimensions. The 3-D CNN architecture allows the combination of multi-channel features that optimize the speech recognition cost compared to the traditional beamforming models that focus on the enhancement task. Experiments are conducted on the CHiME-3 and REVERB Challenge dataset using multi-channel reverberant speech. In these experiments, the proposed 3-D feature and acoustic modeling approach provides significant improvements over an ASR system trained with beamformed audio (average relative improvements of 10 % and 9 % in word error rates for CHiME-3 and REVERB Challenge datasets respectively.
In this study, we present a deep learning-based speech signal-processing mobile application, called CITISEN, which can perform three functions: speech enhancement (SE), model adaptation (MA), and acoustic scene conversion (ASC). For SE, CITISEN can e ffectively reduce noise components from speech signals and accordingly enhance their clarity and intelligibility. When it encounters noisy utterances with unknown speakers or noise types, the MA function allows CITISEN to effectively improve the SE performance by adapting an SE model with a few audio files. Finally, for ASC, CITISEN can convert the current background sound into a different background sound. The experimental results confirmed the effectiveness of performing SE, MA, and ASC functions via objective evaluation and subjective listening tests. Moreover, the MA experimental results indicated that short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) could be improved by approximately 5% and 10%, respectively. The promising results reveal that the developed CITISEN mobile application can be potentially used as a front-end processor for various speech-related services such as voice communication, assistive hearing devices, and virtual reality headsets. In addition, CITISEN can be used as a platform for using and evaluating the newly performed deep-learning-SE models, and can flexibly extend the models to address various noise environments and users.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا