Increasing Compactness Of Deep Learning Based Speech Enhancement Models With Parameter Pruning And Quantization Techniques

94 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jyun-Yi Wu

تاريخ النشر 2019

مجال البحث هندسة إلكترونية الهندسة المعلوماتية

والبحث باللغة English

تأليف Jyun-Yi Wu - Cheng Yu - Szu-Wei Fu

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Most recent studies on deep learning based speech enhancement (SE) focused on improving denoising performance. However, successful SE applications require striking a desirable balance between denoising performance and computational cost in real scenarios. In this study, we propose a novel parameter pruning (PP) technique, which removes redundant channels in a neural network. In addition, a parameter quantization (PQ) technique was applied to reduce the size of a neural network by representing weights with fewer cluster centroids. Because the techniques are derived based on different concepts, the PP and PQ can be integrated to provide even more compact SE models. The experimental results show that the PP and PQ techniques produce a compacted SE model with a size of only 10.03% compared to that of the original model, resulting in minor performance losses of 1.43% (from 0.70 to 0.69) for STOI and 3.24% (from 1.85 to 1.79) for PESQ. The promising results suggest that the PP and PQ techniques can be used in a SE system in devices with limited storage and computation resources.

قيم البحث

158 - Syu-Siang Wang , Yu-You Liang , Jeih-weih Hung 2019

Speech-related applications deliver inferior performance in complex noise environments. Therefore, this study primarily addresses this problem by introducing speech-enhancement (SE) systems based on deep neural networks (DNNs) applied to a distribute d microphone architecture, and then investigates the effectiveness of three different DNN-model structures. The first system constructs a DNN model for each microphone to enhance the recorded noisy speech signal, and the second system combines all the noisy recordings into a large feature structure that is then enhanced through a DNN model. As for the third system, a channel-dependent DNN is first used to enhance the corresponding noisy input, and all the channel-wise enhanced outputs are fed into a DNN fusion model to construct a nearly clean signal. All the three DNN SE systems are operated in the acoustic frequency domain of speech signals in a diffuse-noise field environment. Evaluation experiments were conducted on the Taiwan Mandarin Hearing in Noise Test (TMHINT) database, and the results indicate that all the three DNN-based SE systems provide the original noise-corrupted signals with improved speech quality and intelligibility, whereas the third system delivers the highest signal-to-noise ratio (SNR) improvement and optimal speech intelligibility.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

218 - Yu-Wen Chen , Kuo-Hsuan Hung , You-Jin Li 2020

In this study, we present a deep learning-based speech signal-processing mobile application, called CITISEN, which can perform three functions: speech enhancement (SE), model adaptation (MA), and acoustic scene conversion (ASC). For SE, CITISEN can e ffectively reduce noise components from speech signals and accordingly enhance their clarity and intelligibility. When it encounters noisy utterances with unknown speakers or noise types, the MA function allows CITISEN to effectively improve the SE performance by adapting an SE model with a few audio files. Finally, for ASC, CITISEN can convert the current background sound into a different background sound. The experimental results confirmed the effectiveness of performing SE, MA, and ASC functions via objective evaluation and subjective listening tests. Moreover, the MA experimental results indicated that short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) could be improved by approximately 5% and 10%, respectively. The promising results reveal that the developed CITISEN mobile application can be potentially used as a front-end processor for various speech-related services such as voice communication, assistive hearing devices, and virtual reality headsets. In addition, CITISEN can be used as a platform for using and evaluating the newly performed deep-learning-SE models, and can flexibly extend the models to address various noise environments and users.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

Speech Enhancement with Zero-Shot Model Selection

134 - Ryandhimas E. Zezario , Chiou-Shann Fuh , Hsin-Min Wang 2020

Recent research on speech enhancement (SE) has seen the emergence of deep-learning-based methods. It is still a challenging task to determine the effective ways to increase the generalizability of SE under diverse test conditions. In this study, we c ombine zero-shot learning and ensemble learning to propose a zero-shot model selection (ZMOS) approach to increase the generalization of SE performance. The proposed approach is realized in the offline and online phases. The offline phase clusters the entire set of training data into multiple subsets and trains a specialized SE model (termed component SE model) with each subset. The online phase selects the most suitable component SE model to perform the enhancement. Furthermore, two selection strategies were developed: selection based on the quality score (QS) and selection based on the quality embedding (QE). Both QS and QE were obtained using a Quality-Net, a non-intrusive quality assessment network. Experimental results confirmed that the proposed ZMOS approach can achieve better performance in both seen and unseen noise types compared to the baseline systems and other model selection systems, which indicates the effectiveness of the proposed approach in providing robust SE performance.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

A Study on Speech Enhancement Based on Diffusion Probabilistic Model

101 - Yen-Ju Lu , Yu Tsao , Shinji Watanabe 2021

Diffusion probabilistic models have demonstrated an outstanding capability to model natural images and raw audio waveforms through a paired diffusion and reverse processes. The unique property of the reverse process (namely, eliminating non-target si gnals from the Gaussian noise and noisy signals) could be utilized to restore clean signals. Based on this property, we propose a diffusion probabilistic model-based speech enhancement (DiffuSE) model that aims to recover clean speech signals from noisy signals. The fundamental architecture of the proposed DiffuSE model is similar to that of DiffWave--a high-quality audio waveform generation model that has a relatively low computational cost and footprint. To attain better enhancement performance, we designed an advanced reverse process, termed the supportive reverse process, which adds noisy speech in each time-step to the predicted speech. The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus SE task. Moreover, relative to the generally suggested full sampling schedule, the proposed supportive reverse process especially improved the fast sampling, taking few steps to yield better enhancement results over the conventional full step inference process.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

Scaling Speech Enhancement in Unseen Environments with Noise Embeddings

114 - Gil Keren , Jing Han , Bjorn Schuller 2018

We address the problem of speech enhancement generalisation to unseen environments by performing two manipulations. First, we embed an additional recording from the environment alone, and use this embedding to alter activations in the main enhancemen t subnetwork. Second, we scale the number of noise environments present at training time to 16,784 different environments. Experiment results show that both manipulations reduce word error rates of a pretrained speech recognition system and improve enhancement quality according to a number of performance measures. Specifically, our best model reduces the word error rate from 34.04% on noisy speech to 15.46% on the enhanced speech. Enhanced audio samples can be found in https://speechenhancement.page.link/samples.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب