ﻻ يوجد ملخص باللغة العربية
The Transformer architecture has demonstrated a superior ability compared to recurrent neural networks in many different natural language processing applications. Therefore, our study applies a modified Transformer in a speech enhancement task. Specifically, positional encoding in the Transformer may not be necessary for speech enhancement, and hence, it is replaced by convolutional layers. To further improve the perceptual evaluation of the speech quality (PESQ) scores of enhanced speech, the L_1 pre-trained Transformer is fine-tuned using a MetricGAN framework. The proposed MetricGAN can be treated as a general post-processing module to further boost the objective scores of interest. The experiments were conducted using the data sets provided by the organizer of the Deep Noise Suppression (DNS) challenge. Experimental results demonstrated that the proposed system outperformed the challenge baseline, in both subjective and objective evaluations, with a large margin.
Recent research on speech enhancement (SE) has seen the emergence of deep-learning-based methods. It is still a challenging task to determine the effective ways to increase the generalizability of SE under diverse test conditions. In this study, we c
Diffusion probabilistic models have demonstrated an outstanding capability to model natural images and raw audio waveforms through a paired diffusion and reverse processes. The unique property of the reverse process (namely, eliminating non-target si
Adversarial loss in a conditional generative adversarial network (GAN) is not designed to directly optimize evaluation metrics of a target task, and thus, may not always guide the generator in a GAN to generate data with improved metric scores. To ov
We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct users speech from the accelerometer, the latter provides a strong conditioning s
In this study, we present a deep learning-based speech signal-processing mobile application, called CITISEN, which can perform three functions: speech enhancement (SE), model adaptation (MA), and acoustic scene conversion (ASC). For SE, CITISEN can e