ﻻ يوجد ملخص باللغة العربية
In this paper, in order to further deal with the performance degradation caused by ignoring the phase information in conventional speech enhancement systems, we proposed a temporal dilated convolutional generative adversarial network (TDCGAN) in the end-to-end based speech enhancement architecture. For the first time, we introduced the temporal dilated convolutional network with depthwise separable convolutions into the GAN structure so that the receptive field can be greatly increased without increasing the number of parameters. We also first explored the effect of signal-to-noise ratio (SNR) penalty item as regularization of the loss function of generator on improving the SNR of enhanced speech. The experimental results demonstrated that our proposed method outperformed the state-of-the-art end-to-end GAN-based speech enhancement. Moreover, compared with previous GAN-based methods, the proposed TDCGAN could greatly decreased the number of parameters. As expected, the work also demonstrated that the SNR penalty item as regularization was more effective than $L1$ on improving the SNR of enhanced speech.
Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the locality and temporal sequential properties of speech should be eff
The intelligibility of speech severely degrades in the presence of environmental noise and reverberation. In this paper, we propose a novel deep learning based system for modifying the speech signal to increase its intelligibility under the equal-pow
Deep dilated temporal convolutional networks (TCN) have been proved to be very effective in sequence modeling. In this paper we propose several improvements of TCN for end-to-end approach to monaural speech separation, which consists of 1) multi-scal
Speech-driven facial animation is the process which uses speech signals to automatically synthesize a talking character. The majority of work in this domain creates a mapping from audio features to visual features. This often requires post-processing
Robust voice activity detection (VAD) is a challenging task in low signal-to-noise (SNR) environments. Recent studies show that speech enhancement is helpful to VAD, but the performance improvement is limited. To address this issue, here we propose a