A Two-stage Complex Network using Cycle-consistent Generative Adversarial Networks for Speech Enhancement


Abstract in English

Cycle-consistent generative adversarial networks (CycleGAN) have shown their promising performance for speech enhancement (SE), while one intractable shortcoming of these CycleGAN-based SE systems is that the noise components propagate throughout the cycle and cannot be completely eliminated. Additionally, conventional CycleGAN-based SE systems only estimate the spectral magnitude, while the phase is unaltered. Motivated by the multi-stage learning concept, we propose a novel two-stage denoising system that combines a CycleGAN-based magnitude enhancing network and a subsequent complex spectral refining network in this paper. Specifically, in the first stage, a CycleGAN-based model is responsible for only estimating magnitude, which is subsequently coupled with the original noisy phase to obtain a coarsely enhanced complex spectrum. After that, the second stage is applied to further suppress the residual noise components and estimate the clean phase by a complex spectral mapping network, which is a pure complex-valued network composed of complex 2D convolution/deconvolution and complex temporal-frequency attention blocks. Experimental results on two public datasets demonstrate that the proposed approach consistently surpasses previous one-stage CycleGANs and other state-of-the-art SE systems in terms of various evaluation metrics, especially in background noise suppression.

Download