ﻻ يوجد ملخص باللغة العربية
The end-to-end approaches for single-channel target speech extraction have attracted widespread attention. However, the studies for end-to-end multi-channel target speech extraction are still relatively limited. In this work, we propose two methods for exploiting the multi-channel spatial information to extract the target speech. The first one is using a target speech adaptation layer in a parallel encoder architecture. The second one is designing a channel decorrelation mechanism to extract the inter-channel differential information to enhance the multi-channel encoder representation. We compare the proposed methods with two strong state-of-the-art baselines. Experimental results on the multi-channel reverberant WSJ0 2-mix dataset demonstrate that our proposed methods achieve up to 11.2% and 11.5% relative improvements in SDR and SiSDR respectively, which are the best reported results on this task to the best of our knowledge.
Target speech extraction has attracted widespread attention. When microphone arrays are available, the additional spatial information can be helpful in extracting the target speech. We have recently proposed a channel decorrelation (CD) mechanism to
Target speech separation refers to extracting a target speakers voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general
Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention. Such a reference speech is typically pre-recorded. On the other hand, the temporal synchronization between speech and lip
Target speech extraction, which extracts a single target source in a mixture given clues about the target speaker, has attracted increasing attention. We have recently proposed SpeakerBeam, which exploits an adaptation utterance of the target speaker
Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verificati