ﻻ يوجد ملخص باللغة العربية
We propose a block-online algorithm of guided source separation (GSS). GSS is a speech separation method that uses diarization information to update parameters of the generative model of observation signals. Previous studies have shown that GSS performs well in multi-talker scenarios. However, it requires a large amount of calculation time, which is an obstacle to the deployment of online applications. It is also a problem that the offline GSS is an utterance-wise algorithm so that it produces latency according to the length of the utterance. With the proposed algorithm, block-wise input samples and corresponding time annotations are concatenated with those in the preceding context and used to update the parameters. Using the context enables the algorithm to estimate time-frequency masks accurately only from one iteration of optimization for each block, and its latency does not depend on the utterance length but predetermined block length. It also reduces calculation cost by updating only the parameters of active speakers in each block and its context. Evaluation on the CHiME-6 corpus and a meeting corpus showed that the proposed algorithm achieved almost the same performance as the conventional offline GSS algorithm but with 32x faster calculation, which is sufficient for real-time applications.
We propose a separation guided speaker diarization (SGSD) approach by fully utilizing a complementarity of speech separation and speaker clustering. Since the conventional clustering-based speaker diarization (CSD) approach cannot well handle overlap
Permutation invariant training (PIT) is a widely used training criterion for neural network-based source separation, used for both utterance-level separation with utterance-level PIT (uPIT) and separation of long recordings with the recently proposed
Many applications of single channel source separation (SCSS) including automatic speech recognition (ASR), hearing aids etc. require an estimation of only one source from a mixture of many sources. Treating this special case as a regular SCSS problem
Multi-speaker speech recognition has been one of the keychallenges in conversation transcription as it breaks the singleactive speaker assumption employed by most state-of-the-artspeech recognition systems. Speech separation is consideredas a remedy
Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years. Recent research includes extracting target speech by using the target speakers voice snippet and jointly separating all parti