ﻻ يوجد ملخص باللغة العربية
One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling frame-wise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.
We propose a novel training scheme to optimize voice conversion network with a speaker identity loss function. The training scheme not only minimizes frame-level spectral loss, but also speaker identity loss. We introduce a cycle consistency loss tha
To extract the voice of a target speaker when mixed with a variety of other sounds, such as white and ambient noises or the voices of interfering speakers, we extend the Transformer network to attend the most relevant information with respect to the
In this paper, we focus on improving the performance of the text-dependent speaker verification system in the scenario of limited training data. The speaker verification system deep learning based text-dependent generally needs a large scale text-dep
Content and style representations have been widely studied in the field of style transfer. In this paper, we propose a new loss function using speaker content representation for audio source separation, and we call it speaker representation loss. The
Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we are developing Japanese voice corpora reasonably accessible from not onl