ﻻ يوجد ملخص باللغة العربية
Generating natural speech with diverse and smooth prosody pattern is a challenging task. Although random sampling with phone-level prosody distribution has been investigated to generate different prosody patterns, the diversity of the generated speech is still very limited and far from what can be achieved by human. This is largely due to the use of uni-modal distribution, such as single Gaussian, in the prior works of phone-level prosody modelling. In this work, we propose a novel approach that models phone-level prosodies with a GMM-based mixture density network and then extend it for multi-speaker TTS using speaker adaptation transforms of Gaussian means and variances. Furthermore, we show that we can clone the prosodies from a reference speech by sampling prosodies from the Gaussian components that produce the reference prosodies. Our experiments on LJSpeech and LibriTTS dataset show that the proposed GMM-based method not only achieves significantly better diversity than using a single Gaussian in both single-speaker and multi-speaker TTS, but also provides better naturalness. The prosody cloning experiments demonstrate that the prosody similarity of the proposed GMM-based method is comparable to recent proposed fine-grained VAE while the target speaker similarity is better.
Generating natural speech with diverse and smooth prosody pattern is a challenging task. Although random sampling with phone-level prosody distribution has been investigated to generate different prosody patterns, the diversity of the generated speec
Expressive neural text-to-speech (TTS) systems incorporate a style encoder to learn a latent embedding as the style information. However, this embedding process may encode redundant textual information. This phenomenon is called content leakage. Rese
The cross-speaker emotion transfer task in TTS particularly aims to synthesize speech for a target speaker with the emotion transferred from reference speech recorded by another (source) speaker. During the emotion transfer process, the identity info
Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis at scale. It does not require the target speakers to be experts in expressing all styles and to collect corresponding recordings for model trai
Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody er