ﻻ يوجد ملخص باللغة العربية
End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations. Voice conversion, in which a model has to impersonate a speaker in a recording, is one of those situations. In this paper, we propose Blow, a single-scale normalizing flow using hypernetwork conditioning to perform many-to-many voice conversion between raw audio. Blow is trained end-to-end, with non-parallel data, on a frame-by-frame basis using a single speaker identifier. We show that Blow compares favorably to existing flow-based architectures and other competitive baselines, obtaining equal or better performance in both objective and subjective evaluations. We further assess the impact of its main components with an ablation study, and quantify a number of properties such as the necessary amount of training data or the preference for source or target speakers.
The voice conversion challenge is a bi-annual scientific event held to compare and understand different voice conversion (VC) systems built on a common dataset. In 2020, we organized the third edition of the challenge and constructed and distributed
Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audios prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study prop
Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the limitation tha
This paper presents a novel framework to build a voice conversion (VC) system by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC transfer learning. We first develop a multi-speaker speech synthesis system with sequence-to
Cycle consistent generative adversarial network (CycleGAN) and variational autoencoder (VAE) based models have gained popularity in non-parallel voice conversion recently. However, they often suffer from difficult training process and unsatisfactory