ﻻ يوجد ملخص باللغة العربية
So far, many of the deep learning approaches for voice conversion produce good quality speech by using a large amount of training data. This paper presents a Deep Bidirectional Long Short-Term Memory (DBLSTM) based voice conversion framework that can work with a limited amount of training data. We propose to implement a DBLSTM based average model that is trained with data from many speakers. Then, we propose to perform adaptation with a limited amount of target data. Last but not least, we propose an error reduction network that can improve the voice conversion quality even further. The proposed framework is motivated by three observations. Firstly, DBLSTM can achieve a remarkable voice conversion by considering the long-term dependencies of the speech utterance. Secondly, DBLSTM based average model can be easily adapted with a small amount of data, to achieve a speech that sounds closer to the target. Thirdly, an error reduction network can be trained with a small amount of training data, and can improve the conversion quality effectively. The experiments show that the proposed voice conversion framework is flexible to work with limited training data and outperforms the traditional frameworks in both objective and subjective evaluations.
Although voice conversion (VC) algorithms have achieved remarkable success along with the development of machine learning, superior performance is still difficult to achieve when using nonparallel data. In this paper, we propose using a cycle-consist
Singing voice conversion is converting the timbre in the source singing to the target speakers voice while keeping singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data.I
This paper presents a refinement framework of WaveNet vocoders for variational autoencoder (VAE) based voice conversion (VC), which reduces the quality distortion caused by the mismatch between the training data and testing data. Conventional WaveNet
Emotional Voice Conversion, or emotional VC, is a technique of converting speech from one emotion state into another one, keeping the basic linguistic information and speaker identity. Previous approaches for emotional VC need parallel data and use d
Mel-frequency filter bank (MFB) based approaches have the advantage of learning speech compared to raw spectrum since MFB has less feature size. However, speech generator with MFB approaches require additional vocoder that needs a huge amount of comp