ﻻ يوجد ملخص باللغة العربية
Traditional voice conversion(VC) has been focused on speaker identity conversion for speech with a neutral expression. We note that emotional expression plays an essential role in daily communication, and the emotional style of speech can be speaker-dependent. In this paper, we study the technique to jointly convert the speaker identity and speaker-dependent emotional style, that is called expressive voice conversion. We propose a StarGAN-based framework to learn a many-to-many mapping across different speakers, that takes into account speaker-dependent emotional style without the need for parallel data. To achieve this, we condition the generator on emotional style encoding derived from a pre-trained speech emotion recognition(SER) model. The experiments validate the effectiveness of our proposed framework in both objective and subjective evaluations. To our best knowledge, this is the first study on expressive voice conversion.
Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audios prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study prop
Emotional state of a speaker is found to have significant effect in speech production, which can deviate speech from that arising from neutral state. This makes identifying speakers with different emotions a challenging task as generally the speaker
Voice style transfer, also called voice conversion, seeks to modify one speakers voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known spe
The voice conversion challenge is a bi-annual scientific event held to compare and understand different voice conversion (VC) systems built on a common dataset. In 2020, we organized the third edition of the challenge and constructed and distributed
Emotional Voice Conversion, or emotional VC, is a technique of converting speech from one emotion state into another one, keeping the basic linguistic information and speaker identity. Previous approaches for emotional VC need parallel data and use d