Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

114 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Hideyuki Tachibana

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Hideyuki Tachibana - Katsuya Uenoyama - Shunsuke Aihara

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units. Recurrent neural networks (RNN) have become a standard technique to model sequential data recently, and this technique has been used in some cutting-edge neural TTS techniques. However, training RNN components often requires a very powerful computer, or a very long time, typically several days or weeks. Recent other studies, on the other hand, have shown that CNN-based sequence synthesis can be much faster than RNN-based techniques, because of high parallelizability. The objective of this paper is to show that an alternative neural TTS based only on CNN alleviate these economic costs of training. In our experiment, the proposed Deep Convolutional TTS was sufficiently trained overnight (15 hours), using an ordinary gaming PC equipped with two GPUs, while the quality of the synthesized speech was almost acceptable.

قيم البحث

151 - Wei Ping , Kainan Peng , Andrew Gibiansky 2017

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي الحساب واللغة

VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention

160 - Peng Liu , Yuewen Cao , Songxiang Liu 2021

This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-speech (TTS) model using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism, which refines the textual-to-acoustic alignment layer-wisely. Hierarchical lat ent variables with different temporal resolutions from the VDVAE are used as queries for residual attention module. By leveraging the coarse global alignment from previous attention layer as an extra input, the following attention layer can produce a refined version of alignment. This amortizes the burden of learning the textual-to-acoustic alignment among multiple attention layers and outperforms the use of only a single attention layer in robustness. An utterance-level speaking speed factor is computed by a jointly-trained speaking speed predictor, which takes the mean-pooled latent variables of the coarsest layer as input, to determine number of acoustic frames at inference. Experimental results show that VARA-TTS achieves slightly inferior speech quality to an AR counterpart Tacotron 2 but an order-of-magnitude speed-up at inference; and outperforms an analogous non-AR model, BVAE-TTS, in terms of speech quality.

أنظمة الصوت في الحاسوب الحساب واللغة معالجة الصوت والكلام

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

100 - Renqian Luo , Xu Tan , Rui Wang 2021

Text to speech (TTS) has been broadly used to synthesize natural and intelligible speech in different scenarios. Deploying TTS in various end devices such as mobile phones or embedded devices requires extremely small memory usage and inference latenc y. While non-autoregressive TTS models such as FastSpeech have achieved significantly faster inference speed than autoregressive models, their model size and inference latency are still large for the deployment in resource constrained devices. In this paper, we propose LightSpeech, which leverages neural architecture search~(NAS) to automatically design more lightweight and efficient models based on FastSpeech. We first profile the components of current FastSpeech model and carefully design a novel search space containing various lightweight and potentially effective architectures. Then NAS is utilized to automatically discover well performing architectures within the search space. Experiments show that the model discovered by our method achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality. Audio demos are provided at https://speechresearch.github.io/lightspeech.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي التعلم الآلي

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

113 - Chuanxin Tang , Chong Luo , Zhiyuan Zhao 2021

Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript. Existing methods adopt a two-stage approach: synthesize the input text using a generic text-to-speech (TTS) engine and then transform the voice to the desired voice using voice conversion (VC). A major problem of this framework is that VC is a challenging problem which usually needs a moderate amount of parallel training data to work satisfactorily. In this paper, we propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the target speaker. In particular, we manage to perform accurate zero-shot duration prediction for the inserted text. The predicted duration is used to regulate both text embedding and speech embedding. Then, based on the aligned cross-modality input, we directly generate the mel-spectrogram of the edited speech with a transformer-based decoder. Subjective listening tests show that despite the lack of training data for the speaker, our method has achieved satisfactory results. It outperforms a recent zero-shot TTS engine by a large margin.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي معالجة الصوت والكلام

Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Network

99 - Jen-Cheng Hou , Syu-Siang Wang , Ying-Hui Lai 2017

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus on addressing audio information only. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of con volutional neural networks (CNNs) in SE, we propose an audio-visual deep CNN (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. In the proposed AVDCNN SE model, audio and visual data are first processed using individual CNNs, and then, fused into a joint network to generate enhanced speech at the output layer. The AVDCNN model is trained in an end-to-end manner, and parameters are jointly learned through back-propagation. We evaluate enhanced speech using five objective criteria. Results show that the AVDCNN yields notably better performance, compared with an audio-only CNN-based SE model and two conventional SE approaches, confirming the effectiveness of integrating visual information into the SE process.

أنظمة الصوت في الحاسوب الوسائط المتعددة التعلم الالي