Generative Moment Matching Network-based Random Modulation Post-filter for DNN-based Singing Voice Synthesis and Neural Double-tracking

213 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yuki Saito

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Hiroki Tamaru - Yuki Saito - Shinnosuke Takamichi

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper proposes a generative moment matching network (GMMN)-based post-filter that provides inter-utterance pitch variation for deep neural network (DNN)-based singing voice synthesis. The natural pitch variation of a human singing voice leads to a richer musical experience and is used in double-tracking, a recording method in which two performances of the same phrase are recorded and mixed to create a richer, layered sound. However, singing voices synthesized using conventional DNN-based methods never vary because the synthesis process is deterministic and only one waveform is synthesized from one musical score. To address this problem, we use a GMMN to model the variation of the modulation spectrum of the pitch contour of natural singing voices and add a randomized inter-utterance variation to the pitch contour generated by conventional DNN-based singing voice synthesis. Experimental evaluations suggest that 1) our approach can provide perceptible inter-utterance pitch variation while preserving speech quality. We extend our approach to double-tracking, and the evaluation demonstrates that 2) GMMN-based neural double-tracking is perceptually closer to natural double-tracking than conventional signal processing-based artificial double-tracking is.

قيم البحث

198 - Ryan Prenger , Rafael Valle , Bryan Catanzaro 2018

In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without th e need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Our PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation. All code will be made publicly available online.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي التعلم الآلي

V2S attack: building DNN-based voice conversion from automatic speaker verification

72 - Taiki Nakamura , Yuki Saito , Shinnosuke Takamichi 2019

This paper presents a new voice impersonation attack using voice conversion (VC). Enrolling personal voices for automatic speaker verification (ASV) offers natural and flexible biometric authentication systems. Basically, the ASV systems do not inclu de the users voice data. However, if the ASV system is unexpectedly exposed and hacked by a malicious attacker, there is a risk that the attacker will use VC techniques to reproduce the enrolled users voices. We name this the ``verification-to-synthesis (V2S) attack and propose VC training with the ASV and pre-trained automatic speech recognition (ASR) models and without the targeted speakers voice data. The VC model reproduces the targeted speakers individuality by deceiving the ASV model and restores phonetic property of an input voice by matching phonetic posteriorgrams predicted by the ASR model. The experimental evaluation compares converted voices between the proposed method that does not use the targeted speakers voice data and the standard VC that uses the data. The experimental results demonstrate that the proposed method performs comparably to the existing VC methods that trained using a very small amount of parallel voice data.

أنظمة الصوت في الحاسوب التشفير والأمن التعلم الآلي

Latent Space Explorations of Singing Voice Synthesis using DDSP

74 - Juan Alonso , Cumhur Erkut 2021

Machine learning based singing voice models require large datasets and lengthy training times. In this work we present a lightweight architecture, based on the Differentiable Digital Signal Processing (DDSP) library, that is able to output song-like utterances conditioned only on pitch and amplitude, after twelve hours of training using small datasets of unprocessed audio. The results are promising, as both the melody and the singers voice are recognizable. In addition, we present two zero-configuration tools to train new models and experiment with them. Currently we are exploring the latent space representation, which is included in the DDSP library, but not in the original DDSP examples. Our results indicate that the latent space improves both the identification of the singer as well as the comprehension of the lyrics. Our code is available at https://github.com/juanalonso/DDSP-singing-experiments with links to the zero-configuration notebooks, and our sound examples are at https://juanalonso.github.io/DDSP-singing-experiments/ .

أنظمة الصوت في الحاسوب التعلم الآلي الوسائط المتعددة

Dynamic Attention Based Generative Adversarial Network with Phase Post-Processing for Speech Enhancement

156 - Andong Li , Chengshi Zheng , Renhua Peng 2020

The generative adversarial networks (GANs) have facilitated the development of speech enhancement recently. Nevertheless, the performance advantage is still limited when compared with state-of-the-art models. In this paper, we propose a powerful Dyna mic Attention Recursive GAN called DARGAN for noise reduction in the time-frequency domain. Different from previous works, we have several innovations. First, recursive learning, an iterative training protocol, is used in the generator, which consists of multiple steps. By reusing the network in each step, the noise components are progressively reduced in a step-wise manner. Second, the dynamic attention mechanism is deployed, which helps to re-adjust the feature distribution in the noise reduction module. Third, we exploit the deep Griffin-Lim algorithm as the module for phase postprocessing, which facilitates further improvement in speech quality. Experimental results on Voice Bank corpus show that the proposed GAN achieves state-of-the-art performance than previous GAN- and non-GAN-based models

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

DSP Based System for Real time Voice Synthesis Applications Development

329 - Radu Arsinte , Attila Ferencz , Costin Miron 2008

This paper describes an experimental system designed for development of real time voice synthesis applications. The system is composed from a DSP coprocessor card, equipped with an TMS320C25 or TMS320C50 chip, voice acquisition module (ADDA2),host co mputer (IBM-PC compatible), software specific tools.

أنظمة الصوت في الحاسوب