بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

253 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Lauri Juvela

تاريخ النشر 2018

مجال البحث هندسة إلكترونية الهندسة المعلوماتية

والبحث باللغة English

تأليف Lauri Juvela - Bajibabu Bollepalli - Junichi Yamagishi

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The state-of-the-art in text-to-speech synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parall

قيم البحث

163 - Lauri Juvela , Bajibabu Bollepalli , Xin Wang 2018

This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in speech applications, such as ASR, but are generally considered unusable for speech synthesis. First, we predict fu ndamental frequency and voicing information from MFCCs with an autoregressive recurrent neural net. Second, the spectral envelope information contained in MFCCs is converted to all-pole filters, and a pitch-synchronous excitation model matched to these filters is trained. Finally, we introduce a generative adversarial network -based noise model to add a realistic high-frequency stochastic component to the modeled excitation signal. The results show that high quality speech reconstruction can be obtained, given only MFCC information at test time.

معالجة الصوت والكلام الحساب واللغة أنظمة الصوت في الحاسوب

Comparing Representations for Audio Synthesis Using Generative Adversarial Networks

153 - Javier Nistal , Stefan Lattner , Gael Richard 2020

In this paper, we compare different audio signal representations, including the raw audio waveform and a variety of time-frequency representations, for the task of audio synthesis with Generative Adversarial Networks (GANs). We conduct the experiment s on a subset of the NSynth dataset. The architecture follows the benchmark Progressive Growing Wasserstein GAN. We perform experiments both in a fully non-conditional manner as well as conditioning the network on the pitch information. We quantitatively evaluate the generated material utilizing standard metrics for assessing generative models, and compare training and sampling times. We show that complex-valued as well as the magnitude and Instantaneous Frequency of the Short-Time Fourier Transform achieve the best results, and yield fast generation and inversion times. The code for feature extraction, training and evaluating the model is available online.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

Multi-Metric Optimization using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement

152 - Haoyu Li , Junichi Yamagishi 2021

The intelligibility of speech severely degrades in the presence of environmental noise and reverberation. In this paper, we propose a novel deep learning based system for modifying the speech signal to increase its intelligibility under the equal-pow er constraint, i.e., signal power before and after modification must be the same. To achieve this, we use generative adversarial networks (GANs) to obtain time-frequency dependent amplification factors, which are then applied to the input raw speech to reallocate the speech energy. Instead of optimizing only a single, simple metric, we train a deep neural network (DNN) model to simultaneously optimize multiple advanced speech metrics, including both intelligibility- and quality-related ones, which results in notable improvements in performance and robustness. Our system can not only work in non-realtime mode for offline audio playback but also support practical real-time speech applications. Experimental results using both objective measurements and subjective listening tests indicate that the proposed system significantly outperforms state-ofthe-art baseline systems under various noisy and reverberant listening conditions.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

SeismoGen: Seismic Waveform Synthesis Using Generative Adversarial Networks

108 - Tiantong Wang , Daniel Trugman , 2019

Detecting earthquake events from seismic time series has proved itself a challenging task. Manual detection can be expensive and tedious due to the intensive labor and large scale data set. In recent years, automatic detection methods based on machin e learning have been developed to improve accuracy and efficiency. However, the accuracy of those methods relies on a sufficient amount of high-quality training data, which itself can be expensive to obtain due to the requirement of domain knowledge and subject matter expertise. This paper is to resolve this dilemma by answering two questions: (1) provided with a limited number of reliable labels, can we use them to generate more synthetic labels; (2) Can we use those synthetic labels to improve the detectability? Among all the existing generative models, the generative adversarial network (GAN) shows its supreme capability in generating high-quality synthetic samples in multiple domains. We designed our model based on GAN. In particular, we studied several different network structures. By comparing the generated results, our GAN-based generative model yields the highest quality. We further combine the dataset with synthetic samples generated by our generative model and show that the detectability of our earthquake classification model is significantly improved than the one trained without augmenting the training set.

التعلم الآلي الجيوفيزياء التعلم الالي

Transferring neural speech waveform synthesizers to musical instrument sounds generation

90 - Yi Zhao , Xin Wang , Lauri Juvela 2019

Recent neural waveform synthesizers such as WaveNet, WaveGlow, and the neural-source-filter (NSF) model have shown good performance in speech synthesis despite their different methods of waveform generation. The similarity between speech and music au dio synthesis techniques suggests interesting avenues to explore in terms of the best way to apply speech synthesizers in the music domain. This work compares three neural synthesizers used for musical instrument sounds generation under three scenarios: training from scratch on music data, zero-shot learning from the speech domain, and fine-tuning-based adaptation from the speech to the music domain. The results of a large-scale perceptual test demonstrated that the performance of three synthesizers improved when they were pre-trained on speech data and fine-tuned on music data, which indicates the usefulness of knowledge from speech data for music audio generation. Among the synthesizers, WaveGlow showed the best potential in zero-shot learning while NSF performed best in the other scenarios and could generate samples that were perceptually close to natural audio.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب التعلم الالي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الشھباء الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

The state-of-the-art in text-to-speech synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parall

اقرأ أيضاً