Conditional End-to-End Audio Transforms

201 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Albert Haque

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Albert Haque - Michelle Guo - Prateek Verma

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We present an end-to-end method for transforming audio from one style to another. For the case of speech, by conditioning on speaker identities, we can train a single model to transform words spoken by multiple people into multiple target voices. For the case of music, we can specify musical instruments and achieve the same result. Architecturally, our method is a fully-differentiable sequence-to-sequence model based on convolutional and hierarchical recurrent neural networks. It is designed to capture long-term acoustic dependencies, requires minimal post-processing, and produces realistic audio transforms. Ablation studies confirm that our model can separate speaker and instrument properties from acoustic content at different receptive fields. Empirically, our method achieves competitive performance on community-standard datasets.

قيم البحث

358 - Soroush Mehri , Kundan Kumar , Ishaan Gulrajani 2016

In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stat eful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature. Human evaluation on the generated samples indicate that our model is preferred over competing models. We also show how each component of the model contributes to the exhibited performance.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي

End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN

197 - Manav Kaushik , Van Tung Pham , Eng Siong Chng 2021

Automatic height and age estimation of speakers using acoustic features is widely used for the purpose of human-computer interaction, forensics, etc. In this work, we propose a novel approach of using attention mechanism to build an end-to-end archit ecture for height and age estimation. The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term dependencies in the input acoustic features. We modify the conventionally used Attention -- which calculates context vectors the sum of attention only across timeframes -- by introducing a modified context vector which takes into account total attention across encoder units as well, giving us a new cross-attention mechanism. Apart from this, we also investigate a multi-task learning approach for jointly estimating speaker height and age. We train and test our model on the TIMIT corpus. Our model outperforms several approaches in the literature. We achieve a root mean square error (RMSE) of 6.92cm and6.34cm for male and female heights respectively and RMSE of 7.85years and 8.75years for male and females ages respectively. By tracking the attention weights allocated to different phones, we find that Vowel phones are most important whistlestop phones are least important for the estimation task.

أنظمة الصوت في الحاسوب الحساب واللغة التعلم الآلي

Streaming end-to-end multi-talker speech recognition

142 - Liang Lu , Naoyuki Kanda , Jinyu Li 2020

End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions. To the best of our knowledge, all existing research works are constrained in the offline scenario. In this work, we propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition. Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints. We study two different model architectures that are based on a speaker-differentiator encoder and a mask encoder respectively. To train this model, we investigate the widely used Permutation Invariant Training (PIT) approach and the Heuristic Error Assignment Training (HEAT) approach. Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT, and the SURT model with 150 milliseconds algorithmic latency constraint compares favorably with the offline sequence-to-sequence based baseline model in terms of accuracy.

أنظمة الصوت في الحاسوب الحساب واللغة معالجة الصوت والكلام

Class-Conditional Defense GAN Against End-to-End Speech Attacks

108 - Mohammad Esmaeilpour , Patrick Cardinal , Alessandro Lameiras Koerich 2020

In this paper we propose a novel defense approach against end-to-end adversarial attacks developed to fool advanced speech-to-text systems such as DeepSpeech and Lingvo. Unlike conventional defense approaches, the proposed approach does not directly employ low-level transformations such as autoencoding a given input signal aiming at removing potential adversarial perturbation. Instead of that, we find an optimal input vector for a class conditional generative adversarial network through minimizing the relative chordal distance adjustment between a given test input and the generator network. Then, we reconstruct the 1D signal from the synthesized spectrogram and the original phase information derived from the given input signal. Hence, this reconstruction does not add any extra noise to the signal and according to our experimental results, our defense-GAN considerably outperforms conventional defense algorithms both in terms of word error rate and sentence level recognition accuracy.

أنظمة الصوت في الحاسوب التشفير والأمن الرؤية الحاسوبية وتمييز الأنماط

Review of end-to-end speech synthesis technology based on deep learning

77 - Zhaoxi Mu , Xinyu Yang , Yizhuo Dong 2021

As an indispensable part of modern human-computer interaction system, speech synthesis technology helps users get the output of intelligent machine more easily and intuitively, thus has attracted more and more attention. Due to the limitations of hig h complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which has more powerful modeling ability and a simpler pipeline. It mainly consists of three modules: text front-end, acoustic model, and vocoder. This paper reviews the research status of these three parts, and classifies and compares various methods according to their emphasis. Moreover, this paper also summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and objective speech quality evaluation method. Finally, some attractive future research directions are pointed out.

أنظمة الصوت في الحاسوب الحساب واللغة معالجة الصوت والكلام