Audio-attention discriminative language model for ASR rescoring

109 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ankur Gandhe

تاريخ النشر 2019

مجال البحث هندسة إلكترونية الهندسة المعلوماتية

والبحث باللغة English

تأليف Ankur Gandhe - Ariya Rastrow

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

End-to-end approaches for automatic speech recognition (ASR) benefit from directly modeling the probability of the word sequence given the input audio stream in a single neural network. However, compared to conventional ASR systems, these models typically require more data to achieve comparable results. Well-known model adaptation techniques, to account for domain and style adaptation, are not easily applicable to end-to-end systems. Conventional HMM-based systems, on the other hand, have been optimized for various production environments and use cases. In this work, we propose to combine the benefits of end-to-end approaches with a conventional system using an attention-based discriminative language model that learns to rescore the output of a first-pass ASR system. We show that learning to rescore a list of potential ASR outputs is much simpler than learning to generate the hypothesis. The proposed model results in 8% improvement in word error rate even when the amount of training data is a fraction of data used for training the first-pass system.

قيم البحث

144 - Suyoun Kim , Siddharth Dalmia , Florian Metze 2019

We present an end-to-end speech recognition model that learns interaction between two speakers based on the turn-changing information. Unlike conventional speech recognition models, our model exploits two speakers history of conversational-context in formation that spans across multiple turns within an end-to-end framework. Specifically, we propose a speaker-specific cross-attention mechanism that can look at the output of the other speaker side as well as the one of the current speaker for better at recognizing long conversations. We evaluated the models on the Switchboard conversational speech corpus and show that our model outperforms standard end-to-end speech recognition models.

معالجة الصوت والكلام الحساب واللغة التعلم الآلي

DiffWave: A Versatile Diffusion Model for Audio Synthesis

189 - Zhifeng Kong , Wei Ping , Jiaji Huang 2020

In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain w ith a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

معالجة الصوت والكلام الحساب واللغة التعلم الآلي

Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model

117 - Da-Rong Liu , Chunxi Liu , Frank Zhang 2020

Videos uploaded on social media are often accompanied with textual descriptions. In building automatic speech recognition (ASR) systems for videos, we can exploit the contextual information provided by such video metadata. In this paper, we explore A SR lattice rescoring by selectively attending to the video descriptions. We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model during lattice rescoring. Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata. We perform experimental evaluations on both language modeling and ASR tasks, and demonstrate that both proposed methods provide performance improvements by selectively leveraging the video metadata.

الحساب واللغة أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Contextual RNN-T For Open Domain ASR

100 - Mahaveer Jain , Gil Keren , Jay Mahadeokar 2020

End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciation model - i nto a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names. In this paper, we propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words. We evaluate our approach on an in-house dataset sampled from de-identified public social media videos, which represent an open domain ASR task. By using an attention model and a biasing model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.

معالجة الصوت والكلام الحساب واللغة التعلم الآلي

Speech Model Pre-training for End-to-End Spoken Language Understanding

197 - Loren Lugosch , Mirco Ravanelli , Patrick Ignoto 2019

Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the models ability to generalize to new phrases not heard during training.

معالجة الصوت والكلام الحساب واللغة التعلم الآلي