Personalizing ASR for Dysarthric and Accented Speech with Limited Data

97 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Joel Shor

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Joel Shor - Dotan Emanuel - Oran Lang

الحساب واللغة التعلم الآلي أنظمة الصوت في الحاسوب

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR systems are most often trained from typical speech, which means that underrepresented groups dont experience the same level of improvement. In this paper, we present and evaluate finetuning techniques to improve ASR for users with non-standard speech. We focus on two types of non-standard speech: speech from people with amyotrophic lateral sclerosis (ALS) and accented speech. We train personalized models that achieve 62% and 35% relative WER improvement on these two groups, bringing the absolute WER for ALS speakers, on a test set of message bank phrases, down to 10% for mild dysarthria and 20% for more serious dysarthria. We show that 71% of the improvement comes from only 5 minutes of training data. Finetuning a particular subset of layers (with many fewer parameters) often gives better results than finetuning the entire model. This is the first step towards building state of the art ASR models for dysarthric speech.

قيم البحث

73 - Kartik Khandelwal , Preethi Jyothi , Abhijeet Awasthi 2020

We introduce the problem of adapting a black-box, cloud-based ASR system to speech from a target accent. While leading online ASR services obtain impressive performance on main-stream accents, they perform poorly on sub-populations - we observed that the word error rate (WER) achieved by Googles ASR API on Indian accents is almost twice the WER on US accents. Existing adaptation methods either require access to model parameters or overlay an error-correcting module on output transcripts. We highlight the need for correlating outputs with the original speech to fix accent errors. Accordingly, we propose a novel coupling of an open-source accent-tuned local model with the black-box service where the output from the service guides frame-level inference in the local model. Our fine-grained merging algorithm is better at fixing accent errors than existing word-level combination strategies. Experiments on Indian and Australian accents with three leading ASR models as service, show that we achieve as much as 28% relative reduction in WER over both the local and service models.

معالجة الصوت والكلام الحساب واللغة أنظمة الصوت في الحاسوب

Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech

72 - Katrin Tomanek , Vicky Zayats , Dirk Padfield 2021

Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been sh own that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning, while only updating a tiny fraction (less than 0.5%) of the model parameters. We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.

الحساب واللغة أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Training Keyword Spotters with Limited and Synthesized Speech Data

121 - James Lin , Kevin Kilgour , Dominik Roblek 2020

With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of training data. In this paper, we explore the effectiveness of synthesized speech data in training small, spoken term detection models of around 400k parameters. Instead of training such models directly on the audio or low level features such as MFCCs, we use a pre-trained speech embedding model trained to extract useful features for keyword spotting models. Using this speech embedding, we show that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples. We also show that a model without our speech embeddings would need to be trained on over 4000 real examples to reach the same accuracy.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

63 - Chengyi Wang , Yu Wu , Yao Qian 2021

In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.

الحساب واللغة التعلم الآلي أنظمة الصوت في الحاسوب

Joint Masked CPC and CTC Training for ASR

108 - Chaitanya Talnikar , Tatiana Likhomanenko , Ronan Collobert 2020

Self-supervised learning (SSL) has shown promise in learning representations of audio that are useful for automatic speech recognition (ASR). But, training SSL models like wav2vec~2.0 requires a two-stage pipeline. In this paper we demonstrate a sing le-stage training of ASR models that can utilize both unlabeled and labeled data. During training, we alternately minimize two losses: an unsupervised masked Contrastive Predictive Coding (CPC) loss and the supervised audio-to-text alignment loss Connectionist Temporal Classification (CTC). We show that this joint training method directly optimizes performance for the downstream ASR task using unsupervised data while achieving similar word error rates to wav2vec~2.0 on the Librispeech 100-hour dataset. Finally, we postulate that solving the contrastive task is a regularization for the supervised CTC loss.

الحساب واللغة التعلم الآلي أنظمة الصوت في الحاسوب