مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors

82 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Anupama Chingacham

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Anupama Chingacham - Vera Demberg - Dietrich Klakow

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Listening in noisy environments can be difficult even for individuals with a normal hearing thresholds. The speech signal can be masked by noise, which may lead to word misperceptions on the side of the listener, and overall difficulty to understand the message. To mitigate hearing difficulties on listeners, a co-operative speaker utilizes voice modulation strategies like Lombard speech to generate noise-robust utterances, and similar solutions have been developed for speech synthesis systems. In this work, we propose an alternate solution of choosing noise-robust lexical paraphrases to represent an intended meaning. Our results show that lexical paraphrases differ in their intelligibility in noise. We evaluate the intelligibility of synonyms in context and find that choosing a lexical unit that is less risky to be misheard than its synonym introduced an average gain in comprehension of 37% at SNR -5 dB and 21% at SNR 0 dB for babble noise.

قيم البحث

107 - Xiaofei Wang , Jinyi Yang , Ruizhi Li 2019

Quality of data plays an important role in most deep learning tasks. In the speech community, transcription of speech recording is indispensable. Since the transcription is usually generated artificially, automatically finding errors in manual transc riptions not only saves time and labors but benefits the performance of tasks that need the training process. Inspired by the success of hybrid automatic speech recognition using both language model and acoustic model, two approaches of automatic error detection in the transcriptions have been explored in this work. Previous study using a biased language model approach, relying on a strong transcription-dependent language model, has been reviewed. In this work, we propose a novel acoustic model based approach, focusing on the phonetic sequence of speech. Both methods have been evaluated on a completely real dataset, which was originally transcribed with errors and strictly corrected manually afterwards.

الحساب واللغة أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation

165 - Elizabeth Salesky , Matthias Sperber , Alan W Black 2019

Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representat ions is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two language pairs.

الحساب واللغة أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Alzheimers Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs

90 - Morteza Rohanian , Julian Hough , Matthew Purver 2021

We present two multimodal fusion-based deep learning models that consume ASR transcribed speech and acoustic data simultaneously to classify whether a speaker in a structured diagnostic task has Alzheimers Disease and to what degree, evaluating the A DReSSo challenge 2021 data. Our best model, a BiLSTM with highway layers using words, word probabilities, disfluency features, pause information, and a variety of acoustic features, achieves an accuracy of 84% and RSME error prediction of 4.26 on MMSE cognitive scores. While predicting cognitive decline is more challenging, our models show improvement using the multimodal approach and word probabilities, disfluency and pause information over word-only models. We show considerable gains for AD classification using multimodal fusion and gating, which can effectively deal with noisy inputs from acoustic features and ASR hypotheses.

الحساب واللغة أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Exploring Machine Speech Chain for Domain Adaptation and Few-Shot Speaker Adaptation

82 - Fengpeng Yue , Yan Deng , Lei He 2021

Machine Speech Chain, which integrates both end-to-end (E2E) automatic speech recognition (ASR) and text-to-speech (TTS) into one circle for joint training, has been proven to be effective in data augmentation by leveraging large amounts of unpaired data. In this paper, we explore the TTS->ASR pipeline in speech chain to do domain adaptation for both neural TTS and E2E ASR models, with only text data from target domain. We conduct experiments by adapting from audiobook domain (LibriSpeech) to presentation domain (TED-LIUM), there is a relative word error rate (WER) reduction of 10% for the E2E ASR model on the TED-LIUM test set, and a relative WER reduction of 51.5% in synthetic speech generated by neural TTS in the presentation domain. Further, we apply few-shot speaker adaptation for the E2E ASR by using a few utterances from target speakers in an unsupervised way, results in additional gains.

الحساب واللغة أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention

74 - Daniel Korzekwa , Roberto Barra-Chicote , Szymon Zaporowski 2020

This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classic al approach, audio features are usually extracted from fixed regions of speech such as the syllable nucleus. We propose an attention-based deep learning model that automatically derives optimal syllable-level representation from frame-level and phoneme-level audio features. Training this model is challenging because of the limited amount of incorrect stress patterns. To solve this problem, we propose to augment the training set with incorrectly stressed words generated with Neural TTS. Combining both techniques achieves 94.8% precision and 49.2% recall for the detection of incorrectly stressed words in L2 English speech of Slavic and Baltic speakers.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الشرق الأوسط - الأردن

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً