Looking Enhances Listening: Recovering Missing Speech Using Images

75 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Tejas Srinivasan

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Tejas Srinivasan - Ramon Sanabria - Florian Metze

الحساب واللغة الوسائط المتعددة معالجة الصوت والكلام

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.

قيم البحث

169 - Shujun Li , Andreas Karrenbauer , Dietmar Saupe 2012

A general method for recovering missing DCT coefficients in DCT-transformed images is presented in this work. We model the DCT coefficients recovery problem as an optimization problem and recover all missing DCT coefficients via linear programming. T he visual quality of the recovered image gradually decreases as the number of missing DCT coefficients increases. For some images, the quality is surprisingly good even when more than 10 most significant DCT coefficients are missing. When only the DC coefficient is missing, the proposed algorithm outperforms existing methods according to experimental results conducted on 200 test images. The proposed recovery method can be used for cryptanalysis of DCT based selective encryption schemes and other applications.

الوسائط المتعددة التشفير والأمن التحسين والتحكم

Detection of AI-Synthesized Speech Using Cepstral & Bispectral Statistics

67 - Arun Kumar Singh 2020

Digital technology has made possible unimaginable applications come true. It seems exciting to have a handful of tools for easy editing and manipulation, but it raises alarming concerns that can propagate as speech clones, duplicates, or maybe deep f akes. Validating the authenticity of a speech is one of the primary problems of digital audio forensics. We propose an approach to distinguish human speech from AI synthesized speech exploiting the Bi-spectral and Cepstral analysis. Higher-order statistics have less correlation for human speech in comparison to a synthesized speech. Also, Cepstral analysis revealed a durable power component in human speech that is missing for a synthesized speech. We integrate both these analyses and propose a machine learning model to detect AI synthesized speech.

التعلم الآلي الوسائط المتعددة معالجة الصوت والكلام

Listening to Sounds of Silence for Speech Denoising

197 - Ruilin Xu , Rundi Wu , Yuko Ishiwaka 2020

We introduce a deep learning model for speech denoising, a long-standing challenge in audio analysis arising in numerous applications. Our approach is based on a key observation about human speech: there is often a short pause between each sentence o r word. In a recorded speech signal, those pauses introduce a series of time periods during which only noise is present. We leverage these incidental silent intervals to learn a model for automatic speech denoising given only mono-channel audio. Detected silent intervals over time expose not just pure noise but its time-varying features, allowing the model to learn noise dynamics and suppress it from the speech signal. Experiments on multiple datasets confirm the pivotal role of silent interval detection for speech denoising, and our method outperforms several state-of-the-art denoising methods, including those that accept only audio input (like ours) and those that denoise based on audiovisual input (and hence require more information). We also show that our method enjoys excellent generalization properties, such as denoising spoken languages not seen during training.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Attention-Based Keyword Localisation in Speech using Visual Grounding

148 - Kayode Olaleye , Herman Kamper 2021

Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can de tect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do keyword localisation: predicting where, within an utterance, a given textual keyword occurs without any explicit text-based or alignment supervision. We specifically consider whether incorporating attention into a convolutional model is beneficial for localisation. Although absolute localisation performance with visually supervised models is still modest (compared to using unordered bag-of-word text labels for supervision), we show that attention provides a large gain in performance over previous visually grounded models. As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions, e.g. locating the word backstroke for the query keyword swimming.

الحساب واللغة أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Speech Recognition with Augmented Synthesized Speech

96 - Andrew Rosenberg , Yu Zhang , Bhuvana Ramabhadran 2019

Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human spee ch that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation of the synthesized speech. In this paper, we evaluate the feasibility of enhancing speech recognition performance using speech synthesis using two corpora from different domains. We explore algorithms to provide the necessary acoustic and lexical diversity needed for robust speech recognition. Finally, we demonstrate the feasibility of this approach as a data augmentation strategy for domain-transfer. We find that improvements to speech recognition performance is achievable by augmenting training data with synthesized material. However, there remains a substantial gap in performance between recognizers trained on human speech those trained on synthesized speech.

الحساب واللغة أنظمة الصوت في الحاسوب معالجة الصوت والكلام