ﻻ يوجد ملخص باللغة العربية
Speaker extraction algorithm emulates humans ability of selective attention to extract the target speakers speech from a multi-talker scenario. It requires an auxiliary stimulus to form the top-down attention towards the target speaker. It has been well studied to use a reference speech as the auxiliary stimulus. Visual cues also serve as an informative reference for human listening. They are particularly useful in the presence of acoustic noise and interference speakers. We believe that the temporal synchronization between speech and its accompanying lip motion is a direct and dominant audio-visual cue. In this work, we aim to emulate humans ability of visual attention for speaker extraction based on speech-lip synchronization. We propose a self-supervised pre-training strategy, to exploit the speech-lip synchronization in a multi-talker scenario. We transfer the knowledge from the pre-trained model to a speaker extraction network. We show that the proposed speaker extraction network outperforms various competitive baselines in terms of signal quality and perceptual evaluation, achieving state-of-the-art performance.
Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of large-scale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the
Lip-reading aims to recognize speech content from videos via visual analysis of speakers lip movements. This is a challenging task due to the existence of homophemes-words which involve identical or highly similar lip movements, as well as diverse li
The performance of automated lip reading using visemes as a classification schema has achieved less success compared with the use of ASCII characters and words largely due to the problem of different words sharing identical visemes. The Generative Pr
Lip reading, also known as visual speech recognition, aims to recognize the speech content from videos by analyzing the lip dynamics. There have been several appealing progress in recent years, benefiting much from the rapidly developed deep learning
This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The proposed approach leverages the complementary strengths of both deep learning and analytical acoustic modelling (filtering based approach) as compared