ﻻ يوجد ملخص باللغة العربية
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: https://github.com/TaoRuijie/TalkNet_ASD.
Automatic audio-visual expression recognition can play an important role in communication services such as tele-health, VOIP calls and human-machine interaction. Accuracy of audio-visual expression recognition could benefit from the interplay between
Audio-visual speech recognition (AVSR) can effectively and significantly improve the recognition rates of small-vocabulary systems, compared to their audio-only counterparts. For large-vocabulary systems, however, there are still many difficulties, s
Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived co
Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled
We propose a method to address audio-visual target speaker enhancement in multi-talker environments using event-driven cameras. State of the art audio-visual speech separation methods shows that crucial information is the movement of the facial landm