ﻻ يوجد ملخص باللغة العربية
Automatic audio-visual expression recognition can play an important role in communication services such as tele-health, VOIP calls and human-machine interaction. Accuracy of audio-visual expression recognition could benefit from the interplay between the two modalities. However, most audio-visual expression recognition systems, trained in ideal conditions, fail to generalize in real world scenarios where either the audio or visual modality could be missing due to a number of reasons such as limited bandwidth, interactors orientation, caller initiated muting. This paper studies the performance of a state-of-the art transformer when one of the modalities is missing. We conduct ablation studies to evaluate the model in the absence of either modality. Further, we propose a strategy to randomly ablate visual inputs during training at the clip or frame level to mimic real world scenarios. Results conducted on in-the-wild data, indicate significant generalization in proposed models trained on missing cues, with gains up to 17% for frame level ablations, showing that these training strategies cope better with the loss of input modalities.
Audio-visual speech recognition (AVSR) can effectively and significantly improve the recognition rates of small-vocabulary systems, compared to their audio-only counterparts. For large-vocabulary systems, however, there are still many difficulties, s
End-to-end acoustic speech recognition has quickly gained widespread popularity and shows promising results in many studies. Specifically the joint transformer/CTC model provides very good performance in many tasks. However, under noisy and distorted
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interact
Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived co
With the development of deep learning and artificial intelligence, audio synthesis has a pivotal role in the area of machine learning and shows strong applicability in the industry. Meanwhile, significant efforts have been dedicated by researchers to