ﻻ يوجد ملخص باللغة العربية
We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.
Image-to-video person re-identification identifies a target person by a probe image from quantities of pedestrian videos captured by non-overlapping cameras. Despite the great progress achieved,its still challenging to match in the multimodal scenari
Cross-modal retrieval methods build a common representation space for samples from multiple modalities, typically from the vision and the language domains. For images and their captions, the multiplicity of the correspondences makes the task particul
Cross-modal person re-identification (Re-ID) is critical for modern video surveillance systems. The key challenge is to align inter-modality representations according to semantic information present for a person and ignore background information. In
Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle. An important task under the food-computing umbrella is
Existing alignment-based methods have to employ the pretrained human parsing models to achieve the pixel-level alignment, and cannot identify the personal belongings (e.g., backpacks and reticule) which are crucial to person re-ID. In this paper, we