ﻻ يوجد ملخص باللغة العربية
Standard video and movie description tasks abstract away from person identities, thus failing to link identities across sentences. We propose a multi-sentence Identity-Aware Video Description task, which overcomes this limitation and requires to re-identify persons locally within a set of consecutive clips. We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons IDs consistently within a set of clips, when the video descriptions are given. Our proposed approach to this task leverages a Transformer architecture allowing for coherent joint prediction of multiple IDs. One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model. This auxiliary task allows us to propose a two-stage approach to Identity-Aware Video Description. We first generate multi-sentence video descriptions, and then apply our Fill-in the Identity model to establish links between the predicted person entities. To be able to tackle both tasks, we augment the Large Scale Movie Description Challenge (LSMDC) benchmark with new annotations suited for our problem statement. Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works, and allows us to generate descriptions with locally re-identified people.
Multi-person tracking plays a critical role in the analysis of surveillance video. However, most existing work focus on shorter-term (e.g. minute-long or hour-long) video sequences. Therefore, we propose a multi-person tracking algorithm for very lon
This paper targets to explore the inter-subject variations eliminated facial expression representation in the compressed video domain. Most of the previous methods process the RGB images of a sequence, while the off-the-shelf and valuable expression-
How to extract effective expression representations that invariant to the identity-specific attributes is a long-lasting problem for facial expression recognition (FER). Most of the previous methods process the RGB images of a sequence, while we argu
Video-based person re-identification (Re-ID) is an important computer vision task. The batch-hard triplet loss frequently used in video-based person Re-ID suffers from the Distance Variance among Different Positives (DVDP) problem. In this paper, we
We consider the problem of space-time super-resolution (ST-SR): increasing spatial resolution of video frames and simultaneously interpolating frames to increase the frame rate. Modern approaches handle these axes one at a time. In contrast, our prop