ﻻ يوجد ملخص باللغة العربية
Few-shot action recognition aims to recognize novel action classes (query) using just a few samples (support). The majority of current approaches follow the metric learning paradigm, which learns to compare the similarity between videos. Recently, it has been observed that directly measuring this similarity is not ideal since different action instances may show distinctive temporal distribution, resulting in severe misalignment issues across query and support videos. In this paper, we arrest this problem from two distinct aspects -- action duration misalignment and motion evolution misalignment. We address them sequentially through a Two-stage Temporal Alignment Network (TTAN). The first stage performs temporal transformation with the predicted affine warp parameters, while the second stage utilizes a cross-attention mechanism to coordinate the features of the support and query to a consistent evolution. Besides, we devise a novel multi-shot fusion strategy, which takes the misalignment among support samples into consideration. Ablation studies and visualizations demonstrate the role played by both stages in addressing the misalignment. Extensive experiments on benchmark datasets show the potential of the proposed method in achieving state-of-the-art performance for few-shot action recognition.
Few-shot learning aims to recognize instances from novel classes with few labeled samples, which has great value in research and application. Although there has been a lot of work in this area recently, most of the existing work is based on image cla
This paper studies how to introduce viewpoint-invariant feature representations that can help action recognition and detection. Although we have witnessed great progress of action recognition in the past decade, it remains challenging yet interesting
Zero-shot action recognition can recognize samples of unseen classes that are unavailable in training by exploring common latent semantic representation in samples. However, most methods neglected the connotative relation and extensional relation bet
The goal of few-shot video classification is to learn a classification model with good generalization ability when trained with only a few labeled videos. However, it is difficult to learn discriminative feature representations for videos in such a s
Few-shot learning (FSL) for action recognition is a challenging task of recognizing novel action categories which are represented by few instances in the training data. In a more generalized FSL setting (G-FSL), both seen as well as novel action cate