ﻻ يوجد ملخص باللغة العربية
This technical report presents our solution for temporal action detection task in AcitivityNet Challenge 2021. The purpose of this task is to locate and identify actions of interest in long untrimmed videos. The crucial challenge of the task comes from that the temporal duration of action varies dramatically, and the target actions are typically embedded in a background of irrelevant activities. Our solution builds on BMN, and mainly contains three steps: 1) action classification and feature encoding by Slowfast, CSN and ViViT; 2) proposal generation. We improve BMN by embedding the proposed Proposal Relation Network (PRN), by which we can generate proposals of high quality; 3) action detection. We calculate the detection results by assigning the proposals with corresponding classification results. Finally, we ensemble the results under different settings and achieve 44.7% on the test set, which improves the champion result in ActivityNet 2020 by 1.9% in terms of average mAP.
In this technical report, we describe our solution to temporal action proposal (task 1) in ActivityNet Challenge 2019. First, we fine-tune a ResNet-50-C3D CNN on ActivityNet v1.3 based on Kinetics pretrained model to extract snippet-level video repre
Accurate temporal action proposals play an important role in detecting actions from untrimmed videos. The existing approaches have difficulties in capturing global contextual information and simultaneously localizing actions with different durations.
Temporal action proposal generation aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet important task in the video understanding field. The proposals generated by current methods still suffer from inaccurat
Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This problem r
Localizing persons and recognizing their actions from videos is a challenging task towards high-level video understanding. Recent advances have been achieved by modeling direct pairwise relations between entities. In this paper, we take one step furt