ﻻ يوجد ملخص باللغة العربية
Historically, researchers in the field have spent a great deal of effort to create image representations that have scale invariance and retain spatial location information. This paper proposes to encode equivalent temporal characteristics in video representations for action recognition. To achieve temporal scale invariance, we develop a method called temporal scale pyramid (TSP). To encode temporal information, we present and compare two methods called temporal extension descriptor (TED) and temporal division pyramid (TDP) . Our purpose is to suggest solutions for matching complex actions that have large variation in velocity and appearance, which is missing from most current action representations. The experimental results on four benchmark datasets, UCF50, HMDB51, Hollywood2 and Olympic Sports, support our approach and significantly outperform state-of-the-art methods. Most noticeably, we achieve 65.0% mean accuracy and 68.2% mean average precision on the challenging HMDB51 and Hollywood2 datasets which constitutes an absolute improvement over the state-of-the-art by 7.8% and 3.9%, respectively.
Recognition of human actions and associated interactions with objects and the environment is an important problem in computer vision due to its potential applications in a variety of domains. The most versatile methods can generalize to various envir
In this technical report, we describe our solution to temporal action proposal (task 1) in ActivityNet Challenge 2019. First, we fine-tune a ResNet-50-C3D CNN on ActivityNet v1.3 based on Kinetics pretrained model to extract snippet-level video repre
Local features at neighboring spatial positions in feature maps have high correlation since their receptive fields are often overlapped. Self-attention usually uses the weighted sum (or other functions) with internal elements of each local feature to
Weakly supervised temporal action localization, which aims at temporally locating action instances in untrimmed videos using only video-level class labels during training, is an important yet challenging problem in video analysis. Many current method
Most state-of-the-art action feature extractors involve differential operators, which act as highpass filters and tend to attenuate low frequency action information. This attenuation introduces bias to the resulting features and generates ill-conditi