ﻻ يوجد ملخص باللغة العربية
We investigate the problem of representing an entire video using CNN features for human action recognition. Currently, limited by GPU memory, we have not been able to feed a whole video into CNN/RNNs for end-to-end learning. A common practice is to use sampled frames as inputs and video labels as supervision. One major problem of this popular approach is that the local samples may not contain the information indicated by global labels. To deal with this problem, we propose to treat the deep networks trained on local inputs as local feature extractors. After extracting local features, we aggregate them into global features and train another mapping function on the same training data to map the global features into global labels. We study a set of problems regarding this new type of local features such as how to aggregate them into global features. Experimental results on HMDB51 and UCF101 datasets show that, for these new local features, a simple maximum pooling on the sparsely sampled features lead to significant performance improvement.
Video action recognition, which is topical in computer vision and video analysis, aims to allocate a short video clip to a pre-defined category such as brushing hair or climbing stairs. Recent works focus on action recognition with deep neural networ
Training robust deep video representations has proven to be much more challenging than learning deep image representations. This is in part due to the enormous size of raw video streams and the high temporal redundancy; the true and interesting signa
Temporal modeling in videos is a fundamental yet challenging problem in computer vision. In this paper, we propose a novel Temporal Bilinear (TB) model to capture the temporal pairwise feature interactions between adjacent frames. Compared with some
Human action recognition is an active research area in computer vision. Although great process has been made, previous methods mostly recognize actions based on depth data at only one scale, and thus they often neglect multi-scale features that provi
Inspired by the observation that humans are able to process videos efficiently by only paying attention where and when it is needed, we propose an interpretable and easy plug-in spatial-temporal attention mechanism for video action recognition. For s