ﻻ يوجد ملخص باللغة العربية
Learning the spatial-temporal representation of motion information is crucial to human action recognition. Nevertheless, most of the existing features or descriptors cannot capture motion information effectively, especially for long-term motion. To address this problem, this paper proposes a long-term motion descriptor called sequential Deep Trajectory Descriptor (sDTD). Specifically, we project dense trajectories into two-dimensional planes, and subsequently a CNN-RNN network is employed to learn an effective representation for long-term motion. Unlike the popular two-stream ConvNets, the sDTD stream is introduced into a three-stream framework so as to identify actions from a video sequence. Consequently, this three-stream framework can simultaneously capture static spatial features, short-term motion and long-term motion in the video. Extensive experiments were conducted on three challenging datasets: KTH, HMDB51 and UCF101. Experimental results show that our method achieves state-of-the-art performance on the KTH and UCF101 datasets, and is comparable to the state-of-the-art methods on the HMDB51 dataset.
In recent years, a number of approaches based on 2D or 3D convolutional neural networks (CNN) have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out in-dept
Analyzing videos of human actions involves understanding the temporal relationships among video frames. State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs. Suc
We introduce the first benchmark for a new problem --- recognizing human action adverbs (HAA): Adverbs Describing Human Actions (ADHA). This is the first step for computer vision to change over from pattern recognition to real AI. We demonstrate some
Action recognition is an important research topic in computer vision. It is the basic work for visual understanding and has been applied in many fields. Since human actions can vary in different environments, it is difficult to infer actions in compl
Two-stream networks have achieved great success in video recognition. A two-stream network combines a spatial stream of RGB frames and a temporal stream of Optical Flow to make predictions. However, the temporal redundancy of RGB frames as well as th