ترغب بنشر مسار تعليمي؟ اضغط هنا

EAN: Event Adaptive Network for Enhanced Action Recognition

340   0   0.0 ( 0 )
 نشر من قبل Yuan Tian
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Efficiently modeling spatial-temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network (EAN) because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code (LMC) module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1&V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch.



قيم البحث

اقرأ أيضاً

Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computa tion leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical convolution feature maps is fused with current pruned feature maps with the goal of improving both recognition accuracy and efficiency. In addition, we use a skipping operation to further reduce the computation cost of action recognition. Extensive experiments on Something V1 & V2, Jester and Mini-Kinetics show that our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods. The project page can be found at https://mengyuest.github.io/AdaFuse/
By extracting spatial and temporal characteristics in one network, the two-stream ConvNets can achieve the state-of-the-art performance in action recognition. However, such a framework typically suffers from the separately processing of spatial and t emporal information between the two standalone streams and is hard to capture long-term temporal dependence of an action. More importantly, it is incapable of finding the salient portions of an action, say, the frames that are the most discriminative to identify the action. To address these problems, a textbf{j}oint textbf{n}etwork based textbf{a}ttention (JNA) is proposed in this study. We find that the fully-connected fusion, branch selection and spatial attention mechanism are totally infeasible for action recognition. Thus in our joint network, the spatial and temporal branches share some information during the training stage. We also introduce an attention mechanism on the temporal domain to capture the long-term dependence meanwhile finding the salient portions. Extensive experiments are conducted on two benchmark datasets, UCF101 and HMDB51. Experimental results show that our method can improve the action recognition performance significantly and achieves the state-of-the-art results on both datasets.
How to model fine-grained spatial-temporal dynamics in videos has been a challenging problem for action recognition. It requires learning deep and rich features with superior distinctiveness for the subtle and abstract motions. Most existing methods generate features of a layer in a pure feedforward manner, where the information moves in one direction from inputs to outputs. And they rely on stacking more layers to obtain more powerful features, bringing extra non-negligible overheads. In this paper, we propose an Adaptive Recursive Circle (ARC) framework, a fine-grained decorator for pure feedforward layers. It inherits the operators and parameters of the original layer but is slightly different in the use of those operators and parameters. Specifically, the input of the layer is treated as an evolving state, and its update is alternated with the feature generation. At each recursive step, the input state is enriched by the previously generated features and the feature generation is made with the newly updated input state. We hope the ARC framework can facilitate fine-grained action recognition by introducing deeply refined features and multi-scale receptive fields at a low cost. Significant improvements over feedforward baselines are observed on several benchmarks. For example, an ARC-equipped TSM-ResNet18 outperforms TSM-ResNet50 with 48% fewer FLOPs and 52% model parameters on Something-Something V1 and Diving48.
Action recognition is an open and challenging problem in computer vision. While current state-of-the-art models offer excellent recognition results, their computational expense limits their impact for many real-world applications. In this paper, we p ropose a novel approach, called AR-Net (Adaptive Resolution Network), that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition in long untrimmed videos. Specifically, given a video frame, a policy network is used to decide what input resolution should be used for processing by the action recognition model, with the goal of improving both accuracy and efficiency. We efficiently train the policy network jointly with the recognition model using standard back-propagation. Extensive experiments on several challenging action recognition benchmark datasets well demonstrate the efficacy of our proposed approach over state-of-the-art methods. The project page can be found at https://mengyuest.github.io/AR-Net
Facial expression recognition is a challenging task, arguably because of large intra-class variations and high inter-class similarities. The core drawback of the existing approaches is the lack of ability to discriminate the changes in appearance cau sed by emotions and identities. In this paper, we present a novel identity-enhanced network (IDEnNet) to eliminate the negative impact of identity factor and focus on recognizing facial expressions. Spatial fusion combined with self-constrained multi-task learning are adopted to jointly learn the expression representations and identity-related information. We evaluate our approach on three popular datasets, namely Oulu-CASIA, CK+ and MMI. IDEnNet improves the baseline consistently, and achieves the best or comparable state-of-the-art on all three datasets.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا