VideoLightFormer: Lightweight Action Recognition using Transformers

80 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Raivo Koot

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Raivo Koot - Haiping Lu

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Efficient video action recognition remains a challenging problem. One large model after another takes the place of the state-of-the-art on the Kinetics dataset, but real-world efficiency evaluations are often lacking. In this work, we fill this gap and investigate the use of transformers for efficient action recognition. We propose a novel, lightweight action recognition architecture, VideoLightFormer. In a factorized fashion, we carefully extend the 2D convolutional Temporal Segment Network with transformers, while maintaining spatial and temporal video structure throughout the entire model. Existing methods often resort to one of the two extremes, where they either apply huge transformers to video features, or minimal transformers on highly pooled video features. Our method differs from them by keeping the transformer models small, but leveraging full spatiotemporal feature structure. We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-Something-V2 (SSV2) datasets and find that it achieves a better mix of efficiency and accuracy than existing state-of-the-art models, apart from the Temporal Shift Module on SSV2.

قيم البحث

252 - Evgeny Izutov 2021

Growing amount of different practical tasks in a video understanding problem has addressed the great challenge aiming to design an universal solution, which should be available for broad masses and suitable for the demanding edge-oriented inference. In this paper we are focused on designing a network architecture and a training pipeline to tackle the mentioned challenges. Our architecture takes the best from the previous ones and brings the ability to be successful not only in appearance-based action recognition tasks but in motion-based problems too. Furthermore, the induced label noise problem is formulated and Adaptive Clip Selection (ACS) framework is proposed to deal with it. Together it makes the LIGAR framework the general-purpose action recognition solution. We also have reported the extensive analysis on the general and gesture datasets to show the excellent trade-off between the performance and the accuracy in comparison to the state-of-the-art solutions. Training code is available at: https://github.com/openvinotoolkit/training_extensions. For the efficient edge-oriented inference all trained models can be exported into the OpenVINO format.

الرؤية الحاسوبية وتمييز الأنماط

Lightweight Network Architecture for Real-Time Action Recognition

64 - Alexander Kozlov , Vadim Andronov , Yana Gritsenko 2019

In this work we present a new efficient approach to Human Action Recognition called Video Transformer Network (VTN). It leverages the latest advances in Computer Vision and Natural Language Processing and applies them to video understanding. The prop osed method allows us to create lightweight CNN models that achieve high accuracy and real-time speed using just an RGB mono camera and general purpose CPU. Furthermore, we explain how to improve accuracy by distilling from multiple models with different modalities into a single model. We conduct a comparison with state-of-the-art methods and show that our approach performs on par with most of them on famous Action Recognition datasets. We benchmark the inference time of the models using the modern inference framework and argue that our approach compares favorably with other methods in terms of speed/accuracy trade-off, running at 56 FPS on CPU. The models and the training code are available.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي التعلم الآلي

Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition

111 - Ziyuan Huang , Zhiwu Qing , Xiang Wang 2021

With the recent surge in the research of vision transformers, they have demonstrated remarkable potential for various challenging computer vision applications, such as image recognition, point cloud classification as well as video understanding. In t his paper, we present empirical results for training a stronger video vision transformer on the EPIC-KITCHENS-100 Action Recognition dataset. Specifically, we explore training techniques for video vision transformers, such as augmentations, resolutions as well as initialization, etc. With our training recipe, a single ViViT model achieves the performance of 47.4% on the validation set of EPIC-KITCHENS-100 dataset, outperforming what is reported in the original paper by 3.4%. We found that video transformers are especially good at predicting the noun in the verb-noun action prediction task. This makes the overall action prediction accuracy of video transformers notably higher than convolutional ones. Surprisingly, even the best video transformers underperform the convolutional networks on the verb prediction. Therefore, we combine the video vision transformers and some of the convolutional video networks and present our solution to the EPIC-KITCHENS-100 Action Recognition competition.

الرؤية الحاسوبية وتمييز الأنماط

AutoVideo: An Automated Video Action Recognition System

106 - Daochen Zha , Zaid Pervaiz Bhat , Yi-Wei Chen 2021

Action recognition is a crucial task for video understanding. In this paper, we present AutoVideo, a Python system for automated video action recognition. It currently supports seven action recognition algorithms and various pre-processing modules. U nlike the existing libraries that only provide model zoos, AutoVideo is built with the standard pipeline language. The basic building block is primitive, which wraps a pre-processing module or an algorithm with some hyperparameters. AutoVideo is highly modular and extendable. It can be easily combined with AutoML searchers. The pipeline language is quite general so that we can easily enrich AutoVideo with algorithms for various other video-related tasks in the future. AutoVideo is released under MIT license at https://github.com/datamllab/autovideo

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي معالجة الصور والفيديو

Action Units Recognition Using Improved Pairwise Deep Architecture

60 - Junya Saito , Xiaoyu Mi , Akiyoshi Uchida 2021

Facial Action Units (AUs) represent a set of facial muscular activities and various combinations of AUs can represent a wide range of emotions. AU recognition is often used in many applications, including marketing, healthcare, education, and so fort h. Although a lot of studies have developed various methods to improve recognition accuracy, it still remains a major challenge for AU recognition. In the Affective Behavior Analysis in-the-wild (ABAW) 2020 competition, we proposed a new automatic Action Units (AUs) recognition method using a pairwise deep architecture to derive the Pseudo-Intensities of each AU and then convert them into predicted intensities. This year, we introduced a new technique to last years framework to further reduce AU recognition errors due to temporary face occlusion such as hands on face or large face orientation. We obtained a score of 0.65 in the validation data set for this years competition.

الرؤية الحاسوبية وتمييز الأنماط