Action Units Recognition by Pairwise Deep Architecture

155 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Junya Saito

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Junya Saito - Ryosuke Kawamura - Akiyoshi Uchida

الرؤية الحاسوبية وتمييز الأنماط

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we propose a new automatic Action Units (AUs) recognition method used in a competition, Affective Behavior Analysis in-the-wild (ABAW). Our method tackles a problem of AUs label inconsistency among subjects by using pairwise deep architecture. While the baseline score is 0.31, our method achieved 0.67 in validation dataset of the competition.

قيم البحث

60 - Junya Saito , Xiaoyu Mi , Akiyoshi Uchida 2021

Facial Action Units (AUs) represent a set of facial muscular activities and various combinations of AUs can represent a wide range of emotions. AU recognition is often used in many applications, including marketing, healthcare, education, and so fort h. Although a lot of studies have developed various methods to improve recognition accuracy, it still remains a major challenge for AU recognition. In the Affective Behavior Analysis in-the-wild (ABAW) 2020 competition, we proposed a new automatic Action Units (AUs) recognition method using a pairwise deep architecture to derive the Pseudo-Intensities of each AU and then convert them into predicted intensities. This year, we introduced a new technique to last years framework to further reduce AU recognition errors due to temporary face occlusion such as hands on face or large face orientation. We obtained a score of 0.65 in the validation data set for this years competition.

الرؤية الحاسوبية وتمييز الأنماط

Lightweight Network Architecture for Real-Time Action Recognition

64 - Alexander Kozlov , Vadim Andronov , Yana Gritsenko 2019

In this work we present a new efficient approach to Human Action Recognition called Video Transformer Network (VTN). It leverages the latest advances in Computer Vision and Natural Language Processing and applies them to video understanding. The prop osed method allows us to create lightweight CNN models that achieve high accuracy and real-time speed using just an RGB mono camera and general purpose CPU. Furthermore, we explain how to improve accuracy by distilling from multiple models with different modalities into a single model. We conduct a comparison with state-of-the-art methods and show that our approach performs on par with most of them on famous Action Recognition datasets. We benchmark the inference time of the models using the modern inference framework and argue that our approach compares favorably with other methods in terms of speed/accuracy trade-off, running at 56 FPS on CPU. The models and the training code are available.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي التعلم الآلي

Convolutional Architecture Exploration for Action Recognition and Image Classification

61 - J.T. Turner , David Aha , Leslie Smith 2015

Convolutional Architecture for Fast Feature Encoding (CAFFE) [11] is a software package for the training, classifying, and feature extraction of images. The UCF Sports Action dataset is a widely used machine learning dataset that has 200 videos taken in 720x480 resolution of 9 different sporting activities: diving, golf, swinging, kicking, lifting, horseback riding, running, skateboarding, swinging (various gymnastics), and walking. In this report we report on a caffe feature extraction pipeline of images taken from the videos of the UCF Sports Action dataset. A similar test was performed on overfeat, and results were inferior to caffe. This study is intended to explore the architecture and hyper parameters needed for effective static analysis of action in videos and classification over a variety of image datasets.

الرؤية الحاسوبية وتمييز الأنماط

Deep Local Video Feature for Action Recognition

83 - Zhenzhong Lan , Yi Zhu , Alexander G. Hauptmann 2017

We investigate the problem of representing an entire video using CNN features for human action recognition. Currently, limited by GPU memory, we have not been able to feed a whole video into CNN/RNNs for end-to-end learning. A common practice is to u se sampled frames as inputs and video labels as supervision. One major problem of this popular approach is that the local samples may not contain the information indicated by global labels. To deal with this problem, we propose to treat the deep networks trained on local inputs as local feature extractors. After extracting local features, we aggregate them into global features and train another mapping function on the same training data to map the global features into global labels. We study a set of problems regarding this new type of local features such as how to aggregate them into global features. Experimental results on HMDB51 and UCF101 datasets show that, for these new local features, a simple maximum pooling on the sparsely sampled features lead to significant performance improvement.

الرؤية الحاسوبية وتمييز الأنماط

NAS-TC: Neural Architecture Search on Temporal Convolutions for Complex Action Recognition

114 - Pengzhen Ren , Gang Xiao , Xiaojun Chang 2021

In the field of complex action recognition in videos, the quality of the designed model plays a crucial role in the final performance. However, artificially designed network structures often rely heavily on the researchers knowledge and experience. A ccordingly, because of the automated design of its network structure, Neural architecture search (NAS) has achieved great success in the image processing field and attracted substantial research attention in recent years. Although some NAS methods have reduced the number of GPU search days required to single digits in the image field, directly using 3D convolution to extend NAS to the video field is still likely to produce a surge in computing volume. To address this challenge, we propose a new processing framework called Neural Architecture Search- Temporal Convolutional (NAS-TC). Our proposed framework is divided into two phases. In the first phase, the classical CNN network is used as the backbone network to complete the computationally intensive feature extraction task. In the second stage, a simple stitching search to the cell is used to complete the relatively lightweight long-range temporal-dependent information extraction. This ensures our method will have more reasonable parameter assignments and can handle minute-level videos. Finally, we conduct sufficient experiments on multiple benchmark datasets and obtain competitive recognition accuracy.

الرؤية الحاسوبية وتمييز الأنماط

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

المعهد العالي للعلوم التطبيقية والتكنولوجيا

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Action Units Recognition by Pairwise Deep Architecture

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً