Multidomain Multimodal Fusion For Human Action Recognition Using Inertial Sensors

161 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zeeshan Ahmad

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Zeeshan Ahmad - Naimul Khan

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

One of the major reasons for misclassification of multiplex actions during action recognition is the unavailability of complementary features that provide the semantic information about the actions. In different domains these features are present with different scales and intensities. In existing literature, features are extracted independently in different domains, but the benefits from fusing these multidomain features are not realized. To address this challenge and to extract complete set of complementary information, in this paper, we propose a novel multidomain multimodal fusion framework that extracts complementary and distinct features from different domains of the input modality. We transform input inertial data into signal images, and then make the input modality multidomain and multimodal by transforming spatial domain information into frequency and time-spectrum domain using Discrete Fourier Transform (DFT) and Gabor wavelet transform (GWT) respectively. Features in different domains are extracted by Convolutional Neural networks (CNNs) and then fused by Canonical Correlation based Fusion (CCF) for improving the accuracy of human action recognition. Experimental results on three inertial datasets show the superiority of the proposed method when compared to the state-of-the-art.

قيم البحث

128 - Zeeshan Ahmad , Naimul khan 2020

Convolutional Neural Network (CNN) provides leverage to extract and fuse features from all layers of its architecture. However, extracting and fusing intermediate features from different layers of CNN structure is still uninvestigated for Human Actio n Recognition (HAR) using depth and inertial sensors. To get maximum benefit of accessing all the CNNs layers, in this paper, we propose novel Multistage Gated Average Fusion (MGAF) network which extracts and fuses features from all layers of CNN using our novel and computationally efficient Gated Average Fusion (GAF) network, a decisive integral element of MGAF. At the input of the proposed MGAF, we transform the depth and inertial sensor data into depth images called sequential front view images (SFI) and signal images (SI) respectively. These SFI are formed from the front view information generated by depth data. CNN is employed to extract feature maps from both input modalities. GAF network fuses the extracted features effectively while preserving the dimensionality of fused feature as well. The proposed MGAF network has structural extensibility and can be unfolded to more than two modalities. Experiments on three publicly available multimodal HAR datasets demonstrate that the proposed MGAF outperforms the previous state of the art fusion methods for depth-inertial HAR in terms of recognition accuracy while being computationally much more efficient. We increase the accuracy by an average of 1.5 percent while reducing the computational cost by approximately 50 percent over the previous state of the art.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي الوسائط المتعددة

Vision and Inertial Sensing Fusion for Human Action Recognition : A Review

140 - Sharmin Majumder , Nasser Kehtarnavaz 2020

Human action recognition is used in many applications such as video surveillance, human computer interaction, assistive living, and gaming. Many papers have appeared in the literature showing that the fusion of vision and inertial sensing improves re cognition accuracies compared to the situations when each sensing modality is used individually. This paper provides a survey of the papers in which vision and inertial sensing are used simultaneously within a fusion framework in order to perform human action recognition. The surveyed papers are categorized in terms of fusion approaches, features, classifiers, as well as multimodality datasets considered. Challenges as well as possible future directions are also stated for deploying the fusion of these two sensing modalities under realistic conditions.

تفاعل الإنسان والحاسوب الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

Inertial Sensor Data To Image Encoding For Human Action Recognition

226 - Zeeshan Ahmad , Naimul Khan 2021

Convolutional Neural Networks (CNNs) are successful deep learning models in the field of computer vision. To get the maximum advantage of CNN model for Human Action Recognition (HAR) using inertial sensor data, in this paper, we use 4 types of spatia l domain methods for transforming inertial sensor data to activity images, which are then utilized in a novel fusion framework. These four types of activity images are Signal Images (SI), Gramian Angular Field (GAF) Images, Markov Transition Field (MTF) Images and Recurrence Plot (RP) Images. Furthermore, for creating a multimodal fusion framework and to exploit activity image, we made each type of activity images multimodal by convolving with two spatial domain filters : Prewitt filter and High-boost filter. Resnet-18, a CNN model, is used to learn deep features from multi-modalities. Learned features are extracted from the last pooling layer of each ReNet and then fused by canonical correlation based fusion (CCF) for improving the accuracy of human action recognition. These highly informative features are served as input to a multiclass Support Vector Machine (SVM). Experimental results on three publicly available inertial datasets show the superiority of the proposed method over the current state-of-the-art.

الرؤية الحاسوبية وتمييز الأنماط تفاعل الإنسان والحاسوب التعلم الآلي

Learning multimodal representations for sample-efficient recognition of human actions

83 - Miguel Vasco , Francisco S. Melo , David Martins de Matos 2019

Humans interact in rich and diverse ways with the environment. However, the representation of such behavior by artificial agents is often limited. In this work we present textit{motion concepts}, a novel multimodal representation of human actions in a household environment. A motion concept encompasses a probabilistic description of the kinematics of the action along with its contextual background, namely the location and the objects held during the performance. Furthermore, we present Online Motion Concept Learning (OMCL), a new algorithm which learns novel motion concepts from action demonstrations and recognizes previously learned motion concepts. The algorithm is evaluated on a virtual-reality household environment with the presence of a human avatar. OMCL outperforms standard motion recognition algorithms on an one-shot recognition task, attesting to its potential for sample-efficient recognition of human actions.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي التعلم الآلي

Human Activity Recognition using Inertial, Physiological and Environmental Sensors: a Comprehensive Survey

101 - Florenc Demrozi , Graziano Pravadelli , Azra Bihorac 2020

In the last decade, Human Activity Recognition (HAR) has become a vibrant research area, especially due to the spread of electronic devices such as smartphones, smartwatches and video cameras present in our daily lives. In addition, the advance of de ep learning and other machine learning algorithms has allowed researchers to use HAR in various domains including sports, health and well-being applications. For example, HAR is considered as one of the most promising assistive technology tools to support elderlys daily life by monitoring their cognitive and physical function through daily activities. This survey focuses on critical role of machine learning in developing HAR applications based on inertial sensors in conjunction with physiological and environmental sensors.

معالجة الإشارات تفاعل الإنسان والحاسوب التعلم الآلي