Capsule Attention for Multimodal EEG-EOG Representation Learning with Application to Driver Vigilance Estimation

92 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Guangyi Zhang

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Guangyi Zhang - Ali Etemad

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Driver vigilance estimation is an important task for transportation safety. Wearable and portable brain-computer interface devices provide a powerful means for real-time monitoring of the vigilance level of drivers to help with avoiding distracted or impaired driving. In this paper, we propose a novel multimodal architecture for in-vehicle vigilance estimation from Electroencephalogram and Electrooculogram. To enable the system to focus on the most salient parts of the learned multimodal representations, we propose an architecture composed of a capsule attention mechanism following a deep Long Short-Term Memory (LSTM) network. Our model learns hierarchical dependencies in the data through the LSTM and capsule feature representation layers. To better explore the discriminative ability of the learned representations, we study the effect of the proposed capsule attention mechanism including the number of dynamic routing iterations as well as other parameters. Experiments show the robustness of our method by outperforming other solutions and baseline techniques, setting a new state-of-the-art. We then provide an analysis on different frequency bands and brain regions to evaluate their suitability for driver vigilance estimation. Lastly, an analysis on the role of capsule attention, multimodality, and robustness to noise is performed, highlighting the advantages of our approach.

قيم البحث

93 - Ahmed Imtiaz Humayun , Asif Shahriyar Sushmit , Taufiq Hasan andn Mohammed Imamul Hassan Bhuiyan 2019

Humans approximately spend a third of their life sleeping, which makes monitoring sleep an integral part of well-being. In this paper, a 34-layer deep residual ConvNet architecture for end-to-end sleep staging is proposed. The network takes raw singl e channel electroencephalogram (Fpz-Cz) signal as input and yields hypnogram annotations for each 30s segments as output. Experiments are carried out for two different scoring standards (5 and 6 stage classification) on the expanded PhysioNet Sleep-EDF dataset, which contains multi-source data from hospital and household polysomnography setups. The performance of the proposed network is compared with that of the state-of-the-art algorithms in patient independent validation tasks. The experimental results demonstrate the superiority of the proposed network compared to the best existing method, providing a relative improvement in epoch-wise average accuracy of 6.8% and 6.3% on the household data and multi-source data, respectively. Codes are made publicly available on Github.

التعلم الآلي الرؤية الحاسوبية وتمييز الأنماط معالجة الإشارات

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning

243 - Paul Pu Liang , Yiwei Lyu , Xiang Fan 2021

Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world applications in multimedia, affective computing, robotics, finance, human- computer interaction, and healthcare. Unfortunately, multimodal research has seen limited resources to study (1) generalization across domains and modalities, (2) complexity during training and inference, and (3) robustness to noisy and missing modalities. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiBench, a systematic and unified large-scale benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. MultiBench provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, MultiBench offers a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench introduces impactful challenges for future research, including scalability to large-scale multimodal datasets and robustness to realistic imperfections. To accompany this benchmark, we also provide a standardized implementation of 20 core approaches in multimodal learning. Simply applying methods proposed in different research areas can improve the state-of-the-art performance on 9/15 datasets. Therefore, MultiBench presents a milestone in unifying disjoint efforts in multimodal research and paves the way towards a better understanding of the capabilities and limitations of multimodal models, all the while ensuring ease of use, accessibility, and reproducibility. MultiBench, our standardized code, and leaderboards are publicly available, will be regularly updated, and welcomes inputs from the community.

التعلم الآلي الذكاء الاصطناعي الحساب واللغة

Maximum Likelihood Estimation for Multimodal Learning with Missing Modality

91 - Fei Ma , Xiangxiang Xu , Shao-Lun Huang 2021

Multimodal learning has achieved great successes in many scenarios. Compared with unimodal learning, it can effectively combine the information from different modalities to improve the performance of learning tasks. In reality, the multimodal data ma y have missing modalities due to various reasons, such as sensor failure and data transmission error. In previous works, the information of the modality-missing data has not been well exploited. To address this problem, we propose an efficient approach based on maximum likelihood estimation to incorporate the knowledge in the modality-missing data. Specifically, we design a likelihood function to characterize the conditional distribution of the modality-complete data and the modality-missing data, which is theoretically optimal. Moreover, we develop a generalized form of the softmax function to effectively implement maximum likelihood estimation in an end-to-end manner. Such training strategy guarantees the computability of our algorithm capably. Finally, we conduct a series of experiments on real-world multimodal datasets. Our results demonstrate the effectiveness of the proposed approach, even when 95% of the training data has missing modality.

التعلم الآلي الذكاء الاصطناعي

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

97 - Po-Yao Huang , Xiaojun Chang , Alexander Hauptmann 2019

With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, o ur model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.

الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

Bootstrapping Disjoint Datasets for Multilingual Multimodal Representation Learning

78 - Akos Kadar , Grzegorz Chrupa{l}a , Afra Alishahi 2019

Recent work has highlighted the advantage of jointly learning grounded sentence representations from multiple languages. However, the data used in these studies has been limited to an aligned scenario: the same images annotated with sentences in mult iple languages. We focus on the more realistic disjoint scenario in which there is no overlap between the images in multilingual image--caption datasets. We confirm that training with aligned data results in better grounded sentence representations than training with disjoint data, as measured by image--sentence retrieval performance. In order to close this gap in performance, we propose a pseudopairing method to generate synthetically aligned English--German--image triplets from the disjoint sets. The method works by first training a model on the disjoint data, and then creating new triples across datasets using sentence similarity under the learned model. Experiments show that pseudopairs improve image--sentence retrieval performance compared to disjoint training, despite requiring no external data or models. However, we do find that using an external machine translation model to generate the synthetic data sets results in better performance.

الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط