Jointly Aligning and Predicting Continuous Emotion Annotations

174 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Soheil Khorram

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Soheil Khorram - Melvin G McInnis - Emily Mower Provost

التعلم الآلي تفاعل الإنسان والحاسوب معالجة الصوت والكلام

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network (multi-delay sinc network) that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the delayed sinc layer. It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.

قيم البحث

60 - Mimansa Jaiswal , Zakaria Aldeneh , Cristian-Paul Bara 2019

Emotion recognition algorithms rely on data annotated with high quality labels. However, emotion expression and perception are inherently subjective. There is generally not a single annotation that can be unambiguously declared correct. As a result, annotations are colored by the manner in which they were collected. In this paper, we conduct crowdsourcing experiments to investigate this impact on both the annotations themselves and on the performance of these algorithms. We focus on one critical question: the effect of context. We present a new emotion dataset, Multimodal Stressed Emotion (MuSE), and annotate the dataset using two conditions: randomized, in which annotators are presented with clips in random order, and contextualized, in which annotators are presented with clips in order. We find that contextual labeling schemes result in annotations that are more similar to a speakers own self-reported labels and that labels generated from randomized schemes are most easily predictable by automated systems.

أنظمة الصوت في الحاسوب تفاعل الإنسان والحاسوب التعلم الآلي

EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings

163 - Jing Han , Zixing Zhang , Zhao Ren 2019

Despite remarkable advances in emotion recognition, they are severely restrained from either the essentially limited property of the employed single modality, or the synchronous presence of all involved multiple modalities. Motivated by this, we prop ose a novel crossmodal emotion embedding framework called EmoBed, which aims to leverage the knowledge from other auxiliary modalities to improve the performance of an emotion recognition system at hand. The framework generally includes two main learning components, i. e., joint multimodal training and crossmodal training. Both of them tend to explore the underlying semantic emotion information but with a shared recognition network or with a shared emotion embedding space, respectively. In doing this, the enhanced system trained with this approach can efficiently make use of the complementary information from other modalities. Nevertheless, the presence of these auxiliary modalities is not demanded during inference. To empirically investigate the effectiveness and robustness of the proposed framework, we perform extensive experiments on the two benchmark databases RECOLA and OMG-Emotion for the tasks of dimensional emotion regression and categorical emotion classification, respectively. The obtained results show that the proposed framework significantly outperforms related baselines in monomodal inference, and are also competitive or superior to the recently reported systems, which emphasises the importance of the proposed crossmodal learning for emotion recognition.

التعلم الآلي تفاعل الإنسان والحاسوب أنظمة الصوت في الحاسوب

Fusion of EEG and Musical Features in Continuous Music-emotion Recognition

97 - Nattapong Thammasan , Ken-ichi Fukui , Masayuki Numao 2016

Emotion estimation in music listening is confronting challenges to capture the emotion variation of listeners. Recent years have witnessed attempts to exploit multimodality fusing information from musical contents and physiological signals captured f rom listeners to improve the performance of emotion recognition. In this paper, we present a study of fusion of signals of electroencephalogram (EEG), a tool to capture brainwaves at a high-temporal resolution, and musical features at decision level in recognizing the time-varying binary classes of arousal and valence. Our empirical results showed that the fusion could outperform the performance of emotion recognition using only EEG modality that was suffered from inter-subject variability, and this suggested the promise of multimodal fusion in improving the accuracy of music-emotion recognition.

الذكاء الاصطناعي تفاعل الإنسان والحاسوب

Dynamic Difficulty Awareness Training for Continuous Emotion Prediction

71 - Zixing Zhang , Jing Han , Eduardo Coutinho 2018

Time-continuous emotion prediction has become an increasingly compelling task in machine learning. Considerable efforts have been made to advance the performance of these systems. Nonetheless, the main focus has been the development of more sophistic ated models and the incorporation of different expressive modalities (e. g., speech, face, and physiology). In this paper, motivated by the benefit of difficulty awareness in a human learning procedure, we propose a novel machine learning framework, namely, Dynamic Difficulty Awareness Training (DDAT), which sheds fresh light on the research -- directly exploiting the difficulties in learning to boost the machine learning process. The DDAT framework consists of two stages: information retrieval and information exploitation. In the first stage, we make use of the reconstruction error of input features or the annotation uncertainty to estimate the difficulty of learning specific information. The obtained difficulty level is then used in tandem with original features to update the model input in a second learning stage with the expectation that the model can learn to focus on high difficulty regions of the learning process. We perform extensive experiments on a benchmark database (RECOLA) to evaluate the effectiveness of the proposed framework. The experimental results show that our approach outperforms related baselines as well as other well-established time-continuous emotion prediction systems, which suggests that dynamically integrating the difficulty information for neural networks can help enhance the learning process.

التعلم الآلي الذكاء الاصطناعي تفاعل الإنسان والحاسوب

Trainable Time Warping: Aligning Time-Series in the Continuous-Time Domain

245 - Soheil Khorram , Melvin G McInnis , Emily Mower Provost 2019

DTW calculates the similarity or alignment between two signals, subject to temporal warping. However, its computational complexity grows exponentially with the number of time-series. Although there have been algorithms developed that are linear in th e number of time-series, they are generally quadratic in time-series length. The exception is generalized time warping (GTW), which has linear computational cost. Yet, it can only identify simple time warping functions. There is a need for a new fast, high-quality multisequence alignment algorithm. We introduce trainable time warping (TTW), whose complexity is linear in both the number and the length of time-series. TTW performs alignment in the continuous-time domain using a sinc convolutional kernel and a gradient-based optimization technique. We compare TTW and GTW on 85 UCR datasets in time-series averaging and classification. TTW outperforms GTW on 67.1% of the datasets for the averaging tasks, and 61.2% of the datasets for the classification tasks.

التعلم الآلي الذكاء الاصطناعي التعقيد الحسابي