ﻻ يوجد ملخص باللغة العربية
This paper focuses on finding suitable features to robustly recognize emotions and evaluate customer satisfaction from speech in real acoustic scenarios. The classification of emotions is based on standard and well-known corpora and the evaluation of customer satisfaction is based on recordings of real opinions given by customers about the received service during phone calls with call-center agents. The feature sets considered in this study include two speaker models, namely x-vectors and i-vectors, and also the well known feature set introduced in the Interspeech 2010 Paralinguistics Challenge (I2010PC). Additionally, we introduce the use of phonation, articulation and prosody features extracted with the DisVoice framework as alternative feature sets to robustly model emotions and customer satisfaction from speech. The results indicate that the I2010PC feature set is the best approach to classify emotions in the standard databases typically used in the literature. When considering the recordings collected in the call-center, without any control over the acoustic conditions, the best results are obtained with our articulation features. The I2010PC feature set includes 1584 measures while the articulation approach only includes 488 measures. We think that the proposed approach is more suitable for real-world applications where the acoustic conditions are not controlled and also it is potentially more convenient for industrial applications.
We present first results showing that it is possible to automatically discriminate between the coughing sounds produced by patients with tuberculosis (TB) and those produced by patients with other lung ailments in a real-world noisy environment. Our
Training Automatic Speech Recognition (ASR) models under federated learning (FL) settings has attracted a lot of attention recently. However, the FL scenarios often presented in the literature are artificial and fail to capture the complexity of real
Human speech processing is inherently multimodal, where visual cues (lip movements) help to better understand the speech in noise. Lip-reading driven speech enhancement significantly outperforms benchmark audio-only approaches at low signal-to-noise
Utilizing a human-perception-related objective function to train a speech enhancement model has become a popular topic recently. The main reason is that the conventional mean squared error (MSE) loss cannot represent auditory perception well. One of
Recurrent neural networks (RNNs) based automatic speech recognition has nowadays become prevalent on mobile devices such as smart phones. However, previous RNN compression techniques either suffer from hardware performance overhead due to irregularit