Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Deep Learning for Depression Recognition with Audiovisual Cues: A Review

260 0 0.0 ( 0 )

Download Cite

Added by Lang He

Publication date 2021

fields Electronic Engineering Informatics Engineering

and research's language is English

Authors Lang He - Mingyue Niu - Prayag Tiwari

Signal Processing Sound Audio and Speech Processing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

With the acceleration of the pace of work and life, people have to face more and more pressure, which increases the possibility of suffering from depression. However, many patients may fail to get a timely diagnosis due to the serious imbalance in the doctor-patient ratio in the world. Promisingly, physiological and psychological studies have indicated some differences in speech and facial expression between patients with depression and healthy individuals. Consequently, to improve current medical care, many scholars have used deep learning to extract a representation of depression cues in audio and video for automatic depression detection. To sort out and summarize these works, this review introduces the databases and describes objective markers for automatic depression estimation (ADE). Furthermore, we review the deep learning methods for automatic depression detection to extract the representation of depression from audio and video. Finally, this paper discusses challenges and promising directions related to automatic diagnosing of depression using deep learning technologies.

rate research

Centroid-based deep metric learning for speaker recognition

315 - Jixuan Wang , Kuan-Chieh Wang , Marc Law 2019

Speaker embedding models that utilize neural networks to map utterances to a space where distances reflect similarity between speakers have driven recent progress in the speaker recognition task. However, there is still a significant performance gap between recognizing speakers in the training set and unseen speakers. The latter case corresponds to the few-shot learning task, where a trained model is evaluated on unseen classes. Here, we optimize a speaker embedding model with prototypical network loss (PNL), a state-of-the-art approach for the few-shot image classification task. The resulting embedding model outperforms the state-of-the-art triplet loss based models in both speaker verification and identification tasks, for both seen and unseen speakers.

Machine Learning Sound Audio and Speech Processing

Multi-Task Learning with Auxiliary Speaker Identification for Conversational Emotion Recognition

92 - Jingye Li , Meishan Zhang , Donghong Ji 2020

Conversational emotion recognition (CER) has attracted increasing interests in the natural language processing (NLP) community. Different from the vanilla emotion recognition, effective speaker-sensitive utterance representation is one major challenge for CER. In this paper, we exploit speaker identification (SI) as an auxiliary task to enhance the utterance representation in conversations. By this method, we can learn better speaker-aware contextual representations from the additional SI corpus. Experiments on two benchmark datasets demonstrate that the proposed architecture is highly effective for CER, obtaining new state-of-the-art results on two datasets.

Computation and Language Sound Audio and Speech Processing

Audiovisual Speaker Tracking using Nonlinear Dynamical Systems with Dynamic Stream Weights

59 - Christopher Schymura , Dorothea Kolossa 2019

Data fusion plays an important role in many technical applications that require efficient processing of multimodal sensory observations. A prominent example is audiovisual signal processing, which has gained increasing attention in automatic speech recognition, speaker localization and related tasks. If appropriately combined with acoustic information, additional visual cues can help to improve the performance in these applications, especially under adverse acoustic conditions. A dynamic weighting of acoustic and visual streams based on instantaneous sensor reliability measures is an efficient approach to data fusion in this context. This paper presents a framework that extends the well-established theory of nonlinear dynamical systems with the notion of dynamic stream weights for an arbitrary number of sensory observations. It comprises a recursive state estimator based on the Gaussian filtering paradigm, which incorporates dynamic stream weights into a framework closely related to the extended Kalman filter. Additionally, a convex optimization approach to estimate oracle dynamic stream weights in fully observed dynamical systems utilizing a Dirichlet prior is presented. This serves as a basis for a generic parameter learning framework of dynamic stream weight estimators. The proposed system is application-independent and can be easily adapted to specific tasks and requirements. A study using audiovisual speaker tracking tasks is considered as an exemplary application in this work. An improved tracking performance of the dynamic stream weight-based estimation framework over state-of-the-art methods is demonstrated in the experiments.

Computer Vision and Pattern Recognition Sound Audio and Speech Processing

Learning Constellation Map with Deep CNN for Accurate Modulation Recognition

310 - Van-Sang Doan , Thien Huynh-The , Cam-Hao Hua 2020

Modulation classification, recognized as the intermediate step between signal detection and demodulation, is widely deployed in several modern wireless communication systems. Although many approaches have been studied in the last decades for identifying the modulation format of an incoming signal, they often reveal the obstacle of learning radio characteristics for most traditional machine learning algorithms. To overcome this drawback, we propose an accurate modulation classification method by exploiting deep learning for being compatible with constellation diagram. Particularly, a convolutional neural network is developed for proficiently learning the most relevant radio characteristics of gray-scale constellation image. The deep network is specified by multiple processing blocks, where several grouped and asymmetric convolutional layers in each block are organized by a flow-in-flow structure for feature enrichment. These blocks are connected via skip-connection to prevent the vanishing gradient problem while effectively preserving the information identify throughout the network. Regarding several intensive simulations on the constellation image dataset of eight digital modulations, the proposed deep network achieves the remarkable classification accuracy of approximately 87% at 0 dB signal-to-noise ratio (SNR) under a multipath Rayleigh fading channel and further outperforms some state-of-the-art deep models of constellation-based modulation classification.

Signal Processing Information Theory Information Theory

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

119 - Alexander H. Liu , Tao Tu , Hung-yi Lee 2019

In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn from primarily unpaired audio data and produce sequences of representations very close to phoneme sequences of speech utterances. This is achieved by proper temporal segmentation to make the representations phoneme-synchronized, and proper phonetic clustering to have total number of distinct representations close to the number of phonemes. Mapping between the distinct representations and phonemes is learned from a small amount of annotated paired data. Preliminary experiments on LJSpeech demonstrated the learned representations for vowels have relative locations in latent space in good parallel to that shown in the IPA vowel chart defined by linguistics experts. With less than 20 minutes of annotated speech, our method outperformed existing methods on phoneme recognition and is able to synthesize intelligible speech that beats our baseline model.

Computation and Language Sound Audio and Speech Processing

comments

Fetching comments

Tishreen University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Deep Learning for Depression Recognition with Audiovisual Cues: A Review

Ask ChatGPT about the research

No Arabic abstract

Read More