Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings

101 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Pawel Cyrta

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Pawel Cyrta - Tomasz Trzcinski - Wojciech Stokowiec

أنظمة الصوت في الحاسوب الوسائط المتعددة الحوسبة العصبية والتطورية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper we propose a new method of speaker diarization that employs a deep learning architecture to learn speaker embeddings. In contrast to the traditional approaches that build their speaker embeddings using manually hand-crafted spectral features, we propose to train for this purpose a recurrent convolutional neural network applied directly on magnitude spectrograms. To compare our approach with the state of the art, we collect and release for the public an additional dataset of over 6 hours of fully annotated broadcast material. The results of our evaluation on the new dataset and three other benchmark datasets show that our proposed method significantly outperforms the competitors and reduces diarization error rate by a large margin of over 30% with respect to the baseline.

قيم البحث

165 - Vishal Sharma , Zekun Zhang , Zachary Neubert 2020

Speaker Diarization is the problem of separating speakers in an audio. There could be any number of speakers and final result should state when speaker starts and ends. In this project, we analyze given audio file with 2 channels and 2 speakers (on s eparate channel). We train Neural Network for learning when a person is speaking. We use different type of Neural Networks specifically, Single Layer Perceptron (SLP), Multi Layer Perceptron (MLP), Recurrent Neural Network (RNN) and Convolution Neural Network (CNN) we achieve $sim$92% of accuracy with RNN. The code for this project is available at https://github.com/vishalshar/SpeakerDiarization_RNN_CNN_LSTM

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

Speaker diarization with session-level speaker embedding refinement using graph neural networks

335 - Jixuan Wang , Xiong Xiao , Jian Wu 2020

Deep speaker embedding models have been commonly used as a building block for speaker diarization systems; however, the speaker embedding model is usually trained according to a global loss defined on the training data, which could be sub-optimal for distinguishing speakers locally in a specific meeting session. In this work we present the first use of graph neural networks (GNNs) for the speaker diarization problem, utilizing a GNN to refine speaker embeddings locally using the structural information between speech segments inside each session. The speaker embeddings extracted by a pre-trained model are remapped into a new embedding space, in which the different speakers within a single session are better separated. The model is trained for linkage prediction in a supervised manner by minimizing the difference between the affinity matrix constructed by the refined embeddings and the ground-truth adjacency matrix. Spectral clustering is then applied on top of the refined embeddings. We show that the clustering performance of the refined speaker embeddings outperforms the original embeddings significantly on both simulated and real meeting data, and our system achieves the state-of-the-art result on the NIST SRE 2000 CALLHOME database.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

Content-Aware Speaker Embeddings for Speaker Diarisation

110 - G. Sun , D. Liu , C. Zhang 2021

Recent speaker diarisation systems often convert variable length speech segments into fixed-length vector representations for speaker clustering, which are known as speaker embeddings. In this paper, the content-aware speaker embeddings (CASE) approa ch is proposed, which extends the input of the speaker classifier to include not only acoustic features but also their corresponding speech content, via phone, character, and word embeddings. Compared to alternative methods that leverage similar information, such as multitask or adversarial training, CASE factorises automatic speech recognition (ASR) from speaker recognition to focus on modelling speaker characteristics and correlations with the corresponding content units to derive more expressive representations. CASE is evaluated for speaker re-clustering with a realistic speaker diarisation setup using the AMI meeting transcription dataset, where the content information is obtained by performing ASR based on an automatic segmentation. Experimental results showed that CASE achieved a 17.8% relative speaker error rate reduction over conventional methods.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

ECAPA-TDNN Embeddings for Speaker Diarization

138 - Nauman Dawalatabad , Mirco Ravanelli , Franc{c}ois Grondin 2021

Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of mode rn diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, for instance, has shown impressive performance in the speaker verification domain, thanks to a carefully designed neural model. In this work, we extend, for the first time, the use of the ECAPA-TDNN model to speaker diarization. Moreover, we improved its robustness with a powerful augmentation scheme that concatenates several contaminat

معالجة الصوت والكلام

Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks

104 - Jen-Cheng Hou , Syu-Siang Wang , Ying-Hui Lai 2017

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus only on addressing audio information. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of con volutional neural networks (CNNs) in SE, we propose an audio-visual deep CNNs (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. We also propose a multi-task learning framework for reconstructing audio and visual signals at the output layer. Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer. The model is trained in an end-to-end manner, and parameters are jointly learned through back-propagation. We evaluate enhanced speech using five instrumental criteria. Results show that the AVDCNN model yields a notably superior performance compared with an audio-only CNN-based SE model and two conventional SE approaches, confirming the effectiveness of integrating visual information into the SE process. In addition, the AVDCNN model also outperforms an existing audio-visual SE model, confirming its capability of effectively combining audio and visual information in SE.

أنظمة الصوت في الحاسوب الوسائط المتعددة التعلم الالي