Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Linguistically Aided Speaker Diarization Using Speaker Role Information

150 0 0.0 ( 0 )

Download Cite

Added by Nikolaos Flemotomos

Publication date 2019

fields Electronic Engineering Informatics Engineering

and research's language is English

Authors Nikolaos Flemotomos - Panayiotis Georgiou - Shrikanth Narayanan

Audio and Speech Processing Sound

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Speaker diarization relies on the assumption that speech segments corresponding to a particular speaker are concentrated in a specific region of the speaker space; a region which represents that speakers identity. These identities are not known a priori, so a clustering algorithm is typically employed, which is traditionally based solely on audio. Under noisy conditions, however, such an approach poses the risk of generating unreliable speaker clusters. In this work we aim to utilize linguistic information as a supplemental modality to identify the various speakers in a more robust way. We are focused on conversational scenarios where the speakers assume distinct roles and are expected to follow different linguistic patterns. This distinct linguistic variability can be exploited to help us construct the speaker identities. That way, we are able to boost the diarization performance by converting the clustering task to a classification one. The proposed method is applied in real-world dyadic psychotherapy interactions between a provider and a patient and demonstrated to show improved results.

rate research

Speaker Diarization: Using Recurrent Neural Networks

165 - Vishal Sharma , Zekun Zhang , Zachary Neubert 2020

Speaker Diarization is the problem of separating speakers in an audio. There could be any number of speakers and final result should state when speaker starts and ends. In this project, we analyze given audio file with 2 channels and 2 speakers (on separate channel). We train Neural Network for learning when a person is speaking. We use different type of Neural Networks specifically, Single Layer Perceptron (SLP), Multi Layer Perceptron (MLP), Recurrent Neural Network (RNN) and Convolution Neural Network (CNN) we achieve $sim$92% of accuracy with RNN. The code for this project is available at https://github.com/vishalshar/SpeakerDiarization_RNN_CNN_LSTM

Audio and Speech Processing Sound

Speaker Diarization with Lexical Information

208 - Tae Jin Park , Kyu J. Han , Jing Huang 2020

This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition. We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy. To integrate lexical and acoustic information in a comprehensive way during clustering, we introduce an adjacency matrix integration for spectral clustering. Since words and word boundary information for word-level speaker turn probability estimation are provided by a speech recognition system, our proposed method works without any human intervention for manual transcriptions. We show that the proposed method improves diarization performance on various evaluation datasets compared to the baseline diarization system using acoustic information only in speaker embeddings.

Audio and Speech Processing Computation and Language Sound

Microsoft Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2020

94 - Xiong Xiao , Naoyuki Kanda , Zhuo Chen 2020

This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in the wild, evaluated at the diarization track of the VoxCeleb Speaker Recognition Challenge(VoxSRC) 2020. We will first explain our system design to address issues in handling real multi-talker recordings. We then present the details of the components, which include Res2Net-based speaker embedding extractor, conformer-based continuous speech separation with leakage filtering, and a modified DOVER (short for Diarization Output Voting Error Reduction) method for system fusion. We evaluate the systems with the data set provided by VoxSRCchallenge 2020, which contains real-life multi-talker audio collected from YouTube. Our best system achieves 3.71% and 6.23% of the diarization error rate (DER) on development set and evaluation set, respectively, being ranked the 1st at the diarization track of the challenge.

Audio and Speech Processing Sound

Online Speaker Diarization with Relation Network

157 - Xiang Li , Yucheng Zhao , Chong Luo 2020

In this paper, we propose an online speaker diarization system based on Relation Network, named RenoSD. Unlike conventional diariztion systems which consist of several independently-optimized modules, RenoSD implements voice-activity-detection (VAD), embedding extraction, and speaker identity association using a single deep neural network. The most striking feature of RenoSD is that it adopts a meta-learning strategy for speaker identity association. In particular, the relation network learns to learn a deep distance metric in a data-driven way and it can determine through a simple forward pass whether two given segments belong to the same speaker. As such, RenoSD can be performed in an online manner with low latency. Experimental results on AMI and CALLHOME datasets show that the proposed RenoSD system achieves consistent improvements over the state-of-the-art x-vector baseline. Compared with an existing online diarization system named UIS-RNN, RenoSD achieves a better performance using much fewer training data and at a lower time complexity.

Audio and Speech Processing Sound

Speaker diarization using latent space clustering in generative adversarial network

133 - Monisankha Pal , Manoj Kumar , Raghuveer Peri 2019

In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network. The proposed diarization system is trained jointly with GAN loss, latent variable recovery loss, and a clustering-specific loss. It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a combination of continuous random variables and discrete one-hot encoded variables using the original speaker labels. We benchmark our proposed system on the AMI meeting corpus, and two child-clinician interaction corpora (ADOS and BOSCC) from the autism diagnosis domain. ADOS and BOSCC contain diagnostic and treatment outcome sessions respectively obtained in clinical settings for verbal children and adolescents with autism. Experimental results show that our proposed system significantly outperform the state-of-the-art x-vector based diarization system on these databases. Further, we perform embedding fusion with x-vectors to achieve a relative DER improvement of 31%, 36% and 49% on AMI eval, ADOS and BOSCC corpora respectively, when compared to the x-vector baseline using oracle speech segmentation.

Audio and Speech Processing Sound

comments

Fetching comments

Middle East University- Jordan

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Linguistically Aided Speaker Diarization Using Speaker Role Information

Ask ChatGPT about the research

No Arabic abstract

Read More