Content-Aware Speaker Embeddings for Speaker Diarisation

111 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Guangzhi Sun

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف G. Sun - D. Liu - C. Zhang

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Recent speaker diarisation systems often convert variable length speech segments into fixed-length vector representations for speaker clustering, which are known as speaker embeddings. In this paper, the content-aware speaker embeddings (CASE) approach is proposed, which extends the input of the speaker classifier to include not only acoustic features but also their corresponding speech content, via phone, character, and word embeddings. Compared to alternative methods that leverage similar information, such as multitask or adversarial training, CASE factorises automatic speech recognition (ASR) from speaker recognition to focus on modelling speaker characteristics and correlations with the corresponding content units to derive more expressive representations. CASE is evaluated for speaker re-clustering with a realistic speaker diarisation setup using the AMI meeting transcription dataset, where the content information is obtained by performing ASR based on an automatic segmentation. Experimental results showed that CASE achieved a 17.8% relative speaker error rate reduction over conventional methods.

قيم البحث

83 - Youngki Kwon , Jee-weon Jung , Hee-Soo Heo 2021

The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation. The quality of speaker embeddings is paramount to the performance of speaker diarisation systems. Despite this, prior works in the field have direct ly used embeddings designed only to be effective on the speaker verification task. In this paper, we propose three techniques that can be used to better adapt the speaker embeddings for diarisation: dimensionality reduction, attention-based embedding aggregation, and non-speech clustering. A wide range of experiments is performed on various challenging datasets. The results demonstrate that all three techniques contribute positively to the performance of the diarisation system achieving an average relative improvement of 25.07% in terms of diarisation error rate over the baseline.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

Spot the conversation: speaker diarisation in the wild

70 - Joon Son Chung , Jaesung Huh , Arsha Nagrani 2020

The goal of this paper is speaker diarisation of videos collected in the wild. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using a udio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from in the wild videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.

أنظمة الصوت في الحاسوب الرؤية الحاسوبية وتمييز الأنماط معالجة الصوت والكلام

Single-Channel Speech Separation with Auxiliary Speaker Embeddings

93 - Shuo Liu , Gil Keren , Bjorn Schuller 2019

We present a novel source separation model to decompose asingle-channel speech signal into two speech segments belonging to two different speakers. The proposed model is a neural network based on residual blocks, and uses learnt speaker embeddings cr eated from additional clean context recordings of the two speakers as input to assist in attributing the different time-frequency bins to the two speakers. In experiments, we show that the proposed model yields good performance in the source separation task, and outperforms the state-of-the-art baselines. Specifically, separating speech from the challenging VoxCeleb dataset, the proposed model yields 4.79dB signal-to-distortion ratio, 8.44dB signal-to-artifacts ratio and 7.11dB signal-to-interference ratio.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Learnable MFCCs for Speaker Verification

330 - Xuechen Liu , Md Sahidullah , Tomi Kinnunen 2021

We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend architecture for deep neural network (DNN) based automatic speaker verification. Our architecture retains the simplicity and interpretability of MFCC-based features while allow ing the model to be adapted to data flexibly. In practice, we formulate data-driv

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Differentiable Supervector Extraction for Encoding Speaker and Phrase Information in Text Dependent Speaker Verification

85 - Victoria Mingote , Antonio Miguel , Alfonso Ortega 2018

In this paper, we propose a new differentiable neural network alignment mechanism for text-dependent speaker verification which uses alignment models to produce a supervector representation of an utterance. Unlike previous works with similar approach es, we do not extract the embedding of an utterance from the mean reduction of the temporal dimension. Our system replaces the mean by a phrase alignment model to keep the temporal structure of each phrase which is relevant in this application since the phonetic information is part of the identity in the verification task. Moreover, we can apply a convolutional neural network as front-end, and thanks to the alignment process being differentiable, we can train the whole network to produce a supervector for each utterance which will be discriminative with respect to the speaker and the phrase simultaneously. As we show, this choice has the advantage that the supervector encodes the phrase and speaker information providing good performance in text-dependent speaker verification tasks. In this work, the process of verification is performed using a basic similarity metric, due to simplicity, compared to other more elaborate models that are commonly used. The new model using alignment to produce supervectors was tested on the RSR2015-Part I database for text-dependent speaker verification, providing competitive results compared to similar size networks using the mean to extract embeddings.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام