بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

ECAPA-TDNN Embeddings for Speaker Diarization

139 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Nauman Dawalatabad

تاريخ النشر 2021

مجال البحث هندسة إلكترونية

والبحث باللغة English

تأليف Nauman Dawalatabad - Mirco Ravanelli - Franc{c}ois Grondin

معالجة الصوت والكلام

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, for instance, has shown impressive performance in the speaker verification domain, thanks to a carefully designed neural model. In this work, we extend, for the first time, the use of the ECAPA-TDNN model to speaker diarization. Moreover, we improved its robustness with a powerful augmentation scheme that concatenates several contaminat

قيم البحث

83 - Youngki Kwon , Jee-weon Jung , Hee-Soo Heo 2021

The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation. The quality of speaker embeddings is paramount to the performance of speaker diarisation systems. Despite this, prior works in the field have direct ly used embeddings designed only to be effective on the speaker verification task. In this paper, we propose three techniques that can be used to better adapt the speaker embeddings for diarisation: dimensionality reduction, attention-based embedding aggregation, and non-speech clustering. A wide range of experiments is performed on various challenging datasets. The results demonstrate that all three techniques contribute positively to the performance of the diarisation system achieving an average relative improvement of 25.07% in terms of diarisation error rate over the baseline.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

Speaker Diarization with Lexical Information

208 - Tae Jin Park , Kyu J. Han , Jing Huang 2020

This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition. We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embe ddings into a speaker clustering process to improve the overall diarization accuracy. To integrate lexical and acoustic information in a comprehensive way during clustering, we introduce an adjacency matrix integration for spectral clustering. Since words and word boundary information for word-level speaker turn probability estimation are provided by a speech recognition system, our proposed method works without any human intervention for manual transcriptions. We show that the proposed method improves diarization performance on various evaluation datasets compared to the baseline diarization system using acoustic information only in speaker embeddings.

معالجة الصوت والكلام الحساب واللغة أنظمة الصوت في الحاسوب

Linguistically Aided Speaker Diarization Using Speaker Role Information

149 - Nikolaos Flemotomos , Panayiotis Georgiou , Shrikanth Narayanan 2019

Speaker diarization relies on the assumption that speech segments corresponding to a particular speaker are concentrated in a specific region of the speaker space; a region which represents that speakers identity. These identities are not known a pri ori, so a clustering algorithm is typically employed, which is traditionally based solely on audio. Under noisy conditions, however, such an approach poses the risk of generating unreliable speaker clusters. In this work we aim to utilize linguistic information as a supplemental modality to identify the various speakers in a more robust way. We are focused on conversational scenarios where the speakers assume distinct roles and are expected to follow different linguistic patterns. This distinct linguistic variability can be exploited to help us construct the speaker identities. That way, we are able to boost the diarization performance by converting the clustering task to a classification one. The proposed method is applied in real-world dyadic psychotherapy interactions between a provider and a patient and demonstrated to show improved results.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

Microsoft Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2020

94 - Xiong Xiao , Naoyuki Kanda , Zhuo Chen 2020

This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in the wild, evaluated at the diarization track of the VoxCeleb Speaker Recognition Challenge(VoxSRC) 2020. We will first explain our system design to address issues in handling real multi-talker recordings. We then present the details of the components, which include Res2Net-based speaker embedding extractor, conformer-based continuous speech separation with leakage filtering, and a modified DOVER (short for Diarization Output Voting Error Reduction) method for system fusion. We evaluate the systems with the data set provided by VoxSRCchallenge 2020, which contains real-life multi-talker audio collected from YouTube. Our best system achieves 3.71% and 6.23% of the diarization error rate (DER) on development set and evaluation set, respectively, being ranked the 1st at the diarization track of the challenge.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

Speaker Diarization: Using Recurrent Neural Networks

165 - Vishal Sharma , Zekun Zhang , Zachary Neubert 2020

Speaker Diarization is the problem of separating speakers in an audio. There could be any number of speakers and final result should state when speaker starts and ends. In this project, we analyze given audio file with 2 channels and 2 speakers (on s eparate channel). We train Neural Network for learning when a person is speaking. We use different type of Neural Networks specifically, Single Layer Perceptron (SLP), Multi Layer Perceptron (MLP), Recurrent Neural Network (RNN) and Convolution Neural Network (CNN) we achieve $sim$92% of accuracy with RNN. The code for this project is available at https://github.com/vishalshar/SpeakerDiarization_RNN_CNN_LSTM

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الجامعة العربية الدولية الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

ECAPA-TDNN Embeddings for Speaker Diarization

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً