بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Self-Supervised Learning Based Domain Adaptation for Robust Speaker Verification

201 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zhengyang Chen

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية هندسة إلكترونية

والبحث باللغة English

تأليف Zhengyang Chen - Shuai Wang - Yanmin Qian

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Large performance degradation is often observed for speaker ver-ification systems when applied to a new domain dataset. Givenan unlabeled target-domain dataset, unsupervised domain adaptation(UDA) methods, which usually leverage adversarial training strate-gies, are commonly used to bridge the performance gap caused bythe domain mismatch. However, such adversarial training strategyonly uses the distribution information of target domain data and cannot ensure the performance improvement on the target domain. Inthis paper, we incorporate self-supervised learning strategy to the un-supervised domain adaptation system and proposed a self-supervisedlearning based domain adaptation approach (SSDA). Compared tothe traditional UDA method, the new SSDA training strategy canfully leverage the potential label information from target domainand adapt the speaker discrimination ability from source domainsimultaneously. We evaluated the proposed approach on the Vox-Celeb (labeled source domain) and CnCeleb (unlabeled target do-main) datasets, and the best SSDA system obtains 10.2% Equal ErrorRate (EER) on the CnCeleb dataset without using any speaker labelson CnCeleb, which also can achieve the state-of-the-art results onthis corpus.

قيم البحث

110 - Zhuo Li , Ce Fang , Runqiu Xiao 2021

This paper describes the systems submitted by team HCCL to the Far-Field Speaker Verification Challenge. Our previous work in the AIshell Speaker Verification Challenge 2019 shows that the powerful modeling abilities of Neural Network architectures c an provide exceptional performance for this kind of task. Therefore, in this challenge, we focus on constructing deep Neural Network architectures based on TDNN, Resnet and Res2net blocks. Most of the developed systems consist of Neural Network embeddings are applied with PLDA backend. Firstly, the speed perturbation method is applied to augment data and significant performance improvements are achieved. Then, we explore the use of AMsoftmax loss function and propose to join a CE-loss branch when we train model using AMsoftmax loss. In addition, the impact of score normalization on performance is also investigated. The final system, a fusion of four systems, achieves minDCF 0.5342, EER 5.05% on task1 eval set, and achieves minDCF 0.5193, EER 5.47% on task3 eval set.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Unit selection synthesis based data augmentation for fixed phrase speaker verification

75 - Houjun Huang , Xu Xiang , Fei Zhao 2021

Data augmentation is commonly used to help build a robust speaker verification system, especially in limited-resource case. However, conventional data augmentation methods usually focus on the diversity of acoustic environment, leaving the lexicon va riation neglected. For text dependent speaker verification tasks, its well-known that preparing training data with the target transcript is the most effectual approach to build a well-performing system, however collecting such data is time-consuming and expensive. In this work, we propose a unit selection synthesis based data augmentation method to leverage the abundant text-independent data resources. In this approach text-independent speeches of each speaker are firstly broke up to speech segments each contains one phone unit. Then segments that contain phonetics in the target transcript are selected to produce a speech with the target transcript by concatenating them in turn. Experiments are carried out on the AISHELL Speaker Verification Challenge 2019 database, the results and analysis shows that our proposed method can boost the system performance significantly.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Supervised attention for speaker recognition

111 - Seong Min Kye , Joon Son Chung , Hoirin Kim 2020

The recently proposed self-attentive pooling (SAP) has shown good performance in several speaker recognition systems. In SAP systems, the context vector is trained end-to-end together with the feature extractor, where the role of context vector is to select the most discriminative frames for speaker recognition. However, the SAP underperforms compared to the temporal average pooling (TAP) baseline in some settings, which implies that the attention is not learnt effectively in end-to-end training. To tackle this problem, we introduce strategies for training the attention mechanism in a supervised manner, which learns the context vector using classified samples. With our proposed methods, context vector can be boosted to select the most informative frames. We show that our method outperforms existing methods in various experimental settings including short utterance speaker recognition, and achieves competitive performance over the existing baselines on the VoxCeleb datasets.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Beijing ZKJ-NPU Speaker Verification System for VoxCeleb Speaker Recognition Challenge 2021

170 - Li Zhang , Huan Zhao , Qinling Meng 2021

In this report, we describe the Beijing ZKJ-NPU team submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). We participated in the fully supervised speaker verification track 1 and track 2. In the challenge, we explored various ki nds of advanced neural network structures with different pooling layers and objective loss functions. In addition, we introduced the ResNet-DTCF, CoAtNet and PyConv networks to advance the performance of CNN-based speaker embedding model. Moreover, we applied embedding normalization and score normalization at the evaluation stage. By fusing 11 and 14 systems, our final best performances (minDCF/EER) on the evaluation trails are 0.1205/2.8160% and 0.1175/2.8400% respectively for track 1 and 2. With our submission, we came to the second place in the challenge for both tracks.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Exploring Voice Conversion based Data Augmentation in Text-Dependent Speaker Verification

81 - Xiaoyi Qin , Yaogen Yang , Lin Yang 2020

In this paper, we focus on improving the performance of the text-dependent speaker verification system in the scenario of limited training data. The speaker verification system deep learning based text-dependent generally needs a large scale text-dep endent training data set which could be labor and cost expensive, especially for customized new wake-up words. In recent studies, voice conversion systems that can generate high quality synthesized speech of seen and unseen speakers have been proposed. Inspired by those works, we adopt two different voice conversion methods as well as the very simple re-sampling approach to generate new text-dependent speech samples for data augmentation purposes. Experimental results show that the proposed method significantly improves the Equal Error Rare performance from 6.51% to 4.51% in the scenario of limited training data.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الجامعة السورية الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Self-Supervised Learning Based Domain Adaptation for Robust Speaker Verification

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً