ترغب بنشر مسار تعليمي؟ اضغط هنا

The ByteDance Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2021

95   0   0.0 ( 0 )
 نشر من قبل Xudong Mao
 تاريخ النشر 2021
والبحث باللغة English




اسأل ChatGPT حول البحث

This paper describes the ByteDance speaker diarization system for the fourth track of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). The VoxSRC-21 provides both the dev set and test set of VoxConverse for use in validation and a standalone test set for evaluation. We first collect the duration and signal-to-noise ratio (SNR) of all audio and find that the distribution of the VoxConverses test set and the VoxSRC-21s test set is more closer. Our system consists of voice active detection (VAD), speaker embedding extraction, spectral clustering followed by a re-clustering step based on agglomerative hierarchical clustering (AHC) and overlapped speech detection and handling. Finally, we integrate systems with different time scales using DOVER-Lap. Our best system achieves 5.15% of the diarization error rate (DER) on evaluation set, ranking the second at the diarization track of the challenge.

قيم البحث

اقرأ أيضاً

In this report, we describe the Beijing ZKJ-NPU team submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). We participated in the fully supervised speaker verification track 1 and track 2. In the challenge, we explored various ki nds of advanced neural network structures with different pooling layers and objective loss functions. In addition, we introduced the ResNet-DTCF, CoAtNet and PyConv networks to advance the performance of CNN-based speaker embedding model. Moreover, we applied embedding normalization and score normalization at the evaluation stage. By fusing 11 and 14 systems, our final best performances (minDCF/EER) on the evaluation trails are 0.1205/2.8160% and 0.1175/2.8400% respectively for track 1 and 2. With our submission, we came to the second place in the challenge for both tracks.
97 - Miao Zhao , Yufeng Ma , Min Liu 2021
This report describes our submission to the track 1 and track 2 of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC 2021). Both track 1 and track 2 share the same speaker verification system, which only uses VoxCeleb2-dev as our training set. This report explores several parts, including data augmentation, network structures, domain-based large margin fine-tuning, and back-end refinement. Our system is a fusion of 9 models and achieves first place in these two tracks of VoxSRC 2021. The minDCF of our submission is 0.1034, and the corresponding EER is 1.8460%.
This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in the wild, evaluated at the diarization track of the VoxCeleb Speaker Recognition Challenge(VoxSRC) 2020. We will first explain our system design to address issues in handling real multi-talker recordings. We then present the details of the components, which include Res2Net-based speaker embedding extractor, conformer-based continuous speech separation with leakage filtering, and a modified DOVER (short for Diarization Output Voting Error Reduction) method for system fusion. We evaluate the systems with the data set provided by VoxSRCchallenge 2020, which contains real-life multi-talker audio collected from YouTube. Our best system achieves 3.71% and 6.23% of the diarization error rate (DER) on development set and evaluation set, respectively, being ranked the 1st at the diarization track of the challenge.
This paper describes the XMUSPEECH speaker recognition and diarisation systems for the VoxCeleb Speaker Recognition Challenge 2021. For track 2, we evaluate two systems including ResNet34-SE and ECAPA-TDNN. For track 4, an important part of our syste m is VAD module which greatly improves the performance. Our best submission on the track 4 obtained on the evaluation set DER 5.54% and JER 27.11%, while the performance on the development set is DER 2.92% and JER 20.84%.
153 - Xu Xiang 2020
This report describes the systems submitted to the first and second tracks of the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020, which ranked second in both tracks. Three key points of the system pipeline are explored: (1) investigating multip le CNN architectures including ResNet, Res2Net and dual path network (DPN) to extract the x-vectors, (2) using a composite angular margin softmax loss to train the speaker models, and (3) applying score normalization and system fusion to boost the performance. Measured on the VoxSRC-20 Eval set, the best submitted systems achieve an EER of $3.808%$ and a MinDCF of $0.1958$ in the close-condition track 1, and an EER of $3.798%$ and a MinDCF of $0.1942$ in the open-condition track 2, respectively.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا