USTC-NELSLIP System Description for DIHARD-III Challenge

63 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yuxuan Wang

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Yuxuan Wang - Maokui He - Shutong Niu

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This system description describes our submission system to the Third DIHARD Speech Diarization Challenge. Besides the traditional clustering based system, the innovation of our system lies in the combination of various front-end techniques to solve the diarization problem, including speech separation and target-speaker based voice activity detection (TS-VAD), combined with iterative data purification. We also adopted audio domain classification to design domain-dependent processing. Finally, we performed post processing to do system fusion and selection. Our best system achieved DERs of 11.30% in track 1 and 16.78% in track 2 on evaluation set, respectively.

قيم البحث

57 - Yi Liu , Tianyu Liang , Can Xu 2019

This paper describes the systems submitted by the department of electronic engineering, institute of microelectronics of Tsinghua university and TsingMicro Co. Ltd. (THUEE) to the NIST 2019 speaker recognition evaluation CTS challenge. Six subsystems , including etdnn/ams, ftdnn/as, eftdnn/ams, resnet, multitask and c-vector are developed in this evaluation.

أنظمة الصوت في الحاسوب الحساب واللغة معالجة الصوت والكلام

The USTC-NELSLIP Systems for Simultaneous Speech Translation Task at IWSLT 2021

181 - Dan Liu , Mengge Du , Xiaoxi Li 2021

This paper describes USTC-NELSLIPs submissions to the IWSLT2021 Simultaneous Speech Translation task. We proposed a novel simultaneous translation model, Cross Attention Augmented Transducer (CAAT), which extends conventional RNN-T to sequence-to-seq uence tasks without monotonic constraints, e.g., simultaneous translation. Experiments on speech-to-text (S2T) and text-to-text (T2T) simultaneous translation tasks shows CAAT achieves better quality-latency trade-offs compared to textit{wait-k}, one of the previous state-of-the-art approaches. Based on CAAT architecture and data augmentation, we build S2T and T2T simultaneous translation systems in this evaluation campaign. Compared to last years optimal systems, our S2T simultaneous translation system improves by an average of 11.3 BLEU for all latency regimes, and our T2T simultaneous translation system improves by an average of 4.6 BLEU.

الحساب واللغة أنظمة الصوت في الحاسوب معالجة الصوت والكلام

The NPU System for the 2020 Personalized Voice Trigger Challenge

154 - Jingyong Hou , Li Zhang , Yihui Fu 2021

This paper describes the system developed by the NPU team for the 2020 personalized voice trigger challenge. Our submitted system consists of two independently trained subsystems: a small footprint keyword spotting (KWS) system and a speaker verifica tion (SV) system. For the KWS system, a multi-scale dilated temporal convolutional (MDTC) network is proposed to detect wake-up word (WuW). For SV system, Write something here. The KWS predicts posterior probabilities of whether an audio utterance contains WuW and estimates the location of WuW at the same time. When the posterior probability ofWuW reaches a predefined threshold, the identity information of triggered segment is determined by the SV system. On evaluation dataset, our submitted system obtains detection costs of 0.081and 0.091 in close talking and far-field tasks, respectively.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

The DKU-Duke-Lenovo System Description for the Third DIHARD Speech Diarization Challenge

223 - Weiqing Wang , Qingjian Lin , Danwei Cai 2021

In this paper, we present the submitted system for the third DIHARD Speech Diarization Challenge from the DKU-Duke-Lenovo team. Our system consists of several modules: voice activity detection (VAD), segmentation, speaker embedding extraction, attent ive similarity scoring, agglomerative hierarchical clustering. In addition, the target speaker VAD (TSVAD) is used for the phone call data to further improve the performance. Our final submitted system achieves a DER of 15.43% for the core evaluation set and 13.39% for the full evaluation set on task 1, and we also get a DER of 21.63% for core evaluation set and 18.90% for full evaluation set on task 2.

معالجة الصوت والكلام أنظمة الصوت في الحاسوب

Query Expansion System for the VoxCeleb Speaker Recognition Challenge 2020

121 - Yu-Sen Cheng , Chun-Liang Shih , Tien-Hong Lo 2020

In this report, we describe our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020. Two approaches are adopted. One is to apply query expansion on speaker verification, which shows significant progress compared to baseline in the study. Another is to use Kaldi extract x-vector and to combine its Probabilistic Linear Discriminant Analysis (PLDA) score with ResNet score.

أنظمة الصوت في الحاسوب الحساب واللغة معالجة الصوت والكلام