Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Bayesian Subspace HMM for the Zerospeech 2020 Challenge

226 0 0.0 ( 0 )

Download Cite

Added by Bolaji Yusuf

Publication date 2020

fields Electronic Engineering Informatics Engineering

and research's language is English

Authors Bolaji Yusuf - Lucas Ondel

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this paper we describe our submission to the Zerospeech 2020 challenge, where the participants are required to discover latent representations from unannotated speech, and to use those representations to perform speech synthesis, with synthesis quality used as a proxy metric for the unit quality. In our system, we use the Bayesian Subspace Hidden Markov Model (SHMM) for unit discovery. The SHMM models each unit as an HMM whose parameters are constrained to lie in a low dimensional subspace of the total parameter space which is trained to model phonetic variability. Our system compares favorably with the baseline on the human-evaluated character error rate while maintaining significantly lower unit bitrate.

rate research

The NTNU System at the Interspeech 2020 Non-Native Childrens Speech ASR Challenge

440 - Tien-Hong Lo , Fu-An Chao , Shi-Yan Weng 2020

This paper describes the NTNU ASR system participating in the Interspeech 2020 Non-Native Childrens Speech ASR Challenge supported by the SIG-CHILD group of ISCA. This ASR shared task is made much more challenging due to the coexisting diversity of non-native and children speaking characteristics. In the setting of closed-track evaluation, all participants were restricted to develop their systems merely based on the speech and text corpora provided by the organizer. To work around this under-resourced issue, we built our ASR system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the synergistic power of various data augmentation strategies, including both utterance- and word-level speed perturbation and spectrogram augmentation, alongside a simple yet effective data-cleansing approach. All variants of our ASR system employed an RNN-based language model to rescore the first-pass recognition hypotheses, which was trained solely on the text dataset released by the organizer. Our system with the best configuration came out in second place, resulting in a word error rate (WER) of 17.59 %, while those of the top-performing, second runner-up and official baseline systems are 15.67%, 18.71%, 35.09%, respectively.

Audio and Speech Processing Computation and Language Sound

The DKU-DukeECE Systems for VoxCeleb Speaker Recognition Challenge 2020

99 - Weiqing Wang , Danwei Cai , Xiaoyi Qin 2020

In this paper, we present the system submission for the VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20) by the DKU-DukeECE team. For track 1, we explore various kinds of state-of-the-art front-end extractors with different pooling layers and objective loss functions. For track 3, we employ an iterative framework for self-supervised speaker representation learning based on a deep neural network (DNN). For track 4, we investigate the whole system pipeline for speaker diarization, including voice activity detection (VAD), uniform segmentation, speaker embedding extraction, and clustering.

Audio and Speech Processing

Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020

406 - Hee Soo Heo , Bong-Jin Lee , Jaesung Huh 2020

This report describes our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) at Interspeech 2020. We perform a careful analysis of speaker recognition models based on the popular ResNet architecture, and train a number of variants using a range of loss functions. Our results show significant improvements over most existing works without the use of model ensemble or post-processing. We release the training code and pre-trained models as unofficial baselines for this years challenge.

Audio and Speech Processing Sound

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge

80 - Xiaoyi Qin , Ming Li , Hui Bu 2020

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020) addresses three different research problems under well-defined conditions: far-field text-dependent speaker verification from single microphone array, far-field text-independent speaker verification from single microphone array, and far-field text-dependent speaker verification from distributed microphone arrays. All three tasks pose a cross-channel challenge to the participants. To simulate the real-life scenario, the enrollment utterances are recorded from close-talk cellphone, while the test utterances are recorded from the far-field microphone arrays. In this paper, we describe the database, the challenge, and the baseline system, which is based on a ResNet-based deep speaker network with cosine similarity scoring. For a given utterance, the speaker embeddings of different channels are equally averaged as the final embedding. The baseline system achieves minDCFs of 0.62, 0.66, and 0.64 and EERs of 6.27%, 6.55%, and 7.18% for task 1, task 2, and task 3, respectively.

Audio and Speech Processing Sound

UIAI System for Short-Duration Speaker Verification Challenge 2020

344 - Md Sahidullah , Achintya Kumar Sarkar , Ville Vestman 2020

In this work, we present the system description of the UIAI entry for the short-duration speaker verification (SdSV) challenge 2020. Our focus is on Task 1 dedicated to text-dependent speaker verification. We investigate different feature extraction and modeling approaches for automatic speaker verification (ASV) and utterance verification (UV). We have also studied different fusion strategies for combining UV and ASV modules. Our primary submission to the challenge is the fusion of seven subsystems which yields a normalized minimum detection cost function (minDCF) of 0.072 and an equal error rate (EER) of 2.14% on the evaluation set. The single system consisting of a pass-phrase identification based model with phone-discriminative bottleneck features gives a normalized minDCF of 0.118 and achieves 19% relative improvement over the state-of-the-art challenge baseline.

Audio and Speech Processing Computer Vision and Pattern Recognition Sound

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Bayesian Subspace HMM for the Zerospeech 2020 Challenge

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions