ﻻ يوجد ملخص باللغة العربية
Most speaker verification tasks are studied as an open-set evaluation scenario considering the real-world condition. Thus, the generalization power to unseen speakers is of paramount important to the performance of the speaker verification system. We propose to apply textit {Mean Teacher}, a temporal averaging model, to extract speaker embeddings with small intra-class variance and large inter-class variance. The mean teacher network refers to the temporal averaging of deep neural network parameters; it can produces more accurate and stable representations than using weights after the training finished. By learning the reliable intermediate representation of the mean teacher network, we expect that the proposed method can explore more discriminatory embedding spaces and improve the generalization performance of the speaker verification system. Experimental results on the VoxCeleb1 test set demonstrate that the proposed method relatively improves performance by 11.61%, compared to a baseline system.
Meta-learning (ML) has recently become a research hotspot in speaker verification (SV). We introduce two methods to improve the meta-learning training for SV in this paper. For the first method, a backbone embedding network is first jointly trained w
In this work, we introduce metric learning (ML) to enhance the deep embedding learning for text-independent speaker verification (SV). Specifically, the deep speaker embedding network is trained with conventional cross entropy loss and auxiliary pair
In this paper, we propose an iterative framework for self-supervised speaker representation learning based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between
This report describes the submission of the DKU-DukeECE team to the self-supervision speaker verification task of the 2021 VoxCeleb Speaker Recognition Challenge (VoxSRC). Our method employs an iterative labeling framework to learn self-supervised sp
Recently self-supervised learning has emerged as an effective approach to improve the performance of automatic speech recognition (ASR). Under such a framework, the neural network is usually pre-trained with massive unlabeled data and then fine-tuned