ﻻ يوجد ملخص باللغة العربية
Speaker diarization is an important problem that is topical, and is especially useful as a preprocessor for conversational speech related applications. The objective of this paper is two-fold: (i) segment initialization by uniformly distributing speaker information across the initial segments, and (ii) incorporating speaker discriminative features within the unsupervised diarization framework. In the first part of the work, a varying length segment initialization technique for Information Bottleneck (IB) based speaker diarization system using phoneme rate as the side information is proposed. This initialization distributes speaker information uniformly across the segments and provides a better starting point for IB based clustering. In the second part of the work, we present a Two-Pass Information Bottleneck (TPIB) based speaker diarization system that incorporates speaker discriminative features during the process of diarization. The TPIB based speaker diarization system has shown improvement over the baseline IB based system. During the first pass of the TPIB system, a coarse segmentation is performed using IB based clustering. The alignments obtained are used to generate speaker discriminative features using a shallow feed-forward neural network and linear discriminant analysis. The discriminative features obtained are used in the second pass to obtain the final speaker boundaries. In the final part of the paper, variable segment initialization is combined with the TPIB framework. This leverages the advantages of better segment initialization and speaker discriminative features that results in an additional improvement in performance. An evaluation on standard meeting datasets shows that a significant absolute improvement of 3.9% and 4.7% is obtained on the NIST and AMI datasets, respectively.
The two-pass information bottleneck (TPIB) based speaker diarization system operates independently on different conversational recordings. TPIB system does not consider previously learned speaker discriminative information while diarizing new convers
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition. We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embe
Speaker diarization relies on the assumption that speech segments corresponding to a particular speaker are concentrated in a specific region of the speaker space; a region which represents that speakers identity. These identities are not known a pri
Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of mode
This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in the wild, evaluated at the diarization track of the VoxCeleb Speaker Recognition Challenge(VoxSRC) 2020. We will first explain our system design to