ﻻ يوجد ملخص باللغة العربية
This paper proposes a multi-task learning network with phoneme-aware and channel-wise attentive learning strategies for text-dependent Speaker Verification (SV). In the proposed structure, the frame-level multi-task learning along with the segment-level adversarial learning is adopted for speaker embedding extraction. The phoneme-aware attentive pooling is exploited on frame-level features in the main network for speaker classifier, with the corresponding posterior probability for the phoneme distribution in the auxiliary subnet. Further, the introduction of Squeeze and Excitation (SE-block) performs dynamic channel-wise feature recalibration, which improves the representational ability. The proposed method exploits speaker idiosyncrasies associated with pass-phrases, and is further improved by the phoneme-aware attentive pooling and SE-block from temporal and channel-wise aspects, respectively. The experiments conducted on RSR2015 Part 1 database confirm that the proposed system achieves outstanding results for textdependent SV.
In this paper, we propose a new differentiable neural network alignment mechanism for text-dependent speaker verification which uses alignment models to produce a supervector representation of an utterance. Unlike previous works with similar approach
Vehicle re-identification (re-ID) matches images of the same vehicle across different cameras. It is fundamentally challenging because the dramatically different appearance caused by different viewpoints would make the framework fail to match two veh
Open-set speaker recognition can be regarded as a metric learning problem, which is to maximize inter-class variance and minimize intra-class variance. Supervised metric learning can be categorized into entity-based learning and proxy-based learning.
This paper explores two techniques to improve the performance of text-dependent speaker verification systems based on deep neural networks. Firstly, we propose a general alignment mechanism to keep the temporal structure of each phrase and obtain a s
In this paper, we tackle the singing voice phoneme segmentation problem in the singing training scenario by using language-independent information -- onset and prior coarse duration. We propose a two-step method. In the first step, we jointly calcula