ﻻ يوجد ملخص باللغة العربية
Traditionally, the performance of non-native mispronunciation verification systems relied on effective phone-level labelling of non-native corpora. In this study, a multi-view approach is proposed to incorporate discriminative feature representations which requires less annotation for non-native mispronunciation verification of Mandarin. Here, models are jointly learned to embed acoustic sequence and multi-source information for speech attributes and bottleneck features. Bidirectional LSTM embedding models with contrastive losses are used to map acoustic sequences and multi-source information into fixed-dimensional embeddings. The distance between acoustic embeddings is taken as the similarity between phones. Accordingly, examples of mispronounced phones are expected to have a small similarity score with their canonical pronunciations. The approach shows improvement over GOP-based approach by +11.23% and single-view approach by +1.47% in diagnostic accuracy for a mispronunciation verification task.
A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. This approach makes two simplifying assumptions
In this work, we introduce metric learning (ML) to enhance the deep embedding learning for text-independent speaker verification (SV). Specifically, the deep speaker embedding network is trained with conventional cross entropy loss and auxiliary pair
Computer-Assisted Pronunciation Training (CAPT) plays an important role in language learning. However, conventional CAPT methods cannot effectively use non-native utterances for supervised training because the ground truth pronunciation needs expensi
This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classic
The goal of this paper is text-independent speaker verification where utterances come from in the wild videos and may contain irrelevant signal. While speaker verification is naturally a pair-wise problem, existing methods to produce the speaker embe