أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Tien-Hong Lo

Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods

94 - Shao-Wei Fan Jiang , Bi-Cheng Yan , Tien-Hong Lo 2021

With the acceleration of globalization, more and more people are willing or required to learn second languages (L2). One of the major remaining challenges facing current mispronunciation and diagnosis (MDD) models for use in computer-assisted pronunc iation training (CAPT) is to handle speech from L2 learners with a diverse set of accents. In this paper, we set out to mitigate the adverse effects of accent variety in building an L2 English MDD system with end-to-end (E2E) neural models. To this end, we first propose an effective modeling framework that infuses accent features into an E2E MDD model, thereby making the model more accent-aware. Going a step further, we design and present disparate accent-aware modules to perform accent-aware modulation of acoustic features in a fine-grained manner, so as to enhance the discriminating capability of the resulting MDD model. Extensive sets of experiments conducted on the L2-ARCTIC benchmark dataset show the merits of our MDD model, in comparison to some existing E2E-based strong baselines and the celebrated pronunciation scoring based method.

الوسائط المتعددة أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Cross-utterance Reranking Models with BERT and Graph Convolutional Networks for Conversational Speech Recognition

80 - Shih-Hsuan Chiu , Tien-Hong Lo , Fu-An Chao 2021

How to effectively incorporate cross-utterance information cues into a neural language model (LM) has emerged as one of the intriguing issues for automatic speech recognition (ASR). Existing research efforts on improving contextualization of an LM ty pically regard previous utterances as a sequence of additional input and may fail to capture complex global structural dependencies among these utterances. In view of this, we in this paper seek to represent the historical context information of an utterance as graph-structured data so as to distill cross-utterances, global word interaction relationships. To this end, we apply a graph convolutional network (GCN) on the resulting graph to obtain the corresponding GCN embeddings of historical words. GCN has recently found its versatile applications on social-network analysis, text summarization, and among others due mainly to its ability of effectively capturing rich relational information among elements. However, GCN remains largely underexplored in the context of ASR, especially for dealing with conversational speech. In addition, we frame ASR N-best reranking as a prediction problem, leveraging bidirectional encoder representations from transformers (BERT) as the vehicle to not only seize the local intrinsic word regularity patterns inherent in a candidate hypothesis but also incorporate the cross-utterance, historical word interaction cues distilled by GCN for promoting performance. Extensive experiments conducted on the AMI benchmark dataset seem to confirm the pragmatic utility of our methods, in relation to some current top-of-the-line methods.

الحساب واللغة معالجة الصوت والكلام

The NTNU Taiwanese ASR System for Formosa Speech Recognition Challenge 2020

82 - Fu-An Chao , Tien-Hong Lo , Shi-Yan Weng 2021

This paper describes the NTNU ASR system participating in the Formosa Speech Recognition Challenge 2020 (FSR-2020) supported by the Formosa Speech in the Wild project (FSW). FSR-2020 aims at fostering the development of Taiwanese speech recognition. Apart from the issues on tonal and dialectical variations of the Taiwanese language, speech artificially contaminated with different types of real-world noise also has to be dealt with in the final test stage; all of these make FSR-2020 much more challenging than before. To work around the under-resourced issue, the main technical aspects of our ASR system include various deep learning techniques, such as transfer learning, semi-supervised learning, front-end speech enhancement and model ensemble, as well as data cleansing and data augmentation conducted on the training data. With the best configuration, our system obtains 13.1 % syllable error rate (SER) on the final-test set, achieving the first place among all participating systems on Track 3.

معالجة الصوت والكلام معالجة الإشارات

Query Expansion System for the VoxCeleb Speaker Recognition Challenge 2020

121 - Yu-Sen Cheng , Chun-Liang Shih , Tien-Hong Lo 2020

In this report, we describe our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020. Two approaches are adopted. One is to apply query expansion on speaker verification, which shows significant progress compared to baseline in the study. Another is to use Kaldi extract x-vector and to combine its Probabilistic Linear Discriminant Analysis (PLDA) score with ResNet score.

أنظمة الصوت في الحاسوب الحساب واللغة معالجة الصوت والكلام

The NTNU System at the Interspeech 2020 Non-Native Childrens Speech ASR Challenge

440 - Tien-Hong Lo , Fu-An Chao , Shi-Yan Weng 2020

This paper describes the NTNU ASR system participating in the Interspeech 2020 Non-Native Childrens Speech ASR Challenge supported by the SIG-CHILD group of ISCA. This ASR shared task is made much more challenging due to the coexisting diversity of n on-native and children speaking characteristics. In the setting of closed-track evaluation, all participants were restricted to develop their systems merely based on the speech and text corpora provided by the organizer. To work around this under-resourced issue, we built our ASR system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the synergistic power of various data augmentation strategies, including both utterance- and word-level speed perturbation and spectrogram augmentation, alongside a simple yet effective data-cleansing approach. All variants of our ASR system employed an RNN-based language model to rescore the first-pass recognition hypotheses, which was trained solely on the text dataset released by the organizer. Our system with the best configuration came out in second place, resulting in a word error rate (WER) of 17.59 %, while those of the top-performing, second runner-up and official baseline systems are 15.67%, 18.71%, 35.09%, respectively.

معالجة الصوت والكلام الحساب واللغة أنظمة الصوت في الحاسوب

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد