General-Purpose Speech Representation Learning through a Self-Supervised Multi-Granularity Framework

93 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Chong Luo

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Yucheng Zhao - Dacheng Yin - Chong Luo

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning. In the design of MGF, speech hierarchy is taken into consideration. Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales. For phoneme-scale learning, we borrow idea from the masked language model but tailor it for the continuous speech signal by replacing classification loss with a contrastive loss. We corroborate our design by evaluating MGF representation on various downstream tasks, including phoneme classification, speaker classification, speech recognition, and emotion classification. Experiments verify that training at different time scales needs different training targets and loss functions, which in general complement each other and lead to a better performance.

قيم البحث

116 - Maurice Gerczuk , Shahin Amiriparian , Sandra Ottl 2021

In this manuscript, the topic of multi-corpus Speech Emotion Recognition (SER) is approached from a deep transfer learning perspective. A large corpus of emotional speech data, EmoSet, is assembled from a number of existing SER corpora. In total, Emo Set contains 84181 audio recordings from 26 SER corpora with a total duration of over 65 hours. The corpus is then utilised to create a novel framework for multi-corpus speech emotion recognition, namely EmoNet. A combination of a deep ResNet architecture and residual adapters is transferred from the field of multi-domain visual recognition to multi-corpus SER on EmoSet. Compared against two suitable baselines and more traditional training and transfer settings for the ResNet, the residual adapter approach enables parameter efficient training of a multi-domain SER model on all 26 corpora. A shared model with only $3.5$ times the number of parameters of a model trained on a single database leads to increased performance for 21 of the 26 corpora in EmoSet. Measured by McNemars test, these improvements are further significant for ten datasets at $p<0.05$ while there are just two corpora that see only significant decreases across the residual adapter transfer experiments. Finally, we make our EmoNet framework publicly available for users and developers at https://github.com/EIHW/EmoNet. EmoNet provides an extensive command line interface which is comprehensively documented and can be used in a variety of multi-corpus transfer learning settings.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Fast Development of ASR in African Languages using Self Supervised Speech Representation Learning

273 - Jama Hussein Mohamud , Lloyd Acquaye Thompson , Aissatou Ndoye 2021

This paper describes the results of an informal collaboration launched during the African Master of Machine Intelligence (AMMI) in June 2020. After a series of lectures and labs on speech data collection using mobile applications and on self-supervis ed representation learning from speech, a small group of students and the lecturer continued working on automatic speech recognition (ASR) project for three languages: Wolof, Ga, and Somali. This paper describes how data was collected and ASR systems developed with a small amount (1h) of transcribed speech as training data. In these low resource conditions, pre-training a model on large amounts of raw speech was fundamental for the efficiency of ASR systems developed.

أنظمة الصوت في الحاسوب الحساب واللغة معالجة الصوت والكلام

LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech

421 - Solene Evain , Ha Nguyen , Hang Le 2021

Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing. Recent works also investigated SSL from speech. They were notably successful to improve performance on downstream tasks such as automatic speech recognition (ASR). While these works suggest it is possible to reduce dependence on labeled data for building efficient speech systems, their evaluation was mostly made on ASR and using multiple and heterogeneous experimental settings (most of them for English). This questions the objective comparison of SSL approaches and the evaluation of their impact on building speech systems. In this paper, we propose LeBenchmark: a reproducible framework for assessing SSL from speech. It not only includes ASR (high and low resource) tasks but also spoken language understanding, speech translation and emotion recognition. We also focus on speech technologies in a language different than English: French. SSL models of different sizes are trained from carefully sourced and documented datasets. Experiments show that SSL is beneficial for most but not all tasks which confirms the need for exhaustive and reliable benchmarks to evaluate its real impact. LeBenchmark is shared with the scientific community for reproducible research in SSL from speech.

الحساب واللغة أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Supervised Contrastive Learning for Accented Speech Recognition

253 - Tao Han , Hantao Huang , Ziang Yang 2021

Neural network based speech recognition systems suffer from performance degradation due to accented speech, especially unfamiliar accents. In this paper, we study the supervised contrastive learning framework for accented speech recognition. To build different views (similar positive data samples) for contrastive learning, three data augmentation techniques including noise injection, spectrogram augmentation and TTS-same-sentence generation are further investigated. From the experiments on the Common Voice dataset, we have shown that contrastive learning helps to build data-augmentation invariant and pronunciation invariant representations, which significantly outperforms traditional joint training methods in both zero-shot and full-shot settings. Experiments show that contrastive learning can improve accuracy by 3.66% (zero-shot) and 3.78% (full-shot) on average, comparing to the joint training method.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Layer-wise Analysis of a Self-supervised Speech Representation Model

86 - Ankita Pasad , Ju-Chieh Chou , Karen Livescu 2021

Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the type or exte nt of information encoded in the pre-trained representations themselves. Developing such insights can help understand the capabilities and limits of these models and enable the research community to more efficiently develop their usage for downstream applications. In this work, we begin to fill this gap by examining one recent and successful pre-trained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools. We use the metrics of canonical correlation, mutual information, and performance on simple downstream tasks with non-parametric probes, in order to (i) query for acoustic and linguistic information content, (ii) characterize the evolution of information across model layers, and (iii) understand how fine-tuning the model for automatic speech recognition (ASR) affects these observations. Our findings motivate modifying the fine-tuning protocol for ASR, which produces improved word error rates in a low-resource setting.

الحساب واللغة التعلم الآلي معالجة الصوت والكلام