Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training

98 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Bin Liu

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Bin Liu - Shuai Nie - Yaping Zhang

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In realistic environments, speech is usually interfered by various noise and reverberation, which dramatically degrades the performance of automatic speech recognition (ASR) systems. To alleviate this issue, the commonest way is to use a well-designed speech enhancement approach as the front-end of ASR. However, more complex pipelines, more computations and even higher hardware costs (microphone array) are additionally consumed for this kind of methods. In addition, speech enhancement would result in speech distortions and mismatches to training. In this paper, we propose an adversarial training method to directly boost noise robustness of acoustic model. Specifically, a jointly compositional scheme of generative adversarial net (GAN) and neural network-based acoustic model (AM) is used in the training phase. GAN is used to generate clean feature representations from noisy features by the guidance of a discriminator that tries to distinguish between the true clean signals and generated signals. The joint optimization of generator, discriminator and AM concentrates the strengths of both GAN and AM for speech recognition. Systematic experiments on CHiME-4 show that the proposed method significantly improves the noise robustness of AM and achieves the average relative error rate reduction of 23.38% and 11.54% on the development and test set, respectively.

قيم البحث

161 - Jaesung Huh , Hee Soo Heo , Jingu Kang 2020

The goal of this work is to train robust speaker recognition models without speaker labels. Recent works on unsupervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and a cross-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information. To this end, we propose augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied. Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of our self-supervised models far exceed that of humans.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models

95 - Jing Liu , Rupak Vignesh Swaminathan , Sree Hari Krishnan Parthasarathi 2021

We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 hours of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint s etting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR). When increasing the supervised data to seven-fold, our gains diminish to 7.1% WERR; to improve SSL efficiency at larger supervised data regimes, we employ a step-wise distillation into a smaller model, obtaining a WERR of 14.4%. We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting. We develop a theoretical sketch to explain this behavior.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Modelling Animal Biodiversity Using Acoustic Monitoring and Deep Learning

446 - C. Chalmers , P.Fergus , S. Wich 2021

For centuries researchers have used sound to monitor and study wildlife. Traditionally, conservationists have identified species by ear; however, it is now common to deploy audio recording technology to monitor animal and ecosystem sounds. Animals us e sound for communication, mating, navigation and territorial defence. Animal sounds provide valuable information and help conservationists to quantify biodiversity. Acoustic monitoring has grown in popularity due to the availability of diverse sensor types which include camera traps, portable acoustic sensors, passive acoustic sensors, and even smartphones. Passive acoustic sensors are easy to deploy and can be left running for long durations to provide insights on habitat and the sounds made by animals and illegal activity. While this technology brings enormous benefits, the amount of data that is generated makes processing a time-consuming process for conservationists. Consequently, there is interest among conservationists to automatically process acoustic data to help speed up biodiversity assessments. Processing these large data sources and extracting relevant sounds from background noise introduces significant challenges. In this paper we outline an approach for achieving this using state of the art in machine learning to automatically extract features from time-series audio signals and modelling deep learning models to classify different bird species based on the sounds they make. The acquired bird songs are processed using mel-frequency cepstrum (MFC) to extract features which are later classified using a multilayer perceptron (MLP). Our proposed method achieved promising results with 0.74 sensitivity, 0.92 specificity and an accuracy of 0.74.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Deep Speaker Vector Normalization with Maximum Gaussianality Training

70 - Yunqi Cai , Lantian Li , Dong Wang 2020

Deep speaker embedding represents the state-of-the-art technique for speaker recognition. A key problem with this approach is that the resulting deep speaker vectors tend to be irregularly distributed. In previous research, we proposed a deep normali zation approach based on a new discriminative normalization flow (DNF) model, by which the distributions of individual speakers are arguably transformed to homogeneous Gaussians. This normalization was demonstrated to be effective, but despite this remarkable success, we empirically found that the latent codes produced by the DNF model are generally neither homogeneous nor Gaussian, although the model has assumed so. In this paper, we argue that this problem is largely attributed to the maximum-likelihood (ML) training criterion of the DNF model, which aims to maximize the likelihood of the observations but not necessarily improve the Gaussianality of the latent codes. We therefore propose a new Maximum Gaussianality (MG) training approach that directly maximizes the Gaussianality of the latent codes. Our experiments on two data sets, SITW and CNCeleb, demonstrate that our new MG training approach can deliver much better performance than the previous ML training, and exhibits improved domain generalizability, particularly with regard to cosine scoring.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Robustness and Generalization via Generative Adversarial Training

236 - Omid Poursaeed , Tianxing Jiang , Harry Yang 2021

While deep neural networks have achieved remarkable success in various computer vision tasks, they often fail to generalize to new domains and subtle variations of input images. Several defenses have been proposed to improve the robustness against th ese variations. However, current defenses can only withstand the specific attack used in training, and the models often remain vulnerable to other input variations. Moreover, these methods often degrade performance of the model on clean images and do not generalize to out-of-domain samples. In this paper we present Generative Adversarial Training, an approach to simultaneously improve the models generalization to the test set and out-of-domain samples as well as its robustness to unseen adversarial attacks. Instead of altering a low-level pre-defined aspect of images, we generate a spectrum of low-level, mid-level and high-level changes using generative models with a disentangled latent space. Adversarial training with these examples enable the model to withstand a wide range of attacks by observing a variety of input alterations during training. We show that our approach not only improves performance of the model on clean images and out-of-domain samples but also makes it robust against unforeseen attacks and outperforms prior work. We validate effectiveness of our method by demonstrating results on various tasks such as classification, segmentation and object detection.

الرؤية الحاسوبية وتمييز الأنماط التشفير والأمن التعلم الآلي