Single Channel Speech Enhancement Using Outlier Detection

85 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Eunjoon Cho

تاريخ النشر 2016

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Eunjoon Cho - Bowon Lee - Ronald Schafer

أنظمة الصوت في الحاسوب التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Distortion of the underlying speech is a common problem for single-channel speech enhancement algorithms, and hinders such methods from being used more extensively. A dictionary based speech enhancement method that emphasizes preserving the underlying speech is proposed. Spectral patches of clean speech are sampled and clustered to train a dictionary. Given a noisy speech spectral patch, the best matching dictionary entry is selected and used to estimate the noise power at each time-frequency bin. The noise estimation step is formulated as an outlier detection problem, where the noise at each bin is assumed present only if it is an outlier to the corresponding bin of the best matching dictionary entry. This framework assigns higher priority in removing spectral elements that strongly deviate from a typical spoken unit stored in the trained dictionary. Even without the aid of a separate noise model, this method can achieve significant noise reduction for various non-stationary noises, while effectively preserving the underlying speech in more challenging noisy environments.

قيم البحث

503 - Yuzi Yan , Wei-Qiang Zhang , Michael T. Johnson 2021

As the cornerstone of other important technologies, such as speech recognition and speech synthesis, speech enhancement is a critical area in audio signal processing. In this paper, a new deep learning structure for speech enhancement is demonstrated . The model introduces a full attention mechanism to a bidirectional sequence-to-sequence method to make use of latent information after each focal frame. This is an extension of the previous attention-based RNN method. The proposed bidirectional attention-based architecture achieves better performance in terms of speech quality (PESQ), compared with OM-LSA, CNN-LSTM, T-GSA and the unidirectional attention-based LSTM baseline.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Multi-Channel Speech Enhancement using Graph Neural Networks

88 - Panagiotis Tzirakis , Anurag Kumar , Jacob Donley 2021

Multi-channel speech enhancement aims to extract clean speech from a noisy mixture using signals captured from multiple microphones. Recently proposed methods tackle this problem by incorporating deep neural network models with spatial filtering tech niques such as the minimum variance distortionless response (MVDR) beamformer. In this paper, we introduce a different research direction by viewing each audio channel as a node lying in a non-Euclidean space and, specifically, a graph. This formulation allows us to apply graph neural networks (GNN) to find spatial correlations among the different channels (nodes). We utilize graph convolution networks (GCN) by incorporating them in the embedding space of a U-Net architecture. We use LibriSpeech dataset and simulate room acoustics data to extensively experiment with our approach using different array types, and number of microphones. Results indicate the superiority of our approach when compared to prior state-of-the-art method.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

Single-Channel Speech Separation with Auxiliary Speaker Embeddings

93 - Shuo Liu , Gil Keren , Bjorn Schuller 2019

We present a novel source separation model to decompose asingle-channel speech signal into two speech segments belonging to two different speakers. The proposed model is a neural network based on residual blocks, and uses learnt speaker embeddings cr eated from additional clean context recordings of the two speakers as input to assist in attributing the different time-frequency bins to the two speakers. In experiments, we show that the proposed model yields good performance in the source separation task, and outperforms the state-of-the-art baselines. Specifically, separating speech from the challenging VoxCeleb dataset, the proposed model yields 4.79dB signal-to-distortion ratio, 8.44dB signal-to-artifacts ratio and 7.11dB signal-to-interference ratio.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Glance and Gaze: A Collaborative Learning Framework for Single-channel Speech Enhancement

120 - Andong Li , Chengshi Zheng , Lu Zhang 2021

The capability of the human to pay attention to both coarse and fine-grained regions has been applied to computer vision tasks. Motivated by that, we propose a collaborative learning framework in the complex domain for monaural noise suppression. The proposed system consists of two principal modules, namely spectral feature extraction module (FEM) and stacked glance-gaze modules (GGMs). In FEM, the UNet-block is introduced after each convolution layer, enabling the feature recalibration from multiple scales. In each GGM, we decompose the multi-target optimization in the complex spectrum into two sub-tasks. Specifically, the glance path aims to suppress the noise in the magnitude domain to obtain a coarse estimation, and meanwhile, the gaze path attempts to compensate for the lost spectral detail in the complex domain. The two paths work collaboratively and facilitate spectral estimation from complementary perspectives. Besides, by repeatedly unfolding the GGMs, the intermediate result can be iteratively refined across stages and lead to the ultimate estimation of the spectrum. The experiments are conducted on the WSJ0-SI84, DNS-Challenge dataset, and Voicebank+Demand dataset. Results show that the proposed approach achieves state-of-the-art performance over previous advanced systems on the WSJ0-SI84 and DNS-Challenge dataset, and meanwhile, competitive performance is achieved on the Voicebank+Demand corpus.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام

AV Speech Enhancement Challenge using a Real Noisy Corpus

66 - Mandar Gogate , Ahsan Adeel , Kia Dashtipour 2019

This paper presents, a first of its kind, audio-visual (AV) speech enhacement challenge in real-noisy settings. A detailed description of the AV challenge, a novel real noisy AV corpus (ASPIRE), benchmark speech enhancement task, and baseline perform ance results are outlined. The latter are based on training a deep neural architecture on a synthetic mixture of Grid corpus and ChiME3 noises (consisting of bus, pedestrian, cafe, and street noises) and testing on the ASPIRE corpus. Subjective evaluations of five different speech enhancement algorithms (including SEAGN, spectrum subtraction (SS) , log-minimum mean-square error (LMMSE), audio-only CochleaNet, and AV CochleaNet) are presented as baseline results. The aim of the multi-modal challenge is to provide a timely opportunity for comprehensive evaluation of novel AV speech enhancement algorithms, using our new benchmark, real-noisy AV corpus and specified performance metrics. This will promote AV speech processing research globally, stimulate new ground-breaking multi-modal approaches, and attract interest from companies, academics and researchers working in AV speech technologies and applications. We encourage participants (through a challenge website sign-up) from both the speech and hearing research communities, to benefit from their complementary approaches to AV speech in noise processing.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام