No Arabic abstract
Dynamical variational auto-encoders (DVAEs) are a class of deep generative models with latent variables, dedicated to time series data modeling. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include the modeling of temporal dependencies between successive observed and/or latent vectors in data sequences. Previous work has shown the interest of DVAEs and their better performance over the VAE for speech signals (spectrogram) modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that does not require the use of a parallel dataset of clean and noisy speech samples for training, but only requires clean speech signals. In this paper, we extend those works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm based on the most general form of DVAEs, that we then adapt to three specific DVAE models to illustrate the versatility of the framework. More precisely, we combine DVAE-based speech priors with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. Experimental results show that the proposed approach based on DVAEs outperforms its VAE counterpart and a supervised speech enhancement baseline.
Variational auto-encoders (VAEs) are deep generative latent variable models that can be used for learning the distribution of complex data. VAEs have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. One advantage of this generative approach is that it does not require pairs of clean and noisy speech signals at training. In this paper, we propose audio-visual variants of VAEs for single-channel and speaker-independent speech enhancement. We develop a conditional VAE (CVAE) where the audio speech generative process is conditioned on visual information of the lip region. At test time, the audio-visual speech generative model is combined with a noise model based on nonnegative matrix factorization, and speech enhancement relies on a Monte Carlo expectation-maximization algorithm. Experiments are conducted with the recently published NTCD-TIMIT dataset as well as the GRID corpus. The results confirm that the proposed audio-visual CVAE effectively fuses audio and visual information, and it improves the speech enhancement performance compared with the audio-only VAE model, especially when the speech signal is highly corrupted by noise. We also show that the proposed unsupervised audio-visual speech enhancement approach outperforms a state-of-the-art supervised deep learning method.
Heart Sound (also known as phonocardiogram (PCG)) analysis is a popular way that detects cardiovascular diseases (CVDs). Most PCG analysis uses supervised way, which demands both normal and abnormal samples. This paper proposes a method of unsupervised PCG analysis that uses beta variational auto-encoder ($beta-text{VAE}$) to model the normal PCG signals. The best performed model reaches an AUC (Area Under Curve) value of 0.91 in ROC (Receiver Operating Characteristic) test for PCG signals collected from the same source. Unlike majority of $beta-text{VAE}$s that are used as generative models, the best-performed $beta-text{VAE}$ has a $beta$ value smaller than 1. Further experiments then find that the introduction of a light weighted KL divergence between distribution of latent space and normal distribution improves the performance of anomaly PCG detection based on anomaly scores resulted by reconstruction loss. The fact suggests that anomaly score based on reconstruction loss may be better than anomaly scores based on latent vectors of samples
An effective approach to non-parallel voice conversion (VC) is to utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), to model the latent structure of speech in an unsupervised manner. A previous study has confirmed the ef- fectiveness of VAE using the STRAIGHT spectra for VC. How- ever, VAE using other types of spectral features such as mel- cepstral coefficients (MCCs), which are related to human per- ception and have been widely used in VC, have not been prop- erly investigated. Instead of using one specific type of spectral feature, it is expected that VAE may benefit from using multi- ple types of spectral features simultaneously, thereby improving the capability of VAE for VC. To this end, we propose a novel VAE framework (called cross-domain VAE, CDVAE) for VC. Specifically, the proposed framework utilizes both STRAIGHT spectra and MCCs by explicitly regularizing multiple objectives in order to constrain the behavior of the learned encoder and de- coder. Experimental results demonstrate that the proposed CD- VAE framework outperforms the conventional VAE framework in terms of subjective tests.
We propose FEDENHANCE, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a real-world scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training while periodically providing updates to a central server. Our experiments show that our approach achieves competitive enhancement performance compared to IID training on a single device and that we can further facilitate the convergence speed and the overall performance using transfer learning on the server-side. Moreover, we show that we can effectively combine updates from clients trained locally with supervised and unsupervised losses. We also release a new dataset LibriFSD50K and its creation recipe in order to facilitate FL research for source separation problems.
This paper focuses on single-channel semi-supervised speech enhancement. We learn a speaker-independent deep generative speech model using the framework of variational autoencoders. The noise model remains unsupervised because we do not assume prior knowledge of the noisy recording environment. In this context, our contribution is to propose a noise model based on alpha-stable distributions, instead of the more conventional Gaussian non-negative matrix factorization approach found in previous studies. We develop a Monte Carlo expectation-maximization algorithm for estimating the model parameters at test time. Experimental results show the superiority of the proposed approach both in terms of perceptual quality and intelligibility of the enhanced speech signal.