ﻻ يوجد ملخص باللغة العربية
Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e.g. phones and syllables. In this study, we propose a novel phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models. To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model, a powerful self-supervised encoder that renders rich phonetic information. To more accurately measure the distribution distances of the latent representations, the PFPL adopts the Wasserstein distance as the distance measure. Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses. Moreover, the results showed that the PFPL can enable a deep complex U-Net SE model to achieve highly competitive performance in terms of standardized quality and intelligibility evaluations on the Voice Bank-DEMAND dataset.
Utilizing a human-perception-related objective function to train a speech enhancement model has become a popular topic recently. The main reason is that the conventional mean squared error (MSE) loss cannot represent auditory perception well. One of
We introduce the Expanded Groove MIDI dataset (E-GMD), an automatic drum transcription (ADT) dataset that contains 444 hours of audio from 43 drum kits, making it an order of magnitude larger than similar datasets, and the first with human-performed
In recent years, waveform-mapping-based speech enhancement (SE) methods have garnered significant attention. These methods generally use a deep learning model to directly process and reconstruct speech waveforms. Because both the input and output are
Previous studies have confirmed the effectiveness of leveraging articulatory information to attain improved speech enhancement (SE) performance. By augmenting the original acoustic features with the place/manner of articulatory features, the SE proce
Autoencoders are commonly trained using element-wise loss. However, element-wise loss disregards high-level structures in the image which can lead to embeddings that disregard them as well. A recent improvement to autoencoders that helps alleviate th