ﻻ يوجد ملخص باللغة العربية
Although deep learning algorithms are widely used for improving speech enhancement (SE) performance, the performance remains limited under highly challenging conditions, such as unseen noise or noise signals having low signal-to-noise ratios (SNRs). This study provides a pilot investigation on a novel multimodal audio-articulatory-movement SE (AAMSE) model to enhance SE performance under such challenging conditions. Articulatory movement features and acoustic signals were used as inputs to waveform-mapping-based and spectral-mapping-based SE systems with three fusion strategies. In addition, an ablation study was conducted to evaluate SE performance using a limited number of articulatory movement sensors. Experimental results confirm that, by combining the modalities, the AAMSE model notably improves the SE performance in terms of speech quality and intelligibility, as compared to conventional audio-only SE baselines.
In noisy conditions, knowing speech contents facilitates listeners to more effectively suppress background noise components and to retrieve pure speech signals. Previous studies have also confirmed the benefits of incorporating phonetic information i
Previous studies have confirmed the effectiveness of leveraging articulatory information to attain improved speech enhancement (SE) performance. By augmenting the original acoustic features with the place/manner of articulatory features, the SE proce
Recent single-channel speech enhancement methods based on deep neural networks (DNNs) have achieved remarkable results, but there are still generalization problems in real scenes. Like other data-driven methods, DNN-based speech enhancement models pr
Diffusion probabilistic models have demonstrated an outstanding capability to model natural images and raw audio waveforms through a paired diffusion and reverse processes. The unique property of the reverse process (namely, eliminating non-target si
Synthesized speech from articulatory movements can have real-world use for patients with vocal cord disorders, situations requiring silent speech, or in high-noise environments. In this work, we present EMA2S, an end-to-end multimodal articulatory-to