MicAugment: One-shot Microphone Style Transfer

78 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zal\\'an Borsos

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Zalan Borsos - Yunpeng Li - Beat Gfeller

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

A crucial aspect for the successful deployment of audio-based models in-the-wild is the robustness to the transformations introduced by heterogeneous acquisition conditions. In this work, we propose a method to perform one-shot microphone style transfer. Given only a few seconds of audio recorded by a target device, MicAugment identifies the transformations associated to the input acquisition pipeline and uses the learned transformations to synthesize audio as if it were recorded under the same conditions as the target audio. We show that our method can successfully apply the style transfer to real audio and that it significantly increases model robustness when used as data augmentation in the downstream tasks.

قيم البحث

100 - Shifeng Pan , Lei He 2021

Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis at scale. It does not require the target speakers to be experts in expressing all styles and to collect corresponding recordings for model trai ning. However, the performances of existing style transfer methods are still far behind real application needs. The root causes are mainly twofold. Firstly, the style embedding extracted from single reference speech can hardly provide fine-grained and appropriate prosody information for arbitrary text to synthesize. Secondly, in these models the content/text, prosody, and speaker timbre are usually highly entangled, its therefore not realistic to expect a satisfied result when freely combining these components, such as to transfer speaking style between speakers. In this paper, we propose a cross-speaker style transfer text-to-speech (TTS) model with explicit prosody bottleneck. The prosody bottleneck builds up the kernels accounting for speaking style robustly, and disentangles the prosody from content and speaker timbre, therefore guarantees high quality cross-speaker style transfer. Evaluation result shows the proposed method even achieves on-par performance with source speakers speaker-dependent (SD) model in objective measurement of prosody, and significantly outperforms the cycle consistency and GMVAE-based baselines in objective and subjective evaluations.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Encoding Musical Style with Transformer Autoencoders

110 - Kristy Choi , Curtis Hawthorne , Ian Simon 2019

We consider the problem of learning high-level controls over the global structure of generated sequences, particularly in the context of symbolic music generation with complex language models. In this work, we present the Transformer autoencoder, whi ch aggregates encodings of the input data across time to obtain a global representation of style from a given performance. We show it is possible to combine this global representation with other temporally distributed embeddings, enabling improved control over the separate aspects of performance style and melody. Empirically, we demonstrate the effectiveness of our method on various music generation tasks on the MAESTRO dataset and a YouTube dataset with 10,000+ hours of piano performances, where we achieve improvements in terms of log-likelihood and mean listening scores as compared to baselines.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

A Lite Microphone Array Beamforming Scheme with Maximum Signal-to-Noise Ratio Filter

142 - Lu Ma , Xin Zhao , Pei Zhao 2020

Since space-domain information can be utilized, microphone array beamforming is often used to enhance the quality of the speech by suppressing directional disturbance. However, with the increasing number of microphone, the complexity would be increas ed. In this paper, a concise beamforming scheme using Maximum Signal-to-Noise Ratio (SNR) filter is proposed to reduce the beamforming complexity. The maximum SNR filter is implemented by using the estimated direction-of-arrival (DOA) of the speech source localization (SSL) and the solving method of independent vector analysis (IVA). Our experiments show that when compared with other widely-used algorithms, the proposed algorithm obtain higher gain of signal-to-interference and noise ratio (SINR).

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

BeamTransformer: Microphone Array-based Overlapping Speech Detection

596 - Siqi Zheng , Shiliang Zhang , Weilong Huang 2021

We propose BeamTransformer, an efficient architecture to leverage beamformers edge in spatial filtering and transformers capability in context sequence modeling. BeamTransformer seeks to optimize modeling of sequential relationship among signals from different spatial direction. Overlapping speech detection is one of the tasks where such optimization is favorable. In this paper we effectively apply BeamTransformer to detect overlapping segments. Comparing to single-channel approach, BeamTransformer exceeds in learning to identify the relationship among different beam sequences and hence able to make predictions not only from the acoustic signals but also the localization of the source. The results indicate that a successful incorporation of microphone array signals can lead to remarkable gains. Moreover, BeamTransformer takes one step further, as speech from overlapped speakers have been internally separated into different beams.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي معالجة الصوت والكلام

Stream Attention for far-field multi-microphone ASR

242 - Xiaofei Wang , Yonghong Yan , Hynek Hermansky 2017

A stream attention framework has been applied to the posterior probabilities of the deep neural network (DNN) to improve the far-field automatic speech recognition (ASR) performance in the multi-microphone configuration. The stream attention scheme h as been realized through an attention vector, which is derived by predicting the ASR performance from the phoneme posterior distribution of individual microphone stream, focusing the recognizers attention to more reliable microphones. Investigation on the various ASR performance measures has been carried out using the real recorded dataset. Experiments results show that the proposed framework has yielded substantial improvements in word error rate (WER).

أنظمة الصوت في الحاسوب تفاعل الإنسان والحاسوب معالجة الصوت والكلام