BeamTransformer: Microphone Array-based Overlapping Speech Detection

597 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Siqi Zheng

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Siqi Zheng - Shiliang Zhang - Weilong Huang

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We propose BeamTransformer, an efficient architecture to leverage beamformers edge in spatial filtering and transformers capability in context sequence modeling. BeamTransformer seeks to optimize modeling of sequential relationship among signals from different spatial direction. Overlapping speech detection is one of the tasks where such optimization is favorable. In this paper we effectively apply BeamTransformer to detect overlapping segments. Comparing to single-channel approach, BeamTransformer exceeds in learning to identify the relationship among different beam sequences and hence able to make predictions not only from the acoustic signals but also the localization of the source. The results indicate that a successful incorporation of microphone array signals can lead to remarkable gains. Moreover, BeamTransformer takes one step further, as speech from overlapped speakers have been internally separated into different beams.

قيم البحث

85 - Nicolas Furnon , Romain Serizel (LORIA 2020

Multichannel processing is widely used for speech enhancement but several limitations appear when trying to deploy these solutions to the real-world. Distributed sensor arrays that consider several devices with a few microphones is a viable alternati ve that allows for exploiting the multiple devices equipped with microphones that we are using in our everyday life. In this context, we propose to extend the distributed adaptive node-specific signal estimation approach to a neural networks framework. At each node, a local filtering is performed to send one signal to the other nodes where a mask is estimated by a neural network in order to compute a global multi-channel Wiener filter. In an array of two nodes, we show that this additional signal can be efficiently taken into account to predict the masks and leads to better speech enhancement performances than when the mask estimation relies only on the local signals.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي معالجة الصوت والكلام

Continuous Speech Separation with Ad Hoc Microphone Arrays

125 - Dongmei Wang , Takuya Yoshioka , Zhuo Chen 2021

Speech separation has been shown effective for multi-talker speech recognition. Under the ad hoc microphone array setup where the array consists of spatially distributed asynchronous microphones, additional challenges must be overcome as the geometry and number of microphones are unknown beforehand. Prior studies show, with a spatial-temporalinterleaving structure, neural networks can efficiently utilize the multi-channel signals of the ad hoc array. In this paper, we further extend this approach to continuous speech separation. Several techniques are introduced to enable speech separation for real continuous recordings. First, we apply a transformer-based network for spatio-temporal modeling of the ad hoc array signals. In addition, two methods are proposed to mitigate a speech duplication problem during single talker segments, which seems more severe in the ad hoc array scenarios. One method is device distortion simulation for reducing the acoustic mismatch between simulated training data and real recordings. The other is speaker counting to detect the single speaker segments and merge the output signal channels. Experimental results for AdHoc-LibiCSS, a new dataset consisting of continuous recordings of concatenated LibriSpeech utterances obtained by multiple different devices, show the proposed separation method can significantly improve the ASR accuracy for overlapped speech with little performance degradation for single talker segments.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي التعلم الآلي

A Lite Microphone Array Beamforming Scheme with Maximum Signal-to-Noise Ratio Filter

142 - Lu Ma , Xin Zhao , Pei Zhao 2020

Since space-domain information can be utilized, microphone array beamforming is often used to enhance the quality of the speech by suppressing directional disturbance. However, with the increasing number of microphone, the complexity would be increas ed. In this paper, a concise beamforming scheme using Maximum Signal-to-Noise Ratio (SNR) filter is proposed to reduce the beamforming complexity. The maximum SNR filter is implemented by using the estimated direction-of-arrival (DOA) of the speech source localization (SSL) and the solving method of independent vector analysis (IVA). Our experiments show that when compared with other widely-used algorithms, the proposed algorithm obtain higher gain of signal-to-interference and noise ratio (SINR).

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

113 - Chuanxin Tang , Chong Luo , Zhiyuan Zhao 2021

Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript. Existing methods adopt a two-stage approach: synthesize the input text using a generic text-to-speech (TTS) engine and then transform the voice to the desired voice using voice conversion (VC). A major problem of this framework is that VC is a challenging problem which usually needs a moderate amount of parallel training data to work satisfactorily. In this paper, we propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the target speaker. In particular, we manage to perform accurate zero-shot duration prediction for the inserted text. The predicted duration is used to regulate both text embedding and speech embedding. Then, based on the aligned cross-modality input, we directly generate the mel-spectrogram of the edited speech with a transformer-based decoder. Subjective listening tests show that despite the lack of training data for the speaker, our method has achieved satisfactory results. It outperforms a recent zero-shot TTS engine by a large margin.

أنظمة الصوت في الحاسوب الذكاء الاصطناعي معالجة الصوت والكلام

Communication-Cost Aware Microphone Selection For Neural Speech Enhancement with Ad-hoc Microphone Arrays

90 - Jonah Casebeer , Jamshed Kaikaus , Paris Smaragdis 2020

In this paper, we present a method for jointly-learning a microphone selection mechanism and a speech enhancement network for multi-channel speech enhancement with an ad-hoc microphone array. The attention-based microphone selection mechanism is trai ned to reduce communication-costs through a penalty term which represents a task-performance/ communication-cost trade-off. While working within the trade-off, our method can intelligently stream from more microphones in lower SNR scenes and fewer microphones in higher SNR scenes. We evaluate the model in complex echoic acoustic scenes with moving sources and show that it matches the performance of models that stream from a fixed number of microphones while reducing communication costs.

أنظمة الصوت في الحاسوب معالجة الصوت والكلام