No Arabic abstract
Beamforming methods for sound source localization are usually based on free-field Greens functions to model the sound propagation between source and microphone. This assumption is known to be incorrect for many industrial applications and the beamforming results can suffer from this inconsistency regarding both, accuracy of source power estimation, and accuracy of source localisation. The aim of this paper is to investigate whether the use of numerically calculated Greens functions can improve the results of beamforming measurements. The current test cases of numerical and experimental investigations consists of sources placed in a short rectangular duct. The measurement is performed outside the duct in a semi-anechoic chamber. A typical example for this kind of installation is a fan with a heat exchanger. The Greens functions for this test case are calculated numerically using the boundary element method. These tailored Greens functions are used to calculate the corresponding beamforming steering vectors. The weighting of the Greens functions in the steering vectors has a decisive influence on the beamforming results. A generalization of the common steering vector formulations is given based on two scalars. It is shown that arbitrary differentiable Greens functions can be used to find the correct source position or source power level by using the appropriate steering vector formulations. Beamforming measurements are performed using a loudspeaker as a reference source at different positions in the heat exchanger duct. The measurements are evaluated in the frequency domain and by means of different validation criteria it can be shown that the results with the numerical calculated Greens function are improved compared to free field beamforming especially at low frequencies.
Monitoring active volcanos is an ongoing and important task helping to understand and predict volcanic eruptions. In recent years, analysing the acoustic properties of eruptions became more relevant. We present an inexpensive, lightweight, portable, easy to use and modular acoustic data acquisition system for field measurements that can record data with up to 100~kHz. The system is based on a Raspberry Pi 3 B running a custom build bare metal operating system. It connects to an external analog - digital converter with the microphone sensor. A GPS receiver allows the logging of the position and in addition the recording of a very accurate time signal synchronously to the acoustic data. With that, it is possible for multiple modules to effectively work as a single microphone array. The whole system can be build with low cost and demands only minimal technical infrastructure. We demonstrate a possible use of such a microphone array by deploying 20 modules on the active volcano textit{Stromboli} in the Aeolian Islands by Sicily, Italy. We use the collected acoustic data to indentify the sound source position for all recorded eruptions.
Despite successful applications of end-to-end approaches in multi-channel speech recognition, the performance still degrades severely when the speech is corrupted by reverberation. In this paper, we integrate the dereverberation module into the end-to-end multi-channel speech recognition system and explore two different frontend architectures. First, a multi-source mask-based weighted prediction error (WPE) module is incorporated in the frontend for dereverberation. Second, another novel frontend architecture is proposed, which extends the weighted power minimization distortionless response (WPD) convolutional beamformer to perform simultaneous separation and dereverberation. We derive a new formulation from the original WPD, which can handle multi-source input, and replace eigenvalue decomposition with the matrix inverse operation to make the back-propagation algorithm more stable. The above two architectures are optimized in a fully end-to-end manner, only using the speech recognition criterion. Experiments on both spatialized wsj1-2mix corpus and REVERB show that our proposed model outperformed the conventional methods in reverberant scenarios.
Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel multi-speaker reverberant condition, and propose to extend our previous framework for end-to-end dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend subnetworks including voice activity detection like masks. The techniques significantly stabilize the end-to-end training process. The experiments on the spatialized wsj1-2mix corpus show that the proposed system achieves about 35% WER relative reduction compared to our conventional multi-channel E2E ASR system, and also obtains decent speech dereverberation and separation performance (SDR=12.5 dB) in the reverberant multi-speaker condition while trained only with the ASR criterion.
We propose a separation guided speaker diarization (SGSD) approach by fully utilizing a complementarity of speech separation and speaker clustering. Since the conventional clustering-based speaker diarization (CSD) approach cannot well handle overlapping speech segments, we investigate, in this study, separation-based speaker diarization (SSD) which inherently has the potential to handle the speaker overlap regions. Our preliminary analysis shows that the state-of-the-art Conv-TasNet based speech separation, which works quite well on the simulation data, is unstable in realistic conversational speech due to the high mismatch speaking styles in simulated training speech and read speech. In doing so, separation-based processing can assist CSD in handling the overlapping speech segments under the realistic mismatched conditions. Specifically, several strategies are designed to select between the results of SSD and CSD systems based on an analysis of the instability of the SSD system performances. Experiments on the conversational telephone speech (CTS) data from DIHARD-III Challenge show that the proposed SGSD system can significantly improve the performance of state-of-the-art CSD systems, yielding relative diarization error rate reductions of 20.2% and 20.8% on the development set and evaluation set, respectively.
In this paper, we compare different audio signal representations, including the raw audio waveform and a variety of time-frequency representations, for the task of audio synthesis with Generative Adversarial Networks (GANs). We conduct the experiments on a subset of the NSynth dataset. The architecture follows the benchmark Progressive Growing Wasserstein GAN. We perform experiments both in a fully non-conditional manner as well as conditioning the network on the pitch information. We quantitatively evaluate the generated material utilizing standard metrics for assessing generative models, and compare training and sampling times. We show that complex-valued as well as the magnitude and Instantaneous Frequency of the Short-Time Fourier Transform achieve the best results, and yield fast generation and inversion times. The code for feature extraction, training and evaluating the model is available online.