No Arabic abstract
Monitoring active volcanos is an ongoing and important task helping to understand and predict volcanic eruptions. In recent years, analysing the acoustic properties of eruptions became more relevant. We present an inexpensive, lightweight, portable, easy to use and modular acoustic data acquisition system for field measurements that can record data with up to 100~kHz. The system is based on a Raspberry Pi 3 B running a custom build bare metal operating system. It connects to an external analog - digital converter with the microphone sensor. A GPS receiver allows the logging of the position and in addition the recording of a very accurate time signal synchronously to the acoustic data. With that, it is possible for multiple modules to effectively work as a single microphone array. The whole system can be build with low cost and demands only minimal technical infrastructure. We demonstrate a possible use of such a microphone array by deploying 20 modules on the active volcano textit{Stromboli} in the Aeolian Islands by Sicily, Italy. We use the collected acoustic data to indentify the sound source position for all recorded eruptions.
Beamforming methods for sound source localization are usually based on free-field Greens functions to model the sound propagation between source and microphone. This assumption is known to be incorrect for many industrial applications and the beamforming results can suffer from this inconsistency regarding both, accuracy of source power estimation, and accuracy of source localisation. The aim of this paper is to investigate whether the use of numerically calculated Greens functions can improve the results of beamforming measurements. The current test cases of numerical and experimental investigations consists of sources placed in a short rectangular duct. The measurement is performed outside the duct in a semi-anechoic chamber. A typical example for this kind of installation is a fan with a heat exchanger. The Greens functions for this test case are calculated numerically using the boundary element method. These tailored Greens functions are used to calculate the corresponding beamforming steering vectors. The weighting of the Greens functions in the steering vectors has a decisive influence on the beamforming results. A generalization of the common steering vector formulations is given based on two scalars. It is shown that arbitrary differentiable Greens functions can be used to find the correct source position or source power level by using the appropriate steering vector formulations. Beamforming measurements are performed using a loudspeaker as a reference source at different positions in the heat exchanger duct. The measurements are evaluated in the frequency domain and by means of different validation criteria it can be shown that the results with the numerical calculated Greens function are improved compared to free field beamforming especially at low frequencies.
Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world. However, it is currently unclear to what extent auxiliary modalities improve performance over unimodal models, and under what circumstances the auxiliary modalities are useful. We examine the utility of the auxiliary visual context in Multimodal Automatic Speech Recognition in adversarial settings, where we deprive the models from partial audio signal during inference time. Our experiments show that while MMASR models show significant gains over traditional speech-to-text architectures (upto 4.2% WER improvements), they do not incorporate visual information when the audio signal has been corrupted. This shows that current methods of integrating the visual modality do not improve model robustness to noise, and we need better visually grounded adaptation techniques.
When deploying a Chinese neural text-to-speech (TTS) synthesis system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an Average Voice Model which is built from multi-speaker monolingual data, i.e. Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data.
Mosquitoes are the only known vector of malaria, which leads to hundreds of thousands of deaths each year. Understanding the number and location of potential mosquito vectors is of paramount importance to aid the reduction of malaria transmission cases. In recent years, deep learning has become widely used for bioacoustic classification tasks. In order to enable further research applications in this field, we release a new dataset of mosquito audio recordings. With over a thousand contributors, we obtained 195,434 labels of two second duration, of which approximately 10 percent signify mosquito events. We present an example use of the dataset, in which we train a convolutional neural network on log-Mel features, showcasing the information content of the labels. We hope this will become a vital resource for those researching all aspects of malaria, and add to the existing audio datasets for bioacoustic detection and signal processing.
On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge shared between them. Joint training is also efficient since it reduces the redundancy in the training procedures data handling operations. We propose a method for collaboratively training acoustic encoders of different sizes for speech recognition. We use a sequence transducer setup where different acoustic encoders share a common predictor and joiner modules. The acoustic encoders are also trained using co-distillation through an auxiliary task for frame level chenone prediction, along with the transducer loss. We perform experiments using the LibriSpeech corpus and demonstrate that the collaboratively trained acoustic encoders can provide up to a 11% relative improvement in the word error rate on both the test partitions.