No Arabic abstract
For centuries researchers have used sound to monitor and study wildlife. Traditionally, conservationists have identified species by ear; however, it is now common to deploy audio recording technology to monitor animal and ecosystem sounds. Animals use sound for communication, mating, navigation and territorial defence. Animal sounds provide valuable information and help conservationists to quantify biodiversity. Acoustic monitoring has grown in popularity due to the availability of diverse sensor types which include camera traps, portable acoustic sensors, passive acoustic sensors, and even smartphones. Passive acoustic sensors are easy to deploy and can be left running for long durations to provide insights on habitat and the sounds made by animals and illegal activity. While this technology brings enormous benefits, the amount of data that is generated makes processing a time-consuming process for conservationists. Consequently, there is interest among conservationists to automatically process acoustic data to help speed up biodiversity assessments. Processing these large data sources and extracting relevant sounds from background noise introduces significant challenges. In this paper we outline an approach for achieving this using state of the art in machine learning to automatically extract features from time-series audio signals and modelling deep learning models to classify different bird species based on the sounds they make. The acquired bird songs are processed using mel-frequency cepstrum (MFC) to extract features which are later classified using a multilayer perceptron (MLP). Our proposed method achieved promising results with 0.74 sensitivity, 0.92 specificity and an accuracy of 0.74.
From ancient to modern times, acoustic structures have been used to control the propagation of acoustic waves. However, the design of the acoustic structures has remained widely a time-consuming and computational resource-consuming iterative process. In recent years, Deep Learning has attracted unprecedented attention for its ability to tackle hard problems with huge datasets, which has achieved state-of-the-art results in various tasks. In this work, an acoustic structure design method is proposed based on deep learning. Taking the design of multi-order Helmholtz resonator for instance, we experimentally demonstrate the effectiveness of the proposed method. Our method is not only able to give a very accurate prediction of the geometry of the acoustic structures with multiple strong-coupling parameters, but also capable of improving the performance of evolutionary approaches in optimization for a desired property. Compared with the conventional numerical methods, our method is more efficient, universal and automatic, which has a wide range of potential applications, such as speech enhancement, sound absorption and insulation.
In this work, we consider applying machine learning to the analysis and compression of audio signals in the context of monitoring elephants in sub-Saharan Africa. Earths biodiversity is increasingly under threat by sources of anthropogenic change (e.g. resource extraction, land use change, and climate change) and surveying animal populations is critical for developing conservation strategies. However, manually monitoring tropical forests or deep oceans is intractable. For species that communicate acoustically, researchers have argued for placing audio recorders in the habitats as a cost-effective and non-invasive method, a strategy known as passive acoustic monitoring (PAM). In collaboration with conservation efforts, we construct a large labeled dataset of passive acoustic recordings of the African Forest Elephant via crowdsourcing, compromising thousands of hours of recordings in the wild. Using state-of-the-art techniques in artificial intelligence we improve upon previously proposed methods for passive acoustic monitoring for classification and segmentation. In real-time detection of elephant calls, network bandwidth quickly becomes a bottleneck and efficient ways to compress the data are needed. Most audio compression schemes are aimed at human listeners and are unsuitable for low-frequency elephant calls. To remedy this, we provide a novel end-to-end differentiable method for compression of audio signals that can be adapted to acoustic monitoring of any species and dramatically improves over naive coding strategies.
In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained using the logits from the teacher model. In our experiments, compared to a baseline model trained on about 600 hours of transcribed data, a relative word-error rate (WER) reduction of about 27.3% was achieved when using an additional 1800 hours of untranscribed data. We also investigated the benefit of pre-training the multi-channel front end to output the beamformed log-mel filter bank energies (LFBE) using L2 loss. We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly initialized with a beamformer and mel-filter bank coefficients for the front end. Finally, combining pre-training and teacher-student training produces a WER reduction of 31% compared to our baseline.
Deep clustering (DC) and utterance-level permutation invariant training (uPIT) have been demonstrated promising for speaker-independent speech separation. DC is usually formulated as two-step processes: embedding learning and embedding clustering, which results in complex separation pipelines and a huge obstacle in directly optimizing the actual separation objectives. As for uPIT, it only minimizes the chosen permutation with the lowest mean square error, doesnt discriminate it with other permutations. In this paper, we propose a discriminative learning method for speaker-independent speech separation using deep embedding features. Firstly, a DC network is trained to extract deep embedding features, which contain each sources information and have an advantage in discriminating each target speakers. Then these features are used as the input for uPIT to directly separate the different sources. Finally, uPIT and DC are jointly trained, which directly optimizes the actual separation objectives. Moreover, in order to maximize the distance of each permutation, the discriminative learning is applied to fine tuning the whole model. Our experiments are conducted on WSJ0-2mix dataset. Experimental results show that the proposed models achieve better performances than DC and uPIT for speaker-independent speech separation.
We present an experimental investigation into the effectiveness of transfer learning and bottleneck feature extraction in detecting COVID-19 from audio recordings of cough, breath and speech. This type of screening is non-contact, does not require specialist medical expertise or laboratory facilities and can be deployed on inexpensive consumer hardware. We use datasets that contain recordings of coughing, sneezing, speech and other noises, but do not contain COVID-19 labels, to pre-train three deep neural networks: a CNN, an LSTM and a Resnet50. These pre-trained networks are subsequently either fine-tuned using smaller datasets of coughing with COVID-19 labels in the process of transfer learning, or are used as bottleneck feature extractors. Results show that a Resnet50 classifier trained by this transfer learning process delivers optimal or near-optimal performance across all datasets achieving areas under the receiver operating characteristic (ROC AUC) of 0.98, 0.94 and 0.92 respectively for all three sound classes (coughs, breaths and speech). This indicates that coughs carry the strongest COVID-19 signature, followed by breath and speech. Our results also show that applying transfer learning and extracting bottleneck features using the larger datasets without COVID-19 labels led not only to improve performance, but also to minimise the standard deviation of the classifier AUCs among the outer folds of the leave-$p$-out cross-validation, indicating better generalisation. We conclude that deep transfer learning and bottleneck feature extraction can improve COVID-19 cough, breath and speech audio classification, yielding automatic classifiers with higher accuracy.