No Arabic abstract
The muzzle blast caused by the discharge of a firearm generates a loud, impulsive sound that propagates away from the shooter in all directions. The location of the source can be computed from time-of-arrival measurements of the muzzle blast on multiple acoustic sensors at known locations, a technique known as multilateration. The multilateration problem is considerably simplified by assuming straight-line propagation in a homogeneous medium, a model for which there are multiple published solutions. Live-fire tests of the ShotSpotter gunshot location system in Pittsburgh, PA were analyzed off-line under several algorithms and geometric constraints to evaluate the accuracy of acoustic multilateration in a forensic context. Best results were obtained using the algorithm due to Mathias, Leonari and Galati under a two-dimensional geometric constraint. Multilateration on random subsets of the participating sensor array show that 96% of shots can be located to an accuracy of 15 m or better when six or more sensors participate in the solution.
In this paper, we propose a deep learning (DL)-based parameter enhancement method for a mixed excitation linear prediction (MELP) speech codec in noisy communication environment. Unlike conventional speech enhancement modules that are designed to obtain clean speech signal by removing noise components before speech codec processing, the proposed method directly enhances codec parameters on either the encoder or decoder side. As the proposed method has been implemented by a small network without any additional processes required in conventional enhancement systems, e.g., time-frequency (T-F) analysis/synthesis modules, its computational complexity is very low. By enhancing the noise-corrupted codec parameters with the proposed DL framework, we achieved an enhancement system that is much simpler and faster than conventional T-F mask-based speech enhancement methods, while the quality of its performance remains similar.
Estimation of perceptual quality in audio and speech is possible using a variety of methods. The combined v3 release of ViSQOL and ViSQOLAudio (for speech and audio, respectively,) provides improvements upon previo
Late reverberation involves the superposition of many sound reflections resulting in a diffuse sound field. Since the spatially resolved perception of individual diffuse reflections is impossible, simplifications can potentially be made for modelling late reverberation in room acoustics simulations with reduced spatial resolution. Such simplifications are desired for interactive, real-time virtual acoustic environments with applications in hearing research and for the evaluation of hearing supportive devices. In this context, the number and spatial arrangement of loudspeakers used for playback additionally affect spatial resolution. The current study assessed the minimum number of spatially evenly distributed virtual late reverberation sources required to perceptually approximate spatially highly resolved isotropic and anisotropic late reverberation and to technically approximate a spherically isotropic diffuse sound field. The spatial resolution of the rendering was systematically reduced by using subsets of the loudspeakers of an 86-channel spherical loudspeaker array in an anechoic chamber. It was tested whether listeners can distinguish lower spatial resolutions for the rendering of late reverberation from the highest achievable spatial resolution in different simulated rooms. Rendering of early reflections was kept fixed. The coherence of the sound field across a pair of microphones at ear and behind-the-ear hearing device distance was assessed to separate the effects of number of virtual sources and loudspeaker array geometry. Results show that between 12 and 24 reverberation sources are required.
Computer-Assisted Pronunciation Training (CAPT) plays an important role in language learning. However, conventional CAPT methods cannot effectively use non-native utterances for supervised training because the ground truth pronunciation needs expensive annotation. Meanwhile, certain undefined nonnative phonemes cannot be correctly classified into standard phonemes. To solve these problems, we use the vector-quantized variational autoencoder (VQ-VAE) to encode the speech into discrete acoustic units in a self-supervised manner. Based on these units, we propose a novel method that integrates both discriminative and generative models. The proposed method can detect mispronunciation and generate the correct pronunciation at the same time. Experiments on the L2-Arctic dataset show that the detection F1 score is improved by 9.58% relatively compared with recognition-based methods. The proposed method also achieves a comparable word error rate (WER) and the best style preservation for mispronunciation correction compared with text-to-speech (TTS) methods.
The understanding and interpretation of speech can be affected by various external factors. The use of face masks is one such factors that can create obstruction to speech while communicating. This may lead to degradation of speech processing and affect humans perceptually. Knowing whether a speaker wears a mask may be useful for modeling speech for different applications. With this motivation, finding whether a speaker wears face mask from a given speech is included as a task in Computational Paralinguistics Evaluation (ComParE) 2020. We study novel acoustic features based on linear filterbanks, instantaneous phase and long-term information that can capture the artifacts for classification of speech with and without face mask. These acoustic features are used along with the state-of-the-art baselines of ComParE functionals, bag-of-audio-words, DeepSpectrum and auDeep features for ComParE 2020. The studies reveal the effectiveness of acoustic features, and their score level fusion with the ComParE 2020 baselines leads to an unweighted average recall of 73.50% on the test set.