Do you want to publish a course? Click here

Spatial resolution of late reverberation in virtual acoustic environments

193   0   0.0 ( 0 )
 Added by Christoph Kirsch
 Publication date 2021
and research's language is English




Ask ChatGPT about the research

Late reverberation involves the superposition of many sound reflections resulting in a diffuse sound field. Since the spatially resolved perception of individual diffuse reflections is impossible, simplifications can potentially be made for modelling late reverberation in room acoustics simulations with reduced spatial resolution. Such simplifications are desired for interactive, real-time virtual acoustic environments with applications in hearing research and for the evaluation of hearing supportive devices. In this context, the number and spatial arrangement of loudspeakers used for playback additionally affect spatial resolution. The current study assessed the minimum number of spatially evenly distributed virtual late reverberation sources required to perceptually approximate spatially highly resolved isotropic and anisotropic late reverberation and to technically approximate a spherically isotropic diffuse sound field. The spatial resolution of the rendering was systematically reduced by using subsets of the loudspeakers of an 86-channel spherical loudspeaker array in an anechoic chamber. It was tested whether listeners can distinguish lower spatial resolutions for the rendering of late reverberation from the highest achievable spatial resolution in different simulated rooms. Rendering of early reflections was kept fixed. The coherence of the sound field across a pair of microphones at ear and behind-the-ear hearing device distance was assessed to separate the effects of number of virtual sources and loudspeaker array geometry. Results show that between 12 and 24 reverberation sources are required.



rate research

Read More

For 6-DOF (degrees of freedom) interactive virtual acoustic environments (VAEs), the spatial rendering of diffuse late reverberation in addition to early (specular) reflections is important. In the interest of computational efficiency, the acoustic simulation of the late reverberation can be simplified by using a limited number of spatially distributed virtual reverb sources (VRS) each radiating incoherent signals. A sufficient number of VRS is needed to approximate spatially anisotropic late reverberation, e.g., in a room with inhomogeneous distribution of absorption at the boundaries. Here, a highly efficient and perceptually plausible method to generate and spatially render late reverberation is suggested, extending the room acoustics simulator RAZR [Wendt et al., J. Audio Eng. Soc., 62, 11 (2014)]. The room dimensions and frequency-dependent absorption coefficients at the wall boundaries are used to determine the parameters of a physically-based feedback delay network (FDN) to generate the incoherent VRS signals. The VRS are spatially distributed around the listener with weighting factors representing the spatially subsampled distribution of absorption coefficients on the wall boundaries. The minimum number of VRS required to be perceptually distinguishable from the maximum (reference) number of 96 VRS was assessed in a listening test conducted with a spherical loudspeaker array within an anechoic room. For the resulting low numbers of VRS suited for spatial rendering, optimal physically-based parameter choices for the FDN are discussed.
The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade speech enhancement performance when there is an array geometry mismatch between design and test conditions. Moreover, such speech enhancement techniques do not always yield ASR accuracy improvement due to the difference between speech enhancement and ASR optimization objectives. In this work, we propose to unify an acoustic model framework by optimizing spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input. Our acoustic model subsumes beamformers with multiple types of array geometry. In contrast to deep clustering methods that treat a neural network as a black box tool, the network encoding the spatial filters can process streaming audio data in real time without the accumulation of target signal statistics. We demonstrate the effectiveness of such MC neural networks through ASR experiments on the real-world far-field data. We show that our two-channel acoustic model can on average reduce word error rates (WERs) by~13.4 and~12.7% compared to a single channel ASR system with the log-mel filter bank energy (LFBE) feature under the matched and mismatched microphone placement conditions, respectively. Our result also shows that our two-channel network achieves a relative WER reduction of over~7.0% compared to conventional beamforming with seven microphones overall.
The muzzle blast caused by the discharge of a firearm generates a loud, impulsive sound that propagates away from the shooter in all directions. The location of the source can be computed from time-of-arrival measurements of the muzzle blast on multiple acoustic sensors at known locations, a technique known as multilateration. The multilateration problem is considerably simplified by assuming straight-line propagation in a homogeneous medium, a model for which there are multiple published solutions. Live-fire tests of the ShotSpotter gunshot location system in Pittsburgh, PA were analyzed off-line under several algorithms and geometric constraints to evaluate the accuracy of acoustic multilateration in a forensic context. Best results were obtained using the algorithm due to Mathias, Leonari and Galati under a two-dimensional geometric constraint. Multilateration on random subsets of the participating sensor array show that 96% of shots can be located to an accuracy of 15 m or better when six or more sensors participate in the solution.
Audio-visual speech recognition (AVSR) can effectively and significantly improve the recognition rates of small-vocabulary systems, compared to their audio-only counterparts. For large-vocabulary systems, however, there are still many difficulties, such as unsatisfactory video recognition accuracies, that make it hard to improve over audio-only baselines. In this paper, we specifically consider such scenarios, focusing on the large-vocabulary task of the LRS2 database, where audio-only performance is far superior to video-only accuracies, making this an interesting and challenging setup for multi-modal integration. To address the inherent difficulties, we propose a new fusion strategy: a recurrent integration network is trained to fuse the state posteriors of multiple single-modality models, guided by a set of model-based and signal-based stream reliability measures. During decoding, this network is used for stream integration within a hybrid recognizer, where it can thus cope with the time-variant reliability and information content of its multiple feature inputs. We compare the results with end-to-end AVSR systems as well as with competitive hybrid baseline models, finding that the new fusion strategy shows superior results, on average even outperforming oracle dynamic stream weighting, which has so far marked the -- realistically unachievable -- upper bound for standard stream weighting. Even though the pure lipreading performance is low, audio-visual integration is helpful under all -- clean, noisy, and reverberant -- conditions. On average, the new system achieves a relative word error rate reduction of 42.18% compared to the audio-only model, pointing at a high effectiveness of the proposed integration approach.
The understanding and interpretation of speech can be affected by various external factors. The use of face masks is one such factors that can create obstruction to speech while communicating. This may lead to degradation of speech processing and affect humans perceptually. Knowing whether a speaker wears a mask may be useful for modeling speech for different applications. With this motivation, finding whether a speaker wears face mask from a given speech is included as a task in Computational Paralinguistics Evaluation (ComParE) 2020. We study novel acoustic features based on linear filterbanks, instantaneous phase and long-term information that can capture the artifacts for classification of speech with and without face mask. These acoustic features are used along with the state-of-the-art baselines of ComParE functionals, bag-of-audio-words, DeepSpectrum and auDeep features for ComParE 2020. The studies reveal the effectiveness of acoustic features, and their score level fusion with the ComParE 2020 baselines leads to an unweighted average recall of 73.50% on the test set.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا