We propose a natural way to generalize relative transfer functions (RTFs) to more than one source. We first prove that such a generalization is not possible using a single multichannel spectro-temporal observation, regardless of the number of microphones. We then introduce a new transform for multichannel multi-frame spectrograms, i.e., containing several channels and time frames in each time-frequency bin. This transform allows a natural generalization which satisfies the three key properties of RTFs, namely, they can be directly estimated from observed signals, they capture spatial properties of the sources and they do not depend on emitted signals. Through simulated experiments, we show how this new method can localize multiple simultaneously active sound sources using short spectro-temporal windows, without relying on source separation.
We propose a sampling scheme that can perfectly reconstruct a collection of spikes on the sphere from samples of their lowpass-filtered observations. Central to our algorithm is a generalization of the annihilating filter method, a tool widely used in array signal processing and finite-rate-of-innovation (FRI) sampling. The proposed algorithm can reconstruct $K$ spikes from $(K+sqrt{K})^2$ spatial samples. This sampling requirement improves over previously known FRI sampling schemes on the sphere by a factor of four for large $K$. We showcase the versatility of the proposed algorithm by applying it to three different problems: 1) sampling diffusion processes induced by localized sources on the sphere, 2) shot noise removal, and 3) sound source localization (SSL) by a spherical microphone array. In particular, we show how SSL can be reformulated as a spherical sparse sampling problem.
From a machine learning perspective, the human ability localize sounds can be modeled as a non-parametric and non-linear regression problem between binaural spectral features of sound received at the ears (input) and their sound-source directions (output). The input features can be summarized in terms of the individuals head-related transfer functions (HRTFs) which measure the spectral response between the listeners eardrum and an external point in $3$D. Based on these viewpoints, two related problems are considered: how can one achieve an optimal sampling of measurements for training sound-source localization (SSL) models, and how can SSL models be used to infer the subjects HRTFs in listening tests. First, we develop a class of binaural SSL models based on Gaussian process regression and solve a emph{forward selection} problem that finds a subset of input-output samples that best generalize to all SSL directions. Second, we use an emph{active-learning} approach that updates an online SSL model for inferring the subjects SSL errors via headphones and a graphical user interface. Experiments show that only a small fraction of HRTFs are required for $5^{circ}$ localization accuracy and that the learned HRTFs are localized closer to their intended directions than non-individualized HRTFs.
Head-related impulse responses (HRIRs) are subject-dependent and direction-dependent filters used in spatial audio synthesis. They describe the scattering response of the head, torso, and pinnae of the subject. We propose a structural factorization of the HRIRs into a product of non-negative and Toeplitz matrices; the factorization is based on a novel extension of a non-negative matrix factorization algorithm. As a result, the HRIR becomes expressible as a convolution between a direction-independent emph{resonance} filter and a direction-dependent emph{reflection} filter. Further, the reflection filter can be made emph{sparse} with minimal HRIR distortion. The described factorization is shown to be applicable to the arbitrary source signal case and allows one to employ time-domain convolution at a computational cost lower than using convolution in the frequency domain.
An algorithm involving Mel-Frequency Cepstral Coefficients (MFCCs) is provided to perform signal feature extraction for the task of speaker accent recognition. Then different classifiers are compared based on the MFCC feature. For each signal, the mean vector of MFCC matrix is used as an input vector for pattern recognition. A sample of 330 signals, containing 165 US voice and 165 non-US voice, is analyzed. By comparison, k-nearest neighbors yield the highest average test accuracy, after using a cross-validation of size 500, and least time being used in the computation
We propose a novel sparse representation for heavily underdetermined multichannel sound mixtures, i.e., with much more sources than microphones. The proposed approach operates in the complex Fourier domain, thus preserving spatial characteristics carried by phase differences. We derive a generalization of K-SVD which jointly estimates a dictionary capturing both spectral and spatial features, a sparse activation matrix, and all instantaneous source phases from a set of signal examples. The dictionary can then be used to extract the learned signal from a new input mixture. The method is applied to the challenging problem of ego-noise reduction for robot audition. We demonstrate its superiority relative to conventional dictionary-based techniques using recordings made in a real room.
We present the concept of an acoustic rake receiver---a microphone beamformer that uses echoes to improve the noise and interference suppression. The rake idea is well-known in wireless communications; it involves constructively combining different multipath components that arrive at the receiver antennas. Unlike spread-spectrum signals used in wireless communications, speech signals are not orthogonal to their shifts. Therefore, we focus on the spatial structure, rather than temporal. Instead of explicitly estimating the channel, we create correspondences between early echoes in time and image sources in space. These multiple sources of the desired and the interfering signal offer additional spatial diversity that we can exploit in the beamformer design. We present several intuitive and optimal formulations of acoustic rake receivers, and show theoretically and numerically that the rake formulation of the maximum signal-to-interference-and-noise beamformer offers significant performance boosts in terms of noise and interference suppression. Beyond signal-to-noise ratio, we observe gains in terms of the emph{perceptual evaluation of speech quality} (PESQ) metric for the speech quality. We accompany the paper by the complete simulation and processing chain written in Python. The code and the sound samples are available online at url{http://lcav.github.io/AcousticRakeReceiver/}.
This article introduces an effective generalization of the polar flavor of the Fourier Theorem based on a new method of analysis. Under the premises of the new theory an ample class of functions become viable as bases, with the further advantage of using the same basis for analysis and reconstruction. In fact other tools, like the wavelets, admit specially built nonorthogonal bases but require different bases for analysis and reconstruction (biorthogonal and dual bases) and vectorial coordinates; this renders those systems unintuitive and computing intensive. As an example of the advantages of the new generalization of the Fourier Theorem, this paper introduces a novel method for the synthesis that is based on frequency-phase series of square waves (the equivalent of the polar Fourier Theorem but for nonorthogonal bases). The resulting synthesizer is very efficient needing only few components, frugal in terms of computing needs, and viable for many applications.
In a previous paper [1] it was discussed the viability of functional analysis using as a basis a couple of generic functions, and hence vectorial decomposition. Here we complete the paradigm exploiting one of the analysis methodologies developed there, but applied to phase coordinates, so needing only one function as a basis. It will be shown that, thanks to the novel iterative analysis, any function satisfying a rather loose requisite is ontologically a basis. This in turn generalizes the polar version of the Fourier theorem to an ample class of nonorthogonal bases. The main advantage of this generalization is that it inherits some of the properties of the original Fourier theorem. As a result the new transform has a wide range of applications and some remarkable consequences. The new tool will be compared with wavelets and frames. Examples of analysis and reconstruction of functions using the developed algorithms and generic bases will be given. Some of the properties, and applications that can promptly benefit from the theory, will be discussed. The implementation of a matched filter for noise suppression will be used as an example of the potential of the theory.
The musical realm is a promising area in which to expect to find nontrivial topological structures. This paper describes several kinds of metrics on musical data, and explores the implications of these metrics in two ways: via techniques of classical topology where the metric space of all-possible musical data can be described explicitly, and via modern data-driven ideas of persistent homology which calculates the Betti-number bar-codes of individual musical works. Both analyses are able to recover three well known topological structures in music: the circle of notes (octave-reduced scalar structures), the circle of fifths, and the rhythmic repetition of timelines. Applications to a variety of musical works (for example, folk music in the form of standard MIDI files) are presented, and the bar codes show many interesting features. Examples show that individual pieces may span the complete space (in which case the classical and the data-driven analyses agree), or they may span only part of the space.