No Arabic abstract
Head-related impulse responses (HRIRs) are subject-dependent and direction-dependent filters used in spatial audio synthesis. They describe the scattering response of the head, torso, and pinnae of the subject. We propose a structural factorization of the HRIRs into a product of non-negative and Toeplitz matrices; the factorization is based on a novel extension of a non-negative matrix factorization algorithm. As a result, the HRIR becomes expressible as a convolution between a direction-independent emph{resonance} filter and a direction-dependent emph{reflection} filter. Further, the reflection filter can be made emph{sparse} with minimal HRIR distortion. The described factorization is shown to be applicable to the arbitrary source signal case and allows one to employ time-domain convolution at a computational cost lower than using convolution in the frequency domain.
We study the problem of dictionary learning for signals that can be represented as polynomials or polynomial matrices, such as convolutive signals with time delays or acoustic impulse responses. Recently, we developed a method for polynomial dictionary learning based on the fact that a polynomial matrix can be expressed as a polynomial with matrix coefficients, where the coefficient of the polynomial at each time lag is a scalar matrix. However, a polynomial matrix can be also equally represented as a matrix with polynomial elements. In this paper, we develop an alternative method for learning a polynomial dictionary and a sparse representation method for polynomial signal reconstruction based on this model. The proposed methods can be used directly to operate on the polynomial matrix without having to access its coefficients matrices. We demonstrate the performance of the proposed method for acoustic impulse response modeling.
Concussion and repeated exposure to mild traumatic brain injury are risks for athletes in many sports. While direct head impacts are analyzed to improve the detection and awareness of head acceleration events so that an athletes brain health can be appropriately monitored and treated. However, head accelerations can also be induced by impacts with little or no head involvement. In this work we evaluated if impacts that do not involve direct head contact, such as being pushed in the torso, can be sufficient in collegiate American football to induce head accelerations comparable to direct head impacts. Datasets of impacts with and without direct head contact were collected and compared. These datasets were gathered using a state-of-the-art impact detection algorithm embedded in an instrumented mouthguard to record head kinematics. Video analysis was used to differentiate between impact types. In total, 15 impacts of each type were used in comparison, with clear video screenshots available to distinguish each impact type. Analysis of the kinematics showed that the impacts without direct head contact achieved similar levels of linear and angular accelerations during impact compared to those from direct head impacts. Finite element analyses using the median and peak kinematic signals were used to calculate maximum principal strain of the brain. Statistical analysis revealed that no significant difference was found between the two datasets based on a Bonferroni-adjusted p-value threshold of 0.017 , with the exception of peak linear acceleration. Impacts without direct head contact showed higher mean values of peak linear acceleration values of 17.6 g compared to the direct-head impact mean value of 6.1g. These results indicated that impacts other than direct head impacts could still produce meaningful kinematic loads in the head and as such should be included in athlete health monitoring.
An important problem to be solved in modeling head-related impulse responses (HRIRs) is how to individualize HRIRs so that they are suitable for a listener. We modeled the entire magnitude head-related transfer functions (HRTFs), in frequency domain, for sound sources on horizontal plane of 37 subjects using principal components analysis (PCA). The individual magnitude HRTFs could be modeled adequately well by a linear combination of only ten orthonormal basis functions. The goal of this research was to establish multiple linear regression (MLR) between weights of basis functions obtained from PCA and fewer anthropometric measurements in order to individualize a given listeners HRTFs with his or her own anthropomety. We proposed here an improved individualization method based on MLR of weights of basis functions by utilizing 8 chosen out of 27 anthropometric measurements. Our objective experiments results show a superior performance than that of our previous work on individualizing minimum phase HRIRs and also better than similar research. The proposed individualization method shows that the individualized magnitude HRTFs could approximated well the the original ones with small error. Moving sound employing the reconstructed HRIRs could be perceived as if it was moving around the horizontal plane.
Self-driving cars need to understand 3D scenes efficiently and accurately in order to drive safely. Given the limited hardware resources, existing 3D perception models are not able to recognize small instances (e.g., pedestrians, cyclists) very well due to the low-resolution voxelization and aggressive downsampling. To this end, we propose Sparse Point-Voxel Convolution (SPVConv), a lightweight 3D module that equips the vanilla Sparse Convolution with the high-resolution point-based branch. With negligible overhead, this point-based branch is able to preserve the fine details even from large outdoor scenes. To explore the spectrum of efficient 3D models, we first define a flexible architecture design space based on SPVConv, and we then present 3D Neural Architecture Search (3D-NAS) to search the optimal network architecture over this diverse design space efficiently and effectively. Experimental results validate that the resulting SPVNAS model is fast and accurate: it outperforms the state-of-the-art MinkowskiNet by 3.3%, ranking 1st on the competitive SemanticKITTI leaderboard. It also achieves 8x computation reduction and 3x measured speedup over MinkowskiNet with higher accuracy. Finally, we transfer our method to 3D object detection, and it achieves consistent improvements over the one-stage detection baseline on KITTI.
End-to-end models are favored in automatic speech recognition (ASR) because of their simplified system structure and superior performance. Among these models, Transformer and Conformer have achieved state-of-the-art recognition accuracy in which self-attention plays a vital role in capturing important global information. However, the time and memory complexity of self-attention increases squarely with the length of the sentence. In this paper, a prob-sparse self-attention mechanism is introduced into Conformer to sparse the computing process of self-attention in order to accelerate inference speed and reduce space consumption. Specifically, we adopt a Kullback-Leibler divergence based sparsity measurement for each query to decide whether we compute the attention function on this query. By using the prob-sparse attention mechanism, we achieve impressively 8% to 45% inference speed-up and 15% to 45% memory usage reduction of the self-attention module of Conformer Transducer while maintaining the same level of error rate.