Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Compute and memory efficient universal sound source separation

99 0 0.0 ( 0 )

Download Cite

Added by Efthymios Tzinis

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Efthymios Tzinis - Zhepei Wang - Xilin Jiang

Sound Computation and Language Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Recent progress in audio source separation lead by deep learning has enabled many neural network models to provide robust solutions to this fundamental estimation problem. In this study, we provide a family of efficient neural network architectures for general purpose audio source separation while focusing on multiple computational aspects that hinder the application of neural networks in real-world scenarios. The backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRM-RF) as well as their aggregation which is performed through simple one-dimensional convolutions. This mechanism enables our models to obtain high fidelity signal separation in a wide variety of settings where variable number of sources are present and with limited computational resources (e.g. floating point operations, memory footprint, number of parameters and latency). Our experiments show that SuDoRM-RF models perform comparably and even surpass several state-of-the-art benchmarks with significantly higher computational resource requirements. The causal variation of SuDoRM-RF is able to obtain competitive performance in real-time speech separation of around 10dB scale-invariant signal-to-distortion ratio improvement (SI-SDRi) while remaining up to 20 times faster than real-time on a laptop device.

rate research

Sudo rm -rf: Efficient Networks for Universal Audio Source Separation

98 - Efthymios Tzinis , Zhepei Wang , Paris Smaragdis 2020

In this paper, we present an efficient neural network for end-to-end general purpose audio source separation. Specifically, the backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRMRF) as well as their aggregation which is performed through simple one-dimensional convolutions. In this way, we are able to obtain high quality audio source separation with limited number of floating point operations, memory requirements, number of parameters and latency. Our experiments on both speech and environmental sound separation datasets show that SuDoRMRF performs comparably and even surpasses various state-of-the-art approaches with significantly higher computational resource requirements.

Audio and Speech Processing Computation and Language Machine Learning

Whats All the FUSS About Free Universal Sound Separation Data?

158 - Scott Wisdom , Hakan Erdogan , Daniel Ellis 2020

We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box shaped rooms with frequency-dependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.5 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.

Sound Audio and Speech Processing

End-to-end Non-Negative Autoencoders for Sound Source Separation

291 - Shrikant Venkataramani , Efthymios Tzinis , Paris Smaragdis 2019

Discriminative models for source separation have recently been shown to produce impressive results. However, when operating on sources outside of the training set, these models can not perform as well and are cumbersome to update. Classical methods like Non-negative Matrix Factorization (NMF) provide modular approaches to source separation that can be easily updated to adapt to new mixture scenarios. In this paper, we generalize NMF to develop end-to-end non-negative auto-encoders and demonstrate how they can be used for source separation. Our experiments indicate that these models deliver comparable separation performance to discriminative approaches, while retaining the modularity of NMF and the modeling flexibility of neural networks.

Sound Audio and Speech Processing

Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

127 - Yangyang Shi , Yongqiang Wang , Chunyang Wu 2020

This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the long-range history context is distilled into an augmented memory bank to reduce self-attentions computation complexity. A cache mechanism saves the computation for the key and value in self-attention for the left context. Emformer applies a parallelized block processing in training to support low latency models. We carry out experiments on benchmark LibriSpeech data. Under average latency of 960 ms, Emformer gets WER $2.50%$ on test-clean and $5.62%$ on test-other. Comparing with a strong baseline augmented memory transformer (AM-TRF), Emformer gets $4.6$ folds training speedup and $18%$ relative real-time factor (RTF) reduction in decoding with relative WER reduction $17%$ on test-clean and $9%$ on test-other. For a low latency scenario with an average latency of 80 ms, Emformer achieves WER $3.01%$ on test-clean and $7.09%$ on test-other. Comparing with the LSTM baseline with the same latency and model size, Emformer gets relative WER reduction $9%$ and $16%$ on test-clean and test-other, respectively.

Sound Computation and Language Machine Learning

Memory-efficient Speech Recognition on Smart Devices

124 - Ganesh Venkatesh , Alagappan Valliappan , Jay Mahadeokar 2021

Recurrent transducer models have emerged as a promising solution for speech recognition on the current and next generation smart devices. The transducer models provide competitive accuracy within a reasonable memory footprint alleviating the memory capacity constraints in these devices. However, these models access parameters from off-chip memory for every input time step which adversely effects device battery life and limits their usability on low-power devices. We address transducer models memory access concerns by optimizing their model architecture and designing novel recurrent cell designs. We demonstrate that i) models energy cost is dominated by accessing model weights from off-chip memory, ii) transducer model architecture is pivotal in determining the number of accesses to off-chip memory and just model size is not a good proxy, iii) our transducer model optimizations and novel recurrent cell reduces off-chip memory accesses by 4.5x and model size by 2x with minimal accuracy impact.

Sound Computation and Language Audio and Speech Processing

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Compute and memory efficient universal sound source separation

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions