Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

AudioPairBank: Towards A Large-Scale Tag-Pair-Based Audio Content Analysis

82 0 0.0 ( 0 )

Download Cite

Added by Benjamin Elizalde

Publication date 2016

fields Informatics Engineering

and research's language is English

Authors Sebastian Sager - Benjamin Elizalde - Damian Borth

Sound Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Recently, sound recognition has been used to identify sounds, such as car and river. However, sounds have nuances that may be better described by adjective-noun pairs such as slow car, and verb-noun pairs such as flying insects, which are under explored. Therefore, in this work we investigate the relation between audio content and both adjective-noun pairs and verb-noun pairs. Due to the lack of datasets with these kinds of annotations, we collected and processed the AudioPairBank corpus consisting of a combined total of 1,123 pairs and over 33,000 audio files. One contribution is the previously unavailable documentation of the challenges and implications of collecting audio recordings with these type of labels. A second contribution is to show the degree of correlation between the audio content and the labels through sound recognition experiments, which yielded results of 70% accuracy, hence also providing a performance benchmark. The results and study in this paper encourage further exploration of the nuances in audio and are meant to complement similar research performed on images and text in multimedia analysis.

rate research

WaveFlow: A Compact Flow-based Model for Raw Audio

95 - Wei Ping , Kainan Peng , Kexin Zhao 2019

In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases. It generates high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps. Furthermore, it can significantly reduce the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has only 5.91M parameters, which is 15$times$ smaller than WaveGlow. It can generate 22.05 kHz high-fidelity audio 42.6$times$ faster than real-time (at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.

Sound Computation and Language Machine Learning

Content-based Representations of audio using Siamese neural networks

314 - Pranay Manocha , Rohan Badlani , Anurag Kumar 2017

In this paper, we focus on the problem of content-based retrieval for audio, which aims to retrieve all semantically similar audio recordings for a given audio clip query. This problem is similar to the problem of query by example of audio, which aims to retrieve media samples from a database, which are similar to the user-provided example. We propose a novel approach which encodes the audio into a vector representation using Siamese Neural Networks. The goal is to obtain an encoding similar for files belonging to the same audio class, thus allowing retrieval of semantically similar audio. Using simple similarity measures such as those based on simple euclidean distance and cosine similarity we show that these representations can be very effectively used for retrieving recordings similar in audio content.

Sound Information Retrieval Audio and Speech Processing

Sparsity-based audio declipping methods: selected overview, new algorithms, and large-scale evaluation

55 - Clement Gaultier , Remin Gribonval (PANAMA 2020

Recent advances in audio declipping have substantially improved the state of the art.% in certain saturation regimes. Yet, practitioners need guidelines to choose a method, and while existing benchmarks have been instrumental in advancing the field, larger-scale experiments are needed to guide such choices. First, we show that the clipping levels in existing small-scale benchmarks are moderate and call for benchmarks with more perceptually significant clipping levels. We then propose a general algorithmic framework for declipping that covers existing and new combinations of variants of state-of-the-art techniques exploiting time-frequency sparsity: synthesis vs. analysis sparsity, with plain or structured sparsity. Finally, we systematically compare these combinations and a selection of state-of-the-art methods. Using a large-scale numerical benchmark and a smaller scale formal listening test, we provide guidelines for various clipping levels, both for speech and various musical genres. The code is made publicly available for the purpose of reproducible research and benchmarking.

Sound Audio and Speech Processing Signal Processing

Learning Audio Embeddings with User Listening Data for Content-based Music Recommendation

203 - Ke Chen , Beici Liang , Xiaoshuan Ma 2020

Personalized recommendation on new track releases has always been a challenging problem in the music industry. To combat this problem, we first explore user listening history and demographics to construct a user embedding representing the users music preference. With the user embedding and audio data from users liked and disliked tracks, an audio embedding can be obtained for each track using metric learning with Siamese networks. For a new track, we can decide the best group of users to recommend by computing the similarity between the tracks audio embedding and different user embeddings, respectively. The proposed system yields state-of-the-art performance on content-based music recommendation tested with millions of users and tracks. Also, we extract audio embeddings as features for music genre classification tasks. The results show the generalization ability of our audio embeddings.

Sound Audio and Speech Processing

Towards Multi-Scale Style Control for Expressive Speech Synthesis

121 - Xiang Li , Changhe Song , Jingbei Li 2021

This paper introduces a multi-scale speech style modeling method for end-to-end expressive speech synthesis. The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech, which are then fed into the speech synthesis model as an extension to the input phoneme sequence. During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion. By applying the proposed method to style transfer task, experimental results indicate that the controllability of the multi-scale speech style model and the expressiveness of the synthesized speech are greatly improved. Moreover, by assigning different reference speeches to extraction of style on each scale, the flexibility of the proposed method is further revealed.

Sound Computation and Language Audio and Speech Processing

comments

Fetching comments

Sham Private University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

AudioPairBank: Towards A Large-Scale Tag-Pair-Based Audio Content Analysis

Ask ChatGPT about the research

No Arabic abstract

Read More