Subscribe to the gold package and get unlimited access to Shamra Academy

Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

85 0 0.0 ( 0 )

Download Cite

Added by Xuenan Xu

Publication date 2021

fields Informatics Engineering Electronic Engineering

and research's language is English

Authors Xuenan Xu - Heinrich Dinkel - Mengyue Wu

Sound Audio and Speech Processing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Automated audio captioning (AAC) aims at generating summarizing descriptions for audio clips. Multitudinous concepts are described in an audio caption, ranging from local information such as sound events to global information like acoustic scenery. Currently, the mainstream paradigm for AAC is the end-to-end encoder-decoder architecture, expecting the encoder to learn all levels of concepts embedded in the audio automatically. This paper first proposes a topic model for audio descriptions, comprehensively analyzing the hierarchical audio topics that are commonly covered. We then explore a transfer learning scheme to access local and global information. Two source tasks are identified to respectively represent local and global information, being Audio Tagging (AT) and Acoustic Scene Classification (ASC). Experiments are conducted on the AAC benchmark dataset Clotho and Audiocaps, amounting to a vast increase in all eight metrics with topic transfer learning. Further, it is discovered that local information and abstract representation learning are more crucial to AAC than global information and temporal relationship learning.

rate research

Variational Information Bottleneck for Effective Low-resource Audio Classification

107 - Shijing Si , Jianzong Wang , Huiming Sun 2021

Large-scale deep neural networks (DNNs) such as convolutional neural networks (CNNs) have achieved impressive performance in audio classification for their powerful capacity and strong generalization ability. However, when training a DNN model on low-resource tasks, it is usually prone to overfitting the small data and learning too much redundant information. To address this issue, we propose to use variational information bottleneck (VIB) to mitigate overfitting and suppress irrelevant information. In this work, we conduct experiments ona 4-layer CNN. However, the VIB framework is ready-to-use and could be easily utilized with many other state-of-the-art network architectures. Evaluation on a few audio datasets shows that our approach significantly outperforms baseline methods, yielding more than 5.0% improvement in terms of classification accuracy in some low-source settings.

Sound Audio and Speech Processing

Deep Learning for Audio Signal Processing

81 - Hendrik Purwins 2019

Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.

Sound Audio and Speech Processing Machine Learning

Learning Audio Embeddings with User Listening Data for Content-based Music Recommendation

203 - Ke Chen , Beici Liang , Xiaoshuan Ma 2020

Personalized recommendation on new track releases has always been a challenging problem in the music industry. To combat this problem, we first explore user listening history and demographics to construct a user embedding representing the users music preference. With the user embedding and audio data from users liked and disliked tracks, an audio embedding can be obtained for each track using metric learning with Siamese networks. For a new track, we can decide the best group of users to recommend by computing the similarity between the tracks audio embedding and different user embeddings, respectively. The proposed system yields state-of-the-art performance on content-based music recommendation tested with millions of users and tracks. Also, we extract audio embeddings as features for music genre classification tasks. The results show the generalization ability of our audio embeddings.

Sound Audio and Speech Processing

Deep Learning Frameworks Applied For Audio-Visual Scene Classification

297 - Lam Pham , Alexander Schindler , Mina Schutz 2021

In this paper, we present deep learning frameworks for audio-visual scene classification (SC) and indicate how individual visual and audio features as well as their combination affect SC performance. Our extensive experiments, which are conducted on DCASE (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1B development dataset, achieve the best classification accuracy of 82.2%, 91.1%, and 93.9% with audio input only, visual input only, and both audio-visual input, respectively. The highest classification accuracy of 93.9%, obtained from an ensemble of audio-based and visual-based frameworks, shows an improvement of 16.5% compared with DCASE baseline.

Sound Audio and Speech Processing

Binaural Audio Generation via Multi-task Learning

93 - Sijia Li , Shiguang Liu , Dinesh Manocha 2021

We present a learning-based approach for generating binaural audio from mono audio using multi-task learning. Our formulation leverages additional information from two related tasks: the binaural audio generation task and the flipped audio classification task. Our learning model extracts spatialization features from the visual and audio input, predicts the left and right audio channels, and judges whether the left and right channels are flipped. First, we extract visual features using ResNet from the video frames. Next, we perform binaural audio generation and flipped audio classification using separate subnetworks based on visual features. Our learning method optimizes the overall loss based on the weighted sum of the losses of the two tasks. We train and evaluate our model on the FAIR-Play dataset and the YouTube-ASMR dataset. We perform quantitative and qualitative evaluations to demonstrate the benefits of our approach over prior techniques.

Sound Audio and Speech Processing

comments

Fetching comments

Mustansiriyah University

Additional details More universities

Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

Ask ChatGPT about the research

No Arabic abstract

Read More