Do you want to publish a course? Click here

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Register a new user

CL4AC: A Contrastive Loss for Audio Captioning

89 0 0.0 ( 0 )

Download Cite

Added by Xubo Liu

Publication date 2021

fields Electronic Engineering Informatics Engineering

and research's language is English

Authors Xubo Liu - Qiushi Huang - Xinhao Mei

Audio and Speech Processing Artificial Intelligence Sound

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.

rate research

Read More

Audio Captioning Transformer

125 - Xinhao Mei , Xubo Liu , Qiushi Huang 2021

Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.

Audio and Speech Processing Machine Learning Sound

CDPAM: Contrastive learning for perceptual audio similarity

173 - Pranay Manocha , Zeyu Jin , Richard Zhang 2021

Many speech processing methods based on deep learning require an automatic and differentiable audio metric for the loss function. The DPAM approach of Manocha et al. learns a full-reference metric trained directly on human judgments, and thus correlates well with human perception. However, it requires a large number of human annotations and does not generalize well outside the range of perturbations on which it was trained. This paper introduces CDPAM, a metric that builds on and advances DPAM. The primary improvement is to combine contrastive learning and multi-dimensional representations to build robust models from limited data. In addition, we collect human judgments on triplet comparisons to improve generalization to a broader range of audio perturbations. CDPAM correlates well with human responses across nine varied datasets. We also show that adding this metric to existing speech synthesis and enhancement methods yields significant improvement, as measured by objective and subjective tests.

Audio and Speech Processing Machine Learning Sound

A Transformer-based Audio Captioning Model with Keyword Estimation

98 - Yuma Koizumi , Ryo Masumura , Kyosuke Nishida 2020

One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene. Since one acoustic event/scene can be described with several words, it results in a combinatorial explosion of possible captions and difficulty in training. To solve this problem, we propose a Transformer-based audio-captioning model with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic event detection/acoustic scene classification (i.e., keyword estimation). TRACKE estimates keywords, which comprise a word set corresponding to audio events/scenes in the input audio, and generates the caption while referring to the estimated keywords to reduce word-selection indeterminacy. Experimental results on a public AAC dataset indicate that TRACKE achieved state-of-the-art performance and successfully estimated both the caption and its keywords.

Audio and Speech Processing Machine Learning Sound

Adversarially Training for Audio Classifiers

117 - Raymel Alfonso Sallo , Mohammad Esmaeilpour , Patrick Cardinal 2020

In this paper, we investigate the potential effect of the adversarially training on the robustness of six advanced deep neural networks against a variety of targeted and non-targeted adversarial attacks. We firstly show that, the ResNet-56 model trained on the 2D representation of the discrete wavelet transform appended with the tonnetz chromagram outperforms other models in terms of recognition accuracy. Then we demonstrate the positive impact of adversarially training on this model as well as other deep architectures against six types of attack algorithms (white and black-box) with the cost of the reduced recognition accuracy and limited adversarial perturbation. We run our experiments on two benchmarking environmental sound datasets and show that without any imposed limitations on the budget allocations for the adversary, the fooling rate of the adversarially trained models can exceed 90%. In other words, adversarial attacks exist in any scales, but they might require higher adversarial perturbations compared to non-adversarially trained models.

Audio and Speech Processing Machine Learning Sound

A Deep Neural Network for Audio Classification with a Classifier Attention Mechanism

96 - Haoye Lu , Haolong Zhang , Amit Nayak 2020

Audio classification is considered as a challenging problem in pattern recognition. Recently, many algorithms have been proposed using deep neural networks. In this paper, we introduce a new attention-based neural network architecture called Classifier-Attention-Based Convolutional Neural Network (CAB-CNN). The algorithm uses a newly designed architecture consisting of a list of simple classifiers and an attention mechanism as a classifier selector. This design significantly reduces the number of parameters required by the classifiers and thus their complexities. In this way, it becomes easier to train the classifiers and achieve a high and steady performance. Our claims are corroborated by the experimental results. Compared to the state-of-the-art algorithms, our algorithm achieves more than 10% improvements on all selected test scores.

Audio and Speech Processing Machine Learning Sound

suggested questions

ما العلاقة بين الذكاء الاصطناعي وتعلم الآلة؟

1982 - 0 - - Shamra Editor was published in field ( Informatics Engineering)

الذكاء الاصطناعي

كيف يمكن استخدام الذكاء الصنعي والتعلم الآلي في التعليم؟

1631 - 0 - - Shamra Editor was published in field ( Informatics Engineering)

الذكاء الاصطناعي

ماذا يعني التنقيب عن البيانات؟

2350 - 0 - - Ahmad Ali was published in field ( Informatics Engineering)

الذكاء الاصطناعي

Log in to be able to interact and post comments

comments

Fetching comments

Fetching comments

Sign in to be able to follow your search criteria

Syrian International University for Science and Technology

Additional details More universities

mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا

نعم | كلا