ترغب بنشر مسار تعليمي؟ اضغط هنا

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

CL4AC: A Contrastive Loss for Audio Captioning

89 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Xubo Liu

تاريخ النشر 2021

مجال البحث هندسة إلكترونية الهندسة المعلوماتية

والبحث باللغة English

تأليف Xubo Liu - Qiushi Huang - Xinhao Mei

معالجة الصوت والكلام الذكاء الاصطناعي أنظمة الصوت في الحاسوب

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.

قيم البحث

اقرأ أيضاً

Audio Captioning Transformer

125 - Xinhao Mei , Xubo Liu , Qiushi Huang 2021

Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Con volutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

CDPAM: Contrastive learning for perceptual audio similarity

173 - Pranay Manocha , Zeyu Jin , Richard Zhang 2021

Many speech processing methods based on deep learning require an automatic and differentiable audio metric for the loss function. The DPAM approach of Manocha et al. learns a full-reference metric trained directly on human judgments, and thus correla tes well with human perception. However, it requires a large number of human annotations and does not generalize well outside the range of perturbations on which it was trained. This paper introduces CDPAM, a metric that builds on and advances DPAM. The primary improvement is to combine contrastive learning and multi-dimensional representations to build robust models from limited data. In addition, we collect human judgments on triplet comparisons to improve generalization to a broader range of audio perturbations. CDPAM correlates well with human responses across nine varied datasets. We also show that adding this metric to existing speech synthesis and enhancement methods yields significant improvement, as measured by objective and subjective tests.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

A Transformer-based Audio Captioning Model with Keyword Estimation

98 - Yuma Koizumi , Ryo Masumura , Kyosuke Nishida 2020

One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene. Since one acoustic event/scene can be described with several words, it results in a combinatorial explosion of po ssible captions and difficulty in training. To solve this problem, we propose a Transformer-based audio-captioning model with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic event detection/acoustic scene classification (i.e., keyword estimation). TRACKE estimates keywords, which comprise a word set corresponding to audio events/scenes in the input audio, and generates the caption while referring to the estimated keywords to reduce word-selection indeterminacy. Experimental results on a public AAC dataset indicate that TRACKE achieved state-of-the-art performance and successfully estimated both the caption and its keywords.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

Adversarially Training for Audio Classifiers

117 - Raymel Alfonso Sallo , Mohammad Esmaeilpour , Patrick Cardinal 2020

In this paper, we investigate the potential effect of the adversarially training on the robustness of six advanced deep neural networks against a variety of targeted and non-targeted adversarial attacks. We firstly show that, the ResNet-56 model trai ned on the 2D representation of the discrete wavelet transform appended with the tonnetz chromagram outperforms other models in terms of recognition accuracy. Then we demonstrate the positive impact of adversarially training on this model as well as other deep architectures against six types of attack algorithms (white and black-box) with the cost of the reduced recognition accuracy and limited adversarial perturbation. We run our experiments on two benchmarking environmental sound datasets and show that without any imposed limitations on the budget allocations for the adversary, the fooling rate of the adversarially trained models can exceed 90%. In other words, adversarial attacks exist in any scales, but they might require higher adversarial perturbations compared to non-adversarially trained models.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

A Deep Neural Network for Audio Classification with a Classifier Attention Mechanism

96 - Haoye Lu , Haolong Zhang , Amit Nayak 2020

Audio classification is considered as a challenging problem in pattern recognition. Recently, many algorithms have been proposed using deep neural networks. In this paper, we introduce a new attention-based neural network architecture called Classifi er-Attention-Based Convolutional Neural Network (CAB-CNN). The algorithm uses a newly designed architecture consisting of a list of simple classifiers and an attention mechanism as a classifier selector. This design significantly reduces the number of parameters required by the classifiers and thus their complexities. In this way, it becomes easier to train the classifiers and achieve a high and steady performance. Our claims are corroborated by the experimental results. Compared to the state-of-the-art algorithms, our algorithm achieves more than 10% improvements on all selected test scores.

معالجة الصوت والكلام التعلم الآلي أنظمة الصوت في الحاسوب

الأسئلة المقترحة

ما العلاقة بين الذكاء الاصطناعي وتعلم الآلة؟

1954 - 0 - - Shamra Editor تم طرحه بمساحة ( الهندسة المعلوماتية)

الذكاء الاصطناعي

كيف يمكن استخدام الذكاء الصنعي والتعلم الآلي في التعليم؟

1600 - 0 - - Shamra Editor تم طرحه بمساحة ( الهندسة المعلوماتية)

الذكاء الاصطناعي

ماذا يعني التنقيب عن البيانات؟

2294 - 0 - - Ahmad Ali تم طرحه بمساحة ( الهندسة المعلوماتية)

الذكاء الاصطناعي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الجامعة الأميركية في بيروت

تفاصيل إضافية المزيد من الجامعات

mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا

نعم | كلا