New community

Subscribe to the gold package and get unlimited access to Shamra Academy

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

ديسمبر: التعلم من مقاطع الفيديو التعليمية صاخبة عبر التسميات التوضيحية الكثيفة والقليل الانتروبي

59 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

أصبح الاستفادة من أشرطة فيديو الويب غير المسبقة على نطاق واسع مثل مقاطع الفيديو التعليمية للتدريب المسبق مسبقا تليها Finetuning الخاصة بمهمة المهام هي نهج De Facto للعديد من المهام الفيديو واللغة. ومع ذلك، فإن مقاطع الفيديو التعليمية هذه صاخبة للغاية، وغالبا ما تكون روايات العاصرة المصاحبة غير مكتملة، ويمكن أن تكون غير ذات صلة أو غير مهمة مع المحتوى المرئي، مما يحد من أداء النماذج المدربة على هذه البيانات. لمعالجة هذه المشكلات، نقترح طريقة محسنة مسبقة من الفيديو واللغة التي تضيف أولا التوضيحات التي تم استخراجها تلقائيا من إطارات الفيديو كإدخال نص إضافي، لتوفير إشارات مرئية مفيدة لتعلم أفضل لجمعيات الفيديو واللغة. ثانيا، لتخفيف قضية الاختلال الزمنية، تتضمن طريقةنا فقدان انتروبيا مقيدا مقيدا بقليل التقليل، لتشجيع النموذج على التركيز تلقائيا على التسمية التوضيحية الصحيحة من بركة من تعليق المرشح الأساسي. يتم تسمية نهجنا الإجمالي في ديسمبر (تعليق كثيف وتقليل الانتروبيا). تجارب شاملة على ثلاثة مهام فيديو و لغوية (استرجاع النص إلى الفيديو، وتسمية الفيديو، والإجابة على سؤال الفيديو) عبر خمسة مجموعات بيانات توضح أن نهجنا يتفوق على الطرق السابقة من الأساليب السابقة. تظهر دراسة الأزمة على مهام ما قبل التدريب والمعب المصب أن إضافة تعليق كثيف وفقدان الاهتمام المقيد مساعدة في تحسين الأداء النموذجي. أخيرا، نقدم أيضا تصور الاهتمام لإظهار تأثير تطبيق فقدان الاهتمام المقيد المقترح.

Leveraging large-scale unlabeled web videos such as instructional videos for pre-training followed by task-specific finetuning has become the de facto approach for many video-and-language tasks. However, these instructional videos are very noisy, the accompanying ASR narrations are often incomplete, and can be irrelevant to or temporally misaligned with the visual content, limiting the performance of the models trained on such data. To address these issues, we propose an improved video-and-language pre-training method that first adds automatically-extracted dense region captions from the video frames as auxiliary text input, to provide informative visual cues for learning better video and language associations. Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions. Our overall approach is named DeCEMBERT (Dense Captions and Entropy Minimization). Comprehensive experiments on three video-and-language tasks (text-to-video retrieval, video captioning, and video question answering) across five datasets demonstrate that our approach outperforms previous state-of-the-art methods. Ablation studies on pre-training and downstream tasks show that adding dense captions and constrained attention loss help improve the model performance. Lastly, we also provide attention visualization to show the effect of applying the proposed constrained attention loss.

References used

https://aclanthology.org/

rate research

Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding

451 - Association for Computation Linguistics 2021 مقالة

Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Although various data augmentation approaches have been proposed to synthesize training data in low-resource target languag es, the augmented data sets are often noisy, and thus impede the performance of SLU models. In this paper we focus on mitigating noise in augmented data. We develop a denoising training approach. Multiple models are trained with data produced by various augmented methods. Those models provide supervision signals to each other. The experimental results show that our method outperforms the existing state of the art by 3.05 and 4.24 percentage points on two benchmark datasets, respectively. The code will be made open sourced on github.

الانتباه المتكرر cross-lingual spoken language اللغة المنطوقة عبر اللغات صناعة حمض الفوسفور

Between Flexibility and Consistency: Joint Generation of Captions and Subtitles

62 - Association for Computation Linguistics 2021 مقالة

Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing (i.e. captions). However, the joint generation of source captions and target su btitles does not only bring potential output quality advantages when the two decoding processes inform each other, but it is also often required in multilingual scenarios. In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content. We further introduce new metrics for evaluating subtitling consistency. Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.

joint generation captions جيل المشترك التسميات التوضيحية صناعة حمض الفوسفور

Learning from Noisy Labels for Entity-Centric Information Extraction

260 - Association for Computation Linguistics 2021 مقالة

Recent information extraction approaches have relied on training deep neural models. However, such models can easily overfit noisy labels and suffer from performance degradation. While it is very costly to filter noisy labels in large learning resour ces, recent studies show that such labels take more training steps to be memorized and are more frequently forgotten than clean labels, therefore are identifiable in training. Motivated by such properties, we propose a simple co-regularization framework for entity-centric information extraction, which consists of several neural models with identical structures but different parameter initialization. These models are jointly optimized with the task-specific losses and are regularized to generate similar predictions based on an agreement loss, which prevents overfitting on noisy labels. Extensive experiments on two widely used but noisy benchmarks for information extraction, TACRED and CoNLL03, demonstrate the effectiveness of our framework. We release our code to the community for future research.

المنطق الزمني للحدث entity-centric information extraction noisy labels استخراج المعلومات التركز على الكيان تسميات صاخبة صناعة حمض الفوسفور

Learning with Different Amounts of Annotation: From Zero to Many Labels

219 - Association for Computation Linguistics 2021 مقالة

Training NLP systems typically assumes access to annotated data that has a single human label per example. Given imperfect labeling from annotators and inherent ambiguity of language, we hypothesize that single label is not sufficient to learn the sp ectrum of language interpretation. We explore new annotation distribution schemes, assigning multiple labels per example for a small subset of training examples. Introducing such multi label examples at the cost of annotating fewer examples brings clear gains on natural language inference task and entity typing task, even when we simply first train with a single label data and then fine tune with multi label examples. Extending a MixUp data augmentation framework, we propose a learning algorithm that can learn from training examples with different amount of annotation (with zero, one, or multiple labels). This algorithm efficiently combines signals from uneven training data and brings additional gains in low annotation budget and cross domain settings. Together, our method achieves consistent gains in two tasks, suggesting distributing labels unevenly among training examples can be beneficial for many NLP tasks.

تقييم الاستدلال القوي single label labels ضع الكلمة المناسبة تسمية واحدة تسميات صناعة حمض الفوسفور المزيد..

Re-entry Prediction for Online Conversations via Self-Supervised Learning

106 - Association for Computation Linguistics 2021 مقالة

In recent years, world business in online discussions and opinion sharing on social media is booming. Re-entry prediction task is thus proposed to help people keep track of the discussions which they wish to continue. Nevertheless, existing works onl y focus on exploiting chatting history and context information, and ignore the potential useful learning signals underlying conversation data, such as conversation thread patterns and repeated engagement of target users, which help better understand the behavior of target users in conversations. In this paper, we propose three interesting and well-founded auxiliary tasks, namely, Spread Pattern, Repeated Target user, and Turn Authorship, as the self-supervised signals for re-entry prediction. These auxiliary tasks are trained together with the main task in a multi-task manner. Experimental results on two datasets newly collected from Twitter and Reddit show that our method outperforms the previous state-of-the-arts with fewer parameters and faster convergence. Extensive experiments and analysis show the effectiveness of our proposed models and also point out some key ideas in designing self-supervised tasks.

re-entry prediction online conversations re-entry prediction task إعادة دخول التنبؤ محادثات عبر الإنترنت إعادة دخول تنبؤ المهمة صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

ديسمبر: التعلم من مقاطع الفيديو التعليمية صاخبة عبر التسميات التوضيحية الكثيفة والقليل الانتروبي

Ask ChatGPT about the research

Read More

suggested questions