نقدم VideoClip، وهو نهج مقاوم للتناقض في تدريب نموذج موحد مسبقا لفهم الفيديو والنصية الصفرية، دون استخدام أي ملصقات على مهام المصب.يقوم VideoClep بتدريب محول الفيديو والنص عن طريق تناقض أزواج فيديو إيجابية مؤقتة متداخلة مع السلبيات الصعبة من أقرب استرجاع جار.تجاربنا على سلسلة متنوعة من المهام المصب، بما في ذلك استرجاع الفيديو على مستوى التسلسل، والتعريب الخاص بمستوى عمل Videoqa ومستوى الرمز المميز، وتجزئة العمل تكشف عن أداء حالة من بين الفن، وتجاوز العمل السابق، وفي بعض الحالات يفوقنالنهج الخاضعة للإشراف.يتوفر الكود في https://github.com/pytorch/fairseq/examples/mmpt.
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.
References used
https://aclanthology.org/
Advances in English language representation enabled a more sample-efficient pre-training task by Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA). Which, instead of training a model to recover masked tokens, it
Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence
Multilingual pre-trained models have achieved remarkable performance on cross-lingual transfer learning. Some multilingual models such as mBERT, have been pre-trained on unlabeled corpora, therefore the embeddings of different languages in the models
Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computational cost mainly from cross-modal
Coarse-grained linguistic information, such as named entities or phrases, facilitates adequately representation learning in pre-training. Previous works mainly focus on extending the objective of BERT's Masked Language Modeling (MLM) from masking ind