New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Does Pretraining for Summarization Require Knowledge Transfer?

هل تتطلب الاحتجاط بالتلخيص تحويل المعرفة؟

132 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Pretraining techniques leveraging enormous datasets have driven recent advances in text summarization. While folk explanations suggest that knowledge transfer accounts for pretraining's benefits, little is known about why it works or what makes a pretraining task or dataset suitable. In this paper, we challenge the knowledge transfer story, showing that pretraining on documents consisting of character n-grams selected at random, we can nearly match the performance of models pretrained on real corpora. This work holds the promise of eliminating upstream corpora, which may alleviate some concerns over offensive language, bias, and copyright issues. To see whether the small residual benefit of using real data could be accounted for by the structure of the pretraining task, we design several tasks motivated by a qualitative study of summarization corpora. However, these tasks confer no appreciable benefit, leaving open the possibility of a small role for knowledge transfer.

References used

https://aclanthology.org/

rate research

A Thorough Evaluation of Task-Specific Pretraining for Summarization

135 - Association for Computation Linguistics 2021 مقالة

Task-agnostic pretraining objectives like masked language models or corrupted span prediction are applicable to a wide range of NLP downstream tasks (Raffel et al.,2019), but are outperformed by task-specific pretraining objectives like predicting ex tracted gap sentences on summarization (Zhang et al.,2020). We compare three summarization specific pretraining objectives with the task agnostic corrupted span prediction pretraining in controlled study. We also extend our study to a low resource and zero shot setup, to understand how many training examples are needed in order to ablate the task-specific pretraining without quality loss. Our results show that task-agnostic pretraining is sufficient for most cases which hopefully reduces the need for costly task-specific pretraining. We also report new state-of-the-art number for two summarization task using a T5 model with 11 billion parameters and an optimal beam search length penalty.

task-specific pretraining pretraining objectives احتجاج المهام الخاصة أهداف الاحتجاج صناعة حمض الفوسفور

Limitations of Knowledge Distillation for Zero-shot Transfer Learning

276 - Association for Computation Linguistics 2021 مقالة

Pretrained transformer-based encoders such as BERT have been demonstrated to achieve state-of-the-art performance on numerous NLP tasks. Despite their success, BERT style encoders are large in size and have high latency during inference (especially o n CPU machines) which make them unappealing for many online applications. Recently introduced compression and distillation methods have provided effective ways to alleviate this shortcoming. However, the focus of these works has been mainly on monolingual encoders. Motivated by recent successes in zero-shot cross-lingual transfer learning using multilingual pretrained encoders such as mBERT, we evaluate the effectiveness of Knowledge Distillation (KD) both during pretraining stage and during fine-tuning stage on multilingual BERT models. We demonstrate that in contradiction to the previous observation in the case of monolingual distillation, in multilingual settings, distillation during pretraining is more effective than distillation during fine-tuning for zero-shot transfer learning. Moreover, we observe that distillation during fine-tuning may hurt zero-shot cross-lingual performance. Finally, we demonstrate that distilling a larger model (BERT Large) results in the strongest distilled model that performs best both on the source language as well as target languages in zero-shot settings.

limitations of knowledge distillation knowledge distillation قيود المعرفة التقطير تقطر المعرفة صناعة حمض الفوسفور المزيد..

Does Vision-and-Language Pretraining Improve Lexical Grounding?

299 - Association for Computation Linguistics 2021 مقالة

Linguistic representations derived from text alone have been criticized for their lack of grounding, i.e., connecting words to their meanings in the physical world. Vision-and- Language (VL) models, trained jointly on text and image or video data, ha ve been offered as a response to such criticisms. However, while VL pretraining has shown success on multimodal tasks such as visual question answering, it is not yet known how the internal linguistic representations themselves compare to their text-only counterparts. This paper compares the semantic representations learned via VL vs. text-only pretraining for two recent VL models using a suite of analyses (clustering, probing, and performance on a commonsense question answering task) in a language-only setting. We find that the multimodal models fail to significantly outperform the text-only variants, suggesting that future work is required if multimodal pretraining is to be pursued as a means of improving NLP in general.

improve lexical grounding pretraining improve lexical improve lexical تحسين التأريض المعجمي المحاكمة تحسين المعجم تحسين المعهد صناعة حمض الفوسفور المزيد..

Do Transformer Modifications Transfer Across Implementations and Applications?

162 - Association for Computation Linguistics 2021 مقالة

The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these mo difications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we find that most modifications do not meaningfully improve performance. Furthermore, most of the Transformer variants we found beneficial were either developed in the same codebase that we used or are relatively minor changes. We conjecture that performance improvements may strongly depend on implementation details and correspondingly make some recommendations for improving the generality of experimental results.

transformer modifications transfer modifications transfer تحويل تحويلات المحولات تحويل التعديلات صناعة حمض الفوسفور

Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models' Transferability

294 - Association for Computation Linguistics 2021 مقالة

This paper investigates whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications. To verify pre-trained models' transferability, we test the pre-trained models on text classification tasks with meanings of tokens mismatches, and real-world non-text token sequence classification data, including amino acid, DNA, and music. We find that even on non-text data, the models pre-trained on text converge faster, perform better than the randomly initialized models, and only slightly worse than the models using task-specific knowledge. We also find that the representations of the text and non-text pre-trained models share non-trivial similarities.

cross-disciplinary knowledge learner pre-trained models' transferability knowledge learner المتعلم المعرفي متعدد التخصصات نماذج النماذج المدربة مسبقا متعلم المعرفة صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Does Pretraining for Summarization Require Knowledge Transfer?

هل تتطلب الاحتجاط بالتلخيص تحويل المعرفة؟

Ask ChatGPT about the research

Read More

suggested questions