Research papers, master and doctoral theses about transformers

Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

702 - Association for Computation Linguistics 2021 مقالة

Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication. This discrepancy has led to severe performance degradation of state-of-the-art NLP m odels when fine-tuned on real-world data. One way to resolve this issue is through lexical normalization, which is the process of transforming non-standard text, usually from social media, into a more standardized form. In this work, we propose a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem. As the noisy text is a pervasive problem across languages, not just English, we leverage the multi-lingual pre-training of mBART to fine-tune it to our data. While current approaches mainly operate at the word or subword level, we argue that this approach is straightforward from a technical standpoint and builds upon existing pre-trained transformer networks. Our results show that while word-level, intrinsic, performance evaluation is behind other methods, our model improves performance on extrinsic, downstream tasks through normalization compared to models operating on raw, unprocessed, social media text.

multilingual transformers multilingual محولات متعددة اللغات متعدد اللغات صناعة حمض الفوسفور

Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation

740 - Association for Computation Linguistics 2021 مقالة

We study the power of cross-attention in the Transformer architecture within the context of transfer learning for machine translation, and extend the findings of studies into cross-attention when training from scratch. We conduct a series of experime nts through fine-tuning a translation model on data where either the source or target language has changed. These experiments reveal that fine-tuning only the cross-attention parameters is nearly as effective as fine-tuning all parameters (i.e., the entire translation model). We provide insights into why this is the case and observe that limiting fine-tuning in this manner yields cross-lingually aligned embeddings. The implications of this finding for researchers and practitioners include a mitigation of catastrophic forgetting, the potential for zero-shot translation, and the ability to extend machine translation models to several new language pairs with reduced parameter storage overhead.

adapting pretrained transformers adapting pretrained تكييف المحولات مسبقا التكيف مع الاحاد صناعة حمض الفوسفور

Integrating Transformers and Knowledge Graphs for Twitter Stance Detection

841 - Association for Computation Linguistics 2021 مقالة

Stance detection (SD) entails classifying the sentiment of a text towards a given target, and is a relevant sub-task for opinion mining and social media analysis. Recent works have explored knowledge infusion supplementing the linguistic competence a nd latent knowledge of large pre-trained language models with structured knowledge graphs (KGs), yet few works have applied such methods to the SD task. In this work, we first perform stance-relevant knowledge probing on Transformers-based pre-trained models in a zero-shot setting, showing these models' latent real-world knowledge about SD targets and their sensitivity to context. We then train and evaluate new knowledge-enriched stance detection models on two Twitter stance datasets, achieving state-of-the-art performance on both.

integrating transformers twitter stance detection stance detection دمج المحولات كشف موقف تويتر اكتشاف الموقف صناعة حمض الفوسفور المزيد..

Memory-efficient Transformers via Top-k Attention

803 - Association for Computation Linguistics 2021 مقالة

Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-*k* scores with respect to the keys. Our approach offers several advantages: (a) its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA (b) it is a drop-in replacement for vanilla attention that does not require any corrective pre-training, and (c) it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. We evaluate the quality of top-*k* approximation for multi-head attention layers on the Long Range Arena Benchmark, and for feed-forward layers of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.

memory-efficient transformers transformers via top-k top-k attention محولات كفاءة الذاكرة المحولات عبر Top-K اهتمام Top-K صناعة حمض الفوسفور المزيد..

What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers

1106 - Association for Computation Linguistics 2021 مقالة

GPT-3 shows remarkable in-context learning ability of large-scale language models (LMs) trained on hundreds of billion scale data. Here we address some remaining issues less reported by the GPT-3 paper, such as a non-English LM, the performances of d ifferent sized models, and the effect of recently introduced prompt optimization on in-context learning. To achieve this, we introduce HyperCLOVA, a Korean variant of 82B GPT-3 trained on a Korean-centric corpus of 560B tokens. Enhanced by our Korean-specific tokenization, HyperCLOVA with our training configuration shows state-of-the-art in-context zero-shot and few-shot learning performances on various downstream tasks in Korean. Also, we show the performance benefits of prompt-based learning and demonstrate how it can be integrated into the prompt engineering pipeline. Then we discuss the possibility of materializing the No Code AI paradigm by providing AI prototyping capabilities to non-experts of ML by introducing HyperCLOVA studio, an interactive prompt engineering interface. Lastly, we demonstrate the potential of our methods with three successful in-house applications.

language models bring generative pretrained transformers نماذج اللغة تجلب محولات الإنتاج المحددة مسبقا صناعة حمض الفوسفور

Consistent Accelerated Inference via Confident Adaptive Transformers

922 - Association for Computation Linguistics 2021 مقالة

We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, bu t can come with unpredictable performance costs. In this work, we present CATs -- Confident Adaptive Transformers -- in which we simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four classification and regression tasks.

consistent accelerated inference confident adaptive transformers consistent accelerated الاستدلال المتسار المتسق محولات واثقة من التكيف متساو تسريع صناعة حمض الفوسفور المزيد..

Frequency Effects on Syntactic Rule Learning in Transformers

869 - Association for Computation Linguistics 2021 مقالة

Pre-trained language models perform well on a variety of linguistic tasks that require symbolic reasoning, raising the question of whether such models implicitly represent abstract symbols and rules. We investigate this question using the case study of BERT's performance on English subject--verb agreement. Unlike prior work, we train multiple instances of BERT from scratch, allowing us to perform a series of controlled interventions at pre-training time. We show that BERT often generalizes well to subject--verb pairs that never occurred in training, suggesting a degree of rule-governed behavior. We also find, however, that performance is heavily influenced by word frequency, with experiments showing that both the absolute frequency of a verb form, as well as the frequency relative to the alternate inflection, are causally implicated in the predictions BERT makes at inference time. Closer analysis of these frequency effects reveals that BERT's behavior is consistent with a system that correctly applies the SVA rule in general but struggles to overcome strong training priors and to estimate agreement features (singular vs. plural) on infrequent lexical items.

syntactic rule learning learning in transformers syntactic rule الحكم النحوي التعلم التعلم في المحولات صناعة حمض الفوسفور

Gradient-based Adversarial Attacks against Text Transformers

698 - Association for Computation Linguistics 2021 مقالة

We propose the first general-purpose gradient-based adversarial attack against transformer models. Instead of searching for a single adversarial example, we search for a distribution of adversarial examples parameterized by a continuous-valued matrix , hence enabling gradient-based optimization. We empirically demonstrate that our white-box attack attains state-of-the-art attack performance on a variety of natural language tasks, outperforming prior work in terms of adversarial success rate with matching imperceptibility as per automated and human evaluation. Furthermore, we show that a powerful black-box transfer attack, enabled by sampling from the adversarial distribution, matches or exceeds existing methods, while only requiring hard-label outputs.

text transformers gradient-based adversarial attacks محولات النص النص هجمات الخصومة القائمة على التدرج صناعة حمض الفوسفور

Contrastive Out-of-Distribution Detection for Pretrained Transformers

1004 - Association for Computation Linguistics 2021 مقالة

Pretrained Transformers achieve remarkable performance when training and test data are from the same distribution. However, in real-world scenarios, the model often faces out-of-distribution (OOD) instances that can cause severe semantic shift proble ms at inference time. Therefore, in practice, a reliable model should identify such instances, and then either reject them during inference or pass them over to models that handle another distribution. In this paper, we develop an unsupervised OOD detection method, in which only the in-distribution (ID) data are used in training. We propose to fine-tune the Transformers with a contrastive loss, which improves the compactness of representations, such that OOD instances can be better differentiated from ID ones. These OOD instances can then be accurately detected using the Mahalanobis distance in the model's penultimate layer. We experiment with comprehensive settings and achieve near-perfect OOD detection performance, outperforming baselines drastically. We further investigate the rationales behind the improvement, finding that more compact representations through margin-based contrastive learning bring the improvement. We release our code to the community for future research.

pretrained transformers pretrained transformers achieve ood detection المحولات مسبقا المحولات المسبقة تحقيقها كشف ood. صناعة حمض الفوسفور المزيد..

Correcting Texts Generated by Transformers using Discourse Features and Web Mining

649 - Association for Computation Linguistics 2021 مقالة

Recent transformer-based approaches to NLG like GPT-2 can generate syntactically coherent original texts. However, these generated texts have serious flaws: global discourse incoherence and meaninglessness of sentences in terms of entity values. We a ddress both of these flaws: they are independent but can be combined to generate original texts that will be both consistent and truthful. This paper presents an approach to estimate the quality of discourse structure. Empirical results confirm that the discourse structure of currently generated texts is inaccurate. We propose the research directions to correct it using discourse features during the fine-tuning procedure. The suggested approach is universal and can be applied to different languages. Apart from that, we suggest a method to correct wrong entity values based on Web Mining and text alignment.

correcting texts generated transformers تصحيح النصوص الناتجة التعدين على شبكة الإنترنت محولات صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد