Do you want to publish a course? Click here

Too Much in Common: Shifting of Embeddings in Transformer Language Models and its Implications

الكثير من القمم لدينا: تحويل المدينات في نماذج لغة المحولات وآثارها

315   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

The success of language models based on the Transformer architecture appears to be inconsistent with observed anisotropic properties of representations learned by such models. We resolve this by showing, contrary to previous studies, that the representations do not occupy a narrow cone, but rather drift in common directions. At any training step, all of the embeddings except for the ground-truth target embedding are updated with gradient in the same direction. Compounded over the training set, the embeddings drift and share common components, manifested in their shape in all the models we have empirically tested. Our experiments show that isotropy can be restored using a simple transformation.



References used
https://aclanthology.org/
rate research

Read More

We probe pre-trained transformer language models for bridging inference. We first investigate individual attention heads in BERT and observe that attention heads at higher layers prominently focus on bridging relations in-comparison with the lower an d middle layers, also, few specific attention heads concentrate consistently on bridging. More importantly, we consider language models as a whole in our second approach where bridging anaphora resolution is formulated as a masked token prediction task (Of-Cloze test). Our formulation produces optimistic results without any fine-tuning, which indicates that pre-trained language models substantially capture bridging inference. Our further investigation shows that the distance between anaphor-antecedent and the context provided to language models play an important role in the inference.
Similarity measures are a vital tool for understanding how language models represent and process language. Standard representational similarity measures such as cosine similarity and Euclidean distance have been successfully used in static word embed ding models to understand how words cluster in semantic space. Recently, these measures have been applied to embeddings from contextualized models such as BERT and GPT-2. In this work, we call into question the informativity of such measures for contextualized language models. We find that a small number of rogue dimensions, often just 1-3, dominate these measures. Moreover, we find a striking mismatch between the dimensions that dominate similarity measures and those which are important to the behavior of the model. We show that simple postprocessing techniques such as standardization are able to correct for rogue dimensions and reveal underlying representational quality. We argue that accounting for rogue dimensions is essential for any similarity-based analysis of contextual language models.
Representation learning for text via pretraining a language model on a large corpus has become a standard starting point for building NLP systems. This approach stands in contrast to autoencoders, also trained on raw text, but with the objective of l earning to encode each input as a vector that allows full reconstruction. Autoencoders are attractive because of their latent space structure and generative properties. We therefore explore the construction of a sentence-level autoencoder from a pretrained, frozen transformer language model. We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder. We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer (an example of controlled generation), and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to this end, pr ior work often relies on automatic evaluation of LM toxicity. We critically discuss this approach, evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation, and analyze consequences of toxicity mitigation in terms of model bias and LM quality. We demonstrate that while basic intervention strategies can effectively optimize previously established automatic metrics on the REALTOXICITYPROMPTS dataset, this comes at the cost of reduced LM coverage for both texts about, and dialects of, marginalized groups. Additionally, we find that human raters often disagree with high automatic toxicity scores after strong toxicity reduction interventions---highlighting further the nuances involved in careful evaluation of LM toxicity.
This paper focuses on data cleaning as part of a preprocessing procedure applied to text data retrieved from the web. Although the importance of this early stage in a project using NLP methods is often highlighted by researchers, the details, general principles and techniques are usually left out due to consideration of space. At best, they are dismissed with a comment The usual data cleaning and preprocessing procedures were applied''. More coverage is usually given to automatic text annotation such as lemmatisation, part-of-speech tagging and parsing, which is often included in preprocessing. In the literature, the term preprocessing' is used to refer to a wide range of procedures, from filtering and cleaning to data transformation such as stemming and numeric representation, which might create confusion. We argue that text preprocessing might skew original data distribution with regard to the metadata, such as types, locations and times of registered datapoints. In this paper we describe a systematic approach to cleaning text data mined by a data-providing company for a Digital Humanities (DH) project focused on cultural analytics. We reveal the types and amount of noise in the data coming from various web sources and estimate the changes in the size of the data associated with preprocessing. We also compare the results of a text classification experiment run on the raw and preprocessed data. We hope that our experience and approaches will help the DH community to diagnose the quality of textual data collected from the web and prepare it for further natural language processing.

suggested questions

comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا