No Arabic abstract
We introduce a set of nine challenge tasks that test for the understanding of function words. These tasks are created by structurally mutating sentences from existing datasets to target the comprehension of specific types of function words (e.g., prepositions, wh-words). Using these probing tasks, we explore the effects of various pretraining objectives for sentence encoders (e.g., language modeling, CCG supertagging and natural language inference (NLI)) on the learned representations. Our results show that pretraining on language modeling performs the best on average across our probing tasks, supporting its widespread use for pretraining state-of-the-art NLP models, and CCG supertagging and NLI pretraining perform comparably. Overall, no pretraining objective dominates across the board, and our function word probing tasks highlight several intuitive differences between pretraining objectives, e.g., that NLI helps the comprehension of negation.
We compare the Tcs found in different families of optimally-doped High-Tc cuprates and find, contrary generally accepted lore, that pairing is not exclusively in the CuO2 layers. Evidence for additional pairing interactions, that take place outside the CuO2 layers, is found in two different classes of cuprates, namely the charge reservoir and the chain layer cuprates. The additional pairing in these layers suppresses fluctuations and hence enhances Tc. Tcs higher than 100K, are found in the cuprates containing charge reservoir layers with cations of Tl, Bi, or Hg that are known to be negative-U ions. Comparisons with other cuprates that have the same sequence of optimally doped CuO2 layers, but have lower Tcs, show that Tc is increased by factors of two or more upon insertion of the charge reservoir layer(s). The Tl ion has been shown to be an electronic pairing center in the model system (Pb,Tl)Te and data in the literature that suggest it behaves similarly in the cuprates. A number of other puzzling results that are found in the Hg, Tl, and Bi cuprates can be understood in terms of negative-U ion pairing centers in the charge reservoir layers. There is also evidence for additional pairing in the chain layer cuprates. Superconductivity that originates in the double zigzag Cu chains layers that has been recently demonstrated in NMR studies of Pr-247 leads to the suggestion of a linear, charge 1, diamagnetic quasiparticle formed from a charge-transfer exciton and a hole. Other properties of the chain layer cuprates that are difficult to explain using models in which the pairing is solely confined to the CuO2 layers can be understood if supplementary pairing in the chain layers is included. Finally, we speculate that these same linear quasi-particles can exist in the 2-dimensional CuO2 layers as well.
The attention layer in a neural network model provides insights into the models reasoning behind its prediction, which are usually criticized for being opaque. Recently, seemingly contradictory viewpoints have emerged about the interpretability of attention weights (Jain & Wallace, 2019; Vig & Belinkov, 2019). Amid such confusion arises the need to understand attention mechanism more systematically. In this work, we attempt to fill this gap by giving a comprehensive explanation which justifies both kinds of observations (i.e., when is attention interpretable and when it is not). Through a series of experiments on diverse NLP tasks, we validate our observations and reinforce our claim of interpretability of attention through manual evaluation.
Observations of star-forming galaxies in the distant Universe (z > 2) are starting to confirm the importance of massive stars in shaping galaxy emission and evolution. Inevitably, these distant stellar populations are unresolved, and the limited data available must be interpreted in the context of stellar population synthesis models. With the imminent launch of JWST and the prospect of spectral observations of galaxies within a gigayear of the Big Bang, the uncertainties in modelling of massive stars are becoming increasingly important to our interpretation of the high redshift Universe. In turn, these observations of distant stellar populations will provide ever stronger tests against which to gauge the success of, and flaws in, current massive star models.
There have been various types of pretraining architectures including autoregressive models (e.g., GPT), autoencoding models (e.g., BERT), and encoder-decoder models (e.g., T5). On the other hand, NLP tasks are different in nature, with three main categories being classification, unconditional generation, and conditional generation. However, none of the pretraining frameworks performs the best for all tasks, which introduces inconvenience for model development and selection. We propose a novel pretraining framework GLM (General Language Model) to address this challenge. Compared to previous work, our architecture has three major benefits: (1) it performs well on classification, unconditional generation, and conditional generation tasks with one single pretrained model; (2) it outperforms BERT-like models on classification due to improved pretrain-finetune consistency; (3) it naturally handles variable-length blank filling which is crucial for many downstream tasks. Empirically, GLM substantially outperforms BERT on the SuperGLUE natural language understanding benchmark with the same amount of pre-training data. Moreover, GLM with 1.25x parameters of BERT-Large achieves the best performance in NLU, conditional and unconditional generation at the same time, which demonstrates its generalizability to different downstream tasks.
Recent studies have revealed a security threat to natural language processing (NLP) models, called the Backdoor Attack. Victim models can maintain competitive performance on clean samples while behaving abnormally on samples with a specific trigger word inserted. Previous backdoor attacking methods usually assume that attackers have a certain degree of data knowledge, either the dataset which users would use or proxy datasets for a similar task, for implementing the data poisoning procedure. However, in this paper, we find that it is possible to hack the model in a data-free way by modifying one single word embedding vector, with almost no accuracy sacrificed on clean samples. Experimental results on sentiment analysis and sentence-pair classification tasks show that our method is more efficient and stealthier. We hope this work can raise the awareness of such a critical security risk hidden in the embedding layers of NLP models. Our code is available at https://github.com/lancopku/Embedding-Poisoning.