RealFormer: Transformer Likes Residual Attention

65 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ruining He

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ruining He - Anirudh Ravula - Bhargav Kanagal

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP. We also observe empirically that RealFormer stabilizes training and leads to models with sparser attention. Source code and pre-trained checkpoints for RealFormer can be found at https://github.com/google-research/google-research/tree/master/realformer.

قيم البحث

اقرأ أيضاً

Molecule Attention Transformer

281 - {L}ukasz Maziarka , Tomasz Danel , S{l}awomir Mucha 2020

Designing a single neural network architecture that performs competitively across a range of molecule property prediction tasks remains largely an open challenge, and its solution may unlock a widespread use of deep learning in the drug discovery ind ustry. To move towards this goal, we propose Molecule Attention Transformer (MAT). Our key innovation is to augment the attention mechanism in Transformer using inter-atomic distances and the molecular graph structure. Experiments show that MAT performs competitively on a diverse set of molecular prediction tasks. Most importantly, with a simple self-supervised pretraining, MAT requires tuning of only a few hyperparameter values to achieve state-of-the-art performance on downstream tasks. Finally, we show that attention weights learned by MAT are interpretable from the chemical point of view.

التعلم الآلي الفيزياء الحسابية التعلم الالي

An Attention Free Transformer

86 - Shuangfei Zhai , Walter Talbott , Nitish Srivastava 2021

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of wh ich is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

التعلم الآلي الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط

Image Denoising using Attention-Residual Convolutional Neural Networks

91 - Rafael G. Pires , Daniel F. S. Santos , Marcos C.S. Santana 2021

During the image acquisition process, noise is usually added to the data mainly due to physical limitations of the acquisition sensor, and also regarding imprecisions during the data transmission and manipulation. In that sense, the resultant image n eeds to be processed to attenuate its noise without losing details. Non-learning-based strategies such as filter-based and noise prior modeling have been adopted to solve the image denoising problem. Nowadays, learning-based denoising techniques showed to be much more effective and flexible approaches, such as Residual Convolutional Neural Networks. Here, we propose a new learning-based non-blind denoising technique named Attention Residual Convolutional Neural Network (ARCNN), and its extension to blind denoising named Flexible Attention Residual Convolutional Neural Network (FARCNN). The proposed methods try to learn the underlying noise expectation using an Attention-Residual mechanism. Experiments on public datasets corrupted by different levels of Gaussian and Poisson noise support the effectiveness of the proposed approaches against some state-of-the-art image denoising methods. ARCNN achieved an overall average PSNR results of around 0.44dB and 0.96dB for Gaussian and Poisson denoising, respectively FARCNN presented very consistent results, even with slightly worsen performance compared to ARCNN.

التعلم الآلي

Faster Transformer Decoding: N-gram Masked Self-Attention

88 - Ciprian Chelba , Mia Chen , Ankur Bapna 2020

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, ldots, 8$, depending on the task.

التعلم الآلي الحساب واللغة التعلم الالي

Combiner: Full Attention Transformer with Sparse Computation Cost

97 - Hongyu Ren , Hanjun Dai , Zihang Dai 2021

Transformers provide a class of expressive architectures that are extremely effective for sequence modeling. However, the key limitation of transformers is their quadratic memory and time complexity $mathcal{O}(L^2)$ with respect to the sequence leng th in attention layers, which restricts application in extremely long sequences. Most existing approaches leverage sparsity or low-rank assumptions in the attention matrix to reduce cost, but sacrifice expressiveness. Instead, we propose Combiner, which provides full attention capability in each attention head while maintaining low computation and memory complexity. The key idea is to treat the self-attention mechanism as a conditional expectation over embeddings at each location, and approximate the conditional distribution with a structured factorization. Each location can attend to all other locations, either via direct attention, or through indirect attention to abstractions, which are again conditional expectations of embeddings from corresponding local regions. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention, resulting in the same sub-quadratic cost ($mathcal{O}(Llog(L))$ or $mathcal{O}(Lsqrt{L})$). Combiner is a drop-in replacement for attention layers in existing transformers and can be easily implemented in common frameworks. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach, yielding state-of-the-art results on several image and text modeling tasks.

التعلم الآلي الحساب واللغة الرؤية الحاسوبية وتمييز الأنماط