المحولات هي شبكة عصبية مقرها الانتباه، والتي تتكون من اثنين من المواد الغذائية، وهي شبكة انتباه الذات (SAN) وشبكة الأعلاف إلى الأمام (FFN). يستكشف البحوث الحالية لتعزيز اثنين من الطبقة الفرعية بشكل منفصل لتحسين القدرة على محول تمثيل النص. في هذه الورقة، نقدم فهما جديدا للسان و FFN كشبكات انتباه قناع (مان) وإظهار أنهم حالتان خاصتان من مانتين مع مصفوفات قناع ثابت. ومع ذلك، فإن مصفوفات قناع ثابتة الحد من القدرة على نمذجة النمذجة في مجال التمثيل النصي. لذلك نقدم طبقة جديدة تسمى شبكة انتباه قناع ديناميكي (DMAN) مع مصفوفة قناع يمكن أن تكون قادرة على نموذج الحزين على التكيف. لتضمين مزايا DMAN و San و FFN، نقترح بنية طبقية متسلسلة للجمع بين الأنواع الثلاثة من الطبقات. تجارب شاملة على المهام المختلفة، بما في ذلك الترجمة العصبية لترجمة الآلات والرسوم النصية توضح أن نموذجنا يتفوق على المحول الأصلي.
Transformer is an attention-based neural network, which consists of two sublayers, namely, Self-Attention Network (SAN) and Feed-Forward Network (FFN). Existing research explores to enhance the two sublayers separately to improve the capability of Transformer for text representation. In this paper, we present a novel understanding of SAN and FFN as Mask Attention Networks (MANs) and show that they are two special cases of MANs with static mask matrices. However, their static mask matrices limit the capability for localness modeling in text representation learning. We therefore introduce a new layer named dynamic mask attention network (DMAN) with a learnable mask matrix which is able to model localness adaptively. To incorporate advantages of DMAN, SAN, and FFN, we propose a sequential layered structure to combine the three types of layers. Extensive experiments on various tasks, including neural machine translation and text summarization demonstrate that our model outperforms the original Transformer.
References used
https://aclanthology.org/
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model wil
Supplementary Training on Intermediate Labeled-data Tasks (STILT) is a widely applied technique, which first fine-tunes the pretrained language models on an intermediate task before on the target task of interest. While STILT is able to further impro
We often use perturbations to regularize neural models. For neural encoder-decoders, previous studies applied the scheduled sampling (Bengio et al., 2015) and adversarial perturbations (Sato et al., 2019) as perturbations but these methods require co
Transformer-based pre-trained language models have significantly improved the performance of various natural language processing (NLP) tasks in the recent years. While effective and prevalent, these models are usually prohibitively large for resource
Existing sarcasm detection systems focus on exploiting linguistic markers, context, or user-level priors. However, social studies suggest that the relationship between the author and the audience can be equally relevant for the sarcasm usage and inte