في هذا العمل، نقترح إطارا جديدا، برت التعلم المتبادل المتماثل التدرج (Gaml-Bert)، لتحسين الخروج المبكر من Bert.مساهمات Gaml-Bert هي طي ثنائي.نقوم بإجراء مجموعة من التجارب الطيارية، والتي توضح أن تقطير المعرفة المتبادلة بين الخروج الضحل والخروج العميق يؤدي إلى أداء أفضل لكليهما.من هذه الملاحظة، نستخدم التعلم المتبادل لتحسين عروض بيرت المبكرة المبكرة، أي نطلب من كل خروج من بيرت متعددة الخروج لتقطير المعرفة من بعضها البعض.ثانيا، نقترح GA، طريقة تدريب جديدة تقوم بمحاذاة التدرجات من تقطير المعرفة إلى خسائر الانتروبية.يتم إجراء تجارب واسعة النطاق على معيار الغراء، والذي يدل على أن لدينا Gaml-Bert يمكن أن تتفوق بشكل كبير على أحدث الطرق التي تخرج من أحدث الطرق (SOTA) في وقت مبكر.
In this work, we propose a novel framework, Gradient Aligned Mutual Learning BERT (GAML-BERT), for improving the early exiting of BERT. GAML-BERT's contributions are two-fold. We conduct a set of pilot experiments, which shows that mutual knowledge distillation between a shallow exit and a deep exit leads to better performances for both. From this observation, we use mutual learning to improve BERT's early exiting performances, that is, we ask each exit of a multi-exit BERT to distill knowledge from each other. Second, we propose GA, a novel training method that aligns the gradients from knowledge distillation to cross-entropy losses. Extensive experiments are conducted on the GLUE benchmark, which shows that our GAML-BERT can significantly outperform the state-of-the-art (SOTA) BERT early exiting methods.
References used
https://aclanthology.org/
Existing pre-trained language models (PLMs) are often computationally expensive in inference, making them impractical in various resource-limited real-world applications. To address this issue, we propose a dynamic token reduction approach to acceler
Unifying acoustic and linguistic representation learning has become increasingly crucial to transfer the knowledge learned on the abundance of high-resource language data for low-resource speech recognition. Existing approaches simply cascade pre-tra
A currently popular research area in end-to-end speech translation is the use of knowledge distillation from a machine translation (MT) task to improve the speech translation (ST) task. However, such scenario obviously only allows one way transfer, w
Fine-tuning pre-trained language models suchas BERT has become a common practice dom-inating leaderboards across various NLP tasks.Despite its recent success and wide adoption,this process is unstable when there are onlya small number of training sam
Low-resource Relation Extraction (LRE) aims to extract relation facts from limited labeled corpora when human annotation is scarce. Existing works either utilize self-training scheme to generate pseudo labels that will cause the gradual drift problem