في حين أن الشبكات العصبية تنتج أداء حديثة في العديد من مهام NLP، إلا أنها تعتمد بشكل عام على المعلومات المعدنية، والتي تنقل بشكل سيئ بين المجالات. اقترحت الأعمال السابقة Delexicalization كشكل من أشكال تقطير المعرفة للحد من الاعتماد على القطع الأثرية المعجمية. ومع ذلك، فإن القضية غير المحتملة النقدية التي لا تزال تظل مقدار delexicalization للتطبيق: يساعد القليل على تقليل التجاوز، ولكن يتجاهل الكثير من المعلومات المفيدة. نقترح التعلم الجماعي، ونهج تقطير المعرفة والنموذجية للتحقق من الحقائق التي تتمتع بها نماذج الطلاب المتعددة إمكانية الوصول إلى وجهات نظر مختلفة من البيانات، ولكن يتم تشجيعها على التعلم من بعضها البعض من خلال خسائر الاتساق الزوجية. في العديد من التجارب عبر المجالات بين مجموعات بيانات التحقق من الحمى و FNC، نوضح أن نهجنا يتعلم أفضل استراتيجية Delexicalization لعملية البيانات التدريبية المعطاة، وتتفوق على المصنفين الحديثة الذين يعتمدون على البيانات الأصلية.
While neural networks produce state-of-the- art performance in several NLP tasks, they generally depend heavily on lexicalized information, which transfer poorly between domains. Previous works have proposed delexicalization as a form of knowledge distillation to reduce the dependency on such lexical artifacts. However, a critical unsolved issue that remains is how much delexicalization to apply: a little helps reduce overfitting, but too much discards useful information. We propose Group Learning, a knowledge and model distillation approach for fact verification in which multiple student models have access to different delexicalized views of the data, but are encouraged to learn from each other through pair-wise consistency losses. In several cross-domain experiments between the FEVER and FNC fact verification datasets, we show that our approach learns the best delexicalization strategy for the given training dataset, and outperforms state-of-the-art classifiers that rely on the original data.
References used
https://aclanthology.org/
Pretrained transformer-based encoders such as BERT have been demonstrated to achieve state-of-the-art performance on numerous NLP tasks. Despite their success, BERT style encoders are large in size and have high latency during inference (especially o
Recent studies argue that knowledge distillation is promising for speech translation (ST) using end-to-end models. In this work, we investigate the effect of knowledge distillation with a cascade ST using automatic speech recognition (ASR) and machin
Although pre-training models have achieved great success in dialogue generation, their performance drops dramatically when the input contains an entity that does not appear in pre-training and fine-tuning datasets (unseen entity). To address this iss
To reduce a model size but retain performance, we often rely on knowledge distillation (KD) which transfers knowledge from a large teacher'' model to a smaller student'' model. However, KD on multimodal datasets such as vision-language tasks is relat
In this work, we analyze the performance and properties of cross-lingual word embedding models created by mapping-based alignment methods. We use several measures of corpus and embedding similarity to predict BLI scores of cross-lingual embedding map