Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation

RW-KD: مصطلحات الخسائر العينة الحكيمة إعادة الترجيح لتقطير المعرفة

500 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

sample-wise loss terms شروط الخسائر النموذجية صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Knowledge Distillation (KD) is extensively used in Natural Language Processing to compress the pre-training and task-specific fine-tuning phases of large neural language models. A student model is trained to minimize a convex combination of the prediction loss over the labels and another over the teacher output. However, most existing works either fix the interpolating weight between the two losses apriori or vary the weight using heuristics. In this work, we propose a novel sample-wise loss weighting method, RW-KD. A meta-learner, simultaneously trained with the student, adaptively re-weights the two losses for each sample. We demonstrate, on 7 datasets of the GLUE benchmark, that RW-KD outperforms other loss re-weighting methods for KD.

References used

https://aclanthology.org/

rate research

In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval

672 - Association for Computation Linguistics 2021 مقالة

We present an efficient training approach to text retrieval with dense representations that applies knowledge distillation using the ColBERT late-interaction ranking model. Specifically, we propose to transfer the knowledge from a bi-encoder teacher to a student by distilling knowledge from ColBERT's expressive MaxSim operator into a simple dot product. The advantage of the bi-encoder teacher--student setup is that we can efficiently add in-batch negatives during knowledge distillation, enabling richer interactions between teacher and student models. In addition, using ColBERT as the teacher reduces training cost compared to a full cross-encoder. Experiments on the MS MARCO passage and document ranking tasks and data from the TREC 2019 Deep Learning Track demonstrate that our approach helps models learn robust representations for dense retrieval effectively and efficiently.

المنطق العلائقي dense retrieval applies knowledge distillation استرجاع كثيف ينطبق على تقطير المعرفة صناعة حمض الفوسفور

Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation

483 - Association for Computation Linguistics 2021 مقالة

Intermediate layer matching is shown as an effective approach for improving knowledge distillation (KD). However, this technique applies matching in the hidden spaces of two different networks (i.e. student and teacher), which lacks clear interpretab ility. Moreover, intermediate layer KD cannot easily deal with other problems such as layer mapping search and architecture mismatch (i.e. it requires the teacher and student to be of the same model type). To tackle the aforementioned problems all together, we propose Universal-KD to match intermediate layers of the teacher and the student in the output space (by adding pseudo classifiers on intermediate layers) via the attention-based layer projection. By doing this, our unified approach has three merits: (i) it can be flexibly combined with current intermediate layer distillation techniques to improve their results (ii) the pseudo classifiers of the teacher can be deployed instead of extra expensive teacher assistant networks to address the capacity gap problem in KD which is a common issue when the gap between the size of the teacher and student networks becomes too large; (iii) it can be used in cross-architecture intermediate layer KD. We did comprehensive experiments in distilling BERT-base into BERT-4, RoBERTa-large into DistilRoBERTa and BERT-base into CNN and LSTM-based models. Results on the GLUE tasks show that our approach is able to outperform other KD techniques.

layer knowledge distillation intermediate layer knowledge intermediate layer طبقة تقطير المعرفة المعرفة الوسيطة المعرفة طبقة المتوسطة صناعة حمض الفوسفور المزيد..

Block-wise Word Embedding Compression Revisited: Better Weighting and Structuring

639 - Association for Computation Linguistics 2021 مقالة

Word embedding is essential for neural network models for various natural language processing tasks. Since the word embedding usually has a considerable size, in order to deploy a neural network model having it on edge devices, it should be effective ly compressed. There was a study for proposing a block-wise low-rank approximation method for word embedding, called GroupReduce. Even if their structure is effective, the properties behind the concept of the block-wise word embedding compression were not sufficiently explored. Motivated by this, we improve GroupReduce in terms of word weighting and structuring. For word weighting, we propose a simple yet effective method inspired by the term frequency-inverse document frequency method and a novel differentiable method. Based on them, we construct a discriminative word embedding compression algorithm. In the experiments, we demonstrate that the proposed algorithm more effectively finds word weights than competitors in most cases. In addition, we show that the proposed algorithm can act like a framework through successful cooperation with quantization.

embedding compression revisited إعادة النظر في ضغط التضمين كلمة تضمين صناعة حمض الفوسفور

Noisy Self-Knowledge Distillation for Text Summarization

602 - Association for Computation Linguistics 2021 مقالة

In this paper we apply self-knowledge distillation to text summarization which we argue can alleviate problems with maximum-likelihood training on single reference and noisy datasets. Instead of relying on one-hot annotation labels, our student summa rization model is trained with guidance from a teacher which generates smoothed labels to help regularize training. Furthermore, to better model uncertainty during training, we introduce multiple noise signals for both teacher and student models. We demonstrate experimentally on three benchmarks that our framework boosts the performance of both pretrained and non-pretrained summarizers achieving state-of-the-art results.

noisy self-knowledge distillation self-knowledge distillation تقطير المعرفة الذاتية صاخبة تقطير المعرفة الذاتية تلخيص النص صناعة حمض الفوسفور

MSD: Saliency-aware Knowledge Distillation for Multimodal Understanding

606 - Association for Computation Linguistics 2021 مقالة

To reduce a model size but retain performance, we often rely on knowledge distillation (KD) which transfers knowledge from a large teacher'' model to a smaller student'' model. However, KD on multimodal datasets such as vision-language tasks is relat ively unexplored, and digesting multimodal information is challenging since different modalities present different types of information. In this paper, we perform a large-scale empirical study to investigate the importance and effects of each modality in knowledge distillation. Furthermore, we introduce a multimodal knowledge distillation framework, modality-specific distillation (MSD), to transfer knowledge from a teacher on multimodal tasks by learning the teacher's behavior within each modality. The idea aims at mimicking a teacher's modality-specific predictions by introducing auxiliary loss terms for each modality. Furthermore, because each modality has different saliency for predictions, we define saliency scores for each modality and investigate saliency-based weighting schemes for the auxiliary losses. We further study a weight learning approach to learn the optimal weights on these loss terms. In our empirical analysis, we examine the saliency of each modality in KD, demonstrate the effectiveness of the weighting scheme in MSD, and show that it achieves better performance than KD on four multimodal datasets.

saliency-aware knowledge distillation saliency-aware knowledge multimodal understanding انتقط المعرفة بالوحدة المعرفة الإدراكية فهم متعدد الوسائط صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation

RW-KD: مصطلحات الخسائر العينة الحكيمة إعادة الترجيح لتقطير المعرفة

Ask ChatGPT about the research

Read More

suggested questions