ﻻ يوجد ملخص باللغة العربية
Knowledge Distillation (KD) is a model-agnostic technique to improve model quality while having a fixed capacity budget. It is a commonly used technique for model compression, where a larger capacity teacher model with better quality is used to train a more compact student model with better inference efficiency. Through distillation, one hopes to benefit from students compactness, without sacrificing too much on model quality. Despite the large success of knowledge distillation, better understanding of how it benefits student models training dynamics remains under-explored. In this paper, we categorize teachers knowledge into three hierarchical levels and study its effects on knowledge distillation: (1) knowledge of the `universe, where KD brings a regularization effect through label smoothing; (2) domain knowledge, where teacher injects class relationships prior to students logit layer geometry; and (3) instance specific knowledge, where teacher rescales student models per-instance gradients based on its measurement on the event difficulty. Using systematic analyses and extensive empirical studies on both synthetic and real-world datasets, we confirm that the aforementioned three factors play a major role in knowledge distillation. Furthermore, based on our findings, we diagnose some of the failure cases of applying KD from recent studies.
We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is s
For decades, researchers in fields, such as the natural and social sciences, have been verifying causal relationships and investigating hypotheses that are now well-established or understood as truth. These causal mechanisms are properties of the nat
Knowledge distillation is an effective technique that transfers knowledge from a large teacher model to a shallow student. However, just like massive classification, large scale knowledge distillation also imposes heavy computational costs on trainin
Knowledge distillation has become one of the most important model compression techniques by distilling knowledge from larger teacher networks to smaller student ones. Although great success has been achieved by prior distillation methods via delicate
We present Meta Learning for Knowledge Distillation (MetaDistil), a simple yet effective alternative to traditional knowledge distillation (KD) methods where the teacher model is fixed during training. We show the teacher network can learn to better