ﻻ يوجد ملخص باللغة العربية
Knowledge distillation (KD) is widely used for training a compact model with the supervision of another large model, which could effectively improve the performance. Previous methods mainly focus on two aspects: 1) training the student to mimic representation space of the teacher; 2) training the model progressively or adding extra module like discriminator. Knowledge from teacher is useful, but it is still not exactly right compared with ground truth. Besides, overly uncertain supervision also influences the result. We introduce two novel approaches, Knowledge Adjustment (KA) and Dynamic Temperature Distillation (DTD), to penalize bad supervision and improve student model. Experiments on CIFAR-100, CINIC-10 and Tiny ImageNet show that our methods get encouraging performance compared with state-of-the-art methods. When combined with other KD-based methods, the performance will be further improved.
Knowledge distillation, which involves extracting the dark knowledge from a teacher network to guide the learning of a student network, has emerged as an important technique for model compression and transfer learning. Unlike previous works that expl
Knowledge distillation (KD) is an effective framework that aims to transfer meaningful information from a large teacher to a smaller student. Generally, KD often involves how to define and transfer knowledge. Previous KD methods often focus on mining
Recently, research efforts have been concentrated on revealing how pre-trained model makes a difference in neural network performance. Self-supervision and semi-supervised learning technologies have been extensively explored by the community and are
AutoAugment has been a powerful algorithm that improves the accuracy of many vision tasks, yet it is sensitive to the operator space as well as hyper-parameters, and an improper setting may degenerate network optimization. This paper delves deep into
Knowledge distillation is a widely applicable technique for training a student neural network under the guidance of a trained teacher network. For example, in neural network compression, a high-capacity teacher is distilled to train a compact student