No Arabic abstract
Knowledge Distillation (KD) refers to transferring knowledge from a large model to a smaller one, which is widely used to enhance model performance in machine learning. It tries to align embedding spaces generated from the teacher and the student model (i.e. to make images corresponding to the same semantics share the same embedding across different models). In this work, we focus on its application in face recognition. We observe that existing knowledge distillation models optimize the proxy tasks that force the student to mimic the teachers behavior, instead of directly optimizing the face recognition accuracy. Consequently, the obtained student models are not guaranteed to be optimal on the target task or able to benefit from advanced constraints, such as large margin constraints (e.g. margin-based softmax). We then propose a novel method named ProxylessKD that directly optimizes face recognition accuracy by inheriting the teachers classifier as the students classifier to guide the student to learn discriminative embeddings in the teachers embedding space. The proposed ProxylessKD is very easy to implement and sufficiently generic to be extended to other tasks beyond face recognition. We conduct extensive experiments on standard face recognition benchmarks, and the results demonstrate that ProxylessKD achieves superior performance over existing knowledge distillation methods.
Recent deep learning based face recognition methods have achieved great performance, but it still remains challenging to recognize very low-resolution query face like 28x28 pixels when CCTV camera is far from the captured subject. Such face with very low-resolution is totally out of detail information of the face identity compared to normal resolution in a gallery and hard to find corresponding faces therein. To this end, we propose a Resolution Invariant Model (RIM) for addressing such cross-resolution face recognition problems, with three distinct novelties. First, RIM is a novel and unified deep architecture, containing a Face Hallucination sub-Net (FHN) and a Heterogeneous Recognition sub-Net (HRN), which are jointly learned end to end. Second, FHN is a well-designed tri-path Generative Adversarial Network (GAN) which simultaneously perceives facial structure and geometry prior information, i.e. landmark heatmaps and parsing maps, incorporated with an unsupervised cross-domain adversarial training strategy to super-resolve very low-resolution query image to its 8x larger ones without requiring them to be well aligned. Third, HRN is a generic Convolutional Neural Network (CNN) for heterogeneous face recognition with our proposed residual knowledge distillation strategy for learning discriminative yet generalized feature representation. Quantitative and qualitative experiments on several benchmarks demonstrate the superiority of the proposed model over the state-of-the-arts. Codes and models will be released upon acceptance.
Deep neural networks have rapidly become the mainstream method for face recognition. However, deploying such models that contain an extremely large number of parameters to embedded devices or in application scenarios with limited memory footprint is challenging. In this work, we present an extremely lightweight and accurate face recognition solution. We utilize neural architecture search to develop a new family of face recognition models, namely PocketNet. We also propose to enhance the verification performance of the compact model by presenting a novel training paradigm based on knowledge distillation, namely the multi-step knowledge distillation. We present an extensive experimental evaluation and comparisons with the recent compact face recognition models on nine different benchmarks including large-scale evaluation benchmarks such as IJB-B, IJB-C, and MegaFace. PocketNets have consistently advanced the state-of-the-art (SOTA) face recognition performance on nine mainstream benchmarks when considering the same level of model compactness. With 0.92M parameters, our smallest network PocketNetS-128 achieved very competitive results to recent SOTA compacted models that contain more than 4M parameters. Training codes and pre-trained models are publicly released https://github.com/fdbtrs/PocketNet.
In the real world, medical datasets often exhibit a long-tailed data distribution (i.e., a few classes occupy most of the data, while most classes have rarely few samples), which results in a challenging imbalance learning scenario. For example, there are estimated more than 40 different kinds of retinal diseases with variable morbidity, however with more than 30+ conditions are very rare from the global patient cohorts, which results in a typical long-tailed learning problem for deep learning-based screening models. In this study, we propose class subset learning by dividing the long-tailed data into multiple class subsets according to prior knowledge, such as regions and phenotype information. It enforces the model to focus on learning the subset-specific knowledge. More specifically, there are some relational classes that reside in the fixed retinal regions, or some common pathological features are observed in both the majority and minority conditions. With those subsets learnt teacher models, then we are able to distill the multiple teacher models into a unified model with weighted knowledge distillation loss. The proposed framework proved to be effective for the long-tailed retinal diseases recognition task. The experimental results on two different datasets demonstrate that our method is flexible and can be easily plugged into many other state-of-the-art techniques with significant improvements.
Most teacher-student frameworks based on knowledge distillation (KD) depend on a strong congruent constraint on instance level. However, they usually ignore the correlation between multiple instances, which is also valuable for knowledge transfer. In this work, we propose a new framework named correlation congruence for knowledge distillation (CCKD), which transfers not only the instance-level information, but also the correlation between instances. Furthermore, a generalized kernel method based on Taylor series expansion is proposed to better capture the correlation between instances. Empirical experiments and ablation studies on image classification tasks (including CIFAR-100, ImageNet-1K) and metric learning tasks (including ReID and Face Recognition) show that the proposed CCKD substantially outperforms the original KD and achieves state-of-the-art accuracy compared with other SOTA KD-based methods. The CCKD can be easily deployed in the majority of the teacher-student framework such as KD and hint-based learning methods.
Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one. Most existing approaches enhance the student model by utilizing the similarity information between the categories of instance level provided by the teacher model. However, these works ignore the similarity correlation between different instances that plays an important role in confidence prediction. To tackle this issue, we propose a novel method in this paper, called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples. Furthermore, we propose to better capture the similarity correlation between different instances by the mixup technique, which creates virtual samples by a weighted linear interpolation. Note that, our distillation loss can fully utilize the incorrect classes similarities by the mixed labels. The proposed approach promotes the performance of student model as the virtual sample created by multiple images produces a similar probability distribution in the teacher and student networks. Experiments and ablation studies on several public classification datasets including CIFAR-10,CIFAR-100,CINIC-10 and Tiny-ImageNet verify that this light-weight method can effectively boost the performance of the compact student model. It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.