从一个强大的老师那里蒸馏

论文标题

从一个强大的老师那里蒸馏

Knowledge Distillation from A Stronger Teacher

论文作者

Huang, Tao, You, Shan, Wang, Fei, Qian, Chen, Xu, Chang

论文摘要

与现有的知识蒸馏方法集中在基准设置上，在基准设置中，教师模型和培训策略并不像最先进的方法那样强大而竞争，本文介绍了一种被称为dist的方法，可以更好地从更强大的教师那里提炼。我们从经验上发现，学生与更强大的老师之间的预测差异可能会相当严重。结果，KL差异中预测的确切匹配将干扰训练，并使现有方法的表现不佳。在本文中，我们表明，仅保留教师和学生的预测之间的关系就足够了，并提出了基于相关的损失，以明确地捕获教师的内在阶层间关系。此外，考虑到不同的实例与每个类具有不同的语义相似性，我们还将这种关系匹配扩展到了类内级别。我们的方法简单而实用，广泛的实验表明，它可以很好地适应各种体系结构，模型大小和训练策略，并且可以在图像分类，对象检测和语义细分任务上始终如一地实现最新性能。代码可在以下网址找到：https：//github.com/hunto/dist_kd。

Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher models and training strategies are not that strong and competing as state-of-the-art approaches, this paper presents a method dubbed DIST to distill better from a stronger teacher. We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer. As a result, the exact match of predictions in KL divergence would disturb the training and make existing methods perform poorly. In this paper, we show that simply preserving the relations between the predictions of teacher and student would suffice, and propose a correlation-based loss to capture the intrinsic inter-class relations from the teacher explicitly. Besides, considering that different instances have different semantic similarities to each class, we also extend this relational match to the intra-class level. Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures, model sizes and training strategies, and can achieve state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code is available at: https://github.com/hunto/DIST_KD .

下载PDF全文

下载文献需遵守相关版权规定

论文标题