与重复使用的教师分类器的知识蒸馏

论文标题

与重复使用的教师分类器的知识蒸馏

Knowledge Distillation with the Reused Teacher Classifier

论文作者

Chen, Defang, Mei, Jian-Ping, Zhang, Hailin, Wang, Can, Feng, Yan, Chen, Chun

论文摘要

知识蒸馏旨在将强大而繁琐的教师模型压缩为轻量级的学生模型，而没有太多的表现牺牲。为此，在过去几年中，已经提出了各种方法，通常具有精心设计的知识表示，这反过来又增加了模型开发和解释的难度。相比之下，我们从经验上表明，一种简单的知识蒸馏技术足以显着缩小教师的表现差距。我们直接从学生推论的预训练的教师模型中直接重复使用判别分类器，并通过单个$ \ ell_2 $损失来培训学生编码器。这样，学生模型就可以达到与教师模型完全相同的表现，只要他们的提取功能完全一致。开发了一个额外的投影仪，以帮助学生编码器与教师分类器匹配，该分类器使我们的技术适用于各种教师和学生体系结构。广泛的实验表明，我们的技术以增加投影仪的压缩比的适度成本来实现最先进的结果。

Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years, generally with elaborately designed knowledge representations, which in turn increase the difficulty of model development and interpretation. In contrast, we empirically show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap. We directly reuse the discriminative classifier from the pre-trained teacher model for student inference and train a student encoder through feature alignment with a single $\ell_2$ loss. In this way, the student model is able to achieve exactly the same performance as the teacher model provided that their extracted features are perfectly aligned. An additional projector is developed to help the student encoder match with the teacher classifier, which renders our technique applicable to various teacher and student architectures. Extensive experiments demonstrate that our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector.

下载PDF全文

下载文献需遵守相关版权规定

论文标题