通过深入监督的知识蒸馏

论文标题

通过深入监督的知识蒸馏

Knowledge Distillation with Deep Supervision

论文作者

Luo, Shiya, Chen, Defang, Wang, Can

论文摘要

知识蒸馏旨在通过利用预先训练的麻烦教师模型来利用知识来提高轻量级学生模型的表现。但是，在传统的知识蒸馏中，教师预测仅用于为学生模型的最后一层提供监督信号，这可能导致那些缺乏逐层背部传播中准确培训指南的浅层学生层，从而阻碍了有效的知识传递。为了解决这个问题，我们提出了深度监督的知识蒸馏（DSKD），该知识蒸馏（DSKD）充分利用了班级预测和教师模型的特征图来监督浅层学生层次的培训。在DSKD中开发了基于减损的体重分配策略，以适应每个浅层层的学习过程，以便进一步改善学生的表现。在CIFAR-100和具有各种教师模型的Tinyimagenet上进行了广泛的实验，表现出色，证实了我们提出的方法的有效性。代码可在：$ \ href {https://github.com/luoshiya/dskd} {https://github.com/luoshiya/dskd} $

Knowledge distillation aims to enhance the performance of a lightweight student model by exploiting the knowledge from a pre-trained cumbersome teacher model. However, in the traditional knowledge distillation, teacher predictions are only used to provide the supervisory signal for the last layer of the student model, which may result in those shallow student layers lacking accurate training guidance in the layer-by-layer back propagation and thus hinders effective knowledge transfer. To address this issue, we propose Deeply-Supervised Knowledge Distillation (DSKD), which fully utilizes class predictions and feature maps of the teacher model to supervise the training of shallow student layers. A loss-based weight allocation strategy is developed in DSKD to adaptively balance the learning process of each shallow layer, so as to further improve the student performance. Extensive experiments on CIFAR-100 and TinyImageNet with various teacher-student models show significantly performance, confirming the effectiveness of our proposed method. Code is available at: $\href{https://github.com/luoshiya/DSKD}{https://github.com/luoshiya/DSKD}$

下载PDF全文

下载文献需遵守相关版权规定

论文标题