合作的多教老师知识蒸馏，用于学习低位宽度深神经网络

论文标题

合作的多教老师知识蒸馏，用于学习低位宽度深神经网络

Collaborative Multi-Teacher Knowledge Distillation for Learning Low Bit-width Deep Neural Networks

论文作者

Pham, Cuong, Hoang, Tuan, Do, Thanh-Toan

论文摘要

通过从繁琐的教师模型中提取知识来学习轻量级学生模型的知识蒸馏是学习紧凑的深神经网络（DNNS）的有吸引力的方法。最近的工作通过利用多个教师网络进一步提高了学生网络的表现。但是，大多数现有的基于知识蒸馏的多教学方法都使用了单独预处理的教师。这限制了教师与教师与学生之间的共同学习之间的协作学习。网络量化是学习紧凑型DNN的另一种有吸引力的方法。但是，大多数现有的网络量化方法都是开发和评估的，而无需考虑多教学支持以增强量化学生模型的性能。在本文中，我们提出了一个新颖的框架，该框架利用多教老师知识蒸馏和网络量化来学习低宽度DNN。提出的方法鼓励量化的教师和量化的教师和量化的学生之间的共同学习。在学习过程中，在相应的层次上，来自教师的知识将形成一个重要的共同知识，将作为随后层次的教师的输入，并用于指导学生。我们对CIFAR100和Imagenet数据集的实验结果表明，与其他最新方法相比，接受我们方法的紧凑量化的学生模型获得了竞争性结果，在某些情况下，确实超过了完整的精度模型。

Knowledge distillation which learns a lightweight student model by distilling knowledge from a cumbersome teacher model is an attractive approach for learning compact deep neural networks (DNNs). Recent works further improve student network performance by leveraging multiple teacher networks. However, most of the existing knowledge distillation-based multi-teacher methods use separately pretrained teachers. This limits the collaborative learning between teachers and the mutual learning between teachers and student. Network quantization is another attractive approach for learning compact DNNs. However, most existing network quantization methods are developed and evaluated without considering multi-teacher support to enhance the performance of quantized student model. In this paper, we propose a novel framework that leverages both multi-teacher knowledge distillation and network quantization for learning low bit-width DNNs. The proposed method encourages both collaborative learning between quantized teachers and mutual learning between quantized teachers and quantized student. During learning process, at corresponding layers, knowledge from teachers will form an importance-aware shared knowledge which will be used as input for teachers at subsequent layers and also be used to guide student. Our experimental results on CIFAR100 and ImageNet datasets show that the compact quantized student models trained with our method achieve competitive results compared to other state-of-the-art methods, and in some cases, indeed surpass the full precision models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题