元知识蒸馏

论文标题

Meta Knowledge Distillation

论文作者

Liu, Jihao, Liu, Boxiao, Li, Hongsheng, Liu, Yu

论文摘要

最近的研究指出，知识蒸馏（KD）遭受了两个退化问题，教师的差距以及与强大数据增强的不兼容，这使得它不适用于接受高级增强培训的培训最先进的模型。但是，我们观察到，一个关键因素，即，在先前的方法中，软玛克斯功能的温度大多被忽略了。通过适当调整的温度，可以缓解KD的这种降解问题。但是，我们建议使用可学习的元温度参数的元素进行元学习，而不是依靠可传递性较差的幼稚网格搜索，而是提出元知识蒸馏（MKD）。根据学习目标的梯度，在训练过程中对元参数进行自适应调整。我们验证MKD对不同的数据集量表，不同的教师/学生体系结构以及不同类型的数据增强是可靠的。使用MKD，我们在比较仅使用ImageNet-1K作为训练数据的方法中，通过流行的VIT体系结构实现了最佳性能，从微型到大型模型。使用VIT-L，我们获得了86.5％的训练，比训练1,650个时期的MAE高0.6％。

Recent studies pointed out that knowledge distillation (KD) suffers from two degradation problems, the teacher-student gap and the incompatibility with strong data augmentations, making it not applicable to training state-of-the-art models, which are trained with advanced augmentations. However, we observe that a key factor, i.e., the temperatures in the softmax functions for generating probabilities of both the teacher and student models, was mostly overlooked in previous methods. With properly tuned temperatures, such degradation problems of KD can be much mitigated. However, instead of relying on a naive grid search, which shows poor transferability, we propose Meta Knowledge Distillation (MKD) to meta-learn the distillation with learnable meta temperature parameters. The meta parameters are adaptively adjusted during training according to the gradients of the learning objective. We validate that MKD is robust to different dataset scales, different teacher/student architectures, and different types of data augmentation. With MKD, we achieve the best performance with popular ViT architectures among compared methods that use only ImageNet-1K as training data, ranging from tiny to large models. With ViT-L, we achieve 86.5% with 600 epochs of training, 0.6% better than MAE that trains for 1,650 epochs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题