论文标题
多样性的原则:训练更强大的视觉变形金刚要求减少各个冗余
The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy
论文作者
论文摘要
视觉变压器(VIT)比传统的卷积网络相比,人们普遍认为它们具有更高的建模能力和表示灵活性,因此越来越受欢迎。但是,由于学识渊博的VIT通常会遭受过度平滑的,可能会产生可能的冗余模型,因此在实践中是否已经完全释放了这种潜力,这是值得怀疑的。最近的作品进行了初步尝试来识别和减轻这种冗余,例如,通过将相似性或重新注入类似卷积的结构进行正规化。但是,关于VIT的冗余程度以及通过彻底缓解这种情况,“从头到脚的评估”已经没有在此领域中获得。本文首次系统地研究了所有三个级别的无处不在的冗余存在:贴片嵌入,注意力图和体重空间。鉴于它们,我们提出了一种用于培训VIT的多样性原则,它通过介绍相应的正规化器来鼓励在每个级别上的代表性多样性和覆盖范围,从而捕获更多的歧视性信息。具有许多VIT主链的ImageNet上的广泛实验验证了我们的提案的有效性,在很大程度上消除了观察到的VIT冗余,并显着增强了模型的概括。例如,我们的多样化的DEIT获得了0.70%〜1.76%的ImageNet精度增强,并且相似性高度降低。我们的代码可在https://github.com/vita-group/diverse-vit中获得。
Vision transformers (ViTs) have gained increasing popularity as they are commonly believed to own higher modeling capacity and representation flexibility, than traditional convolutional networks. However, it is questionable whether such potential has been fully unleashed in practice, as the learned ViTs often suffer from over-smoothening, yielding likely redundant models. Recent works made preliminary attempts to identify and alleviate such redundancy, e.g., via regularizing embedding similarity or re-injecting convolution-like structures. However, a "head-to-toe assessment" regarding the extent of redundancy in ViTs, and how much we could gain by thoroughly mitigating such, has been absent for this field. This paper, for the first time, systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space. In view of them, we advocate a principle of diversity for training ViTs, by presenting corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information. Extensive experiments on ImageNet with a number of ViT backbones validate the effectiveness of our proposals, largely eliminating the observed ViT redundancy and significantly boosting the model generalization. For example, our diversified DeiT obtains 0.70%~1.76% accuracy boosts on ImageNet with highly reduced similarity. Our codes are fully available in https://github.com/VITA-Group/Diverse-ViT.