通过重新审视高频组件来改善视觉变压器

论文标题

通过重新审视高频组件来改善视觉变压器

Improving Vision Transformers by Revisiting High-frequency Components

论文作者

Bai, Jiawang, Yuan, Li, Xia, Shu-Tao, Yan, Shuicheng, Li, Zhifeng, Liu, Wei

论文摘要

变压器模型在处理各种视觉任务方面表现出了有希望的有效性。但是，与训练卷积神经网络（CNN）模型相比，训练视觉变压器（VIT）模型更加困难，并且依赖于大规模训练集。为了解释这一观察结果，我们做出了一个假设，即\ textIt {vit模型在捕获图像的高频组件中的有效性较小，而不是CNN模型}，并通过频率分析对其进行验证。受这一发现的启发，我们首先研究了现有技术从新的频率角度改善VIT模型的影响，并发现某些技术（例如，randaughent）的成功可以归因于高频组件的更好使用。然后，为了补偿这种不足的VIT模型的能力，我们提出了HAT，该HAT可以直接通过对抗训练来增强图像的高频组成部分。我们表明，HAT可以始终如一地提高各种VIT模型的性能（例如VIT-B的 +1.2％，Swin-B的 +0.5％），尤其可以增强仅使用ImagEnet-1K数据的高级模型Volo-D5至87.3％，并且优势也可以在外部分数数据范围内维持并转移到下游任务。该代码可在以下网址提供：https：//github.com/jiawangbai/hat。

The transformer models have shown promising effectiveness in dealing with various vision tasks. However, compared with training Convolutional Neural Network (CNN) models, training Vision Transformer (ViT) models is more difficult and relies on the large-scale training set. To explain this observation we make a hypothesis that \textit{ViT models are less effective in capturing the high-frequency components of images than CNN models}, and verify it by a frequency analysis. Inspired by this finding, we first investigate the effects of existing techniques for improving ViT models from a new frequency perspective, and find that the success of some techniques (e.g., RandAugment) can be attributed to the better usage of the high-frequency components. Then, to compensate for this insufficient ability of ViT models, we propose HAT, which directly augments high-frequency components of images via adversarial training. We show that HAT can consistently boost the performance of various ViT models (e.g., +1.2% for ViT-B, +0.5% for Swin-B), and especially enhance the advanced model VOLO-D5 to 87.3% that only uses ImageNet-1K data, and the superiority can also be maintained on out-of-distribution data and transferred to downstream tasks. The code is available at: https://github.com/jiawangbai/HAT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题