明确增加了小数据集上视觉变压器的输入信息密度

论文标题

明确增加了小数据集上视觉变压器的输入信息密度

Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets

论文作者

Chen, Xiangyu, Qin, Ying, Xu, Wenju, Bur, Andrés M., Zhong, Cuncong, Wang, Guanghui

论文摘要

自从视力变压器（VIT）成功实施视觉任务以来，视觉变压器最近引起了很多关注。借助视觉变压器，特别是多头自我发项模块，网络可以固有地捕获长期依赖性。但是，这些关注模块通常需要在大型数据集上进行培训，而视觉变形金刚在从头开始训练时在小数据集上表现出劣等的性能，而与重新NESNET等广泛主导的骨干相比。请注意，变压器模型首先是针对自然语言处理提出的，该模型比自然图像携带的信息密集。为了提高小型数据集上视觉变压器的性能，本文建议明确增加频域中的输入信息密度。具体而言，我们通过使用离散的余弦变换（DCT）计算频域中的通道热图来介绍选择通道，从而减小输入的大小，同时保持大多数信息并因此增加信息密度。结果，与以前的工作相比，保持频道少25％，同时实现更好的性能。广泛的实验证明了所提出的方法对五个小规模数据集的有效性，包括CIFAR-10/100，SVHN，Flowers-102和Tiny Imagenet。 SWIN和Focal Transformers的精度已提高到17.05％。代码可在https://github.com/xiangyu8/densevt上找到。

Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks. With vision Transformers, specifically the multi-head self-attention modules, networks can capture long-term dependencies inherently. However, these attention modules normally need to be trained on large datasets, and vision Transformers show inferior performance on small datasets when training from scratch compared with widely dominant backbones like ResNets. Note that the Transformer model was first proposed for natural language processing, which carries denser information than natural images. To boost the performance of vision Transformers on small datasets, this paper proposes to explicitly increase the input information density in the frequency domain. Specifically, we introduce selecting channels by calculating the channel-wise heatmaps in the frequency domain using Discrete Cosine Transform (DCT), reducing the size of input while keeping most information and hence increasing the information density. As a result, 25% fewer channels are kept while better performance is achieved compared with previous work. Extensive experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets, including CIFAR-10/100, SVHN, Flowers-102, and Tiny ImageNet. The accuracy has been boosted up to 17.05% with Swin and Focal Transformers. Codes are available at https://github.com/xiangyu8/DenseVT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题