附近视觉变压器

论文标题

附近视觉变压器

Vicinity Vision Transformer

论文作者

Sun, Weixuan, Qin, Zhen, Deng, Hui, Wang, Jianyuan, Zhang, Yi, Zhang, Kaihao, Barnes, Nick, Birchfield, Stan, Kong, Lingpeng, Zhong, Yiran

论文摘要

视觉变压器在众多计算机视觉任务上表现出了巨大的成功。然而，由于计算复杂性和记忆足迹是二次的，其中心部分（软磁心的注意力）禁止视觉变压器扩展到高分辨率图像。尽管在自然语言处理（NLP）任务中引入了线性注意以减轻类似问题，但直接将现有的线性注意力应用于视觉变形金刚可能不会导致令人满意的结果。我们研究了这个问题，发现与NLP任务相比，计算机视觉任务更多地关注本地信息。基于这一观察结果，我们提出了附近的关注，该关注将局部性偏见引入了具有线性复杂性的视觉变压器。具体而言，对于每个图像贴片，我们根据其相邻贴片测量的2D曼哈顿距离调整了注意力。在这种情况下，相邻的补丁比遥远的补丁会受到更大的关注。此外，由于我们的附近注意力要求令牌长度比特征维度大得多，以表明其效率优势，因此我们进一步提出了一个新的附近视觉变压器（VVT）结构，以降低特征维度而不脱离准确性。我们在CIFAR100，ImagEnet1k和ADE20K数据集上进行了广泛的实验，以验证我们方法的有效性。当输入分辨率增加时，与以前的基于变压器和基于卷积的网络相比，GFLOP的增长率较慢。特别是，我们的方法达到了最新的图像分类精度，其参数比以前的方法少50％。

Vision transformers have shown great success on numerous computer vision tasks. However, its central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational complexity and memory footprint being quadratic. Although linear attention was introduced in natural language processing (NLP) tasks to mitigate a similar issue, directly applying existing linear attention to vision transformers may not lead to satisfactory results. We investigate this problem and find that computer vision tasks focus more on local information compared with NLP tasks. Based on this observation, we present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Specifically, for each image patch, we adjust its attention weight based on its 2D Manhattan distance measured by its neighbouring patches. In this case, the neighbouring patches will receive stronger attention than far-away patches. Moreover, since our Vicinity Attention requires the token length to be much larger than the feature dimension to show its efficiency advantages, we further propose a new Vicinity Vision Transformer (VVT) structure to reduce the feature dimension without degenerating the accuracy. We perform extensive experiments on the CIFAR100, ImageNet1K, and ADE20K datasets to validate the effectiveness of our method. Our method has a slower growth rate of GFlops than previous transformer-based and convolution-based networks when the input resolution increases. In particular, our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题