论文标题
船:双边当地关注视觉变压器
BOAT: Bilateral Local Attention Vision Transformer
论文作者
论文摘要
视觉变压器在许多计算机视觉任务中都取得了出色的性能。 VIT和DEIT等早期视觉变压器采用全球自我注意力,当补丁的数量较大时,计算在计算上昂贵。为了提高效率,最近的视觉变形金刚采用了当地的自我注意事项机制,在本地窗口中计算自我注意力。尽管基于窗口的本地自我发场显着提高了效率,但它无法捕获图像平面中遥远但相似的斑块之间的关系。为了克服图像空间的局部关注的局限性,在本文中,我们进一步利用了特征空间中斑块的位置。我们使用其功能将贴片分组为多个集群,并且在每个群集中都计算自我注意力。这种功能空间的本地注意力有效地捕获了不同本地窗口的补丁之间的连接,但仍然相关。我们提出了双边的本地注意力视觉变压器(BAT),该变压器(船)将特征空间的局部关注与图像空间的本地关注相结合。我们进一步将船与Swin和Cswin型号集成在一起,并在几个基准数据集上进行了广泛的实验,这表明我们的船 - CSWIN型号清晰,一致地表现出了现有的最先进的CNN模型和视觉变压器。
Vision Transformers achieved outstanding performance in many computer vision tasks. Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large. To improve efficiency, recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows. Despite the fact that window-based local self-attention significantly boosts efficiency, it fails to capture the relationships between distant but similar patches in the image plane. To overcome this limitation of image-space local attention, in this paper, we further exploit the locality of patches in the feature space. We group the patches into multiple clusters using their features, and self-attention is computed within every cluster. Such feature-space local attention effectively captures the connections between patches across different local windows but still relevant. We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention. We further integrate BOAT with both Swin and CSWin models, and extensive experiments on several benchmark datasets demonstrate that our BOAT-CSWin model clearly and consistently outperforms existing state-of-the-art CNN models and vision Transformers.