VIT-LSLA：视觉变压器带有轻度自限性注意力

论文标题

VIT-LSLA：视觉变压器带有轻度自限性注意力

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

论文作者

Hechen, Zhenzhe, Huang, Wei, Zhao, Yixin

论文摘要

变形金刚在各种视觉任务中都表现出竞争性能，而计算全球自我注意力非常昂贵。许多方法限制了本地窗口中的关注范围，以降低计算复杂性。但是，他们的方法无法保存参数的数量。同时，自我注意事项和内部位置偏置（在软磁性功能内部）会导致每个查询集中在相似和关闭的斑块上。因此，本文提出了一个光自限制注意力（LSLA），该注意（LSLA）由轻度自我注意解机制（LSA）组成，以节省计算成本和参数数量，以及一种自限制的注意机制（SLA），以提高性能。首先，LSA用X（原点输入）代替了自我注意的K（键）和V（值）。将其应用于具有编码器体系结构和自我发挥机制的视觉变压器中，可以简化计算。其次，SLA具有位置信息模块和一个限量注意模块。前者包含动态量表和内部位置偏见，以调整自我注意力分数的分布并增强位置信息。后者使用软磁化功能后使用外部位置偏置来限制一些大量的注意力重量。最后，提出了具有光自限制注意力（VIT-LSLA）的层次视觉变压器。实验表明，VIT-LSLA在IP102上达到71.6％的TOP-1准确性（SWIN-T的绝对改善2.4％）；在迷你象征上的87.2％TOP-1准确性（Swin-T的绝对改善3.7％）。此外，它大大降低了拖鞋（3.5Gflops vs. 4.5Gflops的Swin-T）和参数（18.9m vs. 27.6m的Swin-T）。

Transformers have demonstrated a competitive performance across a wide range of vision tasks, while it is very expensive to compute the global self-attention. Many methods limit the range of attention within a local window to reduce computation complexity. However, their approaches cannot save the number of parameters; meanwhile, the self-attention and inner position bias (inside the softmax function) cause each query to focus on similar and close patches. Consequently, this paper presents a light self-limited-attention (LSLA) consisting of a light self-attention mechanism (LSA) to save the computation cost and the number of parameters, and a self-limited-attention mechanism (SLA) to improve the performance. Firstly, the LSA replaces the K (Key) and V (Value) of self-attention with the X(origin input). Applying it in vision Transformers which have encoder architecture and self-attention mechanism, can simplify the computation. Secondly, the SLA has a positional information module and a limited-attention module. The former contains a dynamic scale and an inner position bias to adjust the distribution of the self-attention scores and enhance the positional information. The latter uses an outer position bias after the softmax function to limit some large values of attention weights. Finally, a hierarchical Vision Transformer with Light self-Limited-attention (ViT-LSLA) is presented. The experiments show that ViT-LSLA achieves 71.6% top-1 accuracy on IP102 (2.4% absolute improvement of Swin-T); 87.2% top-1 accuracy on Mini-ImageNet (3.7% absolute improvement of Swin-T). Furthermore, it greatly reduces FLOPs (3.5GFLOPs vs. 4.5GFLOPs of Swin-T) and parameters (18.9M vs. 27.6M of Swin-T).

下载PDF全文

下载文献需遵守相关版权规定

论文标题