论文标题
Castling-Vit:通过在视觉变压器推断上切换到线性角度注意来压缩自我注意力
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
论文作者
论文摘要
视觉变压器(VIT)表现出了令人印象深刻的性能,但与卷积神经网络(CNN)相比,仍然需要高计算成本,一个原因是VITS的注意力衡量全球相似性,因此具有二次复杂性与输入图表的数量。现有的有效VIT受到本地关注(例如SWIN)或线性关注(例如表演者),牺牲VIS VIS vist VIS捕获全球或本地环境的能力。在这项工作中,我们提出了一个重要的研究问题:VIT可以同时学习全球和本地环境,同时在推理期间提高效率?为此,我们提出了一个称为Castling-Vit的框架,该框架使用线性角度的注意力和基于掩盖的软性二次注意力进行训练,但随后在VIT推断过程中仅具有线性角度注意力。我们的castling-Vit利用角核通过光谱角度测量查询和键之间的相似性。我们用两种技术进一步简化了它:(1)一种新型的线性角度注意机制:我们将角内核分解为线性项和高阶残差,仅保留线性项; (2)我们采用两个参数化模块来近似高阶残差:深度卷积和辅助掩盖的软磁性注意力,以帮助学习全球和局部信息,在该信息中,在这种信息中,逐渐将softmax注意的面具正常化为Zeros,从而在Vit Oncements期间逐渐变成Zeros,从而不受头顶上的头顶。对三个任务的广泛实验和消融研究始终如一地证明了拟议的castling-Vit的有效性,例如,与基于Venilla SoftMax的Vits相比,在ImageNet分类中的精度高达1.8%,或在ImageNet分类中降低40%的MAC,而在可比较的FLOP下,Coco检测的MAP较高1.2。
Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens. Existing efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g., Performer), which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear angular attention during ViT inference. Our Castling-ViT leverages angular kernels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an auxiliary masked softmax attention to help learn both global and local information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during ViT inference. Extensive experiments and ablation studies on three tasks consistently validate the effectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP on COCO detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based attentions.